[GitHub] spark issue #20503: [SPARK-23299][SQL][PYSPARK] Fix __repr__ behaviour for R...

2018-10-26 Thread ashashwat
Github user ashashwat commented on the issue:

https://github.com/apache/spark/pull/20503
  
@holdenk I am on it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20503: [SPARK-23299][SQL][PYSPARK] Fix __repr__ behaviour for R...

2018-02-08 Thread ashashwat
Github user ashashwat commented on the issue:

https://github.com/apache/spark/pull/20503
  
@HyukjinKwon Should I add more tests covering Unicode?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20503: [SPARK-23299][SQL][PYSPARK] Fix __repr__ behaviour for R...

2018-02-04 Thread ashashwat
Github user ashashwat commented on the issue:

https://github.com/apache/spark/pull/20503
  
@HyukjinKwon `return "<Row(%s)>" % ", ".join("%s" % (fields) for fields in 
self)` takes care of everything.
```

>>> Row ("aa", 11)
<Row(aa, 11)>

>>> Row (u"아", 11)
<Row(아, 11)>

>>> Row ("아", 11)
<Row(아, 11)>
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20503: [SPARK-23299][SQL][PYSPARK] Fix __repr__ behaviour for R...

2018-02-04 Thread ashashwat
Github user ashashwat commented on the issue:

https://github.com/apache/spark/pull/20503
  
@HyukjinKwon Here is what I tried:

```
# Code: return "<Row(%s)>" % ", ".join(fields.encode("utf8") for fields in 
self)
>>> Row (u"아", "11")
<Row(아, 11)>
# Fails for integer fields.

# Code: return "<Row(%s)>" % ", ".join(str(fields) for fields in self)
>>> Row (u"아", "11")
UnicodeEncodeError: 'ascii' codec can't encode character u'\uc544' in 
position 0: ordinal not in range(128)

# Code: return "<Row(%s)>" % ", ".join(repr(fields) for fields in self)
>>> Row (u"아", 11)
<Row(u'\uc544', 11)>

# Code: return "<Row(%s)>" % ", ".join(unicode(fields).encode("utf8") for 
fields in self)
>>> Row (u"아", 11)
<Row(아, 11)>
```

repr is definitely a better option than str.  But why not unicode? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20503: [SPARK-23299][SQL][PYSPARK] Fix __repr__ behaviour for R...

2018-02-04 Thread ashashwat
Github user ashashwat commented on the issue:

https://github.com/apache/spark/pull/20503
  
@HyukjinKwon Do you mean something like `Row (a=1, b=2, c=3)` or `Row 
(1="Alice", 2=11)`?  Former works fine, latter fails with `SyntaxError: keyword 
can't be an expression`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20503: [SPARK-23299][SQL][PYSPARK] Fix __repr__ behaviou...

2018-02-04 Thread ashashwat
GitHub user ashashwat opened a pull request:

https://github.com/apache/spark/pull/20503

[SPARK-23299][SQL][PYSPARK] Fix __repr__ behaviour for Rows.

## What changes were proposed in this pull request?

Fix \_\_repr\_\_ behaviour for Rows.

Rows \_\_repr\_\_ assumes data is a string when column name is missing.
Examples,
```
>>> from pyspark.sql.types import Row
>>> Row ("Alice", "11")
<Row(Alice, 11)>

>>> Row (name="Alice", age=11)
Row(age=11, name='Alice')

>>> Row ("Alice", 11)

TypeError: sequence item 1: expected string, int found
```

This is because Row () when called without column names assumes
everything is a string.

## How was this patch tested?

Manually tested and unittest was added in `python/pyspark/sql/tests.py`.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ashashwat/spark SPARK-23299

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20503.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20503


commit 6604e9fdaa710cd894b4799390144e404667402e
Author: Shashwat Anand <me@...>
Date:   2018-02-04T10:27:31Z

Fix __repr__ behaviour for Rows.

Rows __repr__ assumes data is strings when column name is missing.

Examples,
>>> Row ("Alice", "11")
<Row(Alice, 11)>

>>> Row (name="Alice", age=11)
Row(age=11, name='Alice')

>>> Row ("Alice", 11)

TypeError: sequence item 1: expected string, int found

This is because Row () when called without column names assumes
everything is string.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20497: [MINOR][DOC] Use raw triple double quotes around ...

2018-02-03 Thread ashashwat
Github user ashashwat commented on a diff in the pull request:

https://github.com/apache/spark/pull/20497#discussion_r165813235
  
--- Diff: examples/src/main/python/streaming/hdfs_wordcount.py ---
@@ -15,7 +15,7 @@
 # limitations under the License.
 #
 
-"""
+r"""
--- End diff --

Yes.  Thanks for pointing it out.  Fixed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20497: [MINOR][DOC] Use raw triple double quotes around ...

2018-02-03 Thread ashashwat
GitHub user ashashwat opened a pull request:

https://github.com/apache/spark/pull/20497

[MINOR][DOC] Use raw triple double quotes around docstrings where there are 
occurrences of backslashes.

From [PEP 257](https://www.python.org/dev/peps/pep-0257/):  

> For consistency, always use """triple double quotes""" around docstrings. 
Use r"""raw triple double quotes""" if you use any backslashes in your 
docstrings. For Unicode docstrings, use u"""Unicode triple-quoted strings""".


For example, this is what help (kafka_wordcount) shows:

```
DESCRIPTION
Counts words in UTF8 encoded, '
' delimited text received from the network every second.
 Usage: kafka_wordcount.py  

 To run this on your local machine, you need to setup Kafka and create 
a producer first, see
 http://kafka.apache.org/documentation.html#quickstart

 and then run the example
`$ bin/spark-submit --jars   
external/kafka-assembly/target/scala-*/spark-streaming-kafka-assembly-*.jar 
  examples/src/main/python/streaming/kafka_wordcount.py   localhost:2181 
test`
```

This is what it shows, after the fix:

```
DESCRIPTION
Counts words in UTF8 encoded, '\n' delimited text received from the 
network every second.
Usage: kafka_wordcount.py  

To run this on your local machine, you need to setup Kafka and create a 
producer first, see
http://kafka.apache.org/documentation.html#quickstart

and then run the example
   `$ bin/spark-submit --jars \
 
external/kafka-assembly/target/scala-*/spark-streaming-kafka-assembly-*.jar \
 examples/src/main/python/streaming/kafka_wordcount.py \
 localhost:2181 test`
```

The thing worth noticing is no linebreak here in the help.

## What changes were proposed in this pull request?

Change triple double quotes to raw triple double quotes when there are 
occurrences of backslashes in docstrings.

## How was this patch tested?

Manually as this is a doc fix.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ashashwat/spark docstring-fixes

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20497.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20497


commit 78b1de3fab8d2bd8256fbbde7b45c230432946a8
Author: Shashwat Anand <me@...>
Date:   2018-02-03T10:27:25Z

Use raw triple double quotes around doctrings to escape backslashes.

From PEP 257:
For consistency, always use """triple double quotes""" around docstrings. 
Use r"""raw triple double quotes""" if you use any backslashes in your 
docstrings. For Unicode docstrings, use u"""Unicode triple-quoted strings""".

For example this is what help (kafka_wordcount) shows:

NAME
kafka_wordcount

FILE

/Users/shashwatanand/Repositories/spark/examples/src/main/python/streaming/kafka_wordcount.py

DESCRIPTION
Counts words in UTF8 encoded, '
' delimited text received from the network every second.
 Usage: kafka_wordcount.py  

 To run this on your local machine, you need to setup Kafka and create 
a producer first, see
 http://kafka.apache.org/documentation.html#quickstart

 and then run the example
`$ bin/spark-submit --jars   
external/kafka-assembly/target/scala-*/spark-streaming-kafka-assembly-*.jar 
  examples/src/main/python/streaming/kafka_wordcount.py   localhost:2181 
test`

This is what it shows, after the fix:

NAME
kafka_wordcount

FILE

/Users/shashwatanand/Repositories/Codes/spark/examples/src/main/python/streaming/kafka_wordcount.py

DESCRIPTION
Counts words in UTF8 encoded, '\n' delimited text received from the 
network every second.
Usage: kafka_wordcount.py  

To run this on your local machine, you need to setup Kafka and create a 
producer first, see
http://kafka.apache.org/documentation.html#quickstart

and then run the example
   `$ bin/spark-submit --jars \
 
external/kafka-assembly/target/scala-*/spark-streaming-kafka-assembly-*.jar \
 examples/src/main/python/streaming/kafka_wordcount.py \
 localhost:2181 test`

Notice no linebreak here in the help.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20436: [MINOR] Fix typos in dev/* scripts.

2018-01-30 Thread ashashwat
Github user ashashwat commented on a diff in the pull request:

https://github.com/apache/spark/pull/20436#discussion_r164749320
  
--- Diff: dev/lint-python ---
@@ -60,9 +60,9 @@ export "PYLINT_HOME=$PYTHONPATH"
 export "PATH=$PYTHONPATH:$PATH"
 
 # There is no need to write this output to a file
-#+ first, but we do so so that the check status can
-#+ be output before the report, like with the
-#+ scalastyle and RAT checks.
--- End diff --

Is that so?  We have 100 character limit on a single line according to the 
style guide.  Maybe all 4 lines could be rearranged?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20436: [SPARK-23174][DOC][PYTHON] python code style checker upd...

2018-01-30 Thread ashashwat
Github user ashashwat commented on the issue:

https://github.com/apache/spark/pull/20436
  
@HyukjinKwon Let me go ahead and check all the scripts for similar 
instances or typos.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20436: [SPARK-23174][DOC][PYTHON] python code style chec...

2018-01-30 Thread ashashwat
GitHub user ashashwat opened a pull request:

https://github.com/apache/spark/pull/20436

[SPARK-23174][DOC][PYTHON] python code style checker update fix.

## What changes were proposed in this pull request?

Consistency in style, grammar and removal of extraneous characters.

## How was this patch tested?

Manually as this is a doc change.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ashashwat/spark SPARK-23174

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20436.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20436


commit 78cb070a88bc61a3117ecde25a25ab40157ccfe1
Author: Shashwat Anand <me@...>
Date:   2018-01-30T10:45:01Z

[SPARK-23174][DOC][PYTHON] python code style checker update fix.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20336: [SPARK-23165][DOC] Spelling mistake fix in quick-start d...

2018-01-20 Thread ashashwat
Github user ashashwat commented on the issue:

https://github.com/apache/spark/pull/20336
  
Retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20336: [SPARK-23165][DOC] Spelling mistake fix in quick-start d...

2018-01-20 Thread ashashwat
Github user ashashwat commented on the issue:

https://github.com/apache/spark/pull/20336
  
@srowen Let me go ahead and do that.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20336: [SPARK-23165][DOC] Spelling mistake fix in quick-...

2018-01-20 Thread ashashwat
GitHub user ashashwat opened a pull request:

https://github.com/apache/spark/pull/20336

[SPARK-23165][DOC] Spelling mistake fix in quick-start doc.

## What changes were proposed in this pull request?

Fix spelling in quick-start doc.

## How was this patch tested?

Doc only.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ashashwat/spark SPARK-23165

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20336.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20336


commit 785fccff1c35f93fc479d460b527bbb6fcfc00a7
Author: Shashwat Anand <me@...>
Date:   2018-01-20T14:50:44Z

[SPARK-23165][DOC] Spelling mistake fix in quick-start doc.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org