[GitHub] spark issue #20503: [SPARK-23299][SQL][PYSPARK] Fix __repr__ behaviour for R...
Github user ashashwat commented on the issue: https://github.com/apache/spark/pull/20503 @holdenk I am on it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20503: [SPARK-23299][SQL][PYSPARK] Fix __repr__ behaviour for R...
Github user ashashwat commented on the issue: https://github.com/apache/spark/pull/20503 @HyukjinKwon Should I add more tests covering Unicode? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20503: [SPARK-23299][SQL][PYSPARK] Fix __repr__ behaviour for R...
Github user ashashwat commented on the issue: https://github.com/apache/spark/pull/20503 @HyukjinKwon `return "<Row(%s)>" % ", ".join("%s" % (fields) for fields in self)` takes care of everything. ``` >>> Row ("aa", 11) <Row(aa, 11)> >>> Row (u"ì", 11) <Row(ì, 11)> >>> Row ("ì", 11) <Row(ì, 11)> ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20503: [SPARK-23299][SQL][PYSPARK] Fix __repr__ behaviour for R...
Github user ashashwat commented on the issue: https://github.com/apache/spark/pull/20503 @HyukjinKwon Here is what I tried: ``` # Code: return "<Row(%s)>" % ", ".join(fields.encode("utf8") for fields in self) >>> Row (u"ì", "11") <Row(ì, 11)> # Fails for integer fields. # Code: return "<Row(%s)>" % ", ".join(str(fields) for fields in self) >>> Row (u"ì", "11") UnicodeEncodeError: 'ascii' codec can't encode character u'\uc544' in position 0: ordinal not in range(128) # Code: return "<Row(%s)>" % ", ".join(repr(fields) for fields in self) >>> Row (u"ì", 11) <Row(u'\uc544', 11)> # Code: return "<Row(%s)>" % ", ".join(unicode(fields).encode("utf8") for fields in self) >>> Row (u"ì", 11) <Row(ì, 11)> ``` repr is definitely a better option than str. But why not unicode? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20503: [SPARK-23299][SQL][PYSPARK] Fix __repr__ behaviour for R...
Github user ashashwat commented on the issue: https://github.com/apache/spark/pull/20503 @HyukjinKwon Do you mean something like `Row (a=1, b=2, c=3)` or `Row (1="Alice", 2=11)`? Former works fine, latter fails with `SyntaxError: keyword can't be an expression`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20503: [SPARK-23299][SQL][PYSPARK] Fix __repr__ behaviou...
GitHub user ashashwat opened a pull request: https://github.com/apache/spark/pull/20503 [SPARK-23299][SQL][PYSPARK] Fix __repr__ behaviour for Rows. ## What changes were proposed in this pull request? Fix \_\_repr\_\_ behaviour for Rows. Rows \_\_repr\_\_ assumes data is a string when column name is missing. Examples, ``` >>> from pyspark.sql.types import Row >>> Row ("Alice", "11") <Row(Alice, 11)> >>> Row (name="Alice", age=11) Row(age=11, name='Alice') >>> Row ("Alice", 11) TypeError: sequence item 1: expected string, int found ``` This is because Row () when called without column names assumes everything is a string. ## How was this patch tested? Manually tested and unittest was added in `python/pyspark/sql/tests.py`. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ashashwat/spark SPARK-23299 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20503.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20503 commit 6604e9fdaa710cd894b4799390144e404667402e Author: Shashwat Anand <me@...> Date: 2018-02-04T10:27:31Z Fix __repr__ behaviour for Rows. Rows __repr__ assumes data is strings when column name is missing. Examples, >>> Row ("Alice", "11") <Row(Alice, 11)> >>> Row (name="Alice", age=11) Row(age=11, name='Alice') >>> Row ("Alice", 11) TypeError: sequence item 1: expected string, int found This is because Row () when called without column names assumes everything is string. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20497: [MINOR][DOC] Use raw triple double quotes around ...
Github user ashashwat commented on a diff in the pull request: https://github.com/apache/spark/pull/20497#discussion_r165813235 --- Diff: examples/src/main/python/streaming/hdfs_wordcount.py --- @@ -15,7 +15,7 @@ # limitations under the License. # -""" +r""" --- End diff -- Yes. Thanks for pointing it out. Fixed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20497: [MINOR][DOC] Use raw triple double quotes around ...
GitHub user ashashwat opened a pull request: https://github.com/apache/spark/pull/20497 [MINOR][DOC] Use raw triple double quotes around docstrings where there are occurrences of backslashes. From [PEP 257](https://www.python.org/dev/peps/pep-0257/): > For consistency, always use """triple double quotes""" around docstrings. Use r"""raw triple double quotes""" if you use any backslashes in your docstrings. For Unicode docstrings, use u"""Unicode triple-quoted strings""". For example, this is what help (kafka_wordcount) shows: ``` DESCRIPTION Counts words in UTF8 encoded, ' ' delimited text received from the network every second. Usage: kafka_wordcount.py To run this on your local machine, you need to setup Kafka and create a producer first, see http://kafka.apache.org/documentation.html#quickstart and then run the example `$ bin/spark-submit --jars external/kafka-assembly/target/scala-*/spark-streaming-kafka-assembly-*.jar examples/src/main/python/streaming/kafka_wordcount.py localhost:2181 test` ``` This is what it shows, after the fix: ``` DESCRIPTION Counts words in UTF8 encoded, '\n' delimited text received from the network every second. Usage: kafka_wordcount.py To run this on your local machine, you need to setup Kafka and create a producer first, see http://kafka.apache.org/documentation.html#quickstart and then run the example `$ bin/spark-submit --jars \ external/kafka-assembly/target/scala-*/spark-streaming-kafka-assembly-*.jar \ examples/src/main/python/streaming/kafka_wordcount.py \ localhost:2181 test` ``` The thing worth noticing is no linebreak here in the help. ## What changes were proposed in this pull request? Change triple double quotes to raw triple double quotes when there are occurrences of backslashes in docstrings. ## How was this patch tested? Manually as this is a doc fix. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ashashwat/spark docstring-fixes Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20497.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20497 commit 78b1de3fab8d2bd8256fbbde7b45c230432946a8 Author: Shashwat Anand <me@...> Date: 2018-02-03T10:27:25Z Use raw triple double quotes around doctrings to escape backslashes. From PEP 257: For consistency, always use """triple double quotes""" around docstrings. Use r"""raw triple double quotes""" if you use any backslashes in your docstrings. For Unicode docstrings, use u"""Unicode triple-quoted strings""". For example this is what help (kafka_wordcount) shows: NAME kafka_wordcount FILE /Users/shashwatanand/Repositories/spark/examples/src/main/python/streaming/kafka_wordcount.py DESCRIPTION Counts words in UTF8 encoded, ' ' delimited text received from the network every second. Usage: kafka_wordcount.py To run this on your local machine, you need to setup Kafka and create a producer first, see http://kafka.apache.org/documentation.html#quickstart and then run the example `$ bin/spark-submit --jars external/kafka-assembly/target/scala-*/spark-streaming-kafka-assembly-*.jar examples/src/main/python/streaming/kafka_wordcount.py localhost:2181 test` This is what it shows, after the fix: NAME kafka_wordcount FILE /Users/shashwatanand/Repositories/Codes/spark/examples/src/main/python/streaming/kafka_wordcount.py DESCRIPTION Counts words in UTF8 encoded, '\n' delimited text received from the network every second. Usage: kafka_wordcount.py To run this on your local machine, you need to setup Kafka and create a producer first, see http://kafka.apache.org/documentation.html#quickstart and then run the example `$ bin/spark-submit --jars \ external/kafka-assembly/target/scala-*/spark-streaming-kafka-assembly-*.jar \ examples/src/main/python/streaming/kafka_wordcount.py \ localhost:2181 test` Notice no linebreak here in the help. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20436: [MINOR] Fix typos in dev/* scripts.
Github user ashashwat commented on a diff in the pull request: https://github.com/apache/spark/pull/20436#discussion_r164749320 --- Diff: dev/lint-python --- @@ -60,9 +60,9 @@ export "PYLINT_HOME=$PYTHONPATH" export "PATH=$PYTHONPATH:$PATH" # There is no need to write this output to a file -#+ first, but we do so so that the check status can -#+ be output before the report, like with the -#+ scalastyle and RAT checks. --- End diff -- Is that so? We have 100 character limit on a single line according to the style guide. Maybe all 4 lines could be rearranged? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20436: [SPARK-23174][DOC][PYTHON] python code style checker upd...
Github user ashashwat commented on the issue: https://github.com/apache/spark/pull/20436 @HyukjinKwon Let me go ahead and check all the scripts for similar instances or typos. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20436: [SPARK-23174][DOC][PYTHON] python code style chec...
GitHub user ashashwat opened a pull request: https://github.com/apache/spark/pull/20436 [SPARK-23174][DOC][PYTHON] python code style checker update fix. ## What changes were proposed in this pull request? Consistency in style, grammar and removal of extraneous characters. ## How was this patch tested? Manually as this is a doc change. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ashashwat/spark SPARK-23174 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20436.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20436 commit 78cb070a88bc61a3117ecde25a25ab40157ccfe1 Author: Shashwat Anand <me@...> Date: 2018-01-30T10:45:01Z [SPARK-23174][DOC][PYTHON] python code style checker update fix. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20336: [SPARK-23165][DOC] Spelling mistake fix in quick-start d...
Github user ashashwat commented on the issue: https://github.com/apache/spark/pull/20336 Retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20336: [SPARK-23165][DOC] Spelling mistake fix in quick-start d...
Github user ashashwat commented on the issue: https://github.com/apache/spark/pull/20336 @srowen Let me go ahead and do that. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20336: [SPARK-23165][DOC] Spelling mistake fix in quick-...
GitHub user ashashwat opened a pull request: https://github.com/apache/spark/pull/20336 [SPARK-23165][DOC] Spelling mistake fix in quick-start doc. ## What changes were proposed in this pull request? Fix spelling in quick-start doc. ## How was this patch tested? Doc only. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ashashwat/spark SPARK-23165 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20336.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20336 commit 785fccff1c35f93fc479d460b527bbb6fcfc00a7 Author: Shashwat Anand <me@...> Date: 2018-01-20T14:50:44Z [SPARK-23165][DOC] Spelling mistake fix in quick-start doc. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org