[GitHub] spark issue #19833: [SPARK-22605][SQL] SQL write job should also set Spark t...

2017-11-28 Thread JasonMWhite
Github user JasonMWhite commented on the issue: https://github.com/apache/spark/pull/19833 Thanks @cloud-fan ! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail

[GitHub] spark issue #17293: [SPARK-19950][SQL] Fix to ignore nullable when df.load()...

2017-06-28 Thread JasonMWhite
Github user JasonMWhite commented on the issue: https://github.com/apache/spark/pull/17293 Reliable nullability information is about far more than non-nullable optimization to us. I would happily opt in to any performance penalty that validated that non-nullable columns were actually

[GitHub] spark issue #17293: [SPARK-19950][SQL] Fix to ignore nullable when df.load()...

2017-06-27 Thread JasonMWhite
Github user JasonMWhite commented on the issue: https://github.com/apache/spark/pull/17293 This issue causes a lot of headaches for us when picking up parquet datasets. To get around this issue, we write the schema alongside the parquet files in a side-band, and then when loading

[GitHub] spark issue #17774: [SPARK-18371][Streaming] Spark Streaming backpressure ge...

2017-05-02 Thread JasonMWhite
Github user JasonMWhite commented on the issue: https://github.com/apache/spark/pull/17774 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the

[GitHub] spark issue #17774: [SPARK-18371][Streaming] Spark Streaming backpressure ge...

2017-04-27 Thread JasonMWhite
Github user JasonMWhite commented on the issue: https://github.com/apache/spark/pull/17774 Tests have some fairly repetitive code, but not sure if that's a problem or not. Looks good to me. --- If your project is set up for it, you can reply to this email and have your reply a

[GitHub] spark pull request #17774: [SPARK-18371][Streaming] Spark Streaming backpres...

2017-04-27 Thread JasonMWhite
Github user JasonMWhite commented on a diff in the pull request: https://github.com/apache/spark/pull/17774#discussion_r113753764 --- Diff: external/kafka-0-10/src/test/scala/org/apache/spark/streaming/kafka010/DirectKafkaStreamSuite.scala --- @@ -617,6 +617,94 @@ class

[GitHub] spark issue #17774: [SPARK-18371][Streaming] Spark Streaming backpressure ge...

2017-04-27 Thread JasonMWhite
Github user JasonMWhite commented on the issue: https://github.com/apache/spark/pull/17774 I think @koeninger's suggestion is valid. `effectiveRateLimitPerPartition` is the upper bound on the number of messages per partition per second, and `maxMessagesPerPartition` sets an

[GitHub] spark issue #17774: [SPARK-18371][Streaming] Spark Streaming backpressure ge...

2017-04-26 Thread JasonMWhite
Github user JasonMWhite commented on the issue: https://github.com/apache/spark/pull/17774 Code looks sound. Could you add or modify a test to illustrate/verify? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your

[GitHub] spark issue #15918: [SPARK-18122][SQL][WIP]Fallback to Kryo for unsupported ...

2017-03-12 Thread JasonMWhite
Github user JasonMWhite commented on the issue: https://github.com/apache/spark/pull/15918 @windpiger I see you flipped it back to `WIP`. What else needs to be done? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If

[GitHub] spark issue #17200: [SPARK-19561][SQL] add int case handling for TimestampTy...

2017-03-09 Thread JasonMWhite
Github user JasonMWhite commented on the issue: https://github.com/apache/spark/pull/17200 Thanks @cloud-fan ! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes

[GitHub] spark pull request #17200: [SPARK-19561][Python] cast TimestampType.toIntern...

2017-03-08 Thread JasonMWhite
Github user JasonMWhite commented on a diff in the pull request: https://github.com/apache/spark/pull/17200#discussion_r105083494 --- Diff: python/pyspark/sql/types.py --- @@ -189,7 +189,7 @@ def toInternal(self, dt): if dt is not None: seconds

[GitHub] spark pull request #17200: [SPARK-19561][Python] cast TimestampType.toIntern...

2017-03-08 Thread JasonMWhite
Github user JasonMWhite commented on a diff in the pull request: https://github.com/apache/spark/pull/17200#discussion_r104919956 --- Diff: python/pyspark/sql/types.py --- @@ -189,7 +189,7 @@ def toInternal(self, dt): if dt is not None: seconds

[GitHub] spark pull request #17200: [SPARK-19561][Python] cast TimestampType.toIntern...

2017-03-07 Thread JasonMWhite
Github user JasonMWhite commented on a diff in the pull request: https://github.com/apache/spark/pull/17200#discussion_r104856801 --- Diff: python/pyspark/sql/types.py --- @@ -189,7 +189,7 @@ def toInternal(self, dt): if dt is not None: seconds

[GitHub] spark pull request #17200: [SPARK-19561][Python] cast TimestampType.toIntern...

2017-03-07 Thread JasonMWhite
Github user JasonMWhite commented on a diff in the pull request: https://github.com/apache/spark/pull/17200#discussion_r104856310 --- Diff: python/pyspark/sql/types.py --- @@ -189,7 +189,7 @@ def toInternal(self, dt): if dt is not None: seconds

[GitHub] spark issue #17200: [SPARK-19561][Python] cast TimestampType.toInternal outp...

2017-03-07 Thread JasonMWhite
Github user JasonMWhite commented on the issue: https://github.com/apache/spark/pull/17200 Ah, test failed on Python 3.4 only. That makes some sense, I only tested locally on 2.6, and there are changes with how Python 3 handles ints vs longs. I'll dig in with Python 3.4 and see

[GitHub] spark pull request #17200: [SPARK-19561][Python] cast TimestampType.toIntern...

2017-03-07 Thread JasonMWhite
Github user JasonMWhite commented on a diff in the pull request: https://github.com/apache/spark/pull/17200#discussion_r104847039 --- Diff: python/pyspark/sql/tests.py --- @@ -1435,6 +1435,12 @@ def test_time_with_timezone(self): self.assertEqual(now, now1

[GitHub] spark pull request #17200: [SPARK-19561][Python] cast TimestampType.toIntern...

2017-03-07 Thread JasonMWhite
GitHub user JasonMWhite opened a pull request: https://github.com/apache/spark/pull/17200 [SPARK-19561][Python] cast TimestampType.toInternal output to long ## What changes were proposed in this pull request? Cast the output of `TimestampType.toInternal` to long to allow

[GitHub] spark issue #16896: [SPARK-19561][Python] cast TimestampType.toInternal outp...

2017-03-06 Thread JasonMWhite
Github user JasonMWhite commented on the issue: https://github.com/apache/spark/pull/16896 This PR is pretty tiny, and corrects a problem that led to corrupt Parquet files in our case. Can anyone spare a minute to review? --- If your project is set up for it, you can reply to this

[GitHub] spark issue #16896: [SPARK-19561][Python] cast TimestampType.toInternal outp...

2017-02-21 Thread JasonMWhite
Github user JasonMWhite commented on the issue: https://github.com/apache/spark/pull/16896 Ping @davies --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or

[GitHub] spark issue #16896: [SPARK-19561][Python] cast TimestampType.toInternal outp...

2017-02-15 Thread JasonMWhite
Github user JasonMWhite commented on the issue: https://github.com/apache/spark/pull/16896 Modified as suggested. Don't think this has been through CI at all yet. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If

[GitHub] spark pull request #16896: [SPARK-19561][Python] cast TimestampType.toIntern...

2017-02-11 Thread JasonMWhite
Github user JasonMWhite commented on a diff in the pull request: https://github.com/apache/spark/pull/16896#discussion_r100681604 --- Diff: python/pyspark/sql/tests.py --- @@ -1435,6 +1435,12 @@ def test_time_with_timezone(self): self.assertEqual(now, now1

[GitHub] spark pull request #16896: [SPARK-19561][Python] cast TimestampType.toIntern...

2017-02-11 Thread JasonMWhite
GitHub user JasonMWhite opened a pull request: https://github.com/apache/spark/pull/16896 [SPARK-19561][Python] cast TimestampType.toInternal output to long ## What changes were proposed in this pull request? Cast the output of `TimestampType.toInternal` to long to allow

[GitHub] spark issue #15254: [SPARK-17679] [PYSPARK] remove unnecessary Py4J ListConv...

2016-09-27 Thread JasonMWhite
Github user JasonMWhite commented on the issue: https://github.com/apache/spark/pull/15254 @JoshRosen @lins05 As requested, I've removed all remaining explicit mentions of `ListConverter` and `MapConverter` as they seemed to all be doing the same thing - getting around

[GitHub] spark issue #15254: [SPARK-17679] [PYSPARK] remove unnecessary Py4J ListConv...

2016-09-26 Thread JasonMWhite
Github user JasonMWhite commented on the issue: https://github.com/apache/spark/pull/15254 @davies you authored https://github.com/apache/spark/pull/5570 and reported the issue in Py4J https://github.com/bartdag/py4j/issues/160. I happened across this while spelunking through Py4J

[GitHub] spark pull request #15254: [SPARK-17679] [PYSPARK] remove unnecessary Py4J L...

2016-09-26 Thread JasonMWhite
GitHub user JasonMWhite opened a pull request: https://github.com/apache/spark/pull/15254 [SPARK-17679] [PYSPARK] remove unnecessary Py4J ListConverter patch ## What changes were proposed in this pull request? This PR removes a patch on ListConverter from https

[GitHub] spark pull request: [SPARK-14700] [Python] adding SQL Row equality...

2016-04-18 Thread JasonMWhite
Github user JasonMWhite commented on the pull request: https://github.com/apache/spark/pull/12470#issuecomment-211712175 After some experimentation on the Scala side, I see that this PR would be a significant departure from Scala's interface, where the schema and field names ar

[GitHub] spark pull request: [SPARK-14700] [Python] adding SQL Row equality...

2016-04-18 Thread JasonMWhite
Github user JasonMWhite closed the pull request at: https://github.com/apache/spark/pull/12470 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature

[GitHub] spark pull request: [SPARK-14700] [Python] adding SQL Row equality...

2016-04-18 Thread JasonMWhite
Github user JasonMWhite commented on the pull request: https://github.com/apache/spark/pull/12470#issuecomment-211635125 Hmm, I see what you mean. ``` r1 = df.select('a', 'b', df['b'].alias('a')).collect()[0] r1 # Row(a=1, b=2, a=2)

[GitHub] spark pull request: [SPARK-14700] [Python] adding SQL Row equality...

2016-04-18 Thread JasonMWhite
Github user JasonMWhite commented on the pull request: https://github.com/apache/spark/pull/12470#issuecomment-211600203 The "few bugs" are included in this PR, all column naming issues. They're in `python/pyspark/sql/tests.py` Could you provide an example a

[GitHub] spark pull request: [SPARK-14700] [Python] adding SQL Row equality...

2016-04-18 Thread JasonMWhite
Github user JasonMWhite commented on a diff in the pull request: https://github.com/apache/spark/pull/12470#discussion_r60134541 --- Diff: python/pyspark/sql/types.py --- @@ -1448,6 +1448,54 @@ def __repr__(self): else: return "" % &quo

[GitHub] spark pull request: [SPARK-14700] [Python] adding SQL Row equality...

2016-04-18 Thread JasonMWhite
Github user JasonMWhite commented on a diff in the pull request: https://github.com/apache/spark/pull/12470#discussion_r60122211 --- Diff: python/pyspark/sql/types.py --- @@ -1448,6 +1448,54 @@ def __repr__(self): else: return "" % &quo

[GitHub] spark pull request: [SPARK-14700] [Python] adding SQL Row equality...

2016-04-18 Thread JasonMWhite
GitHub user JasonMWhite opened a pull request: https://github.com/apache/spark/pull/12470 [SPARK-14700] [Python] adding SQL Row equality and inequality overrides ## What changes were proposed in this pull request? This PR adds equality and inequality overrides to

[GitHub] spark pull request: [SPARK-12073] [Streaming] backpressure rate co...

2016-03-02 Thread JasonMWhite
Github user JasonMWhite commented on the pull request: https://github.com/apache/spark/pull/10089#issuecomment-191159802 All comments addressed, builds cleanly, all tests passing. GTM? --- If your project is set up for it, you can reply to this email and have your reply appear on

[GitHub] spark pull request: [SPARK-12073] [Streaming] backpressure rate co...

2016-03-01 Thread JasonMWhite
Github user JasonMWhite commented on the pull request: https://github.com/apache/spark/pull/10089#issuecomment-191113140 `DirectKafkaStreamSuite` passes all tests for me locally, but the test failure above appeared to be on an outdated sha. Addressed @zsxwing's comments

[GitHub] spark pull request: [SPARK-12073] [Streaming] backpressure rate co...

2016-02-18 Thread JasonMWhite
Github user JasonMWhite commented on the pull request: https://github.com/apache/spark/pull/10089#issuecomment-186037636 It looks like this PySpark unit test failure above was prior to my commit to add a backwards-compatible single-argument version as suggested. Could someone kick it

[GitHub] spark pull request: [SPARK-12073] [Streaming] backpressure rate co...

2016-02-17 Thread JasonMWhite
Github user JasonMWhite commented on the pull request: https://github.com/apache/spark/pull/10089#issuecomment-185521431 A single-argument version is easy enough, and good to support backwards-compatibility anyway. I haven't been able to get PySpark tests running locally yet, s

[GitHub] spark pull request: [SPARK-12073] [Streaming] backpressure rate co...

2016-02-17 Thread JasonMWhite
Github user JasonMWhite commented on the pull request: https://github.com/apache/spark/pull/10089#issuecomment-185517414 Changes addressed, but looks like it's causing PySpark unit tests to fail. Investigating... --- If your project is set up for it, you can reply to this emai

[GitHub] spark pull request: [SPARK-12073] [Streaming] backpressure rate co...

2016-02-17 Thread JasonMWhite
Github user JasonMWhite commented on a diff in the pull request: https://github.com/apache/spark/pull/10089#discussion_r53265282 --- Diff: external/kafka/src/test/scala/org/apache/spark/streaming/kafka/DirectKafkaStreamSuite.scala --- @@ -353,10 +353,52 @@ class

[GitHub] spark pull request: [SPARK-12073] [Streaming] backpressure rate co...

2016-02-17 Thread JasonMWhite
Github user JasonMWhite commented on a diff in the pull request: https://github.com/apache/spark/pull/10089#discussion_r53265248 --- Diff: external/kafka/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala --- @@ -89,23 +89,32 @@ class

[GitHub] spark pull request: [SPARK-12073] [Streaming] backpressure rate co...

2016-02-17 Thread JasonMWhite
Github user JasonMWhite commented on the pull request: https://github.com/apache/spark/pull/10089#issuecomment-185478351 I added unit tests for `maxMessagesPerPartition` to cover 3 edge cases that have been raised here: - when backpressure is disabled, simply uses the

[GitHub] spark pull request: [SPARK-12073] [Streaming] backpressure rate co...

2016-02-16 Thread JasonMWhite
Github user JasonMWhite commented on the pull request: https://github.com/apache/spark/pull/10089#issuecomment-184916602 Sorry, did I put the MiMa exclusion in the wrong section of the file? --- If your project is set up for it, you can reply to this email and have your reply appear

[GitHub] spark pull request: [SPARK-12073] [Streaming] backpressure rate co...

2016-02-16 Thread JasonMWhite
Github user JasonMWhite commented on the pull request: https://github.com/apache/spark/pull/10089#issuecomment-184900516 @koeninger I've added the two error messages from Jenkins to the MimaExcludes, under the v2.0 section. There was one from the test function, but this PR

[GitHub] spark pull request: [SPARK-12073] [Streaming] backpressure rate co...

2016-02-09 Thread JasonMWhite
Github user JasonMWhite commented on the pull request: https://github.com/apache/spark/pull/10089#issuecomment-182008116 Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark pull request: [SPARK-12073] [Streaming] backpressure rate co...

2016-02-09 Thread JasonMWhite
Github user JasonMWhite commented on the pull request: https://github.com/apache/spark/pull/10089#issuecomment-181990835 I'm not sure how to handle the mima test failure, could you point me to where to add the exclude? I'll rebase and add the exception. --- If your project

[GitHub] spark pull request: [SPARK-12073] [Streaming] backpressure rate co...

2015-12-17 Thread JasonMWhite
Github user JasonMWhite commented on the pull request: https://github.com/apache/spark/pull/10089#issuecomment-165518544 Is this a CI failure? I don't have access rights to see what happened. I'd like to get this ready to merge, could someone give me a hand? --- If your

[GitHub] spark pull request: [SPARK-12073] [Streaming] backpressure rate co...

2015-12-11 Thread JasonMWhite
Github user JasonMWhite commented on a diff in the pull request: https://github.com/apache/spark/pull/10089#discussion_r47363415 --- Diff: external/kafka/src/test/scala/org/apache/spark/streaming/kafka/DirectKafkaStreamSuite.scala --- @@ -364,8 +365,8 @@ class

[GitHub] spark pull request: [SPARK-12073] [Streaming] backpressure rate co...

2015-12-09 Thread JasonMWhite
Github user JasonMWhite commented on the pull request: https://github.com/apache/spark/pull/10089#issuecomment-163427146 Could someone verify the patch please? @tdas perhaps? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as

[GitHub] spark pull request: [SPARK-12073] [Streaming] backpressure rate co...

2015-12-05 Thread JasonMWhite
Github user JasonMWhite commented on the pull request: https://github.com/apache/spark/pull/10089#issuecomment-162260106 `maxRatePerPartition` now respects the limit set by `maxRateLimitPerPartition`, if it is set. Let me know if you think any additional tests are needed. --- If

[GitHub] spark pull request: [SPARK-12073] [Streaming] backpressure rate co...

2015-12-05 Thread JasonMWhite
Github user JasonMWhite commented on the pull request: https://github.com/apache/spark/pull/10089#issuecomment-162250749 This patch solved our skew problem. Below is a 15-minute snapshot of our lag earlier this week, showing a single partition getting slowly worse. It would get to

[GitHub] spark pull request: [SPARK-12073] [Streaming] backpressure rate co...

2015-12-05 Thread JasonMWhite
Github user JasonMWhite commented on a diff in the pull request: https://github.com/apache/spark/pull/10089#discussion_r46763463 --- Diff: external/kafka/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala --- @@ -89,23 +89,29 @@ class

[GitHub] spark pull request: [SPARK-12073] [Streaming] backpressure rate co...

2015-12-02 Thread JasonMWhite
Github user JasonMWhite commented on a diff in the pull request: https://github.com/apache/spark/pull/10089#discussion_r46449516 --- Diff: external/kafka/src/test/scala/org/apache/spark/streaming/kafka/DirectKafkaStreamSuite.scala --- @@ -36,9 +36,10 @@ import

[GitHub] spark pull request: [SPARK-12073] [Streaming] backpressure rate co...

2015-12-02 Thread JasonMWhite
Github user JasonMWhite commented on a diff in the pull request: https://github.com/apache/spark/pull/10089#discussion_r46449571 --- Diff: external/kafka/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala --- @@ -89,23 +89,29 @@ class

[GitHub] spark pull request: [SPARK-12073] [Streaming] backpressure rate co...

2015-12-02 Thread JasonMWhite
Github user JasonMWhite commented on a diff in the pull request: https://github.com/apache/spark/pull/10089#discussion_r46449467 --- Diff: external/kafka/src/test/scala/org/apache/spark/streaming/kafka/DirectKafkaStreamSuite.scala --- @@ -364,8 +365,8 @@ class

[GitHub] spark pull request: [SPARK-12073] [Streaming] backpressure rate co...

2015-12-01 Thread JasonMWhite
GitHub user JasonMWhite opened a pull request: https://github.com/apache/spark/pull/10089 [SPARK-12073] [Streaming] backpressure rate controller consumes events preferentially from lagg… …ing partitions I'm pretty sure this is the reason we couldn't easily re

[GitHub] spark pull request: [SPARK-11437] [PySpark] Don't .take when conve...

2015-10-31 Thread JasonMWhite
Github user JasonMWhite commented on the pull request: https://github.com/apache/spark/pull/9392#issuecomment-152774426 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your

[GitHub] spark pull request: [SPARK-11437] [PySpark] Don't .take when conve...

2015-10-31 Thread JasonMWhite
GitHub user JasonMWhite opened a pull request: https://github.com/apache/spark/pull/9392 [SPARK-11437] [PySpark] Don't .take when converting RDD to DataFrame with provided schema When creating a DataFrame from an RDD in PySpark, `createDataFrame` calls `.take(10)` to verif