Github user JasonMWhite commented on the issue:
https://github.com/apache/spark/pull/19833
Thanks @cloud-fan !
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail
Github user JasonMWhite commented on the issue:
https://github.com/apache/spark/pull/17293
Reliable nullability information is about far more than non-nullable
optimization to us. I would happily opt in to any performance penalty that
validated that non-nullable columns were actually
Github user JasonMWhite commented on the issue:
https://github.com/apache/spark/pull/17293
This issue causes a lot of headaches for us when picking up parquet
datasets. To get around this issue, we write the schema alongside the parquet
files in a side-band, and then when loading
Github user JasonMWhite commented on the issue:
https://github.com/apache/spark/pull/17774
LGTM
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the
Github user JasonMWhite commented on the issue:
https://github.com/apache/spark/pull/17774
Tests have some fairly repetitive code, but not sure if that's a problem or
not. Looks good to me.
---
If your project is set up for it, you can reply to this email and have your
reply a
Github user JasonMWhite commented on a diff in the pull request:
https://github.com/apache/spark/pull/17774#discussion_r113753764
--- Diff:
external/kafka-0-10/src/test/scala/org/apache/spark/streaming/kafka010/DirectKafkaStreamSuite.scala
---
@@ -617,6 +617,94 @@ class
Github user JasonMWhite commented on the issue:
https://github.com/apache/spark/pull/17774
I think @koeninger's suggestion is valid. `effectiveRateLimitPerPartition`
is the upper bound on the number of messages per partition per second, and
`maxMessagesPerPartition` sets an
Github user JasonMWhite commented on the issue:
https://github.com/apache/spark/pull/17774
Code looks sound. Could you add or modify a test to illustrate/verify?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your
Github user JasonMWhite commented on the issue:
https://github.com/apache/spark/pull/15918
@windpiger I see you flipped it back to `WIP`. What else needs to be done?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If
Github user JasonMWhite commented on the issue:
https://github.com/apache/spark/pull/17200
Thanks @cloud-fan !
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes
Github user JasonMWhite commented on a diff in the pull request:
https://github.com/apache/spark/pull/17200#discussion_r105083494
--- Diff: python/pyspark/sql/types.py ---
@@ -189,7 +189,7 @@ def toInternal(self, dt):
if dt is not None:
seconds
Github user JasonMWhite commented on a diff in the pull request:
https://github.com/apache/spark/pull/17200#discussion_r104919956
--- Diff: python/pyspark/sql/types.py ---
@@ -189,7 +189,7 @@ def toInternal(self, dt):
if dt is not None:
seconds
Github user JasonMWhite commented on a diff in the pull request:
https://github.com/apache/spark/pull/17200#discussion_r104856801
--- Diff: python/pyspark/sql/types.py ---
@@ -189,7 +189,7 @@ def toInternal(self, dt):
if dt is not None:
seconds
Github user JasonMWhite commented on a diff in the pull request:
https://github.com/apache/spark/pull/17200#discussion_r104856310
--- Diff: python/pyspark/sql/types.py ---
@@ -189,7 +189,7 @@ def toInternal(self, dt):
if dt is not None:
seconds
Github user JasonMWhite commented on the issue:
https://github.com/apache/spark/pull/17200
Ah, test failed on Python 3.4 only. That makes some sense, I only tested
locally on 2.6, and there are changes with how Python 3 handles ints vs longs.
I'll dig in with Python 3.4 and see
Github user JasonMWhite commented on a diff in the pull request:
https://github.com/apache/spark/pull/17200#discussion_r104847039
--- Diff: python/pyspark/sql/tests.py ---
@@ -1435,6 +1435,12 @@ def test_time_with_timezone(self):
self.assertEqual(now, now1
GitHub user JasonMWhite opened a pull request:
https://github.com/apache/spark/pull/17200
[SPARK-19561][Python] cast TimestampType.toInternal output to long
## What changes were proposed in this pull request?
Cast the output of `TimestampType.toInternal` to long to allow
Github user JasonMWhite commented on the issue:
https://github.com/apache/spark/pull/16896
This PR is pretty tiny, and corrects a problem that led to corrupt Parquet
files in our case. Can anyone spare a minute to review?
---
If your project is set up for it, you can reply to this
Github user JasonMWhite commented on the issue:
https://github.com/apache/spark/pull/16896
Ping @davies
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or
Github user JasonMWhite commented on the issue:
https://github.com/apache/spark/pull/16896
Modified as suggested. Don't think this has been through CI at all yet.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If
Github user JasonMWhite commented on a diff in the pull request:
https://github.com/apache/spark/pull/16896#discussion_r100681604
--- Diff: python/pyspark/sql/tests.py ---
@@ -1435,6 +1435,12 @@ def test_time_with_timezone(self):
self.assertEqual(now, now1
GitHub user JasonMWhite opened a pull request:
https://github.com/apache/spark/pull/16896
[SPARK-19561][Python] cast TimestampType.toInternal output to long
## What changes were proposed in this pull request?
Cast the output of `TimestampType.toInternal` to long to allow
Github user JasonMWhite commented on the issue:
https://github.com/apache/spark/pull/15254
@JoshRosen @lins05 As requested, I've removed all remaining explicit
mentions of `ListConverter` and `MapConverter` as they seemed to all be doing
the same thing - getting around
Github user JasonMWhite commented on the issue:
https://github.com/apache/spark/pull/15254
@davies you authored https://github.com/apache/spark/pull/5570 and reported
the issue in Py4J https://github.com/bartdag/py4j/issues/160. I happened across
this while spelunking through Py4J
GitHub user JasonMWhite opened a pull request:
https://github.com/apache/spark/pull/15254
[SPARK-17679] [PYSPARK] remove unnecessary Py4J ListConverter patch
## What changes were proposed in this pull request?
This PR removes a patch on ListConverter from
https
Github user JasonMWhite commented on the pull request:
https://github.com/apache/spark/pull/12470#issuecomment-211712175
After some experimentation on the Scala side, I see that this PR would be a
significant departure from Scala's interface, where the schema and field names
ar
Github user JasonMWhite closed the pull request at:
https://github.com/apache/spark/pull/12470
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature
Github user JasonMWhite commented on the pull request:
https://github.com/apache/spark/pull/12470#issuecomment-211635125
Hmm, I see what you mean.
```
r1 = df.select('a', 'b', df['b'].alias('a')).collect()[0]
r1 # Row(a=1, b=2, a=2)
Github user JasonMWhite commented on the pull request:
https://github.com/apache/spark/pull/12470#issuecomment-211600203
The "few bugs" are included in this PR, all column naming issues. They're
in `python/pyspark/sql/tests.py`
Could you provide an example a
Github user JasonMWhite commented on a diff in the pull request:
https://github.com/apache/spark/pull/12470#discussion_r60134541
--- Diff: python/pyspark/sql/types.py ---
@@ -1448,6 +1448,54 @@ def __repr__(self):
else:
return "" % &quo
Github user JasonMWhite commented on a diff in the pull request:
https://github.com/apache/spark/pull/12470#discussion_r60122211
--- Diff: python/pyspark/sql/types.py ---
@@ -1448,6 +1448,54 @@ def __repr__(self):
else:
return "" % &quo
GitHub user JasonMWhite opened a pull request:
https://github.com/apache/spark/pull/12470
[SPARK-14700] [Python] adding SQL Row equality and inequality overrides
## What changes were proposed in this pull request?
This PR adds equality and inequality overrides to
Github user JasonMWhite commented on the pull request:
https://github.com/apache/spark/pull/10089#issuecomment-191159802
All comments addressed, builds cleanly, all tests passing. GTM?
---
If your project is set up for it, you can reply to this email and have your
reply appear on
Github user JasonMWhite commented on the pull request:
https://github.com/apache/spark/pull/10089#issuecomment-191113140
`DirectKafkaStreamSuite` passes all tests for me locally, but the test
failure above appeared to be on an outdated sha.
Addressed @zsxwing's comments
Github user JasonMWhite commented on the pull request:
https://github.com/apache/spark/pull/10089#issuecomment-186037636
It looks like this PySpark unit test failure above was prior to my commit
to add a backwards-compatible single-argument version as suggested. Could
someone kick it
Github user JasonMWhite commented on the pull request:
https://github.com/apache/spark/pull/10089#issuecomment-185521431
A single-argument version is easy enough, and good to support
backwards-compatibility anyway. I haven't been able to get PySpark tests
running locally yet, s
Github user JasonMWhite commented on the pull request:
https://github.com/apache/spark/pull/10089#issuecomment-185517414
Changes addressed, but looks like it's causing PySpark unit tests to fail.
Investigating...
---
If your project is set up for it, you can reply to this emai
Github user JasonMWhite commented on a diff in the pull request:
https://github.com/apache/spark/pull/10089#discussion_r53265282
--- Diff:
external/kafka/src/test/scala/org/apache/spark/streaming/kafka/DirectKafkaStreamSuite.scala
---
@@ -353,10 +353,52 @@ class
Github user JasonMWhite commented on a diff in the pull request:
https://github.com/apache/spark/pull/10089#discussion_r53265248
--- Diff:
external/kafka/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala
---
@@ -89,23 +89,32 @@ class
Github user JasonMWhite commented on the pull request:
https://github.com/apache/spark/pull/10089#issuecomment-185478351
I added unit tests for `maxMessagesPerPartition` to cover 3 edge cases that
have been raised here:
- when backpressure is disabled, simply uses the
Github user JasonMWhite commented on the pull request:
https://github.com/apache/spark/pull/10089#issuecomment-184916602
Sorry, did I put the MiMa exclusion in the wrong section of the file?
---
If your project is set up for it, you can reply to this email and have your
reply appear
Github user JasonMWhite commented on the pull request:
https://github.com/apache/spark/pull/10089#issuecomment-184900516
@koeninger I've added the two error messages from Jenkins to the
MimaExcludes, under the v2.0 section. There was one from the test function, but
this PR
Github user JasonMWhite commented on the pull request:
https://github.com/apache/spark/pull/10089#issuecomment-182008116
Thanks!
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user JasonMWhite commented on the pull request:
https://github.com/apache/spark/pull/10089#issuecomment-181990835
I'm not sure how to handle the mima test failure, could you point me to
where to add the exclude? I'll rebase and add the exception.
---
If your project
Github user JasonMWhite commented on the pull request:
https://github.com/apache/spark/pull/10089#issuecomment-165518544
Is this a CI failure? I don't have access rights to see what happened. I'd
like to get this ready to merge, could someone give me a hand?
---
If your
Github user JasonMWhite commented on a diff in the pull request:
https://github.com/apache/spark/pull/10089#discussion_r47363415
--- Diff:
external/kafka/src/test/scala/org/apache/spark/streaming/kafka/DirectKafkaStreamSuite.scala
---
@@ -364,8 +365,8 @@ class
Github user JasonMWhite commented on the pull request:
https://github.com/apache/spark/pull/10089#issuecomment-163427146
Could someone verify the patch please? @tdas perhaps?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as
Github user JasonMWhite commented on the pull request:
https://github.com/apache/spark/pull/10089#issuecomment-162260106
`maxRatePerPartition` now respects the limit set by
`maxRateLimitPerPartition`, if it is set. Let me know if you think any
additional tests are needed.
---
If
Github user JasonMWhite commented on the pull request:
https://github.com/apache/spark/pull/10089#issuecomment-162250749
This patch solved our skew problem. Below is a 15-minute snapshot of our
lag earlier this week, showing a single partition getting slowly worse. It
would get to
Github user JasonMWhite commented on a diff in the pull request:
https://github.com/apache/spark/pull/10089#discussion_r46763463
--- Diff:
external/kafka/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala
---
@@ -89,23 +89,29 @@ class
Github user JasonMWhite commented on a diff in the pull request:
https://github.com/apache/spark/pull/10089#discussion_r46449516
--- Diff:
external/kafka/src/test/scala/org/apache/spark/streaming/kafka/DirectKafkaStreamSuite.scala
---
@@ -36,9 +36,10 @@ import
Github user JasonMWhite commented on a diff in the pull request:
https://github.com/apache/spark/pull/10089#discussion_r46449571
--- Diff:
external/kafka/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala
---
@@ -89,23 +89,29 @@ class
Github user JasonMWhite commented on a diff in the pull request:
https://github.com/apache/spark/pull/10089#discussion_r46449467
--- Diff:
external/kafka/src/test/scala/org/apache/spark/streaming/kafka/DirectKafkaStreamSuite.scala
---
@@ -364,8 +365,8 @@ class
GitHub user JasonMWhite opened a pull request:
https://github.com/apache/spark/pull/10089
[SPARK-12073] [Streaming] backpressure rate controller consumes events
preferentially from laggâ¦
â¦ing partitions
I'm pretty sure this is the reason we couldn't easily re
Github user JasonMWhite commented on the pull request:
https://github.com/apache/spark/pull/9392#issuecomment-152774426
Can one of the admins verify this patch?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your
GitHub user JasonMWhite opened a pull request:
https://github.com/apache/spark/pull/9392
[SPARK-11437] [PySpark] Don't .take when converting RDD to DataFrame with
provided schema
When creating a DataFrame from an RDD in PySpark, `createDataFrame` calls
`.take(10)` to verif
56 matches
Mail list logo