[GitHub] spark issue #21320: [SPARK-4502][SQL] Parquet nested column pruning - founda...

2018-10-29 Thread Gauravshah
Github user Gauravshah commented on the issue: https://github.com/apache/spark/pull/21320 backported to 2.3.2 just in case somebody needs it. https://github.com/Gauravshah/spark/tree/branch-2.3_SPARK-4502 Thanks @mallman

[GitHub] spark issue #21320: [SPARK-4502][SQL] Parquet nested column pruning - founda...

2018-09-25 Thread Gauravshah
Github user Gauravshah commented on the issue: https://github.com/apache/spark/pull/21320 @mallman any way I can help pull in rest of the changes from your original PR (https://github.com/apache/spark/pull/16578) for next release

[GitHub] spark issue #21320: [SPARK-4502][SQL] Parquet nested column pruning - founda...

2018-08-13 Thread Gauravshah
Github user Gauravshah commented on the issue: https://github.com/apache/spark/pull/21320 Or @ajacques can open a PR to Mallman's branch and he can merge it. Makes it less work work for him --- - To unsubscribe

[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2018-03-01 Thread Gauravshah
Github user Gauravshah commented on the issue: https://github.com/apache/spark/pull/16578 we have back-ported it to 2.2, on production by an average it has saved us at least 2x time. --- - To unsubscribe, e-mail

[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2018-03-01 Thread Gauravshah
Github user Gauravshah commented on the issue: https://github.com/apache/spark/pull/16578 @marmbrus can we target it for 2.4 ? need help on reviews. Been in waiting state for very long --- - To unsubscribe, e-mail

[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2018-01-08 Thread Gauravshah
Github user Gauravshah commented on the issue: https://github.com/apache/spark/pull/16578 @mallman do you foresee any issues ? planning to backport it to spark 2.2 on personal fork. will probably make jitpack release

[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2018-01-03 Thread Gauravshah
Github user Gauravshah commented on the issue: https://github.com/apache/spark/pull/16578 @marmbrus can we start the review process ? so that it can make it for the next release ? --- - To unsubscribe, e-mail

[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2018-01-03 Thread Gauravshah
Github user Gauravshah commented on the issue: https://github.com/apache/spark/pull/16578 @DaimonPl branch 2.3 is already cut, so its at least not making to 2.3 :( --- - To unsubscribe, e-mail: reviews-unsubscr

[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-12-13 Thread Gauravshah
Github user Gauravshah commented on the issue: https://github.com/apache/spark/pull/16578 thank @mallman for rebasing each time. @gatorsmile can you take a look at it ? --- - To unsubscribe, e-mail: reviews

[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-09-05 Thread Gauravshah
Github user Gauravshah commented on the issue: https://github.com/apache/spark/pull/16578 just in case someone wants to try: ``` resolvers += "jitpack" at "https://jitpack.io; libraryDependencies += "com.github.VideoAmp" % &quo

[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-07-28 Thread Gauravshah
Github user Gauravshah commented on the issue: https://github.com/apache/spark/pull/16578 @mallman not sure how can I help to push it --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark issue #17006: [SPARK-17636] Parquet filter push down doesn't handle st...

2017-07-21 Thread Gauravshah
Github user Gauravshah commented on the issue: https://github.com/apache/spark/pull/17006 https://github.com/apache/spark/pull/16578 PR should solve this --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project

[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-07-21 Thread Gauravshah
Github user Gauravshah commented on the issue: https://github.com/apache/spark/pull/16578 Time in seconds | Test| wPatch | w/o Patch | | - | - | - | | count

[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-07-20 Thread Gauravshah
Github user Gauravshah commented on the issue: https://github.com/apache/spark/pull/16578 I cannot do it. @marmbrus can you help with this pull request ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project

[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-07-20 Thread Gauravshah
Github user Gauravshah commented on the issue: https://github.com/apache/spark/pull/16578 thanks @mallman this helps a lot on performance. for 100 millions rows in a partition we are able to go from 4.5 minute to 35 seconds with the patch. Will share more results by end of the day

[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-07-16 Thread Gauravshah
Github user Gauravshah commented on the issue: https://github.com/apache/spark/pull/16578 @mallman code looks good, I can test out things with my dataset. Have deeply nested data, 20-30 nestings and millions of rows in a partition to test performance. Will do it Monday --- If your

[GitHub] spark issue #14957: [SPARK-4502][SQL]Support parquet nested struct pruning a...

2017-05-22 Thread Gauravshah
Github user Gauravshah commented on the issue: https://github.com/apache/spark/pull/14957 @saulshanabrook looks like #16578 is a superset, trying to invest in that pull request. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub

[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-03-30 Thread Gauravshah
Github user Gauravshah commented on the issue: https://github.com/apache/spark/pull/16578 can I do something to help this pull request ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark issue #16842: [SPARK-19304] [Streaming] [Kinesis] fix kinesis slow che...

2017-03-06 Thread Gauravshah
Github user Gauravshah commented on the issue: https://github.com/apache/spark/pull/16842 thanks @srowen & @brkyvz --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature ena

[GitHub] spark pull request #16842: [SPARK-19304] [Streaming] [Kinesis] fix kinesis s...

2017-03-03 Thread Gauravshah
Github user Gauravshah commented on a diff in the pull request: https://github.com/apache/spark/pull/16842#discussion_r104282408 --- Diff: external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisBackedBlockRDD.scala --- @@ -193,9 +201,10 @@ class

[GitHub] spark issue #16842: [SPARK-19304] [Streaming] [Kinesis] fix kinesis slow che...

2017-02-28 Thread Gauravshah
Github user Gauravshah commented on the issue: https://github.com/apache/spark/pull/16842 @brkyvz Thank you for taking this forward. We have batch interval of 2 minutes & takes ~1.1 minutes to process. With older code it takes 10-12 minutes to recover and with limit fix it reco

[GitHub] spark pull request #16842: [SPARK-19304] [Streaming] [Kinesis] fix kinesis s...

2017-02-28 Thread Gauravshah
Github user Gauravshah commented on a diff in the pull request: https://github.com/apache/spark/pull/16842#discussion_r103502545 --- Diff: external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisBackedBlockRDD.scala --- @@ -212,7 +214,7 @@ class

[GitHub] spark issue #16842: [SPARK-19304] [Streaming] [Kinesis] fix kinesis slow che...

2017-02-22 Thread Gauravshah
Github user Gauravshah commented on the issue: https://github.com/apache/spark/pull/16842 @srowen I assumed that you cannot update code if you want to recover from checkpoint. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub

[GitHub] spark pull request #16842: [SPARK-19304] [Streaming] [Kinesis] fix kinesis s...

2017-02-21 Thread Gauravshah
Github user Gauravshah commented on a diff in the pull request: https://github.com/apache/spark/pull/16842#discussion_r102366702 --- Diff: external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisBackedBlockRDD.scala --- @@ -204,10 +208,11 @@ class

[GitHub] spark pull request #16842: [SPARK-19304] [Streaming] [Kinesis] fix kinesis s...

2017-02-21 Thread Gauravshah
Github user Gauravshah commented on a diff in the pull request: https://github.com/apache/spark/pull/16842#discussion_r102366680 --- Diff: external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisBackedBlockRDD.scala --- @@ -36,7 +36,8 @@ import

[GitHub] spark issue #16842: [SPARK-19304] [Streaming] [Kinesis] fix kinesis slow che...

2017-02-20 Thread Gauravshah
Github user Gauravshah commented on the issue: https://github.com/apache/spark/pull/16842 @brkyvz can I do something to take it forward ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark issue #16842: [WIP] [SPARK-19304] [Streaming] [Kinesis] fix kinesis sl...

2017-02-08 Thread Gauravshah
Github user Gauravshah commented on the issue: https://github.com/apache/spark/pull/16842 Jenkins, retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled

[GitHub] spark issue #16842: [SPARK-19304] [Streaming] [Kinesis] fix kinesis slow che...

2017-02-07 Thread Gauravshah
Github user Gauravshah commented on the issue: https://github.com/apache/spark/pull/16842 will work on testcases today --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled

[GitHub] spark pull request #16842: [SPARK-19304] [Streaming] [Kinesis] fix kinesis s...

2017-02-07 Thread Gauravshah
Github user Gauravshah commented on a diff in the pull request: https://github.com/apache/spark/pull/16842#discussion_r99929314 --- Diff: external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisBackedBlockRDD.scala --- @@ -36,7 +36,8 @@ import

[GitHub] spark pull request #16842: [SPARK-19304] [Streaming] [Kinesis] fix kinesis s...

2017-02-07 Thread Gauravshah
Github user Gauravshah commented on a diff in the pull request: https://github.com/apache/spark/pull/16842#discussion_r99928115 --- Diff: external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisBackedBlockRDD.scala --- @@ -36,7 +36,8 @@ import

[GitHub] spark pull request #16842: [SPARK-19304] [Streaming] [Kinesis] fix kinesis s...

2017-02-07 Thread Gauravshah
Github user Gauravshah commented on a diff in the pull request: https://github.com/apache/spark/pull/16842#discussion_r99927422 --- Diff: external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisBackedBlockRDD.scala --- @@ -36,7 +36,8 @@ import

[GitHub] spark pull request #16842: SPARK-19304 fix kinesis slow checkpoint recovery

2017-02-07 Thread Gauravshah
GitHub user Gauravshah opened a pull request: https://github.com/apache/spark/pull/16842 SPARK-19304 fix kinesis slow checkpoint recovery ## What changes were proposed in this pull request? added a limit to getRecords api call call in KinesisBackedBlockRdd. This helps reduce

[GitHub] spark issue #16213: [SPARK-18020][Streaming][Kinesis] Checkpoint SHARD_END t...

2017-01-25 Thread Gauravshah
Github user Gauravshah commented on the issue: https://github.com/apache/spark/pull/16213 facing same issue randomly on prod, can I help in some way to push it ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your

[GitHub] spark issue #16339: [SPARK-18917][SQL] Add Skip Partition Check Flag to avoi...

2017-01-16 Thread Gauravshah
Github user Gauravshah commented on the issue: https://github.com/apache/spark/pull/16339 should help us save 20 mins on each iteration scanning directories. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your