[GitHub] spark issue #21868: [SPARK-24906][SQL] Adaptively enlarge split / partition ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21868 I think we should fix this. Basically the dynamic estimation logic is too flaky, and I think we need this for the current status. Let's don't add it for now. While I am revisiting old PRs, I am trying to suggest to close PRs that look not likely to be merged. Let me suggest to close this for now but please feel free to recreate a PR if you strongly this is needed in Spark. No objection. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21868: [SPARK-24906][SQL] Adaptively enlarge split / partition ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21868 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21868: [SPARK-24906][SQL] Adaptively enlarge split / partition ...
Github user habren commented on the issue: https://github.com/apache/spark/pull/21868 @HyukjinKwon Yes this is to handle it dynamically. For ad-hoc query, the selected columns are different for different queries, and it's not convenient or event impossible for users to set different maxPartitionBytes for different queries. And for general user (non advanced user), it's not easy for them to set a proper value of maxPartitionBytes. So, this change make it easier > BTW, just for clarification, you can set the bigger number to spark.sql.files.maxPartitionBytes explicitly and that resolved your issue. This one is to handle it dynamically, right? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21868: [SPARK-24906][SQL] Adaptively enlarge split / partition ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21868 @habren, BTW, just for clarification, you can set the bigger number to `spark.sql.files.maxPartitionBytes` explicitly and that resolved your issue. This one is to handle it dynamically, right? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21868: [SPARK-24906][SQL] Adaptively enlarge split / partition ...
Github user habren commented on the issue: https://github.com/apache/spark/pull/21868 Hi @HyukjinKwon I moved the change to master branch just now. Please help to review --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21868: [SPARK-24906][SQL] Adaptively enlarge split / partition ...
Github user habren commented on the issue: https://github.com/apache/spark/pull/21868 @HyukjinKwon Thanks for your comments. I will submit it to master soon --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21868: [SPARK-24906][SQL] Adaptively enlarge split / partition ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21868 ??? why does this still target branch-2.3? is this a backport? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21868: [SPARK-24906][SQL] Adaptively enlarge split / partition ...
Github user habren commented on the issue: https://github.com/apache/spark/pull/21868 @maropu Thanks for your comments. ORC can also benefit from this change since ORC is also columnar file format. Do you think I should add ORC support by change the below line ` if(fsRelation.fileFormat.isInstanceOf[ParquetSource]` to `if(fsRelation.fileFormat.isInstanceOf[ParquetSource] || if(fsRelation.fileFormat.isInstanceOf[OrcFileFormat]` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21868: [SPARK-24906][SQL] Adaptively enlarge split / partition ...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/21868 Is this a parquet-specific issue? e.g., how about ORC? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21868: [SPARK-24906][SQL] Adaptively enlarge split / partition ...
Github user habren commented on the issue: https://github.com/apache/spark/pull/21868 Hi @maropu and @viirya Do you agree with the basic idea that we should take column pruning in to consideration during splitting the input files? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21868: [SPARK-24906][SQL] Adaptively enlarge split / partition ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21868 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21868: [SPARK-24906][SQL] Adaptively enlarge split / partition ...
Github user habren commented on the issue: https://github.com/apache/spark/pull/21868 Hi @maropu and @viirya Do you agree with the basic idea that we should take column pruning in to consideration during splitting the input files? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21868: [SPARK-24906][SQL] Adaptively enlarge split / partition ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21868 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21868: [SPARK-24906][SQL] Adaptively enlarge split / partition ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21868 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21868: [SPARK-24906][SQL] Adaptively enlarge split / partition ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21868 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21868: [SPARK-24906][SQL] Adaptively enlarge split / partition ...
Github user habren commented on the issue: https://github.com/apache/spark/pull/21868 @maropu If I understand correct, your concern is about how to calculate --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21868: [SPARK-24906][SQL] Adaptively enlarge split / partition ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21868 BTW, why does this PR target branch-2.3? I think it should be master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21868: [SPARK-24906][SQL] Adaptively enlarge split / partition ...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/21868 Thanks for the work, but, probably, we first need consensus to work on this because this part is pretty performance-sensitive... As @viirya described in the jira, I think we need more general approach than the current fix (for example, I'm not sure that this fix don't have any performance degression). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21868: [SPARK-24906][SQL] Adaptively enlarge split / partition ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21868 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21868: [SPARK-24906][SQL] Adaptively enlarge split / partition ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21868 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21868: [SPARK-24906][SQL] Adaptively enlarge split / partition ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21868 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org