[GitHub] spark issue #16671: [SPARK-19327][SparkSQL] a better balance partition metho...

2017-01-22 Thread gatorsmile
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/16671 The connectors by some DBMS vendors are using the UNLOAD utility, which performs much better, and build the RDD in the connectors. Normally, JDBC is not a good option for large table fetc

[GitHub] spark issue #16671: [SPARK-19327][SparkSQL] a better balance partition metho...

2017-01-22 Thread djvulee
Github user djvulee commented on the issue: https://github.com/apache/spark/pull/16671 @HyukjinKwon One assumption behind this design is that the specified column has index in most real scenario, so the table scan cost is not much high. What I observed is that most large tabl

[GitHub] spark issue #16671: [SPARK-19327][SparkSQL] a better balance partition metho...

2017-01-22 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/16671 FWIW, I am negative of this approach too. It does not look a good solution to require full table scans to resolve skew between partitions. As said, it is not good for a large table. Then

[GitHub] spark issue #16671: [SPARK-19327][SparkSQL] a better balance partition metho...

2017-01-22 Thread djvulee
Github user djvulee commented on the issue: https://github.com/apache/spark/pull/16671 Yes. I will leave this PR for a few days to seen if others interested in this, and then close it. --- If your project is set up for it, you can reply to this email and have your reply appear on Git

[GitHub] spark issue #16671: [SPARK-19327][SparkSQL] a better balance partition metho...

2017-01-22 Thread gatorsmile
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/16671 It is still achievable if the SQL interface of the underlying database supports it. Currently, if the performance really matters, JDBC is not a good interface. Thus, most DBMS vendors provide the

[GitHub] spark issue #16671: [SPARK-19327][SparkSQL] a better balance partition metho...

2017-01-22 Thread djvulee
Github user djvulee commented on the issue: https://github.com/apache/spark/pull/16671 Yes, I agree with you, sampling bases is the right choose, but through `jdbc` API is not possible to achieve this. --- If your project is set up for it, you can reply to this email and have your r

[GitHub] spark issue #16671: [SPARK-19327][SparkSQL] a better balance partition metho...

2017-01-21 Thread gatorsmile
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/16671 So far, the best workaround is that predicate-based JDBC API; otherwise, as I mentioned above, we need to do it using sampling to find the boundary of each block. > In one embodiment, a

[GitHub] spark issue #16671: [SPARK-19327][SparkSQL] a better balance partition metho...

2017-01-21 Thread djvulee
Github user djvulee commented on the issue: https://github.com/apache/spark/pull/16671 Using the *predicates* parameters to split the table seems reasonable, but it just put some work should be done by Spark to users in my personal opinion. Users need know how to split the table unifo

[GitHub] spark issue #16671: [SPARK-19327][SparkSQL] a better balance partition metho...

2017-01-21 Thread djvulee
Github user djvulee commented on the issue: https://github.com/apache/spark/pull/16671 Yes, this solution is not suitable for large table, but I can not find a better solution, this is the best optimisation I can find. So just add it as a choose, let the users know what he is doing