Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/16671
The connectors by some DBMS vendors are using the UNLOAD utility, which
performs much better, and build the RDD in the connectors.
Normally, JDBC is not a good option for large table fetc
Github user djvulee commented on the issue:
https://github.com/apache/spark/pull/16671
@HyukjinKwon One assumption behind this design is that the specified column
has index in most real scenario, so the table scan cost is not much high.
What I observed is that most large tabl
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/16671
FWIW, I am negative of this approach too. It does not look a good solution
to require full table scans to resolve skew between partitions.
As said, it is not good for a large table. Then
Github user djvulee commented on the issue:
https://github.com/apache/spark/pull/16671
Yes. I will leave this PR for a few days to seen if others interested in
this, and then close it.
---
If your project is set up for it, you can reply to this email and have your
reply appear on Git
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/16671
It is still achievable if the SQL interface of the underlying database
supports it. Currently, if the performance really matters, JDBC is not a good
interface. Thus, most DBMS vendors provide the
Github user djvulee commented on the issue:
https://github.com/apache/spark/pull/16671
Yes, I agree with you, sampling bases is the right choose, but through
`jdbc` API is not possible to achieve this.
---
If your project is set up for it, you can reply to this email and have your
r
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/16671
So far, the best workaround is that predicate-based JDBC API; otherwise, as
I mentioned above, we need to do it using sampling to find the boundary of each
block.
> In one embodiment, a
Github user djvulee commented on the issue:
https://github.com/apache/spark/pull/16671
Using the *predicates* parameters to split the table seems reasonable, but
it just put some work should be done by Spark to users in my personal opinion.
Users need know how to split the table unifo
Github user djvulee commented on the issue:
https://github.com/apache/spark/pull/16671
Yes, this solution is not suitable for large table, but I can not find a
better solution, this is the best optimisation I can find.
So just add it as a choose, let the users know what he is doing