Zack Behringer created SPARK-22299:
--------------------------------------

             Summary: Use OFFSET and LIMIT for JDBC DataFrameReader striping
                 Key: SPARK-22299
                 URL: https://issues.apache.org/jira/browse/SPARK-22299
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 2.2.0, 2.1.0, 2.0.0, 1.6.0, 1.5.0, 1.4.0
            Reporter: Zack Behringer
            Priority: Minor


Loading a large table (300M rows) from JDBC can be partitioned into tasks using 
the column, numPartitions, lowerBound and upperBound parameters on 
DataFrameReader.jdbc(), but that becomes troublesome if the column is 
skewed/fragmented (as in somebody used a global sequence for the partition 
column instead of a sequence specific to the table, or if the table becomes 
fragmented by deletes, etc.).
This can be worked around by using a modulus operation on the column, but that 
will be slow unless there is a already an index using the modulus expression 
with the exact numPartitions value, so that doesn't scale well if you want to 
change the number partitions. Another way would be to use an expression index 
on a hash of the partition column, but I'm not sure if JDBC striping is smart 
enough to create hash ranges for each stripe using hashes of the lower and 
upper bound parameters. If it is, that is great, but still that requires a very 
large index just for this use case.

A less invasive approach would be to use the table's physical ordering along 
with OFFSET and LIMIT so that only the total number of records to read would 
need to be known beforehand in order to evenly distribute, no indexes needed. I 
realize that OFFSET and LIMIT are not standard SQL keywords.

I also see that a list of custom predicates can be defined. I haven't tried 
that to see if I can embed numPartitions specific predicates each with their 
own OFFSET and LIMIT range.

Some relational databases take quite a long time to count the number of records 
in order to determine the stripe size, though, so this can also troublesome. 
Could a feature similar to "spark.sql.files.maxRecordsPerFile" be used in 
conjunction with the number of executors to read manageable batches (internally 
using OFFSET and LIMIT) until there are no more available results?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to