Yang Juan hu created SPARK-14955:
------------------------------------

             Summary: JDBCRelation should report an IllegalArgumentException if 
stride equals 0
                 Key: SPARK-14955
                 URL: https://issues.apache.org/jira/browse/SPARK-14955
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 1.6.1, 1.5.1
            Reporter: Yang Juan hu
            Priority: Minor


In file 
https://github.com/apache/spark/blob/40ed2af587cedadc6e5249031857a922b3b234ca/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala

row 56 and 57 has following line

val stride: Long = (partitioning.upperBound / numPartitions
                      - partitioning.lowerBound / numPartitions)

if we invoke columnPartition as below: 

columnPartition( JDBCPartitioningInfo("partitionColumn", 0, 7, 8) );


columnPartition will generate following where condition:

whereClause: partitionColumn < 0
whereClause: partitionColumn >= 0 AND partitionColumn < 0
whereClause: partitionColumn >= 0 AND partitionColumn < 0
whereClause: partitionColumn >= 0 AND partitionColumn < 0
whereClause: partitionColumn >= 0 AND partitionColumn < 0
whereClause: partitionColumn >= 0 AND partitionColumn < 0
whereClause: partitionColumn >= 0 AND partitionColumn < 0
whereClause: partitionColumn >= 0

it will cause data skew, the last partition will contain all data.

Propose to throw an exception if stride equal 0, help spark user to aware  data 
skew issue ASAP.

if (stride == 0) return throw new 
IllegalArgumentException("partitioning.upperBound / numPartitions -  
partitioning.lowerBound / numPartitions is zero");

partitionColumn must be an integral type, if we want to load a big table from 
DBMS,  we need to do some work around.


Real case to export data from ORACLE database through pyspark.

#data skew issue version

df=ssc.read.format("jdbc").options( url=url,
    dbtable="( SELECT ORA_HASH(PART_COL,7)  AS PART_ID, A.* FROM DBMS_TAB A ) 
TAB_ALIAS",
    fetchSize="1000",
    partitionColumn="PART_ID",
    numPartitions="8",
    lowerBound="0",
    upperBound="7").load()


#no data skew issue version
df=ssc.read.format("jdbc").options( url=url,
    dbtable="( SELECT ORA_HASH(PART_COL,7)+1  AS PART_ID, A.* FROM DBMS_TAB A ) 
TAB_ALIAS",
    fetchSize="1000",
    partitionColumn="PART_ID",
    numPartitions="8",
    lowerBound="1",
    upperBound="8").load()




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to