[jira] [Updated] (SPARK-14955) JDBCRelation should report an IllegalArgumentException if stride equals 0

Yang Juan hu (JIRA) Tue, 24 May 2016 20:25:35 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-14955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Yang Juan hu updated SPARK-14955:
---------------------------------
    Affects Version/s: 2.0.0

> JDBCRelation should report an IllegalArgumentException if stride equals 0
> -------------------------------------------------------------------------
>
>                 Key: SPARK-14955
>                 URL: https://issues.apache.org/jira/browse/SPARK-14955
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 1.5.1, 1.6.1, 2.0.0
>            Reporter: Yang Juan hu
>            Priority: Minor
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> In file 
> https://github.com/apache/spark/blob/40ed2af587cedadc6e5249031857a922b3b234ca/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala
> row 56 and 57 has following line
> val stride: Long = (partitioning.upperBound / numPartitions
>                       - partitioning.lowerBound / numPartitions)
> if we invoke columnPartition as below: 
> columnPartition( JDBCPartitioningInfo("partitionColumn", 0, 7, 8) );
> columnPartition will generate following where condition:
> whereClause: partitionColumn < 0
> whereClause: partitionColumn >= 0 AND partitionColumn < 0
> whereClause: partitionColumn >= 0 AND partitionColumn < 0
> whereClause: partitionColumn >= 0 AND partitionColumn < 0
> whereClause: partitionColumn >= 0 AND partitionColumn < 0
> whereClause: partitionColumn >= 0 AND partitionColumn < 0
> whereClause: partitionColumn >= 0 AND partitionColumn < 0
> whereClause: partitionColumn >= 0
> it will cause data skew, the last partition will contain all data.
> Propose to throw an exception if stride equal 0, help spark user to aware  
> data skew issue ASAP.
> if (stride == 0) return throw new 
> IllegalArgumentException("partitioning.upperBound / numPartitions -  
> partitioning.lowerBound / numPartitions is zero");
> partitionColumn must be an integral type, if we want to load a big table from 
> DBMS,  we need to do some work around.
> Real case to export data from ORACLE database through pyspark.
> #data skew issue version
> df=ssc.read.format("jdbc").options( url=url,
>     dbtable="( SELECT ORA_HASH(PART_COL,7)  AS PART_ID, A.* FROM DBMS_TAB A ) 
> TAB_ALIAS",
>     fetchSize="1000",
>     partitionColumn="PART_ID",
>     numPartitions="8",
>     lowerBound="0",
>     upperBound="7").load()
> #no data skew issue version
> df=ssc.read.format("jdbc").options( url=url,
>     dbtable="( SELECT ORA_HASH(PART_COL,7)+1  AS PART_ID, A.* FROM DBMS_TAB A 
> ) TAB_ALIAS",
>     fetchSize="1000",
>     partitionColumn="PART_ID",
>     numPartitions="8",
>     lowerBound="1",
>     upperBound="8").load()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14955) JDBCRelation should report an IllegalArgumentException if stride equals 0

Reply via email to