[ https://issues.apache.org/jira/browse/SPARK-14955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yang Juan hu updated SPARK-14955: --------------------------------- Affects Version/s: 2.0.0 > JDBCRelation should report an IllegalArgumentException if stride equals 0 > ------------------------------------------------------------------------- > > Key: SPARK-14955 > URL: https://issues.apache.org/jira/browse/SPARK-14955 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 1.5.1, 1.6.1, 2.0.0 > Reporter: Yang Juan hu > Priority: Minor > Original Estimate: 2h > Remaining Estimate: 2h > > In file > https://github.com/apache/spark/blob/40ed2af587cedadc6e5249031857a922b3b234ca/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala > row 56 and 57 has following line > val stride: Long = (partitioning.upperBound / numPartitions > - partitioning.lowerBound / numPartitions) > if we invoke columnPartition as below: > columnPartition( JDBCPartitioningInfo("partitionColumn", 0, 7, 8) ); > columnPartition will generate following where condition: > whereClause: partitionColumn < 0 > whereClause: partitionColumn >= 0 AND partitionColumn < 0 > whereClause: partitionColumn >= 0 AND partitionColumn < 0 > whereClause: partitionColumn >= 0 AND partitionColumn < 0 > whereClause: partitionColumn >= 0 AND partitionColumn < 0 > whereClause: partitionColumn >= 0 AND partitionColumn < 0 > whereClause: partitionColumn >= 0 AND partitionColumn < 0 > whereClause: partitionColumn >= 0 > it will cause data skew, the last partition will contain all data. > Propose to throw an exception if stride equal 0, help spark user to aware > data skew issue ASAP. > if (stride == 0) return throw new > IllegalArgumentException("partitioning.upperBound / numPartitions - > partitioning.lowerBound / numPartitions is zero"); > partitionColumn must be an integral type, if we want to load a big table from > DBMS, we need to do some work around. > Real case to export data from ORACLE database through pyspark. > #data skew issue version > df=ssc.read.format("jdbc").options( url=url, > dbtable="( SELECT ORA_HASH(PART_COL,7) AS PART_ID, A.* FROM DBMS_TAB A ) > TAB_ALIAS", > fetchSize="1000", > partitionColumn="PART_ID", > numPartitions="8", > lowerBound="0", > upperBound="7").load() > #no data skew issue version > df=ssc.read.format("jdbc").options( url=url, > dbtable="( SELECT ORA_HASH(PART_COL,7)+1 AS PART_ID, A.* FROM DBMS_TAB A > ) TAB_ALIAS", > fetchSize="1000", > partitionColumn="PART_ID", > numPartitions="8", > lowerBound="1", > upperBound="8").load() -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org