Yang Juan hu created SPARK-14955: ------------------------------------ Summary: JDBCRelation should report an IllegalArgumentException if stride equals 0 Key: SPARK-14955 URL: https://issues.apache.org/jira/browse/SPARK-14955 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.6.1, 1.5.1 Reporter: Yang Juan hu Priority: Minor
In file https://github.com/apache/spark/blob/40ed2af587cedadc6e5249031857a922b3b234ca/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala row 56 and 57 has following line val stride: Long = (partitioning.upperBound / numPartitions - partitioning.lowerBound / numPartitions) if we invoke columnPartition as below: columnPartition( JDBCPartitioningInfo("partitionColumn", 0, 7, 8) ); columnPartition will generate following where condition: whereClause: partitionColumn < 0 whereClause: partitionColumn >= 0 AND partitionColumn < 0 whereClause: partitionColumn >= 0 AND partitionColumn < 0 whereClause: partitionColumn >= 0 AND partitionColumn < 0 whereClause: partitionColumn >= 0 AND partitionColumn < 0 whereClause: partitionColumn >= 0 AND partitionColumn < 0 whereClause: partitionColumn >= 0 AND partitionColumn < 0 whereClause: partitionColumn >= 0 it will cause data skew, the last partition will contain all data. Propose to throw an exception if stride equal 0, help spark user to aware data skew issue ASAP. if (stride == 0) return throw new IllegalArgumentException("partitioning.upperBound / numPartitions - partitioning.lowerBound / numPartitions is zero"); partitionColumn must be an integral type, if we want to load a big table from DBMS, we need to do some work around. Real case to export data from ORACLE database through pyspark. #data skew issue version df=ssc.read.format("jdbc").options( url=url, dbtable="( SELECT ORA_HASH(PART_COL,7) AS PART_ID, A.* FROM DBMS_TAB A ) TAB_ALIAS", fetchSize="1000", partitionColumn="PART_ID", numPartitions="8", lowerBound="0", upperBound="7").load() #no data skew issue version df=ssc.read.format("jdbc").options( url=url, dbtable="( SELECT ORA_HASH(PART_COL,7)+1 AS PART_ID, A.* FROM DBMS_TAB A ) TAB_ALIAS", fetchSize="1000", partitionColumn="PART_ID", numPartitions="8", lowerBound="1", upperBound="8").load() -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org