[ 
https://issues.apache.org/jira/browse/SPARK-13156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15130411#comment-15130411
 ] 

Charles Drotar commented on SPARK-13156:
----------------------------------------

Thanks Sean for the quick response! 

That was exactly my initial thought. I created the modulo of the id column as 
the partition column to address any skewness by only looking at the 
distribution of the final digit as five bins. The total distinct number of ids 
are approximately 48 million and they are pretty evenly distributed between the 
five bins since they truly represent account id numbers. As a means of 
validation I made a simple python script to plot the modulo column. All five 
bins,0 through 4, are very close in counts and only differ by a minor amount.

> JDBC using multiple partitions creates additional tasks but only executes on 
> one
> --------------------------------------------------------------------------------
>
>                 Key: SPARK-13156
>                 URL: https://issues.apache.org/jira/browse/SPARK-13156
>             Project: Spark
>          Issue Type: Bug
>          Components: Input/Output
>    Affects Versions: 1.5.0
>         Environment: Hadoop 2.6.0-cdh5.4.0, Teradata, yarn-client
>            Reporter: Charles Drotar
>
> I can successfully kick off a query through JDBC to Teradata, and when it 
> runs it creates a task on each executor for every partition. The problem is 
> that all of the tasks except for one complete within a couple seconds and the 
> final task handles the entire dataset.
> Example Code:
> private val properties = new java.util.Properties()
> properties.setProperty("driver","com.teradata.jdbc.TeraDriver")
> properties.setProperty("username","foo")
> properties.setProperty("password","bar")
> val url = "jdbc:teradata://oneview/, TMODE=TERA,TYPE=FASTEXPORT,SESSIONS=10"
> val numPartitions = 5
> val dbTableTemp = "( SELECT  id MOD $numPartitions%d AS modulo, id FROM 
> db.table) AS TEMP_TABLE"
> val partitionColumn = "modulo"
> val lowerBound = 0.toLong
> val upperBound = (numPartitions-1).toLong
> val df = 
> sqlContext.read.jdbc(url,dbTableTemp,partitionColumn,lowerBound,upperBound,numPartitions,properties)
> df.write.parquet("/output/path/for/df/")
> When I look at the Spark UI I see the 5 tasks, but only 1 is actually 
> querying.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to