lichenglin created SPARK-18413: ---------------------------------- Summary: Add a property to control the number of partitions when save a jdbc rdd Key: SPARK-18413 URL: https://issues.apache.org/jira/browse/SPARK-18413 Project: Spark Issue Type: Wish Components: SQL Affects Versions: 2.0.1 Reporter: lichenglin
{code} CREATE or replace TEMPORARY VIEW resultview USING org.apache.spark.sql.jdbc OPTIONS ( url "jdbc:oracle:thin:@10.129.10.111:1521:BKDB", dbtable "result", user "HIVE", password "HIVE" ); --set spark.sql.shuffle.partitions=200 insert overwrite table resultview select g,count(1) as count from tnet.DT_LIVE_INFO group by g {code} I'm tring to save a spark sql result to oracle. And I found spark will create a jdbc connection for each partition. if the sql create to many partitions , the database can't hold so many connections and return exception. In above situation is 200 because of the "group by" and "spark.sql.shuffle.partitions" the spark source code JdbcUtil is {code} def saveTable( df: DataFrame, url: String, table: String, properties: Properties) { val dialect = JdbcDialects.get(url) val nullTypes: Array[Int] = df.schema.fields.map { field => getJdbcType(field.dataType, dialect).jdbcNullType } val rddSchema = df.schema val getConnection: () => Connection = createConnectionFactory(url, properties) val batchSize = properties.getProperty(JDBC_BATCH_INSERT_SIZE, "1000").toInt df.foreachPartition { iterator => savePartition(getConnection, table, iterator, rddSchema, nullTypes, batchSize, dialect) } } {code} May be we can add a property for df.repartition(num).foreachPartition ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org