Hi, I suggest you partition the JDBC reading on a indexed column of the mysql table
2016-01-20 10:11 GMT+08:00 fightf...@163.com <fightf...@163.com>: > Hi , > I want to load really large volumn datasets from mysql using spark > dataframe api. And then save as > parquet file or orc file to facilitate that with hive / Impala. The > datasets size is about 1 billion records and > when I am using the following naive code to run that , Error occurs and > executor lost failure. > > val prop = new java.util.Properties > prop.setProperty("user","test") > prop.setProperty("password", "test") > > val url1 = "jdbc:mysql://172.16.54.136:3306/db1" > val url2 = "jdbc:mysql://172.16.54.138:3306/db1" > val jdbcDF1 = sqlContext.read.jdbc(url1,"video",prop) > val jdbcDF2 = sqlContext.read.jdbc(url2,"video",prop) > > val jdbcDF3 = jdbcDF1.unionAll(jdbcDF2) > jdbcDF3.write.format("parquet").save("hdfs://172.16.54.138:8020/perf") > > I can see from the executor log and the message is like the following. I > can see from the log that the wait_timeout threshold reached > and there is no retry mechanism in the code process. So I am asking you > experts to help on tuning this. Or should I try to use a jdbc > connection pool to increase parallelism ? > > > 16/01/19 17:04:28 ERROR executor.Executor: Exception in task 0.0 in stage 0.0 > (TID 0) > > com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link > failure > > > The last packet successfully received from the server was 377,769 > milliseconds ago. The last packet sent successfully to the server was > 377,790 milliseconds ago. > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > > Caused by: > java.io.EOFException: Can not read response from server. Expected to read 4 > bytes, read 1 bytes before connection was unexpectedly lost. > at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:2914) > at com.mysql.jdbc.MysqlIO.nextRowFast(MysqlIO.java:1996) > ... 22 more > > 16/01/19 17:10:47 INFO executor.CoarseGrainedExecutorBackend: Got assigned > task 4 > 16/01/19 17:10:47 INFO jdbc.JDBCRDD: closed connection > > 16/01/19 17:10:47 ERROR executor.Executor: Exception in task 1.1 in stage 0.0 > (TID 2) > > com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link > failure > > > > ------------------------------ > fightf...@163.com >