Looks like a bug in your lambda function. Some of the lines you are processing must have less than 6 elements, so doing p(5) is failing.
On Wed, Jul 23, 2014 at 11:44 AM, buntu <buntu...@gmail.com> wrote: > Thanks Michael. > > If I read in multiple files and attempt to saveAsParquetFile() I get the > ArrayIndexOutOfBoundsException. I don't see this if I try the same with a > single file: > > > case class Point(dt: String, uid: String, kw: String, tz: Int, success: > > Int, code: String ) > > > val point = sc.textFile("data/raw_data_*").map(_.split("\t")).map(p => > > Point(df.format(new Date( p(0).trim.toLong*1000L )), p(1), p(2), > > p(3).trim.toInt, p(4).trim.toInt ,p(5))) > > > point.saveAsParquetFile("point.parquet") > > SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". > SLF4J: Defaulting to no-operation (NOP) logger implementation > SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further > details. > 14/07/23 11:30:54 ERROR Executor: Exception in task ID 18 > java.lang.ArrayIndexOutOfBoundsException: 1 > at $line17.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:21) > at $line17.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:21) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$$anon$1.next(Iterator.scala:853) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at > org.apache.spark.sql.parquet.InsertIntoParquetTable.org > $apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:248) > at > > org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:264) > at > > org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:264) > at > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) > at org.apache.spark.scheduler.Task.run(Task.scala:51) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) > at > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > > Is this due to the amount of data (about 5M rows) being processed? I've set > the SPARK_DRIVER_MEMORY to 8g. > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Convert-raw-data-files-to-Parquet-format-tp10526p10536.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. >