Error when saving as parquet to S3

2015-04-30 Thread cosmincatalin
After repartitioning a DataFrame in Spark 1.3.0 I get a .parquet exception
when saving toAmazon's S3. The data that I try to write is 10G.

logsForDate
.repartition(10)
.saveAsParquetFile(destination) // -- Exception here

The exception I receive is:

java.io.IOException: The file being written is in an invalid state. Probably
caused by an error thrown previously. Current state: COLUMN
at parquet.hadoop.ParquetFileWriter$STATE.error(ParquetFileWriter.java:137)
at
parquet.hadoop.ParquetFileWriter$STATE.startBlock(ParquetFileWriter.java:129)
at parquet.hadoop.ParquetFileWriter.startBlock(ParquetFileWriter.java:173)
at
parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:152)
at
parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:112)
at parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:73)
at
org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$writeShard$1(newParquet.scala:635)
at
org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:649)
at
org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:649)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

I would like to know what is the problem and how to solve it.



-
https://www.linkedin.com/in/cosmincatalinsanda
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Error-when-saving-as-parquet-to-S3-tp22722.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Disable partition discovery

2015-04-24 Thread cosmincatalin
How can one disable *Partition discovery* in *Spark 1.3.0 * when using
*sqlContext.parquetFile*?

Alternatively, is there a way to load /.parquet/ files without *Partition
discovery*?



-
https://www.linkedin.com/in/cosmincatalinsanda
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Disable-partition-discovery-tp22645.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Loading lots of .parquet files in Spark 1.3.1 (Hadoop 2.4)

2015-04-22 Thread cosmincatalin
I am trying to read a few hundred .parquet files from S3 into an EMR cluster.
The .parquet files are structured by date and have /_common_metadata/ in
each of the folders (as well as /_metadata/).The *sqlContext.parquetFile*
operation takes a very long time, opening for reading each of the .parquet
files. I would have expected that the /*metdata/ files would be used for
structure so that Spark does not have to go through all the files in a
folder. I have also tried for a single folder this experiment, all the
.parquet files have been opened and the /*metdata/ was apparently
ignored.What can I do to speed up the loading process? Can I load the
.parquet files in parallel? What is the purpose of the /*metadata/ files?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Loading-lots-of-parquet-files-in-Spark-1-3-1-Hadoop-2-4-tp22624.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.