from:"cosmincatalin"

Error when saving as parquet to S3

2015-04-30 Thread cosmincatalin

After repartitioning a DataFrame in Spark 1.3.0 I get a .parquet exception
when saving toAmazon's S3. The data that I try to write is 10G.

logsForDate
.repartition(10)
.saveAsParquetFile(destination) // -- Exception here

The exception I receive is:

java.io.IOException: The file being written is in an invalid state. Probably
caused by an error thrown previously. Current state: COLUMN
at parquet.hadoop.ParquetFileWriter$STATE.error(ParquetFileWriter.java:137)
at
parquet.hadoop.ParquetFileWriter$STATE.startBlock(ParquetFileWriter.java:129)
at parquet.hadoop.ParquetFileWriter.startBlock(ParquetFileWriter.java:173)
at
parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:152)
at
parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:112)
at parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:73)
at
org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$writeShard$1(newParquet.scala:635)
at
org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:649)
at
org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:649)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

I would like to know what is the problem and how to solve it.



-
https://www.linkedin.com/in/cosmincatalinsanda
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Error-when-saving-as-parquet-to-S3-tp22722.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Disable partition discovery

2015-04-24 Thread cosmincatalin

How can one disable *Partition discovery* in *Spark 1.3.0 * when using
*sqlContext.parquetFile*?

Alternatively, is there a way to load /.parquet/ files without *Partition
discovery*?



-
https://www.linkedin.com/in/cosmincatalinsanda
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Disable-partition-discovery-tp22645.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Loading lots of .parquet files in Spark 1.3.1 (Hadoop 2.4)

2015-04-22 Thread cosmincatalin

I am trying to read a few hundred .parquet files from S3 into an EMR cluster.
The .parquet files are structured by date and have /_common_metadata/ in
each of the folders (as well as /_metadata/).The *sqlContext.parquetFile*
operation takes a very long time, opening for reading each of the .parquet
files. I would have expected that the /*metdata/ files would be used for
structure so that Spark does not have to go through all the files in a
folder. I have also tried for a single folder this experiment, all the
.parquet files have been opened and the /*metdata/ was apparently
ignored.What can I do to speed up the loading process? Can I load the
.parquet files in parallel? What is the purpose of the /*metadata/ files?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Loading-lots-of-parquet-files-in-Spark-1-3-1-Hadoop-2-4-tp22624.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Error when saving as parquet to S3

Disable partition discovery

Loading lots of .parquet files in Spark 1.3.1 (Hadoop 2.4)

3 matches

Site Navigation

Mail list logo

Footer information