Just out of curiosity, what is the advantage of using parquet without hadoop?
Sent from my iPhone > On 11 Aug, 2015, at 11:12 am, <saif.a.ell...@wellsfargo.com> wrote: > > I confirm that it works, > > I was just having this issue: https://issues.apache.org/jira/browse/SPARK-8450 > > Saif > > From: Ellafi, Saif A. > Sent: Tuesday, August 11, 2015 12:01 PM > To: Ellafi, Saif A.; deanwamp...@gmail.com > Cc: user@spark.apache.org > Subject: RE: Parquet without hadoop: Possible? > > Sorry, I provided bad information. This example worked fine with reduced > parallelism. > > It seems my problem have to do with something specific with the real data > frame at reading point. > > Saif > > > From: saif.a.ell...@wellsfargo.com [mailto:saif.a.ell...@wellsfargo.com] > Sent: Tuesday, August 11, 2015 11:49 AM > To: deanwamp...@gmail.com > Cc: user@spark.apache.org > Subject: RE: Parquet without hadoop: Possible? > > I am launching my spark-shell > spark-1.4.1-bin-hadoop2.6/bin/spark-shell > > 15/08/11 09:43:32 INFO SparkILoop: Created sql context (with Hive support).. > SQL context available as sqlContext. > > scala> val data = sc.parallelize(Array(2,3,5,7,2,3,6,1)).toDF > scala> data.write.parquet("/var/ data/Saif/pq") > > Then I get a million errors: > 15/08/11 09:46:01 INFO CodecPool: Got brand-new compressor [.gz] > 15/08/11 09:46:01 INFO CodecPool: Got brand-new compressor [.gz] > 15/08/11 09:46:01 INFO CodecPool: Got brand-new compressor [.gz] > 15/08/11 09:46:07 ERROR InsertIntoHadoopFsRelation: Aborting task. > java.lang.OutOfMemoryError: Java heap space > 15/08/11 09:46:09 ERROR InsertIntoHadoopFsRelation: Aborting task. > java.lang.OutOfMemoryError: Java heap space > 15/08/11 09:46:08 ERROR InsertIntoHadoopFsRelation: Aborting task. > java.lang.OutOfMemoryError: Java heap space > 15/08/11 09:46:08 ERROR InsertIntoHadoopFsRelation: Aborting task. > java.lang.OutOfMemoryError: Java heap space > 15/08/11 09:46:09 ERROR InsertIntoHadoopFsRelation: Aborting task. > java.lang.OutOfMemoryError: Java heap space > 15/08/11 09:46:09 ERROR InsertIntoHadoopFsRelation: Aborting task. > java.lang.OutOfMemoryError: Java heap space > 15/08/11 09:46:08 ERROR InsertIntoHadoopFsRelation: Aborting task. > java.lang.OutOfMemoryError: Java heap space > 15/08/11 09:46:07 ERROR InsertIntoHadoopFsRelation: Aborting task. > java.lang.OutOfMemoryError: Java heap space > 15/08/11 09:46:07 ERROR InsertIntoHadoopFsRelation: Aborting task. > java.lang.OutOfMemoryError: Java heap space > at > parquet.bytes.CapacityByteArrayOutputStream.initSlabs(CapacityByteArrayOutputStream.java:65) > at > parquet.bytes.CapacityByteArrayOutputStream.<init>(CapacityByteArrayOutputStream.java:57) > at > parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.<init>(ColumnChunkPageWriteStore.java:68) > at > parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.<init>(ColumnChunkPageWriteStore.java:48) > at > parquet.hadoop.ColumnChunkPageWriteStore.getPageWriter(ColumnChunkPageWriteStore.java:215) > at > parquet.column.impl.ColumnWriteStoreImpl.newMemColumn(ColumnWriteStoreImpl.java:67) > at > parquet.column.impl.ColumnWriteStoreImpl.getColumnWriter(ColumnWriteStoreImpl.java:56) > at > parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.<init>(MessageColumnIO.java:178) > at > parquet.io.MessageColumnIO.getRecordWriter(MessageColumnIO.java:369) > at > parquet.hadoop.InternalParquetRecordWriter.initStore(InternalParquetRecordWriter.java:108) > at > parquet.hadoop.InternalParquetRecordWriter.<init>(InternalParquetRecordWriter.java:94) > at > parquet.hadoop.ParquetRecordWriter.<init>(ParquetRecordWriter.java:64) > at > parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:282) > at > parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252) > at > org.apache.spark.sql.parquet.ParquetOutputWriter.<init>(newParquet.scala:83) > at > org.apache.spark.sql.parquet.ParquetRelation2$$anon$4.newInstance(newParquet.scala:229) > at > org.apache.spark.sql.sources.DefaultWriterContainer.initWriters(commands.scala:470) > at > org.apache.spark.sql.sources.BaseWriterContainer.executorSideSetup(commands.scala:360) > at > org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.org$apache$spark$sql$sources$InsertIntoHadoopFsRelation$$writeRows$1(commands.scala:172) > at > org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:160) > at > org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:160) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) > at org.apache.spark.scheduler.Task.run(Task.scala:70) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 15/08/11 09:46:08 ERROR InsertIntoHadoopFsRelation: Aborting task. > ... > ... > . > 15/08/11 09:46:10 ERROR DefaultWriterContainer: Task attempt > attempt_201508110946_0000_m_000011_0 aborted. > 15/08/11 09:46:10 ERROR Executor: Exception in task 31.0 in stage 0.0 (TID 31) > org.apache.spark.SparkException: Task failed while writing rows. > at > org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.org$apache$spark$sql$sources$InsertIntoHadoopFsRelation$$writeRows$1(commands.scala:191) > at > org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:160) > at > org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:160) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) > at org.apache.spark.scheduler.Task.run(Task.scala:70) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.OutOfMemoryError: Java heap space > ... > > > > From: Dean Wampler [mailto:deanwamp...@gmail.com] > Sent: Tuesday, August 11, 2015 11:39 AM > To: Ellafi, Saif A. > Cc: user@spark.apache.org > Subject: Re: Parquet without hadoop: Possible? > > It should work fine. I have an example script here: > https://github.com/deanwampler/spark-workshop/blob/master/src/main/scala/sparkworkshop/SparkSQLParquet10-script.scala > (Spark 1.4.X) > > What does "I am failing to do so" mean? > > Dean Wampler, Ph.D. > Author: Programming Scala, 2nd Edition (O'Reilly) > Typesafe > @deanwampler > http://polyglotprogramming.com > > On Tue, Aug 11, 2015 at 9:28 AM, <saif.a.ell...@wellsfargo.com> wrote: > Hi all, > > I don’t have any hadoop fs installed on my environment, but I would like to > store dataframes in parquet files. I am failing to do so, if possible, anyone > have any pointers? > > Thank you, > Saif > >