Is there a way to save csv file fast ?

2016-02-10 Thread Eli Super
Hi

I work with pyspark & spark 1.5.2

Currently saving rdd into csv file is very very slow , uses 2% CPU only

I use :
my_dd.write.format("com.databricks.spark.csv").option("header",
"false").save('file:///my_folder')

Is there a way to save csv faster ?

Many thanks


Re: Is there a way to save csv file fast ?

2016-02-10 Thread Steve Loughran

> On 10 Feb 2016, at 10:56, Eli Super <eli.su...@gmail.com> wrote:
> 
> Hi 
> 
> I work with pyspark & spark 1.5.2 
> 
> Currently saving rdd into csv file is very very slow , uses 2% CPU only  
> 
> I use :
> my_dd.write.format("com.databricks.spark.csv").option("header", 
> "false").save('file:///my_folder')
> 
> Is there a way to save csv faster ?
> 
> Many thanks

on a linux box you can run "iotop" to see what's happening on the disks -it may 
just be that that the disk is the bottleneck

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Is there a way to save csv file fast ?

2016-02-10 Thread Gourav Sengupta
Hi,

The writes, in terms of number of records written simultaneously, can be
increased if you increased the number of partitions. You can try to
increase the number of partitions and check out how it works. There is
though an upper cap (the one that I faced in Ubuntu) on the number of
parallel writes that you can do based on an Operating System set up, and
when that happens you will be able to see the errors.

By the way why are you expecting high CPU utilisation for writes, should it
not be more of IO issue? But I may be wrong.


Regards,
Gourav

On Wed, Feb 10, 2016 at 10:56 AM, Eli Super <eli.su...@gmail.com> wrote:

> Hi
>
> I work with pyspark & spark 1.5.2
>
> Currently saving rdd into csv file is very very slow , uses 2% CPU only
>
> I use :
> my_dd.write.format("com.databricks.spark.csv").option("header",
> "false").save('file:///my_folder')
>
> Is there a way to save csv faster ?
>
> Many thanks
>


Re: parquet.io.ParquetEncodingException Warning when trying to save parquet file in Spark

2015-11-09 Thread Fengdong Yu
Yes, that’s the problem.
http://search.maven.org/#artifactdetails%7Ccom.twitter%7Cparquet-avro%7C1.6.0%7Cjar
 <http://search.maven.org/#artifactdetails|com.twitter|parquet-avro|1.6.0|jar>

this depends on parquet-hadoop-1.6.0,  then triggered this bug.

can you change the version to 1.6.0rc7 manually ?




> On Nov 9, 2015, at 9:34 PM, swetha kasireddy <swethakasire...@gmail.com> 
> wrote:
> 
> I am using the following:
> 
> 
> 
> com.twitter
> parquet-avro
> 1.6.0
> 
> 
> On Mon, Nov 9, 2015 at 1:00 AM, Fengdong Yu <fengdo...@everstring.com 
> <mailto:fengdo...@everstring.com>> wrote:
> Which Spark version used?
> 
> It was fixed in Parquet-1.7x, so Spark-1.5.x will be work.
> 
> 
> 
> 
> > On Nov 9, 2015, at 3:43 PM, swetha <swethakasire...@gmail.com 
> > <mailto:swethakasire...@gmail.com>> wrote:
> >
> > Hi,
> >
> > I see unwanted Warning when I try to save a Parquet file in hdfs in Spark.
> > Please find below the code and the Warning message. Any idea as to how to
> > avoid the unwanted Warning message?
> >
> > activeSessionsToBeSaved.saveAsNewAPIHadoopFile("test", classOf[Void],
> > classOf[ActiveSession],
> >  classOf[ParquetOutputFormat[ActiveSession]], job.getConfiguration)
> >
> > Nov 8, 2015 11:35:39 PM WARNING: parquet.hadoop.ParquetOutputCommitter:
> > could not write summary file for active_sessions_current
> > parquet.io.ParquetEncodingException:
> > maprfs:/user/testId/active_sessions_current/part-r-00142.parquet invalid:
> > all the files must be contained in the root active_sessions_current
> >   at
> > parquet.hadoop.ParquetFileWriter.mergeFooters(ParquetFileWriter.java:422)
> >   at
> > parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:398)
> >   at
> > parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:51)
> >   at
> > org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1056)
> >   at
> > org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:998)
> >
> >
> >
> > --
> > View this message in context: 
> > http://apache-spark-user-list.1001560.n3.nabble.com/parquet-io-ParquetEncodingException-Warning-when-trying-to-save-parquet-file-in-Spark-tp25326.html
> >  
> > <http://apache-spark-user-list.1001560.n3.nabble.com/parquet-io-ParquetEncodingException-Warning-when-trying-to-save-parquet-file-in-Spark-tp25326.html>
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> > <mailto:user-unsubscr...@spark.apache.org>
> > For additional commands, e-mail: user-h...@spark.apache.org 
> > <mailto:user-h...@spark.apache.org>
> >
> 
> 



Re: parquet.io.ParquetEncodingException Warning when trying to save parquet file in Spark

2015-11-09 Thread swetha kasireddy
I am using the following:



com.twitter
parquet-avro
1.6.0



On Mon, Nov 9, 2015 at 1:00 AM, Fengdong Yu <fengdo...@everstring.com>
wrote:

> Which Spark version used?
>
> It was fixed in Parquet-1.7x, so Spark-1.5.x will be work.
>
>
>
>
> > On Nov 9, 2015, at 3:43 PM, swetha <swethakasire...@gmail.com> wrote:
> >
> > Hi,
> >
> > I see unwanted Warning when I try to save a Parquet file in hdfs in
> Spark.
> > Please find below the code and the Warning message. Any idea as to how to
> > avoid the unwanted Warning message?
> >
> > activeSessionsToBeSaved.saveAsNewAPIHadoopFile("test", classOf[Void],
> > classOf[ActiveSession],
> >  classOf[ParquetOutputFormat[ActiveSession]], job.getConfiguration)
> >
> > Nov 8, 2015 11:35:39 PM WARNING: parquet.hadoop.ParquetOutputCommitter:
> > could not write summary file for active_sessions_current
> > parquet.io.ParquetEncodingException:
> > maprfs:/user/testId/active_sessions_current/part-r-00142.parquet invalid:
> > all the files must be contained in the root active_sessions_current
> >   at
> > parquet.hadoop.ParquetFileWriter.mergeFooters(ParquetFileWriter.java:422)
> >   at
> >
> parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:398)
> >   at
> >
> parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:51)
> >   at
> >
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1056)
> >   at
> >
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:998)
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/parquet-io-ParquetEncodingException-Warning-when-trying-to-save-parquet-file-in-Spark-tp25326.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
>
>


Re: parquet.io.ParquetEncodingException Warning when trying to save parquet file in Spark

2015-11-09 Thread Fengdong Yu
Which Spark version used?

It was fixed in Parquet-1.7x, so Spark-1.5.x will be work.




> On Nov 9, 2015, at 3:43 PM, swetha <swethakasire...@gmail.com> wrote:
> 
> Hi,
> 
> I see unwanted Warning when I try to save a Parquet file in hdfs in Spark.
> Please find below the code and the Warning message. Any idea as to how to
> avoid the unwanted Warning message?
> 
> activeSessionsToBeSaved.saveAsNewAPIHadoopFile("test", classOf[Void],
> classOf[ActiveSession],
>  classOf[ParquetOutputFormat[ActiveSession]], job.getConfiguration)
> 
> Nov 8, 2015 11:35:39 PM WARNING: parquet.hadoop.ParquetOutputCommitter:
> could not write summary file for active_sessions_current
> parquet.io.ParquetEncodingException:
> maprfs:/user/testId/active_sessions_current/part-r-00142.parquet invalid:
> all the files must be contained in the root active_sessions_current
>   at
> parquet.hadoop.ParquetFileWriter.mergeFooters(ParquetFileWriter.java:422)
>   at
> parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:398)
>   at
> parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:51)
>   at
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1056)
>   at
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:998)
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/parquet-io-ParquetEncodingException-Warning-when-trying-to-save-parquet-file-in-Spark-tp25326.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: parquet.io.ParquetEncodingException Warning when trying to save parquet file in Spark

2015-11-09 Thread Ted Yu
Please see
https://issues.apache.org/jira/browse/PARQUET-124


> On Nov 8, 2015, at 11:43 PM, swetha <swethakasire...@gmail.com> wrote:
> 
> Hi,
> 
> I see unwanted Warning when I try to save a Parquet file in hdfs in Spark.
> Please find below the code and the Warning message. Any idea as to how to
> avoid the unwanted Warning message?
> 
> activeSessionsToBeSaved.saveAsNewAPIHadoopFile("test", classOf[Void],
> classOf[ActiveSession],
>  classOf[ParquetOutputFormat[ActiveSession]], job.getConfiguration)
> 
> Nov 8, 2015 11:35:39 PM WARNING: parquet.hadoop.ParquetOutputCommitter:
> could not write summary file for active_sessions_current
> parquet.io.ParquetEncodingException:
> maprfs:/user/testId/active_sessions_current/part-r-00142.parquet invalid:
> all the files must be contained in the root active_sessions_current
>at
> parquet.hadoop.ParquetFileWriter.mergeFooters(ParquetFileWriter.java:422)
>at
> parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:398)
>at
> parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:51)
>at
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1056)
>at
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:998)
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/parquet-io-ParquetEncodingException-Warning-when-trying-to-save-parquet-file-in-Spark-tp25326.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



parquet.io.ParquetEncodingException Warning when trying to save parquet file in Spark

2015-11-08 Thread swetha
Hi,

I see unwanted Warning when I try to save a Parquet file in hdfs in Spark.
Please find below the code and the Warning message. Any idea as to how to
avoid the unwanted Warning message?

activeSessionsToBeSaved.saveAsNewAPIHadoopFile("test", classOf[Void],
classOf[ActiveSession],
  classOf[ParquetOutputFormat[ActiveSession]], job.getConfiguration)

Nov 8, 2015 11:35:39 PM WARNING: parquet.hadoop.ParquetOutputCommitter:
could not write summary file for active_sessions_current
parquet.io.ParquetEncodingException:
maprfs:/user/testId/active_sessions_current/part-r-00142.parquet invalid:
all the files must be contained in the root active_sessions_current
at
parquet.hadoop.ParquetFileWriter.mergeFooters(ParquetFileWriter.java:422)
at
parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:398)
at
parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:51)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1056)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:998)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/parquet-io-ParquetEncodingException-Warning-when-trying-to-save-parquet-file-in-Spark-tp25326.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Exception on save s3n file (1.4.1, hadoop 2.6)

2015-09-25 Thread Steve Loughran

On 25 Sep 2015, at 03:35, Zhang, Jingyu 
> wrote:


I got following exception when I run  
JavPairRDD.values().saveAsTextFile("s3n://bucket); Can anyone help me out? 
thanks


15/09/25 12:24:32 INFO SparkContext: Successfully stopped SparkContext

Exception in thread "main" java.lang.NoClassDefFoundError: 
org/jets3t/service/ServiceException


you need the same jets3t JAR which ships with hadoop 2.6 (jet3t 0.9.0) on your 
classpath; use --jars


at 
org.apache.hadoop.fs.s3native.NativeS3FileSystem.createDefaultStore(NativeS3FileSystem.java:338)

at 
org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(NativeS3FileSystem.java:328)

at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2596)

at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)

at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)

at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)

at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)

at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)

at 
org.apache.spark.SparkHadoopWriter$.createPathFromString(SparkHadoopWriter.scala:170)




Exception on save s3n file (1.4.1, hadoop 2.6)

2015-09-24 Thread Zhang, Jingyu
I got following exception when I run
JavPairRDD.values().saveAsTextFile("s3n://bucket);
Can anyone help me out? thanks


15/09/25 12:24:32 INFO SparkContext: Successfully stopped SparkContext

Exception in thread "main" java.lang.NoClassDefFoundError:
org/jets3t/service/ServiceException

at org.apache.hadoop.fs.s3native.NativeS3FileSystem.createDefaultStore(
NativeS3FileSystem.java:338)

at org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(
NativeS3FileSystem.java:328)

at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2596)

at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)

at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)

at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)

at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)

at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)

at
org.apache.spark.SparkHadoopWriter$.createPathFromString(SparkHadoopWriter.scala:170)

at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply$mcV$sp(PairRDDFunctions.scala:988)

at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:965)

at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:965)

at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)

at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)

at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)

at
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:965)

at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply$mcV$sp(PairRDDFunctions.scala:897)

at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:897)

at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:897)

at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)

at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)

at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)

at
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:896)

at
org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply$mcV$sp(RDD.scala:1404)

at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1383)

at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1383)

at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)

at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)

at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)

at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1383)

at
org.apache.spark.api.java.JavaRDDLike$class.saveAsTextFile(JavaRDDLike.scala:519)

at
org.apache.spark.api.java.AbstractJavaRDDLike.saveAsTextFile(JavaRDDLike.scala:47)

at com.news.report.section.SectionSubSection.run(SectionSubSection.java:184)

at com.news.report.section.SectionSubSection.main(SectionSubSection.java:67)

Caused by: java.lang.ClassNotFoundException:
org.jets3t.service.ServiceException

at java.net.URLClassLoader.findClass(URLClassLoader.java:381)

at java.lang.ClassLoader.loadClass(ClassLoader.java:424)

at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)

at java.lang.ClassLoader.loadClass(ClassLoader.java:357)

... 34 more

-- 
This message and its attachments may contain legally privileged or 
confidential information. It is intended solely for the named addressee. If 
you are not the addressee indicated in this message or responsible for 
delivery of the message to the addressee, you may not copy or deliver this 
message or its attachments to anyone. Rather, you should permanently delete 
this message and its attachments and kindly notify the sender by reply 
e-mail. Any content of this message and its attachments which does not 
relate to the official business of the sending company must be taken not to 
have been sent or endorsed by that company or any of its related entities. 
No warranty is made that the e-mail or attachments are free from computer 
virus or other defect.


save as file

2014-11-11 Thread Naveen Kumar Pokala
Hi,

I am spark 1.1.0. I need a help regarding saving rdd in a JSON file?

How to do that? And how to mentions hdfs path in the program.


-Naveen




Re: save as file

2014-11-11 Thread Akhil Das
One approach would be to use SaveAsNewAPIHadoop file and specify
jsonOutputFormat.

Another simple one would be like:

val rdd = sc.parallelize(1 to 100)
val json = rdd.map(x = {
  val m: Map[String, Int] = Map(id - x)
  new JSONObject(m) })

json.saveAsTextFile(output)


Thanks
Best Regards

On Tue, Nov 11, 2014 at 6:28 PM, Naveen Kumar Pokala 
npok...@spcapitaliq.com wrote:

 Hi,



 I am spark 1.1.0. I need a help regarding saving rdd in a JSON file?



 How to do that? And how to mentions hdfs path in the program.





 -Naveen







Re: save as file

2014-11-11 Thread Ritesh Kumar Singh
We have RDD.saveAsTextFile and RDD.saveAsObjectFile for saving the output
to any location specified. The params to be provided are:
path of storage location
no. of partitions

For giving an hdfs path we use the following format:
/user/user-name/directory-to-sore/

On Tue, Nov 11, 2014 at 6:28 PM, Naveen Kumar Pokala 
npok...@spcapitaliq.com wrote:

 Hi,



 I am spark 1.1.0. I need a help regarding saving rdd in a JSON file?



 How to do that? And how to mentions hdfs path in the program.





 -Naveen