Is there a way to save csv file fast ?
Hi I work with pyspark & spark 1.5.2 Currently saving rdd into csv file is very very slow , uses 2% CPU only I use : my_dd.write.format("com.databricks.spark.csv").option("header", "false").save('file:///my_folder') Is there a way to save csv faster ? Many thanks
Re: Is there a way to save csv file fast ?
> On 10 Feb 2016, at 10:56, Eli Super <eli.su...@gmail.com> wrote: > > Hi > > I work with pyspark & spark 1.5.2 > > Currently saving rdd into csv file is very very slow , uses 2% CPU only > > I use : > my_dd.write.format("com.databricks.spark.csv").option("header", > "false").save('file:///my_folder') > > Is there a way to save csv faster ? > > Many thanks on a linux box you can run "iotop" to see what's happening on the disks -it may just be that that the disk is the bottleneck - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Is there a way to save csv file fast ?
Hi, The writes, in terms of number of records written simultaneously, can be increased if you increased the number of partitions. You can try to increase the number of partitions and check out how it works. There is though an upper cap (the one that I faced in Ubuntu) on the number of parallel writes that you can do based on an Operating System set up, and when that happens you will be able to see the errors. By the way why are you expecting high CPU utilisation for writes, should it not be more of IO issue? But I may be wrong. Regards, Gourav On Wed, Feb 10, 2016 at 10:56 AM, Eli Super <eli.su...@gmail.com> wrote: > Hi > > I work with pyspark & spark 1.5.2 > > Currently saving rdd into csv file is very very slow , uses 2% CPU only > > I use : > my_dd.write.format("com.databricks.spark.csv").option("header", > "false").save('file:///my_folder') > > Is there a way to save csv faster ? > > Many thanks >
Re: parquet.io.ParquetEncodingException Warning when trying to save parquet file in Spark
Yes, that’s the problem. http://search.maven.org/#artifactdetails%7Ccom.twitter%7Cparquet-avro%7C1.6.0%7Cjar <http://search.maven.org/#artifactdetails|com.twitter|parquet-avro|1.6.0|jar> this depends on parquet-hadoop-1.6.0, then triggered this bug. can you change the version to 1.6.0rc7 manually ? > On Nov 9, 2015, at 9:34 PM, swetha kasireddy <swethakasire...@gmail.com> > wrote: > > I am using the following: > > > > com.twitter > parquet-avro > 1.6.0 > > > On Mon, Nov 9, 2015 at 1:00 AM, Fengdong Yu <fengdo...@everstring.com > <mailto:fengdo...@everstring.com>> wrote: > Which Spark version used? > > It was fixed in Parquet-1.7x, so Spark-1.5.x will be work. > > > > > > On Nov 9, 2015, at 3:43 PM, swetha <swethakasire...@gmail.com > > <mailto:swethakasire...@gmail.com>> wrote: > > > > Hi, > > > > I see unwanted Warning when I try to save a Parquet file in hdfs in Spark. > > Please find below the code and the Warning message. Any idea as to how to > > avoid the unwanted Warning message? > > > > activeSessionsToBeSaved.saveAsNewAPIHadoopFile("test", classOf[Void], > > classOf[ActiveSession], > > classOf[ParquetOutputFormat[ActiveSession]], job.getConfiguration) > > > > Nov 8, 2015 11:35:39 PM WARNING: parquet.hadoop.ParquetOutputCommitter: > > could not write summary file for active_sessions_current > > parquet.io.ParquetEncodingException: > > maprfs:/user/testId/active_sessions_current/part-r-00142.parquet invalid: > > all the files must be contained in the root active_sessions_current > > at > > parquet.hadoop.ParquetFileWriter.mergeFooters(ParquetFileWriter.java:422) > > at > > parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:398) > > at > > parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:51) > > at > > org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1056) > > at > > org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:998) > > > > > > > > -- > > View this message in context: > > http://apache-spark-user-list.1001560.n3.nabble.com/parquet-io-ParquetEncodingException-Warning-when-trying-to-save-parquet-file-in-Spark-tp25326.html > > > > <http://apache-spark-user-list.1001560.n3.nabble.com/parquet-io-ParquetEncodingException-Warning-when-trying-to-save-parquet-file-in-Spark-tp25326.html> > > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > > > - > > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > > <mailto:user-unsubscr...@spark.apache.org> > > For additional commands, e-mail: user-h...@spark.apache.org > > <mailto:user-h...@spark.apache.org> > > > >
Re: parquet.io.ParquetEncodingException Warning when trying to save parquet file in Spark
I am using the following: com.twitter parquet-avro 1.6.0 On Mon, Nov 9, 2015 at 1:00 AM, Fengdong Yu <fengdo...@everstring.com> wrote: > Which Spark version used? > > It was fixed in Parquet-1.7x, so Spark-1.5.x will be work. > > > > > > On Nov 9, 2015, at 3:43 PM, swetha <swethakasire...@gmail.com> wrote: > > > > Hi, > > > > I see unwanted Warning when I try to save a Parquet file in hdfs in > Spark. > > Please find below the code and the Warning message. Any idea as to how to > > avoid the unwanted Warning message? > > > > activeSessionsToBeSaved.saveAsNewAPIHadoopFile("test", classOf[Void], > > classOf[ActiveSession], > > classOf[ParquetOutputFormat[ActiveSession]], job.getConfiguration) > > > > Nov 8, 2015 11:35:39 PM WARNING: parquet.hadoop.ParquetOutputCommitter: > > could not write summary file for active_sessions_current > > parquet.io.ParquetEncodingException: > > maprfs:/user/testId/active_sessions_current/part-r-00142.parquet invalid: > > all the files must be contained in the root active_sessions_current > > at > > parquet.hadoop.ParquetFileWriter.mergeFooters(ParquetFileWriter.java:422) > > at > > > parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:398) > > at > > > parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:51) > > at > > > org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1056) > > at > > > org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:998) > > > > > > > > -- > > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/parquet-io-ParquetEncodingException-Warning-when-trying-to-save-parquet-file-in-Spark-tp25326.html > > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > > > - > > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > > For additional commands, e-mail: user-h...@spark.apache.org > > > >
Re: parquet.io.ParquetEncodingException Warning when trying to save parquet file in Spark
Which Spark version used? It was fixed in Parquet-1.7x, so Spark-1.5.x will be work. > On Nov 9, 2015, at 3:43 PM, swetha <swethakasire...@gmail.com> wrote: > > Hi, > > I see unwanted Warning when I try to save a Parquet file in hdfs in Spark. > Please find below the code and the Warning message. Any idea as to how to > avoid the unwanted Warning message? > > activeSessionsToBeSaved.saveAsNewAPIHadoopFile("test", classOf[Void], > classOf[ActiveSession], > classOf[ParquetOutputFormat[ActiveSession]], job.getConfiguration) > > Nov 8, 2015 11:35:39 PM WARNING: parquet.hadoop.ParquetOutputCommitter: > could not write summary file for active_sessions_current > parquet.io.ParquetEncodingException: > maprfs:/user/testId/active_sessions_current/part-r-00142.parquet invalid: > all the files must be contained in the root active_sessions_current > at > parquet.hadoop.ParquetFileWriter.mergeFooters(ParquetFileWriter.java:422) > at > parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:398) > at > parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:51) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1056) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:998) > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/parquet-io-ParquetEncodingException-Warning-when-trying-to-save-parquet-file-in-Spark-tp25326.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: parquet.io.ParquetEncodingException Warning when trying to save parquet file in Spark
Please see https://issues.apache.org/jira/browse/PARQUET-124 > On Nov 8, 2015, at 11:43 PM, swetha <swethakasire...@gmail.com> wrote: > > Hi, > > I see unwanted Warning when I try to save a Parquet file in hdfs in Spark. > Please find below the code and the Warning message. Any idea as to how to > avoid the unwanted Warning message? > > activeSessionsToBeSaved.saveAsNewAPIHadoopFile("test", classOf[Void], > classOf[ActiveSession], > classOf[ParquetOutputFormat[ActiveSession]], job.getConfiguration) > > Nov 8, 2015 11:35:39 PM WARNING: parquet.hadoop.ParquetOutputCommitter: > could not write summary file for active_sessions_current > parquet.io.ParquetEncodingException: > maprfs:/user/testId/active_sessions_current/part-r-00142.parquet invalid: > all the files must be contained in the root active_sessions_current >at > parquet.hadoop.ParquetFileWriter.mergeFooters(ParquetFileWriter.java:422) >at > parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:398) >at > parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:51) >at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1056) >at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:998) > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/parquet-io-ParquetEncodingException-Warning-when-trying-to-save-parquet-file-in-Spark-tp25326.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
parquet.io.ParquetEncodingException Warning when trying to save parquet file in Spark
Hi, I see unwanted Warning when I try to save a Parquet file in hdfs in Spark. Please find below the code and the Warning message. Any idea as to how to avoid the unwanted Warning message? activeSessionsToBeSaved.saveAsNewAPIHadoopFile("test", classOf[Void], classOf[ActiveSession], classOf[ParquetOutputFormat[ActiveSession]], job.getConfiguration) Nov 8, 2015 11:35:39 PM WARNING: parquet.hadoop.ParquetOutputCommitter: could not write summary file for active_sessions_current parquet.io.ParquetEncodingException: maprfs:/user/testId/active_sessions_current/part-r-00142.parquet invalid: all the files must be contained in the root active_sessions_current at parquet.hadoop.ParquetFileWriter.mergeFooters(ParquetFileWriter.java:422) at parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:398) at parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:51) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1056) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:998) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/parquet-io-ParquetEncodingException-Warning-when-trying-to-save-parquet-file-in-Spark-tp25326.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Exception on save s3n file (1.4.1, hadoop 2.6)
On 25 Sep 2015, at 03:35, Zhang, Jingyu> wrote: I got following exception when I run JavPairRDD.values().saveAsTextFile("s3n://bucket); Can anyone help me out? thanks 15/09/25 12:24:32 INFO SparkContext: Successfully stopped SparkContext Exception in thread "main" java.lang.NoClassDefFoundError: org/jets3t/service/ServiceException you need the same jets3t JAR which ships with hadoop 2.6 (jet3t 0.9.0) on your classpath; use --jars at org.apache.hadoop.fs.s3native.NativeS3FileSystem.createDefaultStore(NativeS3FileSystem.java:338) at org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(NativeS3FileSystem.java:328) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2596) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296) at org.apache.spark.SparkHadoopWriter$.createPathFromString(SparkHadoopWriter.scala:170)
Exception on save s3n file (1.4.1, hadoop 2.6)
I got following exception when I run JavPairRDD.values().saveAsTextFile("s3n://bucket); Can anyone help me out? thanks 15/09/25 12:24:32 INFO SparkContext: Successfully stopped SparkContext Exception in thread "main" java.lang.NoClassDefFoundError: org/jets3t/service/ServiceException at org.apache.hadoop.fs.s3native.NativeS3FileSystem.createDefaultStore( NativeS3FileSystem.java:338) at org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize( NativeS3FileSystem.java:328) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2596) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296) at org.apache.spark.SparkHadoopWriter$.createPathFromString(SparkHadoopWriter.scala:170) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply$mcV$sp(PairRDDFunctions.scala:988) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:965) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:965) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108) at org.apache.spark.rdd.RDD.withScope(RDD.scala:286) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:965) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply$mcV$sp(PairRDDFunctions.scala:897) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:897) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:897) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108) at org.apache.spark.rdd.RDD.withScope(RDD.scala:286) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:896) at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply$mcV$sp(RDD.scala:1404) at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1383) at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1383) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108) at org.apache.spark.rdd.RDD.withScope(RDD.scala:286) at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1383) at org.apache.spark.api.java.JavaRDDLike$class.saveAsTextFile(JavaRDDLike.scala:519) at org.apache.spark.api.java.AbstractJavaRDDLike.saveAsTextFile(JavaRDDLike.scala:47) at com.news.report.section.SectionSubSection.run(SectionSubSection.java:184) at com.news.report.section.SectionSubSection.main(SectionSubSection.java:67) Caused by: java.lang.ClassNotFoundException: org.jets3t.service.ServiceException at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ... 34 more -- This message and its attachments may contain legally privileged or confidential information. It is intended solely for the named addressee. If you are not the addressee indicated in this message or responsible for delivery of the message to the addressee, you may not copy or deliver this message or its attachments to anyone. Rather, you should permanently delete this message and its attachments and kindly notify the sender by reply e-mail. Any content of this message and its attachments which does not relate to the official business of the sending company must be taken not to have been sent or endorsed by that company or any of its related entities. No warranty is made that the e-mail or attachments are free from computer virus or other defect.
save as file
Hi, I am spark 1.1.0. I need a help regarding saving rdd in a JSON file? How to do that? And how to mentions hdfs path in the program. -Naveen
Re: save as file
One approach would be to use SaveAsNewAPIHadoop file and specify jsonOutputFormat. Another simple one would be like: val rdd = sc.parallelize(1 to 100) val json = rdd.map(x = { val m: Map[String, Int] = Map(id - x) new JSONObject(m) }) json.saveAsTextFile(output) Thanks Best Regards On Tue, Nov 11, 2014 at 6:28 PM, Naveen Kumar Pokala npok...@spcapitaliq.com wrote: Hi, I am spark 1.1.0. I need a help regarding saving rdd in a JSON file? How to do that? And how to mentions hdfs path in the program. -Naveen
Re: save as file
We have RDD.saveAsTextFile and RDD.saveAsObjectFile for saving the output to any location specified. The params to be provided are: path of storage location no. of partitions For giving an hdfs path we use the following format: /user/user-name/directory-to-sore/ On Tue, Nov 11, 2014 at 6:28 PM, Naveen Kumar Pokala npok...@spcapitaliq.com wrote: Hi, I am spark 1.1.0. I need a help regarding saving rdd in a JSON file? How to do that? And how to mentions hdfs path in the program. -Naveen