Re: coalesce(1).saveAsTextfile() takes forever?
another option will be to try rdd.toLocalIterator() not sure if it will help though I had same problem and ended up to move all parts to local disk(with Hadoop FileSystem api) and then processing them locally On 5 January 2016 at 22:08, Alexander Pivovarov <apivova...@gmail.com> wrote: > try coalesce(1, true). > > On Tue, Jan 5, 2016 at 11:58 AM, unk1102 <umesh.ka...@gmail.com> wrote: > >> hi I am trying to save many partitions of Dataframe into one CSV file and >> it >> take forever for large data sets of around 5-6 GB. >> >> >> sourceFrame.coalesce(1).write().format("com.databricks.spark.csv").option("gzip").save("/path/hadoop") >> >> For small data above code works well but for large data it hangs forever >> does not move on because of only one partitions has to shuffle data of GBs >> please help me >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/coalesce-1-saveAsTextfile-takes-forever-tp25886.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >
coalesce(1).saveAsTextfile() takes forever?
hi I am trying to save many partitions of Dataframe into one CSV file and it take forever for large data sets of around 5-6 GB. sourceFrame.coalesce(1).write().format("com.databricks.spark.csv").option("gzip").save("/path/hadoop") For small data above code works well but for large data it hangs forever does not move on because of only one partitions has to shuffle data of GBs please help me -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/coalesce-1-saveAsTextfile-takes-forever-tp25886.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: coalesce(1).saveAsTextfile() takes forever?
try coalesce(1, true). On Tue, Jan 5, 2016 at 11:58 AM, unk1102 <umesh.ka...@gmail.com> wrote: > hi I am trying to save many partitions of Dataframe into one CSV file and > it > take forever for large data sets of around 5-6 GB. > > > sourceFrame.coalesce(1).write().format("com.databricks.spark.csv").option("gzip").save("/path/hadoop") > > For small data above code works well but for large data it hangs forever > does not move on because of only one partitions has to shuffle data of GBs > please help me > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/coalesce-1-saveAsTextfile-takes-forever-tp25886.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >
Re: coalesce(1).saveAsTextfile() takes forever?
Hi dataframe has not boolean option for coalesce it is only for RDD I believe sourceFrame.coalesce(1,true) //gives compilation error On Wed, Jan 6, 2016 at 1:38 AM, Alexander Pivovarov <apivova...@gmail.com> wrote: > try coalesce(1, true). > > On Tue, Jan 5, 2016 at 11:58 AM, unk1102 <umesh.ka...@gmail.com> wrote: > >> hi I am trying to save many partitions of Dataframe into one CSV file and >> it >> take forever for large data sets of around 5-6 GB. >> >> >> sourceFrame.coalesce(1).write().format("com.databricks.spark.csv").option("gzip").save("/path/hadoop") >> >> For small data above code works well but for large data it hangs forever >> does not move on because of only one partitions has to shuffle data of GBs >> please help me >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/coalesce-1-saveAsTextfile-takes-forever-tp25886.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >
Re: coalesce(1).saveAsTextfile() takes forever?
Hi Unk1102 I also had trouble when I used coalesce(). Reparation() worked much better. Keep in mind if you have a large number of portions you are probably going have high communication costs. Also my code works a lot better on 1.6.0. DataFrame memory was not be spilled in 1.5.2. In 1.6.0 unpersist() actually frees up memory Another strange thing I noticed in 1.5.1 was that I had thousands of partitions. Many of them where empty. Have lots of empty partitions really slowed things down Andy From: unk1102 <umesh.ka...@gmail.com> Date: Tuesday, January 5, 2016 at 11:58 AM To: "user @spark" <user@spark.apache.org> Subject: coalesce(1).saveAsTextfile() takes forever? > hi I am trying to save many partitions of Dataframe into one CSV file and it > take forever for large data sets of around 5-6 GB. > > sourceFrame.coalesce(1).write().format("com.databricks.spark.csv").option("gzi > p").save("/path/hadoop") > > For small data above code works well but for large data it hangs forever > does not move on because of only one partitions has to shuffle data of GBs > please help me > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/coalesce-1-saveAsTextfile- > takes-forever-tp25886.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >