saveAsTextFile

2015-04-16 Thread Vadim Bichutskiy
I am using Spark Streaming where during each micro-batch I output data to S3 using saveAsTextFile. Right now each batch of data is put into its own directory containing 2 objects, "_SUCCESS" and "part-0." How do I output each batch into a common directory? Thanks, Vadim ᐧ

Re: saveAsTextFile

2015-01-03 Thread Pankaj Narang
If you can paste the code here I can certainly help. Also confirm the version of spark you are using Regards Pankaj Infoshore Software India -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-tp20951p20953.html Sent from the Apache Spark

Re: saveAsTextFile

2015-01-03 Thread Sanjay Subramanian
@lailaBased on the error u mentioned in the nabble link below, it seems like there are no permissions to write to HDFS. So this is possibly why saveAsTextFile is failing. From: Pankaj Narang To: user@spark.apache.org Sent: Saturday, January 3, 2015 4:07 AM Subject: Re

Re: saveAsTextFile

2015-01-15 Thread Prannoy
certainly help. > > Also confirm the version of spark you are using > > Regards > Pankaj > Infoshore Software > India > > -- > If you reply to this email, your message will be added to the discussion > below: > > http://apache-spar

Re: saveAsTextFile

2015-01-15 Thread ankits
I have seen this happen when the RDD contains null values. Essentially, saveAsTextFile calls toString() on the elements of the RDD, so a call to null.toString will result in an NPE. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile

Re: saveAsTextFile

2015-04-16 Thread Sean Owen
skiy wrote: > I am using Spark Streaming where during each micro-batch I output data to S3 > using > saveAsTextFile. Right now each batch of data is put into its own directory > containing > 2 objects, "_SUCCESS" and "part-0." > > How do I

Re: saveAsTextFile

2015-04-16 Thread Vadim Bichutskiy
;, which are really directories containing > partitions, as is common in Hadoop. You can move them later, or just > read them where they are. > > On Thu, Apr 16, 2015 at 6:32 PM, Vadim Bichutskiy > wrote: >> I am using Spark Streaming where during each micro-batch I output data

RE: saveAsTextFile

2015-04-16 Thread Evo Eftimov
HDFS adapter and invoke it in forEachRDD and foreach Regards Evo Eftimov From: Vadim Bichutskiy [mailto:vadim.bichuts...@gmail.com] Sent: Thursday, April 16, 2015 6:33 PM To: user@spark.apache.org Subject: saveAsTextFile I am using Spark Streaming where during each micro-batch I

RE: saveAsTextFile

2015-04-16 Thread Evo Eftimov
Nop Sir, it is possible - check my reply earlier -Original Message- From: Sean Owen [mailto:so...@cloudera.com] Sent: Thursday, April 16, 2015 6:35 PM To: Vadim Bichutskiy Cc: user@spark.apache.org Subject: Re: saveAsTextFile You can't, since that's how it's designed t

RE: saveAsTextFile

2015-04-16 Thread Evo Eftimov
@spark.apache.org Subject: Re: saveAsTextFile Thanks Sean. I want to load each batch into Redshift. What's the best/most efficient way to do that? Vadim > On Apr 16, 2015, at 1:35 PM, Sean Owen wrote: > > You can't, since that's how it's designed to work. Batches are saved >

Re: saveAsTextFile

2015-04-16 Thread Vadim Bichutskiy
im.bichuts...@gmail.com] > Sent: Thursday, April 16, 2015 6:33 PM > To: user@spark.apache.org > Subject: saveAsTextFile > > I am using Spark Streaming where during each micro-batch I output data to S3 > using > saveAsTextFile. Right now each batch of data is put into its own

Re: saveAsTextFile

2015-04-16 Thread Sean Owen
d >> in different "files", which are really directories containing >> partitions, as is common in Hadoop. You can move them later, or just >> read them where they are. >> >> On Thu, Apr 16, 2015 at 6:32 PM, Vadim Bichutskiy >> wrote: >>> I am using Spark

RE: saveAsTextFile

2015-04-16 Thread Evo Eftimov
files and directories From: Vadim Bichutskiy [mailto:vadim.bichuts...@gmail.com] Sent: Thursday, April 16, 2015 6:45 PM To: Evo Eftimov Cc: Subject: Re: saveAsTextFile Thanks Evo for your detailed explanation. On Apr 16, 2015, at 1:38 PM, Evo Eftimov wrote: The reason for this is

Re: saveAsTextFile

2015-04-16 Thread Vadim Bichutskiy
in different "files", which are really directories containing >>> partitions, as is common in Hadoop. You can move them later, or just >>> read them where they are. >>> >>> On Thu, Apr 16, 2015 at 6:32 PM, Vadim Bichutskiy >>> wrote: >>>

Re: saveAsTextFile

2014-08-10 Thread durin
will be created and must therefore not exist before. Best regards, Simon -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-tp11803p11846.html Sent from the Apache Spark User List mailing l

JavaPairDStream saveAsTextFile

2014-10-08 Thread SA
HI, I am looking at the documentation for Java API for Streams. The scala library has option to save file locally, but the Java version doesnt seem to. The only option i see is saveAsHadoopFiles. Is there a reason why this option was left out from Java API? http://spark.apache.org/docs/1.0.0/a

saveAsTextFile error

2014-11-14 Thread Niko Gamulin
r: [error] /home/bart/spark-1.1.0/examples/src/main/scala/org/apache/spark/examples/streaming/NetworkWordCountModified.scala:57: value saveAsTextFile is not a member of org.apache.spark.streaming.dstream.DStream[(String, Int)] [error] wordCounts.saveAsTextFile("/home/bart/rest_services/outp

Improve saveAsTextFile performance

2015-12-02 Thread Ram VISWANADHA
JavaRDD.saveAsTextFile is taking a long time to succeed. There are 10 tasks, the first 9 complete in a reasonable time but the last task is taking a long time to complete. The last task contains the maximum number of records like 90% of the total number of records. Is there any way to paralleli

FP Growth saveAsTextFile

2015-05-20 Thread Eric Tanner
I am having trouble with saving an FP-Growth model as a text file. I can print out the results, but when I try to save the model I get a NullPointerException. model.freqItemsets.saveAsTextFile("c://fpGrowth/model") Thanks, Eric

PySpark saveAsTextFile gzip

2015-01-15 Thread Tom Seddon
Hi, I've searched but can't seem to find a PySpark example. How do I write compressed text file output to S3 using PySpark saveAsTextFile? Thanks, Tom

NPE using saveAsTextFile

2014-04-08 Thread Nick Pentreath
Hi I'm using Spark 0.9.0. When calling saveAsTextFile on a custom hadoop inputformat (loaded with newAPIHadoopRDD), I get the following error below. If I call count, I get the correct count of number of records, so the inputformat is being read correctly... the issue only appears when tryi

java.lang.OutOfMemoryError with saveAsTextFile

2014-06-18 Thread Muttineni, Vinay
Hi, I have a 5 million record, 300 column data set. I am running a spark job in yarn-cluster mode, with the following args --driver-memory 11G --executor-memory 11G --executor-cores 16 --num-executors 500 The spark job replaces all categorical variables with some integers. I am getting the below

Pyspark saveAsTextFile exceptions

2015-03-13 Thread Madabhattula Rajesh Kumar
Hi Team, I'm getting below exception for saving the results into hadoop. *Code :* rdd.saveAsTextFile("hdfs://localhost:9000/home/rajesh/data/result.rdd") Could you please help me how to resolve this issue. 15/03/13 17:19:31 INFO spark.SparkContext: Starting job: sa

Re: JavaPairDStream saveAsTextFile

2014-10-08 Thread Sean Owen
Yeah it's not there. I imagine it was simply never added, and that there's not a good reaosn it couldn't be. On Thu, Oct 9, 2014 at 4:53 AM, SA wrote: > HI, > > I am looking at the documentation for Java API for Streams. The scala > library has option to save file locally, but the Java version d

Re: JavaPairDStream saveAsTextFile

2014-10-08 Thread Mayur Rustagi
Thats a cryptic way to say thr should be a Jira for it :) Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Thu, Oct 9, 2014 at 11:46 AM, Sean Owen wrote: > Yeah it's not there. I imagine it was simply never added, and tha

Re: saveAsTextFile error

2014-11-14 Thread Harold Nguyen
en I ran sbt/sbt package it returned the following error: > > [error] > /home/bart/spark-1.1.0/examples/src/main/scala/org/apache/spark/examples/streaming/NetworkWordCountModified.scala:57: > value saveAsTextFile is not a member of > org.apache.spark.streaming.dstream.DStream[(String

Re: saveAsTextFile error

2014-11-15 Thread Prannoy
Hi Niko, Have you tried it running keeping the wordCounts.print() ?? Possibly the import to the package *org.apache.spark.streaming._* is not there so during sbt package it is unable to locate the saveAsTextFile API. Go to https://github.com/apache/spark/blob/master/examples/src/main/scala/org

Appending with saveAsTextFile?

2014-11-29 Thread YaoPau
actually from the 1:00pm-2:00pm timespan. But 0.1% will be data that, for one of several reasons, trickled in from the 12:00pm hour or even earlier. What I'd like to do is split by RDD by timestamp into several RDDs, then use saveAsTextFile() to write each RDD to disk to its proper location. So

Exception during SaveAstextFile Stage

2015-09-24 Thread Chirag Dewan
Hi, I have 2 stages in my job map and save as text file. During the save text file stage I am getting an exception : 15/09/24 15:38:16 WARN AkkaUtils: Error sending message in 1 attempts java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] at scala.concurrent.impl.

Re: Improve saveAsTextFile performance

2015-12-02 Thread Ted Yu
Have you tried calling coalesce() before saveAsTextFile ? Cheers On Wed, Dec 2, 2015 at 3:15 PM, Ram VISWANADHA < ram.viswana...@dailymotion.com> wrote: > JavaRDD.saveAsTextFile is taking a long time to succeed. There are 10 > tasks, the first 9 complete in a reasonable time but t

Re: Improve saveAsTextFile performance

2015-12-02 Thread Ram VISWANADHA
Yes. That did not help. Best Regards, Ram From: Ted Yu mailto:yuzhih...@gmail.com>> Date: Wednesday, December 2, 2015 at 3:25 PM To: Ram VISWANADHA mailto:ram.viswana...@dailymotion.com>> Cc: user mailto:user@spark.apache.org>> Subject: Re: Improve saveAsTextFile performan

Re: Improve saveAsTextFile performance

2015-12-02 Thread Sahil Sareen
From: Ted Yu > Date: Wednesday, December 2, 2015 at 3:25 PM > To: Ram VISWANADHA > Cc: user > Subject: Re: Improve saveAsTextFile performance > > Have you tried calling coalesce() before saveAsTextFile ? > > Cheers > > On Wed, Dec 2, 2015 at 3:15 PM, Ram V

Re: Improve saveAsTextFile performance

2015-12-04 Thread Ram VISWANADHA
That didn’t work :( Any help I have documented some steps here. http://stackoverflow.com/questions/34048340/spark-saveastextfile-last-stage-almost-never-finishes Best Regards, Ram From: Sahil Sareen mailto:sareen...@gmail.com>> Date: Wednesday, December 2, 2015 at 10:18 PM To: Ram VISW

Re: Improve saveAsTextFile performance

2015-12-05 Thread Akhil Das
the partitions. Thanks Best Regards On Sat, Dec 5, 2015 at 8:24 AM, Ram VISWANADHA < ram.viswana...@dailymotion.com> wrote: > That didn’t work :( > Any help I have documented some steps here. > > http://stackoverflow.com/questions/34048340/spark-saveastextfile-last-stage-alm

Re: Improve saveAsTextFile performance

2015-12-05 Thread Ram VISWANADHA
0 8 3.9 MB / 95334 Best Regards, Ram From: Akhil Das mailto:ak...@sigmoidanalytics.com>> Date: Saturday, December 5, 2015 at 1:32 AM To: Ram VISWANADHA mailto:ram.viswana...@dailymotion.com>> Cc: user mailto:user@spark.apache.org>> Subject: Re: Improve saveAsTex

Re: Improve saveAsTextFile performance

2015-12-05 Thread Ram VISWANADHA
, Ram -- Date: Saturday, December 5, 2015 at 7:18 AM To: Akhil Das mailto:ak...@sigmoidanalytics.com>> Cc: user mailto:user@spark.apache.org>> Subject: Re: Improve saveAsTextFile performance >If you are doing a join/groupBy kind of operations then you need to make sure >the keys are

Re: Improve saveAsTextFile performance

2015-12-07 Thread Akhil Das
3b#file-saveasparquet-java-L80 > > Best Regards, > Ram > -- > > Date: Saturday, December 5, 2015 at 7:18 AM > To: Akhil Das > > Cc: user > Subject: Re: Improve saveAsTextFile performance > > >If you are doing a join/groupBy kind of operations then you need t

Spark saveAsTextFile Disk Recommendation

2021-03-20 Thread Ranju Jain
Hi All, I have a large RDD dataset of around 60-70 GB which I cannot send to driver using collect so first writing that to disk using saveAsTextFile and then this data gets saved in the form of multiple part files on each node of the cluster and after that driver reads the data from that

Spark saveAsTextFile Disk Recommendation

2021-03-21 Thread ranju goel
just checked the 2nd argument of saveAsTextFile and I believe read and write will be faster on disk after use of compression. I will try this. So I think there is no special requirement on type of disk for execution of saveAsTextFile as they are local I/O operations. Regards Ranju

Re: FP Growth saveAsTextFile

2015-05-20 Thread Xiangrui Meng
Could you post the stack trace? If you are using Spark 1.3 or 1.4, it would be easier to save freq itemsets as a Parquet file. -Xiangrui On Wed, May 20, 2015 at 12:16 PM, Eric Tanner wrote: > I am having trouble with saving an FP-Growth model as a text file. I can > print out the results, but wh

Re: FP Growth saveAsTextFile

2015-05-20 Thread Xiangrui Meng
at this. > > scala> > model.freqItemsets.saveAsTextFile("c:///repository/trunk/Scala_210_wspace/fpGrowth/modelText1") > 15/05/20 14:07:47 INFO SparkContext: Starting job: saveAsTextFile at > :33 > 15/05/20 14:07:47 INFO DAGScheduler: Got job 15 (saveAsTextFile at > :33)

Re: PySpark saveAsTextFile gzip

2015-01-15 Thread Akhil Das
25> to use the API. Thanks Best Regards On Thu, Jan 15, 2015 at 5:16 PM, Tom Seddon wrote: > Hi, > > I've searched but can't seem to find a PySpark example. How do I write > compressed text file output to S3 using PySpark saveAsTextFile? > > Thanks, > > Tom >

SaveAsTextFile to S3 bucket

2015-01-26 Thread Chen, Kevin
the bucket //nexgen-software write permission, I don't get exception. But the output is not created under dev. Rather, a different /dev/output directory is created directory in the bucket (//nexgen-software). Is this how saveAsTextFile behalves in S3? Is there anyway I can have output cr

ClassCastException when using saveAsTextFile

2014-03-25 Thread Niko Stahl
Hi, I'm trying to save an RDD to HDFS with the saveAsTextFile method on my ec2 cluster and am encountering the following exception (the app is called GraphTest): Exception failure: java.lang.ClassCastException: cannot assign instance of GraphTest$$anonfun$3 to

Re: NPE using saveAsTextFile

2014-04-09 Thread Nick Pentreath
4:50 PM, Nick Pentreath wrote: > Hi > > I'm using Spark 0.9.0. > > When calling saveAsTextFile on a custom hadoop inputformat (loaded with > newAPIHadoopRDD), I get the following error below. > > If I call count, I get the correct count of number of records, so the >

Re: NPE using saveAsTextFile

2014-04-09 Thread Matei Zaharia
just can't save it as text file. > > > > > On Tue, Apr 8, 2014 at 4:50 PM, Nick Pentreath > wrote: > Hi > > I'm using Spark 0.9.0. > > When calling saveAsTextFile on a custom hadoop inputformat (loaded with > newAPIHadoopRDD), I get the foll

Re: NPE using saveAsTextFile

2014-04-09 Thread Nick Pentreath
at. But as I say, I can parse the data (count, first() etc). I > just can't save it as text file. > > > > > On Tue, Apr 8, 2014 at 4:50 PM, Nick Pentreath > wrote: > >> Hi >> >> I'm using Spark 0.9.0. >> >> When calling saveAsTextFi

Re: NPE using saveAsTextFile

2014-04-10 Thread Nick Pentreath
elasticsearch-hadoop plugin >> for ESInputFormat. But as I say, I can parse the data (count, first() etc). >> I just can't save it as text file. >> >> >> >> >> On Tue, Apr 8, 2014 at 4:50 PM, Nick Pentreath >> wrote: >> >>> Hi &g

saveAsTextFile hangs with hdfs

2014-08-19 Thread David
I have a simple spark job that seems to hang when saving to hdfs. When looking at the spark web ui, the job reached 97 of 100 tasks completed. I need some help determining why the job appears to hang. The job hangs on the "saveAsTextFile()" call. https://www.dropbox.com/s/fdp7c

OOM with groupBy + saveAsTextFile

2014-11-01 Thread Bharath Ravi Kumar
Hi, I'm trying to run groupBy(function) followed by saveAsTextFile on an RDD of count ~ 100 million. The data size is 20GB and groupBy results in an RDD of 1061 keys with values being Iterable>. The job runs on 3 hosts in a standalone setup with each host's executor having 100G RAM

coalesce(1).saveAsTextfile() takes forever?

2016-01-05 Thread unk1102
ll but for large data it hangs forever does not move on because of only one partitions has to shuffle data of GBs please help me -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/coalesce-1-saveAsTextfile-takes-forever-tp25886.html Sent from the Apache Spark Use

Combining Spark Files with saveAsTextFile

2015-08-04 Thread Brandon White
What is the best way to make saveAsTextFile save as only a single file?

EC2 cluster doesn't work saveAsTextFile

2015-08-10 Thread Yasemin Kaya
Hi, I have EC2 cluster, and am using spark 1.3, yarn and HDFS . When i submit at local there is no problem , but i run at cluster, saveAsTextFile doesn't work."*It says me User class threw exception: Output directory hdfs://172.31.42.10:54310/./weblogReadResult <http://172.3

Re: Spark saveAsTextFile Disk Recommendation

2021-03-20 Thread Attila Zsolt Piros
Hi! I would like to reflect only to the first part of your mail: I have a large RDD dataset of around 60-70 GB which I cannot send to driver > using *collect* so first writing that to disk using *saveAsTextFile* and > then this data gets saved in the form of multiple part files on eac

RE: Spark saveAsTextFile Disk Recommendation

2021-03-21 Thread Ranju Jain
csv file. This script runs on every node and later they all combine to single file. On the other hand is your data really just a collection of strings without any repetitions [Ranju]: Yes It is comma separated string. And I just checked the 2nd argument of saveAsTextFile and I believe read and

saveAsTextFile() part- files are missing

2015-05-21 Thread rroxanaioana
oing wrong? Thank you! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-part-files-are-missing-tp22974.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --

RDD saveAsTextFile() to local disk

2015-07-08 Thread Vijay Pawnarkar
Getting exception when wrting RDD to local disk using following function saveAsTextFile("file:home/someuser/dir2/testupload/20150708/") The dir (/home/someuser/dir2/testupload/) was created before running the job. The error message is misleading. org.apache.spark.SparkExce

RDD saveAsTextFile() to local disk

2015-07-08 Thread spok20nn
Getting exception when wrting RDD to local disk using following function saveAsTextFile("file:home/someuser/dir2/testupload/20150708/") The dir (/home/someuser/dir2/testupload/) was created before running the job. The error message is misleading. org.apache.spark.SparkExce

Re: SaveAsTextFile to S3 bucket

2015-01-26 Thread Nick Pentreath
ResponseCode=403, ResponseMessage=Forbidden > > > I have verified /dev has write permission. However, if I grant the > bucket //nexgen-software write permission, I don't get exception. But the > output is not created under dev. Rather, a different /dev/output directory > is c

Re: SaveAsTextFile to S3 bucket

2015-01-26 Thread Ashish Rangole
By default, the files will be created under the path provided as the argument for saveAsTextFile. This argument is considered as a folder in the bucket and actual files are created in it with the naming convention part-n, where n is the number of output partition. On Mon, Jan 26, 2015 at

Re: SaveAsTextFile to S3 bucket

2015-01-26 Thread Chen, Kevin
to have output directory created under dev directory I created upfront. From: Nick Pentreath mailto:nick.pentre...@gmail.com>> Date: Monday, January 26, 2015 9:15 PM To: "user@spark.apache.org<mailto:user@spark.apache.org>" mailto:user@spark.apache.org>> Subject: Re:

Re: SaveAsTextFile to S3 bucket

2015-01-27 Thread Thomas Demoor
is possible to have output directory created under dev directory I > created upfront. > > From: Nick Pentreath > Date: Monday, January 26, 2015 9:15 PM > To: "user@spark.apache.org" > Subject: Re: SaveAsTextFile to S3 bucket > > Your output folder specifies &g

saveAsTextFile() failing for large datasets

2014-03-19 Thread Soila Pertet Kavulya
the cluster. Wordcount fails due to connection errors during saveAsTextFile() when the input size is 1TB. I have tried experimenting with different timeouts, and akka frame sizes but the job is still failing. Are there any changes that I should make to get the job to run successfully? Here is my

Re: ClassCastException when using saveAsTextFile

2014-03-25 Thread Niko Stahl
AsTextFile("hdfs://" + masterDomain + ":9000/user/root/" + "test_dir") Even this simple mapping give me a java.lang.ClassCastException. Sorry, my knowledge of Scala is very rudimentary. Thanks, Niko On Tue, Mar 25, 2014 at 5:55 PM, Niko Stahl wrote: > Hi, > &

Re: ClassCastException when using saveAsTextFile

2014-06-04 Thread Kanwaldeep
using-saveAsTextFile-tp3206p7018.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: ClassCastException when using saveAsTextFile

2014-06-05 Thread Anwar Rizal
uot; + masterDomain + > ":9000/user/root/" + "test_dir") > > Even this simple mapping give me a java.lang.ClassCastException. Sorry, > my knowledge of Scala is very rudimentary. > > Thanks, > Niko > > > On Tue, Mar 25, 2014 at 5:55 PM, Niko Stahl

saveAsTextFile extremely slow near finish

2015-03-09 Thread mingweili0x
alues values.saveAsTextFile Input size is ~ 160G, and I made 1000 partitions specified in JavaSparkContext.textFile and JavaPairRDD.sortByKey. From WebUI, the job is splitted into two stages: saveAsTextFile and mapToPair. MapToPair finished in 8 mins. While saveAsTextFile took ~15mins to reach (2366/2373) progress an

Re: saveAsTextFile hangs with hdfs

2014-08-19 Thread evadnoob
update: hangs even when not writing to hdfs. I changed the code to avoid saveAsTextFile() and instead do a forEachParitition and log the results. This time it hangs at 96/100 tasks, but still hangs. I changed the saveAsTextFile to: stringIntegerJavaPairRDD.foreachPartition(p

Re: saveAsTextFile hangs with hdfs

2014-08-19 Thread evadnoob
34:08 INFO ConnectionManager: Handling connection error on connection to ConnectionManagerId(localhost,39840) 14/08/19 20:34:08 INFO ConnectionManager: Removing SendingConnection to ConnectionManagerId(localhost,39840) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/save

Re: saveAsTextFile hangs with hdfs

2014-08-26 Thread Burak Yavuz
t; Sent: Tuesday, August 19, 2014 1:44:18 PM Subject: saveAsTextFile hangs with hdfs I have a simple spark job that seems to hang when saving to hdfs. When looking at the spark web ui, the job reached 97 of 100 tasks completed. I need some help determining why the job appears to hang. The j

Re: OOM with groupBy + saveAsTextFile

2014-11-01 Thread Bharath Ravi Kumar
Minor clarification: I'm running spark 1.1.0 on JDK 1.8, Linux 64 bit. On Sun, Nov 2, 2014 at 1:06 AM, Bharath Ravi Kumar wrote: > Hi, > > I'm trying to run groupBy(function) followed by saveAsTextFile on an RDD > of count ~ 100 million. The data size is 20GB and groupBy

Re: OOM with groupBy + saveAsTextFile

2014-11-01 Thread Bharath Ravi Kumar
rying to run groupBy(function) followed by saveAsTextFile on an RDD > of count ~ 100 million. The data size is 20GB and groupBy results in an RDD > of 1061 keys with values being Iterable String>>. The job runs on 3 hosts in a standalone setup with each host's > executor having

Re: OOM with groupBy + saveAsTextFile

2014-11-01 Thread arthur.hk.c...@gmail.com
: > Hi, > > I'm trying to run groupBy(function) followed by saveAsTextFile on an RDD of > count ~ 100 million. The data size is 20GB and groupBy results in an RDD of > 1061 keys with values being Iterable String>>. The job runs on 3 hosts in a standalone setup with each host

Re: OOM with groupBy + saveAsTextFile

2014-11-01 Thread Reynold Xin
houldn't be the norm for a common groupby / > sort use case in a framework that is leading in sorting bench marks? Or is > there something fundamentally wrong in the usage? > On 02-Nov-2014 1:06 am, "Bharath Ravi Kumar" wrote: > >> Hi, >> >> I'm try

Re: OOM with groupBy + saveAsTextFile

2014-11-02 Thread Bharath Ravi Kumar
Ravi Kumar > wrote: > > Resurfacing the thread. Oom shouldn't be the norm for a common groupby / > sort use case in a framework that is leading in sorting bench marks? Or is > there something fundamentally wrong in the usage? > On 02-Nov-2014 1:06 am, "Bharath Ravi Kumar&

Re: OOM with groupBy + saveAsTextFile

2014-11-02 Thread Sean Owen
saveAsText means "save every element of the RDD as one line of text". It works like TextOutputFormat in Hadoop MapReduce since that's what it uses. So you are causing it to create one big string out of each Iterable this way. On Sun, Nov 2, 2014 at 4:48 PM, Bharath Ravi Kumar wrote: > Thanks for

Re: OOM with groupBy + saveAsTextFile

2014-11-03 Thread Bharath Ravi Kumar
The result was no different with saveAsHadoopFile. In both cases, I can see that I've misinterpreted the API docs. I'll explore the API's a bit further for ways to save the iterable as chunks rather than one large text/binary. It might also help to clarify this aspect in the API docs. For those (li

Re: OOM with groupBy + saveAsTextFile

2014-11-03 Thread Bharath Ravi Kumar
I also realized from your description of saveAsText that the API is indeed behaving as expected i.e. it is appropriate (though not optimal) for the API to construct a single string out of the value. If the value turns out to be large, the user of the API needs to reconsider the implementation appro

Re: OOM with groupBy + saveAsTextFile

2014-11-03 Thread Sean Owen
Yes, that's the same thing really. You're still writing a huge value as part of one single (key,value) record. The value exists in memory in order to be written to storage. Although there aren't hard limits, in general, keys and values aren't intended to be huge, like, hundreds of megabytes. You s

Re: coalesce(1).saveAsTextfile() takes forever?

2016-01-05 Thread Alexander Pivovarov
n context: > http://apache-spark-user-list.1001560.n3.nabble.com/coalesce-1-saveAsTextfile-takes-forever-tp25886.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubs

Re: coalesce(1).saveAsTextfile() takes forever?

2016-01-05 Thread Igor Berman
t;> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/coalesce-1-saveAsTextfile-takes-forever-tp25886.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> -

Re: coalesce(1).saveAsTextfile() takes forever?

2016-01-05 Thread Umesh Kacha
") >> >> For small data above code works well but for large data it hangs forever >> does not move on because of only one partitions has to shuffle data of GBs >> please help me >> >> >> >> -- >> View this message in context: >> http:/

Re: coalesce(1).saveAsTextfile() takes forever?

2016-01-05 Thread Andy Davidson
coalesce(1).saveAsTextfile() takes forever? > hi I am trying to save many partitions of Dataframe into one CSV file and it > take forever for large data sets of around 5-6 GB. > > sourceFrame.coalesce(1).write().format("com.databricks.spark.csv").option("gzi > p&qu

RE: Combining Spark Files with saveAsTextFile

2015-08-04 Thread Mohammed Guller
One options is to use the coalesce method in the RDD class. Mohammed From: Brandon White [mailto:bwwintheho...@gmail.com] Sent: Tuesday, August 4, 2015 7:23 PM To: user Subject: Combining Spark Files with saveAsTextFile What is the best way to make saveAsTextFile save as only a single file?

RE: Combining Spark Files with saveAsTextFile

2015-08-04 Thread Mohammed Guller
Just to further clarify, you can first call coalesce with argument 1 and then call saveAsTextFile. For example, rdd.coalesce(1).saveAsTextFile(...) Mohammed From: Mohammed Guller Sent: Tuesday, August 4, 2015 9:39 PM To: 'Brandon White'; user Subject: RE: Combining Spark

Re: Combining Spark Files with saveAsTextFile

2015-08-04 Thread Igor Berman
August 2015 at 07:43, Mohammed Guller wrote: > Just to further clarify, you can first call coalesce with argument 1 and > then call saveAsTextFile. For example, > > > > rdd.coalesce(1).saveAsTextFile(...) > > > > > > > > Mohammed > > > > *F

Re: Combining Spark Files with saveAsTextFile

2015-08-05 Thread Igor Berman
her clarify, you can first call coalesce with argument 1 and >> then call saveAsTextFile. For example, >> >> >> >> rdd.coalesce(1).saveAsTextFile(...) >> >> >> >> >> >> >> >> Mohammed >> >> >> >

Re: Combining Spark Files with saveAsTextFile

2015-08-06 Thread MEETHU MATHEW
Hi,Try using coalesce(1) before calling saveAsTextFile() Thanks & Regards, Meethu M On Wednesday, 5 August 2015 7:53 AM, Brandon White wrote: What is the best way to make saveAsTextFile save as only a single file?

Re: EC2 cluster doesn't work saveAsTextFile

2015-08-10 Thread Dean Wampler
3, yarn and HDFS . When i submit > at local there is no problem , but i run at cluster, saveAsTextFile doesn't > work."*It says me User class threw exception: Output directory > hdfs://172.31.42.10:54310/./weblogReadResult > <http://172.31.42.10:54310/./weblogReadResul

Re: EC2 cluster doesn't work saveAsTextFile

2015-08-10 Thread Yasemin Kaya
m> > @deanwampler <http://twitter.com/deanwampler> > http://polyglotprogramming.com > > On Mon, Aug 10, 2015 at 7:08 AM, Yasemin Kaya wrote: > >> Hi, >> >> I have EC2 cluster, and am using spark 1.3, yarn and HDFS . When i submit >> at local there i

Re: EC2 cluster doesn't work saveAsTextFile

2015-08-10 Thread Dean Wampler
twitter.com/deanwampler> >> http://polyglotprogramming.com >> >> On Mon, Aug 10, 2015 at 7:08 AM, Yasemin Kaya wrote: >> >>> Hi, >>> >>> I have EC2 cluster, and am using spark 1.3, yarn and HDFS . When i >>> submit at local there is n

error with saveAsTextFile in local directory

2015-11-03 Thread Jack Yang
res.show() --working res.map{ x => tranRow2Str(x) }.coalesce(1).saveAsTextFile(hdfsFilePath) --still working res.map{ x => tranRow2Str(x) }.coalesce(1).saveAsTextFile(localFilePath) --wrong! then at last, I get the correct results in hdfsFilePath, but nothing in localFilePath. Btw, the l

How to improve performance of saveAsTextFile()

2017-03-10 Thread Parsian, Mahmoud
How to improve performance of JavaRDD.saveAsTextFile(“hdfs://…“). This is taking over 30 minutes on a cluster of 10 nodes. Running Spark on YARN. JavaRDD has 120 million entries. Thank you, Best regards, Mahmoud

Re: saveAsTextFile() part- files are missing

2015-05-21 Thread Tomasz Fruboes
1001560.n3.nabble.com/saveAsTextFile-part-files-are-missing-tp22974.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-ma

Re: RDD saveAsTextFile() to local disk

2015-07-08 Thread canan chen
wing function > > saveAsTextFile("file:home/someuser/dir2/testupload/20150708/") > > The dir (/home/someuser/dir2/testupload/) was created before running the > job. The error message is misleading. > > > org.apache.spark.SparkException: Job aborted due to stage

Re: RDD saveAsTextFile() to local disk

2015-07-08 Thread Vijay Pawnarkar
Thanks for the help. Following are the folders I was trying to write to *saveAsTextFile("*file:///home/someuser/test2/testupload/20150708/0/") *saveAsTextFile("f*ile:///home/someuser/test2/testupload/20150708/1/") *saveAsTextFile("*file:///home/someuser/te

pyspark 1.1.1 on windows saveAsTextFile - NullPointerException

2014-12-18 Thread mj
Hi, I'm trying to use pyspark to save a simple rdd to a text file (code below), but it keeps throwing an error. - Python Code - items=["Hello", "world"] items2 = sc.parallelize(items) items2.coalesce(1).saveAsTextFile('c:/tmp/python_out.csv

Question on saveAsTextFile with overwrite option

2014-12-24 Thread Shao, Saisai
Hi, We have such requirements to save RDD output to HDFS with saveAsTextFile like API, but need to overwrite the data if existed. I'm not sure if current Spark support such kind of operations, or I need to check this manually? There's a thread in mailing list discussed about t

Re: saveAsTextFile of RDD[Array[Any]]

2015-02-09 Thread Jong Wook Kim
If you have `RDD[Array[Any]]` you can do rdd.map(_.mkString("\t")) or with some other delimiter to make it `RDD[String]`, and then call `saveAsTextFile`. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-of-RDD-Array-Any-tp21548p

saveAsTextFile with replication factor in HDFS

2014-05-14 Thread Sai Prasanna
Hi, Can we override the default file-replication factor while using saveAsTextFile() to HDFS. My default repl.factor is >1. But intermediate files that i want to put in HDFS while running a SPARK query need not be replicated, so is there a way ? Thanks !

  1   2   3   >