Re: Missing output partition file in S3

2016-09-21 Thread Steve Loughran
<user@spark.apache.org<mailto:user@spark.apache.org>> Subject: Re: Missing output partition file in S3 On 15 Sep 2016, at 19:37, Chen, Kevin <kevin.c...@neustar.biz<mailto:kevin.c...@neustar.biz>> wrote: Hi, Has any one encountered an issue of missing output partition file in S3

Re: Missing output partition file in S3

2016-09-19 Thread Chen, Kevin
uot; <user@spark.apache.org<mailto:user@spark.apache.org>> Subject: Re: Missing output partition file in S3 On 15 Sep 2016, at 19:37, Chen, Kevin <kevin.c...@neustar.biz<mailto:kevin.c...@neustar.biz>> wrote: Hi, Has any one encountered an issue of missing output par

Re: Spark output data to S3 is very slow

2016-09-17 Thread Qiang Li
Tried several times, it is slow same as before, I will let spark output data to HDFS, then sync data to S3 as temporary solution. Thank you. On Sat, Sep 17, 2016 at 10:43 AM, Takeshi Yamamuro wrote: > Hi, > > Have you seen the previous thread? >

Re: Spark output data to S3 is very slow

2016-09-16 Thread Takeshi Yamamuro
Hi, Have you seen the previous thread? https://www.mail-archive.com/user@spark.apache.org/msg56791.html // maropu On Sat, Sep 17, 2016 at 11:34 AM, Qiang Li wrote: > Hi, > > > I ran some jobs with Spark 2.0 on Yarn, I found all tasks finished very > quickly, but the last

Re: Missing output partition file in S3

2016-09-16 Thread Tracy Li
Sent from my iPhone > On Sep 15, 2016, at 1:37 PM, Chen, Kevin wrote: > > Hi, > > Has any one encountered an issue of missing output partition file in S3 ? My > spark job writes output to a S3 location. Occasionally, I noticed one > partition file is missing. As a

Re: Missing output partition file in S3

2016-09-16 Thread Igor Berman
are you using speculation? On 15 September 2016 at 21:37, Chen, Kevin wrote: > Hi, > > Has any one encountered an issue of missing output partition file in S3 ? > My spark job writes output to a S3 location. Occasionally, I noticed one > partition file is missing. As a

Re: Missing output partition file in S3

2016-09-16 Thread Steve Loughran
On 15 Sep 2016, at 19:37, Chen, Kevin > wrote: Hi, Has any one encountered an issue of missing output partition file in S3 ? My spark job writes output to a S3 location. Occasionally, I noticed one partition file is missing. As a result,

Re: SVD output within Spark

2016-08-31 Thread Yanbo Liang
The signs of the eigenvectors are essentially arbitrary, so both result of Spark and Matlab are right. Thanks On Thu, Jul 21, 2016 at 3:50 PM, Martin Somers wrote: > > just looking at a comparision between Matlab and Spark for svd with an > input matrix N > > > this is

Re: Writing output of key-value Pair RDD

2016-05-05 Thread Afshartous, Nick
.@turbine.com> Sent: Thursday, May 5, 2016 3:35:17 PM To: Nicholas Chammas; user@spark.apache.org Subject: Re: Writing output of key-value Pair RDD Thanks, I got the example below working. Though it writes both the keys and values to the output file. Is there any way to write just the values ?

Re: Writing output of key-value Pair RDD

2016-05-05 Thread Afshartous, Nick
lelize(Arrays.asList(strings)) .mapToPair(pairFunction) .saveAsHadoopFile("s3://...", String.class, String.class, RDDMultipleTextOutputFormat.class); From: Nicholas Chammas <nicholas.cham...@gmail.com> Sent: Wednesday, May 4, 2016

Re: Writing output of key-value Pair RDD

2016-05-04 Thread Nicholas Chammas
You're looking for this discussion: http://stackoverflow.com/q/23995040/877069 Also, a simpler alternative with DataFrames: https://github.com/apache/spark/pull/8375#issuecomment-202458325 On Wed, May 4, 2016 at 4:09 PM Afshartous, Nick wrote: > Hi, > > > Is there any

Re: Get output of the ALS algorithm.

2016-03-15 Thread Bryan Cutler
Jacek is correct for using org.apache.spark.ml.recommendation.ALSModel If you are trying to save org.apache.spark.mllib.recommendation.MatrixFactorizationModel, then it is similar, but just a little different, see the example here

Re: Get output of the ALS algorithm.

2016-03-11 Thread Jacek Laskowski
What about write.save(file)? P.s. I'm new to Spark MLlib. 11.03.2016 4:57 AM "Shishir Anshuman" napisał(a): > hello, > > I am new to Apache Spark and would like to get the Recommendation output > of the ALS algorithm in a file. > Please suggest me the solution. > >

Re: Get output of the ALS algorithm.

2016-03-11 Thread Bryan Cutler
Are you trying to save predictions on a dataset to a file, or the model produced after training with ALS? On Thu, Mar 10, 2016 at 7:57 PM, Shishir Anshuman wrote: > hello, > > I am new to Apache Spark and would like to get the Recommendation output > of the ALS

Re: Writing output fails when spark.unsafe.offHeap is enabled

2015-12-21 Thread Mayuresh Kunjir
Not quite sure if error is resolved. Upon further probing, the setting spark.memory.offHeap.enabled is not getting applied in this build. When I print its value from core/src/main/scala/org/apache/spark/memory/MemoryManager.scala, it returns false even though the webUI is indicating that it's been

Re: Writing output fails when spark.unsafe.offHeap is enabled

2015-12-21 Thread Mayuresh Kunjir
Thanks Ted. That stack trace is from 1.5.1 build. I tried on the latest code as you suggested. Memory management seems to have changed quite a bit and this error has been fixed as well. :) Thanks for the help! Regards, ~Mayuresh On Mon, Dec 21, 2015 at 10:10 AM, Ted Yu

Re: Writing output fails when spark.unsafe.offHeap is enabled

2015-12-21 Thread Mayuresh Kunjir
Any intuition on this? ​~Mayuresh​ On Thu, Dec 17, 2015 at 8:04 PM, Mayuresh Kunjir wrote: > I am testing a simple Sort program written using Dataframe APIs. When I > enable spark.unsafe.offHeap, the output stage fails with a NPE. The > exception when run on spark-1.5.1

Re: Writing output fails when spark.unsafe.offHeap is enabled

2015-12-21 Thread Ted Yu
w.r.t. at org.apache.spark.sql.execution.UnsafeExternalRowSorter$RowComparator.compare(UnsafeExternalRowSorter.java:202) I looked at UnsafeExternalRowSorter.java in 1.6.0 which only has 192 lines of code. Can you run with latest RC of 1.6.0 and paste the stack trace ? Thanks On Thu, Dec 17,

Re: flatMap output on disk / flatMap memory overhead

2015-08-01 Thread Puneet Kapoor
Hi Ocatavian, Just out of curiosity, did you try persisting your RDD in serialized format MEMORY_AND_DISK_SER or MEMORY_ONLY_SER ?? i.e. changing your : rdd.persist(MEMORY_AND_DISK) to rdd.persist(MEMORY_ONLY_SER) Regards On Wed, Jun 10, 2015 at 7:27 AM, Imran Rashid iras...@cloudera.com

Re: flatMap output on disk / flatMap memory overhead

2015-06-09 Thread Imran Rashid
I agree with Richard. It looks like the issue here is shuffling, and shuffle data is always written to disk, so the issue is definitely not that all the output of flatMap has to be stored in memory. If at all possible, I'd first suggest upgrading to a new version of spark -- even in 1.2, there

Re: flatMap output on disk / flatMap memory overhead

2015-06-02 Thread Akhil Das
You could try rdd.persist(MEMORY_AND_DISK/DISK_ONLY).flatMap(...), I think StorageLevel MEMORY_AND_DISK means spark will try to keep the data in memory and if there isn't sufficient space then it will be shipped to the disk. Thanks Best Regards On Mon, Jun 1, 2015 at 11:02 PM, octavian.ganea

Re: flatMap output on disk / flatMap memory overhead

2015-06-02 Thread octavian.ganea
I was tried using reduceByKey, without success. I also tried this: rdd.persist(MEMORY_AND_DISK).flatMap(...).reduceByKey . However, I got the same error as before, namely the error described here:

Re: flatMap output on disk / flatMap memory overhead

2015-06-02 Thread Richard Marscher
Are you sure it's memory related? What is the disk utilization and IO performance on the workers? The error you posted looks to be related to shuffle trying to obtain block data from another worker node and failing to do so in reasonable amount of time. It may still be memory related, but I'm not

Re: Extra output from Spark run

2015-03-05 Thread Sean Owen
In the console, you'd find this draws a progress bar illustrating the current stage progress. In logs, it shows up as this sort of 'pyramid' since CR makes a newline. You can turn it off with spark.ui.showConsoleProgress = false On Thu, Mar 5, 2015 at 2:11 AM, cjwang c...@cjwang.us wrote: When

Re: Extra output from Spark run

2015-03-05 Thread davidkl
If you do not want those progress indication to appear, just set spark.ui.showConsoleProgress to false, e.g: System.setProperty(spark.ui.showConsoleProgress, false); Regards -- View this message in context:

Re: sorting output of join operation

2015-02-23 Thread Imran Rashid
sortByKey() is the probably the easiest way: import org.apache.spark.SparkContext._ joinedRdd.map{case(word, (file1Counts, file2Counts)) = (file1Counts, (word, file1Counts, file2Counts))}.sortByKey() On Mon, Feb 23, 2015 at 10:41 AM, Anupama Joshi anupama.jo...@gmail.com wrote: Hi , To

Re: No Output

2015-01-18 Thread Deep Pradhan
The error in the log file says: *java.lang.OutOfMemoryError: GC overhead limit exceeded* with certain task ID and the error repeats for further task IDs. What could be the problem? On Sun, Jan 18, 2015 at 2:45 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: Updating the Spark version means

Re: No Output

2015-01-18 Thread Deep Pradhan
Updating the Spark version means setting up the entire cluster once more? Or can we update it in some other way? On Sat, Jan 17, 2015 at 3:22 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Can you paste the code? Also you can try updating your spark version. Thanks Best Regards On Sat,

Re: No Output

2015-01-18 Thread Akhil Das
You can try increasing the parallelism, can you be more specific about the task that you are doing? May be pasting the piece of code would help. On 18 Jan 2015 13:22, Deep Pradhan pradhandeep1...@gmail.com wrote: The error in the log file says: *java.lang.OutOfMemoryError: GC overhead limit

Re: No Output

2015-01-17 Thread Akhil Das
Can you paste the code? Also you can try updating your spark version. Thanks Best Regards On Sat, Jan 17, 2015 at 2:40 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, I am using Spark-1.0.0 in a single node cluster. When I run a job with small data set it runs perfectly but when I use

Re: Getting Output From a Cluster

2015-01-12 Thread Su She
Hello Everyone, Quick followup, is there any way I can append output to one file rather then create a new directory/file every X milliseconds? Thanks! Suhas Shekar University of California, Los Angeles B.A. Economics, Specialization in Computing 2014 On Thu, Jan 8, 2015 at 11:41 PM, Su She

Re: Getting Output From a Cluster

2015-01-12 Thread Akhil Das
There is no direct way of doing that. If you need a Single file for every batch duration, then you can repartition the data to 1 before saving. Another way would be to use hadoop's copy merge command/api(available from 2.0 versions) On 13 Jan 2015 01:08, Su She suhsheka...@gmail.com wrote: Hello

Re: Getting Output From a Cluster

2015-01-12 Thread Su She
Okay, thanks Akhil! Suhas Shekar University of California, Los Angeles B.A. Economics, Specialization in Computing 2014 On Mon, Jan 12, 2015 at 1:24 PM, Akhil Das ak...@sigmoidanalytics.com wrote: There is no direct way of doing that. If you need a Single file for every batch duration, then

Re: Getting Output From a Cluster

2015-01-08 Thread Akhil Das
saveAsHadoopFiles requires you to specify the output format which i believe you are not specifying anywhere and hence the program crashes. You could try something like this: Class? extends OutputFormat?,? outputFormatClass = (Class? extends OutputFormat?,?) (Class?)

Re: Getting Output From a Cluster

2015-01-08 Thread Su She
1) Thank you everyone for the help once again...the support here is really amazing and I hope to contribute soon! 2) The solution I actually ended up using was from this thread:

Re: Getting Output From a Cluster

2015-01-08 Thread Su She
Yes, I am calling the saveAsHadoopFiles on the Dstream. However, when I call print on the Dstream it works? If I had to do foreachRDD to saveAsHadoopFile, then why is it working for print? Also, if I am doing foreachRDD, do I need connections, or can I simply put the saveAsHadoopFiles inside the

Re: Getting Output From a Cluster

2015-01-08 Thread Yana Kadiyska
are you calling the saveAsText files on the DStream --looks like it? Look at the section called Design Patterns for using foreachRDD in the link you sent -- you want to do dstream.foreachRDD(rdd = rdd.saveAs) On Thu, Jan 8, 2015 at 5:20 PM, Su She suhsheka...@gmail.com wrote: Hello

Re: Map output statuses exceeds frameSize

2014-11-13 Thread pouryas
Anyone experienced this before? Any help would be appreciated -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Map-output-statuses-exceeds-frameSize-tp18783p18866.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark output to s3 extremely slow

2014-10-16 Thread Anny Chen
Hi Rafal, Thanks for the explanation and solution! I need to write maybe 100 GB to s3. I will try your way and see whether it works for me. Thanks again! On Wed, Oct 15, 2014 at 1:44 AM, Rafal Kwasny m...@entropy.be wrote: Hi, How large is the dataset you're saving into S3? Actually saving

Re: Spark output to s3 extremely slow

2014-10-15 Thread Rafal Kwasny
Hi, How large is the dataset you're saving into S3? Actually saving to S3 is done in two steps: 1) writing temporary files 2) commiting them to proper directory Step 2) could be slow because S3 do not have a quick atomic move operation, you have to copy (server side but still takes time) and then

Re: overwriting output directory

2014-06-12 Thread Nan Zhu
Hi, SK For 1.0.0 you have to delete it manually in 1.0.1 there will be a parameter to enable overwriting https://github.com/apache/spark/pull/947/files Best, -- Nan Zhu On Thursday, June 12, 2014 at 1:57 PM, SK wrote: Hi, When we have multiple runs of a program writing to the same

Re: Spark output compression on HDFS

2014-04-04 Thread Azuryy
There is no compress type for snappy. Sent from my iPhone5s On 2014年4月4日, at 23:06, Konstantin Kudryavtsev kudryavtsev.konstan...@gmail.com wrote: Can anybody suggest how to change compression level (Record, Block) for Snappy? if it possible, of course thank you in advance Thank

Re: Spark output compression on HDFS

2014-04-02 Thread Patrick Wendell
For textFile I believe we overload it and let you set a codec directly: https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/FileSuite.scala#L59 For saveAsSequenceFile yep, I think Mark is right, you need an option. On Wed, Apr 2, 2014 at 12:36 PM, Mark Hamstra

Re: Spark output compression on HDFS

2014-04-02 Thread Nicholas Chammas
Is this a Scala-onlyhttp://spark.incubator.apache.org/docs/latest/api/pyspark/pyspark.rdd.RDD-class.html#saveAsTextFilefeature? On Wed, Apr 2, 2014 at 5:55 PM, Patrick Wendell pwend...@gmail.com wrote: For textFile I believe we overload it and let you set a codec directly:

Re: Spark output compression on HDFS

2014-04-02 Thread Nicholas Chammas
Thanks for pointing that out. On Wed, Apr 2, 2014 at 6:11 PM, Mark Hamstra m...@clearstorydata.comwrote: First, you shouldn't be using spark.incubator.apache.org anymore, just spark.apache.org. Second, saveAsSequenceFile doesn't appear to exist in the Python API at this point. On Wed,