Re: Question regarding doing aggregation over custom partitions

2014-05-03 Thread Arun Swami
Thanks, that was what I was missing! arun arun *__* *Arun Swami* +1 408-338-0906 On Fri, May 2, 2014 at 4:28 AM, Mayur Rustagi mayur.rust...@gmail.comwrote: You need to first partition the data by the key Use mappartition instead of map. Mayur Rustagi Ph: +1 (760) 203 3257

Re: string to int conversion

2014-05-03 Thread Sean Owen
On Sat, May 3, 2014 at 2:00 AM, SK skrishna...@gmail.com wrote: 1) I have a csv file where one of the field has integer data but it appears as strings: 1, 3 etc. I tried using toInt to implcitly convert the strings to int after reading (field(3).toInt). But I got a NumberFormatException. So I

Re: sbt/sbt run command returns a JVM problem

2014-05-03 Thread Carter
Hi, thanks for all your help. I tried your setting in the sbt file, but the problem is still there. The Java setting in my sbt file is: java \ -Xmx1200m -XX:MaxPermSize=350m -XX:ReservedCodeCacheSize=256m \ -jar ${JAR} \ $@ I have tried to set these 3 parameters bigger and smaller, but

Re: Spark's behavior

2014-05-03 Thread Eduardo Costa Alfaia
Hi TD, Thanks for reply This last experiment I did with one computer, like local, but I think that time gap grow up when I add more computer. I will do again now with 8 worker and 1 word source and I will see what’s go on. I will control the time too, like suggested by Andrew. On May 3, 2014,

what's local[n]

2014-05-03 Thread Weide Zhang
in spark kafka example, it says `./bin/run-example org.apache.spark.streaming.examples.KafkaWordCount local[2] zoo01,zoo02,zoo03 my-consumer-group topic1,topic2 1` can any one tell me what does local[2] represent ? i thought master url should be sth like spark://hostname:portname . also,

Re: Crazy Kryo Exception

2014-05-03 Thread Soren Macbeth
Poking around in the bowels of scala, it seems like this has something to do with implicit scala - java collection munging. Why would it be doing this and where? The stack trace given is entirely unhelpful to me. Is there a better one buried in my task logs? None of my tasks actually failed, so it

Re: what's local[n]

2014-05-03 Thread Andrew Ash
Hi Weide, The answer to your first question about local[2] can be found in the Running the Examples and Shell section of https://spark.apache.org/docs/latest/ Note that all of the sample programs take a master parameter specifying the cluster URL to connect to. This can be a URL for a

Lease Exception hadoop 2.4

2014-05-03 Thread Andre Kuhnen
Hello, I am getting this warning after upgrading Hadoop 2.4, when I try to write something to the HDFS. The content is written correctly, but I do not like this warning. DO I have to compile SPARK with hadoop 2.4? WARN TaskSetManager: Loss was due to org.apache.hadoop.ipc.RemoteException

Spark-1.0.0-rc3 compiled against Hadoop 2.3.0 cannot read HDFS 2.3.0?

2014-05-03 Thread Nan Zhu
Hi, all I compiled the Spark-1.0.0-rc3 against Hadoop 2.3.0 with SPARK_HADOOP_VERSION=2.3.0 sbt/sbt assembly and deploy the spark with HDFS 2.3.0 (yarn = false), it seems that it cannot read data from HDFS the following exception is thrown java.lang.VerifyError: class

Re: sbt/sbt run command returns a JVM problem

2014-05-03 Thread Michael Armbrust
The problem is probably not with the JVM running sbt but with the one that sbt is forking to run your program. See here for the relevant option: https://github.com/apache/spark/blob/master/project/SparkBuild.scala#L186 You might try starting sbt with no arguments (to bring up the sbt console).

Re: Multiple Streams with Spark Streaming

2014-05-03 Thread Chris Fregly
if you want to use true Spark Streaming (not the same as Hadoop Streaming/Piping, as Mayur pointed out), you can use the DStream.union() method as described in the following docs: http://spark.apache.org/docs/0.9.1/streaming-custom-receivers.html

Re: Reading multiple S3 objects, transforming, writing back one

2014-05-03 Thread Chris Fregly
not sure if this directly addresses your issue, peter, but it's worth mentioned a handy AWS EMR utility called s3distcp that can upload a single HDFS file - in parallel - to a single, concatenated S3 file once all the partitions are uploaded. kinda cool. here's some info:

Re: Reading multiple S3 objects, transforming, writing back one

2014-05-03 Thread Patrick Wendell
Hi Peter, If your dataset is large (3GB) then why not just chunk it into multiple files? You'll need to do this anyways if you want to read/write this from S3 quickly, because S3's throughput per file is limited (as you've seen). This is exactly why the Hadoop API lets you save datasets with

Re: Setting the Scala version in the EC2 script?

2014-05-03 Thread Patrick Wendell
Spark will only work with Scala 2.10... are you trying to do a minor version upgrade or upgrade to a totally different version? You could do this as follows if you want: 1. Fork the spark-ec2 repository and change the file here: https://github.com/mesos/spark-ec2/blob/v2/scala/init.sh 2. Modify

Re: when to use broadcast variables

2014-05-03 Thread Patrick Wendell
Broadcast variables need to fit entirely in memory - so that's a pretty good litmus test for whether or not to broadcast a smaller dataset or turn it into an RDD. On Fri, May 2, 2014 at 7:50 AM, Prashant Sharma scrapco...@gmail.com wrote: I had like to be corrected on this but I am just trying

Re: performance improvement on second operation...without caching?

2014-05-03 Thread Matei Zaharia
Hi Diana, Apart from these reasons, in a multi-stage job, Spark saves the map output files from map stages to the filesystem, so it only needs to rerun the last reduce stage. This is why you only saw one stage executing. These files are saved for fault recovery but they speed up subsequent

Re: Spark's behavior

2014-05-03 Thread Andrew Ash
Hi Eduardo, Yep those machines look pretty well synchronized at this point. Just wanted to throw that out there and eliminate it as a possible source of confusion. Good luck on continuing the debugging! Andrew On Sat, May 3, 2014 at 11:59 AM, Eduardo Costa Alfaia e.costaalf...@unibs.it

Re: help me

2014-05-03 Thread Chris Fregly
as Mayur indicated, it's odd that you are seeing better performance from a less-local configuration. however, the non-deterministic behavior that you describe is likely caused by GC pauses in your JVM process. take note of the *spark.locality.wait* configuration parameter described here:

spark run issue

2014-05-03 Thread Weide Zhang
Hi I'm trying to run the kafka-word-count example in spark2.9.1. I encountered some exception when initialize kafka consumer/producer config. I'm using scala 2.10.3 and used maven build inside spark streaming kafka library comes with spark2.9.1. Any one see this exception before? Thanks,

Re: spark run issue

2014-05-03 Thread Weide Zhang
Hi Tathagata, I figured out the reason. I was adding a wrong kafka lib along side with the version spark uses. Sorry for spamming. Weide On Sat, May 3, 2014 at 7:04 PM, Tathagata Das tathagata.das1...@gmail.comwrote: I am a little confused about the version of Spark you are using. Are you

Re: spark run issue

2014-05-03 Thread Weide Zhang
Hi Tathagata, I actually have a separate question. What's the usage of lib_managed folder inside spark source folder ? Are those the library required for spark streaming to run ? Do they needed to be added to spark classpath when starting sparking cluster? Weide On Sat, May 3, 2014 at 7:08 PM,

cache not work as expected for iteration?

2014-05-03 Thread Earthson
I'm using spark for LDA impementation. I need cache RDD for next step of Gibbs Sampling, and cached the result and the cache previous could be uncache. Something like LRU cache should delete the previous cache because it is never used then, but the cache runs into confusion: Here is the code:)

Re: performance improvement on second operation...without caching?

2014-05-03 Thread Koert Kuipers
Hey Matei, Not sure i understand that. These are 2 separate jobs. So the second job takes advantage of the fact that there is map output left somewhere on disk from the first job, and re-uses that? On Sat, May 3, 2014 at 8:29 PM, Matei Zaharia matei.zaha...@gmail.comwrote: Hi Diana, Apart

Re: sbt/sbt run command returns a JVM problem

2014-05-03 Thread Carter
Hi Michael, Thank you very much for your reply. Sorry I am not very familiar with sbt. Could you tell me where to set the Java option for the sbt fork for my program? I brought up the sbt console, and run set javaOptions += -Xmx1G in it, but it returned an error: [error]

Re: performance improvement on second operation...without caching?

2014-05-03 Thread Matei Zaharia
Yes, this happens as long as you use the same RDD. For example say you do the following: data1 = sc.textFile(…).map(…).reduceByKey(…) data1.count() data1.filter(…).count() The first count() causes outputs of the map/reduce pair in there to be written out to shuffle files. Next time you do a