Re: Spark resilience

2014-04-16 Thread Arpit Tak
1. If we add more executors to cluster and data is already cached inside system(rdds are already there) . so, in that case those executors will run job on new executors or not , as rdd are not present there?? if yes, then how the performance on new executors ?? 2. What is the replication factor

Re: Proper caching method

2014-04-16 Thread Arpit Tak
Hi Cheng, Is it possibe to delete or replicate an rdd ?? rdd1 = textFile(hdfs...).cache() rdd2 = rdd1.filter(userDefinedFunc1).cache() rdd3 = rdd1.filter(userDefinedFunc2).cache() I reframe above question , if rdd1 is around 50G and after filtering its come around say 4G. then to increase

Re: Why these operations are slower than the equivalent on Hadoop?

2014-04-16 Thread Eugen Cepoi
Yes, the second example does that. It transforms all the points of a partition into a single element the skyline, thus reduce will run on the skyline of two partitions and not on single points. Le 16 avr. 2014 06:47, Yanzhe Chen yanzhe...@gmail.com a écrit : Eugen, Thanks for your tip and I do

Re: Spark resilience

2014-04-16 Thread Aaron Davidson
1. Spark prefers to run tasks where the data is, but it is able to move cached data between executors if no cores are available where the data is initially cached (which is often much faster than recomputing the data from scratch). The result is that data is automatically spread out across the

Re: StackOverflow Error when run ALS with 100 iterations

2014-04-16 Thread Nick Pentreath
I'd also say that running for 100 iterations is a waste of resources, as ALS will typically converge pretty quickly, as in within 10-20 iterations. On Wed, Apr 16, 2014 at 3:54 AM, Xiaoli Li lixiaolima...@gmail.com wrote: Thanks a lot for your information. It really helps me. On Tue, Apr

what is a partition? how it works?

2014-04-16 Thread Joe L
I want to know as follows: what is a partition? how it works? how it is different from hadoop partition? For example: sc.parallelize([1,2,3,4]).map(lambda x: (x,x)).partitionBy(2).glom().collect() [[(2,2), (4,4)], [(1,1), (3,3)]] from this, we will get 2 partitions but what does it mean? how

Re: Proper caching method

2014-04-16 Thread Arpit Tak
Thanks Cheng , that was helpful.. On Wed, Apr 16, 2014 at 1:29 PM, Cheng Lian lian.cs@gmail.com wrote: You can remove cached rdd1 from the cache manager by calling rdd1.unpersist(). But here comes some subtleties: RDD.cache() is *lazy*while RDD.unpersist() is *eager*. When .cache() is

Java heap space and spark.akka.frameSize Inbox x

2014-04-16 Thread Chieh-Yen
Dear all, I developed a application that the message size of communication is greater than 10 MB sometimes. For smaller datasets it works fine, but fails for larger datasets. Please check the error message following. I surveyed the situation online and lots of people said the problem can be

Re: Spark program thows OutOfMemoryError

2014-04-16 Thread Andre Bois-Crettez
Seem you have not enough memory on the spark driver. Hints below : On 2014-04-15 12:10, Qin Wei wrote: val resourcesRDD = jsonRDD.map(arg = arg.get(rid).toString.toLong).distinct // the program crashes at this line of code val bcResources =

PySpark still reading only text?

2014-04-16 Thread Bertrand Dechoux
Hi, I have browsed the online documentation and it is stated that PySpark only read text files as sources. Is it still the case? From what I understand, the RDD can after this first step be any serialized python structure if the class definitions are well distributed. Is it not possible to read

using saveAsNewAPIHadoopFile with OrcOutputFormat

2014-04-16 Thread Brock Bose
Howdy all, I recently saw that the OrcInputFormat/OutputFormat's have been exposed to be usable outside of hive ( https://issues.apache.org/jira/browse/HIVE-5728). Does anyone know how one could use this with saveAsNewAPIHadoopFile to write records in orc format? In particular, I would

SPARK_YARN_APP_JAR, SPARK_CLASSPATH and ADD_JARS in a spark-shell on YARN

2014-04-16 Thread Christophe Préaud
Hi, I am running Spark 0.9.1 on a YARN cluster, and I am wondering which is the correct way to add external jars when running a spark shell on a YARN cluster. Packaging all this dependencies in an assembly which path is then set in SPARK_YARN_APP_JAR (as written in the doc:

Re: Error reading HDFS file using spark 0.9.0 / hadoop 2.2.0 - incompatible protobuf 2.5 and 2.4.1

2014-04-16 Thread Arpit Tak
I too stuck on same issue , but on shark (0.9 with spark-0.9 ) on hadoop-2.2.0 . On rest hadoop versions , it works perfect Regards, Arpit Tak On Wed, Apr 16, 2014 at 11:18 PM, Aureliano Buendia buendia...@gmail.comwrote: Is this resolved in spark 0.9.1? On Tue, Apr 15, 2014 at 6:55

Re: Spark packaging

2014-04-16 Thread Arpit Tak
Also try this ... http://docs.sigmoidanalytics.com/index.php/How_to_Install_Spark_on_Ubuntu-12.04 http://docs.sigmoidanalytics.com/index.php/How_to_Install_Spark_on_HortonWorks_VM Regards, arpit On Thu, Apr 10, 2014 at 3:04 AM, Pradeep baji pradeep.chanum...@gmail.comwrote: Thanks Prabeesh.

Re: sbt assembly error

2014-04-16 Thread Arpit Tak
Its because , there is no sl4f directory exists there may be they updating it . https://oss.sonatype.org/content/repositories/snapshots/org/ Hard luck try after some time... Regards, Arpit On Thu, Apr 17, 2014 at 12:33 AM, Yiou Li liy...@gmail.com wrote: Hi all, I am trying to

Re: How to cogroup/join pair RDDs with different key types?

2014-04-16 Thread Andrew Ash
Glad to hear you're making progress! Do you have a working version of the join? Is there anything else you need help with? On Wed, Apr 16, 2014 at 7:11 PM, Roger Hoover roger.hoo...@gmail.comwrote: Ah, in case this helps others, looks like RDD.zipPartitions will accomplish step 4. On

Re: How to cogroup/join pair RDDs with different key types?

2014-04-16 Thread Roger Hoover
Thanks for following up. I hope to get some free time this afternoon to get it working. Will let you know. On Wed, Apr 16, 2014 at 12:43 PM, Andrew Ash and...@andrewash.com wrote: Glad to hear you're making progress! Do you have a working version of the join? Is there anything else you

Re: PySpark still reading only text?

2014-04-16 Thread Matei Zaharia
Hi Bertrand, We should probably add a SparkContext.pickleFile and RDD.saveAsPickleFile that will allow saving pickled objects. Unfortunately this is not in yet, but there is an issue up to track it: https://issues.apache.org/jira/browse/SPARK-1161. In 1.0, one feature we do have now is the

Regarding Partitioner

2014-04-16 Thread yh18190
Hi,, I have large dataset of elemenst[RDD] and i want to divide it into two exactly equal sized partitions maintaining order of elements.I tried using RangePartitioner like var data= partitionedFile.partitionBy(new RangePartitioner(2, partitionedFile)). This doesnt give satisfactory results

Re: GC overhead limit exceeded

2014-04-16 Thread Nicholas Chammas
Never mind. I'll take it from both Andrew and Syed's comments that the answer is yes. Dunno why I thought otherwise. On Wed, Apr 16, 2014 at 5:43 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: I’m running into a similar issue as the OP. I’m running the same job over and over (with

Re: sbt assembly error

2014-04-16 Thread Yiou Li
Hi Sean, It's true that the sbt is trying different links but ALL of them have connections issue (which is actually 404 File not found error) and the build process takes forever connecting different links.. I don't think it's a proxy issue because my other projects (other than spark) builds well

Re: choose the number of partition according to the number of nodes

2014-04-16 Thread Nicholas Chammas
From the Spark tuning guidehttp://spark.apache.org/docs/latest/tuning.html : In general, we recommend 2-3 tasks per CPU core in your cluster. I think you can only get one task per partition to run concurrently for a given RDD. So if your RDD has 10 partitions, then 10 tasks at most can operate

Re: PySpark still reading only text?

2014-04-16 Thread Jesvin Jose
When this is implemented, can you load/save an RDD of pickled objects to HDFS? On Thu, Apr 17, 2014 at 1:51 AM, Matei Zaharia matei.zaha...@gmail.comwrote: Hi Bertrand, We should probably add a SparkContext.pickleFile and RDD.saveAsPickleFile that will allow saving pickled objects.