Re: Lost an executor error - Jobs fail

2014-04-15 Thread Aaron Davidson
Hmm, interesting. I created https://issues.apache.org/jira/browse/SPARK-1499to track the issue of Workers continuously spewing bad executors, but the real issue seems to be a combination of that and some other bug in Shark or Spark which fails to handle the situation properly. Please let us know

Re: storage.MemoryStore estimated size 7 times larger than real

2014-04-15 Thread Aaron Davidson
Hey, I was talking about something more like: val size = 1024 * 1024 val numSlices = 8 val arr = Array.fill[Array[Int]](numSlices) { new Array[Int](size / numSlices) } val rdd = sc.parallelize(arr, numSlices).cache() val size2 = rdd.map(_.length).sum() assert( size2 ==

groupByKey returns a single partition in a RDD?

2014-04-15 Thread Joe L
I want to apply the following transformations to 60Gbyte data on 7nodes with 10Gbyte memory. And I am wondering if groupByKey() function returns a RDD with a single partition for each key? if so, what will happen if the size of the partition doesn't fit into that particular node? rdd =

Re: Comparing GraphX and GraphLab

2014-04-15 Thread Qi Song
Hi Debasish, I found PageRank LiveJournal cost less than 100 seconds for GraphX in your EC2. But as I use the example (LiveJournalPageRank) you provided in my mechines with the same LiveJournal dataset, It took more than 10 minutes. Following are some details: Environment: 8 machines with each

Re: How to stop system info output in spark shell

2014-04-15 Thread Wei Da
The solution: Edit /opt/spark-0.9.0-incubating-bin-hadoop2/conf/log4j.properties, changing Spark's output to WARN. Done! Refer to: https://github.com/amplab-extras/SparkR-pkg/blob/master/pkg/src/src/main/resources/log4j.properties#L8 eduardocalfaia wrote Have you already tried in

Spark program thows OutOfMemoryError

2014-04-15 Thread Qin Wei
Hi, all My spark program always gives me the error java.lang.OutOfMemoryError: Java heap space in my standalone cluster, here is my code: object SimCalcuTotal { def main(args: Array[String]) { val sc = new SparkContext(spark://192.168.2.184:7077, Sim Calcu Total,

standalone vs YARN

2014-04-15 Thread ishaaq
Hi all, I am evaluating Spark to use here at my work. We have an existing Hadoop 1.x install which I planning to upgrade to Hadoop 2.3. I am trying to work out whether I should install YARN or simply just setup a Spark standalone cluster. We already use ZooKeeper so it isn't a problem to setup

Re: standalone vs YARN

2014-04-15 Thread Surendranauth Hiraman
Prashant, In another email thread several weeks ago, it was mentioned that YARN support is considered beta until Spark 1.0. Is that not the case? -Suren On Tue, Apr 15, 2014 at 8:38 AM, Prashant Sharma scrapco...@gmail.comwrote: Hi Ishaaq, answers inline from what I know, I had like to be

Re: Scala vs Python performance differences

2014-04-15 Thread Ian Ferreira
This would be super useful. Thanks. On 4/15/14, 1:30 AM, Jeremy Freeman freeman.jer...@gmail.com wrote: Hi Andrew, I'm putting together some benchmarks for PySpark vs Scala. I'm focusing on ML algorithms, as I'm particularly curious about the relative performance of MLlib in Scala vs the Python

partitioning of small data sets

2014-04-15 Thread Diana Carroll
I loaded a very tiny file into Spark -- 23 lines of text, 2.6kb Given the size, and that it is a single file, I assumed it would only be in a single partition. But when I cache it, I can see in the Spark App UI that it actually splits it into two partitions: [image: Inline image 1] Is this

Re: Spark resilience

2014-04-15 Thread Manoj Samel
Thanks Aaron, this is useful ! - Manoj On Mon, Apr 14, 2014 at 8:12 PM, Aaron Davidson ilike...@gmail.com wrote: Launching drivers inside the cluster was a feature added in 0.9, for standalone cluster mode:

Re: storage.MemoryStore estimated size 7 times larger than real

2014-04-15 Thread Aaron Davidson
Ah, I think I can see where your issue may be coming from. In spark-shell, the MASTER is local[*], which just means it uses a pre-set number of cores. This distinction only matters because the default number of slices created from sc.parallelize() is based on the number of cores. So when you run

Re: partitioning of small data sets

2014-04-15 Thread Aaron Davidson
Take a look at the minSplits argument for SparkContext#textFile [1] -- the default value is 2. You can simply set this to 1 if you'd prefer not to split your data. [1] http://spark.apache.org/docs/latest/api/core/index.html#org.apache.spark.SparkContext On Tue, Apr 15, 2014 at 8:44 AM, Diana

Why these operations are slower than the equivalent on Hadoop?

2014-04-15 Thread Yanzhe Chen
Hi all, As a previous thread, I am asking how to implement a divide-and-conquer algorithm (skyline) in Spark. Here is my current solution: val data = sc.textFile(…).map(line = line.split(“,”).map(_.toDouble)) val result = data.mapPartitions(points =

Streaming job having Cassandra query : OutOfMemoryError

2014-04-15 Thread sonyjv
Hi All, I am desperately looking for some help. My cluster is 6 nodes having dual core and 8GB ram each. Spark version running on the cluster is spark-0.9.0-incubating-bin-cdh4. I am getting OutOfMemoryError when running a Spark Streaming job (non-streaming version works fine) which queries

Re: partitioning of small data sets

2014-04-15 Thread Matei Zaharia
Yup, one reason it’s 2 actually is to give people a similar experience to working with large files, in case their code doesn’t deal well with the file being partitioned. Matei On Apr 15, 2014, at 9:53 AM, Aaron Davidson ilike...@gmail.com wrote: Take a look at the minSplits argument for

scheduler question

2014-04-15 Thread Mohit Jaggi
Hi Folks, I have some questions about how Spark scheduler works: - How does Spark know how many resources a job might need? - How does it fairly share resources between multiple jobs? - Does it know about data and partition sizes and use that information for scheduling? Mohit.

Re: Why these operations are slower than the equivalent on Hadoop?

2014-04-15 Thread Cheng Lian
Your Spark solution first reduces partial results into a single partition, computes the final result, and then collects to the driver side. This involves a shuffle and two waves of network traffic. Instead, you can directly collect partial results to the driver and then computes the final results

Re: How to cogroup/join pair RDDs with different key types?

2014-04-15 Thread Roger Hoover
Andrew, Thank you very much for your feedback. Unfortunately, the ranges are not of predictable size but you gave me an idea of how to handle it. Here's what I'm thinking: 1. Choose number of partitions, n, over IP space 2. Preprocess the IPRanges, splitting any of them that cross partition

Re: Error reading HDFS file using spark 0.9.0 / hadoop 2.2.0 - incompatible protobuf 2.5 and 2.4.1

2014-04-15 Thread anant
I've received the same error with Spark built using Maven. It turns out that mesos-0.13.0 depends on protobuf-2.4.1 which is causing the clash at runtime. Protobuf included by Akka is shaded and doesn't cause any problems. The solution is to update the mesos dependency to 0.18.0 in spark's

Re: How to cogroup/join pair RDDs with different key types?

2014-04-15 Thread Roger Hoover
I'm thinking of creating a union type for the key so that IPRange and IP types can be joined. On Tue, Apr 15, 2014 at 10:44 AM, Roger Hoover roger.hoo...@gmail.comwrote: Andrew, Thank you very much for your feedback. Unfortunately, the ranges are not of predictable size but you gave me an

Re: can't sc.paralellize in Spark 0.7.3 spark-shell

2014-04-15 Thread Walrus theCat
Actually altering the classpath in the REPL causes the provided SparkContext to disappear: scala sc.parallelize(List(1,2,3)) res0: spark.RDD[Int] = ParallelCollectionRDD[0] at parallelize at console:13 scala :cp /root Added '/root'. Your new classpath is:

Can't run a simple spark application with 0.9.1

2014-04-15 Thread Paul Schooss
Hello, Currently I deployed 0.9.1 spark using a new way of starting up spark exec start-stop-daemon --start --pidfile /var/run/spark.pid --make-pidfile --chuid ${SPARK_USER}:${SPARK_GROUP} --chdir ${SPARK_HOME} --exec /usr/bin/java -- -cp ${CLASSPATH}

Re: Why these operations are slower than the equivalent on Hadoop?

2014-04-15 Thread Eugen Cepoi
It depends on your algorithm but I guess that you probably should use reduce (the code probably doesn't compile but it shows you the idea). val result = data.reduce { case (left, right) = skyline(left ++ right) } Or in the case you want to merge the result of a partition with another one you

Re: can't sc.paralellize in Spark 0.7.3 spark-shell

2014-04-15 Thread Aaron Davidson
This is probably related to the Scala bug that :cp does not work: https://issues.scala-lang.org/browse/SI-6502 On Tue, Apr 15, 2014 at 11:21 AM, Walrus theCat walrusthe...@gmail.comwrote: Actually altering the classpath in the REPL causes the provided SparkContext to disappear: scala

Re: can't sc.paralellize in Spark 0.7.3 spark-shell

2014-04-15 Thread Walrus theCat
Dankeschön ! On Tue, Apr 15, 2014 at 11:29 AM, Aaron Davidson ilike...@gmail.com wrote: This is probably related to the Scala bug that :cp does not work: https://issues.scala-lang.org/browse/SI-6502 On Tue, Apr 15, 2014 at 11:21 AM, Walrus theCat walrusthe...@gmail.comwrote: Actually

Shark: class java.io.IOException: Cannot run program /bin/java

2014-04-15 Thread ge ko
Hi, after starting the shark-shell via /opt/shark/shark-0.9.1/bin/shark-withinfo -skipRddReload I receive lots of output, including the exception that /bin/java cannot be executed. But it is linked to /usr/bin/java ?!?! root#ls -al /bin/java lrwxrwxrwx 1 root root 13 15. Apr 21:45 /bin/java

Problem with KryoSerializer

2014-04-15 Thread yh18190
Hi, I have a problem when i want to use spark kryoserializer by extending a class Kryoregistarar to register custom classes inorder to create objects.I am getting following exception When I run following program..Please let me know what could be the problem... ] (run-main)

StackOverflow Error when run ALS with 100 iterations

2014-04-15 Thread Xiaoli Li
Hi, I am testing ALS using 7 nodes. Each node has 4 cores and 8G memeory. ALS program cannot run even with a very small size of training data (about 91 lines) due to StackVverFlow error when I set the number of iterations to 100. I think the problem may be caused by updateFeatures method which

Re: partitioning of small data sets

2014-04-15 Thread Nicholas Chammas
Looking at the Python version of textFile()http://spark.apache.org/docs/latest/api/pyspark/pyspark.context-pysrc.html#SparkContext.textFile, shouldn't it be *max*(self.defaultParallelism, 2)? If the default parallelism is, say 4, wouldn't we want to use that for minSplits instead of 2? On Tue,

Multi-tenant?

2014-04-15 Thread Ian Ferreira
What is the support for multi-tenancy in Spark. I assume more than one driver can share the same cluster, but can a driver run two jobs in parallel?

java.net.SocketException: Network is unreachable while connecting to HBase

2014-04-15 Thread amit karmakar
I am getting a java.net.SocketException: Network is unreachable whenever i do a count on one of my tables. If i just do a take(1), i see the task status as killed on the master UI but i get back the results. My driver runs on my local system which is accessible over the public internet and

Re: Can't run a simple spark application with 0.9.1

2014-04-15 Thread Paul Schooss
I am a dork please disregard this issue. I did not have the slaves correctly configured. This error is very misleading On Tue, Apr 15, 2014 at 11:21 AM, Paul Schooss paulmscho...@gmail.comwrote: Hello, Currently I deployed 0.9.1 spark using a new way of starting up spark exec

Re: Multi-tenant?

2014-04-15 Thread Matei Zaharia
Yes, both things can happen. Take a look at http://spark.apache.org/docs/latest/job-scheduling.html, which includes scheduling concurrent jobs within the same driver. Matei On Apr 15, 2014, at 4:08 PM, Ian Ferreira ianferre...@hotmail.com wrote: What is the support for multi-tenancy in

RE: Multi-tenant?

2014-04-15 Thread Ian Ferreira
Thanks Matei! Sent from my Windows Phone From: Matei Zahariamailto:matei.zaha...@gmail.com Sent: ‎4/‎15/‎2014 7:14 PM To: user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: Multi-tenant? Yes, both things can happen. Take a look at

Re: Error reading HDFS file using spark 0.9.0 / hadoop 2.2.0 - incompatible protobuf 2.5 and 2.4.1

2014-04-15 Thread giive chen
Hi Prasad Sorry for missing your reply. https://gist.github.com/thegiive/10791823 Here it is. Wisely Chen On Fri, Apr 4, 2014 at 11:57 PM, Prasad ramachandran.pra...@gmail.comwrote: Hi Wisely, Could you please post your pom.xml here. Thanks -- View this message in context:

Re: storage.MemoryStore estimated size 7 times larger than real

2014-04-15 Thread wxhsdp
thank you so much, davidson ye, you are right, in both sbt and spark shell, the result of my code is 28MB, it's irrelevant to numSlices. yesterday i had the result of 4.2MB in spark shell, because i remove array initialization for laziness:) for(i - 0 until size) { array(i) = i } --

Re: StackOverflow Error when run ALS with 100 iterations

2014-04-15 Thread Cheng Lian
Probably this JIRA issuehttps://spark-project.atlassian.net/browse/SPARK-1006solves your problem. When running with large iteration number, the lineage DAG of ALS becomes very deep, both DAGScheduler and Java serializer may overflow because they are implemented in a recursive way. You may resort

JMX with Spark

2014-04-15 Thread Paul Schooss
Has anyone got this working? I have enabled the properties for it in the metrics.conf file and ensure that it is placed under spark's home directory. Any ideas why I don't see spark beans ?

Re: JMX with Spark

2014-04-15 Thread Parviz Deyhim
home directory or $home/conf directory? works for me with metrics.properties hosted under conf dir. On Tue, Apr 15, 2014 at 6:08 PM, Paul Schooss paulmscho...@gmail.comwrote: Has anyone got this working? I have enabled the properties for it in the metrics.conf file and ensure that it is

Re: java.net.SocketException: Network is unreachable while connecting to HBase

2014-04-15 Thread amit
In the worker logs i can see, 14/04/16 01:02:47 ERROR EndpointWriter: AssociationError [akka.tcp://sparkWorker@xx:10548] - [akka.tcp://sparkExecutor@xx:16041]: Error [Association failed with [akka.tcp://sparkExecutor@xx:16041]] [ akka.remote.EndpointAssociationException: Association

Re: StackOverflow Error when run ALS with 100 iterations

2014-04-15 Thread Xiaoli Li
Thanks a lot for your information. It really helps me. On Tue, Apr 15, 2014 at 7:57 PM, Cheng Lian lian.cs@gmail.com wrote: Probably this JIRA issuehttps://spark-project.atlassian.net/browse/SPARK-1006solves your problem. When running with large iteration number, the lineage DAG of

Re: Why these operations are slower than the equivalent on Hadoop?

2014-04-15 Thread Yanzhe Chen
Eugen, Thanks for your tip and I do want to merge the result of a partition with another one but I am still not quite clear how to do it. Say the original data rdd has 32 partitions and since mapPartitions won’t change the number of partitions, it will remain 32 partitions which each contains

what is the difference between element and partition?

2014-04-15 Thread Joe L
-- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/what-is-the-difference-between-element-and-partition-tp4317.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

groupByKey(None) returns partitions according to the keys?

2014-04-15 Thread Joe L
I was wonder if groupByKey returns 2 partitions in the below example? x = sc.parallelize([(a, 1), (b, 1), (a, 1)]) sorted(x.groupByKey().collect()) [('a', [1, 1]), ('b', [1])] -- View this message in context: