Re: Error when compiling spark in IDEA and best practice to use IDE?

2014-04-09 Thread Xiangrui Meng
After sbt/sbt gen-diea, do not import as an SBT project but choose open project and point it to the spark folder. -Xiangrui On Tue, Apr 8, 2014 at 10:45 PM, Sean Owen so...@cloudera.com wrote: I let IntelliJ read the Maven build directly and that works fine. -- Sean Owen | Director, Data

Re: Preconditions on RDDs for creating a Graph?

2014-04-09 Thread Ankur Dave
On Tue, Apr 8, 2014 at 2:33 PM, Adam Novak ano...@soe.ucsc.edu wrote: What, exactly, needs to be true about the RDDs that you pass to Graph() to be sure of constructing a valid graph? (Do they need to have the same number of partitions? The same number of partitions and no empty partitions?

KafkaReciever Error when starting ssc (Actor name not unique)

2014-04-09 Thread gaganbm
Hi All, I am getting this exception when doing ssc.start to start the streaming context. ERROR KafkaReceiver - Error receiving data akka.actor.InvalidActorNameException: actor name [NetworkReceiver-0] is not unique! at

Spark on YARN performance

2014-04-09 Thread Flavio Pompermaier
Hi to everybody, I'm new to Spark and I'd like to know if running Spark on top of YARN or Mesos could affect (and how much) its performance. Is there any doc about this? Best, Flavio

Re: trouble with join on large RDDs

2014-04-09 Thread Andrew Ash
A JVM can easily be limited in how much memory it uses with the -Xmx parameter, but Python doesn't have memory limits built in in such a first-class way. Maybe the memory limits aren't making it to the python executors. What was your SPARK_MEM setting? The JVM below seems to be using 603201

Re: Spark Disk Usage

2014-04-09 Thread Andrew Ash
For 1, persist can be used to save an RDD to disk using the various persistence levels. When a persistency level is set on an RDD, when that RDD is evaluated it's saved to memory/disk/elsewhere so that it can be re-used. It's applied to that RDD, so that subsequent uses of the RDD can use the

Re: Spark Disk Usage

2014-04-09 Thread Andrew Ash
Which persistence level are you talking about? MEMORY_AND_DISK ? Sent from my mobile phone On Apr 9, 2014 2:28 PM, Surendranauth Hiraman suren.hira...@velos.io wrote: Thanks, Andrew. That helps. For 1, it sounds like the data for the RDD is held in memory and then only written to disk after

Re: AWS Spark-ec2 script with different user

2014-04-09 Thread Nicholas Chammas
Marco, If you call spark-ec2 launch without specifying an AMI, it will default to the Spark-provided AMI. Nick On Wed, Apr 9, 2014 at 9:43 AM, Marco Costantini silvio.costant...@granatads.com wrote: Hi there, To answer your question; no there is no reason NOT to use an AMI that Spark has

Re: AWS Spark-ec2 script with different user

2014-04-09 Thread Nicholas Chammas
And for the record, that AMI is ami-35b1885c. Again, you don't need to specify it explicitly; spark-ec2 will default to it. On Wed, Apr 9, 2014 at 11:08 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Marco, If you call spark-ec2 launch without specifying an AMI, it will default to

executors not registering with the driver

2014-04-09 Thread azurecoder
Up until last week we had no problems running a Spark standalone cluster. We now have a problem registering executors with the driver node in any application. Although we can start-all and see the worker on 8080 no executors are registered with the blockmanager. The feedback we have is scant but

Re: AWS Spark-ec2 script with different user

2014-04-09 Thread Marco Costantini
Ah, tried that. I believe this is an HVM AMI? We are exploring paravirtual AMIs. On Wed, Apr 9, 2014 at 11:17 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: And for the record, that AMI is ami-35b1885c. Again, you don't need to specify it explicitly; spark-ec2 will default to it.

Re: AWS Spark-ec2 script with different user

2014-04-09 Thread Shivaram Venkataraman
The AMI should automatically switch between PVM and HVM based on the instance type you specify on the command line. For reference (note you don't need to specify this on the command line), the PVM ami id is ami-5bb18832 in us-east-1. FWIW we maintain the list of AMI Ids (across regions and pvm,

Re: How does Spark handle RDD via HDFS ?

2014-04-09 Thread Andrew Ash
The typical way to handle that use case would be to join the 3 files together into one RDD and then do the factorization on that. There will definitely be network traffic during the initial join to get everything into one table, and after that there will likely be more network traffic for various

Re: Spark Disk Usage

2014-04-09 Thread Andrew Ash
The groupByKey would be aware of the subsequent persist -- that's part of the reason why operations are lazy. As far as whether it's materialized in memory first and then flushed to disk vs streamed to disk I'm not sure the exact behavior. What I'd expect to happen would be that the RDD is

How to change the parallelism level of input dstreams

2014-04-09 Thread Dong Mo
Dear list, A quick question about spark streaming: Say I have this stage set up in my Spark Streaming cluster: batched TCP stream == map(expensive computation) === ReduceByKey I know I can set the number of tasks for ReduceByKey. But I didn't find a place to specify the parallelism for the

Re: hbase scan performance

2014-04-09 Thread Jerry Lam
Hi Dave, This is HBase solution to the poor scan performance issue: https://issues.apache.org/jira/browse/HBASE-8369 I encountered the same issue before. To the best of my knowledge, this is not a mapreduce issue. It is hbase issue. If you are planning to swap out mapreduce and replace it with

Re: Why doesn't the driver node do any work?

2014-04-09 Thread Mayur Rustagi
Also Driver can run on one of the slave nodes. (you will stil need a spark master though for resource allocation etc). Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Tue, Apr 8, 2014 at 2:46 PM, Nan Zhu

cannot run spark shell in yarn-client mode

2014-04-09 Thread Pennacchiotti, Marco
I am pretty new to Spark and I am trying to run the spark shell on a Yarn cluster from the cli (in yarn-client mode). I am able to start the shell with the following command: SPARK_JAR=../spark-0.9.0-incubating/jars/spark-assembly-0.9.0-incubating-hadoop2.2.0.jar \ SPARK_YARN_APP_JAR=emptyfile

Re: Spark RDD to Shark table IN MEMORY conversion

2014-04-09 Thread abhietc31
Never mind...plz return it later with interest -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-RDD-to-Shark-table-IN-MEMORY-conversion-tp3682p4014.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

KafkaInputDStream Stops reading new messages

2014-04-09 Thread Kanwaldeep
Spark Streaming job was running on two worker nodes and then there was an error on one of the nodes. The spark job showed running but no progress was being made and not processing any new messages. Based on the driver log files I see the following errors. I would expect the stream reading would

is it possible to initiate Spark jobs from Oozie?

2014-04-09 Thread Segerlind, Nathan L
Howdy. Is it possible to initiate Spark jobs from Oozie (presumably as a java action)? If so, are there known limitations to this? And would anybody have a pointer to an example? Thanks, Nate

Re: Spark packaging

2014-04-09 Thread Pradeep baji
Thanks Prabeesh. On Wed, Apr 9, 2014 at 12:37 AM, prabeesh k prabsma...@gmail.com wrote: Please refer http://prabstechblog.blogspot.in/2014/04/creating-single-jar-for-spark-project.html Regards, prabeesh On Wed, Apr 9, 2014 at 1:04 PM, Pradeep baji pradeep.chanum...@gmail.comwrote:

Spark 0.9.1 released

2014-04-09 Thread Tathagata Das
Hi everyone, We have just posted Spark 0.9.1, which is a maintenance release with bug fixes, performance improvements, better stability with YARN and improved parity of the Scala and Python API. We recommend all 0.9.0 users to upgrade to this stable release. This is the first release since Spark

Re: Spark 0.9.1 released

2014-04-09 Thread Tathagata Das
A small additional note: Please use the direct download links in the Spark Downloads http://spark.apache.org/downloads.html page. The Apache mirrors take a day or so to sync from the main repo, so may not work immediately. TD On Wed, Apr 9, 2014 at 2:54 PM, Tathagata Das

Problem with running LogisticRegression in spark cluster mode

2014-04-09 Thread Jenny Zhao
Hi all, I have been able to run LR in local mode, but I am facing problem to run it in cluster mode, below is the source script, and stack trace when running it cluster mode, I used sbt package to build the project, not sure what it is complaining? another question I have is for

Multi master Spark

2014-04-09 Thread Pradeep Ch
Hi, I want to enable Spark Master HA in spark. Documentation specifies that we can do this with the help of Zookeepers. But what I am worried is how to configure one master with the other and similarly how do workers know that the have two masters? where do you specify the multi-master

Re: Spark 0.9.1 released

2014-04-09 Thread Matei Zaharia
Thanks TD for managing this release, and thanks to everyone who contributed! Matei On Apr 9, 2014, at 2:59 PM, Tathagata Das tathagata.das1...@gmail.com wrote: A small additional note: Please use the direct download links in the Spark Downloads page. The Apache mirrors take a day or so to

Re: Problem with running LogisticRegression in spark cluster mode

2014-04-09 Thread Jagat Singh
Hi Jenny, How are you packaging your jar. Can you please confirm if you have included the Mlib jar inside the fat jar you have created for your code. libraryDependencies += org.apache.spark % spark-mllib_2.9.3 % 0.8.1-incubating Thanks, Jagat Singh On Thu, Apr 10, 2014 at 8:05 AM, Jenny

Re: Multi master Spark

2014-04-09 Thread Dmitriy Lyubimov
The only way i know to do this is to use mesos with zookeepers. you specify zookeeper url as spark url that contains multiple zookeeper hosts. Multiple mesos masters are then elected thru zookeeper leader election until current leader dies; at which point mesos will elect another master (if still

Re: programmatic way to tell Spark version

2014-04-09 Thread Nicholas Chammas
Hey Patrick, I've created SPARK-1458 https://issues.apache.org/jira/browse/SPARK-1458 to track this request, in case the team/community wants to implement it in the future. Nick On Sat, Feb 22, 2014 at 7:25 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: No use case at the moment.

Re: Multi master Spark

2014-04-09 Thread Pradeep Ch
Thanks Dmitriy. But I want multi master support when running spark standalone. Also I want to know if this multi master thing works if I use spark-shell. On Wed, Apr 9, 2014 at 3:26 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: The only way i know to do this is to use mesos with zookeepers.

Re: pySpark memory usage

2014-04-09 Thread Jim Blomo
Hi Matei, thanks for working with me to find these issues. To summarize, the issues I've seen are: 0.9.0: - https://issues.apache.org/jira/browse/SPARK-1323 SNAPSHOT 2014-03-18: - When persist() used and batchSize=1, java.lang.OutOfMemoryError: Java heap space. To me this indicates a memory

Re: Problem with running LogisticRegression in spark cluster mode

2014-04-09 Thread Jenny Zhao
Hi Jagat, yes, I did specify mllib in build.sbt name := Spark LogisticRegression version :=1.0 scalaVersion := 2.10.3 libraryDependencies += org.apache.spark % spark-core_2.10 % 0.9.0-incubating libraryDependencies += org.apache.spark % spark-mllib_2.10 % 0.9.0-incubating

Re: Spark 0.9.1 released

2014-04-09 Thread Nicholas Chammas
A very nice addition for us PySpark users in 0.9.1 is the addition of RDD.repartition(), which is not mentioned in the release noteshttp://spark.apache.org/releases/spark-release-0-9-1.html ! This is super helpful for when you create an RDD from a gzipped file and then need to explicitly shuffle

Re: pySpark memory usage

2014-04-09 Thread Matei Zaharia
Okay, thanks. Do you have any info on how large your records and data file are? I’d like to reproduce and fix this. Matei On Apr 9, 2014, at 3:52 PM, Jim Blomo jim.bl...@gmail.com wrote: Hi Matei, thanks for working with me to find these issues. To summarize, the issues I've seen are:

Re: trouble with join on large RDDs

2014-04-09 Thread Brad Miller
I set SPARK_MEM in the driver process by setting spark.executor.memory to 10G. Each machine had 32G of RAM and a dedicated 32G spill volume. I believe all of the units are in pages, and the page size is the standard 4K. There are 15 slave nodes in the cluster and the sizes of the datasets I'm

Re: Spark 0.9.1 released

2014-04-09 Thread Nicholas Chammas
Ah, looks good now. It took me a minute to realize that doing a hard refresh on the docs page was missing the RDD class doc page... And thanks for updating the release notes. On Wed, Apr 9, 2014 at 7:21 PM, Tathagata Das tathagata.das1...@gmail.comwrote: Thanks Nick for pointing that out! I

Re: Best way to turn an RDD back into a SchemaRDD

2014-04-09 Thread Michael Armbrust
Good question. This is something we wanted to fix, but unfortunately I'm not sure how to do it without changing the API to RDD, which is undesirable now that the 1.0 branch has been cut. We should figure something out though for 1.1. I've created https://issues.apache.org/jira/browse/SPARK-1460

Re: Multi master Spark

2014-04-09 Thread Aaron Davidson
It is as Jagat said. The Masters do not need to know about one another, as ZooKeeper manages their implicit communication. As for Workers (and applications, such as spark-shell), once a Worker is registered with *some *Master, its metadata is stored in ZooKeeper such that if another Master is

Re: Only TraversableOnce?

2014-04-09 Thread wxhsdp
thank you, it works after my operation over p, return p.toIterator, because mapPartitions has iterator return type, is that right? rdd.mapPartitions{D = {val p = D.toArray; ...; p.toIterator}} -- View this message in context:

Re: Only TraversableOnce?

2014-04-09 Thread Nan Zhu
Yeah, should be right -- Nan Zhu On Wednesday, April 9, 2014 at 8:54 PM, wxhsdp wrote: thank you, it works after my operation over p, return p.toIterator, because mapPartitions has iterator return type, is that right? rdd.mapPartitions{D = {val p = D.toArray; ...; p.toIterator}}

Re: pySpark memory usage

2014-04-09 Thread Jim Blomo
This dataset is uncompressed text at ~54GB. stats() returns (count: 56757667, mean: 1001.68740583, stdev: 601.775217822, max: 8965, min: 343) On Wed, Apr 9, 2014 at 6:59 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Okay, thanks. Do you have any info on how large your records and data file