unsubscribe

2014-05-19 Thread Jayaraman Babu
CLASSIFICATION : Public This message has been marked by Jayaraman Babu on Tuesday, May 20, 2014, 9:55:10 AM. The above classification labels were added to the message by. AL ELM Message Classification This e-mail message and all attachments transmitted with it are intended solely for the use o

Status stays at ACCEPTED

2014-05-19 Thread Jan Holmberg
Hi, I’m new to Spark and trying to test first Spark prog. I’m running SparkPi successfully in yarn-client -mode but when running the same in yarn-mode, app gets stuck to ACCEPTED phase. I’ve tried hours to hunt down the reason but the outcome is always the same. Any hints what to look for next?

Re: life if an executor

2014-05-19 Thread Tathagata Das
That's one the main motivation in using Tachyon ;) http://tachyon-project.org/ It gives off heap in-memory caching. And starting Spark 0.9, you can cache any RDD in Tachyon just by specifying the appropriate StorageLevel. TD On Mon, May 19, 2014 at 10:22 PM, Mohit Jaggi wrote: > I guess it

Re: life if an executor

2014-05-19 Thread Mohit Jaggi
I guess it "needs" to be this way to benefit from caching of RDDs in memory. It would be nice however if the RDD cache can be dissociated from the JVM heap so that in cases where garbage collection is difficult to tune, one could choose to discard the JVM and run the next operation in a few one.

Re: filling missing values in a sequence

2014-05-19 Thread Xiangrui Meng
Actually there is a sliding method implemented in mllib.rdd.RDDFunctions. Since this is not for general use cases, we didn't include it in spark-core. You can take a look at the implementation there and see whether it fits. -Xiangrui On Mon, May 19, 2014 at 10:06 PM, Mohit Jaggi wrote: > Thanks S

Re: filling missing values in a sequence

2014-05-19 Thread Mohit Jaggi
Thanks Sean. Yes, your solution works :-) I did oversimplify my real problem, which has other parameters that go along with the sequence. On Fri, May 16, 2014 at 3:03 AM, Sean Owen wrote: > Not sure if this is feasible, but this literally does what I think you > are describing: > > sc.paralleli

Re: life if an executor

2014-05-19 Thread Matei Zaharia
They’re tied to the SparkContext (application) that launched them. Matei On May 19, 2014, at 8:44 PM, Koert Kuipers wrote: > from looking at the source code i see executors run in their own jvm > subprocesses. > > how long to they live for? as long as the worker/slave? or are they tied to >

Re: accessing partition i+1 from mapper of partition i

2014-05-19 Thread Mohit Jaggi
Thanks Brian. This works. I used Accumulable to do the "collect" in step B. While doing that I found that Accumulable.value is not a Spark "action", I need to call "cache" in the underlying RDD for "value" to work. Not sure if that is intentional or a bug. The "collect" of Step B can be done as a n

Setting queue for spark job on yarn

2014-05-19 Thread Ron Gonzalez
Hi,   How does one submit a spark job to yarn and specify a queue?   The code that successfully submits to yarn is:    val conf = new SparkConf()    val sc = new SparkContext("yarn-client", "Simple App", conf)    Where do I need to specify the queue?   Thanks in advance for any help on this...

Re: Spark stalling during shuffle (maybe a memory issue)

2014-05-19 Thread jonathan.keebler
Thanks for the response, Aaron! We'll give it a try tomorrow. On Tue, May 20, 2014 at 12:13 AM, Aaron Davidson [via Apache Spark User List] wrote: > This is very likely because the serialized map output locations buffer > exceeds the akka frame size. Please try setting "spark.akka.frameSize" >

Re: Spark stalling during shuffle (maybe a memory issue)

2014-05-19 Thread Aaron Davidson
This is very likely because the serialized map output locations buffer exceeds the akka frame size. Please try setting "spark.akka.frameSize" (default 10 MB) to some higher number, like 64 or 128. In the newest version of Spark, this would throw a better error, for what it's worth. On Mon, May

facebook data mining with Spark

2014-05-19 Thread Joe L
Is there any way to get facebook data into Spark and filter the content of it? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/facebook-data-mining-with-Spark-tp6072.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Problem when sorting big file

2014-05-19 Thread Andrew Ash
Is your RDD of Strings? If so, you should make sure to use the Kryo serializer instead of the default Java one. It stores strings as UTF8 rather than Java's default UTF16 representation, which can save you half the memory usage in the right situation. Try setting the persistence level on the RDD

Re: How to run the SVM and LogisticRegression

2014-05-19 Thread yxzhao
Thanks Andrew, Yes, I have downloaded the master code. But, actually I just want to know how to run the classfication algorithms SVM and LogisticRegression implemented under /spark-0.9.0-incubating/mllib/src/main/scala/org/apache/spark/mllib/classification . Thanks. On Mon, May

life if an executor

2014-05-19 Thread Koert Kuipers
from looking at the source code i see executors run in their own jvm subprocesses. how long to they live for? as long as the worker/slave? or are they tied to the sparkcontext and life/die with it? thx

Re: Reading from .bz2 files with Spark

2014-05-19 Thread Andrew Ash
Hi Xiangrui, many thanks to you and Sandy for fixing this issue! On Fri, May 16, 2014 at 10:23 PM, Xiangrui Meng wrote: > Hi Andrew, > > I submitted a patch and verified it solves the problem. You can > download the patch from > https://issues.apache.org/jira/browse/HADOOP-10614 . > > Best, > X

Spark stalling during shuffle (maybe a memory issue)

2014-05-19 Thread jonathan.keebler
Has anyone observed Spark worker threads stalling during a shuffle phase with the following message (one per worker host) being echoed to the terminal on the driver thread? INFO spark.MapOutputTrackerActor: Asked to send map output locations for shuffle 0 to [worker host]... At this point Spark-

Re: How to run the SVM and LogisticRegression

2014-05-19 Thread Andrew Ash
Hi yxzhao, Those are branches in the source code git repository. You can get to them with "git checkout branch-1.0" once you've cloned the git repository. Cheers, Andrew On Mon, May 19, 2014 at 8:30 PM, yxzhao wrote: > Thanks Xiangrui, > > Sorry I am new for Spark, could you give me

Re: How to run the SVM and LogisticRegression

2014-05-19 Thread yxzhao
Thanks Xiangrui, Sorry I am new for Spark, could you give me more detail about "master or branch-1.0" I do not know what "master or branch-1.0" is. Thanks again. On Mon, May 19, 2014 at 10:17 PM, Xiangrui Meng [via Apache Spark User List] wrote: > Checkout the master or branch-1.0. Th

Re: How to run the SVM and LogisticRegression

2014-05-19 Thread Xiangrui Meng
Checkout the master or branch-1.0. Then the examples should be there. -Xiangrui On Mon, May 19, 2014 at 11:36 AM, yxzhao wrote: > Thanks Xiangrui, > > But I did not find the directory: > examples/src/main/scala/org/apache/spark/examples/mllib. > Could you give me more detail or show me one exampl

Re: advice on maintaining a production spark cluster?

2014-05-19 Thread Matei Zaharia
Which version is this with? I haven’t seen standalone masters lose workers. Is there other stuff on the machines that’s killing them, or what errors do you see? Matei On May 16, 2014, at 9:53 AM, Josh Marcus wrote: > Hey folks, > > I'm wondering what strategies other folks are using for main

Re: advice on maintaining a production spark cluster?

2014-05-19 Thread Mayur Rustagi
You are better off using Mesos for production cluster. Standalone mode will not provide reliability & availability in production. That said it depends on what production means. Many of my analytics customers use standalone in production. Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www

Re: Benchmarking Graphx

2014-05-19 Thread ankurdave
On May 17, 2014 at 2:59pm, Hari wrote: > a) Is there a way to get the total time taken for the execution from start to finish? Assuming you're running the benchmark as a standalone program, such as by invoking the Analytics driver

combinebykey throw classcastexception

2014-05-19 Thread xiemeilong
I am using CDH5 on a three machines cluster. map data from hbase as (string, V) pair , then call combineByKey like this: .combineByKey[C]( (v:V)=>new C(v), //this line throw java.lang.ClassCastException: C cannot be cast to V (v:C,v:V)=>C, (c1:C,c2:C)=>C) I am very confu

Checkpoint serialization

2014-05-19 Thread Vadim Chekan
Hi all, I'm writing an spark streaming app which consumes Kafka stream. Objects consumed are Avro objects (they do not implement java serialization). I decided to use Kryo and have set "spark.serializer" and "spark.closure.serializer". But now I am getting exception in CheckpointWriter: 14/05/19 1

Re: How to compile the examples directory?

2014-05-19 Thread Matei Zaharia
If you’d like to work on just this code for your own changes, it might be best to copy it to a separate project. Look at http://spark.apache.org/docs/latest/quick-start.html for how to set up a standalone job. Matei On May 19, 2014, at 4:53 AM, Hao Wang wrote: > Hi, > > I am running some ex

Re: unsubscribe

2014-05-19 Thread Shangyu Luo
Hi Andrew and Madhu, Thank you for your help here! Will unsubscribe through another address and may subscribe digest instead! Best, Shangyu On Sun, May 18, 2014 at 3:49 PM, Andrew Ash wrote: > Hi Shangyu (and everyone else looking to unsubscribe!), > > If you'd like to get off this mailing lis

Re: Advanced log processing

2014-05-19 Thread Mayur Rustagi
It seems you are not reducing the data in size. If you are not then you are better off partitioning the data into buckets (folders?) & keep data sorted in those buckets .. A more cleaner approach is to use HBase to keep track of keys & keep adding keys as you find them & let hbase handle it. Mayur

Re: Spark Streaming and Shark | Streaming Taking All CPUs

2014-05-19 Thread Mayur Rustagi
Simply configure spark.cores.max variable in your application in spark context. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Mon, May 19, 2014 at 11:13 PM, anishs...@yahoo.co.in < anishs...@yahoo.co.in> wrote: > Hi Al

Re: specifying worker nodes when using the repl?

2014-05-19 Thread Eric Friedman
Sandy, thank you so much — that was indeed my omission! Eric On May 19, 2014, at 10:14 AM, Sandy Ryza wrote: > Hi Eric, > > Have you tried setting the SPARK_WORKER_INSTANCES env variable before running > spark-shell? > http://spark.apache.org/docs/0.9.0/running-on-yarn.html > > -Sandy > > >

Re: making spark/conf/spark-defaults.conf changes take effect

2014-05-19 Thread Andrew Or
Hm, it should just take effect immediately. But yes, there is a script for syncing everything: /root/spark-ec2/copy-dir --delete /root/spark After that you should do /root/spark/sbin/stop-all.sh /root/spark/sbin/start-all.sh 2014-05-18 16:56 GMT-07:00 Daniel Mahler : > > I am running in an a

Which component(s) of Spark do not support IPv6?

2014-05-19 Thread queniesun
Besides Hadoop, are there any other components of Spark that do not support IPv6? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Which-component-s-of-Spark-do-not-support-IPv6-tp6050.html Sent from the Apache Spark User List mailing list archive at Nabble.

Re: How to run the SVM and LogisticRegression

2014-05-19 Thread yxzhao
Thanks Xiangrui, But I did not find the directory: examples/src/main/scala/org/apache/spark/examples/mllib. Could you give me more detail or show me one example? Thanks a lot. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-run-the-SVM-and-LogisticRe

Re: sync master with slaves with bittorrent?

2014-05-19 Thread Mosharaf Chowdhury
Good catch. In that case, using BitTornado/murder would be better. -- Mosharaf Chowdhury http://www.mosharaf.com/ On Mon, May 19, 2014 at 11:17 AM, Aaron Davidson wrote: > On the ec2 machines, you can update the slaves from the master using > something like "~/spark-ec2/copy-dir ~/spark". > >

Re: sync master with slaves with bittorrent?

2014-05-19 Thread Aaron Davidson
On the ec2 machines, you can update the slaves from the master using something like "~/spark-ec2/copy-dir ~/spark". Spark's TorrentBroadcast relies on the Block Manager to distribute blocks, making it relatively hard to extract. On Mon, May 19, 2014 at 12:36 AM, Daniel Mahler wrote: > btw is t

Re: unsubscribe

2014-05-19 Thread Andrew Ash
Agree that the links to the archives should probably point to the Apache archives rather than Nabble's, so the unsubscribe documentation is clearer. Also, an (unsubscribe) link right next to subscribe with the email already generated could help too. I'd be one of those highly against a footer on

spark ec2 commandline tool error "VPC security groups may not be used for a non-VPC launch"

2014-05-19 Thread Matt Work Coarr
Hi, I'm attempting to run "spark-ec2 launch" on AWS. My AWS instances would be in our EC2 VPC (which seems to be causing a problem). The two security groups MyClusterName-master and MyClusterName-slaves have already been setup with the same ports open as the security group that spark-ec2 tries to

Re: unsubscribe

2014-05-19 Thread Marcelo Vanzin
Hey Andrew, Since we're seeing so many of these e-mails, I think it's worth pointing out that it's not really obvious to find unsubscription information for the lists. The community link on the Spark site (http://spark.apache.org/community.html) does not have instructions for unsubscribing; it li

Spark Streaming and Shark | Streaming Taking All CPUs

2014-05-19 Thread anishs...@yahoo.co.in
Hi All I am new to Spark, I was trying to use Spark Streaming and Shark at the same time. I was recieiving messages from Kafka and pushing them to HDFS after minor processing. It was workin fine, but it was taking all the CPUs and at the same time on other terminal i tried to access shark but

Re: specifying worker nodes when using the repl?

2014-05-19 Thread Sandy Ryza
Hi Eric, Have you tried setting the SPARK_WORKER_INSTANCES env variable before running spark-shell? http://spark.apache.org/docs/0.9.0/running-on-yarn.html -Sandy On Mon, May 19, 2014 at 8:08 AM, Eric Friedman wrote: > Hi > > I am working with a Cloudera 5 cluster with 192 nodes and can’t work

Re: persist @ disk-only failing

2014-05-19 Thread Sai Prasanna
Ok Thanks! On Mon, May 19, 2014 at 10:09 PM, Matei Zaharia wrote: > This is the patch for it: https://github.com/apache/spark/pull/50/. It > might be possible to backport it to 0.8. > > Matei > > On May 19, 2014, at 2:04 AM, Sai Prasanna wrote: > > Matei, I am using 0.8.1 !! > > But is there a

OutofMemory: Failed on spark/examples/bagel/WikipediaPageRank.scala

2014-05-19 Thread Hao Wang
Hi, all I am running a 30GB Wikipedia dataset on a 7-server cluster. Using WikipediaPageRank underexample/Bagel. My Spark version is bae07e3 [behind 1] fix different versions of commons-lang dependency and apache/spark#746 addendum The problem is that the job will fail after several stages becau

Re: persist @ disk-only failing

2014-05-19 Thread Matei Zaharia
This is the patch for it: https://github.com/apache/spark/pull/50/. It might be possible to backport it to 0.8. Matei On May 19, 2014, at 2:04 AM, Sai Prasanna wrote: > Matei, I am using 0.8.1 !! > > But is there a way without moving to 0.9.1 to bypass cache ? > > > On Mon, May 19, 2014 at

Re: Yarn configuration file doesn't work when run with yarn-client mode

2014-05-19 Thread Arun Ahuja
I am encountering the same thing. Basic yarn apps work as does the SparkPi example, but my custom application gives this result. I am using compute-classpath to create the proper classpath for my application, same with SparkPi - was there a resolution to this issue? Thanks, Arun On Wed, Feb 12

Re: java.lang.NoClassDefFoundError: org/apache/spark/deploy/worker/Worker

2014-05-19 Thread Hao Wang
I made a mistake, that machines in my cluster run different JDKs. After I unify all the JDKs, the problem is solved. Regards, Wang Hao(王灏) CloudTeam | School of Software Engineering Shanghai Jiao Tong University Address:800 Dongchuan Road, Minhang District, Shanghai, 200240 Email:wh.s...@gmail.co

specifying worker nodes when using the repl?

2014-05-19 Thread Eric Friedman
Hi I am working with a Cloudera 5 cluster with 192 nodes and can’t work out how to get the spark repo to use more than 2 nodes in an interactive session. So, this works, but is non-interactive (using yarn-client as MASTER) /opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/spark/bin/spark-cla

Re: File present but file not found exception

2014-05-19 Thread Koert Kuipers
why does it need to be local file? why not do some filter ops on hdfs file and save to hdfs, from where you can create rdd? you can read a small file in on driver program and use sc.parallelize to turn it into RDD On May 16, 2014 7:01 PM, "Sai Prasanna" wrote: > I found that if a file is present

How to compile the examples directory?

2014-05-19 Thread Hao Wang
Hi, I am running some examples of Spark on a cluster. Because I need to modify some source code, I have to re-compile the whole Spark using `sbt/sbt assembly`, which takes a long time. I have tried `mvn package` under the example directory, it failed because of some dependencies problem. Any way

Re: sync master with slaves with bittorrent?

2014-05-19 Thread Daniel Mahler
On Mon, May 19, 2014 at 2:04 AM, Daniel Mahler wrote: > I agree that for updating rsync is probably preferable, and it seems like > for that purpose it would also parallelize well since most of the time is > spent computing checksums so the process is not constrained by the total > i/o capacity o

Re: Packaging a spark job using maven

2014-05-19 Thread Eugen Cepoi
2014-05-19 10:35 GMT+02:00 Laurent T : > Hi Eugen, > > Thanks for your help. I'm not familiar with the shaded plugin and i was > wondering: does it replace the assembly plugin ? Nope it doesn't replace it. It allows you to make "fat jars" and other nice things such as relocating classes to some

Re: Using mongo with PySpark

2014-05-19 Thread Nick Pentreath
You need to use mapPartitions (or foreachPartition) to instantiate your client in each partition as it is not serializable by the pickle library. Something like def mapper(iter): db = MongoClient()['spark_test_db'] *collec = db['programs']* *for val in iter:* asc = val.encode('

Re: persist @ disk-only failing

2014-05-19 Thread Sai Prasanna
Matei, I am using 0.8.1 !! But is there a way without moving to 0.9.1 to bypass cache ? On Mon, May 19, 2014 at 1:31 PM, Matei Zaharia wrote: > What version is this with? We used to build each partition first before > writing it out, but this was fixed a while back (0.9.1, but it may also be >

Re: Advanced log processing

2014-05-19 Thread Laurent T
(resending this as alot of mails seems not to be delivered) Hi, I have some complex behavior i'd like to be advised on as i'm really new to Spark. I'm reading some log files that contains various events. There are two types of events: parents and children. A child event can only have one pare

Re: Packaging a spark job using maven

2014-05-19 Thread Laurent T
Hi Eugen, Thanks for your help. I'm not familiar with the shaded plugin and i was wondering: does it replace the assembly plugin ? Also, do i have to specify all the artifacts and sub artifacts in the artifactSet ? Or can i just use a *:* wildcard and let the maven scopes do their work ? I have a

Re: persist @ disk-only failing

2014-05-19 Thread Matei Zaharia
What version is this with? We used to build each partition first before writing it out, but this was fixed a while back (0.9.1, but it may also be in 0.9.0). Matei On May 19, 2014, at 12:41 AM, Sai Prasanna wrote: > Hi all, > > When i gave the persist level as DISK_ONLY, still Spark tries to

persist @ disk-only failing

2014-05-19 Thread Sai Prasanna
Hi all, When i gave the persist level as DISK_ONLY, still Spark tries to use memory and caches. Any reason ? Do i need to override some parameter elsewhere ? Thanks !

Re: sync master with slaves with bittorrent?

2014-05-19 Thread Daniel Mahler
btw is there a command or script to update the slaves from the master? thanks Daniel On Mon, May 19, 2014 at 1:48 AM, Andrew Ash wrote: > If the codebase for Spark's broadcast is pretty self-contained, you could > consider creating a small bootstrap sent out via the doubling rsync > strategy t

Re: sync master with slaves with bittorrent?

2014-05-19 Thread Daniel Mahler
I agree that for updating rsync is probably preferable, and it seems like for that purpose it would also parallelize well since most of the time is spent computing checksums so the process is not constrained by the total i/o capacity of the master. However it is a problem for the initial replicatio