spark streaming kafka output

2014-05-05 Thread Weide Zhang
Hi , Is there any code to implement a kafka output for spark streaming? My use case is all the output need to be dumped back to kafka cluster again after data is processed ? What will be guideline to implement such function ? I heard foreachRDD will create one instance of producer per batch ? If

Re: NoSuchMethodError: breeze.linalg.DenseMatrix

2014-05-05 Thread DB Tsai
Since the breeze jar is brought into spark by mllib package, you may want to add mllib as your dependency in spark 1.0. For bring it from your application yourself, you can either use sbt assembly in ur build project to generate a flat myApp-assembly.jar which contains breeze jar, or use spark add

Check your cluster UI to ensure that workers are registered and have sufficient memory

2014-05-05 Thread Sai Prasanna
I executed the following commands to launch spark app with yarn client mode. I have Hadoop 2.3.0, Spark 0.8.1 and Scala 2.9.3 SPARK_HADOOP_VERSION=2.3.0 SPARK_YARN=true sbt/sbt assembly SPARK_YARN_MODE=true \ SPARK_JAR=./assembly/target/scala-2.9.3/spark-assembly-0.8.1-incubating-hadoop2.3.0.jar

unsibscribe

2014-05-05 Thread Konstantin Kudryavtsev
unsibscribe Thank you, Konstantin Kudryavtsev

unsubscribe

2014-05-05 Thread Shubhabrata Roy
unsubscribe

Re: Cache issue for iteration with broadcast

2014-05-05 Thread Earthson
.set(spark.cleaner.ttl, 120) drops broadcast_0 which makes a Exception below. It is strange, because broadcast_0 is no need, and I have broadcast_3 instead, and recent RDD is persisted, there is no need for recomputing... what is the problem? need help. ~~~ 14/05/05 17:03:12 INFO

Re: Is any idea on architecture based on Spark + Spray + Akka

2014-05-05 Thread Quintus Zhou
Hi, Yi Your project sounds interesting to me, Im also working on 3g4g communication domain, besides Ive also done a tiny project based on hadoop, which analyzes execution logs. Recently, Im planed to pick it up again. So, if you don't mind, may i know the introduction of your log analyzing

unsibscribe

2014-05-05 Thread Chhaya Vishwakarma
unsibscribe Regards, Chhaya Vishwakarma The contents of this e-mail and any attachment(s) may contain confidential or privileged information for the intended recipient(s). Unintended recipients are prohibited from taking action on the basis of information in

java.io.FileNotFoundException: /test/spark-0.9.1/work/app-20140505053550-0000/2/stdout (No such file or directory)

2014-05-05 Thread Francis . Hu
Hi,All We run a spark cluster with three workers. created a spark streaming application, then run the spark project using below command: shell sbt run spark://192.168.219.129:7077 tcp://192.168.20.118:5556 foo we looked at the webui of workers, jobs failed without any error or

Re: Cache issue for iteration with broadcast

2014-05-05 Thread Earthson
Using checkpoint. It removes dependences:) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Cache-issue-for-iteration-with-broadcast-tp5350p5368.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Cache issue for iteration with broadcast

2014-05-05 Thread Earthson
RDD.checkpoint works fine. But spark.cleaner.ttl is really ugly for broadcast cleaning. May be it could be removed automatically when no dependences. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Cache-issue-for-iteration-with-broadcast-tp5350p5369.html

Re: Cache issue for iteration with broadcast

2014-05-05 Thread Cheng Lian
Have you tried Broadcast.unpersist()? On Mon, May 5, 2014 at 6:34 PM, Earthson earthson...@gmail.com wrote: RDD.checkpoint works fine. But spark.cleaner.ttl is really ugly for broadcast cleaning. May be it could be removed automatically when no dependences. -- View this message in

Spark Streaming and JMS

2014-05-05 Thread Patrick McGloin
Hi all, Is there a best practice for subscribing to JMS with Spark Streaming? I have searched but not found anything conclusive. In the absence of a standard practice the solution I was thinking of was to use Akka + Camel (akka.camel.Consumer) to create a subscription for a Spark Streaming

Re: master attempted to re-register the worker and then took all workers as unregistered

2014-05-05 Thread Nan Zhu
Ah, I think this should be fixed in 0.9.1? Did you see the exception is thrown in the worker side? Best, -- Nan Zhu On Sunday, May 4, 2014 at 10:15 PM, Cheney Sun wrote: Hi Nan, Have you found a way to fix the issue? Now I run into the same problem with version 0.9.1. Thanks,

Re: Shark on cloudera CDH5 error

2014-05-05 Thread manas Kar
No replies yet. Guess everyone who had this problem knew the obvious reason why the error occurred. It took me some time to figure out the work around though. It seems shark depends on /var/lib/spark/shark-0.9.1/lib_managed/jars/org.apache.hadoop/hadoop-core/hadoop-core.jar for client server

Re: configure spark history server for running on Yarn

2014-05-05 Thread Tom Graves
Since 1.0 is still in development you can pick up the latest docs in git:  https://github.com/apache/spark/tree/branch-1.0/docs I didn't see anywhere that you said you started the spark history server? there are multiple things that need to happen for the spark history server to work. 1)

Spark GCE Script

2014-05-05 Thread Akhil Das
Hi Sparkers, We have created a quick spark_gce script which can launch a spark cluster in the Google Cloud. I'm sharing it because it might be helpful for someone using the Google Cloud for deployment rather than AWS. Here's the link to the script https://github.com/sigmoidanalytics/spark_gce

Re: Using google cloud storage for spark big data

2014-05-05 Thread Akhil Das
Hi Aureliano, You might want to check this script out, https://github.com/sigmoidanalytics/spark_gce Let me know if you need any help around that. Thanks Best Regards On Tue, Apr 22, 2014 at 7:12 PM, Aureliano Buendia buendia...@gmail.comwrote: On Tue, Apr 22, 2014 at 10:50 AM, Andras

Caused by: java.lang.OutOfMemoryError: unable to create new native thread

2014-05-05 Thread Soumya Simanta
I just upgraded my Spark version to 1.0.0_SNAPSHOT. commit f25ebed9f4552bc2c88a96aef06729d9fc2ee5b3 Author: witgo wi...@qq.com Date: Fri May 2 12:40:27 2014 -0700 I'm running a standalone cluster with 3 workers. - *Workers:* 3 - *Cores:* 48 Total, 0 Used - *Memory:* 469.8 GB

RE: another updateStateByKey question - updated w possible Spark bug

2014-05-05 Thread Adrian Mocanu
I’ve encountered this issue again and am able to reproduce it about 10% of the time. 1. Here is the input: RDD[ (a, 126232566, 1), (a, 126232566, 2) ] RDD[ (a, 126232566, 1), (a, 126232566, 3) ] RDD[ (a, 126232566, 3) ] RDD[ (a, 126232566, 4) ] RDD[ (a, 126232566, 2)

RE: another updateStateByKey question - updated w possible Spark bug

2014-05-05 Thread Adrian Mocanu
Forgot to mention my batch interval is 1 second: val ssc = new StreamingContext(conf, Seconds(1)) hence the Thread.sleep(1100) From: Adrian Mocanu [mailto:amoc...@verticalscope.com] Sent: May-05-14 12:06 PM To: user@spark.apache.org Cc: u...@spark.incubator.apache.org Subject: RE: another

Comprehensive Port Configuration reference?

2014-05-05 Thread Scott Clasen
Is there somewhere documented how one would go about configuring every open port a spark application needs? This seems like one of the main things that make running spark hard in places like EC2 where you arent using the canned spark scripts. Starting an app looks like you'll see ports open for

Re: Cache issue for iteration with broadcast

2014-05-05 Thread Earthson
Yes, I've tried. The problem is new broadcast object generated by every step until eat up all of the memory. I solved it by using RDD.checkpoint to remove dependences to old broadcast object, and use cleanner.ttl to clean up these broadcast object automatically. If there's more elegant way to

Re: Spark GCE Script

2014-05-05 Thread Matei Zaharia
Very cool! Have you thought about sending this as a pull request? We’d be happy to maintain it inside Spark, though it might be interesting to find a single Python package that can manage clusters across both EC2 and GCE. Matei On May 5, 2014, at 7:18 AM, Akhil Das ak...@sigmoidanalytics.com

Problem with sharing class across worker nodes using spark-shell on Spark 1.0.0

2014-05-05 Thread Soumya Simanta
Hi, I'm trying to run a simple Spark job that uses a 3rd party class (in this case twitter4j.Status) in the spark-shell using spark-1.0.0_SNAPSHOT I'm starting my bin/spark-shell with the following command. ./spark-shell

Re: Spark GCE Script

2014-05-05 Thread François Le lay
Has anyone considered using jclouds tooling to support multiple cloud providers? Maybe using Pallet? François On May 5, 2014, at 3:22 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: I second this motion. :) A unified cloud deployment tool would be absolutely great. On Mon,

Re: performance improvement on second operation...without caching?

2014-05-05 Thread Diana Carroll
Ethan, you're not the only one, which is why I was asking about this! :-) Matei, thanks for your response. your answer explains the performance jump in my code, but shows I've missed something key in my understanding of Spark! I was not aware until just now that map output was saved to disk

Re: spark streaming question

2014-05-05 Thread Tathagata Das
One main reason why Spark Streaming can achieve higher throughput than Storm is because Spark Streaming operates in coarser-grained batches - second-scale massive batches - which reduce per-tuple of overheads in shuffles, and other kinds of data movements, etc. Note that, this is also true that

Local Dev Env with Mesos + Spark Streaming on Docker: Can't submit jobs.

2014-05-05 Thread Gerard Maas
Hi all, I'm currently working on creating a set of docker images to facilitate local development with Spark/streaming on Mesos (+zk, hdfs, kafka) After solving the initial hurdles to get things working together in docker containers, now everything seems to start-up correctly and the mesos UI

Re: Spark Streaming and JMS

2014-05-05 Thread Tathagata Das
A few high-level suggestions. 1. I recommend using the new Receiver API in almost-released Spark 1.0 (see branch-1.0 / master branch on github). Its a slightly better version of the earlier NetworkReceiver, as it hides away blockgenerator (which needed to be unnecessarily manually started and

Re: Local Dev Env with Mesos + Spark Streaming on Docker: Can't submit jobs.

2014-05-05 Thread Gerard Maas
Hi Benjamin, Yes, we initially used a modified version of the AmpLabs docker scripts [1]. The amplab docker images are a good starting point. One of the biggest hurdles has been HDFS, which requires reverse-DNS and I didn't want to go the dnsmasq route to keep the containers relatively simple to

Spark 0.9.1 - saveAsSequenceFile and large RDD

2014-05-05 Thread Allen Lee
Hi, Fairly new to Spark. I'm using Spark's saveAsSequenceFile() to write large Sequence Files to HDFS. The Sequence Files need to be large to be efficiently accessed in HDFS, preferably larger than Hadoop's block size, 64MB. The task works for files smaller than 64 MiB (with a warning for

Re: Incredible slow iterative computation

2014-05-05 Thread Andrea Esposito
Update: Checkpointing it doesn't perform. I checked by the isCheckpointed method but it returns always false. ??? 2014-05-05 23:14 GMT+02:00 Andrea Esposito and1...@gmail.com: Checkpoint doesn't help it seems. I do it at each iteration/superstep. Looking deeply, the RDDs are recomputed just

Re: Incredible slow iterative computation

2014-05-05 Thread Earthson
checkpoint seems to be just add a CheckPoint mark? You need an action after marked it. I have tried it with success:) newRdd = oldRdd.map(myFun).persist(myStorageLevel) newRdd.checkpoint // checkpoint here newRdd.isCheckpointed // false here newRdd.foreach(x = {}) // Force evaluation

答复: java.io.FileNotFoundException: /test/spark-0.9.1/work/app-20140505053550-0000/2/stdout (No such file or directory)

2014-05-05 Thread Francis . Hu
The file does not exist in fact and no permission issue. francis@ubuntu-4:/test/spark-0.9.1$ ll work/app-20140505053550-/ total 24 drwxrwxr-x 6 francis francis 4096 May 5 05:35 ./ drwxrwxr-x 11 francis francis 4096 May 5 06:18 ../ drwxrwxr-x 2 francis francis 4096 May 5 05:35 2/

details about event log

2014-05-05 Thread wxhsdp
Hi, i'am looking at the event log, i'am a little confuse about some metrics here's the info of one task: Launch Time:1399336904603 Finish Time:1399336906465 Executor Run Time:1781 Shuffle Read Metrics:Shuffle Finish Time:1399336906027, Fetch Wait Time:0 Shuffle Write Metrics:{Shuffle Bytes

RE: sbt/sbt run command returns a JVM problem

2014-05-05 Thread Carter
hi I still have over 1g left for my program. Date: Sun, 4 May 2014 19:14:30 -0700 From: ml-node+s1001560n5340...@n3.nabble.com To: gyz...@hotmail.com Subject: Re: sbt/sbt run command returns a JVM problem the total memory of your machine is 2G right?then how much memory is left free?

Re: How to use spark-submit

2014-05-05 Thread Soumya Simanta
Yes, I'm struggling with a similar problem where my class are not found on the worker nodes. I'm using 1.0.0_SNAPSHOT. I would really appreciate if someone can provide some documentation on the usage of spark-submit. Thanks On May 5, 2014, at 10:24 PM, Stephen Boesch java...@gmail.com

Can I share RDD between a pyspark and spark API

2014-05-05 Thread manas Kar
Hi experts. I have some pre-built python parsers that I am planning to use, just because I don't want to write them again in scala. However after the data is parsed I would like to take the RDD and use it in a scala program.(Yes, I like scala more than python and more comfortable in scala :) In

about broadcast

2014-05-05 Thread randylu
In my code, there are two broadcast variables. Sometimes reading the small one took more time than the big one, so strange! Log on slave node is as follows: Block broadcast_2 stored as values to memory (estimated size *4.0 KB*, free 17.2 GB) Reading broadcast variable 2 took *9.998537123* s

Re: about broadcast

2014-05-05 Thread randylu
additional, Reading the big broadcast variable always took about 2s. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/about-broadcast-tp5416p5417.html Sent from the Apache Spark User List mailing list archive at Nabble.com.