Re: Akka Connection refused - standalone cluster using spark-0.9.0

2014-05-28 Thread Gino Bustelo
I've been playing with the amplab docker scripts and I needed to set 
spark.driver.host to the driver host ip. One that all spark processes can get 
to. 

> On May 28, 2014, at 4:35 AM, jaranda  wrote:
> 
> Same here, got stuck at this point. Any hints on what might be going on?
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Akka-Connection-refused-standalone-cluster-using-spark-0-9-0-tp1297p6463.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: how to construct a ClassTag object as a method parameter in Java

2014-06-03 Thread Gino Bustelo
A better way seems to be to use ClassTag$.apply(Class). 

I'm going by memory since I'm on my phone, but I just did that today. 

Gino B.

> On Jun 3, 2014, at 11:04 AM, Michael Armbrust  wrote:
> 
> Ah, this is a bug that was fixed in 1.0.
> 
> I think you should be able to workaround it by using a fake class tag: 
> scala.reflect.ClassTag$.MODULE$.AnyRef()
> 
> 
>> On Mon, Jun 2, 2014 at 8:22 PM, bluejoe2008  wrote:
>> spark 0.9.1
>> textInput is a JavaRDD object
>> i am programming in Java
>>  
>> 2014-06-03
>> bluejoe2008
>>  
>> From: Michael Armbrust
>> Date: 2014-06-03 10:09
>> To: user
>> Subject: Re: how to construct a ClassTag object as a method parameter in Java
>> What version of Spark are you using?  Also are you sure the type of 
>> textInput is a JavaRDD and not an RDD?
>> 
>> It looks like the 1.0 Java API does not require a class tag.
>> 
>> 
>>> On Mon, Jun 2, 2014 at 5:59 PM, bluejoe2008  wrote:
>>> hi,all
>>> i am programming with Spark in Java, and now i have a question:
>>> when i made a method call on a JavaRDD such as:
>>>  
>>> textInput.mapPartitionsWithIndex(
>>> new Function2, Iterator>()
>>> {...},
>>> false,
>>> PARAM3
>>> );
>>>  
>>> what value should i pass as the PARAM3 parameter?
>>> it is required as a ClassTag value, then how can i define such a value in 
>>> Java? i really have no idea...
>>>  
>>> best regards,
>>> bluejoe2008
> 


Re: Interactive modification of DStreams

2014-06-03 Thread Gino Bustelo
Thanks for the reply. Are there plans to allow this runtime interactions with a 
dstream context? From the surface they seem doable. What is preventing this to 
work?

Also... I implemented the modifiable windowdstream and it seemed to work good. 
Thanks for the pointer. 

Gino B.

> On Jun 2, 2014, at 7:14 PM, Tathagata Das  wrote:
> 
> Currently Spark Streaming does not support addition/deletion/modification of 
> DStream after the streaming context has been started. 
> Nor can you restart a stopped streaming context. 
> Also, multiple spark contexts (and therefore multiple streaming contexts) 
> cannot be run concurrently in the same JVM. 
> 
> To change the window duration, I would one of the following.
> 
> 1. Stop the previous streaming context, create a new streaming context, and 
> setup the dstreams once again with the new window duration
> 2. Create a custom DStream, say DynamicWindowDStream. Take a look at how 
> WindowedDStream is implemented (pretty simple, just a union over RDDs across 
> time). That should allow you to modify the window duration. However, do make 
> sure you have a maximum window duration that you will ever reach, and make 
> sure you define parentRememberDuration as a "rememberDuration + 
> maxWindowDuration". That fields defines which RDDs can be forgotten, so is 
> sensitive to the window duration. Then you have to take care of correctly 
> (atomically, etc.) modifying the window duration as per your requirements.
> 
> Happy streaming!
> 
> TD
> 
> 
> 
> 
>> On Mon, Jun 2, 2014 at 2:46 PM, lbustelo  wrote:
>> This is a general question about whether Spark Streaming can be interactive
>> like batch Spark jobs. I've read plenty of threads and done my fair bit of
>> experimentation and I'm thinking the answer is NO, but it does not hurt to
>> ask.
>> 
>> More specifically, I would like to be able to do:
>> 1. Add/Remove steps to the Streaming Job
>> 2. Modify Window durations
>> 3. Stop and Restart context.
>> 
>> I've tried the following:
>> 
>> 1. Modify the DStream after it has been started… BOOM! Exceptions
>> everywhere.
>> 
>> 2. Stop the DStream, Make modification, Start… NOT GOOD :( In 0.9.0 I was
>> getting deadlocks. I also tried 1.0.0 and it did not work.
>> 
>> 3. Based on information provided  here
>> 
>> , I was been able to prototype modifying the RDD computation within a
>> forEachRDD. That is nice, but you are then bounded to the specified batch
>> size. That got me to wanting to modify Window durations. Is changing the
>> Window duration possible?
>> 
>> 4. Tried running multiple streaming context from within a single Driver
>> application and got several exceptions. The first one was bind exception on
>> the web port. Then once the app started getting run (cores were taken but
>> 1st job) it did not run correctly. A lot of
>> "akka.pattern.AskTimeoutException: Timed out"
>> .
>> 
>> I've tried my experiments in 0.9.0, 0.9.1 and 1.0.0 running on Standalone
>> Cluster setup.
>> Thanks in advanced
>> 
>> 
>> 
>> --
>> View this message in context: 
>> http://apache-spark-user-list.1001560.n3.nabble.com/Interactive-modification-of-DStreams-tp6740.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 


Re: New user streaming question

2014-06-07 Thread Gino Bustelo
I would make sure that your workers are running. It is very difficult to tell 
from the console dribble if you just have no data or the workers just 
disassociated from masters. 

Gino B.

> On Jun 6, 2014, at 11:32 PM, Jeremy Lee  
> wrote:
> 
> Yup, when it's running, DStream.print() will print out a timestamped block 
> for every time step, even if the block is empty. (for v1.0.0, which I have 
> running in the other window)
> 
> If you're not getting that, I'd guess the stream hasn't started up properly. 
> 
> 
>> On Sat, Jun 7, 2014 at 11:50 AM, Michael Campbell 
>>  wrote:
>> I've been playing with spark and streaming and have a question on stream 
>> outputs.  The symptom is I don't get any.
>> 
>> I have run spark-shell and all does as I expect, but when I run the 
>> word-count example with streaming, it *works* in that things happen and 
>> there are no errors, but I never get any output.  
>> 
>> Am I understanding how it it is supposed to work correctly?  Is the 
>> Dstream.print() method supposed to print the output for every (micro)batch 
>> of the streamed data?  If that's the case, I'm not seeing it.
>> 
>> I'm using the "netcat" example and the StreamingContext uses the network to 
>> read words, but as I said, nothing comes out. 
>> 
>> I tried changing the .print() to .saveAsTextFiles(), and I AM getting a 
>> file, but nothing is in it other than a "_temporary" subdir.
>> 
>> I'm sure I'm confused here, but not sure where.  Help?
> 
> 
> 
> -- 
> Jeremy Lee  BCompSci(Hons)
>   The Unorthodox Engineers


Re: Best practise for 'Streaming' dumps?

2014-06-07 Thread Gino Bustelo
Have you thought of using window?

Gino B.

> On Jun 6, 2014, at 11:49 PM, Jeremy Lee  
> wrote:
> 
> 
> It's going well enough that this is a "how should I in 1.0.0" rather than 
> "how do i" question.
> 
> So I've got data coming in via Streaming (twitters) and I want to archive/log 
> it all. It seems a bit wasteful to generate a new HDFS file for each DStream, 
> but also I want to guard against data loss from crashes,
> 
> I suppose what I want is to let things build up into "superbatches" over a 
> few minutes, and then serialize those to parquet files, or similar? Or do i?
> 
> Do I count-down the number of DStreams, or does Spark have a preferred way of 
> scheduling cron events?
> 
> What's the best practise for keeping persistent data for a streaming app? 
> (Across restarts) And to clean up on termination?
> 
> 
> -- 
> Jeremy Lee  BCompSci(Hons)
>   The Unorthodox Engineers


Re: Best practise for 'Streaming' dumps?

2014-06-08 Thread Gino Bustelo
Yeah... Have not tried it, but if you set the slidingDuration == windowDuration 
that should prevent overlaps. 

Gino B.

> On Jun 8, 2014, at 8:25 AM, Jeremy Lee  wrote:
> 
> I read it more carefully, and window() might actually work for some other 
> stuff like logs. (assuming I can have multiple windows with entirely 
> different attributes on a single stream..) 
> 
> Thanks for that!
> 
> 
>> On Sun, Jun 8, 2014 at 11:11 PM, Jeremy Lee  
>> wrote:
>> Yes.. but from what I understand that's a "sliding window" so for a window 
>> of (60) over (1) second DStreams, that would save the entire last minute of 
>> data once per second. That's more than I need.
>> 
>> I think what I'm after is probably updateStateByKey... I want to mutate data 
>> structures (probably even graphs) as the stream comes in, but I also want 
>> that state to be persistent across restarts of the application, (Or parallel 
>> version of the app, if possible) So I'd have to save that structure 
>> occasionally and reload it as the "primer" on the next run.
>> 
>> I was almost going to use HBase or Hive, but they seem to have been 
>> deprecated in 1.0.0? Or just late to the party?
>> 
>> Also, I've been having trouble deleting hadoop directories.. the old "two 
>> line" examples don't seem to work anymore. I actually managed to fill up the 
>> worker instances (I gave them tiny EBS) and I think I crashed them.
>> 
>> 
>> 
>>> On Sat, Jun 7, 2014 at 10:23 PM, Gino Bustelo  wrote:
>>> Have you thought of using window?
>>> 
>>> Gino B.
>>> 
>>> > On Jun 6, 2014, at 11:49 PM, Jeremy Lee  
>>> > wrote:
>>> >
>>> >
>>> > It's going well enough that this is a "how should I in 1.0.0" rather than 
>>> > "how do i" question.
>>> >
>>> > So I've got data coming in via Streaming (twitters) and I want to 
>>> > archive/log it all. It seems a bit wasteful to generate a new HDFS file 
>>> > for each DStream, but also I want to guard against data loss from crashes,
>>> >
>>> > I suppose what I want is to let things build up into "superbatches" over 
>>> > a few minutes, and then serialize those to parquet files, or similar? Or 
>>> > do i?
>>> >
>>> > Do I count-down the number of DStreams, or does Spark have a preferred 
>>> > way of scheduling cron events?
>>> >
>>> > What's the best practise for keeping persistent data for a streaming app? 
>>> > (Across restarts) And to clean up on termination?
>>> >
>>> >
>>> > --
>>> > Jeremy Lee  BCompSci(Hons)
>>> >   The Unorthodox Engineers
>> 
>> 
>> 
>> -- 
>> Jeremy Lee  BCompSci(Hons)
>>   The Unorthodox Engineers
> 
> 
> 
> -- 
> Jeremy Lee  BCompSci(Hons)
>   The Unorthodox Engineers


Re: Master not seeing recovered nodes("Got heartbeat from unregistered worker ....")

2014-06-13 Thread Gino Bustelo
I get the same problem, but I'm running in a dev environment based on
docker scripts. The additional issue is that the worker processes do not
die and so the docker container does not exit. So I end up with worker
containers that are not participating in the cluster.


On Fri, Jun 13, 2014 at 9:44 AM, Mayur Rustagi 
wrote:

> I have also had trouble in worker joining the working set. I have
> typically moved to Mesos based setup. Frankly for high availability you are
> better off using a cluster manager.
>
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoidanalytics.com
> @mayur_rustagi 
>
>
>
> On Fri, Jun 13, 2014 at 8:57 AM, Yana Kadiyska 
> wrote:
>
>> Hi, I see this has been asked before but has not gotten any satisfactory
>> answer so I'll try again:
>>
>> (here is the original thread I found:
>> http://mail-archives.apache.org/mod_mbox/spark-user/201403.mbox/%3c1394044078706-2312.p...@n3.nabble.com%3E
>> )
>>
>> I have a set of workers dying and coming back again. The master prints
>> the following warning:
>>
>> "Got heartbeat from unregistered worker "
>>
>> What is the solution to this -- rolling the master is very undesirable to
>> me as I have a Shark context sitting on top of it (it's meant to be highly
>> available).
>>
>> Insights appreciated -- I don't think an executor going down is very
>> unexpected but it does seem odd that it won't be able to rejoin the working
>> set.
>>
>> I'm running Spark 0.9.1 on CDH
>>
>>
>>
>


Re: spark streaming, kafka, SPARK_CLASSPATH

2014-06-16 Thread Gino Bustelo
+1 for this issue. Documentation for spark-submit are misleading. Among many 
issues, the jar support is bad. HTTP urls do not work. This is because spark is 
using hadoop's FileSystem class. You have to specify the jars twice to get 
things to work. Once for the DriverWrapper to laid your classes and a 2nd time 
in the Context to distribute to workers. 

I would like to see some contrib response to this issue. 

Gino B.

> On Jun 16, 2014, at 1:49 PM, Luis Ángel Vicente Sánchez 
>  wrote:
> 
> Did you manage to make it work? I'm facing similar problems and this a 
> serious blocker issue. spark-submit seems kind of broken to me if you can use 
> it for spark-streaming.
> 
> Regards,
> 
> Luis
> 
> 
> 2014-06-11 1:48 GMT+01:00 lannyripple :
>> I am using Spark 1.0.0 compiled with Hadoop 1.2.1.
>> 
>> I have a toy spark-streaming-kafka program.  It reads from a kafka queue and
>> does
>> 
>> stream
>>   .map {case (k, v) => (v, 1)}
>>   .reduceByKey(_ + _)
>>   .print()
>> 
>> using a 1 second interval on the stream.
>> 
>> The docs say to make Spark and Hadoop jars 'provided' but this breaks for
>> spark-streaming.  Including spark-streaming (and spark-streaming-kafka) as
>> 'compile' to sweep them into our assembly gives collisions on javax.*
>> classes.  To work around this I modified
>> $SPARK_HOME/bin/compute-classpath.sh to include spark-streaming,
>> spark-streaming-kafka, and zkclient.  (Note that kafka is included as
>> 'compile' in my project and picked up in the assembly.)
>> 
>> I have set up conf/spark-env.sh as needed.  I have copied my assembly to
>> /tmp/myjar.jar on all spark hosts and to my hdfs /tmp/jars directory.  I am
>> running spark-submit from my spark master.  I am guided by the information
>> here https://spark.apache.org/docs/latest/submitting-applications.html
>> 
>> Well at this point I was going to detail all the ways spark-submit fails to
>> follow it's own documentation.  If I do not invoke sparkContext.setJars()
>> then it just fails to find the driver class.  This is using various
>> combinations of absolute path, file:, hdfs: (Warning: Skip remote jar)??,
>> and local: prefixes on the application-jar and --jars arguments.
>> 
>> If I invoke sparkContext.setJars() and include my assembly jar I get
>> further.  At this point I get a failure from
>> kafka.consumer.ConsumerConnector not being found.  I suspect this is because
>> spark-streaming-kafka needs the Kafka dependency it but my assembly jar is
>> too late in the classpath.
>> 
>> At this point I try setting spark.files.userClassPathfirst to 'true' but
>> this causes more things to blow up.
>> 
>> I finally found something that works.  Namely setting environment variable
>> SPARK_CLASSPATH=/tmp/myjar.jar  But silly me, this is deprecated and I'm
>> helpfully informed to
>> 
>>   Please instead use:
>>- ./spark-submit with --driver-class-path to augment the driver classpath
>>- spark.executor.extraClassPath to augment the executor classpath
>> 
>> which when put into a file and introduced with --properties-file does not
>> work.  (Also tried spark.files.userClassPathFirst here.)  These fail with
>> the kafka.consumer.ConsumerConnector error.
>> 
>> At a guess what's going on is that using SPARK_CLASSPATH I have my assembly
>> jar in the classpath at SparkSubmit invocation
>> 
>>   Spark Command: java -cp
>> /tmp/myjar.jar::/opt/spark/conf:/opt/spark/lib/spark-assembly-1.0.0-hadoop1.2.1.jar:/opt/spark/lib/spark-streaming_2.10-1.0.0.jar:/opt/spark/lib/spark-streaming-kafka_2.10-1.0.0.jar:/opt/spark/lib/zkclient-0.4.jar
>> -XX:MaxPermSize=128m -Djava.library.path= -Xms512m -Xmx512m
>> org.apache.spark.deploy.SparkSubmit --class me.KafkaStreamingWC
>> /tmp/myjar.jar
>> 
>> but using --properties-file then the assembly is not available for
>> SparkSubmit.
>> 
>> I think the root cause is either spark-submit not handling the
>> spark-streaming libraries so they can be 'provided' or the inclusion of
>> org.elicpse.jetty.orbit in the streaming libraries which cause
>> 
>>   [error] (*:assembly) deduplicate: different file contents found in the
>> following:
>>   [error]
>> /Users/lanny/.ivy2/cache/org.eclipse.jetty.orbit/javax.transaction/orbits/javax.transaction-1.1.1.v201105210645.jar:META-INF/ECLIPSEF.RSA
>>   [error]
>> /Users/lanny/.ivy2/cache/org.eclipse.jetty.orbit/javax.servlet/orbits/javax.servlet-3.0.0.v201112011016.jar:META-INF/ECLIPSEF.RSA
>>   [error]
>> /Users/lanny/.ivy2/cache/org.eclipse.jetty.orbit/javax.mail.glassfish/orbits/javax.mail.glassfish-1.4.1.v201005082020.jar:META-INF/ECLIPSEF.RSA
>>   [error]
>> /Users/lanny/.ivy2/cache/org.eclipse.jetty.orbit/javax.activation/orbits/javax.activation-1.1.0.v201105071233.jar:META-INF/ECLIPSEF.RSA
>> 
>> I've tried applying mergeStategy in assembly for my assembly.sbt but then I
>> get
>> 
>>   Invalid signature file digest for Manifest main attributes
>> 
>> If anyone knows the magic to get this working a reply would be greatly
>> appreciated

Re: spark streaming, kafka, SPARK_CLASSPATH

2014-06-17 Thread Gino Bustelo
com.esotericsoftware.minlog", "minlog")
>>> 
>>> 
>>> 
>>> )
>>> 
>>> 
>>> mergeStrategy in assembly <<= (mergeStrategy in assembly) { (old) =>
>>> 
>>> 
>>> 
>>>   {
>>> case x if x.startsWith("META-INF/ECLIPSEF.RSA") => MergeStrategy.last
>>> 
>>> 
>>> 
>>> case x if x.startsWith("META-INF/mailcap") => MergeStrategy.last
>>> 
>>> 
>>> 
>>> case x if x.startsWith("plugin.properties") => MergeStrategy.last
>>> 
>>> 
>>> 
>>> case x => old(x)
>>> 
>>> 
>>> 
>>>   }
>>> }
>>> 
>>> 
>>> You can see the "exclude()" has to go around the spark-streaming-kafka 
>>> dependency, and I've used a MergeStrategy to solve the "deduplicate: 
>>> different file contents found in the following" errors.
>>> 
>>> Build the JAR with sbt assembly and use the scripts in bin/ to run the 
>>> examples.
>>> 
>>> I'm using this same approach to run my Spark Streaming jobs with 
>>> spark-submit and have them managed using Mesos/Marathon to handle failures 
>>> and restarts with long running processes.
>>> 
>>> Good luck!
>>> 
>>> MC
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Michael Cutler
>>> Founder, CTO
>>> 
>>> 
>>> Mobile: +44 789 990 7847
>>> Email:   mich...@tumra.com
>>> Web: tumra.com
>>> Visit us at our offices in Chiswick Park
>>> Registered in England & Wales, 07916412. VAT No. 130595328
>>> 
>>> 
>>> This email and any files transmitted with it are confidential and may also 
>>> be privileged. It is intended only for the person to whom it is addressed. 
>>> If you have received this email in error, please inform the sender 
>>> immediately. If you are not the intended recipient you must not use, 
>>> disclose, copy, print, distribute or rely on this email.
>>> 
>>> 
>>>> On 17 June 2014 02:51, Gino Bustelo  wrote:
>>>> +1 for this issue. Documentation for spark-submit are misleading. Among 
>>>> many issues, the jar support is bad. HTTP urls do not work. This is 
>>>> because spark is using hadoop's FileSystem class. You have to specify the 
>>>> jars twice to get things to work. Once for the DriverWrapper to laid your 
>>>> classes and a 2nd time in the Context to distribute to workers. 
>>>> 
>>>> I would like to see some contrib response to this issue. 
>>>> 
>>>> Gino B.
>>>> 
>>>>> On Jun 16, 2014, at 1:49 PM, Luis Ángel Vicente Sánchez 
>>>>>  wrote:
>>>>> 
>>>>> Did you manage to make it work? I'm facing similar problems and this a 
>>>>> serious blocker issue. spark-submit seems kind of broken to me if you can 
>>>>> use it for spark-streaming.
>>>>> 
>>>>> Regards,
>>>>> 
>>>>> Luis
>>>>> 
>>>>> 
>>>>> 2014-06-11 1:48 GMT+01:00 lannyripple :
>>>>>> I am using Spark 1.0.0 compiled with Hadoop 1.2.1.
>>>>>> 
>>>>>> I have a toy spark-streaming-kafka program.  It reads from a kafka queue 
>>>>>> and
>>>>>> does
>>>>>> 
>>>>>> stream
>>>>>>   .map {case (k, v) => (v, 1)}
>>>>>>   .reduceByKey(_ + _)
>>>>>>   .print()
>>>>>> 
>>>>>> using a 1 second interval on the stream.
>>>>>> 
>>>>>> The docs say to make Spark and Hadoop jars 'provided' but this breaks for
>>>>>> spark-streaming.  Including spark-streaming (and spark-streaming-kafka) 
>>>>>> as
>>>>>> 'compile' to sweep them into our assembly gives collisions on javax.*
>>>>>> classes.  To work around this I modified
>>>>>> $SPARK_HOME/bin/compute-classpath.sh to include spark-streaming,
>>>>>> spark-streaming-kafka, and zkclient.  (Note that kafka is included as
>>>>>> 'compile' in my project and picked up in the assembly.)
>>>>>> 
>>>>>> I have set up conf/spark-

Re: problem about cluster mode of spark 1.0.0

2014-06-20 Thread Gino Bustelo
I've found that the jar will be copied to the worker from hdfs fine, but it is 
not added to the spark context for you. You have to know that the jar will end 
up in the driver's working dir, and so you just add a the file name if the jar 
to the context in your program. 

In your example below, just add to the context "test.jar". 

Btw, the context will not have the master URL either, so add that while you are 
at it. 

This is a big issue. I've posted about it a week ago and no replies. Hopefully 
it gets more attention as more people start hitting this. Basically, 
spark-submit on standalone cluster with cluster deploy is broken. 

Gino B.

> On Jun 20, 2014, at 2:46 AM, randylu  wrote:
> 
> in addition, jar file can be copied to driver node automatically.
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/problem-about-cluster-mode-of-spark-1-0-0-tp7982p7984.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: problem about cluster mode of spark 1.0.0

2014-06-24 Thread Gino Bustelo
Andrew,

Thanks for your answer. It validates our finding. Unfortunately, client mode 
assumes that I'm running in a "privilege node". What I mean by privilege is a 
node that has net access to all the workers and vice versa. This is a big 
assumption to make and unreasonable in certain circumstances 

I much rather have a single point of contact like a job server (like ooyala's) 
that handles jar uploading and lifecycles drivers. I think these are basic 
requirement for standalone clusters. 

Gino B.

> On Jun 24, 2014, at 1:32 PM, Andrew Or  wrote:
> 
> Hi Randy and Gino,
> 
> The issue is that standalone-cluster mode is not officially supported. Please 
> use standalone-client mode instead, i.e. specify --deploy-mode client in 
> spark-submit, or simply leave out this config because it defaults to client 
> mode.
> 
> Unfortunately, this is not currently documented anywhere, and the existing 
> explanation for the distinction between cluster and client modes is highly 
> misleading. In general, cluster mode means the driver runs on one of the 
> worker nodes, just like the executors. The corollary is that the output of 
> the application is not forwarded to command that launched the application 
> (spark-submit in this case), but is accessible instead through the worker 
> logs. In contrast, client mode means the command that launches the 
> application also launches the driver, while the executors still run on the 
> worker nodes. This means the spark-submit command also returns the output of 
> the application. For instance, it doesn't make sense to run the Spark shell 
> in cluster mode, because the stdin / stdout / stderr will not be redirected 
> to the spark-submit command.
> 
> If you are hosting your own cluster and can launch applications from within 
> the cluster, then there is little benefit for launching your application in 
> cluster mode, which is primarily intended to cut down the latency between the 
> driver and the executors in the first place. However, if you are still intent 
> on using standalone-cluster mode after all, you can use the deprecated way of 
> launching org.apache.spark.deploy.Client directly through bin/spark-class. 
> Note that this is not recommended and only serves as a temporary workaround 
> until we fix standalone-cluster mode through spark-submit.
> 
> I have filed the relevant issues: 
> https://issues.apache.org/jira/browse/SPARK-2259 and 
> https://issues.apache.org/jira/browse/SPARK-2260. Thanks for pointing this 
> out, and we will get to fixing these shortly.
> 
> Best,
> Andrew
> 
> 
> 2014-06-20 6:06 GMT-07:00 Gino Bustelo :
>> I've found that the jar will be copied to the worker from hdfs fine, but it 
>> is not added to the spark context for you. You have to know that the jar 
>> will end up in the driver's working dir, and so you just add a the file name 
>> if the jar to the context in your program.
>> 
>> In your example below, just add to the context "test.jar".
>> 
>> Btw, the context will not have the master URL either, so add that while you 
>> are at it.
>> 
>> This is a big issue. I've posted about it a week ago and no replies. 
>> Hopefully it gets more attention as more people start hitting this. 
>> Basically, spark-submit on standalone cluster with cluster deploy is broken.
>> 
>> Gino B.
>> 
>> > On Jun 20, 2014, at 2:46 AM, randylu  wrote:
>> >
>> > in addition, jar file can be copied to driver node automatically.
>> >
>> >
>> >
>> > --
>> > View this message in context: 
>> > http://apache-spark-user-list.1001560.n3.nabble.com/problem-about-cluster-mode-of-spark-1-0-0-tp7982p7984.html
>> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 


Re: jackson-core-asl jar (1.8.8 vs 1.9.x) conflict with the spark-sql (version 1.x)

2014-06-27 Thread Gino Bustelo
Hit this same problem yesterday. My fix might not be ideal for you, but we were 
able to get rid of the error by turning off annotation deser in ObjectMapper. 

Gino B.

> On Jun 27, 2014, at 2:58 PM, M Singh  wrote:
> 
> Hi:
> 
> I am using spark to stream data to cassandra and it works fine in local mode. 
> But when I execute the application in a standalone clustered env I got 
> exception included below (java.lang.NoClassDefFoundError: 
> org/codehaus/jackson/annotate/JsonClass).
> 
> I think this is due to the jackson-core-asl dependency conflict 
> (jackson-core-asl 1.8.8 has the JsonClass but 1.9.x does not).  The 1.9.x 
> version is being pulled in by spark-sql project.  I tried adding 
> jackson-core-asl 1.8.8 with --jars argument while submitting the application 
> for execution but it did not work.  So I created a custom spark build 
> excluding sql project.  With this custom spark install I was able to resolve 
> the issue at least on a single node cluster (separate master and worker).  
> 
> If there is an alternate way to resolve this conflicting jar issue without a 
> custom build (eg: configuration to use the user defined jars in the executor 
> class path first), please let me know.
> 
> Also, is there a comprehensive list of configuration properties available for 
> spark ?
> 
> Thanks
> 
> Mans
> 
> Exception trace
> 
>  TaskSetManager: Loss was due to java.lang.NoClassDefFoundError
> java.lang.NoClassDefFoundError: org/codehaus/jackson/annotate/JsonClass
> at 
> org.codehaus.jackson.map.introspect.JacksonAnnotationIntrospector.findDeserializationType(JacksonAnnotationIntrospector.java:524)
> at 
> org.codehaus.jackson.map.deser.BasicDeserializerFactory.modifyTypeByAnnotation(BasicDeserializerFactory.java:732)
> at 
> org.codehaus.jackson.map.deser.BasicDeserializerFactory.createCollectionDeserializer(BasicDeserializerFactory.java:229)
> at 
> org.codehaus.jackson.map.deser.StdDeserializerProvider._createDeserializer(StdDeserializerProvider.java:386)
> at 
> org.codehaus.jackson.map.deser.StdDeserializerProvider._createAndCache2(StdDeserializerProvider.java:307)
> at 
> org.codehaus.jackson.map.deser.StdDeserializerProvider._createAndCacheValueDeserializer(StdDeserializerProvider.java:287)
> at 
> org.codehaus.jackson.map.deser.StdDeserializerProvider.findValueDeserializer(StdDeserializerProvider.java:136)
> at 
> org.codehaus.jackson.map.deser.StdDeserializerProvider.findTypedValueDeserializer(StdDeserializerProvider.java:157)
> at 
> org.codehaus.jackson.map.ObjectMapper._findRootDeserializer(ObjectMapper.java:2468)
> at 
> org.codehaus.jackson.map.ObjectMapper._readMapAndClose(ObjectMapper.java:2402)
> at org.codehaus.jackson.map.ObjectMapper.readValue(ObjectMapper.java:1602)
>  


Re: unserializable object in Spark Streaming context

2014-07-18 Thread Gino Bustelo
I get TD's recommendation of sharing a connection among tasks. Now, is there a 
good way to determine when to close connections? 

Gino B.

> On Jul 17, 2014, at 7:05 PM, Yan Fang  wrote:
> 
> Hi Sean,
> 
> Thank you. I see your point. What I was thinking is that, do computation in a 
> distributed fashion and do the storing from a single place. But you are 
> right, having multiple DB connections actually is fine.
> 
> Thanks for answering my questions. That helps me understand the system.
> 
> Cheers,
> 
> Fang, Yan
> yanfang...@gmail.com
> +1 (206) 849-4108
> 
> 
>> On Thu, Jul 17, 2014 at 2:53 PM, Sean Owen  wrote:
>> On Thu, Jul 17, 2014 at 10:39 PM, Yan Fang  wrote:
>> > Thank you for the help. If I use TD's approache, it works and there is no
>> > exception. Only drawback is that it will create many connections to the DB,
>> > which I was trying to avoid.
>> 
>> Connection-like objects aren't data that can be serialized. What would
>> it mean to share one connection with N workers? that they all connect
>> back to the driver, and through one DB connection there? this defeats
>> the purpose of distributed computing. You want multiple DB
>> connections. You can limit the number of partitions if needed.
>> 
>> 
>> > Here is a snapshot of my code. Mark as red for the important code. What I
>> > was thinking is that, if I call the collect() method, Spark Streaming will
>> > bring the data to the driver and then the db object does not need to be 
>> > sent
>> 
>> The Function you pass to foreachRDD() has a reference to db though.
>> That's what is making it be serialized.
>> 
>> > to executors. My observation is that, thought exceptions are thrown, the
>> > insert function still works. Any thought about that? Also paste the log in
>> > case it helps .http://pastebin.com/T1bYvLWB
>> 
>> Any executors that run locally might skip the serialization and
>> succeed (?) but I don't think the remote executors can be succeeding.
>