Re: Problems after upgrading to spark 1.4.0

2015-07-14 Thread Luis Ángel Vicente Sánchez
or not, but somehow the executor's shutdown hook is being called. Can you check the driver logs to see if driver's shutdown hook is accidentally being called? On Mon, Jul 13, 2015 at 9:23 AM, Luis Ángel Vicente Sánchez langel.gro...@gmail.com wrote: I forgot to mention that this is a long

Duplicated UnusedStubClass in assembly

2015-07-13 Thread Luis Ángel Vicente Sánchez
I have just upgraded to spark 1.4.0 and it seems that spark-streaming-kafka has a dependency on org.spark-project.spark unused 1.0.0 but it also embeds that jar in its artifact, causing a problem while creating a fatjar. This is the error: [Step 1/1] (*:assembly) deduplicate: different file

Re: Problems after upgrading to spark 1.4.0

2015-07-13 Thread Luis Ángel Vicente Sánchez
I forgot to mention that this is a long running job, actually a spark streaming job, and it's using mesos coarse mode. I'm still using the unreliable kafka receiver. 2015-07-13 17:15 GMT+01:00 Luis Ángel Vicente Sánchez langel.gro...@gmail.com: I have just upgrade one of my spark jobs from

Re: Duplicated UnusedStubClass in assembly

2015-07-13 Thread Luis Ángel Vicente Sánchez
to be an intermittent problem. You can just add an exclude: mergeStrategy in assembly := { case PathList(org, apache, spark, unused, UnusedStubClass.class) = MergeStrategy.first case x = (mergeStrategy in assembly).value(x) } On Mon, Jul 13, 2015 at 6:55 AM, Luis Ángel Vicente Sánchez

Problems after upgrading to spark 1.4.0

2015-07-13 Thread Luis Ángel Vicente Sánchez
I have just upgrade one of my spark jobs from spark 1.2.1 to spark 1.4.0 and after deploying it to mesos, it's not working anymore. The upgrade process was quite easy: - Create a new docker container for spark 1.4.0. - Upgrade spark job to use spark 1.4.0 as a dependency and create a new fatjar.

Re: Streaming problems running 24x7

2015-04-20 Thread Luis Ángel Vicente Sánchez
You have a window operation; I have seen that behaviour before with window operations in spark streaming. My solution was to move away from window operations using probabilistic data structures; it might not be an option for you. 2015-04-20 10:29 GMT+01:00 Marius Soutier mps@gmail.com: The

foreachRDD execution

2015-03-25 Thread Luis Ángel Vicente Sánchez
I have a simple and probably dumb question about foreachRDD. We are using spark streaming + cassandra to compute concurrent users every 5min. Our batch size is 10secs and our block interval is 2.5secs. At the end of the world we are using foreachRDD to join the data in the RDD with existing data

Re: NullPointer when access rdd.sparkContext (Spark 1.1.1)

2015-01-21 Thread Luis Ángel Vicente Sánchez
that. 2015-01-21 12:35 GMT+00:00 Luis Ángel Vicente Sánchez langel.gro...@gmail.com: The following functions, def sink(data: DStream[((GameID, Category), ((TimeSeriesKey, Platform), HLL))]): Unit = { data.foreachRDD { rdd = rdd.cache() val (minTime, maxTime): (Long, Long

NullPointer when access rdd.sparkContext (Spark 1.1.1)

2015-01-21 Thread Luis Ángel Vicente Sánchez
The following functions, def sink(data: DStream[((GameID, Category), ((TimeSeriesKey, Platform), HLL))]): Unit = { data.foreachRDD { rdd = rdd.cache() val (minTime, maxTime): (Long, Long) = rdd.map { case (_, ((TimeSeriesKey(_, time), _), _)) = (time, time)

Re: NullPointer when access rdd.sparkContext (Spark 1.1.1)

2015-01-21 Thread Luis Ángel Vicente Sánchez
, rdd) } everything works as expected. 2015-01-21 14:18 GMT+00:00 Sean Owen so...@cloudera.com: It looks like you are trying to use the RDD in a distributed operation, which won't work. The context will be null. On Jan 21, 2015 1:50 PM, Luis Ángel Vicente Sánchez langel.gro

Kafka Receiver not recreated after executor died

2014-12-16 Thread Luis Ángel Vicente Sánchez
Dear spark community, We were testing a spark failure scenario where the executor that is running a Kafka Receiver dies. We are running our streaming jobs on top of mesos and we killed the mesos slave that was running the executor ; a new executor was created on another mesos-slave but

Re: Kafka Receiver not recreated after executor died

2014-12-16 Thread Luis Ángel Vicente Sánchez
It seems to be slightly related to this: https://issues.apache.org/jira/browse/SPARK-1340 But in this case, it's not the Task that is failing but the entire executor where the Kafka Receiver resides. 2014-12-16 16:53 GMT+00:00 Luis Ángel Vicente Sánchez langel.gro...@gmail.com: Dear spark

Re: Low Level Kafka Consumer for Spark

2014-12-03 Thread Luis Ángel Vicente Sánchez
My main complain about the WAL mechanism in the new reliable kafka receiver is that you have to enable checkpointing and for some reason, even if spark.cleaner.ttl is set to a reasonable value, only the metadata is cleaned periodically. In my tests, using a folder in my filesystem as the

Re: Spark 1.1.1 released but not available on maven repositories

2014-11-28 Thread Luis Ángel Vicente Sánchez
Are there any news about this issue? I have checked again maven central and the artefacts are still not there. Regards, Luis 2014-11-27 10:42 GMT+00:00 Luis Ángel Vicente Sánchez langel.gro...@gmail.com: I have just read on the website that spark 1.1.1 has been released but when I upgraded

Re: RDD data checkpoint cleaning

2014-11-28 Thread Luis Ángel Vicente Sánchez
2014-11-21 15:17 GMT+00:00 Luis Ángel Vicente Sánchez langel.gro...@gmail.com: I have seen the same behaviour while testing the latest spark 1.2.0 snapshot. I'm trying the ReliableKafkaReceiver and it works quite well but the checkpoints folder is always increasing in size

Spark 1.1.1 released but not available on maven repositories

2014-11-27 Thread Luis Ángel Vicente Sánchez
I have just read on the website that spark 1.1.1 has been released but when I upgraded my project to use 1.1.1 I discovered that the artefacts are not on maven yet. [info] Resolving org.apache.spark#spark-streaming-kafka_2.10;1.1.1 ... [warn] module not found:

Re: RDD data checkpoint cleaning

2014-11-21 Thread Luis Ángel Vicente Sánchez
I have seen the same behaviour while testing the latest spark 1.2.0 snapshot. I'm trying the ReliableKafkaReceiver and it works quite well but the checkpoints folder is always increasing in size. The receivedMetaData folder remains almost constant in size but the receivedData folder is always

Spark Streaming: CoarseGrainedExecutorBackend: Slave registration failed: Duplicate executor ID

2014-09-16 Thread Luis Ángel Vicente Sánchez
I have a standalone spark cluster and from within the same scala application I'm creating 2 different spark context to run two different spark streaming jobs as SparkConf is different for each of them. I'm getting this error that... I don't really understand: 14/09/16 11:51:35 ERROR

Re: Spark Streaming: CoarseGrainedExecutorBackend: Slave registration failed: Duplicate executor ID

2014-09-16 Thread Luis Ángel Vicente Sánchez
It seems that, as I have a single scala application, the scheduler is the same and there is a collision between executors of both spark context. Is there a way to change how the executor ID is generated (maybe an uuid instead of a sequential number..?) 2014-09-16 13:07 GMT+01:00 Luis Ángel

Re: Spark Streaming: CoarseGrainedExecutorBackend: Slave registration failed: Duplicate executor ID

2014-09-16 Thread Luis Ángel Vicente Sánchez
When I said scheduler I meant executor backend. 2014-09-16 13:26 GMT+01:00 Luis Ángel Vicente Sánchez langel.gro...@gmail.com: It seems that, as I have a single scala application, the scheduler is the same and there is a collision between executors of both spark context. Is there a way

Re: Spark Streaming: CoarseGrainedExecutorBackend: Slave registration failed: Duplicate executor ID

2014-09-16 Thread Luis Ángel Vicente Sánchez
a message to each actor with configuration details to create a SparkContext/StreamingContext. 3. send a message to each actor to start the job and streaming context. 2014-09-16 13:29 GMT+01:00 Luis Ángel Vicente Sánchez langel.gro...@gmail.com: When I said scheduler I meant executor backend. 2014-09

Re: spark.cleaner.ttl and spark.streaming.unpersist

2014-09-10 Thread Luis Ángel Vicente Sánchez
fast, you can try “spark.straming.receiver.maxRate” to control the inject rate. Thanks Jerry *From:* Luis Ángel Vicente Sánchez [mailto:langel.gro...@gmail.com] *Sent:* Wednesday, September 10, 2014 5:21 AM *To:* user@spark.apache.org *Subject:* spark.cleaner.ttl

Re: Spark streaming: size of DStream

2014-09-09 Thread Luis Ángel Vicente Sánchez
If you take into account what streaming means in spark, your goal doesn't really make sense; you have to assume that your streams are infinite and you will have to process them till the end of the days. Operations on a DStream define what you want to do with each element of each RDD, but spark

spark.cleaner.ttl and spark.streaming.unpersist

2014-09-09 Thread Luis Ángel Vicente Sánchez
The executors of my spark streaming application are being killed due to memory issues. The memory consumption is quite high on startup because is the first run and there are quite a few events on the kafka queues that are consumed at a rate of 100K events per sec. I wonder if it's recommended to

Re: using multiple dstreams together (spark streaming)

2014-07-16 Thread Luis Ángel Vicente Sánchez
I'm joining several kafka dstreams using the join operation but you have the limitation that the duration of the batch has to be same,i.e. 1 second window for all dstreams... so it would not work for you. 2014-07-16 18:08 GMT+01:00 Walrus theCat walrusthe...@gmail.com: Hi, My application has

Re: using multiple dstreams together (spark streaming)

2014-07-16 Thread Luis Ángel Vicente Sánchez
the .union operation and it didn't work for that reason. Surely there has to be a way to do this, as I imagine this is a commonly desired goal in streaming applications? On Wed, Jul 16, 2014 at 10:10 AM, Luis Ángel Vicente Sánchez langel.gro...@gmail.com wrote: I'm joining several kafka

Re: Cassandra driver Spark question

2014-07-09 Thread Luis Ángel Vicente Sánchez
Is MyType serializable? Everything inside the foreachRDD closure has to be serializable. 2014-07-09 14:24 GMT+01:00 RodrigoB rodrigo.boav...@aspect.com: Hi all, I am currently trying to save to Cassandra after some Spark Streaming computation. I call a myDStream.foreachRDD so that I can

Re: Cassandra driver Spark question

2014-07-09 Thread Luis Ángel Vicente Sánchez
Yes, I'm using it to count concurrent users from a kafka stream of events without problems. I'm currently testing it using the local mode but any serialization problem would have already appeared so I don't expect any serialization issue when I deployed to my cluster. 2014-07-09 15:39 GMT+01:00

Possible bug in Spark Streaming :: TextFileStream

2014-07-07 Thread Luis Ángel Vicente Sánchez
I have a basic spark streaming job that is watching a folder, processing any new file and updating a column family in cassandra using the new cassandra-spark-driver. I think there is a problem with SparkStreamingContext.textFileStream... if I start my job in local mode with no files in the folder

Re: spark streaming rate limiting from kafka

2014-07-01 Thread Luis Ángel Vicente Sánchez
Maybe reducing the batch duration would help :\ 2014-07-01 17:57 GMT+01:00 Chen Song chen.song...@gmail.com: In my use case, if I need to stop spark streaming for a while, data would accumulate a lot on kafka topic-partitions. After I restart spark streaming job, the worker's heap will go

Re: spark streaming, kafka, SPARK_CLASSPATH

2014-06-17 Thread Luis Ángel Vicente Sánchez
class. You have to specify the jars twice to get things to work. Once for the DriverWrapper to laid your classes and a 2nd time in the Context to distribute to workers. I would like to see some contrib response to this issue. Gino B. On Jun 16, 2014, at 1:49 PM, Luis Ángel Vicente Sánchez

Re: Worker dies while submitting a job

2014-06-17 Thread Luis Ángel Vicente Sánchez
to find why that state changed message is sent... I will continue updating this thread until I found the problem :D 2014-06-16 18:25 GMT+01:00 Luis Ángel Vicente Sánchez langel.gro...@gmail.com: I'm playing with a modified version of the TwitterPopularTags example and when I tried to submit the job

Re: Worker dies while submitting a job

2014-06-17 Thread Luis Ángel Vicente Sánchez
-SNAPSHOT.jar)) Now I'm getting this error on my worker: 4/06/17 17:03:40 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory 2014-06-17 17:36 GMT+01:00 Luis Ángel Vicente Sánchez langel.gro

Re: spark streaming, kafka, SPARK_CLASSPATH

2014-06-17 Thread Luis Ángel Vicente Sánchez
-SNAPSHOT.jar)) Now I'm getting this error on my worker: 4/06/17 17:03:40 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory 2014-06-17 15:38 GMT+01:00 Luis Ángel Vicente Sánchez langel.gro

Worker dies while submitting a job

2014-06-16 Thread Luis Ángel Vicente Sánchez
I'm playing with a modified version of the TwitterPopularTags example and when I tried to submit the job to my cluster, workers keep dying with this message: 14/06/16 17:11:16 INFO DriverRunner: Launch Command: java -cp

Re: spark streaming, kafka, SPARK_CLASSPATH

2014-06-16 Thread Luis Ángel Vicente Sánchez
Did you manage to make it work? I'm facing similar problems and this a serious blocker issue. spark-submit seems kind of broken to me if you can use it for spark-streaming. Regards, Luis 2014-06-11 1:48 GMT+01:00 lannyripple lanny.rip...@gmail.com: I am using Spark 1.0.0 compiled with Hadoop

Re: Anyone using value classes in RDDs?

2014-04-20 Thread Luis Ángel Vicente Sánchez
Type alias aren't safe as you could use any string as a name or id. On 20 Apr 2014 14:18, Surendranauth Hiraman suren.hira...@velos.io wrote: If the purpose is only aliasing, rather than adding additional methods and avoiding runtime allocation, what about type aliases? type ID = String type