Re: Job keeps aborting because of org.apache.spark.shuffle.FetchFailedException: Failed to connect to server/ip:39232

2017-07-28 Thread Juan Rodríguez Hortalá
Hi Jeff, Can you provide more information about how are you running your job? In particular: - which cluster manager are you using? It is YARN, Mesos, Spark Standalone? - with configuration options are you using to submit the job? In particular are you using dynamic allocation or external

Re: unit testing in spark

2016-12-11 Thread Juan Rodríguez Hortalá
Hi all, I would also would like to participate on that. Greetings, Juan On Fri, Dec 9, 2016 at 6:03 AM, Michael Stratton < michael.strat...@komodohealth.com> wrote: > That sounds great, please include me so I can get involved. > > On Fri, Dec 9, 2016 at 7:39 AM, Marco Mistroni

Re: JobScheduler: Error generating jobs for time for custom InputDStream

2015-09-29 Thread Juan Rodríguez Hortalá
sxw...@gmail.com>: > Looks like you returns a "Some(null)" in "compute". If you don't want to > create a RDD, it should return None. If you want to return an empty RDD, it > should return "Some(sc.emptyRDD)". > > Best Regards, > Shixiong Zh

JobScheduler: Error generating jobs for time for custom InputDStream

2015-08-26 Thread Juan Rodríguez Hortalá
Hi, I've developed a ScalaCheck property for testing Spark Streaming transformations. To do that I had to develop a custom InputDStream, which is very similar to QueueInputDStream but has a method for adding new test cases for dstreams, which are objects of type Seq[Seq[A]], to the DStream. You

Difficulties developing a Specs2 matcher for Spark Streaming

2015-08-24 Thread Juan Rodríguez Hortalá
Hi, I've had some troubles developing a Specs2 matcher that checks that a predicate holds for all the elements of an RDD, and using it for testing a simple Spark Streaming program. I've finally been able to get a code that works, you can see it in

Re: Writing streaming data to cassandra creates duplicates

2015-07-30 Thread Juan Rodríguez Hortalá
Hi, Just my two cents. I understand your problem is that your problem is that you have messages with the same key in two different dstreams. What I would do would be making a union of all the dstreams with StreamingContext.union or several calls to DStream.union, and then I would create a pair

Messages are not stored for actorStream when using RoundRobinRouter

2015-07-28 Thread Juan Rodríguez Hortalá
Hi, I'm using a simple akka actor to create a actorStream. The actor just forwards the messages received to the stream by calling super[ActorHelper].store(msg). This works ok when I create the stream with ssc.actorStream[A](Props(new ProxyReceiverActor[A]), receiverActorName) but when I try to

Stopping StreamingContext before receiver has started

2015-07-13 Thread Juan Rodríguez Hortalá
Hi, I have noticed that when StreamingContext.stop is called when no receiver has started yet, then the context is not really stopped. Watching the logs it looks like a stop signal is sent to 0 receivers, because the receivers have not started yet, and then the receivers are started and the

Re: SparkDriverExecutionException when using actorStream

2015-07-11 Thread Juan Rodríguez Hortalá
in Spark 1.3.1 and Spark 1.4.0. Here https://gist.github.com/juanrh/eaf34cf0a308a87db32c you have the corrected example in case someone is interested. Greetings, Juan 2015-07-10 18:27 GMT+02:00 Juan Rodríguez Hortalá juan.rodriguez.hort...@gmail.com: Hi, I'm trying to create a Spark Streaming

SparkDriverExecutionException when using actorStream

2015-07-10 Thread Juan Rodríguez Hortalá
Hi, I'm trying to create a Spark Streaming actor stream but I'm having several problems. First of all the guide from https://spark.apache.org/docs/latest/streaming-custom-receivers.html refers to the code

Re: java.lang.IllegalArgumentException: A metric named ... already exists

2015-07-06 Thread Juan Rodríguez Hortalá
. Can you make it a JIRA. TD On Tue, Jun 23, 2015 at 7:59 AM, Juan Rodríguez Hortalá juan.rodriguez.hort...@gmail.com wrote: Hi, I'm running a program in Spark 1.4 where several Spark Streaming contexts are created from the same Spark context. As pointed in https://spark.apache.org/docs

java.lang.IllegalArgumentException: A metric named ... already exists

2015-06-23 Thread Juan Rodríguez Hortalá
Hi, I'm running a program in Spark 1.4 where several Spark Streaming contexts are created from the same Spark context. As pointed in https://spark.apache.org/docs/latest/streaming-programming-guide.html each Spark Streaming context is stopped before creating the next Spark Streaming context. The

Re: Spark 1.4: Python API for getting Kafka offsets in direct mode?

2015-06-13 Thread Juan Rodríguez Hortalá
Perfect! I'll start working on it 2015-06-13 2:23 GMT+02:00 Amit Ramesh a...@yelp.com: Hi Juan, I have created a ticket for this: https://issues.apache.org/jira/browse/SPARK-8337 Thanks! Amit On Fri, Jun 12, 2015 at 3:17 PM, Juan Rodríguez Hortalá juan.rodriguez.hort...@gmail.com

Re: Spark 1.4: Python API for getting Kafka offsets in direct mode?

2015-06-12 Thread Juan Rodríguez Hortalá
Hi, If you want I would be happy to work in this. I have worked with KafkaUtils.createDirectStream before, in a pull request that wasn't accepted https://github.com/apache/spark/pull/5367. I'm fluent with Python and I'm starting to feel comfortable with Scala, so if someone opens a JIRA I can

Re: Creating topology in spark streaming

2015-05-06 Thread Juan Rodríguez Hortalá
Hi, I agree with Evo, Spark works at a different abstraction level than Storm, and there is not a direct translation from Storm topologies to Spark Streaming jobs. I think something remotely close is the notion of lineage of DStreams or RDDs, which is similar to a logical plan of an engine like

Re: Creating topology in spark streaming

2015-05-06 Thread Juan Rodríguez Hortalá
Hi, You can use the method repartition from DStream (for the Scala API) or JavaDStream (for the Java API) defrepartition(numPartitions: Int): DStream https://spark.apache.org/docs/latest/api/scala/org/apache/spark/streaming/dstream/DStream.html [T] Return a new DStream with an increased or

Re: Event generator for SPARK-Streaming from csv

2015-05-01 Thread Juan Rodríguez Hortalá
Hi, Maybe you could use streamingContext.fileStream like in the example from https://spark.apache.org/docs/latest/streaming-programming-guide.html#input-dstreams-and-receivers, you can read from files on any file system compatible with the HDFS API (that is, HDFS, S3, NFS, etc.). You could split

Re: Is there a limit to the number of RDDs in a Spark context?

2015-03-12 Thread Juan Rodríguez Hortalá
everything inside Spark if possible. Maybe I'm missing something here, any idea would be appreciated. Thanks in advance for your help, Greetings, Juan Rodriguez 2015-02-18 20:23 GMT+01:00 Juan Rodríguez Hortalá juan.rodriguez.hort...@gmail.com: Hi Sean, Thanks a lot for your answer

Re: Memory problems when calling pipe()

2015-02-24 Thread Juan Rodríguez Hortalá
and https://issues.apache.org/jira/browse/SPARK-2444, and now it works ok. Greetings, Juan 2015-02-23 10:40 GMT+01:00 Juan Rodríguez Hortalá juan.rodriguez.hort...@gmail.com: Hi, I'm having problems using pipe() from a Spark program written in Java, where I call a python script, running in a YARN

Memory problems when calling pipe()

2015-02-23 Thread Juan Rodríguez Hortalá
Hi, I'm having problems using pipe() from a Spark program written in Java, where I call a python script, running in a YARN cluster. The problem is that the job fails when YARN kills the container because the python script is going beyond the memory limits. I get something like this in the log:

Is there a limit to the number of RDDs in a Spark context?

2015-02-18 Thread Juan Rodríguez Hortalá
Hi, I'm writing a Spark program where I want to divide a RDD into different groups, but the groups are too big to use groupByKey. To cope with that, since I know in advance the list of keys for each group, I build a map from the keys to the RDDs that result from filtering the input RDD to get the

Re: Is there a limit to the number of RDDs in a Spark context?

2015-02-18 Thread Juan Rodríguez Hortalá
just in pure Java / Scala to do whatever you need inside your function. On Wed, Feb 18, 2015 at 2:12 PM, Juan Rodríguez Hortalá juan.rodriguez.hort...@gmail.com wrote: Hi Paweł, Thanks a lot for your answer. I finally got the program to work by using aggregateByKey, but I was wondering why

Re: Is there a limit to the number of RDDs in a Spark context?

2015-02-18 Thread Juan Rodríguez Hortalá
into memory issues. Please let me know if that helped Pawel Szulc @rabbitonweb http://www.rabbitonweb.com On Wed, Feb 18, 2015 at 12:06 PM, Juan Rodríguez Hortalá juan.rodriguez.hort...@gmail.com wrote: Hi, I'm writing a Spark program where I want to divide a RDD into different groups

Re: Implementing a spark version of Haskell's partition

2014-12-18 Thread Juan Rodríguez Hortalá
is not that a problem (specially if you can cache it), and in fact it's way better to keep advantage of resilience etc etc that comes with Spark. my2c andy On Wed Dec 17 2014 at 7:07:05 PM Juan Rodríguez Hortalá juan.rodriguez.hort...@gmail.com wrote: Hi Andy, thanks for your response. I

Implementing a spark version of Haskell's partition

2014-12-17 Thread Juan Rodríguez Hortalá
Hi all, I would like to be able to split a RDD in two pieces according to a predicate. That would be equivalent to applying filter twice, with the predicate and its complement, which is also similar to Haskell's partition list function (

Re: Implementing a spark version of Haskell's partition

2014-12-17 Thread Juan Rodríguez Hortalá
you'll have to do is to partition each partition each partition (:-D) or create two RDDs with by filtering twice → hence tasks will be scheduled distinctly, and data read twice. Choose what's best for you! hth, andy On Wed Dec 17 2014 at 5:57:56 PM Juan Rodríguez Hortalá

Number of partitions in RDD for input DStreams

2014-11-12 Thread Juan Rodríguez Hortalá
Hi list, In an excelent blog post on Kafka and Spark Streaming integrartion ( http://www.michael-noll.com/blog/2014/10/01/kafka-spark-streaming-integration-example-tutorial/), Michael Noll poses an assumption about the number of partitions of the RDDs created by input DStreams. He says his

Re: Debugging Task not serializable

2014-08-15 Thread Juan Rodríguez Hortalá
On Wed, Jul 30, 2014 at 2:13 PM, Juan Rodríguez Hortalá juan.rodriguez.hort...@gmail.com wrote: Akhil, Andry, thanks a lot for your suggestions. I will take a look to those JVM options. Greetings, Juan 2014-07-28 18:56 GMT+02:00 andy petrella andy.petre...@gmail.com: Also check the guides

Re: Debugging Task not serializable

2014-07-30 Thread Juan Rodríguez Hortalá
it by heart :/ Le 28 juil. 2014 18:44, Akhil Das ak...@sigmoidanalytics.com a écrit : A quick fix would be to implement java.io.Serializable in those classes which are causing this exception. Thanks Best Regards On Mon, Jul 28, 2014 at 9:21 PM, Juan Rodríguez Hortalá

Re: Server IPC version 7 cannot communicate with client version 4 with Spark Streaming 1.0.0 in Java and CH4 quickstart in local mode

2014-07-19 Thread Juan Rodríguez Hortalá
for CDH4.6 maybe? they are certainly built vs Hadoop 2) On Wed, Jul 16, 2014 at 10:32 AM, Juan Rodríguez Hortalá juan.rodriguez.hort...@gmail.com wrote: Hi, I'm running a Java program using Spark Streaming 1.0.0 on Cloudera 4.4.0 quickstart virtual machine, with hadoop-client 2.0.0-mr1-cdh4.4.0

Server IPC version 7 cannot communicate with client version 4 with Spark Streaming 1.0.0 in Java and CH4 quickstart in local mode

2014-07-16 Thread Juan Rodríguez Hortalá
Hi, I'm running a Java program using Spark Streaming 1.0.0 on Cloudera 4.4.0 quickstart virtual machine, with hadoop-client 2.0.0-mr1-cdh4.4.0, which is the one corresponding to my Hadoop distribution, and that works with other mapreduce programs, and with the maven property

Re: Which is the best way to get a connection to an external database per task in Spark Streaming?

2014-07-09 Thread Juan Rodríguez Hortalá
program to prevent resource leakage. A good choice is to use third-party DB connection library which supports connection pool, that will alleviate your programming efforts. Thanks Jerry *From:* Juan Rodríguez Hortalá [mailto:juan.rodriguez.hort...@gmail.com] *Sent:* Tuesday, July 08, 2014 6

Re: Which is the best way to get a connection to an external database per task in Spark Streaming?

2014-07-08 Thread Juan Rodríguez Hortalá
) or multiple times (to deal with failures/timeouts) etc., which is maybe something you want to deal with in your SQL. Tobias On Tue, Jul 8, 2014 at 3:40 AM, Juan Rodríguez Hortalá juan.rodriguez.hort...@gmail.com wrote: Hi list, I'm writing a Spark Streaming program that reads from a kafka

Re: Which is the best way to get a connection to an external database per task in Spark Streaming?

2014-07-08 Thread Juan Rodríguez Hortalá
task can get this connection each time executing a task, not creating a new one, that would be good for your scenario, since create a connection is quite expensive for each task. Thanks Jerry *From:* Juan Rodríguez Hortalá [mailto:juan.rodriguez.hort...@gmail.com] *Sent:* Tuesday, July 08

Which is the best way to get a connection to an external database per task in Spark Streaming?

2014-07-07 Thread Juan Rodríguez Hortalá
Hi list, I'm writing a Spark Streaming program that reads from a kafka topic, performs some transformations on the data, and then inserts each record in a database with foreachRDD. I was wondering which is the best way to handle the connection to the database so each worker, or even each task,

[no subject]

2014-07-07 Thread Juan Rodríguez Hortalá
Hi all, I'm writing a Spark Streaming program that uses reduceByKeyAndWindow(), and when I change the windowsLenght or slidingInterval I get the following exceptions, running in local mode 14/07/06 13:03:46 ERROR actor.OneForOneStrategy: key not found: 1404677026000 ms

Re: No FileSystem for scheme: hdfs

2014-07-04 Thread Juan Rodríguez Hortalá
Hi, To cope with the issue with META-INF that Sean is pointing out, my solution is replacing maven-assembly.plugin with maven-shade-plugin, using the ServicesResourceTransformer ( http://maven.apache.org/plugins/maven-shade-plugin/examples/resource-transformers.html#ServicesResourceTransformer)