DStream flatMap "swallows" records

2015-09-15 Thread Jeffrey Jedele
Hi all, I've got a problem with Spark Streaming (both 1.3.1 and 1.5). Following situation: There is a component which gets a DStream of URLs. For each of these URLs, it should access it, retrieve several data elements and pass those on for further processing. The relevant code looks like this:

Re: Spark as a service

2015-03-24 Thread Jeffrey Jedele
Hi Ashish, this might be what you're looking for: https://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbcodbc-server Regards, Jeff 2015-03-24 11:28 GMT+01:00 Ashish Mukherjee ashish.mukher...@gmail.com: Hello, As of now, if I have to execute a Spark job, I

Re: Spark as a service

2015-03-24 Thread Jeffrey Jedele
. Regards, Ashish On Tue, Mar 24, 2015 at 4:07 PM, Jeffrey Jedele jeffrey.jed...@gmail.com wrote: Hi Ashish, this might be what you're looking for: https://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbcodbc-server Regards, Jeff 2015-03-24 11:28 GMT+01:00

Re: RDD storage in spark steaming

2015-03-23 Thread Jeffrey Jedele
Hey Abhi, many of StreamingContext's methods to create input streams take a StorageLevel parameter to configure this behavior. RDD partitions are generally stored in the in-memory cache of worker nodes I think. You can also configure replication and spilling to disk if needed. Regards, Jeff

Re: Spark streaming alerting

2015-03-23 Thread Jeffrey Jedele
What exactly do you mean by alerts? Something specific to your data or general events of the spark cluster? For the first, sth like Akhil suggested should work. For the latter, I would suggest having a log consolidation system like logstash in place and use this to generate alerts. Regards, Jeff

Re: Load balancing

2015-03-22 Thread Jeffrey Jedele
node the data is going to go to? On Fri, Mar 20, 2015 at 1:33 AM, Jeffrey Jedele jeffrey.jed...@gmail.com wrote: Hi Mohit, it also depends on what the source for your streaming application is. If you use Kafka, you can easily partition topics and have multiple receivers on different

Re: Spark per app logging

2015-03-21 Thread Jeffrey Jedele
Hi, I'm not completely sure about this either, but this is what we are doing currently: Configure your logging to write to STDOUT, not to a file explicitely. Spark will capture stdour and stderr and separate the messages into a app/driver folder structure in the configured worker directory. We

Re: Spark Streaming Not Reading Messages From Multiple Kafka Topics

2015-03-21 Thread Jeffrey Jedele
Hey Eason! Weird problem indeed. More information will probably help to find te issue: Have you searched the logs for peculiar messages? How does your Spark environment look like? #workers, #threads, etc? Does it work if you create separate receivers for the topics? Regards, Jeff 2015-03-21

Re: Visualizing Spark Streaming data

2015-03-20 Thread Jeffrey Jedele
, it becomes hard to manage them all. On 20 March 2015 at 12:02, Jeffrey Jedele jeffrey.jed...@gmail.com wrote: Hey Harut, I don't think there'll by any general practices as this part heavily depends on your environment, skills and what you want to achieve. If you don't have a general direction yet

Re: Visualizing Spark Streaming data

2015-03-20 Thread Jeffrey Jedele
Hey Harut, I don't think there'll by any general practices as this part heavily depends on your environment, skills and what you want to achieve. If you don't have a general direction yet, I'd suggest you to have a look at Elasticsearch+Kibana. It's very easy to set up, powerful and therefore

Re: Load balancing

2015-03-20 Thread Jeffrey Jedele
Hi Mohit, it also depends on what the source for your streaming application is. If you use Kafka, you can easily partition topics and have multiple receivers on different machines. If you have sth like a HTTP, socket, etc stream, you probably can't do that. The Spark RDDs generated by your

Re: Writing Spark Streaming Programs

2015-03-19 Thread Jeffrey Jedele
I second what has been said already. We just built a streaming app in Java and I would definitely choose Scala this time. Regards, Jeff 2015-03-19 16:34 GMT+01:00 Emre Sevinc emre.sev...@gmail.com: Hello James, I've been working with Spark Streaming for the last 6 months, and I'm coding in

Re: Spark + Kafka

2015-03-18 Thread Jeffrey Jedele
Probably 1.3.0 - it has some improvements in the included Kafka receiver for streaming. https://spark.apache.org/releases/spark-release-1-3-0.html Regards, Jeff 2015-03-18 10:38 GMT+01:00 James King jakwebin...@gmail.com: Hi All, Which build of Spark is best when using Kafka? Regards jk

Re: GraphX: Get edges for a vertex

2015-03-18 Thread Jeffrey Jedele
Hi Mas, I never actually worked with GraphX, but one idea: As far as I know, you can directly access the vertex and edge RDDs of your Graph object. Why not simply run a .filter() on the edge RDD to get all edges that originate from or end at your vertex? Regards, Jeff 2015-03-18 10:52 GMT+01:00

Re: Spark + Kafka

2015-03-18 Thread Jeffrey Jedele
recommendations hadoop, MapR, CDH? On Wed, Mar 18, 2015 at 10:48 AM, James King jakwebin...@gmail.com wrote: Many thanks Jeff will give it a go. On Wed, Mar 18, 2015 at 10:47 AM, Jeffrey Jedele jeffrey.jed...@gmail.com wrote: Probably 1.3.0 - it has some improvements in the included Kafka

IllegalAccessError in GraphX (Spark 1.3.0 LDA)

2015-03-17 Thread Jeffrey Jedele
Hi all, I'm trying to use the new LDA in mllib, but when trying to train the model, I'm getting following error: java.lang.IllegalAccessError: tried to access class org.apache.spark.util.collection.Sorter from class org.apache.spark.graphx.impl.EdgePartitionBuilder at

Re: IllegalAccessError in GraphX (Spark 1.3.0 LDA)

2015-03-17 Thread Jeffrey Jedele
don't have multiple Spark versions deployed. If the classpath looks correct, please create a JIRA for this issue. Thanks! -Xiangrui On Tue, Mar 17, 2015 at 2:03 AM, Jeffrey Jedele jeffrey.jed...@gmail.com wrote: Hi all, I'm trying to use the new LDA in mllib, but when trying to train

Re: KafkaUtils and specifying a specific partition

2015-03-12 Thread Jeffrey Jedele
Hi Colin, my understanding is that this is currently not possible with KafkaUtils. You would have to write a custom receiver using Kafka's SimpleConsumer API. https://spark.apache.org/docs/1.2.0/streaming-custom-receivers.html

Re: How to use the TF-IDF model?

2015-03-09 Thread Jeffrey Jedele
Hi, well, it really depends on what you want to do ;) TF-IDF is a measure that originates in the information retrieval context and that can be used to judge the relevancy of a document in context of a given search term. It's also often used for text-related machine learning tasks. E.g. have a

Re: Augment more data to existing MatrixFactorization Model?

2015-02-27 Thread Jeffrey Jedele
Hey Anish, machine learning models that are updated with incoming data are commonly known as online learning systems. Matrix factorization is one way to implement recommender systems, but not the only one. There are papers about how to do online matrix factorization, but you will likely have to

Re: how to map and filter in one step?

2015-02-27 Thread Jeffrey Jedele
Hi, we are using RDD#mapPartitions() to achieve the same. Are there advantages/disadvantages of using one method over the other? Regards, Jeff 2015-02-26 20:02 GMT+01:00 Mark Hamstra m...@clearstorydata.com: rdd.map(foo).filter(bar) and rdd.filter(bar).map(foo) will each already be pipelined

Re: spark streaming, batchinterval,windowinterval and window sliding interval difference

2015-02-27 Thread Jeffrey Jedele
If you read the streaming programming guide, you'll notice that Spark does not do real streaming but emulates it with a so-called mini-batching approach. Let's say you want to work with a continuous stream of incoming events from a computing centre: Batch interval: That's the basic heartbeat of

Re: SparkStreaming failing with exception Could not compute split, block input

2015-02-27 Thread Jeffrey Jedele
I don't have an idea, but perhaps a little more context would be helpful. What is the source of your streaming data? What's the storage level you're using? What are you doing? Some kind of windows operations? Regards, Jeff 2015-02-26 18:59 GMT+01:00 Mukesh Jha me.mukesh@gmail.com: On

Re: Considering Spark for large data elements

2015-02-26 Thread Jeffrey Jedele
Hi Rob, I fear your questions will be hard to answer without additional information about what kind of simulations you plan to do. int[r][c] basically means you have a matrix of integers? You could for example map this to a row-oriented RDD of integer-arrays or to a column oriented RDD of integer

Re: spark streaming: stderr does not roll

2015-02-26 Thread Jeffrey Jedele
So the summarize (I had a similar question): Spark's log4j per default is configured to log to the console? Those messages end up in the stderr files and the approach does not support rolling? If I configure log4j to log to files, how can I keep the folder structure? Should I use relative paths

Re: How to efficiently control concurrent Spark jobs

2015-02-26 Thread Jeffrey Jedele
So basically you have lots of small ML tasks you want to run concurrently? With I've used repartition and cache to store the sub-datasets on only one machine you mean that you reduced each RDD to have one partition only? Maybe you want to give the fair scheduler a try to get more of your tasks

Re: Number of Executors per worker process

2015-02-26 Thread Jeffrey Jedele
Hi Spico, Yes, I think an executor core in Spark is basically a thread in a worker pool. It's recommended to have one executor core per physical core on your machine for best performance, but I think in theory you can create as many threads as your OS allows. For deployment: There seems to be

Re: Re: Many Receiver vs. Many threads per Receiver

2015-02-26 Thread Jeffrey Jedele
As I understand the matter: Option 1) has benefits when you think that your network bandwidth may be a bottle neck, because Spark opens several network connections on possibly several different physical machines. Option 2) - as you already pointed out - has the benefit that you occupy less