DStream flatMap "swallows" records

2015-09-15 Thread Jeffrey Jedele
Hi all, I've got a problem with Spark Streaming (both 1.3.1 and 1.5). Following situation: There is a component which gets a DStream of URLs. For each of these URLs, it should access it, retrieve several data elements and pass those on for further processing. The relevant code looks like this: ...

Re: Spark as a service

2015-03-24 Thread Jeffrey Jedele
; > Regards, > Ashish > > On Tue, Mar 24, 2015 at 4:07 PM, Jeffrey Jedele > wrote: > >> Hi Ashish, >> this might be what you're looking for: >> >> >> https://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbcodb

Re: Spark as a service

2015-03-24 Thread Jeffrey Jedele
Hi Ashish, this might be what you're looking for: https://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbcodbc-server Regards, Jeff 2015-03-24 11:28 GMT+01:00 Ashish Mukherjee : > Hello, > > As of now, if I have to execute a Spark job, I need to create a jar and >

Re: RDD storage in spark steaming

2015-03-23 Thread Jeffrey Jedele
Hey Abhi, many of StreamingContext's methods to create input streams take a StorageLevel parameter to configure this behavior. RDD partitions are generally stored in the in-memory cache of worker nodes I think. You can also configure replication and spilling to disk if needed. Regards, Jeff 2015-

Re: Spark streaming alerting

2015-03-23 Thread Jeffrey Jedele
What exactly do you mean by "alerts"? Something specific to your data or general events of the spark cluster? For the first, sth like Akhil suggested should work. For the latter, I would suggest having a log consolidation system like logstash in place and use this to generate alerts. Regards, Jef

Re: Load balancing

2015-03-22 Thread Jeffrey Jedele
of the receiver. When client >> gets handle to a spark context and calls something like "val lines = ssc. >> socketTextStream("localhost", )", is this the point when spark >> master is contacted to determine which spark worker node the data is going >

Re: Spark per app logging

2015-03-21 Thread Jeffrey Jedele
Hi, I'm not completely sure about this either, but this is what we are doing currently: Configure your logging to write to STDOUT, not to a file explicitely. Spark will capture stdour and stderr and separate the messages into a app/driver folder structure in the configured worker directory. We the

Re: Spark Streaming Not Reading Messages From Multiple Kafka Topics

2015-03-21 Thread Jeffrey Jedele
Hey Eason! Weird problem indeed. More information will probably help to find te issue: Have you searched the logs for peculiar messages? How does your Spark environment look like? #workers, #threads, etc? Does it work if you create separate receivers for the topics? Regards, Jeff 2015-03-21 2:27

Re: Load balancing

2015-03-20 Thread Jeffrey Jedele
Hi Mohit, it also depends on what the source for your streaming application is. If you use Kafka, you can easily partition topics and have multiple receivers on different machines. If you have sth like a HTTP, socket, etc stream, you probably can't do that. The Spark RDDs generated by your receiv

Re: Visualizing Spark Streaming data

2015-03-20 Thread Jeffrey Jedele
ifying their own criteria, it becomes hard to > manage them all. > > On 20 March 2015 at 12:02, Jeffrey Jedele > wrote: > >> Hey Harut, >> I don't think there'll by any general practices as this part heavily >> depends on your environment, skills and what you w

Re: Visualizing Spark Streaming data

2015-03-20 Thread Jeffrey Jedele
Hey Harut, I don't think there'll by any general practices as this part heavily depends on your environment, skills and what you want to achieve. If you don't have a general direction yet, I'd suggest you to have a look at Elasticsearch+Kibana. It's very easy to set up, powerful and therefore gets

Re: Writing Spark Streaming Programs

2015-03-19 Thread Jeffrey Jedele
I second what has been said already. We just built a streaming app in Java and I would definitely choose Scala this time. Regards, Jeff 2015-03-19 16:34 GMT+01:00 Emre Sevinc : > Hello James, > > I've been working with Spark Streaming for the last 6 months, and I'm > coding in Java 7. Even thou

Re: GraphX: Get edges for a vertex

2015-03-18 Thread Jeffrey Jedele
Hi Mas, I never actually worked with GraphX, but one idea: As far as I know, you can directly access the vertex and edge RDDs of your Graph object. Why not simply run a .filter() on the edge RDD to get all edges that originate from or end at your vertex? Regards, Jeff 2015-03-18 10:52 GMT+01:00

Re: Spark + Kafka

2015-03-18 Thread Jeffrey Jedele
ng : > Any sub-category recommendations hadoop, MapR, CDH? > > On Wed, Mar 18, 2015 at 10:48 AM, James King > wrote: > >> Many thanks Jeff will give it a go. >> >> On Wed, Mar 18, 2015 at 10:47 AM, Jeffrey Jedele < >> jeffrey.jed...@gmail.com> wrote: &g

Re: Spark + Kafka

2015-03-18 Thread Jeffrey Jedele
Probably 1.3.0 - it has some improvements in the included Kafka receiver for streaming. https://spark.apache.org/releases/spark-release-1-3-0.html Regards, Jeff 2015-03-18 10:38 GMT+01:00 James King : > Hi All, > > Which build of Spark is best when using Kafka? > > Regards > jk >

Re: IllegalAccessError in GraphX (Spark 1.3.0 LDA)

2015-03-17 Thread Jeffrey Jedele
7;t have multiple > Spark versions deployed. If the classpath looks correct, please create > a JIRA for this issue. Thanks! -Xiangrui > > On Tue, Mar 17, 2015 at 2:03 AM, Jeffrey Jedele > wrote: > > Hi all, > > I'm trying to use the new LDA in mllib, but when trying to trai

IllegalAccessError in GraphX (Spark 1.3.0 LDA)

2015-03-17 Thread Jeffrey Jedele
Hi all, I'm trying to use the new LDA in mllib, but when trying to train the model, I'm getting following error: java.lang.IllegalAccessError: tried to access class org.apache.spark.util.collection.Sorter from class org.apache.spark.graphx.impl.EdgePartitionBuilder at org.apache.spark.graphx.i

Re: KafkaUtils and specifying a specific partition

2015-03-12 Thread Jeffrey Jedele
Hi Colin, my understanding is that this is currently not possible with KafkaUtils. You would have to write a custom receiver using Kafka's SimpleConsumer API. https://spark.apache.org/docs/1.2.0/streaming-custom-receivers.html https://cwiki.apache.org/confluence/display/KAFKA/0.8.0+SimpleConsumer+

Re: How to use the TF-IDF model?

2015-03-09 Thread Jeffrey Jedele
Hi, well, it really depends on what you want to do ;) TF-IDF is a measure that originates in the information retrieval context and that can be used to judge the relevancy of a document in context of a given search term. It's also often used for text-related machine learning tasks. E.g. have a loo

Re: how to map and filter in one step?

2015-02-27 Thread Jeffrey Jedele
Hi, we are using RDD#mapPartitions() to achieve the same. Are there advantages/disadvantages of using one method over the other? Regards, Jeff 2015-02-26 20:02 GMT+01:00 Mark Hamstra : > rdd.map(foo).filter(bar) and rdd.filter(bar).map(foo) will each already be > pipelined into a single stage,

Re: Augment more data to existing MatrixFactorization Model?

2015-02-27 Thread Jeffrey Jedele
Hey Anish, machine learning models that are updated with incoming data are commonly known as "online learning systems". Matrix factorization is one way to implement recommender systems, but not the only one. There are papers about how to do online matrix factorization, but you will likely have to i

Re: SparkStreaming failing with exception Could not compute split, block input

2015-02-27 Thread Jeffrey Jedele
I don't have an idea, but perhaps a little more context would be helpful. What is the source of your streaming data? What's the storage level you're using? What are you doing? Some kind of windows operations? Regards, Jeff 2015-02-26 18:59 GMT+01:00 Mukesh Jha : > > On Wed, Feb 25, 2015 at 8:09

Re: spark streaming, batchinterval,windowinterval and window sliding interval difference

2015-02-27 Thread Jeffrey Jedele
If you read the streaming programming guide, you'll notice that Spark does not do "real" streaming but "emulates" it with a so-called mini-batching approach. Let's say you want to work with a continuous stream of incoming events from a computing centre: Batch interval: That's the basic "heartbeat"

Re: Considering Spark for large data elements

2015-02-26 Thread Jeffrey Jedele
Hi Rob, I fear your questions will be hard to answer without additional information about what kind of simulations you plan to do. int[r][c] basically means you have a matrix of integers? You could for example map this to a row-oriented RDD of integer-arrays or to a column oriented RDD of integer a

Re: Number of Executors per worker process

2015-02-26 Thread Jeffrey Jedele
Hi Spico, Yes, I think an "executor core" in Spark is basically a thread in a worker pool. It's recommended to have one executor core per physical core on your machine for best performance, but I think in theory you can create as many threads as your OS allows. For deployment: There seems to be t

Re: How to efficiently control concurrent Spark jobs

2015-02-26 Thread Jeffrey Jedele
So basically you have lots of small ML tasks you want to run concurrently? With "I've used repartition and cache to store the sub-datasets on only one machine" you mean that you reduced each RDD to have one partition only? Maybe you want to give the fair scheduler a try to get more of your tasks

Re: spark streaming: stderr does not roll

2015-02-26 Thread Jeffrey Jedele
So the summarize (I had a similar question): Spark's log4j per default is configured to log to the console? Those messages end up in the stderr files and the approach does not support rolling? If I configure log4j to log to files, how can I keep the folder structure? Should I use relative paths an

Re: Re: Many Receiver vs. Many threads per Receiver

2015-02-26 Thread Jeffrey Jedele
As I understand the matter: Option 1) has benefits when you think that your network bandwidth may be a bottle neck, because Spark opens several network connections on possibly several different physical machines. Option 2) - as you already pointed out - has the benefit that you occupy less worker