Hi all,
I've got a problem with Spark Streaming (both 1.3.1 and 1.5). Following
situation:
There is a component which gets a DStream of URLs. For each of these URLs,
it should access it, retrieve several data elements and pass those on for
further processing.
The relevant code looks like this:
...
;
> Regards,
> Ashish
>
> On Tue, Mar 24, 2015 at 4:07 PM, Jeffrey Jedele
> wrote:
>
>> Hi Ashish,
>> this might be what you're looking for:
>>
>>
>> https://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbcodb
Hi Ashish,
this might be what you're looking for:
https://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbcodbc-server
Regards,
Jeff
2015-03-24 11:28 GMT+01:00 Ashish Mukherjee :
> Hello,
>
> As of now, if I have to execute a Spark job, I need to create a jar and
>
Hey Abhi,
many of StreamingContext's methods to create input streams take a
StorageLevel parameter to configure this behavior. RDD partitions are
generally stored in the in-memory cache of worker nodes I think. You can
also configure replication and spilling to disk if needed.
Regards,
Jeff
2015-
What exactly do you mean by "alerts"?
Something specific to your data or general events of the spark cluster? For
the first, sth like Akhil suggested should work. For the latter, I would
suggest having a log consolidation system like logstash in place and use
this to generate alerts.
Regards,
Jef
of the receiver. When client
>> gets handle to a spark context and calls something like "val lines = ssc.
>> socketTextStream("localhost", )", is this the point when spark
>> master is contacted to determine which spark worker node the data is going
>
Hi,
I'm not completely sure about this either, but this is what we are doing
currently:
Configure your logging to write to STDOUT, not to a file explicitely. Spark
will capture stdour and stderr and separate the messages into a app/driver
folder structure in the configured worker directory.
We the
Hey Eason!
Weird problem indeed. More information will probably help to find te issue:
Have you searched the logs for peculiar messages?
How does your Spark environment look like? #workers, #threads, etc?
Does it work if you create separate receivers for the topics?
Regards,
Jeff
2015-03-21 2:27
Hi Mohit,
it also depends on what the source for your streaming application is.
If you use Kafka, you can easily partition topics and have multiple
receivers on different machines.
If you have sth like a HTTP, socket, etc stream, you probably can't do
that. The Spark RDDs generated by your receiv
ifying their own criteria, it becomes hard to
> manage them all.
>
> On 20 March 2015 at 12:02, Jeffrey Jedele
> wrote:
>
>> Hey Harut,
>> I don't think there'll by any general practices as this part heavily
>> depends on your environment, skills and what you w
Hey Harut,
I don't think there'll by any general practices as this part heavily
depends on your environment, skills and what you want to achieve.
If you don't have a general direction yet, I'd suggest you to have a look
at Elasticsearch+Kibana. It's very easy to set up, powerful and therefore
gets
I second what has been said already.
We just built a streaming app in Java and I would definitely choose Scala
this time.
Regards,
Jeff
2015-03-19 16:34 GMT+01:00 Emre Sevinc :
> Hello James,
>
> I've been working with Spark Streaming for the last 6 months, and I'm
> coding in Java 7. Even thou
Hi Mas,
I never actually worked with GraphX, but one idea:
As far as I know, you can directly access the vertex and edge RDDs of your
Graph object. Why not simply run a .filter() on the edge RDD to get all
edges that originate from or end at your vertex?
Regards,
Jeff
2015-03-18 10:52 GMT+01:00
ng :
> Any sub-category recommendations hadoop, MapR, CDH?
>
> On Wed, Mar 18, 2015 at 10:48 AM, James King
> wrote:
>
>> Many thanks Jeff will give it a go.
>>
>> On Wed, Mar 18, 2015 at 10:47 AM, Jeffrey Jedele <
>> jeffrey.jed...@gmail.com> wrote:
&g
Probably 1.3.0 - it has some improvements in the included Kafka receiver
for streaming.
https://spark.apache.org/releases/spark-release-1-3-0.html
Regards,
Jeff
2015-03-18 10:38 GMT+01:00 James King :
> Hi All,
>
> Which build of Spark is best when using Kafka?
>
> Regards
> jk
>
7;t have multiple
> Spark versions deployed. If the classpath looks correct, please create
> a JIRA for this issue. Thanks! -Xiangrui
>
> On Tue, Mar 17, 2015 at 2:03 AM, Jeffrey Jedele
> wrote:
> > Hi all,
> > I'm trying to use the new LDA in mllib, but when trying to trai
Hi all,
I'm trying to use the new LDA in mllib, but when trying to train the model,
I'm getting following error:
java.lang.IllegalAccessError: tried to access class
org.apache.spark.util.collection.Sorter from class
org.apache.spark.graphx.impl.EdgePartitionBuilder
at
org.apache.spark.graphx.i
Hi Colin,
my understanding is that this is currently not possible with KafkaUtils.
You would have to write a custom receiver using Kafka's SimpleConsumer API.
https://spark.apache.org/docs/1.2.0/streaming-custom-receivers.html
https://cwiki.apache.org/confluence/display/KAFKA/0.8.0+SimpleConsumer+
Hi,
well, it really depends on what you want to do ;)
TF-IDF is a measure that originates in the information retrieval context
and that can be used to judge the relevancy of a document in context of a
given search term.
It's also often used for text-related machine learning tasks. E.g. have a
loo
Hi,
we are using RDD#mapPartitions() to achieve the same.
Are there advantages/disadvantages of using one method over the other?
Regards,
Jeff
2015-02-26 20:02 GMT+01:00 Mark Hamstra :
> rdd.map(foo).filter(bar) and rdd.filter(bar).map(foo) will each already be
> pipelined into a single stage,
Hey Anish,
machine learning models that are updated with incoming data are commonly
known as "online learning systems". Matrix factorization is one way to
implement recommender systems, but not the only one. There are papers about
how to do online matrix factorization, but you will likely have to
i
I don't have an idea, but perhaps a little more context would be helpful.
What is the source of your streaming data? What's the storage level you're
using?
What are you doing? Some kind of windows operations?
Regards,
Jeff
2015-02-26 18:59 GMT+01:00 Mukesh Jha :
>
> On Wed, Feb 25, 2015 at 8:09
If you read the streaming programming guide, you'll notice that Spark does
not do "real" streaming but "emulates" it with a so-called mini-batching
approach. Let's say you want to work with a continuous stream of incoming
events from a computing centre:
Batch interval:
That's the basic "heartbeat"
Hi Rob,
I fear your questions will be hard to answer without additional information
about what kind of simulations you plan to do. int[r][c] basically means
you have a matrix of integers? You could for example map this to a
row-oriented RDD of integer-arrays or to a column oriented RDD of integer
a
Hi Spico,
Yes, I think an "executor core" in Spark is basically a thread in a worker
pool. It's recommended to have one executor core per physical core on your
machine for best performance, but I think in theory you can create as many
threads as your OS allows.
For deployment:
There seems to be t
So basically you have lots of small ML tasks you want to run concurrently?
With "I've used repartition and cache to store the sub-datasets on only one
machine" you mean that you reduced each RDD to have one partition only?
Maybe you want to give the fair scheduler a try to get more of your tasks
So the summarize (I had a similar question):
Spark's log4j per default is configured to log to the console? Those
messages end up in the stderr files and the approach does not support
rolling?
If I configure log4j to log to files, how can I keep the folder structure?
Should I use relative paths an
As I understand the matter:
Option 1) has benefits when you think that your network bandwidth may be a
bottle neck, because Spark opens several network connections on possibly
several different physical machines.
Option 2) - as you already pointed out - has the benefit that you occupy
less worker
28 matches
Mail list logo