Hi all,
I've got a problem with Spark Streaming (both 1.3.1 and 1.5). Following
situation:
There is a component which gets a DStream of URLs. For each of these URLs,
it should access it, retrieve several data elements and pass those on for
further processing.
The relevant code looks like this:
Hi Ashish,
this might be what you're looking for:
https://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbcodbc-server
Regards,
Jeff
2015-03-24 11:28 GMT+01:00 Ashish Mukherjee ashish.mukher...@gmail.com:
Hello,
As of now, if I have to execute a Spark job, I
.
Regards,
Ashish
On Tue, Mar 24, 2015 at 4:07 PM, Jeffrey Jedele jeffrey.jed...@gmail.com
wrote:
Hi Ashish,
this might be what you're looking for:
https://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbcodbc-server
Regards,
Jeff
2015-03-24 11:28 GMT+01:00
Hey Abhi,
many of StreamingContext's methods to create input streams take a
StorageLevel parameter to configure this behavior. RDD partitions are
generally stored in the in-memory cache of worker nodes I think. You can
also configure replication and spilling to disk if needed.
Regards,
Jeff
What exactly do you mean by alerts?
Something specific to your data or general events of the spark cluster? For
the first, sth like Akhil suggested should work. For the latter, I would
suggest having a log consolidation system like logstash in place and use
this to generate alerts.
Regards,
Jeff
node the data is going
to go to?
On Fri, Mar 20, 2015 at 1:33 AM, Jeffrey Jedele jeffrey.jed...@gmail.com
wrote:
Hi Mohit,
it also depends on what the source for your streaming application is.
If you use Kafka, you can easily partition topics and have multiple
receivers on different
Hi,
I'm not completely sure about this either, but this is what we are doing
currently:
Configure your logging to write to STDOUT, not to a file explicitely. Spark
will capture stdour and stderr and separate the messages into a app/driver
folder structure in the configured worker directory.
We
Hey Eason!
Weird problem indeed. More information will probably help to find te issue:
Have you searched the logs for peculiar messages?
How does your Spark environment look like? #workers, #threads, etc?
Does it work if you create separate receivers for the topics?
Regards,
Jeff
2015-03-21
, it becomes hard to
manage them all.
On 20 March 2015 at 12:02, Jeffrey Jedele jeffrey.jed...@gmail.com
wrote:
Hey Harut,
I don't think there'll by any general practices as this part heavily
depends on your environment, skills and what you want to achieve.
If you don't have a general direction yet
Hey Harut,
I don't think there'll by any general practices as this part heavily
depends on your environment, skills and what you want to achieve.
If you don't have a general direction yet, I'd suggest you to have a look
at Elasticsearch+Kibana. It's very easy to set up, powerful and therefore
Hi Mohit,
it also depends on what the source for your streaming application is.
If you use Kafka, you can easily partition topics and have multiple
receivers on different machines.
If you have sth like a HTTP, socket, etc stream, you probably can't do
that. The Spark RDDs generated by your
I second what has been said already.
We just built a streaming app in Java and I would definitely choose Scala
this time.
Regards,
Jeff
2015-03-19 16:34 GMT+01:00 Emre Sevinc emre.sev...@gmail.com:
Hello James,
I've been working with Spark Streaming for the last 6 months, and I'm
coding in
Probably 1.3.0 - it has some improvements in the included Kafka receiver
for streaming.
https://spark.apache.org/releases/spark-release-1-3-0.html
Regards,
Jeff
2015-03-18 10:38 GMT+01:00 James King jakwebin...@gmail.com:
Hi All,
Which build of Spark is best when using Kafka?
Regards
jk
Hi Mas,
I never actually worked with GraphX, but one idea:
As far as I know, you can directly access the vertex and edge RDDs of your
Graph object. Why not simply run a .filter() on the edge RDD to get all
edges that originate from or end at your vertex?
Regards,
Jeff
2015-03-18 10:52 GMT+01:00
recommendations hadoop, MapR, CDH?
On Wed, Mar 18, 2015 at 10:48 AM, James King jakwebin...@gmail.com
wrote:
Many thanks Jeff will give it a go.
On Wed, Mar 18, 2015 at 10:47 AM, Jeffrey Jedele
jeffrey.jed...@gmail.com wrote:
Probably 1.3.0 - it has some improvements in the included Kafka
Hi all,
I'm trying to use the new LDA in mllib, but when trying to train the model,
I'm getting following error:
java.lang.IllegalAccessError: tried to access class
org.apache.spark.util.collection.Sorter from class
org.apache.spark.graphx.impl.EdgePartitionBuilder
at
don't have multiple
Spark versions deployed. If the classpath looks correct, please create
a JIRA for this issue. Thanks! -Xiangrui
On Tue, Mar 17, 2015 at 2:03 AM, Jeffrey Jedele
jeffrey.jed...@gmail.com wrote:
Hi all,
I'm trying to use the new LDA in mllib, but when trying to train
Hi Colin,
my understanding is that this is currently not possible with KafkaUtils.
You would have to write a custom receiver using Kafka's SimpleConsumer API.
https://spark.apache.org/docs/1.2.0/streaming-custom-receivers.html
Hi,
well, it really depends on what you want to do ;)
TF-IDF is a measure that originates in the information retrieval context
and that can be used to judge the relevancy of a document in context of a
given search term.
It's also often used for text-related machine learning tasks. E.g. have a
Hey Anish,
machine learning models that are updated with incoming data are commonly
known as online learning systems. Matrix factorization is one way to
implement recommender systems, but not the only one. There are papers about
how to do online matrix factorization, but you will likely have to
Hi,
we are using RDD#mapPartitions() to achieve the same.
Are there advantages/disadvantages of using one method over the other?
Regards,
Jeff
2015-02-26 20:02 GMT+01:00 Mark Hamstra m...@clearstorydata.com:
rdd.map(foo).filter(bar) and rdd.filter(bar).map(foo) will each already be
pipelined
If you read the streaming programming guide, you'll notice that Spark does
not do real streaming but emulates it with a so-called mini-batching
approach. Let's say you want to work with a continuous stream of incoming
events from a computing centre:
Batch interval:
That's the basic heartbeat of
I don't have an idea, but perhaps a little more context would be helpful.
What is the source of your streaming data? What's the storage level you're
using?
What are you doing? Some kind of windows operations?
Regards,
Jeff
2015-02-26 18:59 GMT+01:00 Mukesh Jha me.mukesh@gmail.com:
On
Hi Rob,
I fear your questions will be hard to answer without additional information
about what kind of simulations you plan to do. int[r][c] basically means
you have a matrix of integers? You could for example map this to a
row-oriented RDD of integer-arrays or to a column oriented RDD of integer
So the summarize (I had a similar question):
Spark's log4j per default is configured to log to the console? Those
messages end up in the stderr files and the approach does not support
rolling?
If I configure log4j to log to files, how can I keep the folder structure?
Should I use relative paths
So basically you have lots of small ML tasks you want to run concurrently?
With I've used repartition and cache to store the sub-datasets on only one
machine you mean that you reduced each RDD to have one partition only?
Maybe you want to give the fair scheduler a try to get more of your tasks
Hi Spico,
Yes, I think an executor core in Spark is basically a thread in a worker
pool. It's recommended to have one executor core per physical core on your
machine for best performance, but I think in theory you can create as many
threads as your OS allows.
For deployment:
There seems to be
As I understand the matter:
Option 1) has benefits when you think that your network bandwidth may be a
bottle neck, because Spark opens several network connections on possibly
several different physical machines.
Option 2) - as you already pointed out - has the benefit that you occupy
less
28 matches
Mail list logo