Hi,
Any reason why we might be getting this error? The code seems to work fine
in the non-distributed mode but the same code when run from a Spark job is
not able to get to Elastic.
Spark version: 2.0.1 built for Hadoop 2.4, Scala 2.11
Elastic version: 2.3.1
I've verified the Elastic hosts and
Hi,
I'm seeing quite a bit of information on Spark memory management. I'm just
trying to set the heap size, e.g. Xms as 512m and Xmx as 1g or some such.
Per
http://apache-spark-user-list.1001560.n3.nabble.com/Use-of-SPARK-DAEMON-JAVA-OPTS-tt10479.html#a10529:
SPARK_DAEMON_JAVA_OPTS is not
I have about 20 environment variables to pass to my Spark workers. Even
though they're in the init scripts on the Linux box, the workers don't see
these variables.
Does Spark do something to shield itself from what may be defined in the
environment?
I see multiple pieces of info on how to pass
Hi,
I am seeing a lot of posts on singletons vs. broadcast variables, such as
*
http://apache-spark-user-list.1001560.n3.nabble.com/Best-way-to-have-some-singleton-per-worker-tt20277.html
*
Is there any way to retrieve the time of each message's arrival into a Kafka
topic, when streaming in Spark, whether with receiver-based or direct
streaming?
Thanks.
--
View this message in context:
Hi Gerard,
Have there been any responses? Any insights as to what you ended up doing to
enable custom metrics? I'm thinking of implementing a custom metrics sink,
not sure how doable that is yet...
Thanks.
--
View this message in context:
Hi,
I was wondering if there've been any responses to this?
Thanks.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Custom-Metrics-Sink-tp10068p23425.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi,
I'm running Spark Standalone on a single node with 16 cores. Master and 4
workers are running.
I'm trying to submit two applications via spark-submit and am getting the
following error when submitting the second one: Initial job has not
accepted any resources; check your cluster UI to ensure
I just realized that --conf needs to be one key-value pair per line. And
somehow I needed
--conf spark.cores.max=2 \
However, when it was
--conf spark.deploy.defaultCores=2 \
then one job would take up all 16 cores on the box.
What's the actual model here?
We've got 10 apps
I'd like to understand better what happens when a streaming consumer job
(with direct streaming, but also with receiver-based streaming) is
killed/terminated/crashes.
Assuming it was processing a batch of RDD data, what happens when the job is
restarted? How much state is maintained within
I'm looking at the doc here:
https://spark.apache.org/docs/latest/monitoring.html.
Is there a way to define custom metrics in Spark, via Coda Hale perhaps, and
emit those?
Can a custom metrics sink be defined?
And, can such a sink collect some metrics, execute some metrics handling
logic, then
What is Spark's data retention policy?
As in, the jobs that are sent from the master to the worker nodes, how long
do they persist on those nodes? What about the RDD data, how is that
cleaned up? Are all RDD's cleaned up at GC time unless they've been
.persist()'ed or .cache()'ed?
--
View
Hi,
I'm gathering that the typical approach for splitting an RDD is to apply
several filters to it.
rdd1 = rdd.filter(func1);
rdd2 = rdd.filter(func2);
...
Is there/should there be a way to create 'buckets' like these in one go?
ListRDD rddList = rdd.filter(func1, func2, ..., funcN)
Another
We have some pipelines defined where sometimes we need to load potentially
large resources such as dictionaries.
What would be the best strategy for sharing such resources among the
transformations/actions within a consumer? Can they be shared somehow
across the RDD's?
I'm looking for a way to
I'm looking at https://spark.apache.org/docs/latest/tuning.html. Basically
the takeaway is that all objects passed into the code processing RDD's must
be serializable. So if I've got a few objects that I'd rather initialize
once and deinitialize once outside of the logic processing the RDD's, I'd
Hi,
I've got a Spark Streaming driver job implemented and in it, I register a
streaming listener, like so:
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf,
Durations.milliseconds(params.getBatchDurationMillis()));
jssc.addStreamingListener(new JobListener(jssc));
where
Hi,
Could someone explain the behavior of the
spark.streaming.kafka.maxRatePerPartition parameter? The doc says An
important (configuration) is spark.streaming.kafka.maxRatePerPartition which
is the maximum rate at which each Kafka partition will be read by (the)
direct API.
What is the default
Hi,
What are some of the good/adopted approached to monitoring Spark Streaming
from Kafka? I see that there are things like
http://quantifind.github.io/KafkaOffsetMonitor, for example. Do they all
assume that Receiver-based streaming is used?
Then Note that one disadvantage of this approach
I'm wondering how logging works in Spark.
I see that there's the log4j.properties.template file in the conf directory.
Safe to assume Spark is using log4j 1? What's the approach if we're using
log4j 2? I've got a log4j2.xml file in the job jar which seems to be
working for my log statements
Hi,
With the no receivers approach to streaming from Kafka, is there a way to
set something like spark.streaming.receiver.maxRate so as not to overwhelm
the Spark consumers?
What would be some of the ways to throttle the streamed messages so that the
consumers don't run out of memory?
--
Hi,
I'm trying to understand if there are design patterns for autoscaling Spark
(add/remove slave machines to the cluster) based on the throughput.
Assuming we can throttle Spark consumers, the respective Kafka topics we
stream data from would start growing. What are some of the ways to
From the performance and scalability standpoint, is it better to plug in, say
a multi-threaded pipeliner into a Spark job, or implement pipelining via
Spark's own transformation mechanisms such as e.g. map or filter?
I'm seeing some reference architectures where things like 'morphlines' are
I keep hearing the argument that the way Discretized Streams work with Spark
Streaming is a lot more of a batch processing algorithm than true streaming.
For streaming, one would expect a new item, e.g. in a Kafka topic, to be
available to the streaming consumer immediately.
With the discretized
Hi,
I'm looking at a data ingestion implementation which streams data out of
Kafka with Spark Streaming, then uses a multi-threaded pipeline engine to
process the data in each partition. Have folks looked at ways of speeding
up this type of ingestion?
Let's say the main part of the ingest
Are there existing or under development versions/modules for streaming
messages out of RabbitMQ with SparkStreaming, or perhaps a RabbitMQ RDD?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-and-RabbitMQ-tp22852.html
Sent from the Apache Spark User
Hi,
I'm wondering about the use-case where you're not doing continuous,
incremental streaming of data out of Kafka but rather want to publish data
once with your Producer(s) and consume it once, in your Consumer, then
terminate the consumer Spark job.
JavaStreamingContext jssc = new
Sorry if this is a total noob question but is there a reason why I'm only
seeing folks' responses to my posts in emails but not in the browser view
under apache-spark-user-list.1001560.n3.nabble.com? Is this a matter of
setting your preferences such that your responses only go to email and never
Still a Spark noob grappling with the concepts...
I'm trying to grok the idea of integrating something like the Morphlines
pipelining library with Spark (or SparkStreaming). The Kite/Morphlines doc
states that runtime executes all commands of a given morphline in the same
thread... there are no
I'm using Solrj in a Spark program. When I try to send the docs to Solr, I
get the NotSerializableException on the DefaultHttpClient. Is there a
possible fix or workaround?
I'm using Spark 1.2.1 with Hadoop 2.4, SolrJ is version 4.0.0.
final HttpSolrServer solrServer = new
I'm getting the below error when running spark-submit on my class. This class
has a transitive dependency on HttpClient v.4.3.1 since I'm calling SolrJ
4.10.3 from within the class.
This is in conflict with the older version, HttpClient 3.1 that's a
dependency of Hadoop 2.4 (I'm running Spark
30 matches
Mail list logo