Hi,
Any reason why we might be getting this error? The code seems to work fine
in the non-distributed mode but the same code when run from a Spark job is
not able to get to Elastic.
Spark version: 2.0.1 built for Hadoop 2.4, Scala 2.11
Elastic version: 2.3.1
I've verified the Elastic hosts and t
Hi,
I'm seeing quite a bit of information on Spark memory management. I'm just
trying to set the heap size, e.g. Xms as 512m and Xmx as 1g or some such.
Per
http://apache-spark-user-list.1001560.n3.nabble.com/Use-of-SPARK-DAEMON-JAVA-OPTS-tt10479.html#a10529:
"SPARK_DAEMON_JAVA_OPTS is not inten
I have about 20 environment variables to pass to my Spark workers. Even
though they're in the init scripts on the Linux box, the workers don't see
these variables.
Does Spark do something to shield itself from what may be defined in the
environment?
I see multiple pieces of info on how to pass th
Is there a set of best practices for when to use foreachPartition vs.
foreachRDD?
Is it generally true that using foreachPartition avoids some of the
over-network data shuffling overhead?
When would I definitely want to use one method vs. the other?
Thanks.
--
View this message in context:
h
Hi,
I am seeing a lot of posts on singletons vs. broadcast variables, such as
*
http://apache-spark-user-list.1001560.n3.nabble.com/Best-way-to-have-some-singleton-per-worker-tt20277.html
*
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-share-a-NonSerializable-variable-among-tasks-in-t
Is there any way to retrieve the time of each message's arrival into a Kafka
topic, when streaming in Spark, whether with receiver-based or direct
streaming?
Thanks.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Any-way-to-retrieve-time-of-message-arrival
Hi Gerard,
Have there been any responses? Any insights as to what you ended up doing to
enable custom metrics? I'm thinking of implementing a custom metrics sink,
not sure how doable that is yet...
Thanks.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Re
Hi,
I was wondering if there've been any responses to this?
Thanks.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Custom-Metrics-Sink-tp10068p23425.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---
I just realized that --conf needs to be one key-value pair per line. And
somehow I needed
--conf "spark.cores.max=2" \
However, when it was
--conf "spark.deploy.defaultCores=2" \
then one job would take up all 16 cores on the box.
What's the actual model here?
We've got 10 app
Hi,
I'm running Spark Standalone on a single node with 16 cores. Master and 4
workers are running.
I'm trying to submit two applications via spark-submit and am getting the
following error when submitting the second one: "Initial job has not
accepted any resources; check your cluster UI to ensure
I'm looking at the doc here:
https://spark.apache.org/docs/latest/monitoring.html.
Is there a way to define custom metrics in Spark, via Coda Hale perhaps, and
emit those?
Can a custom metrics sink be defined?
And, can such a sink collect some metrics, execute some metrics handling
logic, then i
What is Spark's data retention policy?
As in, the jobs that are sent from the master to the worker nodes, how long
do they persist on those nodes? What about the RDD data, how is that
cleaned up? Are all RDD's cleaned up at GC time unless they've been
.persist()'ed or .cache()'ed?
--
View this
I'd like to understand better what happens when a streaming consumer job
(with direct streaming, but also with receiver-based streaming) is
killed/terminated/crashes.
Assuming it was processing a batch of RDD data, what happens when the job is
restarted? How much state is maintained within Spark'
Hi,
I'm gathering that the typical approach for splitting an RDD is to apply
several filters to it.
rdd1 = rdd.filter(func1);
rdd2 = rdd.filter(func2);
...
Is there/should there be a way to create 'buckets' like these in one go?
List rddList = rdd.filter(func1, func2, ..., funcN)
Another angle
We have some pipelines defined where sometimes we need to load potentially
large resources such as dictionaries.
What would be the best strategy for sharing such resources among the
transformations/actions within a consumer? Can they be shared somehow
across the RDD's?
I'm looking for a way to l
Hi,
I've got a Spark Streaming driver job implemented and in it, I register a
streaming listener, like so:
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf,
Durations.milliseconds(params.getBatchDurationMillis()));
jssc.addStreamingListener(new JobListener(jssc));
where JobListe
I'm looking at https://spark.apache.org/docs/latest/tuning.html. Basically
the takeaway is that all objects passed into the code processing RDD's must
be serializable. So if I've got a few objects that I'd rather initialize
once and deinitialize once outside of the logic processing the RDD's, I'd
Hi,
Could someone explain the behavior of the
spark.streaming.kafka.maxRatePerPartition parameter? The doc says "An
important (configuration) is spark.streaming.kafka.maxRatePerPartition which
is the maximum rate at which each Kafka partition will be read by (the)
direct API."
What is the default
Hi,
What are some of the good/adopted approached to monitoring Spark Streaming
from Kafka? I see that there are things like
http://quantifind.github.io/KafkaOffsetMonitor, for example. Do they all
assume that Receiver-based streaming is used?
Then "Note that one disadvantage of this approach (R
Hi,
I'm trying to understand if there are design patterns for autoscaling Spark
(add/remove slave machines to the cluster) based on the throughput.
Assuming we can throttle Spark consumers, the respective Kafka topics we
stream data from would start growing. What are some of the ways to generate
Hi,
With the no receivers approach to streaming from Kafka, is there a way to
set something like spark.streaming.receiver.maxRate so as not to overwhelm
the Spark consumers?
What would be some of the ways to throttle the streamed messages so that the
consumers don't run out of memory?
--
Vie
I'm wondering how logging works in Spark.
I see that there's the log4j.properties.template file in the conf directory.
Safe to assume Spark is using log4j 1? What's the approach if we're using
log4j 2? I've got a log4j2.xml file in the job jar which seems to be
working for my log statements but
>From the performance and scalability standpoint, is it better to plug in, say
a multi-threaded pipeliner into a Spark job, or implement pipelining via
Spark's own transformation mechanisms such as e.g. map or filter?
I'm seeing some reference architectures where things like 'morphlines' are
plugg
I keep hearing the argument that the way Discretized Streams work with Spark
Streaming is a lot more of a batch processing algorithm than true streaming.
For streaming, one would expect a new item, e.g. in a Kafka topic, to be
available to the streaming consumer immediately.
With the discretized
Hi,
I'm looking at a data ingestion implementation which streams data out of
Kafka with Spark Streaming, then uses a multi-threaded pipeline engine to
process the data in each partition. Have folks looked at ways of speeding
up this type of ingestion?
Let's say the main part of the ingest proces
Are there existing or under development versions/modules for streaming
messages out of RabbitMQ with SparkStreaming, or perhaps a RabbitMQ RDD?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-and-RabbitMQ-tp22852.html
Sent from the Apache Spark User Li
Hi,
Is there anything special one must do, running locally and submitting a job
like so:
spark-submit \
--class "com.myco.Driver" \
--master local[*] \
./lib/myco.jar
In my logs, I'm only seeing log messages with the thread identifier of
"Executor task launch worker-0".
Hi,
I'm wondering about the use-case where you're not doing continuous,
incremental streaming of data out of Kafka but rather want to publish data
once with your Producer(s) and consume it once, in your Consumer, then
terminate the consumer Spark job.
JavaStreamingContext jssc = new JavaStreaming
Sorry if this is a total noob question but is there a reason why I'm only
seeing folks' responses to my posts in emails but not in the browser view
under apache-spark-user-list.1001560.n3.nabble.com? Is this a matter of
setting your preferences such that your responses only go to email and never
t
Still a Spark noob grappling with the concepts...
I'm trying to grok the idea of integrating something like the Morphlines
pipelining library with Spark (or SparkStreaming). The Kite/Morphlines doc
states that "runtime executes all commands of a given morphline in the same
thread... there are no
I'm using Solrj in a Spark program. When I try to send the docs to Solr, I
get the NotSerializableException on the DefaultHttpClient. Is there a
possible fix or workaround?
I'm using Spark 1.2.1 with Hadoop 2.4, SolrJ is version 4.0.0.
final HttpSolrServer solrServer = new HttpSolrServer(SOLR_SE
I'm reading data from a database using JdbcRDD, in Java, and I have an
implementation of Function0 whose instance I supply as the
'getConnection' parameter into the JdbcRDD constructor. Compiles fine.
The definition of the class/function is as follows:
public class GetDbConnection extends Abstr
I'm getting the below error when running spark-submit on my class. This class
has a transitive dependency on HttpClient v.4.3.1 since I'm calling SolrJ
4.10.3 from within the class.
This is in conflict with the older version, HttpClient 3.1 that's a
dependency of Hadoop 2.4 (I'm running Spark 1.2.
33 matches
Mail list logo