NoNodeAvailableException (None of the configured nodes are available) error when trying to push data to Elastic from a Spark job

2017-02-07 Thread dgoldenberg
Hi, Any reason why we might be getting this error? The code seems to work fine in the non-distributed mode but the same code when run from a Spark job is not able to get to Elastic. Spark version: 2.0.1 built for Hadoop 2.4, Scala 2.11 Elastic version: 2.3.1 I've verified the Elastic hosts and

How to set the heap size on consumers?

2015-07-13 Thread dgoldenberg
Hi, I'm seeing quite a bit of information on Spark memory management. I'm just trying to set the heap size, e.g. Xms as 512m and Xmx as 1g or some such. Per http://apache-spark-user-list.1001560.n3.nabble.com/Use-of-SPARK-DAEMON-JAVA-OPTS-tt10479.html#a10529: SPARK_DAEMON_JAVA_OPTS is not

What is a best practice for passing environment variables to Spark workers?

2015-07-09 Thread dgoldenberg
I have about 20 environment variables to pass to my Spark workers. Even though they're in the init scripts on the Linux box, the workers don't see these variables. Does Spark do something to shield itself from what may be defined in the environment? I see multiple pieces of info on how to pass

Best practice for using singletons on workers (seems unanswered) ?

2015-07-07 Thread dgoldenberg
Hi, I am seeing a lot of posts on singletons vs. broadcast variables, such as * http://apache-spark-user-list.1001560.n3.nabble.com/Best-way-to-have-some-singleton-per-worker-tt20277.html *

Any way to retrieve time of message arrival to Kafka topic, in Spark Streaming?

2015-06-22 Thread dgoldenberg
Is there any way to retrieve the time of each message's arrival into a Kafka topic, when streaming in Spark, whether with receiver-based or direct streaming? Thanks. -- View this message in context:

Re: Registering custom metrics

2015-06-22 Thread dgoldenberg
Hi Gerard, Have there been any responses? Any insights as to what you ended up doing to enable custom metrics? I'm thinking of implementing a custom metrics sink, not sure how doable that is yet... Thanks. -- View this message in context:

Re: Custom Metrics Sink

2015-06-22 Thread dgoldenberg
Hi, I was wondering if there've been any responses to this? Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Custom-Metrics-Sink-tp10068p23425.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

The Initial job has not accepted any resources error; can't seem to set

2015-06-18 Thread dgoldenberg
Hi, I'm running Spark Standalone on a single node with 16 cores. Master and 4 workers are running. I'm trying to submit two applications via spark-submit and am getting the following error when submitting the second one: Initial job has not accepted any resources; check your cluster UI to ensure

Re: The Initial job has not accepted any resources error; can't seem to set

2015-06-18 Thread dgoldenberg
I just realized that --conf needs to be one key-value pair per line. And somehow I needed --conf spark.cores.max=2 \ However, when it was --conf spark.deploy.defaultCores=2 \ then one job would take up all 16 cores on the box. What's the actual model here? We've got 10 apps

What happens when a streaming consumer job is killed then restarted?

2015-06-16 Thread dgoldenberg
I'd like to understand better what happens when a streaming consumer job (with direct streaming, but also with receiver-based streaming) is killed/terminated/crashes. Assuming it was processing a batch of RDD data, what happens when the job is restarted? How much state is maintained within

Custom Spark metrics?

2015-06-16 Thread dgoldenberg
I'm looking at the doc here: https://spark.apache.org/docs/latest/monitoring.html. Is there a way to define custom metrics in Spark, via Coda Hale perhaps, and emit those? Can a custom metrics sink be defined? And, can such a sink collect some metrics, execute some metrics handling logic, then

What is Spark's data retention policy?

2015-06-16 Thread dgoldenberg
What is Spark's data retention policy? As in, the jobs that are sent from the master to the worker nodes, how long do they persist on those nodes? What about the RDD data, how is that cleaned up? Are all RDD's cleaned up at GC time unless they've been .persist()'ed or .cache()'ed? -- View

Split RDD based on criteria

2015-06-10 Thread dgoldenberg
Hi, I'm gathering that the typical approach for splitting an RDD is to apply several filters to it. rdd1 = rdd.filter(func1); rdd2 = rdd.filter(func2); ... Is there/should there be a way to create 'buckets' like these in one go? ListRDD rddList = rdd.filter(func1, func2, ..., funcN) Another

How to share large resources like dictionaries while processing data with Spark ?

2015-06-04 Thread dgoldenberg
We have some pipelines defined where sometimes we need to load potentially large resources such as dictionaries. What would be the best strategy for sharing such resources among the transformations/actions within a consumer? Can they be shared somehow across the RDD's? I'm looking for a way to

Objects serialized before foreachRDD/foreachPartition ?

2015-06-03 Thread dgoldenberg
I'm looking at https://spark.apache.org/docs/latest/tuning.html. Basically the takeaway is that all objects passed into the code processing RDD's must be serializable. So if I've got a few objects that I'd rather initialize once and deinitialize once outside of the logic processing the RDD's, I'd

StreamingListener, anyone?

2015-06-03 Thread dgoldenberg
Hi, I've got a Spark Streaming driver job implemented and in it, I register a streaming listener, like so: JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, Durations.milliseconds(params.getBatchDurationMillis())); jssc.addStreamingListener(new JobListener(jssc)); where

Behavior of the spark.streaming.kafka.maxRatePerPartition config param?

2015-06-02 Thread dgoldenberg
Hi, Could someone explain the behavior of the spark.streaming.kafka.maxRatePerPartition parameter? The doc says An important (configuration) is spark.streaming.kafka.maxRatePerPartition which is the maximum rate at which each Kafka partition will be read by (the) direct API. What is the default

How to monitor Spark Streaming from Kafka?

2015-06-01 Thread dgoldenberg
Hi, What are some of the good/adopted approached to monitoring Spark Streaming from Kafka? I see that there are things like http://quantifind.github.io/KafkaOffsetMonitor, for example. Do they all assume that Receiver-based streaming is used? Then Note that one disadvantage of this approach

Spark and logging

2015-05-27 Thread dgoldenberg
I'm wondering how logging works in Spark. I see that there's the log4j.properties.template file in the conf directory. Safe to assume Spark is using log4j 1? What's the approach if we're using log4j 2? I've got a log4j2.xml file in the job jar which seems to be working for my log statements

Spark Streaming from Kafka - no receivers and spark.streaming.receiver.maxRate?

2015-05-27 Thread dgoldenberg
Hi, With the no receivers approach to streaming from Kafka, is there a way to set something like spark.streaming.receiver.maxRate so as not to overwhelm the Spark consumers? What would be some of the ways to throttle the streamed messages so that the consumers don't run out of memory? --

Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-05-27 Thread dgoldenberg
Hi, I'm trying to understand if there are design patterns for autoscaling Spark (add/remove slave machines to the cluster) based on the throughput. Assuming we can throttle Spark consumers, the respective Kafka topics we stream data from would start growing. What are some of the ways to

Pipelining with Spark

2015-05-21 Thread dgoldenberg
From the performance and scalability standpoint, is it better to plug in, say a multi-threaded pipeliner into a Spark job, or implement pipelining via Spark's own transformation mechanisms such as e.g. map or filter? I'm seeing some reference architectures where things like 'morphlines' are

Spark Streaming and reducing latency

2015-05-17 Thread dgoldenberg
I keep hearing the argument that the way Discretized Streams work with Spark Streaming is a lot more of a batch processing algorithm than true streaming. For streaming, one would expect a new item, e.g. in a Kafka topic, to be available to the streaming consumer immediately. With the discretized

How to speed up data ingestion with Spark

2015-05-12 Thread dgoldenberg
Hi, I'm looking at a data ingestion implementation which streams data out of Kafka with Spark Streaming, then uses a multi-threaded pipeline engine to process the data in each partition. Have folks looked at ways of speeding up this type of ingestion? Let's say the main part of the ingest

Spark and RabbitMQ

2015-05-11 Thread dgoldenberg
Are there existing or under development versions/modules for streaming messages out of RabbitMQ with SparkStreaming, or perhaps a RabbitMQ RDD? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-and-RabbitMQ-tp22852.html Sent from the Apache Spark User

How to stream all data out of a Kafka topic once, then terminate job?

2015-04-28 Thread dgoldenberg
Hi, I'm wondering about the use-case where you're not doing continuous, incremental streaming of data out of Kafka but rather want to publish data once with your Producer(s) and consume it once, in your Consumer, then terminate the consumer Spark job. JavaStreamingContext jssc = new

Apache Spark User List: people's responses not showing in the browser view

2015-03-18 Thread dgoldenberg
Sorry if this is a total noob question but is there a reason why I'm only seeing folks' responses to my posts in emails but not in the browser view under apache-spark-user-list.1001560.n3.nabble.com? Is this a matter of setting your preferences such that your responses only go to email and never

Spark and Morphlines, parallelization, multithreading

2015-03-18 Thread dgoldenberg
Still a Spark noob grappling with the concepts... I'm trying to grok the idea of integrating something like the Morphlines pipelining library with Spark (or SparkStreaming). The Kite/Morphlines doc states that runtime executes all commands of a given morphline in the same thread... there are no

NotSerializableException: org.apache.http.impl.client.DefaultHttpClient when trying to send documents to Solr

2015-02-18 Thread dgoldenberg
I'm using Solrj in a Spark program. When I try to send the docs to Solr, I get the NotSerializableException on the DefaultHttpClient. Is there a possible fix or workaround? I'm using Spark 1.2.1 with Hadoop 2.4, SolrJ is version 4.0.0. final HttpSolrServer solrServer = new

Class loading issue, spark.files.userClassPathFirst doesn't seem to be working

2015-02-17 Thread dgoldenberg
I'm getting the below error when running spark-submit on my class. This class has a transitive dependency on HttpClient v.4.3.1 since I'm calling SolrJ 4.10.3 from within the class. This is in conflict with the older version, HttpClient 3.1 that's a dependency of Hadoop 2.4 (I'm running Spark