NoNodeAvailableException (None of the configured nodes are available) error when trying to push data to Elastic from a Spark job

2017-02-07 Thread dgoldenberg
Hi, Any reason why we might be getting this error? The code seems to work fine in the non-distributed mode but the same code when run from a Spark job is not able to get to Elastic. Spark version: 2.0.1 built for Hadoop 2.4, Scala 2.11 Elastic version: 2.3.1 I've verified the Elastic hosts and t

How to set the heap size on consumers?

2015-07-13 Thread dgoldenberg
Hi, I'm seeing quite a bit of information on Spark memory management. I'm just trying to set the heap size, e.g. Xms as 512m and Xmx as 1g or some such. Per http://apache-spark-user-list.1001560.n3.nabble.com/Use-of-SPARK-DAEMON-JAVA-OPTS-tt10479.html#a10529: "SPARK_DAEMON_JAVA_OPTS is not inten

What is a best practice for passing environment variables to Spark workers?

2015-07-09 Thread dgoldenberg
I have about 20 environment variables to pass to my Spark workers. Even though they're in the init scripts on the Linux box, the workers don't see these variables. Does Spark do something to shield itself from what may be defined in the environment? I see multiple pieces of info on how to pass th

foreachRDD vs. forearchPartition ?

2015-07-08 Thread dgoldenberg
Is there a set of best practices for when to use foreachPartition vs. foreachRDD? Is it generally true that using foreachPartition avoids some of the over-network data shuffling overhead? When would I definitely want to use one method vs. the other? Thanks. -- View this message in context: h

Best practice for using singletons on workers (seems unanswered) ?

2015-07-07 Thread dgoldenberg
Hi, I am seeing a lot of posts on singletons vs. broadcast variables, such as * http://apache-spark-user-list.1001560.n3.nabble.com/Best-way-to-have-some-singleton-per-worker-tt20277.html * http://apache-spark-user-list.1001560.n3.nabble.com/How-to-share-a-NonSerializable-variable-among-tasks-in-t

Any way to retrieve time of message arrival to Kafka topic, in Spark Streaming?

2015-06-22 Thread dgoldenberg
Is there any way to retrieve the time of each message's arrival into a Kafka topic, when streaming in Spark, whether with receiver-based or direct streaming? Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Any-way-to-retrieve-time-of-message-arrival

Re: Registering custom metrics

2015-06-22 Thread dgoldenberg
Hi Gerard, Have there been any responses? Any insights as to what you ended up doing to enable custom metrics? I'm thinking of implementing a custom metrics sink, not sure how doable that is yet... Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Re

Re: Custom Metrics Sink

2015-06-22 Thread dgoldenberg
Hi, I was wondering if there've been any responses to this? Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Custom-Metrics-Sink-tp10068p23425.html Sent from the Apache Spark User List mailing list archive at Nabble.com. ---

Re: The "Initial job has not accepted any resources" error; can't seem to set

2015-06-18 Thread dgoldenberg
I just realized that --conf needs to be one key-value pair per line. And somehow I needed --conf "spark.cores.max=2" \ However, when it was --conf "spark.deploy.defaultCores=2" \ then one job would take up all 16 cores on the box. What's the actual model here? We've got 10 app

The "Initial job has not accepted any resources" error; can't seem to set

2015-06-18 Thread dgoldenberg
Hi, I'm running Spark Standalone on a single node with 16 cores. Master and 4 workers are running. I'm trying to submit two applications via spark-submit and am getting the following error when submitting the second one: "Initial job has not accepted any resources; check your cluster UI to ensure

Custom Spark metrics?

2015-06-16 Thread dgoldenberg
I'm looking at the doc here: https://spark.apache.org/docs/latest/monitoring.html. Is there a way to define custom metrics in Spark, via Coda Hale perhaps, and emit those? Can a custom metrics sink be defined? And, can such a sink collect some metrics, execute some metrics handling logic, then i

What is Spark's data retention policy?

2015-06-16 Thread dgoldenberg
What is Spark's data retention policy? As in, the jobs that are sent from the master to the worker nodes, how long do they persist on those nodes? What about the RDD data, how is that cleaned up? Are all RDD's cleaned up at GC time unless they've been .persist()'ed or .cache()'ed? -- View this

What happens when a streaming consumer job is killed then restarted?

2015-06-16 Thread dgoldenberg
I'd like to understand better what happens when a streaming consumer job (with direct streaming, but also with receiver-based streaming) is killed/terminated/crashes. Assuming it was processing a batch of RDD data, what happens when the job is restarted? How much state is maintained within Spark'

Split RDD based on criteria

2015-06-10 Thread dgoldenberg
Hi, I'm gathering that the typical approach for splitting an RDD is to apply several filters to it. rdd1 = rdd.filter(func1); rdd2 = rdd.filter(func2); ... Is there/should there be a way to create 'buckets' like these in one go? List rddList = rdd.filter(func1, func2, ..., funcN) Another angle

How to share large resources like dictionaries while processing data with Spark ?

2015-06-04 Thread dgoldenberg
We have some pipelines defined where sometimes we need to load potentially large resources such as dictionaries. What would be the best strategy for sharing such resources among the transformations/actions within a consumer? Can they be shared somehow across the RDD's? I'm looking for a way to l

StreamingListener, anyone?

2015-06-03 Thread dgoldenberg
Hi, I've got a Spark Streaming driver job implemented and in it, I register a streaming listener, like so: JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, Durations.milliseconds(params.getBatchDurationMillis())); jssc.addStreamingListener(new JobListener(jssc)); where JobListe

Objects serialized before foreachRDD/foreachPartition ?

2015-06-03 Thread dgoldenberg
I'm looking at https://spark.apache.org/docs/latest/tuning.html. Basically the takeaway is that all objects passed into the code processing RDD's must be serializable. So if I've got a few objects that I'd rather initialize once and deinitialize once outside of the logic processing the RDD's, I'd

Behavior of the spark.streaming.kafka.maxRatePerPartition config param?

2015-06-02 Thread dgoldenberg
Hi, Could someone explain the behavior of the spark.streaming.kafka.maxRatePerPartition parameter? The doc says "An important (configuration) is spark.streaming.kafka.maxRatePerPartition which is the maximum rate at which each Kafka partition will be read by (the) direct API." What is the default

How to monitor Spark Streaming from Kafka?

2015-06-01 Thread dgoldenberg
Hi, What are some of the good/adopted approached to monitoring Spark Streaming from Kafka? I see that there are things like http://quantifind.github.io/KafkaOffsetMonitor, for example. Do they all assume that Receiver-based streaming is used? Then "Note that one disadvantage of this approach (R

Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-05-27 Thread dgoldenberg
Hi, I'm trying to understand if there are design patterns for autoscaling Spark (add/remove slave machines to the cluster) based on the throughput. Assuming we can throttle Spark consumers, the respective Kafka topics we stream data from would start growing. What are some of the ways to generate

Spark Streaming from Kafka - no receivers and spark.streaming.receiver.maxRate?

2015-05-27 Thread dgoldenberg
Hi, With the no receivers approach to streaming from Kafka, is there a way to set something like spark.streaming.receiver.maxRate so as not to overwhelm the Spark consumers? What would be some of the ways to throttle the streamed messages so that the consumers don't run out of memory? -- Vie

Spark and logging

2015-05-27 Thread dgoldenberg
I'm wondering how logging works in Spark. I see that there's the log4j.properties.template file in the conf directory. Safe to assume Spark is using log4j 1? What's the approach if we're using log4j 2? I've got a log4j2.xml file in the job jar which seems to be working for my log statements but

Pipelining with Spark

2015-05-21 Thread dgoldenberg
>From the performance and scalability standpoint, is it better to plug in, say a multi-threaded pipeliner into a Spark job, or implement pipelining via Spark's own transformation mechanisms such as e.g. map or filter? I'm seeing some reference architectures where things like 'morphlines' are plugg

Spark Streaming and reducing latency

2015-05-17 Thread dgoldenberg
I keep hearing the argument that the way Discretized Streams work with Spark Streaming is a lot more of a batch processing algorithm than true streaming. For streaming, one would expect a new item, e.g. in a Kafka topic, to be available to the streaming consumer immediately. With the discretized

How to speed up data ingestion with Spark

2015-05-12 Thread dgoldenberg
Hi, I'm looking at a data ingestion implementation which streams data out of Kafka with Spark Streaming, then uses a multi-threaded pipeline engine to process the data in each partition. Have folks looked at ways of speeding up this type of ingestion? Let's say the main part of the ingest proces

Spark and RabbitMQ

2015-05-11 Thread dgoldenberg
Are there existing or under development versions/modules for streaming messages out of RabbitMQ with SparkStreaming, or perhaps a RabbitMQ RDD? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-and-RabbitMQ-tp22852.html Sent from the Apache Spark User Li

Running Spark in local mode seems to ignore local[N]

2015-05-11 Thread dgoldenberg
Hi, Is there anything special one must do, running locally and submitting a job like so: spark-submit \ --class "com.myco.Driver" \ --master local[*] \ ./lib/myco.jar In my logs, I'm only seeing log messages with the thread identifier of "Executor task launch worker-0".

How to stream all data out of a Kafka topic once, then terminate job?

2015-04-28 Thread dgoldenberg
Hi, I'm wondering about the use-case where you're not doing continuous, incremental streaming of data out of Kafka but rather want to publish data once with your Producer(s) and consume it once, in your Consumer, then terminate the consumer Spark job. JavaStreamingContext jssc = new JavaStreaming

Apache Spark User List: people's responses not showing in the browser view

2015-03-18 Thread dgoldenberg
Sorry if this is a total noob question but is there a reason why I'm only seeing folks' responses to my posts in emails but not in the browser view under apache-spark-user-list.1001560.n3.nabble.com? Is this a matter of setting your preferences such that your responses only go to email and never t

Spark and Morphlines, parallelization, multithreading

2015-03-18 Thread dgoldenberg
Still a Spark noob grappling with the concepts... I'm trying to grok the idea of integrating something like the Morphlines pipelining library with Spark (or SparkStreaming). The Kite/Morphlines doc states that "runtime executes all commands of a given morphline in the same thread... there are no

NotSerializableException: org.apache.http.impl.client.DefaultHttpClient when trying to send documents to Solr

2015-02-18 Thread dgoldenberg
I'm using Solrj in a Spark program. When I try to send the docs to Solr, I get the NotSerializableException on the DefaultHttpClient. Is there a possible fix or workaround? I'm using Spark 1.2.1 with Hadoop 2.4, SolrJ is version 4.0.0. final HttpSolrServer solrServer = new HttpSolrServer(SOLR_SE

JdbcRDD, ClassCastException with scala.Function0

2015-02-18 Thread dgoldenberg
I'm reading data from a database using JdbcRDD, in Java, and I have an implementation of Function0 whose instance I supply as the 'getConnection' parameter into the JdbcRDD constructor. Compiles fine. The definition of the class/function is as follows: public class GetDbConnection extends Abstr

Class loading issue, spark.files.userClassPathFirst doesn't seem to be working

2015-02-17 Thread dgoldenberg
I'm getting the below error when running spark-submit on my class. This class has a transitive dependency on HttpClient v.4.3.1 since I'm calling SolrJ 4.10.3 from within the class. This is in conflict with the older version, HttpClient 3.1 that's a dependency of Hadoop 2.4 (I'm running Spark 1.2.