Structured Streaming - HDFS State Store Performance Issues

2020-01-14 Thread William Briggs
Hi all, I've got a problem that really has me stumped. I'm running a Structured Streaming query that reads from Kafka, performs some transformations and stateful aggregations (using flatMapGroupsWithState), and outputs any updated aggregates to another Kafka topic. I'm running this job using

Exactly-Once delivery with Structured Streaming and Kafka

2019-01-31 Thread William Briggs
I noticed that Spark 2.4.0 implemented support for reading only committed messages in Kafka, and was excited. Are there currently any plans to update the Kafka output sink to support exactly-once delivery? Thanks, Will

Change in configuration settings?

2018-06-08 Thread William Briggs
I recently upgraded a Structured Streaming application from Spark 2.2.1 -> Spark 2.3.0. This application runs in yarn-cluster mode, and it made use of the spark.yarn.{driver|executor}.memoryOverhead properties. I noticed the job started crashing unexpectedly, and after doing a bunch of digging, it

Structured Streaming + Kafka - Corrupted Checkpoint Offsets / Commits

2018-01-04 Thread William Briggs
I am running a Structured Streaming job (Spark 2.2.0) using EMR 5.9. The job sources data from a Kafka topic, performs a variety of filters and transformations, and sinks data back into a different Kafka topic. Once per day, we stop the query in order to merge the namenode edit logs with the

Re: How to automatically relaunch a Driver program after crashes?

2015-08-19 Thread William Briggs
When submitting to YARN, you can specify two different operation modes for the driver with the --master parameter: yarn-client or yarn-cluster. For more information on submitting to YARN, see this page in the Spark docs: http://spark.apache.org/docs/latest/running-on-yarn.html yarn-cluster mode

Re: Scala: How to match a java object????

2015-08-18 Thread William Briggs
Could you share your pattern matching expression that is failing? On Tue, Aug 18, 2015, 3:38 PM saif.a.ell...@wellsfargo.com wrote: Hi all, I am trying to run a spark job, in which I receive *java.math.BigDecimal* objects, instead of the scala equivalents, and I am trying to convert them

Re: Does spark performance really scale out with multiple machines?

2015-06-15 Thread William Briggs
There are a lot of variables to consider. I'm not an expert on Spark, and my ML knowledge is rudimentary at best, but here are some questions whose answers might help us to help you: - What type of Spark cluster are you running (e.g., Stand-alone, Mesos, YARN)? - What does the HTTP UI

Re: Does spark performance really scale out with multiple machines?

2015-06-15 Thread William Briggs
on how to increase the number of partitions. -Will On Mon, Jun 15, 2015 at 5:00 PM, William Briggs wrbri...@gmail.com wrote: There are a lot of variables to consider. I'm not an expert on Spark, and my ML knowledge is rudimentary at best, but here are some questions whose answers might help us

Re: Can a Spark App run with spark-submit write pdf files to HDFS

2015-06-09 Thread William Briggs
I don't know anything about your use case, so take this with a grain of salt, but typically if you are operating at a scale that benefits from Spark, then you likely will not want to write your output records as individual files into HDFS. Spark has built-in support for the Hadoop SequenceFile

Re: SparkContext Threading

2015-06-06 Thread William Briggs
Hi Lee, I'm stuck with only mobile devices for correspondence right now, so I can't get to shell to play with this issue - this is all supposition; I think that the lambdas are closing over the context because it's a constructor parameter to your Runnable class, which is why inlining the lambdas

Re: Deduping events using Spark

2015-06-04 Thread William Briggs
Hi Lee, You should be able to create a PairRDD using the Nonce as the key, and the AnalyticsEvent as the value. I'm very new to Spark, but here is some uncompilable pseudo code that may or may not help: events.map(event = (event.getNonce, event)).reduceByKey((a, b) = a).map(_._2) The above code

Re: Make HTTP requests from within Spark

2015-06-03 Thread William Briggs
Hi Kaspar, This is definitely doable, but in my opinion, it's important to remember that, at its core, Spark is based around a functional programming paradigm - you're taking input sets of data and, by applying various transformations, you end up with a dataset that represents your answer.