Structured Streaming - HDFS State Store Performance Issues

2020-01-14 Thread William Briggs
Hi all, I've got a problem that really has me stumped. I'm running a Structured Streaming query that reads from Kafka, performs some transformations and stateful aggregations (using flatMapGroupsWithState), and outputs any updated aggregates to another Kafka topic. I'm running this job using

Exactly-Once delivery with Structured Streaming and Kafka

2019-01-31 Thread William Briggs
I noticed that Spark 2.4.0 implemented support for reading only committed messages in Kafka, and was excited. Are there currently any plans to update the Kafka output sink to support exactly-once delivery? Thanks, Will

Change in configuration settings?

2018-06-08 Thread William Briggs
I recently upgraded a Structured Streaming application from Spark 2.2.1 -> Spark 2.3.0. This application runs in yarn-cluster mode, and it made use of the spark.yarn.{driver|executor}.memoryOverhead properties. I noticed the job started crashing unexpectedly, and after doing a bunch of digging, it

Structured Streaming + Kafka - Corrupted Checkpoint Offsets / Commits

2018-01-04 Thread William Briggs
I am running a Structured Streaming job (Spark 2.2.0) using EMR 5.9. The job sources data from a Kafka topic, performs a variety of filters and transformations, and sinks data back into a different Kafka topic. Once per day, we stop the query in order to merge the namenode edit logs with the

Re: Fw: Spark + Parquet + IBM Block Storage at Bluemix

2016-09-25 Thread Mario Ds Briggs
Hi Daniel, can you give it a try in the IBM's Analytics for Spark, the fix has been in for a week now thanks Mario From: Daniel Lopes <dan...@onematch.com.br> To: Mario Ds Briggs/India/IBM@IBMIN Cc: Adam Roberts <arobe...@uk.ibm.com>, user <user@s

Re: Fw: Spark + Parquet + IBM Block Storage at Bluemix

2016-09-12 Thread Mario Ds Briggs
Mario From: Adam Roberts/UK/IBM To: Mario Ds Briggs/India/IBM@IBMIN Date: 12/09/2016 09:37 pm Subject:Fw: Spark + Parquet + IBM Block Storage at Bluemix Mario, incase you've not seen

Re: Spark support for Complex Event Processing (CEP)

2016-04-21 Thread Mario Ds Briggs
-csv thanks Mario From: Mich Talebzadeh <mich.talebza...@gmail.com> To: Mario Ds Briggs/India/IBM@IBMIN Cc: Alonso Isidoro Roman <alons...@gmail.com>, Luciano Resende <luckbr1...@gmail.com>, "user @spark" <user@spark.apache.org> Date: 21/0

Re: Spark support for Complex Event Processing (CEP)

2016-04-20 Thread Mario Ds Briggs
I did see your earlier post about Stratio decision. Will readup on it thanks Mario From: Alonso Isidoro Roman <alons...@gmail.com> To: Mich Talebzadeh <mich.talebza...@gmail.com> Cc: Mario Ds Briggs/India/IBM@IBMIN, Luciano Resende <luckbr1...@gmail.com

Re: Spark support for Complex Event Processing (CEP)

2016-04-19 Thread Mario Ds Briggs
/external/kafka/src/test/scala/org/apache/spark/streaming/kafka/DirectKafkaStreamSuite.scala#L532 Your feedback is appreciated. thanks Mario From: Mich Talebzadeh <mich.talebza...@gmail.com> To: Mario Ds Briggs/India/IBM@IBMIN Cc: "user @spark" <user@spark.apache.or

Re: Spark support for Complex Event Processing (CEP)

2016-04-18 Thread Mario Ds Briggs
Hey Mich, Luciano Will provide links with docs by tomorrow thanks Mario - Message from Mich Talebzadeh on Sun, 17 Apr 2016 19:17:38 +0100 - To: Luciano Resende

Re: How to automatically relaunch a Driver program after crashes?

2015-08-19 Thread William Briggs
When submitting to YARN, you can specify two different operation modes for the driver with the --master parameter: yarn-client or yarn-cluster. For more information on submitting to YARN, see this page in the Spark docs: http://spark.apache.org/docs/latest/running-on-yarn.html yarn-cluster mode

Re: Scala: How to match a java object????

2015-08-18 Thread William Briggs
Could you share your pattern matching expression that is failing? On Tue, Aug 18, 2015, 3:38 PM saif.a.ell...@wellsfargo.com wrote: Hi all, I am trying to run a spark job, in which I receive *java.math.BigDecimal* objects, instead of the scala equivalents, and I am trying to convert them

Re: Why Kryo Serializer is slower than Java Serializer in TeraSort

2015-07-05 Thread Will Briggs
That code doesn't appear to be registering classes with Kryo, which means the fully-qualified classname is stored with every Kryo record. The Spark documentation has more on this: https://spark.apache.org/docs/latest/tuning.html#data-serialization Regards, Will On July 5, 2015, at 2:31 AM,

Re: Kryo fails to serialise output

2015-07-03 Thread Will Briggs
Kryo serialization is used internally by Spark for spilling or shuffling intermediate results, not for writing out an RDD as an action. Look at Sandy Ryza's examples for some hints on how to do this: https://github.com/sryza/simplesparkavroapp Regards, Will On July 3, 2015, at 2:45 AM,

Re: Using Accumulators in Streaming

2015-06-21 Thread Will Briggs
It sounds like accumulators are not necessary in Spark Streaming - see this post ( http://apache-spark-user-list.1001560.n3.nabble.com/Shared-variable-in-Spark-Streaming-td11762.html) for more details. On June 21, 2015, at 7:31 PM, anshu shukla anshushuk...@gmail.com wrote: In spark Streaming

Re: Submitting Spark Applications using Spark Submit

2015-06-16 Thread Will Briggs
In general, you should avoid making direct changes to the Spark source code. If you are using Scala, you can seamlessly blend your own methods on top of the base RDDs using implicit conversions. Regards, Will On June 16, 2015, at 7:53 PM, raggy raghav0110...@gmail.com wrote: I am trying to

Re: Spark or Storm

2015-06-16 Thread Will Briggs
The programming models for the two frameworks are conceptually rather different; I haven't worked with Storm for quite some time, but based on my old experience with it, I would equate Spark Streaming more with Storm's Trident API, rather than with the raw Bolt API. Even then, there are

Re: Submitting Spark Applications using Spark Submit

2015-06-16 Thread Will Briggs
(). A member on here suggested I make the change in RDD.scala to accomplish that. Also, this is for a research project, and not for commercial use. So, any advice on how I can get the spark submit to use my custom built jars would be very useful. Thanks, Raghav On Jun 16, 2015, at 6:57 PM, Will Briggs

Re: Does spark performance really scale out with multiple machines?

2015-06-15 Thread William Briggs
There are a lot of variables to consider. I'm not an expert on Spark, and my ML knowledge is rudimentary at best, but here are some questions whose answers might help us to help you: - What type of Spark cluster are you running (e.g., Stand-alone, Mesos, YARN)? - What does the HTTP UI

Re: Does spark performance really scale out with multiple machines?

2015-06-15 Thread William Briggs
on how to increase the number of partitions. -Will On Mon, Jun 15, 2015 at 5:00 PM, William Briggs wrbri...@gmail.com wrote: There are a lot of variables to consider. I'm not an expert on Spark, and my ML knowledge is rudimentary at best, but here are some questions whose answers might help us

Re: creation of RDD from a Tree

2015-06-14 Thread Will Briggs
If you are working on large structures, you probably want to look at the GraphX extension to Spark: https://spark.apache.org/docs/latest/graphx-programming-guide.html On June 14, 2015, at 10:50 AM, lisp lispra...@gmail.com wrote: Hi there, I have a large amount of objects, which I have to

Re: How to split log data into different files according to severity

2015-06-13 Thread Will Briggs
Check out this recent post by Cheng Liam regarding dynamic partitioning in Spark 1.4: https://www.mail-archive.com/user@spark.apache.org/msg30204.html On June 13, 2015, at 5:41 AM, Hao Wang bill...@gmail.com wrote: Hi, I have a bunch of large log files on Hadoop. Each line contains a log and

Re: Spark distinct() returns incorrect results for some types?

2015-06-11 Thread Will Briggs
To be fair, this is a long-standing issue due to optimizations for object reuse in the Hadoop API, and isn't necessarily a failing in Spark - see this blog post (https://cornercases.wordpress.com/2011/08/18/hadoop-object-reuse-pitfall-all-my-reducer-values-are-the-same/) from 2011 documenting

Re: Can a Spark App run with spark-submit write pdf files to HDFS

2015-06-09 Thread William Briggs
I don't know anything about your use case, so take this with a grain of salt, but typically if you are operating at a scale that benefits from Spark, then you likely will not want to write your output records as individual files into HDFS. Spark has built-in support for the Hadoop SequenceFile

Re: SparkContext Threading

2015-06-06 Thread Will Briggs
McFadden splee...@gmail.com wrote: On Fri, Jun 5, 2015 at 2:05 PM Will Briggs wrbri...@gmail.com wrote: Your lambda expressions on the RDDs in the SecondRollup class are closing around the context, and Spark has special logic to ensure that all variables in a closure used on an RDD

Re: SparkContext Threading

2015-06-06 Thread William Briggs
lambda expressions a lot - surely those examples also would not work if this is always an issue with lambdas? On Sat, Jun 6, 2015, 12:21 AM Will Briggs wrbri...@gmail.com wrote: Hi Lee, it's actually not related to threading at all - you would still have the same problem even if you were using

Re: write multiple outputs by key

2015-06-06 Thread Will Briggs
I believe groupByKey currently requires that all items for a specific key fit into a single and executive's memory: http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html This previous discussion has some pointers if you must

Re: SparkContext Threading

2015-06-05 Thread Will Briggs
Your lambda expressions on the RDDs in the SecondRollup class are closing around the context, and Spark has special logic to ensure that all variables in a closure used on an RDD are Serializable - I hate linking to Quora, but there's a good explanation here:

Re: Deduping events using Spark

2015-06-04 Thread William Briggs
Hi Lee, You should be able to create a PairRDD using the Nonce as the key, and the AnalyticsEvent as the value. I'm very new to Spark, but here is some uncompilable pseudo code that may or may not help: events.map(event = (event.getNonce, event)).reduceByKey((a, b) = a).map(_._2) The above code

Re: Make HTTP requests from within Spark

2015-06-03 Thread William Briggs
Hi Kaspar, This is definitely doable, but in my opinion, it's important to remember that, at its core, Spark is based around a functional programming paradigm - you're taking input sets of data and, by applying various transformations, you end up with a dataset that represents your answer.