Hi all, I've got a problem that really has me stumped. I'm running a
Structured Streaming query that reads from Kafka, performs some
transformations and stateful aggregations (using flatMapGroupsWithState),
and outputs any updated aggregates to another Kafka topic.
I'm running this job using
I noticed that Spark 2.4.0 implemented support for reading only committed
messages in Kafka, and was excited. Are there currently any plans to update
the Kafka output sink to support exactly-once delivery?
Thanks,
Will
I recently upgraded a Structured Streaming application from Spark 2.2.1 ->
Spark 2.3.0. This application runs in yarn-cluster mode, and it made use of
the spark.yarn.{driver|executor}.memoryOverhead properties. I noticed the
job started crashing unexpectedly, and after doing a bunch of digging, it
I am running a Structured Streaming job (Spark 2.2.0) using EMR 5.9. The
job sources data from a Kafka topic, performs a variety of filters and
transformations, and sinks data back into a different Kafka topic.
Once per day, we stop the query in order to merge the namenode edit logs
with the
Hi Daniel,
can you give it a try in the IBM's Analytics for Spark, the fix has been in
for a week now
thanks
Mario
From: Daniel Lopes <dan...@onematch.com.br>
To: Mario Ds Briggs/India/IBM@IBMIN
Cc: Adam Roberts <arobe...@uk.ibm.com>, user
<user@s
Mario
From: Adam Roberts/UK/IBM
To: Mario Ds Briggs/India/IBM@IBMIN
Date: 12/09/2016 09:37 pm
Subject:Fw: Spark + Parquet + IBM Block Storage at Bluemix
Mario, incase you've not seen
-csv
thanks
Mario
From: Mich Talebzadeh <mich.talebza...@gmail.com>
To: Mario Ds Briggs/India/IBM@IBMIN
Cc: Alonso Isidoro Roman <alons...@gmail.com>, Luciano Resende
<luckbr1...@gmail.com>, "user @spark" <user@spark.apache.org>
Date: 21/0
I did see your earlier post about Stratio decision. Will readup on it
thanks
Mario
From: Alonso Isidoro Roman <alons...@gmail.com>
To: Mich Talebzadeh <mich.talebza...@gmail.com>
Cc: Mario Ds Briggs/India/IBM@IBMIN, Luciano Resende
<luckbr1...@gmail.com
/external/kafka/src/test/scala/org/apache/spark/streaming/kafka/DirectKafkaStreamSuite.scala#L532
Your feedback is appreciated.
thanks
Mario
From: Mich Talebzadeh <mich.talebza...@gmail.com>
To: Mario Ds Briggs/India/IBM@IBMIN
Cc: "user @spark" <user@spark.apache.or
Hey Mich, Luciano
Will provide links with docs by tomorrow
thanks
Mario
- Message from Mich Talebzadeh on Sun, 17
Apr 2016 19:17:38 +0100 -
To: Luciano Resende
When submitting to YARN, you can specify two different operation modes for
the driver with the --master parameter: yarn-client or yarn-cluster. For
more information on submitting to YARN, see this page in the Spark docs:
http://spark.apache.org/docs/latest/running-on-yarn.html
yarn-cluster mode
Could you share your pattern matching expression that is failing?
On Tue, Aug 18, 2015, 3:38 PM saif.a.ell...@wellsfargo.com wrote:
Hi all,
I am trying to run a spark job, in which I receive *java.math.BigDecimal*
objects,
instead of the scala equivalents, and I am trying to convert them
That code doesn't appear to be registering classes with Kryo, which means the
fully-qualified classname is stored with every Kryo record. The Spark
documentation has more on this:
https://spark.apache.org/docs/latest/tuning.html#data-serialization
Regards,
Will
On July 5, 2015, at 2:31 AM,
Kryo serialization is used internally by Spark for spilling or shuffling
intermediate results, not for writing out an RDD as an action. Look at Sandy
Ryza's examples for some hints on how to do this:
https://github.com/sryza/simplesparkavroapp
Regards,
Will
On July 3, 2015, at 2:45 AM,
It sounds like accumulators are not necessary in Spark Streaming - see this
post (
http://apache-spark-user-list.1001560.n3.nabble.com/Shared-variable-in-Spark-Streaming-td11762.html)
for more details.
On June 21, 2015, at 7:31 PM, anshu shukla anshushuk...@gmail.com wrote:
In spark Streaming
In general, you should avoid making direct changes to the Spark source code. If
you are using Scala, you can seamlessly blend your own methods on top of the
base RDDs using implicit conversions.
Regards,
Will
On June 16, 2015, at 7:53 PM, raggy raghav0110...@gmail.com wrote:
I am trying to
The programming models for the two frameworks are conceptually rather
different; I haven't worked with Storm for quite some time, but based on my old
experience with it, I would equate Spark Streaming more with Storm's Trident
API, rather than with the raw Bolt API. Even then, there are
(). A member
on here suggested I make the change in RDD.scala to accomplish that. Also, this
is for a research project, and not for commercial use.
So, any advice on how I can get the spark submit to use my custom built jars
would be very useful.
Thanks,
Raghav
On Jun 16, 2015, at 6:57 PM, Will Briggs
There are a lot of variables to consider. I'm not an expert on Spark, and
my ML knowledge is rudimentary at best, but here are some questions whose
answers might help us to help you:
- What type of Spark cluster are you running (e.g., Stand-alone, Mesos,
YARN)?
- What does the HTTP UI
on how to increase the number of partitions.
-Will
On Mon, Jun 15, 2015 at 5:00 PM, William Briggs wrbri...@gmail.com wrote:
There are a lot of variables to consider. I'm not an expert on Spark, and
my ML knowledge is rudimentary at best, but here are some questions whose
answers might help us
If you are working on large structures, you probably want to look at the GraphX
extension to Spark:
https://spark.apache.org/docs/latest/graphx-programming-guide.html
On June 14, 2015, at 10:50 AM, lisp lispra...@gmail.com wrote:
Hi there,
I have a large amount of objects, which I have to
Check out this recent post by Cheng Liam regarding dynamic partitioning in
Spark 1.4: https://www.mail-archive.com/user@spark.apache.org/msg30204.html
On June 13, 2015, at 5:41 AM, Hao Wang bill...@gmail.com wrote:
Hi,
I have a bunch of large log files on Hadoop. Each line contains a log and
To be fair, this is a long-standing issue due to optimizations for object reuse
in the Hadoop API, and isn't necessarily a failing in Spark - see this blog
post
(https://cornercases.wordpress.com/2011/08/18/hadoop-object-reuse-pitfall-all-my-reducer-values-are-the-same/)
from 2011 documenting
I don't know anything about your use case, so take this with a grain of
salt, but typically if you are operating at a scale that benefits from
Spark, then you likely will not want to write your output records as
individual files into HDFS. Spark has built-in support for the Hadoop
SequenceFile
McFadden splee...@gmail.com wrote:
On Fri, Jun 5, 2015 at 2:05 PM Will Briggs wrbri...@gmail.com wrote:
Your lambda expressions on the RDDs in the SecondRollup class are closing
around the context, and Spark has special logic to ensure that all variables in
a closure used on an RDD
lambda expressions a lot -
surely those examples also would not work if this is always an issue with
lambdas?
On Sat, Jun 6, 2015, 12:21 AM Will Briggs wrbri...@gmail.com wrote:
Hi Lee, it's actually not related to threading at all - you would still
have the same problem even if you were using
I believe groupByKey currently requires that all items for a specific key fit
into a single and executive's memory:
http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html
This previous discussion has some pointers if you must
Your lambda expressions on the RDDs in the SecondRollup class are closing
around the context, and Spark has special logic to ensure that all variables in
a closure used on an RDD are Serializable - I hate linking to Quora, but
there's a good explanation here:
Hi Lee,
You should be able to create a PairRDD using the Nonce as the key, and the
AnalyticsEvent as the value. I'm very new to Spark, but here is some
uncompilable pseudo code that may or may not help:
events.map(event = (event.getNonce, event)).reduceByKey((a, b) =
a).map(_._2)
The above code
Hi Kaspar,
This is definitely doable, but in my opinion, it's important to remember
that, at its core, Spark is based around a functional programming paradigm
- you're taking input sets of data and, by applying various
transformations, you end up with a dataset that represents your answer.
30 matches
Mail list logo