Ooops - linked the wrong JIRA ticket: (that other one is related)
https://issues.apache.org/jira/browse/SPARK-28025
On Wed, Jun 12, 2019 at 1:21 PM Gerard Maas wrote:
> Hi!
> I would like to socialize this issue we are currently facing:
> The Structured Streaming default CheckpointFi
Hi!
I would like to socialize this issue we are currently facing:
The Structured Streaming default CheckpointFileManager leaks .crc files by
leaving them behind after users of this class (like
HDFSBackedStateStoreProvider) apply their cleanup methods.
This results in an unbounded creation of tiny
Hi,
I'm looking into the Parquet format support for the File source in
Structured Streaming.
The docs mention the use of the option 'mergeSchema' to merge the schemas
of the part files found.[1]
What would be the practical use of that in a streaming context?
In its batch counterpart,
Devs,
While investigating another issue, I came across this OOM error when using
the Console Sink with any source that can be larger than the available
driver memory. In my case, I was using the File source and I had a 14G file
in the monitored dir.
I traced back the issue to a `df.collect` in
Great discussion. Glad to see it happening and lucky to have seen it on the
mailing list due to its high volume.
I had this same conversation with Patrick Wendell few Spark Summits ago. At
the time, SO was not even listed as a resource and the idea was to make it
the primary "go-to" place for
+1
On Mar 19, 2016 08:33, "Pete Robbins" wrote:
> This seems to me to be unnecessarily restrictive. These are very useful
> extension points for adding 3rd party sources and sinks.
>
> I intend to make an Elasticsearch sink available on spark-packages but
> this will require
Are you sharing the SimpleDateFormat instance? This looks a lot more like
the non-thread-safe behaviour of SimpleDateFormat (that has claimed many
unsuspecting victims over the years), than any 'ugly' Spark Streaming. Try
writing the timestamps in millis to Kafka and compare.
-kr, Gerard.
On
Kay,
Excellent write-up. This should be preserved for reference somewhere
searchable.
-Gerard.
On Fri, Jun 12, 2015 at 1:19 AM, Kay Ousterhout k...@eecs.berkeley.edu
wrote:
Here’s how the shuffle works. This explains what happens for a single
task; this will happen in parallel for each
the spark.executor.uri
(or a another one) can take more than one downloadable path.
my.2¢
andy
On Fri, May 29, 2015 at 5:09 PM Gerard Maas gerard.m...@gmail.com
wrote:
Hi Tim,
Thanks for the info. We (Andy Petrella and myself) have been diving a
bit deeper into this log config:
The log
)
bytes
}
.saveAsTextFile(text)
Is there a way to achieve this with the MetricSystem?
ᐧ
On Mon, Jan 5, 2015 at 10:24 AM, Gerard Maas gerard.m...@gmail.com
wrote:
Hi,
Yes, I managed to create a register custom metrics by creating an
implementation
Hi,
Yes, I managed to create a register custom metrics by creating an
implementation of org.apache.spark.metrics.source.Source and registering
it to the metrics subsystem.
Source is [Spark] private, so you need to create it under a org.apache.spark
package. In my case, I'm dealing with Spark
Hi,
After facing issues with the performance of some of our Spark Streaming
jobs, we invested quite some effort figuring out the factors that affect
the performance characteristics of a Streaming job. We defined an
empirical model that helps us reason about Streaming jobs and applied it to
tune
mode?
I'm making changes to the spark mesos scheduler and I think we can propose
a best way to achieve what you mentioned.
Tim
Sent from my iPhone
On Dec 22, 2014, at 8:33 AM, Gerard Maas gerard.m...@gmail.com wrote:
Hi,
After facing issues with the performance of some of our Spark
Hi,
I'm confused about the Stage times reported on the Spark-UI (Spark 1.1.0)
for an Spark-Streaming job. I'm hoping somebody can shine some light on it:
Let's do this with an example:
On the /stages page, stage # 232 is reported to have lasted 18 seconds:
232runJob at RDDFunctions.scala:23
Looks like metrics are not a hot topic to discuss - yet so important to
sleep well when jobs are running in production.
I've created Spark-4537 https://issues.apache.org/jira/browse/SPARK-4537
to track this issue.
-kr, Gerard.
On Thu, Nov 20, 2014 at 9:25 PM, Gerard Maas gerard.m...@gmail.com
As the Spark Streaming tuning guide indicates, the key indicators of a
healthy streaming job are:
- Processing Time
- Total Delay
The Spark UI page for the Streaming job [1] shows these two indicators but
the metrics source for Spark Streaming (StreamingSource.scala) [2] does
not.
Any reasons
vHi,
I've been exploring the metrics exposed by Spark and I'm wondering whether
there's a way to register job-specific metrics that could be exposed
through the existing metrics system.
Would there be an example somewhere?
BTW, documentation about how the metrics work could be improved. I
Using a case class as a key doesn't seem to work properly. [Spark 1.0.0]
A minimal example:
case class P(name:String)
val ps = Array(P(alice), P(bob), P(charly), P(bob))
sc.parallelize(ps).map(x= (x,1)).reduceByKey((x,y) = x+y).collect
[Spark shell local mode] res : Array[(P, Int)] =
,ArrayBuffer(1, 1)))
On Tue, Jul 22, 2014 at 4:20 PM, Gerard Maas gerard.m...@gmail.com wrote:
Using a case class as a key doesn't seem to work properly. [Spark 1.0.0]
A minimal example:
case class P(name:String)
val ps = Array(P(alice), P(bob), P(charly), P(bob))
sc.parallelize(ps).map(x= (x,1
, 2014 at 5:37 PM, Gerard Maas gerard.m...@gmail.com wrote:
Yes, right. 'sc.parallelize(ps).map(x= (**x.name**,1)).groupByKey().
collect'
An oversight from my side.
Thanks!, Gerard.
On Tue, Jul 22, 2014 at 5:24 PM, Daniel Siegmann daniel.siegm...@velos.io
wrote:
I can confirm this bug
send in a pull request that includes your proposed
changes?
Andrew
On Wed, May 21, 2014 at 10:19 AM, Gerard Maas gerard.m...@gmail.com
wrote:
Spark dev's,
I was looking into a question asked on the user list where a
ClassNotFoundException was thrown when running a job on Mesos
a new ticket for just this particular issue.
On Thu, May 22, 2014 at 11:03 AM, Gerard Maas gerard.m...@gmail.comwrote:
Sure. Should I create a Jira as well?
I saw there's already a broader ticket regarding the ambiguous use of
SPARK_HOME [1] (cc: Patrick as owner of that ticket)
I don't
Hi Tobias,
I was curious about this issue and tried to run your example on my local
Mesos. I was able to reproduce your issue using your current config:
[error] (run-main-0) org.apache.spark.SparkException: Job aborted: Task
1.0:4 failed 4 times (most recent failure: Exception failure:
Spark dev's,
I was looking into a question asked on the user list where a
ClassNotFoundException was thrown when running a job on Mesos. Curious
issue with serialization on Mesos: more details here [1]:
When trying to run that simple example on my Mesos installation, I faced
another issue: I got
for it to work.
The SparkREPL works differently. It uses some dark magic to send the
working session to the workers.
-kr, Gerard.
On Wed, May 21, 2014 at 2:47 PM, Gerard Maas gerard.m...@gmail.com wrote:
Hi Tobias,
I was curious about this issue and tried to run your example on my local
this is cool +1
On Wed, Mar 19, 2014 at 6:54 PM, Patrick Wendell pwend...@gmail.com wrote:
Evan - yep definitely open a JIRA. It would be nice to have a contrib
repo set-up for the 1.0 release.
On Tue, Mar 18, 2014 at 11:28 PM, Evan Chan e...@ooyala.com wrote:
Matei,
Maybe it's time
26 matches
Mail list logo