Is there a proper way to make or get an Encoder for Option in Spark 2.0?
There isn't one by default and while ExpressionEncoder from catalyst will
work, it is private and unsupported.
--
*Richard Marscher*
Senior Software Engineer
Localytics
Localytics.com <http://localytics.com/> | Ou
e - I do not see a
>> simple reduceByKey replacement.
>>
>> Regards,
>>
>> Bryan Jeffrey
>>
>>
>
--
*Richard Marscher*
Senior Software Engineer
Localytics
Localytics.com <http://localytics.com/> | Our Blog
<http://localytics.com/blog> | Twitter <http://twitter.com/localytics> |
Facebook <http://facebook.com/localytics> | LinkedIn
<http://www.linkedin.com/company/1148792?trk=tyah>
wrote:
> That kind of stuff is likely fixed in 2.0. If you can get a reproduction
> working there it would be very helpful if you could open a JIRA.
>
> On Mon, Jun 6, 2016 at 7:37 AM, Richard Marscher <rmarsc...@localytics.com
> > wrote:
>
>> A quick unit test
ault.
>
> That said, I would like to enable that kind of sugar while still taking
> advantage of all the optimizations going on under the covers. Can you get
> it to work if you use `as[...]` instead of `map`?
>
> On Wed, Jun 1, 2016 at 11:59 AM, Richard Marscher &l
1, 2016 at 1:42 PM, Michael Armbrust <mich...@databricks.com>
wrote:
> Thanks for the feedback. I think this will address at least some of the
> problems you are describing: https://github.com/apache/spark/pull/13425
>
> On Wed, Jun 1, 2016 at 9:58 AM, Richard Marscher <rma
to -1 instead of null. Now it's
completely ambiguous what data in the join was actually there versus
populated via this atypical semantic.
Are there additional options available to work around this issue? I can
convert to RDD and back to Dataset but that's less than ideal.
Thanks,
--
*Richard
was able to trace down the leak to a couple thread
pools that were not shut down properly by noticing the named threads
accumulating in thread dumps of the JVM process.
On Mon, Dec 7, 2015 at 6:41 PM, Richard Marscher <rmarsc...@localytics.com>
wrote:
> Thanks for the response.
>
&
s if anyone has seen a similar issue?
--
*Richard Marscher*
Software Engineer
Localytics
Localytics.com <http://localytics.com/> | Our Blog
<http://localytics.com/blog> | Twitter <http://twitter.com/localytics> |
Facebook <http://facebook.com/localytics> | LinkedIn
<http://www.linkedin.com/company/1148792?trk=tyah>
read names here?
>
> Best Regards,
> Shixiong Zhu
>
> 2015-12-07 14:30 GMT-08:00 Richard Marscher <rmarsc...@localytics.com>:
>
>> Hi,
>>
>> I've been running benchmarks against Spark in local mode in a long
>> running process. I'm seeing threads leakin
pause for such a long time.
Has anyone else seen similar behavior or aware of some quirk of local mode
that could cause this kind of blocking?
--
*Richard Marscher*
Software Engineer
Localytics
Localytics.com <http://localytics.com/> | Our Blog
<http://localytics.com/blog> | T
I should add that the pauses are not from GC and also in tracing the CPU
call tree in the JVM it seems like nothing is doing any work, just seems to
be idling or blocking.
On Thu, Dec 3, 2015 at 11:24 AM, Richard Marscher <rmarsc...@localytics.com>
wrote:
> Hi,
>
> I'm doi
<alitedu1...@gmail.com>
wrote:
> You can try to run "jstack" a couple of times while the app is hung to
> look for patterns for where the app is hung.
> --
> Ali
>
>
> On Dec 3, 2015, at 8:27 AM, Richard Marscher <rmarsc...@localytics.com>
> wrote:
>
> I
_
>
> *Mike Wright*
> Principal Architect, Software Engineering
>
> SNL Financial LC
> 434-951-7816 *p*
> 434-244-4466 *f*
> 540-470-0119 *m*
>
> mwri...@snl.com
>
> On Tue, Sep 8, 2015 at 5:38 PM, Richard Marscher <rmarsc...@localytics.com
> > wrote:
>
of memory in
the cluster. I think that's a fundamental problem up front. If it's skewed
then that will be even worse for doing aggregation. I think to start the
data either needs to be broken down or the cluster upgraded unfortunately.
On Wed, Sep 9, 2015 at 5:41 PM, Richard Marscher <rmarsc...@localyti
t from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
--
*Richard Mars
e (perhaps after the
> fact, when temporary files have been cleaned up).
>
> Has anyone run into something like this before? I would be happy to see
> OOM errors, because that would be consistent with one understanding of what
> might be going on, but I haven't yet.
>
> Eric
>
pache-spark-user-list.1001560.n3.nabble.com/Why-is-huge-data-shuffling-in-Spark-when-using-union-coalesce-1-false-on-DataFrame-tp24581.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------
> To
http://apache-spark-user-list.1001560.n3.nabble.com/New-to-Spark-Paritioning-Question-tp24580.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -----
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands,
commands, e-mail: user-h...@spark.apache.org
--
*Richard Marscher*
Software Engineer
Localytics
Localytics.com http://localytics.com/ | Our Blog
http://localytics.com/blog | Twitter http://twitter.com/localytics |
Facebook http://facebook.com/localytics | LinkedIn
http://www.linkedin.com/company
!
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
--
*Richard Marscher*
Software Engineer
Localytics
Localytics.com http://localytics.com/ | Our Blog
http://localytics.com/blog | Twitter http://twitter.com/localytics
this the only option or can I pass
something in the above method to have more partitions of the RDD.
Please let me know.
Thanks.
--
*Richard Marscher*
Software Engineer
Localytics
Localytics.com http://localytics.com/ | Our Blog
http://localytics.com/blog | Twitter http://twitter.com/localytics
small to fit)?
On Tue, Aug 4, 2015 at 8:23 AM, Richard Marscher
rmarsc...@localytics.com wrote:
Yes it does, in fact it's probably going to be one of the more expensive
shuffles you could trigger.
On Mon, Aug 3, 2015 at 12:56 PM, Meihua Wu rotationsymmetr...@gmail.com
wrote:
Does
with
multiple workers per machine? Or is there something I am doing wrong?
Thank you in advance for any pointers you can provide.
-sujit
--
*Richard Marscher*
Software Engineer
Localytics
Localytics.com http://localytics.com/ | Our Blog
http://localytics.com/blog | Twitter http
? Is it a
good idea to create user threads in spark map task?
Thanks
--
*Richard Marscher*
Software Engineer
Localytics
Localytics.com http://localytics.com/ | Our Blog
http://localytics.com/blog | Twitter http://twitter.com/localytics |
Facebook http://facebook.com/localytics | LinkedIn
.
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
--
--
*Richard Marscher*
Software Engineer
Localytics
Localytics.com http://localytics.com/ | Our Blog
.
Is there any guide or link which I can refer.
Thank you very much.
-Naveen
--
--
*Richard Marscher*
Software Engineer
Localytics
Localytics.com http://localytics.com/ | Our Blog
http://localytics.com/blog | Twitter http://twitter.com/localytics |
Facebook http://facebook.com
?
A more general question is, how can I prevent Spark from serializing
the
parent class where RDD is defined, with still support of passing in
function defined in other classes?
--
Chen Song
--
Chen Song
--
Chen Song
--
--
*Richard Marscher*
Software Engineer
.
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
--
--
*Richard Marscher*
Software Engineer
Localytics
Localytics.com http://localytics.com/ | Our Blog
http://localytics.com/blog | Twitter http://twitter.com
wrote:
My singletons do in fact stick around. They're one per worker, looks
like. So with 4 workers running on the box, we're creating one singleton
per worker process/jvm, which seems OK.
Still curious about foreachPartition vs. foreachRDD though...
On Tue, Jul 7, 2015 at 11:27 AM, Richard
Would it be possible to have a wrapper class that just represents a
reference to a singleton holding the 3rd party object? It could proxy over
calls to the singleton object which will instantiate a private instance of
the 3rd party object lazily? I think something like this might work if the
This should work
val output: RDD[(DetailInputRecord, VISummary)] =
sc.paralellize(Seq.empty[(DetailInputRecord, VISummary)])
On Mon, Jul 6, 2015 at 5:11 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:
I need to return an empty RDD of type
val output: RDD[(DetailInputRecord, VISummary)]
This
, 2015 at 4:37 AM, Sjoerd Mulder sjoerdmul...@gmail.com
wrote:
Hi Richard,
I have actually applied the following fix to our 1.4.0 version and this
seem to resolve the zombies :)
https://github.com/apache/spark/pull/7077/files
Sjoerd
2015-06-26 20:08 GMT+02:00 Richard Marscher rmarsc
Hi,
not sure what the context is but I think you can do something similar with
mapPartitions:
rdd.mapPartitions { iterator =
iterator.grouped(5).map { tupleGroup = emitOneRddForGroup(tupleGroup) }
}
The edge case is when the final grouping doesn't have exactly 5 items, if
that matters.
On
We've seen this issue as well in production. We also aren't sure what
causes it, but have just recently shaded some of the Spark code in
TaskSchedulerImpl that we use to effectively bubble up an exception from
Spark instead of zombie in this situation. If you are interested I can go
into more
since were planning for production :)
If you could share the workaround or point directions that would be great!
Sjoerd
2015-06-26 16:53 GMT+02:00 Richard Marscher rmarsc...@localytics.com:
We've seen this issue as well in production. We also aren't sure what
causes it, but have just recently
Hi,
can you detail the symptom further? Was it that only 12 requests were
services and the other 440 timed out? I don't think that Spark is well
suited for this kind of workload, or at least the way it is being
represented. How long does a single request take Spark to complete?
Even with fair
There should be no difference assuming you don't use the intermediately
stored rdd values you are creating for anything else (rdd1, rdd2). In the
first example it still is creating these intermediate rdd objects you are
just using them implicitly and not storing the value.
It's also worth
Is spoutLog just a non-spark file writer? If you run that in the map call
on a cluster its going to be writing in the filesystem of the executor its
being run on. I'm not sure if that's what you intended.
On Mon, Jun 22, 2015 at 1:35 PM, anshu shukla anshushuk...@gmail.com
wrote:
Running
It would be the 40%, although it's probably better to think of it as
shuffle vs. data cache and the remainder goes to tasks. As the comments for
the shuffle memory fraction configuration clarify that it will be taking
memory at the expense of the storage/data cache fraction:
Hi,
if you now want to write 1 file per partition, that's actually built into
Spark as
*saveAsTextFile*(*path*)Write the elements of the dataset as a text file
(or set of text files) in a given directory in the local filesystem, HDFS
or any other Hadoop-supported file system. Spark will call
Hi,
I don't have a complete answer to your questions but:
Removing the suffix does not solve the problem - unfortunately this is
true, the master web UI only tries to build out a Spark UI from the event
logs once, at the time the context is closed. If the event logs are
in-progress at this time,
Hi,
we've been seeing occasional issues in production with the FileOutCommitter
reaching a deadlock situation.
We are writing our data to S3 and currently have speculation enabled. What
we see is that Spark get's a file not found error trying to access a
temporary part file that it wrote
I think if you create a bidirectional mapping from AnalyticsEvent to
another type that would wrap it and use the nonce as its equality, you
could then do something like reduceByKey to group by nonce and map back to
AnalyticsEvent after.
On Thu, Jun 4, 2015 at 1:10 PM, lbierman
It is possible to start multiple concurrent drivers, Spark dynamically
allocates ports per spark application on driver, master, and workers from
a port range. When you collect results back to the driver, they do not go
through the master. The master is mostly there as a coordinator between the
I had the same issue a couple days ago. It's a bug in 1.3.0 that is fixed
in 1.3.1 and up.
https://issues.apache.org/jira/browse/SPARK-6036
The ordering that the event logs are moved from in-progress to complete is
coded to be after the Master tries to build the history page for the logs.
The
I think the short answer to the question is, no, there is no alternate API
that will not use the System.exit calls. You can craft a workaround like is
being suggested in this thread. For comparison, we are doing programmatic
submission of applications in a long-running client application. To get
Are you sure it's memory related? What is the disk utilization and IO
performance on the workers? The error you posted looks to be related to
shuffle trying to obtain block data from another worker node and failing to
do so in reasonable amount of time. It may still be memory related, but I'm
not
Hi,
In Spark 1.3.0 I've enabled event logging to write to an existing HDFS
folder on a Standalone cluster. This is generally working, all the logs are
being written. However, from the Master Web UI, the vast majority of
completed applications are labeled as not having a history:
Ah, apologies, I found an existing issue and fix has already gone out for
this in 1.3.1 and up: https://issues.apache.org/jira/browse/SPARK-6036.
On Mon, Jun 1, 2015 at 3:39 PM, Richard Marscher rmarsc...@localytics.com
wrote:
It looks like it is possibly a race condition between removing
the dagScheduler?
Thanks,
Richard
On Mon, Jun 1, 2015 at 12:23 PM, Richard Marscher rmarsc...@localytics.com
wrote:
Hi,
In Spark 1.3.0 I've enabled event logging to write to an existing HDFS
folder on a Standalone cluster. This is generally working, all the logs are
being written. However, from
The doc is a bit confusing IMO, but at least for my application I had to
use a fair pool configuration to get my stages to be scheduled with FAIR.
On Fri, May 15, 2015 at 2:13 PM, Evo Eftimov evo.efti...@isecc.com wrote:
No pools for the moment – for each of the apps using the straightforward
for confirmation that FAIR scheduling is
supported for Spark Streaming Apps
*From:* Richard Marscher [mailto:rmarsc...@localytics.com]
*Sent:* Friday, May 15, 2015 7:20 PM
*To:* Evo Eftimov
*Cc:* Tathagata Das; user
*Subject:* Re: Spark Fair Scheduler for Spark Streaming - 1.2 and beyond
I believe the issue in b and c is that you call iter.size which actually is
going to flush the iterator so the subsequent attempt to put it into a
vector will yield 0 items. You could use an ArrayBuilder for example and
not need to rely on knowing the size of the iterator.
On Mon, May 11, 2015 at
By default you would expect to find the logs files for master and workers
in the relative `logs` directory from the root of the Spark installation on
each of the respective nodes in the cluster.
On Thu, May 7, 2015 at 10:27 AM, Wang, Ningjun (LNG-NPV)
ningjun.w...@lexisnexis.com wrote:
Ø Can
Hi,
I think you may want to use this setting?:
spark.task.maxFailures4Number of individual task failures before giving up
on the job. Should be greater than or equal to 1. Number of allowed retries
= this value - 1.
On Thu, May 7, 2015 at 2:34 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:
How
I should also add I've recently seen this issue as well when using collect.
I believe in my case it was related to heap space on the driver program not
being able to handle the returned collection.
On Thu, May 7, 2015 at 11:05 AM, Richard Marscher rmarsc...@localytics.com
wrote:
By default you
Hi,
do you have information on how many partitions/tasks the stage/job is
running? By default there is 1 core per task, and your number of concurrent
tasks may be limiting core utilization.
There are a few settings you could play with, assuming your issue is
related to the above:
In regards to the large GC pauses, assuming you allocated all 100GB of
memory per worker you may consider running with less memory on your Worker
nodes, or splitting up the available memory on the Worker nodes amongst
several worker instances. The JVM's garbage collection starts to become
very
I'm not sure, but I wonder if because you are using the Spark REPL that it
may not be representing what a normal runtime execution would look like and
is possibly eagerly running a partial DAG once you define an operation that
would cause a shuffle.
What happens if you setup your same set of
Hi,
I can offer a few ideas to investigate in regards to your issue here. I've
run into resource issues doing shuffle operations with a much smaller
dataset than 2B. The data is going to be saved to disk by the BlockManager
as part of the shuffle and then redistributed across the cluster as
wrote:
Hi Richard,
There are several values of time per id. Is there a way to perform group
by id and sort by time in Spark?
Best regards, Alexander
*From:* Richard Marscher [mailto:rmarsc...@localytics.com]
*Sent:* Monday, April 27, 2015 12:20 PM
*To:* Ulanov, Alexander
*Cc:* user
Hi,
that error seems to indicate the basic query is not properly expressed. If
you group by just ID, then that means it would need to aggregate all the
time values into one value per ID, so you can't sort by it. Thus it tries
to suggest an aggregate function for time so you can have 1 value per
Could you possibly describe what you are trying to learn how to do in more
detail? Some basics of submitting programmatically:
- Create a SparkContext instance and use that to build your RDDs
- You can only have 1 SparkContext per JVM you are running, so if you need
to satisfy concurrent job
If it fails with sbt-assembly but not without it, then there's always the
likelihood of a classpath issue. What dependencies are you rolling up into
your assembly jar?
On Thu, Apr 16, 2015 at 4:46 PM, Jaonary Rabarisoa jaon...@gmail.com
wrote:
Any ideas ?
On Thu, Apr 16, 2015 at 5:04 PM,
Hi,
I've gotten an application working with sbt-assembly and spark, thought I'd
present an option. In my experience, trying to bundle any of the Spark
libraries in your uber jar is going to be a major pain. There will be a lot
of deduplication to work through and even if you resolve them it can
65 matches
Mail list logo