Option Encoder

2016-06-23 Thread Richard Marscher
Is there a proper way to make or get an Encoder for Option in Spark 2.0? There isn't one by default and while ExpressionEncoder from catalyst will work, it is private and unsupported. -- *Richard Marscher* Senior Software Engineer Localytics Localytics.com <http://localytics.com/> | Ou

Re: Dataset - reduceByKey

2016-06-07 Thread Richard Marscher
e - I do not see a >> simple reduceByKey replacement. >> >> Regards, >> >> Bryan Jeffrey >> >> > -- *Richard Marscher* Senior Software Engineer Localytics Localytics.com <http://localytics.com/> | Our Blog <http://localytics.com/blog> | Twitter <http://twitter.com/localytics> | Facebook <http://facebook.com/localytics> | LinkedIn <http://www.linkedin.com/company/1148792?trk=tyah>

Re: Dataset Outer Join vs RDD Outer Join

2016-06-07 Thread Richard Marscher
wrote: > That kind of stuff is likely fixed in 2.0. If you can get a reproduction > working there it would be very helpful if you could open a JIRA. > > On Mon, Jun 6, 2016 at 7:37 AM, Richard Marscher <rmarsc...@localytics.com > > wrote: > >> A quick unit test

Re: Dataset Outer Join vs RDD Outer Join

2016-06-06 Thread Richard Marscher
ault. > > That said, I would like to enable that kind of sugar while still taking > advantage of all the optimizations going on under the covers. Can you get > it to work if you use `as[...]` instead of `map`? > > On Wed, Jun 1, 2016 at 11:59 AM, Richard Marscher &l

Re: Dataset Outer Join vs RDD Outer Join

2016-06-01 Thread Richard Marscher
1, 2016 at 1:42 PM, Michael Armbrust <mich...@databricks.com> wrote: > Thanks for the feedback. I think this will address at least some of the > problems you are describing: https://github.com/apache/spark/pull/13425 > > On Wed, Jun 1, 2016 at 9:58 AM, Richard Marscher <rma

Dataset Outer Join vs RDD Outer Join

2016-06-01 Thread Richard Marscher
to -1 instead of null. Now it's completely ambiguous what data in the join was actually there versus populated via this atypical semantic. Are there additional options available to work around this issue? I can convert to RDD and back to Dataset but that's less than ideal. Thanks, -- *Richard

Re: Local Mode: Executor thread leak?

2015-12-08 Thread Richard Marscher
was able to trace down the leak to a couple thread pools that were not shut down properly by noticing the named threads accumulating in thread dumps of the JVM process. On Mon, Dec 7, 2015 at 6:41 PM, Richard Marscher <rmarsc...@localytics.com> wrote: > Thanks for the response. > &

Local Mode: Executor thread leak?

2015-12-07 Thread Richard Marscher
s if anyone has seen a similar issue? -- *Richard Marscher* Software Engineer Localytics Localytics.com <http://localytics.com/> | Our Blog <http://localytics.com/blog> | Twitter <http://twitter.com/localytics> | Facebook <http://facebook.com/localytics> | LinkedIn <http://www.linkedin.com/company/1148792?trk=tyah>

Re: Local Mode: Executor thread leak?

2015-12-07 Thread Richard Marscher
read names here? > > Best Regards, > Shixiong Zhu > > 2015-12-07 14:30 GMT-08:00 Richard Marscher <rmarsc...@localytics.com>: > >> Hi, >> >> I've been running benchmarks against Spark in local mode in a long >> running process. I'm seeing threads leakin

Local mode: Stages hang for minutes

2015-12-03 Thread Richard Marscher
pause for such a long time. Has anyone else seen similar behavior or aware of some quirk of local mode that could cause this kind of blocking? -- *Richard Marscher* Software Engineer Localytics Localytics.com <http://localytics.com/> | Our Blog <http://localytics.com/blog> | T

Re: Local mode: Stages hang for minutes

2015-12-03 Thread Richard Marscher
I should add that the pauses are not from GC and also in tracing the CPU call tree in the JVM it seems like nothing is doing any work, just seems to be idling or blocking. On Thu, Dec 3, 2015 at 11:24 AM, Richard Marscher <rmarsc...@localytics.com> wrote: > Hi, > > I'm doi

Re: Local mode: Stages hang for minutes

2015-12-03 Thread Richard Marscher
<alitedu1...@gmail.com> wrote: > You can try to run "jstack" a couple of times while the app is hung to > look for patterns for where the app is hung. > -- > Ali > > > On Dec 3, 2015, at 8:27 AM, Richard Marscher <rmarsc...@localytics.com> > wrote: > > I

Re: New to Spark - Paritioning Question

2015-09-09 Thread Richard Marscher
_ > > *Mike Wright* > Principal Architect, Software Engineering > > SNL Financial LC > 434-951-7816 *p* > 434-244-4466 *f* > 540-470-0119 *m* > > mwri...@snl.com > > On Tue, Sep 8, 2015 at 5:38 PM, Richard Marscher <rmarsc...@localytics.com > > wrote: >

Re: What should be the optimal value for spark.sql.shuffle.partition?

2015-09-09 Thread Richard Marscher
of memory in the cluster. I think that's a fundamental problem up front. If it's skewed then that will be even worse for doing aggregation. I think to start the data either needs to be broken down or the cluster upgraded unfortunately. On Wed, Sep 9, 2015 at 5:41 PM, Richard Marscher <rmarsc...@localyti

Re: What should be the optimal value for spark.sql.shuffle.partition?

2015-09-09 Thread Richard Marscher
t from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- *Richard Mars

Re: spark.shuffle.spill=false ignored?

2015-09-09 Thread Richard Marscher
e (perhaps after the > fact, when temporary files have been cleaned up). > > Has anyone run into something like this before? I would be happy to see > OOM errors, because that would be consistent with one understanding of what > might be going on, but I haven't yet. > > Eric >

Re: Why is huge data shuffling in Spark when using union()/coalesce(1,false) on DataFrame?

2015-09-08 Thread Richard Marscher
pache-spark-user-list.1001560.n3.nabble.com/Why-is-huge-data-shuffling-in-Spark-when-using-union-coalesce-1-false-on-DataFrame-tp24581.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------- > To

Re: New to Spark - Paritioning Question

2015-09-08 Thread Richard Marscher
http://apache-spark-user-list.1001560.n3.nabble.com/New-to-Spark-Paritioning-Question-tp24580.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > ----- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands,

Re: Removing empty partitions before we write to HDFS

2015-08-06 Thread Richard Marscher
commands, e-mail: user-h...@spark.apache.org -- *Richard Marscher* Software Engineer Localytics Localytics.com http://localytics.com/ | Our Blog http://localytics.com/blog | Twitter http://twitter.com/localytics | Facebook http://facebook.com/localytics | LinkedIn http://www.linkedin.com/company

Re: Does RDD.cartesian involve shuffling?

2015-08-04 Thread Richard Marscher
! - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- *Richard Marscher* Software Engineer Localytics Localytics.com http://localytics.com/ | Our Blog http://localytics.com/blog | Twitter http://twitter.com/localytics

Re: Repartition question

2015-08-04 Thread Richard Marscher
this the only option or can I pass something in the above method to have more partitions of the RDD. Please let me know. Thanks. -- *Richard Marscher* Software Engineer Localytics Localytics.com http://localytics.com/ | Our Blog http://localytics.com/blog | Twitter http://twitter.com/localytics

Re: Does RDD.cartesian involve shuffling?

2015-08-04 Thread Richard Marscher
small to fit)? On Tue, Aug 4, 2015 at 8:23 AM, Richard Marscher rmarsc...@localytics.com wrote: Yes it does, in fact it's probably going to be one of the more expensive shuffles you could trigger. On Mon, Aug 3, 2015 at 12:56 PM, Meihua Wu rotationsymmetr...@gmail.com wrote: Does

Re: How to increase parallelism of a Spark cluster?

2015-08-04 Thread Richard Marscher
with multiple workers per machine? Or is there something I am doing wrong? Thank you in advance for any pointers you can provide. -sujit -- *Richard Marscher* Software Engineer Localytics Localytics.com http://localytics.com/ | Our Blog http://localytics.com/blog | Twitter http

Re: user threads in executors

2015-07-21 Thread Richard Marscher
? Is it a good idea to create user threads in spark map task? Thanks -- *Richard Marscher* Software Engineer Localytics Localytics.com http://localytics.com/ | Our Blog http://localytics.com/blog | Twitter http://twitter.com/localytics | Facebook http://facebook.com/localytics | LinkedIn

Re: Apache Spark : Custom function for reduceByKey - missing arguments for method

2015-07-10 Thread Richard Marscher
. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- -- *Richard Marscher* Software Engineer Localytics Localytics.com http://localytics.com/ | Our Blog

Re: Unit tests of spark application

2015-07-10 Thread Richard Marscher
. Is there any guide or link which I can refer. Thank you very much. -Naveen -- -- *Richard Marscher* Software Engineer Localytics Localytics.com http://localytics.com/ | Our Blog http://localytics.com/blog | Twitter http://twitter.com/localytics | Facebook http://facebook.com

Re: Spark serialization in closure

2015-07-09 Thread Richard Marscher
? A more general question is, how can I prevent Spark from serializing the parent class where RDD is defined, with still support of passing in function defined in other classes? -- Chen Song -- Chen Song -- Chen Song -- -- *Richard Marscher* Software Engineer

Re: Create RDD from output of unix command

2015-07-08 Thread Richard Marscher
. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- -- *Richard Marscher* Software Engineer Localytics Localytics.com http://localytics.com/ | Our Blog http://localytics.com/blog | Twitter http://twitter.com

Re: Best practice for using singletons on workers (seems unanswered) ?

2015-07-08 Thread Richard Marscher
wrote: My singletons do in fact stick around. They're one per worker, looks like. So with 4 workers running on the box, we're creating one singleton per worker process/jvm, which seems OK. Still curious about foreachPartition vs. foreachRDD though... On Tue, Jul 7, 2015 at 11:27 AM, Richard

Re: Best practice for using singletons on workers (seems unanswered) ?

2015-07-07 Thread Richard Marscher
Would it be possible to have a wrapper class that just represents a reference to a singleton holding the 3rd party object? It could proxy over calls to the singleton object which will instantiate a private instance of the 3rd party object lazily? I think something like this might work if the

Re: How to create empty RDD

2015-07-06 Thread Richard Marscher
This should work val output: RDD[(DetailInputRecord, VISummary)] = sc.paralellize(Seq.empty[(DetailInputRecord, VISummary)]) On Mon, Jul 6, 2015 at 5:11 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: I need to return an empty RDD of type val output: RDD[(DetailInputRecord, VISummary)] This

Re: Spark driver hangs on start of job

2015-07-02 Thread Richard Marscher
, 2015 at 4:37 AM, Sjoerd Mulder sjoerdmul...@gmail.com wrote: Hi Richard, I have actually applied the following fix to our 1.4.0 version and this seem to resolve the zombies :) https://github.com/apache/spark/pull/7077/files Sjoerd 2015-06-26 20:08 GMT+02:00 Richard Marscher rmarsc

Re: Applying functions over certain count of tuples .

2015-06-29 Thread Richard Marscher
Hi, not sure what the context is but I think you can do something similar with mapPartitions: rdd.mapPartitions { iterator = iterator.grouped(5).map { tupleGroup = emitOneRddForGroup(tupleGroup) } } The edge case is when the final grouping doesn't have exactly 5 items, if that matters. On

Re: Spark driver hangs on start of job

2015-06-26 Thread Richard Marscher
We've seen this issue as well in production. We also aren't sure what causes it, but have just recently shaded some of the Spark code in TaskSchedulerImpl that we use to effectively bubble up an exception from Spark instead of zombie in this situation. If you are interested I can go into more

Re: Spark driver hangs on start of job

2015-06-26 Thread Richard Marscher
since were planning for production :) If you could share the workaround or point directions that would be great! Sjoerd 2015-06-26 16:53 GMT+02:00 Richard Marscher rmarsc...@localytics.com: We've seen this issue as well in production. We also aren't sure what causes it, but have just recently

Re: Limitations using SparkContext

2015-06-23 Thread Richard Marscher
Hi, can you detail the symptom further? Was it that only 12 requests were services and the other 440 timed out? I don't think that Spark is well suited for this kind of workload, or at least the way it is being represented. How long does a single request take Spark to complete? Even with fair

Re: How Spark Execute chaining vs no chaining statements

2015-06-23 Thread Richard Marscher
There should be no difference assuming you don't use the intermediately stored rdd values you are creating for anything else (rdd1, rdd2). In the first example it still is creating these intermediate rdd objects you are just using them implicitly and not storing the value. It's also worth

Re: Multiple executors writing file using java filewriter

2015-06-22 Thread Richard Marscher
Is spoutLog just a non-spark file writer? If you run that in the map call on a cluster its going to be writing in the filesystem of the executor its being run on. I'm not sure if that's what you intended. On Mon, Jun 22, 2015 at 1:35 PM, anshu shukla anshushuk...@gmail.com wrote: Running

Re: Executor memory allocations

2015-06-18 Thread Richard Marscher
It would be the 40%, although it's probably better to think of it as shuffle vs. data cache and the remainder goes to tasks. As the comments for the shuffle memory fraction configuration clarify that it will be taking memory at the expense of the storage/data cache fraction:

Re: append file on hdfs

2015-06-10 Thread Richard Marscher
Hi, if you now want to write 1 file per partition, that's actually built into Spark as *saveAsTextFile*(*path*)Write the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop-supported file system. Spark will call

Re: spark eventLog and history server

2015-06-09 Thread Richard Marscher
Hi, I don't have a complete answer to your questions but: Removing the suffix does not solve the problem - unfortunately this is true, the master web UI only tries to build out a Spark UI from the event logs once, at the time the context is closed. If the event logs are in-progress at this time,

FileOutputCommitter deadlock 1.3.1

2015-06-08 Thread Richard Marscher
Hi, we've been seeing occasional issues in production with the FileOutCommitter reaching a deadlock situation. We are writing our data to S3 and currently have speculation enabled. What we see is that Spark get's a file not found error trying to access a temporary part file that it wrote

Re: Deduping events using Spark

2015-06-04 Thread Richard Marscher
I think if you create a bidirectional mapping from AnalyticsEvent to another type that would wrap it and use the nonce as its equality, you could then do something like reduceByKey to group by nonce and map back to AnalyticsEvent after. On Thu, Jun 4, 2015 at 1:10 PM, lbierman

Re: Scaling spark jobs returning large amount of data

2015-06-04 Thread Richard Marscher
It is possible to start multiple concurrent drivers, Spark dynamically allocates ports per spark application on driver, master, and workers from a port range. When you collect results back to the driver, they do not go through the master. The master is mostly there as a coordinator between the

Re: Application is always in process when I check out logs of completed application

2015-06-03 Thread Richard Marscher
I had the same issue a couple days ago. It's a bug in 1.3.0 that is fixed in 1.3.1 and up. https://issues.apache.org/jira/browse/SPARK-6036 The ordering that the event logs are moved from in-progress to complete is coded to be after the Master tries to build the history page for the logs. The

Re: Spark Client

2015-06-03 Thread Richard Marscher
I think the short answer to the question is, no, there is no alternate API that will not use the System.exit calls. You can craft a workaround like is being suggested in this thread. For comparison, we are doing programmatic submission of applications in a long-running client application. To get

Re: flatMap output on disk / flatMap memory overhead

2015-06-02 Thread Richard Marscher
Are you sure it's memory related? What is the disk utilization and IO performance on the workers? The error you posted looks to be related to shuffle trying to obtain block data from another worker node and failing to do so in reasonable amount of time. It may still be memory related, but I'm not

Event Logging to HDFS on Standalone Cluster In Progress

2015-06-01 Thread Richard Marscher
Hi, In Spark 1.3.0 I've enabled event logging to write to an existing HDFS folder on a Standalone cluster. This is generally working, all the logs are being written. However, from the Master Web UI, the vast majority of completed applications are labeled as not having a history:

Re: Event Logging to HDFS on Standalone Cluster In Progress

2015-06-01 Thread Richard Marscher
Ah, apologies, I found an existing issue and fix has already gone out for this in 1.3.1 and up: https://issues.apache.org/jira/browse/SPARK-6036. On Mon, Jun 1, 2015 at 3:39 PM, Richard Marscher rmarsc...@localytics.com wrote: It looks like it is possibly a race condition between removing

Re: Event Logging to HDFS on Standalone Cluster In Progress

2015-06-01 Thread Richard Marscher
the dagScheduler? Thanks, Richard On Mon, Jun 1, 2015 at 12:23 PM, Richard Marscher rmarsc...@localytics.com wrote: Hi, In Spark 1.3.0 I've enabled event logging to write to an existing HDFS folder on a Standalone cluster. This is generally working, all the logs are being written. However, from

Re: Spark Fair Scheduler for Spark Streaming - 1.2 and beyond

2015-05-15 Thread Richard Marscher
The doc is a bit confusing IMO, but at least for my application I had to use a fair pool configuration to get my stages to be scheduled with FAIR. On Fri, May 15, 2015 at 2:13 PM, Evo Eftimov evo.efti...@isecc.com wrote: No pools for the moment – for each of the apps using the straightforward

Re: Spark Fair Scheduler for Spark Streaming - 1.2 and beyond

2015-05-15 Thread Richard Marscher
for confirmation that FAIR scheduling is supported for Spark Streaming Apps *From:* Richard Marscher [mailto:rmarsc...@localytics.com] *Sent:* Friday, May 15, 2015 7:20 PM *To:* Evo Eftimov *Cc:* Tathagata Das; user *Subject:* Re: Spark Fair Scheduler for Spark Streaming - 1.2 and beyond

Re: Looking inside the 'mapPartitions' transformation, some confused observations

2015-05-11 Thread Richard Marscher
I believe the issue in b and c is that you call iter.size which actually is going to flush the iterator so the subsequent attempt to put it into a vector will yield 0 items. You could use an ArrayBuilder for example and not need to rely on knowing the size of the iterator. On Mon, May 11, 2015 at

Re: java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_2_piece0

2015-05-07 Thread Richard Marscher
By default you would expect to find the logs files for master and workers in the relative `logs` directory from the root of the Spark installation on each of the respective nodes in the cluster. On Thu, May 7, 2015 at 10:27 AM, Wang, Ningjun (LNG-NPV) ningjun.w...@lexisnexis.com wrote: Ø Can

Re: Spark Job triggers second attempt

2015-05-07 Thread Richard Marscher
Hi, I think you may want to use this setting?: spark.task.maxFailures4Number of individual task failures before giving up on the job. Should be greater than or equal to 1. Number of allowed retries = this value - 1. On Thu, May 7, 2015 at 2:34 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: How

Re: java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_2_piece0

2015-05-07 Thread Richard Marscher
I should also add I've recently seen this issue as well when using collect. I believe in my case it was related to heap space on the driver program not being able to handle the returned collection. On Thu, May 7, 2015 at 11:05 AM, Richard Marscher rmarsc...@localytics.com wrote: By default you

Re: Maximum Core Utilization

2015-05-05 Thread Richard Marscher
Hi, do you have information on how many partitions/tasks the stage/job is running? By default there is 1 core per task, and your number of concurrent tasks may be limiting core utilization. There are a few settings you could play with, assuming your issue is related to the above:

Re: Long GC pauses with Spark SQL 1.3.0 and billion row tables

2015-05-04 Thread Richard Marscher
In regards to the large GC pauses, assuming you allocated all 100GB of memory per worker you may consider running with less memory on your Worker nodes, or splitting up the available memory on the Worker nodes amongst several worker instances. The JVM's garbage collection starts to become very

Re: Extra stage that executes before triggering computation with an action

2015-04-29 Thread Richard Marscher
I'm not sure, but I wonder if because you are using the Spark REPL that it may not be representing what a normal runtime execution would look like and is possibly eagerly running a partial DAG once you define an operation that would cause a shuffle. What happens if you setup your same set of

Re: Scalability of group by

2015-04-28 Thread Richard Marscher
Hi, I can offer a few ideas to investigate in regards to your issue here. I've run into resource issues doing shuffle operations with a much smaller dataset than 2B. The data is going to be saved to disk by the BlockManager as part of the shuffle and then redistributed across the cluster as

Re: Group by order by

2015-04-27 Thread Richard Marscher
wrote: Hi Richard, There are several values of time per id. Is there a way to perform group by id and sort by time in Spark? Best regards, Alexander *From:* Richard Marscher [mailto:rmarsc...@localytics.com] *Sent:* Monday, April 27, 2015 12:20 PM *To:* Ulanov, Alexander *Cc:* user

Re: Group by order by

2015-04-27 Thread Richard Marscher
Hi, that error seems to indicate the basic query is not properly expressed. If you group by just ID, then that means it would need to aggregate all the time values into one value per ID, so you can't sort by it. Thus it tries to suggest an aggregate function for time so you can have 1 value per

Re: Instantiating/starting Spark jobs programmatically

2015-04-21 Thread Richard Marscher
Could you possibly describe what you are trying to learn how to do in more detail? Some basics of submitting programmatically: - Create a SparkContext instance and use that to build your RDDs - You can only have 1 SparkContext per JVM you are running, so if you need to satisfy concurrent job

Re: Problem with Spark SQL UserDefinedType and sbt assembly

2015-04-16 Thread Richard Marscher
If it fails with sbt-assembly but not without it, then there's always the likelihood of a classpath issue. What dependencies are you rolling up into your assembly jar? On Thu, Apr 16, 2015 at 4:46 PM, Jaonary Rabarisoa jaon...@gmail.com wrote: Any ideas ? On Thu, Apr 16, 2015 at 5:04 PM,

Re: sbt-assembly spark-streaming-kinesis-asl error

2015-04-14 Thread Richard Marscher
Hi, I've gotten an application working with sbt-assembly and spark, thought I'd present an option. In my experience, trying to bundle any of the Spark libraries in your uber jar is going to be a major pain. There will be a lot of deduplication to work through and even if you resolve them it can