Re: Spark dev-setup

2016-08-23 Thread Nishadi Kirielle
Hi, I'm engaged in learning how query execution flow occurs in Spark SQL. In order to understand the query execution flow, I'm attempting to run an example in debug mode with intellij IDEA. It would be great if anyone can help me with debug configurations. Thanks & Regards Nishadi On Tue, Jun 21,

is the Lineage of RDD stored as a byte code in memory or a file?

2016-08-23 Thread kant kodali
Hi Guys, I have this question for a very long time and after diving into the source code(specifically from the links below) I have a feeling that the lineage of an RDD (the transformations) are converted into byte code and stored in memory or disk. or if I were to ask another question on a similar

Re: Anyone else having trouble with replicated off heap RDD persistence?

2016-08-23 Thread Michael Allman
I've replied on the issue's page, but in a word, "yes". See https://issues.apache.org/jira/browse/SPARK-17204 . Michael > On Aug 23, 2016, at 11:55 AM, Reynold Xin wrote: > > Does this problem still exist on today's master/branch-2.0? > >

How do we process/scale variable size batches in Apache Spark Streaming

2016-08-23 Thread Rachana Srivastava
I am running a spark streaming process where I am getting batch of data after n seconds. I am using repartition to scale the application. Since the repartition size is fixed we are getting lots of small files when batch size is very small. Is there anyway I can change the partitioner logic based

Re: Anyone else having trouble with replicated off heap RDD persistence?

2016-08-23 Thread Reynold Xin
Does this problem still exist on today's master/branch-2.0? SPARK-16550 was merged. It might be fixed already. On Tue, Aug 23, 2016 at 9:37 AM, Michael Allman wrote: > FYI, I posted this to user@ and have followed up with a bug report: > https://issues.apache.org/jira/browse/SPARK-17204 > > Mic

Serialization troubles with mutable.LinkedHashMap

2016-08-23 Thread Rahul Palamuttam
Hi, I initially send this on the user mailing list, however I didn't get any response. I figured this could be a bug so it might of more concern to the dev-list. I recently switched to using kryo serialization and I've been running into errors with the mutable.LinkedHashMap class. If I don't reg

Fwd: Anyone else having trouble with replicated off heap RDD persistence?

2016-08-23 Thread Michael Allman
FYI, I posted this to user@ and have followed up with a bug report: https://issues.apache.org/jira/browse/SPARK-17204 Michael > Begin forwarded message: > > From: Michael Allman > Subject: Anyone else having trouble with replicated off heap

Re: Why can't a Transformer have multiple output columns?

2016-08-23 Thread Nicholas Chammas
Thanks for the pointer! A linked issue from the one you shared also appears to be relevant. SPARK-8418 : "Add single- and multi-value support to ML Transformers" On Tue, Aug 23, 2016 at 10:41 AM Nick Pentreath wrote: > It's not impossible that a

Re: Why can't a Transformer have multiple output columns?

2016-08-23 Thread Nick Pentreath
It's not impossible that a Transformer could output multiple columns - it's simply because none of the current ones do. It's true that it might be a relatively less common use case in general. But take StringIndexer for example. It turns strings (categorical features) into ints (0-based indexes).

Why can't a Transformer have multiple output columns?

2016-08-23 Thread Nicholas Chammas
If you create your own Spark 2.x ML Transformer, there are multiple mix-ins (is that the correct term?) that you can use to define its behavior which are in ml/param/shared.py . Among them are the following mix-ins: