Re: Shrinking the DataFrame lineage

2016-05-15 Thread Hamel Kothari
I don't know about the second one but for question #1: When you convert from a cached DF to an RDD (via a map function or the "rdd" value) the types are converted from the off-heap types to on-heap types. If your rows are fairly large/complex this can have a pretty big performance impact so I

ConvertToSafe being done before functions.explode

2016-04-28 Thread Hamel Kothari
Hi all, I've been looking at some of my query plans and noticed that pretty much every explode that I run (which is always over a column with ArrayData) is prefixed with a ConvertToSafe call in the physical plan. Looking at Generate.scala it looks like it doesn't override canProcessUnsafeRows in

Skipping Type Conversion and using InternalRows for UDF

2016-04-15 Thread Hamel Kothari
Hi all, So we have these UDFs which take <1ms to operate and we're seeing pretty poor performance around them in practice, the overhead being >10ms for the projections (this data is deeply nested with ArrayTypes and MapTypes so that could be the cause). Looking at the logs and code for ScalaUDF,

Re: Spark SQL UDF Returning Rows

2016-03-31 Thread Hamel Kothari
Hi Michael, Thanks for the response. I am just extracting part of the nested structure and returning only a piece that same structure. I haven't looked at Encoders or Datasets since we're bound to 1.6 for now but I'll look at encoders to see if that covers it. Datasets seems like it would solve

Re: Spark SQL UDF Returning Rows

2016-03-30 Thread Hamel Kothari
30, 2016 at 11:47 AM Hamel Kothari <hamelkoth...@gmail.com> wrote: > Hi all, > > I've been trying for the last couple of days to define a UDF which takes > in a deeply nested Row object and performs some extraction to pull out a > portion of of the Row and return it. This

Spark SQL UDF Returning Rows

2016-03-30 Thread Hamel Kothari
Hi all, I've been trying for the last couple of days to define a UDF which takes in a deeply nested Row object and performs some extraction to pull out a portion of of the Row and return it. This row object is nested not just with StructTypes but a bunch of ArrayTypes and MapTypes. From this

Re: graceful shutdown in external data sources

2016-03-20 Thread Hamel Kothari
Dan, You could probably just register a JVM shutdown hook yourself: https://docs.oracle.com/javase/7/docs/api/java/lang/Runtime.html#addShutdownHook(java.lang.Thread ) This at least would let you close the connections when the application as a whole has completed (in standalone) or when your

Re: More Robust DataSource Parameters

2016-02-27 Thread Hamel Kothari
Java, Scala) > > 2. API stability (both backward and forward) > > > > On Fri, Feb 26, 2016 at 8:44 AM, Hamel Kothari <hamelkoth...@gmail.com> > wrote: > >> Hi devs, >> >> Has there been any discussion around changing the DataSource parameters &g

More Robust DataSource Parameters

2016-02-26 Thread Hamel Kothari
Hi devs, Has there been any discussion around changing the DataSource parameters arguments be something more sophisticated than Map[String, String]? As you write more complex DataSources there are likely to be a variety of parameters of varying formats which are needed and having to coerce them

Re: Spark Job on YARN Hogging the entire Cluster resource

2016-02-24 Thread Hamel Kothari
and reservation. > > The question is how much preemption tries to preempt the queue A if it > holds the entire resource without releasing? Could not able to share the > actual configuration, but the answer to the question here will help us. > > > Thanks, > Prabhu Joseph &g

Re: Spark Job on YARN Hogging the entire Cluster resource

2016-02-24 Thread Hamel Kothari
If all queues are identical, this behavior should not be happening. Preemption as designed in fair scheduler (IIRC) takes place based on the instantaneous fair share, not the steady state fair share. The fair scheduler docs

Re: Spark 1.6.1

2016-02-01 Thread Hamel Kothari
I noticed that the Jackson dependency was bumped to 2.5 in master for something spark-streaming related. Is there any reason that this upgrade can't be included with 1.6.1? According to later comments on this thread: https://issues.apache.org/jira/browse/SPARK-8332 and my personal experience

Heuristics for Partitioning Non-Local Data

2016-01-28 Thread Hamel Kothari
Hey spark-devs, I'm in the process of writing a DataSource for what is essentially a java web service. Each relation which we create will consist of a series of queries to this webservice which returns a pretty much known amount of data (eg. 2000 rows, 5 string columns or similar which we can

Re: timeout in shuffle problem

2016-01-27 Thread Hamel Kothari
Are you running on YARN? Another possibility here is that your shuffle managers are facing GC pain and becoming less responsive, thus missing timeouts. Can you try increasing the memory on the node managers and see if that helps? On Sun, Jan 24, 2016 at 4:58 PM Ted Yu wrote:

Re: java.io.FileNotFoundException(Too many open files) in Spark streaming

2016-01-05 Thread Hamel Kothari
The "Too Many Files" part of the exception is just indicative of the fact that when that call was made, too many files were already open. It doesn't necessarily mean that that line is the source of all of the open files, that's just the point at which it hit its limit. What I would recommend is