subject:"3Ps for Datasets not available\?\! \(=Parquet Predicate Pushdown\)"

Re: 3Ps for Datasets not available?! (=Parquet Predicate Pushdown)

2016-08-30 Thread Jacek Laskowski

Hi Reynold, That's what I was told few times already (most notably by Adam on twitter), but couldn't understand what it meant exactly. It's only now when I understand what you're saying, Reynold :) Does this put DataFrame's Column-based or SQL-based queries usually faster than Datasets with

Re: 3Ps for Datasets not available?! (=Parquet Predicate Pushdown)

2016-08-30 Thread Reynold Xin

The UDF is a black box so Spark can't know what it is dealing with. There are simple cases in which we can analyze the UDFs byte code and infer what it is doing, but it is pretty difficult to do in general. On Tuesday, August 30, 2016, Jacek Laskowski wrote: > Hi, > > I've been

3Ps for Datasets not available?! (=Parquet Predicate Pushdown)

2016-08-30 Thread Jacek Laskowski

Hi, I've been playing with UDFs and why they're a blackbox for Spark's optimizer and started with filters to showcase the optimizations in play. My current understanding is that the predicate pushdowns are supported by the following data sources: 1. Hive tables 2. Parquet files 3. ORC files 4.