Re: Spark SQL with a sorted file

2014-12-23 Thread Cheng Lian
This Parquet bug only triggers when there exists some row groups which 
are either empty or contain only null binary values.


So it’s still safe to turn it on if data types of all columns are 
boolean, numeric, and non-null binaries.


You may turn it on by |SET spark.sql.parquet.filterPushdown=true|

Please refer to PARQUET-136 for details: 
https://issues.apache.org/jira/browse/PARQUET-136


Cheng

On 12/23/14 3:54 PM, Jerry Raj wrote:


Michael,
Thanks. Is this still turned off in the released 1.2? Is it possible 
to turn it on just to get an idea of how much of a difference it makes?


-Jerry

On 05/12/14 12:40 am, Michael Armbrust wrote:

I'll add that some of our data formats will actual infer this sort of
useful information automatically.  Both parquet and cached inmemory
tables keep statistics on the min/max value for each column.  When you
have predicates over these sorted columns, partitions will be eliminated
if they can't possibly match the predicate given the statistics.

For parquet this is new in Spark 1.2 and it is turned off by defaults
(due to bugs we are working with the parquet library team to fix).
Hopefully soon it will be on by default.

On Wed, Dec 3, 2014 at 8:44 PM, Cheng, Hao mailto:hao.ch...@intel.com>> wrote:

You can try to write your own Relation with filter push down or use
the ParquetRelation2 for workaround.
(https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala)

Cheng Hao

-Original Message-
From: Jerry Raj [mailto:jerry@gmail.com
]
Sent: Thursday, December 4, 2014 11:34 AM
To: user@spark.apache.org 
Subject: Spark SQL with a sorted file

Hi,
If I create a SchemaRDD from a file that I know is sorted on a
certain field, is it possible to somehow pass that information on to
Spark SQL so that SQL queries referencing that field are optimized?

Thanks
-Jerry

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands,
e-mail: user-h...@spark.apache.org 




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org

For additional commands, e-mail: user-h...@spark.apache.org





-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



​



Re: Spark SQL with a sorted file

2014-12-22 Thread Jerry Raj

Michael,
Thanks. Is this still turned off in the released 1.2? Is it possible to 
turn it on just to get an idea of how much of a difference it makes?


-Jerry

On 05/12/14 12:40 am, Michael Armbrust wrote:

I'll add that some of our data formats will actual infer this sort of
useful information automatically.  Both parquet and cached inmemory
tables keep statistics on the min/max value for each column.  When you
have predicates over these sorted columns, partitions will be eliminated
if they can't possibly match the predicate given the statistics.

For parquet this is new in Spark 1.2 and it is turned off by defaults
(due to bugs we are working with the parquet library team to fix).
Hopefully soon it will be on by default.

On Wed, Dec 3, 2014 at 8:44 PM, Cheng, Hao mailto:hao.ch...@intel.com>> wrote:

You can try to write your own Relation with filter push down or use
the ParquetRelation2 for workaround.

(https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala)

Cheng Hao

-Original Message-
From: Jerry Raj [mailto:jerry@gmail.com
]
Sent: Thursday, December 4, 2014 11:34 AM
To: user@spark.apache.org 
Subject: Spark SQL with a sorted file

Hi,
If I create a SchemaRDD from a file that I know is sorted on a
certain field, is it possible to somehow pass that information on to
Spark SQL so that SQL queries referencing that field are optimized?

Thanks
-Jerry

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands,
e-mail: user-h...@spark.apache.org 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org

For additional commands, e-mail: user-h...@spark.apache.org





-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark SQL with a sorted file

2014-12-04 Thread Michael Armbrust
I'll add that some of our data formats will actual infer this sort of
useful information automatically.  Both parquet and cached inmemory tables
keep statistics on the min/max value for each column.  When you have
predicates over these sorted columns, partitions will be eliminated if they
can't possibly match the predicate given the statistics.

For parquet this is new in Spark 1.2 and it is turned off by defaults (due
to bugs we are working with the parquet library team to fix).  Hopefully
soon it will be on by default.

On Wed, Dec 3, 2014 at 8:44 PM, Cheng, Hao  wrote:

> You can try to write your own Relation with filter push down or use the
> ParquetRelation2 for workaround. (
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala
> )
>
> Cheng Hao
>
> -Original Message-
> From: Jerry Raj [mailto:jerry@gmail.com]
> Sent: Thursday, December 4, 2014 11:34 AM
> To: user@spark.apache.org
> Subject: Spark SQL with a sorted file
>
> Hi,
> If I create a SchemaRDD from a file that I know is sorted on a certain
> field, is it possible to somehow pass that information on to Spark SQL so
> that SQL queries referencing that field are optimized?
>
> Thanks
> -Jerry
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional
> commands, e-mail: user-h...@spark.apache.org
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


RE: Spark SQL with a sorted file

2014-12-03 Thread Cheng, Hao
You can try to write your own Relation with filter push down or use the 
ParquetRelation2 for workaround. 
(https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala)

Cheng Hao

-Original Message-
From: Jerry Raj [mailto:jerry@gmail.com] 
Sent: Thursday, December 4, 2014 11:34 AM
To: user@spark.apache.org
Subject: Spark SQL with a sorted file

Hi,
If I create a SchemaRDD from a file that I know is sorted on a certain field, 
is it possible to somehow pass that information on to Spark SQL so that SQL 
queries referencing that field are optimized?

Thanks
-Jerry

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org