Re: Spark SQL with a sorted file

Michael Armbrust Thu, 04 Dec 2014 11:12:36 -0800

I'll add that some of our data formats will actual infer this sort of
useful information automatically.  Both parquet and cached inmemory tables
keep statistics on the min/max value for each column.  When you have
predicates over these sorted columns, partitions will be eliminated if they
can't possibly match the predicate given the statistics.


For parquet this is new in Spark 1.2 and it is turned off by defaults (due
to bugs we are working with the parquet library team to fix).  Hopefully
soon it will be on by default.

On Wed, Dec 3, 2014 at 8:44 PM, Cheng, Hao <hao.ch...@intel.com> wrote:

> You can try to write your own Relation with filter push down or use the
> ParquetRelation2 for workaround. (
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala
> )
>
> Cheng Hao
>
> -----Original Message-----
> From: Jerry Raj [mailto:jerry....@gmail.com]
> Sent: Thursday, December 4, 2014 11:34 AM
> To: user@spark.apache.org
> Subject: Spark SQL with a sorted file
>
> Hi,
> If I create a SchemaRDD from a file that I know is sorted on a certain
> field, is it possible to somehow pass that information on to Spark SQL so
> that SQL queries referencing that field are optimized?
>
> Thanks
> -Jerry
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional
> commands, e-mail: user-h...@spark.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Spark SQL with a sorted file

Reply via email to