Hi,

On 06/12/2014 05:47 PM, Toby Douglass wrote:

> In these future jobs, when I come to load the aggregted RDD, will Spark
> load and only load the columns being accessed by the query?  or will Spark
> load everything, to convert it into an internal representation, and then
> execute the query?

The aforementioned native Parquet support in Spark 1.0 supports column
projections which means only the columns that appear in the query will
be loaded. The next release will also support record filters for simple
pruning predicates ("int-column smaller value" and such). This is
different from using a Hadoop Input/Output format and requires no
additional setup (jars in classpath and such).

For more details see:

http://spark.apache.org/docs/latest/sql-programming-guide.html#using-parquet

Andre

Reply via email to