LATERAL VIEW explode requests the full schema

2015-03-03 Thread matthes
I use LATERAL VIEW explode(...) to read data from a parquet-file but the
full schema is requeseted by parquet instead just the used columns. When I
didn't use LATERAL VIEW the requested schema has just the two columns which
I use. Is it correct or is there place for an optimization or do I
understand there somthing wrong?

Here are my examples:

1) hiveContext.sql(SELECT userid FROM pef WHERE observeddays==20140509) 

The requested schema is:

optional group observedDays (LIST) {
repeated int32 array;
  }
  required int64 userid;
}

This is what I expect although the result does not work, but that is not the
problem here!

2) hiveContext.sql(SELECT userid FROM pef LATERAL VIEW
explode(observeddays) od AS observed WHERE observed==20140509) 

the requested schema is:

  required int64 userid;
  optional int32 source;
  optional group observedDays (LIST) {
repeated int32 array;
  }
  optional group placetobe (LIST) {
repeated group bag {
  optional group array {
optional binary palces (UTF8);
optional group dates (LIST) {
  repeated int32 array;
}
  }
}
  }
}

Why does parquet request the full schema. I just use two fields of the
table.

Can somebody please explain me why this can happen.

Thanks!




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/LATERAL-VIEW-explode-requests-the-full-schema-tp21893.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: LATERAL VIEW explode requests the full schema

2015-03-03 Thread Michael Armbrust
I believe that this has been optimized
https://github.com/apache/spark/commit/2a36292534a1e9f7a501e88f69bfc3a09fb62cb3
in Spark 1.3.

On Tue, Mar 3, 2015 at 4:36 AM, matthes matthias.diekst...@web.de wrote:

 I use LATERAL VIEW explode(...) to read data from a parquet-file but the
 full schema is requeseted by parquet instead just the used columns. When I
 didn't use LATERAL VIEW the requested schema has just the two columns which
 I use. Is it correct or is there place for an optimization or do I
 understand there somthing wrong?

 Here are my examples:

 1) hiveContext.sql(SELECT userid FROM pef WHERE observeddays==20140509)

 The requested schema is:

 optional group observedDays (LIST) {
 repeated int32 array;
   }
   required int64 userid;
 }

 This is what I expect although the result does not work, but that is not
 the
 problem here!

 2) hiveContext.sql(SELECT userid FROM pef LATERAL VIEW
 explode(observeddays) od AS observed WHERE observed==20140509)

 the requested schema is:

   required int64 userid;
   optional int32 source;
   optional group observedDays (LIST) {
 repeated int32 array;
   }
   optional group placetobe (LIST) {
 repeated group bag {
   optional group array {
 optional binary palces (UTF8);
 optional group dates (LIST) {
   repeated int32 array;
 }
   }
 }
   }
 }

 Why does parquet request the full schema. I just use two fields of the
 table.

 Can somebody please explain me why this can happen.

 Thanks!




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/LATERAL-VIEW-explode-requests-the-full-schema-tp21893.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org