Re: SparkSQL operator priority

2014-07-19 Thread Eric Friedman
Can position be null?  Looks like there may be constraints with predicate push 
down in that case. https://github.com/apache/spark/pull/511/

 On Jul 18, 2014, at 8:04 PM, Christos Kozanitis kozani...@berkeley.edu 
 wrote:
 
 Hello
 
 What is the order with which SparkSQL deserializes parquet fields? Is it 
 possible to modify it?
 
 I am using SparkSQL to query a parquet file that consists of a lot of fields 
 (around 30 or so). Let me call an example table MyTable and let's suppose the 
 name of one of its fields is position.
 
 The query that I am executing is: 
 sql(select * from MyTable where position = 243189160)
 
 The query plan that I get from this query is:
 Filter (position#6L:6 = 243189160)
  ParquetTableScan 
 [contig.contigName#0,contig.contigLength#1L,contig.contigMD5#2,contig.referenceURL#3,contig.assembly#4,contig.species#5,position#6L,rangeOffset#7,rangeLength#8,referenceBase#9,readBase#10,sangerQuality#11,mapQuality#12,numSoftClipped#13,numReverseStrand#14,countAtPosition#15,readName#16,readStart#17L,readEnd#18L,recordGroupSequencingCenter#19,recordGroupDescription#20,recordGroupRunDateEpoch#21L,recordGroupFlowOrder#22,recordGroupKeySequence#23,recordGroupLibrary#24,recordGroupPredictedMedianInsertSize#25,recordGroupPlatform#26,recordGroupPlatformUnit#27,recordGroupSample#28],
  (ParquetRelation 
 hdfs://ec2-54-89-87-167.compute-1.amazonaws.com:9000/genomes/hg00096.plup), 
 None
 
 I expect 14 entries in the output but the execution of 
 .collect.foreach(println) takes forever to run on my cluster (more than an 
 hour). 
 
 Is it safe to assume in my example that SparkSQL deserializes all fields 
 first before applying the filter? If so, can a user change this behavior?
 
 To support my assumption I replaced * with position, so my new query is 
 of the form sql(select position from MyTable where position = 243189160) 
 and this query runs much faster on the same hardware (2-3 minutes vs 65 min).
 
 Any ideas?
 
 thanks
 Christos


Re: SparkSQL operator priority

2014-07-19 Thread Christos Kozanitis
Thanks Eric. That is the case as most of my fields are optional. So it
seems that the problem comes from Parquet.


On Sat, Jul 19, 2014 at 8:27 AM, Eric Friedman eric.d.fried...@gmail.com
wrote:

 Can position be null?  Looks like there may be constraints with predicate
 push down in that case. https://github.com/apache/spark/pull/511/

 On Jul 18, 2014, at 8:04 PM, Christos Kozanitis kozani...@berkeley.edu
 wrote:

 Hello

 What is the order with which SparkSQL deserializes parquet fields? Is it
 possible to modify it?

 I am using SparkSQL to query a parquet file that consists of a lot of
 fields (around 30 or so). Let me call an example table MyTable and let's
 suppose the name of one of its fields is position.

 The query that I am executing is:
 sql(select * from MyTable where position = 243189160)

 The query plan that I get from this query is:
 Filter (position#6L:6 = 243189160)
  ParquetTableScan
 [contig.contigName#0,contig.contigLength#1L,contig.contigMD5#2,contig.referenceURL#3,contig.assembly#4,contig.species#5,position#6L,rangeOffset#7,rangeLength#8,referenceBase#9,readBase#10,sangerQuality#11,mapQuality#12,numSoftClipped#13,numReverseStrand#14,countAtPosition#15,readName#16,readStart#17L,readEnd#18L,recordGroupSequencingCenter#19,recordGroupDescription#20,recordGroupRunDateEpoch#21L,recordGroupFlowOrder#22,recordGroupKeySequence#23,recordGroupLibrary#24,recordGroupPredictedMedianInsertSize#25,recordGroupPlatform#26,recordGroupPlatformUnit#27,recordGroupSample#28],
 (ParquetRelation hdfs://
 ec2-54-89-87-167.compute-1.amazonaws.com:9000/genomes/hg00096.plup), None

 I expect 14 entries in the output but the execution of
 .collect.foreach(println) takes forever to run on my cluster (more than an
 hour).

 Is it safe to assume in my example that SparkSQL deserializes all fields
 first before applying the filter? If so, can a user change this behavior?

 To support my assumption I replaced * with position, so my new query
 is of the form sql(select position from MyTable where position =
 243189160) and this query runs much faster on the same hardware (2-3
 minutes vs 65 min).

 Any ideas?

 thanks
 Christos