Re: SparkSQL operator priority

Eric Friedman Sat, 19 Jul 2014 08:29:25 -0700

Can position be null?  Looks like there may be constraints with predicate push 
down in that case. https://github.com/apache/spark/pull/511/


> On Jul 18, 2014, at 8:04 PM, Christos Kozanitis <kozani...@berkeley.edu> 
> wrote:
> 
> Hello
> 
> What is the order with which SparkSQL deserializes parquet fields? Is it 
> possible to modify it?
> 
> I am using SparkSQL to query a parquet file that consists of a lot of fields 
> (around 30 or so). Let me call an example table MyTable and let's suppose the 
> name of one of its fields is "position".
> 
> The query that I am executing is: 
> sql("select * from MyTable where position = 243189160")
> 
> The query plan that I get from this query is:
> Filter (position#6L:6 = 243189160)
>  ParquetTableScan 
> [contig.contigName#0,contig.contigLength#1L,contig.contigMD5#2,contig.referenceURL#3,contig.assembly#4,contig.species#5,position#6L,rangeOffset#7,rangeLength#8,referenceBase#9,readBase#10,sangerQuality#11,mapQuality#12,numSoftClipped#13,numReverseStrand#14,countAtPosition#15,readName#16,readStart#17L,readEnd#18L,recordGroupSequencingCenter#19,recordGroupDescription#20,recordGroupRunDateEpoch#21L,recordGroupFlowOrder#22,recordGroupKeySequence#23,recordGroupLibrary#24,recordGroupPredictedMedianInsertSize#25,recordGroupPlatform#26,recordGroupPlatformUnit#27,recordGroupSample#28],
>  (ParquetRelation 
> hdfs://ec2-54-89-87-167.compute-1.amazonaws.com:9000/genomes/hg00096.plup), 
> None
> 
> I expect 14 entries in the output but the execution of 
> .collect.foreach(println) takes forever to run on my cluster (more than an 
> hour). 
> 
> Is it safe to assume in my example that SparkSQL deserializes all fields 
> first before applying the filter? If so, can a user change this behavior?
> 
> To support my assumption I replaced "*" with "position", so my new query is 
> of the form sql("select position from MyTable where position = 243189160") 
> and this query runs much faster on the same hardware (2-3 minutes vs 65 min).
> 
> Any ideas?
> 
> thanks
> Christos

Re: SparkSQL operator priority

Reply via email to