Can position be null? Looks like there may be constraints with predicate push down in that case. https://github.com/apache/spark/pull/511/
> On Jul 18, 2014, at 8:04 PM, Christos Kozanitis <kozani...@berkeley.edu> > wrote: > > Hello > > What is the order with which SparkSQL deserializes parquet fields? Is it > possible to modify it? > > I am using SparkSQL to query a parquet file that consists of a lot of fields > (around 30 or so). Let me call an example table MyTable and let's suppose the > name of one of its fields is "position". > > The query that I am executing is: > sql("select * from MyTable where position = 243189160") > > The query plan that I get from this query is: > Filter (position#6L:6 = 243189160) > ParquetTableScan > [contig.contigName#0,contig.contigLength#1L,contig.contigMD5#2,contig.referenceURL#3,contig.assembly#4,contig.species#5,position#6L,rangeOffset#7,rangeLength#8,referenceBase#9,readBase#10,sangerQuality#11,mapQuality#12,numSoftClipped#13,numReverseStrand#14,countAtPosition#15,readName#16,readStart#17L,readEnd#18L,recordGroupSequencingCenter#19,recordGroupDescription#20,recordGroupRunDateEpoch#21L,recordGroupFlowOrder#22,recordGroupKeySequence#23,recordGroupLibrary#24,recordGroupPredictedMedianInsertSize#25,recordGroupPlatform#26,recordGroupPlatformUnit#27,recordGroupSample#28], > (ParquetRelation > hdfs://ec2-54-89-87-167.compute-1.amazonaws.com:9000/genomes/hg00096.plup), > None > > I expect 14 entries in the output but the execution of > .collect.foreach(println) takes forever to run on my cluster (more than an > hour). > > Is it safe to assume in my example that SparkSQL deserializes all fields > first before applying the filter? If so, can a user change this behavior? > > To support my assumption I replaced "*" with "position", so my new query is > of the form sql("select position from MyTable where position = 243189160") > and this query runs much faster on the same hardware (2-3 minutes vs 65 min). > > Any ideas? > > thanks > Christos