subject:"Re\: SparkSQL operator priority"

Re: SparkSQL operator priority

2014-07-19 Thread Eric Friedman

Can position be null?  Looks like there may be constraints with predicate push 
down in that case. https://github.com/apache/spark/pull/511/

 On Jul 18, 2014, at 8:04 PM, Christos Kozanitis kozani...@berkeley.edu 
 wrote:
 
 Hello
 
 What is the order with which SparkSQL deserializes parquet fields? Is it 
 possible to modify it?
 
 I am using SparkSQL to query a parquet file that consists of a lot of fields 
 (around 30 or so). Let me call an example table MyTable and let's suppose the 
 name of one of its fields is position.
 
 The query that I am executing is: 
 sql(select * from MyTable where position = 243189160)
 
 The query plan that I get from this query is:
 Filter (position#6L:6 = 243189160)
  ParquetTableScan 
 [contig.contigName#0,contig.contigLength#1L,contig.contigMD5#2,contig.referenceURL#3,contig.assembly#4,contig.species#5,position#6L,rangeOffset#7,rangeLength#8,referenceBase#9,readBase#10,sangerQuality#11,mapQuality#12,numSoftClipped#13,numReverseStrand#14,countAtPosition#15,readName#16,readStart#17L,readEnd#18L,recordGroupSequencingCenter#19,recordGroupDescription#20,recordGroupRunDateEpoch#21L,recordGroupFlowOrder#22,recordGroupKeySequence#23,recordGroupLibrary#24,recordGroupPredictedMedianInsertSize#25,recordGroupPlatform#26,recordGroupPlatformUnit#27,recordGroupSample#28],
  (ParquetRelation 
 hdfs://ec2-54-89-87-167.compute-1.amazonaws.com:9000/genomes/hg00096.plup), 
 None
 
 I expect 14 entries in the output but the execution of 
 .collect.foreach(println) takes forever to run on my cluster (more than an 
 hour). 
 
 Is it safe to assume in my example that SparkSQL deserializes all fields 
 first before applying the filter? If so, can a user change this behavior?
 
 To support my assumption I replaced * with position, so my new query is 
 of the form sql(select position from MyTable where position = 243189160) 
 and this query runs much faster on the same hardware (2-3 minutes vs 65 min).
 
 Any ideas?
 
 thanks
 Christos

Re: SparkSQL operator priority

2014-07-19 Thread Christos Kozanitis

Thanks Eric. That is the case as most of my fields are optional. So it
seems that the problem comes from Parquet.

On Sat, Jul 19, 2014 at 8:27 AM, Eric Friedman eric.d.fried...@gmail.com
wrote:

Can position be null? Looks like there may be constraints with predicate
push down in that case. https://github.com/apache/spark/pull/511/

On Jul 18, 2014, at 8:04 PM, Christos Kozanitis kozani...@berkeley.edu
wrote:

Hello

What is the order with which SparkSQL deserializes parquet fields? Is it
possible to modify it?

I am using SparkSQL to query a parquet file that consists of a lot of
fields (around 30 or so). Let me call an example table MyTable and let's
suppose the name of one of its fields is position.

The query that I am executing is:
sql(select * from MyTable where position = 243189160)

The query plan that I get from this query is:
Filter (position#6L:6 = 243189160)
ParquetTableScan
[contig.contigName#0,contig.contigLength#1L,contig.contigMD5#2,contig.referenceURL#3,contig.assembly#4,contig.species#5,position#6L,rangeOffset#7,rangeLength#8,referenceBase#9,readBase#10,sangerQuality#11,mapQuality#12,numSoftClipped#13,numReverseStrand#14,countAtPosition#15,readName#16,readStart#17L,readEnd#18L,recordGroupSequencingCenter#19,recordGroupDescription#20,recordGroupRunDateEpoch#21L,recordGroupFlowOrder#22,recordGroupKeySequence#23,recordGroupLibrary#24,recordGroupPredictedMedianInsertSize#25,recordGroupPlatform#26,recordGroupPlatformUnit#27,recordGroupSample#28],
(ParquetRelation hdfs://
ec2-54-89-87-167.compute-1.amazonaws.com:9000/genomes/hg00096.plup), None

I expect 14 entries in the output but the execution of
.collect.foreach(println) takes forever to run on my cluster (more than an
hour).

Is it safe to assume in my example that SparkSQL deserializes all fields
first before applying the filter? If so, can a user change this behavior?

To support my assumption I replaced * with position, so my new query
is of the form sql(select position from MyTable where position =
243189160) and this query runs much faster on the same hardware (2-3
minutes vs 65 min).

Any ideas?

thanks
Christos

Re: SparkSQL operator priority

Re: SparkSQL operator priority

2 matches

Site Navigation

Mail list logo

Footer information