Re: Data source API | sizeInBytes should be to *Scan

2015-02-11 Thread Aniket Bhatnagar
Circling back on this. Did you get a chance to re-look at this? Thanks, Aniket On Sun, Feb 8, 2015, 2:53 AM Aniket Bhatnagar aniket.bhatna...@gmail.com wrote: Thanks for looking into this. If this true, isn't this an issue today? The default implementation of sizeInBytes is 1 + broadcast

Re: Data source API | sizeInBytes should be to *Scan

2015-02-11 Thread Reynold Xin
Unfortunately this is not to happen for 1.3 (as a snapshot release is already cut). We need to figure out how we are going to do cardinality estimation before implementing this. If we need to do this in the future, I think we can do it in a way that doesn't break existing APIs. Given I think this

Re: Data source API | sizeInBytes should be to *Scan

2015-02-08 Thread Aniket Bhatnagar
Thanks for looking into this. If this true, isn't this an issue today? The default implementation of sizeInBytes is 1 + broadcast threshold. So, if catalyst's cardinality estimation estimates even a small filter selectivity, it will result in broadcasting the relation. Therefore, shouldn't the

Re: Data source API | sizeInBytes should be to *Scan

2015-02-08 Thread Reynold Xin
We thought about this today after seeing this email. I actually built a patch for this (adding filter/column to data source stat estimation), but ultimately dropped it due to the potential problems the change the cause. The main problem I see is that column pruning/predicate pushdowns are

Data source API | sizeInBytes should be to *Scan

2015-02-06 Thread Aniket Bhatnagar
Hi Spark SQL committers I have started experimenting with data sources API and I was wondering if it makes sense to move the method sizeInBytes from BaseRelation to Scan interfaces. This is because that a relation may be able to leverage filter push down to estimate size potentially making a very