Circling back on this. Did you get a chance to re-look at this? Thanks, Aniket
On Sun, Feb 8, 2015, 2:53 AM Aniket Bhatnagar <aniket.bhatna...@gmail.com> wrote: > Thanks for looking into this. If this true, isn't this an issue today? The > default implementation of sizeInBytes is 1 + broadcast threshold. So, if > catalyst's cardinality estimation estimates even a small filter > selectivity, it will result in broadcasting the relation. Therefore, > shouldn't the default be much higher than broadcast threshold? > > Also, since the default implementation of sizeInBytes already exists in > BaseRelation, I am not sure why the same/similar default implementation > can't be provided with in *Scan specific sizeInBytes functions and have > Catalyst always trust the size returned by DataSourceAPI (with default > implementation being to never broadcast). Another thing that could be done > is have sizeInBytes return Option[Long] so that Catalyst explicitly knows > when DataSource was able to optimize the size. The reason why I would push > for sizeInBytes in *Scan interfaces is because at times the data source > implementation can more accurately predict the size output. For example, > DataSource implementations for MongoDB, ElasticSearch, Cassandra, etc can > easy use filter push downs to query the underlying storage to predict the > size. Such predictions will be more accurate than Catalyst's prediction. > Therefore, if its not a fundamental change in Catalyst, I would think this > makes sense. > > > Thanks, > Aniket > > > On Sat, Feb 7, 2015, 4:50 AM Reynold Xin <r...@databricks.com> wrote: > >> We thought about this today after seeing this email. I actually built a >> patch for this (adding filter/column to data source stat estimation), but >> ultimately dropped it due to the potential problems the change the cause. >> >> The main problem I see is that column pruning/predicate pushdowns are >> advisory, i.e. the data source might or might not apply those filters. >> >> Without significantly complicating the data source API, it is hard for >> the optimizer (and future cardinality estimation) to know whether the >> filter/column pushdowns are advisory, and whether to incorporate that in >> cardinality estimation. >> >> Imagine this scenario: a data source applies a filter and estimates the >> filter's selectivity is 0.1, then the data set is reduced to 10% of the >> size. Catalyst's own cardinality estimation estimates the filter >> selectivity to 0.1 again, and thus the estimated data size is now 1% of the >> original data size, lowering than some threshold. Catalyst decides to >> broadcast the table. The actual table size is actually 10x the size. >> >> >> >> >> >> On Fri, Feb 6, 2015 at 3:39 AM, Aniket Bhatnagar < >> aniket.bhatna...@gmail.com> wrote: >> >>> Hi Spark SQL committers >>> >>> I have started experimenting with data sources API and I was wondering if >>> it makes sense to move the method sizeInBytes from BaseRelation to Scan >>> interfaces. This is because that a relation may be able to leverage >>> filter >>> push down to estimate size potentially making a very large relation >>> broadcast-able. Thoughts? >>> >>> Aniket >>> >> >>