yes on LoadMetadata being how the user provides the data, and yes on only attaching to logical operators.
D On Fri, Nov 5, 2010 at 10:38 AM, Renato Marroquín Mogrovejo < [email protected]> wrote: > @Gianmarco > It would be in a way. A cost based optimizer would be awesome, but when > dealing with large amounts of data, important things such as statistics to > make accurate estimations are not that easy to get or to maintain. And > about > just hacking into the code, I guess it is my fault for not explaining > myself. The approach would be to add a new rule into the logical optimizer > framework that Pig 0.8 uses. > I know it would be awesome to have a cost based optimizer but I guess we > will have to get there little by little (: > > @Dmitriy > I am not trying to write the whole optimizer here, I would like to add a > rule to the logical optimizer. > So I will try to fake them to try isolating the optimization problem. So > for > example there is sort information in the ResourceSchema and I would need to > check this information up to change the join operation, but this is what I > don't understand, so ok I will fake this information, but how would this > information would come from outside? I mean the user would provide this > information through the LoadMetadata interface? > And when you are saying to attach this ResourceStatistics to any operator > instances, you are talking about in any operator which needs this at > logical > level (e.g. LOJoin), so no changes would have to be made at the physical > level right? > > Please correct me if I am mistaken. Thanks in advanced. > > > Renato M. > > > > > > 2010/11/4 Dmitriy Ryaboy <[email protected]> > > > 1. Collection is kind of a separate problem. You can write an optimizer > > from > > the position of "if we have stats, we use them" and punt on this. Assume > > there is a something that provides the stats. Fake them while you are > > dealing with the optimization problem. > > > > 2. Attach ResourceStatistics to the different operator instances and > mutate > > them as appropriate while walking down the operators. > > > > -D > > > > On Tue, Nov 2, 2010 at 10:09 PM, Renato Marroquín Mogrovejo < > > [email protected]> wrote: > > > > > A couple of weeks ago on a list discussion Alan suggested me an > > interesting > > > project which consists in the idea of switching join operators based on > > > some > > > data properties e.g. at logical plan compiling time, a specific join > > > operator might be chosen, but maybe this operator is probably not the > > most > > > suitable for the data. For example, if both data sources are ordered by > > its > > > key, then a merge join would be the best operator. > > > But at this point I dunno how I should proceed. I have some 'general > > > doubts' > > > about the approach that should be taken. > > > > > > 1. Data statistics can be passed to the LoadFunc by using the > > LoadMetadata > > > interface right? But how should these statistics be collected? should I > > > modify the LOLoad class to use a different LoadFunc? > > > 2. And how would these statistics be passed to the optimizer to change > > (if > > > it were the case) the join operator? > > > > > > Please correct me if I am wrong (which I probably am), and any > > suggestions > > > or comments are highly appreciated. > > > Thanks in advance. > > > > > > > > > Renato M. > > > > > >
