I don't have cycles for working on it in the next month or two. Maybe after that.
On Wed, Apr 8, 2015 at 2:16 PM, <andres.qui...@parc.com> wrote: > This is certainly very helpful, thank you. Do you have any cycles to devote > to this issue at the moment, or in the near future? > > -----Original Message----- > From: Thejas Nair [mailto:thejas.n...@gmail.com] > Sent: Wednesday, April 08, 2015 2:32 PM > To: dev > Subject: Re: Request for feedback on work intent for non-equijoin support > > Yes, the theta join paper in northeastern is a good place to start. > There is also a presentation from the folks in youtube, which is also very > useful. > I had a look at this issue as well earlier, and I had written up a rough > proposal. I had not organized the document well enough for sharing publicly, > but in case you find it useful, I have attached it to wiki - > https://cwiki.apache.org/confluence/download/attachments/27362075/theta%20join%20proposal%20-%20thejas.pdf?version=1&modificationDate=1428517702954&api=v2 > It also includes a list of some of the changes that are needed (it is > probably not comprehensive enough). > > > On Wed, Apr 8, 2015 at 5:49 AM, <andres.qui...@parc.com> wrote: >> So, I'd like to get started on this. The description in the design doc and >> the theta join paper from Northeastern seem like a good place to start, to >> have a baseline that I can later use for the more specific join algorithms I >> want to try. >> >> I created a JIRA account, and my username is Andres.Quiroz >> >> Brock, since I'm completely new to this code, could you (or anyone else) >> please point me to the relevant modules to start learning and ramping up? >> Also, please let me know if I can contact you directly for discussing this >> specific topic, or if I should always send a message to the mailing list. >> >> Thank you, >> >> Andrés >> >> -----Original Message----- >> From: andres.qui...@parc.com [mailto:andres.qui...@parc.com] >> Sent: Thursday, April 02, 2015 9:07 AM >> To: dev@hive.apache.org >> Subject: RE: Request for feedback on work intent for non-equijoin >> support >> >> This is a great pointer, Szehon and Brock, thank you. I will catch up with >> the material on theta joins and circle back. >> >> Andrés >> >> -----Original Message----- >> From: Brock Noland [mailto:br...@apache.org] >> Sent: Thursday, April 02, 2015 1:31 AM >> To: dev@hive.apache.org >> Subject: Re: Request for feedback on work intent for non-equijoin >> support >> >> Nice, it'd be great if someone finally implemented this :) >> >> On Wed, Apr 1, 2015 at 10:10 PM, Szehon Ho <sze...@cloudera.com> wrote: >>> From Hive side, there has been some thought on the subject here: >>> https://cwiki.apache.org/confluence/display/Hive/Theta+Join, it has >>> some ideas but nobody has gotten around to giving it a try. It might >>> be of interest. >>> >>> Thanks >>> Szehon >>> >>> >>> On Wed, Apr 1, 2015 at 10:05 PM, Lefty Leverenz >>> <leftylever...@gmail.com> >>> wrote: >>> >>>> D'oh! Thanks Chao. >>>> >>>> -- Lefty >>>> >>>> On Thu, Apr 2, 2015 at 12:59 AM, Chao Sun <c...@cloudera.com> wrote: >>>> >>>> > Hey Lefty, >>>> > >>>> > You need to use the ftp protocol, not http. >>>> > After clicking the link, you'll need to remove "http://" from the >>>> address >>>> > bar. >>>> > >>>> > Best, >>>> > Chao >>>> > >>>> > On Wed, Apr 1, 2015 at 9:41 PM, Lefty Leverenz >>>> > <leftylever...@gmail.com> >>>> > wrote: >>>> > >>>> > > Andrés, I followed that link and got the dread 404 Not Found: >>>> > > >>>> > > "The requested URI /pub/torres/Hiperfuse/extended_hiperfuse.pdf >>>> > > was not found on this server." >>>> > > >>>> > > -- Lefty >>>> > > >>>> > > On Wed, Apr 1, 2015 at 7:23 PM, <andres.qui...@parc.com> wrote: >>>> > > >>>> > > > Dear Lefty, >>>> > > > >>>> > > > Thank you very much for pointing that out and for your initial >>>> > pointers. >>>> > > > Here is the missing link: >>>> > > > >>>> > > > ftp.parc.com/pub/torres/Hiperfuse/extended_hiperfuse.pdf >>>> > > > >>>> > > > Regards, >>>> > > > >>>> > > > Andrés >>>> > > > >>>> > > > -----Original Message----- >>>> > > > From: Lefty Leverenz [mailto:leftylever...@gmail.com] >>>> > > > Sent: Wednesday, April 01, 2015 12:48 AM >>>> > > > To: dev@hive.apache.org >>>> > > > Subject: Re: Request for feedback on work intent for >>>> > > > non-equijoin >>>> > support >>>> > > > >>>> > > > Hello Andres, the link to your paper is missing: >>>> > > > >>>> > > > In our preliminary work, which you can find here (pointer to >>>> > > > the >>>> paper) >>>> > > ... >>>> > > > >>>> > > > >>>> > > > You can find general information about contributing to Hive in >>>> > > > the >>>> > > > wiki: Resources >>>> > > > for Contributors >>>> > > > < >>>> > > > >>>> > > >>>> > >>>> https://cwiki.apache.org/confluence/display/Hive/Home#Home-Resources >>>> f >>>> orContributors >>>> > > > > >>>> > > > , How to Contribute >>>> > > > <https://cwiki.apache.org/confluence/display/Hive/HowToContribute>. >>>> > > > >>>> > > > -- Lefty >>>> > > > >>>> > > > On Tue, Mar 31, 2015 at 10:42 PM, <andres.qui...@parc.com> wrote: >>>> > > > >>>> > > > > Dear Hive development community members, >>>> > > > > >>>> > > > > >>>> > > > > >>>> > > > > I am interested in learning more about the current support >>>> > > > > for non-equijoins in Hive and/or other Hadoop SQL engines, >>>> > > > > and in >>>> getting >>>> > > > > feedback about community interest in more extensive support >>>> > > > > for >>>> such >>>> > a >>>> > > > > feature. I intend to work on this challenge, assuming people >>>> > > > > find >>>> it >>>> > > > > compelling, and I intend to contribute results to the community. >>>> > Where >>>> > > > > possible, it would be great to receive feedback and engage >>>> > > > > in collaborations along the way (for a bit more context, see >>>> > > > > the postscript of this message). >>>> > > > > >>>> > > > > >>>> > > > > >>>> > > > > My initial goal is to support query conditions such as the >>>> following: >>>> > > > > >>>> > > > > >>>> > > > > >>>> > > > > A.x < B.y >>>> > > > > >>>> > > > > A.x in_range [B.y, B.z] >>>> > > > > >>>> > > > > distance(A.x, B.y) < D >>>> > > > > >>>> > > > > >>>> > > > > >>>> > > > > where A and B are distinct tables/files. It is my >>>> > > > > understanding >>>> that >>>> > > > > current support for performing non-equijoins like those >>>> > > > > above is >>>> > quite >>>> > > > > limited, and where some forms are supported (like in >>>> > > > > Cloudera's Impala), this support is based on doing a >>>> > > > > potentially expensive >>>> cross >>>> > > > product join. >>>> > > > > Depending on the data types involved, I believe that joins >>>> > > > > with >>>> these >>>> > > > > conditions can be made to be tractable (at least on the >>>> > > > > average) >>>> with >>>> > > > > join algorithms that exploit properties of the data types, >>>> > > > > possibly with some pre-scanning of the data. >>>> > > > > >>>> > > > > >>>> > > > > >>>> > > > > I am asking for feedback on the interest & need in the >>>> > > > > community >>>> for >>>> > > > > this work, as well as any pointers to similar work. In >>>> > > > > particular, >>>> I >>>> > > > > would appreciate any answers people could give on the >>>> > > > > following >>>> > > > questions: >>>> > > > > >>>> > > > > >>>> > > > > >>>> > > > > - Is my understanding of the state of the art in Hive and >>>> > > > > similar tools accurate? Are there groups currently working >>>> > > > > on similar or related issues, or tools that already >>>> > > > > accomplish some or all of >>>> what >>>> > I >>>> > > > have proposed? >>>> > > > > >>>> > > > > - Is there significant value to the community in the support >>>> > > > > of >>>> such >>>> > a >>>> > > > > feature? In other words, are the manual workarounds >>>> > > > > necessary >>>> because >>>> > > > > of the absence of non-equijoins such as these enough of a >>>> > > > > pain to justify the work I propose? >>>> > > > > >>>> > > > > - Being aware that the potential pre-scanning adds to the >>>> > > > > cost of >>>> the >>>> > > > > join, and that data could still blow-up in the worst case, >>>> > > > > am I missing any other important considerations and >>>> > > > > tradeoffs for this >>>> > > > problem? >>>> > > > > >>>> > > > > - What would be a good avenue to contribute this feature to >>>> > > > > the community (e.g. as a standalone tool on top of Hadoop, >>>> > > > > or as a Hive extension or plugin)? >>>> > > > > >>>> > > > > - What is the best way to get started in working with the >>>> community? >>>> > > > > >>>> > > > > >>>> > > > > >>>> > > > > Thanks for your attention and any info you can provide! >>>> > > > > >>>> > > > > >>>> > > > > >>>> > > > > Andres Quiroz >>>> > > > > >>>> > > > > >>>> > > > > >>>> > > > > P.S. If you are interested in some context, and why/how I am >>>> > proposing >>>> > > > > to do this work, please read on. >>>> > > > > >>>> > > > > >>>> > > > > >>>> > > > > I am part of a small project team at PARC working on the >>>> > > > > general problems of data integration and automated ETL. We >>>> > > > > have proposed a tool called HiperFuse that is designed to >>>> > > > > accept declarative, high-level queries in order to produce >>>> > > > > joined (fused) data sets >>>> from >>>> > > > > multiple heterogeneous raw data sources. In our preliminary >>>> > > > > work, which you can find here (pointer to the paper), we >>>> > > > > designed the architecture of the tool and obtained some >>>> > > > > results separately on >>>> the >>>> > > > > problems of automated data cleansing, data type inference, >>>> > > > > and >>>> query >>>> > > > > planning. One of the planned prototype implementations of >>>> > > > > HiperFuse relies on Hadoop MR, and because the declarative >>>> > > > > language we >>>> proposed >>>> > > > > was closely related to SQL, we thought that we could exploit >>>> > > > > the existing work in Hive and/or other open-source tools for >>>> > > > > handling >>>> the >>>> > > > > SQL part and layer our work on top of that. For example, the >>>> > > > > query given in the paper could easily be expressed in >>>> > > > > SQL-like form with >>>> a >>>> > > > > non-equijoin >>>> > > > > condition: >>>> > > > > >>>> > > > > >>>> > > > > >>>> > > > > SELECT web_access_log.ip, census.income >>>> > > > > >>>> > > > > FROM web_access_log, ip2zip, census >>>> > > > > >>>> > > > > WHERE web_access_log.ip in_range [ip2zip.ip_low, >>>> > > > > ip2zip.ip_high] >>>> > > > > >>>> > > > > AND ip2zip.zip = census.zip >>>> > > > > >>>> > > > > >>>> > > > > >>>> > > > > As you can see, the first impasse that we hit in order to >>>> > > > > bring the elements together to solve this query end-to-end >>>> > > > > was the >>>> realization >>>> > > > > and performance of the non-equality join in the query. The >>>> > > > > intent >>>> now >>>> > > > > is to tackle this problem in a general sense and provide a >>>> > > > > solution for a wide range of queries. >>>> > > > > >>>> > > > > >>>> > > > > >>>> > > > > The work I propose to do would be based on three main >>>> > > > > components within >>>> > > > > HiperFuse: >>>> > > > > >>>> > > > > >>>> > > > > >>>> > > > > - Enhancements to the extensible data type framework in >>>> > > > > HiperFuse >>>> > that >>>> > > > > would categorize data types based on the properties needed >>>> > > > > to >>>> support >>>> > > > > the join algorithms, in order to write join-ready >>>> > > > > domain-specific >>>> > data >>>> > > > > type libraries. >>>> > > > > >>>> > > > > - The join algorithms themselves, based on Hive or directly >>>> > > > > on >>>> Hadoop >>>> > > MR. >>>> > > > > >>>> > > > > - A query planner, which would determine the right algorithm >>>> > > > > to >>>> apply >>>> > > > > and automatically schedule any necessary pre-scanning of the data. >>>> > > > > >>>> > > > > >>>> > > > > >>>> > > > >>>> > > >>>> > >>>> > >>>> > >>>> > -- >>>> > Best, >>>> > Chao >>>> > >>>>