This is a great pointer, Szehon and Brock, thank you. I will catch up with the material on theta joins and circle back.
Andrés -----Original Message----- From: Brock Noland [mailto:br...@apache.org] Sent: Thursday, April 02, 2015 1:31 AM To: dev@hive.apache.org Subject: Re: Request for feedback on work intent for non-equijoin support Nice, it'd be great if someone finally implemented this :) On Wed, Apr 1, 2015 at 10:10 PM, Szehon Ho <sze...@cloudera.com> wrote: > From Hive side, there has been some thought on the subject here: > https://cwiki.apache.org/confluence/display/Hive/Theta+Join, it has > some ideas but nobody has gotten around to giving it a try. It might > be of interest. > > Thanks > Szehon > > > On Wed, Apr 1, 2015 at 10:05 PM, Lefty Leverenz > <leftylever...@gmail.com> > wrote: > >> D'oh! Thanks Chao. >> >> -- Lefty >> >> On Thu, Apr 2, 2015 at 12:59 AM, Chao Sun <c...@cloudera.com> wrote: >> >> > Hey Lefty, >> > >> > You need to use the ftp protocol, not http. >> > After clicking the link, you'll need to remove "http://" from the >> address >> > bar. >> > >> > Best, >> > Chao >> > >> > On Wed, Apr 1, 2015 at 9:41 PM, Lefty Leverenz >> > <leftylever...@gmail.com> >> > wrote: >> > >> > > Andrés, I followed that link and got the dread 404 Not Found: >> > > >> > > "The requested URI /pub/torres/Hiperfuse/extended_hiperfuse.pdf >> > > was not found on this server." >> > > >> > > -- Lefty >> > > >> > > On Wed, Apr 1, 2015 at 7:23 PM, <andres.qui...@parc.com> wrote: >> > > >> > > > Dear Lefty, >> > > > >> > > > Thank you very much for pointing that out and for your initial >> > pointers. >> > > > Here is the missing link: >> > > > >> > > > ftp.parc.com/pub/torres/Hiperfuse/extended_hiperfuse.pdf >> > > > >> > > > Regards, >> > > > >> > > > Andrés >> > > > >> > > > -----Original Message----- >> > > > From: Lefty Leverenz [mailto:leftylever...@gmail.com] >> > > > Sent: Wednesday, April 01, 2015 12:48 AM >> > > > To: dev@hive.apache.org >> > > > Subject: Re: Request for feedback on work intent for >> > > > non-equijoin >> > support >> > > > >> > > > Hello Andres, the link to your paper is missing: >> > > > >> > > > In our preliminary work, which you can find here (pointer to >> > > > the >> paper) >> > > ... >> > > > >> > > > >> > > > You can find general information about contributing to Hive in >> > > > the >> > > > wiki: Resources >> > > > for Contributors >> > > > < >> > > > >> > > >> > >> https://cwiki.apache.org/confluence/display/Hive/Home#Home-Resourcesf >> orContributors >> > > > > >> > > > , How to Contribute >> > > > <https://cwiki.apache.org/confluence/display/Hive/HowToContribute>. >> > > > >> > > > -- Lefty >> > > > >> > > > On Tue, Mar 31, 2015 at 10:42 PM, <andres.qui...@parc.com> wrote: >> > > > >> > > > > Dear Hive development community members, >> > > > > >> > > > > >> > > > > >> > > > > I am interested in learning more about the current support >> > > > > for non-equijoins in Hive and/or other Hadoop SQL engines, >> > > > > and in >> getting >> > > > > feedback about community interest in more extensive support >> > > > > for >> such >> > a >> > > > > feature. I intend to work on this challenge, assuming people >> > > > > find >> it >> > > > > compelling, and I intend to contribute results to the community. >> > Where >> > > > > possible, it would be great to receive feedback and engage in >> > > > > collaborations along the way (for a bit more context, see the >> > > > > postscript of this message). >> > > > > >> > > > > >> > > > > >> > > > > My initial goal is to support query conditions such as the >> following: >> > > > > >> > > > > >> > > > > >> > > > > A.x < B.y >> > > > > >> > > > > A.x in_range [B.y, B.z] >> > > > > >> > > > > distance(A.x, B.y) < D >> > > > > >> > > > > >> > > > > >> > > > > where A and B are distinct tables/files. It is my >> > > > > understanding >> that >> > > > > current support for performing non-equijoins like those above >> > > > > is >> > quite >> > > > > limited, and where some forms are supported (like in >> > > > > Cloudera's Impala), this support is based on doing a >> > > > > potentially expensive >> cross >> > > > product join. >> > > > > Depending on the data types involved, I believe that joins >> > > > > with >> these >> > > > > conditions can be made to be tractable (at least on the >> > > > > average) >> with >> > > > > join algorithms that exploit properties of the data types, >> > > > > possibly with some pre-scanning of the data. >> > > > > >> > > > > >> > > > > >> > > > > I am asking for feedback on the interest & need in the >> > > > > community >> for >> > > > > this work, as well as any pointers to similar work. In >> > > > > particular, >> I >> > > > > would appreciate any answers people could give on the >> > > > > following >> > > > questions: >> > > > > >> > > > > >> > > > > >> > > > > - Is my understanding of the state of the art in Hive and >> > > > > similar tools accurate? Are there groups currently working on >> > > > > similar or related issues, or tools that already accomplish >> > > > > some or all of >> what >> > I >> > > > have proposed? >> > > > > >> > > > > - Is there significant value to the community in the support >> > > > > of >> such >> > a >> > > > > feature? In other words, are the manual workarounds necessary >> because >> > > > > of the absence of non-equijoins such as these enough of a >> > > > > pain to justify the work I propose? >> > > > > >> > > > > - Being aware that the potential pre-scanning adds to the >> > > > > cost of >> the >> > > > > join, and that data could still blow-up in the worst case, am >> > > > > I missing any other important considerations and tradeoffs >> > > > > for this >> > > > problem? >> > > > > >> > > > > - What would be a good avenue to contribute this feature to >> > > > > the community (e.g. as a standalone tool on top of Hadoop, or >> > > > > as a Hive extension or plugin)? >> > > > > >> > > > > - What is the best way to get started in working with the >> community? >> > > > > >> > > > > >> > > > > >> > > > > Thanks for your attention and any info you can provide! >> > > > > >> > > > > >> > > > > >> > > > > Andres Quiroz >> > > > > >> > > > > >> > > > > >> > > > > P.S. If you are interested in some context, and why/how I am >> > proposing >> > > > > to do this work, please read on. >> > > > > >> > > > > >> > > > > >> > > > > I am part of a small project team at PARC working on the >> > > > > general problems of data integration and automated ETL. We >> > > > > have proposed a tool called HiperFuse that is designed to >> > > > > accept declarative, high-level queries in order to produce >> > > > > joined (fused) data sets >> from >> > > > > multiple heterogeneous raw data sources. In our preliminary >> > > > > work, which you can find here (pointer to the paper), we >> > > > > designed the architecture of the tool and obtained some >> > > > > results separately on >> the >> > > > > problems of automated data cleansing, data type inference, >> > > > > and >> query >> > > > > planning. One of the planned prototype implementations of >> > > > > HiperFuse relies on Hadoop MR, and because the declarative >> > > > > language we >> proposed >> > > > > was closely related to SQL, we thought that we could exploit >> > > > > the existing work in Hive and/or other open-source tools for >> > > > > handling >> the >> > > > > SQL part and layer our work on top of that. For example, the >> > > > > query given in the paper could easily be expressed in >> > > > > SQL-like form with >> a >> > > > > non-equijoin >> > > > > condition: >> > > > > >> > > > > >> > > > > >> > > > > SELECT web_access_log.ip, census.income >> > > > > >> > > > > FROM web_access_log, ip2zip, census >> > > > > >> > > > > WHERE web_access_log.ip in_range [ip2zip.ip_low, >> > > > > ip2zip.ip_high] >> > > > > >> > > > > AND ip2zip.zip = census.zip >> > > > > >> > > > > >> > > > > >> > > > > As you can see, the first impasse that we hit in order to >> > > > > bring the elements together to solve this query end-to-end >> > > > > was the >> realization >> > > > > and performance of the non-equality join in the query. The >> > > > > intent >> now >> > > > > is to tackle this problem in a general sense and provide a >> > > > > solution for a wide range of queries. >> > > > > >> > > > > >> > > > > >> > > > > The work I propose to do would be based on three main >> > > > > components within >> > > > > HiperFuse: >> > > > > >> > > > > >> > > > > >> > > > > - Enhancements to the extensible data type framework in >> > > > > HiperFuse >> > that >> > > > > would categorize data types based on the properties needed to >> support >> > > > > the join algorithms, in order to write join-ready >> > > > > domain-specific >> > data >> > > > > type libraries. >> > > > > >> > > > > - The join algorithms themselves, based on Hive or directly >> > > > > on >> Hadoop >> > > MR. >> > > > > >> > > > > - A query planner, which would determine the right algorithm >> > > > > to >> apply >> > > > > and automatically schedule any necessary pre-scanning of the data. >> > > > > >> > > > > >> > > > > >> > > > >> > > >> > >> > >> > >> > -- >> > Best, >> > Chao >> > >>