Hello Andres, the link to your paper is missing: In our preliminary work, which you can find here (pointer to the paper) ...
You can find general information about contributing to Hive in the wiki: Resources for Contributors <https://cwiki.apache.org/confluence/display/Hive/Home#Home-ResourcesforContributors> , How to Contribute <https://cwiki.apache.org/confluence/display/Hive/HowToContribute>. -- Lefty On Tue, Mar 31, 2015 at 10:42 PM, <andres.qui...@parc.com> wrote: > Dear Hive development community members, > > > > I am interested in learning more about the current support for > non-equijoins in Hive and/or other Hadoop SQL engines, and in getting > feedback about community interest in more extensive support for such a > feature. I intend to work on this challenge, assuming people find it > compelling, and I intend to contribute results to the community. Where > possible, it would be great to receive feedback and engage in > collaborations along the way (for a bit more context, see the postscript of > this message). > > > > My initial goal is to support query conditions such as the following: > > > > A.x < B.y > > A.x in_range [B.y, B.z] > > distance(A.x, B.y) < D > > > > where A and B are distinct tables/files. It is my understanding that > current support for performing non-equijoins like those above is quite > limited, and where some forms are supported (like in Cloudera's Impala), > this support is based on doing a potentially expensive cross product join. > Depending on the data types involved, I believe that joins with these > conditions can be made to be tractable (at least on the average) with join > algorithms that exploit properties of the data types, possibly with some > pre-scanning of the data. > > > > I am asking for feedback on the interest & need in the community for this > work, as well as any pointers to similar work. In particular, I would > appreciate any answers people could give on the following questions: > > > > - Is my understanding of the state of the art in Hive and similar tools > accurate? Are there groups currently working on similar or related issues, > or tools that already accomplish some or all of what I have proposed? > > - Is there significant value to the community in the support of such a > feature? In other words, are the manual workarounds necessary because of > the absence of non-equijoins such as these enough of a pain to justify the > work I propose? > > - Being aware that the potential pre-scanning adds to the cost of the > join, and that data could still blow-up in the worst case, am I missing any > other important considerations and tradeoffs for this problem? > > - What would be a good avenue to contribute this feature to the community > (e.g. as a standalone tool on top of Hadoop, or as a Hive extension or > plugin)? > > - What is the best way to get started in working with the community? > > > > Thanks for your attention and any info you can provide! > > > > Andres Quiroz > > > > P.S. If you are interested in some context, and why/how I am proposing to > do this work, please read on. > > > > I am part of a small project team at PARC working on the general problems > of data integration and automated ETL. We have proposed a tool called > HiperFuse that is designed to accept declarative, high-level queries in > order to produce joined (fused) data sets from multiple heterogeneous raw > data sources. In our preliminary work, which you can find here (pointer to > the paper), we designed the architecture of the tool and obtained some > results separately on the problems of automated data cleansing, data type > inference, and query planning. One of the planned prototype implementations > of HiperFuse relies on Hadoop MR, and because the declarative language we > proposed was closely related to SQL, we thought that we could exploit the > existing work in Hive and/or other open-source tools for handling the SQL > part and layer our work on top of that. For example, the query given in the > paper could easily be expressed in SQL-like form with a non-equijoin > condition: > > > > SELECT web_access_log.ip, census.income > > FROM web_access_log, ip2zip, census > > WHERE web_access_log.ip in_range [ip2zip.ip_low, ip2zip.ip_high] > > AND ip2zip.zip = census.zip > > > > As you can see, the first impasse that we hit in order to bring the > elements together to solve this query end-to-end was the realization and > performance of the non-equality join in the query. The intent now is to > tackle this problem in a general sense and provide a solution for a wide > range of queries. > > > > The work I propose to do would be based on three main components within > HiperFuse: > > > > - Enhancements to the extensible data type framework in HiperFuse that > would categorize data types based on the properties needed to support the > join algorithms, in order to write join-ready domain-specific data type > libraries. > > - The join algorithms themselves, based on Hive or directly on Hadoop MR. > > - A query planner, which would determine the right algorithm to apply and > automatically schedule any necessary pre-scanning of the data. > > >