RE: Request for feedback on work intent for non-equijoin support

Andres.Quiroz Thu, 02 Apr 2015 06:09:01 -0700

This is a great pointer, Szehon and Brock, thank you. I will catch up with the 
material on theta joins and circle back.


Andrés

-----Original Message-----
From: Brock Noland [mailto:[email protected]] 
Sent: Thursday, April 02, 2015 1:31 AM
To: [email protected]
Subject: Re: Request for feedback on work intent for non-equijoin support

Nice, it'd be great if someone finally implemented this :)

On Wed, Apr 1, 2015 at 10:10 PM, Szehon Ho <[email protected]> wrote:
> From Hive side, there has been some thought on the subject here:
> https://cwiki.apache.org/confluence/display/Hive/Theta+Join, it has 
> some ideas but nobody has gotten around to giving it a try.  It might 
> be of interest.
>
> Thanks
> Szehon
>
>
> On Wed, Apr 1, 2015 at 10:05 PM, Lefty Leverenz 
> <[email protected]>
> wrote:
>
>> D'oh!  Thanks Chao.
>>
>> -- Lefty
>>
>> On Thu, Apr 2, 2015 at 12:59 AM, Chao Sun <[email protected]> wrote:
>>
>> > Hey Lefty,
>> >
>> > You need to use the ftp protocol, not http.
>> > After clicking the link, you'll need to remove "http://"; from the
>> address
>> > bar.
>> >
>> > Best,
>> > Chao
>> >
>> > On Wed, Apr 1, 2015 at 9:41 PM, Lefty Leverenz 
>> > <[email protected]>
>> > wrote:
>> >
>> > > Andrés, I followed that link and got the dread 404 Not Found:
>> > >
>> > > "The requested URI /pub/torres/Hiperfuse/extended_hiperfuse.pdf 
>> > > was not found on this server."
>> > >
>> > > -- Lefty
>> > >
>> > > On Wed, Apr 1, 2015 at 7:23 PM, <[email protected]> wrote:
>> > >
>> > > > Dear Lefty,
>> > > >
>> > > > Thank you very much for pointing that out and for your initial
>> > pointers.
>> > > > Here is the missing link:
>> > > >
>> > > > ftp.parc.com/pub/torres/Hiperfuse/extended_hiperfuse.pdf
>> > > >
>> > > > Regards,
>> > > >
>> > > > Andrés
>> > > >
>> > > > -----Original Message-----
>> > > > From: Lefty Leverenz [mailto:[email protected]]
>> > > > Sent: Wednesday, April 01, 2015 12:48 AM
>> > > > To: [email protected]
>> > > > Subject: Re: Request for feedback on work intent for 
>> > > > non-equijoin
>> > support
>> > > >
>> > > > Hello Andres, the link to your paper is missing:
>> > > >
>> > > > In our preliminary work, which you can find here (pointer to 
>> > > > the
>> paper)
>> > > ...
>> > > >
>> > > >
>> > > > You can find general information about contributing to Hive in 
>> > > > the
>> > > > wiki:  Resources
>> > > > for Contributors
>> > > > <
>> > > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/Hive/Home#Home-Resourcesf
>> orContributors
>> > > > >
>> > > > , How to Contribute
>> > > > <https://cwiki.apache.org/confluence/display/Hive/HowToContribute>.
>> > > >
>> > > > -- Lefty
>> > > >
>> > > > On Tue, Mar 31, 2015 at 10:42 PM, <[email protected]> wrote:
>> > > >
>> > > > >  Dear Hive development community members,
>> > > > >
>> > > > >
>> > > > >
>> > > > > I am interested in learning more about the current support 
>> > > > > for non-equijoins in Hive and/or other Hadoop SQL engines, 
>> > > > > and in
>> getting
>> > > > > feedback about community interest in more extensive support 
>> > > > > for
>> such
>> > a
>> > > > > feature. I intend to work on this challenge, assuming people 
>> > > > > find
>> it
>> > > > > compelling, and I intend to contribute results to the community.
>> > Where
>> > > > > possible, it would be great to receive feedback and engage in 
>> > > > > collaborations along the way (for a bit more context, see the 
>> > > > > postscript of this message).
>> > > > >
>> > > > >
>> > > > >
>> > > > > My initial goal is to support query conditions such as the
>> following:
>> > > > >
>> > > > >
>> > > > >
>> > > > > A.x < B.y
>> > > > >
>> > > > > A.x in_range [B.y, B.z]
>> > > > >
>> > > > > distance(A.x, B.y) < D
>> > > > >
>> > > > >
>> > > > >
>> > > > > where A and B are distinct tables/files. It is my 
>> > > > > understanding
>> that
>> > > > > current support for performing non-equijoins like those above 
>> > > > > is
>> > quite
>> > > > > limited, and where some forms are supported (like in 
>> > > > > Cloudera's Impala), this support is based on doing a 
>> > > > > potentially expensive
>> cross
>> > > > product join.
>> > > > > Depending on the data types involved, I believe that joins 
>> > > > > with
>> these
>> > > > > conditions can be made to be tractable (at least on the 
>> > > > > average)
>> with
>> > > > > join algorithms that exploit properties of the data types, 
>> > > > > possibly with some pre-scanning of the data.
>> > > > >
>> > > > >
>> > > > >
>> > > > > I am asking for feedback on the interest & need in the 
>> > > > > community
>> for
>> > > > > this work, as well as any pointers to similar work. In 
>> > > > > particular,
>> I
>> > > > > would appreciate any answers people could give on the 
>> > > > > following
>> > > > questions:
>> > > > >
>> > > > >
>> > > > >
>> > > > > - Is my understanding of the state of the art in Hive and 
>> > > > > similar tools accurate? Are there groups currently working on 
>> > > > > similar or related issues, or tools that already accomplish 
>> > > > > some or all of
>> what
>> > I
>> > > > have proposed?
>> > > > >
>> > > > > - Is there significant value to the community in the support 
>> > > > > of
>> such
>> > a
>> > > > > feature? In other words, are the manual workarounds necessary
>> because
>> > > > > of the absence of non-equijoins such as these enough of a 
>> > > > > pain to justify the work I propose?
>> > > > >
>> > > > > - Being aware that the potential pre-scanning adds to the 
>> > > > > cost of
>> the
>> > > > > join, and that data could still blow-up in the worst case, am 
>> > > > > I missing any other important considerations and tradeoffs 
>> > > > > for this
>> > > > problem?
>> > > > >
>> > > > > - What would be a good avenue to contribute this feature to 
>> > > > > the community (e.g. as a standalone tool on top of Hadoop, or 
>> > > > > as a Hive extension or plugin)?
>> > > > >
>> > > > > - What is the best way to get started in working with the
>> community?
>> > > > >
>> > > > >
>> > > > >
>> > > > > Thanks for your attention and any info you can provide!
>> > > > >
>> > > > >
>> > > > >
>> > > > > Andres Quiroz
>> > > > >
>> > > > >
>> > > > >
>> > > > > P.S. If you are interested in some context, and why/how I am
>> > proposing
>> > > > > to do this work, please read on.
>> > > > >
>> > > > >
>> > > > >
>> > > > > I am part of a small project team at PARC working on the 
>> > > > > general problems of data integration and automated ETL. We 
>> > > > > have proposed a tool called HiperFuse that is designed to 
>> > > > > accept declarative, high-level queries in order to produce 
>> > > > > joined (fused) data sets
>> from
>> > > > > multiple heterogeneous raw data sources. In our preliminary 
>> > > > > work, which you can find here (pointer to the paper), we 
>> > > > > designed the architecture of the tool and obtained some 
>> > > > > results separately on
>> the
>> > > > > problems of automated data cleansing, data type inference, 
>> > > > > and
>> query
>> > > > > planning. One of the planned prototype implementations of 
>> > > > > HiperFuse relies on Hadoop MR, and because the declarative 
>> > > > > language we
>> proposed
>> > > > > was closely related to SQL, we thought that we could exploit 
>> > > > > the existing work in Hive and/or other open-source tools for 
>> > > > > handling
>> the
>> > > > > SQL part and layer our work on top of that. For example, the 
>> > > > > query given in the paper could easily be expressed in 
>> > > > > SQL-like form with
>> a
>> > > > > non-equijoin
>> > > > > condition:
>> > > > >
>> > > > >
>> > > > >
>> > > > > SELECT web_access_log.ip, census.income
>> > > > >
>> > > > > FROM web_access_log, ip2zip, census
>> > > > >
>> > > > > WHERE web_access_log.ip in_range [ip2zip.ip_low, 
>> > > > > ip2zip.ip_high]
>> > > > >
>> > > > > AND ip2zip.zip = census.zip
>> > > > >
>> > > > >
>> > > > >
>> > > > > As you can see, the first impasse that we hit in order to 
>> > > > > bring the elements together to solve this query end-to-end 
>> > > > > was the
>> realization
>> > > > > and performance of the non-equality join in the query. The 
>> > > > > intent
>> now
>> > > > > is to tackle this problem in a general sense and provide a 
>> > > > > solution for a wide range of queries.
>> > > > >
>> > > > >
>> > > > >
>> > > > > The work I propose to do would be based on three main 
>> > > > > components within
>> > > > > HiperFuse:
>> > > > >
>> > > > >
>> > > > >
>> > > > > - Enhancements to the extensible data type framework in 
>> > > > > HiperFuse
>> > that
>> > > > > would categorize data types based on the properties needed to
>> support
>> > > > > the join algorithms, in order to write join-ready 
>> > > > > domain-specific
>> > data
>> > > > > type libraries.
>> > > > >
>> > > > > - The join algorithms themselves, based on Hive or directly 
>> > > > > on
>> Hadoop
>> > > MR.
>> > > > >
>> > > > > - A query planner, which would determine the right algorithm 
>> > > > > to
>> apply
>> > > > > and automatically schedule any necessary pre-scanning of the data.
>> > > > >
>> > > > >
>> > > > >
>> > > >
>> > >
>> >
>> >
>> >
>> > --
>> > Best,
>> > Chao
>> >
>>

RE: Request for feedback on work intent for non-equijoin support

Reply via email to