Re: Request for feedback on work intent for non-equijoin support

Szehon Ho Wed, 01 Apr 2015 22:11:50 -0700

>From Hive side, there has been some thought on the subject here:
https://cwiki.apache.org/confluence/display/Hive/Theta+Join, it has some
ideas but nobody has gotten around to giving it a try.  It might be of
interest.


Thanks
Szehon


On Wed, Apr 1, 2015 at 10:05 PM, Lefty Leverenz <[email protected]>
wrote:

> D'oh!  Thanks Chao.
>
> -- Lefty
>
> On Thu, Apr 2, 2015 at 12:59 AM, Chao Sun <[email protected]> wrote:
>
> > Hey Lefty,
> >
> > You need to use the ftp protocol, not http.
> > After clicking the link, you'll need to remove "http://"; from the
> address
> > bar.
> >
> > Best,
> > Chao
> >
> > On Wed, Apr 1, 2015 at 9:41 PM, Lefty Leverenz <[email protected]>
> > wrote:
> >
> > > Andrés, I followed that link and got the dread 404 Not Found:
> > >
> > > "The requested URI /pub/torres/Hiperfuse/extended_hiperfuse.pdf was not
> > > found on this server."
> > >
> > > -- Lefty
> > >
> > > On Wed, Apr 1, 2015 at 7:23 PM, <[email protected]> wrote:
> > >
> > > > Dear Lefty,
> > > >
> > > > Thank you very much for pointing that out and for your initial
> > pointers.
> > > > Here is the missing link:
> > > >
> > > > ftp.parc.com/pub/torres/Hiperfuse/extended_hiperfuse.pdf
> > > >
> > > > Regards,
> > > >
> > > > Andrés
> > > >
> > > > -----Original Message-----
> > > > From: Lefty Leverenz [mailto:[email protected]]
> > > > Sent: Wednesday, April 01, 2015 12:48 AM
> > > > To: [email protected]
> > > > Subject: Re: Request for feedback on work intent for non-equijoin
> > support
> > > >
> > > > Hello Andres, the link to your paper is missing:
> > > >
> > > > In our preliminary work, which you can find here (pointer to the
> paper)
> > > ...
> > > >
> > > >
> > > > You can find general information about contributing to Hive in the
> > > > wiki:  Resources
> > > > for Contributors
> > > > <
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/Hive/Home#Home-ResourcesforContributors
> > > > >
> > > > , How to Contribute
> > > > <https://cwiki.apache.org/confluence/display/Hive/HowToContribute>.
> > > >
> > > > -- Lefty
> > > >
> > > > On Tue, Mar 31, 2015 at 10:42 PM, <[email protected]> wrote:
> > > >
> > > > >  Dear Hive development community members,
> > > > >
> > > > >
> > > > >
> > > > > I am interested in learning more about the current support for
> > > > > non-equijoins in Hive and/or other Hadoop SQL engines, and in
> getting
> > > > > feedback about community interest in more extensive support for
> such
> > a
> > > > > feature. I intend to work on this challenge, assuming people find
> it
> > > > > compelling, and I intend to contribute results to the community.
> > Where
> > > > > possible, it would be great to receive feedback and engage in
> > > > > collaborations along the way (for a bit more context, see the
> > > > > postscript of this message).
> > > > >
> > > > >
> > > > >
> > > > > My initial goal is to support query conditions such as the
> following:
> > > > >
> > > > >
> > > > >
> > > > > A.x < B.y
> > > > >
> > > > > A.x in_range [B.y, B.z]
> > > > >
> > > > > distance(A.x, B.y) < D
> > > > >
> > > > >
> > > > >
> > > > > where A and B are distinct tables/files. It is my understanding
> that
> > > > > current support for performing non-equijoins like those above is
> > quite
> > > > > limited, and where some forms are supported (like in Cloudera's
> > > > > Impala), this support is based on doing a potentially expensive
> cross
> > > > product join.
> > > > > Depending on the data types involved, I believe that joins with
> these
> > > > > conditions can be made to be tractable (at least on the average)
> with
> > > > > join algorithms that exploit properties of the data types, possibly
> > > > > with some pre-scanning of the data.
> > > > >
> > > > >
> > > > >
> > > > > I am asking for feedback on the interest & need in the community
> for
> > > > > this work, as well as any pointers to similar work. In particular,
> I
> > > > > would appreciate any answers people could give on the following
> > > > questions:
> > > > >
> > > > >
> > > > >
> > > > > - Is my understanding of the state of the art in Hive and similar
> > > > > tools accurate? Are there groups currently working on similar or
> > > > > related issues, or tools that already accomplish some or all of
> what
> > I
> > > > have proposed?
> > > > >
> > > > > - Is there significant value to the community in the support of
> such
> > a
> > > > > feature? In other words, are the manual workarounds necessary
> because
> > > > > of the absence of non-equijoins such as these enough of a pain to
> > > > > justify the work I propose?
> > > > >
> > > > > - Being aware that the potential pre-scanning adds to the cost of
> the
> > > > > join, and that data could still blow-up in the worst case, am I
> > > > > missing any other important considerations and tradeoffs for this
> > > > problem?
> > > > >
> > > > > - What would be a good avenue to contribute this feature to the
> > > > > community (e.g. as a standalone tool on top of Hadoop, or as a Hive
> > > > > extension or plugin)?
> > > > >
> > > > > - What is the best way to get started in working with the
> community?
> > > > >
> > > > >
> > > > >
> > > > > Thanks for your attention and any info you can provide!
> > > > >
> > > > >
> > > > >
> > > > > Andres Quiroz
> > > > >
> > > > >
> > > > >
> > > > > P.S. If you are interested in some context, and why/how I am
> > proposing
> > > > > to do this work, please read on.
> > > > >
> > > > >
> > > > >
> > > > > I am part of a small project team at PARC working on the general
> > > > > problems of data integration and automated ETL. We have proposed a
> > > > > tool called HiperFuse that is designed to accept declarative,
> > > > > high-level queries in order to produce joined (fused) data sets
> from
> > > > > multiple heterogeneous raw data sources. In our preliminary work,
> > > > > which you can find here (pointer to the paper), we designed the
> > > > > architecture of the tool and obtained some results separately on
> the
> > > > > problems of automated data cleansing, data type inference, and
> query
> > > > > planning. One of the planned prototype implementations of HiperFuse
> > > > > relies on Hadoop MR, and because the declarative language we
> proposed
> > > > > was closely related to SQL, we thought that we could exploit the
> > > > > existing work in Hive and/or other open-source tools for handling
> the
> > > > > SQL part and layer our work on top of that. For example, the query
> > > > > given in the paper could easily be expressed in SQL-like form with
> a
> > > > > non-equijoin
> > > > > condition:
> > > > >
> > > > >
> > > > >
> > > > > SELECT web_access_log.ip, census.income
> > > > >
> > > > > FROM web_access_log, ip2zip, census
> > > > >
> > > > > WHERE web_access_log.ip in_range [ip2zip.ip_low, ip2zip.ip_high]
> > > > >
> > > > > AND ip2zip.zip = census.zip
> > > > >
> > > > >
> > > > >
> > > > > As you can see, the first impasse that we hit in order to bring the
> > > > > elements together to solve this query end-to-end was the
> realization
> > > > > and performance of the non-equality join in the query. The intent
> now
> > > > > is to tackle this problem in a general sense and provide a solution
> > > > > for a wide range of queries.
> > > > >
> > > > >
> > > > >
> > > > > The work I propose to do would be based on three main components
> > > > > within
> > > > > HiperFuse:
> > > > >
> > > > >
> > > > >
> > > > > - Enhancements to the extensible data type framework in HiperFuse
> > that
> > > > > would categorize data types based on the properties needed to
> support
> > > > > the join algorithms, in order to write join-ready domain-specific
> > data
> > > > > type libraries.
> > > > >
> > > > > - The join algorithms themselves, based on Hive or directly on
> Hadoop
> > > MR.
> > > > >
> > > > > - A query planner, which would determine the right algorithm to
> apply
> > > > > and automatically schedule any necessary pre-scanning of the data.
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
> >
> >
> > --
> > Best,
> > Chao
> >
>

Re: Request for feedback on work intent for non-equijoin support

Reply via email to