Re: Request for feedback on work intent for non-equijoin support

Thejas Nair Wed, 08 Apr 2015 17:49:23 -0700

I don't have cycles for working on it in the next month or two. Maybe
after that.



On Wed, Apr 8, 2015 at 2:16 PM,  <andres.qui...@parc.com> wrote:
> This is certainly very helpful, thank you. Do you have any cycles to devote 
> to this issue at the moment, or in the near future?
>
> -----Original Message-----
> From: Thejas Nair [mailto:thejas.n...@gmail.com]
> Sent: Wednesday, April 08, 2015 2:32 PM
> To: dev
> Subject: Re: Request for feedback on work intent for non-equijoin support
>
> Yes, the theta join paper in northeastern is a good place to start.
> There is also a presentation from the folks in youtube, which is also very 
> useful.
> I had a look at this issue as well earlier, and I had written up a rough 
> proposal.  I had not organized the document well enough for sharing publicly, 
> but in case you find it useful, I have attached it to wiki - 
> https://cwiki.apache.org/confluence/download/attachments/27362075/theta%20join%20proposal%20-%20thejas.pdf?version=1&modificationDate=1428517702954&api=v2
> It also includes a list of some of the changes that are needed (it is 
> probably not comprehensive enough).
>
>
> On Wed, Apr 8, 2015 at 5:49 AM,  <andres.qui...@parc.com> wrote:
>> So, I'd like to get started on this. The description in the design doc and 
>> the theta join paper from Northeastern seem like a good place to start, to 
>> have a baseline that I can later use for the more specific join algorithms I 
>> want to try.
>>
>> I created a JIRA account, and my username is Andres.Quiroz
>>
>> Brock, since I'm completely new to this code, could you (or anyone else) 
>> please point me to the relevant modules to start learning and ramping up? 
>> Also, please let me know if I can contact you directly for discussing this 
>> specific topic, or if I should always send a message to the mailing list.
>>
>> Thank you,
>>
>> Andrés
>>
>> -----Original Message-----
>> From: andres.qui...@parc.com [mailto:andres.qui...@parc.com]
>> Sent: Thursday, April 02, 2015 9:07 AM
>> To: dev@hive.apache.org
>> Subject: RE: Request for feedback on work intent for non-equijoin
>> support
>>
>> This is a great pointer, Szehon and Brock, thank you. I will catch up with 
>> the material on theta joins and circle back.
>>
>> Andrés
>>
>> -----Original Message-----
>> From: Brock Noland [mailto:br...@apache.org]
>> Sent: Thursday, April 02, 2015 1:31 AM
>> To: dev@hive.apache.org
>> Subject: Re: Request for feedback on work intent for non-equijoin
>> support
>>
>> Nice, it'd be great if someone finally implemented this :)
>>
>> On Wed, Apr 1, 2015 at 10:10 PM, Szehon Ho <sze...@cloudera.com> wrote:
>>> From Hive side, there has been some thought on the subject here:
>>> https://cwiki.apache.org/confluence/display/Hive/Theta+Join, it has
>>> some ideas but nobody has gotten around to giving it a try.  It might
>>> be of interest.
>>>
>>> Thanks
>>> Szehon
>>>
>>>
>>> On Wed, Apr 1, 2015 at 10:05 PM, Lefty Leverenz
>>> <leftylever...@gmail.com>
>>> wrote:
>>>
>>>> D'oh!  Thanks Chao.
>>>>
>>>> -- Lefty
>>>>
>>>> On Thu, Apr 2, 2015 at 12:59 AM, Chao Sun <c...@cloudera.com> wrote:
>>>>
>>>> > Hey Lefty,
>>>> >
>>>> > You need to use the ftp protocol, not http.
>>>> > After clicking the link, you'll need to remove "http://"; from the
>>>> address
>>>> > bar.
>>>> >
>>>> > Best,
>>>> > Chao
>>>> >
>>>> > On Wed, Apr 1, 2015 at 9:41 PM, Lefty Leverenz
>>>> > <leftylever...@gmail.com>
>>>> > wrote:
>>>> >
>>>> > > Andrés, I followed that link and got the dread 404 Not Found:
>>>> > >
>>>> > > "The requested URI /pub/torres/Hiperfuse/extended_hiperfuse.pdf
>>>> > > was not found on this server."
>>>> > >
>>>> > > -- Lefty
>>>> > >
>>>> > > On Wed, Apr 1, 2015 at 7:23 PM, <andres.qui...@parc.com> wrote:
>>>> > >
>>>> > > > Dear Lefty,
>>>> > > >
>>>> > > > Thank you very much for pointing that out and for your initial
>>>> > pointers.
>>>> > > > Here is the missing link:
>>>> > > >
>>>> > > > ftp.parc.com/pub/torres/Hiperfuse/extended_hiperfuse.pdf
>>>> > > >
>>>> > > > Regards,
>>>> > > >
>>>> > > > Andrés
>>>> > > >
>>>> > > > -----Original Message-----
>>>> > > > From: Lefty Leverenz [mailto:leftylever...@gmail.com]
>>>> > > > Sent: Wednesday, April 01, 2015 12:48 AM
>>>> > > > To: dev@hive.apache.org
>>>> > > > Subject: Re: Request for feedback on work intent for
>>>> > > > non-equijoin
>>>> > support
>>>> > > >
>>>> > > > Hello Andres, the link to your paper is missing:
>>>> > > >
>>>> > > > In our preliminary work, which you can find here (pointer to
>>>> > > > the
>>>> paper)
>>>> > > ...
>>>> > > >
>>>> > > >
>>>> > > > You can find general information about contributing to Hive in
>>>> > > > the
>>>> > > > wiki:  Resources
>>>> > > > for Contributors
>>>> > > > <
>>>> > > >
>>>> > >
>>>> >
>>>> https://cwiki.apache.org/confluence/display/Hive/Home#Home-Resources
>>>> f
>>>> orContributors
>>>> > > > >
>>>> > > > , How to Contribute
>>>> > > > <https://cwiki.apache.org/confluence/display/Hive/HowToContribute>.
>>>> > > >
>>>> > > > -- Lefty
>>>> > > >
>>>> > > > On Tue, Mar 31, 2015 at 10:42 PM, <andres.qui...@parc.com> wrote:
>>>> > > >
>>>> > > > >  Dear Hive development community members,
>>>> > > > >
>>>> > > > >
>>>> > > > >
>>>> > > > > I am interested in learning more about the current support
>>>> > > > > for non-equijoins in Hive and/or other Hadoop SQL engines,
>>>> > > > > and in
>>>> getting
>>>> > > > > feedback about community interest in more extensive support
>>>> > > > > for
>>>> such
>>>> > a
>>>> > > > > feature. I intend to work on this challenge, assuming people
>>>> > > > > find
>>>> it
>>>> > > > > compelling, and I intend to contribute results to the community.
>>>> > Where
>>>> > > > > possible, it would be great to receive feedback and engage
>>>> > > > > in collaborations along the way (for a bit more context, see
>>>> > > > > the postscript of this message).
>>>> > > > >
>>>> > > > >
>>>> > > > >
>>>> > > > > My initial goal is to support query conditions such as the
>>>> following:
>>>> > > > >
>>>> > > > >
>>>> > > > >
>>>> > > > > A.x < B.y
>>>> > > > >
>>>> > > > > A.x in_range [B.y, B.z]
>>>> > > > >
>>>> > > > > distance(A.x, B.y) < D
>>>> > > > >
>>>> > > > >
>>>> > > > >
>>>> > > > > where A and B are distinct tables/files. It is my
>>>> > > > > understanding
>>>> that
>>>> > > > > current support for performing non-equijoins like those
>>>> > > > > above is
>>>> > quite
>>>> > > > > limited, and where some forms are supported (like in
>>>> > > > > Cloudera's Impala), this support is based on doing a
>>>> > > > > potentially expensive
>>>> cross
>>>> > > > product join.
>>>> > > > > Depending on the data types involved, I believe that joins
>>>> > > > > with
>>>> these
>>>> > > > > conditions can be made to be tractable (at least on the
>>>> > > > > average)
>>>> with
>>>> > > > > join algorithms that exploit properties of the data types,
>>>> > > > > possibly with some pre-scanning of the data.
>>>> > > > >
>>>> > > > >
>>>> > > > >
>>>> > > > > I am asking for feedback on the interest & need in the
>>>> > > > > community
>>>> for
>>>> > > > > this work, as well as any pointers to similar work. In
>>>> > > > > particular,
>>>> I
>>>> > > > > would appreciate any answers people could give on the
>>>> > > > > following
>>>> > > > questions:
>>>> > > > >
>>>> > > > >
>>>> > > > >
>>>> > > > > - Is my understanding of the state of the art in Hive and
>>>> > > > > similar tools accurate? Are there groups currently working
>>>> > > > > on similar or related issues, or tools that already
>>>> > > > > accomplish some or all of
>>>> what
>>>> > I
>>>> > > > have proposed?
>>>> > > > >
>>>> > > > > - Is there significant value to the community in the support
>>>> > > > > of
>>>> such
>>>> > a
>>>> > > > > feature? In other words, are the manual workarounds
>>>> > > > > necessary
>>>> because
>>>> > > > > of the absence of non-equijoins such as these enough of a
>>>> > > > > pain to justify the work I propose?
>>>> > > > >
>>>> > > > > - Being aware that the potential pre-scanning adds to the
>>>> > > > > cost of
>>>> the
>>>> > > > > join, and that data could still blow-up in the worst case,
>>>> > > > > am I missing any other important considerations and
>>>> > > > > tradeoffs for this
>>>> > > > problem?
>>>> > > > >
>>>> > > > > - What would be a good avenue to contribute this feature to
>>>> > > > > the community (e.g. as a standalone tool on top of Hadoop,
>>>> > > > > or as a Hive extension or plugin)?
>>>> > > > >
>>>> > > > > - What is the best way to get started in working with the
>>>> community?
>>>> > > > >
>>>> > > > >
>>>> > > > >
>>>> > > > > Thanks for your attention and any info you can provide!
>>>> > > > >
>>>> > > > >
>>>> > > > >
>>>> > > > > Andres Quiroz
>>>> > > > >
>>>> > > > >
>>>> > > > >
>>>> > > > > P.S. If you are interested in some context, and why/how I am
>>>> > proposing
>>>> > > > > to do this work, please read on.
>>>> > > > >
>>>> > > > >
>>>> > > > >
>>>> > > > > I am part of a small project team at PARC working on the
>>>> > > > > general problems of data integration and automated ETL. We
>>>> > > > > have proposed a tool called HiperFuse that is designed to
>>>> > > > > accept declarative, high-level queries in order to produce
>>>> > > > > joined (fused) data sets
>>>> from
>>>> > > > > multiple heterogeneous raw data sources. In our preliminary
>>>> > > > > work, which you can find here (pointer to the paper), we
>>>> > > > > designed the architecture of the tool and obtained some
>>>> > > > > results separately on
>>>> the
>>>> > > > > problems of automated data cleansing, data type inference,
>>>> > > > > and
>>>> query
>>>> > > > > planning. One of the planned prototype implementations of
>>>> > > > > HiperFuse relies on Hadoop MR, and because the declarative
>>>> > > > > language we
>>>> proposed
>>>> > > > > was closely related to SQL, we thought that we could exploit
>>>> > > > > the existing work in Hive and/or other open-source tools for
>>>> > > > > handling
>>>> the
>>>> > > > > SQL part and layer our work on top of that. For example, the
>>>> > > > > query given in the paper could easily be expressed in
>>>> > > > > SQL-like form with
>>>> a
>>>> > > > > non-equijoin
>>>> > > > > condition:
>>>> > > > >
>>>> > > > >
>>>> > > > >
>>>> > > > > SELECT web_access_log.ip, census.income
>>>> > > > >
>>>> > > > > FROM web_access_log, ip2zip, census
>>>> > > > >
>>>> > > > > WHERE web_access_log.ip in_range [ip2zip.ip_low,
>>>> > > > > ip2zip.ip_high]
>>>> > > > >
>>>> > > > > AND ip2zip.zip = census.zip
>>>> > > > >
>>>> > > > >
>>>> > > > >
>>>> > > > > As you can see, the first impasse that we hit in order to
>>>> > > > > bring the elements together to solve this query end-to-end
>>>> > > > > was the
>>>> realization
>>>> > > > > and performance of the non-equality join in the query. The
>>>> > > > > intent
>>>> now
>>>> > > > > is to tackle this problem in a general sense and provide a
>>>> > > > > solution for a wide range of queries.
>>>> > > > >
>>>> > > > >
>>>> > > > >
>>>> > > > > The work I propose to do would be based on three main
>>>> > > > > components within
>>>> > > > > HiperFuse:
>>>> > > > >
>>>> > > > >
>>>> > > > >
>>>> > > > > - Enhancements to the extensible data type framework in
>>>> > > > > HiperFuse
>>>> > that
>>>> > > > > would categorize data types based on the properties needed
>>>> > > > > to
>>>> support
>>>> > > > > the join algorithms, in order to write join-ready
>>>> > > > > domain-specific
>>>> > data
>>>> > > > > type libraries.
>>>> > > > >
>>>> > > > > - The join algorithms themselves, based on Hive or directly
>>>> > > > > on
>>>> Hadoop
>>>> > > MR.
>>>> > > > >
>>>> > > > > - A query planner, which would determine the right algorithm
>>>> > > > > to
>>>> apply
>>>> > > > > and automatically schedule any necessary pre-scanning of the data.
>>>> > > > >
>>>> > > > >
>>>> > > > >
>>>> > > >
>>>> > >
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> > Best,
>>>> > Chao
>>>> >
>>>>

Re: Request for feedback on work intent for non-equijoin support

Reply via email to