Re: Request for feedback on work intent for non-equijoin support

2015-05-15 Thread Andres.Quiroz
Hello,

At this point, I have implemented a standalone version of the
1-bucket-theta join algorithm described in the northeastern paper on
Hadoop MR, and would like to start porting it to Hive.

I have been looking at the code and believe that the main goal would be to
implement a new JoinOperator. However, it¹s still not very clear to me how
this class interacts with the rest of the platform (i.e. How it fits in
the overall query processing workflow).

Could someone please provide or point me to a crash course on implementing
a join operator? If nothing else, a list of steps and other classes that I
may have to touch or add would be a very helpful starting point.

Also, I suppose tez is preferred for the implementation, right?

Thanks for your help,

Andrés

On 4/8/15, 2:32 PM, Thejas Nair thejas.n...@gmail.com wrote:

Yes, the theta join paper in northeastern is a good place to start.
There is also a presentation from the folks in youtube, which is also
very useful.
I had a look at this issue as well earlier, and I had written up a
rough proposal.  I had not organized the document well enough for
sharing publicly, but in case you find it useful, I have attached it
to wiki - 
https://cwiki.apache.org/confluence/download/attachments/27362075/theta%20
join%20proposal%20-%20thejas.pdf?version=1modificationDate=1428517702954
api=v2
It also includes a list of some of the changes that are needed (it is
probably not comprehensive enough).


On Wed, Apr 8, 2015 at 5:49 AM,  andres.qui...@parc.com wrote:
 So, I'd like to get started on this. The description in the design doc
and the theta join paper from Northeastern seem like a good place to
start, to have a baseline that I can later use for the more specific
join algorithms I want to try.

 I created a JIRA account, and my username is Andres.Quiroz

 Brock, since I'm completely new to this code, could you (or anyone
else) please point me to the relevant modules to start learning and
ramping up? Also, please let me know if I can contact you directly for
discussing this specific topic, or if I should always send a message to
the mailing list.

 Thank you,

 Andrés

 -Original Message-
 From: andres.qui...@parc.com [mailto:andres.qui...@parc.com]
 Sent: Thursday, April 02, 2015 9:07 AM
 To: dev@hive.apache.org
 Subject: RE: Request for feedback on work intent for non-equijoin
support

 This is a great pointer, Szehon and Brock, thank you. I will catch up
with the material on theta joins and circle back.

 Andrés

 -Original Message-
 From: Brock Noland [mailto:br...@apache.org]
 Sent: Thursday, April 02, 2015 1:31 AM
 To: dev@hive.apache.org
 Subject: Re: Request for feedback on work intent for non-equijoin
support

 Nice, it'd be great if someone finally implemented this :)

 On Wed, Apr 1, 2015 at 10:10 PM, Szehon Ho sze...@cloudera.com wrote:
 From Hive side, there has been some thought on the subject here:
 https://cwiki.apache.org/confluence/display/Hive/Theta+Join, it has
 some ideas but nobody has gotten around to giving it a try.  It might
 be of interest.

 Thanks
 Szehon


 On Wed, Apr 1, 2015 at 10:05 PM, Lefty Leverenz
 leftylever...@gmail.com
 wrote:

 D'oh!  Thanks Chao.

 -- Lefty

 On Thu, Apr 2, 2015 at 12:59 AM, Chao Sun c...@cloudera.com wrote:

  Hey Lefty,
 
  You need to use the ftp protocol, not http.
  After clicking the link, you'll need to remove http://; from the
 address
  bar.
 
  Best,
  Chao
 
  On Wed, Apr 1, 2015 at 9:41 PM, Lefty Leverenz
  leftylever...@gmail.com
  wrote:
 
   Andrés, I followed that link and got the dread 404 Not Found:
  
   The requested URI /pub/torres/Hiperfuse/extended_hiperfuse.pdf
   was not found on this server.
  
   -- Lefty
  
   On Wed, Apr 1, 2015 at 7:23 PM, andres.qui...@parc.com wrote:
  
Dear Lefty,
   
Thank you very much for pointing that out and for your initial
  pointers.
Here is the missing link:
   
ftp.parc.com/pub/torres/Hiperfuse/extended_hiperfuse.pdf
   
Regards,
   
Andrés
   
-Original Message-
From: Lefty Leverenz [mailto:leftylever...@gmail.com]
Sent: Wednesday, April 01, 2015 12:48 AM
To: dev@hive.apache.org
Subject: Re: Request for feedback on work intent for
non-equijoin
  support
   
Hello Andres, the link to your paper is missing:
   
In our preliminary work, which you can find here (pointer to
the
 paper)
   ...
   
   
You can find general information about contributing to Hive in
the
wiki:  Resources
for Contributors

   
  
 
 https://cwiki.apache.org/confluence/display/Hive/Home#Home-Resourcesf
 orContributors

, How to Contribute

https://cwiki.apache.org/confluence/display/Hive/HowToContribute.
   
-- Lefty
   
On Tue, Mar 31, 2015 at 10:42 PM, andres.qui...@parc.com
wrote:
   
  Dear Hive development community members,



 I am interested in learning more about the current support
 for non-equijoins in Hive and/or other 

Re: Request for feedback on work intent for non-equijoin support

2015-05-15 Thread Thejas Nair
Hi Andres,
Glad to hear about the progress!

Vikram is a hive join implementation expert. He can guide you through this.
We can setup a webex or google hangout and discuss this. Does sometime
next week work for you ? (Please let us know some hours that work for
you,  in Pacific time zone).

Anybody else who is interested in the theta join work is also welcome
to join the discussion. Please let me know.

Thanks,
Thejas


On Fri, May 15, 2015 at 12:48 PM,  andres.qui...@parc.com wrote:
 Hello,

 At this point, I have implemented a standalone version of the
 1-bucket-theta join algorithm described in the northeastern paper on
 Hadoop MR, and would like to start porting it to Hive.

 I have been looking at the code and believe that the main goal would be to
 implement a new JoinOperator. However, it¹s still not very clear to me how
 this class interacts with the rest of the platform (i.e. How it fits in
 the overall query processing workflow).

 Could someone please provide or point me to a crash course on implementing
 a join operator? If nothing else, a list of steps and other classes that I
 may have to touch or add would be a very helpful starting point.

 Also, I suppose tez is preferred for the implementation, right?

 Thanks for your help,

 Andrés

 On 4/8/15, 2:32 PM, Thejas Nair thejas.n...@gmail.com wrote:

Yes, the theta join paper in northeastern is a good place to start.
There is also a presentation from the folks in youtube, which is also
very useful.
I had a look at this issue as well earlier, and I had written up a
rough proposal.  I had not organized the document well enough for
sharing publicly, but in case you find it useful, I have attached it
to wiki -
https://cwiki.apache.org/confluence/download/attachments/27362075/theta%20
join%20proposal%20-%20thejas.pdf?version=1modificationDate=1428517702954
api=v2
It also includes a list of some of the changes that are needed (it is
probably not comprehensive enough).


On Wed, Apr 8, 2015 at 5:49 AM,  andres.qui...@parc.com wrote:
 So, I'd like to get started on this. The description in the design doc
and the theta join paper from Northeastern seem like a good place to
start, to have a baseline that I can later use for the more specific
join algorithms I want to try.

 I created a JIRA account, and my username is Andres.Quiroz

 Brock, since I'm completely new to this code, could you (or anyone
else) please point me to the relevant modules to start learning and
ramping up? Also, please let me know if I can contact you directly for
discussing this specific topic, or if I should always send a message to
the mailing list.

 Thank you,

 Andrés

 -Original Message-
 From: andres.qui...@parc.com [mailto:andres.qui...@parc.com]
 Sent: Thursday, April 02, 2015 9:07 AM
 To: dev@hive.apache.org
 Subject: RE: Request for feedback on work intent for non-equijoin
support

 This is a great pointer, Szehon and Brock, thank you. I will catch up
with the material on theta joins and circle back.

 Andrés

 -Original Message-
 From: Brock Noland [mailto:br...@apache.org]
 Sent: Thursday, April 02, 2015 1:31 AM
 To: dev@hive.apache.org
 Subject: Re: Request for feedback on work intent for non-equijoin
support

 Nice, it'd be great if someone finally implemented this :)

 On Wed, Apr 1, 2015 at 10:10 PM, Szehon Ho sze...@cloudera.com wrote:
 From Hive side, there has been some thought on the subject here:
 https://cwiki.apache.org/confluence/display/Hive/Theta+Join, it has
 some ideas but nobody has gotten around to giving it a try.  It might
 be of interest.

 Thanks
 Szehon


 On Wed, Apr 1, 2015 at 10:05 PM, Lefty Leverenz
 leftylever...@gmail.com
 wrote:

 D'oh!  Thanks Chao.

 -- Lefty

 On Thu, Apr 2, 2015 at 12:59 AM, Chao Sun c...@cloudera.com wrote:

  Hey Lefty,
 
  You need to use the ftp protocol, not http.
  After clicking the link, you'll need to remove http://; from the
 address
  bar.
 
  Best,
  Chao
 
  On Wed, Apr 1, 2015 at 9:41 PM, Lefty Leverenz
  leftylever...@gmail.com
  wrote:
 
   Andrés, I followed that link and got the dread 404 Not Found:
  
   The requested URI /pub/torres/Hiperfuse/extended_hiperfuse.pdf
   was not found on this server.
  
   -- Lefty
  
   On Wed, Apr 1, 2015 at 7:23 PM, andres.qui...@parc.com wrote:
  
Dear Lefty,
   
Thank you very much for pointing that out and for your initial
  pointers.
Here is the missing link:
   
ftp.parc.com/pub/torres/Hiperfuse/extended_hiperfuse.pdf
   
Regards,
   
Andrés
   
-Original Message-
From: Lefty Leverenz [mailto:leftylever...@gmail.com]
Sent: Wednesday, April 01, 2015 12:48 AM
To: dev@hive.apache.org
Subject: Re: Request for feedback on work intent for
non-equijoin
  support
   
Hello Andres, the link to your paper is missing:
   
In our preliminary work, which you can find here (pointer to
the
 paper)
   ...
   
   
You can find general information about contributing to Hive in
the

Re: Request for feedback on work intent for non-equijoin support

2015-05-15 Thread Andres.Quiroz
Ok, that would be great! Except for Monday and Friday, I could meet any
day next week in the afternoon (Pacific time), since it is the end of the
day for me. 

Thanks a lot,

Andrés

On 5/15/15, 4:13 PM, Thejas Nair thejas.n...@gmail.com wrote:

Hi Andres,
Glad to hear about the progress!

Vikram is a hive join implementation expert. He can guide you through
this.
We can setup a webex or google hangout and discuss this. Does sometime
next week work for you ? (Please let us know some hours that work for
you,  in Pacific time zone).

Anybody else who is interested in the theta join work is also welcome
to join the discussion. Please let me know.

Thanks,
Thejas


On Fri, May 15, 2015 at 12:48 PM,  andres.qui...@parc.com wrote:
 Hello,

 At this point, I have implemented a standalone version of the
 1-bucket-theta join algorithm described in the northeastern paper on
 Hadoop MR, and would like to start porting it to Hive.

 I have been looking at the code and believe that the main goal would be
to
 implement a new JoinOperator. However, it¹s still not very clear to me
how
 this class interacts with the rest of the platform (i.e. How it fits in
 the overall query processing workflow).

 Could someone please provide or point me to a crash course on
implementing
 a join operator? If nothing else, a list of steps and other classes
that I
 may have to touch or add would be a very helpful starting point.

 Also, I suppose tez is preferred for the implementation, right?

 Thanks for your help,

 Andrés

 On 4/8/15, 2:32 PM, Thejas Nair thejas.n...@gmail.com wrote:

Yes, the theta join paper in northeastern is a good place to start.
There is also a presentation from the folks in youtube, which is also
very useful.
I had a look at this issue as well earlier, and I had written up a
rough proposal.  I had not organized the document well enough for
sharing publicly, but in case you find it useful, I have attached it
to wiki -
https://cwiki.apache.org/confluence/download/attachments/27362075/theta%
20
join%20proposal%20-%20thejas.pdf?version=1modificationDate=142851770295
4
api=v2
It also includes a list of some of the changes that are needed (it is
probably not comprehensive enough).


On Wed, Apr 8, 2015 at 5:49 AM,  andres.qui...@parc.com wrote:
 So, I'd like to get started on this. The description in the design doc
and the theta join paper from Northeastern seem like a good place to
start, to have a baseline that I can later use for the more specific
join algorithms I want to try.

 I created a JIRA account, and my username is Andres.Quiroz

 Brock, since I'm completely new to this code, could you (or anyone
else) please point me to the relevant modules to start learning and
ramping up? Also, please let me know if I can contact you directly for
discussing this specific topic, or if I should always send a message to
the mailing list.

 Thank you,

 Andrés

 -Original Message-
 From: andres.qui...@parc.com [mailto:andres.qui...@parc.com]
 Sent: Thursday, April 02, 2015 9:07 AM
 To: dev@hive.apache.org
 Subject: RE: Request for feedback on work intent for non-equijoin
support

 This is a great pointer, Szehon and Brock, thank you. I will catch up
with the material on theta joins and circle back.

 Andrés

 -Original Message-
 From: Brock Noland [mailto:br...@apache.org]
 Sent: Thursday, April 02, 2015 1:31 AM
 To: dev@hive.apache.org
 Subject: Re: Request for feedback on work intent for non-equijoin
support

 Nice, it'd be great if someone finally implemented this :)

 On Wed, Apr 1, 2015 at 10:10 PM, Szehon Ho sze...@cloudera.com
wrote:
 From Hive side, there has been some thought on the subject here:
 https://cwiki.apache.org/confluence/display/Hive/Theta+Join, it has
 some ideas but nobody has gotten around to giving it a try.  It might
 be of interest.

 Thanks
 Szehon


 On Wed, Apr 1, 2015 at 10:05 PM, Lefty Leverenz
 leftylever...@gmail.com
 wrote:

 D'oh!  Thanks Chao.

 -- Lefty

 On Thu, Apr 2, 2015 at 12:59 AM, Chao Sun c...@cloudera.com wrote:

  Hey Lefty,
 
  You need to use the ftp protocol, not http.
  After clicking the link, you'll need to remove http://; from the
 address
  bar.
 
  Best,
  Chao
 
  On Wed, Apr 1, 2015 at 9:41 PM, Lefty Leverenz
  leftylever...@gmail.com
  wrote:
 
   Andrés, I followed that link and got the dread 404 Not Found:
  
   The requested URI /pub/torres/Hiperfuse/extended_hiperfuse.pdf
   was not found on this server.
  
   -- Lefty
  
   On Wed, Apr 1, 2015 at 7:23 PM, andres.qui...@parc.com wrote:
  
Dear Lefty,
   
Thank you very much for pointing that out and for your initial
  pointers.
Here is the missing link:
   
ftp.parc.com/pub/torres/Hiperfuse/extended_hiperfuse.pdf
   
Regards,
   
Andrés
   
-Original Message-
From: Lefty Leverenz [mailto:leftylever...@gmail.com]
Sent: Wednesday, April 01, 2015 12:48 AM
To: dev@hive.apache.org
Subject: Re: Request for feedback on work intent for

RE: Request for feedback on work intent for non-equijoin support

2015-04-08 Thread Andres.Quiroz
So, I'd like to get started on this. The description in the design doc and the 
theta join paper from Northeastern seem like a good place to start, to have a 
baseline that I can later use for the more specific join algorithms I want to 
try. 

I created a JIRA account, and my username is Andres.Quiroz

Brock, since I'm completely new to this code, could you (or anyone else) please 
point me to the relevant modules to start learning and ramping up? Also, please 
let me know if I can contact you directly for discussing this specific topic, 
or if I should always send a message to the mailing list.

Thank you,

Andrés

-Original Message-
From: andres.qui...@parc.com [mailto:andres.qui...@parc.com] 
Sent: Thursday, April 02, 2015 9:07 AM
To: dev@hive.apache.org
Subject: RE: Request for feedback on work intent for non-equijoin support

This is a great pointer, Szehon and Brock, thank you. I will catch up with the 
material on theta joins and circle back.

Andrés

-Original Message-
From: Brock Noland [mailto:br...@apache.org]
Sent: Thursday, April 02, 2015 1:31 AM
To: dev@hive.apache.org
Subject: Re: Request for feedback on work intent for non-equijoin support

Nice, it'd be great if someone finally implemented this :)

On Wed, Apr 1, 2015 at 10:10 PM, Szehon Ho sze...@cloudera.com wrote:
 From Hive side, there has been some thought on the subject here:
 https://cwiki.apache.org/confluence/display/Hive/Theta+Join, it has 
 some ideas but nobody has gotten around to giving it a try.  It might 
 be of interest.

 Thanks
 Szehon


 On Wed, Apr 1, 2015 at 10:05 PM, Lefty Leverenz 
 leftylever...@gmail.com
 wrote:

 D'oh!  Thanks Chao.

 -- Lefty

 On Thu, Apr 2, 2015 at 12:59 AM, Chao Sun c...@cloudera.com wrote:

  Hey Lefty,
 
  You need to use the ftp protocol, not http.
  After clicking the link, you'll need to remove http://; from the
 address
  bar.
 
  Best,
  Chao
 
  On Wed, Apr 1, 2015 at 9:41 PM, Lefty Leverenz 
  leftylever...@gmail.com
  wrote:
 
   Andrés, I followed that link and got the dread 404 Not Found:
  
   The requested URI /pub/torres/Hiperfuse/extended_hiperfuse.pdf
   was not found on this server.
  
   -- Lefty
  
   On Wed, Apr 1, 2015 at 7:23 PM, andres.qui...@parc.com wrote:
  
Dear Lefty,
   
Thank you very much for pointing that out and for your initial
  pointers.
Here is the missing link:
   
ftp.parc.com/pub/torres/Hiperfuse/extended_hiperfuse.pdf
   
Regards,
   
Andrés
   
-Original Message-
From: Lefty Leverenz [mailto:leftylever...@gmail.com]
Sent: Wednesday, April 01, 2015 12:48 AM
To: dev@hive.apache.org
Subject: Re: Request for feedback on work intent for 
non-equijoin
  support
   
Hello Andres, the link to your paper is missing:
   
In our preliminary work, which you can find here (pointer to 
the
 paper)
   ...
   
   
You can find general information about contributing to Hive in 
the
wiki:  Resources
for Contributors

   
  
 
 https://cwiki.apache.org/confluence/display/Hive/Home#Home-Resourcesf
 orContributors

, How to Contribute
https://cwiki.apache.org/confluence/display/Hive/HowToContribute.
   
-- Lefty
   
On Tue, Mar 31, 2015 at 10:42 PM, andres.qui...@parc.com wrote:
   
  Dear Hive development community members,



 I am interested in learning more about the current support 
 for non-equijoins in Hive and/or other Hadoop SQL engines, 
 and in
 getting
 feedback about community interest in more extensive support 
 for
 such
  a
 feature. I intend to work on this challenge, assuming people 
 find
 it
 compelling, and I intend to contribute results to the community.
  Where
 possible, it would be great to receive feedback and engage in 
 collaborations along the way (for a bit more context, see the 
 postscript of this message).



 My initial goal is to support query conditions such as the
 following:



 A.x  B.y

 A.x in_range [B.y, B.z]

 distance(A.x, B.y)  D



 where A and B are distinct tables/files. It is my 
 understanding
 that
 current support for performing non-equijoins like those above 
 is
  quite
 limited, and where some forms are supported (like in 
 Cloudera's Impala), this support is based on doing a 
 potentially expensive
 cross
product join.
 Depending on the data types involved, I believe that joins 
 with
 these
 conditions can be made to be tractable (at least on the
 average)
 with
 join algorithms that exploit properties of the data types, 
 possibly with some pre-scanning of the data.



 I am asking for feedback on the interest  need in the 
 community
 for
 this work, as well as any pointers to similar work. In 
 particular,
 I
 would appreciate any answers people could give on the 
 following

RE: Request for feedback on work intent for non-equijoin support

2015-04-08 Thread Xu, Cheng A
You can start your work from JoinOperator. Before that, you should follow the 
steps in https://cwiki.apache.org/confluence/display/Hive/GettingStarted 

-Original Message-
From: andres.qui...@parc.com [mailto:andres.qui...@parc.com] 
Sent: Wednesday, April 08, 2015 8:49 PM
To: dev@hive.apache.org
Subject: RE: Request for feedback on work intent for non-equijoin support

So, I'd like to get started on this. The description in the design doc and the 
theta join paper from Northeastern seem like a good place to start, to have a 
baseline that I can later use for the more specific join algorithms I want to 
try. 

I created a JIRA account, and my username is Andres.Quiroz

Brock, since I'm completely new to this code, could you (or anyone else) please 
point me to the relevant modules to start learning and ramping up? Also, please 
let me know if I can contact you directly for discussing this specific topic, 
or if I should always send a message to the mailing list.

Thank you,

Andrés

-Original Message-
From: andres.qui...@parc.com [mailto:andres.qui...@parc.com]
Sent: Thursday, April 02, 2015 9:07 AM
To: dev@hive.apache.org
Subject: RE: Request for feedback on work intent for non-equijoin support

This is a great pointer, Szehon and Brock, thank you. I will catch up with the 
material on theta joins and circle back.

Andrés

-Original Message-
From: Brock Noland [mailto:br...@apache.org]
Sent: Thursday, April 02, 2015 1:31 AM
To: dev@hive.apache.org
Subject: Re: Request for feedback on work intent for non-equijoin support

Nice, it'd be great if someone finally implemented this :)

On Wed, Apr 1, 2015 at 10:10 PM, Szehon Ho sze...@cloudera.com wrote:
 From Hive side, there has been some thought on the subject here:
 https://cwiki.apache.org/confluence/display/Hive/Theta+Join, it has 
 some ideas but nobody has gotten around to giving it a try.  It might 
 be of interest.

 Thanks
 Szehon


 On Wed, Apr 1, 2015 at 10:05 PM, Lefty Leverenz 
 leftylever...@gmail.com
 wrote:

 D'oh!  Thanks Chao.

 -- Lefty

 On Thu, Apr 2, 2015 at 12:59 AM, Chao Sun c...@cloudera.com wrote:

  Hey Lefty,
 
  You need to use the ftp protocol, not http.
  After clicking the link, you'll need to remove http://; from the
 address
  bar.
 
  Best,
  Chao
 
  On Wed, Apr 1, 2015 at 9:41 PM, Lefty Leverenz 
  leftylever...@gmail.com
  wrote:
 
   Andrés, I followed that link and got the dread 404 Not Found:
  
   The requested URI /pub/torres/Hiperfuse/extended_hiperfuse.pdf
   was not found on this server.
  
   -- Lefty
  
   On Wed, Apr 1, 2015 at 7:23 PM, andres.qui...@parc.com wrote:
  
Dear Lefty,
   
Thank you very much for pointing that out and for your initial
  pointers.
Here is the missing link:
   
ftp.parc.com/pub/torres/Hiperfuse/extended_hiperfuse.pdf
   
Regards,
   
Andrés
   
-Original Message-
From: Lefty Leverenz [mailto:leftylever...@gmail.com]
Sent: Wednesday, April 01, 2015 12:48 AM
To: dev@hive.apache.org
Subject: Re: Request for feedback on work intent for 
non-equijoin
  support
   
Hello Andres, the link to your paper is missing:
   
In our preliminary work, which you can find here (pointer to 
the
 paper)
   ...
   
   
You can find general information about contributing to Hive in 
the
wiki:  Resources
for Contributors

   
  
 
 https://cwiki.apache.org/confluence/display/Hive/Home#Home-Resourcesf
 orContributors

, How to Contribute
https://cwiki.apache.org/confluence/display/Hive/HowToContribute.
   
-- Lefty
   
On Tue, Mar 31, 2015 at 10:42 PM, andres.qui...@parc.com wrote:
   
  Dear Hive development community members,



 I am interested in learning more about the current support 
 for non-equijoins in Hive and/or other Hadoop SQL engines, 
 and in
 getting
 feedback about community interest in more extensive support 
 for
 such
  a
 feature. I intend to work on this challenge, assuming people 
 find
 it
 compelling, and I intend to contribute results to the community.
  Where
 possible, it would be great to receive feedback and engage in 
 collaborations along the way (for a bit more context, see the 
 postscript of this message).



 My initial goal is to support query conditions such as the
 following:



 A.x  B.y

 A.x in_range [B.y, B.z]

 distance(A.x, B.y)  D



 where A and B are distinct tables/files. It is my 
 understanding
 that
 current support for performing non-equijoins like those above 
 is
  quite
 limited, and where some forms are supported (like in 
 Cloudera's Impala), this support is based on doing a 
 potentially expensive
 cross
product join.
 Depending on the data types involved, I believe that joins 
 with
 these
 conditions can be made to be tractable (at least on the
 

Re: Request for feedback on work intent for non-equijoin support

2015-04-08 Thread Thejas Nair
Yes, the theta join paper in northeastern is a good place to start.
There is also a presentation from the folks in youtube, which is also
very useful.
I had a look at this issue as well earlier, and I had written up a
rough proposal.  I had not organized the document well enough for
sharing publicly, but in case you find it useful, I have attached it
to wiki - 
https://cwiki.apache.org/confluence/download/attachments/27362075/theta%20join%20proposal%20-%20thejas.pdf?version=1modificationDate=1428517702954api=v2
It also includes a list of some of the changes that are needed (it is
probably not comprehensive enough).


On Wed, Apr 8, 2015 at 5:49 AM,  andres.qui...@parc.com wrote:
 So, I'd like to get started on this. The description in the design doc and 
 the theta join paper from Northeastern seem like a good place to start, to 
 have a baseline that I can later use for the more specific join algorithms I 
 want to try.

 I created a JIRA account, and my username is Andres.Quiroz

 Brock, since I'm completely new to this code, could you (or anyone else) 
 please point me to the relevant modules to start learning and ramping up? 
 Also, please let me know if I can contact you directly for discussing this 
 specific topic, or if I should always send a message to the mailing list.

 Thank you,

 Andrés

 -Original Message-
 From: andres.qui...@parc.com [mailto:andres.qui...@parc.com]
 Sent: Thursday, April 02, 2015 9:07 AM
 To: dev@hive.apache.org
 Subject: RE: Request for feedback on work intent for non-equijoin support

 This is a great pointer, Szehon and Brock, thank you. I will catch up with 
 the material on theta joins and circle back.

 Andrés

 -Original Message-
 From: Brock Noland [mailto:br...@apache.org]
 Sent: Thursday, April 02, 2015 1:31 AM
 To: dev@hive.apache.org
 Subject: Re: Request for feedback on work intent for non-equijoin support

 Nice, it'd be great if someone finally implemented this :)

 On Wed, Apr 1, 2015 at 10:10 PM, Szehon Ho sze...@cloudera.com wrote:
 From Hive side, there has been some thought on the subject here:
 https://cwiki.apache.org/confluence/display/Hive/Theta+Join, it has
 some ideas but nobody has gotten around to giving it a try.  It might
 be of interest.

 Thanks
 Szehon


 On Wed, Apr 1, 2015 at 10:05 PM, Lefty Leverenz
 leftylever...@gmail.com
 wrote:

 D'oh!  Thanks Chao.

 -- Lefty

 On Thu, Apr 2, 2015 at 12:59 AM, Chao Sun c...@cloudera.com wrote:

  Hey Lefty,
 
  You need to use the ftp protocol, not http.
  After clicking the link, you'll need to remove http://; from the
 address
  bar.
 
  Best,
  Chao
 
  On Wed, Apr 1, 2015 at 9:41 PM, Lefty Leverenz
  leftylever...@gmail.com
  wrote:
 
   Andrés, I followed that link and got the dread 404 Not Found:
  
   The requested URI /pub/torres/Hiperfuse/extended_hiperfuse.pdf
   was not found on this server.
  
   -- Lefty
  
   On Wed, Apr 1, 2015 at 7:23 PM, andres.qui...@parc.com wrote:
  
Dear Lefty,
   
Thank you very much for pointing that out and for your initial
  pointers.
Here is the missing link:
   
ftp.parc.com/pub/torres/Hiperfuse/extended_hiperfuse.pdf
   
Regards,
   
Andrés
   
-Original Message-
From: Lefty Leverenz [mailto:leftylever...@gmail.com]
Sent: Wednesday, April 01, 2015 12:48 AM
To: dev@hive.apache.org
Subject: Re: Request for feedback on work intent for
non-equijoin
  support
   
Hello Andres, the link to your paper is missing:
   
In our preliminary work, which you can find here (pointer to
the
 paper)
   ...
   
   
You can find general information about contributing to Hive in
the
wiki:  Resources
for Contributors

   
  
 
 https://cwiki.apache.org/confluence/display/Hive/Home#Home-Resourcesf
 orContributors

, How to Contribute
https://cwiki.apache.org/confluence/display/Hive/HowToContribute.
   
-- Lefty
   
On Tue, Mar 31, 2015 at 10:42 PM, andres.qui...@parc.com wrote:
   
  Dear Hive development community members,



 I am interested in learning more about the current support
 for non-equijoins in Hive and/or other Hadoop SQL engines,
 and in
 getting
 feedback about community interest in more extensive support
 for
 such
  a
 feature. I intend to work on this challenge, assuming people
 find
 it
 compelling, and I intend to contribute results to the community.
  Where
 possible, it would be great to receive feedback and engage in
 collaborations along the way (for a bit more context, see the
 postscript of this message).



 My initial goal is to support query conditions such as the
 following:



 A.x  B.y

 A.x in_range [B.y, B.z]

 distance(A.x, B.y)  D



 where A and B are distinct tables/files. It is my
 understanding
 that
 current support for performing non-equijoins like those above
 is
  quite
 

Re: Request for feedback on work intent for non-equijoin support

2015-04-08 Thread Thejas Nair
I don't have cycles for working on it in the next month or two. Maybe
after that.


On Wed, Apr 8, 2015 at 2:16 PM,  andres.qui...@parc.com wrote:
 This is certainly very helpful, thank you. Do you have any cycles to devote 
 to this issue at the moment, or in the near future?

 -Original Message-
 From: Thejas Nair [mailto:thejas.n...@gmail.com]
 Sent: Wednesday, April 08, 2015 2:32 PM
 To: dev
 Subject: Re: Request for feedback on work intent for non-equijoin support

 Yes, the theta join paper in northeastern is a good place to start.
 There is also a presentation from the folks in youtube, which is also very 
 useful.
 I had a look at this issue as well earlier, and I had written up a rough 
 proposal.  I had not organized the document well enough for sharing publicly, 
 but in case you find it useful, I have attached it to wiki - 
 https://cwiki.apache.org/confluence/download/attachments/27362075/theta%20join%20proposal%20-%20thejas.pdf?version=1modificationDate=1428517702954api=v2
 It also includes a list of some of the changes that are needed (it is 
 probably not comprehensive enough).


 On Wed, Apr 8, 2015 at 5:49 AM,  andres.qui...@parc.com wrote:
 So, I'd like to get started on this. The description in the design doc and 
 the theta join paper from Northeastern seem like a good place to start, to 
 have a baseline that I can later use for the more specific join algorithms I 
 want to try.

 I created a JIRA account, and my username is Andres.Quiroz

 Brock, since I'm completely new to this code, could you (or anyone else) 
 please point me to the relevant modules to start learning and ramping up? 
 Also, please let me know if I can contact you directly for discussing this 
 specific topic, or if I should always send a message to the mailing list.

 Thank you,

 Andrés

 -Original Message-
 From: andres.qui...@parc.com [mailto:andres.qui...@parc.com]
 Sent: Thursday, April 02, 2015 9:07 AM
 To: dev@hive.apache.org
 Subject: RE: Request for feedback on work intent for non-equijoin
 support

 This is a great pointer, Szehon and Brock, thank you. I will catch up with 
 the material on theta joins and circle back.

 Andrés

 -Original Message-
 From: Brock Noland [mailto:br...@apache.org]
 Sent: Thursday, April 02, 2015 1:31 AM
 To: dev@hive.apache.org
 Subject: Re: Request for feedback on work intent for non-equijoin
 support

 Nice, it'd be great if someone finally implemented this :)

 On Wed, Apr 1, 2015 at 10:10 PM, Szehon Ho sze...@cloudera.com wrote:
 From Hive side, there has been some thought on the subject here:
 https://cwiki.apache.org/confluence/display/Hive/Theta+Join, it has
 some ideas but nobody has gotten around to giving it a try.  It might
 be of interest.

 Thanks
 Szehon


 On Wed, Apr 1, 2015 at 10:05 PM, Lefty Leverenz
 leftylever...@gmail.com
 wrote:

 D'oh!  Thanks Chao.

 -- Lefty

 On Thu, Apr 2, 2015 at 12:59 AM, Chao Sun c...@cloudera.com wrote:

  Hey Lefty,
 
  You need to use the ftp protocol, not http.
  After clicking the link, you'll need to remove http://; from the
 address
  bar.
 
  Best,
  Chao
 
  On Wed, Apr 1, 2015 at 9:41 PM, Lefty Leverenz
  leftylever...@gmail.com
  wrote:
 
   Andrés, I followed that link and got the dread 404 Not Found:
  
   The requested URI /pub/torres/Hiperfuse/extended_hiperfuse.pdf
   was not found on this server.
  
   -- Lefty
  
   On Wed, Apr 1, 2015 at 7:23 PM, andres.qui...@parc.com wrote:
  
Dear Lefty,
   
Thank you very much for pointing that out and for your initial
  pointers.
Here is the missing link:
   
ftp.parc.com/pub/torres/Hiperfuse/extended_hiperfuse.pdf
   
Regards,
   
Andrés
   
-Original Message-
From: Lefty Leverenz [mailto:leftylever...@gmail.com]
Sent: Wednesday, April 01, 2015 12:48 AM
To: dev@hive.apache.org
Subject: Re: Request for feedback on work intent for
non-equijoin
  support
   
Hello Andres, the link to your paper is missing:
   
In our preliminary work, which you can find here (pointer to
the
 paper)
   ...
   
   
You can find general information about contributing to Hive in
the
wiki:  Resources
for Contributors

   
  
 
 https://cwiki.apache.org/confluence/display/Hive/Home#Home-Resources
 f
 orContributors

, How to Contribute
https://cwiki.apache.org/confluence/display/Hive/HowToContribute.
   
-- Lefty
   
On Tue, Mar 31, 2015 at 10:42 PM, andres.qui...@parc.com wrote:
   
  Dear Hive development community members,



 I am interested in learning more about the current support
 for non-equijoins in Hive and/or other Hadoop SQL engines,
 and in
 getting
 feedback about community interest in more extensive support
 for
 such
  a
 feature. I intend to work on this challenge, assuming people
 find
 it
 compelling, and I intend to contribute results to the community.
  Where
 possible, it would be great to 

RE: Request for feedback on work intent for non-equijoin support

2015-04-08 Thread Andres.Quiroz
This is certainly very helpful, thank you. Do you have any cycles to devote to 
this issue at the moment, or in the near future?

-Original Message-
From: Thejas Nair [mailto:thejas.n...@gmail.com] 
Sent: Wednesday, April 08, 2015 2:32 PM
To: dev
Subject: Re: Request for feedback on work intent for non-equijoin support

Yes, the theta join paper in northeastern is a good place to start.
There is also a presentation from the folks in youtube, which is also very 
useful.
I had a look at this issue as well earlier, and I had written up a rough 
proposal.  I had not organized the document well enough for sharing publicly, 
but in case you find it useful, I have attached it to wiki - 
https://cwiki.apache.org/confluence/download/attachments/27362075/theta%20join%20proposal%20-%20thejas.pdf?version=1modificationDate=1428517702954api=v2
It also includes a list of some of the changes that are needed (it is probably 
not comprehensive enough).


On Wed, Apr 8, 2015 at 5:49 AM,  andres.qui...@parc.com wrote:
 So, I'd like to get started on this. The description in the design doc and 
 the theta join paper from Northeastern seem like a good place to start, to 
 have a baseline that I can later use for the more specific join algorithms I 
 want to try.

 I created a JIRA account, and my username is Andres.Quiroz

 Brock, since I'm completely new to this code, could you (or anyone else) 
 please point me to the relevant modules to start learning and ramping up? 
 Also, please let me know if I can contact you directly for discussing this 
 specific topic, or if I should always send a message to the mailing list.

 Thank you,

 Andrés

 -Original Message-
 From: andres.qui...@parc.com [mailto:andres.qui...@parc.com]
 Sent: Thursday, April 02, 2015 9:07 AM
 To: dev@hive.apache.org
 Subject: RE: Request for feedback on work intent for non-equijoin 
 support

 This is a great pointer, Szehon and Brock, thank you. I will catch up with 
 the material on theta joins and circle back.

 Andrés

 -Original Message-
 From: Brock Noland [mailto:br...@apache.org]
 Sent: Thursday, April 02, 2015 1:31 AM
 To: dev@hive.apache.org
 Subject: Re: Request for feedback on work intent for non-equijoin 
 support

 Nice, it'd be great if someone finally implemented this :)

 On Wed, Apr 1, 2015 at 10:10 PM, Szehon Ho sze...@cloudera.com wrote:
 From Hive side, there has been some thought on the subject here:
 https://cwiki.apache.org/confluence/display/Hive/Theta+Join, it has 
 some ideas but nobody has gotten around to giving it a try.  It might 
 be of interest.

 Thanks
 Szehon


 On Wed, Apr 1, 2015 at 10:05 PM, Lefty Leverenz 
 leftylever...@gmail.com
 wrote:

 D'oh!  Thanks Chao.

 -- Lefty

 On Thu, Apr 2, 2015 at 12:59 AM, Chao Sun c...@cloudera.com wrote:

  Hey Lefty,
 
  You need to use the ftp protocol, not http.
  After clicking the link, you'll need to remove http://; from the
 address
  bar.
 
  Best,
  Chao
 
  On Wed, Apr 1, 2015 at 9:41 PM, Lefty Leverenz 
  leftylever...@gmail.com
  wrote:
 
   Andrés, I followed that link and got the dread 404 Not Found:
  
   The requested URI /pub/torres/Hiperfuse/extended_hiperfuse.pdf
   was not found on this server.
  
   -- Lefty
  
   On Wed, Apr 1, 2015 at 7:23 PM, andres.qui...@parc.com wrote:
  
Dear Lefty,
   
Thank you very much for pointing that out and for your initial
  pointers.
Here is the missing link:
   
ftp.parc.com/pub/torres/Hiperfuse/extended_hiperfuse.pdf
   
Regards,
   
Andrés
   
-Original Message-
From: Lefty Leverenz [mailto:leftylever...@gmail.com]
Sent: Wednesday, April 01, 2015 12:48 AM
To: dev@hive.apache.org
Subject: Re: Request for feedback on work intent for 
non-equijoin
  support
   
Hello Andres, the link to your paper is missing:
   
In our preliminary work, which you can find here (pointer to 
the
 paper)
   ...
   
   
You can find general information about contributing to Hive in 
the
wiki:  Resources
for Contributors

   
  
 
 https://cwiki.apache.org/confluence/display/Hive/Home#Home-Resources
 f
 orContributors

, How to Contribute
https://cwiki.apache.org/confluence/display/Hive/HowToContribute.
   
-- Lefty
   
On Tue, Mar 31, 2015 at 10:42 PM, andres.qui...@parc.com wrote:
   
  Dear Hive development community members,



 I am interested in learning more about the current support 
 for non-equijoins in Hive and/or other Hadoop SQL engines, 
 and in
 getting
 feedback about community interest in more extensive support 
 for
 such
  a
 feature. I intend to work on this challenge, assuming people 
 find
 it
 compelling, and I intend to contribute results to the community.
  Where
 possible, it would be great to receive feedback and engage 
 in collaborations along the way (for a bit more context, see 
 the postscript of this message).




RE: Request for feedback on work intent for non-equijoin support

2015-04-02 Thread Andres.Quiroz
This is a great pointer, Szehon and Brock, thank you. I will catch up with the 
material on theta joins and circle back.

Andrés

-Original Message-
From: Brock Noland [mailto:br...@apache.org] 
Sent: Thursday, April 02, 2015 1:31 AM
To: dev@hive.apache.org
Subject: Re: Request for feedback on work intent for non-equijoin support

Nice, it'd be great if someone finally implemented this :)

On Wed, Apr 1, 2015 at 10:10 PM, Szehon Ho sze...@cloudera.com wrote:
 From Hive side, there has been some thought on the subject here:
 https://cwiki.apache.org/confluence/display/Hive/Theta+Join, it has 
 some ideas but nobody has gotten around to giving it a try.  It might 
 be of interest.

 Thanks
 Szehon


 On Wed, Apr 1, 2015 at 10:05 PM, Lefty Leverenz 
 leftylever...@gmail.com
 wrote:

 D'oh!  Thanks Chao.

 -- Lefty

 On Thu, Apr 2, 2015 at 12:59 AM, Chao Sun c...@cloudera.com wrote:

  Hey Lefty,
 
  You need to use the ftp protocol, not http.
  After clicking the link, you'll need to remove http://; from the
 address
  bar.
 
  Best,
  Chao
 
  On Wed, Apr 1, 2015 at 9:41 PM, Lefty Leverenz 
  leftylever...@gmail.com
  wrote:
 
   Andrés, I followed that link and got the dread 404 Not Found:
  
   The requested URI /pub/torres/Hiperfuse/extended_hiperfuse.pdf 
   was not found on this server.
  
   -- Lefty
  
   On Wed, Apr 1, 2015 at 7:23 PM, andres.qui...@parc.com wrote:
  
Dear Lefty,
   
Thank you very much for pointing that out and for your initial
  pointers.
Here is the missing link:
   
ftp.parc.com/pub/torres/Hiperfuse/extended_hiperfuse.pdf
   
Regards,
   
Andrés
   
-Original Message-
From: Lefty Leverenz [mailto:leftylever...@gmail.com]
Sent: Wednesday, April 01, 2015 12:48 AM
To: dev@hive.apache.org
Subject: Re: Request for feedback on work intent for 
non-equijoin
  support
   
Hello Andres, the link to your paper is missing:
   
In our preliminary work, which you can find here (pointer to 
the
 paper)
   ...
   
   
You can find general information about contributing to Hive in 
the
wiki:  Resources
for Contributors

   
  
 
 https://cwiki.apache.org/confluence/display/Hive/Home#Home-Resourcesf
 orContributors

, How to Contribute
https://cwiki.apache.org/confluence/display/Hive/HowToContribute.
   
-- Lefty
   
On Tue, Mar 31, 2015 at 10:42 PM, andres.qui...@parc.com wrote:
   
  Dear Hive development community members,



 I am interested in learning more about the current support 
 for non-equijoins in Hive and/or other Hadoop SQL engines, 
 and in
 getting
 feedback about community interest in more extensive support 
 for
 such
  a
 feature. I intend to work on this challenge, assuming people 
 find
 it
 compelling, and I intend to contribute results to the community.
  Where
 possible, it would be great to receive feedback and engage in 
 collaborations along the way (for a bit more context, see the 
 postscript of this message).



 My initial goal is to support query conditions such as the
 following:



 A.x  B.y

 A.x in_range [B.y, B.z]

 distance(A.x, B.y)  D



 where A and B are distinct tables/files. It is my 
 understanding
 that
 current support for performing non-equijoins like those above 
 is
  quite
 limited, and where some forms are supported (like in 
 Cloudera's Impala), this support is based on doing a 
 potentially expensive
 cross
product join.
 Depending on the data types involved, I believe that joins 
 with
 these
 conditions can be made to be tractable (at least on the 
 average)
 with
 join algorithms that exploit properties of the data types, 
 possibly with some pre-scanning of the data.



 I am asking for feedback on the interest  need in the 
 community
 for
 this work, as well as any pointers to similar work. In 
 particular,
 I
 would appreciate any answers people could give on the 
 following
questions:



 - Is my understanding of the state of the art in Hive and 
 similar tools accurate? Are there groups currently working on 
 similar or related issues, or tools that already accomplish 
 some or all of
 what
  I
have proposed?

 - Is there significant value to the community in the support 
 of
 such
  a
 feature? In other words, are the manual workarounds necessary
 because
 of the absence of non-equijoins such as these enough of a 
 pain to justify the work I propose?

 - Being aware that the potential pre-scanning adds to the 
 cost of
 the
 join, and that data could still blow-up in the worst case, am 
 I missing any other important considerations and tradeoffs 
 for this
problem?

 - What would be a good avenue to contribute this 

RE: Request for feedback on work intent for non-equijoin support

2015-04-01 Thread Andres.Quiroz
Dear Lefty,

Thank you very much for pointing that out and for your initial pointers. Here 
is the missing link:

ftp.parc.com/pub/torres/Hiperfuse/extended_hiperfuse.pdf

Regards,

Andrés

-Original Message-
From: Lefty Leverenz [mailto:leftylever...@gmail.com] 
Sent: Wednesday, April 01, 2015 12:48 AM
To: dev@hive.apache.org
Subject: Re: Request for feedback on work intent for non-equijoin support

Hello Andres, the link to your paper is missing:

In our preliminary work, which you can find here (pointer to the paper) ...


You can find general information about contributing to Hive in the
wiki:  Resources
for Contributors
https://cwiki.apache.org/confluence/display/Hive/Home#Home-ResourcesforContributors
, How to Contribute
https://cwiki.apache.org/confluence/display/Hive/HowToContribute.

-- Lefty

On Tue, Mar 31, 2015 at 10:42 PM, andres.qui...@parc.com wrote:

  Dear Hive development community members,



 I am interested in learning more about the current support for 
 non-equijoins in Hive and/or other Hadoop SQL engines, and in getting 
 feedback about community interest in more extensive support for such a 
 feature. I intend to work on this challenge, assuming people find it 
 compelling, and I intend to contribute results to the community. Where 
 possible, it would be great to receive feedback and engage in 
 collaborations along the way (for a bit more context, see the 
 postscript of this message).



 My initial goal is to support query conditions such as the following:



 A.x  B.y

 A.x in_range [B.y, B.z]

 distance(A.x, B.y)  D



 where A and B are distinct tables/files. It is my understanding that 
 current support for performing non-equijoins like those above is quite 
 limited, and where some forms are supported (like in Cloudera's 
 Impala), this support is based on doing a potentially expensive cross product 
 join.
 Depending on the data types involved, I believe that joins with these 
 conditions can be made to be tractable (at least on the average) with 
 join algorithms that exploit properties of the data types, possibly 
 with some pre-scanning of the data.



 I am asking for feedback on the interest  need in the community for 
 this work, as well as any pointers to similar work. In particular, I 
 would appreciate any answers people could give on the following questions:



 - Is my understanding of the state of the art in Hive and similar 
 tools accurate? Are there groups currently working on similar or 
 related issues, or tools that already accomplish some or all of what I have 
 proposed?

 - Is there significant value to the community in the support of such a 
 feature? In other words, are the manual workarounds necessary because 
 of the absence of non-equijoins such as these enough of a pain to 
 justify the work I propose?

 - Being aware that the potential pre-scanning adds to the cost of the 
 join, and that data could still blow-up in the worst case, am I 
 missing any other important considerations and tradeoffs for this problem?

 - What would be a good avenue to contribute this feature to the 
 community (e.g. as a standalone tool on top of Hadoop, or as a Hive 
 extension or plugin)?

 - What is the best way to get started in working with the community?



 Thanks for your attention and any info you can provide!



 Andres Quiroz



 P.S. If you are interested in some context, and why/how I am proposing 
 to do this work, please read on.



 I am part of a small project team at PARC working on the general 
 problems of data integration and automated ETL. We have proposed a 
 tool called HiperFuse that is designed to accept declarative, 
 high-level queries in order to produce joined (fused) data sets from 
 multiple heterogeneous raw data sources. In our preliminary work, 
 which you can find here (pointer to the paper), we designed the 
 architecture of the tool and obtained some results separately on the 
 problems of automated data cleansing, data type inference, and query 
 planning. One of the planned prototype implementations of HiperFuse 
 relies on Hadoop MR, and because the declarative language we proposed 
 was closely related to SQL, we thought that we could exploit the 
 existing work in Hive and/or other open-source tools for handling the 
 SQL part and layer our work on top of that. For example, the query 
 given in the paper could easily be expressed in SQL-like form with a 
 non-equijoin
 condition:



 SELECT web_access_log.ip, census.income

 FROM web_access_log, ip2zip, census

 WHERE web_access_log.ip in_range [ip2zip.ip_low, ip2zip.ip_high]

 AND ip2zip.zip = census.zip



 As you can see, the first impasse that we hit in order to bring the 
 elements together to solve this query end-to-end was the realization 
 and performance of the non-equality join in the query. The intent now 
 is to tackle this problem in a general sense and provide a solution 
 for a wide range of queries.



 The work I propose to do would be based on 

Re: Request for feedback on work intent for non-equijoin support

2015-04-01 Thread Lefty Leverenz
Andrés, I followed that link and got the dread 404 Not Found:

The requested URI /pub/torres/Hiperfuse/extended_hiperfuse.pdf was not
found on this server.

-- Lefty

On Wed, Apr 1, 2015 at 7:23 PM, andres.qui...@parc.com wrote:

 Dear Lefty,

 Thank you very much for pointing that out and for your initial pointers.
 Here is the missing link:

 ftp.parc.com/pub/torres/Hiperfuse/extended_hiperfuse.pdf

 Regards,

 Andrés

 -Original Message-
 From: Lefty Leverenz [mailto:leftylever...@gmail.com]
 Sent: Wednesday, April 01, 2015 12:48 AM
 To: dev@hive.apache.org
 Subject: Re: Request for feedback on work intent for non-equijoin support

 Hello Andres, the link to your paper is missing:

 In our preliminary work, which you can find here (pointer to the paper) ...


 You can find general information about contributing to Hive in the
 wiki:  Resources
 for Contributors
 
 https://cwiki.apache.org/confluence/display/Hive/Home#Home-ResourcesforContributors
 
 , How to Contribute
 https://cwiki.apache.org/confluence/display/Hive/HowToContribute.

 -- Lefty

 On Tue, Mar 31, 2015 at 10:42 PM, andres.qui...@parc.com wrote:

   Dear Hive development community members,
 
 
 
  I am interested in learning more about the current support for
  non-equijoins in Hive and/or other Hadoop SQL engines, and in getting
  feedback about community interest in more extensive support for such a
  feature. I intend to work on this challenge, assuming people find it
  compelling, and I intend to contribute results to the community. Where
  possible, it would be great to receive feedback and engage in
  collaborations along the way (for a bit more context, see the
  postscript of this message).
 
 
 
  My initial goal is to support query conditions such as the following:
 
 
 
  A.x  B.y
 
  A.x in_range [B.y, B.z]
 
  distance(A.x, B.y)  D
 
 
 
  where A and B are distinct tables/files. It is my understanding that
  current support for performing non-equijoins like those above is quite
  limited, and where some forms are supported (like in Cloudera's
  Impala), this support is based on doing a potentially expensive cross
 product join.
  Depending on the data types involved, I believe that joins with these
  conditions can be made to be tractable (at least on the average) with
  join algorithms that exploit properties of the data types, possibly
  with some pre-scanning of the data.
 
 
 
  I am asking for feedback on the interest  need in the community for
  this work, as well as any pointers to similar work. In particular, I
  would appreciate any answers people could give on the following
 questions:
 
 
 
  - Is my understanding of the state of the art in Hive and similar
  tools accurate? Are there groups currently working on similar or
  related issues, or tools that already accomplish some or all of what I
 have proposed?
 
  - Is there significant value to the community in the support of such a
  feature? In other words, are the manual workarounds necessary because
  of the absence of non-equijoins such as these enough of a pain to
  justify the work I propose?
 
  - Being aware that the potential pre-scanning adds to the cost of the
  join, and that data could still blow-up in the worst case, am I
  missing any other important considerations and tradeoffs for this
 problem?
 
  - What would be a good avenue to contribute this feature to the
  community (e.g. as a standalone tool on top of Hadoop, or as a Hive
  extension or plugin)?
 
  - What is the best way to get started in working with the community?
 
 
 
  Thanks for your attention and any info you can provide!
 
 
 
  Andres Quiroz
 
 
 
  P.S. If you are interested in some context, and why/how I am proposing
  to do this work, please read on.
 
 
 
  I am part of a small project team at PARC working on the general
  problems of data integration and automated ETL. We have proposed a
  tool called HiperFuse that is designed to accept declarative,
  high-level queries in order to produce joined (fused) data sets from
  multiple heterogeneous raw data sources. In our preliminary work,
  which you can find here (pointer to the paper), we designed the
  architecture of the tool and obtained some results separately on the
  problems of automated data cleansing, data type inference, and query
  planning. One of the planned prototype implementations of HiperFuse
  relies on Hadoop MR, and because the declarative language we proposed
  was closely related to SQL, we thought that we could exploit the
  existing work in Hive and/or other open-source tools for handling the
  SQL part and layer our work on top of that. For example, the query
  given in the paper could easily be expressed in SQL-like form with a
  non-equijoin
  condition:
 
 
 
  SELECT web_access_log.ip, census.income
 
  FROM web_access_log, ip2zip, census
 
  WHERE web_access_log.ip in_range [ip2zip.ip_low, ip2zip.ip_high]
 
  AND ip2zip.zip = census.zip
 
 
 
  As you can see, the first impasse that 

Re: Request for feedback on work intent for non-equijoin support

2015-04-01 Thread Szehon Ho
From Hive side, there has been some thought on the subject here:
https://cwiki.apache.org/confluence/display/Hive/Theta+Join, it has some
ideas but nobody has gotten around to giving it a try.  It might be of
interest.

Thanks
Szehon


On Wed, Apr 1, 2015 at 10:05 PM, Lefty Leverenz leftylever...@gmail.com
wrote:

 D'oh!  Thanks Chao.

 -- Lefty

 On Thu, Apr 2, 2015 at 12:59 AM, Chao Sun c...@cloudera.com wrote:

  Hey Lefty,
 
  You need to use the ftp protocol, not http.
  After clicking the link, you'll need to remove http://; from the
 address
  bar.
 
  Best,
  Chao
 
  On Wed, Apr 1, 2015 at 9:41 PM, Lefty Leverenz leftylever...@gmail.com
  wrote:
 
   Andrés, I followed that link and got the dread 404 Not Found:
  
   The requested URI /pub/torres/Hiperfuse/extended_hiperfuse.pdf was not
   found on this server.
  
   -- Lefty
  
   On Wed, Apr 1, 2015 at 7:23 PM, andres.qui...@parc.com wrote:
  
Dear Lefty,
   
Thank you very much for pointing that out and for your initial
  pointers.
Here is the missing link:
   
ftp.parc.com/pub/torres/Hiperfuse/extended_hiperfuse.pdf
   
Regards,
   
Andrés
   
-Original Message-
From: Lefty Leverenz [mailto:leftylever...@gmail.com]
Sent: Wednesday, April 01, 2015 12:48 AM
To: dev@hive.apache.org
Subject: Re: Request for feedback on work intent for non-equijoin
  support
   
Hello Andres, the link to your paper is missing:
   
In our preliminary work, which you can find here (pointer to the
 paper)
   ...
   
   
You can find general information about contributing to Hive in the
wiki:  Resources
for Contributors

   
  
 
 https://cwiki.apache.org/confluence/display/Hive/Home#Home-ResourcesforContributors

, How to Contribute
https://cwiki.apache.org/confluence/display/Hive/HowToContribute.
   
-- Lefty
   
On Tue, Mar 31, 2015 at 10:42 PM, andres.qui...@parc.com wrote:
   
  Dear Hive development community members,



 I am interested in learning more about the current support for
 non-equijoins in Hive and/or other Hadoop SQL engines, and in
 getting
 feedback about community interest in more extensive support for
 such
  a
 feature. I intend to work on this challenge, assuming people find
 it
 compelling, and I intend to contribute results to the community.
  Where
 possible, it would be great to receive feedback and engage in
 collaborations along the way (for a bit more context, see the
 postscript of this message).



 My initial goal is to support query conditions such as the
 following:



 A.x  B.y

 A.x in_range [B.y, B.z]

 distance(A.x, B.y)  D



 where A and B are distinct tables/files. It is my understanding
 that
 current support for performing non-equijoins like those above is
  quite
 limited, and where some forms are supported (like in Cloudera's
 Impala), this support is based on doing a potentially expensive
 cross
product join.
 Depending on the data types involved, I believe that joins with
 these
 conditions can be made to be tractable (at least on the average)
 with
 join algorithms that exploit properties of the data types, possibly
 with some pre-scanning of the data.



 I am asking for feedback on the interest  need in the community
 for
 this work, as well as any pointers to similar work. In particular,
 I
 would appreciate any answers people could give on the following
questions:



 - Is my understanding of the state of the art in Hive and similar
 tools accurate? Are there groups currently working on similar or
 related issues, or tools that already accomplish some or all of
 what
  I
have proposed?

 - Is there significant value to the community in the support of
 such
  a
 feature? In other words, are the manual workarounds necessary
 because
 of the absence of non-equijoins such as these enough of a pain to
 justify the work I propose?

 - Being aware that the potential pre-scanning adds to the cost of
 the
 join, and that data could still blow-up in the worst case, am I
 missing any other important considerations and tradeoffs for this
problem?

 - What would be a good avenue to contribute this feature to the
 community (e.g. as a standalone tool on top of Hadoop, or as a Hive
 extension or plugin)?

 - What is the best way to get started in working with the
 community?



 Thanks for your attention and any info you can provide!



 Andres Quiroz



 P.S. If you are interested in some context, and why/how I am
  proposing
 to do this work, please read on.



 I am part of a small project team at PARC working on the general
 problems of data integration and automated ETL. We have proposed a
 tool 

Re: Request for feedback on work intent for non-equijoin support

2015-04-01 Thread Lefty Leverenz
D'oh!  Thanks Chao.

-- Lefty

On Thu, Apr 2, 2015 at 12:59 AM, Chao Sun c...@cloudera.com wrote:

 Hey Lefty,

 You need to use the ftp protocol, not http.
 After clicking the link, you'll need to remove http://; from the address
 bar.

 Best,
 Chao

 On Wed, Apr 1, 2015 at 9:41 PM, Lefty Leverenz leftylever...@gmail.com
 wrote:

  Andrés, I followed that link and got the dread 404 Not Found:
 
  The requested URI /pub/torres/Hiperfuse/extended_hiperfuse.pdf was not
  found on this server.
 
  -- Lefty
 
  On Wed, Apr 1, 2015 at 7:23 PM, andres.qui...@parc.com wrote:
 
   Dear Lefty,
  
   Thank you very much for pointing that out and for your initial
 pointers.
   Here is the missing link:
  
   ftp.parc.com/pub/torres/Hiperfuse/extended_hiperfuse.pdf
  
   Regards,
  
   Andrés
  
   -Original Message-
   From: Lefty Leverenz [mailto:leftylever...@gmail.com]
   Sent: Wednesday, April 01, 2015 12:48 AM
   To: dev@hive.apache.org
   Subject: Re: Request for feedback on work intent for non-equijoin
 support
  
   Hello Andres, the link to your paper is missing:
  
   In our preliminary work, which you can find here (pointer to the paper)
  ...
  
  
   You can find general information about contributing to Hive in the
   wiki:  Resources
   for Contributors
   
  
 
 https://cwiki.apache.org/confluence/display/Hive/Home#Home-ResourcesforContributors
   
   , How to Contribute
   https://cwiki.apache.org/confluence/display/Hive/HowToContribute.
  
   -- Lefty
  
   On Tue, Mar 31, 2015 at 10:42 PM, andres.qui...@parc.com wrote:
  
 Dear Hive development community members,
   
   
   
I am interested in learning more about the current support for
non-equijoins in Hive and/or other Hadoop SQL engines, and in getting
feedback about community interest in more extensive support for such
 a
feature. I intend to work on this challenge, assuming people find it
compelling, and I intend to contribute results to the community.
 Where
possible, it would be great to receive feedback and engage in
collaborations along the way (for a bit more context, see the
postscript of this message).
   
   
   
My initial goal is to support query conditions such as the following:
   
   
   
A.x  B.y
   
A.x in_range [B.y, B.z]
   
distance(A.x, B.y)  D
   
   
   
where A and B are distinct tables/files. It is my understanding that
current support for performing non-equijoins like those above is
 quite
limited, and where some forms are supported (like in Cloudera's
Impala), this support is based on doing a potentially expensive cross
   product join.
Depending on the data types involved, I believe that joins with these
conditions can be made to be tractable (at least on the average) with
join algorithms that exploit properties of the data types, possibly
with some pre-scanning of the data.
   
   
   
I am asking for feedback on the interest  need in the community for
this work, as well as any pointers to similar work. In particular, I
would appreciate any answers people could give on the following
   questions:
   
   
   
- Is my understanding of the state of the art in Hive and similar
tools accurate? Are there groups currently working on similar or
related issues, or tools that already accomplish some or all of what
 I
   have proposed?
   
- Is there significant value to the community in the support of such
 a
feature? In other words, are the manual workarounds necessary because
of the absence of non-equijoins such as these enough of a pain to
justify the work I propose?
   
- Being aware that the potential pre-scanning adds to the cost of the
join, and that data could still blow-up in the worst case, am I
missing any other important considerations and tradeoffs for this
   problem?
   
- What would be a good avenue to contribute this feature to the
community (e.g. as a standalone tool on top of Hadoop, or as a Hive
extension or plugin)?
   
- What is the best way to get started in working with the community?
   
   
   
Thanks for your attention and any info you can provide!
   
   
   
Andres Quiroz
   
   
   
P.S. If you are interested in some context, and why/how I am
 proposing
to do this work, please read on.
   
   
   
I am part of a small project team at PARC working on the general
problems of data integration and automated ETL. We have proposed a
tool called HiperFuse that is designed to accept declarative,
high-level queries in order to produce joined (fused) data sets from
multiple heterogeneous raw data sources. In our preliminary work,
which you can find here (pointer to the paper), we designed the
architecture of the tool and obtained some results separately on the
problems of automated data cleansing, data type inference, and query
planning. One of the planned prototype implementations of HiperFuse

Re: Request for feedback on work intent for non-equijoin support

2015-04-01 Thread Brock Noland
Nice, it'd be great if someone finally implemented this :)

On Wed, Apr 1, 2015 at 10:10 PM, Szehon Ho sze...@cloudera.com wrote:
 From Hive side, there has been some thought on the subject here:
 https://cwiki.apache.org/confluence/display/Hive/Theta+Join, it has some
 ideas but nobody has gotten around to giving it a try.  It might be of
 interest.

 Thanks
 Szehon


 On Wed, Apr 1, 2015 at 10:05 PM, Lefty Leverenz leftylever...@gmail.com
 wrote:

 D'oh!  Thanks Chao.

 -- Lefty

 On Thu, Apr 2, 2015 at 12:59 AM, Chao Sun c...@cloudera.com wrote:

  Hey Lefty,
 
  You need to use the ftp protocol, not http.
  After clicking the link, you'll need to remove http://; from the
 address
  bar.
 
  Best,
  Chao
 
  On Wed, Apr 1, 2015 at 9:41 PM, Lefty Leverenz leftylever...@gmail.com
  wrote:
 
   Andrés, I followed that link and got the dread 404 Not Found:
  
   The requested URI /pub/torres/Hiperfuse/extended_hiperfuse.pdf was not
   found on this server.
  
   -- Lefty
  
   On Wed, Apr 1, 2015 at 7:23 PM, andres.qui...@parc.com wrote:
  
Dear Lefty,
   
Thank you very much for pointing that out and for your initial
  pointers.
Here is the missing link:
   
ftp.parc.com/pub/torres/Hiperfuse/extended_hiperfuse.pdf
   
Regards,
   
Andrés
   
-Original Message-
From: Lefty Leverenz [mailto:leftylever...@gmail.com]
Sent: Wednesday, April 01, 2015 12:48 AM
To: dev@hive.apache.org
Subject: Re: Request for feedback on work intent for non-equijoin
  support
   
Hello Andres, the link to your paper is missing:
   
In our preliminary work, which you can find here (pointer to the
 paper)
   ...
   
   
You can find general information about contributing to Hive in the
wiki:  Resources
for Contributors

   
  
 
 https://cwiki.apache.org/confluence/display/Hive/Home#Home-ResourcesforContributors

, How to Contribute
https://cwiki.apache.org/confluence/display/Hive/HowToContribute.
   
-- Lefty
   
On Tue, Mar 31, 2015 at 10:42 PM, andres.qui...@parc.com wrote:
   
  Dear Hive development community members,



 I am interested in learning more about the current support for
 non-equijoins in Hive and/or other Hadoop SQL engines, and in
 getting
 feedback about community interest in more extensive support for
 such
  a
 feature. I intend to work on this challenge, assuming people find
 it
 compelling, and I intend to contribute results to the community.
  Where
 possible, it would be great to receive feedback and engage in
 collaborations along the way (for a bit more context, see the
 postscript of this message).



 My initial goal is to support query conditions such as the
 following:



 A.x  B.y

 A.x in_range [B.y, B.z]

 distance(A.x, B.y)  D



 where A and B are distinct tables/files. It is my understanding
 that
 current support for performing non-equijoins like those above is
  quite
 limited, and where some forms are supported (like in Cloudera's
 Impala), this support is based on doing a potentially expensive
 cross
product join.
 Depending on the data types involved, I believe that joins with
 these
 conditions can be made to be tractable (at least on the average)
 with
 join algorithms that exploit properties of the data types, possibly
 with some pre-scanning of the data.



 I am asking for feedback on the interest  need in the community
 for
 this work, as well as any pointers to similar work. In particular,
 I
 would appreciate any answers people could give on the following
questions:



 - Is my understanding of the state of the art in Hive and similar
 tools accurate? Are there groups currently working on similar or
 related issues, or tools that already accomplish some or all of
 what
  I
have proposed?

 - Is there significant value to the community in the support of
 such
  a
 feature? In other words, are the manual workarounds necessary
 because
 of the absence of non-equijoins such as these enough of a pain to
 justify the work I propose?

 - Being aware that the potential pre-scanning adds to the cost of
 the
 join, and that data could still blow-up in the worst case, am I
 missing any other important considerations and tradeoffs for this
problem?

 - What would be a good avenue to contribute this feature to the
 community (e.g. as a standalone tool on top of Hadoop, or as a Hive
 extension or plugin)?

 - What is the best way to get started in working with the
 community?



 Thanks for your attention and any info you can provide!



 Andres Quiroz



 P.S. If you are interested in some context, and why/how I am
  proposing
 to do this work, please read on.



 I am part 

Re: Request for feedback on work intent for non-equijoin support

2015-04-01 Thread Chao Sun
Hey Lefty,

You need to use the ftp protocol, not http.
After clicking the link, you'll need to remove http://; from the address
bar.

Best,
Chao

On Wed, Apr 1, 2015 at 9:41 PM, Lefty Leverenz leftylever...@gmail.com
wrote:

 Andrés, I followed that link and got the dread 404 Not Found:

 The requested URI /pub/torres/Hiperfuse/extended_hiperfuse.pdf was not
 found on this server.

 -- Lefty

 On Wed, Apr 1, 2015 at 7:23 PM, andres.qui...@parc.com wrote:

  Dear Lefty,
 
  Thank you very much for pointing that out and for your initial pointers.
  Here is the missing link:
 
  ftp.parc.com/pub/torres/Hiperfuse/extended_hiperfuse.pdf
 
  Regards,
 
  Andrés
 
  -Original Message-
  From: Lefty Leverenz [mailto:leftylever...@gmail.com]
  Sent: Wednesday, April 01, 2015 12:48 AM
  To: dev@hive.apache.org
  Subject: Re: Request for feedback on work intent for non-equijoin support
 
  Hello Andres, the link to your paper is missing:
 
  In our preliminary work, which you can find here (pointer to the paper)
 ...
 
 
  You can find general information about contributing to Hive in the
  wiki:  Resources
  for Contributors
  
 
 https://cwiki.apache.org/confluence/display/Hive/Home#Home-ResourcesforContributors
  
  , How to Contribute
  https://cwiki.apache.org/confluence/display/Hive/HowToContribute.
 
  -- Lefty
 
  On Tue, Mar 31, 2015 at 10:42 PM, andres.qui...@parc.com wrote:
 
Dear Hive development community members,
  
  
  
   I am interested in learning more about the current support for
   non-equijoins in Hive and/or other Hadoop SQL engines, and in getting
   feedback about community interest in more extensive support for such a
   feature. I intend to work on this challenge, assuming people find it
   compelling, and I intend to contribute results to the community. Where
   possible, it would be great to receive feedback and engage in
   collaborations along the way (for a bit more context, see the
   postscript of this message).
  
  
  
   My initial goal is to support query conditions such as the following:
  
  
  
   A.x  B.y
  
   A.x in_range [B.y, B.z]
  
   distance(A.x, B.y)  D
  
  
  
   where A and B are distinct tables/files. It is my understanding that
   current support for performing non-equijoins like those above is quite
   limited, and where some forms are supported (like in Cloudera's
   Impala), this support is based on doing a potentially expensive cross
  product join.
   Depending on the data types involved, I believe that joins with these
   conditions can be made to be tractable (at least on the average) with
   join algorithms that exploit properties of the data types, possibly
   with some pre-scanning of the data.
  
  
  
   I am asking for feedback on the interest  need in the community for
   this work, as well as any pointers to similar work. In particular, I
   would appreciate any answers people could give on the following
  questions:
  
  
  
   - Is my understanding of the state of the art in Hive and similar
   tools accurate? Are there groups currently working on similar or
   related issues, or tools that already accomplish some or all of what I
  have proposed?
  
   - Is there significant value to the community in the support of such a
   feature? In other words, are the manual workarounds necessary because
   of the absence of non-equijoins such as these enough of a pain to
   justify the work I propose?
  
   - Being aware that the potential pre-scanning adds to the cost of the
   join, and that data could still blow-up in the worst case, am I
   missing any other important considerations and tradeoffs for this
  problem?
  
   - What would be a good avenue to contribute this feature to the
   community (e.g. as a standalone tool on top of Hadoop, or as a Hive
   extension or plugin)?
  
   - What is the best way to get started in working with the community?
  
  
  
   Thanks for your attention and any info you can provide!
  
  
  
   Andres Quiroz
  
  
  
   P.S. If you are interested in some context, and why/how I am proposing
   to do this work, please read on.
  
  
  
   I am part of a small project team at PARC working on the general
   problems of data integration and automated ETL. We have proposed a
   tool called HiperFuse that is designed to accept declarative,
   high-level queries in order to produce joined (fused) data sets from
   multiple heterogeneous raw data sources. In our preliminary work,
   which you can find here (pointer to the paper), we designed the
   architecture of the tool and obtained some results separately on the
   problems of automated data cleansing, data type inference, and query
   planning. One of the planned prototype implementations of HiperFuse
   relies on Hadoop MR, and because the declarative language we proposed
   was closely related to SQL, we thought that we could exploit the
   existing work in Hive and/or other open-source tools for handling the
   SQL part and layer our work on top of 

Re: Request for feedback on work intent for non-equijoin support

2015-03-31 Thread Lefty Leverenz
Hello Andres, the link to your paper is missing:

In our preliminary work, which you can find here (pointer to the paper) ...


You can find general information about contributing to Hive in the
wiki:  Resources
for Contributors
https://cwiki.apache.org/confluence/display/Hive/Home#Home-ResourcesforContributors
, How to Contribute
https://cwiki.apache.org/confluence/display/Hive/HowToContribute.

-- Lefty

On Tue, Mar 31, 2015 at 10:42 PM, andres.qui...@parc.com wrote:

  Dear Hive development community members,



 I am interested in learning more about the current support for
 non-equijoins in Hive and/or other Hadoop SQL engines, and in getting
 feedback about community interest in more extensive support for such a
 feature. I intend to work on this challenge, assuming people find it
 compelling, and I intend to contribute results to the community. Where
 possible, it would be great to receive feedback and engage in
 collaborations along the way (for a bit more context, see the postscript of
 this message).



 My initial goal is to support query conditions such as the following:



 A.x  B.y

 A.x in_range [B.y, B.z]

 distance(A.x, B.y)  D



 where A and B are distinct tables/files. It is my understanding that
 current support for performing non-equijoins like those above is quite
 limited, and where some forms are supported (like in Cloudera's Impala),
 this support is based on doing a potentially expensive cross product join.
 Depending on the data types involved, I believe that joins with these
 conditions can be made to be tractable (at least on the average) with join
 algorithms that exploit properties of the data types, possibly with some
 pre-scanning of the data.



 I am asking for feedback on the interest  need in the community for this
 work, as well as any pointers to similar work. In particular, I would
 appreciate any answers people could give on the following questions:



 - Is my understanding of the state of the art in Hive and similar tools
 accurate? Are there groups currently working on similar or related issues,
 or tools that already accomplish some or all of what I have proposed?

 - Is there significant value to the community in the support of such a
 feature? In other words, are the manual workarounds necessary because of
 the absence of non-equijoins such as these enough of a pain to justify the
 work I propose?

 - Being aware that the potential pre-scanning adds to the cost of the
 join, and that data could still blow-up in the worst case, am I missing any
 other important considerations and tradeoffs for this problem?

 - What would be a good avenue to contribute this feature to the community
 (e.g. as a standalone tool on top of Hadoop, or as a Hive extension or
 plugin)?

 - What is the best way to get started in working with the community?



 Thanks for your attention and any info you can provide!



 Andres Quiroz



 P.S. If you are interested in some context, and why/how I am proposing to
 do this work, please read on.



 I am part of a small project team at PARC working on the general problems
 of data integration and automated ETL. We have proposed a tool called
 HiperFuse that is designed to accept declarative, high-level queries in
 order to produce joined (fused) data sets from multiple heterogeneous raw
 data sources. In our preliminary work, which you can find here (pointer to
 the paper), we designed the architecture of the tool and obtained some
 results separately on the problems of automated data cleansing, data type
 inference, and query planning. One of the planned prototype implementations
 of HiperFuse relies on Hadoop MR, and because the declarative language we
 proposed was closely related to SQL, we thought that we could exploit the
 existing work in Hive and/or other open-source tools for handling the SQL
 part and layer our work on top of that. For example, the query given in the
 paper could easily be expressed in SQL-like form with a non-equijoin
 condition:



 SELECT web_access_log.ip, census.income

 FROM web_access_log, ip2zip, census

 WHERE web_access_log.ip in_range [ip2zip.ip_low, ip2zip.ip_high]

 AND ip2zip.zip = census.zip



 As you can see, the first impasse that we hit in order to bring the
 elements together to solve this query end-to-end was the realization and
 performance of the non-equality join in the query. The intent now is to
 tackle this problem in a general sense and provide a solution for a wide
 range of queries.



 The work I propose to do would be based on three main components within
 HiperFuse:



 - Enhancements to the extensible data type framework in HiperFuse that
 would categorize data types based on the properties needed to support the
 join algorithms, in order to write join-ready domain-specific data type
 libraries.

 - The join algorithms themselves, based on Hive or directly on Hadoop MR.

 - A query planner, which would determine the right algorithm to apply and
 automatically schedule any