RE: Table Wrapper

2013-06-28 Thread Peter Marron
Yes indeed. Thank you for prompting me to post again.
It is always helpful to get a response, any response.
There is nothing worse than posting and getting absolutely
nothing in reply.

As you say I have now got quite a few suggestions and even if they
don't all pan out I will have learnt a lot. And I feel happier now with
my ugly hack knowing that at least I haven't missed anything too obvious :)

Thanks again.

Z

From: Stephen Sprague [mailto:sprag...@gmail.com]
Sent: 27 June 2013 16:40
To: user@hive.apache.org
Subject: Re: Table Wrapper

Well.  You got a few suggestions there Peter.  That in itself is reason to 
celebrate!

And that was a good description and i fault you not for going into some detail. 
 The part about keeping it simple is always a challenge I know. :)
I get your point but i don't have anything more really to offer. if running an 
MR job is a blocker i'm not in a position to suggest anything.

On Thu, Jun 27, 2013 at 3:14 AM, Peter Marron 
peter.mar...@trilliumsoftware.commailto:peter.mar...@trilliumsoftware.com 
wrote:
Hi,

If you're suggesting that I use something like

SELECT * FROM data WHERE MyUdf(data. BLOCK__OFFSET__INSIDE__FILE);

rather than
SELECT * FROM data JOIN small ON data.BLOCK__OFFSET__INSIDE__FILE = 
small.offset;
then, yes, I have thought of that. However the fact is that reading the 
billions of records
and filtering them is too slow compared to doing the seeks.

Partitioning would help, but I can't assume that the big data is partitioned in 
a way that suits
this particular query.

Z

From: Jan Dolinár [mailto:dolik@gmail.commailto:dolik@gmail.com]
Sent: 27 June 2013 10:59
To: user
Subject: Re: Table Wrapper

Slightly less hackish way to do this without joins is to write custom UDF 
that will take data.BLOCK__OFFSET__INSIDE__FILE as input parameter and return 
the corresponding data from the small file. If you mark it deterministic 
using @UDFType(deterministic = true), the performance should be quite good.

To avoid the full table scan, partitioning is IMHO the best way to speed things 
up.

Best regards,
J. Dolinar

On Thu, Jun 27, 2013 at 11:18 AM, Peter Marron 
peter.mar...@trilliumsoftware.commailto:peter.mar...@trilliumsoftware.com 
wrote:
Hi,

I have thought about a map-only join, but as I understand it this is still going
to do a full table scan on my large data file. If this is billions of records 
then it's
still going to be slow, even if it only returns a handful of records.

Also I don't know of any way to get Hive to do a join without performing a
Map/Reduce. And, as I mentioned before, just the overheads of setting
up a Map/Reduce, even if it's map only and does practically nothing, makes the
elapsed time too high. I want it to be interactive. (I guess that something Tez
when it becomes available might solve this problem...)

My ugly hack approach works in seconds, the overhead for setting up 
Map/Reduce takes this into minutes.

Indexing looks promising but, as far as I can see, it can't be done without a 
Map/Reduce.

If I could find a way to perform a join or use indexing without a Map/Reduce I 
would be happy to use that approach.

Partitioning and ORC would be helpful but I can't assume anything about the 
original data format.

Z

From: Nitin Pawar 
[mailto:nitinpawar...@gmail.commailto:nitinpawar...@gmail.com]
Sent: 27 June 2013 09:52

To: user@hive.apache.orgmailto:user@hive.apache.org
Subject: Re: Table Wrapper

few thoughts:
If you have a smaller file (in size of MB's) have you tried considering map 
only join?
also if you are interested in particular records from a table and do not want 
to go through entire table to find them, then partitioning + indexing will be 
handy.

ORCFile Format (still very new) can help you in this regard as well.

On Thu, Jun 27, 2013 at 2:16 PM, Peter Marron 
peter.mar...@trilliumsoftware.commailto:peter.mar...@trilliumsoftware.com 
wrote:
Well, I'm not very good at keeping things brief, unfortunately.
But I'll have a go, trying to keep things simple.

Suppose that I have a data table in Hive and it has many rows - say billions.
I have another file stored in HDFS (it can be a Hive table too if it helps)
and this file is small and contains file offsets into the data, Stored as 
binary,
8 bytes per offset. Now suppose that I want to read the records from the data
defined by the offsets in the small file, in the order defined in the small 
file.

How can I do that?

The obvious way is to turn the small file into a Hive table and provide a custom
InputFormat which can read the binary. I've done that, that's the easy part and
then I could form a query like this:

SELECT * FROM data JOIN small ON data. ON 
data.BLOCK__OFFSET__INSIDE__FILE = small.offset;

But, when it works, this performs awfully.

The approach that I have taken is to create a copy of the data table which is 
hacked to use a custom input
format which knows about the small file and which overrides the record reader 
to use

Re: Table Wrapper

2013-06-27 Thread Stephen Sprague
Well.  You got a few suggestions there Peter.  That in itself is reason to
celebrate!

And that was a good description and i fault you not for going into some
detail.  The part about keeping it simple is always a challenge I know. :)

I get your point but i don't have anything more really to offer. if running
an MR job is a blocker i'm not in a position to suggest anything.


On Thu, Jun 27, 2013 at 3:14 AM, Peter Marron 
peter.mar...@trilliumsoftware.com wrote:

  Hi,

 ** **

 If you’re suggesting that I use something like

 ** **

 SELECT * FROM data WHERE MyUdf(data. BLOCK__OFFSET__INSIDE__FILE);

 ** **

 rather than 

 SELECT * FROM data JOIN small ON data.BLOCK__OFFSET__INSIDE__FILE =
 small.offset;

 then, yes, I have thought of that. However the fact is that reading the
 billions of records

 and filtering them is too slow compared to doing the seeks.

 ** **

 Partitioning would help, but I can’t assume that the big data is
 partitioned in a way that suits

 this particular query.

 ** **

 Z

 ** **

 *From:* Jan Dolinár [mailto:dolik@gmail.com]
 *Sent:* 27 June 2013 10:59
 *To:* user
 *Subject:* Re: Table Wrapper

 ** **

 Slightly less hackish way to do this without joins is to write custom
 UDF that will take data.BLOCK__OFFSET__INSIDE__FILE as input parameter and
 return the corresponding data from the small file. If you mark it
 deterministic using @UDFType(deterministic = true), the performance
 should be quite good. 

 ** **

 To avoid the full table scan, partitioning is IMHO the best way to speed
 things up.

 ** **

 Best regards,

 J. Dolinar

 ** **

 On Thu, Jun 27, 2013 at 11:18 AM, Peter Marron 
 peter.mar...@trilliumsoftware.com wrote:

 Hi,

  

 I have thought about a map-only join, but as I understand it this is still
 going

 to do a full table scan on my large data file. If this is billions of
 records then it’s

 still going to be slow, even if it only returns a handful of records.

  

 Also I don’t know of any way to get Hive to do a join without performing a
 

 Map/Reduce. And, as I mentioned before, just the overheads of setting

 up a Map/Reduce, even if it’s map only and does practically nothing, makes
 the

 elapsed time too high. I want it to be interactive. (I guess that
 something Tez

 when it becomes available might solve this problem…)

  

 My ugly “hack” approach works in seconds, the overhead for setting up
 Map/Reduce takes this into minutes.

  

 Indexing looks promising but, as far as I can see, it can’t be done
 without a Map/Reduce.

  

 If I could find a way to perform a join or use indexing without a
 Map/Reduce I would be happy to use that approach.

  

 Partitioning and ORC would be helpful but I can’t assume anything about
 the original data format.

  

 Z

  

 *From:* Nitin Pawar [mailto:nitinpawar...@gmail.com]
 *Sent:* 27 June 2013 09:52


 *To:* user@hive.apache.org
 *Subject:* Re: Table Wrapper

  

 few thoughts: 

 If you have a smaller file (in size of MB's) have you tried considering
 map only join? 

 also if you are interested in particular records from a table and do not
 want to go through entire table to find them, then partitioning + indexing
 will be handy.

  

 ORCFile Format (still very new) can help you in this regard as well. 

  

 On Thu, Jun 27, 2013 at 2:16 PM, Peter Marron 
 peter.mar...@trilliumsoftware.com wrote:

 Well, I’m not very good at keeping things brief, unfortunately.

 But I’ll have a go, trying to keep things simple.

  

 Suppose that I have a data table in Hive and it has many rows – say
 billions.

 I have another file stored in HDFS (it can be a Hive table too if it helps)
 

 and this file is small and contains file offsets into the data, Stored as
 binary,

 8 bytes per offset. Now suppose that I want to read the records from the
 data

 defined by the offsets in the small file, in the order defined in the
 small file.

  

 How can I do that?

  

 The obvious way is to turn the small file into a Hive table and provide a
 custom

 InputFormat which can read the binary. I’ve done that, that’s the easy
 part and

 then I could form a query like this:

  

 SELECT * FROM data JOIN small ON data. ON
 data.BLOCK__OFFSET__INSIDE__FILE = small.offset;

  

 But, when it works, this performs awfully.

  

 The approach that I have taken is to create a “copy” of the data table
 which is “hacked” to use a custom input

 format which knows about the small file and which overrides the record
 reader to use the offsets

 as seeks before it reads the records. This is awkward, for various
 reasons, but it works well. I can

 avoid a full table scan, in fact I can suppress any Map/Reduce and so the
 query runs very