Well.  You got a few suggestions there Peter.  That in itself is reason to
celebrate!

And that was a good description and i fault you not for going into some
detail.  The part about keeping it simple is always a challenge I know. :)

I get your point but i don't have anything more really to offer. if running
an MR job is a blocker i'm not in a position to suggest anything.


On Thu, Jun 27, 2013 at 3:14 AM, Peter Marron <
peter.mar...@trilliumsoftware.com> wrote:

>  Hi,****
>
> ** **
>
> If you’re suggesting that I use something like****
>
> ** **
>
> SELECT * FROM data WHERE MyUdf(data. BLOCK__OFFSET__INSIDE__FILE);****
>
> ** **
>
> rather than ****
>
> SELECT * FROM data JOIN small ON data.BLOCK__OFFSET__INSIDE__FILE =
> small.offset;****
>
> then, yes, I have thought of that. However the fact is that reading the
> billions of records****
>
> and filtering them is too slow compared to doing the seeks.****
>
> ** **
>
> Partitioning would help, but I can’t assume that the big data is
> partitioned in a way that suits****
>
> this particular query.****
>
> ** **
>
> Z****
>
> ** **
>
> *From:* Jan Dolinár [mailto:dolik....@gmail.com]
> *Sent:* 27 June 2013 10:59
> *To:* user
> *Subject:* Re: Table Wrapper****
>
> ** **
>
> Slightly less "hackish" way to do this without joins is to write custom
> UDF that will take data.BLOCK__OFFSET__INSIDE__FILE as input parameter and
> return the corresponding data from the small file. If you mark it
> "deterministic" using @UDFType(deterministic = true), the performance
> should be quite good. ****
>
> ** **
>
> To avoid the full table scan, partitioning is IMHO the best way to speed
> things up.****
>
> ** **
>
> Best regards,****
>
> J. Dolinar****
>
> ** **
>
> On Thu, Jun 27, 2013 at 11:18 AM, Peter Marron <
> peter.mar...@trilliumsoftware.com> wrote:****
>
> Hi,****
>
>  ****
>
> I have thought about a map-only join, but as I understand it this is still
> going****
>
> to do a full table scan on my large data file. If this is billions of
> records then it’s****
>
> still going to be slow, even if it only returns a handful of records.****
>
>  ****
>
> Also I don’t know of any way to get Hive to do a join without performing a
> ****
>
> Map/Reduce. And, as I mentioned before, just the overheads of setting****
>
> up a Map/Reduce, even if it’s map only and does practically nothing, makes
> the****
>
> elapsed time too high. I want it to be interactive. (I guess that
> something Tez****
>
> when it becomes available might solve this problem…)****
>
>  ****
>
> My ugly “hack” approach works in seconds, the overhead for setting up
> Map/Reduce takes this into minutes.****
>
>  ****
>
> Indexing looks promising but, as far as I can see, it can’t be done
> without a Map/Reduce.****
>
>  ****
>
> If I could find a way to perform a join or use indexing without a
> Map/Reduce I would be happy to use that approach.****
>
>  ****
>
> Partitioning and ORC would be helpful but I can’t assume anything about
> the original data format.****
>
>  ****
>
> Z****
>
>  ****
>
> *From:* Nitin Pawar [mailto:nitinpawar...@gmail.com]
> *Sent:* 27 June 2013 09:52****
>
>
> *To:* user@hive.apache.org
> *Subject:* Re: Table Wrapper****
>
>  ****
>
> few thoughts: ****
>
> If you have a smaller file (in size of MB's) have you tried considering
> map only join? ****
>
> also if you are interested in particular records from a table and do not
> want to go through entire table to find them, then partitioning + indexing
> will be handy.****
>
>  ****
>
> ORCFile Format (still very new) can help you in this regard as well. ****
>
>  ****
>
> On Thu, Jun 27, 2013 at 2:16 PM, Peter Marron <
> peter.mar...@trilliumsoftware.com> wrote:****
>
> Well, I’m not very good at keeping things brief, unfortunately.****
>
> But I’ll have a go, trying to keep things simple.****
>
>  ****
>
> Suppose that I have a data table in Hive and it has many rows – say
> billions.****
>
> I have another file stored in HDFS (it can be a Hive table too if it helps)
> ****
>
> and this file is small and contains file offsets into the data, Stored as
> binary,****
>
> 8 bytes per offset. Now suppose that I want to read the records from the
> data****
>
> defined by the offsets in the small file, in the order defined in the
> small file.****
>
>  ****
>
> How can I do that?****
>
>  ****
>
> The obvious way is to turn the small file into a Hive table and provide a
> custom****
>
> InputFormat which can read the binary. I’ve done that, that’s the easy
> part and****
>
> then I could form a query like this:****
>
>  ****
>
>                 SELECT * FROM data JOIN small ON data. ON
> data.BLOCK__OFFSET__INSIDE__FILE = small.offset;****
>
>  ****
>
> But, when it works, this performs awfully.****
>
>  ****
>
> The approach that I have taken is to create a “copy” of the data table
> which is “hacked” to use a custom input****
>
> format which knows about the small file and which overrides the record
> reader to use the offsets****
>
> as seeks before it reads the records. This is awkward, for various
> reasons, but it works well. I can****
>
> avoid a full table scan, in fact I can suppress any Map/Reduce and so the
> query runs very quickly.****
>
>  ****
>
> So I was just trying to “wrap” the data table so that I didn’t have to
> create the copy.****
>
>  ****
>
> I hope that you don’t regret asking too much.****
>
>  ****
>
> Regards,****
>
>  ****
>
> Z****
>
>  ****
>
> *From:* Stephen Sprague [mailto:sprag...@gmail.com]
> *Sent:* 25 June 2013 18:37
> *To:* user@hive.apache.org
> *Subject:* Re: Table Wrapper****
>
>  ****
>
> Good luck, bro. :)   May i ask why are you doing this to yourself? I think
> your instincts are correct going down the path you describe sounds a tad
> more painful than just hitting yourself in the head with a hammer.
> Different strokes for different folks though.****
>
> so can we back up? what - briefly if possible - do you want to achieve
> with a "wrapper"? (i'm going to regret asking that i know.)****
>
>  ****
>
>  ****
>
> On Tue, Jun 25, 2013 at 7:29 AM, Peter Marron <
> peter.mar...@trilliumsoftware.com> wrote:****
>
> Hi,****
>
>  ****
>
> Running Hive 0.11.0 over Hadoop 1.0.4.****
>
>  ****
>
> I would like to be able to “wrap” a Hive table.****
>
>  ****
>
> So, if I have table “X” which uses SerDe “s” and InputFormat “i”****
>
> then I would like to be able to create a table “Y” which has a ****
>
> SerDe “ws” which is a wrapper of “s” (and so can encapsulate an instance
> of “s”)****
>
> and an InputFormat “wi” which is a wrapper of “I” (and similarly
> encapsulates an****
>
> instance of “i”).  So far I have done this by creating a table like this**
> **
>
>  ****
>
> CREATE TABLE Y (… copy of underlying table’s columns...)****
>
> ROW FORMAT SERDE 'ws'****
>
> WITH SERDEPROPERTIES (…****
>
> 'wrapped.serde.name'='org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe',
> ****
>
> ‘wrapped.inputformat.name’=’TextInputFormat’,****
>
> 'serialization.format'='|',              'field.delim'='|'****
>
> )****
>
> STORED AS****
>
>   INPUTFORMAT 'wi'****
>
>   OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
> ****
>
> TBLPROPERTIES (…);****
>
>  ****
>
> I have to add the names of the underlying classes “s” and “I”****
>
> into the table properties so that I know what to instantiate.****
>
> I also have to replicate all the column details of the wrapped table****
>
> to ensure the correct information is passed down to the underlying SerDe**
> **
>
> when I instantiate it. I also have to know the output format. I have to***
> *
>
> explicitly add the default SerDe properties to get it to work.****
>
> I have to explicitly provide the default output format too.****
>
> If any changes are made to the underlying table then I need to reflect****
>
> those changes in my “wrapper” table.****
>
> It’s a mess.****
>
>  ****
>
> What I’d like to be able to do is to just parameterise my wrapper****
>
> table with the name of the underlying table and using that name****
>
> be able to instantiate the correct SerDe and InputFormat.****
>
>  ****
>
> Is there an easier way to do this?****
>
>  ****
>
> Any pointers appreciated.****
>
>  ****
>
> Z****
>
>  ****
>
>  ****
>
>
>
> ****
>
>  ****
>
> --
> Nitin Pawar****
>
> ** **
>

Reply via email to