Well. You got a few suggestions there Peter. That in itself is reason to celebrate!
And that was a good description and i fault you not for going into some detail. The part about keeping it simple is always a challenge I know. :) I get your point but i don't have anything more really to offer. if running an MR job is a blocker i'm not in a position to suggest anything. On Thu, Jun 27, 2013 at 3:14 AM, Peter Marron < peter.mar...@trilliumsoftware.com> wrote: > Hi,**** > > ** ** > > If you’re suggesting that I use something like**** > > ** ** > > SELECT * FROM data WHERE MyUdf(data. BLOCK__OFFSET__INSIDE__FILE);**** > > ** ** > > rather than **** > > SELECT * FROM data JOIN small ON data.BLOCK__OFFSET__INSIDE__FILE = > small.offset;**** > > then, yes, I have thought of that. However the fact is that reading the > billions of records**** > > and filtering them is too slow compared to doing the seeks.**** > > ** ** > > Partitioning would help, but I can’t assume that the big data is > partitioned in a way that suits**** > > this particular query.**** > > ** ** > > Z**** > > ** ** > > *From:* Jan Dolinár [mailto:dolik....@gmail.com] > *Sent:* 27 June 2013 10:59 > *To:* user > *Subject:* Re: Table Wrapper**** > > ** ** > > Slightly less "hackish" way to do this without joins is to write custom > UDF that will take data.BLOCK__OFFSET__INSIDE__FILE as input parameter and > return the corresponding data from the small file. If you mark it > "deterministic" using @UDFType(deterministic = true), the performance > should be quite good. **** > > ** ** > > To avoid the full table scan, partitioning is IMHO the best way to speed > things up.**** > > ** ** > > Best regards,**** > > J. Dolinar**** > > ** ** > > On Thu, Jun 27, 2013 at 11:18 AM, Peter Marron < > peter.mar...@trilliumsoftware.com> wrote:**** > > Hi,**** > > **** > > I have thought about a map-only join, but as I understand it this is still > going**** > > to do a full table scan on my large data file. If this is billions of > records then it’s**** > > still going to be slow, even if it only returns a handful of records.**** > > **** > > Also I don’t know of any way to get Hive to do a join without performing a > **** > > Map/Reduce. And, as I mentioned before, just the overheads of setting**** > > up a Map/Reduce, even if it’s map only and does practically nothing, makes > the**** > > elapsed time too high. I want it to be interactive. (I guess that > something Tez**** > > when it becomes available might solve this problem…)**** > > **** > > My ugly “hack” approach works in seconds, the overhead for setting up > Map/Reduce takes this into minutes.**** > > **** > > Indexing looks promising but, as far as I can see, it can’t be done > without a Map/Reduce.**** > > **** > > If I could find a way to perform a join or use indexing without a > Map/Reduce I would be happy to use that approach.**** > > **** > > Partitioning and ORC would be helpful but I can’t assume anything about > the original data format.**** > > **** > > Z**** > > **** > > *From:* Nitin Pawar [mailto:nitinpawar...@gmail.com] > *Sent:* 27 June 2013 09:52**** > > > *To:* user@hive.apache.org > *Subject:* Re: Table Wrapper**** > > **** > > few thoughts: **** > > If you have a smaller file (in size of MB's) have you tried considering > map only join? **** > > also if you are interested in particular records from a table and do not > want to go through entire table to find them, then partitioning + indexing > will be handy.**** > > **** > > ORCFile Format (still very new) can help you in this regard as well. **** > > **** > > On Thu, Jun 27, 2013 at 2:16 PM, Peter Marron < > peter.mar...@trilliumsoftware.com> wrote:**** > > Well, I’m not very good at keeping things brief, unfortunately.**** > > But I’ll have a go, trying to keep things simple.**** > > **** > > Suppose that I have a data table in Hive and it has many rows – say > billions.**** > > I have another file stored in HDFS (it can be a Hive table too if it helps) > **** > > and this file is small and contains file offsets into the data, Stored as > binary,**** > > 8 bytes per offset. Now suppose that I want to read the records from the > data**** > > defined by the offsets in the small file, in the order defined in the > small file.**** > > **** > > How can I do that?**** > > **** > > The obvious way is to turn the small file into a Hive table and provide a > custom**** > > InputFormat which can read the binary. I’ve done that, that’s the easy > part and**** > > then I could form a query like this:**** > > **** > > SELECT * FROM data JOIN small ON data. ON > data.BLOCK__OFFSET__INSIDE__FILE = small.offset;**** > > **** > > But, when it works, this performs awfully.**** > > **** > > The approach that I have taken is to create a “copy” of the data table > which is “hacked” to use a custom input**** > > format which knows about the small file and which overrides the record > reader to use the offsets**** > > as seeks before it reads the records. This is awkward, for various > reasons, but it works well. I can**** > > avoid a full table scan, in fact I can suppress any Map/Reduce and so the > query runs very quickly.**** > > **** > > So I was just trying to “wrap” the data table so that I didn’t have to > create the copy.**** > > **** > > I hope that you don’t regret asking too much.**** > > **** > > Regards,**** > > **** > > Z**** > > **** > > *From:* Stephen Sprague [mailto:sprag...@gmail.com] > *Sent:* 25 June 2013 18:37 > *To:* user@hive.apache.org > *Subject:* Re: Table Wrapper**** > > **** > > Good luck, bro. :) May i ask why are you doing this to yourself? I think > your instincts are correct going down the path you describe sounds a tad > more painful than just hitting yourself in the head with a hammer. > Different strokes for different folks though.**** > > so can we back up? what - briefly if possible - do you want to achieve > with a "wrapper"? (i'm going to regret asking that i know.)**** > > **** > > **** > > On Tue, Jun 25, 2013 at 7:29 AM, Peter Marron < > peter.mar...@trilliumsoftware.com> wrote:**** > > Hi,**** > > **** > > Running Hive 0.11.0 over Hadoop 1.0.4.**** > > **** > > I would like to be able to “wrap” a Hive table.**** > > **** > > So, if I have table “X” which uses SerDe “s” and InputFormat “i”**** > > then I would like to be able to create a table “Y” which has a **** > > SerDe “ws” which is a wrapper of “s” (and so can encapsulate an instance > of “s”)**** > > and an InputFormat “wi” which is a wrapper of “I” (and similarly > encapsulates an**** > > instance of “i”). So far I have done this by creating a table like this** > ** > > **** > > CREATE TABLE Y (… copy of underlying table’s columns...)**** > > ROW FORMAT SERDE 'ws'**** > > WITH SERDEPROPERTIES (…**** > > 'wrapped.serde.name'='org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe', > **** > > ‘wrapped.inputformat.name’=’TextInputFormat’,**** > > 'serialization.format'='|', 'field.delim'='|'**** > > )**** > > STORED AS**** > > INPUTFORMAT 'wi'**** > > OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' > **** > > TBLPROPERTIES (…);**** > > **** > > I have to add the names of the underlying classes “s” and “I”**** > > into the table properties so that I know what to instantiate.**** > > I also have to replicate all the column details of the wrapped table**** > > to ensure the correct information is passed down to the underlying SerDe** > ** > > when I instantiate it. I also have to know the output format. I have to*** > * > > explicitly add the default SerDe properties to get it to work.**** > > I have to explicitly provide the default output format too.**** > > If any changes are made to the underlying table then I need to reflect**** > > those changes in my “wrapper” table.**** > > It’s a mess.**** > > **** > > What I’d like to be able to do is to just parameterise my wrapper**** > > table with the name of the underlying table and using that name**** > > be able to instantiate the correct SerDe and InputFormat.**** > > **** > > Is there an easier way to do this?**** > > **** > > Any pointers appreciated.**** > > **** > > Z**** > > **** > > **** > > > > **** > > **** > > -- > Nitin Pawar**** > > ** ** >