Bejoy, Again, I do understand those two steps, and I do understand that I have a lot of options of making them run in sequence, but from the very beginning my point was to avoid having two steps. I want to have a dataset in the hive warehouse that I could query at any time with just a hive query without any preliminary imports/queries. So implementing a custom UDF/InputFormat looks best for now except for having too many rdbms connections (one connection per mapper as far as I understand).
Thanks On Sat, Jun 16, 2012 at 6:04 AM, Bejoy KS <bejoy...@yahoo.com> wrote: > Hi Ruslan > > The solution Esteban pointed out was > 1. Import look up data from RDBMS to hdfs/hive (you can fire any adhoc query > here). If the data is just a few mbs one or two maps/connections are enough. > > 2. A look up on this smaller data can be achieved in terms of joining that > with larger table > > Now since the look up table is small, enable map joins so that the look up > table is in the distributed cache and that data is used by map tasks for join. > > The two sequential steps mentioned above can be scheduled using a workflow > manager as oozie. > > In simple terms you can place these steps in order in a shell script and just > execute the script. > > > Regards > Bejoy KS > > Sent from handheld, please excuse typos. > > -----Original Message----- > From: Ruslan Al-Fakikh <metarus...@gmail.com> > Date: Sat, 16 Jun 2012 04:40:36 > To: <user@hive.apache.org> > Reply-To: user@hive.apache.org > Subject: Re: Quering RDBMS table in a Hive query > > Hi Esteban, > > Your solution is what I am trying to avoid, having to keep the hdfs > data up-to-date. I know I can easily schedule a dependency between the > Sqoop import job and the hive query job and currently we have a > scheduling tool (opswise) for such things. But what if I just want to > run an ad hoc query and forget to re-import the lookup data, etc? > Maybe there is a way to put the Sqoop import as a hook for a > particular hive table making it run before every query? > But I understand the problem of having too many connections. I would > like to have it only once and distribute it over all the mappers in a > distributed cache or something like it. Isn't there a way for it? > > Ruslan > > On Fri, Jun 15, 2012 at 9:43 PM, Esteban Gutierrez <este...@cloudera.com> > wrote: >> Hi Ruslan, >> >> Jan's approach sounds like a good workaround only if you can use the output >> in a mapjoin, but I don't think it will scale nicely if you have a very >> large number of tasks since that will translate as DB connections to >> MySQL. I think a more scalable and reliable way is just to schedule an Oozie >> workflow to transfer the data from MySQL to HDFS using Sqoop and trigger the >> Hive query once the transfer was done. >> >> cheers! >> esteban. >> >> -- >> Cloudera, Inc. >> >> >> >> >> On Fri, Jun 15, 2012 at 10:28 AM, Ruslan Al-Fakikh <metarus...@gmail.com> >> wrote: >>> >>> Thanks Jan >>> >>> On Fri, Jun 15, 2012 at 4:35 PM, Jan Dolinár <dolik....@gmail.com> wrote: >>> > On 6/15/12, Ruslan Al-Fakikh <ruslan.al-fak...@jalent.ru> wrote: >>> >> I didn't know InputFormat and LineReader could help, though I didn't >>> >> look at them closely. I was thinking about implementing a >>> >> Table-Generating Function (UDTF) if there is no an already implemented >>> >> solution. >>> > >>> > Both is possible, InputFormat and/or UD(T)F. It all depends on what >>> > you need. I actually use both - in Input format I load lists of >>> > allowed values to check the data and in UDF I query some other >>> > database for values necessary only in some queries. Generally, I'd use >>> > InputFormat for situations where all jobs over given table would >>> > require the additional data from RDBMS. Oppositely, in situations >>> > where only few jobs out of many requires the RDBMS connection, I would >>> > use UDF. >>> > >>> > I think that the difference in performance between the two is rather >>> > small, if any. Also UDF is easier to write, so it might be the "weapon >>> > of choice", at least if you don't already use custom InputFormat. >>> > >>> > Jan >> >>