Hi Esteban, Your solution is what I am trying to avoid, having to keep the hdfs data up-to-date. I know I can easily schedule a dependency between the Sqoop import job and the hive query job and currently we have a scheduling tool (opswise) for such things. But what if I just want to run an ad hoc query and forget to re-import the lookup data, etc? Maybe there is a way to put the Sqoop import as a hook for a particular hive table making it run before every query? But I understand the problem of having too many connections. I would like to have it only once and distribute it over all the mappers in a distributed cache or something like it. Isn't there a way for it?
Ruslan On Fri, Jun 15, 2012 at 9:43 PM, Esteban Gutierrez <este...@cloudera.com> wrote: > Hi Ruslan, > > Jan's approach sounds like a good workaround only if you can use the output > in a mapjoin, but I don't think it will scale nicely if you have a very > large number of tasks since that will translate as DB connections to > MySQL. I think a more scalable and reliable way is just to schedule an Oozie > workflow to transfer the data from MySQL to HDFS using Sqoop and trigger the > Hive query once the transfer was done. > > cheers! > esteban. > > -- > Cloudera, Inc. > > > > > On Fri, Jun 15, 2012 at 10:28 AM, Ruslan Al-Fakikh <metarus...@gmail.com> > wrote: >> >> Thanks Jan >> >> On Fri, Jun 15, 2012 at 4:35 PM, Jan Dolinár <dolik....@gmail.com> wrote: >> > On 6/15/12, Ruslan Al-Fakikh <ruslan.al-fak...@jalent.ru> wrote: >> >> I didn't know InputFormat and LineReader could help, though I didn't >> >> look at them closely. I was thinking about implementing a >> >> Table-Generating Function (UDTF) if there is no an already implemented >> >> solution. >> > >> > Both is possible, InputFormat and/or UD(T)F. It all depends on what >> > you need. I actually use both - in Input format I load lists of >> > allowed values to check the data and in UDF I query some other >> > database for values necessary only in some queries. Generally, I'd use >> > InputFormat for situations where all jobs over given table would >> > require the additional data from RDBMS. Oppositely, in situations >> > where only few jobs out of many requires the RDBMS connection, I would >> > use UDF. >> > >> > I think that the difference in performance between the two is rather >> > small, if any. Also UDF is easier to write, so it might be the "weapon >> > of choice", at least if you don't already use custom InputFormat. >> > >> > Jan > >