The nature of Hadoop is that it runs tasks in parallel, so any Hadoop job
will result in an output file per reducer (or per mapper in your case since
it looks like your query doesn't do any grouping or joining so doesn't use
any reducers).  In general, there are a couple options for merging the
output of a Hadoop job:

* You can ask Hadoop to merge the output after the fact using "hadoop fs
-getmerge <dir>" (which will essentially do a cat for you).

* You can use a single reducer instead of multiple reducers using "set
mapred.reduce.tasks=1;".  This will likely slow down the query since the
reducer stage will only run on one node, but it will result in one output
file.

* You can add an extra map-reduce job (or just a reduce step if your job is
map only) to the end of the pipeline which just merges the results.  Check
out the hive.merge.mapfiles and hive.merge.mapredfiles options in
hive/conf/hive-default.xml which tell Hive to do this for you (this was
added relatively recently so make sure you're using a recent build).

In your particular case, the query you're running doesn't use any reducers,
so this complicates it a bit.  You could add a group by clause to the
statement to force Hive to use a reducer and then set mapred.reduce.tasks=1
which would merge the output.  You should also be able to do the same thing
by setting hive.merge.mapfiles to true which should add a reduce step that
just merges the output - but for some reason, on my build this wasn't
working.

On 12/21/09 7:45 PM, "Sagi, Lee" <[email protected]> wrote:

> I'm trying to create a local os file from a hive query:
> 
> INSERT OVERWRITE LOCAL DIRECTORY '../../dwh_out/click_term_20091219.dat'
> SELECT a.date_key, a.deal_id, a.is_roi, a.search_query, a.traffic_source,
> a.country_id FROM str_click_term_final a
> 
> I expected to have a file "called click_term_20091219.dat" in directory
> "'../../dwh_out", instead I got a directory named
> "../../dwh_out/click_term_20091219.dat" and in it multifile files like
> "attempt_200912182309_0102_m_000036_0*",
> "attempt_200912182309_0102_m_000037_0*"Setc.
> 
> Any idea how I can have one file (I know I can "cat" on the os, but I'm
> looking for Hive solution)
> 
> 
> 
> Thanks. 
> 
> 
> Lee Sagi | Data Warehouse Tech Lead & Architect 

Reply via email to