The nature of Hadoop is that it runs tasks in parallel, so any Hadoop job will result in an output file per reducer (or per mapper in your case since it looks like your query doesn't do any grouping or joining so doesn't use any reducers). In general, there are a couple options for merging the output of a Hadoop job:
* You can ask Hadoop to merge the output after the fact using "hadoop fs -getmerge <dir>" (which will essentially do a cat for you). * You can use a single reducer instead of multiple reducers using "set mapred.reduce.tasks=1;". This will likely slow down the query since the reducer stage will only run on one node, but it will result in one output file. * You can add an extra map-reduce job (or just a reduce step if your job is map only) to the end of the pipeline which just merges the results. Check out the hive.merge.mapfiles and hive.merge.mapredfiles options in hive/conf/hive-default.xml which tell Hive to do this for you (this was added relatively recently so make sure you're using a recent build). In your particular case, the query you're running doesn't use any reducers, so this complicates it a bit. You could add a group by clause to the statement to force Hive to use a reducer and then set mapred.reduce.tasks=1 which would merge the output. You should also be able to do the same thing by setting hive.merge.mapfiles to true which should add a reduce step that just merges the output - but for some reason, on my build this wasn't working. On 12/21/09 7:45 PM, "Sagi, Lee" <[email protected]> wrote: > I'm trying to create a local os file from a hive query: > > INSERT OVERWRITE LOCAL DIRECTORY '../../dwh_out/click_term_20091219.dat' > SELECT a.date_key, a.deal_id, a.is_roi, a.search_query, a.traffic_source, > a.country_id FROM str_click_term_final a > > I expected to have a file "called click_term_20091219.dat" in directory > "'../../dwh_out", instead I got a directory named > "../../dwh_out/click_term_20091219.dat" and in it multifile files like > "attempt_200912182309_0102_m_000036_0*", > "attempt_200912182309_0102_m_000037_0*"Setc. > > Any idea how I can have one file (I know I can "cat" on the os, but I'm > looking for Hive solution) > > > > Thanks. > > > Lee Sagi | Data Warehouse Tech Lead & Architect
