RE: Fail to Increase Hive Mapper Tasks?
As for the number of 22, I guess that your table have multiple files, probably 2. HIVE will divide the desired number of map tasks evenly among the files of the table. And the number of map tasks for a file may be increased because the file size can't be divided exactly by it. -Original Message- From: Ji ZHANG [mailto:zhangj...@gmail.com] Sent: Friday, January 03, 2014 12:27 PM To: user@hive.apache.org Subject: Re: Fail to Increase Hive Mapper Tasks? Hi Rui, I combined your suggestion with the answer from SO(http://stackoverflow.com/questions/20816726/fail-to-increase-hive-mapper-tasks), and it works: set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; set mapred.map.tasks = 20; select count(*) from dw_stage.st_dw_marketing_touch_pi_metrics_basic; It'll use 22 mappers, though I don't know why it's not an exact 20. And I'm using Hive 0.9 + Hadoop 1.01. Thank you very much. Jerry On Fri, Jan 3, 2014 at 10:51 AM, Sun, Rui wrote: > Hi, You can try set mapred.map.tasks = 19. > It seems that HIVE is using the old Hadoop MapReduce API and so > mapred.max.split.size won't work. > > -Original Message- > From: Ji Zhang [mailto:zhangj...@gmail.com] > Sent: Thursday, January 02, 2014 3:56 PM > To: user@hive.apache.org > Subject: Fail to Increase Hive Mapper Tasks? > > Hi, > > I have a managed Hive table, which contains only one 150MB file. I then do > "select count(*) from tbl" to it, and it uses 2 mappers. I want to set it to > a bigger number. > > First I tried 'set mapred.max.split.size=8388608;', so hopefully it will use > 19 mappers. But it only uses 3. Somehow it still split the input by 64MB. I > also used 'set dfs.block.size=8388608;', not working either. > > Then I tried a vanilla map-reduce job to do the same thing. It initially uses > 3 mappers, and when I set mapred.max.split.size, it uses 19. So the problem > lies in Hive, I suppose. > > I read some of the Hive source code, like CombineHiveInputFormat, ExecDriver, > etc. can't find a clue. > > What else settings can I use? > > Thanks in advance. > > Jerry
Re: Fail to Increase Hive Mapper Tasks?
Hi Rui, I combined your suggestion with the answer from SO(http://stackoverflow.com/questions/20816726/fail-to-increase-hive-mapper-tasks), and it works: set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; set mapred.map.tasks = 20; select count(*) from dw_stage.st_dw_marketing_touch_pi_metrics_basic; It'll use 22 mappers, though I don't know why it's not an exact 20. And I'm using Hive 0.9 + Hadoop 1.01. Thank you very much. Jerry On Fri, Jan 3, 2014 at 10:51 AM, Sun, Rui wrote: > Hi, You can try set mapred.map.tasks = 19. > It seems that HIVE is using the old Hadoop MapReduce API and so > mapred.max.split.size won't work. > > -Original Message- > From: Ji Zhang [mailto:zhangj...@gmail.com] > Sent: Thursday, January 02, 2014 3:56 PM > To: user@hive.apache.org > Subject: Fail to Increase Hive Mapper Tasks? > > Hi, > > I have a managed Hive table, which contains only one 150MB file. I then do > "select count(*) from tbl" to it, and it uses 2 mappers. I want to set it to > a bigger number. > > First I tried 'set mapred.max.split.size=8388608;', so hopefully it will use > 19 mappers. But it only uses 3. Somehow it still split the input by 64MB. I > also used 'set dfs.block.size=8388608;', not working either. > > Then I tried a vanilla map-reduce job to do the same thing. It initially uses > 3 mappers, and when I set mapred.max.split.size, it uses 19. So the problem > lies in Hive, I suppose. > > I read some of the Hive source code, like CombineHiveInputFormat, ExecDriver, > etc. can't find a clue. > > What else settings can I use? > > Thanks in advance. > > Jerry
RE: Fail to Increase Hive Mapper Tasks?
Hi, You can try set mapred.map.tasks = 19. It seems that HIVE is using the old Hadoop MapReduce API and so mapred.max.split.size won't work. -Original Message- From: Ji Zhang [mailto:zhangj...@gmail.com] Sent: Thursday, January 02, 2014 3:56 PM To: user@hive.apache.org Subject: Fail to Increase Hive Mapper Tasks? Hi, I have a managed Hive table, which contains only one 150MB file. I then do "select count(*) from tbl" to it, and it uses 2 mappers. I want to set it to a bigger number. First I tried 'set mapred.max.split.size=8388608;', so hopefully it will use 19 mappers. But it only uses 3. Somehow it still split the input by 64MB. I also used 'set dfs.block.size=8388608;', not working either. Then I tried a vanilla map-reduce job to do the same thing. It initially uses 3 mappers, and when I set mapred.max.split.size, it uses 19. So the problem lies in Hive, I suppose. I read some of the Hive source code, like CombineHiveInputFormat, ExecDriver, etc. can't find a clue. What else settings can I use? Thanks in advance. Jerry
Help on loading data stream to hive table.
Guys, I am using storm to read data stream from our socket server, entry by entry, and then write them to file: one entry per file. At some point, i need to import the data into my hive table. There are several approaches i could think of: 1. directly write to hive hdfs file whenever I get the entry(from our socket server). The problem is that this could be very inefficient, since we have huge amount of data stream, and I would not want to write to hive hdfs one by one. Or 2 i can write the entries to files(normal file or hdfs file) on the disk, and then have a separate job to merge those small files into big one, and then load them into hive table. The problem with this is, a) how can I merge small files into big files for hive? b) what is the best file size to upload to hive? I am seeking advice on both approaches, and appreciate your insight. Thanks, Chen
working with HIVE VARIALBE: Pls suggest
Hello Hive Champs, I have a case statement, where I need to check the date passed through parameter, If date is 1st date of the month then keep it as it as else set the parameter date to 1st date of the month. and then later opretation are being performed on that date into hive quries, I have wrote this Hive QL *select case when as_of_dt = ${hivevar:as_of_dt} then ${hivevar:as_of_dt} else date_sub(${hivevar:as_of_dt} , (day(${hivevar:as_of_dt} )) -1 ) end as as_of_dt from TABLE group by as_of_dt ;* O/P of this query is, lets say = 2012-08-01 I want to store the value of this Query into a variable. like MY_VARIABLE = (*select case when as_of_dt = ${hivevar:as_of_dt} then ${hivevar:as_of_dt} else date_sub(${hivevar:as_of_dt} , (day(${hivevar:as_of_dt} )) -1 ) end as as_of_dt from TABLE group by as_of_dt; )* How to achieve that. Pls suggest, Thanks in advance
Setting value into hive varialble
Hello All, I have a case statement, where I need to check the date passed through parameter, If date is 1st date of the month then keep it as it as else set the parameter date to 1st date of the month. and then later opretation are being performed on that date into hive quries, I have wrote this Hive QL *select case when as_of_dt = ${hivevar:as_of_dt} then ${hivevar:as_of_dt} else date_sub(${hivevar:as_of_dt} , (day(${hivevar:as_of_dt} )) -1 ) end as as_of_dt from TABLE group by as_of_dt ;* I want to store the value of this Query int a variable. like MY_VARIABLE = output of this query; How to achive that. Pls suggest, Thanks in advance some