Re: help on failed MR jobs (big hive files)
Elaine, Nitin raises some good points. Continuing on the same lines, let's take a closer look at the query: insert overwrite table B select a, b, c from table A where datediff(to_date(from_unixtime(unix_timestamp('${logdate}'))), request_date) <= 30 In the above query, "datediff(to_date(from_unixtime(unix_timestamp('${logdate}'))), request_date)" would cause this set of nested functions to be evaluated for every record in your 6 GB dataset on the server. It would be best if this computation was done in your client (bash script or java code issuing hive queries) so that the query that gets sent to the Hive server looks like: "request_date >= '2012-01-01' and request_date < '2012-06-01'" That would shave off a lot of time. If the performance is still poor, consider partitioning your data (based on date?), also make sure you don't suffer from the small file problem: http://www.cloudera.com/blog/2009/02/the-small-files-problem/ Good luck! Mark On Wed, Dec 12, 2012 at 11:36 PM, Nitin Pawar wrote: > 6GB size is nothing. We have done it with few TB of data in hive. > Error you are seeing is on the hadoop side. > > You can always optimize your query based on the hadoop compute capacity you > have got and also based on the pattern in the data you will need to design > your schema. > > The problem here can be you have got a fucntion to execute in the where > clause. Can you try hard coding them to data range and see if you can get > any improvements. > > Alternatively if you can partition your data on date basis, smaller dataset > you will have to read. > > If you got good size hadoop cluster then lower the split size and launch > many maps that way it will get executed quickly > > by the heapsize increase did you mean increase hive heapsize or hadoop > mapred heapsize ? You will need to increase the heapsize on mapred by > setting the property > set mapred.job.map.memory.mb=6000; > set mapred.job.reduce.memory.mb=4000; > > > > On Wed, Dec 12, 2012 at 3:13 PM, Elaine Gan wrote: >> >> Hi, >> >> I'm trying to run a program on Hadoop. >> >> [Input] tsv file >> >> My program does the following. >> (1) Load tsv into hive >> load data local inpath 'tsvfile' overwrite into table A partitioned >> by xx >> (2) insert overwrite table B select a, b, c from table A where >> datediff(to_date(from_unixtime(unix_timestamp('${logdate}'))), request_date) >> <= 30 >> (3) Running Mahout >> >> In step 2, i am trying to retrieve data from hive for the past month. >> My hadoop work always stopped here. >> When i check through my browser utility it says that >> >> Diagnostic Info: >> # of failed Map Tasks exceeded allowed limit. FailedCount: 1. >> LastFailedTask: task_201211291541_0262_m_001800 >> >> Task attempt_201211291541_0262_m_001800_0 failed to report status for 1802 >> seconds. Killing! >> Error: Java heap space >> Task attempt_201211291541_0262_m_001800_2 failed to report status for 1800 >> seconds. Killing! >> Task attempt_201211291541_0262_m_001800_3 failed to report status for 1801 >> seconds. Killing! >> >> >> >> Each hive table is big, around 6 GB. >> >> (1) Is it too big to have around 6GB for each hive table? >> (2) I've increased by HEAPSIZE to 50G,which i think is far more than >> enough. Any else >> where i can do the tuning? >> >> >> Thank you. >> >> >> >> rei >> >> > > > > -- > Nitin Pawar
Re: help on failed MR jobs (big hive files)
6GB size is nothing. We have done it with few TB of data in hive. Error you are seeing is on the hadoop side. You can always optimize your query based on the hadoop compute capacity you have got and also based on the pattern in the data you will need to design your schema. The problem here can be you have got a fucntion to execute in the where clause. Can you try hard coding them to data range and see if you can get any improvements. Alternatively if you can partition your data on date basis, smaller dataset you will have to read. If you got good size hadoop cluster then lower the split size and launch many maps that way it will get executed quickly by the heapsize increase did you mean increase hive heapsize or hadoop mapred heapsize ? You will need to increase the heapsize on mapred by setting the property set mapred.job.map.memory.mb=6000; set mapred.job.reduce.memory.mb=4000; On Wed, Dec 12, 2012 at 3:13 PM, Elaine Gan wrote: > Hi, > > I'm trying to run a program on Hadoop. > > [Input] tsv file > > My program does the following. > (1) Load tsv into hive > load data local inpath 'tsvfile' overwrite into table A partitioned > by xx > (2) insert overwrite table B select a, b, c from table A where > datediff(to_date(from_unixtime(unix_timestamp('${logdate}'))), > request_date) <= 30 > (3) Running Mahout > > In step 2, i am trying to retrieve data from hive for the past month. > My hadoop work always stopped here. > When i check through my browser utility it says that > > Diagnostic Info: > # of failed Map Tasks exceeded allowed limit. FailedCount: 1. > LastFailedTask: task_201211291541_0262_m_001800 > > Task attempt_201211291541_0262_m_001800_0 failed to report status for 1802 > seconds. Killing! > Error: Java heap space > Task attempt_201211291541_0262_m_001800_2 failed to report status for 1800 > seconds. Killing! > Task attempt_201211291541_0262_m_001800_3 failed to report status for 1801 > seconds. Killing! > > > > Each hive table is big, around 6 GB. > > (1) Is it too big to have around 6GB for each hive table? > (2) I've increased by HEAPSIZE to 50G,which i think is far more than > enough. Any else > where i can do the tuning? > > > Thank you. > > > > rei > > > -- Nitin Pawar
Re: REST API for Hive queries?
Hive takes a longer time to respond to queries as the data gets larger. Best way to handle this is you process the data on hive and store in some rdbms like mysql etc. On top of that then you can write your own API or use pentaho like interface where they can write the queries or see predefined reports. Alternatively, pentaho does have hive connection as well. There are other platforms such as talend, datameer etc. You can have a look at them On Thu, Dec 13, 2012 at 1:15 AM, Leena Gupta wrote: > Hi, > > We are using Hive as our data warehouse to run various queries on large > amounts of data. There are some users who would like to get access to the > output of these queries and display the data on an existing UI application. > What is the best way to give them the output of these queries? Should we > write REST APIs that the Front end can call to get the data? How can this > be done? > I'd like to know what have other people done to meet this requirement ? > Any pointers would be very helpful. > Thanks. > -- Nitin Pawar
Re: map side join with group by
I think Chen wanted to know why this is two phased query if I understood it correctly When you run a mapside join .. it just performs the join query .. after that to execute the group by part it launches the second job. I may be wrong but this is how I saw it whenever I executed group by queries On Thu, Dec 13, 2012 at 7:11 AM, Mark Grover wrote: > Hi Chen, > I think we would need some more information. > > The query is referring to a table called "d" in the MAPJOIN hint but > there is not such table in the query. Moreover, Map joins only make > sense when the right table is the one being "mapped" (in other words, > being kept in memory) in case of a Left Outer Join, similarly if the > left table is the one being "mapped" in case of a Right Outer Join. > Let me know if this is not clear, I'd be happy to offer a better > explanation. > > In your query, the where clause on a column called "hour", at this > point I am unsure if that's a column of table1 or table2. If it's > column on table1, that predicate would get pushed up (if you have > hive.optimize.ppd property set to true), so it could possibly be done > in 1 MR job (I am not sure if that's presently the case, you will have > to check the explain plan). If however, the where clause is on a > column in the right table (table2 in your example), it can't be pushed > up since a column of the right table can have different values before > and after the LEFT OUTER JOIN. Therefore, the where clause would need > to be applied in a separate MR job. > > This is just my understanding, the full proof answer would lie in > checking out the explain plans and the Semantic Analyzer code. > > And for completeness, there is a conditional task (starting Hive 0.7) > that will convert your joins automatically to map joins where > applicable. This can be enabled by enabling hive.auto.convert.join > property. > > Mark > > On Wed, Dec 12, 2012 at 3:32 PM, Chen Song wrote: > > I have a silly question on how Hive interpretes a simple query with both > map > > side join and group by. > > > > Below query will translate into two jobs, with the 1st one as a map only > job > > doing the join and storing the output in a intermediary location, and the > > 2nd one as a map-reduce job taking the output of the 1st job as input and > > doing the group by. > > > > SELECT > > /*+ MAPJOIN(d) */ > > table.a, sum(table2.b) > > from table > > LEFT OUTER JOIN table2 > > ON table.id = table2.id > > where hour = '2012-12-11 11' > > group by table.a > > > > Why can't this be done within a single map reduce job? As what I can see > > from the query plan is that all 2nd job mapper do is taking the 1st job's > > mapper output. > > > > -- > > Chen Song > > > > > -- Nitin Pawar
Re: Array index support non-constant expresssion
Different error messages but seemed from same problem. Could you do that with later versions of hive? I think these kind of bugs are fixed. 2012/12/13 java8964 java8964 : > ExprNodeGenericFuncEvaluator
Re: Hive Thrift upgrade to 0.9.0
On Tue, Dec 11, 2012 at 12:07 PM, Shangzhong zhu wrote: > We are using Hive 0.9.0, and we have seen frequent Thrift Metastore > timeout issues probably due to the Thrift memory leak reported in > THRIFT-1468. > > The current solution is to upgrade Thrift to 0.9.0 > > I am trying to use the patch (HIVE-2715). But seems the patch only works > for Hive trunk (0.10.0). Saw a lot of missing files when I applied the > patch to 0.9.0. > You probably saw new generated thrift files for features that were added to trunk after 0.9.0. > > Do we have a patch available for Hive 0.9.0? Or what is the recommended > apporach to upgrade to Thrift 0.9.0? > Currently, we dont have a patch for 0.9.0. The best way I can think of is to regenerate the thrift files for 0.9.0 using the thrift 0.9 compiler. > > Thanks, > Shanzhong > Thanks. Shreepadma
RE: Array index support non-constant expresssion
Hi, Navis: If I disable both CP/PPD, it will be worse, as neither 1) or 2) query works. But interested thing is that for both queries, I got same error message, but different one comparing with my original error message: 2012-12-12 20:36:21,362 WARN org.apache.hadoop.mapred.Child: Error running childjava.lang.RuntimeException: Error in configuring objectat org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:387) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)at org.apache.hadoop.mapred.Child$4.run(Child.java:270)at java.security.AccessController.doPrivileged(Native Method)at javax.security.auth.Subject.doAs(Subject.java:396)at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157) at org.apache.hadoop.mapred.Child.main(Child.java:264)Caused by: java.lang.reflect.InvocationTargetExceptionat sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597)at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88) ... 9 moreCaused by: java.lang.RuntimeException: Error in configuring object at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117) at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34) ... 14 moreCaused by: java.lang.reflect.InvocationTargetExceptionat sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597)at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88) ... 17 moreCaused by: java.lang.RuntimeException: Map operator initialization failedat org.apache.hadoop.hive.ql.exec.ExecMapper.configure(ExecMapper.java:121) ... 22 moreCaused by: java.lang.ClassCastException: org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableLongObjectInspector cannot be cast to org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspectorat org.apache.hadoop.hive.ql.exec.ExprNodeFieldEvaluator.initialize(ExprNodeFieldEvaluator.java:60) at org.apache.hadoop.hive.ql.exec.ExprNodeGenericFuncEvaluator.initialize(ExprNodeGenericFuncEvaluator.java:77) at org.apache.hadoop.hive.ql.exec.Operator.initEvaluators(Operator.java:878) at org.apache.hadoop.hive.ql.exec.Operator.initEvaluatorsAndReturnStruct(Operator.java:904) at org.apache.hadoop.hive.ql.exec.SelectOperator.initializeOp(SelectOperator.java:60) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357)at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:433)at org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:389) at org.apache.hadoop.hive.ql.exec.Operator.initializeOp(Operator.java:374) at org.apache.hadoop.hive.ql.exec.LateralViewJoinOperator.initializeOp(LateralViewJoinOperator.java:109) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357)at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:433)at org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:389) at org.apache.hadoop.hive.ql.exec.Operator.initializeOp(Operator.java:374) at org.apache.hadoop.hive.ql.exec.UDTFOperator.initializeOp(UDTFOperator.java:85) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:433) at org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:389) at org.apache.hadoop.hive.ql.exec.SelectOperator.initializeOp(SelectOperator.java:62) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357)at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:433)at org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:389) at org.apache.hadoop.hive.ql.exec.Operator.initializeOp(Operator.java:374) at org.apache.hadoop.hive.ql.exec.Operator
Re: map side join with group by
Hi Chen, I think we would need some more information. The query is referring to a table called "d" in the MAPJOIN hint but there is not such table in the query. Moreover, Map joins only make sense when the right table is the one being "mapped" (in other words, being kept in memory) in case of a Left Outer Join, similarly if the left table is the one being "mapped" in case of a Right Outer Join. Let me know if this is not clear, I'd be happy to offer a better explanation. In your query, the where clause on a column called "hour", at this point I am unsure if that's a column of table1 or table2. If it's column on table1, that predicate would get pushed up (if you have hive.optimize.ppd property set to true), so it could possibly be done in 1 MR job (I am not sure if that's presently the case, you will have to check the explain plan). If however, the where clause is on a column in the right table (table2 in your example), it can't be pushed up since a column of the right table can have different values before and after the LEFT OUTER JOIN. Therefore, the where clause would need to be applied in a separate MR job. This is just my understanding, the full proof answer would lie in checking out the explain plans and the Semantic Analyzer code. And for completeness, there is a conditional task (starting Hive 0.7) that will convert your joins automatically to map joins where applicable. This can be enabled by enabling hive.auto.convert.join property. Mark On Wed, Dec 12, 2012 at 3:32 PM, Chen Song wrote: > I have a silly question on how Hive interpretes a simple query with both map > side join and group by. > > Below query will translate into two jobs, with the 1st one as a map only job > doing the join and storing the output in a intermediary location, and the > 2nd one as a map-reduce job taking the output of the 1st job as input and > doing the group by. > > SELECT > /*+ MAPJOIN(d) */ > table.a, sum(table2.b) > from table > LEFT OUTER JOIN table2 > ON table.id = table2.id > where hour = '2012-12-11 11' > group by table.a > > Why can't this be done within a single map reduce job? As what I can see > from the query plan is that all 2nd job mapper do is taking the 1st job's > mapper output. > > -- > Chen Song > >
Re: Array index support non-constant expresssion
Could you try it with CP/PPD disabled? set hive.optimize.cp=false; set hive.optimize.ppd=false; 2012/12/13 java8964 java8964 : > Hi, > > I played my query further, and found out it is very puzzle to explain the > following behaviors: > > 1) The following query works: > > select c_poi.provider_str, c_poi.name from (select darray(search_results, > c.rank) as c_poi from nulf_search lateral view explode(search_clicks) > clickTable as c) a > > I get get all the result from the above query without any problem. > > 2) The following query NOT works: > > select c_poi.provider_str, c_poi.name from (select darray(search_results, > c.rank) as c_poi from nulf_search lateral view explode(search_clicks) > clickTable as c) a where c_poi.provider_str = 'POI' > > As long as I add the where criteria on provider_str, or even I added another > level of sub query like following: > > select > ps, name > from > (select c_poi.provider_str as ps, c_poi.name as name from (select > darray(search_results, c.rank) as c_poi from nulf_search lateral view > explode(search_clicks) clickTable as c) a ) b > where ps = 'POI' > > any kind of criteria I tried to add on provider_str, the hive MR jobs failed > in the same error I shown below. > > Any idea why this happened? Is it related to the data? But provider_str is > just a simple String type. > > Thanks > > Yong > > > From: java8...@hotmail.com > To: user@hive.apache.org > Subject: RE: Array index support non-constant expresssion > Date: Wed, 12 Dec 2012 12:15:27 -0500 > > > OK. > > I followed the hive source code of > org.apache.hadoop.hive.ql.udf.generic.GenericUDFArrayContains and wrote the > UDF. It is quite simple. > > It works fine as I expected for simple case, but when I try to run it under > some complex query, the hive MR jobs failed with some strange errors. What I > mean is that it failed in HIVE code base, from stuck trace, I can not see > this failure has anything to do with my custom code. > > I would like some help if some one can tell me what went wrong. > > For example, I created this UDF called darray, stand for dynamic array, > which supports the non-constant value as the index location of the array. > > The following query works fine as I expected: > > hive> select c_poi.provider_str as provider_str, c_poi.name as name from > (select darray(search_results, c.index_loc) as c_poi from search_table > lateral view explode(search_clicks) clickTable as c) a limit 5; > POI > ADDRESS some address > POI > POI > ADDRESSS some address > > Of course, in this case, I only want the provider_str = 'POI' returned, and > filter out any rows with provider_str != 'POI', so it sounds simple, I > changed the query to the following: > > hive> select c_poi.provider_str as provider_str, c_poi.name as name from > (select darray(search_results, c.rank) as c_poi from search_table lateral > view explode(search_clicks) clickTable as c) a where c_poi.provider_str = > 'POI' limit 5; > Total MapReduce jobs = 1 > Launching Job 1 out of 1 > Number of reduce tasks is set to 0 since there's no reduce operator > Cannot run job locally: Input Size (= 178314025) is larger than > hive.exec.mode.local.auto.inputbytes.max (= 134217728) > Starting Job = job_201212031001_0100, Tracking URL = > http://blevine-desktop:50030/jobdetails.jsp?jobid=job_201212031001_0100 > Kill Command = /home/yzhang/hadoop/bin/hadoop job > -Dmapred.job.tracker=blevine-desktop:8021 -kill job_201212031001_0100 > 2012-12-12 11:45:24,090 Stage-1 map = 0%, reduce = 0% > 2012-12-12 11:45:43,173 Stage-1 map = 100%, reduce = 100% > Ended Job = job_201212031001_0100 with errors > FAILED: Execution Error, return code 2 from > org.apache.hadoop.hive.ql.exec.MapRedTask > > I am only add a Where limitation, but to my surprise, the MR jobs generated > by HIVE failed. I am testing this in my local standalone cluster, which is > running CDH3U3 release. When I check the hadoop userlog, here is what I got: > > 2012-12-12 11:40:22,421 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: > SELECT > struct<_col0:bigint,_col1:string,_col2:string,_col3:string,_col4:string,_col5:string,_col6:boolean,_col7:boolean,_col8:boolean,_col9:boolean,_col10:boolean,_col11:boolean,_col12:string,_col13:string,_col14:struct,categories_id:array,categories_name:array,lang_raw:string,lang_rose:string,lang:string,viewport:struct>,_col15:struct>>,_col16:array>,_col17:array>,_col18:string,_col19:struct> > 2012-12-12 11:40:22,440 WARN org.apache.hadoop.mapred.Child: Error running > child > java.lang.RuntimeException: Error in configuring object > at > org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93) > at > org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64) > at > org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117) > at org.apache.ha
map side join with group by
I have a silly question on how Hive interpretes a simple query with both map side join and group by. Below query will translate into two jobs, with the 1st one as a map only job doing the join and storing the output in a intermediary location, and the 2nd one as a map-reduce job taking the output of the 1st job as input and doing the group by. SELECT /*+ MAPJOIN(d) */ table.a, sum(table2.b) from table LEFT OUTER JOIN table2 ON table.id = table2.id where hour = '2012-12-11 11' group by table.a Why can't this be done within a single map reduce job? As what I can see from the query plan is that all 2nd job mapper do is taking the 1st job's mapper output. -- Chen Song
Re: Map side join
Hi Bejoy, Yes I ran the pi example. It was fine. Regarding the HIVE Job what I found is that it took 4 hrs for the first map job to get completed. Those map tasks were doing their job and only reported status after completion. It is indeed taking too long time to finish. Nothing I could find relevant in the logs. Thanks and regards, Souvik. On Wed, Dec 12, 2012 at 8:04 AM, wrote: > ** > Hi Souvik > > Apart from hive jobs is the normal mapreduce jobs like the wordcount > running fine on your cluster? > > If it is working, for the hive jobs are you seeing anything skeptical in > task, Tasktracker or jobtracker logs? > > > Regards > Bejoy KS > > Sent from remote device, Please excuse typos > -- > *From: * Souvik Banerjee > *Date: *Tue, 11 Dec 2012 17:12:20 -0600 > *To: *; > *ReplyTo: * user@hive.apache.org > *Subject: *Re: Map side join > > Hello Everybody, > > Need help in for on HIVE join. As we were talking about the Map side join > I tried that. > I set the flag set hive.auto.convert.join=true; > > I saw Hive converts the same to map join while launching the job. But the > problem is that none of the map job progresses in my case. I made the > dataset smaller. Now it's only 512 MB cross 25 MB. I was expecting it to be > done very quickly. > No luck with any change of settings. > Failing to progress with the default setting changes these settings. > set hive.mapred.local.mem=1024; // Initially it was 216 I guess > set hive.join.cache.size=10; // Initialliu it was 25000 > > Also on Hadoop side I made this changes > > mapred.child.java.opts -Xmx1073741824 > > But I don't see any progress. After more than 40 minutes of run I am at 0% > map completion state. > Can you please throw some light on this? > > Thanks a lot once again. > > Regards, > Souvik. > > > > On Fri, Dec 7, 2012 at 2:32 PM, Souvik Banerjee > wrote: > >> Hi Bejoy, >> >> That's wonderful. Thanks for your reply. >> What I was wondering if HIVE can do map side join with more than one >> condition on JOIN clause. >> I'll simply try it out and post the result. >> >> Thanks once again. >> >> Regards, >> Souvik. >> >> On Fri, Dec 7, 2012 at 2:10 PM, wrote: >> >>> ** >>> Hi Souvik >>> >>> In earlier versions of hive you had to give the map join hint. But in >>> later versions just set hive.auto.convert.join = true; >>> Hive automatically selects the smaller table. It is better to give the >>> smaller table as the first one in join. >>> >>> You can use a map join if you are joining a small table with a large >>> one, in terms of data size. By small, better to have the smaller table size >>> in range of MBs. >>> Regards >>> Bejoy KS >>> >>> Sent from remote device, Please excuse typos >>> -- >>> *From: *Souvik Banerjee >>> *Date: *Fri, 7 Dec 2012 13:58:25 -0600 >>> *To: * >>> *ReplyTo: *user@hive.apache.org >>> *Subject: *Map side join >>> >>> Hello everybody, >>> >>> I have got a question. I didn't came across any post which says >>> somethign about this. >>> I have got two tables. Lets say A and B. >>> I want to join A & B in HIVE. I am currently using HIVE 0.9 version. >>> The join would be on few columns. like on (A.id1 = B.id1) AND (A.id2 = >>> B.id2) AND (A.id3 = B.id3) >>> >>> Can I ask HIVE to use map side join in this scenario? Should I give a >>> hint to HIVE by saying /*+mapjoin(B)*/ >>> >>> Get back to me if you want any more information in this regard. >>> >>> Thanks and regards, >>> Souvik. >>> >> >> >
REST API for Hive queries?
Hi, We are using Hive as our data warehouse to run various queries on large amounts of data. There are some users who would like to get access to the output of these queries and display the data on an existing UI application. What is the best way to give them the output of these queries? Should we write REST APIs that the Front end can call to get the data? How can this be done? I'd like to know what have other people done to meet this requirement ? Any pointers would be very helpful. Thanks.
RE: Array index support non-constant expresssion
Hi, I played my query further, and found out it is very puzzle to explain the following behaviors: 1) The following query works: select c_poi.provider_str, c_poi.name from (select darray(search_results, c.rank) as c_poi from nulf_search lateral view explode(search_clicks) clickTable as c) a I get get all the result from the above query without any problem. 2) The following query NOT works: select c_poi.provider_str, c_poi.name from (select darray(search_results, c.rank) as c_poi from nulf_search lateral view explode(search_clicks) clickTable as c) a where c_poi.provider_str = 'POI' As long as I add the where criteria on provider_str, or even I added another level of sub query like following: selectps, namefrom (select c_poi.provider_str as ps, c_poi.name as name from (select darray(search_results, c.rank) as c_poi from nulf_search lateral view explode(search_clicks) clickTable as c) a ) bwhere ps = 'POI' any kind of criteria I tried to add on provider_str, the hive MR jobs failed in the same error I shown below. Any idea why this happened? Is it related to the data? But provider_str is just a simple String type. Thanks Yong From: java8...@hotmail.com To: user@hive.apache.org Subject: RE: Array index support non-constant expresssion Date: Wed, 12 Dec 2012 12:15:27 -0500 OK. I followed the hive source code of org.apache.hadoop.hive.ql.udf.generic.GenericUDFArrayContains and wrote the UDF. It is quite simple. It works fine as I expected for simple case, but when I try to run it under some complex query, the hive MR jobs failed with some strange errors. What I mean is that it failed in HIVE code base, from stuck trace, I can not see this failure has anything to do with my custom code. I would like some help if some one can tell me what went wrong. For example, I created this UDF called darray, stand for dynamic array, which supports the non-constant value as the index location of the array. The following query works fine as I expected: hive> select c_poi.provider_str as provider_str, c_poi.name as name from (select darray(search_results, c.index_loc) as c_poi from search_table lateral view explode(search_clicks) clickTable as c) a limit 5;POI ADDRESS some addressPOIPOI ADDRESSS some address Of course, in this case, I only want the provider_str = 'POI' returned, and filter out any rows with provider_str != 'POI', so it sounds simple, I changed the query to the following: hive> select c_poi.provider_str as provider_str, c_poi.name as name from (select darray(search_results, c.rank) as c_poi from search_table lateral view explode(search_clicks) clickTable as c) a where c_poi.provider_str = 'POI' limit 5;Total MapReduce jobs = 1Launching Job 1 out of 1Number of reduce tasks is set to 0 since there's no reduce operatorCannot run job locally: Input Size (= 178314025) is larger than hive.exec.mode.local.auto.inputbytes.max (= 134217728)Starting Job = job_201212031001_0100, Tracking URL = http://blevine-desktop:50030/jobdetails.jsp?jobid=job_201212031001_0100Kill Command = /home/yzhang/hadoop/bin/hadoop job -Dmapred.job.tracker=blevine-desktop:8021 -kill job_201212031001_01002012-12-12 11:45:24,090 Stage-1 map = 0%, reduce = 0%2012-12-12 11:45:43,173 Stage-1 map = 100%, reduce = 100%Ended Job = job_201212031001_0100 with errorsFAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask I am only add a Where limitation, but to my surprise, the MR jobs generated by HIVE failed. I am testing this in my local standalone cluster, which is running CDH3U3 release. When I check the hadoop userlog, here is what I got: 2012-12-12 11:40:22,421 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: SELECT struct<_col0:bigint,_col1:string,_col2:string,_col3:string,_col4:string,_col5:string,_col6:boolean,_col7:boolean,_col8:boolean,_col9:boolean,_col10:boolean,_col11:boolean,_col12:string,_col13:string,_col14:struct,categories_id:array,categories_name:array,lang_raw:string,lang_rose:string,lang:string,viewport:struct>,_col15:struct>>,_col16:array>,_col17:array>,_col18:string,_col19:struct>2012-12-12 11:40:22,440 WARN org.apache.hadoop.mapred.Child: Error running childjava.lang.RuntimeException: Error in configuring objectat org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:387) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)at org.apache.hadoop.mapred.Child$4.run(Child.java:270)at java.security.AccessController.doPrivileged(Native Method)at javax.security.auth.Subject.doAs(Subject.java:396)at org.apache.hadoop.security.UserGroup
RE: Array index support non-constant expresssion
OK. I followed the hive source code of org.apache.hadoop.hive.ql.udf.generic.GenericUDFArrayContains and wrote the UDF. It is quite simple. It works fine as I expected for simple case, but when I try to run it under some complex query, the hive MR jobs failed with some strange errors. What I mean is that it failed in HIVE code base, from stuck trace, I can not see this failure has anything to do with my custom code. I would like some help if some one can tell me what went wrong. For example, I created this UDF called darray, stand for dynamic array, which supports the non-constant value as the index location of the array. The following query works fine as I expected: hive> select c_poi.provider_str as provider_str, c_poi.name as name from (select darray(search_results, c.index_loc) as c_poi from search_table lateral view explode(search_clicks) clickTable as c) a limit 5;POI ADDRESS some addressPOIPOI ADDRESSS some address Of course, in this case, I only want the provider_str = 'POI' returned, and filter out any rows with provider_str != 'POI', so it sounds simple, I changed the query to the following: hive> select c_poi.provider_str as provider_str, c_poi.name as name from (select darray(search_results, c.rank) as c_poi from search_table lateral view explode(search_clicks) clickTable as c) a where c_poi.provider_str = 'POI' limit 5;Total MapReduce jobs = 1Launching Job 1 out of 1Number of reduce tasks is set to 0 since there's no reduce operatorCannot run job locally: Input Size (= 178314025) is larger than hive.exec.mode.local.auto.inputbytes.max (= 134217728)Starting Job = job_201212031001_0100, Tracking URL = http://blevine-desktop:50030/jobdetails.jsp?jobid=job_201212031001_0100Kill Command = /home/yzhang/hadoop/bin/hadoop job -Dmapred.job.tracker=blevine-desktop:8021 -kill job_201212031001_01002012-12-12 11:45:24,090 Stage-1 map = 0%, reduce = 0%2012-12-12 11:45:43,173 Stage-1 map = 100%, reduce = 100%Ended Job = job_201212031001_0100 with errorsFAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask I am only add a Where limitation, but to my surprise, the MR jobs generated by HIVE failed. I am testing this in my local standalone cluster, which is running CDH3U3 release. When I check the hadoop userlog, here is what I got: 2012-12-12 11:40:22,421 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: SELECT struct<_col0:bigint,_col1:string,_col2:string,_col3:string,_col4:string,_col5:string,_col6:boolean,_col7:boolean,_col8:boolean,_col9:boolean,_col10:boolean,_col11:boolean,_col12:string,_col13:string,_col14:struct,categories_id:array,categories_name:array,lang_raw:string,lang_rose:string,lang:string,viewport:struct>,_col15:struct>>,_col16:array>,_col17:array>,_col18:string,_col19:struct>2012-12-12 11:40:22,440 WARN org.apache.hadoop.mapred.Child: Error running childjava.lang.RuntimeException: Error in configuring objectat org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:387) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)at org.apache.hadoop.mapred.Child$4.run(Child.java:270)at java.security.AccessController.doPrivileged(Native Method)at javax.security.auth.Subject.doAs(Subject.java:396)at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157) at org.apache.hadoop.mapred.Child.main(Child.java:264)Caused by: java.lang.reflect.InvocationTargetExceptionat sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597)at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88) ... 9 moreCaused by: java.lang.RuntimeException: Error in configuring object at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117) at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34) ... 14 moreCaused by: java.lang.reflect.InvocationTargetExceptionat sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
Re: Modify the number of map tasks
have you a page in which you explain the steps. 2012/12/12 Mohammad Tariq > Hi Imen, > > I am sorry, I didn't get the question. Are you asking about > creating a distributed cluster? Yeah, I have done that. > > Regards, > Mohammad Tariq > > > > On Wed, Dec 12, 2012 at 7:45 PM, imen Megdiche wrote: > >> have you please commented the configuration of hadoop on cluster >> >> thanks >> >> >> 2012/12/12 Mohammad Tariq >> >>> You are always welcome. If you still need any help, you can go here : >>> http://cloudfront.blogspot.in/2012/07/how-to-configure-hadoop.html >>> I have outlined the entire process here along with few small(but >>> necessary) explanations. >>> >>> Regards, >>> Mohammad Tariq >>> >>> >>> >>> On Wed, Dec 12, 2012 at 7:31 PM, imen Megdiche >>> wrote: >>> thank you very much you re awsome. Fixed 2012/12/12 Mohammad Tariq > Uncomment the property in core-site.xml. That is a must. After doing > this you have to restart the daemons? > > Regards, > Mohammad Tariq > > > > On Wed, Dec 12, 2012 at 7:08 PM, imen Megdiche < > imen.megdi...@gmail.com> wrote: > >> I changed the files >> now when i run i have this response : >> >> 12/12/12 14:37:33 INFO ipc.Client: Retrying connect to server: >> localhost/127.0.0.1:9001. Already tried 0 time(s). >> 12/12/12 14:37:34 INFO ipc.Client: Retrying connect to server: >> localhost/127.0.0.1:9001. Already tried 1 time(s). >> 12/12/12 14:37:35 INFO ipc.Client: Retrying connect to server: >> localhost/127.0.0.1:9001. Already tried 2 time(s). >> 12/12/12 14:37:36 INFO ipc.Client: Retrying connect to server: >> localhost/127.0.0.1:9001. Already tried 3 time(s). >> 12/12/12 14:37:37 INFO ipc.Client: Retrying connect to server: >> localhost/127.0.0.1:9001. Already tried 4 time(s). >> 12/12/12 14:37:38 INFO ipc.Client: Retrying connect to server: >> localhost/127.0.0.1:9001. Already tried 5 time(s). >> 12/12/12 14:37:39 INFO ipc.Client: Retrying connect to server: >> localhost/127.0.0.1:9001. Already tried 6 time(s). >> 12/12/12 14:37:40 INFO ipc.Client: Retrying connect to server: >> localhost/127.0.0.1:9001. Already tried 7 time(s). >> 12/12/12 14:37:41 INFO ipc.Client: Retrying connect to server: >> localhost/127.0.0.1:9001. Already tried 8 time(s). >> 12/12/12 14:37:42 INFO ipc.Client: Retrying connect to server: >> localhost/127.0.0.1:9001. Already tried 9 time(s). >> Exception in thread "main" java.net.ConnectException: Call to >> localhost/127.0.0.1:9001 failed on connection exception: >> java.net.ConnectException: Connexion refusée >> at org.apache.hadoop.ipc.Client.wrapException(Client.java:1099) >> at org.apache.hadoop.ipc.Client.call(Client.java:1075) >> at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225) >> at org.apache.hadoop.mapred.$Proxy1.getProtocolVersion(Unknown >> Source) >> at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396) >> at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379) >> at >> org.apache.hadoop.mapred.JobClient.createRPCProxy(JobClient.java:480) >> at org.apache.hadoop.mapred.JobClient.init(JobClient.java:474) >> at org.apache.hadoop.mapred.JobClient.(JobClient.java:457) >> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1260) >> at org.myorg.WordCount.run(WordCount.java:115) >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >> at org.myorg.WordCount.main(WordCount.java:120) >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) >> at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) >> at java.lang.reflect.Method.invoke(Unknown Source) >> at org.apache.hadoop.util.RunJar.main(RunJar.java:156) >> Caused by: java.net.ConnectException: Connexion refusée >> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) >> at sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source) >> at >> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) >> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:489) >> at >> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:434) >> at >> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:560) >> at >> org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:184) >> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1206) >> at org.apache.hadoop.ipc.Client.call(Client.java:1050) >> ... 16 more >> >> >> 2012/12/12 Mohammad Tariq >> >>> dfs.name.dir >> >> >> > >>> >> >
Re: Modify the number of map tasks
Hi Imen, I am sorry, I didn't get the question. Are you asking about creating a distributed cluster? Yeah, I have done that. Regards, Mohammad Tariq On Wed, Dec 12, 2012 at 7:45 PM, imen Megdiche wrote: > have you please commented the configuration of hadoop on cluster > > thanks > > > 2012/12/12 Mohammad Tariq > >> You are always welcome. If you still need any help, you can go here : >> http://cloudfront.blogspot.in/2012/07/how-to-configure-hadoop.html >> I have outlined the entire process here along with few small(but >> necessary) explanations. >> >> Regards, >> Mohammad Tariq >> >> >> >> On Wed, Dec 12, 2012 at 7:31 PM, imen Megdiche >> wrote: >> >>> thank you very much you re awsome. >>> >>> Fixed >>> >>> >>> 2012/12/12 Mohammad Tariq >>> Uncomment the property in core-site.xml. That is a must. After doing this you have to restart the daemons? Regards, Mohammad Tariq On Wed, Dec 12, 2012 at 7:08 PM, imen Megdiche >>> > wrote: > I changed the files > now when i run i have this response : > > 12/12/12 14:37:33 INFO ipc.Client: Retrying connect to server: > localhost/127.0.0.1:9001. Already tried 0 time(s). > 12/12/12 14:37:34 INFO ipc.Client: Retrying connect to server: > localhost/127.0.0.1:9001. Already tried 1 time(s). > 12/12/12 14:37:35 INFO ipc.Client: Retrying connect to server: > localhost/127.0.0.1:9001. Already tried 2 time(s). > 12/12/12 14:37:36 INFO ipc.Client: Retrying connect to server: > localhost/127.0.0.1:9001. Already tried 3 time(s). > 12/12/12 14:37:37 INFO ipc.Client: Retrying connect to server: > localhost/127.0.0.1:9001. Already tried 4 time(s). > 12/12/12 14:37:38 INFO ipc.Client: Retrying connect to server: > localhost/127.0.0.1:9001. Already tried 5 time(s). > 12/12/12 14:37:39 INFO ipc.Client: Retrying connect to server: > localhost/127.0.0.1:9001. Already tried 6 time(s). > 12/12/12 14:37:40 INFO ipc.Client: Retrying connect to server: > localhost/127.0.0.1:9001. Already tried 7 time(s). > 12/12/12 14:37:41 INFO ipc.Client: Retrying connect to server: > localhost/127.0.0.1:9001. Already tried 8 time(s). > 12/12/12 14:37:42 INFO ipc.Client: Retrying connect to server: > localhost/127.0.0.1:9001. Already tried 9 time(s). > Exception in thread "main" java.net.ConnectException: Call to > localhost/127.0.0.1:9001 failed on connection exception: > java.net.ConnectException: Connexion refusée > at org.apache.hadoop.ipc.Client.wrapException(Client.java:1099) > at org.apache.hadoop.ipc.Client.call(Client.java:1075) > at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225) > at org.apache.hadoop.mapred.$Proxy1.getProtocolVersion(Unknown > Source) > at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396) > at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379) > at > org.apache.hadoop.mapred.JobClient.createRPCProxy(JobClient.java:480) > at org.apache.hadoop.mapred.JobClient.init(JobClient.java:474) > at org.apache.hadoop.mapred.JobClient.(JobClient.java:457) > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1260) > at org.myorg.WordCount.run(WordCount.java:115) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > at org.myorg.WordCount.main(WordCount.java:120) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) > at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) > at java.lang.reflect.Method.invoke(Unknown Source) > at org.apache.hadoop.util.RunJar.main(RunJar.java:156) > Caused by: java.net.ConnectException: Connexion refusée > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source) > at > org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) > at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:489) > at > org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:434) > at > org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:560) > at > org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:184) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1206) > at org.apache.hadoop.ipc.Client.call(Client.java:1050) > ... 16 more > > > 2012/12/12 Mohammad Tariq > >> dfs.name.dir > > > >>> >> >
Re: Modify the number of map tasks
have you please commented the configuration of hadoop on cluster thanks 2012/12/12 Mohammad Tariq > You are always welcome. If you still need any help, you can go here : > http://cloudfront.blogspot.in/2012/07/how-to-configure-hadoop.html > I have outlined the entire process here along with few small(but > necessary) explanations. > > Regards, > Mohammad Tariq > > > > On Wed, Dec 12, 2012 at 7:31 PM, imen Megdiche wrote: > >> thank you very much you re awsome. >> >> Fixed >> >> >> 2012/12/12 Mohammad Tariq >> >>> Uncomment the property in core-site.xml. That is a must. After doing >>> this you have to restart the daemons? >>> >>> Regards, >>> Mohammad Tariq >>> >>> >>> >>> On Wed, Dec 12, 2012 at 7:08 PM, imen Megdiche >>> wrote: >>> I changed the files now when i run i have this response : 12/12/12 14:37:33 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:9001. Already tried 0 time(s). 12/12/12 14:37:34 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:9001. Already tried 1 time(s). 12/12/12 14:37:35 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:9001. Already tried 2 time(s). 12/12/12 14:37:36 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:9001. Already tried 3 time(s). 12/12/12 14:37:37 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:9001. Already tried 4 time(s). 12/12/12 14:37:38 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:9001. Already tried 5 time(s). 12/12/12 14:37:39 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:9001. Already tried 6 time(s). 12/12/12 14:37:40 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:9001. Already tried 7 time(s). 12/12/12 14:37:41 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:9001. Already tried 8 time(s). 12/12/12 14:37:42 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:9001. Already tried 9 time(s). Exception in thread "main" java.net.ConnectException: Call to localhost/ 127.0.0.1:9001 failed on connection exception: java.net.ConnectException: Connexion refusée at org.apache.hadoop.ipc.Client.wrapException(Client.java:1099) at org.apache.hadoop.ipc.Client.call(Client.java:1075) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225) at org.apache.hadoop.mapred.$Proxy1.getProtocolVersion(Unknown Source) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379) at org.apache.hadoop.mapred.JobClient.createRPCProxy(JobClient.java:480) at org.apache.hadoop.mapred.JobClient.init(JobClient.java:474) at org.apache.hadoop.mapred.JobClient.(JobClient.java:457) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1260) at org.myorg.WordCount.run(WordCount.java:115) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.myorg.WordCount.main(WordCount.java:120) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) Caused by: java.net.ConnectException: Connexion refusée at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:489) at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:434) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:560) at org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:184) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1206) at org.apache.hadoop.ipc.Client.call(Client.java:1050) ... 16 more 2012/12/12 Mohammad Tariq > dfs.name.dir >>> >> >
Re: Map side join
Hi Souvik Apart from hive jobs is the normal mapreduce jobs like the wordcount running fine on your cluster? If it is working, for the hive jobs are you seeing anything skeptical in task, Tasktracker or jobtracker logs? Regards Bejoy KS Sent from remote device, Please excuse typos -Original Message- From: Souvik Banerjee Date: Tue, 11 Dec 2012 17:12:20 To: ; Reply-To: user@hive.apache.org Subject: Re: Map side join Hello Everybody, Need help in for on HIVE join. As we were talking about the Map side join I tried that. I set the flag set hive.auto.convert.join=true; I saw Hive converts the same to map join while launching the job. But the problem is that none of the map job progresses in my case. I made the dataset smaller. Now it's only 512 MB cross 25 MB. I was expecting it to be done very quickly. No luck with any change of settings. Failing to progress with the default setting changes these settings. set hive.mapred.local.mem=1024; // Initially it was 216 I guess set hive.join.cache.size=10; // Initialliu it was 25000 Also on Hadoop side I made this changes mapred.child.java.opts -Xmx1073741824 But I don't see any progress. After more than 40 minutes of run I am at 0% map completion state. Can you please throw some light on this? Thanks a lot once again. Regards, Souvik. On Fri, Dec 7, 2012 at 2:32 PM, Souvik Banerjee wrote: > Hi Bejoy, > > That's wonderful. Thanks for your reply. > What I was wondering if HIVE can do map side join with more than one > condition on JOIN clause. > I'll simply try it out and post the result. > > Thanks once again. > > Regards, > Souvik. > > On Fri, Dec 7, 2012 at 2:10 PM, wrote: > >> ** >> Hi Souvik >> >> In earlier versions of hive you had to give the map join hint. But in >> later versions just set hive.auto.convert.join = true; >> Hive automatically selects the smaller table. It is better to give the >> smaller table as the first one in join. >> >> You can use a map join if you are joining a small table with a large one, >> in terms of data size. By small, better to have the smaller table size in >> range of MBs. >> Regards >> Bejoy KS >> >> Sent from remote device, Please excuse typos >> -- >> *From: *Souvik Banerjee >> *Date: *Fri, 7 Dec 2012 13:58:25 -0600 >> *To: * >> *ReplyTo: *user@hive.apache.org >> *Subject: *Map side join >> >> Hello everybody, >> >> I have got a question. I didn't came across any post which says somethign >> about this. >> I have got two tables. Lets say A and B. >> I want to join A & B in HIVE. I am currently using HIVE 0.9 version. >> The join would be on few columns. like on (A.id1 = B.id1) AND (A.id2 = >> B.id2) AND (A.id3 = B.id3) >> >> Can I ask HIVE to use map side join in this scenario? Should I give a >> hint to HIVE by saying /*+mapjoin(B)*/ >> >> Get back to me if you want any more information in this regard. >> >> Thanks and regards, >> Souvik. >> > >
Re: Modify the number of map tasks
thank you very much you re awsome. Fixed 2012/12/12 Mohammad Tariq > Uncomment the property in core-site.xml. That is a must. After doing this > you have to restart the daemons? > > Regards, > Mohammad Tariq > > > > On Wed, Dec 12, 2012 at 7:08 PM, imen Megdiche wrote: > >> I changed the files >> now when i run i have this response : >> >> 12/12/12 14:37:33 INFO ipc.Client: Retrying connect to server: localhost/ >> 127.0.0.1:9001. Already tried 0 time(s). >> 12/12/12 14:37:34 INFO ipc.Client: Retrying connect to server: localhost/ >> 127.0.0.1:9001. Already tried 1 time(s). >> 12/12/12 14:37:35 INFO ipc.Client: Retrying connect to server: localhost/ >> 127.0.0.1:9001. Already tried 2 time(s). >> 12/12/12 14:37:36 INFO ipc.Client: Retrying connect to server: localhost/ >> 127.0.0.1:9001. Already tried 3 time(s). >> 12/12/12 14:37:37 INFO ipc.Client: Retrying connect to server: localhost/ >> 127.0.0.1:9001. Already tried 4 time(s). >> 12/12/12 14:37:38 INFO ipc.Client: Retrying connect to server: localhost/ >> 127.0.0.1:9001. Already tried 5 time(s). >> 12/12/12 14:37:39 INFO ipc.Client: Retrying connect to server: localhost/ >> 127.0.0.1:9001. Already tried 6 time(s). >> 12/12/12 14:37:40 INFO ipc.Client: Retrying connect to server: localhost/ >> 127.0.0.1:9001. Already tried 7 time(s). >> 12/12/12 14:37:41 INFO ipc.Client: Retrying connect to server: localhost/ >> 127.0.0.1:9001. Already tried 8 time(s). >> 12/12/12 14:37:42 INFO ipc.Client: Retrying connect to server: localhost/ >> 127.0.0.1:9001. Already tried 9 time(s). >> Exception in thread "main" java.net.ConnectException: Call to localhost/ >> 127.0.0.1:9001 failed on connection exception: >> java.net.ConnectException: Connexion refusée >> at org.apache.hadoop.ipc.Client.wrapException(Client.java:1099) >> at org.apache.hadoop.ipc.Client.call(Client.java:1075) >> at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225) >> at org.apache.hadoop.mapred.$Proxy1.getProtocolVersion(Unknown Source) >> at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396) >> at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379) >> at >> org.apache.hadoop.mapred.JobClient.createRPCProxy(JobClient.java:480) >> at org.apache.hadoop.mapred.JobClient.init(JobClient.java:474) >> at org.apache.hadoop.mapred.JobClient.(JobClient.java:457) >> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1260) >> at org.myorg.WordCount.run(WordCount.java:115) >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >> at org.myorg.WordCount.main(WordCount.java:120) >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) >> at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) >> at java.lang.reflect.Method.invoke(Unknown Source) >> at org.apache.hadoop.util.RunJar.main(RunJar.java:156) >> Caused by: java.net.ConnectException: Connexion refusée >> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) >> at sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source) >> at >> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) >> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:489) >> at >> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:434) >> at >> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:560) >> at >> org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:184) >> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1206) >> at org.apache.hadoop.ipc.Client.call(Client.java:1050) >> ... 16 more >> >> >> 2012/12/12 Mohammad Tariq >> >>> dfs.name.dir >> >> >> >
Re: Modify the number of map tasks
I wonder how you are able to run the job without a JT. You must have this on your mapred-site.xml file : mapred.job.tracker localhost:9001 Also add "hadoop.tmp.dir" in core-site.xml, and "dfs.name.dir" & "dfs.data.dir" in hdfs-site.xml. Regards, Mohammad Tariq On Wed, Dec 12, 2012 at 6:46 PM, imen Megdiche wrote: > For mapred-site.xml : > > > > > mapred.map.tasks > 6 > > > > > for core-site.xml : > > > > > > > on hdfs-site.xml nothing > > > > > > 2012/12/12 Mohammad Tariq > >> Can I have a look at your config files? >> >> Regards, >> Mohammad Tariq >> >> >> >> On Wed, Dec 12, 2012 at 6:31 PM, imen Megdiche >> wrote: >> >>> i run the start-all.sh and all daemons starts without problems. But i >>> the log of the tasktracker look like this : >>> >>> >>> 2012-12-12 13:53:45,495 INFO org.apache.hadoop.mapred.TaskTracker: >>> STARTUP_MSG: >>> / >>> STARTUP_MSG: Starting TaskTracker >>> STARTUP_MSG: host = megdiche-OptiPlex-GX280/127.0.1.1 >>> STARTUP_MSG: args = [] >>> STARTUP_MSG: version = 1.0.4 >>> STARTUP_MSG: build = >>> https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0 -r >>> 1393290; compiled by 'hortonfo' on Wed Oct 3 05:13:58 UTC 2012 >>> / >>> 2012-12-12 13:53:47,009 INFO >>> org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from >>> hadoop-metrics2.properties >>> 2012-12-12 13:53:47,331 INFO >>> org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source >>> MetricsSystem,sub=Stats registered. >>> 2012-12-12 13:53:47,336 INFO >>> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot >>> period at 10 second(s). >>> 2012-12-12 13:53:47,336 INFO >>> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: TaskTracker metrics >>> system started >>> 2012-12-12 13:53:48,165 INFO >>> org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source ugi >>> registered. >>> 2012-12-12 13:53:48,192 WARN >>> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Source name ugi already >>> exists! >>> 2012-12-12 13:53:48,513 ERROR org.apache.hadoop.mapred.TaskTracker: Can >>> not start task tracker because java.lang.IllegalArgumentException: Does not >>> contain a valid host:port authority: local >>> at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:162) >>> at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:128) >>> at >>> org.apache.hadoop.mapred.JobTracker.getAddress(JobTracker.java:2560) >>> at org.apache.hadoop.mapred.TaskTracker.(TaskTracker.java:1426) >>> at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:3742) >>> >>> 2012-12-12 13:53:48,519 INFO org.apache.hadoop.mapred.TaskTracker: >>> SHUTDOWN_MSG: >>> / >>> SHUTDOWN_MSG: Shutting down TaskTracker at megdiche-OptiPlex-GX280/ >>> 127.0.1.1 >>> / >>> >>> >>> >>> >>> 2012/12/12 Mohammad Tariq >>> I would check if all the daemons are running properly or not, before anything else. If some problem is found, next place to track is the log of each daemon. The correct command to check the status of a job from command line is : hadoop job -status jobID. (Mind the 'space' after job and remove 'command' from the statement) HTH Regards, Mohammad Tariq On Wed, Dec 12, 2012 at 6:14 PM, imen Megdiche >>> > wrote: > My goal is to analyze the response time of MapReduce depending on the size > of the input files. I need to change the number of map and / or Reduce > tasks and recover the execution time. S it turns out that nothing > works locally on my pc : > neither hadoop job-status command job_local_0001 (which return no job > found ) > nor localhost: 50030 > I will be very grateful if you can help m better understand these > problem > > > 2012/12/12 Mohammad Tariq > >> Are you working locally?What exactly is the issue? >> >> Regards, >> Mohammad Tariq >> >> >> >> On Wed, Dec 12, 2012 at 6:00 PM, imen Megdiche < >> imen.megdi...@gmail.com> wrote: >> >>> no >>> >>> >>> 2012/12/12 Mohammad Tariq >>> Any luck with "localhost:50030"?? Regards, Mohammad Tariq On Wed, Dec 12, 2012 at 5:53 PM, imen Megdiche < imen.megdi...@gmail.com> wrote: > i run the job through the command line > > > 2012/12/12 Mohammad Tariq > >> You have to replace "JobTrackerHost" in "JobTrackerHost:50030" >> with the actual name of the machine where JobTracker is running. >> For example, If you are working on a local cluster, you have to use
Re: Modify the number of map tasks
For mapred-site.xml : mapred.map.tasks 6 for core-site.xml : on hdfs-site.xml nothing 2012/12/12 Mohammad Tariq > Can I have a look at your config files? > > Regards, > Mohammad Tariq > > > > On Wed, Dec 12, 2012 at 6:31 PM, imen Megdiche wrote: > >> i run the start-all.sh and all daemons starts without problems. But i the >> log of the tasktracker look like this : >> >> >> 2012-12-12 13:53:45,495 INFO org.apache.hadoop.mapred.TaskTracker: >> STARTUP_MSG: >> / >> STARTUP_MSG: Starting TaskTracker >> STARTUP_MSG: host = megdiche-OptiPlex-GX280/127.0.1.1 >> STARTUP_MSG: args = [] >> STARTUP_MSG: version = 1.0.4 >> STARTUP_MSG: build = >> https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0 -r >> 1393290; compiled by 'hortonfo' on Wed Oct 3 05:13:58 UTC 2012 >> / >> 2012-12-12 13:53:47,009 INFO >> org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from >> hadoop-metrics2.properties >> 2012-12-12 13:53:47,331 INFO >> org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source >> MetricsSystem,sub=Stats registered. >> 2012-12-12 13:53:47,336 INFO >> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot >> period at 10 second(s). >> 2012-12-12 13:53:47,336 INFO >> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: TaskTracker metrics >> system started >> 2012-12-12 13:53:48,165 INFO >> org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source ugi >> registered. >> 2012-12-12 13:53:48,192 WARN >> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Source name ugi already >> exists! >> 2012-12-12 13:53:48,513 ERROR org.apache.hadoop.mapred.TaskTracker: Can >> not start task tracker because java.lang.IllegalArgumentException: Does not >> contain a valid host:port authority: local >> at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:162) >> at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:128) >> at >> org.apache.hadoop.mapred.JobTracker.getAddress(JobTracker.java:2560) >> at org.apache.hadoop.mapred.TaskTracker.(TaskTracker.java:1426) >> at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:3742) >> >> 2012-12-12 13:53:48,519 INFO org.apache.hadoop.mapred.TaskTracker: >> SHUTDOWN_MSG: >> / >> SHUTDOWN_MSG: Shutting down TaskTracker at megdiche-OptiPlex-GX280/ >> 127.0.1.1 >> / >> >> >> >> >> 2012/12/12 Mohammad Tariq >> >>> I would check if all the daemons are running properly or not, before >>> anything else. If some problem is found, next place to track is the log of >>> each daemon. >>> >>> The correct command to check the status of a job from command line is : >>> hadoop job -status jobID. >>> (Mind the 'space' after job and remove 'command' from the statement) >>> >>> HTH >>> >>> Regards, >>> Mohammad Tariq >>> >>> >>> >>> On Wed, Dec 12, 2012 at 6:14 PM, imen Megdiche >>> wrote: >>> My goal is to analyze the response time of MapReduce depending on the size of the input files. I need to change the number of map and / or Reduce tasks and recover the execution time. S it turns out that nothing works locally on my pc : neither hadoop job-status command job_local_0001 (which return no job found ) nor localhost: 50030 I will be very grateful if you can help m better understand these problem 2012/12/12 Mohammad Tariq > Are you working locally?What exactly is the issue? > > Regards, > Mohammad Tariq > > > > On Wed, Dec 12, 2012 at 6:00 PM, imen Megdiche < > imen.megdi...@gmail.com> wrote: > >> no >> >> >> 2012/12/12 Mohammad Tariq >> >>> Any luck with "localhost:50030"?? >>> >>> Regards, >>> Mohammad Tariq >>> >>> >>> >>> On Wed, Dec 12, 2012 at 5:53 PM, imen Megdiche < >>> imen.megdi...@gmail.com> wrote: >>> i run the job through the command line 2012/12/12 Mohammad Tariq > You have to replace "JobTrackerHost" in "JobTrackerHost:50030" > with the actual name of the machine where JobTracker is running. > For example, If you are working on a local cluster, you have to use > "localhost:50030". > > Are you running your job through the command line or some IDE? > > Regards, > Mohammad Tariq > > > > On Wed, Dec 12, 2012 at 5:42 PM, imen Megdiche < > imen.megdi...@gmail.com> wrote: > >> excuse me the data size is 98 MB >> >> >> 2012/12/12 imen Megdiche >> >>> the size of data 49 MB and n of map 4 >>> the web UI JobTrackerH
Re: Modify the number of map tasks
Can I have a look at your config files? Regards, Mohammad Tariq On Wed, Dec 12, 2012 at 6:31 PM, imen Megdiche wrote: > i run the start-all.sh and all daemons starts without problems. But i the > log of the tasktracker look like this : > > > 2012-12-12 13:53:45,495 INFO org.apache.hadoop.mapred.TaskTracker: > STARTUP_MSG: > / > STARTUP_MSG: Starting TaskTracker > STARTUP_MSG: host = megdiche-OptiPlex-GX280/127.0.1.1 > STARTUP_MSG: args = [] > STARTUP_MSG: version = 1.0.4 > STARTUP_MSG: build = > https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0 -r > 1393290; compiled by 'hortonfo' on Wed Oct 3 05:13:58 UTC 2012 > / > 2012-12-12 13:53:47,009 INFO > org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from > hadoop-metrics2.properties > 2012-12-12 13:53:47,331 INFO > org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source > MetricsSystem,sub=Stats registered. > 2012-12-12 13:53:47,336 INFO > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot > period at 10 second(s). > 2012-12-12 13:53:47,336 INFO > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: TaskTracker metrics > system started > 2012-12-12 13:53:48,165 INFO > org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source ugi > registered. > 2012-12-12 13:53:48,192 WARN > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Source name ugi already > exists! > 2012-12-12 13:53:48,513 ERROR org.apache.hadoop.mapred.TaskTracker: Can > not start task tracker because java.lang.IllegalArgumentException: Does not > contain a valid host:port authority: local > at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:162) > at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:128) > at org.apache.hadoop.mapred.JobTracker.getAddress(JobTracker.java:2560) > at org.apache.hadoop.mapred.TaskTracker.(TaskTracker.java:1426) > at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:3742) > > 2012-12-12 13:53:48,519 INFO org.apache.hadoop.mapred.TaskTracker: > SHUTDOWN_MSG: > / > SHUTDOWN_MSG: Shutting down TaskTracker at megdiche-OptiPlex-GX280/ > 127.0.1.1 > / > > > > > 2012/12/12 Mohammad Tariq > >> I would check if all the daemons are running properly or not, before >> anything else. If some problem is found, next place to track is the log of >> each daemon. >> >> The correct command to check the status of a job from command line is : >> hadoop job -status jobID. >> (Mind the 'space' after job and remove 'command' from the statement) >> >> HTH >> >> Regards, >> Mohammad Tariq >> >> >> >> On Wed, Dec 12, 2012 at 6:14 PM, imen Megdiche >> wrote: >> >>> My goal is to analyze the response time of MapReduce depending on the size >>> of the input files. I need to change the number of map and / or Reduce >>> tasks and recover the execution time. S it turns out that nothing works >>> locally >>> on my pc : >>> neither hadoop job-status command job_local_0001 (which return no job >>> found ) >>> nor localhost: 50030 >>> I will be very grateful if you can help m better understand these >>> problem >>> >>> >>> 2012/12/12 Mohammad Tariq >>> Are you working locally?What exactly is the issue? Regards, Mohammad Tariq On Wed, Dec 12, 2012 at 6:00 PM, imen Megdiche >>> > wrote: > no > > > 2012/12/12 Mohammad Tariq > >> Any luck with "localhost:50030"?? >> >> Regards, >> Mohammad Tariq >> >> >> >> On Wed, Dec 12, 2012 at 5:53 PM, imen Megdiche < >> imen.megdi...@gmail.com> wrote: >> >>> i run the job through the command line >>> >>> >>> 2012/12/12 Mohammad Tariq >>> You have to replace "JobTrackerHost" in "JobTrackerHost:50030" with the actual name of the machine where JobTracker is running. For example, If you are working on a local cluster, you have to use "localhost:50030". Are you running your job through the command line or some IDE? Regards, Mohammad Tariq On Wed, Dec 12, 2012 at 5:42 PM, imen Megdiche < imen.megdi...@gmail.com> wrote: > excuse me the data size is 98 MB > > > 2012/12/12 imen Megdiche > >> the size of data 49 MB and n of map 4 >> the web UI JobTrackerHost:50030 does not wok, what should i do to >> make this appear , i work on ubuntu >> >> >> 2012/12/12 Mohammad Tariq >> >>> Hi Imen, >>> >>> You can visit the MR web UI at "JobTrackerHost:50030" and >>> see all the useful information li
Re: Modify the number of map tasks
i run the start-all.sh and all daemons starts without problems. But i the log of the tasktracker look like this : 2012-12-12 13:53:45,495 INFO org.apache.hadoop.mapred.TaskTracker: STARTUP_MSG: / STARTUP_MSG: Starting TaskTracker STARTUP_MSG: host = megdiche-OptiPlex-GX280/127.0.1.1 STARTUP_MSG: args = [] STARTUP_MSG: version = 1.0.4 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0 -r 1393290; compiled by 'hortonfo' on Wed Oct 3 05:13:58 UTC 2012 / 2012-12-12 13:53:47,009 INFO org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from hadoop-metrics2.properties 2012-12-12 13:53:47,331 INFO org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source MetricsSystem,sub=Stats registered. 2012-12-12 13:53:47,336 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s). 2012-12-12 13:53:47,336 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: TaskTracker metrics system started 2012-12-12 13:53:48,165 INFO org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source ugi registered. 2012-12-12 13:53:48,192 WARN org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Source name ugi already exists! 2012-12-12 13:53:48,513 ERROR org.apache.hadoop.mapred.TaskTracker: Can not start task tracker because java.lang.IllegalArgumentException: Does not contain a valid host:port authority: local at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:162) at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:128) at org.apache.hadoop.mapred.JobTracker.getAddress(JobTracker.java:2560) at org.apache.hadoop.mapred.TaskTracker.(TaskTracker.java:1426) at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:3742) 2012-12-12 13:53:48,519 INFO org.apache.hadoop.mapred.TaskTracker: SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down TaskTracker at megdiche-OptiPlex-GX280/127.0.1.1 / 2012/12/12 Mohammad Tariq > I would check if all the daemons are running properly or not, before > anything else. If some problem is found, next place to track is the log of > each daemon. > > The correct command to check the status of a job from command line is : > hadoop job -status jobID. > (Mind the 'space' after job and remove 'command' from the statement) > > HTH > > Regards, > Mohammad Tariq > > > > On Wed, Dec 12, 2012 at 6:14 PM, imen Megdiche wrote: > >> My goal is to analyze the response time of MapReduce depending on the size >> of the input files. I need to change the number of map and / or Reduce >> tasks and recover the execution time. S it turns out that nothing works >> locally >> on my pc : >> neither hadoop job-status command job_local_0001 (which return no job >> found ) >> nor localhost: 50030 >> I will be very grateful if you can help m better understand these problem >> >> >> 2012/12/12 Mohammad Tariq >> >>> Are you working locally?What exactly is the issue? >>> >>> Regards, >>> Mohammad Tariq >>> >>> >>> >>> On Wed, Dec 12, 2012 at 6:00 PM, imen Megdiche >>> wrote: >>> no 2012/12/12 Mohammad Tariq > Any luck with "localhost:50030"?? > > Regards, > Mohammad Tariq > > > > On Wed, Dec 12, 2012 at 5:53 PM, imen Megdiche < > imen.megdi...@gmail.com> wrote: > >> i run the job through the command line >> >> >> 2012/12/12 Mohammad Tariq >> >>> You have to replace "JobTrackerHost" in "JobTrackerHost:50030" with >>> the actual name of the machine where JobTracker is running. For >>> example, If you are working on a local cluster, you have to use >>> "localhost:50030". >>> >>> Are you running your job through the command line or some IDE? >>> >>> Regards, >>> Mohammad Tariq >>> >>> >>> >>> On Wed, Dec 12, 2012 at 5:42 PM, imen Megdiche < >>> imen.megdi...@gmail.com> wrote: >>> excuse me the data size is 98 MB 2012/12/12 imen Megdiche > the size of data 49 MB and n of map 4 > the web UI JobTrackerHost:50030 does not wok, what should i do to > make this appear , i work on ubuntu > > > 2012/12/12 Mohammad Tariq > >> Hi Imen, >> >> You can visit the MR web UI at "JobTrackerHost:50030" and >> see all the useful information like no. of mappers, no of reducers, >> time >> taken for the execution etc. >> >> One quick question for you, what is the size of your data and >> what is the no of maps which you are getting right now? >> >> Regards, >> Mohammad Tariq >> >>
Re: Modify the number of map tasks
I would check if all the daemons are running properly or not, before anything else. If some problem is found, next place to track is the log of each daemon. The correct command to check the status of a job from command line is : hadoop job -status jobID. (Mind the 'space' after job and remove 'command' from the statement) HTH Regards, Mohammad Tariq On Wed, Dec 12, 2012 at 6:14 PM, imen Megdiche wrote: > My goal is to analyze the response time of MapReduce depending on the size > of the input files. I need to change the number of map and / or Reduce > tasks and recover the execution time. S it turns out that nothing works > locally > on my pc : > neither hadoop job-status command job_local_0001 (which return no job > found ) > nor localhost: 50030 > I will be very grateful if you can help m better understand these problem > > > 2012/12/12 Mohammad Tariq > >> Are you working locally?What exactly is the issue? >> >> Regards, >> Mohammad Tariq >> >> >> >> On Wed, Dec 12, 2012 at 6:00 PM, imen Megdiche >> wrote: >> >>> no >>> >>> >>> 2012/12/12 Mohammad Tariq >>> Any luck with "localhost:50030"?? Regards, Mohammad Tariq On Wed, Dec 12, 2012 at 5:53 PM, imen Megdiche >>> > wrote: > i run the job through the command line > > > 2012/12/12 Mohammad Tariq > >> You have to replace "JobTrackerHost" in "JobTrackerHost:50030" with >> the actual name of the machine where JobTracker is running. For >> example, If you are working on a local cluster, you have to use >> "localhost:50030". >> >> Are you running your job through the command line or some IDE? >> >> Regards, >> Mohammad Tariq >> >> >> >> On Wed, Dec 12, 2012 at 5:42 PM, imen Megdiche < >> imen.megdi...@gmail.com> wrote: >> >>> excuse me the data size is 98 MB >>> >>> >>> 2012/12/12 imen Megdiche >>> the size of data 49 MB and n of map 4 the web UI JobTrackerHost:50030 does not wok, what should i do to make this appear , i work on ubuntu 2012/12/12 Mohammad Tariq > Hi Imen, > > You can visit the MR web UI at "JobTrackerHost:50030" and see > all the useful information like no. of mappers, no of reducers, time > taken > for the execution etc. > > One quick question for you, what is the size of your data and what > is the no of maps which you are getting right now? > > Regards, > Mohammad Tariq > > > > On Wed, Dec 12, 2012 at 5:11 PM, imen Megdiche < > imen.megdi...@gmail.com> wrote: > >> Thank you Mohammad but the number of map tasks still the same in >> the execution. Do you know how to capture the time spent on >> execution. >> >> >> 2012/12/12 Mohammad Tariq >> >>> Hi Imen, >>> >>> You can add "mapred.map.tasks" property in your >>> mapred-site.xml file. >>> >>> But, it is just a hint for the InputFormat. Actually no. of maps >>> is actually determined by the no of InputSplits created by the >>> InputFormat. >>> >>> HTH >>> >>> Regards, >>> Mohammad Tariq >>> >>> >>> >>> On Wed, Dec 12, 2012 at 4:11 PM, imen Megdiche < >>> imen.megdi...@gmail.com> wrote: >>> Hi, I try to force the number of map for the mapreduce job with the command : public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.set("mapred.job.tracker", "local"); conf.set("fs.default.name", "local"); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setNumMapTask(6); conf.setMapperClass(Map.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class); ... } But it doesn t work. What can i do to modify the number of map and reduce tasks. Thank you >>> >>> >> > >>> >> > >>> >> >
Re: Modify the number of map tasks
My goal is to analyze the response time of MapReduce depending on the size of the input files. I need to change the number of map and / or Reduce tasks and recover the execution time. S it turns out that nothing works locally on my pc : neither hadoop job-status command job_local_0001 (which return no job found ) nor localhost: 50030 I will be very grateful if you can help m better understand these problem 2012/12/12 Mohammad Tariq > Are you working locally?What exactly is the issue? > > Regards, > Mohammad Tariq > > > > On Wed, Dec 12, 2012 at 6:00 PM, imen Megdiche wrote: > >> no >> >> >> 2012/12/12 Mohammad Tariq >> >>> Any luck with "localhost:50030"?? >>> >>> Regards, >>> Mohammad Tariq >>> >>> >>> >>> On Wed, Dec 12, 2012 at 5:53 PM, imen Megdiche >>> wrote: >>> i run the job through the command line 2012/12/12 Mohammad Tariq > You have to replace "JobTrackerHost" in "JobTrackerHost:50030" with > the actual name of the machine where JobTracker is running. For > example, If you are working on a local cluster, you have to use > "localhost:50030". > > Are you running your job through the command line or some IDE? > > Regards, > Mohammad Tariq > > > > On Wed, Dec 12, 2012 at 5:42 PM, imen Megdiche < > imen.megdi...@gmail.com> wrote: > >> excuse me the data size is 98 MB >> >> >> 2012/12/12 imen Megdiche >> >>> the size of data 49 MB and n of map 4 >>> the web UI JobTrackerHost:50030 does not wok, what should i do to >>> make this appear , i work on ubuntu >>> >>> >>> 2012/12/12 Mohammad Tariq >>> Hi Imen, You can visit the MR web UI at "JobTrackerHost:50030" and see all the useful information like no. of mappers, no of reducers, time taken for the execution etc. One quick question for you, what is the size of your data and what is the no of maps which you are getting right now? Regards, Mohammad Tariq On Wed, Dec 12, 2012 at 5:11 PM, imen Megdiche < imen.megdi...@gmail.com> wrote: > Thank you Mohammad but the number of map tasks still the same in > the execution. Do you know how to capture the time spent on execution. > > > 2012/12/12 Mohammad Tariq > >> Hi Imen, >> >> You can add "mapred.map.tasks" property in your >> mapred-site.xml file. >> >> But, it is just a hint for the InputFormat. Actually no. of maps >> is actually determined by the no of InputSplits created by the >> InputFormat. >> >> HTH >> >> Regards, >> Mohammad Tariq >> >> >> >> On Wed, Dec 12, 2012 at 4:11 PM, imen Megdiche < >> imen.megdi...@gmail.com> wrote: >> >>> Hi, >>> >>> I try to force the number of map for the mapreduce job with the >>> command : >>> public static void main(String[] args) throws Exception { >>> >>> JobConf conf = new JobConf(WordCount.class); >>> conf.set("mapred.job.tracker", "local"); >>> conf.set("fs.default.name", "local"); >>> conf.setJobName("wordcount"); >>> >>> conf.setOutputKeyClass(Text.class); >>> conf.setOutputValueClass(IntWritable.class); >>> >>>conf.setNumMapTask(6); >>> conf.setMapperClass(Map.class); >>> conf.setCombinerClass(Reduce.class); >>> conf.setReducerClass(Reduce.class); >>> ... >>> } >>> >>> But it doesn t work. >>> What can i do to modify the number of map and reduce tasks. >>> >>> Thank you >>> >> >> > >>> >> > >>> >> >
Re: Modify the number of map tasks
Are you working locally?What exactly is the issue? Regards, Mohammad Tariq On Wed, Dec 12, 2012 at 6:00 PM, imen Megdiche wrote: > no > > > 2012/12/12 Mohammad Tariq > >> Any luck with "localhost:50030"?? >> >> Regards, >> Mohammad Tariq >> >> >> >> On Wed, Dec 12, 2012 at 5:53 PM, imen Megdiche >> wrote: >> >>> i run the job through the command line >>> >>> >>> 2012/12/12 Mohammad Tariq >>> You have to replace "JobTrackerHost" in "JobTrackerHost:50030" with the actual name of the machine where JobTracker is running. For example, If you are working on a local cluster, you have to use "localhost:50030". Are you running your job through the command line or some IDE? Regards, Mohammad Tariq On Wed, Dec 12, 2012 at 5:42 PM, imen Megdiche >>> > wrote: > excuse me the data size is 98 MB > > > 2012/12/12 imen Megdiche > >> the size of data 49 MB and n of map 4 >> the web UI JobTrackerHost:50030 does not wok, what should i do to >> make this appear , i work on ubuntu >> >> >> 2012/12/12 Mohammad Tariq >> >>> Hi Imen, >>> >>> You can visit the MR web UI at "JobTrackerHost:50030" and see >>> all the useful information like no. of mappers, no of reducers, time >>> taken >>> for the execution etc. >>> >>> One quick question for you, what is the size of your data and what >>> is the no of maps which you are getting right now? >>> >>> Regards, >>> Mohammad Tariq >>> >>> >>> >>> On Wed, Dec 12, 2012 at 5:11 PM, imen Megdiche < >>> imen.megdi...@gmail.com> wrote: >>> Thank you Mohammad but the number of map tasks still the same in the execution. Do you know how to capture the time spent on execution. 2012/12/12 Mohammad Tariq > Hi Imen, > > You can add "mapred.map.tasks" property in your > mapred-site.xml file. > > But, it is just a hint for the InputFormat. Actually no. of maps > is actually determined by the no of InputSplits created by the > InputFormat. > > HTH > > Regards, > Mohammad Tariq > > > > On Wed, Dec 12, 2012 at 4:11 PM, imen Megdiche < > imen.megdi...@gmail.com> wrote: > >> Hi, >> >> I try to force the number of map for the mapreduce job with the >> command : >> public static void main(String[] args) throws Exception { >> >> JobConf conf = new JobConf(WordCount.class); >> conf.set("mapred.job.tracker", "local"); >> conf.set("fs.default.name", "local"); >> conf.setJobName("wordcount"); >> >> conf.setOutputKeyClass(Text.class); >> conf.setOutputValueClass(IntWritable.class); >> >>conf.setNumMapTask(6); >> conf.setMapperClass(Map.class); >> conf.setCombinerClass(Reduce.class); >> conf.setReducerClass(Reduce.class); >> ... >> } >> >> But it doesn t work. >> What can i do to modify the number of map and reduce tasks. >> >> Thank you >> > > >>> >> > >>> >> >
Re: Modify the number of map tasks
no 2012/12/12 Mohammad Tariq > Any luck with "localhost:50030"?? > > Regards, > Mohammad Tariq > > > > On Wed, Dec 12, 2012 at 5:53 PM, imen Megdiche wrote: > >> i run the job through the command line >> >> >> 2012/12/12 Mohammad Tariq >> >>> You have to replace "JobTrackerHost" in "JobTrackerHost:50030" with the >>> actual name of the machine where JobTracker is running. For example, If >>> you are working on a local cluster, you have to use "localhost:50030". >>> >>> Are you running your job through the command line or some IDE? >>> >>> Regards, >>> Mohammad Tariq >>> >>> >>> >>> On Wed, Dec 12, 2012 at 5:42 PM, imen Megdiche >>> wrote: >>> excuse me the data size is 98 MB 2012/12/12 imen Megdiche > the size of data 49 MB and n of map 4 > the web UI JobTrackerHost:50030 does not wok, what should i do to make > this appear , i work on ubuntu > > > 2012/12/12 Mohammad Tariq > >> Hi Imen, >> >> You can visit the MR web UI at "JobTrackerHost:50030" and see >> all the useful information like no. of mappers, no of reducers, time >> taken >> for the execution etc. >> >> One quick question for you, what is the size of your data and what is >> the no of maps which you are getting right now? >> >> Regards, >> Mohammad Tariq >> >> >> >> On Wed, Dec 12, 2012 at 5:11 PM, imen Megdiche < >> imen.megdi...@gmail.com> wrote: >> >>> Thank you Mohammad but the number of map tasks still the same in the >>> execution. Do you know how to capture the time spent on execution. >>> >>> >>> 2012/12/12 Mohammad Tariq >>> Hi Imen, You can add "mapred.map.tasks" property in your mapred-site.xml file. But, it is just a hint for the InputFormat. Actually no. of maps is actually determined by the no of InputSplits created by the InputFormat. HTH Regards, Mohammad Tariq On Wed, Dec 12, 2012 at 4:11 PM, imen Megdiche < imen.megdi...@gmail.com> wrote: > Hi, > > I try to force the number of map for the mapreduce job with the > command : > public static void main(String[] args) throws Exception { > > JobConf conf = new JobConf(WordCount.class); > conf.set("mapred.job.tracker", "local"); > conf.set("fs.default.name", "local"); > conf.setJobName("wordcount"); > > conf.setOutputKeyClass(Text.class); > conf.setOutputValueClass(IntWritable.class); > >conf.setNumMapTask(6); > conf.setMapperClass(Map.class); > conf.setCombinerClass(Reduce.class); > conf.setReducerClass(Reduce.class); > ... > } > > But it doesn t work. > What can i do to modify the number of map and reduce tasks. > > Thank you > >>> >> > >>> >> >
Re: Modify the number of map tasks
Any luck with "localhost:50030"?? Regards, Mohammad Tariq On Wed, Dec 12, 2012 at 5:53 PM, imen Megdiche wrote: > i run the job through the command line > > > 2012/12/12 Mohammad Tariq > >> You have to replace "JobTrackerHost" in "JobTrackerHost:50030" with the >> actual name of the machine where JobTracker is running. For example, If >> you are working on a local cluster, you have to use "localhost:50030". >> >> Are you running your job through the command line or some IDE? >> >> Regards, >> Mohammad Tariq >> >> >> >> On Wed, Dec 12, 2012 at 5:42 PM, imen Megdiche >> wrote: >> >>> excuse me the data size is 98 MB >>> >>> >>> 2012/12/12 imen Megdiche >>> the size of data 49 MB and n of map 4 the web UI JobTrackerHost:50030 does not wok, what should i do to make this appear , i work on ubuntu 2012/12/12 Mohammad Tariq > Hi Imen, > > You can visit the MR web UI at "JobTrackerHost:50030" and see all > the useful information like no. of mappers, no of reducers, time taken > for > the execution etc. > > One quick question for you, what is the size of your data and what is > the no of maps which you are getting right now? > > Regards, > Mohammad Tariq > > > > On Wed, Dec 12, 2012 at 5:11 PM, imen Megdiche < > imen.megdi...@gmail.com> wrote: > >> Thank you Mohammad but the number of map tasks still the same in the >> execution. Do you know how to capture the time spent on execution. >> >> >> 2012/12/12 Mohammad Tariq >> >>> Hi Imen, >>> >>> You can add "mapred.map.tasks" property in your mapred-site.xml >>> file. >>> >>> But, it is just a hint for the InputFormat. Actually no. of maps is >>> actually determined by the no of InputSplits created by the InputFormat. >>> >>> HTH >>> >>> Regards, >>> Mohammad Tariq >>> >>> >>> >>> On Wed, Dec 12, 2012 at 4:11 PM, imen Megdiche < >>> imen.megdi...@gmail.com> wrote: >>> Hi, I try to force the number of map for the mapreduce job with the command : public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.set("mapred.job.tracker", "local"); conf.set("fs.default.name", "local"); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setNumMapTask(6); conf.setMapperClass(Map.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class); ... } But it doesn t work. What can i do to modify the number of map and reduce tasks. Thank you >>> >>> >> > >>> >> >
Re: Modify the number of map tasks
i run the job through the command line 2012/12/12 Mohammad Tariq > You have to replace "JobTrackerHost" in "JobTrackerHost:50030" with the > actual name of the machine where JobTracker is running. For example, If > you are working on a local cluster, you have to use "localhost:50030". > > Are you running your job through the command line or some IDE? > > Regards, > Mohammad Tariq > > > > On Wed, Dec 12, 2012 at 5:42 PM, imen Megdiche wrote: > >> excuse me the data size is 98 MB >> >> >> 2012/12/12 imen Megdiche >> >>> the size of data 49 MB and n of map 4 >>> the web UI JobTrackerHost:50030 does not wok, what should i do to make >>> this appear , i work on ubuntu >>> >>> >>> 2012/12/12 Mohammad Tariq >>> Hi Imen, You can visit the MR web UI at "JobTrackerHost:50030" and see all the useful information like no. of mappers, no of reducers, time taken for the execution etc. One quick question for you, what is the size of your data and what is the no of maps which you are getting right now? Regards, Mohammad Tariq On Wed, Dec 12, 2012 at 5:11 PM, imen Megdiche >>> > wrote: > Thank you Mohammad but the number of map tasks still the same in the > execution. Do you know how to capture the time spent on execution. > > > 2012/12/12 Mohammad Tariq > >> Hi Imen, >> >> You can add "mapred.map.tasks" property in your mapred-site.xml >> file. >> >> But, it is just a hint for the InputFormat. Actually no. of maps is >> actually determined by the no of InputSplits created by the InputFormat. >> >> HTH >> >> Regards, >> Mohammad Tariq >> >> >> >> On Wed, Dec 12, 2012 at 4:11 PM, imen Megdiche < >> imen.megdi...@gmail.com> wrote: >> >>> Hi, >>> >>> I try to force the number of map for the mapreduce job with the >>> command : >>> public static void main(String[] args) throws Exception { >>> >>> JobConf conf = new JobConf(WordCount.class); >>> conf.set("mapred.job.tracker", "local"); >>> conf.set("fs.default.name", "local"); >>> conf.setJobName("wordcount"); >>> >>> conf.setOutputKeyClass(Text.class); >>> conf.setOutputValueClass(IntWritable.class); >>> >>>conf.setNumMapTask(6); >>> conf.setMapperClass(Map.class); >>> conf.setCombinerClass(Reduce.class); >>> conf.setReducerClass(Reduce.class); >>> ... >>> } >>> >>> But it doesn t work. >>> What can i do to modify the number of map and reduce tasks. >>> >>> Thank you >>> >> >> > >>> >> >
Re: Modify the number of map tasks
You have to replace "JobTrackerHost" in "JobTrackerHost:50030" with the actual name of the machine where JobTracker is running. For example, If you are working on a local cluster, you have to use "localhost:50030". Are you running your job through the command line or some IDE? Regards, Mohammad Tariq On Wed, Dec 12, 2012 at 5:42 PM, imen Megdiche wrote: > excuse me the data size is 98 MB > > > 2012/12/12 imen Megdiche > >> the size of data 49 MB and n of map 4 >> the web UI JobTrackerHost:50030 does not wok, what should i do to make >> this appear , i work on ubuntu >> >> >> 2012/12/12 Mohammad Tariq >> >>> Hi Imen, >>> >>> You can visit the MR web UI at "JobTrackerHost:50030" and see all >>> the useful information like no. of mappers, no of reducers, time taken for >>> the execution etc. >>> >>> One quick question for you, what is the size of your data and what is >>> the no of maps which you are getting right now? >>> >>> Regards, >>> Mohammad Tariq >>> >>> >>> >>> On Wed, Dec 12, 2012 at 5:11 PM, imen Megdiche >>> wrote: >>> Thank you Mohammad but the number of map tasks still the same in the execution. Do you know how to capture the time spent on execution. 2012/12/12 Mohammad Tariq > Hi Imen, > > You can add "mapred.map.tasks" property in your mapred-site.xml > file. > > But, it is just a hint for the InputFormat. Actually no. of maps is > actually determined by the no of InputSplits created by the InputFormat. > > HTH > > Regards, > Mohammad Tariq > > > > On Wed, Dec 12, 2012 at 4:11 PM, imen Megdiche < > imen.megdi...@gmail.com> wrote: > >> Hi, >> >> I try to force the number of map for the mapreduce job with the >> command : >> public static void main(String[] args) throws Exception { >> >> JobConf conf = new JobConf(WordCount.class); >> conf.set("mapred.job.tracker", "local"); >> conf.set("fs.default.name", "local"); >> conf.setJobName("wordcount"); >> >> conf.setOutputKeyClass(Text.class); >> conf.setOutputValueClass(IntWritable.class); >> >>conf.setNumMapTask(6); >> conf.setMapperClass(Map.class); >> conf.setCombinerClass(Reduce.class); >> conf.setReducerClass(Reduce.class); >> ... >> } >> >> But it doesn t work. >> What can i do to modify the number of map and reduce tasks. >> >> Thank you >> > > >>> >> >
Re: Modify the number of map tasks
excuse me the data size is 98 MB 2012/12/12 imen Megdiche > the size of data 49 MB and n of map 4 > the web UI JobTrackerHost:50030 does not wok, what should i do to make > this appear , i work on ubuntu > > > 2012/12/12 Mohammad Tariq > >> Hi Imen, >> >> You can visit the MR web UI at "JobTrackerHost:50030" and see all >> the useful information like no. of mappers, no of reducers, time taken for >> the execution etc. >> >> One quick question for you, what is the size of your data and what is the >> no of maps which you are getting right now? >> >> Regards, >> Mohammad Tariq >> >> >> >> On Wed, Dec 12, 2012 at 5:11 PM, imen Megdiche >> wrote: >> >>> Thank you Mohammad but the number of map tasks still the same in the >>> execution. Do you know how to capture the time spent on execution. >>> >>> >>> 2012/12/12 Mohammad Tariq >>> Hi Imen, You can add "mapred.map.tasks" property in your mapred-site.xml file. But, it is just a hint for the InputFormat. Actually no. of maps is actually determined by the no of InputSplits created by the InputFormat. HTH Regards, Mohammad Tariq On Wed, Dec 12, 2012 at 4:11 PM, imen Megdiche >>> > wrote: > Hi, > > I try to force the number of map for the mapreduce job with the > command : > public static void main(String[] args) throws Exception { > > JobConf conf = new JobConf(WordCount.class); > conf.set("mapred.job.tracker", "local"); > conf.set("fs.default.name", "local"); > conf.setJobName("wordcount"); > > conf.setOutputKeyClass(Text.class); > conf.setOutputValueClass(IntWritable.class); > >conf.setNumMapTask(6); > conf.setMapperClass(Map.class); > conf.setCombinerClass(Reduce.class); > conf.setReducerClass(Reduce.class); > ... > } > > But it doesn t work. > What can i do to modify the number of map and reduce tasks. > > Thank you > >>> >> >
Re: Modify the number of map tasks
the size of data 49 MB and n of map 4 the web UI JobTrackerHost:50030 does not wok, what should i do to make this appear , i work on ubuntu 2012/12/12 Mohammad Tariq > Hi Imen, > > You can visit the MR web UI at "JobTrackerHost:50030" and see all the > useful information like no. of mappers, no of reducers, time taken for the > execution etc. > > One quick question for you, what is the size of your data and what is the > no of maps which you are getting right now? > > Regards, > Mohammad Tariq > > > > On Wed, Dec 12, 2012 at 5:11 PM, imen Megdiche wrote: > >> Thank you Mohammad but the number of map tasks still the same in the >> execution. Do you know how to capture the time spent on execution. >> >> >> 2012/12/12 Mohammad Tariq >> >>> Hi Imen, >>> >>> You can add "mapred.map.tasks" property in your mapred-site.xml >>> file. >>> >>> But, it is just a hint for the InputFormat. Actually no. of maps is >>> actually determined by the no of InputSplits created by the InputFormat. >>> >>> HTH >>> >>> Regards, >>> Mohammad Tariq >>> >>> >>> >>> On Wed, Dec 12, 2012 at 4:11 PM, imen Megdiche >>> wrote: >>> Hi, I try to force the number of map for the mapreduce job with the command : public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.set("mapred.job.tracker", "local"); conf.set("fs.default.name", "local"); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setNumMapTask(6); conf.setMapperClass(Map.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class); ... } But it doesn t work. What can i do to modify the number of map and reduce tasks. Thank you >>> >>> >> >
Re: Modify the number of map tasks
Hi Imen, You can visit the MR web UI at "JobTrackerHost:50030" and see all the useful information like no. of mappers, no of reducers, time taken for the execution etc. One quick question for you, what is the size of your data and what is the no of maps which you are getting right now? Regards, Mohammad Tariq On Wed, Dec 12, 2012 at 5:11 PM, imen Megdiche wrote: > Thank you Mohammad but the number of map tasks still the same in the > execution. Do you know how to capture the time spent on execution. > > > 2012/12/12 Mohammad Tariq > >> Hi Imen, >> >> You can add "mapred.map.tasks" property in your mapred-site.xml file. >> >> But, it is just a hint for the InputFormat. Actually no. of maps is >> actually determined by the no of InputSplits created by the InputFormat. >> >> HTH >> >> Regards, >> Mohammad Tariq >> >> >> >> On Wed, Dec 12, 2012 at 4:11 PM, imen Megdiche >> wrote: >> >>> Hi, >>> >>> I try to force the number of map for the mapreduce job with the command >>> : >>> public static void main(String[] args) throws Exception { >>> >>> JobConf conf = new JobConf(WordCount.class); >>> conf.set("mapred.job.tracker", "local"); >>> conf.set("fs.default.name", "local"); >>> conf.setJobName("wordcount"); >>> >>> conf.setOutputKeyClass(Text.class); >>> conf.setOutputValueClass(IntWritable.class); >>> >>>conf.setNumMapTask(6); >>> conf.setMapperClass(Map.class); >>> conf.setCombinerClass(Reduce.class); >>> conf.setReducerClass(Reduce.class); >>> ... >>> } >>> >>> But it doesn t work. >>> What can i do to modify the number of map and reduce tasks. >>> >>> Thank you >>> >> >> >
Re: Modify the number of map tasks
Thank you Mohammad but the number of map tasks still the same in the execution. Do you know how to capture the time spent on execution. 2012/12/12 Mohammad Tariq > Hi Imen, > > You can add "mapred.map.tasks" property in your mapred-site.xml file. > > But, it is just a hint for the InputFormat. Actually no. of maps is > actually determined by the no of InputSplits created by the InputFormat. > > HTH > > Regards, > Mohammad Tariq > > > > On Wed, Dec 12, 2012 at 4:11 PM, imen Megdiche wrote: > >> Hi, >> >> I try to force the number of map for the mapreduce job with the command : >> public static void main(String[] args) throws Exception { >> >> JobConf conf = new JobConf(WordCount.class); >> conf.set("mapred.job.tracker", "local"); >> conf.set("fs.default.name", "local"); >> conf.setJobName("wordcount"); >> >> conf.setOutputKeyClass(Text.class); >> conf.setOutputValueClass(IntWritable.class); >> >>conf.setNumMapTask(6); >> conf.setMapperClass(Map.class); >> conf.setCombinerClass(Reduce.class); >> conf.setReducerClass(Reduce.class); >> ... >> } >> >> But it doesn t work. >> What can i do to modify the number of map and reduce tasks. >> >> Thank you >> > >
Re: Modify the number of map tasks
Hi Imen, You can add "mapred.map.tasks" property in your mapred-site.xml file. But, it is just a hint for the InputFormat. Actually no. of maps is actually determined by the no of InputSplits created by the InputFormat. HTH Regards, Mohammad Tariq On Wed, Dec 12, 2012 at 4:11 PM, imen Megdiche wrote: > Hi, > > I try to force the number of map for the mapreduce job with the command : > public static void main(String[] args) throws Exception { > > JobConf conf = new JobConf(WordCount.class); > conf.set("mapred.job.tracker", "local"); > conf.set("fs.default.name", "local"); > conf.setJobName("wordcount"); > > conf.setOutputKeyClass(Text.class); > conf.setOutputValueClass(IntWritable.class); > >conf.setNumMapTask(6); > conf.setMapperClass(Map.class); > conf.setCombinerClass(Reduce.class); > conf.setReducerClass(Reduce.class); > ... > } > > But it doesn t work. > What can i do to modify the number of map and reduce tasks. > > Thank you >
Re: Re: Number of mapreduce job and the time spent
the command hadoop job -status work fine but the problem that it cannot find the job Could not find job job_local_0001 i don t understand why does it not find it. 2012/12/12 long > Sorry for my mistake. > if $HADOOP_HOME is set, run as follow, or not just find the path for your > 'hadoop' command for instead: > $HADOOP_HOME/bin/hadoop job -status job_xxx > > > > > -- > Best Regards, > longmans > > At 2012-12-12 17:56:45,"imen Megdiche" wrote: > > I think that my job id is in this line : > > 12/12/12 10:43:00 INFO mapred.JobClient: Running job: job_local_0001 > > > but i have this response when i execute : > > hadoop job -status job_local_0001 > Warning: $HADOOP_HOME is deprecated. > > Could not find job job_local_0001 > > > > > > 2012/12/12 long > >> get you jobid and use this command: >> $HADOOP_HOME/hadoop job -status job_xxx >> >> >> >> >> -- >> Best Regards, >> longmans >> >> At 2012-12-12 17:23:39,"imen Megdiche" wrote: >> >> Hi, >> >> I want to know from the output of the execution of the example of >> mapreduce wordcount on hadoop : the number of mapreduce job and the time >> spent for the execution. >> >> There is an exceprt from the output. >> >> 12/12/12 10:20:09 INFO mapred.Task: Task 'attempt_local_0001_r_00_0' >> done. >> 12/12/12 10:20:10 INFO mapred.JobClient: map 100% reduce 100% >> 12/12/12 10:20:10 INFO mapred.JobClient: Job complete: job_local_0001 >> 12/12/12 10:20:10 INFO mapred.JobClient: Counters: 22 >> 12/12/12 10:20:10 INFO mapred.JobClient: File Input Format Counters >> 12/12/12 10:20:10 INFO mapred.JobClient: Bytes Read=145966941 >> 12/12/12 10:20:10 INFO mapred.JobClient: File Output Format Counters >> 12/12/12 10:20:10 INFO mapred.JobClient: Bytes Written=50704638 >> 12/12/12 10:20:10 INFO mapred.JobClient: >> org.myorg.WordCount$Map$Counters >> 12/12/12 10:20:10 INFO mapred.JobClient: INPUT_WORDS=4980060 >> 12/12/12 10:20:10 INFO mapred.JobClient: FileSystemCounters >> 12/12/12 10:20:10 INFO mapred.JobClient: FILE_BYTES_READ=1777104865 >> 12/12/12 10:20:10 INFO mapred.JobClient: FILE_BYTES_WRITTEN=1783494521 >> 12/12/12 10:20:10 INFO mapred.JobClient: Map-Reduce Framework >> 12/12/12 10:20:10 INFO mapred.JobClient: Map output materialized >> bytes=170854986 >> 12/12/12 10:20:10 INFO mapred.JobClient: Map input records=4980060 >> 12/12/12 10:20:10 INFO mapred.JobClient: Reduce shuffle bytes=0 >> 12/12/12 10:20:10 INFO mapred.JobClient: Spilled Records=14940180 >> 12/12/12 10:20:10 INFO mapred.JobClient: Map output bytes=160894830 >> 12/12/12 10:20:10 INFO mapred.JobClient: Total committed heap usage >> (bytes)=1185910784 >> 12/12/12 10:20:10 INFO mapred.JobClient: CPU time spent (ms)=0 >> 12/12/12 10:20:10 INFO mapred.JobClient: Map input bytes=145954650 >> 12/12/12 10:20:10 INFO mapred.JobClient: SPLIT_RAW_BYTES=614 >> 12/12/12 10:20:10 INFO mapred.JobClient: Combine input records=8426541 >> 12/12/12 10:20:10 INFO mapred.JobClient: Reduce input records=4980060 >> 12/12/12 10:20:10 INFO mapred.JobClient: Reduce input groups=1660020 >> 12/12/12 10:20:10 INFO mapred.JobClient: Combine output >> records=8426541 >> 12/12/12 10:20:10 INFO mapred.JobClient: Physical memory (bytes) >> snapshot=0 >> 12/12/12 10:20:10 INFO mapred.JobClient: Reduce output records=1660020 >> 12/12/12 10:20:10 INFO mapred.JobClient: Virtual memory (bytes) >> snapshot=0 >> 12/12/12 10:20:10 INFO mapred.JobClient: Map output records=4980060 >> >> >> Thank you for your responses. >> >> >> >> >> > > >
Re:Re: Number of mapreduce job and the time spent
Sorry for my mistake. if $HADOOP_HOME is set, run as follow, or not just find the path for your 'hadoop' command for instead: $HADOOP_HOME/bin/hadoop job -status job_xxx -- Best Regards, longmans At 2012-12-12 17:56:45,"imen Megdiche" wrote: I think that my job id is in this line : 12/12/12 10:43:00 INFO mapred.JobClient: Running job: job_local_0001 but i have this response when i execute : hadoop job -status job_local_0001 Warning: $HADOOP_HOME is deprecated. Could not find job job_local_0001 2012/12/12 long get you jobid and use this command: $HADOOP_HOME/hadoop job -status job_xxx -- Best Regards, longmans At 2012-12-12 17:23:39,"imen Megdiche" wrote: Hi, I want to know from the output of the execution of the example of mapreduce wordcount on hadoop : the number of mapreduce job and the time spent for the execution. There is an exceprt from the output. 12/12/12 10:20:09 INFO mapred.Task: Task 'attempt_local_0001_r_00_0' done. 12/12/12 10:20:10 INFO mapred.JobClient: map 100% reduce 100% 12/12/12 10:20:10 INFO mapred.JobClient: Job complete: job_local_0001 12/12/12 10:20:10 INFO mapred.JobClient: Counters: 22 12/12/12 10:20:10 INFO mapred.JobClient: File Input Format Counters 12/12/12 10:20:10 INFO mapred.JobClient: Bytes Read=145966941 12/12/12 10:20:10 INFO mapred.JobClient: File Output Format Counters 12/12/12 10:20:10 INFO mapred.JobClient: Bytes Written=50704638 12/12/12 10:20:10 INFO mapred.JobClient: org.myorg.WordCount$Map$Counters 12/12/12 10:20:10 INFO mapred.JobClient: INPUT_WORDS=4980060 12/12/12 10:20:10 INFO mapred.JobClient: FileSystemCounters 12/12/12 10:20:10 INFO mapred.JobClient: FILE_BYTES_READ=1777104865 12/12/12 10:20:10 INFO mapred.JobClient: FILE_BYTES_WRITTEN=1783494521 12/12/12 10:20:10 INFO mapred.JobClient: Map-Reduce Framework 12/12/12 10:20:10 INFO mapred.JobClient: Map output materialized bytes=170854986 12/12/12 10:20:10 INFO mapred.JobClient: Map input records=4980060 12/12/12 10:20:10 INFO mapred.JobClient: Reduce shuffle bytes=0 12/12/12 10:20:10 INFO mapred.JobClient: Spilled Records=14940180 12/12/12 10:20:10 INFO mapred.JobClient: Map output bytes=160894830 12/12/12 10:20:10 INFO mapred.JobClient: Total committed heap usage (bytes)=1185910784 12/12/12 10:20:10 INFO mapred.JobClient: CPU time spent (ms)=0 12/12/12 10:20:10 INFO mapred.JobClient: Map input bytes=145954650 12/12/12 10:20:10 INFO mapred.JobClient: SPLIT_RAW_BYTES=614 12/12/12 10:20:10 INFO mapred.JobClient: Combine input records=8426541 12/12/12 10:20:10 INFO mapred.JobClient: Reduce input records=4980060 12/12/12 10:20:10 INFO mapred.JobClient: Reduce input groups=1660020 12/12/12 10:20:10 INFO mapred.JobClient: Combine output records=8426541 12/12/12 10:20:10 INFO mapred.JobClient: Physical memory (bytes) snapshot=0 12/12/12 10:20:10 INFO mapred.JobClient: Reduce output records=1660020 12/12/12 10:20:10 INFO mapred.JobClient: Virtual memory (bytes) snapshot=0 12/12/12 10:20:10 INFO mapred.JobClient: Map output records=4980060 Thank you for your responses.
Re: Number of mapreduce job and the time spent
I think that my job id is in this line : 12/12/12 10:43:00 INFO mapred.JobClient: Running job: job_local_0001 but i have this response when i execute : hadoop job -status job_local_0001 Warning: $HADOOP_HOME is deprecated. Could not find job job_local_0001 2012/12/12 long > get you jobid and use this command: > $HADOOP_HOME/hadoop job -status job_xxx > > > > > -- > Best Regards, > longmans > > At 2012-12-12 17:23:39,"imen Megdiche" wrote: > > Hi, > > I want to know from the output of the execution of the example of > mapreduce wordcount on hadoop : the number of mapreduce job and the time > spent for the execution. > > There is an exceprt from the output. > > 12/12/12 10:20:09 INFO mapred.Task: Task 'attempt_local_0001_r_00_0' > done. > 12/12/12 10:20:10 INFO mapred.JobClient: map 100% reduce 100% > 12/12/12 10:20:10 INFO mapred.JobClient: Job complete: job_local_0001 > 12/12/12 10:20:10 INFO mapred.JobClient: Counters: 22 > 12/12/12 10:20:10 INFO mapred.JobClient: File Input Format Counters > 12/12/12 10:20:10 INFO mapred.JobClient: Bytes Read=145966941 > 12/12/12 10:20:10 INFO mapred.JobClient: File Output Format Counters > 12/12/12 10:20:10 INFO mapred.JobClient: Bytes Written=50704638 > 12/12/12 10:20:10 INFO mapred.JobClient: org.myorg.WordCount$Map$Counters > 12/12/12 10:20:10 INFO mapred.JobClient: INPUT_WORDS=4980060 > 12/12/12 10:20:10 INFO mapred.JobClient: FileSystemCounters > 12/12/12 10:20:10 INFO mapred.JobClient: FILE_BYTES_READ=1777104865 > 12/12/12 10:20:10 INFO mapred.JobClient: FILE_BYTES_WRITTEN=1783494521 > 12/12/12 10:20:10 INFO mapred.JobClient: Map-Reduce Framework > 12/12/12 10:20:10 INFO mapred.JobClient: Map output materialized > bytes=170854986 > 12/12/12 10:20:10 INFO mapred.JobClient: Map input records=4980060 > 12/12/12 10:20:10 INFO mapred.JobClient: Reduce shuffle bytes=0 > 12/12/12 10:20:10 INFO mapred.JobClient: Spilled Records=14940180 > 12/12/12 10:20:10 INFO mapred.JobClient: Map output bytes=160894830 > 12/12/12 10:20:10 INFO mapred.JobClient: Total committed heap usage > (bytes)=1185910784 > 12/12/12 10:20:10 INFO mapred.JobClient: CPU time spent (ms)=0 > 12/12/12 10:20:10 INFO mapred.JobClient: Map input bytes=145954650 > 12/12/12 10:20:10 INFO mapred.JobClient: SPLIT_RAW_BYTES=614 > 12/12/12 10:20:10 INFO mapred.JobClient: Combine input records=8426541 > 12/12/12 10:20:10 INFO mapred.JobClient: Reduce input records=4980060 > 12/12/12 10:20:10 INFO mapred.JobClient: Reduce input groups=1660020 > 12/12/12 10:20:10 INFO mapred.JobClient: Combine output records=8426541 > 12/12/12 10:20:10 INFO mapred.JobClient: Physical memory (bytes) > snapshot=0 > 12/12/12 10:20:10 INFO mapred.JobClient: Reduce output records=1660020 > 12/12/12 10:20:10 INFO mapred.JobClient: Virtual memory (bytes) > snapshot=0 > 12/12/12 10:20:10 INFO mapred.JobClient: Map output records=4980060 > > > Thank you for your responses. > > > > >
Re:Number of mapreduce job and the time spent
get you jobid and use this command: $HADOOP_HOME/hadoop job -status job_xxx -- Best Regards, longmans At 2012-12-12 17:23:39,"imen Megdiche" wrote: Hi, I want to know from the output of the execution of the example of mapreduce wordcount on hadoop : the number of mapreduce job and the time spent for the execution. There is an exceprt from the output. 12/12/12 10:20:09 INFO mapred.Task: Task 'attempt_local_0001_r_00_0' done. 12/12/12 10:20:10 INFO mapred.JobClient: map 100% reduce 100% 12/12/12 10:20:10 INFO mapred.JobClient: Job complete: job_local_0001 12/12/12 10:20:10 INFO mapred.JobClient: Counters: 22 12/12/12 10:20:10 INFO mapred.JobClient: File Input Format Counters 12/12/12 10:20:10 INFO mapred.JobClient: Bytes Read=145966941 12/12/12 10:20:10 INFO mapred.JobClient: File Output Format Counters 12/12/12 10:20:10 INFO mapred.JobClient: Bytes Written=50704638 12/12/12 10:20:10 INFO mapred.JobClient: org.myorg.WordCount$Map$Counters 12/12/12 10:20:10 INFO mapred.JobClient: INPUT_WORDS=4980060 12/12/12 10:20:10 INFO mapred.JobClient: FileSystemCounters 12/12/12 10:20:10 INFO mapred.JobClient: FILE_BYTES_READ=1777104865 12/12/12 10:20:10 INFO mapred.JobClient: FILE_BYTES_WRITTEN=1783494521 12/12/12 10:20:10 INFO mapred.JobClient: Map-Reduce Framework 12/12/12 10:20:10 INFO mapred.JobClient: Map output materialized bytes=170854986 12/12/12 10:20:10 INFO mapred.JobClient: Map input records=4980060 12/12/12 10:20:10 INFO mapred.JobClient: Reduce shuffle bytes=0 12/12/12 10:20:10 INFO mapred.JobClient: Spilled Records=14940180 12/12/12 10:20:10 INFO mapred.JobClient: Map output bytes=160894830 12/12/12 10:20:10 INFO mapred.JobClient: Total committed heap usage (bytes)=1185910784 12/12/12 10:20:10 INFO mapred.JobClient: CPU time spent (ms)=0 12/12/12 10:20:10 INFO mapred.JobClient: Map input bytes=145954650 12/12/12 10:20:10 INFO mapred.JobClient: SPLIT_RAW_BYTES=614 12/12/12 10:20:10 INFO mapred.JobClient: Combine input records=8426541 12/12/12 10:20:10 INFO mapred.JobClient: Reduce input records=4980060 12/12/12 10:20:10 INFO mapred.JobClient: Reduce input groups=1660020 12/12/12 10:20:10 INFO mapred.JobClient: Combine output records=8426541 12/12/12 10:20:10 INFO mapred.JobClient: Physical memory (bytes) snapshot=0 12/12/12 10:20:10 INFO mapred.JobClient: Reduce output records=1660020 12/12/12 10:20:10 INFO mapred.JobClient: Virtual memory (bytes) snapshot=0 12/12/12 10:20:10 INFO mapred.JobClient: Map output records=4980060 Thank you for your responses.
help on failed MR jobs (big hive files)
Hi, I'm trying to run a program on Hadoop. [Input] tsv file My program does the following. (1) Load tsv into hive load data local inpath 'tsvfile' overwrite into table A partitioned by xx (2) insert overwrite table B select a, b, c from table A where datediff(to_date(from_unixtime(unix_timestamp('${logdate}'))), request_date) <= 30 (3) Running Mahout In step 2, i am trying to retrieve data from hive for the past month. My hadoop work always stopped here. When i check through my browser utility it says that Diagnostic Info: # of failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask: task_201211291541_0262_m_001800 Task attempt_201211291541_0262_m_001800_0 failed to report status for 1802 seconds. Killing! Error: Java heap space Task attempt_201211291541_0262_m_001800_2 failed to report status for 1800 seconds. Killing! Task attempt_201211291541_0262_m_001800_3 failed to report status for 1801 seconds. Killing! Each hive table is big, around 6 GB. (1) Is it too big to have around 6GB for each hive table? (2) I've increased by HEAPSIZE to 50G,which i think is far more than enough. Any else where i can do the tuning? Thank you. rei