Re: log4j format logs in Hive table
Hi, I hope I understood your question correct - did you describe your table? Like create TABLE YOURTABLE (row1 STRING, row2 STRING, row3 STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY 'YOUR TERMINATOR' STORED AS TEXTFILE; row* = a name of your descision, Datatype look @documentation. After import via insert (overwrite) table YOURTABLE - alex On Tue, Dec 6, 2011 at 8:56 AM, sangeetha k get2sa...@yahoo.com wrote: Hi, I am new to Hive. I am using Flume agent to collect log4j logs and sending to HDFS. Now i wanted to load the log4j format logs from HDFS to Hive tables. Each of the attributes in log statements like timestamp, level, classname etc... should be loaded in seperate columns in the Hive tables. I tried creating table in Hive and loaded the entire log in one column, but dont know how to load the above mentioned data in seperate columns. Please send me your suggestions, any links, tutorials on this. Thanks, Sangeetha -- Alexander Lorenz http://mapredit.blogspot.com *P **Think of the environment: please don't print this email unless you really need to.*
Re: log4j format logs in Hive table
Hi, Thanks for the response. Yes, You got my question. An example of my log message line will be as below: [2011-10-17 16:30:57,281] [ INFO] [33157362@qtp-28456974-0] [net.hp.tr.webservice.referenceimplcustomer.resource.CustomersResource] [Organization: Travelocity] [Client: AA] [Location of device: DFW] [User: 550393] [user_role: ] [CorelationId: 248] [Component: Crossplane] [Server: server01] [Request: seats=5] [Response: yes] [Status: pass] - Entering Method = getKey() How to specify the delimiter, while describing the table? Thanks, Sangeetha From: alo alt wget.n...@googlemail.com To: user@hive.apache.org; sangeetha k get2sa...@yahoo.com Sent: Tuesday, December 6, 2011 2:01 PM Subject: Re: log4j format logs in Hive table Hi, I hope I understood your question correct - did you describe your table? Like create TABLE YOURTABLE (row1 STRING, row2 STRING, row3 STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY 'YOUR TERMINATOR' STORED AS TEXTFILE; row* = a name of your descision, Datatype look @documentation. After import via insert (overwrite) table YOURTABLE - alex On Tue, Dec 6, 2011 at 8:56 AM, sangeetha k get2sa...@yahoo.com wrote: Hi, I am new to Hive. I am using Flume agent to collect log4j logs and sending to HDFS. Now i wanted to load the log4j format logs from HDFS to Hive tables. Each of the attributes in log statements like timestamp, level, classname etc... should be loaded in seperate columns in the Hive tables. I tried creating table in Hive and loaded the entire log in one column, but dont know how to load the above mentioned data in seperate columns. Please send me your suggestions, any links, tutorials on this. Thanks, Sangeetha -- Alexander Lorenz http://mapredit.blogspot.com P Think of the environment: please don't print this email unless you really need to.
Hive query taking too much time
Hi All, My setup is hadoop-0.20.203.0 hive-0.7.1 I am having a total of 5 node cluster: 4 data nodes, 1 namenode (it is also acting as secondary name node). On namenode I have setup hive with HiveDerbyServerMode to support multiple hive server connection. I have inserted plain text CSV files in HDFS using 'LOAD DATA' hive query statements, total number of files is 2624 an their combined size is only 713 MB, which is very less from Hadoop perspective that can handle TBs of data very easily. The problem is, when I run a simple count query (i.e. select count(*) from a_table), it takes too much time in executing the query. For instance it takes almost 17 minutes to execute the said query if the table has 950,000 rows, I understand that time is too much for executing a query with only such small data. This is only a dev environment and in production environment the number of files and their combined size will move into millions and GBs respectively. On analyzing the logs on all the datanodes and namenode/secondary namenode I do not find any error in them. I have tried setting mapred.reduce.tasks to a fixed number also, but number of reduce always remains 1 while number of maps is determined by hive only. Any suggestion what I am doing wrong, or how can I improve the performance of hive queries? Any suggestion or pointer is highly appreciated. Keshav _ The information contained in this message is proprietary and/or confidential. If you are not the intended recipient, please: (i) delete the message and all copies; (ii) do not disclose, distribute or use the message in any manner; and (iii) notify the sender immediately. In addition, please be aware that any message addressed to our domain is subject to archiving and review by persons other than the intended recipient. Thank you.
Re: Hive Reducers hanging - interesting problem - skew ?
Hi Mark, Thanks for your response. I tried skew optimization and I also saw the video by Lin and Namit. From what I understand about skew join, instead of a single go , they divide it into 2 stages. Stage1 Join non-skew pairs. and write the skew pairs into temporary files on HDFS. Stage 2 Do a Map-Join of the files by copying smaller file into mappers of larger file. I have a doubt here. How can they be so sure that MapJoin works in stage 2? The files can be so large that they donot fit into the memory and join is impossible. Am I wrong? I also ran the query with skew optimized and as expected, none of the the pairs got joined in the stage 1 and all of them got written into the HDFS. (They are huge) Now in the stage2 , Hive is trying to perform a map-join on these large tables and my Map phase in stage 2 is stuck at 0.13% after 6 hours and 2 of my machines went down. I had to kill the job finally. The size of each table is just 2GB which is way smaller than what Hadoop eco system can handle. So is there anyway I can join these tables in Hive? Any thoughts ? Thanks, jS On Tue, Dec 6, 2011 at 3:39 AM, Mark Grover mgro...@oanda.com wrote: jS, Check out if this helps: http://search-hadoop.com/m/l1usr1MAHX32subj=Re+Severely+hit+by+curse+of+last+reducer+ Mark Grover, Business Intelligence Analyst OANDA Corporation www: oanda.com www: fxtrade.com e: mgro...@oanda.com Best Trading Platform - World Finance's Forex Awards 2009. The One to Watch - Treasury Today's Adam Smith Awards 2009. - Original Message - From: john smith js1987.sm...@gmail.com To: user@hive.apache.org Sent: Monday, December 5, 2011 4:38:14 PM Subject: Hive Reducers hanging - interesting problem - skew ? Hi list, I am trying to run a Join query on my 10 node cluster. My query looks as follows select * from A JOIN B on (A.a = B.b) size of A = 15 million rows size of B = 1 million rows The problem is A.a and B.b has around 25-30 distinct values per column which implies that they have high selectivities and the reducers are bulky. However the performance hit is so horrible that , ALL my reducers hang @ 75% for 6 hours and doesn't move further. The only thing that log shows up is Join operator - forwarding rows ---Huge number kinds of logs for all this long. What does this mean ? There is no swapping happening and the CPU % is constantly around 40% for all this time (observed through Ganglia) . Any way I can solve this problem? Can anyone help me with this? Thanks, jS
Re: Hive query taking too much time
Hi, In your case total file size isn't main factor that reduces performance, number of files is. To test this try merging those over 2000 files into one (or few) big, then upload it to HDFS and test hive performance (it should be definitely higher). It this works you should think about merging those files before or after loading them to HDFS. Second issue is counts, try to observe how your jobs uses mappers and reducers, my experience is that simple count() jobs might be stuck on one reducer (the one that does all counting) for longer time. I have not resolved this issue, but it was not significant in my case. set mapred.reduce.tasks=xyz doesn't change that behavior, but for example using GROUP with COUNT works much faster. I hope this helps. -- Wojciech Langiewicz On 06.12.2011 12:00, Savant, Keshav wrote: Hi All, My setup is hadoop-0.20.203.0 hive-0.7.1 I am having a total of 5 node cluster: 4 data nodes, 1 namenode (it is also acting as secondary name node). On namenode I have setup hive with HiveDerbyServerMode to support multiple hive server connection. I have inserted plain text CSV files in HDFS using 'LOAD DATA' hive query statements, total number of files is 2624 an their combined size is only 713 MB, which is very less from Hadoop perspective that can handle TBs of data very easily. The problem is, when I run a simple count query (i.e. select count(*) from a_table), it takes too much time in executing the query. For instance it takes almost 17 minutes to execute the said query if the table has 950,000 rows, I understand that time is too much for executing a query with only such small data. This is only a dev environment and in production environment the number of files and their combined size will move into millions and GBs respectively. On analyzing the logs on all the datanodes and namenode/secondary namenode I do not find any error in them. I have tried setting mapred.reduce.tasks to a fixed number also, but number of reduce always remains 1 while number of maps is determined by hive only. Any suggestion what I am doing wrong, or how can I improve the performance of hive queries? Any suggestion or pointer is highly appreciated. Keshav
Re: Hive query taking too much time
Hi Paul, I am having the same problem. Do you know any efficient way of merging the files? -Mohit On Tue, Dec 6, 2011 at 8:14 PM, Paul Mackles pmack...@adobe.com wrote: How much time is it spending in the map/reduce phases, respectively? The large number of files could be creating a lot of mappers which create a lot of overhead. What happens if you merge the 2624 files into a smaller number like 24 or 48. That should speed up the mapper phase significantly. ** ** *From:* Savant, Keshav [mailto:keshav.c.sav...@fisglobal.com] *Sent:* Tuesday, December 06, 2011 6:01 AM *To:* user@hive.apache.org *Subject:* Hive query taking too much time ** ** Hi All, ** ** My setup is hadoop-0.20.203.0 hive-0.7.1 ** ** I am having a total of 5 node cluster: 4 data nodes, 1 namenode (it is also acting as secondary name node). On namenode I have setup hive with HiveDerbyServerMode to support multiple hive server connection. ** ** I have inserted plain text CSV files in HDFS using ‘LOAD DATA’ hive query statements, total number of files is 2624 an their combined size is only 713 MB, which is very less from Hadoop perspective that can handle TBs of data very easily. ** ** The problem is, when I run a simple count query (i.e. *select count(*) from a_table*), it takes too much time in executing the query. ** ** For instance it takes almost 17 minutes to execute the said query if the table has 950,000 rows, I understand that time is too much for executing a query with only such small data. This is only a dev environment and in production environment the number of files and their combined size will move into millions and GBs respectively. ** ** On analyzing the logs on all the datanodes and namenode/secondary namenode I do not find any error in them. ** ** I have tried setting mapred.reduce.tasks to a fixed number also, but number of reduce always remains 1 while number of maps is determined by hive only. ** ** Any suggestion what I am doing wrong, or how can I improve the performance of hive queries? Any suggestion or pointer is highly appreciated. ** ** Keshav _ The information contained in this message is proprietary and/or confidential. If you are not the intended recipient, please: (i) delete the message and all copies; (ii) do not disclose, distribute or use the message in any manner; and (iii) notify the sender immediately. In addition, please be aware that any message addressed to our domain is subject to archiving and review by persons other than the intended recipient. Thank you. -- Best Regards, Mohit Gupta Software Engineer at Vdopia Inc.
Hive web console - schema is empty
Hi, I opened the web console for Hive using http://localhost:/hwi In the Browse Schema option, I could see only the default Hive table list name and description. Not able to view the tables. What should be issue? I have created 2 tables under default schema , but could not able to see those tables under the Browse schema option. Thanks, Sangeetha
Re: Hive web console - schema is empty
I get this error message in the console.. 11/12/06 08:14:50 INFO DataNucleus.MetaData: Registering listener for metadata initialisation 11/12/06 08:14:50 INFO metastore.ObjectStore: Initialized ObjectStore 11/12/06 08:14:50 WARN DataNucleus.MetaData: MetaData Parser encountered an error in file jar:file:/opt/hive/lib/hive-metastore-0.7.1-cdh3u2.jar!/package.jdo at line 11, column 6 : cvc-elt.1: Cannot find the declaration of element 'jdo'. - Please check your specification of DTD and the validity of the MetaData XML that you have specified. 11/12/06 08:14:50 WARN DataNucleus.MetaData: MetaData Parser encountered an error in file jar:file:/opt/hive/lib/hive-metastore-0.7.1-cdh3u2.jar!/package.jdo at line 312, column 13 : The content of element type class must match (extension*,implements*,datastore-identity?,primary-key?,inheritance?,version?,join*,foreign-key*,index*,unique*,column*,field*,property*,query*,fetch-group*,extension*). - Please check your specification of DTD and the validity of the MetaData XML that you have specified. From: sangeetha k get2sa...@yahoo.com To: user@hive.apache.org user@hive.apache.org Sent: Tuesday, December 6, 2011 8:21 PM Subject: Hive web console - schema is empty Hi, I opened the web console for Hive using http://localhost:/hwi In the Browse Schema option, I could see only the default Hive table list name and description. Not able to view the tables. What should be issue? I have created 2 tables under default schema , but could not able to see those tables under the Browse schema option. Thanks, Sangeetha
Re: log4j format logs in Hive table
Hi Sangeetha, Hive uses SerDe (Serializer/Deserializer) for reading data from and writing to HDFS. You have many options for choosing the SerDe for your table. For example, if your file contains tab delimited fields, you could use the default SerDe (by not specifying any SerDe) and specify the delimiter by using FIELDS TERMINATED BY '\t' in your create table statement. If you desire, you could use the Regex SerDe (albeit, with some performance overhead) using something like: ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ( input.regex = .*time:([^,]*), output.format.string = time:%1$s) in your create table statement. As you get more familiar with Hive, you might find the need for writing your own UDF for parsing the data. Here is the link to the Hive wiki for Create Table: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Create%2FDropTable Here is the link for UDFs: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF Welcome and good luck! Mark Mark Grover, Business Intelligence Analyst OANDA Corporation www: oanda.com www: fxtrade.com e: mgro...@oanda.com Best Trading Platform - World Finance's Forex Awards 2009. The One to Watch - Treasury Today's Adam Smith Awards 2009. - Original Message - From: sangeetha k get2sa...@yahoo.com To: user@hive.apache.org Sent: Tuesday, December 6, 2011 4:26:03 AM Subject: Re: log4j format logs in Hive table Hi, Thanks for the response. Yes, You got my question. An example of my log message line will be as below: [2011-10-17 16:30:57,281] [ INFO] [33157362@qtp-28456974-0] [net.hp.tr.webservice.referenceimplcustomer.resource.CustomersResource] [Organization: Travelocity] [Client: AA] [Location of device: DFW] [User: 550393] [user_role: ] [CorelationId: 248] [Component: Crossplane] [Server: server01] [Request: seats=5] [Response: yes] [Status: pass] - Entering Method = getKey() How to specify the delimiter, while describing the table? Thanks, Sangeetha From: alo alt wget.n...@googlemail.com To: user@hive.apache.org; sangeetha k get2sa...@yahoo.com Sent: Tuesday, December 6, 2011 2:01 PM Subject: Re: log4j format logs in Hive table Hi, I hope I understood your question correct - did you describe your table? Like create TABLE YOURTABLE (row1 STRING, row2 STRING, row3 STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY 'YOUR TERMINATOR' STORED AS TEXTFILE; row* = a name of your descision, Datatype look @documentation. After import via insert (overwrite) table YOURTABLE - alex On Tue, Dec 6, 2011 at 8:56 AM, sangeetha k get2sa...@yahoo.com wrote: Hi, I am new to Hive. I am using Flume agent to collect log4j logs and sending to HDFS. Now i wanted to load the log4j format logs from HDFS to Hive tables. Each of the attributes in log statements like timestamp, level, classname etc... should be loaded in seperate columns in the Hive tables. I tried creating table in Hive and loaded the entire log in one column, but dont know how to load the above mentioned data in seperate columns. Please send me your suggestions, any links, tutorials on this. Thanks, Sangeetha -- Alexander Lorenz http://mapredit.blogspot.com P Think of the environment: please don't print this email unless you really need to.
Re: log4j format logs in Hive table
Hi Sangeetha, sry, was on road and the answer tooks a while. As Mark wrote, SerDe will be a good start. If its usefull for you take a look at http://code.google.com/p/hive-json-serde/wiki/GettingStarted. - alex On Tue, Dec 6, 2011 at 10:26 AM, sangeetha k get2sa...@yahoo.com wrote: Hi, Thanks for the response. Yes, You got my question. An example of my log message line will be as below: [2011-10-17 16:30:57,281] [ INFO] [33157362@qtp-28456974-0] [net.hp.tr.webservice.referenceimplcustomer.resource.CustomersResource] [Organization: Travelocity] [Client: AA] [Location of device: DFW] [User: 550393] [user_role: ] [CorelationId: 248] [Component: Crossplane] [Server: server01] [Request: seats=5] [Response: yes] [Status: pass] - Entering Method = getKey() How to specify the delimiter, while describing the table? Thanks, Sangeetha *From:* alo alt wget.n...@googlemail.com *To:* user@hive.apache.org; sangeetha k get2sa...@yahoo.com *Sent:* Tuesday, December 6, 2011 2:01 PM *Subject:* Re: log4j format logs in Hive table Hi, I hope I understood your question correct - did you describe your table? Like create TABLE YOURTABLE (row1 STRING, row2 STRING, row3 STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY 'YOUR TERMINATOR' STORED AS TEXTFILE; row* = a name of your descision, Datatype look @documentation. After import via insert (overwrite) table YOURTABLE - alex On Tue, Dec 6, 2011 at 8:56 AM, sangeetha k get2sa...@yahoo.com wrote: Hi, I am new to Hive. I am using Flume agent to collect log4j logs and sending to HDFS. Now i wanted to load the log4j format logs from HDFS to Hive tables. Each of the attributes in log statements like timestamp, level, classname etc... should be loaded in seperate columns in the Hive tables. I tried creating table in Hive and loaded the entire log in one column, but dont know how to load the above mentioned data in seperate columns. Please send me your suggestions, any links, tutorials on this. Thanks, Sangeetha -- Alexander Lorenz http://mapredit.blogspot.com *P **Think of the environment: please don't print this email unless you really need to.* -- Alexander Lorenz http://mapredit.blogspot.com *P **Think of the environment: please don't print this email unless you really need to.*
Re: Hive Reducers hanging - interesting problem - skew ?
Can you try from B join A. One simple rule of join in Hive is Largest table last. The smaller tables can then be buffered into distributed cache for fast retrieval and comparison. Thanks Aaron On Tue, Dec 6, 2011 at 4:01 AM, john smith js1987.sm...@gmail.com wrote: Hi Mark, Thanks for your response. I tried skew optimization and I also saw the video by Lin and Namit. From what I understand about skew join, instead of a single go , they divide it into 2 stages. Stage1 Join non-skew pairs. and write the skew pairs into temporary files on HDFS. Stage 2 Do a Map-Join of the files by copying smaller file into mappers of larger file. I have a doubt here. How can they be so sure that MapJoin works in stage 2? The files can be so large that they donot fit into the memory and join is impossible. Am I wrong? I also ran the query with skew optimized and as expected, none of the the pairs got joined in the stage 1 and all of them got written into the HDFS. (They are huge) Now in the stage2 , Hive is trying to perform a map-join on these large tables and my Map phase in stage 2 is stuck at 0.13% after 6 hours and 2 of my machines went down. I had to kill the job finally. The size of each table is just 2GB which is way smaller than what Hadoop eco system can handle. So is there anyway I can join these tables in Hive? Any thoughts ? Thanks, jS On Tue, Dec 6, 2011 at 3:39 AM, Mark Grover mgro...@oanda.com wrote: jS, Check out if this helps: http://search-hadoop.com/m/l1usr1MAHX32subj=Re+Severely+hit+by+curse+of+last+reducer+ Mark Grover, Business Intelligence Analyst OANDA Corporation www: oanda.com www: fxtrade.com e: mgro...@oanda.com Best Trading Platform - World Finance's Forex Awards 2009. The One to Watch - Treasury Today's Adam Smith Awards 2009. - Original Message - From: john smith js1987.sm...@gmail.com To: user@hive.apache.org Sent: Monday, December 5, 2011 4:38:14 PM Subject: Hive Reducers hanging - interesting problem - skew ? Hi list, I am trying to run a Join query on my 10 node cluster. My query looks as follows select * from A JOIN B on (A.a = B.b) size of A = 15 million rows size of B = 1 million rows The problem is A.a and B.b has around 25-30 distinct values per column which implies that they have high selectivities and the reducers are bulky. However the performance hit is so horrible that , ALL my reducers hang @ 75% for 6 hours and doesn't move further. The only thing that log shows up is Join operator - forwarding rows ---Huge number kinds of logs for all this long. What does this mean ? There is no swapping happening and the CPU % is constantly around 40% for all this time (observed through Ganglia) . Any way I can solve this problem? Can anyone help me with this? Thanks, jS
Re: How to see the intermediate results between AST and optimized logical query plan.
Hi, I am trying to understand the output of hive Explain command. I found the documentation provided ( https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Explain ) to be of little help. Is there any other place where I can find the detailed documentation on this? Hiroyuki, were you able to find any detailed docs on this? Or any other leads? Thanks Mohit On Wed, Oct 19, 2011 at 6:24 PM, Hiroyuki Yamada mogwa...@gmail.com wrote: Hello, I have been trying to learn the Hive query compiler and I am wondering if there is a way to see the result of semantic analysis (query block tree) and non-optimized logical query plan. I know we can get AST and optimized logical query plan with explain, but I want to know the intermediate results between them. Also, is there any detailed documentations about Hive query compiler ? I would be very appreciated if anyone answered my questions. Thanks, Hiroyuki -- Best Regards, Mohit Gupta Software Engineer at Vdopia Inc.
Re: log4j format logs in Hive table
Pig has a Log loader in Piggybank. You can use that to generate the columns of that table and make the table point to it. Take a look-- https://github.com/apache/pig/tree/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/apachelog Thanks, Aniket On Tue, Dec 6, 2011 at 10:19 AM, Abhishek Pratap Singh manu.i...@gmail.comwrote: Hi Sangeetha, One more easier option is to use Flume Decorators to put some delimiter in you stream of data and then load the data into table. For example: Below data can be converted to say PIPE Delimited data (You an code for any delimiter) by using Flume decorators. [2011-10-17 16:30:57,281] [ INFO] [33157362@qtp-28456974-0] [net.hp.tr.webservice.referenceimplcustomer.resource.CustomersResource] [Organization: Travelocity] [Client: AA] [Location of device: DFW] [User: 550393] [user_role: ] [CorelationId: 248] [Component: Crossplane] [Server: server01] [Request: seats=5] [Response: yes] [Status: pass] - Entering Method = getKey() PIPE Delimited--- 2011-10-17 16:30:57,281 | INFO |33157362@qtp-28456974-0|net.hp.tr.webservice.referenceimplcustomer.resource.CustomersResource|Organization: Travelocity|Client: AA|Location of device: DFW|User: 550393|user_role: |CorelationId: 248|Component: Crossplane|Server: server01|Request: seats=5|Response: yes|Status: pass| - Entering Method = getKey() Now once you have this pipe delimited data, you can create a table with pipe delimiter and load this file. You can choose any delimiter as well as remove some data in flume decorator and finally load into Hive table with same schema and delimiter. Hope it helps. ~Abhishek P Singh On Tue, Dec 6, 2011 at 7:58 AM, alo alt wget.n...@googlemail.com wrote: Hi Sangeetha, sry, was on road and the answer tooks a while. As Mark wrote, SerDe will be a good start. If its usefull for you take a look at http://code.google.com/p/hive-json-serde/wiki/GettingStarted. - alex On Tue, Dec 6, 2011 at 10:26 AM, sangeetha k get2sa...@yahoo.com wrote: Hi, Thanks for the response. Yes, You got my question. An example of my log message line will be as below: [2011-10-17 16:30:57,281] [ INFO] [33157362@qtp-28456974-0] [net.hp.tr.webservice.referenceimplcustomer.resource.CustomersResource] [Organization: Travelocity] [Client: AA] [Location of device: DFW] [User: 550393] [user_role: ] [CorelationId: 248] [Component: Crossplane] [Server: server01] [Request: seats=5] [Response: yes] [Status: pass] - Entering Method = getKey() How to specify the delimiter, while describing the table? Thanks, Sangeetha *From:* alo alt wget.n...@googlemail.com *To:* user@hive.apache.org; sangeetha k get2sa...@yahoo.com *Sent:* Tuesday, December 6, 2011 2:01 PM *Subject:* Re: log4j format logs in Hive table Hi, I hope I understood your question correct - did you describe your table? Like create TABLE YOURTABLE (row1 STRING, row2 STRING, row3 STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY 'YOUR TERMINATOR' STORED AS TEXTFILE; row* = a name of your descision, Datatype look @documentation. After import via insert (overwrite) table YOURTABLE - alex On Tue, Dec 6, 2011 at 8:56 AM, sangeetha k get2sa...@yahoo.com wrote: Hi, I am new to Hive. I am using Flume agent to collect log4j logs and sending to HDFS. Now i wanted to load the log4j format logs from HDFS to Hive tables. Each of the attributes in log statements like timestamp, level, classname etc... should be loaded in seperate columns in the Hive tables. I tried creating table in Hive and loaded the entire log in one column, but dont know how to load the above mentioned data in seperate columns. Please send me your suggestions, any links, tutorials on this. Thanks, Sangeetha -- Alexander Lorenz http://mapredit.blogspot.com *P **Think of the environment: please don't print this email unless you really need to.* -- Alexander Lorenz http://mapredit.blogspot.com *P **Think of the environment: please don't print this email unless you really need to.* -- ...:::Aniket:::... Quetzalco@tl
Re: Hive query taking too much time
How about a simple Pig script with a load and a store statement? Set the max # reducers to say 20 or 30, that way you will only have 20-30 files as output. Then put these files in the Hive dir. Make sure to match the delimiters in Hive Pig. -Ayon See My Photos on Flickr Also check out my Blog for answers to commonly asked questions. From: Vikas Srivastava vikas.srivast...@one97.net To: user@hive.apache.org Sent: Tuesday, December 6, 2011 10:00 PM Subject: Re: Hive query taking too much time hey if u having the same col of all the files then you can easily merge by shell script list=`*.csv` $table=yourtable for file in $list do cat $file new_file.csv done hive -e load data local inpath '$file' into table $table it will merge all the files in single file then you can upload it in the same query On Tue, Dec 6, 2011 at 8:16 PM, Mohit Gupta success.mohit.gu...@gmail.com wrote: Hi Paul, I am having the same problem. Do you know any efficient way of merging the files? -Mohit On Tue, Dec 6, 2011 at 8:14 PM, Paul Mackles pmack...@adobe.com wrote: How much time is it spending in the map/reduce phases, respectively? The large number of files could be creating a lot of mappers which create a lot of overhead. What happens if you merge the 2624 files into a smaller number like 24 or 48. That should speed up the mapper phase significantly. From:Savant, Keshav [mailto:keshav.c.sav...@fisglobal.com] Sent: Tuesday, December 06, 2011 6:01 AM To: user@hive.apache.org Subject: Hive query taking too much time Hi All, My setup is hadoop-0.20.203.0 hive-0.7.1 I am having a total of 5 node cluster: 4 data nodes, 1 namenode (it is also acting as secondary name node). On namenode I have setup hive with HiveDerbyServerMode to support multiple hive server connection. I have inserted plain text CSV files in HDFS using ‘LOAD DATA’ hive query statements, total number of files is 2624 an their combined size is only 713 MB, which is very less from Hadoop perspective that can handle TBs of data very easily. The problem is, when I run a simple count query (i.e. select count(*) from a_table), it takes too much time in executing the query. For instance it takes almost 17 minutes to execute the said query if the table has 950,000 rows, I understand that time is too much for executing a query with only such small data. This is only a dev environment and in production environment the number of files and their combined size will move into millions and GBs respectively. On analyzing the logs on all the datanodes and namenode/secondary namenode I do not find any error in them. I have tried setting mapred.reduce.tasks to a fixed number also, but number of reduce always remains 1 while number of maps is determined by hive only. Any suggestion what I am doing wrong, or how can I improve the performance of hive queries? Any suggestion or pointer is highly appreciated. Keshav _ The information contained in this message is proprietary and/or confidential. If you are not the intended recipient, please: (i) delete the message and all copies; (ii) do not disclose, distribute or use the message in any manner; and (iii) notify the sender immediately. In addition, please be aware that any message addressed to our domain is subject to archiving and review by persons other than the intended recipient. Thank you. -- Best Regards, Mohit Gupta Software Engineer at Vdopia Inc. -- With Regards Vikas Srivastava DWH Analytics Team Mob:+91 9560885900 One97 | Let's get talking !