Re: log4j format logs in Hive table

2011-12-06 Thread alo alt
Hi,

I hope I understood your question correct - did you describe your table?
Like
create TABLE YOURTABLE (row1 STRING, row2 STRING, row3 STRING) ROW FORMAT
DELIMITED FIELDS TERMINATED BY 'YOUR TERMINATOR' STORED AS TEXTFILE;

row* = a name of your descision, Datatype look @documentation.

After import via insert (overwrite) table YOURTABLE

- alex


On Tue, Dec 6, 2011 at 8:56 AM, sangeetha k get2sa...@yahoo.com wrote:

 Hi,

 I am new to Hive.

 I am using Flume agent to collect log4j logs and sending to HDFS.
 Now i wanted to load the log4j format logs from HDFS to Hive tables.
 Each of the attributes in log statements like timestamp, level, classname
 etc... should be loaded in seperate columns in the Hive tables.

 I tried creating table in Hive and loaded the entire log in one column,
 but dont know how to load the above mentioned data in seperate columns.

 Please send me your suggestions, any links, tutorials on this.

 Thanks,
 Sangeetha




-- 
Alexander Lorenz
http://mapredit.blogspot.com

*P **Think of the environment: please don't print this email unless you
really need to.*


Re: log4j format logs in Hive table

2011-12-06 Thread sangeetha k
Hi,
 
Thanks for the response.
Yes, You got my question.
 
An example of my log message line will be as below:
 
[2011-10-17 16:30:57,281] [ INFO] [33157362@qtp-28456974-0] 
[net.hp.tr.webservice.referenceimplcustomer.resource.CustomersResource] 
[Organization: Travelocity] [Client: AA] [Location of device: DFW] [User: 
550393] [user_role: ] [CorelationId: 248] [Component: Crossplane] [Server: 
server01] [Request: seats=5] [Response: yes] [Status: pass] - Entering Method = 
getKey() 
 
How to specify the delimiter, while describing the table?
 
Thanks,
Sangeetha



From: alo alt wget.n...@googlemail.com
To: user@hive.apache.org; sangeetha k get2sa...@yahoo.com 
Sent: Tuesday, December 6, 2011 2:01 PM
Subject: Re: log4j format logs in Hive table


Hi, 

I hope I understood your question correct - did you describe your table? 
Like create TABLE YOURTABLE (row1 STRING, row2 STRING, row3 STRING) ROW FORMAT 
DELIMITED FIELDS TERMINATED BY 'YOUR TERMINATOR' STORED AS TEXTFILE; 

row* = a name of your descision, Datatype look @documentation.

After import via insert (overwrite) table YOURTABLE

- alex




On Tue, Dec 6, 2011 at 8:56 AM, sangeetha k get2sa...@yahoo.com wrote:

Hi,

I am new to Hive.

I am using Flume agent to collect log4j logs and sending to HDFS.
Now i wanted to load the log4j format logs from HDFS to Hive tables.
Each of the attributes in log statements like timestamp, level, classname 
etc... should be loaded in seperate columns in the Hive tables.

I tried creating table in Hive and loaded the entire log in one column, but 
dont know how to load the above mentioned data in seperate columns.

Please send me your suggestions, any links, tutorials on this.

Thanks,
Sangeetha


-- 

Alexander Lorenz
http://mapredit.blogspot.com

P Think of the environment: please don't print this email unless you really 
need to.

Hive query taking too much time

2011-12-06 Thread Savant, Keshav
Hi All,

 

My setup is 

hadoop-0.20.203.0

hive-0.7.1

 

I am having a total of 5 node cluster: 4 data nodes, 1 namenode (it is
also acting as secondary name node). On namenode I have setup hive with
HiveDerbyServerMode to support multiple hive server connection.

 

I have inserted plain text CSV files in HDFS using 'LOAD DATA' hive
query statements, total number of files is 2624 an their combined size
is only 713 MB, which is very less from Hadoop perspective that can
handle TBs of data very easily.

 

The problem is, when I run a simple count query (i.e. select count(*)
from a_table), it takes too much time in executing the query.

 

For instance it takes almost 17 minutes to execute the said query if the
table has 950,000 rows, I understand that time is too much for executing
a query with only such small data. 

This is only a dev environment and in production environment the number
of files and their combined size will move into millions and GBs
respectively.

 

On analyzing the logs on all the datanodes and namenode/secondary
namenode I do not find any error in them.

 

I have tried setting mapred.reduce.tasks to a fixed number also, but
number of reduce always remains 1 while number of maps is determined by
hive only.

 

Any suggestion what I am doing wrong, or how can I improve the
performance of hive queries? Any suggestion or pointer is highly
appreciated. 

 

Keshav

_
The information contained in this message is proprietary and/or confidential. 
If you are not the intended recipient, please: (i) delete the message and all 
copies; (ii) do not disclose, distribute or use the message in any manner; and 
(iii) notify the sender immediately. In addition, please be aware that any 
message addressed to our domain is subject to archiving and review by persons 
other than the intended recipient. Thank you.


Re: Hive Reducers hanging - interesting problem - skew ?

2011-12-06 Thread john smith
Hi Mark,

Thanks for your response. I tried skew optimization and I also saw the
video by Lin and Namit. From what I understand about skew join, instead of
a single go , they divide it into 2 stages.

Stage1
Join non-skew pairs. and write the skew pairs into temporary files on HDFS.

Stage 2
Do a Map-Join of the files by copying smaller file into mappers of larger
file.

I have a doubt here. How can they be so sure that MapJoin works in stage 2?
The files can be so large that they donot fit into the memory and join is
impossible. Am I wrong?

I also ran the query with skew optimized  and as expected, none of the the
pairs got joined in  the stage 1 and all of them got written into the HDFS.
(They are huge)

Now in the stage2 , Hive is trying to perform a map-join on these large
tables and my Map phase in stage 2 is stuck at 0.13% after 6 hours and 2 of
my machines went down. I had to kill the job finally.

The size of each table is just 2GB which is way smaller than what Hadoop
eco system can handle.

So is there anyway I can join these tables in Hive? Any thoughts ?


Thanks,
jS



On Tue, Dec 6, 2011 at 3:39 AM, Mark Grover mgro...@oanda.com wrote:

 jS,
 Check out if this helps:

 http://search-hadoop.com/m/l1usr1MAHX32subj=Re+Severely+hit+by+curse+of+last+reducer+



 Mark Grover, Business Intelligence Analyst
 OANDA Corporation

 www: oanda.com www: fxtrade.com
 e: mgro...@oanda.com

 Best Trading Platform - World Finance's Forex Awards 2009.
 The One to Watch - Treasury Today's Adam Smith Awards 2009.


 - Original Message -
 From: john smith js1987.sm...@gmail.com
 To: user@hive.apache.org
 Sent: Monday, December 5, 2011 4:38:14 PM
 Subject: Hive Reducers hanging - interesting problem - skew ?

 Hi list,

 I am trying to run a Join query on my 10 node cluster. My query looks as
 follows

 select * from A JOIN B on (A.a = B.b)

 size of A = 15 million rows
 size of B = 1 million rows

 The problem is A.a and B.b has around 25-30 distinct values per column
 which implies that they have high selectivities and the reducers are bulky.

 However the performance hit is so horrible that , ALL my reducers hang @
 75% for 6 hours and doesn't move further.

 The only thing that log shows up is Join operator - forwarding rows
 ---Huge number kinds of logs for all this long. What does
 this mean ?
 There is no swapping happening and the CPU % is constantly around 40% for
 all this time (observed through Ganglia) .

 Any way I can solve this problem? Can anyone help me with this?

 Thanks,
 jS





Re: Hive query taking too much time

2011-12-06 Thread Wojciech Langiewicz

Hi,
In your case total file size isn't main factor that reduces performance, 
number of files is.


To test this try merging those over 2000 files into one (or few) big, 
then upload it to HDFS and test hive performance (it should be 
definitely higher). It this works you should think about merging those 
files before or after loading them to HDFS.


Second issue is counts, try to observe how your jobs uses mappers and 
reducers, my experience is that simple count() jobs might be stuck on 
one reducer (the one that does all counting) for longer time. I have not 
resolved this issue, but it was not significant in my case.
set mapred.reduce.tasks=xyz doesn't change that behavior, but for 
example using GROUP with COUNT works much faster.


I hope this helps.
--
Wojciech Langiewicz

On 06.12.2011 12:00, Savant, Keshav wrote:

Hi All,



My setup is

hadoop-0.20.203.0

hive-0.7.1



I am having a total of 5 node cluster: 4 data nodes, 1 namenode (it is
also acting as secondary name node). On namenode I have setup hive with
HiveDerbyServerMode to support multiple hive server connection.



I have inserted plain text CSV files in HDFS using 'LOAD DATA' hive
query statements, total number of files is 2624 an their combined size
is only 713 MB, which is very less from Hadoop perspective that can
handle TBs of data very easily.



The problem is, when I run a simple count query (i.e. select count(*)
from a_table), it takes too much time in executing the query.



For instance it takes almost 17 minutes to execute the said query if the
table has 950,000 rows, I understand that time is too much for executing
a query with only such small data.

This is only a dev environment and in production environment the number
of files and their combined size will move into millions and GBs
respectively.



On analyzing the logs on all the datanodes and namenode/secondary
namenode I do not find any error in them.



I have tried setting mapred.reduce.tasks to a fixed number also, but
number of reduce always remains 1 while number of maps is determined by
hive only.



Any suggestion what I am doing wrong, or how can I improve the
performance of hive queries? Any suggestion or pointer is highly
appreciated.



Keshav





Re: Hive query taking too much time

2011-12-06 Thread Mohit Gupta
Hi Paul,
I am having the same problem. Do you know any efficient way of merging the
files?

-Mohit

On Tue, Dec 6, 2011 at 8:14 PM, Paul Mackles pmack...@adobe.com wrote:

 How much time is it spending in the map/reduce phases, respectively? The
 large number of files could be creating a lot of mappers which create a lot
 of overhead. What happens if you merge the 2624 files into a smaller number
 like 24 or 48. That should speed up the mapper phase significantly.

 ** **

 *From:* Savant, Keshav [mailto:keshav.c.sav...@fisglobal.com]
 *Sent:* Tuesday, December 06, 2011 6:01 AM
 *To:* user@hive.apache.org
 *Subject:* Hive query taking too much time

 ** **

 Hi All,

 ** **

 My setup is 

 hadoop-0.20.203.0

 hive-0.7.1

 ** **

 I am having a total of 5 node cluster: 4 data nodes, 1 namenode (it is
 also acting as secondary name node). On namenode I have setup hive with
 HiveDerbyServerMode to support multiple hive server connection.

 ** **

 I have inserted plain text CSV files in HDFS using ‘LOAD DATA’ hive query
 statements, total number of files is 2624 an their combined size is only
 713 MB, which is very less from Hadoop perspective that can handle TBs of
 data very easily.

 ** **

 The problem is, when I run a simple count query (i.e. *select count(*)
 from a_table*), it takes too much time in executing the query.

 ** **

 For instance it takes almost 17 minutes to execute the said query if the
 table has 950,000 rows, I understand that time is too much for executing a
 query with only such small data. 

 This is only a dev environment and in production environment the number of
 files and their combined size will move into millions and GBs respectively.
 

 ** **

 On analyzing the logs on all the datanodes and namenode/secondary namenode
 I do not find any error in them.

 ** **

 I have tried setting mapred.reduce.tasks to a fixed number also, but
 number of reduce always remains 1 while number of maps is determined by
 hive only.

 ** **

 Any suggestion what I am doing wrong, or how can I improve the performance
 of hive queries? Any suggestion or pointer is highly appreciated. 

 ** **

 Keshav

 _
 The information contained in this message is proprietary and/or
 confidential. If you are not the intended recipient, please: (i) delete the
 message and all copies; (ii) do not disclose, distribute or use the message
 in any manner; and (iii) notify the sender immediately. In addition, please
 be aware that any message addressed to our domain is subject to archiving
 and review by persons other than the intended recipient. Thank you.




-- 
Best Regards,

Mohit Gupta
Software Engineer at Vdopia Inc.


Hive web console - schema is empty

2011-12-06 Thread sangeetha k
Hi,
 
I opened the web console for Hive using http://localhost:/hwi
 
In the Browse Schema option, I could see only the default Hive table list name 
and description.
Not able to view the tables. What should be issue?
 
I have created 2 tables under default schema , but could not able to see those 
tables under the Browse schema option.
 
Thanks,
Sangeetha

Re: Hive web console - schema is empty

2011-12-06 Thread sangeetha k
I get this error message in the console..
 
11/12/06 08:14:50 INFO DataNucleus.MetaData: Registering listener for metadata 
initialisation
11/12/06 08:14:50 INFO metastore.ObjectStore: Initialized ObjectStore
11/12/06 08:14:50 WARN DataNucleus.MetaData: MetaData Parser encountered an 
error in file 
jar:file:/opt/hive/lib/hive-metastore-0.7.1-cdh3u2.jar!/package.jdo at line 
11, column 6 : cvc-elt.1: Cannot find the declaration of element 'jdo'. - 
Please check your specification of DTD and the validity of the MetaData XML 
that you have specified.
11/12/06 08:14:50 WARN DataNucleus.MetaData: MetaData Parser encountered an 
error in file 
jar:file:/opt/hive/lib/hive-metastore-0.7.1-cdh3u2.jar!/package.jdo at line 
312, column 13 : The content of element type class must match 
(extension*,implements*,datastore-identity?,primary-key?,inheritance?,version?,join*,foreign-key*,index*,unique*,column*,field*,property*,query*,fetch-group*,extension*).
 - Please check your specification of DTD and the validity of the MetaData XML 
that you have specified.




From: sangeetha k get2sa...@yahoo.com
To: user@hive.apache.org user@hive.apache.org 
Sent: Tuesday, December 6, 2011 8:21 PM
Subject: Hive web console - schema is empty


Hi,

I opened the web console for Hive using http://localhost:/hwi

In the Browse Schema option, I could see only the default Hive table list name 
and description.
Not able to view the tables. What should be issue?

I have created 2 tables under default schema , but could not able to see those 
tables under the Browse schema option.

Thanks,
Sangeetha

Re: log4j format logs in Hive table

2011-12-06 Thread Mark Grover
Hi Sangeetha,
Hive uses SerDe (Serializer/Deserializer) for reading data from and writing to 
HDFS. You have many options for choosing the SerDe for your table.
For example, if your file contains tab delimited fields, you could use the 
default SerDe (by not specifying any SerDe) and specify the delimiter by using
FIELDS TERMINATED BY '\t'
in your create table statement.

If you desire,  you could use the Regex SerDe (albeit, with some performance 
overhead) using something like:

ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES  (  
input.regex = .*time:([^,]*),  
output.format.string = time:%1$s)

in your create table statement.

As you get more familiar with Hive, you might find the need for writing your 
own UDF for parsing the data.

Here is the link to the Hive wiki for Create Table:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Create%2FDropTable

Here is the link for UDFs:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF


Welcome and good luck!
Mark

Mark Grover, Business Intelligence Analyst
OANDA Corporation 

www: oanda.com www: fxtrade.com 
e: mgro...@oanda.com 

Best Trading Platform - World Finance's Forex Awards 2009. 
The One to Watch - Treasury Today's Adam Smith Awards 2009. 


- Original Message -
From: sangeetha k get2sa...@yahoo.com
To: user@hive.apache.org
Sent: Tuesday, December 6, 2011 4:26:03 AM
Subject: Re: log4j format logs in Hive table



Hi, 

Thanks for the response. 
Yes, You got my question. 

An example of my log message line will be as below: 

[2011-10-17 16:30:57,281] [ INFO] [33157362@qtp-28456974-0] 
[net.hp.tr.webservice.referenceimplcustomer.resource.CustomersResource] 
[Organization: Travelocity] [Client: AA] [Location of device: DFW] [User: 
550393] [user_role: ] [CorelationId: 248] [Component: Crossplane] [Server: 
server01] [Request: seats=5] [Response: yes] [Status: pass] - Entering Method = 
getKey() 

How to specify the delimiter, while describing the table? 

Thanks, 
Sangeetha 




From: alo alt wget.n...@googlemail.com 
To: user@hive.apache.org; sangeetha k get2sa...@yahoo.com 
Sent: Tuesday, December 6, 2011 2:01 PM 
Subject: Re: log4j format logs in Hive table 


Hi, 


I hope I understood your question correct - did you describe your table? Like 
create TABLE YOURTABLE (row1 STRING, row2 STRING, row3 STRING) ROW FORMAT 
DELIMITED FIELDS TERMINATED BY 'YOUR TERMINATOR' STORED AS TEXTFILE; 


row* = a name of your descision, Datatype look @documentation. 


After import via insert (overwrite) table YOURTABLE 


- alex 




On Tue, Dec 6, 2011 at 8:56 AM, sangeetha k  get2sa...@yahoo.com  wrote: 





Hi, 

I am new to Hive. 

I am using Flume agent to collect log4j logs and sending to HDFS. 
Now i wanted to load the log4j format logs from HDFS to Hive tables. 
Each of the attributes in log statements like timestamp, level, classname 
etc... should be loaded in seperate columns in the Hive tables. 

I tried creating table in Hive and loaded the entire log in one column, but 
dont know how to load the above mentioned data in seperate columns. 

Please send me your suggestions, any links, tutorials on this. 

Thanks, 
Sangeetha 



-- 

Alexander Lorenz 
http://mapredit.blogspot.com 


P Think of the environment: please don't print this email unless you really 
need to. 






Re: log4j format logs in Hive table

2011-12-06 Thread alo alt
Hi Sangeetha,

sry, was on road and the answer tooks a while.

As Mark wrote, SerDe will be a good start. If its usefull for you take a
look at http://code.google.com/p/hive-json-serde/wiki/GettingStarted.

- alex


On Tue, Dec 6, 2011 at 10:26 AM, sangeetha k get2sa...@yahoo.com wrote:

 Hi,

 Thanks for the response.
 Yes, You got my question.

 An example of my log message line will be as below:

 [2011-10-17 16:30:57,281] [ INFO] [33157362@qtp-28456974-0]
 [net.hp.tr.webservice.referenceimplcustomer.resource.CustomersResource]
 [Organization: Travelocity] [Client: AA] [Location of device: DFW] [User:
 550393] [user_role: ] [CorelationId: 248] [Component: Crossplane] [Server:
 server01] [Request: seats=5] [Response: yes] [Status: pass] - Entering
 Method = getKey()

 How to specify the delimiter, while describing the table?

 Thanks,
 Sangeetha

  *From:* alo alt wget.n...@googlemail.com
 *To:* user@hive.apache.org; sangeetha k get2sa...@yahoo.com
 *Sent:* Tuesday, December 6, 2011 2:01 PM
 *Subject:* Re: log4j format logs in Hive table

 Hi,

 I hope I understood your question correct - did you describe your table?
 Like
 create TABLE YOURTABLE (row1 STRING, row2 STRING, row3 STRING) ROW FORMAT
 DELIMITED FIELDS TERMINATED BY 'YOUR TERMINATOR' STORED AS TEXTFILE;

 row* = a name of your descision, Datatype look @documentation.

 After import via insert (overwrite) table YOURTABLE

 - alex


 On Tue, Dec 6, 2011 at 8:56 AM, sangeetha k get2sa...@yahoo.com wrote:

  Hi,

 I am new to Hive.

 I am using Flume agent to collect log4j logs and sending to HDFS.
 Now i wanted to load the log4j format logs from HDFS to Hive tables.
 Each of the attributes in log statements like timestamp, level, classname
 etc... should be loaded in seperate columns in the Hive tables.

 I tried creating table in Hive and loaded the entire log in one column,
 but dont know how to load the above mentioned data in seperate columns.

 Please send me your suggestions, any links, tutorials on this.

 Thanks,
 Sangeetha




 --
 Alexander Lorenz
 http://mapredit.blogspot.com

 *P **Think of the environment: please don't print this email unless you
 really need to.*







-- 
Alexander Lorenz
http://mapredit.blogspot.com

*P **Think of the environment: please don't print this email unless you
really need to.*


Re: Hive Reducers hanging - interesting problem - skew ?

2011-12-06 Thread Aaron Sun
Can you try from B join A.

One simple rule of join in Hive is Largest table last. The smaller tables
can then be buffered into distributed cache for fast retrieval and
comparison.

Thanks
Aaron

On Tue, Dec 6, 2011 at 4:01 AM, john smith js1987.sm...@gmail.com wrote:

 Hi Mark,

 Thanks for your response. I tried skew optimization and I also saw the
 video by Lin and Namit. From what I understand about skew join, instead of
 a single go , they divide it into 2 stages.

 Stage1
 Join non-skew pairs. and write the skew pairs into temporary files on HDFS.

 Stage 2
 Do a Map-Join of the files by copying smaller file into mappers of larger
 file.

 I have a doubt here. How can they be so sure that MapJoin works in stage
 2? The files can be so large that they donot fit into the memory and join
 is impossible. Am I wrong?

 I also ran the query with skew optimized  and as expected, none of the the
 pairs got joined in  the stage 1 and all of them got written into the HDFS.
 (They are huge)

 Now in the stage2 , Hive is trying to perform a map-join on these large
 tables and my Map phase in stage 2 is stuck at 0.13% after 6 hours and 2 of
 my machines went down. I had to kill the job finally.

 The size of each table is just 2GB which is way smaller than what Hadoop
 eco system can handle.

 So is there anyway I can join these tables in Hive? Any thoughts ?


 Thanks,
 jS



 On Tue, Dec 6, 2011 at 3:39 AM, Mark Grover mgro...@oanda.com wrote:

 jS,
 Check out if this helps:

 http://search-hadoop.com/m/l1usr1MAHX32subj=Re+Severely+hit+by+curse+of+last+reducer+



 Mark Grover, Business Intelligence Analyst
 OANDA Corporation

 www: oanda.com www: fxtrade.com
 e: mgro...@oanda.com

 Best Trading Platform - World Finance's Forex Awards 2009.
 The One to Watch - Treasury Today's Adam Smith Awards 2009.


 - Original Message -
 From: john smith js1987.sm...@gmail.com
 To: user@hive.apache.org
 Sent: Monday, December 5, 2011 4:38:14 PM
 Subject: Hive Reducers hanging - interesting problem - skew ?

 Hi list,

 I am trying to run a Join query on my 10 node cluster. My query looks as
 follows

 select * from A JOIN B on (A.a = B.b)

 size of A = 15 million rows
 size of B = 1 million rows

 The problem is A.a and B.b has around 25-30 distinct values per column
 which implies that they have high selectivities and the reducers are bulky.

 However the performance hit is so horrible that , ALL my reducers hang @
 75% for 6 hours and doesn't move further.

 The only thing that log shows up is Join operator - forwarding rows
 ---Huge number kinds of logs for all this long. What does
 this mean ?
 There is no swapping happening and the CPU % is constantly around 40% for
 all this time (observed through Ganglia) .

 Any way I can solve this problem? Can anyone help me with this?

 Thanks,
 jS






Re: How to see the intermediate results between AST and optimized logical query plan.

2011-12-06 Thread Mohit Gupta
Hi,
I am trying to understand the output of hive Explain command. I found the
documentation provided (
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Explain )
to be of little help. Is there any other place where I can find the
detailed documentation on this?

Hiroyuki, were you able to find any detailed docs on this? Or any other
leads?

Thanks
Mohit

On Wed, Oct 19, 2011 at 6:24 PM, Hiroyuki Yamada mogwa...@gmail.com wrote:

 Hello,

 I have been trying to learn the Hive query compiler and
 I am wondering if there is a way to see the result of semantic
 analysis (query block tree)
 and non-optimized logical query plan.
 I know we can get AST and optimized logical query plan with explain,
 but I want to know the intermediate results between them.

 Also, is there any detailed documentations about Hive query compiler ?

 I would be very appreciated if anyone answered my questions.

 Thanks,
 Hiroyuki




-- 
Best Regards,

Mohit Gupta
Software Engineer at Vdopia Inc.


Re: log4j format logs in Hive table

2011-12-06 Thread Aniket Mokashi
Pig has a Log loader in Piggybank. You can use that to generate the columns
of that table and make the table point to it.

Take a look--
https://github.com/apache/pig/tree/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/apachelog

Thanks,
Aniket

On Tue, Dec 6, 2011 at 10:19 AM, Abhishek Pratap Singh
manu.i...@gmail.comwrote:

 Hi Sangeetha,

 One more easier option is to use Flume Decorators to put some delimiter in
 you stream of data and then load the data into table.

 For example:
 Below data can be converted to say PIPE Delimited data (You an code for
 any delimiter) by using Flume decorators.

 [2011-10-17 16:30:57,281] [ INFO] [33157362@qtp-28456974-0]
 [net.hp.tr.webservice.referenceimplcustomer.resource.CustomersResource]
 [Organization: Travelocity] [Client: AA] [Location of device: DFW] [User:
 550393] [user_role: ] [CorelationId: 248] [Component: Crossplane] [Server:
 server01] [Request: seats=5] [Response: yes] [Status: pass] - Entering
 Method = getKey()

 PIPE Delimited---
 2011-10-17 16:30:57,281 |  INFO 
 |33157362@qtp-28456974-0|net.hp.tr.webservice.referenceimplcustomer.resource.CustomersResource|Organization:
 Travelocity|Client: AA|Location of device: DFW|User: 550393|user_role:
 |CorelationId: 248|Component: Crossplane|Server: server01|Request:
 seats=5|Response: yes|Status: pass| - Entering Method = getKey()

 Now once you have this pipe delimited data, you can create a table with
 pipe delimiter and load this file.

 You can choose any delimiter as well as remove some data in flume
 decorator and finally load into Hive table with same schema and delimiter.
 Hope it helps.

 ~Abhishek P Singh

  On Tue, Dec 6, 2011 at 7:58 AM, alo alt wget.n...@googlemail.com wrote:

 Hi Sangeetha,

 sry, was on road and the answer tooks a while.

 As Mark wrote, SerDe will be a good start. If its usefull for you take a
 look at http://code.google.com/p/hive-json-serde/wiki/GettingStarted.

 - alex


 On Tue, Dec 6, 2011 at 10:26 AM, sangeetha k get2sa...@yahoo.com wrote:

 Hi,

 Thanks for the response.
 Yes, You got my question.

 An example of my log message line will be as below:

 [2011-10-17 16:30:57,281] [ INFO] [33157362@qtp-28456974-0]
 [net.hp.tr.webservice.referenceimplcustomer.resource.CustomersResource]
 [Organization: Travelocity] [Client: AA] [Location of device: DFW] [User:
 550393] [user_role: ] [CorelationId: 248] [Component: Crossplane] [Server:
 server01] [Request: seats=5] [Response: yes] [Status: pass] - Entering
 Method = getKey()

 How to specify the delimiter, while describing the table?

 Thanks,
 Sangeetha

   *From:* alo alt wget.n...@googlemail.com
 *To:* user@hive.apache.org; sangeetha k get2sa...@yahoo.com
 *Sent:* Tuesday, December 6, 2011 2:01 PM
 *Subject:* Re: log4j format logs in Hive table

 Hi,

 I hope I understood your question correct - did you describe your table?
 Like
 create TABLE YOURTABLE (row1 STRING, row2 STRING, row3 STRING) ROW
 FORMAT DELIMITED FIELDS TERMINATED BY 'YOUR TERMINATOR' STORED AS
 TEXTFILE;

 row* = a name of your descision, Datatype look @documentation.

 After import via insert (overwrite) table YOURTABLE

 - alex


 On Tue, Dec 6, 2011 at 8:56 AM, sangeetha k get2sa...@yahoo.com wrote:

  Hi,

 I am new to Hive.

 I am using Flume agent to collect log4j logs and sending to HDFS.
 Now i wanted to load the log4j format logs from HDFS to Hive tables.
 Each of the attributes in log statements like timestamp, level,
 classname etc... should be loaded in seperate columns in the Hive tables.

 I tried creating table in Hive and loaded the entire log in one column,
 but dont know how to load the above mentioned data in seperate columns.

 Please send me your suggestions, any links, tutorials on this.

 Thanks,
 Sangeetha




 --
 Alexander Lorenz
 http://mapredit.blogspot.com

 *P **Think of the environment: please don't print this email unless you
 really need to.*







 --
 Alexander Lorenz
 http://mapredit.blogspot.com

 *P **Think of the environment: please don't print this email unless you
 really need to.*






-- 
...:::Aniket:::... Quetzalco@tl


Re: Hive query taking too much time

2011-12-06 Thread Ayon Sinha
How about a simple Pig script with a load and a store statement? Set the max # 
reducers to say 20 or 30, that way you will only have 20-30 files as output. 
Then put these files in the Hive dir. Make sure to match the delimiters in Hive 
 Pig.
 
-Ayon
See My Photos on Flickr
Also check out my Blog for answers to commonly asked questions.




 From: Vikas Srivastava vikas.srivast...@one97.net
To: user@hive.apache.org 
Sent: Tuesday, December 6, 2011 10:00 PM
Subject: Re: Hive query taking too much time
 

hey if u having the same col of  all the files then you can easily merge by 
shell script

list=`*.csv`
$table=yourtable
for file in $list
do
cat $file new_file.csv
done
hive -e load data local inpath '$file' into table $table

it will merge all the files in single file then you can upload it in the same 
query


On Tue, Dec 6, 2011 at 8:16 PM, Mohit Gupta success.mohit.gu...@gmail.com 
wrote:

Hi Paul,
I am having the same problem. Do you know any efficient way of merging the 
files?


-Mohit



On Tue, Dec 6, 2011 at 8:14 PM, Paul Mackles pmack...@adobe.com wrote:

How much time is it spending in the map/reduce phases, respectively? The large 
number of files could be creating a lot of mappers which create a lot of 
overhead. What happens if you merge the 2624 files into a smaller number like 
24 or 48. That should speed up the mapper phase significantly.
 
From:Savant, Keshav [mailto:keshav.c.sav...@fisglobal.com] 
Sent: Tuesday, December 06, 2011 6:01 AM
To: user@hive.apache.org
Subject: Hive query taking too much time
 
Hi All,
 
My setup is 
hadoop-0.20.203.0
hive-0.7.1
 
I am having a total of 5 node cluster: 4 data nodes, 1 namenode (it is also 
acting as secondary name node). On namenode I have setup hive with 
HiveDerbyServerMode to support multiple hive server connection.
 
I have inserted plain text CSV files in HDFS using ‘LOAD DATA’ hive query 
statements, total number of files is 2624 an their combined size is only 713 
MB, which is very less from Hadoop perspective that can handle TBs of data 
very easily.
 
The problem is, when I run a simple count query (i.e. select count(*) from 
a_table), it takes too much time in executing the query.
 
For instance it takes almost 17 minutes to execute the said query if the 
table has 950,000 rows, I understand that time is too much for executing a 
query with only such small data. 
This is only a dev environment and in production environment the number of 
files and their combined size will move into millions and GBs respectively.
 
On analyzing the logs on all the datanodes and namenode/secondary namenode I 
do not find any error in them.
 
I have tried setting mapred.reduce.tasks to a fixed number also, but number 
of reduce always remains 1 while number of maps is determined by hive only.
 
Any suggestion what I am doing wrong, or how can I improve the performance of 
hive queries? Any suggestion or pointer is highly appreciated. 
 
Keshav
_
The information contained in this message is proprietary and/or confidential. 
If you are not the intended recipient, please: (i) delete the message and all 
copies; (ii) do not disclose, distribute or use the message in any manner; 
and (iii) notify the sender immediately. In addition, please be aware that 
any message addressed to our domain is subject to archiving and review by 
persons other than the intended recipient. Thank you.



-- 
Best Regards,

Mohit Gupta
Software Engineer at Vdopia Inc.





-- 
With Regards
Vikas Srivastava

DWH  Analytics Team
Mob:+91 9560885900
One97 | Let's get talking !