Re: Query

2012-10-22 Thread MiaoMiao
My solution:
1. Echo columns into a file;
2. Run queries then append into that file.

On Tue, Oct 23, 2012 at 1:56 PM, Venugopal Krishnan
 wrote:
> Hi,
>
>
>
> We have a requirement where we need to print the column headers in the
> generated file on executing a query. We are using Jdbc hive client to
> execute the query.
>
>
>
> Regards,
>
> Venugopal
>
>
>
>
> 
>
> http://www.mindtree.com/email/disclaimer.html


Re: help in hive

2012-10-22 Thread MiaoMiao
Try

SELECT
client, receive_day, receive_hour as start_time, receive_hour+1 as end_time
FROM some_table
WHERE client='xyz' AND receive_day=7
ORDER BY start_time;

On Mon, Oct 22, 2012 at 4:41 PM, dyuti a  wrote:
> Hi all,
>  I have a hive table with  235 million records.
>
> SAMPLE INPUT:
> receive_yearreceive_day receive_hour   client
>  2012   7 17
> xyz
>  2012   7 02
> xyz
>  2012   7 17
> btn
>  2012   7 04
> snm
>  2012   7 05
> btn
>  2012   7 02
> snm
>
> receive_hour is an int field having values from 01 to 24 (transaction
> received hour in 24 hour basis).
>
> EXPECTED OUTPUT:
>
> The transaction received per hour basis i.e., 01 (start_time) t0 02
> (end_time) . so that the receive_hour column should be displayed as
> start_time and end_time so that every one hour transaction received for
> particular client on that day.
>
> client  receive_day  start_time   end_time
>xyz7   01
> 02
>xyz7   02
> 03
>xyz7   03
> 04
>   continues up to 23 24
>
> like the same for other clients too.
>
> am facing problem in displaying the single column values into two column in
> hive. i.e., 1 - 2 , 2- 3 . 23 - 24.
>
> I have written the query to display the other details happened in
> transaction but got stuck in the above scenario.
>
> Please let me know is this possible in hive, if so then pls guide me.
>
>
> Thanks for your help in advance!
>
>
> Regards,
> dti
>
>
>
>
>


Re: Error in semantic analysis: Unable to fetch table

2012-10-18 Thread MiaoMiao
What php, you are using exec('hive -S -e "select * from some_table"')
or something like that in php?

On Fri, Oct 19, 2012 at 4:46 AM, Keith Wiley  wrote:
> If I run a select command on the command line with -e it works.  If I run the 
> same command inside hive (without -e) it still works...but if I exec it with 
> a -e from php (so obviously the apache user), I get "Error in semantic 
> analysis: Unable to fetch table [tablename]".  Bear in mind that this overall 
> system worked perfectly for weeks.  I haven't modified the code or anything.  
> What did change was I had to reinstall hadoop and hive and I'm sure some 
> configuration parameter is not correct now.  I found some Google references 
> to this error but no clear-but solutions.
>
> What does this error mean and how do I fix it?
>
> Thanks.
>
> 
> Keith Wiley kwi...@keithwiley.com keithwiley.com
> music.keithwiley.com
>
> "The easy confidence with which I know another man's religion is folly teaches
> me to suspect that my own is also."
>--  Mark Twain
> 
>


Re: why i get Table not found 'dual'

2012-10-17 Thread MiaoMiao
Do you even have a table name 'dual' or not?

On Thu, Oct 18, 2012 at 2:24 PM, fellowtree  wrote:
> hi all:
>  some one said hive support sql like  select if(1=2,100,200) from dual;
> however i always get Table not found 'dual' . my hive is 0.9.0 and what is
> the reason?
>
> 
> fellowtree


Re: Hive Query Unable to distribute load evenly in reducers

2012-10-15 Thread MiaoMiao
And your queries were?

On Mon, Oct 15, 2012 at 8:09 PM, Saurabh Mishra
 wrote:
> Hi,
> I am firing some hive queries joining tables containing upto 30millions
> records each. Since the load on the reducers is very significant in these
> cases, i specifically set the following parameters before executing the
> queries :
>
> set mapred.reduce.tasks=100;
> set hive.exec.reducers.bytes.per.reducer=5;
> set hive.optimize.cp=true;
>
> The number of reducer the job spouts in now 160, but despite the high number
> most of the load remains upon 1 or 2 reducers. Hence in the final
> statistics, 158 reducers go completed with 2-3 minutes of start and 2
> reducers took 2 hrs to run.
> Is there any way to overcome this load distribution disparity.
> Any help in this regards will be highly appreciated.
>
> Sincerely
> Saurabh Mishra


Re: Need Help in Hive storage format

2012-10-12 Thread MiaoMiao
Hi Yogesh:

I think this may help.

https://pig.apache.org/docs/r0.10.0/api/org/apache/pig/builtin/PigStorage.html

Miao

On Fri, Oct 12, 2012 at 5:27 PM, yogesh dhari  wrote:
> Thanks Bejoy,
>
> Now I want to store theses rows in Pig.
>
> like
>
> A = load '/Pig/00_0' using PigStorage()
> as
> (id:INT, name:chararray, ddate, prim, ignore, ignorecase, activat);
>
> What should be in the delimiter into PigStorage( )?
> I have tried PigStorage('/001') but its showing errors.
> What delimiter should we use.
>
> Please help and Suggest.
>
>
> Thanks & Regards
> Yogesh Kumar
>
>
>
>
>
> 
> Subject: Re: Need Help in Hive storage format
> To: user@hive.apache.org
> From: bejoy...@yahoo.com
> Date: Thu, 11 Oct 2012 17:53:12 +
>
>
> Hi Yogesh.
>
> It should be a simple delimited file with ^A character as the field
> delimiter.
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
> 
> From: yogesh dhari 
> Date: Thu, 11 Oct 2012 23:18:35 +0530
> To: hive request
> ReplyTo: user@hive.apache.org
> Subject: Need Help in Hive storage format
>
> Hi all,
>
> If we run this query
>
> insert overwrite local directory '/home/yogesh/Downloads/demoyy' select *
> from NYSE_LOCAL;
>
> {
> ( describe NYSE_LOCAL ;
>
>  exchangestring
> symbolstring
> ddatestring
> openfloat
> highfloat
> lowfloat
> ) }
>
> ls /home/yogesh/Downloads/demoyy/
>
> it shows the file name 00_0
>
> my Question is:
>
> 1) In which format does file 00_0 is?
> 2) what is the delimiter between columns?
>
> Please help
>
> Thanks & Regards
> Yogesh Kumar
>
>


Re: Fw: need help to write a query that does same as BETWEEN operator

2012-10-12 Thread MiaoMiao
You familiar with SQL?

I'm sure this manual will help. https://cwiki.apache.org/Hive/

On Fri, Oct 12, 2012 at 2:40 PM, Praveenkumar Ch
 wrote:
> Hi I am new to hive and we have a requirement for converting terra-data
> queries to hive queries. So i was successful converting them.. so far.
> but now i have a teradata query that has BETWEEN operator in it, i don't
> know how to convert using hive equi-joins.. can someone help me??
>
>
> here is my query:
>
> SELECT *
> FROM account a, timezone g
> WHERE
> a.create_date BETWEEN g.start_date AND g.end_date ;
>
> =-=-=
> Notice: The information contained in this e-mail
> message and/or attachments to it may contain
> confidential or privileged information. If you are
> not the intended recipient, any dissemination, use,
> review, distribution, printing or copying of the
> information contained in this e-mail message
> and/or attachments to it are strictly prohibited. If
> you have received this communication in error,
> please notify us by reply e-mail or telephone and
> immediately and permanently delete the message
> and any attachments. Thank you


Re: Rolling MAU computation

2012-10-10 Thread MiaoMiao
How about
SELECT day, COUNT(DISTINCT(userid)) FROM logins WHERE day - logins.day
< 30 GROUP BY day;

On Thu, Oct 11, 2012 at 6:05 AM, Tom Hubina  wrote:
> I'm trying to compute the number of active users in the previous 30 days for
> each day over a date range. I can't think of any way to do it directly
> within Hive so I'm wondering if you guys have any ideas.
>
> Basically the algorithm is something like:
>
> For each day in date range:
>SELECT day, COUNT(DISTINCT(userid)) FROM logins WHERE day - logins.day <
> 30;
>
> Thanks for your help!
>
> Tom
>


Re: Date Comparisons. in Hive

2012-10-04 Thread MiaoMiao
I suggest you store unix timestamp in hive, and so you can compare it
as BIGINT without worrying about STRING comparison.

And if your data is to be queried on daily bases, you can split one
big file into small files, say, one file per day, then add them as
partitions of soj_session_container. This way can optimize hive a
little since your queries won't have to read all records in
soj_session_container.

CREATE TABLE soj_session_container (
events MAP
)
PARTITIONED BY (date STRING);
ALTER TABLE soj_session_container ADD PARTITION (date = '20120918')
location 'loc1';
SELECT * FROM soj_session_container LATERAL VIEW explode(a.events) t
AS event WHERE date = '20120918' AND event.event_timestamp >=
unix_timestamp('2012-09-18 00:00:00') AND event.event_timestamp <=
unix_timestamp('2012-09-18 02:00:00');
On Thu, Oct 4, 2012 at 8:20 AM, Raihan Jamal  wrote:
> I have this below query from which I am trying to find out those records
> that fall between midnight and 2 A.M on 18th September.
> And SojTimestampToDate function will give me date in this format /MM/dd
> HH:mm:ss
>
> I am not sure whether the date comparison I did is right or not. And it will
> give me all those records between midnight and 2 AM.
>
> SELECT event.app_payload ['n'] AS changed_cguid
> FROM soj_session_container a LATERAL VIEW explode(a.events) t AS event
> WHERE a.dt = '20120918'
> AND SojTimestampToDate(event.event_timestamp) >= '2012/09/18 00:00:00'
> AND SojTimestampToDate(event.event_timestamp) <= '2012/09/18 02:00:00'
>
> Can anyone shed some light on this whether I am doing right or not?
>
>
>
> Raihan Jamal
>


Re: Hive and RESTFul with RESTEasy (jax-rs)

2012-10-02 Thread MiaoMiao
Don't know any besides JDBC or THRIFT.

On Tue, Oct 2, 2012 at 11:24 PM, Zebeljan, Nebojsa
 wrote:
> Hi,
> I'm very new to Hive and I need to approach how to fire Hive sql queries via
> the RESTEasy framework and to stream back the query result as a JSON string
> to the client.
>
> I wonder, if there is any approach or best practice how I can achieve this
> with Hive and a RESTFul service.
>
> Thanks in advance!
>
> Regards,
> Nebo


Re: Percentile calculation

2012-10-01 Thread MiaoMiao
More info, please.

On Mon, Oct 1, 2012 at 4:50 PM, Mayank Bansal
 wrote:
> Hi,
>
>
>
> I am trying to run the hive udf percentile, I am trying to run it on a
> column with something around 116 million unique values.
>
> The maximum space that I can give to the reducer is 12 GB, the job keeps on
> failing due to java heap space error.
>
> Is there a way to optimize this, so that I don’t encounter this error?
>
> Or any other suggestion or solution which could help me out?
>
>
>
> Thanks,
>
> Mayank
>
>
> 
> This email message may contain proprietary, private and confidential
> information. The information transmitted is intended only for the person(s)
> or entities to which it is addressed. Any review, retransmission,
> dissemination or other use of, or taking of any action in reliance upon,
> this information by persons or entities other than the intended recipient is
> prohibited and may be illegal. If you received this in error, please contact
> the sender and delete the message from your system.
>
> Mu Sigma takes all reasonable steps to ensure that its electronic
> communications are free from viruses. However, given Internet accessibility,
> the Company cannot accept liability for any virus introduced by this e-mail
> or any attachment and you are advised to use up-to-date virus checking
> software.


Re: Defining collection items terminated by for a nested data type

2012-09-28 Thread MiaoMiao
Never seen nested collections on Hive, so I'm not sure about "array
>>".

In my case, my arrays and maps always contains values of primitive
types, such as string, int, bigint, etc.

On Fri, Sep 28, 2012 at 1:01 PM, Sadananda Hegde  wrote:
> How does "collection items terminated by" work  on a nested structure? Say
> the  table is created with the DDL:
>
> CREATE TABLE table_1(f1 int, f2 string, f3  array  c map>>)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY '|'
> COLLECTION ITEMS TERMINATED BY ','
> MAP KEYS TERMINATED BY '='
> LINES TERMINATED BY '\'n'
> STORED AS TEXTFILE;
>
> I guess comma seperator wll be used for the items in the outer most
> structure (i.e. array).  Is that true?
>  1. What would be the seperator character between a,b and c (struct
> elements)?
>  2. What would be the seperator for mapelements?
>  3. Is there a way to explicitly specify those ITEMS seperators rather than
> using the default ones like ^B, ^C, etc, (like multiple collection items)?
>
>  The original data is in xml format (complex one with many nested levels)
> and we are planning to parse that xml using a java parser into delimited
> text file which can be used to load the hive table. My question is:
>  " How should we be representng the f3 like structures in the data
> file?"
>
> The actual file has lot many fields with quite a few complex types like f3
> above; but I guess logic would be the same.
>
> Thanks for your help.
>
> Regards,
> Sadu
>
>
>
>
>
>
>


Re: Problem loading a CSV file

2012-09-28 Thread MiaoMiao
When creating external table with location clause, you need to put
your csv into HDFS.
Or else you can load data local as Savant said.

On Fri, Sep 28, 2012 at 1:57 PM, Sarath
 wrote:
> Hi,
>
> I have created a new table using reference to a file on HDFS -
> create external table table1 (field1 STRING, field2 STRING, field3 STRING,
> field3 STRING, field4 STRING, field5 FLOAT, field6 FLOAT, field7 FLOAT,
> field8 STRING, field9 STRING) row format delimited fields terminated by ','
> location '/user/hduser/dumps/table_dump.csv';
>
> The table got created successfully. But when I try retrieving rows from this
> table, it returns me nothing.
> hive> select * from table1;
> OK
> Time taken: 0.156 seconds
>
> I also tried creating the table first and then loading the HDFS file data
> into it -
> hive> create table table1 (field1 STRING, field2 STRING, field3 STRING,
> field3 STRING, field4 STRING, field5 FLOAT, field6 FLOAT, field7 FLOAT,
> field8 STRING, field9 STRING) row format delimited fields terminated by ',';
> OK
> Time taken: 0.088 seconds
>
> But when I try to load data into this table I'm getting below error -
> hive> load data inpath '/user/hduser/dumps/table_dump.csv' overwrite into
> table table1;
> FAILED: Error in semantic analysis: Line 1:17 Invalid path
> ''/user/hduser/dumps/table_dump.csv'': No files matching path
> hdfs://master:54310/user/hduser/dumps/table_dump.csv
>
> What is going wrong? Is there a different way to load a CSV file using hive?
>
> Regards,
> Sarath.


Re: how to load TAB_SEPRATED file in hive table

2012-09-27 Thread MiaoMiao
It is very much about your source file. If you choose wrong
separators, then hive will not parse your file correctly.
There are four kinds of separators, making it possible to clearlify
lines, fields, collections, and keyvalues.
You can refer to the following uri on this topic.
https://cwiki.apache.org/Hive/languagemanual-ddl.html#LanguageManualDDL-CreateTable

On Thu, Sep 27, 2012 at 1:48 PM, yogesh dhari  wrote:
> Thanks Ashok :-),
>
> I am not so much aware about regarding storage formats in hive(I am new to
> hive).
>
> Could you please list some of them except those.
>
> 1) space seprated  -->  FIELDS TERMINATED BY  " ";
> 2) Control-A seprated  > FIELDS TERMINATED BY '\001'
> 3) Tab septared  -->   FIELDS TERMINATED BY '\t'
>
> Please list some more.
>
>
> Thanks & Regards
> Yogesh Kumar
>
>
>
>
>
>> From: ashok.sa...@wipro.com
>> To: user@hive.apache.org
>> Subject: RE: how to load TAB_SEPRATED file in hive table
>> Date: Thu, 27 Sep 2012 05:20:11 +
>
>>
>> '\t'
>> 
>> From: yogesh dhari [yogeshdh...@live.com]
>> Sent: Thursday, September 27, 2012 10:42 AM
>> To: hive request
>> Subject: how to load TAB_SEPRATED file in hive table
>>
>> Hi all,
>>
>> I have a file in which records are Tab-Seprated,
>> Please suggest me how to upload such file in Hive table.
>>
>> Like how to specify
>>
>> Create table XYZ (name STRING, roll INT)
>> ROW FORMAT DELIMITED
>> FIELDS TERMINATED BY 
>>
>> Please suggest for "" over here.
>>
>>
>> Thanks & Regards
>> Yogesh Kumar
>>
>> Please do not print this email unless it is absolutely necessary.
>>
>> The information contained in this electronic message and any attachments
>> to this message are intended for the exclusive use of the addressee(s) and
>> may contain proprietary, confidential or privileged information. If you are
>> not the intended recipient, you should not disseminate, distribute or copy
>> this e-mail. Please notify the sender immediately and destroy all copies of
>> this message and any attachments.
>>
>> WARNING: Computer viruses can be transmitted via email. The recipient
>> should check this email and any attachments for the presence of viruses. The
>> company accepts no liability for any damage caused by any virus transmitted
>> by this email.
>>
>> www.wipro.com


Re: how to export to a csv file?

2012-09-22 Thread MiaoMiao
I would use this:
hive -S -e "SELECT some_field FROM some_table"|sed "s/,/_/"|sed
"s/\t/,/">some.csv

On Sun, Sep 23, 2012 at 11:14 AM, zuohua zhang  wrote:
> I tried the following:
>
> CREATE TABLE test
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'
> STORED AS TEXTFILE
> AS
> select
> *
> from
> test_new;
>
> set hive.io.output.fileformat=CSVTextFile;
> INSERT OVERWRITE DIRECTORY '/test'
> select * from test;
>
>
> hadoop dfs -getmerge /test /mnt/test
>
> when I check my /mnt/test, it shows with \A as delimiter not ,
>
> How to fix?
>
> Most importantly, some of the fields has , in them, so the proper csv format
> would be "day1, day2", how to achieve that?
>
> Thanks!


Re: How to run big queries in optimized way ?

2012-09-20 Thread MiaoMiao
Hive implements a format named RCFILE, which could gain better
performance, but in my project, it just ties with the plain-text
format.

Hive also have an index feature, but not so convenient or practical.

I think the best way to optimized is still reusing the same source
tables, avoiding sub-queries, and merge HiveQL as many as possible.
On Fri, Sep 21, 2012 at 10:30 AM, Mapred Learn  wrote:
> Hi,
> We have datasets which are about 10-15 TB in size.
>
> We want to run hive queries on top of this input data.
>
> What are ways to reduce stress on our cluster for running many such big 
> queries( include joins too) in parallel ?
> How to enable compression etc for intermediate hive output ?
> How to make job cache does not go to high etc ?
> In short , best practices for huge queries on hive ?
>
> Any inputs are really appreciated !
>
> Thanks,
> JJ
>
> Sent from my iPhone


Re: Issue with Inserting/Selecting Data From a ROW FORMAT SERDE table

2012-09-18 Thread MiaoMiao
The uri provided in your log is wrong, check
http://10.40.35.54:9103/tasklog?attemptid=attempt_201209171421_0029_m_00_0&start=-8193

Shouldn't this misleading uri issue be a bug?
On Tue, Sep 18, 2012 at 11:14 PM, Aniket Daoo  wrote:
> Hi,
>
>
>
> I have a ROW FORMAT SERDE table created using the following DDL.
>
>
>
> CREATE external TABLE multivalset_u6
>
> (
>
> col1 string,
>
> col2 string,
>
> col3 string,
>
> col4 string,
>
> col5 string
>
> )
>
> ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
>
> WITH SERDEPROPERTIES
>
> (
>
> "input.regex" = "(.*)\\t(.*)\\t~([0-9]{6})~(.*)~(.*)",
>
> "output.format.string" = "%1$s %2$s %3$s %4$s %5$s"
>
> )
>
> STORED AS TEXTFILE
>
> LOCATION '/user/admin/u6/parsed/';
>
>
>
> The above table LOCATION contains the file to be read by this table. I need
> the original file to be parsed and stored as a tab delimited file with 5
> columns.
>
>
>
> Sample row from original file:
>
> 02-15-2012-11:34:56  873801356593332362   ~3261961~1~10.0
>
>
>
> Sample row from the expected parsed file:
>
>02-15-2012-11:34:56  873801356593332362   3261961   1  10.0
>
>
>
> To do this, I was trying to create a table with 5 columns at another
> location and insert data from the table multivalset_u6 into it. I
> encountered the following message on the console while doing so.
>
>
>
> Ended Job = job_201209171421_0029 with errors
>
> Error during job, obtaining debugging information...
>
> Examining task ID: task_201209171421_0029_m_02 (and more) from job
> job_201209171421_0029
>
> Exception in thread "Thread-47" java.lang.RuntimeException: Error while
> reading from task log url
>
> at
> org.apache.hadoop.hive.ql.exec.errors.TaskLogProcessor.getErrors(TaskLogProcessor.java:130)
>
> at
> org.apache.hadoop.hive.ql.exec.JobDebugger.showJobFailDebugInfo(JobDebugger.java:211)
>
> at
> org.apache.hadoop.hive.ql.exec.JobDebugger.run(JobDebugger.java:81)
>
> at java.lang.Thread.run(Thread.java:662)
>
> Caused by: java.io.IOException: Server returned HTTP response code: 400 for
> URL:
> http://10.40.35.54:9103/tasklog?taskid=attempt_201209171421_0029_m_00_0&start=-8193
>
> at
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1436)
>
> at java.net.URL.openStream(URL.java:1010)
>
> at
> org.apache.hadoop.hive.ql.exec.errors.TaskLogProcessor.getErrors(TaskLogProcessor.java:120)
>
> ... 3 more
>
> Counters:
>
> FAILED: Execution Error, return code 2 from
> org.apache.hadoop.hive.ql.exec.MapRedTask
>
> MapReduce Jobs Launched:
>
> Job 0: Map: 1   HDFS Read: 0 HDFS Write: 0 FAIL
>
> Total MapReduce CPU Time Spent: 0 msec
>
>
>
> I have observed that when I execute a SELECT * FROM multivalset_u6, I get
> the output with all the columns as expected. However, on executing a SELECT
> on individual columns like SELECT col1, col2, col3, col4, col5 FROM
> multivalset_u6, a similar error message appears.
>
>
>
> Am I missing something here? Is there a way to work around this?
>
>
>
> Thanks,
>
> Aniket


Re: Questions about Hive

2012-09-17 Thread MiaoMiao
I believe Hive is not for web users, since it takes several minutes or
even hours to do one query. But I managed to provide a web service via
THRIFT and php.
http://nousefor.net/55/2011/12/php/hbase-and-hive-thrift-php-client/
On Mon, Sep 17, 2012 at 10:42 PM, Hamilton, Robert (Austin)
 wrote:
> Hello, something J
>
> Regarding jdbc style: I understand this approach has some limitations, but
> here is an example.
>
> You will need to make sure the hive service is running:
> https://cwiki.apache.org/Hive/hiveserver.html
>
> Here is a sample code that I’ve used for testing. It is not the best java in
> the world but it gets the job done.
>
> You will need to make sure the hive and hadoop jars are on the classpath.
> Note you will have to edit the connectionString.
>
>
>
>
>
> import java.sql.*;
>
>
>
> public class RunSQL {
>
>private static String driverName =
> "org.apache.hadoop.hive.jdbc.HiveDriver";
>
>private static String connectionString =
> "jdbc:hive://myserver.hp.com:1/default";
>
>
>
> public static void main(String[] args) throws SQLException
> ,org.apache.hadoop.hive.ql.metadata.HiveException {
>
>
>
> String SQLToRun=(args[0]);
>
>
>
> ResultSet res = null;
>
>
>
> try {
>
> Class.forName(driverName);
>
> } catch (ClassNotFoundException e) {
>
> e.printStackTrace();
>
> System.exit(1);
>
>   }
>
> Connection con = DriverManager.getConnection(connectionString);
>
> System.out.println("Connected.");
>
>
>
> Statement stmt = con.createStatement();
>
>
>
> System.out.println("Running: " + SQLToRun);
>
> res = stmt.executeQuery(SQLToRun);
>
> ResultSetMetaData meta=res.getMetaData();
>
> int numberOfColumns=meta.getColumnCount();
>
>
>
> System.out.println("Result:");
>
> while (res.next()) {
>
> for (int i=1;i<=numberOfColumns;i++){
>
> System.out.print(String.valueOf("\t" +
> res.getString(i)));
>
> }
>
> System.out.println();
>
> }
>
>
>
> }
>
> }
>
>
>
> From: Something Something [mailto:mailinglist...@gmail.com]
> Sent: Monday, September 17, 2012 12:39 AM
>
>
> To: hive-u...@hadoop.apache.org
> Subject: Questions about Hive
>
>
>
> Note:  I am a newbie to Hive.
>
>
>
> Can someone please answer the following questions?
>
> 1)  Does Hive provide APIs (like HBase does) that can be used to retrieve
> data from the tables in Hive from a Java program?  I heard somewhere that
> the data can be accessed with JDBC (style) APIs.  True?
>
> 2)  I don't see how I can add indexes on the tables, so does that mean a
> query such as the following will trigger a MR job that will search files on
> HDFS sequentially?
>
>
>
>
> hive> SELECT a.foo FROM invites a WHERE a.ds='2008-08-15';
>
>
> 3)  Has anyone compared performance of Hive against other NOSQL databases
> such as HBase, MongoDB.  I understand it's not exactly apples to apples
> comparison, but still...
>
> Thanks.


Re: Does hive support nested queries?

2012-09-17 Thread MiaoMiao
Yes, + is an operator.

t1
a b
1 2
3 4

and the result will be:
col
3
7
On Mon, Sep 17, 2012 at 4:43 PM, Amila Maha Arachchi
 wrote:
> Hi Bharath,
>
> Thanks for the reply. But I am confused while trying to understand the
> example in the wiki. Example says
>
>
> SELECT col
> FROM (
>   SELECT a+b AS col
>   FROM t1
> ) t2
>
>
> Can you please explain me what is a+b here. Are they two columns from table
> t1?
>
>
> Can you also tell me whether there is another way to achieve my original
> requirement.
>
>
> Thanks,
>
> Amila.
>
>
> On Mon, Sep 17, 2012 at 1:34 PM, bharath vissapragada
>  wrote:
>>
>> Hive supports subqueries only in the FROM clause as of now.
>>
>> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SubQueries
>>
>>
>> On Mon, Sep 17, 2012 at 1:03 PM, Amila Maha Arachchi
>>  wrote:
>>>
>>> Hi folks,
>>>
>>> I want to provide the output of a select query to a where clause of
>>> another query as shown below.
>>>
>>> select *
>>> from TableA
>>> where TA_timestamp > (select timestmp from TableB where id="hourDim")
>>>
>>> Is this possible in hive? When I try to execute this I get the following
>>> exception.
>>>
>>> If this is not possible, is there another way to achieve this?
>>>
>>> org.apache.hadoop.hive.ql.parse.ParseException: line 1:43 cannot
>>> recognize input near 'select' 'timestmp' 'from' in expression specification
>>>
>>> at
>>> org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:438)
>>> at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:417)
>>> at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:337)
>>> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:889)
>>> at
>>> org.apache.hadoop.hive.service.HiveServer$HiveServerHandler.execute(HiveServer.java:201)
>>> at
>>> org.apache.hadoop.hive.jdbc.HiveStatement.executeQuery(HiveStatement.java:187)
>>> at
>>> org.wso2.carbon.analytics.hive.impl.HiveExecutorServiceImpl$ScriptCallable.call(HiveExecutorServiceImpl.java:323)
>>> at
>>> org.wso2.carbon.analytics.hive.impl.HiveExecutorServiceImpl$ScriptCallable.call(HiveExecutorServiceImpl.java:224)
>>> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>>> at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>>> at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>>> at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>>> at java.lang.Thread.run(Thread.java:662)
>>>
>>> Thanks in advance.
>>>
>>> Amila.
>>
>>
>>
>>
>> --
>> Regards,
>> Bharath .V
>> w:http://researchweb.iiit.ac.in/~bharath.v
>
>


Re: insert into table not working with tables prefixed with database name

2012-09-16 Thread MiaoMiao
Yes, same problem here, sorry I missed your point and tested only once.

On Fri, Sep 14, 2012 at 10:13 PM, Kaufman Ng  wrote:
> Did you run the insert into query twice?  And did it append?
>
> From what I observred the query will complete successfully, but it
> overwrites instead of append.  That's the problem.
>
>
> On Thu, Sep 13, 2012 at 11:07 PM, MiaoMiao  wrote:
>> This query works on my hive 0.8
>>
>> insert into table tmp.testtest select user,city from source;
>>
>> On Fri, Sep 14, 2012 at 4:36 AM, Kaufman Ng  wrote:
>>> Does anyone know if insert into statement is supposed to work across
>>> databases/schemas?
>>>
>>> For instance if I do this the target table gets appended correctly:
>>>
>>> insert into table target select * from source;
>>>
>>> However, if I have another target table in a different database, the
>>> target table simply gets overwritten instead of appended:
>>>
>>> insert into table my_database.target select * from source;
>>>
>>> To get around this, one needs to switch to the target database first,
>>> then run the insert into statement like this:
>>>
>>> insert into table target select * from default.source;
>>>
>>> Just wanna confirm if this is a bug.  Thx.
>>>
>>> Kaufman


Re: Show job progress when using JDBC to run HIVE query

2012-09-16 Thread MiaoMiao
Not familiar with JDBC, but thrift seems can't.

On Sat, Sep 15, 2012 at 3:17 AM, Haijia Zhou  wrote:
> Hi, All
>  I have am writing a Hive client to run a Hive query using Hive JDBC driver.
>  Since the data amount is huge I really would like to see the progress when
> the query is running.
>  Is there anyway I can get the job progress?
> Thanks
> Haijia


Re: How to overwrite a file inside hive table folder

2012-09-16 Thread MiaoMiao
What do you mean by "a file"? If your HDFS dir contains several file
and you want to overwrite one, then no way you can do it with HiveQL.
You can check out this link and see if it suits you.

https://cwiki.apache.org/Hive/languagemanual-dml.html#LanguageManualDML-Writingdataintofilesystemfromqueries

My way of exporting hiveql result as a single is this:
echo "some,column,name">some.file
hive -S -e "select some, column, name from some_table"|sed -e
"s/\t/,/" >> some.file
On Sun, Sep 16, 2012 at 12:17 AM, Ramasubramanian
 wrote:
> Hi,
>
> We could not overwrite a file inside hive table folder. What is the command 
> to over write it.
>
> Regards,
> Rams


Re: insert into table not working with tables prefixed with database name

2012-09-13 Thread MiaoMiao
This query works on my hive 0.8

insert into table tmp.testtest select user,city from source;

On Fri, Sep 14, 2012 at 4:36 AM, Kaufman Ng  wrote:
> Does anyone know if insert into statement is supposed to work across
> databases/schemas?
>
> For instance if I do this the target table gets appended correctly:
>
> insert into table target select * from source;
>
> However, if I have another target table in a different database, the
> target table simply gets overwritten instead of appended:
>
> insert into table my_database.target select * from source;
>
> To get around this, one needs to switch to the target database first,
> then run the insert into statement like this:
>
> insert into table target select * from default.source;
>
> Just wanna confirm if this is a bug.  Thx.
>
> Kaufman


Re: most efficient way to concatenate 3 tables into one?

2012-09-12 Thread MiaoMiao
Then you will get a copy of each sub table.
My project use external partitions instead, and in this way, you don't
need to cp sub tables.
1) Create a table some_table
2) ALTER TABLE some_table ADD PARTITION (partition_col =
'partition_col_value1') LOCATION '/user/data/subTable1'
ALTER TABLE some_table ADD PARTITION (partition_col =
'partition_col_value2') LOCATION '/user/data/subTable2'

On Thu, Sep 13, 2012 at 4:23 AM, Bejoy KS  wrote:
> Hi
>
> Just follow the steps
> 1) Create an external table with location as /user/aggregatedTable
> 2) Move the contents of 3 monthly tables to this location
> Hadoop fs -cp /user/monthlyTable1/* /user/aggregatedTable
>
> Hadoop fs -cp /user/monthlyTable2/* /user/aggregatedTable
>
> ...
>
> Replace the hsfs dirs in the above commands with those used in your tables.
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
> 
> From: zuohua zhang 
> Date: Wed, 12 Sep 2012 13:14:20 -0700
> To: ; 
> ReplyTo: user@hive.apache.org
> Subject: Re: most efficient way to concatenate 3 tables into one?
>
> Thanks Bejoy.
> Yes, they have the same schema.
> Can you explain further how to? I am new to hive.
>
>
> On Wed, Sep 12, 2012 at 1:06 PM, Bejoy KS  wrote:
>>
>> Hi
>>
>> If all the 3 tables have the same. Schema, Create an external table and
>> move the data from all the 3 tables to this new table's location. Just a
>> hdfs copy or move is not that expensive.
>> Regards
>> Bejoy KS
>>
>> Sent from handheld, please excuse typos.
>> 
>> From: zuohua zhang 
>> Date: Wed, 12 Sep 2012 13:02:42 -0700
>> To: 
>> ReplyTo: user@hive.apache.org
>> Subject: most efficient way to concatenate 3 tables into one?
>>
>> I have 3 tables, each contains monthly data, I would like to make a table
>> to concatenate all 3 months into 1 table.
>> What is the most efficient way to do that?
>> I am using insert into which takes a long time.
>>
>> Thanks!
>
>


Re: How to get percentage of each group?

2012-09-07 Thread MiaoMiao
Yeah I figure it out, this query will create 3 jobs, it seems to do
sub-queries without any optimization.

SELECT
  A.userType
, A.userType_count/B.global_count
FROM
(
SELECT
  userType
, COUNT(1) as userType_count
FROM
some_table
GROUP BY
userType
) A
JOIN
(
SELECT
COUNT(1) as global_count
FROM
some_table
) B
ON (1=1);

But if all userType are enumerable, I can get their ratio in one job,
which seems like a more optimized way.

SELECT
  SUM(IF(userType='A',1,0))/count(1) as A_ratio
, SUM(IF(userType='B',1,0))/count(1) as B_ratio
, SUM(IF(userType='C',1,0))/count(1) as C_ratio
FROM
some_table;

On Fri, Sep 7, 2012 at 2:59 PM, Bejoy KS  wrote:
> Hi
>
> CROSS JOIN is same as giving JOIN keyword. CROSS JOIN just a new notation in
> later releases of hive. JOIN without ON is same as CROSS JOIN
>
> Regards,
> Bejoy KS
>
> 
> From: MiaoMiao 
> To: user@hive.apache.org
> Sent: Friday, September 7, 2012 11:46 AM
> Subject: Re: How to get percentage of each group?
>
> You mean this one? Hive 0.8 doesn't have CROSS JOIN, so I don't know
> if this query works.
> SELECT
>   A.userType
> , A.userType_count/B.global_count
> FROM
> (
> SELECT
>   userType
> , COUNT(1) as userType_count
> FROM
> some_table
> GROUP BY
> userType
> ) A
> CROSS JOIN
> (
> SELECT
> COUNT(1) as global_count
> FROM
> some_table
> ) B;
> On Fri, Sep 7, 2012 at 1:34 PM, Bertrand Dechoux  wrote:
>> Hi,
>>
>> You could use a cross join.
>> You basically have one table
>>
>> select
>>  userType
>>, count(1)
>> from
>>some_table
>> group by
>>userType
>>
>> and a second one
>>
>> select count(1) from some_table
>>
>> With a cross join you can add the global count to every results in the
>> first
>> table and compute a ratio.
>>
>> Regards
>>
>> Bertrand
>>
>>
>> On Fri, Sep 7, 2012 at 7:23 AM, MiaoMiao  wrote:
>>>
>>> I have a table, containing userId and userType.
>>> userIduserType
>>> 1A
>>> 2B
>>> 3C
>>> 4A
>>> 5B
>>> 6B
>>>
>>> I want to get percentage of each userType.
>>> My current solution:
>>> 1. Get count of each group via THRIFT
>>> select
>>>  userType
>>>, count(1)
>>> from
>>>some_table
>>> group by
>>>userType
>>>
>>> 2. Calculate each userType using other programming language like PHP.
>>>
>>> This solution is fine, but I'm just curious, is there a way to do it
>>> in one query?
>>> I know this query works in mysql, but not hive.
>>> select
>>>  userType
>>>, count(1)/(select count(1) from some_table)
>>> from
>>>some_table
>>> group by
>>>userType
>>
>>
>>
>>
>> --
>> Bertrand Dechoux
>
>


Re: How to get percentage of each group?

2012-09-06 Thread MiaoMiao
You mean this one? Hive 0.8 doesn't have CROSS JOIN, so I don't know
if this query works.
SELECT
  A.userType
, A.userType_count/B.global_count
FROM
(
SELECT
  userType
, COUNT(1) as userType_count
FROM
some_table
GROUP BY
userType
) A
CROSS JOIN
(
SELECT
COUNT(1) as global_count
FROM
some_table
) B;
On Fri, Sep 7, 2012 at 1:34 PM, Bertrand Dechoux  wrote:
> Hi,
>
> You could use a cross join.
> You basically have one table
>
> select
>   userType
> , count(1)
> from
> some_table
> group by
> userType
>
> and a second one
>
> select count(1) from some_table
>
> With a cross join you can add the global count to every results in the first
> table and compute a ratio.
>
> Regards
>
> Bertrand
>
>
> On Fri, Sep 7, 2012 at 7:23 AM, MiaoMiao  wrote:
>>
>> I have a table, containing userId and userType.
>> userIduserType
>> 1A
>> 2B
>> 3C
>> 4A
>> 5B
>> 6B
>>
>> I want to get percentage of each userType.
>> My current solution:
>> 1. Get count of each group via THRIFT
>> select
>>   userType
>> , count(1)
>> from
>> some_table
>> group by
>> userType
>>
>> 2. Calculate each userType using other programming language like PHP.
>>
>> This solution is fine, but I'm just curious, is there a way to do it
>> in one query?
>> I know this query works in mysql, but not hive.
>> select
>>   userType
>> , count(1)/(select count(1) from some_table)
>> from
>> some_table
>> group by
>> userType
>
>
>
>
> --
> Bertrand Dechoux


How to get percentage of each group?

2012-09-06 Thread MiaoMiao
I have a table, containing userId and userType.
userIduserType
1A
2B
3C
4A
5B
6B

I want to get percentage of each userType.
My current solution:
1. Get count of each group via THRIFT
select
  userType
, count(1)
from
some_table
group by
userType

2. Calculate each userType using other programming language like PHP.

This solution is fine, but I'm just curious, is there a way to do it
in one query?
I know this query works in mysql, but not hive.
select
  userType
, count(1)/(select count(1) from some_table)
from
some_table
group by
userType


Re: How to set default value for a certain field?

2012-09-05 Thread MiaoMiao
Thank you all, guess I just have to do it this way.

On Thu, Sep 6, 2012 at 11:50 AM, Philip Tromans
 wrote:
> Yep. It's not an ideal solution, but it gets you part of the way there. Hive
> doesn't have a way of specifying default values at schema level.
>
> Phil.
>
> On Sep 5, 2012 11:38 PM, "MiaoMiao"  wrote:
>>
>> You mean COALESCE(value,3)? This does work on int field, and easier than
>> IF
>>
>> select value,COALESCE(value,3) from testtest;
>> 1   1
>> 1   1
>> 2   2
>> NULL3
>> NULL3
>>
>> On Wed, Sep 5, 2012 at 7:52 PM, Philip Tromans
>>  wrote:
>> > You could do something with the coalesce UDF?
>> >
>> > Phil.
>> >
>> > On Sep 5, 2012 12:24 AM, "MiaoMiao"  wrote:
>> >>
>> >> I have a file whose content is:
>> >> 1,1
>> >> 2,1
>> >> 3,2
>> >> 4,
>> >> 5,
>> >> Then I import in into a hive table.
>> >> create external table testtest (id int,value int) row format delimited
>> >> fields terminated by ',' stored as textfile location '/wtt/test/def';
>> >> select * from testtest;
>> >> 1   1
>> >> 2   1
>> >> 3   2
>> >> 4   NULL
>> >> 5   NULL
>> >>
>> >> I want to set default value for value, but seems hive simply doesn't
>> >> have this feature, any idea?
>> >> 1   1
>> >> 2   1
>> >> 3   2
>> >> 4   3
>> >> 5   3


Re: How to set default value for a certain field?

2012-09-05 Thread MiaoMiao
You mean COALESCE(value,3)? This does work on int field, and easier than IF

select value,COALESCE(value,3) from testtest;
1   1
1   1
2   2
NULL3
NULL3

On Wed, Sep 5, 2012 at 7:52 PM, Philip Tromans
 wrote:
> You could do something with the coalesce UDF?
>
> Phil.
>
> On Sep 5, 2012 12:24 AM, "MiaoMiao"  wrote:
>>
>> I have a file whose content is:
>> 1,1
>> 2,1
>> 3,2
>> 4,
>> 5,
>> Then I import in into a hive table.
>> create external table testtest (id int,value int) row format delimited
>> fields terminated by ',' stored as textfile location '/wtt/test/def';
>> select * from testtest;
>> 1   1
>> 2   1
>> 3   2
>> 4   NULL
>> 5   NULL
>>
>> I want to set default value for value, but seems hive simply doesn't
>> have this feature, any idea?
>> 1   1
>> 2   1
>> 3   2
>> 4   3
>> 5   3


Re: How to set default value for a certain field?

2012-09-05 Thread MiaoMiao
COALESCE how?
COALESCE(VALUE) will return VALUE or null.

I know this query works
select IF(VALUE is NULL,3,VALUE) from testtest;

But I have to do this conditional check on every possible field, every
time I perform queries. I mean, does Hive have something like
CREATE TABLE table (field INT NOT NULL) ?

On Wed, Sep 5, 2012 at 7:52 PM, Philip Tromans
 wrote:
> You could do something with the coalesce UDF?
>
> Phil.
>
> On Sep 5, 2012 12:24 AM, "MiaoMiao"  wrote:
>>
>> I have a file whose content is:
>> 1,1
>> 2,1
>> 3,2
>> 4,
>> 5,
>> Then I import in into a hive table.
>> create external table testtest (id int,value int) row format delimited
>> fields terminated by ',' stored as textfile location '/wtt/test/def';
>> select * from testtest;
>> 1   1
>> 2   1
>> 3   2
>> 4   NULL
>> 5   NULL
>>
>> I want to set default value for value, but seems hive simply doesn't
>> have this feature, any idea?
>> 1   1
>> 2   1
>> 3   2
>> 4   3
>> 5   3


Re: How to set default value for a certain field?

2012-09-05 Thread MiaoMiao
This will certainly work, but with 20 tables, each of 20GB size, using
insert overwrite could take up both time and space a lot.

On Wed, Sep 5, 2012 at 7:57 PM,   wrote:
> Well, you could create the table and then insert overwrite from the file:
>
>
>
> Insert overwrite
>
> Select id, case when value = ‘’ then 3 else value end as value
>
> From testtest;
>
>
>
> Carla
>
>
>
> From: ext Philip Tromans [mailto:philip.j.trom...@gmail.com]
> Sent: Wednesday, September 05, 2012 07:52
> To: user@hive.apache.org
> Subject: Re: How to set default value for a certain field?
>
>
>
> You could do something with the coalesce UDF?
>
> Phil.
>
> On Sep 5, 2012 12:24 AM, "MiaoMiao"  wrote:
>
> I have a file whose content is:
> 1,1
> 2,1
> 3,2
> 4,
> 5,
> Then I import in into a hive table.
> create external table testtest (id int,value int) row format delimited
> fields terminated by ',' stored as textfile location '/wtt/test/def';
> select * from testtest;
> 1   1
> 2   1
> 3   2
> 4   NULL
> 5   NULL
>
> I want to set default value for value, but seems hive simply doesn't
> have this feature, any idea?
> 1   1
> 2   1
> 3   2
> 4   3
> 5   3


Re: Improving query performance on hive and hdfs

2012-09-04 Thread MiaoMiao
Your store 90 million records in DB? What kind?

Sure there are some optimizations to speed up hive query, but I don't
see a universal one, except adding more servers.

On Wed, Sep 5, 2012 at 2:19 PM, iwannaplay games
 wrote:
> Hi all,
>
> I ran a query on hive on top of 90 million records that took 12 minutes to
> execute and same query on sql server took 8 minutes.My question is how can i
> make hadoop's performance better.What all configurations will improve the
> latency?
>
> Thanks & Regards
> Prabhjot


Re: loading logfile into hive tables using certain format

2012-09-04 Thread MiaoMiao
Awk is good, except that it is not distributed. Some may use it with
Hadoop streaming, but I haven't give it a try yet.

Pig has some advanced features, its field-based sql-like language
(Pig-latin) is more flexible than text processing.

Anyway, glad I could help.

On Wed, Sep 5, 2012 at 9:33 AM, Elaine Gan  wrote:
> Hi Miao Miao,
>
> Thanks for the response and solution idea.
> I am not familiar with Pig (as I am still a beginner on hadoop & hive),
> will check it out.
> The simplest way which comes into my mind now is to awk the logs, and
> create a csv file with the input values i want before i load it to my
> hive table.
> My hive table would look like the following, simple one without using 
> SERDEPROPERTIES.
> CREATE external TABLE logtable (
> host STRING,
> user STRING,
>  request STRING)
> )
> ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'
> STORED AS TEXTFILE;
>
> Regards.
>
>> I tried import apache2 log into hive a few weeks ago, and took a look
>> at SERDEPROPERTIES, but it was too complicated and pasting others'
>> demo wouldn't work.
>>
>> Then I came up with another solution : apache2 log -> Apache Pig (for
>> ETL) -> Hive external table. But I ran into a problem of Pig ( which
>> was later solved with Cheolsoo Park's help), so I finally picked up
>> NodeJS.
>>
>> Node is actually quite great, some advanced features of JavaScript
>> make ETL a lot easy and flexible, but it can't benefit from Hadoop.
>>
>> So, to sum up, I suggest you do some ETL on logs before importing them
>> into hive.
>>
>> 1. ETL with Apache Pig:
>> --- Pig script
>> read = LOAD '/home/test/input/apacheLog'
>> USING PigStorage(' ')
>> --- space as separator
>> AS (
>>   ip:CHARARRAY
>> , indentity:CHARARRAY
>> , name:CHARARRAY
>> , date:CHARARRAY
>> , timezone:CHARARRAY
>> , method:CHARARRAY
>> , path:CHARARRAY
>> , protocol:CHARARRAY
>> , status:CHARARRAY
>> , size:CHARARRAY
>> );
>> data = FOREACH read GENERATE
>>   ip
>> , REPLACE(date,'\\[','')
>> --- Here be careful with [, it should be escape because it will cause
>> --- a regex warning where Pig will throw away the whole field.
>> --- (Not documented anywhere) REF: http://goo.gl/g1x1q
>> , REPLACE(timezone,']','')
>> , REPLACE(method,'"','')
>> , path
>> , REPLACE(protocol,'"','')
>> , status
>> , size;
>> STORE data INTO '/home/test/output/apacheLog' USING PigStorage(' ');
>> 2. Import into Hive as external tables or external partitions.
>>
>> On Tue, Sep 4, 2012 at 1:08 PM, Elaine Gan  wrote:
>> > Hi,
>> >
>> > I would like to seek help on  loading logfiles to hive tables.
>> >
>> > I learnt from the "Getting Started" page that we could create hive
>> > tables as follow to import apachelog into it.
>> > --
>> > CREATE TABLE apachelog (
>> >   host STRING,
>> >   identity STRING,
>> >   user STRING,
>> >   time STRING,
>> >   request STRING,
>> >   status STRING,
>> >   size STRING,
>> >   referer STRING,
>> >   agent STRING)
>> > ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
>> > WITH SERDEPROPERTIES (
>> >   "input.regex" = "([^]*) ([^]*) ([^]*) (-|\\[^\\]*\\]) ([^ 
>> > \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\".*\") ([^ 
>> > \"]*|\".*\"))?",
>> >   "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s"
>> > )
>> > STORED AS TEXTFILE;
>> > --
>> >
>> > I was trying to do the same thing, but changing the value of my 
>> > output.form.string,
>> > let's say i only need, host, user, request.
>> >
>> > CREATE TABLE apachelog (
>> >   host STRING,
>> >   user STRING,
>> >   request STRING)
>> > ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
>> > WITH SERDEPROPERTIES (
>> >   "input.regex" = "([^]*) ([^]*) ([^]*) (-|\\[^\\]*\\]) ([^ 
>> > \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\".*\") ([^ 
>> > \"]*|\".*\"))?",
>> >   "output.format.string" = "%1$s %3$s %5$s"
>> > )
>> > STORED AS TEXTFILE;
>> >
>> > My questions are :
>> > (1) I specified only %1, %3 %5 variables to be input into my table
>> > column, but looks like hive load the first 3 variables into it (%1 %2
>> > %3)
>> > Is there no way that hive could only load the columns i want?
>> >
>> > (2) How can i skip lines which does not fit input.regex pattern match?
>> >
>> > Thank you.
>> >
>> > lai
>> >
>> >
>
> ――
> GMOインターネット株式会社
> 次世代システム研究室
> Elaine Gan 
> Skype: gan.elaine1
> ――
>     ■ GMO INTERNET GROUP ■  http://www.gmo.jp/
> ――
>


Re: loading logfile into hive tables using certain format

2012-09-04 Thread MiaoMiao
I tried import apache2 log into hive a few weeks ago, and took a look
at SERDEPROPERTIES, but it was too complicated and pasting others'
demo wouldn't work.

Then I came up with another solution : apache2 log -> Apache Pig (for
ETL) -> Hive external table. But I ran into a problem of Pig ( which
was later solved with Cheolsoo Park's help), so I finally picked up
NodeJS.

Node is actually quite great, some advanced features of JavaScript
make ETL a lot easy and flexible, but it can't benefit from Hadoop.

So, to sum up, I suggest you do some ETL on logs before importing them
into hive.

1. ETL with Apache Pig:
--- Pig script
read = LOAD '/home/test/input/apacheLog'
USING PigStorage(' ')
--- space as separator
AS (
  ip:CHARARRAY
, indentity:CHARARRAY
, name:CHARARRAY
, date:CHARARRAY
, timezone:CHARARRAY
, method:CHARARRAY
, path:CHARARRAY
, protocol:CHARARRAY
, status:CHARARRAY
, size:CHARARRAY
);
data = FOREACH read GENERATE
  ip
, REPLACE(date,'\\[','')
--- Here be careful with [, it should be escape because it will cause
--- a regex warning where Pig will throw away the whole field.
--- (Not documented anywhere) REF: http://goo.gl/g1x1q
, REPLACE(timezone,']','')
, REPLACE(method,'"','')
, path
, REPLACE(protocol,'"','')
, status
, size;
STORE data INTO '/home/test/output/apacheLog' USING PigStorage(' ');
2. Import into Hive as external tables or external partitions.

On Tue, Sep 4, 2012 at 1:08 PM, Elaine Gan  wrote:
> Hi,
>
> I would like to seek help on  loading logfiles to hive tables.
>
> I learnt from the "Getting Started" page that we could create hive
> tables as follow to import apachelog into it.
> --
> CREATE TABLE apachelog (
>   host STRING,
>   identity STRING,
>   user STRING,
>   time STRING,
>   request STRING,
>   status STRING,
>   size STRING,
>   referer STRING,
>   agent STRING)
> ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
> WITH SERDEPROPERTIES (
>   "input.regex" = "([^]*) ([^]*) ([^]*) (-|\\[^\\]*\\]) ([^ \"]*|\"[^\"]*\") 
> (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\".*\") ([^ \"]*|\".*\"))?",
>   "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s"
> )
> STORED AS TEXTFILE;
> --
>
> I was trying to do the same thing, but changing the value of my 
> output.form.string,
> let's say i only need, host, user, request.
>
> CREATE TABLE apachelog (
>   host STRING,
>   user STRING,
>   request STRING)
> ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
> WITH SERDEPROPERTIES (
>   "input.regex" = "([^]*) ([^]*) ([^]*) (-|\\[^\\]*\\]) ([^ \"]*|\"[^\"]*\") 
> (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\".*\") ([^ \"]*|\".*\"))?",
>   "output.format.string" = "%1$s %3$s %5$s"
> )
> STORED AS TEXTFILE;
>
> My questions are :
> (1) I specified only %1, %3 %5 variables to be input into my table
> column, but looks like hive load the first 3 variables into it (%1 %2
> %3)
> Is there no way that hive could only load the columns i want?
>
> (2) How can i skip lines which does not fit input.regex pattern match?
>
> Thank you.
>
> lai
>
>


Re: Group by clause gives error

2012-08-10 Thread MiaoMiao
You need to apply an aggregate function to NAME since it's not used as
a GROUP BY key.

Without an aggregate function, your query will be confusing. I mean, what does
`select name from sometable group by date;`
mean anyway?

Try

select date,  count(type), count(name) from demotable group by date;


On Fri, Aug 10, 2012 at 2:53 PM,   wrote:
> Hi all,
>
> I have a dummy table having values
>
>
> date   name  type
> --
> 01   xa
> 01   ya
> 01   zb
> 02   uc
> 02   xd
> 03ta
>
>
> I want to operate operation that
>
> count the name and type on the same date and order by date.
>
> I wrote this query
>
> select date,  count(*), name from demotable group by date;
>
> its throws error:
>
> Error in semantic analysis:  Expression not in GROUP BY key 'name'
> (Although this query works in Mysql )
>
>
> Please suggest
>
> Thanks & Regards
> Yogesh Kumar
>
>
>
>
>
> Please do not print this email unless it is absolutely necessary.
>
> The information contained in this electronic message and any attachments to
> this message are intended for the exclusive use of the addressee(s) and may
> contain proprietary, confidential or privileged information. If you are not
> the intended recipient, you should not disseminate, distribute or copy this
> e-mail. Please notify the sender immediately and destroy all copies of this
> message and any attachments.
>
> WARNING: Computer viruses can be transmitted via email. The recipient should
> check this email and any attachments for the presence of viruses. The
> company accepts no liability for any damage caused by any virus transmitted
> by this email.
>
> www.wipro.com