Re: how to make async call to hive
Hi Gary, HiveServer2 has recently added an API to support asynchronous execution: https://github.com/apache/hive/blob/trunk/service/if/TCLIService.thrift#L604 You will have to create an instance of Thrift HiveServer2 client and while creating the request object for ExecuteStatement, set runAsync as true. Thanks, --Vaibhav On Sun, Sep 29, 2013 at 9:23 PM, Gary Zhao garyz...@gmail.com wrote: I'm using node.js which is async. On Sun, Sep 29, 2013 at 5:32 PM, Brad Ruderman bruder...@radiumone.comwrote: Typically it be your application that opens the process off the main thread. Hue (Beeswax specifically) does this and you can see the code here: https://github.com/cloudera/hue/tree/master/apps/beeswax Thx On Sun, Sep 29, 2013 at 5:15 PM, kentkong_work kentkong_w...@163.comwrote: ** hi all, just wonder if there is offical solution for async call to hive? hive query runs so long time, my application can't block until it returns. Kent -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
issue about remote hive client
hi,all: i run hive client in seperate box ,but all job submit from the client is local job,why? ,i try it from hive-server2 running box ,the job will submit as distribute job
Load Timestamp data type fom local file
Hello, For unit testing, I would like to load from a local file data that has several columns, one is also Timestamp. The command I use is LOAD DATA LOCAL INPATH... . Unfortunately that column does not allow me to load all the dataset. I have no error in the log of my local apache hive server, everything looks ok. By the way, officially the data type Timestamp is available. For completeness, I'm using hive version: 0.10.0 and I report both the script which format the database and the dataset: - hive DROP TABLE momis_test_a_3 hive CREATE TABLE momis_test_a_3 (col1 STRING, col2 DOUBLE, col3 FLOAT, col4 TIMESTAMP, col5 BOOLEAN) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE hive LOAD DATA LOCAL INPATH '/home/nophiq/Programmi/Eclipse-Indigo-Momis/workspace/datariver/datariver-querymanager/test/sources/hive/dataset3' OVERWRITE INTO TABLE momis_test_a_3 - testo1,100.00,201.00,2013-01-01 04:00:00.123,true testo2,300.00,401.00,2013-01-02 04:00:00.123,false testo3,500.00,601.00,2013-01-03 04:00:00.123,false Finally, here it is the log from the local server: Copying data from file:/home/nophiq/Programmi/Eclipse-Indigo-Momis/workspace/datariver/datariver-querymanager/test/sources/hive/dataset3 Copying file: file:/home/nophiq/Programmi/Eclipse-Indigo-Momis/workspace/datariver/datariver-querymanager/test/sources/hive/dataset3 Loading data to table default.momis_test_a_3 Deleted file:/home/nophiq/Programmi/hive-0.10.0/warehouse/momis_test_a_3 Table default.momis_test_a_3 stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 182, raw_data_size: 0] OK OK How can I load timestamp data type from a local file? I don't want to create an external table. Any suggestion? Thanks Claudio Reggiani
Re: Load Timestamp data type fom local file
Sorry but I could not understand the issues you are facing. When you loaded data, did select col from table for the timestamp column, what error did you get? what data did you get? this is the default datetime format -MM-dd hh:mm:ss. Looking at your sample data seems to match the format. Can you show us some error or what you expect to see as the query output? On Mon, Sep 30, 2013 at 3:36 PM, Claudio Reggiani nop...@gmail.com wrote: Hello, For unit testing, I would like to load from a local file data that has several columns, one is also Timestamp. The command I use is LOAD DATA LOCAL INPATH... . Unfortunately that column does not allow me to load all the dataset. I have no error in the log of my local apache hive server, everything looks ok. By the way, officially the data type Timestamp is available. For completeness, I'm using hive version: 0.10.0 and I report both the script which format the database and the dataset: - hive DROP TABLE momis_test_a_3 hive CREATE TABLE momis_test_a_3 (col1 STRING, col2 DOUBLE, col3 FLOAT, col4 TIMESTAMP, col5 BOOLEAN) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE hive LOAD DATA LOCAL INPATH '/home/nophiq/Programmi/Eclipse-Indigo-Momis/workspace/datariver/datariver-querymanager/test/sources/hive/dataset3' OVERWRITE INTO TABLE momis_test_a_3 - testo1,100.00,201.00,2013-01-01 04:00:00.123,true testo2,300.00,401.00,2013-01-02 04:00:00.123,false testo3,500.00,601.00,2013-01-03 04:00:00.123,false Finally, here it is the log from the local server: Copying data from file:/home/nophiq/Programmi/Eclipse-Indigo-Momis/workspace/datariver/datariver-querymanager/test/sources/hive/dataset3 Copying file: file:/home/nophiq/Programmi/Eclipse-Indigo-Momis/workspace/datariver/datariver-querymanager/test/sources/hive/dataset3 Loading data to table default.momis_test_a_3 Deleted file:/home/nophiq/Programmi/hive-0.10.0/warehouse/momis_test_a_3 Table default.momis_test_a_3 stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 182, raw_data_size: 0] OK OK How can I load timestamp data type from a local file? I don't want to create an external table. Any suggestion? Thanks Claudio Reggiani -- Nitin Pawar
Re: Load Timestamp data type fom local file
Thanks Nitin for the reply, if I run the query SELECT * FROM momis_test_a_3 I get an empty result set with no errors. Instead I would expect all the results. My best guess is that because of timestamp data the whole dataset is not able to be loaded. But since I don't have any errors (of any kind) I don't know where to puts my hands on. Claudio 2013/9/30 Nitin Pawar nitinpawar...@gmail.com Sorry but I could not understand the issues you are facing. When you loaded data, did select col from table for the timestamp column, what error did you get? what data did you get? this is the default datetime format -MM-dd hh:mm:ss. Looking at your sample data seems to match the format. Can you show us some error or what you expect to see as the query output? On Mon, Sep 30, 2013 at 3:36 PM, Claudio Reggiani nop...@gmail.comwrote: Hello, For unit testing, I would like to load from a local file data that has several columns, one is also Timestamp. The command I use is LOAD DATA LOCAL INPATH... . Unfortunately that column does not allow me to load all the dataset. I have no error in the log of my local apache hive server, everything looks ok. By the way, officially the data type Timestamp is available. For completeness, I'm using hive version: 0.10.0 and I report both the script which format the database and the dataset: - hive DROP TABLE momis_test_a_3 hive CREATE TABLE momis_test_a_3 (col1 STRING, col2 DOUBLE, col3 FLOAT, col4 TIMESTAMP, col5 BOOLEAN) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE hive LOAD DATA LOCAL INPATH '/home/nophiq/Programmi/Eclipse-Indigo-Momis/workspace/datariver/datariver-querymanager/test/sources/hive/dataset3' OVERWRITE INTO TABLE momis_test_a_3 - testo1,100.00,201.00,2013-01-01 04:00:00.123,true testo2,300.00,401.00,2013-01-02 04:00:00.123,false testo3,500.00,601.00,2013-01-03 04:00:00.123,false Finally, here it is the log from the local server: Copying data from file:/home/nophiq/Programmi/Eclipse-Indigo-Momis/workspace/datariver/datariver-querymanager/test/sources/hive/dataset3 Copying file: file:/home/nophiq/Programmi/Eclipse-Indigo-Momis/workspace/datariver/datariver-querymanager/test/sources/hive/dataset3 Loading data to table default.momis_test_a_3 Deleted file:/home/nophiq/Programmi/hive-0.10.0/warehouse/momis_test_a_3 Table default.momis_test_a_3 stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 182, raw_data_size: 0] OK OK How can I load timestamp data type from a local file? I don't want to create an external table. Any suggestion? Thanks Claudio Reggiani -- Nitin Pawar
Re: Load Timestamp data type fom local file
Hi Claudio, When you do a select * from table there is no mapreduce in place. What hive does is it uses the hdfs api and reads your files and displays the data by a tab separated columns list. If the data is wrongly populated, hive will show the entire set into first column and rest of the columns are shown as NULL When you are seeing no data, i suspect the data file is deleted somehow from your table. I would recommend you to do following things 1) Create table with a location flag 2) Load data in the table and check the directory for the file 3) If the file is present then you run select * query alternatively, what you can do is check your current table directory :home/nophiq/Programmi/hive-0.10.0/warehouse/momis_test_a_3 If there are any files, you can do hadoop dfs -cat on that file and see if that shows your content. If that shows some content then we will need to see why hive is not able to read the file On Mon, Sep 30, 2013 at 4:28 PM, Claudio Reggiani nop...@gmail.com wrote: Thanks Nitin for the reply, if I run the query SELECT * FROM momis_test_a_3 I get an empty result set with no errors. Instead I would expect all the results. My best guess is that because of timestamp data the whole dataset is not able to be loaded. But since I don't have any errors (of any kind) I don't know where to puts my hands on. Claudio 2013/9/30 Nitin Pawar nitinpawar...@gmail.com Sorry but I could not understand the issues you are facing. When you loaded data, did select col from table for the timestamp column, what error did you get? what data did you get? this is the default datetime format -MM-dd hh:mm:ss. Looking at your sample data seems to match the format. Can you show us some error or what you expect to see as the query output? On Mon, Sep 30, 2013 at 3:36 PM, Claudio Reggiani nop...@gmail.comwrote: Hello, For unit testing, I would like to load from a local file data that has several columns, one is also Timestamp. The command I use is LOAD DATA LOCAL INPATH... . Unfortunately that column does not allow me to load all the dataset. I have no error in the log of my local apache hive server, everything looks ok. By the way, officially the data type Timestamp is available. For completeness, I'm using hive version: 0.10.0 and I report both the script which format the database and the dataset: - hive DROP TABLE momis_test_a_3 hive CREATE TABLE momis_test_a_3 (col1 STRING, col2 DOUBLE, col3 FLOAT, col4 TIMESTAMP, col5 BOOLEAN) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE hive LOAD DATA LOCAL INPATH '/home/nophiq/Programmi/Eclipse-Indigo-Momis/workspace/datariver/datariver-querymanager/test/sources/hive/dataset3' OVERWRITE INTO TABLE momis_test_a_3 - testo1,100.00,201.00,2013-01-01 04:00:00.123,true testo2,300.00,401.00,2013-01-02 04:00:00.123,false testo3,500.00,601.00,2013-01-03 04:00:00.123,false Finally, here it is the log from the local server: Copying data from file:/home/nophiq/Programmi/Eclipse-Indigo-Momis/workspace/datariver/datariver-querymanager/test/sources/hive/dataset3 Copying file: file:/home/nophiq/Programmi/Eclipse-Indigo-Momis/workspace/datariver/datariver-querymanager/test/sources/hive/dataset3 Loading data to table default.momis_test_a_3 Deleted file:/home/nophiq/Programmi/hive-0.10.0/warehouse/momis_test_a_3 Table default.momis_test_a_3 stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 182, raw_data_size: 0] OK OK How can I load timestamp data type from a local file? I don't want to create an external table. Any suggestion? Thanks Claudio Reggiani -- Nitin Pawar -- Nitin Pawar
Re: unable to create a table in hive
Can you share your create table ddl and hive warehouse directory setting from hive-site.xml ? On Mon, Sep 30, 2013 at 4:57 PM, Manickam P manicka...@outlook.com wrote: Guys, when i try to create a new table in hive i am getting the below error. *FAILED: Error in metadata: MetaException(message:Got exception: java.io.FileNotFoundException /user)* *FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask* I've created direcotries in hdfs like /home/storate/tmp and /home/storage/user/hive/warehouse and given permission but it is not taking up. I'm having hdfs federated cluster with 2 name nodes. does anyone have any idea? Thanks, Manickam P -- Nitin Pawar
RE: unable to create a table in hive
Hi, I have given below the script i used. I've not used any hive site xml here. CREATE TABLE TABLE_A (EMPLOYEE_ID INT, EMPLOYEE_NAME STRING, EMPLOYEE_LOCATION STRING, EMPLOYEE_DEPT STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE; Thanks, Manickam P Date: Mon, 30 Sep 2013 17:11:12 +0530 Subject: Re: unable to create a table in hive From: nitinpawar...@gmail.com To: user@hive.apache.org Can you share your create table ddl and hive warehouse directory setting from hive-site.xml ? On Mon, Sep 30, 2013 at 4:57 PM, Manickam P manicka...@outlook.com wrote: Guys, when i try to create a new table in hive i am getting the below error. FAILED: Error in metadata: MetaException(message:Got exception: java.io.FileNotFoundException /user)FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask I've created direcotries in hdfs like /home/storate/tmp and /home/storage/user/hive/warehouse and given permission but it is not taking up. I'm having hdfs federated cluster with 2 name nodes. does anyone have any idea? Thanks, Manickam P -- Nitin Pawar
Re: unable to create a table in hive
hive-site.xml will be placed under your hive conf directory. anyway, try using location flag to your ddl like below CREATE TABLE TABLE_A (EMPLOYEE_ID INT, EMPLOYEE_NAME STRING, EMPLOYEE_LOCATION STRING, EMPLOYEE_DEPT STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION '/home/storage/user/hive/ warehouse/TABLE_A' On Mon, Sep 30, 2013 at 5:17 PM, Manickam P manicka...@outlook.com wrote: Hi, I have given below the script i used. I've not used any hive site xml here. CREATE TABLE TABLE_A (EMPLOYEE_ID INT, EMPLOYEE_NAME STRING, EMPLOYEE_LOCATION STRING, EMPLOYEE_DEPT STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE; Thanks, Manickam P -- Date: Mon, 30 Sep 2013 17:11:12 +0530 Subject: Re: unable to create a table in hive From: nitinpawar...@gmail.com To: user@hive.apache.org Can you share your create table ddl and hive warehouse directory setting from hive-site.xml ? On Mon, Sep 30, 2013 at 4:57 PM, Manickam P manicka...@outlook.comwrote: Guys, when i try to create a new table in hive i am getting the below error. *FAILED: Error in metadata: MetaException(message:Got exception: java.io.FileNotFoundException /user)* *FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask* I've created direcotries in hdfs like /home/storate/tmp and /home/storage/user/hive/warehouse and given permission but it is not taking up. I'm having hdfs federated cluster with 2 name nodes. does anyone have any idea? Thanks, Manickam P -- Nitin Pawar -- Nitin Pawar
RE: unable to create a table in hive
Thanks man. I added hive site and it worked. Thanks, Manickam P From: Nitin Pawarmailto:nitinpawar...@gmail.com Sent: 30-09-2013 05:35 PM To: user@hive.apache.orgmailto:user@hive.apache.org Subject: Re: unable to create a table in hive hive-site.xml will be placed under your hive conf directory. anyway, try using location flag to your ddl like below CREATE TABLE TABLE_A (EMPLOYEE_ID INT, EMPLOYEE_NAME STRING, EMPLOYEE_LOCATION STRING, EMPLOYEE_DEPT STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION '/home/storage/user/hive/ warehouse/TABLE_A' On Mon, Sep 30, 2013 at 5:17 PM, Manickam P manicka...@outlook.com wrote: Hi, I have given below the script i used. I've not used any hive site xml here. CREATE TABLE TABLE_A (EMPLOYEE_ID INT, EMPLOYEE_NAME STRING, EMPLOYEE_LOCATION STRING, EMPLOYEE_DEPT STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE; Thanks, Manickam P -- Date: Mon, 30 Sep 2013 17:11:12 +0530 Subject: Re: unable to create a table in hive From: nitinpawar...@gmail.com To: user@hive.apache.org Can you share your create table ddl and hive warehouse directory setting from hive-site.xml ? On Mon, Sep 30, 2013 at 4:57 PM, Manickam P manicka...@outlook.comwrote: Guys, when i try to create a new table in hive i am getting the below error. *FAILED: Error in metadata: MetaException(message:Got exception: java.io.FileNotFoundException /user)* *FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask* I've created direcotries in hdfs like /home/storate/tmp and /home/storage/user/hive/warehouse and given permission but it is not taking up. I'm having hdfs federated cluster with 2 name nodes. does anyone have any idea? Thanks, Manickam P -- Nitin Pawar -- Nitin Pawar
RE: Hive Query via Hue, Only column headers in downloaded CSV or XSL results, sometimes
Hmm.. No replies on this one? Is no one use Hue? :-) That would be interesting to know .. if not Hue, how are others exposing Hive to end users? without given them a direct login to a node on the cluster? --- Mark E. Sunderlin Data Architect | AOL NETWORKS BDM P: 703-265-6935 | C: 540-327-6222 | AIM: MESunderlin 22000 AOL Way, Dulles, VA 20166 -Original Message- From: Sunderlin, Mark [mailto:mark.sunder...@teamaol.com] Sent: Wednesday, September 18, 2013 2:08 PM To: user@hive.apache.org Subject: Hive Query via Hue, Only column headers in downloaded CSV or XSL results, sometimes Using Hive V11, via Hue from CDH4, I can run my query, output 10 rows (limit 10) and download to a nice CSV or XSL file ... sometimes. :-( Sometimes, even when the run is error free, the download only downloads the column headers. This is true for both the CSV and XSL options. It is only ten lines of output, so it cannot be a number of rows issue. Is there a limit to the width of the data you can download? A limit on the number of columns? Anyone seen this before? Does anyone know a fix or a work around? --- Mark E. Sunderlin Data Architect | AOL NETWORKS BDM P: 703-265-6935 | C: 540-327-6222 | AIM: MESunderlin 22000 AOL Way, Dulles, VA 20166
Error - loading data into tables
Hi, I'm getting the below error while loading the data into hive table. return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask I used LOAD DATA INPATH '/home/storage/mount1/tabled.txt' INTO TABLE TEST; this query to insert into table. Thanks, Manickam P
Not able to execute this query
When I executing the query to create table in HIVE, I am getting this error. 'NoneType' object has no attribute 'columns' The table script below create external table test1( - ) PARTITIONED BY (col1 timestamp, col2 timestamp) CLUSTERED BY(col1) SORTED BY(col1 ASC) into 40 buckets ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE LOCATION '/user/test/' Thanks, Shouvanik This message is for the designated recipient only and may contain privileged, proprietary, or otherwise confidential information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the e-mail by you is prohibited. Where allowed by local law, electronic communications with Accenture and its affiliates, including e-mail and instant messaging (including content), may be scanned by our systems for the purposes of information security and assessment of internal compliance with Accenture policy. __ www.accenture.com
Re: Not able to execute this query
you are trying to bucket and partition on same column? I could create a hive table if I change the bucketing column to non-partition column On Mon, Sep 30, 2013 at 7:23 PM, shouvanik.hal...@accenture.com wrote: When I executing the query to create table in HIVE, I am getting this error. ** ** 'NoneType' object has no attribute 'columns' ** ** ** ** The table script below ** ** ** ** create external table test1( ** ** - ) PARTITIONED BY (col1 timestamp, col2 timestamp) CLUSTERED BY(col1) SORTED BY(col1 ASC) into 40 buckets ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE LOCATION '/user/test/' ** ** ** ** Thanks, Shouvanik -- This message is for the designated recipient only and may contain privileged, proprietary, or otherwise confidential information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the e-mail by you is prohibited. Where allowed by local law, electronic communications with Accenture and its affiliates, including e-mail and instant messaging (including content), may be scanned by our systems for the purposes of information security and assessment of internal compliance with Accenture policy. __ www.accenture.com -- Nitin Pawar
RE: Hive Query via Hue, Only column headers in downloaded CSV or XSL results, sometimes
Hi Mark - we hit this issue as well. We use Hue as the Hive front-end for our users and this is a pretty big roadblock for them. We're on Hue 2.2 and Hive 11. If you figure out a fix let me know :) -Original Message- From: Sunderlin, Mark [mailto:mark.sunder...@teamaol.com] Sent: Monday, September 30, 2013 9:38 AM To: user@hive.apache.org Subject: RE: Hive Query via Hue, Only column headers in downloaded CSV or XSL results, sometimes Hmm.. No replies on this one? Is no one use Hue? :-) That would be interesting to know .. if not Hue, how are others exposing Hive to end users? without given them a direct login to a node on the cluster? --- Mark E. Sunderlin Data Architect | AOL NETWORKS BDM P: 703-265-6935 | C: 540-327-6222 | AIM: MESunderlin 22000 AOL Way, Dulles, VA 20166 -Original Message- From: Sunderlin, Mark [mailto:mark.sunder...@teamaol.com] Sent: Wednesday, September 18, 2013 2:08 PM To: user@hive.apache.org Subject: Hive Query via Hue, Only column headers in downloaded CSV or XSL results, sometimes Using Hive V11, via Hue from CDH4, I can run my query, output 10 rows (limit 10) and download to a nice CSV or XSL file ... sometimes. :-( Sometimes, even when the run is error free, the download only downloads the column headers. This is true for both the CSV and XSL options. It is only ten lines of output, so it cannot be a number of rows issue. Is there a limit to the width of the data you can download? A limit on the number of columns? Anyone seen this before? Does anyone know a fix or a work around? --- Mark E. Sunderlin Data Architect | AOL NETWORKS BDM P: 703-265-6935 | C: 540-327-6222 | AIM: MESunderlin 22000 AOL Way, Dulles, VA 20166
RE: Not able to execute this query
Hi, Have you used HUE WEB console. Actually I have not used same columns. But, when I give a query, I get that error.! Please help? Thanks, Shouvanik From: Nitin Pawar [mailto:nitinpawar...@gmail.com] Sent: Monday, September 30, 2013 7:34 PM To: user@hive.apache.org Subject: Re: Not able to execute this query you are trying to bucket and partition on same column? I could create a hive table if I change the bucketing column to non-partition column On Mon, Sep 30, 2013 at 7:23 PM, shouvanik.hal...@accenture.commailto:shouvanik.hal...@accenture.com wrote: When I executing the query to create table in HIVE, I am getting this error. 'NoneType' object has no attribute 'columns' The table script below create external table test1( - ) PARTITIONED BY (col1 timestamp, col2 timestamp) CLUSTERED BY(col1) SORTED BY(col1 ASC) into 40 buckets ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE LOCATION '/user/test/' Thanks, Shouvanik This message is for the designated recipient only and may contain privileged, proprietary, or otherwise confidential information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the e-mail by you is prohibited. Where allowed by local law, electronic communications with Accenture and its affiliates, including e-mail and instant messaging (including content), may be scanned by our systems for the purposes of information security and assessment of internal compliance with Accenture policy. __ www.accenture.comhttp://www.accenture.com -- Nitin Pawar
Re: Not able to execute this query
I am really not sure what your entire query is but the below one works . If possible share your entire ddl and mask or hide cols if there is something you can not share create table test1( col3 int, col4 string) PARTITIONED BY (col1 timestamp, col2 timestamp) CLUSTERED BY(col3) SORTED BY(col3 ASC) into 40 buckets ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE; On Mon, Sep 30, 2013 at 7:45 PM, shouvanik.hal...@accenture.com wrote: Hi, ** ** Have you used HUE WEB console. Actually I have not used same columns. But, when I give a query, I get that error.! ** ** Please help? ** ** Thanks, Shouvanik ** ** *From:* Nitin Pawar [mailto:nitinpawar...@gmail.com] *Sent:* Monday, September 30, 2013 7:34 PM *To:* user@hive.apache.org *Subject:* Re: Not able to execute this query ** ** you are trying to bucket and partition on same column? ** ** I could create a hive table if I change the bucketing column to non-partition column ** ** On Mon, Sep 30, 2013 at 7:23 PM, shouvanik.hal...@accenture.com wrote:** ** When I executing the query to create table in HIVE, I am getting this error. 'NoneType' object has no attribute 'columns' The table script below create external table test1( - ) PARTITIONED BY (col1 timestamp, col2 timestamp) CLUSTERED BY(col1) SORTED BY(col1 ASC) into 40 buckets ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE LOCATION '/user/test/' Thanks, Shouvanik ** ** -- This message is for the designated recipient only and may contain privileged, proprietary, or otherwise confidential information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the e-mail by you is prohibited. Where allowed by local law, electronic communications with Accenture and its affiliates, including e-mail and instant messaging (including content), may be scanned by our systems for the purposes of information security and assessment of internal compliance with Accenture policy. __ www.accenture.com ** ** -- Nitin Pawar -- Nitin Pawar
RE: Not able to execute this query
Hi Nitin, Thanks. That answers my previous query. But, if I add LOCATION '/user/hue/' string below, I get a big fat exception in beeswax. Thanks, Shouvanik From: Nitin Pawar [mailto:nitinpawar...@gmail.com] Sent: Monday, September 30, 2013 8:22 PM To: user@hive.apache.org Subject: Re: Not able to execute this query I am really not sure what your entire query is but the below one works . If possible share your entire ddl and mask or hide cols if there is something you can not share create table test1( col3 int, col4 string) PARTITIONED BY (col1 timestamp, col2 timestamp) CLUSTERED BY(col3) SORTED BY(col3 ASC) into 40 buckets ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE; On Mon, Sep 30, 2013 at 7:45 PM, shouvanik.hal...@accenture.commailto:shouvanik.hal...@accenture.com wrote: Hi, Have you used HUE WEB console. Actually I have not used same columns. But, when I give a query, I get that error.! Please help? Thanks, Shouvanik From: Nitin Pawar [mailto:nitinpawar...@gmail.commailto:nitinpawar...@gmail.com] Sent: Monday, September 30, 2013 7:34 PM To: user@hive.apache.orgmailto:user@hive.apache.org Subject: Re: Not able to execute this query you are trying to bucket and partition on same column? I could create a hive table if I change the bucketing column to non-partition column On Mon, Sep 30, 2013 at 7:23 PM, shouvanik.hal...@accenture.commailto:shouvanik.hal...@accenture.com wrote: When I executing the query to create table in HIVE, I am getting this error. 'NoneType' object has no attribute 'columns' The table script below create external table test1( - ) PARTITIONED BY (col1 timestamp, col2 timestamp) CLUSTERED BY(col1) SORTED BY(col1 ASC) into 40 buckets ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE LOCATION '/user/test/' Thanks, Shouvanik This message is for the designated recipient only and may contain privileged, proprietary, or otherwise confidential information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the e-mail by you is prohibited. Where allowed by local law, electronic communications with Accenture and its affiliates, including e-mail and instant messaging (including content), may be scanned by our systems for the purposes of information security and assessment of internal compliance with Accenture policy. __ www.accenture.comhttp://www.accenture.com -- Nitin Pawar -- Nitin Pawar
Converting from textfile to sequencefile using Hive
Hi, I have a lot of tweets saved as text. I created an external table on top of it to access it as textfile. I need to convert these to sequencefiles with each tweet as its own record. To do this, I created another table as a sequencefile table like so - CREATE EXTERNAL TABLE tweetseq( tweet STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\054' STORED AS SEQUENCEFILE LOCATION '/user/hdfs/tweetseq' Now when I insert into this table from my original tweets table, each line gets its own record as expected. This is great. However, I don't have any record ids here. Short of writing my own UDF to make that happen, are there any obvious solutions I am missing here? PS, I need the ids to be there because mahout seq2sparse expects that. Without ids, it fails with - java.lang.ClassCastException: org.apache.hadoop.io.BytesWritable cannot be cast to org.apache.hadoop.io.Text at org.apache.mahout.vectorizer.document.SequenceFileTokenizerMapper.map(SequenceFileTokenizerMapper.java:37) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:140) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330) at org.apache.hadoop.mapred.Child$4.run(Child.java:268) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) at org.apache.hadoop.mapred.Child.main(Child.java:262) Regards, S
Re: Not able to execute this query
Do the hue user have permissions to access '/user/hue'? is that directory existing ? On Mon, Sep 30, 2013 at 8:58 PM, shouvanik.hal...@accenture.com wrote: Hi Nitin, ** ** Thanks. That answers my previous query. But, if I add LOCATION '/user/hue/' string below, I get a big fat exception in beeswax. ** ** Thanks, Shouvanik ** ** *From:* Nitin Pawar [mailto:nitinpawar...@gmail.com] *Sent:* Monday, September 30, 2013 8:22 PM *To:* user@hive.apache.org *Subject:* Re: Not able to execute this query ** ** I am really not sure what your entire query is but the below one works . If possible share your entire ddl and mask or hide cols if there is something you can not share ** ** create table test1( col3 int, col4 string) PARTITIONED BY (col1 timestamp, col2 timestamp) CLUSTERED BY(col3) SORTED BY(col3 ASC) into 40 buckets ** ** ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE; ** ** On Mon, Sep 30, 2013 at 7:45 PM, shouvanik.hal...@accenture.com wrote:** ** Hi, Have you used HUE WEB console. Actually I have not used same columns. But, when I give a query, I get that error.! Please help? Thanks, Shouvanik *From:* Nitin Pawar [mailto:nitinpawar...@gmail.com] *Sent:* Monday, September 30, 2013 7:34 PM *To:* user@hive.apache.org *Subject:* Re: Not able to execute this query you are trying to bucket and partition on same column? I could create a hive table if I change the bucketing column to non-partition column On Mon, Sep 30, 2013 at 7:23 PM, shouvanik.hal...@accenture.com wrote:** ** When I executing the query to create table in HIVE, I am getting this error. 'NoneType' object has no attribute 'columns' The table script below create external table test1( - ) PARTITIONED BY (col1 timestamp, col2 timestamp) CLUSTERED BY(col1) SORTED BY(col1 ASC) into 40 buckets ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE LOCATION '/user/test/' Thanks, Shouvanik -- This message is for the designated recipient only and may contain privileged, proprietary, or otherwise confidential information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the e-mail by you is prohibited. Where allowed by local law, electronic communications with Accenture and its affiliates, including e-mail and instant messaging (including content), may be scanned by our systems for the purposes of information security and assessment of internal compliance with Accenture policy. __ www.accenture.com -- Nitin Pawar ** ** -- Nitin Pawar -- Nitin Pawar
Re: Error - loading data into tables
Is this /home/strorage/... a hdfs directory? I think its a normal filesystem directory. Try running this load data local inpath '*/home/storage/mount1/tabled.txt' INTO TABLE TEST;* On Mon, Sep 30, 2013 at 7:13 PM, Manickam P manicka...@outlook.com wrote: Hi, I'm getting the below error while loading the data into hive table. *return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask* * * I used * LOAD DATA INPATH '/home/storage/mount1/tabled.txt' INTO TABLE TEST;* this query to insert into table. Thanks, Manickam P -- Nitin Pawar
RE: how to treat an existing partition data file as a table?
You need to specify a table partition from which you want to sample. Olga From: Yang [mailto:tedd...@gmail.com] Sent: Sunday, September 29, 2013 1:39 PM To: hive-u...@hadoop.apache.org Subject: how to treat an existing partition data file as a table? we have a huge table, including browsing data for the past 5 years, let's say. now I want to take a few samples to play around with it. so I did select * from mytable limit 10; but it actually went full out and tried to scan the entire table. is there a way to kind of create a view pointing to only one of the data files used by the original table mytable ? this way the total files to be scanned is much smaller. thanks! yang
RE: Not able to execute this query
Thanks Nitin. I am able to create the table in HUE now. You are right. There was no directory and no permission accordingly. Thanks, Shouvanik From: Nitin Pawar [mailto:nitinpawar...@gmail.com] Sent: Monday, September 30, 2013 9:08 PM To: Haldar, Shouvanik Cc: user@hive.apache.org Subject: Re: Not able to execute this query Do the hue user have permissions to access '/user/hue'? is that directory existing ? On Mon, Sep 30, 2013 at 8:58 PM, shouvanik.hal...@accenture.commailto:shouvanik.hal...@accenture.com wrote: Hi Nitin, Thanks. That answers my previous query. But, if I add LOCATION '/user/hue/' string below, I get a big fat exception in beeswax. Thanks, Shouvanik From: Nitin Pawar [mailto:nitinpawar...@gmail.commailto:nitinpawar...@gmail.com] Sent: Monday, September 30, 2013 8:22 PM To: user@hive.apache.orgmailto:user@hive.apache.org Subject: Re: Not able to execute this query I am really not sure what your entire query is but the below one works . If possible share your entire ddl and mask or hide cols if there is something you can not share create table test1( col3 int, col4 string) PARTITIONED BY (col1 timestamp, col2 timestamp) CLUSTERED BY(col3) SORTED BY(col3 ASC) into 40 buckets ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE; On Mon, Sep 30, 2013 at 7:45 PM, shouvanik.hal...@accenture.commailto:shouvanik.hal...@accenture.com wrote: Hi, Have you used HUE WEB console. Actually I have not used same columns. But, when I give a query, I get that error.! Please help? Thanks, Shouvanik From: Nitin Pawar [mailto:nitinpawar...@gmail.commailto:nitinpawar...@gmail.com] Sent: Monday, September 30, 2013 7:34 PM To: user@hive.apache.orgmailto:user@hive.apache.org Subject: Re: Not able to execute this query you are trying to bucket and partition on same column? I could create a hive table if I change the bucketing column to non-partition column On Mon, Sep 30, 2013 at 7:23 PM, shouvanik.hal...@accenture.commailto:shouvanik.hal...@accenture.com wrote: When I executing the query to create table in HIVE, I am getting this error. 'NoneType' object has no attribute 'columns' The table script below create external table test1( - ) PARTITIONED BY (col1 timestamp, col2 timestamp) CLUSTERED BY(col1) SORTED BY(col1 ASC) into 40 buckets ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE LOCATION '/user/test/' Thanks, Shouvanik This message is for the designated recipient only and may contain privileged, proprietary, or otherwise confidential information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the e-mail by you is prohibited. Where allowed by local law, electronic communications with Accenture and its affiliates, including e-mail and instant messaging (including content), may be scanned by our systems for the purposes of information security and assessment of internal compliance with Accenture policy. __ www.accenture.comhttp://www.accenture.com -- Nitin Pawar -- Nitin Pawar -- Nitin Pawar
Doing FSCK throws error
Hi, On executing MSCK REPAIR TABLE table1, I get the below error. FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask What can possibly be the error. Thanks, Shouvnik This message is for the designated recipient only and may contain privileged, proprietary, or otherwise confidential information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the e-mail by you is prohibited. Where allowed by local law, electronic communications with Accenture and its affiliates, including e-mail and instant messaging (including content), may be scanned by our systems for the purposes of information security and assessment of internal compliance with Accenture policy. __ www.accenture.com
RE: Doing FSCK throws error
The script for this table is add jar json-serde-1.1.3-jar-with-dependencies.jar; list jars; CREATE EXTERNAL TABLE IF NOT EXISTS table1 ( instance_type string, category string, session_id string, nonce string, user_id string, properties arraystructname : string, value : string, instance mapstring,string, true_as_of_secs string ) PARTITIONED BY (type string, dth string) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' LOCATION 's3n://com...xx.xxx/tables/pl0/ctg='; MSCK REPAIR TABLE table1; Thanks, Shouvanik From: Haldar, Shouvanik Sent: Monday, September 30, 2013 10:39 PM To: nitinpawar...@gmail.com; user@hive.apache.org Subject: Doing FSCK throws error Hi, On executing MSCK REPAIR TABLE table1, I get the below error. FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask What can possibly be the error. Thanks, Shouvnik This message is for the designated recipient only and may contain privileged, proprietary, or otherwise confidential information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the e-mail by you is prohibited. Where allowed by local law, electronic communications with Accenture and its affiliates, including e-mail and instant messaging (including content), may be scanned by our systems for the purposes of information security and assessment of internal compliance with Accenture policy. __ www.accenture.com
Converting from textfile to sequencefile using Hive
Hi, I have a lot of tweets saved as text. I created an external table on top of it to access it as textfile. I need to convert these to sequencefiles with each tweet as its own record. To do this, I created another table as a sequencefile table like so - CREATE EXTERNAL TABLE tweetseq( tweet STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\054' STORED AS SEQUENCEFILE LOCATION '/user/hdfs/tweetseq' Now when I insert into this table from my original tweets table, each line gets its own record as expected. This is great. However, I don't have any record ids here. How can I get it to write ids? PS, I need the ids to be there because mahout seq2sparse expects that. Regards, S
RE: Hive Query via Hue, Only column headers in downloaded CSV or XSL results, sometimes
Mark - is the Hive table you're using for this fairly wide? If so, are you doing a select * from table_name limit 10? We ran some tests this morning on one of the Hive tables giving us some fits and if we limit the select to ~20 columns and put the limit on the query we get the returns fairly quickly and are able to export. -Original Message- From: Sunderlin, Mark [mailto:mark.sunder...@teamaol.com] Sent: Monday, September 30, 2013 9:38 AM To: user@hive.apache.org Subject: RE: Hive Query via Hue, Only column headers in downloaded CSV or XSL results, sometimes Hmm.. No replies on this one? Is no one use Hue? :-) That would be interesting to know .. if not Hue, how are others exposing Hive to end users? without given them a direct login to a node on the cluster? --- Mark E. Sunderlin Data Architect | AOL NETWORKS BDM P: 703-265-6935 | C: 540-327-6222 | AIM: MESunderlin 22000 AOL Way, Dulles, VA 20166 -Original Message- From: Sunderlin, Mark [mailto:mark.sunder...@teamaol.com] Sent: Wednesday, September 18, 2013 2:08 PM To: user@hive.apache.org Subject: Hive Query via Hue, Only column headers in downloaded CSV or XSL results, sometimes Using Hive V11, via Hue from CDH4, I can run my query, output 10 rows (limit 10) and download to a nice CSV or XSL file ... sometimes. :-( Sometimes, even when the run is error free, the download only downloads the column headers. This is true for both the CSV and XSL options. It is only ten lines of output, so it cannot be a number of rows issue. Is there a limit to the width of the data you can download? A limit on the number of columns? Anyone seen this before? Does anyone know a fix or a work around? --- Mark E. Sunderlin Data Architect | AOL NETWORKS BDM P: 703-265-6935 | C: 540-327-6222 | AIM: MESunderlin 22000 AOL Way, Dulles, VA 20166
Re: Hive Query via Hue, Only column headers in downloaded CSV or XSL results, sometimes
+ hue-user hue-u...@cloudera.org thanks Prasad On Mon, Sep 30, 2013 at 11:05 AM, Martin, Nick nimar...@pssd.com wrote: Mark - is the Hive table you're using for this fairly wide? If so, are you doing a select * from table_name limit 10? We ran some tests this morning on one of the Hive tables giving us some fits and if we limit the select to ~20 columns and put the limit on the query we get the returns fairly quickly and are able to export. -Original Message- From: Sunderlin, Mark [mailto:mark.sunder...@teamaol.com] Sent: Monday, September 30, 2013 9:38 AM To: user@hive.apache.org Subject: RE: Hive Query via Hue, Only column headers in downloaded CSV or XSL results, sometimes Hmm.. No replies on this one? Is no one use Hue? :-) That would be interesting to know .. if not Hue, how are others exposing Hive to end users? without given them a direct login to a node on the cluster? --- Mark E. Sunderlin Data Architect | AOL NETWORKS BDM P: 703-265-6935 | C: 540-327-6222 | AIM: MESunderlin 22000 AOL Way, Dulles, VA 20166 -Original Message- From: Sunderlin, Mark [mailto:mark.sunder...@teamaol.com] Sent: Wednesday, September 18, 2013 2:08 PM To: user@hive.apache.org Subject: Hive Query via Hue, Only column headers in downloaded CSV or XSL results, sometimes Using Hive V11, via Hue from CDH4, I can run my query, output 10 rows (limit 10) and download to a nice CSV or XSL file ... sometimes. :-( Sometimes, even when the run is error free, the download only downloads the column headers. This is true for both the CSV and XSL options. It is only ten lines of output, so it cannot be a number of rows issue. Is there a limit to the width of the data you can download? A limit on the number of columns? Anyone seen this before? Does anyone know a fix or a work around? --- Mark E. Sunderlin Data Architect | AOL NETWORKS BDM P: 703-265-6935 | C: 540-327-6222 | AIM: MESunderlin 22000 AOL Way, Dulles, VA 20166
Re: Converting from textfile to sequencefile using Hive
are you using hive to just convert your text files to sequence files? If thats the case then you may want to look at the purpose why hive was developed. If you want to modify data or process data which does not involve any kind of analytics functions on a routine basis. If you want to do a data manipulation or enrichment and do not want to code a lot of map reduce job, you can take a look at pig scripts. basically what you want to do is generate an UUID for each of your tweet and then feed it to mahout algorithms. Sorry if I understood it wrong or it sounds rude.
Re: Converting from textfile to sequencefile using Hive
Hi Nitin, No offense taken. Thank you for your response. Part of this is also trying to find the right tool for the job. I am doing queries to determine the cuts of tweets that I want, then doing some modest normalization (through a python script) and then I want to create sequenceFiles from that. So far Hive seems to be the most convenient way to do this. But I can take a look at PIG too. It looked like the STORED AS SEQUENCEFILE gets me 99% way there. So I was wondering if there was a way to get those ids in there as well. The last piece is always the stumbler :) Thanks again, S On Mon, Sep 30, 2013 at 2:41 PM, Nitin Pawar nitinpawar...@gmail.comwrote: are you using hive to just convert your text files to sequence files? If thats the case then you may want to look at the purpose why hive was developed. If you want to modify data or process data which does not involve any kind of analytics functions on a routine basis. If you want to do a data manipulation or enrichment and do not want to code a lot of map reduce job, you can take a look at pig scripts. basically what you want to do is generate an UUID for each of your tweet and then feed it to mahout algorithms. Sorry if I understood it wrong or it sounds rude.
Re: Converting from textfile to sequencefile using Hive
S, Check out these presentations from Data Science Maryland back in May[1]. 1. working with Tweets in Hive: http://www.slideshare.net/JoeyEcheverria/analyzing-twitter-data-with-hadoop-20929978 2. then pulling stuff out of Hive to use with Mahout: http://files.meetup.com/6195792/Working%20With%20Mahout.pdf The Mahout talk didn't have a directly useful outcome (largely because it tried to work with the tweets as individual text documents), but it does get through all the mechanics of exactly what you state you want. The meetup page also has links to video, if the slides don't give enough context. HTH [1]: http://www.meetup.com/Data-Science-MD/events/111081282/ On Mon, Sep 30, 2013 at 11:54 AM, Saurabh B saurabh.wri...@gmail.comwrote: Hi Nitin, No offense taken. Thank you for your response. Part of this is also trying to find the right tool for the job. I am doing queries to determine the cuts of tweets that I want, then doing some modest normalization (through a python script) and then I want to create sequenceFiles from that. So far Hive seems to be the most convenient way to do this. But I can take a look at PIG too. It looked like the STORED AS SEQUENCEFILE gets me 99% way there. So I was wondering if there was a way to get those ids in there as well. The last piece is always the stumbler :) Thanks again, S On Mon, Sep 30, 2013 at 2:41 PM, Nitin Pawar nitinpawar...@gmail.comwrote: are you using hive to just convert your text files to sequence files? If thats the case then you may want to look at the purpose why hive was developed. If you want to modify data or process data which does not involve any kind of analytics functions on a routine basis. If you want to do a data manipulation or enrichment and do not want to code a lot of map reduce job, you can take a look at pig scripts. basically what you want to do is generate an UUID for each of your tweet and then feed it to mahout algorithms. Sorry if I understood it wrong or it sounds rude. -- Sean
Re: Want query to use more reducers
Hey Keith, It sounds like you should tweak the settings for how Hive handles query execution[1]: 1) Tune the guessed number of reducers based on input size = hive.exec.reducers.bytes.per.reducer Defaults to 1G. Based on your description, it sounds like this is probably still at default. In this case, you should also set a max # of reducers based on your cluster size. = hive.exec.reducers.max I usually set this to the # reduce slots, if there's a decent chance I'll get to saturate the cluster. If not, don't worry about it. 2) Hard code a number of reducers = mapred.reduce.tasks Setting this will cause Hive to always use that number. It defaults to -1, which tells hive to use the heuristic about input size to guess. In either of the above cases, you should look at the options to merge small files (search for merge in the configuration property list) to avoid getting lots of little outputs. HTH [1]: https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-QueryExecution -Sean On Mon, Sep 30, 2013 at 11:31 AM, Keith Wiley kwi...@keithwiley.com wrote: I have a query that doesn't use reducers as efficiently as I would hope. If I run it on a large table, it uses more reducers, even saturating the cluster, as I desire. However, on smaller tables it uses as low as a single reducer. While I understand there is a logic in this (not using multiple reducers until the data size is larger), it is nevertheless inefficient to run a query for thirty minutes leaving the entire cluster vacant when the query could distribute the work evenly and wrap things up in a fraction of the time. The query is shown below (abstracted to its basic form). As you can see, it is a little atypical: it is a nested query which obviously implies two map-reduce jobs and it uses a script for the reducer stage that I am trying to speed up. I thought the distribute by clause should make it use the reducers more evenly, but as I said, that is not the behavior I am seeing. Any ideas how I could improve this situation? Thanks. CREATE TABLE output_table ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' as SELECT * FROM ( FROM ( SELECT * FROM input_table DISTRIBUTE BY input_column_1 SORT BY input_column_1 ASC, input_column_2 ASC, input_column_etc ASC) q SELECT TRANSFORM(*) USING 'python my_reducer_script.py' AS( output_column_1, output_column_2, output_column_etc, ) ) s ORDER BY output_column_1; Keith Wiley kwi...@keithwiley.com keithwiley.com music.keithwiley.com Luminous beings are we, not this crude matter. -- Yoda -- Sean
Re: Converting from textfile to sequencefile using Hive
Thanks Sean, that is exactly what I want. On Mon, Sep 30, 2013 at 3:09 PM, Sean Busbey bus...@cloudera.com wrote: S, Check out these presentations from Data Science Maryland back in May[1]. 1. working with Tweets in Hive: http://www.slideshare.net/JoeyEcheverria/analyzing-twitter-data-with-hadoop-20929978 2. then pulling stuff out of Hive to use with Mahout: http://files.meetup.com/6195792/Working%20With%20Mahout.pdf The Mahout talk didn't have a directly useful outcome (largely because it tried to work with the tweets as individual text documents), but it does get through all the mechanics of exactly what you state you want. The meetup page also has links to video, if the slides don't give enough context. HTH [1]: http://www.meetup.com/Data-Science-MD/events/111081282/ On Mon, Sep 30, 2013 at 11:54 AM, Saurabh B saurabh.wri...@gmail.comwrote: Hi Nitin, No offense taken. Thank you for your response. Part of this is also trying to find the right tool for the job. I am doing queries to determine the cuts of tweets that I want, then doing some modest normalization (through a python script) and then I want to create sequenceFiles from that. So far Hive seems to be the most convenient way to do this. But I can take a look at PIG too. It looked like the STORED AS SEQUENCEFILE gets me 99% way there. So I was wondering if there was a way to get those ids in there as well. The last piece is always the stumbler :) Thanks again, S On Mon, Sep 30, 2013 at 2:41 PM, Nitin Pawar nitinpawar...@gmail.comwrote: are you using hive to just convert your text files to sequence files? If thats the case then you may want to look at the purpose why hive was developed. If you want to modify data or process data which does not involve any kind of analytics functions on a routine basis. If you want to do a data manipulation or enrichment and do not want to code a lot of map reduce job, you can take a look at pig scripts. basically what you want to do is generate an UUID for each of your tweet and then feed it to mahout algorithms. Sorry if I understood it wrong or it sounds rude. -- Sean
Re: Want query to use more reducers
Thanks. mapred.reduce.tasks and hive.exec.reducers.max seem to have fixed the problem. It is now saturating the cluster and running the query super fast. Excellent! On Sep 30, 2013, at 12:28 , Sean Busbey wrote: Hey Keith, It sounds like you should tweak the settings for how Hive handles query execution[1]: 1) Tune the guessed number of reducers based on input size = hive.exec.reducers.bytes.per.reducer Defaults to 1G. Based on your description, it sounds like this is probably still at default. In this case, you should also set a max # of reducers based on your cluster size. = hive.exec.reducers.max I usually set this to the # reduce slots, if there's a decent chance I'll get to saturate the cluster. If not, don't worry about it. 2) Hard code a number of reducers = mapred.reduce.tasks Setting this will cause Hive to always use that number. It defaults to -1, which tells hive to use the heuristic about input size to guess. In either of the above cases, you should look at the options to merge small files (search for merge in the configuration property list) to avoid getting lots of little outputs. HTH [1]: https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-QueryExecution -Sean On Mon, Sep 30, 2013 at 11:31 AM, Keith Wiley kwi...@keithwiley.com wrote: I have a query that doesn't use reducers as efficiently as I would hope. If I run it on a large table, it uses more reducers, even saturating the cluster, as I desire. However, on smaller tables it uses as low as a single reducer. While I understand there is a logic in this (not using multiple reducers until the data size is larger), it is nevertheless inefficient to run a query for thirty minutes leaving the entire cluster vacant when the query could distribute the work evenly and wrap things up in a fraction of the time. The query is shown below (abstracted to its basic form). As you can see, it is a little atypical: it is a nested query which obviously implies two map-reduce jobs and it uses a script for the reducer stage that I am trying to speed up. I thought the distribute by clause should make it use the reducers more evenly, but as I said, that is not the behavior I am seeing. Any ideas how I could improve this situation? Thanks. CREATE TABLE output_table ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' as SELECT * FROM ( FROM ( SELECT * FROM input_table DISTRIBUTE BY input_column_1 SORT BY input_column_1 ASC, input_column_2 ASC, input_column_etc ASC) q SELECT TRANSFORM(*) USING 'python my_reducer_script.py' AS( output_column_1, output_column_2, output_column_etc, ) ) s ORDER BY output_column_1; Keith Wiley kwi...@keithwiley.com keithwiley.com music.keithwiley.com Luminous beings are we, not this crude matter. -- Yoda -- Sean Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com I do not feel obliged to believe that the same God who has endowed us with sense, reason, and intellect has intended us to forgo their use. -- Galileo Galilei
Re: how to treat an existing partition data file as a table?
thanks guys, I found that the table is not partitioned, so I guess no way out... On Mon, Sep 30, 2013 at 9:31 AM, Olga L. Natkovich ol...@yahoo-inc.comwrote: You need to specify a table partition from which you want to sample. ** ** Olga ** ** *From:* Yang [mailto:tedd...@gmail.com] *Sent:* Sunday, September 29, 2013 1:39 PM *To:* hive-u...@hadoop.apache.org *Subject:* how to treat an existing partition data file as a table? ** ** we have a huge table, including browsing data for the past 5 years, let's say. ** ** now I want to take a few samples to play around with it. so I did select * from mytable limit 10; but it actually went full out and tried to scan the entire table. is there a way to kind of create a view pointing to only one of the data files used by the original table mytable ? this way the total files to be scanned is much smaller. ** ** ** ** thanks! yang
Re: UDF error?
It's been ages since I wrote one, but the differences to mine: a) I use LongWritable: public LongWritable evaluate(LongWritable startAt) { b) I have annotations on the class (but I think they are just for docs) @Description(name = row_sequence, value = _FUNC_() - Returns a generated row sequence number starting from 1) @UDFType(deterministic = false) public class UDFRowSequence extends UDF { Hope this helps! Tim On Mon, Sep 30, 2013 at 10:47 PM, Yang tedd...@gmail.com wrote: I wrote a super simple UDF, but got some errors: UDF: package yy; import org.apache.hadoop.hive.ql.exec.UDF; import java.util.Random; import java.util.UUID; import java.lang.management.*; public class MyUdf extends UDF { static Random rand = new Random(System.currentTimeMillis() + Thread.currentThread().getId()* 100); String name = ManagementFactory.getRuntimeMXBean().getName(); long startValue = Long.valueOf(name.replaceAll([^\\d]+, )) * 1 + Thread.currentThread().getId() * 1000; public long evaluate(long x ) { //return (long)UUID.randomUUID().hashCode(); //return rand.nextLong(); return startValue++; } } sql script: CREATE TEMPORARY FUNCTION gen_uniq2 AS 'yy.MyUdf'; select gen_uniq2(field1), field2 from yy_mapping limit 10; field1 is bigint, field2 is int error: hive source aa.sql; Added ./MyUdf.jar to class path Added resource: ./MyUdf.jar OK Time taken: 0.0070 seconds FAILED: SemanticException [Error 10014]: Line 2:7 Wrong arguments 'field1': No matching method for class yy.MyUdf with (bigint). Possible choices: _FUNC_() so I'm declaring a UDF with arg of long, so that should work for a bigint (more importantly it's complaining not long vs bigint, but bigint vs void ). I tried changing both to int, same failure thanks! yang
Re: UDF error?
That class is: https://code.google.com/p/gbif-occurrencestore/source/browse/trunk/occurrence-store/src/main/java/org/gbif/occurrencestore/hive/udf/UDFRowSequence.java Cheers, Tim On Mon, Sep 30, 2013 at 10:55 PM, Tim Robertson timrobertson...@gmail.comwrote: It's been ages since I wrote one, but the differences to mine: a) I use LongWritable: public LongWritable evaluate(LongWritable startAt) { b) I have annotations on the class (but I think they are just for docs) @Description(name = row_sequence, value = _FUNC_() - Returns a generated row sequence number starting from 1) @UDFType(deterministic = false) public class UDFRowSequence extends UDF { Hope this helps! Tim On Mon, Sep 30, 2013 at 10:47 PM, Yang tedd...@gmail.com wrote: I wrote a super simple UDF, but got some errors: UDF: package yy; import org.apache.hadoop.hive.ql.exec.UDF; import java.util.Random; import java.util.UUID; import java.lang.management.*; public class MyUdf extends UDF { static Random rand = new Random(System.currentTimeMillis() + Thread.currentThread().getId()* 100); String name = ManagementFactory.getRuntimeMXBean().getName(); long startValue = Long.valueOf(name.replaceAll([^\\d]+, )) * 1 + Thread.currentThread().getId() * 1000; public long evaluate(long x ) { //return (long)UUID.randomUUID().hashCode(); //return rand.nextLong(); return startValue++; } } sql script: CREATE TEMPORARY FUNCTION gen_uniq2 AS 'yy.MyUdf'; select gen_uniq2(field1), field2 from yy_mapping limit 10; field1 is bigint, field2 is int error: hive source aa.sql; Added ./MyUdf.jar to class path Added resource: ./MyUdf.jar OK Time taken: 0.0070 seconds FAILED: SemanticException [Error 10014]: Line 2:7 Wrong arguments 'field1': No matching method for class yy.MyUdf with (bigint). Possible choices: _FUNC_() so I'm declaring a UDF with arg of long, so that should work for a bigint (more importantly it's complaining not long vs bigint, but bigint vs void ). I tried changing both to int, same failure thanks! yang
Re: UDF error?
thanks! at first I did have a no-arg evaluate(), but somehow select myfunction(), field1, field2 from mytable ; spits out the same value for myfunction() for each row. so I was wondering whether the UDF got called only 1 time, because the hive compiler sees that the argument is void, so that all the invocations would be having the same value, then I tried to pass in a param to prevent this possibility. On Mon, Sep 30, 2013 at 1:55 PM, Tim Robertson timrobertson...@gmail.comwrote: It's been ages since I wrote one, but the differences to mine: a) I use LongWritable: public LongWritable evaluate(LongWritable startAt) { b) I have annotations on the class (but I think they are just for docs) @Description(name = row_sequence, value = _FUNC_() - Returns a generated row sequence number starting from 1) @UDFType(deterministic = false) public class UDFRowSequence extends UDF { Hope this helps! Tim On Mon, Sep 30, 2013 at 10:47 PM, Yang tedd...@gmail.com wrote: I wrote a super simple UDF, but got some errors: UDF: package yy; import org.apache.hadoop.hive.ql.exec.UDF; import java.util.Random; import java.util.UUID; import java.lang.management.*; public class MyUdf extends UDF { static Random rand = new Random(System.currentTimeMillis() + Thread.currentThread().getId()* 100); String name = ManagementFactory.getRuntimeMXBean().getName(); long startValue = Long.valueOf(name.replaceAll([^\\d]+, )) * 1 + Thread.currentThread().getId() * 1000; public long evaluate(long x ) { //return (long)UUID.randomUUID().hashCode(); //return rand.nextLong(); return startValue++; } } sql script: CREATE TEMPORARY FUNCTION gen_uniq2 AS 'yy.MyUdf'; select gen_uniq2(field1), field2 from yy_mapping limit 10; field1 is bigint, field2 is int error: hive source aa.sql; Added ./MyUdf.jar to class path Added resource: ./MyUdf.jar OK Time taken: 0.0070 seconds FAILED: SemanticException [Error 10014]: Line 2:7 Wrong arguments 'field1': No matching method for class yy.MyUdf with (bigint). Possible choices: _FUNC_() so I'm declaring a UDF with arg of long, so that should work for a bigint (more importantly it's complaining not long vs bigint, but bigint vs void ). I tried changing both to int, same failure thanks! yang
Re: UDF error?
Here is an example of a no arg that will return a different value for each row: https://code.google.com/p/gbif-occurrencestore/source/browse/trunk/occurrence-store/src/main/java/org/gbif/occurrencestore/hive/udf/UuidUDF.java Hope this helps, Tim On Mon, Sep 30, 2013 at 10:59 PM, Yang tedd...@gmail.com wrote: thanks! at first I did have a no-arg evaluate(), but somehow select myfunction(), field1, field2 from mytable ; spits out the same value for myfunction() for each row. so I was wondering whether the UDF got called only 1 time, because the hive compiler sees that the argument is void, so that all the invocations would be having the same value, then I tried to pass in a param to prevent this possibility. On Mon, Sep 30, 2013 at 1:55 PM, Tim Robertson timrobertson...@gmail.comwrote: It's been ages since I wrote one, but the differences to mine: a) I use LongWritable: public LongWritable evaluate(LongWritable startAt) { b) I have annotations on the class (but I think they are just for docs) @Description(name = row_sequence, value = _FUNC_() - Returns a generated row sequence number starting from 1) @UDFType(deterministic = false) public class UDFRowSequence extends UDF { Hope this helps! Tim On Mon, Sep 30, 2013 at 10:47 PM, Yang tedd...@gmail.com wrote: I wrote a super simple UDF, but got some errors: UDF: package yy; import org.apache.hadoop.hive.ql.exec.UDF; import java.util.Random; import java.util.UUID; import java.lang.management.*; public class MyUdf extends UDF { static Random rand = new Random(System.currentTimeMillis() + Thread.currentThread().getId()* 100); String name = ManagementFactory.getRuntimeMXBean().getName(); long startValue = Long.valueOf(name.replaceAll([^\\d]+, )) * 1 + Thread.currentThread().getId() * 1000; public long evaluate(long x ) { //return (long)UUID.randomUUID().hashCode(); //return rand.nextLong(); return startValue++; } } sql script: CREATE TEMPORARY FUNCTION gen_uniq2 AS 'yy.MyUdf'; select gen_uniq2(field1), field2 from yy_mapping limit 10; field1 is bigint, field2 is int error: hive source aa.sql; Added ./MyUdf.jar to class path Added resource: ./MyUdf.jar OK Time taken: 0.0070 seconds FAILED: SemanticException [Error 10014]: Line 2:7 Wrong arguments 'field1': No matching method for class yy.MyUdf with (bigint). Possible choices: _FUNC_() so I'm declaring a UDF with arg of long, so that should work for a bigint (more importantly it's complaining not long vs bigint, but bigint vs void ). I tried changing both to int, same failure thanks! yang
Re: UDF error?
ok I found the reason, as I modified the jar file, though I re-ran ADD .MyUdf.jar; create temporary function ; , it doesn't take effect. I have to get out of hive session, then rerun these again. On Mon, Sep 30, 2013 at 1:47 PM, Yang tedd...@gmail.com wrote: I wrote a super simple UDF, but got some errors: UDF: package yy; import org.apache.hadoop.hive.ql.exec.UDF; import java.util.Random; import java.util.UUID; import java.lang.management.*; public class MyUdf extends UDF { static Random rand = new Random(System.currentTimeMillis() + Thread.currentThread().getId()* 100); String name = ManagementFactory.getRuntimeMXBean().getName(); long startValue = Long.valueOf(name.replaceAll([^\\d]+, )) * 1 + Thread.currentThread().getId() * 1000; public long evaluate(long x ) { //return (long)UUID.randomUUID().hashCode(); //return rand.nextLong(); return startValue++; } } sql script: CREATE TEMPORARY FUNCTION gen_uniq2 AS 'yy.MyUdf'; select gen_uniq2(field1), field2 from yy_mapping limit 10; field1 is bigint, field2 is int error: hive source aa.sql; Added ./MyUdf.jar to class path Added resource: ./MyUdf.jar OK Time taken: 0.0070 seconds FAILED: SemanticException [Error 10014]: Line 2:7 Wrong arguments 'field1': No matching method for class yy.MyUdf with (bigint). Possible choices: _FUNC_() so I'm declaring a UDF with arg of long, so that should work for a bigint (more importantly it's complaining not long vs bigint, but bigint vs void ). I tried changing both to int, same failure thanks! yang
Re: Tableau connectivity available on KR
Olga. I'm sure it was not intended for me and a lot of us. hive-u...@hadoop.apache.org made it happened. From: Olga L. Natkovich ol...@yahoo-inc.com To: kryptonite-u...@yahoo-inc.com kryptonite-u...@yahoo-inc.com; hive-u...@hadoop.apache.org hive-u...@hadoop.apache.org; ygrid-sandbox-annou...@yahoo-inc.com ygrid-sandbox-annou...@yahoo-inc.com; ygrid-production-annou...@yahoo-inc.com ygrid-production-annou...@yahoo-inc.com; ygrid-research-annou...@yahoo-inc.com ygrid-research-annou...@yahoo-inc.com; hcat-us...@yahoo-inc.com hcat-us...@yahoo-inc.com Sent: Monday, September 30, 2013 2:12 PM Subject: Tableau connectivity available on KR Dear Grid Users, Hadoop Services team is happy to announce that Tableau is now supported on KR. Please, come give it a try and provide your feedback. The steps to connect with Tableau are described here: http://twiki.corp.yahoo.com/view/Grid/HiveServer2BITools. In addition, we also provide support for MicroStrategy users. If you want to connect your MS server to KR, please, follow the instructions here: https://docs.google.com/a/yahoo-inc.com/document/d/1QzAh19bysE6ooFeCPSZZTFgcVR36stK6v8ZPcq2Yi30. Olga
RE: Tableau connectivity available on KR
Sorry for the spam. This was meant as internal Yahoo announcement. Olga From: Mohammad Islam [mailto:misla...@yahoo.com] Sent: Monday, September 30, 2013 3:53 PM To: user@hive.apache.org Subject: Re: Tableau connectivity available on KR Olga. I'm sure it was not intended for me and a lot of us. hive-u...@hadoop.apache.orgmailto:hive-u...@hadoop.apache.org made it happened. From: Olga L. Natkovich ol...@yahoo-inc.commailto:ol...@yahoo-inc.com To: kryptonite-u...@yahoo-inc.commailto:kryptonite-u...@yahoo-inc.com kryptonite-u...@yahoo-inc.commailto:kryptonite-u...@yahoo-inc.com; hive-u...@hadoop.apache.orgmailto:hive-u...@hadoop.apache.org hive-u...@hadoop.apache.orgmailto:hive-u...@hadoop.apache.org; ygrid-sandbox-annou...@yahoo-inc.commailto:ygrid-sandbox-annou...@yahoo-inc.com ygrid-sandbox-annou...@yahoo-inc.commailto:ygrid-sandbox-annou...@yahoo-inc.com; ygrid-production-annou...@yahoo-inc.commailto:ygrid-production-annou...@yahoo-inc.com ygrid-production-annou...@yahoo-inc.commailto:ygrid-production-annou...@yahoo-inc.com; ygrid-research-annou...@yahoo-inc.commailto:ygrid-research-annou...@yahoo-inc.com ygrid-research-annou...@yahoo-inc.commailto:ygrid-research-annou...@yahoo-inc.com; hcat-us...@yahoo-inc.commailto:hcat-us...@yahoo-inc.com hcat-us...@yahoo-inc.commailto:hcat-us...@yahoo-inc.com Sent: Monday, September 30, 2013 2:12 PM Subject: Tableau connectivity available on KR Dear Grid Users, Hadoop Services team is happy to announce that Tableau is now supported on KR. Please, come give it a try and provide your feedback. The steps to connect with Tableau are described here: http://twiki.corp.yahoo.com/view/Grid/HiveServer2BITools. In addition, we also provide support for MicroStrategy users. If you want to connect your MS server to KR, please, follow the instructions here: https://docs.google.com/a/yahoo-inc.com/document/d/1QzAh19bysE6ooFeCPSZZTFgcVR36stK6v8ZPcq2Yi30. Olga