RE: Failed to create table in Hive (AlreadyExistsException)

2017-06-23 Thread Markovitz, Dudu
Kaustubh, there is not much to do without you suppling a way to reproduce the 
issue and/or the relevant logs.


Dudu

From: Kaustubh Deshpande [mailto:kaustubh.deshpa...@exadatum.com]
Sent: Friday, June 23, 2017 10:29 AM
To: user@hive.apache.org; dev-subscr...@hive.apache.org
Subject: Failed to create table in Hive (AlreadyExistsException)


Hi,



· I am facing issue in Apache Hive v0.13.0 for creating table.

· I am executing hive script in which i have DROP and CREATE TABLE 
statements.

· CREATE TABLE statement is type of CREATE TABLE db_nm.tble_nm AS 
SELECT * from db_nm.other_tbl.

· db_nm.tble_nm is managed HIve table

· ERROR - FAILED: Execution Error, return code 1 from 
org.apache.hadoop.hive.ql.exec.DDLTask. AlreadyExistsException(message:Table 
tble_nm already exists)

· I had this issue only once and didn't reoccurred but i want to find 
root cause of it.

· In logs it has successfully dropped the table and it started creating 
table and executed map-reduce for copying data and at the end it given 
'AlreadyExistsException'.

· I tried it with excluding DROP TABLE statement then it has given 
'FAILED: SemanticException org.apache.hadoop.hive.ql.parse.SemanticException: 
Table already exists: db_nm.tble_nm' as expected without executing map-reduce 
for copying data.

· Please help me to find root cause of this error.





Thanks,

Kaustubh.



DISCLAIMER:
This message contains confidential information and is intended only for the 
individual named. If you are not the named addressee, you should not 
disseminate, distribute or copy this email. Please notify the sender 
immediately by email if you have received this email by mistake and delete this 
email from your system. Email transmission cannot be guaranteed to be secure or 
error-free, as information could be intercepted, corrupted, lost, destroyed, 
arrive late or incomplete, or contain viruses. The sender, therefore, does not 
accept liability for any errors or omissions in the contents of this message 
which arise as a result of email transmission. If verification is required, 
please request a hard-copy version.


RE: Unable to use "." in column name

2017-05-06 Thread Markovitz, Dudu
Hi Ben

Check 
http://stackoverflow.com/questions/43808435/cannot-use-a-in-a-hive-table-column-name


Dudu

From: Ben Johnson [mailto:b...@timber.io]
Sent: Friday, May 05, 2017 6:25 PM
To: user@hive.apache.org
Subject: Unable to use "." in column name

Hi, I have a fairly basic question. I'm attempting to use a "." in a Hive 
column name. I must be doing something wrong, because the Hive documentation 
says:

"by default column names can be specified within backticks (`) and contain any 
Unicode character (HIVE-6013)"

Here's a better breakdown of what I'm doing and the error message I'm 
receiving.

Thanks for your help!

Ben
Timber.io - Blog - 
Github - Twitter
[https://track.mixmax.com/api/track/v2/CszdfvBnZAx5ebLdX/i8WauIXZi1Wa0BkblJmI/gInJ3buUGajFGch5SZ2lGaAJXZzVnI]





RE: Is it possible to use LOAD DATA INPATH with a PARTITIONED, STORED AS PARQUET table?

2017-04-04 Thread Markovitz, Dudu
“LOAD” is very misleading here. it is all in done the metadata level.
The data is not being touched. The data in not being verified. The “system” 
does not have any clue if the flies format match the table definition and they 
can be actually used.
The data files are being “moved” (again,  a metadata operation) from their 
current HDFS location to the location defined for the table.
Later on when you  query the table the files will be scanned. If there are in 
the right format you’ll get results. If not, then no.

From: Dmitry Goldenberg [mailto:dgoldenb...@hexastax.com]
Sent: Tuesday, April 04, 2017 8:54 PM
To: user@hive.apache.org
Subject: Re: Is it possible to use LOAD DATA INPATH with a PARTITIONED, STORED 
AS PARQUET table?

Thanks, Dudu. I think there's a disconnect here. We're using LOAD INPATH on a 
few tables to achieve the effect of actual insertion of records. Is it not the 
case that the LOAD causes the data to get inserted into Hive?

Based on that I'd like to understand whether we can get away with using LOAD 
INPATH instead of INSERT/SELECT FROM.

On Apr 4, 2017, at 1:43 PM, Markovitz, Dudu 
<dmarkov...@paypal.com<mailto:dmarkov...@paypal.com>> wrote:
I just want to verify that you understand the following:


· LOAD DATA INPATH is just a HDFS file movement operation.

You can achieve the same results by using hdfs dfs -mv …



· LOAD DATA LOCAL  INPATH is just a file copying operation from the 
shell to the HDFS.

You can achieve the same results by using hdfs dfs -put …


From: Dmitry Goldenberg [mailto:dgoldenb...@hexastax.com]
Sent: Tuesday, April 04, 2017 7:48 PM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Re: Is it possible to use LOAD DATA INPATH with a PARTITIONED, STORED 
AS PARQUET table?

Dudu,

This is still in design stages, so we have a way to get the data from its 
source. The data is *not* in the Parquet format.  It's up to us to format it 
the best and most efficient way.  We can roll with CSV or Parquet; ultimately 
the data must make it into a pre-defined PARQUET, PARTITIONED table in Hive.

Thanks,
- Dmitry

On Tue, Apr 4, 2017 at 12:20 PM, Markovitz, Dudu 
<dmarkov...@paypal.com<mailto:dmarkov...@paypal.com>> wrote:
Are your files already in Parquet format?

From: Dmitry Goldenberg 
[mailto:dgoldenb...@hexastax.com<mailto:dgoldenb...@hexastax.com>]
Sent: Tuesday, April 04, 2017 7:03 PM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Re: Is it possible to use LOAD DATA INPATH with a PARTITIONED, STORED 
AS PARQUET table?

Thanks, Dudu.

Just to re-iterate; the way I'm reading your response is that yes, we can use 
LOAD INPATH for a PARQUET, PARTITIONED table, provided that the data in the 
delimited file is properly formatted.  Then we can LOAD it into the table 
(mytable in my example) directly and avoid the creation of the temp table 
(origtable in my example).  Correct so far?

I did not quite follow the latter part of your response:
>> You should only create an external table which is an interface to read the 
>> files and use it in an INSERT operation.

My assumption was that we would LOAD INPATH and not have to use INSERT 
altogether.  Am I missing something in groking this latter part of your 
response?

Thanks,
- Dmitry

On Tue, Apr 4, 2017 at 11:26 AM, Markovitz, Dudu 
<dmarkov...@paypal.com<mailto:dmarkov...@paypal.com>> wrote:
Since LOAD DATA INPATH  only moves files the answer is very simple.
If you’re files are already in a format that matches the destination table 
(storage type, number and types of columns etc.) then – yes and if not, then – 
no.

But –
You don’t need to load the files into intermediary table.
You should only create an external table which is an interface to read the 
files and use it in an INSERT operation.

Dudu

From: Dmitry Goldenberg 
[mailto:dgoldenb...@hexastax.com<mailto:dgoldenb...@hexastax.com>]
Sent: Tuesday, April 04, 2017 4:52 PM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Is it possible to use LOAD DATA INPATH with a PARTITIONED, STORED AS 
PARQUET table?

We have a table such as the following defined:

CREATE TABLE IF NOT EXISTS db.mytable (
  `item_id` string,
  `timestamp` string,
  `item_comments` string)
PARTITIONED BY (`date`, `content_type`)
STORED AS PARQUET;

Currently we insert data into this PARQUET, PARTITIONED table as follows, using 
an intermediary table:

INSERT INTO TABLE db.mytable PARTITION(date, content_type)
SELECT itemid as item_id, itemts as timestamp, date, content_type
FROM db.origtable
WHERE date = “${SELECTED_DATE}”
GROUP BY item_id, date, content_type;
Our question is, would it be possible to use the LOAD DATA INPATH.. INTO TABLE 
syntax to load the data from delimited data files into 'mytable' rather than 
populating mytable from the intermediary table?

I see in the Hive documentation that:
* Load operations are currently pure copy/move operations th

RE: Is it possible to use LOAD DATA INPATH with a PARTITIONED, STORED AS PARQUET table?

2017-04-04 Thread Markovitz, Dudu
I just want to verify that you understand the following:


· LOAD DATA INPATH is just a HDFS file movement operation.

You can achieve the same results by using hdfs dfs -mv …



· LOAD DATA LOCAL  INPATH is just a file copying operation from the 
shell to the HDFS.

You can achieve the same results by using hdfs dfs -put …


From: Dmitry Goldenberg [mailto:dgoldenb...@hexastax.com]
Sent: Tuesday, April 04, 2017 7:48 PM
To: user@hive.apache.org
Subject: Re: Is it possible to use LOAD DATA INPATH with a PARTITIONED, STORED 
AS PARQUET table?

Dudu,

This is still in design stages, so we have a way to get the data from its 
source. The data is *not* in the Parquet format.  It's up to us to format it 
the best and most efficient way.  We can roll with CSV or Parquet; ultimately 
the data must make it into a pre-defined PARQUET, PARTITIONED table in Hive.

Thanks,
- Dmitry

On Tue, Apr 4, 2017 at 12:20 PM, Markovitz, Dudu 
<dmarkov...@paypal.com<mailto:dmarkov...@paypal.com>> wrote:
Are your files already in Parquet format?

From: Dmitry Goldenberg 
[mailto:dgoldenb...@hexastax.com<mailto:dgoldenb...@hexastax.com>]
Sent: Tuesday, April 04, 2017 7:03 PM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Re: Is it possible to use LOAD DATA INPATH with a PARTITIONED, STORED 
AS PARQUET table?

Thanks, Dudu.

Just to re-iterate; the way I'm reading your response is that yes, we can use 
LOAD INPATH for a PARQUET, PARTITIONED table, provided that the data in the 
delimited file is properly formatted.  Then we can LOAD it into the table 
(mytable in my example) directly and avoid the creation of the temp table 
(origtable in my example).  Correct so far?

I did not quite follow the latter part of your response:
>> You should only create an external table which is an interface to read the 
>> files and use it in an INSERT operation.

My assumption was that we would LOAD INPATH and not have to use INSERT 
altogether.  Am I missing something in groking this latter part of your 
response?

Thanks,
- Dmitry

On Tue, Apr 4, 2017 at 11:26 AM, Markovitz, Dudu 
<dmarkov...@paypal.com<mailto:dmarkov...@paypal.com>> wrote:
Since LOAD DATA INPATH  only moves files the answer is very simple.
If you’re files are already in a format that matches the destination table 
(storage type, number and types of columns etc.) then – yes and if not, then – 
no.

But –
You don’t need to load the files into intermediary table.
You should only create an external table which is an interface to read the 
files and use it in an INSERT operation.

Dudu

From: Dmitry Goldenberg 
[mailto:dgoldenb...@hexastax.com<mailto:dgoldenb...@hexastax.com>]
Sent: Tuesday, April 04, 2017 4:52 PM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Is it possible to use LOAD DATA INPATH with a PARTITIONED, STORED AS 
PARQUET table?

We have a table such as the following defined:

CREATE TABLE IF NOT EXISTS db.mytable (
  `item_id` string,
  `timestamp` string,
  `item_comments` string)
PARTITIONED BY (`date`, `content_type`)
STORED AS PARQUET;

Currently we insert data into this PARQUET, PARTITIONED table as follows, using 
an intermediary table:

INSERT INTO TABLE db.mytable PARTITION(date, content_type)
SELECT itemid as item_id, itemts as timestamp, date, content_type
FROM db.origtable
WHERE date = “${SELECTED_DATE}”
GROUP BY item_id, date, content_type;
Our question is, would it be possible to use the LOAD DATA INPATH.. INTO TABLE 
syntax to load the data from delimited data files into 'mytable' rather than 
populating mytable from the intermediary table?

I see in the Hive documentation that:
* Load operations are currently pure copy/move operations that move datafiles 
into locations corresponding to Hive tables.
* If the table is partitioned, then one must specify a specific partition of 
the table by specifying values for all of the partitioning columns.

This seems to indicate that using LOAD is possible; however looking at this 
discussion: 
http://grokbase.com/t/hive/user/114frbfg0y/can-i-use-hive-dynamic-partition-while-loading-data-into-tables,
 perhaps not?

We'd like to understand if using LOAD in the case of PARQUET, PARTITIONED 
tables is possible and if so, then how does one go about using LOAD in that 
case?

Thanks,
- Dmitry





RE: Is it possible to use LOAD DATA INPATH with a PARTITIONED, STORED AS PARQUET table?

2017-04-04 Thread Markovitz, Dudu
Are your files already in Parquet format?

From: Dmitry Goldenberg [mailto:dgoldenb...@hexastax.com]
Sent: Tuesday, April 04, 2017 7:03 PM
To: user@hive.apache.org
Subject: Re: Is it possible to use LOAD DATA INPATH with a PARTITIONED, STORED 
AS PARQUET table?

Thanks, Dudu.

Just to re-iterate; the way I'm reading your response is that yes, we can use 
LOAD INPATH for a PARQUET, PARTITIONED table, provided that the data in the 
delimited file is properly formatted.  Then we can LOAD it into the table 
(mytable in my example) directly and avoid the creation of the temp table 
(origtable in my example).  Correct so far?

I did not quite follow the latter part of your response:
>> You should only create an external table which is an interface to read the 
>> files and use it in an INSERT operation.

My assumption was that we would LOAD INPATH and not have to use INSERT 
altogether.  Am I missing something in groking this latter part of your 
response?

Thanks,
- Dmitry

On Tue, Apr 4, 2017 at 11:26 AM, Markovitz, Dudu 
<dmarkov...@paypal.com<mailto:dmarkov...@paypal.com>> wrote:
Since LOAD DATA INPATH  only moves files the answer is very simple.
If you’re files are already in a format that matches the destination table 
(storage type, number and types of columns etc.) then – yes and if not, then – 
no.

But –
You don’t need to load the files into intermediary table.
You should only create an external table which is an interface to read the 
files and use it in an INSERT operation.

Dudu

From: Dmitry Goldenberg 
[mailto:dgoldenb...@hexastax.com<mailto:dgoldenb...@hexastax.com>]
Sent: Tuesday, April 04, 2017 4:52 PM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Is it possible to use LOAD DATA INPATH with a PARTITIONED, STORED AS 
PARQUET table?

We have a table such as the following defined:

CREATE TABLE IF NOT EXISTS db.mytable (
  `item_id` string,
  `timestamp` string,
  `item_comments` string)
PARTITIONED BY (`date`, `content_type`)
STORED AS PARQUET;

Currently we insert data into this PARQUET, PARTITIONED table as follows, using 
an intermediary table:

INSERT INTO TABLE db.mytable PARTITION(date, content_type)
SELECT itemid as item_id, itemts as timestamp, date, content_type
FROM db.origtable
WHERE date = “${SELECTED_DATE}”
GROUP BY item_id, date, content_type;
Our question is, would it be possible to use the LOAD DATA INPATH.. INTO TABLE 
syntax to load the data from delimited data files into 'mytable' rather than 
populating mytable from the intermediary table?

I see in the Hive documentation that:
* Load operations are currently pure copy/move operations that move datafiles 
into locations corresponding to Hive tables.
* If the table is partitioned, then one must specify a specific partition of 
the table by specifying values for all of the partitioning columns.

This seems to indicate that using LOAD is possible; however looking at this 
discussion: 
http://grokbase.com/t/hive/user/114frbfg0y/can-i-use-hive-dynamic-partition-while-loading-data-into-tables,
 perhaps not?

We'd like to understand if using LOAD in the case of PARQUET, PARTITIONED 
tables is possible and if so, then how does one go about using LOAD in that 
case?

Thanks,
- Dmitry




RE: Is it possible to use LOAD DATA INPATH with a PARTITIONED, STORED AS PARQUET table?

2017-04-04 Thread Markovitz, Dudu
Since LOAD DATA INPATH  only moves files the answer is very simple.
If you’re files are already in a format that matches the destination table 
(storage type, number and types of columns etc.) then – yes and if not, then – 
no.

But –
You don’t need to load the files into intermediary table.
You should only create an external table which is an interface to read the 
files and use it in an INSERT operation.

Dudu

From: Dmitry Goldenberg [mailto:dgoldenb...@hexastax.com]
Sent: Tuesday, April 04, 2017 4:52 PM
To: user@hive.apache.org
Subject: Is it possible to use LOAD DATA INPATH with a PARTITIONED, STORED AS 
PARQUET table?

We have a table such as the following defined:

CREATE TABLE IF NOT EXISTS db.mytable (
  `item_id` string,
  `timestamp` string,
  `item_comments` string)
PARTITIONED BY (`date`, `content_type`)
STORED AS PARQUET;

Currently we insert data into this PARQUET, PARTITIONED table as follows, using 
an intermediary table:

INSERT INTO TABLE db.mytable PARTITION(date, content_type)
SELECT itemid as item_id, itemts as timestamp, date, content_type
FROM db.origtable
WHERE date = “${SELECTED_DATE}”
GROUP BY item_id, date, content_type;
Our question is, would it be possible to use the LOAD DATA INPATH.. INTO TABLE 
syntax to load the data from delimited data files into 'mytable' rather than 
populating mytable from the intermediary table?

I see in the Hive documentation that:
* Load operations are currently pure copy/move operations that move datafiles 
into locations corresponding to Hive tables.
* If the table is partitioned, then one must specify a specific partition of 
the table by specifying values for all of the partitioning columns.

This seems to indicate that using LOAD is possible; however looking at this 
discussion: 
http://grokbase.com/t/hive/user/114frbfg0y/can-i-use-hive-dynamic-partition-while-loading-data-into-tables,
 perhaps not?

We'd like to understand if using LOAD in the case of PARQUET, PARTITIONED 
tables is possible and if so, then how does one go about using LOAD in that 
case?

Thanks,
- Dmitry



How to reset textinputformat.record.delimiter?

2017-03-16 Thread Markovitz, Dudu
Good morning

If you have the knowledge of how to reset the textinputformat.record.delimiter 
within hive cli / beeline I'll appreciate it if you could share it.
The full question is posted in Stack Overflow and it has an open bounty worth 
+50 reputation ending tomorrow.

How to reset textinputformat.record.delimiter to its default value within hive 
cli / 
beeline?


Thanks

Dudu









RE: Hive table for a single file: CREATE/ALTER TABLE differences

2017-01-25 Thread Markovitz, Dudu
Wow. This is gold.

Dudu

From: Dmitry Tolpeko [mailto:dmtolp...@gmail.com]
Sent: Wednesday, January 25, 2017 6:47 PM
To: user@hive.apache.org
Subject: Hive table for a single file: CREATE/ALTER TABLE differences

​
​I accidentally noticed​​ one feature:
(it is well know
​n​
that in CREATE TABLE you must specify a directory for the table LOCATION 
otherwise you get: "Can't make directory for path 's3n://dir/file' since it is 
a file.")

But at the same time, ALTER TABLE SET LOCATION 's3n://dir/file' works fine.
​ SELECT also reads data from the single file only. ​

I see this in Hive 1.0.0-amzn-4.

Is this just a bug and
​will
 be fixed
​ some day​
(or maybe already fixed) or it is for some reason, and will stay?

Thanks,
Dmitry



RE: import sql file

2016-11-23 Thread Markovitz, Dudu
Hi Patcharee 
The question is not clear.

Dudu

-Original Message-
From: patcharee [mailto:patcharee.thong...@uni.no] 
Sent: Wednesday, November 23, 2016 11:37 AM
To: user@hive.apache.org
Subject: import sql file

Hi,

How can I import .sql file into hive?

Best, Patcharee



RE: Nested JSON Parsing

2016-11-12 Thread Markovitz, Dudu
And your issue/question is?

From: Ajay Tirpude [mailto:tirpudeaj...@gmail.com]
Sent: Sunday, November 13, 2016 4:46 AM
To: user@hive.apache.org
Subject: Nested JSON Parsing

Dear All,

I am trying to parse this json file given below and my intention is to convert 
this json file into a csv.

{
  "devicetype": "SmartPhone",
  "uuid": "sg76fdhh7gfxhxfhgxf67x",
  "ts": {
"date": "2016-03-23T10:58:34.660Z"
  },
  "events": [
{
  "timestamp": "2016-03-23T10:58:37Z",
  "evt": "first",
  "ad": "v6v75v88n98778mn",
  "tkey": "ngbbc76fbc6fb6fb66fb6",
  "mtp": "Wed Mar 23 2016 19:04:22 GMT 0800 (PHT)",
  "eventid": "eytuy"
},
{
  "timestamp": "2016-03-23T10:58:35Z",
  "evt": "second",
  "ad": "v6v75v88n98778mn",
  "tkey": "ngbbc76fbc6fb6fb66fb6"
},
{
  "timestamp": "2016-03-23T10:58:36Z",
  "evt": "third",
  "ad": "v6v75v88n98778mn",
  "tkey": "ngbbc76fbc6fb6fb66fb6"
}
  ],
  "adid": "v6v75v88n98778mn",
  "ad_tz": {
"date": "2016-03-23T10:58:34.660Z"
  },
  "ua": "Mozilla/5.0 (Linux; U; Android 4.3; en-gb; SM-N9005 Build/JSS15J) 
AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30"
}

There are few conditions that I need to apply before I parse

1. I want to get all the fields except timestamp inside events nested key.
2. I want to loop events key for each evt. In above input file there are three 
evts but that would not fixed in the actual input file. There can be multiple 
evts and not just 3.
3. Not every evt block is similar. You can have different extra field in each 
evt block but we need to extract every key. In case we don't have key in one 
evt then the value should be blank for that env. For example for evt: first we 
have two extra key value pair i.,e, eventid/mtp and these value should be blank 
for other evts. Similarly we can have some key:value in other evts as well so 
that other key:values should be blank in other evts.

At last I want the output to be like this

devicetype

uuid

ts.date

events.evt

events.ad

events.tkey

events.mtp

events.eventid

adid

ad_tz.date

ua

SmartPhone

sg76fdhh7gfxhxfhgxf67x

2016-03-23T10:58:34.660Z

first

v6v75v88n98778mn

ngbbc76fbc6fb6fb66fb6

Wed Mar 23 2016 19:04:22 GMT 0800 (PHT)

eytuy

v6v75v88n98778mn

2016-03-23T10:58:34.660Z

Mozilla/5.0 (Linux; U; Android 4.3; en-gb; SM-N9005 Build/JSS15J) 
AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30

SmartPhone

sg76fdhh7gfxhxfhgxf67x

2016-03-23T10:58:34.660Z

second

v6v75v88n98778mn

ngbbc76fbc6fb6fb66fb6

v6v75v88n98778mn

2016-03-23T10:58:34.660Z

Mozilla/5.0 (Linux; U; Android 4.3; en-gb; SM-N9005 Build/JSS15J) 
AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30

SmartPhone

sg76fdhh7gfxhxfhgxf67x

2016-03-23T10:58:34.660Z

third

v6v75v88n98778mn

ngbbc76fbc6fb6fb66fb6





v6v75v88n98778mn

2016-03-23T10:58:34.660Z

Mozilla/5.0 (Linux; U; Android 4.3; en-gb; SM-N9005 Build/JSS15J) 
AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30


Regards,
Ajay T


RE: Hive Left Join inequality condition

2016-11-11 Thread Markovitz, Dudu
My pleasure ☺

Dudu

From: Goden Yao [mailto:goden@gmail.com]
Sent: Friday, November 11, 2016 1:26 AM
To: user@hive.apache.org
Subject: Re: Hive Left Join inequality condition

This worked!! Thanks so much Dudu!!

On Sat, Nov 5, 2016 at 1:03 PM Markovitz, Dudu 
<dmarkov...@paypal.com<mailto:dmarkov...@paypal.com>> wrote:
Ugly as hell, but should work.

Dudu



SELECT r_id,
   CASE WHEN table1.property_value = 'False' THEN FALSE
WHEN table1.property_value = 'True' THEN TRUE
WHEN r.rea <  rg.laa THEN FALSE
WHEN r.rea >= rg.laa THEN TRUE
ELSE FALSE END AS flag
  FROM rs r
  LEFT JOIN public.di_re rg
ON r.re<http://r.re> = rg.re<http://rg.re>
  LEFT JOIN (selectr.r_id, table1.property_value
fromrs r
join public.tbl table1
ON r.re<http://r.re> = table1.re<http://table1.re>
where  table1.property_name = ''
   AND r.rea BETWEEN table1.begin_time AND table1.end_time
) table1

   on r.r_id = table1.r_id

From: Goden Yao [mailto:goden...@apache.org<mailto:goden...@apache.org>]
Sent: Saturday, November 05, 2016 9:22 AM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Hive Left Join inequality condition


Hello!

Lately we have ran into the need to implement inequality JOIN in Hive, and we 
could have easily done that with WHERE clause, if it was not the LEFT join.
Basically, we wonder how people implement LEFT/RIGHT JOIN with inequality 
conditions in Hive without loss of efficiency.
Thank you.
Example:

SELECT r_id,

   CASE WHEN table1.property_value = 'False' THEN FALSE

WHEN table1.property_value = 'True' THEN TRUE

WHEN r.rea <  rg.laa THEN FALSE

WHEN r.rea >= rg.laa THEN TRUE

ELSE FALSE END AS flag

  FROM rs r

  LEFT JOIN public.di_re rg

ON r.re<http://r.re> = rg.re<http://rg.re>

  LEFT JOIN public.tbl table1

ON r.re<http://r.re> = table1.re<http://table1.re>

   AND table1.property_name = ''

   AND r.rea BETWEEN table1.begin_time AND table1.end_time

Error:

FAILED: SemanticException Line 0:-1 Both left and right aliases encountered in 
JOIN ...

Ways to resolve:
• Move inequality condition in WHERE clause:

• WHERE r.rea BETWEEN table1.begin_time AND table1.end_time

• WARNING: Affects query logic - filters all the table instead of filtering 
LEFT JOIN clause only;
• Move condition into SELECT field with CASE statement (if possible):

• SELECT r_id,

•  CASE WHEN table1.property_value = 'False'

•AND r.rea BETWEEN table1.begin_time AND  table1.end_time 
THEN FALSE

•   WHEN table1.property_value = 'True'

•AND r.rea BETWEEN table1.begin_time AND table1.end_time 
THEN TRUE
Not possible in every case;
• Divide queries into two separate statements and UNION them: one query 
with WHERE filter and another query totally omitting the JOIN to table that 
needed inequality as well as omitting the ids from the first query:

• WITH stage AS (

• SELECT r_id,

•  CASE WHEN table1.property_value = 'False' THEN FALSE

•   WHEN table1.property_value = 'True' THEN TRUE

•   WHEN r.rea <  rg.laa THEN FALSE

•   WHEN r.rea >= rg.laa THEN TRUE

•   ELSE FALSE END as flag

• FROM rs r

• LEFT JOIN public.di_re rg

•   ON r.re<http://r.re> = rg.re<http://rg.re>

• LEFT JOIN public.tbl table1

•   ON r.region = table1.region

•  AND table1.property_name = ''

• WHERE r.rea BETWEEN table1.begin_time AND table1.end_time

• )

• SELECT * FROM stage

• UNION

• SELECT r_id,

•  CASE WHEN r.rea <  rg.laa THEN FALSE

•   WHEN r.rea >= rg.laa THEN TRUE

•   ELSE FALSE END as flag

• FROM rs r

• LEFT JOIN public.di_re rg

•   ON r.re<http://r.re> = rg.re<http://rg.re>

• WHERE r_id NOT IN (SELECT DISTINCT r_id from stage)
Very expensive in terms of calculation, but in some cases inevitable.
​
--
Goden


RE: Hive Left Join inequality condition

2016-11-05 Thread Markovitz, Dudu
Ugly as hell, but should work.

Dudu



SELECT r_id,
   CASE WHEN table1.property_value = 'False' THEN FALSE
WHEN table1.property_value = 'True' THEN TRUE
WHEN r.rea <  rg.laa THEN FALSE
WHEN r.rea >= rg.laa THEN TRUE
ELSE FALSE END AS flag
  FROM rs r
  LEFT JOIN public.di_re rg
ON r.re = rg.re
  LEFT JOIN (selectr.r_id, table1.property_value
fromrs r
join public.tbl table1
ON r.re = table1.re
where  table1.property_name = ''
   AND r.rea BETWEEN table1.begin_time AND table1.end_time
) table1

   on r.r_id = table1.r_id

From: Goden Yao [mailto:goden...@apache.org]
Sent: Saturday, November 05, 2016 9:22 AM
To: user@hive.apache.org
Subject: Hive Left Join inequality condition


Hello!

Lately we have ran into the need to implement inequality JOIN in Hive, and we 
could have easily done that with WHERE clause, if it was not the LEFT join.
Basically, we wonder how people implement LEFT/RIGHT JOIN with inequality 
conditions in Hive without loss of efficiency.
Thank you.
Example:

SELECT r_id,

   CASE WHEN table1.property_value = 'False' THEN FALSE

WHEN table1.property_value = 'True' THEN TRUE

WHEN r.rea <  rg.laa THEN FALSE

WHEN r.rea >= rg.laa THEN TRUE

ELSE FALSE END AS flag

  FROM rs r

  LEFT JOIN public.di_re rg

ON r.re = rg.re

  LEFT JOIN public.tbl table1

ON r.re = table1.re

   AND table1.property_name = ''

   AND r.rea BETWEEN table1.begin_time AND table1.end_time

Error:

FAILED: SemanticException Line 0:-1 Both left and right aliases encountered in 
JOIN ...

Ways to resolve:
· Move inequality condition in WHERE clause:

· WHERE r.rea BETWEEN table1.begin_time AND table1.end_time

· WARNING: Affects query logic - filters all the table instead of filtering 
LEFT JOIN clause only;
· Move condition into SELECT field with CASE statement (if possible):

· SELECT r_id,

·  CASE WHEN table1.property_value = 'False'

·AND r.rea BETWEEN table1.begin_time AND  table1.end_time 
THEN FALSE

·   WHEN table1.property_value = 'True'

·AND r.rea BETWEEN table1.begin_time AND table1.end_time 
THEN TRUE
Not possible in every case;
· Divide queries into two separate statements and UNION them: one query 
with WHERE filter and another query totally omitting the JOIN to table that 
needed inequality as well as omitting the ids from the first query:

· WITH stage AS (

· SELECT r_id,

·  CASE WHEN table1.property_value = 'False' THEN FALSE

·   WHEN table1.property_value = 'True' THEN TRUE

·   WHEN r.rea <  rg.laa THEN FALSE

·   WHEN r.rea >= rg.laa THEN TRUE

·   ELSE FALSE END as flag

· FROM rs r

· LEFT JOIN public.di_re rg

·   ON r.re = rg.re

· LEFT JOIN public.tbl table1

·   ON r.region = table1.region

·  AND table1.property_name = ''

· WHERE r.rea BETWEEN table1.begin_time AND table1.end_time

· )

· SELECT * FROM stage

· UNION

· SELECT r_id,

·  CASE WHEN r.rea <  rg.laa THEN FALSE

·   WHEN r.rea >= rg.laa THEN TRUE

·   ELSE FALSE END as flag

· FROM rs r

· LEFT JOIN public.di_re rg

·   ON r.re = rg.re

· WHERE r_id NOT IN (SELECT DISTINCT r_id from stage)
Very expensive in terms of calculation, but in some cases inevitable.
​


RE: HDFS small files to Sequence file using Hive

2016-09-23 Thread Markovitz, Dudu
Hi

I’m not sure how this will solve the issue you were mentioned, but just for the 
fun of it –
Here is the code.

Dudu


set textinputformat.record.delimiter='\0';
set hive.mapred.supports.subdirectories=true;
set mapred.input.dir.recursive=true;
create external table if not exists files_ext (txt string) stored as textfile 
location '/tmp/t';
create table if not exists files (key string,val string) stored as sequencefile;
insert into files select input__file__name,* from files_ext;
select key,length (val),regexp_extract (val,'(.*)\n',1) as val_first_line from 
files;

hdfs://quickstart.cloudera:8020/tmp/t/t1/t3/t4/xx01

447

Ring-ding-ding-ding-dingeringeding!

hdfs://quickstart.cloudera:8020/tmp/t/t1/t3/t4/xx02

364

Big blue eyes, pointy nose, chasing mice, and digging holes.

hdfs://quickstart.cloudera:8020/tmp/t/t1/t3/t5/xx03

321

Jacha-chacha-chacha-chow!

hdfs://quickstart.cloudera:8020/tmp/t/t1/t3/xx00

256

Dog goes woof, cat goes meow.

hdfs://quickstart.cloudera:8020/tmp/t/t2/xx05

258

You're my guardian angel hiding in the woods.

hdfs://quickstart.cloudera:8020/tmp/t/xx04

171

The secret of the fox, ancient mystery.


From: Arun Patel [mailto:arunp.bigd...@gmail.com]
Sent: Friday, September 23, 2016 7:04 PM
To: user@hive.apache.org
Subject: HDFS small files to Sequence file using Hive

I'm trying to resolve small files issue using Hive.

Is there a way to create an external table on a directory, extract 'key' as 
file name and 'value' as file content and write to a sequence file table?

Or any other better option in Hive?

Thank you

Arun


RE: on duplicate update equivalent?

2016-09-23 Thread Markovitz, Dudu
If these are dimension tables, what do you need to update there?

Dudu

From: Vijay Ramachandran [mailto:vi...@linkedin.com]
Sent: Friday, September 23, 2016 1:46 PM
To: user@hive.apache.org
Subject: Re: on duplicate update equivalent?


On Fri, Sep 23, 2016 at 3:47 PM, Mich Talebzadeh 
> wrote:
What is the use case for UPSERT in Hive. The functionality does not exist but 
there are other solutions.

Are we talking about a set of dimension tables with primary keys hat need to be 
updated (existing rows) or inserted (new rows)?

Hi Mich.
Exactly, I'm looking at dimension tables.
thanks,



RE: on duplicate update equivalent?

2016-09-23 Thread Markovitz, Dudu
You may however use a code similar to the following.
The main idea is to work with 2 target tables.
Instead of merging the source table into a target table, we create an 
additional target table based of the merge results.
A view is pointing all the time to the most updated target table.

Dudu


Initialize demo -

create table src (i int,c char(1));
insert into src values (2,'b'),(3,'c');

create table trg1 (i int,c char(1)) stored as orc;
insert into trg1 values (1,'X'),(2,'Y');

create view trg as select * from trg1;


Ongoing process -

create table if not exists trg2 as select coalesce (s.i,t.i) as i,coalesce 
(s.c,t.c) from src as s full join trg as t on t.i = s.i;

alter view trg as select * from trg2;

drop table if exists trg1;


After some time passes and the source table contains new data -

create table if not exists trg1 as select coalesce (s.i,t.i) as i,coalesce 
(s.c,t.c) from src as s full join trg as t on t.i = s.i;

alter view trg as select * from trg1;

drop table if exists trg2;


etc…




From: Markovitz, Dudu [mailto:dmarkov...@paypal.com]
Sent: Friday, September 23, 2016 1:02 PM
To: user@hive.apache.org
Subject: RE: on duplicate update equivalent?

We’re not there yet…
https://issues.apache.org/jira/browse/HIVE-10924

Dudu

From: Vijay Ramachandran [mailto:vi...@linkedin.com]
Sent: Friday, September 23, 2016 11:47 AM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: on duplicate update equivalent?

Hello.
Is there a way to write a query with a behaviour equivalent to mysql's "on 
duplicate update"? i.e., try to insert, and if key exists, update the row 
instead?
thanks,


RE: on duplicate update equivalent?

2016-09-23 Thread Markovitz, Dudu
We’re not there yet…
https://issues.apache.org/jira/browse/HIVE-10924

Dudu

From: Vijay Ramachandran [mailto:vi...@linkedin.com]
Sent: Friday, September 23, 2016 11:47 AM
To: user@hive.apache.org
Subject: on duplicate update equivalent?

Hello.
Is there a way to write a query with a behaviour equivalent to mysql's "on 
duplicate update"? i.e., try to insert, and if key exists, update the row 
instead?
thanks,


RE: Extracting data from ELB log date format

2016-09-21 Thread Markovitz, Dudu
select to_date(ts),year(ts),month(ts),day(ts),hour(ts),minute(ts),second(ts) 
from (select from_unixtime (unix_timestamp 
('2016-09-15T23:45:22.943762Z',"-MM-dd'T'HH:mm:ss")) as ts) as t;
OK
2016-09-15 2016 915   23   45   22

Dudu

From: Manish Rangari [mailto:linuxtricksfordev...@gmail.com]
Sent: Wednesday, September 21, 2016 4:23 PM
To: user@hive.apache.org
Subject: Extracting data from ELB log date format

Guys,

I am trying to extract date, time, month, minute etc from below timestamp 
format but did not find any function for this. Can anyone help me to extract 
the details?

2016-09-15T23:45:22.943762Z
2016-09-15T23:45:22.948829Z

--Manish


RE: ELB Log processing

2016-09-20 Thread Markovitz, Dudu
Or

create view elb_raw_log_detailed
as
select request_date, elbname, requestip, requestport, backendip, backendport, 
requestprocessingtime, backendprocessingtime, clientresponsetime, 
elbresponsecode, backendresponsecode, receivedbytes, sentbytes, requestverb, 
url, parse_url(url, 'QUERY','aid') as aid, parse_url(url, 'QUERY','tid') as 
tid, parse_url(url, 'QUERY','eid') as eid, parse_url(url, 'QUERY','did') as 
did, protocol, useragent, ssl_cipher, ssl_protocol
from elblog;

Dudu

From: Markovitz, Dudu [mailto:dmarkov...@paypal.com]
Sent: Tuesday, September 20, 2016 6:06 PM
To: user@hive.apache.org
Subject: RE: ELB Log processing

create view elb_raw_log_detailed
as
select request_date, elbname, requestip, requestport, backendip, backendport, 
requestprocessingtime, backendprocessingtime, clientresponsetime, 
elbresponsecode, backendresponsecode, receivedbytes, sentbytes, requestverb, 
url, u.aid, u.tid, u.eid,u.did, protocol, useragent, ssl_cipher, ssl_protocol
from elblog
LATERAL VIEW 
parse_url_tuple(url,'QUERY:eid','QUERY:tid','QUERY:aid','QUERY:did') u as 
eid,tid,aid,did
;

Dudu

From: Manish Rangari [mailto:linuxtricksfordev...@gmail.com]
Sent: Tuesday, September 20, 2016 4:09 PM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Re: ELB Log processing

Guys,

I am struggling to create this view. I am keep getting the error in bold. I 
found that I need to use lateral view but still I am not able to get the syntax 
right.

hive> create view elb_raw_log_detailed as select request_date, elbname, 
requestip, requestport, backendip, backendport, requestprocessingtime, 
backendprocessingtime, clientresponsetime, elbresponsecode, 
backendresponsecode, receivedbytes, sentbytes, requestverb, url, 
parse_url_tuple(url, 'QUERY:aid') as aid, parse_url_tuple(url, 'QUERY:tid') as 
tid, parse_url_tuple(url, 'QUERY:eid') as eid, parse_url_tuple(url, 
'QUERY:did') as did, protocol, useragent, ssl_cipher, ssl_protocol from elblogz;

FAILED: SemanticException [Error 10081]: UDTF's are not supported outside the 
SELECT clause, nor nested in expressions

On Tue, Sep 20, 2016 at 3:56 PM, Manish Rangari 
<linuxtricksfordev...@gmail.com<mailto:linuxtricksfordev...@gmail.com>> wrote:
Yes views looks like a way to go

On Tue, Sep 20, 2016 at 3:49 PM, Damien Carol 
<damien.ca...@gmail.com<mailto:damien.ca...@gmail.com>> wrote:
The royal way to do that is a view IMHO.

2016-09-20 12:14 GMT+02:00 Manish Rangari 
<linuxtricksfordev...@gmail.com<mailto:linuxtricksfordev...@gmail.com>>:
Thanks for the reply Damien. The suggestion you gave is really useful. 
Currently I am achieving my desired output by performing below steps. But I 
want to achieve the desired result in one step instead of two. Do we have any 
way so that I can get the aid, did etc in create table statement? If not I will 
have to look for the option that you mentioned

1.
CREATE TABLE elblog (
Request_date STRING,
  ELBName STRING,
  RequestIP STRING,
  RequestPort INT,
  BackendIP STRING,
  BackendPort INT,
  RequestProcessingTime DOUBLE,
  BackendProcessingTime DOUBLE,
  ClientResponseTime DOUBLE,
  ELBResponseCode STRING,
  BackendResponseCode STRING,
  ReceivedBytes BIGINT,
  SentBytes BIGINT,
  RequestVerb STRING,
  URL STRING,
  Protocol STRING,
Useragent STRING,
ssl_cipher STRING,
ssl_protocol STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  "input.regex" = "([^ ]*) ([^ ]*) ([^ ]*):([0-9]*) ([^ ]*):([0-9]*) 
([.0-9]*) ([.0-9]*) ([.0-9]*) (-|[0-9]*) (-|[0-9]*) ([-0-9]*) ([-0-9]*) \"([^ 
]*) ([^ ]*) (- |[^ ]*)\" \"(.*)\" (.*) (.*)$"
)
STORED AS TEXTFILE;

2.
create table elb_raw_log as select request_date, elbname, requestip, 
requestport, backendip, backendport, requestprocessingtime, 
backendprocessingtime, clientresponsetime, elbresponsecode, 
backendresponsecode, receivedbytes, sentbytes, requestverb, url, 
regexp_extract(url, '.*aid=([a-zA-Z0-9]+).*', 1) as aid, regexp_extract(url, 
'.*tid=([a-zA-Z0-9]+).*', 1) as tid, regexp_extract(url, 
'.*eid=([a-zA-Z0-9]+).*', 1) as eid, regexp_extract(url, 
'.*did=([a-zA-Z0-9]+).*', 1) as did, protocol, useragent, ssl_cipher, 
ssl_protocol from elblog;

On Tue, Sep 20, 2016 at 3:12 PM, Damien Carol 
<damien.ca...@gmail.com<mailto:damien.ca...@gmail.com>> wrote:
see the udf parse_url_tuple
SELECT b.*
FROM src LATERAL VIEW parse_url_tuple(fullurl, 'HOST', 'PATH', 'QUERY', 
'QUERY:id') b as host, path, query, query_id LIMIT 1;


https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-parse_url_tuple

2016-09-20 11:22 GMT+02:00 Manish Rangari 
<linuxtricksfordev...@gmail.com<mailto:linuxtricksfordev...@gmail.com>>:
Guys,

I want to get the field of elb logs. A sample elb log is given below and I am 
using below create table definition. It is working f

RE: ELB Log processing

2016-09-20 Thread Markovitz, Dudu
create view elb_raw_log_detailed
as
select request_date, elbname, requestip, requestport, backendip, backendport, 
requestprocessingtime, backendprocessingtime, clientresponsetime, 
elbresponsecode, backendresponsecode, receivedbytes, sentbytes, requestverb, 
url, u.aid, u.tid, u.eid,u.did, protocol, useragent, ssl_cipher, ssl_protocol
from elblog
LATERAL VIEW 
parse_url_tuple(url,'QUERY:eid','QUERY:tid','QUERY:aid','QUERY:did') u as 
eid,tid,aid,did
;

Dudu

From: Manish Rangari [mailto:linuxtricksfordev...@gmail.com]
Sent: Tuesday, September 20, 2016 4:09 PM
To: user@hive.apache.org
Subject: Re: ELB Log processing

Guys,

I am struggling to create this view. I am keep getting the error in bold. I 
found that I need to use lateral view but still I am not able to get the syntax 
right.

hive> create view elb_raw_log_detailed as select request_date, elbname, 
requestip, requestport, backendip, backendport, requestprocessingtime, 
backendprocessingtime, clientresponsetime, elbresponsecode, 
backendresponsecode, receivedbytes, sentbytes, requestverb, url, 
parse_url_tuple(url, 'QUERY:aid') as aid, parse_url_tuple(url, 'QUERY:tid') as 
tid, parse_url_tuple(url, 'QUERY:eid') as eid, parse_url_tuple(url, 
'QUERY:did') as did, protocol, useragent, ssl_cipher, ssl_protocol from elblogz;

FAILED: SemanticException [Error 10081]: UDTF's are not supported outside the 
SELECT clause, nor nested in expressions

On Tue, Sep 20, 2016 at 3:56 PM, Manish Rangari 
> wrote:
Yes views looks like a way to go

On Tue, Sep 20, 2016 at 3:49 PM, Damien Carol 
> wrote:
The royal way to do that is a view IMHO.

2016-09-20 12:14 GMT+02:00 Manish Rangari 
>:
Thanks for the reply Damien. The suggestion you gave is really useful. 
Currently I am achieving my desired output by performing below steps. But I 
want to achieve the desired result in one step instead of two. Do we have any 
way so that I can get the aid, did etc in create table statement? If not I will 
have to look for the option that you mentioned

1.
CREATE TABLE elblog (
Request_date STRING,
  ELBName STRING,
  RequestIP STRING,
  RequestPort INT,
  BackendIP STRING,
  BackendPort INT,
  RequestProcessingTime DOUBLE,
  BackendProcessingTime DOUBLE,
  ClientResponseTime DOUBLE,
  ELBResponseCode STRING,
  BackendResponseCode STRING,
  ReceivedBytes BIGINT,
  SentBytes BIGINT,
  RequestVerb STRING,
  URL STRING,
  Protocol STRING,
Useragent STRING,
ssl_cipher STRING,
ssl_protocol STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  "input.regex" = "([^ ]*) ([^ ]*) ([^ ]*):([0-9]*) ([^ ]*):([0-9]*) 
([.0-9]*) ([.0-9]*) ([.0-9]*) (-|[0-9]*) (-|[0-9]*) ([-0-9]*) ([-0-9]*) \"([^ 
]*) ([^ ]*) (- |[^ ]*)\" \"(.*)\" (.*) (.*)$"
)
STORED AS TEXTFILE;

2.
create table elb_raw_log as select request_date, elbname, requestip, 
requestport, backendip, backendport, requestprocessingtime, 
backendprocessingtime, clientresponsetime, elbresponsecode, 
backendresponsecode, receivedbytes, sentbytes, requestverb, url, 
regexp_extract(url, '.*aid=([a-zA-Z0-9]+).*', 1) as aid, regexp_extract(url, 
'.*tid=([a-zA-Z0-9]+).*', 1) as tid, regexp_extract(url, 
'.*eid=([a-zA-Z0-9]+).*', 1) as eid, regexp_extract(url, 
'.*did=([a-zA-Z0-9]+).*', 1) as did, protocol, useragent, ssl_cipher, 
ssl_protocol from elblog;

On Tue, Sep 20, 2016 at 3:12 PM, Damien Carol 
> wrote:
see the udf parse_url_tuple
SELECT b.*
FROM src LATERAL VIEW parse_url_tuple(fullurl, 'HOST', 'PATH', 'QUERY', 
'QUERY:id') b as host, path, query, query_id LIMIT 1;


https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-parse_url_tuple

2016-09-20 11:22 GMT+02:00 Manish Rangari 
>:
Guys,

I want to get the field of elb logs. A sample elb log is given below and I am 
using below create table definition. It is working fine. I am getting what I 
wanted but now I want the bold part as well. For example eid, tid, aid. Can 
anyone help me how can I match them as well.

NOTE: The position of aid, eid, tid is not fixed and it may change.

2016-09-16T06:55:19.056871Z testelb 2.1.7.2:52399 
192.168.1.5:80 0.21 0.000596 0.2 200 200 0 43 
"GET https://site1.example.com:443/peek?eid=aw123=fskc235n=2ADSFGSDG 
HTTP/1.1" "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) 
Chrome/45.0.2454.85 Safari/537.36" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2


CREATE TABLE elblog (
Request_date STRING,
  ELBName STRING,
  RequestIP STRING,
  RequestPort INT,
  BackendIP STRING,
  BackendPort INT,
  RequestProcessingTime 

RE: What's the best way to find the nearest neighbor in Hive? Any windowing function?

2016-09-14 Thread Markovitz, Dudu
It seems you’ll have to go with JOIN.
Here are 2 options.

Dudu


select  t0.id   
as id_0
   ,min (named_struct ("dist",abs((t1.price - t0.price)/100) + 
abs((t1.number - t0.number)/1000),"id",t1.id)).idas id_1

fromt   as t0

joint   as t1

on  t0.state=
t1.state

and t0.city=
t1.city

where   t0.flag = 0
and t1.flag = 1

group byt0.id
;



select  t.id_0
   ,t.id_1

from   (select  t0.id   
as id_0
   ,t1.id   
as id_1
   ,row_number () over (partition by t0.id order by 
abs((t1.price - t0.price)/100) + abs((t1.number - t0.number)/1000)) as n

fromt   as t0

joint   as t1

on  t0.state=
t1.state

and t0.city=
t1.city

where   t0.flag = 0
and t1.flag = 1
)
as t

where   n = 1
;




From: Mobius ReX [mailto:aoi...@gmail.com]
Sent: Tuesday, September 13, 2016 10:47 PM
To: user@hive.apache.org
Subject: What's the best way to find the nearest neighbor in Hive? Any 
windowing function?

Given a table

> $cat data.csv
>
> ID,State,City,Price,Number,Flag
> 1,CA,A,100,1000,0
> 2,CA,A,96,1010,1
> 3,CA,A,195,1010,1
> 4,NY,B,124,2000,0
> 5,NY,B,128,2001,1
> 6,NY,C,24,3,0
> 7,NY,C,27,30100,1
> 8,NY,C,29,30200,0
> 9,NY,C,39,33000,1


Expected Result:

ID0, ID1
1,2
4,5
6,7
8,7

for each ID with Flag=0 above, we want to find another ID from Flag=1, with the 
same "State" and "City", and the nearest Price and Number normalized by the 
corresponding values of that ID with Flag=0.

For example, ID = 1 and ID=2, has the same State and City, but different FLAG.
After normalized the Price and Number (Price divided by 100, Number divided by 
1000), the distance between ID=1 and ID=2 is:
abs(100/100 - 96/100) + abs(1000/1000 - 1010/1000) = 0.04 + 0.01 = 0.05


What's the best way to find such nearest neighbor in Hive? Can we use Lead/Lag 
or Rank for this case? Any valuable tips will be greatly appreciated!


RE: Beeline throws OOM on large input query

2016-09-06 Thread Markovitz, Dudu
Hi Adam

Are you familiar with MBR (Minimum Bounding Rectangle) and tessellations 
technics?
You might be able to significantly reduce the number of potentially intersected 
polygons.

Dudu

From: Adam [mailto:work@gmail.com]
Sent: Sunday, September 04, 2016 4:07 AM
To: user@hive.apache.org
Subject: RE: Beeline throws OOM on large input query

Reply to Stephen Sprague
1) confirm your beeline java process is indeed running with expanded
memory
I used the -XX:+PrintCommandLineFlags which showed:
-XX:MaxHeapSize=17179869184
confirming the 16g setting.

2)
try the hive-cli (or the python one even.)  or "beeline -u
jdbc:hive2://"
I was using the beeline jdbc connect:
  issuing: !connect jdbc:hive2: 

3) chop down your 6K points to 3K or something smaller to see just where
the breaking point is
I didn't bother though it would be good information since I found a work around 
and troubleshooting beeline wasn't my primary goal :)

Reply to Markovitz, Dudu
The query is basically finding geometry intersections.
If you are familiar with Postgis, it is a Java version of the Postgis function 
ST_Intersects (http://postgis.net/docs/ST_Intersects.html) wrapped in a Hive 
UDF.

We are checking intersection of a table's geometry column with a set of N 
geometries (6000+ in this case).

select from table
where st_intersects(table.geom, g1) OR st_intersects(table.geom, g2), etc.

Unfortunately doing it with a table join requires a theta condition which Hive 
doesn't support, something like

select from table inner join reftable on st_intersects(table.geom, 
reftable.geom)

I tried pushing down the predicate but that required a cross join which was not 
feasible for the huge table sizes.






RE: Beeline throws OOM on large input query

2016-09-03 Thread Markovitz, Dudu
Hi Adam

I’m not clear about what you are trying to achieve in your query.
Can you please give a small example?

Thanks

Dudu


From: Adam [mailto:work@gmail.com]
Sent: Friday, September 02, 2016 4:13 PM
To: user@hive.apache.org
Subject: Re: Beeline throws OOM on large input query

I set the heap size using HADOOP_CLIENT_OPTS all the way to 16g and still no 
luck.

I tried to go down the table join route but the problem is that the relation is 
not an equality so it would be a theta join which is not supported in Hive.
Basically what I am doing is a geographic intersection against 6,000 points so 
the where clause has 6000 points in it (I use a custom UDF for the 
intersection).

To avoid the problem I ended up writing another version of the UDF that reads 
the point list from an HDFS file.

It's a low priority I'm sure but I bet there are some inefficiencies in the 
query string handling that could be fixed.  When I traced the code it was doing 
all kinds of StringBuffer and String += type stuff.

Regards,


RE: Crate Non-partitioned table from partitioned table using CREATE TABLE .. LIKE

2016-08-07 Thread Markovitz, Dudu
It won’t help him since ‘*’ represent all columns including the partition 
columns which he wants to exclude.

Dudu

From: Marcin Tustin [mailto:mtus...@handybook.com]
Sent: Sunday, August 07, 2016 3:17 PM
To: user@hive.apache.org
Subject: Re: Crate Non-partitioned table from partitioned table using CREATE 
TABLE .. LIKE

Will CREATE TABLE sales5 AS SELECT * FROM SALES; not work for you?

On Thu, Aug 4, 2016 at 5:05 PM, Nagabhushanam Bheemisetty 
> wrote:
Hi I've a scenario where I need to create a table from partitioned table but my 
destination table should not be partitioned. I won't be knowing the schema so I 
cannot create manually the destination table. By the way both tables are 
external tables.


Want to work at Handy? Check out our culture deck and open 
roles
Latest news at Handy
Handy just raised 
$50m
 led by Fidelity

[http://marketing-email-assets.handybook.com/smalllogo.png]


RE: Crate Non-partitioned table from partitioned table using CREATE TABLE .. LIKE

2016-08-06 Thread Markovitz, Dudu
Hi

Should your destination table contain the source partitions values?

e.g.

Assuming this is the source table:

create table src (cust_id int,cust_first_name string,cust_last_name string) 
partitioned by (yr string,mn string,dt string);

Should the destination table look like

create table dst (cust_id int,cust_first_name string,cust_last_name string,yr 
string,mn string,dt string);

or like

create table dst (cust_id int,cust_first_name string,cust_last_name string);


I think I can achieve the first option.

Dudu



From: Nagabhushanam Bheemisetty [mailto:nbheemise...@gmail.com]
Sent: Friday, August 05, 2016 12:05 AM
To: user@hive.apache.org
Subject: Crate Non-partitioned table from partitioned table using CREATE TABLE 
.. LIKE

Hi I've a scenario where I need to create a table from partitioned table but my 
destination table should not be partitioned. I won't be knowing the schema so I 
cannot create manually the destination table. By the way both tables are 
external tables.


RE: Error running SQL query through Hive JDBC

2016-08-06 Thread Markovitz, Dudu
1.
SELECT TBL_CODE FROM DB.CODE_MAP WHERE SYSTEM_NAME='TDS' AND 
TABLE_NAME=TRIM('XYZ')

This does not make sense

2.
Can you please also share the DDL and maybe a small set of data?

Thanks

Dudu

From: Amit Bajpai [mailto:amit.baj...@flextronics.com]
Sent: Friday, August 05, 2016 11:08 PM
To: user@hive.apache.org
Subject: RE: Error running SQL query through Hive JDBC

Below is the code snippet with the SQL query which I am running. The same query 
is running fine through Hive CLI.

String sql = " SELECT TBL_CODE FROM DB.CODE_MAP 
WHERE SYSTEM_NAME='TDS' AND TABLE_NAME=TRIM('XYZ')";

System.out.println("New SQL: " + sql);

String driverName = 
"org.apache.hive.jdbc.HiveDriver";
try {
Class.forName(driverName);
Connection con = 
DriverManager.getConnection(

"jdbc:hive2://hiveservername:1/default",

"username", "");
HiveStatement stmt = 
(HiveStatement) con.createStatement();
ResultSet res = 
stmt.executeQuery(sql);

while (res.next()) {
Object ret_obj 
= res.getObject(1);

System.out.println(res.getString(1));
}

stmt.close();
con.close();

} catch (ClassNotFoundException e) {
e.printStackTrace();
} catch (SQLException e) {
e.printStackTrace();
}

From: Markovitz, Dudu [mailto:dmarkov...@paypal.com]
Sent: Friday, August 05, 2016 3:04 PM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: RE: Error running SQL query through Hive JDBC

Can you please share the query?

From: Amit Bajpai [mailto:amit.baj...@flextronics.com]
Sent: Friday, August 05, 2016 10:40 PM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Error running SQL query through Hive JDBC

Hi,

I am getting the below error when running the SQL query through Hive JDBC. Can 
suggestion how to fix it.

org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: 
FAILED: SemanticException UDF = is not allowed
at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:231)
at 
org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:217)
at 
org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:254)
at 
org.apache.hive.jdbc.HiveStatement.executeQuery(HiveStatement.java:392)
at com.flex.hdp.logs.test.main(test.java:84)
Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling 
statement: FAILED: SemanticException UDF = is not allowed
at 
org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:314)
at 
org.apache.hive.service.cli.operation.SQLOperation.prepare(SQLOperation.java:111)
at 
org.apache.hive.service.cli.operation.SQLOperation.runInternal(SQLOperation.java:180)
at 
org.apache.hive.service.cli.operation.Operation.run(Operation.java:256)
at 
org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:376)
at 
org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:363)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79)
at 
org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37)
at 
org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.

RE: Error running SQL query through Hive JDBC

2016-08-05 Thread Markovitz, Dudu
Can you please share the query?

From: Amit Bajpai [mailto:amit.baj...@flextronics.com]
Sent: Friday, August 05, 2016 10:40 PM
To: user@hive.apache.org
Subject: Error running SQL query through Hive JDBC

Hi,

I am getting the below error when running the SQL query through Hive JDBC. Can 
suggestion how to fix it.

org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: 
FAILED: SemanticException UDF = is not allowed
at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:231)
at 
org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:217)
at 
org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:254)
at 
org.apache.hive.jdbc.HiveStatement.executeQuery(HiveStatement.java:392)
at com.flex.hdp.logs.test.main(test.java:84)
Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling 
statement: FAILED: SemanticException UDF = is not allowed
at 
org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:314)
at 
org.apache.hive.service.cli.operation.SQLOperation.prepare(SQLOperation.java:111)
at 
org.apache.hive.service.cli.operation.SQLOperation.runInternal(SQLOperation.java:180)
at 
org.apache.hive.service.cli.operation.Operation.run(Operation.java:256)
at 
org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:376)
at 
org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:363)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79)
at 
org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37)
at 
org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at 
org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:536)
at 
org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:60)
at com.sun.proxy.$Proxy32.executeStatementAsync(Unknown Source)
at 
org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:271)
at 
org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:401)
at 
org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1313)
at 
org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1298)
at 
org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
at 
org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
at 
org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:56)
at 
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.RuntimeException: 
org.apache.hadoop.hive.ql.parse.SemanticException:UDF = is not allowed
at 
org.apache.hadoop.hive.ql.exec.FunctionRegistry.getFunctionInfo(FunctionRegistry.java:677)
at 
org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory$DefaultExprProcessor.getXpathOrFuncExprNodeDesc(TypeCheckProcFactory.java:810)
at 
org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory$DefaultExprProcessor.process(TypeCheckProcFactory.java:1152)
at 
org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(DefaultRuleDispatcher.java:90)
at 
org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatchAndReturn(DefaultGraphWalker.java:94)
at 
org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(DefaultGraphWalker.java:78)
at 
org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.walk(DefaultGraphWalker.java:132)
at 

RE: Alternatives to self join

2016-07-26 Thread Markovitz, Dudu
Hi

Can you please send your original query and perhaps a small dataset sample?

Thanks

Dudu

From: Buntu Dev [mailto:buntu...@gmail.com]
Sent: Tuesday, July 26, 2016 10:46 AM
To: user@hive.apache.org
Subject: Alternatives to self join

I'm currently doing a self-join on a table 4 times on varying conditions. 
Although it works fine, I'm not sure if there are any alternatives that perform 
better. Please let me know.

Thanks!


RE: Any way in hive to have functionality like SQL Server collation on Case sensitivity

2016-07-14 Thread Markovitz, Dudu
Yes, java_method is a Synonym for reflect as of Hive 
0.9.0<https://issues.apache.org/jira/browse/HIVE-1877>

The use-case was presented by Mahender at the bottom of this thread (the 
emphasis is mine):

“We would like to have feature in Hive where string comparison should ignore 
case sensitivity while joining on String Columns in hive. This feature helps us 
in reducing code of calling Upper or Lower function on Join columns. If it is 
already there, please let me know settings to enable this feature”.

From: Jörn Franke [mailto:jornfra...@gmail.com]
Sent: Thursday, July 14, 2016 10:43 AM
To: user@hive.apache.org
Subject: Re: Any way in hive to have functionality like SQL Server collation on 
Case sensitivity

I think both are the same.
can you elaborate a little bit more on your use case, eg a query you currently 
do and what the exact issue is

On 14 Jul 2016, at 09:36, Markovitz, Dudu 
<dmarkov...@paypal.com<mailto:dmarkov...@paypal.com>> wrote:
Are you referring to ‘java_method‘ (or ‘reflect’)?

e.g.

hive> select java_method  ('java.lang.Math','min',45,9)  ;
9

I’m not sure how it serves out purpose.

Dudu

From: Jörn Franke [mailto:jornfra...@gmail.com]
Sent: Thursday, July 14, 2016 8:55 AM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Re: Any way in hive to have functionality like SQL Server collation on 
Case sensitivity


You can use use any Java function in Hive without (!) the need to wrap it in an 
UDF via the reflect command.
however not sure if this meets your use case.



Sent from my iPhone
On 13 Jul 2016, at 19:50, Markovitz, Dudu 
<dmarkov...@paypal.com<mailto:dmarkov...@paypal.com>> wrote:
Hi

I’m personally not aware of other methods to achieve case insensitivity 
comparison but to use lower() / upper()

Dudu

From: Mahender Sarangam [mailto:mahender.bigd...@outlook.com]
Sent: Wednesday, July 13, 2016 12:56 AM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Re: Any way in hive to have functionality like SQL Server collation on 
Case sensitivity


Thanks Dudu,

I would like to know dealing with case in-sensitivity in other project. is 
every one converting to toLower() or toUpper() in the Joins ? . Is there any 
setting applied at Hive Server level which gets reflected in all the queries ?



/MS

On 5/25/2016 9:05 AM, Markovitz, Dudu wrote:
It will not be suitable for JOIN operation since it will cause a Cartesian 
product.
Any chosen solution should determine a single representation for any given 
string.

Dudu

From: Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
Sent: Wednesday, May 25, 2016 1:31 AM
To: user <user@hive.apache.org><mailto:user@hive.apache.org>
Subject: Re: Any way in hive to have functionality like SQL Server collation on 
Case sensitivity

I would rather go for something like compare() 
<http://infocenter.sybase.com/help/index.jsp?topic=/com.sybase.infocenter.dc36271.1572/html/blocks/X14054.htm>
 that allows one to directly compare two character strings based on alternate 
collation rules.

Hive does not have it. This is from SAP ASE

1> select compare ("aaa","bbb")
2> go
 ---
  -1
(1 row affected)
1> select compare ("aaa","Aaa")
2> go
 ---
   1
(1 row affected)

1> select compare ("aaa","AAA")
2> go
 ---
   1

•  The compare function returns the following values, based on the collation 
rules that you chose:

· 1 – indicates that char_expression1 or uchar_expression1 is greater 
than char_expression2 or uchar_expression2.

· 0 – indicates that char_expression1 or uchar_expression1 is equal to 
char_expression2 or uchar_expression2.

· -1 – indicates that char_expression1 or uchar_expression1 is less 
than char_expression2 or uchar expression2.

hive> select compare("aaa", "bbb");
FAILED: SemanticException [Error 10011]: Line 1:7 Invalid function 'compare'


HTH




Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>



On 24 May 2016 at 21:15, mahender bigdata 
<mahender.bigd...@outlook.com<mailto:mahender.bigd...@outlook.com>> wrote:
Hi,

We would like to have feature in Hive where string comparison should ignore 
case sensitivity while joining on String Columns in hive. This feature helps us 
in reducing code of calling Upper or Lower function on Join columns. If it is 
already there, please let me know settings to enable this feature.

/MS




RE: Any way in hive to have functionality like SQL Server collation on Case sensitivity

2016-07-14 Thread Markovitz, Dudu
Are you referring to ‘java_method‘ (or ‘reflect’)?

e.g.

hive> select java_method  ('java.lang.Math','min',45,9)  ;
9

I’m not sure how it serves out purpose.

Dudu

From: Jörn Franke [mailto:jornfra...@gmail.com]
Sent: Thursday, July 14, 2016 8:55 AM
To: user@hive.apache.org
Subject: Re: Any way in hive to have functionality like SQL Server collation on 
Case sensitivity


You can use use any Java function in Hive without (!) the need to wrap it in an 
UDF via the reflect command.
however not sure if this meets your use case.



Sent from my iPhone
On 13 Jul 2016, at 19:50, Markovitz, Dudu 
<dmarkov...@paypal.com<mailto:dmarkov...@paypal.com>> wrote:
Hi

I’m personally not aware of other methods to achieve case insensitivity 
comparison but to use lower() / upper()

Dudu

From: Mahender Sarangam [mailto:mahender.bigd...@outlook.com]
Sent: Wednesday, July 13, 2016 12:56 AM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Re: Any way in hive to have functionality like SQL Server collation on 
Case sensitivity


Thanks Dudu,

I would like to know dealing with case in-sensitivity in other project. is 
every one converting to toLower() or toUpper() in the Joins ? . Is there any 
setting applied at Hive Server level which gets reflected in all the queries ?



/MS

On 5/25/2016 9:05 AM, Markovitz, Dudu wrote:
It will not be suitable for JOIN operation since it will cause a Cartesian 
product.
Any chosen solution should determine a single representation for any given 
string.

Dudu

From: Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
Sent: Wednesday, May 25, 2016 1:31 AM
To: user <user@hive.apache.org><mailto:user@hive.apache.org>
Subject: Re: Any way in hive to have functionality like SQL Server collation on 
Case sensitivity

I would rather go for something like compare() 
<http://infocenter.sybase.com/help/index.jsp?topic=/com.sybase.infocenter.dc36271.1572/html/blocks/X14054.htm>
 that allows one to directly compare two character strings based on alternate 
collation rules.

Hive does not have it. This is from SAP ASE

1> select compare ("aaa","bbb")
2> go
 ---
  -1
(1 row affected)
1> select compare ("aaa","Aaa")
2> go
 ---
   1
(1 row affected)

1> select compare ("aaa","AAA")
2> go
 ---
   1

•  The compare function returns the following values, based on the collation 
rules that you chose:

· 1 – indicates that char_expression1 or uchar_expression1 is greater 
than char_expression2 or uchar_expression2.

· 0 – indicates that char_expression1 or uchar_expression1 is equal to 
char_expression2 or uchar_expression2.

· -1 – indicates that char_expression1 or uchar_expression1 is less 
than char_expression2 or uchar expression2.

hive> select compare("aaa", "bbb");
FAILED: SemanticException [Error 10011]: Line 1:7 Invalid function 'compare'


HTH




Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>



On 24 May 2016 at 21:15, mahender bigdata 
<mahender.bigd...@outlook.com<mailto:mahender.bigd...@outlook.com>> wrote:
Hi,

We would like to have feature in Hive where string comparison should ignore 
case sensitivity while joining on String Columns in hive. This feature helps us 
in reducing code of calling Upper or Lower function on Join columns. If it is 
already there, please let me know settings to enable this feature.

/MS




RE: Any way in hive to have functionality like SQL Server collation on Case sensitivity

2016-07-13 Thread Markovitz, Dudu
Hi

I’m personally not aware of other methods to achieve case insensitivity 
comparison but to use lower() / upper()

Dudu

From: Mahender Sarangam [mailto:mahender.bigd...@outlook.com]
Sent: Wednesday, July 13, 2016 12:56 AM
To: user@hive.apache.org
Subject: Re: Any way in hive to have functionality like SQL Server collation on 
Case sensitivity


Thanks Dudu,

I would like to know dealing with case in-sensitivity in other project. is 
every one converting to toLower() or toUpper() in the Joins ? . Is there any 
setting applied at Hive Server level which gets reflected in all the queries ?



/MS

On 5/25/2016 9:05 AM, Markovitz, Dudu wrote:
It will not be suitable for JOIN operation since it will cause a Cartesian 
product.
Any chosen solution should determine a single representation for any given 
string.

Dudu

From: Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
Sent: Wednesday, May 25, 2016 1:31 AM
To: user <user@hive.apache.org><mailto:user@hive.apache.org>
Subject: Re: Any way in hive to have functionality like SQL Server collation on 
Case sensitivity

I would rather go for something like compare() 
<http://infocenter.sybase.com/help/index.jsp?topic=/com.sybase.infocenter.dc36271.1572/html/blocks/X14054.htm>
 that allows one to directly compare two character strings based on alternate 
collation rules.

Hive does not have it. This is from SAP ASE

1> select compare ("aaa","bbb")
2> go
 ---
  -1
(1 row affected)
1> select compare ("aaa","Aaa")
2> go
 ---
   1
(1 row affected)

1> select compare ("aaa","AAA")
2> go
 ---
   1

•  The compare function returns the following values, based on the collation 
rules that you chose:

· 1 – indicates that char_expression1 or uchar_expression1 is greater 
than char_expression2 or uchar_expression2.

· 0 – indicates that char_expression1 or uchar_expression1 is equal to 
char_expression2 or uchar_expression2.

· -1 – indicates that char_expression1 or uchar_expression1 is less 
than char_expression2 or uchar expression2.

hive> select compare("aaa", "bbb");
FAILED: SemanticException [Error 10011]: Line 1:7 Invalid function 'compare'


HTH




Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>



On 24 May 2016 at 21:15, mahender bigdata 
<mahender.bigd...@outlook.com<mailto:mahender.bigd...@outlook.com>> wrote:
Hi,

We would like to have feature in Hive where string comparison should ignore 
case sensitivity while joining on String Columns in hive. This feature helps us 
in reducing code of calling Upper or Lower function on Join columns. If it is 
already there, please let me know settings to enable this feature.

/MS




RE: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-12 Thread Markovitz, Dudu
 may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.



On 12 July 2016 at 08:16, Markovitz, Dudu 
<dmarkov...@paypal.com<mailto:dmarkov...@paypal.com>> wrote:
This is a simple task –
Read the files, find the local max value and combine the results (find the 
global max value).
How do you explain the differences in the results? Spark reads the files and 
finds a local max 10X (+) faster than MR?
Can you please attach the execution plan?

Thanks

Dudu



From: Mich Talebzadeh 
[mailto:mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>]
Sent: Monday, July 11, 2016 11:55 PM
To: user <user@hive.apache.org<mailto:user@hive.apache.org>>; user @spark 
<u...@spark.apache.org<mailto:u...@spark.apache.org>>
Subject: Re: Using Spark on Hive with Hive also using Spark as its execution 
engine

In my test I did like for like keeping the systematic the same namely:


  1.  Table was a parquet table of 100 Million rows
  2.  The same set up was used for both Hive on Spark and Hive on MR
  3.  Spark was very impressive compared to MR on this particular test.

Just to see any issues I created an ORC table in in the image of Parquet 
(insert/select from Parquet to ORC) with stats updated for columns etc

These were the results of the same run using ORC table this time:

hive> select max(id) from oraclehadoop.dummy;

Starting Spark Job = b886b869-5500-4ef7-aab9-ae6fb4dad22b
Query Hive on Spark job[1] stages:
2
3
Status: Running (Hive on Spark job[1])
Job Progress Format
CurrentTime StageId_StageAttemptId: 
SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount 
[StageCost]
2016-07-11 21:35:45,020 Stage-2_0: 0(+8)/23 Stage-3_0: 0/1
2016-07-11 21:35:48,033 Stage-2_0: 0(+8)/23 Stage-3_0: 0/1
2016-07-11 21:35:51,046 Stage-2_0: 1(+8)/23 Stage-3_0: 0/1
2016-07-11 21:35:52,050 Stage-2_0: 3(+8)/23 Stage-3_0: 0/1
2016-07-11 21:35:53,055 Stage-2_0: 8(+4)/23 Stage-3_0: 0/1
2016-07-11 21:35:54,060 Stage-2_0: 11(+1)/23Stage-3_0: 0/1
2016-07-11 21:35:55,065 Stage-2_0: 12(+0)/23Stage-3_0: 0/1
2016-07-11 21:35:56,071 Stage-2_0: 12(+8)/23Stage-3_0: 0/1
2016-07-11 21:35:57,076 Stage-2_0: 13(+8)/23Stage-3_0: 0/1
2016-07-11 21:35:58,081 Stage-2_0: 20(+3)/23Stage-3_0: 0/1
2016-07-11 21:35:59,085 Stage-2_0: 23/23 Finished   Stage-3_0: 0(+1)/1
2016-07-11 21:36:00,089 Stage-2_0: 23/23 Finished   Stage-3_0: 1/1 Finished
Status: Finished successfully in 16.08 seconds
OK
1
Time taken: 17.775 seconds, Fetched: 1 row(s)

Repeat with MR engine

hive> set hive.execution.engine=mr;
Hive-on-MR is deprecated in Hive 2 and may not be available in the future 
versions. Consider using a different execution engine (i.e. spark, tez) or 
using Hive 1.X releases.

hive> select max(id) from oraclehadoop.dummy;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the 
future versions. Consider using a different execution engine (i.e. spark, tez) 
or using Hive 1.X releases.
Query ID = hduser_20160711213100_8dc2afae-8644-4097-ba33-c7bd3c304bf8
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=
In order to set a constant number of reducers:
  set mapreduce.job.reduces=
Starting Job = job_1468226887011_0008, Tracking URL = 
http://rhes564:8088/proxy/application_1468226887011_0008/
Kill Command = /home/hduser/hadoop-2.6.0/bin/hadoop job  -kill 
job_1468226887011_0008
Hadoop job information for Stage-1: number of mappers: 23; number of reducers: 1
2016-07-11 21:37:00,061 Stage-1 map = 0%,  reduce = 0%
2016-07-11 21:37:06,440 Stage-1 map = 4%,  reduce = 0%, Cumulative CPU 16.48 sec
2016-07-11 21:37:14,751 Stage-1 map = 9%,  reduce = 0%, Cumulative CPU 40.63 sec
2016-07-11 21:37:22,048 Stage-1 map = 13%,  reduce = 0%, Cumulative CPU 58.88 
sec
2016-07-11 21:37:30,412 Stage-1 map = 17%,  reduce = 0%, Cumulative CPU 80.72 
sec
2016-07-11 21:37:37,707 Stage-1 map = 22%,  reduce = 0%, Cumulative CPU 103.43 
sec
2016-07-11 21:37:45,999 Stage-1 map = 26%,  reduce = 0%, Cumulative CPU 125.93 
sec
2016-07-11 21:37:54,300 Stage-1 map = 30%,  reduce = 0%, Cumulative CPU 147.17 
sec
2016-07-11 21:38:01,538 Stage-1 map = 35%,  reduce = 0%, Cumulative CPU 166.56 
sec
2016-07-11 21:38:08,807 Stage-1 map = 39%,  reduce = 0%, Cumulative CPU 189.29 
sec
2016-07-11 21:38:17,115 Stage-1 map = 43%,  reduce = 0%, Cumulative CPU 211.03 
sec
2016-07-11 21:38:24,363 Stage-1 map = 48%,  reduce = 0%, Cumulative CPU 235.68 
sec
2016-07-11 21:38:32,638 Stage-1 map = 52%,  reduce = 0%, Cumulative CPU 258.27 
sec
2016-07-11 21:38:40,916 Stage-1 map = 57%,  reduce = 0%, Cumulative CPU 278.44 
sec
201

RE: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-12 Thread Markovitz, Dudu
This is a simple task –
Read the files, find the local max value and combine the results (find the 
global max value).
How do you explain the differences in the results? Spark reads the files and 
finds a local max 10X (+) faster than MR?
Can you please attach the execution plan?

Thanks

Dudu



From: Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
Sent: Monday, July 11, 2016 11:55 PM
To: user ; user @spark 
Subject: Re: Using Spark on Hive with Hive also using Spark as its execution 
engine

In my test I did like for like keeping the systematic the same namely:


  1.  Table was a parquet table of 100 Million rows
  2.  The same set up was used for both Hive on Spark and Hive on MR
  3.  Spark was very impressive compared to MR on this particular test.

Just to see any issues I created an ORC table in in the image of Parquet 
(insert/select from Parquet to ORC) with stats updated for columns etc

These were the results of the same run using ORC table this time:

hive> select max(id) from oraclehadoop.dummy;

Starting Spark Job = b886b869-5500-4ef7-aab9-ae6fb4dad22b
Query Hive on Spark job[1] stages:
2
3
Status: Running (Hive on Spark job[1])
Job Progress Format
CurrentTime StageId_StageAttemptId: 
SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount 
[StageCost]
2016-07-11 21:35:45,020 Stage-2_0: 0(+8)/23 Stage-3_0: 0/1
2016-07-11 21:35:48,033 Stage-2_0: 0(+8)/23 Stage-3_0: 0/1
2016-07-11 21:35:51,046 Stage-2_0: 1(+8)/23 Stage-3_0: 0/1
2016-07-11 21:35:52,050 Stage-2_0: 3(+8)/23 Stage-3_0: 0/1
2016-07-11 21:35:53,055 Stage-2_0: 8(+4)/23 Stage-3_0: 0/1
2016-07-11 21:35:54,060 Stage-2_0: 11(+1)/23Stage-3_0: 0/1
2016-07-11 21:35:55,065 Stage-2_0: 12(+0)/23Stage-3_0: 0/1
2016-07-11 21:35:56,071 Stage-2_0: 12(+8)/23Stage-3_0: 0/1
2016-07-11 21:35:57,076 Stage-2_0: 13(+8)/23Stage-3_0: 0/1
2016-07-11 21:35:58,081 Stage-2_0: 20(+3)/23Stage-3_0: 0/1
2016-07-11 21:35:59,085 Stage-2_0: 23/23 Finished   Stage-3_0: 0(+1)/1
2016-07-11 21:36:00,089 Stage-2_0: 23/23 Finished   Stage-3_0: 1/1 Finished
Status: Finished successfully in 16.08 seconds
OK
1
Time taken: 17.775 seconds, Fetched: 1 row(s)

Repeat with MR engine

hive> set hive.execution.engine=mr;
Hive-on-MR is deprecated in Hive 2 and may not be available in the future 
versions. Consider using a different execution engine (i.e. spark, tez) or 
using Hive 1.X releases.

hive> select max(id) from oraclehadoop.dummy;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the 
future versions. Consider using a different execution engine (i.e. spark, tez) 
or using Hive 1.X releases.
Query ID = hduser_20160711213100_8dc2afae-8644-4097-ba33-c7bd3c304bf8
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=
In order to set a constant number of reducers:
  set mapreduce.job.reduces=
Starting Job = job_1468226887011_0008, Tracking URL = 
http://rhes564:8088/proxy/application_1468226887011_0008/
Kill Command = /home/hduser/hadoop-2.6.0/bin/hadoop job  -kill 
job_1468226887011_0008
Hadoop job information for Stage-1: number of mappers: 23; number of reducers: 1
2016-07-11 21:37:00,061 Stage-1 map = 0%,  reduce = 0%
2016-07-11 21:37:06,440 Stage-1 map = 4%,  reduce = 0%, Cumulative CPU 16.48 sec
2016-07-11 21:37:14,751 Stage-1 map = 9%,  reduce = 0%, Cumulative CPU 40.63 sec
2016-07-11 21:37:22,048 Stage-1 map = 13%,  reduce = 0%, Cumulative CPU 58.88 
sec
2016-07-11 21:37:30,412 Stage-1 map = 17%,  reduce = 0%, Cumulative CPU 80.72 
sec
2016-07-11 21:37:37,707 Stage-1 map = 22%,  reduce = 0%, Cumulative CPU 103.43 
sec
2016-07-11 21:37:45,999 Stage-1 map = 26%,  reduce = 0%, Cumulative CPU 125.93 
sec
2016-07-11 21:37:54,300 Stage-1 map = 30%,  reduce = 0%, Cumulative CPU 147.17 
sec
2016-07-11 21:38:01,538 Stage-1 map = 35%,  reduce = 0%, Cumulative CPU 166.56 
sec
2016-07-11 21:38:08,807 Stage-1 map = 39%,  reduce = 0%, Cumulative CPU 189.29 
sec
2016-07-11 21:38:17,115 Stage-1 map = 43%,  reduce = 0%, Cumulative CPU 211.03 
sec
2016-07-11 21:38:24,363 Stage-1 map = 48%,  reduce = 0%, Cumulative CPU 235.68 
sec
2016-07-11 21:38:32,638 Stage-1 map = 52%,  reduce = 0%, Cumulative CPU 258.27 
sec
2016-07-11 21:38:40,916 Stage-1 map = 57%,  reduce = 0%, Cumulative CPU 278.44 
sec
2016-07-11 21:38:49,206 Stage-1 map = 61%,  reduce = 0%, Cumulative CPU 300.35 
sec
2016-07-11 21:38:58,524 Stage-1 map = 65%,  reduce = 0%, Cumulative CPU 322.89 
sec
2016-07-11 21:39:07,889 Stage-1 map = 70%,  reduce = 0%, Cumulative CPU 344.8 
sec
2016-07-11 21:39:16,151 Stage-1 map = 74%,  reduce = 0%, Cumulative CPU 367.77 
sec
2016-07-11 21:39:25,456 Stage-1 map = 78%,  reduce = 0%, Cumulative CPU 391.82 
sec
2016-07-11 

RE: RegexSerDe with Filters

2016-07-02 Thread Markovitz, Dudu
Hi Venkat

You don’t necessarily need the three views if your goal is to join them.
You can achieve the same result using a single view and an aggregated query.
Please test the following code and see if it works for you or you would like to 
get a different solution.

Dudu


create external table t
(
c1  string
   ,ts  string
   ,c3  string
   ,log_rec_level   string
   ,tid string
   ,att string
   ,val string
   ,val_num string
)
row format serde 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
with serdeproperties 
('input.regex'='(\\S+)\\s+(.{23})\\s+(\\S+)\\s+\\[([^]]+)\\]\\s+TID:\\s*(\\d+)\\s+([^:]+):?\\s*((\\d+)?.*)')
stored as textfile
location '/tmp/t'
;

create view v
as
select  c1
   ,cast (ts as timestamp)  as ts
   ,c3  as c3
   ,log_rec_level   as log_rec_level
   ,cast (tid as bigint)as tid
   ,att as att
   ,val as val
   ,cast (val_num as bigint)as val_num

fromt
;


select  tid

   ,min (case when att like '%ProcessingHandler Message%'   then ts 
end)  as ts_ProcessingHandler_Message
   ,min (case when att = 'Request received in writer'   then ts 
end)  as ts_Request_received_in_writer
   ,min (case when att = 'Total time'   then ts 
end)  as ts_Total_time

   ,min (case when att like '%ProcessingHandler Message%'   then 
val_numend)  as timestamp
   ,min (case when att = 'Total time'   then 
val_numend)  as Total_time

fromv

group bytid
;




From: Arun Patel [mailto:arunp.bigd...@gmail.com]
Sent: Friday, July 01, 2016 9:20 PM
To: user@hive.apache.org
Subject: Re: RegexSerDe with Filters

Dudu,

Thanks for your continued support.  I need one more quick help.  I have one 
more log file as shown below.

STD-SERV 2016-06-29 12:10:39.142 c.f.c.s.F.ProcessingHandler [INFO] 
TID:101114719017567668 cluster1 ProcessingHandler Message timestamp: 
1467216639090
STD-SERV 2016-06-29 12:10:39.143 c.f.c.s.F.ProcessingHandler [INFO] TID: 
101114719017567668 cluster1: Processed request
STD-SERV 2016-06-29 12:10:39.163 c.f.c.s.F.WritingHandler [INFO] TID: 
101114719017567668 Request received in writer
STD-SERV 2016-06-29 12:10:39.273 c.f.c.s.F.WritingHandler [INFO] TID: 
101114719017567668 Processed request
STD-SERV 2016-06-29 12:10:39.273 c.f.c.s.F.WritingHandler [INFO] TID: 
101114719017567668 Total time: 10 ms

I need to create 3 views for 3 requirements.
1) create a view to get timestamp, TID number and cluster1 for lines  
"ProcessingHandler Message timestamp".  But, for this line there is no space 
between TID: and TID number.

2) create a view to get timestamp, TID for the lines "Request received in 
writer".  There is a space between TID: and TID number.

3) Create a view to get timestamp, TID for the lines "Total time:".  There is a 
space between TID: and TID number.

How do I create base table and views?  I am planning to join these 3 views 
based on TID.  Do I need to take any special considerations?

Regards,
Venkat




On Fri, Jun 24, 2016 at 5:17 PM, Arun Patel 
<arunp.bigd...@gmail.com<mailto:arunp.bigd...@gmail.com>> wrote:
Dudu, Thanks for the clarification. Looks like I have an issue with my Hive 
installation.  I tried in a different cluster and it works.

Thanks again.


On Fri, Jun 24, 2016 at 4:59 PM, Markovitz, Dudu 
<dmarkov...@paypal.com<mailto:dmarkov...@paypal.com>> wrote:
This is a tested, working code.
If you’re using https://regex101.com ,first replace backslash pairs (\\ ) with 
a single backslash (\) and also use the ‘g’ modifier in order to find all of 
the matches.

The regular expression is -
(\S+)\s+([0-9]{4}-[0-9]{2}-[0-9]{2} 
[0-9]{2}:[0-9]{2}:[0-9]{2}),([0-9]{3})\s+(\S+)\s+\[([^]]+)\]\s+(\S+)\s+:\s+(TID:\s\d+)?\s*(.*)

I’ll send you a screen shot in private, since you don’t want to expose the data.

Dudu


From: Arun Patel 
[mailto:arunp.bigd...@gmail.com<mailto:arunp.bigd...@gmail.com>]
Sent: Friday, June 24, 2016 9:33 PM

To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Re: RegexSerDe with Filters

Looks like Regex pattern is not working.  I tested the pattern on 
https://regex101.com/ and it does not find any match.

Any suggestions?

On Thu, Jun 23, 2016 at 3:01 PM, Markovitz, Dudu 
<dmarkov...@paypal.com<mailto:dmarkov...@paypal.com>> wrote:
My pleasure.
Please feel free to reach me if needed.

Dudu

From: Arun Patel 
[mailto:arunp.bigd...@gmail.com<mailto:arunp.bigd...@gmail.com>]
Sent: Wednesday, June 22, 2016 2:57 AM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Re: RegexSerDe with Filters

Thank you very much, Dudu.  This really helps

RE: Query Performance Issue : Group By and Distinct and load on reducer

2016-07-01 Thread Markovitz, Dudu
3.
This is a working code for consecutive values.
MyColumn should be a column (or list of columns) with good uniformed 
distribution.


withgroup_rows
as
(
select  abs(hash(MyColumn))%1 as group_id
   ,count (*)   as cnt

fromINTER_ETL

group byabs(hash(MyColumn))%1
)

   ,group_rows_accumulated
as
(
select  g1.group_id
   ,sum (g2.cnt) - min (g1.cnt)   as accumulated_rows

from
group_rows   as g1

cross join  group_rows   as g2

where   g2.group_id <= g1.group_id

group byg1.group_id
)

 select t.*
   ,row_number () over (partition by a.group_id order by null) + 
a.accumulated_rows as ETL_ROW_ID

from   INTER_ETL   as t

joingroup_rows_accumulated  as a

on  a.group_id  =
abs(hash(MyColumn))%1
;

From: Markovitz, Dudu [mailto:dmarkov...@paypal.com]
Sent: Thursday, June 30, 2016 12:43 PM
To: user@hive.apache.org; sanjiv.is...@gmail.com
Subject: RE: Query Performance Issue : Group By and Distinct and load on reducer

1.
This works.
I’ve recalled that the CAST is needed since FLOOR defaults to FLOAT.

select  (cast (floor(r*100) as bigint)+ 1)  + 100L * (row_number () 
over (partition by (cast (floor(r*100) as bigint) + 1) order by null) - 1)  
as ETL_ROW_ID

from(select *,rand() as r from INTER_ETL) as t
;



Here is a test result from our dev system

select  min (ETL_ROW_ID)as min_ETL_ROW_ID
   ,count   (ETL_ROW_ID)as count_ETL_ROW_ID
   ,max (ETL_ROW_ID)as max_ETL_ROW_ID

from   (select  (cast (floor(r*100) as bigint)+ 1)  + 100L * 
(row_number () over (partition by (cast (floor(r*100) as bigint) + 1) order 
by null) - 1)  as ETL_ROW_ID

from(select *,rand() as r from INTER_ETL) as t
)
as t
;


min_ETL_ROW_ID

count_ETL_ROW_ID

max_ETL_ROW_ID

   1

   39567412227

 40529759537




From: Markovitz, Dudu [mailto:dmarkov...@paypal.com]
Sent: Wednesday, June 29, 2016 11:37 PM
To: sanjiv.is...@gmail.com<mailto:sanjiv.is...@gmail.com>
Cc: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: RE: Query Performance Issue : Group By and Distinct and load on reducer

1.
This is strange.
The negative numbers are due to overflow of the ‘int’ type, but for that reason 
exactly I’ve casted the expressions in my code to ‘bigint’.
I’ve tested this code before sending it to you and it worked fine, returning 
results that are beyond the range of the ‘int’ type.

Please try this:

select  *
  ,(floor(r*100) + 1)  + (100L * (row_number () over (partition 
by (floor(r*100) + 1) order by null) - 1)  as ETL_ROW_ID

from(select *,rand() as r from INTER_ETL) as t
;

2.
Great

3.
Sorry, hadn’t had the time to test it (nor the change I’m going to suggest 
now…☺)
Please check if the following code works and if so, replace the ‘a’ subquery 
code with it.



select  a1.group_id

   ,sum (a2.cnt) - a1.cnt   as accum_rows



from   (select  abs(hash(MyCol1,MyCol2))%1000  as group_id

   ,count (*)  as cnt



fromINTER_ETL



group byabs(hash(MyCol1,MyCol2))%1000

)

as a1



cross join  (select abs(hash(MyCol1,MyCol2))%1000   as group_id

   ,count (*)   as cnt



fromINTER_ETL



group byabs(hash(MyCol1,MyCol2))%1000

)

as a2



where   a2.group_id <= a1.group_id



group bya1.group_id

;


From: @Sanjiv Singh [mailto:sanjiv.is...@gmail.com]
Sent: Wednesday, June 29, 2016 10:55 PM
To: Markovitz, Dudu <dmarkov...@paypal.com<mailto:dmarkov...@paypal.com>>
Cc: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Re: Query Performance Issue : Group By and Distinct and load on reducer

Hi Dudu,

I tried the same on same table which has 6357592675 rows. See response of all 
three.


I tried 1st one , its giving duplicates for rows.

> CREATE TEMPORARY TABLE INTER_ETL_T AS
select  *
,cast (floor(r*100) + 1 as bigint) + (100 * (row_number () over 
(partition by cast (floor(r*100) + 1 as bigint) order by null) - 1))  as 
ROW_NUM
from(select *,rand() as r from INTER_ETL) as t ;


> select ROW_NUM, count(*) from INTER_ETL_T 

RE: Query Performance Issue : Group By and Distinct and load on reducer

2016-06-30 Thread Markovitz, Dudu
1.
This works.
I’ve recalled that the CAST is needed since FLOOR defaults to FLOAT.

select  (cast (floor(r*100) as bigint)+ 1)  + 100L * (row_number () 
over (partition by (cast (floor(r*100) as bigint) + 1) order by null) - 1)  
as ETL_ROW_ID

from(select *,rand() as r from INTER_ETL) as t
;



Here is a test result from our dev system

select  min (ETL_ROW_ID)as min_ETL_ROW_ID
   ,count   (ETL_ROW_ID)as count_ETL_ROW_ID
   ,max (ETL_ROW_ID)as max_ETL_ROW_ID

from   (select  (cast (floor(r*100) as bigint)+ 1)  + 100L * 
(row_number () over (partition by (cast (floor(r*100) as bigint) + 1) order 
by null) - 1)  as ETL_ROW_ID

from(select *,rand() as r from INTER_ETL) as t
)
as t
;


min_ETL_ROW_ID

count_ETL_ROW_ID

max_ETL_ROW_ID

   1

   39567412227

 40529759537




From: Markovitz, Dudu [mailto:dmarkov...@paypal.com]
Sent: Wednesday, June 29, 2016 11:37 PM
To: sanjiv.is...@gmail.com
Cc: user@hive.apache.org
Subject: RE: Query Performance Issue : Group By and Distinct and load on reducer

1.
This is strange.
The negative numbers are due to overflow of the ‘int’ type, but for that reason 
exactly I’ve casted the expressions in my code to ‘bigint’.
I’ve tested this code before sending it to you and it worked fine, returning 
results that are beyond the range of the ‘int’ type.

Please try this:

select  *
  ,(floor(r*100) + 1)  + (100L * (row_number () over (partition 
by (floor(r*100) + 1) order by null) - 1)  as ETL_ROW_ID

from(select *,rand() as r from INTER_ETL) as t
;

2.
Great

3.
Sorry, hadn’t had the time to test it (nor the change I’m going to suggest 
now…☺)
Please check if the following code works and if so, replace the ‘a’ subquery 
code with it.



select  a1.group_id

   ,sum (a2.cnt) - a1.cnt   as accum_rows



from   (select  abs(hash(MyCol1,MyCol2))%1000  as group_id

   ,count (*)  as cnt



fromINTER_ETL



group byabs(hash(MyCol1,MyCol2))%1000

)

as a1



cross join  (select abs(hash(MyCol1,MyCol2))%1000   as group_id

   ,count (*)   as cnt



fromINTER_ETL



group byabs(hash(MyCol1,MyCol2))%1000

)

as a2



where   a2.group_id <= a1.group_id



group bya1.group_id

;


From: @Sanjiv Singh [mailto:sanjiv.is...@gmail.com]
Sent: Wednesday, June 29, 2016 10:55 PM
To: Markovitz, Dudu <dmarkov...@paypal.com<mailto:dmarkov...@paypal.com>>
Cc: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Re: Query Performance Issue : Group By and Distinct and load on reducer

Hi Dudu,

I tried the same on same table which has 6357592675 rows. See response of all 
three.


I tried 1st one , its giving duplicates for rows.

> CREATE TEMPORARY TABLE INTER_ETL_T AS
select  *
,cast (floor(r*100) + 1 as bigint) + (100 * (row_number () over 
(partition by cast (floor(r*100) + 1 as bigint) order by null) - 1))  as 
ROW_NUM
from(select *,rand() as r from INTER_ETL) as t ;


> select ROW_NUM, count(*) from INTER_ETL_T by ROW_NUM having count(*) > 1 
> limit 10;

+--+--+--+
|ROW_NUM| _c1  |
+--+--+--+
| -2146932303  | 2|
| -2146924922  | 2|
| -2146922710  | 2|
| -2146901450  | 2|
| -2146897115  | 2|
| -2146874805  | 2|
| -2146869449  | 2|
| -2146865918  | 2|
| -2146864595  | 2|
| -2146857688  | 2|
+--+--+--+

On 2nd one, it is not giving any duplicate and was much faster than 
ROW_NUMBER() atleast.

numRows=6357592675, totalSize=405516934422, rawDataSize=399159341747


And on 3rd for consecutive number, query is not compatible to HIVE.

CREATE TEMPORARY TABLE INTER_ETL_T AS
select  *
,a.accum_rows + row_number () over (partition by 
abs(hash(t.m_d_key,t.s_g_key))%1 order by null) as ROW_NUM
fromINTER_ETL   as t
join(select abs(hash(m_d_key,s_g_key))%1   as group_id
,sum (count (*)) over (order by m_d_key,s_g_key rows between unbounded 
preceding and 1 preceding) - count(*)   as accum_rows
fromINTER_ETL
group byabs(hash(m_d_key,s_g_key))%1
) as a
on  a.group_id  = abs(hash(t.m_d_key,t.s_g_key))%1
;

Error :

Error: Error while compiling statement: FAILED: SemanticException End of a 
WindowFrame cannot be UNBOUNDED PRECEDING (state=42000,code=4)



Regards
Sanjiv Singh
Mob :  +091 9990-447-339

On Tue, Jun 28, 2016 at 6:16 PM, @Sanjiv Singh 
<sanjiv.is...@gmail.com<mailto:sanjiv.is...@gmail.com>> wrote:

RE: Query Performance Issue : Group By and Distinct and load on reducer

2016-06-29 Thread Markovitz, Dudu
1.
This is strange.
The negative numbers are due to overflow of the ‘int’ type, but for that reason 
exactly I’ve casted the expressions in my code to ‘bigint’.
I’ve tested this code before sending it to you and it worked fine, returning 
results that are beyond the range of the ‘int’ type.

Please try this:

select  *
  ,(floor(r*100) + 1)  + (100L * (row_number () over (partition 
by (floor(r*100) + 1) order by null) - 1)  as ETL_ROW_ID

from(select *,rand() as r from INTER_ETL) as t
;

2.
Great

3.
Sorry, hadn’t had the time to test it (nor the change I’m going to suggest 
now…☺)
Please check if the following code works and if so, replace the ‘a’ subquery 
code with it.



select  a1.group_id

   ,sum (a2.cnt) - a1.cnt   as accum_rows



from   (select  abs(hash(MyCol1,MyCol2))%1000  as group_id

   ,count (*)  as cnt



fromINTER_ETL



group byabs(hash(MyCol1,MyCol2))%1000

)

as a1



cross join  (select abs(hash(MyCol1,MyCol2))%1000   as group_id

   ,count (*)   as cnt



fromINTER_ETL



group byabs(hash(MyCol1,MyCol2))%1000

)

as a2



where   a2.group_id <= a1.group_id



group bya1.group_id

;


From: @Sanjiv Singh [mailto:sanjiv.is...@gmail.com]
Sent: Wednesday, June 29, 2016 10:55 PM
To: Markovitz, Dudu <dmarkov...@paypal.com>
Cc: user@hive.apache.org
Subject: Re: Query Performance Issue : Group By and Distinct and load on reducer

Hi Dudu,

I tried the same on same table which has 6357592675 rows. See response of all 
three.


I tried 1st one , its giving duplicates for rows.

> CREATE TEMPORARY TABLE INTER_ETL_T AS
select  *
,cast (floor(r*100) + 1 as bigint) + (100 * (row_number () over 
(partition by cast (floor(r*100) + 1 as bigint) order by null) - 1))  as 
ROW_NUM
from(select *,rand() as r from INTER_ETL) as t ;


> select ROW_NUM, count(*) from INTER_ETL_T by ROW_NUM having count(*) > 1 
> limit 10;

+--+--+--+
|ROW_NUM| _c1  |
+--+--+--+
| -2146932303  | 2|
| -2146924922  | 2|
| -2146922710  | 2|
| -2146901450  | 2|
| -2146897115  | 2|
| -2146874805  | 2|
| -2146869449  | 2|
| -2146865918  | 2|
| -2146864595  | 2|
| -2146857688  | 2|
+--+--+--+

On 2nd one, it is not giving any duplicate and was much faster than 
ROW_NUMBER() atleast.

numRows=6357592675, totalSize=405516934422, rawDataSize=399159341747


And on 3rd for consecutive number, query is not compatible to HIVE.

CREATE TEMPORARY TABLE INTER_ETL_T AS
select  *
,a.accum_rows + row_number () over (partition by 
abs(hash(t.m_d_key,t.s_g_key))%1 order by null) as ROW_NUM
fromINTER_ETL   as t
join(select abs(hash(m_d_key,s_g_key))%1   as group_id
,sum (count (*)) over (order by m_d_key,s_g_key rows between unbounded 
preceding and 1 preceding) - count(*)   as accum_rows
fromINTER_ETL
group byabs(hash(m_d_key,s_g_key))%1
) as a
on  a.group_id  = abs(hash(t.m_d_key,t.s_g_key))%1
;

Error :

Error: Error while compiling statement: FAILED: SemanticException End of a 
WindowFrame cannot be UNBOUNDED PRECEDING (state=42000,code=4)



Regards
Sanjiv Singh
Mob :  +091 9990-447-339

On Tue, Jun 28, 2016 at 6:16 PM, @Sanjiv Singh 
<sanjiv.is...@gmail.com<mailto:sanjiv.is...@gmail.com>> wrote:
thanks a lot.
let me give it a try.

Regards
Sanjiv Singh
Mob :  +091 9990-447-339

On Tue, Jun 28, 2016 at 5:32 PM, Markovitz, Dudu 
<dmarkov...@paypal.com<mailto:dmarkov...@paypal.com>> wrote:
There’s a distributed algorithm for windows function that is based on the ORDER 
BY clause rather than the PARTITION BY clause.
I doubt if is implemented in Hive, but it’s worth a shot.

select  *
   ,row_number () over (order by rand()) as ETL_ROW_ID
fromINTER_ETL
;

For unique, not consecutive values you can try this:

select  *
   ,cast (floor(r*100) + 1 as bigint) + (100 * (row_number () 
over (partition by cast (floor(r*100) + 1 as bigint) order by null) - 1))  
as ETL_ROW_ID

from(select *,rand() as r from INTER_ETL) as t
;

If you have in your table a column/combination of columns with unified 
distribution you can also do something like this:

select  *
   , (abs(hash(MyCol1,MyCol2))%100 + 1) + (row_number () over 
(partition by (abs(hash(MyCol1,MyCol2))%100 + 1) order by null) - 1) * 
100L  as ETL_ROW_ID

fromINTER_ETL
;

For consecutive values you can do something (ugly…) like this:

select  *
   ,a.accum_rows + row_number () o

RE: Query Performance Issue : Group By and Distinct and load on reducer

2016-06-28 Thread Markovitz, Dudu
There’s a distributed algorithm for windows function that is based on the ORDER 
BY clause rather than the PARTITION BY clause.
I doubt if is implemented in Hive, but it’s worth a shot.

select  *
   ,row_number () over (order by rand()) as ETL_ROW_ID
fromINTER_ETL
;

For unique, not consecutive values you can try this:

select  *
   ,cast (floor(r*100) + 1 as bigint) + (100 * (row_number () 
over (partition by cast (floor(r*100) + 1 as bigint) order by null) - 1))  
as ETL_ROW_ID

from(select *,rand() as r from INTER_ETL) as t
;

If you have in your table a column/combination of columns with unified 
distribution you can also do something like this:

select  *
   , (abs(hash(MyCol1,MyCol2))%100 + 1) + (row_number () over 
(partition by (abs(hash(MyCol1,MyCol2))%100 + 1) order by null) - 1) * 
100L  as ETL_ROW_ID

fromINTER_ETL
;

For consecutive values you can do something (ugly…) like this:

select  *
   ,a.accum_rows + row_number () over (partition by 
abs(hash(t.MyCol1,t.MyCol2))%1 order by null) as ETL_ROW_ID

fromINTER_ETL   as t

join(select abs(hash(MyCol1,MyCol2))%1  
as group_id
   ,sum (count (*)) over (order by 
MyCol1,MyCol2 rows between unbounded preceding and 1 preceding) - count(*)   as 
accum_rows

fromINTER_ETL

group byabs(hash(MyCol1,MyCol2))%1
)
as a

on  a.group_id  = abs(hash(t.MyCol1,t.MyCol2))%1

;



From: @Sanjiv Singh [mailto:sanjiv.is...@gmail.com]
Sent: Tuesday, June 28, 2016 11:52 PM
To: Markovitz, Dudu <dmarkov...@paypal.com>
Cc: user@hive.apache.org
Subject: Re: Query Performance Issue : Group By and Distinct and load on reducer

ETL_ROW_ID is to be consecutive number. I need to check if having unique number 
would not break any logic.

Considering unique number for ETL_ROW_ID column, what are optimum options 
available?
What id it has to be consecutive number only?



Regards
Sanjiv Singh
Mob :  +091 9990-447-339

On Tue, Jun 28, 2016 at 4:17 PM, Markovitz, Dudu 
<dmarkov...@paypal.com<mailto:dmarkov...@paypal.com>> wrote:
I’m guessing ETL_ROW_ID should be unique but not necessarily contain only 
consecutive numbers?

From: @Sanjiv Singh 
[mailto:sanjiv.is...@gmail.com<mailto:sanjiv.is...@gmail.com>]
Sent: Tuesday, June 28, 2016 10:57 PM
To: Markovitz, Dudu <dmarkov...@paypal.com<mailto:dmarkov...@paypal.com>>
Cc: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Re: Query Performance Issue : Group By and Distinct and load on reducer

Hi Dudu,

You are correct ...ROW_NUMBER() is main culprit.

ROW_NUMBER() OVER Not Fast Enough With Large Result Set, any good solution?



Regards
Sanjiv Singh
Mob :  +091 9990-447-339

On Tue, Jun 28, 2016 at 3:42 PM, Markovitz, Dudu 
<dmarkov...@paypal.com<mailto:dmarkov...@paypal.com>> wrote:
The row_number operation seems to be skewed.

Dudu

From: @Sanjiv Singh 
[mailto:sanjiv.is...@gmail.com<mailto:sanjiv.is...@gmail.com>]
Sent: Tuesday, June 28, 2016 8:54 PM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Query Performance Issue : Group By and Distinct and load on reducer

Hi All,

I am having performance issue with data skew of the distinct statement in 
Hive<http://stackoverflow.com/questions/37894023/understanding-the-data-skew-of-the-countdistinct-statement-in-hive>.
 See below query with DISTINCT operator.
Original Query :

SELECT DISTINCT
 SD.REGION
,SD.HEADEND
,SD.NETWORK
,SD.RETAILUNITCODE
,SD.LOGTIMEDATE
,SD.SPOTKEY
,SD.CRE_DT
,CASE
WHEN SD.LOGTIMEDATE IS NULL
THEN 'Y'
ELSE 'N'
END AS DROP_REASON
,ROW_NUMBER() OVER (
ORDER BY NULL
) AS ETL_ROW_ID
FROM INTER_ETL AS SD;

Table INTER_ETL used for query is big enough.
From the logs , it seems that data skew for specific set of values , causing 
one of reducer have to do all the job. I tried to achieve the same through 
GROUP BY still having the same issue.  Help me to understand the issue and 
resolution.
Query with Distinct V2 :

CREATE TEMPORARY TABLE ETL_TMP AS
SELECT DISTINCT dt.*
FROM (
SELECT S

RE: Query Performance Issue : Group By and Distinct and load on reducer

2016-06-28 Thread Markovitz, Dudu
I’m guessing ETL_ROW_ID should be unique but not necessarily contain only 
consecutive numbers?

From: @Sanjiv Singh [mailto:sanjiv.is...@gmail.com]
Sent: Tuesday, June 28, 2016 10:57 PM
To: Markovitz, Dudu <dmarkov...@paypal.com>
Cc: user@hive.apache.org
Subject: Re: Query Performance Issue : Group By and Distinct and load on reducer

Hi Dudu,

You are correct ...ROW_NUMBER() is main culprit.

ROW_NUMBER() OVER Not Fast Enough With Large Result Set, any good solution?



Regards
Sanjiv Singh
Mob :  +091 9990-447-339

On Tue, Jun 28, 2016 at 3:42 PM, Markovitz, Dudu 
<dmarkov...@paypal.com<mailto:dmarkov...@paypal.com>> wrote:
The row_number operation seems to be skewed.

Dudu

From: @Sanjiv Singh 
[mailto:sanjiv.is...@gmail.com<mailto:sanjiv.is...@gmail.com>]
Sent: Tuesday, June 28, 2016 8:54 PM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Query Performance Issue : Group By and Distinct and load on reducer

Hi All,

I am having performance issue with data skew of the distinct statement in 
Hive<http://stackoverflow.com/questions/37894023/understanding-the-data-skew-of-the-countdistinct-statement-in-hive>.
 See below query with DISTINCT operator.
Original Query :

SELECT DISTINCT
 SD.REGION
,SD.HEADEND
,SD.NETWORK
,SD.RETAILUNITCODE
,SD.LOGTIMEDATE
,SD.SPOTKEY
,SD.CRE_DT
,CASE
WHEN SD.LOGTIMEDATE IS NULL
THEN 'Y'
ELSE 'N'
END AS DROP_REASON
,ROW_NUMBER() OVER (
ORDER BY NULL
) AS ETL_ROW_ID
FROM INTER_ETL AS SD;

Table INTER_ETL used for query is big enough.
From the logs , it seems that data skew for specific set of values , causing 
one of reducer have to do all the job. I tried to achieve the same through 
GROUP BY still having the same issue.  Help me to understand the issue and 
resolution.
Query with Distinct V2 :

CREATE TEMPORARY TABLE ETL_TMP AS
SELECT DISTINCT dt.*
FROM (
SELECT SD.REGION
,SD.HEADEND
,SD.NETWORK
,SD.RETAILUNITCODE
,SD.LOGTIMEDATE
,SD.SPOTKEY
,SD.CRE_DT
,CASE
WHEN SD.LOGTIMEDATE IS NULL
THEN 'Y'
ELSE 'N'
END AS DROP_REASON
,ROW_NUMBER() OVER (
ORDER BY NULL
) AS ETL_ROW_ID
FROM INTER_ETL AS SD
) AS dt;

Logs:

INFO  : Map 1: 107/107  Reducer 2: 417(+1)/418  Reducer 3: 0(+56)/418
INFO  : Map 1: 107/107  Reducer 2: 417(+1)/418  Reducer 3: 0(+56)/418
INFO  : Map 1: 107/107  Reducer 2: 417(+1)/418  Reducer 3: 0(+56)/418
INFO  : Map 1: 107/107  Reducer 2: 417(+1)/418  Reducer 3: 0(+56)/418
INFO  : Map 1: 107/107  Reducer 2: 417(+1)/418  Reducer 3: 0(+418)/418


Query With Group By:

CREATE TEMPORARY TABLE ETL_TMP AS
SELECT REGION
,HEADEND
,NETWORK
,RETAILUNITCODE
,LOGTIMEDATE
,SPOTKEY
,CRE_DT
,DROP_REASON
,ETL_ROW_ID
FROM (
SELECT SD.REGION
,SD.HEADEND
,SD.NETWORK
,SD.RETAILUNITCODE
,SD.LOGTIMEDATE
,SD.SPOTKEY
,SD.CRE_DT
,CASE
WHEN SD.LOGTIMEDATE IS NULL
THEN 'Y'
ELSE 'N'
END AS DROP_REASON
,ROW_NUMBER() OVER (
ORDER BY NULL
) AS ETL_ROW_ID
FROM INTER_ETL AS SD
) AS dt
GROUP BY
 REGION
,HEADEND
,NETWORK
,RETAILUNITCODE
,LOGTIMEDATE
,SPOTKEY
,CRE_DT
,DROP_REASON
,ETL_ROW_ID;

Logs:

INFO  : Map 1: 818/818  Reducer 2: 417(+1)/418  Reducer 3: 0(+418)/418
INFO  : Map 1: 818/818  Reducer 2: 417(+1)/418  Reducer 3: 0(+418)/418
INFO  : Map 1: 818/818  Reducer 2: 417(+1)/418  Reducer 3: 0(+418)/418
INFO  : Map 1: 818/818  Reducer 2: 417(+1)/418  Reducer 3: 0(+418)/418
INFO 

RE: Query Performance Issue : Group By and Distinct and load on reducer

2016-06-28 Thread Markovitz, Dudu
The row_number operation seems to be skewed.

Dudu

From: @Sanjiv Singh [mailto:sanjiv.is...@gmail.com]
Sent: Tuesday, June 28, 2016 8:54 PM
To: user@hive.apache.org
Subject: Query Performance Issue : Group By and Distinct and load on reducer

Hi All,

I am having performance issue with data skew of the distinct statement in 
Hive.
 See below query with DISTINCT operator.
Original Query :

SELECT DISTINCT
 SD.REGION
,SD.HEADEND
,SD.NETWORK
,SD.RETAILUNITCODE
,SD.LOGTIMEDATE
,SD.SPOTKEY
,SD.CRE_DT
,CASE
WHEN SD.LOGTIMEDATE IS NULL
THEN 'Y'
ELSE 'N'
END AS DROP_REASON
,ROW_NUMBER() OVER (
ORDER BY NULL
) AS ETL_ROW_ID
FROM INTER_ETL AS SD;

Table INTER_ETL used for query is big enough.
From the logs , it seems that data skew for specific set of values , causing 
one of reducer have to do all the job. I tried to achieve the same through 
GROUP BY still having the same issue.  Help me to understand the issue and 
resolution.
Query with Distinct V2 :

CREATE TEMPORARY TABLE ETL_TMP AS
SELECT DISTINCT dt.*
FROM (
SELECT SD.REGION
,SD.HEADEND
,SD.NETWORK
,SD.RETAILUNITCODE
,SD.LOGTIMEDATE
,SD.SPOTKEY
,SD.CRE_DT
,CASE
WHEN SD.LOGTIMEDATE IS NULL
THEN 'Y'
ELSE 'N'
END AS DROP_REASON
,ROW_NUMBER() OVER (
ORDER BY NULL
) AS ETL_ROW_ID
FROM INTER_ETL AS SD
) AS dt;

Logs:

INFO  : Map 1: 107/107  Reducer 2: 417(+1)/418  Reducer 3: 0(+56)/418
INFO  : Map 1: 107/107  Reducer 2: 417(+1)/418  Reducer 3: 0(+56)/418
INFO  : Map 1: 107/107  Reducer 2: 417(+1)/418  Reducer 3: 0(+56)/418
INFO  : Map 1: 107/107  Reducer 2: 417(+1)/418  Reducer 3: 0(+56)/418
INFO  : Map 1: 107/107  Reducer 2: 417(+1)/418  Reducer 3: 0(+418)/418


Query With Group By:

CREATE TEMPORARY TABLE ETL_TMP AS
SELECT REGION
,HEADEND
,NETWORK
,RETAILUNITCODE
,LOGTIMEDATE
,SPOTKEY
,CRE_DT
,DROP_REASON
,ETL_ROW_ID
FROM (
SELECT SD.REGION
,SD.HEADEND
,SD.NETWORK
,SD.RETAILUNITCODE
,SD.LOGTIMEDATE
,SD.SPOTKEY
,SD.CRE_DT
,CASE
WHEN SD.LOGTIMEDATE IS NULL
THEN 'Y'
ELSE 'N'
END AS DROP_REASON
,ROW_NUMBER() OVER (
ORDER BY NULL
) AS ETL_ROW_ID
FROM INTER_ETL AS SD
) AS dt
GROUP BY
 REGION
,HEADEND
,NETWORK
,RETAILUNITCODE
,LOGTIMEDATE
,SPOTKEY
,CRE_DT
,DROP_REASON
,ETL_ROW_ID;

Logs:

INFO  : Map 1: 818/818  Reducer 2: 417(+1)/418  Reducer 3: 0(+418)/418
INFO  : Map 1: 818/818  Reducer 2: 417(+1)/418  Reducer 3: 0(+418)/418
INFO  : Map 1: 818/818  Reducer 2: 417(+1)/418  Reducer 3: 0(+418)/418
INFO  : Map 1: 818/818  Reducer 2: 417(+1)/418  Reducer 3: 0(+418)/418
INFO  : Map 1: 818/818  Reducer 2: 417(+1)/418  Reducer 3: 0(+418)/418

Table details :

Beeline > dfs -ls /apps/hive/warehouse/PRD_DB.db/INTER_ETL ;
++--+
| DFS Output
 |
++--+
| Found 15 items
 |
| 

RE: Hive error : Can not convert struct<> to

2016-06-28 Thread Markovitz, Dudu
The staging table has no partitions, so no issue there.

Also, the error specifically refers to the covertion between the struct types.



Dudu





FAILED: SemanticException [Error 10044]: Line 2:23 Cannot insert into target 
table because column number/types are different ''CA'': Cannot convert column 4 
from struct to 
struct.





-Original Message-
From: Gopal Vijayaraghavan [mailto:go...@hortonworks.com] On Behalf Of Gopal 
Vijayaraghavan
Sent: Tuesday, June 28, 2016 6:17 PM
To: user@hive.apache.org
Subject: Re: Hive error : Can not convert struct<> to 



> PARTITION(state='CA')

> SELECT * WHERE se.adr.st='CA'

> FAILED: SemanticException [Error 10044]: Line 2:23 Cannot insert into

>target table because column number/types are different ''CA'':



The error is bogus, but the issue has to do with the "SELECT *".



Inserts where a partition is specified statically cannot have a partition 
column in the select.



So this is failing since it is trying to insert n-1 columns, because state='CA' 
cannot be repeated in the SELECT.



Cheers,

Gopal










RE: Hive error : Can not convert struct<> to

2016-06-28 Thread Markovitz, Dudu
Hi

The fields' names are part of the struct definition.
Different names, different types of structs.

Dudu


e.g.

Setup

create table t1 (s struct);
create table t2 (s struct);

insert into table t1 select named_struct('c1',1,'c2',2);


--

insert into t2 select * from t1;

FAILED: SemanticException [Error 10044]: Line 1:12 Cannot insert into target 
table because column number/types are different 't2': Cannot convert column 0 
from struct to struct.

--

Solution 1 (per INSERT)

insert into t2 select named_struct('col1',s.c1,'col2',s.c2) from t1;

Solution 2 (one time)

alter table t2 change s s struct;
insert into t2 select * from t1;



From: Kuldeep Chitrakar [mailto:kuldeep.chitra...@synechron.com]
Sent: Tuesday, June 28, 2016 4:03 PM
To: user@hive.apache.org
Subject: Hive error : Can not convert struct<> to 

Hi

I have staged table as

hive (revise)> desc employees_se;
OK
namestring
salaryfloat
subordinates  array
deductions  map
adr struct

I am trying to insert the data in partitioned table employees as

hive (revise)> desc employees;
OK
namestring
salaryfloat
subordinates array
deductions map
addressstruct
state  string

# Partition Information
# col_namedata_type   comment

state  string
Time taken: 0.161 seconds, Fetched: 11 row(s)

Command

FROM employees_se se
INSERT OVERWRITE TABLE employees
PARTITION(state='CA')
SELECT * WHERE se.adr.st='CA'

But I am getting an error as

FAILED: SemanticException [Error 10044]: Line 2:23 Cannot insert into target 
table because column number/types are different ''CA'': Cannot convert column 4 
from struct to 
struct.


Any idea, as I do not see anything wrong.





RE: Querying Hive tables from Spark

2016-06-27 Thread Markovitz, Dudu
Hi Mich

I could not figure out what is the point you are trying to make.
Could you please clarify?

Thanks

Dudu

From: Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
Sent: Monday, June 27, 2016 12:20 PM
To: user @spark ; user 
Subject: Querying Hive tables from Spark


Hi,

I have done some extensive tests with Spark querying Hive tables.

It appears to me that Spark does not rely on statistics that are collected by 
Hive on say ORC tables. It seems that Spark uses its own optimization to query 
the Hive tables irrespective of Hive has collected by way of statistics etc?

Case in point I have a FACT table bucketed on 5 dimensional foreign keys like 
below

 CREATE TABLE IF NOT EXISTS oraclehadoop.sales2
 (
  PROD_IDbigint   ,
  CUST_IDbigint   ,
  TIME_IDtimestamp,
  CHANNEL_ID bigint   ,
  PROMO_ID   bigint   ,
  QUANTITY_SOLD  decimal(10)  ,
  AMOUNT_SOLDdecimal(10)
)
CLUSTERED BY (PROD_ID,CUST_ID,TIME_ID,CHANNEL_ID,PROMO_ID) INTO 256 BUCKETS
STORED AS ORC
TBLPROPERTIES ( "orc.compress"="SNAPPY",
"orc.create.index"="true",
"orc.bloom.filter.columns"="PROD_ID,CUST_ID,TIME_ID,CHANNEL_ID,PROMO_ID",
"orc.bloom.filter.fpp"="0.05",
"orc.stripe.size"="268435456",
"orc.row.index.stride"="1")

Table is sorted in the order of prod_id, cust_id,time_id, channel_id and 
promo_id. It has 22 million rows.

A simple query like below:

val s = HiveContext.table("sales2")
  s.filter($"prod_id" ===13 && $"cust_id" === 50833 && $"time_id" === 
"2000-12-26 00:00:00" && $"channel_id" === 2 && $"promo_id" === 999 ).explain
  s.filter($"prod_id" ===13 && $"cust_id" === 50833 && $"time_id" === 
"2000-12-26 00:00:00" && $"channel_id" === 2 && $"promo_id" === 999 
).collect.foreach(println)

Shows the plan as

== Physical Plan ==
Filter (prod_id#10L = 13) && (cust_id#11L = 50833)) && (time_id#12 = 
9777888)) && (channel_id#13L = 2)) && (promo_id#14L = 999))
+- HiveTableScan 
[prod_id#10L,cust_id#11L,time_id#12,channel_id#13L,promo_id#14L,quantity_sold#15,amount_sold#16],
 MetastoreRelation oraclehadoop, sales2, None

Spark returns 24 rows pretty fast in 22 seconds.

Running the same on Hive with Spark as execution engine shows:

STAGE DEPENDENCIES:
  Stage-0 is a root stage
STAGE PLANS:
  Stage: Stage-0
Fetch Operator
  limit: -1
  Processor Tree:
TableScan
  alias: sales2
  Filter Operator
predicate: (prod_id = 13) and (cust_id = 50833)) and 
(UDFToString(time_id) = '2000-12-26 00:00:00')) and (channel_id = 2)) and 
(promo_id = 999)) (type: boolean)
Select Operator
  expressions: 13 (type: bigint), 50833 (type: bigint), 2000-12-26 
00:00:00.0 (type: timestamp), 2 (type: bigint), 999 (type: bigint), 
quantity_sold (type: decimal(10,0)), amount_sold (type: decimal(10,0))
  outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6
  ListSink

And Hive on Spark returns the same 24 rows in 30 seconds

Ok Hive query is just slower with Spark engine.

Assuming that the time taken will be optimization time + query time then it 
appears that in most cases the optimization time does not really make that 
impact on the overall performance?


Let me know your thoughts.


HTH


Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




RE: RegexSerDe with Filters

2016-06-24 Thread Markovitz, Dudu
This is a tested, working code.
If you’re using https://regex101.com ,first replace backslash pairs (\\ ) with 
a single backslash (\) and also use the ‘g’ modifier in order to find all of 
the matches.

The regular expression is -
(\S+)\s+([0-9]{4}-[0-9]{2}-[0-9]{2} 
[0-9]{2}:[0-9]{2}:[0-9]{2}),([0-9]{3})\s+(\S+)\s+\[([^]]+)\]\s+(\S+)\s+:\s+(TID:\s\d+)?\s*(.*)

I’ll send you a screen shot in private, since you don’t want to expose the data.

Dudu


From: Arun Patel [mailto:arunp.bigd...@gmail.com]
Sent: Friday, June 24, 2016 9:33 PM
To: user@hive.apache.org
Subject: Re: RegexSerDe with Filters

Looks like Regex pattern is not working.  I tested the pattern on 
https://regex101.com/ and it does not find any match.

Any suggestions?

On Thu, Jun 23, 2016 at 3:01 PM, Markovitz, Dudu 
<dmarkov...@paypal.com<mailto:dmarkov...@paypal.com>> wrote:
My pleasure.
Please feel free to reach me if needed.

Dudu

From: Arun Patel 
[mailto:arunp.bigd...@gmail.com<mailto:arunp.bigd...@gmail.com>]
Sent: Wednesday, June 22, 2016 2:57 AM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Re: RegexSerDe with Filters

Thank you very much, Dudu.  This really helps.

On Tue, Jun 21, 2016 at 7:48 PM, Markovitz, Dudu 
<dmarkov...@paypal.com<mailto:dmarkov...@paypal.com>> wrote:
Hi

Here is the code (without the log data).

I’ve created some of the views using different text processing technics.
The rest of the views could be create in similar ways.


Dudu



bash


hdfs dfs -mkdir -p /tmp/log/20160621
hdfs dfs –put logfile.txt /tmp/log/20160621


hive


/*
External table log

Defines all common columns + optional column 'tid' which appears in most 
log records + the rest of the log ('txt')

*/

drop table if exists log;

create external table log
(
c1  string
   ,ts  string
   ,ts_frac string
   ,log_rec_level   string
   ,c4  string
   ,c5  string
   ,tid string
   ,txt string
)
partitioned by (dt date)
row format serde 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
with serdeproperties 
('input.regex'='(\\S+)\\s+([0-9]{4}-[0-9]{2}-[0-9]{2}
 
[0-9]{2}:[0-9]{2}:[0-9]{2}),([0-9]{3})\\s+(\\S+)\\s+\\[([^]]+)\\]\\s+(\\S+)\\s+:\\s+(TID:\\s\\d+)?\\s*(.*)')
stored as textfile
location '/tmp/log'
;

alter table log add partition (dt=date '2016-06-21') location 
'/tmp/log/20160621';

select * from log;


/*
View log_v

Base view for all other views

*/

drop view if exists log_v;

create view log_v
as
select  c1
   ,cast (concat_ws ('.',ts,ts_frac) as timestamp)  as ts
   ,log_rec_level
   ,c4
   ,c5
   ,cast (ltrim(substr (tid,5)) as bigint)  as tid
   ,txt

fromlog
;

select * from log_v;



drop view if exists log_v_reaping_path;

create view log_v_reaping_path
as
select  c1
   ,ts
   ,log_rec_level
   ,c4
   ,c5
   ,substr (txt,15) as reaping_path

fromlog_V

where   txt like 'Reaping path: %'
;

select * from log_v_reaping_path;



drop view if exists log_v_published_to_kafka;

create view log_v_published_to_kafka
as
select  c1
   ,ts
   ,log_rec_level
   ,c4
   ,c5
   ,tid

   ,  ltrim (kv [' Key']  ) as key
   ,cast (ltrim (kv [' size'] ) as bigint ) as size
   ,  ltrim (kv [' topic']) as topic
   ,cast (ltrim (kv [' partition']) as int) as partition
   ,cast (ltrim (kv [' offset']   ) as bigint ) as offset

from   (select  c1
   ,ts
   ,log_rec_level
   ,c4
   ,c5
   ,tid
   ,str_to_map (substr (txt ,locate ('.',txt)+1),',',':')   
as kv

fromlog_V

where   txt like 'Published to Kafka. %'
)
as t
;

select * from log_v_published_to_kafka;



drop view if exists log_v_get_request;

create view log_v_get_request
as
select  c1
   ,ts
   ,log_rec_level
   ,c4
   ,c5
   ,tid
   ,substr (txt,31) as path

fromlog_V

where   txt like 'GET request r

RE: Optimize Hive Query

2016-06-23 Thread Markovitz, Dudu
Thanks, I wanted to rule out skewedness over m_d_key,sb_gu_key

Dudu

From: @Sanjiv Singh [mailto:sanjiv.is...@gmail.com]
Sent: Thursday, June 23, 2016 11:55 PM
To: user@hive.apache.org; Markovitz, Dudu <dmarkov...@paypal.com>; sanjiv singh 
(ME) <sanjiv.is...@gmail.com>
Subject: Re: Optimize Hive Query

Hi Dudu,

find below query response.

Query :
select  m_d_key,sb_gu_key   ,count (*)   as cnt
fromtuning_dd_key
group bym_d_key,sb_gu_key
order bycnt desc
limit   100;

Output :

169042668  1361
168063808  1361
168569864  1361
168909889  1361
169864785  1361
168269717  1361
16101802821361
168913062  1361
168418183  1361
168003791  1361
16102010841361
168470942  1361
169234223  1361
168330286  1361
16129661921361
169008767  1361
168902598  1361
169878885  1361
168741214  1361
168732856  1361
169692696  1361
168072042  1361
168802681  1361
16140875581361
169027186  1361
169587342  1361
169699202  1361
168542344  1361
169680544  1361
168903570  1361
169542542  1361
4  3576041  1361
169126774  1361
169957826  1361
168345331  1361
169756883  1361
169399702  1361
189403442  1361
169746288  1361
169435202  1361
169069894  1361
169920826  1361
168765877  1361
168813448  1361
189635460  1361
168463714  1361
168166965  1361
169597903  1361
169432100  1361
168847857  1361
16139530681361
168744451  1361
168089463  1361
169674902  1361
168418200  1361
168028509  1361
169243086  1361
168892184  1361
168801594  1361
169849079  1361
168556753  1361
168979232  1361
168081946  1361
168724046  1361
169984434  1361
168651659  1361
169116866  1361
1  178700721361
168860630  1361
169888398  1361
169463782  1361
169602127  1361
169353325  1361
167991816  1361
169920420  1361
168497624  1361
168987980  1361
168234751  1361
168389490  1361
189975575  1361
168026536  1361
168790618  1361
169846791  1361
168363833  1361
169025525  1361
169241297  1361
168712487  1361
168692003  1361
169316523  1361
168124338  1361
169941027  1361
169547973  1361
168007742  1361
168418425  1361
168944940  1361
168890232  1361
169248984  1361
169784461  1361
169009374  1361
168395861  1361


Regards
Sanjiv Singh
Mob :  +091 9990-447-339

On Thu, Jun 23, 2016 at 4:01 AM, Markovitz, Dudu 
<dmarkov...@paypal.com<mailto:dmarkov...@paypal.com>> wrote:

Could you also add the results of the following query?



Thanks



Dudu





select  m_d_key

   ,sb_gu_key

   ,count (*)   as cnt



fromtuning_dd_key



group bym_d_key

   ,sb_gu_key



order bycnt desc



limit   100

;



-Original Message-
From: Gopal Vijayaraghavan 
[mailto:go...@hortonworks.com<mailto:go...@hortonworks.com>] On Behalf Of Gopal 
Vijayaraghavan
Sent: Thursday, June 23, 2016 9:45 AM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Re: Optimize Hive Query





> Long running query :



Are you running this on MapReduce or Tez?



Please post the output of explain - if you are seeing > 1 shuffle edge in your 
query while having only one window for OVER(), that might be the reason.



OVER ( PARTITION BY  m_d_key , sb_gu_key  ORDER BY  t_ev_st_dt)



The multiple PTF operators should have been collapsed by the reduce 
sink-deduplication.



Cheers,

Gopal







RE: Optimized Hive query

2016-06-23 Thread Markovitz, Dudu
Any progress on this one?

Dudu

From: Aviral Agarwal [mailto:aviral12...@gmail.com]
Sent: Wednesday, June 15, 2016 1:04 PM
To: user@hive.apache.org
Subject: Re: Optimized Hive query

I ok to digging down to the AST Builder class. Can you guys point me to the 
right class ?

Meanwhile "explain (rewrite | logical | extended) ", all are not able to 
flatten even a basic query of the form:

select * from ( select * from ( select c from d) alias_1 ) alias_2

into

select c from d

Thanks,
Aviral Agarwal

On Wed, Jun 15, 2016 at 6:24 AM, Gopal Vijayaraghavan 
> wrote:

> So I was hoping of using internal Hive CBO to somehow change the AST
>generated for the query somehow.

Hive does have an "explain rewrite" but that prints out the query before
CBO runs.

For CBO, you need to dig all the way down to the ASTBuilder class and work
upwards from there.

Perhaps add it as an "explain optimized" (there exists "explain logical",
"explain extended" and 2 versions of regular "explain").

Cheers,
Gopal




RE: RegexSerDe with Filters

2016-06-23 Thread Markovitz, Dudu
My pleasure.
Please feel free to reach me if needed.

Dudu

From: Arun Patel [mailto:arunp.bigd...@gmail.com]
Sent: Wednesday, June 22, 2016 2:57 AM
To: user@hive.apache.org
Subject: Re: RegexSerDe with Filters

Thank you very much, Dudu.  This really helps.

On Tue, Jun 21, 2016 at 7:48 PM, Markovitz, Dudu 
<dmarkov...@paypal.com<mailto:dmarkov...@paypal.com>> wrote:
Hi

Here is the code (without the log data).

I’ve created some of the views using different text processing technics.
The rest of the views could be create in similar ways.


Dudu



bash


hdfs dfs -mkdir -p /tmp/log/20160621
hdfs dfs –put logfile.txt /tmp/log/20160621


hive


/*
External table log

Defines all common columns + optional column 'tid' which appears in most 
log records + the rest of the log ('txt')

*/

drop table if exists log;

create external table log
(
c1  string
   ,ts  string
   ,ts_frac string
   ,log_rec_level   string
   ,c4  string
   ,c5  string
   ,tid string
   ,txt string
)
partitioned by (dt date)
row format serde 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
with serdeproperties 
('input.regex'='(\\S+)\\s+([0-9]{4}-[0-9]{2}-[0-9]{2}
 
[0-9]{2}:[0-9]{2}:[0-9]{2}),([0-9]{3})\\s+(\\S+)\\s+\\[([^]]+)\\]\\s+(\\S+)\\s+:\\s+(TID:\\s\\d+)?\\s*(.*)')
stored as textfile
location '/tmp/log'
;

alter table log add partition (dt=date '2016-06-21') location 
'/tmp/log/20160621';

select * from log;


/*
View log_v

Base view for all other views

*/

drop view if exists log_v;

create view log_v
as
select  c1
   ,cast (concat_ws ('.',ts,ts_frac) as timestamp)  as ts
   ,log_rec_level
   ,c4
   ,c5
   ,cast (ltrim(substr (tid,5)) as bigint)  as tid
   ,txt

fromlog
;

select * from log_v;



drop view if exists log_v_reaping_path;

create view log_v_reaping_path
as
select  c1
   ,ts
   ,log_rec_level
   ,c4
   ,c5
   ,substr (txt,15) as reaping_path

fromlog_V

where   txt like 'Reaping path: %'
;

select * from log_v_reaping_path;



drop view if exists log_v_published_to_kafka;

create view log_v_published_to_kafka
as
select  c1
   ,ts
   ,log_rec_level
   ,c4
   ,c5
   ,tid

   ,  ltrim (kv [' Key']  ) as key
   ,cast (ltrim (kv [' size'] ) as bigint ) as size
   ,  ltrim (kv [' topic']) as topic
   ,cast (ltrim (kv [' partition']) as int) as partition
   ,cast (ltrim (kv [' offset']   ) as bigint ) as offset

from   (select  c1
   ,ts
   ,log_rec_level
   ,c4
   ,c5
   ,tid
   ,str_to_map (substr (txt ,locate ('.',txt)+1),',',':')   
as kv

fromlog_V

where   txt like 'Published to Kafka. %'
)
as t
;

select * from log_v_published_to_kafka;



drop view if exists log_v_get_request;

create view log_v_get_request
as
select  c1
   ,ts
   ,log_rec_level
   ,c4
   ,c5
   ,tid
   ,substr (txt,31) as path

fromlog_V

where   txt like 'GET request received for path %'
;

select * from log_v_get_request;



drop view if exists log_v_unlock_request;

create view log_v_unlock_request
as
select  c1
   ,ts
   ,log_rec_level
   ,c4
   ,c5
   ,tid
   ,regexp_extract (txt,'rowkey (\\S+)',1)  as 
rowkey
   ,regexp_extract (txt,'lock id (\\S+)',1) as 
lock_id

fromlog_V

where   txt like 'Unlock request for schema DU %'
;


From: Markovitz, Dudu 
[mailto:dmarkov...@paypal.com<mailto:dmarkov...@paypal.com>]
Sent: Tuesday, June 21, 2016 2:26 PM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: RE: RegexSerDe with Filters

Hi

I would suggest creating a single external table with daily partitions and 
multiple views each with the appropriate filtering.
If you’ll send me log sample (~100 rows) 

RE: Optimize Hive Query

2016-06-23 Thread Markovitz, Dudu
Could you also add the results of the following query?



Thanks



Dudu





select  m_d_key

   ,sb_gu_key

   ,count (*)   as cnt



fromtuning_dd_key



group bym_d_key

   ,sb_gu_key



order bycnt desc



limit   100

;



-Original Message-
From: Gopal Vijayaraghavan [mailto:go...@hortonworks.com] On Behalf Of Gopal 
Vijayaraghavan
Sent: Thursday, June 23, 2016 9:45 AM
To: user@hive.apache.org
Subject: Re: Optimize Hive Query





> Long running query :



Are you running this on MapReduce or Tez?



Please post the output of explain - if you are seeing > 1 shuffle edge in your 
query while having only one window for OVER(), that might be the reason.



OVER ( PARTITION BY  m_d_key , sb_gu_key  ORDER BY  t_ev_st_dt)



The multiple PTF operators should have been collapsed by the reduce 
sink-deduplication.



Cheers,

Gopal






RE: RegexSerDe with Filters

2016-06-21 Thread Markovitz, Dudu
Hi

Here is the code (without the log data).

I’ve created some of the views using different text processing technics.
The rest of the views could be create in similar ways.


Dudu



bash


hdfs dfs -mkdir -p /tmp/log/20160621
hdfs dfs –put logfile.txt /tmp/log/20160621


hive


/*
External table log

Defines all common columns + optional column 'tid' which appears in most 
log records + the rest of the log ('txt')

*/

drop table if exists log;

create external table log
(
c1  string
   ,ts  string
   ,ts_frac string
   ,log_rec_level   string
   ,c4  string
   ,c5  string
   ,tid string
   ,txt string
)
partitioned by (dt date)
row format serde 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
with serdeproperties ('input.regex'='(\\S+)\\s+([0-9]{4}-[0-9]{2}-[0-9]{2} 
[0-9]{2}:[0-9]{2}:[0-9]{2}),([0-9]{3})\\s+(\\S+)\\s+\\[([^]]+)\\]\\s+(\\S+)\\s+:\\s+(TID:\\s\\d+)?\\s*(.*)')
stored as textfile
location '/tmp/log'
;

alter table log add partition (dt=date '2016-06-21') location 
'/tmp/log/20160621';

select * from log;


/*
View log_v

Base view for all other views

*/

drop view if exists log_v;

create view log_v
as
select  c1
   ,cast (concat_ws ('.',ts,ts_frac) as timestamp)  as ts
   ,log_rec_level
   ,c4
   ,c5
   ,cast (ltrim(substr (tid,5)) as bigint)  as tid
   ,txt

fromlog
;

select * from log_v;



drop view if exists log_v_reaping_path;

create view log_v_reaping_path
as
select  c1
   ,ts
   ,log_rec_level
   ,c4
   ,c5
   ,substr (txt,15) as reaping_path

fromlog_V

where   txt like 'Reaping path: %'
;

select * from log_v_reaping_path;



drop view if exists log_v_published_to_kafka;

create view log_v_published_to_kafka
as
select  c1
   ,ts
   ,log_rec_level
   ,c4
   ,c5
   ,tid

   ,  ltrim (kv [' Key']  ) as key
   ,cast (ltrim (kv [' size'] ) as bigint ) as size
   ,  ltrim (kv [' topic']) as topic
   ,cast (ltrim (kv [' partition']) as int) as partition
   ,cast (ltrim (kv [' offset']   ) as bigint ) as offset

from   (select  c1
   ,ts
   ,log_rec_level
   ,c4
   ,c5
   ,tid
   ,str_to_map (substr (txt ,locate ('.',txt)+1),',',':')   
as kv

fromlog_V

where   txt like 'Published to Kafka. %'
)
as t
;

select * from log_v_published_to_kafka;



drop view if exists log_v_get_request;

create view log_v_get_request
as
select  c1
   ,ts
   ,log_rec_level
   ,c4
   ,c5
   ,tid
   ,substr (txt,31) as path

fromlog_V

where   txt like 'GET request received for path %'
;

select * from log_v_get_request;



drop view if exists log_v_unlock_request;

create view log_v_unlock_request
as
select  c1
   ,ts
   ,log_rec_level
   ,c4
   ,c5
   ,tid
   ,regexp_extract (txt,'rowkey (\\S+)',1)  as rowkey
   ,regexp_extract (txt,'lock id (\\S+)',1) as lock_id

fromlog_V

where   txt like 'Unlock request for schema DU %'
;


From: Markovitz, Dudu [mailto:dmarkov...@paypal.com]
Sent: Tuesday, June 21, 2016 2:26 PM
To: user@hive.apache.org
Subject: RE: RegexSerDe with Filters

Hi

I would suggest creating a single external table with daily partitions and 
multiple views each with the appropriate filtering.
If you’ll send me log sample (~100 rows) I’ll send you an example.

Dudu

From: Arun Patel [mailto:arunp.bigd...@gmail.com]
Sent: Tuesday, June 21, 2016 1:51 AM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: RegexSerDe with Filters

Hello Hive Experts,

I use flume to ingest application specific logs from Syslog to HDFS.  
Currently, I grep the HDFS directory for specific patterns (for multiple types 
of requests) and then create reports.  However, generating reports for 

RE: if else condition in hive

2016-06-21 Thread Markovitz, Dudu
I understand that you’re looking for the functionality of the MERGE statement.

1)
MERGE is currently an open issue.
https://issues.apache.org/jira/browse/HIVE-10924

2)
UPDATE and DELETE (and MERGE in the future) work under a bunch of limitations, 
e.g. –
Currently only ORC tables are supported
https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions

3)
If we’re not working with transactional tables, we have no choice but create a 
temporary target table (‘trg_tmp’) that will hold the new (updated and 
inserted) data and then replace the original table/content (‘trg’) with the new 
one, in one of the following ways:

· 1

o   Drop table trg;

o   Alter table trg_tmp rename to trg;

· 2

o   Drop table trg_bck;

o   Alter table trg rename to trg_bck;

o   Alter table trg_tmp rename to trg;

· 3

o   Truncate table trg;

o   Insert into trg select * from trg_tmp;


I would recommend (2).

· We keep the old table as a backup in case something goes wrong (in 
opposite of (1)).

· We have the minimum down time (in opposite of (3)).
The down sides are –

· Renaming the ‘trg’ table requires that no one will touch the table at 
that time

· We preserve the storage of ‘trg’, ‘trg_bck’ and for some of the time 
– ‘trg_tmp’

One question regarding your specific case –
For matching rows (update operation), do we need any data from the target table 
or can we take all the required columns from the source table?


Dudu



From: raj hive [mailto:raj.hiv...@gmail.com]
Sent: Tuesday, June 21, 2016 2:22 PM
To: user@hive.apache.org
Subject: if else condition in hive

Hi friends,
INSERT,UPDATE,DELETE commands are working fine in my Hive environment after 
changing the configuration and all. Now, I have to execute a query like below 
sql  in hive.
If exists(select * from tablename where columnname=something)
  update table set column1=something where columnname=something
 else
  insert into tablename values ...
Can any one help me how to do it in Hive?
Thanks
Raj


RE: RegexSerDe with Filters

2016-06-21 Thread Markovitz, Dudu
Hi

I would suggest creating a single external table with daily partitions and 
multiple views each with the appropriate filtering.
If you’ll send me log sample (~100 rows) I’ll send you an example.

Dudu

From: Arun Patel [mailto:arunp.bigd...@gmail.com]
Sent: Tuesday, June 21, 2016 1:51 AM
To: user@hive.apache.org
Subject: RegexSerDe with Filters

Hello Hive Experts,

I use flume to ingest application specific logs from Syslog to HDFS.  
Currently, I grep the HDFS directory for specific patterns (for multiple types 
of requests) and then create reports.  However, generating reports for Weekly 
and Monthly are not salable.

I would like to create multiple external on the daily HDFS directory 
partitioned by date with RegexSerde and then create separate Parquet tables for 
every kind of request.

Question is - How do I create multiple (about 20) RegexSerde tables on same 
data applying filters?  This will be just like 20 grep commands I am running 
today.

Example:  hadoop fs -cat /user/prod/2016-06-20/* | grep 'STORE Request 
Received for APP' | awk '{print $4, $13, $14, $17, $20}'
hadoop fs -cat /user/prod/2016-06-20/* | grep 'SCAN Request 
Received for APP' | awk '{print $4, $14, $19, $21, $22}'
hadoop fs -cat /user/prod/2016-06-20/* | grep 'TOTAL TIME' 
| awk '{print $4, $24}'

I would like to create a tables which does this kind of job and then writes 
output to Parquet tables.

Please let me know how this can be done.  Thank you!

Regards,
Arun


RE: Is there any GROUP_CONCAT Function in Hive

2016-06-15 Thread Markovitz, Dudu
Have you tried to increase the heap size (worked for me)?

E.g. -

bash
mkdir t
awk 'BEGIN{OFS=",";for(i=0;i<1000;++i){print i,i}}' > t/t.csv
hdfs dfs -put t /tmp
export HADOOP_OPTS="$HADOOP_OPTS -Xmx1024m"

hive
create external table t (i int,s string) row format delimited fields terminated 
by ',' location '/tmp/t';
select i%10,collect_list(s) from t group by i%10;


Dudu

From: Mahender Sarangam [mailto:mahender.bigd...@outlook.com]
Sent: Thursday, June 16, 2016 1:47 AM
To: user@hive.apache.org
Subject: Is there any GROUP_CONCAT Function in Hive


Hi,

We have Hive table with 3 GB of data like 100 rows. We are looking for any 
functionality in hive, which can perform GROUP_CONCAT Function.

We tried implement Group_Concat function with use Collect_List and Collect_Set. 
But we are getting heap space error. Because, For each group key around 10 
rows are present,  now these rows which needs to be concatenate.

Any direct way to concat row data into single string column by GROUP BY.


RE: Is there any GROUP_CONCAT Function in Hive

2016-06-15 Thread Markovitz, Dudu
Hi

Out of curiosity, could you share what is the motivation for that?

Thanks

Dudu

From: Mahender Sarangam [mailto:mahender.bigd...@outlook.com]
Sent: Thursday, June 16, 2016 1:47 AM
To: user@hive.apache.org
Subject: Is there any GROUP_CONCAT Function in Hive


Hi,

We have Hive table with 3 GB of data like 100 rows. We are looking for any 
functionality in hive, which can perform GROUP_CONCAT Function.

We tried implement Group_Concat function with use Collect_List and Collect_Set. 
But we are getting heap space error. Because, For each group key around 10 
rows are present,  now these rows which needs to be concatenate.

Any direct way to concat row data into single string column by GROUP BY.


RE: column statistics for non-primitive types

2016-06-15 Thread Markovitz, Dudu
Hi Michael

Case ‘b’ (“answer query directly”) seems to be risky in an open system.
Files/directories can be deleted directly in the filesystem without Hive having 
any knowledge about it which will lead to wrong queries results.

Dudu

From: Michael Häusler [mailto:mich...@akatose.de]
Sent: Tuesday, June 14, 2016 11:43 PM
To: user@hive.apache.org
Subject: Re: column statistics for non-primitive types

Hi Pengcheng,

(1)
statistics on non-primitive columns can be just as useful as on primitive 
columns, e.g.,
DROP TABLE IF EXISTS foo;
CREATE TABLE foo (id BIGINT, someArray ARRAY, someStruct 
STRUCT);

a) query optimization
Let foo be a huge table that needs to be joined with another huge table bar 
like this

SELECT
f.id
FROM
foo f
JOIN
bar b
ON
f.id = b.id
WHERE
f.someArray IS NOT NULL

If statistics tell us that #nulls in someArray is small, we could apply a 
different join strategy (e.g., map-side join, bar main table, filtered foo as 
hash table)

b) answer query directly

SELECT
COUNT(DISTINCT someStruct)
FROM
foo;

Such a query can easily be answered directly from stats.



(2)

Do you happen to know, whether HIVE-11160 also works for CTAS?
Because a quick test of the configuration property did not work for me:

hive> SET hive.stats.fetch.column.stats=true;
hive> DROP TABLE IF EXISTS foo;
OK
Time taken: 6.585 seconds
hive> CREATE TABLE foo AS
> SELECT
> 1 AS foo;
Query ID = haeusler_20160614203002_7a47459d-349b-4012-ac7f-b2cc867b87ef
Total jobs = 1
Launching Job 1 out of 1


Status: Running (Executing on YARN cluster with App id 
application_1465334589772_15920)


VERTICES  STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED

Map 1 ..   SUCCEEDED  1  100   0   0

VERTICES: 01/01  [==>>] 100%  ELAPSED TIME: 2.89 s

Moving data to: hdfs://invcluster/user/hive/warehouse/haeusler.db/foo
Table haeusler.foo stats: [numFiles=1, numRows=1, totalSize=194, rawDataSize=4]
OK
Time taken: 8.088 seconds
hive> DESCRIBE FORMATTED foo.foo;
OK
# col_name  data_type   min max 
num_nulls   distinct_count  avg_col_len 
max_col_len num_trues   num_falses  
comment

foo int 


from deserializer
Time taken: 0.197 seconds, Fetched: 3 row(s)

^^^ the table creation works, but I don't get any column stats.


Best regards
Michael



On 2016-06-14, at 22:23, Pengcheng Xiong 
> wrote:

Hi Michael,

(1) We collect columns stats for the following purpose (a) Query 
optimization, esp. join reordering and big/small table size estimation. More 
recently, we also use it to remove filters. You can refer to Calcite rules. (b) 
Answer query directly through metaStore. You can refer to the configuration of 
HIVEOPTIMIZEMETADATAQUERIES("hive.compute.query.using.stats").

We can do stats for non-primitive columns, but we need to know the 
motivation to do so before we do it. If you can, could you please list some?

   (2) There is a configuration "hive.stats.fetch.column.stats". If you set it 
to true, it will automatically collect column stats for you when you insert 
into/overwrite a new table. You can refer to HIVE-11160 for more details.

   Hope my answers help.

Thanks

Best.
Pengcheng


On Tue, Jun 14, 2016 at 1:03 PM, Michael Häusler 
> wrote:
Hi there,

there might be two topics here:

1) feasibility of stats for non-primitive columns
2) ease of use


1) feasibility of stats for non-primitive columns:

Hive currently collects different kind of statistics for different kind of 
types:
numeric values: min, max, #nulls, #distincts
boolean values: #nulls, #trues, #falses
string values: #nulls, #distincts, avgLength, maxLength

So, it seems quite possible to also collect at least partial stats for 
top-level non-primitive columns, e.g.:
array values: #nulls, #distincts, avgLength, maxLength
map values: #nulls, #distincts, avgLength, maxLength
struct values: #nulls, #distincts
union values: #nulls, #distincts


2) ease of use

The presence of a single non-primitive column currently breaks the use of the 
convenience shorthand to gather statistics for all columns 

RE: Optimized Hive query

2016-06-14 Thread Markovitz, Dudu
1)
Cost-based optimization in 
Hive<https://cwiki.apache.org/confluence/display/Hive/Cost-based+optimization+in+Hive>
https://cwiki.apache.org/confluence/display/Hive/Cost-based+optimization+in+Hive

Calcite is an open source, Apache Licensed, query planning and execution 
framework. Many pieces of Calcite are derived from Eigenbase Project. Calcite 
has optional JDBC server, query parser and validator, query optimizer and 
pluggable data source adapters. One of the available Calcite optimizer is a 
cost based optimizer based on volcano paper.

2)
The Volcano Optimizer Generator: Extensibility and Efficient Search
Goetz Graefe, Portland State University
William J. McKenna, University of Colorado at Boulder
From Proc. IEEE Conf. on Data Eng., Vienna, April 1993, p. 209.

2.2. Optimizer Generator Input and Optimizer Operation
…
The user queries to be optimized by a generated optimizer are specified as an 
algebra
expression (tree) of logical operators. The translation from a user interface 
into a logical algebra
expression must be performed by the parser and is not discussed here.
…

3)
Abstract syntax tree
From Wikipedia, the free encyclopedia
https://en.wikipedia.org/wiki/Abstract_syntax_tree

In computer science<https://en.wikipedia.org/wiki/Computer_science>, an 
abstract syntax tree (AST), or just syntax tree, is a 
tree<https://en.wikipedia.org/wiki/Directed_tree> representation of the 
abstract syntactic<https://en.wikipedia.org/wiki/Abstract_syntax> structure of 
source code<https://en.wikipedia.org/wiki/Source_code> written in a programming 
language<https://en.wikipedia.org/wiki/Programming_language>.


From: Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
Sent: Tuesday, June 14, 2016 7:58 PM
To: user <user@hive.apache.org>
Subject: Re: Optimized Hive query

Amazing. that is the first time I have heard that an optimizer does not have 
the concept of flattened query?

So what is the definition of syntax tree? Are you referring to the industry 
notation "access path". This is the first time I have heard of such notation 
called syntax tree. Are you stating that there is somehow some explanation for 
optimiser "access path" that comes out independent of  the optimizer and is 
called syntax tree?




Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>



On 14 June 2016 at 17:46, Markovitz, Dudu 
<dmarkov...@paypal.com<mailto:dmarkov...@paypal.com>> wrote:
It’s not the query that is being optimized but the syntax tree that is created 
upon the query (execute “explain extended select …”)
In no point do we have a “flattened query”

Dudu

From: Aviral Agarwal 
[mailto:aviral12...@gmail.com<mailto:aviral12...@gmail.com>]
Sent: Tuesday, June 14, 2016 10:37 AM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Re: Optimized Hive query

Hi,
Thanks for the replies.
I already knew that the optimizer already does that.
My usecase is a bit different though.
I want to display the flattened query back to the user.
So I was hoping of using internal Hive CBO to somehow change the AST generated 
for the query somehow.

Thanks,
Aviral

On Tue, Jun 14, 2016 at 12:42 PM, Gopal Vijayaraghavan 
<gop...@apache.org<mailto:gop...@apache.org>> wrote:

> You can see that you get identical execution plans for the nested query
>and the flatten one.

Wasn't that always though. Back when I started with Hive, before Stinger,
it didn't have the identity project remover.

To know if your version has this fix, try looking at

hive> set hive.optimize.remove.identity.project;


Cheers,
Gopal






RE: Optimized Hive query

2016-06-14 Thread Markovitz, Dudu
It’s not the query that is being optimized but the syntax tree that is created 
upon the query (execute “explain extended select …”)
In no point do we have a “flattened query”

Dudu

From: Aviral Agarwal [mailto:aviral12...@gmail.com]
Sent: Tuesday, June 14, 2016 10:37 AM
To: user@hive.apache.org
Subject: Re: Optimized Hive query

Hi,
Thanks for the replies.
I already knew that the optimizer already does that.
My usecase is a bit different though.
I want to display the flattened query back to the user.
So I was hoping of using internal Hive CBO to somehow change the AST generated 
for the query somehow.

Thanks,
Aviral

On Tue, Jun 14, 2016 at 12:42 PM, Gopal Vijayaraghavan 
> wrote:

> You can see that you get identical execution plans for the nested query
>and the flatten one.

Wasn't that always though. Back when I started with Hive, before Stinger,
it didn't have the identity project remover.

To know if your version has this fix, try looking at

hive> set hive.optimize.remove.identity.project;


Cheers,
Gopal






RE: Issue in Insert Overwrite directory operation

2016-06-14 Thread Markovitz, Dudu
There seems to be a known bug fixed on version 1.3

https://issues.apache.org/jira/browse/HIVE-12364

Dudu

From: Udit Mehta [mailto:ume...@groupon.com]
Sent: Tuesday, June 14, 2016 2:55 AM
To: user@hive.apache.org
Subject: Issue in Insert Overwrite directory operation

Hi All,
I see a weird issue when trying to do a "INSERT OVERWRITE DIRECTORY" operation. 
The query seems to work when I limit the data set but fails with the following 
exception if the data set is larger:

Failed with exception Unable to move source 
hdfs://namenode/user/grp_admin/external_test1/output/.hive-staging_hive_2016-06-13_21-34-36_449_7074605
 to destination /user/grp_admin/external_test1/output
I ensured that the directory has enough space so there is no disk quota issues 
here.
Does anyone know what is happening here?
Running Hive on Tez. Hive version is 1.2.1. Fails even with Hive on MR.

Run 1 with smaller data set:

> insert overwrite directory '/user/grp_admin/external_test1/output' row 
format delimited fields terminated by '\t'

> select * from test_table limit 1000;

Query ID = hive_20160613213624_d9d54ef0-0b28-4e98-b49e-197043f67c43

Total jobs = 3

Launching Job 1 out of 3





Status: Running (Executing on YARN cluster with App id 
application_1464825277140_26149)





VERTICES  STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED



Map 1 ..   SUCCEEDED 12 1200   0   0

Reducer 2 ..   SUCCEEDED  1  100   0   0



VERTICES: 02/02  [==>>] 100%  ELAPSED TIME: 21.03 s



Stage-4 is selected by condition resolver.

Stage-3 is filtered out by condition resolver.

Stage-5 is filtered out by condition resolver.

Moving data to: 
hdfs://namenode/user/grp_admin/external_test1/output/.hive-staging_hive_2016-06-13_21-36-24_620_4270199609063911787-1/-ext-1

Moving data to: /user/grp_admin/external_test1/output

OK

Time taken: 21.501 seconds



Run 2 with larger data set:

> insert overwrite directory '/user/grp_admin/external_test1/output' row 
format delimited fields terminated by '\t'


> select * from test_table;


Query ID = hive_20160613213436_a1b0087a-84ff-48a0-ac76-25811aaafe28


Total jobs = 3


Launching Job 1 out of 3


Tez session was closed. Reopening...


Session re-established.








Status: Running (Executing on YARN cluster with App id 
application_1464825277140_26149)








VERTICES  STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED





Map 1 ..   SUCCEEDED 12 1200   0   0





VERTICES: 01/01  [==>>] 100%  ELAPSED TIME: 72.69 s





Stage-4 is selected by condition resolver.


Stage-3 is filtered out by condition resolver.


Stage-5 is filtered out by condition resolver.


Moving data to: 
hdfs://namenode/user/grp_admin/external_test1/output/.hive-staging_hive_2016-06-13_21-34-36_449_7074605303086037347-1/-ext-1


Moving data to: /user/grp_admin/external_test1/output


Failed with exception Unable to move source 
hdfs://namenode/user/grp_admin/external_test1/output/.hive-staging_hive_2016-06-13_21-34-36_449_7074605303086037347-1/-ext-1/00_0
 to destination /user/grp_admin/external_test1/output


FAILED: Execution Error, return code 1 from 
org.apache.hadoop.hive.ql.exec.MoveTask






RE: same hdfs location with different schema exception

2016-06-14 Thread Markovitz, Dudu
Hi

Can you please share the query?

Thanks

Dudu

From: 赵升/赵荣生 [mailto:roncenz...@qq.com]
Sent: Tuesday, June 14, 2016 5:26 AM
To: user 
Subject: same hdfs location with different schema exception

Hi all:
  I have a question when using hive. It's described as follows:

  Firstly, I create two table:
CREATE TABLE `roncen_tmp`(
`a` bigint,
`b` bigint,
`c` string);

CREATE EXTERNAL TABLE `ext_roncen`(
`aaa` bigint)
LOCATION 'hdfs://xxx/user/hive/warehouse/roncen_tmp'

  You see, the two tables have the same hdfs location, but they have different 
schema.

  Then:
  When I run a sql which includes the two tables, the exception occur.


2016-06-14 10:16:56,807 INFO [main] 
org.apache.hadoop.hive.ql.exec.MapJoinOperator: Initializing child 2 MAPJOIN

2016-06-14 10:16:56,807 INFO [main] 
org.apache.hadoop.hive.ql.exec.MapJoinOperator: Initializing Self MAPJOIN[2]

2016-06-14 10:16:56,815 ERROR [main] 
org.apache.hadoop.hive.ql.exec.HashTableDummyOperator: Generating output obj 
inspector from dummy object error

java.lang.RuntimeException: cannot find field aaa from [0:a, 1:b, 2:c]

at 
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.getStandardStructFieldRef(ObjectInspectorUtils.java:410)

at 
org.apache.hadoop.hive.serde2.BaseStructObjectInspector.getStructFieldRef(BaseStructObjectInspector.java:133)

at 
org.apache.hadoop.hive.ql.exec.ExprNodeColumnEvaluator.initialize(ExprNodeColumnEvaluator.java:55)

at 
org.apache.hadoop.hive.ql.exec.JoinUtil.getObjectInspectorsFromEvaluators(JoinUtil.java:68)

at 
org.apache.hadoop.hive.ql.exec.AbstractMapJoinOperator.initializeOp(AbstractMapJoinOperator.java:68)

at 
org.apache.hadoop.hive.ql.exec.MapJoinOperator.initializeOp(MapJoinOperator.java:95)

at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:385)

at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:469)

at 
org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:425)

at 
org.apache.hadoop.hive.ql.exec.HashTableDummyOperator.initializeOp(HashTableDummyOperator.java:40)

at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:385)

at 
org.apache.hadoop.hive.ql.exec.mr.ExecMapper.configure(ExecMapper.java:144)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606)

at 
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)

at 
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75)

at 
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)

at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:38)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606)

at 
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)

at 
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75)

at 
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)

at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)

at org.apache.hadoop.mapred.MapTask.run(MapTask.java:352)

at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:415)

at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1680)

at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)

2016-06-14 10:16:56,820 INFO [main] 
org.apache.hadoop.hive.ql.exec.HashTableDummyOperator: Initialization Done 5 
HASHTABLEDUMMY

2016-06-14 10:16:56,876 INFO [main] org.apache.hadoop.hive.ql.log.PerfLogger: 


2016-06-14 10:16:56,876 FATAL [main] 
org.apache.hadoop.hive.ql.exec.mr.ExecMapper: java.lang.NullPointerException

at 
org.apache.hadoop.hive.ql.exec.MapJoinOperator.loadHashTable(MapJoinOperator.java:189)

at 
org.apache.hadoop.hive.ql.exec.MapJoinOperator.cleanUpInputFileChangedOp(MapJoinOperator.java:216)

at 
org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged(Operator.java:1051)

at 
org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged(Operator.java:1055)

at 
org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged(Operator.java:1055)

at 

RE: Optimized Hive query

2016-06-14 Thread Markovitz, Dudu
Hi

You don’t need to do anything, the optimizer does it for you.
You can see that you get identical execution plans for the nested query and the 
flatten one.

Dudu


> create multiset table t (i int);

> explain select * from t;
+---+--+
|  Explain  
|
+---+--+
| STAGE DEPENDENCIES:   
|
|   Stage-0 is a root stage 
|
|   
|
| STAGE PLANS:  
|
|   Stage: Stage-0  
|
| Fetch Operator
|
|   limit: -1   
|
|   Processor Tree: 
|
| TableScan 
|
|   alias: t
|
|   Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column 
stats: NONE|
|   Select Operator 
|
| expressions: i (type: int)
|
| outputColumnNames: _col0  
|
| Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column 
stats: NONE  |
| ListSink  
|
|   
|
+---+--+

> explain select * from (select * from (select * from (select * from (select * 
> from t) t) t) t) t;
+---+--+
|  Explain  
|
+---+--+
| STAGE DEPENDENCIES:   
|
|   Stage-0 is a root stage 
|
|   
|
| STAGE PLANS:  
|
|   Stage: Stage-0  
|
| Fetch Operator
|
|   limit: -1   
|
|   Processor Tree: 
|
| TableScan 
|
|   alias: t
|
|   Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column 
stats: NONE|
|   Select Operator 
|
| expressions: i (type: int)
|
| outputColumnNames: _col0  
|
| Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column 
stats: NONE  |
| ListSink  
|
|   
|
+---+--+

From: Aviral Agarwal [mailto:aviral12...@gmail.com]
Sent: Monday, June 13, 2016 7:55 PM
To: user@hive.apache.org
Subject: Optimized Hive query

Hi,
I would like to know if there is a way to convert nested hive sub-queries into 
optimized queries.

For example :
INSERT INTO TABLE a.b SELECT * FROM ( SELECT c FROM d)

into

INSERT INTO TABLE a.b SELECT c FROM D

This is a simple example but the solution should apply is there were deeper 
nesting levels present.

Thanks,
Aviral Agarwal



RE: Get 100 items in Comma Separated strings from Hive Column.

2016-06-11 Thread Markovitz, Dudu
You are welcomed

Dudu

From: Mahender Sarangam [mailto:mahender.bigd...@outlook.com]
Sent: Friday, June 10, 2016 8:55 PM
To: user@hive.apache.org
Subject: Re: Get 100 items in Comma Separated strings from Hive Column.


Thanks Dudu. This is wonderful explaination. I'm very thankful

On 6/10/2016 7:24 AM, Markovitz, Dudu wrote:
regexp_extract ('(,?[^,]*){0,10}',0)

(...){0,10}

The expression surrounded by brackets repeats 0 to 10 times.


(,?[…]*)

Optional comma followed by sequence (0 or more) of characters


[^,]

Any character which is not comma


regexp_extract (...,0)

0 stands for the whole expression
1 stands for the 1st expression which is surrounded by brackets (ordered by the 
opening brackets)
2 stands for the 2nd expression which is surrounded by brackets (ordered by the 
opening brackets)
3 stands for the 3rd expression which is surrounded by brackets (ordered by the 
opening brackets)
Etc.



regexp_replace (((,?[^,]*){0,10}).*','$1')

Similar to regexp_extract but this time we’re not extracting the first 10 
tokens but replacing the whole expression with the first 10 tokens.
The expression that stands for the first 10 tokens is identical to the one we 
used in regexp_extract
.* stands for any character that repeats 0 or more times which represent 
anything following the first 10 tokens
$1 stands for the 1st expression which is surrounded by brackets (ordered by 
the opening brackets)


From: Mahender Sarangam [mailto:mahender.bigd...@outlook.com]
Sent: Friday, June 10, 2016 2:54 PM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Re: Get 100 items in Comma Separated strings from Hive Column.


Thanks Dudu. I will check. Can you please throw some light on regexp_replace 
(((,?[^,]*){0,10}).*','$1')  regexp_extract ('(,?[^,]*){0,10}',0),

On 6/9/2016 11:33 PM, Markovitz, Dudu wrote:
+ Improvement

The “Count” can be done in a cleaner way
(The previous way works also with simple ‘replace’)

hive> select RowID,length(regexp_replace(stringColumn,'[^,]',''))+1 as count 
from t;

1  2
2  5
3  24
4  17
5  8
6  11
7  26
8  18
9  9


From: Markovitz, Dudu [mailto:dmarkov...@paypal.com]
Sent: Thursday, June 09, 2016 11:30 PM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: RE: Get 100 items in Comma Separated strings from Hive Column.


--  bash

mkdir t

cat>t/data.txt
1|44,85
2|56,37,83,68,43
3|33,48,42,18,23,80,31,86,48,42,37,52,99,55,93,1,63,67,32,75,44,57,70,2
4|77,26,95,53,11,99,74,82,7,55,75,6,32,87,75,99,80
5|48,78,39,62,16,44,43,63
6|35,97,99,19,22,50,29,84,82,25,77
7|80,43,82,94,81,58,70,8,70,6,62,100,60,84,55,24,100,75,84,15,53,5,19,45,61,73
8|66,44,66,4,80,72,81,63,51,24,51,77,87,85,10,36,43,2
9|39,64,29,14,9,42,66,56,33

hdfs dfs -put t /tmp


--  hive


create external table t
(
RowID   int
   ,stringColumnstring
)
row format delimited
fields terminated by '|'
location '/tmp/t'
;

select RowID,regexp_extract (stringColumn,'(,?[^,]*){0,10}',0) as 
string10,length(stringColumn)-length(regexp_replace(stringColumn,',',''))+1 as 
count from t;

144,85 2
256,37,83,68,43  5
333,48,42,18,23,80,31,86,48,42   24
477,26,95,53,11,99,74,82,7,5517
548,78,39,62,16,44,43,638
635,97,99,19,22,50,29,84,82,25   11
780,43,82,94,81,58,70,8,70,6 26
866,44,66,4,80,72,81,63,51,2418
939,64,29,14,9,42,66,56,33  9


Extracting the first 100 (10 in my example) tokens can be done with 
regexp_extract or regexp_replace

hive> select regexp_extract 
('1,2,3,4,5,6,7,8,9,10,11,12,13,14,15','(,?[^,]*){0,10}',0);

1,2,3,4,5,6,7,8,9,10

hive> select regexp_replace 
('1,2,3,4,5,6,7,8,9,10,11,12,13,14,15','((,?[^,]*){0,10}).*','$1');

1,2,3,4,5,6,7,8,9,10


From: Mahender Sarangam [mailto:mahender.bigd...@outlook.com]
Sent: Thursday, June 09, 2016 7:13 PM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Get 100 items in Comma Separated strings from Hive Column.


Hi,

We have hive table which has a single column with more than 1000 comma 
separated string items.  Is there a way to retrieve only 100 string items from 
that Column. Also we need to capture number of comma separated string items. We 
are looking for more of   "substring_index" functionality, since we are using 
Hive 1.2 version, we couldn't find "substring_index" UDF function, Is there a 
way to achieve the same functionality with  "regexp_extract" and I also see 
there is

RE: Get 100 items in Comma Separated strings from Hive Column.

2016-06-10 Thread Markovitz, Dudu
regexp_extract ('(,?[^,]*){0,10}',0)

(...){0,10}

The expression surrounded by brackets repeats 0 to 10 times.


(,?[…]*)

Optional comma followed by sequence (0 or more) of characters


[^,]

Any character which is not comma


regexp_extract (...,0)

0 stands for the whole expression
1 stands for the 1st expression which is surrounded by brackets (ordered by the 
opening brackets)
2 stands for the 2nd expression which is surrounded by brackets (ordered by the 
opening brackets)
3 stands for the 3rd expression which is surrounded by brackets (ordered by the 
opening brackets)
Etc.



regexp_replace (((,?[^,]*){0,10}).*','$1')

Similar to regexp_extract but this time we’re not extracting the first 10 
tokens but replacing the whole expression with the first 10 tokens.
The expression that stands for the first 10 tokens is identical to the one we 
used in regexp_extract
.* stands for any character that repeats 0 or more times which represent 
anything following the first 10 tokens
$1 stands for the 1st expression which is surrounded by brackets (ordered by 
the opening brackets)


From: Mahender Sarangam [mailto:mahender.bigd...@outlook.com]
Sent: Friday, June 10, 2016 2:54 PM
To: user@hive.apache.org
Subject: Re: Get 100 items in Comma Separated strings from Hive Column.


Thanks Dudu. I will check. Can you please throw some light on regexp_replace 
(((,?[^,]*){0,10}).*','$1')  regexp_extract ('(,?[^,]*){0,10}',0),

On 6/9/2016 11:33 PM, Markovitz, Dudu wrote:
+ Improvement

The “Count” can be done in a cleaner way
(The previous way works also with simple ‘replace’)

hive> select RowID,length(regexp_replace(stringColumn,'[^,]',''))+1 as count 
from t;

1  2
2  5
3  24
4  17
5  8
6  11
7  26
8  18
9  9


From: Markovitz, Dudu [mailto:dmarkov...@paypal.com]
Sent: Thursday, June 09, 2016 11:30 PM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: RE: Get 100 items in Comma Separated strings from Hive Column.


--  bash

mkdir t

cat>t/data.txt
1|44,85
2|56,37,83,68,43
3|33,48,42,18,23,80,31,86,48,42,37,52,99,55,93,1,63,67,32,75,44,57,70,2
4|77,26,95,53,11,99,74,82,7,55,75,6,32,87,75,99,80
5|48,78,39,62,16,44,43,63
6|35,97,99,19,22,50,29,84,82,25,77
7|80,43,82,94,81,58,70,8,70,6,62,100,60,84,55,24,100,75,84,15,53,5,19,45,61,73
8|66,44,66,4,80,72,81,63,51,24,51,77,87,85,10,36,43,2
9|39,64,29,14,9,42,66,56,33

hdfs dfs -put t /tmp


--  hive


create external table t
(
RowID   int
   ,stringColumnstring
)
row format delimited
fields terminated by '|'
location '/tmp/t'
;

select RowID,regexp_extract (stringColumn,'(,?[^,]*){0,10}',0) as 
string10,length(stringColumn)-length(regexp_replace(stringColumn,',',''))+1 as 
count from t;

144,85 2
256,37,83,68,43  5
333,48,42,18,23,80,31,86,48,42   24
477,26,95,53,11,99,74,82,7,5517
548,78,39,62,16,44,43,638
635,97,99,19,22,50,29,84,82,25   11
780,43,82,94,81,58,70,8,70,6 26
866,44,66,4,80,72,81,63,51,2418
939,64,29,14,9,42,66,56,33  9


Extracting the first 100 (10 in my example) tokens can be done with 
regexp_extract or regexp_replace

hive> select regexp_extract 
('1,2,3,4,5,6,7,8,9,10,11,12,13,14,15','(,?[^,]*){0,10}',0);

1,2,3,4,5,6,7,8,9,10

hive> select regexp_replace 
('1,2,3,4,5,6,7,8,9,10,11,12,13,14,15','((,?[^,]*){0,10}).*','$1');

1,2,3,4,5,6,7,8,9,10


From: Mahender Sarangam [mailto:mahender.bigd...@outlook.com]
Sent: Thursday, June 09, 2016 7:13 PM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Get 100 items in Comma Separated strings from Hive Column.


Hi,

We have hive table which has a single column with more than 1000 comma 
separated string items.  Is there a way to retrieve only 100 string items from 
that Column. Also we need to capture number of comma separated string items. We 
are looking for more of   "substring_index" functionality, since we are using 
Hive 1.2 version, we couldn't find "substring_index" UDF function, Is there a 
way to achieve the same functionality with  "regexp_extract" and I also see 
there is UDF available not sure whether this helps us achieving same 
functionality. 
https://github.com/brndnmtthws/facebook-hive-udfs/blob/master/src/main/java/com/facebook/hive/udf/UDFRegexpExtractAll.java

Scenario : Table1 (Source Table)

RowID stringColumn

1 1,2,3,4...1

2 2,4,5,8,4

3 10,11,98,100

Now i Would like to show table result structure like below

Row ID 100String count

1 1,2,3...100 1

2 2,4,5,8,4 5



RE: Get 100 items in Comma Separated strings from Hive Column.

2016-06-10 Thread Markovitz, Dudu
+ bug fix
This version will differentiate between empty strings and strings with a single 
token (both have no commas)

hive> select 
RowID,length(regexp_replace(stringColumn,'[^,]',''))+if(length(stringColumn)=0,0,1)
 as count from t;


From: Markovitz, Dudu [mailto:dmarkov...@paypal.com]
Sent: Friday, June 10, 2016 9:34 AM
To: user@hive.apache.org
Subject: RE: Get 100 items in Comma Separated strings from Hive Column.

+ Improvement

The “Count” can be done in a cleaner way
(The previous way works also with simple ‘replace’)

hive> select RowID,length(regexp_replace(stringColumn,'[^,]',''))+1 as count 
from t;

1  2
2  5
3  24
4  17
5  8
6  11
7  26
8  18
9  9


From: Markovitz, Dudu [mailto:dmarkov...@paypal.com]
Sent: Thursday, June 09, 2016 11:30 PM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: RE: Get 100 items in Comma Separated strings from Hive Column.


--  bash

mkdir t

cat>t/data.txt
1|44,85
2|56,37,83,68,43
3|33,48,42,18,23,80,31,86,48,42,37,52,99,55,93,1,63,67,32,75,44,57,70,2
4|77,26,95,53,11,99,74,82,7,55,75,6,32,87,75,99,80
5|48,78,39,62,16,44,43,63
6|35,97,99,19,22,50,29,84,82,25,77
7|80,43,82,94,81,58,70,8,70,6,62,100,60,84,55,24,100,75,84,15,53,5,19,45,61,73
8|66,44,66,4,80,72,81,63,51,24,51,77,87,85,10,36,43,2
9|39,64,29,14,9,42,66,56,33

hdfs dfs -put t /tmp


--  hive


create external table t
(
RowID   int
   ,stringColumnstring
)
row format delimited
fields terminated by '|'
location '/tmp/t'
;

select RowID,regexp_extract (stringColumn,'(,?[^,]*){0,10}',0) as 
string10,length(stringColumn)-length(regexp_replace(stringColumn,',',''))+1 as 
count from t;

144,85 2
256,37,83,68,43  5
333,48,42,18,23,80,31,86,48,42   24
477,26,95,53,11,99,74,82,7,5517
548,78,39,62,16,44,43,638
635,97,99,19,22,50,29,84,82,25   11
780,43,82,94,81,58,70,8,70,6 26
866,44,66,4,80,72,81,63,51,2418
939,64,29,14,9,42,66,56,33  9


Extracting the first 100 (10 in my example) tokens can be done with 
regexp_extract or regexp_replace

hive> select regexp_extract 
('1,2,3,4,5,6,7,8,9,10,11,12,13,14,15','(,?[^,]*){0,10}',0);

1,2,3,4,5,6,7,8,9,10

hive> select regexp_replace 
('1,2,3,4,5,6,7,8,9,10,11,12,13,14,15','((,?[^,]*){0,10}).*','$1');

1,2,3,4,5,6,7,8,9,10


From: Mahender Sarangam [mailto:mahender.bigd...@outlook.com]
Sent: Thursday, June 09, 2016 7:13 PM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Get 100 items in Comma Separated strings from Hive Column.


Hi,

We have hive table which has a single column with more than 1000 comma 
separated string items.  Is there a way to retrieve only 100 string items from 
that Column. Also we need to capture number of comma separated string items. We 
are looking for more of   "substring_index" functionality, since we are using 
Hive 1.2 version, we couldn't find "substring_index" UDF function, Is there a 
way to achieve the same functionality with  "regexp_extract" and I also see 
there is UDF available not sure whether this helps us achieving same 
functionality. 
https://github.com/brndnmtthws/facebook-hive-udfs/blob/master/src/main/java/com/facebook/hive/udf/UDFRegexpExtractAll.java

Scenario : Table1 (Source Table)

RowID stringColumn

1 1,2,3,4...1

2 2,4,5,8,4

3 10,11,98,100

Now i Would like to show table result structure like below

Row ID 100String count

1 1,2,3...100 1

2 2,4,5,8,4 5


RE: Get 100 items in Comma Separated strings from Hive Column.

2016-06-10 Thread Markovitz, Dudu
+ Improvement

The “Count” can be done in a cleaner way
(The previous way works also with simple ‘replace’)

hive> select RowID,length(regexp_replace(stringColumn,'[^,]',''))+1 as count 
from t;

1  2
2  5
3  24
4  17
5  8
6  11
7  26
8  18
9  9


From: Markovitz, Dudu [mailto:dmarkov...@paypal.com]
Sent: Thursday, June 09, 2016 11:30 PM
To: user@hive.apache.org
Subject: RE: Get 100 items in Comma Separated strings from Hive Column.


--  bash

mkdir t

cat>t/data.txt
1|44,85
2|56,37,83,68,43
3|33,48,42,18,23,80,31,86,48,42,37,52,99,55,93,1,63,67,32,75,44,57,70,2
4|77,26,95,53,11,99,74,82,7,55,75,6,32,87,75,99,80
5|48,78,39,62,16,44,43,63
6|35,97,99,19,22,50,29,84,82,25,77
7|80,43,82,94,81,58,70,8,70,6,62,100,60,84,55,24,100,75,84,15,53,5,19,45,61,73
8|66,44,66,4,80,72,81,63,51,24,51,77,87,85,10,36,43,2
9|39,64,29,14,9,42,66,56,33

hdfs dfs -put t /tmp


--  hive


create external table t
(
RowID   int
   ,stringColumnstring
)
row format delimited
fields terminated by '|'
location '/tmp/t'
;

select RowID,regexp_extract (stringColumn,'(,?[^,]*){0,10}',0) as 
string10,length(stringColumn)-length(regexp_replace(stringColumn,',',''))+1 as 
count from t;

144,85 2
256,37,83,68,43  5
333,48,42,18,23,80,31,86,48,42   24
477,26,95,53,11,99,74,82,7,5517
548,78,39,62,16,44,43,638
635,97,99,19,22,50,29,84,82,25   11
780,43,82,94,81,58,70,8,70,6 26
866,44,66,4,80,72,81,63,51,2418
939,64,29,14,9,42,66,56,33  9


Extracting the first 100 (10 in my example) tokens can be done with 
regexp_extract or regexp_replace

hive> select regexp_extract 
('1,2,3,4,5,6,7,8,9,10,11,12,13,14,15','(,?[^,]*){0,10}',0);

1,2,3,4,5,6,7,8,9,10

hive> select regexp_replace 
('1,2,3,4,5,6,7,8,9,10,11,12,13,14,15','((,?[^,]*){0,10}).*','$1');

1,2,3,4,5,6,7,8,9,10


From: Mahender Sarangam [mailto:mahender.bigd...@outlook.com]
Sent: Thursday, June 09, 2016 7:13 PM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Get 100 items in Comma Separated strings from Hive Column.


Hi,

We have hive table which has a single column with more than 1000 comma 
separated string items.  Is there a way to retrieve only 100 string items from 
that Column. Also we need to capture number of comma separated string items. We 
are looking for more of   "substring_index" functionality, since we are using 
Hive 1.2 version, we couldn't find "substring_index" UDF function, Is there a 
way to achieve the same functionality with  "regexp_extract" and I also see 
there is UDF available not sure whether this helps us achieving same 
functionality. 
https://github.com/brndnmtthws/facebook-hive-udfs/blob/master/src/main/java/com/facebook/hive/udf/UDFRegexpExtractAll.java

Scenario : Table1 (Source Table)

RowID stringColumn

1 1,2,3,4...1

2 2,4,5,8,4

3 10,11,98,100

Now i Would like to show table result structure like below

Row ID 100String count

1 1,2,3...100 1

2 2,4,5,8,4 5


RE: Get 100 items in Comma Separated strings from Hive Column.

2016-06-09 Thread Markovitz, Dudu

--  bash

mkdir t

cat>t/data.txt
1|44,85
2|56,37,83,68,43
3|33,48,42,18,23,80,31,86,48,42,37,52,99,55,93,1,63,67,32,75,44,57,70,2
4|77,26,95,53,11,99,74,82,7,55,75,6,32,87,75,99,80
5|48,78,39,62,16,44,43,63
6|35,97,99,19,22,50,29,84,82,25,77
7|80,43,82,94,81,58,70,8,70,6,62,100,60,84,55,24,100,75,84,15,53,5,19,45,61,73
8|66,44,66,4,80,72,81,63,51,24,51,77,87,85,10,36,43,2
9|39,64,29,14,9,42,66,56,33

hdfs dfs -put t /tmp


--  hive


create external table t
(
RowID   int
   ,stringColumnstring
)
row format delimited
fields terminated by '|'
location '/tmp/t'
;

select RowID,regexp_extract (stringColumn,'(,?[^,]*){0,10}',0) as 
string10,length(stringColumn)-length(regexp_replace(stringColumn,',',''))+1 as 
count from t;

144,85 2
256,37,83,68,43  5
333,48,42,18,23,80,31,86,48,42   24
477,26,95,53,11,99,74,82,7,5517
548,78,39,62,16,44,43,638
635,97,99,19,22,50,29,84,82,25   11
780,43,82,94,81,58,70,8,70,6 26
866,44,66,4,80,72,81,63,51,2418
939,64,29,14,9,42,66,56,33  9


Extracting the first 100 (10 in my example) tokens can be done with 
regexp_extract or regexp_replace

hive> select regexp_extract 
('1,2,3,4,5,6,7,8,9,10,11,12,13,14,15','(,?[^,]*){0,10}',0);

1,2,3,4,5,6,7,8,9,10

hive> select regexp_replace 
('1,2,3,4,5,6,7,8,9,10,11,12,13,14,15','((,?[^,]*){0,10}).*','$1');

1,2,3,4,5,6,7,8,9,10


From: Mahender Sarangam [mailto:mahender.bigd...@outlook.com]
Sent: Thursday, June 09, 2016 7:13 PM
To: user@hive.apache.org
Subject: Get 100 items in Comma Separated strings from Hive Column.


Hi,

We have hive table which has a single column with more than 1000 comma 
separated string items.  Is there a way to retrieve only 100 string items from 
that Column. Also we need to capture number of comma separated string items. We 
are looking for more of   "substring_index" functionality, since we are using 
Hive 1.2 version, we couldn't find "substring_index" UDF function, Is there a 
way to achieve the same functionality with  "regexp_extract" and I also see 
there is UDF available not sure whether this helps us achieving same 
functionality. 
https://github.com/brndnmtthws/facebook-hive-udfs/blob/master/src/main/java/com/facebook/hive/udf/UDFRegexpExtractAll.java

Scenario : Table1 (Source Table)

RowID stringColumn

1 1,2,3,4...1

2 2,4,5,8,4

3 10,11,98,100

Now i Would like to show table result structure like below

Row ID 100String count

1 1,2,3...100 1

2 2,4,5,8,4 5


RE: LINES TERMINATED BY only supports newline '\n' right now

2016-06-09 Thread Markovitz, Dudu
I’ve checked “sentences” source code.
It turns out it is using BreakIterator.getSentenceInstance to break the text to 
sentences.
Apparently ‘\n’ is not considered as a sentence separator nor ‘.’, but ‘?’ and 
‘!’ does.

Dudu


hive> select id,name,sentences(regexp_replace (lyrics,'\n','?')) from songs;

1  All For Leyna
[["She","stood","on","the","tracks"],["Waving","her","arms"],["Leading","me","to","that","third","rail","shock"],["Quick","as","a","wink"],["She","changed","her","mind"]]
2  Goodnight Saigon
[["We","met","as","soul","mates"],["On","Parris","Island"],["We","left","as","inmates"],["From","an","asylum"],["And","we","were","sharp"],["As","sharp","as","knives"],["And","we","were","so","gung","ho"],["To","lay","down","our","lives"]]

From: Markovitz, Dudu [mailto:dmarkov...@paypal.com]
Sent: Thursday, June 09, 2016 10:58 AM
To: user@hive.apache.org
Subject: RE: LINES TERMINATED BY only supports newline '\n' right now

Partial success after few more trials and errors –

1.
“insert into … values (),(),…,()” doesn’t work right in any case
“insert into … values (); insert into … values ();…;insert into … values();” 
works only with textinputformat.record.delimiter changed.
Insert into … select … union all select … works fine (no need to touch 
textinputformat.record.delimiter)

2.
No bugs around aggregative functions

3.
“sentences” still doesn’t work as expected.
We can see that “split” works correctly.

hive> select id,name,split(lyrics,'\n') from songs;

1  All For Leyna  ["She stood on the tracks","Waving her 
arms","Leading me to that third rail shock","Quick as a wink","She changed her 
mind"]
2  Goodnight Saigon["We met as soul mates","On Parris 
Island","We left as inmates","From an asylum","And we were sharp","As sharp as 
knives","And we were so gung ho","To lay down our lives"]

hive> select id,name,sentences(lyrics) from songs;

1  All For Leyna
[["She","stood","on","the","tracks","Waving","her","arms","Leading","me","to","that","third","rail","shock","Quick","as","a","wink","She","changed","her","mind"]]
2  Goodnight Saigon
[["We","met","as","soul","mates","On","Parris","Island","We","left","as","inmates","From","an","asylum","And","we","were","sharp","As","sharp","as","knives","And","we","were","so","gung","ho","To","lay","down","our","lives"]]





From: Markovitz, Dudu [mailto:dmarkov...@paypal.com]
Sent: Thursday, June 09, 2016 10:23 AM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: RE: LINES TERMINATED BY only supports newline '\n' right now

Same issues.

Dudu

From: abhishek [mailto:ec.abhis...@gmail.com]
Sent: Thursday, June 09, 2016 9:23 AM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Re: LINES TERMINATED BY only supports newline '\n' right now


Did you try defining the table with hive In built SerDe. 'Stored as ORC'
This should resolve your issue. Plz try and let me know if it works.

Abhi
Sent from my iPhone

On Jun 3, 2016, at 3:33 AM, Markovitz, Dudu 
<dmarkov...@paypal.com<mailto:dmarkov...@paypal.com>> wrote:
Here is an example, but first – some warnings:


· You should set textinputformat.record.delimiter not only for the 
populating of the table but also for querying it



· There seems to be many issues around this area –

o   When I tried to insert multiple values in a single statement (“insert into 
table … values (…),(…),(…)”) only the first set of values was inserted correctl

RE: Need Your Inputs For Below Scenario

2016-06-09 Thread Markovitz, Dudu
Explode + joins


--  bash


mkdir t1
mkdir t2

cat>t1/data.txt
A  B1 B2B4 B5 B6

cat>t2/data.txt
B1 D1
B2 D2
B3 D3
B4 D4
B5 D5
B6 D6

hdfs dfs -put t1 t2 /tmp


--  hive


create external table t1
(
Column1 string
   ,Column2 string
   ,Column3 string
   ,Column4 string
   ,Column5 string
   ,Column6 string
   ,Column7 string
)
row format delimited
fields terminated by '\t'
location '/tmp/t1'
;

create external table t2
(
Column1 string
   ,Column2 string
)
row format delimited
fields terminated by '\t'
location '/tmp/t2'
;

Theoretically I would have written the query like this -

select  t1.Column1
   ,t1_unpivot.val
   ,t2.Column2

fromt1

lateral viewexplode 
(array(Column2,Column3,Column4,Column5,Column6,Column7)) t1_unpivot as val

joint2

on  t2.Column1  =
t1_unpivot.val
;

Unfortunately, this syntax is not supported

FAILED: SemanticException [Error 10085]: Line 7:32 JOIN with a LATERAL VIEW is 
not supported 'val'


As a work-around I'm nesting the "lateral view'


select  t1.Column1
   ,t1.val
   ,t2.Column2

from   (select  t1.Column1
   ,t1_unpivot.val

fromt1

lateral viewexplode 
(array(Column2,Column3,Column4,Column5,Column6,Column7)) t1_unpivot as val
)
as t1

joint2

on  t2.Column1  =
t1.val
;

A  B1 D1
A  B2 D2
A  B4 D4
A  B5 D5
A  B6 D6

From: Lunagariya, Dhaval [mailto:dhaval.lunagar...@citi.com]
Sent: Wednesday, June 08, 2016 6:25 PM
To: 'user@hive.apache.org' 
Cc: 'er.dcpa...@gmail.com' 
Subject: RE: Need Your Inputs For Below Scenario

Here Table2 is very large table and contains lakhs of rows.

From: Lunagariya, Dhaval [CCC-OT]
Sent: Wednesday, June 08, 2016 5:52 PM
To: user@hive.apache.org
Subject: Need Your Inputs For Below Scenario

Hey folks,

Need your help.

Input Table1:

Column1

Column2

Column3

Column4

Column5

Column6

Column7

A

B1

B2

B3(NULL)

B4

B5

B6



Input Table2:
Column1

Column2

B1

D1

B2

D2

B3

D3

B4

D4

B5

D5

B6

D6



Output:
Column1

Column2

Column3

A

B1

D1

A

B2

D2

A

B4

D4

A

B5

D5

A

B6

D6


Here B3 is skipped because B3 is NULL.

What is the efficient way to get above result using Hive?



Regards,
Dhaval



RE: LINES TERMINATED BY only supports newline '\n' right now

2016-06-09 Thread Markovitz, Dudu
Partial success after few more trials and errors –

1.
“insert into … values (),(),…,()” doesn’t work right in any case
“insert into … values (); insert into … values ();…;insert into … values();” 
works only with textinputformat.record.delimiter changed.
Insert into … select … union all select … works fine (no need to touch 
textinputformat.record.delimiter)

2.
No bugs around aggregative functions

3.
“sentences” still doesn’t work as expected.
We can see that “split” works correctly.

hive> select id,name,split(lyrics,'\n') from songs;

1  All For Leyna  ["She stood on the tracks","Waving her 
arms","Leading me to that third rail shock","Quick as a wink","She changed her 
mind"]
2  Goodnight Saigon["We met as soul mates","On Parris 
Island","We left as inmates","From an asylum","And we were sharp","As sharp as 
knives","And we were so gung ho","To lay down our lives"]

hive> select id,name,sentences(lyrics) from songs;

1  All For Leyna
[["She","stood","on","the","tracks","Waving","her","arms","Leading","me","to","that","third","rail","shock","Quick","as","a","wink","She","changed","her","mind"]]
2  Goodnight Saigon
[["We","met","as","soul","mates","On","Parris","Island","We","left","as","inmates","From","an","asylum","And","we","were","sharp","As","sharp","as","knives","And","we","were","so","gung","ho","To","lay","down","our","lives"]]





From: Markovitz, Dudu [mailto:dmarkov...@paypal.com]
Sent: Thursday, June 09, 2016 10:23 AM
To: user@hive.apache.org
Subject: RE: LINES TERMINATED BY only supports newline '\n' right now

Same issues.

Dudu

From: abhishek [mailto:ec.abhis...@gmail.com]
Sent: Thursday, June 09, 2016 9:23 AM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Re: LINES TERMINATED BY only supports newline '\n' right now


Did you try defining the table with hive In built SerDe. 'Stored as ORC'
This should resolve your issue. Plz try and let me know if it works.

Abhi
Sent from my iPhone

On Jun 3, 2016, at 3:33 AM, Markovitz, Dudu 
<dmarkov...@paypal.com<mailto:dmarkov...@paypal.com>> wrote:
Here is an example, but first – some warnings:


· You should set textinputformat.record.delimiter not only for the 
populating of the table but also for querying it



· There seems to be many issues around this area –

o   When I tried to insert multiple values in a single statement (“insert into 
table … values (…),(…),(…)”) only the first set of values was inserted correctly

o   2 new lines were added to the end of each text (‘lyrics’) although there 
should be none.

o   Aggregative queries seems to return null for the last column. Sometimes.

o   The function ‘sentences’ does not work as expected. It treated the whole 
text as a single line.



Dudu





Example


hive> create table songs (id int,name string,lyrics string)
;

hive> set textinputformat.record.delimiter='\0'
;

hive> insert into table songs values
(
1
,'All For Leyna'
,'She stood on the tracks
Waving her arms
Leading me to that third rail shock
Quick as a wink
She changed her mind'
)
;

hive> insert into table songs values
(
2
,'Goodnight Saigon'
,'We met as soul mates
On Parris Island
We left as inmates
From an asylum
And we were sharp
As sharp as knives
And we were so gung ho
To lay down our lives'
)
;

hive> select id,name,length(lyrics) from songs;

1  All For Leyna  114
2  Goodnight Saigon155


hive> select id,name,hex(lyrics) from songs;

1  All For Leyna
5368652073746F6F64206F6E2074686520747261636B730A576176696E67206865722061726D730A4C656164696E67206D6520746F2074686174207468697264207261696C2073686F636B0A517569636B20617320612077696E6B0A536865206368616E67656420686572206D696E640A0A
2  Goodnight Saigon
5765206D657420617320736F756C206D617465730A4F6E205061727269732049736C616E640A5765206C65667420617320696E6D617465730A46726F6D20616E206173796C756D0A416E6420776520776572652073686172700A4173207368617270206173206B6E697665730A416E64207765207765726520736F2067756E6720686F0A546F206C617920646F776E206F7572206C697665730A0A

hive> select id,name,regexp_replace(lyrics,'\n','<<>>')

RE: LINES TERMINATED BY only supports newline '\n' right now

2016-06-09 Thread Markovitz, Dudu
Same issues.

Dudu

From: abhishek [mailto:ec.abhis...@gmail.com]
Sent: Thursday, June 09, 2016 9:23 AM
To: user@hive.apache.org
Subject: Re: LINES TERMINATED BY only supports newline '\n' right now


Did you try defining the table with hive In built SerDe. 'Stored as ORC'
This should resolve your issue. Plz try and let me know if it works.

Abhi
Sent from my iPhone

On Jun 3, 2016, at 3:33 AM, Markovitz, Dudu 
<dmarkov...@paypal.com<mailto:dmarkov...@paypal.com>> wrote:
Here is an example, but first – some warnings:


· You should set textinputformat.record.delimiter not only for the 
populating of the table but also for querying it



· There seems to be many issues around this area –

o   When I tried to insert multiple values in a single statement (“insert into 
table … values (…),(…),(…)”) only the first set of values was inserted correctly

o   2 new lines were added to the end of each text (‘lyrics’) although there 
should be none.

o   Aggregative queries seems to return null for the last column. Sometimes.

o   The function ‘sentences’ does not work as expected. It treated the whole 
text as a single line.



Dudu





Example


hive> create table songs (id int,name string,lyrics string)
;

hive> set textinputformat.record.delimiter='\0'
;

hive> insert into table songs values
(
1
,'All For Leyna'
,'She stood on the tracks
Waving her arms
Leading me to that third rail shock
Quick as a wink
She changed her mind'
)
;

hive> insert into table songs values
(
2
,'Goodnight Saigon'
,'We met as soul mates
On Parris Island
We left as inmates
From an asylum
And we were sharp
As sharp as knives
And we were so gung ho
To lay down our lives'
)
;

hive> select id,name,length(lyrics) from songs;

1  All For Leyna  114
2  Goodnight Saigon155


hive> select id,name,hex(lyrics) from songs;

1  All For Leyna
5368652073746F6F64206F6E2074686520747261636B730A576176696E67206865722061726D730A4C656164696E67206D6520746F2074686174207468697264207261696C2073686F636B0A517569636B20617320612077696E6B0A536865206368616E67656420686572206D696E640A0A
2  Goodnight Saigon
5765206D657420617320736F756C206D617465730A4F6E205061727269732049736C616E640A5765206C65667420617320696E6D617465730A46726F6D20616E206173796C756D0A416E6420776520776572652073686172700A4173207368617270206173206B6E697665730A416E64207765207765726520736F2067756E6720686F0A546F206C617920646F776E206F7572206C697665730A0A

hive> select id,name,regexp_replace(lyrics,'\n','<<>>') from songs;

1  All For Leyna  She stood on the tracks<<>>Waving 
her arms<<>>Leading me to that third rail shock<<>>Quick as a 
wink<<>>She changed her mind<<>><<>>
2  Goodnight SaigonWe met as soul mates<<>>On 
Parris Island<<>>We left as inmates<<>>From an 
asylum<<>>And we were sharp<<>>As sharp as 
knives<<>>And we were so gung ho<<>>To lay down our 
lives<<>><<>>

hive> select id,name,split(lyrics,'\n') from songs;

1  All For Leyna  ["She stood on the tracks","Waving her 
arms","Leading me to that third rail shock","Quick as a wink","She changed her 
mind","",""]
2  Goodnight Saigon["We met as soul mates","On Parris 
Island","We left as inmates","From an asylum","And we were sharp","As sharp as 
knives","And we were so gung ho","To lay down our lives","",""]

hive> select id,name,sentences(lyrics) from songs;

1  All For Leyna
[["She","stood","on","the","tracks","Waving","her","arms","Leading","me","to","that","third","rail","shock","Quick","as","a","wink","She","changed","her","mind"]]
2  Goodnight Saigon
[["We","met","as","soul","mates","On","Parris","Island","We","left","as","inmates","From","an","asylum","And","we","were","sharp","As","sharp","as","knives","And","we","were","so","gung","ho","To","lay","down","our","lives"]]

hive> select count (*) from songs;

NULL

hive> select count (*),123,

RE: Why does the user need write permission on the location of external hive table?

2016-06-06 Thread Markovitz, Dudu
P.s.

There are some risky data manipulations going there.
I’m not sure this is a desired result… ☺

hive> select CAST(REGEXP_REPLACE('And the Lord spake, saying, "First shalt thou 
take out the Holy Pin * Then shalt thou count to 3, no more, no less * 3 shall 
be the number thou shalt count, and the number of the counting shall be 3 * 4 
shalt thou not count, neither count thou 2, excepting that thou then proceed to 
3 * 5 is right out * Once the number 3, being the third number, be reached, 
then lobbest thou thy Holy Hand Grenade of Antioch towards thy foe, who, being 
naughty in My sight, shall snuff it *','[^\\d\\.]','') AS DECIMAL(20,2));
OK
33342353

From: Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
Sent: Tuesday, June 07, 2016 2:23 AM
To: user 
Subject: Re: Why does the user need write permission on the location of 
external hive table?

Hi Igor,

Hive can read from zipped files. If you are getting a lot of external files it 
makes sense to zip them and store on staging hdfs directory

1) download say these csv files into your local file system and use bzip2 to 
zip them as part of ETL

 ls -l
total 68
-rw-r--r-- 1 hduser hadoop 7334 Apr 25 11:29 nw_2011.csv.bz2
-rw-r--r-- 1 hduser hadoop 6235 Apr 25 11:29 nw_2012.csv.bz2
-rw-r--r-- 1 hduser hadoop 5476 Apr 25 11:29 nw_2013.csv.bz2
-rw-r--r-- 1 hduser hadoop 2725 Apr 25 11:29 nw_2014.csv.bz2
-rw-r--r-- 1 hduser hadoop 1868 Apr 25 11:29 nw_2015.csv.bz2
-rw-r--r-- 1 hduser hadoop  693 Apr 25 11:29 nw_2016.csv.bz2

Then put these files in a staging directory on hdfs usinh a shell script


for FILE in `ls *.*|grep -v .ksh`
do
  echo "Bzipping ${FILE}"
  /usr/bin/bzip2 ${FILE}
   hdfs dfs -copyFromLocal ${FILE}.bz2 ${TARGETDIR}
done

OK now the files are saved in ${TARGETDIR}

Now create the external table looking at this staging directory. No need to 
tell hive that these files are compressed. It knows how to handle it. They are 
stored as textfiles


DROP TABLE IF EXISTS stg_t2;
CREATE EXTERNAL TABLE stg_t2 (
 INVOICENUMBER string
,PAYMENTDATE string
,NET string
,VAT string
,TOTAL string
)
COMMENT 'from csv file from excel sheet nw_10124772'
ROW FORMAT serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
STORED AS TEXTFILE
LOCATION '/data/stg/accounts/nw/10124772'
TBLPROPERTIES ("skip.header.line.count"="1")

Now create the Hive table internally. Note that I want this data to be 
compressed. You will tell it to compress the table with ZLIB or SNAPPY


DROP TABLE IF EXISTS t2;
CREATE TABLE t2 (
 INVOICENUMBER  INT
,PAYMENTDATEdate
,NETDECIMAL(20,2)
,VATDECIMAL(20,2)
,TOTAL  DECIMAL(20,2)
)
COMMENT 'from csv file from excel sheet nw_10124772'
CLUSTERED BY (INVOICENUMBER) INTO 256 BUCKETS
STORED AS ORC
TBLPROPERTIES ( "orc.compress"="ZLIB" )

Put data in target table. do the conversion and ignore empty rows

INSERT INTO TABLE t2
SELECT
  INVOICENUMBER
, 
TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP(paymentdate,'dd/MM/'),'-MM-dd')) 
AS paymentdate
, CAST(REGEXP_REPLACE(net,'[^\\d\\.]','') AS DECIMAL(20,2))
, CAST(REGEXP_REPLACE(vat,'[^\\d\\.]','') AS DECIMAL(20,2))
, CAST(REGEXP_REPLACE(total,'[^\\d\\.]','') AS DECIMAL(20,2))
FROM
stg_t2
WHERE
--INVOICENUMBER > 0 AND
CAST(REGEXP_REPLACE(total,'[^\\d\\.]','') AS DECIMAL(20,2)) > 0.0 -- 
Exclude empty rows
;

So pretty straight forward.

Now to your question

"it will affect performance, correct?"


Compression is a well established algorithm. It has been around in databases. 
Almost all RDBMS (Oracle, Sybase etc) do compress the data at database and 
backups through an option. Compression is more CPU intensive than without it. 
However, the database will handle the conversion of data from compressed to 
none when you read it or whatever. So yes there is a performance price to pay 
albeit small using more CPU to uncompress the data and present it. However, 
that is a small price to pay to reduce the storage cost for data.

HTH













Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com



On 6 June 2016 at 23:18, Igor Kravzov 
> wrote:
Mich, will Hive automatically detect and unzip zipped files? Ir there is 
special option in table configuration?
it will affect performance, correct?

On Mon, Jun 6, 2016 at 4:14 PM, Mich Talebzadeh 
> wrote:
Hi Sandeep.

I tend to use Hive external tables as staffing tables but still I will require 
access writes to hdfs.

Zip files work OK as well. For example our CSV files are zipped using bzip2 to 
save space

However, you may request a temporary solution by disabling permission in 
$HADOOP_HOME/etc/Hadoop/hdfs-site.xml


dfs.permissions
false


There 

RE: Why does the user need write permission on the location of external hive table?

2016-06-06 Thread Markovitz, Dudu
Hi guys

I would strongly recommend not to work with zipped files.

“Hadoop will not be able to split your file into chunks/blocks and run multiple 
maps in parallel”
https://cwiki.apache.org/confluence/display/Hive/CompressedStorage

Dudu

From: Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
Sent: Tuesday, June 07, 2016 2:23 AM
To: user 
Subject: Re: Why does the user need write permission on the location of 
external hive table?

Hi Igor,

Hive can read from zipped files. If you are getting a lot of external files it 
makes sense to zip them and store on staging hdfs directory

1) download say these csv files into your local file system and use bzip2 to 
zip them as part of ETL

 ls -l
total 68
-rw-r--r-- 1 hduser hadoop 7334 Apr 25 11:29 nw_2011.csv.bz2
-rw-r--r-- 1 hduser hadoop 6235 Apr 25 11:29 nw_2012.csv.bz2
-rw-r--r-- 1 hduser hadoop 5476 Apr 25 11:29 nw_2013.csv.bz2
-rw-r--r-- 1 hduser hadoop 2725 Apr 25 11:29 nw_2014.csv.bz2
-rw-r--r-- 1 hduser hadoop 1868 Apr 25 11:29 nw_2015.csv.bz2
-rw-r--r-- 1 hduser hadoop  693 Apr 25 11:29 nw_2016.csv.bz2

Then put these files in a staging directory on hdfs usinh a shell script


for FILE in `ls *.*|grep -v .ksh`
do
  echo "Bzipping ${FILE}"
  /usr/bin/bzip2 ${FILE}
   hdfs dfs -copyFromLocal ${FILE}.bz2 ${TARGETDIR}
done

OK now the files are saved in ${TARGETDIR}

Now create the external table looking at this staging directory. No need to 
tell hive that these files are compressed. It knows how to handle it. They are 
stored as textfiles


DROP TABLE IF EXISTS stg_t2;
CREATE EXTERNAL TABLE stg_t2 (
 INVOICENUMBER string
,PAYMENTDATE string
,NET string
,VAT string
,TOTAL string
)
COMMENT 'from csv file from excel sheet nw_10124772'
ROW FORMAT serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
STORED AS TEXTFILE
LOCATION '/data/stg/accounts/nw/10124772'
TBLPROPERTIES ("skip.header.line.count"="1")

Now create the Hive table internally. Note that I want this data to be 
compressed. You will tell it to compress the table with ZLIB or SNAPPY


DROP TABLE IF EXISTS t2;
CREATE TABLE t2 (
 INVOICENUMBER  INT
,PAYMENTDATEdate
,NETDECIMAL(20,2)
,VATDECIMAL(20,2)
,TOTAL  DECIMAL(20,2)
)
COMMENT 'from csv file from excel sheet nw_10124772'
CLUSTERED BY (INVOICENUMBER) INTO 256 BUCKETS
STORED AS ORC
TBLPROPERTIES ( "orc.compress"="ZLIB" )

Put data in target table. do the conversion and ignore empty rows

INSERT INTO TABLE t2
SELECT
  INVOICENUMBER
, 
TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP(paymentdate,'dd/MM/'),'-MM-dd')) 
AS paymentdate
, CAST(REGEXP_REPLACE(net,'[^\\d\\.]','') AS DECIMAL(20,2))
, CAST(REGEXP_REPLACE(vat,'[^\\d\\.]','') AS DECIMAL(20,2))
, CAST(REGEXP_REPLACE(total,'[^\\d\\.]','') AS DECIMAL(20,2))
FROM
stg_t2
WHERE
--INVOICENUMBER > 0 AND
CAST(REGEXP_REPLACE(total,'[^\\d\\.]','') AS DECIMAL(20,2)) > 0.0 -- 
Exclude empty rows
;

So pretty straight forward.

Now to your question

"it will affect performance, correct?"


Compression is a well established algorithm. It has been around in databases. 
Almost all RDBMS (Oracle, Sybase etc) do compress the data at database and 
backups through an option. Compression is more CPU intensive than without it. 
However, the database will handle the conversion of data from compressed to 
none when you read it or whatever. So yes there is a performance price to pay 
albeit small using more CPU to uncompress the data and present it. However, 
that is a small price to pay to reduce the storage cost for data.

HTH













Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com



On 6 June 2016 at 23:18, Igor Kravzov 
> wrote:
Mich, will Hive automatically detect and unzip zipped files? Ir there is 
special option in table configuration?
it will affect performance, correct?

On Mon, Jun 6, 2016 at 4:14 PM, Mich Talebzadeh 
> wrote:
Hi Sandeep.

I tend to use Hive external tables as staffing tables but still I will require 
access writes to hdfs.

Zip files work OK as well. For example our CSV files are zipped using bzip2 to 
save space

However, you may request a temporary solution by disabling permission in 
$HADOOP_HOME/etc/Hadoop/hdfs-site.xml


dfs.permissions
false


There are other ways as well.

Check this

http://stackoverflow.com/questions/11593374/permission-denied-at-hdfs

HTH







Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com



On 6 June 2016 at 21:00, Igor Kravzov 
> wrote:
I see file 

RE: alter partitions on hive external table

2016-06-06 Thread Markovitz, Dudu
And here is a full example


-- bash


mkdir -p t
mkdir -p t/20150122/dudu/cust1
mkdir -p t/cust2/aaa/2015/01/23/bbb/ccc/dudu/ddd
mkdir -p t/raj/20150204/cust1
mkdir -p t/raj/cust2/yyy/20150204/zzz
mkdir -p t/bla/bla/bla/raj/yada/yada/yada/cust3/yyy/20150204/zzz

echo -e "1\n2\n3" > t/20150122/dudu/cust1/data.txt
echo -e "4"   > t/cust2/aaa/2015/01/23/bbb/ccc/dudu/ddd/data.txt
echo -e "5\n6"> t/raj/20150204/cust1/data.txt
echo -e "7\n8\n9" > t/raj/cust2/yyy/20150204/zzz/data.txt
echo -e "10"  > 
t/bla/bla/bla/raj/yada/yada/yada/cust3/yyy/20150204/zzz/data.txt

hdfs dfs -put t /tmp


-- hive



n  We’re creating the external table with the requested partition columns


create external table t (i int) partitioned by (user string,cust string,dt 
date) location '/tmp/t';


n  We’re choosing each partition values according the full path of the relevant 
directory

alter table t add partition (user='dudu',cust='cust1',dt=date '2015-01-22') 
location '/tmp/t/20150122/dudu/cust1';
alter table t add partition (user='dudu',cust='cust2',dt=date '2015-01-23') 
location '/tmp/t/cust2/aaa/2015/01/23/bbb/ccc/dudu/ddd';
alter table t add partition (user='raj' ,cust='cust1',dt=date '2015-02-04') 
location '/tmp/t/raj/20150204/cust1';
alter table t add partition (user='raj' ,cust='cust2',dt=date '2015-02-04') 
location '/tmp/t/raj/cust2/yyy/20150204/zzz';
alter table t add partition (user='raj' ,cust='cust3',dt=date '2015-02-04') 
location '/tmp/t/bla/bla/bla/raj/yada/yada/yada/cust3/yyy/20150204/zzz';


n  The partitions’ values and their corresponding locations are all saved in 
the metastore

n  The metastore is being queried based on our query predicates. Returning the 
list of relevant partitions/locations


explain dependency select * from t where (cust like '%1' and dt < date 
'2015-02-01') or (user='raj' and substr(cust,-1) = 3) ;

{"input_partitions":[{"partitionName":"default@t@user=dudu/cust=cust1/dt=2015-01-22"},{"partitionName":"default@t@user=raj/cust=cust3/dt=2015-02-04"}],"input_tables":[{"tablename":"default@t","tabletype":"EXTERNAL_TABLE"}]}

select *,input__file__name from t where (cust like '%1' and dt < date 
'2015-02-01') or (user='raj' and substr(cust,-1) = 3) ;

1 dudu  cust1 2015-01-22 
hdfs://quickstart.cloudera:8020/tmp/t/20150122/dudu/cust1/data.txt
2 dudu  cust1 2015-01-22 
hdfs://quickstart.cloudera:8020/tmp/t/20150122/dudu/cust1/data.txt
3 dudu  cust1 2015-01-22 
hdfs://quickstart.cloudera:8020/tmp/t/20150122/dudu/cust1/data.txt
10    raj   cust3 2015-02-04  
hdfs://quickstart.cloudera:8020/tmp/t/bla/bla/bla/raj/yada/yada/yada/cust3/yyy/20150204/zzz/data.txt



From: Markovitz, Dudu [mailto:dmarkov...@paypal.com]
Sent: Monday, June 06, 2016 6:10 PM
To: user@hive.apache.org
Subject: RE: alter partitions on hive external table

… are just logical connections between certain values and specific directories …

From: Markovitz, Dudu [mailto:dmarkov...@paypal.com]
Sent: Monday, June 06, 2016 6:07 PM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: RE: alter partitions on hive external table

Hi Raj


1.   I don’t understand the reason for this change, can you please 
elaborate?



2.   External table is just an interface. Instructions for how to read 
existing data.

Partitions of external table are just a logical connections between certain 
values and a specific directories.

You can connect any set of values to any directory no matter what the 
directories structure is and then query the external table filtering on this 
values and by that eliminating the query only to the directories you are 
interested in.



3.   By all means, don’t duplicate data without a good reason (unless you 
don’t care about wasting storage, time, CPU etc.)

It seems to me that all you need to do is to retrieve a list of the directories 
and generate “alter table … add partition…” statements based on that.


Dudu

From: raj hive [mailto:raj.hiv...@gmail.com]
Sent: Monday, June 06, 2016 6:02 AM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: alter partitions on hive external table

Hi friends,
I have created partitions on hive external tables. partitions on 
datetime/userid/customerId.
now i have to change the order of the partitions for the existing data for all 
the dates.
order 

RE: alter partitions on hive external table

2016-06-06 Thread Markovitz, Dudu
… are just logical connections between certain values and specific directories …

From: Markovitz, Dudu [mailto:dmarkov...@paypal.com]
Sent: Monday, June 06, 2016 6:07 PM
To: user@hive.apache.org
Subject: RE: alter partitions on hive external table

Hi Raj


1.   I don’t understand the reason for this change, can you please 
elaborate?



2.   External table is just an interface. Instructions for how to read 
existing data.

Partitions of external table are just a logical connections between certain 
values and a specific directories.

You can connect any set of values to any directory no matter what the 
directories structure is and then query the external table filtering on this 
values and by that eliminating the query only to the directories you are 
interested in.



3.   By all means, don’t duplicate data without a good reason (unless you 
don’t care about wasting storage, time, CPU etc.)

It seems to me that all you need to do is to retrieve a list of the directories 
and generate “alter table … add partition…” statements based on that.


Dudu

From: raj hive [mailto:raj.hiv...@gmail.com]
Sent: Monday, June 06, 2016 6:02 AM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: alter partitions on hive external table

Hi friends,
I have created partitions on hive external tables. partitions on 
datetime/userid/customerId.
now i have to change the order of the partitions for the existing data for all 
the dates.
order of the partition is custerid/userid/datetime.
Anyone can help me, how to alter the partitions for the existing table. Need a 
help to write a script to change the partions on existing data. almost 3 months 
data is there to modify as per new partition so changing each date is 
difficult. Any expert can help me.
Thanks
Raj


RE: alter partitions on hive external table

2016-06-06 Thread Markovitz, Dudu
Hi Raj


1.   I don’t understand the reason for this change, can you please 
elaborate?



2.   External table is just an interface. Instructions for how to read 
existing data.

Partitions of external table are just a logical connections between certain 
values and a specific directories.

You can connect any set of values to any directory no matter what the 
directories structure is and then query the external table filtering on this 
values and by that eliminating the query only to the directories you are 
interested in.



3.   By all means, don’t duplicate data without a good reason (unless you 
don’t care about wasting storage, time, CPU etc.)

It seems to me that all you need to do is to retrieve a list of the directories 
and generate “alter table … add partition…” statements based on that.


Dudu

From: raj hive [mailto:raj.hiv...@gmail.com]
Sent: Monday, June 06, 2016 6:02 AM
To: user@hive.apache.org
Subject: alter partitions on hive external table

Hi friends,
I have created partitions on hive external tables. partitions on 
datetime/userid/customerId.
now i have to change the order of the partitions for the existing data for all 
the dates.
order of the partition is custerid/userid/datetime.
Anyone can help me, how to alter the partitions for the existing table. Need a 
help to write a script to change the partions on existing data. almost 3 months 
data is there to modify as per new partition so changing each date is 
difficult. Any expert can help me.
Thanks
Raj


RE: Convert date in string format to timestamp in table definition

2016-06-05 Thread Markovitz, Dudu
‘Never’ is a strong word.


1.   We’re talking about the metadata so –



a.   The data format is irrelevant



b.  The records number is small (scale of thousands)

I would have sacrificed 1 second of metadata processing for a better user 
experience



2.   Partitions values are being held in the metastore (at least with 
MySQL)  as strings

Dudu

From: Jörn Franke [mailto:jornfra...@gmail.com]
Sent: Sunday, June 05, 2016 11:38 AM
To: user@hive.apache.org
Subject: Re: Convert date in string format to timestamp in table definition

Never use string when you can use int - the performance will be much better - 
especially for tables in Orc / parquet format

On 04 Jun 2016, at 22:31, Igor Kravzov 
<igork.ine...@gmail.com<mailto:igork.ine...@gmail.com>> wrote:
Thanks Dudu.
So if I need actual date I will use view.
Regarding partition column:  I can create 2 external tables based on the same 
data with integer or string column partition and see which one is more 
convenient for our use.

On Sat, Jun 4, 2016 at 2:20 PM, Markovitz, Dudu 
<dmarkov...@paypal.com<mailto:dmarkov...@paypal.com>> wrote:
I’m not aware of an option to do what you request in the external table 
definition but you might want to that using a view.

P.s.
I seems to me that defining the partition column as a string would be more user 
friendly than integer, e.g. –

select * from threads_test where mmdd like ‘2016%’ – year 2016;
select * from threads_test where mmdd like ‘201603%’ –- March 2016;
select * from threads_test where mmdd like ‘__01’ -- first of every 
month;





$ hdfs dfs -ls -R /tmp/threads_test
drwxr-xr-x   - cloudera supergroup  0 2016-06-04 10:45 
/tmp/threads_test/20160604
-rw-r--r--   1 cloudera supergroup136 2016-06-04 10:45 
/tmp/threads_test/20160604/data.txt

$ hdfs dfs -cat /tmp/threads_test/20160604/data.txt
{"url":"www.blablabla.com<http://www.blablabla.com>","pageType":"pg1","addDate":"2016-05-17T02:10:44.527","postDate":"2016-05-16T02:08:55","postText":"YadaYada"}




hive> add jar /usr/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar;

hive>
create external table threads_test
(
url string
   ,pagetypestring
   ,adddate string
   ,postdatestring
   ,posttextstring
)
partitioned by (mmdd string)
row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'
location '/tmp/threads_test'
;

hive> alter table threads_test add partition (mmdd=20160604) location 
'/tmp/threads_test/20160604';

hive> select * from threads_test;

www.blablabla.com<http://www.blablabla.com>pg12016-05-17T02:10:44.527   
 2016-05-16T02:08:55  YadaYada  20160604

hive>
create view threads_test_v
as
select  url
   ,pagetype
   ,cast (concat_ws(' ',substr (adddate ,1,10),substr (adddate ,12)) as 
timestamp)  as adddate
   ,cast (concat_ws(' ',substr (postdate,1,10),substr (postdate,12)) as 
timestamp)  as postdate
   ,posttext

fromthreads_test
;

hive> select * from threads_test_v;

www.blablabla.com<http://www.blablabla.com>pg12016-05-17 02:10:44.527   
 2016-05-16 02:08:55  YadaYada


From: Igor Kravzov 
[mailto:igork.ine...@gmail.com<mailto:igork.ine...@gmail.com>]
Sent: Saturday, June 04, 2016 8:13 PM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Convert date in string format to timestamp in table definition

Hi,

I have 2 dates in Json file defined like this
"addDate": "2016-05-17T02:10:44.527",
  "postDate": "2016-05-16T02:08:55",

Right now I define external table based on this file like this:
CREATE external TABLE threads_test
(url string,
 pagetype string,
 adddate string,
 postdate string,
 posttext string)
partitioned by (mmdd int)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
location 'my location';

is it possible to define these 2 dates as timestamp?
Do I need to change date format in the file? is it possible to specify date 
format in table definition?
Or I better off with string?

Thanks in advance.



RE: Convert date in string format to timestamp in table definition

2016-06-04 Thread Markovitz, Dudu
I’m not aware of an option to do what you request in the external table 
definition but you might want to that using a view.

P.s.
I seems to me that defining the partition column as a string would be more user 
friendly than integer, e.g. –

select * from threads_test where mmdd like ‘2016%’ – year 2016;
select * from threads_test where mmdd like ‘201603%’ –- March 2016;
select * from threads_test where mmdd like ‘__01’ -- first of every 
month;





$ hdfs dfs -ls -R /tmp/threads_test
drwxr-xr-x   - cloudera supergroup  0 2016-06-04 10:45 
/tmp/threads_test/20160604
-rw-r--r--   1 cloudera supergroup136 2016-06-04 10:45 
/tmp/threads_test/20160604/data.txt

$ hdfs dfs -cat /tmp/threads_test/20160604/data.txt
{"url":"www.blablabla.com","pageType":"pg1","addDate":"2016-05-17T02:10:44.527","postDate":"2016-05-16T02:08:55","postText":"YadaYada"}




hive> add jar /usr/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar;

hive>
create external table threads_test
(
url string
   ,pagetypestring
   ,adddate string
   ,postdatestring
   ,posttextstring
)
partitioned by (mmdd string)
row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'
location '/tmp/threads_test'
;

hive> alter table threads_test add partition (mmdd=20160604) location 
'/tmp/threads_test/20160604';

hive> select * from threads_test;

www.blablabla.compg12016-05-17T02:10:44.5272016-05-16T02:08:55  
YadaYada  20160604

hive>
create view threads_test_v
as
select  url
   ,pagetype
   ,cast (concat_ws(' ',substr (adddate ,1,10),substr (adddate ,12)) as 
timestamp)  as adddate
   ,cast (concat_ws(' ',substr (postdate,1,10),substr (postdate,12)) as 
timestamp)  as postdate
   ,posttext

fromthreads_test
;

hive> select * from threads_test_v;

www.blablabla.compg12016-05-17 02:10:44.5272016-05-16 02:08:55  
YadaYada


From: Igor Kravzov [mailto:igork.ine...@gmail.com]
Sent: Saturday, June 04, 2016 8:13 PM
To: user@hive.apache.org
Subject: Convert date in string format to timestamp in table definition

Hi,

I have 2 dates in Json file defined like this
"addDate": "2016-05-17T02:10:44.527",
  "postDate": "2016-05-16T02:08:55",

Right now I define external table based on this file like this:
CREATE external TABLE threads_test
(url string,
 pagetype string,
 adddate string,
 postdate string,
 posttext string)
partitioned by (mmdd int)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
location 'my location';

is it possible to define these 2 dates as timestamp?
Do I need to change date format in the file? is it possible to specify date 
format in table definition?
Or I better off with string?

Thanks in advance.


RE: External partitoned table based on yyyy/mm/dd HDFS structure

2016-06-03 Thread Markovitz, Dudu
Can you do something similar to this (shell commands)?

dt=$(date +"%Y%m%d")
cmd="alter table t3 add if not exists partition (mmdd='${dt}') location 
'/user/dmarkovitz/t/${dt}'"
hive -e "${cmd}"

From: Igor Kravzov [mailto:igork.ine...@gmail.com]
Sent: Friday, June 03, 2016 6:41 PM
To: user@hive.apache.org
Subject: Re: External partitoned table based on /mm/dd HDFS structure

Thank you Dudu.
regarding #2.
I am planning to ingest data using Apache NiFi PutHDFS processor and it will be 
able to create a directory but not execute 'Alter  Table...'.
Does Hive have function similar to EXEC in MS SQL? I am thinking to construct 
'alter table...' string dynamically every day and execute it somehow.

On Fri, Jun 3, 2016 at 5:22 AM, Markovitz, Dudu 
<dmarkov...@paypal.com<mailto:dmarkov...@paypal.com>> wrote:
1.
If the directory name is in the format of key=val
The partition column name should be key

e.g.
/user/igor/data/dt=2016-06-02
create external table t (i int) partitioned by (dt date) location 
'/user/igor/data/';

2.
I would have used msck repair table t for ad-hoc operations.
It scans the whole table HDFS tree and if you have a lot of directories it 
might be costly.
I would suggest to add “Alter table t add partition …” to the process that 
creates the new directories and adds the data.

3.
Partitioning:


• Metadata performance wise, you should strive to create the minimum 
number of partitions.



• Query performance wise, you should strive to partitions` granularity 
that matches your common queries

o   If you usually select whole years, create a yearly partitions

o   If you usually select whole months, create monthly partitions

o   If you usually select few days, create daily partitions



In addition, partitions should be big enough to have a performance advantage.

Don’t partition small tables.



• Maintenance performance wise, your partitions should be small enough 
to be handled by operations such as

ALTER TABLE table_name [PARTITION partition_spec] SET FILEFORMAT file_format;

in a reasonable time.



Common practice is a daily partition
~365 days * 10 years = 3,650 partitions, which is O.K.
Try not to generate more than few thousands partitions




From: Igor Kravzov 
[mailto:igork.ine...@gmail.com<mailto:igork.ine...@gmail.com>]
Sent: Thursday, June 02, 2016 5:55 PM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Re: External partitoned table based on /mm/dd HDFS structure

Thanks Dudu for the great explanation.
I was doing some reading and thinking instead of complicated hierarchical 
structure to have flat one.
Like

/user/igor/data/date=2016-06-02
create external table t (i int) partitioned by (mmdd date) location 
'/user/igor/data/';
 or
 /user/igor/date=20160602
create external table t (i int) partitioned by (mmdd int) location 
'/user/igor/data/';
Will it work?

Also I will need to schedule msck repair table t; if I want partitions 
automatically picked up. Hive does not have this feature. Correct?

What is the optimal directory size for a partition? Is about 2GB OK?


On Wed, Jun 1, 2016 at 4:38 PM, Markovitz, Dudu 
<dmarkov...@paypal.com<mailto:dmarkov...@paypal.com>> wrote:
The short answer:
In this naming convention it will require to specifically define each partition.
If the naming convention was =2016/mm=11/dd=28 instead of 2016/11/28 it 
would have been straight forward.

Dudu

+ The long answer:


-- bash


mkdir t
mkdir t/2015
mkdir t/2015/01
mkdir t/2015/01/22
mkdir t/2015/01/23
mkdir t/2015/02
mkdir t/2015/02/17
mkdir t/2015/03
mkdir t/2015/03/04
mkdir t/2015/03/05
mkdir t/2015/03/06
mkdir t/2016
mkdir t/2016/10
mkdir t/2016/10/01
mkdir t/2016/10/02
mkdir t/2016/10/03
mkdir t/2016/11
mkdir t/2016/11/27
mkdir t/2016/11/28


echo -e "1\n2\n3"   > t/2015/01/22/data.txt
echo -e "4" > t/2015/01/23/data.txt
echo -e "5\n6"  > t/2015/02/17/data.txt
echo -e "7\n8\n9"   > t/2015/03/04/data.txt
echo -e "10"> t/2015/03/05/data.txt
echo -e "11\n12"> t/2015/03/06/data.txt
echo -e "13\n14"> t/2016/10/01/data.txt
echo -e "15\n16\n17\n18\n19"> t/2016/10/02/data.txt
echo -e "20\n21"> t/2016/10/03/data.txt
echo -e "22"> t/2016/11/27/data.txt
echo -e "23\n24\n25"> t/2016/11/28/data.txt

hdfs dfs -put t /user/dmarkovitz/t


t
├── 2015
│   ├── 01
│   │   ├── 22
│   │   │   └── data.txt
│   │   └── 23
│   │   └── d

RE: LINES TERMINATED BY only supports newline '\n' right now

2016-06-03 Thread Markovitz, Dudu
Here is an example, but first – some warnings:


· You should set textinputformat.record.delimiter not only for the 
populating of the table but also for querying it



· There seems to be many issues around this area –

o   When I tried to insert multiple values in a single statement (“insert into 
table … values (…),(…),(…)”) only the first set of values was inserted correctly

o   2 new lines were added to the end of each text (‘lyrics’) although there 
should be none.

o   Aggregative queries seems to return null for the last column. Sometimes.

o   The function ‘sentences’ does not work as expected. It treated the whole 
text as a single line.



Dudu





Example


hive> create table songs (id int,name string,lyrics string)
;

hive> set textinputformat.record.delimiter='\0'
;

hive> insert into table songs values
(
1
,'All For Leyna'
,'She stood on the tracks
Waving her arms
Leading me to that third rail shock
Quick as a wink
She changed her mind'
)
;

hive> insert into table songs values
(
2
,'Goodnight Saigon'
,'We met as soul mates
On Parris Island
We left as inmates
From an asylum
And we were sharp
As sharp as knives
And we were so gung ho
To lay down our lives'
)
;

hive> select id,name,length(lyrics) from songs;

1  All For Leyna  114
2  Goodnight Saigon155


hive> select id,name,hex(lyrics) from songs;

1  All For Leyna
5368652073746F6F64206F6E2074686520747261636B730A576176696E67206865722061726D730A4C656164696E67206D6520746F2074686174207468697264207261696C2073686F636B0A517569636B20617320612077696E6B0A536865206368616E67656420686572206D696E640A0A
2  Goodnight Saigon
5765206D657420617320736F756C206D617465730A4F6E205061727269732049736C616E640A5765206C65667420617320696E6D617465730A46726F6D20616E206173796C756D0A416E6420776520776572652073686172700A4173207368617270206173206B6E697665730A416E64207765207765726520736F2067756E6720686F0A546F206C617920646F776E206F7572206C697665730A0A

hive> select id,name,regexp_replace(lyrics,'\n','<<>>') from songs;

1  All For Leyna  She stood on the tracks<<>>Waving 
her arms<<>>Leading me to that third rail shock<<>>Quick as a 
wink<<>>She changed her mind<<>><<>>
2  Goodnight SaigonWe met as soul mates<<>>On 
Parris Island<<>>We left as inmates<<>>From an 
asylum<<>>And we were sharp<<>>As sharp as 
knives<<>>And we were so gung ho<<>>To lay down our 
lives<<>><<>>

hive> select id,name,split(lyrics,'\n') from songs;

1  All For Leyna  ["She stood on the tracks","Waving her 
arms","Leading me to that third rail shock","Quick as a wink","She changed her 
mind","",""]
2  Goodnight Saigon["We met as soul mates","On Parris 
Island","We left as inmates","From an asylum","And we were sharp","As sharp as 
knives","And we were so gung ho","To lay down our lives","",""]

hive> select id,name,sentences(lyrics) from songs;

1  All For Leyna
[["She","stood","on","the","tracks","Waving","her","arms","Leading","me","to","that","third","rail","shock","Quick","as","a","wink","She","changed","her","mind"]]
2  Goodnight Saigon
[["We","met","as","soul","mates","On","Parris","Island","We","left","as","inmates","From","an","asylum","And","we","were","sharp","As","sharp","as","knives","And","we","were","so","gung","ho","To","lay","down","our","lives"]]

hive> select count (*) from songs;

NULL

hive> select count (*),123,456,789 from songs;

2  123 456 NULL

hive> select count (*),'A','B','C' from songs;

2  A B C


From: Radha krishna [mailto:grkmc...@gmail.com]
Sent: Thursday, June 02, 2016 12:42 PM
To: user@hive.apache.org
Subject: LINES TERMINATED BY only supports newline '\n' right now

For some of the columns '\n' character is there as part of value, i want to 
create a hive table for this data i tried by creating the hive table with US as 
the line separator but it showing the below message

Ex:
CREATE EXTERNAL TABLE IF NOT EXISTS emp (name String,id int)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '28'
LINES TERMINATED BY '31'
LOCATION 'emp.txt file path';

FAILED: SemanticException 4:20 LINES TERMINATED BY only supports newline '\n' 
right now. Error encountered near token ''31''

how can we create hive table with out removing the \n character as part of 
column value ( requirement data need to maintain as it is)

can any one implemented hive tables with line separated other than \n

Thanks & Regards
   Radha krishna



RE: External partitoned table based on yyyy/mm/dd HDFS structure

2016-06-03 Thread Markovitz, Dudu
1.
If the directory name is in the format of key=val
The partition column name should be key

e.g.
/user/igor/data/dt=2016-06-02
create external table t (i int) partitioned by (dt date) location 
'/user/igor/data/';

2.
I would have used msck repair table t for ad-hoc operations.
It scans the whole table HDFS tree and if you have a lot of directories it 
might be costly.
I would suggest to add “Alter table t add partition …” to the process that 
creates the new directories and adds the data.

3.
Partitioning:


· Metadata performance wise, you should strive to create the minimum 
number of partitions.



· Query performance wise, you should strive to partitions` granularity 
that matches your common queries

o   If you usually select whole years, create a yearly partitions

o   If you usually select whole months, create monthly partitions

o   If you usually select few days, create daily partitions



In addition, partitions should be big enough to have a performance advantage.

Don’t partition small tables.



· Maintenance performance wise, your partitions should be small enough 
to be handled by operations such as

ALTER TABLE table_name [PARTITION partition_spec] SET FILEFORMAT file_format;

in a reasonable time.



Common practice is a daily partition
~365 days * 10 years = 3,650 partitions, which is O.K.
Try not to generate more than few thousands partitions




From: Igor Kravzov [mailto:igork.ine...@gmail.com]
Sent: Thursday, June 02, 2016 5:55 PM
To: user@hive.apache.org
Subject: Re: External partitoned table based on /mm/dd HDFS structure

Thanks Dudu for the great explanation.
I was doing some reading and thinking instead of complicated hierarchical 
structure to have flat one.
Like

/user/igor/data/date=2016-06-02
create external table t (i int) partitioned by (mmdd date) location 
'/user/igor/data/';
 or
 /user/igor/date=20160602
create external table t (i int) partitioned by (mmdd int) location 
'/user/igor/data/';
Will it work?

Also I will need to schedule msck repair table t; if I want partitions 
automatically picked up. Hive does not have this feature. Correct?

What is the optimal directory size for a partition? Is about 2GB OK?


On Wed, Jun 1, 2016 at 4:38 PM, Markovitz, Dudu 
<dmarkov...@paypal.com<mailto:dmarkov...@paypal.com>> wrote:
The short answer:
In this naming convention it will require to specifically define each partition.
If the naming convention was =2016/mm=11/dd=28 instead of 2016/11/28 it 
would have been straight forward.

Dudu

+ The long answer:


-- bash


mkdir t
mkdir t/2015
mkdir t/2015/01
mkdir t/2015/01/22
mkdir t/2015/01/23
mkdir t/2015/02
mkdir t/2015/02/17
mkdir t/2015/03
mkdir t/2015/03/04
mkdir t/2015/03/05
mkdir t/2015/03/06
mkdir t/2016
mkdir t/2016/10
mkdir t/2016/10/01
mkdir t/2016/10/02
mkdir t/2016/10/03
mkdir t/2016/11
mkdir t/2016/11/27
mkdir t/2016/11/28


echo -e "1\n2\n3"   > t/2015/01/22/data.txt
echo -e "4" > t/2015/01/23/data.txt
echo -e "5\n6"  > t/2015/02/17/data.txt
echo -e "7\n8\n9"   > t/2015/03/04/data.txt
echo -e "10"> t/2015/03/05/data.txt
echo -e "11\n12"> t/2015/03/06/data.txt
echo -e "13\n14"> t/2016/10/01/data.txt
echo -e "15\n16\n17\n18\n19"> t/2016/10/02/data.txt
echo -e "20\n21"> t/2016/10/03/data.txt
echo -e "22"> t/2016/11/27/data.txt
echo -e "23\n24\n25"> t/2016/11/28/data.txt

hdfs dfs -put t /user/dmarkovitz/t


t
├── 2015
│   ├── 01
│   │   ├── 22
│   │   │   └── data.txt
│   │   └── 23
│   │   └── data.txt
│   ├── 02
│   │   └── 17
│   │   └── data.txt
│   └── 03
│   ├── 04
│   │   └── data.txt
│   ├── 05
│   │   └── data.txt
│   └── 06
│   └── data.txt
└── 2016
├── 10
│   ├── 01
│   │   └── data.txt
│   ├── 02
│   │   └── data.txt
│   └── 03
│   └── data.txt
└── 11
├── 27
│   └── data.txt
└── 28
└── data.txt



-- hive


set hive.mapred.supports.subdirectories=true;
set mapred.input.dir.recursive=true;


-- t1: no partitions

RE: How to disable SMB join?

2016-05-31 Thread Markovitz, Dudu
Hi

The documentation describes a scenario where SMB join leads to the same error 
you’ve got.
It claims that changing the order of the tables solves the problem.

Dudu


https://cwiki.apache.org/confluence/display/Hive/LanguageManual+JoinOptimization#LanguageManualJoinOptimization-SMBJoinacrossTableswithDifferentKeys
SMB Join across Tables with Different Keys
If the tables have differing number of keys, for example Table A has 2 SORT 
columns and Table B has 1 SORT column, then you might get an index out of 
bounds exception.
The following query results in an index out of bounds exception because 
emp_person let us say for example has 1 sort column while emp_pay_history has 2 
sort columns.
Error Hive 0.11
SELECT p.*, py.*
FROM emp_person p INNER JOIN emp_pay_history py
ON   p.empid = py.empid

This works fine.
Working query Hive 0.11
SELECT p.*, py.*
FROM emp_pay_history py INNER JOIN emp_person p
ON   p.empid = py.empid




From: Banias H [mailto:banias4sp...@gmail.com]
Sent: Tuesday, May 31, 2016 8:09 PM
To: user@hive.apache.org
Subject: How to disable SMB join?

Hi,

Does anybody know if there a config setting to disable SMB join?

One of our Hive queries failed with ArrayIndexOutOfBoundsException when Tez is 
the execution engine. The error seems to be addressed by 
https://issues.apache.org/jira/browse/HIVE-13282

We have Hive 1.2 and Tez 0.7 in our cluster and the workaround suggested in the 
ticket is to disable SMB join. I searched around and only found the setting to 
convert to SMB MapJoin. Any help on disabling SMB join altogether would be 
appreciated. Thanks.

-B





RE: Does hive need exact schema in Hive Export/Import?

2016-05-30 Thread Markovitz, Dudu
Hi

1)
I was able to do the import by doing the following manipulation:


· Export table dev101

· Create an empty table dev102

· Export table dev102

· replace the _metadata file of dev101 with the _metadata file of dev102

· import table dev101 to table dev102

2)
Another option is not to create dev102 in advance but let the import from 
dev101 to create it.
After the import you can alter the table, e.g.:

Alter table dev102 change column col2 col2 varchar(10);


Dudu

From: Devender Yadav [mailto:devender.ya...@impetus.co.in]
Sent: Monday, May 30, 2016 2:38 PM
To: user@hive.apache.org
Subject: Does hive need exact schema in Hive Export/Import?


Hi All,


I am using HDP 2.3

- Hadoop version - 2.7.1

- Hive version - 1.2.1


I created a table dev101 in hive using

create table dev101 (col1 int, col2 char(10));

I inserted two records using

insert into dev101 values (1, 'value1');
insert into dev101 values (2, 'value2');

I exported data to HDFS using

export table dev101 to '/tmp/dev101';


Then, I created a new table dev102 using

create table dev102 (col1 int, col2 String);


I imported data from `/tmp/dev10` in `dev102` using

import table dev102 from '/tmp/dev101';

I got error:

>FAILED: SemanticException [Error 10120]: The existing table is not compatible 
>with the import spec.   Column Schema does not match


Then I created another table `dev103` using

create table dev103 (col1 int, col2 char(50));

Again imported:

import table dev103 from '/tmp/dev101';

Same error:

>FAILED: SemanticException [Error 10120]: The existing table is not compatible 
>with the import spec.   Column Schema does not match

Finally, I create table with **exactly same schema**

create table dev104 (col1 int, col2 char(10));

And imported

import table dev104 from '/tmp/dev101';

Imported Successfully.

Does hive need exact schema in Hive Export/Import? ​




Regards,
Devender​








NOTE: This message may contain information that is confidential, proprietary, 
privileged or otherwise protected by law. The message is intended solely for 
the named addressee. If received in error, please destroy and notify the 
sender. Any use of this email is prohibited when received in error. Impetus 
does not represent, warrant and/or guarantee, that the integrity of this 
communication has been maintained nor that the communication is free of errors, 
virus, interception or interference.


RE: Test

2016-05-29 Thread Markovitz, Dudu
Hi ☺

From: Igor Kravzov [mailto:igork.ine...@gmail.com]
Sent: Sunday, May 29, 2016 8:02 PM
To: user@hive.apache.org
Subject: Test

Please someone reply. Not sure if subscribed properly


RE: Any way in hive to have functionality like SQL Server collation on Case sensitivity

2016-05-25 Thread Markovitz, Dudu
It will not be suitable for JOIN operation since it will cause a Cartesian 
product.
Any chosen solution should determine a single representation for any given 
string.

Dudu

From: Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
Sent: Wednesday, May 25, 2016 1:31 AM
To: user 
Subject: Re: Any way in hive to have functionality like SQL Server collation on 
Case sensitivity

I would rather go for something like compare() 

 that allows one to directly compare two character strings based on alternate 
collation rules.

Hive does not have it. This is from SAP ASE

1> select compare ("aaa","bbb")
2> go
 ---
  -1
(1 row affected)
1> select compare ("aaa","Aaa")
2> go
 ---
   1
(1 row affected)

1> select compare ("aaa","AAA")
2> go
 ---
   1

•  The compare function returns the following values, based on the collation 
rules that you chose:

· 1 – indicates that char_expression1 or uchar_expression1 is greater 
than char_expression2 or uchar_expression2.

· 0 – indicates that char_expression1 or uchar_expression1 is equal to 
char_expression2 or uchar_expression2.

· -1 – indicates that char_expression1 or uchar_expression1 is less 
than char_expression2 or uchar expression2.

hive> select compare("aaa", "bbb");
FAILED: SemanticException [Error 10011]: Line 1:7 Invalid function 'compare'


HTH




Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com



On 24 May 2016 at 21:15, mahender bigdata 
> wrote:
Hi,

We would like to have feature in Hive where string comparison should ignore 
case sensitivity while joining on String Columns in hive. This feature helps us 
in reducing code of calling Upper or Lower function on Join columns. If it is 
already there, please let me know settings to enable this feature.

/MS



RE: Hive 2 database Entity-Relationship Diagram

2016-05-19 Thread Markovitz, Dudu
Thanks Mich

I’m afraid the current format is not completely user friendly.
I would suggest to divide the tables to multiple sets by subjects / graph 
connectivity (BTW, it seems odd that most of the tables are disconnected)

Also –

· HIVEUSER.PARTITION_KEY_VALS is partially covering another table

· The PDF is upside-down

Dudu

From: Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
Sent: Thursday, May 19, 2016 8:04 PM
To: user ; user @spark 
Subject: Re: Hive 2 database Entity-Relationship Diagram

Attachement


Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com



On 19 May 2016 at 18:02, Mich Talebzadeh 
> wrote:
Hi All,

I use Hive 2 with metastore created for Oracle Database with 
hive-txn-schema-2.0.0.oracle.sql.

It already includes concurrency stuff added into metastore

The RDBMS is Oracle Database 12c Enterprise Edition Release 12.1.0.2.0.

 I created an Entity-Relationship (ER) diagram from the physical model. There 
are 194 tables, 127 views and 38 relationships. The relationship notation is 
Bachman

Fairly big diagram in PDF format. However, you can zoom into it.


Please have a kook and appreciate comments to me and if it is useful we can 
load it into wiki.


HTH


Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com





RE: Could i use Hive SQL parser in our application?

2016-05-18 Thread Markovitz, Dudu
Hi

Can you please share what was the problem?

Thanks

Dudu

From: Heng Chen [mailto:heng.chen.1...@gmail.com]
Sent: Thursday, May 19, 2016 7:07 AM
To: user@hive.apache.org
Subject: Re: Could i use Hive SQL parser in our application?

Got it now!  Thanks again for your help! guys!

2016-05-19 11:09 GMT+08:00 Heng Chen 
>:
Hi, guys.

I write one example as @furcy said like this.


public static void main(String[] args) throws SemanticException, 
ParseException, IOException {
  String sql = "select * from table1 where a > 100";
  Context context = new Context(new HiveConf());
  ParseDriver pd = new ParseDriver();
  ASTNode tree = pd.parse(sql, context);
  System.out.println(tree);
}

When i run it,  exception thrown out,  did i miss something?


Exception in thread "main" java.lang.NullPointerException: Conf non-local 
session path expected to be non-null
at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:204)
at 
org.apache.hadoop.hive.ql.session.SessionState.getHDFSSessionPath(SessionState.java:669)
at org.apache.hadoop.hive.ql.Context.(Context.java:133)
at org.apache.hadoop.hive.ql.Context.(Context.java:120)
at com.fenbi.pipe.utils.LineageInfoUtils.main(LineageInfoUtils.java:24)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)








2016-05-19 10:20 GMT+08:00 Heng Chen 
>:
Thanks guys!  Let me try it firstly.

2016-05-19 1:44 GMT+08:00 Pengcheng Xiong 
>:
Hi Heng,

Sure you can. Hive SQL parser is based on ANTLR and you can do that by 
taking that part out of Hive and integrate in to your application. Please let 
me know if you need any further help. Thanks.

Best
Pengcheng Xiong

On Wed, May 18, 2016 at 3:43 AM, Heng Chen 
> wrote:
Hi, guys.

  Recently,  we need to integrate Hive SQL parser in our application.  Is 
there any way to do it?

Thanks!






RE: Would like to be a user

2016-05-17 Thread Markovitz, Dudu
Thanks ☺

From: Lefty Leverenz [mailto:leftylever...@gmail.com]
Sent: Tuesday, May 17, 2016 10:06 PM
To: user@hive.apache.org
Subject: Re: Would like to be a user

Done.  Welcome to the Hive wiki team, Dudu!

-- Lefty


On Tue, May 17, 2016 at 3:04 AM, Markovitz, Dudu 
<dmarkov...@paypal.com<mailto:dmarkov...@paypal.com>> wrote:
Hi Lefty

Can you add me as well?
My user is dudu.markovitz

Thanks

Dudu

From: Lefty Leverenz 
[mailto:leftylever...@gmail.com<mailto:leftylever...@gmail.com>]
Sent: Monday, May 16, 2016 11:44 PM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Re: Would like to be a user

Welcome Gail, I've given you write access to the Hive wiki.  Thanks in advance 
for your contributions!

-- Lefty

On Mon, May 16, 2016 at 4:39 PM, gail haspert 
<gkhasp...@gmail.com<mailto:gkhasp...@gmail.com>> wrote:
Hi
I would like to start working on (write to) apache hive. My confluence user 
name is ghaspert.

thanks
-- Forwarded message --
From: gail haspert <gkhasp...@gmail.com<mailto:gkhasp...@gmail.com>>
Date: Mon, May 16, 2016 at 12:40 PM
Subject: Would like to be a user
To: user-subscr...@hive.apache.org<mailto:user-subscr...@hive.apache.org>
I just signed up as ghaspert for a confluence account.
Thanks,
Gail





RE: Would like to be a user

2016-05-17 Thread Markovitz, Dudu
Hi Lefty

Can you add me as well?
My user is dudu.markovitz

Thanks

Dudu

From: Lefty Leverenz [mailto:leftylever...@gmail.com]
Sent: Monday, May 16, 2016 11:44 PM
To: user@hive.apache.org
Subject: Re: Would like to be a user

Welcome Gail, I've given you write access to the Hive wiki.  Thanks in advance 
for your contributions!

-- Lefty

On Mon, May 16, 2016 at 4:39 PM, gail haspert 
> wrote:
Hi
I would like to start working on (write to) apache hive. My confluence user 
name is ghaspert.

thanks
-- Forwarded message --
From: gail haspert >
Date: Mon, May 16, 2016 at 12:40 PM
Subject: Would like to be a user
To: user-subscr...@hive.apache.org

I just signed up as ghaspert for a confluence account.
Thanks,
Gail




RE: Hive cte Alias problem

2016-05-11 Thread Markovitz, Dudu
Hi

It seem that you are right and it a bug with the CTE when there’s an “IS NULL” 
predicate involved.
I’ve opened a bug for this.
https://issues.apache.org/jira/browse/HIVE-13733

Dudu


hive> create table t (i int,a string,b string);
hive> insert into t values (1,'hello','world'),(2,'bye',null);
hive> select * from t where t.b is null;
2  byeNULL

This is wrong, all 3 columns should return the same value - t.a:

hive> with cte as (select t.a as a,t.a as b,t.a as c from t where t.b is null) 
select * from cte;
byeNULL bye


However, these are right:

hive> select t.a as a,t.a as b,t.a as c from t where t.b is null;
byebyebye


hive> with cte as (select t.a as a,t.a as b,t.a as c from t where t.b is not 
null) select * from cte;OK
hello  hello  hello


From: mahender bigdata [mailto:mahender.bigd...@outlook.com]
Sent: Wednesday, May 11, 2016 4:03 AM
To: user@hive.apache.org
Subject: Hive cte Alias problem


Hi,

I see peculiar difference while querying using CTE where I'm aliasing one of 
column in table to another column name in same table. Instead of getting values 
of source column, hive returns NULLS i.e column 8 values
with cte_temp as
(
select  a.COLUMN1, a.Column2,a.Column2 as Column8,ID
 from  a where Coalesce(ltrim(rtrim(a.COLUMN1)) ,'') <> ''
  ANDCoalesce(ltrim(rtrim(a.COLUMN2)), '') <> ''
  ANDCoalesce(ltrim(rtrim(a.COLUMN2)), '') <> ''
  ANDa.COLUMN8 IS NULL
  and a.ID=100
)
select * from cte_temp ;

Results

cte_temp.column1,cte_temp.column2,cte_temp.column8,ID
Row1,UK,,49
Row5,UP,,49

From the above query, Col2 has not null values and I'm filtering on Col8 =null, 
I'm aliasing Col2 has Col8. Whenever I perform Select * from CTE, I see instead 
of showing Col2 values, it is showing Col8 values. Is it bug with Hive.

When i run query using SQL SELECT only, it is working fine.


select  a.COLUMN1, a.Column2,a.Column2 as Column8,ID
 from  a where Coalesce(ltrim(rtrim(a.COLUMN1)) ,'') <> ''
  ANDCoalesce(ltrim(rtrim(a.COLUMN2)), '') <> ''
  ANDCoalesce(ltrim(rtrim(a.COLUMN2)), '') <> ''
  ANDa.COLUMN8 IS NULL
  and a.ID=100

Results

cte_temp.column1,cte_temp.column2,cte_temp.column8,ID

Row1,UK,test,49

Row5,UP,test,49

Please let me know whether it is problem with CTE.

/Mahender


RE: Create external table

2016-05-11 Thread Markovitz, Dudu
Could not reproduced that issue on Cloudera quickstart VM.

I’ve created an HDFS directory with 10,000 files.
I’ve create external table from within beeline.
The creation was immediate.

Dudu

---
bash
---
mkdir files_10k
awk 'BEGIN{for (i=1;i<=1;++i){print i>"./files_10k/f"i".txt"}}'
hdfs dfs -put files_10k /tmp

---
beeline
---
> create external table files_10k (i int) row format delimited fields 
> terminated by '\t' location '/tmp/files_10k';
No rows affected (0.282 seconds)
> select * from files_10k;
10,000 rows selected (27.986 seconds)

From: Margus Roo [mailto:mar...@roo.ee]
Sent: Tuesday, May 10, 2016 11:26 PM
To: user@hive.apache.org
Subject: Re: Create external table


Hi again

I opened hive (an old client)

And exactly the same create external table  location [paht in hdfs to place 
where are loads of files] works and the same DDL does not work via beeline.

Margus (margusja) Roo

http://margus.roo.ee

skype: margusja

+372 51 48 780
On 10/05/16 23:03, Margus Roo wrote:

Hi

Can someone explain or provide documentation how Hive creates external tables?

I have problem with creating external table in case I am pointing location in 
hdfs in to directory where are loads of files. Beeline just hangs or there will 
be other errors.

In case I point location in to the empty directory then hive creates table.



So does hive looks into files during creating external table?

I can not find any documentation explaining it.

--

Margus (margusja) Roo

http://margus.roo.ee

skype: margusja

+372 51 48 780



RE: Any difference between LOWER and LCASE

2016-05-10 Thread Markovitz, Dudu
Hi

According to documentation LCASE is a synonym for LOWER.
From what I've seen in the source code, it seems right.

https://github.com/apache/hive/blob/f089f2e64241592ecf8144d044bec8a0659ff422/ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java

system.registerGenericUDF("lower", GenericUDFLower.class);
system.registerGenericUDF("lcase", GenericUDFLower.class);


Please verify that you've run the exact same queries.
If you still see an issue, please share the relevant DDL (table/tables 
definition) and a small subset of data so I would be able to reproduce it.

Thanks

Dudu

-Original Message-
From: mahender bigdata [mailto:mahender.bigd...@outlook.com] 
Sent: Wednesday, May 11, 2016 1:55 AM
To: user@hive.apache.org
Subject: Any difference between LOWER and LCASE

Hi Team,

Is there any difference between LOWER and LCASE functions in Hive. For one of 
the query, when we are using LOWER in where condition, it is failing to match 
record. When we changed to LCASE, it started matching. 
I surprised to see differences in LOWER and LCASE. Can any one know why there 2 
function for same functionality. Is there any thing to do with any special or 
Unicode characters where Lower and LCASE differs in functionality


/MS



RE: Unsupported SubQuery Expression '1': Only SubQuery expressions that are top level conjuncts are allowed

2016-05-10 Thread Markovitz, Dudu
You’re welcome

Dudu

From: mahender bigdata [mailto:mahender.bigd...@outlook.com]
Sent: Wednesday, May 11, 2016 1:43 AM
To: user@hive.apache.org
Subject: Re: Unsupported SubQuery Expression '1': Only SubQuery expressions 
that are top level conjuncts are allowed


Thanks Dudu, I made modification as per our requirement. ur query helped me to 
modify as per our requirement.

On 5/4/2016 10:57 AM, Markovitz, Dudu wrote:
Hi

The syntax is not Hive specific but SQL ANSI/ISO.
In a series of “JOIN … ON …” any “ON” can (but not necessarily have to) refer 
any of its preceding tables, e.g. –

select … from t1 join t2 on … *1 … join t3 on … *2 … join t4 on … *3 …
*1  The 1st “ON” can refer tables t1 & t2
*2  The 2nd “ON” can refer tables t1, t2 & t3
*3  The 3rd “ON” can refer tables t1, t2, t3 & t4

In our query the “… group by … > 1” combined with “b2.col1 is null” implements 
the functionality of the “not exists” from the original query.
The rest of the query stays quite the same.

Dudu

From: mahender bigdata [mailto:mahender.bigd...@outlook.com]
Sent: Wednesday, May 04, 2016 7:39 PM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Re: Unsupported SubQuery Expression '1': Only SubQuery expressions 
that are top level conjuncts are allowed


Thanks Dudu,

Can you help me in parsing below logic, I see First you are starting join of 
table1 with result set of Group by > 1 and perform left join with table2, how 
can we get reference a. alias of joined result or will hive pickup "a" column 
from table 1 and 3 column in table2.



thanks in advance



On 5/3/2016 11:24 AM, Markovitz, Dudu wrote:
Forget about the BTW…
Apparently hive behaves like sqlite in that matter and not like other databases

hive> select 1 from table1 having 1=1;
FAILED: SemanticException HAVING specified without GROUP BY

From: Markovitz, Dudu [mailto:dmarkov...@paypal.com]
Sent: Tuesday, May 03, 2016 8:36 PM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: RE: Unsupported SubQuery Expression '1': Only SubQuery expressions 
that are top level conjuncts are allowed

I left out the filter on column Col2 in order to simplify the test case.
The following query is logically equal to your original query.

BTW –
You don’t need the GROUP BY A.Col1 part in your original query

Dudu

create table Table1 (Col1 int,Col3 int);
create table Table2 (Col1 int,Col3 int);

insert into Table1 values (10,1),(20,2),(40,4),(60,7),(80,8);
insert into Table2 values (10,1),(30,2),(20,3),(50,4),(40,5),(40,6),(70,7);


select  *



fromtable1  a



left join  (select  col1



fromtable2



group bycol1



having  count(*) > 1

)

b2



  onb2.col1  =

a.col1



left join   table2  b



on  a.col3  =

b.col3



and b2.col1 is null

;

10   1  NULL 10   1
20   2  NULL 30   2
40   4  40   NULL NULL
60   7  NULL 70   7
80   8  NULL NULL NULL

From: mahender bigdata [mailto:mahender.bigd...@outlook.com]
Sent: Tuesday, May 03, 2016 4:02 PM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Re: Unsupported SubQuery Expression '1': Only SubQuery expressions 
that are top level conjuncts are allowed


Updated..

select A.Col1,A.Col2B.Col3

From Table1 A
LEFT OUTER JOIN Table2 B
ON  A.Col3= B.Col3
AND NOT EXISTS(SELECT 1 FROM Table2 B WHERE B.Col1= A.Col1 GROUP BY A.Col1 
HAVING COUNT(*)>1 )
 AND (CASE WHEN ISNULL(A.Col2,'\;')  = '\;' THEN 'NOT-NULL' ELSE 'NULL' 
END) = B.Col2)
On 5/2/2016 10:52 PM, Markovitz, Dudu wrote:
Hi

Before dealing the issue itself, can you please fix the query?
There are 3 aliased tables - Table1 (A), Table2 (B)  & Table2 (mb) but you’re 
using additional 2 aliases – ma & adi1.

Thanks

Dudu

select A.Col1,A.Col2B.Col3

From Table1 A

LEFT OUTER JOIN Table2 B
ON  A.Col3= B.Col3
AND NOT EXISTS(SELECT 1 FROM Table2 B WHERE B.Col1= A.Col1 GROUP BY A.Col1 
HAVING COUNT(*)>1 )
 AND (CASE WHEN ISNULL(A.Col2,'\;')  = '\;' THEN 'NOT-NULL' ELSE 'NULL' 
END) = B.Col2)





From: mahender bigdata [mailto:mahender.bigd...@outlook.com]
Sent: Tuesday, May 03, 2016 4:22 AM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Unsupported SubQuery Expression '1': Only SubQuery expressions that 
are top level conjuncts are allowed


Hi,

Is there a way to implement  not exists in Hive. I'm using Hive 1.2. I'm 
getting below error

"Unsupported SubQuery Expression '1': Only SubQuery expressions that are top 
level conjuncts are allowed"

Query:



sele

RE: Unsupported SubQuery Expression '1': Only SubQuery expressions that are top level conjuncts are allowed

2016-05-04 Thread Markovitz, Dudu
Hi

The syntax is not Hive specific but SQL ANSI/ISO.
In a series of “JOIN … ON …” any “ON” can (but not necessarily have to) refer 
any of its preceding tables, e.g. –

select … from t1 join t2 on … *1 … join t3 on … *2 … join t4 on … *3 …
*1  The 1st “ON” can refer tables t1 & t2
*2  The 2nd “ON” can refer tables t1, t2 & t3
*3  The 3rd “ON” can refer tables t1, t2, t3 & t4

In our query the “… group by … > 1” combined with “b2.col1 is null” implements 
the functionality of the “not exists” from the original query.
The rest of the query stays quite the same.

Dudu

From: mahender bigdata [mailto:mahender.bigd...@outlook.com]
Sent: Wednesday, May 04, 2016 7:39 PM
To: user@hive.apache.org
Subject: Re: Unsupported SubQuery Expression '1': Only SubQuery expressions 
that are top level conjuncts are allowed


Thanks Dudu,

Can you help me in parsing below logic, I see First you are starting join of 
table1 with result set of Group by > 1 and perform left join with table2, how 
can we get reference a. alias of joined result or will hive pickup "a" column 
from table 1 and 3 column in table2.



thanks in advance



On 5/3/2016 11:24 AM, Markovitz, Dudu wrote:
Forget about the BTW…
Apparently hive behaves like sqlite in that matter and not like other databases

hive> select 1 from table1 having 1=1;
FAILED: SemanticException HAVING specified without GROUP BY

From: Markovitz, Dudu [mailto:dmarkov...@paypal.com]
Sent: Tuesday, May 03, 2016 8:36 PM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: RE: Unsupported SubQuery Expression '1': Only SubQuery expressions 
that are top level conjuncts are allowed

I left out the filter on column Col2 in order to simplify the test case.
The following query is logically equal to your original query.

BTW –
You don’t need the GROUP BY A.Col1 part in your original query

Dudu

create table Table1 (Col1 int,Col3 int);
create table Table2 (Col1 int,Col3 int);

insert into Table1 values (10,1),(20,2),(40,4),(60,7),(80,8);
insert into Table2 values (10,1),(30,2),(20,3),(50,4),(40,5),(40,6),(70,7);


select  *



fromtable1  a



left join  (select  col1



fromtable2



group bycol1



having  count(*) > 1

)

b2



  onb2.col1  =

a.col1



left join   table2  b



on  a.col3  =

b.col3



and b2.col1 is null

;

10   1  NULL 10   1
20   2  NULL 30   2
40   4  40   NULL NULL
60   7  NULL 70   7
80   8  NULL NULL NULL

From: mahender bigdata [mailto:mahender.bigd...@outlook.com]
Sent: Tuesday, May 03, 2016 4:02 PM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Re: Unsupported SubQuery Expression '1': Only SubQuery expressions 
that are top level conjuncts are allowed


Updated..

select A.Col1,A.Col2B.Col3

From Table1 A
LEFT OUTER JOIN Table2 B
ON  A.Col3= B.Col3
AND NOT EXISTS(SELECT 1 FROM Table2 B WHERE B.Col1= A.Col1 GROUP BY A.Col1 
HAVING COUNT(*)>1 )
 AND (CASE WHEN ISNULL(A.Col2,'\;')  = '\;' THEN 'NOT-NULL' ELSE 'NULL' 
END) = B.Col2)
On 5/2/2016 10:52 PM, Markovitz, Dudu wrote:
Hi

Before dealing the issue itself, can you please fix the query?
There are 3 aliased tables - Table1 (A), Table2 (B)  & Table2 (mb) but you’re 
using additional 2 aliases – ma & adi1.

Thanks

Dudu

select A.Col1,A.Col2B.Col3

From Table1 A

LEFT OUTER JOIN Table2 B
ON  A.Col3= B.Col3
AND NOT EXISTS(SELECT 1 FROM Table2 B WHERE B.Col1= A.Col1 GROUP BY A.Col1 
HAVING COUNT(*)>1 )
 AND (CASE WHEN ISNULL(A.Col2,'\;')  = '\;' THEN 'NOT-NULL' ELSE 'NULL' 
END) = B.Col2)





From: mahender bigdata [mailto:mahender.bigd...@outlook.com]
Sent: Tuesday, May 03, 2016 4:22 AM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Unsupported SubQuery Expression '1': Only SubQuery expressions that 
are top level conjuncts are allowed


Hi,

Is there a way to implement  not exists in Hive. I'm using Hive 1.2. I'm 
getting below error

"Unsupported SubQuery Expression '1': Only SubQuery expressions that are top 
level conjuncts are allowed"

Query:



select A.Col1,A.Col2B.Col3

From Table1 A

LEFT OUTER JOIN Table2 B
ON  A.Col3= B.Col3
AND NOT EXISTS(SELECT 1 FROM Table2 mb WHERE ma.Col1= adi1.Col1 GROUP BY 
ma.Col1 HAVING COUNT(*)>1 )
 AND (CASE WHEN ISNULL(A.Col2,'\;')  = '\;' THEN 'NOT-NULL' ELSE 'NULL' 
END) = B.Col2)



I Would like to have OR Condition in LEFT Join hive statement. or alternative 
way by splitting.



thanks








RE: multiple selects on a left join give incorrect result

2016-05-03 Thread Markovitz, Dudu
There is no issue on Cloudera VM

Dudu


[cloudera@quickstart ~]$ hadoop version
Hadoop 2.6.0-cdh5.5.0
Subversion http://github.com/cloudera/hadoop -r 
fd21232cef7b8c1f536965897ce20f50b83ee7b2
Compiled by jenkins on 2015-11-09T20:37Z
Compiled with protoc 2.5.0
From source with checksum 98e07176d1787150a6a9c087627562c
This command was run using /usr/jars/hadoop-common-2.6.0-cdh5.5.0.jar

[cloudera@quickstart ~]$ hive --version
Hive 1.1.0-cdh5.5.0
Subversion 
file:///data/jenkins/workspace/generic-package-rhel64-6-0/topdir/BUILD/hive-1.1.0-cdh5.5.0
 -r Unknown
Compiled by jenkins on Mon Nov 9 12:37:34 PST 2015
From source with checksum 8dfc2aac3731e4e5f0e8bd1b442be0e2

From: Frank Luo [mailto:j...@merkleinc.com]
Sent: Wednesday, May 04, 2016 1:58 AM
To: user@hive.apache.org
Cc: Rebecca Yang 
Subject: multiple selects on a left join give incorrect result

All,

I have found that when doing a multiple selects on a left join, the “on” clause 
seems to be ignored!!! (It is hard to believe).

Below is a very simple test case and please tell me I am crazy. I am on hdp 
2.3.4.7.


CREATE TABLE T_A ( idSTRING, val   STRING );
CREATE TABLE T_B ( idSTRING, val   STRING );
CREATE TABLE join_result_1 ( idaSTRING, vala   STRING, idbSTRING, valb  
 STRING );
CREATE TABLE join_result_2 ( idaSTRING, vala   STRING, idbSTRING, valb  
 STRING );
CREATE TABLE join_result_3 ( idaSTRING, vala   STRING, idbSTRING, valb  
 STRING );

INSERT INTO TABLE T_A
VALUES ('Id_1', 'val_101'), ('Id_2', 'val_102'), ('Id_3', 'val_103');

INSERT INTO TABLE T_B
VALUES ('Id_1', 'val_103'), ('Id_2', 'val_104');

FROM T_A a LEFT JOIN T_B b ON a.id = b.id
INSERT OVERWRITE TABLE join_result_1
   SELECT a.*, b.*
WHERE b.id = 'Id_1' AND b.val = 'val_103'
INSERT OVERWRITE TABLE join_result_2
   SELECT a.*, b.*
WHERE b.val IS NULL OR (b.id = 'Id_3' AND b.val = 'val_101')
INSERT OVERWRITE TABLE join_result_3
   SELECT a.*, b.*
WHERE b.val = 'val_104' AND b.id = 'Id_2' AND a.val <> b.val;


And here is the result:

0: jdbc:hive2 > select * from join_result_1;
++-++-+--+
| join_result_1.ida  | join_result_1.vala  | join_result_1.idb  | 
join_result_1.valb  |
++-++-+--+
| Id_1   | val_101 | Id_1   | val_103   
  |
| Id_2   | val_102 | Id_1   | val_103   
  |
| Id_3   | val_103 | Id_1   | val_103   
  |
++-++-+--+
3 rows selected (0.057 seconds)




I am expecting join_result_1 to have one row, but got three!!!

Has other people run into the same thing?

Join us at Merkle’s 2016 annual Performance Marketer Executive Summit – June 7 
– 9 in Memphis, TN


Download the latest installment of our annual Marketing Imperatives, “Winning 
with People-Based 
Marketing”

This email and any attachments transmitted with it are intended for use by the 
intended recipient(s) only. If you have received this email in error, please 
notify the sender immediately and then delete it. If you are not the intended 
recipient, you must not keep, use, disclose, copy or distribute this email 
without the author’s prior permission. We take precautions to minimize the risk 
of transmitting software viruses, but we advise you to perform your own virus 
checks on any attachment to this message. We cannot accept liability for any 
loss or damage caused by software viruses. The information contained in this 
communication may be confidential and may be subject to the attorney-client 
privilege.


RE: Query fails if condition placed on Parquet struct field

2016-05-03 Thread Markovitz, Dudu
Hi

Can you send the execution plans of both versions?

Thanks

Dudu

From: Jose Rozanec [mailto:jose.roza...@mercadolibre.com]
Sent: Tuesday, May 03, 2016 11:13 PM
To: Haas, Nichole 
Cc: user@hive.apache.org
Subject: Re: Query fails if condition placed on Parquet struct field

Hi!

Is not due to memory allocation. I found that I am able to perform que query 
ok, if I rewrite it as:

select a.user_agent from (SELECT device.user_agent as user_agent FROM sometable 
WHERE ds >= '2016-03-30 00' AND ds <= '2016-03-30 01')a where a.user_agent LIKE 
'Mozilla%'  LIMIT 1;

I see the amount of mappers and execution time is almost the same, but this way 
we are able to execute ok and get the results.
Any ideas why may this happen?



2016-05-03 17:02 GMT-03:00 Haas, Nichole 
>:
What are you memory allocations set to?  When using something as expensive as 
LIKE and a date range together, I often have to increase my standard memory 
allocation.

Try changing your memory allocation settings to:
Key: ​mapreduce.map.memory.mb​ Value: ​2048​ and Key: ​mapreduce.map.java.opts​ 
Value: ​-Xmx1500m

In HUE, this is the settings tab and you enter them manually.  I’m unsure about 
command line.


From: Jose Rozanec 
>
Reply-To: "user@hive.apache.org" 
>
Date: Tuesday, May 3, 2016 at 12:45 PM
To: "user@hive.apache.org" 
>
Subject: Query fails if condition placed on Parquet struct field

Hello,

We are running queries on Hive against parquet files.
In the schema definition, we have a parquet struct called device with a string 
field user_agent.

If we run query from Example 1, it returns results as expected.
If we run query from Example 2, execution fails and exits with error.

Did anyone face a similar case?

Thanks!

Example 1:
SELECT device.user_agent FROM sometable WHERE ds >= '2016-03-30 00' AND ds <= 
'2016-03-30 01' LIMIT 1;

Example 2:
SELECT device.user_agent FROM sometable WHERE ds >= '2016-03-30 00' AND ds <= 
'2016-03-30 01' AND device.user_agent LIKE 'Mozilla%'  LIMIT 1;


The error and trace we get is:

Exception from container-launch.
FAILED: Execution Error, return code 2 from 
org.apache.hadoop.hive.ql.exec.mr.MapRedTask
Container exited with a non-zero exit code 1

Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
at org.apache.hadoop.util.Shell.run(Shell.java:456)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
at 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)




This e-mail message is authorized for use by the intended recipient only and 
may contain information that is privileged and confidential. If you received 
this message in error, please call us immediately at (425) 590-5000 and ask to 
speak to the message sender. Please do not copy, disseminate, or retain this 
message unless you are the intended recipient. In addition, to ensure the 
security of your data, please do not send any unencrypted credit card or 
personally identifiable information to this email address. Thank you.



RE: Unsupported SubQuery Expression '1': Only SubQuery expressions that are top level conjuncts are allowed

2016-05-03 Thread Markovitz, Dudu
Forget about the BTW…
Apparently hive behaves like sqlite in that matter and not like other databases

hive> select 1 from table1 having 1=1;
FAILED: SemanticException HAVING specified without GROUP BY

From: Markovitz, Dudu [mailto:dmarkov...@paypal.com]
Sent: Tuesday, May 03, 2016 8:36 PM
To: user@hive.apache.org
Subject: RE: Unsupported SubQuery Expression '1': Only SubQuery expressions 
that are top level conjuncts are allowed

I left out the filter on column Col2 in order to simplify the test case.
The following query is logically equal to your original query.

BTW –
You don’t need the GROUP BY A.Col1 part in your original query

Dudu

create table Table1 (Col1 int,Col3 int);
create table Table2 (Col1 int,Col3 int);

insert into Table1 values (10,1),(20,2),(40,4),(60,7),(80,8);
insert into Table2 values (10,1),(30,2),(20,3),(50,4),(40,5),(40,6),(70,7);


select  *



fromtable1  a



left join  (select  col1



fromtable2



group bycol1



having  count(*) > 1

)

b2



  onb2.col1  =

a.col1



left join   table2  b



on  a.col3  =

b.col3



and b2.col1 is null

;

10   1  NULL 10   1
20   2  NULL 30   2
40   4  40   NULL NULL
60   7  NULL 70   7
80   8  NULL NULL NULL

From: mahender bigdata [mailto:mahender.bigd...@outlook.com]
Sent: Tuesday, May 03, 2016 4:02 PM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Re: Unsupported SubQuery Expression '1': Only SubQuery expressions 
that are top level conjuncts are allowed


Updated..

select A.Col1,A.Col2B.Col3

From Table1 A
LEFT OUTER JOIN Table2 B
ON  A.Col3= B.Col3
AND NOT EXISTS(SELECT 1 FROM Table2 B WHERE B.Col1= A.Col1 GROUP BY A.Col1 
HAVING COUNT(*)>1 )
 AND (CASE WHEN ISNULL(A.Col2,'\;')  = '\;' THEN 'NOT-NULL' ELSE 'NULL' 
END) = B.Col2)
On 5/2/2016 10:52 PM, Markovitz, Dudu wrote:
Hi

Before dealing the issue itself, can you please fix the query?
There are 3 aliased tables - Table1 (A), Table2 (B)  & Table2 (mb) but you’re 
using additional 2 aliases – ma & adi1.

Thanks

Dudu

select A.Col1,A.Col2B.Col3

From Table1 A

LEFT OUTER JOIN Table2 B
ON  A.Col3= B.Col3
AND NOT EXISTS(SELECT 1 FROM Table2 B WHERE B.Col1= A.Col1 GROUP BY A.Col1 
HAVING COUNT(*)>1 )
 AND (CASE WHEN ISNULL(A.Col2,'\;')  = '\;' THEN 'NOT-NULL' ELSE 'NULL' 
END) = B.Col2)





From: mahender bigdata [mailto:mahender.bigd...@outlook.com]
Sent: Tuesday, May 03, 2016 4:22 AM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Unsupported SubQuery Expression '1': Only SubQuery expressions that 
are top level conjuncts are allowed


Hi,

Is there a way to implement  not exists in Hive. I'm using Hive 1.2. I'm 
getting below error

"Unsupported SubQuery Expression '1': Only SubQuery expressions that are top 
level conjuncts are allowed"

Query:



select A.Col1,A.Col2B.Col3

From Table1 A

LEFT OUTER JOIN Table2 B
ON  A.Col3= B.Col3
AND NOT EXISTS(SELECT 1 FROM Table2 mb WHERE ma.Col1= adi1.Col1 GROUP BY 
ma.Col1 HAVING COUNT(*)>1 )
 AND (CASE WHEN ISNULL(A.Col2,'\;')  = '\;' THEN 'NOT-NULL' ELSE 'NULL' 
END) = B.Col2)



I Would like to have OR Condition in LEFT Join hive statement. or alternative 
way by splitting.



thanks







RE: Unsupported SubQuery Expression '1': Only SubQuery expressions that are top level conjuncts are allowed

2016-05-03 Thread Markovitz, Dudu
I left out the filter on column Col2 in order to simplify the test case.
The following query is logically equal to your original query.

BTW –
You don’t need the GROUP BY A.Col1 part in your original query

Dudu

create table Table1 (Col1 int,Col3 int);
create table Table2 (Col1 int,Col3 int);

insert into Table1 values (10,1),(20,2),(40,4),(60,7),(80,8);
insert into Table2 values (10,1),(30,2),(20,3),(50,4),(40,5),(40,6),(70,7);


select  *



fromtable1  a



left join  (select  col1



fromtable2



group bycol1



having  count(*) > 1

)

b2



  onb2.col1  =

a.col1



left join   table2  b



on  a.col3  =

b.col3



and b2.col1 is null

;

10   1  NULL 10   1
20   2  NULL 30   2
40   4  40   NULL NULL
60   7  NULL 70   7
80   8  NULL NULL NULL

From: mahender bigdata [mailto:mahender.bigd...@outlook.com]
Sent: Tuesday, May 03, 2016 4:02 PM
To: user@hive.apache.org
Subject: Re: Unsupported SubQuery Expression '1': Only SubQuery expressions 
that are top level conjuncts are allowed


Updated..

select A.Col1,A.Col2B.Col3

From Table1 A
LEFT OUTER JOIN Table2 B
ON  A.Col3= B.Col3
AND NOT EXISTS(SELECT 1 FROM Table2 B WHERE B.Col1= A.Col1 GROUP BY A.Col1 
HAVING COUNT(*)>1 )
 AND (CASE WHEN ISNULL(A.Col2,'\;')  = '\;' THEN 'NOT-NULL' ELSE 'NULL' 
END) = B.Col2)

On 5/2/2016 10:52 PM, Markovitz, Dudu wrote:
Hi

Before dealing the issue itself, can you please fix the query?
There are 3 aliased tables - Table1 (A), Table2 (B)  & Table2 (mb) but you’re 
using additional 2 aliases – ma & adi1.

Thanks

Dudu

select A.Col1,A.Col2B.Col3

From Table1 A

LEFT OUTER JOIN Table2 B
ON  A.Col3= B.Col3
AND NOT EXISTS(SELECT 1 FROM Table2 B WHERE B.Col1= A.Col1 GROUP BY A.Col1 
HAVING COUNT(*)>1 )
 AND (CASE WHEN ISNULL(A.Col2,'\;')  = '\;' THEN 'NOT-NULL' ELSE 'NULL' 
END) = B.Col2)





From: mahender bigdata [mailto:mahender.bigd...@outlook.com]
Sent: Tuesday, May 03, 2016 4:22 AM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Unsupported SubQuery Expression '1': Only SubQuery expressions that 
are top level conjuncts are allowed


Hi,

Is there a way to implement  not exists in Hive. I'm using Hive 1.2. I'm 
getting below error

"Unsupported SubQuery Expression '1': Only SubQuery expressions that are top 
level conjuncts are allowed"

Query:



select A.Col1,A.Col2B.Col3

From Table1 A

LEFT OUTER JOIN Table2 B
ON  A.Col3= B.Col3
AND NOT EXISTS(SELECT 1 FROM Table2 mb WHERE ma.Col1= adi1.Col1 GROUP BY 
ma.Col1 HAVING COUNT(*)>1 )
 AND (CASE WHEN ISNULL(A.Col2,'\;')  = '\;' THEN 'NOT-NULL' ELSE 'NULL' 
END) = B.Col2)



I Would like to have OR Condition in LEFT Join hive statement. or alternative 
way by splitting.



thanks







RE: Unsupported SubQuery Expression '1': Only SubQuery expressions that are top level conjuncts are allowed

2016-05-02 Thread Markovitz, Dudu
Hi

Before dealing the issue itself, can you please fix the query?
There are 3 aliased tables - Table1 (A), Table2 (B)  & Table2 (mb) but you’re 
using additional 2 aliases – ma & adi1.

Thanks

Dudu

select A.Col1,A.Col2B.Col3

From Table1 A

LEFT OUTER JOIN Table2 B
ON  A.Col3= B.Col3
AND NOT EXISTS(SELECT 1 FROM Table2 mb WHERE ma.Col1= adi1.Col1 GROUP BY 
ma.Col1 HAVING COUNT(*)>1 )
 AND (CASE WHEN ISNULL(A.Col2,'\;')  = '\;' THEN 'NOT-NULL' ELSE 'NULL' 
END) = B.Col2)





From: mahender bigdata [mailto:mahender.bigd...@outlook.com]
Sent: Tuesday, May 03, 2016 4:22 AM
To: user@hive.apache.org
Subject: Unsupported SubQuery Expression '1': Only SubQuery expressions that 
are top level conjuncts are allowed


Hi,

Is there a way to implement  not exists in Hive. I'm using Hive 1.2. I'm 
getting below error

"Unsupported SubQuery Expression '1': Only SubQuery expressions that are top 
level conjuncts are allowed"

Query:



select A.Col1,A.Col2B.Col3

From Table1 A

LEFT OUTER JOIN Table2 B
ON  A.Col3= B.Col3
AND NOT EXISTS(SELECT 1 FROM Table2 mb WHERE ma.Col1= adi1.Col1 GROUP BY 
ma.Col1 HAVING COUNT(*)>1 )
 AND (CASE WHEN ISNULL(A.Col2,'\;')  = '\;' THEN 'NOT-NULL' ELSE 'NULL' 
END) = B.Col2)



I Would like to have OR Condition in LEFT Join hive statement. or alternative 
way by splitting.



thanks






RE: Hive query to split one row into many rows such that Row 1 will have col 1 Name, col 1 Value and Row 2 will have col 2 Name and col 2 value

2016-04-26 Thread Markovitz, Dudu
You are welcome ☺
I’ve tried to guess the requested result for your last question.
It can be very helpful if you can create a small example containing your 
original data and the requested result.

Dudu


Given the following table, ‘t’:

i

c1

c2

c3

1

1

12

15

2

1

13

11

3

3

11

13

4

1

12

13

5

3

12

15

6

1

14

13

7

1

11

13

8

2

15

11

9

1

14

11

10

3

11

13


collect_list contains all values.
collect_set remove duplicates.

Option 1
Collect c2 and c3 together

select c1,collect_list (value),collect_set (value) from t lateral view explode 
(map('c2',c2,'c3',c3)) t group by c1;


1   [12,15,13,11,12,13,14,13,11,13,14,11] [12,15,13,11,14]

2   [15,11] [15,11]

3   [11,13,12,15,11,13][11,13,12,15]

Option 2
Collect c2 and c3 separately

select c1,collect_list (c2),collect_set (c2),collect_list (c3),collect_set (c3) 
from t group by c1;

1 [12,13,12,14,11,14] [12,13,14,11] [15,11,13,13,13,11] 
[15,11,13]
2 [15]  [15]  [11]  [11]
3 [11,12,11]  [11,12] [13,15,13]  [13,15]


From: Deepak Khandelwal [mailto:dkhandelwal@gmail.com]
Sent: Tuesday, April 26, 2016 8:35 PM
To: user@hive.apache.org
Subject: Re: Hive query to split one row into many rows such that Row 1 will 
have col 1 Name, col 1 Value and Row 2 will have col 2 Name and col 2 value

Thanks a lot Dudu.

Could you also tell how can  I use concat with group by clause in have. I have 
n rows with col1, col2, col3 and i want a result grouped by col1 and concat all 
values of col2 and col3.

Id,key,value, value2
__
1,fname,Dudu, m1
1,lname,Markowitz, m2
2,fname, Andrew, m3
2,lname, Sears,m4
And I need result like below

Id, appended (group by key)
__
Fname ,  Dudu | m1 | Andrew | m3
Lname,   Markowitz| m2 | Sears | m4
Thanks a lot for your help.


On Saturday, April 23, 2016, Markovitz, Dudu 
<dmarkov...@paypal.com<mailto:dmarkov...@paypal.com>> wrote:
Another example (with first name and last name), same principal

Dudu


Given the following table:

id, first_name,last_name
__
1,Dudu,Markovitz
2,Andrew,Sears

select id,key,value from my_table lateral view explode 
(map('fname',first_name,'lname',last_name)) t;

The result will look like:

Id,key,value
__
1,fname,Dudu
1,lname,Markovitz
2,fname, Andrew
2,lname, Sears


From: Deepak Khandelwal 
[mailto:dkhandelwal@gmail.com<javascript:_e(%7B%7D,'cvml','dkhandelwal@gmail.com');>]
Sent: Saturday, April 23, 2016 9:04 AM
To: user@hive.apache.org<javascript:_e(%7B%7D,'cvml','user@hive.apache.org');>
Subject: Hive query to split one row into many rows such that Row 1 will have 
col 1 Name, col 1 Value and Row 2 will have col 2 Name and col 2 value

Hi All,

I am new to Hive and I am trying to create a query for below aituation. Would 
appreciate if someone could guide on same. Thans a lot in advance.

I have two TABLES shown below

TABLE1 (USER_dETAILS)
**USER_ID**  |  **USER_NAME**  |   **USER_ADDRESS**
 +--+
1  USER1   ADDRESS111
2  USER2 ADDRESS222

TABLE2 (USER_PARAMETERS)
**USER_ID**  |  **PARAM_NAME**  |   **PARAM_VALUE**
 +--+--
1   USER_NAMEUSER1
1   USER_ADDRESS  ADDRESS111
2   USER_NAMEUSER2
2USER_ADDRESS  ADDRESS222

I need to insert data in table2(USER_PARAMETERS) FROM table1(USER_DETAILS) in 
the format shown above. I can do this using UNION ALL but I want to avoid it as 
there are like 10 such columns that i need to split like above.

Can someone suggest a efficient hive query so that i can achieve the results 
shown in table 2 from data in table 1 (Hive query to split one row of data into 
multiple rows like such that Row 1 will have column1 Name, column1 Value and 
Row 2 will have column 2 Name and column 2 value...).

Thanks a lot
Deepak



RE: Hive query to split one row into many rows such that Row 1 will have col 1 Name, col 1 Value and Row 2 will have col 2 Name and col 2 value

2016-04-23 Thread Markovitz, Dudu
Hi Mich, it seems the request was for unpivot.

Dudu

From: Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
Sent: Saturday, April 23, 2016 10:04 AM
To: user 
Subject: Re: Hive query to split one row into many rows such that Row 1 will 
have col 1 Name, col 1 Value and Row 2 will have col 2 Name and col 2 value

try this

-- populate table user_parameters with user_id values (unique)from user_details
INSERT user_parameters
SELECT user_id, null, null FROM user_details

-- Update remaining columnsd
UPDATE user_parameters
SET
param_name = t1.user_name
param_value = t1.user_address
FROM
user_parameters t2 JOIN user_details t1
ON t2.user_id = t1.user_id;




Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com



On 23 April 2016 at 07:04, Deepak Khandelwal 
> wrote:
Hi All,

I am new to Hive and I am trying to create a query for below aituation. Would 
appreciate if someone could guide on same. Thans a lot in advance.

I have two TABLES shown below

TABLE1 (USER_dETAILS)
**USER_ID**  |  **USER_NAME**  |   **USER_ADDRESS**
 +--+
1  USER1   ADDRESS111
2  USER2 ADDRESS222

TABLE2 (USER_PARAMETERS)
**USER_ID**  |  **PARAM_NAME**  |   **PARAM_VALUE**
 +--+--
1   USER_NAMEUSER1
1   USER_ADDRESS  ADDRESS111
2   USER_NAMEUSER2
2USER_ADDRESS  ADDRESS222

I need to insert data in table2(USER_PARAMETERS) FROM table1(USER_DETAILS) in 
the format shown above. I can do this using UNION ALL but I want to avoid it as 
there are like 10 such columns that i need to split like above.

Can someone suggest a efficient hive query so that i can achieve the results 
shown in table 2 from data in table 1 (Hive query to split one row of data into 
multiple rows like such that Row 1 will have column1 Name, column1 Value and 
Row 2 will have column 2 Name and column 2 value...).

Thanks a lot
Deepak




RE: Hive query to split one row into many rows such that Row 1 will have col 1 Name, col 1 Value and Row 2 will have col 2 Name and col 2 value

2016-04-23 Thread Markovitz, Dudu
Another example (with first name and last name), same principal

Dudu


Given the following table:

id, first_name,last_name
__
1,Dudu,Markovitz
2,Andrew,Sears

select id,key,value from my_table lateral view explode 
(map('fname',first_name,'lname',last_name)) t;

The result will look like:

Id,key,value
__
1,fname,Dudu
1,lname,Markovitz
2,fname, Andrew
2,lname, Sears


From: Deepak Khandelwal [mailto:dkhandelwal@gmail.com]
Sent: Saturday, April 23, 2016 9:04 AM
To: user@hive.apache.org
Subject: Hive query to split one row into many rows such that Row 1 will have 
col 1 Name, col 1 Value and Row 2 will have col 2 Name and col 2 value

Hi All,

I am new to Hive and I am trying to create a query for below aituation. Would 
appreciate if someone could guide on same. Thans a lot in advance.

I have two TABLES shown below

TABLE1 (USER_dETAILS)
**USER_ID**  |  **USER_NAME**  |   **USER_ADDRESS**
 +--+
1  USER1   ADDRESS111
2  USER2 ADDRESS222

TABLE2 (USER_PARAMETERS)
**USER_ID**  |  **PARAM_NAME**  |   **PARAM_VALUE**
 +--+--
1   USER_NAMEUSER1
1   USER_ADDRESS  ADDRESS111
2   USER_NAMEUSER2
2USER_ADDRESS  ADDRESS222

I need to insert data in table2(USER_PARAMETERS) FROM table1(USER_DETAILS) in 
the format shown above. I can do this using UNION ALL but I want to avoid it as 
there are like 10 such columns that i need to split like above.

Can someone suggest a efficient hive query so that i can achieve the results 
shown in table 2 from data in table 1 (Hive query to split one row of data into 
multiple rows like such that Row 1 will have column1 Name, column1 Value and 
Row 2 will have column 2 Name and column 2 value...).

Thanks a lot
Deepak



RE: Question on Implementing CASE in Hive Join

2016-04-20 Thread Markovitz, Dudu
The second version works as expected (after fixing a typo in the word 
‘indicator’).
If you don’t get any results you should check your data (maybe the fields 
contains trailing spaces or control characters etc.).

If you’re willing to replace the ‘OUTER’ with ‘INNER’, there’s another option -

select  *

fromb

cross join  a

where   a.type  = b.type
and a.code  like case b.code  when 'ALL' then '%' else b.code   
   end
and a.indicator like case b.indicator when 'ALL' then '%' else 
b.indicator end
;

Dudu


From: Kishore A [mailto:kishore.atmak...@gmail.com]
Sent: Wednesday, April 20, 2016 5:04 PM
To: user@hive.apache.org
Subject: Re: Question on Implementing CASE in Hive Join

Hi Dudu,

Thank you for sending queries around this.

I have run these queries and below are the observations

1. It did return the same error as before" SemanticException [Error 10017]: 
Line 4:4 Both left and right aliases encountered in JOIN 'code'"

2. Query execution is successful but not retrieving any results out of it.

I am clueless and not able to proceed to next step until this is resolved. Do 
you have any other suggestions please?

Kishore

On Tue, Apr 19, 2016 at 6:08 AM, Markovitz, Dudu 
<dmarkov...@paypal.com<mailto:dmarkov...@paypal.com>> wrote:
Please try the following two options.
Option 2 might be better, performance wise (depending of the data volume and 
characteristics).

P.s.
I didn’t understand the explanation about the LEFT JOIN


Dudu

1.

select  b.code
   ,b.value

fromb

left join   a

on  a.type  = b.type
and a.code  like case b.code   when 'ALL' then '%' 
else b.code   end
and a.indicator like case b.indicatior when 'ALL' then '%' 
else b.indicatior end
;



2.

select  b.code
   ,b.value

fromb

left join   a

on  a.type  = b.type
and a.code  = b.code
and a.indicator = b.indicatior

where   b.code   != 'ALL'
and b.indicatior != 'ALL'

union all

select  b.code
   ,b.value

fromb

left join   a

on  a.type  = b.type
and a.indicator = b.indicatior

where   b.code= 'ALL'
and b.indicatior != 'ALL'

union all

select  b.code
   ,b.value

fromb

left join   a

on  a.type  = b.type
and a.code  = b.code

where   b.code   != 'ALL'
and b.indicatior  = 'ALL'

union all

select  b.code
   ,b.value

fromb

left join   a

on  a.type  = b.type

where   b.code   = 'ALL'
and b.indicatior = 'ALL'
;


From: Kishore A 
[mailto:kishore.atmak...@gmail.com<mailto:kishore.atmak...@gmail.com>]
Sent: Tuesday, April 19, 2016 3:51 PM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Re: Question on Implementing CASE in Hive Join

Hi Dudu,

Actually we use both fields from left and right tables, I mentioned right table 
just for my convenience to check whether ALL from right table can be pulled as 
per join condition match.

One more reason why we use left join is we should not have extra columns after 
join.

Kishore



On Tue, Apr 19, 2016 at 5:46 AM, Markovitz, Dudu 
<dmarkov...@paypal.com<mailto:dmarkov...@paypal.com>> wrote:
Before dealing with the technical aspect, can you please explain what is the 
point of using LEFT JOIN without selecting any field from table A?

Thanks

Dudu

From: Kishore A 
[mailto:kishore.atmak...@gmail.com<mailto:kishore.atmak...@gmail.com>]
Sent: Tuesday, April 19, 2016 2:29 PM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Question on Implementing CASE in Hive Join

Hi,

I have a scenario to implement to cases in Hive Joins. I need to implement case 
on the value on which join condition to be applied.

Table A
Code// Type// Indicator// Value//
A  1  XYZ John
B  1  PQR Smith
C  2  XYZ John
C  2  PQR Smith
D  3  PQR Smith
E  3  XYZ Smith
F  4  MNO Smith
G  3  MNO Smith
D  1  XYZ John
N  3  STR Smith


Table B
Code// Type// Indicator// Value//
ALL1  XYZ John
D3  ALL Smith
ALL1  PQR Smith

I need to stamp Value from TableB by joining TableA and I am writing join 
condition as below.
Note : No instance of ALL for Type column, a value for Type will be provided.

Select b.Code,b.Value from B
LEFT JOIN A a ON
a.Code = (case when b.Code = 'ALL' then a.Code else b.Code END)
AND
a.Type = b.Type
AND
a.Ind

RE: Question on Implementing CASE in Hive Join

2016-04-19 Thread Markovitz, Dudu
Please try the following two options.
Option 2 might be better, performance wise (depending of the data volume and 
characteristics).

P.s.
I didn’t understand the explanation about the LEFT JOIN


Dudu

1.

select  b.code
   ,b.value

fromb

left join   a

on  a.type  = b.type
and a.code  like case b.code   when 'ALL' then '%' 
else b.code   end
and a.indicator like case b.indicatior when 'ALL' then '%' 
else b.indicatior end
;



2.

select  b.code
   ,b.value

fromb

left join   a

on  a.type  = b.type
and a.code  = b.code
and a.indicator = b.indicatior

where   b.code   != 'ALL'
and b.indicatior != 'ALL'

union all

select  b.code
   ,b.value

fromb

left join   a

on  a.type  = b.type
and a.indicator = b.indicatior

where   b.code= 'ALL'
and b.indicatior != 'ALL'

union all

select  b.code
   ,b.value

fromb

left join   a

on  a.type  = b.type
and a.code  = b.code

where   b.code   != 'ALL'
and b.indicatior  = 'ALL'

union all

select  b.code
   ,b.value

fromb

left join   a

on  a.type  = b.type

where   b.code   = 'ALL'
and b.indicatior = 'ALL'
;


From: Kishore A [mailto:kishore.atmak...@gmail.com]
Sent: Tuesday, April 19, 2016 3:51 PM
To: user@hive.apache.org
Subject: Re: Question on Implementing CASE in Hive Join

Hi Dudu,

Actually we use both fields from left and right tables, I mentioned right table 
just for my convenience to check whether ALL from right table can be pulled as 
per join condition match.

One more reason why we use left join is we should not have extra columns after 
join.

Kishore



On Tue, Apr 19, 2016 at 5:46 AM, Markovitz, Dudu 
<dmarkov...@paypal.com<mailto:dmarkov...@paypal.com>> wrote:
Before dealing with the technical aspect, can you please explain what is the 
point of using LEFT JOIN without selecting any field from table A?

Thanks

Dudu

From: Kishore A 
[mailto:kishore.atmak...@gmail.com<mailto:kishore.atmak...@gmail.com>]
Sent: Tuesday, April 19, 2016 2:29 PM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Question on Implementing CASE in Hive Join

Hi,

I have a scenario to implement to cases in Hive Joins. I need to implement case 
on the value on which join condition to be applied.

Table A
Code// Type// Indicator// Value//
A  1  XYZ John
B  1  PQR Smith
C  2  XYZ John
C  2  PQR Smith
D  3  PQR Smith
E  3  XYZ Smith
F  4  MNO Smith
G  3  MNO Smith
D  1  XYZ John
N  3  STR Smith


Table B
Code// Type// Indicator// Value//
ALL1  XYZ John
D3  ALL Smith
ALL1  PQR Smith

I need to stamp Value from TableB by joining TableA and I am writing join 
condition as below.
Note : No instance of ALL for Type column, a value for Type will be provided.

Select b.Code,b.Value from B
LEFT JOIN A a ON
a.Code = (case when b.Code = 'ALL' then a.Code else b.Code END)
AND
a.Type = b.Type
AND
a.Indicator = (case when b.Indicatior = 'ALL' then a.Inidicator else 
b.Inidicator END)

When I run this in hive this query is failing with below error
Error while compiling statement: FAILED: SemanticException [Error 10017]: Line 
4:0 Both left and right aliases encountered in JOIN 'Code'.


Please let me know if more details are needed

Thanks,
Kishore




  1   2   >