Re: Fw: Hive update operation

2016-08-25 Thread Mich Talebzadeh
Ok this is what you have in MSSQL (COLLATE) does not come into it in Hive)

UPDATE table1
SET

address=regexp_replace(t2.cout_event_description,,)
, latitude=t2.latitude
, longitude=t2.longitude
, speed =t2.speed
, dtimestamp =mv.dtimestamp
, reg_no=t2.registration
, gpsstate = t2.bgps
FROM
  default.maxvalues  mv
, table2 t2
INNER JOIN table2 t2 on  mv.dtimestamp=t2.dtimestamp AND
mv.acqnum=t2.acqnum INNER JOIN table1 t1 on mv.acqnum=t1.deal_number
where t1.deal_number=mv.acqnum;

Simplify this in Hive and test

CREATE TEMPORARY TABLE tmp1
AS
SELECT https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 26 August 2016 at 06:38, Priyanka Raghuvanshi 
wrote:

> Current RDBMS:  SQL Server 2012
>
>
> Yes, I tried below one.
>
>
> UPDATE table1 set address=regexp_replace(t2.cout
> _event_description,,),latitude=t2.latitude,longitude=t2.
> longitude ,speed =t2.speed,dtimestamp =mv.dtimestamp,reg_no=t2.regis
> tration,gpsstate = t2.bgps FROM  default.maxvalues mv, table2 t2 INNER
> JOIN table2 t2 on mv.dtimestamp=t2.dtimestamp AND mv.acqnum=t2.acqnum
> INNER JOIN table1 t1 on mv.acqnum=t1.deal_number
> where t1.deal_number=mv.acqnum;
>
> OUTPUT:
>
> " FAILED: ParseException line 1:221 missing EOF at 'FROM' near 'bgps' "
>
>
> --
> *From:* Mich Talebzadeh 
> *Sent:* 25 August 2016 21:41:51
>
> *To:* user
> *Subject:* Re: Fw: Hive update operation
>
> Him
>
> What is your current RDBMS and are these SQL the ones used in RDBMS?
>
> Have you tried them on Hive?
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 25 August 2016 at 06:56, Priyanka Raghuvanshi 
> wrote:
>
>> Hi  Dr Mich,
>>
>>
>> Thank you for replying.
>>
>>
>> Yes, while creating the table, transactional property has been set as
>> true, same implies to other tables.
>>
>>
>> Following in are SQL update query example, same I want to achieve through
>> HQL:
>>
>>
>> 1)
>>
>> UPDATE table1
>> SET FAging=t2.FAging,
>> PaymentR=t2.PaymentR,
>> ArrearsO=t2.ArrearsO ,
>> IRemaining=t2.IRemaining,
>> Taxi_Association=t2.TaxiAssociation
>> From table2 t2
>> Left JOIN table1  t1
>> ON t2.AccNum COLLATE DATABASE_DEFAULT= t1.AccNo COLLATE DATABASE_DEFAULT
>>
>> 2)
>>
>> UPDATE table1
>> SET Img_String=CASE WHEN convert(nvarchar,T1.dTimeStamp,103)=Convert(
>> nvarchar,getdate(),103) AND T1.Speed>0 then
>> isnull(T2.clmn1,'Other') 
>> +';Moving;'+ISNULL(T2.PinsStatus,'Performing')+';'+CASE
>> WHEN ISNULL(T2.SupplierName,'New') LIKE '%Repo%' THEN 'Refurbished' ELSE
>> 'New' END
>> ELSE
>> isnull(T2.clmn1,'Other') +';Non Moving;'+ISNULL(VEH.PinsStatus
>> ,'Performing')+';'+CASE WHEN ISNULL(T2.SupplierName,'New') LIKE '%Repo%'
>> THEN 'Refurbished' ELSE 'New' END
>> END,
>> Moving_or_NonMoving =CASE WHEN convert(nvarchar,T1.dTimeStamp,103)
>> =Convert(nvarchar,getdate(),103) AND T1.Speed>0 then
>> 'Moving'
>> ELSE
>> 'Non Moving'
>> END,
>> InfoWindow= CAST (T1.dTimeStamp as NVARCHAR(20)) + '  ' + [Address]
>> +' @ ' +  cast(T1.Speed as nvarchar) + ' KMPH ['+  isnull(T1.AccountNo,'')
>> +'];'+ isnull(T2.ChassisNumber,'') +';'+ isnull(replace(T2.OwnerName,,'')
>> ,'')  +';'+ isnull(T2.clmn1,''),
>> Status=T2.PinsStatus
>>
>> FROM table2 T2
>> INNER JOIN table1 T1 on T2.AcqNum=T1.AccoNo
>>
>> Priyanka Raghuvanshi
>> --
>> *From:* Mich Talebzadeh 
>> *Sent:* 24 August 2016 22:02:58
>> *To:* user
>> *Subject:* Re: Fw: Hive update operation
>>
>> have the underlying table to be updated been defined as transactional?
>>
>> can you give the update example?
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own ri

Re: Fw: Hive update operation

2016-08-25 Thread Priyanka Raghuvanshi
Current RDBMS:  SQL Server 2012


Yes, I tried below one.


UPDATE table1 set 
address=regexp_replace(t2.cout_event_description,,),latitude=t2.latitude,longitude=t2.longitude
 ,speed =t2.speed,dtimestamp =mv.dtimestamp,reg_no=t2.registration,gpsstate = 
t2.bgps FROM  default.maxvalues mv, table2 t2 INNER JOIN table2 t2 on 
mv.dtimestamp=t2.dtimestamp AND mv.acqnum=t2.acqnum INNER JOIN table1 t1 on 
mv.acqnum=t1.deal_number
where t1.deal_number=mv.acqnum;


OUTPUT:

" FAILED: ParseException line 1:221 missing EOF at 'FROM' near 'bgps' "



From: Mich Talebzadeh 
Sent: 25 August 2016 21:41:51
To: user
Subject: Re: Fw: Hive update operation

Him

What is your current RDBMS and are these SQL the ones used in RDBMS?

Have you tried them on Hive?

HTH


Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com


Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.



On 25 August 2016 at 06:56, Priyanka Raghuvanshi 
mailto:priyan...@winjit.com>> wrote:

Hi  Dr Mich,


Thank you for replying.


Yes, while creating the table, transactional property has been set as true, 
same implies to other tables.


Following in are SQL update query example, same I want to achieve through HQL:


1)

UPDATE table1
SET FAging=t2.FAging,
PaymentR=t2.PaymentR,
ArrearsO=t2.ArrearsO ,
IRemaining=t2.IRemaining,
Taxi_Association=t2.TaxiAssociation
>From table2 t2
Left JOIN table1  t1
ON t2.AccNum COLLATE DATABASE_DEFAULT= t1.AccNo COLLATE DATABASE_DEFAULT


2)

UPDATE table1
SET Img_String=CASE WHEN 
convert(nvarchar,T1.dTimeStamp,103)=Convert(nvarchar,getdate(),103) AND 
T1.Speed>0 then
isnull(T2.clmn1,'Other') 
+';Moving;'+ISNULL(T2.PinsStatus,'Performing')+';'+CASE WHEN 
ISNULL(T2.SupplierName,'New') LIKE '%Repo%' THEN 'Refurbished' ELSE 'New' END
ELSE
isnull(T2.clmn1,'Other') +';Non 
Moving;'+ISNULL(VEH.PinsStatus,'Performing')+';'+CASE WHEN 
ISNULL(T2.SupplierName,'New') LIKE '%Repo%' THEN 'Refurbished' ELSE 'New' END
END,
Moving_or_NonMoving =CASE WHEN 
convert(nvarchar,T1.dTimeStamp,103)=Convert(nvarchar,getdate(),103) AND 
T1.Speed>0 then
'Moving'
ELSE
'Non Moving'
END,
InfoWindow= CAST (T1.dTimeStamp as NVARCHAR(20)) + '  ' + [Address] +' @ ' 
+  cast(T1.Speed as nvarchar) + ' KMPH ['+  isnull(T1.AccountNo,'') +'];'+ 
isnull(T2.ChassisNumber,'') +';'+ isnull(replace(T2.OwnerName,,'') ,'')  
+';'+ isnull(T2.clmn1,''),
Status=T2.PinsStatus

FROM table2 T2
INNER JOIN table1 T1 on T2.AcqNum=T1.AccoNo


Priyanka Raghuvanshi


From: Mich Talebzadeh 
mailto:mich.talebza...@gmail.com>>
Sent: 24 August 2016 22:02:58
To: user
Subject: Re: Fw: Hive update operation

have the underlying table to be updated been defined as transactional?

can you give the update example?


Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com


Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.



On 24 August 2016 at 14:50, Priyanka Raghuvanshi 
mailto:priyan...@winjit.com>> wrote:




From: Priyanka Raghuvanshi
Sent: 24 August 2016 19:12
To: user@hive.apache.org
Subject: Hive update operation


Hi,


Is it possible to update a table(ORC) using join or column of another 
table(ORC)?


Kind Regards,

Priyanka Raghuvanshi




Re: hive 2.1.0 and "NOT IN ( list )" and column is a partition_key

2016-08-25 Thread Gopal Vijayaraghavan

>   not array_contains(array('foo'), partition_key)

And this is why that works.

https://issues.apache.org/jira/browse/HIVE-13951 :(


Cheers,
Gopal
















Re: hive 2.1.0 and "NOT IN ( list )" and column is a partition_key

2016-08-25 Thread Stephen Sprague
awesome!  i'm not crazy after all!

so workaround we devised here for this pattern:

   partition_key not in ('foo')

is:

   not array_contains(array('foo'), partition_key)

thanks,
Stephen.

On Thu, Aug 25, 2016 at 6:31 PM, Sergey Shelukhin 
wrote:

> I can repro this on master. I’ll file a bug...
>
> From: Stephen Sprague 
> Reply-To: "user@hive.apache.org" 
> Date: Thursday, August 25, 2016 at 13:34
> To: "user@hive.apache.org" 
> Subject: Re: hive 2.1.0 and "NOT IN ( list )" and column is a
> partition_key
>
> Hi Gopal,
> Thank you for this insight.  good stuff.   The thing is there is no 'foo'
> for etl_database_source so that filter if anything should be
> short-circuited to 'true'.  ie. double nots.   1. not in  2. and foo not
> present.
>
> it doesn't matter what what i put in that "not in" clause the filter
> always comes back false if the column is a partition_key of course.
>
> thanks for the tip on explain extended that's some crazy output so i'm
> sifting for clues in that now.   i hear you though - something in there
> with the metastore is at play.
>
> Cheers,
> Stephen.
>
> On Thu, Aug 25, 2016 at 1:12 PM, Gopal Vijayaraghavan 
> wrote:
>
>>
>> > anybody run up against this one?  hive 2.1.0 + using a  "not in" on a
>> >list + the column is a partition key participant.
>>
>> The partition filters are run before the plan is generated.
>>
>> >AND etl_source_database not in ('foo')
>>
>> Is there a 'foo' in etl_source_database?
>>
>> > predicate: false (type: boolean)   this kills any hope
>> >of the query returning anything.
>> ...
>> >  Select Operator###doesn't even mention a filter
>>
>> This is probably good news, because that's an optimization.
>>
>> PrunedPartitionList getPartitionsFromServer(Table tab, final
>> ExprNodeGenericFuncDesc compactExpr ...) {
>> ...
>>   hasUnknownPartitions = Hive.get().getPartitionsByExpr(
>>   tab, compactExpr, conf, partitions);
>> }
>>
>>
>> goes into the metastore and evaluates the IN and NOT IN for partitions
>> ahead of time.
>>
>>
>> So, this could mean that the partition pruning evaluation returned no
>> partitions at all (or just exactly matched partitions only, skipping the
>> filter per-row).
>>
>> In 2.x, you might notice it does a bit fancier things there as well, like
>>
>> select count(1) from table where year*1 + month*100 + day >= 20160101;
>>
>> https://github.com/apache/hive/blob/master/ql/src/java/org/
>> apache/hadoop/hi
>> ve/ql/optimizer/ppr/PartitionPruner.java#L468
>> 
>>
>>
>> You can try "explain extended" and see which partitions are selected (&
>> validate that the filter removed was applied already).
>>
>> Cheers,
>> Gopal
>>
>>
>>
>>
>>
>


Re: hive 2.1.0 and "NOT IN ( list )" and column is a partition_key

2016-08-25 Thread Sergey Shelukhin
I can repro this on master. I’ll file a bug...

From: Stephen Sprague mailto:sprag...@gmail.com>>
Reply-To: "user@hive.apache.org" 
mailto:user@hive.apache.org>>
Date: Thursday, August 25, 2016 at 13:34
To: "user@hive.apache.org" 
mailto:user@hive.apache.org>>
Subject: Re: hive 2.1.0 and "NOT IN ( list )" and column is a partition_key

Hi Gopal,
Thank you for this insight.  good stuff.   The thing is there is no 'foo' for 
etl_database_source so that filter if anything should be short-circuited to 
'true'.  ie. double nots.   1. not in  2. and foo not present.

it doesn't matter what what i put in that "not in" clause the filter always 
comes back false if the column is a partition_key of course.

thanks for the tip on explain extended that's some crazy output so i'm 
sifting for clues in that now.   i hear you though - something in there with 
the metastore is at play.

Cheers,
Stephen.

On Thu, Aug 25, 2016 at 1:12 PM, Gopal Vijayaraghavan 
mailto:gop...@apache.org>> wrote:

> anybody run up against this one?  hive 2.1.0 + using a  "not in" on a
>list + the column is a partition key participant.

The partition filters are run before the plan is generated.

>AND etl_source_database not in ('foo')

Is there a 'foo' in etl_source_database?

> predicate: false (type: boolean)   this kills any hope
>of the query returning anything.
...
>  Select Operator###doesn't even mention a filter

This is probably good news, because that's an optimization.

PrunedPartitionList getPartitionsFromServer(Table tab, final
ExprNodeGenericFuncDesc compactExpr ...) {
...
  hasUnknownPartitions = Hive.get().getPartitionsByExpr(
  tab, compactExpr, conf, partitions);
}


goes into the metastore and evaluates the IN and NOT IN for partitions
ahead of time.


So, this could mean that the partition pruning evaluation returned no
partitions at all (or just exactly matched partitions only, skipping the
filter per-row).

In 2.x, you might notice it does a bit fancier things there as well, like

select count(1) from table where year*1 + month*100 + day >= 20160101;

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hi
ve/ql/optimizer/ppr/PartitionPruner.java#L468


You can try "explain extended" and see which partitions are selected (&
validate that the filter removed was applied already).

Cheers,
Gopal







How to do such a join of Pairing in Hive?

2016-08-25 Thread Rex X
1. Firstly we create a hive table by loading following csv file

> $cat data.csv
>
> ID,City,Zip,Flag
> 1,A,95126,0
> 2,A,95126,1
> 3,A,95126,1
> 4,B,95124,0
> 5,B,95124,1
> 6,C,95124,0
> 7,C,95127,1
> 8,C,95127,0
> 9,C,95127,1


(a) where "ID" above is a primary key (unique),

(b) for each "City" and "Zip" combination, there is one ID in max with
Flag=0; while it can contain multiple IDs with Flag=1 for each "City" and
"Zip" combination.

(c) Flag can be 0 or 1


2. Load the csv file above to a hive table

> create table test(ID string, City String, Zip String, Flag int)
> ROW FORMAT DELIMITED FIELDS TERMINATED BY "," ;
>
> LOAD DATA LOCAL INPATH "./data.csv" OVERWRITE INTO TABLE test;


3.
For each ID with Flag=0, we want to pair it with another ID with Flag=1 but
with the same City - Zip. If one cannot find another paired ID with Flag=1
and matched City - Zip, we just delete that record.

Here is the Expected result:

> ID,City,Zip,Flag
> 1,A,95126,0
> 2,A,95126,1
> 4,B,95124,0
> 5,B,95124,1
> 7,C,95127,1
> 8,C,95127,0


Any valuable tips how to do this pairing in Hive or Python?

Great thanks!

Rex


Re: hive 2.1.0 and "NOT IN ( list )" and column is a partition_key

2016-08-25 Thread Stephen Sprague
Hi Gopal,
Thank you for this insight.  good stuff.   The thing is there is no 'foo'
for etl_database_source so that filter if anything should be
short-circuited to 'true'.  ie. double nots.   1. not in  2. and foo not
present.

it doesn't matter what what i put in that "not in" clause the filter always
comes back false if the column is a partition_key of course.

thanks for the tip on explain extended that's some crazy output so i'm
sifting for clues in that now.   i hear you though - something in there
with the metastore is at play.

Cheers,
Stephen.

On Thu, Aug 25, 2016 at 1:12 PM, Gopal Vijayaraghavan 
wrote:

>
> > anybody run up against this one?  hive 2.1.0 + using a  "not in" on a
> >list + the column is a partition key participant.
>
> The partition filters are run before the plan is generated.
>
> >AND etl_source_database not in ('foo')
>
> Is there a 'foo' in etl_source_database?
>
> > predicate: false (type: boolean)   this kills any hope
> >of the query returning anything.
> ...
> >  Select Operator###doesn't even mention a filter
>
> This is probably good news, because that's an optimization.
>
> PrunedPartitionList getPartitionsFromServer(Table tab, final
> ExprNodeGenericFuncDesc compactExpr ...) {
> ...
>   hasUnknownPartitions = Hive.get().getPartitionsByExpr(
>   tab, compactExpr, conf, partitions);
> }
>
>
> goes into the metastore and evaluates the IN and NOT IN for partitions
> ahead of time.
>
>
> So, this could mean that the partition pruning evaluation returned no
> partitions at all (or just exactly matched partitions only, skipping the
> filter per-row).
>
> In 2.x, you might notice it does a bit fancier things there as well, like
>
> select count(1) from table where year*1 + month*100 + day >= 20160101;
>
> https://github.com/apache/hive/blob/master/ql/src/java/
> org/apache/hadoop/hi
> ve/ql/optimizer/ppr/PartitionPruner.java#L468
>
>
> You can try "explain extended" and see which partitions are selected (&
> validate that the filter removed was applied already).
>
> Cheers,
> Gopal
>
>
>
>
>


Re: hive 2.1.0 and "NOT IN ( list )" and column is a partition_key

2016-08-25 Thread Gopal Vijayaraghavan

> anybody run up against this one?  hive 2.1.0 + using a  "not in" on a
>list + the column is a partition key participant.

The partition filters are run before the plan is generated.

>AND etl_source_database not in ('foo')

Is there a 'foo' in etl_source_database?

> predicate: false (type: boolean)   this kills any hope
>of the query returning anything.
...
>  Select Operator###doesn't even mention a filter

This is probably good news, because that's an optimization.

PrunedPartitionList getPartitionsFromServer(Table tab, final
ExprNodeGenericFuncDesc compactExpr ...) {
...
  hasUnknownPartitions = Hive.get().getPartitionsByExpr(
  tab, compactExpr, conf, partitions);
}


goes into the metastore and evaluates the IN and NOT IN for partitions
ahead of time.


So, this could mean that the partition pruning evaluation returned no
partitions at all (or just exactly matched partitions only, skipping the
filter per-row).

In 2.x, you might notice it does a bit fancier things there as well, like

select count(1) from table where year*1 + month*100 + day >= 20160101;

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hi
ve/ql/optimizer/ppr/PartitionPruner.java#L468


You can try "explain extended" and see which partitions are selected (&
validate that the filter removed was applied already).

Cheers,
Gopal






hive 2.1.0 and "NOT IN ( list )" and column is a partition_key

2016-08-25 Thread Stephen Sprague
anybody run up against this one?  hive 2.1.0 + using a  "not in" on a list
+ the column is a partition key participant.


* using not

query:

explain
SELECT count(*)
FROM bi.fact_email_funnel
WHERE
event_date_key = 20160824
AND etl_source_database *not* in ('foo')


output frag:

Map Operator Tree:
  TableScan
alias: fact_email_funnel
Statistics: Num rows: 33723530 Data size: 12154626048 Basic
stats: COMPLETE Column stats: NONE
Filter Operator
  *predicate: false (type: boolean)   this kills any hope
of the query returning anything.*
  Statistics: Num rows: 1 Data size: 360 Basic stats: COMPLETE
Column stats: NONE
  Group By Operator
aggregations: count()




* not using not

query:

  explain
  SELECT count(*)
  FROM bi.fact_email_funnel
  WHERE
event_date_key = 20160824
AND etl_source_database in ('foo')




output frag:
 Map Operator Tree:
  TableScan
alias: fact_email_funnel
Statistics: Num rows: 33744659 Data size: 14562180096 Basic
stats: COMPLETE Column stats: NONE
Select Operator*###doesn't even mention a filter*
  Statistics: Num rows: 33744659 Data size: 14562180096 Basic
stats: COMPLETE Column stats: NONE
  Group By Operator
aggregations: count()



* show create table output:

CREATE TABLE `bi.fact_email_funnel`(
  `user_id` string,
  `email` string,
  `ipaddress` string,
  `email_type_key` string,
  `email_type_id` string,
  `email_type_name` string,
  `tracking_code` string,
  `action_type` string,
  `email_guid` string,
  `mailing_id` string,
  `message_id` string,
  `delivered_status` string,
  `delivered_error_code` string,
  `click_link_id` string,
  `click_link` string,
  `click_property_id` string,
  `platform` string,
  `etl_source_id_column` string,
  `etl_source_id` string,
  `event_datetime` timestamp,
  `event_time_key` bigint,
  `id_job` bigint,
  `load_date` timestamp,
  `version` string,
  `event_datetime_utc` timestamp,
  `num_leads` int COMMENT 'number of leads',
  `lead_product_type` string COMMENT 'product lead type ',
  `property_impression`
array>
COMMENT 'sent property attributes',
  `search_impression` map COMMENT 'search attributes')
PARTITIONED BY (
  `event_date_key` bigint,
  `etl_source_database` string,
  `etl_source_table` string)


Re: Hive transaction doesn't release lock.

2016-08-25 Thread Eugene Koifman
What is your hive.txn.timeout set to and does the value as seen by the 
metastore process agree with what is seen by Storm client?

The good news is that this race condition is no longer possible since 
https://issues.apache.org/jira/browse/HIVE-13013

Eugene

From: Igor Kuzmenko mailto:f1she...@gmail.com>>
Reply-To: "user@hive.apache.org" 
mailto:user@hive.apache.org>>
Date: Wednesday, August 24, 2016 at 3:32 AM
To: "user@hive.apache.org" 
mailto:user@hive.apache.org>>
Subject: Re: Hive transaction doesn't release lock.

Oh, my copy paste skills failed me.
Here's output from HIVE_LOCKS:

# HL_LOCK_EXT_ID, HL_LOCK_INT_ID, HL_TXNID, HL_DB, HL_TABLE, HL_PARTITION, 
HL_LOCK_STATE, HL_LOCK_TYPE, HL_LAST_HEARTBEAT, HL_ACQUIRED_AT, HL_USER, HL_HOST
'8496355', '1', '78461824', 'default', 'data_http', 'dt=20160821', 'a', 'r', 
'0', '1471768489000', 'storm', 
'sorm-data03.msk.mts.ru'

TXNS, TXN_COMPONENTS:
# TXN_ID, TXN_STATE, TXN_STARTED, TXN_LAST_HEARTBEAT, TXN_USER, TXN_HOST, 
TC_TXNID, TC_DATABASE, TC_TABLE, TC_PARTITION
78461824, a, 1471762974000, 1471768488000, storm, 
sorm-data03.msk.mts.ru, 78461824, default, 
data_http, dt=201608213


Good news, I've found a log when it all started:

Here comes heartbeat:
2016-08-21 11:34:48,934 DEBUG [pool-3-thread-155]: txn.TxnHandler 
(TxnHandler.java:heartbeatTxn(1882)) - Going to execute update 

Concurrently we are trying to abort transaction
2016-08-21 11:34:48,907 DEBUG [DeadTxnReaper-0]: txn.TxnHandler 
(TxnHandler.java:abortTxns(1429)) - Going to execute update 
2016-08-21 11:34:48,943 DEBUG [DeadTxnReaper-0]: txn.TxnHandler 
(TxnHandler.java:abortTxns(1446)) - Going to execute update 
2016-08-21 11:34:48,957 DEBUG [pool-3-thread-11]: txn.TxnHandler 
(TxnHandler.java:abortTxns(1429)) - Going to execute update 
2016-08-21 11:34:48,979 DEBUG [pool-3-thread-11]: txn.TxnHandler 
(TxnHandler.java:abortTxns(1446)) - Going to execute update 

At this point transaction aborted and thers no lock.
But heartbeat thread still working and:
2016-08-21 11:34:49,025 DEBUG [pool-3-thread-155]: txn.TxnHandler 
(TxnHandler.java:lock(1546)) - Going to execute update 
2016-08-21 11:34:49,027 DEBUG [pool-3-thread-155]: txn.TxnHandler 
(TxnHandler.java:lock(1576)) - Going to execute update http://sorm-data03.msk.mts.ru>')>
2016-08-21 11:34:49,029 DEBUG [pool-3-thread-155]: txn.TxnHandler 
(TxnHandler.java:checkLock(1696)) -   lockid:8496355 intLockId:1 txnid:78461824 
db:default table:data_http partition:dt=20160821 state:WAITING type:SHARED_READ


So I guess, that it's race condition between heartbeat thread and TxnReaper 
thread. Last heartbeat information in HIVE_LOCKS table differs from TXNS table.
Full log here.

On Tue, Aug 23, 2016 at 8:20 PM, Eugene Koifman 
mailto:ekoif...@hortonworks.com>> wrote:
your query is “select * from HIVE_LOCKS” but the output is not from HIVE_LOCKS.
What entries do you have in HIVE_LOCKS for this txn_id?

If all you see is an entry in TXN table in ‘a’ state – that is OK.  that just 
mean that this transaction was aborted.

Eugene

From: Igor Kuzmenko mailto:f1she...@gmail.com>>
Reply-To: "user@hive.apache.org" 
mailto:user@hive.apache.org>>
Date: Monday, August 22, 2016 at 8:27 AM
To: "user@hive.apache.org" 
mailto:user@hive.apache.org>>
Subject: Hive transaction doesn't release lock.

Hello, I'm using Apache Hive 1.2.1 and Apache Storm to stream data in hive 
table.
After making some tests I tried to truncate my table, but sql execution doesn't 
complete because of the lock on table:

select * from HIVE_LOCKS;

# TXN_ID, TXN_STATE, TXN_STARTED, TXN_LAST_HEARTBEAT, TXN_USER, TXN_HOST, 
TC_TXNID, TC_DATABASE, TC_TABLE, TC_PARTITION
'78461824', 'a', '1471762974000', '1471768488000', 'storm', 
'sorm-data03.msk.mts.ru', '78461824', 'default', 'data_http', 'dt=20160821'



At the same time:

select * from TXNS, TXN_COMPONENTS
where 1=1
and TXNS.TXN_ID = TXN_COMPONENTS.TC_TXNID
and TXNS.TXN_ID = 78461824
;


# TXN_ID, TXN_STATE, TXN_STARTED, TXN_LAST_HEARTBEAT, TXN_USER, TXN_HOST, 
TC_TXNID, TC_DATABASE, TC_TABLE, TC_PARTITION
78461824, a, 1471762974000, 1471768488000, storm, sorm-data03.msk.mts.ru, 
78461824, default, data_http, dt=20160821


Txn 78461824 is in aborted state, but still keep lock on table 'data_http'. How 
can I avoid this? What should I do to get rid of that lock?




Re: Concurrency support of Apache Hive for streaming data ingest at 7K RPS into multiple tables

2016-08-25 Thread Eugene Koifman
Hive streaming API (which is what Storm uses) inserts multiple evens to a table 
per transaction.  It has been designed for this but not quite ready for prime 
time in 0.14.

Hive 1.3 has these metastore issues fixed as well as many others. 
HIVE-11948, 
HIVE-13013 include the bulk 
of the improvements regarding the metastore but not all.  (HDP 2.5 will include 
all of them)
Back porting individual patches may prove difficult and error prone.  Using HDP 
2.5 would be much safer.
(If you are using Storm fro HDP 2.2 - it also has had a number of important 
fixes since 2.2 in the module that uses Hive Streaming API)

It looks like you have (roughly) 100k events per transaction.  How many 
transactions do you have per batch?

The only thing that can help in 0.14 is to reduce the load on the metastore.
You can do that by adjusting your ingest process to reduce the number of 
concurrent (Hive) transactions by making each transaction larger (most 
effective) and by making transaction batches larger.

Streaming API requires a heartbeat (also a metastore call) to be sent which 
storm does.  The frequency is controlled by 
hive.txn.timeout
 .  you may want to set it to a larger value but make sure it's set to the same 
value for the metastore process and in storm topology.

If you are writing to partitioned tables, you may consider building a "shuffle" 
process in the Storm app so that all events for a given partition end up on the 
same bolt instance.  This would reduce the number of writers to Hive (and thus 
concurrent transactions) but of course may create unbalanced workload.


Hive's transactional tables do support concurrent inserts from multiple 
clients.  To avoid serialization errors you'd have to use 1.3/2.1.x

thanks,
Eugene

From: Jörn Franke mailto:jornfra...@gmail.com>>
Reply-To: "user@hive.apache.org" 
mailto:user@hive.apache.org>>
Date: Wednesday, August 24, 2016 at 2:27 PM
To: "user@hive.apache.org" 
mailto:user@hive.apache.org>>
Subject: Re: Concurrency support of Apache Hive for streaming data ingest at 7K 
RPS into multiple tables

This is also a good option.

With respect to Hive transactional tables: I do to think they have been 
designed for massive inserts of single items. On the other hand you would not 
insert a lot of events using single inserts in a relational database. Same 
restrictions apply, it is not the use case you want to implement.


On 24 Aug 2016, at 13:55, Kit Menke 
mailto:kitme...@gmail.com>> wrote:


Joel,
Another option which you have is to use the Storm HDFS bolt to stream data into 
Hive external tables. The external tables then get loaded into ORC history 
tables for long term storage. We use this in a HDP cluster with similar load so 
I know it works. :)

I'm with Jörn on this one. My impression of hive transactions is that it is a 
new feature not totally ready for production.
Thanks,
Kit

On Aug 24, 2016 3:07 AM, "Joel Victor" 
mailto:joelsvic...@gmail.com>> wrote:
@Jörn: If I understood correctly even later versions of Hive won't be able to 
handle these kinds of workloads?

On Wed, Aug 24, 2016 at 1:26 PM, Jörn Franke 
mailto:jornfra...@gmail.com>> wrote:
I think Hive especially these old versions have not been designed for this. Why 
not store them in Hbase and run a oozie job regularly that puts them all into 
Hive /Orc or parquet in a bulk job?

On 24 Aug 2016, at 09:35, Joel Victor 
mailto:joelsvic...@gmail.com>> wrote:

Currently I am using Apache Hive 0.14 that ships with HDP 2.2. We are trying 
perform streaming ingestion with it.
We are using the Storm Hive bolt and we have 7 tables in which we are trying to 
insert. The RPS (requests per second) of our bolts ranges from 7000 to 5000 and 
our commit policies are configured accordingly i.e 100k events or 15 seconds.

We see that there are many commitTxn exceptions due to serialization errors in 
the metastore (we are using PostgreSQL 9.5 as metastore)
The serialization errors will cause the topology to start lagging in terms of 
events processed as it will try to reprocess the batches that have failed.

I have already backported this 
HIVE-10500 to 0.14 and there 
isn't much improvement.
I went through most of the JIRA's about transaction and I found the following 
HIVE-11948, 
HIVE-13013. I would like to 
backport them to 0.14.
Going through the patches gives me an impression that I need to mostly update 
the queries and transaction levels.
Do these patches also require me to update the schema in the metastore? Please 
also let me know if there are any other patches that I missed.

I wo

SERVICE In hive.distro

2016-08-25 Thread Rasp Berry
Would it make sense to either export the SERVICE parameter in hive.distro
or maybe pass it as input to hive-env.sh so that we can set environment
parameters based on the service (hiveserver2, megastore, etc.)? If this is
already happening I must be missing it in the code.


Re: Fw: Hive update operation

2016-08-25 Thread Mich Talebzadeh
Him

What is your current RDBMS and are these SQL the ones used in RDBMS?

Have you tried them on Hive?

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 25 August 2016 at 06:56, Priyanka Raghuvanshi 
wrote:

> Hi  Dr Mich,
>
>
> Thank you for replying.
>
>
> Yes, while creating the table, transactional property has been set as
> true, same implies to other tables.
>
>
> Following in are SQL update query example, same I want to achieve through
> HQL:
>
>
> 1)
>
> UPDATE table1
> SET FAging=t2.FAging,
> PaymentR=t2.PaymentR,
> ArrearsO=t2.ArrearsO ,
> IRemaining=t2.IRemaining,
> Taxi_Association=t2.TaxiAssociation
> From table2 t2
> Left JOIN table1  t1
> ON t2.AccNum COLLATE DATABASE_DEFAULT= t1.AccNo COLLATE DATABASE_DEFAULT
>
> 2)
>
> UPDATE table1
> SET Img_String=CASE WHEN convert(nvarchar,T1.dTimeStamp,103)=Convert(
> nvarchar,getdate(),103) AND T1.Speed>0 then
> isnull(T2.clmn1,'Other') +';Moving;'+ISNULL(T2.PinsStatus,'Performing')
> +';'+CASE WHEN ISNULL(T2.SupplierName,'New') LIKE '%Repo%' THEN
> 'Refurbished' ELSE 'New' END
> ELSE
> isnull(T2.clmn1,'Other') +';Non Moving;'+ISNULL(VEH.PinsStatus
> ,'Performing')+';'+CASE WHEN ISNULL(T2.SupplierName,'New') LIKE '%Repo%'
> THEN 'Refurbished' ELSE 'New' END
> END,
> Moving_or_NonMoving =CASE WHEN convert(nvarchar,T1.dTimeStamp,103)=Convert
> (nvarchar,getdate(),103) AND T1.Speed>0 then
> 'Moving'
> ELSE
> 'Non Moving'
> END,
> InfoWindow= CAST (T1.dTimeStamp as NVARCHAR(20)) + '  ' + [Address]
> +' @ ' +  cast(T1.Speed as nvarchar) + ' KMPH ['+  isnull(T1.AccountNo,'')
> +'];'+ isnull(T2.ChassisNumber,'') +';'+ isnull(replace(T2.OwnerName,,'')
> ,'')  +';'+ isnull(T2.clmn1,''),
> Status=T2.PinsStatus
>
> FROM table2 T2
> INNER JOIN table1 T1 on T2.AcqNum=T1.AccoNo
>
> Priyanka Raghuvanshi
> --
> *From:* Mich Talebzadeh 
> *Sent:* 24 August 2016 22:02:58
> *To:* user
> *Subject:* Re: Fw: Hive update operation
>
> have the underlying table to be updated been defined as transactional?
>
> can you give the update example?
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 24 August 2016 at 14:50, Priyanka Raghuvanshi 
> wrote:
>
>>
>>
>>
>> --
>> *From:* Priyanka Raghuvanshi
>> *Sent:* 24 August 2016 19:12
>> *To:* user@hive.apache.org
>> *Subject:* Hive update operation
>>
>>
>> Hi,
>>
>>
>> Is it possible to update a table(ORC) using join or column of another
>> table(ORC)?
>>
>>
>> Kind Regards,
>>
>> Priyanka Raghuvanshi
>>
>
>


Re: Table substitution

2016-08-25 Thread Furcy Pin
Hi Rakesh,

this is an interesting question,

Did you look at the cube and rollup possibilities?

https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation,+Cube,+Grouping+and+Rollup

I believe that you could use this feature to merge your table 1 and table 2
in one single table with a
GROUP BY business_unit, balance, account WITH ROLLUP

partition the result table with GROUPING__ID

and then your select query would have to be

select business_unit, balance from new_table where business_unit='mybu' and
GROUPING_ID = 1

If you want to get something more automatic after that, I guess there must
be tools
out there capable of providing a GUI on top of a cube, automatically
generating the SQL
and running it via JDBC connectors to Hive, Spark or Impala...

After googling a little, I believe that Apache Lens matches the description.

https://lens.apache.org/user/olap-cube.html

I also found this old presentation related to it :
http://events.linuxfoundation.org/sites/events/files/slides/ApacheCon-Datacubes_in_Apache_Hive_0.pdf

Hope this helps,

Furcy





On Thu, Aug 25, 2016 at 1:27 PM, Rakesh Kant  wrote:

> I have multiple aggregate tables created at different levels of
> aggregation grouped by smaller subset of keys. As an example-
> Table 1 : business_unit, account, balance
> Table 2 : business_unit, balance
>
> When a query is written as select business_unit, balance from table1 where
> business_unit='mybu'; it would be faster to use table 2 instead but how do
> I substitute table 2 instead on table 1 in the query?
>
> Is there a way to add custom handlers either in the driver, optimizer or
> other points ?
>
> Any help or pointers will be appreciated.
>
> RK
>
> Get Outlook for iOS 
>
>


Re: Loading Sybase to hive using sqoop

2016-08-25 Thread Rahul Channe
Thank you all for valuable inputs

On Wednesday, August 24, 2016, Mich Talebzadeh 
wrote:

> If this is one off then Spark will do OK.
>
> Sybase IQ provides bcp that creates a flat file tab/comma separated and
> you can use that to extract IQ table and put it on HDFS and create an
> external table.
>
> This is of course is a one off.
>
> You can also use SRS (SAP Replication Server) to get the data out first
> time and sync Hive table with Sybase IQ table real time. You will need SRS
> SP 204 or above to make this work.
>
> Talk to your DBA if they can get SRS SP from Sybase for this purpose. I
> have done it many times. I think it is stable enough for this purpose.
>
> HTH
>
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 24 August 2016 at 22:35, Gopal Vijayaraghavan  > wrote:
>
>>
>>
>> > val d = HiveContext.read.format("jdbc").options(
>> ...
>> >> The sqoop job takes 7 hours to load 15 days of data, even while setting
>> >>the direct load option to 6. Hive is using MR framework.
>>
>> In generaly, the jdbc implementations tend to react rather badly to large
>> extracts like this - the throttling usually happens on the operational
>> database end rather than being a problem on the MR side.
>>
>>
>> Sqoop is good enough for a one-shot import, but doing it frequently is
>> best done by the database's own dump protocols, which are generally not
>> throttled similarly.
>>
>> Pinterest recently put out a document on how they do this
>>
>> https://engineering.pinterest.com/blog/tracker-ingesting-mys
>> ql-data-scale-p
>> art-1
>> 
>>
>> +
>> https://engineering.pinterest.com/blog/tracker-ingesting-mys
>> ql-data-scale-p
>> art-2
>> 
>>
>> More interesting continous ingestion reads directly off the replication
>> protocol write-ahead logs.
>>
>> https://github.com/Flipkart/MySQL-replication-listener/tree/
>> master/examples
>> /mysql2hdfs
>> 
>>
>> +
>> https://github.com/flipkart-incubator/storm-mysql
>>
>>
>> But all of these tend to be optimized to a database engine, while the JDBC
>> pipe tends to work slowly for all engines.
>>
>> Cheers,
>> Gopal
>>
>>
>>
>


Table substitution

2016-08-25 Thread Rakesh Kant
I have multiple aggregate tables created at different levels of aggregation 
grouped by smaller subset of keys. As an example-Table 1 : business_unit, 
account, balanceTable 2 : business_unit, balance
When a query is written as select business_unit, balance from table1 where 
business_unit='mybu'; it would be faster to use table 2 instead but how do I 
substitute table 2 instead on table 1 in the query?
Is there a way to add custom handlers either in the driver, optimizer or other 
points ?
Any help or pointers will be appreciated.
RK

Get Outlook for iOS


Realm is required in beeline connection for users if display and logon name differs ,even its ensured with ldap authentciation

2016-08-25 Thread mathes waran
*Problem*:
Realm is required in beeline connection for users if display and logon name
differs ,even its ensured with ldap authentciation?

While connecting with beeline,with configured ldap authentication some
users connected  without realm and some users connect with realm
authentication.This is because of *while creating users in active directory*
,*display name differ with logon name.*
During authentication,it can only validate by log on name only,but some
users validate also by display name.Actually its only validate by on log on
name but here here seems validate by display name.

*Error* I got when login without realm in beeline:
Error: Could not open client transport with JDBC Uri:
jdbc:hive2://mylapn2215
:1/default;: Peer indicated failure: PLAIN auth failed: LDAP
Authentication
failed for user (state=08S01,code=0)

Please refer the attached images and provide the ideas if you overcome this.

Thanks,
Matheswaran.S


Restricting Avro File size/records in Hive

2016-08-25 Thread Pallav Jakhotiya
Hi,

We have data in Orc formatted table, we filter certain records and then create 
an Avro format hive table using the "insert into" clause.

Our use case is to create smaller avro data files in a hive table that can be 
passed on to consumers as a Kafka Message.
Can we restrict the file size in an avro backed hive table while we execute the 
insert into command.

One solution we had was to use clustered by, but since the number of 
records/size is not known beforehand it becomes difficult to create the number 
of buckets.

Anything else we can try to restrict this?