Will partition pruning will happen on HIVE views?

2018-01-23 Thread Ramasubramanian Narayanan
Hi,

I have a scenario.

Hive table "Table1" created with Partition with txn date.

A view "VIEW_TABLE1" is created on top of "Table" with the query 'select *
from table1'.

If I query the view with the where clause using txn_date, will partition
pruning happen? Please throw some light how it will behave in HIVE.


regards,
Rams


Re: LEFT JOIN and WHERE CLAUSE - How to handle

2017-08-24 Thread Ramasubramanian Narayanan
Hi,

My bad, you are right I intended to give the query  as below (TXN left
joining CURRENCY not the other way around):

*Query : *Select ROW_NUM,CCY_CD,TXN_DT,CNTRY_DESC
from TXN LEFT JOIN CURRENCY  on (CURRENCY.CCY_CD = TXN.CCY_CD)
where
TXN_DT between EFF_ST_DT and EFF_END_DT;

Thanks and regards,
Rams


On Wed, Aug 23, 2017 at 9:17 PM, Furcy Pin <furcy@flaminem.com> wrote:

> Ho, in that case...
>
> (First I notice that you say you want all records in TXN but in the query
> you give, you perform your join the other way round.)
>
> This is a typical use case that SQL is not very good at handling...
>
> The solutions I see are:
>
> - use RANK as you suggested.
>   Note that Hive is smart enough to optimize it correctly :
>   if you use a filter WHERE Rank() < K, it will take the K first values in
> RAM and ignore the rest, rather than ranking everything and filtering
> afterwards.
>
> - perform a GROUP BY TXN.ROW_NUM so you have only one line per
> transaction. Aggregate the rest the way you like (max, max_by or arbitrary)
>   To handle the missing EUR line, just add a (OR CURRENCY.CCY_CD IS NULL)
> to your where clause (and reverse the left join).
>
> - Most complicated way to implement, but simplest to use afterwards:
>   replace your CURRENCY table with a table where you have only one row per
> currency, and store the multiple start/end dates in an array of structs,
>   then a UDF that given such array and a date returns the correct
> description. It only works if you don't have too many description per
> currency though.
>   This is where Spark comes handy: it is simpler to define UDFs with it.
>
>
>
> On Wed, Aug 23, 2017 at 5:21 PM, Ramasubramanian Narayanan <
> ramasubramanian.naraya...@gmail.com> wrote:
>
>> Hi,
>>
>> TXN.TXN_DT should be between CURRENCY.EFF_ST_DT and CURRENCY.EFF_END_DT.
>> It needs to be equated.
>>
>>
>> regards,
>> Rams
>>
>> On Wed, Aug 23, 2017 at 7:55 PM, Furcy Pin <furcy@flaminem.com>
>> wrote:
>>
>>> I would suggest to use a subquery
>>>
>>> WITH unique_currency AS (
>>>   SELECT
>>> CCY_CD,
>>> MAX(CNTRY_DESC) as CNTRY_DESC
>>>   FROM CURRENCY
>>>   GROUP BY CCY_CD
>>> )
>>>
>>> and then perform your left join on it.
>>>
>>> Some SQL engine (e.g. Presto) have aggregation functions like
>>> arbitrary(col) that take any value and are a little less costly than a
>>> max.
>>> Sometimes, they also have functions like max_by(x, y)
>>> <https://prestodb.io/docs/current/functions/aggregate.html#max_by> that
>>> would allow you to get the most recent description.
>>>
>>> It is a shame that this function is not included in Hive yet, but still
>>> you can find some UDAF implementations on github
>>> <https://github.com/dataiku/dataiku-hive-udf#first_of_group-last_of_group>
>>> .
>>>
>>>
>>>
>>> On Wed, Aug 23, 2017 at 3:37 PM, Ramasubramanian Narayanan <
>>> ramasubramanian.naraya...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> Need your suggestion on the below.
>>>>
>>>> Have two tables TXN and CURRENCY.
>>>>
>>>> Need all records in TXN and hence doing Left Join with CURRENCY.
>>>>
>>>> *Two problems :*
>>>> 1. CURRENCY table may contain duplicate records hence it needs to be
>>>> handled through RANK or some other function.
>>>> 2. If we equate TXN_DT between EFF_ST_DT and EFF_END_DT in the 'where
>>>> clause' then we will loose the EUR records which should not happen.
>>>>
>>>> Please suggest a solution to over come both the problems. For
>>>> duplicated records it is fine if we select any of the CNTRY_DESC.
>>>>
>>>> *Table : CURRENCY*
>>>>
>>>> *Table : TXN*
>>>>
>>>> *CCY_CD*
>>>>
>>>> *CNTRY_DESC*
>>>>
>>>> *EFF_ST_DT*
>>>>
>>>> *EFF_END_DT*
>>>>
>>>> *ROW_NUM*
>>>>
>>>> *CCY_CD*
>>>>
>>>> *TXN_DT*
>>>>
>>>> INR
>>>>
>>>> Indian Rupee
>>>>
>>>> 1-Jan-15
>>>>
>>>> 20-Feb-16
>>>>
>>>> 1
>>>>
>>>> INR
>>>>
>>>> 16-Feb-17
>>>>
>>>> INR
>>>>
>>>> Indian Rupee New
>>>>

Re: LEFT JOIN and WHERE CLAUSE - How to handle

2017-08-23 Thread Ramasubramanian Narayanan
Hi,

TXN.TXN_DT should be between CURRENCY.EFF_ST_DT and CURRENCY.EFF_END_DT. It
needs to be equated.


regards,
Rams

On Wed, Aug 23, 2017 at 7:55 PM, Furcy Pin <furcy@flaminem.com> wrote:

> I would suggest to use a subquery
>
> WITH unique_currency AS (
>   SELECT
> CCY_CD,
> MAX(CNTRY_DESC) as CNTRY_DESC
>   FROM CURRENCY
>   GROUP BY CCY_CD
> )
>
> and then perform your left join on it.
>
> Some SQL engine (e.g. Presto) have aggregation functions like
> arbitrary(col) that take any value and are a little less costly than a
> max.
> Sometimes, they also have functions like max_by(x, y)
> <https://prestodb.io/docs/current/functions/aggregate.html#max_by> that
> would allow you to get the most recent description.
>
> It is a shame that this function is not included in Hive yet, but still
> you can find some UDAF implementations on github
> <https://github.com/dataiku/dataiku-hive-udf#first_of_group-last_of_group>
> .
>
>
>
> On Wed, Aug 23, 2017 at 3:37 PM, Ramasubramanian Narayanan <
> ramasubramanian.naraya...@gmail.com> wrote:
>
>> Hi,
>>
>> Need your suggestion on the below.
>>
>> Have two tables TXN and CURRENCY.
>>
>> Need all records in TXN and hence doing Left Join with CURRENCY.
>>
>> *Two problems :*
>> 1. CURRENCY table may contain duplicate records hence it needs to be
>> handled through RANK or some other function.
>> 2. If we equate TXN_DT between EFF_ST_DT and EFF_END_DT in the 'where
>> clause' then we will loose the EUR records which should not happen.
>>
>> Please suggest a solution to over come both the problems. For duplicated
>> records it is fine if we select any of the CNTRY_DESC.
>>
>> *Table : CURRENCY*
>>
>> *Table : TXN*
>>
>> *CCY_CD*
>>
>> *CNTRY_DESC*
>>
>> *EFF_ST_DT*
>>
>> *EFF_END_DT*
>>
>> *ROW_NUM*
>>
>> *CCY_CD*
>>
>> *TXN_DT*
>>
>> INR
>>
>> Indian Rupee
>>
>> 1-Jan-15
>>
>> 20-Feb-16
>>
>> 1
>>
>> INR
>>
>> 16-Feb-17
>>
>> INR
>>
>> Indian Rupee New
>>
>> 21-Feb-16
>>
>> 20-Feb-99
>>
>> 2
>>
>> USD
>>
>> 16-Feb-17
>>
>> USD
>>
>> US Dollar
>>
>> 1-Jan-15
>>
>> 20-Feb-16
>>
>> 3
>>
>> SGD
>>
>> 16-Feb-17
>>
>> SGD
>>
>> Singapore Dollar
>>
>> 1-Jan-15
>>
>> 20-Feb-17
>>
>> 4
>>
>> EUR
>>
>> 16-Feb-17
>>
>> SGD
>>
>> Singapore Dollar New
>>
>> 15-Feb-17
>>
>> 20-Feb-99
>>
>> SGD
>>
>> Singapore Dollar Latest
>>
>> 16-Feb-17
>>
>> 16-Feb-17
>>
>> *Expected Output*
>>
>> *ROW_NUM*
>>
>> *CCY_CD*
>>
>> *TXN_DT*
>>
>> *CNTRY_DESC*
>>
>> 1
>>
>> INR
>>
>> 16-Feb-17
>>
>> Indian Rupee
>>
>> 2
>>
>> USD
>>
>> 16-Feb-17
>>
>> US Dollar
>>
>> 3
>>
>> SGD
>>
>> 16-Feb-17
>>
>> Singapore Dollar Latest (Any of three valid valid is fine)
>>
>> 4
>>
>> EUR
>>
>> 16-Feb-17
>>
>> 
>>
>>
>>
>> *Query : *Select ROW_NUM,CCY_CD,TXN_DT,CNTRY_DESC
>> from CURRENCY LEFT JOIN TXN on (CURRENCY.CCY_CD = TXN.CCY_CD)
>> where
>> TXN_DT between EFF_ST_DT and EFF_END_DT;
>>
>>
>>
>> This query will drop the "EUR" record because of the where clause used.
>> It cannot be handled with case statement instead of 'where clause' as we
>> have   more than one record for 'SGD' when  TXN_DT is 16-FEB.
>>
>> regards,
>> Rams
>>
>
>


LEFT JOIN and WHERE CLAUSE - How to handle

2017-08-23 Thread Ramasubramanian Narayanan
Hi,

Need your suggestion on the below.

Have two tables TXN and CURRENCY.

Need all records in TXN and hence doing Left Join with CURRENCY.

*Two problems :*
1. CURRENCY table may contain duplicate records hence it needs to be
handled through RANK or some other function.
2. If we equate TXN_DT between EFF_ST_DT and EFF_END_DT in the 'where
clause' then we will loose the EUR records which should not happen.

Please suggest a solution to over come both the problems. For duplicated
records it is fine if we select any of the CNTRY_DESC.

*Table : CURRENCY*

*Table : TXN*

*CCY_CD*

*CNTRY_DESC*

*EFF_ST_DT*

*EFF_END_DT*

*ROW_NUM*

*CCY_CD*

*TXN_DT*

INR

Indian Rupee

1-Jan-15

20-Feb-16

1

INR

16-Feb-17

INR

Indian Rupee New

21-Feb-16

20-Feb-99

2

USD

16-Feb-17

USD

US Dollar

1-Jan-15

20-Feb-16

3

SGD

16-Feb-17

SGD

Singapore Dollar

1-Jan-15

20-Feb-17

4

EUR

16-Feb-17

SGD

Singapore Dollar New

15-Feb-17

20-Feb-99

SGD

Singapore Dollar Latest

16-Feb-17

16-Feb-17

*Expected Output*

*ROW_NUM*

*CCY_CD*

*TXN_DT*

*CNTRY_DESC*

1

INR

16-Feb-17

Indian Rupee

2

USD

16-Feb-17

US Dollar

3

SGD

16-Feb-17

Singapore Dollar Latest (Any of three valid valid is fine)

4

EUR

16-Feb-17





*Query : *Select ROW_NUM,CCY_CD,TXN_DT,CNTRY_DESC
from CURRENCY LEFT JOIN TXN on (CURRENCY.CCY_CD = TXN.CCY_CD)
where
TXN_DT between EFF_ST_DT and EFF_END_DT;



This query will drop the "EUR" record because of the where clause used.
It cannot be handled with case statement instead of 'where clause' as we
have   more than one record for 'SGD' when  TXN_DT is 16-FEB.

regards,
Rams


HIVE or PIG - For building DQ framework

2017-07-06 Thread Ramasubramanian Narayanan
Hi All,

Pls help me on the below.

*Use Case :*
Trying to develop a framework to do Data profiling and Data Quality.
Data is stored HIVE table stored in RC format.
No join only considering DQ checks that can be done in a single table.

*Need suggestion :*
Thinking either to use PIG or HIVE for performing Data Quality and
profiling. Need your suggestion on the same. Have listed few highlevel
points which came to my mind.

*Performance *:
- HIVE will perform better or PIG ? In PIG can load the data set into a
variable and can perform many operations on that data set. Will that
improve any performance?
- In HIVE, can have almost 70% of the checks in the same query. Like null,
count, distinct count, duplicate count (total count - distinct count),
length,etc., Even in this case, PIG will perform better or HIVE?

*Coding *:
- Though HIVE is easy to code than PIG, which one is most suitable for
perfoming Data Quality and profiling
*Open source tools:*
- Pls Suggest any open source tools built on Java or some other
technologies which can be integarated with Hadoop without any installation.



regards,
Rams


Pro and Cons of using HBase table as an external table in HIVE

2017-06-07 Thread Ramasubramanian Narayanan
Hi,

Can you please let us know Pro and Cons of using HBase table as an external
table in HIVE.

Will there be any performance degrade when using Hive over HBase instead of
using direct HIVE table.

The table that I am planning to use in HBase will be master table like
account, customer. Wanting to achieve Slowly Changing Dimension. Please
through some lights on that too if you have done any such implementations.

Thanks and Regards,
Rams


Is it ok to build an entire ETL/ELT data flow using HIVE queries?

2016-02-15 Thread Ramasubramanian
Hi,

Is it ok to build an entire ETL/ELT data flow using HIVE queries?

Data is stored in HIVE. We have transactional and reference data. We need to 
build a small warehouse. 

Need suggestion on alternatives too. 

Regards,
Rams

How to overwrite Key in RecordReader function

2014-06-12 Thread Ramasubramanian Narayanan
DA,

We are trying to write a UDF to read an XML which contains some unbounded
tags.

For repeated tags, new row has to be generated.

Please let us know how to ovewrite the default key with the new key in the
Record Reader function (where we do for loop to make multiple rows).

*Sample XML:*
students
student
  name ABC /name
  Addresses
Address address1 /Address
Address address2 /Address
  /Addresses
/student
/students

*Expected Output* (using custom input format in HIVE table and quering
through a view using xpath).

ABC | address1|
ABC | address2

Thanks and Regards,
Rams


White papers/Solution implemented to use HIVE/HBASE as OLTP

2014-06-08 Thread Ramasubramanian Narayanan
Dear All,

Request your help in providing link to the White papers/Solution
implemented to use HIVE/HBASE as OLTP.

regards,
Rams


Storing and reading XML files in HIVE

2014-06-06 Thread Ramasubramanian Narayanan
Dear All,

Request your help to guide how to store and read XML data in HIVE.

while querying it should look as if we are having txt format file under
HIVE (it is fine if we use view to parse the XML and show).

Have gone through some sites but not able to figure out correctly.. few are
mentioning that we need use some JAR's to achieve it...


Thanks in advance,
Rams


Need help in joining 2 tables

2013-07-26 Thread Ramasubramanian Narayanan
Hi,

Need to create a view in HIVE as per the below query (written in Oracle).
Need help to provide the equivalent query in HIVE to create view.


*EMP_ADDRESS*

Emp_Id
Address1
Address2
Address3
Address4
City
State
Pin
Country
Alternate_Address1
Alternate_Address2
Alternate_Address3
Alternate_Address4
Alternate_City
Alternate_State
Alternate_Pin
Alternate_Country
row_create_date


*EMP_MASTER*

Emp_Id
Name
DOB
Department
Sex
FathersName
row_create_date


*View Query*

select
MAST.Emp_Id,
MAST.Name,
MAST.DOB,
ADDR.Address1,
ADDR.Address2,
ADDR.Address3,
ADDR.Address4,
ADDR.City,
ADDR.State,
ADDR.Pin,
ADDR.Country
from EMP_MASTER MAST, EMP_ADDRESS ADDR
where
MAST.row_create_date = (select max(row_create_date) from EMP_MASTER where
Emp_Id = MAST.Emp_Id)
and ADDR.row_create_date = (select max(row_create_date) from EMP_ADDRESS
where Emp_Id = ADDR.Emp_Id)


regards,
Rams


Merging different HDFS file for HIVE

2013-07-26 Thread Ramasubramanian Narayanan
Hi,

Please help in providing solution for the below problem... this scenario is
applicable in Banking atleast...

I have a HIVE table with the below structure...

Hive Table:
Field1
...
Field 10


For the above table, I will get the values for each feed in different file.
You can imagine that these files belongs to same branch and will get at any
time interval. I have to load into table only if I get all 3 files for the
same branch. (assume that we have a common field in all the files to join)

*Feed file 1 :*
EMP ID
Field 1
Field 2
Field 6
Field 9

*Feed File2 :*
EMP ID
Field 5
Field 7
Field 10

*Feed File3 :*
EMP ID
Field 3
Field 4
Field 8

Now the question is,
what is the best way to make all these files to make it as a single file so
that it can be placed under the HIVE structure.

regards,
Rams


Regarding Backup of MySQL if we have that as Metadata store for HIVE

2013-01-02 Thread Ramasubramanian Narayanan
Hi,

If we have MySQL as the metadata store for HIVE,

1) do we need to backup every day ?

2) Is there any automatic way in Hadoop to have replication copy for MySQL
too?

3) Anyway to update the metadata if we loose the information in MySQL?


regards,
Rams


Re: Join not working in HIVE

2012-12-17 Thread Ramasubramanian Narayanan
Hi,

But it is working fine in MySql...

mysql select count(A1.id) as LVL, A2.id, A2.code, A2.short_name, A2.lft,
A2.rgt from product A1 join product A2 on (A1.lft = A2.lft and A1.rgt =
A2.rgt) group by A2.id, A2.code, A2.short_name, A2.lft, A2.rgt;
+-+--+--+--+--+-+
| LVL | id   | code | short_name   | lft  | rgt |
+-+--+--+--+--+-+
|   1 |1 | 4| Treasury Service |1 | 100 |
|   2 |2 | 2| Root |2 |1000 |
|   2 |3 | Z| CKC  | 1001 |2000 |
|   2 |4 | A| Treasury Service | 2001 |3000 |
|   3 |5 | OOAQ | CODE CASH MANAGEMENT |3 | 100 |
|   3 |6 | YP00 | JPMC Treasury|  101 | 200 |
|   3 |7 | 432  | Treasury Service | 1002 |1100 |
+-+--+--+--+--+-+


regards,
Rams

On Mon, Dec 17, 2012 at 6:33 PM, Nitin Pawar nitinpawar...@gmail.comwrote:

 select count(A1.id) as LVL, A2.id, A2.code, A2.short_name, A2.lft, A2.rgt
 from vprd A1 join vprd A2 on (A1.lft = A2.lft and A1.rgt = A2.rgt) where
 A.2rgt=1 and A2.lft=1 group by A2.id, A2.code, A2.short_name, A2.lft, A2.rgt


Re: Join not working in HIVE

2012-12-17 Thread Ramasubramanian Narayanan
Hi,

We are trying to build a tree structure in a table.. hence we have the left
and right limits...
Can't use where clause in that..

regards,
Rams

On Mon, Dec 17, 2012 at 6:53 PM, Nitin Pawar nitinpawar...@gmail.comwrote:

 hive is not mysql  :)


 On Mon, Dec 17, 2012 at 6:50 PM, Ramasubramanian Narayanan 
 ramasubramanian.naraya...@gmail.com wrote:

 Hi,

 But it is working fine in MySql...

 mysql select count(A1.id) as LVL, A2.id, A2.code, A2.short_name, A2.lft,
 A2.rgt from product A1 join product A2 on (A1.lft = A2.lft and A1.rgt =
 A2.rgt) group by A2.id, A2.code, A2.short_name, A2.lft, A2.rgt;
 +-+--+--+--+--+-+
 | LVL | id   | code | short_name   | lft  | rgt |
 +-+--+--+--+--+-+

 |   1 |1 | 4| Treasury Service |1 | 100 |
 |   2 |2 | 2| Root |2 |1000 |
 |   2 |3 | Z| CKC  | 1001 |2000 |
 |   2 |4 | A| Treasury Service | 2001 |3000 |
 |   3 |5 | OOAQ | CODE CASH MANAGEMENT |3 | 100 |
 |   3 |6 | YP00 | JPMC Treasury|  101 | 200 |
 |   3 |7 | 432  | Treasury Service | 1002 |1100 |
 +-+--+--+--+--+-+


 regards,
 Rams

 On Mon, Dec 17, 2012 at 6:33 PM, Nitin Pawar nitinpawar...@gmail.comwrote:

 select count(A1.id) as LVL, A2.id, A2.code, A2.short_name, A2.lft,
 A2.rgt from vprd A1 join vprd A2 on (A1.lft = A2.lft and A1.rgt = A2.rgt)
 where A.2rgt=1 and A2.lft=1 group by A2.id, A2.code, A2.short_name, A2.lft,
 A2.rgt





 --
 Nitin Pawar



How to overwrite a file inside hive table folder

2012-09-15 Thread Ramasubramanian
Hi,

We could not overwrite a file inside hive table folder. What is the command to 
over write it. 

Regards,
Rams

How to update and delete a row in hive

2012-09-11 Thread Ramasubramanian
Hi,

What is the command to update/delete a row from hive? If not directly possible 
what is the work around for that?

Regards,
Rams