Will partition pruning will happen on HIVE views?
Hi, I have a scenario. Hive table "Table1" created with Partition with txn date. A view "VIEW_TABLE1" is created on top of "Table" with the query 'select * from table1'. If I query the view with the where clause using txn_date, will partition pruning happen? Please throw some light how it will behave in HIVE. regards, Rams
Re: LEFT JOIN and WHERE CLAUSE - How to handle
Hi, My bad, you are right I intended to give the query as below (TXN left joining CURRENCY not the other way around): *Query : *Select ROW_NUM,CCY_CD,TXN_DT,CNTRY_DESC from TXN LEFT JOIN CURRENCY on (CURRENCY.CCY_CD = TXN.CCY_CD) where TXN_DT between EFF_ST_DT and EFF_END_DT; Thanks and regards, Rams On Wed, Aug 23, 2017 at 9:17 PM, Furcy Pin <furcy@flaminem.com> wrote: > Ho, in that case... > > (First I notice that you say you want all records in TXN but in the query > you give, you perform your join the other way round.) > > This is a typical use case that SQL is not very good at handling... > > The solutions I see are: > > - use RANK as you suggested. > Note that Hive is smart enough to optimize it correctly : > if you use a filter WHERE Rank() < K, it will take the K first values in > RAM and ignore the rest, rather than ranking everything and filtering > afterwards. > > - perform a GROUP BY TXN.ROW_NUM so you have only one line per > transaction. Aggregate the rest the way you like (max, max_by or arbitrary) > To handle the missing EUR line, just add a (OR CURRENCY.CCY_CD IS NULL) > to your where clause (and reverse the left join). > > - Most complicated way to implement, but simplest to use afterwards: > replace your CURRENCY table with a table where you have only one row per > currency, and store the multiple start/end dates in an array of structs, > then a UDF that given such array and a date returns the correct > description. It only works if you don't have too many description per > currency though. > This is where Spark comes handy: it is simpler to define UDFs with it. > > > > On Wed, Aug 23, 2017 at 5:21 PM, Ramasubramanian Narayanan < > ramasubramanian.naraya...@gmail.com> wrote: > >> Hi, >> >> TXN.TXN_DT should be between CURRENCY.EFF_ST_DT and CURRENCY.EFF_END_DT. >> It needs to be equated. >> >> >> regards, >> Rams >> >> On Wed, Aug 23, 2017 at 7:55 PM, Furcy Pin <furcy@flaminem.com> >> wrote: >> >>> I would suggest to use a subquery >>> >>> WITH unique_currency AS ( >>> SELECT >>> CCY_CD, >>> MAX(CNTRY_DESC) as CNTRY_DESC >>> FROM CURRENCY >>> GROUP BY CCY_CD >>> ) >>> >>> and then perform your left join on it. >>> >>> Some SQL engine (e.g. Presto) have aggregation functions like >>> arbitrary(col) that take any value and are a little less costly than a >>> max. >>> Sometimes, they also have functions like max_by(x, y) >>> <https://prestodb.io/docs/current/functions/aggregate.html#max_by> that >>> would allow you to get the most recent description. >>> >>> It is a shame that this function is not included in Hive yet, but still >>> you can find some UDAF implementations on github >>> <https://github.com/dataiku/dataiku-hive-udf#first_of_group-last_of_group> >>> . >>> >>> >>> >>> On Wed, Aug 23, 2017 at 3:37 PM, Ramasubramanian Narayanan < >>> ramasubramanian.naraya...@gmail.com> wrote: >>> >>>> Hi, >>>> >>>> Need your suggestion on the below. >>>> >>>> Have two tables TXN and CURRENCY. >>>> >>>> Need all records in TXN and hence doing Left Join with CURRENCY. >>>> >>>> *Two problems :* >>>> 1. CURRENCY table may contain duplicate records hence it needs to be >>>> handled through RANK or some other function. >>>> 2. If we equate TXN_DT between EFF_ST_DT and EFF_END_DT in the 'where >>>> clause' then we will loose the EUR records which should not happen. >>>> >>>> Please suggest a solution to over come both the problems. For >>>> duplicated records it is fine if we select any of the CNTRY_DESC. >>>> >>>> *Table : CURRENCY* >>>> >>>> *Table : TXN* >>>> >>>> *CCY_CD* >>>> >>>> *CNTRY_DESC* >>>> >>>> *EFF_ST_DT* >>>> >>>> *EFF_END_DT* >>>> >>>> *ROW_NUM* >>>> >>>> *CCY_CD* >>>> >>>> *TXN_DT* >>>> >>>> INR >>>> >>>> Indian Rupee >>>> >>>> 1-Jan-15 >>>> >>>> 20-Feb-16 >>>> >>>> 1 >>>> >>>> INR >>>> >>>> 16-Feb-17 >>>> >>>> INR >>>> >>>> Indian Rupee New >>>>
Re: LEFT JOIN and WHERE CLAUSE - How to handle
Hi, TXN.TXN_DT should be between CURRENCY.EFF_ST_DT and CURRENCY.EFF_END_DT. It needs to be equated. regards, Rams On Wed, Aug 23, 2017 at 7:55 PM, Furcy Pin <furcy@flaminem.com> wrote: > I would suggest to use a subquery > > WITH unique_currency AS ( > SELECT > CCY_CD, > MAX(CNTRY_DESC) as CNTRY_DESC > FROM CURRENCY > GROUP BY CCY_CD > ) > > and then perform your left join on it. > > Some SQL engine (e.g. Presto) have aggregation functions like > arbitrary(col) that take any value and are a little less costly than a > max. > Sometimes, they also have functions like max_by(x, y) > <https://prestodb.io/docs/current/functions/aggregate.html#max_by> that > would allow you to get the most recent description. > > It is a shame that this function is not included in Hive yet, but still > you can find some UDAF implementations on github > <https://github.com/dataiku/dataiku-hive-udf#first_of_group-last_of_group> > . > > > > On Wed, Aug 23, 2017 at 3:37 PM, Ramasubramanian Narayanan < > ramasubramanian.naraya...@gmail.com> wrote: > >> Hi, >> >> Need your suggestion on the below. >> >> Have two tables TXN and CURRENCY. >> >> Need all records in TXN and hence doing Left Join with CURRENCY. >> >> *Two problems :* >> 1. CURRENCY table may contain duplicate records hence it needs to be >> handled through RANK or some other function. >> 2. If we equate TXN_DT between EFF_ST_DT and EFF_END_DT in the 'where >> clause' then we will loose the EUR records which should not happen. >> >> Please suggest a solution to over come both the problems. For duplicated >> records it is fine if we select any of the CNTRY_DESC. >> >> *Table : CURRENCY* >> >> *Table : TXN* >> >> *CCY_CD* >> >> *CNTRY_DESC* >> >> *EFF_ST_DT* >> >> *EFF_END_DT* >> >> *ROW_NUM* >> >> *CCY_CD* >> >> *TXN_DT* >> >> INR >> >> Indian Rupee >> >> 1-Jan-15 >> >> 20-Feb-16 >> >> 1 >> >> INR >> >> 16-Feb-17 >> >> INR >> >> Indian Rupee New >> >> 21-Feb-16 >> >> 20-Feb-99 >> >> 2 >> >> USD >> >> 16-Feb-17 >> >> USD >> >> US Dollar >> >> 1-Jan-15 >> >> 20-Feb-16 >> >> 3 >> >> SGD >> >> 16-Feb-17 >> >> SGD >> >> Singapore Dollar >> >> 1-Jan-15 >> >> 20-Feb-17 >> >> 4 >> >> EUR >> >> 16-Feb-17 >> >> SGD >> >> Singapore Dollar New >> >> 15-Feb-17 >> >> 20-Feb-99 >> >> SGD >> >> Singapore Dollar Latest >> >> 16-Feb-17 >> >> 16-Feb-17 >> >> *Expected Output* >> >> *ROW_NUM* >> >> *CCY_CD* >> >> *TXN_DT* >> >> *CNTRY_DESC* >> >> 1 >> >> INR >> >> 16-Feb-17 >> >> Indian Rupee >> >> 2 >> >> USD >> >> 16-Feb-17 >> >> US Dollar >> >> 3 >> >> SGD >> >> 16-Feb-17 >> >> Singapore Dollar Latest (Any of three valid valid is fine) >> >> 4 >> >> EUR >> >> 16-Feb-17 >> >> >> >> >> >> *Query : *Select ROW_NUM,CCY_CD,TXN_DT,CNTRY_DESC >> from CURRENCY LEFT JOIN TXN on (CURRENCY.CCY_CD = TXN.CCY_CD) >> where >> TXN_DT between EFF_ST_DT and EFF_END_DT; >> >> >> >> This query will drop the "EUR" record because of the where clause used. >> It cannot be handled with case statement instead of 'where clause' as we >> have more than one record for 'SGD' when TXN_DT is 16-FEB. >> >> regards, >> Rams >> > >
LEFT JOIN and WHERE CLAUSE - How to handle
Hi, Need your suggestion on the below. Have two tables TXN and CURRENCY. Need all records in TXN and hence doing Left Join with CURRENCY. *Two problems :* 1. CURRENCY table may contain duplicate records hence it needs to be handled through RANK or some other function. 2. If we equate TXN_DT between EFF_ST_DT and EFF_END_DT in the 'where clause' then we will loose the EUR records which should not happen. Please suggest a solution to over come both the problems. For duplicated records it is fine if we select any of the CNTRY_DESC. *Table : CURRENCY* *Table : TXN* *CCY_CD* *CNTRY_DESC* *EFF_ST_DT* *EFF_END_DT* *ROW_NUM* *CCY_CD* *TXN_DT* INR Indian Rupee 1-Jan-15 20-Feb-16 1 INR 16-Feb-17 INR Indian Rupee New 21-Feb-16 20-Feb-99 2 USD 16-Feb-17 USD US Dollar 1-Jan-15 20-Feb-16 3 SGD 16-Feb-17 SGD Singapore Dollar 1-Jan-15 20-Feb-17 4 EUR 16-Feb-17 SGD Singapore Dollar New 15-Feb-17 20-Feb-99 SGD Singapore Dollar Latest 16-Feb-17 16-Feb-17 *Expected Output* *ROW_NUM* *CCY_CD* *TXN_DT* *CNTRY_DESC* 1 INR 16-Feb-17 Indian Rupee 2 USD 16-Feb-17 US Dollar 3 SGD 16-Feb-17 Singapore Dollar Latest (Any of three valid valid is fine) 4 EUR 16-Feb-17 *Query : *Select ROW_NUM,CCY_CD,TXN_DT,CNTRY_DESC from CURRENCY LEFT JOIN TXN on (CURRENCY.CCY_CD = TXN.CCY_CD) where TXN_DT between EFF_ST_DT and EFF_END_DT; This query will drop the "EUR" record because of the where clause used. It cannot be handled with case statement instead of 'where clause' as we have more than one record for 'SGD' when TXN_DT is 16-FEB. regards, Rams
HIVE or PIG - For building DQ framework
Hi All, Pls help me on the below. *Use Case :* Trying to develop a framework to do Data profiling and Data Quality. Data is stored HIVE table stored in RC format. No join only considering DQ checks that can be done in a single table. *Need suggestion :* Thinking either to use PIG or HIVE for performing Data Quality and profiling. Need your suggestion on the same. Have listed few highlevel points which came to my mind. *Performance *: - HIVE will perform better or PIG ? In PIG can load the data set into a variable and can perform many operations on that data set. Will that improve any performance? - In HIVE, can have almost 70% of the checks in the same query. Like null, count, distinct count, duplicate count (total count - distinct count), length,etc., Even in this case, PIG will perform better or HIVE? *Coding *: - Though HIVE is easy to code than PIG, which one is most suitable for perfoming Data Quality and profiling *Open source tools:* - Pls Suggest any open source tools built on Java or some other technologies which can be integarated with Hadoop without any installation. regards, Rams
Pro and Cons of using HBase table as an external table in HIVE
Hi, Can you please let us know Pro and Cons of using HBase table as an external table in HIVE. Will there be any performance degrade when using Hive over HBase instead of using direct HIVE table. The table that I am planning to use in HBase will be master table like account, customer. Wanting to achieve Slowly Changing Dimension. Please through some lights on that too if you have done any such implementations. Thanks and Regards, Rams
Is it ok to build an entire ETL/ELT data flow using HIVE queries?
Hi, Is it ok to build an entire ETL/ELT data flow using HIVE queries? Data is stored in HIVE. We have transactional and reference data. We need to build a small warehouse. Need suggestion on alternatives too. Regards, Rams
How to overwrite Key in RecordReader function
DA, We are trying to write a UDF to read an XML which contains some unbounded tags. For repeated tags, new row has to be generated. Please let us know how to ovewrite the default key with the new key in the Record Reader function (where we do for loop to make multiple rows). *Sample XML:* students student name ABC /name Addresses Address address1 /Address Address address2 /Address /Addresses /student /students *Expected Output* (using custom input format in HIVE table and quering through a view using xpath). ABC | address1| ABC | address2 Thanks and Regards, Rams
White papers/Solution implemented to use HIVE/HBASE as OLTP
Dear All, Request your help in providing link to the White papers/Solution implemented to use HIVE/HBASE as OLTP. regards, Rams
Storing and reading XML files in HIVE
Dear All, Request your help to guide how to store and read XML data in HIVE. while querying it should look as if we are having txt format file under HIVE (it is fine if we use view to parse the XML and show). Have gone through some sites but not able to figure out correctly.. few are mentioning that we need use some JAR's to achieve it... Thanks in advance, Rams
Need help in joining 2 tables
Hi, Need to create a view in HIVE as per the below query (written in Oracle). Need help to provide the equivalent query in HIVE to create view. *EMP_ADDRESS* Emp_Id Address1 Address2 Address3 Address4 City State Pin Country Alternate_Address1 Alternate_Address2 Alternate_Address3 Alternate_Address4 Alternate_City Alternate_State Alternate_Pin Alternate_Country row_create_date *EMP_MASTER* Emp_Id Name DOB Department Sex FathersName row_create_date *View Query* select MAST.Emp_Id, MAST.Name, MAST.DOB, ADDR.Address1, ADDR.Address2, ADDR.Address3, ADDR.Address4, ADDR.City, ADDR.State, ADDR.Pin, ADDR.Country from EMP_MASTER MAST, EMP_ADDRESS ADDR where MAST.row_create_date = (select max(row_create_date) from EMP_MASTER where Emp_Id = MAST.Emp_Id) and ADDR.row_create_date = (select max(row_create_date) from EMP_ADDRESS where Emp_Id = ADDR.Emp_Id) regards, Rams
Merging different HDFS file for HIVE
Hi, Please help in providing solution for the below problem... this scenario is applicable in Banking atleast... I have a HIVE table with the below structure... Hive Table: Field1 ... Field 10 For the above table, I will get the values for each feed in different file. You can imagine that these files belongs to same branch and will get at any time interval. I have to load into table only if I get all 3 files for the same branch. (assume that we have a common field in all the files to join) *Feed file 1 :* EMP ID Field 1 Field 2 Field 6 Field 9 *Feed File2 :* EMP ID Field 5 Field 7 Field 10 *Feed File3 :* EMP ID Field 3 Field 4 Field 8 Now the question is, what is the best way to make all these files to make it as a single file so that it can be placed under the HIVE structure. regards, Rams
Regarding Backup of MySQL if we have that as Metadata store for HIVE
Hi, If we have MySQL as the metadata store for HIVE, 1) do we need to backup every day ? 2) Is there any automatic way in Hadoop to have replication copy for MySQL too? 3) Anyway to update the metadata if we loose the information in MySQL? regards, Rams
Re: Join not working in HIVE
Hi, But it is working fine in MySql... mysql select count(A1.id) as LVL, A2.id, A2.code, A2.short_name, A2.lft, A2.rgt from product A1 join product A2 on (A1.lft = A2.lft and A1.rgt = A2.rgt) group by A2.id, A2.code, A2.short_name, A2.lft, A2.rgt; +-+--+--+--+--+-+ | LVL | id | code | short_name | lft | rgt | +-+--+--+--+--+-+ | 1 |1 | 4| Treasury Service |1 | 100 | | 2 |2 | 2| Root |2 |1000 | | 2 |3 | Z| CKC | 1001 |2000 | | 2 |4 | A| Treasury Service | 2001 |3000 | | 3 |5 | OOAQ | CODE CASH MANAGEMENT |3 | 100 | | 3 |6 | YP00 | JPMC Treasury| 101 | 200 | | 3 |7 | 432 | Treasury Service | 1002 |1100 | +-+--+--+--+--+-+ regards, Rams On Mon, Dec 17, 2012 at 6:33 PM, Nitin Pawar nitinpawar...@gmail.comwrote: select count(A1.id) as LVL, A2.id, A2.code, A2.short_name, A2.lft, A2.rgt from vprd A1 join vprd A2 on (A1.lft = A2.lft and A1.rgt = A2.rgt) where A.2rgt=1 and A2.lft=1 group by A2.id, A2.code, A2.short_name, A2.lft, A2.rgt
Re: Join not working in HIVE
Hi, We are trying to build a tree structure in a table.. hence we have the left and right limits... Can't use where clause in that.. regards, Rams On Mon, Dec 17, 2012 at 6:53 PM, Nitin Pawar nitinpawar...@gmail.comwrote: hive is not mysql :) On Mon, Dec 17, 2012 at 6:50 PM, Ramasubramanian Narayanan ramasubramanian.naraya...@gmail.com wrote: Hi, But it is working fine in MySql... mysql select count(A1.id) as LVL, A2.id, A2.code, A2.short_name, A2.lft, A2.rgt from product A1 join product A2 on (A1.lft = A2.lft and A1.rgt = A2.rgt) group by A2.id, A2.code, A2.short_name, A2.lft, A2.rgt; +-+--+--+--+--+-+ | LVL | id | code | short_name | lft | rgt | +-+--+--+--+--+-+ | 1 |1 | 4| Treasury Service |1 | 100 | | 2 |2 | 2| Root |2 |1000 | | 2 |3 | Z| CKC | 1001 |2000 | | 2 |4 | A| Treasury Service | 2001 |3000 | | 3 |5 | OOAQ | CODE CASH MANAGEMENT |3 | 100 | | 3 |6 | YP00 | JPMC Treasury| 101 | 200 | | 3 |7 | 432 | Treasury Service | 1002 |1100 | +-+--+--+--+--+-+ regards, Rams On Mon, Dec 17, 2012 at 6:33 PM, Nitin Pawar nitinpawar...@gmail.comwrote: select count(A1.id) as LVL, A2.id, A2.code, A2.short_name, A2.lft, A2.rgt from vprd A1 join vprd A2 on (A1.lft = A2.lft and A1.rgt = A2.rgt) where A.2rgt=1 and A2.lft=1 group by A2.id, A2.code, A2.short_name, A2.lft, A2.rgt -- Nitin Pawar
How to overwrite a file inside hive table folder
Hi, We could not overwrite a file inside hive table folder. What is the command to over write it. Regards, Rams
How to update and delete a row in hive
Hi, What is the command to update/delete a row from hive? If not directly possible what is the work around for that? Regards, Rams