Migrating data from Postgres to Hive using Sqoop

Sowjanya Kakarala Thu, 22 Mar 2018 11:17:38 -0700

Hi Guys,

I am trying to migrate one of the tables from Postgres database to Hive.


Schema of Postgres:
db=> \d table
                Table "public.table"
 Column  |        Type        | Collation | Nullable | Default 
---------+--------------------+-----------+----------+---------
 cell_id | integer            |           |          | 
 yr2010  | double precision[] |           |          | 

Data in yr2010 has 365 days data in it like:
select yr2010[1:365] from table where cell_id=502070;

{0,0,0,0,0.07,0.05,0,0,0.04,0,0,0.02,0.09,0.06,0,0,0,0,0,0,0,0,0,0,0.06,0.01,0.4,0.03,0.01,0,0,0,0.01,0,0,0,0,0,0,0.09,1.83,1.76,0,0,0,0,0,0.02,0.02,0,0.01,0.08,0,0,0,0,0.89,0,0,0,0,0,0,0,0,0,0.47,0.07,0.43,0,0,0,0,0,0,0,0.45,0,0,0,0,0,0.08,0,0,0,0.58,0,0,0.4,0,0.78,0,0,1.69,0.09,0,0,0.46,0,0,0.38,0.6,0,0,0,0,0.18,0.21,0.1,0.14,0,0,0,0,0,0.78,0.11,0.57,0.75,0.14,0,0,0,0,0.26,0.77,0.04,0,0,0,0.1,2.05,0,1.26,0,0,0,0,0,0,0,0,0,0.52,0.01,0.65,0.03,0.56,0,0,0,0.94,0.59,0,0,0,0.01,0.08,0.58,0.48,1.37,0,0.26,0,0,0.31,0,0.47,0.72,0,1.09,0.03,0,0,0.02,0,0,0,0,0,0,0,0,1.33,0,0,0.19,0.05,0,0,0.74,0,0,0.14,0.11,0.01,0,0,0.13,0,0,0.02,0,0.76,0,0,0,0.51,0,0,0,0.08,0,0,0.83,0,0,0.07,0,0,0,0.19,0,0,0,0,0.21,0,0,0.69,0,0.14,0,0,0,0,0,0,1.29,0,0,0,0,0,0,0,0,0,0,0.04,0,0,0,0,0,0,0,0,0,0,0,0.01,0,0.02,0.02,0.4,0.04,0.54,0.05,0,0,0,0.45,0,0,0,0,0.48,0,2.21,0.23,0,0,0,0.2,0.23,0.01,0.05,0,0,0,0,0,1.38,0.09,0.01,0,0,0.53,0,0.27,0.67,0,0,0,0.02,0,0,0,0,0,0,0,0,0,0,0,0.66,0.84,0.44,0,0.1,0,0,0,0,0.26,0.08,0,0,0,0.05,0,0,0.97,0,0,0,0.02,0,0.96,0.07,0,0,0.84,0.02,0,0,0,0,0,0.04,0.01,0.02,0.09,0.12,0.28,0.25,0.08,0.16,0,0,0.09,0}


My Sqoop query is :
sqoop import --connect 'jdbc:postgresql://path-dev/db? 
ssl=true&sslfactory=org.postgresql.ssl.NonValidatingFacry' --query "select 
cell_id,yr2010[1] as data from table where \$CONDITIONS" --username ‘name' -P 
--hcatalog-database ‘db_in_hive' --hcatalog-table ’table_in_hive' 
--hive-partition-key 'date' --hive-partition-value '2010-01-01' -m 1

sqoop import --connect 'jdbc:postgresql://path-dev/db? 
ssl=true&sslfactory=org.postgresql.ssl.NonValidatingFacry' --query "select 
cell_id,yr2010[2] as data from table where \$CONDITIONS" --username ‘name' -P 
--hcatalog-database ‘db_in_hive' --hcatalog-table ’table_in_hive' 
--hive-partition-key 'date' --hive-partition-value '2010-01-02' -m 1
My Hive table is:
create table table_in_hive(cell_id int, data double) partitioned by (time_stamp 
string) stored as orc;

My Usecase after migrating data in Hive is:
To update any data by cell_id
And
 to fetch the days with range like select data from table_in_hive where 
cell_id=411298 and time_stamp between '2006-01-01' and '2006-01-03';

Here, usecase satisfies but, for each import is saving the cell_id’s again 
under each partition and hence the count doubles for each import. 
Like:
select distinct(cell_id),time_stamp from table_in_hive where cell_id=411298;

411298  2006-01-01
411298  2006-01-02
411298  2006-01-03
411298  2006-01-04
411298  2006-01-05

Here, in the output the cell_id count should be 1 but it created 5 for five 
imports.

So, in Hive is there a way to select particular column and import only in that 
column which matches with cell_id record?
Or How should I map the cell_id which is already inserted with next import 
cell_id.

Any help/suggestion is appreciated.

Thanks
Sowjanya

Migrating data from Postgres to Hive using Sqoop

Reply via email to