Thanks a lot Venkat for the info.

I will keep you posted on the Hortonworks forum.

-Nirmal

________________________________
From: Venkat Ranganathan <[email protected]>
Sent: Tuesday, October 7, 2014 12:25 PM
To: [email protected]
Subject: Re: Sqoop Incremental Import with lastmodified mode giving Duplicate 
rows for updated rows

Those can be ignored (it looks like those are environment variables that are 
set elsewhere to control).   In any case, I have followed up on the hortonworks 
forum

Venkat

On Mon, Oct 6, 2014 at 9:53 PM, Nirmal Kumar 
<[email protected]<mailto:[email protected]>> wrote:
Thanks Venkat,

I have posted this in the Hortonworks Forums as well:
http://hortonworks.com/community/forums/topic/sqoop-incremental-import-lastmodified-giving-duplicate-rows-for-updated-rows/

I tried the HDP documentation for incremental import.
But the info given in the HDP documentation is very less.

The below  info is that all given in the bk_HortonworksConnectorForTeradata.pdf 
for incremental import.

Incremental Import
Teradata incremental import emulates the check-column and last value options. 
Here is an
example for a table which has 'hire_date' as the date column to check against 
and 'name' as
the column that can be used to partition the data.

export USER=dbc
export PASS=dbc
export HOST=<dbhost>
export DB=<dbuser>
export TABLE=<dbtable>
export JDBCURL=jdbc:teradata://$HOST/DATABASE=$DB
export IMPORT_DIR=<hdfs-dir to import>
export VERBOSE=--verbose
export MANAGER=org.apache.sqoop.teradata.TeradataConnManager
export CONN_MANAGER="--connection-manager $MANAGER"
export CONNECT="--connect $JDBCURL"
MAPPERS="--num-mappers 4"
DATE="'1990-12-31'"
FORMAT="'yyyy-mm-dd'"
LASTDATE="cast($DATE as date format $FORMAT)"
SQOOPQUERY="select * from employees where hire_date < $LASTDATE AND \
$CONDITIONS"
$SQOOP_HOME/bin/sqoop import $TDQUERY $TDSPLITBY $INPUTMETHOD $VERBOSE
$CONN_MANAGER $CONNECT -query "$SQOOPQUERY" --username $USER --password $PASS
--target-dir $IMPORT_DIR --split-by name

Values of $TDQUERY $TDSPLITBY $INPUTMETHOD are confusing.

Is there some more info about Incremental Imports in HDP ?

Thanks,
-Nirmal


________________________________
From: Venkat Ranganathan 
<[email protected]<mailto:[email protected]>>
Sent: Monday, October 6, 2014 10:35 PM
To: [email protected]<mailto:[email protected]>
Subject: Re: Sqoop Incremental Import with lastmodified mode giving Duplicate 
rows for updated rows

Hi Nirmal

hdp connector for TD is a HDP specific work.  Please use the vendor forums for 
this.
Last modified is not supported as specified in Sqoop.   The HDP documentation 
has an example of doing this using queries.  Please look at the documentation

Thanks

Venkat

On Mon, Oct 6, 2014 at 5:29 AM, Nirmal Kumar 
<[email protected]<mailto:[email protected]>> wrote:
Hi All,

I’m trying to do an Incremental Import using Sqoop from Teradata to Hive tables.

I’m using:
-Apache Hadoop 2.4.0
-Apache Hive 0.13.1
-Apache Sqoop 1.4.4
-hdp-connector-for-teradata-1.3.2.2.1.5.0-695-distro
-Teradata 15.0.0.8

From Sqoop documentation:
An alternate table update strategy supported by Sqoop is called lastmodified 
mode. You should use this when rows of the source table may be updated, and 
each such update will set the value of a last-modified column to the current 
timestamp. Rows where the check column holds a timestamp more recent than the 
timestamp specified with �Clast-value are imported.

I followed the below steps:

STEP 1: One time activity
I’m doing a full import of the table to hive table.

STEP 2: One time activity
Created a Sqoob Job for incremental import
sqoop job �Ccreate incr1 ― import �Cconnection-manager 
org.apache.sqoop.teradata.TeradataConnManager �Cconnect 
jdbc:teradata://192.168.199.137/testdb123<http://192.168.199.137/testdb123> 
�Cusername testdb123 �Cpassword testdb123 �Ctable Paper_STAGE �Cincremental 
lastmodified �Ccheck-column last_modified_col �Clast-value “2014-10-03 
15:29:48.66″ �Csplit-by id �Chive-table paper_stage �Chive-import

STEP 3: This will be done on timely basis from any Scheduler OR Oozie
Executing the Sqoob Job for incremental import everytime I need the updated 
rows/newly added rows.
sqoop job �Cexec incr1

The source table has a “unique primary key” and “last modified column” with 
current timestamp.
The newly added rows though are working fine and getting imported but for the 
updated rows I’m getting duplicate rows.
Sqoop is not updating the updated rows but adding a new one with same Id and 
new current timestamp.

Is this something which is currently not supported in Sqoop as of now ?
This is since I found these:

http://stackoverflow.com/questions/19093417/sqoop-import-lastmodified-gives-duplicate-records-it-doesnt-merger
http://grokbase.com/p/cloudera/cdh-user/13a4n03jrh/sqoop-import-lastmodified-gives-duplicate-records-merger-does-not-happen
https://groups.google.com/a/cloudera.org/forum/#!topic/cdh-user/xAbXEduvahU
https://issues.cloudera.org/browse/DISTRO-464

Is there a way to avoid the duplicate rows for the updated rows and get a 
merged updated row for each updated row in the Source table?
Kindly advise me any alternatives to handle this.

Thanks,
-Nirmal

________________________________






NOTE: This message may contain information that is confidential, proprietary, 
privileged or otherwise protected by law. The message is intended solely for 
the named addressee. If received in error, please destroy and notify the 
sender. Any use of this email is prohibited when received in error. Impetus 
does not represent, warrant and/or guarantee, that the integrity of this 
communication has been maintained nor that the communication is free of errors, 
virus, interception or interference.


CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader of 
this message is not the intended recipient, you are hereby notified that any 
printing, copying, dissemination, distribution, disclosure or forwarding of 
this communication is strictly prohibited. If you have received this 
communication in error, please contact the sender immediately and delete it 
from your system. Thank You.

________________________________






NOTE: This message may contain information that is confidential, proprietary, 
privileged or otherwise protected by law. The message is intended solely for 
the named addressee. If received in error, please destroy and notify the 
sender. Any use of this email is prohibited when received in error. Impetus 
does not represent, warrant and/or guarantee, that the integrity of this 
communication has been maintained nor that the communication is free of errors, 
virus, interception or interference.


CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader of 
this message is not the intended recipient, you are hereby notified that any 
printing, copying, dissemination, distribution, disclosure or forwarding of 
this communication is strictly prohibited. If you have received this 
communication in error, please contact the sender immediately and delete it 
from your system. Thank You.

________________________________






NOTE: This message may contain information that is confidential, proprietary, 
privileged or otherwise protected by law. The message is intended solely for 
the named addressee. If received in error, please destroy and notify the 
sender. Any use of this email is prohibited when received in error. Impetus 
does not represent, warrant and/or guarantee, that the integrity of this 
communication has been maintained nor that the communication is free of errors, 
virus, interception or interference.

Reply via email to