Hi, all, I've loaded some data with Sqoop from Oracle onto HDFS, storing it as SequenceFiles and I'm having problems loading it with Pig. I'm using Sqoop 1.4.3 and used the following steps (simplified example using the DUAL table).
Any ideas of why it loads incorrectly? Am I missing any steps?
Thanks,
Andre
*1. Imported data from the table onto HDFS (the DUAL table has only 1 row
with 1 field containing the string "X") *
sqoop import -D mapred.child.java.opts="$JDBC_JAVA_OPTS" --connect $CONNSTR
-m 1 --query "select DUMMY from dual where \$CONDITIONS" --target-dir test
--as-sequencefile --class-name com.acme.Dual
The Dual.java file is attached.
*2. Generated the Dual.jar file:*
javac -cp
/opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/sqoop/sqoop-1.4.3-cdh4.3.0.jar:/opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/hadoop/client-0.20/hadoop-core-2.0.0-mr1-cdh4.3.0.jar:/opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/hadoop/hadoop-common.jar:.
com/acme/Dual.java
jar cf /tmp/Dual.jar com/acme/Dual.class
*3. Tried to load the data with Pig, however, the field value is read as 0
(zero) instead of the string "X"):*
REGISTER
/opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/pig/piggybank.jar;
REGISTER
/opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/sqoop/sqoop-1.4.3-cdh4.3.0.jar
REGISTER
/opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/hadoop/client-0.20/hadoop-core-2.0.0-mr1-cdh4.3.0.jar
REGISTER
/opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/hadoop/hadoop-common.jar
REGISTER /tmp/Dual.jar
DEFINE SequenceFileLoader
org.apache.pig.piggybank.storage.SequenceFileLoader();
log = LOAD 'test' USING SequenceFileLoader AS (DUMMY:chararray);
DUMP log;
...
2013-11-04 03:21:32,325 [main] INFO
org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics:
HadoopVersion PigVersion UserId StartedAt FinishedAt
Features
2.0.0-cdh4.3.0 0.11.0-cdh4.3.0 araujo 2013-11-04 03:21:12 2013-11-04
03:21:32 UNKNOWN
Success!
Job Stats (time in seconds):
JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime
MedianMapTime MaxReduceTime MinReduceTime AvgReduceTime
MedianReducetime Alias Feature Outputs
job_201310230912_0065 1 0 6 6 6 6 0
0 0 0 log MAP_ONLY hdfs://
n1.hadoop.cto.pythian.com:8020/tmp/temp-805635901/tmp-702886222,
Input(s):
Successfully read 1 records (479 bytes) from: "hdfs://
n1.hadoop.cto.pythian.com:8020/user/araujo/test"
Output(s):
Successfully stored 1 records (8 bytes) in: "hdfs://
n1.hadoop.cto.pythian.com:8020/tmp/temp-805635901/tmp-702886222"
Counters:
Total records written : 1
Total bytes written : 8
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_201310230912_0065
2013-11-04 03:21:32,338 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- Success!
2013-11-04 03:21:32,342 [main] INFO org.apache.pig.data.SchemaTupleBackend
- Key [pig.schematuple] was not set... will not generate code.
2013-11-04 03:21:32,350 [main] INFO
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths
to process : 1
2013-11-04 03:21:32,350 [main] INFO
org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
input paths to process : 1
*(0) <--- THIS SHOULD SHOW "X"*
--
André Araújo
Database Administrator / SDM
The Pythian Group - Australia - www.pythian.com
Office (calls from within Australia): 1300 366 021 x1270
Office (international): +61 2 8016 7000 x270 *OR* +1 613 565 8696 x1270
Mobile: +61 410 323 559
Fax: +61 2 9805 0544
IM: pythianaraujo @ AIM/MSN/Y! or [email protected] @ GTalk
“Success is not about standing at the top, it's the steps you leave behind.”
— Iker Pou (rock climber)
--
--
Dual.java
Description: Binary data
