Getting bogus rows from sqoop import...?

Felix GV Wed, 20 Mar 2013 20:28:55 -0700

Hello,

I'm trying to import a full table from MySQL to Hadoop/Hive. It works with
certain parameters, but when I try to do an ETL that's somewhat more
complex, I start getting bogus rows in my resulting table.


This works:

sqoop import \
        --connect
'jdbc:mysql://backup.general.db/general?tinyInt1isBit=false&zeroDateTimeBehavior=convertToNull'
\
        --username xxxxx \
        --password xxxxx \
        --hive-import \
        --hive-overwrite \
        -m 23 \
        --direct \
        --hive-table profile_felix_test17 \
        --split-by id \
        --table Profile

But if I use a --query instead of a --table, then I start getting bogus
records (and by that, I mean rows that have a non-sensically high primary
key that doesn't exist in my source database and null for the rest of the
cells).

The output I get with the above query is not exactly the way I want it.
Using --query, I can get the data in the format I want (by transforming
some stuff inside MySQL), but then I also get the bogus rows, which pretty
much makes the Hive table unusable.

I tried various combinations of parameters and it's hard to pin-point
exactly what causes the problem, so it could be more intricate than my
above simplistic description. That being said, removing --table and adding
the following params definitely breaks it:

        --target-dir /tests/sqoop/general/profile_felix_test \
        --query "select * from Profile WHERE \$CONDITIONS"

(Ultimately, I want to use a query that's more complex than this, but even
a simple query like this breaks...)

Any ideas why this would happen and how to solve it?

Is this the kind of problem that Sqoop2's cleaner architecture intends to
solve?

I use CDH 4.2, BTW.

Thanks :) !

--
Felix

Getting bogus rows from sqoop import...?

Reply via email to