On Sat, Nov 10, 2018 at 05:10:01PM +0100, Nicolas Paris wrote: > I have been able to load data from **csv only** format with sqoop export > in **direct mode**. While very fast (parallel copy statements!), the > method is not robust in case the data do have varchar columns. In > particular a varchar column may contain **newlines** and this breaks the > mapper job, since it splits the csv by newlines. > That's too bad, because the *copy* statement can handle *newlined csv*. > > 1) Is there any way to only send a whole hdfs file per mapper instead of > splitting them ? That would work well.
Finally I found a workaround. Setting the number of mapper to one(-m 1), makes no split in the data, and COPY handle well the multi line CSV produced by HIVE. > sqoop export --connect "jdbc:postgresql://<postgres_host>/<postgres_db>" > --username <postgres_user> --password-file file:///home/$USER/.password > --export-dir /apps/hive/warehouse/<my_db>/<my_table_path> --table > <my_hive_table> --columns "id, text_column" -m 1 --direct > --lines-terminated-by '\n' --fields-terminated-by ',' --input-null-string > "\\\\N" --input-null-non-string "\\\\N" -- --schema <my_schema> This should works in any situation (excluding binary columns). Finally I am not sure using one COPY instead of parallel COPY is worst: less agressive with the database while great performance. regards, -- nicolas
