We will mostly be wanting to bring in a single partition at a time, but there will also be occasions where would we need to pull down the whole table.
sqoop import -Doraoop.import.hint="no_parallel" -Doraoop.chunk.method=PARTITION -Doraoop.timestamp.string=false -Doraoop.import.partitions=partition_name --connect connect_string --table "WAREHOUSE.BIG_TABLE" --fetch-size 100000 -m 20 --target-dir /user/hive/warehouse/database/partition --as-parquetfile --username user --password password On Mon, Nov 3, 2014 at 9:40 PM, Gwen Shapira <[email protected]> wrote: > Do you need to get just one partition, or is the ultimate goal to use all > partitions? > > Also, can you share the exact Oraoop command you used? > > On Mon, Nov 3, 2014 at 1:32 PM, Joshua Baxter <[email protected]> > wrote: > >> Apologies if this question has been asked before. >> >> I have a very large table in Oracle with hundreds of partitions and we >> want to be able to import it to parquet in HDFS a partition at a time as >> part of a ETL process. The table has evolved over time and there is not a >> column that doesn't have significant skew meaning that mappers get very >> uneven numbers when using the standard sqoop connector and split-by. Impala >> is the target platform that the data is for so we also want to keep the >> file sizes under the cluster block size to prevent remote streaming when we >> use the data. I've just discovered OraOop and it sounds like this would be >> exactly tool we would need to import the data in an efficient and >> predictable way. >> >> Unfortunately the problem i'm now having is that if i use the partition >> option to choose just a single partition this always equates to exactly one >> mapper. The sort of speed and output file sizes we are looking at would >> equate to something like 40. >> >> Are there any options i can set to increase the number of mappers when >> pulling data from a single table partition? >> > >
