For importing a single partition – you should be able to set the chunk method to ROWID, then set oraoop.import.partitions to your partition you are importing. This will split that one partition by ROWID to as many mappers as you like.
Also – you shouldn’t need any no parallel hints – the direct connector disables parallel query when it first connects so that shouldn’t be a problem. So in your command below can you change oraoop.chunk.method to ROWID (or just leave it out – it is the default) and let me know if that works for you? From: Joshua Baxter [mailto:[email protected]] Sent: Tuesday, 4 November 2014 8:53 AM To: [email protected] Subject: Re: Using more than a single mapper per partition with OraOop We will mostly be wanting to bring in a single partition at a time, but there will also be occasions where would we need to pull down the whole table. sqoop import -Doraoop.import.hint="no_parallel" -Doraoop.chunk.method=PARTITION -Doraoop.timestamp.string=false -Doraoop.import.partitions=partition_name --connect connect_string --table "WAREHOUSE.BIG_TABLE" --fetch-size 100000 -m 20 --target-dir /user/hive/warehouse/database/partition --as-parquetfile --username user --password password On Mon, Nov 3, 2014 at 9:40 PM, Gwen Shapira <[email protected]<mailto:[email protected]>> wrote: Do you need to get just one partition, or is the ultimate goal to use all partitions? Also, can you share the exact Oraoop command you used? On Mon, Nov 3, 2014 at 1:32 PM, Joshua Baxter <[email protected]<mailto:[email protected]>> wrote: Apologies if this question has been asked before. I have a very large table in Oracle with hundreds of partitions and we want to be able to import it to parquet in HDFS a partition at a time as part of a ETL process. The table has evolved over time and there is not a column that doesn't have significant skew meaning that mappers get very uneven numbers when using the standard sqoop connector and split-by. Impala is the target platform that the data is for so we also want to keep the file sizes under the cluster block size to prevent remote streaming when we use the data. I've just discovered OraOop and it sounds like this would be exactly tool we would need to import the data in an efficient and predictable way. Unfortunately the problem i'm now having is that if i use the partition option to choose just a single partition this always equates to exactly one mapper. The sort of speed and output file sizes we are looking at would equate to something like 40. Are there any options i can set to increase the number of mappers when pulling data from a single table partition?
