So it queries dba_extents to get the extent map – I am presuming your database 
must have a lot of extents hence why the query takes a long time. If you edit 
oraoop-site.xml and look at “oraoop.oracle.session.initialization.statements” 
you can get rid of the statement to disable parallel query. You could then add 
NOPARALLEL(t) to “oraoop.import.hint” to ensure the actual mappers don’t run in 
parallel. This is no guarantee the dba_extents query will run in parallel but 
you could give it a shot.

As for the error – this looks to be related in that you have so many blocks it 
has caused the number not to fit into an int – this should be a simple fix to 
change the datatype to a a long throughout the code. I’ll log an issue next 
week and look in to this.

From: Joshua Baxter [mailto:[email protected]]
Sent: Friday, 7 November 2014 6:24 AM
To: [email protected]
Subject: Re: Using more than a single mapper per partition with OraOop

Hi David,

Thanks for your reply. I think the incorrect chunking is exactly my problem. 
I've now changed that, however i have hit a couple more issues. Firstly, after 
launching the job I am now getting the following error  after the query to 
fetch the block information.

14/11/06 14:17:58 ERROR tool.ImportTool: Encountered IOException running import 
job: java.io.IOException: java.sql.SQLException: Numeric Overflow
        at 
com.quest.oraoop.OraOopDataDrivenDBInputFormat.getSplits(OraOopDataDrivenDBInputFormat.java:120)
        at 
org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:498)
        at 
org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:515)
        at 
org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:399)
        at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1295)
        at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1292)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
        at org.apache.hadoop.mapreduce.Job.submit(Job.java:1292)
        at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1313)
        at 
org.apache.sqoop.mapreduce.ImportJobBase.doSubmitJob(ImportJobBase.java:198)
        at 
org.apache.sqoop.mapreduce.ImportJobBase.runJob(ImportJobBase.java:171)
        at 
org.apache.sqoop.mapreduce.ImportJobBase.runImport(ImportJobBase.java:268)
        at org.apache.sqoop.manager.SqlManager.importTable(SqlManager.java:665)
        at 
com.quest.oraoop.OraOopConnManager.importTable(OraOopConnManager.java:260)
        at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:497)
        at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:605)
        at org.apache.sqoop.Sqoop.run(Sqoop.java:143)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:179)
        at org.apache.sqoop.Sqoop.runTool(Sqoop.java:218)
        at org.apache.sqoop.Sqoop.runTool(Sqoop.java:227)
        at org.apache.sqoop.Sqoop.main(Sqoop.java:236)
Caused by: java.sql.SQLException: Numeric Overflow
        at 
oracle.jdbc.driver.NumberCommonAccessor.throwOverflow(NumberCommonAccessor.java:4170)
        at 
oracle.jdbc.driver.NumberCommonAccessor.getInt(NumberCommonAccessor.java:119)
        at 
oracle.jdbc.driver.GeneratedStatement.getInt(GeneratedStatement.java:217)
        at 
oracle.jdbc.driver.GeneratedScrollableResultSet.getInt(GeneratedScrollableResultSet.java:522)
        at 
oracle.jdbc.driver.GeneratedResultSet.getInt(GeneratedResultSet.java:1350)
        at 
com.quest.oraoop.OraOopOracleQueries.getOracleDataChunksExtent(OraOopOracleQueries.java:271)
        at 
com.quest.oraoop.OraOopDataDrivenDBInputFormat.getSplits(OraOopDataDrivenDBInputFormat.java:74)
        ... 23 more


Secondly, the query to pull out the block data is taking a really really long 
time. About 10 minutes. This query is also running with parallel querying 
disabled. Is was wondering if there is any way to enable it for just this step 
to help reduce latency?

Thanks

Josh

On Wed, Nov 5, 2014 at 2:00 AM, David Robson 
<[email protected]<mailto:[email protected]>> wrote:
For importing a single partition – you should be able to set the chunk method 
to ROWID, then set oraoop.import.partitions to your partition you are 
importing. This will split that one partition by ROWID to as many mappers as 
you like.

Also – you shouldn’t need any no parallel hints – the direct connector disables 
parallel query when it first connects so that shouldn’t be a problem.

So in your command below can you change oraoop.chunk.method to ROWID (or just 
leave it out – it is the default) and let me know if that works for you?

From: Joshua Baxter 
[mailto:[email protected]<mailto:[email protected]>]
Sent: Tuesday, 4 November 2014 8:53 AM
To: [email protected]<mailto:[email protected]>
Subject: Re: Using more than a single mapper per partition with OraOop

We will mostly be wanting to bring in a single partition at a time, but there 
will also be occasions where would we need to pull down the whole table.

sqoop import  -Doraoop.import.hint="no_parallel" 
-Doraoop.chunk.method=PARTITION -Doraoop.timestamp.string=false 
-Doraoop.import.partitions=partition_name  --connect connect_string  --table 
"WAREHOUSE.BIG_TABLE" --fetch-size  100000 -m 20 --target-dir 
/user/hive/warehouse/database/partition   --as-parquetfile --username user 
--password password

On Mon, Nov 3, 2014 at 9:40 PM, Gwen Shapira 
<[email protected]<mailto:[email protected]>> wrote:
Do you need to get just one partition, or is the ultimate goal to use all 
partitions?

Also, can you share the exact Oraoop command you used?

On Mon, Nov 3, 2014 at 1:32 PM, Joshua Baxter 
<[email protected]<mailto:[email protected]>> wrote:
Apologies if this question has been asked before.

I have a very large table in Oracle with hundreds of partitions and we want to 
be able to import it to parquet in HDFS a partition at a time as part of a ETL 
process. The table has evolved over time and there is not a column that doesn't 
have significant skew meaning that mappers get very uneven numbers when using 
the standard sqoop connector and split-by. Impala is the target platform that 
the data is for so we also want to keep the file sizes under the cluster block 
size to prevent remote streaming when we use the data. I've just discovered 
OraOop and it sounds like this would be exactly tool we would need to import 
the data in an efficient and predictable way.

Unfortunately the problem i'm now having is that if i use the partition option 
to choose just a single partition this always equates to exactly one mapper. 
The sort of speed and output file sizes we are looking at would equate to 
something like 40.

Are there any options i can set to increase the number of mappers when pulling 
data from a single table partition?



Reply via email to