[
https://issues.apache.org/jira/browse/SQOOP-2981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15365060#comment-15365060
]
Ruslan Dautkhanov edited comment on SQOOP-2981 at 7/6/16 9:06 PM:
------------------------------------------------------------------
Hi [~maugli],
Thanks a lot for prompt response.
Sure, I will rerun sqoop with Joeri's tool.
Not sure yet how to run it.. created
https://github.com/cerndb/Hadoop-Profiler/issues/1
{quote}It would require JDK8_u60 or above. Is it a doable scenario on your
side?{quote}
It should be doable.. as long as it's only sqoop client side that needs
adjustment.
Is adding JAVA_HOME=/usr/java/JDK8_u60 enough in sqoop-conf/sqoop-env.sh?
We can't switch yet whole cluster to java8, still on jdk1.7.0_67-cloudera, but
can switch temporarily sqoop1-client to jdk8 through sqoop-env.sh.
{quote}The reason why I'm asking for that is the following:
JdbcWritableBridge#readBigDecimal is just a shorthand for
java.sql.ResultSet#getBigDecimal(int), so the implementation is within the
Oracle JDBC driver, and thus Sqoop do not have too much control over that
method.{quote}
I can create an Oracle SR too, we have database support from Oracle too.
{quote}Meanwhile I'll try to investigate your scenario on my side, and check if
I find any clue what could cause you performance issue (or if we could find a
workaround e.g. mapping to a different type instead of BigDecimal or so).{quote}
Thank you - strangely enough sqoop hadn't imported data types correctly from
Oracle:
{quote}
$ parquet-tools schema 154ff6db-0c3d-4424-928a-c5ceed95de70.parquet
message AutoGeneratedSchema \{
optional binary ID (UTF8);
optional binary PERSON_SEQ_NO (UTF8);
optional binary MERCHANTID (UTF8);
optional binary RECENCY (UTF8);
optional binary FREQUENCY (UTF8);
optional binary SPEND (UTF8);
optional binary SPENDING (UTF8);
optional binary FREQ (UTF8);
\}
{quote}
Notice it's all strings. Compare with create table DDL on oracle side that I
posted yesterday.
All columns are NUMBER on Oracle side, except PERSON_SEQ_NO which is
varchar2(50).
Sqoop create parquet schema with call columns being string. I had to fix this
manually.
Might be worth creating a separate sqoop jira for that - let me know and I'll
post one.
Not sure if this problem with datatype mapping has to do with high cpu
utilization in JdbcWritableBridge#readBigDecimal ?
Thanks,
Ruslan
was (Author: tagar):
Hi [~maugli],
Thanks a lot for prompt response.
Sure, I will rerun sqoop with Joeri's tool.
Not sure yet how to run it.. created
https://github.com/cerndb/Hadoop-Profiler/issues/1
{quote}It would require JDK8_u60 or above. Is it a doable scenario on your
side?{quote}
It should be doable.. as long as it's only sqoop client side that needs
adjustment.
Is adding JAVA_HOME=/usr/java/JDK8_u60 enough in sqoop-conf/sqoop-env.sh?
We can't switch yet whole cluster to java8, still on jdk1.7.0_67-cloudera, but
can switch temporarily sqoop1-client to jdk8 through sqoop-env.sh.
{quote}The reason why I'm asking for that is the following:
JdbcWritableBridge#readBigDecimal is just a shorthand for
java.sql.ResultSet#getBigDecimal(int), so the implementation is within the
Oracle JDBC driver, and thus Sqoop do not have too much control over that
method.{quote}
I can create an Oracle SR too, we have database support from Oracle too.
{quote}Meanwhile I'll try to investigate your scenario on my side, and check if
I find any clue what could cause you performance issue (or if we could find a
workaround e.g. mapping to a different type instead of BigDecimal or so).{quote}
Thank you - strangely enough sqoop hadn't imported data types correctly from
Oracle:
{quote}
$ parquet-tools schema 154ff6db-0c3d-4424-928a-c5ceed95de70.parquet
message AutoGeneratedSchema {
optional binary ID (UTF8);
optional binary PERSON_SEQ_NO (UTF8);
optional binary MERCHANTID (UTF8);
optional binary RECENCY (UTF8);
optional binary FREQUENCY (UTF8);
optional binary SPEND (UTF8);
optional binary SPENDING (UTF8);
optional binary FREQ (UTF8);
}
{quote}
Notice it's all strings. Compare with create table DDL on oracle side that I
posted yesterday.
All columns are NUMBER on Oracle side, except PERSON_SEQ_NO which is
varchar2(50).
Sqoop create parquet schema with call columns being string. I had to fix this
manually.
Might be worth creating a separate sqoop jira for that - let me know and I'll
post one.
Not sure if this problem with datatype mapping has to do with high cpu
utilization in JdbcWritableBridge#readBigDecimal ?
Thanks,
Ruslan
> sqoop import from jdbc: JdbcWritableBridge.readBigDecimal() takes a ton of cpu
> ------------------------------------------------------------------------------
>
> Key: SQOOP-2981
> URL: https://issues.apache.org/jira/browse/SQOOP-2981
> Project: Sqoop
> Issue Type: Bug
> Components: codegen, connectors/oracle, sqoop2-jdbc-connector
> Affects Versions: 1.4.5, 1.4.6
> Environment: sqoop import from Oracle; saves as parquet file
> Reporter: Ruslan Dautkhanov
> Labels: decimal, import, jdbc, oracle, parquet
>
> Majority of time spent of sqoop import from Oracle was on converting Decimal.
> It was 2.5x times more than following most cpu consumer (snappy compression).
> Sqoop was 100% on cpu in total, Oracle side was pretty bored.
> {quote}
> JvmTop 0.8.0 alpha - 11:56:45, amd64, 48 cpus, Linux 2.6.32-57, load avg
> 0.92
> http://code.google.com/p/jvmtop
> Profiling PID 25489: org.apache.hadoop.mapred.YarnChild 10.20
> 38.78% ( 7.68s)
> com.cloudera.sqoop.lib.JdbcWritableBridge.readBigDecimal()
> 14.27% ( 2.82s) org.xerial.snappy.SnappyNative.rawCompress()
> 12.67% ( 2.51s) parquet.io.api.Binary$FromStringBinary.encodeUTF8()
> 10.28% ( 2.04s)
> ...quet.column.values.dictionary.DictionaryValuesWriter$()
> 4.80% ( 0.95s)
> ...quet.column.values.fallback.FallbackValuesWriter.writ()
> 3.69% ( 0.73s) com.cloudera.sqoop.lib.JdbcWritableBridge.readString()
> 2.51% ( 0.50s) parquet.avro.AvroWriteSupport.writeRecordFields()
> 2.30% ( 0.46s) parquet.column.impl.ColumnWriterV1.write()
> 1.90% ( 0.38s)
> ...quet.it.unimi.dsi.fastutil.objects.Object2IntLinkedOp()
> 1.31% ( 0.26s)
> ...quet.column.values.rle.RunLengthBitPackingHybridEncod()
> 1.27% ( 0.25s)
> ...quet.column.values.dictionary.DictionaryValuesWriter$()
> 1.22% ( 0.24s) parquet.hadoop.CodecFactory$BytesCompressor.compress()
> 0.65% ( 0.13s)
> ...quet.column.values.dictionary.DictionaryValuesWriter$()
> 0.64% ( 0.13s)
> ...quet.it.unimi.dsi.fastutil.objects.Object2IntLinkedOp()
> 0.64% ( 0.13s) parquet.bytes.CapacityByteArrayOutputStream.addSlab()
> 0.63% ( 0.12s) parquet.io.api.Binary$ByteArrayBackedBinary.getBytes()
> 0.62% ( 0.12s)
> ...quet.column.values.dictionary.DictionaryValuesWriter.()
> 0.58% ( 0.12s) parquet.bytes.CapacityByteArrayOutputStream.setByte()
> 0.49% ( 0.10s) parquet.hadoop.codec.SnappyUtil.validateBuffer()
> 0.44% ( 0.09s) parquet.hadoop.InternalParquetRecordWriter.write()
> {quote}
> DDL of the table on Oracle side:
> {quote}
> CREATE TABLE someschema.sometable
> (
> ID NUMBER NOT NULL,
> psn VARCHAR2(50 BYTE),
> MERCHID NUMBER,
> RECENCY NUMBER,
> FREQUENCY NUMBER,
> SPEND NUMBER,
> SPENDING NUMBER,
> FREQ NUMBER
> )
> {quote}
> Sqoop parameters:
> {quote}
> sqoop import
> -Dmapred.job.name="sqoop import into out_table"
> --connect "jdbc:oracle:thin:@jdbc_tns"
> --username username --password password
> --direct
> --compress --compression-codec snappy
> --as-parquetfile
> --target-dir hdfs_dir
> --num-mappers num_mappers
> --query "SELECT * FROM someschema.sometable WHERE \$CONDITIONS"
> {quote}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)