Re: Sqoop vs spark jdbc

Jörn Franke Wed, 21 Sep 2016 13:47:21 -0700

I think there might be still something messed up with the classpath. It 
complains in the logs about deprecated jars and deprecated configuration files.


> On 21 Sep 2016, at 22:21, Mich Talebzadeh <mich.talebza...@gmail.com> wrote:
> 
> Well I am left to use Spark for importing data from RDBMS table to Hadoop.
> 
> You may argue why and it is because Spark does it in one process and no errors
> 
> With sqoop I am getting this error message which leaves the RDBMS table data 
> on HDFS file but stops there.
> 
> 2016-09-21 21:00:15,084 [myid:] - INFO  [main:OraOopLog@103] - Data Connector 
> for Oracle and Hadoop is disabled.
> 2016-09-21 21:00:15,095 [myid:] - INFO  [main:SqlManager@98] - Using default 
> fetchSize of 1000
> 2016-09-21 21:00:15,095 [myid:] - INFO  [main:CodeGenTool@92] - Beginning 
> code generation
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in 
> [jar:file:/data6/hduser/hbase-0.98.21-hadoop2/lib/phoenix-4.8.0-HBase-0.98-client.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/data6/hduser/hbase-0.98.21-hadoop2/lib/phoenix-4.8.0-HBase-0.98-hive.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/data6/hduser/hbase-0.98.21-hadoop2/lib/phoenix-4.8.0-HBase-0.98-thin-client.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/data6/hduser/hbase-0.98.21-hadoop2/lib/slf4j-log4j12-1.6.4.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/home/hduser/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> 2016-09-21 21:00:15,681 [myid:] - INFO  [main:OracleManager@417] - Time zone 
> has been set to GMT
> 2016-09-21 21:00:15,717 [myid:] - INFO  [main:SqlManager@757] - Executing SQL 
> statement: select * from sh.sales where            (1 = 0)
> 2016-09-21 21:00:15,727 [myid:] - INFO  [main:SqlManager@757] - Executing SQL 
> statement: select * from sh.sales where            (1 = 0)
> 2016-09-21 21:00:15,748 [myid:] - INFO  [main:CompilationManager@94] - 
> HADOOP_MAPRED_HOME is /home/hduser/hadoop-2.7.3/share/hadoop/mapreduce
> Note: 
> /tmp/sqoop-hduser/compile/82dcf5975118b5e271b442e547201fdf/QueryResult.java 
> uses or overrides a deprecated API.
> Note: Recompile with -Xlint:deprecation for details.
> 2016-09-21 21:00:17,354 [myid:] - INFO  [main:CompilationManager@330] - 
> Writing jar file: 
> /tmp/sqoop-hduser/compile/82dcf5975118b5e271b442e547201fdf/QueryResult.jar
> 2016-09-21 21:00:17,366 [myid:] - INFO  [main:ImportJobBase@237] - Beginning 
> query import.
> 2016-09-21 21:00:17,511 [myid:] - WARN  [main:NativeCodeLoader@62] - Unable 
> to load native-hadoop library for your platform... using builtin-java classes 
> where applicable
> 2016-09-21 21:00:17,516 [myid:] - INFO  [main:Configuration@840] - mapred.jar 
> is deprecated. Instead, use mapreduce.job.jar
> 2016-09-21 21:00:17,993 [myid:] - INFO  [main:Configuration@840] - 
> mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
> 2016-09-21 21:00:18,094 [myid:] - INFO  [main:RMProxy@56] - Connecting to 
> ResourceManager at rhes564/50.140.197.217:8032
> 2016-09-21 21:00:23,441 [myid:] - INFO  [main:DBInputFormat@192] - Using read 
> commited transaction isolation
> 2016-09-21 21:00:23,442 [myid:] - INFO  [main:DataDrivenDBInputFormat@147] - 
> BoundingValsQuery: SELECT MIN(prod_id), MAX(prod_id) FROM (select * from 
> sh.sales where            (1 = 1) ) t1
> 2016-09-21 21:00:23,540 [myid:] - INFO  [main:JobSubmitter@394] - number of 
> splits:4
> 2016-09-21 21:00:23,547 [myid:] - INFO  [main:Configuration@840] - 
> mapred.job.name is deprecated. Instead, use mapreduce.job.name
> 2016-09-21 21:00:23,547 [myid:] - INFO  [main:Configuration@840] - 
> mapred.cache.files.timestamps is deprecated. Instead, use 
> mapreduce.job.cache.files.timestamps
> 2016-09-21 21:00:23,547 [myid:] - INFO  [main:Configuration@840] - 
> mapreduce.map.class is deprecated. Instead, use mapreduce.job.map.class
> 2016-09-21 21:00:23,547 [myid:] - INFO  [main:Configuration@840] - 
> mapreduce.inputformat.class is deprecated. Instead, use 
> mapreduce.job.inputformat.class
> 2016-09-21 21:00:23,547 [myid:] - INFO  [main:Configuration@840] - 
> mapreduce.outputformat.class is deprecated. Instead, use 
> mapreduce.job.outputformat.class
> 2016-09-21 21:00:23,548 [myid:] - INFO  [main:Configuration@840] - 
> mapred.output.value.class is deprecated. Instead, use 
> mapreduce.job.output.value.class
> 2016-09-21 21:00:23,548 [myid:] - INFO  [main:Configuration@840] - 
> mapred.output.dir is deprecated. Instead, use 
> mapreduce.output.fileoutputformat.outputdir
> 2016-09-21 21:00:23,548 [myid:] - INFO  [main:Configuration@840] - 
> mapred.cache.files is deprecated. Instead, use mapreduce.job.cache.files
> 2016-09-21 21:00:23,548 [myid:] - INFO  [main:Configuration@840] - 
> mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir
> 2016-09-21 21:00:23,548 [myid:] - INFO  [main:Configuration@840] - 
> mapred.job.classpath.files is deprecated. Instead, use 
> mapreduce.job.classpath.files
> 2016-09-21 21:00:23,548 [myid:] - INFO  [main:Configuration@840] - user.name 
> is deprecated. Instead, use mapreduce.job.user.name
> 2016-09-21 21:00:23,548 [myid:] - INFO  [main:Configuration@840] - 
> mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
> 2016-09-21 21:00:23,549 [myid:] - INFO  [main:Configuration@840] - 
> mapred.cache.files.filesizes is deprecated. Instead, use 
> mapreduce.job.cache.files.filesizes
> 2016-09-21 21:00:23,549 [myid:] - INFO  [main:Configuration@840] - 
> mapred.output.key.class is deprecated. Instead, use 
> mapreduce.job.output.key.class
> 2016-09-21 21:00:23,656 [myid:] - INFO  [main:JobSubmitter@477] - Submitting 
> tokens for job: job_1474455325627_0045
> 2016-09-21 21:00:23,955 [myid:] - INFO  [main:YarnClientImpl@174] - Submitted 
> application application_1474455325627_0045 to ResourceManager at 
> rhes564/50.140.197.217:8032
> 2016-09-21 21:00:23,980 [myid:] - INFO  [main:Job@1272] - The url to track 
> the job: http://http://rhes564:8088/proxy/application_1474455325627_0045/
> 2016-09-21 21:00:23,981 [myid:] - INFO  [main:Job@1317] - Running job: 
> job_1474455325627_0045
> 2016-09-21 21:00:31,180 [myid:] - INFO  [main:Job@1338] - Job 
> job_1474455325627_0045 running in uber mode : false
> 2016-09-21 21:00:31,182 [myid:] - INFO  [main:Job@1345] -  map 0% reduce 0%
> 2016-09-21 21:00:40,260 [myid:] - INFO  [main:Job@1345] -  map 25% reduce 0%
> 2016-09-21 21:00:44,283 [myid:] - INFO  [main:Job@1345] -  map 50% reduce 0%
> 2016-09-21 21:00:48,308 [myid:] - INFO  [main:Job@1345] -  map 75% reduce 0%
> 2016-09-21 21:00:55,346 [myid:] - INFO  [main:Job@1345] -  map 100% reduce 0%
> 2016-09-21 21:00:56,359 [myid:] - INFO  [main:Job@1356] - Job 
> job_1474455325627_0045 completed successfully
> 2016-09-21 21:00:56,501 [myid:] - ERROR [main:ImportTool@607] - Imported 
> Failed: No enum constant org.apache.hadoop.mapreduce.JobCounter.MB_MILLIS_MAPS
> 
> 
> 
> 
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
>> On 21 September 2016 at 20:56, Michael Segel <michael_se...@hotmail.com> 
>> wrote:
>> Uhmmm… 
>> 
>> A bit of a longer-ish answer…
>> 
>> Spark may or may not be faster than sqoop. The standard caveats apply… YMMV. 
>> 
>> The reason I say this… you have a couple of limiting factors.  The main one 
>> being the number of connections allowed with the target RDBMS. 
>> 
>> Then there’s the data distribution within the partitions / ranges in the 
>> database.  
>> By this, I mean that using any parallel solution, you need to run copies of 
>> your query in parallel over different ranges within the database. Most of 
>> the time you may run the query over a database where there is even 
>> distribution… if not, then you will have one thread run longer than the 
>> others.  Note that this is a problem that both solutions would face. 
>> 
>> Then there’s the cluster itself. 
>> Again YMMV on your spark job vs a Map/Reduce job. 
>> 
>> In terms of launching the job, setup, etc … the spark job could take longer 
>> to setup.  But on long running queries, that becomes noise. 
>> 
>> The issue is what makes the most sense to you, where do you have the most 
>> experience, and what do you feel the most comfortable in using. 
>> 
>> The other issue is what do you do with the data (RDDs,DataSets, Frames, etc) 
>> once you have read the data? 
>> 
>> 
>> HTH
>> 
>> -Mike
>> 
>> PS. I know that I’m responding to an earlier message in the thread, but this 
>> is something that I’ve heard lots of questions about… and its not a simple 
>> thing to answer… Since this is a batch process.  The performance issues are 
>> moot.  
>> 
>>> On Aug 24, 2016, at 5:07 PM, Mich Talebzadeh <mich.talebza...@gmail.com> 
>>> wrote:
>>> 
>>> Personally I prefer Spark JDBC.
>>> 
>>> Both Sqoop and Spark rely on the same drivers.
>>> 
>>> I think Spark is faster and if you have many nodes you can partition your 
>>> incoming data and take advantage of Spark DAG + in memory offering.
>>> 
>>> By default Sqoop will use Map-reduce which is pretty slow.
>>> 
>>> Remember for Spark you will need to have sufficient memory
>>> 
>>> HTH
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  
>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>  
>>> http://talebzadehmich.wordpress.com
>>> 
>>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>>> loss, damage or destruction of data or any other property which may arise 
>>> from relying on this email's technical content is explicitly disclaimed. 
>>> The author will in no case be liable for any monetary damages arising from 
>>> such loss, damage or destruction.
>>>  
>>> 
>>>> On 24 August 2016 at 22:39, Venkata Penikalapati 
>>>> <mail.venkatakart...@gmail.com> wrote:
>>>> Team, 
>>>> Please help me in choosing sqoop or spark jdbc to fetch data from rdbms. 
>>>> Sqoop has lot of optimizations to fetch data does spark jdbc also has 
>>>> those ?
>>>> 
>>>> I'm performing few analytics using spark data for which data is residing 
>>>> in rdbms. 
>>>> 
>>>> Please guide me with this. 
>>>> 
>>>> 
>>>> Thanks
>>>> Venkata Karthik P 
>>>> 
>>> 
>> 
>

Re: Sqoop vs spark jdbc

Reply via email to