Re: Sqoop vs spark jdbc

Mich Talebzadeh Wed, 21 Sep 2016 16:17:58 -0700

I do not know why this happening.

Trying to load an Hbase table at command line


hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=','
-Dimporttsv.columns="HBASE_ROW_KEY,c1,c2" t2
hdfs://rhes564:9000/tmp/crap.txt

Comes back with this error


2016-09-22 00:12:46,576 INFO  [main] mapreduce.JobSubmitter: Submitting
tokens for job: job_1474455325627_0052
2016-09-22 00:12:46,755 INFO  [main] impl.YarnClientImpl: Submitted
application application_1474455325627_0052 to ResourceManager at rhes564/
50.140.197.217:8032
2016-09-22 00:12:46,783 INFO  [main] mapreduce.Job: The url to track the
job: http://http://rhes564:8088/proxy/application_1474455325627_0052/
2016-09-22 00:12:46,783 INFO  [main] mapreduce.Job: Running job:
job_1474455325627_0052
2016-09-22 00:12:55,913 INFO  [main] mapreduce.Job: Job
job_1474455325627_0052 running in uber mode : false
2016-09-22 00:12:55,915 INFO  [main] mapreduce.Job:  map 0% reduce 0%
2016-09-22 00:13:01,994 INFO  [main] mapreduce.Job:  map 100% reduce 0%
2016-09-22 00:13:03,008 INFO  [main] mapreduce.Job: Job
job_1474455325627_0052 completed successfully
Exception in thread "main" java.lang.IllegalArgumentException: No enum
constant org.apache.hadoop.mapreduce.JobCounter.MB_MILLIS_MAPS
        at java.lang.Enum.valueOf(Enum.java:238)
        at
org.apache.hadoop.mapreduce.counters.FrameworkCounterGroup.valueOf(FrameworkCounterGroup.java:148)
        at
org.apache.hadoop.mapreduce.counters.FrameworkCounterGroup.findCounter(FrameworkCounterGroup.java:182)
        at
org.apache.hadoop.mapreduce.counters.AbstractCounters.findCounter(AbstractCounters.java:154)
        at
org.apache.hadoop.mapreduce.TypeConverter.fromYarn(TypeConverter.java:240)
        at
org.apache.hadoop.mapred.ClientServiceDelegate.getJobCounters(ClientServiceDelegate.java:370)
        at
org.apache.hadoop.mapred.YARNRunner.getJobCounters(YARNRunner.java:511)
        at org.apache.hadoop.mapreduce.Job$7.run(Job.java:756)
        at org.apache.hadoop.mapreduce.Job$7.run(Job.java:753)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
        at org.apache.hadoop.mapreduce.Job.getCounters(Job.java:753)
        at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1361)
        at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1289)
        at
org.apache.hadoop.hbase.mapreduce.ImportTsv.run(ImportTsv.java:680)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
        at
org.apache.hadoop.hbase.mapreduce.ImportTsv.main(ImportTsv.java:684)




Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 21 September 2016 at 21:47, Jörn Franke <jornfra...@gmail.com> wrote:

> I think there might be still something messed up with the classpath. It
> complains in the logs about deprecated jars and deprecated configuration
> files.
>
> On 21 Sep 2016, at 22:21, Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
> Well I am left to use Spark for importing data from RDBMS table to Hadoop.
>
> You may argue why and it is because Spark does it in one process and no
> errors
>
> With sqoop I am getting this error message which leaves the RDBMS table
> data on HDFS file but stops there.
>
> 2016-09-21 21:00:15,084 [myid:] - INFO  [main:OraOopLog@103] - Data
> Connector for Oracle and Hadoop is disabled.
> 2016-09-21 21:00:15,095 [myid:] - INFO  [main:SqlManager@98] - Using
> default fetchSize of 1000
> 2016-09-21 21:00:15,095 [myid:] - INFO  [main:CodeGenTool@92] - Beginning
> code generation
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in [jar:file:/data6/hduser/hbase-
> 0.98.21-hadoop2/lib/phoenix-4.8.0-HBase-0.98-client.jar!/org/slf4j/impl/
> StaticLoggerBinder.class]
> SLF4J: Found binding in [jar:file:/data6/hduser/hbase-
> 0.98.21-hadoop2/lib/phoenix-4.8.0-HBase-0.98-hive.jar!/org/
> slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in [jar:file:/data6/hduser/hbase-
> 0.98.21-hadoop2/lib/phoenix-4.8.0-HBase-0.98-thin-client.
> jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in [jar:file:/data6/hduser/hbase-
> 0.98.21-hadoop2/lib/slf4j-log4j12-1.6.4.jar!/org/slf4j/
> impl/StaticLoggerBinder.class]
> SLF4J: Found binding in [jar:file:/home/hduser/hadoop-
> 2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/
> slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
> explanation.
> 2016-09-21 21:00:15,681 [myid:] - INFO  [main:OracleManager@417] - Time
> zone has been set to GMT
> 2016-09-21 21:00:15,717 [myid:] - INFO  [main:SqlManager@757] - Executing
> SQL statement: select * from sh.sales where            (1 = 0)
> 2016-09-21 21:00:15,727 [myid:] - INFO  [main:SqlManager@757] - Executing
> SQL statement: select * from sh.sales where            (1 = 0)
> 2016-09-21 21:00:15,748 [myid:] - INFO  [main:CompilationManager@94] -
> HADOOP_MAPRED_HOME is /home/hduser/hadoop-2.7.3/share/hadoop/mapreduce
> Note: 
> /tmp/sqoop-hduser/compile/82dcf5975118b5e271b442e547201fdf/QueryResult.java
> uses or overrides a deprecated API.
> Note: Recompile with -Xlint:deprecation for details.
>
> *2016-09-21 21:00:17,354 [myid:] - INFO  [main:CompilationManager@330] -
> Writing jar file:
> /tmp/sqoop-hduser/compile/82dcf5975118b5e271b442e547201fdf/QueryResult.jar*2016-09-21
> 21:00:17,366 [myid:] - INFO  [main:ImportJobBase@237] - Beginning query
> import.
> 2016-09-21 21:00:17,511 [myid:] - WARN  [main:NativeCodeLoader@62] -
> Unable to load native-hadoop library for your platform... using
> builtin-java classes where applicable
> 2016-09-21 21:00:17,516 [myid:] - INFO  [main:Configuration@840] -
> mapred.jar is deprecated. Instead, use mapreduce.job.jar
> 2016-09-21 21:00:17,993 [myid:] - INFO  [main:Configuration@840] -
> mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
> 2016-09-21 21:00:18,094 [myid:] - INFO  [main:RMProxy@56] - Connecting to
> ResourceManager at rhes564/50.140.197.217:8032
> 2016-09-21 21:00:23,441 [myid:] - INFO  [main:DBInputFormat@192] - Using
> read commited transaction isolation
> 2016-09-21 21:00:23,442 [myid:] - INFO  [main:DataDrivenDBInputFormat@147]
> - BoundingValsQuery: SELECT MIN(prod_id), MAX(prod_id) FROM (select * from
> sh.sales where            (1 = 1) ) t1
> 2016-09-21 21:00:23,540 [myid:] - INFO  [main:JobSubmitter@394] - number
> of splits:4
> 2016-09-21 21:00:23,547 [myid:] - INFO  [main:Configuration@840] -
> mapred.job.name is deprecated. Instead, use mapreduce.job.name
> 2016-09-21 21:00:23,547 [myid:] - INFO  [main:Configuration@840] -
> mapred.cache.files.timestamps is deprecated. Instead, use
> mapreduce.job.cache.files.timestamps
> 2016-09-21 21:00:23,547 [myid:] - INFO  [main:Configuration@840] -
> mapreduce.map.class is deprecated. Instead, use mapreduce.job.map.class
> 2016-09-21 21:00:23,547 [myid:] - INFO  [main:Configuration@840] -
> mapreduce.inputformat.class is deprecated. Instead, use
> mapreduce.job.inputformat.class
> 2016-09-21 21:00:23,547 [myid:] - INFO  [main:Configuration@840] -
> mapreduce.outputformat.class is deprecated. Instead, use
> mapreduce.job.outputformat.class
> 2016-09-21 21:00:23,548 [myid:] - INFO  [main:Configuration@840] -
> mapred.output.value.class is deprecated. Instead, use
> mapreduce.job.output.value.class
> 2016-09-21 21:00:23,548 [myid:] - INFO  [main:Configuration@840] -
> mapred.output.dir is deprecated. Instead, use mapreduce.output.
> fileoutputformat.outputdir
> 2016-09-21 21:00:23,548 [myid:] - INFO  [main:Configuration@840] -
> mapred.cache.files is deprecated. Instead, use mapreduce.job.cache.files
> 2016-09-21 21:00:23,548 [myid:] - INFO  [main:Configuration@840] -
> mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir
> 2016-09-21 21:00:23,548 [myid:] - INFO  [main:Configuration@840] -
> mapred.job.classpath.files is deprecated. Instead, use
> mapreduce.job.classpath.files
> 2016-09-21 21:00:23,548 [myid:] - INFO  [main:Configuration@840] -
> user.name is deprecated. Instead, use mapreduce.job.user.name
> 2016-09-21 21:00:23,548 [myid:] - INFO  [main:Configuration@840] -
> mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
> 2016-09-21 21:00:23,549 [myid:] - INFO  [main:Configuration@840] -
> mapred.cache.files.filesizes is deprecated. Instead, use
> mapreduce.job.cache.files.filesizes
> 2016-09-21 21:00:23,549 [myid:] - INFO  [main:Configuration@840] -
> mapred.output.key.class is deprecated. Instead, use
> mapreduce.job.output.key.class
> 2016-09-21 21:00:23,656 [myid:] - INFO  [main:JobSubmitter@477] -
> Submitting tokens for job: job_1474455325627_0045
> 2016-09-21 21:00:23,955 [myid:] - INFO  [main:YarnClientImpl@174] -
> Submitted application application_1474455325627_0045 to ResourceManager at
> rhes564/50.140.197.217:8032
> 2016-09-21 21:00:23,980 [myid:] - INFO  [main:Job@1272] - The url to
> track the job: http://http://rhes564:8088/proxy/application_
> 1474455325627_0045/
> 2016-09-21 21:00:23,981 [myid:] - INFO  [main:Job@1317] - Running job:
> job_1474455325627_0045
> 2016-09-21 21:00:31,180 [myid:] - INFO  [main:Job@1338] - Job
> job_1474455325627_0045 running in uber mode : false
> 2016-09-21 21:00:31,182 [myid:] - INFO  [main:Job@1345] -  map 0% reduce
> 0%
> 2016-09-21 21:00:40,260 [myid:] - INFO  [main:Job@1345] -  map 25% reduce
> 0%
> 2016-09-21 21:00:44,283 [myid:] - INFO  [main:Job@1345] -  map 50% reduce
> 0%
> 2016-09-21 21:00:48,308 [myid:] - INFO  [main:Job@1345] -  map 75% reduce
> 0%
> 2016-09-21 21:00:55,346 [myid:] - INFO  [main:Job@1345] -  map 100%
> reduce 0%
>
> *2016-09-21 21:00:56,359 [myid:] - INFO  [main:Job@1356] - Job
> job_1474455325627_0045 completed successfully2016-09-21 21:00:56,501
> [myid:] - ERROR [main:ImportTool@607] - Imported Failed: No enum constant
> org.apache.hadoop.mapreduce.JobCounter.MB_MILLIS_MAPS*
>
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 21 September 2016 at 20:56, Michael Segel <michael_se...@hotmail.com>
> wrote:
>
>> Uhmmm…
>>
>> A bit of a longer-ish answer…
>>
>> Spark may or may not be faster than sqoop. The standard caveats apply…
>> YMMV.
>>
>> The reason I say this… you have a couple of limiting factors.  The main
>> one being the number of connections allowed with the target RDBMS.
>>
>> Then there’s the data distribution within the partitions / ranges in the
>> database.
>> By this, I mean that using any parallel solution, you need to run copies
>> of your query in parallel over different ranges within the database. Most
>> of the time you may run the query over a database where there is even
>> distribution… if not, then you will have one thread run longer than the
>> others.  Note that this is a problem that both solutions would face.
>>
>> Then there’s the cluster itself.
>> Again YMMV on your spark job vs a Map/Reduce job.
>>
>> In terms of launching the job, setup, etc … the spark job could take
>> longer to setup.  But on long running queries, that becomes noise.
>>
>> The issue is what makes the most sense to you, where do you have the most
>> experience, and what do you feel the most comfortable in using.
>>
>> The other issue is what do you do with the data (RDDs,DataSets, Frames,
>> etc) once you have read the data?
>>
>>
>> HTH
>>
>> -Mike
>>
>> PS. I know that I’m responding to an earlier message in the thread, but
>> this is something that I’ve heard lots of questions about… and its not a
>> simple thing to answer… Since this is a batch process.  The performance
>> issues are moot.
>>
>> On Aug 24, 2016, at 5:07 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
>> wrote:
>>
>> Personally I prefer Spark JDBC.
>>
>> Both Sqoop and Spark rely on the same drivers.
>>
>> I think Spark is faster and if you have many nodes you can partition your
>> incoming data and take advantage of Spark DAG + in memory offering.
>>
>> By default Sqoop will use Map-reduce which is pretty slow.
>>
>> Remember for Spark you will need to have sufficient memory
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 24 August 2016 at 22:39, Venkata Penikalapati <
>> mail.venkatakart...@gmail.com> wrote:
>>
>>> Team,
>>> Please help me in choosing sqoop or spark jdbc to fetch data from rdbms.
>>> Sqoop has lot of optimizations to fetch data does spark jdbc also has those
>>> ?
>>>
>>> I'm performing few analytics using spark data for which data is residing
>>> in rdbms.
>>>
>>> Please guide me with this.
>>>
>>>
>>> Thanks
>>> Venkata Karthik P
>>>
>>>
>>
>>
>

Re: Sqoop vs spark jdbc

Reply via email to