Re: Sqoop on Spark

2016-04-05 Thread Takeshi Yamamuro
Hi, It depends on your use case using sqoop. What's it like? // maropu On Wed, Apr 6, 2016 at 1:26 PM, ayan guha wrote: > Hi All > > Asking opinion: is it possible/advisable to use spark to replace what > sqoop does? Any existing project done in similar lines? > > -- > Best Regards, > Ayan Guh

Re: Sqoop on Spark

2016-04-05 Thread ayan guha
Hi Thanks for reply. My use case is query ~40 tables from Oracle (using index and incremental only) and add data to existing Hive tables. Also, it would be good to have an option to create Hive table, driven by job specific configuration. What do you think? Best Ayan On Wed, Apr 6, 2016 at 2:30

Re: Sqoop on Spark

2016-04-05 Thread Takeshi Yamamuro
It worths trying it though, you might hit many technical issues; for example, I'm not sure that spark completely supports all the Oracle dialects (e.g., types) in JDBCRelation. Also, I'm not sure spark can handle queries that you have used in the use case. At least, you just have to check if spark

Re: Sqoop on Spark

2016-04-05 Thread Jörn Franke
Why do you want to reimplement something which is already there? > On 06 Apr 2016, at 06:47, ayan guha wrote: > > Hi > > Thanks for reply. My use case is query ~40 tables from Oracle (using index > and incremental only) and add data to existing Hive tables. Also, it would be > good to have an

Re: Sqoop on Spark

2016-04-05 Thread ayan guha
One of the reason in my mind is to avoid Map-Reduce application completely during ingestion, if possible. Also, I can then use Spark stand alone cluster to ingest, even if my hadoop cluster is heavily loaded. What you guys think? On Wed, Apr 6, 2016 at 3:13 PM, Jörn Franke wrote: > Why do you wa

Re: Sqoop on Spark

2016-04-05 Thread Mich Talebzadeh
Pretty simple assuming that you have ojdbc6.jar in $SQOOP_HOME/lib sqoop import --connect "jdbc:oracle:thin:@rhes564:1521:mydb" --username hddtester -P \ --query "select * from hddtester.t where \ \$CONDITIONS" \ --split-by object_id \ --hive-import --creat

Re: Sqoop on Spark

2016-04-05 Thread Jörn Franke
I do not think you can be more resource efficient. In the end you have to store the data anyway on HDFS . You have a lot of development effort for doing something like sqoop. Especially with error handling. You may create a ticket with the Sqoop guys to support Spark as an execution engine and

Re: Sqoop on Spark

2016-04-05 Thread ayan guha
Thanks guys for feedback. On Wed, Apr 6, 2016 at 3:44 PM, Jörn Franke wrote: > I do not think you can be more resource efficient. In the end you have to > store the data anyway on HDFS . You have a lot of development effort for > doing something like sqoop. Especially with error handling. > You

Re: Sqoop on Spark

2016-04-06 Thread Jorge Sánchez
Ayan, there was a talk in spark summit https://spark-summit.org/2015/events/Sqoop-on-Spark-for-Data-Ingestion/ Apparently they had a lot of problems and the project seems abandoned. If you just have to do simple ingestion of a full table or a simple query, just use Sqoop as suggested by Mich, but

Re: Sqoop on Spark

2016-04-06 Thread Mich Talebzadeh
Yes JDBC is another option. Need to be aware of some conversion issues like spark does like CHAR types etc. You best bet is to do the conversion when fetching data from Oracle itself. var _ORACLEserver : String = "jdbc:oracle:thin:@rhes564:1521:mydb" var _username : String = "sh" var _password : S

Re: Sqoop on Spark

2016-04-06 Thread Michael Segel
I don’t think its necessarily a bad idea. Sqoop is an ugly tool and it requires you to make some assumptions as a way to gain parallelism. (Not that most of the assumptions are not valid for most of the use cases…) Depending on what you want to do… your data may not be persisted on HDFS. The

Re: Sqoop on Spark

2016-04-06 Thread Ranadip Chatterjee
I know of projects that have done this but have never seen any advantage of "using spark to do what sqoop does" - at least in a yarn cluster. Both frameworks will have similar overheads of getting the containers allocated by yarn and creating new jvms to do the work. Probably spark will have a slig

Re: Sqoop on Spark

2016-04-06 Thread Mich Talebzadeh
I just created an example of how to use JDBC to get Oracle data into Hive table using Sqoop. Please see thread below How to use Spark JDBC to read from RDBMS table, create Hive ORC table and save RDBMS data in it HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=A

Re: Sqoop on Spark

2016-04-06 Thread Jörn Franke
Well I am not sure, but using a database as a storage, such as relational databases or certain nosql databases (eg MongoDB) for Spark is generally a bad idea - no data locality, it cannot handle real big data volumes for compute and you may potentially overload an operational database. And if y

Re: Sqoop on Spark

2016-04-06 Thread Mich Talebzadeh
Sorry are you referring to Hive as a relational Data Warehouse in this scenario. The assumption here is that data is coming from a relational database (Oracle) so IMO the best storage for it in Big Data World is another DW adaptive to SQL. Spark is a powerful query tool and together with Hive as a

Re: Sqoop on Spark

2016-04-06 Thread Peyman Mohajerian
For some MPP relational stores (not operational) it maybe feasible to run Spark jobs and also have data locality. I know QueryGrid (Teradata) and PolyBase (microsoft) use data locality to move data between their MPP and Hadoop. I would guess (have no idea) someone like IBM already is doing that for

Re: Sqoop on Spark

2016-04-06 Thread Mich Talebzadeh
SAP Sybase IQ does that and I believe SAP Hana as well. Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com O

RE: Sqoop on Spark

2016-04-06 Thread Yong Zhang
If they do that, they must provide a customized input format, instead of through JDBC. Yong Date: Wed, 6 Apr 2016 23:56:54 +0100 Subject: Re: Sqoop on Spark From: mich.talebza...@gmail.com To: mohaj...@gmail.com CC: jornfra...@gmail.com; msegel_had...@hotmail.com; guha.a...@gmail.com; linguin

Re: Sqoop on Spark

2016-04-06 Thread Peyman Mohajerian
eryGrid is for interactive data movement. On Wed, Apr 6, 2016 at 4:05 PM, Yong Zhang wrote: > If they do that, they must provide a customized input format, instead of > through JDBC. > > Yong > > -- > Date: Wed, 6 Apr 2016 23:56:54 +0100 >

RE: Sqoop on Spark

2016-04-06 Thread Yong Zhang
te: Wed, 6 Apr 2016 16:12:30 -0700 Subject: Re: Sqoop on Spark From: mohaj...@gmail.com To: java8...@hotmail.com CC: mich.talebza...@gmail.com; jornfra...@gmail.com; msegel_had...@hotmail.com; guha.a...@gmail.com; linguin@gmail.com; user@spark.apache.org It is using JDBC driver, i know that'

Re: Sqoop on Spark

2016-04-08 Thread Gourav Sengupta
ll think it makes sense that vendor provide some kind of > InputFormat, or data source in Spark, so Hadoop eco-system can integrate > with them more natively. > > Yong > > -- > Date: Wed, 6 Apr 2016 16:12:30 -0700 > Subject: Re: Sqoop on Spark > F

Re: Sqoop on Spark

2016-04-08 Thread Mich Talebzadeh
ta source in Spark, so Hadoop eco-system can integrate >> with them more natively. >> >> Yong >> >> -- >> Date: Wed, 6 Apr 2016 16:12:30 -0700 >> Subject: Re: Sqoop on Spark >> From: mohaj...@gmail.com >> To: java8..

Re: Sqoop on Spark

2016-04-10 Thread Michael Segel
, 2016 at 12:26 AM, Yong Zhang <mailto:java8...@hotmail.com>> wrote: > Good to know that. > > That is why Sqoop has this "direct" mode, to utilize the vendor specific > feature. > > But for MPP, I still think it makes sense that vendor provide some kind of >

Re: Sqoop on Spark

2016-04-10 Thread Mich Talebzadeh
> >> >> Regards, >> Gourav Sengupta >> >> On Thu, Apr 7, 2016 at 12:26 AM, Yong Zhang wrote: >> >>> Good to know that. >>> >>> That is why Sqoop has this "direct" mode, to utilize the vendor specific >>> feature. &

Re: Sqoop on Spark

2016-04-10 Thread Jörn Franke
killed after 8 hours. >>>> >>>> (both the above studies were done in a system of same capacity, with 32 GB >>>> RAM and dual hexacore Xeon processors and SSD. SPARK was running locally, >>>> and SQOOP ran on HADOOP2 and extracted data to local file syste

Re: Sqoop on Spark

2016-04-11 Thread Michael Segel
killed after 8 hours. >>> >>> (both the above studies were done in a system of same capacity, with 32 GB >>> RAM and dual hexacore Xeon processors and SSD. SPARK was running locally, >>> and SQOOP ran on HADOOP2 and extracted data to local file s

Re: Sqoop on Spark

2016-04-11 Thread Jörn Franke
lebzadehmich.wordpress.com >>>>> >>>>> >>>>>> On 8 April 2016 at 10:07, Gourav Sengupta >>>>>> wrote: >>>>>> Hi, >>>>>> >>>>>> Some metrics thrown around the discussion:

Re: Sqoop on Spark

2016-04-14 Thread Gourav Sengupta
r to SQOOP difficult parallelization) to load >>> 500 million records - manually killed after 8 hours. >>> >>> (both the above studies were done in a system of same capacity, with 32 >>> GB RAM and dual hexacore Xeon processors and SSD. SPARK was running >&g

Re: Sqoop on Spark

2016-04-14 Thread Jörn Franke
iews take care of NUMBER and CHAR columns in Oracle and >>>>>>> convert them to TO_CHAR(NUMBER_COLUMN) and varchar CAST(coln AS >>>>>>> VARCHAR2(n)) AS coln etc >>>>>>> >>>>>>> HTH >>>>>>> &g

Re: Sqoop on Spark

2016-04-14 Thread Mich Talebzadeh
talebzadehmich.wordpress.com >>> >>> >>> >>> On 8 April 2016 at 10:07, Gourav Sengupta >>> wrote: >>> >>>> Hi, >>>> >>>> Some metrics thrown around the discussion: >>>> >>>> SQOOP: ex

Re: Sqoop On Spark

2016-08-01 Thread Takeshi Yamamuro
Hi, Have you seen this previous thread? https://www.mail-archive.com/user@spark.apache.org/msg49025.html I'm not sure this is what you want though. // maropu On Tue, Aug 2, 2016 at 1:52 PM, Selvam Raman wrote: > Hi Team, > > how can i use spark as execution engine in sqoop2. i see the patch(

Re: Sqoop on Spark

2016-09-14 Thread Mich Talebzadeh
Sqoop is a standalone product (a utility) that is used to get data out of JDBC compliant database tables into HDFS and Hive if specified. Spark can also use JDBC to get data out from such tables. However, I have not come across a situation where Sqoop is invoked from Spark. Have a look at Sqoop do