Re: Sqoop vs spark jdbc

2016-09-21 Thread Don Drake
andard caveats apply… >>> YMMV. >>> >>> The reason I say this… you have a couple of limiting factors. The main >>> one being the number of connections allowed with the target RDBMS. >>> >>> Then there’s the data distribution within the partitio

Re: Sqoop vs spark jdbc

2016-09-21 Thread Mich Talebzadeh
b could take >> longer to setup. But on long running queries, that becomes noise. >> >> The issue is what makes the most sense to you, where do you have the most >> experience, and what do you feel the most comfortable in using. >> >> The other issue is what do you

Re: Sqoop vs spark jdbc

2016-09-21 Thread Jörn Franke
ob could take longer >> to setup. But on long running queries, that becomes noise. >> >> The issue is what makes the most sense to you, where do you have the most >> experience, and what do you feel the most comfortable in using. >> >> The other issue is what do

Re: Sqoop vs spark jdbc

2016-09-21 Thread Mich Talebzadeh
know that I’m responding to an earlier message in the thread, but > this is something that I’ve heard lots of questions about… and its not a > simple thing to answer… Since this is a batch process. The performance > issues are moot. > > On Aug 24, 2016, at 5:07 PM, Mich Talebzadeh

Re: Sqoop vs spark jdbc

2016-09-21 Thread Mich Talebzadeh
mages >>> arising from such loss, damage or destruction. >>> >>> >>> >>> On 25 August 2016 at 09:07, Bhaskar Dutta <bhas...@gmail.com> wrote: >>> >>>> Which RDBMS are you using here, and what is the data volume and >>>> frequency of

Re: Sqoop on Spark

2016-09-14 Thread Mich Talebzadeh
any monetary damages arising from such loss, damage or destruction. On 14 September 2016 at 15:31, KhajaAsmath Mohammed <mdkhajaasm...@gmail.com > wrote: > Hi Experts, > > Good morning. > > I am looking for some references on how to use sqoop with spark. could you > please let

Sqoop on Spark

2016-09-14 Thread KhajaAsmath Mohammed
Hi Experts, Good morning. I am looking for some references on how to use sqoop with spark. could you please let me know if there are any references on how to use it. Thanks, Asmath.

Re: Sqoop vs spark jdbc

2016-08-25 Thread Mich Talebzadeh
..@gmail.com> wrote: >> >>> Which RDBMS are you using here, and what is the data volume and >>> frequency of pulling data off the RDBMS? >>> Specifying these would help in giving better answers. >>> >>> Sqoop has a direct mode (non-JDBC) support for P

Re: Sqoop vs spark jdbc

2016-08-25 Thread Bhaskar Dutta
e RDBMS? >> Specifying these would help in giving better answers. >> >> Sqoop has a direct mode (non-JDBC) support for Postgres, MySQL and >> Oracle, so you can use that for better performance if using one of these >> databases. >> >> And don't forget that

Re: Sqoop vs spark jdbc

2016-08-25 Thread Mich Talebzadeh
o you can use that for better performance if using one of these databases. > > And don't forget that you Sqoop can load data directly into Parquet or > Avro (I think direct mode is not supported in this case). > Also you can use Kite SDK with Sqoop to manage/transform datasets, perform > sc

Re: Sqoop vs spark jdbc

2016-08-25 Thread Bhaskar Dutta
Penikalapati < mail.venkatakart...@gmail.com> wrote: > Team, > Please help me in choosing sqoop or spark jdbc to fetch data from rdbms. > Sqoop has lot of optimizations to fetch data does spark jdbc also has those > ? > > I'm performing few analytics using spark data fo

Re: Sqoop vs spark jdbc

2016-08-25 Thread Sean Owen
I am not sure Spark really does for you. MapReduce won't matter here. The bottleneck is reading from the RDBMS in general. On Wed, Aug 24, 2016 at 11:07 PM, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > Personally I prefer Spark JDBC. > > Both Sqoop and Spark rely on t

Re: Sqoop vs spark jdbc

2016-08-24 Thread ayan guha
, Ranadip Chatterjee <ranadi...@gmail.com> wrote: > This will depend on multiple factors. Assuming we are talking significant > volumes of data, I'd prefer sqoop compared to spark on yarn, if ingestion > performance is the sole consideration (which is true in many production use

Re: Sqoop vs spark jdbc

2016-08-24 Thread Ranadip Chatterjee
This will depend on multiple factors. Assuming we are talking significant volumes of data, I'd prefer sqoop compared to spark on yarn, if ingestion performance is the sole consideration (which is true in many production use cases). Sqoop provides some potential optimisations specially around using

Re: Sqoop vs spark jdbc

2016-08-24 Thread Mich Talebzadeh
Personally I prefer Spark JDBC. Both Sqoop and Spark rely on the same drivers. I think Spark is faster and if you have many nodes you can partition your incoming data and take advantage of Spark DAG + in memory offering. By default Sqoop will use Map-reduce which is pretty slow. Remember

Sqoop vs spark jdbc

2016-08-24 Thread Venkata Penikalapati
Team, Please help me in choosing sqoop or spark jdbc to fetch data from rdbms. Sqoop has lot of optimizations to fetch data does spark jdbc also has those ? I'm performing few analytics using spark data for which data is residing in  rdbms.  Please guide me with this.  ThanksVenkata Karthik P 

Re: Spark 1.6.2 can read hive tables created with sqoop, but Spark 2.0.0 cannot

2016-08-11 Thread cdecleene
-6-2-can-read-hive-tables-created-with-sqoop-but-Spark-2-0-0-cannot-tp27502p27516.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Spark 1.6.2 can read hive tables created with sqoop, but Spark 2.0.0 cannot

2016-08-10 Thread cdecleene
Using the scala api instead of the python api yields the same results. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-6-2-can-read-hive-tables-created-with-sqoop-but-Spark-2-0-0-cannot-tp27502p27506.html Sent from the Apache Spark User List mailing

Re: Spark 1.6.2 can read hive tables created with sqoop, but Spark 2.0.0 cannot

2016-08-09 Thread Mich Talebzadeh
eSupport().getOrCreate() > df = spark.sql("SELECT * FROM dra_agency_analytics.raw_ewt_agcy_dim") > > ... the dataframe df has the correct number of rows and the correct > columns, > but all values read as "None". > > > > > -- > View this message in context

Re: Spark 1.6.2 can read hive tables created with sqoop, but Spark 2.0.0 cannot

2016-08-09 Thread Davies Liu
the correct number of rows and the correct columns, > but all values read as "None". > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-6-2-can-read-hi

Spark 1.6.2 can read hive tables created with sqoop, but Spark 2.0.0 cannot

2016-08-09 Thread cdecleene
t;SELECT * FROM dra_agency_analytics.raw_ewt_agcy_dim") ... the dataframe df has the correct number of rows and the correct columns, but all values read as "None". -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-6-2-can-read-hive-tables-crea

Re: Sqoop On Spark

2016-08-01 Thread Takeshi Yamamuro
xecution engine in sqoop2. i see the patch(S > QOOP-1532 <https://issues.apache.org/jira/browse/SQOOP-1532>) but it > shows in progess. > > so can not we use sqoop on spark. > > Please help me if you have an any idea. > > -- > Selvam Raman > "லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து" > -- --- Takeshi Yamamuro

Sqoop On Spark

2016-08-01 Thread Selvam Raman
Hi Team, how can i use spark as execution engine in sqoop2. i see the patch(S QOOP-1532 <https://issues.apache.org/jira/browse/SQOOP-1532>) but it shows in progess. so can not we use sqoop on spark. Please help me if you have an any idea. -- Selvam Raman "லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"

Re: Sqoop on Spark

2016-04-14 Thread Mich Talebzadeh
t; >>> >>> On 8 April 2016 at 10:07, Gourav Sengupta <gourav.sengu...@gmail.com> >>> wrote: >>> >>>> Hi, >>>> >>>> Some metrics thrown around the discussion: >>>> >>>> SQOOP: extract 500 million row

Re: Sqoop on Spark

2016-04-14 Thread Jörn Franke
t; convert them to TO_CHAR(NUMBER_COLUMN) and varchar CAST(coln AS >>>>>>> VARCHAR2(n)) AS coln etc >>>>>>> >>>>>>> HTH >>>>>>> >>>>>>> >>>>>>> >>>>>

Re: Sqoop on Spark

2016-04-14 Thread Gourav Sengupta
ords - manually killed after 8 hours. >>> >>> (both the above studies were done in a system of same capacity, with 32 >>> GB RAM and dual hexacore Xeon processors and SSD. SPARK was running >>> locally, and SQOOP ran on HADOOP2 and extracted data to local file

Re: Sqoop on Spark

2016-04-11 Thread Jörn Franke
urav Sengupta <gourav.sengu...@gmail.com> >>>>>> wrote: >>>>>> Hi, >>>>>> >>>>>> Some metrics thrown around the discussion: >>>>>> >>>>>> SQOOP: extract 500 million rows (in single thread) 20 mins (data

Re: Sqoop on Spark

2016-04-11 Thread Michael Segel
ta into memory (15 mins) >>> >>> SPARK: use JDBC (and similar to SQOOP difficult parallelization) to load >>> 500 million records - manually killed after 8 hours. >>> >>> (both the above studies were done in a system of same capacity, with 32 GB

Re: Sqoop on Spark

2016-04-10 Thread Jörn Franke
Xeon processors and SSD. SPARK was running locally, >>>> and SQOOP ran on HADOOP2 and extracted data to local file system) >>>> >>>> In case any one needs to know what needs to be done to access both the CSV >>>> and JDBC modules in SPARK Local Se

Re: Sqoop on Spark

2016-04-10 Thread Mich Talebzadeh
to access both the >> CSV and JDBC modules in SPARK Local Server mode, please let me know. >> >> >> Regards, >> Gourav Sengupta >> >> On Thu, Apr 7, 2016 at 12:26 AM, Yong Zhang <java8...@hotmail.com> wrote: >> >>> Good to know that. >

Re: Sqoop on Spark

2016-04-10 Thread Michael Segel
; > > Regards, > Gourav Sengupta > > On Thu, Apr 7, 2016 at 12:26 AM, Yong Zhang <java8...@hotmail.com > <mailto:java8...@hotmail.com>> wrote: > Good to know that. > > That is why Sqoop has this "direct" mode, to utilize the vendor specific >

Re: Sqoop on Spark

2016-04-08 Thread Mich Talebzadeh
es sense that vendor provide some kind >> of InputFormat, or data source in Spark, so Hadoop eco-system can integrate >> with them more natively. >> >> Yong >> >> -- >> Date: Wed, 6 Apr 2016 16:12:30 -0700 >> Subject: Re: Sqoo

Re: Sqoop on Spark

2016-04-08 Thread Gourav Sengupta
t; > But for MPP, I still think it makes sense that vendor provide some kind of > InputFormat, or data source in Spark, so Hadoop eco-system can integrate > with them more natively. > > Yong > > -- > Date: Wed, 6 Apr 2016 16:12:30 -0700 > Subject:

RE: Sqoop on Spark

2016-04-06 Thread Yong Zhang
te: Wed, 6 Apr 2016 16:12:30 -0700 Subject: Re: Sqoop on Spark From: mohaj...@gmail.com To: java8...@hotmail.com CC: mich.talebza...@gmail.com; jornfra...@gmail.com; msegel_had...@hotmail.com; guha.a...@gmail.com; linguin@gmail.com; user@spark.apache.org It is using JDBC driver, i know that'

Re: Sqoop on Spark

2016-04-06 Thread Peyman Mohajerian
is for interactive data movement. On Wed, Apr 6, 2016 at 4:05 PM, Yong Zhang <java8...@hotmail.com> wrote: > If they do that, they must provide a customized input format, instead of > through JDBC. > > Yong > > -- > Date: Wed, 6 Apr 2016 23:56:54

RE: Sqoop on Spark

2016-04-06 Thread Yong Zhang
If they do that, they must provide a customized input format, instead of through JDBC. Yong Date: Wed, 6 Apr 2016 23:56:54 +0100 Subject: Re: Sqoop on Spark From: mich.talebza...@gmail.com To: mohaj...@gmail.com CC: jornfra...@gmail.com; msegel_had...@hotmail.com; guha.a...@gmail.com; linguin

Re: Sqoop on Spark

2016-04-06 Thread Mich Talebzadeh
SAP Sybase IQ does that and I believe SAP Hana as well. Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com

Re: Sqoop on Spark

2016-04-06 Thread Peyman Mohajerian
For some MPP relational stores (not operational) it maybe feasible to run Spark jobs and also have data locality. I know QueryGrid (Teradata) and PolyBase (microsoft) use data locality to move data between their MPP and Hadoop. I would guess (have no idea) someone like IBM already is doing that

Re: Sqoop on Spark

2016-04-06 Thread Mich Talebzadeh
Sorry are you referring to Hive as a relational Data Warehouse in this scenario. The assumption here is that data is coming from a relational database (Oracle) so IMO the best storage for it in Big Data World is another DW adaptive to SQL. Spark is a powerful query tool and together with Hive as

Re: Sqoop on Spark

2016-04-06 Thread Jörn Franke
Well I am not sure, but using a database as a storage, such as relational databases or certain nosql databases (eg MongoDB) for Spark is generally a bad idea - no data locality, it cannot handle real big data volumes for compute and you may potentially overload an operational database. And if

Re: Sqoop on Spark

2016-04-06 Thread Mich Talebzadeh
verhead due to creation of RDD before writing the data to > hdfs - something that the sqoop mapper need not do. (So what am I > overlooking here?) > > In cases where a data pipeline is being built with the sqooped data being > the only trigger, there is a justification for usin

Re: Sqoop on Spark

2016-04-06 Thread Ranadip Chatterjee
will have a slightly higher overhead due to creation of RDD before writing the data to hdfs - something that the sqoop mapper need not do. (So what am I overlooking here?) In cases where a data pipeline is being built with the sqooped data being the only trigger, there is a justification for using spa

Re: Sqoop on Spark

2016-04-06 Thread Michael Segel
I don’t think its necessarily a bad idea. Sqoop is an ugly tool and it requires you to make some assumptions as a way to gain parallelism. (Not that most of the assumptions are not valid for most of the use cases…) Depending on what you want to do… your data may not be persisted on HDFS.

Re: Sqoop on Spark

2016-04-06 Thread Mich Talebzadeh
Jd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 6 April 2016 at 10:34, Jorge Sánchez <jorgesg1...@gmail.com> wrote: > Ayan, > > there was a talk in spark summit > https://spark-summit.org/2015/events/Sqoop-on-Spark-for-Data-Ingestion/ > Apparently they had a

Re: Sqoop on Spark

2016-04-06 Thread Jorge Sánchez
Ayan, there was a talk in spark summit https://spark-summit.org/2015/events/Sqoop-on-Spark-for-Data-Ingestion/ Apparently they had a lot of problems and the project seems abandoned. If you just have to do simple ingestion of a full table or a simple query, just use Sqoop as suggested by Mich

Re: Sqoop on Spark

2016-04-05 Thread ayan guha
Thanks guys for feedback. On Wed, Apr 6, 2016 at 3:44 PM, Jörn Franke wrote: > I do not think you can be more resource efficient. In the end you have to > store the data anyway on HDFS . You have a lot of development effort for > doing something like sqoop. Especially with

Sqoop on Spark

2016-04-05 Thread ayan guha
Hi All Asking opinion: is it possible/advisable to use spark to replace what sqoop does? Any existing project done in similar lines? -- Best Regards, Ayan Guha