I know of projects that have done this but have never seen any advantage of "using spark to do what sqoop does" - at least in a yarn cluster. Both frameworks will have similar overheads of getting the containers allocated by yarn and creating new jvms to do the work. Probably spark will have a slightly higher overhead due to creation of RDD before writing the data to hdfs - something that the sqoop mapper need not do. (So what am I overlooking here?)
In cases where a data pipeline is being built with the sqooped data being the only trigger, there is a justification for using spark instead of sqoop to short circuit the data directly into the transformation pipeline. Regards Ranadip On 6 Apr 2016 7:05 p.m., "Michael Segel" <msegel_had...@hotmail.com> wrote: > I don’t think its necessarily a bad idea. > > Sqoop is an ugly tool and it requires you to make some assumptions as a > way to gain parallelism. (Not that most of the assumptions are not valid > for most of the use cases…) > > Depending on what you want to do… your data may not be persisted on HDFS. > There are use cases where your cluster is used for compute and not storage. > > I’d say that spending time re-inventing the wheel can be a good thing. > It would be a good idea for many to rethink their ingestion process so > that they can have a nice ‘data lake’ and not a ‘data sewer’. (Stealing > that term from Dean Wampler. ;-) > > Just saying. ;-) > > -Mike > > On Apr 5, 2016, at 10:44 PM, Jörn Franke <jornfra...@gmail.com> wrote: > > I do not think you can be more resource efficient. In the end you have to > store the data anyway on HDFS . You have a lot of development effort for > doing something like sqoop. Especially with error handling. > You may create a ticket with the Sqoop guys to support Spark as an > execution engine and maybe it is less effort to plug it in there. > Maybe if your cluster is loaded then you may want to add more machines or > improve the existing programs. > > On 06 Apr 2016, at 07:33, ayan guha <guha.a...@gmail.com> wrote: > > One of the reason in my mind is to avoid Map-Reduce application completely > during ingestion, if possible. Also, I can then use Spark stand alone > cluster to ingest, even if my hadoop cluster is heavily loaded. What you > guys think? > > On Wed, Apr 6, 2016 at 3:13 PM, Jörn Franke <jornfra...@gmail.com> wrote: > >> Why do you want to reimplement something which is already there? >> >> On 06 Apr 2016, at 06:47, ayan guha <guha.a...@gmail.com> wrote: >> >> Hi >> >> Thanks for reply. My use case is query ~40 tables from Oracle (using >> index and incremental only) and add data to existing Hive tables. Also, it >> would be good to have an option to create Hive table, driven by job >> specific configuration. >> >> What do you think? >> >> Best >> Ayan >> >> On Wed, Apr 6, 2016 at 2:30 PM, Takeshi Yamamuro <linguin....@gmail.com> >> wrote: >> >>> Hi, >>> >>> It depends on your use case using sqoop. >>> What's it like? >>> >>> // maropu >>> >>> On Wed, Apr 6, 2016 at 1:26 PM, ayan guha <guha.a...@gmail.com> wrote: >>> >>>> Hi All >>>> >>>> Asking opinion: is it possible/advisable to use spark to replace what >>>> sqoop does? Any existing project done in similar lines? >>>> >>>> -- >>>> Best Regards, >>>> Ayan Guha >>>> >>> >>> >>> >>> -- >>> --- >>> Takeshi Yamamuro >>> >> >> >> >> -- >> Best Regards, >> Ayan Guha >> >> > > > -- > Best Regards, > Ayan Guha > > >