Re: Sqoop on Spark

Gourav Sengupta Thu, 14 Apr 2016 07:09:42 -0700

Hi,

SQOOP just extracted for me 1,253,015,160 records in 30 minutes running in
4 threads, that is 246 GB of data.


Why is the discussion about using anything other than SQOOP still so
wonderfully on?


Regards,
Gourav

On Mon, Apr 11, 2016 at 6:26 PM, Jörn Franke <jornfra...@gmail.com> wrote:

> Actually I was referring to have a an external table in Oracle, which is
> used to export to CSV (insert into). Then you have a csv on the database
> server which needs to be moved to HDFS.
>
> On 11 Apr 2016, at 17:50, Michael Segel <msegel_had...@hotmail.com> wrote:
>
> Depending on the Oracle release…
>
> You could use webHDFS to gain access to the cluster and see the CSV file
> as an external table.
>
> However, you would need to have an application that will read each block
> of the file in parallel. This works for loading in to the RDBMS itself.
> Actually you could use sqoop in reverse to push data to the RDBMS provided
> that the block file is splittable.  This is a classic M/R problem.
>
> But I don’t think this is what the OP wants to do. They want to pull data
> from the RDBMs. If you could drop the table’s underlying file and can read
> directly from it… you can do a very simple bulk load/unload process.
> However you need to know the file’s format.
>
> Not sure what IBM or Oracle has done to tie their RDBMs to Big Data.
>
> As I and other posters to this thread have alluded to… this would be a
> block bulk load/unload tool.
>
>
> On Apr 10, 2016, at 11:31 AM, Jörn Franke <jornfra...@gmail.com> wrote:
>
>
> I am not 100% sure, but you could export to CSV in Oracle using external
> tables.
>
> Oracle has also the Hadoop Loader, which seems to support Avro. However, I
> think you need to buy the Big Data solution.
>
> On 10 Apr 2016, at 16:12, Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
> Yes I meant MR.
>
> Again one cannot beat the RDBMS export utility. I was specifically
> referring to Oracle in above case that does not provide any specific text
> bases export except the binary one Exp, data pump etc).
>
> In case of SAPO ASE, Sybase IQ, and MSSQL, one can use BCP (bulk copy)
> that can be parallelised either through range partitioning or simple round
> robin partitioning that can be used to get data out to file in parallel.
> Then once get data into Hive table through import etc.
>
> In general if the source table is very large you can used either SAP
> Replication Server (SRS) or Oracle Golden Gate to get data to Hive. Both
> these replication tools provide connectors to Hive and they do a good job.
> If one has something like Oracle in Prod then there is likely a Golden Gate
> there. For bulk setting of Hive tables and data migration, replication
> server is good option.
>
> HTH
>
>
> Dr Mich Talebzadeh
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 10 April 2016 at 14:24, Michael Segel <msegel_had...@hotmail.com>
> wrote:
>
>> Sqoop doesn’t use MapR… unless you meant to say M/R (Map Reduce)
>>
>> The largest problem with sqoop is that in order to gain parallelism you
>> need to know how your underlying table is partitioned and to do multiple
>> range queries. This may not be known, or your data may or may not be
>> equally distributed across the ranges.
>>
>> If you’re bringing over the entire table, you may find dropping it and
>> then moving it to HDFS and then doing a bulk load to be more efficient.
>> (This is less flexible than sqoop, but also stresses the database servers
>> less. )
>>
>> Again, YMMV
>>
>>
>> On Apr 8, 2016, at 9:17 AM, Mich Talebzadeh <mich.talebza...@gmail.com>
>> wrote:
>>
>> Well unless you have plenty of memory, you are going to have certain
>> issues with Spark.
>>
>> I tried to load a billion rows table from oracle through spark using JDBC
>> and ended up with "Caused by: java.lang.OutOfMemoryError: Java heap space"
>> error.
>>
>> Sqoop uses MapR and does it in serial mode which takes time and you can
>> also tell it to create Hive table. However, it will import data into Hive
>> table.
>>
>> In any case the mechanism of data import is through JDBC, Spark uses
>> memory and DAG, whereas Sqoop relies on MapR.
>>
>> There is of course another alternative.
>>
>> Assuming that your Oracle table has a primary Key say "ID" (it would be
>> easier if it was a monotonically increasing number) or already partitioned.
>>
>>
>>    1. You can create views based on the range of ID or for each
>>    partition. You can then SELECT COLUMNS  co1, col2, coln from view and 
>> spool
>>    it to a text file on OS (locally say backup directory would be fastest).
>>    2. bzip2 those files and scp them to a local directory in Hadoop
>>    3. You can then use Spark/hive to load the target table from local
>>    files in parallel
>>    4. When creating views take care of NUMBER and CHAR columns in Oracle
>>    and convert them to TO_CHAR(NUMBER_COLUMN) and varchar CAST(coln AS
>>    VARCHAR2(n)) AS coln etc
>>
>>
>> HTH
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 8 April 2016 at 10:07, Gourav Sengupta <gourav.sengu...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> Some metrics thrown around the discussion:
>>>
>>> SQOOP: extract 500 million rows (in single thread) 20 mins (data size 21
>>> GB)
>>> SPARK: load the data into memory (15 mins)
>>>
>>> SPARK: use JDBC (and similar to SQOOP difficult parallelization) to load
>>> 500 million records - manually killed after 8 hours.
>>>
>>> (both the above studies were done in a system of same capacity, with 32
>>> GB RAM and dual hexacore Xeon processors and SSD. SPARK was running
>>> locally, and SQOOP ran on HADOOP2 and extracted data to local file system)
>>>
>>> In case any one needs to know what needs to be done to access both the
>>> CSV and JDBC modules in SPARK Local Server mode, please let me know.
>>>
>>>
>>> Regards,
>>> Gourav Sengupta
>>>
>>> On Thu, Apr 7, 2016 at 12:26 AM, Yong Zhang <java8...@hotmail.com>
>>> wrote:
>>>
>>>> Good to know that.
>>>>
>>>> That is why Sqoop has this "direct" mode, to utilize the vendor
>>>> specific feature.
>>>>
>>>> But for MPP, I still think it makes sense that vendor provide some kind
>>>> of InputFormat, or data source in Spark, so Hadoop eco-system can integrate
>>>> with them more natively.
>>>>
>>>> Yong
>>>>
>>>> ------------------------------
>>>> Date: Wed, 6 Apr 2016 16:12:30 -0700
>>>> Subject: Re: Sqoop on Spark
>>>> From: mohaj...@gmail.com
>>>> To: java8...@hotmail.com
>>>> CC: mich.talebza...@gmail.com; jornfra...@gmail.com;
>>>> msegel_had...@hotmail.com; guha.a...@gmail.com; linguin....@gmail.com;
>>>> user@spark.apache.org
>>>>
>>>>
>>>> It is using JDBC driver, i know that's the case for Teradata:
>>>>
>>>> http://developer.teradata.com/connectivity/articles/teradata-connector-for-hadoop-now-available
>>>>
>>>> Teradata Connector (which is used by Cloudera and Hortonworks) for
>>>> doing Sqoop is parallelized and works with ORC and probably other formats
>>>> as well. It is using JDBC for each connection between data-nodes and their
>>>> AMP (compute) nodes. There is an additional layer that coordinates all of
>>>> it.
>>>> I know Oracle has a similar technology I've used it and had to supply
>>>> the JDBC driver.
>>>>
>>>> Teradata Connector is for batch data copy, QueryGrid is for interactive
>>>> data movement.
>>>>
>>>> On Wed, Apr 6, 2016 at 4:05 PM, Yong Zhang <java8...@hotmail.com>
>>>> wrote:
>>>>
>>>> If they do that, they must provide a customized input format, instead
>>>> of through JDBC.
>>>>
>>>> Yong
>>>>
>>>> ------------------------------
>>>> Date: Wed, 6 Apr 2016 23:56:54 +0100
>>>> Subject: Re: Sqoop on Spark
>>>> From: mich.talebza...@gmail.com
>>>> To: mohaj...@gmail.com
>>>> CC: jornfra...@gmail.com; msegel_had...@hotmail.com;
>>>> guha.a...@gmail.com; linguin....@gmail.com; user@spark.apache.org
>>>>
>>>>
>>>> SAP Sybase IQ does that and I believe SAP Hana as well.
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>> LinkedIn * 
>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>> On 6 April 2016 at 23:49, Peyman Mohajerian <mohaj...@gmail.com> wrote:
>>>>
>>>> For some MPP relational stores (not operational) it maybe feasible to
>>>> run Spark jobs and also have data locality. I know QueryGrid (Teradata) and
>>>> PolyBase (microsoft) use data locality to move data between their MPP and
>>>> Hadoop.
>>>> I would guess (have no idea) someone like IBM already is doing that for
>>>> Spark, maybe a bit off topic!
>>>>
>>>> On Wed, Apr 6, 2016 at 3:29 PM, Jörn Franke <jornfra...@gmail.com>
>>>> wrote:
>>>>
>>>> Well I am not sure, but using a database as a storage, such as
>>>> relational databases or certain nosql databases (eg MongoDB) for Spark is
>>>> generally a bad idea - no data locality, it cannot handle real big data
>>>> volumes for compute and you may potentially overload an operational
>>>> database.
>>>> And if your job fails for whatever reason (eg scheduling ) then you
>>>> have to pull everything out again. Sqoop and HDFS seems to me the more
>>>> elegant solution together with spark. These "assumption" on parallelism
>>>> have to be anyway made with any solution.
>>>> Of course you can always redo things, but why - what benefit do you
>>>> expect? A real big data platform has to support anyway many different tools
>>>> otherwise people doing analytics will be limited.
>>>>
>>>> On 06 Apr 2016, at 20:05, Michael Segel <msegel_had...@hotmail.com>
>>>> wrote:
>>>>
>>>> I don’t think its necessarily a bad idea.
>>>>
>>>> Sqoop is an ugly tool and it requires you to make some assumptions as a
>>>> way to gain parallelism. (Not that most of the assumptions are not valid
>>>> for most of the use cases…)
>>>>
>>>> Depending on what you want to do… your data may not be persisted on
>>>> HDFS.  There are use cases where your cluster is used for compute and not
>>>> storage.
>>>>
>>>> I’d say that spending time re-inventing the wheel can be a good thing.
>>>> It would be a good idea for many to rethink their ingestion process so
>>>> that they can have a nice ‘data lake’ and not a ‘data sewer’. (Stealing
>>>> that term from Dean Wampler. ;-)
>>>>
>>>> Just saying. ;-)
>>>>
>>>> -Mike
>>>>
>>>> On Apr 5, 2016, at 10:44 PM, Jörn Franke <jornfra...@gmail.com> wrote:
>>>>
>>>> I do not think you can be more resource efficient. In the end you have
>>>> to store the data anyway on HDFS . You have a lot of development effort for
>>>> doing something like sqoop. Especially with error handling.
>>>> You may create a ticket with the Sqoop guys to support Spark as an
>>>> execution engine and maybe it is less effort to plug it in there.
>>>> Maybe if your cluster is loaded then you may want to add more machines
>>>> or improve the existing programs.
>>>>
>>>> On 06 Apr 2016, at 07:33, ayan guha <guha.a...@gmail.com> wrote:
>>>>
>>>> One of the reason in my mind is to avoid Map-Reduce application
>>>> completely during ingestion, if possible. Also, I can then use Spark stand
>>>> alone cluster to ingest, even if my hadoop cluster is heavily loaded. What
>>>> you guys think?
>>>>
>>>> On Wed, Apr 6, 2016 at 3:13 PM, Jörn Franke <jornfra...@gmail.com>
>>>> wrote:
>>>>
>>>> Why do you want to reimplement something which is already there?
>>>>
>>>> On 06 Apr 2016, at 06:47, ayan guha <guha.a...@gmail.com> wrote:
>>>>
>>>> Hi
>>>>
>>>> Thanks for reply. My use case is query ~40 tables from Oracle (using
>>>> index and incremental only) and add data to existing Hive tables. Also, it
>>>> would be good to have an option to create Hive table, driven by job
>>>> specific configuration.
>>>>
>>>> What do you think?
>>>>
>>>> Best
>>>> Ayan
>>>>
>>>> On Wed, Apr 6, 2016 at 2:30 PM, Takeshi Yamamuro <linguin....@gmail.com
>>>> > wrote:
>>>>
>>>> Hi,
>>>>
>>>> It depends on your use case using sqoop.
>>>> What's it like?
>>>>
>>>> // maropu
>>>>
>>>> On Wed, Apr 6, 2016 at 1:26 PM, ayan guha <guha.a...@gmail.com> wrote:
>>>>
>>>> Hi All
>>>>
>>>> Asking opinion: is it possible/advisable to use spark to replace what
>>>> sqoop does? Any existing project done in similar lines?
>>>>
>>>> --
>>>> Best Regards,
>>>> Ayan Guha
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> ---
>>>> Takeshi Yamamuro
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Best Regards,
>>>> Ayan Guha
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Best Regards,
>>>> Ayan Guha
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>>
>
>

Re: Sqoop on Spark

Reply via email to