Hi, SQOOP just extracted for me 1,253,015,160 records in 30 minutes running in 4 threads, that is 246 GB of data.
Why is the discussion about using anything other than SQOOP still so wonderfully on? Regards, Gourav On Mon, Apr 11, 2016 at 6:26 PM, Jörn Franke <jornfra...@gmail.com> wrote: > Actually I was referring to have a an external table in Oracle, which is > used to export to CSV (insert into). Then you have a csv on the database > server which needs to be moved to HDFS. > > On 11 Apr 2016, at 17:50, Michael Segel <msegel_had...@hotmail.com> wrote: > > Depending on the Oracle release… > > You could use webHDFS to gain access to the cluster and see the CSV file > as an external table. > > However, you would need to have an application that will read each block > of the file in parallel. This works for loading in to the RDBMS itself. > Actually you could use sqoop in reverse to push data to the RDBMS provided > that the block file is splittable. This is a classic M/R problem. > > But I don’t think this is what the OP wants to do. They want to pull data > from the RDBMs. If you could drop the table’s underlying file and can read > directly from it… you can do a very simple bulk load/unload process. > However you need to know the file’s format. > > Not sure what IBM or Oracle has done to tie their RDBMs to Big Data. > > As I and other posters to this thread have alluded to… this would be a > block bulk load/unload tool. > > > On Apr 10, 2016, at 11:31 AM, Jörn Franke <jornfra...@gmail.com> wrote: > > > I am not 100% sure, but you could export to CSV in Oracle using external > tables. > > Oracle has also the Hadoop Loader, which seems to support Avro. However, I > think you need to buy the Big Data solution. > > On 10 Apr 2016, at 16:12, Mich Talebzadeh <mich.talebza...@gmail.com> > wrote: > > Yes I meant MR. > > Again one cannot beat the RDBMS export utility. I was specifically > referring to Oracle in above case that does not provide any specific text > bases export except the binary one Exp, data pump etc). > > In case of SAPO ASE, Sybase IQ, and MSSQL, one can use BCP (bulk copy) > that can be parallelised either through range partitioning or simple round > robin partitioning that can be used to get data out to file in parallel. > Then once get data into Hive table through import etc. > > In general if the source table is very large you can used either SAP > Replication Server (SRS) or Oracle Golden Gate to get data to Hive. Both > these replication tools provide connectors to Hive and they do a good job. > If one has something like Oracle in Prod then there is likely a Golden Gate > there. For bulk setting of Hive tables and data migration, replication > server is good option. > > HTH > > > Dr Mich Talebzadeh > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > http://talebzadehmich.wordpress.com > > > > On 10 April 2016 at 14:24, Michael Segel <msegel_had...@hotmail.com> > wrote: > >> Sqoop doesn’t use MapR… unless you meant to say M/R (Map Reduce) >> >> The largest problem with sqoop is that in order to gain parallelism you >> need to know how your underlying table is partitioned and to do multiple >> range queries. This may not be known, or your data may or may not be >> equally distributed across the ranges. >> >> If you’re bringing over the entire table, you may find dropping it and >> then moving it to HDFS and then doing a bulk load to be more efficient. >> (This is less flexible than sqoop, but also stresses the database servers >> less. ) >> >> Again, YMMV >> >> >> On Apr 8, 2016, at 9:17 AM, Mich Talebzadeh <mich.talebza...@gmail.com> >> wrote: >> >> Well unless you have plenty of memory, you are going to have certain >> issues with Spark. >> >> I tried to load a billion rows table from oracle through spark using JDBC >> and ended up with "Caused by: java.lang.OutOfMemoryError: Java heap space" >> error. >> >> Sqoop uses MapR and does it in serial mode which takes time and you can >> also tell it to create Hive table. However, it will import data into Hive >> table. >> >> In any case the mechanism of data import is through JDBC, Spark uses >> memory and DAG, whereas Sqoop relies on MapR. >> >> There is of course another alternative. >> >> Assuming that your Oracle table has a primary Key say "ID" (it would be >> easier if it was a monotonically increasing number) or already partitioned. >> >> >> 1. You can create views based on the range of ID or for each >> partition. You can then SELECT COLUMNS co1, col2, coln from view and >> spool >> it to a text file on OS (locally say backup directory would be fastest). >> 2. bzip2 those files and scp them to a local directory in Hadoop >> 3. You can then use Spark/hive to load the target table from local >> files in parallel >> 4. When creating views take care of NUMBER and CHAR columns in Oracle >> and convert them to TO_CHAR(NUMBER_COLUMN) and varchar CAST(coln AS >> VARCHAR2(n)) AS coln etc >> >> >> HTH >> >> >> >> Dr Mich Talebzadeh >> >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >> >> >> http://talebzadehmich.wordpress.com >> >> >> >> On 8 April 2016 at 10:07, Gourav Sengupta <gourav.sengu...@gmail.com> >> wrote: >> >>> Hi, >>> >>> Some metrics thrown around the discussion: >>> >>> SQOOP: extract 500 million rows (in single thread) 20 mins (data size 21 >>> GB) >>> SPARK: load the data into memory (15 mins) >>> >>> SPARK: use JDBC (and similar to SQOOP difficult parallelization) to load >>> 500 million records - manually killed after 8 hours. >>> >>> (both the above studies were done in a system of same capacity, with 32 >>> GB RAM and dual hexacore Xeon processors and SSD. SPARK was running >>> locally, and SQOOP ran on HADOOP2 and extracted data to local file system) >>> >>> In case any one needs to know what needs to be done to access both the >>> CSV and JDBC modules in SPARK Local Server mode, please let me know. >>> >>> >>> Regards, >>> Gourav Sengupta >>> >>> On Thu, Apr 7, 2016 at 12:26 AM, Yong Zhang <java8...@hotmail.com> >>> wrote: >>> >>>> Good to know that. >>>> >>>> That is why Sqoop has this "direct" mode, to utilize the vendor >>>> specific feature. >>>> >>>> But for MPP, I still think it makes sense that vendor provide some kind >>>> of InputFormat, or data source in Spark, so Hadoop eco-system can integrate >>>> with them more natively. >>>> >>>> Yong >>>> >>>> ------------------------------ >>>> Date: Wed, 6 Apr 2016 16:12:30 -0700 >>>> Subject: Re: Sqoop on Spark >>>> From: mohaj...@gmail.com >>>> To: java8...@hotmail.com >>>> CC: mich.talebza...@gmail.com; jornfra...@gmail.com; >>>> msegel_had...@hotmail.com; guha.a...@gmail.com; linguin....@gmail.com; >>>> user@spark.apache.org >>>> >>>> >>>> It is using JDBC driver, i know that's the case for Teradata: >>>> >>>> http://developer.teradata.com/connectivity/articles/teradata-connector-for-hadoop-now-available >>>> >>>> Teradata Connector (which is used by Cloudera and Hortonworks) for >>>> doing Sqoop is parallelized and works with ORC and probably other formats >>>> as well. It is using JDBC for each connection between data-nodes and their >>>> AMP (compute) nodes. There is an additional layer that coordinates all of >>>> it. >>>> I know Oracle has a similar technology I've used it and had to supply >>>> the JDBC driver. >>>> >>>> Teradata Connector is for batch data copy, QueryGrid is for interactive >>>> data movement. >>>> >>>> On Wed, Apr 6, 2016 at 4:05 PM, Yong Zhang <java8...@hotmail.com> >>>> wrote: >>>> >>>> If they do that, they must provide a customized input format, instead >>>> of through JDBC. >>>> >>>> Yong >>>> >>>> ------------------------------ >>>> Date: Wed, 6 Apr 2016 23:56:54 +0100 >>>> Subject: Re: Sqoop on Spark >>>> From: mich.talebza...@gmail.com >>>> To: mohaj...@gmail.com >>>> CC: jornfra...@gmail.com; msegel_had...@hotmail.com; >>>> guha.a...@gmail.com; linguin....@gmail.com; user@spark.apache.org >>>> >>>> >>>> SAP Sybase IQ does that and I believe SAP Hana as well. >>>> >>>> Dr Mich Talebzadeh >>>> >>>> LinkedIn * >>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>> >>>> >>>> http://talebzadehmich.wordpress.com >>>> >>>> >>>> On 6 April 2016 at 23:49, Peyman Mohajerian <mohaj...@gmail.com> wrote: >>>> >>>> For some MPP relational stores (not operational) it maybe feasible to >>>> run Spark jobs and also have data locality. I know QueryGrid (Teradata) and >>>> PolyBase (microsoft) use data locality to move data between their MPP and >>>> Hadoop. >>>> I would guess (have no idea) someone like IBM already is doing that for >>>> Spark, maybe a bit off topic! >>>> >>>> On Wed, Apr 6, 2016 at 3:29 PM, Jörn Franke <jornfra...@gmail.com> >>>> wrote: >>>> >>>> Well I am not sure, but using a database as a storage, such as >>>> relational databases or certain nosql databases (eg MongoDB) for Spark is >>>> generally a bad idea - no data locality, it cannot handle real big data >>>> volumes for compute and you may potentially overload an operational >>>> database. >>>> And if your job fails for whatever reason (eg scheduling ) then you >>>> have to pull everything out again. Sqoop and HDFS seems to me the more >>>> elegant solution together with spark. These "assumption" on parallelism >>>> have to be anyway made with any solution. >>>> Of course you can always redo things, but why - what benefit do you >>>> expect? A real big data platform has to support anyway many different tools >>>> otherwise people doing analytics will be limited. >>>> >>>> On 06 Apr 2016, at 20:05, Michael Segel <msegel_had...@hotmail.com> >>>> wrote: >>>> >>>> I don’t think its necessarily a bad idea. >>>> >>>> Sqoop is an ugly tool and it requires you to make some assumptions as a >>>> way to gain parallelism. (Not that most of the assumptions are not valid >>>> for most of the use cases…) >>>> >>>> Depending on what you want to do… your data may not be persisted on >>>> HDFS. There are use cases where your cluster is used for compute and not >>>> storage. >>>> >>>> I’d say that spending time re-inventing the wheel can be a good thing. >>>> It would be a good idea for many to rethink their ingestion process so >>>> that they can have a nice ‘data lake’ and not a ‘data sewer’. (Stealing >>>> that term from Dean Wampler. ;-) >>>> >>>> Just saying. ;-) >>>> >>>> -Mike >>>> >>>> On Apr 5, 2016, at 10:44 PM, Jörn Franke <jornfra...@gmail.com> wrote: >>>> >>>> I do not think you can be more resource efficient. In the end you have >>>> to store the data anyway on HDFS . You have a lot of development effort for >>>> doing something like sqoop. Especially with error handling. >>>> You may create a ticket with the Sqoop guys to support Spark as an >>>> execution engine and maybe it is less effort to plug it in there. >>>> Maybe if your cluster is loaded then you may want to add more machines >>>> or improve the existing programs. >>>> >>>> On 06 Apr 2016, at 07:33, ayan guha <guha.a...@gmail.com> wrote: >>>> >>>> One of the reason in my mind is to avoid Map-Reduce application >>>> completely during ingestion, if possible. Also, I can then use Spark stand >>>> alone cluster to ingest, even if my hadoop cluster is heavily loaded. What >>>> you guys think? >>>> >>>> On Wed, Apr 6, 2016 at 3:13 PM, Jörn Franke <jornfra...@gmail.com> >>>> wrote: >>>> >>>> Why do you want to reimplement something which is already there? >>>> >>>> On 06 Apr 2016, at 06:47, ayan guha <guha.a...@gmail.com> wrote: >>>> >>>> Hi >>>> >>>> Thanks for reply. My use case is query ~40 tables from Oracle (using >>>> index and incremental only) and add data to existing Hive tables. Also, it >>>> would be good to have an option to create Hive table, driven by job >>>> specific configuration. >>>> >>>> What do you think? >>>> >>>> Best >>>> Ayan >>>> >>>> On Wed, Apr 6, 2016 at 2:30 PM, Takeshi Yamamuro <linguin....@gmail.com >>>> > wrote: >>>> >>>> Hi, >>>> >>>> It depends on your use case using sqoop. >>>> What's it like? >>>> >>>> // maropu >>>> >>>> On Wed, Apr 6, 2016 at 1:26 PM, ayan guha <guha.a...@gmail.com> wrote: >>>> >>>> Hi All >>>> >>>> Asking opinion: is it possible/advisable to use spark to replace what >>>> sqoop does? Any existing project done in similar lines? >>>> >>>> -- >>>> Best Regards, >>>> Ayan Guha >>>> >>>> >>>> >>>> >>>> -- >>>> --- >>>> Takeshi Yamamuro >>>> >>>> >>>> >>>> >>>> -- >>>> Best Regards, >>>> Ayan Guha >>>> >>>> >>>> >>>> >>>> -- >>>> Best Regards, >>>> Ayan Guha >>>> >>>> >>>> >>>> >>>> >>>> >>> >> >> > >