Re: spark architecture question -- Pleas Read

kuassi mensah Sun, 05 Feb 2017 13:11:30 -0800

Apology in advance for injecting Oracle product in this discussion but Ithought it might help address the requirements (as far as I understoodthese).We are looking into furnishing for Spark a new connector similar to theOracle Datasource for Hadoop,<http://www.oracle.com/technetwork/database/database-technologies/bdc/oracle-datasource-4-hadoop-3158076.pdf>whichwill implement the Spark DataSource interfaces for Oracle Database.


In summary, it'll allow:


 * allow parallel and direct access to the Oracle database (with option
   to control the number of concurrent connections)
 * introspect the Oracle table then dynamically generate partitions of
   Spark JDBCRDDs based on the split pattern and rewrite Spark SQL
   queries into Oracle SQL queries for each partition. The typical use
   case consists in joining fact data (or Big Data) with master data in
   Oracle.
 * hooks in Oracle JDBC driver for faster type conversions
 * Implement predicate pushdown, partition pruning, column projections
   to the Oracle database, thereby reducing the amount of data to be
   processed on Spark
 * write back to Oracle table (through paralllel insert) the result of
   SparkSQL processing for further mining by traditional BI tools.

You may reach out to me offline for ore details if interested,

Kuassi

On 1/29/2017 3:39 AM, Mich Talebzadeh wrote:

This is classis nothing special about it.

 1. You source is Oracle schema tables
 2. You can use Oracle JDBC connection with DIRECT CONNECT and
    parallel processing to read your data from Oracle table into Spark
    FP using JDBC. Ensure that you are getting data from Oracle DB at
    a time when the DB is not busy and network between your Spark and
    Oracle is reasonable. You will be creating multiple connections to
    your Oracle database from Spark
 3. Create a DF from RDD and ingest your data into Hive staging
    tables. This should be pretty fast. If you are using a recent
    version of Spark > 1.5 you can see this in Spark GUI
 4. Once data is ingested into Hive table (frequency Discrete,
    Recurring or Cumulative), then you have your source data in Hive
 5. Do your work in Hive staging tables and then your enriched data
    will go into Hive enriched tables (different from your staging
    tables). You can use Spark to enrich (transform) your data on Hive
    staging tables
 6. Then use Spark to send that data into Oracle table. Again bear in
    mind that the application has to handle consistency from Big Data
    into RDBMS. For example what you are going to do with failed
    transactions in Oracle
 7. From my experience you also need some  staging tables in Oracle to
    handle inserts from Hive via Spark into Oracle table
 8. Finally run a job in PL/SQL to load Oracle target tables from
    Oracle staging tables

Notes:

Oracle columns types are 100% compatible with Spark. For example Sparkdoes not recognize CHAR column and that has to be converted intoVARCHAR or STRING.Hive does not have the concept of Oracle "WITH CLAUSE" inline table.So that script that works in Oracle may not work in Hive. Windowingfunctions should be fine.

I tend to do all this via shell script that gives control at eachlayer and creates alarms.


HTH


1.



2.





Dr Mich Talebzadeh

LinkedIn/https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw/


http://talebzadehmich.wordpress.com

*Disclaimer:* Use it at your own risk.Any and all responsibility forany loss, damage or destruction of data or any other property whichmay arise from relying on this email's technical content is explicitlydisclaimed. The author will in no case be liable for any monetarydamages arising from such loss, damage or destruction.

On 29 January 2017 at 10:18, Alex <siri8...@gmail.com<mailto:siri8...@gmail.com>> wrote:


    Hi All,

    Thanks for your response .. Please find below flow diagram

    Please help me out simplifying this architecture using Spark

    1) Can i skip step 1 to step 4 and directly store it in spark
    if I am storing it in spark where actually it is getting stored
    Do i need to retain HAdoop to store data
    or can i directly store it in spark and remove hadoop also?

    I want to remove informatica for preprocessing and directly load
    the files data coming from server to Hadoop/Spark

    So My Question is Can i directly load files data to spark ? Then
    where exactly the data will get stored.. Do I need to have Spark
    installed on Top of HDFS?

    2) if I am retaining below architecture Can I store back output
    from spark directly to oracle from step 5 to step 7

    and will spark way of storing it back to oracle will be better
    than using sqoop performance wise
    3)Can I use SPark scala UDF to process data from hive and retain
    entire architecture

    which among the above would be optimal

    Inline image 1

    On Sat, Jan 28, 2017 at 10:38 PM, Sachin Naik
    <sachin.u.n...@gmail.com <mailto:sachin.u.n...@gmail.com>> wrote:

        I strongly agree with Jorn and Russell. There are different
        solutions for data movement depending upon your needs
        frequency, bi-directional drivers. workflow, handling
        duplicate records. This is a space is known as " Change Data
        Capture - CDC" for short. If you need more information, I
        would be happy to chat with you.  I built some products in
        this space that extensively used connection pooling over
        ODBC/JDBC.

        Happy to chat if you need more information.

        -Sachin Naik

        >>Hard to tell. Can you give more insights >>on what you try to
        achieve and what the data is about?
        >>For example, depending on your use case sqoop can make sense
        or not.
        Sent from my iPhone

        On Jan 27, 2017, at 11:22 PM, Russell Spitzer
        <russell.spit...@gmail.com <mailto:russell.spit...@gmail.com>>
        wrote:

        You can treat Oracle as a JDBC source
        
(http://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases
        
<http://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases>)
        and skip Sqoop, HiveTables and go straight to Queries. Then
        you can skip hive on the way back out (see the same link) and
        write directly to Oracle. I'll leave the performance
        questions for someone else.

        On Fri, Jan 27, 2017 at 11:06 PM Sirisha Cheruvu
        <siri8...@gmail.com <mailto:siri8...@gmail.com>> wrote:


            On Sat, Jan 28, 2017 at 6:44 AM, Sirisha Cheruvu
            <siri8...@gmail.com <mailto:siri8...@gmail.com>> wrote:

                Hi Team,

                RIght now our existing flow is

                Oracle-->Sqoop --> Hive--> Hive Queries on Spark-sql
                (Hive Context)-->Destination Hive table -->sqoop
                export to Oracle

                Half of the Hive UDFS required is developed in Java UDF..

                SO Now I want to know if I run the native scala UDF's
                than runninng hive java udfs in spark-sql will there
                be any performance difference


                Can we skip the Sqoop Import and export part and

                Instead directly load data from oracle to spark and
                code Scala UDF's for transformations and export
                output data back to oracle?

                RIght now the architecture we are using is

                oracle-->Sqoop (Import)-->Hive Tables--> Hive Queries
                --> Spark-SQL--> Hive --> Oracle
                what would be optimal architecture to process data
                from oracle using spark ?? can i anyway better this
                process ?




                Regards,
                Sirisha

Re: spark architecture question -- Pleas Read

Reply via email to