Apology in advance for injecting Oracle product in this discussion but I
thought it might help address the requirements (as far as I understood
these).
We are looking into furnishing for Spark a new connector similar to the
Oracle Datasource for Hadoop,
<http://www.oracle.com/technetwork/database/database-technologies/bdc/oracle-datasource-4-hadoop-3158076.pdf>which
will implement the Spark DataSource interfaces for Oracle Database.
In summary, it'll allow:
* allow parallel and direct access to the Oracle database (with option
to control the number of concurrent connections)
* introspect the Oracle table then dynamically generate partitions of
Spark JDBCRDDs based on the split pattern and rewrite Spark SQL
queries into Oracle SQL queries for each partition. The typical use
case consists in joining fact data (or Big Data) with master data in
Oracle.
* hooks in Oracle JDBC driver for faster type conversions
* Implement predicate pushdown, partition pruning, column projections
to the Oracle database, thereby reducing the amount of data to be
processed on Spark
* write back to Oracle table (through paralllel insert) the result of
SparkSQL processing for further mining by traditional BI tools.
You may reach out to me offline for ore details if interested,
Kuassi
On 1/29/2017 3:39 AM, Mich Talebzadeh wrote:
This is classis nothing special about it.
1. You source is Oracle schema tables
2. You can use Oracle JDBC connection with DIRECT CONNECT and
parallel processing to read your data from Oracle table into Spark
FP using JDBC. Ensure that you are getting data from Oracle DB at
a time when the DB is not busy and network between your Spark and
Oracle is reasonable. You will be creating multiple connections to
your Oracle database from Spark
3. Create a DF from RDD and ingest your data into Hive staging
tables. This should be pretty fast. If you are using a recent
version of Spark > 1.5 you can see this in Spark GUI
4. Once data is ingested into Hive table (frequency Discrete,
Recurring or Cumulative), then you have your source data in Hive
5. Do your work in Hive staging tables and then your enriched data
will go into Hive enriched tables (different from your staging
tables). You can use Spark to enrich (transform) your data on Hive
staging tables
6. Then use Spark to send that data into Oracle table. Again bear in
mind that the application has to handle consistency from Big Data
into RDBMS. For example what you are going to do with failed
transactions in Oracle
7. From my experience you also need some staging tables in Oracle to
handle inserts from Hive via Spark into Oracle table
8. Finally run a job in PL/SQL to load Oracle target tables from
Oracle staging tables
Notes:
Oracle columns types are 100% compatible with Spark. For example Spark
does not recognize CHAR column and that has to be converted into
VARCHAR or STRING.
Hive does not have the concept of Oracle "WITH CLAUSE" inline table.
So that script that works in Oracle may not work in Hive. Windowing
functions should be fine.
I tend to do all this via shell script that gives control at each
layer and creates alarms.
HTH
1.
2.
Dr Mich Talebzadeh
LinkedIn
/https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw/
http://talebzadehmich.wordpress.com
*Disclaimer:* Use it at your own risk.Any and all responsibility for
any loss, damage or destruction of data or any other property which
may arise from relying on this email's technical content is explicitly
disclaimed. The author will in no case be liable for any monetary
damages arising from such loss, damage or destruction.
On 29 January 2017 at 10:18, Alex <siri8...@gmail.com
<mailto:siri8...@gmail.com>> wrote:
Hi All,
Thanks for your response .. Please find below flow diagram
Please help me out simplifying this architecture using Spark
1) Can i skip step 1 to step 4 and directly store it in spark
if I am storing it in spark where actually it is getting stored
Do i need to retain HAdoop to store data
or can i directly store it in spark and remove hadoop also?
I want to remove informatica for preprocessing and directly load
the files data coming from server to Hadoop/Spark
So My Question is Can i directly load files data to spark ? Then
where exactly the data will get stored.. Do I need to have Spark
installed on Top of HDFS?
2) if I am retaining below architecture Can I store back output
from spark directly to oracle from step 5 to step 7
and will spark way of storing it back to oracle will be better
than using sqoop performance wise
3)Can I use SPark scala UDF to process data from hive and retain
entire architecture
which among the above would be optimal
Inline image 1
On Sat, Jan 28, 2017 at 10:38 PM, Sachin Naik
<sachin.u.n...@gmail.com <mailto:sachin.u.n...@gmail.com>> wrote:
I strongly agree with Jorn and Russell. There are different
solutions for data movement depending upon your needs
frequency, bi-directional drivers. workflow, handling
duplicate records. This is a space is known as " Change Data
Capture - CDC" for short. If you need more information, I
would be happy to chat with you. I built some products in
this space that extensively used connection pooling over
ODBC/JDBC.
Happy to chat if you need more information.
-Sachin Naik
>>Hard to tell. Can you give more insights >>on what you try to
achieve and what the data is about?
>>For example, depending on your use case sqoop can make sense
or not.
Sent from my iPhone
On Jan 27, 2017, at 11:22 PM, Russell Spitzer
<russell.spit...@gmail.com <mailto:russell.spit...@gmail.com>>
wrote:
You can treat Oracle as a JDBC source
(http://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases
<http://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases>)
and skip Sqoop, HiveTables and go straight to Queries. Then
you can skip hive on the way back out (see the same link) and
write directly to Oracle. I'll leave the performance
questions for someone else.
On Fri, Jan 27, 2017 at 11:06 PM Sirisha Cheruvu
<siri8...@gmail.com <mailto:siri8...@gmail.com>> wrote:
On Sat, Jan 28, 2017 at 6:44 AM, Sirisha Cheruvu
<siri8...@gmail.com <mailto:siri8...@gmail.com>> wrote:
Hi Team,
RIght now our existing flow is
Oracle-->Sqoop --> Hive--> Hive Queries on Spark-sql
(Hive Context)-->Destination Hive table -->sqoop
export to Oracle
Half of the Hive UDFS required is developed in Java UDF..
SO Now I want to know if I run the native scala UDF's
than runninng hive java udfs in spark-sql will there
be any performance difference
Can we skip the Sqoop Import and export part and
Instead directly load data from oracle to spark and
code Scala UDF's for transformations and export
output data back to oracle?
RIght now the architecture we are using is
oracle-->Sqoop (Import)-->Hive Tables--> Hive Queries
--> Spark-SQL--> Hive --> Oracle
what would be optimal architecture to process data
from oracle using spark ?? can i anyway better this
process ?
Regards,
Sirisha