Re: spark architecture question -- Pleas Read

2017-02-07 Thread Alex
Hi All,

So Will be there any performance difference instead of running hive java
native udfs in spark-shell using hive context if we recode the entire logic
to spark-sql code?

or spark is anyway converting hiev java udf to spark sql code so we dont
need to rewrite the entire logic in spark-sql?

On Mon, Feb 6, 2017 at 2:40 AM, kuassi mensah 
wrote:

> Apology in advance for injecting Oracle product in this discussion but I
> thought it might help address the requirements (as far as I understood
> these).
> We are looking into furnishing for Spark a new connector similar to the
> Oracle Datasource for Hadoop,
>
> which
> will implement the Spark DataSource interfaces for Oracle Database.
>
> In summary, it'll allow:
>
>- allow parallel and direct access to the Oracle database (with option
>to control the number of concurrent connections)
>- introspect the Oracle table then dynamically generate partitions of
>Spark JDBCRDDs based on the split pattern and rewrite Spark SQL queries
>into Oracle SQL queries for each partition. The typical use case consists
>in joining fact data (or Big Data) with master data in Oracle.
>- hooks in Oracle JDBC driver for faster type conversions
>- Implement predicate pushdown, partition pruning, column projections
>to the Oracle database, thereby reducing the amount of data to be processed
>on Spark
>- write back to Oracle table (through paralllel insert) the result of
>SparkSQL processing for further mining by traditional BI tools.
>
> You may reach out to me offline for ore details if interested,
>
> Kuassi
>
>
> On 1/29/2017 3:39 AM, Mich Talebzadeh wrote:
>
> This is classis nothing special about it.
>
>
>1. You source is Oracle schema tables
>2. You can use Oracle JDBC connection with DIRECT CONNECT and parallel
>processing to read your data from Oracle table into Spark FP using JDBC.
>Ensure that you are getting data from Oracle DB at a time when the DB is
>not busy and network between your Spark and Oracle is reasonable. You will
>be creating multiple connections to your Oracle database from Spark
>3. Create a DF from RDD and ingest your data into Hive staging tables.
>This should be pretty fast. If you are using a recent version of Spark >
>1.5 you can see this in Spark GUI
>4. Once data is ingested into Hive table (frequency Discrete,
>Recurring or Cumulative), then you have your source data in Hive
>5. Do your work in Hive staging tables and then your enriched data
>will go into Hive enriched tables (different from your staging tables). You
>can use Spark to enrich (transform) your data on Hive staging tables
>6. Then use Spark to send that data into Oracle table. Again bear in
>mind that the application has to handle consistency from Big Data into
>RDBMS. For example what you are going to do with failed transactions in
>Oracle
>7. From my experience you also need some  staging tables in Oracle to
>handle inserts from Hive via Spark into Oracle table
>8. Finally run a job in PL/SQL to load Oracle target tables from
>Oracle staging tables
>
> Notes:
>
> Oracle columns types are 100% compatible with Spark. For example Spark
> does not recognize CHAR column and that has to be converted into VARCHAR or
> STRING.
> Hive does not have the concept of Oracle "WITH CLAUSE" inline table. So
> that script that works in Oracle may not work in Hive. Windowing functions
> should be fine.
>
> I tend to do all this via shell script that gives control at each layer
> and creates alarms.
>
> HTH
>
>
>
>1.
>2.
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 29 January 2017 at 10:18, Alex  wrote:
>
>> Hi All,
>>
>> Thanks for your response .. Please find below flow diagram
>>
>> Please help me out simplifying this architecture using Spark
>>
>> 1) Can i skip step 1 to step 4 and directly store it in spark
>> if I am storing it in spark where actually it is getting stored
>> Do i need to retain HAdoop to store data
>> or can i directly store it in spark and remove hadoop also?
>>
>> I want to remove informatica for preprocessing and directly load the
>> files data coming from server to Hadoop/Spark
>>
>> So My Question is Can i directly 

Re: spark architecture question -- Pleas Read

2017-02-05 Thread kuassi mensah
Apology in advance for injecting Oracle product in this discussion but I 
thought it might help address the requirements (as far as I understood 
these).
We are looking into furnishing for Spark a new connector similar to the 
Oracle Datasource for Hadoop,
which 
will implement the Spark DataSource interfaces for Oracle Database.


In summary, it'll allow:

 * allow parallel and direct access to the Oracle database (with option
   to control the number of concurrent connections)
 * introspect the Oracle table then dynamically generate partitions of
   Spark JDBCRDDs based on the split pattern and rewrite Spark SQL
   queries into Oracle SQL queries for each partition. The typical use
   case consists in joining fact data (or Big Data) with master data in
   Oracle.
 * hooks in Oracle JDBC driver for faster type conversions
 * Implement predicate pushdown, partition pruning, column projections
   to the Oracle database, thereby reducing the amount of data to be
   processed on Spark
 * write back to Oracle table (through paralllel insert) the result of
   SparkSQL processing for further mining by traditional BI tools.

You may reach out to me offline for ore details if interested,

Kuassi

On 1/29/2017 3:39 AM, Mich Talebzadeh wrote:

This is classis nothing special about it.

 1. You source is Oracle schema tables
 2. You can use Oracle JDBC connection with DIRECT CONNECT and
parallel processing to read your data from Oracle table into Spark
FP using JDBC. Ensure that you are getting data from Oracle DB at
a time when the DB is not busy and network between your Spark and
Oracle is reasonable. You will be creating multiple connections to
your Oracle database from Spark
 3. Create a DF from RDD and ingest your data into Hive staging
tables. This should be pretty fast. If you are using a recent
version of Spark > 1.5 you can see this in Spark GUI
 4. Once data is ingested into Hive table (frequency Discrete,
Recurring or Cumulative), then you have your source data in Hive
 5. Do your work in Hive staging tables and then your enriched data
will go into Hive enriched tables (different from your staging
tables). You can use Spark to enrich (transform) your data on Hive
staging tables
 6. Then use Spark to send that data into Oracle table. Again bear in
mind that the application has to handle consistency from Big Data
into RDBMS. For example what you are going to do with failed
transactions in Oracle
 7. From my experience you also need some  staging tables in Oracle to
handle inserts from Hive via Spark into Oracle table
 8. Finally run a job in PL/SQL to load Oracle target tables from
Oracle staging tables

Notes:

Oracle columns types are 100% compatible with Spark. For example Spark 
does not recognize CHAR column and that has to be converted into 
VARCHAR or STRING.
Hive does not have the concept of Oracle "WITH CLAUSE" inline table. 
So that script that works in Oracle may not work in Hive. Windowing 
functions should be fine.


I tend to do all this via shell script that gives control at each 
layer and creates alarms.


HTH


1.



2.





Dr Mich Talebzadeh

LinkedIn 
/https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw/


http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk.Any and all responsibility for 
any loss, damage or destruction of data or any other property which 
may arise from relying on this email's technical content is explicitly 
disclaimed. The author will in no case be liable for any monetary 
damages arising from such loss, damage or destruction.



On 29 January 2017 at 10:18, Alex > wrote:


Hi All,

Thanks for your response .. Please find below flow diagram

Please help me out simplifying this architecture using Spark

1) Can i skip step 1 to step 4 and directly store it in spark
if I am storing it in spark where actually it is getting stored
Do i need to retain HAdoop to store data
or can i directly store it in spark and remove hadoop also?

I want to remove informatica for preprocessing and directly load
the files data coming from server to Hadoop/Spark

So My Question is Can i directly load files data to spark ? Then
where exactly the data will get stored.. Do I need to have Spark
installed on Top of HDFS?

2) if I am retaining below architecture Can I store back output
from spark directly to oracle from step 5 to step 7

and will spark way of storing it back to oracle will be better
than using sqoop performance wise
3)Can I use SPark scala UDF to process data from hive and retain
entire architecture

which among the above would be optimal

Inline image 1

On Sat, Jan 28, 2017 at 10:38 PM, Sachin Naik


Re: spark architecture question -- Pleas Read

2017-02-05 Thread Mich Talebzadeh
agreed.

The best option is to ingest to ingesting tables in Oracle. Many people
ingest into main Oracle table which is wrong design in my opinion.

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 5 February 2017 at 08:18, Jörn Franke  wrote:

> You should see an exception and your job fails by default after I think 4
> attempts. If you see an exception you may want to clean the staging table
> for loading and reload again.
>
> On 4 Feb 2017, at 09:06, Mich Talebzadeh 
> wrote:
>
> Ingesting from Hive tables back into Oracle. What mechanisms are in place
> to ensure that data ends up consistently into Oracle table and Spark is
> notified when Oracle has issues with data ingested (say rollback)?
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 29 January 2017 at 22:22, Jörn Franke  wrote:
>
>> You can use HDFS, S3, Azure, glusterfs, Ceph, ignite (in-memory )  a
>> Spark cluster itself does not store anything it just processes.
>>
>> On 29 Jan 2017, at 15:37, Alex  wrote:
>>
>> But for persistance after intermediate processing can i use spark cluster
>> itself or i have to use hadoop cluster?!
>>
>> On Jan 29, 2017 7:36 PM, "Deepak Sharma"  wrote:
>>
>> The better way is to read the data directly into spark using spark sql
>> read jdbc .
>> Apply the udf's locally .
>> Then save the data frame back to Oracle using dataframe's write jdbc.
>>
>> Thanks
>> Deepak
>>
>> On Jan 29, 2017 7:15 PM, "Jörn Franke"  wrote:
>>
>>> One alternative could be the oracle Hadoop loader and other Oracle
>>> products, but you have to invest some money and probably buy their Hadoop
>>> Appliance, which you have to evaluate if it make sense (can get expensive
>>> with large clusters etc).
>>>
>>> Another alternative would be to get rid of Oracle alltogether and use
>>> other databases.
>>>
>>> However, can you elaborate a little bit on your use case and the
>>> business logic as well as SLA requires. Otherwise all recommendations are
>>> right because the requirements you presented are very generic.
>>>
>>> About get rid of Hadoop - this depends! You will need some resource
>>> manager (yarn, mesos, kubernetes etc) and most likely also a distributed
>>> file system. Spark supports through the Hadoop apis a wide range of file
>>> systems, but does not need HDFS for persistence. You can have local
>>> filesystem (ie any file system mounted to a node, so also distributed ones,
>>> such as zfs), cloud file systems (s3, azure blob etc).
>>>
>>>
>>>
>>> On 29 Jan 2017, at 11:18, Alex  wrote:
>>>
>>> Hi All,
>>>
>>> Thanks for your response .. Please find below flow diagram
>>>
>>> Please help me out simplifying this architecture using Spark
>>>
>>> 1) Can i skip step 1 to step 4 and directly store it in spark
>>> if I am storing it in spark where actually it is getting stored
>>> Do i need to retain HAdoop to store data
>>> or can i directly store it in spark and remove hadoop also?
>>>
>>> I want to remove informatica for preprocessing and directly load the
>>> files data coming from server to Hadoop/Spark
>>>
>>> So My Question is Can i directly load files data to spark ? Then where
>>> exactly the data will get stored.. Do I need to have Spark installed on Top
>>> of HDFS?
>>>
>>> 2) if I am retaining below architecture Can I store back output from
>>> spark directly to oracle from step 5 to step 7
>>>
>>> and will spark way of storing it back to oracle will be better than
>>> using sqoop performance wise
>>> 3)Can I use SPark scala UDF to process data from hive and retain entire
>>> architecture
>>>
>>> which among the above would be optimal
>>>
>>> [image: Inline image 1]
>>>
>>> On Sat, Jan 28, 2017 at 10:38 PM, Sachin Naik 
>>> wrote:
>>>
 I strongly agree with Jorn and 

Re: spark architecture question -- Pleas Read

2017-02-05 Thread Jörn Franke
You should see an exception and your job fails by default after I think 4 
attempts. If you see an exception you may want to clean the staging table for 
loading and reload again.

> On 4 Feb 2017, at 09:06, Mich Talebzadeh  wrote:
> 
> Ingesting from Hive tables back into Oracle. What mechanisms are in place to 
> ensure that data ends up consistently into Oracle table and Spark is notified 
> when Oracle has issues with data ingested (say rollback)?
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
>> On 29 January 2017 at 22:22, Jörn Franke  wrote:
>> You can use HDFS, S3, Azure, glusterfs, Ceph, ignite (in-memory )  a 
>> Spark cluster itself does not store anything it just processes. 
>> 
>>> On 29 Jan 2017, at 15:37, Alex  wrote:
>>> 
>>> But for persistance after intermediate processing can i use spark cluster 
>>> itself or i have to use hadoop cluster?!
>>> 
>>> On Jan 29, 2017 7:36 PM, "Deepak Sharma"  wrote:
>>> The better way is to read the data directly into spark using spark sql read 
>>> jdbc .
>>> Apply the udf's locally .
>>> Then save the data frame back to Oracle using dataframe's write jdbc.
>>> 
>>> Thanks
>>> Deepak
>>> 
 On Jan 29, 2017 7:15 PM, "Jörn Franke"  wrote:
 One alternative could be the oracle Hadoop loader and other Oracle 
 products, but you have to invest some money and probably buy their Hadoop 
 Appliance, which you have to evaluate if it make sense (can get expensive 
 with large clusters etc).
 
 Another alternative would be to get rid of Oracle alltogether and use 
 other databases.
 
 However, can you elaborate a little bit on your use case and the business 
 logic as well as SLA requires. Otherwise all recommendations are right 
 because the requirements you presented are very generic.
 
 About get rid of Hadoop - this depends! You will need some resource 
 manager (yarn, mesos, kubernetes etc) and most likely also a distributed 
 file system. Spark supports through the Hadoop apis a wide range of file 
 systems, but does not need HDFS for persistence. You can have local 
 filesystem (ie any file system mounted to a node, so also distributed 
 ones, such as zfs), cloud file systems (s3, azure blob etc).
 
 
 
> On 29 Jan 2017, at 11:18, Alex  wrote:
> 
> Hi All,
> 
> Thanks for your response .. Please find below flow diagram
> 
> Please help me out simplifying this architecture using Spark
> 
> 1) Can i skip step 1 to step 4 and directly store it in spark
> if I am storing it in spark where actually it is getting stored
> Do i need to retain HAdoop to store data
> or can i directly store it in spark and remove hadoop also?
> 
> I want to remove informatica for preprocessing and directly load the 
> files data coming from server to Hadoop/Spark
> 
> So My Question is Can i directly load files data to spark ? Then where 
> exactly the data will get stored.. Do I need to have Spark installed on 
> Top of HDFS?
> 
> 2) if I am retaining below architecture Can I store back output from 
> spark directly to oracle from step 5 to step 7 
> 
> and will spark way of storing it back to oracle will be better than using 
> sqoop performance wise
> 3)Can I use SPark scala UDF to process data from hive and retain entire 
> architecture 
> 
> which among the above would be optimal
> 
> 
> 
>> On Sat, Jan 28, 2017 at 10:38 PM, Sachin Naik  
>> wrote:
>> I strongly agree with Jorn and Russell. There are different solutions 
>> for data movement depending upon your needs frequency, bi-directional 
>> drivers. workflow, handling duplicate records. This is a space is known 
>> as " Change Data Capture - CDC" for short. If you need more information, 
>> I would be happy to chat with you.  I built some products in this space 
>> that extensively used connection pooling over ODBC/JDBC. 
>> 
>> Happy to chat if you need more information. 
>> 
>> -Sachin Naik
>> 
>> >>Hard to tell. Can you give more insights >>on what you try to achieve 
>> >>and what the data is about?
>> >>For example, depending on your use case sqoop can make sense or not.
>> Sent from my iPhone
>> 
>>> On Jan 27, 

Re: spark architecture question -- Pleas Read

2017-02-04 Thread Mich Talebzadeh
Ingesting from Hive tables back into Oracle. What mechanisms are in place
to ensure that data ends up consistently into Oracle table and Spark is
notified when Oracle has issues with data ingested (say rollback)?

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 29 January 2017 at 22:22, Jörn Franke  wrote:

> You can use HDFS, S3, Azure, glusterfs, Ceph, ignite (in-memory )  a
> Spark cluster itself does not store anything it just processes.
>
> On 29 Jan 2017, at 15:37, Alex  wrote:
>
> But for persistance after intermediate processing can i use spark cluster
> itself or i have to use hadoop cluster?!
>
> On Jan 29, 2017 7:36 PM, "Deepak Sharma"  wrote:
>
> The better way is to read the data directly into spark using spark sql
> read jdbc .
> Apply the udf's locally .
> Then save the data frame back to Oracle using dataframe's write jdbc.
>
> Thanks
> Deepak
>
> On Jan 29, 2017 7:15 PM, "Jörn Franke"  wrote:
>
>> One alternative could be the oracle Hadoop loader and other Oracle
>> products, but you have to invest some money and probably buy their Hadoop
>> Appliance, which you have to evaluate if it make sense (can get expensive
>> with large clusters etc).
>>
>> Another alternative would be to get rid of Oracle alltogether and use
>> other databases.
>>
>> However, can you elaborate a little bit on your use case and the business
>> logic as well as SLA requires. Otherwise all recommendations are right
>> because the requirements you presented are very generic.
>>
>> About get rid of Hadoop - this depends! You will need some resource
>> manager (yarn, mesos, kubernetes etc) and most likely also a distributed
>> file system. Spark supports through the Hadoop apis a wide range of file
>> systems, but does not need HDFS for persistence. You can have local
>> filesystem (ie any file system mounted to a node, so also distributed ones,
>> such as zfs), cloud file systems (s3, azure blob etc).
>>
>>
>>
>> On 29 Jan 2017, at 11:18, Alex  wrote:
>>
>> Hi All,
>>
>> Thanks for your response .. Please find below flow diagram
>>
>> Please help me out simplifying this architecture using Spark
>>
>> 1) Can i skip step 1 to step 4 and directly store it in spark
>> if I am storing it in spark where actually it is getting stored
>> Do i need to retain HAdoop to store data
>> or can i directly store it in spark and remove hadoop also?
>>
>> I want to remove informatica for preprocessing and directly load the
>> files data coming from server to Hadoop/Spark
>>
>> So My Question is Can i directly load files data to spark ? Then where
>> exactly the data will get stored.. Do I need to have Spark installed on Top
>> of HDFS?
>>
>> 2) if I am retaining below architecture Can I store back output from
>> spark directly to oracle from step 5 to step 7
>>
>> and will spark way of storing it back to oracle will be better than using
>> sqoop performance wise
>> 3)Can I use SPark scala UDF to process data from hive and retain entire
>> architecture
>>
>> which among the above would be optimal
>>
>> [image: Inline image 1]
>>
>> On Sat, Jan 28, 2017 at 10:38 PM, Sachin Naik 
>> wrote:
>>
>>> I strongly agree with Jorn and Russell. There are different solutions
>>> for data movement depending upon your needs frequency, bi-directional
>>> drivers. workflow, handling duplicate records. This is a space is known as
>>> " Change Data Capture - CDC" for short. If you need more information, I
>>> would be happy to chat with you.  I built some products in this space that
>>> extensively used connection pooling over ODBC/JDBC.
>>>
>>> Happy to chat if you need more information.
>>>
>>> -Sachin Naik
>>>
>>> >>Hard to tell. Can you give more insights >>on what you try to achieve
>>> and what the data is about?
>>> >>For example, depending on your use case sqoop can make sense or not.
>>> Sent from my iPhone
>>>
>>> On Jan 27, 2017, at 11:22 PM, Russell Spitzer 
>>> wrote:
>>>
>>> You can treat Oracle as a JDBC source (http://spark.apache.org/docs/
>>> latest/sql-programming-guide.html#jdbc-to-other-databases) and skip
>>> Sqoop, HiveTables and go straight to Queries. Then you can skip hive on the
>>> way back out (see the same link) and write directly to Oracle. I'll leave
>>> the performance questions for someone else.
>>>
>>> On Fri, Jan 27, 2017 at 11:06 PM Sirisha 

Re: spark architecture question -- Pleas Read

2017-01-29 Thread Mich Talebzadeh
Sorry mistake


   1. Put the csv files into HDFS /apps//data/staging/
   2. Multiple csv files for the same table can co-exist
   3. like val df1 = spark.read.option("header", false).csv(location)
   4. once the csv file read into df then you can do loads of things. The
   csv files have to reside in HDFS


HTH





Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 29 January 2017 at 15:16, Mich Talebzadeh 
wrote:

> you can use Spark directly on csv file.
>
>
>1. Put the csv files into HDFS /apps//data/staging/<
>TABLE_NAME>
>2. Multiple csv files for the same table can co-exist
>3. like df1 = spark.read.option("header", false).csv(location)
>4.
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 29 January 2017 at 14:37, Alex  wrote:
>
>> But for persistance after intermediate processing can i use spark cluster
>> itself or i have to use hadoop cluster?!
>>
>> On Jan 29, 2017 7:36 PM, "Deepak Sharma"  wrote:
>>
>> The better way is to read the data directly into spark using spark sql
>> read jdbc .
>> Apply the udf's locally .
>> Then save the data frame back to Oracle using dataframe's write jdbc.
>>
>> Thanks
>> Deepak
>>
>> On Jan 29, 2017 7:15 PM, "Jörn Franke"  wrote:
>>
>>> One alternative could be the oracle Hadoop loader and other Oracle
>>> products, but you have to invest some money and probably buy their Hadoop
>>> Appliance, which you have to evaluate if it make sense (can get expensive
>>> with large clusters etc).
>>>
>>> Another alternative would be to get rid of Oracle alltogether and use
>>> other databases.
>>>
>>> However, can you elaborate a little bit on your use case and the
>>> business logic as well as SLA requires. Otherwise all recommendations are
>>> right because the requirements you presented are very generic.
>>>
>>> About get rid of Hadoop - this depends! You will need some resource
>>> manager (yarn, mesos, kubernetes etc) and most likely also a distributed
>>> file system. Spark supports through the Hadoop apis a wide range of file
>>> systems, but does not need HDFS for persistence. You can have local
>>> filesystem (ie any file system mounted to a node, so also distributed ones,
>>> such as zfs), cloud file systems (s3, azure blob etc).
>>>
>>>
>>>
>>> On 29 Jan 2017, at 11:18, Alex  wrote:
>>>
>>> Hi All,
>>>
>>> Thanks for your response .. Please find below flow diagram
>>>
>>> Please help me out simplifying this architecture using Spark
>>>
>>> 1) Can i skip step 1 to step 4 and directly store it in spark
>>> if I am storing it in spark where actually it is getting stored
>>> Do i need to retain HAdoop to store data
>>> or can i directly store it in spark and remove hadoop also?
>>>
>>> I want to remove informatica for preprocessing and directly load the
>>> files data coming from server to Hadoop/Spark
>>>
>>> So My Question is Can i directly load files data to spark ? Then where
>>> exactly the data will get stored.. Do I need to have Spark installed on Top
>>> of HDFS?
>>>
>>> 2) if I am retaining below architecture Can I store back output from
>>> spark directly to oracle from step 5 to step 7
>>>
>>> and will spark way of storing it back to oracle will be better than
>>> using sqoop performance wise
>>> 3)Can I use SPark scala UDF to process data from hive and retain entire
>>> architecture
>>>
>>> which among the above would be optimal
>>>
>>> [image: Inline image 1]
>>>
>>> On Sat, Jan 28, 2017 at 10:38 PM, Sachin Naik 
>>> wrote:
>>>
 I strongly agree with Jorn and Russell. There are different solutions
 for data movement depending upon your needs frequency, bi-directional
 drivers. workflow, handling duplicate records. This is a space is known as
 " Change Data Capture - CDC" for short. If you need more information, I

Re: spark architecture question -- Pleas Read

2017-01-29 Thread Mich Talebzadeh
you can use Spark directly on csv file.


   1. Put the csv files into HDFS /apps//data/staging/
   2. Multiple csv files for the same table can co-exist
   3. like df1 = spark.read.option("header", false).csv(location)
   4.


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 29 January 2017 at 14:37, Alex  wrote:

> But for persistance after intermediate processing can i use spark cluster
> itself or i have to use hadoop cluster?!
>
> On Jan 29, 2017 7:36 PM, "Deepak Sharma"  wrote:
>
> The better way is to read the data directly into spark using spark sql
> read jdbc .
> Apply the udf's locally .
> Then save the data frame back to Oracle using dataframe's write jdbc.
>
> Thanks
> Deepak
>
> On Jan 29, 2017 7:15 PM, "Jörn Franke"  wrote:
>
>> One alternative could be the oracle Hadoop loader and other Oracle
>> products, but you have to invest some money and probably buy their Hadoop
>> Appliance, which you have to evaluate if it make sense (can get expensive
>> with large clusters etc).
>>
>> Another alternative would be to get rid of Oracle alltogether and use
>> other databases.
>>
>> However, can you elaborate a little bit on your use case and the business
>> logic as well as SLA requires. Otherwise all recommendations are right
>> because the requirements you presented are very generic.
>>
>> About get rid of Hadoop - this depends! You will need some resource
>> manager (yarn, mesos, kubernetes etc) and most likely also a distributed
>> file system. Spark supports through the Hadoop apis a wide range of file
>> systems, but does not need HDFS for persistence. You can have local
>> filesystem (ie any file system mounted to a node, so also distributed ones,
>> such as zfs), cloud file systems (s3, azure blob etc).
>>
>>
>>
>> On 29 Jan 2017, at 11:18, Alex  wrote:
>>
>> Hi All,
>>
>> Thanks for your response .. Please find below flow diagram
>>
>> Please help me out simplifying this architecture using Spark
>>
>> 1) Can i skip step 1 to step 4 and directly store it in spark
>> if I am storing it in spark where actually it is getting stored
>> Do i need to retain HAdoop to store data
>> or can i directly store it in spark and remove hadoop also?
>>
>> I want to remove informatica for preprocessing and directly load the
>> files data coming from server to Hadoop/Spark
>>
>> So My Question is Can i directly load files data to spark ? Then where
>> exactly the data will get stored.. Do I need to have Spark installed on Top
>> of HDFS?
>>
>> 2) if I am retaining below architecture Can I store back output from
>> spark directly to oracle from step 5 to step 7
>>
>> and will spark way of storing it back to oracle will be better than using
>> sqoop performance wise
>> 3)Can I use SPark scala UDF to process data from hive and retain entire
>> architecture
>>
>> which among the above would be optimal
>>
>> [image: Inline image 1]
>>
>> On Sat, Jan 28, 2017 at 10:38 PM, Sachin Naik 
>> wrote:
>>
>>> I strongly agree with Jorn and Russell. There are different solutions
>>> for data movement depending upon your needs frequency, bi-directional
>>> drivers. workflow, handling duplicate records. This is a space is known as
>>> " Change Data Capture - CDC" for short. If you need more information, I
>>> would be happy to chat with you.  I built some products in this space that
>>> extensively used connection pooling over ODBC/JDBC.
>>>
>>> Happy to chat if you need more information.
>>>
>>> -Sachin Naik
>>>
>>> >>Hard to tell. Can you give more insights >>on what you try to achieve
>>> and what the data is about?
>>> >>For example, depending on your use case sqoop can make sense or not.
>>> Sent from my iPhone
>>>
>>> On Jan 27, 2017, at 11:22 PM, Russell Spitzer 
>>> wrote:
>>>
>>> You can treat Oracle as a JDBC source (http://spark.apache.org/docs/
>>> latest/sql-programming-guide.html#jdbc-to-other-databases) and skip
>>> Sqoop, HiveTables and go straight to Queries. Then you can skip hive on the
>>> way back out (see the same link) and write directly to Oracle. I'll leave
>>> the performance questions for someone else.
>>>
>>> On Fri, Jan 27, 2017 at 11:06 PM Sirisha Cheruvu 
>>> wrote:
>>>

 On Sat, Jan 28, 2017 at 6:44 AM, Sirisha Cheruvu 
 wrote:

 Hi Team,

 RIght now our existing flow is


Re: spark architecture question -- Pleas Read

2017-01-29 Thread Alex
But for persistance after intermediate processing can i use spark cluster
itself or i have to use hadoop cluster?!

On Jan 29, 2017 7:36 PM, "Deepak Sharma"  wrote:

The better way is to read the data directly into spark using spark sql read
jdbc .
Apply the udf's locally .
Then save the data frame back to Oracle using dataframe's write jdbc.

Thanks
Deepak

On Jan 29, 2017 7:15 PM, "Jörn Franke"  wrote:

> One alternative could be the oracle Hadoop loader and other Oracle
> products, but you have to invest some money and probably buy their Hadoop
> Appliance, which you have to evaluate if it make sense (can get expensive
> with large clusters etc).
>
> Another alternative would be to get rid of Oracle alltogether and use
> other databases.
>
> However, can you elaborate a little bit on your use case and the business
> logic as well as SLA requires. Otherwise all recommendations are right
> because the requirements you presented are very generic.
>
> About get rid of Hadoop - this depends! You will need some resource
> manager (yarn, mesos, kubernetes etc) and most likely also a distributed
> file system. Spark supports through the Hadoop apis a wide range of file
> systems, but does not need HDFS for persistence. You can have local
> filesystem (ie any file system mounted to a node, so also distributed ones,
> such as zfs), cloud file systems (s3, azure blob etc).
>
>
>
> On 29 Jan 2017, at 11:18, Alex  wrote:
>
> Hi All,
>
> Thanks for your response .. Please find below flow diagram
>
> Please help me out simplifying this architecture using Spark
>
> 1) Can i skip step 1 to step 4 and directly store it in spark
> if I am storing it in spark where actually it is getting stored
> Do i need to retain HAdoop to store data
> or can i directly store it in spark and remove hadoop also?
>
> I want to remove informatica for preprocessing and directly load the files
> data coming from server to Hadoop/Spark
>
> So My Question is Can i directly load files data to spark ? Then where
> exactly the data will get stored.. Do I need to have Spark installed on Top
> of HDFS?
>
> 2) if I am retaining below architecture Can I store back output from spark
> directly to oracle from step 5 to step 7
>
> and will spark way of storing it back to oracle will be better than using
> sqoop performance wise
> 3)Can I use SPark scala UDF to process data from hive and retain entire
> architecture
>
> which among the above would be optimal
>
> [image: Inline image 1]
>
> On Sat, Jan 28, 2017 at 10:38 PM, Sachin Naik 
> wrote:
>
>> I strongly agree with Jorn and Russell. There are different solutions for
>> data movement depending upon your needs frequency, bi-directional drivers.
>> workflow, handling duplicate records. This is a space is known as " Change
>> Data Capture - CDC" for short. If you need more information, I would be
>> happy to chat with you.  I built some products in this space that
>> extensively used connection pooling over ODBC/JDBC.
>>
>> Happy to chat if you need more information.
>>
>> -Sachin Naik
>>
>> >>Hard to tell. Can you give more insights >>on what you try to achieve
>> and what the data is about?
>> >>For example, depending on your use case sqoop can make sense or not.
>> Sent from my iPhone
>>
>> On Jan 27, 2017, at 11:22 PM, Russell Spitzer 
>> wrote:
>>
>> You can treat Oracle as a JDBC source (http://spark.apache.org/docs/
>> latest/sql-programming-guide.html#jdbc-to-other-databases) and skip
>> Sqoop, HiveTables and go straight to Queries. Then you can skip hive on the
>> way back out (see the same link) and write directly to Oracle. I'll leave
>> the performance questions for someone else.
>>
>> On Fri, Jan 27, 2017 at 11:06 PM Sirisha Cheruvu 
>> wrote:
>>
>>>
>>> On Sat, Jan 28, 2017 at 6:44 AM, Sirisha Cheruvu 
>>> wrote:
>>>
>>> Hi Team,
>>>
>>> RIght now our existing flow is
>>>
>>> Oracle-->Sqoop --> Hive--> Hive Queries on Spark-sql (Hive
>>> Context)-->Destination Hive table -->sqoop export to Oracle
>>>
>>> Half of the Hive UDFS required is developed in Java UDF..
>>>
>>> SO Now I want to know if I run the native scala UDF's than runninng hive
>>> java udfs in spark-sql will there be any performance difference
>>>
>>>
>>> Can we skip the Sqoop Import and export part and
>>>
>>> Instead directly load data from oracle to spark and code Scala UDF's for
>>> transformations and export output data back to oracle?
>>>
>>> RIght now the architecture we are using is
>>>
>>> oracle-->Sqoop (Import)-->Hive Tables--> Hive Queries --> Spark-SQL-->
>>> Hive --> Oracle
>>> what would be optimal architecture to process data from oracle using
>>> spark ?? can i anyway better this process ?
>>>
>>>
>>>
>>>
>>> Regards,
>>> Sirisha
>>>
>>>
>>>
>


Re: spark architecture question -- Pleas Read

2017-01-29 Thread Jörn Franke
I meant with distributed file system such as Ceph, Gluster etc...

> On 29 Jan 2017, at 14:45, Jörn Franke  wrote:
> 
> One alternative could be the oracle Hadoop loader and other Oracle products, 
> but you have to invest some money and probably buy their Hadoop Appliance, 
> which you have to evaluate if it make sense (can get expensive with large 
> clusters etc).
> 
> Another alternative would be to get rid of Oracle alltogether and use other 
> databases.
> 
> However, can you elaborate a little bit on your use case and the business 
> logic as well as SLA requires. Otherwise all recommendations are right 
> because the requirements you presented are very generic.
> 
> About get rid of Hadoop - this depends! You will need some resource manager 
> (yarn, mesos, kubernetes etc) and most likely also a distributed file system. 
> Spark supports through the Hadoop apis a wide range of file systems, but does 
> not need HDFS for persistence. You can have local filesystem (ie any file 
> system mounted to a node, so also distributed ones, such as zfs), cloud file 
> systems (s3, azure blob etc).
> 
> 
> 
>> On 29 Jan 2017, at 11:18, Alex  wrote:
>> 
>> Hi All,
>> 
>> Thanks for your response .. Please find below flow diagram
>> 
>> Please help me out simplifying this architecture using Spark
>> 
>> 1) Can i skip step 1 to step 4 and directly store it in spark
>> if I am storing it in spark where actually it is getting stored
>> Do i need to retain HAdoop to store data
>> or can i directly store it in spark and remove hadoop also?
>> 
>> I want to remove informatica for preprocessing and directly load the files 
>> data coming from server to Hadoop/Spark
>> 
>> So My Question is Can i directly load files data to spark ? Then where 
>> exactly the data will get stored.. Do I need to have Spark installed on Top 
>> of HDFS?
>> 
>> 2) if I am retaining below architecture Can I store back output from spark 
>> directly to oracle from step 5 to step 7 
>> 
>> and will spark way of storing it back to oracle will be better than using 
>> sqoop performance wise
>> 3)Can I use SPark scala UDF to process data from hive and retain entire 
>> architecture 
>> 
>> which among the above would be optimal
>> 
>> 
>> 
>>> On Sat, Jan 28, 2017 at 10:38 PM, Sachin Naik  
>>> wrote:
>>> I strongly agree with Jorn and Russell. There are different solutions for 
>>> data movement depending upon your needs frequency, bi-directional drivers. 
>>> workflow, handling duplicate records. This is a space is known as " Change 
>>> Data Capture - CDC" for short. If you need more information, I would be 
>>> happy to chat with you.  I built some products in this space that 
>>> extensively used connection pooling over ODBC/JDBC. 
>>> 
>>> Happy to chat if you need more information. 
>>> 
>>> -Sachin Naik
>>> 
>>> >>Hard to tell. Can you give more insights >>on what you try to achieve and 
>>> >>what the data is about?
>>> >>For example, depending on your use case sqoop can make sense or not.
>>> Sent from my iPhone
>>> 
 On Jan 27, 2017, at 11:22 PM, Russell Spitzer  
 wrote:
 
 You can treat Oracle as a JDBC source 
 (http://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases)
  and skip Sqoop, HiveTables and go straight to Queries. Then you can skip 
 hive on the way back out (see the same link) and write directly to Oracle. 
 I'll leave the performance questions for someone else. 
 
> On Fri, Jan 27, 2017 at 11:06 PM Sirisha Cheruvu  
> wrote:
> 
> On Sat, Jan 28, 2017 at 6:44 AM, Sirisha Cheruvu  
> wrote:
> Hi Team,
> 
> RIght now our existing flow is
> 
> Oracle-->Sqoop --> Hive--> Hive Queries on Spark-sql (Hive 
> Context)-->Destination Hive table -->sqoop export to Oracle
> 
> Half of the Hive UDFS required is developed in Java UDF..
> 
> SO Now I want to know if I run the native scala UDF's than runninng hive 
> java udfs in spark-sql will there be any performance difference
> 
> 
> Can we skip the Sqoop Import and export part and 
> 
> Instead directly load data from oracle to spark and code Scala UDF's for 
> transformations and export output data back to oracle?
> 
> RIght now the architecture we are using is
> 
> oracle-->Sqoop (Import)-->Hive Tables--> Hive Queries --> Spark-SQL--> 
> Hive --> Oracle 
> what would be optimal architecture to process data from oracle using 
> spark ?? can i anyway better this process ?
> 
> 
> 
> 
> Regards,
> Sirisha 
> 
>> 


Re: spark architecture question -- Pleas Read

2017-01-29 Thread Deepak Sharma
The better way is to read the data directly into spark using spark sql read
jdbc .
Apply the udf's locally .
Then save the data frame back to Oracle using dataframe's write jdbc.

Thanks
Deepak

On Jan 29, 2017 7:15 PM, "Jörn Franke"  wrote:

> One alternative could be the oracle Hadoop loader and other Oracle
> products, but you have to invest some money and probably buy their Hadoop
> Appliance, which you have to evaluate if it make sense (can get expensive
> with large clusters etc).
>
> Another alternative would be to get rid of Oracle alltogether and use
> other databases.
>
> However, can you elaborate a little bit on your use case and the business
> logic as well as SLA requires. Otherwise all recommendations are right
> because the requirements you presented are very generic.
>
> About get rid of Hadoop - this depends! You will need some resource
> manager (yarn, mesos, kubernetes etc) and most likely also a distributed
> file system. Spark supports through the Hadoop apis a wide range of file
> systems, but does not need HDFS for persistence. You can have local
> filesystem (ie any file system mounted to a node, so also distributed ones,
> such as zfs), cloud file systems (s3, azure blob etc).
>
>
>
> On 29 Jan 2017, at 11:18, Alex  wrote:
>
> Hi All,
>
> Thanks for your response .. Please find below flow diagram
>
> Please help me out simplifying this architecture using Spark
>
> 1) Can i skip step 1 to step 4 and directly store it in spark
> if I am storing it in spark where actually it is getting stored
> Do i need to retain HAdoop to store data
> or can i directly store it in spark and remove hadoop also?
>
> I want to remove informatica for preprocessing and directly load the files
> data coming from server to Hadoop/Spark
>
> So My Question is Can i directly load files data to spark ? Then where
> exactly the data will get stored.. Do I need to have Spark installed on Top
> of HDFS?
>
> 2) if I am retaining below architecture Can I store back output from spark
> directly to oracle from step 5 to step 7
>
> and will spark way of storing it back to oracle will be better than using
> sqoop performance wise
> 3)Can I use SPark scala UDF to process data from hive and retain entire
> architecture
>
> which among the above would be optimal
>
> [image: Inline image 1]
>
> On Sat, Jan 28, 2017 at 10:38 PM, Sachin Naik 
> wrote:
>
>> I strongly agree with Jorn and Russell. There are different solutions for
>> data movement depending upon your needs frequency, bi-directional drivers.
>> workflow, handling duplicate records. This is a space is known as " Change
>> Data Capture - CDC" for short. If you need more information, I would be
>> happy to chat with you.  I built some products in this space that
>> extensively used connection pooling over ODBC/JDBC.
>>
>> Happy to chat if you need more information.
>>
>> -Sachin Naik
>>
>> >>Hard to tell. Can you give more insights >>on what you try to achieve
>> and what the data is about?
>> >>For example, depending on your use case sqoop can make sense or not.
>> Sent from my iPhone
>>
>> On Jan 27, 2017, at 11:22 PM, Russell Spitzer 
>> wrote:
>>
>> You can treat Oracle as a JDBC source (http://spark.apache.org/docs/
>> latest/sql-programming-guide.html#jdbc-to-other-databases) and skip
>> Sqoop, HiveTables and go straight to Queries. Then you can skip hive on the
>> way back out (see the same link) and write directly to Oracle. I'll leave
>> the performance questions for someone else.
>>
>> On Fri, Jan 27, 2017 at 11:06 PM Sirisha Cheruvu 
>> wrote:
>>
>>>
>>> On Sat, Jan 28, 2017 at 6:44 AM, Sirisha Cheruvu 
>>> wrote:
>>>
>>> Hi Team,
>>>
>>> RIght now our existing flow is
>>>
>>> Oracle-->Sqoop --> Hive--> Hive Queries on Spark-sql (Hive
>>> Context)-->Destination Hive table -->sqoop export to Oracle
>>>
>>> Half of the Hive UDFS required is developed in Java UDF..
>>>
>>> SO Now I want to know if I run the native scala UDF's than runninng hive
>>> java udfs in spark-sql will there be any performance difference
>>>
>>>
>>> Can we skip the Sqoop Import and export part and
>>>
>>> Instead directly load data from oracle to spark and code Scala UDF's for
>>> transformations and export output data back to oracle?
>>>
>>> RIght now the architecture we are using is
>>>
>>> oracle-->Sqoop (Import)-->Hive Tables--> Hive Queries --> Spark-SQL-->
>>> Hive --> Oracle
>>> what would be optimal architecture to process data from oracle using
>>> spark ?? can i anyway better this process ?
>>>
>>>
>>>
>>>
>>> Regards,
>>> Sirisha
>>>
>>>
>>>
>


Re: spark architecture question -- Pleas Read

2017-01-29 Thread Jörn Franke
One alternative could be the oracle Hadoop loader and other Oracle products, 
but you have to invest some money and probably buy their Hadoop Appliance, 
which you have to evaluate if it make sense (can get expensive with large 
clusters etc).

Another alternative would be to get rid of Oracle alltogether and use other 
databases.

However, can you elaborate a little bit on your use case and the business logic 
as well as SLA requires. Otherwise all recommendations are right because the 
requirements you presented are very generic.

About get rid of Hadoop - this depends! You will need some resource manager 
(yarn, mesos, kubernetes etc) and most likely also a distributed file system. 
Spark supports through the Hadoop apis a wide range of file systems, but does 
not need HDFS for persistence. You can have local filesystem (ie any file 
system mounted to a node, so also distributed ones, such as zfs), cloud file 
systems (s3, azure blob etc).



> On 29 Jan 2017, at 11:18, Alex  wrote:
> 
> Hi All,
> 
> Thanks for your response .. Please find below flow diagram
> 
> Please help me out simplifying this architecture using Spark
> 
> 1) Can i skip step 1 to step 4 and directly store it in spark
> if I am storing it in spark where actually it is getting stored
> Do i need to retain HAdoop to store data
> or can i directly store it in spark and remove hadoop also?
> 
> I want to remove informatica for preprocessing and directly load the files 
> data coming from server to Hadoop/Spark
> 
> So My Question is Can i directly load files data to spark ? Then where 
> exactly the data will get stored.. Do I need to have Spark installed on Top 
> of HDFS?
> 
> 2) if I am retaining below architecture Can I store back output from spark 
> directly to oracle from step 5 to step 7 
> 
> and will spark way of storing it back to oracle will be better than using 
> sqoop performance wise
> 3)Can I use SPark scala UDF to process data from hive and retain entire 
> architecture 
> 
> which among the above would be optimal
> 
> 
> 
>> On Sat, Jan 28, 2017 at 10:38 PM, Sachin Naik  
>> wrote:
>> I strongly agree with Jorn and Russell. There are different solutions for 
>> data movement depending upon your needs frequency, bi-directional drivers. 
>> workflow, handling duplicate records. This is a space is known as " Change 
>> Data Capture - CDC" for short. If you need more information, I would be 
>> happy to chat with you.  I built some products in this space that 
>> extensively used connection pooling over ODBC/JDBC. 
>> 
>> Happy to chat if you need more information. 
>> 
>> -Sachin Naik
>> 
>> >>Hard to tell. Can you give more insights >>on what you try to achieve and 
>> >>what the data is about?
>> >>For example, depending on your use case sqoop can make sense or not.
>> Sent from my iPhone
>> 
>>> On Jan 27, 2017, at 11:22 PM, Russell Spitzer  
>>> wrote:
>>> 
>>> You can treat Oracle as a JDBC source 
>>> (http://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases)
>>>  and skip Sqoop, HiveTables and go straight to Queries. Then you can skip 
>>> hive on the way back out (see the same link) and write directly to Oracle. 
>>> I'll leave the performance questions for someone else. 
>>> 
 On Fri, Jan 27, 2017 at 11:06 PM Sirisha Cheruvu  
 wrote:
 
 On Sat, Jan 28, 2017 at 6:44 AM, Sirisha Cheruvu  
 wrote:
 Hi Team,
 
 RIght now our existing flow is
 
 Oracle-->Sqoop --> Hive--> Hive Queries on Spark-sql (Hive 
 Context)-->Destination Hive table -->sqoop export to Oracle
 
 Half of the Hive UDFS required is developed in Java UDF..
 
 SO Now I want to know if I run the native scala UDF's than runninng hive 
 java udfs in spark-sql will there be any performance difference
 
 
 Can we skip the Sqoop Import and export part and 
 
 Instead directly load data from oracle to spark and code Scala UDF's for 
 transformations and export output data back to oracle?
 
 RIght now the architecture we are using is
 
 oracle-->Sqoop (Import)-->Hive Tables--> Hive Queries --> Spark-SQL--> 
 Hive --> Oracle 
 what would be optimal architecture to process data from oracle using spark 
 ?? can i anyway better this process ?
 
 
 
 
 Regards,
 Sirisha 
 
> 


Re: spark architecture question -- Pleas Read

2017-01-29 Thread Mich Talebzadeh
This is classis nothing special about it.


   1. You source is Oracle schema tables
   2. You can use Oracle JDBC connection with DIRECT CONNECT and parallel
   processing to read your data from Oracle table into Spark FP using JDBC.
   Ensure that you are getting data from Oracle DB at a time when the DB is
   not busy and network between your Spark and Oracle is reasonable. You will
   be creating multiple connections to your Oracle database from Spark
   3. Create a DF from RDD and ingest your data into Hive staging tables.
   This should be pretty fast. If you are using a recent version of Spark >
   1.5 you can see this in Spark GUI
   4. Once data is ingested into Hive table (frequency Discrete, Recurring
   or Cumulative), then you have your source data in Hive
   5. Do your work in Hive staging tables and then your enriched data will
   go into Hive enriched tables (different from your staging tables). You can
   use Spark to enrich (transform) your data on Hive staging tables
   6. Then use Spark to send that data into Oracle table. Again bear in
   mind that the application has to handle consistency from Big Data into
   RDBMS. For example what you are going to do with failed transactions in
   Oracle
   7. From my experience you also need some  staging tables in Oracle to
   handle inserts from Hive via Spark into Oracle table
   8. Finally run a job in PL/SQL to load Oracle target tables from Oracle
   staging tables

Notes:

Oracle columns types are 100% compatible with Spark. For example Spark does
not recognize CHAR column and that has to be converted into VARCHAR or
STRING.
Hive does not have the concept of Oracle "WITH CLAUSE" inline table. So
that script that works in Oracle may not work in Hive. Windowing functions
should be fine.

I tend to do all this via shell script that gives control at each layer and
creates alarms.

HTH



   1.
   2.


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 29 January 2017 at 10:18, Alex  wrote:

> Hi All,
>
> Thanks for your response .. Please find below flow diagram
>
> Please help me out simplifying this architecture using Spark
>
> 1) Can i skip step 1 to step 4 and directly store it in spark
> if I am storing it in spark where actually it is getting stored
> Do i need to retain HAdoop to store data
> or can i directly store it in spark and remove hadoop also?
>
> I want to remove informatica for preprocessing and directly load the files
> data coming from server to Hadoop/Spark
>
> So My Question is Can i directly load files data to spark ? Then where
> exactly the data will get stored.. Do I need to have Spark installed on Top
> of HDFS?
>
> 2) if I am retaining below architecture Can I store back output from spark
> directly to oracle from step 5 to step 7
>
> and will spark way of storing it back to oracle will be better than using
> sqoop performance wise
> 3)Can I use SPark scala UDF to process data from hive and retain entire
> architecture
>
> which among the above would be optimal
>
> [image: Inline image 1]
>
> On Sat, Jan 28, 2017 at 10:38 PM, Sachin Naik 
> wrote:
>
>> I strongly agree with Jorn and Russell. There are different solutions for
>> data movement depending upon your needs frequency, bi-directional drivers.
>> workflow, handling duplicate records. This is a space is known as " Change
>> Data Capture - CDC" for short. If you need more information, I would be
>> happy to chat with you.  I built some products in this space that
>> extensively used connection pooling over ODBC/JDBC.
>>
>> Happy to chat if you need more information.
>>
>> -Sachin Naik
>>
>> >>Hard to tell. Can you give more insights >>on what you try to achieve
>> and what the data is about?
>> >>For example, depending on your use case sqoop can make sense or not.
>> Sent from my iPhone
>>
>> On Jan 27, 2017, at 11:22 PM, Russell Spitzer 
>> wrote:
>>
>> You can treat Oracle as a JDBC source (http://spark.apache.org/docs/
>> latest/sql-programming-guide.html#jdbc-to-other-databases) and skip
>> Sqoop, HiveTables and go straight to Queries. Then you can skip hive on the
>> way back out (see the same link) and write directly to Oracle. I'll leave
>> the performance questions for someone else.
>>
>> On Fri, Jan 27, 2017 at 11:06 PM Sirisha Cheruvu 
>> wrote:
>>
>>>
>>> On Sat, Jan 28, 2017 at 6:44 AM, Sirisha Cheruvu 
>>> wrote:
>>>
>>> Hi 

Re: spark architecture question -- Pleas Read

2017-01-29 Thread Alex
Hi All,

Thanks for your response .. Please find below flow diagram

Please help me out simplifying this architecture using Spark

1) Can i skip step 1 to step 4 and directly store it in spark
if I am storing it in spark where actually it is getting stored
Do i need to retain HAdoop to store data
or can i directly store it in spark and remove hadoop also?

I want to remove informatica for preprocessing and directly load the files
data coming from server to Hadoop/Spark

So My Question is Can i directly load files data to spark ? Then where
exactly the data will get stored.. Do I need to have Spark installed on Top
of HDFS?

2) if I am retaining below architecture Can I store back output from spark
directly to oracle from step 5 to step 7

and will spark way of storing it back to oracle will be better than using
sqoop performance wise
3)Can I use SPark scala UDF to process data from hive and retain entire
architecture

which among the above would be optimal

[image: Inline image 1]

On Sat, Jan 28, 2017 at 10:38 PM, Sachin Naik 
wrote:

> I strongly agree with Jorn and Russell. There are different solutions for
> data movement depending upon your needs frequency, bi-directional drivers.
> workflow, handling duplicate records. This is a space is known as " Change
> Data Capture - CDC" for short. If you need more information, I would be
> happy to chat with you.  I built some products in this space that
> extensively used connection pooling over ODBC/JDBC.
>
> Happy to chat if you need more information.
>
> -Sachin Naik
>
> >>Hard to tell. Can you give more insights >>on what you try to achieve
> and what the data is about?
> >>For example, depending on your use case sqoop can make sense or not.
> Sent from my iPhone
>
> On Jan 27, 2017, at 11:22 PM, Russell Spitzer 
> wrote:
>
> You can treat Oracle as a JDBC source (http://spark.apache.org/docs/
> latest/sql-programming-guide.html#jdbc-to-other-databases) and skip
> Sqoop, HiveTables and go straight to Queries. Then you can skip hive on the
> way back out (see the same link) and write directly to Oracle. I'll leave
> the performance questions for someone else.
>
> On Fri, Jan 27, 2017 at 11:06 PM Sirisha Cheruvu 
> wrote:
>
>>
>> On Sat, Jan 28, 2017 at 6:44 AM, Sirisha Cheruvu 
>> wrote:
>>
>> Hi Team,
>>
>> RIght now our existing flow is
>>
>> Oracle-->Sqoop --> Hive--> Hive Queries on Spark-sql (Hive
>> Context)-->Destination Hive table -->sqoop export to Oracle
>>
>> Half of the Hive UDFS required is developed in Java UDF..
>>
>> SO Now I want to know if I run the native scala UDF's than runninng hive
>> java udfs in spark-sql will there be any performance difference
>>
>>
>> Can we skip the Sqoop Import and export part and
>>
>> Instead directly load data from oracle to spark and code Scala UDF's for
>> transformations and export output data back to oracle?
>>
>> RIght now the architecture we are using is
>>
>> oracle-->Sqoop (Import)-->Hive Tables--> Hive Queries --> Spark-SQL-->
>> Hive --> Oracle
>> what would be optimal architecture to process data from oracle using
>> spark ?? can i anyway better this process ?
>>
>>
>>
>>
>> Regards,
>> Sirisha
>>
>>
>>


Re: spark architecture question -- Pleas Read

2017-01-28 Thread Sachin Naik
I strongly agree with Jorn and Russell. There are different solutions for data 
movement depending upon your needs frequency, bi-directional drivers. workflow, 
handling duplicate records. This is a space is known as " Change Data Capture - 
CDC" for short. If you need more information, I would be happy to chat with 
you.  I built some products in this space that extensively used connection 
pooling over ODBC/JDBC. 

Happy to chat if you need more information. 

-Sachin Naik

>>Hard to tell. Can you give more insights >>on what you try to achieve and 
>>what the data is about?
>>For example, depending on your use case sqoop can make sense or not.
Sent from my iPhone

> On Jan 27, 2017, at 11:22 PM, Russell Spitzer  
> wrote:
> 
> You can treat Oracle as a JDBC source 
> (http://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases)
>  and skip Sqoop, HiveTables and go straight to Queries. Then you can skip 
> hive on the way back out (see the same link) and write directly to Oracle. 
> I'll leave the performance questions for someone else. 
> 
>> On Fri, Jan 27, 2017 at 11:06 PM Sirisha Cheruvu  wrote:
>> 
>> On Sat, Jan 28, 2017 at 6:44 AM, Sirisha Cheruvu  wrote:
>> Hi Team,
>> 
>> RIght now our existing flow is
>> 
>> Oracle-->Sqoop --> Hive--> Hive Queries on Spark-sql (Hive 
>> Context)-->Destination Hive table -->sqoop export to Oracle
>> 
>> Half of the Hive UDFS required is developed in Java UDF..
>> 
>> SO Now I want to know if I run the native scala UDF's than runninng hive 
>> java udfs in spark-sql will there be any performance difference
>> 
>> 
>> Can we skip the Sqoop Import and export part and 
>> 
>> Instead directly load data from oracle to spark and code Scala UDF's for 
>> transformations and export output data back to oracle?
>> 
>> RIght now the architecture we are using is
>> 
>> oracle-->Sqoop (Import)-->Hive Tables--> Hive Queries --> Spark-SQL--> Hive 
>> --> Oracle 
>> what would be optimal architecture to process data from oracle using spark 
>> ?? can i anyway better this process ?
>> 
>> 
>> 
>> 
>> Regards,
>> Sirisha 
>> 


Re: spark architecture question -- Pleas Read

2017-01-28 Thread Jörn Franke
Hard to tell. Can you give more insights on what you try to achieve and what 
the data is about?
For example, depending on your use case sqoop can make sense or not.

> On 28 Jan 2017, at 02:14, Sirisha Cheruvu  wrote:
> 
> Hi Team,
> 
> RIght now our existing flow is
> 
> Oracle-->Sqoop --> Hive--> Hive Queries on Spark-sql (Hive 
> Context)-->Destination Hive table -->sqoop export to Oracle
> 
> Half of the Hive UDFS required is developed in Java UDF..
> 
> SO Now I want to know if I run the native scala UDF's than runninng hive java 
> udfs in spark-sql will there be any performance difference
> 
> 
> Can we skip the Sqoop Import and export part and 
> 
> Instead directly load data from oracle to spark and code Scala UDF's for 
> transformations and export output data back to oracle?
> 
> RIght now the architecture we are using is
> 
> oracle-->Sqoop (Import)-->Hive Tables--> Hive Queries --> Spark-SQL--> Hive 
> --> Oracle 
> what would be optimal architecture to process data from oracle using spark ?? 
> can i anyway better this process ?
> 
> 
> 
> 
> Regards,
> Sirisha 

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: spark architecture question -- Pleas Read

2017-01-27 Thread Russell Spitzer
You can treat Oracle as a JDBC source (
http://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases)
and skip Sqoop, HiveTables and go straight to Queries. Then you can skip
hive on the way back out (see the same link) and write directly to Oracle.
I'll leave the performance questions for someone else.

On Fri, Jan 27, 2017 at 11:06 PM Sirisha Cheruvu  wrote:

>
> On Sat, Jan 28, 2017 at 6:44 AM, Sirisha Cheruvu 
> wrote:
>
> Hi Team,
>
> RIght now our existing flow is
>
> Oracle-->Sqoop --> Hive--> Hive Queries on Spark-sql (Hive
> Context)-->Destination Hive table -->sqoop export to Oracle
>
> Half of the Hive UDFS required is developed in Java UDF..
>
> SO Now I want to know if I run the native scala UDF's than runninng hive
> java udfs in spark-sql will there be any performance difference
>
>
> Can we skip the Sqoop Import and export part and
>
> Instead directly load data from oracle to spark and code Scala UDF's for
> transformations and export output data back to oracle?
>
> RIght now the architecture we are using is
>
> oracle-->Sqoop (Import)-->Hive Tables--> Hive Queries --> Spark-SQL-->
> Hive --> Oracle
> what would be optimal architecture to process data from oracle using spark
> ?? can i anyway better this process ?
>
>
>
>
> Regards,
> Sirisha
>
>
>


Re: spark architecture question -- Pleas Read

2017-01-27 Thread Sirisha Cheruvu
On Sat, Jan 28, 2017 at 6:44 AM, Sirisha Cheruvu  wrote:

> Hi Team,
>
> RIght now our existing flow is
>
> Oracle-->Sqoop --> Hive--> Hive Queries on Spark-sql (Hive
> Context)-->Destination Hive table -->sqoop export to Oracle
>
> Half of the Hive UDFS required is developed in Java UDF..
>
> SO Now I want to know if I run the native scala UDF's than runninng hive
> java udfs in spark-sql will there be any performance difference
>
>
> Can we skip the Sqoop Import and export part and
>
> Instead directly load data from oracle to spark and code Scala UDF's for
> transformations and export output data back to oracle?
>
> RIght now the architecture we are using is
>
> oracle-->Sqoop (Import)-->Hive Tables--> Hive Queries --> Spark-SQL-->
> Hive --> Oracle
> what would be optimal architecture to process data from oracle using spark
> ?? can i anyway better this process ?
>
>
>
>
> Regards,
> Sirisha
>


spark architecture question -- Pleas Read

2017-01-27 Thread Sirisha Cheruvu
Hi Team,

RIght now our existing flow is

Oracle-->Sqoop --> Hive--> Hive Queries on Spark-sql (Hive
Context)-->Destination Hive table -->sqoop export to Oracle

Half of the Hive UDFS required is developed in Java UDF..

SO Now I want to know if I run the native scala UDF's than runninng hive
java udfs in spark-sql will there be any performance difference


Can we skip the Sqoop Import and export part and

Instead directly load data from oracle to spark and code Scala UDF's for
transformations and export output data back to oracle?

RIght now the architecture we are using is

oracle-->Sqoop (Import)-->Hive Tables--> Hive Queries --> Spark-SQL--> Hive
--> Oracle
what would be optimal architecture to process data from oracle using spark
?? can i anyway better this process ?




Regards,
Sirisha