Re: Spark Architecture Question

2021-07-29 Thread Pasha Finkelshteyn
Hi Renganathan, Not quite. It strongly depends on your usage of UDFs defined in any manner — as UDF object or just lambdas. If you have ones — they may and will be called on executors too. On 21/07/29 05:17, Renganathan Mutthiah wrote: > Hi, > > I have read in many materials (including from the

Spark Architecture Question

2021-07-29 Thread Renganathan Mutthiah
Hi, I have read in many materials (including from the book: Spark - The Definitive Guide) that Spark is a compiler. In my understanding, our program is used until the point of DAG generation. This portion can be written in any language - Java,Scala,R,Python. Post that (executing the DAG), the

Re: spark architecture question -- Pleas Read

2017-02-07 Thread Alex
Hi All, So Will be there any performance difference instead of running hive java native udfs in spark-shell using hive context if we recode the entire logic to spark-sql code? or spark is anyway converting hiev java udf to spark sql code so we dont need to rewrite the entire logic in spark-sql?

Re: spark architecture question -- Pleas Read

2017-02-05 Thread kuassi mensah
Apology in advance for injecting Oracle product in this discussion but I thought it might help address the requirements (as far as I understood these). We are looking into furnishing for Spark a new connector similar to the Oracle Datasource for Hadoop,

Re: spark architecture question -- Pleas Read

2017-02-05 Thread Mich Talebzadeh
agreed. The best option is to ingest to ingesting tables in Oracle. Many people ingest into main Oracle table which is wrong design in my opinion. Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: spark architecture question -- Pleas Read

2017-02-05 Thread Jörn Franke
You should see an exception and your job fails by default after I think 4 attempts. If you see an exception you may want to clean the staging table for loading and reload again. > On 4 Feb 2017, at 09:06, Mich Talebzadeh wrote: > > Ingesting from Hive tables back

Re: spark architecture question -- Pleas Read

2017-02-04 Thread Mich Talebzadeh
Ingesting from Hive tables back into Oracle. What mechanisms are in place to ensure that data ends up consistently into Oracle table and Spark is notified when Oracle has issues with data ingested (say rollback)? Dr Mich Talebzadeh LinkedIn *

Re: spark architecture question -- Pleas Read

2017-01-29 Thread Mich Talebzadeh
Sorry mistake 1. Put the csv files into HDFS /apps//data/staging/ 2. Multiple csv files for the same table can co-exist 3. like val df1 = spark.read.option("header", false).csv(location) 4. once the csv file read into df then you can do loads of things. The csv files have to

Re: spark architecture question -- Pleas Read

2017-01-29 Thread Mich Talebzadeh
you can use Spark directly on csv file. 1. Put the csv files into HDFS /apps//data/staging/ 2. Multiple csv files for the same table can co-exist 3. like df1 = spark.read.option("header", false).csv(location) 4. Dr Mich Talebzadeh LinkedIn *

Re: spark architecture question -- Pleas Read

2017-01-29 Thread Alex
But for persistance after intermediate processing can i use spark cluster itself or i have to use hadoop cluster?! On Jan 29, 2017 7:36 PM, "Deepak Sharma" wrote: The better way is to read the data directly into spark using spark sql read jdbc . Apply the udf's locally .

Re: spark architecture question -- Pleas Read

2017-01-29 Thread Jörn Franke
I meant with distributed file system such as Ceph, Gluster etc... > On 29 Jan 2017, at 14:45, Jörn Franke wrote: > > One alternative could be the oracle Hadoop loader and other Oracle products, > but you have to invest some money and probably buy their Hadoop Appliance,

Re: spark architecture question -- Pleas Read

2017-01-29 Thread Deepak Sharma
The better way is to read the data directly into spark using spark sql read jdbc . Apply the udf's locally . Then save the data frame back to Oracle using dataframe's write jdbc. Thanks Deepak On Jan 29, 2017 7:15 PM, "Jörn Franke" wrote: > One alternative could be the

Re: spark architecture question -- Pleas Read

2017-01-29 Thread Jörn Franke
One alternative could be the oracle Hadoop loader and other Oracle products, but you have to invest some money and probably buy their Hadoop Appliance, which you have to evaluate if it make sense (can get expensive with large clusters etc). Another alternative would be to get rid of Oracle

Re: spark architecture question -- Pleas Read

2017-01-29 Thread Mich Talebzadeh
This is classis nothing special about it. 1. You source is Oracle schema tables 2. You can use Oracle JDBC connection with DIRECT CONNECT and parallel processing to read your data from Oracle table into Spark FP using JDBC. Ensure that you are getting data from Oracle DB at a time

Re: spark architecture question -- Pleas Read

2017-01-29 Thread Alex
Hi All, Thanks for your response .. Please find below flow diagram Please help me out simplifying this architecture using Spark 1) Can i skip step 1 to step 4 and directly store it in spark if I am storing it in spark where actually it is getting stored Do i need to retain HAdoop to store data

Re: spark architecture question -- Pleas Read

2017-01-28 Thread Sachin Naik
I strongly agree with Jorn and Russell. There are different solutions for data movement depending upon your needs frequency, bi-directional drivers. workflow, handling duplicate records. This is a space is known as " Change Data Capture - CDC" for short. If you need more information, I would be

Re: spark architecture question -- Pleas Read

2017-01-28 Thread Jörn Franke
Hard to tell. Can you give more insights on what you try to achieve and what the data is about? For example, depending on your use case sqoop can make sense or not. > On 28 Jan 2017, at 02:14, Sirisha Cheruvu wrote: > > Hi Team, > > RIght now our existing flow is > >

Re: spark architecture question -- Pleas Read

2017-01-27 Thread Russell Spitzer
You can treat Oracle as a JDBC source ( http://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases) and skip Sqoop, HiveTables and go straight to Queries. Then you can skip hive on the way back out (see the same link) and write directly to Oracle. I'll leave the

Re: spark architecture question -- Pleas Read

2017-01-27 Thread Sirisha Cheruvu
On Sat, Jan 28, 2017 at 6:44 AM, Sirisha Cheruvu wrote: > Hi Team, > > RIght now our existing flow is > > Oracle-->Sqoop --> Hive--> Hive Queries on Spark-sql (Hive > Context)-->Destination Hive table -->sqoop export to Oracle > > Half of the Hive UDFS required is developed

spark architecture question -- Pleas Read

2017-01-27 Thread Sirisha Cheruvu
Hi Team, RIght now our existing flow is Oracle-->Sqoop --> Hive--> Hive Queries on Spark-sql (Hive Context)-->Destination Hive table -->sqoop export to Oracle Half of the Hive UDFS required is developed in Java UDF.. SO Now I want to know if I run the native scala UDF's than runninng hive java