Hi All,
So Will be there any performance difference instead of running hive java
native udfs in spark-shell using hive context if we recode the entire logic
to spark-sql code?
or spark is anyway converting hiev java udf to spark sql code so we dont
need to rewrite the entire logic in spark-sql?
Apology in advance for injecting Oracle product in this discussion but I
thought it might help address the requirements (as far as I understood
these).
We are looking into furnishing for Spark a new connector similar to the
Oracle Datasource for Hadoop,
agreed.
The best option is to ingest to ingesting tables in Oracle. Many people
ingest into main Oracle table which is wrong design in my opinion.
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
You should see an exception and your job fails by default after I think 4
attempts. If you see an exception you may want to clean the staging table for
loading and reload again.
> On 4 Feb 2017, at 09:06, Mich Talebzadeh wrote:
>
> Ingesting from Hive tables back
Ingesting from Hive tables back into Oracle. What mechanisms are in place
to ensure that data ends up consistently into Oracle table and Spark is
notified when Oracle has issues with data ingested (say rollback)?
Dr Mich Talebzadeh
LinkedIn *
Sorry mistake
1. Put the csv files into HDFS /apps//data/staging/
2. Multiple csv files for the same table can co-exist
3. like val df1 = spark.read.option("header", false).csv(location)
4. once the csv file read into df then you can do loads of things. The
csv files have to
you can use Spark directly on csv file.
1. Put the csv files into HDFS /apps//data/staging/
2. Multiple csv files for the same table can co-exist
3. like df1 = spark.read.option("header", false).csv(location)
4.
Dr Mich Talebzadeh
LinkedIn *
But for persistance after intermediate processing can i use spark cluster
itself or i have to use hadoop cluster?!
On Jan 29, 2017 7:36 PM, "Deepak Sharma" wrote:
The better way is to read the data directly into spark using spark sql read
jdbc .
Apply the udf's locally .
I meant with distributed file system such as Ceph, Gluster etc...
> On 29 Jan 2017, at 14:45, Jörn Franke wrote:
>
> One alternative could be the oracle Hadoop loader and other Oracle products,
> but you have to invest some money and probably buy their Hadoop Appliance,
The better way is to read the data directly into spark using spark sql read
jdbc .
Apply the udf's locally .
Then save the data frame back to Oracle using dataframe's write jdbc.
Thanks
Deepak
On Jan 29, 2017 7:15 PM, "Jörn Franke" wrote:
> One alternative could be the
One alternative could be the oracle Hadoop loader and other Oracle products,
but you have to invest some money and probably buy their Hadoop Appliance,
which you have to evaluate if it make sense (can get expensive with large
clusters etc).
Another alternative would be to get rid of Oracle
This is classis nothing special about it.
1. You source is Oracle schema tables
2. You can use Oracle JDBC connection with DIRECT CONNECT and parallel
processing to read your data from Oracle table into Spark FP using JDBC.
Ensure that you are getting data from Oracle DB at a time
Hi All,
Thanks for your response .. Please find below flow diagram
Please help me out simplifying this architecture using Spark
1) Can i skip step 1 to step 4 and directly store it in spark
if I am storing it in spark where actually it is getting stored
Do i need to retain HAdoop to store data
I strongly agree with Jorn and Russell. There are different solutions for data
movement depending upon your needs frequency, bi-directional drivers. workflow,
handling duplicate records. This is a space is known as " Change Data Capture -
CDC" for short. If you need more information, I would be
Hard to tell. Can you give more insights on what you try to achieve and what
the data is about?
For example, depending on your use case sqoop can make sense or not.
> On 28 Jan 2017, at 02:14, Sirisha Cheruvu wrote:
>
> Hi Team,
>
> RIght now our existing flow is
>
>
You can treat Oracle as a JDBC source (
http://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases)
and skip Sqoop, HiveTables and go straight to Queries. Then you can skip
hive on the way back out (see the same link) and write directly to Oracle.
I'll leave the
On Sat, Jan 28, 2017 at 6:44 AM, Sirisha Cheruvu wrote:
> Hi Team,
>
> RIght now our existing flow is
>
> Oracle-->Sqoop --> Hive--> Hive Queries on Spark-sql (Hive
> Context)-->Destination Hive table -->sqoop export to Oracle
>
> Half of the Hive UDFS required is developed
Hi Team,
RIght now our existing flow is
Oracle-->Sqoop --> Hive--> Hive Queries on Spark-sql (Hive
Context)-->Destination Hive table -->sqoop export to Oracle
Half of the Hive UDFS required is developed in Java UDF..
SO Now I want to know if I run the native scala UDF's than runninng hive
java
18 matches
Mail list logo