Re: spark architecture question -- Pleas Read

Jörn Franke Sun, 05 Feb 2017 00:18:35 -0800

You should see an exception and your job fails by default after I think 4 
attempts. If you see an exception you may want to clean the staging table for 
loading and reload again.


> On 4 Feb 2017, at 09:06, Mich Talebzadeh <mich.talebza...@gmail.com> wrote:
> 
> Ingesting from Hive tables back into Oracle. What mechanisms are in place to 
> ensure that data ends up consistently into Oracle table and Spark is notified 
> when Oracle has issues with data ingested (say rollback)?
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
>> On 29 January 2017 at 22:22, Jörn Franke <jornfra...@gmail.com> wrote:
>> You can use HDFS, S3, Azure, glusterfs, Ceph, ignite (in-memory ) .... a 
>> Spark cluster itself does not store anything it just processes. 
>> 
>>> On 29 Jan 2017, at 15:37, Alex <siri8...@gmail.com> wrote:
>>> 
>>> But for persistance after intermediate processing can i use spark cluster 
>>> itself or i have to use hadoop cluster?!
>>> 
>>> On Jan 29, 2017 7:36 PM, "Deepak Sharma" <deepakmc...@gmail.com> wrote:
>>> The better way is to read the data directly into spark using spark sql read 
>>> jdbc .
>>> Apply the udf's locally .
>>> Then save the data frame back to Oracle using dataframe's write jdbc.
>>> 
>>> Thanks
>>> Deepak
>>> 
>>>> On Jan 29, 2017 7:15 PM, "Jörn Franke" <jornfra...@gmail.com> wrote:
>>>> One alternative could be the oracle Hadoop loader and other Oracle 
>>>> products, but you have to invest some money and probably buy their Hadoop 
>>>> Appliance, which you have to evaluate if it make sense (can get expensive 
>>>> with large clusters etc).
>>>> 
>>>> Another alternative would be to get rid of Oracle alltogether and use 
>>>> other databases.
>>>> 
>>>> However, can you elaborate a little bit on your use case and the business 
>>>> logic as well as SLA requires. Otherwise all recommendations are right 
>>>> because the requirements you presented are very generic.
>>>> 
>>>> About get rid of Hadoop - this depends! You will need some resource 
>>>> manager (yarn, mesos, kubernetes etc) and most likely also a distributed 
>>>> file system. Spark supports through the Hadoop apis a wide range of file 
>>>> systems, but does not need HDFS for persistence. You can have local 
>>>> filesystem (ie any file system mounted to a node, so also distributed 
>>>> ones, such as zfs), cloud file systems (s3, azure blob etc).
>>>> 
>>>> 
>>>> 
>>>>> On 29 Jan 2017, at 11:18, Alex <siri8...@gmail.com> wrote:
>>>>> 
>>>>> Hi All,
>>>>> 
>>>>> Thanks for your response .. Please find below flow diagram
>>>>> 
>>>>> Please help me out simplifying this architecture using Spark
>>>>> 
>>>>> 1) Can i skip step 1 to step 4 and directly store it in spark
>>>>> if I am storing it in spark where actually it is getting stored
>>>>> Do i need to retain HAdoop to store data
>>>>> or can i directly store it in spark and remove hadoop also?
>>>>> 
>>>>> I want to remove informatica for preprocessing and directly load the 
>>>>> files data coming from server to Hadoop/Spark
>>>>> 
>>>>> So My Question is Can i directly load files data to spark ? Then where 
>>>>> exactly the data will get stored.. Do I need to have Spark installed on 
>>>>> Top of HDFS?
>>>>> 
>>>>> 2) if I am retaining below architecture Can I store back output from 
>>>>> spark directly to oracle from step 5 to step 7 
>>>>> 
>>>>> and will spark way of storing it back to oracle will be better than using 
>>>>> sqoop performance wise
>>>>> 3)Can I use SPark scala UDF to process data from hive and retain entire 
>>>>> architecture 
>>>>> 
>>>>> which among the above would be optimal
>>>>> 
>>>>> 
>>>>> 
>>>>>> On Sat, Jan 28, 2017 at 10:38 PM, Sachin Naik <sachin.u.n...@gmail.com> 
>>>>>> wrote:
>>>>>> I strongly agree with Jorn and Russell. There are different solutions 
>>>>>> for data movement depending upon your needs frequency, bi-directional 
>>>>>> drivers. workflow, handling duplicate records. This is a space is known 
>>>>>> as " Change Data Capture - CDC" for short. If you need more information, 
>>>>>> I would be happy to chat with you.  I built some products in this space 
>>>>>> that extensively used connection pooling over ODBC/JDBC. 
>>>>>> 
>>>>>> Happy to chat if you need more information. 
>>>>>> 
>>>>>> -Sachin Naik
>>>>>> 
>>>>>> >>Hard to tell. Can you give more insights >>on what you try to achieve 
>>>>>> >>and what the data is about?
>>>>>> >>For example, depending on your use case sqoop can make sense or not.
>>>>>> Sent from my iPhone
>>>>>> 
>>>>>>> On Jan 27, 2017, at 11:22 PM, Russell Spitzer 
>>>>>>> <russell.spit...@gmail.com> wrote:
>>>>>>> 
>>>>>>> You can treat Oracle as a JDBC source 
>>>>>>> (http://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases)
>>>>>>>  and skip Sqoop, HiveTables and go straight to Queries. Then you can 
>>>>>>> skip hive on the way back out (see the same link) and write directly to 
>>>>>>> Oracle. I'll leave the performance questions for someone else. 
>>>>>>> 
>>>>>>>> On Fri, Jan 27, 2017 at 11:06 PM Sirisha Cheruvu <siri8...@gmail.com> 
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>> On Sat, Jan 28, 2017 at 6:44 AM, Sirisha Cheruvu <siri8...@gmail.com> 
>>>>>>>> wrote:
>>>>>>>> Hi Team,
>>>>>>>> 
>>>>>>>> RIght now our existing flow is
>>>>>>>> 
>>>>>>>> Oracle-->Sqoop --> Hive--> Hive Queries on Spark-sql (Hive 
>>>>>>>> Context)-->Destination Hive table -->sqoop export to Oracle
>>>>>>>> 
>>>>>>>> Half of the Hive UDFS required is developed in Java UDF..
>>>>>>>> 
>>>>>>>> SO Now I want to know if I run the native scala UDF's than runninng 
>>>>>>>> hive java udfs in spark-sql will there be any performance difference
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Can we skip the Sqoop Import and export part and 
>>>>>>>> 
>>>>>>>> Instead directly load data from oracle to spark and code Scala UDF's 
>>>>>>>> for transformations and export output data back to oracle?
>>>>>>>> 
>>>>>>>> RIght now the architecture we are using is
>>>>>>>> 
>>>>>>>> oracle-->Sqoop (Import)-->Hive Tables--> Hive Queries --> Spark-SQL--> 
>>>>>>>> Hive --> Oracle 
>>>>>>>> what would be optimal architecture to process data from oracle using 
>>>>>>>> spark ?? can i anyway better this process ?
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Regards,
>>>>>>>> Sirisha 
>>>>>>>> 
>>>>> 
>>> 
>

Re: spark architecture question -- Pleas Read

Reply via email to