Re: spark architecture question -- Pleas Read

Jörn Franke Sun, 29 Jan 2017 14:23:13 -0800

You can use HDFS, S3, Azure, glusterfs, Ceph, ignite (in-memory ) .... a Spark 
cluster itself does not store anything it just processes.


> On 29 Jan 2017, at 15:37, Alex <siri8...@gmail.com> wrote:
> 
> But for persistance after intermediate processing can i use spark cluster 
> itself or i have to use hadoop cluster?!
> 
> On Jan 29, 2017 7:36 PM, "Deepak Sharma" <deepakmc...@gmail.com> wrote:
> The better way is to read the data directly into spark using spark sql read 
> jdbc .
> Apply the udf's locally .
> Then save the data frame back to Oracle using dataframe's write jdbc.
> 
> Thanks
> Deepak
> 
>> On Jan 29, 2017 7:15 PM, "Jörn Franke" <jornfra...@gmail.com> wrote:
>> One alternative could be the oracle Hadoop loader and other Oracle products, 
>> but you have to invest some money and probably buy their Hadoop Appliance, 
>> which you have to evaluate if it make sense (can get expensive with large 
>> clusters etc).
>> 
>> Another alternative would be to get rid of Oracle alltogether and use other 
>> databases.
>> 
>> However, can you elaborate a little bit on your use case and the business 
>> logic as well as SLA requires. Otherwise all recommendations are right 
>> because the requirements you presented are very generic.
>> 
>> About get rid of Hadoop - this depends! You will need some resource manager 
>> (yarn, mesos, kubernetes etc) and most likely also a distributed file 
>> system. Spark supports through the Hadoop apis a wide range of file systems, 
>> but does not need HDFS for persistence. You can have local filesystem (ie 
>> any file system mounted to a node, so also distributed ones, such as zfs), 
>> cloud file systems (s3, azure blob etc).
>> 
>> 
>> 
>>> On 29 Jan 2017, at 11:18, Alex <siri8...@gmail.com> wrote:
>>> 
>>> Hi All,
>>> 
>>> Thanks for your response .. Please find below flow diagram
>>> 
>>> Please help me out simplifying this architecture using Spark
>>> 
>>> 1) Can i skip step 1 to step 4 and directly store it in spark
>>> if I am storing it in spark where actually it is getting stored
>>> Do i need to retain HAdoop to store data
>>> or can i directly store it in spark and remove hadoop also?
>>> 
>>> I want to remove informatica for preprocessing and directly load the files 
>>> data coming from server to Hadoop/Spark
>>> 
>>> So My Question is Can i directly load files data to spark ? Then where 
>>> exactly the data will get stored.. Do I need to have Spark installed on Top 
>>> of HDFS?
>>> 
>>> 2) if I am retaining below architecture Can I store back output from spark 
>>> directly to oracle from step 5 to step 7 
>>> 
>>> and will spark way of storing it back to oracle will be better than using 
>>> sqoop performance wise
>>> 3)Can I use SPark scala UDF to process data from hive and retain entire 
>>> architecture 
>>> 
>>> which among the above would be optimal
>>> 
>>> 
>>> 
>>>> On Sat, Jan 28, 2017 at 10:38 PM, Sachin Naik <sachin.u.n...@gmail.com> 
>>>> wrote:
>>>> I strongly agree with Jorn and Russell. There are different solutions for 
>>>> data movement depending upon your needs frequency, bi-directional drivers. 
>>>> workflow, handling duplicate records. This is a space is known as " Change 
>>>> Data Capture - CDC" for short. If you need more information, I would be 
>>>> happy to chat with you.  I built some products in this space that 
>>>> extensively used connection pooling over ODBC/JDBC. 
>>>> 
>>>> Happy to chat if you need more information. 
>>>> 
>>>> -Sachin Naik
>>>> 
>>>> >>Hard to tell. Can you give more insights >>on what you try to achieve 
>>>> >>and what the data is about?
>>>> >>For example, depending on your use case sqoop can make sense or not.
>>>> Sent from my iPhone
>>>> 
>>>>> On Jan 27, 2017, at 11:22 PM, Russell Spitzer <russell.spit...@gmail.com> 
>>>>> wrote:
>>>>> 
>>>>> You can treat Oracle as a JDBC source 
>>>>> (http://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases)
>>>>>  and skip Sqoop, HiveTables and go straight to Queries. Then you can skip 
>>>>> hive on the way back out (see the same link) and write directly to 
>>>>> Oracle. I'll leave the performance questions for someone else. 
>>>>> 
>>>>>> On Fri, Jan 27, 2017 at 11:06 PM Sirisha Cheruvu <siri8...@gmail.com> 
>>>>>> wrote:
>>>>>> 
>>>>>> On Sat, Jan 28, 2017 at 6:44 AM, Sirisha Cheruvu <siri8...@gmail.com> 
>>>>>> wrote:
>>>>>> Hi Team,
>>>>>> 
>>>>>> RIght now our existing flow is
>>>>>> 
>>>>>> Oracle-->Sqoop --> Hive--> Hive Queries on Spark-sql (Hive 
>>>>>> Context)-->Destination Hive table -->sqoop export to Oracle
>>>>>> 
>>>>>> Half of the Hive UDFS required is developed in Java UDF..
>>>>>> 
>>>>>> SO Now I want to know if I run the native scala UDF's than runninng hive 
>>>>>> java udfs in spark-sql will there be any performance difference
>>>>>> 
>>>>>> 
>>>>>> Can we skip the Sqoop Import and export part and 
>>>>>> 
>>>>>> Instead directly load data from oracle to spark and code Scala UDF's for 
>>>>>> transformations and export output data back to oracle?
>>>>>> 
>>>>>> RIght now the architecture we are using is
>>>>>> 
>>>>>> oracle-->Sqoop (Import)-->Hive Tables--> Hive Queries --> Spark-SQL--> 
>>>>>> Hive --> Oracle 
>>>>>> what would be optimal architecture to process data from oracle using 
>>>>>> spark ?? can i anyway better this process ?
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Regards,
>>>>>> Sirisha 
>>>>>> 
>>> 
>

Re: spark architecture question -- Pleas Read

Reply via email to