Re: spark architecture question -- Pleas Read

Deepak Sharma Sun, 29 Jan 2017 06:06:36 -0800

The better way is to read the data directly into spark using spark sql read
jdbc .
Apply the udf's locally .
Then save the data frame back to Oracle using dataframe's write jdbc.


Thanks
Deepak

On Jan 29, 2017 7:15 PM, "Jörn Franke" <jornfra...@gmail.com> wrote:

> One alternative could be the oracle Hadoop loader and other Oracle
> products, but you have to invest some money and probably buy their Hadoop
> Appliance, which you have to evaluate if it make sense (can get expensive
> with large clusters etc).
>
> Another alternative would be to get rid of Oracle alltogether and use
> other databases.
>
> However, can you elaborate a little bit on your use case and the business
> logic as well as SLA requires. Otherwise all recommendations are right
> because the requirements you presented are very generic.
>
> About get rid of Hadoop - this depends! You will need some resource
> manager (yarn, mesos, kubernetes etc) and most likely also a distributed
> file system. Spark supports through the Hadoop apis a wide range of file
> systems, but does not need HDFS for persistence. You can have local
> filesystem (ie any file system mounted to a node, so also distributed ones,
> such as zfs), cloud file systems (s3, azure blob etc).
>
>
>
> On 29 Jan 2017, at 11:18, Alex <siri8...@gmail.com> wrote:
>
> Hi All,
>
> Thanks for your response .. Please find below flow diagram
>
> Please help me out simplifying this architecture using Spark
>
> 1) Can i skip step 1 to step 4 and directly store it in spark
> if I am storing it in spark where actually it is getting stored
> Do i need to retain HAdoop to store data
> or can i directly store it in spark and remove hadoop also?
>
> I want to remove informatica for preprocessing and directly load the files
> data coming from server to Hadoop/Spark
>
> So My Question is Can i directly load files data to spark ? Then where
> exactly the data will get stored.. Do I need to have Spark installed on Top
> of HDFS?
>
> 2) if I am retaining below architecture Can I store back output from spark
> directly to oracle from step 5 to step 7
>
> and will spark way of storing it back to oracle will be better than using
> sqoop performance wise
> 3)Can I use SPark scala UDF to process data from hive and retain entire
> architecture
>
> which among the above would be optimal
>
> [image: Inline image 1]
>
> On Sat, Jan 28, 2017 at 10:38 PM, Sachin Naik <sachin.u.n...@gmail.com>
> wrote:
>
>> I strongly agree with Jorn and Russell. There are different solutions for
>> data movement depending upon your needs frequency, bi-directional drivers.
>> workflow, handling duplicate records. This is a space is known as " Change
>> Data Capture - CDC" for short. If you need more information, I would be
>> happy to chat with you.  I built some products in this space that
>> extensively used connection pooling over ODBC/JDBC.
>>
>> Happy to chat if you need more information.
>>
>> -Sachin Naik
>>
>> >>Hard to tell. Can you give more insights >>on what you try to achieve
>> and what the data is about?
>> >>For example, depending on your use case sqoop can make sense or not.
>> Sent from my iPhone
>>
>> On Jan 27, 2017, at 11:22 PM, Russell Spitzer <russell.spit...@gmail.com>
>> wrote:
>>
>> You can treat Oracle as a JDBC source (http://spark.apache.org/docs/
>> latest/sql-programming-guide.html#jdbc-to-other-databases) and skip
>> Sqoop, HiveTables and go straight to Queries. Then you can skip hive on the
>> way back out (see the same link) and write directly to Oracle. I'll leave
>> the performance questions for someone else.
>>
>> On Fri, Jan 27, 2017 at 11:06 PM Sirisha Cheruvu <siri8...@gmail.com>
>> wrote:
>>
>>>
>>> On Sat, Jan 28, 2017 at 6:44 AM, Sirisha Cheruvu <siri8...@gmail.com>
>>> wrote:
>>>
>>> Hi Team,
>>>
>>> RIght now our existing flow is
>>>
>>> Oracle-->Sqoop --> Hive--> Hive Queries on Spark-sql (Hive
>>> Context)-->Destination Hive table -->sqoop export to Oracle
>>>
>>> Half of the Hive UDFS required is developed in Java UDF..
>>>
>>> SO Now I want to know if I run the native scala UDF's than runninng hive
>>> java udfs in spark-sql will there be any performance difference
>>>
>>>
>>> Can we skip the Sqoop Import and export part and
>>>
>>> Instead directly load data from oracle to spark and code Scala UDF's for
>>> transformations and export output data back to oracle?
>>>
>>> RIght now the architecture we are using is
>>>
>>> oracle-->Sqoop (Import)-->Hive Tables--> Hive Queries --> Spark-SQL-->
>>> Hive --> Oracle
>>> what would be optimal architecture to process data from oracle using
>>> spark ?? can i anyway better this process ?
>>>
>>>
>>>
>>>
>>> Regards,
>>> Sirisha
>>>
>>>
>>>
>

Re: spark architecture question -- Pleas Read

Reply via email to