You can use HDFS, S3, Azure, glusterfs, Ceph, ignite (in-memory ) .... a Spark cluster itself does not store anything it just processes.
> On 29 Jan 2017, at 15:37, Alex <siri8...@gmail.com> wrote: > > But for persistance after intermediate processing can i use spark cluster > itself or i have to use hadoop cluster?! > > On Jan 29, 2017 7:36 PM, "Deepak Sharma" <deepakmc...@gmail.com> wrote: > The better way is to read the data directly into spark using spark sql read > jdbc . > Apply the udf's locally . > Then save the data frame back to Oracle using dataframe's write jdbc. > > Thanks > Deepak > >> On Jan 29, 2017 7:15 PM, "Jörn Franke" <jornfra...@gmail.com> wrote: >> One alternative could be the oracle Hadoop loader and other Oracle products, >> but you have to invest some money and probably buy their Hadoop Appliance, >> which you have to evaluate if it make sense (can get expensive with large >> clusters etc). >> >> Another alternative would be to get rid of Oracle alltogether and use other >> databases. >> >> However, can you elaborate a little bit on your use case and the business >> logic as well as SLA requires. Otherwise all recommendations are right >> because the requirements you presented are very generic. >> >> About get rid of Hadoop - this depends! You will need some resource manager >> (yarn, mesos, kubernetes etc) and most likely also a distributed file >> system. Spark supports through the Hadoop apis a wide range of file systems, >> but does not need HDFS for persistence. You can have local filesystem (ie >> any file system mounted to a node, so also distributed ones, such as zfs), >> cloud file systems (s3, azure blob etc). >> >> >> >>> On 29 Jan 2017, at 11:18, Alex <siri8...@gmail.com> wrote: >>> >>> Hi All, >>> >>> Thanks for your response .. Please find below flow diagram >>> >>> Please help me out simplifying this architecture using Spark >>> >>> 1) Can i skip step 1 to step 4 and directly store it in spark >>> if I am storing it in spark where actually it is getting stored >>> Do i need to retain HAdoop to store data >>> or can i directly store it in spark and remove hadoop also? >>> >>> I want to remove informatica for preprocessing and directly load the files >>> data coming from server to Hadoop/Spark >>> >>> So My Question is Can i directly load files data to spark ? Then where >>> exactly the data will get stored.. Do I need to have Spark installed on Top >>> of HDFS? >>> >>> 2) if I am retaining below architecture Can I store back output from spark >>> directly to oracle from step 5 to step 7 >>> >>> and will spark way of storing it back to oracle will be better than using >>> sqoop performance wise >>> 3)Can I use SPark scala UDF to process data from hive and retain entire >>> architecture >>> >>> which among the above would be optimal >>> >>> >>> >>>> On Sat, Jan 28, 2017 at 10:38 PM, Sachin Naik <sachin.u.n...@gmail.com> >>>> wrote: >>>> I strongly agree with Jorn and Russell. There are different solutions for >>>> data movement depending upon your needs frequency, bi-directional drivers. >>>> workflow, handling duplicate records. This is a space is known as " Change >>>> Data Capture - CDC" for short. If you need more information, I would be >>>> happy to chat with you. I built some products in this space that >>>> extensively used connection pooling over ODBC/JDBC. >>>> >>>> Happy to chat if you need more information. >>>> >>>> -Sachin Naik >>>> >>>> >>Hard to tell. Can you give more insights >>on what you try to achieve >>>> >>and what the data is about? >>>> >>For example, depending on your use case sqoop can make sense or not. >>>> Sent from my iPhone >>>> >>>>> On Jan 27, 2017, at 11:22 PM, Russell Spitzer <russell.spit...@gmail.com> >>>>> wrote: >>>>> >>>>> You can treat Oracle as a JDBC source >>>>> (http://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases) >>>>> and skip Sqoop, HiveTables and go straight to Queries. Then you can skip >>>>> hive on the way back out (see the same link) and write directly to >>>>> Oracle. I'll leave the performance questions for someone else. >>>>> >>>>>> On Fri, Jan 27, 2017 at 11:06 PM Sirisha Cheruvu <siri8...@gmail.com> >>>>>> wrote: >>>>>> >>>>>> On Sat, Jan 28, 2017 at 6:44 AM, Sirisha Cheruvu <siri8...@gmail.com> >>>>>> wrote: >>>>>> Hi Team, >>>>>> >>>>>> RIght now our existing flow is >>>>>> >>>>>> Oracle-->Sqoop --> Hive--> Hive Queries on Spark-sql (Hive >>>>>> Context)-->Destination Hive table -->sqoop export to Oracle >>>>>> >>>>>> Half of the Hive UDFS required is developed in Java UDF.. >>>>>> >>>>>> SO Now I want to know if I run the native scala UDF's than runninng hive >>>>>> java udfs in spark-sql will there be any performance difference >>>>>> >>>>>> >>>>>> Can we skip the Sqoop Import and export part and >>>>>> >>>>>> Instead directly load data from oracle to spark and code Scala UDF's for >>>>>> transformations and export output data back to oracle? >>>>>> >>>>>> RIght now the architecture we are using is >>>>>> >>>>>> oracle-->Sqoop (Import)-->Hive Tables--> Hive Queries --> Spark-SQL--> >>>>>> Hive --> Oracle >>>>>> what would be optimal architecture to process data from oracle using >>>>>> spark ?? can i anyway better this process ? >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> Regards, >>>>>> Sirisha >>>>>> >>> >