I meant with distributed file system such as Ceph, Gluster etc...
> On 29 Jan 2017, at 14:45, Jörn Franke <jornfra...@gmail.com> wrote: > > One alternative could be the oracle Hadoop loader and other Oracle products, > but you have to invest some money and probably buy their Hadoop Appliance, > which you have to evaluate if it make sense (can get expensive with large > clusters etc). > > Another alternative would be to get rid of Oracle alltogether and use other > databases. > > However, can you elaborate a little bit on your use case and the business > logic as well as SLA requires. Otherwise all recommendations are right > because the requirements you presented are very generic. > > About get rid of Hadoop - this depends! You will need some resource manager > (yarn, mesos, kubernetes etc) and most likely also a distributed file system. > Spark supports through the Hadoop apis a wide range of file systems, but does > not need HDFS for persistence. You can have local filesystem (ie any file > system mounted to a node, so also distributed ones, such as zfs), cloud file > systems (s3, azure blob etc). > > > >> On 29 Jan 2017, at 11:18, Alex <siri8...@gmail.com> wrote: >> >> Hi All, >> >> Thanks for your response .. Please find below flow diagram >> >> Please help me out simplifying this architecture using Spark >> >> 1) Can i skip step 1 to step 4 and directly store it in spark >> if I am storing it in spark where actually it is getting stored >> Do i need to retain HAdoop to store data >> or can i directly store it in spark and remove hadoop also? >> >> I want to remove informatica for preprocessing and directly load the files >> data coming from server to Hadoop/Spark >> >> So My Question is Can i directly load files data to spark ? Then where >> exactly the data will get stored.. Do I need to have Spark installed on Top >> of HDFS? >> >> 2) if I am retaining below architecture Can I store back output from spark >> directly to oracle from step 5 to step 7 >> >> and will spark way of storing it back to oracle will be better than using >> sqoop performance wise >> 3)Can I use SPark scala UDF to process data from hive and retain entire >> architecture >> >> which among the above would be optimal >> >> >> >>> On Sat, Jan 28, 2017 at 10:38 PM, Sachin Naik <sachin.u.n...@gmail.com> >>> wrote: >>> I strongly agree with Jorn and Russell. There are different solutions for >>> data movement depending upon your needs frequency, bi-directional drivers. >>> workflow, handling duplicate records. This is a space is known as " Change >>> Data Capture - CDC" for short. If you need more information, I would be >>> happy to chat with you. I built some products in this space that >>> extensively used connection pooling over ODBC/JDBC. >>> >>> Happy to chat if you need more information. >>> >>> -Sachin Naik >>> >>> >>Hard to tell. Can you give more insights >>on what you try to achieve and >>> >>what the data is about? >>> >>For example, depending on your use case sqoop can make sense or not. >>> Sent from my iPhone >>> >>>> On Jan 27, 2017, at 11:22 PM, Russell Spitzer <russell.spit...@gmail.com> >>>> wrote: >>>> >>>> You can treat Oracle as a JDBC source >>>> (http://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases) >>>> and skip Sqoop, HiveTables and go straight to Queries. Then you can skip >>>> hive on the way back out (see the same link) and write directly to Oracle. >>>> I'll leave the performance questions for someone else. >>>> >>>>> On Fri, Jan 27, 2017 at 11:06 PM Sirisha Cheruvu <siri8...@gmail.com> >>>>> wrote: >>>>> >>>>> On Sat, Jan 28, 2017 at 6:44 AM, Sirisha Cheruvu <siri8...@gmail.com> >>>>> wrote: >>>>> Hi Team, >>>>> >>>>> RIght now our existing flow is >>>>> >>>>> Oracle-->Sqoop --> Hive--> Hive Queries on Spark-sql (Hive >>>>> Context)-->Destination Hive table -->sqoop export to Oracle >>>>> >>>>> Half of the Hive UDFS required is developed in Java UDF.. >>>>> >>>>> SO Now I want to know if I run the native scala UDF's than runninng hive >>>>> java udfs in spark-sql will there be any performance difference >>>>> >>>>> >>>>> Can we skip the Sqoop Import and export part and >>>>> >>>>> Instead directly load data from oracle to spark and code Scala UDF's for >>>>> transformations and export output data back to oracle? >>>>> >>>>> RIght now the architecture we are using is >>>>> >>>>> oracle-->Sqoop (Import)-->Hive Tables--> Hive Queries --> Spark-SQL--> >>>>> Hive --> Oracle >>>>> what would be optimal architecture to process data from oracle using >>>>> spark ?? can i anyway better this process ? >>>>> >>>>> >>>>> >>>>> >>>>> Regards, >>>>> Sirisha >>>>> >>