Hi Team, I am using Apache spark to build scalable Analytic engine. My setup is as follows.
Flow of processing is as follows: Raw Files > Store to HDFS > Process by Spark and Store to Postgre_XL Database > R process data fom Postgre-XL to process in distributed mode. I have 6 nodes cluster setup for ETL operations which have 1. Spark slaves installed on all 6 of them. 2. HDFS data nodes on each of 6 nodes with replication factor 2. 3. PosGRE -XL 9.5 Database coordinator on each of 6 nodes. 4. R software is installed on all nodes and Uses process Data from Postgre-XL in distributed manner. Can you please guide me about pros and cons of this setup. Installing all component on every machines is recommended or there is any drawback? R software should run on spark cluster ? Thanks & Regards Saurabh Kumar R&D Engineer, T&I TED Technology Explorat&Disruption Nokia Networks L5, Manyata Embassy Business Park, Nagavara, Bangalore, India 560045 Mobile: +91-8861012418 http://networks.nokia.com/