Well if you require R then you need to install it (including all additional packages) on each node. I am not sure why you store the data in Postgres . Storing it in Parquet and Orc is sufficient in HDFS (sorted on relevant columns) and you use the SparkR libraries to access them.
> On 30 May 2016, at 08:38, Kumar, Saurabh 5. (Nokia - IN/Bangalore) > <saurabh.5.ku...@nokia.com> wrote: > > Hi Team, > > I am using Apache spark to build scalable Analytic engine. My setup is as > follows. > > Flow of processing is as follows: > > Raw Files > Store to HDFS > Process by Spark and Store to Postgre_XL Database > > R process data fom Postgre-XL to process in distributed mode. > > I have 6 nodes cluster setup for ETL operations which have > > Spark slaves installed on all 6 of them. > HDFS data nodes on each of 6 nodes with replication factor 2. > PosGRE –XL 9.5 Database coordinator on each of 6 nodes. > R software is installed on all nodes and Uses process Data from Postgre-XL in > distributed manner. > > > > > Can you please guide me about pros and cons of this setup. > Installing all component on every machines is recommended or there is any > drawback? > R software should run on spark cluster ? > > > > Thanks & Regards > Saurabh Kumar > R&D Engineer, T&I TED Technology Explorat&Disruption > Nokia Networks > L5, Manyata Embassy Business Park, Nagavara, Bangalore, India 560045 > Mobile: +91-8861012418 > http://networks.nokia.com/ > > >