Well if you require R then you need to install it (including all additional 
packages) on each node. I am not sure why you store the data in Postgres . 
Storing it in Parquet and Orc is sufficient in HDFS (sorted on relevant 
columns) and you use the SparkR libraries to access them.

> On 30 May 2016, at 08:38, Kumar, Saurabh 5. (Nokia - IN/Bangalore) 
> <saurabh.5.ku...@nokia.com> wrote:
> 
> Hi Team,
>  
> I am using Apache spark to build scalable Analytic engine. My setup is as 
> follows.
>  
> Flow of processing is as follows:
>  
> Raw Files > Store to HDFS > Process by Spark and Store to Postgre_XL Database 
> > R process data fom Postgre-XL to process in distributed mode.
>  
> I have 6 nodes cluster setup for ETL operations which have
>  
> Spark slaves installed on all 6 of them.
> HDFS data nodes on each of 6 nodes with replication factor 2.
> PosGRE –XL 9.5 Database coordinator on each of 6 nodes.
> R software is installed on all nodes and Uses process Data from Postgre-XL in 
> distributed manner.
>  
>  
>  
>  
> Can you please guide me about pros and cons of this setup.
> Installing all component on every machines is recommended or there is any 
> drawback?
> R software should run on spark cluster ?
>  
>  
>  
> Thanks & Regards
> Saurabh Kumar
> R&D Engineer, T&I TED Technology Explorat&Disruption
> Nokia Networks
> L5, Manyata Embassy Business Park, Nagavara, Bangalore, India 560045
> Mobile: +91-8861012418
> http://networks.nokia.com/
>  
>  
>  

Reply via email to