Re: SPARK Issue in Standalone cluster

Jean Georges Perrin Fri, 04 Aug 2017 05:11:35 -0700

I use CIFS and it works reasonably well and easily cross platform, well 
documented...


> On Aug 4, 2017, at 6:50 AM, Steve Loughran <ste...@hortonworks.com> wrote:
> 
> 
>> On 3 Aug 2017, at 19:59, Marco Mistroni <mmistr...@gmail.com> wrote:
>> 
>> Hello
>> my 2 cents here, hope it helps
>> If you want to just to play around with Spark, i'd leave Hadoop out, it's an 
>> unnecessary dependency that you dont need for just running a python script
>> Instead do the following:
>> - got to the root of our master / slave node. create a directory 
>> /root/pyscripts 
>> - place your csv file there as well as the python script
>> - run the script to replicate the whole directory  across the cluster (i 
>> believe it's called copy-script.sh)
>> - then run your spark-submit , it will be something lke
>>    ./spark-submit /root/pyscripts/mysparkscripts.py  
>> file:///root/pyscripts/tree_addhealth.csv 10 --master 
>> spark://ip-172-31-44-155.us-west-2.compute.internal:7077
>> - in your python script, as part of your processing, write the parquet file 
>> in directory /root/pyscripts 
>> 
> 
> That's going to hit the commit problem discussed: only the spark driver 
> executes the final commit process; the output from the other servers doesn't 
> get picked up and promoted. You need a shared stpre (NFS is the easy one)
> 
> 
>> If you have an AWS account and you are versatile with that - you need to 
>> setup bucket permissions etc - , you can just
>> - store your file in one of your S3 bucket
>> - create an EMR cluster
>> - connect to master or slave
>> - run your  scritp that reads from the s3 bucket and write to the same s3 
>> bucket
> 
> 
> Aah, and now we are into the problem of implementing a safe commit protocol 
> for an inconsistent filesystem....
> 
> My current stance there is out-the-box S3 isn't safe to use as the direct 
> output of work, Azure is. It mostly works for a small experiment, but I 
> wouldn't use it in production.
> 
> Simplest: work on one machine, if you go to 2-3 for exploratory work: NFS
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: SPARK Issue in Standalone cluster

Reply via email to