I use CIFS and it works reasonably well and easily cross platform, well documented...
> On Aug 4, 2017, at 6:50 AM, Steve Loughran <ste...@hortonworks.com> wrote: > > >> On 3 Aug 2017, at 19:59, Marco Mistroni <mmistr...@gmail.com> wrote: >> >> Hello >> my 2 cents here, hope it helps >> If you want to just to play around with Spark, i'd leave Hadoop out, it's an >> unnecessary dependency that you dont need for just running a python script >> Instead do the following: >> - got to the root of our master / slave node. create a directory >> /root/pyscripts >> - place your csv file there as well as the python script >> - run the script to replicate the whole directory across the cluster (i >> believe it's called copy-script.sh) >> - then run your spark-submit , it will be something lke >> ./spark-submit /root/pyscripts/mysparkscripts.py >> file:///root/pyscripts/tree_addhealth.csv 10 --master >> spark://ip-172-31-44-155.us-west-2.compute.internal:7077 >> - in your python script, as part of your processing, write the parquet file >> in directory /root/pyscripts >> > > That's going to hit the commit problem discussed: only the spark driver > executes the final commit process; the output from the other servers doesn't > get picked up and promoted. You need a shared stpre (NFS is the easy one) > > >> If you have an AWS account and you are versatile with that - you need to >> setup bucket permissions etc - , you can just >> - store your file in one of your S3 bucket >> - create an EMR cluster >> - connect to master or slave >> - run your scritp that reads from the s3 bucket and write to the same s3 >> bucket > > > Aah, and now we are into the problem of implementing a safe commit protocol > for an inconsistent filesystem.... > > My current stance there is out-the-box S3 isn't safe to use as the direct > output of work, Azure is. It mostly works for a small experiment, but I > wouldn't use it in production. > > Simplest: work on one machine, if you go to 2-3 for exploratory work: NFS > > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org