Hey,
i am using Spark to distribute the execution of a binary tool and to do
some further calculation further down stream. I want to distribute the
binary tool using either the --files or the addFile option from spark to
make it available on each worker node. However although he tells my that
he added the file:
2018-05-09 07:42:19 INFO SparkContext:54 - Added file
s3a://executables/blastp at s3a://executables/foo with timestamp
1525851739972
2018-05-09 07:42:20 INFO Utils:54 - Fetching s3a://executables/foo to
/tmp/spark-54931ea6-b3d6-419b-997b-a498da898b77/userFiles-5e4b66e5-de4a-4420-a641-4453b9ea2ead/fetchFileTemp3437582648265876247.tmp
However when i want to execute the tool using pipe it does not work. I
currently assume that the file is only downloaded to the master node.
However i am not sure if i misunderstood the concept of adding files in
spark or if i did something wrong.
I am getting the path with Sparkfiles.get(). It does work but the bin is
not there.
This is my call:
spark-submit \
--class de.jlu.bioinfsys.sparkBlast.run.Run \
--master $master \
--jars${awsPath},${awsJavaSDK} \
--files
s3a://database/a.a.z,s3a://database/a.a.y,s3a://database/a.a.x,s3a://executables/tool
\
--conf spark.executor.extraClassPath=${awsPath}:${awsJavaSDK} \
--conf spark.driver.extraClassPath=${awsPath}:${awsJavaSDK} \
--conf
spark.hadoop.fs.s3a.endpoint=https://s3.computational.bio.uni-giessen.de/ \
--conf spark.hadoop.fs.s3a.access.key=$s3Access \
--conf spark.hadoop.fs.s3a.secret.key=$s3Secret \
--conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
${execJarPath}
I am using Spark v 2.3.0 along with scala in Standalone cluster node
with three workers.
Cheers
Marius