Hi, I got a cluster of 4 machines and I
sc.wholeTextFiles(/x/*/*.txt)
folder x contains subfolders and each subfolder contains thousand of files
with a total of ~1million matching the path expression.
My spark task starts processing the files but single threaded. I can see
that in the sparkUI, only 1 executor is used out of 4. And only 1 thread out
of configured 24:
spark-submit --class com.stratified.articleids.NxmlExtractorJob \
--driver-memory 8g \
--executor-memory 8g \
--num-executors 4 \
--executor-cores 16 \
--master yarn-cluster \
--conf spark.akka.frameSize=128 \
$JAR
My actual code is :
val rdd=extractIds(sc.wholeTextFiles(xmlDir))
rdd.saveAsObjectFile(serDir)
Is the saveAsObjectFile causing this and any workarounds?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-x-txt-runs-single-threaded-tp23591.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org