Hi, I got a cluster of 4 machines and I sc.wholeTextFiles("/x/*/*.txt")
folder x contains subfolders and each subfolder contains thousand of files with a total of ~1million matching the path expression. My spark task starts processing the files but single threaded. I can see that in the sparkUI, only 1 executor is used out of 4. And only 1 thread out of configured 24: spark-submit --class com.stratified.articleids.NxmlExtractorJob \ --driver-memory 8g \ --executor-memory 8g \ --num-executors 4 \ --executor-cores 16 \ --master yarn-cluster \ --conf spark.akka.frameSize=128 \ $JAR My actual code is : val rdd=extractIds(sc.wholeTextFiles(xmlDir)) rdd.saveAsObjectFile(serDir) Is the saveAsObjectFile causing this and any workarounds? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-x-txt-runs-single-threaded-tp23591.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org