On 7 Apr 2017, at 15:32, Alvaro Brandon <alvarobran...@gmail.com<mailto:alvarobran...@gmail.com>> wrote:
I was going through the SparkContext.textFile() and I was wondering at that point does Spark communicates with HDFS. Since when you download Spark binaries you also specify the Hadoop version you will use, I'm guessing it has its own client that calls HDFS wherever you specify it in the configuration files. it uses the hadoop-hdfs JAR in spark-assembly JAR or the lib dir under SPARK_HOME. Nobody would ever want to do their own HDFS client, not if you look at the bit of the code related to kerberos. webhdfs://<webhdfs:///>, that you could, though it's not done here. The goal is to instrument and log all the calls that Spark does to HDFS. Which class or classes perform these operations? org.apache.hadoop.hdfs.DistributedFileSystem Take a look at HTrace here: https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/Tracing.html