On 7 Apr 2017, at 15:32, Alvaro Brandon 
<alvarobran...@gmail.com<mailto:alvarobran...@gmail.com>> wrote:

I was going through the SparkContext.textFile() and I was wondering at that 
point does Spark communicates with HDFS. Since when you download Spark binaries 
you also specify the Hadoop version you will use, I'm guessing it has its own 
client that calls HDFS wherever you specify it in the configuration files.



it uses the hadoop-hdfs JAR in spark-assembly JAR or the lib dir under 
SPARK_HOME. Nobody would ever want to do their own HDFS client, not if you look 
at the bit of the code related to kerberos. webhdfs://<webhdfs:///>, that you 
could, though it's not done here.


The goal is to instrument and log all the calls that Spark does to HDFS. Which 
class or classes perform these operations?



org.apache.hadoop.hdfs.DistributedFileSystem

Take a look at HTrace here: 
https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/Tracing.html




Reply via email to