I am running a grep application on spark 2.3.4 and scala version 2.11. I have an input textfile of 813MB stored on a remote source (not a part of spark infrastructure) using hdfs. My application just reads the textfile line by line from hdfs server and filters for a given keyword in each line and output's like grep in Linux. Hdfs divides the file into 128MB chunks so my application distributes into 7 tasks and 1 stage (stage 0). I want to analyze the time spark takes for a task in the compute function of hadoopRDD. For that I record and log every time a hadoopRDD compute, read, updaterecords or updatebytesread is called. Also when the filter RDD (MapPartitionsRDD) compute and the spark build filter function is called. What I observe is that the MapPartitionsRDD which is the child RDD has its compute and filter function called first and once the hadoopRDD is called it never logs compute or filter operation of MapPartitionsRDD. But, before reading the data spark cannot perform any filter on it, then the computing has to be called after a read operation. Does this filter operation work simultaneously on every record read, or once the whole text file chunk is read? Also How can I separate the information about the two or know when exactly did the first mapPartionRDD operation was done? Any help is appreciated.
Thanks -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org