I am trying to read logs which have many irrelevant lines and whose lines are related by a thread number in each line.
Firstly, if I read from a text file using the textFile function and then call multiple filter functions on that file will Spark apply all of the filters using one read pass? Eg will the second filter incur another read of log.txt? val file = sc.textFile("log.txt") val test = file.filter(some condition) val test1 = file.filter(some other condition) Secondly, if there are multiple reads I was thinking that I could apply a filter that gets rid of all of the lines that I do not need and cache that in a PairRDD. From that PairRDD I would need to remove keys that only appear once, is there a recommended strategy for this? I was thinking about using distinct to create another PairRDD and then using subtract, but this seems inefficient. Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Relational-Log-Data-tp24696.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org