Spark functions are lazy, so none of them actually do anything until an
action is encountered. And no, your code will NOT read the file multiple
time.

On Tue, Sep 15, 2015 at 7:33 PM, 328d95 <20500...@student.uwa.edu.au> wrote:

> I am trying to read logs which have many irrelevant lines and whose lines
> are
> related by a thread number in each line.
>
> Firstly, if I read from a text file using the textFile function and then
> call multiple filter functions on that file will Spark apply all of the
> filters using one read pass?
>
> Eg will the second filter incur another read of log.txt?
> val file = sc.textFile("log.txt")
> val test = file.filter(some condition)
> val test1 = file.filter(some other condition)
>
> Secondly, if there are multiple reads I was thinking that I could apply a
> filter that gets rid of all of the lines that I do not need and cache that
> in a PairRDD. From that PairRDD I would need to remove keys that only
> appear
> once, is there a recommended strategy for this? I was thinking about using
> distinct to create another PairRDD and then using subtract, but this seems
> inefficient.
>
> Thanks
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Relational-Log-Data-tp24696.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
Best Regards,
Ayan Guha

Reply via email to