Hello, I have a very specific question on how to do a search between particular lines of log file. I did some research to find the answer and what I learned is that if one of the shuffle operation applied to RDD, there is no a way to "reconstruct" the sequence of line (except zipping with id). I'm looking for any useful approaches/workarounds how other developers solve that problem.
Here is a sample: I have log4j log files where for each request/transaction a specific BEGIN and END transaction marker is printed. Somewhere in between other classes may report useful statistics, which is needed to parse, and unfortunately there is now a way to keep transaction id with that record. What is the best approach to link transaction with particular line between BEGIN and END markers? Assume, only timestamp and thread name are available: 2015-01-01 20:00:00 DEBUG className [Thread-0] - BEGIN TransactionID=AA000000001 2015-01-01 20:00:00 DEBUG className [Thread-0] - ... {some other logs} 2015-01-01 20:00:01 DEBUG className [Thread-0] - SQL execution time: 500ms 2015-01-01 20:00:02 DEBUG className [Thread-0] - ... {some other logs} 2015-01-01 20:00:05 DEBUG className [Thread-0] - END Finally, I want to get the result with transaction ID AA000000001 and SQL execution time 500ms. Probably, another good example would be - extracting java stacktrace from logs, when stacktrace lines wouldn't have any key strings (timestamp, thread id) at all to parse by. So far I've come up with one "idea" and one approach: 1) Find out the file and position of BEGIN line and run separate non-Spark process to parse it line-by-line. In this case the question is what is the best approach to know to which file this line belongs to, and what is the position? Is zipWithUniqueId helpful for that? Not sure if it's really effective and can help to find the file name (or may be hadoop partition). 2) I use thread id as a key and map that key with BEGIN / END lines. Then I create another RDD with the same key, but for SQL execution time line. Then I do left join of RDDs by thread id and filter by timestamps, coming from both RDDs: leaving only this SQL line which is prior to END line (SQL's timestamp is before END's timestamp). Approach like this becomes very confusing in cases when it's required to extract more information (lines) between BEGIN/END. Is there any recommendations how to handle cases like that? Thank you, Sergey -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Linear-search-between-particular-log4j-log-lines-tp23773.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org