Linear search between particular log4j log lines

ssbiox Fri, 10 Jul 2015 19:49:42 -0700

Hello,

I have a very specific question on how to do a search between particular
lines of log file. I did some research to find the answer and what I learned
is that if one of the shuffle operation applied to RDD, there is no a way to
"reconstruct" the sequence of line (except zipping with id). I'm looking for
any useful approaches/workarounds how other developers solve that problem.

Here is a sample:
I have log4j log files where for each request/transaction a specific BEGIN
and END transaction marker is printed. Somewhere in between other classes
may report useful statistics, which is needed to parse, and unfortunately
there is now a way to keep transaction id with that record. What is the best
approach to link transaction with particular line between BEGIN and END
markers?

Assume, only timestamp and thread name are available:
2015-01-01 20:00:00 DEBUG className [Thread-0] - BEGIN
TransactionID=AA000000001
2015-01-01 20:00:00 DEBUG className [Thread-0] - ... {some other logs}
2015-01-01 20:00:01 DEBUG className [Thread-0] - SQL execution time: 500ms
2015-01-01 20:00:02 DEBUG className [Thread-0] - ... {some other logs}
2015-01-01 20:00:05 DEBUG className [Thread-0] - END

Finally, I want to get the result with transaction ID AA000000001 and SQL
execution time 500ms.

Probably, another good example would be - extracting java stacktrace from
logs, when stacktrace lines wouldn't have any key strings (timestamp, thread
id) at all to parse by.

So far I've come up with one "idea" and one approach:
1) Find out the file and position of BEGIN line and run separate non-Spark
process to parse it line-by-line. In this case the question is what is the
best approach to know to which file this line belongs to, and what is the
position? Is zipWithUniqueId helpful for that? Not sure if it's really
effective and can help to find the file name (or may be hadoop partition).

2) I use thread id as a key and map that key with BEGIN / END lines. Then I
create another RDD with the same key, but for SQL execution time line. Then
I do left join of RDDs by thread id and filter by timestamps, coming from
both RDDs: leaving only this SQL line which is prior to END line (SQL's
timestamp is before END's timestamp).
Approach like this becomes very confusing in cases when it's required to
extract more information (lines) between BEGIN/END. Is there any
recommendations how to handle cases like that?

Thank you,
Sergey

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Linear-search-between-particular-log4j-log-lines-tp23773.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Linear search between particular log4j log lines

Reply via email to