#22983: Add a Descriptor subinterface and implementation for Tor web server logs -----------------------------+----------------------------------- Reporter: iwakeh | Owner: metrics-team Type: enhancement | Status: needs_review Priority: High | Milestone: metrics-lib 2.2.0 Component: Metrics/Library | Version: Severity: Normal | Resolution: Keywords: metrics-2018 | Actual Points: Parent ID: | Points: Reviewer: | Sponsor: -----------------------------+-----------------------------------
Comment (by iwakeh): Replying to [comment:58 karsten]: > ... > > You only need to look at the latest four commits. The others are reviewed already (comment:47). > > Oh. That information would have saved some time this morning. That also explains why I have just one remark on the earlier commits and a couple more on the first four! Hrmm, in my request for review I did mention: ... The additional commits should address all topics from the comments above starting at comment:47. ... Maybe, too subtle for 'please re-read the comments starting at comment:47 unless you memorized them earlier' ;-) > > ... > There are two types of changes here: > > 1. Changes in the review process that require knowing the branch under review. Ideally, we'd use "fixup" or "squash" commits for all commits made during the review process. Alternatively, we can agree that commits will be squashed prior to merge, which works for me, too. > 2. Changes in the merge process that only require knowing master before the merge. Somebody who didn't follow the review process, including our future selfs, should see changes where we're not going forth and back throughout commit series to achieve something that we could do in one step. Sounds ok. > > ... > Anyway, here's what I found in your branch: > > - `AccessLogLineSanVal` contains an explicit mention of `"FTP"`. Does this mean we're considering FTP log lines to be valid? Why FTP in particular and no other protocols? Or should we take that out and restrict ourselves to HTTP? Not feeling strongly about this one. Please also refer to comments 50 to 52. The main issue was to not only declare lines as valid that resemble sanitized lines, but be more general. Thus, allowing different protocols could be part of that, but I don't insist. Which log lines should be considered valid? How much can log lines differ from sanitized log lines and still be considered valid? The current code only sketches one possibility, e.g. allowing different protocols, various ip addresses, and other request types. Before making any more code changes there should be at least a list of valid and invalid lines or a description of either state. > - Further down below we're comparing `referenceDate` to `extractedDate` and discarding lines where the two don't match. Not sure if this is problematic, but I'm at least flagging it here as potential issue. If this is a non-issue, feel free to ignore this comment. The variable `extractedDate` refers to the date extracted from the log line and reference date is the log files date (see [https://metrics.torproject.org/web-server-logs.html#n-re-assembling-log- files spec section 4.3]). As discarding is only performed during sanitization this is ok. The lines would be considered valid. > - A few lines further down we're formatting a date using `DateTimeFormatter.ofPattern()`. Is that expensive? Would we be able to create the formatter just once and re-use it here? Not trying to optimize prematurely, but trying to avoid shooting ourselves in the foot. Makes sense. (1) > - Logging an error in case of catching a `Throwable` might become very noisy. After all, we're processing third-party input here, and we can typically proceed without parsing that line. The debug level seemed like the better choice here. Oh, that's an oversight. (2) > - In `WebServerAccessLog`, what exactly is the log date? Is it the date of contained requests, is it the date when the file was written to disk, or what is it? This is still unclear in the docs. [https://gitweb.torproject.org/user/iwakeh/metrics- lib.git/tree/src/main/java/org/torproject/descriptor/WebServerAccessLog.java?h=task-22983-4&id=5dbbeb0ac38728f09718c80e295e8524d92d2e68#n19 Javadoc] says: `Returns the date of the log as <code>LocalDate</code>.` and the [https://metrics.torproject.org/web-server-logs.html#n-re- assembling-log-files spec 4.3] says: `Rewritten log lines are re-assembled into sanitized log files based on physical host, virtual host, and request start date.` Where would a clarification be needed? > > When these remaining issues are clearer to me, do you mind me editing your branch and preparing one or more "fixup"/"squash" commits? There are still a few things that I'd like to document differently or where I'd like to make the new code more similar to existing code (in the good sense). You could still go through them and talk me out of making those changes. Sound okay? Sure, go ahead. Would you also make the tiny adaptions from (1) and (2), or should I add these to the current branch first? > > Oh, and here's another remark for future reviews: let's try to reduce the time for reviewing and for revising a branch to a few days. Neither reviewing nor revising is fun, but it's even less fun when doing it several weeks later. I know this is a special case with the holiday break, and I'm not blaming you any more than I'm blaming myself. Just an idea to make reviews a little more fun. :) The holidays were not the problem here. I think tasks like adding a new module especially those spanning two or more products should be treated in a more formal/structured approach in future. There were several iterations that were not coding or (technical) design questions, but actually only concerned the specification (not only the one published on metrics.tp.o, but the definition of the wanted functionality for the webstats module itself). Even now we still have an iteration on wanted functionality, i.e., which log lines should be considered valid. These discussions for clarifying wanted functionality should happen before writing code. The current approach produces code that is written, reviewed, tossed, and rewritten only because of not having a clear picture of the wanted functionality beforehand (also thinking of CollecTor's changes here). Something we should have in mind when adding other modules or making major changes in future. -- Ticket URL: <https://trac.torproject.org/projects/tor/ticket/22983#comment:59> Tor Bug Tracker & Wiki <https://trac.torproject.org/> The Tor Project: anonymity online
_______________________________________________ tor-bugs mailing list tor-bugs@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs