This is an automated email from the ASF dual-hosted git repository. mergebot-role pushed a commit to branch mergebot in repository https://gitbox.apache.org/repos/asf/beam-site.git
commit 5e7f1b2ccc3a04ccad148d25ea6204badd1c2e85 Author: Niels Basjes <ni...@basjes.nl> AuthorDate: Thu Apr 19 13:44:09 2018 +0200 Document Java extensions for parsing Apache HTTPD logfiles and Useragent strings --- src/documentation/sdks/java-extensions.md | 182 ++++++++++++++++++++++++++++++ 1 file changed, 182 insertions(+) diff --git a/src/documentation/sdks/java-extensions.md b/src/documentation/sdks/java-extensions.md index 7742345..3b1524f 100644 --- a/src/documentation/sdks/java-extensions.md +++ b/src/documentation/sdks/java-extensions.md @@ -58,3 +58,185 @@ PCollection<KV<String, Iterable<KV<String, Integer>>>> groupedAndSorted = grouped.apply( SortValues.<String, String, Integer>create(BufferedExternalSorter.options())); ``` + +## Parsing Apache HTTPD and NGINX Access log files. + +The Apache HTTPD webserver creates logfiles that contain valuable information about the requests that have been done to +thie webserver. The format of these config files is a configuration option in the Apache HTTPD server so parsing this +into useful data elements is normally very hard to do. + +To solve this problem in an easy way a library was created that works in combination with Apache Beam. + +The basic idea is that you should be able to have a parser that you can construct by simply +telling it with what configuration options the line was written. + +### Basic usage +Full documentation can be found here [https://github.com/nielsbasjes/logparser](https://github.com/nielsbasjes/logparser) + +First you put something like this in your pom.xml file: + + <dependency> + <groupId>nl.basjes.parse.httpdlog</groupId> + <artifactId>httpdlog-parser</artifactId> + <version>5.0</version> + </dependency> + +Check [https://github.com/nielsbasjes/logparser](https://github.com/nielsbasjes/logparser) for the latest version. + +Assume we have a logformat variable that looks something like this: + + String logformat = "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\""; + +**Step 1: What CAN we get from this line?** + +To figure out what values we CAN get from this line we instantiate the parser with a dummy class +that does not have ANY @Field annotations or setters. The "Object" class will do just fine for this purpose. + + Parser<Object> dummyParser = new HttpdLoglineParser<Object>(Object.class, logformat); + List<String> possiblePaths = dummyParser.getPossiblePaths(); + for (String path: possiblePaths) { + System.out.println(path); + } + +You will get a list that looks something like this: + + IP:connection.client.host + NUMBER:connection.client.logname + STRING:connection.client.user + TIME.STAMP:request.receive.time + TIME.DAY:request.receive.time.day + TIME.MONTHNAME:request.receive.time.monthname + TIME.MONTH:request.receive.time.month + TIME.YEAR:request.receive.time.year + TIME.HOUR:request.receive.time.hour + TIME.MINUTE:request.receive.time.minute + TIME.SECOND:request.receive.time.second + TIME.MILLISECOND:request.receive.time.millisecond + TIME.ZONE:request.receive.time.timezone + HTTP.FIRSTLINE:request.firstline + HTTP.METHOD:request.firstline.method + HTTP.URI:request.firstline.uri + HTTP.QUERYSTRING:request.firstline.uri.query + STRING:request.firstline.uri.query.* + HTTP.PROTOCOL:request.firstline.protocol + HTTP.PROTOCOL.VERSION:request.firstline.protocol.version + STRING:request.status.last + BYTESCLF:response.body.bytes + HTTP.URI:request.referer + HTTP.QUERYSTRING:request.referer.query + STRING:request.referer.query.* + HTTP.USERAGENT:request.user-agent + +Now some of these lines contain a * . +This is a wildcard that can be replaced with any 'name' if you need a specific value. +You can also leave the '*' and get everything that is found in the actual log line. + +**Step 2 Create the receiving POJO** + +We need to create the receiving record class that is simply a POJO that does not need any interface or inheritance. +In this class we create setters that will be called when the specified field has been found in the line. + +So we can now add to this class a setter that simply receives a single value as specified using the @Field annotation: + + @Field("IP:connection.client.host") + public void setIP(final String value) { + ip = value; + } + +If we really want the name of the field we can also do this + + @Field("STRING:request.firstline.uri.query.img") + public void setQueryImg(final String name, final String value) { + results.put(name, value); + } + +This latter form is very handy because this way we can obtain all values for a wildcard field + + @Field("STRING:request.firstline.uri.query.*") + public void setQueryStringValues(final String name, final String value) { + results.put(name, value); + } + +Instead of using the annotations on the setters we can also simply tell the parser the name of th setter that must be +called when an element is found. + + parser.addParseTarget("setIP", "IP:connection.client.host"); + parser.addParseTarget("setQueryImg", "STRING:request.firstline.uri.query.img"); + parser.addParseTarget("setQueryStringValues", "STRING:request.firstline.uri.query.*"); + +### Using this in Apache Beam + +Assuming we have a String (being the full log line) comming in and an instance of the WebEvent class comming out +(where the WebEvent already the has the needed setters) the final code when using this in an Apache Beam project +will end up looking something like this +``` + PCollection<WebEvent> filledWebEvents = input + .apply("Extract Elements from logline", + ParDo.of(new DoFn<String, WebEvent>() { + private Parser<WebEvent> parser; + + @Setup + public void setup() throws NoSuchMethodException { + parser = new HttpdLoglineParser<>(WebEvent.class, getLogFormat()); + parser.addParseTarget("setIP", "IP:connection.client.host"); + parser.addParseTarget("setQueryImg", "STRING:request.firstline.uri.query.img"); + parser.addParseTarget("setQueryStringValues", "STRING:request.firstline.uri.query.*"); + } + + @ProcessElement + public void processElement(ProcessContext c) throws InvalidDissectorException, MissingDissectorsException, DissectionFailure { + c.output(parser.parse(c.element())); + } + })); + +``` + + +## Analyzing the Useragent string + +This is a java library that tries to parse and analyze the useragent string and extract as many relevant attributes as possible. + +### Getting the Beam UDF +You can get the prebuilt UDF from maven central. +If you use a maven based project simply add this dependency to your Apache Beam application. + + <dependency> + <groupId>nl.basjes.parse.useragent</groupId> + <artifactId>yauaa-beam</artifactId> + <version>4.2</version> + </dependency> + +Check https://github.com/nielsbasjes/yauaa for the latest version. + +### Example usage +Assume you have a PCollection with your records. +In most cases I see (clickstream data) these records (in this example this class is called "WebEvent") +contain the useragent string in a field and the parsed results must be added to these fields. + +Now you must do two things: + + 1) Determine the names of the fields you need. + 2) Add an instance of the (abstract) UserAgentAnalysisDoFn function and implement the functions as shown in the example below. Use the YauaaField annotation to get the setter for the requested fields. + +Note that the name of the two setters is not important, the system looks at the annotation. + + .apply("Extract Elements from Useragent", + ParDo.of(new UserAgentAnalysisDoFn<WebEvent>() { + @Override + public String getUserAgentString(WebEvent record) { + return record.useragent; + } + + @SuppressWarnings("unused") + @YauaaField("DeviceClass") + public void setDC(WebEvent record, String value) { + record.deviceClass = value; + } + + @SuppressWarnings("unused") + @YauaaField("AgentNameVersion") + public void setANV(WebEvent record, String value) { + record.agentNameVersion = value; + } + })); + -- To stop receiving notification emails like this one, please contact mergebot-r...@apache.org.