Thank you Brock, Devin and Jimmy for the great information. Dumping null values in the the EventSerializer write method looks really easy to do but I think using the custom interceptor to validate then tag the event for proper good/bad routing sounds like a great idea and seems to fit into the Flume way of doing things better.
Thank you again! ~Ed On Fri, Jan 3, 2014 at 2:40 AM, Devin Suiter RDX <[email protected]> wrote: > Yes, the regex interceptors and selectors can be very powerful - > experimenting with them was really exciting. > > Brock, thanks for validating the ML idea - as with most things, the > simplest solution is probably the way to go, and in this use case, the > morphlines might be overkill. > > *Devin Suiter* > Jr. Data Solutions Software Engineer > 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212 > Google Voice: 412-256-8556 | www.rdx.com > > > On Thu, Jan 2, 2014 at 12:27 PM, Brock Noland <[email protected]> wrote: > >> Jimmy, great to hear that method is working for you! >> >> Devin, regarding the morphlines question. Since ML can have arbitrary >> java plugins it *can* do just about anything. I generally think of ML as >> the T in ETL. Doing the validation in ML might make sense. In general >> though I think adding the custom header field as probably the best option >> for dealing with bad data. >> >> Once again, thank you everyone for using our software! >> >> >> On Thu, Jan 2, 2014 at 10:10 AM, Jimmy <[email protected]> wrote: >> >>> We are doing similar thing what Brock mentioned - simple interceptor for >>> JSON validation with updating custom field in the header, then flume HDFS >>> sink pushes the data to good/bad target directory based on this custom >>> field.... then watch for bad directory in separate process. >>> >>> You could add notification to the flume flow, we wanted to keep it very >>> simple. >>> >>> >>> >>> >>> ---------- Forwarded message ---------- >>> From: Devin Suiter RDX <[email protected]> >>> Date: Thu, Jan 2, 2014 at 7:40 AM >>> Subject: Re: Handling malformed data when using custom >>> AvroEventSerializer and HDFS Sink >>> To: [email protected] >>> >>> >>> Just throwing this out there, since I haven't had time to dig into the >>> API with a big fork, but, can morphlines offer any assistance here? >>> >>> Some kind of an interceptor that would parse for malformed data, package >>> the offending data and send it somewhere (email it, log it), and then >>> project a valid "there was something wrong here" piece of data into the >>> field then allow your channel to carry on? Or skip the projection piece and >>> just move along? I was just thinking that the projection of known data into >>> a field that previously had malformed data would allow you to easily locate >>> those records later with the projected data, but keep your data shape >>> consistent. >>> >>> Kind of looking to Brock a >>> s a sounding board as to the appropriateness of this as a potential >>> solution since morphlines takes some time to really understand well... >>> >>> *Devin Suiter* >>> Jr. Data Solutions Software Engineer >>> 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212 >>> Google Voice: 412-256-8556 | www.rdx.com >>> >>> >>> On Thu, Jan 2, 2014 at 10:25 AM, Brock Noland <[email protected]> >>> wrote: >>> >>>> >>>> On Tue, Dec 31, 2013 at 8:34 PM, ed <[email protected]> wrote: >>>> >>>>> Hello, >>>>> >>>>> We are using Flume v1.4 to load JSON formatted log data into HDFS as >>>>> Avro. Our flume setup looks like this: >>>>> >>>>> NXLog ==> (FlumeHTTPSource -> HDFSSink w/ custom EventSerializer) >>>>> >>>>> Right now our custom EventSerializer (which extends >>>>> AbstractAvroEventSerializer) takes the JSON input from the HTTPSource and >>>>> converts it into an avro record of the appropriate type for the incoming >>>>> log file. This is working great and we use the serializer to add some >>>>> additional "synthetic" fields to the avro record that don't exist in the >>>>> original JSON log data. >>>>> >>>>> My question concerns how to handle malformed JSON data (or really any >>>>> error inside of the custom EventSerializer). It's very likely that as we >>>>> parse the JSON there will be records where something is malformed (either >>>>> the JSON itself, or a field is of the wrong type etc.). >>>>> >>>>> For example, a "port" field which should always be an Integer might >>>>> for some reason have some ASCII text in it. I'd like to catch these >>>>> errors >>>>> in the EventSerializer and then write out the bad JSON to a log file >>>>> somewhere that we can monitor. >>>>> >>>> >>>> Yeah it would be nice to have a better story about this in Flume. >>>> >>>> >>>>> >>>>> What is the best way to do this? >>>>> >>>> >>>> Typically people will either log it to a file or send it through >>>> another "flow" to a different HDFS sink. >>>> >>>> >>>> >>>>> Right now, all the logic for catching bad JSON would be inside of the >>>>> "convert" function of the EventSerializer. Should the convert function >>>>> itself throw an exception that will be gracefully handled upstream >>>>> >>>> >>>> The exception will be logged but that is it.. >>>> >>>> >>>>> or do I just return a "null" value if there was an error? Would it be >>>>> appropriate to log errors directly to a database from inside the >>>>> EventSerializer convert method or would this be too slow? >>>>> >>>> >>>> That might be too slow to do directly. If I did that I'd have a >>>> separate thread doing that and then an in-memory queue between the >>>> serializer and thread. >>>> >>>> >>>>> What are the best practices for this type of error handling? >>>>> >>>> >>>> If looks to me like we'd need to change AbstractAvroEventSerilizer to >>>> filter out nulls: >>>> >>>> >>>> https://github.com/apache/flume/blob/trunk/flume-ng-core/src/main/java/org/apache/flume/serialization/AbstractAvroEventSerializer.java#L106 >>>> >>>> which we could easily do. Since you don't want to wait for that you >>>> could override the write method to do this. >>>> >>>> >>>>> >>>>> Thank you for any assistance! >>>>> >>>>> Best Regards, >>>>> >>>>> Ed >>>>> >>>> >>>> >>>> >>>> -- >>>> Apache MRUnit - Unit testing MapReduce - http://mrunit.apache.org >>>> >>> >>> >> >> >> -- >> Apache MRUnit - Unit testing MapReduce - http://mrunit.apache.org >> > >
