Yes, the regex interceptors and selectors can be very powerful - experimenting with them was really exciting.
Brock, thanks for validating the ML idea - as with most things, the simplest solution is probably the way to go, and in this use case, the morphlines might be overkill. *Devin Suiter* Jr. Data Solutions Software Engineer 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212 Google Voice: 412-256-8556 | www.rdx.com On Thu, Jan 2, 2014 at 12:27 PM, Brock Noland <[email protected]> wrote: > Jimmy, great to hear that method is working for you! > > Devin, regarding the morphlines question. Since ML can have arbitrary java > plugins it *can* do just about anything. I generally think of ML as the T > in ETL. Doing the validation in ML might make sense. In general though I > think adding the custom header field as probably the best option for > dealing with bad data. > > Once again, thank you everyone for using our software! > > > On Thu, Jan 2, 2014 at 10:10 AM, Jimmy <[email protected]> wrote: > >> We are doing similar thing what Brock mentioned - simple interceptor for >> JSON validation with updating custom field in the header, then flume HDFS >> sink pushes the data to good/bad target directory based on this custom >> field.... then watch for bad directory in separate process. >> >> You could add notification to the flume flow, we wanted to keep it very >> simple. >> >> >> >> >> ---------- Forwarded message ---------- >> From: Devin Suiter RDX <[email protected]> >> Date: Thu, Jan 2, 2014 at 7:40 AM >> Subject: Re: Handling malformed data when using custom >> AvroEventSerializer and HDFS Sink >> To: [email protected] >> >> >> Just throwing this out there, since I haven't had time to dig into the >> API with a big fork, but, can morphlines offer any assistance here? >> >> Some kind of an interceptor that would parse for malformed data, package >> the offending data and send it somewhere (email it, log it), and then >> project a valid "there was something wrong here" piece of data into the >> field then allow your channel to carry on? Or skip the projection piece and >> just move along? I was just thinking that the projection of known data into >> a field that previously had malformed data would allow you to easily locate >> those records later with the projected data, but keep your data shape >> consistent. >> >> Kind of looking to Brock a >> s a sounding board as to the appropriateness of this as a potential >> solution since morphlines takes some time to really understand well... >> >> *Devin Suiter* >> Jr. Data Solutions Software Engineer >> 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212 >> Google Voice: 412-256-8556 | www.rdx.com >> >> >> On Thu, Jan 2, 2014 at 10:25 AM, Brock Noland <[email protected]> wrote: >> >>> >>> On Tue, Dec 31, 2013 at 8:34 PM, ed <[email protected]> wrote: >>> >>>> Hello, >>>> >>>> We are using Flume v1.4 to load JSON formatted log data into HDFS as >>>> Avro. Our flume setup looks like this: >>>> >>>> NXLog ==> (FlumeHTTPSource -> HDFSSink w/ custom EventSerializer) >>>> >>>> Right now our custom EventSerializer (which extends >>>> AbstractAvroEventSerializer) takes the JSON input from the HTTPSource and >>>> converts it into an avro record of the appropriate type for the incoming >>>> log file. This is working great and we use the serializer to add some >>>> additional "synthetic" fields to the avro record that don't exist in the >>>> original JSON log data. >>>> >>>> My question concerns how to handle malformed JSON data (or really any >>>> error inside of the custom EventSerializer). It's very likely that as we >>>> parse the JSON there will be records where something is malformed (either >>>> the JSON itself, or a field is of the wrong type etc.). >>>> >>>> For example, a "port" field which should always be an Integer might for >>>> some reason have some ASCII text in it. I'd like to catch these errors in >>>> the EventSerializer and then write out the bad JSON to a log file somewhere >>>> that we can monitor. >>>> >>> >>> Yeah it would be nice to have a better story about this in Flume. >>> >>> >>>> >>>> What is the best way to do this? >>>> >>> >>> Typically people will either log it to a file or send it through another >>> "flow" to a different HDFS sink. >>> >>> >>> >>>> Right now, all the logic for catching bad JSON would be inside of the >>>> "convert" function of the EventSerializer. Should the convert function >>>> itself throw an exception that will be gracefully handled upstream >>>> >>> >>> The exception will be logged but that is it.. >>> >>> >>>> or do I just return a "null" value if there was an error? Would it be >>>> appropriate to log errors directly to a database from inside the >>>> EventSerializer convert method or would this be too slow? >>>> >>> >>> That might be too slow to do directly. If I did that I'd have a separate >>> thread doing that and then an in-memory queue between the serializer and >>> thread. >>> >>> >>>> What are the best practices for this type of error handling? >>>> >>> >>> If looks to me like we'd need to change AbstractAvroEventSerilizer to >>> filter out nulls: >>> >>> >>> https://github.com/apache/flume/blob/trunk/flume-ng-core/src/main/java/org/apache/flume/serialization/AbstractAvroEventSerializer.java#L106 >>> >>> which we could easily do. Since you don't want to wait for that you >>> could override the write method to do this. >>> >>> >>>> >>>> Thank you for any assistance! >>>> >>>> Best Regards, >>>> >>>> Ed >>>> >>> >>> >>> >>> -- >>> Apache MRUnit - Unit testing MapReduce - http://mrunit.apache.org >>> >> >> > > > -- > Apache MRUnit - Unit testing MapReduce - http://mrunit.apache.org >
