On Tue, Dec 31, 2013 at 8:34 PM, ed <[email protected]> wrote: > Hello, > > We are using Flume v1.4 to load JSON formatted log data into HDFS as Avro. > Our flume setup looks like this: > > NXLog ==> (FlumeHTTPSource -> HDFSSink w/ custom EventSerializer) > > Right now our custom EventSerializer (which extends > AbstractAvroEventSerializer) takes the JSON input from the HTTPSource and > converts it into an avro record of the appropriate type for the incoming > log file. This is working great and we use the serializer to add some > additional "synthetic" fields to the avro record that don't exist in the > original JSON log data. > > My question concerns how to handle malformed JSON data (or really any > error inside of the custom EventSerializer). It's very likely that as we > parse the JSON there will be records where something is malformed (either > the JSON itself, or a field is of the wrong type etc.). > > For example, a "port" field which should always be an Integer might for > some reason have some ASCII text in it. I'd like to catch these errors in > the EventSerializer and then write out the bad JSON to a log file somewhere > that we can monitor. >
Yeah it would be nice to have a better story about this in Flume. > > What is the best way to do this? > Typically people will either log it to a file or send it through another "flow" to a different HDFS sink. > Right now, all the logic for catching bad JSON would be inside of the > "convert" function of the EventSerializer. Should the convert function > itself throw an exception that will be gracefully handled upstream > The exception will be logged but that is it.. > or do I just return a "null" value if there was an error? Would it be > appropriate to log errors directly to a database from inside the > EventSerializer convert method or would this be too slow? > That might be too slow to do directly. If I did that I'd have a separate thread doing that and then an in-memory queue between the serializer and thread. > What are the best practices for this type of error handling? > If looks to me like we'd need to change AbstractAvroEventSerilizer to filter out nulls: https://github.com/apache/flume/blob/trunk/flume-ng-core/src/main/java/org/apache/flume/serialization/AbstractAvroEventSerializer.java#L106 which we could easily do. Since you don't want to wait for that you could override the write method to do this. > > Thank you for any assistance! > > Best Regards, > > Ed > -- Apache MRUnit - Unit testing MapReduce - http://mrunit.apache.org
