Re: Handling malformed data when using custom AvroEventSerializer and HDFS Sink

Devin Suiter RDX Thu, 02 Jan 2014 09:49:03 -0800

Yes, the regex interceptors and selectors can be very powerful -
experimenting with them was really exciting.


Brock, thanks for validating the ML idea - as with most things, the
simplest solution is probably the way to go, and in this use case, the
morphlines might be overkill.

*Devin Suiter*
Jr. Data Solutions Software Engineer
100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
Google Voice: 412-256-8556 | www.rdx.com


On Thu, Jan 2, 2014 at 12:27 PM, Brock Noland <[email protected]> wrote:

> Jimmy, great to hear that method is working for you!
>
> Devin, regarding the morphlines question. Since ML can have arbitrary java
> plugins it *can* do just about anything. I generally think of ML as the T
> in ETL. Doing the validation in ML might make sense. In general though I
> think adding the custom header field as probably the best option for
> dealing with bad data.
>
> Once again, thank you everyone for using our software!
>
>
> On Thu, Jan 2, 2014 at 10:10 AM, Jimmy <[email protected]> wrote:
>
>> We are doing similar thing what Brock mentioned - simple interceptor for
>> JSON validation with updating custom field in the header, then flume HDFS
>> sink pushes the data to good/bad target directory based on this custom
>> field.... then watch for bad directory in separate process.
>>
>> You could add notification to the flume flow, we wanted to keep it very
>> simple.
>>
>>
>>
>>
>> ---------- Forwarded message ----------
>> From: Devin Suiter RDX <[email protected]>
>> Date: Thu, Jan 2, 2014 at 7:40 AM
>> Subject: Re: Handling malformed data when using custom
>> AvroEventSerializer and HDFS Sink
>> To: [email protected]
>>
>>
>> Just throwing this out there, since I haven't had time to dig into the
>> API with a big fork, but, can morphlines offer any assistance here?
>>
>> Some kind of an interceptor that would parse for malformed data, package
>> the offending data and send it somewhere (email it, log it), and then
>> project a valid "there was something wrong here" piece of data into the
>> field then allow your channel to carry on? Or skip the projection piece and
>> just move along? I was just thinking that the projection of known data into
>> a field that previously had malformed data would allow you to easily locate
>> those records later with the projected data, but keep your data shape
>> consistent.
>>
>> Kind of looking to Brock a
>> s a sounding board as to the appropriateness of this as a potential
>> solution since morphlines takes some time to really understand well...
>>
>> *Devin Suiter*
>> Jr. Data Solutions Software Engineer
>> 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
>> Google Voice: 412-256-8556 | www.rdx.com
>>
>>
>> On Thu, Jan 2, 2014 at 10:25 AM, Brock Noland <[email protected]> wrote:
>>
>>>
>>> On Tue, Dec 31, 2013 at 8:34 PM, ed <[email protected]> wrote:
>>>
>>>> Hello,
>>>>
>>>> We are using Flume v1.4 to load JSON formatted log data into HDFS as
>>>> Avro.  Our flume setup looks like this:
>>>>
>>>> NXLog ==> (FlumeHTTPSource -> HDFSSink w/ custom EventSerializer)
>>>>
>>>> Right now our custom EventSerializer (which extends
>>>> AbstractAvroEventSerializer) takes the JSON input from the HTTPSource and
>>>> converts it into an avro record of the appropriate type for the incoming
>>>> log file.  This is working great and we use the serializer to add some
>>>> additional "synthetic" fields to the avro record that don't exist in the
>>>> original JSON log data.
>>>>
>>>> My question concerns how to handle malformed JSON data (or really any
>>>> error inside of the custom EventSerializer).  It's very likely that as we
>>>> parse the JSON there will be records where something is malformed (either
>>>> the JSON itself, or a field is of the wrong type etc.).
>>>>
>>>> For example, a "port" field which should always be an Integer might for
>>>> some reason have some ASCII text in it.  I'd like to catch these errors in
>>>> the EventSerializer and then write out the bad JSON to a log file somewhere
>>>> that we can monitor.
>>>>
>>>
>>> Yeah it would be nice to have a better story about this in Flume.
>>>
>>>
>>>>
>>>> What is the best way to do this?
>>>>
>>>
>>> Typically people will either log it to a file or send it through another
>>> "flow" to a different HDFS sink.
>>>
>>>
>>>
>>>> Right now, all the logic for catching bad JSON would be inside of the
>>>> "convert" function of the EventSerializer.  Should the convert function
>>>> itself throw an exception that will be gracefully handled upstream
>>>>
>>>
>>> The exception will be logged but that is it..
>>>
>>>
>>>> or do I just return a "null" value if there was an error?  Would it be
>>>> appropriate to log errors directly to a database from inside the
>>>> EventSerializer convert method or would this be too slow?
>>>>
>>>
>>> That might be too slow to do directly. If I did that I'd have a separate
>>> thread doing that and then an in-memory queue between the serializer and
>>> thread.
>>>
>>>
>>>> What are the best practices for this type of error handling?
>>>>
>>>
>>> If looks to me like we'd need to change AbstractAvroEventSerilizer to
>>> filter out nulls:
>>>
>>>
>>> https://github.com/apache/flume/blob/trunk/flume-ng-core/src/main/java/org/apache/flume/serialization/AbstractAvroEventSerializer.java#L106
>>>
>>> which we could easily do.  Since you don't want to wait for that you
>>> could override the write method to do this.
>>>
>>>
>>>>
>>>> Thank you for any assistance!
>>>>
>>>> Best Regards,
>>>>
>>>> Ed
>>>>
>>>
>>>
>>>
>>> --
>>> Apache MRUnit - Unit testing MapReduce - http://mrunit.apache.org
>>>
>>
>>
>
>
> --
> Apache MRUnit - Unit testing MapReduce - http://mrunit.apache.org
>

Re: Handling malformed data when using custom AvroEventSerializer and HDFS Sink

Reply via email to