Re: bad data output

Darshan Singh Tue, 03 Apr 2018 11:01:35 -0700

Thanks,

As of now I have decided to write it to hdfs from within the function.


Thanks

On Tue, Apr 3, 2018 at 10:58 AM, Kostas Kloudas <k.klou...@data-artisans.com
> wrote:

> Hi Darshan,
>
> You can use side outputs [1] and a process function to split the data in
> as many streams as you want,
> e.g. correct, fixable and wrong. Each side output will be a separate
> stream that your can process individually.
>
> You can always send the “bad data” directly from your process function to
> Kafka or wherever. You just need
> to override the open() method, create a connection to the outside storage
> system, and use that connection
> to store the data whenever you see them. Keep in mind though, that your
> process function is executed on a single thread, so it may be beneficial to
> split your computation in multiple functions (although this is up to
> you to benchmark and see if it fits you).
>
> Thanks,
> Kostas
>
> [1] https://ci.apache.org/projects/flink/flink-docs-
> release-1.4/dev/stream/side_output.html
>
>
> On Mar 29, 2018, at 8:53 PM, Darshan Singh <darshan.m...@gmail.com> wrote:
>
> Hi
>
> I have a dataset which has almost 99% of correct data. As of now if say
> some data is bad I just ignore it and log it and return only correct data.
> I do this inside a map function.
>
> The part which decides whether data is correct or not is expensive one.
>
> Now I want to store the bad data somewhere so that I could analyze that
> data in future.
>
> So I can run the same calc 2 times and get the correct data in first go
> and bad data in 2nd go.
>
> Is there a better way where I can somehow store the bad data from inside
> of map function like send to kafka, file etc?
>
> Also, is there a way I could create a datastream which can get the data
> from inside map function(not sure this is feasible as of now)?
>
> Thanks
>
>
>

Re: bad data output

Reply via email to