Thanks, As of now I have decided to write it to hdfs from within the function.
Thanks On Tue, Apr 3, 2018 at 10:58 AM, Kostas Kloudas <k.klou...@data-artisans.com > wrote: > Hi Darshan, > > You can use side outputs [1] and a process function to split the data in > as many streams as you want, > e.g. correct, fixable and wrong. Each side output will be a separate > stream that your can process individually. > > You can always send the “bad data” directly from your process function to > Kafka or wherever. You just need > to override the open() method, create a connection to the outside storage > system, and use that connection > to store the data whenever you see them. Keep in mind though, that your > process function is executed on a single thread, so it may be beneficial to > split your computation in multiple functions (although this is up to > you to benchmark and see if it fits you). > > Thanks, > Kostas > > [1] https://ci.apache.org/projects/flink/flink-docs- > release-1.4/dev/stream/side_output.html > > > On Mar 29, 2018, at 8:53 PM, Darshan Singh <darshan.m...@gmail.com> wrote: > > Hi > > I have a dataset which has almost 99% of correct data. As of now if say > some data is bad I just ignore it and log it and return only correct data. > I do this inside a map function. > > The part which decides whether data is correct or not is expensive one. > > Now I want to store the bad data somewhere so that I could analyze that > data in future. > > So I can run the same calc 2 times and get the correct data in first go > and bad data in 2nd go. > > Is there a better way where I can somehow store the bad data from inside > of map function like send to kafka, file etc? > > Also, is there a way I could create a datastream which can get the data > from inside map function(not sure this is feasible as of now)? > > Thanks > > >