bad data output

2018-03-29 Thread Darshan Singh
Hi

I have a dataset which has almost 99% of correct data. As of now if say
some data is bad I just ignore it and log it and return only correct data.
I do this inside a map function.

The part which decides whether data is correct or not is expensive one.

Now I want to store the bad data somewhere so that I could analyze that
data in future.

So I can run the same calc 2 times and get the correct data in first go and
bad data in 2nd go.

Is there a better way where I can somehow store the bad data from inside of
map function like send to kafka, file etc?

Also, is there a way I could create a datastream which can get the data
from inside map function(not sure this is feasible as of now)?

Thanks


Re: bad data output

2018-03-29 Thread 杨力
You can use a split operator, generating 2 streams.

Darshan Singh  于 2018年3月30日周五 上午2:53写道:

> Hi
>
> I have a dataset which has almost 99% of correct data. As of now if say
> some data is bad I just ignore it and log it and return only correct data.
> I do this inside a map function.
>
> The part which decides whether data is correct or not is expensive one.
>
> Now I want to store the bad data somewhere so that I could analyze that
> data in future.
>
> So I can run the same calc 2 times and get the correct data in first go
> and bad data in 2nd go.
>
> Is there a better way where I can somehow store the bad data from inside
> of map function like send to kafka, file etc?
>
> Also, is there a way I could create a datastream which can get the data
> from inside map function(not sure this is feasible as of now)?
>
> Thanks
>


Re: bad data output

2018-04-03 Thread Kostas Kloudas
Hi Darshan,

You can use side outputs [1] and a process function to split the data in as 
many streams as you want,
e.g. correct, fixable and wrong. Each side output will be a separate stream 
that your can process individually.

You can always send the “bad data” directly from your process function to Kafka 
or wherever. You just need
to override the open() method, create a connection to the outside storage 
system, and use that connection
to store the data whenever you see them. Keep in mind though, that your process 
function is executed on a single thread, so it may be beneficial to split your 
computation in multiple functions (although this is up to
you to benchmark and see if it fits you).

Thanks,
Kostas
 
[1] 
https://ci.apache.org/projects/flink/flink-docs-release-1.4/dev/stream/side_output.html
 



> On Mar 29, 2018, at 8:53 PM, Darshan Singh  wrote:
> 
> Hi
> 
> I have a dataset which has almost 99% of correct data. As of now if say some 
> data is bad I just ignore it and log it and return only correct data. I do 
> this inside a map function.
> 
> The part which decides whether data is correct or not is expensive one.
> 
> Now I want to store the bad data somewhere so that I could analyze that data 
> in future.
> 
> So I can run the same calc 2 times and get the correct data in first go and 
> bad data in 2nd go.
> 
> Is there a better way where I can somehow store the bad data from inside of 
> map function like send to kafka, file etc?
> 
> Also, is there a way I could create a datastream which can get the data from 
> inside map function(not sure this is feasible as of now)?
> 
> Thanks



Re: bad data output

2018-04-03 Thread Darshan Singh
Thanks,

As of now I have decided to write it to hdfs from within the function.

Thanks

On Tue, Apr 3, 2018 at 10:58 AM, Kostas Kloudas  wrote:

> Hi Darshan,
>
> You can use side outputs [1] and a process function to split the data in
> as many streams as you want,
> e.g. correct, fixable and wrong. Each side output will be a separate
> stream that your can process individually.
>
> You can always send the “bad data” directly from your process function to
> Kafka or wherever. You just need
> to override the open() method, create a connection to the outside storage
> system, and use that connection
> to store the data whenever you see them. Keep in mind though, that your
> process function is executed on a single thread, so it may be beneficial to
> split your computation in multiple functions (although this is up to
> you to benchmark and see if it fits you).
>
> Thanks,
> Kostas
>
> [1] https://ci.apache.org/projects/flink/flink-docs-
> release-1.4/dev/stream/side_output.html
>
>
> On Mar 29, 2018, at 8:53 PM, Darshan Singh  wrote:
>
> Hi
>
> I have a dataset which has almost 99% of correct data. As of now if say
> some data is bad I just ignore it and log it and return only correct data.
> I do this inside a map function.
>
> The part which decides whether data is correct or not is expensive one.
>
> Now I want to store the bad data somewhere so that I could analyze that
> data in future.
>
> So I can run the same calc 2 times and get the correct data in first go
> and bad data in 2nd go.
>
> Is there a better way where I can somehow store the bad data from inside
> of map function like send to kafka, file etc?
>
> Also, is there a way I could create a datastream which can get the data
> from inside map function(not sure this is feasible as of now)?
>
> Thanks
>
>
>