Re: counting words (not frequency)

Sameer Wadkar Fri, 22 Jul 2016 19:52:26 -0700

It is complicated:
1. If you have a file you should consider using the DataSet API. It is more 
complicated to use DataStream with files as you have to simulate a stream from 
a file. 
2. You need a tokenizer for a map operator unless you have a word per line. 
3. Sum operator is fine it will count but unless you groupBy you will get as 
many counts as your default parallelism. So you need to explicitly set 
parallelism to 1. And that is something you cannot do with a map only job 
unless your files is small. But if it is small why use Flink.


Sent from my iPhone

> On Jul 22, 2016, at 6:35 PM, Roshan Naik <ros...@hortonworks.com> wrote:
> 
> Seems a bit convoluted for such a simple problem. I am thinking a custom
> streaming count() operator will simplify. Wasn¹t able to find examples for
> custom Streaming operators.
> -roshan
> 
> 
>> On 7/21/16, 8:00 PM, "hrajaram" <hraja...@arcogent.com> wrote:
>> 
>> Can't you use a KeyedStream, I mean keyBy with the sameKey?  something
>> like
>> this,
>> source.flatMap(new Tokenizer()).keyBy(0).sum(2).project(2).print();
>> 
>> Assuming tokenizer is giving Tuple3<String,String,Integer>
>> 
>> 1-> is always the same key, say "test"
>> 2->the actual word
>> 3-> 1
>> 
>> 
>> 
>> There might be some other good choices but this is the first thing that
>> quickly came in my mind :-)
>> 
>> Hari
>> 
>> 
>> 
>> 
>> --
>> View this message in context:
>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/counti
>> ng-words-not-frequency-tp8099p8100.html
>> Sent from the Apache Flink User Mailing List archive. mailing list
>> archive at Nabble.com.
>

Re: counting words (not frequency)

Reply via email to