Re: Streaming use case: Row enrichment

2017-06-16 Thread Flavio Pompermaier
Understood..Thanks anyway Aljoscha! On Fri, Jun 16, 2017 at 11:55 AM, Aljoscha Krettek wrote: > Hi, > > The problem with that is that the file is being read by (possibly, very > likely) multiple operators in parallel. The file source works like this: > there is a ContinuousFileMonitoringFunction

Re: Streaming use case: Row enrichment

2017-06-16 Thread Aljoscha Krettek
Hi, The problem with that is that the file is being read by (possibly, very likely) multiple operators in parallel. The file source works like this: there is a ContinuousFileMonitoringFunction (this is an actual Flink source) that monitors a directory and when a new file appears sends several (

Re: Streaming use case: Row enrichment

2017-06-16 Thread Flavio Pompermaier
Is it really necessary to wait for the file to reach the end of the pipeline? Isn't sufficient to know that it has been read and the source operator has been checkpointed (I don't know if I'm using this word correctly...I mean that all the file splits has been processed and Flink won't reprocess th

Re: Streaming use case: Row enrichment

2017-06-16 Thread Aljoscha Krettek
Hi, I was referring to StreamExecutionEnvironment.readFile( FileInputFormat inputFormat, String filePath, FileProcessingMode watchType, long interval) Where you can specify whether the source should shutdown once all files have been had (PROCESS_ONCE) or whether the source should contin

Re: Streaming use case: Row enrichment

2017-06-15 Thread Flavio Pompermaier
Hi Aljosha, thanks for the great suggestions, I wasn't aware of AsyncDataStream.unorderedWait and BucketingSink setBucketer(). Most probably that's exactly what I was looking for...(I should just have the time to test it. Just one last question: what are you referring to with "you could use a diffe

Re: Streaming use case: Row enrichment

2017-06-15 Thread Aljoscha Krettek
Ok, just trying to make sure I understand everything: You have this: 1. A bunch of data in HDFS that you want to enrich 2. An external service (Solr/ES) that you query for enriching the data rows stored in 1. 3. You need to store the enriched rows in HDFS again I think you could just do this (ro

Re: Streaming use case: Row enrichment

2017-06-06 Thread Flavio Pompermaier
Hi Aljosha, thanks for getting back to me on this! I'll try to simplify the thread starting from what we want to achieve. At the moment we execute some queries to a db and we store the data into Parquet directories (one for each query). Let's say we then create a DataStream from each dir, what we

Re: Streaming use case: Row enrichment

2017-06-06 Thread Aljoscha Krettek
Hi Flavio, I’ll try and answer your questions: Regarding 1. Why do you need to first read the data from HDFS into Kafka (or another queue)? Using StreamExecutionEnvironment.readFile(FileInputFormat, String, FileProcessingMode, long) you can monitor a directory in HDFS and process the files tha

Streaming use case: Row enrichment

2017-05-11 Thread Flavio Pompermaier
Hi to all, we have a particular use case where we have a tabular dataset on HDFS (e.g. a CSV) that we want to enrich filling some cells with the content returned by a query to a reverse index (e.g. solr/elasticsearch). Since we want to be able to make this process resilient and scalable we thought