Re: Stream in loop and not getting to sink (Parquet writer )

2018-12-03 Thread Kostas Kloudas
Hi Avi, If Parquet is not a requirement then you can use the StreamingFileSink and write as plain text, if this is ok for you. In this case, you can set the batch size and specify a custom RollingPolicy in general. For example I would recommend to check [1] where you have, of course, to adjust

Re: Stream in loop and not getting to sink (Parquet writer )

2018-12-03 Thread Avi Levi
Thanks Kostas, Ok got it, so bucketingSink might not be a good choice here. can you please advice what will be the best approach ? I have heavy load of data that I consume from kafka that I want to process and put them in a file (doesn't have to be parquet) . I thought that StreamingFileSink might

Re: Stream in loop and not getting to sink (Parquet writer )

2018-12-03 Thread Kostas Kloudas
Hi Avi, For Bulk Formats like Parquet, unfortunately, we do not support setting the batch size. The part-files roll on every checkpoint. This is a known limitation and there are plans to alleviate it in the future. Setting the batch size (among other things) is supported for RowWise formats.

Re: Stream in loop and not getting to sink (Parquet writer )

2018-12-02 Thread Avi Levi
Thanks Kostas. I will definitely look into that. but is the StreamingFileSink also support setting the batch size by size and/or by time interval like bucketing sink ? On Sun, Dec 2, 2018 at 5:09 PM Kostas Kloudas wrote: > Hi Avi, > > The ParquetAvroWriters cannot be used with the

Re: Stream in loop and not getting to sink (Parquet writer )

2018-12-02 Thread Kostas Kloudas
Hi Avi, The ParquetAvroWriters cannot be used with the BucketingSink. In fact the StreamingFIleSink is the "evolution" of the BucketingSink and it supports all the functionality that the BucketingSink supports. Given this, why not using the StreamingFileSink? On Sat, Dec 1, 2018 at 7:56 AM Avi

Re: Stream in loop and not getting to sink (Parquet writer )

2018-11-30 Thread Avi Levi
Thanks looks good. Do you know a way to use PaquetWriter or ParquetAvroWriters with a BucketingSink file ? something like : val bucketingSink = new BucketingSink[String]("/base/path")

Re: Stream in loop and not getting to sink (Parquet writer )

2018-11-30 Thread Kostas Kloudas
And for a Java example which is actually similar to your pipeline, you can check the ParquetStreamingFileSinkITCase. On Fri, Nov 30, 2018 at 2:39 PM Kostas Kloudas wrote: > Hi Avi, > > At a first glance I am not seeing anything wrong with your code. > Did you verify that there are elements

Re: Stream in loop and not getting to sink (Parquet writer )

2018-11-30 Thread Kostas Kloudas
Hi Avi, At a first glance I am not seeing anything wrong with your code. Did you verify that there are elements flowing in your pipeline and that checkpoints are actually completed? And also can you check the logs at Job and Task Manager for anything suspicious? Unfortunately, we do not allow

Re: Stream in loop and not getting to sink (Parquet writer )

2018-11-29 Thread Avi Levi
Thanks a lot Kostas, but the file not created . what am I doing wrong? BTW how can you set the encoding etc' in Flink's Avro - Parquet writer? object Tester extends App { val env = StreamExecutionEnvironment.getExecutionEnvironment def now = System.currentTimeMillis() val path = new

Re: Stream in loop and not getting to sink (Parquet writer )

2018-11-29 Thread Kostas Kloudas
Sorry, previously I got confused and I assumed you were using Flink's StreamingFileSink. Could you try to use Flink's Avro - Parquet writer? StreamingFileSink.forBulkFormat( Path...(MY_PATH), ParquetAvroWriters.forGenericRecord(MY_SCHEMA)) .build() Cheers, Kostas On Thu, Nov 29,

Re: Stream in loop and not getting to sink (Parquet writer )

2018-11-29 Thread Avi Levi
Thanks. yes, the *env.execute* is called and enabled checkpoints I think the problem is where to place the *writer.close *to flush the cache If I'll place on the sink after the write event e.g addSink{ writer.write writer.close } in this case only the first record will be included in the file but

Re: Stream in loop and not getting to sink (Parquet writer )

2018-11-29 Thread Kostas Kloudas
Hi again Avi, In the first example that you posted (the one with the Kafka source), do you call env.execute()? Cheers, Kostas On Thu, Nov 29, 2018 at 10:01 AM Kostas Kloudas wrote: > Hi Avi, > > In the last snippet that you posted, you have not activated checkpoints. > > Checkpoints are

Re: Stream in loop and not getting to sink (Parquet writer )

2018-11-29 Thread Kostas Kloudas
Hi Avi, In the last snippet that you posted, you have not activated checkpoints. Checkpoints are needed for the StreamingFileSink to produce results, especially in the case of BulkWriters (like Parquet) where the part file is rolled upon reception of a checkpoint and the part is finalised (i.e.

Re: Stream in loop and not getting to sink (Parquet writer )

2018-11-28 Thread Avi Levi
Checkout this little App. you can see that the file is created but no data is written. even for a single record import io.eels.component.parquet.ParquetWriterConfig import org.apache.avro.Schema import org.apache.avro.generic.{ GenericData, GenericRecord } import

Re: Stream in loop and not getting to sink (Parquet writer )

2018-11-28 Thread vipul singh
Can you try closing the writer? AvroParquetWriter has an internal buffer. Try doing a .close() in snapshot()( since you are checkpointing hence this method will be called) On Wed, Nov 28, 2018 at 7:33 PM Avi Levi wrote: > Thanks Rafi, > I am actually not using assignTimestampsAndWatermarks , I

Re: Stream in loop and not getting to sink (Parquet writer )

2018-11-28 Thread Avi Levi
Thanks Rafi, I am actually not using assignTimestampsAndWatermarks , I will try to add it as you suggested. however it seems that the messages I repeating in the stream over and over even if I am pushing single message manually to the queue, that message will repeat infinity Cheers Avi On Wed,

Re: Stream in loop and not getting to sink (Parquet writer )

2018-11-28 Thread Rafi Aroch
Hi Avi, I can't see the part where you use assignTimestampsAndWatermarks. If this part in not set properly, it's possible that watermarks are not sent and nothing will be written to your Sink. See here for more details:

Stream in loop and not getting to sink (Parquet writer )

2018-11-28 Thread Avi Levi
Hi, I am trying to implement Parquet Writer as SinkFunction. The pipeline consists of kafka as source and parquet file as a sink however it seems like the stream is repeating itself like endless loop and the parquet file is not written . can someone please help me with this? object