Re: TextIO. Writing late files

2020-05-15 Thread Jozef Vilcek
leNamePolicy than before was used (window + timing + shards). >> Then, you can find files that contains the original filenames in >> windowing-textio/pipe_with_lateness_60s/files-after-distinct. This is the >> interesting part, because you will find several files with LATE

Re: TextIO. Writing late files

2020-05-10 Thread Jozef Vilcek
I am using FileIO and I do observe the drop of pane info information on Flink runner too. It was mentioned in this thread: https://www.mail-archive.com/dev@beam.apache.org/msg20186.html It is a result of different reshuffle expansion for optimisation reasons. However, I did not observe a data

Re: Apache Beam with Hive

2020-02-16 Thread Jozef Vilcek
Problem seems to be incompatibility of Hive's hcatalog version ... What HCatalogIO expects and what you have on classpath. Beams HCatalogIO is is compiled agains Hive 2.1, you are packing 1.2.0. On Mon, Feb 17, 2020 at 7:47 AM Gershi, Noam wrote: > 2.19.0 did not work also... > > >

Re: SQL massively more resource-intensive? Memory leak?

2019-06-06 Thread Jozef Vilcek
Since you are using FlinkRunner, can you try run pipeline with `--objectReuse=true` if you will see any noticeable difference? If you are not using the option already off course On Tue, Jun 4, 2019 at 9:24 PM Rui Wang wrote: > Sorry I couldn't be more helpful at this moment. Created a JIRA for

Re: "Dynamic" read / expand of PCollection element by IOs

2019-03-15 Thread Jozef Vilcek
gt; >> 1: https://beam.apache.org/blog/2017/08/16/splittable-do-fn.html >> 2: >> https://github.com/apache/beam/blob/2ac5b764e3450798661a97f2b51f2d602feafb23/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileIO.java#L133 >> >> >> On Thu, Mar 14, 2019

"Dynamic" read / expand of PCollection element by IOs

2019-03-14 Thread Jozef Vilcek
Hello, I wanted to write a Beam code which expands incoming `PCollection<>`, element wise, by use of existing IO components. Example could be to have a `PCollection` which will hold arbitrary paths to data and I want to load them via `HadoopFormatIO.Read` which is of `PTransform>`. Problem is, I

KafkaIO and added partitions

2018-11-22 Thread Jozef Vilcek
Hello, just wanted to check how does Beam KafkaIO behaves when partitions are added to the topic. Will they be picked up or ignored during the runtime? Will they be picked up on restart with state restore? Thanks, Jozef

Re: Editor for BeamSQL

2018-11-18 Thread Jozef Vilcek
ensions/sql/jdbc/BeamSqlLine.java > [4]: https://github.com/julianhyde/sqlline > > Regards, > Anton > > On Fri, Nov 16, 2018 at 2:03 AM Jozef Vilcek > wrote: > >> Hello, >> >> does anyone use or is aware some kind of editor integration options for >>

Editor for BeamSQL

2018-11-16 Thread Jozef Vilcek
Hello, does anyone use or is aware some kind of editor integration options for BeamSQL? It would be to enable less technical people to execute SQL or do data analysis queries conveniently. E.g. like HUE integration for SparkSQL or similar Thanks, Jozef

Re: Unbalanced FileIO writes on Flink

2018-10-26 Thread Jozef Vilcek
rkers). Probably we should do something > > similar for the Flink runner. > > > > This needs to be done by the runner, as # of workers is a runner > > concept; the SDK itself has no concept of workers. > > > > On Thu, Oct 25, 2018 at 3:28 AM Jozef Vilcek > &

Re: Unbalanced FileIO writes on Flink

2018-10-25 Thread Jozef Vilcek
cify the number of shards in streaming mode? > > -Max > > On 25.10.18 10:12, Jozef Vilcek wrote: > > Hm, yes, this makes sense now, but what can be done for my case? I do > > not want to end up with too many files on disk. > > > > I think what I am looking fo

Re: Unbalanced FileIO writes on Flink

2018-10-25 Thread Jozef Vilcek
ough keys, the chance > increases they are equally spread. > > This should be similar to what the other Runners do. > > On 24.10.18 10:58, Jozef Vilcek wrote: > > > > So if I run 5 workers with 50 shards, I end up with: > > > > DurationBytes rec

Re: KafkaIO - Deadletter output

2018-10-25 Thread Jozef Vilcek
what I ended up doing, when I could not for any reasono rely on kafka timestamps, but need to parse them form message is: * have a cusom kafka deserializer which never throws but returns message which is either a success with parsed data structure plus timestamp or failure with original kafka

Re: Unbalanced FileIO writes on Flink

2018-10-24 Thread Jozef Vilcek
Oct 24, 2018 at 12:28 AM Jozef Vilcek > wrote: > >> cc (dev) >> >> I tried to run the example with FlinkRunner in batch mode and received >> again bad data spread among the workers. >> >> When I tried to remove number of shards for batch mode i

Re: Unbalanced FileIO writes on Flink

2018-10-24 Thread Jozef Vilcek
, AfterWatermark.pastEndOfWindow().withEarlyFirings(AfterPane.elementCountAtLeast(1)).withLateFirings(AfterFirst.of(Repeatedly.fo rever(AfterPane.elementCountAtLeast(1)), Repeatedly.forever(AfterSynchronizedProcessingTime.pastFirstElementInPane( On Tue, Oct 23, 2018 at 12:01 PM Jozef Vilcek wrote: > Hi

Re: Unbalanced FileIO writes on Flink

2018-10-23 Thread Jozef Vilcek
)`? > > Thanks, > Max > > On 22.10.18 11:57, Jozef Vilcek wrote: > > Hello, > > > > I am having some trouble to get a balanced write via FileIO. Workers at > > the shuffle side where data per window fire are written to the > > filesystem receive

Re: Minimum time between checkpoints in Flink runner

2018-09-17 Thread Jozef Vilcek
I can see it was very recently added... https://issues.apache.org/jira/browse/BEAM-5372 On Wed, Sep 12, 2018 at 12:54 PM Encho Mishinev wrote: > Hello, > > I am using Flink runner with Apache Beam 2.6.0. An important configuration > of Flink is the 'minimum time between checkpoints' parameter

Re: FileBasedSink.WriteOperation copy instead of move?

2018-07-26 Thread Jozef Vilcek
gt; we would need to make sure that all Filesystems support cross-directory >> rename. >> >> On Thu, Jul 26, 2018 at 9:58 AM Lukasz Cwik wrote: >> >>> +dev >>> >>> On Thu, Jul 26, 2018 at 2:40 AM Jozef Vilcek >>> wrote: >>&

FileBasedSink.WriteOperation copy instead of move?

2018-07-26 Thread Jozef Vilcek
Hello, just came across FileBasedSink.WriteOperation class which does have moveToOutput() method. Implementation does a Filesystem.copy() instead of "move". With large files I find it quote no efficient if underlying FS supports more efficient ways, so I wonder what is the story behind it? Must

Re: Write bulks files from streaming app

2018-07-24 Thread Jozef Vilcek
. Instead solution for me is to use custom function with state and timers. Similar like GroupIntoBatches.ofSize() is doing. On Sun, Jul 22, 2018 at 6:42 PM Jozef Vilcek wrote: > Here is a pseudocode (sorry) of what I am doing right now: > > PCollection> writtenFiles

Re: Write bulks files from streaming app

2018-07-22 Thread Jozef Vilcek
ike parquet), one needs an additional step, to > encode/compress when the specific destination file is done (if you think in > Hadoop terms, that would be in the "commit" step). > > > On Sun, Jul 22, 2018 at 2:10 PM, Jozef Vilcek > wrote: > >> I looked into Wait.on() b

Re: Write bulks files from streaming app

2018-07-22 Thread Jozef Vilcek
tried to window it again with different triggers (no early trigger) and groupBy key, but so far, no luck as it never yield a collection of files in which were emitted as EARLY in first window. On Fri, Jul 20, 2018 at 9:06 PM Raghu Angadi wrote: > On Fri, Jul 20, 2018 at 2:58 AM Jozef Vil

Re: Write bulks files from streaming app

2018-07-20 Thread Jozef Vilcek
gt; files to a single file in your own DoFn. This is certainly more code on > your part, but might be worth it. You can use Wait.on() transoform to run > your finalizer DoFn right after the window that writes smaller files closes. > > > On Thu, Jul 19, 2018 at 2:43 AM Jozef Vilcek >

Write bulks files from streaming app

2018-07-19 Thread Jozef Vilcek
Hey, I am looking for the advice. I am trying to do a stream processing with Beam on Flink runtime. Reading data from Kafka, doing some processing with it which is not important here and in the same time want to store consumed data to history storage for archive and reprocessing, which is HDFS.

Pass complex type to FlinkPipelineOptions

2018-07-19 Thread Jozef Vilcek
Hey, I am using Beam with Flink and want to set `stateBacked` via pipeline options available here https://github.com/apache/beam/blob/master/runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkPipelineOptions.java#L120 The property is an abstract class. I was not able to figure out so

Re: Metrics: Non-cumulative values for Distribution

2018-06-19 Thread Jozef Vilcek
to treat the > metric as if it was an element and compute it donwstream so that it could > be bound to a window. > > Etienne > > > > Le samedi 02 juin 2018 à 08:01 +0300, Jozef Vilcek a écrit : > > Hi Scott, > > nothing special about the use-case. Just want

Re: Metrics: Non-cumulative values for Distribution

2018-06-01 Thread Jozef Vilcek
ger strategy which > captures the report interval you're looking for. > > [1] https://s.apache.org/runner_independent_metrics_extraction > > On Fri, Jun 1, 2018 at 3:39 AM Jozef Vilcek wrote: > >> Hi, >> >> I am running a streaming job on flink a

Metrics: Non-cumulative values for Distribution

2018-06-01 Thread Jozef Vilcek
Hi, I am running a streaming job on flink and want to monitor MIN and MAX ranges of a metric floating through operator. I did it via org.apache.beam.sdk.metrics.Distribution Problem is, that it seems to report only cumulative values. What I would want instead is discrete report for MIN / MAX

Access to current watermark

2018-05-17 Thread Jozef Vilcek
Hello, is there a way to observe `current watermark` when processing elements of triggered window? What I am trying to achieve is, to have on one window quite large late comer event allowance and I want to log information on how much late behind the watermark event did came in.