Re: Avro SortedKeyValueFile

2019-09-27 Thread Lukasz Cwik
Not to my knowledge. On Fri, Sep 27, 2019 at 7:48 AM Shannon Duncan wrote: > Does beam support the avro SortedKeyValueFile > ? > Not able to find any documentation in beam area on this. > > I need

Re: Multiple iterations after GroupByKey with SparkRunner

2019-09-26 Thread Lukasz Cwik
Jan, in Beam users expect to be able to iterate the GBK output multiple times even from within the same ParDo. Is this something that Beam on Spark Runner never supported? On Thu, Sep 26, 2019 at 6:50 AM Jan Lukavský wrote: > Hi Gershi, > > could you please outline the pipeline you are trying

Re: Python Portable Runner Issues

2019-09-25 Thread Lukasz Cwik
Google Dataflow currently uses a JSON representation of the pipeline graph and also the pipeline proto. We represent the graph in two different ways which leads to some wonderful *features*. Google Dataflow also side steps the Beam job service since Dataflow has its own Job API. Supporting the

Re: Where is /opt/apache/beam/boot?

2019-09-18 Thread Lukasz Cwik
It is embedded inside the docker container that corresponds to which SDK your using. Python container boot src: https://github.com/apache/beam/blob/master/sdks/python/container/boot.go Java container boot src: https://github.com/apache/beam/blob/master/sdks/java/container/boot.go Go container

Re: How to debug dataflow locally

2019-09-13 Thread Lukasz Cwik
In general there is no generic source/sink fake/mock/emulator that runs for all sources/sinks locally. Configuration and data is on a case by case basis. Most of the Apache Beam integration tests either launch a local implementation that is specific to the source/sink (such as a DB for JdbcIO) or

Re: How do you write portable runner pipeline on separate python code ?

2019-09-13 Thread Lukasz Cwik
nk/Spark cluster or Dataflow). It would be very convenient if >>>>> we >>>>> could automatically stage local files to be read as artifacts that could >>>>> be >>>>> consumed by any worker (possibly via external directory mounting in th

Re: How can I work with multiple pcollections?

2019-09-12 Thread Lukasz Cwik
Yes you can create multiple output PCollections using a ParDo with multiple outputs instead of inserting them into Mongo. It could be useful to read through the programming guide related to PCollections[1] and PTransforms with multiple outputs[2] and feel free to return with more questions. 1:

Re: How do you write portable runner pipeline on separate python code ?

2019-09-12 Thread Lukasz Cwik
When you use a local filesystem path and a docker environment, "/tmp" is written inside the container. You can solve this issue by: * Using a "remote" filesystem such as HDFS/S3/GCS/... * Mounting an external directory into the container so that any "local" writes appear outside the container *

Re: How do I run Beam Python pipelines using Flink deployed on Kubernetes?

2019-09-11 Thread Lukasz Cwik
n the python > container (source file boot.go). Is that right? > > > > Best, > > Andrea > > > > *From: *Lukasz Cwik > *Reply-To: *"user@beam.apache.org" > *Date: *Tuesday, 10 September 2019 at 23:19 > *To: *user > *Subject: *Re: How do I r

Re: How do I run Beam Python pipelines using Flink deployed on Kubernetes?

2019-09-10 Thread Lukasz Cwik
External environments are mainly used for testing since they represent environments that are already running. There are other clever uses of it as well but unlikely what your looking for. The docker way will be nice once there is integration between the FlinkRunner and its supported cluster

Re: How to buffer events using spark portable runner ?

2019-09-08 Thread Lukasz Cwik
Try using Apache Flink. On Sun, Sep 8, 2019 at 6:23 AM Yu Watanabe wrote: > Hello . > > I would like to ask question related to timely processing as stated in > below page. > > https://beam.apache.org/blog/2017/08/28/timely-processing.html > > Python version: 3.7.4 > apache beam version: 2.15.0

Re: installing Apache Beam on Pycharm with Python 3.7

2019-09-05 Thread Lukasz Cwik
+user -dev Can you provide any error messages or details as to what failed? On Thu, Sep 5, 2019 at 9:43 AM Priti Badami < pbadami.srdataengin...@gmail.com> wrote: > Hi Dev Team, > > I am trying to install Apache Beam. I have pip 19.2.3 but I am facing > issues while installing Beam. > > please

Re: [Java] Compressed SequenceFile

2019-09-05 Thread Lukasz Cwik
Sorry for the poor experience and thanks for sharing a solution with others. On Thu, Sep 5, 2019 at 6:34 AM Shannon Duncan wrote: > FYI this was due to hadoop version. 3.2.0 was throwing this error, but > rolled back to version in googles pom.xml 2.7.4 and it is working fine now. > > Kindof

Re: Setting environment and system properties on Dataflow workers

2019-09-03 Thread Lukasz Cwik
oogle's docs, so perhaps that would bTe the > best place for this. > > On Fri, Aug 30, 2019 at 1:53 PM Lukasz Cwik wrote: > >> There is a way to run arbitrary code on JVM startup via a JVM >> initializer[1] in the Dataflow worker and in the portable Java worker as >> well

Re: Save state on tear down

2019-09-03 Thread Lukasz Cwik
execution graph causing windows to close, timers to fire, state to be emit and then garbage collected and so forth. > > thanks, > -chad > > > On Fri, Aug 16, 2019 at 2:47 PM Jose Delgado > wrote: > >> I see, thank you Lukasz. >> >> >> >&g

Re: Setting environment and system properties on Dataflow workers

2019-08-30 Thread Lukasz Cwik
There is a way to run arbitrary code on JVM startup via a JVM initializer[1] in the Dataflow worker and in the portable Java worker as well. You should be able to mutate system properties at that point in time since Java allows for system properties to be mutated. The standard Java runtime

Re: Limit log files count with Dataflow Runner Logging

2019-08-29 Thread Lukasz Cwik
The only logging options that Dataflow exposes today limit what gets logged and not anything about how many rotated logs there are or how big they are. All Dataflow logging options are available here:

Re: NoSuchElementException after Adding SplittableDoFn

2019-08-27 Thread Lukasz Cwik
lad that this will be fixed in future release. > > Is there anyway that I can avoid hitting this problem before 2.16 is > released? > > On Tue, Aug 27, 2019 at 12:57 PM Lukasz Cwik wrote: > >> This is a known issue and was fixed with >> http

Re: NoSuchElementException after Adding SplittableDoFn

2019-08-27 Thread Lukasz Cwik
This is a known issue and was fixed with https://github.com/apache/beam/commit/5d9bb4595c763025a369a959e18c6dd288e72314#diff-f149847d2c06f56ea591cab8d862c960 It is meant to be released as part of 2.16.0 On Tue, Aug 27, 2019 at 11:41 AM Zhiheng Huang wrote: > Hi all, > > Looking for help to

Re: Design question regarding streaming and sorting

2019-08-26 Thread Lukasz Cwik
Once you choose to start processing a row, can a different row preempt that work or when you start processing you'll finish it? So far, I'm agreeing with what Jan has said. I believe there is a way to get what you want working but with a lot of unneeded complexity since not much in your problem

Re: Design question regarding streaming and sorting

2019-08-24 Thread Lukasz Cwik
Side inputs don't change for the lifetime of a bundle. Only on new bundles would you get a possibly updated new view of the side input so you may not see the changes to priority as quickly as you may expect. How quickly this happens is all dependent on the runner's internal implementation details.

Re: Multiple file systems configuration

2019-08-20 Thread Lukasz Cwik
ually, in other IOs, we can do this easily by having specific methods, > like “withConfiguration()”, “withCredentialsProvider()”, etc. for Read and > Write, but FileSystems based IO could be configured only > with PipelineOptions afaik. There was a thread about that a while ago [1] &

Re: Late data handling in Python SDK

2019-08-09 Thread Lukasz Cwik
+dev Related JIRA's I found are BEAM-3759 and BEAM-7825. This has been a priority thing as the community has been trying to get streaming Python execution working on multiple Beam runners. On Wed, Aug 7, 2019 at 2:31 AM Sam Stephens wrote: > Hi all, > > I’ve been reading into, and

Re: BigQueryIO - insert retry policy in Apache Beam

2019-08-08 Thread Lukasz Cwik
On Wed, Aug 7, 2019 at 10:55 PM Yohei Onishi wrote: > Hi, > > If you are familiar with BiqQuery insert retry policies in Apache Beam API > (BigQueryIO), please help me understand the following behavior. I am using > Dataflow runner. > >- How Dataflow job behave if I specify

Re: Caused by: java.lang.IllegalArgumentException: URI is not hierarchical

2019-08-08 Thread Lukasz Cwik
+user Can you supply the full stacktrace for the exception? On Tue, Aug 6, 2019 at 3:22 PM Jayanth Kolli wrote: > Getting following error while trying to run DataFloeRunner example in Self > executing JAR mode. > I have example word count spring boot application running fine with >

Re: Latency of Google Dataflow with Pubsub

2019-08-05 Thread Lukasz Cwik
+dev On Mon, Aug 5, 2019 at 12:49 PM Dmitry Minaev wrote: > Hi there, > > I'm building streaming pipelines in Beam (using Google Dataflow runner) > and using Google Pubsub as a message broker. I've made a couple of > experiments with a very simple pipeline: consume events from Pubsub >

Re: Save state on tear down

2019-08-05 Thread Lukasz Cwik
This is not possible today. There have been discussions about pipeline drain, snapshot and update [1, 2] which may provide additional details of what is planned and could use your feedback. 1: https://docs.google.com/document/d/1NExwHlj-2q2WUGhSO4jTu8XGhDPmm3cllSN8IMmWci8 2:

Re: How to deal with failed Checkpoint? What is current behavior for subsequent checkpoints?

2019-08-05 Thread Lukasz Cwik
https://s.apache.org/beam-finalizing-bundles should give you a bunch more details but I replied inline to your questions as well. On Fri, Jul 19, 2019 at 10:40 AM Ken Barr wrote: > Reading the below two statements I conclude that CheckpointMark.finalize > Checkpoint() will be called in order,

Re: Beam python pipeline on spark

2019-08-05 Thread Lukasz Cwik
I added a new stackoverflow answer pointing to the link about getting started. Please upvote the answer to increase visibility. On Sun, Aug 4, 2019 at 12:46 AM Chad Dombrova wrote: > It's in the doc that Kyle sent, but it's worth mentioning here that > streaming is not yet supported. > > -chad

Re: [Java] Using a complex datastructure as Key for KV

2019-07-12 Thread Lukasz Cwik
Shannon Duncan > wrote: > >> Aha, makes sense. Thanks! >> >> On Fri, Jul 12, 2019 at 9:26 AM Lukasz Cwik wrote: >> >>> TreeMapCoder.of(StringUtf8Coder.of(), ListCoder.of(VarIntCoder.of())); >>> >>> On Fri, Jul 12, 2019 at 10:22 AM Shannon D

Re: [Java] Using a complex datastructure as Key for KV

2019-07-12 Thread Lukasz Cwik
TreeMapCoder.of(StringUtf8Coder.of(), ListCoder.of(VarIntCoder.of())); On Fri, Jul 12, 2019 at 10:22 AM Shannon Duncan wrote: > So I have my custom coder created for TreeMap and I'm ready to set it... > > So my Type is "TreeMap>" > > What do I put for ".setCoder(TreeMapCoder.of(???, ???))" > >

Re: [Opinion] [Question] Python SDK & Java SDK

2019-07-10 Thread Lukasz Cwik
Age is the largest consideration since the Python SDK was started a few years after the Java one was started. Another consideration was that the Python SDK only worked on Dataflow and until recently due to the work with portability, a few other runners have been able to execute Python pipelines.

Re: [Python SDK] Avro read/write & Indexing

2019-07-09 Thread Lukasz Cwik
Typically this would be done by reading in the contents of the entire file into a map side input and then consuming that side input within a DoFn. Unfortunately, only Dataflow supports really large side inputs with an efficient access pattern and only when using Beam Java for bounded pipelines.

Re: Apache Beam issue | Reading Avro files and pushing to Bigquery

2019-07-09 Thread Lukasz Cwik
+user (please use user@ for questions about using the product and restrict to dev@ for questions related to developing the product). Can you provide the rest of the failing reason (and any stacktraces from the workers related to the failures)? On Tue, Jul 9, 2019 at 11:04 AM Dhiraj Sardana

Re: gRPC method to get a pipeline definition?

2019-06-28 Thread Lukasz Cwik
+dev On Fri, Jun 28, 2019 at 8:20 AM Chad Dombrova wrote: > > I think the simplest solution would be to have some kind of override/hook >> that allows Flink/Spark/... to provide storage. They already have a concept >> of a job and know how to store them so can we piggyback the Beam pipeline >>

Re: gRPC method to get a pipeline definition?

2019-06-28 Thread Lukasz Cwik
I think the simplest solution would be to have some kind of override/hook that allows Flink/Spark/... to provide storage. They already have a concept of a job and know how to store them so can we piggyback the Beam pipeline there. On Fri, Jun 28, 2019 at 7:49 AM Chad Dombrova wrote: > > In

Re: gRPC method to get a pipeline definition?

2019-06-28 Thread Lukasz Cwik
The InMemoryJobService is meant to be a simple implementation. Adding a job expiration N minutes after the job completes might make sense. In reality, a more complex job service is needed that is backed by some kind of persistent storage or stateful service. On Thu, Jun 27, 2019 at 10:45 PM Chad

Re: [ANNOUNCE] Spark portable runner (batch) now available for Java, Python, Go

2019-06-21 Thread Lukasz Cwik
This is great news. Can't wait to see more. On Fri, Jun 21, 2019 at 8:56 AM Alexey Romanenko wrote: > Amazing job! Thank you, Kyle! > > On 19 Jun 2019, at 18:10, David Morávek wrote: > > Great job Kyle, thanks for pushing this forward! > > Sent from my iPhone > > On 18 Jun 2019, at 12:58,

Re: Can we do pardo inside a pardo?

2019-06-17 Thread Lukasz Cwik
Typically you would apply your first ParDo getting back a PCollection and then apply your second ParDo to the return PCollection. You can get a lot more details in the programming guide[1]. For example: PCollection input = ... input.apply("ParDo1", ParDo.of(myDoFn1)).apply("ParDo2",

Re: Why is my RabbitMq message never acknowledged ?

2019-06-14 Thread Lukasz Cwik
In the future, you will be able to check and give a hard error if checkpointing is disabled yet finalization is requested for portable pipelines: https://github.com/apache/beam/blob/2be7457a4c0b311c3bd784b3f00b425596adeb06/model/pipeline/src/main/proto/beam_runner_api.proto#L382 On Fri, Jun 14,

Re: Why Apache beam can't infer the default coder when using KV?

2019-06-05 Thread Lukasz Cwik
A large cause of coder inference issues is due to type erasure in Java[1]. For your example, I would suspect that it should have worked since your ConcatWordsCombineFn doesn't have any type variables declared. Can you add the message and stacktrace for the exception that is failing here[2] to your

Re: Question about --environment_type argument

2019-05-28 Thread Lukasz Cwik
Are you losing the META-INF/ ServiceLoader entries related to binding the FileSystem via the FileSystemRegistrar when building the uber jar[1]? It does look like the Flink JobServer driver is registering the file systems[2]. 1:

Re: Problem with gzip

2019-05-14 Thread Lukasz Cwik
ow > specific features with Python streaming execution.* > >- > >*Streaming autoscaling* > > I doubt whether this approach can solve my issue. > > > Thanks so much! > > Allie > > *From: *Lukasz Cwik > *Date: *Tue, May 14, 2019 at 11:16 AM > *To: *dev

Re: Problem with gzip

2019-05-14 Thread Lukasz Cwik
yKey won't wait until all data has been read? > > Thanks! > Allie > > *From: *Lukasz Cwik > *Date: *Fri, May 10, 2019 at 5:36 PM > *To: *dev > *Cc: *user > > There is no such flag to turn of fusion. >> >> Writing 100s of GiBs of uncompressed data to r

Re: How to configure TLS connections for Dataflow job that use Kafka and Schema Registry

2019-05-13 Thread Lukasz Cwik
onfluent Schema Registry client internally ? >>>> >>>> Yohei Onishi >>>> >>>> >>>> On Thu, May 9, 2019 at 1:12 PM Vishwas Bm wrote: >>>> >>>>> Hi Yohei, >>>>> >>>>> I had tried some time back wi

Re: Problem with gzip

2019-05-10 Thread Lukasz Cwik
gt;> The file format for our users are mostly gzip format, since uncompressed >>> files would be too costly to store (It could be in hundreds of GB). >>> >>> Thanks, >>> >>> Allie >>> >>> >>> *From: *Lukasz Cwik >>>

Re: Problem with gzip

2019-05-10 Thread Lukasz Cwik
at 1:25 PM Allie Chen wrote: > Yes, that is correct. > > *From: *Allie Chen > *Date: *Fri, May 10, 2019 at 4:21 PM > *To: * > *Cc: * > > Yes. >> >> *From: *Lukasz Cwik >> *Date: *Fri, May 10, 2019 at 4:19 PM >> *To: *dev >> *Cc: * >&g

Re: Problem with gzip

2019-05-10 Thread Lukasz Cwik
to run, according to > one test (24 gzip files, 17 million lines in total) I did. > > The file format for our users are mostly gzip format, since uncompressed > files would be too costly to store (It could be in hundreds of GB). > > Thanks, > > Allie > > > *From: *Lukasz Cw

Re: Problem with gzip

2019-05-10 Thread Lukasz Cwik
+user@beam.apache.org Reshuffle on Google Cloud Dataflow for a bounded pipeline waits till all the data has been read before the next transforms can run. After the reshuffle, the data should have been processed in parallel across the workers. Did you see this? Are you able to change the input

Re: How to configure TLS connections for Dataflow job that use Kafka and Schema Registry

2019-05-08 Thread Lukasz Cwik
I replied to the SO question with more details. The issue is that your trying to load a truststore (file) on the VM which doesn't exist. You need to make that file accessible in some way. The other SO question (

Re: Is it safe to cache the value of a singleton view (with a global window) in a DoFn?

2019-05-07 Thread Lukasz Cwik
Keep your code simple and rely on the runner caching the value locally so it should be very cheap to access. If you have a performance issue due to a runner lacking caching, it would be good to hear about it so we could file a JIRA about it. On Mon, May 6, 2019 at 4:24 PM Kenneth Knowles wrote:

Re: kafka client interoperability

2019-05-02 Thread Lukasz Cwik
+dev On Thu, May 2, 2019 at 10:34 AM Moorhead,Richard < richard.moorhe...@cerner.com> wrote: > In Beam 2.9.0, this check was made: > > >

Re: Will Beam add any overhead or lack certain API/functions available in Spark/Flink?

2019-05-02 Thread Lukasz Cwik
On Thu, May 2, 2019 at 6:29 AM Maximilian Michels wrote: > Couple of comments: > > * Flink transforms > > It wouldn't be hard to add a way to run arbitrary Flink operators > through the Beam API. Like you said, once you go down that road, you > loose the ability to run the pipeline on a

Re: cancel job

2019-05-02 Thread Lukasz Cwik
+user@beam.apache.org On Thu, May 2, 2019 at 9:51 AM Lukasz Cwik wrote: > ... build pipeline ... > pipeline_result = p.run() > if job_taking_too_long: > pipeline_result.cancel() > > Python: > https://github.com/apache/beam/blob/95d0ac5e5cb59fd0c6a8a4861a38a7087a6c46b5/sd

Re: AvroUtils converting generic record to Beam Row causes class cast exception

2019-04-15 Thread Lukasz Cwik
+dev On Sun, Apr 14, 2019 at 10:29 PM Vishwas Bm wrote: > Hi, > > Below is my pipeline: > > KafkaSource (KafkaIO.read) --> Pardo ---> BeamSql > ---> KafkaSink(KafkaIO.write) > > > The avro schema of the topic has a field of logical type > timestamp-millis.

Re: Side Inputs size

2019-04-08 Thread Lukasz Cwik
Side input performance and scaling is runner dependent. Runners should attempt to provide support for efficient random access lookup in the maps. Side inputs should also be cached across elements if the map hasn't changed which runners should also be capable of doing. So yes, side input size can

Re: Implementation an S3 file system for python SDK - Updated

2019-04-08 Thread Lukasz Cwik
SDF-based >> improvements to split when many files are being matched. >> Best >> -P. >> >> On Mon, Apr 8, 2019 at 10:00 AM Alex Amato wrote: >> >>> +Lukasz Cwik , +Boyuan Zhang , +Lara >>> Schmidt >>> >>> Should splittable DoFn

Re: "Dynamic" read / expand of PCollection element by IOs

2019-03-14 Thread Lukasz Cwik
1) Your looking for SplittableDoFn[1]. It is still in development and a conversion of all the current IO connectors that exist today to be able to consume a PCollection of resources is yet to come. There is some limited usecases that exist already like FileIO.match[2] and if these fit your usecase

Re: Join PCollection Data with HBase Large Data - Suggestion Requested

2018-12-04 Thread Lukasz Cwik
e table size is like >> 1TB. The output of the processing is stored in the same Hbase table. Please >> let me know if you need more context. >> >> Thanks, >> Chandan >> >> On Tue, Dec 4, 2018 at 11:32 AM Steve Niemitz >> wrote: >> &g

Re: Generic Type PTransform

2018-12-04 Thread Lukasz Cwik
of K and V. > > So, what coder should I set? Can you please give a code example of how to > do that? > > > > Really appriciate your help, > > Eran > > > > *From:* Lukasz Cwik [mailto:lc...@google.com] > *Sent:* Monday, December 03, 2018 7:10 PM > *To:* us

Re: Generic Type PTransform

2018-12-03 Thread Lukasz Cwik
Apache Beam attempts to propagate coders through by looking at any typing information available but because Java has a lot of type erasure and there are many scenarios where these coders can't be propagated forward from a previous transform. Take the following two examples (note that there are

Re: Rolling File Query

2018-11-26 Thread Lukasz Cwik
Flink where it > rolls to a new file after batch size is reached. > > > Regards, > Vinay Patil > > > On Mon, Nov 26, 2018 at 1:04 PM Lukasz Cwik wrote: > >> You could use a StatefulDoFn to buffer up 10k worth of data and write it. >> There is an example fo

Re: Rolling File Query

2018-11-26 Thread Lukasz Cwik
You could use a StatefulDoFn to buffer up 10k worth of data and write it. There is an example for batched RPC[1] that could be re-used using the FileSystems API to just create files. You'll want to use a reshuffle + a fixed number of keys where the fixed number of keys controls the write

Re: ValueState for Dataflow runner and MapState for others

2018-11-12 Thread Lukasz Cwik
Could you write two different implementations of the DoFn and put your processing logic in another function that both DoFn's would invoke after doing any accessing of the state? Then during pipeline construction you could choose to apply the Map one or the Value one based upon which runner your

Re: Unable to provide a Coder for org.apache.hadoop.hbase.client.Mutation.

2018-11-09 Thread Lukasz Cwik
formets list. > And there are another way to solve this question by adding > HBaseMutationCoder in code manually. > > Thanks. > > On Fri, Nov 9, 2018 at 3:16 AM Lukasz Cwik wrote: > >> Posted my answer on SO, reposting here for visibility: >> >> Note that t

Re: Unable to provide a Coder for org.apache.hadoop.hbase.client.Mutation.

2018-11-08 Thread Lukasz Cwik
Posted my answer on SO, reposting here for visibility: Note that the HBaseCoderProviderRegistrar[1] already registers the HBaseMutationCoder for the Mutation type automatically already. Using the maven-shade plugin without handling service files contained in META-INF/ inside your output jar is a

Re: Experience with KafkaIO -> BigQueryIO

2018-11-06 Thread Lukasz Cwik
Since your using FILE_LOADS as the BigQueryIO write method, did you see a file being created for each partitioned table based upon the requested triggering frequency? Figuring out whether the problem was upstream from creating the file or downstream after the file was created would help debug

Re: Flink 1.6 Support

2018-10-30 Thread Lukasz Cwik
+dev On Tue, Oct 30, 2018 at 10:30 AM Jins George wrote: > Hi Community, > > Noticed that the Beam 2.8 release comes with flink 1.5.x dependency. > Are there any plans to upgrade flink to 1.6.x in next beam release. ( > I am looking for the better k8s support in Flink 1.6) > > Thanks, > >

Re: Watermark Propagation using SDF

2018-10-30 Thread Lukasz Cwik
Yes, reporting watermarks is a key concept within SDFs and needed for downstream windowing operations. On Mon, Oct 29, 2018 at 11:00 PM Abdul Qadeer wrote: > Hi! > > I would like to know if watermarks can be updated from SDFs and propagated > to other operators for windowing? As per the design

Re: KafkaIO - Deadletter output

2018-10-24 Thread Lukasz Cwik
d records in the timestamp > handler passed to KafkaIO itself, and correct timestamp for parsable > records. That should work too, right? > > On Wed, Oct 24, 2018 at 4:21 PM Lukasz Cwik wrote: > >> Yes, that would be fine. >> >> The user could then use a ParDo whi

Re: KafkaIO - Deadletter output

2018-10-24 Thread Lukasz Cwik
#outputWithTimestamp-org.apache.beam.sdk.values.TupleTag-T-org.joda.time.Instant- On Wed, Oct 24, 2018 at 1:21 PM Raghu Angadi wrote: > Thanks. So returning min timestamp is OK, right (assuming application > fine is with what it means)? > > On Wed, Oct 24, 2018 at 1:17 PM Lukasz Cwik wrote: >

Re: KafkaIO - Deadletter output

2018-10-24 Thread Lukasz Cwik
at 12:51 PM Raghu Angadi wrote: > > > On Wed, Oct 24, 2018 at 12:33 PM Lukasz Cwik wrote: > >> You would have to return min timestamp for all records otherwise the >> watermark may have advanced and you would be outputting records that are >> droppably late. >> &g

Re: KafkaIO - Deadletter output

2018-10-24 Thread Lukasz Cwik
pipeline defined under kafkaio package? >> >> On Wed, Oct 24, 2018 at 10:19 AM Lukasz Cwik wrote: >> >>> In this case, the user is attempting to handle errors when parsing the >>> timestamp. The timestamp controls the watermark for the UnboundedSource, >>>

Re: KafkaIO - Deadletter output

2018-10-24 Thread Lukasz Cwik
In this case, the user is attempting to handle errors when parsing the timestamp. The timestamp controls the watermark for the UnboundedSource, how would they control the watermark in a downstream ParDo? On Wed, Oct 24, 2018 at 9:48 AM Raghu Angadi wrote: > On Wed, Oct 24, 2018 at 7:19 AM

Re: write to a kafka topic that is set in data

2018-10-19 Thread Lukasz Cwik
until the functionality is added, thank you! >> >> Thanks, >> Dmitry >> >> On Fri, Oct 19, 2018 at 2:53 PM Lukasz Cwik wrote: >> >>> If there are a fixed number of topics, you could partition your write by >>> structuring your pipeline as such: >&g

Re: write to a kafka topic that is set in data

2018-10-19 Thread Lukasz Cwik
If there are a fixed number of topics, you could partition your write by structuring your pipeline as such: ParDo(PartitionByTopic) > KafkaIO.write(topicA) \---> KafkaIO.write(topicB) \---> KafkaIO.write(...) There is no support currently for

Re: Pass-through transform for accessing runner-specific functionality from Beam?

2018-10-17 Thread Lukasz Cwik
+t...@apache.org has been doing something very similar but using it support native Flink IO within Apache Beam within the company he works for. Note that the community had a discussion about runner specific extensions and is currently leaning[1] towards having support for them for internal use

Re: joining two streams of the same type

2018-10-17 Thread Lukasz Cwik
Yes On Tue, Oct 16, 2018 at 4:34 PM mina...@gmail.com wrote: > Ok, thanks, but still I have to manually specify how to extract messages > of a specific stream (previously flattened), e.g. via "instanceof". Right? > > On 2018/10/16 22:41:36, Lukasz Cwik wrote: > >

Re: joining two streams of the same type

2018-10-16 Thread Lukasz Cwik
Use Flatten to "merge" two PCollections and then GBK to group all the records by key. On Tue, Oct 16, 2018 at 3:27 PM mina...@gmail.com wrote: > I realized I even can't follow that approach with the same type of > messages merged since I'm getting an exception: Multiple entries with same > key.

Re: Ingesting from KafkaIO into HDFS, problems with checkpoint and data lost on top of yarn + spark

2018-10-16 Thread Lukasz Cwik
There has been some discussion in the past about adding a "drain" feature to Apache Beam which would allow this intermediate data to be output so it isn't lost. The caveat is that you'll be outputting partial results. The design doc was shared here:

Re: FYI on Slack Channels

2018-10-15 Thread Lukasz Cwik
;>>> #beam-sql >>>>>>> #beam-testing >>>>>>> >>>>>>> >>>>>>> On Tue, Jun 26, 2018 at 1:18 AM Romain Manni-Bucau < >>>>>>> rmannibu...@gmail.com> wrote: >>>>>>> &g

Re: Why does Beam write to Spanner twice on REPORT_FAILURES mode?

2018-09-24 Thread Lukasz Cwik
+Mairbek Khadikov , do you know why the writes are first attempted as a batch and then re-attempted individually if the batch call failed? (Might be worth a code comment to explain the retry behavior) On Mon, Sep 24, 2018 at 7:12 AM NaHeon Kim wrote: > Hi community! > > I found interesting

Re: [Proposal] Add a static PTransform.compose() method for composing transforms in a lambda expression

2018-09-19 Thread Lukasz Cwik
uld presumably expose the > ability to specify a coder. > > I don't have enough context yet to comment on whether display data might > be an issue, so I do hope the user list can provide input there. > > On Wed, Sep 19, 2018 at 4:52 PM Lukasz Cwik wrote: > >> Thanks for the p

Re: [Proposal] Add a static PTransform.compose() method for composing transforms in a lambda expression

2018-09-19 Thread Lukasz Cwik
Thanks for the proposal and it does seem to make the API cleaner to build anonymous composite transforms. In your experience have you had issues where the API doesn't work out well because the PTransform: * is not able to override how the output coder is inferred? * can't supply display data?

Re: BEAM-2059 out of sync with docs?

2018-09-19 Thread Lukasz Cwik
Only attempted metrics are supported in Dataflow streaming. Committed metrics aren't supported. On Wed, Sep 19, 2018 at 5:14 AM Vince Gonzalez wrote: > Hi, > > The runner compatibility matrix says the following for Metrics / Dataflow: > > Metrics are not yet supported at all in streaming mode,

Re: watermark

2018-09-17 Thread Lukasz Cwik
a) Do you have a screenshot you can share? b) Yes and it depends on your trigger definition, see this video presentation[1] that goes some details into the topic and this entire section about windowing[2]. c) Typically no. Your sources control the watermark based upon the watermark your source

Re: cores and partitions in DataFlow

2018-09-14 Thread Lukasz Cwik
Dataflow has logical partitions of work and relies on auto-scaling and dynamic work rebalancing to distribute and redistribute work. Typically machine size vs number of machines shouldn't matter unless you run really small or very large jobs since there is no point in running a job with a machine

Re: windowing -> groupby

2018-09-13 Thread Lukasz Cwik
You can even change windowing strategies between group bys with Window.into. On Thu, Sep 13, 2018 at 3:29 PM Lukasz Cwik wrote: > Multiple group by are supported. > > On Thu, Sep 13, 2018 at 2:36 PM asharma...@gmail.com > wrote: > >> Hi >> >> from docume

Re: windowing -> groupby

2018-09-13 Thread Lukasz Cwik
Multiple group by are supported. On Thu, Sep 13, 2018 at 2:36 PM asharma...@gmail.com wrote: > Hi > > from documentation groupby is applied on key and window basis. > > If my source is Pubsub (unbounded) - does Beam support applying multiple > groupby transformations and all of applied groupby

Re: writing a single record to Kafka ...

2018-09-12 Thread Lukasz Cwik
ciate your advice. >> >> *--* >> *Mahesh Vangala* >> *(Ph) 443-326-1957* >> *(web) mvangala.com <http://mvangala.com>* >> >> >> On Tue, Sep 11, 2018 at 6:19 PM Lukasz Cwik wrote: >> >>> A PCollection is a bag of elements. PCollections ca

Re: Plans for GcsIO or another mechanism to read metadata about GCS objects?

2018-09-10 Thread Lukasz Cwik
There are currently no plans for a GcsIO or extensions to the FileSystems APIs to expose file system specific metadata. Your best short term bet is to invoke the GCS API directly. On Mon, Sep 10, 2018 at 10:03 AM Eric Beach wrote: > tl;dr - Use case is reading metadata such as ACLs, owner,

Re: How to update side inputs.

2018-09-06 Thread Lukasz Cwik
StreamingDataflowWorker.access$1000(StreamingDataflowWorker.java:135) > > > com.google.cloud.dataflow.worker.StreamingDataflowWorker$6.run(StreamingDataflowWorker.java:966) > > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > &g

Re: How to update side inputs.

2018-09-04 Thread Lukasz Cwik
:49 PM Lukasz Cwik wrote: > Bart, the error you are hitting is because the other part of the pipeline > is operating on a global window. > > Every time a side input is looked up in the DoFn, the main window (global > window in your case) is mapped onto the side input window

Re: How to update side inputs.

2018-09-04 Thread Lukasz Cwik
the Map. I've tried > removing one or both of them, but I keep getting the same issue. The other > part of my pipeline is operating on a global window at that point. So it > seems there is a mismatch but I'm not sure how to resolve it. > > > > Op di 4 sep. 2018 om 19:11 schreef Lukas

Re: How to update side inputs.

2018-09-04 Thread Lukasz Cwik
Jose, what Bart is recommending is a path that should work. Bart, what do you mean by conflicting windows? On Mon, Sep 3, 2018 at 11:29 PM Bart Aelterman wrote: > Hi Jose, > > > You could generate a sequence of "ticks" and use that as input to > continuously update your side input. This is

Re: staging location errors while kicking a dataflow job

2018-08-29 Thread Lukasz Cwik
Then the suggestion by Sameer was the issue that you were facing. Sameer, thanks for helping out. On Wed, Aug 29, 2018 at 8:32 AM asharma...@gmail.com wrote: > > > On 2018/08/29 00:37:30, Lukasz Cwik wrote: > > It seems like you specified gs:/ and not gs:// > > > >

Re: BigqueryIO field clustering

2018-08-29 Thread Lukasz Cwik
+d...@beam.apache.org Wout, I assigned this task to you since it seems like your interested in contributing. The Apache Beam contribution guide[1] is a good place to start for answering questions on how to contribute. If you need help in getting stuff reviewed or having questions, feel free to

Re: staging location errors while kicking a dataflow job

2018-08-28 Thread Lukasz Cwik
It seems like you specified gs:/ and not gs:// Typo? On Mon, Aug 27, 2018 at 2:02 PM Sameer Abhyankar wrote: > See this thread to see if it is related to the way the executable jar is > being created: > >

Re: How to test session windows

2018-08-27 Thread Lukasz Cwik
very 10 seconds) there are potentially a lot of early > panes. So I assume there is currently no good way to assert the output of > those different panes? > > > Op vr 24 aug. 2018 om 23:05 schreef Lukasz Cwik : > >> Actually, it seems like someone has already beat you to

Re: Beam and ParquetIO

2018-08-27 Thread Lukasz Cwik
s this documented somewhere?) > > > > > > *From:* Lukasz Cwik > *Sent:* vendredi 24 août 2018 22:04 > *To:* user@beam.apache.org > *Subject:* Re: Beam and ParquetIO > > > > Does it work if you define the schema manually instead of using > ReflectData?

  1   2   3   >