Re: Large public Beam projects?

2020-04-21 Thread Tim Robertson
My apologies, I missed the link: [1] https://github.com/gbif/pipelines On Tue, Apr 21, 2020 at 5:58 PM Tim Robertson wrote: > Hi Jordan > > I don't know if we qualify as a large Beam project but at GBIF.org we > bring together datasets from 1600+ institutions documenting 1,4B

Re: Large public Beam projects?

2020-04-21 Thread Tim Robertson
t advanced features. It's batch processing of data into Avro files stored on HDFS then into HBase / Elasticsearch. All our data and code [1] are open and I'm happy to discuss any aspect of it if it is helpful to you. Best wishes, Tim On Tue, Apr 21, 2020 at 3:48 PM Jeff Klukas wrote: > Moz

Re: Beam discarding massive amount of events due to Window object or inner processing

2019-10-16 Thread Tim Sell
I think everyone who followed this thread learned something! I know I did. Thanks for asking these questions. The summary and code snippets were just the right length to be accessible and focussed. On Wed, 16 Oct 2019, 06:04 Eddy G, wrote: > Thank you so so much guys for the amazing feedback yo

Re: Beam discarding massive amount of events due to Window object or inner processing

2019-10-14 Thread Tim Sell
You're getting 1 shard per pane, and you get a pane every time it's triggered on an early firing. And then another one in the final on-time pane. To have 1 file with 1 shard for every 15 minute window you need to only fire on window close. Ie AfterWatermark.pastendofwindow, without early firing. O

Re: Live fixing of a Beam bug on July 25 at 3:30pm-4:30pm PST

2019-07-18 Thread Tim Sell
+1, I'd love to see this as a recording. Will you stick it up on youtube afterwards? On Thu, Jul 18, 2019 at 4:00 AM sridhar inuog wrote: > Thanks, Pablo! Looking forward to it! Hopefully, it will also be recorded > as well. > > On Wed, Jul 17, 2019 at 2:50 PM Pablo Estrada wrote: > >> Yes! So

Re: gRPC method to get a pipeline definition?

2019-06-26 Thread Tim Robertson
Another +1 to support your research into this Chad. Thank you. Trying to understand where a beam process is in the Spark DAG is... not easy. A UI that helped would be a great addition. On Wed, Jun 26, 2019 at 3:30 PM Ismaël Mejía wrote: > +1 don't hesitate to create a JIRA + PR. You may be in

Re: [ANNOUNCEMENT] Common Pipeline Patterns - new section in the documentation + contributions welcome

2019-06-07 Thread Tim Robertson
This is great. Thanks Pablo and all I've seen several folk struggle with writing avro to dynamic locations which I think might be a good addition. If you agree I'll offer a PR unless someone gets there first - I have an example here: https://github.com/gbif/pipelines/blob/master/pipelines/export-

Re: PubSubIO watermark not advancing for low volumes

2019-05-15 Thread Tim Sell
't work at all with those > MovingFunction params. But I may not understand all the subtleties of the > connector. > > Kenn > > *From: *Tim Sell > *Date: *Mon, May 13, 2019 at 8:06 AM > *To: * > > Thanks for the feedback, I did some more investigating after you

Re: ElasticsearchIO Write Batching Problems

2018-12-07 Thread Tim
Great. Thanks for sharing Evan. Tim > On 7 Dec 2018, at 20:06, Evan Galpin wrote: > > I've actually found that this was just a matter of pipeline processing speed. > I removed many layers of transforms such that entities flowed through the > pipeline faster, and saw the

Re: Moving to spark 2.4

2018-12-07 Thread Tim Robertson
To clarify Ismaël's comment Cloudera repo indicates Cloudera 6.1 will have spark 2.4 but CDH is currently still on 6.0. https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/spark/spark-core_2.11/2.4.0-cdh6.1.0/ With the HWX / Cloudera merger the release cycle is not announced bu

Re: ElasticsearchIO Write Batching Problems

2018-12-06 Thread Tim
. I’ll verify ES IO behaviour when I get to a computer too. Tim (on phone) > On 6 Dec 2018, at 22:00, e...@calabs.ca wrote: > > Hi all, > > I’m having a bit of trouble with ElasticsearchIO Write transform. I’m able to > successfully index documents into my elasticsearch cluster,

Re: bean elasticsearch connector for dataflow

2018-12-04 Thread Tim
Beam 2.8.0 brought in support for ES 6.3.x I’m not sure if that works against a 6.5.x server but I could imagine it does. Tim, Sent from my iPhone > On 4 Dec 2018, at 18:28, Adeel Ahmad wrote: > > Hello, > > I am trying to use gcp dataflow for indexing data from pubsub into

Re: No filesystem found for scheme hdfs - with the FlinkRunner

2018-10-26 Thread Tim
Thanks for sharing that Tim, Sent from my iPhone > On 26 Oct 2018, at 17:50, Juan Carlos Garcia wrote: > > Just for everyone to know we figure it out, it was an environment problem. > > In our case we have our cluster in a network that is not accessible directly, > so to

Re: No filesystem found for scheme hdfs - with the FlinkRunner

2018-10-26 Thread Tim Robertson
): transform.apply("Write", AvroIO.writeGenericRecords(schema) .to(FileSystems.matchNewResource(options.getTarget(),true)) // BEAM-2277 workaround .withTempDirectory(FileSystems.matchNewResource("hdfs://ha-nn/tmp/beam-avro", true))); Thanks

Re: ElasticIO retry configuration exception

2018-10-11 Thread Tim
Great! Thank you. Feel free to add me as reviewer if you open a PR. Tim > On 12 Oct 2018, at 08:28, Wout Scheepers > wrote: > > Hey Tim, Romain, > > I created the ticket (BEAM-5725. I’ll try to fix it, as it’s time I made my > first PR. > First will focus on get

Re: ElasticIO retry configuration exception

2018-10-11 Thread Tim Robertson
orting the issue, Tim On Thu, Oct 11, 2018 at 4:19 PM Romain Manni-Bucau wrote: > It looks more like a client issue where the stream is already read, maybe > give a try to reproduce it in a unit test in beam ES module? This will > enable us to help you more accurately. >

Re: AvroIO - failure using direct runner with java.nio.file.FileAlreadyExistsException when moving from temp to destination

2018-09-26 Thread Tim Robertson
is helps a little, and thanks again Tim On Wed, Sep 26, 2018 at 11:24 AM Juan Carlos Garcia wrote: > Hi Guys, after days of bumping my head against the monitor i found why it > was not working. > > One key element when using *DynamicAvroDestinations *that is not > describe

Re: Regression of (BEAM-2277) - IllegalArgumentException when using Hadoop file system for WordCount example.

2018-08-20 Thread Tim Robertson
Hi Juan, You are correct that BEAM-2277 seems to be recurring. I have today stumbled upon that myself in my own pipeline (not word count). I have just posted a workaround at the bottom of the issue, and will reopen the issue. Thank you for reminding us on this, Tim On Mon, Aug 20, 2018 at 4:44

Re: ElasticsearchIO bulk delete

2018-07-30 Thread Tim Robertson
tions or improvement requests on that jira. The offer to assist in your first PR remains open for the future - please don't hesitate to ask. Thanks, Tim [1] https://github.com/jsteggink/beam/tree/BEAM-3199/sdks/java/io/elasticsearch-6/src/main/java/org/apache/beam/sdk/io/elasticsea

Re: ElasticsearchIO bulk delete

2018-07-27 Thread Tim Robertson
is a branch and work underway for ES6 already which is rather different on the write() path so this may become redundant rather quickly. Thanks, Tim @timrobertson100 on the Beam slack channel On Fri, Jul 27, 2018 at 2:53 PM, Wout Scheepers < wout.scheep...@vente-exclusive.com> wrote: >

Re: Pipeline error handling

2018-07-26 Thread Tim Robertson
a cause HTH, Tim [1] https://beam.apache.org/documentation/programming-guide/#additional-outputs On Thu, Jul 26, 2018 at 9:44 AM, Kelsey RIDER wrote: > I’m trying to figure out how to handle errors in my Pipeline. > > > > Right now, my main transform is a DoFn. I > ha

Re: Beam and Hive

2018-07-20 Thread Tim Robertson
I answered on SO Kelsey, You should be able to add this I believe to explicitly declare the coder to use: p.getCoderRegistry() .registerCoderForClass(HCatRecord.class, WritableCoder.of(DefaultHCatRecord.class)); On Fri, Jul 20, 2018 at 5:05 PM, Kelsey RIDER wrote: > Hello, > > > > I wrote

withHotKeyFanout Question

2018-02-16 Thread Tim Ross
My pipeline utilizes the Combine.perKey transform and I would like to add withHotKeyFanout to prevent the combine from being a bottleneck. To test I made sure that my mergeAccumulators function was correct and added withHotKeyFanout(2) to my Combine transform. When I launch the pipeline with fli

Re: Windowing question

2018-01-25 Thread Tim Ross
The are many thousand inputs per window. I only need to only calculate certain statistics based on a new element, i.e. preset in current window but not previous window, and the last element, i.e. was in previous window but not the current window. Thanks, Tim From: Kenneth Knowles Reply-To

Re: Windowing question

2018-01-24 Thread Tim Ross
Yes that is what the pipeline I came up with looks like. However the next step in the pipeline is new/expired logic. I have tried a variety of things but none have gotten me close to what I want. Hence my questioning to this mailing list. Thanks, Tim From: Kenneth Knowles Reply-To: "

Re: Windowing question

2018-01-24 Thread Tim Ross
expired. Since Beam doesn’t have the concept of new and expired data in a window I’m trying to figure out how one would accomplish this. Thanks, Tim From: Kenneth Knowles Reply-To: "user@beam.apache.org" Date: Wednesday, January 24, 2018 at 1:45 PM To: "user@beam.apache.or

Re: Windowing question

2018-01-24 Thread Tim Ross
I am just trying to do certain processing on edge triggers, i.e. new or expired data, to reduce the overall processing of a very large stream. How would I go about doing that with state? As I understand it, state is tied to key and window. Thanks, Tim From: Kenneth Knowles Reply-To: "

Windowing question

2018-01-24 Thread Tim Ross
Is there anything comparable to Apache Storm’s Window.getNew and Window.getExpired in Apache Beam? How would I determine if an element is new or expired in consecutive windows? Thanks, Tim This e-mail message and any attachments to it are intended only for the named recipients and may

Re: [DISCUSS] Drop Spark 1.x support to focus on Spark 2.x

2017-11-20 Thread Tim
Thanks JB From which release will Spark 1.x be dropped please? Is this slated for 2.3.0 at the end of the year? Thanks, Tim, Sent from my iPhone > On 20 Nov 2017, at 21:21, Jean-Baptiste Onofré wrote: > > Hi, > > it seems we have a consensus to upgrade to Spark 2.x, droppi

Re: Does ElasticsearchIO in the latest RC support adding document IDs?

2017-11-15 Thread Tim Robertson
Hi Chet, I'll be a user of this, so thank you. It seems reasonable although - did you consider letting folk name the document ID field explicitly? It would avoid an unnecessary transformation and might be simpler: // instruct the writer to use a provided document ID ElasticsearchIO.write

Re: Does ElasticsearchIO in the latest RC support adding document IDs?

2017-11-15 Thread Tim Robertson
Hi Chet, +1 for interest in this from me too. If it helps, I'd have expected a) to be the implementation (e.g. something like "_id" being used if present) and handing multiple delivery being a responsibility of the developer. Thanks, Tim On Wed, Nov 15, 2017 at 10:08 AM

Re: [DISCUSS] Drop Spark 1.x support to focus on Spark 2.x

2017-11-13 Thread Tim Robertson
headaches in packaging, robustness etc. I'd recommend putting it in a 6 month timeframe which might align with 2.3? Hope this helps, Tim On Mon, Nov 13, 2017 at 10:07 AM, Neville Dipale wrote: > Hi JB, > > > [X ] +1 to drop Spark 1.x support and upgrade to Spark 2.x only &

Re: ElasticSearch with RestHighLevelClient

2017-10-23 Thread Tim Robertson
Hi Ryan, I can confirm 2.2.0-SNAPSHOT works fine with an ES 5.6 cluster. I am told 2.2.0 is expected within a couple weeks. My work is only a proof of concept for now, but I put in 300M fairly small docs at around 100,000/sec on a 3 node cluster without any issue [1]. Hope this helps, Tim [1

Re: KafkaIO and Avro

2017-10-19 Thread Tim Robertson
roCoder.of(Envelope.class)) >> >> >> On Thu, Oct 19, 2017 at 12:21 PM Raghu Angadi wrote: >> >>> Same for me. It does not look like there is an annotation to suppress >>> the error. >>> >>> >>> On Thu, Oct 19, 2017 at 12:18 PM, Tim R

Re: KafkaIO and Avro

2017-10-19 Thread Tim Robertson
erializerAndCoder((Deserializer) > KafkaAvroDeserializer.class, > AvroCoder.of(Envelope.class)) > > On Thu, Oct 19, 2017 at 11:45 AM Tim Robertson > wrote: > >> Hi Raghu >> >> I tried that but with KafkaAvroDeserializer already implementing >> Deserializer I couldn't get

Re: KafkaIO and Avro

2017-10-19 Thread Tim Robertson
Hi Raghu I tried that but with KafkaAvroDeserializer already implementing Deserializer I couldn't get it to work... I didn't spend too much time though and agree something like that would be cleaner. Cheers, Tim On Thu, Oct 19, 2017 at 7:54 PM, Raghu Angadi wrote: > Thanks Tim.

Re: KafkaIO and Avro

2017-10-19 Thread Tim Robertson
); } } On Thu, Oct 19, 2017 at 12:03 PM, Andrew Jones wrote: > Thanks Tim, that works! > > Full code is: > > public class EnvelopeKafkaAvroDeserializer extends > AbstractKafkaAvroDeserializer implements Deserializer { > @Override > public void configure(Ma

Re: KafkaIO and Avro

2017-10-19 Thread Tim Robertson
ments Deserializer { ... public Envelope deserialize(String s, byte[] bytes) { return (Envelope) this.deserialize(bytes); } public Envelope deserialize(String s, byte[] bytes, Schema readerSchema) { return (Envelope) this.deserialize(bytes, readerSchema); } ... } Tim On Thu, Oct 19,

Re: KafkaIO and Avro

2017-10-18 Thread Tim Robertson
I just tried quickly and see the same as you Andrew. We're missing something obvious or else extending KafkaAvroDeserializer seems necessary right? On Wed, Oct 18, 2017 at 3:14 PM, Andrew Jones wrote: > Hi, > > I'm trying to read Avro data from a Kafka stream using KafkaIO. I think > it should b

Re: JavaSparkContext Wrapper

2017-10-08 Thread Tim Robertson
quick-start [1], looking at the Spark word count example [2] and reading the programming guide [3] - you should be able to run some quick tests in your favourite IDE and then run the same code on Spark by using the Spark runner. I hope this helps, Tim [1] https://beam.apache.org/get-started/quick

Re: Slack invitation request

2017-10-03 Thread Tim
May I also have one please? Tim, Sent from my iPhone > On 3 Oct 2017, at 19:22, Lukasz Cwik wrote: > > Invitation sent, welcome. > >> On Tue, Oct 3, 2017 at 9:14 AM, Jon Brasted wrote: >> Hello, >> >> Please may I have an invitation to the Apache B

Re: Run a ParDo after a Partition

2017-10-03 Thread Tim
ase I'm reading zip files residing on HDFS which is inherently a single threaded op but then need to run expensive operations on the resulting dataset in parallel. Thanks, Tim, Sent from my iPhone > On 3 Oct 2017, at 18:16, Thomas Groh wrote: > > Just to make sure I'm un

Re: Run a ParDo after a Partition

2017-10-03 Thread Tim Robertson
t 3, 2017 at 2:48 PM, Tim Robertson wrote: > Hi folks, > > I feel a little daft asking this, and suspect I am missing the obvious... > > Can someone please tell me how I can do a ParDo following a Partition? > > In spark I'd just repartition(...) and then a map() but I don&

Run a ParDo after a Partition

2017-10-03 Thread Tim Robertson
tition, and need to split it to run expensive op in parallel] Thanks, Tim