Re: New Design Doc for Cost Based Optimization

2019-07-12 Thread Kenneth Knowles
Thanks for these thorough docs. I feel I am learning a lot from what you
are sharing. I've commented w/ my questions on the doc.

Kenn

On Wed, Jul 10, 2019 at 2:54 PM Alireza Samadian 
wrote:

> Dear Members of Beam Community,
>
> Previously I had shared a document discussing row count estimation for the
> source tables in a query.
>
> https://docs.google.com/document/d/1vi1PBBu5IqSy-qZl1Gk-49CcANOpbNs1UAud6LnOaiY/edit
>
> I wrote another document that discusses the Cost Model and Join
> Reordering, and I am probably going to polish and merge the previous one as
> a section of this document. The following is the link to the new document.
>
> https://docs.google.com/document/d/1DM_bcfFbIoc_vEoqQxhC7AvHBUDVCAwToC8TYGukkII/edit?usp=sharing
> I will appreciate your comments and suggestions.
>
> Best,
> Alireza Samadian
>


Re: [PROPOSAL] Prepare for LTS bugfix release 2.7.1

2019-07-12 Thread Kenneth Knowles
I went ahead and took over all the bugs and did the cherrypicks for the
remaining backports targeting 2.7.1:
https://issues.apache.org/jira/issues/?jql=statusCategory%20%3D%20new%20AND%20project%20%3D%2012319527%20AND%20fixVersion%20%3D%2012344458

The tests are not healthy. I have not had time to look into the issues. I
would appreciate some help reviewing and/or manually running tests and
publishing gradle build scans.

Kenn

On Fri, Jun 7, 2019 at 3:08 AM Maximilian Michels  wrote:

> Created an up-to-date version of the Flink backports for 2.7.1:
> https://github.com/apache/beam/pull/8787
>
> Some of the Gradle task names have changed which makes testing via Jenkins
> hard. Will have to run them manually before merging.
>
> -Max
>
> On 06.06.19 17:41, Kenneth Knowles wrote:
> > Hi all,
> >
> > Re-raising this thread. I got busy for the last month, and also did not
> > want to overlap the 2.13.0 release process. Now I want to pick up 2.7.1
> > again.
> >
> > Can everyone check on any bug they have targeted to 2.7.1 [1] and get
> > the backports merged to release-2.7.1 and the tickets resolved?
> >
> > Kenn
> >
> > [1]
> >
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20BEAM%20AND%20resolution%20%3D%20Unresolved%20AND%20fixVersion%20%3D%202.7.1%20ORDER%20BY%20priority%20DESC%2C%20updated%20DESC
> >
> > On Fri, Apr 26, 2019 at 11:19 AM Ahmet Altay  > > wrote:
> >
> > I agree with both keeping 2.7.x going until a new LTS is declared
> > and declaring LTS spost-release after some use. 2.12 might actually
> > be a good candidate, with multiple RCs/validations it presumably is
> > well tested. We can consider that after it gets some real world use.
> >
> > On Fri, Apr 26, 2019 at 6:29 AM Robert Bradshaw  > > wrote:
> >
> > IIRC, there was some talk on making 2.12 the next LTS, but the
> > consensus is to decide on a LTS after having had some experience
> > with
> > it, not at or before the release itself.
> >
> >
> > On Fri, Apr 26, 2019 at 3:04 PM Alexey Romanenko
> > mailto:aromanenko@gmail.com>>
> wrote:
> >  >
> >  > Thanks for working on this, Kenn.
> >  >
> >  > Perhaps, I missed this but has it been already
> > discussed/decided what will be the next LTS release?
> >  >
> >  > On 26 Apr 2019, at 08:02, Kenneth Knowles  > > wrote:
> >  >
> >  > Since it is all trivially reversible if there is some other
> > feeling about this thread, I have gone ahead and started the
> work:
> >  >
> >  >  - I made release-2.7.1 branch point to the same commit as
> > release-2.7.0 so there is something to target PRs
> >  >  - I have opened the first PR, cherry-picking the set_version
> > script and using it to set the version on the branch to 2.7.1:
> > https://github.com/apache/beam/pull/8407 (found bug in the new
> > script right away :-)
> >  >
> >  > Here is the release with list of issues:
> > https://issues.apache.org/jira/projects/BEAM/versions/12344458.
> > So anyone can grab a ticket and volunteer to open a backport PR
> > to the release-2.7.1 branch.
> >  >
> >  > I don't have a strong opinion about how long we should
> > support the 2.7.x line. I am curious about different
> > perspectives on user / vendor needs. I have two very basic
> > thoughts: (1) we surely need to keep it going until some time
> > after we have another LTS designated, to make sure there is a
> > clear path for anyone only using LTS releases and (2) if we
> > decide to end support of 2.7.x but then someone volunteers to
> > backport and release, of course I would not expect anyone to
> > block them, so it has no maximum lifetime, but we just need
> > consensus on a minimum. And of course that consensus cannot
> > force anyone to do the work, but is just a resolution of the
> > community.
> >  >
> >  > Kenn
> >  >
> >  > On Thu, Apr 25, 2019 at 10:29 PM Jean-Baptiste Onofré
> > mailto:j...@nanthrax.net>> wrote:
> >  >>
> >  >> +1 it sounds good to me.
> >  >>
> >  >> Thanks !
> >  >>
> >  >> Regards
> >  >> JB
> >  >>
> >  >> On 26/04/2019 02:42, Kenneth Knowles wrote:
> >  >> > Hi all,
> >  >> >
> >  >> > Since the release of 2.7.0 we have identified some serious
> > bugs:
> >  >> >
> >  >> >  - There are 8 (non-dupe) issues* tagged with Fix Version
> > 2.7.1
> >  >> >  - 2 are rated "Blocker" (aka P0) but I think the others
> > may be underrated
> >  >> >  - If you know of a critical bug that is not on 

Re: Discussion/Proposal: support Sort Merge Bucket joins in Beam

2019-07-12 Thread Kenneth Knowles
I've seen some discussion on the doc. I cannot tell whether the questions
are resolved or what the status of review is. Would you mind looping this
thread with a quick summary? This is such a major piece of work I don't
want it to sit with everyone thinking they are waiting on someone else, or
any such thing. (not saying this is happening, just pinging to be sure)

Kenn

On Mon, Jul 1, 2019 at 1:09 PM Neville Li  wrote:

> Updated the doc a bit with more future work (appendix). IMO most of them
> are non-breaking and better done in separate PRs later since some involve
> pretty big refactoring and are outside the scope of MVP.
>
> For now we'd really like to get feedback on some fundamental design
> decisions and find a way to move forward.
>
> On Thu, Jun 27, 2019 at 4:39 PM Neville Li  wrote:
>
>> Thanks. I responded to comments in the doc. More inline.
>>
>> On Thu, Jun 27, 2019 at 2:44 PM Chamikara Jayalath 
>> wrote:
>>
>>> Thanks added few comments.
>>>
>>> If I understood correctly, you basically assign elements with keys to
>>> different buckets which are written to unique files and merge files for the
>>> same key while reading ?
>>>
>>> Some of my concerns are.
>>>
>>> (1)  Seems like you rely on an in-memory sorting of buckets. Will this
>>> end up limiting the size of a PCollection you can process ?
>>>
>> The sorter transform we're using supports spilling and external sort. We
>> can break up large key groups further by sharding, similar to fan out in
>> some GBK transforms.
>>
>> (2) Seems like you rely on Reshuffle.viaRandomKey() which is actually
>>> implemented using a shuffle (which you try to replace with this proposal).
>>>
>> That's for distributing task metadata, so that each DoFn thread picks up
>> a random bucket and sort merge key-values. It's not shuffling actual data.
>>
>>
>>> (3) I think (at least some of the) shuffle implementations are
>>> implemented in ways similar to this (writing to files and merging). So I'm
>>> wondering if the performance benefits you see are for a very specific case
>>> and may limit the functionality in other ways.
>>>
>> This is for the common pattern of few core data producer pipelines and
>> many downstream consumer pipelines. It's not intended to replace
>> shuffle/join within a single pipeline. On the producer side, by
>> pre-grouping/sorting data and writing to bucket/shard output files, the
>> consumer can sort/merge matching ones without a CoGBK. Essentially we're
>> paying the shuffle cost upfront to avoid them repeatedly in each consumer
>> pipeline that wants to join data.
>>
>>
>>> Thanks,
>>> Cham
>>>
>>>
>>> On Thu, Jun 27, 2019 at 8:12 AM Neville Li 
>>> wrote:
>>>
 Ping again. Any chance someone takes a look to get this thing going?
 It's just a design doc and basic metadata/IO impl. We're not talking about
 actual source/sink code yet (already done but saved for future PRs).

 On Fri, Jun 21, 2019 at 1:38 PM Ahmet Altay  wrote:

> Thank you Claire, this looks promising. Explicitly adding a few folks
> that might have feedback: +Ismaël Mejía  +Robert
> Bradshaw  +Lukasz Cwik  +Chamikara
> Jayalath 
>
> On Mon, Jun 17, 2019 at 2:12 PM Claire McGinty <
> claire.d.mcgi...@gmail.com> wrote:
>
>> Hey dev@!
>>
>> Myself and a few other Spotify data engineers have put together a design
>> doc for SMB Join support in Beam
>> ,
>>  and
>> have a working Java implementation we've started to put up for PR ([0
>> ], [1
>> ], [2
>> ]). There's more detailed
>> information in the document, but the tl;dr is that SMB is a strategy to
>> optimize joins for file-based sources by modifying the initial write
>> operation to write records in sorted buckets based on the desired join 
>> key.
>> This means that subsequent joins of datasets written in this way are only
>> sequential file reads, no shuffling involved. We've seen some pretty
>> substantial performance speedups with our implementation and would love 
>> to
>> get it checked in to Beam's Java SDK.
>>
>> We'd appreciate any suggestions or feedback on our proposal--the
>> design doc should be public to comment on.
>>
>> Thanks!
>> Claire / Neville
>>
>


Re: python precommits failing at head

2019-07-12 Thread Tanay Tummalapalli
Thank You Valentyn!

I'll retest it.
Hopefully, it's a transient issue.

Regards,
- Tanay Tummalapalli

On Sat, Jul 13, 2019 at 2:39 AM Valentyn Tymofieiev 
wrote:

> No, we did not reduce the timeout recently. Looking at console logs,
> nothing happened for an hour or so,
>
> *06:57:50 py27-cython: commands succeeded 06:57:50 congratulations :)
> 06:57:50 *
>
> *06:57:50* >* Task :sdks:python:preCommitPy2**08:22:33* Build timed out 
> (after 120 minutes). Marking the build as aborted.
>
>
> However, we can also see in the logs that py36-cython suite never started,
> not sure way. I assume gradle waited for this suite to finish.
> Try "retest this please", hopefully this is a transient gradle issue. I
> did not observe it before.
>
> On Fri, Jul 12, 2019 at 1:22 PM Tanay Tummalapalli 
> wrote:
>
>> Hi Udi,
>>
>> I rebased another PR[1] onto the fix mentioned above. The lint error is
>> fixed, but, the "beam_PreCommit_Python_Commit" Jenkins job is failing
>> because of a timeout at 120 minutes[2].
>> The log says "Build timed out (after 120 minutes). Marking the build as
>> aborted."
>> Another PR's Python PreCommit job aborted with the same error[3].
>>
>> I found this issue - "[BEAM-3040] Python precommit timed out after 150
>> minutes"[4].
>> Was the timeout reduced recently?
>>
>> Regards,
>> - Tanay Tummalapalli
>>
>> [1] https://github.com/apache/beam/pull/8871
>> [2]
>> https://builds.apache.org/job/beam_PreCommit_Python_Commit/7412/consoleFull
>>
>> [3] https://github.com/apache/beam/pull/9050
>> [4] https://issues.apache.org/jira/browse/BEAM-3040
>>
>> On Fri, Jul 12, 2019 at 5:42 AM Udi Meiri  wrote:
>>
>>> This is due to
>>> https://github.com/apache/beam/pull/8969
>>> and
>>> https://github.com/apache/beam/pull/8934
>>> being merged today.
>>>
>>> Fix is here: https://github.com/apache/beam/pull/9044
>>>
>>


Re: python precommits failing at head

2019-07-12 Thread Valentyn Tymofieiev
No, we did not reduce the timeout recently. Looking at console logs,
nothing happened for an hour or so,

*06:57:50 py27-cython: commands succeeded 06:57:50 congratulations :)
06:57:50 *

*06:57:50* >* Task :sdks:python:preCommitPy2**08:22:33* Build timed
out (after 120 minutes). Marking the build as aborted.


However, we can also see in the logs that py36-cython suite never started,
not sure way. I assume gradle waited for this suite to finish.
Try "retest this please", hopefully this is a transient gradle issue. I did
not observe it before.

On Fri, Jul 12, 2019 at 1:22 PM Tanay Tummalapalli 
wrote:

> Hi Udi,
>
> I rebased another PR[1] onto the fix mentioned above. The lint error is
> fixed, but, the "beam_PreCommit_Python_Commit" Jenkins job is failing
> because of a timeout at 120 minutes[2].
> The log says "Build timed out (after 120 minutes). Marking the build as
> aborted."
> Another PR's Python PreCommit job aborted with the same error[3].
>
> I found this issue - "[BEAM-3040] Python precommit timed out after 150
> minutes"[4].
> Was the timeout reduced recently?
>
> Regards,
> - Tanay Tummalapalli
>
> [1] https://github.com/apache/beam/pull/8871
> [2]
> https://builds.apache.org/job/beam_PreCommit_Python_Commit/7412/consoleFull
>
> [3] https://github.com/apache/beam/pull/9050
> [4] https://issues.apache.org/jira/browse/BEAM-3040
>
> On Fri, Jul 12, 2019 at 5:42 AM Udi Meiri  wrote:
>
>> This is due to
>> https://github.com/apache/beam/pull/8969
>> and
>> https://github.com/apache/beam/pull/8934
>> being merged today.
>>
>> Fix is here: https://github.com/apache/beam/pull/9044
>>
>


Re: python precommits failing at head

2019-07-12 Thread Tanay Tummalapalli
Hi Udi,

I rebased another PR[1] onto the fix mentioned above. The lint error is
fixed, but, the "beam_PreCommit_Python_Commit" Jenkins job is failing
because of a timeout at 120 minutes[2].
The log says "Build timed out (after 120 minutes). Marking the build as
aborted."
Another PR's Python PreCommit job aborted with the same error[3].

I found this issue - "[BEAM-3040] Python precommit timed out after 150
minutes"[4].
Was the timeout reduced recently?

Regards,
- Tanay Tummalapalli

[1] https://github.com/apache/beam/pull/8871
[2]
https://builds.apache.org/job/beam_PreCommit_Python_Commit/7412/consoleFull
[3] https://github.com/apache/beam/pull/9050
[4] https://issues.apache.org/jira/browse/BEAM-3040

On Fri, Jul 12, 2019 at 5:42 AM Udi Meiri  wrote:

> This is due to
> https://github.com/apache/beam/pull/8969
> and
> https://github.com/apache/beam/pull/8934
> being merged today.
>
> Fix is here: https://github.com/apache/beam/pull/9044
>


Re: [Python] Read Hadoop Sequence File?

2019-07-12 Thread Shannon Duncan
Clarification on previous message. Only happens on local file system where
it is unable to match a pattern string. Via a `gs://` link it is
able to do multiple file matching.

On Fri, Jul 12, 2019 at 1:36 PM Shannon Duncan 
wrote:

> Awesome. I got it working for a single file, but for a structure of:
>
> /part-0001/index
> /part-0001/data
> /part-0002/index
> /part-0002/data
>
> I tried to do /part-*  and /part-*/data
>
> It does not find the multipart files. However if I just do /part-0001/data
> it will find it and read it.
>
> Any ideas why?
>
> I am using this to generate the source:
>
> static SequenceFileSource createSource(
> ValueProvider sourcePattern) {
> return new SequenceFileSource(
> sourcePattern,
> Text.class,
> WritableSerialization.class,
> Text.class,
> WritableSerialization.class,
> SequenceFile.SYNC_INTERVAL);
> }
>
> On Wed, Jul 10, 2019 at 10:52 AM Igor Bernstein 
> wrote:
>
>> It should be fairly straight forward:
>> 1. Copy SequenceFileSource.java
>> 
>>  to
>> your project
>> 2. Add the source to your pipeline, configuring it with appropriate
>> serializers. See here
>> 
>> for an example for hbase Results
>>
>> On Wed, Jul 10, 2019 at 10:58 AM Shannon Duncan <
>> joseph.dun...@liveramp.com> wrote:
>>
>>> If I wanted to go ahead and include this within a new Java Pipeline,
>>> what would I be looking at for level of work to integrate?
>>>
>>> On Wed, Jul 3, 2019 at 3:54 AM Ismaël Mejía  wrote:
>>>
 That's great. I can help whenever you need. We just need to choose its
 destination. Both the `hadoop-format` and `hadoop-file-system` modules
 are good candidates, I would even feel inclined to put it in its own
 module `sdks/java/extensions/sequencefile` to make it more easy to
 discover by the final users.

 A thing to consider is the SeekableByteChannel adapters, we can move
 that into hadoop-common if needed and refactor the modules to share
 code. Worth to take a look at

 org.apache.beam.sdk.io.hdfs.HadoopFileSystem.HadoopSeekableByteChannel#HadoopSeekableByteChannel
 to see if some of it could be useful.

 On Tue, Jul 2, 2019 at 11:46 PM Igor Bernstein <
 igorbernst...@google.com> wrote:
 >
 > Hi all,
 >
 > I wrote those classes with the intention of upstreaming them to Beam.
 I can try to make some time this quarter to clean them up. I would need a
 bit of guidance from a beam expert in how to make them coexist with
 HadoopFormatIO though.
 >
 >
 > On Tue, Jul 2, 2019 at 10:55 AM Solomon Duskis 
 wrote:
 >>
 >> +Igor Bernstein who wrote the Cloud Bigtable Sequence File classes.
 >>
 >> Solomon Duskis | Google Cloud clients | sdus...@google.com |
 914-462-0531
 >>
 >>
 >> On Tue, Jul 2, 2019 at 4:57 AM Ismaël Mejía 
 wrote:
 >>>
 >>> (Adding dev@ and Solomon Duskis to the discussion)
 >>>
 >>> I was not aware of these thanks for sharing David. Definitely it
 would
 >>> be a great addition if we could have those donated as an extension
 in
 >>> the Beam side. We can even evolve them in the future to be more
 FileIO
 >>> like. Any chance this can happen? Maybe Solomon and his team?
 >>>
 >>>
 >>>
 >>> On Tue, Jul 2, 2019 at 9:39 AM David Morávek 
 wrote:
 >>> >
 >>> > Hi, you can use SequenceFileSink and Source, from a BigTable
 client. Those works nice with FileIO.
 >>> >
 >>> >
 https://github.com/googleapis/cloud-bigtable-client/blob/master/bigtable-dataflow-parent/bigtable-beam-import/src/main/java/com/google/cloud/bigtable/beam/sequencefiles/SequenceFileSink.java
 >>> >
 https://github.com/googleapis/cloud-bigtable-client/blob/master/bigtable-dataflow-parent/bigtable-beam-import/src/main/java/com/google/cloud/bigtable/beam/sequencefiles/SequenceFileSource.java
 >>> >
 >>> > It would be really cool to move these into Beam, but that's up to
 Googlers to decide, whether they want to donate this.
 >>> >
 >>> > D.
 >>> >
 >>> > On Tue, Jul 2, 2019 at 2:07 AM Shannon Duncan <
 joseph.dun...@liveramp.com> wrote:
 >>> >>
 >>> >> It's not outside the realm of possibilities. For now I've
 created an intermediary step of a hadoop job that converts from sequence to
 text file.
 >>> >>
 >>> >> Looking into better options.
 >>> >>
 >>> >> On Mon, Jul 1, 2019, 5:50 PM Chamikara Jayalath <
 chamik...@google.com> wrote:
 >>> >>>
 >>> >>> Java SDK has a HadoopInputFormatIO using which you should be
 able 

Re: Bucketed histogram metrics in beam. Anyone currently looking into this?

2019-07-12 Thread Steve Niemitz
I've been doing some experiments in my own fork of the Dataflow worker
using HdrHistogram [1] to record histograms.  I export them to our own
stats collector, not Stackdriver, but have been having good success with
them.

The problem is that the dataflow worker metrics implementation is totally
different than the beam metrics implementation, but the concept would
translate pretty easily I imagine.

[1] https://github.com/HdrHistogram/HdrHistogram

On Fri, Jul 12, 2019 at 1:33 PM Pablo Estrada  wrote:

> I am not aware of anyone working on this. I do recall a couple things:
>
> - These metrics can be very large in terms of space. Users may cause
> themselves trouble if they define too many of them.
> - Not enough reason not to do it, but certainly worth considering.
> - There is some code added by Boyuan to develop highly efficient
> histogram-type metrics.
>
> Best
> -P.
>
> On Fri, Jul 12, 2019 at 10:21 AM Alex Amato  wrote:
>
>> Hi,
>>
>> I was wondering if anyone has any plans to introduce bucketed
>> histogram to beam (different from Distribution, which is just min, max, sum
>> and count values)? I have some thoughts about how it could be done so that
>> it integrates with stackdriver.
>>
>> Essentially I am referring to a timeseries of histograms, displaying
>> buckets of values at fixed windows in time.
>>
>


Re: [Python] Read Hadoop Sequence File?

2019-07-12 Thread Shannon Duncan
Awesome. I got it working for a single file, but for a structure of:

/part-0001/index
/part-0001/data
/part-0002/index
/part-0002/data

I tried to do /part-*  and /part-*/data

It does not find the multipart files. However if I just do /part-0001/data
it will find it and read it.

Any ideas why?

I am using this to generate the source:

static SequenceFileSource createSource(
ValueProvider sourcePattern) {
return new SequenceFileSource(
sourcePattern,
Text.class,
WritableSerialization.class,
Text.class,
WritableSerialization.class,
SequenceFile.SYNC_INTERVAL);
}

On Wed, Jul 10, 2019 at 10:52 AM Igor Bernstein 
wrote:

> It should be fairly straight forward:
> 1. Copy SequenceFileSource.java
> 
>  to
> your project
> 2. Add the source to your pipeline, configuring it with appropriate
> serializers. See here
> 
> for an example for hbase Results
>
> On Wed, Jul 10, 2019 at 10:58 AM Shannon Duncan <
> joseph.dun...@liveramp.com> wrote:
>
>> If I wanted to go ahead and include this within a new Java Pipeline, what
>> would I be looking at for level of work to integrate?
>>
>> On Wed, Jul 3, 2019 at 3:54 AM Ismaël Mejía  wrote:
>>
>>> That's great. I can help whenever you need. We just need to choose its
>>> destination. Both the `hadoop-format` and `hadoop-file-system` modules
>>> are good candidates, I would even feel inclined to put it in its own
>>> module `sdks/java/extensions/sequencefile` to make it more easy to
>>> discover by the final users.
>>>
>>> A thing to consider is the SeekableByteChannel adapters, we can move
>>> that into hadoop-common if needed and refactor the modules to share
>>> code. Worth to take a look at
>>>
>>> org.apache.beam.sdk.io.hdfs.HadoopFileSystem.HadoopSeekableByteChannel#HadoopSeekableByteChannel
>>> to see if some of it could be useful.
>>>
>>> On Tue, Jul 2, 2019 at 11:46 PM Igor Bernstein 
>>> wrote:
>>> >
>>> > Hi all,
>>> >
>>> > I wrote those classes with the intention of upstreaming them to Beam.
>>> I can try to make some time this quarter to clean them up. I would need a
>>> bit of guidance from a beam expert in how to make them coexist with
>>> HadoopFormatIO though.
>>> >
>>> >
>>> > On Tue, Jul 2, 2019 at 10:55 AM Solomon Duskis 
>>> wrote:
>>> >>
>>> >> +Igor Bernstein who wrote the Cloud Bigtable Sequence File classes.
>>> >>
>>> >> Solomon Duskis | Google Cloud clients | sdus...@google.com |
>>> 914-462-0531
>>> >>
>>> >>
>>> >> On Tue, Jul 2, 2019 at 4:57 AM Ismaël Mejía 
>>> wrote:
>>> >>>
>>> >>> (Adding dev@ and Solomon Duskis to the discussion)
>>> >>>
>>> >>> I was not aware of these thanks for sharing David. Definitely it
>>> would
>>> >>> be a great addition if we could have those donated as an extension in
>>> >>> the Beam side. We can even evolve them in the future to be more
>>> FileIO
>>> >>> like. Any chance this can happen? Maybe Solomon and his team?
>>> >>>
>>> >>>
>>> >>>
>>> >>> On Tue, Jul 2, 2019 at 9:39 AM David Morávek 
>>> wrote:
>>> >>> >
>>> >>> > Hi, you can use SequenceFileSink and Source, from a BigTable
>>> client. Those works nice with FileIO.
>>> >>> >
>>> >>> >
>>> https://github.com/googleapis/cloud-bigtable-client/blob/master/bigtable-dataflow-parent/bigtable-beam-import/src/main/java/com/google/cloud/bigtable/beam/sequencefiles/SequenceFileSink.java
>>> >>> >
>>> https://github.com/googleapis/cloud-bigtable-client/blob/master/bigtable-dataflow-parent/bigtable-beam-import/src/main/java/com/google/cloud/bigtable/beam/sequencefiles/SequenceFileSource.java
>>> >>> >
>>> >>> > It would be really cool to move these into Beam, but that's up to
>>> Googlers to decide, whether they want to donate this.
>>> >>> >
>>> >>> > D.
>>> >>> >
>>> >>> > On Tue, Jul 2, 2019 at 2:07 AM Shannon Duncan <
>>> joseph.dun...@liveramp.com> wrote:
>>> >>> >>
>>> >>> >> It's not outside the realm of possibilities. For now I've created
>>> an intermediary step of a hadoop job that converts from sequence to text
>>> file.
>>> >>> >>
>>> >>> >> Looking into better options.
>>> >>> >>
>>> >>> >> On Mon, Jul 1, 2019, 5:50 PM Chamikara Jayalath <
>>> chamik...@google.com> wrote:
>>> >>> >>>
>>> >>> >>> Java SDK has a HadoopInputFormatIO using which you should be
>>> able to read Sequence files:
>>> https://github.com/apache/beam/blob/master/sdks/java/io/hadoop-format/src/main/java/org/apache/beam/sdk/io/hadoop/format/HadoopFormatIO.java
>>> >>> >>> I don't think there's a direct alternative for this for Python.
>>> >>> >>>
>>> >>> >>> Is it possible to write to a well-known format such as Avro
>>> instead of a Hadoop specific format which will allow you to read from both

Re: Beam/Samza Ensuring At Least Once semantics

2019-07-12 Thread Lukasz Cwik
That seems to be an issue with how the commit is being restarted in Samza
and not with the Kafka source.

On Thu, Jul 11, 2019 at 4:44 PM Deshpande, Omkar 
wrote:

> Yes, we are resuming from samza’s last commit. But the problem is that the
> last commit was done for data in the window that is not completely
> processed.
>
>
>
> *From: *Lukasz Cwik 
> *Date: *Wednesday, July 10, 2019 at 11:07 AM
> *To: *dev 
> *Cc: *"LeVeck, Matt" , "Deshpande, Omkar" <
> omkar_deshpa...@intuit.com>, Xinyu Liu , Xinyu Liu
> , Samarth Shetty , "Audo,
> Nicholas" , "Cesar, Scott" <
> scott_ce...@intuit.com>, "Ho, Tom" , "
> d...@samza.apache.org" 
> *Subject: *Re: Beam/Samza Ensuring At Least Once semantics
>
>
>
> This email is from an external sender.
>
>
>
> When you restart the application, are you resuming it from Samza's last
> commit?
>
>
>
> Since the exception is thrown after the GBK, all the data could be read
> from Kafka and forwarded to the GBK operator inside of Samza and
> checkpointed in Kafka before the exception is ever thrown.
>
>
>
> On Tue, Jul 9, 2019 at 8:34 PM Benenson, Mikhail <
> mikhail_benen...@intuit.com> wrote:
>
> Hi
>
>
>
> I have run a few experiments to verify if 'at least once' processing is
> guarantee on Beam 2.13.0 with Samza Runner 1.1.0
>
>
>
> Beam application is a slightly modified Stream Word Count from Beam
> examples:
>
>- read strings from input Kafka topic, print (topic, partition,
>offset, value)
>- convert values to pairs (value, 1)
>- grouping in Fixed Windows with duration 30 sec
>- sum per key
>- throw exception, if key starts with 'm'
>- write (key, sum) to output Kafka topic
>
>
>
> Tried KafkaIO.read() with and without commitOffsetsInFinalize() there is
> no difference in results.
>
>
>
> Please, see src code attached.
>
>
>
> Environment:
>
>- Run with local zk & kafka, pre-create input & output topics with 1
>partition.
>- samza.properties contains "task.commit.ms=2000". According to samza
>doc "this property determines how often a checkpoint is written. The value
>is the time between checkpoints, in milliseconds". See
>
> http://samza.apache.org/learn/documentation/latest/jobs/samza-configurations.html#checkpointing.
>Please, see samza config file and run script attached.
>
>
>
>
>
> *Scenario 1: Exception in transformation*
>
>
>
> Run
>
>- Write 'a', 'b', 'c', 'm', 'd', 'e' into input topic
>- start Beam app
>- verify, that app log contains "read from topic=XXX, part=0,
>offset=100, val: e". Because input topic has only one partition, this means
>all data have been read from Kafka.
>- wait, until app terminates, because of the exception, while
>processing 'm'
>
>
>
> Expectation
>
> The order of processing after grouping is not specified, so some data
> could be written to output topic before application terminates, but I
> expect that value=m with offset 98 and all later records must NOT be marked
> as processed, so if I restart Beam app, I expect it again throws the
> exception when processing value=m.
>
> Comment: throwing exception in transformation is not a good idea, but such
> exception could be the result of application error. So, expectation is that
> after fixing the error, and restarting Beam app, it should process the
> record that cause an error.
>
>
>
> Results
>
> After I restarted app, it does NOT re-processing value m and does not
> throws an exception. If I add new value 'f' into input topic, I see  "read
> from topic=XXX, part=0, offset=101, val: f", and after some time I see 'm'
> in the output topic. So, the record with value 'm' is NOT processed.
>
>
>
>
>
> *Scenario 2: App termination*
>
>
>
> Run
>
>- Write 'g', 'h', 'i', 'j' into input topic
>- start Beam app
>- verify, that app log contains "read from topic=XXX, part=0,
>offset=105, val: j". Because input topic has only one partition, this means
>that all data has been read from Kafka.
>- wait about 10 sec, then terminate Beam app. The idea is to terminate
>app, when, ''g', 'h', 'i', 'j' are waiting in the 30 sec Fixed Windows, but
>after  task.commit.ms=2000 pass, so offsets are committed.
>
>
>
> Expectation
>
> As records 'g', 'h', 'i', 'j'  are NOT processed, I expect that after app
> restarted, it again reads ‘g’, ‘h’, ‘I’, ‘j’ from input topic and process
> these records.
>
>
>
> Results
>
> After I restarted app, it does NOT re-process  ‘g’, ‘h’, ‘I’, ‘j’ values.
> If I add new value ‘k’ into input topic, I see  “read from topic=XXX,
> part=0, offset=106, val: k”, and after some time I see ‘k’ in the output
> topic. So, the records with values ‘g’, ‘h’, ‘I’, ‘j’ are NOT processed.
>
>
>
>
>
> Based on these results I’m incline to conclude that Beam with Samza runner
> does NOT provides 'at least once' guarantee for processing.
>
>
>
> If I missed something?
>
>
>
> --
>
> Michael Benenson
>
>
>
>
>
> *From: *"LeVeck, Matt" 
> *Date: *Monday, July 1, 

Re: Jira Contributors List

2019-07-12 Thread sridhar inuog
Thank you, guys!

On Fri, Jul 12, 2019 at 12:30 PM Rui Wang  wrote:

> Indeed! Welcome!
>
>
> -Rui
>
> On Fri, Jul 12, 2019 at 10:16 AM Pablo Estrada  wrote:
>
>> It seems that both have been added. Welcome!
>>
>> On Fri, Jul 12, 2019 at 10:12 AM Rui Wang  wrote:
>>
>>> Hi Francesco,
>>>
>>> What's your JIRA ID?
>>>
>>>
>>> -Rui
>>>
>>> On Thu, Jul 11, 2019 at 9:17 AM Francesco Perera 
>>> wrote:
>>>
 Hi,
 I am new to the beam community but  I am eager to contribute back. I am
 going to work on this issue :
 https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=245=detail=BEAM-7198
 but can someone here add me to the contributors list in Jira ?

 Thanks,
 Francesco
 --
 Francesco Perera Kuranapatabendige
 fperer...@gmail.com | 646-719-6970 | www.linkedin.com/in/fperera




>>>


Re: Bucketed histogram metrics in beam. Anyone currently looking into this?

2019-07-12 Thread Pablo Estrada
I am not aware of anyone working on this. I do recall a couple things:

- These metrics can be very large in terms of space. Users may cause
themselves trouble if they define too many of them.
- Not enough reason not to do it, but certainly worth considering.
- There is some code added by Boyuan to develop highly efficient
histogram-type metrics.

Best
-P.

On Fri, Jul 12, 2019 at 10:21 AM Alex Amato  wrote:

> Hi,
>
> I was wondering if anyone has any plans to introduce bucketed histogram to
> beam (different from Distribution, which is just min, max, sum and count
> values)? I have some thoughts about how it could be done so that it
> integrates with stackdriver.
>
> Essentially I am referring to a timeseries of histograms, displaying
> buckets of values at fixed windows in time.
>


Re: Jira Contributors List

2019-07-12 Thread Rui Wang
Indeed! Welcome!


-Rui

On Fri, Jul 12, 2019 at 10:16 AM Pablo Estrada  wrote:

> It seems that both have been added. Welcome!
>
> On Fri, Jul 12, 2019 at 10:12 AM Rui Wang  wrote:
>
>> Hi Francesco,
>>
>> What's your JIRA ID?
>>
>>
>> -Rui
>>
>> On Thu, Jul 11, 2019 at 9:17 AM Francesco Perera 
>> wrote:
>>
>>> Hi,
>>> I am new to the beam community but  I am eager to contribute back. I am
>>> going to work on this issue :
>>> https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=245=detail=BEAM-7198
>>> but can someone here add me to the contributors list in Jira ?
>>>
>>> Thanks,
>>> Francesco
>>> --
>>> Francesco Perera Kuranapatabendige
>>> fperer...@gmail.com | 646-719-6970 | www.linkedin.com/in/fperera
>>>
>>>
>>>
>>>
>>


Bucketed histogram metrics in beam. Anyone currently looking into this?

2019-07-12 Thread Alex Amato
Hi,

I was wondering if anyone has any plans to introduce bucketed histogram to
beam (different from Distribution, which is just min, max, sum and count
values)? I have some thoughts about how it could be done so that it
integrates with stackdriver.

Essentially I am referring to a timeseries of histograms, displaying
buckets of values at fixed windows in time.


Re: Jira Contributors List

2019-07-12 Thread Pablo Estrada
It seems that both have been added. Welcome!

On Fri, Jul 12, 2019 at 10:12 AM Rui Wang  wrote:

> Hi Francesco,
>
> What's your JIRA ID?
>
>
> -Rui
>
> On Thu, Jul 11, 2019 at 9:17 AM Francesco Perera 
> wrote:
>
>> Hi,
>> I am new to the beam community but  I am eager to contribute back. I am
>> going to work on this issue :
>> https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=245=detail=BEAM-7198
>> but can someone here add me to the contributors list in Jira ?
>>
>> Thanks,
>> Francesco
>> --
>> Francesco Perera Kuranapatabendige
>> fperer...@gmail.com | 646-719-6970 | www.linkedin.com/in/fperera
>>
>>
>>
>>
>


Re: Jira Contributors List

2019-07-12 Thread Rui Wang
Hi Francesco,

What's your JIRA ID?


-Rui

On Thu, Jul 11, 2019 at 9:17 AM Francesco Perera 
wrote:

> Hi,
> I am new to the beam community but  I am eager to contribute back. I am
> going to work on this issue :
> https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=245=detail=BEAM-7198
> but can someone here add me to the contributors list in Jira ?
>
> Thanks,
> Francesco
> --
> Francesco Perera Kuranapatabendige
> fperer...@gmail.com | 646-719-6970 | www.linkedin.com/in/fperera
>
>
>
>


Re: [VOTE] Vendored Dependencies Release

2019-07-12 Thread Kai Jiang
+1 (non-binding)

On Thu, Jul 11, 2019 at 8:27 PM Lukasz Cwik  wrote:

> Please review the release of the following artifacts that we vendor:
>  * beam-vendor-grpc_1_21_0
>  * beam-vendor-guava-26_0-jre
>  * beam-vendor-bytebuddy-1_9_3
>
> Hi everyone,
> Please review and vote on the release candidate #3 for the
> org.apache.beam:beam-vendor-grpc_1_21_0:0.1,
> org.apache.beam:beam-vendor-guava-26_0-jre:0.1, and
> org.apache.beam:beam-vendor-bytebuddy-1_9_3:0.1 as follows:
> [ ] +1, Approve the release
> [ ] -1, Do not approve the release (please provide specific comments)
>
>
> The complete staging area is available for your review, which includes:
> * the official Apache source release to be deployed to dist.apache.org
> 
> [1], which is signed with the key with fingerprint
> EAD5DE293F4A03DD2E77565589E68A56E371CCA2 [2],
> * all artifacts to be deployed to the Maven Central Repository [3],
> * commit hash "0fce2b88660f52dae638697e1472aa108c982ae6" [4],
>
> The vote will be open for at least 72 hours. It is adopted by majority
> approval, with at least 3 PMC affirmative votes.
>
> Thanks,
> Luke
>
> [1] https://dist.apache.org/repos/dist/dev/beam/vendor/
> 
> [2] https://dist.apache.org/repos/dist/release/beam/KEYS
> 
> [3] https://repository.apache.org/content/repositories/orgapachebeam-1078/
> 
> [4]
> https://github.com/apache/beam/commit/0fce2b88660f52dae638697e1472aa108c982ae6
> 
>


Re: [Java] Using a complex datastructure as Key for KV

2019-07-12 Thread Shannon Duncan
I tried to pass ArrayList in and it wouldn't generalize it to List. It
required me to convert my ArrayLists  to Lists.

On Fri, Jul 12, 2019 at 10:20 AM Lukasz Cwik  wrote:

> Additional coders would be useful. Note that we usually don't have coders
> for specific collection types like ArrayList but prefer to have Coders for
> their general counterparts like List, Map, Iterable, 
>
> There has been discussion in the past to make the MapCoder a deterministic
> coder when a coder is required to be deterministic. There are a few people
> working on schema support within Apache Beam that might be able to provide
> guidance (+Reuven Lax  +Brian Hulette
> ).
>
> On Fri, Jul 12, 2019 at 11:05 AM Shannon Duncan <
> joseph.dun...@liveramp.com> wrote:
>
>> I have a working TreeMapCoder now. Got it all setup and done, and the
>> GroupByKey is accepting it.
>>
>> Thanks for all the help. I need to read up more on contributing
>> guidelines then I'll PR the coder into the SDK. Also willing to write
>> coders for things such as ArrayList etc if people want them.
>>
>> On Fri, Jul 12, 2019 at 9:31 AM Shannon Duncan <
>> joseph.dun...@liveramp.com> wrote:
>>
>>> Aha, makes sense. Thanks!
>>>
>>> On Fri, Jul 12, 2019 at 9:26 AM Lukasz Cwik  wrote:
>>>
 TreeMapCoder.of(StringUtf8Coder.of(), ListCoder.of(VarIntCoder.of()));

 On Fri, Jul 12, 2019 at 10:22 AM Shannon Duncan <
 joseph.dun...@liveramp.com> wrote:

> So I have my custom coder created for TreeMap and I'm ready to set
> it...
>
> So my Type is "TreeMap>"
>
> What do I put for ".setCoder(TreeMapCoder.of(???, ???))"
>
> On Thu, Jul 11, 2019 at 8:21 PM Rui Wang  wrote:
>
>> Hi Shannon,  [1] will be a good start on coder in Java SDK.
>>
>>
>> [1]
>> https://beam.apache.org/documentation/programming-guide/#data-encoding-and-type-safety
>>
>> Rui
>>
>> On Thu, Jul 11, 2019 at 3:08 PM Shannon Duncan <
>> joseph.dun...@liveramp.com> wrote:
>>
>>> Was able to get it to use ArrayList by doing List>
>>> result = new ArrayList>();
>>>
>>> Then storing my keys in a separate array that I'll pass in as a side
>>> input to key for the list of lists.
>>>
>>> Thanks for the help, lemme know more in the future about how coders
>>> work and instantiate and I'd love to help contribute by adding some new
>>> coders.
>>>
>>> - Shannon
>>>
>>> On Thu, Jul 11, 2019 at 4:59 PM Shannon Duncan <
>>> joseph.dun...@liveramp.com> wrote:
>>>
 Will do. Thanks. A new coder for deterministic Maps would be great
 in the future. Thank you!

 On Thu, Jul 11, 2019 at 4:58 PM Rui Wang  wrote:

> I think Mike refers to ListCoder
> 
>  which
> is deterministic if its element is the same. Maybe you can search the 
> repo
> for examples of ListCoder?
>
>
> -Rui
>
> On Thu, Jul 11, 2019 at 2:55 PM Shannon Duncan <
> joseph.dun...@liveramp.com> wrote:
>
>> So ArrayList doesn't work either, so just a standard List?
>>
>> On Thu, Jul 11, 2019 at 4:53 PM Rui Wang 
>> wrote:
>>
>>> Shannon, I agree with Mike on List is a good workaround if your
>>> element within list is deterministic and you are eager to make your 
>>> new
>>> pipeline working.
>>>
>>>
>>> Let me send back some pointers to adding new coder later.
>>>
>>>
>>> -Rui
>>>
>>> On Thu, Jul 11, 2019 at 2:45 PM Shannon Duncan <
>>> joseph.dun...@liveramp.com> wrote:
>>>
 I just started learning Java today to attempt to convert our
 python pipelines to Java to take advantage of key features that 
 Java has. I
 have no idea how I would create a new coder and include it in for 
 beam to
 recognize.

 If you can point me in the right direction of where it hooks
 together I might be able to figure that out. I can duplicate 
 MapCoder and
 try to make changes, but how will beam know to pick up that coder 
 for a
 groupByKey?

 Thanks!
 Shannon

 On Thu, Jul 11, 2019 at 4:42 PM Rui Wang 
 wrote:

> It could be just straightforward to create a SortedMapCoder
> for TreeMap. Just add checks on map instances and then change
> verifyDeterministic.
>
> If this is a common need we could just submit it into Beam
> repo.
>
> [1]:
> 

Re: [Java] Using a complex datastructure as Key for KV

2019-07-12 Thread Lukasz Cwik
Additional coders would be useful. Note that we usually don't have coders
for specific collection types like ArrayList but prefer to have Coders for
their general counterparts like List, Map, Iterable, 

There has been discussion in the past to make the MapCoder a deterministic
coder when a coder is required to be deterministic. There are a few people
working on schema support within Apache Beam that might be able to provide
guidance (+Reuven Lax  +Brian Hulette
).

On Fri, Jul 12, 2019 at 11:05 AM Shannon Duncan 
wrote:

> I have a working TreeMapCoder now. Got it all setup and done, and the
> GroupByKey is accepting it.
>
> Thanks for all the help. I need to read up more on contributing guidelines
> then I'll PR the coder into the SDK. Also willing to write coders for
> things such as ArrayList etc if people want them.
>
> On Fri, Jul 12, 2019 at 9:31 AM Shannon Duncan 
> wrote:
>
>> Aha, makes sense. Thanks!
>>
>> On Fri, Jul 12, 2019 at 9:26 AM Lukasz Cwik  wrote:
>>
>>> TreeMapCoder.of(StringUtf8Coder.of(), ListCoder.of(VarIntCoder.of()));
>>>
>>> On Fri, Jul 12, 2019 at 10:22 AM Shannon Duncan <
>>> joseph.dun...@liveramp.com> wrote:
>>>
 So I have my custom coder created for TreeMap and I'm ready to set it...

 So my Type is "TreeMap>"

 What do I put for ".setCoder(TreeMapCoder.of(???, ???))"

 On Thu, Jul 11, 2019 at 8:21 PM Rui Wang  wrote:

> Hi Shannon,  [1] will be a good start on coder in Java SDK.
>
>
> [1]
> https://beam.apache.org/documentation/programming-guide/#data-encoding-and-type-safety
>
> Rui
>
> On Thu, Jul 11, 2019 at 3:08 PM Shannon Duncan <
> joseph.dun...@liveramp.com> wrote:
>
>> Was able to get it to use ArrayList by doing List>
>> result = new ArrayList>();
>>
>> Then storing my keys in a separate array that I'll pass in as a side
>> input to key for the list of lists.
>>
>> Thanks for the help, lemme know more in the future about how coders
>> work and instantiate and I'd love to help contribute by adding some new
>> coders.
>>
>> - Shannon
>>
>> On Thu, Jul 11, 2019 at 4:59 PM Shannon Duncan <
>> joseph.dun...@liveramp.com> wrote:
>>
>>> Will do. Thanks. A new coder for deterministic Maps would be great
>>> in the future. Thank you!
>>>
>>> On Thu, Jul 11, 2019 at 4:58 PM Rui Wang  wrote:
>>>
 I think Mike refers to ListCoder
 
  which
 is deterministic if its element is the same. Maybe you can search the 
 repo
 for examples of ListCoder?


 -Rui

 On Thu, Jul 11, 2019 at 2:55 PM Shannon Duncan <
 joseph.dun...@liveramp.com> wrote:

> So ArrayList doesn't work either, so just a standard List?
>
> On Thu, Jul 11, 2019 at 4:53 PM Rui Wang 
> wrote:
>
>> Shannon, I agree with Mike on List is a good workaround if your
>> element within list is deterministic and you are eager to make your 
>> new
>> pipeline working.
>>
>>
>> Let me send back some pointers to adding new coder later.
>>
>>
>> -Rui
>>
>> On Thu, Jul 11, 2019 at 2:45 PM Shannon Duncan <
>> joseph.dun...@liveramp.com> wrote:
>>
>>> I just started learning Java today to attempt to convert our
>>> python pipelines to Java to take advantage of key features that 
>>> Java has. I
>>> have no idea how I would create a new coder and include it in for 
>>> beam to
>>> recognize.
>>>
>>> If you can point me in the right direction of where it hooks
>>> together I might be able to figure that out. I can duplicate 
>>> MapCoder and
>>> try to make changes, but how will beam know to pick up that coder 
>>> for a
>>> groupByKey?
>>>
>>> Thanks!
>>> Shannon
>>>
>>> On Thu, Jul 11, 2019 at 4:42 PM Rui Wang 
>>> wrote:
>>>
 It could be just straightforward to create a SortedMapCoder for
 TreeMap. Just add checks on map instances and then change
 verifyDeterministic.

 If this is a common need we could just submit it into Beam repo.

 [1]:
 https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/MapCoder.java#L146

 On Thu, Jul 11, 2019 at 2:26 PM Mike Pedersen <
 m...@mikepedersen.dk> wrote:

> There isn't a coder for deterministic maps in Beam, so even if
> your datastructure is deterministic, Beam will assume the 
> serialized 

Re: Phrase triggering jobs problem

2019-07-12 Thread Michał Walenia
Thanks for the heads up, I'll get in touch with him so that I don't
duplicate the research.


On Fri, Jul 12, 2019 at 3:55 PM Lukasz Cwik  wrote:

> I believe Scott Wegner investigated the new plugin (about 10 months ago)
> because it seemed like it could filter out running tests based upon paths
> but it was lacking some other feature(s) that the old plugin had that we
> were already using.
>
> On Fri, Jul 12, 2019 at 4:52 AM Katarzyna Kucharczyk <
> ka.kucharc...@gmail.com> wrote:
>
>> Just for knowledge sharing purpose, here is a link to conversation
>> 
>> about the new plugin.
>>
>> Kasia
>>
>> On Fri, Jul 12, 2019 at 10:47 AM Michał Walenia <
>> michal.wale...@polidea.com> wrote:
>>
>>> Hi,
>>> I think I'd like to take a look at it. I'll assign the issue to myself
>>> and I'll keep you posted on my findings.
>>>
>>> Have a good day
>>>
>>> Michal
>>>
>>> On Thu, Jul 11, 2019 at 8:10 PM Udi Meiri  wrote:
>>>
 Opened https://issues.apache.org/jira/browse/BEAM-7725 for migration
 off the old plugin onto the new (already deprecated I might add) plugin.
 Any takers?

 On Thu, Jul 11, 2019 at 10:53 AM Udi Meiri  wrote:

> Okay, phrase triggering is working again (they re-enabled the plugin).
> See notes in bug for details.
>
> On Thu, Jul 11, 2019 at 10:04 AM Udi Meiri  wrote:
>
>> I've opened a bug: https://issues.apache.org/jira/browse/BEAM-7723
>> If anyone is working on this please assign yourself
>>
>> On Wed, Jul 10, 2019 at 5:57 PM Udi Meiri  wrote:
>>
>>> Thanks Kenn.
>>>
>>> On Wed, Jul 10, 2019 at 3:31 PM Kenneth Knowles 
>>> wrote:
>>>
 Just noticed this thread. Infra turned off one of the GitHub
 plugins - the one we use. I forwarded the announcement. I'll see if we 
 can
 get it back on for a bit so we can migrate off. I'm not sure if they 
 have
 identical job DSL or not.

 On Wed, Jul 10, 2019 at 12:32 PM Udi Meiri 
 wrote:

> Still happening for me too.
>
> On Wed, Jul 10, 2019 at 10:40 AM Lukasz Cwik 
> wrote:
>
>> This has happened in the past. Usually there is some issue where
>> Jenkins isn't notified of new PRs by Github or doesn't see the PR 
>> phrases
>> and hence Jenkins sits around idle. This is usually fixed after a 
>> few hours
>> without any action on our part.
>>
>> On Wed, Jul 10, 2019 at 10:28 AM Katarzyna Kucharczyk <
>> ka.kucharc...@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> Hope it's not duplicate but I can't find if any issue with
>>> phrase triggering in Jenkins was already here.
>>> Currently, I started third PR and no test were triggered there.
>>> I tried to trigger some tests manually, but with no effect.
>>>
>>> Am I missing something?
>>>
>>> Here are links to my problematic PRs:
>>> https://github.com/apache/beam/pull/9033
>>> https://github.com/apache/beam/pull/9034
>>> https://github.com/apache/beam/pull/9035
>>>
>>> Thanks,
>>> Kasia
>>>
>>
>>>
>>> --
>>>
>>> Michał Walenia
>>> Polidea  | Software Engineer
>>>
>>> M: +48 791 432 002 <+48791432002>
>>> E: michal.wale...@polidea.com
>>>
>>> Unique Tech
>>> Check out our projects! 
>>>
>>

-- 

Michał Walenia
Polidea  | Software Engineer

M: +48 791 432 002 <+48791432002>
E: michal.wale...@polidea.com

Unique Tech
Check out our projects! 


Re: Circular dependencies between DataflowRunner and google cloud IO

2019-07-12 Thread Lukasz Cwik
Yes, there is a dependency between Dataflow -> GCP IOs and this is expected
since Dataflow depends on parts of those implementations for its own
execution purposes. We definitely don't want GCP IOs depending on Dataflow
since we would like users of other runners to still be able to use GCP IOs
without bringing in Dataflow specific dependencies.

There is already a test definition inside of the Dataflow runner package
that is meant to run integration tests defined in the GCP IO package named
googleCloudPlatformLegacyWorkerIntegrationTest[1] task, does this fit your
needs?

1:
https://github.com/apache/beam/blob/0fce2b88660f52dae638697e1472aa108c982ae6/runners/google-cloud-dataflow-java/build.gradle#L318

On Fri, Jul 12, 2019 at 5:17 AM Michał Walenia 
wrote:

> Hi all,
> recently when I was trying to implement a performance test of BigQueryIO,
> I ran into an issue when trying to run the test on Dataflow.
> The problem was that I encountered a circular dependency when compiling
> the tests. I added the test in org.apache.beam.sdk.io.gcp.bigquery package,
> so I also needed to add DataflowRunner as a dependency in order to launch
> the test. The error was that DataflowRunner package depends on
> org.apache.beam.sdk.io.gcp.bigquery package (for example in [1]). Should
> it be like that?
> For now, in order to solve the problem, I intend to move the performance
> test to its own package in my PR [2] I am wondering about the right
> approach to this - shouldn’t we decouple the DataflowRunner code from IOs?
> If not, what’s the reason behind the way the modules are organized?
> I noticed that 5 tests are excluded from the integrationTest task in
> io.google-cloud-platform.bigquery build.gradle file [3]. Are they
> launched on Dataflow anywhere? I couldn’t find their usage except for the
> exclusions.
>
> [1] PubSubIO translations section in DataflowRunner.java
> 
> [2] My PR 
> [3] DefaultCoderCloudObjectTranslatorRegistrar
> 
>
> Best regards
> Michal
>
> --
>
> Michał Walenia
> Polidea  | Software Engineer
>
> M: +48 791 432 002 <+48791432002>
> E: michal.wale...@polidea.com
>
> Unique Tech
> Check out our projects! 
>


Circular dependencies between DataflowRunner and google cloud IO

2019-07-12 Thread Michał Walenia
Hi all,
recently when I was trying to implement a performance test of BigQueryIO, I
ran into an issue when trying to run the test on Dataflow.
The problem was that I encountered a circular dependency when compiling the
tests. I added the test in org.apache.beam.sdk.io.gcp.bigquery package, so
I also needed to add DataflowRunner as a dependency in order to launch the
test. The error was that DataflowRunner package depends on
org.apache.beam.sdk.io.gcp.bigquery package (for example in [1]). Should it
be like that?
For now, in order to solve the problem, I intend to move the performance
test to its own package in my PR [2] I am wondering about the right
approach to this - shouldn’t we decouple the DataflowRunner code from IOs?
If not, what’s the reason behind the way the modules are organized?
I noticed that 5 tests are excluded from the integrationTest task in
io.google-cloud-platform.bigquery build.gradle file [3]. Are they launched
on Dataflow anywhere? I couldn’t find their usage except for the exclusions.

[1] PubSubIO translations section in DataflowRunner.java

[2] My PR 
[3] DefaultCoderCloudObjectTranslatorRegistrar


Best regards
Michal

-- 

Michał Walenia
Polidea  | Software Engineer

M: +48 791 432 002 <+48791432002>
E: michal.wale...@polidea.com

Unique Tech
Check out our projects! 


Re: Phrase triggering jobs problem

2019-07-12 Thread Katarzyna Kucharczyk
Just for knowledge sharing purpose, here is a link to conversation

about the new plugin.

Kasia

On Fri, Jul 12, 2019 at 10:47 AM Michał Walenia 
wrote:

> Hi,
> I think I'd like to take a look at it. I'll assign the issue to myself and
> I'll keep you posted on my findings.
>
> Have a good day
>
> Michal
>
> On Thu, Jul 11, 2019 at 8:10 PM Udi Meiri  wrote:
>
>> Opened https://issues.apache.org/jira/browse/BEAM-7725 for migration off
>> the old plugin onto the new (already deprecated I might add) plugin.
>> Any takers?
>>
>> On Thu, Jul 11, 2019 at 10:53 AM Udi Meiri  wrote:
>>
>>> Okay, phrase triggering is working again (they re-enabled the plugin).
>>> See notes in bug for details.
>>>
>>> On Thu, Jul 11, 2019 at 10:04 AM Udi Meiri  wrote:
>>>
 I've opened a bug: https://issues.apache.org/jira/browse/BEAM-7723
 If anyone is working on this please assign yourself

 On Wed, Jul 10, 2019 at 5:57 PM Udi Meiri  wrote:

> Thanks Kenn.
>
> On Wed, Jul 10, 2019 at 3:31 PM Kenneth Knowles 
> wrote:
>
>> Just noticed this thread. Infra turned off one of the GitHub plugins
>> - the one we use. I forwarded the announcement. I'll see if we can get it
>> back on for a bit so we can migrate off. I'm not sure if they have
>> identical job DSL or not.
>>
>> On Wed, Jul 10, 2019 at 12:32 PM Udi Meiri  wrote:
>>
>>> Still happening for me too.
>>>
>>> On Wed, Jul 10, 2019 at 10:40 AM Lukasz Cwik 
>>> wrote:
>>>
 This has happened in the past. Usually there is some issue where
 Jenkins isn't notified of new PRs by Github or doesn't see the PR 
 phrases
 and hence Jenkins sits around idle. This is usually fixed after a few 
 hours
 without any action on our part.

 On Wed, Jul 10, 2019 at 10:28 AM Katarzyna Kucharczyk <
 ka.kucharc...@gmail.com> wrote:

> Hi all,
>
> Hope it's not duplicate but I can't find if any issue with phrase
> triggering in Jenkins was already here.
> Currently, I started third PR and no test were triggered there. I
> tried to trigger some tests manually, but with no effect.
>
> Am I missing something?
>
> Here are links to my problematic PRs:
> https://github.com/apache/beam/pull/9033
> https://github.com/apache/beam/pull/9034
> https://github.com/apache/beam/pull/9035
>
> Thanks,
> Kasia
>

>
> --
>
> Michał Walenia
> Polidea  | Software Engineer
>
> M: +48 791 432 002 <+48791432002>
> E: michal.wale...@polidea.com
>
> Unique Tech
> Check out our projects! 
>


Re: Phrase triggering jobs problem

2019-07-12 Thread Michał Walenia
Hi,
I think I'd like to take a look at it. I'll assign the issue to myself and
I'll keep you posted on my findings.

Have a good day

Michal

On Thu, Jul 11, 2019 at 8:10 PM Udi Meiri  wrote:

> Opened https://issues.apache.org/jira/browse/BEAM-7725 for migration off
> the old plugin onto the new (already deprecated I might add) plugin.
> Any takers?
>
> On Thu, Jul 11, 2019 at 10:53 AM Udi Meiri  wrote:
>
>> Okay, phrase triggering is working again (they re-enabled the plugin).
>> See notes in bug for details.
>>
>> On Thu, Jul 11, 2019 at 10:04 AM Udi Meiri  wrote:
>>
>>> I've opened a bug: https://issues.apache.org/jira/browse/BEAM-7723
>>> If anyone is working on this please assign yourself
>>>
>>> On Wed, Jul 10, 2019 at 5:57 PM Udi Meiri  wrote:
>>>
 Thanks Kenn.

 On Wed, Jul 10, 2019 at 3:31 PM Kenneth Knowles 
 wrote:

> Just noticed this thread. Infra turned off one of the GitHub plugins -
> the one we use. I forwarded the announcement. I'll see if we can get it
> back on for a bit so we can migrate off. I'm not sure if they have
> identical job DSL or not.
>
> On Wed, Jul 10, 2019 at 12:32 PM Udi Meiri  wrote:
>
>> Still happening for me too.
>>
>> On Wed, Jul 10, 2019 at 10:40 AM Lukasz Cwik 
>> wrote:
>>
>>> This has happened in the past. Usually there is some issue where
>>> Jenkins isn't notified of new PRs by Github or doesn't see the PR 
>>> phrases
>>> and hence Jenkins sits around idle. This is usually fixed after a few 
>>> hours
>>> without any action on our part.
>>>
>>> On Wed, Jul 10, 2019 at 10:28 AM Katarzyna Kucharczyk <
>>> ka.kucharc...@gmail.com> wrote:
>>>
 Hi all,

 Hope it's not duplicate but I can't find if any issue with phrase
 triggering in Jenkins was already here.
 Currently, I started third PR and no test were triggered there. I
 tried to trigger some tests manually, but with no effect.

 Am I missing something?

 Here are links to my problematic PRs:
 https://github.com/apache/beam/pull/9033
 https://github.com/apache/beam/pull/9034
 https://github.com/apache/beam/pull/9035

 Thanks,
 Kasia

>>>

-- 

Michał Walenia
Polidea  | Software Engineer

M: +48 791 432 002 <+48791432002>
E: michal.wale...@polidea.com

Unique Tech
Check out our projects!