Re: Returning multiple PCollections from a PTransform

2020-08-04 Thread Robert Bradshaw
Yes, this is explicitly supported. You can return named tuples and
dictionaries (with PCollections as values) as well.

On Tue, Aug 4, 2020 at 5:00 PM Harrison Green  wrote:
>
> Hi all,
>
> I've run into a situation where I would like to return two PCollections 
> during a PTransform. I am aware of the ParDo.with_outputs construct but in 
> this case, the PCollections are the flattened results of several other 
> transforms and it would be cleaner to just return multiple PCollections in a 
> tuple.
>
> I've tested this out with the following snippet and it seems to work (at 
> least on the direct runner):
>
> ---
> import apache_beam as beam
>
> @beam.ptransform_fn
> def test(pcoll):
> a = pcoll | '1' >> beam.Map(lambda x: x+1)
> b = pcoll | '2' >> beam.Map(lambda x: x+10)
>
> return (a,b)
>
> with beam.Pipeline() as p:
> c = p | beam.Create(list(range(10)))
>
> a,b = c | test()
>
> a | 'a' >> beam.Map(lambda x: print('a %d' % x))
> b | 'b' >> beam.Map(lambda x: print('b %d' % x))
> ---
>
> I'm curious if this type of pipeline construction is well-supported and if I 
> will run into any issues on other runners.
>
> Thanks!
> - Harrison


Returning multiple PCollections from a PTransform

2020-08-04 Thread Harrison Green
Hi all,

I've run into a situation where I would like to return two PCollections
during a PTransform. I am aware of the ParDo.with_outputs construct but in
this case, the PCollections are the flattened results of several other
transforms and it would be cleaner to just return multiple PCollections in
a tuple.

I've tested this out with the following snippet and it seems to work (at
least on the direct runner):

---
import apache_beam as beam

@beam.ptransform_fn
def test(pcoll):
a = pcoll | '1' >> beam.Map(lambda x: x+1)
b = pcoll | '2' >> beam.Map(lambda x: x+10)

return (a,b)

with beam.Pipeline() as p:
c = p | beam.Create(list(range(10)))

a,b = c | test()

a | 'a' >> beam.Map(lambda x: print('a %d' % x))
b | 'b' >> beam.Map(lambda x: print('b %d' % x))
---

I'm curious if this type of pipeline construction is well-supported and if
I will run into any issues on other runners.

Thanks!
- Harrison


Re: Request Throttling in OSSIO

2020-08-04 Thread Praveen K Viswanathan
Thanks Luke. I will go through them and come back if I have any questions.

Regards,
Praveen

On Tue, Aug 4, 2020 at 3:55 PM Luke Cwik  wrote:

> Take a look at the WatchGrowthFn[1] and also the in-progress Kafka PR[2].
>
> 1:
> https://github.com/apache/beam/blob/6612b24ada9382706373db547b5606d6e0496b02/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/Watch.java#L787
> 2: https://github.com/apache/beam/pull/11749
>
> On Tue, Aug 4, 2020 at 3:33 PM Praveen K Viswanathan <
> harish.prav...@gmail.com> wrote:
>
>> Thanks for the suggestions Luke. As you know, we are just starting and
>> should be able to switch to SplittableDoFn, if that's the future of Beam IO
>> Connectors. The SplittableDoFn page has the design details but it would be
>> great if we can look into an IO connector built using SplittableDoFn
>> for reference and to map the design details with actual implementation.
>> Could you please suggest any such IO for reference.
>>
>> I will also parallely try your suggestion in advance() and checkpoint
>> mark coder to close that issue.
>>
>> Thanks,
>> Praveen
>>
>> On Mon, Aug 3, 2020 at 3:28 PM Luke Cwik  wrote:
>>
>>> Since you are working on a new connector I would very strongly
>>> suggest writing it as a splittable DoFn instead of an UnboundedSource. See
>>> this thread[1] about additional details and some caveats on the
>>> recommendation.
>>>
>>> 1) You can return false from advance and the runner will execute advance
>>> at some point in time instead of sleeping. This is also the correct thing
>>> to do if you hit a throttling error. With a splittable DoFn you can return
>>> a process continuation allowing you to suggest an amount of time to wait
>>> before being resumed.
>>>
>>> 2) It looks like null was returned as the checkpoint mark coder[2].
>>>
>>> 1:
>>> https://lists.apache.org/thread.html/r76bac40fd22ebf96f379efbaef36fc27c65bdb859f504e19da76ff01%40%3Cdev.beam.apache.org%3E
>>> 2:
>>> https://github.com/apache/beam/blob/fa3ca2b11e2ca031232245814389d29c805f79e7/runners/direct-java/src/main/java/org/apache/beam/runners/direct/UnboundedReadEvaluatorFactory.java#L223
>>>
>>> On Thu, Jul 30, 2020 at 3:41 PM Praveen K Viswanathan <
>>> harish.prav...@gmail.com> wrote:
>>>
 Hello Dev team,

 We are giving our first shot in writing Beam IO connector for Oracle
 Streaming Service (OSS). The plan is to first implement it for enterprise
 use and based on the feedback and stability make it available open source.
 This is our first attempt in developing a Beam IO connector and so far we
 have progressed with the help of Beam documentation and other related IOs
 like KafkaIO, KinesisIO. Thanks to the community on that front.

 Now OSS *has a read limit of 200ms* so when we read the data as shown
 below in our UnboundedReaders *advance()* method

 // Get Messages

 GetMessagesResponse getResponse =
 this.streamClient.getMessages(getRequest);

 We are able to read around five message but after that we are getting 
 *request
 throttling error*

 Request was throttled because requests limit exhausted, next request
 can be made in 200 ms

 We tried with an initial solution of introducing *Thread.sleep(200)*
 before the getMessages to see how it is behaving and this time we are *able
 to read around 300+ messages*. With the expertise available in this
 forum, I would like to hear inputs on two points.

1.

How to implement this feature in a proper way rather than just with
a one-line Thread.sleep(200)
2.

After adding Thread.sleep(200) and reading 300+ messages the
pipeline encountered below error. I do not see any implementation 
 specific
detail in the stack trace. Can I get an insight what this error could be
and how to handle.

java.lang.NullPointerException
at org.apache.beam.sdk.util.CoderUtils.encodeToSafeStream 
 (CoderUtils.java:82)
at org.apache.beam.sdk.util.CoderUtils.encodeToByteArray 
 (CoderUtils.java:66)
at org.apache.beam.sdk.util.CoderUtils.encodeToByteArray 
 (CoderUtils.java:51)
at org.apache.beam.sdk.util.CoderUtils.clone (CoderUtils.java:141)
at 
 org.apache.beam.runners.direct.UnboundedReadEvaluatorFactory$UnboundedReadEvaluator.getReader
  (UnboundedReadEvaluatorFactory.java:224)
at 
 org.apache.beam.runners.direct.UnboundedReadEvaluatorFactory$UnboundedReadEvaluator.processElement
  (UnboundedReadEvaluatorFactory.java:132)
at 
 org.apache.beam.runners.direct.DirectTransformExecutor.processElements 
 (DirectTransformExecutor.java:160)
at org.apache.beam.runners.direct.DirectTransformExecutor.run 
 (DirectTransformExecutor.java:124)
at java.util.concurrent.Executors$RunnableAdapter.call 
 (Executors.java:511)
   

Re: Request Throttling in OSSIO

2020-08-04 Thread Luke Cwik
Take a look at the WatchGrowthFn[1] and also the in-progress Kafka PR[2].

1:
https://github.com/apache/beam/blob/6612b24ada9382706373db547b5606d6e0496b02/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/Watch.java#L787
2: https://github.com/apache/beam/pull/11749

On Tue, Aug 4, 2020 at 3:33 PM Praveen K Viswanathan <
harish.prav...@gmail.com> wrote:

> Thanks for the suggestions Luke. As you know, we are just starting and
> should be able to switch to SplittableDoFn, if that's the future of Beam IO
> Connectors. The SplittableDoFn page has the design details but it would be
> great if we can look into an IO connector built using SplittableDoFn
> for reference and to map the design details with actual implementation.
> Could you please suggest any such IO for reference.
>
> I will also parallely try your suggestion in advance() and checkpoint mark
> coder to close that issue.
>
> Thanks,
> Praveen
>
> On Mon, Aug 3, 2020 at 3:28 PM Luke Cwik  wrote:
>
>> Since you are working on a new connector I would very strongly
>> suggest writing it as a splittable DoFn instead of an UnboundedSource. See
>> this thread[1] about additional details and some caveats on the
>> recommendation.
>>
>> 1) You can return false from advance and the runner will execute advance
>> at some point in time instead of sleeping. This is also the correct thing
>> to do if you hit a throttling error. With a splittable DoFn you can return
>> a process continuation allowing you to suggest an amount of time to wait
>> before being resumed.
>>
>> 2) It looks like null was returned as the checkpoint mark coder[2].
>>
>> 1:
>> https://lists.apache.org/thread.html/r76bac40fd22ebf96f379efbaef36fc27c65bdb859f504e19da76ff01%40%3Cdev.beam.apache.org%3E
>> 2:
>> https://github.com/apache/beam/blob/fa3ca2b11e2ca031232245814389d29c805f79e7/runners/direct-java/src/main/java/org/apache/beam/runners/direct/UnboundedReadEvaluatorFactory.java#L223
>>
>> On Thu, Jul 30, 2020 at 3:41 PM Praveen K Viswanathan <
>> harish.prav...@gmail.com> wrote:
>>
>>> Hello Dev team,
>>>
>>> We are giving our first shot in writing Beam IO connector for Oracle
>>> Streaming Service (OSS). The plan is to first implement it for enterprise
>>> use and based on the feedback and stability make it available open source.
>>> This is our first attempt in developing a Beam IO connector and so far we
>>> have progressed with the help of Beam documentation and other related IOs
>>> like KafkaIO, KinesisIO. Thanks to the community on that front.
>>>
>>> Now OSS *has a read limit of 200ms* so when we read the data as shown
>>> below in our UnboundedReaders *advance()* method
>>>
>>> // Get Messages
>>>
>>> GetMessagesResponse getResponse =
>>> this.streamClient.getMessages(getRequest);
>>>
>>> We are able to read around five message but after that we are getting 
>>> *request
>>> throttling error*
>>>
>>> Request was throttled because requests limit exhausted, next request can
>>> be made in 200 ms
>>>
>>> We tried with an initial solution of introducing *Thread.sleep(200)*
>>> before the getMessages to see how it is behaving and this time we are *able
>>> to read around 300+ messages*. With the expertise available in this
>>> forum, I would like to hear inputs on two points.
>>>
>>>1.
>>>
>>>How to implement this feature in a proper way rather than just with
>>>a one-line Thread.sleep(200)
>>>2.
>>>
>>>After adding Thread.sleep(200) and reading 300+ messages the
>>>pipeline encountered below error. I do not see any implementation 
>>> specific
>>>detail in the stack trace. Can I get an insight what this error could be
>>>and how to handle.
>>>
>>>java.lang.NullPointerException
>>>at org.apache.beam.sdk.util.CoderUtils.encodeToSafeStream 
>>> (CoderUtils.java:82)
>>>at org.apache.beam.sdk.util.CoderUtils.encodeToByteArray 
>>> (CoderUtils.java:66)
>>>at org.apache.beam.sdk.util.CoderUtils.encodeToByteArray 
>>> (CoderUtils.java:51)
>>>at org.apache.beam.sdk.util.CoderUtils.clone (CoderUtils.java:141)
>>>at 
>>> org.apache.beam.runners.direct.UnboundedReadEvaluatorFactory$UnboundedReadEvaluator.getReader
>>>  (UnboundedReadEvaluatorFactory.java:224)
>>>at 
>>> org.apache.beam.runners.direct.UnboundedReadEvaluatorFactory$UnboundedReadEvaluator.processElement
>>>  (UnboundedReadEvaluatorFactory.java:132)
>>>at 
>>> org.apache.beam.runners.direct.DirectTransformExecutor.processElements 
>>> (DirectTransformExecutor.java:160)
>>>at org.apache.beam.runners.direct.DirectTransformExecutor.run 
>>> (DirectTransformExecutor.java:124)
>>>at java.util.concurrent.Executors$RunnableAdapter.call 
>>> (Executors.java:511)
>>>at java.util.concurrent.FutureTask.run (FutureTask.java:266)
>>>at java.util.concurrent.ThreadPoolExecutor.runWorker 
>>> (ThreadPoolExecutor.java:1149)
>>>at java.util.concurrent.ThreadPoolExecutor$Worker.run 
>>> 

Re: Request Throttling in OSSIO

2020-08-04 Thread Praveen K Viswanathan
Thanks for the suggestions Luke. As you know, we are just starting and
should be able to switch to SplittableDoFn, if that's the future of Beam IO
Connectors. The SplittableDoFn page has the design details but it would be
great if we can look into an IO connector built using SplittableDoFn
for reference and to map the design details with actual implementation.
Could you please suggest any such IO for reference.

I will also parallely try your suggestion in advance() and checkpoint mark
coder to close that issue.

Thanks,
Praveen

On Mon, Aug 3, 2020 at 3:28 PM Luke Cwik  wrote:

> Since you are working on a new connector I would very strongly
> suggest writing it as a splittable DoFn instead of an UnboundedSource. See
> this thread[1] about additional details and some caveats on the
> recommendation.
>
> 1) You can return false from advance and the runner will execute advance
> at some point in time instead of sleeping. This is also the correct thing
> to do if you hit a throttling error. With a splittable DoFn you can return
> a process continuation allowing you to suggest an amount of time to wait
> before being resumed.
>
> 2) It looks like null was returned as the checkpoint mark coder[2].
>
> 1:
> https://lists.apache.org/thread.html/r76bac40fd22ebf96f379efbaef36fc27c65bdb859f504e19da76ff01%40%3Cdev.beam.apache.org%3E
> 2:
> https://github.com/apache/beam/blob/fa3ca2b11e2ca031232245814389d29c805f79e7/runners/direct-java/src/main/java/org/apache/beam/runners/direct/UnboundedReadEvaluatorFactory.java#L223
>
> On Thu, Jul 30, 2020 at 3:41 PM Praveen K Viswanathan <
> harish.prav...@gmail.com> wrote:
>
>> Hello Dev team,
>>
>> We are giving our first shot in writing Beam IO connector for Oracle
>> Streaming Service (OSS). The plan is to first implement it for enterprise
>> use and based on the feedback and stability make it available open source.
>> This is our first attempt in developing a Beam IO connector and so far we
>> have progressed with the help of Beam documentation and other related IOs
>> like KafkaIO, KinesisIO. Thanks to the community on that front.
>>
>> Now OSS *has a read limit of 200ms* so when we read the data as shown
>> below in our UnboundedReaders *advance()* method
>>
>> // Get Messages
>>
>> GetMessagesResponse getResponse =
>> this.streamClient.getMessages(getRequest);
>>
>> We are able to read around five message but after that we are getting 
>> *request
>> throttling error*
>>
>> Request was throttled because requests limit exhausted, next request can
>> be made in 200 ms
>>
>> We tried with an initial solution of introducing *Thread.sleep(200)*
>> before the getMessages to see how it is behaving and this time we are *able
>> to read around 300+ messages*. With the expertise available in this
>> forum, I would like to hear inputs on two points.
>>
>>1.
>>
>>How to implement this feature in a proper way rather than just with a
>>one-line Thread.sleep(200)
>>2.
>>
>>After adding Thread.sleep(200) and reading 300+ messages the pipeline
>>encountered below error. I do not see any implementation specific detail 
>> in
>>the stack trace. Can I get an insight what this error could be and how to
>>handle.
>>
>>java.lang.NullPointerException
>>at org.apache.beam.sdk.util.CoderUtils.encodeToSafeStream 
>> (CoderUtils.java:82)
>>at org.apache.beam.sdk.util.CoderUtils.encodeToByteArray 
>> (CoderUtils.java:66)
>>at org.apache.beam.sdk.util.CoderUtils.encodeToByteArray 
>> (CoderUtils.java:51)
>>at org.apache.beam.sdk.util.CoderUtils.clone (CoderUtils.java:141)
>>at 
>> org.apache.beam.runners.direct.UnboundedReadEvaluatorFactory$UnboundedReadEvaluator.getReader
>>  (UnboundedReadEvaluatorFactory.java:224)
>>at 
>> org.apache.beam.runners.direct.UnboundedReadEvaluatorFactory$UnboundedReadEvaluator.processElement
>>  (UnboundedReadEvaluatorFactory.java:132)
>>at 
>> org.apache.beam.runners.direct.DirectTransformExecutor.processElements 
>> (DirectTransformExecutor.java:160)
>>at org.apache.beam.runners.direct.DirectTransformExecutor.run 
>> (DirectTransformExecutor.java:124)
>>at java.util.concurrent.Executors$RunnableAdapter.call 
>> (Executors.java:511)
>>at java.util.concurrent.FutureTask.run (FutureTask.java:266)
>>at java.util.concurrent.ThreadPoolExecutor.runWorker 
>> (ThreadPoolExecutor.java:1149)
>>at java.util.concurrent.ThreadPoolExecutor$Worker.run 
>> (ThreadPoolExecutor.java:624)
>>at java.lang.Thread.run (Thread.java:748)
>>
>>
>>
>> --
>> Thanks,
>> Praveen K Viswanathan
>>
>

-- 
Thanks,
Praveen K Viswanathan


Re: Stateful Pardo Question

2020-08-04 Thread jmac...@godaddy.com
So, after some additional digging, it appears that Beam does not consistently 
check for timer expiry before calling process. The result is that it may be the 
case that the watermark has moved beyond your timer expiry, and if youre 
counting on the timer callback happening at the time you set it for, that 
simply may NOT have happened when you are in DoFn.process(). You can “fix” the 
behavior by simply checking the watermark manually in process() and doing what 
you would normally do for timestamp exipry before proceeding. See my latest 
updated code reproducing the issue and showing the fix at  
https://github.com/randomsamples/pardo_repro.

I would argue that users of this API will naturally expect that timer callback 
semantics will guarantee that when they are in process(), if the current 
watermark is past a timers expiry that the timer callback in question will have 
been called. Is there any reason why this isn’t happening? Am I 
misunderstanding something?

From: "jmac...@godaddy.com" 
Reply-To: "dev@beam.apache.org" 
Date: Monday, August 3, 2020 at 10:51 AM
To: "dev@beam.apache.org" 
Subject: Re: Stateful Pardo Question

Notice: This email is from an external sender.


Yeah, unless I am misunderstanding something. The output from my repro code 
shows event timestamp and the context timestamp every time we process an event.

Receiving event at: 2000-01-01T00:00:00.000Z
Resetting timer to : 2000-01-01T00:15:00.000Z
Receiving event at: 2000-01-01T00:05:00.000Z
Resetting timer to : 2000-01-01T00:20:00.000Z <-- Shouldn’t the timer have 
fired before we processed the next event?
Receiving event at: 2000-01-01T00:40:00.000Z
Why didnt the timer fire?
Resetting timer to : 2000-01-01T00:55:00.000Z
Receiving event at: 2000-01-01T00:45:00.000Z
Resetting timer to : 2000-01-01T01:00:00.000Z
Receiving event at: 2000-01-01T00:50:00.000Z
Resetting timer to : 2000-01-01T01:05:00.000Z
Timer firing at: 2000-01-01T01:05:00.000Z

From: Reuven Lax 
Reply-To: "dev@beam.apache.org" 
Date: Monday, August 3, 2020 at 10:02 AM
To: dev 
Subject: Re: Stateful Pardo Question

Notice: This email is from an external sender.


Are you sure that there is a 15 minute gap in your data?

On Mon, Aug 3, 2020 at 6:20 AM jmac...@godaddy.com 
mailto:jmac...@godaddy.com>> wrote:
I am confused about the behavior of timers on a simple stateful pardo. I have 
put together a little repro here: https://github.com/randomsamples/pardo_repro

I basically want to build something like a session window, accumulating events 
until quiescence of the stream for a given key and gap time, then output 
results. But it appears that the timer is not firing when the watermark is 
passed it expiration time, so the event stream is not being split as I would 
have expected. Would love some help getting this work, the behavior is for a 
project I’m working on.


Re: Unknown accumulator coder error when running cross-language SpannerIO Write

2020-08-04 Thread Boyuan Zhang
Hi Piotr,

Are you using the beam master head to dev? Can you share your code? The
x-lang transform can be tested with Flink runner, where SDF is also
supported, such as
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/portability/flink_runner_test.py#L205-L261

On Tue, Aug 4, 2020 at 9:42 AM Piotr Szuberski 
wrote:

> Is there a simple way to register the splittable dofn for cross-language
> usage? It's a bit a black box to me right now.
>
> The most meaningful logs for Flink are the ones I pasted and the following:
>
> apache_beam.utils.subprocess_server: INFO: b'[grpc-default-executor-0]
> WARN org.apache.beam.runners.jobsubmission.InMemoryJobService - Encountered
> Unexpected Exception during validation'
> apache_beam.utils.subprocess_server: INFO: b'java.lang.RuntimeException:
> Failed to validate transform ref_AppliedPTransform_Write to Spanner/Write
> mutations to Cloud Spanner/Schema
> View/Combine.GloballyAsSingletonView/Combine.globally(Singleton)/Combine.perKey(Singleton)_31'
>
> and a shortened oneline message:
> [...] DEBUG: Stages: ['ref_AppliedPTransform_Generate input/Impulse_3\n
> Generate input/Impulse:beam:transform:impulse:v1\n  must follow: \n
> downstream_side_inputs: ', 'ref_AppliedPTransform_Generate
> input/FlatMap()_4\n  Generate input/FlatMap( at core.py:2826>):beam:transform:pardo:v1\n  must follow: \n
> downstream_side_inputs: ', 'ref_AppliedPTransform_Generate
> input/Map(decode)_6\n [...]
>
> On 2020/08/03 23:40:42, Brian Hulette  wrote:
> > The DirectRunner error looks like it's because the FnApiRunner doesn't
> > support SDF.
> >
> > What is the coder id for the Flink error? It looks like the full stack
> > trace should contain it.
> >
> > On Mon, Aug 3, 2020 at 10:09 AM Piotr Szuberski <
> piotr.szuber...@polidea.com>
> > wrote:
> >
> > > I'm Writing SpannerIO.Write cross-language transform and when I try to
> run
> > > it from python I receive errors:
> > >
> > > On Flink:
> > > apache_beam.utils.subprocess_server: INFO: b'Caused by:
> > > java.lang.IllegalArgumentException: Transform external_1HolderCoder
> uses
> > > unknown accumulator coder id %s'
> > > apache_beam.utils.subprocess_server: INFO: b'\tat
> > >
> org.apache.beam.vendor.guava.v26_0_jre.com.google.common.base.Preconditions.checkArgument(Preconditions.java:216)'
> > > apache_beam.utils.subprocess_server: INFO: b'\tat
> > >
> org.apache.beam.runners.core.construction.graph.PipelineValidator.validateCombine(PipelineValidator.java:273)'
> > >
> > > On DirectRunner:
> > >   File
> > >
> "/Users/piotr/beam/sdks/python/apache_beam/runners/portability/fn_api_runner/fn_runner.py",
> > > line 181, in run_via_runner_api
> > > self._validate_requirements(pipeline_proto)
> > >   File
> > >
> "/Users/piotr/beam/sdks/python/apache_beam/runners/portability/fn_api_runner/fn_runner.py",
> > > line 264, in _validate_requirements
> > > raise ValueError(
> > > ValueError: Missing requirement declaration:
> > > {'beam:requirement:pardo:splittable_dofn:v1'}
> > >
> > > I suppose that SpannerIO.Write uses a transform that cannot be
> translated
> > > in cross-language usage? I'm not sure whether there is something I can
> do
> > > about it.
> > >
> > >
> >
>


Re: Needed help identifying a error in running a SDF

2020-08-04 Thread Boyuan Zhang
Hi Mayank,

Which runner do you want to run your pipeline? You should add 'beam_fn_api'
when you launch the pipeline --experiments=beam_fn_api.
In your code:

class TestDoFn(beam.DoFn):
def process(
self,
element,
restriction_tracker=beam.DoFn.RestrictionParam(
TestProvider())):
import pdb; pdb.set_trace()
cur = restriction_tracker.current_restriction().start
while restriction_tracker.try_claim(cur):
  return element -> yield element; cur += 1



On Tue, Aug 4, 2020 at 11:07 AM Mayank Ketkar  wrote:

> Hello Team,
>
> I was hoping to get anyones help with an error I'm encountering in
> running SDF.
>
> Posted the question imn stack overflow (includes code)
>
> https://stackoverflow.com/questions/63252327/error-in-running-apache-beam-python-splittabledofn
>
> However I am receiving a error
> RuntimeError: Transform node
> AppliedPTransform(ParDo(TestDoFn)/ProcessKeyedElements/GroupByKey/GroupByKey,
> _GroupByKeyOnly) was not replaced as expected.
>
> when trying to apply a SDF to a pubsubIO source
>
> Thanks in advance!! Really!!
>
> Mayank
>


Re: Making reporting bugs/feature request easier

2020-08-04 Thread Alex Amato
May I suggest we print a URL(and a message) you can use to file bugs at, in
the command line when you run a beam pipeline. (And in any other user
interface we use for beam, some of the runner specific UIs may want to link
to this as well).

On Tue, Aug 4, 2020 at 9:16 AM Alexey Romanenko 
wrote:

> Great topic, thanks Griselda for raising this question.
>
> I’d prefer to keep Jira as the only one main issue tracker and use other
> suggested ways, like emails, Git issues, web form or dedicated Slack
> channel, as different interfaces designed to simplify a way how users can
> submit an issue. But in any case it will require an attention of Beam
> contributors to properly create Jira issue and send back a link that can be
> followed for updates.
>
> On 31 Jul 2020, at 20:22, Robert Burke  wrote:
>
> I do like the idea of the "mwrong way to raise issues point to the correct
> ways.
>
> On Fri, Jul 31, 2020, 10:57 AM Brian Hulette  wrote:
>
>> I think I'd prefer continuing to use jira, but GitHub issues are
>> certainly much more discoverable for our users. The Arrow project uses
>> GitHub issues as a way to funnel users to the mailing lists and JIRA. When
>> users go to file an issue they're first given two options [1]:
>>
>> - Ask a question -> Please ask questions at u...@arrow.apache.org
>> - Report an issue -> Please report bugs and request features on JIRA.
>>
>> With accompanying links for each option. The user@ link actually
>> takes you to the new issue page, with a template strongly encouraging you
>> to file a jira or subscribe to the mailing lists.
>> Despite all these barriers people do still file github issues, and they
>> need to be triaged (usually they just receive a comment asking the reporter
>> to file a jira or linking to an existing jira), but the volume isn't that
>> high.
>>
>> Maybe we could consider something like that?
>>
>> Brian
>>
>> [1] https://github.com/apache/arrow/issues/new/choose
>>
>> On Thu, Jul 30, 2020 at 2:45 PM Robert Bradshaw 
>> wrote:
>>
>>> On Wed, Jul 29, 2020 at 7:12 PM Kenneth Knowles  wrote:
>>>

 On Wed, Jul 29, 2020 at 11:08 AM Robert Bradshaw 
 wrote:

> +1 to a simple link that fills in most of the fields in JIRA, though
> this doesn't solve the issue of having to sign up just to submit a bug
> report. Just using the users list isn't a bad idea either--we could easily
> create a script that ensures all threads that have a message like "we
> should file a JIRA for this" are followed up with a message like "JIRA
> filed at ...". (That doesn't mean it won't language on the tracker.)
>
> I think it's worth seriously considering just using Github's issue
> tracker, since that's where our users are. Is there anything in we 
> actually
> use in JIRA that'd be missing?
>

 Pretty big question. Just noting to start that Apache projects
 certainly can and do use GitHub issues. Here is a quick inventory of things
 that are used in a meaningful way:

  - Priorities (with GitHub Issues I think you roll your own with labels)
  - Issue types (with GitHub Issues I think you roll your own with
 labels)
  - Distinct "Triage Needed" state (also labels; anything lacking the
 "triaged" label)
  - Distinguishing "Open" and "In Progress" (also labels? can use
 Projects/Milestones - I forget exactly which - to keep a kanban-ish status)

>>>
>>> Yes, basically everything can is done with labels. Whether having one
>>> hammer is good, well, there are pros and cons.
>>>
>>>
  - Our new automation: "stale-assigned" and subsequent unassign;
 "stale-P2" and subsequent downgrade

>>>
>>> Github has a very nice ReST API, making things like this very easy.
>>>
>>>
  - Fix Version for associating fixes with releases

>>>
>>> This information is typically intrinsic with when the commits were
>>> applied and the bug closed. It's pretty typical to use milestones for a
>>> release, and then tag "blockers" to it. (IMHO, this is better than having
>>> the default always be the next release, and bumping all open bugs every
>>> release that comes out.) Milestones can be used to track other work as
>>> well.
>>>
>>>
  - Affect Version, while not used much, is still helpful to have
  - Components, since our repo is really a mini mono repo. Again, labels.
  - Kanban boards (milestones/projects maybe kinda)
  - Reports (not really same level, but maybe OK?)

 Fairly recently I worked on a project that tried to use GitHub Issues
 and Projects and Milestones and whatnot and it was OK but not great. Jira's
 complexity is largely essential / not really complex but just visually
 busy. The two are not really even comparable offerings. There may be third
 party integrations that add some of what you'd want.

>>>
>>> Yeah, I agree Github issues is not as full featured. One thing I miss
>>> from other products is dependencies 

Needed help identifying a error in running a SDF

2020-08-04 Thread Mayank Ketkar
Hello Team,

I was hoping to get anyones help with an error I'm encountering in
running SDF.

Posted the question imn stack overflow (includes code)
https://stackoverflow.com/questions/63252327/error-in-running-apache-beam-python-splittabledofn

However I am receiving a error
RuntimeError: Transform node
AppliedPTransform(ParDo(TestDoFn)/ProcessKeyedElements/GroupByKey/GroupByKey,
_GroupByKeyOnly) was not replaced as expected.

when trying to apply a SDF to a pubsubIO source

Thanks in advance!! Really!!

Mayank


Re: Chronically flaky tests

2020-08-04 Thread Robert Bradshaw
I'm in favor of a quarantine job whose tests are called out
prominently as "possibly broken" in the release notes. As a follow up,
+1 to exploring better tooling to track at a fine grained level
exactly how flaky these test are (and hopefully detect if/when they go
from flakey to just plain broken).

On Tue, Aug 4, 2020 at 7:25 AM Etienne Chauchot  wrote:
>
> Hi all,
>
> +1 on ping the assigned person.
>
> For the flakes I know of (ESIO and CassandraIO), they are due to the load of 
> the CI server. These IOs are tested using real embedded backends because 
> those backends are complex and we need relevant tests.
>
> Counter measures have been taken (retrial inside the test sensible to load, 
> add ranges of acceptable numbers, call internal backend mechanisms to force 
> refresh in case load prevented the backend to do so ...).

Yes, certain tests with external dependencies should to their own
internal retries. if that is not sufficient, they should probably be
quarantined.

> I recently got pinged my Ahmet (thanks to him!) about a flakiness that I did 
> not see. This seems to me the correct way to go. Systematically retrying 
> tests with a CI mechanism or disabling tests seem to me a risky workaround 
> that just allows to get the problem off our minds.
>
> Etienne
>
> On 20/07/2020 20:58, Brian Hulette wrote:
>
> > I think we are missing a way for checking that we are making progress on P1 
> > issues. For example, P0 issues block releases and this obviously results in 
> > fixing/triaging/addressing P0 issues at least every 6 weeks. We do not have 
> > a similar process for flaky tests. I do not know what would be a good 
> > policy. One suggestion is to ping (email/slack) assignees of issues. I 
> > recently missed a flaky issue that was assigned to me. A ping like that 
> > would have reminded me. And if an assignee cannot help/does not have the 
> > time, we can try to find a new assignee.
>
> Yeah I think this is something we should address. With the new jira 
> automation at least assignees should get an email notification after 30 days 
> because of a jira comment like [1], but that's too long to let a test 
> continue to flake. Could Beam Jira Bot ping every N days for P1s that aren't 
> making progress?
>
> That wouldn't help us with P1s that have no assignee, or are assigned to 
> overloaded people. It seems we'd need some kind of dashboard or report to 
> capture those.
>
> [1] 
> https://issues.apache.org/jira/browse/BEAM-8101?focusedCommentId=17121918=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17121918
>
> On Fri, Jul 17, 2020 at 1:09 PM Ahmet Altay  wrote:
>>
>> Another idea, could we change our "Retest X" phrases with "Retest X 
>> (Reason)" phrases? With this change a PR author will have to look at failed 
>> test logs. They could catch new flakiness introduced by their PR, file a 
>> JIRA for a flakiness that was not noted before, or ping an existing JIRA 
>> issue/raise its severity. On the downside this will require PR authors to do 
>> more.
>>
>> On Fri, Jul 17, 2020 at 6:46 AM Tyson Hamilton  wrote:
>>>
>>> Adding retries can be beneficial in two ways, unblocking a PR, and 
>>> collecting metrics about the flakes.
>>
>>
>> Makes sense. I think we will still need to have a plan to remove retries 
>> similar to re-enabling disabled tests.
>>
>>>
>>>
>>> If we also had a flaky test leaderboard that showed which tests are the 
>>> most flaky, then we could take action on them. Encouraging someone from the 
>>> community to fix the flaky test is another issue.
>>>
>>> The test status matrix of tests that is on the GitHub landing page could 
>>> show flake level to communicate to users which modules are losing a 
>>> trustable test signal. Maybe this shows up as a flake % or a code coverage 
>>> % that decreases due to disabled flaky tests.
>>
>>
>> +1 to a dashboard that will show a "leaderboard" of flaky tests.
>>
>>>
>>>
>>> I didn't look for plugins, just dreaming up some options.
>>>
>>>
>>>
>>>
>>> On Thu, Jul 16, 2020, 5:58 PM Luke Cwik  wrote:

 What do other Apache projects do to address this issue?

 On Thu, Jul 16, 2020 at 5:51 PM Ahmet Altay  wrote:
>
> I agree with the comments in this thread.
> - If we are not re-enabling tests back again or we do not have a plan to 
> re-enable them again, disabling tests only provides us temporary relief 
> until eventually users find issues instead of disabled tests.
> - I feel similarly about retries. It is reasonable to add retries for 
> reasons we understand. Adding retries to avoid flakes is similar to 
> disabling tests. They might hide real issues.
>
> I think we are missing a way for checking that we are making progress on 
> P1 issues. For example, P0 issues block releases and this obviously 
> results in fixing/triaging/addressing P0 issues at least every 6 weeks. 
> We do not have a similar process for flaky tests. I do not 

Re: Git commit history: "fixup" commits

2020-08-04 Thread Udi Meiri
https://github.com/marketplace/actions/gs-commit-message-checker

On Tue, Aug 4, 2020 at 10:25 AM Robert Bradshaw  wrote:

> +1, thanks for the reminder.
>
> This should be really easy to automate, using
> https://developer.github.com/webhooks/event-payloads/#pull_request to
> give a warning when the change history is not sufficiently "clean."
> I'm not sure where to host this though (or if it could be integrated
> into jenkins--basically I'd just want to run a Python script with the
> PR number (or better, just point to the local git repo and have the
> master's commit handy) as another precommit).
>
> On Tue, Aug 4, 2020 at 10:10 AM Rui Wang  wrote:
> >
> > +1 thanks Alexey.
> >
> > My apologies that I merged such a case recently (but not intentionally).
> I tried to use the "squash and merge" button with a consolidated commit
> message. After clicking the button, github showed "failed to merge" and
> gave a retry button, and after clicking that retry button, github magically
> switched to "create merge commit" approach thus merged some fixup commits
> to the main branch.
> >
> > This is a rare case (I only encountered once). But I will pay more
> attention next time. I could ask PR authors to squash their commits before
> merging when it is possible.
> >
> >
> > -Rui
> >
> > On Tue, Aug 4, 2020 at 9:40 AM Alexey Romanenko <
> aromanenko@gmail.com> wrote:
> >>
> >> Yes, good point, thanks Valentyn.
> >>
> >> On 4 Aug 2020, at 18:29, Valentyn Tymofieiev 
> wrote:
> >>
> >> +1, thanks, Alexey.
> >>
> >> Also a reminder from the contributor guide: do not use the default
> GitHub commit message for merge commits, which looks like:
> >>
> >> Merge pull request #1234 from some_user/transient_branch_name
> >>
> >> Instead, add the commit message into the subject line, for example:
> "Merge pull request #1234: [BEAM-7873] Fix the foo bizzle bazzle".
> >>
> >> On Tue, Aug 4, 2020 at 7:13 AM Alexey Romanenko <
> aromanenko@gmail.com> wrote:
> >>>
> >>> Hi all,
> >>>
> >>> I’d like to attract your attention regarding our Git commit history
> and related issue. A while ago I noticed that it started getting not very
> clear and quite verbose comparing to how it was before. We have quite
> significant amount of recent commits like “fix”, “address comments”,
> “typo”, “spotless”, etc. Most of them also doesn’t contain Jira Tag as a
> prefix and actually is just supplementary commits to “main” and initial
> commit of PR, added after several PRs review rounds.
> >>>
> >>> AFAIR, we already had several discussion in the past about this topic
> and we agreed that we should avoid such commits in a final merge and have
> only one (in most cases) or several (if necessary) logical commits that
> should be atomic and properly explain what they do.
> >>>
> >>> Why these “tiny" commits are bad practice? Just several main reasons:
> >>> - They pollute our git repository history and don’t give any
> additional and useful further information;
> >>> - They are not atomic and we can’t easily revert (rollback) this
> supplementary commit since the state of the build before was likely broken
> or had incorrect behaviour. So, in this case, the whole set of PRs commits
> should be reverted which is not convenient and error-prone. It’s also
> expected that all checks were green before merging a PR (take a part flaky
> tests).
> >>> - They are not informative in terms of commit message. So it makes
> more hard to identify Git annotated code and how the lines of code are
> related together.
> >>>
> >>> Following this, I just want to briefly remind our Committers rules
> regarding PR merging [1].
> >>> Every commit:
> >>> - should do one thing and reflect it in commit message;
> >>> - should contain Jira Tag;
> >>> - all “fixup” and “address comments” type of commits should be
> squashed by author or committer before merging.
> >>>
> >>> Please, pay attention on what is finally committed and merged into our
> repository and it should help to keep our commit history clear, which will
> be transferred to saving a time of other developers in the end.
> >>>
> >>> [1]
> https://beam.apache.org/contribute/committer-guide/#finishing-touches
> >>>
> >>> Regards,
> >>> Alexey
> >>
> >>
>


smime.p7s
Description: S/MIME Cryptographic Signature


Re: Git commit history: "fixup" commits

2020-08-04 Thread Robert Bradshaw
+1, thanks for the reminder.

This should be really easy to automate, using
https://developer.github.com/webhooks/event-payloads/#pull_request to
give a warning when the change history is not sufficiently "clean."
I'm not sure where to host this though (or if it could be integrated
into jenkins--basically I'd just want to run a Python script with the
PR number (or better, just point to the local git repo and have the
master's commit handy) as another precommit).

On Tue, Aug 4, 2020 at 10:10 AM Rui Wang  wrote:
>
> +1 thanks Alexey.
>
> My apologies that I merged such a case recently (but not intentionally). I 
> tried to use the "squash and merge" button with a consolidated commit 
> message. After clicking the button, github showed "failed to merge" and gave 
> a retry button, and after clicking that retry button, github magically 
> switched to "create merge commit" approach thus merged some fixup commits to 
> the main branch.
>
> This is a rare case (I only encountered once). But I will pay more attention 
> next time. I could ask PR authors to squash their commits before merging when 
> it is possible.
>
>
> -Rui
>
> On Tue, Aug 4, 2020 at 9:40 AM Alexey Romanenko  
> wrote:
>>
>> Yes, good point, thanks Valentyn.
>>
>> On 4 Aug 2020, at 18:29, Valentyn Tymofieiev  wrote:
>>
>> +1, thanks, Alexey.
>>
>> Also a reminder from the contributor guide: do not use the default GitHub 
>> commit message for merge commits, which looks like:
>>
>> Merge pull request #1234 from some_user/transient_branch_name
>>
>> Instead, add the commit message into the subject line, for example: "Merge 
>> pull request #1234: [BEAM-7873] Fix the foo bizzle bazzle".
>>
>> On Tue, Aug 4, 2020 at 7:13 AM Alexey Romanenko  
>> wrote:
>>>
>>> Hi all,
>>>
>>> I’d like to attract your attention regarding our Git commit history and 
>>> related issue. A while ago I noticed that it started getting not very clear 
>>> and quite verbose comparing to how it was before. We have quite significant 
>>> amount of recent commits like “fix”, “address comments”, “typo”, 
>>> “spotless”, etc. Most of them also doesn’t contain Jira Tag as a prefix and 
>>> actually is just supplementary commits to “main” and initial commit of PR, 
>>> added after several PRs review rounds.
>>>
>>> AFAIR, we already had several discussion in the past about this topic and 
>>> we agreed that we should avoid such commits in a final merge and have only 
>>> one (in most cases) or several (if necessary) logical commits that should 
>>> be atomic and properly explain what they do.
>>>
>>> Why these “tiny" commits are bad practice? Just several main reasons:
>>> - They pollute our git repository history and don’t give any additional and 
>>> useful further information;
>>> - They are not atomic and we can’t easily revert (rollback) this 
>>> supplementary commit since the state of the build before was likely broken 
>>> or had incorrect behaviour. So, in this case, the whole set of PRs commits 
>>> should be reverted which is not convenient and error-prone. It’s also 
>>> expected that all checks were green before merging a PR (take a part flaky 
>>> tests).
>>> - They are not informative in terms of commit message. So it makes more 
>>> hard to identify Git annotated code and how the lines of code are related 
>>> together.
>>>
>>> Following this, I just want to briefly remind our Committers rules 
>>> regarding PR merging [1].
>>> Every commit:
>>> - should do one thing and reflect it in commit message;
>>> - should contain Jira Tag;
>>> - all “fixup” and “address comments” type of commits should be squashed by 
>>> author or committer before merging.
>>>
>>> Please, pay attention on what is finally committed and merged into our 
>>> repository and it should help to keep our commit history clear, which will 
>>> be transferred to saving a time of other developers in the end.
>>>
>>> [1] https://beam.apache.org/contribute/committer-guide/#finishing-touches
>>>
>>> Regards,
>>> Alexey
>>
>>


Re: Git commit history: "fixup" commits

2020-08-04 Thread Rui Wang
+1 thanks Alexey.

My apologies that I merged such a case recently (but not intentionally). I
tried to use the "squash and merge" button with a consolidated commit
message. After clicking the button, github showed "failed to merge" and
gave a retry button, and after clicking that retry button, github magically
switched to "create merge commit" approach thus merged some fixup commits
to the main branch.

This is a rare case (I only encountered once). But I will pay more
attention next time. I could ask PR authors to squash their commits before
merging when it is possible.


-Rui

On Tue, Aug 4, 2020 at 9:40 AM Alexey Romanenko 
wrote:

> Yes, good point, thanks Valentyn.
>
> On 4 Aug 2020, at 18:29, Valentyn Tymofieiev  wrote:
>
> +1, thanks, Alexey.
>
> Also a reminder from the contributor guide: do not use the default GitHub
> commit message for merge commits, which looks like:
>
> Merge pull request #1234 from some_user/transient_branch_name
>
> Instead, add the commit message into the subject line, for example: "Merge
> pull request #1234: [BEAM-7873] Fix the foo bizzle bazzle".
>
> On Tue, Aug 4, 2020 at 7:13 AM Alexey Romanenko 
> wrote:
>
>> Hi all,
>>
>> I’d like to attract your attention regarding our Git commit history and
>> related issue. A while ago I noticed that it started getting not very clear
>> and quite verbose comparing to how it was before. We have quite significant
>> amount of recent commits like “fix”, “address comments”, “typo”,
>> “spotless”, etc. Most of them also doesn’t contain Jira Tag as a prefix and
>> actually is just supplementary commits to “main” and initial commit of PR,
>> added after several PRs review rounds.
>>
>> AFAIR, we already had several discussion in the past about this topic and
>> we agreed that we should avoid such commits in a final merge and have only
>> one (in most cases) or several (if necessary) logical commits that should
>> be atomic and properly explain what they do.
>>
>> Why these “tiny" commits are bad practice? Just several main reasons:
>> - They pollute our git repository history and don’t give any additional
>> and useful further information;
>> - They are not atomic and we can’t easily revert (rollback) this
>> supplementary commit since the state of the build before was likely broken
>> or had incorrect behaviour. So, in this case, the whole set of PRs commits
>> should be reverted which is not convenient and error-prone. It’s also
>> expected that all checks were green before merging a PR (take a part flaky
>> tests).
>> - They are not informative in terms of commit message. So it makes more
>> hard to identify Git annotated code and how the lines of code are related
>> together.
>>
>> Following this, I just want to briefly remind our Committers rules
>> regarding PR merging [1].
>> Every commit:
>> - should do one thing and reflect it in commit message;
>> - should contain Jira Tag;
>> - all “fixup” and “address comments” type of commits should be squashed
>> by author or committer before merging.
>>
>> Please, pay attention on what is finally committed and merged into our
>> repository and it should help to keep our commit history clear, which will
>> be transferred to saving a time of other developers in the end.
>>
>> [1] https://beam.apache.org/contribute/committer-guide/#finishing-touches
>>
>> Regards,
>> Alexey
>>
>
>


Re: Unknown accumulator coder error when running cross-language SpannerIO Write

2020-08-04 Thread Piotr Szuberski
Is there a simple way to register the splittable dofn for cross-language usage? 
It's a bit a black box to me right now.

The most meaningful logs for Flink are the ones I pasted and the following:

apache_beam.utils.subprocess_server: INFO: b'[grpc-default-executor-0] WARN 
org.apache.beam.runners.jobsubmission.InMemoryJobService - Encountered 
Unexpected Exception during validation'
apache_beam.utils.subprocess_server: INFO: b'java.lang.RuntimeException: Failed 
to validate transform ref_AppliedPTransform_Write to Spanner/Write mutations to 
Cloud Spanner/Schema 
View/Combine.GloballyAsSingletonView/Combine.globally(Singleton)/Combine.perKey(Singleton)_31'

and a shortened oneline message:
[...] DEBUG: Stages: ['ref_AppliedPTransform_Generate input/Impulse_3\n  
Generate input/Impulse:beam:transform:impulse:v1\n  must follow: \n  
downstream_side_inputs: ', 'ref_AppliedPTransform_Generate 
input/FlatMap()_4\n  Generate input/FlatMap():beam:transform:pardo:v1\n  must follow: \n  
downstream_side_inputs: ', 'ref_AppliedPTransform_Generate 
input/Map(decode)_6\n [...]

On 2020/08/03 23:40:42, Brian Hulette  wrote: 
> The DirectRunner error looks like it's because the FnApiRunner doesn't
> support SDF.
> 
> What is the coder id for the Flink error? It looks like the full stack
> trace should contain it.
> 
> On Mon, Aug 3, 2020 at 10:09 AM Piotr Szuberski 
> wrote:
> 
> > I'm Writing SpannerIO.Write cross-language transform and when I try to run
> > it from python I receive errors:
> >
> > On Flink:
> > apache_beam.utils.subprocess_server: INFO: b'Caused by:
> > java.lang.IllegalArgumentException: Transform external_1HolderCoder uses
> > unknown accumulator coder id %s'
> > apache_beam.utils.subprocess_server: INFO: b'\tat
> > org.apache.beam.vendor.guava.v26_0_jre.com.google.common.base.Preconditions.checkArgument(Preconditions.java:216)'
> > apache_beam.utils.subprocess_server: INFO: b'\tat
> > org.apache.beam.runners.core.construction.graph.PipelineValidator.validateCombine(PipelineValidator.java:273)'
> >
> > On DirectRunner:
> >   File
> > "/Users/piotr/beam/sdks/python/apache_beam/runners/portability/fn_api_runner/fn_runner.py",
> > line 181, in run_via_runner_api
> > self._validate_requirements(pipeline_proto)
> >   File
> > "/Users/piotr/beam/sdks/python/apache_beam/runners/portability/fn_api_runner/fn_runner.py",
> > line 264, in _validate_requirements
> > raise ValueError(
> > ValueError: Missing requirement declaration:
> > {'beam:requirement:pardo:splittable_dofn:v1'}
> >
> > I suppose that SpannerIO.Write uses a transform that cannot be translated
> > in cross-language usage? I'm not sure whether there is something I can do
> > about it.
> >
> >
> 


Re: Git commit history: "fixup" commits

2020-08-04 Thread Alexey Romanenko
Yes, good point, thanks Valentyn.

> On 4 Aug 2020, at 18:29, Valentyn Tymofieiev  wrote:
> 
> +1, thanks, Alexey.
> 
> Also a reminder from the contributor guide: do not use the default GitHub 
> commit message for merge commits, which looks like:
> 
> Merge pull request #1234 from some_user/transient_branch_name
> 
> Instead, add the commit message into the subject line, for example: "Merge 
> pull request #1234: [BEAM-7873] Fix the foo bizzle bazzle".
> 
> On Tue, Aug 4, 2020 at 7:13 AM Alexey Romanenko  > wrote:
> Hi all,
> 
> I’d like to attract your attention regarding our Git commit history and 
> related issue. A while ago I noticed that it started getting not very clear 
> and quite verbose comparing to how it was before. We have quite significant 
> amount of recent commits like “fix”, “address comments”, “typo”, “spotless”, 
> etc. Most of them also doesn’t contain Jira Tag as a prefix and actually is 
> just supplementary commits to “main” and initial commit of PR, added after 
> several PRs review rounds.
> 
> AFAIR, we already had several discussion in the past about this topic and we 
> agreed that we should avoid such commits in a final merge and have only one 
> (in most cases) or several (if necessary) logical commits that should be 
> atomic and properly explain what they do. 
> 
> Why these “tiny" commits are bad practice? Just several main reasons:
> - They pollute our git repository history and don’t give any additional and 
> useful further information;
> - They are not atomic and we can’t easily revert (rollback) this 
> supplementary commit since the state of the build before was likely broken or 
> had incorrect behaviour. So, in this case, the whole set of PRs commits 
> should be reverted which is not convenient and error-prone. It’s also 
> expected that all checks were green before merging a PR (take a part flaky 
> tests).
> - They are not informative in terms of commit message. So it makes more hard 
> to identify Git annotated code and how the lines of code are related together.
> 
> Following this, I just want to briefly remind our Committers rules regarding 
> PR merging [1]. 
> Every commit:
> - should do one thing and reflect it in commit message;
> - should contain Jira Tag;
> - all “fixup” and “address comments” type of commits should be squashed by 
> author or committer before merging.
> 
> Please, pay attention on what is finally committed and merged into our 
> repository and it should help to keep our commit history clear, which will be 
> transferred to saving a time of other developers in the end.
> 
> [1] https://beam.apache.org/contribute/committer-guide/#finishing-touches 
> 
> 
> Regards,
> Alexey



Could not run side_input in a streaming pipeline

2020-08-04 Thread Chuong Nguyen
Hi dev team,
I want to add a side_input into a ParDo transform. Side_input is a table
from Bigquery.
However, I am facing a weird issue. I could run this pipeline with
*DirectRunner* on a local machine but I am not able to run this pipeline
with *DataflowRunner. *The error message is:
```
Job did not reach to a terminal state after waiting indefinitely
```
And it traced into my *sales_transaction *table.
Attachment is the pipeline issue.

I am looking for help from the community.
Thanks and best regards.


data.py
Description: Binary data


Re: Git commit history: "fixup" commits

2020-08-04 Thread Valentyn Tymofieiev
+1, thanks, Alexey.

Also a reminder from the contributor guide: do not use the default GitHub
commit message for merge commits, which looks like:

Merge pull request #1234 from some_user/transient_branch_name

Instead, add the commit message into the subject line, for example: "Merge
pull request #1234: [BEAM-7873] Fix the foo bizzle bazzle".

On Tue, Aug 4, 2020 at 7:13 AM Alexey Romanenko 
wrote:

> Hi all,
>
> I’d like to attract your attention regarding our Git commit history and
> related issue. A while ago I noticed that it started getting not very clear
> and quite verbose comparing to how it was before. We have quite significant
> amount of recent commits like “fix”, “address comments”, “typo”,
> “spotless”, etc. Most of them also doesn’t contain Jira Tag as a prefix and
> actually is just supplementary commits to “main” and initial commit of PR,
> added after several PRs review rounds.
>
> AFAIR, we already had several discussion in the past about this topic and
> we agreed that we should avoid such commits in a final merge and have only
> one (in most cases) or several (if necessary) logical commits that should
> be atomic and properly explain what they do.
>
> Why these “tiny" commits are bad practice? Just several main reasons:
> - They pollute our git repository history and don’t give any additional
> and useful further information;
> - They are not atomic and we can’t easily revert (rollback) this
> supplementary commit since the state of the build before was likely broken
> or had incorrect behaviour. So, in this case, the whole set of PRs commits
> should be reverted which is not convenient and error-prone. It’s also
> expected that all checks were green before merging a PR (take a part flaky
> tests).
> - They are not informative in terms of commit message. So it makes more
> hard to identify Git annotated code and how the lines of code are related
> together.
>
> Following this, I just want to briefly remind our Committers rules
> regarding PR merging [1].
> Every commit:
> - should do one thing and reflect it in commit message;
> - should contain Jira Tag;
> - all “fixup” and “address comments” type of commits should be squashed by
> author or committer before merging.
>
> Please, pay attention on what is finally committed and merged into our
> repository and it should help to keep our commit history clear, which will
> be transferred to saving a time of other developers in the end.
>
> [1] https://beam.apache.org/contribute/committer-guide/#finishing-touches
>
> Regards,
> Alexey
>


Re: Making reporting bugs/feature request easier

2020-08-04 Thread Alexey Romanenko
Great topic, thanks Griselda for raising this question.

I’d prefer to keep Jira as the only one main issue tracker and use other 
suggested ways, like emails, Git issues, web form or dedicated Slack channel, 
as different interfaces designed to simplify a way how users can submit an 
issue. But in any case it will require an attention of Beam contributors to 
properly create Jira issue and send back a link that can be followed for 
updates.

> On 31 Jul 2020, at 20:22, Robert Burke  wrote:
> 
> I do like the idea of the "mwrong way to raise issues point to the correct 
> ways.
> 
> On Fri, Jul 31, 2020, 10:57 AM Brian Hulette  > wrote:
> I think I'd prefer continuing to use jira, but GitHub issues are certainly 
> much more discoverable for our users. The Arrow project uses GitHub issues as 
> a way to funnel users to the mailing lists and JIRA. When users go to file an 
> issue they're first given two options [1]:
> 
> - Ask a question -> Please ask questions at u...@arrow.apache.org 
> 
> - Report an issue -> Please report bugs and request features on JIRA.
> 
> With accompanying links for each option. The user@ link actually takes you to 
> the new issue page, with a template strongly encouraging you to file a jira 
> or subscribe to the mailing lists.
> Despite all these barriers people do still file github issues, and they need 
> to be triaged (usually they just receive a comment asking the reporter to 
> file a jira or linking to an existing jira), but the volume isn't that high.
> 
> Maybe we could consider something like that?
> 
> Brian
> 
> [1] https://github.com/apache/arrow/issues/new/choose 
> 
> On Thu, Jul 30, 2020 at 2:45 PM Robert Bradshaw  > wrote:
> On Wed, Jul 29, 2020 at 7:12 PM Kenneth Knowles  > wrote:
> 
> On Wed, Jul 29, 2020 at 11:08 AM Robert Bradshaw  > wrote:
> +1 to a simple link that fills in most of the fields in JIRA, though this 
> doesn't solve the issue of having to sign up just to submit a bug report. 
> Just using the users list isn't a bad idea either--we could easily create a 
> script that ensures all threads that have a message like "we should file a 
> JIRA for this" are followed up with a message like "JIRA filed at ...". (That 
> doesn't mean it won't language on the tracker.)
> 
> I think it's worth seriously considering just using Github's issue tracker, 
> since that's where our users are. Is there anything in we actually use in 
> JIRA that'd be missing? 
> 
> Pretty big question. Just noting to start that Apache projects certainly can 
> and do use GitHub issues. Here is a quick inventory of things that are used 
> in a meaningful way:
> 
>  - Priorities (with GitHub Issues I think you roll your own with labels)
>  - Issue types (with GitHub Issues I think you roll your own with labels) 
>  - Distinct "Triage Needed" state (also labels; anything lacking the 
> "triaged" label)
>  - Distinguishing "Open" and "In Progress" (also labels? can use 
> Projects/Milestones - I forget exactly which - to keep a kanban-ish status)
> 
> Yes, basically everything can is done with labels. Whether having one hammer 
> is good, well, there are pros and cons. 
>  
>  - Our new automation: "stale-assigned" and subsequent unassign; "stale-P2" 
> and subsequent downgrade
> 
> Github has a very nice ReST API, making things like this very easy. 
>  
>  - Fix Version for associating fixes with releases
> 
> This information is typically intrinsic with when the commits were applied 
> and the bug closed. It's pretty typical to use milestones for a release, and 
> then tag "blockers" to it. (IMHO, this is better than having the default 
> always be the next release, and bumping all open bugs every release that 
> comes out.) Milestones can be used to track other work as well. 
>  
>  - Affect Version, while not used much, is still helpful to have
>  - Components, since our repo is really a mini mono repo. Again, labels.
>  - Kanban boards (milestones/projects maybe kinda)
>  - Reports (not really same level, but maybe OK?)
> 
> Fairly recently I worked on a project that tried to use GitHub Issues and 
> Projects and Milestones and whatnot and it was OK but not great. Jira's 
> complexity is largely essential / not really complex but just visually busy. 
> The two are not really even comparable offerings. There may be third party 
> integrations that add some of what you'd want.
> 
> Yeah, I agree Github issues is not as full featured. One thing I miss from 
> other products is dependencies (though Jira's one level of subtasks isn't the 
> best either.) But maybe I am just not a power user of JIRA, mostly using it 
> as a (collective) TODO list and place to record state on bugs/features. It'd 
> be useful to understand what others actually use/would really miss. 
> 
> The primary advantage of 

Re: Chronically flaky tests

2020-08-04 Thread Tyson Hamilton
On Thu, Jul 30, 2020 at 6:24 PM Ahmet Altay  wrote:

> I like:
> *Include ignored or quarantined tests in the release notes*
> *Run flaky tests only in postcommit* (related? *Separate flaky tests into
> quarantine job*)
>

The quarantine job would allow them to run in presubmit still, we would
just not use it to determine the health of a PR or block submission.


> *Require link to Jira to rerun a test*
>
> I am concerned about:
> *Add Gradle or Jenkins plugin to retry flaky tests* - because it is a
> convenient place for real bugs to hide.
>

This concern has come up a few times now so I feel like this is a route we
shouldn't pursue further.


>
> I do not know much about:
> *Consider Gradle Enterprise*
> https://testautonation.com/analyse-test-results-deflake-flaky-tests/
>

There is a subscription fee for Gradle Enterprise but it offers a lot of
support for flaky tests and other metrics. I have a meeting to talk with
them on August 7th about the pricing model for open source projects. From
what I understand, last time we spoke with them they didn't have a good
model for open source projects and the fee was tied into the number of
developers in the project.


>
>
> Thank you for putting this list! I believe even if we can commit to doing
> some of these we would have a much healthier project. If we can build
> consensus on implementing, I will be happy to work on some of them.
>
> On Fri, Jul 24, 2020 at 1:54 PM Kenneth Knowles  wrote:
>
>> Adding
>> https://testautonation.com/analyse-test-results-deflake-flaky-tests/ to
>> the list which seems a more powerful test history tool.
>>
>> On Fri, Jul 24, 2020 at 1:51 PM Kenneth Knowles  wrote:
>>
>>> Had some off-list chats to brainstorm and I wanted to bring ideas back
>>> to the dev@ list for consideration. A lot can be combined. I would
>>> really like to have a section in the release notes. I like the idea of
>>> banishing flakes from pre-commit (since you can't tell easily if it was a
>>> real failure caused by the PR) and auto-retrying in post-commit (so we can
>>> gather data on exactly what is flaking without a lot of manual
>>> investigation).
>>>
>>> *Include ignored or quarantined tests in the release notes*
>>> Pro:
>>>  - Users are aware of what is not being tested so may be silently broken
>>>  - It forces discussion of ignored tests to be part of our community
>>> processes
>>> Con:
>>>  - It may look bad if the list is large (this is actually also a Pro
>>> because if it looks bad, it is bad)
>>>
>>> *Run flaky tests only in postcommit*
>>> Pro:
>>>  - isolates the bad signal so pre-commit is not affected
>>>  - saves pointless re-runs in pre-commit
>>>  - keeps a signal in post-commit that we can watch, instead of losing it
>>> completely when we disable a test
>>>  - maybe keeps the flaky tests in job related to what they are testing
>>> Con:
>>>  - we have to really watch post-commit or flakes can turn into failures
>>>
>>> *Separate flaky tests into quarantine job*
>>> Pro:
>>>  - gain signal for healthy tests, as with disabling or running in
>>> post-commit
>>>  - also saves pointless re-runs
>>> Con:
>>>  - may collect bad tests so that we never look at it so it is the same
>>> as disabling the test
>>>  - lots of unrelated tests grouped into signal instead of focused on
>>> health of a particular component
>>>
>>> *Add Gradle or Jenkins plugin to retry flaky tests*
>>> https://blog.gradle.org/gradle-flaky-test-retry-plugin
>>> https://plugins.jenkins.io/flaky-test-handler/
>>> Pro:
>>>  - easier than Jiras with human pasting links; works with moving flakes
>>> to post-commit
>>>  - get a somewhat automated view of flakiness, whether in pre-commit or
>>> post-commit
>>>  - don't get stopped by flakiness
>>> Con:
>>>  - maybe too easy to ignore flakes; we should add all flakes (not just
>>> disabled or quarantined) to the release notes
>>>  - sometimes flakes are actual bugs (like concurrency) so treating this
>>> as OK is not desirable
>>>  - without Jiras, no automated release notes
>>>  - Jenkins: retry only will work at job level because it needs Maven to
>>> retry only failed (I think)
>>>  - Jenkins: some of our jobs may have duplicate test names (but might
>>> already be fixed)
>>>
>>> *Consider Gradle Enterprise*
>>> Pro:
>>>  - get Gradle scan granularity of flake data (and other stuff)
>>>  - also gives module-level health which we do not have today
>>> Con:
>>>  - cost and administrative burden unknown
>>>  - we probably have to do some small work to make our jobs compatible
>>> with their history tracking
>>>
>>> *Require link to Jira to rerun a test*
>>> Instead of saying "Run Java PreCommit" you have to link to the bug
>>> relating to the failure.
>>> Pro:
>>>  - forces investigation
>>>  - helps others find out about issues
>>> Con:
>>>  - adds a lot of manual work, or requires automation (which will
>>> probably be ad hoc and fragile)
>>>
>>> Kenn
>>>
>>> On Mon, Jul 20, 2020 at 11:59 AM Brian Hulette 
>>> wrote:
>>>

Re: Chronically flaky tests

2020-08-04 Thread Etienne Chauchot

Hi all,

+1 on ping the assigned person.

For the flakes I know of (ESIO and CassandraIO), they are due to the 
load of the CI server. These IOs are tested using real embedded backends 
because those backends are complex and we need relevant tests.


Counter measures have been taken (retrial inside the test sensible to 
load, add ranges of acceptable numbers, call internal backend mechanisms 
to force refresh in case load prevented the backend to do so ...).


I recently got pinged my Ahmet (thanks to him!) about a flakiness that I 
did not see. This seems to me the correct way to go. Systematically 
retrying tests with a CI mechanism or disabling tests seem to me a risky 
workaround that just allows to get the problem off our minds.


Etienne

On 20/07/2020 20:58, Brian Hulette wrote:
> I think we are missing a way for checking that we are making 
progress on P1 issues. For example, P0 issues block releases and this 
obviously results in fixing/triaging/addressing P0 issues at least 
every 6 weeks. We do not have a similar process for flaky tests. I do 
not know what would be a good policy. One suggestion is to ping 
(email/slack) assignees of issues. I recently missed a flaky issue 
that was assigned to me. A ping like that would have reminded me. And 
if an assignee cannot help/does not have the time, we can try to find 
a new assignee.


Yeah I think this is something we should address. With the new jira 
automation at least assignees should get an email notification after 
30 days because of a jira comment like [1], but that's too long to let 
a test continue to flake. Could Beam Jira Bot ping every N days for 
P1s that aren't making progress?


That wouldn't help us with P1s that have no assignee, or are assigned 
to overloaded people. It seems we'd need some kind of dashboard or 
report to capture those.


[1] 
https://issues.apache.org/jira/browse/BEAM-8101?focusedCommentId=17121918=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17121918


On Fri, Jul 17, 2020 at 1:09 PM Ahmet Altay > wrote:


Another idea, could we change our "Retest X" phrases with "Retest
X (Reason)" phrases? With this change a PR author will have to
look at failed test logs. They could catch new flakiness
introduced by their PR, file a JIRA for a flakiness that was not
noted before, or ping an existing JIRA issue/raise its severity.
On the downside this will require PR authors to do more.

On Fri, Jul 17, 2020 at 6:46 AM Tyson Hamilton mailto:tyso...@google.com>> wrote:

Adding retries can be beneficial in two ways, unblocking a PR,
and collecting metrics about the flakes.


Makes sense. I think we will still need to have a plan to remove
retries similar to re-enabling disabled tests.


If we also had a flaky test leaderboard that showed which
tests are the most flaky, then we could take action on them.
Encouraging someone from the community to fix the flaky test
is another issue.

The test status matrix of tests that is on the GitHub landing
page could show flake level to communicate to users which
modules are losing a trustable test signal. Maybe this shows
up as a flake % or a code coverage % that decreases due to
disabled flaky tests.


+1 to a dashboard that will show a "leaderboard" of flaky tests.


I didn't look for plugins, just dreaming up some options.




On Thu, Jul 16, 2020, 5:58 PM Luke Cwik mailto:lc...@google.com>> wrote:

What do other Apache projects do to address this issue?

On Thu, Jul 16, 2020 at 5:51 PM Ahmet Altay
mailto:al...@google.com>> wrote:

I agree with the comments in this thread.
- If we are not re-enabling tests back again or we do
not have a plan to re-enable them again, disabling
tests only provides us temporary relief until
eventually users find issues instead of disabled tests.
- I feel similarly about retries. It is reasonable to
add retries for reasons we understand. Adding retries
to avoid flakes is similar to disabling tests. They
might hide real issues.

I think we are missing a way for checking that we are
making progress on P1 issues. For example, P0 issues
block releases and this obviously results in
fixing/triaging/addressing P0 issues at least every 6
weeks. We do not have a similar process for flaky
tests. I do not know what would be a good policy. One
suggestion is to ping (email/slack) assignees of
issues. I recently missed a flaky issue that was
assigned to me. A ping like that would have reminded
me. And if an assignee cannot help/does not have 

Git commit history: "fixup" commits

2020-08-04 Thread Alexey Romanenko
Hi all,

I’d like to attract your attention regarding our Git commit history and related 
issue. A while ago I noticed that it started getting not very clear and quite 
verbose comparing to how it was before. We have quite significant amount of 
recent commits like “fix”, “address comments”, “typo”, “spotless”, etc. Most of 
them also doesn’t contain Jira Tag as a prefix and actually is just 
supplementary commits to “main” and initial commit of PR, added after several 
PRs review rounds.

AFAIR, we already had several discussion in the past about this topic and we 
agreed that we should avoid such commits in a final merge and have only one (in 
most cases) or several (if necessary) logical commits that should be atomic and 
properly explain what they do. 

Why these “tiny" commits are bad practice? Just several main reasons:
- They pollute our git repository history and don’t give any additional and 
useful further information;
- They are not atomic and we can’t easily revert (rollback) this supplementary 
commit since the state of the build before was likely broken or had incorrect 
behaviour. So, in this case, the whole set of PRs commits should be reverted 
which is not convenient and error-prone. It’s also expected that all checks 
were green before merging a PR (take a part flaky tests).
- They are not informative in terms of commit message. So it makes more hard to 
identify Git annotated code and how the lines of code are related together.

Following this, I just want to briefly remind our Committers rules regarding PR 
merging [1]. 
Every commit:
- should do one thing and reflect it in commit message;
- should contain Jira Tag;
- all “fixup” and “address comments” type of commits should be squashed by 
author or committer before merging.

Please, pay attention on what is finally committed and merged into our 
repository and it should help to keep our commit history clear, which will be 
transferred to saving a time of other developers in the end.

[1] https://beam.apache.org/contribute/committer-guide/#finishing-touches 


Regards,
Alexey

Re: No space left on device - beam-jenkins 1 and 7

2020-08-04 Thread Damian Gadomski
I did a small research on the temporary directories. Seems that there's no
one unified way of telling applications to use a specific path. Neither the
guarantee that all of them will use the dedicated custom directory. Badly
behaving apps could always hardcode `/tmp`, e.g. Java ;)

But, we should be able to handle most of the cases by setting the TEMPDIR
env variable (and also less popular `TMP`, `TEMP`) and passing Java
property `java.io.tmpdir` to the builds.

There's even a plugin [1] that perfectly fits our needs. But it's
unmaintained for 3 years and not available in Jenkins plugin repository.
Not sure if we want to use it anyway. Alternatively, we can add the envs,
property, and the creation of the directory manually to the DSL scripts.

[1] https://github.com/acrolinx/tmpdir-jenkins-plugin

On Wed, Jul 29, 2020 at 12:58 AM Kenneth Knowles  wrote:

> Cool. If it is /home/jenkins it should be just fine. Thanks for checking!
>
> Kenn
>
> On Tue, Jul 28, 2020 at 10:23 AM Damian Gadomski <
> damian.gadom...@polidea.com> wrote:
>
>> Sorry, mistake while copying, [1] should be:
>> [1]
>> https://github.com/apache/beam/blob/8aca8ccc7f1a14516ad769b63845ddd4dc163d92/.test-infra/jenkins/CommonJobProperties.groovy#L63
>>
>>
>> On Tue, Jul 28, 2020 at 7:21 PM Damian Gadomski <
>> damian.gadom...@polidea.com> wrote:
>>
>>> That's interesting. I didn't check that myself but all the Jenkins jobs
>>> are configured to wipe the workspace just before the actual build happens
>>> [1]
>>> .
>>> Git SCM plugin is used for that and it enables the option called "Wipe out
>>> repository and force clone". Docs state that it "deletes the contents of
>>> the workspace before build and before checkout" [2]
>>> . Therefore I assume that removing
>>> workspace just after the build won't change anything.
>>>
>>> The ./.gradle/caches/modules-2/files-2.1 dir is indeed present on the
>>> worker machines but it's rather in /home/jenkins dir.
>>>
>>> damgad@apache-ci-beam-jenkins-13:/home/jenkins/.gradle$ sudo du -sh
>>> 11G .
>>> damgad@apache-ci-beam-jenkins-13:/home/jenkins/.gradle$ sudo du -sh
>>> caches/modules-2/files-2.1
>>> 2.3G caches/modules-2/files-2.1
>>>
>>> I can't find that directory structure inside workspaces.
>>>
>>> damgad@apache-ci-beam-jenkins-13:/home/jenkins/jenkins-slave/workspace$
>>> sudo find -name "files-2.1"
>>> damgad@apache-ci-beam-jenkins-13:/home/jenkins/jenkins-slave/workspace$
>>>
>>> [1]
>>> https://github.com/apache/beam/blob/8aca8ccc7f1a14516ad769b63845ddd4dc163d92/.test-infra/jenkins/CommonJobProperties.groovy#L6
>>> [2] https://plugins.jenkins.io/git/
>>>
>>> On Tue, Jul 28, 2020 at 5:47 PM Kenneth Knowles  wrote:
>>>
 Just checking - will this wipe out dependency cache? That will slow
 things down and significantly increase flakiness. If I recall correctly,
 the default Jenkins layout was:

 /home/jenkins/jenkins-slave/workspace/$jobname
 /home/jenkins/jenkins-slave/workspace/$jobname/.m2
 /home/jenkins/jenkins-slave/workspace/$jobname/.git

 Where you can see that it did a `git clone` right into the root
 workspace directory, adjacent to .m2. This was not hygienic. One important
 thing was that `git clean` would wipe the maven cache with every build. So
 in https://github.com/apache/beam/pull/3976 we changed it to

 /home/jenkins/jenkins-slave/workspace/$jobname
 /home/jenkins/jenkins-slave/workspace/$jobname/.m2
 /home/jenkins/jenkins-slave/workspace/$jobname/src/.git

 Now the .m2 directory survives and we do not constantly see flakes
 re-downloading deps that are immutable. This does, of course, use disk
 space.

 That was in the maven days. Gradle is the same except for $HOME/.m2 is
 replaced by $HOME/.gradle/caches/modules-2/files-2.1. Is Jenkins configured
 the same way so we will be wiping out the dependencies? If so, can you
 address this issue? Everything in that directory should be immutable and
 just a cache to avoid pointless re-download.

 Kenn

 On Tue, Jul 28, 2020 at 2:25 AM Damian Gadomski <
 damian.gadom...@polidea.com> wrote:

> Agree with Udi, workspaces seem to be the third culprit, not yet
> addressed in any way (until PR#12326
>  is merged). I feel that
> it'll solve the issue of filling up the disks for a long time ;)
>
> I'm also OK with moving /tmp cleanup to option B, and will happily
> investigate on proper TMPDIR config.
>
>
>
> On Tue, Jul 28, 2020 at 3:07 AM Udi Meiri  wrote:
>
>> What about the workspaces, which can take up 175GB in some cases (see
>> above)?
>> I'm working on getting them cleaned up automatically:
>>