Re: Python 3: final step

2019-01-04 Thread Manu Zhang
Guys,

Happy New Year !!!
I haven't got much time to contribute to Python 3 support. What is the
progress now ? It seems there are quite a few open issues under
https://issues.apache.org/jira/browse/BEAM-1251. People have kept asking
about Python 3 support in tf.transform (
https://github.com/tensorflow/transform/issues/1) which is blocked by
BEAM-1251.

Thanks,
Manu Zhang


On Fri, Oct 12, 2018 at 3:17 AM Valentyn Tymofieiev 
wrote:

> I cc'ed a few folks who are familiar with Jenkins setup on
> https://issues.apache.org/jira/browse/BEAM-5663, I think we can continue
> the discussion there or start a separate thread.
>
> On Wed, Oct 10, 2018 at 8:54 PM Manu Zhang 
> wrote:
>
>> Does anyone know how to set up python version on Jenkins ? It’s Python
>> 3.5.2 now.
>>
>> Thanks,
>> Manu Zhang
>> On Oct 5, 2018, 9:24 AM +0800, Valentyn Tymofieiev ,
>> wrote:
>>
>> I have put together a guide [1] to help get started with investigating
>> Python 3-related test failures that may be helpful for new folks joining
>> the effort.
>>
>> Comments and improvements welcome!
>>
>> Thanks,
>> Valentyn
>>
>> [1]
>> https://docs.google.com/document/d/1s1BJVCY65LB_SYK1SU1u7NbZiFANoq-nEYaEvzRbYlA
>>
>>
>> On Thu, Oct 4, 2018 at 11:26 AM Valentyn Tymofieiev 
>> wrote:
>>
>>> I agree there is some overlap between JIRAs that track individual
>>> failures and module-level JIRAs. We originally wanted to do the conversion
>>> on a module-by-module basis, however we learned that test failures in some
>>> modules require changes in other modules, and it may be a little easier to
>>> slice the problem if we focus on classes of failures.
>>>
>>> Module-level JIRAs can still be useful for tracking the end result: tox
>>> suites cover all tests in the module in Py3 environment, and there are no
>>> disabled tests in the module that don't have individual JIRAs tracking them.
>>>
>>> I suggest that folks who are working on module-level JIRAs assign to
>>> themselves the JIRAs that track individual failures if/when they are
>>> actively addressing them. This way, unassigned problem-specific JIRAs can
>>> use help from the community.
>>>
>>> Thanks,
>>> Valentyn
>>>
>>>
>>> On Wed, Oct 3, 2018 at 8:14 PM Manu Zhang 
>>> wrote:
>>>
 Thanks Valentyn. Note some test failing issues are covered by “Finish
 Python 3 porting for *** module”, e.g.
 https://issues.apache.org/jira/browse/BEAM-5315.

 Manu
 在 2018年10月3日 +0800 PM4:18,Valentyn Tymofieiev ,写道:

 Hi Rakesh and Manu,

 Thanks to both of you for offering help (in different threads). It's
 great to see that more and more people get involved with helping to make
 Beam Python 3 compatible!

 There are a few PRs in flight, and several people in the community
 actively work on Python 3 support now. I would be happy to coordinate the
 work so that we don't step at each others toes and avoid duplication of
 effort.

 I recently looked at unit tests that are still failing in Python 3
 environment  and filed a few issues (within range BEAM-5615 - BEAM-5629),
 to track similar classes of errors. You can also find them on Kanban board
 [1].
 In particular, BEAM-5620 and BEAM-5627 should be easy issues to get
 started.

 There are multiple ways you can help:
 - Helping to rootcause errors. Even a comment why a test is failing and
 a suggestion how to fix it, will be helpful for others when you don't have
 time to do the fix.
 - Helping with code reviews.
 - Reporting new issues (as subtasks to BEAM-1251), deduplicating or
 splitting the existing issues. We probably don't want to file a Jira for
 each of 250+ currently failing tests at this point, but it may make sense
 to track the errors that occur repeatedly share the root cause.
 - Fixing the issues. Feel free to assign an issue to yourself if you
 have a fix in mind and plan to actively work on it. Due to the nature of
 the problem it may occasionally happen that two issues share the rootcause,
 or fixing one issue is a prerequisite for fixing another issue, so sync to
 master often to make sure the issue you are working on is not already
 fixed.

 I'll also keep an eye on the PRs and will try to keep the list of open
 issues up to date.

 Thanks,
 Valentyn

 [1]:
 https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=245=detail


 On Tue, Oct 2, 2018 at 9:38 AM Pablo Estrada 
 wrote:

> Very cool : ) I'm also available to review / merge if you need help
> from my side.
> Best
> -P.
>
> On Tue, Oct 2, 2018 at 7:45 AM Rakesh Kumar 
> wrote:
>
>> Hi Rob,
>>
>> I am, Rakesh Kumar, using Beam SDK for one of my projects at Lyft. I
>> have been working closely with Thomas Weise. I have already met a couple 
>> of
>> Python SDK developers in person.
>> I am interested to help 

Beam Contribution

2019-01-04 Thread David Rieber
Hello,
My name is David Rieber. I work on Google Cloud Dataflow Service. I would
like to be added as a contributor to Beam. My Jira user name is drieber.
Thanks!


Re: Schemas in the Go SDK

2019-01-04 Thread Robert Burke
Having slept on it here are my thoughts, but granted, AFAICT there is no
spec for schema's so my understanding is based on what I've learned in the
last 18-ish hours. If there is a spec, I'd love to see it.

*1.* Default behavior to support Schema's in some way doesn't remove the
need for certain specific uses of an atomic coder for a type. eg.
Specifying that Beam shouldn't look further into this type.

TBH the interaction between schema's and coders is the least interesting
part about schemas and matters in precious few circumstances. In
particular, when Grouping By Key, it seems like the schema coder should be
used by default but otherwise, not. Further, there's always the option to
"try the schema" encoding and should that fail, try any existing atomic
coder by default, though this risks data corruption in some situations.

*1.a *In a later beam version, it could be true that there's no need for
such uses. There's always the option to work around anything by writing at
DoFn that accepts a []byte, and then produces a given type. However
decoding []byte and encoding back again seems like a common enough
operation for some domains that having direct beam support in some capacity
is desirable for performance reasons.

*2.* It would be easy enough to have a pipeline fail at construction time
should a type not be able to derive a schema for itself, and it's put into
a schema required scenario.

*3.* The Go SDK does recursive type analysis to be able encode types

for coders anyway, as Go has no native concept of "serializable types" or
"serializable functions" It wouldn't be too much of a stretch to convert
this representation to a Portable Schema representation.

When materializing types, Go has extensively defined Type Conversion rules
 which are accessible via the
reflect package. This means that we can always synthetically create an
instance of a real type from something like a schema, assuming they match
field for field. Eg. If a user declares a PCollection with a given Schema,
then in principle it would be possible to use that PCollection as an input
with a field for field compatible real struct type, and have this verified
at construction time. The "extra sauce" would be to have this happen for a
subset of fields for convenient extraction, ala the annotation use in java.

In particular, this means that whenever the Go SDK is in a scenario that it
doesn't have a schema*, it could probably create one ad-hoc *for that
context, and use the atomic coder the rest of the time if available.
Whether we want it do so is another matter, and probably situation specific.

*4. *It seems Long Term (in that it will be eventually be done, not that it
will necessarily take a long time to get there), that Schemas are likely
the interchange format for Cross Language pipeline support. That is, when
an SDK is invoking a transform in a different language (say, Beam Go
calling on Beam SQL), the values could be specified, and returned in the
schema format, to ensure compatibility. The trick here is that the expected
return schema still needs to be explicitly specified from the user in some
circumstances. (eg. Going from a SQL statement -> Schema doesn't seem like
a natural fit, and won't necessarily be available at pipeline construction
time in the remote language.)

*5.* An interesting aspect of schemas is that they fundamentally enable
SDKs to start with a light DSL layer with "known" types and
transforms/combines/joins, which then never need to be invoked on the SDK
layer. Runners could each implement schemas directly and avoid unnecessary
FnAPI hops for improved performance, largely because they know the type's
structure. No need for any of it to be implemented SDK side to start.

 Overall this is a noble goal in that it enables more languages more
easily, but it's concerning from my view, in that the other goal is to
enable data processing in the SDK language, and this moves it farther away
from the more general, if verbose approaches to do the same thing.

I'm on the side of Scalable Data Processing in Go, which ideally entails
writing Go, rather than an abstract DSL.


I don't speak for all Go users, and welcome hearing from others.

On Thu, 3 Jan 2019 at 17:52 Robert Burke  wrote:

> At this point I feel like the schema discussion should be a separate
> thread from having a Coder Registry in Go, which was the original topic, so
> I'm forking it.
>
> It does sounds like adding Schemas to the Go SDK would be a much larger
> extension than the registry.
>
> I'm not convinced that not having a convenient registry would serve Go SDK
> users (such as they exist).
>
> The concern I have isn't so much for Ints or Doubles, but for user types
> such as Protocol Buffers, but not just those. There will be some users who
> prize efficiency first, and readability second. The Go SDK presently 

Re: [Go SDK] User Defined Coders

2019-01-04 Thread Robert Burke
I think you're right Kenn.

Reuven alluded to the difficulty in inference of what to use between
AtomicType and the rest, in particular Struct.

Go has the additional concerns around Pointer vs Non Pointer types which
isn't a concern either Python or Java have, but has implications on
pipeline efficiency that need addressing, in particular, being able to use
them in a useful fashion in the Go SDK.

I agree that long term, having schemas as a default codec would be hugely
beneficial for readability, composability, and allows more processing to be
on the Runner Harness side of a worker. (I'll save the rest of my thoughts
on Schemas in Go for the other thread, and say no more of it here.)

*Regarding my proposal for User Defined Coders:*

To avoid users accidentally preventing themselves from using Schemas in the
future, I need to remove the ability to override the default coder *(4). *Then
instead of JSON coding by default *(5)*, the SDK should be doing Schema
coding. The SDK is already doing the recursive type analysis on types at
pipeline construction time, so it's not a huge stretch to support Schemas
using that information in the future, once Runner & FnAPI support begins to
exist.

*(1)* doesn't seem to need changing, as this is the existing AtomicType
definition Kenn pointed out.

*(2)* is the specific AtomicType override.

*(3) *is the broader Go specific override for Go's unique interface
semantics. This most of the cases *(4)* would have covered anyway, but in a
targeted way.

This should still allow Go users to better control their pipeline, and
associated performance implications (which is my goal in this change),
while not making an overall incompatible choice for powerful beam features
for the common case in the future.

Does that sound right?

On Fri, 4 Jan 2019 at 10:05 Kenneth Knowles  wrote:

> On Thu, Jan 3, 2019 at 4:33 PM Reuven Lax  wrote:
>
>> If a user wants custom encoding for a primitive type, they can create a
>> byte-array field and wrap that field with a Coder
>>
>
> This is the crux of the issue, right?
>
> Roughly, today, we've got:
>
> Schema ::= [ (fieldname, Type) ]
>
> Type ::= AtomicType | Array | Map |
> Struct
>
> AtomicType ::= bytes | int{16, 32, 64} | datetime | string | ...
>
> To fully replace custom encodings as they exist, you need:
>
> AtomicType ::= bytes | ...
>
> At this point, an SDK need not surface the concept of "Coder" to a user at
> all outside the bytes field concept and the wire encoding and efficient
> should be identical or nearly to what we do with coders today. PCollections
> in such an SDK have schemas, not coders, so we have successfully turned it
> completely inside-out relative to how the Java SDK does it. Is that what
> you have in mind?
>
> I really like this, but I agree with Robert that this is a major change
> that takes a bunch of work and a lot more collaborative thinking in design
> docs if we hope to get it right/stable.
>
> Kenn
>
>
>> (this is why I said that todays Coders are simply special cases); this
>> should be very rare though, as users rarely should care how Beam encodes a
>> long or a double.
>>
>>>
>>> Offhand, Schemas seem to be an alternative to pipeline construction,
>>> rather than coders for value serialization, allowing manual field
>>> extraction code to be omitted. They do not appear to be a fundamental
>>> approach to achieve it. For example, the grouping operation still needs to
>>> encode the whole of the object as a value.
>>>
>>
>> Schemas are properties of the data - essentially a Schema is the data
>> type of a PCollection. In Java Schemas are also understood by ParDo, so you
>> can write a ParDo like this:
>>
>> @ProcessElement
>> public void process(@Field("user") String userId,  @Field("country")
>> String countryCode) {
>> }
>>
>> These extra functionalities are part of the graph, but they are enabled
>> by schemas.
>>
>>>
>>> As mentioned, I'm hoping to have a solution for existing coders by
>>> January's end, so waiting for your documentation doesn't work on that
>>> timeline.
>>>
>>
>> I don't think we need to wait for all the documentation to be written.
>>
>>
>>>
>>> That said, they aren't incompatible ideas as demonstrated by the Java
>>> implementation. The Go SDK remains in an experimental state. We can change
>>> things should the need arise in the next few months. Further, whenever 
>>> Generics
>>> in Go
>>> 
>>> crop up, the existing user surface and execution stack will need to be
>>> re-written to take advantage of them anyway. That provides an opportunity
>>> to invert Coder vs Schema dependence while getting a nice performance
>>> boost, and cleaner code (and deleting much of my code generator).
>>>
>>> 
>>>
>>> Were I to implement schemas to get the same syntatic benefits as the
>>> Java API, I'd be leveraging the field annotations Go has. This satisfies
>>> the protocol 

Re: [Go SDK] User Defined Coders

2019-01-04 Thread Reuven Lax
Maybe a good first step would be to write a doc explaining how this would
work in the Go SDK and share with the dev list. It's possible we will
decide to just implement Coders first, however that way this will be done
with everyone fully understanding the design tradeoffs.

Reuven

On Fri, Jan 4, 2019 at 7:05 PM Kenneth Knowles  wrote:

> On Thu, Jan 3, 2019 at 4:33 PM Reuven Lax  wrote:
>
>> If a user wants custom encoding for a primitive type, they can create a
>> byte-array field and wrap that field with a Coder
>>
>
> This is the crux of the issue, right?
>
> Roughly, today, we've got:
>
> Schema ::= [ (fieldname, Type) ]
>
> Type ::= AtomicType | Array | Map |
> Struct
>
> AtomicType ::= bytes | int{16, 32, 64} | datetime | string | ...
>
> To fully replace custom encodings as they exist, you need:
>
> AtomicType ::= bytes | ...
>
> At this point, an SDK need not surface the concept of "Coder" to a user at
> all outside the bytes field concept and the wire encoding and efficient
> should be identical or nearly to what we do with coders today. PCollections
> in such an SDK have schemas, not coders, so we have successfully turned it
> completely inside-out relative to how the Java SDK does it. Is that what
> you have in mind?
>
> I really like this, but I agree with Robert that this is a major change
> that takes a bunch of work and a lot more collaborative thinking in design
> docs if we hope to get it right/stable.
>
> Kenn
>
>
>> (this is why I said that todays Coders are simply special cases); this
>> should be very rare though, as users rarely should care how Beam encodes a
>> long or a double.
>>
>>>
>>> Offhand, Schemas seem to be an alternative to pipeline construction,
>>> rather than coders for value serialization, allowing manual field
>>> extraction code to be omitted. They do not appear to be a fundamental
>>> approach to achieve it. For example, the grouping operation still needs to
>>> encode the whole of the object as a value.
>>>
>>
>> Schemas are properties of the data - essentially a Schema is the data
>> type of a PCollection. In Java Schemas are also understood by ParDo, so you
>> can write a ParDo like this:
>>
>> @ProcessElement
>> public void process(@Field("user") String userId,  @Field("country")
>> String countryCode) {
>> }
>>
>> These extra functionalities are part of the graph, but they are enabled
>> by schemas.
>>
>>>
>>> As mentioned, I'm hoping to have a solution for existing coders by
>>> January's end, so waiting for your documentation doesn't work on that
>>> timeline.
>>>
>>
>> I don't think we need to wait for all the documentation to be written.
>>
>>
>>>
>>> That said, they aren't incompatible ideas as demonstrated by the Java
>>> implementation. The Go SDK remains in an experimental state. We can change
>>> things should the need arise in the next few months. Further, whenever 
>>> Generics
>>> in Go
>>> 
>>> crop up, the existing user surface and execution stack will need to be
>>> re-written to take advantage of them anyway. That provides an opportunity
>>> to invert Coder vs Schema dependence while getting a nice performance
>>> boost, and cleaner code (and deleting much of my code generator).
>>>
>>> 
>>>
>>> Were I to implement schemas to get the same syntatic benefits as the
>>> Java API, I'd be leveraging the field annotations Go has. This satisfies
>>> the protocol buffer issue as well, since generated go protos have name &
>>> json annotations. Schemas could be extracted that way. These are also
>>> available to anything using static analysis for more direct generation of
>>> accessors. The reflective approach would also work, which is excellent for
>>> development purposes.
>>>
>>> The rote code that the schemas were replacing would be able to be
>>> cobbled together into efficient DoFn and CombineFns for serialization. At
>>> present, it seems like it could be implemented as a side package that uses
>>> beam, rather than changing portions of the core beam Go packages, The real
>>> trick would be to do so without "apply" since that's not how the Go SDK is
>>> shaped.
>>>
>>>
>>>
>>>
>>> On Thu, 3 Jan 2019 at 15:34 Gleb Kanterov  wrote:
>>>
 Reuven, it sounds great. I see there is a similar thing to Row coders
 happening in Apache Arrow , and there is a
 similarity between Apache Arrow Flight
 
 and data exchange service in portability. How do you see these two things
 relate to each other in the long term?

 On Fri, Jan 4, 2019 at 12:13 AM Reuven Lax  wrote:

> The biggest advantage is actually readability and usability. A
> secondary advantage is that it means that Go will be able to interact
> seamlessly with BeamSQL, which would be a big win for Go.
>
> A schema is 

Re: [Go SDK] User Defined Coders

2019-01-04 Thread Kenneth Knowles
On Thu, Jan 3, 2019 at 4:33 PM Reuven Lax  wrote:

> If a user wants custom encoding for a primitive type, they can create a
> byte-array field and wrap that field with a Coder
>

This is the crux of the issue, right?

Roughly, today, we've got:

Schema ::= [ (fieldname, Type) ]

Type ::= AtomicType | Array | Map | Struct

AtomicType ::= bytes | int{16, 32, 64} | datetime | string | ...

To fully replace custom encodings as they exist, you need:

AtomicType ::= bytes | ...

At this point, an SDK need not surface the concept of "Coder" to a user at
all outside the bytes field concept and the wire encoding and efficient
should be identical or nearly to what we do with coders today. PCollections
in such an SDK have schemas, not coders, so we have successfully turned it
completely inside-out relative to how the Java SDK does it. Is that what
you have in mind?

I really like this, but I agree with Robert that this is a major change
that takes a bunch of work and a lot more collaborative thinking in design
docs if we hope to get it right/stable.

Kenn


> (this is why I said that todays Coders are simply special cases); this
> should be very rare though, as users rarely should care how Beam encodes a
> long or a double.
>
>>
>> Offhand, Schemas seem to be an alternative to pipeline construction,
>> rather than coders for value serialization, allowing manual field
>> extraction code to be omitted. They do not appear to be a fundamental
>> approach to achieve it. For example, the grouping operation still needs to
>> encode the whole of the object as a value.
>>
>
> Schemas are properties of the data - essentially a Schema is the data type
> of a PCollection. In Java Schemas are also understood by ParDo, so you can
> write a ParDo like this:
>
> @ProcessElement
> public void process(@Field("user") String userId,  @Field("country")
> String countryCode) {
> }
>
> These extra functionalities are part of the graph, but they are enabled by
> schemas.
>
>>
>> As mentioned, I'm hoping to have a solution for existing coders by
>> January's end, so waiting for your documentation doesn't work on that
>> timeline.
>>
>
> I don't think we need to wait for all the documentation to be written.
>
>
>>
>> That said, they aren't incompatible ideas as demonstrated by the Java
>> implementation. The Go SDK remains in an experimental state. We can change
>> things should the need arise in the next few months. Further, whenever 
>> Generics
>> in Go
>> 
>> crop up, the existing user surface and execution stack will need to be
>> re-written to take advantage of them anyway. That provides an opportunity
>> to invert Coder vs Schema dependence while getting a nice performance
>> boost, and cleaner code (and deleting much of my code generator).
>>
>> 
>>
>> Were I to implement schemas to get the same syntatic benefits as the Java
>> API, I'd be leveraging the field annotations Go has. This satisfies the
>> protocol buffer issue as well, since generated go protos have name & json
>> annotations. Schemas could be extracted that way. These are also available
>> to anything using static analysis for more direct generation of accessors.
>> The reflective approach would also work, which is excellent for development
>> purposes.
>>
>> The rote code that the schemas were replacing would be able to be cobbled
>> together into efficient DoFn and CombineFns for serialization. At present,
>> it seems like it could be implemented as a side package that uses beam,
>> rather than changing portions of the core beam Go packages, The real trick
>> would be to do so without "apply" since that's not how the Go SDK is shaped.
>>
>>
>>
>>
>> On Thu, 3 Jan 2019 at 15:34 Gleb Kanterov  wrote:
>>
>>> Reuven, it sounds great. I see there is a similar thing to Row coders
>>> happening in Apache Arrow , and there is a
>>> similarity between Apache Arrow Flight
>>> 
>>> and data exchange service in portability. How do you see these two things
>>> relate to each other in the long term?
>>>
>>> On Fri, Jan 4, 2019 at 12:13 AM Reuven Lax  wrote:
>>>
 The biggest advantage is actually readability and usability. A
 secondary advantage is that it means that Go will be able to interact
 seamlessly with BeamSQL, which would be a big win for Go.

 A schema is basically a way of saying that a record has a specific set
 of (possibly nested, possibly repeated) fields. So for instance let's say
 that the user's type is a struct with fields named user, country,
 purchaseCost. This allows us to provide transforms that operate on field
 names. Some example (using the Java API):

 PCollection users = events.apply(Select.fields("user"));  // Select out
 only the user field.

 PCollection joinedEvents =
 

Re: excessive java precommit logging

2019-01-04 Thread Udi Meiri
To follow up, I did some research yesterday on removing --info and my
findings are:
- Gradle Test tasks generate HTML and Junit XML reports. Both contain a
stacktrace, STDOUT, and STDERR of the failed test (example

).
So even though --info wasn't specified the output are not lost.
- Python SDK tests don't use Test tasks (they exec tox), and thus are not
affected by --info. Python tests aren't excessively verbose however.
- Go tests should also generate reports (via gogradle), but I haven't found
any and I can't seem to run ./gradlew :beam-sdks-go:test on my workstation.

Suggestion:
- Remove --info (https://github.com/apache/beam/pull/7409)
- If we find Gradle tasks that aren't somehow reporting or logging to
console on failure, that's a bug and the task should be fixed.

On Thu, Dec 20, 2018 at 6:09 AM Kenneth Knowles  wrote:

> I support lowering the log level. The default is `lifecycle`.
>
> For reference, here's where it was increased to `info`:
> https://github.com/apache/beam/pull/4644
>
> The reason was to see more details about Gradle's dependency management.
> We were seeing dependency download flakes on things that should not require
> re-downloading. No longer an issue.
>
> To easily tweak  it on a one-off basis without having to change a Jenkins
> job, you can edit gradle.properties in a commit on your PR:
>
> org.gradle.logging.level=(quiet,warn,lifecycle,info,debug)
> org.gradle.warning.mode=(all,none,summary)
> org.gradle.console=(auto,plain,rich,verbose)
> org.gradle.caching.debug=(true,false)
>
> Kenn
>
> On Thu, Dec 20, 2018 at 6:49 AM Robert Bradshaw 
> wrote:
>
>> Interestingly, I was thinking exactly the same thing the other day.
>>
>> If we could drop the info logs for passing tests, that would be ideal.
>> Regardless, tests should fail (when possible) with actionable
>> messages. I think the rare case of not being able to reproduce the
>> error locally if info logs are needed makes it OK to go and add
>> logging to jenkins as a one-off. (If it's about jenkins build errors,
>> perhaps we could build with higher verbosity before testing with a
>> lower one.)
>> On Thu, Dec 20, 2018 at 11:24 AM Maximilian Michels 
>> wrote:
>> >
>> > Thanks Udi for bringing this up. I'm also for dropping INFO. It's just
>> > incredible verbose. More importantly, from my experience the INFO level
>> doesn't
>> > help debugging problems, but it makes finding errors messages or
>> warnings harder.
>> >
>> > That said, here's what I do to search through the log:
>> >
>> > 1) curl /consoleText | less
>> >
>> > This is when I just want to quickly look for something.
>> >
>> > 2) curl /consoleText > log.txt
>> > less log.txt
>> >
>> > Here we store the log to a file first, then use 'less' or 'grep' to
>> search it.
>> >
>> > When in 'less', I use '/' to grep through the lines. Pressing 'n' or
>> 'N' gets
>> > you forward and back in the search results.
>> >
>> > That works pretty well, but I think we would do us a favor by dropping
>> the log
>> > level. Shall we try it out?
>> >
>> > -Max
>> >
>> > On 19.12.18 23:27, Udi Meiri wrote:
>> > > The gradle scan doesn't pinpoint the error message, and it doesn't
>> contain all
>> > > the lines: https://scans.gradle.com/s/ckhjrjdexpuzm/console-log
>> > >
>> > > The logs might be useful, but usually not from passing tests. Doesn't
>> gradle log
>> > > output from failed tests by default?
>> > >
>> > > On Wed, Dec 19, 2018 at 1:22 PM Thomas Weise > > > > wrote:
>> > >
>> > > I usually follow the download procedure outlined by Scott to look
>> at the logs.
>> > >
>> > > These logs are big, but when there is a problem it is sometimes
>> essential to
>> > > have the extra output, especially for less frequent flakes.
>> > >
>> > > Reducing logs would then require the author to add extra logging
>> to the PR
>> > > (and attempt to reproduce), which is also not nice.
>> > >
>> > > Thomas
>> > >
>> > >
>> > > On Wed, Dec 19, 2018 at 11:47 AM Scott Wegner > > > > wrote:
>> > >
>> > > I'm not sure what we lose by dropping the --info flag, but I
>> generally
>> > > worry about reducing log output since logs are the main
>> resource for
>> > > diagnosing Jenkins build errors.
>> > >
>> > > It seems the issue is that Chrome doesn't scale well to large
>> log files.
>> > > A few alternative solutions:
>> > >
>> > > 1. Use the produced Build Scan (example: [1]) instead of the
>> raw console
>> > > log. The build scan is quite useful at pointing to what
>> actually failed,
>> > > and filtering log output for only that task.
>> > > 2. Instead of consoleFull, use consoleText ("View as plain
>> text" link in
>> > > Jenkins), which seems to 

Re: [Go SDK] User Defined Coders

2019-01-04 Thread Robert Burke
That's an interesting idea. I must confess I don't rightly know the
difference between a schema and coder, but here's what I've got with a bit
of searching through memory and the mailing list. Please let me know if I'm
off track.

As near as I can tell, a schema, as far as Beam takes it

is
a mechanism to define what data is extracted from a given row of data. So
in principle, there's an opportunity to be more efficient with data with
many columns that aren't being used, and only extract the data that's
meaningful to the pipeline.
The trick then is how to apply the schema to a given serialization format,
which is something I'm missing in my mental model (and then how to do it
efficiently in Go).

I do know that the Go client package for BigQuery
 does something
like that, using field tags. Similarly, the "encoding/json"
 package in the Go
Standard Library permits annotating fields and it will read out and
deserialize the JSON fields and that's it.

A concern I have is that Go (at present) would require pre-compile time
code generation for schemas to be efficient, and they would still mostly
boil down to turning []bytes into real structs. Go reflection doesn't keep
up.
Go has no mechanism I'm aware of to Just In Time compile more efficient
processing of values.
It's also not 100% clear how Schema's would play with protocol buffers or
similar.
BigQuery has a mechanism of generating a JSON schema from a proto file
, but that's
only the specification half, not the using half.

As it stands, the code generator I've been building these last months could
(in principle) statically analyze a user's struct, and then generate an
efficient dedicated coder for it. It just has no where to put them such
that the Go SDK would use it.


On Thu, Jan 3, 2019 at 1:39 PM Reuven Lax  wrote:

> I'll make a different suggestion. There's been some chatter that schemas
> are a better tool than coders, and that in Beam 3.0 we should make schemas
> the basic semantics instead of coders. Schemas provide everything a coder
> provides, but also allows for far more readable code. We can't make such a
> change in Beam Java 2.X for compatibility reasons, but maybe in Go we're
> better off starting with schemas instead of coders?
>
> Reuven
>
> On Thu, Jan 3, 2019 at 8:45 PM Robert Burke  wrote:
>
>> One area that the Go SDK currently lacks: is the ability for users to
>> specify their own coders for types.
>>
>> I've written a proposal document,
>> 
>>  and
>> while I'm confident about the core, there are certainly some edge cases
>> that require discussion before getting on with the implementation.
>>
>> At presently, the SDK only permits primitive value types (all numeric
>> types but complex, strings, and []bytes) which are coded with beam coders,
>> and structs whose exported fields are of those type, which is then encoded
>> as JSON. Protocol buffer support is hacked in to avoid the type anaiyzer,
>> and presents the current work around this issue.
>>
>> The high level proposal is to catch up with Python and Java, and have a
>> coder registry. In addition, arrays, and maps should be permitted as well.
>>
>> If you have alternatives, or other suggestions and opinions, I'd love to
>> hear them! Otherwise my intent is to get a PR ready by the end of January.
>>
>> Thanks!
>> Robert Burke
>>
>

-- 
http://go/where-is-rebo