Re: Meet up at Strata+Hadoop World in Singapore

2016-11-29 Thread Aljoscha Krettek
Hi, I'll also be there to give a talk (and also at the Beam tutorial). Cheers, Aljoscha On Wed, Nov 30, 2016, 00:51 Dan Halperin wrote: > Hey folks, > > Who will be attending Strata+Hadoop World next week in Singapore? Tyler and > I will be there, giving a Beam tutorial [0] and some talks [2,3]

Re: PCollection to PCollection Conversion

2016-11-29 Thread Jean-Baptiste Onofré
Hi Jesse, yes, I started something there (using JAXB and Jackson). Let me polish and push. Regards JB On 11/29/2016 10:00 PM, Jesse Anderson wrote: I went through the string conversions. Do you have an example of writing out XML/JSON/etc too? On Tue, Nov 29, 2016 at 3:46 PM Jean-Baptiste On

Meet up at Strata+Hadoop World in Singapore

2016-11-29 Thread Dan Halperin
Hey folks, Who will be attending Strata+Hadoop World next week in Singapore? Tyler and I will be there, giving a Beam tutorial [0] and some talks [2,3]. I'd love to sync in person with anyone who wants to talk Beam. Please reach out to me directly if you'd like to meet. Thanks! Dan [0] http://c

Re: PCollection to PCollection Conversion

2016-11-29 Thread Jesse Anderson
I went through the string conversions. Do you have an example of writing out XML/JSON/etc too? On Tue, Nov 29, 2016 at 3:46 PM Jean-Baptiste Onofré wrote: > Hi Jesse, > > > https://github.com/jbonofre/incubator-beam/tree/DATAFORMAT/sdks/java/extensions/dataformat > > it's very simple and stupid

Re: PCollection to PCollection Conversion

2016-11-29 Thread Jean-Baptiste Onofré
Hi Jesse, https://github.com/jbonofre/incubator-beam/tree/DATAFORMAT/sdks/java/extensions/dataformat it's very simple and stupid and of course not complete at all (I have other commits but not merged as they need some polishing), but as I said, it's a base of discussion. Regards JB On 11/29

Re: PCollection to PCollection Conversion

2016-11-29 Thread Jesse Anderson
@jb Sounds good. Just let us know once you've pushed. On Tue, Nov 29, 2016 at 2:54 PM Jean-Baptiste Onofré wrote: > Good point Eugene. > > Right now, it's a DoFn collection to experiment a bit (a pure > extension). It's pretty stupid ;) > > But, you are right, depending the direction of such ext

Re: PCollection to PCollection Conversion

2016-11-29 Thread Jean-Baptiste Onofré
Good point Eugene. Right now, it's a DoFn collection to experiment a bit (a pure extension). It's pretty stupid ;) But, you are right, depending the direction of such extension, it could cover more use cases (even if it's not my first intention ;)). Let me push the branch (pretty small) as

Re: PCollection to PCollection Conversion

2016-11-29 Thread Eugene Kirpichov
Hi JB, Depending on the scope of what you want to ultimately accomplish with this extension, I think it may make sense to write a proposal document and discuss it. If it's just a collection of utility DoFn's for various well-defined source/target format pairs, then that's probably not needed, but i

Re: PCollection to PCollection Conversion

2016-11-29 Thread Jean-Baptiste Onofré
By the way Jesse, I gonna push my DATAFORMAT branch on my github and I will post on the dev mailing list when done. Regards JB On 11/29/2016 07:01 PM, Jesse Anderson wrote: I want to bring this thread back up since we've had time to think about it more and make a plan. I think a format-specif

Re: [PROPOSAL] "IOChannelFactory" Redesign and Make it Configurable

2016-11-29 Thread Jean-Baptiste Onofré
Hi Pei, rethinking about that, I understand that the purpose of the Beam filesystem is to avoid to bring a bunch of dependencies into the core. That makes perfect sense. So, I agree that a Beam filesystem abstract is fine. My point is that we should provide a HadoopFilesystem extension/plugi

Re: PCollection to PCollection Conversion

2016-11-29 Thread Jean-Baptiste Onofré
Hi Jesse, actually, I started a extension as a PoC: sdks/java/extensions/dataformat The purpose is to have a base for discussion and show "actual" use cases. The dataformat extension contains ready to use converter fn. Right now, I implemented (by directional): 1. JmsRecord / Avro IndexedRec

Re: How to create a Pipeline with Cycles

2016-11-29 Thread Shen LI
Hi Maria, Bobby, Thanks for the explanation. Regards, Shen On Tue, Nov 29, 2016 at 12:37 PM, Bobby Evans wrote: > In my experience almost all of the time cycles are bad and cause a lot of > debugging problems. Most of the time you can implement what you want by > using a windowed join or grou

Re: PCollection to PCollection Conversion

2016-11-29 Thread Jesse Anderson
I want to bring this thread back up since we've had time to think about it more and make a plan. I think a format-specific converter will be more time consuming task than we originally thought. It'd have to be a writer that takes another writer as a parameter. I think a string converter can be do

Re: How to create a Pipeline with Cycles

2016-11-29 Thread Bobby Evans
In my experience almost all of the time cycles are bad and cause a lot of debugging problems. Most of the time you can implement what you want by using a windowed join or group by instead.  - Bobby On Tuesday, November 29, 2016, 11:06:44 AM CST, María García Herrero wrote:Hi Shen, No. Beam pi

Re: Flink runner. Optimization for sideOutput with tags

2016-11-29 Thread Aljoscha Krettek
Hi Alexey, I think it should be possible to optimise this particular transformation by using a split/select pattern in Flink. (See split and select here: https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/datastream_api.html#datastream-transformations). The current implementation is no

Re: UnboundedSource backlog num events

2016-11-29 Thread Dan Halperin
Hi Aviem, Another good question. There's no strong reason why not have Count in addition to Bytes. Practically, in the Dataflow runner we found bytes to be the best signal here. I won't go deeply into why, but two intuitions: * Beam is designed to enable runners to minimize the per-element overhe

Re: Question regarding UnboundedReader#getTotalBacklogBytes

2016-11-29 Thread Dan Halperin
Hi Aviem, A great question! The two backlog methods (getSplitBacklogBytes () and getTotalBacklogBytes

Re: How to create a Pipeline with Cycles

2016-11-29 Thread María García Herrero
Hi Shen, No. Beam pipelines are DAGs: http://beam.incubator.apache.org/documentation/sdks/javadoc/0.3.0-incubating/org/apache/beam/sdk/Pipeline.html Best, María On Tue, Nov 29, 2016 at 7:44 AM, Shen LI wrote: > Hi, > > Can I use Beam to create a pipeline with cycles? For example, to implement

How to create a Pipeline with Cycles

2016-11-29 Thread Shen LI
Hi, Can I use Beam to create a pipeline with cycles? For example, to implement the Yahoo! Streaming benchmark( https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at), can a ParDo transform consume a downstream output as a side input? Thanks, Shen

Question regarding UnboundedReader#getTotalBacklogBytes

2016-11-29 Thread Aviem Zur
UnboundedReader exposes method getTotalBacklogBytes() which promises: * Returns the size of the backlog of unread data in the underlying data source represented by split of this source. But there are no implementations of this method in any source, I'm assuming that this is because splits are inst

UnboundedSource backlog num events

2016-11-29 Thread Aviem Zur
Hi, Today UnboundedSource exposes split backlog in bytes via getSplitBacklogBytes() I think there is much value in exposing backlog in number of events as well, since this number can be more human comprehensible than bytes. something like getSplitBacklogEvents() or getSplitBacklogCount(). Thought