Re: Podling Report Reminder - May 2016

2016-05-26 Thread Jean-Baptiste Onofré

Hi James,

There's an issue in the e-mail.

The next podling report is for June:

https://wiki.apache.org/incubator/June2016

But we don't have to report as we already did it for May:

https://wiki.apache.org/incubator/May2016

So, no need for a report this month. As we reported last month, the next 
report is due for August.


Regards
JB

On 05/26/2016 06:49 PM, James Malone wrote:

Hi everyone,

I am happy to take on drafting this report. Would anyone object if I did?

Cheers!

James

On Wed, May 25, 2016 at 7:33 PM,  wrote:


Dear podling,

This email was sent by an automated system on behalf of the Apache
Incubator PMC. It is an initial reminder to give you plenty of time to
prepare your quarterly board report.

The board meeting is scheduled for Wed, 15 June 2016, 10:30 am PDT.
The report for your podling will form a part of the Incubator PMC
report. The Incubator PMC requires your report to be submitted 2 weeks
before the board meeting, to allow sufficient time for review and
submission (Wed, June 1st).

Please submit your report with sufficient time to allow the Incubator
PMC, and subsequently board members to review and digest. Again, the
very latest you should submit your report is 2 weeks prior to the board
meeting.

Thanks,

The Apache Incubator PMC

Submitting your Report

--

Your report should contain the following:

*   Your project name
*   A brief description of your project, which assumes no knowledge of
 the project or necessarily of its field
*   A list of the three most important issues to address in the move
 towards graduation.
*   Any issues that the Incubator PMC or ASF Board might wish/need to be
 aware of
*   How has the community developed since the last report
*   How has the project developed since the last report.

This should be appended to the Incubator Wiki page at:

http://wiki.apache.org/incubator/May2016

Note: This is manually populated. You may need to wait a little before
this page is created from a template.

Mentors
---

Mentors should review reports for their project(s) and sign them off on
the Incubator wiki page. Signing off reports shows that you are
following the project - projects that are not signed may raise alarms
for the Incubator PMC.

Incubator PMC





--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: Add Sorting Class?

2016-05-26 Thread Jesse Anderson
Another perspective is to look at other projects in the Hadoop ecosystem.

Impala had to have a LIMIT any time you did an ORDER BY. They're since
removed this limitation.

Hive has two sorting options. ORDER BY does a global order. SORT BY orders
everything in that partition.

On Thu, May 26, 2016 at 12:35 PM Jesse Anderson 
wrote:

> I had a similar thought, but wasn't sure if that violated a tenet of Beam.
>
> I'm thinking an ordered sink could wrap around another sink. I could see
> something like:
> collection.apply(OrderedSink.Timestamp.write(TextIO.Write.To(...)));
>
> On Thu, May 26, 2016 at 12:26 PM Robert Bradshaw
>  wrote:
>
>> As Frances alluded to, it's also really hard to reconcile the notion
>> of a globally ordered PCollection in the context of a streaming
>> pipeline. Sorting also imposes conditions on partitioning, which we
>> intentionally leave unspecified for maximum flexibility in the
>> runtime. One also gets into the question of whether particular
>> operations are order-creating, order-preserving, or order-destroying
>> and how much extra overhead is required to maintain these properties
>> for intermediate collections.
>>
>> Your mention of sorting by time is interesting, as this is the
>> inherent sort dimension is streaming (and we use features like
>> windowing and triggering to do correct time-based grouping despite
>> real-time skew). Other than that, all the uses of sorting I've seen
>> have been limited to portions of data small enough to be produced by
>> (and consumed by) a single machine (so tops GBs, not TBs or PBs).
>>
>> All that aside, I could see more tractable case being made for
>> ordering (partitioning, etc.) a particular materialization of a
>> PCollection, i.e. being sorted would not be a property of a
>> PCollection itself, but could be provided by a sink (e.g. one could
>> have a sink that promises to write its records in a particular order
>> within and across shards). It's not inconceivable that this could be
>> done in a way that is composible with (a large class of) existing
>> sinks, e.g. given a FileBasedSink and intra/inter-shard-sorting
>> specifications, one could produce a bounded sink that writes "sorted"
>> files. Lots of design work TBD...
>>
>> - Robert
>>
>>
>>
>>
>> On Thu, May 26, 2016 at 11:32 AM, Jesse Anderson 
>> wrote:
>> > @frances great analysis. I'm hoping this serves as the starting point
>> for
>> > the discussion.
>> >
>> > It really comes down to: is this a nice to have or a show stopping
>> > requirement? As you mention, it comes down to the use case. I've taught
>> at
>> > large financial companies where (global) sorting was a real and show
>> > stopping use case. Theirs was for a large end of day report that had to
>> be
>> > globally sorted and consumed by many other groups. Sorry, I can't be
>> more
>> > specific.
>> >
>> > Thanks,
>> >
>> > Jesse
>> >
>> > On Thu, May 26, 2016 at 10:19 AM Frances Perry 
>> > wrote:
>> >
>> >> Currently the Beam model doesn't provide the functionality to do
>> sorting,
>> >> so this is a pretty deep feature request. Let's separate the discussion
>> >> into value sorting and global sorting.
>> >>
>> >> For value sorting, you need to be able to specify some property of the
>> >> value (often called a secondary key) and have the GroupByKey/shuffle
>> >> implementation sort values for a given key by the secondary key. This
>> is a
>> >> pretty common use case, and I think exposing this in Beam would make a
>> lot
>> >> of sense. The Hadoop and the Cloud Dataflow shuffle implementation
>> supports
>> >> this, for example. So it may just be a matter of figuring out how best
>> to
>> >> expose it to users. In FlumeJava we had you explicitly ParDo to pair
>> values
>> >> with a string "sort key" so you'd GroupByKey on a PCollection> >> KV> and get back the Values sorted lexicographically by
>> >> String. It's a bit gross for users to think about a way to order things
>> >> that sorts lexicographically. Looks like Crunch[1] uses a general sort
>> key
>> >> -- but that likely won't interact cleanly with Beam's use of encoded
>> keys
>> >> for comparisons. Would be nice to think about if there's a cleaner way.
>> >>
>> >> For global sorting, you need to be able to be able to generate and
>> maintain
>> >> orderedness across the elements in a PCollection and have a way to
>> know how
>> >> to partition the PCollection into balanced, sorted subchunks. This
>> would
>> >> have a pretty large impact on the Beam model and potentially on many
>> of the
>> >> runners. Looking at the Crunch sort [1], it requires users to provide
>> the
>> >> partitioning function if they want it to scale beyond a single reduce.
>> I'd
>> >> love to see if there's a way to do better. It also can have a pretty
>> big
>> >> impact on the ability to efficiently parallelize execution (things like
>> >> dynamic work rebalancing [2] become trickier). Within Google [3], we've
>> >> found that this tends to be something that users ask f

Re: Add Sorting Class?

2016-05-26 Thread Jesse Anderson
I had a similar thought, but wasn't sure if that violated a tenet of Beam.

I'm thinking an ordered sink could wrap around another sink. I could see
something like:
collection.apply(OrderedSink.Timestamp.write(TextIO.Write.To(...)));

On Thu, May 26, 2016 at 12:26 PM Robert Bradshaw
 wrote:

> As Frances alluded to, it's also really hard to reconcile the notion
> of a globally ordered PCollection in the context of a streaming
> pipeline. Sorting also imposes conditions on partitioning, which we
> intentionally leave unspecified for maximum flexibility in the
> runtime. One also gets into the question of whether particular
> operations are order-creating, order-preserving, or order-destroying
> and how much extra overhead is required to maintain these properties
> for intermediate collections.
>
> Your mention of sorting by time is interesting, as this is the
> inherent sort dimension is streaming (and we use features like
> windowing and triggering to do correct time-based grouping despite
> real-time skew). Other than that, all the uses of sorting I've seen
> have been limited to portions of data small enough to be produced by
> (and consumed by) a single machine (so tops GBs, not TBs or PBs).
>
> All that aside, I could see more tractable case being made for
> ordering (partitioning, etc.) a particular materialization of a
> PCollection, i.e. being sorted would not be a property of a
> PCollection itself, but could be provided by a sink (e.g. one could
> have a sink that promises to write its records in a particular order
> within and across shards). It's not inconceivable that this could be
> done in a way that is composible with (a large class of) existing
> sinks, e.g. given a FileBasedSink and intra/inter-shard-sorting
> specifications, one could produce a bounded sink that writes "sorted"
> files. Lots of design work TBD...
>
> - Robert
>
>
>
>
> On Thu, May 26, 2016 at 11:32 AM, Jesse Anderson 
> wrote:
> > @frances great analysis. I'm hoping this serves as the starting point for
> > the discussion.
> >
> > It really comes down to: is this a nice to have or a show stopping
> > requirement? As you mention, it comes down to the use case. I've taught
> at
> > large financial companies where (global) sorting was a real and show
> > stopping use case. Theirs was for a large end of day report that had to
> be
> > globally sorted and consumed by many other groups. Sorry, I can't be more
> > specific.
> >
> > Thanks,
> >
> > Jesse
> >
> > On Thu, May 26, 2016 at 10:19 AM Frances Perry 
> > wrote:
> >
> >> Currently the Beam model doesn't provide the functionality to do
> sorting,
> >> so this is a pretty deep feature request. Let's separate the discussion
> >> into value sorting and global sorting.
> >>
> >> For value sorting, you need to be able to specify some property of the
> >> value (often called a secondary key) and have the GroupByKey/shuffle
> >> implementation sort values for a given key by the secondary key. This
> is a
> >> pretty common use case, and I think exposing this in Beam would make a
> lot
> >> of sense. The Hadoop and the Cloud Dataflow shuffle implementation
> supports
> >> this, for example. So it may just be a matter of figuring out how best
> to
> >> expose it to users. In FlumeJava we had you explicitly ParDo to pair
> values
> >> with a string "sort key" so you'd GroupByKey on a PCollection >> KV> and get back the Values sorted lexicographically by
> >> String. It's a bit gross for users to think about a way to order things
> >> that sorts lexicographically. Looks like Crunch[1] uses a general sort
> key
> >> -- but that likely won't interact cleanly with Beam's use of encoded
> keys
> >> for comparisons. Would be nice to think about if there's a cleaner way.
> >>
> >> For global sorting, you need to be able to be able to generate and
> maintain
> >> orderedness across the elements in a PCollection and have a way to know
> how
> >> to partition the PCollection into balanced, sorted subchunks. This would
> >> have a pretty large impact on the Beam model and potentially on many of
> the
> >> runners. Looking at the Crunch sort [1], it requires users to provide
> the
> >> partitioning function if they want it to scale beyond a single reduce.
> I'd
> >> love to see if there's a way to do better. It also can have a pretty big
> >> impact on the ability to efficiently parallelize execution (things like
> >> dynamic work rebalancing [2] become trickier). Within Google [3], we've
> >> found that this tends to be something that users ask for, but don't
> really
> >> have a strong use case for. It's usually the case that Top suffices or
> that
> >> they would rather redo the algorithm into something that can parallelize
> >> more efficiently without relying on a global sort. Though of course,
> with
> >> out this, we can't actually do the TeraSort benchmark in Beam. ;-)
> >>
> >> And of course there's the impact of the unified model on all this ;-) I
> >> think these ideas would translat

Re: Add Sorting Class?

2016-05-26 Thread Robert Bradshaw
As Frances alluded to, it's also really hard to reconcile the notion
of a globally ordered PCollection in the context of a streaming
pipeline. Sorting also imposes conditions on partitioning, which we
intentionally leave unspecified for maximum flexibility in the
runtime. One also gets into the question of whether particular
operations are order-creating, order-preserving, or order-destroying
and how much extra overhead is required to maintain these properties
for intermediate collections.

Your mention of sorting by time is interesting, as this is the
inherent sort dimension is streaming (and we use features like
windowing and triggering to do correct time-based grouping despite
real-time skew). Other than that, all the uses of sorting I've seen
have been limited to portions of data small enough to be produced by
(and consumed by) a single machine (so tops GBs, not TBs or PBs).

All that aside, I could see more tractable case being made for
ordering (partitioning, etc.) a particular materialization of a
PCollection, i.e. being sorted would not be a property of a
PCollection itself, but could be provided by a sink (e.g. one could
have a sink that promises to write its records in a particular order
within and across shards). It's not inconceivable that this could be
done in a way that is composible with (a large class of) existing
sinks, e.g. given a FileBasedSink and intra/inter-shard-sorting
specifications, one could produce a bounded sink that writes "sorted"
files. Lots of design work TBD...

- Robert




On Thu, May 26, 2016 at 11:32 AM, Jesse Anderson  wrote:
> @frances great analysis. I'm hoping this serves as the starting point for
> the discussion.
>
> It really comes down to: is this a nice to have or a show stopping
> requirement? As you mention, it comes down to the use case. I've taught at
> large financial companies where (global) sorting was a real and show
> stopping use case. Theirs was for a large end of day report that had to be
> globally sorted and consumed by many other groups. Sorry, I can't be more
> specific.
>
> Thanks,
>
> Jesse
>
> On Thu, May 26, 2016 at 10:19 AM Frances Perry 
> wrote:
>
>> Currently the Beam model doesn't provide the functionality to do sorting,
>> so this is a pretty deep feature request. Let's separate the discussion
>> into value sorting and global sorting.
>>
>> For value sorting, you need to be able to specify some property of the
>> value (often called a secondary key) and have the GroupByKey/shuffle
>> implementation sort values for a given key by the secondary key. This is a
>> pretty common use case, and I think exposing this in Beam would make a lot
>> of sense. The Hadoop and the Cloud Dataflow shuffle implementation supports
>> this, for example. So it may just be a matter of figuring out how best to
>> expose it to users. In FlumeJava we had you explicitly ParDo to pair values
>> with a string "sort key" so you'd GroupByKey on a PCollection> KV> and get back the Values sorted lexicographically by
>> String. It's a bit gross for users to think about a way to order things
>> that sorts lexicographically. Looks like Crunch[1] uses a general sort key
>> -- but that likely won't interact cleanly with Beam's use of encoded keys
>> for comparisons. Would be nice to think about if there's a cleaner way.
>>
>> For global sorting, you need to be able to be able to generate and maintain
>> orderedness across the elements in a PCollection and have a way to know how
>> to partition the PCollection into balanced, sorted subchunks. This would
>> have a pretty large impact on the Beam model and potentially on many of the
>> runners. Looking at the Crunch sort [1], it requires users to provide the
>> partitioning function if they want it to scale beyond a single reduce. I'd
>> love to see if there's a way to do better. It also can have a pretty big
>> impact on the ability to efficiently parallelize execution (things like
>> dynamic work rebalancing [2] become trickier). Within Google [3], we've
>> found that this tends to be something that users ask for, but don't really
>> have a strong use case for. It's usually the case that Top suffices or that
>> they would rather redo the algorithm into something that can parallelize
>> more efficiently without relying on a global sort. Though of course, with
>> out this, we can't actually do the TeraSort benchmark in Beam. ;-)
>>
>> And of course there's the impact of the unified model on all this ;-) I
>> think these ideas would translated to windowed PCollections ok, but would
>> want to think carefully about it.
>>
>> [1] https://crunch.apache.org/user-guide.html#sorting
>> [2]
>>
>> https://cloud.google.com/blog/big-data/2016/05/no-shard-left-behind-dynamic-work-rebalancing-in-google-cloud-dataflow
>>
>> [3]
>>
>> https://cloud.google.com/blog/big-data/2016/02/history-of-massive-scale-sorting-experiments-at-google
>>
>>
>> On Thu, May 26, 2016 at 8:56 AM, Jesse Anderson 
>> wrote:
>>
>> > This is somewhat the continuat

Re: Add Sorting Class?

2016-05-26 Thread Jesse Anderson
@frances great analysis. I'm hoping this serves as the starting point for
the discussion.

It really comes down to: is this a nice to have or a show stopping
requirement? As you mention, it comes down to the use case. I've taught at
large financial companies where (global) sorting was a real and show
stopping use case. Theirs was for a large end of day report that had to be
globally sorted and consumed by many other groups. Sorry, I can't be more
specific.

Thanks,

Jesse

On Thu, May 26, 2016 at 10:19 AM Frances Perry 
wrote:

> Currently the Beam model doesn't provide the functionality to do sorting,
> so this is a pretty deep feature request. Let's separate the discussion
> into value sorting and global sorting.
>
> For value sorting, you need to be able to specify some property of the
> value (often called a secondary key) and have the GroupByKey/shuffle
> implementation sort values for a given key by the secondary key. This is a
> pretty common use case, and I think exposing this in Beam would make a lot
> of sense. The Hadoop and the Cloud Dataflow shuffle implementation supports
> this, for example. So it may just be a matter of figuring out how best to
> expose it to users. In FlumeJava we had you explicitly ParDo to pair values
> with a string "sort key" so you'd GroupByKey on a PCollection KV> and get back the Values sorted lexicographically by
> String. It's a bit gross for users to think about a way to order things
> that sorts lexicographically. Looks like Crunch[1] uses a general sort key
> -- but that likely won't interact cleanly with Beam's use of encoded keys
> for comparisons. Would be nice to think about if there's a cleaner way.
>
> For global sorting, you need to be able to be able to generate and maintain
> orderedness across the elements in a PCollection and have a way to know how
> to partition the PCollection into balanced, sorted subchunks. This would
> have a pretty large impact on the Beam model and potentially on many of the
> runners. Looking at the Crunch sort [1], it requires users to provide the
> partitioning function if they want it to scale beyond a single reduce. I'd
> love to see if there's a way to do better. It also can have a pretty big
> impact on the ability to efficiently parallelize execution (things like
> dynamic work rebalancing [2] become trickier). Within Google [3], we've
> found that this tends to be something that users ask for, but don't really
> have a strong use case for. It's usually the case that Top suffices or that
> they would rather redo the algorithm into something that can parallelize
> more efficiently without relying on a global sort. Though of course, with
> out this, we can't actually do the TeraSort benchmark in Beam. ;-)
>
> And of course there's the impact of the unified model on all this ;-) I
> think these ideas would translated to windowed PCollections ok, but would
> want to think carefully about it.
>
> [1] https://crunch.apache.org/user-guide.html#sorting
> [2]
>
> https://cloud.google.com/blog/big-data/2016/05/no-shard-left-behind-dynamic-work-rebalancing-in-google-cloud-dataflow
>
> [3]
>
> https://cloud.google.com/blog/big-data/2016/02/history-of-massive-scale-sorting-experiments-at-google
>
>
> On Thu, May 26, 2016 at 8:56 AM, Jesse Anderson 
> wrote:
>
> > This is somewhat the continuation of my thread "Writing Out
> List."
> >
> > Right now, the only way to do sorting is with the Top class. This works
> > well, but has the constraint of fitting in memory.
> >
> > A common batch use case is to take a large file and sort it. For example,
> > this would be sorting a large report (several GB) file by timestamp. As
> of
> > right now, this isn't built into Beam. I think it should be.
> >
> > I'll hold out Crunch's Sort
> > <
> https://crunch.apache.org/apidocs/0.11.0/org/apache/crunch/lib/Sort.html>
> > class as an example of what this class could look like.
> >
> > Thanks,
> >
> > Jesse
> >
>


One more streaming engine in OSS

2016-05-26 Thread Seetharam Venkatesh
https://blog.twitter.com/2016/open-sourcing-twitter-heron

More the merrier for Beam? :-)

Venkatesh


Add Sorting Class?

2016-05-26 Thread Jesse Anderson
This is somewhat the continuation of my thread "Writing Out List."

Right now, the only way to do sorting is with the Top class. This works
well, but has the constraint of fitting in memory.

A common batch use case is to take a large file and sort it. For example,
this would be sorting a large report (several GB) file by timestamp. As of
right now, this isn't built into Beam. I think it should be.

I'll hold out Crunch's Sort

class as an example of what this class could look like.

Thanks,

Jesse