Community over Code EU 2024: The countdown has started!

2024-05-14 Thread Ryan Skraba
[Note: You're receiving this email because you are subscribed to one
or more project dev@ mailing lists at the Apache Software Foundation.]

We are very close to Community Over Code EU -- check out the amazing
program and the special discounts that we have for you.

Special discounts

You still have the opportunity to secure your ticket for Community
Over Code EU. Explore the various options available, including the
regular pass, the committer and groups pass, and now introducing the
one-day pass tailored for locals in Bratislava.

We also have a special discount for you to attend both Community Over
Code and Berlin Buzzwords from June 9th to 11th. Visit our website to
find out more about this opportunity and contact te...@sg.com.mx to
get the discount code.

Take advantage of the discounts and register now!
https://eu.communityovercode.org/tickets/

Check out the full program!

This year Community Over Code Europe will bring to you three days of
keynotes and sessions that cover topics of interest for ASF projects
and the greater open source ecosystem including data engineering,
performance engineering, search, Internet of Things (IoT) as well as
sessions with tips and lessons learned on building a healthy open
source community.

Check out the program: https://eu.communityovercode.org/program/

Keynote speaker highlights for Community Over Code Europe include:

* Dirk-Willem Van Gulik, VP of Public Policy at the Apache Software
Foundation, will discuss the Cyber Resiliency Act and its impact on
open source (All your code belongs to Policy Makers, Politicians, and
the Law).

* Dr. Sherae Daniel will share the results of her study on the impact
of self-promotion for open source software developers (To Toot or not
to Toot, that is the question).

* Asim Hussain, Executive Director of the Green Software Foundation
will present a framework they have developed for quantifying the
environmental impact of software (Doing for Sustainability what Open
Source did for Software).

* Ruth Ikegah will  discuss the growth of the open source movement in
Africa (From Local Roots to Global Impact: Building an Inclusive Open
Source Community in Africa)

* A discussion panel on EU policies and regulations affecting
specialists working in Open Source Program Offices

Additional activities

* Poster sessions: We invite you to stop by our poster area and see if
the ideas presented ignite a conversation within your team.

* BOF time: Don't miss the opportunity to discuss in person with your
open source colleagues on your shared interests.

* Participants reception: At the end of the first day, we will have a
reception at the event venue. All participants are welcome to attend!

* Spontaneous talks: There is a dedicated room and social space for
having spontaneous talks and sessions. Get ready to share with your
peers.

* Lighting talks: At the end of the event we will have the awaited
Lighting talks, where every participant is welcome to share and
enlighten us.

Please remember:  If you haven't applied for the visa, we will provide
the necessary letter for the process. In the unfortunate case of a
visa rejection, your ticket will be reimbursed.

See you in Bratislava,

Community Over Code EU Team


Community over Code EU 2024: Start planning your trip!

2024-04-03 Thread Ryan Skraba
[Note: You're receiving this email because you are subscribed to one
or more project dev@ mailing lists at the Apache Software Foundation.]

Dear community,

We hope you are doing great, are you ready for Community Over Code EU?
Check out the featured sessions, get your tickets with special
discounts and start planning your trip.

Save your spot! Take a look at our lineup of sessions, panelists and
featured speakers and make your final choice:

* EU policies and regulations affecting open source specialists working in OSPOs

The panel will discuss how EU legislation affects the daily work of
open source operations. Panelists will cover some recent policy
updates, the challenges of staying compliant when managing open source
contribution and usage within organizations, and their personal
experiences in adapting to the changing European regulatory
environment.

* Doing for sustainability, what open source did for software

In this keynote Asim Hussain will explain the history of Impact
Framework, a coalition of hundreds of software practitioners with
tangible solutions that directly foster meaningful change by measuring
the environmental impacts of a piece of software.

Don’t forget that we have special discounts for groups, students and
Apache committers. Visit the website to discover more about these
rates.[1]

It's time for you to start planning your trip. Remember that we have
prepared a “How to get there” guide that will be helpful to find out
the best transportation, either train, bus, flight or boat to
Bratislava from wherever you are coming from. Take a look at the
different options and please reach out to us if you have any
questions.

We have available rooms -with a special rate- at the Radisson Blu
Carlton Hotel, where the event will take place and at the Park Inn
Hotel which is only 5 minutes walking from the venue. [2] However, you
are free to choose any other accommodation options around the city.

See you in Bratislava,
Community Over Code EU Team

[1]: https://eu.communityovercode.org/tickets/ "Register"
[2]: https://eu.communityovercode.org/venue/ "Where to stay"


Meet our keynote speakers and register to Community Over Code EU!

2023-12-22 Thread Ryan Skraba
[Note: You're receiving this email because you are subscribed to one or
more project dev@ mailing lists at the Apache Software Foundation.]











*
Merge
with the ASF EUniverse!The registration for Community Over Code Europe is
finally open! Get your tickets now and save your spot!
We are happy to announce that we
have confirmed the first featured speakers
!  - Asim Hussain, Executive
Director at Green Software Foundation- Dirk-Willem Van Gulik, VP of Public
Policy at The Apache Software Foundation- Ruth Ikega, Community Lead at
CHAOSS Africa Visit our website
 to learn more about this
amazing lineup.CFP is openWe are looking forward to hearing all you have to
share with the Apache Community. Please submit your talk proposal
 before January 12, 2024.Interested in
boosting your brand?Take a look at our prospectus

and find out the opportunities we have for you. Be one step ahead and book
your room at the hotel venueWe have a special rate for you at the Radisson
Blu Carlton, the hotel that will hold Community Over Code EU. Learn more
about the location and venue 
and book your accommodation. Should you have any questions, please do not
hesitate to contact us. We wish you Happy Holidays in the company of your
loved ones! See you in Bratislava next year!Community Over Code EU
Organizer Committee*


Call for Presentations now open: Community over Code EU 2024

2023-10-30 Thread Ryan Skraba
(Note: You are receiving this because you are subscribed to the dev@
list for one or more projects of the Apache Software Foundation.)

It's back *and* it's new!

We're excited to announce that the first edition of Community over
Code Europe (formerly known as ApacheCon EU) which will be held at the
Radisson Blu Carlton Hotel in Bratislava, Slovakia from June 03-05,
2024! This eagerly anticipated event will be our first live EU
conference since 2019.

The Call for Presentations (CFP) for Community Over Code EU 2024 is
now open at https://eu.communityovercode.org/blog/cfp-open/,
and will close 2024/01/12 23:59:59 GMT.

We welcome submissions on any topic related to the Apache Software
Foundation, Apache projects, or the communities around those projects.
We are specifically looking for presentations in the following
categories:

* API & Microservices
* Big Data Compute
* Big Data Storage
* Cassandra
* CloudStack
* Community
* Data Engineering
* Fintech
* Groovy
* Incubator
* IoT
* Performance Engineering
* Search
* Tomcat, Httpd and other servers

Additionally, we are thrilled to introduce a new feature this year: a
poster session. This addition will provide an excellent platform for
showcasing high-level projects and incubator initiatives in a visually
engaging manner. We believe this will foster lively discussions and
facilitate networking opportunities among participants.

All my best, and thanks so much for your participation,

Ryan Skraba (on behalf of the program committee)

[Countdown]: https://www.timeanddate.com/countdown/to?iso=20240112T2359=1440


Re: [RESULT] [VOTE] Release 2.33.0, release candidate #1

2021-09-29 Thread Ryan Skraba
Just to follow up -- the cherry-pick of
https://github.com/apache/beam/pull/15616 changes the default value of
a configuration option that appears for the first time in 2.33.0.

I think it's a strong argument for making the change now, before
unwary developers start using the wrong default.  I understand that
it's *extremely* late in the release cycle!

All my best, Ryan

On Wed, Sep 29, 2021 at 6:50 PM Alexey Romanenko
 wrote:
>
> Is it still possible to cherry-pick this fix [1][2] since it’s a recent 
> regression that touches pipelines in production?
>
> [1] https://github.com/apache/beam/pull/15616
> [2] https://issues.apache.org/jira/browse/BEAM-12628
>
> On 28 Sep 2021, at 03:31, Udi Meiri  wrote:
>
> I spoke too soon. We will be doing an rc2
>
> On Mon, Sep 27, 2021 at 1:29 PM Udi Meiri  wrote:
>>
>> I'm happy to announce that we have unanimously approved this release.
>>
>> There are 8 approving votes, 4 of which are binding:
>> * Ahmet Altay
>> * Alexey Romanenko
>> * Robert Bradshaw
>> * Chamikara Jayalath
>>
>> There are no disapproving votes.
>>
>> Thanks everyone!
>>
>


Re: Avro String decoding changes in Beam 2.30.0

2021-07-27 Thread Ryan Skraba
Hello!  I took a quick look -- I think there's some potential
confusion on this line[1] and the reflectData argument being passed
into the new constructor.

If I'm reading correctly, the argument passed in is never actually
used in the eventual ReflectDatumReader/Writer, and it's a different
type than the "this.reflectData" member in the instance.

To restore the original behaviour, I'd probably recommend just passing
in a boolean argument instead, something very explicit along the lines
of "useReflectionOnSpecificData", or "alwaysUseAvroReflect".  That's
also the reason to consider a very simple AvroReflectCoder.of(...)
instead of an AvroCoder.of(x, y, true) factory method for readability,
like what was done with AvroGenericCoder.

It would be easier to comment on a PR, don't hesitate!

All my best, Ryan

[1]https://github.com/apache/beam/compare/master...clairemcginty:avro_reflect_coder_option?expand=1#diff-e875a9933286d97dd3d3d21a61e6f11c0e35624e97411c1b98f1ac672c21045dR311


On Mon, Jul 26, 2021 at 6:42 PM Claire McGinty
 wrote:
>
> Thanks! I put up a branch with a possible solution for adding the Reflect 
> option to AvroCoder with as minimal a code change as possible [1] - would 
> love to get anyone's thoughts on this.
>
> - Claire
>
> On Wed, Jul 21, 2021 at 7:00 PM Ahmet Altay  wrote:
>>
>>
>>
>> On Wed, Jul 21, 2021 at 9:37 AM Claire McGinty  
>> wrote:
>>>
>>> Hi Ahmet! Yes, I think it should be documented in the release notes.
>>
>>
>> Great. +Vitaly, do you want to add the breaking change to the release notes, 
>> since this was related your change.
>>
>>>
>>> What do you think of Ryan’s suggestion to add a ReflectAvroCoder or a 
>>> configuration option to the existing AvroCoder?
>>
>>
>> I am not sure I am the best person to answer this. Second option, of adding 
>> a configuration to the existing AvroCoder, rather than creating a new coder 
>> makes more sense to me.
>>
>> That said, people who might have an opinion: /cc @Ismaël Mejía @Kenneth 
>> Knowles @Lukasz Cwik +Vitaly
>>
>>>
>>>
>>> Thanks,
>>> Claire
>>>
>>> On Tue, Jul 20, 2021 at 4:15 PM Ahmet Altay  wrote:
>>>>
>>>> Is this something we need to add to the 2.30.0 release notes 
>>>> (https://beam.apache.org/blog/beam-2.30.0/) as a breaking change?
>>>>
>>>> On Fri, Jul 16, 2021 at 7:11 AM Ryan Skraba  wrote:
>>>>>
>>>>> Hello!  Good catch, I'm taking a look, but it looks like you're
>>>>> entirely correct and there isn't any obvious workaround.  I guess you
>>>>> could regenerate every SpecificRecord class in order to add the
>>>>> "java-class" or "avro.java.string" annotation, but that shouldn't be
>>>>> necessary.
>>>>>
>>>>> From the Avro perspective, we should always have been using
>>>>> SpecificDatumReader/Writer for all generated SpecificRecords...  We
>>>>> would still have the same Utf8 and .toString problems, but at least
>>>>> there would be no change in behaviour during migration :/
>>>>>
>>>>> As a side note, the Apache Avro project should probably reconsider
>>>>> whether the Utf8 class still adds any value with modern JVMs!  If I
>>>>> understand correctly, it was originally in place because Hadoop had a
>>>>> performance boost when it could reuse mutable data containers.
>>>>>
>>>>> Moving forward, I think your suggestion is the most pragmatic: either
>>>>> add a configuration option to AvroCoder to always drop to ReflectData,
>>>>> or explicitly provide a ReflectAvroCoder that only uses reflection.
>>>>>
>>>>> I took the liberty of creating the JIRA
>>>>> https://issues.apache.org/jira/browse/BEAM-12628 JIRA, so I could
>>>>> create an link an Avro issue!  Please feel free to update if I missed
>>>>> anything.
>>>>>
>>>>> Best regards, Ryan
>>>>>
>>>>> On Thu, Jul 15, 2021 at 10:53 PM Claire McGinty
>>>>>  wrote:
>>>>> >
>>>>> > Hi all,
>>>>> >
>>>>> > When upgrading from Beam 2.29.0 to 2.30.0, we encountered some 
>>>>> > unexpected runtime issues due to changes from BEAM-2303. This PR 
>>>>> > updated  AvroCoder to use SpecificDatum{Reader,Writer} instead 
>>>>> > ofReflectDatum{Reader,Writer} in its impl

Re: Avro String decoding changes in Beam 2.30.0

2021-07-16 Thread Ryan Skraba
Hello!  Good catch, I'm taking a look, but it looks like you're
entirely correct and there isn't any obvious workaround.  I guess you
could regenerate every SpecificRecord class in order to add the
"java-class" or "avro.java.string" annotation, but that shouldn't be
necessary.

>From the Avro perspective, we should always have been using
SpecificDatumReader/Writer for all generated SpecificRecords...  We
would still have the same Utf8 and .toString problems, but at least
there would be no change in behaviour during migration :/

As a side note, the Apache Avro project should probably reconsider
whether the Utf8 class still adds any value with modern JVMs!  If I
understand correctly, it was originally in place because Hadoop had a
performance boost when it could reuse mutable data containers.

Moving forward, I think your suggestion is the most pragmatic: either
add a configuration option to AvroCoder to always drop to ReflectData,
or explicitly provide a ReflectAvroCoder that only uses reflection.

I took the liberty of creating the JIRA
https://issues.apache.org/jira/browse/BEAM-12628 JIRA, so I could
create an link an Avro issue!  Please feel free to update if I missed
anything.

Best regards, Ryan

On Thu, Jul 15, 2021 at 10:53 PM Claire McGinty
 wrote:
>
> Hi all,
>
> When upgrading from Beam 2.29.0 to 2.30.0, we encountered some unexpected 
> runtime issues due to changes from BEAM-2303. This PR updated  AvroCoder to 
> use SpecificDatum{Reader,Writer} instead ofReflectDatum{Reader,Writer} in its 
> implementation.
>
> When using the Reflect* suite, Avro string fields have getters/setters 
> defined with a CharSequence signature, but are by default decoded as 
> java.lang.Strings [1]. But the Specific* suitehas a different default 
> behavior for decoding Avro string fields: unless the Avro schema property 
> "java-class" is set to "java.lang.String", the decoded CharSequences will by 
> default be implemented as org.apache.avro.util.Utf8 objects [2].
>
> This is causing some migration pain for us as we're having to either add the 
> java-class property to all string field schemas, or call .toString on a lot 
> of fields we could just cast before. Additionally, Utf8 isn't Serializable 
> and there's no default Coder representation for it. Beam's 
> AvroSink/AvroSource still use the Reflect* reader/writer, as well.I created a 
> quick Gist to demonstrate the issue: [3].
>
> I'm wondering if there's any possibility of making the use of Reflect* vs 
> Specific* configurable in AvroCoder, or maybe setting a default String type 
> in the coder constructor.  If not, maybe this change should be documented in 
> the release notes?
>
> Thanks,
> Claire


Re: [ANNOUNCE] New Committer: Kamil Wasilewski

2020-03-02 Thread Ryan Skraba
Congratulations Kamil!

On Mon, Mar 2, 2020 at 8:06 AM Michał Walenia 
wrote:

> Congratulations!
>
> On Sun, Mar 1, 2020 at 2:55 AM Reza Rokni  wrote:
>
>> Congratilation Kamil
>>
>> On Sat, 29 Feb 2020, 06:18 Udi Meiri,  wrote:
>>
>>> Welcome Kamil!
>>>
>>> On Fri, Feb 28, 2020 at 12:53 PM Mark Liu  wrote:
>>>
 Congrats, Kamil!

 On Fri, Feb 28, 2020 at 12:23 PM Ismaël Mejía 
 wrote:

> Congratulations Kamil!
>
> On Fri, Feb 28, 2020 at 7:09 PM Yichi Zhang  wrote:
>
>> Congrats, Kamil!
>>
>> On Fri, Feb 28, 2020 at 9:53 AM Valentyn Tymofieiev <
>> valen...@google.com> wrote:
>>
>>> Congratulations, Kamil!
>>>
>>> On Fri, Feb 28, 2020 at 9:34 AM Pablo Estrada 
>>> wrote:
>>>
 Hi everyone,

 Please join me and the rest of the Beam PMC in welcoming a new
 committer: Kamil Wasilewski

 Kamil has contributed to Beam in many ways, including the
 performance testing infrastructure, and a custom BQ source, along with
 other contributions.

 In consideration of his contributions, the Beam PMC trusts him with
 the responsibilities of a Beam committer[1].

 Thanks for your contributions Kamil!

 Pablo, on behalf of the Apache Beam PMC.

 [1] https://beam.apache.org/contribute/become-a-committer
 /#an-apache-beam-committer


>
> --
>
> Michał Walenia
> Polidea  | Software Engineer
>
> M: +48 791 432 002 <+48791432002>
> E: michal.wale...@polidea.com
>
> Unique Tech
> Check out our projects! 
>


Re: [ANNOUNCE] New committer: Alex Van Boxel

2020-02-19 Thread Ryan Skraba
Congratulations Alex!

On Wed, Feb 19, 2020 at 9:52 AM Katarzyna Kucharczyk <
ka.kucharc...@gmail.com> wrote:

> Great news! Congratulations, Alex! 
>
> On Wed, Feb 19, 2020 at 9:14 AM Reza Rokni  wrote:
>
>> Fantastic news! Congratulations :-)
>>
>> On Wed, 19 Feb 2020 at 07:54, jincheng sun 
>> wrote:
>>
>>> Congratulations!
>>> Best,
>>> Jincheng
>>>
>>>
>>> Robin Qiu 于2020年2月19日 周三05:52写道:
>>>
 Congratulations, Alex!

 On Tue, Feb 18, 2020 at 1:48 PM Valentyn Tymofieiev <
 valen...@google.com> wrote:

> Congratulations!
>
> On Tue, Feb 18, 2020 at 10:38 AM Alex Van Boxel 
> wrote:
>
>> Thank you everyone!
>>
>>  _/
>> _/ Alex Van Boxel
>>
>>
>> On Tue, Feb 18, 2020 at 7:05 PM  wrote:
>>
>>> Congrats Alex!
>>> Jan
>>>
>>>
>>> Dne 18. 2. 2020 18:46 napsal uživatel Thomas Weise :
>>>
>>> Congratulations!
>>>
>>>
>>> On Tue, Feb 18, 2020 at 8:33 AM Ismaël Mejía 
>>> wrote:
>>>
>>> Congrats Alex! Well done!
>>>
>>> On Tue, Feb 18, 2020 at 5:10 PM Gleb Kanterov 
>>> wrote:
>>>
>>> Congratulations!
>>>
>>> On Tue, Feb 18, 2020 at 5:02 PM Brian Hulette 
>>> wrote:
>>>
>>> Congratulations Alex! Well deserved!
>>>
>>> On Tue, Feb 18, 2020 at 7:49 AM Pablo Estrada 
>>> wrote:
>>>
>>> Hi everyone,
>>>
>>> Please join me and the rest of the Beam PMC in welcoming
>>> a new committer: Alex Van Boxel
>>>
>>> Alex has contributed to Beam in many ways - as an organizer for Beam
>>> Summit, and meetups - and also with the Protobuf extensions for schemas.
>>>
>>> In consideration of his contributions, the Beam PMC trusts him with
>>> the responsibilities of a Beam committer[1].
>>>
>>> Thanks for your contributions Alex!
>>>
>>> Pablo, on behalf of the Apache Beam PMC.
>>>
>>> [1]
>>> https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
>>>
>>>
>>> --
>>>
>>> Best,
>>> Jincheng
>>> -
>>> Twitter: https://twitter.com/sunjincheng121
>>> -
>>>
>>


Re: [ANNOUNCE] New committer: Michał Walenia

2020-01-28 Thread Ryan Skraba
Congratulations!

On Tue, Jan 28, 2020 at 11:26 AM Jan Lukavský  wrote:

> Congrats Michał!
> On 1/28/20 11:16 AM, Katarzyna Kucharczyk wrote:
>
> Congratulations Michał!  
>
> On Tue, Jan 28, 2020 at 9:29 AM Alexey Romanenko 
> wrote:
>
>> Congrats, Michał!
>>
>> On 28 Jan 2020, at 09:20, Ismaël Mejía  wrote:
>>
>> Congratulations Michał, well deserved!
>>
>> On Tue, Jan 28, 2020 at 8:54 AM Kamil Wasilewski <
>> kamil.wasilew...@polidea.com> wrote:
>>
>>> Congrats, Michał!
>>>
>>> On Tue, Jan 28, 2020 at 3:03 AM Udi Meiri  wrote:
>>>
 Congratulations Michał!

 On Mon, Jan 27, 2020 at 3:49 PM Chamikara Jayalath <
 chamik...@google.com> wrote:

> Congrats Michał!
>
> On Mon, Jan 27, 2020 at 2:59 PM Reza Rokni  wrote:
>
>> Congratulations buddy!
>>
>> On Tue, 28 Jan 2020, 06:52 Valentyn Tymofieiev, 
>> wrote:
>>
>>> Congratulations, Michał!
>>>
>>> On Mon, Jan 27, 2020 at 2:24 PM Austin Bennett <
>>> whatwouldausti...@gmail.com> wrote:
>>>
 Nice -- keep up the good work!

 On Mon, Jan 27, 2020 at 2:02 PM Mikhail Gryzykhin <
 mig...@google.com> wrote:
 >
 > Congratulations Michal!
 >
 > --Mikhail
 >
 > On Mon, Jan 27, 2020 at 1:01 PM Kyle Weaver 
 wrote:
 >>
 >> Congratulations Michał! Looking forward to your future
 contributions :)
 >>
 >> Thanks,
 >> Kyle
 >>
 >> On Mon, Jan 27, 2020 at 12:47 PM Pablo Estrada <
 pabl...@google.com> wrote:
 >>>
 >>> Hi everyone,
 >>>
 >>> Please join me and the rest of the Beam PMC in welcoming a new
 committer: Michał Walenia
 >>>
 >>> Michał has contributed to Beam in many ways, including the
 performance testing infrastructure, and has even spoken at events about
 Beam.
 >>>
 >>> In consideration of his contributions, the Beam PMC trusts him
 with the responsibilities of a Beam committer[1].
 >>>
 >>> Thanks for your contributions Michał!
 >>>
 >>> Pablo, on behalf of the Apache Beam PMC.
 >>>
 >>> [1]
 https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer

>>>
>>


Re: [VOTE] Beam Mascot animal choice: vote for as many as you want

2019-11-20 Thread Ryan Skraba
*** Vote for as many as you like, using this checklist as a template 

[X] Beaver
[ ] Hedgehog
[ ] Lemur
[ ] Owl
[ ] Salmon
[X] Trout
[ ] Robot dinosaur
[ ] Firefly
[ ] Cuttlefish
[ ] Dumbo Octopus
[ ] Angler fish


Re: Library to Parse Thrift Files for ThriftIO

2019-11-19 Thread Ryan Skraba
For info: https://github.com/airlift/drift has forked and maintained
the code over the last few years.

On Fri, Nov 15, 2019 at 7:23 PM Reuven Lax  wrote:
>
> At a quick glance, the license is Apache which is fine (though we'd have to 
> check dependencies as well). I do notice that git repro is no longer 
> maintained; is there a different one that is better supported?
>
> Reuven
>
> On Wed, Nov 13, 2019 at 7:04 AM Christopher Larsen 
>  wrote:
>>
>> Hey everyone,
>>
>> In regards to the library that will be used to parse the .thrift files for 
>> ThriftIO we are thinking about using some of the code from the library found 
>> here and updating it to use Beam friendly packages. We would love to get the 
>> community's feedback on this approach.
>>
>> Best,
>> Chris
>>
>> This message contains information that may be privileged or confidential and 
>> is the property of the Quantiphi Inc and/or its affiliates. It is intended 
>> only for the person to whom it is addressed. If you are not the intended 
>> recipient, any review, dissemination, distribution, copying, storage or 
>> other use of all or any portion of this message is strictly prohibited. If 
>> you received this message in error, please immediately notify the sender by 
>> reply e-mail and delete this message in its entirety


Re: JdbcIO read needs to fit in memory

2019-10-29 Thread Ryan Skraba
I didn't get a chance to try this out -- it sounds like a bug with the
SparkRunner, if you've tested it with FlinkRunner and it succeeded.

>From your description, it should be reproducible by reading any large
database table with the SparkRunner where the entire dataset is
greater than the memory available to a single executor?  Do you have
any other tips to reproduce?

Expecially worrisome is "as past JDBC load job runs fine with 4GB
heap" -- did this happen with the same volumes of data and a different
version of Beam?  Or the same version and a pipeline with different
characteristics? This does sound like a regression, so details would
help to confirm and track it down!

All my best, Ryan




On Tue, Oct 29, 2019 at 9:48 AM Jozef Vilcek  wrote:
>
> I can not find anything in docs about expected behavior of DoFn emitting 
> arbitrary large number elements on one processElement().
>
> I wonder if Spark Runner behavior is a bug or just a difference (and 
> disadvantage in this case) in execution more towards runner capability matrix 
> differences.
>
> Also, in such cases, what is an opinion about BoundedSource vs DoFn as a 
> source. What is a recommendation to IO developer if one want's to achieve 
> equivalent execution scalability across runners?
>
>
> On Sun, Oct 27, 2019 at 6:02 PM Jozef Vilcek  wrote:
>>
>> typo in my previous message. I meant to say => JDBC is `not` the main data 
>> set, just metadata
>>
>> On Sun, Oct 27, 2019 at 6:00 PM Jozef Vilcek  wrote:
>>>
>>> Result of my query can fit the memory if I use 12GB heap per spark 
>>> executor. This makes the job quite inefficient as past JDBC load job runs 
>>> fine with 4GB heap to do the main heavy lifting - JDBC is the main data 
>>> set, just metadata.
>>>
>>> I just did run the same JdbcIO read code on Spark and Flink runner. Flink 
>>> did not blow up on memory. So it seems like this is a limitation of 
>>> SparkRunner.
>>>
>>> On Fri, Oct 25, 2019 at 5:28 PM Ryan Skraba  wrote:
>>>>
>>>> One more thing to try -- depending on your pipeline, you can disable
>>>> the "auto-reshuffle" of JdbcIO.Read by setting
>>>> withOutputParallelization(false)
>>>>
>>>> This is particularly useful if (1) you do aggressive and cheap
>>>> filtering immediately after the read or (2) you do your own
>>>> repartitioning action like GroupByKey after the read.
>>>>
>>>> Given your investigation into the heap, I doubt this will help!  I'll
>>>> take a closer look at the DoFnOutputManager.  In the meantime, is
>>>> there anything particularly about your job that might help
>>>> investigate?
>>>>
>>>> All my best, Ryan
>>>>
>>>> On Fri, Oct 25, 2019 at 2:47 PM Jozef Vilcek  wrote:
>>>> >
>>>> > I agree I might be too quick to call DoFn output need to fit in memory. 
>>>> > Actually I am not sure what Beam model say on this matter and what 
>>>> > output managers of particular runners do about it.
>>>> >
>>>> > But SparkRunner definitely has an issue here. I did try set small 
>>>> > `fetchSize` for JdbcIO as well as change `storageLevel` to 
>>>> > MEMORY_AND_DISK. All fails on OOM.
>>>> > When looking at the heap, most of it is used by linked list multi-map of 
>>>> > DoFnOutputManager here:
>>>> > https://github.com/apache/beam/blob/v2.15.0/runners/spark/src/main/java/org/apache/beam/runners/spark/translation/MultiDoFnFunction.java#L234
>>>> >
>>>> >


Re: JdbcIO read needs to fit in memory

2019-10-25 Thread Ryan Skraba
One more thing to try -- depending on your pipeline, you can disable
the "auto-reshuffle" of JdbcIO.Read by setting
withOutputParallelization(false)

This is particularly useful if (1) you do aggressive and cheap
filtering immediately after the read or (2) you do your own
repartitioning action like GroupByKey after the read.

Given your investigation into the heap, I doubt this will help!  I'll
take a closer look at the DoFnOutputManager.  In the meantime, is
there anything particularly about your job that might help
investigate?

All my best, Ryan

On Fri, Oct 25, 2019 at 2:47 PM Jozef Vilcek  wrote:
>
> I agree I might be too quick to call DoFn output need to fit in memory. 
> Actually I am not sure what Beam model say on this matter and what output 
> managers of particular runners do about it.
>
> But SparkRunner definitely has an issue here. I did try set small `fetchSize` 
> for JdbcIO as well as change `storageLevel` to MEMORY_AND_DISK. All fails on 
> OOM.
> When looking at the heap, most of it is used by linked list multi-map of 
> DoFnOutputManager here:
> https://github.com/apache/beam/blob/v2.15.0/runners/spark/src/main/java/org/apache/beam/runners/spark/translation/MultiDoFnFunction.java#L234
>
>


Re: JdbcIO read needs to fit in memory

2019-10-24 Thread Ryan Skraba
Hello!

If I remember correctly -- the JdbcIO will use *one* DoFn instance to
read all of the rows, but that instance is not required to hold all of
the rows in memory.

The fetch size will, however, read 50K rows at a time by default and
those will all be held in memory in that single worker until they are
emitted.  You can adjust this setting with the setFetchSize(...)
method.

By default, the JdbcIO.Read transform adds a "reshuffle", which will
repartition the records among all of the nodes in the cluster.  This
means that all of the rows need to fit into total available memory of
the cluster (not just that one node), especially if the RDD underneath
the PCollection is reused/persisted.  You can change the persistence
level to "MEMORY_AND_DISK" in this case if you want to spill data to
disk instead of failing your job:
https://github.com/apache/beam/blob/416f62bdd7fa092257921e4835a48094ebe1dda4/runners/spark/src/main/java/org/apache/beam/runners/spark/SparkPipelineOptions.java#L56

I hope this helps!  Ryan




On Thu, Oct 24, 2019 at 4:26 PM Jean-Baptiste Onofré  wrote:
>
> Hi
>
> JdbcIO is basically a DoFn. So it could load all on a single executor 
> (there's no obvious way to split).
>
> It's what you mean ?
>
> Regards
> JB
>
> Le 24 oct. 2019 15:26, Jozef Vilcek  a écrit :
>
> Hi,
>
> I am in a need to read a big-ish data set via JdbcIO. This forced me to bump 
> up memory for my executor (right now using SparkRunner). It seems that JdbcIO 
> has a requirement to fit all data in memory as it is using DoFn to unfold 
> query to list of elements.
>
> BoundedSource would not face the need to fit result in memory, but JdbcIO is 
> using DoFn. Also, in recent discussion [1] it was suggested that 
> BoudnedSource should not be used as it is obsolete.
>
> Does anyone faced this issue? What would be the best way to solve it? If DoFn 
> should be kept, then I can only think of splitting the query to ranges and 
> try to find most fitting number of rows to read at once.
>
> I appreciate any thoughts.
>
> [1] 
> https://lists.apache.org/list.html?dev@beam.apache.org:lte=1M:Reading%20from%20RDB%2C%20ParDo%20or%20BoundedSource
>
>


Re: Question related to running unit tests in IDE

2019-10-23 Thread Ryan Skraba
Just for info -- I managed to get a pretty good state using IntelliJ
2019.2.3 (Fedora) and a plain gradle import!

There's a slack channel at https://s.apache.org/beam-slack-channel
(see https://beam.apache.org/community/contact-us/) called
#beam-intellij  It's pretty low-traffic, but you might be able to get
some real-time help there if you need it.

Ryan

On Wed, Oct 23, 2019 at 2:11 AM Saikat Maitra  wrote:
>
> Hi Michal, Alexey
>
> Thank you for your email. I am using macOS Catalina and JDK 8 with IntelliJ 
> IDEA 2019.1
>
> I will try to setup IntelliJ from scratch and see if the error resolves.
>
> Regards,
> Saikat
>
>
>
>
> On Tue, Oct 22, 2019 at 7:05 AM Alexey Romanenko  
> wrote:
>>
>> Hi,
>>
>> Thank you for your interest to contribute!
>>
>> Did you properly imported a project (as explained on page [1]) and all deps 
>> were resolved successfully?
>>
>> [1] 
>> https://cwiki.apache.org/confluence/display/BEAM/Set+up+IntelliJ+from+scratch
>>
>> On 22 Oct 2019, at 02:28, Saikat Maitra  wrote:
>>
>> Hi,
>>
>> I am interested to contribute to this issue
>>
>> https://issues.apache.org/jira/browse/BEAM-3658
>>
>> I have followed the contribution guide and was able to build the project 
>> locally using gradlew commands.
>>
>> I wanted to debug and trace the issue further by running the tests locally 
>> using Intellij Idea but I am getting following errors. I looked up the docs 
>> related to running tests 
>> (https://cwiki.apache.org/confluence/display/BEAM/Run+a+single+unit+test) 
>> and common IDE errors 
>> (https://cwiki.apache.org/confluence/display/BEAM/%28FAQ%29+Recovering+from+common+IDE+errors)
>>  but have not found similar errors.
>>
>> Error:(632, 17) java: cannot find symbol
>>   symbol:   method 
>> apply(org.apache.beam.sdk.transforms.Values)
>>   location: interface org.apache.beam.sdk.values.POutput
>>
>> Error:(169, 26) java: cannot find symbol
>>   symbol:   class PCollection
>>   location: class org.apache.beam.sdk.transforms.Watch
>>
>> Error:(169, 59) java: cannot find symbol
>>   symbol:   class KV
>>   location: class org.apache.beam.sdk.transforms.Watch
>>
>> Please let me know if you have feedback.
>>
>> Regards,
>> Saikat
>>
>>


Re: DoFn and Source sequence diagrams

2019-10-17 Thread Ryan Skraba
All is well, PlantUML has an Apache Licensed distribution as well, AND
the diagrams are explicitly not covered by a license:
http://plantuml.com/faq

The UML diagrams in Beam Fn API doc are almost certainly PlantUML !

On Thu, Oct 17, 2019 at 4:07 PM Ismaël Mejía  wrote:
>
> In previous documents they have used a textual representation for UML
> that renders quite nice looking diagrams, see for example the UML
> diagram in this doc:
> https://s.apache.org/beam-fn-api-processing-a-bundle
>
> I think that they use a tool (StartUML?) for that. In any case even if
> the tool used to produce the diagram is GPL licensed the resulting
> diagram is not so that should not be a problem.
>
> On Thu, Oct 17, 2019 at 3:39 PM Etienne Chauchot  wrote:
> >
> > Yes maybe convert to PlantUML as the diagram was done with a commercial 
> > tool for rapidity reasons.
> >
> > ouch plantUML is GPL which is incompatible with Apachev2 license. But can 
> > we still embed docs that were done with that tool?
> >
> > Etienne
> >
> >
> > On 16/10/2019 20:01, Valentyn Tymofieiev wrote:
> >
> > It may be useful to have the UML  source available & linked as well to make 
> > it possible to maintain these charts going forward.
> >
> > On Wed, Oct 16, 2019 at 10:36 AM Pablo Estrada  wrote:
> >>
> >> Maybe add a link to it from Javadocs as well? So source implementers / 
> >> users may have an idea of how these things work generally? : )
> >>
> >> On Wed, Oct 16, 2019 at 1:15 AM Etienne Chauchot  
> >> wrote:
> >>>
> >>> Hi Kenn,
> >>>
> >>> Thanks for the comment. I agree, current representation is unclear. I'll 
> >>> find a way to keep both sequence diagram shape and represent nested loops 
> >>> and submit a PR on the website.
> >>>
> >>> Etienne
> >>>
> >>> On 15/10/2019 18:04, Kenneth Knowles wrote:
> >>>
> >>> Content seems useful illustration. I think for DoFn I would clarify:
> >>>
> >>> Currently for DoFn:
> >>>
> >>> > For each bundle: call startBundle
> >>> > For each element: call processElement
> >>>
> >>> But this makes it seem like these are two loops, one completed before the 
> >>> other. And it makes it not quite so clear that start/process/finish 
> >>> bundle happens many times during setup/teardown lifecycle. Maybe try 
> >>> other ways to illustrate the pattern, which (is (Setup (StartBundle 
> >>> (ProcessElement|OnTimer)* FinishBundle)* Teardown?) and emphasize that a 
> >>> DoFn instance is never used in parallel.
> >>>
> >>> Kenn
> >>>
> >>>
> >>> On Tue, Oct 15, 2019 at 8:49 AM Etienne Chauchot  
> >>> wrote:
> 
>  Hi all,
> 
>  I did 2 sequence diagrams for internal training purposes, one for 
>  source, the other for DoFn. What do you think about adding them to the 
>  programming guide ?
> 
>  Here they are:
> 
>  Best
> 
>  Etienne


Re: [spark structured streaming runner] merge to master?

2019-10-10 Thread Ryan Skraba
Merging to master sounds like a really good idea, even if it is not
feature-complete yet.

It's already a pretty big accomplishment getting it to the current
state (great job all!).  Merging it into master would give it a pretty
good boost for visibility and encouraging some discussion about where
it's going.

I don't think there's any question about removing the RDD-based
(a.k.a. old/legacy/stable) spark runner yet!

All my best, Ryan


On Thu, Oct 10, 2019 at 2:47 PM Jean-Baptiste Onofré  wrote:
>
> +1
>
> As the runner seems almost "equivalent" to the one we have, it makes sense.
>
> Question is: do we keep the "old" spark runner for a while or not (or
> just keep on previous version/tag on git) ?
>
> Regards
> JB
>
> On 10/10/2019 09:39, Etienne Chauchot wrote:
> > Hi guys,
> >
> > You probably know that there has been for several months an work
> > developing a new Spark runner based on Spark Structured Streaming
> > framework. This work is located in a feature branch here:
> > https://github.com/apache/beam/tree/spark-runner_structured-streaming
> >
> > To attract more contributors and get some user feedback, we think it is
> > time to merge it to master. Before doing so, some steps need to be
> > achieved:
> >
> > - finish the work on spark Encoders (that allow to call Beam coders)
> > because, right now, the runner is in an unstable state (some transforms
> > use the new way of doing ser/de and some use the old one, making a
> > pipeline incoherent toward serialization)
> >
> > - clean history: The history contains commits from November 2018, so
> > there is a good amount of work, thus a consequent number of commits.
> > They were already squashed but not from September 2019
> >
> > Regarding status:
> >
> > - the runner passes 89% of the validates runner tests in batch mode. We
> > hope to pass more with the new Encoders
> >
> > - Streaming mode is barely started (waiting for the multi-aggregations
> > support in spark SS framework from the Spark community)
> >
> > - Runner can execute Nexmark
> >
> > - Some things are not wired up yet
> >
> > - Beam Schemas not wired with Spark Schemas
> >
> > - Optional features of the model not implemented:  state api, timer
> > api, splittable doFn api, …
> >
> > WDYT, can we merge it to master once the 2 steps are done ?
> >
> > Best
> >
> > Etienne
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com


Re: [ANNOUNCE] New committer: Jan Lukavský

2019-07-31 Thread Ryan Skraba
Congratulations Jan!

On Wed, Jul 31, 2019 at 10:10 AM Ismaël Mejía  wrote:
>
> Hi,
>
> Please join me and the rest of the Beam PMC in welcoming a new
> committer: Jan Lukavský.
>
> Jan has been contributing to Beam for a while, he was part of the team
> that contributed the Euphoria DSL extension, and he has done
> interesting improvements for the Spark and Direct runner. He has also
> been active in the community discussions around the Beam model and
> other subjects.
>
> In consideration of Jan's contributions, the Beam PMC trusts him with
> the responsibilities of a Beam committer [1].
>
> Thank you, Jan, for your contributions and looking forward to many more!
>
> Ismaël, on behalf of the Apache Beam PMC
>
> [1] https://beam.apache.org/committer/committer


Re: [DISCUSS] Moving FakeBigQueryServices to main/ rather than test/

2019-07-31 Thread Ryan Skraba
Hello!  No objection to the move :/  But what do you think about
publishing the test jar created in google-cloud-platform to be reused
without moving the code to the main artifact jar?

I admit that I'm familiar with this technique with maven, and not at
all with gradle, but it's described here:
https://maven.apache.org/guides/mini/guide-attached-tests.html  It
relies on the classifier to distinguish between test classes and main
classes.  Is this possible and/or easy with gradle?

I've used this in the past to create "re-usable test artifacts" that
remain independent from "real" code but tightly associated with the
main artifact.

I noticed that elasticsearch publishes main artifacts with `-test` in
the artifact name as an alternative strategy (i.e. a separate
project).

All my best, Ryan

On Wed, Jul 31, 2019 at 5:55 AM Mikhail Gryzykhin
 wrote:
>
> +1
> It is completely worth it.
>
> On Tue, Jul 30, 2019 at 8:50 PM Rui Wang  wrote:
>>
>> +1.
>>
>> I did something similar before: move TestBoundedTable to BeamSQL main to 
>> allow another module tests use it.
>>
>>
>> -Rui
>>
>> On Tue, Jul 30, 2019 at 6:13 PM Pablo Estrada  wrote:
>>>
>>> Hello all,
>>> I found some test utilities that we use to write unit tests for transforms 
>>> that read/write to/from BigQuery. These are all the 
>>> non-(*IT.java/*Test.java) classes in [1].
>>>
>>> I believe that users may want to write tests for their own pipelines that 
>>> may rely on complex DynamicDestination logic (imagine streaming, or side 
>>> inputs for on-the-fly schema computation, or other tricky issues).
>>>
>>> I think it makes sense to move these classes to 
>>> org.apache.beam.io.gcp.bigquery.testing, and publish them in the release. 
>>> Thoughts?
>>>
>>> -P.
>>>
>>> [1] 
>>> https://github.com/apache/beam/tree/master/sdks/java/io/google-cloud-platform/src/test/java/org/apache/beam/sdk/io/gcp/bigquery


Re: Choosing a coder for a class that contains a Row?

2019-07-24 Thread Ryan Skraba
I'm also really interested in the question of evolving schemas... It's
something I've also put off figuring out :D

With all its warts, the LazyAvroCoder technique (a coder backed by
some sort of schema registry) _could_ work with "homogeneish" data
(i.e. if the number of schemas in play for a single coder is much,
much smaller than the number of elements), even if none of the the
schemas are known at Pipeline construction.  The portability job
server (which already stores and serves artifacts for running jobs)
might be the right place to put a schema registry... but I'm not
entirely convinced it's the right way to go either.

At the same time, "simply" bumping a known schema to a new version is
roughly equivalent to updating a pipeline in place.

Sending the data as Java-serialized Rows will be equivalent to sending
the entire schema with every record, so it _would_ work without
involving a new, distributed state between one coders encode and
anothers decode (at the cost of message size, of course).

Ryan


On Wed, Jul 24, 2019 at 1:40 AM Pablo Estrada  wrote:
>
> +dev
> Thanks Ryan! This is quite helpful. Still not what I need : ) - but useful.
>
> The data is change data capture from databases, and I'm putting it into a 
> Beam Row. The schema for the Row is generally homogeneous, but subject to 
> change at some point in the future if the schema in the database changes. 
> It's unusual and unlikely, but possible. I have no idea how Beam deals with 
> evolving schemas. +Reuven Lax is there documentation / examples / anything 
> around this? : )
>
> I think evolving schemas is an interesting question
>
> For now, I am going to Java-serialize the objects, and delay figuring this 
> out. But I reckon I'll have to come back to this...
>
> Best
> -P.
>
> On Tue, Jul 23, 2019 at 1:07 AM Ryan Skraba  wrote:
>>
>> Hello Pablo!  Just to clarify -- the Row schemas aren't known at
>> pipeline construction time, but can be discovered from the instance of
>> MyData?
>>
>> Once discovered, is the schema "homogeneous" for all instance of
>> MyData?  (i.e. someRow will always have the same schema for all
>> instances afterwards, and there won't be another someRow with a
>> different schema).
>>
>> We've encountered a parallel "problem" with pure Avro data, where the
>> instance is a GenericRecord containing it's own Avro schema but
>> *without* knowing the schema until the pipeline is run.  The solution
>> that we've been using is a bit hacky, but we're using an ad hoc
>> per-job schema registry and a custom coder where each worker saves the
>> schema in the `encode` before writing the record, and loads it lazily
>> in the `decode` before reading.
>>
>> The original code is available[1] (be gentle, it was written with Beam
>> 0.4.0-incubating... and has continued to work until now).
>>
>> In practice, the ad hoc schema registry is just a server socket in the
>> Spark driver, in-memory for DirectRunner / local mode, and a a
>> read/write to a known location in other runners.  There are definitely
>> other solutions with side-inputs and providers, and the job server in
>> portability looks like an exciting candidate for per-job schema
>> registry story...
>>
>> I'm super eager to see if there are other ideas or a contribution we
>> can make in this area that's "Beam Row" oriented!
>>
>> Ryan
>>
>> [1] 
>> https://github.com/Talend/components/blob/master/core/components-adapter-beam/src/main/java/org/talend/components/adapter/beam/coders/LazyAvroCoder.java
>>
>> On Tue, Jul 23, 2019 at 12:49 AM Pablo Estrada  wrote:
>> >
>> > Hello all,
>> > I am writing a utility to push data to PubSub. My data class looks 
>> > something like so:
>> > ==
>> > class MyData {
>> >   String someId;
>> >   Row someRow;
>> >   Row someOtherRow;
>> > }
>> > ==
>> > The schema for the Rows is not known a-priori. It is contained by the Row. 
>> > I am then pushing this data to pubsub:
>> > ===
>> > MyData pushingData = 
>> > WhatCoder? coder = 
>> >
>> > ByteArrayOutputStream os = new ByteArrayOutputStream();
>> > coder.encode(this, os);
>> >
>> > pubsubClient.connect();
>> > pubsubClient.push(PubSubMessage.newBuilder().setData(os.toByteArray()).build());
>> > pubsubClient.close();
>> > =
>> > What's the right coder to use in this case? I don't know if SchemaCoder 
>> > will work, because it seems that it requires the Row's schema a priori. I 
>> > have not been able to make AvroCoder work.
>> >
>> > Any tips?
>> > Best
>> > -P.


Re: [Python] Read Hadoop Sequence File?

2019-07-17 Thread Ryan Skraba
Hello!

I dug a bit into this (not a FileIO expert), and it looks like
LocalFileSystem only matches globs in file names (not directories):
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/LocalFileSystem.java#L251

Perhaps related: https://issues.apache.org/jira/browse/BEAM-1309

There's a note in the FileSystem javadoc that makes me suspect that globs
aren't expected to expand everywhere in the "paths" for all filesystems,
but *should* work in the last hierarchical element:
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileSystem.java#L59
(noting
that the last hierarchical element doesn't necessarily mean "files only",
in my opinion.)

It kind of makes sense -- wildcards at the top of a hierarchy in a large
filesystem can end up creating a huge internal "query" walking the entire
tree.

I gave a quick try to make a composable pipeline that matched "part-*/data"
using the FileIO.matchAll() technique for TextIO, but didn't succeed.  It's
a bit surprising to me, so I'm interested if this could be a feature
improvement...

It seems reasonable that we could construct something like:

PCollection lines = p.apply(Create.of("/tmp/input/"))
  .apply(FileIO.matchResolveDirectory("part-*"))
  .apply(FileIO.matchResolveFile("data"))
  .apply(FileIO.readMatches().withCompression(AUTO))
  .apply(TextIO.readFiles());

Does anybody have a bit more experience how to correctly construct
something like that?

Best regards, Ryan



On Tue, Jul 16, 2019 at 4:25 PM Shannon Duncan 
wrote:

> I am still having the problem that local file system (DirectRunner) will
> not allow a local GLOB string to be passed as a file source. I have tried
> both relative path and fully qualified paths.
>
> I can confirm the same inputFile source GLOB returns data on a simple cat
> command. So I know the GLOB is good.
>
> Error: "java.io.FileNotFoundException: No files matched spec:
> /Users//github//io/sequenceFile/part-*/data
>
> Any assistance would be greatly appreciated. This is on the Java SDK.
>
> I tested this with TextIO.read().from(ValueProvider); Still the
> same.
>
> Thanks,
> Shannon
>
> On Fri, Jul 12, 2019 at 2:14 PM Igor Bernstein 
> wrote:
>
>> I'm not sure to be honest. The pattern expansion happens in
>> FileBasedSource via FileSystems.match(), so it should follow the same
>> expansion rules other file based sinks like TextIO. Maybe someone with more
>> beam experience can help?
>>
>> On Fri, Jul 12, 2019 at 2:55 PM Shannon Duncan <
>> joseph.dun...@liveramp.com> wrote:
>>
>>> Clarification on previous message. Only happens on local file system
>>> where it is unable to match a pattern string. Via a `gs://` link it
>>> is able to do multiple file matching.
>>>
>>> On Fri, Jul 12, 2019 at 1:36 PM Shannon Duncan <
>>> joseph.dun...@liveramp.com> wrote:
>>>
 Awesome. I got it working for a single file, but for a structure of:

 /part-0001/index
 /part-0001/data
 /part-0002/index
 /part-0002/data

 I tried to do /part-*  and /part-*/data

 It does not find the multipart files. However if I just do
 /part-0001/data it will find it and read it.

 Any ideas why?

 I am using this to generate the source:

 static SequenceFileSource createSource(
 ValueProvider sourcePattern) {
 return new SequenceFileSource(
 sourcePattern,
 Text.class,
 WritableSerialization.class,
 Text.class,
 WritableSerialization.class,
 SequenceFile.SYNC_INTERVAL);
 }

 On Wed, Jul 10, 2019 at 10:52 AM Igor Bernstein <
 igorbernst...@google.com> wrote:

> It should be fairly straight forward:
> 1. Copy SequenceFileSource.java
> 
>  to
> your project
> 2. Add the source to your pipeline, configuring it with appropriate
> serializers. See here
> 
> for an example for hbase Results
>
> On Wed, Jul 10, 2019 at 10:58 AM Shannon Duncan <
> joseph.dun...@liveramp.com> wrote:
>
>> If I wanted to go ahead and include this within a new Java Pipeline,
>> what would I be looking at for level of work to integrate?
>>
>> On Wed, Jul 3, 2019 at 3:54 AM Ismaël Mejía 
>> wrote:
>>
>>> That's great. I can help whenever you need. We just need to choose
>>> its
>>> destination. Both the `hadoop-format` and `hadoop-file-system`
>>> modules
>>> are good candidates, I would even feel inclined to put it in its own
>>> module `sdks/java/extensions/sequencefile` to make it more easy to
>>> 

Re: pubsub -> IO

2019-07-17 Thread Ryan Skraba
Hello!  To clarify, you want to do something like this?

PubSubIO.read() -> extract mongodb collection and range ->
MongoDbIO.read(collection, range) -> ...

If I'm not mistaken, it isn't possible with the implementation of MongoDbIO
(based on BoundedSource interface, requiring the collection to be specified
once at pipeline construction time).

BUT -- this is a good candidate for an improvement in composability, and
the ongoing work to prefer the SDF for these types of use cases.   Maybe
raise a JIRA for an improvement?

All my best, Ryan


On Wed, Jul 17, 2019 at 9:35 AM Chaim Turkel  wrote:

> any ideas?
>
> On Mon, Jul 15, 2019 at 11:04 PM Rui Wang  wrote:
> >
> > +u...@beam.apache.org
> >
> >
> > -Rui
> >
> > On Mon, Jul 15, 2019 at 6:55 AM Chaim Turkel  wrote:
> >>
> >> Hi,
> >>   I am looking to write a pipeline that read from a mongo collection.
> >>   I would like to listen to a pubsub that will have a object that will
> >> tell me which collection and which time frame.
> >>   Is there a way to do this?
> >>
> >> Chaim
> >>
> >> --
> >>
> >>
> >> Loans are funded by
> >> FinWise Bank, a Utah-chartered bank located in Sandy,
> >> Utah, member FDIC, Equal
> >> Opportunity Lender. Merchant Cash Advances are
> >> made by Behalf. For more
> >> information on ECOA, click here
> >> . For important information about
> >> opening a new
> >> account, review Patriot Act procedures here
> >> .
> >> Visit Legal
> >>  to
> >> review our comprehensive program terms,
> >> conditions, and disclosures.
>
> --
>
>
> Loans are funded by
> FinWise Bank, a Utah-chartered bank located in Sandy,
> Utah, member FDIC, Equal
> Opportunity Lender. Merchant Cash Advances are
> made by Behalf. For more
> information on ECOA, click here
> . For important information about
> opening a new
> account, review Patriot Act procedures here
> .
> Visit Legal
>  to
> review our comprehensive program terms,
> conditions, and disclosures.
>


Re: Wiki access?

2019-07-03 Thread Ryan Skraba
Oof, sorry: ryanskraba

Thanks in advance!  There's a lot of great info in there.

On Wed, Jul 3, 2019 at 5:03 PM Lukasz Cwik  wrote:

> Can you share your login id for cwiki.apache.org?
>
> On Wed, Jul 3, 2019 at 7:21 AM Ryan Skraba  wrote:
>
>> Hello -- I've been reading through a lot of Beam documentation recently,
>> and noting minor typos here and there... Is it possible to get Wiki access
>> to make fixes on the spot?
>>
>> Best regards, Ryan
>>
>


Wiki access?

2019-07-03 Thread Ryan Skraba
Hello -- I've been reading through a lot of Beam documentation recently,
and noting minor typos here and there... Is it possible to get Wiki access
to make fixes on the spot?

Best regards, Ryan