Re: [YAML] Fileio sink parameterization (streaming, sharding, and naming)

2023-10-12 Thread Robert Bradshaw via dev
OK, so how about this for a concrete proposal:

sink:
  type: WriteToParquet
  config:
path:
"/beam/filesytem/{record.my_col}-{timestamp.year}{timestamp.month}{timestamp.day}"
suffix: ".parquet"

The eventual path would be . The suffix
would be optional, and there could be a default for the specific file
format. A file format could inspect a provided suffix like ".csv.gz" to
infer compression as well.

Note that this doesn't have any special indicators for being dynamic other
than the {}'s. Also, my_col would be written as part of the data (but we
could add an extra "elide" config parameter that takes a list of columns to
exclude if desired).

We could call this "prefix" rather than path. (Path is symmetric with
reading, but prefix is a bit more direct.) Anyone want to voice
their opinion here?




On Wed, Oct 11, 2023 at 9:01 AM Chamikara Jayalath 
wrote:

>
>
> On Wed, Oct 11, 2023 at 6:55 AM Kenneth Knowles  wrote:
>
>> So, top-posting because the threading got to be a lot for me and I think
>> it forked a bit too... I may even be restating something someone said, so
>> apologies for that.
>>
>> Very very good point about *required* parameters where if you don't use
>> them then you will end up with two writers writing to the same file. The
>> easiest example to work with might be if you omitted SHARD_NUM so all
>> shards end up clobbering the same file.
>>
>> I think there's a unifying perspective between prefix/suffix and the need
>> to be sure to include critical sharding variables. Essentially it is my
>> point about it being a "big data fileset". It is perhaps unrealistic but
>> ideally the user names the big data fileset and then the mandatory other
>> pieces are added outside of their control. For example if I name my big
>> data fileset "foo" then that implicitly means that "foo" consists of all
>> the files named "foo/${SHARD_NUM}-of-${SHARD_TOTAL}". And yes now that I
>> re-read I see you basically said the same thing. In some cases the required
>> fields will include $WINDOW, $KEY, and $PANE_INDEX, yes? Even though the
>> user can think of it as a textual template, if we can use a library that
>> yields an abstract syntax tree for the expression we can easily check these
>> requirements in a robust way - or we could do it in a non-robust way be
>> string-scraping ourselves.
>>
>
> Yes. I think we are talking about the same thing. Users should not have
> full control over the filename since that could lead to conflicts and data
> loss when data is being written in parallel from multiple workers. Users
> can refer to the big data fileset being written using the glob "/**".
> In addition users have control over the filename  and 
> (file extension, for example) which can be useful for some downstream
> use-cases. Rest of the filename will be filled out by the SDK (window, pane
> etc.) to make sure that the files written by different workers do not
> conflict.
>
> Thanks,
> Cham
>
>
>>
>> We actually are very close to this in FileIO. I think the interpretation
>> of "prefix" is that it is the filename "foo" as above, and "suffix" is
>> really something like ".txt" that you stick on the end of everything for
>> whatever reason.
>>
>> Kenn
>>
>> On Tue, Oct 10, 2023 at 7:12 PM Robert Bradshaw via dev <
>> dev@beam.apache.org> wrote:
>>
>>> On Tue, Oct 10, 2023 at 4:05 PM Chamikara Jayalath 
>>> wrote:
>>>

 On Tue, Oct 10, 2023 at 4:02 PM Robert Bradshaw 
 wrote:

> On Tue, Oct 10, 2023 at 3:53 PM Chamikara Jayalath <
> chamik...@google.com> wrote:
>
>>
>> On Tue, Oct 10, 2023 at 3:41 PM Reuven Lax  wrote:
>>
>>> I suspect some simple pattern templating would solve most use cases.
>>> We probably would want to support timestamp formatting (e.g. $ $M 
>>> $D)
>>> as well.
>>>
>>> On Tue, Oct 10, 2023 at 3:35 PM Robert Bradshaw 
>>> wrote:
>>>
 On Mon, Oct 9, 2023 at 3:09 PM Chamikara Jayalath <
 chamik...@google.com> wrote:

> I would say:
>
> sink:
>   type: WriteToParquet
>   config:
> path: /beam/filesytem/dest
> prefix: 
> suffix: 
>
> Underlying SDK will add the middle part of the file names to make
> sure that files generated by various bundles/windows/shards do not 
> conflict.
>

 What's the relationship between path and prefix? Is path the
 directory part of the full path, or does prefix precede it?

>>>
>> prefix would be the first part of the file name so each shard will be
>> named.
>> /--
>>
>> This is similar to what we do in existing SDKS. For example, Java
>> FileIO,
>>
>>
>> https://github.com/apache/beam/blob/65eaf45026e9eeb61a9e05412488e5858faec6de/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileIO.java#L187
>>
>
> Yeah, 

Re: [Question] Read Parquet Schema from S3 Directory

2023-10-12 Thread Robert Bradshaw via dev
You'll probably need to resolve "s3a:///*.parquet" out into a
concrete non-glob filepattern to inspect it this way. Presumably any
individual shard will do. match and open from
https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/FileSystems.html
may be useful.

On Wed, Oct 11, 2023 at 10:29 AM Ramya Prasad via dev 
wrote:

> Hello,
>
> I am a developer trying to use Apache Beam in my Java application, and I'm
> running into an issue with reading multiple Parquet files from a directory
> in S3. I'm able to successfully run this line of code, where tempPath  =
> "s3:///*.parquet":
> PCollection records = pipeline.apply("Read parquet file in
> as Generic Records", ParquetIO.read(schema).from(tempPath));
>
> My problem is reading the schema beforehand. At runtime, I only have the
> name of the S3 bucket, which has all the Parquet files I need underneath
> it. However, I am unable to use that same tempPath above to retrieve my
> schema. Because the path is not pointing to a singular parquet file, the
> ParquetFileReader class from Apache Hadoop throws an error: No such file or
> directory: s3a:///*.parquet.
>
> To read my schema, I'm using this chunk of code:
>
> Configuration configuration = new Configuration();
> configuration.set("fs.s3a.access.key",");
> configuration.set("fs.s3a.secret.key", "");
> configuration.set("fs.s3a.session.token","");
> configuration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem");
> configuration.set("fs.s3a.server-side-encryption-algorithm", "");
> configuration.set("fs.s3a.proxy.host", "");
> configuration.set("fs.s3a.proxy.port", "");
> configuration.set("fs.s3a.aws.credentials.provider", 
> "org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider");
>
> String hadoopFilePath = new Path("s3a:///*.parquet");
> ParquetFileReader r = 
> ParquetFileReader.open(HadoopInputFile.fromPath(hadoopFilePath, 
> configuration));
> MessageType messageType = r.getFooter().getFileMetaData().getSchema();
> AvroSchemaConverter converter = new AvroSchemaConverter();
> Schema schema = converter.convert(messageType);
>
> The red line is where the code is failing. Is there maybe a Hadoop
> Configuration I can set to force Hadoop to read recursively?
>
> I realize this is kind of a Beam-adjacent problem, but I've been
> struggling with this for a while, so any help would be appreciated!
>
> Thanks and sincerely,
> Ramya
> --
>
> The information contained in this e-mail may be confidential and/or
> proprietary to Capital One and/or its affiliates and may only be used
> solely in performance of work or services for Capital One. The information
> transmitted herewith is intended only for use by the individual or entity
> to which it is addressed. If the reader of this message is not the intended
> recipient, you are hereby notified that any review, retransmission,
> dissemination, distribution, copying or other use of, or taking of any
> action in reliance upon this information is strictly prohibited. If you
> have received this communication in error, please contact the sender and
> delete the material from your computer.
>
>
>
>
>


Apache Beam 2.50.0 - org.apache.beam.sdk.options.MemoryMonitorOptions ClassDef Not found

2023-10-12 Thread Deliwala, Jaymik H. via dev
Hello Team

Greetings!!

As part of upgrading our Dataflow - Apache beam version from 2.46.0 to 
2.49.0/2.50.0, we are able to compile the mvn package successfully.
However, while running the compile exec command, we are getting an error as 
below - org.apache.beam.sdk.options.MemoryMonitorOptions - NoClassefFoundError

2023-10-12T10:39:17.9662373Z java.lang.NoClassDefFoundError: 
org/apache/beam/sdk/options/MemoryMonitorOptions
2023-10-12T10:39:17.9665631Z at java.lang.ClassLoader.defineClass1 (Native 
Method)
2023-10-12T10:39:17.9668538Z at java.lang.ClassLoader.defineClass 
(ClassLoader.java:1022)
2023-10-12T10:39:17.9670605Z at java.security.SecureClassLoader.defineClass 
(SecureClassLoader.java:174)
2023-10-12T10:39:17.9672586Z at java.net.URLClassLoader.defineClass 
(URLClassLoader.java:555)
2023-10-12T10:39:17.9674516Z at java.net.URLClassLoader$1.run 
(URLClassLoader.java:458)
2023-10-12T10:39:17.9676939Z at java.net.URLClassLoader$1.run 
(URLClassLoader.java:452)
2023-10-12T10:39:17.9678816Z at java.security.AccessController.doPrivileged 
(Native Method)
2023-10-12T10:39:17.9690248Z at java.net.URLClassLoader.findClass 
(URLClassLoader.java:451)
2023-10-12T10:39:17.9692252Z at java.lang.ClassLoader.loadClass 
(ClassLoader.java:594)
2023-10-12T10:39:17.9694175Z at java.lang.ClassLoader.loadClass 
(ClassLoader.java:527)
2023-10-12T10:39:17.9696026Z at java.lang.ClassLoader.defineClass1 (Native 
Method)
2023-10-12T10:39:17.961Z at java.lang.ClassLoader.defineClass 
(ClassLoader.java:1022)
2023-10-12T10:39:17.9704965Z at java.security.SecureClassLoader.defineClass 
(SecureClassLoader.java:174)
2023-10-12T10:39:17.9706908Z at java.net.URLClassLoader.defineClass 
(URLClassLoader.java:555)
2023-10-12T10:39:17.9708814Z at java.net.URLClassLoader$1.run 
(URLClassLoader.java:458)
2023-10-12T10:39:17.9710696Z at java.net.URLClassLoader$1.run 
(URLClassLoader.java:452)
2023-10-12T10:39:17.9712552Z at java.security.AccessController.doPrivileged 
(Native Method)
2023-10-12T10:39:17.9714469Z at java.net.URLClassLoader.findClass 
(URLClassLoader.java:451)
2023-10-12T10:39:17.9716698Z at java.lang.ClassLoader.loadClass 
(ClassLoader.java:594)
2023-10-12T10:39:17.9718622Z at java.lang.ClassLoader.loadClass 
(ClassLoader.java:527)
2023-10-12T10:39:17.9720635Z at 
org.apache.beam.runners.dataflow.DataflowPipelineRegistrar$Options.getPipelineOptions
 (DataflowPipelineRegistrar.java:40)
2023-10-12T10:39:17.9722611Z at 
org.apache.beam.sdk.options.PipelineOptionsFactory$Cache.initializeRegistry 
(PipelineOptionsFactory.java:2090)
2023-10-12T10:39:17.9724487Z at 
org.apache.beam.sdk.options.PipelineOptionsFactory$Cache. 
(PipelineOptionsFactory.java:2083)
2023-10-12T10:39:17.9726421Z at 
org.apache.beam.sdk.options.PipelineOptionsFactory$Cache. 
(PipelineOptionsFactory.java:2047)
2023-10-12T10:39:17.9728327Z at 
org.apache.beam.sdk.options.PipelineOptionsFactory.resetCache 
(PipelineOptionsFactory.java:581)
2023-10-12T10:39:17.9730230Z at 
org.apache.beam.sdk.options.PipelineOptionsFactory. 
(PipelineOptionsFactory.java:547)
2023-10-12T10:39:17.9732113Z at cio.mmt.PubSubTopicsToBigQuery.main 
(PubSubTopicsToBigQuery.java:55)
2023-10-12T10:39:17.9734009Z at org.codehaus.mojo.exec.ExecJavaMojo$1.run 
(ExecJavaMojo.java:254)
2023-10-12T10:39:17.9743608Z at java.lang.Thread.run (Thread.java:829)
2023-10-12T10:39:17.9745720Z Caused by: java.lang.ClassNotFoundException: 
org.apache.beam.sdk.options.MemoryMonitorOptions
2023-10-12T10:39:17.9748376Z at java.net.URLClassLoader.findClass 
(URLClassLoader.java:476)
2023-10-12T10:39:17.9750298Z at java.lang.ClassLoader.loadClass 
(ClassLoader.java:594)
2023-10-12T10:39:17.9752205Z at java.lang.ClassLoader.loadClass 
(ClassLoader.java:527)
2023-10-12T10:39:17.9754041Z at java.lang.ClassLoader.defineClass1 (Native 
Method)
2023-10-12T10:39:17.9756293Z at java.lang.ClassLoader.defineClass 
(ClassLoader.java:1022)
2023-10-12T10:39:17.9758225Z at java.security.SecureClassLoader.defineClass 
(SecureClassLoader.java:174)
2023-10-12T10:39:17.9760115Z at java.net.URLClassLoader.defineClass 
(URLClassLoader.java:555)
2023-10-12T10:39:17.9761982Z at java.net.URLClassLoader$1.run 
(URLClassLoader.java:458)
2023-10-12T10:39:17.9763915Z at java.net.URLClassLoader$1.run 
(URLClassLoader.java:452)
2023-10-12T10:39:17.9765721Z at java.security.AccessController.doPrivileged 
(Native Method)
2023-10-12T10:39:17.9767732Z at java.net.URLClassLoader.findClass 
(URLClassLoader.java:451)
2023-10-12T10:39:17.9769643Z at java.lang.ClassLoader.loadClass 
(ClassLoader.java:594)
2023-10-12T10:39:17.9771552Z at java.lang.ClassLoader.loadClass 
(ClassLoader.java:527)
2023-10-12T10:39:17.9821607Z at java.lang.ClassLoader.defineClass1 (Native 
Method)
2023-10-12T10:39:17.9824027Z at java.lang.ClassLoader.defineClass 

Re: Proposal for pyproject.toml Support in Apache Beam Python

2023-10-12 Thread Anand Inguva via dev
The gen_protos.py will be called while building a sdist, wheel or an
editable installation.  We use pytest through the tox package and during
the tox build process, gen_protos.py is called during either wheel or sdist
creation.

For building sdist, the process now changed from `python setup.py sdist` to
`python -m build --sdist` and editable installation still happens with `pip
install -e .` and rest of the development process follows more or less the
same old behavior.


On Thu, Oct 12, 2023 at 5:06 PM Robert Bradshaw  wrote:

> On Thu, Oct 12, 2023 at 2:04 PM Anand Inguva 
> wrote:
>
>> I am in the process of updating the documentation at
>> https://cwiki.apache.org/confluence/display/BEAM/Python+Tips related to
>> setup.py/pyproject.toml changes, but yes you can't call setup.py
>> directly because it might fail due to the lack of presence of beam python's
>> build time dependencies.
>>
>> With regards to other files(eg:protos), we will follow the similar
>> behavior as before(generating proros using `gen_protos.py`).
>>
>
> Meaning this will be called automatically when needed (e.g. from pytest)?
>
>
>> On Thu, Oct 12, 2023 at 4:01 PM Robert Bradshaw 
>> wrote:
>>
>>> Does this change any development practices? E.g. if I clone the repo,
>>> I'm assuming I couldn't run "setup.py test" anymore. What about the
>>> generated files (like protos, or the yaml definitions copied from other
>>> parts of the repo)?
>>>
>>> On Thu, Oct 12, 2023 at 12:27 PM Anand Inguva via dev <
>>> dev@beam.apache.org> wrote:
>>>
 The PR https://github.com/apache/beam/pull/28385 is merged today. If
 there are any observed failures, please comment on the PR and I will follow
 up with a forward fix. Thanks.

 On Fri, Sep 1, 2023 at 2:30 PM Anand Inguva 
 wrote:

> Since there is positive feedback from the dev community, I am going
> ahead and implementing this proposal for Python SDK.
>
> @aus...@apache.org   Initially let's move forward
> with the setuptools as backend for building package and as part of the
> future work, we can find a better backend than setuptools.
>
> Thanks for the feedback.
> Anand
>
> On Mon, Aug 28, 2023 at 12:00 PM Austin Bennett 
> wrote:
>
>> I've thought about this a ton, but haven't been in a position to
>> undertake the work.  Thanks for bringing this up, @Anand Inguva
>>  !
>>
>> I'd point us to https://python-poetry.org/  ... [ which is where I'd
>> look take us, but I'm also not able to do all the work, so my
>> suggestion/preference doensn't matter that much ]
>>
>> https://python-poetry.org/docs/pyproject#the-pyprojecttoml-file <-
>> for info on pyproject.toml file.
>>
>> Notice the use of a 'lock' file is very valuable, ex:
>> https://python-poetry.org/docs/basic-usage/#committing-your-poetrylock-file-to-version-control
>>
>> I haven't come across `build`, that might be great too.  I'd
>> highlight that Poetry is pretty common across industry these days,
>> rock-solid, ecosystem of interoperability, users, etc...   If not 
>> familiar,
>> PLEASE have a look at that.
>>
>>
>>
>>
>> On Mon, Aug 28, 2023 at 8:04 AM Kerry Donny-Clark via dev <
>> dev@beam.apache.org> wrote:
>>
>>> +1
>>> Hi Anand,
>>> I appreciate this effort. Managing python dependencies has been a
>>> major pain point for me, and I think this approach would help.
>>> Kerry
>>>
>>> On Mon, Aug 28, 2023 at 10:14 AM Anand Inguva via dev <
>>> dev@beam.apache.org> wrote:
>>>
 Hello Beam Dev Team,

 I've compiled a design document
 [1]
 proposing the integration of pyproject.toml into Apache Beam's Python 
 build
 process. Your insights and feedback would be invaluable.

 What is pyproject.toml?
 pyproject.toml is a configuration file that specifies a project's
 build dependencies and other project-related metadata in a standardized
 format. Before pyproject.toml, Python projects often had multiple
 configuration files (like setup.py, setup.cfg, and requirements.txt).
 pyproject.toml aims to centralize these configurations into one place,
 making project setups more organized and straightforward. One of the
 significant features enabled by pyproject.toml is the ability to 
 perform
 isolated builds. This ensures that build dependencies are separated 
 from
 the project's runtime dependencies, leading to more consistent and
 reproducible builds.

 [1]
 https://docs.google.com/document/d/17-y48WW25-VGBWZNyTdoN0WUN03k9ZhJjLp9wtyG1Wc/edit#heading=h.wskna8eurvjv

 Thanks,
 Anand

>>>


Re: Proposal for pyproject.toml Support in Apache Beam Python

2023-10-12 Thread Robert Bradshaw via dev
On Thu, Oct 12, 2023 at 2:04 PM Anand Inguva  wrote:

> I am in the process of updating the documentation at
> https://cwiki.apache.org/confluence/display/BEAM/Python+Tips related to
> setup.py/pyproject.toml changes, but yes you can't call setup.py directly
> because it might fail due to the lack of presence of beam python's build
> time dependencies.
>
> With regards to other files(eg:protos), we will follow the similar
> behavior as before(generating proros using `gen_protos.py`).
>

Meaning this will be called automatically when needed (e.g. from pytest)?


> On Thu, Oct 12, 2023 at 4:01 PM Robert Bradshaw 
> wrote:
>
>> Does this change any development practices? E.g. if I clone the repo, I'm
>> assuming I couldn't run "setup.py test" anymore. What about the generated
>> files (like protos, or the yaml definitions copied from other parts of the
>> repo)?
>>
>> On Thu, Oct 12, 2023 at 12:27 PM Anand Inguva via dev <
>> dev@beam.apache.org> wrote:
>>
>>> The PR https://github.com/apache/beam/pull/28385 is merged today. If
>>> there are any observed failures, please comment on the PR and I will follow
>>> up with a forward fix. Thanks.
>>>
>>> On Fri, Sep 1, 2023 at 2:30 PM Anand Inguva 
>>> wrote:
>>>
 Since there is positive feedback from the dev community, I am going
 ahead and implementing this proposal for Python SDK.

 @aus...@apache.org   Initially let's move forward
 with the setuptools as backend for building package and as part of the
 future work, we can find a better backend than setuptools.

 Thanks for the feedback.
 Anand

 On Mon, Aug 28, 2023 at 12:00 PM Austin Bennett 
 wrote:

> I've thought about this a ton, but haven't been in a position to
> undertake the work.  Thanks for bringing this up, @Anand Inguva
>  !
>
> I'd point us to https://python-poetry.org/  ... [ which is where I'd
> look take us, but I'm also not able to do all the work, so my
> suggestion/preference doensn't matter that much ]
>
> https://python-poetry.org/docs/pyproject#the-pyprojecttoml-file <-
> for info on pyproject.toml file.
>
> Notice the use of a 'lock' file is very valuable, ex:
> https://python-poetry.org/docs/basic-usage/#committing-your-poetrylock-file-to-version-control
>
> I haven't come across `build`, that might be great too.  I'd highlight
> that Poetry is pretty common across industry these days, rock-solid,
> ecosystem of interoperability, users, etc...   If not familiar, PLEASE 
> have
> a look at that.
>
>
>
>
> On Mon, Aug 28, 2023 at 8:04 AM Kerry Donny-Clark via dev <
> dev@beam.apache.org> wrote:
>
>> +1
>> Hi Anand,
>> I appreciate this effort. Managing python dependencies has been a
>> major pain point for me, and I think this approach would help.
>> Kerry
>>
>> On Mon, Aug 28, 2023 at 10:14 AM Anand Inguva via dev <
>> dev@beam.apache.org> wrote:
>>
>>> Hello Beam Dev Team,
>>>
>>> I've compiled a design document
>>> [1]
>>> proposing the integration of pyproject.toml into Apache Beam's Python 
>>> build
>>> process. Your insights and feedback would be invaluable.
>>>
>>> What is pyproject.toml?
>>> pyproject.toml is a configuration file that specifies a project's
>>> build dependencies and other project-related metadata in a standardized
>>> format. Before pyproject.toml, Python projects often had multiple
>>> configuration files (like setup.py, setup.cfg, and requirements.txt).
>>> pyproject.toml aims to centralize these configurations into one place,
>>> making project setups more organized and straightforward. One of the
>>> significant features enabled by pyproject.toml is the ability to perform
>>> isolated builds. This ensures that build dependencies are separated from
>>> the project's runtime dependencies, leading to more consistent and
>>> reproducible builds.
>>>
>>> [1]
>>> https://docs.google.com/document/d/17-y48WW25-VGBWZNyTdoN0WUN03k9ZhJjLp9wtyG1Wc/edit#heading=h.wskna8eurvjv
>>>
>>> Thanks,
>>> Anand
>>>
>>


Re: Proposal for pyproject.toml Support in Apache Beam Python

2023-10-12 Thread Anand Inguva via dev
I am in the process of updating the documentation at
https://cwiki.apache.org/confluence/display/BEAM/Python+Tips related to
setup.py/pyproject.toml changes, but yes you can't call setup.py directly
because it might fail due to the lack of presence of beam python's build
time dependencies.

With regards to other files(eg:protos), we will follow the similar behavior
as before(generating proros using `gen_protos.py`).

On Thu, Oct 12, 2023 at 4:01 PM Robert Bradshaw  wrote:

> Does this change any development practices? E.g. if I clone the repo, I'm
> assuming I couldn't run "setup.py test" anymore. What about the generated
> files (like protos, or the yaml definitions copied from other parts of the
> repo)?
>
> On Thu, Oct 12, 2023 at 12:27 PM Anand Inguva via dev 
> wrote:
>
>> The PR https://github.com/apache/beam/pull/28385 is merged today. If
>> there are any observed failures, please comment on the PR and I will follow
>> up with a forward fix. Thanks.
>>
>> On Fri, Sep 1, 2023 at 2:30 PM Anand Inguva 
>> wrote:
>>
>>> Since there is positive feedback from the dev community, I am going
>>> ahead and implementing this proposal for Python SDK.
>>>
>>> @aus...@apache.org   Initially let's move forward
>>> with the setuptools as backend for building package and as part of the
>>> future work, we can find a better backend than setuptools.
>>>
>>> Thanks for the feedback.
>>> Anand
>>>
>>> On Mon, Aug 28, 2023 at 12:00 PM Austin Bennett 
>>> wrote:
>>>
 I've thought about this a ton, but haven't been in a position to
 undertake the work.  Thanks for bringing this up, @Anand Inguva
  !

 I'd point us to https://python-poetry.org/  ... [ which is where I'd
 look take us, but I'm also not able to do all the work, so my
 suggestion/preference doensn't matter that much ]

 https://python-poetry.org/docs/pyproject#the-pyprojecttoml-file <- for
 info on pyproject.toml file.

 Notice the use of a 'lock' file is very valuable, ex:
 https://python-poetry.org/docs/basic-usage/#committing-your-poetrylock-file-to-version-control

 I haven't come across `build`, that might be great too.  I'd highlight
 that Poetry is pretty common across industry these days, rock-solid,
 ecosystem of interoperability, users, etc...   If not familiar, PLEASE have
 a look at that.




 On Mon, Aug 28, 2023 at 8:04 AM Kerry Donny-Clark via dev <
 dev@beam.apache.org> wrote:

> +1
> Hi Anand,
> I appreciate this effort. Managing python dependencies has been a
> major pain point for me, and I think this approach would help.
> Kerry
>
> On Mon, Aug 28, 2023 at 10:14 AM Anand Inguva via dev <
> dev@beam.apache.org> wrote:
>
>> Hello Beam Dev Team,
>>
>> I've compiled a design document
>> [1]
>> proposing the integration of pyproject.toml into Apache Beam's Python 
>> build
>> process. Your insights and feedback would be invaluable.
>>
>> What is pyproject.toml?
>> pyproject.toml is a configuration file that specifies a project's
>> build dependencies and other project-related metadata in a standardized
>> format. Before pyproject.toml, Python projects often had multiple
>> configuration files (like setup.py, setup.cfg, and requirements.txt).
>> pyproject.toml aims to centralize these configurations into one place,
>> making project setups more organized and straightforward. One of the
>> significant features enabled by pyproject.toml is the ability to perform
>> isolated builds. This ensures that build dependencies are separated from
>> the project's runtime dependencies, leading to more consistent and
>> reproducible builds.
>>
>> [1]
>> https://docs.google.com/document/d/17-y48WW25-VGBWZNyTdoN0WUN03k9ZhJjLp9wtyG1Wc/edit#heading=h.wskna8eurvjv
>>
>> Thanks,
>> Anand
>>
>


Re: Proposal for pyproject.toml Support in Apache Beam Python

2023-10-12 Thread Robert Bradshaw via dev
Does this change any development practices? E.g. if I clone the repo, I'm
assuming I couldn't run "setup.py test" anymore. What about the generated
files (like protos, or the yaml definitions copied from other parts of the
repo)?

On Thu, Oct 12, 2023 at 12:27 PM Anand Inguva via dev 
wrote:

> The PR https://github.com/apache/beam/pull/28385 is merged today. If
> there are any observed failures, please comment on the PR and I will follow
> up with a forward fix. Thanks.
>
> On Fri, Sep 1, 2023 at 2:30 PM Anand Inguva 
> wrote:
>
>> Since there is positive feedback from the dev community, I am going ahead
>> and implementing this proposal for Python SDK.
>>
>> @aus...@apache.org   Initially let's move forward
>> with the setuptools as backend for building package and as part of the
>> future work, we can find a better backend than setuptools.
>>
>> Thanks for the feedback.
>> Anand
>>
>> On Mon, Aug 28, 2023 at 12:00 PM Austin Bennett 
>> wrote:
>>
>>> I've thought about this a ton, but haven't been in a position to
>>> undertake the work.  Thanks for bringing this up, @Anand Inguva
>>>  !
>>>
>>> I'd point us to https://python-poetry.org/  ... [ which is where I'd
>>> look take us, but I'm also not able to do all the work, so my
>>> suggestion/preference doensn't matter that much ]
>>>
>>> https://python-poetry.org/docs/pyproject#the-pyprojecttoml-file <- for
>>> info on pyproject.toml file.
>>>
>>> Notice the use of a 'lock' file is very valuable, ex:
>>> https://python-poetry.org/docs/basic-usage/#committing-your-poetrylock-file-to-version-control
>>>
>>> I haven't come across `build`, that might be great too.  I'd highlight
>>> that Poetry is pretty common across industry these days, rock-solid,
>>> ecosystem of interoperability, users, etc...   If not familiar, PLEASE have
>>> a look at that.
>>>
>>>
>>>
>>>
>>> On Mon, Aug 28, 2023 at 8:04 AM Kerry Donny-Clark via dev <
>>> dev@beam.apache.org> wrote:
>>>
 +1
 Hi Anand,
 I appreciate this effort. Managing python dependencies has been a major
 pain point for me, and I think this approach would help.
 Kerry

 On Mon, Aug 28, 2023 at 10:14 AM Anand Inguva via dev <
 dev@beam.apache.org> wrote:

> Hello Beam Dev Team,
>
> I've compiled a design document
> [1]
> proposing the integration of pyproject.toml into Apache Beam's Python 
> build
> process. Your insights and feedback would be invaluable.
>
> What is pyproject.toml?
> pyproject.toml is a configuration file that specifies a project's
> build dependencies and other project-related metadata in a standardized
> format. Before pyproject.toml, Python projects often had multiple
> configuration files (like setup.py, setup.cfg, and requirements.txt).
> pyproject.toml aims to centralize these configurations into one place,
> making project setups more organized and straightforward. One of the
> significant features enabled by pyproject.toml is the ability to perform
> isolated builds. This ensures that build dependencies are separated from
> the project's runtime dependencies, leading to more consistent and
> reproducible builds.
>
> [1]
> https://docs.google.com/document/d/17-y48WW25-VGBWZNyTdoN0WUN03k9ZhJjLp9wtyG1Wc/edit#heading=h.wskna8eurvjv
>
> Thanks,
> Anand
>



Re: Proposal for pyproject.toml Support in Apache Beam Python

2023-10-12 Thread Anand Inguva via dev
The PR https://github.com/apache/beam/pull/28385 is merged today. If there
are any observed failures, please comment on the PR and I will follow up
with a forward fix. Thanks.

On Fri, Sep 1, 2023 at 2:30 PM Anand Inguva  wrote:

> Since there is positive feedback from the dev community, I am going ahead
> and implementing this proposal for Python SDK.
>
> @aus...@apache.org   Initially let's move forward with
> the setuptools as backend for building package and as part of the future
> work, we can find a better backend than setuptools.
>
> Thanks for the feedback.
> Anand
>
> On Mon, Aug 28, 2023 at 12:00 PM Austin Bennett  wrote:
>
>> I've thought about this a ton, but haven't been in a position to
>> undertake the work.  Thanks for bringing this up, @Anand Inguva
>>  !
>>
>> I'd point us to https://python-poetry.org/  ... [ which is where I'd
>> look take us, but I'm also not able to do all the work, so my
>> suggestion/preference doensn't matter that much ]
>>
>> https://python-poetry.org/docs/pyproject#the-pyprojecttoml-file <- for
>> info on pyproject.toml file.
>>
>> Notice the use of a 'lock' file is very valuable, ex:
>> https://python-poetry.org/docs/basic-usage/#committing-your-poetrylock-file-to-version-control
>>
>> I haven't come across `build`, that might be great too.  I'd highlight
>> that Poetry is pretty common across industry these days, rock-solid,
>> ecosystem of interoperability, users, etc...   If not familiar, PLEASE have
>> a look at that.
>>
>>
>>
>>
>> On Mon, Aug 28, 2023 at 8:04 AM Kerry Donny-Clark via dev <
>> dev@beam.apache.org> wrote:
>>
>>> +1
>>> Hi Anand,
>>> I appreciate this effort. Managing python dependencies has been a major
>>> pain point for me, and I think this approach would help.
>>> Kerry
>>>
>>> On Mon, Aug 28, 2023 at 10:14 AM Anand Inguva via dev <
>>> dev@beam.apache.org> wrote:
>>>
 Hello Beam Dev Team,

 I've compiled a design document
 [1]
 proposing the integration of pyproject.toml into Apache Beam's Python build
 process. Your insights and feedback would be invaluable.

 What is pyproject.toml?
 pyproject.toml is a configuration file that specifies a project's build
 dependencies and other project-related metadata in a standardized
 format. Before pyproject.toml, Python projects often had multiple
 configuration files (like setup.py, setup.cfg, and requirements.txt).
 pyproject.toml aims to centralize these configurations into one place,
 making project setups more organized and straightforward. One of the
 significant features enabled by pyproject.toml is the ability to perform
 isolated builds. This ensures that build dependencies are separated from
 the project's runtime dependencies, leading to more consistent and
 reproducible builds.

 [1]
 https://docs.google.com/document/d/17-y48WW25-VGBWZNyTdoN0WUN03k9ZhJjLp9wtyG1Wc/edit#heading=h.wskna8eurvjv

 Thanks,
 Anand

>>>


Beam High Priority Issue Report (43)

2023-10-12 Thread beamactions
This is your daily summary of Beam's current high priority issues that may need 
attention.

See https://beam.apache.org/contribute/issue-priorities for the meaning and 
expectations around issue priorities.

Unassigned P1 Issues:

https://github.com/apache/beam/issues/28909 [Stuck Test]: GitHub Action 
issue_comment trigger not scalable
https://github.com/apache/beam/issues/28760 [Bug]: EFO Kinesis IO reader 
provided by apache beam does not pick the event time for watermarking
https://github.com/apache/beam/issues/28703 [Failing Test]: Building a wheel 
for integration tests sometimes times out
https://github.com/apache/beam/issues/28383 [Failing Test]: 
org.apache.beam.runners.dataflow.worker.StreamingDataflowWorkerTest.testMaxThreadMetric
https://github.com/apache/beam/issues/28339 Fix failing 
"beam_PostCommit_XVR_GoUsingJava_Dataflow" job
https://github.com/apache/beam/issues/28326 Bug: 
apache_beam.io.gcp.pubsublite.ReadFromPubSubLite not working
https://github.com/apache/beam/issues/28142 [Bug]: [Go SDK] Memory seems to be 
leaking on 2.49.0 with Dataflow
https://github.com/apache/beam/issues/27892 [Bug]: ignoreUnknownValues not 
working when using CreateDisposition.CREATE_IF_NEEDED 
https://github.com/apache/beam/issues/27648 [Bug]: Python SDFs (e.g. 
PeriodicImpulse) running in Flink and polling using tracker.defer_remainder 
have checkpoint size growing indefinitely 
https://github.com/apache/beam/issues/27616 [Bug]: Unable to use 
applyRowMutations() in bigquery IO apache beam java
https://github.com/apache/beam/issues/27486 [Bug]: Read from datastore with 
inequality filters
https://github.com/apache/beam/issues/27314 [Failing Test]: 
bigquery.StorageApiSinkCreateIfNeededIT.testCreateManyTables[1]
https://github.com/apache/beam/issues/27238 [Bug]: Window trigger has lag when 
using Kafka and GroupByKey on Dataflow Runner
https://github.com/apache/beam/issues/26981 [Bug]: Getting an error related to 
SchemaCoder after upgrading to 2.48
https://github.com/apache/beam/issues/26911 [Bug]: UNNEST ARRAY with a nested 
ROW (described below)
https://github.com/apache/beam/issues/26343 [Bug]: 
apache_beam.io.gcp.bigquery_read_it_test.ReadAllBQTests.test_read_queries is 
flaky
https://github.com/apache/beam/issues/26329 [Bug]: BigQuerySourceBase does not 
propagate a Coder to AvroSource
https://github.com/apache/beam/issues/26041 [Bug]: Unable to create 
exactly-once Flink pipeline with stream source and file sink
https://github.com/apache/beam/issues/25975 [Bug]: Reducing parallelism in 
FlinkRunner leads to a data loss
https://github.com/apache/beam/issues/24776 [Bug]: Race condition in Python SDK 
Harness ProcessBundleProgress
https://github.com/apache/beam/issues/24389 [Failing Test]: 
HadoopFormatIOElasticTest.classMethod ExceptionInInitializerError 
ContainerFetchException
https://github.com/apache/beam/issues/24313 [Flaky]: 
apache_beam/runners/portability/portable_runner_test.py::PortableRunnerTestWithSubprocesses::test_pardo_state_with_custom_key_coder
https://github.com/apache/beam/issues/23944  beam_PreCommit_Python_Cron 
regularily failing - test_pardo_large_input flaky
https://github.com/apache/beam/issues/23709 [Flake]: Spark batch flakes in 
ParDoLifecycleTest.testTeardownCalledAfterExceptionInProcessElement and 
ParDoLifecycleTest.testTeardownCalledAfterExceptionInStartBundle
https://github.com/apache/beam/issues/23525 [Bug]: Default PubsubMessage coder 
will drop message id and orderingKey
https://github.com/apache/beam/issues/22913 [Bug]: 
beam_PostCommit_Java_ValidatesRunner_Flink is flakes in 
org.apache.beam.sdk.transforms.GroupByKeyTest$BasicTests.testAfterProcessingTimeContinuationTriggerUsingState
https://github.com/apache/beam/issues/22605 [Bug]: Beam Python failure for 
dataflow_exercise_metrics_pipeline_test.ExerciseMetricsPipelineTest.test_metrics_it
https://github.com/apache/beam/issues/21714 
PulsarIOTest.testReadFromSimpleTopic is very flaky
https://github.com/apache/beam/issues/21706 Flaky timeout in github Python unit 
test action 
StatefulDoFnOnDirectRunnerTest.test_dynamic_timer_clear_then_set_timer
https://github.com/apache/beam/issues/21643 FnRunnerTest with non-trivial 
(order 1000 elements) numpy input flakes in non-cython environment
https://github.com/apache/beam/issues/21476 WriteToBigQuery Dynamic table 
destinations returns wrong tableId
https://github.com/apache/beam/issues/21469 beam_PostCommit_XVR_Flink flaky: 
Connection refused
https://github.com/apache/beam/issues/21424 Java VR (Dataflow, V2, Streaming) 
failing: ParDoTest$TimestampTests/OnWindowExpirationTests
https://github.com/apache/beam/issues/21262 Python AfterAny, AfterAll do not 
follow spec
https://github.com/apache/beam/issues/21260 Python DirectRunner does not emit 
data at GC time
https://github.com/apache/beam/issues/21121 
apache_beam.examples.streaming_wordcount_it_test.StreamingWordCountIT.test_streaming_wordcount_it
 flakey
https://github.com/apache/beam/issues/21104 Flaky: