Is the following workflow suitable for apache beam

2019-11-28 Thread Marco Mistroni
Hi all
 i am currently getting acquainted with Apache beam to replace my
current workflow, and was wondering if Beam can handle it.
Currently, my workflow is based entirely on python asyncio plus some
groupby operations, and it consists of the following

- have a list of remote directories from which i need to download a file -
file has same name across directories
- for each of the file above, i need to scan the content (which is itself a
list of remote file paths)
- for each of the file paths above i need to extract the content to a list
of string
- i need to do a reducebYkey operation out of all the lists extracted above

To me, it seems suitable... the only thing that concerns me is that i
probably have to drop asyncio
Could anyone advise?

kind regards
 Marco


Re: Installing system dependencies in a DataFlow worker - how

2019-11-28 Thread Carl Thomé
Thanks!

I tried following https://beam
.apache.org/documentation/runtime/environments/ but get a "Custom images
are not yet supported" error message from DataFlow. Perhaps I did something
wrong?

"error": {
> "code": 400,
> "message": "(24f8c9b6e647d55d): The workflow could not be created. Causes:
> (24f8c9b6e647de48): Invalid worker harness container image: my_image.
> Custom images are not yet supported.",
> "status": "INVALID_ARGUMENT"
> }
>

On Wed, 27 Nov 2019 at 18:34, Kyle Weaver  wrote:

> You can also configure your own Docker images if you like, instructions
> here: https://beam.apache.org/documentation/runtime/environments/
>
> On Wed, Nov 27, 2019 at 12:38 AM Carl Thomé  wrote:
>
>> Hi,
>>
>> I have a Beam pipeline written in the Python SDK that decodes audio files
>> into TFRecord:s. I'd like to run it on DataFlow but I'm missing libsndfile1
>> in the workers.
>>
>> Is there any way of configuring the base image for the DataFlow workers
>> (e.g. Dockerfile + apt install) to get audio decoding working?
>>
>> On a similar note, when it comes to Python dependencies in the DataFlow
>> runtime (like librosa), is there a wish list somewhere on which we can
>> upvote missing Python libraries?
>>
>> Cheers,
>> Carl Thomé
>>
>