[ 
https://issues.apache.org/jira/browse/BEAM-7998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16937092#comment-16937092
 ] 

Pablo Estrada commented on BEAM-7998:
-------------------------------------

I've ran the following pipeline locally, and I've had no problem : / - but if 
the problem you describe is happening somehow, it would be pretty serious. Are 
you running locally only? or ever using Dataflow? If you're using Dataflow, it 
makes sense to file a support ticket to get this resolved.
{code:java}


def run():
  with beam.Pipeline() as p:
    pairs = (p
        | fileio.MatchFiles('gs://my-bucket/*.json')
        | fileio.ReadMatches()
        | beam.Map(lambda f: (f.metadata.path,
                              json.loads(f.read_utf8()))))    

    pairs | beam.Map(lambda x: print(x))
{code}

> MatchesFiles or MatchAll seems to return seveval time the same element
> ----------------------------------------------------------------------
>
>                 Key: BEAM-7998
>                 URL: https://issues.apache.org/jira/browse/BEAM-7998
>             Project: Beam
>          Issue Type: Bug
>          Components: io-py-files
>    Affects Versions: 2.14.0
>         Environment: GCP for storage, DirectRunner and DataflowRunner both 
> have the problem. PyCharm on Win10 for IDE and dev environment.
>            Reporter: Jerome MASSOT
>            Assignee: Pablo Estrada
>            Priority: Major
>              Labels: ccoss2019
>
> Hi team,
> when I use MatcheFiles using wildcard and files located in a GCP bucket, the 
> MatcheFiles transform returns several times (at least 2) the same file.
> I have tried to follow the stack, and I can see that the MatchesAll is called 
> twice when I run the pipeline on a debug project where a single element is 
> present in the bucket.
> But I am not good enough to say more than that. Sorry.
> Best regards
> Jerome



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to