Hello! I dug a bit into this (not a FileIO expert), and it looks like LocalFileSystem only matches globs in file names (not directories): https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/LocalFileSystem.java#L251
Perhaps related: https://issues.apache.org/jira/browse/BEAM-1309 There's a note in the FileSystem javadoc that makes me suspect that globs aren't expected to expand everywhere in the "paths" for all filesystems, but *should* work in the last hierarchical element: https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileSystem.java#L59 (noting that the last hierarchical element doesn't necessarily mean "files only", in my opinion.) It kind of makes sense -- wildcards at the top of a hierarchy in a large filesystem can end up creating a huge internal "query" walking the entire tree. I gave a quick try to make a composable pipeline that matched "part-*/data" using the FileIO.matchAll() technique for TextIO, but didn't succeed. It's a bit surprising to me, so I'm interested if this could be a feature improvement... It seems reasonable that we could construct something like: PCollection<String> lines = p.apply(Create.of("/tmp/input/")) .apply(FileIO.matchResolveDirectory("part-*")) .apply(FileIO.matchResolveFile("data")) .apply(FileIO.readMatches().withCompression(AUTO)) .apply(TextIO.readFiles()); Does anybody have a bit more experience how to correctly construct something like that? Best regards, Ryan On Tue, Jul 16, 2019 at 4:25 PM Shannon Duncan <[email protected]> wrote: > I am still having the problem that local file system (DirectRunner) will > not allow a local GLOB string to be passed as a file source. I have tried > both relative path and fully qualified paths. > > I can confirm the same inputFile source GLOB returns data on a simple cat > command. So I know the GLOB is good. > > Error: "java.io.FileNotFoundException: No files matched spec: > /Users/<user>/github/<repo>/io/sequenceFile/part-*/data > > Any assistance would be greatly appreciated. This is on the Java SDK. > > I tested this with TextIO.read().from(ValueProvider<String>); Still the > same. > > Thanks, > Shannon > > On Fri, Jul 12, 2019 at 2:14 PM Igor Bernstein <[email protected]> > wrote: > >> I'm not sure to be honest. The pattern expansion happens in >> FileBasedSource via FileSystems.match(), so it should follow the same >> expansion rules other file based sinks like TextIO. Maybe someone with more >> beam experience can help? >> >> On Fri, Jul 12, 2019 at 2:55 PM Shannon Duncan < >> [email protected]> wrote: >> >>> Clarification on previous message. Only happens on local file system >>> where it is unable to match a pattern string. Via a `gs://<bucket>` link it >>> is able to do multiple file matching. >>> >>> On Fri, Jul 12, 2019 at 1:36 PM Shannon Duncan < >>> [email protected]> wrote: >>> >>>> Awesome. I got it working for a single file, but for a structure of: >>>> >>>> /part-0001/index >>>> /part-0001/data >>>> /part-0002/index >>>> /part-0002/data >>>> >>>> I tried to do /part-* and /part-*/data >>>> >>>> It does not find the multipart files. However if I just do >>>> /part-0001/data it will find it and read it. >>>> >>>> Any ideas why? >>>> >>>> I am using this to generate the source: >>>> >>>> static SequenceFileSource<Text, Text> createSource( >>>> ValueProvider<String> sourcePattern) { >>>> return new SequenceFileSource<Text, Text>( >>>> sourcePattern, >>>> Text.class, >>>> WritableSerialization.class, >>>> Text.class, >>>> WritableSerialization.class, >>>> SequenceFile.SYNC_INTERVAL); >>>> } >>>> >>>> On Wed, Jul 10, 2019 at 10:52 AM Igor Bernstein < >>>> [email protected]> wrote: >>>> >>>>> It should be fairly straight forward: >>>>> 1. Copy SequenceFileSource.java >>>>> <https://github.com/googleapis/cloud-bigtable-client/blob/master/bigtable-dataflow-parent/bigtable-beam-import/src/main/java/com/google/cloud/bigtable/beam/sequencefiles/SequenceFileSource.java> >>>>> to >>>>> your project >>>>> 2. Add the source to your pipeline, configuring it with appropriate >>>>> serializers. See here >>>>> <https://github.com/googleapis/cloud-bigtable-client/blob/master/bigtable-dataflow-parent/bigtable-beam-import/src/main/java/com/google/cloud/bigtable/beam/sequencefiles/ImportJob.java#L159-L173> >>>>> for an example for hbase Results >>>>> >>>>> On Wed, Jul 10, 2019 at 10:58 AM Shannon Duncan < >>>>> [email protected]> wrote: >>>>> >>>>>> If I wanted to go ahead and include this within a new Java Pipeline, >>>>>> what would I be looking at for level of work to integrate? >>>>>> >>>>>> On Wed, Jul 3, 2019 at 3:54 AM Ismaël Mejía <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> That's great. I can help whenever you need. We just need to choose >>>>>>> its >>>>>>> destination. Both the `hadoop-format` and `hadoop-file-system` >>>>>>> modules >>>>>>> are good candidates, I would even feel inclined to put it in its own >>>>>>> module `sdks/java/extensions/sequencefile` to make it more easy to >>>>>>> discover by the final users. >>>>>>> >>>>>>> A thing to consider is the SeekableByteChannel adapters, we can move >>>>>>> that into hadoop-common if needed and refactor the modules to share >>>>>>> code. Worth to take a look at >>>>>>> >>>>>>> org.apache.beam.sdk.io.hdfs.HadoopFileSystem.HadoopSeekableByteChannel#HadoopSeekableByteChannel >>>>>>> to see if some of it could be useful. >>>>>>> >>>>>>> On Tue, Jul 2, 2019 at 11:46 PM Igor Bernstein < >>>>>>> [email protected]> wrote: >>>>>>> > >>>>>>> > Hi all, >>>>>>> > >>>>>>> > I wrote those classes with the intention of upstreaming them to >>>>>>> Beam. I can try to make some time this quarter to clean them up. I would >>>>>>> need a bit of guidance from a beam expert in how to make them coexist >>>>>>> with >>>>>>> HadoopFormatIO though. >>>>>>> > >>>>>>> > >>>>>>> > On Tue, Jul 2, 2019 at 10:55 AM Solomon Duskis <[email protected]> >>>>>>> wrote: >>>>>>> >> >>>>>>> >> +Igor Bernstein who wrote the Cloud Bigtable Sequence File >>>>>>> classes. >>>>>>> >> >>>>>>> >> Solomon Duskis | Google Cloud clients | [email protected] | >>>>>>> 914-462-0531 >>>>>>> >> >>>>>>> >> >>>>>>> >> On Tue, Jul 2, 2019 at 4:57 AM Ismaël Mejía <[email protected]> >>>>>>> wrote: >>>>>>> >>> >>>>>>> >>> (Adding dev@ and Solomon Duskis to the discussion) >>>>>>> >>> >>>>>>> >>> I was not aware of these thanks for sharing David. Definitely it >>>>>>> would >>>>>>> >>> be a great addition if we could have those donated as an >>>>>>> extension in >>>>>>> >>> the Beam side. We can even evolve them in the future to be more >>>>>>> FileIO >>>>>>> >>> like. Any chance this can happen? Maybe Solomon and his team? >>>>>>> >>> >>>>>>> >>> >>>>>>> >>> >>>>>>> >>> On Tue, Jul 2, 2019 at 9:39 AM David Morávek <[email protected]> >>>>>>> wrote: >>>>>>> >>> > >>>>>>> >>> > Hi, you can use SequenceFileSink and Source, from a BigTable >>>>>>> client. Those works nice with FileIO. >>>>>>> >>> > >>>>>>> >>> > >>>>>>> https://github.com/googleapis/cloud-bigtable-client/blob/master/bigtable-dataflow-parent/bigtable-beam-import/src/main/java/com/google/cloud/bigtable/beam/sequencefiles/SequenceFileSink.java >>>>>>> >>> > >>>>>>> https://github.com/googleapis/cloud-bigtable-client/blob/master/bigtable-dataflow-parent/bigtable-beam-import/src/main/java/com/google/cloud/bigtable/beam/sequencefiles/SequenceFileSource.java >>>>>>> >>> > >>>>>>> >>> > It would be really cool to move these into Beam, but that's up >>>>>>> to Googlers to decide, whether they want to donate this. >>>>>>> >>> > >>>>>>> >>> > D. >>>>>>> >>> > >>>>>>> >>> > On Tue, Jul 2, 2019 at 2:07 AM Shannon Duncan < >>>>>>> [email protected]> wrote: >>>>>>> >>> >> >>>>>>> >>> >> It's not outside the realm of possibilities. For now I've >>>>>>> created an intermediary step of a hadoop job that converts from >>>>>>> sequence to >>>>>>> text file. >>>>>>> >>> >> >>>>>>> >>> >> Looking into better options. >>>>>>> >>> >> >>>>>>> >>> >> On Mon, Jul 1, 2019, 5:50 PM Chamikara Jayalath < >>>>>>> [email protected]> wrote: >>>>>>> >>> >>> >>>>>>> >>> >>> Java SDK has a HadoopInputFormatIO using which you should be >>>>>>> able to read Sequence files: >>>>>>> https://github.com/apache/beam/blob/master/sdks/java/io/hadoop-format/src/main/java/org/apache/beam/sdk/io/hadoop/format/HadoopFormatIO.java >>>>>>> >>> >>> I don't think there's a direct alternative for this for >>>>>>> Python. >>>>>>> >>> >>> >>>>>>> >>> >>> Is it possible to write to a well-known format such as Avro >>>>>>> instead of a Hadoop specific format which will allow you to read from >>>>>>> both >>>>>>> Dataproc/Hadoop and Beam Python SDK ? >>>>>>> >>> >>> >>>>>>> >>> >>> Thanks, >>>>>>> >>> >>> Cham >>>>>>> >>> >>> >>>>>>> >>> >>> On Mon, Jul 1, 2019 at 3:37 PM Shannon Duncan < >>>>>>> [email protected]> wrote: >>>>>>> >>> >>>> >>>>>>> >>> >>>> That's a pretty big hole for a missing source/sink when >>>>>>> looking at transitioning from Dataproc to Dataflow using GCS as storage >>>>>>> buffer instead of a traditional hdfs. >>>>>>> >>> >>>> >>>>>>> >>> >>>> From what I've been able to tell from source code and >>>>>>> documentation, Java is able to but not Python? >>>>>>> >>> >>>> >>>>>>> >>> >>>> Thanks, >>>>>>> >>> >>>> Shannon >>>>>>> >>> >>>> >>>>>>> >>> >>>> On Mon, Jul 1, 2019 at 5:29 PM Chamikara Jayalath < >>>>>>> [email protected]> wrote: >>>>>>> >>> >>>>> >>>>>>> >>> >>>>> I don't think we have a source/sink for reading Hadoop >>>>>>> sequence files. Your best bet currently will probably be to use >>>>>>> FileSystem >>>>>>> abstraction to create a file from a ParDo and read directly from there >>>>>>> using a library that can read sequence files. >>>>>>> >>> >>>>> >>>>>>> >>> >>>>> Thanks, >>>>>>> >>> >>>>> Cham >>>>>>> >>> >>>>> >>>>>>> >>> >>>>> On Mon, Jul 1, 2019 at 8:42 AM Shannon Duncan < >>>>>>> [email protected]> wrote: >>>>>>> >>> >>>>>> >>>>>>> >>> >>>>>> I'm wanting to read a Sequence/Map file from Hadoop >>>>>>> stored on Google Cloud Storage via a " gs://bucket/link/SequenceFile-* " >>>>>>> via the Python SDK. >>>>>>> >>> >>>>>> >>>>>>> >>> >>>>>> I cannot locate any good adapters for this, and the one >>>>>>> Hadoop Filesystem reader seems to only read from a "hdfs://" url. >>>>>>> >>> >>>>>> >>>>>>> >>> >>>>>> I'm wanting to use Dataflow and GCS exclusively to start >>>>>>> mixing in Beam pipelines with our current Hadoop Pipelines. >>>>>>> >>> >>>>>> >>>>>>> >>> >>>>>> Is this a feature that is supported or will be supported >>>>>>> in the future? >>>>>>> >>> >>>>>> Does anyone have any good suggestions for this that is >>>>>>> performant? >>>>>>> >>> >>>>>> >>>>>>> >>> >>>>>> I'd also like to be able to write back out to a >>>>>>> SequenceFile if possible. >>>>>>> >>> >>>>>> >>>>>>> >>> >>>>>> Thanks! >>>>>>> >>> >>>>>> >>>>>>> >>>>>>
