I am still having the problem that local file system (DirectRunner) will not allow a local GLOB string to be passed as a file source. I have tried both relative path and fully qualified paths.
I can confirm the same inputFile source GLOB returns data on a simple cat command. So I know the GLOB is good. Error: "java.io.FileNotFoundException: No files matched spec: /Users/<user>/github/<repo>/io/sequenceFile/part-*/data Any assistance would be greatly appreciated. This is on the Java SDK. I tested this with TextIO.read().from(ValueProvider<String>); Still the same. Thanks, Shannon On Fri, Jul 12, 2019 at 2:14 PM Igor Bernstein <[email protected]> wrote: > I'm not sure to be honest. The pattern expansion happens in > FileBasedSource via FileSystems.match(), so it should follow the same > expansion rules other file based sinks like TextIO. Maybe someone with more > beam experience can help? > > On Fri, Jul 12, 2019 at 2:55 PM Shannon Duncan <[email protected]> > wrote: > >> Clarification on previous message. Only happens on local file system >> where it is unable to match a pattern string. Via a `gs://<bucket>` link it >> is able to do multiple file matching. >> >> On Fri, Jul 12, 2019 at 1:36 PM Shannon Duncan < >> [email protected]> wrote: >> >>> Awesome. I got it working for a single file, but for a structure of: >>> >>> /part-0001/index >>> /part-0001/data >>> /part-0002/index >>> /part-0002/data >>> >>> I tried to do /part-* and /part-*/data >>> >>> It does not find the multipart files. However if I just do >>> /part-0001/data it will find it and read it. >>> >>> Any ideas why? >>> >>> I am using this to generate the source: >>> >>> static SequenceFileSource<Text, Text> createSource( >>> ValueProvider<String> sourcePattern) { >>> return new SequenceFileSource<Text, Text>( >>> sourcePattern, >>> Text.class, >>> WritableSerialization.class, >>> Text.class, >>> WritableSerialization.class, >>> SequenceFile.SYNC_INTERVAL); >>> } >>> >>> On Wed, Jul 10, 2019 at 10:52 AM Igor Bernstein < >>> [email protected]> wrote: >>> >>>> It should be fairly straight forward: >>>> 1. Copy SequenceFileSource.java >>>> <https://github.com/googleapis/cloud-bigtable-client/blob/master/bigtable-dataflow-parent/bigtable-beam-import/src/main/java/com/google/cloud/bigtable/beam/sequencefiles/SequenceFileSource.java> >>>> to >>>> your project >>>> 2. Add the source to your pipeline, configuring it with appropriate >>>> serializers. See here >>>> <https://github.com/googleapis/cloud-bigtable-client/blob/master/bigtable-dataflow-parent/bigtable-beam-import/src/main/java/com/google/cloud/bigtable/beam/sequencefiles/ImportJob.java#L159-L173> >>>> for an example for hbase Results >>>> >>>> On Wed, Jul 10, 2019 at 10:58 AM Shannon Duncan < >>>> [email protected]> wrote: >>>> >>>>> If I wanted to go ahead and include this within a new Java Pipeline, >>>>> what would I be looking at for level of work to integrate? >>>>> >>>>> On Wed, Jul 3, 2019 at 3:54 AM Ismaël Mejía <[email protected]> wrote: >>>>> >>>>>> That's great. I can help whenever you need. We just need to choose its >>>>>> destination. Both the `hadoop-format` and `hadoop-file-system` modules >>>>>> are good candidates, I would even feel inclined to put it in its own >>>>>> module `sdks/java/extensions/sequencefile` to make it more easy to >>>>>> discover by the final users. >>>>>> >>>>>> A thing to consider is the SeekableByteChannel adapters, we can move >>>>>> that into hadoop-common if needed and refactor the modules to share >>>>>> code. Worth to take a look at >>>>>> >>>>>> org.apache.beam.sdk.io.hdfs.HadoopFileSystem.HadoopSeekableByteChannel#HadoopSeekableByteChannel >>>>>> to see if some of it could be useful. >>>>>> >>>>>> On Tue, Jul 2, 2019 at 11:46 PM Igor Bernstein < >>>>>> [email protected]> wrote: >>>>>> > >>>>>> > Hi all, >>>>>> > >>>>>> > I wrote those classes with the intention of upstreaming them to >>>>>> Beam. I can try to make some time this quarter to clean them up. I would >>>>>> need a bit of guidance from a beam expert in how to make them coexist >>>>>> with >>>>>> HadoopFormatIO though. >>>>>> > >>>>>> > >>>>>> > On Tue, Jul 2, 2019 at 10:55 AM Solomon Duskis <[email protected]> >>>>>> wrote: >>>>>> >> >>>>>> >> +Igor Bernstein who wrote the Cloud Bigtable Sequence File classes. >>>>>> >> >>>>>> >> Solomon Duskis | Google Cloud clients | [email protected] | >>>>>> 914-462-0531 >>>>>> >> >>>>>> >> >>>>>> >> On Tue, Jul 2, 2019 at 4:57 AM Ismaël Mejía <[email protected]> >>>>>> wrote: >>>>>> >>> >>>>>> >>> (Adding dev@ and Solomon Duskis to the discussion) >>>>>> >>> >>>>>> >>> I was not aware of these thanks for sharing David. Definitely it >>>>>> would >>>>>> >>> be a great addition if we could have those donated as an >>>>>> extension in >>>>>> >>> the Beam side. We can even evolve them in the future to be more >>>>>> FileIO >>>>>> >>> like. Any chance this can happen? Maybe Solomon and his team? >>>>>> >>> >>>>>> >>> >>>>>> >>> >>>>>> >>> On Tue, Jul 2, 2019 at 9:39 AM David Morávek <[email protected]> >>>>>> wrote: >>>>>> >>> > >>>>>> >>> > Hi, you can use SequenceFileSink and Source, from a BigTable >>>>>> client. Those works nice with FileIO. >>>>>> >>> > >>>>>> >>> > >>>>>> https://github.com/googleapis/cloud-bigtable-client/blob/master/bigtable-dataflow-parent/bigtable-beam-import/src/main/java/com/google/cloud/bigtable/beam/sequencefiles/SequenceFileSink.java >>>>>> >>> > >>>>>> https://github.com/googleapis/cloud-bigtable-client/blob/master/bigtable-dataflow-parent/bigtable-beam-import/src/main/java/com/google/cloud/bigtable/beam/sequencefiles/SequenceFileSource.java >>>>>> >>> > >>>>>> >>> > It would be really cool to move these into Beam, but that's up >>>>>> to Googlers to decide, whether they want to donate this. >>>>>> >>> > >>>>>> >>> > D. >>>>>> >>> > >>>>>> >>> > On Tue, Jul 2, 2019 at 2:07 AM Shannon Duncan < >>>>>> [email protected]> wrote: >>>>>> >>> >> >>>>>> >>> >> It's not outside the realm of possibilities. For now I've >>>>>> created an intermediary step of a hadoop job that converts from sequence >>>>>> to >>>>>> text file. >>>>>> >>> >> >>>>>> >>> >> Looking into better options. >>>>>> >>> >> >>>>>> >>> >> On Mon, Jul 1, 2019, 5:50 PM Chamikara Jayalath < >>>>>> [email protected]> wrote: >>>>>> >>> >>> >>>>>> >>> >>> Java SDK has a HadoopInputFormatIO using which you should be >>>>>> able to read Sequence files: >>>>>> https://github.com/apache/beam/blob/master/sdks/java/io/hadoop-format/src/main/java/org/apache/beam/sdk/io/hadoop/format/HadoopFormatIO.java >>>>>> >>> >>> I don't think there's a direct alternative for this for >>>>>> Python. >>>>>> >>> >>> >>>>>> >>> >>> Is it possible to write to a well-known format such as Avro >>>>>> instead of a Hadoop specific format which will allow you to read from >>>>>> both >>>>>> Dataproc/Hadoop and Beam Python SDK ? >>>>>> >>> >>> >>>>>> >>> >>> Thanks, >>>>>> >>> >>> Cham >>>>>> >>> >>> >>>>>> >>> >>> On Mon, Jul 1, 2019 at 3:37 PM Shannon Duncan < >>>>>> [email protected]> wrote: >>>>>> >>> >>>> >>>>>> >>> >>>> That's a pretty big hole for a missing source/sink when >>>>>> looking at transitioning from Dataproc to Dataflow using GCS as storage >>>>>> buffer instead of a traditional hdfs. >>>>>> >>> >>>> >>>>>> >>> >>>> From what I've been able to tell from source code and >>>>>> documentation, Java is able to but not Python? >>>>>> >>> >>>> >>>>>> >>> >>>> Thanks, >>>>>> >>> >>>> Shannon >>>>>> >>> >>>> >>>>>> >>> >>>> On Mon, Jul 1, 2019 at 5:29 PM Chamikara Jayalath < >>>>>> [email protected]> wrote: >>>>>> >>> >>>>> >>>>>> >>> >>>>> I don't think we have a source/sink for reading Hadoop >>>>>> sequence files. Your best bet currently will probably be to use >>>>>> FileSystem >>>>>> abstraction to create a file from a ParDo and read directly from there >>>>>> using a library that can read sequence files. >>>>>> >>> >>>>> >>>>>> >>> >>>>> Thanks, >>>>>> >>> >>>>> Cham >>>>>> >>> >>>>> >>>>>> >>> >>>>> On Mon, Jul 1, 2019 at 8:42 AM Shannon Duncan < >>>>>> [email protected]> wrote: >>>>>> >>> >>>>>> >>>>>> >>> >>>>>> I'm wanting to read a Sequence/Map file from Hadoop stored >>>>>> on Google Cloud Storage via a " gs://bucket/link/SequenceFile-* " via the >>>>>> Python SDK. >>>>>> >>> >>>>>> >>>>>> >>> >>>>>> I cannot locate any good adapters for this, and the one >>>>>> Hadoop Filesystem reader seems to only read from a "hdfs://" url. >>>>>> >>> >>>>>> >>>>>> >>> >>>>>> I'm wanting to use Dataflow and GCS exclusively to start >>>>>> mixing in Beam pipelines with our current Hadoop Pipelines. >>>>>> >>> >>>>>> >>>>>> >>> >>>>>> Is this a feature that is supported or will be supported >>>>>> in the future? >>>>>> >>> >>>>>> Does anyone have any good suggestions for this that is >>>>>> performant? >>>>>> >>> >>>>>> >>>>>> >>> >>>>>> I'd also like to be able to write back out to a >>>>>> SequenceFile if possible. >>>>>> >>> >>>>>> >>>>>> >>> >>>>>> Thanks! >>>>>> >>> >>>>>> >>>>>> >>>>>
