Awesome. I got it working for a single file, but for a structure of: /part-0001/index /part-0001/data /part-0002/index /part-0002/data
I tried to do /part-* and /part-*/data It does not find the multipart files. However if I just do /part-0001/data it will find it and read it. Any ideas why? I am using this to generate the source: static SequenceFileSource<Text, Text> createSource( ValueProvider<String> sourcePattern) { return new SequenceFileSource<Text, Text>( sourcePattern, Text.class, WritableSerialization.class, Text.class, WritableSerialization.class, SequenceFile.SYNC_INTERVAL); } On Wed, Jul 10, 2019 at 10:52 AM Igor Bernstein <igorbernst...@google.com> wrote: > It should be fairly straight forward: > 1. Copy SequenceFileSource.java > <https://github.com/googleapis/cloud-bigtable-client/blob/master/bigtable-dataflow-parent/bigtable-beam-import/src/main/java/com/google/cloud/bigtable/beam/sequencefiles/SequenceFileSource.java> > to > your project > 2. Add the source to your pipeline, configuring it with appropriate > serializers. See here > <https://github.com/googleapis/cloud-bigtable-client/blob/master/bigtable-dataflow-parent/bigtable-beam-import/src/main/java/com/google/cloud/bigtable/beam/sequencefiles/ImportJob.java#L159-L173> > for an example for hbase Results > > On Wed, Jul 10, 2019 at 10:58 AM Shannon Duncan < > joseph.dun...@liveramp.com> wrote: > >> If I wanted to go ahead and include this within a new Java Pipeline, what >> would I be looking at for level of work to integrate? >> >> On Wed, Jul 3, 2019 at 3:54 AM Ismaël Mejía <ieme...@gmail.com> wrote: >> >>> That's great. I can help whenever you need. We just need to choose its >>> destination. Both the `hadoop-format` and `hadoop-file-system` modules >>> are good candidates, I would even feel inclined to put it in its own >>> module `sdks/java/extensions/sequencefile` to make it more easy to >>> discover by the final users. >>> >>> A thing to consider is the SeekableByteChannel adapters, we can move >>> that into hadoop-common if needed and refactor the modules to share >>> code. Worth to take a look at >>> >>> org.apache.beam.sdk.io.hdfs.HadoopFileSystem.HadoopSeekableByteChannel#HadoopSeekableByteChannel >>> to see if some of it could be useful. >>> >>> On Tue, Jul 2, 2019 at 11:46 PM Igor Bernstein <igorbernst...@google.com> >>> wrote: >>> > >>> > Hi all, >>> > >>> > I wrote those classes with the intention of upstreaming them to Beam. >>> I can try to make some time this quarter to clean them up. I would need a >>> bit of guidance from a beam expert in how to make them coexist with >>> HadoopFormatIO though. >>> > >>> > >>> > On Tue, Jul 2, 2019 at 10:55 AM Solomon Duskis <sdus...@google.com> >>> wrote: >>> >> >>> >> +Igor Bernstein who wrote the Cloud Bigtable Sequence File classes. >>> >> >>> >> Solomon Duskis | Google Cloud clients | sdus...@google.com | >>> 914-462-0531 >>> >> >>> >> >>> >> On Tue, Jul 2, 2019 at 4:57 AM Ismaël Mejía <ieme...@gmail.com> >>> wrote: >>> >>> >>> >>> (Adding dev@ and Solomon Duskis to the discussion) >>> >>> >>> >>> I was not aware of these thanks for sharing David. Definitely it >>> would >>> >>> be a great addition if we could have those donated as an extension in >>> >>> the Beam side. We can even evolve them in the future to be more >>> FileIO >>> >>> like. Any chance this can happen? Maybe Solomon and his team? >>> >>> >>> >>> >>> >>> >>> >>> On Tue, Jul 2, 2019 at 9:39 AM David Morávek <d...@apache.org> >>> wrote: >>> >>> > >>> >>> > Hi, you can use SequenceFileSink and Source, from a BigTable >>> client. Those works nice with FileIO. >>> >>> > >>> >>> > >>> https://github.com/googleapis/cloud-bigtable-client/blob/master/bigtable-dataflow-parent/bigtable-beam-import/src/main/java/com/google/cloud/bigtable/beam/sequencefiles/SequenceFileSink.java >>> >>> > >>> https://github.com/googleapis/cloud-bigtable-client/blob/master/bigtable-dataflow-parent/bigtable-beam-import/src/main/java/com/google/cloud/bigtable/beam/sequencefiles/SequenceFileSource.java >>> >>> > >>> >>> > It would be really cool to move these into Beam, but that's up to >>> Googlers to decide, whether they want to donate this. >>> >>> > >>> >>> > D. >>> >>> > >>> >>> > On Tue, Jul 2, 2019 at 2:07 AM Shannon Duncan < >>> joseph.dun...@liveramp.com> wrote: >>> >>> >> >>> >>> >> It's not outside the realm of possibilities. For now I've created >>> an intermediary step of a hadoop job that converts from sequence to text >>> file. >>> >>> >> >>> >>> >> Looking into better options. >>> >>> >> >>> >>> >> On Mon, Jul 1, 2019, 5:50 PM Chamikara Jayalath < >>> chamik...@google.com> wrote: >>> >>> >>> >>> >>> >>> Java SDK has a HadoopInputFormatIO using which you should be >>> able to read Sequence files: >>> https://github.com/apache/beam/blob/master/sdks/java/io/hadoop-format/src/main/java/org/apache/beam/sdk/io/hadoop/format/HadoopFormatIO.java >>> >>> >>> I don't think there's a direct alternative for this for Python. >>> >>> >>> >>> >>> >>> Is it possible to write to a well-known format such as Avro >>> instead of a Hadoop specific format which will allow you to read from both >>> Dataproc/Hadoop and Beam Python SDK ? >>> >>> >>> >>> >>> >>> Thanks, >>> >>> >>> Cham >>> >>> >>> >>> >>> >>> On Mon, Jul 1, 2019 at 3:37 PM Shannon Duncan < >>> joseph.dun...@liveramp.com> wrote: >>> >>> >>>> >>> >>> >>>> That's a pretty big hole for a missing source/sink when looking >>> at transitioning from Dataproc to Dataflow using GCS as storage buffer >>> instead of a traditional hdfs. >>> >>> >>>> >>> >>> >>>> From what I've been able to tell from source code and >>> documentation, Java is able to but not Python? >>> >>> >>>> >>> >>> >>>> Thanks, >>> >>> >>>> Shannon >>> >>> >>>> >>> >>> >>>> On Mon, Jul 1, 2019 at 5:29 PM Chamikara Jayalath < >>> chamik...@google.com> wrote: >>> >>> >>>>> >>> >>> >>>>> I don't think we have a source/sink for reading Hadoop >>> sequence files. Your best bet currently will probably be to use FileSystem >>> abstraction to create a file from a ParDo and read directly from there >>> using a library that can read sequence files. >>> >>> >>>>> >>> >>> >>>>> Thanks, >>> >>> >>>>> Cham >>> >>> >>>>> >>> >>> >>>>> On Mon, Jul 1, 2019 at 8:42 AM Shannon Duncan < >>> joseph.dun...@liveramp.com> wrote: >>> >>> >>>>>> >>> >>> >>>>>> I'm wanting to read a Sequence/Map file from Hadoop stored on >>> Google Cloud Storage via a " gs://bucket/link/SequenceFile-* " via the >>> Python SDK. >>> >>> >>>>>> >>> >>> >>>>>> I cannot locate any good adapters for this, and the one >>> Hadoop Filesystem reader seems to only read from a "hdfs://" url. >>> >>> >>>>>> >>> >>> >>>>>> I'm wanting to use Dataflow and GCS exclusively to start >>> mixing in Beam pipelines with our current Hadoop Pipelines. >>> >>> >>>>>> >>> >>> >>>>>> Is this a feature that is supported or will be supported in >>> the future? >>> >>> >>>>>> Does anyone have any good suggestions for this that is >>> performant? >>> >>> >>>>>> >>> >>> >>>>>> I'd also like to be able to write back out to a SequenceFile >>> if possible. >>> >>> >>>>>> >>> >>> >>>>>> Thanks! >>> >>> >>>>>> >>> >>