Clarification on previous message. Only happens on local file system where it is unable to match a pattern string. Via a `gs://<bucket>` link it is able to do multiple file matching.
On Fri, Jul 12, 2019 at 1:36 PM Shannon Duncan <joseph.dun...@liveramp.com> wrote: > Awesome. I got it working for a single file, but for a structure of: > > /part-0001/index > /part-0001/data > /part-0002/index > /part-0002/data > > I tried to do /part-* and /part-*/data > > It does not find the multipart files. However if I just do /part-0001/data > it will find it and read it. > > Any ideas why? > > I am using this to generate the source: > > static SequenceFileSource<Text, Text> createSource( > ValueProvider<String> sourcePattern) { > return new SequenceFileSource<Text, Text>( > sourcePattern, > Text.class, > WritableSerialization.class, > Text.class, > WritableSerialization.class, > SequenceFile.SYNC_INTERVAL); > } > > On Wed, Jul 10, 2019 at 10:52 AM Igor Bernstein <igorbernst...@google.com> > wrote: > >> It should be fairly straight forward: >> 1. Copy SequenceFileSource.java >> <https://github.com/googleapis/cloud-bigtable-client/blob/master/bigtable-dataflow-parent/bigtable-beam-import/src/main/java/com/google/cloud/bigtable/beam/sequencefiles/SequenceFileSource.java> >> to >> your project >> 2. Add the source to your pipeline, configuring it with appropriate >> serializers. See here >> <https://github.com/googleapis/cloud-bigtable-client/blob/master/bigtable-dataflow-parent/bigtable-beam-import/src/main/java/com/google/cloud/bigtable/beam/sequencefiles/ImportJob.java#L159-L173> >> for an example for hbase Results >> >> On Wed, Jul 10, 2019 at 10:58 AM Shannon Duncan < >> joseph.dun...@liveramp.com> wrote: >> >>> If I wanted to go ahead and include this within a new Java Pipeline, >>> what would I be looking at for level of work to integrate? >>> >>> On Wed, Jul 3, 2019 at 3:54 AM Ismaël Mejía <ieme...@gmail.com> wrote: >>> >>>> That's great. I can help whenever you need. We just need to choose its >>>> destination. Both the `hadoop-format` and `hadoop-file-system` modules >>>> are good candidates, I would even feel inclined to put it in its own >>>> module `sdks/java/extensions/sequencefile` to make it more easy to >>>> discover by the final users. >>>> >>>> A thing to consider is the SeekableByteChannel adapters, we can move >>>> that into hadoop-common if needed and refactor the modules to share >>>> code. Worth to take a look at >>>> >>>> org.apache.beam.sdk.io.hdfs.HadoopFileSystem.HadoopSeekableByteChannel#HadoopSeekableByteChannel >>>> to see if some of it could be useful. >>>> >>>> On Tue, Jul 2, 2019 at 11:46 PM Igor Bernstein < >>>> igorbernst...@google.com> wrote: >>>> > >>>> > Hi all, >>>> > >>>> > I wrote those classes with the intention of upstreaming them to Beam. >>>> I can try to make some time this quarter to clean them up. I would need a >>>> bit of guidance from a beam expert in how to make them coexist with >>>> HadoopFormatIO though. >>>> > >>>> > >>>> > On Tue, Jul 2, 2019 at 10:55 AM Solomon Duskis <sdus...@google.com> >>>> wrote: >>>> >> >>>> >> +Igor Bernstein who wrote the Cloud Bigtable Sequence File classes. >>>> >> >>>> >> Solomon Duskis | Google Cloud clients | sdus...@google.com | >>>> 914-462-0531 >>>> >> >>>> >> >>>> >> On Tue, Jul 2, 2019 at 4:57 AM Ismaël Mejía <ieme...@gmail.com> >>>> wrote: >>>> >>> >>>> >>> (Adding dev@ and Solomon Duskis to the discussion) >>>> >>> >>>> >>> I was not aware of these thanks for sharing David. Definitely it >>>> would >>>> >>> be a great addition if we could have those donated as an extension >>>> in >>>> >>> the Beam side. We can even evolve them in the future to be more >>>> FileIO >>>> >>> like. Any chance this can happen? Maybe Solomon and his team? >>>> >>> >>>> >>> >>>> >>> >>>> >>> On Tue, Jul 2, 2019 at 9:39 AM David Morávek <d...@apache.org> >>>> wrote: >>>> >>> > >>>> >>> > Hi, you can use SequenceFileSink and Source, from a BigTable >>>> client. Those works nice with FileIO. >>>> >>> > >>>> >>> > >>>> https://github.com/googleapis/cloud-bigtable-client/blob/master/bigtable-dataflow-parent/bigtable-beam-import/src/main/java/com/google/cloud/bigtable/beam/sequencefiles/SequenceFileSink.java >>>> >>> > >>>> https://github.com/googleapis/cloud-bigtable-client/blob/master/bigtable-dataflow-parent/bigtable-beam-import/src/main/java/com/google/cloud/bigtable/beam/sequencefiles/SequenceFileSource.java >>>> >>> > >>>> >>> > It would be really cool to move these into Beam, but that's up to >>>> Googlers to decide, whether they want to donate this. >>>> >>> > >>>> >>> > D. >>>> >>> > >>>> >>> > On Tue, Jul 2, 2019 at 2:07 AM Shannon Duncan < >>>> joseph.dun...@liveramp.com> wrote: >>>> >>> >> >>>> >>> >> It's not outside the realm of possibilities. For now I've >>>> created an intermediary step of a hadoop job that converts from sequence to >>>> text file. >>>> >>> >> >>>> >>> >> Looking into better options. >>>> >>> >> >>>> >>> >> On Mon, Jul 1, 2019, 5:50 PM Chamikara Jayalath < >>>> chamik...@google.com> wrote: >>>> >>> >>> >>>> >>> >>> Java SDK has a HadoopInputFormatIO using which you should be >>>> able to read Sequence files: >>>> https://github.com/apache/beam/blob/master/sdks/java/io/hadoop-format/src/main/java/org/apache/beam/sdk/io/hadoop/format/HadoopFormatIO.java >>>> >>> >>> I don't think there's a direct alternative for this for Python. >>>> >>> >>> >>>> >>> >>> Is it possible to write to a well-known format such as Avro >>>> instead of a Hadoop specific format which will allow you to read from both >>>> Dataproc/Hadoop and Beam Python SDK ? >>>> >>> >>> >>>> >>> >>> Thanks, >>>> >>> >>> Cham >>>> >>> >>> >>>> >>> >>> On Mon, Jul 1, 2019 at 3:37 PM Shannon Duncan < >>>> joseph.dun...@liveramp.com> wrote: >>>> >>> >>>> >>>> >>> >>>> That's a pretty big hole for a missing source/sink when >>>> looking at transitioning from Dataproc to Dataflow using GCS as storage >>>> buffer instead of a traditional hdfs. >>>> >>> >>>> >>>> >>> >>>> From what I've been able to tell from source code and >>>> documentation, Java is able to but not Python? >>>> >>> >>>> >>>> >>> >>>> Thanks, >>>> >>> >>>> Shannon >>>> >>> >>>> >>>> >>> >>>> On Mon, Jul 1, 2019 at 5:29 PM Chamikara Jayalath < >>>> chamik...@google.com> wrote: >>>> >>> >>>>> >>>> >>> >>>>> I don't think we have a source/sink for reading Hadoop >>>> sequence files. Your best bet currently will probably be to use FileSystem >>>> abstraction to create a file from a ParDo and read directly from there >>>> using a library that can read sequence files. >>>> >>> >>>>> >>>> >>> >>>>> Thanks, >>>> >>> >>>>> Cham >>>> >>> >>>>> >>>> >>> >>>>> On Mon, Jul 1, 2019 at 8:42 AM Shannon Duncan < >>>> joseph.dun...@liveramp.com> wrote: >>>> >>> >>>>>> >>>> >>> >>>>>> I'm wanting to read a Sequence/Map file from Hadoop stored >>>> on Google Cloud Storage via a " gs://bucket/link/SequenceFile-* " via the >>>> Python SDK. >>>> >>> >>>>>> >>>> >>> >>>>>> I cannot locate any good adapters for this, and the one >>>> Hadoop Filesystem reader seems to only read from a "hdfs://" url. >>>> >>> >>>>>> >>>> >>> >>>>>> I'm wanting to use Dataflow and GCS exclusively to start >>>> mixing in Beam pipelines with our current Hadoop Pipelines. >>>> >>> >>>>>> >>>> >>> >>>>>> Is this a feature that is supported or will be supported in >>>> the future? >>>> >>> >>>>>> Does anyone have any good suggestions for this that is >>>> performant? >>>> >>> >>>>>> >>>> >>> >>>>>> I'd also like to be able to write back out to a SequenceFile >>>> if possible. >>>> >>> >>>>>> >>>> >>> >>>>>> Thanks! >>>> >>> >>>>>> >>>> >>>