Re: [Python] Read Hadoop Sequence File?

Shannon Duncan Fri, 12 Jul 2019 11:56:39 -0700

Clarification on previous message. Only happens on local file system where
it is unable to match a pattern string. Via a `gs://<bucket>` link it is
able to do multiple file matching.


On Fri, Jul 12, 2019 at 1:36 PM Shannon Duncan <joseph.dun...@liveramp.com>
wrote:

> Awesome. I got it working for a single file, but for a structure of:
>
> /part-0001/index
> /part-0001/data
> /part-0002/index
> /part-0002/data
>
> I tried to do /part-*  and /part-*/data
>
> It does not find the multipart files. However if I just do /part-0001/data
> it will find it and read it.
>
> Any ideas why?
>
> I am using this to generate the source:
>
> static SequenceFileSource<Text, Text> createSource(
> ValueProvider<String> sourcePattern) {
> return new SequenceFileSource<Text, Text>(
> sourcePattern,
> Text.class,
> WritableSerialization.class,
> Text.class,
> WritableSerialization.class,
> SequenceFile.SYNC_INTERVAL);
> }
>
> On Wed, Jul 10, 2019 at 10:52 AM Igor Bernstein <igorbernst...@google.com>
> wrote:
>
>> It should be fairly straight forward:
>> 1. Copy SequenceFileSource.java
>> <https://github.com/googleapis/cloud-bigtable-client/blob/master/bigtable-dataflow-parent/bigtable-beam-import/src/main/java/com/google/cloud/bigtable/beam/sequencefiles/SequenceFileSource.java>
>>  to
>> your project
>> 2. Add the source to your pipeline, configuring it with appropriate
>> serializers. See here
>> <https://github.com/googleapis/cloud-bigtable-client/blob/master/bigtable-dataflow-parent/bigtable-beam-import/src/main/java/com/google/cloud/bigtable/beam/sequencefiles/ImportJob.java#L159-L173>
>> for an example for hbase Results
>>
>> On Wed, Jul 10, 2019 at 10:58 AM Shannon Duncan <
>> joseph.dun...@liveramp.com> wrote:
>>
>>> If I wanted to go ahead and include this within a new Java Pipeline,
>>> what would I be looking at for level of work to integrate?
>>>
>>> On Wed, Jul 3, 2019 at 3:54 AM Ismaël Mejía <ieme...@gmail.com> wrote:
>>>
>>>> That's great. I can help whenever you need. We just need to choose its
>>>> destination. Both the `hadoop-format` and `hadoop-file-system` modules
>>>> are good candidates, I would even feel inclined to put it in its own
>>>> module `sdks/java/extensions/sequencefile` to make it more easy to
>>>> discover by the final users.
>>>>
>>>> A thing to consider is the SeekableByteChannel adapters, we can move
>>>> that into hadoop-common if needed and refactor the modules to share
>>>> code. Worth to take a look at
>>>>
>>>> org.apache.beam.sdk.io.hdfs.HadoopFileSystem.HadoopSeekableByteChannel#HadoopSeekableByteChannel
>>>> to see if some of it could be useful.
>>>>
>>>> On Tue, Jul 2, 2019 at 11:46 PM Igor Bernstein <
>>>> igorbernst...@google.com> wrote:
>>>> >
>>>> > Hi all,
>>>> >
>>>> > I wrote those classes with the intention of upstreaming them to Beam.
>>>> I can try to make some time this quarter to clean them up. I would need a
>>>> bit of guidance from a beam expert in how to make them coexist with
>>>> HadoopFormatIO though.
>>>> >
>>>> >
>>>> > On Tue, Jul 2, 2019 at 10:55 AM Solomon Duskis <sdus...@google.com>
>>>> wrote:
>>>> >>
>>>> >> +Igor Bernstein who wrote the Cloud Bigtable Sequence File classes.
>>>> >>
>>>> >> Solomon Duskis | Google Cloud clients | sdus...@google.com |
>>>> 914-462-0531
>>>> >>
>>>> >>
>>>> >> On Tue, Jul 2, 2019 at 4:57 AM Ismaël Mejía <ieme...@gmail.com>
>>>> wrote:
>>>> >>>
>>>> >>> (Adding dev@ and Solomon Duskis to the discussion)
>>>> >>>
>>>> >>> I was not aware of these thanks for sharing David. Definitely it
>>>> would
>>>> >>> be a great addition if we could have those donated as an extension
>>>> in
>>>> >>> the Beam side. We can even evolve them in the future to be more
>>>> FileIO
>>>> >>> like. Any chance this can happen? Maybe Solomon and his team?
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>> On Tue, Jul 2, 2019 at 9:39 AM David Morávek <d...@apache.org>
>>>> wrote:
>>>> >>> >
>>>> >>> > Hi, you can use SequenceFileSink and Source, from a BigTable
>>>> client. Those works nice with FileIO.
>>>> >>> >
>>>> >>> >
>>>> https://github.com/googleapis/cloud-bigtable-client/blob/master/bigtable-dataflow-parent/bigtable-beam-import/src/main/java/com/google/cloud/bigtable/beam/sequencefiles/SequenceFileSink.java
>>>> >>> >
>>>> https://github.com/googleapis/cloud-bigtable-client/blob/master/bigtable-dataflow-parent/bigtable-beam-import/src/main/java/com/google/cloud/bigtable/beam/sequencefiles/SequenceFileSource.java
>>>> >>> >
>>>> >>> > It would be really cool to move these into Beam, but that's up to
>>>> Googlers to decide, whether they want to donate this.
>>>> >>> >
>>>> >>> > D.
>>>> >>> >
>>>> >>> > On Tue, Jul 2, 2019 at 2:07 AM Shannon Duncan <
>>>> joseph.dun...@liveramp.com> wrote:
>>>> >>> >>
>>>> >>> >> It's not outside the realm of possibilities. For now I've
>>>> created an intermediary step of a hadoop job that converts from sequence to
>>>> text file.
>>>> >>> >>
>>>> >>> >> Looking into better options.
>>>> >>> >>
>>>> >>> >> On Mon, Jul 1, 2019, 5:50 PM Chamikara Jayalath <
>>>> chamik...@google.com> wrote:
>>>> >>> >>>
>>>> >>> >>> Java SDK has a HadoopInputFormatIO using which you should be
>>>> able to read Sequence files:
>>>> https://github.com/apache/beam/blob/master/sdks/java/io/hadoop-format/src/main/java/org/apache/beam/sdk/io/hadoop/format/HadoopFormatIO.java
>>>> >>> >>> I don't think there's a direct alternative for this for Python.
>>>> >>> >>>
>>>> >>> >>> Is it possible to write to a well-known format such as Avro
>>>> instead of a Hadoop specific format which will allow you to read from both
>>>> Dataproc/Hadoop and Beam Python SDK ?
>>>> >>> >>>
>>>> >>> >>> Thanks,
>>>> >>> >>> Cham
>>>> >>> >>>
>>>> >>> >>> On Mon, Jul 1, 2019 at 3:37 PM Shannon Duncan <
>>>> joseph.dun...@liveramp.com> wrote:
>>>> >>> >>>>
>>>> >>> >>>> That's a pretty big hole for a missing source/sink when
>>>> looking at transitioning from Dataproc to Dataflow using GCS as storage
>>>> buffer instead of a traditional hdfs.
>>>> >>> >>>>
>>>> >>> >>>> From what I've been able to tell from source code and
>>>> documentation, Java is able to but not Python?
>>>> >>> >>>>
>>>> >>> >>>> Thanks,
>>>> >>> >>>> Shannon
>>>> >>> >>>>
>>>> >>> >>>> On Mon, Jul 1, 2019 at 5:29 PM Chamikara Jayalath <
>>>> chamik...@google.com> wrote:
>>>> >>> >>>>>
>>>> >>> >>>>> I don't think we have a source/sink for reading Hadoop
>>>> sequence files. Your best bet currently will probably be to use FileSystem
>>>> abstraction to create a file from a ParDo and read directly from there
>>>> using a library that can read sequence files.
>>>> >>> >>>>>
>>>> >>> >>>>> Thanks,
>>>> >>> >>>>> Cham
>>>> >>> >>>>>
>>>> >>> >>>>> On Mon, Jul 1, 2019 at 8:42 AM Shannon Duncan <
>>>> joseph.dun...@liveramp.com> wrote:
>>>> >>> >>>>>>
>>>> >>> >>>>>> I'm wanting to read a Sequence/Map file from Hadoop stored
>>>> on Google Cloud Storage via a " gs://bucket/link/SequenceFile-* " via the
>>>> Python SDK.
>>>> >>> >>>>>>
>>>> >>> >>>>>> I cannot locate any good adapters for this, and the one
>>>> Hadoop Filesystem reader seems to only read from a "hdfs://" url.
>>>> >>> >>>>>>
>>>> >>> >>>>>> I'm wanting to use Dataflow and GCS exclusively to start
>>>> mixing in Beam pipelines with our current Hadoop Pipelines.
>>>> >>> >>>>>>
>>>> >>> >>>>>> Is this a feature that is supported or will be supported in
>>>> the future?
>>>> >>> >>>>>> Does anyone have any good suggestions for this that is
>>>> performant?
>>>> >>> >>>>>>
>>>> >>> >>>>>> I'd also like to be able to write back out to a SequenceFile
>>>> if possible.
>>>> >>> >>>>>>
>>>> >>> >>>>>> Thanks!
>>>> >>> >>>>>>
>>>>
>>>

Re: [Python] Read Hadoop Sequence File?

Reply via email to