Re: [Python] Read Hadoop Sequence File?

Shannon Duncan Tue, 16 Jul 2019 07:25:57 -0700

I am still having the problem that local file system (DirectRunner) will
not allow a local GLOB string to be passed as a file source. I have tried
both relative path and fully qualified paths.


I can confirm the same inputFile source GLOB returns data on a simple cat
command. So I know the GLOB is good.

Error: "java.io.FileNotFoundException: No files matched spec:
/Users/<user>/github/<repo>/io/sequenceFile/part-*/data

Any assistance would be greatly appreciated. This is on the Java SDK.

I tested this with TextIO.read().from(ValueProvider<String>); Still the
same.

Thanks,
Shannon

On Fri, Jul 12, 2019 at 2:14 PM Igor Bernstein <[email protected]>
wrote:

> I'm not sure to be honest. The pattern expansion happens in
> FileBasedSource via FileSystems.match(), so it should follow the same
> expansion rules other file based sinks like TextIO. Maybe someone with more
> beam experience can help?
>
> On Fri, Jul 12, 2019 at 2:55 PM Shannon Duncan <[email protected]>
> wrote:
>
>> Clarification on previous message. Only happens on local file system
>> where it is unable to match a pattern string. Via a `gs://<bucket>` link it
>> is able to do multiple file matching.
>>
>> On Fri, Jul 12, 2019 at 1:36 PM Shannon Duncan <
>> [email protected]> wrote:
>>
>>> Awesome. I got it working for a single file, but for a structure of:
>>>
>>> /part-0001/index
>>> /part-0001/data
>>> /part-0002/index
>>> /part-0002/data
>>>
>>> I tried to do /part-*  and /part-*/data
>>>
>>> It does not find the multipart files. However if I just do
>>> /part-0001/data it will find it and read it.
>>>
>>> Any ideas why?
>>>
>>> I am using this to generate the source:
>>>
>>> static SequenceFileSource<Text, Text> createSource(
>>> ValueProvider<String> sourcePattern) {
>>> return new SequenceFileSource<Text, Text>(
>>> sourcePattern,
>>> Text.class,
>>> WritableSerialization.class,
>>> Text.class,
>>> WritableSerialization.class,
>>> SequenceFile.SYNC_INTERVAL);
>>> }
>>>
>>> On Wed, Jul 10, 2019 at 10:52 AM Igor Bernstein <
>>> [email protected]> wrote:
>>>
>>>> It should be fairly straight forward:
>>>> 1. Copy SequenceFileSource.java
>>>> <https://github.com/googleapis/cloud-bigtable-client/blob/master/bigtable-dataflow-parent/bigtable-beam-import/src/main/java/com/google/cloud/bigtable/beam/sequencefiles/SequenceFileSource.java>
>>>>  to
>>>> your project
>>>> 2. Add the source to your pipeline, configuring it with appropriate
>>>> serializers. See here
>>>> <https://github.com/googleapis/cloud-bigtable-client/blob/master/bigtable-dataflow-parent/bigtable-beam-import/src/main/java/com/google/cloud/bigtable/beam/sequencefiles/ImportJob.java#L159-L173>
>>>> for an example for hbase Results
>>>>
>>>> On Wed, Jul 10, 2019 at 10:58 AM Shannon Duncan <
>>>> [email protected]> wrote:
>>>>
>>>>> If I wanted to go ahead and include this within a new Java Pipeline,
>>>>> what would I be looking at for level of work to integrate?
>>>>>
>>>>> On Wed, Jul 3, 2019 at 3:54 AM Ismaël Mejía <[email protected]> wrote:
>>>>>
>>>>>> That's great. I can help whenever you need. We just need to choose its
>>>>>> destination. Both the `hadoop-format` and `hadoop-file-system` modules
>>>>>> are good candidates, I would even feel inclined to put it in its own
>>>>>> module `sdks/java/extensions/sequencefile` to make it more easy to
>>>>>> discover by the final users.
>>>>>>
>>>>>> A thing to consider is the SeekableByteChannel adapters, we can move
>>>>>> that into hadoop-common if needed and refactor the modules to share
>>>>>> code. Worth to take a look at
>>>>>>
>>>>>> org.apache.beam.sdk.io.hdfs.HadoopFileSystem.HadoopSeekableByteChannel#HadoopSeekableByteChannel
>>>>>> to see if some of it could be useful.
>>>>>>
>>>>>> On Tue, Jul 2, 2019 at 11:46 PM Igor Bernstein <
>>>>>> [email protected]> wrote:
>>>>>> >
>>>>>> > Hi all,
>>>>>> >
>>>>>> > I wrote those classes with the intention of upstreaming them to
>>>>>> Beam. I can try to make some time this quarter to clean them up. I would
>>>>>> need a bit of guidance from a beam expert in how to make them coexist 
>>>>>> with
>>>>>> HadoopFormatIO though.
>>>>>> >
>>>>>> >
>>>>>> > On Tue, Jul 2, 2019 at 10:55 AM Solomon Duskis <[email protected]>
>>>>>> wrote:
>>>>>> >>
>>>>>> >> +Igor Bernstein who wrote the Cloud Bigtable Sequence File classes.
>>>>>> >>
>>>>>> >> Solomon Duskis | Google Cloud clients | [email protected] |
>>>>>> 914-462-0531
>>>>>> >>
>>>>>> >>
>>>>>> >> On Tue, Jul 2, 2019 at 4:57 AM Ismaël Mejía <[email protected]>
>>>>>> wrote:
>>>>>> >>>
>>>>>> >>> (Adding dev@ and Solomon Duskis to the discussion)
>>>>>> >>>
>>>>>> >>> I was not aware of these thanks for sharing David. Definitely it
>>>>>> would
>>>>>> >>> be a great addition if we could have those donated as an
>>>>>> extension in
>>>>>> >>> the Beam side. We can even evolve them in the future to be more
>>>>>> FileIO
>>>>>> >>> like. Any chance this can happen? Maybe Solomon and his team?
>>>>>> >>>
>>>>>> >>>
>>>>>> >>>
>>>>>> >>> On Tue, Jul 2, 2019 at 9:39 AM David Morávek <[email protected]>
>>>>>> wrote:
>>>>>> >>> >
>>>>>> >>> > Hi, you can use SequenceFileSink and Source, from a BigTable
>>>>>> client. Those works nice with FileIO.
>>>>>> >>> >
>>>>>> >>> >
>>>>>> https://github.com/googleapis/cloud-bigtable-client/blob/master/bigtable-dataflow-parent/bigtable-beam-import/src/main/java/com/google/cloud/bigtable/beam/sequencefiles/SequenceFileSink.java
>>>>>> >>> >
>>>>>> https://github.com/googleapis/cloud-bigtable-client/blob/master/bigtable-dataflow-parent/bigtable-beam-import/src/main/java/com/google/cloud/bigtable/beam/sequencefiles/SequenceFileSource.java
>>>>>> >>> >
>>>>>> >>> > It would be really cool to move these into Beam, but that's up
>>>>>> to Googlers to decide, whether they want to donate this.
>>>>>> >>> >
>>>>>> >>> > D.
>>>>>> >>> >
>>>>>> >>> > On Tue, Jul 2, 2019 at 2:07 AM Shannon Duncan <
>>>>>> [email protected]> wrote:
>>>>>> >>> >>
>>>>>> >>> >> It's not outside the realm of possibilities. For now I've
>>>>>> created an intermediary step of a hadoop job that converts from sequence 
>>>>>> to
>>>>>> text file.
>>>>>> >>> >>
>>>>>> >>> >> Looking into better options.
>>>>>> >>> >>
>>>>>> >>> >> On Mon, Jul 1, 2019, 5:50 PM Chamikara Jayalath <
>>>>>> [email protected]> wrote:
>>>>>> >>> >>>
>>>>>> >>> >>> Java SDK has a HadoopInputFormatIO using which you should be
>>>>>> able to read Sequence files:
>>>>>> https://github.com/apache/beam/blob/master/sdks/java/io/hadoop-format/src/main/java/org/apache/beam/sdk/io/hadoop/format/HadoopFormatIO.java
>>>>>> >>> >>> I don't think there's a direct alternative for this for
>>>>>> Python.
>>>>>> >>> >>>
>>>>>> >>> >>> Is it possible to write to a well-known format such as Avro
>>>>>> instead of a Hadoop specific format which will allow you to read from 
>>>>>> both
>>>>>> Dataproc/Hadoop and Beam Python SDK ?
>>>>>> >>> >>>
>>>>>> >>> >>> Thanks,
>>>>>> >>> >>> Cham
>>>>>> >>> >>>
>>>>>> >>> >>> On Mon, Jul 1, 2019 at 3:37 PM Shannon Duncan <
>>>>>> [email protected]> wrote:
>>>>>> >>> >>>>
>>>>>> >>> >>>> That's a pretty big hole for a missing source/sink when
>>>>>> looking at transitioning from Dataproc to Dataflow using GCS as storage
>>>>>> buffer instead of a traditional hdfs.
>>>>>> >>> >>>>
>>>>>> >>> >>>> From what I've been able to tell from source code and
>>>>>> documentation, Java is able to but not Python?
>>>>>> >>> >>>>
>>>>>> >>> >>>> Thanks,
>>>>>> >>> >>>> Shannon
>>>>>> >>> >>>>
>>>>>> >>> >>>> On Mon, Jul 1, 2019 at 5:29 PM Chamikara Jayalath <
>>>>>> [email protected]> wrote:
>>>>>> >>> >>>>>
>>>>>> >>> >>>>> I don't think we have a source/sink for reading Hadoop
>>>>>> sequence files. Your best bet currently will probably be to use 
>>>>>> FileSystem
>>>>>> abstraction to create a file from a ParDo and read directly from there
>>>>>> using a library that can read sequence files.
>>>>>> >>> >>>>>
>>>>>> >>> >>>>> Thanks,
>>>>>> >>> >>>>> Cham
>>>>>> >>> >>>>>
>>>>>> >>> >>>>> On Mon, Jul 1, 2019 at 8:42 AM Shannon Duncan <
>>>>>> [email protected]> wrote:
>>>>>> >>> >>>>>>
>>>>>> >>> >>>>>> I'm wanting to read a Sequence/Map file from Hadoop stored
>>>>>> on Google Cloud Storage via a " gs://bucket/link/SequenceFile-* " via the
>>>>>> Python SDK.
>>>>>> >>> >>>>>>
>>>>>> >>> >>>>>> I cannot locate any good adapters for this, and the one
>>>>>> Hadoop Filesystem reader seems to only read from a "hdfs://" url.
>>>>>> >>> >>>>>>
>>>>>> >>> >>>>>> I'm wanting to use Dataflow and GCS exclusively to start
>>>>>> mixing in Beam pipelines with our current Hadoop Pipelines.
>>>>>> >>> >>>>>>
>>>>>> >>> >>>>>> Is this a feature that is supported or will be supported
>>>>>> in the future?
>>>>>> >>> >>>>>> Does anyone have any good suggestions for this that is
>>>>>> performant?
>>>>>> >>> >>>>>>
>>>>>> >>> >>>>>> I'd also like to be able to write back out to a
>>>>>> SequenceFile if possible.
>>>>>> >>> >>>>>>
>>>>>> >>> >>>>>> Thanks!
>>>>>> >>> >>>>>>
>>>>>>
>>>>>

Re: [Python] Read Hadoop Sequence File?

Reply via email to