Re: [Python] Read Hadoop Sequence File?

Ryan Skraba Wed, 17 Jul 2019 07:22:27 -0700

Hello!

I dug a bit into this (not a FileIO expert), and it looks like
LocalFileSystem only matches globs in file names (not directories):
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/LocalFileSystem.java#L251


Perhaps related: https://issues.apache.org/jira/browse/BEAM-1309

There's a note in the FileSystem javadoc that makes me suspect that globs
aren't expected to expand everywhere in the "paths" for all filesystems,
but *should* work in the last hierarchical element:
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileSystem.java#L59
(noting
that the last hierarchical element doesn't necessarily mean "files only",
in my opinion.)

It kind of makes sense -- wildcards at the top of a hierarchy in a large
filesystem can end up creating a huge internal "query" walking the entire
tree.

I gave a quick try to make a composable pipeline that matched "part-*/data"
using the FileIO.matchAll() technique for TextIO, but didn't succeed.  It's
a bit surprising to me, so I'm interested if this could be a feature
improvement...

It seems reasonable that we could construct something like:

PCollection<String> lines = p.apply(Create.of("/tmp/input/"))
  .apply(FileIO.matchResolveDirectory("part-*"))
  .apply(FileIO.matchResolveFile("data"))
  .apply(FileIO.readMatches().withCompression(AUTO))
  .apply(TextIO.readFiles());

Does anybody have a bit more experience how to correctly construct
something like that?

Best regards, Ryan



On Tue, Jul 16, 2019 at 4:25 PM Shannon Duncan <[email protected]>
wrote:

> I am still having the problem that local file system (DirectRunner) will
> not allow a local GLOB string to be passed as a file source. I have tried
> both relative path and fully qualified paths.
>
> I can confirm the same inputFile source GLOB returns data on a simple cat
> command. So I know the GLOB is good.
>
> Error: "java.io.FileNotFoundException: No files matched spec:
> /Users/<user>/github/<repo>/io/sequenceFile/part-*/data
>
> Any assistance would be greatly appreciated. This is on the Java SDK.
>
> I tested this with TextIO.read().from(ValueProvider<String>); Still the
> same.
>
> Thanks,
> Shannon
>
> On Fri, Jul 12, 2019 at 2:14 PM Igor Bernstein <[email protected]>
> wrote:
>
>> I'm not sure to be honest. The pattern expansion happens in
>> FileBasedSource via FileSystems.match(), so it should follow the same
>> expansion rules other file based sinks like TextIO. Maybe someone with more
>> beam experience can help?
>>
>> On Fri, Jul 12, 2019 at 2:55 PM Shannon Duncan <
>> [email protected]> wrote:
>>
>>> Clarification on previous message. Only happens on local file system
>>> where it is unable to match a pattern string. Via a `gs://<bucket>` link it
>>> is able to do multiple file matching.
>>>
>>> On Fri, Jul 12, 2019 at 1:36 PM Shannon Duncan <
>>> [email protected]> wrote:
>>>
>>>> Awesome. I got it working for a single file, but for a structure of:
>>>>
>>>> /part-0001/index
>>>> /part-0001/data
>>>> /part-0002/index
>>>> /part-0002/data
>>>>
>>>> I tried to do /part-*  and /part-*/data
>>>>
>>>> It does not find the multipart files. However if I just do
>>>> /part-0001/data it will find it and read it.
>>>>
>>>> Any ideas why?
>>>>
>>>> I am using this to generate the source:
>>>>
>>>> static SequenceFileSource<Text, Text> createSource(
>>>> ValueProvider<String> sourcePattern) {
>>>> return new SequenceFileSource<Text, Text>(
>>>> sourcePattern,
>>>> Text.class,
>>>> WritableSerialization.class,
>>>> Text.class,
>>>> WritableSerialization.class,
>>>> SequenceFile.SYNC_INTERVAL);
>>>> }
>>>>
>>>> On Wed, Jul 10, 2019 at 10:52 AM Igor Bernstein <
>>>> [email protected]> wrote:
>>>>
>>>>> It should be fairly straight forward:
>>>>> 1. Copy SequenceFileSource.java
>>>>> <https://github.com/googleapis/cloud-bigtable-client/blob/master/bigtable-dataflow-parent/bigtable-beam-import/src/main/java/com/google/cloud/bigtable/beam/sequencefiles/SequenceFileSource.java>
>>>>>  to
>>>>> your project
>>>>> 2. Add the source to your pipeline, configuring it with appropriate
>>>>> serializers. See here
>>>>> <https://github.com/googleapis/cloud-bigtable-client/blob/master/bigtable-dataflow-parent/bigtable-beam-import/src/main/java/com/google/cloud/bigtable/beam/sequencefiles/ImportJob.java#L159-L173>
>>>>> for an example for hbase Results
>>>>>
>>>>> On Wed, Jul 10, 2019 at 10:58 AM Shannon Duncan <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> If I wanted to go ahead and include this within a new Java Pipeline,
>>>>>> what would I be looking at for level of work to integrate?
>>>>>>
>>>>>> On Wed, Jul 3, 2019 at 3:54 AM Ismaël Mejía <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> That's great. I can help whenever you need. We just need to choose
>>>>>>> its
>>>>>>> destination. Both the `hadoop-format` and `hadoop-file-system`
>>>>>>> modules
>>>>>>> are good candidates, I would even feel inclined to put it in its own
>>>>>>> module `sdks/java/extensions/sequencefile` to make it more easy to
>>>>>>> discover by the final users.
>>>>>>>
>>>>>>> A thing to consider is the SeekableByteChannel adapters, we can move
>>>>>>> that into hadoop-common if needed and refactor the modules to share
>>>>>>> code. Worth to take a look at
>>>>>>>
>>>>>>> org.apache.beam.sdk.io.hdfs.HadoopFileSystem.HadoopSeekableByteChannel#HadoopSeekableByteChannel
>>>>>>> to see if some of it could be useful.
>>>>>>>
>>>>>>> On Tue, Jul 2, 2019 at 11:46 PM Igor Bernstein <
>>>>>>> [email protected]> wrote:
>>>>>>> >
>>>>>>> > Hi all,
>>>>>>> >
>>>>>>> > I wrote those classes with the intention of upstreaming them to
>>>>>>> Beam. I can try to make some time this quarter to clean them up. I would
>>>>>>> need a bit of guidance from a beam expert in how to make them coexist 
>>>>>>> with
>>>>>>> HadoopFormatIO though.
>>>>>>> >
>>>>>>> >
>>>>>>> > On Tue, Jul 2, 2019 at 10:55 AM Solomon Duskis <[email protected]>
>>>>>>> wrote:
>>>>>>> >>
>>>>>>> >> +Igor Bernstein who wrote the Cloud Bigtable Sequence File
>>>>>>> classes.
>>>>>>> >>
>>>>>>> >> Solomon Duskis | Google Cloud clients | [email protected] |
>>>>>>> 914-462-0531
>>>>>>> >>
>>>>>>> >>
>>>>>>> >> On Tue, Jul 2, 2019 at 4:57 AM Ismaël Mejía <[email protected]>
>>>>>>> wrote:
>>>>>>> >>>
>>>>>>> >>> (Adding dev@ and Solomon Duskis to the discussion)
>>>>>>> >>>
>>>>>>> >>> I was not aware of these thanks for sharing David. Definitely it
>>>>>>> would
>>>>>>> >>> be a great addition if we could have those donated as an
>>>>>>> extension in
>>>>>>> >>> the Beam side. We can even evolve them in the future to be more
>>>>>>> FileIO
>>>>>>> >>> like. Any chance this can happen? Maybe Solomon and his team?
>>>>>>> >>>
>>>>>>> >>>
>>>>>>> >>>
>>>>>>> >>> On Tue, Jul 2, 2019 at 9:39 AM David Morávek <[email protected]>
>>>>>>> wrote:
>>>>>>> >>> >
>>>>>>> >>> > Hi, you can use SequenceFileSink and Source, from a BigTable
>>>>>>> client. Those works nice with FileIO.
>>>>>>> >>> >
>>>>>>> >>> >
>>>>>>> https://github.com/googleapis/cloud-bigtable-client/blob/master/bigtable-dataflow-parent/bigtable-beam-import/src/main/java/com/google/cloud/bigtable/beam/sequencefiles/SequenceFileSink.java
>>>>>>> >>> >
>>>>>>> https://github.com/googleapis/cloud-bigtable-client/blob/master/bigtable-dataflow-parent/bigtable-beam-import/src/main/java/com/google/cloud/bigtable/beam/sequencefiles/SequenceFileSource.java
>>>>>>> >>> >
>>>>>>> >>> > It would be really cool to move these into Beam, but that's up
>>>>>>> to Googlers to decide, whether they want to donate this.
>>>>>>> >>> >
>>>>>>> >>> > D.
>>>>>>> >>> >
>>>>>>> >>> > On Tue, Jul 2, 2019 at 2:07 AM Shannon Duncan <
>>>>>>> [email protected]> wrote:
>>>>>>> >>> >>
>>>>>>> >>> >> It's not outside the realm of possibilities. For now I've
>>>>>>> created an intermediary step of a hadoop job that converts from 
>>>>>>> sequence to
>>>>>>> text file.
>>>>>>> >>> >>
>>>>>>> >>> >> Looking into better options.
>>>>>>> >>> >>
>>>>>>> >>> >> On Mon, Jul 1, 2019, 5:50 PM Chamikara Jayalath <
>>>>>>> [email protected]> wrote:
>>>>>>> >>> >>>
>>>>>>> >>> >>> Java SDK has a HadoopInputFormatIO using which you should be
>>>>>>> able to read Sequence files:
>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/io/hadoop-format/src/main/java/org/apache/beam/sdk/io/hadoop/format/HadoopFormatIO.java
>>>>>>> >>> >>> I don't think there's a direct alternative for this for
>>>>>>> Python.
>>>>>>> >>> >>>
>>>>>>> >>> >>> Is it possible to write to a well-known format such as Avro
>>>>>>> instead of a Hadoop specific format which will allow you to read from 
>>>>>>> both
>>>>>>> Dataproc/Hadoop and Beam Python SDK ?
>>>>>>> >>> >>>
>>>>>>> >>> >>> Thanks,
>>>>>>> >>> >>> Cham
>>>>>>> >>> >>>
>>>>>>> >>> >>> On Mon, Jul 1, 2019 at 3:37 PM Shannon Duncan <
>>>>>>> [email protected]> wrote:
>>>>>>> >>> >>>>
>>>>>>> >>> >>>> That's a pretty big hole for a missing source/sink when
>>>>>>> looking at transitioning from Dataproc to Dataflow using GCS as storage
>>>>>>> buffer instead of a traditional hdfs.
>>>>>>> >>> >>>>
>>>>>>> >>> >>>> From what I've been able to tell from source code and
>>>>>>> documentation, Java is able to but not Python?
>>>>>>> >>> >>>>
>>>>>>> >>> >>>> Thanks,
>>>>>>> >>> >>>> Shannon
>>>>>>> >>> >>>>
>>>>>>> >>> >>>> On Mon, Jul 1, 2019 at 5:29 PM Chamikara Jayalath <
>>>>>>> [email protected]> wrote:
>>>>>>> >>> >>>>>
>>>>>>> >>> >>>>> I don't think we have a source/sink for reading Hadoop
>>>>>>> sequence files. Your best bet currently will probably be to use 
>>>>>>> FileSystem
>>>>>>> abstraction to create a file from a ParDo and read directly from there
>>>>>>> using a library that can read sequence files.
>>>>>>> >>> >>>>>
>>>>>>> >>> >>>>> Thanks,
>>>>>>> >>> >>>>> Cham
>>>>>>> >>> >>>>>
>>>>>>> >>> >>>>> On Mon, Jul 1, 2019 at 8:42 AM Shannon Duncan <
>>>>>>> [email protected]> wrote:
>>>>>>> >>> >>>>>>
>>>>>>> >>> >>>>>> I'm wanting to read a Sequence/Map file from Hadoop
>>>>>>> stored on Google Cloud Storage via a " gs://bucket/link/SequenceFile-* "
>>>>>>> via the Python SDK.
>>>>>>> >>> >>>>>>
>>>>>>> >>> >>>>>> I cannot locate any good adapters for this, and the one
>>>>>>> Hadoop Filesystem reader seems to only read from a "hdfs://" url.
>>>>>>> >>> >>>>>>
>>>>>>> >>> >>>>>> I'm wanting to use Dataflow and GCS exclusively to start
>>>>>>> mixing in Beam pipelines with our current Hadoop Pipelines.
>>>>>>> >>> >>>>>>
>>>>>>> >>> >>>>>> Is this a feature that is supported or will be supported
>>>>>>> in the future?
>>>>>>> >>> >>>>>> Does anyone have any good suggestions for this that is
>>>>>>> performant?
>>>>>>> >>> >>>>>>
>>>>>>> >>> >>>>>> I'd also like to be able to write back out to a
>>>>>>> SequenceFile if possible.
>>>>>>> >>> >>>>>>
>>>>>>> >>> >>>>>> Thanks!
>>>>>>> >>> >>>>>>
>>>>>>>
>>>>>>

Re: [Python] Read Hadoop Sequence File?

Reply via email to