On Thu, Oct 5, 2017 at 10:15 AM Sergey Beryozkin <sberyoz...@gmail.com>
wrote:

> Hi Eugene
>
> I've done an initial commit to do with removing TikaSource, more work is
> needed and I see 3 tasks remaining:
> 1) provide a shortcut which can let users avoid using FileIO directly,
> as you suggested earlier, at the moment I do:
>
>
> https://github.com/sberyozkin/beam/blob/tikaio/sdks/java/io/tika/src/test/java/org/apache/beam/sdk/io/tika/TikaIOTest.java#L99
>
> but would love to be able to type something like this in the simple cases
>
> PCollection<ParseResult> output =
>          p.apply(TikaIO.parseAll().from(filePattern));
>
Yup, makes sense.


>
> (note I hope to convince you to keep it as parseAll() as opposed to
> parseAllToString() :-) but it is a minor and separate issue).
>
> What I don't understand here is how to do this shortcut without a
> Pipeline instance, i.e, with explicit FileIO use it looks easy, one
> creates a pipeline and then one applies to it FileIO and then connects
> TikaIO via another apply(), but how to implement
> TikaIO.parseAll().from(filePattern) such that TikaIO links to FileIO
> internally without .apply ?
>
It is a composite transform - it can construct arbitrary complex stuff
inside expand(). TikaIO.parseAll()'s expand() method might look something
like:

ParseAll.expand(PCollection<String> filepatterns) {
  return input.apply(FileIO.matchAll())
      .apply(FileIO.readMatches().withCompression(UNCOMPRESSED))
      .apply(TikaIO.parseFiles());
}

ParseFiles.expand(PCollection<ReadableFile> files) {
  return files.apply(ParDo.of(... whatever you have there now ...))
}


>
> 2) Optimize ParseResult coder as you noted in the review
>
> 3) Finish it all with finalizing the configuration options (and enabling
> and enhancing display tests)
>
> Have a look please, I wonder if it makes sense to merge to the master
> now for me to do a follow up (and hopefully final) PR next
>
> Cheers, Sergey
>
> On 28/09/17 17:09, Eugene Kirpichov wrote:
> > Hi! Glad the refactoring is happening, thanks!
> > It was auto-assigned to Reuven as formal owner of the component. I
> > reassigned it to you.
> >
> > On Thu, Sep 28, 2017 at 7:57 AM Sergey Beryozkin <sberyoz...@gmail.com>
> > wrote:
> >
> >> Hi
> >>
> >> I started looking at
> >> https://issues.apache.org/jira/browse/BEAM-2994
> >>
> >> and pushed some initial code to my tikaio branch introducing ParseResult
> >> and updating the tests but keeping the BounderSource/Reader, dropping
> >> the asynchronous parsing code, and few other bits.
> >>
> >> Just noticed it is assigned to Reuven - does it mean Reuven is looking
> >> into it too or was it auto-assigned ?
> >>
> >> I don't mind, would it make sense for me to do an 'interim' PR on
> >> what've done so far before completely removing BoundedSource/Reader
> >> based code ?
> >>
> > Yes :)
> >
> >
> >>
> >> I have another question anyway,
> >>
> >>
> >>> E.g. TikaIO could:
> >>> - take as input a PCollection<ReadableFile>
> >>> - return a PCollection<KV<String, TikaIO.ParseResult>>, where
> ParseResult
> >>> is a class with properties { String content, Metadata metadata }
> >>> - be configured by: a Parser (it implements Serializable so can be
> >>> specified at pipeline construction time) and a ContentHandler whose
> >>> toString() will go into "content". ContentHandler does not implement
> >>> Serializable, so you can not specify it at construction time - however,
> >> you
> >>> can let the user specify either its class (if it's a simple handler
> like
> >> a
> >>> BodyContentHandler) or specify a lambda for creating the handler
> >>> (SerializableFunction<Void, ContentHandler>), and potentially you can
> >> have
> >>> a simpler facade for Tika.parseAsString() - e.g. call it
> >>> TikaIO.parseAllAsStrings().
> >>>
> >>> Example usage would look like:
> >>>
> >>>     PCollection<KV<String, ParseResult>> parseResults =
> >>> p.apply(FileIO.match().filepattern(...))
> >>>       .apply(FileIO.readMatches())
> >>>       .apply(TikaIO.parseAllAsStrings())
> >>>
> >>> or:
> >>>
> >>>       .apply(TikaIO.parseAll()
> >>>           .withParser(new AutoDetectParser())
> >>>           .withContentHandler(() -> new BodyContentHandler(new
> >>> ToXMLContentHandler())))
> >>>
> >>> You could also have shorthands for letting the user avoid using FileIO
> >>> directly in simple cases, for example:
> >>>       p.apply(TikaIO.parseAsStrings().from(filepattern))
> >>>
> >>> This would of course be implemented as a ParDo or even MapElements, and
> >>> you'll be able to share the code between parseAll and regular parse.
> >>>
> >> I'd like to understand how to do
> >>
> >> TikaIO.parse().from(filepattern)
> >>
> >> Right now I have TikaIO.Read extending
> >> PTransform<PBegin, PCollection<ParseResult>
> >>
> >> and then the boilerplate code which builds Read when I do something like
> >>
> >> TikaIO.read().from(filepattern).
> >>
> >> What is the convention for supporting something like
> >> TikaIO.parse().from(filepattern) to be implemented as a ParDo, can I see
> >> some example ?
> >>
> > There are a number of IOs that don't use Source - e.g. DatastoreIO and
> > JdbcIO. TextIO.readMatches() might be an even better transform to mimic.
> > Note that in TikaIO you probably won't need a fusion break after the
> ParDo
> > since there's 1 result per input file.
> >
> >
> >>
> >> Many thanks, Sergey
> >>
> >
>

Reply via email to