On Thu, Oct 5, 2017 at 10:15 AM Sergey Beryozkin <sberyoz...@gmail.com> wrote:
> Hi Eugene > > I've done an initial commit to do with removing TikaSource, more work is > needed and I see 3 tasks remaining: > 1) provide a shortcut which can let users avoid using FileIO directly, > as you suggested earlier, at the moment I do: > > > https://github.com/sberyozkin/beam/blob/tikaio/sdks/java/io/tika/src/test/java/org/apache/beam/sdk/io/tika/TikaIOTest.java#L99 > > but would love to be able to type something like this in the simple cases > > PCollection<ParseResult> output = > p.apply(TikaIO.parseAll().from(filePattern)); > Yup, makes sense. > > (note I hope to convince you to keep it as parseAll() as opposed to > parseAllToString() :-) but it is a minor and separate issue). > > What I don't understand here is how to do this shortcut without a > Pipeline instance, i.e, with explicit FileIO use it looks easy, one > creates a pipeline and then one applies to it FileIO and then connects > TikaIO via another apply(), but how to implement > TikaIO.parseAll().from(filePattern) such that TikaIO links to FileIO > internally without .apply ? > It is a composite transform - it can construct arbitrary complex stuff inside expand(). TikaIO.parseAll()'s expand() method might look something like: ParseAll.expand(PCollection<String> filepatterns) { return input.apply(FileIO.matchAll()) .apply(FileIO.readMatches().withCompression(UNCOMPRESSED)) .apply(TikaIO.parseFiles()); } ParseFiles.expand(PCollection<ReadableFile> files) { return files.apply(ParDo.of(... whatever you have there now ...)) } > > 2) Optimize ParseResult coder as you noted in the review > > 3) Finish it all with finalizing the configuration options (and enabling > and enhancing display tests) > > Have a look please, I wonder if it makes sense to merge to the master > now for me to do a follow up (and hopefully final) PR next > > Cheers, Sergey > > On 28/09/17 17:09, Eugene Kirpichov wrote: > > Hi! Glad the refactoring is happening, thanks! > > It was auto-assigned to Reuven as formal owner of the component. I > > reassigned it to you. > > > > On Thu, Sep 28, 2017 at 7:57 AM Sergey Beryozkin <sberyoz...@gmail.com> > > wrote: > > > >> Hi > >> > >> I started looking at > >> https://issues.apache.org/jira/browse/BEAM-2994 > >> > >> and pushed some initial code to my tikaio branch introducing ParseResult > >> and updating the tests but keeping the BounderSource/Reader, dropping > >> the asynchronous parsing code, and few other bits. > >> > >> Just noticed it is assigned to Reuven - does it mean Reuven is looking > >> into it too or was it auto-assigned ? > >> > >> I don't mind, would it make sense for me to do an 'interim' PR on > >> what've done so far before completely removing BoundedSource/Reader > >> based code ? > >> > > Yes :) > > > > > >> > >> I have another question anyway, > >> > >> > >>> E.g. TikaIO could: > >>> - take as input a PCollection<ReadableFile> > >>> - return a PCollection<KV<String, TikaIO.ParseResult>>, where > ParseResult > >>> is a class with properties { String content, Metadata metadata } > >>> - be configured by: a Parser (it implements Serializable so can be > >>> specified at pipeline construction time) and a ContentHandler whose > >>> toString() will go into "content". ContentHandler does not implement > >>> Serializable, so you can not specify it at construction time - however, > >> you > >>> can let the user specify either its class (if it's a simple handler > like > >> a > >>> BodyContentHandler) or specify a lambda for creating the handler > >>> (SerializableFunction<Void, ContentHandler>), and potentially you can > >> have > >>> a simpler facade for Tika.parseAsString() - e.g. call it > >>> TikaIO.parseAllAsStrings(). > >>> > >>> Example usage would look like: > >>> > >>> PCollection<KV<String, ParseResult>> parseResults = > >>> p.apply(FileIO.match().filepattern(...)) > >>> .apply(FileIO.readMatches()) > >>> .apply(TikaIO.parseAllAsStrings()) > >>> > >>> or: > >>> > >>> .apply(TikaIO.parseAll() > >>> .withParser(new AutoDetectParser()) > >>> .withContentHandler(() -> new BodyContentHandler(new > >>> ToXMLContentHandler()))) > >>> > >>> You could also have shorthands for letting the user avoid using FileIO > >>> directly in simple cases, for example: > >>> p.apply(TikaIO.parseAsStrings().from(filepattern)) > >>> > >>> This would of course be implemented as a ParDo or even MapElements, and > >>> you'll be able to share the code between parseAll and regular parse. > >>> > >> I'd like to understand how to do > >> > >> TikaIO.parse().from(filepattern) > >> > >> Right now I have TikaIO.Read extending > >> PTransform<PBegin, PCollection<ParseResult> > >> > >> and then the boilerplate code which builds Read when I do something like > >> > >> TikaIO.read().from(filepattern). > >> > >> What is the convention for supporting something like > >> TikaIO.parse().from(filepattern) to be implemented as a ParDo, can I see > >> some example ? > >> > > There are a number of IOs that don't use Source - e.g. DatastoreIO and > > JdbcIO. TextIO.readMatches() might be an even better transform to mimic. > > Note that in TikaIO you probably won't need a fusion break after the > ParDo > > since there's 1 result per input file. > > > > > >> > >> Many thanks, Sergey > >> > > >