Re: TikaIO Refactoring

Sergey Beryozkin Thu, 05 Oct 2017 06:36:13 -0700

Hi

Yes, using .withCompression(UNCOMPRESSED) works, but the test code looksfunny:


p.apply("ParseFiles", FileIO.match().filepattern(resourcePath))
        .apply(FileIO.readMatches().withCompression(
          compressed ? Compression.UNCOMPRESSED : Compression.AUTO))
        .apply(TikaIO.read())

One can obviously always set UNCOMPRESSED for Tika and avoid if/else andit will work fine (with some minor confusions expected when someoneattempts to process the compressed files), but IMHO a more typedapproach for getting the raw content would be useful to have as well.


Thanks, Sergey
On 05/10/17 12:56, Sergey Beryozkin wrote:

Hi Eugene

On 04/10/17 22:52, Eugene Kirpichov wrote:

You can avoid automatic decompression by using
FileIO.readMatches().withCompression(UNCOMPRESSED) (default is AUTO).

This is nice it was already thought about earlier the auto-decompressionwould not always be needed, and it would help a readZippedPdfFile testpassing :-). but I'd rather have TikaIO users not worry about doing.withCompression(UNCOMPRESSED) for it to work correctly, given the rawcontent is always needed by Tika - thus having the users to assert itmay be a bit suboptimal.

It is only useful to auto-decompress if it is indeed what a user wants(it's a single file only, no risk of some malicious compression), butotherwise Tika should do it itself.


So ideally, after the initial refactoring is complete, we'd only have

p.apply(FileIO.readMatches()).apply(TikaIO.parseAll()) for readingcompressed or uncompressed files, where Tika would just doReadableFile.openRaw as suggested by Ben, or at least to have anothershortcut at the FileIO level, something like


FileIO.readRawMatches()

which would be equivalent to specifying the UNCOMPRESSED option explicitly.


Thanks, Sergey


On Wed, Oct 4, 2017 at 2:42 PM Sergey Beryozkin <[email protected]>
wrote:

Wait, but what about Tika doing checks like Zip bombs, etc ? Tika is
expected to decompress itself, while ReadableFile has the content
decompressed.

The other point is that Tika reports the names of the zipped files too,
in the content, as you can see from TikaIOTest#readZippedPdfFile.

Can we assume that if Metadata does not point to the local file then it
can be opened as a URL stream ? The same issue affects TikaConfig, so
I'd rather have a solution which will work for MatchResult.Metadata and
TikaConfig

Thanks, Sergey
On 04/10/17 22:02, Sergey Beryozkin wrote:

Good point...

Sergey

On 04/10/17 18:24, Eugene Kirpichov wrote:

Can TikaInputStream consume a regular InputStream? If so, you can
apply it

to Channels.newInputStream(channel). If not, applying it to thefilenameextracted from Metadata won't work either because it can point to afile

that's not on the local disk.

On Wed, Oct 4, 2017, 10:08 AM Sergey Beryozkin <[email protected]>
wrote:

I'm starting moving toward

class TikaIO {
     public static ParseAllToString parseAllToString() {..}
     class ParseAllToString extends

PTransform<PCollection<ReadableFile>,

PCollection<ParseResult>> {
       ...configuration properties...
       expand {
         return input.apply(ParDo.of(new ParseToStringFn))
       }
       class ParseToStringFn extends DoFn<...> {...}
     }
}

as suggested by Eugene

The initial migration seems to work fine, except that ReadableFileand

in particular, ReadableByteChannel can not be consumed by
TikaInputStream yet (I'll open an enhancement request), besides, it's
better let Tika to unzip if needed given that a lot of effort went in
Tika into detecting zip security issues...

So I'm typing it as

class ParseAllToString extends

PTransform<PCollection<MatchResult.Metadata>,PCollection<ParseResult>>


Cheers, Sergey

On 02/10/17 12:03, Sergey Beryozkin wrote:

Thanks for the review, please see the last comment:

https://github.com/apache/beam/pull/3835#issuecomment-333502388

(sorry for the possible duplication - but I'm not sure that GitHub

will

propagate it - as I can not see a comment there that I left on
Saturday).

Cheers, Sergey
On 29/09/17 10:21, Sergey Beryozkin wrote:

Hi
On 28/09/17 17:09, Eugene Kirpichov wrote:

Hi! Glad the refactoring is happening, thanks!

Thanks for getting me focused on having TikaIO supporting thesimpler

(and practical) cases first :-)

It was auto-assigned to Reuven as formal owner of the component. I
reassigned it to you.

OK, thanks...


On Thu, Sep 28, 2017 at 7:57 AM Sergey Beryozkin
<[email protected]

wrote:

Hi

I started looking at
https://issues.apache.org/jira/browse/BEAM-2994

and pushed some initial code to my tikaio branch introducing
ParseResult
and updating the tests but keeping the BounderSource/Reader,
dropping
the asynchronous parsing code, and few other bits.

Just noticed it is assigned to Reuven - does it mean Reuven is
looking
into it too or was it auto-assigned ?

I don't mind, would it make sense for me to do an 'interim' PR on

what've done so far before completely removingBoundedSource/Reader

based code ?

Yes :)

I did commit yesterday to my branch, and it made its way to the
pending PR (which I forgot about) where I only tweaked a couple of

doc

typos, so I renamed that PR:

https://github.com/apache/beam/pull/3835

(The build failures are apparently due to the build timeouts)
As I mentioned, in this PR I updated the existing TikaIO test toworkwith ParseResult, at the moment a file location as its property.Only
a file name can easily be saved, I thought it might be important

where

on the network the file is - may be copy it afterwards if needed,

etc.

I'd also have no problems with having it typed as a K key, was only
trying to make it a bit simpler at the start.

I'll deal with the new configurations after a switch. TikaConfig

would

most likely still need to be supported but I recall youmentioned the
way it's done now will make it work only with the direct runner. I
guess I can load it as a URL resource... The other bits like

providing

custom content handlers, parsers, input metadata, may be settingthe

max size of the files, etc, can all be added after a switch.

Note I haven't dealt with a number of your comments to the original
code which can still be dealt with in the current code - given that
most of that code will go with the next PR anyway.

Please review or merge if it looks like it is a step in the right
direction...


I have another question anyway,

E.g. TikaIO could:
- take as input a PCollection<ReadableFile>
- return a PCollection<KV<String, TikaIO.ParseResult>>, where
ParseResult
is a class with properties { String content, Metadata metadata }

- be configured by: a Parser (it implements Serializable socan be

specified at pipeline construction time) and a ContentHandler

whose

toString() will go into "content". ContentHandler does not
implement
Serializable, so you can not specify it at construction time -
however,

you

can let the user specify either its class (if it's a simple

handler

like

BodyContentHandler) or specify a lambda for creating the handler
(SerializableFunction<Void, ContentHandler>), and potentially
you can

have

a simpler facade for Tika.parseAsString() - e.g. call it
TikaIO.parseAllAsStrings().

Example usage would look like:

      PCollection<KV<String, ParseResult>> parseResults =
p.apply(FileIO.match().filepattern(...))
        .apply(FileIO.readMatches())
        .apply(TikaIO.parseAllAsStrings())

or:

        .apply(TikaIO.parseAll()
            .withParser(new AutoDetectParser())
            .withContentHandler(() -> new BodyContentHandler(new
ToXMLContentHandler())))

You could also have shorthands for letting the user avoid using

FileIO

directly in simple cases, for example:
        p.apply(TikaIO.parseAsStrings().from(filepattern))

This would of course be implemented as a ParDo or even

MapElements,

and
you'll be able to share the code between parseAll and regular
parse.

I'd like to understand how to do

TikaIO.parse().from(filepattern)

Right now I have TikaIO.Read extending
PTransform<PBegin, PCollection<ParseResult>

and then the boilerplate code which builds Read when I dosomething

like

TikaIO.read().from(filepattern).

What is the convention for supporting something like

TikaIO.parse().from(filepattern) to be implemented as a ParDo,can

see
some example ?

There are a number of IOs that don't use Source - e.g. DatastoreIO
and
JdbcIO. TextIO.readMatches() might be an even better transform to

mimic.

Note that in TikaIO you probably won't need a fusion breakafter the
ParDo
since there's 1 result per input file.
OK, I'll have a look

Cheers, Sergey
Many thanks, Sergey

Re: TikaIO Refactoring

Reply via email to