[ 
https://issues.apache.org/jira/browse/BEAM-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16022676#comment-16022676
 ] 

Sergey Beryozkin edited comment on BEAM-2328 at 5/24/17 10:37 AM:
------------------------------------------------------------------

Apache Tika Parsers report the content via the SAX events, 
https://tika.apache.org/1.14/.

I'm implementing a TikaReader such that it adapts the sequence of SAX events to 
the streaming BounderReader API by using the internal ExecutorService and the 
ConcurrentLinkedQueue. Thus when the Beam thread comes in and calls start() and 
then advance(), it won't have to immediately parse the given file content. A 
good number of Tika parsers can report the data in chunks thus the proposed 
TikaReader implementation should be quite optimal.

Unfortunately I can not extend FileBasedSource/Reader helpers given that Tika 
Parsers will need to get the full control of the InputStream. However, should 
the PR be accepted, then I would definitely see some scope for reusing some of 
currently private FileBasedSource/Reader helpers such as for example the 
composite reader which is used when multiple files are picked up.

Right now I have a reasonably good starting code IMHO with the TikaInputTest 
testing reading PDF, Zipped PDF, ODT and two ODT files, with the content and 
optionally the parsed out metadata also being streamed. 

Some of the code I copied from FileBasedSource might be suboptimal when applied 
to the Tika case. I hope that if PR gets eventually accepted then, with the 
help of Tika experts, there would be no doubt more improvements coming in.

Planning to work on creating a branch and PR soon, cheers  




was (Author: sergey_beryozkin):
Apache Tika Parsers report the content via the SAX events, 
https://tika.apache.org/1.14/.

I'm implementing a TikaReader such that it adapts the sequence of SAX events to 
the streaming BounderReader API by using the internal ExecutorService and the 
ConcurrentLinkedQueue. Thus when the Beam thread comes in and calls start() and 
then advance(), it won't have to immediately parse the given file content. A 
good number of Tika parsers can report the data in chunks thus the proposed 
TikaReader implementation should be quite optimal.

Unfortunately I can not extend FileBasedSource/Reader helpers given that Tika 
Parsers will need to get the full control of the InputStream. However, should 
the PR be accepted, then I would definitely see some scope for reusing some of 
currently private FileBasedSource/Reader helpers such as for example the 
composite reader which is used when multiple files are picked up.

Right now I have a reasonably good starting code IMHO with the TikaInputTest 
testing reading PDF, Zipped PDF, ODT and two ODT files, with the content and 
optionally the parsed out metadata also being streamed. 

Some of the code I copied from FileBasedSource might be suboptimal when applied 
to the Tika case. I hope that if PR gets eventually accepted then, with the 
help of Tika experts, there would be no doubt be more improvements coming in.

Planning to work in creating a branch and PR soon, cheers  



> Introduce Apache Tika Input component
> -------------------------------------
>
>                 Key: BEAM-2328
>                 URL: https://issues.apache.org/jira/browse/BEAM-2328
>             Project: Beam
>          Issue Type: New Feature
>          Components: sdk-ideas
>            Reporter: Sergey Beryozkin
>            Assignee: Davor Bonaci
>             Fix For: 2.1.0
>
>
> Apache Tika is a popular project that offers an extensive support for parsing 
> the variety of file formats. It is used in many projects including Lucene and 
> Elastic Search. 
> Supporting a Tika Input (Read) at the Beam level would be of major interest 
> to many users.
> PR is to follow



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to