[jira] [Commented] (BEAM-3004) TikaIOTest#testReadPdfFile is flaky.
[ https://issues.apache.org/jira/browse/BEAM-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237606#comment-16237606 ] Sergey Beryozkin commented on BEAM-3004: This can now be resolved, the current TikaIOTest does not have a dedicated testReadPdfFile; the test parsing PDF and ODT files is OK. > TikaIOTest#testReadPdfFile is flaky. > > > Key: BEAM-3004 > URL: https://issues.apache.org/jira/browse/BEAM-3004 > Project: Beam > Issue Type: Bug > Components: sdk-java-extensions >Reporter: Jason Kuster >Assignee: Sergey Beryozkin >Priority: Major > > testReadPdfFile has been sporadically failing on Jenkins. > https://builds.apache.org/view/A-D/view/Beam/job/beam_PreCommit_Java_MavenInstall/14691/org.apache.beam$beam-sdks-java-io-tika/testReport/org.apache.beam.sdk.io.tika/TikaIOTest/testReadPdfFile/history/ -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-2994) Refactor TikaIO
[ https://issues.apache.org/jira/browse/BEAM-2994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16225703#comment-16225703 ] Sergey Beryozkin commented on BEAM-2994: Thanks for merging this PR > Refactor TikaIO > --- > > Key: BEAM-2994 > URL: https://issues.apache.org/jira/browse/BEAM-2994 > Project: Beam > Issue Type: Task > Components: sdk-java-extensions >Affects Versions: 2.2.0 >Reporter: Sergey Beryozkin >Assignee: Sergey Beryozkin > Fix For: 2.3.0 > > > TikaIO is currently implemented as a BoundedSource and asynchronous > BoundedReader returning individual document's text chunks as Strings, > eventually passed unordered (and not linked to the original documents) to the > pipeline functions. > It was decided in the recent beam-dev thread that initially TikaIO should > support the cases where only a single composite bean per file, capturing the > file content, location (or name) and metadata, should flow to the pipeline, > and thus avoiding the need to implement TikaIO as a BoundedSource/Reader. > Enhancing TikaIO to support the streaming of the content into the pipelines > may be considered in the next phase, based on the specific use-cases... -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (BEAM-3004) TikaIOTest#testReadPdfFile is flaky.
[ https://issues.apache.org/jira/browse/BEAM-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16187099#comment-16187099 ] Sergey Beryozkin edited comment on BEAM-3004 at 9/30/17 1:49 PM: - Thanks, looks like the asynchronous TikaReader implementation is weak somewhere when it comes to processing the PDF file(s), but that code has already gone from the pending https://github.com/apache/beam/pull/3835, so I'm hoping this issue will be resolved after PR 3835 gets approved... was (Author: sergey_beryozkin): Thanks, looks like the asynchronous TikaReader implementation is weak somewhere when it comes to processing the PDF file(s), but that code ha already gone from the pending https://github.com/apache/beam/pull/3835, so I'm hoping this issue will be resolved after PR 3835 gets approved... > TikaIOTest#testReadPdfFile is flaky. > > > Key: BEAM-3004 > URL: https://issues.apache.org/jira/browse/BEAM-3004 > Project: Beam > Issue Type: Bug > Components: sdk-java-extensions >Reporter: Jason Kuster >Assignee: Sergey Beryozkin > > testReadPdfFile has been sporadically failing on Jenkins. > https://builds.apache.org/view/A-D/view/Beam/job/beam_PreCommit_Java_MavenInstall/14691/org.apache.beam$beam-sdks-java-io-tika/testReport/org.apache.beam.sdk.io.tika/TikaIOTest/testReadPdfFile/history/ -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-3004) TikaIOTest#testReadPdfFile is flaky.
[ https://issues.apache.org/jira/browse/BEAM-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16187099#comment-16187099 ] Sergey Beryozkin commented on BEAM-3004: Thanks, looks like the asynchronous TikaReader implementation is weak somewhere when it comes to processing the PDF file(s), but that code ha already gone from the pending https://github.com/apache/beam/pull/3835, so I'm hoping this issue will be resolved after PR 3835 gets approved... > TikaIOTest#testReadPdfFile is flaky. > > > Key: BEAM-3004 > URL: https://issues.apache.org/jira/browse/BEAM-3004 > Project: Beam > Issue Type: Bug > Components: sdk-java-extensions >Reporter: Jason Kuster >Assignee: Sergey Beryozkin > > testReadPdfFile has been sporadically failing on Jenkins. > https://builds.apache.org/view/A-D/view/Beam/job/beam_PreCommit_Java_MavenInstall/14691/org.apache.beam$beam-sdks-java-io-tika/testReport/org.apache.beam.sdk.io.tika/TikaIOTest/testReadPdfFile/history/ -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (BEAM-2994) Refactor TikaIO
Sergey Beryozkin created BEAM-2994: -- Summary: Refactor TikaIO Key: BEAM-2994 URL: https://issues.apache.org/jira/browse/BEAM-2994 Project: Beam Issue Type: Task Components: sdk-java-extensions Affects Versions: 2.2.0 Reporter: Sergey Beryozkin Assignee: Reuven Lax Fix For: 2.2.0 TikaIO is currently implemented as a BoundedSource and asynchronous BoundedReader returning individual document's text chunks as Strings, eventually passed unordered (and not linked to the original documents) to the pipeline functions. It was decided in the recent beam-dev thread that initially TikaIO should support the cases where only a single composite bean per file, capturing the file content, location (or name) and metadata, should flow to the pipeline, and thus avoiding the need to implement TikaIO as a BoundedSource/Reader. Enhancing TikaIO to support the streaming of the content into the pipelines may be considered in the next phase, based on the specific use-cases... -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (BEAM-2874) TikaIO JavaDocs have minor typos
[ https://issues.apache.org/jira/browse/BEAM-2874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Beryozkin resolved BEAM-2874. Resolution: Invalid Have just read that the doc typos do not require opening JIRA issues :-) > TikaIO JavaDocs have minor typos > > > Key: BEAM-2874 > URL: https://issues.apache.org/jira/browse/BEAM-2874 > Project: Beam > Issue Type: Bug > Components: sdk-java-extensions >Affects Versions: 2.2.0 >Reporter: Sergey Beryozkin >Assignee: Sergey Beryozkin >Priority: Trivial > Fix For: 2.2.0 > > > Some of TikaIO sources have the minor doc typos -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (BEAM-2874) TikaIO JavaDocs have minor typos
[ https://issues.apache.org/jira/browse/BEAM-2874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Beryozkin updated BEAM-2874: --- Description: Some of TikaIO sources have the minor doc typos > TikaIO JavaDocs have minor typos > > > Key: BEAM-2874 > URL: https://issues.apache.org/jira/browse/BEAM-2874 > Project: Beam > Issue Type: Bug > Components: sdk-java-extensions >Affects Versions: 2.2.0 >Reporter: Sergey Beryozkin >Assignee: Sergey Beryozkin >Priority: Trivial > Fix For: 2.2.0 > > > Some of TikaIO sources have the minor doc typos -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (BEAM-2874) TikaIO JavaDocs have minor typos
Sergey Beryozkin created BEAM-2874: -- Summary: TikaIO JavaDocs have minor typos Key: BEAM-2874 URL: https://issues.apache.org/jira/browse/BEAM-2874 Project: Beam Issue Type: Bug Components: sdk-java-extensions Affects Versions: 2.2.0 Reporter: Sergey Beryozkin Assignee: Sergey Beryozkin Priority: Trivial Fix For: 2.2.0 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-2328) Introduce Apache Tika Input component
[ https://issues.apache.org/jira/browse/BEAM-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085343#comment-16085343 ] Sergey Beryozkin commented on BEAM-2328: [~talli...@mitre.org] Hi Tim - the PR has been updated to pull in Tika 1.16, thanks. > Introduce Apache Tika Input component > - > > Key: BEAM-2328 > URL: https://issues.apache.org/jira/browse/BEAM-2328 > Project: Beam > Issue Type: New Feature > Components: sdk-ideas, sdk-java-extensions >Reporter: Sergey Beryozkin >Assignee: Sergey Beryozkin > Fix For: 2.2.0 > > > Apache Tika is a popular project that offers an extensive support for parsing > the variety of file formats. It is used in many projects including Lucene and > Elastic Search. > Supporting a Tika Input (Read) at the Beam level would be of major interest > to many users. > PR is to follow -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (BEAM-2328) Introduce Apache Tika Input component
[ https://issues.apache.org/jira/browse/BEAM-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Beryozkin updated BEAM-2328: --- Fix Version/s: 2.2.0 > Introduce Apache Tika Input component > - > > Key: BEAM-2328 > URL: https://issues.apache.org/jira/browse/BEAM-2328 > Project: Beam > Issue Type: New Feature > Components: sdk-ideas, sdk-java-extensions >Reporter: Sergey Beryozkin >Assignee: Sergey Beryozkin > Fix For: 2.2.0 > > > Apache Tika is a popular project that offers an extensive support for parsing > the variety of file formats. It is used in many projects including Lucene and > Elastic Search. > Supporting a Tika Input (Read) at the Beam level would be of major interest > to many users. > PR is to follow -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-2328) Introduce Apache Tika Input component
[ https://issues.apache.org/jira/browse/BEAM-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16051834#comment-16051834 ] Sergey Beryozkin commented on BEAM-2328: HI All, The initial cleanup of the 'tikaio' branch is now complete (with thanks to JB), the commits - squashed, I'm now proceeding to creating the first PR. I'd like to ask JB to review it, the feedback from all of the team will also be welcomed. [~talli...@mitre.org] Hi Tim, I hope that if the team accepts this PR then we can get TikaReader improved further :-). (I'm not sure if some more work will need to be done to make a better reporting of the embedded attachments inside a given PDF/etc, if some further ParserContext customizations may be needed - the input metadata and TikaConfig are covered though, etc); concatenating multiple SAX content bits into a minimum length fragments will optionally be supported too later on if needed thanks > Introduce Apache Tika Input component > - > > Key: BEAM-2328 > URL: https://issues.apache.org/jira/browse/BEAM-2328 > Project: Beam > Issue Type: New Feature > Components: sdk-ideas, sdk-java-extensions >Reporter: Sergey Beryozkin >Assignee: Sergey Beryozkin > Fix For: 2.1.0 > > > Apache Tika is a popular project that offers an extensive support for parsing > the variety of file formats. It is used in many projects including Lucene and > Elastic Search. > Supporting a Tika Input (Read) at the Beam level would be of major interest > to many users. > PR is to follow -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (BEAM-2328) Introduce Apache Tika Input component
[ https://issues.apache.org/jira/browse/BEAM-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16049057#comment-16049057 ] Sergey Beryozkin edited comment on BEAM-2328 at 6/14/17 11:09 AM: -- Hi JB, All, I'm now ready to create the initial PR. As I said earlier I realize it won't be perfect from a start and I have some tasks to do next once PR gets accepted (making common-compress 1.14 managed, a couple of possible refactorings which would affect the outer Beam source and help minimize the duplication of FileBased related utility code inside the Tika component) but for now I'm just trying to keep this initial contribution as simple as possible and also self contained. The only immediate question I have is how should this artifact be really named, at the moment it is "beam-sdks-java-io-tika" but I wonder should it really be "beam-sdks-java-input-tika" given that the output can not be supported ? Thanks was (Author: sergey_beryozkin): Hi JB, All, I'm now ready to create the initial PR. As I said earlier I realize it won't be perfect from a start and I have some tasks to do next once PR gets accepted (making common-compress 1.14 managed, a couple of possible refactorings which would affect the outer Beam source and help to minimize the duplication of FileBased related utility code inside the Tika component) but for now I'm just trying to keep this initial contribution as simple as possible and also self contained. The only immediate question I have is how should this artifact be really named, at the moment it is "beam-sdks-java-io-tika" but I wonder should it really be "beam-sdks-java-input-tika" given that the output can not be supported ? Thanks > Introduce Apache Tika Input component > - > > Key: BEAM-2328 > URL: https://issues.apache.org/jira/browse/BEAM-2328 > Project: Beam > Issue Type: New Feature > Components: sdk-ideas, sdk-java-extensions >Reporter: Sergey Beryozkin >Assignee: Sergey Beryozkin > Fix For: 2.1.0 > > > Apache Tika is a popular project that offers an extensive support for parsing > the variety of file formats. It is used in many projects including Lucene and > Elastic Search. > Supporting a Tika Input (Read) at the Beam level would be of major interest > to many users. > PR is to follow -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-2328) Introduce Apache Tika Input component
[ https://issues.apache.org/jira/browse/BEAM-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16049057#comment-16049057 ] Sergey Beryozkin commented on BEAM-2328: Hi JB, All, I'm now ready to create the initial PR. As I said earlier I realize it won't be perfect from a start and I have some tasks to do next once PR gets accepted (making common-compress 1.14 managed, a couple of possible refactorings which would affect the outer Beam source and help to minimize the duplication of FileBased related utility code inside the Tika component) but for now I'm just trying to keep this initial contribution as simple as possible and also self contained. The only immediate question I have is how should this artifact be really named, at the moment it is "beam-sdks-java-io-tika" but I wonder should it really be "beam-sdks-java-input-tika" given that the output can not be supported ? Thanks > Introduce Apache Tika Input component > - > > Key: BEAM-2328 > URL: https://issues.apache.org/jira/browse/BEAM-2328 > Project: Beam > Issue Type: New Feature > Components: sdk-ideas, sdk-java-extensions >Reporter: Sergey Beryozkin >Assignee: Sergey Beryozkin > Fix For: 2.1.0 > > > Apache Tika is a popular project that offers an extensive support for parsing > the variety of file formats. It is used in many projects including Lucene and > Elastic Search. > Supporting a Tika Input (Read) at the Beam level would be of major interest > to many users. > PR is to follow -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-2328) Introduce Apache Tika Input component
[ https://issues.apache.org/jira/browse/BEAM-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16034412#comment-16034412 ] Sergey Beryozkin commented on BEAM-2328: Hi JB, Tim re org.json dependencies, FYI, at the moment the only strong Tika dependency is tika-core. tika-parsers is a test dependency, it is not needed to compile, the current expectation is that the users of the future Tika Input component will add a tika-parsers dependency and as such Tika Parsers (including those that may depend on org.json) will not make it into the Beam distro. I reckon that can make it easier to align with the Tika 2.0-SNAPSHOT effort where a number of mainstream parsers (PDF, etc) is represented by individual modules. I guess an option to ship all of the tika-bundle with tika-io can also be considered but for a start having only a tika-core dependency seems workable to me...In this (current) case if the tika-core itself is org.json free then it should not be an issue. > Introduce Apache Tika Input component > - > > Key: BEAM-2328 > URL: https://issues.apache.org/jira/browse/BEAM-2328 > Project: Beam > Issue Type: New Feature > Components: sdk-ideas, sdk-java-extensions >Reporter: Sergey Beryozkin >Assignee: Sergey Beryozkin > Fix For: 2.1.0 > > > Apache Tika is a popular project that offers an extensive support for parsing > the variety of file formats. It is used in many projects including Lucene and > Elastic Search. > Supporting a Tika Input (Read) at the Beam level would be of major interest > to many users. > PR is to follow -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Comment Edited] (BEAM-2328) Introduce Apache Tika Input component
[ https://issues.apache.org/jira/browse/BEAM-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16032904#comment-16032904 ] Sergey Beryozkin edited comment on BEAM-2328 at 6/1/17 1:05 PM: Sorry, Tika already reports the characters, I got confused for a moment that the default output coder was not used there but of course that output coder is for converting String to the output... As far as Tika is concerned it is already possible to pass the custom Metadata to TikaInput.Read, I'll just update that to also accept TikaConfg was (Author: sergey_beryozkin): Sorry, Tika already reports the characters... > Introduce Apache Tika Input component > - > > Key: BEAM-2328 > URL: https://issues.apache.org/jira/browse/BEAM-2328 > Project: Beam > Issue Type: New Feature > Components: sdk-ideas, sdk-java-extensions >Reporter: Sergey Beryozkin >Assignee: Sergey Beryozkin > Fix For: 2.1.0 > > > Apache Tika is a popular project that offers an extensive support for parsing > the variety of file formats. It is used in many projects including Lucene and > Elastic Search. > Supporting a Tika Input (Read) at the Beam level would be of major interest > to many users. > PR is to follow -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-2328) Introduce Apache Tika Input component
[ https://issues.apache.org/jira/browse/BEAM-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16032904#comment-16032904 ] Sergey Beryozkin commented on BEAM-2328: Sorry, Tika already reports the characters... > Introduce Apache Tika Input component > - > > Key: BEAM-2328 > URL: https://issues.apache.org/jira/browse/BEAM-2328 > Project: Beam > Issue Type: New Feature > Components: sdk-ideas, sdk-java-extensions >Reporter: Sergey Beryozkin >Assignee: Sergey Beryozkin > Fix For: 2.1.0 > > > Apache Tika is a popular project that offers an extensive support for parsing > the variety of file formats. It is used in many projects including Lucene and > Elastic Search. > Supporting a Tika Input (Read) at the Beam level would be of major interest > to many users. > PR is to follow -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-2328) Introduce Apache Tika Input component
[ https://issues.apache.org/jira/browse/BEAM-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16032881#comment-16032881 ] Sergey Beryozkin commented on BEAM-2328: Hi JB, Tim Yes, TikaReader returns Strings, but as JB just pointed out the default coder is not used, so I'll fix it, thanks JB :-). Tim, the reason I mentioned that I do not expect 'anything but Strings' is because in many cases, as far as I can see, Beam readers can be typed for different types and custom Beam coders can support such conversions, but I agree in case of Tika is is really only about String as it is impossible to predict at the generic Tika API level what a given format parser can produce, etc... Tim - I also updated the reader to use TikaInputStream, thanks > Introduce Apache Tika Input component > - > > Key: BEAM-2328 > URL: https://issues.apache.org/jira/browse/BEAM-2328 > Project: Beam > Issue Type: New Feature > Components: sdk-ideas, sdk-java-extensions >Reporter: Sergey Beryozkin >Assignee: Sergey Beryozkin > Fix For: 2.1.0 > > > Apache Tika is a popular project that offers an extensive support for parsing > the variety of file formats. It is used in many projects including Lucene and > Elastic Search. > Supporting a Tika Input (Read) at the Beam level would be of major interest > to many users. > PR is to follow -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-2328) Introduce Apache Tika Input component
[ https://issues.apache.org/jira/browse/BEAM-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16032825#comment-16032825 ] Sergey Beryozkin commented on BEAM-2328: I've added some TikaReader and TikaSource tests. Tika version was updated to 1.15 (released by [~talli...@mitre.org]) and common-compress to 1.14 (see TIKA-2099 for example). In general I'd like to keep an initial contribution very much isolated, and then later on follow up with some optimizations which would affect some other Beam modules. Specifically, the two most immediate follow up PRs would be about updating a managed Beam common compress dependency to 1.14 and remove the version from tika/pom.xml and attempt to refactor a bit a FileBasedSource composite reader such that its code can be reused by TikaSource. The last thing I'd like to investigate for a start is to check what may need to be done around non UTF-8 charsets. I don't expect TikaReader producing anything else but Strings though. I'm away next week, will start preparing for the initial PR shortly afterwards > Introduce Apache Tika Input component > - > > Key: BEAM-2328 > URL: https://issues.apache.org/jira/browse/BEAM-2328 > Project: Beam > Issue Type: New Feature > Components: sdk-ideas, sdk-java-extensions >Reporter: Sergey Beryozkin >Assignee: Sergey Beryozkin > Fix For: 2.1.0 > > > Apache Tika is a popular project that offers an extensive support for parsing > the variety of file formats. It is used in many projects including Lucene and > Elastic Search. > Supporting a Tika Input (Read) at the Beam level would be of major interest > to many users. > PR is to follow -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Resolved] (BEAM-2361) Add TikaIO to the list of in-progress transforms
[ https://issues.apache.org/jira/browse/BEAM-2361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Beryozkin resolved BEAM-2361. Resolution: Fixed Thanks for applying the patch > Add TikaIO to the list of in-progress transforms > > > Key: BEAM-2361 > URL: https://issues.apache.org/jira/browse/BEAM-2361 > Project: Beam > Issue Type: Task > Components: website >Reporter: Sergey Beryozkin >Assignee: Sergey Beryozkin >Priority: Minor > Fix For: 2.1.0 > > -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-2328) Introduce Apache Tika Input component
[ https://issues.apache.org/jira/browse/BEAM-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16024981#comment-16024981 ] Sergey Beryozkin commented on BEAM-2328: The initial code is here: https://github.com/sberyozkin/beam/tree/tikaio/sdks/java/io/tika it is a work in progress and it will take me some time to get to the PR stage, just wanted to share a link to what is already available, perhaps the most interesting code at this stage is the initial test code: https://github.com/sberyozkin/beam/blob/tikaio/sdks/java/io/tika/src/test/java/org/apache/beam/sdk/io/tika/TikaInputTest.java and TikaReader: https://github.com/sberyozkin/beam/blob/tikaio/sdks/java/io/tika/src/main/java/org/apache/beam/sdk/io/tika/TikaReader.java For the moment the focus is on getting the overall component structure be in a good enough initial state (add few more tests, docs), TikaReader (etc) optimizations/enhancements can def follow after the initial PR. I'd appreciate if my colleague [~jbonofre] could help next with the initial branch clean up. thanks > Introduce Apache Tika Input component > - > > Key: BEAM-2328 > URL: https://issues.apache.org/jira/browse/BEAM-2328 > Project: Beam > Issue Type: New Feature > Components: sdk-ideas, sdk-java-extensions >Reporter: Sergey Beryozkin >Assignee: Sergey Beryozkin > Fix For: 2.1.0 > > > Apache Tika is a popular project that offers an extensive support for parsing > the variety of file formats. It is used in many projects including Lucene and > Elastic Search. > Supporting a Tika Input (Read) at the Beam level would be of major interest > to many users. > PR is to follow -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-2328) Introduce Apache Tika Input component
[ https://issues.apache.org/jira/browse/BEAM-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16023826#comment-16023826 ] Sergey Beryozkin commented on BEAM-2328: Sorry for a bit of a noise, I spotted in the docs that the site updates should be assigned to a different category, hence I opened BEAM-2361 and made this one related to it, hopefully I've made it nearly right this time :-) cheers > Introduce Apache Tika Input component > - > > Key: BEAM-2328 > URL: https://issues.apache.org/jira/browse/BEAM-2328 > Project: Beam > Issue Type: New Feature > Components: sdk-ideas, sdk-java-extensions >Reporter: Sergey Beryozkin >Assignee: Sergey Beryozkin > Fix For: 2.1.0 > > > Apache Tika is a popular project that offers an extensive support for parsing > the variety of file formats. It is used in many projects including Lucene and > Elastic Search. > Supporting a Tika Input (Read) at the Beam level would be of major interest > to many users. > PR is to follow -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Issue Comment Deleted] (BEAM-2328) Introduce Apache Tika Input component
[ https://issues.apache.org/jira/browse/BEAM-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Beryozkin updated BEAM-2328: --- Comment: was deleted (was: Hi, pull request #250 has been created. thanks) > Introduce Apache Tika Input component > - > > Key: BEAM-2328 > URL: https://issues.apache.org/jira/browse/BEAM-2328 > Project: Beam > Issue Type: New Feature > Components: sdk-ideas, sdk-java-extensions >Reporter: Sergey Beryozkin >Assignee: Sergey Beryozkin > Fix For: 2.1.0 > > > Apache Tika is a popular project that offers an extensive support for parsing > the variety of file formats. It is used in many projects including Lucene and > Elastic Search. > Supporting a Tika Input (Read) at the Beam level would be of major interest > to many users. > PR is to follow -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (BEAM-2361) Add TikaIO to the list of in-progress transforms
Sergey Beryozkin created BEAM-2361: -- Summary: Add TikaIO to the list of in-progress transforms Key: BEAM-2361 URL: https://issues.apache.org/jira/browse/BEAM-2361 Project: Beam Issue Type: Task Components: website Reporter: Sergey Beryozkin Assignee: Sergey Beryozkin Priority: Minor Fix For: 2.1.0 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Comment Edited] (BEAM-2328) Introduce Apache Tika Input component
[ https://issues.apache.org/jira/browse/BEAM-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16022676#comment-16022676 ] Sergey Beryozkin edited comment on BEAM-2328 at 5/24/17 10:37 AM: -- Apache Tika Parsers report the content via the SAX events, https://tika.apache.org/1.14/. I'm implementing a TikaReader such that it adapts the sequence of SAX events to the streaming BounderReader API by using the internal ExecutorService and the ConcurrentLinkedQueue. Thus when the Beam thread comes in and calls start() and then advance(), it won't have to immediately parse the given file content. A good number of Tika parsers can report the data in chunks thus the proposed TikaReader implementation should be quite optimal. Unfortunately I can not extend FileBasedSource/Reader helpers given that Tika Parsers will need to get the full control of the InputStream. However, should the PR be accepted, then I would definitely see some scope for reusing some of currently private FileBasedSource/Reader helpers such as for example the composite reader which is used when multiple files are picked up. Right now I have a reasonably good starting code IMHO with the TikaInputTest testing reading PDF, Zipped PDF, ODT and two ODT files, with the content and optionally the parsed out metadata also being streamed. Some of the code I copied from FileBasedSource might be suboptimal when applied to the Tika case. I hope that if PR gets eventually accepted then, with the help of Tika experts, there would be no doubt more improvements coming in. Planning to work on creating a branch and PR soon, cheers was (Author: sergey_beryozkin): Apache Tika Parsers report the content via the SAX events, https://tika.apache.org/1.14/. I'm implementing a TikaReader such that it adapts the sequence of SAX events to the streaming BounderReader API by using the internal ExecutorService and the ConcurrentLinkedQueue. Thus when the Beam thread comes in and calls start() and then advance(), it won't have to immediately parse the given file content. A good number of Tika parsers can report the data in chunks thus the proposed TikaReader implementation should be quite optimal. Unfortunately I can not extend FileBasedSource/Reader helpers given that Tika Parsers will need to get the full control of the InputStream. However, should the PR be accepted, then I would definitely see some scope for reusing some of currently private FileBasedSource/Reader helpers such as for example the composite reader which is used when multiple files are picked up. Right now I have a reasonably good starting code IMHO with the TikaInputTest testing reading PDF, Zipped PDF, ODT and two ODT files, with the content and optionally the parsed out metadata also being streamed. Some of the code I copied from FileBasedSource might be suboptimal when applied to the Tika case. I hope that if PR gets eventually accepted then, with the help of Tika experts, there would be no doubt be more improvements coming in. Planning to work in creating a branch and PR soon, cheers > Introduce Apache Tika Input component > - > > Key: BEAM-2328 > URL: https://issues.apache.org/jira/browse/BEAM-2328 > Project: Beam > Issue Type: New Feature > Components: sdk-ideas >Reporter: Sergey Beryozkin >Assignee: Davor Bonaci > Fix For: 2.1.0 > > > Apache Tika is a popular project that offers an extensive support for parsing > the variety of file formats. It is used in many projects including Lucene and > Elastic Search. > Supporting a Tika Input (Read) at the Beam level would be of major interest > to many users. > PR is to follow -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Comment Edited] (BEAM-2328) Introduce Apache Tika Input component
[ https://issues.apache.org/jira/browse/BEAM-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16022676#comment-16022676 ] Sergey Beryozkin edited comment on BEAM-2328 at 5/24/17 10:36 AM: -- Apache Tika Parsers report the content via the SAX events, https://tika.apache.org/1.14/. I'm implementing a TikaReader such that it adapts the sequence of SAX events to the streaming BounderReader API by using the internal ExecutorService and the ConcurrentLinkedQueue. Thus when the Beam thread comes in and calls start() and then advance(), it won't have to immediately parse the given file content. A good number of Tika parsers can report the data in chunks thus the proposed TikaReader implementation should be quite optimal. Unfortunately I can not extend FileBasedSource/Reader helpers given that Tika Parsers will need to get the full control of the InputStream. However, should the PR be accepted, then I would definitely see some scope for reusing some of currently private FileBasedSource/Reader helpers such as for example the composite reader which is used when multiple files are picked up. Right now I have a reasonably good starting code IMHO with the TikaInputTest testing reading PDF, Zipped PDF, ODT and two ODT files, with the content and optionally the parsed out metadata also being streamed. Some of the code I copied from FileBasedSource might be suboptimal when applied to the Tika case. I hope that if PR gets eventually accepted then, with the help of Tika experts, there would be no doubt be more improvements coming in. Planning to work in creating a branch and PR soon, cheers was (Author: sergey_beryozkin): Apache Tika Parsers report the content via the SAX events, https://tika.apache.org/1.14/. I'm implementing a TikaReader such that it adapts the sequence of SAX events to the streaming BounderReader API by using the internal ExecutorService and the ConcurrentLinkedQueue. Thus when the Beam thread comes in and calls start() and then advance(), it won't have to immediately parse the given file content. A good number of Tika parsers can report the data in chunks thus the proposed TikaReader implementation should be quite optimal. Unfortunately I can not extend FileBasedSource/Reader helpers given that Tika Parsers will need to get the full control of the InputStream. However, should the PR be accepted, then I would definitely see some scope for reusing some of currently private FileBasedSource/Reader helpers such as for example the composite reader which is used when a multiple files are picked up. Right now I have a reasonably good starting code IMHO with the TikaInputTest testing reading PDF, Zipped PDF, ODT and two ODT files, with the content and optionally the parsed out metadata also being streamed. Some of the code I copied from FileBasedSource might be suboptimal when applied to the Tika case. I hope that if PR gets eventually accepted then, with the help of Tika experts, there would be no doubt be more improvements coming in. Planning to work in creating a branch and PR soon, cheers > Introduce Apache Tika Input component > - > > Key: BEAM-2328 > URL: https://issues.apache.org/jira/browse/BEAM-2328 > Project: Beam > Issue Type: New Feature > Components: sdk-ideas >Reporter: Sergey Beryozkin >Assignee: Davor Bonaci > Fix For: 2.1.0 > > > Apache Tika is a popular project that offers an extensive support for parsing > the variety of file formats. It is used in many projects including Lucene and > Elastic Search. > Supporting a Tika Input (Read) at the Beam level would be of major interest > to many users. > PR is to follow -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (BEAM-2328) Introduce Apache Tika Input component
Sergey Beryozkin created BEAM-2328: -- Summary: Introduce Apache Tika Input component Key: BEAM-2328 URL: https://issues.apache.org/jira/browse/BEAM-2328 Project: Beam Issue Type: New Feature Components: sdk-ideas Reporter: Sergey Beryozkin Assignee: Davor Bonaci Fix For: 2.1.0 Apache Tika is a popular project that offers an extensive support for parsing the variety of file formats. It is used in many projects including Lucene and Elastic Search. Supporting a Tika Input (Read) at the Beam level would be of major interest to many users. PR is to follow -- This message was sent by Atlassian JIRA (v6.3.15#6346)