[jira] [Commented] (BEAM-3004) TikaIOTest#testReadPdfFile is flaky.

2017-11-03 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237606#comment-16237606
 ] 

Sergey Beryozkin commented on BEAM-3004:


This can now be resolved, the current TikaIOTest does not have a dedicated 
testReadPdfFile; the test parsing PDF and ODT files is OK.

> TikaIOTest#testReadPdfFile is flaky.
> 
>
> Key: BEAM-3004
> URL: https://issues.apache.org/jira/browse/BEAM-3004
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-extensions
>Reporter: Jason Kuster
>Assignee: Sergey Beryozkin
>Priority: Major
>
> testReadPdfFile has been sporadically failing on Jenkins.
> https://builds.apache.org/view/A-D/view/Beam/job/beam_PreCommit_Java_MavenInstall/14691/org.apache.beam$beam-sdks-java-io-tika/testReport/org.apache.beam.sdk.io.tika/TikaIOTest/testReadPdfFile/history/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-2994) Refactor TikaIO

2017-10-30 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16225703#comment-16225703
 ] 

Sergey Beryozkin commented on BEAM-2994:


Thanks for merging this PR

> Refactor TikaIO
> ---
>
> Key: BEAM-2994
> URL: https://issues.apache.org/jira/browse/BEAM-2994
> Project: Beam
>  Issue Type: Task
>  Components: sdk-java-extensions
>Affects Versions: 2.2.0
>Reporter: Sergey Beryozkin
>Assignee: Sergey Beryozkin
> Fix For: 2.3.0
>
>
> TikaIO is currently implemented as a BoundedSource and asynchronous 
> BoundedReader returning individual document's text chunks as Strings, 
> eventually passed unordered (and not linked to the original documents) to the 
> pipeline functions.
> It was decided in the recent beam-dev thread that initially TikaIO should 
> support the cases where only a single composite bean per file, capturing the 
> file content, location (or name) and metadata, should flow to the pipeline, 
> and thus avoiding the need to implement TikaIO as a BoundedSource/Reader.
> Enhancing  TikaIO to support the streaming of the content into the pipelines 
> may be considered in the next phase, based on the specific use-cases... 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (BEAM-3004) TikaIOTest#testReadPdfFile is flaky.

2017-09-30 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16187099#comment-16187099
 ] 

Sergey Beryozkin edited comment on BEAM-3004 at 9/30/17 1:49 PM:
-

Thanks, looks like the asynchronous TikaReader implementation is weak somewhere 
when it comes to processing the PDF file(s), but that code has already gone 
from the pending https://github.com/apache/beam/pull/3835, so I'm hoping this 
issue will be resolved after PR 3835 gets approved... 


was (Author: sergey_beryozkin):
Thanks, looks like the asynchronous TikaReader implementation is weak somewhere 
when it comes to processing the PDF file(s), but that code ha already gone from 
the pending https://github.com/apache/beam/pull/3835, so I'm hoping this issue 
will be resolved after PR 3835 gets approved... 

> TikaIOTest#testReadPdfFile is flaky.
> 
>
> Key: BEAM-3004
> URL: https://issues.apache.org/jira/browse/BEAM-3004
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-extensions
>Reporter: Jason Kuster
>Assignee: Sergey Beryozkin
>
> testReadPdfFile has been sporadically failing on Jenkins.
> https://builds.apache.org/view/A-D/view/Beam/job/beam_PreCommit_Java_MavenInstall/14691/org.apache.beam$beam-sdks-java-io-tika/testReport/org.apache.beam.sdk.io.tika/TikaIOTest/testReadPdfFile/history/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-3004) TikaIOTest#testReadPdfFile is flaky.

2017-09-30 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16187099#comment-16187099
 ] 

Sergey Beryozkin commented on BEAM-3004:


Thanks, looks like the asynchronous TikaReader implementation is weak somewhere 
when it comes to processing the PDF file(s), but that code ha already gone from 
the pending https://github.com/apache/beam/pull/3835, so I'm hoping this issue 
will be resolved after PR 3835 gets approved... 

> TikaIOTest#testReadPdfFile is flaky.
> 
>
> Key: BEAM-3004
> URL: https://issues.apache.org/jira/browse/BEAM-3004
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-extensions
>Reporter: Jason Kuster
>Assignee: Sergey Beryozkin
>
> testReadPdfFile has been sporadically failing on Jenkins.
> https://builds.apache.org/view/A-D/view/Beam/job/beam_PreCommit_Java_MavenInstall/14691/org.apache.beam$beam-sdks-java-io-tika/testReport/org.apache.beam.sdk.io.tika/TikaIOTest/testReadPdfFile/history/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (BEAM-2994) Refactor TikaIO

2017-09-27 Thread Sergey Beryozkin (JIRA)
Sergey Beryozkin created BEAM-2994:
--

 Summary: Refactor TikaIO
 Key: BEAM-2994
 URL: https://issues.apache.org/jira/browse/BEAM-2994
 Project: Beam
  Issue Type: Task
  Components: sdk-java-extensions
Affects Versions: 2.2.0
Reporter: Sergey Beryozkin
Assignee: Reuven Lax
 Fix For: 2.2.0


TikaIO is currently implemented as a BoundedSource and asynchronous 
BoundedReader returning individual document's text chunks as Strings, 
eventually passed unordered (and not linked to the original documents) to the 
pipeline functions.

It was decided in the recent beam-dev thread that initially TikaIO should 
support the cases where only a single composite bean per file, capturing the 
file content, location (or name) and metadata, should flow to the pipeline, and 
thus avoiding the need to implement TikaIO as a BoundedSource/Reader.

Enhancing  TikaIO to support the streaming of the content into the pipelines 
may be considered in the next phase, based on the specific use-cases... 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (BEAM-2874) TikaIO JavaDocs have minor typos

2017-09-11 Thread Sergey Beryozkin (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-2874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Beryozkin resolved BEAM-2874.

Resolution: Invalid

Have just read that the doc typos do not require opening JIRA issues :-)

> TikaIO JavaDocs have minor typos
> 
>
> Key: BEAM-2874
> URL: https://issues.apache.org/jira/browse/BEAM-2874
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-extensions
>Affects Versions: 2.2.0
>Reporter: Sergey Beryozkin
>Assignee: Sergey Beryozkin
>Priority: Trivial
> Fix For: 2.2.0
>
>
> Some of TikaIO sources have the minor doc typos



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (BEAM-2874) TikaIO JavaDocs have minor typos

2017-09-11 Thread Sergey Beryozkin (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-2874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Beryozkin updated BEAM-2874:
---
Description: Some of TikaIO sources have the minor doc typos

> TikaIO JavaDocs have minor typos
> 
>
> Key: BEAM-2874
> URL: https://issues.apache.org/jira/browse/BEAM-2874
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-extensions
>Affects Versions: 2.2.0
>Reporter: Sergey Beryozkin
>Assignee: Sergey Beryozkin
>Priority: Trivial
> Fix For: 2.2.0
>
>
> Some of TikaIO sources have the minor doc typos



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (BEAM-2874) TikaIO JavaDocs have minor typos

2017-09-11 Thread Sergey Beryozkin (JIRA)
Sergey Beryozkin created BEAM-2874:
--

 Summary: TikaIO JavaDocs have minor typos
 Key: BEAM-2874
 URL: https://issues.apache.org/jira/browse/BEAM-2874
 Project: Beam
  Issue Type: Bug
  Components: sdk-java-extensions
Affects Versions: 2.2.0
Reporter: Sergey Beryozkin
Assignee: Sergey Beryozkin
Priority: Trivial
 Fix For: 2.2.0






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-2328) Introduce Apache Tika Input component

2017-07-13 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085343#comment-16085343
 ] 

Sergey Beryozkin commented on BEAM-2328:


[~talli...@mitre.org] Hi Tim - the PR has been updated to pull in Tika 1.16, 
thanks. 

> Introduce Apache Tika Input component
> -
>
> Key: BEAM-2328
> URL: https://issues.apache.org/jira/browse/BEAM-2328
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-ideas, sdk-java-extensions
>Reporter: Sergey Beryozkin
>Assignee: Sergey Beryozkin
> Fix For: 2.2.0
>
>
> Apache Tika is a popular project that offers an extensive support for parsing 
> the variety of file formats. It is used in many projects including Lucene and 
> Elastic Search. 
> Supporting a Tika Input (Read) at the Beam level would be of major interest 
> to many users.
> PR is to follow



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (BEAM-2328) Introduce Apache Tika Input component

2017-07-13 Thread Sergey Beryozkin (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Beryozkin updated BEAM-2328:
---
Fix Version/s: 2.2.0

> Introduce Apache Tika Input component
> -
>
> Key: BEAM-2328
> URL: https://issues.apache.org/jira/browse/BEAM-2328
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-ideas, sdk-java-extensions
>Reporter: Sergey Beryozkin
>Assignee: Sergey Beryozkin
> Fix For: 2.2.0
>
>
> Apache Tika is a popular project that offers an extensive support for parsing 
> the variety of file formats. It is used in many projects including Lucene and 
> Elastic Search. 
> Supporting a Tika Input (Read) at the Beam level would be of major interest 
> to many users.
> PR is to follow



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-2328) Introduce Apache Tika Input component

2017-06-16 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16051834#comment-16051834
 ] 

Sergey Beryozkin commented on BEAM-2328:


HI All,
The initial cleanup of the 'tikaio' branch is now complete (with thanks to JB), 
the commits - squashed, I'm now proceeding to creating the first PR. I'd like 
to ask JB to review it, the feedback from all of the team will also be welcomed.
[~talli...@mitre.org] Hi Tim, I hope that if the team accepts this PR then we 
can get TikaReader improved further :-). (I'm not sure if some more work will 
need to be done to make a better reporting of the embedded attachments inside a 
given PDF/etc, if some further ParserContext customizations may be needed - the 
input metadata and TikaConfig are covered though, etc); concatenating multiple 
SAX content bits into a minimum length fragments will optionally be supported 
too later on if needed

thanks 

> Introduce Apache Tika Input component
> -
>
> Key: BEAM-2328
> URL: https://issues.apache.org/jira/browse/BEAM-2328
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-ideas, sdk-java-extensions
>Reporter: Sergey Beryozkin
>Assignee: Sergey Beryozkin
> Fix For: 2.1.0
>
>
> Apache Tika is a popular project that offers an extensive support for parsing 
> the variety of file formats. It is used in many projects including Lucene and 
> Elastic Search. 
> Supporting a Tika Input (Read) at the Beam level would be of major interest 
> to many users.
> PR is to follow



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (BEAM-2328) Introduce Apache Tika Input component

2017-06-14 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16049057#comment-16049057
 ] 

Sergey Beryozkin edited comment on BEAM-2328 at 6/14/17 11:09 AM:
--

Hi JB, All,
I'm now ready to create the initial PR. As I said earlier I realize it won't be 
perfect from a start and I have some tasks to do next once PR gets accepted 
(making common-compress 1.14 managed, a couple of possible refactorings which 
would affect the outer Beam source and help minimize the duplication of 
FileBased related utility code inside the Tika component) but for now I'm just 
trying to keep this initial contribution as simple as possible and also self 
contained.
The only immediate question I have is how should this artifact be really named, 
at the moment it is "beam-sdks-java-io-tika" but I wonder should it really be 
"beam-sdks-java-input-tika" given that the output can not be supported ?

Thanks 


was (Author: sergey_beryozkin):
Hi JB, All,
I'm now ready to create the initial PR. As I said earlier I realize it won't be 
perfect from a start and I have some tasks to do next once PR gets accepted 
(making common-compress 1.14 managed, a couple of possible refactorings which 
would affect the outer Beam source and help to minimize the duplication of 
FileBased related utility code inside the Tika component) but for now I'm just 
trying to keep this initial contribution as simple as possible and also self 
contained.
The only immediate question I have is how should this artifact be really named, 
at the moment it is "beam-sdks-java-io-tika" but I wonder should it really be 
"beam-sdks-java-input-tika" given that the output can not be supported ?

Thanks 

> Introduce Apache Tika Input component
> -
>
> Key: BEAM-2328
> URL: https://issues.apache.org/jira/browse/BEAM-2328
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-ideas, sdk-java-extensions
>Reporter: Sergey Beryozkin
>Assignee: Sergey Beryozkin
> Fix For: 2.1.0
>
>
> Apache Tika is a popular project that offers an extensive support for parsing 
> the variety of file formats. It is used in many projects including Lucene and 
> Elastic Search. 
> Supporting a Tika Input (Read) at the Beam level would be of major interest 
> to many users.
> PR is to follow



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-2328) Introduce Apache Tika Input component

2017-06-14 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16049057#comment-16049057
 ] 

Sergey Beryozkin commented on BEAM-2328:


Hi JB, All,
I'm now ready to create the initial PR. As I said earlier I realize it won't be 
perfect from a start and I have some tasks to do next once PR gets accepted 
(making common-compress 1.14 managed, a couple of possible refactorings which 
would affect the outer Beam source and help to minimize the duplication of 
FileBased related utility code inside the Tika component) but for now I'm just 
trying to keep this initial contribution as simple as possible and also self 
contained.
The only immediate question I have is how should this artifact be really named, 
at the moment it is "beam-sdks-java-io-tika" but I wonder should it really be 
"beam-sdks-java-input-tika" given that the output can not be supported ?

Thanks 

> Introduce Apache Tika Input component
> -
>
> Key: BEAM-2328
> URL: https://issues.apache.org/jira/browse/BEAM-2328
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-ideas, sdk-java-extensions
>Reporter: Sergey Beryozkin
>Assignee: Sergey Beryozkin
> Fix For: 2.1.0
>
>
> Apache Tika is a popular project that offers an extensive support for parsing 
> the variety of file formats. It is used in many projects including Lucene and 
> Elastic Search. 
> Supporting a Tika Input (Read) at the Beam level would be of major interest 
> to many users.
> PR is to follow



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-2328) Introduce Apache Tika Input component

2017-06-02 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16034412#comment-16034412
 ] 

Sergey Beryozkin commented on BEAM-2328:


Hi JB, Tim

re org.json dependencies, FYI, at the moment the only strong Tika dependency is 
tika-core. tika-parsers is a test dependency, it is not needed to compile, the 
current expectation is that the users of the future Tika Input component will 
add a tika-parsers dependency and as such Tika Parsers (including those that 
may depend on org.json) will not make it into the Beam distro. I reckon that 
can make it easier to align with the Tika 2.0-SNAPSHOT effort where a number of 
mainstream parsers (PDF, etc) is represented by individual modules. 

I guess an option to ship all of the tika-bundle with tika-io can also be 
considered but for a start having only a tika-core dependency seems workable to 
me...In this (current) case if the tika-core itself  is org.json free then it 
should not be an issue.



  



> Introduce Apache Tika Input component
> -
>
> Key: BEAM-2328
> URL: https://issues.apache.org/jira/browse/BEAM-2328
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-ideas, sdk-java-extensions
>Reporter: Sergey Beryozkin
>Assignee: Sergey Beryozkin
> Fix For: 2.1.0
>
>
> Apache Tika is a popular project that offers an extensive support for parsing 
> the variety of file formats. It is used in many projects including Lucene and 
> Elastic Search. 
> Supporting a Tika Input (Read) at the Beam level would be of major interest 
> to many users.
> PR is to follow



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (BEAM-2328) Introduce Apache Tika Input component

2017-06-01 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16032904#comment-16032904
 ] 

Sergey Beryozkin edited comment on BEAM-2328 at 6/1/17 1:05 PM:


Sorry, Tika already reports the characters, I got confused for a moment that 
the default output coder was not used there but of course that output coder is 
for converting String to the output...
As far as Tika is concerned it is already possible to pass the custom Metadata 
to TikaInput.Read, I'll just update that to also accept TikaConfg 


was (Author: sergey_beryozkin):
Sorry, Tika already reports the characters...

> Introduce Apache Tika Input component
> -
>
> Key: BEAM-2328
> URL: https://issues.apache.org/jira/browse/BEAM-2328
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-ideas, sdk-java-extensions
>Reporter: Sergey Beryozkin
>Assignee: Sergey Beryozkin
> Fix For: 2.1.0
>
>
> Apache Tika is a popular project that offers an extensive support for parsing 
> the variety of file formats. It is used in many projects including Lucene and 
> Elastic Search. 
> Supporting a Tika Input (Read) at the Beam level would be of major interest 
> to many users.
> PR is to follow



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-2328) Introduce Apache Tika Input component

2017-06-01 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16032904#comment-16032904
 ] 

Sergey Beryozkin commented on BEAM-2328:


Sorry, Tika already reports the characters...

> Introduce Apache Tika Input component
> -
>
> Key: BEAM-2328
> URL: https://issues.apache.org/jira/browse/BEAM-2328
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-ideas, sdk-java-extensions
>Reporter: Sergey Beryozkin
>Assignee: Sergey Beryozkin
> Fix For: 2.1.0
>
>
> Apache Tika is a popular project that offers an extensive support for parsing 
> the variety of file formats. It is used in many projects including Lucene and 
> Elastic Search. 
> Supporting a Tika Input (Read) at the Beam level would be of major interest 
> to many users.
> PR is to follow



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-2328) Introduce Apache Tika Input component

2017-06-01 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16032881#comment-16032881
 ] 

Sergey Beryozkin commented on BEAM-2328:


Hi JB, Tim

Yes, TikaReader returns Strings, but as JB just pointed out the default coder 
is not used, so I'll fix it, thanks JB :-).
Tim, the reason I mentioned that I do not expect 'anything but Strings' is 
because in many cases, as far as I can see, Beam readers can be typed for 
different types and custom Beam coders can support such conversions, but I 
agree in case of Tika is is really only about String as it is impossible to 
predict at the generic Tika API level what a given format parser can produce, 
etc...

Tim - I also updated the reader to use TikaInputStream, thanks 


> Introduce Apache Tika Input component
> -
>
> Key: BEAM-2328
> URL: https://issues.apache.org/jira/browse/BEAM-2328
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-ideas, sdk-java-extensions
>Reporter: Sergey Beryozkin
>Assignee: Sergey Beryozkin
> Fix For: 2.1.0
>
>
> Apache Tika is a popular project that offers an extensive support for parsing 
> the variety of file formats. It is used in many projects including Lucene and 
> Elastic Search. 
> Supporting a Tika Input (Read) at the Beam level would be of major interest 
> to many users.
> PR is to follow



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-2328) Introduce Apache Tika Input component

2017-06-01 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16032825#comment-16032825
 ] 

Sergey Beryozkin commented on BEAM-2328:


I've added some TikaReader and TikaSource tests. Tika version was updated to 
1.15 (released by [~talli...@mitre.org]) and common-compress to 1.14 (see 
TIKA-2099 for example).

In general I'd like to keep an initial contribution very much isolated, and 
then later on follow up with some optimizations which would affect some other 
Beam modules. Specifically, the two most immediate follow up PRs would be about 
updating a managed Beam common compress dependency to 1.14 and remove the 
version from tika/pom.xml and attempt to refactor a bit a FileBasedSource 
composite reader such that its code can be reused by TikaSource.

The last thing I'd like to investigate for a start is to check what may need to 
be done around non UTF-8 charsets. I don't expect TikaReader producing anything 
else but Strings though.

I'm away next week, will start preparing for the initial PR shortly afterwards 



  


> Introduce Apache Tika Input component
> -
>
> Key: BEAM-2328
> URL: https://issues.apache.org/jira/browse/BEAM-2328
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-ideas, sdk-java-extensions
>Reporter: Sergey Beryozkin
>Assignee: Sergey Beryozkin
> Fix For: 2.1.0
>
>
> Apache Tika is a popular project that offers an extensive support for parsing 
> the variety of file formats. It is used in many projects including Lucene and 
> Elastic Search. 
> Supporting a Tika Input (Read) at the Beam level would be of major interest 
> to many users.
> PR is to follow



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (BEAM-2361) Add TikaIO to the list of in-progress transforms

2017-05-26 Thread Sergey Beryozkin (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-2361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Beryozkin resolved BEAM-2361.

Resolution: Fixed

Thanks for applying the patch

> Add TikaIO to the list of in-progress transforms
> 
>
> Key: BEAM-2361
> URL: https://issues.apache.org/jira/browse/BEAM-2361
> Project: Beam
>  Issue Type: Task
>  Components: website
>Reporter: Sergey Beryozkin
>Assignee: Sergey Beryozkin
>Priority: Minor
> Fix For: 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-2328) Introduce Apache Tika Input component

2017-05-25 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16024981#comment-16024981
 ] 

Sergey Beryozkin commented on BEAM-2328:


The initial code is here:
https://github.com/sberyozkin/beam/tree/tikaio/sdks/java/io/tika

it is a work in progress and it will take me some time to get to the PR stage, 
just wanted to share a link to what is already available, perhaps the most 
interesting code at this stage is the initial test code:

https://github.com/sberyozkin/beam/blob/tikaio/sdks/java/io/tika/src/test/java/org/apache/beam/sdk/io/tika/TikaInputTest.java

and TikaReader:

https://github.com/sberyozkin/beam/blob/tikaio/sdks/java/io/tika/src/main/java/org/apache/beam/sdk/io/tika/TikaReader.java

For the moment the focus is on getting the overall component structure be in a 
good enough initial state (add few more tests, docs), TikaReader (etc) 
optimizations/enhancements can def follow after the initial PR.

I'd appreciate if my colleague [~jbonofre] could help next with the initial 
branch clean up.
thanks

> Introduce Apache Tika Input component
> -
>
> Key: BEAM-2328
> URL: https://issues.apache.org/jira/browse/BEAM-2328
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-ideas, sdk-java-extensions
>Reporter: Sergey Beryozkin
>Assignee: Sergey Beryozkin
> Fix For: 2.1.0
>
>
> Apache Tika is a popular project that offers an extensive support for parsing 
> the variety of file formats. It is used in many projects including Lucene and 
> Elastic Search. 
> Supporting a Tika Input (Read) at the Beam level would be of major interest 
> to many users.
> PR is to follow



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-2328) Introduce Apache Tika Input component

2017-05-24 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16023826#comment-16023826
 ] 

Sergey Beryozkin commented on BEAM-2328:


Sorry for a bit of a noise, I spotted in the docs that the site updates should 
be assigned to a different category, hence I opened BEAM-2361 and made this one 
related to it, hopefully I've made it nearly right this time :-) cheers

> Introduce Apache Tika Input component
> -
>
> Key: BEAM-2328
> URL: https://issues.apache.org/jira/browse/BEAM-2328
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-ideas, sdk-java-extensions
>Reporter: Sergey Beryozkin
>Assignee: Sergey Beryozkin
> Fix For: 2.1.0
>
>
> Apache Tika is a popular project that offers an extensive support for parsing 
> the variety of file formats. It is used in many projects including Lucene and 
> Elastic Search. 
> Supporting a Tika Input (Read) at the Beam level would be of major interest 
> to many users.
> PR is to follow



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Issue Comment Deleted] (BEAM-2328) Introduce Apache Tika Input component

2017-05-24 Thread Sergey Beryozkin (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Beryozkin updated BEAM-2328:
---
Comment: was deleted

(was: Hi, pull request #250 has been created. thanks)

> Introduce Apache Tika Input component
> -
>
> Key: BEAM-2328
> URL: https://issues.apache.org/jira/browse/BEAM-2328
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-ideas, sdk-java-extensions
>Reporter: Sergey Beryozkin
>Assignee: Sergey Beryozkin
> Fix For: 2.1.0
>
>
> Apache Tika is a popular project that offers an extensive support for parsing 
> the variety of file formats. It is used in many projects including Lucene and 
> Elastic Search. 
> Supporting a Tika Input (Read) at the Beam level would be of major interest 
> to many users.
> PR is to follow



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (BEAM-2361) Add TikaIO to the list of in-progress transforms

2017-05-24 Thread Sergey Beryozkin (JIRA)
Sergey Beryozkin created BEAM-2361:
--

 Summary: Add TikaIO to the list of in-progress transforms
 Key: BEAM-2361
 URL: https://issues.apache.org/jira/browse/BEAM-2361
 Project: Beam
  Issue Type: Task
  Components: website
Reporter: Sergey Beryozkin
Assignee: Sergey Beryozkin
Priority: Minor
 Fix For: 2.1.0






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (BEAM-2328) Introduce Apache Tika Input component

2017-05-24 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16022676#comment-16022676
 ] 

Sergey Beryozkin edited comment on BEAM-2328 at 5/24/17 10:37 AM:
--

Apache Tika Parsers report the content via the SAX events, 
https://tika.apache.org/1.14/.

I'm implementing a TikaReader such that it adapts the sequence of SAX events to 
the streaming BounderReader API by using the internal ExecutorService and the 
ConcurrentLinkedQueue. Thus when the Beam thread comes in and calls start() and 
then advance(), it won't have to immediately parse the given file content. A 
good number of Tika parsers can report the data in chunks thus the proposed 
TikaReader implementation should be quite optimal.

Unfortunately I can not extend FileBasedSource/Reader helpers given that Tika 
Parsers will need to get the full control of the InputStream. However, should 
the PR be accepted, then I would definitely see some scope for reusing some of 
currently private FileBasedSource/Reader helpers such as for example the 
composite reader which is used when multiple files are picked up.

Right now I have a reasonably good starting code IMHO with the TikaInputTest 
testing reading PDF, Zipped PDF, ODT and two ODT files, with the content and 
optionally the parsed out metadata also being streamed. 

Some of the code I copied from FileBasedSource might be suboptimal when applied 
to the Tika case. I hope that if PR gets eventually accepted then, with the 
help of Tika experts, there would be no doubt more improvements coming in.

Planning to work on creating a branch and PR soon, cheers  




was (Author: sergey_beryozkin):
Apache Tika Parsers report the content via the SAX events, 
https://tika.apache.org/1.14/.

I'm implementing a TikaReader such that it adapts the sequence of SAX events to 
the streaming BounderReader API by using the internal ExecutorService and the 
ConcurrentLinkedQueue. Thus when the Beam thread comes in and calls start() and 
then advance(), it won't have to immediately parse the given file content. A 
good number of Tika parsers can report the data in chunks thus the proposed 
TikaReader implementation should be quite optimal.

Unfortunately I can not extend FileBasedSource/Reader helpers given that Tika 
Parsers will need to get the full control of the InputStream. However, should 
the PR be accepted, then I would definitely see some scope for reusing some of 
currently private FileBasedSource/Reader helpers such as for example the 
composite reader which is used when multiple files are picked up.

Right now I have a reasonably good starting code IMHO with the TikaInputTest 
testing reading PDF, Zipped PDF, ODT and two ODT files, with the content and 
optionally the parsed out metadata also being streamed. 

Some of the code I copied from FileBasedSource might be suboptimal when applied 
to the Tika case. I hope that if PR gets eventually accepted then, with the 
help of Tika experts, there would be no doubt be more improvements coming in.

Planning to work in creating a branch and PR soon, cheers  



> Introduce Apache Tika Input component
> -
>
> Key: BEAM-2328
> URL: https://issues.apache.org/jira/browse/BEAM-2328
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-ideas
>Reporter: Sergey Beryozkin
>Assignee: Davor Bonaci
> Fix For: 2.1.0
>
>
> Apache Tika is a popular project that offers an extensive support for parsing 
> the variety of file formats. It is used in many projects including Lucene and 
> Elastic Search. 
> Supporting a Tika Input (Read) at the Beam level would be of major interest 
> to many users.
> PR is to follow



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (BEAM-2328) Introduce Apache Tika Input component

2017-05-24 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16022676#comment-16022676
 ] 

Sergey Beryozkin edited comment on BEAM-2328 at 5/24/17 10:36 AM:
--

Apache Tika Parsers report the content via the SAX events, 
https://tika.apache.org/1.14/.

I'm implementing a TikaReader such that it adapts the sequence of SAX events to 
the streaming BounderReader API by using the internal ExecutorService and the 
ConcurrentLinkedQueue. Thus when the Beam thread comes in and calls start() and 
then advance(), it won't have to immediately parse the given file content. A 
good number of Tika parsers can report the data in chunks thus the proposed 
TikaReader implementation should be quite optimal.

Unfortunately I can not extend FileBasedSource/Reader helpers given that Tika 
Parsers will need to get the full control of the InputStream. However, should 
the PR be accepted, then I would definitely see some scope for reusing some of 
currently private FileBasedSource/Reader helpers such as for example the 
composite reader which is used when multiple files are picked up.

Right now I have a reasonably good starting code IMHO with the TikaInputTest 
testing reading PDF, Zipped PDF, ODT and two ODT files, with the content and 
optionally the parsed out metadata also being streamed. 

Some of the code I copied from FileBasedSource might be suboptimal when applied 
to the Tika case. I hope that if PR gets eventually accepted then, with the 
help of Tika experts, there would be no doubt be more improvements coming in.

Planning to work in creating a branch and PR soon, cheers  




was (Author: sergey_beryozkin):
Apache Tika Parsers report the content via the SAX events, 
https://tika.apache.org/1.14/.

I'm implementing a TikaReader such that it adapts the sequence of SAX events to 
the streaming BounderReader API by using the internal ExecutorService and the 
ConcurrentLinkedQueue. Thus when the Beam thread comes in and calls start() and 
then advance(), it won't have to immediately parse the given file content. A 
good number of Tika parsers can report the data in chunks thus the proposed 
TikaReader implementation should be quite optimal.

Unfortunately I can not extend FileBasedSource/Reader helpers given that Tika 
Parsers will need to get the full control of the InputStream. However, should 
the PR be accepted, then I would definitely see some scope for reusing some of 
currently private FileBasedSource/Reader helpers such as for example the 
composite reader which is used when a multiple files are picked up.

Right now I have a reasonably good starting code IMHO with the TikaInputTest 
testing reading PDF, Zipped PDF, ODT and two ODT files, with the content and 
optionally the parsed out metadata also being streamed. 

Some of the code I copied from FileBasedSource might be suboptimal when applied 
to the Tika case. I hope that if PR gets eventually accepted then, with the 
help of Tika experts, there would be no doubt be more improvements coming in.

Planning to work in creating a branch and PR soon, cheers  



> Introduce Apache Tika Input component
> -
>
> Key: BEAM-2328
> URL: https://issues.apache.org/jira/browse/BEAM-2328
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-ideas
>Reporter: Sergey Beryozkin
>Assignee: Davor Bonaci
> Fix For: 2.1.0
>
>
> Apache Tika is a popular project that offers an extensive support for parsing 
> the variety of file formats. It is used in many projects including Lucene and 
> Elastic Search. 
> Supporting a Tika Input (Read) at the Beam level would be of major interest 
> to many users.
> PR is to follow



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (BEAM-2328) Introduce Apache Tika Input component

2017-05-19 Thread Sergey Beryozkin (JIRA)
Sergey Beryozkin created BEAM-2328:
--

 Summary: Introduce Apache Tika Input component
 Key: BEAM-2328
 URL: https://issues.apache.org/jira/browse/BEAM-2328
 Project: Beam
  Issue Type: New Feature
  Components: sdk-ideas
Reporter: Sergey Beryozkin
Assignee: Davor Bonaci
 Fix For: 2.1.0


Apache Tika is a popular project that offers an extensive support for parsing 
the variety of file formats. It is used in many projects including Lucene and 
Elastic Search. 
Supporting a Tika Input (Read) at the Beam level would be of major interest to 
many users.

PR is to follow



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)