[
https://issues.apache.org/jira/browse/TIKA-31?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12530265
]
Jukka Zitting commented on TIKA-31:
-----------------------------------
Keith:
> I have some comments and questions.
Cool, thanks for the review!
> In the new parse method:
>
> protected String parse(InputStream stream, Iterable<Content> contents)
>
> This means that contents needs to be populated with the metadata keys before
> being passed to parse(), right? Is that the intention? Wouldn't we want the
> parser to be responsible to know what metadata should go into contents
> and put it there itself?
Exactly. The current solution is based on the Lius code where the metadata keys
are specified in the configuration file instead of the parser class. While this
approach does make the parser classes more configurable, I do agree that it
would probably make more sense to just hardcode the classes to always extract
all available metadata. After all, as you mentioned, the parser class probably
knows best what metadata it can extract and what metadata is available in the
current document.
> Currently, even with the changes, the parser is stateful. I'm aware you said
> that
> this was a step *on the way to* making it stateless. However, the current
> code
> has a serious bug that I think we should fix, even if it's a minimal fix
> (such as
> my second suggestion below).
Agreed. For this issue, like TIKA-26, I wanted to avoid too big changes in one
step to keep the patches and changes somewhat intelligible, so I explicitly
didn't want to remove all the statefullnes in one step.
> Parser.java has an input stream nonstatic member named 'is'. This is set via
> setInputStream(). The parser can then be called on to return various kinds of
> content via getStrContent(), getContents(), and getContent(). It is possible
> to
> parse methods will reuse the parsed content of the previous stream, because
> the boolean 'parsed' will not have been reset to false.
Agreed, I think this is a problem with the current design. I'd like to get rid
of the "is" member variable as well as all the getStrContent(), getContents()
and getContent() methods in favor of a parse() method that takes the input
stream and produces both the full text content and any metadata as results.
Effectively I'd make the proposed parse() method public (and change the
contents variable to an extensible metadata object as discussed above). In
effect I agree with your alternative 1.
> In XMLParser, Intellij Idea is reporting that:
>
> } else if (obj instanceof CDATA) {
>
> will never be evaluated to true (since CDATA extends Text, which comes before
> it
> in the list of conditions).
> [...]
> I can enter a Jira issue and submit a patch if you like.
Yep, this is an issue with the existing code (I tried to avoid changing .
Please file a new issue (and patch :-) for fixing that.
> Minor stuff:
>
> In MSExcelParser, "a Excel document" should be "an Excel document".
> In RTFParser, "a Excel document" should be "an RTF document".
Ah, thanks! I'll fix those before committing the changes.
> protected Parser.parse(InputStream stream, Iterable<Content> contents)
> ----------------------------------------------------------------------
>
> Key: TIKA-31
> URL: https://issues.apache.org/jira/browse/TIKA-31
> Project: Tika
> Issue Type: Improvement
> Components: general
> Reporter: Jukka Zitting
> Assignee: Jukka Zitting
> Attachments: TIKA-31.patch
>
>
> In order to push towards a stateless Parser interface, I'd like to propose
> implementing the current Parser.getContents() method (as it exists after
> TIKA-26) in terms of a stateless abstract method with the following signature:
> protected abstract String parse(InputStream stream, Iterable<Content>
> contents) throws IOException, TikaException;
> This method would return the fulltext content of the given stream as the
> String return value and place any extra metadata into the given set of
> Content instances. With this information the current Parser.getContents()
> method could populate a fulltext (and summary) Content entry and any regexp
> Content entries universally for any Parser classes. Also, we could centralize
> state handling and exception processing to a single class.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.