[
https://issues.apache.org/jira/browse/TIKA-26?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jukka Zitting updated TIKA-26:
------------------------------
Attachment: TIKA-26.patch
This patch replaces the List<Content> collection in ParserConfig and Parser
with a Map<String, Content> map as described above.
In addition the patch makes some minor cleanups like using class-specific
logger instances, more explicitly tracking state of the parser instances (added
a separate "parsed" flag), etc. The patch should however not introduce any
functional changes.
This patch probably conflicts a bit with Keith's recent work on TIKA-17 and
other issues. I'll give those a look and come up with an updated patch once his
changes are committed.
After this patch the basic structure of a parser class is:
public class SomeParser extends Parser {
private static final Logger logger = Logger.getLogger(SomeParser.class);
private boolean parsed = false;
private String contentStr;
public Map<String,Content> getContents() {
Map<String,Content> contents = super.getContents();
if (!parsed) {
// fill in contents and contentStr with parsed content from
getInputStream()
parsed = true;
}
return contents;
}
public String getStrContent() {
getContents();
return contentStr;
}
}
What I'd like to do as a followup step is to pass the InputStream as an
argument to getContents() and to include the full text content as a part of the
Content map to make the parser instances stateless.
> Use Map<String, Content> instead of List<Content>
> -------------------------------------------------
>
> Key: TIKA-26
> URL: https://issues.apache.org/jira/browse/TIKA-26
> Project: Tika
> Issue Type: Improvement
> Components: general
> Reporter: Jukka Zitting
> Assignee: Jukka Zitting
> Fix For: 0.1-incubator
>
> Attachments: TIKA-26.patch
>
>
> The current Parser classes take a List<Content> collection from ParserConfig,
> and explicitly reformat that collection into an internal Map<String,Content>
> map keyed by the Content names. I don't see any place where using a list of
> Content instances is better than a Map keyed by the Content names, so I'd
> like to simplify things by creating the map already in ParserConfig and using
> it directly ever since.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.