Jukka -

Two things:

1) I suggest we create a class to store the parsed document content, rather
than just a Map.  The class could have convenience methods such as
getStringContent(), and possibly hold onto a resource identifier that could
be set.  We might also want to make the parsed values immutable.

We might even want this new DocumentContent thing to be an interface or
abstract class.  I'm not very familiar with the data, but for example, it
would be nice (and safer) to be able to call content.getAuthor() rather than
contentMap.get("author").  Also, each implementation (OpenOfficeContent,
MSWordContent, etc.) could have methods for each of their known properties.

2) If we make the Parser stateless, how will we deal with the chunking of
large documents? 

- Keith 



JIRA [EMAIL PROTECTED] wrote:
> 
> 
>      [
> https://issues.apache.org/jira/browse/TIKA-26?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
> ]
> 
> Jukka Zitting updated TIKA-26:
> ------------------------------
> 
>     Attachment: TIKA-26.patch
> 
> This patch replaces the List<Content> collection in ParserConfig and
> Parser with a Map<String, Content> map as described above.
> 
> In addition the patch makes some minor cleanups like using class-specific
> logger instances, more explicitly tracking state of the parser instances
> (added a separate "parsed" flag), etc. The patch should however not
> introduce any functional changes.
> 
> This patch probably conflicts a bit with Keith's recent work on TIKA-17
> and other issues. I'll give those a look and come up with an updated patch
> once his changes are committed.
> 
> After this patch the basic structure of a parser class is:
> 
>     public class SomeParser extends Parser {
>         private static final Logger logger =
> Logger.getLogger(SomeParser.class);
>         private boolean parsed = false;
>         private String contentStr;
>         public Map<String,Content> getContents() {
>             Map<String,Content> contents = super.getContents();
>             if (!parsed) {
>                 // fill in contents and contentStr with parsed content
> from getInputStream()
>                 parsed = true;
>             }
>             return contents;
>         }
>         public String getStrContent() {
>             getContents();
>             return contentStr;
>         }
>     }
> 
> What I'd like to do as a followup step is to pass the InputStream as an
> argument to getContents() and to include the full text content as a part
> of the Content map to make the parser instances stateless.
> 
> 
>> Use Map<String, Content> instead of List<Content>
>> -------------------------------------------------
>>
>>                 Key: TIKA-26
>>                 URL: https://issues.apache.org/jira/browse/TIKA-26
>>             Project: Tika
>>          Issue Type: Improvement
>>          Components: general
>>            Reporter: Jukka Zitting
>>            Assignee: Jukka Zitting
>>             Fix For: 0.1-incubator
>>
>>         Attachments: TIKA-26.patch
>>
>>
>> The current Parser classes take a List<Content> collection from
>> ParserConfig, and explicitly reformat that collection into an internal
>> Map<String,Content> map keyed by the Content names. I don't see any place
>> where using a list of Content instances is better than a Map keyed by the
>> Content names, so I'd like to simplify things by creating the map already
>> in ParserConfig and using it directly ever since.
> 
> -- 
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/-jira--Created%3A-%28TIKA-26%29-Use-Map%3CString%2C-Content%3E-instead-of-List%3CContent%3E-tf4503686.html#a12849590
Sent from the Apache Tika - Development mailing list archive at Nabble.com.

Reply via email to