[ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14280553#comment-14280553
 ] 

Luis Filipe Nassif commented on TIKA-1511:
------------------------------------------

I think it will fail if someone sends in a custom EmbeddedDocExtractor (EDE) 
because it will probably try to read from the empty ByteArrayInputStream to get 
the table. The StatementTablePair wil be there but could not be searched for 
into parseContext.

1) I prefer to handle each table as an embedded doc too, if it is possible. If 
not, lets go back.

2) Is it possible to generate a HTML representation of the tables and pass it 
into EDE? By default could it be handled by HtmlParser? Does HtmlParser 
currently extract embedded docs, like images? Can we insert the BLOBs into that 
HTML so that the HtmlParser will extract those BLOBs?

If this approach is possible, we can use pipedWriter and pipedReader to not 
hold the entire HTML/Tables in memory, possibly huge ones.

> Create a parser for SQLite3
> ---------------------------
>
>                 Key: TIKA-1511
>                 URL: https://issues.apache.org/jira/browse/TIKA-1511
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.6
>            Reporter: Luis Filipe Nassif
>             Fix For: 1.8
>
>         Attachments: TIKA-1511v1.patch, testSQLLite3b.db
>
>
> I think it would be very useful, as sqlite is used as data storage by a wide 
> range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to