On Wed, 1 Sep 2010, Nick Burch wrote:
I've been thinking about extracting files from container formats (eg images in a .docx, pdfs in a zip file etc).

I've been pondering the various feedback over the weekend, and hopefully now have a more detailed idea.

Firstly, the new service needs to work for both people who have the container file locally, and those streaming it remotely. Some container parsers may work better with input streams, some with files, so making the input contract be a TikaInputStream would seem to be the right way around this?

Next, how to control which child elements are returned. The container will usually know the embeded file name, but not always, and will often know the path details of it (eg /foo/bar.txt in a zip file). It may sometimes know the mime type. This seems to me too difficult to easily represent as a wish-list filter. So, I now think that probably the only way to work it is to offer all the details of every file to the consumer, and let them decide if they're interested or not. Ideally, the amount of work done by the container parser until the consumer decides they want it + asks for the contents will be minimised. (A filter wrapper can always be put around it as required)

Nested embeded files - do we have a boolean flag for descend / don't descend, or do we pass that choice back to the consumer on a per-embeded basis similar to above? I worry that the latter would make things too complicated and heavy-weight, so I'm leaning towards the simple boolean flag.

Finally, pull vs push for the consumer. The two forms would probably look something like:
====
Iterator<Embeded> embeded = containerExtractor.extract(inp, false);
for(Embeded details : embeded) {
  if("application/pdf".equals(details.getMimeType()) ||
     "pdf".equals(details.getSuffix()) {
       handlePDF(details.getInputStream());
  }
  if("/README.txt".equals(details.getFilename()) {
       handleREADME(details.getInputStream());
  }
}
====
containerExtractor.extract(inp, false, new EmbededHandler() {
   public void handle(String filename, String mimetype, InputStreamSource
                          futureInputStream) {
       if("application/pdf".equals(mimetype) ||
              (filename != null && filename.endsWith("pdf"))) {
           handlePDF(futureInputStream.getInputStream());
       }
   }
});
====

I think the former would be a little bit more work for us, but is likely to lead to cleaner and simpler code for consumers. What do people think?

Nick

Reply via email to