[ 
https://issues.apache.org/jira/browse/TIKA-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15955205#comment-15955205
 ] 

ASF GitHub Bot commented on TIKA-2309:
--------------------------------------

tballison commented on issue #161: fix for TIKA-2309 contributed by Shinobi@75
URL: https://github.com/apache/tika/pull/161#issuecomment-291520466
 
 
   If I understand correctly, the TSD is an evelope file that contains another 
actual file.  For example your first test file had the TSD envelope, but then 
it contained an xml file:
   
   `<?xml version="1.0" encoding="UTF-8"?>
   <blocco>
       <manifest ID="9570">
           <man:manifestConservazione Id="ManifestConservazioneCNN"`
   
   I see that your updated .pdf file also has an envelope and then the raw 
bytes for a PDF file.
   
   You'll probably want to cache those bytes in a byte[] and then call the 
embedded parser, something like:
   
   `
   embeddedDocumentExtractor = 
EmbeddedDocumentUtil.getEmbeddedDocumentExtractor(context);
   if (!embeddedDocumentExtractor.shouldParseEmbedded(embeddedMetadata)) {
               return;
           }
           TikaInputStream stream = TikaInputStream.get(cachedBytes);
           try {
               embeddedDocumentExtractor.parseEmbedded(
                       stream,
                       new EmbeddedContentHandler(xhtml),
                       embeddedMetadata, false);
           } finally {
               IOUtils.closeQuietly(stream);
           }
   `
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> New Detector and Parser classes for Time Stamped Data Envelope file format
> --------------------------------------------------------------------------
>
>                 Key: TIKA-2309
>                 URL: https://issues.apache.org/jira/browse/TIKA-2309
>             Project: Tika
>          Issue Type: Improvement
>          Components: detector, parser
>    Affects Versions: 1.13, 1.14
>            Reporter: Fabio
>            Priority: Minor
>         Attachments: MANIFEST.XML.TSD
>
>
> Hello,
> I'm Fabio Evangelista from Rome. I'm working for an italian Public 
> Administration company and i'm using Apache Tika in my Java applications to 
> detect and parse a broad kinds of file formats. During that activity, after 
> following your good guide on Tika project page, I've made with success new 
> type of Detector and Parser classes for a particular crypto timestamp type 
> with these caracteristics:
> Format name:               Time Stamped Data Envelope
> Mime Type:                   application/timestamped-data
> File extension:              .tsd
> TSD file hax magic code at the start of the file:   30 80 06 0B 2A 86 48 86 F7
> I've integrated and tested successfully with my applications those new 
> classes in Tika 1.13 tika-core.jar and tika-parsers.jar. What should I do to 
> submit my new classes to you? Should I to push those in a particular git 
> branch or, is there a particular process to follow to submit my classes?
> Thank you for you patience and best regards.
> Fabio.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to