[ 
https://issues.apache.org/jira/browse/TIKA-1657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14729573#comment-14729573
 ] 

Nick Burch edited comment on TIKA-1657 at 9/3/15 6:54 PM:
----------------------------------------------------------

Let's consider this config file:
{code}
<properties>
  <parsers>
    <parser class="org.apache.tika.parser.DefaultParser">
      <mime-exclude>image/jpeg</mime-exclude>
      <mime-exclude>application/pdf</mime-exclude>
      <parser-exclude 
class="org.apache.tika.parser.executable.ExecutableParser"/>
      <parser-exclu 
class="org.apache.tika.parser.executable.ExecutableParser2"/>
    </parser>
    <parser class="org.apache.tika.parser.EmptyParser">
      <mime>application/pdf</mime>
      <no-mime>hello/world</no-mime>
    </parser>
  </parsers>
</properties>
{code}

With {{--dump-active-config}} you'd get what Tika was using of that, allowing 
you to spot what was and wasn't used, eg
{code}
<properties>
  <parsers>
    <parser class="org.apache.tika.parser.DefaultParser">
      <mime-exclude>image/jpeg</mime-exclude>
      <mime-exclude>application/pdf</mime-exclude>
      <parser-exclude 
class="org.apache.tika.parser.executable.ExecutableParser"/>
    </parser>
    <parser class="org.apache.tika.parser.EmptyParser">
      <mime>application/pdf</mime>
    </parser>
  </parsers>
</properties>
{code}

Or, with {{--dump-static-config}} you'd get something like:
{code}
<properties>
  <service-loader dynamic="false" />
  <translators/>
  <detectors>
   <detector class="org.apache.tika.parser.microsoft.POIFSContainerDetector"/>
   <detector class="org.apache.tika.parser.pkg.ZipContainerDetector"/>
   <detector class="org.gagravarr.tika.OggDetector"/>
   <detector class="org.apache.tika.mime.MimeTypes"/>
  </detectors>
  <parsers>
    <parser class="org.apache.tika.parser.CompositeParser">
      <mime-exclude>image/jpeg</mime-exclude>
      <mime-exclude>application/pdf</mime-exclude>
      <parser class="org.apache.tika.parser.asm.ClassParser"/>
      <parser class="org.apache.tika.parser.audio.AudioParser"/>
      <parser class="org.apache.tika.parser.audio.MidiParser"/>
      <parser class="org.apache.tika.parser.chm.ChmParser"/>
      <parser class="org.apache.tika.parser.code.SourceCodeParser"/>
      ... everything except executable ...
    </parser>
    <parser class="org.apache.tika.parser.EmptyParser">
      <mime>application/pdf</mime>
    </parser>
  </parsers>
</properties>
{code}


was (Author: gagravarr):
Let's consider this config file:
{{{
<properties>
  <parsers>
    <parser class="org.apache.tika.parser.DefaultParser">
      <mime-exclude>image/jpeg</mime-exclude>
      <mime-exclude>application/pdf</mime-exclude>
      <parser-exclude 
class="org.apache.tika.parser.executable.ExecutableParser"/>
      <parser-exclu 
class="org.apache.tika.parser.executable.ExecutableParser2"/>
    </parser>
    <parser class="org.apache.tika.parser.EmptyParser">
      <mime>application/pdf</mime>
      <no-mime>hello/world</no-mime>
    </parser>
  </parsers>
</properties>
}}}

With {{--dump-active-config}} you'd get what Tika was using of that, allowing 
you to spot what was and wasn't used, eg
{{{
<properties>
  <parsers>
    <parser class="org.apache.tika.parser.DefaultParser">
      <mime-exclude>image/jpeg</mime-exclude>
      <mime-exclude>application/pdf</mime-exclude>
      <parser-exclude 
class="org.apache.tika.parser.executable.ExecutableParser"/>
    </parser>
    <parser class="org.apache.tika.parser.EmptyParser">
      <mime>application/pdf</mime>
    </parser>
  </parsers>
</properties>
}}}

Or, with {{--dump-static-config}} you'd get something like:
{{{
<properties>
  <service-loader dynamic="false" />
  <translators/>
  <detectors>
   <detector class="org.apache.tika.parser.microsoft.POIFSContainerDetector"/>
   <detector class="org.apache.tika.parser.pkg.ZipContainerDetector"/>
   <detector class="org.gagravarr.tika.OggDetector"/>
   <detector class="org.apache.tika.mime.MimeTypes"/>
  </detectors>
  <parsers>
    <parser class="org.apache.tika.parser.CompositeParser">
      <mime-exclude>image/jpeg</mime-exclude>
      <mime-exclude>application/pdf</mime-exclude>
      <parser class="org.apache.tika.parser.asm.ClassParser"/>
      <parser class="org.apache.tika.parser.audio.AudioParser"/>
      <parser class="org.apache.tika.parser.audio.MidiParser"/>
      <parser class="org.apache.tika.parser.chm.ChmParser"/>
      <parser class="org.apache.tika.parser.code.SourceCodeParser"/>
      ... everything except executable ...
    </parser>
    <parser class="org.apache.tika.parser.EmptyParser">
      <mime>application/pdf</mime>
    </parser>
  </parsers>
</properties>
}}}

> Allow easier XML serialization of TikaConfig
> --------------------------------------------
>
>                 Key: TIKA-1657
>                 URL: https://issues.apache.org/jira/browse/TIKA-1657
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Priority: Minor
>             Fix For: 1.11
>
>         Attachments: TIKA-1558-blacklist-effective.xml
>
>
> In TIKA-1418, we added an example for how to dump the config file so that 
> users could easily modify it.  I think we should go further and make this an 
> option at the tika-core level with hooks for tika-app and tika-server.  I 
> propose adding a main() to TikaConfig that will print the xml config file 
> that Tika is currently using to stdout.
> I'd like to put this into core so that e.g. Solr's DIH users can get by 
> without having to download tika-app separately.  
> There's every chance that I've not accounted for issues with dynamic loading 
> etc.  Also, I'd be ok with only having this available in tika-app and 
> tika-server if there are good reasons.
> Feedback?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to