Tim Allison created TIKA-1443:
---------------------------------

             Summary: Add a junk text detector to Tika
                 Key: TIKA-1443
                 URL: https://issues.apache.org/jira/browse/TIKA-1443
             Project: Tika
          Issue Type: Wish
            Reporter: Tim Allison
            Priority: Minor


It would be helpful to have a detector that flags documents whose extracted 
text is junk.  This could be used as a component of TIKA-1332 or as a 
standalone detector.  See TIKA-1332 for some initial ideas of what statistics 
we might use for such a detector.

Two use cases:
* Parser developers could quickly see whether changes in code lead to less 
"junky" documents or more "junky" documents.  This would also aid in 
prioritizing manual review of output comparison (see discussion in TIKA-1419).
* Search system integrators could use that information to set document specific 
relevancy rankings or to avoid indexing a document



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to