[ https://issues.apache.org/jira/browse/TIKA-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17204912#comment-17204912 ]
Tim Allison edited comment on TIKA-3044 at 9/30/20, 5:38 PM: ------------------------------------------------------------- It has been a while since I looked at this part of the codebase. To confirm... -t --text uses the BodyContentHandler which should only include the <body/> content -T --text-main uses the BoilerpipeContentHandler, which relies on heuristics to guess what the main content of a page is and remove the boilerplate navigational sections, ads, etc. So, to confirm, --text-main will return the title only for some specific html pages that BoilerpipeContentHandler fails on. It _should_ return the main content of an html page if it works correctly. The proposal is to add a feature to write out the text and body, just the simple WriteoutContentHandler. This makes sense to me. The current proposal is {{\-C}} and {{\-\-content}}. What would people think of {{\-A}} and {{--text-all}}? was (Author: talli...@mitre.org): It has been a while since I looked at this part of the codebase. To confirm... -t --text uses the BodyContentHandler which should only include the <body/> content -T --text-main uses the BoilerpipeContentHandler, which relies on heuristics to guess what the main content of a page is and remove the boilerplate navigational sections, ads, etc. So, to confirm, --text-main will return the title only for some specific html pages that BoilerpipeContentHandler fails on. It _should_ return the main content of an html page if it works correctly. The proposal is to add a feature to write out the text and body, just the simple WriteoutContentHandler. This makes sense to me. The current proposal is {{-C}} and {{-content}}. What would people think of {{-A}} and {{--text-all}}? > add -C/--content cli option using WriteOutContentHandler > -------------------------------------------------------- > > Key: TIKA-3044 > URL: https://issues.apache.org/jira/browse/TIKA-3044 > Project: Tika > Issue Type: New Feature > Components: cli > Reporter: Alexander Klimetschek > Priority: Major > > For text extraction, the cli currently provides both --text and --text-main > options. For html files, --text will return the body, while --text-main will > only return the title. There is currently no cli option that gives all text > content. However, the Tika API has the WriteOutContentHandler which does the > trick. -- This message was sent by Atlassian Jira (v8.3.4#803005)