[ 
https://issues.apache.org/jira/browse/TIKA-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17204912#comment-17204912
 ] 

Tim Allison edited comment on TIKA-3044 at 9/30/20, 5:38 PM:
-------------------------------------------------------------

It has been a while since I looked at this part of the codebase.  To confirm...

-t --text uses the BodyContentHandler which should only include the <body/> 
content
-T --text-main uses the BoilerpipeContentHandler, which relies on heuristics to 
guess what the main content of a page is and remove the boilerplate 
navigational sections, ads, etc.  So, to confirm, --text-main will return the 
title only for some specific html pages that BoilerpipeContentHandler fails on. 
 It _should_ return the main content of an html page if it works correctly.

The proposal is to add a feature to write out the text and body, just the 
simple WriteoutContentHandler.

This makes sense to me.

The current proposal is {{\-C}} and {{\-\-content}}.  What would people think 
of {{\-A}} and {{--text-all}}?


was (Author: talli...@mitre.org):
It has been a while since I looked at this part of the codebase.  To confirm...

-t --text uses the BodyContentHandler which should only include the <body/> 
content
-T --text-main uses the BoilerpipeContentHandler, which relies on heuristics to 
guess what the main content of a page is and remove the boilerplate 
navigational sections, ads, etc.  So, to confirm, --text-main will return the 
title only for some specific html pages that BoilerpipeContentHandler fails on. 
 It _should_ return the main content of an html page if it works correctly.

The proposal is to add a feature to write out the text and body, just the 
simple WriteoutContentHandler.

This makes sense to me.

The current proposal is {{-C}} and {{-content}}.  What would people think of 
{{-A}} and {{--text-all}}?

> add -C/--content cli option using WriteOutContentHandler
> --------------------------------------------------------
>
>                 Key: TIKA-3044
>                 URL: https://issues.apache.org/jira/browse/TIKA-3044
>             Project: Tika
>          Issue Type: New Feature
>          Components: cli
>            Reporter: Alexander Klimetschek
>            Priority: Major
>
> For text extraction, the cli currently provides both --text and --text-main 
> options. For html files, --text will return the body, while --text-main will 
> only return the title. There is currently no cli option that gives all text 
> content. However, the Tika API has the WriteOutContentHandler which does the 
> trick.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to