Benoit Tellier created JAMES-2910: ------------------------------------- Summary: HTML could be indexed directly in ElasticSearch Key: JAMES-2910 URL: https://issues.apache.org/jira/browse/JAMES-2910 Project: James Server Issue Type: Improvement Components: elasticsearch, guice Reporter: Benoit Tellier
When tika is disabled, the DefaultTextExtract is used, which does not perform html text extraction. This results in decreased precision in search in such situation (index being polluted by html) and of course results in a massive index size. Proposal: CassandraGuice should default to JsoupTextExtractor when tika is disabled. This will allow html text extraction to actually happen. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: server-dev-unsubscr...@james.apache.org For additional commands, e-mail: server-dev-h...@james.apache.org