[jira] [Commented] (TIKA-1815) Text content from parser is empty when NamedEntityParser is enabled
[ https://issues.apache.org/jira/browse/TIKA-1815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15065535#comment-15065535 ] ASF GitHub Bot commented on TIKA-1815: -- Github user thammegowda closed the pull request at: https://github.com/apache/tika/pull/66 > Text content from parser is empty when NamedEntityParser is enabled > --- > > Key: TIKA-1815 > URL: https://issues.apache.org/jira/browse/TIKA-1815 > Project: Tika > Issue Type: Bug > Components: parser >Reporter: Thamme Gowda N > Fix For: 1.12 > > Original Estimate: 0.5h > Remaining Estimate: 0.5h > > When the NamedEntityParser is enabled, the Tika#parseToString() and other > parse() methods produces an empty string. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] tika pull request: Fix for TIKA-1815 contributed by Thamme Gowda
Github user thammegowda closed the pull request at: https://github.com/apache/tika/pull/66 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (TIKA-1815) Text content from parser is empty when NamedEntityParser is enabled
[ https://issues.apache.org/jira/browse/TIKA-1815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15065533#comment-15065533 ] ASF GitHub Bot commented on TIKA-1815: -- GitHub user thammegowda opened a pull request: https://github.com/apache/tika/pull/67 FIX for TIKA-1815 contributed by Thamme Gowda + Writing the text content to XML Document + Added Regex recogniser to default NER chain Closes #66 (this is a simpler version of the same). Fixes #TIKA-1815 You can merge this pull request into a Git repository by running: $ git pull https://github.com/thammegowda/tika TIKA-1815 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/67.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #67 commit a40a18e2f61f2152fa065bda193ceb74e7e60c97 Author: Thamme Gowda Date: 2015-12-19T20:56:21Z FIX for TIKA-1815 contributed by Thamme Gowda + Writing the text content to XML Document + Added Regex recogniser to default NER chain > Text content from parser is empty when NamedEntityParser is enabled > --- > > Key: TIKA-1815 > URL: https://issues.apache.org/jira/browse/TIKA-1815 > Project: Tika > Issue Type: Bug > Components: parser >Reporter: Thamme Gowda N > Fix For: 1.12 > > Original Estimate: 0.5h > Remaining Estimate: 0.5h > > When the NamedEntityParser is enabled, the Tika#parseToString() and other > parse() methods produces an empty string. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] tika pull request: FIX for TIKA-1815 contributed by Thamme Gowda
GitHub user thammegowda opened a pull request: https://github.com/apache/tika/pull/67 FIX for TIKA-1815 contributed by Thamme Gowda + Writing the text content to XML Document + Added Regex recogniser to default NER chain Closes #66 (this is a simpler version of the same). Fixes #TIKA-1815 You can merge this pull request into a Git repository by running: $ git pull https://github.com/thammegowda/tika TIKA-1815 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/67.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #67 commit a40a18e2f61f2152fa065bda193ceb74e7e60c97 Author: Thamme Gowda Date: 2015-12-19T20:56:21Z FIX for TIKA-1815 contributed by Thamme Gowda + Writing the text content to XML Document + Added Regex recogniser to default NER chain --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (TIKA-1815) Text content from parser is empty when NamedEntityParser is enabled
[ https://issues.apache.org/jira/browse/TIKA-1815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15065485#comment-15065485 ] ASF GitHub Bot commented on TIKA-1815: -- GitHub user thammegowda opened a pull request: https://github.com/apache/tika/pull/66 Fix for TIKA-1815 contributed by Thamme Gowda + Outputting the text content to XMLDocumentHandler You can merge this pull request into a Git repository by running: $ git pull https://github.com/thammegowda/tika fix-TIKA-1815 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/66.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #66 commit e96da2bc28d5eef81d034e39eb05099ed5d38ac1 Author: Thamme Gowda Date: 2015-10-30T21:47:45Z Add NamedEntityParser Add OpenNLPNERecogniser as default commit a720507a1c1906a501470a7d5c5cec335412fcd3 Author: Thamme Gowda Date: 2015-10-30T22:16:11Z Set charset for converting text to stream commit 6b1a20e681a5d319886464ec147967c876b7e60d Author: Thamme Gowda Date: 2015-10-31T04:23:43Z Automated OpenNLP NER model downloader commit e381ea88ebd2bb8f5adfe36d710acfce673e30aa Author: Thamme Gowda Date: 2015-11-04T00:31:40Z using a secondary parser to convert non-text streams commit ea7871bd4afae7d18e500ffc285e58afd08f5e86 Author: Thamme Gowda Date: 2015-11-08T07:36:48Z Add regex based NER commit 084985b3612438e9ca7107fecdffd67757d04d10 Author: Thamme Gowda Date: 2015-11-08T07:38:17Z Add CoreNLP NER with runtime binding commit e4d74218ece77143d1e5245a3ef64ddf5578c310 Author: Thamme Gowda Date: 2015-11-08T23:41:15Z Added support for chaining NER implementations commit 7e6b43c83ec6cdd35ea258f52c0110ba986c82b3 Author: Thamme Gowda Date: 2015-11-09T05:58:58Z charset specified commit caba68773a287752dea43f3366e6d4309fde861c Author: Thamme Gowda Date: 2015-11-10T01:34:04Z Merge branch 'trunk' of github.com:apache/tika into trunk commit 08b916790b279cda0201f2529ca58646dea4b2f9 Author: Thamme Gowda Date: 2015-11-10T19:06:29Z Resolved Code formatting issues + Removed star imports + Removed dead code / commented code + Added License header to missing files commit e07ac630d54cc79d9a7bfc9ac82332474d07434b Author: Thamme Gowda Date: 2015-11-16T09:05:07Z Add missing doc strings, fix code formatting issues commit 96d4d7cc29d4bcd8ac0cf7a595c39b6ed64d4d19 Author: Thamme Gowda Date: 2015-11-18T03:03:41Z Fix: build phase for model downloader commit 6d0b121b8b321e8a31257fc608bb001d3fe7afb5 Author: Thamme Gowda Date: 2015-12-11T14:33:36Z Merge branch 'trunk' of github.com:apache/tika into trunk commit 66d3a10ffabf1f54cff384ce1c7325c2a3c16279 Author: Thamme Gowda Date: 2015-12-19T18:59:26Z Fix : TIKA-1815 by Thamme Gowda N. 1. Writing text content to XMLContentHandler 2. Added RegexNERParser to Default parser chain > Text content from parser is empty when NamedEntityParser is enabled > --- > > Key: TIKA-1815 > URL: https://issues.apache.org/jira/browse/TIKA-1815 > Project: Tika > Issue Type: Bug > Components: parser >Reporter: Thamme Gowda N > Fix For: 1.12 > > Original Estimate: 0.5h > Remaining Estimate: 0.5h > > When the NamedEntityParser is enabled, the Tika#parseToString() and other > parse() methods produces an empty string. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] tika pull request: Fix for TIKA-1815 contributed by Thamme Gowda
GitHub user thammegowda opened a pull request: https://github.com/apache/tika/pull/66 Fix for TIKA-1815 contributed by Thamme Gowda + Outputting the text content to XMLDocumentHandler You can merge this pull request into a Git repository by running: $ git pull https://github.com/thammegowda/tika fix-TIKA-1815 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/66.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #66 commit e96da2bc28d5eef81d034e39eb05099ed5d38ac1 Author: Thamme Gowda Date: 2015-10-30T21:47:45Z Add NamedEntityParser Add OpenNLPNERecogniser as default commit a720507a1c1906a501470a7d5c5cec335412fcd3 Author: Thamme Gowda Date: 2015-10-30T22:16:11Z Set charset for converting text to stream commit 6b1a20e681a5d319886464ec147967c876b7e60d Author: Thamme Gowda Date: 2015-10-31T04:23:43Z Automated OpenNLP NER model downloader commit e381ea88ebd2bb8f5adfe36d710acfce673e30aa Author: Thamme Gowda Date: 2015-11-04T00:31:40Z using a secondary parser to convert non-text streams commit ea7871bd4afae7d18e500ffc285e58afd08f5e86 Author: Thamme Gowda Date: 2015-11-08T07:36:48Z Add regex based NER commit 084985b3612438e9ca7107fecdffd67757d04d10 Author: Thamme Gowda Date: 2015-11-08T07:38:17Z Add CoreNLP NER with runtime binding commit e4d74218ece77143d1e5245a3ef64ddf5578c310 Author: Thamme Gowda Date: 2015-11-08T23:41:15Z Added support for chaining NER implementations commit 7e6b43c83ec6cdd35ea258f52c0110ba986c82b3 Author: Thamme Gowda Date: 2015-11-09T05:58:58Z charset specified commit caba68773a287752dea43f3366e6d4309fde861c Author: Thamme Gowda Date: 2015-11-10T01:34:04Z Merge branch 'trunk' of github.com:apache/tika into trunk commit 08b916790b279cda0201f2529ca58646dea4b2f9 Author: Thamme Gowda Date: 2015-11-10T19:06:29Z Resolved Code formatting issues + Removed star imports + Removed dead code / commented code + Added License header to missing files commit e07ac630d54cc79d9a7bfc9ac82332474d07434b Author: Thamme Gowda Date: 2015-11-16T09:05:07Z Add missing doc strings, fix code formatting issues commit 96d4d7cc29d4bcd8ac0cf7a595c39b6ed64d4d19 Author: Thamme Gowda Date: 2015-11-18T03:03:41Z Fix: build phase for model downloader commit 6d0b121b8b321e8a31257fc608bb001d3fe7afb5 Author: Thamme Gowda Date: 2015-12-11T14:33:36Z Merge branch 'trunk' of github.com:apache/tika into trunk commit 66d3a10ffabf1f54cff384ce1c7325c2a3c16279 Author: Thamme Gowda Date: 2015-12-19T18:59:26Z Fix : TIKA-1815 by Thamme Gowda N. 1. Writing text content to XMLContentHandler 2. Added RegexNERParser to Default parser chain --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
Re: looking to contribute
Regarding TIKA-1329, I found the tike-site on the Subversion source code, and I called: svn checkout https://svn.apache.org/repos/asf/tika/site/publish/1.11/.Since this isn’t part of the main tika/trunk repository, I was wondering if I should still follow the same protocol and svn commit my changes to the site folder. In case I shouldn’t, I’ve attached my changes to the usage examples page of the website below. I basically added how to parse documents with embedded docs using the RecursiveParserWrapper class, and how to serialize the returned Metadata list to JSON, with some description.Thanks,JoeyTitle: Apache Tika – Tika API Usage Examples Apache Tika API Usage Examples This page provides a number of examples on how to use the various Tika APIs. All of the examples shown are also available in the Tika Example module in SVN. Apache Tika API Usage Examples Parsing Parsing using the Tika Facade Parsing using the Auto-Detect Parser Parsing using the Recursive Parser Wrapper Picking different output formats Parsing to Plain Text Parsing to XHTML Fetching just certain bits of the XHTML Custom Content Handlers Extract Phone Numbers from Content into the Metadata Streaming the plain text in chunks Translation Translation using the Microsoft Translation API Language Identification Additional Examples Parsing Tika provides a number of different ways to parse a file. These provide different levels of control, flexibility, and complexity. Parsing using the Tika Facade The Tika facade, provides a number of very quick and easy ways to have your content parsed by Tika, and return the resulting plain text public String parseToStringExample() throws IOException, SAXException, TikaException {Tika tika = new Tika();try (InputStream stream = ParsingExample.class.getResourceAsStream("test.doc")) {return tika.parseToString(stream);}} Parsing using the Auto-Detect Parser For more control, you can call the Tika Parsers directly. Most likely, you'll want to start out using the Auto-Detect Parser, which automatically figures out what kind of content you have, then calls the appropriate parser for you. public String parseExample() throws IOException, SAXException, TikaException { AutoDetectParser parser = new AutoDetectParser(); BodyContentHandler handler = new BodyContentHandler(); Metadata metadata = new Metadata(); try (InputStream stream = ParsingExample.class.getResourceAsStream("test.doc")) { parser.parse(stream, handler, metadata); return handler.toString(); } } Parsing using the Recursive Parser Wrapper When you want to parse embedded documents, you can extract content from both the enclosing document and all embedded ones by passing the parser into the ParseContext instance. public String parseEmbeddedExample() throws IOException, SAXException, TikaException { AutoDetectParser parser = new AutoDetectParser(); BodyContentHandler handler = new BodyContentHandler(); Metadata metadata = new Metadata(); ParseContext context = new ParseContext(); context.set(Parser.class, parser); try (InputStream stream = ParsingExample.class.getResourceAsStream("test_recursive_embedded.docx")) { parser.parse(stream, handler, metadata); return handler.toString(); } } Alternatively, you can use the RecursiveParserWrapper, which handles passing the parser into ParseContext. This wrapper class returns a list of Metadata objects, where the first element is the metadata and content for the container document, and the rest for each embedded document. public List recursiveParserWrapperExample() throws IOException, SAXException, TikaException { Parser p = new AutoDetectParser(); ContentHandlerFactor factory = new BasicContentHandlerFactory( BasicContentHandlerFactory.HANDLER_TYPE.HTML, -1); RecursiveParserWrapper wrapper = new RecursiveParserWrapper(p, factory); Metadata metadata = new Metadata(); metadata.set(Metadata.RESOURCE_NAME_KEY, "test_recursive_embedded.docx"); ParseContext context = new ParseContext(); try (InputStream stream = ParsingExample.class.getResourceAsStream("test_recursive_embedded.docx")) { wrapper.parse(stream, new DefaultHandler(), metadata, context) } return wrapper.getMetadata(); } The JsonMetadataList class can serialize the metadata list into JSON, and deserialize back into the list. public String serializedRecursiveParserWrapperExample() throws IOException, SAXException, TikaException { List metadataList = recursiveParserWrapperExample(); StringWriter writer = new StringWriter(); JsonMetadataList.toJson(metadataList, writer); return writer.toString(); } Picking different output formats With Tika, you can get the textual content of your files returned in a number of different formats. These can be plain text, html, xhtml,
[jira] [Created] (TIKA-1815) Text content from parser is empty when NamedEntityParser is enabled
Thamme Gowda N created TIKA-1815: Summary: Text content from parser is empty when NamedEntityParser is enabled Key: TIKA-1815 URL: https://issues.apache.org/jira/browse/TIKA-1815 Project: Tika Issue Type: Bug Components: parser Reporter: Thamme Gowda N Fix For: 1.12 When the NamedEntityParser is enabled, the Tika#parseToString() and other parse() methods produces an empty string. -- This message was sent by Atlassian JIRA (v6.3.4#6332)