[jira] [Commented] (TIKA-1815) Text content from parser is empty when NamedEntityParser is enabled

2015-12-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15065535#comment-15065535
 ] 

ASF GitHub Bot commented on TIKA-1815:
--

Github user thammegowda closed the pull request at:

https://github.com/apache/tika/pull/66


> Text content from parser is empty when NamedEntityParser is enabled
> ---
>
> Key: TIKA-1815
> URL: https://issues.apache.org/jira/browse/TIKA-1815
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Thamme Gowda N
> Fix For: 1.12
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> When the NamedEntityParser is enabled, the Tika#parseToString() and other 
> parse() methods produces an empty string.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] tika pull request: Fix for TIKA-1815 contributed by Thamme Gowda

2015-12-19 Thread thammegowda
Github user thammegowda closed the pull request at:

https://github.com/apache/tika/pull/66


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (TIKA-1815) Text content from parser is empty when NamedEntityParser is enabled

2015-12-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15065533#comment-15065533
 ] 

ASF GitHub Bot commented on TIKA-1815:
--

GitHub user thammegowda opened a pull request:

https://github.com/apache/tika/pull/67

FIX for TIKA-1815 contributed by Thamme Gowda

+ Writing the text content to XML Document
+ Added Regex recogniser to default NER chain

Closes #66  (this is a simpler version of the same). Fixes #TIKA-1815

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/thammegowda/tika TIKA-1815

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/67.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #67


commit a40a18e2f61f2152fa065bda193ceb74e7e60c97
Author: Thamme Gowda 
Date:   2015-12-19T20:56:21Z

FIX for TIKA-1815 contributed by Thamme Gowda

+ Writing the text content to XML Document
+ Added Regex recogniser to default NER chain




> Text content from parser is empty when NamedEntityParser is enabled
> ---
>
> Key: TIKA-1815
> URL: https://issues.apache.org/jira/browse/TIKA-1815
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Thamme Gowda N
> Fix For: 1.12
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> When the NamedEntityParser is enabled, the Tika#parseToString() and other 
> parse() methods produces an empty string.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] tika pull request: FIX for TIKA-1815 contributed by Thamme Gowda

2015-12-19 Thread thammegowda
GitHub user thammegowda opened a pull request:

https://github.com/apache/tika/pull/67

FIX for TIKA-1815 contributed by Thamme Gowda

+ Writing the text content to XML Document
+ Added Regex recogniser to default NER chain

Closes #66  (this is a simpler version of the same). Fixes #TIKA-1815

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/thammegowda/tika TIKA-1815

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/67.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #67


commit a40a18e2f61f2152fa065bda193ceb74e7e60c97
Author: Thamme Gowda 
Date:   2015-12-19T20:56:21Z

FIX for TIKA-1815 contributed by Thamme Gowda

+ Writing the text content to XML Document
+ Added Regex recogniser to default NER chain




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (TIKA-1815) Text content from parser is empty when NamedEntityParser is enabled

2015-12-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15065485#comment-15065485
 ] 

ASF GitHub Bot commented on TIKA-1815:
--

GitHub user thammegowda opened a pull request:

https://github.com/apache/tika/pull/66

Fix for TIKA-1815 contributed by Thamme Gowda

+ Outputting the text content to XMLDocumentHandler

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/thammegowda/tika fix-TIKA-1815

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/66.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #66


commit e96da2bc28d5eef81d034e39eb05099ed5d38ac1
Author: Thamme Gowda 
Date:   2015-10-30T21:47:45Z

Add NamedEntityParser

Add OpenNLPNERecogniser as default

commit a720507a1c1906a501470a7d5c5cec335412fcd3
Author: Thamme Gowda 
Date:   2015-10-30T22:16:11Z

Set charset for converting text to stream

commit 6b1a20e681a5d319886464ec147967c876b7e60d
Author: Thamme Gowda 
Date:   2015-10-31T04:23:43Z

Automated OpenNLP NER model downloader

commit e381ea88ebd2bb8f5adfe36d710acfce673e30aa
Author: Thamme Gowda 
Date:   2015-11-04T00:31:40Z

using a secondary parser to convert non-text streams

commit ea7871bd4afae7d18e500ffc285e58afd08f5e86
Author: Thamme Gowda 
Date:   2015-11-08T07:36:48Z

Add regex based NER

commit 084985b3612438e9ca7107fecdffd67757d04d10
Author: Thamme Gowda 
Date:   2015-11-08T07:38:17Z

Add CoreNLP NER with runtime binding

commit e4d74218ece77143d1e5245a3ef64ddf5578c310
Author: Thamme Gowda 
Date:   2015-11-08T23:41:15Z

Added support for chaining NER implementations

commit 7e6b43c83ec6cdd35ea258f52c0110ba986c82b3
Author: Thamme Gowda 
Date:   2015-11-09T05:58:58Z

charset specified

commit caba68773a287752dea43f3366e6d4309fde861c
Author: Thamme Gowda 
Date:   2015-11-10T01:34:04Z

Merge branch 'trunk' of github.com:apache/tika into trunk

commit 08b916790b279cda0201f2529ca58646dea4b2f9
Author: Thamme Gowda 
Date:   2015-11-10T19:06:29Z

Resolved Code formatting issues

+ Removed star imports
+ Removed dead code / commented code
+ Added License header to missing files

commit e07ac630d54cc79d9a7bfc9ac82332474d07434b
Author: Thamme Gowda 
Date:   2015-11-16T09:05:07Z

Add missing doc strings, fix code formatting issues

commit 96d4d7cc29d4bcd8ac0cf7a595c39b6ed64d4d19
Author: Thamme Gowda 
Date:   2015-11-18T03:03:41Z

Fix: build phase for model downloader

commit 6d0b121b8b321e8a31257fc608bb001d3fe7afb5
Author: Thamme Gowda 
Date:   2015-12-11T14:33:36Z

Merge branch 'trunk' of github.com:apache/tika into trunk

commit 66d3a10ffabf1f54cff384ce1c7325c2a3c16279
Author: Thamme Gowda 
Date:   2015-12-19T18:59:26Z

Fix : TIKA-1815 by Thamme Gowda N.

1. Writing text content to XMLContentHandler
2. Added RegexNERParser to Default parser chain




> Text content from parser is empty when NamedEntityParser is enabled
> ---
>
> Key: TIKA-1815
> URL: https://issues.apache.org/jira/browse/TIKA-1815
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Thamme Gowda N
> Fix For: 1.12
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> When the NamedEntityParser is enabled, the Tika#parseToString() and other 
> parse() methods produces an empty string.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] tika pull request: Fix for TIKA-1815 contributed by Thamme Gowda

2015-12-19 Thread thammegowda
GitHub user thammegowda opened a pull request:

https://github.com/apache/tika/pull/66

Fix for TIKA-1815 contributed by Thamme Gowda

+ Outputting the text content to XMLDocumentHandler

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/thammegowda/tika fix-TIKA-1815

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/66.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #66


commit e96da2bc28d5eef81d034e39eb05099ed5d38ac1
Author: Thamme Gowda 
Date:   2015-10-30T21:47:45Z

Add NamedEntityParser

Add OpenNLPNERecogniser as default

commit a720507a1c1906a501470a7d5c5cec335412fcd3
Author: Thamme Gowda 
Date:   2015-10-30T22:16:11Z

Set charset for converting text to stream

commit 6b1a20e681a5d319886464ec147967c876b7e60d
Author: Thamme Gowda 
Date:   2015-10-31T04:23:43Z

Automated OpenNLP NER model downloader

commit e381ea88ebd2bb8f5adfe36d710acfce673e30aa
Author: Thamme Gowda 
Date:   2015-11-04T00:31:40Z

using a secondary parser to convert non-text streams

commit ea7871bd4afae7d18e500ffc285e58afd08f5e86
Author: Thamme Gowda 
Date:   2015-11-08T07:36:48Z

Add regex based NER

commit 084985b3612438e9ca7107fecdffd67757d04d10
Author: Thamme Gowda 
Date:   2015-11-08T07:38:17Z

Add CoreNLP NER with runtime binding

commit e4d74218ece77143d1e5245a3ef64ddf5578c310
Author: Thamme Gowda 
Date:   2015-11-08T23:41:15Z

Added support for chaining NER implementations

commit 7e6b43c83ec6cdd35ea258f52c0110ba986c82b3
Author: Thamme Gowda 
Date:   2015-11-09T05:58:58Z

charset specified

commit caba68773a287752dea43f3366e6d4309fde861c
Author: Thamme Gowda 
Date:   2015-11-10T01:34:04Z

Merge branch 'trunk' of github.com:apache/tika into trunk

commit 08b916790b279cda0201f2529ca58646dea4b2f9
Author: Thamme Gowda 
Date:   2015-11-10T19:06:29Z

Resolved Code formatting issues

+ Removed star imports
+ Removed dead code / commented code
+ Added License header to missing files

commit e07ac630d54cc79d9a7bfc9ac82332474d07434b
Author: Thamme Gowda 
Date:   2015-11-16T09:05:07Z

Add missing doc strings, fix code formatting issues

commit 96d4d7cc29d4bcd8ac0cf7a595c39b6ed64d4d19
Author: Thamme Gowda 
Date:   2015-11-18T03:03:41Z

Fix: build phase for model downloader

commit 6d0b121b8b321e8a31257fc608bb001d3fe7afb5
Author: Thamme Gowda 
Date:   2015-12-11T14:33:36Z

Merge branch 'trunk' of github.com:apache/tika into trunk

commit 66d3a10ffabf1f54cff384ce1c7325c2a3c16279
Author: Thamme Gowda 
Date:   2015-12-19T18:59:26Z

Fix : TIKA-1815 by Thamme Gowda N.

1. Writing text content to XMLContentHandler
2. Added RegexNERParser to Default parser chain




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: looking to contribute

2015-12-19 Thread Joey Hong
Regarding TIKA-1329, I found the tike-site on the Subversion source code, and I called:  	svn checkout https://svn.apache.org/repos/asf/tika/site/publish/1.11/.Since this isn’t part of the main tika/trunk repository, I was wondering if I should still follow the same protocol and svn commit my changes to the site folder. In case I shouldn’t, I’ve attached my changes to the usage examples page of the website below. I basically added how to parse documents with embedded docs using the RecursiveParserWrapper class, and how to serialize the returned Metadata list to JSON, with some description.Thanks,JoeyTitle: Apache Tika – Tika API Usage Examples











  
  

  


  
  

Apache Tika API Usage Examples
This page provides a number of examples on how to use the various Tika APIs. All of the examples shown are also available in the Tika Example module in SVN.

Apache Tika API Usage Examples

Parsing

Parsing using the Tika Facade
Parsing using the Auto-Detect Parser
 Parsing using the Recursive Parser Wrapper 
Picking different output formats

Parsing to Plain Text
Parsing to XHTML
Fetching just certain bits of the XHTML
Custom Content Handlers

Extract Phone Numbers from Content into the Metadata
Streaming the plain text in chunks
Translation

Translation using the Microsoft Translation API
Language Identification
Additional Examples

Parsing
Tika provides a number of different ways to parse a file. These provide different levels of control, flexibility, and complexity.

Parsing using the Tika Facade
The Tika facade, provides a number of very quick and easy ways to have your content parsed by Tika, and return the resulting plain text
public String parseToStringExample() throws IOException, SAXException, TikaException {Tika tika = new Tika();try (InputStream stream = ParsingExample.class.getResourceAsStream("test.doc")) {return tika.parseToString(stream);}}


Parsing using the Auto-Detect Parser
For more control, you can call the Tika Parsers directly. Most likely, you'll want to start out using the Auto-Detect Parser, which automatically figures out what kind of content you have, then calls the appropriate parser for you.


public String parseExample() throws IOException, SAXException, TikaException {

AutoDetectParser parser = new AutoDetectParser();

BodyContentHandler handler = new BodyContentHandler();

Metadata metadata = new Metadata();

try (InputStream stream = ParsingExample.class.getResourceAsStream("test.doc")) {

parser.parse(stream, handler, metadata);

return handler.toString();

}
}



Parsing using the Recursive Parser Wrapper
 When you want to parse embedded documents, you can extract content from both the enclosing document and all embedded ones by passing the parser into the ParseContext instance. 


public String parseEmbeddedExample() throws IOException, SAXException, TikaException {

AutoDetectParser parser = new AutoDetectParser();

BodyContentHandler handler = new BodyContentHandler();

Metadata metadata = new Metadata();

ParseContext context = new ParseContext();

context.set(Parser.class, parser);

try (InputStream stream = ParsingExample.class.getResourceAsStream("test_recursive_embedded.docx")) {

parser.parse(stream, handler, metadata);

return handler.toString();
}
}

Alternatively, you can use the RecursiveParserWrapper, which handles passing the parser into ParseContext. This wrapper class returns a list of Metadata objects, where the first element is the metadata and content for the container document, and the rest for each embedded document. 


public List recursiveParserWrapperExample() throws IOException, SAXException, TikaException {

Parser p = new  AutoDetectParser();

ContentHandlerFactor factory = new BasicContentHandlerFactory(
 
BasicContentHandlerFactory.HANDLER_TYPE.HTML, -1);

RecursiveParserWrapper wrapper = new RecursiveParserWrapper(p, factory);


Metadata metadata = new Metadata();

metadata.set(Metadata.RESOURCE_NAME_KEY, "test_recursive_embedded.docx");

ParseContext context = new ParseContext();

try (InputStream stream = ParsingExample.class.getResourceAsStream("test_recursive_embedded.docx")) {

wrapper.parse(stream, new DefaultHandler(), metadata, context)
}

return wrapper.getMetadata();
}

 The JsonMetadataList class can serialize the metadata list into JSON, and deserialize back into the list. 

public String serializedRecursiveParserWrapperExample() throws IOException, SAXException, TikaException {

 List metadataList = recursiveParserWrapperExample();

 StringWriter writer = new StringWriter();

 JsonMetadataList.toJson(metadataList, writer);

return writer.toString();
}



Picking different output formats
With Tika, you can get the textual content of your files returned in a number of different formats. These can be plain text, html, xhtml, 

[jira] [Created] (TIKA-1815) Text content from parser is empty when NamedEntityParser is enabled

2015-12-19 Thread Thamme Gowda N (JIRA)
Thamme Gowda N created TIKA-1815:


 Summary: Text content from parser is empty when NamedEntityParser 
is enabled
 Key: TIKA-1815
 URL: https://issues.apache.org/jira/browse/TIKA-1815
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Thamme Gowda N
 Fix For: 1.12


When the NamedEntityParser is enabled, the Tika#parseToString() and other 
parse() methods produces an empty string.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)