[jira] [Commented] (TIKA-1343) Create a Tika Translator implementation that uses JoshuaDecoder

2016-04-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15261039#comment-15261039
 ] 

ASF GitHub Bot commented on TIKA-1343:
--

GitHub user lewismc opened a pull request:

https://github.com/apache/tika/pull/112

TIKA-1343 Create a Tika Translator implementation that uses JoshuaDecoder

This issue is this afternoons first attempt at addressing the long overdue 
https://issues.apache.org/jira/browse/TIKA-1343

It also removes unused imports and material which is not required from 
within other Translation implementations. 
This has not be extensively tested, I will be testing it more tomorrow in 
particular debugging the JSON response message and the REST API request. 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/lewismc/tika TIKA-1343

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/112.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #112


commit d4fb28f91d77458b15557942438f874b9f564e88
Author: Lewis John McGibbney 
Date:   2016-04-27T22:06:42Z

TIKA-1343 Create a Tika Translator implementation that uses JoshuaDecoder




> Create a Tika Translator implementation that uses JoshuaDecoder
> ---
>
> Key: TIKA-1343
> URL: https://issues.apache.org/jira/browse/TIKA-1343
> Project: Tika
>  Issue Type: New Feature
>  Components: translation
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
> Fix For: 1.14
>
>
> The Joshua Decoder toolkit is a BSD licensed Java-based statistical machine 
> translation system hosted at Github:
> http://joshua-decoder.org/
> Joshua takes in corpuses and trains models that can then be used to do 
> language translation. Currently there is support for e.g., Spanisn->English, 
> Indian dialects->English, Chinese->English, and a few others. 
> https://github.com/joshua-decoder/joshua/
> It would be nice to build a Tika Translator on top of Joshua. There are of 
> course several issues with this:
> * the models are huge - so we'll need a separate package or Maven module, 
> maybe tika-translate-joshua or something to release the models and we'll need 
> to build the models. I just went through the process of building the 
> Spanish->English one, and it still needs to be rebuilt b/c I did it wrong, 
> but it took over a day
> * there is a configuration for Joshua, and so we need some way of passing 
> that config into the Translator. Not sure of the best way to do this.
> * Joshua isn't in the Central repository. I've started a discussion on the 
> Joshua lists about this: 
> https://groups.google.com/forum/#!topic/joshua_support/9Y04miboUj0
> Anyhoo, I've got a working patch right now with hard code stuff, and a manual 
> install into my Maven repo for brave souls out there that want to try it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] tika pull request: TIKA-1343 Create a Tika Translator implementati...

2016-04-27 Thread lewismc
GitHub user lewismc opened a pull request:

https://github.com/apache/tika/pull/112

TIKA-1343 Create a Tika Translator implementation that uses JoshuaDecoder

This issue is this afternoons first attempt at addressing the long overdue 
https://issues.apache.org/jira/browse/TIKA-1343

It also removes unused imports and material which is not required from 
within other Translation implementations. 
This has not be extensively tested, I will be testing it more tomorrow in 
particular debugging the JSON response message and the REST API request. 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/lewismc/tika TIKA-1343

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/112.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #112


commit d4fb28f91d77458b15557942438f874b9f564e88
Author: Lewis John McGibbney 
Date:   2016-04-27T22:06:42Z

TIKA-1343 Create a Tika Translator implementation that uses JoshuaDecoder




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Created] (TIKA-1962) Support Topic Modeling in Tika

2016-04-27 Thread Madhawa Gunasekara (JIRA)
Madhawa Gunasekara created TIKA-1962:


 Summary: Support Topic Modeling in Tika
 Key: TIKA-1962
 URL: https://issues.apache.org/jira/browse/TIKA-1962
 Project: Tika
  Issue Type: New Feature
Reporter: Madhawa Gunasekara






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1938) HtmlParser drops

2016-04-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15260771#comment-15260771
 ] 

ASF GitHub Bot commented on TIKA-1938:
--

GitHub user naegelejd opened a pull request:

https://github.com/apache/tika/pull/111

fix for TIKA-1938 contributed by naegelejd

Adds HtmlParser support for  tags within 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/naegelejd/tika TIKA-1938

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/111.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #111


commit b6d23c189e852fa2e41b441c18bfe3e66e3f67c4
Author: Joseph Naegele 
Date:   2016-04-27T18:35:11Z

fix for TIKA-1938 contributed by naegelejd

add HtmlParser support for