[jira] [Commented] (JOSHUA-299) Move regression tests to proper unit tests

2016-09-07 Thread lewis john mcgibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/JOSHUA-299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15471850#comment-15471850
 ] 

lewis john mcgibbney commented on JOSHUA-299:
-

Nope did not sorry. Please progress!




-- 
http://home.apache.org/~lewismc/
@hectorMcSpector
http://www.linkedin.com/in/lmcgibbney


> Move regression tests to proper unit tests
> --
>
> Key: JOSHUA-299
> URL: https://issues.apache.org/jira/browse/JOSHUA-299
> Project: Joshua
>  Issue Type: Bug
>Reporter: Matt Post
>Assignee: Lewis John McGibbney
> Fix For: 6.1
>
>
> Many of the regression tests (test*.sh under src/test/resources) have been 
> moved to proper unit tests, but this move should be completed, and the 
> regression tests should be deleted. This should be done for 6.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (JOSHUA-299) Move regression tests to proper unit tests

2016-09-07 Thread Matt Post (JIRA)

[ 
https://issues.apache.org/jira/browse/JOSHUA-299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15471790#comment-15471790
 ] 

Matt Post commented on JOSHUA-299:
--

[~lewismc], did you ever look at this? Me or [~maxthomas] are thinking of doing 
this this week, just want to make sure we don't duplicate work.

> Move regression tests to proper unit tests
> --
>
> Key: JOSHUA-299
> URL: https://issues.apache.org/jira/browse/JOSHUA-299
> Project: Joshua
>  Issue Type: Bug
>Reporter: Matt Post
>Assignee: Lewis John McGibbney
> Fix For: 6.1
>
>
> Many of the regression tests (test*.sh under src/test/resources) have been 
> moved to proper unit tests, but this move should be completed, and the 
> regression tests should be deleted. This should be done for 6.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Language Pack English-Japanese

2016-09-07 Thread Matt Post
Hi IGA,

That would be great. 

There is also this collection of data for English/Japanese translation. If you 
collect and prepare all of this, I can then either help you build a model, or 
build it myself.

http://www.phontron.com/japanese-translation-data.php

Sincerely,
Matt



> On Aug 5, 2016, at 5:22 AM, IGA Tosiki  wrote:
> 
> Hi Matt,
> 
> I can convert those XML en-ja pair into other format as you point, if
> you think the pairs are useful, and if you want to do so.
> 
> Regards,
> Toshiki
> 
> 2016-08-05 17:53 GMT+09:00 IGA Tosiki :
>> Hi Matt,
>> 
>> I can share my en-ja parallel data.
>> 
>> https://osdn.jp/projects/blancofw/releases/52952
>> 
>> It is pair that translation en to ja for Eclipse IDE menu and
>> messages. It is translated by human and also checked by human.
>> 
>> Toshiki
>> 
>> 2016-08-04 22:02 GMT+09:00 Matt Post :
>>> Hi Toshiki,
>>> 
>>> Have you been able to gather any parallel data?
>>> 
>>> matt
>>> 
>>> 
 On Jul 22, 2016, at 3:50 PM, Henry Saputra  wrote:
 
 HI Toshiki,
 
 For this kind of discussion, let's have it in the dev@ list.
 
 You can ask the question to dev@joshua.incubator.apache.org.
 
 Thanks,
 
 Henry
 
 On Thu, Jul 21, 2016 at 9:46 PM, IGA Tosiki  wrote:
 
> Hi Matt,
> 
> Thanks for your reply!
> 
> I'm happy to read your mail, I want to help you Japanese-English language
> pack.
> And YES, I mean translation memories by TMS/XLIFF. But I may convert
> TMS to what you specified format.
> 
> And also I knew English to Japanese is very difficult, but also I
> believe sample of English-Japanese language pack will attract many
> Japanese people to use Joshua.
> 
> Regards,
> Toshiki
> 
> 2016-07-22 12:42 GMT+09:00 Matt Post :
>> Hi,
>> 
>> There is no Japanese--English language pack, but I would be happy to
> build one if you could help by pointing me to data. What we need is
> parallel data in the form of sentences that are translations of each 
> other.
> If you have access to this or pointers to where I could find some, I would
> be happy to build it. There are likely standard datasets available; people
> like Graham Neubig (http://www.phontron.com) have been working on this
> for a while.
>> 
>> What are TMS and LTIFF? Are you talking about translation memories?
>> 
>> As a side note, translation between English and Japanese is very
> difficult and tends not to be very good. One approach that helps is
> translating from trees and forests. Joshua does not have this capability 
> at
> the moment.
>> 
>> Sincerely,
>> matt
>> 
>> 
>>> On Jul 21, 2016, at 11:28 PM, IGA Tosiki  wrote:
>>> 
>>> Hi team,
>>> 
>>> I got interest about Joshua, and language pack. I am Japanese, and I
>>> want to know around Japanese language pack.
>>> 
>>> Is there any plan about building Japanese-English language pack?
>>> I believe TMS or LTIFF will usefull to building such language pack. I
>>> have many OSS based TMS between English-Japanese. Is there any path
>>> using TMX or LTIFF for input of Joshua language pack?
>>> 
>>> Best regards,
>>> Toshiki Iga
>> 
> 
>>> 



[jira] [Commented] (JOSHUA-268) Phrase-based model error (NullPointerException)

2016-09-07 Thread Matt Post (JIRA)

[ 
https://issues.apache.org/jira/browse/JOSHUA-268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15471398#comment-15471398
 ] 

Matt Post commented on JOSHUA-268:
--

Update: I have removed the (deprecated) --optimizer-runs flags from the example 
runs. I suspect that this issue will be gone with the changes that have taken 
place on the phrase-based decoder, but am testing now and will know soon.

Do you have any updates on your end?

> Phrase-based model error (NullPointerException)
> ---
>
> Key: JOSHUA-268
> URL: https://issues.apache.org/jira/browse/JOSHUA-268
> Project: Joshua
>  Issue Type: Bug
>  Components: decoders
>Affects Versions: 6.0.5
> Environment: fedora 23
>Reporter: Kyle Richardson
>Assignee: Matt Post
>Priority: Minor
> Fix For: 6.1
>
>
> I'm trying to run the phrase.sh example script (the only modification I made 
> was to take out the --optimizer-runs option, because the system says that 
> this is an "Unknown option"). 
> The error comes at the tuning stage (specifically, it fails at some point in 
> the tuning then complains that it cannot find the "joshua.config.final" 
> file). 
> Looking into the log file (tune/joshua.log), it seems to translate and tune a 
> number of sentences, then it raises the following NullPointerException: 
> Memory used after sentence 7 is 42.5 MB
> Translation 7: -30.617 good how is fine
> Input 2: Collecting options took 0.000 seconds
> Input 8: Collecting options took 0.000 seconds
> Input 2: FATAL UNCAUGHT EXCEPTION: null
> java.lang.NullPointerException
> at joshua.decoder.phrase.Candidate.score(Candidate.java:214)
> at joshua.decoder.phrase.Candidate.compareTo(Candidate.java:136)
> at joshua.decoder.phrase.Candidate.compareTo(Candidate.java:19)
> at java.util.HashMap.compareComparables(HashMap.java:371)
> at java.util.HashMap$TreeNode.treeify(HashMap.java:1920)
> at java.util.HashMap.treeifyBin(HashMap.java:771)
> at java.util.HashMap.putVal(HashMap.java:643)
> at java.util.HashMap.put(HashMap.java:611)
> at java.util.HashSet.add(HashSet.java:219)
> at joshua.decoder.phrase.Stack.addCandidate(Stack.java:125)
> at joshua.decoder.phrase.Stacks.search(Stacks.java:166)
> at joshua.decoder.DecoderThread.translate(DecoderThread.java:113)
> atjoshua.decoder.Decoder$DecoderThreadRunner.run(Decoder.java:218)
> There's nothing informative in the tune/mert.log, it just says that it exited 
> prematurely. The other processes seem to work as expected (although in the 
> giza.log, there are a number of "Sentence mismatch error! Line " warnings). 
> I'm running this on Fedora 23  with Moses.  I had no problems training the 
> hiero model.
> note---
> There appears to be an open ticket for more or less the same problem 
> (JOSHUA-267), the difference however is that in that in this ticket, it 
> appears that the tuner fails on the first input, whereas here, it already 
> decodes/tunes several inputs before failing (see above). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (JOSHUA-289) Fix output formatting

2016-09-07 Thread Matt Post (JIRA)

[ 
https://issues.apache.org/jira/browse/JOSHUA-289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15471376#comment-15471376
 ] 

Matt Post commented on JOSHUA-289:
--

This is a larger sub-issue, and I've decided to move it to the 7 release.

> Fix output formatting
> -
>
> Key: JOSHUA-289
> URL: https://issues.apache.org/jira/browse/JOSHUA-289
> Project: Joshua
>  Issue Type: Improvement
>Reporter: Matt Post
>Assignee: Matt Post
> Fix For: 6.2
>
>
> This is a sub ticket of JOSHUA-273.
> Joshua output formatting is a mess. The StructuredTranslation piece is a good 
> step in the right direction, but many problems remain. Here is a list of 
> problems and corrections.
> - There are currently four variables that contribute to defining separate 
> paths for formatting the output: server mode (two different types) or regular 
> mode, whether use_structured_translations is set, whether topN == 0 (i.e., 
> whether we are outputting k-best or just quick viterbi best), and whether we 
> are doing projecting case or doing denormalization of the output.
> - In TCP mode, ServerThread.java.run() iterates over Translation objects 
> returned by Translations. Translation.toString() is then called. %S and 
> recasing are applied.
> - In HTTP mode, ServerThread.java.handle() builds a JSONMessage, which in 
> turn calls 
> translation.getStructuredTranslations.get(0).getTranslationString(). No 
> recasing or %S formatting are applied.
> - In regular mode, we call Translation.toString(), which formats output in a 
> complicated way in the constructor, using different methods depending on 
> whether (a) use_structured_translations is set (b) topN == 0. This is a 
> veritable mess of nested redundant output formatting. Some of these in turn 
> use separate formatting applied in KBestExtractor's constructor.
> Suggestions:
> - Get rid of topN==0. Viterbi extraction should be quicker than k-best and is 
> used automatically if possible. The same output formatting should apply in 
> either case.
> - We should always use structured outputs, even collapsing 
> StructuredTranslation into Translation
> - Move all output formatting out of KBestExtractor. This should just return 
> k-best items.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (JOSHUA-289) Fix output formatting

2016-09-07 Thread Matt Post (JIRA)

 [ 
https://issues.apache.org/jira/browse/JOSHUA-289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Post updated JOSHUA-289:
-
Fix Version/s: (was: 6.1)
   6.2

> Fix output formatting
> -
>
> Key: JOSHUA-289
> URL: https://issues.apache.org/jira/browse/JOSHUA-289
> Project: Joshua
>  Issue Type: Improvement
>Reporter: Matt Post
>Assignee: Matt Post
> Fix For: 6.2
>
>
> This is a sub ticket of JOSHUA-273.
> Joshua output formatting is a mess. The StructuredTranslation piece is a good 
> step in the right direction, but many problems remain. Here is a list of 
> problems and corrections.
> - There are currently four variables that contribute to defining separate 
> paths for formatting the output: server mode (two different types) or regular 
> mode, whether use_structured_translations is set, whether topN == 0 (i.e., 
> whether we are outputting k-best or just quick viterbi best), and whether we 
> are doing projecting case or doing denormalization of the output.
> - In TCP mode, ServerThread.java.run() iterates over Translation objects 
> returned by Translations. Translation.toString() is then called. %S and 
> recasing are applied.
> - In HTTP mode, ServerThread.java.handle() builds a JSONMessage, which in 
> turn calls 
> translation.getStructuredTranslations.get(0).getTranslationString(). No 
> recasing or %S formatting are applied.
> - In regular mode, we call Translation.toString(), which formats output in a 
> complicated way in the constructor, using different methods depending on 
> whether (a) use_structured_translations is set (b) topN == 0. This is a 
> veritable mess of nested redundant output formatting. Some of these in turn 
> use separate formatting applied in KBestExtractor's constructor.
> Suggestions:
> - Get rid of topN==0. Viterbi extraction should be quicker than k-best and is 
> used automatically if possible. The same output formatting should apply in 
> either case.
> - We should always use structured outputs, even collapsing 
> StructuredTranslation into Translation
> - Move all output formatting out of KBestExtractor. This should just return 
> k-best items.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)