[jira] [Commented] (JOSHUA-299) Move regression tests to proper unit tests
[ https://issues.apache.org/jira/browse/JOSHUA-299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15471850#comment-15471850 ] lewis john mcgibbney commented on JOSHUA-299: - Nope did not sorry. Please progress! -- http://home.apache.org/~lewismc/ @hectorMcSpector http://www.linkedin.com/in/lmcgibbney > Move regression tests to proper unit tests > -- > > Key: JOSHUA-299 > URL: https://issues.apache.org/jira/browse/JOSHUA-299 > Project: Joshua > Issue Type: Bug >Reporter: Matt Post >Assignee: Lewis John McGibbney > Fix For: 6.1 > > > Many of the regression tests (test*.sh under src/test/resources) have been > moved to proper unit tests, but this move should be completed, and the > regression tests should be deleted. This should be done for 6.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (JOSHUA-299) Move regression tests to proper unit tests
[ https://issues.apache.org/jira/browse/JOSHUA-299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15471790#comment-15471790 ] Matt Post commented on JOSHUA-299: -- [~lewismc], did you ever look at this? Me or [~maxthomas] are thinking of doing this this week, just want to make sure we don't duplicate work. > Move regression tests to proper unit tests > -- > > Key: JOSHUA-299 > URL: https://issues.apache.org/jira/browse/JOSHUA-299 > Project: Joshua > Issue Type: Bug >Reporter: Matt Post >Assignee: Lewis John McGibbney > Fix For: 6.1 > > > Many of the regression tests (test*.sh under src/test/resources) have been > moved to proper unit tests, but this move should be completed, and the > regression tests should be deleted. This should be done for 6.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Language Pack English-Japanese
Hi IGA, That would be great. There is also this collection of data for English/Japanese translation. If you collect and prepare all of this, I can then either help you build a model, or build it myself. http://www.phontron.com/japanese-translation-data.php Sincerely, Matt > On Aug 5, 2016, at 5:22 AM, IGA Tosiki wrote: > > Hi Matt, > > I can convert those XML en-ja pair into other format as you point, if > you think the pairs are useful, and if you want to do so. > > Regards, > Toshiki > > 2016-08-05 17:53 GMT+09:00 IGA Tosiki : >> Hi Matt, >> >> I can share my en-ja parallel data. >> >> https://osdn.jp/projects/blancofw/releases/52952 >> >> It is pair that translation en to ja for Eclipse IDE menu and >> messages. It is translated by human and also checked by human. >> >> Toshiki >> >> 2016-08-04 22:02 GMT+09:00 Matt Post : >>> Hi Toshiki, >>> >>> Have you been able to gather any parallel data? >>> >>> matt >>> >>> On Jul 22, 2016, at 3:50 PM, Henry Saputra wrote: HI Toshiki, For this kind of discussion, let's have it in the dev@ list. You can ask the question to dev@joshua.incubator.apache.org. Thanks, Henry On Thu, Jul 21, 2016 at 9:46 PM, IGA Tosiki wrote: > Hi Matt, > > Thanks for your reply! > > I'm happy to read your mail, I want to help you Japanese-English language > pack. > And YES, I mean translation memories by TMS/XLIFF. But I may convert > TMS to what you specified format. > > And also I knew English to Japanese is very difficult, but also I > believe sample of English-Japanese language pack will attract many > Japanese people to use Joshua. > > Regards, > Toshiki > > 2016-07-22 12:42 GMT+09:00 Matt Post : >> Hi, >> >> There is no Japanese--English language pack, but I would be happy to > build one if you could help by pointing me to data. What we need is > parallel data in the form of sentences that are translations of each > other. > If you have access to this or pointers to where I could find some, I would > be happy to build it. There are likely standard datasets available; people > like Graham Neubig (http://www.phontron.com) have been working on this > for a while. >> >> What are TMS and LTIFF? Are you talking about translation memories? >> >> As a side note, translation between English and Japanese is very > difficult and tends not to be very good. One approach that helps is > translating from trees and forests. Joshua does not have this capability > at > the moment. >> >> Sincerely, >> matt >> >> >>> On Jul 21, 2016, at 11:28 PM, IGA Tosiki wrote: >>> >>> Hi team, >>> >>> I got interest about Joshua, and language pack. I am Japanese, and I >>> want to know around Japanese language pack. >>> >>> Is there any plan about building Japanese-English language pack? >>> I believe TMS or LTIFF will usefull to building such language pack. I >>> have many OSS based TMS between English-Japanese. Is there any path >>> using TMX or LTIFF for input of Joshua language pack? >>> >>> Best regards, >>> Toshiki Iga >> > >>>
[jira] [Commented] (JOSHUA-268) Phrase-based model error (NullPointerException)
[ https://issues.apache.org/jira/browse/JOSHUA-268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15471398#comment-15471398 ] Matt Post commented on JOSHUA-268: -- Update: I have removed the (deprecated) --optimizer-runs flags from the example runs. I suspect that this issue will be gone with the changes that have taken place on the phrase-based decoder, but am testing now and will know soon. Do you have any updates on your end? > Phrase-based model error (NullPointerException) > --- > > Key: JOSHUA-268 > URL: https://issues.apache.org/jira/browse/JOSHUA-268 > Project: Joshua > Issue Type: Bug > Components: decoders >Affects Versions: 6.0.5 > Environment: fedora 23 >Reporter: Kyle Richardson >Assignee: Matt Post >Priority: Minor > Fix For: 6.1 > > > I'm trying to run the phrase.sh example script (the only modification I made > was to take out the --optimizer-runs option, because the system says that > this is an "Unknown option"). > The error comes at the tuning stage (specifically, it fails at some point in > the tuning then complains that it cannot find the "joshua.config.final" > file). > Looking into the log file (tune/joshua.log), it seems to translate and tune a > number of sentences, then it raises the following NullPointerException: > Memory used after sentence 7 is 42.5 MB > Translation 7: -30.617 good how is fine > Input 2: Collecting options took 0.000 seconds > Input 8: Collecting options took 0.000 seconds > Input 2: FATAL UNCAUGHT EXCEPTION: null > java.lang.NullPointerException > at joshua.decoder.phrase.Candidate.score(Candidate.java:214) > at joshua.decoder.phrase.Candidate.compareTo(Candidate.java:136) > at joshua.decoder.phrase.Candidate.compareTo(Candidate.java:19) > at java.util.HashMap.compareComparables(HashMap.java:371) > at java.util.HashMap$TreeNode.treeify(HashMap.java:1920) > at java.util.HashMap.treeifyBin(HashMap.java:771) > at java.util.HashMap.putVal(HashMap.java:643) > at java.util.HashMap.put(HashMap.java:611) > at java.util.HashSet.add(HashSet.java:219) > at joshua.decoder.phrase.Stack.addCandidate(Stack.java:125) > at joshua.decoder.phrase.Stacks.search(Stacks.java:166) > at joshua.decoder.DecoderThread.translate(DecoderThread.java:113) > atjoshua.decoder.Decoder$DecoderThreadRunner.run(Decoder.java:218) > There's nothing informative in the tune/mert.log, it just says that it exited > prematurely. The other processes seem to work as expected (although in the > giza.log, there are a number of "Sentence mismatch error! Line " warnings). > I'm running this on Fedora 23 with Moses. I had no problems training the > hiero model. > note--- > There appears to be an open ticket for more or less the same problem > (JOSHUA-267), the difference however is that in that in this ticket, it > appears that the tuner fails on the first input, whereas here, it already > decodes/tunes several inputs before failing (see above). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (JOSHUA-289) Fix output formatting
[ https://issues.apache.org/jira/browse/JOSHUA-289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15471376#comment-15471376 ] Matt Post commented on JOSHUA-289: -- This is a larger sub-issue, and I've decided to move it to the 7 release. > Fix output formatting > - > > Key: JOSHUA-289 > URL: https://issues.apache.org/jira/browse/JOSHUA-289 > Project: Joshua > Issue Type: Improvement >Reporter: Matt Post >Assignee: Matt Post > Fix For: 6.2 > > > This is a sub ticket of JOSHUA-273. > Joshua output formatting is a mess. The StructuredTranslation piece is a good > step in the right direction, but many problems remain. Here is a list of > problems and corrections. > - There are currently four variables that contribute to defining separate > paths for formatting the output: server mode (two different types) or regular > mode, whether use_structured_translations is set, whether topN == 0 (i.e., > whether we are outputting k-best or just quick viterbi best), and whether we > are doing projecting case or doing denormalization of the output. > - In TCP mode, ServerThread.java.run() iterates over Translation objects > returned by Translations. Translation.toString() is then called. %S and > recasing are applied. > - In HTTP mode, ServerThread.java.handle() builds a JSONMessage, which in > turn calls > translation.getStructuredTranslations.get(0).getTranslationString(). No > recasing or %S formatting are applied. > - In regular mode, we call Translation.toString(), which formats output in a > complicated way in the constructor, using different methods depending on > whether (a) use_structured_translations is set (b) topN == 0. This is a > veritable mess of nested redundant output formatting. Some of these in turn > use separate formatting applied in KBestExtractor's constructor. > Suggestions: > - Get rid of topN==0. Viterbi extraction should be quicker than k-best and is > used automatically if possible. The same output formatting should apply in > either case. > - We should always use structured outputs, even collapsing > StructuredTranslation into Translation > - Move all output formatting out of KBestExtractor. This should just return > k-best items. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (JOSHUA-289) Fix output formatting
[ https://issues.apache.org/jira/browse/JOSHUA-289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matt Post updated JOSHUA-289: - Fix Version/s: (was: 6.1) 6.2 > Fix output formatting > - > > Key: JOSHUA-289 > URL: https://issues.apache.org/jira/browse/JOSHUA-289 > Project: Joshua > Issue Type: Improvement >Reporter: Matt Post >Assignee: Matt Post > Fix For: 6.2 > > > This is a sub ticket of JOSHUA-273. > Joshua output formatting is a mess. The StructuredTranslation piece is a good > step in the right direction, but many problems remain. Here is a list of > problems and corrections. > - There are currently four variables that contribute to defining separate > paths for formatting the output: server mode (two different types) or regular > mode, whether use_structured_translations is set, whether topN == 0 (i.e., > whether we are outputting k-best or just quick viterbi best), and whether we > are doing projecting case or doing denormalization of the output. > - In TCP mode, ServerThread.java.run() iterates over Translation objects > returned by Translations. Translation.toString() is then called. %S and > recasing are applied. > - In HTTP mode, ServerThread.java.handle() builds a JSONMessage, which in > turn calls > translation.getStructuredTranslations.get(0).getTranslationString(). No > recasing or %S formatting are applied. > - In regular mode, we call Translation.toString(), which formats output in a > complicated way in the constructor, using different methods depending on > whether (a) use_structured_translations is set (b) topN == 0. This is a > veritable mess of nested redundant output formatting. Some of these in turn > use separate formatting applied in KBestExtractor's constructor. > Suggestions: > - Get rid of topN==0. Viterbi extraction should be quicker than k-best and is > used automatically if possible. The same output formatting should apply in > either case. > - We should always use structured outputs, even collapsing > StructuredTranslation into Translation > - Move all output formatting out of KBestExtractor. This should just return > k-best items. -- This message was sent by Atlassian JIRA (v6.3.4#6332)