[Moses-support] Fwd: TweetMT @ SEPLN 2015
Apologies for multiple postings *TweetMT 2015--Tweet Translation Workshop at SEPLN 2015 TweetMT is a workshop and shared task on machine translation applied to tweets. It will take place in September, 2015, in Alicante, co-located with SEPLN 2015 (to be confirmed). The objective of the task is to bring together interested researchers to join forces to experiment with and compare different approaches to tweet MT. This workshop is a follow-up to two other workshops organized previously also at SEPLN: TweetNorm2013 and TweetLID2014. The machine translation of tweets is a complex task that greatly depends on the type of data we work with. The translation process of tweets is very different from that of correct texts posted for instance through a content manager. Tweets are often written from mobile devices, which exacerbates the poor quality of the spelling, and include errors, symbols and diacritics. The texts also vary in terms of structure, where the latter include tweet-specific features such as hashtags, user mentions, and retweets, among others. The translation of tweets can be tackled as a direct translation (tweet-to-tweet) or as an indirect translation (tweet normalization to standard text (KaufmannKalita, 2011), text translation and, if needed, tweet generation). Although the first approach looks attractive, the lack of parallel or comparable tweets for the working languages (Petrovic et al., 2010) tends to lead us towards an indirect approach. Some authors also try to gather similar tweets in other languages (CLIR). Work in this area is scarce in the literature but a growing interest is evident (Gotti et al., 2013). An important point of reference is the work done to translate SMS texts during the Haiti earthquake (Munro, 2010). The current task will focus on MT of tweets between languages of the Iberian Peninsula (Basque, Catalan, Galician, Portuguese and Spanish), as well as English. The organizing committee will release development data including parallel tweets that will enable participants to train their systems. For the final evaluation participants will have to submit the automatic translation of a number of tweet corpora in a short period of time. The evaluation will be carried out using automatic distances to the reference corpora. These corpora are not meant to be representative of all types of messages that can be observed in informal communication. This is instead an initial attempt at tackling part of the task which starts by addressing one of its simplest parts. We are planing on using more informal and varied corpora in future tasks as we make progress on these initial issues. The workshop aims to be a forum where researchers will have a chance to compare their methods, systems and results. Important dates - *March **1*: Registration opened - *April 17*: Release of the development-set - *May **12*: Registration deadline - *May 19*: Release of the test-set - *May 21*: Result submission deadline - *May 22-June 12*: Manual evaluation. Publication of results - *July 3*: Short paper submission deadline - *July 31*: Papers’ camera ready version - *September **14 *or* 15*: Workshop Organizing CommitteeIñaki Alegria (UPV/EHU) Nora Aranberri (UPV/EHU) Cristina España-Bonet (UPC) Pablo Gamallo (USC) Eva Martínez (UPC) Hugo Oliveira (Universidade de Coimbra) Iñaki San Vicente (Elhuyar) Antonio Toral (DCU, Dublin) Arkaitz Zubiaga (University of Warwick) Proceedings The papers of the workshop will be published In the proceedings of “XXXI Congreso de la Sociedad Española de Procesamiento de lenguaje natural”. Proceedings of the workshop will be also published in the CEUR Workshop Proceedings digital publication service. Additional information http://komunitatea.elhuyar.org/tweetmt ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Problem
the xmlrpc-c problems have been solved, so please git pull and run bjam again On 20/03/2015 13:32, Hieu Hoang wrote: There's a problem with compiling with xmlrpc-c at the moment. https://github.com/moses-smt/mosesdecoder/issues/99 It's being looked at, but in the meantime, try compiling without xmlrpc-c On 20/03/2015 03:31, qinmaoyuan wrote: hi everyone I am fresh for mosesdecoder. Now i am confused by some problem. 1.I installed Ubuntu 14.04.02 kylin by using VMware station and some software below: g++ git subversion automake libtool zlib1g-dev libboost-all-dev libbz2-dev liblzma-dev python-dev libtcmalloc-minimal4 2.And then i install boost_1_56_0.tar.gz. I downloaded by myself and ziped to my directory and installed it successfully. commands below: cd boost_1_55_0/ ./bootstrap.sh ./b2 -j2 --prefix=$PWD --libdir=$PWD/lib64 --layout=tagged link=static threading=multi,single install || echo FAILURE 3.Install xmlrpc-c I downloaded this instead of using apt-get. commands : wget http://svn.code.sf.net/p/xmlrpc-c/code REPOS=http://svn.code.sf.net/p/xmlrpc-c/code/stable svn checkout $REPOS xmlrpc-c ./configure --prefix=/usr/local/lib/xml-rpc make make install 4.install mosesdecoder i used git to download mosesdecoder from github and the code is below: git clone https://github.com/moses-smt/mosesdecoder.git cd /usr/local/lib/mosesdecoder/ ./bjam --with-xmlrpc-c=/usr/local/lib/xml-rpc here, system always reported build error. control platform output : please help me kings regards Qin ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Chinese segmentation/tokenization
We’ve had reasonable luck with the Stanford Chinese segmenter - I think the ctb model did better than the pku one for our use case Message: 2 Date: Fri, 20 Mar 2015 13:19:02 +0100 From: Marcin Junczys-Dowmunt junc...@amu.edu.pl Subject: [Moses-support] Chinese segmentation/tokenization To: Moses Support moses-support@mit.edu Message-ID: e4d171cb90994cb853a9965facaeb...@amu.edu.pl Content-Type: text/plain; charset=us-ascii Hi, questions appear from time to time on the list concerning Chinese segmentation/tokenization. I saw Barry mention Lingpipe and other tools. Is there a favourite tool you guys prefer to use over others? Thanks, Marcin ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Problem
There's a problem with compiling with xmlrpc-c at the moment. https://github.com/moses-smt/mosesdecoder/issues/99 It's being looked at, but in the meantime, try compiling without xmlrpc-c On 20/03/2015 03:31, qinmaoyuan wrote: hi everyone I am fresh for mosesdecoder. Now i am confused by some problem. 1.I installed Ubuntu 14.04.02 kylin by using VMware station and some software below: g++ git subversion automake libtool zlib1g-dev libboost-all-dev libbz2-dev liblzma-dev python-dev libtcmalloc-minimal4 2.And then i install boost_1_56_0.tar.gz. I downloaded by myself and ziped to my directory and installed it successfully. commands below: cd boost_1_55_0/ ./bootstrap.sh ./b2 -j2 --prefix=$PWD --libdir=$PWD/lib64 --layout=tagged link=static threading=multi,single install || echo FAILURE 3.Install xmlrpc-c I downloaded this instead of using apt-get. commands : wget http://svn.code.sf.net/p/xmlrpc-c/code REPOS=http://svn.code.sf.net/p/xmlrpc-c/code/stable svn checkout $REPOS xmlrpc-c ./configure --prefix=/usr/local/lib/xml-rpc make make install 4.install mosesdecoder i used git to download mosesdecoder from github and the code is below: git clone https://github.com/moses-smt/mosesdecoder.git cd /usr/local/lib/mosesdecoder/ ./bjam --with-xmlrpc-c=/usr/local/lib/xml-rpc here, system always reported build error. control platform output : please help me kings regards Qin ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Chinese segmentation/tokenization
Hi Marcin, At Autodesk we’ve been successfully using KyTea since 2011. The main reason we chose this specific tool is that it has readily available models for both Chinese and Japanese, which simplified the integration in our workflows. At least for Japanese, we also evaluated Mecab in 2011, but found KyTea to serve us better. Keep in mind, though, that we are not very interested in the quality of the segmentation per se; instead we need the MT to be of sufficient quality, regardless if what the segmentation tool does makes sense on its own or not. Cheers, Ventzi ––– Dr. Ventsislav Zhechev Computational Linguist, Certified ScrumMaster® Platform Architecture and Technologies Localisation Services MAIN +41 32 723 91 22 FAX +41 32 723 93 99 http://VentsislavZhechev.eu Autodesk, Inc. Rue de Puits-Godet 6 2000 Neuchâtel, Switzerland www.autodesk.com 20.03.2015 г., в 14:32, moses-support-requ...@mit.edu написал(а): Date: Fri, 20 Mar 2015 13:19:02 +0100 From: Marcin Junczys-Dowmunt junc...@amu.edu.pl Subject: [Moses-support] Chinese segmentation/tokenization To: Moses Support moses-support@mit.edu Message-ID: e4d171cb90994cb853a9965facaeb...@amu.edu.pl Content-Type: text/plain; charset=us-ascii Hi, questions appear from time to time on the list concerning Chinese segmentation/tokenization. I saw Barry mention Lingpipe and other tools. Is there a favourite tool you guys prefer to use over others? Thanks, Marcin ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
[Moses-support] CALL FOR BOOK PROPOSALS
[apologies for cross-posting] * CALL FOR BOOK PROPOSALS John Benjamins' NATURAL LANGUAGE PROCESSING Book Series invites new book proposals to respond to the growing demand for Natural Language processing (NLP) literature. Three general types of books are considered for publication: -- MONOGRAPHS -- - original, leading and cutting-edge research (the monograph could be based on an outstanding PhD thesis) - surveys of the state of the art in specific NLP tasks or applications -- COLLECTIONS -- - books focusing on a particular NLP area (e.g. emerging from successful NLP workshops or as a result of editors’ calls for papers) - books which include papers covering a wide range of topics (e.g. emerging from competitive NLP conferences or as a result of proposals for books of the type Reading In NLP) - COURSE BOOKS - - general NLP course books - books on a particular key area of NLP (e.g. Speech Processing, Computational Syntax/Parsing) Authors are encouraged to append supplementary materials such as demonstration programs, NLP software, corpora and so on if applicable, and to indicate websites and computational language resources where appropriate. This call invites proposals from potential authors of the types of books described above. Proposals on any topic related to Natural Language Processing are welcome. Interested authors should submit proposals by email (plain text or pdf files) to the series editor: Prof. Dr. Ruslan Mitkov Email r.mit...@wlv.ac.uk with a copy to Emma Franklin (emma.frank...@wlv.ac.uk), the series editorial assistant. The proposals should include an outline of the book (1-2 pages), a preliminary table of contents, the target readership, related publications, how the book will differ from other similar books in the area (if applicable), time-scale and information about the prospective author (relevant experience in the field, publications etc.). Each proposal will be reviewed by members of the advisory board or additional reviewers. For more information on the series, visit: https://benjamins.com/#catalog/books/nlp/main *Rohit Gupta* *Marie Curie Early Stage Researcher, EXPERT Project*Research Group in Computational Linguistics Research Institute of Information and Language Processing University of Wolverhampton ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
[Moses-support] Chinese segmentation/tokenization
Hi, questions appear from time to time on the list concerning Chinese segmentation/tokenization. I saw Barry mention Lingpipe and other tools. Is there a favourite tool you guys prefer to use over others? Thanks, Marcin ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Chinese segmentation/tokenization
We also use the Stanford Segmenter most of the time, but have also used many. Surprisingly, LDC's manseg also gives very good results with SMT and it's much faster to load than Stanford's. Like Ventzi's comments, a segmenter's absolute accuracy relative to the human interpretation of what is a word is not the most important factor when using it as a tokenizer for SMT. It's much more important for the tool to give consistent co-occurrence results relative to the paired language tokens. In from ZH environments, the segmented/tokenized form is never seen by humans. In to ZH environments, the recaser/detokenizer method(s) can actually repair errors and restore the string to what it should be. @ Venzi, thanks for mentioning KeTea. We'll test compare. Tom On 03/20/2015 08:43 PM, Венцислав Жечев (Ventsislav Zhechev) wrote: Hi Marcin, At Autodesk we’ve been successfully using KyTea since 2011. The main reason we chose this specific tool is that it has readily available models for both Chinese and Japanese, which simplified the integration in our workflows. At least for Japanese, we also evaluated Mecab in 2011, but found KyTea to serve us better. Keep in mind, though, that we are not very interested in the quality of the segmentation per se; instead we need the MT to be of sufficient quality, regardless if what the segmentation tool does makes sense on its own or not. Cheers, Ventzi ––– Dr. Ventsislav Zhechev Computational Linguist, Certified ScrumMaster® Platform Architecture and Technologies Localisation Services MAIN +41 32 723 91 22 FAX +41 32 723 93 99 http://VentsislavZhechev.eu Autodesk, Inc. Rue de Puits-Godet 6 2000 Neuchâtel, Switzerland www.autodesk.com 20.03.2015 г., в 14:32, moses-support-requ...@mit.edu mailto:moses-support-requ...@mit.edu написал(а): Date: Fri, 20 Mar 2015 13:19:02 +0100 From: Marcin Junczys-Dowmunt junc...@amu.edu.pl mailto:junc...@amu.edu.pl Subject: [Moses-support] Chinese segmentation/tokenization To: Moses Support moses-support@mit.edu Message-ID: e4d171cb90994cb853a9965facaeb...@amu.edu.pl Content-Type: text/plain; charset=us-ascii Hi, questions appear from time to time on the list concerning Chinese segmentation/tokenization. I saw Barry mention Lingpipe and other tools. Is there a favourite tool you guys prefer to use over others? Thanks, Marcin ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Chinese segmentation/tokenization
Hi all, thank you all for the tips. I am going with Stanford then. I am currently producing a language model from the Christian's raw Chinese CommonCrawl data (www.statmt.org/ngrams). Once I am done I will be happy to share back. Best, Marcin W dniu 20.03.2015 o 15:43, Tom Hoar pisze: We also use the Stanford Segmenter most of the time, but have also used many. Surprisingly, LDC's manseg also gives very good results with SMT and it's much faster to load than Stanford's. Like Ventzi's comments, a segmenter's absolute accuracy relative to the human interpretation of what is a word is not the most important factor when using it as a tokenizer for SMT. It's much more important for the tool to give consistent co-occurrence results relative to the paired language tokens. In from ZH environments, the segmented/tokenized form is never seen by humans. In to ZH environments, the recaser/detokenizer method(s) can actually repair errors and restore the string to what it should be. @ Venzi, thanks for mentioning KeTea. We'll test compare. Tom On 03/20/2015 08:43 PM, Венцислав Жечев (Ventsislav Zhechev) wrote: Hi Marcin, At Autodesk we’ve been successfully using KyTea since 2011. The main reason we chose this specific tool is that it has readily available models for both Chinese and Japanese, which simplified the integration in our workflows. At least for Japanese, we also evaluated Mecab in 2011, but found KyTea to serve us better. Keep in mind, though, that we are not very interested in the quality of the segmentation per se; instead we need the MT to be of sufficient quality, regardless if what the segmentation tool does makes sense on its own or not. Cheers, Ventzi ––– Dr. Ventsislav Zhechev Computational Linguist, Certified ScrumMaster® Platform Architecture and Technologies Localisation Services MAIN +41 32 723 91 22 FAX +41 32 723 93 99 http://VentsislavZhechev.eu Autodesk, Inc. Rue de Puits-Godet 6 2000 Neuchâtel, Switzerland www.autodesk.com 20.03.2015 г., в 14:32, moses-support-requ...@mit.edu mailto:moses-support-requ...@mit.edu написал(а): Date: Fri, 20 Mar 2015 13:19:02 +0100 From: Marcin Junczys-Dowmunt junc...@amu.edu.pl mailto:junc...@amu.edu.pl Subject: [Moses-support] Chinese segmentation/tokenization To: Moses Support moses-support@mit.edu Message-ID: e4d171cb90994cb853a9965facaeb...@amu.edu.pl Content-Type: text/plain; charset=us-ascii Hi, questions appear from time to time on the list concerning Chinese segmentation/tokenization. I saw Barry mention Lingpipe and other tools. Is there a favourite tool you guys prefer to use over others? Thanks, Marcin ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
[Moses-support] Translator Model Parameter Clarification
Hi there, As seen on the training references from the website, the --lm parameter accepts three inputs: format, order, and filename. - --lm -- language model: factor:order:filename (option can be repeated) But the baseline system uses 4 inputs: -lm 0:3:$HOME/lm/news-commentary-v8.fr-en.blm.en:8 I would like to find out what the factor input is, also in identifying the fourth input. Thanks! ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Forbidden link to binaries
Hi Nick, Thank you! Yours, Per Tunedal On Thu, Mar 19, 2015, at 15:46, Nikolay Bogoychev wrote: Hey Per, The link seems to be outdated, as it points to RELEASE-1.0. You can find the current ones here: http://www.statmt.org/moses/RELEASE-3.0/binaries/ Cheers, Nick On Thu, Mar 19, 2015 at 2:22 PM, Per Tunedal per.tune...@operamail.com wrote: Hi, I just read the page http://www.statmt.org/moses/?n=Moses.Releases and tried the link to the binaries: All the binary executables are made available for download for users who do not wish to compile their own version. Clicking on download gets me to the page http://www.statmt.org/moses/RELEASE-1.0/binaries/ showing the message: Forbidden You don't have permission to access /moses/RELEASE-1.0/binaries/ on this server. Yours, Per Tunedal ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] LMs for factors unused make the decoder fail
your output only has factor 0 and 2. So a LM over factor 1 or 3 will result in a segfault Hieu Hoang Research Associate (until March 2015) University of Edinburgh http://www.hoang.co.uk/hieu On 19 March 2015 at 08:01, Stanislav Kuřík standa.ku...@gmail.com wrote: Hello, when a train a model with 0,2-0,2 translation factors and I also attach a LMs for a different factor (1 or 3 in this case), running the decoder fails. Commenting these other LMs out in the INI file fixes this. It's not a critical issue, it just strikes me as odd that LMs which should not be used in the decoding process at all (yes, they are loaded, but they should not be consulted at all, if I am correct) make it fail. Regards, Stanislav K. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Translator Model Parameter Clarification
the 4th input (8) is the LM implementation you would like the decoder to use. 0=SRILM, 1=IRSTLM, 8=KENLM. factor (0) is the factor in the output sentence you want the LM to use. If you don't use factors, then it's always 0 Hieu Hoang Research Associate (until March 2015) University of Edinburgh http://www.hoang.co.uk/hieu On 20 March 2015 at 16:32, Jer Yango yango@gmail.com wrote: Hi there, As seen on the training references from the website, the --lm parameter accepts three inputs: format, order, and filename. - --lm -- language model: factor:order:filename (option can be repeated) But the baseline system uses 4 inputs: -lm 0:3:$HOME/lm/news-commentary-v8.fr-en.blm.en:8 I would like to find out what the factor input is, also in identifying the fourth input. Thanks! ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support