Hi Mariusz Sorry for the delay.
If your problem is so dynamic that it cannot be described with rules, then you cannot extract such a list of terms automatically. A semi-automatic method would be: you define rules that have little precision and high recall, this gets you an overly long list of terms that will include false positives. Then, look through this list manually, e.g. by looking at the term and its sentence context. Inspecting the data in this way might even suggest patterns you did not see before. Another option is to still extract terms only automatically, with rules that work most of the time (probably more precision-oriented rules) and live with the margin of error. If the terms to be extracted are a finite set (i.e. one that can be enumerated) that changes infrequently, consider taking the time to simply list all of the terms, for highest precision. (We still don’t know what you will use the exported list for. Intended use also dictates the approach to a certain extent.) Regards Mathias > On 4 Jul 2017, at 10:41, Mariusz Hawryłkiewicz > <mariusz.hawrylkiew...@gmail.com> wrote: > > Hi Mathias, thank you for getting back - let me give you an example from a > monolingual EN corpora: > > Acoustic measurement precision and uncertainty. > Each press of the Acoustic Output – key decreases the transmission power > setting (TX) displayed in the monitor display. > > In the first sentence the word Acoustic should not be exported. In the second > sentence Acoustic Output should. > Now I have written a program in Java that exports all the terms or group of > terms with first capital letter, but this obviously includes the words like > from the first example and it should not. > > The purpose is that the proper names only should be exported to a separate > file. > > Best regards > Mariusz > > > > 2017-07-04 10:02 GMT+02:00 Mathias Müller <mmuel...@ifi.uzh.ch > <mailto:mmuel...@ifi.uzh.ch>>: > Hi Mariusz > > What do you mean by “extracting” this content? What do you need the list of > proper names for? What are the languages involved? > > Regards, > Mathias > > — > > Mathias Müller > AND-2-20 > Institute of Computational Linguistics > University of Zurich > Switzerland > +41 44 635 75 81 <tel:+41%2044%20635%2075%2081> > mmuel...@cl.uzh.ch <mailto:mmuel...@cl.uzh.ch> >> On 4 Jul 2017, at 09:39, Mariusz Hawryłkiewicz >> <mariusz.hawrylkiew...@gmail.com <mailto:mariusz.hawrylkiew...@gmail.com>> >> wrote: >> >> Dear all, >> >> I have been searching for the most efficient way to extract untranslatable >> content from the corpora that always begin from the capital letter (product >> names etc.), the problem is that all the segments begin with the capital >> letter and what's obvious, the sentence may also begin with the >> untranslatable content (product name) :-). >> >> I want to avoid using common dictionaries to eliminate common words. >> >> Would you have any other suggestions? >> >> Thank you very much! >> Mariusz >> _______________________________________________ >> Moses-support mailing list >> Moses-support@mit.edu <mailto:Moses-support@mit.edu> >> http://mailman.mit.edu/mailman/listinfo/moses-support >> <http://mailman.mit.edu/mailman/listinfo/moses-support> > > — Mathias Müller AND-2-20 Institute of Computational Linguistics University of Zurich Switzerland +41 44 635 75 81 mathias.muel...@uzh.ch
_______________________________________________ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support