Re: [Moses-support] a tool for extracting specific terms from the corpora

Mathias Müller Mon, 31 Jul 2017 01:48:51 -0700

Hi Mariusz

Sorry for the delay.


If your problem is so dynamic that it cannot be described with rules, then you 
cannot extract such a list of terms automatically.

A semi-automatic method would be: you define rules that have little precision 
and high recall, this gets you an overly long list of terms that will include 
false positives. Then, look through this list manually, e.g. by looking at the 
term and its sentence context. Inspecting the data in this way might even 
suggest patterns you did not see before.

Another option is to still extract terms only automatically, with rules that 
work most of the time (probably more precision-oriented rules) and live with 
the margin of error.

If the terms to be extracted are a finite set (i.e. one that can be enumerated) 
that changes infrequently, consider taking the time to simply list all of the 
terms, for highest precision.

(We still don’t know what you will use the exported list for. Intended use also 
dictates the approach to a certain extent.)

Regards
Mathias

> On 4 Jul 2017, at 10:41, Mariusz Hawryłkiewicz 
> <mariusz.hawrylkiew...@gmail.com> wrote:
> 
> Hi Mathias, thank you for getting back - let me give you an example from a 
> monolingual EN corpora:
> 
> Acoustic measurement precision and uncertainty.
> Each press of the Acoustic Output – key decreases the transmission power 
> setting (TX) displayed in the monitor display.
> 
> In the first sentence the word Acoustic should not be exported. In the second 
> sentence Acoustic Output should.
> Now I have written a program in Java that exports all the terms or group of 
> terms with first capital letter, but this obviously includes the words like 
> from the first example and it should not.
> 
> The purpose is that the proper names only should be exported to a separate 
> file.
> 
> Best regards
> Mariusz
> 
> 
> 
> 2017-07-04 10:02 GMT+02:00 Mathias Müller <mmuel...@ifi.uzh.ch 
> <mailto:mmuel...@ifi.uzh.ch>>:
> Hi Mariusz
> 
> What do you mean by “extracting” this content? What do you need the list of 
> proper names for? What are the languages involved?
> 
> Regards,
> Mathias
> 
> —
> 
> Mathias Müller
> AND-2-20
> Institute of Computational Linguistics
> University of Zurich
> Switzerland
> +41 44 635 75 81 <tel:+41%2044%20635%2075%2081>
> mmuel...@cl.uzh.ch <mailto:mmuel...@cl.uzh.ch>
>> On 4 Jul 2017, at 09:39, Mariusz Hawryłkiewicz 
>> <mariusz.hawrylkiew...@gmail.com <mailto:mariusz.hawrylkiew...@gmail.com>> 
>> wrote:
>> 
>> Dear all,
>> 
>> I have been searching for the most efficient way to extract untranslatable 
>> content from the corpora that always begin from the capital letter (product 
>> names etc.), the problem is that all the segments begin with the capital 
>> letter and what's obvious, the sentence may also begin with the 
>> untranslatable content (product name) :-). 
>> 
>> I want to avoid using common dictionaries to eliminate common words.
>> 
>> Would you have any other suggestions?
>> 
>> Thank you very much!
>> Mariusz
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu <mailto:Moses-support@mit.edu>
>> http://mailman.mit.edu/mailman/listinfo/moses-support 
>> <http://mailman.mit.edu/mailman/listinfo/moses-support>
> 
> 


—

Mathias Müller
AND-2-20
Institute of Computational Linguistics
University of Zurich
Switzerland
+41 44 635 75 81
mathias.muel...@uzh.ch

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] a tool for extracting specific terms from the corpora

Reply via email to