[Corpora-List]Re: Complex Word Identification in French

Ada Wan Mon, 20 Jun 2022 16:25:55 -0700

Hi Linas,

As also a native EN speaker myself, I know "w*rds" is a very colloquial
term that gets used often. It got "cemented" via computational
implementation* and hence reinforced people's idea of what grammar is or
ought to be. I am not "undoing" EN writing (that'd be nonsense --- writing
is a thing and that, which is written already, is history, what is more
negotiable is our action/reaction/attitude towards it), but rather, trying
to get people to be more open-minded and more flexibly-minded about the
usage --- in terms of "w*rds" and "grammar", in the context of computing
and beyond (as all these are connected with each other), and of input
representation when modelling/processing.
*And back few decades ago, computing was more EN-dominant. And most working
on text representation then were primarily EN monolinguals, hence there
could have been a bit of X-centrism (where X can be any language, but in
this case, for the relevant historical context, it was EN), with less
sensitivity towards other languages and how w segmentation (with a totally
unintentional "I am just doing language processing, where language involves
words"-mindset) can effect linguistic hegemony in ways that have not been
considered by some. But now our systems and processing power is there to
give us that fine qualitative difference. So hopefully, instead of "so
what?", we can get the community to be more sensitive towards the values of
other communities --- in which the notions of w*rds may be different from
that in EN, or in which there is no native concept of a "w*rd" (and we
don't have to impose one on anyone).


I understand speech processing (both recognition and generation in terms of
TTS) is more fine-grained than text processing. Any step towards finer
granularity is better than none. I don't know if you are aware of the
vocabulary hacking practice in text processing... that's primarily what I
was getting at (also how this vocab hacking relates to structural
linguistics in some ways).

Statistical methods have always been around. They are not "new" methods. In
the tradition of lang sci/tech/eng, they've been somewhat "suppressed" bc
ppl kept arguing about grammar and how surface text representations ought
to look like.
The concept of w*rd defined with whitespace tokenisation is also not
satisfactory for EN, think contractions, abbreviations, tons of stuff from
intro NLP textbooks. :)

Re "What, exactly, is being proposed here?": in case you have read the paper
<https://openreview.net/forum?id=-llS6TiOew> already, then more empathy,
more awareness and sensitivity with inter-cultural/personal values. Our
downstream results are good enough. We can switch our concern to more
qualitative matters.

Best,
Ada


On Tue, Jun 21, 2022 at 12:13 AM Linas Vepstas <linasveps...@gmail.com>
wrote:

> Hi Ada,
>
> In the English language, "words" are a thing. Children are taught to place
> spaces between "words". You're not going to undo a millennium-worth of
> English writing by discouraging the use of words.
>
> Much of Latin was written without blank spaces to denote word boundaries.
> In Chinese writing, there are no blank spaces to denote word-boundaries.
> There's assorted NLP software that attempts to guess where those blanks may
> be, so that Chinese could be segmented and passed into other NLP pipeline
> stages.
>
> When we speak, verbally, we don't put in "blanks" between words, although
> there are sometimes pauses. Realistic text-to-speech software NEVER
> vocalizes words individually, and instead ALWAYS vocalizes the transition
> between words, and places the break within a single phoneme (I hope it's
> clear what I am saying here). Thus, from the point of text-to-speech
> software, words don't exist, because that is a fundamental requirement for
> normal-sounding speech. (For English.)
>
> Now that we live in the world of statistics and deep learning and whatnot,
> it's become clear that an audio stream of human speech has some parts that
> are "highly conserved" (require certain sounds to follow) and other regions
> which are flexible (just about any other sound can follow). And plenty of
> stuff in the middle between these two extremes.   Surprisingly (or not
> surprisingly, depending on who you are) the highly variable regions are not
> word boundaries. Except when ... there are ... well, exceptions.
>
> However, right now, I am not communicating verbally, and so I am faced
> with the task of converting thoughts into sequences of (discrete) symbols.
> As I learned in first grade, I do this by placing typed spaces between
> words.
>
> Sure, the concept of "word" may be quite inappropriate for some obscure
> languages.  This is entirely plausible, as any "synthetic" language defies
> the concept of "word" (Finnish, Lithuanian consist of "words" many of which
> are like "antidisestablishmentarianism" and its a children's playground
> game of creating the longest such possible expression. Creating new words
> in these languages is like creating new sentences in English. It's just
> something you do, and there are no "word boundaries" involved.)
>
> Great. So now what?  I assume everything I wrote is 100% mainstream, known
> to any and every linguist, half of whom could amplify and correct all the
> mistakes I've made in the above.  Sure, but so what? You can't get rid of
> the concept of "word". It's a thing.  What, exactly, is being proposed here?
>
> -- linas
>
>
> On Mon, Jun 20, 2022 at 10:33 AM Ada Wan <adawan...@gmail.com> wrote:
>
>> Hi Christopher,
>>
>> It is of the best interest of the community to discontinue the usage of
>> "word". The term is not only very shaky in its foundation (if any), but it
>> can also effect disparity in performance in computational processing and
>> robustness when human evaluation is involved.
>> Despite the term has been casually adopted by many in the past, like many
>> un-PC terms that may have an inappropriate undertone, it needs to be
>> discouraged and abandoned.
>> Last but not least, I noticed that you are located in Canada, in the
>> event that you were to work with any indigenous communities, one MUST be
>> advised to be careful with the usage of such term --- you could be imposing
>> your own (EN- / FR- / dominant language-centric) view onto another
>> individual/community. There is an element of cultural and
>> linguistic hegemony with the usage of such term (including and not limited
>> to making applications with it).
>> Please also consult recent work in this area:
>> https://openreview.net/forum?id=-llS6TiOew.
>>
>> Feel free to get in touch if you should have any questions.
>>
>> Best,
>> Ada
>>
>>
>> On Mon, Jun 20, 2022 at 4:53 PM Christopher Collins <
>> christopher.coll...@ontariotechu.ca> wrote:
>>
>>> Hello,
>>>
>>>
>>>
>>> I’m looking for any open source or cloud-hosted solution for complex
>>> word identification or word difficulty rating in French for a reading
>>> application.
>>>
>>>
>>>
>>> As a backup plan we can use measures like corpus frequency, length,
>>> number of senses, but we’re hoping someone has already made a tool
>>> available.
>>>
>>>
>>>
>>> We found this but that’s it: https://github.com/sheffieldnlp/cwi
>>>
>>>
>>>
>>> Would appreciate any tips!
>>>
>>>
>>>
>>> Thanks,
>>>
>>>
>>>
>>> Chris
>>>
>>>
>>>
>>> *Christopher Collins *[he/him
>>> <https://medium.com/gender-inclusivit/why-i-put-pronouns-on-my-email-signature-and-linkedin-profile-and-you-should-too-d3dc942c8743>
>>> ]
>>> Associate Professor - Faculty of Science
>>> Canada Research Chair in Linguistic Information Visualization
>>> Ontario Tech University
>>> vialab.ca
>>> _______________________________________________
>>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>>> Corpora mailing list -- corpora@list.elra.info
>>> To unsubscribe send an email to corpora-le...@list.elra.info
>>>
>> _______________________________________________
>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>> Corpora mailing list -- corpora@list.elra.info
>> To unsubscribe send an email to corpora-le...@list.elra.info
>>
>
>
> --
> Patrick: Are they laughing at us?
> Sponge Bob: No, Patrick, they are laughing next to us.
>
>
>

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list -- corpora@list.elra.info
To unsubscribe send an email to corpora-le...@list.elra.info

[Corpora-List]Re: Complex Word Identification in French

Reply via email to