[Corpora-List] Re: Any literature about tensors-based corpora NLP research with actual examples (and homework ;-)) you would suggest? ...

Ada Wan via Corpora Sat, 05 Aug 2023 06:27:11 -0700

Hi Anil

Thanks for your comments. (And thanks for reading my work.)

Yeah, there is a lot that one has to pay attention to when it comes to what
"textual computing" entails (and to which extent it "exists"). Beyond
"grammar" definitely. But experienced CL folks should know that. (Is this
you btw: https://scholar.google.com/citations?user=QKnpUbgAAAAJ? If not, do
you have a webpage for your work? Nice to e-meet you either way!)

Re "I know first hand the problems in doing NLP for low resource languages
which are related to text encodings":
which specific languages/varieties are you referring to here? If the issue
lies in the script not having been encoded, one can contact SEI about it (
https://linguistics.berkeley.edu/sei/)? I'm always interested in knowing
what hasn't been encoded. Are the scripts on this list (
https://linguistics.berkeley.edu/sei/scripts-not-encoded.html)?

Re the unpublished paper (on a computational typology of writing systems?):
when and to where (as in, which venues/publications) did you submit it?
I remember one of my first term papers from the 90s being on the
phonological system of written Cantonese (or sth like that --- don't
remember my wild days), the prof told me it wasn't "exactly linguistics"...

Re "on building an encoding converter that will work for all 'encodings'
used for Indian languages":
this sounds interesting!

Re "I too wish there was a good comprehensive history text encodings,
including non-standard ad-hoc encodings":
what do you mean by that --- history of text encodings or historical text
encodings?
After my discoveries from recent years, when my "mental model" towards
what's been practiced in the language space (esp. in CL/NLP) finally
*completely
*shifted, I had wanted to host (or co-host) a tutorial on character
encoding for those who might be under-informed on the matter (including but
not limited to the "grammaroholics" (esp. the CL/NLP practitioners who seem
to be stuck doing grammar, even in the context of computing) --- there are
so many of them! :) )

Re "word level language identification":
I don't do "words" anymore. In that 2016 TBLID paper of mine, I
(regrettably) was still going with the flow in under-reporting on
tokenization procedures (like what many "cool" ML papers did). But "words"
do certainly shape the results! I'm really forward to everyone working with
full-vocabulary, pure character or byte formats (depending on the task),
while being 100% aware of statistics. Things can be much more transparent
and easily replicable/reproducible that way anyway.

Re "We have to be tolerant of what you call bad research for various
unavoidable reasons. Research is not what it used to be":
No, I think one should just call out bad research and stop doing it. I
wouldn't want students to burn their midnight oil working hard for nothing.
Bad research warps also expectations and standards, in other sectors as
well (education, healthcare, commerce... etc.). Science, as in the pursuit
of truth and clarity, is and should be the number 1 priority of any decent
research. (In my opinion, market research or research for marketing
purposes should be all consolidated into one track/venue if they lack
scientific quality.) I agree research is not what it used to be --- but in
the sense that the quality is much worse in general, much hacking around
with minor, incremental improvements. Like in the case of "textual
computing", people are "grammar"-hacking.

Re *better ... gender representation":
hhmm... I'm not so sure about that.

Re "About grammar, I have come to think of it as a kind of language model
for describing some linguistic phenomenon":
nah, grammar not necessary.

Re grammaroholic reviewers:
yeah, there are tons of those in the CL/NLP space. I think many of them are
only willing and/or able to critique on grammar. Explicit is that it shows
that they don't want to check one's math and code --- besides, when most
work on "words" anyway, there is a limit to how things are
replicable/reproducible, esp. if on a different dataset. The implicit bit,
however, is that I think there is some latent intent to introduce/reinforce
the influence of "grammar" into the computing space. That, I do not agree
with at all.

Re "magic":
yes, once one gets over the hype, it's just work.

Re "I have no experience of field work at all and that I regret, but it is
partly because I am not a social creature":
one can be doing implicit and unofficial "fieldwork" everyday if one pays
attention to how language is used.

Best
Ada

On Sat, Aug 5, 2023 at 8:51 AM Anil Singh <anil.ph...@gmail.com> wrote:

> I forgot the main reason for writing the last email. Most importantly, I
> share your view that orthography is underrepresented in NLP/CL. I had once
> tried to build a computational typology of writing systems. The paper was
> not published, but I still believe that is something worth doing. Perhaps
> one day I will complete that work.
>
> Also, I am conscious that, technically, I used the term category mistake
> in a wrong way, but I hope I was understood correctly.
>
> On Sat, Aug 5, 2023 at 12:47 AM Hesham Haroon <heshamharoo...@gmail.com>
> wrote:
>
>> Hi Ada and Anil,
>>
>> I'm enjoying reading your discussion. It's been very informative and
>> thought-provoking. Thanks for sharing your insights!
>>
>> Best,
>> Hesham
>>
>>
>> On Fri, Aug 4, 2023, 8:51 PM Anil Singh via Corpora <
>> corpora@list.elra.info> wrote:
>>
>>> I have been enjoying the discussion. I hope it will continue. I have
>>> learnt some new things. I was also confused about the tensor thing,
>>> although not in the same way.
>>>
>>> I hope I am not among one of the scare quoted NLP practitioners, because
>>> that's exactly what I like to call myself. I certainly don't think I am
>>> qualified to work on language just because I can speak one.
>>>
>>> I am currently reading your thesis and trying to digest it.
>>>
>>> I also glanced through the syllabus you are preparing. I share your
>>> interest in text encodings. among other things. I can't resist talking
>>> about text encodings, whether I am teaching NLP or Computer Programming,
>>> because I know first hand the problems in doing NLP for low resource
>>> languages which are related to text encodings.
>>>
>>> If you can actually teach that syllabus, I envy you as I am unable to
>>> get people interested in the very basics of language/linguistics.
>>>
>>> About the importance of granularities, I had, in my (very badly written)
>>> PhD thesis, explicitly talked about NLP problem formulation in terms of
>>> granularities. In my second research paper, I had used byte n-grams for
>>> language identification. I use byte n-grams whenever I can. Actually, I
>>> used it for language-encoding pair identification, as there are so many
>>> non-standard 'encodings' which were used and perhaps are still used for
>>> South Asian languages. My very first -- unsuccessful or you may say
>>> unfinished -- attempt at doing some kind of NLP even before knowing that a
>>> field called NLP or CL existed, was on building an encoding converter that
>>> will work for all 'encodings' used for Indian languages. I too wish there
>>> was a good comprehensive history text encodings, including non-standard
>>> ad-hoc encodings.
>>>
>>> I also share your interest in word level language identification. In
>>> 2007 I had published one of the earliest papers on what I called language
>>> identification in a multilingual document, where I had tried word level
>>> language identification, and what is now called language identification for
>>> code switched data.
>>>
>>> About gender, I had actually made a kind of category assumption. I
>>> didn't pay attention to the name, which you share with no less than Ada
>>> Byron.
>>>
>>> We have to be tolerant of what you call bad research for various
>>> unavoidable reasons. Research is not what it used to be. At least that's my
>>> opinion. Still, in some ways it is better, perhaps like in the case of
>>> gender representation.
>>>
>>> About grammar, I have come to think of it as a kind of language model
>>> for describing some linguistic phenomenon. I once received a review in
>>> which the reviewer mentioned some grammatical mistakes and wrote that you
>>> don't have to just see how the sentence/phrase sounds, you have to
>>> explicitly check the grammar according to the rules. Thank you very much,
>>> but I learnt English without paying any explicit attention to grammar. I am
>>> pretty sure I didn't learn much from explicit teaching of grammar, whether
>>> of English, or of Sanskrit, or of French.That doesn't necessarily mean I
>>> don't believe in grammar, but I guess I am moving towards the language
>>> games view of language.
>>>
>>> As to language being magical, well, that depends on what you mean by
>>> magical. To me, it seems it is magical in the same sense as life itself is
>>> magical. Nothing more, nothing less. Even computer programming I have been
>>> known to call magical in a certain sense.
>>>
>>> I also completely agree that we can only hope that we are communicating
>>> as we intended, but we rarely, if ever, actually attain that goal.
>>>
>>> I can't match your background, but I did have -- what can be called --
>>> four rounds of graduate training in different disciplines. I am still
>>> trying to learn new things about language. However, I have no experience of
>>> field work at all and that I regret, but it is partly because I am not a
>>> social creature, or, to be more precise (as if one can be precise with
>>> language), I am socially totally incompetent. I wouldn't know how to
>>> approach anyone for fieldwork in Linguistics.
>>>
>>> On Fri, Aug 4, 2023 at 9:03 PM Ada Wan via Corpora <
>>> corpora@list.elra.info> wrote:
>>>
>>>> @Toms:
>>>> for completeness' sake: would you mind please sharing your background?
>>>> Thanks.
>>>>
>>>> On Fri, Aug 4, 2023 at 5:31 PM Ada Wan <adawan...@gmail.com> wrote:
>>>>
>>>>> Thanks x2, Ibrtchx.
>>>>>
>>>>> On Fri, Aug 4, 2023 at 3:30 AM Albretch Mueller <lbrt...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> On 8/3/23, Toms Bergmanis <toms.bergma...@tilde.lv> wrote:
>>>>>>  ...
>>>>>>
>>>>>>  I, for one, have benefited from Ada's, as well as other member's
>>>>>> suggestions and comments as I hope they have somehow benefited from
>>>>>> mine.
>>>>>>  lbrtchx
>>>>>>
>>>>> _______________________________________________
>>>> Corpora mailing list -- corpora@list.elra.info
>>>> https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
>>>> To unsubscribe send an email to corpora-le...@list.elra.info
>>>>
>>>
>>>
>>> --
>>> - Anil
>>> _______________________________________________
>>> Corpora mailing list -- corpora@list.elra.info
>>> https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
>>> To unsubscribe send an email to corpora-le...@list.elra.info
>>>
>>
>
> --
> - Anil
>

_______________________________________________
Corpora mailing list -- corpora@list.elra.info
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to corpora-le...@list.elra.info

[Corpora-List] Re: Any literature about tensors-based corpora NLP research with actual examples (and homework ;-)) you would suggest? ...

Reply via email to