On 10/11/2023 16:23, Aaron Conole wrote: > Roi Dayan <r...@nvidia.com> writes: > >> On 08/11/2023 21:04, Aaron Conole wrote: >>> Roi Dayan via dev <ovs-dev@openvswitch.org> writes: >>> >>>> On 02/11/2023 16:11, Eelco Chaudron wrote: >>>>> >>>>> On 2 Nov 2023, at 14:20, Roi Dayan via dev wrote: >>>>> >>>>>> Add personal words list as spellcheck.txt and load it >>>>>> into enchant spell checker. This file is generated from >>>>>> codespell dictionary.txt and contains words users use >>>>>> but enchant spell checker failed on like >>>>>> refcount, pthread, enqueuing, etc. >>>>>> >>>>>> Signed-off-by: Roi Dayan <r...@nvidia.com> >>>>> Thanks for the patch, but it doesn’t look right to add the full list >>>>> of words to the OVS repository. >>>>> >>>>> Maybe we can update the extra_keywords list with the most common >>>>> missing ones, and add a command line option to include a >>>>> user-defined file for people who want this? >>>>> >>>>> What do you think? >>>>> >>>> >>>> I think it is needed. It's a dictionary of most commonly used words >>>> and the enchant spell check does not seem to be enough. >>>> Some examples enchant fails as I remember are: >>>> lacp, dereferenced, valgrind, priv, syscall,.. >>> >>> Well, we do add some of those - BUT I see that codespell probably has a >>> more complete dictionary. >>> >>> The original implementation using enchant was due to there not being a >>> clear winner at the time. Enchant is a spell checking frontend intended >>> for lots of tools, so seemed like a good idea (for example, used by >>> libreoffice, AbiWord, and others). >>> >>> It may be that codepspell is more appropriate since it is targeted at >>> development spell checking. OTOH, codespell will miss lots of >>> misspellings where enchant will have a larger lexicon. I have no real >>> opinion on which framework makes sense - but we shouldn't include a >>> dictionary. After all, even linux checkpatch.pl will find the codespell >>> dictionary and just use that as it exists. >>> >>> However, I will point out that there *is* a difference between the two. >>> Here's a simple example: >>> >>> 02:01:19 aconole@RHTPC1VM0NT {(594d145410...)} ~/git/ovs$ echo vailgrind >>> | codespell - >>> 02:01:24 aconole@RHTPC1VM0NT {(594d145410...)} ~/git/ovs$ >>> >>> vs. >>> >>> 02:00:18 aconole@RHTPC1VM0NT {(594d145410...)} ~/git/ovs$ >>> ./utilities/checkpatch.py -S test.patch >>> == Checking "test.patch" == >>> WARNING: Possible misspelled word: "vaigrind" >>> Did you mean: ['grinder', 'valgrind'] >>> >>> So I guess we can probably do something better - and maybe that >>> something is to find the codespell dictionary cut it up and merge the >>> values with the current session (instead of add(), ie: 2/2 in this >>> series). >>> >>> But I don't think we need to copy the entire codespell dict. >>> >> >> In linux checkpatch find codespell dictonary.txt, loads it into memory and >> cut on '>' to get the word needed and looks like internal perl code does >> the spell check. >> >> I can do the same here to find codespell dictionary.txt and cut on '>' >> and I get the same dictionary I tried to add here but I need to create >> a temporary file to load it with enchant.DictWithPWL() >> >> As doing a loop of calls to spell_check_dict.add_to_session() seems >> to take a lot longer for the library by far as shown. >> >> Will that be ok? >> >> This is the example diff. I can send as an update patch if its ok. >> Looking for the codespell dictionary as it can be different folder between >> fedora and ubuntu for example. >> Then loading it and creating temporary file to load into enchant which >> is being removed when file is closed. > > I think it's a good idea to use the codespell dict, but it turns out we > don't actually need the temp file. I've tested with the add_to_session > patch, and here is my timing output: > > 09:19:43 aconole@RHTPC1VM0NT {(594d145410...)} ~/git/ovs$ time > ./utilities/checkpatch.py -S -1 test.patch > == Checking 594d145410c5 ("readthedocs: Use dirhtml builder.") == > WARNING: Possible misspelled word: "readthedocs" > Did you mean: ['headteachers'] > WARNING: Possible misspelled word: "dirhtml" > Did you mean: ['dirtily'] > Subject: readthedocs: Use dirhtml builder. > WARNING: Possible misspelled word: "ReadTheDocs" > Did you mean: ['Headteachers'] > Lines checked: 44, Warnings: 3, Errors: 0 > > > real 0m0.369s > user 0m0.344s > sys 0m0.024s > > The difference being that I don't use a tempfile - just your delta above > as: > > if codespell_file != '': > with open(codespell_file) as f: > for line in f.readlines(): > words = line.strip().split('>')[1].strip(', ').split(',') > for word in words: > spell_check_dict.add_to_session(word) > > > Of course, looks like we would want to add other things like readthedocs, > dirhtml, etc. Or possibly skip checking words that are proper nouns > (like ReadTheDocs would be). Codespell ignores words it doesn't have a > correction / suggestion for, while enchant will assume something it > doesn't know is misspelled. This makes it so we will flag more often. > > WDYT?
This looks good. We shouldn't lag more often than now as we only add more familiar words to enchant session. Do you want me to create a proper patch and send it or you will do it when taking the 2nd patch? > >> --- a/utilities/checkpatch.py >> +++ b/utilities/checkpatch.py >> @@ -37,8 +37,19 @@ spell_check_dict = None >> >> >> def open_spell_check_dict(): >> + import tempfile >> import enchant >> >> + try: >> + import codespell_lib >> + codespell_dir = os.path.dirname(codespell_lib.__file__) >> + codespell_file = os.path.join(codespell_dir, 'data', >> 'dictionary.txt') >> + if not os.path.exists(codespell_file): >> + codespell_file = '' >> + except: >> + codespell_file = '' >> + >> + >> try: >> extra_keywords = ['ovs', 'vswitch', 'vswitchd', 'ovs-vswitchd', >> 'netdev', 'selinux', 'ovs-ctl', 'dpctl', 'ofctl', >> @@ -93,7 +104,19 @@ def open_spell_check_dict(): >> global spell_check_dict >> word_dic = os.path.join(os.path.dirname(os.path.abspath(__file__)), >> 'dictionary.txt') >> - spell_check_dict = enchant.DictWithPWL('en_US', word_dic) >> + >> + if codespell_file: >> + with open(codespell_file) as f: >> + with tempfile.NamedTemporaryFile(mode='w') as f2: >> + for line in f.readlines(): >> + words = line.strip().split('>')[1].strip(', >> ').split(',') >> + for word in words: >> + f2.write(word.strip()+'\n') >> + >> + spell_check_dict = enchant.DictWithPWL('en_US', f2.name) >> + else: >> + spell_check_dict = enchant.Dict("en_US") >> + >> for kw in extra_keywords: >> spell_check_dict.add_to_session(kw) >> >> >> >> >> Example run with time, ignore the warning >> >> $ time python3 utilities/checkpatch.py -1 -S >> == Checking dcfe1402d26e ("checkpatch: Fix personal word list storage.") == >> ERROR: Committer Roi Dayan <r...@nvidia.com> needs to sign off. >> Lines checked: 31, Warnings: 0, Errors: 1 >> >> >> real 0m0.517s >> user 0m0.304s >> sys 0m0.067s >> >> >> >>>> Also adding the entire dictionary the script is even faster than >>>> adding word by word as done now. >>>> >>>> I think maybe removing the add word by word part at all but checking and >>>> doing >>>> in steps. >>>> >>>> Just by adding the dictionary and having the words being added through >>>> python >>>> already exists seems to be faster. >>>> >>>> Checking small commit before loading dictionry.txt: >>>> >>>> $ time ./utilities/checkpatch.py -S -1 >>>> >>>> real 0m28.379s >>>> user 0m0.272s >>>> sys 0m0.223s >>>> >>>> >>>> and after: >>>> >>>> $ time ./utilities/checkpatch.py -S -1 >>>> >>>> real 0m0.238s >>>> user 0m0.138s >>>> sys 0m0.038s >>>> >>>>> Cheers, >>>>> >>>>> Eelco >>>>> >>>>>> --- >>>>>> utilities/automake.mk | 1 + >>>>>> utilities/checkpatch.py | 4 +- >>>>>> utilities/dictionary.txt | 16161 +++++++++++++++++++++++++++++++++++++ >>>>>> 3 files changed, 16165 insertions(+), 1 deletion(-) >>>>>> create mode 100644 utilities/dictionary.txt >>>> >>>> _______________________________________________ >>>> dev mailing list >>>> d...@openvswitch.org >>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev >>> > _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev