On 13/11/2023 11:46, Eelco Chaudron wrote: > > > On 12 Nov 2023, at 8:22, Roi Dayan wrote: > >> On 10/11/2023 16:23, Aaron Conole wrote: >>> Roi Dayan <r...@nvidia.com> writes: >>> >>>> On 08/11/2023 21:04, Aaron Conole wrote: >>>>> Roi Dayan via dev <ovs-dev@openvswitch.org> writes: >>>>> >>>>>> On 02/11/2023 16:11, Eelco Chaudron wrote: >>>>>>> >>>>>>> On 2 Nov 2023, at 14:20, Roi Dayan via dev wrote: >>>>>>> >>>>>>>> Add personal words list as spellcheck.txt and load it >>>>>>>> into enchant spell checker. This file is generated from >>>>>>>> codespell dictionary.txt and contains words users use >>>>>>>> but enchant spell checker failed on like >>>>>>>> refcount, pthread, enqueuing, etc. >>>>>>>> >>>>>>>> Signed-off-by: Roi Dayan <r...@nvidia.com> >>>>>>> Thanks for the patch, but it doesn’t look right to add the full list >>>>>>> of words to the OVS repository. >>>>>>> >>>>>>> Maybe we can update the extra_keywords list with the most common >>>>>>> missing ones, and add a command line option to include a >>>>>>> user-defined file for people who want this? >>>>>>> >>>>>>> What do you think? >>>>>>> >>>>>> >>>>>> I think it is needed. It's a dictionary of most commonly used words >>>>>> and the enchant spell check does not seem to be enough. >>>>>> Some examples enchant fails as I remember are: >>>>>> lacp, dereferenced, valgrind, priv, syscall,.. >>>>> >>>>> Well, we do add some of those - BUT I see that codespell probably has a >>>>> more complete dictionary. >>>>> >>>>> The original implementation using enchant was due to there not being a >>>>> clear winner at the time. Enchant is a spell checking frontend intended >>>>> for lots of tools, so seemed like a good idea (for example, used by >>>>> libreoffice, AbiWord, and others). >>>>> >>>>> It may be that codepspell is more appropriate since it is targeted at >>>>> development spell checking. OTOH, codespell will miss lots of >>>>> misspellings where enchant will have a larger lexicon. I have no real >>>>> opinion on which framework makes sense - but we shouldn't include a >>>>> dictionary. After all, even linux checkpatch.pl will find the codespell >>>>> dictionary and just use that as it exists. >>>>> >>>>> However, I will point out that there *is* a difference between the two. >>>>> Here's a simple example: >>>>> >>>>> 02:01:19 aconole@RHTPC1VM0NT {(594d145410...)} ~/git/ovs$ echo >>>>> vailgrind | codespell - >>>>> 02:01:24 aconole@RHTPC1VM0NT {(594d145410...)} ~/git/ovs$ >>>>> >>>>> vs. >>>>> >>>>> 02:00:18 aconole@RHTPC1VM0NT {(594d145410...)} ~/git/ovs$ >>>>> ./utilities/checkpatch.py -S test.patch >>>>> == Checking "test.patch" == >>>>> WARNING: Possible misspelled word: "vaigrind" >>>>> Did you mean: ['grinder', 'valgrind'] >>>>> >>>>> So I guess we can probably do something better - and maybe that >>>>> something is to find the codespell dictionary cut it up and merge the >>>>> values with the current session (instead of add(), ie: 2/2 in this >>>>> series). >>>>> >>>>> But I don't think we need to copy the entire codespell dict. >>>>> >>>> >>>> In linux checkpatch find codespell dictonary.txt, loads it into memory and >>>> cut on '>' to get the word needed and looks like internal perl code does >>>> the spell check. >>>> >>>> I can do the same here to find codespell dictionary.txt and cut on '>' >>>> and I get the same dictionary I tried to add here but I need to create >>>> a temporary file to load it with enchant.DictWithPWL() >>>> >>>> As doing a loop of calls to spell_check_dict.add_to_session() seems >>>> to take a lot longer for the library by far as shown. >>>> >>>> Will that be ok? >>>> >>>> This is the example diff. I can send as an update patch if its ok. >>>> Looking for the codespell dictionary as it can be different folder between >>>> fedora and ubuntu for example. >>>> Then loading it and creating temporary file to load into enchant which >>>> is being removed when file is closed. >>> >>> I think it's a good idea to use the codespell dict, but it turns out we >>> don't actually need the temp file. I've tested with the add_to_session >>> patch, and here is my timing output: >>> >>> 09:19:43 aconole@RHTPC1VM0NT {(594d145410...)} ~/git/ovs$ time >>> ./utilities/checkpatch.py -S -1 test.patch >>> == Checking 594d145410c5 ("readthedocs: Use dirhtml builder.") == >>> WARNING: Possible misspelled word: "readthedocs" >>> Did you mean: ['headteachers'] >>> WARNING: Possible misspelled word: "dirhtml" >>> Did you mean: ['dirtily'] >>> Subject: readthedocs: Use dirhtml builder. >>> WARNING: Possible misspelled word: "ReadTheDocs" >>> Did you mean: ['Headteachers'] >>> Lines checked: 44, Warnings: 3, Errors: 0 >>> >>> >>> real 0m0.369s >>> user 0m0.344s >>> sys 0m0.024s >>> >>> The difference being that I don't use a tempfile - just your delta above >>> as: >>> >>> if codespell_file != '': >>> with open(codespell_file) as f: >>> for line in f.readlines(): >>> words = line.strip().split('>')[1].strip(', >>> ').split(',') >>> for word in words: >>> spell_check_dict.add_to_session(word) >>> >>> >>> Of course, looks like we would want to add other things like readthedocs, >>> dirhtml, etc. Or possibly skip checking words that are proper nouns >>> (like ReadTheDocs would be). Codespell ignores words it doesn't have a >>> correction / suggestion for, while enchant will assume something it >>> doesn't know is misspelled. This makes it so we will flag more often. >>> >>> WDYT? >> >> >> This looks good. >> We shouldn't lag more often than now as we only add more familiar words >> to enchant session. >> >> Do you want me to create a proper patch and send it or you will do it >> when taking the 2nd patch? > > I guess, you sending a new patch would be preferred as it can be reviewed by > others. > > //Eelco >
sure. will send v2 the two patches. thanks. >>> >>>> --- a/utilities/checkpatch.py >>>> +++ b/utilities/checkpatch.py >>>> @@ -37,8 +37,19 @@ spell_check_dict = None >>>> >>>> >>>> def open_spell_check_dict(): >>>> + import tempfile >>>> import enchant >>>> >>>> + try: >>>> + import codespell_lib >>>> + codespell_dir = os.path.dirname(codespell_lib.__file__) >>>> + codespell_file = os.path.join(codespell_dir, 'data', >>>> 'dictionary.txt') >>>> + if not os.path.exists(codespell_file): >>>> + codespell_file = '' >>>> + except: >>>> + codespell_file = '' >>>> + >>>> + >>>> try: >>>> extra_keywords = ['ovs', 'vswitch', 'vswitchd', 'ovs-vswitchd', >>>> 'netdev', 'selinux', 'ovs-ctl', 'dpctl', >>>> 'ofctl', >>>> @@ -93,7 +104,19 @@ def open_spell_check_dict(): >>>> global spell_check_dict >>>> word_dic = >>>> os.path.join(os.path.dirname(os.path.abspath(__file__)), >>>> 'dictionary.txt') >>>> - spell_check_dict = enchant.DictWithPWL('en_US', word_dic) >>>> + >>>> + if codespell_file: >>>> + with open(codespell_file) as f: >>>> + with tempfile.NamedTemporaryFile(mode='w') as f2: >>>> + for line in f.readlines(): >>>> + words = line.strip().split('>')[1].strip(', >>>> ').split(',') >>>> + for word in words: >>>> + f2.write(word.strip()+'\n') >>>> + >>>> + spell_check_dict = enchant.DictWithPWL('en_US', >>>> f2.name) >>>> + else: >>>> + spell_check_dict = enchant.Dict("en_US") >>>> + >>>> for kw in extra_keywords: >>>> spell_check_dict.add_to_session(kw) >>>> >>>> >>>> >>>> >>>> Example run with time, ignore the warning >>>> >>>> $ time python3 utilities/checkpatch.py -1 -S >>>> == Checking dcfe1402d26e ("checkpatch: Fix personal word list storage.") == >>>> ERROR: Committer Roi Dayan <r...@nvidia.com> needs to sign off. >>>> Lines checked: 31, Warnings: 0, Errors: 1 >>>> >>>> >>>> real 0m0.517s >>>> user 0m0.304s >>>> sys 0m0.067s >>>> >>>> >>>> >>>>>> Also adding the entire dictionary the script is even faster than >>>>>> adding word by word as done now. >>>>>> >>>>>> I think maybe removing the add word by word part at all but checking and >>>>>> doing >>>>>> in steps. >>>>>> >>>>>> Just by adding the dictionary and having the words being added through >>>>>> python >>>>>> already exists seems to be faster. >>>>>> >>>>>> Checking small commit before loading dictionry.txt: >>>>>> >>>>>> $ time ./utilities/checkpatch.py -S -1 >>>>>> >>>>>> real 0m28.379s >>>>>> user 0m0.272s >>>>>> sys 0m0.223s >>>>>> >>>>>> >>>>>> and after: >>>>>> >>>>>> $ time ./utilities/checkpatch.py -S -1 >>>>>> >>>>>> real 0m0.238s >>>>>> user 0m0.138s >>>>>> sys 0m0.038s >>>>>> >>>>>>> Cheers, >>>>>>> >>>>>>> Eelco >>>>>>> >>>>>>>> --- >>>>>>>> utilities/automake.mk | 1 + >>>>>>>> utilities/checkpatch.py | 4 +- >>>>>>>> utilities/dictionary.txt | 16161 +++++++++++++++++++++++++++++++++++++ >>>>>>>> 3 files changed, 16165 insertions(+), 1 deletion(-) >>>>>>>> create mode 100644 utilities/dictionary.txt >>>>>> >>>>>> _______________________________________________ >>>>>> dev mailing list >>>>>> d...@openvswitch.org >>>>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev >>>>> >>> > _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev