This is an automated email from the git hooks/post-receive script. rene pushed a commit to branch master in repository hunspell.
commit fc2b5bae843e42d0888303ed7b43b4393362d7a2 Author: Rene Engelhard <[email protected]> Date: Thu Apr 21 14:45:22 2016 +0200 Imported Upstream version 1.2.8 --- ChangeLog | 80 ++++++++ NEWS | 33 ++++ README | 1 + THANKS | 7 + aclocal.m4 | 28 ++- configure | 124 ++++++++---- configure.ac | 4 +- man/hunspell.4 | 66 +++++-- src/hunspell/Makefile.am | 4 +- src/hunspell/Makefile.in | 7 +- src/hunspell/affentry.cxx | 15 +- src/hunspell/affixmgr.cxx | 392 ++++++++++++++++++++++++++++++-------- src/hunspell/affixmgr.hxx | 13 +- src/hunspell/atypes.hxx | 8 + src/hunspell/csutil.cxx | 28 +-- src/hunspell/filemgr.cxx | 11 +- src/hunspell/hunspell.cxx | 345 ++++++++++++++++++--------------- src/hunspell/replist.cxx | 95 +++++++++ src/hunspell/replist.hxx | 24 +++ src/hunspell/w_char.hxx | 2 +- src/tools/Makefile.am | 2 +- src/tools/Makefile.in | 2 +- src/tools/affixcompress | 10 +- src/tools/hunspell.cxx | 2 + src/tools/wordforms | 35 ++++ src/win_api/Hunspell.rc | 8 +- src/win_api/config.h | 6 +- tests/Makefile.am | 47 ++++- tests/Makefile.in | 47 ++++- tests/break.wrong | 2 + tests/breakdefault.aff | 6 + tests/breakdefault.dic | 6 + tests/breakdefault.good | 7 + tests/breakdefault.sug | 3 + tests/breakdefault.test | 4 + tests/breakdefault.wrong | 3 + tests/checkcompoundpattern2.aff | 7 + tests/checkcompoundpattern2.dic | 3 + tests/checkcompoundpattern2.good | 3 + tests/checkcompoundpattern2.test | 4 + tests/checkcompoundpattern2.wrong | 1 + tests/checkcompoundpattern3.aff | 6 + tests/checkcompoundpattern3.dic | 5 + tests/checkcompoundpattern3.good | 9 + tests/checkcompoundpattern3.test | 4 + tests/checkcompoundpattern3.wrong | 8 + tests/checkcompoundpattern4.aff | 8 + tests/checkcompoundpattern4.dic | 6 + tests/checkcompoundpattern4.good | 2 + tests/checkcompoundpattern4.test | 4 + tests/checkcompoundpattern4.wrong | 2 + tests/condition.aff | 9 +- tests/condition.wrong | 4 + tests/condition_utf.aff | 8 +- tests/condition_utf.wrong | 4 + tests/iconv.aff | 10 + tests/iconv.dic | 5 + tests/iconv.good | 6 + tests/iconv.test | 4 + tests/oconv.aff | 12 ++ tests/oconv.dic | 4 + tests/oconv.good | 2 + tests/oconv.sug | 3 + tests/oconv.test | 4 + tests/oconv.wrong | 3 + tests/simplifiedtriple.aff | 8 + tests/simplifiedtriple.dic | 3 + tests/simplifiedtriple.good | 3 + tests/simplifiedtriple.test | 4 + tests/simplifiedtriple.wrong | 1 + 70 files changed, 1300 insertions(+), 346 deletions(-) diff --git a/ChangeLog b/ChangeLog index 267305c..5f78744 100644 --- a/ChangeLog +++ b/ChangeLog @@ -1,3 +1,83 @@ +2008-11-01 Németh László <nemeth at OOo>: + * replist.*, hunspell.cxx, affixmgr.cxx: new input and output + conversion support, see ICONV and OCONV keywords in the Hunspell(4) + manual page and the test examples. The input/output conversion + problem of syllabic languages reported by Daniel Yacob and + Shewangizaw Gulilat. + - tests/{iconv,oconv}.*: test examples + + * tools/wordforms: word generation script for dictionary developers + (Hunspell version of the unmunch program) + + * hunspell/hunspell.cxx: extended BREAK feature: ^ and $ mean in break + patterns the beginning and end of the word. + - tests/BREAK.*: modified examples. + + * hunspell/hunspell.cxx: set default break at hyphen characters. + The associated problem reported by S Page in Hunspell Bug 2174061. + See Mozilla Bug ID 355178 and OOo Issue 64400, too. + - tests/breakdefault.*: test data + The following definition is equivalent of the default word break: + + BREAK 3 + BREAK - + BREAK ^- + BREAK -$ + + * affixmgr.cxx: SIMPLIFIEDTRIPLE is a new affix file keyword to allow + simplified forms of the compound words with triple repeating letters. + It is useful for Swedish and Norwegian languages. + + * affixmgr.cxx: extend CHECKCOMPOUNDPATTERN to support + alternations of compound words for example by sandhi + feature of Indian and other languages. The problem reported + by Kiran Chittella associated with Telugu writing system + (see Telugu example in tests/checkcompoundpattern4.test). + The new optional field of CHECKCOMPOUNDPATTERN definition is the + replacement of the compound boundary defined by the previous fields: + CHECKCOMPOUNDPATTERN ff f ff + means ff|f compound boundary has been replaced by "ff", like in + the (prereform) German Schiffahrt (Schiff+fahrt). + - CHECKCOMPOUNDPATTERN supports also optional flag conditions now: + CHECKCOMPOUNDPATTERN ff/A f/B ff + means that the first word of the compound needs flag "A" and + the second word of the compound needs flag "B" to the operation. + + * tools/hunspell.cxx: add empty lines as separators to the output of + the stemming and morphological analysis. + + * affixmgr.cxx: fix condition checking algorithm. Bad suggestion + generation reported by Mehmet Akin in SF.net Bug 2124186 with help of + Eleonora Goldman. + + * affixmgr,cxx: fix COMPOUNDWORDMAX feature. The problem and its + code details reported by Göran Andersson under SF.net Bug ID 2138001. + + * csutil.cxx: fix bad conditional code for Mozilla compilation. + Patch by Serge Gautherie. The problem reported by Ryan VanderMeulen. + + * hunspell/hunspell.cxx: add missing ngram suggestion for HUHINITCAP + (capitalized mixed case) words. + + * w_char.hxx: use GCC conditions for GCC related code. Patch by + Ryan VanderMeulen. + + * affixmgr.cxx: check morphological description in morphgen() + (fix potential program fault by incomplete morphological + description of affix rules) + + * src/win_api: config.h: switch on warning messages on Windows + + * tools/affixcompress: extended help for -h (use LC_ALL=C sort + for input word list) + + * man/hunspell.4: updated manual: + - new and modified features (SIMPLIFIEDTRIPLE, ICONV, OCONV, + BREAK, CHECKCOMPOUNDPATTERN). + - note about costs of zero affixes, suggested by Olivier Ronez. + + * hunspell/hunspell.cxx: remove deprecated word breaking codes. + 2008-08-15 Németh László <nemeth at OOo>: * affentry.cxx: add FULLSTRIP option. With FULLSTRIP, affix rules can strip full words, not only one less characters. Suggested by diff --git a/NEWS b/NEWS index 0db5c2d..602d62b 100644 --- a/NEWS +++ b/NEWS @@ -1,3 +1,36 @@ +2008-11-01: Hunspell 1.2.8 release: + - Default BREAK feature and better hyphenated word suggestion to accept + and fix (compound) words with hyphen characters by spell checker + instead of by work breaking code of OpenOffice.org. With this feature + it's possible to accept hyphenated compound words, such as "scot-free", + where "scot" is not a correct English word. + + - ICONV & OCONV: input and output conversion tables for optional character + handling or using special inner format. Example: + + # Accepting de facto replacements of the Romanian comma acuted letters + SET UTF-8 + ICONV 4 + ICONV ş ș + ICONV ţ ț + ICONV Ş Ș + ICONV Ţ Ț + + Typical usage of ICONV/OCONV is to manage an inner format for a segmental + writing system, like the Ethiopic script of the Amharic language. + + - Extended CHECKCOMPOUNDPATTERN to handle conpound word alternations, like + sandhi feature of Telugu and other writing systems. + + - SIMPLIFIEDTRIPLE compound word feature: allow simplified Swedish and + Norwegian compound word forms, like tillåta (till|låta) and + bussjåfør (buss|sjåfør) + + - wordforms: word generator script for dictionary developers (Hunspell + version of unmunch). + + - bug fixes + 2008-08-15: Hunspell 1.2.7 release: - FULLSTRIP: new option for affix handling. With FULLSTRIP, affix rules can strip full words, not only one less characters. diff --git a/README b/README index f97273f..ee34e26 100644 --- a/README +++ b/README @@ -140,6 +140,7 @@ affixcompress: dictionary generation from large (millions of words) vocabularies makealias: alias compression (Hunspell only, not back compatible with MySpell) munch: dictionary generation from vocabularies (it needs an affix file, too). unmunch: list all recognized words of a MySpell dictionary +wordforms: word generation (Hunspell version of unmunch) After compiling and installing (see INSTALL) you can run the Hunspell spell checker (compiled with user interface) diff --git a/THANKS b/THANKS index 24e5a18..8691cd9 100644 --- a/THANKS +++ b/THANKS @@ -1,5 +1,7 @@ Many thanks to the following contributors and supporters: +Mehmet Akin +Göran Andersson Lars Aronsson Ruud Baars Bartkó Zoltán @@ -17,6 +19,7 @@ David Einstein Rene Engelhard Frederik Fouvry Flemming Frandsen +Serge Gautherie Gavins at OOo Gefferth András Godó Ferenc @@ -37,6 +40,7 @@ Pavel Janík John Winters Mohamed Kebdani Kelemen Gábor +Shewangizaw Gulilat Kéménczy Kálmán Dan Kenigsberg Pham Ngoc Khanh @@ -61,6 +65,7 @@ Daniel Naber Nagy Viktor John Nisly Noll János +S Page Christophe Paris Malcolm Parsons Sylvain Paschein @@ -70,6 +75,7 @@ Harri Pitkänen Davide Prina Kevin F. Quinn Erdal Ronahi +Olivier Ronez Bernhard Rosenkraenzer Sarlós Tamás Thobias Schlemmer @@ -90,6 +96,7 @@ Martijn Wargers Michel Weimerskirch Brett Wilson Friedel Wolff +Daniel Yacob Gábor Zahemszky Taha Zerrouki and others (see also AUTHORS.myspell) diff --git a/aclocal.m4 b/aclocal.m4 index 7ac8da4..5b4252d 100644 --- a/aclocal.m4 +++ b/aclocal.m4 @@ -21,7 +21,7 @@ To do so, use the procedure documented by the package, typically `autoreconf'.]) # libtool.m4 - Configure libtool for the host system. -*-Autoconf-*- -# serial 52 AC_PROG_LIBTOOL +# serial 52 Debian 1.5.26-1ubuntu1 AC_PROG_LIBTOOL # AC_PROVIDE_IFELSE(MACRO-NAME, IF-PROVIDED, IF-NOT-PROVIDED) @@ -1723,6 +1723,18 @@ linux* | k*bsd*-gnu) dynamic_linker='GNU/Linux ld.so' ;; +netbsdelf*-gnu) + version_type=linux + need_lib_prefix=no + need_version=no + library_names_spec='${libname}${release}${shared_ext}$versuffix ${libname}${release}${shared_ext}$major ${libname}${shared_ext}' + soname_spec='${libname}${release}${shared_ext}$major' + shlibpath_var=LD_LIBRARY_PATH + shlibpath_overrides_runpath=no + hardcode_into_libs=yes + dynamic_linker='NetBSD ld.elf_so' + ;; + netbsd*) version_type=sunos need_lib_prefix=no @@ -2504,7 +2516,7 @@ linux* | k*bsd*-gnu) lt_cv_deplibs_check_method=pass_all ;; -netbsd*) +netbsd* | netbsdelf*-gnu) if echo __ELF__ | $CC -E - | grep __ELF__ > /dev/null; then lt_cv_deplibs_check_method='match_pattern /lib[[^/]]+(\.so\.[[0-9]]+\.[[0-9]]+|_pic\.a)$' else @@ -3511,7 +3523,7 @@ case $host_os in ;; esac ;; - netbsd*) + netbsd* | netbsdelf*-gnu) if echo __ELF__ | $CC -E - | grep __ELF__ >/dev/null; then _LT_AC_TAGVAR(archive_cmds, $1)='$LD -Bshareable -o $lib $predep_objects $libobjs $deplibs $postdep_objects $linker_flags' wlarc= @@ -5203,7 +5215,7 @@ AC_MSG_CHECKING([for $compiler option to produce PIC]) ;; esac ;; - netbsd*) + netbsd* | netbsdelf*-gnu) ;; osf3* | osf4* | osf5*) case $cc_basename in @@ -5580,6 +5592,9 @@ ifelse([$1],[CXX],[ cygwin* | mingw*) _LT_AC_TAGVAR(export_symbols_cmds, $1)='$NM $libobjs $convenience | $global_symbol_pipe | $SED -e '\''/^[[BCDGRS]][[ ]]/s/.*[[ ]]\([[^ ]]*\)/\1 DATA/;/^.*[[ ]]__nm__/s/^.*[[ ]]__nm__\([[^ ]]*\)[[ ]][[^ ]]*/\1 DATA/;/^I[[ ]]/d;/^[[AITW]][[ ]]/s/.*[[ ]]//'\'' | sort | uniq > $export_symbols' ;; + linux* | k*bsd*-gnu) + _LT_AC_TAGVAR(link_all_deplibs, $1)=no + ;; *) _LT_AC_TAGVAR(export_symbols_cmds, $1)='$NM $libobjs $convenience | $global_symbol_pipe | $SED '\''s/.* //'\'' | sort | uniq > $export_symbols' ;; @@ -5788,12 +5803,13 @@ EOF $echo "local: *; };" >> $output_objdir/$libname.ver~ $CC '"$tmp_sharedflag""$tmp_addflag"' $libobjs $deplibs $compiler_flags ${wl}-soname $wl$soname ${wl}-version-script ${wl}$output_objdir/$libname.ver -o $lib' fi + _LT_AC_TAGVAR(link_all_deplibs, $1)=no else _LT_AC_TAGVAR(ld_shlibs, $1)=no fi ;; - netbsd*) + netbsd* | netbsdelf*-gnu) if echo __ELF__ | $CC -E - | grep __ELF__ >/dev/null; then _LT_AC_TAGVAR(archive_cmds, $1)='$LD -Bshareable $libobjs $deplibs $linker_flags -o $lib' wlarc= @@ -6224,7 +6240,7 @@ _LT_EOF _LT_AC_TAGVAR(link_all_deplibs, $1)=yes ;; - netbsd*) + netbsd* | netbsdelf*-gnu) if echo __ELF__ | $CC -E - | grep __ELF__ >/dev/null; then _LT_AC_TAGVAR(archive_cmds, $1)='$LD -Bshareable -o $lib $libobjs $deplibs $linker_flags' # a.out else diff --git a/configure b/configure index 1320607..7fe8a0d 100755 --- a/configure +++ b/configure @@ -1,6 +1,6 @@ #! /bin/sh # Guess values for system-dependent variables and create Makefiles. -# Generated by GNU Autoconf 2.61 for hunspell 1.2.7. +# Generated by GNU Autoconf 2.61 for hunspell 1.2.8. # # Report bugs to <[email protected]>. # @@ -728,8 +728,8 @@ SHELL=${CONFIG_SHELL-/bin/sh} # Identity of this package. PACKAGE_NAME='hunspell' PACKAGE_TARNAME='hunspell' -PACKAGE_VERSION='1.2.7' -PACKAGE_STRING='hunspell 1.2.7' +PACKAGE_VERSION='1.2.8' +PACKAGE_STRING='hunspell 1.2.8' PACKAGE_BUGREPORT='[email protected]' ac_unique_file="config.h.in" @@ -1425,7 +1425,7 @@ if test "$ac_init_help" = "long"; then # Omit some internal or obsolete options to make the list less imposing. # This message is too long to be a string in the A/UX 3.1 sh. cat <<_ACEOF -\`configure' configures hunspell 1.2.7 to adapt to many kinds of systems. +\`configure' configures hunspell 1.2.8 to adapt to many kinds of systems. Usage: $0 [OPTION]... [VAR=VALUE]... @@ -1496,7 +1496,7 @@ fi if test -n "$ac_init_help"; then case $ac_init_help in - short | recursive ) echo "Configuration of hunspell 1.2.7:";; + short | recursive ) echo "Configuration of hunspell 1.2.8:";; esac cat <<\_ACEOF @@ -1609,7 +1609,7 @@ fi test -n "$ac_init_help" && exit $ac_status if $ac_init_version; then cat <<\_ACEOF -hunspell configure 1.2.7 +hunspell configure 1.2.8 generated by GNU Autoconf 2.61 Copyright (C) 1992, 1993, 1994, 1995, 1996, 1998, 1999, 2000, 2001, @@ -1623,7 +1623,7 @@ cat >config.log <<_ACEOF This file contains any messages produced by compilers while running configure, to aid debugging if configure makes a mistake. -It was created by hunspell $as_me 1.2.7, which was +It was created by hunspell $as_me 1.2.8, which was generated by GNU Autoconf 2.61. Invocation command line was $ $0 $@ @@ -2442,7 +2442,7 @@ fi # Define the identity of the package. PACKAGE=hunspell - VERSION=1.2.7 + VERSION=1.2.8 cat >>confdefs.h <<_ACEOF @@ -4763,7 +4763,7 @@ linux* | k*bsd*-gnu) lt_cv_deplibs_check_method=pass_all ;; -netbsd*) +netbsd* | netbsdelf*-gnu) if echo __ELF__ | $CC -E - | grep __ELF__ > /dev/null; then lt_cv_deplibs_check_method='match_pattern /lib[^/]+(\.so\.[0-9]+\.[0-9]+|_pic\.a)$' else @@ -8128,12 +8128,13 @@ EOF $echo "local: *; };" >> $output_objdir/$libname.ver~ $CC '"$tmp_sharedflag""$tmp_addflag"' $libobjs $deplibs $compiler_flags ${wl}-soname $wl$soname ${wl}-version-script ${wl}$output_objdir/$libname.ver -o $lib' fi + link_all_deplibs=no else ld_shlibs=no fi ;; - netbsd*) + netbsd* | netbsdelf*-gnu) if echo __ELF__ | $CC -E - | grep __ELF__ >/dev/null; then archive_cmds='$LD -Bshareable $libobjs $deplibs $linker_flags -o $lib' wlarc= @@ -8676,7 +8677,7 @@ if test -z "$aix_libpath"; then aix_libpath="/usr/lib:/lib"; fi link_all_deplibs=yes ;; - netbsd*) + netbsd* | netbsdelf*-gnu) if echo __ELF__ | $CC -E - | grep __ELF__ >/dev/null; then archive_cmds='$LD -Bshareable -o $lib $libobjs $deplibs $linker_flags' # a.out else @@ -9387,6 +9388,18 @@ linux* | k*bsd*-gnu) dynamic_linker='GNU/Linux ld.so' ;; +netbsdelf*-gnu) + version_type=linux + need_lib_prefix=no + need_version=no + library_names_spec='${libname}${release}${shared_ext}$versuffix ${libname}${release}${shared_ext}$major ${libname}${shared_ext}' + soname_spec='${libname}${release}${shared_ext}$major' + shlibpath_var=LD_LIBRARY_PATH + shlibpath_overrides_runpath=no + hardcode_into_libs=yes + dynamic_linker='NetBSD ld.elf_so' + ;; + netbsd*) version_type=sunos need_lib_prefix=no @@ -10227,7 +10240,7 @@ else lt_dlunknown=0; lt_dlno_uscore=1; lt_dlneed_uscore=2 lt_status=$lt_dlunknown cat > conftest.$ac_ext <<EOF -#line 10230 "configure" +#line 10243 "configure" #include "confdefs.h" #if HAVE_DLFCN_H @@ -10327,7 +10340,7 @@ else lt_dlunknown=0; lt_dlno_uscore=1; lt_dlneed_uscore=2 lt_status=$lt_dlunknown cat > conftest.$ac_ext <<EOF -#line 10330 "configure" +#line 10343 "configure" #include "confdefs.h" #if HAVE_DLFCN_H @@ -11915,7 +11928,7 @@ if test -z "$aix_libpath"; then aix_libpath="/usr/lib:/lib"; fi ;; esac ;; - netbsd*) + netbsd* | netbsdelf*-gnu) if echo __ELF__ | $CC -E - | grep __ELF__ >/dev/null; then archive_cmds_CXX='$LD -Bshareable -o $lib $predep_objects $libobjs $deplibs $postdep_objects $linker_flags' wlarc= @@ -12619,7 +12632,7 @@ echo $ECHO_N "checking for $compiler option to produce PIC... $ECHO_C" >&6; } ;; esac ;; - netbsd*) + netbsd* | netbsdelf*-gnu) ;; osf3* | osf4* | osf5*) case $cc_basename in @@ -12728,11 +12741,11 @@ else -e 's:.*FLAGS}\{0,1\} :&$lt_compiler_flag :; t' \ -e 's: [^ ]*conftest\.: $lt_compiler_flag&:; t' \ -e 's:$: $lt_compiler_flag:'` - (eval echo "\"\$as_me:12731: $lt_compile\"" >&5) + (eval echo "\"\$as_me:12744: $lt_compile\"" >&5) (eval "$lt_compile" 2>conftest.err) ac_status=$? cat conftest.err >&5 - echo "$as_me:12735: \$? = $ac_status" >&5 + echo "$as_me:12748: \$? = $ac_status" >&5 if (exit $ac_status) && test -s "$ac_outfile"; then # The compiler can only warn and ignore the option if not recognized # So say no if there are warnings other than the usual output. @@ -12832,11 +12845,11 @@ else -e 's:.*FLAGS}\{0,1\} :&$lt_compiler_flag :; t' \ -e 's: [^ ]*conftest\.: $lt_compiler_flag&:; t' \ -e 's:$: $lt_compiler_flag:'` - (eval echo "\"\$as_me:12835: $lt_compile\"" >&5) + (eval echo "\"\$as_me:12848: $lt_compile\"" >&5) (eval "$lt_compile" 2>out/conftest.err) ac_status=$? cat out/conftest.err >&5 - echo "$as_me:12839: \$? = $ac_status" >&5 + echo "$as_me:12852: \$? = $ac_status" >&5 if (exit $ac_status) && test -s out/conftest2.$ac_objext then # The compiler can only warn and ignore the option if not recognized @@ -12904,6 +12917,9 @@ echo $ECHO_N "checking whether the $compiler linker ($LD) supports shared librar cygwin* | mingw*) export_symbols_cmds_CXX='$NM $libobjs $convenience | $global_symbol_pipe | $SED -e '\''/^[BCDGRS][ ]/s/.*[ ]\([^ ]*\)/\1 DATA/;/^.*[ ]__nm__/s/^.*[ ]__nm__\([^ ]*\)[ ][^ ]*/\1 DATA/;/^I[ ]/d;/^[AITW][ ]/s/.*[ ]//'\'' | sort | uniq > $export_symbols' ;; + linux* | k*bsd*-gnu) + link_all_deplibs_CXX=no + ;; *) export_symbols_cmds_CXX='$NM $libobjs $convenience | $global_symbol_pipe | $SED '\''s/.* //'\'' | sort | uniq > $export_symbols' ;; @@ -13350,6 +13366,18 @@ linux* | k*bsd*-gnu) dynamic_linker='GNU/Linux ld.so' ;; +netbsdelf*-gnu) + version_type=linux + need_lib_prefix=no + need_version=no + library_names_spec='${libname}${release}${shared_ext}$versuffix ${libname}${release}${shared_ext}$major ${libname}${shared_ext}' + soname_spec='${libname}${release}${shared_ext}$major' + shlibpath_var=LD_LIBRARY_PATH + shlibpath_overrides_runpath=no + hardcode_into_libs=yes + dynamic_linker='NetBSD ld.elf_so' + ;; + netbsd*) version_type=sunos need_lib_prefix=no @@ -14415,11 +14443,11 @@ else -e 's:.*FLAGS}\{0,1\} :&$lt_compiler_flag :; t' \ -e 's: [^ ]*conftest\.: $lt_compiler_flag&:; t' \ -e 's:$: $lt_compiler_flag:'` - (eval echo "\"\$as_me:14418: $lt_compile\"" >&5) + (eval echo "\"\$as_me:14446: $lt_compile\"" >&5) (eval "$lt_compile" 2>conftest.err) ac_status=$? cat conftest.err >&5 - echo "$as_me:14422: \$? = $ac_status" >&5 + echo "$as_me:14450: \$? = $ac_status" >&5 if (exit $ac_status) && test -s "$ac_outfile"; then # The compiler can only warn and ignore the option if not recognized # So say no if there are warnings other than the usual output. @@ -14519,11 +14547,11 @@ else -e 's:.*FLAGS}\{0,1\} :&$lt_compiler_flag :; t' \ -e 's: [^ ]*conftest\.: $lt_compiler_flag&:; t' \ -e 's:$: $lt_compiler_flag:'` - (eval echo "\"\$as_me:14522: $lt_compile\"" >&5) + (eval echo "\"\$as_me:14550: $lt_compile\"" >&5) (eval "$lt_compile" 2>out/conftest.err) ac_status=$? cat out/conftest.err >&5 - echo "$as_me:14526: \$? = $ac_status" >&5 + echo "$as_me:14554: \$? = $ac_status" >&5 if (exit $ac_status) && test -s out/conftest2.$ac_objext then # The compiler can only warn and ignore the option if not recognized @@ -14784,12 +14812,13 @@ EOF $echo "local: *; };" >> $output_objdir/$libname.ver~ $CC '"$tmp_sharedflag""$tmp_addflag"' $libobjs $deplibs $compiler_flags ${wl}-soname $wl$soname ${wl}-version-script ${wl}$output_objdir/$libname.ver -o $lib' fi + link_all_deplibs_F77=no else ld_shlibs_F77=no fi ;; - netbsd*) + netbsd* | netbsdelf*-gnu) if echo __ELF__ | $CC -E - | grep __ELF__ >/dev/null; then archive_cmds_F77='$LD -Bshareable $libobjs $deplibs $linker_flags -o $lib' wlarc= @@ -15312,7 +15341,7 @@ if test -z "$aix_libpath"; then aix_libpath="/usr/lib:/lib"; fi link_all_deplibs_F77=yes ;; - netbsd*) + netbsd* | netbsdelf*-gnu) if echo __ELF__ | $CC -E - | grep __ELF__ >/dev/null; then archive_cmds_F77='$LD -Bshareable -o $lib $libobjs $deplibs $linker_flags' # a.out else @@ -15971,6 +16000,18 @@ linux* | k*bsd*-gnu) dynamic_linker='GNU/Linux ld.so' ;; +netbsdelf*-gnu) + version_type=linux + need_lib_prefix=no + need_version=no + library_names_spec='${libname}${release}${shared_ext}$versuffix ${libname}${release}${shared_ext}$major ${libname}${shared_ext}' + soname_spec='${libname}${release}${shared_ext}$major' + shlibpath_var=LD_LIBRARY_PATH + shlibpath_overrides_runpath=no + hardcode_into_libs=yes + dynamic_linker='NetBSD ld.elf_so' + ;; + netbsd*) version_type=sunos need_lib_prefix=no @@ -16726,11 +16767,11 @@ else -e 's:.*FLAGS}\{0,1\} :&$lt_compiler_flag :; t' \ -e 's: [^ ]*conftest\.: $lt_compiler_flag&:; t' \ -e 's:$: $lt_compiler_flag:'` - (eval echo "\"\$as_me:16729: $lt_compile\"" >&5) + (eval echo "\"\$as_me:16770: $lt_compile\"" >&5) (eval "$lt_compile" 2>conftest.err) ac_status=$? cat conftest.err >&5 - echo "$as_me:16733: \$? = $ac_status" >&5 + echo "$as_me:16774: \$? = $ac_status" >&5 if (exit $ac_status) && test -s "$ac_outfile"; then # The compiler can only warn and ignore the option if not recognized # So say no if there are warnings other than the usual output. @@ -17016,11 +17057,11 @@ else -e 's:.*FLAGS}\{0,1\} :&$lt_compiler_flag :; t' \ -e 's: [^ ]*conftest\.: $lt_compiler_flag&:; t' \ -e 's:$: $lt_compiler_flag:'` - (eval echo "\"\$as_me:17019: $lt_compile\"" >&5) + (eval echo "\"\$as_me:17060: $lt_compile\"" >&5) (eval "$lt_compile" 2>conftest.err) ac_status=$? cat conftest.err >&5 - echo "$as_me:17023: \$? = $ac_status" >&5 + echo "$as_me:17064: \$? = $ac_status" >&5 if (exit $ac_status) && test -s "$ac_outfile"; then # The compiler can only warn and ignore the option if not recognized # So say no if there are warnings other than the usual output. @@ -17120,11 +17161,11 @@ else -e 's:.*FLAGS}\{0,1\} :&$lt_compiler_flag :; t' \ -e 's: [^ ]*conftest\.: $lt_compiler_flag&:; t' \ -e 's:$: $lt_compiler_flag:'` - (eval echo "\"\$as_me:17123: $lt_compile\"" >&5) + (eval echo "\"\$as_me:17164: $lt_compile\"" >&5) (eval "$lt_compile" 2>out/conftest.err) ac_status=$? cat out/conftest.err >&5 - echo "$as_me:17127: \$? = $ac_status" >&5 + echo "$as_me:17168: \$? = $ac_status" >&5 if (exit $ac_status) && test -s out/conftest2.$ac_objext then # The compiler can only warn and ignore the option if not recognized @@ -17385,12 +17426,13 @@ EOF $echo "local: *; };" >> $output_objdir/$libname.ver~ $CC '"$tmp_sharedflag""$tmp_addflag"' $libobjs $deplibs $compiler_flags ${wl}-soname $wl$soname ${wl}-version-script ${wl}$output_objdir/$libname.ver -o $lib' fi + link_all_deplibs_GCJ=no else ld_shlibs_GCJ=no fi ;; - netbsd*) + netbsd* | netbsdelf*-gnu) if echo __ELF__ | $CC -E - | grep __ELF__ >/dev/null; then archive_cmds_GCJ='$LD -Bshareable $libobjs $deplibs $linker_flags -o $lib' wlarc= @@ -17933,7 +17975,7 @@ if test -z "$aix_libpath"; then aix_libpath="/usr/lib:/lib"; fi link_all_deplibs_GCJ=yes ;; - netbsd*) + netbsd* | netbsdelf*-gnu) if echo __ELF__ | $CC -E - | grep __ELF__ >/dev/null; then archive_cmds_GCJ='$LD -Bshareable -o $lib $libobjs $deplibs $linker_flags' # a.out else @@ -18592,6 +18634,18 @@ linux* | k*bsd*-gnu) dynamic_linker='GNU/Linux ld.so' ;; +netbsdelf*-gnu) + version_type=linux + need_lib_prefix=no + need_version=no + library_names_spec='${libname}${release}${shared_ext}$versuffix ${libname}${release}${shared_ext}$major ${libname}${shared_ext}' + soname_spec='${libname}${release}${shared_ext}$major' + shlibpath_var=LD_LIBRARY_PATH + shlibpath_overrides_runpath=no + hardcode_into_libs=yes + dynamic_linker='NetBSD ld.elf_so' + ;; + netbsd*) version_type=sunos need_lib_prefix=no @@ -24994,7 +25048,7 @@ exec 6>&1 # report actual input values of CONFIG_FILES etc. instead of their # values after options handling. ac_log=" -This file was extended by hunspell $as_me 1.2.7, which was +This file was extended by hunspell $as_me 1.2.8, which was generated by GNU Autoconf 2.61. Invocation command line was CONFIG_FILES = $CONFIG_FILES @@ -25047,7 +25101,7 @@ Report bugs to <[email protected]>." _ACEOF cat >>$CONFIG_STATUS <<_ACEOF ac_cs_version="\\ -hunspell config.status 1.2.7 +hunspell config.status 1.2.8 configured by $0, generated by GNU Autoconf 2.61, with options \\"`echo "$ac_configure_args" | sed 's/^ //; s/[\\""\`\$]/\\\\&/g'`\\" diff --git a/configure.ac b/configure.ac index 55c7797..3a0e1f9 100644 --- a/configure.ac +++ b/configure.ac @@ -4,12 +4,12 @@ m4_pattern_allow AC_PREREQ(2.59) -AC_INIT([hunspell],[1.2.7],[[email protected]]) +AC_INIT([hunspell],[1.2.8],[[email protected]]) AC_CANONICAL_SYSTEM AC_SUBST(XFAILED) -AM_INIT_AUTOMAKE(hunspell, 1.2.7) +AM_INIT_AUTOMAKE(hunspell, 1.2.8) HUNSPELL_VERSION_MAJOR=`echo $VERSION | cut -d"." -f1` HUNSPELL_VERSION_MINOR=`echo $VERSION | cut -d"." -f2` AC_SUBST(HUNSPELL_VERSION_MAJOR) diff --git a/man/hunspell.4 b/man/hunspell.4 index 401db45..cd99f4b 100644 --- a/man/hunspell.4 +++ b/man/hunspell.4 @@ -300,11 +300,9 @@ UTF-8 characters yet. .SH "OPTIONS FOR COMPOUNDING" .IP "BREAK number_of_break_definitions" .IP "BREAK character_or_character_sequence" -Define break points for breaking words and checking -word parts separately. -Rationale: useful for compounding with joining character or strings (for -example, hyphen in English and German or hyphen and n-dash in Hungarian). -Dashes are often bad break points for tokenization, because compounds with +Define new break points for breaking words and checking +word parts separately. Use ^ and $ to delete characters at end and +start of the word. Rationale: useful for compounding with joining character or strings (for example, hyphen in English and German or hyphen and n-dash in Hungarian). Dashes are often bad break points for tokenization, because compounds with dashes may contain not valid parts, too.) With BREAK, Hunspell can check both side of these compounds, breaking the words at dashes and n-dashes: @@ -319,16 +317,37 @@ BREAK \fB--\fR # n-dash .PP Breaking are recursive, so foo-bar, bar-foo and foo-foo\fB--\fRbar-bar would be valid compounds. - -Note: COMPOUNDRULE is better (or will be better) for handling dashes and +Note: The default word break of Hunspell is equivalent of the following BREAK +definition: +.PP +.RS +.nf +BREAK 3 +BREAK - +BREAK ^- +BREAK -$ +.fi +.RE +.PP +Hunspell doesn't accept the "-word" and "word-" forms by this BREAK definition: +.PP +.RS +.nf +BREAK 1 +BREAK - +.fi +.RE +.PP +W +Note II: COMPOUNDRULE is better (or will be better) for handling dashes and other compound joining characters or character strings. Use BREAK, if you want check words with dashes or other joining characters and there is no time or possibility to describe precise compound rules with COMPOUNDRULE (COMPOUNDRULE has handled only the last suffixation of the compound word yet). -Note II: For command line spell checking, set WORDCHARS parameters: -WORDCHARS -\fB--\fR (see tests/break.*) example +Note III: For command line spell checking of words with extra characters, +set WORDCHARS parameters: WORDCHARS -\fB--\fR (see tests/break.*) example .IP "COMPOUNDRULE number_of_compound_definitions" .IP "COMPOUNDRULE compound_pattern" Define custom compound patterns with a regex-like syntax. @@ -388,14 +407,20 @@ a non compound word with a REP fault. Useful for languages with .IP "CHECKCOMPOUNDCASE" Forbid upper case characters at word bound in compounds. .IP "CHECKCOMPOUNDTRIPLE" -Forbid compounding, if compound word contains triple letters -(e.g. foo|ox or xo|oof). -Bug: missing multi-byte character support in UTF-8 encoding -(works only for 7-bit ASCII characters). +Forbid compounding, if compound word contains triple repeating letters +(e.g. foo|ox or xo|oof). Bug: missing multi-byte character support +in UTF-8 encoding (works only for 7-bit ASCII characters). +.IP "SIMPLIFIEDTRIPLE" +Allow simplified 2-letter forms of the compounds forbidden by CHECKCOMPOUNDTRIPLE. +It's useful for Swedish and Norwegian (and for +the old German orthography: Schiff|fahrt -> Schiffahrt). .IP "CHECKCOMPOUNDPATTERN number_of_checkcompoundpattern_definitions" -.IP "CHECKCOMPOUNDPATTERN endchars beginchars" -Forbid compounding, if first word in compound ends with endchars, and -next word begins with beginchars. +.IP "CHECKCOMPOUNDPATTERN endchars[/flag] beginchars[/flag] [replacement]" +Forbid compounding, if the first word in the compound ends with endchars, and +next word begins with beginchars and (optionally) they have the requested flags. +The optional replacement parameter allows simplified compound form. +Note: COMPOUNDMIN doesn't work correctly with the compound word alternation, +so it may need to set COMPOUNDMIN to lower value. .IP "COMPOUNDSYLLABLE max_syllable vowels" Need for special compounding rules in Hungarian. First parameter is the maximum syllable number, that may be in a @@ -465,6 +490,15 @@ Note: With CHECKSHARPS declaration, words with sharp s and KEEPCASE flag may be capitalized and uppercased, but uppercased forms of these words may not contain sharp s, only SS. See germancompounding example in the tests directory of the Hunspell distribution. + +Note: Using lot of zero affixes may have a big cost, because every +zero affix is checked under affix analysis before the other affixes. +.IP "ICONV number_of_ICONV_definitions" +.IP "ICONV pattern pattern2" +Define input conversion table. +.IP "OCONV number_of_OCONV_definitions" +.IP "OCONV pattern pattern2" +Define output conversion table. .IP "LEMMA_PRESENT flag" Not used in Hunspell 1.2. Use "st:" field instead of LEMMA_PRESENT. .IP "NEEDAFFIX flag" diff --git a/src/hunspell/Makefile.am b/src/hunspell/Makefile.am index bbb720d..0172a5d 100644 --- a/src/hunspell/Makefile.am +++ b/src/hunspell/Makefile.am @@ -5,11 +5,11 @@ libhunspell_1_2_includedir = $(includedir)/hunspell libhunspell_1_2_la_SOURCES=affentry.cxx affixmgr.cxx csutil.cxx \ dictmgr.cxx hashmgr.cxx hunspell.cxx utf_info.cxx \ suggestmgr.cxx license.myspell license.hunspell \ - phonet.cxx filemgr.cxx hunzip.cxx + phonet.cxx filemgr.cxx hunzip.cxx replist.cxx libhunspell_1_2_include_HEADERS=affentry.hxx htypes.hxx affixmgr.hxx \ csutil.hxx hunspell.hxx atypes.hxx dictmgr.hxx hunspell.h \ suggestmgr.hxx baseaffix.hxx hashmgr.hxx langnum.hxx \ - phonet.hxx filemgr.hxx hunzip.hxx w_char.hxx + phonet.hxx filemgr.hxx hunzip.hxx w_char.hxx replist.hxx EXTRA_DIST=hunspell.dsp makefile.mk README diff --git a/src/hunspell/Makefile.in b/src/hunspell/Makefile.in index e3fef01..b7ae938 100644 --- a/src/hunspell/Makefile.in +++ b/src/hunspell/Makefile.in @@ -64,7 +64,7 @@ LTLIBRARIES = $(lib_LTLIBRARIES) libhunspell_1_2_la_LIBADD = am_libhunspell_1_2_la_OBJECTS = affentry.lo affixmgr.lo csutil.lo \ dictmgr.lo hashmgr.lo hunspell.lo utf_info.lo suggestmgr.lo \ - phonet.lo filemgr.lo hunzip.lo + phonet.lo filemgr.lo hunzip.lo replist.lo libhunspell_1_2_la_OBJECTS = $(am_libhunspell_1_2_la_OBJECTS) DEFAULT_INCLUDES = -I.@am__isrc@ -I$(top_builddir) depcomp = $(SHELL) $(top_srcdir)/depcomp @@ -230,12 +230,12 @@ libhunspell_1_2_includedir = $(includedir)/hunspell libhunspell_1_2_la_SOURCES = affentry.cxx affixmgr.cxx csutil.cxx \ dictmgr.cxx hashmgr.cxx hunspell.cxx utf_info.cxx \ suggestmgr.cxx license.myspell license.hunspell \ - phonet.cxx filemgr.cxx hunzip.cxx + phonet.cxx filemgr.cxx hunzip.cxx replist.cxx libhunspell_1_2_include_HEADERS = affentry.hxx htypes.hxx affixmgr.hxx \ csutil.hxx hunspell.hxx atypes.hxx dictmgr.hxx hunspell.h \ suggestmgr.hxx baseaffix.hxx hashmgr.hxx langnum.hxx \ - phonet.hxx filemgr.hxx hunzip.hxx w_char.hxx + phonet.hxx filemgr.hxx hunzip.hxx w_char.hxx replist.hxx EXTRA_DIST = hunspell.dsp makefile.mk README all: all-am @@ -316,6 +316,7 @@ distclean-compile: @AMDEP_TRUE@@am__include@ @am__quote@./$(DEPDIR)/hunspell.Plo@am__quote@ @AMDEP_TRUE@@am__include@ @am__quote@./$(DEPDIR)/hunzip.Plo@am__quote@ @AMDEP_TRUE@@am__include@ @am__quote@./$(DEPDIR)/phonet.Plo@am__quote@ +@AMDEP_TRUE@@am__include@ @am__quote@./$(DEPDIR)/replist.Plo@am__quote@ @AMDEP_TRUE@@am__include@ @am__quote@./$(DEPDIR)/suggestmgr.Plo@am__quote@ @AMDEP_TRUE@@am__include@ @am__quote@./$(DEPDIR)/utf_info.Plo@am__quote@ diff --git a/src/hunspell/affentry.cxx b/src/hunspell/affentry.cxx index 4fea920..7c2dab4 100644 --- a/src/hunspell/affentry.cxx +++ b/src/hunspell/affentry.cxx @@ -486,22 +486,14 @@ inline char * SfxEntry::nextchar(char * p) { inline int SfxEntry::test_condition(const char * st, const char * beg) { -// fprintf(stderr, "ENTER: %s, %s\n", st, beg); const char * pos = NULL; // group with pos input position bool neg = false; // complementer bool ingroup = false; // character in the group if (numconds == 0) return 1; char * p = c.conds; -// while (p && *p) { -// fprintf(stderr, "%c", *p); -// p = nextchar(p); -// } -// fprintf(stderr, "\n"); -// p = c.conds; st--; int i = 1; while (1) { -// if (p) fprintf(stderr, "POS: %c, %s\n", *p, st); switch (*p) { case '\0': return 1; case '[': { p = nextchar(p); pos = st; break; } @@ -556,21 +548,22 @@ inline int SfxEntry::test_condition(const char * st, const char * beg) else if (i == numconds) return 1; ingroup = true; while (p && *p != ']' && (p = nextchar(p))); + st--; } - if (p) p = nextchar(p); + if (p && *p != ']') p = nextchar(p); } else if (pos) { if (neg) return 0; else if (i == numconds) return 1; ingroup = true; while (p && *p != ']' && (p = nextchar(p))); - if (p) p = nextchar(p); +// if (p && *p != ']') p = nextchar(p); st--; } if (!pos) { i++; st--; } - if (st < beg && p) return 0; // word <= condition + if (st < beg && p && *p != ']') return 0; // word <= condition } else if (pos) { // group p = nextchar(p); } else return 0; diff --git a/src/hunspell/affixmgr.cxx b/src/hunspell/affixmgr.cxx index 4fe7170..b625ae9 100644 --- a/src/hunspell/affixmgr.cxx +++ b/src/hunspell/affixmgr.cxx @@ -42,7 +42,11 @@ AffixMgr::AffixMgr(const char * affpath, HashMgr** ptr, int * md, const char * k numbreak = 0; reptable = NULL; numrep = 0; + iconvtable = NULL; + oconvtable = NULL; checkcpdtable = NULL; + // allow simplified compound forms (see 3rd field of CHECKCOMPOUNDPATTERN) + simplifiedcpd = 0; numcheckcpd = 0; defcpdtable = NULL; numdefcpd = 0; @@ -58,6 +62,7 @@ AffixMgr::AffixMgr(const char * affpath, HashMgr** ptr, int * md, const char * k checkcompoundrep = 0; // forbid bad compounds (may be non compound word with a REP substitution) checkcompoundcase = 0; // forbid upper and lowercase combinations at word bounds checkcompoundtriple = 0; // forbid compounds with triple letters + simplifiedtriple = 0; // allow simplified triple letters in compounds (Schiff+fahrt -> Schiffahrt) forbiddenword = FORBIDDENWORD; // forbidden word signing flag nosuggest = FLAG_NULL; // don't suggest words signed with NOSUGGEST flag lang = NULL; // language @@ -181,6 +186,8 @@ AffixMgr::~AffixMgr() free(reptable); reptable = NULL; } + if (iconvtable) delete iconvtable; + if (oconvtable) delete oconvtable; if (phone && phone->rules) { for (int j=0; j < phone->num + 1; j++) { free(phone->rules[j * 2]); @@ -204,8 +211,10 @@ AffixMgr::~AffixMgr() for (int j=0; j < numcheckcpd; j++) { free(checkcpdtable[j].pattern); free(checkcpdtable[j].pattern2); + free(checkcpdtable[j].pattern3); checkcpdtable[j].pattern = NULL; checkcpdtable[j].pattern2 = NULL; + checkcpdtable[j].pattern3 = NULL; } free(checkcpdtable); checkcpdtable = NULL; @@ -405,6 +414,10 @@ int AffixMgr::parse_file(const char * affpath, const char * key) checkcompoundtriple = 1; } + if (strncmp(line,"SIMPLIFIEDTRIPLE",16) == 0) { + simplifiedtriple = 1; + } + if (strncmp(line,"CHECKCOMPOUNDCASE",17) == 0) { checkcompoundcase = 1; } @@ -518,6 +531,22 @@ int AffixMgr::parse_file(const char * affpath, const char * key) } } + /* parse in the input conversion table */ + if (strncmp(line,"ICONV",5) == 0) { + if (parse_convtable(line, afflst, &iconvtable, "ICONV")) { + delete afflst; + return 1; + } + } + + /* parse in the input conversion table */ + if (strncmp(line,"OCONV",5) == 0) { + if (parse_convtable(line, afflst, &oconvtable, "OCONV")) { + delete afflst; + return 1; + } + } + /* parse in the phonetic translation table */ if (strncmp(line,"PHONE",5) == 0) { if (parse_phonetable(line, afflst)) { @@ -685,12 +714,14 @@ int AffixMgr::parse_file(const char * affpath, const char * key) wordchars = mystrdup(expw); } - // temporary BREAK definition for German dash handling (OOo issue 64400) - if ((langnum == LANG_de) && (!breaktable)) { - breaktable = (char **) malloc(sizeof(char *)); + // default BREAK definition + if (!breaktable) { + breaktable = (char **) malloc(sizeof(char *) * 3); if (!breaktable) return 1; breaktable[0] = mystrdup("-"); - if (breaktable[0]) numbreak = 1; + breaktable[1] = mystrdup("^-"); + breaktable[2] = mystrdup("-$"); + if (breaktable[0] && breaktable[1] && breaktable[2]) numbreak = 3; } return 0; } @@ -1242,11 +1273,15 @@ int AffixMgr::cpdrep_check(const char * word, int wl) } // forbid compoundings when there are special patterns at word bound -int AffixMgr::cpdpat_check(const char * word, int pos) +int AffixMgr::cpdpat_check(const char * word, int pos, hentry * r1, hentry * r2) { int len; for (int i = 0; i < numcheckcpd; i++) { if (isSubset(checkcpdtable[i].pattern2, word + pos) && + (!r1 || !checkcpdtable[i].cond || + (r1->astr && TESTAFF(r1->astr, checkcpdtable[i].cond, r1->alen))) && + (!r2 || !checkcpdtable[i].cond2 || + (r2->astr && TESTAFF(r2->astr, checkcpdtable[i].cond2, r2->alen))) && (len = strlen(checkcpdtable[i].pattern)) && (pos > len) && (strncmp(word + pos - len, checkcpdtable[i].pattern, len) == 0)) return 1; } @@ -1292,7 +1327,11 @@ int AffixMgr::defcpd_check(hentry *** words, short wnum, hentry * rv, hentry ** (*words)[wnum] = rv; // has the last word COMPOUNDRULE flag? - if (rv->alen == 0) return 0; + if (rv->alen == 0) { + (*words)[wnum] = NULL; + if (w) *words = NULL; + return 0; + } ok = 0; for (i = 0; i < numdefcpd; i++) { for (j = 0; j < defcpdtable[i].len; j++) { @@ -1300,7 +1339,11 @@ int AffixMgr::defcpd_check(hentry *** words, short wnum, hentry * rv, hentry ** TESTAFF(rv->astr, defcpdtable[i].def[j], rv->alen)) ok = 1; } } - if (ok == 0) return 0; + if (ok == 0) { + (*words)[wnum] = NULL; + if (w) *words = NULL; + return 0; + } for (i = 0; i < numdefcpd; i++) { signed short pp = 0; // pattern position @@ -1405,6 +1448,21 @@ short AffixMgr::get_syllable(const char * word, int wlen) return num; } +void AffixMgr::setcminmax(int * cmin, int * cmax, const char * word, int len) { + if (utf8) { + int i; + for (*cmin = 0, i = 0; (i < cpdmin) && word[*cmin]; i++) { + for ((*cmin)++; (word[*cmin] & 0xc0) == 0x80; (*cmin)++); + } + for (*cmax = len, i = 0; (i < (cpdmin - 1)) && *cmax; i++) { + for ((*cmax)--; (word[*cmax] & 0xc0) == 0x80; (*cmax)--); + } + } else { + *cmin = cpdmin; + *cmax = len - cpdmin + 1; + } +} + // check if compound word is correctly spelled // hu_mov_rule = spec. Hungarian rule (XXX) struct hentry * AffixMgr::compound_check(const char * word, int len, @@ -1420,22 +1478,17 @@ struct hentry * AffixMgr::compound_check(const char * word, int len, char ch; int cmin; int cmax; + int striple = 0; + int scpd = 0; + int soldi = 0; + int oldcmin = 0; + int oldcmax = 0; + int oldlen = 0; + int checkedstriple = 0; int checked_prefix; - if (utf8) { - for (cmin = 0, i = 0; (i < cpdmin) && word[cmin]; i++) { - cmin++; - for (; (word[cmin] & 0xc0) == 0x80; cmin++); - } - for (cmax = len, i = 0; (i < (cpdmin - 1)) && cmax; i++) { - cmax--; - for (; (word[cmax] & 0xc0) == 0x80; cmax--); - } - } else { - cmin = cpdmin; - cmax = len - cpdmin + 1; - } + setcminmax(&cmin, &cmax, word, len); strcpy(st, word); @@ -1451,6 +1504,29 @@ struct hentry * AffixMgr::compound_check(const char * word, int len, if (i >= cmax) return NULL; } + do { // simplified checkcompoundpattern loop + + if (scpd > 0) { + for (; scpd <= numcheckcpd && (!checkcpdtable[scpd-1].pattern3 || + strncmp(word + i, checkcpdtable[scpd-1].pattern3, strlen(checkcpdtable[scpd-1].pattern3)) != 0); scpd++); + + if (scpd > numcheckcpd) break; // break simplified checkcompoundpattern loop + strcpy(st + i, checkcpdtable[scpd-1].pattern); + soldi = i; + i += strlen(checkcpdtable[scpd-1].pattern); + strcpy(st + i, checkcpdtable[scpd-1].pattern2); + strcpy(st + i + strlen(checkcpdtable[scpd-1].pattern2), word + soldi + strlen(checkcpdtable[scpd-1].pattern3)); + + oldlen = len; + len += strlen(checkcpdtable[scpd-1].pattern) + strlen(checkcpdtable[scpd-1].pattern2) - strlen(checkcpdtable[scpd-1].pattern3); + oldcmin = cmin; + oldcmax = cmax; + setcminmax(&cmin, &cmax, st, len); + + cmax = len - cpdmin + 1; + } + + ch = st[i]; st[i] = '\0'; @@ -1471,8 +1547,10 @@ struct hentry * AffixMgr::compound_check(const char * word, int len, TESTAFF(rv->astr, compoundmiddle, rv->alen)) || (numdefcpd && ((!words && !wordnum && defcpd_check(&words, wnum, rv, (hentry **) &rwords, 0)) || - (words && defcpd_check(&words, wnum, rv, (hentry **) &rwords, 0)))) - ))) { + (words && defcpd_check(&words, wnum, rv, (hentry **) &rwords, 0))))) || + (scpd != 0 && checkcpdtable[scpd-1].cond != FLAG_NULL && + !TESTAFF(rv->astr, checkcpdtable[scpd-1].cond, rv->alen))) + ) { rv = rv->next_homonym; } @@ -1489,6 +1567,7 @@ struct hentry * AffixMgr::compound_check(const char * word, int len, rv = NULL; } } + if (rv || (((wordnum == 0) && compoundbegin && ((rv = suffix_check(st, i, 0, NULL, NULL, 0, NULL, FLAG_NULL, compoundbegin, hu_mov_rule ? IN_CPD_OTHER : IN_CPD_BEGIN)) || @@ -1567,19 +1646,20 @@ struct hentry * AffixMgr::compound_check(const char * word, int len, ) ) // END of LANG_hu section + ) && + ( + // test CHECKCOMPOUNDPATTERN conditions + scpd == 0 || checkcpdtable[scpd-1].cond == FLAG_NULL || + TESTAFF(rv->astr, checkcpdtable[scpd-1].cond, rv->alen) ) - && ! (( checkcompoundtriple && // test triple letters + && ! (( checkcompoundtriple && scpd == 0 && !words && // test triple letters (word[i-1]==word[i]) && ( - ((i>1) && (word[i-1]==word[i-2])) || + ((i>1) && (word[i-1]==word[i-2])) || ((word[i-1]==word[i+1])) // may be word[i+1] == '\0' ) ) || - ( - // test CHECKCOMPOUNDPATTERN - numcheckcpd && cpdpat_check(word, i) - ) || - ( - checkcompoundcase && cpdcase_check(word, i) + ( + checkcompoundcase && scpd == 0 && !words && cpdcase_check(word, i) )) ) // LANG_hu section: spec. Hungarian rule @@ -1587,15 +1667,14 @@ struct hentry * AffixMgr::compound_check(const char * word, int len, (sfx && ((SfxEntry*)sfx)->getCont() && ( // XXX hardwired Hungarian dic. codes TESTAFF(((SfxEntry*)sfx)->getCont(), (unsigned short) 'x', ((SfxEntry*)sfx)->getContLen()) || TESTAFF(((SfxEntry*)sfx)->getCont(), (unsigned short) '%', ((SfxEntry*)sfx)->getContLen()) - ) + ) ) ) -// END of LANG_hu section - ) { + ) { // first word is ok condition // LANG_hu section: spec. Hungarian rule if (langnum == LANG_hu) { - // calculate syllable number of the word + // calculate syllable number of the word numsyllable += get_syllable(st, i); // + 1 word, if syllable number of the prefix > 1 (hungarian convention) @@ -1603,19 +1682,35 @@ struct hentry * AffixMgr::compound_check(const char * word, int len, } // END of LANG_hu section + // NEXT WORD(S) rv_first = rv; - rv = lookup((word+i)); // perhaps without prefix + st[i] = ch; + + do { // striple loop + + // check simplifiedtriple + if (simplifiedtriple) { + if (striple) { + checkedstriple = 1; + i--; // check "fahrt" instead of "ahrt" in "Schiffahrt" + } else if (i > 2 && *(word+i - 1) == *(word + i - 2)) striple = 1; + } + + rv = lookup((st+i)); // perhaps without prefix // search homonym with compound flag while ((rv) && ((needaffix && TESTAFF(rv->astr, needaffix, rv->alen)) || !((compoundflag && !words && TESTAFF(rv->astr, compoundflag, rv->alen)) || (compoundend && !words && TESTAFF(rv->astr, compoundend, rv->alen)) || - (numdefcpd && words && defcpd_check(&words, wnum + 1, rv, NULL,1))))) { + (numdefcpd && words && defcpd_check(&words, wnum + 1, rv, NULL,1))) || + (scpd != 0 && checkcpdtable[scpd-1].cond2 != FLAG_NULL && + !TESTAFF(rv->astr, checkcpdtable[scpd-1].cond2, rv->alen)) + )) { rv = rv->next_homonym; } - if (rv && words && words[wnum + 1]) return rv; + if (rv && words && words[wnum + 1]) return rv_first; oldnumsyllable2 = numsyllable; oldwordnum2 = wordnum; @@ -1647,17 +1742,24 @@ struct hentry * AffixMgr::compound_check(const char * word, int len, ) && ( ((cpdwordmax==-1) || (wordnum+1<cpdwordmax)) || - ((cpdmaxsyllable==0) || + ((cpdmaxsyllable!=0) && (numsyllable + get_syllable(HENTRY_WORD(rv), rv->clen)<=cpdmaxsyllable)) - ) - && ( + ) && + ( + // test CHECKCOMPOUNDPATTERN + !numcheckcpd || scpd != 0 || !cpdpat_check(word, i, rv_first, rv) + ) && + ( (!checkcompounddup || (rv != rv_first)) ) + // test CHECKCOMPOUNDPATTERN conditions + && (scpd == 0 || checkcpdtable[scpd-1].cond2 == FLAG_NULL || + TESTAFF(rv->astr, checkcpdtable[scpd-1].cond2, rv->alen)) ) { // forbid compound word, if it is a non compound word with typical fault if (checkcompoundrep && cpdrep_check(word,len)) return NULL; - return rv; + return rv_first; } numsyllable = oldnumsyllable2; @@ -1672,13 +1774,20 @@ struct hentry * AffixMgr::compound_check(const char * word, int len, pfx = NULL; rv = affix_check((word+i),strlen(word+i), compoundend, IN_CPD_END); } - + if (!rv && numdefcpd && words) { rv = affix_check((word+i),strlen(word+i), 0, IN_CPD_END); - if (rv && defcpd_check(&words, wnum + 1, rv, NULL, 1)) return rv; + if (rv && defcpd_check(&words, wnum + 1, rv, NULL, 1)) return rv_first; rv = NULL; } + // test CHECKCOMPOUNDPATTERN conditions (allowed forms) + if (rv && !(scpd == 0 || checkcpdtable[scpd-1].cond2 == FLAG_NULL || + TESTAFF(rv->astr, checkcpdtable[scpd-1].cond2, rv->alen))) rv = NULL; + + // test CHECKCOMPOUNDPATTERN conditions (forbidden compounds) + if (rv && numcheckcpd && scpd == 0 && cpdpat_check(word, i, rv_first, rv)) rv = NULL; + // check non_compound flag in suffix and prefix if ((rv) && ((pfx && ((PfxEntry*)pfx)->getCont() && @@ -1702,7 +1811,7 @@ struct hentry * AffixMgr::compound_check(const char * word, int len, if (langnum == LANG_hu) { // calculate syllable number of the word numsyllable += get_syllable(word + i, strlen(word + i)); - + // - affix syllable num. // XXX only second suffix (inflections, not derivations) if (sfxappnd) { @@ -1710,13 +1819,13 @@ struct hentry * AffixMgr::compound_check(const char * word, int len, numsyllable -= get_syllable(tmp, strlen(tmp)); free(tmp); } - + // + 1 word, if syllable number of the prefix > 1 (hungarian convention) if (pfx && (get_syllable(((PfxEntry *)pfx)->getKey(),strlen(((PfxEntry *)pfx)->getKey())) > 1)) wordnum++; // increment syllable num, if last word has a SYLLABLENUM flag // and the suffix is beginning `s' - + if (cpdsyllablenum) { switch (sfxflag) { case 'c': { numsyllable+=2; break; } @@ -1725,7 +1834,7 @@ struct hentry * AffixMgr::compound_check(const char * word, int len, } } } - + // increment word number, if the second word has a compoundroot flag if ((rv) && (compoundroot) && (TESTAFF(rv->astr, compoundroot, rv->alen))) { @@ -1739,7 +1848,7 @@ struct hentry * AffixMgr::compound_check(const char * word, int len, if ((rv) && ( ((cpdwordmax == -1) || (wordnum + 1 < cpdwordmax)) || - ((cpdmaxsyllable == 0) || + ((cpdmaxsyllable != 0) && (numsyllable <= cpdmaxsyllable)) ) && ( @@ -1747,7 +1856,7 @@ struct hentry * AffixMgr::compound_check(const char * word, int len, )) { // forbid compound word, if it is a non compound word with typical fault if (checkcompoundrep && cpdrep_check(word, len)) return NULL; - return rv; + return rv_first; } numsyllable = oldnumsyllable2; @@ -1755,24 +1864,52 @@ struct hentry * AffixMgr::compound_check(const char * word, int len, // perhaps second word is a compound word (recursive call) if (wordnum < maxwordnum) { - rv = compound_check((word+i),strlen(word+i), wordnum+1, + rv = compound_check((st+i),strlen(st+i), wordnum+1, numsyllable, maxwordnum, wnum + 1, words, 0, is_sug); + if (rv && numcheckcpd && (scpd == 0 && cpdpat_check(word, i, rv_first, rv) || + scpd != 0 && !cpdpat_check(word, i, rv_first, rv))) rv = NULL; } else { rv=NULL; } if (rv) { // forbid compound word, if it is a non compound word with typical fault if (checkcompoundrep && cpdrep_check(word, len)) return NULL; - return rv; + return rv_first; } + } while (striple && !checkedstriple); // end of striple loop + + if (checkedstriple) { + i++; + checkedstriple = 0; + striple = 0; + } + + } // first word is ok condition + + if (soldi != 0) { + i = soldi; + soldi = 0; + len = oldlen; + cmin = oldcmin; + cmax = oldcmax; } - st[i] = ch; + scpd++; + + } while (simplifiedcpd && scpd <= numcheckcpd); // end of simplifiedcpd loop + + if (soldi != 0) { + i = soldi; + strcpy(st, word); // XXX add more optim. + soldi = 0; + } else st[i] = ch; + + scpd = 0; wordnum = oldwordnum; numsyllable = oldnumsyllable; } - + return NULL; -} +} // check if compound word is correctly spelled // hu_mov_rule = spec. Hungarian rule (XXX) @@ -1796,19 +1933,7 @@ int AffixMgr::compound_check_morph(const char * word, int len, int cmin; int cmax; - if (utf8) { - for (cmin = 0, i = 0; (i < cpdmin) && word[cmin]; i++) { - cmin++; - for (; (word[cmin] & 0xc0) == 0x80; cmin++); - } - for (cmax = len, i = 0; (i < (cpdmin - 1)) && cmax; i++) { - cmax--; - for (; (word[cmax] & 0xc0) == 0x80; cmax--); - } - } else { - cmin = cpdmin; - cmax = len - cpdmin + 1; - } + setcminmax(&cmin, &cmax, word, len); strcpy(st, word); @@ -1965,7 +2090,7 @@ int AffixMgr::compound_check_morph(const char * word, int len, ) // END of LANG_hu section ) - && ! (( checkcompoundtriple && // test triple letters + && ! (( checkcompoundtriple && !words && // test triple letters (word[i-1]==word[i]) && ( ((i>1) && (word[i-1]==word[i-2])) || ((word[i-1]==word[i+1])) // may be word[i+1] == '\0' @@ -1973,10 +2098,10 @@ int AffixMgr::compound_check_morph(const char * word, int len, ) || ( // test CHECKCOMPOUNDPATTERN - numcheckcpd && cpdpat_check(word, i) + numcheckcpd && !words && cpdpat_check(word, i, rv, NULL) ) || ( - checkcompoundcase && cpdcase_check(word, i) + checkcompoundcase && !words && cpdcase_check(word, i) )) ) // LANG_hu section: spec. Hungarian rule @@ -2064,7 +2189,7 @@ int AffixMgr::compound_check_morph(const char * word, int len, ) && ( ((cpdwordmax==-1) || (wordnum+1<cpdwordmax)) || - ((cpdmaxsyllable==0) || + ((cpdmaxsyllable!=0) && (numsyllable+get_syllable(HENTRY_WORD(rv),rv->blen)<=cpdmaxsyllable)) ) && ( @@ -2188,7 +2313,7 @@ int AffixMgr::compound_check_morph(const char * word, int len, if ((rv) && ( ((cpdwordmax==-1) || (wordnum+1<cpdwordmax)) || - ((cpdmaxsyllable==0) || + ((cpdmaxsyllable!=0) && (numsyllable <= cpdmaxsyllable)) ) && ( @@ -2699,7 +2824,7 @@ char * AffixMgr::morphgen(char * ts, int wl, const unsigned short * ap, const unsigned char c = (unsigned char) (ap[i] & 0x00FF); SfxEntry * sptr = (SfxEntry *)sFlag[c]; while (sptr) { - if (sptr->getFlag() == ap[i] && ((sptr->getContLen() == 0) || + if (sptr->getFlag() == ap[i] && sptr->getMorph() && ((sptr->getContLen() == 0) || // don't generate forms with substandard affixes !TESTAFF(sptr->getCont(), substandard, sptr->getContLen()))) { @@ -2890,6 +3015,20 @@ struct replentry * AffixMgr::get_reptable() return reptable; } +// return iconv table +RepList * AffixMgr::get_iconvtable() +{ + if (! iconvtable ) return NULL; + return iconvtable; +} + +// return oconv table +RepList * AffixMgr::get_oconvtable() +{ + if (! oconvtable ) return NULL; + return oconvtable; +} + // return replacing table struct phonetable * AffixMgr::get_phonetable() { @@ -3271,6 +3410,88 @@ int AffixMgr::parse_reptable(char * line, FileMgr * af) } /* parse in the typical fault correcting table */ +int AffixMgr::parse_convtable(char * line, FileMgr * af, RepList ** rl, const char * keyword) +{ + if (*rl) { + HUNSPELL_WARNING(stderr, "error: line %d: multiple table definitions\n", af->getlinenum()); + return 1; + } + char * tp = line; + char * piece; + int i = 0; + int np = 0; + int numrl = 0; + piece = mystrsep(&tp, 0); + while (piece) { + if (*piece != '\0') { + switch(i) { + case 0: { np++; break; } + case 1: { + numrl = atoi(piece); + if (numrl < 1) { + HUNSPELL_WARNING(stderr, "error: line %d: incorrect entry number\n", af->getlinenum()); + return 1; + } + *rl = new RepList(numrl); + if (!rl) return 1; + np++; + break; + } + default: break; + } + i++; + } + piece = mystrsep(&tp, 0); + } + if (np != 2) { + HUNSPELL_WARNING(stderr, "error: line %d: missing data\n", af->getlinenum()); + return 1; + } + + /* now parse the num lines to read in the remainder of the table */ + char * nl; + for (int j=0; j < numrl; j++) { + if (!(nl = af->getline())) return 1; + mychomp(nl); + tp = nl; + i = 0; + char * pattern = NULL; + char * pattern2 = NULL; + piece = mystrsep(&tp, 0); + while (piece) { + if (*piece != '\0') { + switch(i) { + case 0: { + if (strncmp(piece, keyword, sizeof(keyword)) != 0) { + HUNSPELL_WARNING(stderr, "error: line %d: table is corrupt\n", af->getlinenum()); + delete *rl; + *rl = NULL; + return 1; + } + break; + } + case 1: { pattern = mystrrep(mystrdup(piece),"_"," "); break; } + case 2: { + pattern2 = mystrrep(mystrdup(piece),"_"," "); + break; + } + default: break; + } + i++; + } + piece = mystrsep(&tp, 0); + } + if (!pattern || !pattern2) { + HUNSPELL_WARNING(stderr, "error: line %d: table is corrupt\n", af->getlinenum()); + return 1; + } + (*rl)->add(pattern, pattern2); + } + return 0; +} + + +/* parse in the typical fault correcting table */ int AffixMgr::parse_phonetable(char * line, FileMgr * af) { if (phone) { @@ -3375,7 +3596,7 @@ int AffixMgr::parse_checkcpdtable(char * line, FileMgr * af) HUNSPELL_WARNING(stderr, "error: line %d: bad entry number\n", af->getlinenum()); return 1; } - checkcpdtable = (replentry *) malloc(numcheckcpd * sizeof(struct replentry)); + checkcpdtable = (patentry *) malloc(numcheckcpd * sizeof(struct patentry)); if (!checkcpdtable) return 1; np++; break; @@ -3400,6 +3621,9 @@ int AffixMgr::parse_checkcpdtable(char * line, FileMgr * af) i = 0; checkcpdtable[j].pattern = NULL; checkcpdtable[j].pattern2 = NULL; + checkcpdtable[j].pattern3 = NULL; + checkcpdtable[j].cond = FLAG_NULL; + checkcpdtable[j].cond2 = FLAG_NULL; piece = mystrsep(&tp, 0); while (piece) { if (*piece != '\0') { @@ -3412,8 +3636,24 @@ int AffixMgr::parse_checkcpdtable(char * line, FileMgr * af) } break; } - case 1: { checkcpdtable[j].pattern = mystrdup(piece); break; } - case 2: { checkcpdtable[j].pattern2 = mystrdup(piece); break; } + case 1: { + checkcpdtable[j].pattern = mystrdup(piece); + char * p = strchr(checkcpdtable[j].pattern, '/'); + if (p) { + *p = '\0'; + checkcpdtable[j].cond = pHMgr->decode_flag(p + 1); + } + break; } + case 2: { + checkcpdtable[j].pattern2 = mystrdup(piece); + char * p = strchr(checkcpdtable[j].pattern2, '/'); + if (p) { + *p = '\0'; + checkcpdtable[j].cond2 = pHMgr->decode_flag(p + 1); + } + break; + } + case 3: { checkcpdtable[j].pattern3 = mystrdup(piece); simplifiedcpd = 1; break; } default: break; } i++; diff --git a/src/hunspell/affixmgr.hxx b/src/hunspell/affixmgr.hxx index 541ab91..1c33191 100644 --- a/src/hunspell/affixmgr.hxx +++ b/src/hunspell/affixmgr.hxx @@ -14,6 +14,7 @@ using namespace std; #include "baseaffix.hxx" #include "hashmgr.hxx" #include "phonet.hxx" +#include "replist.hxx" // check flag duplication #define dupSFX (1 << 0) @@ -46,18 +47,22 @@ class AffixMgr int checkcompoundrep; int checkcompoundcase; int checkcompoundtriple; + int simplifiedtriple; FLAG forbiddenword; FLAG nosuggest; FLAG needaffix; int cpdmin; int numrep; replentry * reptable; + RepList * iconvtable; + RepList * oconvtable; int nummap; mapentry * maptable; int numbreak; char ** breaktable; int numcheckcpd; - replentry * checkcpdtable; + patentry * checkcpdtable; + int simplifiedcpd; int numdefcpd; flagentry * defcpdtable; phonetable * phone; @@ -140,11 +145,12 @@ public: short get_syllable (const char * word, int wlen); int cpdrep_check(const char * word, int len); - int cpdpat_check(const char * word, int len); + int cpdpat_check(const char * word, int len, hentry * r1, hentry * r2); int defcpd_check(hentry *** words, short wnum, hentry * rv, hentry ** rwords, char all); int cpdcase_check(const char * word, int len); inline int candidate_check(const char * word, int len); + void setcminmax(int * cmin, int * cmax, const char * word, int len); struct hentry * compound_check(const char * word, int len, short wordnum, short numsyllable, short maxwordnum, short wnum, hentry ** words, char hu_mov_rule, char is_sug); @@ -156,6 +162,8 @@ public: struct hentry * lookup(const char * word); int get_numrep(); struct replentry * get_reptable(); + RepList * get_iconvtable(); + RepList * get_oconvtable(); struct phonetable * get_phonetable(); int get_nummap(); struct mapentry * get_maptable(); @@ -202,6 +210,7 @@ private: int parse_num(char * line, int * out, FileMgr * af); int parse_cpdsyllable(char * line, FileMgr * af); int parse_reptable(char * line, FileMgr * af); + int parse_convtable(char * line, FileMgr * af, RepList ** rl, const char * keyword); int parse_phonetable(char * line, FileMgr * af); int parse_maptable(char * line, FileMgr * af); int parse_breaktable(char * line, FileMgr * af); diff --git a/src/hunspell/atypes.hxx b/src/hunspell/atypes.hxx index 0d4db14..4753f9c 100644 --- a/src/hunspell/atypes.hxx +++ b/src/hunspell/atypes.hxx @@ -87,4 +87,12 @@ struct flagentry { int len; }; +struct patentry { + char * pattern; + char * pattern2; + char * pattern3; + FLAG cond; + FLAG cond2; +}; + #endif diff --git a/src/hunspell/csutil.cxx b/src/hunspell/csutil.cxx index 6264bfc..7b5eb83 100644 --- a/src/hunspell/csutil.cxx +++ b/src/hunspell/csutil.cxx @@ -674,6 +674,20 @@ void mkallcap_utf(w_char * u, int nc, int langnum) { if (*p != '\0') *p = csconv[((unsigned char)*p)].cupper; } + // conversion function for protected memory + void store_pointer(char * dest, char * source) + { + memcpy(dest, &source, sizeof(char *)); + } + + // conversion function for protected memory + char * get_stored_pointer(char * s) + { + char * p; + memcpy(&p, s, sizeof(char *)); + return p; + } + #ifndef MOZILLA_CLIENT // convert null terminated string to all caps using encoding void enmkallcap(char * d, const char * p, const char * encoding) @@ -706,20 +720,6 @@ void mkallcap_utf(w_char * u, int nc, int langnum) { if (*p != '\0') *d= csconv[((unsigned char)*p)].cupper; } - // conversion function for protected memory - void store_pointer(char * dest, char * source) - { - memcpy(dest, &source, sizeof(char *)); - } - - // conversion function for protected memory - char * get_stored_pointer(char * s) - { - char * p; - memcpy(&p, s, sizeof(char *)); - return p; - } - // these are simple character mappings for the // encodings supported // supplying isupper, tolower, and toupper diff --git a/src/hunspell/filemgr.cxx b/src/hunspell/filemgr.cxx index f2f4360..4150ce6 100644 --- a/src/hunspell/filemgr.cxx +++ b/src/hunspell/filemgr.cxx @@ -1,6 +1,15 @@ -#include <stdio.h> +#include "license.hunspell" +#include "license.myspell" + +#ifndef MOZILLA_CLIENT +#include <cstdlib> +#include <cstring> +#include <cstdio> +#else #include <stdlib.h> #include <string.h> +#include <stdio.h> +#endif #include "filemgr.hxx" diff --git a/src/hunspell/hunspell.cxx b/src/hunspell/hunspell.cxx index 4219a6d..c35f4b5 100644 --- a/src/hunspell/hunspell.cxx +++ b/src/hunspell/hunspell.cxx @@ -6,9 +6,9 @@ #include <cstring> #include <cstdio> #else -#include <stdlib.h> +#include <stdlib.h> #include <string.h> -#include <stdio.h> +#include <stdio.h> #endif #include "hunspell.hxx" @@ -83,19 +83,19 @@ int Hunspell::add_dic(const char * dpath, const char * key) { // make a copy of src at destination while removing all leading // blanks and removing any trailing periods after recording // their presence with the abbreviation flag -// also since already going through character by character, +// also since already going through character by character, // set the capitalization type // return the length of the "cleaned" (and UTF-8 encoded) word -int Hunspell::cleanword2(char * dest, const char * src, +int Hunspell::cleanword2(char * dest, const char * src, w_char * dest_utf, int * nc, int * pcaptype, int * pabbrev) -{ +{ unsigned char * p = (unsigned char *) dest; const unsigned char * q = (const unsigned char * ) src; // first skip over any leading blanks while ((*q != '\0') && (*q == ' ')) q++; - + // now strip off any trailing periods (recording their presence) *pabbrev = 0; int nl = strlen((const char *)q); @@ -103,14 +103,14 @@ int Hunspell::cleanword2(char * dest, const char * src, nl--; (*pabbrev)++; } - + // if no characters are left it can't be capitalized - if (nl <= 0) { + if (nl <= 0) { *pcaptype = NOCAP; *p = '\0'; return 0; } - + strncpy(dest, (char *) q, nl); *(dest + nl) = '\0'; nl = strlen(dest); @@ -128,18 +128,18 @@ int Hunspell::cleanword2(char * dest, const char * src, *nc = nl; } return nl; -} +} -int Hunspell::cleanword(char * dest, const char * src, +int Hunspell::cleanword(char * dest, const char * src, int * pcaptype, int * pabbrev) -{ +{ unsigned char * p = (unsigned char *) dest; const unsigned char * q = (const unsigned char * ) src; int firstcap = 0; // first skip over any leading blanks while ((*q != '\0') && (*q == ' ')) q++; - + // now strip off any trailing periods (recording their presence) *pabbrev = 0; int nl = strlen((const char *)q); @@ -147,9 +147,9 @@ int Hunspell::cleanword(char * dest, const char * src, nl--; (*pabbrev)++; } - + // if no characters are left it can't be capitalized - if (nl <= 0) { + if (nl <= 0) { *pcaptype = NOCAP; *p = '\0'; return 0; @@ -201,7 +201,7 @@ int Hunspell::cleanword(char * dest, const char * src, *pcaptype = HUHCAP; } return strlen(dest); -} +} void Hunspell::mkallcap(char * p) { @@ -218,7 +218,7 @@ void Hunspell::mkallcap(char * p) } u16_u8(p, MAXWORDUTF8LEN, u, nc); } else { - while (*p != '\0') { + while (*p != '\0') { *p = csconv[((unsigned char) *p)].cupper; p++; } @@ -238,9 +238,9 @@ int Hunspell::mkallcap2(char * p, w_char * u, int nc) } } u16_u8(p, MAXWORDUTF8LEN, u, nc); - return strlen(p); + return strlen(p); } else { - while (*p != '\0') { + while (*p != '\0') { *p = csconv[((unsigned char) *p)].cupper; p++; } @@ -251,7 +251,7 @@ int Hunspell::mkallcap2(char * p, w_char * u, int nc) void Hunspell::mkallsmall(char * p) { - while (*p != '\0') { + while (*p != '\0') { *p = csconv[((unsigned char) *p)].clower; p++; } @@ -272,7 +272,7 @@ int Hunspell::mkallsmall2(char * p, w_char * u, int nc) u16_u8(p, MAXWORDUTF8LEN, u, nc); return strlen(p); } else { - while (*p != '\0') { + while (*p != '\0') { *p = csconv[((unsigned char) *p)].clower; p++; } @@ -316,7 +316,7 @@ int Hunspell::is_keepcase(const hentry * rv) { TESTAFF(rv->astr, pAMgr->get_keepcase(), rv->alen); } -/* insert a word to beginning of the suggestion array and return ns */ +/* insert a word to the beginning of the suggestion array and return ns */ int Hunspell::insert_sug(char ***slst, char * word, int ns) { char * dup = mystrdup(word); if (!dup) return ns; @@ -348,12 +348,18 @@ int Hunspell::spell(const char * word, int * info, char ** root) } int captype = 0; int abbv = 0; - int wl = cleanword2(cw, word, unicw, &nc, &captype, &abbv); + int wl = 0; + + // input conversion + RepList * rl = (pAMgr) ? pAMgr->get_iconvtable() : NULL; + if (rl && rl->conv(word, wspace)) wl = cleanword2(cw, wspace, unicw, &nc, &captype, &abbv); + else wl = cleanword2(cw, word, unicw, &nc, &captype, &abbv); + int info2 = 0; if (wl == 0 || maxdic == 0) return 1; if (root) *root = NULL; - // allow numbers with dots and commas (but forbid double separators: "..", ",," etc.) + // allow numbers with dots, dashes and commas (but forbid double separators: "..", "--" etc.) enum { NBEGIN, NNUM, NSEP }; int nstate = NBEGIN; int i; @@ -369,19 +375,10 @@ int Hunspell::spell(const char * word, int * info, char ** root) if ((i == wl) && (nstate == NNUM)) return 1; if (!info) info = &info2; else *info = 0; - // LANG_hu section: number(s) + (percent or degree) with suffixes - if (langnum == LANG_hu) { - if ((nstate == NNUM) && ((cw[i] == '%') || ((!utf8 && (cw[i] == '\xB0')) || - (utf8 && (strncmp(cw + i, "\xC2\xB0", 2)==0 || // UTF-8 degree - strncmp(cw + i, "\xE2\x80\xB0", 3)==0)))) // UTF-8 per mille - && checkword(cw + i, info, root)) return 1; - } - // END of LANG_hu section - switch(captype) { - case HUHCAP: - case HUHINITCAP: - case NOCAP: { + case HUHCAP: + case HUHINITCAP: + case NOCAP: { rv = checkword(cw, info, root); if ((abbv) && !(rv)) { memcpy(wspace,cw,wl); @@ -448,7 +445,7 @@ int Hunspell::spell(const char * word, int * info, char ** root) if (rv) break; } } - case INITCAP: { + case INITCAP: { wl = mkallsmall2(cw, unicw, nc); memcpy(wspace,cw,(wl+1)); wl2 = mkinitcap2(cw, unicw, nc); @@ -461,7 +458,7 @@ int Hunspell::spell(const char * word, int * info, char ** root) if (*info & SPELL_FORBIDDEN) { rv = NULL; break; - } + } if (rv && is_keepcase(rv) && (captype == ALLCAP)) rv = NULL; if (rv) break; @@ -488,88 +485,60 @@ int Hunspell::spell(const char * word, int * info, char ** root) // in INITCAP form, too. !(pAMgr->get_checksharps() && ((utf8 && strstr(wspace, "\xC3\x9F")) || - (!utf8 && strchr(wspace, '\xDF')))))) rv = NULL; + (!utf8 && strchr(wspace, '\xDF')))))) rv = NULL; break; - } + } } - + if (rv) return 1; - // recursive breaking at break points (not good for morphological analysis) + // recursive breaking at break points if (wordbreak) { char * s; char r; int corr = 0; - // German words beginning with "-" are not accepted - if (langnum == LANG_de) corr = 1; + wl = strlen(cw); int numbreak = pAMgr ? pAMgr->get_numbreak() : 0; + // check boundary patterns (^begin and end$) for (int j = 0; j < numbreak; j++) { - s=(char *) strstr(cw + corr, wordbreak[j]); - if (s) { + int plen = strlen(wordbreak[j]); + if (plen == 1 || plen > wl) continue; + if (wordbreak[j][0] == '^' && strncmp(cw, wordbreak[j] + 1, plen - 1) == 0 + && spell(cw + plen - 1)) return 1; + if (wordbreak[j][plen - 1] == '$' && + strncmp(cw + wl - plen + 1, wordbreak[j], plen - 1) == 0) { + r = cw[wl - plen + 1]; + cw[wl - plen + 1] = '\0'; + if (spell(cw)) return 1; + cw[wl - plen + 1] = r; + } + } + // other patterns + for (int j = 0; j < numbreak; j++) { + int result = 0; + int plen = strlen(wordbreak[j]); + s=(char *) strstr(cw, wordbreak[j]); + if (s && (s > cw) && (s < cw + wl - plen)) { + if (!spell(s + plen)) continue; r = *s; *s = '\0'; // examine 2 sides of the break point - if (spell(cw) && spell(s + strlen(wordbreak[j]))) { - *s = r; - return 1; - } + if (spell(cw)) return 1; *s = r; + + // LANG_hu: spec. dash rule + if (langnum == LANG_hu && strcmp(wordbreak[j], "-") == 0) { + r = s[1]; + s[1] = '\0'; + if (spell(cw)) return 1; // check the first part with dash + s[1] = r; + } + // end of LANG speficic region + } } } - // LANG_hu: compoundings with dashes and n-dashes XXX deprecated! - if (langnum == LANG_hu) { - int n; - // compound word with dash (HU) I18n - char * dash; - int result = 0; - // n-dash - dash = (char *) strstr(cw,"\xE2\x80\x93"); - if (dash && !wordbreak) { - *dash = '\0'; - // examine 2 sides of the dash - if (spell(cw) && spell(dash + 3)) { - *dash = '\xE2'; - return 1; - } - *dash = '\xE2'; - } - dash = (char *) strchr(cw,'-'); - if (dash) { - *dash='\0'; - // examine 2 sides of the dash - if (dash[1] == '\0') { // base word ending with dash - if (spell(cw)) return 1; - } else { - // first word ending with dash: word- - char r2 = *(dash + 1); - dash[0]='-'; - dash[1]='\0'; - result = spell(cw); - dash[1] = r2; - dash[0]='\0'; - if (result && spell(dash+1) && ((strlen(dash+1) > 1) || (dash[1] == 'e') || - ((dash[1] > '0') && (dash[1] < '9')))) return 1; - } - // affixed number in correct word - if (result && (dash > cw) && (((*(dash-1)<='9') && (*(dash-1)>='0')) || (*(dash-1)>='.'))) { - *dash='-'; - n = 1; - if (*(dash - n) == '.') n++; - // search first not a number character to left from dash - while (((dash - n)>=cw) && ((*(dash - n)=='0') || (n < 3)) && (n < 6)) { - n++; - } - if ((dash - n) < cw) n--; - // numbers: deprecated - for(; n >= 1; n--) { - if ((*(dash - n) >= '0') && (*(dash - n) <= '9') && - checkword(dash - n, info, root)) return 1; - } - } - } - } return 0; } @@ -635,8 +604,8 @@ struct hentry * Hunspell::checkword(const char * w, int * info, char ** root) // check compound restriction and onlyupcase if (he && he->astr && ( - (pAMgr->get_onlyincompound() && - TESTAFF(he->astr, pAMgr->get_onlyincompound(), he->alen)) || + (pAMgr->get_onlyincompound() && + TESTAFF(he->astr, pAMgr->get_onlyincompound(), he->alen)) || (info && (*info & SPELL_INITCAP) && TESTAFF(he->astr, ONLYUPCASEFLAG, he->alen)))) { he = NULL; @@ -664,7 +633,7 @@ struct hentry * Hunspell::checkword(const char * w, int * info, char ** root) he = pAMgr->compound_check(dup, len-1, -5, 0, 100, 0, NULL, 1, 0); free(dup); } - // end of LANG speficic region + // end of LANG speficic region if (he) { if (root) { *root = mystrdup(&(he->word)); @@ -701,18 +670,24 @@ int Hunspell::suggest(char*** slst, const char * word) } int captype = 0; int abbv = 0; - int wl = cleanword2(cw, word, unicw, &nc, &captype, &abbv); + int wl = 0; + + // input conversion + RepList * rl = (pAMgr) ? pAMgr->get_iconvtable() : NULL; + if (rl && rl->conv(word, wspace)) wl = cleanword2(cw, wspace, unicw, &nc, &captype, &abbv); + else wl = cleanword2(cw, word, unicw, &nc, &captype, &abbv); + if (wl == 0) return 0; int ns = 0; int capwords = 0; switch(captype) { - case NOCAP: { + case NOCAP: { ns = pSMgr->suggest(slst, cw, ns, &onlycmpdsug); break; } - case INITCAP: { + case INITCAP: { capwords = 1; ns = pSMgr->suggest(slst, cw, ns, &onlycmpdsug); if (ns == -1) break; @@ -723,7 +698,7 @@ int Hunspell::suggest(char*** slst, const char * word) } case HUHINITCAP: capwords = 1; - case HUHCAP: { + case HUHCAP: { ns = pSMgr->suggest(slst, cw, ns, &onlycmpdsug); if (ns != -1) { int prevns; @@ -785,7 +760,7 @@ int Hunspell::suggest(char*** slst, const char * word) break; } - case ALLCAP: { + case ALLCAP: { memcpy(wspace, cw, (wl+1)); mkallsmall2(wspace, unicw, nc); ns = pSMgr->suggest(slst, wspace, ns, &onlycmpdsug); @@ -837,7 +812,7 @@ int Hunspell::suggest(char*** slst, const char * word) } } // END OF LANG_hu section - + // try ngram approach since found nothing if ((ns == 0 || onlycmpdsug) && pAMgr && (pAMgr->get_maxngramsugs() != 0)) { switch(captype) { @@ -845,13 +820,15 @@ int Hunspell::suggest(char*** slst, const char * word) ns = pSMgr->ngsuggest(*slst, cw, ns, pHMgr, maxdic); break; } + case HUHINITCAP: + capwords = 1; case HUHCAP: { memcpy(wspace,cw,(wl+1)); mkallsmall2(wspace, unicw, nc); ns = pSMgr->ngsuggest(*slst, wspace, ns, pHMgr, maxdic); - break; + break; } - case INITCAP: { + case INITCAP: { capwords = 1; memcpy(wspace,cw,(wl+1)); mkallsmall2(wspace, unicw, nc); @@ -863,13 +840,50 @@ int Hunspell::suggest(char*** slst, const char * word) mkallsmall2(wspace, unicw, nc); int oldns = ns; ns = pSMgr->ngsuggest(*slst, wspace, ns, pHMgr, maxdic); - for (int j = oldns; j < ns; j++) + for (int j = oldns; j < ns; j++) mkallcap((*slst)[j]); break; } } } + // try dash suggestion (Afo-American -> Afro-American) + if (strchr(cw, '-')) { + char * pos = strchr(cw, '-'); + char * ppos = cw; + int nodashsug = 1; + char ** nlst = NULL; + int nn = 0; + int last = 0; + for (int j = 0; j < ns && nodashsug == 1; j++) { + if (strchr((*slst)[j], '-')) nodashsug = 0; + } + while (nodashsug && !last) { + if (*pos == '\0') last = 1; else *pos = '\0'; + if (!spell(ppos)) { + nn = suggest(&nlst, ppos); + for (int j = nn - 1; j >= 0; j--) { + strncpy(wspace, cw, ppos - cw); + strcpy(wspace + (ppos - cw), nlst[j]); + if (!last) { + strcat(wspace, "-"); + strcat(wspace, pos + 1); + } + ns = insert_sug(slst, wspace, ns); + free(nlst[j]); + } + if (nlst != NULL) free(nlst); + nodashsug = 0; + } + if (!last) { + *pos = '-'; + ppos = pos + 1; + pos = strchr(ppos, '-'); + } + if (!pos) pos = cw + strlen(cw); + } + } + // word reversing wrapper for complex prefixes if (complexprefixes) { for (int j = 0; j < ns; j++) { @@ -908,7 +922,7 @@ int Hunspell::suggest(char*** slst, const char * word) len = strlen(s); } mkallsmall2(s, w, len); - free((*slst)[j]); + free((*slst)[j]); if (spell(s)) { (*slst)[l] = mystrdup(s); if ((*slst)[l]) l++; @@ -922,7 +936,7 @@ int Hunspell::suggest(char*** slst, const char * word) } else { (*slst)[l] = (*slst)[j]; l++; - } + } } ns = l; } @@ -942,6 +956,15 @@ int Hunspell::suggest(char*** slst, const char * word) l++; } + // output conversion + rl = (pAMgr) ? pAMgr->get_oconvtable() : NULL; + for (int j = 0; rl && j < ns; j++) { + if (rl->conv((*slst)[j], wspace)) { + free((*slst)[j]); + (*slst)[j] = mystrdup(wspace); + } + } + // if suggestions removed by nosuggest, onlyincompound parameters if (l == 0 && *slst) { free(*slst); @@ -978,15 +1001,15 @@ int Hunspell::suggest_auto(char*** slst, const char * word) if (wl == 0) return 0; int ns = 0; *slst = NULL; // HU, nsug in pSMgr->suggest - + switch(captype) { - case NOCAP: { + case NOCAP: { ns = pSMgr->suggest_auto(slst, cw, ns); if (ns>0) break; break; } - case INITCAP: { + case INITCAP: { memcpy(wspace,cw,(wl+1)); mkallsmall(wspace); ns = pSMgr->suggest_auto(slst, wspace, ns); @@ -994,10 +1017,11 @@ int Hunspell::suggest_auto(char*** slst, const char * word) mkinitcap((*slst)[j]); ns = pSMgr->suggest_auto(slst, cw, ns); break; - + } - case HUHCAP: { + case HUHINITCAP: + case HUHCAP: { ns = pSMgr->suggest_auto(slst, cw, ns); if (ns == 0) { memcpy(wspace,cw,(wl+1)); @@ -1007,7 +1031,7 @@ int Hunspell::suggest_auto(char*** slst, const char * word) break; } - case ALLCAP: { + case ALLCAP: { memcpy(wspace,cw,(wl+1)); mkallsmall(wspace); ns = pSMgr->suggest_auto(slst, wspace, ns); @@ -1053,7 +1077,7 @@ int Hunspell::suggest_auto(char*** slst, const char * word) } } } - // END OF LANG_hu section + // END OF LANG_hu section return ns; } #endif @@ -1111,14 +1135,14 @@ int Hunspell::stem(char*** slst, char ** desc, int n) if (strstr(pl[k], MORPH_SURF_PFX)) { copy_field(result2 + strlen(result2), pl[k], MORPH_SURF_PFX); } - copy_field(result2 + strlen(result2), pl[k], MORPH_STEM); + copy_field(result2 + strlen(result2), pl[k], MORPH_STEM); } } freelist(&pl, pln); } int sln = line_tok(result2, slst, MSEP_REC); return uniqlist(*slst, sln); - + } int Hunspell::stem(char*** slst, const char * word) @@ -1146,14 +1170,14 @@ int Hunspell::suggest_pos_stems(char*** slst, const char * word) int abbv = 0; wl = cleanword(cw, word, &captype, &abbv); if (wl == 0) return 0; - + int ns = 0; // ns=0 = normalized input *slst = NULL; // HU, nsug in pSMgr->suggest - + switch(captype) { case HUHCAP: - case NOCAP: { + case NOCAP: { ns = pSMgr->suggest_pos_stems(slst, cw, ns); if ((abbv) && (ns == 0)) { @@ -1166,7 +1190,7 @@ int Hunspell::suggest_pos_stems(char*** slst, const char * word) break; } - case INITCAP: { + case INITCAP: { ns = pSMgr->suggest_pos_stems(slst, cw, ns); @@ -1175,15 +1199,15 @@ int Hunspell::suggest_pos_stems(char*** slst, const char * word) mkallsmall(wspace); ns = pSMgr->suggest_pos_stems(slst, wspace, ns); } - + break; - + } - case ALLCAP: { + case ALLCAP: { ns = pSMgr->suggest_pos_stems(slst, cw, ns); if (ns != 0) break; - + memcpy(wspace,cw,(wl+1)); mkallsmall(wspace); ns = pSMgr->suggest_pos_stems(slst, wspace, ns); @@ -1306,7 +1330,12 @@ int Hunspell::analyze(char*** slst, const char * word) } int captype = 0; int abbv = 0; - int wl = cleanword2(cw, word, unicw, &nc, &captype, &abbv); + int wl = 0; + + // input conversion + RepList * rl = (pAMgr) ? pAMgr->get_iconvtable() : NULL; + if (rl && rl->conv(word, wspace)) wl = cleanword2(cw, wspace, unicw, &nc, &captype, &abbv); + else wl = cleanword2(cw, word, unicw, &nc, &captype, &abbv); if (wl == 0) { if (abbv) { @@ -1318,7 +1347,7 @@ int Hunspell::analyze(char*** slst, const char * word) char result[MAXLNLEN]; char * st = NULL; - + *result = '\0'; int n = 0; @@ -1328,11 +1357,11 @@ int Hunspell::analyze(char*** slst, const char * word) // test numbers // LANG_hu section: set dash information for suggestions if (langnum == LANG_hu) { - while ((n < wl) && + while ((n < wl) && (((cw[n] <= '9') && (cw[n] >= '0')) || (((cw[n] == '.') || (cw[n] == ',')) && (n > 0)))) { n++; if ((cw[n] == '.') || (cw[n] == ',')) { - if (((n2 == 0) && (n > 3)) || + if (((n2 == 0) && (n > 3)) || ((n2 > 0) && ((cw[n-1] == '.') || (cw[n-1] == ',')))) break; n2++; n3 = n; @@ -1356,11 +1385,11 @@ int Hunspell::analyze(char*** slst, const char * word) } } // END OF LANG_hu section - + switch(captype) { case HUHCAP: case HUHINITCAP: - case NOCAP: { + case NOCAP: { cat_result(result, pSMgr->suggest_morph(cw)); if (abbv) { memcpy(wspace,cw,wl); @@ -1370,7 +1399,7 @@ int Hunspell::analyze(char*** slst, const char * word) } break; } - case INITCAP: { + case INITCAP: { wl = mkallsmall2(cw, unicw, nc); memcpy(wspace,cw,(wl+1)); wl2 = mkinitcap2(cw, unicw, nc); @@ -1389,7 +1418,7 @@ int Hunspell::analyze(char*** slst, const char * word) } break; } - case ALLCAP: { + case ALLCAP: { cat_result(result, pSMgr->suggest_morph(cw)); if (abbv) { memcpy(wspace,cw,wl); @@ -1433,7 +1462,7 @@ int Hunspell::analyze(char*** slst, const char * word) // LANG_hu section: set dash information for suggestions if (langnum == LANG_hu) dash = (char *) strchr(cw,'-'); if ((langnum == LANG_hu) && dash) { - *dash='\0'; + *dash='\0'; // examine 2 sides of the dash if (dash[1] == '\0') { // base word ending with dash if (spell(cw)) return line_tok(pSMgr->suggest_morph(cw), slst, MSEP_REC); @@ -1477,7 +1506,7 @@ int Hunspell::analyze(char*** slst, const char * word) } } // affixed number in correct word - if (nresult && (dash > cw) && (((*(dash-1)<='9') && + if (nresult && (dash > cw) && (((*(dash-1)<='9') && (*(dash-1)>='0')) || (*(dash-1)=='.'))) { *dash='-'; n = 1; @@ -1519,7 +1548,7 @@ int Hunspell::generate(char*** slst, const char * word, char ** pl, int pln) cleanword(cw, word, &captype, &abbv); char result[MAXLNLEN]; *result = '\0'; - + for (int i = 0; i < pln; i++) { cat_result(result, pSMgr->suggest_gen(pl2, pl2n, pl[i])); } @@ -1536,7 +1565,7 @@ int Hunspell::generate(char*** slst, const char * word, char ** pl, int pln) if (captype == INITCAP || captype == HUHINITCAP) { for (int j=0; j < linenum; j++) mkinitcap((*slst)[j]); } - + // temporary filtering of prefix related errors (eg. // generate("undrinkable", "eats") --> "undrinkables" and "*undrinks") @@ -1597,7 +1626,7 @@ const char * Hunspell::get_xml_pos(const char * s, const char * attr) int Hunspell::check_xml_par(const char * q, const char * attr, const char * value) { char cw[MAXWORDUTF8LEN]; - if (get_xml_par(cw, get_xml_pos(q, attr), MAXWORDUTF8LEN - 1) && + if (get_xml_par(cw, get_xml_pos(q, attr), MAXWORDUTF8LEN - 1) && strcmp(cw, value) == 0) return 1; return 0; } @@ -1695,12 +1724,12 @@ char * Hunspell::morph_with_correction(const char * word) char result[MAXLNLEN]; char * st = NULL; - + *result = '\0'; - - + + switch(captype) { - case NOCAP: { + case NOCAP: { st = pSMgr->suggest_morph_for_spelling_error(cw); if (st) { mystrcat(result, st, MAXLNLEN); @@ -1719,14 +1748,14 @@ char * Hunspell::morph_with_correction(const char * word) } break; } - case INITCAP: { + case INITCAP: { memcpy(wspace,cw,(wl+1)); mkallsmall(wspace); st = pSMgr->suggest_morph_for_spelling_error(wspace); if (st) { mystrcat(result, st, MAXLNLEN); free(st); - } + } st = pSMgr->suggest_morph_for_spelling_error(cw); if (st) { if (*result) mystrcat(result, "\n", MAXLNLEN); @@ -1754,7 +1783,7 @@ char * Hunspell::morph_with_correction(const char * word) } break; } - case HUHCAP: { + case HUHCAP: { st = pSMgr->suggest_morph_for_spelling_error(cw); if (st) { mystrcat(result, st, MAXLNLEN); @@ -1767,16 +1796,16 @@ char * Hunspell::morph_with_correction(const char * word) if (*result) mystrcat(result, "\n", MAXLNLEN); mystrcat(result, st, MAXLNLEN); free(st); - } + } break; } - case ALLCAP: { + case ALLCAP: { memcpy(wspace,cw,(wl+1)); st = pSMgr->suggest_morph_for_spelling_error(wspace); if (st) { mystrcat(result, st, MAXLNLEN); free(st); - } + } mkallsmall(wspace); st = pSMgr->suggest_morph_for_spelling_error(wspace); if (st) { @@ -1800,7 +1829,7 @@ char * Hunspell::morph_with_correction(const char * word) if (st) { mystrcat(result, st, MAXLNLEN); free(st); - } + } mkallsmall(wspace); st = pSMgr->suggest_morph_for_spelling_error(wspace); if (st) { @@ -1887,7 +1916,7 @@ int Hunspell_generate(Hunhandle *pHunspell, char*** slst, const char * word, /* functions for run-time modification of the dictionary */ /* add word to the run-time dictionary */ - + int Hunspell_add(Hunhandle *pHunspell, const char * word) { return ((Hunspell*)pHunspell)->add(word); } diff --git a/src/hunspell/replist.cxx b/src/hunspell/replist.cxx new file mode 100644 index 0000000..7846470 --- /dev/null +++ b/src/hunspell/replist.cxx @@ -0,0 +1,95 @@ +#include "license.hunspell" +#include "license.myspell" + +#ifndef MOZILLA_CLIENT +#include <cstdlib> +#include <cstring> +#include <cstdio> +#else +#include <stdlib.h> +#include <string.h> +#include <stdio.h> +#endif + +#include "replist.hxx" +#include "csutil.hxx" + +RepList::RepList(int n) { + dat = (replentry **) malloc(sizeof(replentry *) * n); + if (dat == 0) size = 0; else size = n; + pos = 0; +} + +RepList::~RepList() +{ + for (int i = 0; i < pos; i++) { + free(dat[i]->pattern); + free(dat[i]->pattern2); + free(dat[i]); + } + free(dat); +} + +int RepList::get_pos() { + return pos; +} + +replentry * RepList::item(int n) { + return dat[n]; +} + +int RepList::near(const char * word) { + int p1 = 0; + int p2 = pos; + while ((p2 - p1) > 1) { + int m = (p1 + p2) / 2; +// fprintf(stderr, "m: %d p1: %d p2: %d dat: %s\n", m, p1, p2, dat[m]->pattern); + int c = strcmp(word, dat[m]->pattern); + if (c <= 0) { + if (c < 0) p2 = m; else p1 = p2 = m; + } else p1 = m; + } +// fprintf(stderr, "NEAR: %s (word: %s)\n", dat[p1]->pattern, word); + return p1; +} + +int RepList::match(const char * word, int n) { + if (strncmp(word, dat[n]->pattern, strlen(dat[n]->pattern)) == 0) return strlen(dat[n]->pattern); + return 0; +} + +int RepList::add(char * pat1, char * pat2) { + if (pos >= size || pat1 == NULL || pat2 == NULL) return 1; + replentry * r = (replentry *) malloc(sizeof(replentry)); + if (r == NULL) return 1; + r->pattern = mystrrep(pat1, "_", " "); + r->pattern2 = mystrrep(pat2, "_", " "); + dat[pos++] = r; + for (int i = pos - 1; i > 0; i--) { + r = dat[i]; + if (strcmp(r->pattern, dat[i - 1]->pattern) < 0) { + dat[i] = dat[i - 1]; + dat[i - 1] = r; + } else break; + } + return 0; +} + +int RepList::conv(const char * word, char * dest) { + int stl = 0; + int change = 0; +// for (int i = 0; i < pos; i++) fprintf(stderr, "%d. %s\n", i, dat[i]->pattern); + for (int i = 0; i < strlen(word); i++) { + int n = near(word + i); + int l = match(word + i, n); + if (l) { + strcpy(dest + stl, dat[n]->pattern2); + stl += strlen(dat[n]->pattern2); + i += l - 1; + change = 1; + } else dest[stl++] = word[i]; + } + dest[stl] = '\0'; +// fprintf(stderr, "i: %s o: %s change: %d\n", word, dest, change); + return change; +} diff --git a/src/hunspell/replist.hxx b/src/hunspell/replist.hxx new file mode 100644 index 0000000..d366cf9 --- /dev/null +++ b/src/hunspell/replist.hxx @@ -0,0 +1,24 @@ +/* string replacement list class */ +#ifndef _REPLIST_HXX_ +#define _REPLIST_HXX_ +#include "w_char.hxx" + +class RepList +{ +protected: + replentry ** dat; + int size; + int pos; + +public: + RepList(int n); + ~RepList(); + + int get_pos(); + int add(char * pat1, char * pat2); + replentry * item(int n); + int near(const char * word); + int match(const char * word, int n); + int conv(const char * word, char * dest); +}; +#endif diff --git a/src/hunspell/w_char.hxx b/src/hunspell/w_char.hxx index a3d11c3..99cfe63 100644 --- a/src/hunspell/w_char.hxx +++ b/src/hunspell/w_char.hxx @@ -1,7 +1,7 @@ #ifndef __WCHARHXX__ #define __WCHARHXX__ -#ifdef WIN32 +#ifndef GCC typedef struct { #else typedef struct __attribute__ ((packed)) { diff --git a/src/tools/Makefile.am b/src/tools/Makefile.am index 445a8fd..d33c9a7 100644 --- a/src/tools/Makefile.am +++ b/src/tools/Makefile.am @@ -25,4 +25,4 @@ chmorph_LDADD = ../hunspell/libhunspell-1.2.la ../parsers/libparsers.a noinst_PROGRAMS=example -EXTRA_DIST=makealias affixcompress +EXTRA_DIST=makealias affixcompress wordforms diff --git a/src/tools/Makefile.in b/src/tools/Makefile.in index 7e132f1..71d8d80 100644 --- a/src/tools/Makefile.in +++ b/src/tools/Makefile.in @@ -277,7 +277,7 @@ analyze_SOURCES = analyze.cxx analyze_LDADD = ../hunspell/libhunspell-1.2.la chmorph_SOURCES = chmorph.cxx chmorph_LDADD = ../hunspell/libhunspell-1.2.la ../parsers/libparsers.a -EXTRA_DIST = makealias affixcompress +EXTRA_DIST = makealias affixcompress wordforms all: all-am .SUFFIXES: diff --git a/src/tools/affixcompress b/src/tools/affixcompress index c2e174b..9fc2989 100755 --- a/src/tools/affixcompress +++ b/src/tools/affixcompress @@ -4,8 +4,14 @@ # usage: affixcompress sorted_word_list_file [max_affix_rules] case $# in 0) echo \ -"affixcompress - compress a huge sorted word list to Hunspell aff and dic file -Usage: affixcompress sorted_word_list_file [max_affix_rules] +"affixcompress - compress a huge sorted word list to Hunspell format +Usage: + +LC_ALL=C sort word_list >sorted_word_list +affixcompress sorted_word_list [max_affix_rules] + +Default value of max_affix_rules = 5000 + Note: output may need manually added affix parameters (SET character_encoding, TRY suggestion_characters etc., see man(4) hunspell)" exit 0;; diff --git a/src/tools/hunspell.cxx b/src/tools/hunspell.cxx index a902a06..fb5a7d2 100644 --- a/src/tools/hunspell.cxx +++ b/src/tools/hunspell.cxx @@ -651,6 +651,7 @@ if (pos >= 0) { pMS[d]->free_list(&result, n); } if (n == 0) fprintf(stdout, "%s\n", chenc(token, dic_enc[d], ui_enc)); + fprintf(stdout, "\n"); free(token); continue; } @@ -671,6 +672,7 @@ if (pos >= 0) { pMS[d]->free_list(&result, n); } if (n == 0) fprintf(stdout, "%s\n", chenc(token, dic_enc[d], ui_enc)); + fprintf(stdout, "\n"); free(token); continue; } diff --git a/src/tools/wordforms b/src/tools/wordforms new file mode 100755 index 0000000..dabc346 --- /dev/null +++ b/src/tools/wordforms @@ -0,0 +1,35 @@ +#!/bin/sh +case $# in +0|1|2) echo "Usage: wordforms [-s | -p] dictionary.aff dictionary.dic word +-s: print only suffixed forms +-p: print only prefixed forms +"; exit 1;; +esac +fx=0 +case $1 in +-s) fx=1; shift;; +-p) fx=2; shift;; +esac +test -h /tmp/wordforms.aff && rm /tmp/wordforms.aff +ln -s $PWD/$1 /tmp/wordforms.aff +# prepared dic only with the query word +echo 1 >/tmp/wordforms.dic +grep "^$3/" $2 >>/tmp/wordforms.dic +echo $3 | awk -v "fx=$fx" ' +fx!=2 && FILENAME!="-" && /^SFX/ && NF > 4{split($4,a,"/");clen=($3=="0") ? 0 : length($3);sfx[a[1],clen]=a[1];sfxc[a[1],clen]=clen;next} +fx!=1 && FILENAME!="-" && /^PFX/ && NF > 4{split($4,a,"/");clen=($3=="0") ? 0 : length($3);pfx[a[1],clen]=a[1];pfxc[a[1],clen]=clen;next} +FILENAME=="-"{ +wlen=length($1) +if (fx==0 || fx==2) { + for (j in pfx) {if (wlen<=pfxc[j]) continue; print (pfx[j]=="0" ? "" : pfx[j]) substr($1, pfxc[j]+1)} +} +if (fx==0 || fx==1) { + for(i in sfx){clen=sfxc[i];if (wlen<=clen) continue; print substr($1, 1, wlen-clen) (sfx[i]=="0" ? "": sfx[i]) } +} +if (fx==0) { +for (j in pfx) {if (wlen<=pfxc[j]) continue; + for(i in sfx){clen=sfxc[i];if (wlen<=clen || wlen <= (clen + pfxc[j]))continue; + print (pfx[j]=="0" ? "" : pfx[j]) substr($1, pfxc[j]+1, wlen-clen-pfxc[j]) (sfx[i]=="0" ? "": sfx[i]) }} +} +} +' /tmp/wordforms.aff - | hunspell -d /tmp/wordforms -G -l diff --git a/src/win_api/Hunspell.rc b/src/win_api/Hunspell.rc index d0af810..d1202c5 100644 --- a/src/win_api/Hunspell.rc +++ b/src/win_api/Hunspell.rc @@ -2,8 +2,8 @@ #include <windows.h> VS_VERSION_INFO VERSIONINFO -FILEVERSION 1,2,7,0 -PRODUCTVERSION 1,2,7,0 +FILEVERSION 1,2,8,0 +PRODUCTVERSION 1,2,8,0 FILEFLAGSMASK 0x17L FILEFLAGS 0 FILEOS VOS_NT_WINDOWS32 @@ -21,12 +21,12 @@ BEGIN VALUE "Comments", "Hunspell (http://hunspell.sourceforge.net/) by L�szl� N�meth" VALUE "CompanyName", "http://hunspell.sourceforge.net/" VALUE "FileDescription", "libhunspell" - VALUE "FileVersion", "1.2.7" + VALUE "FileVersion", "1.2.8" VALUE "InternalName", "libhunspell" VALUE "LegalCopyright", "Copyright (c) 2007-2008" VALUE "OriginalFilename", "libhunspell.dll" VALUE "ProductName", "Hunspell Dynamic Link Library" - VALUE "ProductVersion", "1.2.7" + VALUE "ProductVersion", "1.2.8" END END END diff --git a/src/win_api/config.h b/src/win_api/config.h index 4415d4c..5f52f2c 100644 --- a/src/win_api/config.h +++ b/src/win_api/config.h @@ -190,7 +190,7 @@ #undef HUNSPELL_EXPERIMENTAL /* "Define if you need warning messages" */ -#undef HUNSPELL_WARNING_ON +#define HUNSPELL_WARNING_ON /* Define as const if the declaration of iconv() needs const. */ #define ICONV_CONST 1 @@ -211,5 +211,5 @@ #define PACKAGE_TARNAME /* Define to the version of this package. */ -#define PACKAGE_VERSION "1.2.7" -#define VERSION "1.2.7" +#define PACKAGE_VERSION "1.2.8" +#define VERSION "1.2.8" diff --git a/tests/Makefile.am b/tests/Makefile.am index b610e8a..978d7b5 100644 --- a/tests/Makefile.am +++ b/tests/Makefile.am @@ -31,6 +31,7 @@ nosuggest.test \ alias.test \ alias2.test \ alias3.test \ +breakdefault.test \ break.test \ needaffix.test \ needaffix2.test \ @@ -63,10 +64,14 @@ compoundaffix2.test \ compoundaffix3.test \ checkcompounddup.test \ checkcompoundtriple.test \ +simplifiedtriple.test \ checkcompoundrep.test \ checkcompoundcase2.test \ checkcompoundcaseutf.test \ checkcompoundpattern.test \ +checkcompoundpattern2.test \ +checkcompoundpattern3.test \ +checkcompoundpattern4.test \ utfcompound.test \ checksharps.test \ checksharpsutf.test \ @@ -93,7 +98,9 @@ colons_in_words.test \ ngram_utf_fix.test \ morph.test \ 1975530.test \ -fullstrip.test +fullstrip.test \ +iconv.test \ +oconv.test # infixes.test distclean-local: @@ -200,6 +207,12 @@ break.dic \ break.good \ break.test \ break.wrong \ +breakdefault.aff \ +breakdefault.dic \ +breakdefault.good \ +breakdefault.sug \ +breakdefault.test \ +breakdefault.wrong \ circumfix.aff \ circumfix.dic \ circumfix.good \ @@ -379,11 +392,31 @@ checkcompoundtriple.dic \ checkcompoundtriple.good \ checkcompoundtriple.test \ checkcompoundtriple.wrong \ +simplifiedtriple.aff \ +simplifiedtriple.dic \ +simplifiedtriple.good \ +simplifiedtriple.test \ +simplifiedtriple.wrong \ checkcompoundpattern.aff \ checkcompoundpattern.dic \ checkcompoundpattern.good \ checkcompoundpattern.test \ checkcompoundpattern.wrong \ +checkcompoundpattern2.aff \ +checkcompoundpattern2.dic \ +checkcompoundpattern2.good \ +checkcompoundpattern2.test \ +checkcompoundpattern2.wrong \ +checkcompoundpattern3.aff \ +checkcompoundpattern3.dic \ +checkcompoundpattern3.good \ +checkcompoundpattern3.test \ +checkcompoundpattern3.wrong \ +checkcompoundpattern4.aff \ +checkcompoundpattern4.dic \ +checkcompoundpattern4.good \ +checkcompoundpattern4.test \ +checkcompoundpattern4.wrong \ checksharps.aff \ checksharps.dic \ checksharps.good \ @@ -544,7 +577,17 @@ morph.test \ fullstrip.aff \ fullstrip.dic \ fullstrip.good \ -fullstrip.test +fullstrip.test \ +iconv.aff \ +iconv.dic \ +iconv.good \ +iconv.test \ +oconv.aff \ +oconv.dic \ +oconv.good \ +oconv.sug \ +oconv.test \ +oconv.wrong # infixes.aff # infixes.dic # infixes.good diff --git a/tests/Makefile.in b/tests/Makefile.in index bc6a12f..460d93f 100644 --- a/tests/Makefile.in +++ b/tests/Makefile.in @@ -230,6 +230,7 @@ nosuggest.test \ alias.test \ alias2.test \ alias3.test \ +breakdefault.test \ break.test \ needaffix.test \ needaffix2.test \ @@ -262,10 +263,14 @@ compoundaffix2.test \ compoundaffix3.test \ checkcompounddup.test \ checkcompoundtriple.test \ +simplifiedtriple.test \ checkcompoundrep.test \ checkcompoundcase2.test \ checkcompoundcaseutf.test \ checkcompoundpattern.test \ +checkcompoundpattern2.test \ +checkcompoundpattern3.test \ +checkcompoundpattern4.test \ utfcompound.test \ checksharps.test \ checksharpsutf.test \ @@ -292,7 +297,9 @@ colons_in_words.test \ ngram_utf_fix.test \ morph.test \ 1975530.test \ -fullstrip.test +fullstrip.test \ +iconv.test \ +oconv.test EXTRA_DIST = \ test.sh \ @@ -395,6 +402,12 @@ break.dic \ break.good \ break.test \ break.wrong \ +breakdefault.aff \ +breakdefault.dic \ +breakdefault.good \ +breakdefault.sug \ +breakdefault.test \ +breakdefault.wrong \ circumfix.aff \ circumfix.dic \ circumfix.good \ @@ -574,11 +587,31 @@ checkcompoundtriple.dic \ checkcompoundtriple.good \ checkcompoundtriple.test \ checkcompoundtriple.wrong \ +simplifiedtriple.aff \ +simplifiedtriple.dic \ +simplifiedtriple.good \ +simplifiedtriple.test \ +simplifiedtriple.wrong \ checkcompoundpattern.aff \ checkcompoundpattern.dic \ checkcompoundpattern.good \ checkcompoundpattern.test \ checkcompoundpattern.wrong \ +checkcompoundpattern2.aff \ +checkcompoundpattern2.dic \ +checkcompoundpattern2.good \ +checkcompoundpattern2.test \ +checkcompoundpattern2.wrong \ +checkcompoundpattern3.aff \ +checkcompoundpattern3.dic \ +checkcompoundpattern3.good \ +checkcompoundpattern3.test \ +checkcompoundpattern3.wrong \ +checkcompoundpattern4.aff \ +checkcompoundpattern4.dic \ +checkcompoundpattern4.good \ +checkcompoundpattern4.test \ +checkcompoundpattern4.wrong \ checksharps.aff \ checksharps.dic \ checksharps.good \ @@ -739,7 +772,17 @@ morph.test \ fullstrip.aff \ fullstrip.dic \ fullstrip.good \ -fullstrip.test +fullstrip.test \ +iconv.aff \ +iconv.dic \ +iconv.good \ +iconv.test \ +oconv.aff \ +oconv.dic \ +oconv.good \ +oconv.sug \ +oconv.test \ +oconv.wrong all: all-recursive diff --git a/tests/break.wrong b/tests/break.wrong index c783ee4..599ed9f 100644 --- a/tests/break.wrong +++ b/tests/break.wrong @@ -1,5 +1,7 @@ fox bax +-foo +bar- fox-bar foo-bax foo–bax diff --git a/tests/breakdefault.aff b/tests/breakdefault.aff new file mode 100644 index 0000000..a13f464 --- /dev/null +++ b/tests/breakdefault.aff @@ -0,0 +1,6 @@ +# default word break at hyphens and n-dashes + +SET UTF-8 +MAXNGRAMSUGS 0 +WORDCHARS - +TRY ot diff --git a/tests/breakdefault.dic b/tests/breakdefault.dic new file mode 100644 index 0000000..bf29960 --- /dev/null +++ b/tests/breakdefault.dic @@ -0,0 +1,6 @@ +3 +foo +bar +free +scott +scot-free diff --git a/tests/breakdefault.good b/tests/breakdefault.good new file mode 100644 index 0000000..8d81254 --- /dev/null +++ b/tests/breakdefault.good @@ -0,0 +1,7 @@ +foo +bar +foo- +-foo +scot-free +foo-bar +foo-bar-foo-bar diff --git a/tests/breakdefault.sug b/tests/breakdefault.sug new file mode 100644 index 0000000..8bfc69d --- /dev/null +++ b/tests/breakdefault.sug @@ -0,0 +1,3 @@ +scott +scot-free +foo-bar diff --git a/tests/breakdefault.test b/tests/breakdefault.test new file mode 100755 index 0000000..cde7c54 --- /dev/null +++ b/tests/breakdefault.test @@ -0,0 +1,4 @@ +#!/bin/sh +DIR="`dirname $0`" +NAME="`basename $0 .test`" +$DIR/test.sh $NAME -i utf-8 diff --git a/tests/breakdefault.wrong b/tests/breakdefault.wrong new file mode 100644 index 0000000..c3b203a --- /dev/null +++ b/tests/breakdefault.wrong @@ -0,0 +1,3 @@ +scot +sco-free +fo-bar diff --git a/tests/checkcompoundpattern2.aff b/tests/checkcompoundpattern2.aff new file mode 100644 index 0000000..fdf6560 --- /dev/null +++ b/tests/checkcompoundpattern2.aff @@ -0,0 +1,7 @@ +# forbid compounds with spec. pattern at word bound and allow modificated form +# (for German and Indian languages) +COMPOUNDFLAG A +CHECKCOMPOUNDPATTERN 2 +CHECKCOMPOUNDPATTERN o b z +CHECKCOMPOUNDPATTERN oo ba u +COMPOUNDMIN 1 diff --git a/tests/checkcompoundpattern2.dic b/tests/checkcompoundpattern2.dic new file mode 100644 index 0000000..8ac75f4 --- /dev/null +++ b/tests/checkcompoundpattern2.dic @@ -0,0 +1,3 @@ +2 +foo/A +bar/A diff --git a/tests/checkcompoundpattern2.good b/tests/checkcompoundpattern2.good new file mode 100644 index 0000000..eaad4f9 --- /dev/null +++ b/tests/checkcompoundpattern2.good @@ -0,0 +1,3 @@ +barfoo +fozar +fur diff --git a/tests/checkcompoundpattern2.test b/tests/checkcompoundpattern2.test new file mode 100755 index 0000000..dc29507 --- /dev/null +++ b/tests/checkcompoundpattern2.test @@ -0,0 +1,4 @@ +#!/bin/sh +DIR="`dirname $0`" +NAME="`basename $0 .test`" +$DIR/test.sh $NAME -i ISO8859-1 diff --git a/tests/checkcompoundpattern2.wrong b/tests/checkcompoundpattern2.wrong new file mode 100644 index 0000000..323fae0 --- /dev/null +++ b/tests/checkcompoundpattern2.wrong @@ -0,0 +1 @@ +foobar diff --git a/tests/checkcompoundpattern3.aff b/tests/checkcompoundpattern3.aff new file mode 100644 index 0000000..6c2cfa4 --- /dev/null +++ b/tests/checkcompoundpattern3.aff @@ -0,0 +1,6 @@ +# forbid compounds with spec. pattern at word bound and allow modificated form +# (for Indian languages) +COMPOUNDFLAG A +CHECKCOMPOUNDPATTERN 1 +CHECKCOMPOUNDPATTERN o/X b/Y z +COMPOUNDMIN 1 diff --git a/tests/checkcompoundpattern3.dic b/tests/checkcompoundpattern3.dic new file mode 100644 index 0000000..6bd1b7f --- /dev/null +++ b/tests/checkcompoundpattern3.dic @@ -0,0 +1,5 @@ +4 +foo/A +boo/AX +bar/A +ban/AY diff --git a/tests/checkcompoundpattern3.good b/tests/checkcompoundpattern3.good new file mode 100644 index 0000000..6070eff --- /dev/null +++ b/tests/checkcompoundpattern3.good @@ -0,0 +1,9 @@ +bozan +barfoo +banfoo +banbar +foobar +fooban +foobanbar +boobar +boobarfoo diff --git a/tests/checkcompoundpattern3.test b/tests/checkcompoundpattern3.test new file mode 100755 index 0000000..dc29507 --- /dev/null +++ b/tests/checkcompoundpattern3.test @@ -0,0 +1,4 @@ +#!/bin/sh +DIR="`dirname $0`" +NAME="`basename $0 .test`" +$DIR/test.sh $NAME -i ISO8859-1 diff --git a/tests/checkcompoundpattern3.wrong b/tests/checkcompoundpattern3.wrong new file mode 100644 index 0000000..41d8d37 --- /dev/null +++ b/tests/checkcompoundpattern3.wrong @@ -0,0 +1,8 @@ +booban +boobanfoo +fozar +fozarfoo +fozan +fozanfoo +bozar +bozarfoo diff --git a/tests/checkcompoundpattern4.aff b/tests/checkcompoundpattern4.aff new file mode 100644 index 0000000..ef25663 --- /dev/null +++ b/tests/checkcompoundpattern4.aff @@ -0,0 +1,8 @@ +# sandhi in Telugu writing system, based on the Kiran Chittella's example + +COMPOUNDFLAG x +COMPOUNDMIN 1 +CHECKCOMPOUNDPATTERN 2 +CHECKCOMPOUNDPATTERN a/A u/A O +CHECKCOMPOUNDPATTERN u/B u/B u + diff --git a/tests/checkcompoundpattern4.dic b/tests/checkcompoundpattern4.dic new file mode 100644 index 0000000..d245ef0 --- /dev/null +++ b/tests/checkcompoundpattern4.dic @@ -0,0 +1,6 @@ +4 +sUrya/Ax +udayaM/Ax +pEru/Bx +unna/Bx + diff --git a/tests/checkcompoundpattern4.good b/tests/checkcompoundpattern4.good new file mode 100644 index 0000000..48761b6 --- /dev/null +++ b/tests/checkcompoundpattern4.good @@ -0,0 +1,2 @@ +sUryOdayaM +pErunna diff --git a/tests/checkcompoundpattern4.test b/tests/checkcompoundpattern4.test new file mode 100755 index 0000000..dc29507 --- /dev/null +++ b/tests/checkcompoundpattern4.test @@ -0,0 +1,4 @@ +#!/bin/sh +DIR="`dirname $0`" +NAME="`basename $0 .test`" +$DIR/test.sh $NAME -i ISO8859-1 diff --git a/tests/checkcompoundpattern4.wrong b/tests/checkcompoundpattern4.wrong new file mode 100644 index 0000000..a357fec --- /dev/null +++ b/tests/checkcompoundpattern4.wrong @@ -0,0 +1,2 @@ +sUryaudayaM +pEruunna diff --git a/tests/condition.aff b/tests/condition.aff index f27f0b2..6215742 100644 --- a/tests/condition.aff +++ b/tests/condition.aff @@ -1,7 +1,7 @@ SET ISO8859-2 WORDCHARS 0123456789 -SFX S N 16 +SFX S N 18 SFX S 0 suf1 . SFX S 0 suf2 o SFX S 0 suf3 [aeou] @@ -18,6 +18,8 @@ SFX S 0 suf13 [aefu][^aefu] SFX S 0 suf14 [^aeou][aeou] SFX S 0 suf15 [aeou][^aefu] SFX S 0 suf16 [^aeou][^aefu] +SFX S 0 suf17 [aeou][bcdfgkmnoprstvz] +SFX S 0 suf18 [aeou]o SFX Q N 2 SFX Q 0 ning [^aeio][aeiou]n @@ -34,7 +36,7 @@ SFX Z 0 ch [ SFX Z 0 m [��������].a SFX Z a 0 [��������].a -PFX P N 16 +PFX P N 18 PFX P 0 pre1 . PFX P 0 pre2 o PFX P 0 pre3 [aeou] @@ -51,6 +53,9 @@ PFX P 0 pre13 [aeou][aefu] PFX P 0 pre14 [aeou][^aeou] PFX P 0 pre15 [aeou][^aefu] PFX P 0 pre16 [^aefu][^aeou] +PFX P 0 pre17 [bcdfgkmnoprstvz][aeou] +PFX P 0 pre18 o[aeou] + PFX R N 2 PFX R 0 gnin n[aeiou][^aeio] diff --git a/tests/condition.wrong b/tests/condition.wrong index 16443ef..7b83d82 100644 --- a/tests/condition.wrong +++ b/tests/condition.wrong @@ -13,5 +13,9 @@ ofosuf12 pre12ofo ofosuf15 pre15ofo +ofosuf17 +pre17ofo +ofosuf18 +pre18ofo entertainning gninnianretne diff --git a/tests/condition_utf.aff b/tests/condition_utf.aff index f716dd9..62a1ce5 100644 --- a/tests/condition_utf.aff +++ b/tests/condition_utf.aff @@ -1,7 +1,7 @@ SET UTF-8 WORDCHARS 0123456789 -SFX S N 16 +SFX S N 18 SFX S 0 suf1 . SFX S 0 suf2 ó SFX S 0 suf3 [áéóú] @@ -18,8 +18,10 @@ SFX S 0 suf13 [áéőú][^ú] SFX S 0 suf14 [^ú][áéóú] SFX S 0 suf15 [áéóú][^áéőú] SFX S 0 suf16 [^áéóú][^áéőú] +SFX S 0 suf17 [áéóú][bcdfgkmnóprstvz] +SFX S 0 suf18 [áéóú]ó -PFX P N 16 +PFX P N 18 PFX P 0 pre1 . PFX P 0 pre2 ó PFX P 0 pre3 [áéóú] @@ -36,3 +38,5 @@ PFX P 0 pre13 [áéóú][áéőú] PFX P 0 pre14 [áéóú][^áéóú] PFX P 0 pre15 [áéóú][^áéőú] PFX P 0 pre16 [^áéőú][^áéóú] +PFX P 0 pre17 [bcdfgkmnóprstvz][áéóú] +PFX P 0 pre18 ó[áéóú] diff --git a/tests/condition_utf.wrong b/tests/condition_utf.wrong index 4040213..f102213 100644 --- a/tests/condition_utf.wrong +++ b/tests/condition_utf.wrong @@ -12,3 +12,7 @@ pre11óőó pre12óőó óőósuf15 pre15óőó +óőósuf17 +óőósuf18 +pre17óőó +pre18óőó diff --git a/tests/iconv.aff b/tests/iconv.aff new file mode 100644 index 0000000..36cf7a2 --- /dev/null +++ b/tests/iconv.aff @@ -0,0 +1,10 @@ +# input conversion (accept comma acuted letters also with cedilla, +# as de facto replacement of the Romanian standard) +SET UTF-8 + +ICONV 4 +ICONV ş ș +ICONV ţ ț +ICONV Ş Ș +ICONV Ţ Ț + diff --git a/tests/iconv.dic b/tests/iconv.dic new file mode 100644 index 0000000..8326eee --- /dev/null +++ b/tests/iconv.dic @@ -0,0 +1,5 @@ +4 +Chișinău +Țepes +ț +Ș diff --git a/tests/iconv.good b/tests/iconv.good new file mode 100644 index 0000000..746cf1e --- /dev/null +++ b/tests/iconv.good @@ -0,0 +1,6 @@ +Chișinău +Chişinău +Țepes +Ţepes +Ş +ţ diff --git a/tests/iconv.test b/tests/iconv.test new file mode 100755 index 0000000..cde7c54 --- /dev/null +++ b/tests/iconv.test @@ -0,0 +1,4 @@ +#!/bin/sh +DIR="`dirname $0`" +NAME="`basename $0 .test`" +$DIR/test.sh $NAME -i utf-8 diff --git a/tests/oconv.aff b/tests/oconv.aff new file mode 100644 index 0000000..13a3d9b --- /dev/null +++ b/tests/oconv.aff @@ -0,0 +1,12 @@ +# output conversion +SET UTF-8 + +OCONV 7 +OCONV a A +OCONV á Á +OCONV b B +OCONV c C +OCONV d D +OCONV e E +OCONV é É + diff --git a/tests/oconv.dic b/tests/oconv.dic new file mode 100644 index 0000000..359186c --- /dev/null +++ b/tests/oconv.dic @@ -0,0 +1,4 @@ +3 +bébé +dádá +aábcdeé diff --git a/tests/oconv.good b/tests/oconv.good new file mode 100644 index 0000000..6cdaab1 --- /dev/null +++ b/tests/oconv.good @@ -0,0 +1,2 @@ +bébé +dádá diff --git a/tests/oconv.sug b/tests/oconv.sug new file mode 100644 index 0000000..a191c62 --- /dev/null +++ b/tests/oconv.sug @@ -0,0 +1,3 @@ +BÉBÉ +DÁDÁ +AÁBCDEÉ diff --git a/tests/oconv.test b/tests/oconv.test new file mode 100755 index 0000000..cde7c54 --- /dev/null +++ b/tests/oconv.test @@ -0,0 +1,4 @@ +#!/bin/sh +DIR="`dirname $0`" +NAME="`basename $0 .test`" +$DIR/test.sh $NAME -i utf-8 diff --git a/tests/oconv.wrong b/tests/oconv.wrong new file mode 100644 index 0000000..73dcc89 --- /dev/null +++ b/tests/oconv.wrong @@ -0,0 +1,3 @@ +béb +dád +aábcde diff --git a/tests/simplifiedtriple.aff b/tests/simplifiedtriple.aff new file mode 100644 index 0000000..3ab3473 --- /dev/null +++ b/tests/simplifiedtriple.aff @@ -0,0 +1,8 @@ +# Forbid compound word with triple letters +CHECKCOMPOUNDTRIPLE +# Allow simplified forms +SIMPLIFIEDTRIPLE + +COMPOUNDMIN 2 + +COMPOUNDFLAG A diff --git a/tests/simplifiedtriple.dic b/tests/simplifiedtriple.dic new file mode 100644 index 0000000..cfe7a35 --- /dev/null +++ b/tests/simplifiedtriple.dic @@ -0,0 +1,3 @@ +2 +glass/A +sko/A diff --git a/tests/simplifiedtriple.good b/tests/simplifiedtriple.good new file mode 100644 index 0000000..23a4815 --- /dev/null +++ b/tests/simplifiedtriple.good @@ -0,0 +1,3 @@ +glass +sko +glassko diff --git a/tests/simplifiedtriple.test b/tests/simplifiedtriple.test new file mode 100755 index 0000000..7f44369 --- /dev/null +++ b/tests/simplifiedtriple.test @@ -0,0 +1,4 @@ +#!/bin/sh +DIR="`dirname $0`" +NAME="`basename $0 .test`" +$DIR/test.sh $NAME diff --git a/tests/simplifiedtriple.wrong b/tests/simplifiedtriple.wrong new file mode 100644 index 0000000..2811287 --- /dev/null +++ b/tests/simplifiedtriple.wrong @@ -0,0 +1 @@ +glasssko -- Alioth's /usr/local/bin/git-commit-notice on /srv/git.debian.org/git/pkg-openoffice/hunspell.git

