Re: SV: [htdig] Foreign chars (Swedish)

1999-11-30 Thread Philippe Ramkvist-Henry



On Mon, 29 Nov 1999, Gilles Detillieux wrote:

> Just a hunch, but you wouldn't happen to have a ä in valid_punctuation,
> would you?  In any case, could you run htdig -vvv twice, searching
> first for ANLÄNDE, and then for anlände?  How do the initial debugging
> messages differ.  What's happening to the ä - is it getting stripped
> out or changed to another character?  Is the upper case Ä getting changed
> to a ä, or to another character?  Are you using the exact same config
> file for htdig, htmerge and htsearch?

I use the default for "valid_punctuation", I even tried adding it as
"extra_word_characters: ä".

Here's the debugging info for the second (237th! :) try.
 
su10-2 <74> htsearch -vvv
Enter value for words: anlände
tempWords: 'anlände:0 '
Boolean: 'anlände:0 '
initial: ''
Add: anlände
searchWords: 'anlände:0 '
LogicalWords: anlände
Pattern: 
Enter value for format:

su10-2 <75> htsearch -vvv
Enter value for words: ANLÄNDE
tempWords: 'anlände:0 '
Boolean: 'anlände:0 '
initial: ''
Fuzzy on: anlände
   (null) anlände
   (null) word=anlände prefix_suffix=* prefix_suffix_length=1
minimum_prefix_length=1

   endings anlända anländandet anländandets anländande anländ- anländer
anlänt anländs anländes anlänts anländes
   synonyms
searchWords: '(:0 anlände:0 |:0 anlända:0 |:0 anländandet:0 |:0
anländandets:0 |:0 anländande:0 |:0 anländ-:0 |:0 anländer:0 |:0 anlänt:0
|:0 anländs:0 |:0 anländes:0 |:0 anlänts:0 |:0 anländes:0 ):0 '
LogicalWords: (anlände or anlända or anländandet or anländandets or
anländande or anländ- or anländer or anlänt or anländs or anländes or
anlänts or anländes)
Pattern: anlände
Enter value for format: 

looks ok to me... what do you say?

> Not that I know of, but you could put a originalWords.uppercase(); right
> after the originalWords.chop(" \t\r\n"); in htsearch/htsearch.cc.  If the
> htsearch -vvv above doesn't get to the root of the problem, it might be
> interesting to see if this hack has any effect.

I'll try this too. If the above looks ok.

I got a mail from another Swedish subscriber of this list and according to
him everything worked well using sv_SE (which I don't have) and indexing
using an English dictionary (which shouldn't change anything).

I'll try to get hold of that locale and try it...

/Philippe



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.



Re: SV: [htdig] Foreign chars (Swedish)

1999-11-29 Thread Philippe Ramkvist-henry



On Fri, 26 Nov 1999, Gilles Detillieux wrote:

> That all looks the way it should, as far as I'm concerned.  I guess we
> need to focus on htsearch, as it appears to be the culprit.  (Either that
> or htmerge.)  Could you try running htsearch from the command line,
> and seaching first for ANLÄNDE, and then for anlände?  I'd like to see
> what it finds in both cases.

I get exaclty the same problem as when searching from a HTML form as
searching directly from the shell.

> OK, your ctype info for the sv locale looks fine.  Again, I suspect
> htsearch, or possibly a corrupt database.  If we can't nail down something
> specific in htsearch, or if it's not to difficult to reindex everything
> from scratch, I'd suggest you do just that.

Well reindexing doesn't help... I've done it SEVERAL times now... :(

> Yes, and your db.wordlist looks fine (at least what you showed me),
> so it should work, as long as you're also feeding Latin 1 characters
> into htsearch.  If you are, then it's a bug or a corrupt database (the
> index, not the word list).

Hmm, maybe. Is there anyway to force the input in the HTML form to be
"uppercase"? It's a dirty solution but it would work.

Thanx for all the help!



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.



Re: SV: [htdig] Foreign chars (Swedish)

1999-11-26 Thread Philippe Ramkvist-Henry



On Thu, 25 Nov 1999, Gilles Detillieux wrote:
> 
> OK, so the word Ättestupan appears in there as ättestupan, correct?
> Very strange.  So searches for words containing Ä will find words with
> ä in its place, as expected, but searches for words containing ä will
> match neither ä nor Ä, is that right?  I'm at a bit of a loss to explain
> it, but at some point it would seem that htsearch is mangling the lower
> case ä.  Do you have any documents containing a lower case ä somewhere
> in a word, and if so, does that word make it into db.wordlist correctly?

All correct and the words make it into the db.wordlist correctly. 
Example:

anlände i:269   l:150   w:1652  c:2 a:4
anlände i:475   l:285   w:715
anlände i:581   l:295   w:705   a:1
anlände i:586   l:394   w:606
anländeri:146   l:466   w:534
anländeri:282   l:466   w:534
 
and

äteri:576   l:606   w:394   a:14
ätiti:531   l:603   w:397
ätiti:586   l:636   w:364
ättestupan  i:109   l:558   w:442
ättestupan  i:126   l:465   w:535

> I still suspect a problem with ctype for your locale.  Could you compile
> and run the following C program on your system, and send me the output?
> (Run it with the name of your locale, "sv", as an argument.)

Ok, here you go:

su10-6 <6> cc test.c
su10-6 <7> a.out sv
  0 0x00: ^@  A-c-?
  1 0x01: ^A  A-c-?
  2 0x02: ^B  A-c-?
  3 0x03: ^C  A-c-?
  4 0x04: ^D  A-c-?
  5 0x05: ^E  A-c-?
  6 0x06: ^F  A-c-?
  7 0x07: ^G  A-c-?
  8 0x08: ^H  A-c-?
  9 0x09: ^I  A-cs?
 10 0x0A: ^J  A-cs?
 11 0x0B: ^K  A-cs?
 12 0x0C: ^L  A-cs?
 13 0x0D: ^M  A-cs?
 14 0x0E: ^N  A-c-?
 15 0x0F: ^O  A-c-?
 16 0x10: ^P  A-c-?
 17 0x11: ^Q  A-c-?
 18 0x12: ^R  A-c-?
 19 0x13: ^S  A-c-?
 20 0x14: ^T  A-c-?
 21 0x15: ^U  A-c-?
 22 0x16: ^V  A-c-?
 23 0x17: ^W  A-c-?
 24 0x18: ^X  A-c-?
 25 0x19: ^Y  A-c-?
 26 0x1A: ^Z  A-c-?
 27 0x1B: ^[  A-c-?
 28 0x1C: ^\  A-c-?
 29 0x1D: ^]  A-c-?
 30 0x1E: ^^  A-c-?
 31 0x1F: ^_  A-c-?
 32 0x20: A---t--s?
 33 0x21:  !  A--gtp--?
 34 0x22:  "  A--gtp--?
 35 0x23:  #  A--gtp--?
 36 0x24:  $  A--gtp--?
 37 0x25:  %  A--gtp--?
 38 0x26:  &  A--gtp--?
 39 0x27:  '  A--gtp--?
 40 0x28:  (  A--gtp--?
 41 0x29:  )  A--gtp--?
 42 0x2A:  *  A--gtp--?
 43 0x2B:  +  A--gtp--?
 44 0x2C:  ,  A--gtp--?
 45 0x2D:  -  A--gtp--?
 46 0x2E:  .  A--gtp--?
 47 0x2F:  /  A--gtp--?
 48 0x30:  0  A---ndxgt---?
 49 0x31:  1  A---ndxgt---?
 50 0x32:  2  A---ndxgt---?
 51 0x33:  3  A---ndxgt---?
 52 0x34:  4  A---ndxgt---?
 53 0x35:  5  A---ndxgt---?
 54 0x36:  6  A---ndxgt---?
 55 0x37:  7  A---ndxgt---?
 56 0x38:  8  A---ndxgt---?
 57 0x39:  9  A---ndxgt---?
 58 0x3A:  :  A--gtp--?
 59 0x3B:  ;  A--gtp--?
 60 0x3C:  <  A--gtp--?
 61 0x3D:  =  A--gtp--?
 62 0x3E:  >  A--gtp--?
 63 0x3F:  ?  A--gtp--?
 64 0x40:  @  A--gtp--?
 65 0x41:  A  Aa-un-xgt---?
 66 0x42:  B  Aa-un-xgt---?
 67 0x43:  C  Aa-un-xgt---?
 68 0x44:  D  Aa-un-xgt---?
 69 0x45:  E  Aa-un-xgt---?
 70 0x46:  F  Aa-un-xgt---?
 71 0x47:  G  Aa-un--gt---?
 72 0x48:  H  Aa-un--gt---?
 73 0x49:  I  Aa-un--gt---?
 74 0x4A:  J  Aa-un--gt---?
 75 0x4B:  K  Aa-un--gt---?
 76 0x4C:  L  Aa-un--gt---?
 77 0x4D:  M  Aa-un--gt---?
 78 0x4E:  N  Aa-un--gt---?
 79 0x4F:  O  Aa-un--gt---?
 80 0x50:  P  Aa-un--gt---?
 81 0x51:  Q  Aa-un--gt---?
 82 0x52:  R  Aa-un--gt---?
 83 0x53:  S  Aa-un--gt---?
 84 0x54:  T  Aa-un--gt---?
 85 0x55:  U  Aa-un--gt---?
 86 0x56:  V  Aa-un--gt---?
 87 0x57:  W  Aa-un--gt---?
 88 0x58:  X  Aa-un--gt---?
 89 0x59:  Y  Aa-un--gt---?
 90 0x5A:  Z  Aa-un--gt---?
 91 0x5B:  [  A--gtp--?
 92 0x5C:  \  A--gtp--?
 93 0x5D:  ]  A--gtp--?
 94 0x5E:  ^  A--gtp--?
 95 0x5F:  _  A--gtp--?
 96 0x60:  `  A--gtp--?
 97 0x61:  a  Aal-n-xgt---?
 98 0x62:  b  Aal-n-xgt---?
 99 0x63:  c  Aal-n-xgt---?
100 0x64:  d  Aal-n-xgt---?
101 0x65:  e  Aal-n-xgt---?
102 0x66:  f  Aal-n-xgt---?
103 0x67:  g  Aal-n--gt---?
104 0x68:  h  Aal-n--gt---?
105 0x69:  i  Aal-n--gt---?
106 0x6A:  j  Aal-n--gt---?
107 0x6B:  k  Aal-n--gt---?
108 0x6C:  l  Aal-n--gt---?
109 0x6D:  m  Aal-n--gt---?
110 0x6E:  n  Aal-n--gt---?
111 0x6F:  o  Aal-n--gt---?
112 0x70:  p  Aal-n--gt---?
113 0x71:  q  Aal-n--gt---?
114 0x72:  r  Aal-n--gt---?
115 0x73:  s  Aal-n--gt---?
116 0x74:  t  Aal-n--gt---?
117 0x75:  u  Aal-n--gt---?
118 0x76:  v  Aal-n--gt---?
119 0x77:  w  Aal-n--gt---?
120 0x78:  x  Aal-n--gt---?
121 0x79:  y  Aal-n--gt---?
122 0x7A:  z  Aal-n--gt---?
123 0x7B:  {  A--gtp--?
124 0x7C:  |  A--gtp--?
125 0x7D:  }  A--gtp--?
126 0x7E:  ~  A--gtp--?
127 0x7F: ^?  A-c-?
128 0x80: ~@  --c-?
129 0x81: ~A  --c-?
130 0x82: ~B  --c-?
131 0x83: ~C  --

SV: [htdig] Foreign chars (Swedish)

1999-11-25 Thread Philippe Ramkvist-Henry


> Are the hits all capitalized, or do some of them have the lowercase ä?
> Does this problem happen consistently with certain accented letters, and
> not others?  Do you have certain uppercase letters appearing in db.wordlist?

With hits you mean the actual words from the document I guess. Well only those 
which are supposed to be capitalized are. For example: A search for "ättestupan" 
renders 0 hits while a search for "Ättestupan" renders 18. The word is in the documents
always written as "Ättestupan" so this would be natural if the search was case 
sensitive.
The problem is that "Åsa" and "åsa" gives the exact same hits and it's also always 
reffered to as "Åsa". The problem only exists (as far as I can test) for "äÄ".

The db.wordlist only contain lowercase letters.

> > I asked a guy here a the University and he said that there might be
> > complications with "unsigned char" and "char". He gave me the example
> > below. Please answer at a novice level, my C++ and Unix knowledge is very
> > limited.  
> 
> Good hunch, but given that some accented letters work and some give
> problems, I wouldn't expect that it's a problem with sign extension.
> This seems to point to a problem with the ctype tables for your locale,
> but there could be something else that I'm missing here.  Please keep
> us posted.

I'm also looking for a synonym wordlist in swedish... If anyone has one, please 
send me a copy.



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You'll receive a message confirming the unsubscription.



[htdig] Foreign chars (Swedish)

1999-11-25 Thread Philippe Ramkvist-Henry


Hello!

I'm having problems with some foreign chars when using htdig to index and
search a Swedish site. The locale is set right (sv) and is working in
other applications. The problem I have is somewhat weird, maybe it has
something to do with "uppercase" "lowercase"?

Well, I can search words like "Åsa,åsa,Öl,öl" and get the same matches.
But when I try to search "bäst" I get no hits. With "bÄst" I get several
hits...

I asked a guy here a the University and he said that there might be
complications with "unsigned char" and "char". He gave me the example
below. Please answer at a novice level, my C++ and Unix knowledge is very
limited.  

Thanks
Philippe Ramkvist-Henry



 htlib/StringMatch.cc
 
 while ((unsigned char)string[pos])
 {
 new_state = table[trans[string[pos]]][state];
 
Should be? or? 
 
 while (string[pos])
 {
 new_state = table[trans[(unsigned 
 char)string[pos]]][state];
  
   



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You'll receive a message confirming the unsubscription.



[htdig] Swedish files...

1999-11-12 Thread Philippe Ramkvist-Henry



I'm wondering if someone has built the:

svenska.datalista svenska.aff svenska.hash

files. It seems like my unix sytem doesen't support the "buildhash"
command. If you've already built these files please send me a copy! I
would be very greatful.

(my last problem is solved! "help a newbie", just some Roxen parameters)





To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word unsubscribe in
the SUBJECT of the message.



[htdig] Help a newbie

1999-11-10 Thread Philippe Ramkvist-Henry



I've installed th latest version of ht://Dig and I'm trying to get it to
work with my Roxen server. Everything is running smoothly, ht://dig
installed nicely, rundig created a db but when I try to make a search I
get "the document contained no data". My htsearch is working because when
I run it from a shell it answers with some html code. I'm wondering if it
might be my server that I've set up wrong? Probably it's me and cgi
scripts...

Thankful for help,

Philippe Ramkvist-Henry



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word unsubscribe in
the SUBJECT of the message.