Re: SV: [htdig] Foreign chars (Swedish)
On Mon, 29 Nov 1999, Gilles Detillieux wrote: > Just a hunch, but you wouldn't happen to have a ä in valid_punctuation, > would you? In any case, could you run htdig -vvv twice, searching > first for ANLÄNDE, and then for anlände? How do the initial debugging > messages differ. What's happening to the ä - is it getting stripped > out or changed to another character? Is the upper case Ä getting changed > to a ä, or to another character? Are you using the exact same config > file for htdig, htmerge and htsearch? I use the default for "valid_punctuation", I even tried adding it as "extra_word_characters: ä". Here's the debugging info for the second (237th! :) try. su10-2 <74> htsearch -vvv Enter value for words: anlände tempWords: 'anlände:0 ' Boolean: 'anlände:0 ' initial: '' Add: anlände searchWords: 'anlände:0 ' LogicalWords: anlände Pattern: Enter value for format: su10-2 <75> htsearch -vvv Enter value for words: ANLÄNDE tempWords: 'anlände:0 ' Boolean: 'anlände:0 ' initial: '' Fuzzy on: anlände (null) anlände (null) word=anlände prefix_suffix=* prefix_suffix_length=1 minimum_prefix_length=1 endings anlända anländandet anländandets anländande anländ- anländer anlänt anländs anländes anlänts anländes synonyms searchWords: '(:0 anlände:0 |:0 anlända:0 |:0 anländandet:0 |:0 anländandets:0 |:0 anländande:0 |:0 anländ-:0 |:0 anländer:0 |:0 anlänt:0 |:0 anländs:0 |:0 anländes:0 |:0 anlänts:0 |:0 anländes:0 ):0 ' LogicalWords: (anlände or anlända or anländandet or anländandets or anländande or anländ- or anländer or anlänt or anländs or anländes or anlänts or anländes) Pattern: anlände Enter value for format: looks ok to me... what do you say? > Not that I know of, but you could put a originalWords.uppercase(); right > after the originalWords.chop(" \t\r\n"); in htsearch/htsearch.cc. If the > htsearch -vvv above doesn't get to the root of the problem, it might be > interesting to see if this hack has any effect. I'll try this too. If the above looks ok. I got a mail from another Swedish subscriber of this list and according to him everything worked well using sv_SE (which I don't have) and indexing using an English dictionary (which shouldn't change anything). I'll try to get hold of that locale and try it... /Philippe To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this.
Re: SV: [htdig] Foreign chars (Swedish)
On Fri, 26 Nov 1999, Gilles Detillieux wrote: > That all looks the way it should, as far as I'm concerned. I guess we > need to focus on htsearch, as it appears to be the culprit. (Either that > or htmerge.) Could you try running htsearch from the command line, > and seaching first for ANLÄNDE, and then for anlände? I'd like to see > what it finds in both cases. I get exaclty the same problem as when searching from a HTML form as searching directly from the shell. > OK, your ctype info for the sv locale looks fine. Again, I suspect > htsearch, or possibly a corrupt database. If we can't nail down something > specific in htsearch, or if it's not to difficult to reindex everything > from scratch, I'd suggest you do just that. Well reindexing doesn't help... I've done it SEVERAL times now... :( > Yes, and your db.wordlist looks fine (at least what you showed me), > so it should work, as long as you're also feeding Latin 1 characters > into htsearch. If you are, then it's a bug or a corrupt database (the > index, not the word list). Hmm, maybe. Is there anyway to force the input in the HTML form to be "uppercase"? It's a dirty solution but it would work. Thanx for all the help! To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this.
Re: SV: [htdig] Foreign chars (Swedish)
On Thu, 25 Nov 1999, Gilles Detillieux wrote: > > OK, so the word Ättestupan appears in there as ättestupan, correct? > Very strange. So searches for words containing Ä will find words with > ä in its place, as expected, but searches for words containing ä will > match neither ä nor Ä, is that right? I'm at a bit of a loss to explain > it, but at some point it would seem that htsearch is mangling the lower > case ä. Do you have any documents containing a lower case ä somewhere > in a word, and if so, does that word make it into db.wordlist correctly? All correct and the words make it into the db.wordlist correctly. Example: anlände i:269 l:150 w:1652 c:2 a:4 anlände i:475 l:285 w:715 anlände i:581 l:295 w:705 a:1 anlände i:586 l:394 w:606 anländeri:146 l:466 w:534 anländeri:282 l:466 w:534 and äteri:576 l:606 w:394 a:14 ätiti:531 l:603 w:397 ätiti:586 l:636 w:364 ättestupan i:109 l:558 w:442 ättestupan i:126 l:465 w:535 > I still suspect a problem with ctype for your locale. Could you compile > and run the following C program on your system, and send me the output? > (Run it with the name of your locale, "sv", as an argument.) Ok, here you go: su10-6 <6> cc test.c su10-6 <7> a.out sv 0 0x00: ^@ A-c-? 1 0x01: ^A A-c-? 2 0x02: ^B A-c-? 3 0x03: ^C A-c-? 4 0x04: ^D A-c-? 5 0x05: ^E A-c-? 6 0x06: ^F A-c-? 7 0x07: ^G A-c-? 8 0x08: ^H A-c-? 9 0x09: ^I A-cs? 10 0x0A: ^J A-cs? 11 0x0B: ^K A-cs? 12 0x0C: ^L A-cs? 13 0x0D: ^M A-cs? 14 0x0E: ^N A-c-? 15 0x0F: ^O A-c-? 16 0x10: ^P A-c-? 17 0x11: ^Q A-c-? 18 0x12: ^R A-c-? 19 0x13: ^S A-c-? 20 0x14: ^T A-c-? 21 0x15: ^U A-c-? 22 0x16: ^V A-c-? 23 0x17: ^W A-c-? 24 0x18: ^X A-c-? 25 0x19: ^Y A-c-? 26 0x1A: ^Z A-c-? 27 0x1B: ^[ A-c-? 28 0x1C: ^\ A-c-? 29 0x1D: ^] A-c-? 30 0x1E: ^^ A-c-? 31 0x1F: ^_ A-c-? 32 0x20: A---t--s? 33 0x21: ! A--gtp--? 34 0x22: " A--gtp--? 35 0x23: # A--gtp--? 36 0x24: $ A--gtp--? 37 0x25: % A--gtp--? 38 0x26: & A--gtp--? 39 0x27: ' A--gtp--? 40 0x28: ( A--gtp--? 41 0x29: ) A--gtp--? 42 0x2A: * A--gtp--? 43 0x2B: + A--gtp--? 44 0x2C: , A--gtp--? 45 0x2D: - A--gtp--? 46 0x2E: . A--gtp--? 47 0x2F: / A--gtp--? 48 0x30: 0 A---ndxgt---? 49 0x31: 1 A---ndxgt---? 50 0x32: 2 A---ndxgt---? 51 0x33: 3 A---ndxgt---? 52 0x34: 4 A---ndxgt---? 53 0x35: 5 A---ndxgt---? 54 0x36: 6 A---ndxgt---? 55 0x37: 7 A---ndxgt---? 56 0x38: 8 A---ndxgt---? 57 0x39: 9 A---ndxgt---? 58 0x3A: : A--gtp--? 59 0x3B: ; A--gtp--? 60 0x3C: < A--gtp--? 61 0x3D: = A--gtp--? 62 0x3E: > A--gtp--? 63 0x3F: ? A--gtp--? 64 0x40: @ A--gtp--? 65 0x41: A Aa-un-xgt---? 66 0x42: B Aa-un-xgt---? 67 0x43: C Aa-un-xgt---? 68 0x44: D Aa-un-xgt---? 69 0x45: E Aa-un-xgt---? 70 0x46: F Aa-un-xgt---? 71 0x47: G Aa-un--gt---? 72 0x48: H Aa-un--gt---? 73 0x49: I Aa-un--gt---? 74 0x4A: J Aa-un--gt---? 75 0x4B: K Aa-un--gt---? 76 0x4C: L Aa-un--gt---? 77 0x4D: M Aa-un--gt---? 78 0x4E: N Aa-un--gt---? 79 0x4F: O Aa-un--gt---? 80 0x50: P Aa-un--gt---? 81 0x51: Q Aa-un--gt---? 82 0x52: R Aa-un--gt---? 83 0x53: S Aa-un--gt---? 84 0x54: T Aa-un--gt---? 85 0x55: U Aa-un--gt---? 86 0x56: V Aa-un--gt---? 87 0x57: W Aa-un--gt---? 88 0x58: X Aa-un--gt---? 89 0x59: Y Aa-un--gt---? 90 0x5A: Z Aa-un--gt---? 91 0x5B: [ A--gtp--? 92 0x5C: \ A--gtp--? 93 0x5D: ] A--gtp--? 94 0x5E: ^ A--gtp--? 95 0x5F: _ A--gtp--? 96 0x60: ` A--gtp--? 97 0x61: a Aal-n-xgt---? 98 0x62: b Aal-n-xgt---? 99 0x63: c Aal-n-xgt---? 100 0x64: d Aal-n-xgt---? 101 0x65: e Aal-n-xgt---? 102 0x66: f Aal-n-xgt---? 103 0x67: g Aal-n--gt---? 104 0x68: h Aal-n--gt---? 105 0x69: i Aal-n--gt---? 106 0x6A: j Aal-n--gt---? 107 0x6B: k Aal-n--gt---? 108 0x6C: l Aal-n--gt---? 109 0x6D: m Aal-n--gt---? 110 0x6E: n Aal-n--gt---? 111 0x6F: o Aal-n--gt---? 112 0x70: p Aal-n--gt---? 113 0x71: q Aal-n--gt---? 114 0x72: r Aal-n--gt---? 115 0x73: s Aal-n--gt---? 116 0x74: t Aal-n--gt---? 117 0x75: u Aal-n--gt---? 118 0x76: v Aal-n--gt---? 119 0x77: w Aal-n--gt---? 120 0x78: x Aal-n--gt---? 121 0x79: y Aal-n--gt---? 122 0x7A: z Aal-n--gt---? 123 0x7B: { A--gtp--? 124 0x7C: | A--gtp--? 125 0x7D: } A--gtp--? 126 0x7E: ~ A--gtp--? 127 0x7F: ^? A-c-? 128 0x80: ~@ --c-? 129 0x81: ~A --c-? 130 0x82: ~B --c-? 131 0x83: ~C --
SV: [htdig] Foreign chars (Swedish)
> Are the hits all capitalized, or do some of them have the lowercase ä? > Does this problem happen consistently with certain accented letters, and > not others? Do you have certain uppercase letters appearing in db.wordlist? With hits you mean the actual words from the document I guess. Well only those which are supposed to be capitalized are. For example: A search for "ättestupan" renders 0 hits while a search for "Ättestupan" renders 18. The word is in the documents always written as "Ättestupan" so this would be natural if the search was case sensitive. The problem is that "Åsa" and "åsa" gives the exact same hits and it's also always reffered to as "Åsa". The problem only exists (as far as I can test) for "äÄ". The db.wordlist only contain lowercase letters. > > I asked a guy here a the University and he said that there might be > > complications with "unsigned char" and "char". He gave me the example > > below. Please answer at a novice level, my C++ and Unix knowledge is very > > limited. > > Good hunch, but given that some accented letters work and some give > problems, I wouldn't expect that it's a problem with sign extension. > This seems to point to a problem with the ctype tables for your locale, > but there could be something else that I'm missing here. Please keep > us posted. I'm also looking for a synonym wordlist in swedish... If anyone has one, please send me a copy. To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You'll receive a message confirming the unsubscription.
[htdig] Foreign chars (Swedish)
Hello! I'm having problems with some foreign chars when using htdig to index and search a Swedish site. The locale is set right (sv) and is working in other applications. The problem I have is somewhat weird, maybe it has something to do with "uppercase" "lowercase"? Well, I can search words like "Åsa,åsa,Öl,öl" and get the same matches. But when I try to search "bäst" I get no hits. With "bÄst" I get several hits... I asked a guy here a the University and he said that there might be complications with "unsigned char" and "char". He gave me the example below. Please answer at a novice level, my C++ and Unix knowledge is very limited. Thanks Philippe Ramkvist-Henry htlib/StringMatch.cc while ((unsigned char)string[pos]) { new_state = table[trans[string[pos]]][state]; Should be? or? while (string[pos]) { new_state = table[trans[(unsigned char)string[pos]]][state]; To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You'll receive a message confirming the unsubscription.
[htdig] Swedish files...
I'm wondering if someone has built the: svenska.datalista svenska.aff svenska.hash files. It seems like my unix sytem doesen't support the "buildhash" command. If you've already built these files please send me a copy! I would be very greatful. (my last problem is solved! "help a newbie", just some Roxen parameters) To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] containing the single word unsubscribe in the SUBJECT of the message.
[htdig] Help a newbie
I've installed th latest version of ht://Dig and I'm trying to get it to work with my Roxen server. Everything is running smoothly, ht://dig installed nicely, rundig created a db but when I try to make a search I get "the document contained no data". My htsearch is working because when I run it from a shell it answers with some html code. I'm wondering if it might be my server that I've set up wrong? Probably it's me and cgi scripts... Thankful for help, Philippe Ramkvist-Henry To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] containing the single word unsubscribe in the SUBJECT of the message.