Re: +Idx problems maybe?
Hi Henrik, > I tested the > > (rel key (+Fold +Ref +String)) > ... > (fold @Str @Cls key) > > version and rebuilt the index but I still can't get the search to work > in a case insensitive fasion. Did I miss something? Just to be sure: You re-imported the data, or at least re-built the index, didn't you? BTW, with the above pattern, you get 'fold'ed comparisons, which imply case-insensitiveness. But before you said you also wanted substring indexing. For that, you might take (rel key (+Fold +Idx +String)) ... (part @Str @Cls key) though at costs. Cheers, - Alex -- UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe
Re: +Idx problems maybe?
I tested the (rel key (+Fold +Ref +String)) ... (fold @Str @Cls key) version and rebuilt the index but I still can't get the search to work in a case insensitive fasion. Did I miss something? On Tue, Nov 3, 2009 at 11:02 AM, Henrik Sarvell wrote: > I'll try with the one you suggested, thanks for the clarifications! > > /Henrik > > On Tue, Nov 3, 2009 at 8:38 AM, Alexander Burger wr= ote: >> Hi Henrik, >> >>> I took a look at the pilog file, I already get what same and range are >>> doing but what are part, head and fold doing? >> >> You are on the right track. You used 'tolr', but this actually makes >> sense only in combination with the '+Sn' (Soundex) prefix. The whole >> matter is rather complicated, because there are so many combinations of >> index types and Pilog comparison functions possible. >> >> >> I would say that we have the following typical use cases for string >> searches (I'll leave out numerical searches, which usually combine with >> 'same' or 'range'). >> >> 1. "Exact" searches. You have either a unique index >> >> =A0 =A0 =A0(rel key (+Key +String)) >> >> =A0 or a non-unique index >> >> =A0 =A0 =A0(rel key (+Ref +String)) >> >> =A0 and you can compare results in Pilog with >> >> =A0 =A0 =A0(same @Str @Cls key) >> >> =A0 for exact matches, or with >> >> =A0 =A0 =A0(head @Str @Cls key) >> >> =A0 for "dictionary" searches (searching only for the beginning of >> =A0 strings). These are case-sensitive searches. >> >> >> 2. "Folded" searches. They make use of the 'fold' function which keeps >> =A0 only letters, converted to lower case, and digits. >> >> =A0 =A0 =A0(rel key (+Fold +Ref +String)) >> =A0 =A0 =A0... >> =A0 =A0 =A0(fold @Str @Cls key) >> >> =A0 This searches only for the beginning of strings. We use it typically >> =A0 for telephone numbers. >> >> >> =A0 If a search for individual words in a key is desired, we can use >> >> =A0 =A0 =A0(rel key (+List +Fold +Ref +String)) >> =A0 =A0 =A0... >> =A0 =A0 =A0(fold @Str @Cls key) >> >> =A0 This stores only the strings in the list (not the substrings) in >> =A0 'fold'ed representation. So each word can be found by "dictionary" >> =A0 search. This requires changes to the GUI and import functions, >> =A0 though, as 'key' is not a string but a list of strings. >> >> >> =A0 Finally, we can also index folded substrings: >> >> =A0 =A0 =A0(rel key (+Fold +Idx +String)) >> =A0 =A0 =A0... >> =A0 =A0 =A0(part @Str @Cls key) >> >> =A0 This is perhaps what you need. If you go for it, I'd recommend you >> =A0 download once more the latest testing release, as the 'part' functio= n >> =A0 was changed recently. >> >> >> 3. "Tolerant" searches. They return first all exact (case-sensitive) >> =A0 matches of partial strings, and then the matches according to the >> =A0 soundex algorithm (the first letter is compared exactly >> =A0 (case-sensitive), the rest checks for similarity). This makes mainly >> =A0 sense for personal names. >> >> =A0 =A0 =A0(rel key (+Sn +Idx +String)) >> =A0 =A0 =A0... >> =A0 =A0 =A0(tolr @Str @Cls key) >> >> >> Concerning space consumption, the '+Key' and '+Ref' indexes are the most >> economical ones. They create only a single entry in the index tree per >> key. >> >> Then follow the '+List +Ref +String' indexes, which create an entry per >> word. >> >> Most space-hungry are the '+Idx' indexes, as they create an entry for >> each substring down to a length of three, and '+Sn' adds one more for >> the soundex key. >> >> Cheers, >> - Alex >> -- >> UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=3dunsubscribe >> > -- UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe
Re: 64bit segmentation fault when matching on long lists
Great, thanks! I'll try it out tonight. On Tue, Nov 3, 2009 at 8:09 PM, Alexander Burger wrot= e: > On Tue, Nov 03, 2009 at 07:27:47PM +0100, Henrik Sarvell wrote: >> What we need is a test for > discarding/ignoring the first ">" when we do subsequent tills. That's >> what I tried to do with an initial (till ">") but it didn't work, like >> this: >> >> (in "rss.xml" >> =A0 (while >> =A0 =A0 =A0(from "> =A0 =A0 =A0(till ">") > > That's easy. Try two 'from's in succession: > > (in "rss.xml" > =A0(while > =A0 =A0 (from " =A0 =A0 (from ">") > > This works both for > > =A0 Content > > and > > =A0 Content > > > 'from' is the main working tool. As you know, you can also pass several > patterns to 'from' (implicit OR, and the return value can be checked in > a 'case' statement), so this is more flexible. > > Cheers, > - Alex > -- > UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=3dunsubscribe > -- UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe
Re: 64bit segmentation fault when matching on long lists
On Tue, Nov 03, 2009 at 07:27:47PM +0100, Henrik Sarvell wrote: > What we need is a test for discarding/ignoring the first ">" when we do subsequent tills. That's > what I tried to do with an initial (till ">") but it didn't work, like > this: > > (in "rss.xml" > (while > (from " (till ">") That's easy. Try two 'from's in succession: (in "rss.xml" (while (from "") This works both for Content and Content 'from' is the main working tool. As you know, you can also pass several patterns to 'from' (implicit OR, and the return value can be checked in a 'case' statement), so this is more flexible. Cheers, - Alex -- UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe
Re: 64bit segmentation fault when matching on long lists
On Tue, Nov 03, 2009 at 05:58:20PM +0100, Henrik Sarvell wrote: > Am I missing something? Won't the first (from) basically find the > first instance of " match here am I not matching on the whole of the rest of the document? No, I was talking about the return value of 'make' (while (from "") (println # Instead of printing (make # do further matching (loop (NIL (chain (till ">"))) # Collect until next tag (char)# Skip '>' (T (tail '`(chop "item") @)) ) ) ) ) # See if we got 'make' returns everything that was collected by 'till' and 'chain' in a list. The tail of that list will be ("i" "t" "e" "m") if 'loop' was terminated by the 'T' clause, or something unexpected when end of file was hit (the 'NIL' clause). So how about such a structure: (while (from "") (let Lst (make (loop (NIL (chain (till ">"))) (char) (T (tail '`(chop "item") @)) ) ) (cond ((match ... Lst) ... ) ((match ... Lst) ...) You could also immediately check for the trailing ("i" "t" "e" "m") and discard results which do not match it: (use @X (while (from "") (when (match '(@X "i" "t" "e" "m") (make (loop (NIL (chain (till ">"))) (char) (T (tail '`(chop "item") @)) ) ) ) (got something in @X without trailing "item") ... ) ) ) Again, just ideas, not tested ;-) Cheers, - Alex -- UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe
Re: 64bit segmentation fault when matching on long lists
But the problem is that we can't use in the (from) since some feeds will contain only, what do we do about that? In most cases it will still be and then the subsequent test for item (in the terminating tag) will of course return true and then we get nothing. I mean I don't understand how the above code would work with something looking like this: Content1 Content2 While still being able to handle Content1 Content2 What we need is a test for " when we do subsequent tills. That's what I tried to do with an initial (till ">") but it didn't work, like this: (in "rss.xml" (while (from "") (println (make (loop (NIL (chain (till ">"))) (char) (T (tail '`(chop "item") @)) ) ) ) )) /Henrik On Tue, Nov 3, 2009 at 6:37 PM, Alexander Burger wrot= e: > On Tue, Nov 03, 2009 at 05:58:20PM +0100, Henrik Sarvell wrote: >> Am I missing something? Won't the first (from) basically find the >> first instance of " > Yes. > >> match here am I not matching on the whole of the rest of the document? > > No, I was talking about the return value of 'make' > > (while (from "") > =A0 (println =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 # Instead of printing > =A0 =A0 =A0(make =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0 # do further matching > =A0 =A0 =A0 =A0 (loop > =A0 =A0 =A0 =A0 =A0 =A0(NIL (chain (till ">"))) =A0 =A0 =A0 =A0 =A0 =A0 = =A0# Collect until next tag > =A0 =A0 =A0 =A0 =A0 =A0(char) =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 =A0 =A0# Skip '>' > =A0 =A0 =A0 =A0 =A0 =A0(T (tail '`(chop "item") @)) ) ) ) ) =A0# See if w= e got > > 'make' returns everything that was collected by 'till' and 'chain' in a > list. > > The tail of that list will be ("i" "t" "e" "m") if 'loop' was terminated > by the 'T' clause, or something unexpected when end of file was hit (the > 'NIL' clause). > > So how about such a structure: > > (while (from "") > =A0 (let Lst > =A0 =A0 =A0(make > =A0 =A0 =A0 =A0 (loop > =A0 =A0 =A0 =A0 =A0 =A0(NIL (chain (till ">"))) > =A0 =A0 =A0 =A0 =A0 =A0(char) > =A0 =A0 =A0 =A0 =A0 =A0(T (tail '`(chop "item") @)) ) ) > =A0 =A0 =A0(cond > =A0 =A0 =A0 =A0 ((match ... Lst) > =A0 =A0 =A0 =A0 =A0 =A0... ) > =A0 =A0 =A0 =A0 ((match ... Lst) > =A0 =A0 =A0 =A0 =A0 =A0...) > > > You could also immediately check for the trailing ("i" "t" "e" "m") and > discard results which do not match it: > > (use @X > =A0 (while (from "") > =A0 =A0 =A0(when > =A0 =A0 =A0 =A0 (match '(@X "i" "t" "e" "m") > =A0 =A0 =A0 =A0 =A0 =A0(make > =A0 =A0 =A0 =A0 =A0 =A0 =A0 (loop > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0(NIL (chain (till ">"))) > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0(char) > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0(T (tail '`(chop "item") @)) ) ) ) > =A0 =A0 =A0 =A0 (got something in @X without trailing "item") > =A0 =A0 =A0 =A0 ... ) ) ) > > Again, just ideas, not tested ;-) > > Cheers, > - Alex > -- > UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=3dunsubscribe > -- UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe
Re: 64bit segmentation fault when matching on long lists
Am I missing something? Won't the first (from) basically find the first instance of "]*>") (println (make (loop (NIL (chain (till ">"))) (char) (T (tail '`(chop "item") @)) ) ) ) )) Maybe this is clearer, so obviously we can't use the above but what to do in order to replace it with something that is equivalent and legal? /Henrik On Tue, Nov 3, 2009 at 5:41 PM, Alexander Burger wrot= e: > On Tue, Nov 03, 2009 at 04:24:55PM +0100, Henrik Sarvell wrote: >> (in "rss.xml" >> =A0 =A0(while >> =A0 =A0 =A0 (from "> =A0 =A0 =A0 (println >> =A0 =A0 =A0 =A0 =A0(make >> =A0 =A0 =A0 =A0 =A0 =A0 (loop >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0(NIL (chain (till ">"))) >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0(char) >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0(T (tail '`(chop "item") @)) ) ) ) )) >> >> This will accurately capture the tag all the time I think but >> then we need some way of discarding the attributes and the closing >. > > I think that from this point on 'match' is the easiest and most general. > What 'make' returns is not so big any more, and has also perhaps more > predictable patterns. > > Cheers, > - Alex > -- > UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=3dunsubscribe > -- UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe
Re: 64bit segmentation fault when matching on long lists
On Tue, Nov 03, 2009 at 04:24:55PM +0100, Henrik Sarvell wrote: > (in "rss.xml" >(while > (from " (println > (make > (loop >(NIL (chain (till ">"))) >(char) >(T (tail '`(chop "item") @)) ) ) ) )) > > This will accurately capture the tag all the time I think but > then we need some way of discarding the attributes and the closing >. I think that from this point on 'match' is the easiest and most general. What 'make' returns is not so big any more, and has also perhaps more predictable patterns. Cheers, - Alex -- UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe
Re: 64bit segmentation fault when matching on long lists
I started with this approach yesterday, first in order to capture feed type which I am now able to do. I noticed that some rss feeds have attributes in their tags, therefore the above won't work 100% of the time. (in "rss.xml" (while (from ""))) (char) (T (tail '`(chop "item") @)) ) ) ) )) This will accurately capture the tag all the time I think but then we need some way of discarding the attributes and the closing >. I tried with an immediate (till ">") after the (from) but it didn't have the intentional result, any suggestions here? /Henrik On Sun, Nov 1, 2009 at 6:26 PM, Alexander Burger wrot= e: > On Sun, Nov 01, 2009 at 01:49:59PM +0100, Henrik Sarvell wrote: >> It's a good question with a very simple answer, many many feeds out >> there are completely broken, sometimes they don't conform to >> standards, that's a good scenario but often they have unmatched tags >> or unclosed attributes. > > Ouch. I see. > > So what do you think about the following: > > (while (from "") > =A0 (println =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 # Instead of printing > =A0 =A0 =A0(make =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0 # do further matching > =A0 =A0 =A0 =A0 (loop > =A0 =A0 =A0 =A0 =A0 =A0(NIL (chain (till ">"))) =A0 =A0 =A0 =A0 =A0 =A0 = =A0# Collect until next tag > =A0 =A0 =A0 =A0 =A0 =A0(char) =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 =A0 =A0# Skip '>' > =A0 =A0 =A0 =A0 =A0 =A0(T (tail '`(chop "item") @)) ) ) ) ) =A0# See if w= e got > > The 'make' will give you smaller chunks of data, which are easier to > 'match'. > > Cheers, > - Alex > -- > UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=3dunsubscribe > -- UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe
Re: +Idx problems maybe?
I'll try with the one you suggested, thanks for the clarifications! /Henrik On Tue, Nov 3, 2009 at 8:38 AM, Alexander Burger wrot= e: > Hi Henrik, > >> I took a look at the pilog file, I already get what same and range are >> doing but what are part, head and fold doing? > > You are on the right track. You used 'tolr', but this actually makes > sense only in combination with the '+Sn' (Soundex) prefix. The whole > matter is rather complicated, because there are so many combinations of > index types and Pilog comparison functions possible. > > > I would say that we have the following typical use cases for string > searches (I'll leave out numerical searches, which usually combine with > 'same' or 'range'). > > 1. "Exact" searches. You have either a unique index > > =A0 =A0 =A0(rel key (+Key +String)) > > =A0 or a non-unique index > > =A0 =A0 =A0(rel key (+Ref +String)) > > =A0 and you can compare results in Pilog with > > =A0 =A0 =A0(same @Str @Cls key) > > =A0 for exact matches, or with > > =A0 =A0 =A0(head @Str @Cls key) > > =A0 for "dictionary" searches (searching only for the beginning of > =A0 strings). These are case-sensitive searches. > > > 2. "Folded" searches. They make use of the 'fold' function which keeps > =A0 only letters, converted to lower case, and digits. > > =A0 =A0 =A0(rel key (+Fold +Ref +String)) > =A0 =A0 =A0... > =A0 =A0 =A0(fold @Str @Cls key) > > =A0 This searches only for the beginning of strings. We use it typically > =A0 for telephone numbers. > > > =A0 If a search for individual words in a key is desired, we can use > > =A0 =A0 =A0(rel key (+List +Fold +Ref +String)) > =A0 =A0 =A0... > =A0 =A0 =A0(fold @Str @Cls key) > > =A0 This stores only the strings in the list (not the substrings) in > =A0 'fold'ed representation. So each word can be found by "dictionary" > =A0 search. This requires changes to the GUI and import functions, > =A0 though, as 'key' is not a string but a list of strings. > > > =A0 Finally, we can also index folded substrings: > > =A0 =A0 =A0(rel key (+Fold +Idx +String)) > =A0 =A0 =A0... > =A0 =A0 =A0(part @Str @Cls key) > > =A0 This is perhaps what you need. If you go for it, I'd recommend you > =A0 download once more the latest testing release, as the 'part' function > =A0 was changed recently. > > > 3. "Tolerant" searches. They return first all exact (case-sensitive) > =A0 matches of partial strings, and then the matches according to the > =A0 soundex algorithm (the first letter is compared exactly > =A0 (case-sensitive), the rest checks for similarity). This makes mainly > =A0 sense for personal names. > > =A0 =A0 =A0(rel key (+Sn +Idx +String)) > =A0 =A0 =A0... > =A0 =A0 =A0(tolr @Str @Cls key) > > > Concerning space consumption, the '+Key' and '+Ref' indexes are the most > economical ones. They create only a single entry in the index tree per > key. > > Then follow the '+List +Ref +String' indexes, which create an entry per > word. > > Most space-hungry are the '+Idx' indexes, as they create an entry for > each substring down to a length of three, and '+Sn' adds one more for > the soundex key. > > Cheers, > - Alex > -- > UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=3dunsubscribe > -- UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe