Re: +Idx problems maybe?

2009-11-03 Thread Alexander Burger
Hi Henrik,

> I tested the
> 
>  (rel key (+Fold +Ref +String))
>  ...
>  (fold @Str @Cls key)
> 
> version and rebuilt the index but I still can't get the search to work
> in a case insensitive fasion. Did I miss something?

Just to be sure: You re-imported the data, or at least re-built the
index, didn't you?


BTW, with the above pattern, you get 'fold'ed comparisons, which imply
case-insensitiveness. But before you said you also wanted substring
indexing. For that, you might take

  (rel key (+Fold +Idx +String))
  ...
  (part @Str @Cls key)

though at costs.

Cheers,
- Alex
-- 
UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe


Re: +Idx problems maybe?

2009-11-03 Thread Henrik Sarvell
I tested the

 (rel key (+Fold +Ref +String))
 ...
 (fold @Str @Cls key)

version and rebuilt the index but I still can't get the search to work
in a case insensitive fasion. Did I miss something?


On Tue, Nov 3, 2009 at 11:02 AM, Henrik Sarvell  wrote:
> I'll try with the one you suggested, thanks for the clarifications!
>
> /Henrik
>
> On Tue, Nov 3, 2009 at 8:38 AM, Alexander Burger  wr=
ote:
>> Hi Henrik,
>>
>>> I took a look at the pilog file, I already get what same and range are
>>> doing but what are part, head and fold doing?
>>
>> You are on the right track. You used 'tolr', but this actually makes
>> sense only in combination with the '+Sn' (Soundex) prefix. The whole
>> matter is rather complicated, because there are so many combinations of
>> index types and Pilog comparison functions possible.
>>
>>
>> I would say that we have the following typical use cases for string
>> searches (I'll leave out numerical searches, which usually combine with
>> 'same' or 'range').
>>
>> 1. "Exact" searches. You have either a unique index
>>
>> =A0 =A0 =A0(rel key (+Key +String))
>>
>> =A0 or a non-unique index
>>
>> =A0 =A0 =A0(rel key (+Ref +String))
>>
>> =A0 and you can compare results in Pilog with
>>
>> =A0 =A0 =A0(same @Str @Cls key)
>>
>> =A0 for exact matches, or with
>>
>> =A0 =A0 =A0(head @Str @Cls key)
>>
>> =A0 for "dictionary" searches (searching only for the beginning of
>> =A0 strings). These are case-sensitive searches.
>>
>>
>> 2. "Folded" searches. They make use of the 'fold' function which keeps
>> =A0 only letters, converted to lower case, and digits.
>>
>> =A0 =A0 =A0(rel key (+Fold +Ref +String))
>> =A0 =A0 =A0...
>> =A0 =A0 =A0(fold @Str @Cls key)
>>
>> =A0 This searches only for the beginning of strings. We use it typically
>> =A0 for telephone numbers.
>>
>>
>> =A0 If a search for individual words in a key is desired, we can use
>>
>> =A0 =A0 =A0(rel key (+List +Fold +Ref +String))
>> =A0 =A0 =A0...
>> =A0 =A0 =A0(fold @Str @Cls key)
>>
>> =A0 This stores only the strings in the list (not the substrings) in
>> =A0 'fold'ed representation. So each word can be found by "dictionary"
>> =A0 search. This requires changes to the GUI and import functions,
>> =A0 though, as 'key' is not a string but a list of strings.
>>
>>
>> =A0 Finally, we can also index folded substrings:
>>
>> =A0 =A0 =A0(rel key (+Fold +Idx +String))
>> =A0 =A0 =A0...
>> =A0 =A0 =A0(part @Str @Cls key)
>>
>> =A0 This is perhaps what you need. If you go for it, I'd recommend you
>> =A0 download once more the latest testing release, as the 'part' functio=
n
>> =A0 was changed recently.
>>
>>
>> 3. "Tolerant" searches. They return first all exact (case-sensitive)
>> =A0 matches of partial strings, and then the matches according to the
>> =A0 soundex algorithm (the first letter is compared exactly
>> =A0 (case-sensitive), the rest checks for similarity). This makes mainly
>> =A0 sense for personal names.
>>
>> =A0 =A0 =A0(rel key (+Sn +Idx +String))
>> =A0 =A0 =A0...
>> =A0 =A0 =A0(tolr @Str @Cls key)
>>
>>
>> Concerning space consumption, the '+Key' and '+Ref' indexes are the most
>> economical ones. They create only a single entry in the index tree per
>> key.
>>
>> Then follow the '+List +Ref +String' indexes, which create an entry per
>> word.
>>
>> Most space-hungry are the '+Idx' indexes, as they create an entry for
>> each substring down to a length of three, and '+Sn' adds one more for
>> the soundex key.
>>
>> Cheers,
>> - Alex
>> --
>> UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=3dunsubscribe
>>
>
-- 
UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe


Re: 64bit segmentation fault when matching on long lists

2009-11-03 Thread Henrik Sarvell
Great, thanks! I'll try it out tonight.


On Tue, Nov 3, 2009 at 8:09 PM, Alexander Burger  wrot=
e:
> On Tue, Nov 03, 2009 at 07:27:47PM +0100, Henrik Sarvell wrote:
>> What we need is a test for > discarding/ignoring the first ">" when we do subsequent tills. That's
>> what I tried to do with an initial (till ">") but it didn't work, like
>> this:
>>
>> (in "rss.xml"
>> =A0 (while
>> =A0 =A0 =A0(from "> =A0 =A0 =A0(till ">")
>
> That's easy. Try two 'from's in succession:
>
> (in "rss.xml"
> =A0(while
> =A0 =A0 (from " =A0 =A0 (from ">")
>
> This works both for
>
> =A0 Content
>
> and
>
> =A0 Content
>
>
> 'from' is the main working tool. As you know, you can also pass several
> patterns to 'from' (implicit OR, and the return value can be checked in
> a 'case' statement), so this is more flexible.
>
> Cheers,
> - Alex
> --
> UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=3dunsubscribe
>
-- 
UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe


Re: 64bit segmentation fault when matching on long lists

2009-11-03 Thread Alexander Burger
On Tue, Nov 03, 2009 at 07:27:47PM +0100, Henrik Sarvell wrote:
> What we need is a test for  discarding/ignoring the first ">" when we do subsequent tills. That's
> what I tried to do with an initial (till ">") but it didn't work, like
> this:
> 
> (in "rss.xml"
>   (while
>  (from "  (till ">")

That's easy. Try two 'from's in succession:

(in "rss.xml"
  (while
 (from "")

This works both for

   Content

and

   Content


'from' is the main working tool. As you know, you can also pass several
patterns to 'from' (implicit OR, and the return value can be checked in
a 'case' statement), so this is more flexible.

Cheers,
- Alex
-- 
UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe


Re: 64bit segmentation fault when matching on long lists

2009-11-03 Thread Alexander Burger
On Tue, Nov 03, 2009 at 05:58:20PM +0100, Henrik Sarvell wrote:
> Am I missing something? Won't the first (from) basically find the
> first instance of " match here am I not matching on the whole of the rest of the document?

No, I was talking about the return value of 'make'

(while (from "")
   (println   # Instead of printing
  (make   # do further matching
 (loop
(NIL (chain (till ">")))  # Collect until next tag
(char)# Skip '>'
(T (tail '`(chop "item") @)) ) ) ) )  # See if we got 

'make' returns everything that was collected by 'till' and 'chain' in a
list.

The tail of that list will be ("i" "t" "e" "m") if 'loop' was terminated
by the 'T' clause, or something unexpected when end of file was hit (the
'NIL' clause).

So how about such a structure:

(while (from "")
   (let Lst
  (make
 (loop
(NIL (chain (till ">")))
(char)
(T (tail '`(chop "item") @)) ) )
  (cond
 ((match ... Lst)
... )
 ((match ... Lst)
...)


You could also immediately check for the trailing ("i" "t" "e" "m") and
discard results which do not match it:

(use @X
   (while (from "")
  (when
 (match '(@X "i" "t" "e" "m")
(make
   (loop
  (NIL (chain (till ">")))
  (char)
  (T (tail '`(chop "item") @)) ) ) )
 (got something in @X without trailing "item")
 ... ) ) )

Again, just ideas, not tested ;-)

Cheers,
- Alex
-- 
UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe


Re: 64bit segmentation fault when matching on long lists

2009-11-03 Thread Henrik Sarvell
But the problem is that we can't use  in the (from) since some
feeds will contain  only, what do we do about that?

In most cases it will still be  and then the subsequent test for
item (in the terminating tag) will of course return true and then we
get nothing.

I mean I don't understand how the above code would work with something
looking like this:

Content1
Content2

While still being able to handle

Content1
Content2

What we need is a test for " when we do subsequent tills. That's
what I tried to do with an initial (till ">") but it didn't work, like
this:

(in "rss.xml"
  (while
 (from "")
 (println
(make
   (loop
  (NIL (chain (till ">")))
  (char)
  (T (tail '`(chop "item") @)) ) ) ) ))

/Henrik


On Tue, Nov 3, 2009 at 6:37 PM, Alexander Burger  wrot=
e:
> On Tue, Nov 03, 2009 at 05:58:20PM +0100, Henrik Sarvell wrote:
>> Am I missing something? Won't the first (from) basically find the
>> first instance of "
> Yes.
>
>> match here am I not matching on the whole of the rest of the document?
>
> No, I was talking about the return value of 'make'
>
> (while (from "")
> =A0 (println =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =
=A0 =A0 =A0 =A0 # Instead of printing
> =A0 =A0 =A0(make =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =
=A0 =A0 =A0 =A0 =A0 # do further matching
> =A0 =A0 =A0 =A0 (loop
> =A0 =A0 =A0 =A0 =A0 =A0(NIL (chain (till ">"))) =A0 =A0 =A0 =A0 =A0 =A0 =
=A0# Collect until next tag
> =A0 =A0 =A0 =A0 =A0 =A0(char) =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
 =A0 =A0 =A0 =A0 =A0# Skip '>'
> =A0 =A0 =A0 =A0 =A0 =A0(T (tail '`(chop "item") @)) ) ) ) ) =A0# See if w=
e got 
>
> 'make' returns everything that was collected by 'till' and 'chain' in a
> list.
>
> The tail of that list will be ("i" "t" "e" "m") if 'loop' was terminated
> by the 'T' clause, or something unexpected when end of file was hit (the
> 'NIL' clause).
>
> So how about such a structure:
>
> (while (from "")
> =A0 (let Lst
> =A0 =A0 =A0(make
> =A0 =A0 =A0 =A0 (loop
> =A0 =A0 =A0 =A0 =A0 =A0(NIL (chain (till ">")))
> =A0 =A0 =A0 =A0 =A0 =A0(char)
> =A0 =A0 =A0 =A0 =A0 =A0(T (tail '`(chop "item") @)) ) )
> =A0 =A0 =A0(cond
> =A0 =A0 =A0 =A0 ((match ... Lst)
> =A0 =A0 =A0 =A0 =A0 =A0... )
> =A0 =A0 =A0 =A0 ((match ... Lst)
> =A0 =A0 =A0 =A0 =A0 =A0...)
>
>
> You could also immediately check for the trailing ("i" "t" "e" "m") and
> discard results which do not match it:
>
> (use @X
> =A0 (while (from "")
> =A0 =A0 =A0(when
> =A0 =A0 =A0 =A0 (match '(@X "i" "t" "e" "m")
> =A0 =A0 =A0 =A0 =A0 =A0(make
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 (loop
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0(NIL (chain (till ">")))
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0(char)
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0(T (tail '`(chop "item") @)) ) ) )
> =A0 =A0 =A0 =A0 (got something in @X without trailing "item")
> =A0 =A0 =A0 =A0 ... ) ) )
>
> Again, just ideas, not tested ;-)
>
> Cheers,
> - Alex
> --
> UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=3dunsubscribe
>
-- 
UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe


Re: 64bit segmentation fault when matching on long lists

2009-11-03 Thread Henrik Sarvell
Am I missing something? Won't the first (from) basically find the
first instance of "]*>")
 (println
(make
   (loop
  (NIL (chain (till ">")))
  (char)
  (T (tail '`(chop "item") @)) ) ) ) ))

Maybe this is clearer, so obviously we can't use the above but what to
do in order to replace it with something that is equivalent and legal?

/Henrik

On Tue, Nov 3, 2009 at 5:41 PM, Alexander Burger  wrot=
e:
> On Tue, Nov 03, 2009 at 04:24:55PM +0100, Henrik Sarvell wrote:
>> (in "rss.xml"
>> =A0 =A0(while
>> =A0 =A0 =A0 (from "> =A0 =A0 =A0 (println
>> =A0 =A0 =A0 =A0 =A0(make
>> =A0 =A0 =A0 =A0 =A0 =A0 (loop
>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0(NIL (chain (till ">")))
>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0(char)
>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0(T (tail '`(chop "item") @)) ) ) ) ))
>>
>> This will accurately capture the  tag all the time I think but
>> then we need some way of discarding the attributes and the closing >.
>
> I think that from this point on 'match' is the easiest and most general.
> What 'make' returns is not so big any more, and has also perhaps more
> predictable patterns.
>
> Cheers,
> - Alex
> --
> UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=3dunsubscribe
>
-- 
UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe


Re: 64bit segmentation fault when matching on long lists

2009-11-03 Thread Alexander Burger
On Tue, Nov 03, 2009 at 04:24:55PM +0100, Henrik Sarvell wrote:
> (in "rss.xml"
>(while
>   (from "   (println
>  (make
> (loop
>(NIL (chain (till ">")))
>(char)
>(T (tail '`(chop "item") @)) ) ) ) ))
> 
> This will accurately capture the  tag all the time I think but
> then we need some way of discarding the attributes and the closing >.

I think that from this point on 'match' is the easiest and most general.
What 'make' returns is not so big any more, and has also perhaps more
predictable patterns.

Cheers,
- Alex
-- 
UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe


Re: 64bit segmentation fault when matching on long lists

2009-11-03 Thread Henrik Sarvell
I started with this approach yesterday, first in order to capture feed
type which I am now able to do.

I noticed that some rss feeds have attributes in their  tags,
therefore the above won't work 100% of the time.

(in "rss.xml"
   (while
  (from "")))
   (char)
   (T (tail '`(chop "item") @)) ) ) ) ))

This will accurately capture the  tag all the time I think but
then we need some way of discarding the attributes and the closing >.
I tried with an immediate (till ">") after the (from) but it didn't
have the intentional result, any suggestions here?

/Henrik


On Sun, Nov 1, 2009 at 6:26 PM, Alexander Burger  wrot=
e:
> On Sun, Nov 01, 2009 at 01:49:59PM +0100, Henrik Sarvell wrote:
>> It's a good question with a very simple answer, many many feeds out
>> there are completely broken, sometimes they don't conform to
>> standards, that's a good scenario but often they have unmatched tags
>> or unclosed attributes.
>
> Ouch. I see.
>
> So what do you think about the following:
>
> (while (from "")
> =A0 (println =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =
=A0 =A0 =A0 =A0 # Instead of printing
> =A0 =A0 =A0(make =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =
=A0 =A0 =A0 =A0 =A0 # do further matching
> =A0 =A0 =A0 =A0 (loop
> =A0 =A0 =A0 =A0 =A0 =A0(NIL (chain (till ">"))) =A0 =A0 =A0 =A0 =A0 =A0 =
=A0# Collect until next tag
> =A0 =A0 =A0 =A0 =A0 =A0(char) =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
 =A0 =A0 =A0 =A0 =A0# Skip '>'
> =A0 =A0 =A0 =A0 =A0 =A0(T (tail '`(chop "item") @)) ) ) ) ) =A0# See if w=
e got 
>
> The 'make' will give you smaller chunks of data, which are easier to
> 'match'.
>
> Cheers,
> - Alex
> --
> UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=3dunsubscribe
>
-- 
UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe


Re: +Idx problems maybe?

2009-11-03 Thread Henrik Sarvell
I'll try with the one you suggested, thanks for the clarifications!

/Henrik

On Tue, Nov 3, 2009 at 8:38 AM, Alexander Burger  wrot=
e:
> Hi Henrik,
>
>> I took a look at the pilog file, I already get what same and range are
>> doing but what are part, head and fold doing?
>
> You are on the right track. You used 'tolr', but this actually makes
> sense only in combination with the '+Sn' (Soundex) prefix. The whole
> matter is rather complicated, because there are so many combinations of
> index types and Pilog comparison functions possible.
>
>
> I would say that we have the following typical use cases for string
> searches (I'll leave out numerical searches, which usually combine with
> 'same' or 'range').
>
> 1. "Exact" searches. You have either a unique index
>
> =A0 =A0 =A0(rel key (+Key +String))
>
> =A0 or a non-unique index
>
> =A0 =A0 =A0(rel key (+Ref +String))
>
> =A0 and you can compare results in Pilog with
>
> =A0 =A0 =A0(same @Str @Cls key)
>
> =A0 for exact matches, or with
>
> =A0 =A0 =A0(head @Str @Cls key)
>
> =A0 for "dictionary" searches (searching only for the beginning of
> =A0 strings). These are case-sensitive searches.
>
>
> 2. "Folded" searches. They make use of the 'fold' function which keeps
> =A0 only letters, converted to lower case, and digits.
>
> =A0 =A0 =A0(rel key (+Fold +Ref +String))
> =A0 =A0 =A0...
> =A0 =A0 =A0(fold @Str @Cls key)
>
> =A0 This searches only for the beginning of strings. We use it typically
> =A0 for telephone numbers.
>
>
> =A0 If a search for individual words in a key is desired, we can use
>
> =A0 =A0 =A0(rel key (+List +Fold +Ref +String))
> =A0 =A0 =A0...
> =A0 =A0 =A0(fold @Str @Cls key)
>
> =A0 This stores only the strings in the list (not the substrings) in
> =A0 'fold'ed representation. So each word can be found by "dictionary"
> =A0 search. This requires changes to the GUI and import functions,
> =A0 though, as 'key' is not a string but a list of strings.
>
>
> =A0 Finally, we can also index folded substrings:
>
> =A0 =A0 =A0(rel key (+Fold +Idx +String))
> =A0 =A0 =A0...
> =A0 =A0 =A0(part @Str @Cls key)
>
> =A0 This is perhaps what you need. If you go for it, I'd recommend you
> =A0 download once more the latest testing release, as the 'part' function
> =A0 was changed recently.
>
>
> 3. "Tolerant" searches. They return first all exact (case-sensitive)
> =A0 matches of partial strings, and then the matches according to the
> =A0 soundex algorithm (the first letter is compared exactly
> =A0 (case-sensitive), the rest checks for similarity). This makes mainly
> =A0 sense for personal names.
>
> =A0 =A0 =A0(rel key (+Sn +Idx +String))
> =A0 =A0 =A0...
> =A0 =A0 =A0(tolr @Str @Cls key)
>
>
> Concerning space consumption, the '+Key' and '+Ref' indexes are the most
> economical ones. They create only a single entry in the index tree per
> key.
>
> Then follow the '+List +Ref +String' indexes, which create an entry per
> word.
>
> Most space-hungry are the '+Idx' indexes, as they create an entry for
> each substring down to a length of three, and '+Sn' adds one more for
> the soundex key.
>
> Cheers,
> - Alex
> --
> UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=3dunsubscribe
>
-- 
UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe