RE: Collation API not complete for search

Shawn Steele Mon, 28 Mar 2011 14:31:47 -0700

My initial thinking was that “type” would indicate a more restrictive or a 
fuzzy matching, with the idea that when known strings are sorted (like records 
from a database being sorted for display), a detailed sort is appropriate, 
however when querying the database to see if there’s a record for “nebojsa”, it 
makes sense in many languages to also find “Nebojša”.

That’s not “string search”, it’s searching complete strings within the database 
record to see if it matches another complete string, and that’s what my 
original idea was.

Searching within a string is a more complicated problem, but, likely, fuzzy and 
exact ideas are still useful.

- Shawn

From: Nebojša Ćirić [mailto:c...@google.com]
Sent: Monday, March 28, 2011 2:26 PM
To: Phillips, Addison
Cc: Mark Davis ☕; es-discuss@mozilla.org; Shawn Steele
Subject: Re: Collation API not complete for search

How do you solve search like this in offline mode?

- Text is in Serbian and contains word "šnala"
- User searches for "snala" since he can't input š easily (android phone or 
nook or kindle keyboards).

Without StringSearch you won't get a match...

I do agree this may complicate things for now and if we decide to postpone it 
we should also remove collationType option from the collator since it's pretty 
useless on its own.

28. март 2011. 14.04, Phillips, Addison 
<addi...@lab126.com<mailto:addi...@lab126.com>> је написао/ла:
This discussion has had me pretty confused. I never understood why you would 
*want* string search inside collator: the APIs and usage models are completely 
different. While there is some underlying relation, it’s just confusing to try 
to jam them into the same API.

StringSearch is modestly useful, but really I don’t see it as a particularly 
high priority for us.

Addison

From: Nebojša Ćirić [mailto:c...@google.com<mailto:c...@google.com>]
Sent: Monday, March 28, 2011 1:36 PM
To: Mark Davis ☕
Cc: es-discuss@mozilla.org<mailto:es-discuss@mozilla.org>; Shawn Steele; 
Phillips, Addison
Subject: Re: Collation API not complete for search

Shawn, would you be ok with adding this new API to the list for 0.5 so we can 
support collation search?

I'll edit the strawman in case nobody objects to this addition.
25. март 2011. 16.34, Nebojša Ćirić <c...@google.com<mailto:c...@google.com>> 
је написао/ла:
In that case I wouldn't put this new functionality in the Collator object. A 
new StringSearch or StringIterator object would make more sense:

options = {
  collator[optional - default, collatorType=search],
  source[required],
  pattern[required]
}
LocaleInfo.StringIterator = function(options) {}
LocaleInfo.StringIterator.prototype.first = function() { find first occurrence}
LocaleInfo.StringIterator.prototype.next = function() { get me next occurrence 
of pattern in source}
LocaleInfo.StringIterator.prototype.matchLength = function() { length of the 
match }
... (reset, setPosition...)
25. март 2011. 15.14, Mark Davis ☕ 
<m...@macchiato.com<mailto:m...@macchiato.com>> је написао/ла:

I think an iterator is a cleaner interface; we were just trying to minimize new 
API.

In general, collation is context sensitive, so searching on substrings isn't a 
good idea. You want to search from a location, but have the rest of the text 
available to you.

For the iterator, you would need to be able to reset to a location, but the 
context beforehand could affect what happens.

Mark

— Il meglio è l’inimico del bene —

On Fri, Mar 25, 2011 at 14:22, Mike Samuel 
<mikesam...@gmail.com<mailto:mikesam...@gmail.com>> wrote:
2011/3/25 Mike Samuel <mikesam...@gmail.com<mailto:mikesam...@gmail.com>>:
> 2011/3/25 Nebojša Ćirić <c...@google.com<mailto:c...@google.com>>:
>> find method wouldn't return boolean but an array of two values:
>
> Sorry if I wasn't clear.  The !! at the beginning of the call to find
> is important.
> The undefined value you mentioned below as possible no match result is
> falsey because !!undefined === false.
>
>> myCollator.find('gaard', 'ard', 2) -> [2, 5]  // 4 or 5 as a bound
>> myCollator.find('ard', 'ard', 0) -> [0, 3]  // 2 or 3 as a bound
>> I guess [2, 5] !== [0, 3]
>
> True, but also [2, 5] !== [2, 5].
>
>> We could return [-1, undefined] for not found state, or just undefined.
>
>> I agree that returning a boolean makes for easier tests in loops.
>
>
>> 25. март 2011. 14.00, Mike Samuel 
>> <mikesam...@gmail.com<mailto:mikesam...@gmail.com>> је написао/ла:
>>>
>>> 2011/3/25 Nebojša Ćirić <c...@google.com<mailto:c...@google.com>>:
>>> > Looking through the notes from the meeting I also found some problems
>>> > with
>>> > the collator. We did specify the collatorType: search, but we didn't
>>> > offer a
>>> > function that would make use of it. Mark and I are thinking about:
>>> > /**
>>> >  * string - string to search over.
>>> >  * substring - string to look for in "string"
>>> >  * index - start search from index
>>> >  * @return {Array} [first, last] - first is index of the match or -1,
>>> > last
>>> > is end of the match or undefined.
>>> >  */
>>> > LocaleInfo.Collator.prototype.find(string, substring, index)
>>> > We could also opt for iterator solution where we keep the state.
>>>
>>> Assuming find returns a falsey value when nothing is found, is it the
>>> case that for all (string, index) pairs,
>>>
>>> !!myCollator.find(string, substring, index) ===
>>> !!myCollator.find(string.substring(index), substring, 0)
Maybe a better way to phrase this relation is

will any collator ever look at a code-unit to the left of index when
trying to determine whether there is a match at or after index?

E.g. if the code-unit at index might be a strict suffix of a substring
that could be represented as a one codepoint ligature.

>>> This would be false if the substring 'ard' should be found in 'gard',
>>> but not 'gaard' because then
>>>
>>>     !!myCollator.find('gaard', 'ard', 2) !== !!myCollator.find('ard',
>>> 'ard', 0)
>>>
>>>
>>> If that relation does not hold, then exposing find as an iterator
>>> might help prevent a profusion of subtly wrong loops.
>>>
>>>
>>> > The reason we need to return both begin and end part of the found string
>>> > is:
>>> > Look for gaard and we find gård - which may be equivalent in Danish, but
>>> > substring lengths don't match (5 vs. 4) so we need to tell user the next
>>> > index position.
>>> > The other problem Jungshik found is that there is a combinatorial
>>> > explosion
>>> > with all ignoreXXX options we defined. My proposal is to define only N
>>> > that
>>> > make sense (and can be supported by all implementors) and fall back the
>>> > rest
>>> > to some predefined default.
>>>
>>>
>>>
>>> > --
>>> > Nebojša Ćirić
>>> >
>>> > _______________________________________________
>>> > es-discuss mailing list
>>> > es-discuss@mozilla.org<mailto:es-discuss@mozilla.org>
>>> > https://mail.mozilla.org/listinfo/es-discuss
>>> >
>>> >
>>
>>
>>
>> --
>> Nebojša Ćirić
>>
>
_______________________________________________
es-discuss mailing list
es-discuss@mozilla.org<mailto:es-discuss@mozilla.org>
https://mail.mozilla.org/listinfo/es-discuss

--
Nebojša Ćirić

--
Nebojša Ćirić

--
Nebojša Ćirić

_______________________________________________
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

RE: Collation API not complete for search

Reply via email to