Re: Words Indexing strategies

2010-02-22 Thread Alejandro Tejada
Some time ago, i posted a message asking for
volunteers to create a Wikipedia CD/DVD.

Since then, i have been working on this project
and have done some advances, that will be
published as soon they work as expected.

These are the steps that i am following to
process the XML databases:

Wikipedia XML database are huge and we could
get best results when XML databases are
downloaded directly from:

http://download.wikipedia.org/

The download direction for wikipedia English
XML database from 2010 February 3 is:
http://download.wikimedia.org/enwiki/20100130/enwiki-20100130-pages-articles.xml.bz2

download direction for wikipedia Spanish
XML database from 2010 February 21 is:
http://download.wikimedia.org/eswiki/20100221/eswiki-20100221-pages-meta-current.xml.bz2

After downloading the compressed xml database, you should
put the database inside a folder (not in the disk root) and split
the file in small bz2 files using bzip2recover.
http://www.bzip.org/downloads.html
http://www.bzip.org/1.0.5/bzip2recover-105-x86-win32.exe

It is easier to deal with many compressed small files than using
one humongous text file of more than 25 GB (english xml database)
or 5.3 GB (spanish xml database).

After using bzip2recover to split the English xml database,
i get more than 28,000 small (~250 kb) bz2 files or
6800 small bz2 files for Spanish xml databases.
Each one of these files have (more of less) a 1 MB segment
of the database.

Notice that i choose a different file for spanish xml database
than english xml database. That is because Wikipedia have
been unable to solve a problem with their backup of spanish
xml database.

https://bugzilla.wikimedia.org/show_bug.cgi?id=18694

Alejandro
___
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution


Re: Words Indexing strategies

2010-02-13 Thread Alejandro Tejada

Hi Bernard,


Bernard Devlin-2 wrote:
 
 [snip]
 Am I right that given these search terms: baboon OR monkey AND fruit
 and
 index file b.tgz contains a line like this: baboon: 1,5,9
 index file m.tgz contains a line like this: monkey: 2,7,17
 index file f.tgz contains a line like this: fruit: 3,7,23
 you would want the result of your search to be: 7 i.e. the number of
 the article that matches the boolean search?  Unless I've
 misunderstood, what you want to do is combine indexes in order to
 satisfy boolean combinations of search terms.
 

Yes, this is correct.
With larger sets of data, should be easier to convert
these results in arrays and merge them, following
the explanations provided by Ken Ray in his website:
http://www.sonsothunder.com/devres/revolution/tips/arry002.htm

Now, you have made a valid point with the following commentary: 


Bernard Devlin-2 wrote:
 
 However, it looks to me like the existing indexes don't contain enough
 information for you to calculate frequency of occurrence (a measure of
 relevance).  And depending on how these pre-existing indexes have been
 constructed they may not have any stemming information in them.  You
 might be able to build some kind of rough stemming algorithm in Rev
 (by doing rough pluralization like 'baboon*', but as Richard pointed
 out more complex plurals like 'children' will be where the work
 comes).
 
 Are you looking for an approximate solution?  Or do you need greater
 flexibility of scope and relevance scores, etc. ?
 

Starting this project, an approximate solution could be fine, and
eventually,
keep working to refine this search algorithm.

After taking a close look at the 580,000 articles index (and 450,000
redirections index),
i understand that employing effectively an stemming algorithm could save
many megabytes.

You are right about lack of information in current simple index format.
Relevance should be a function of the number of times a word appear
in an article. An index that could include this information would be:

monkey:3827#12|15,1#4|3,2131#18|3,34#3|2,4567#2|2,3456#22|1

In this new example, compressed Datafile number 3827, article # 12,
there are 15 instances of the word Monkey...

Surely, there should be some better notation to handle this data,
feel free to send your ideas and comments.
Again, converting this data in an array, make easier to work with.


Richard Gaskin wrote:
 
 Once again, MetaCard to the rescue! :)
 
 Raney included this little gem in MC's Examples stack, and using repeat
 for each and arrays it's blazing fast, able to make a frequency table
 for even large files in almost no time at all:
 

Yes, this works great! Many thanks for digging this handler. :-)

Have a nice weekend!

Alejandro
-- 
View this message in context: 
http://n4.nabble.com/Words-Indexing-strategies-tp1473753p1554526.html
Sent from the Revolution - User mailing list archive at Nabble.com.
___
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution


Re: Words Indexing strategies

2010-02-12 Thread Bernard Devlin
On Fri, Feb 12, 2010 at 3:24 AM, Alejandro Tejada
capellan2...@gmail.com wrote:
 I have a dll named: dbsqlite.dll (452 K) in my Rev Studio instalation.
 If an experienced database developer could lend a hand, i would be
 really grateful.

Hi Alejandro

Ok, since you don't know anything about databases, I'm assuming that
it is also going to be quite a lot of work to deal with compiling the
sqlite FT plugin for the different platforms too. (I've no idea how
hard that would be or what problems you might meet with - but in
general these things are always harder than they are supposed to be).
So, let's consider a Rev-only solution, and if that looks like a
hopeless case then it will make the work required to deal with
databases and compilation more worthwhile.

As I'm slow on the uptake, I am still not entirely sure what I think
it is you are trying to do.

Am I right that given these search terms: baboon OR monkey AND fruit

and

index file b.tgz contains a line like this: baboon: 1,5,9
index file m.tgz contains a line like this: monkey: 2,7,17
index file f.tgz contains a line like this: fruit: 3,7,23

you would want the result of your search to be: 7 i.e. the number of
the article that matches the boolean search?  Unless I've
misunderstood, what you want to do is combine indexes in order to
satisfy boolean combinations of search terms.

However, it looks to me like the existing indexes don't contain enough
information for you to calculate frequency of occurrence (a measure of
relevance).  And depending on how these pre-existing indexes have been
constructed they may not have any stemming information in them.  You
might be able to build some kind of rough stemming algorithm in Rev
(by doing rough pluralization like 'baboon*', but as Richard pointed
out more complex plurals like 'children' will be where the work
comes).

Are you looking for an approximate solution?  Or do you need greater
flexibility of scope and relevance scores, etc. ?

Bernard
___
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution


Re: Words Indexing strategies

2010-02-12 Thread Richard Gaskin

Bernard Devlin wrote:


However, it looks to me like the existing indexes don't contain enough
information for you to calculate frequency of occurrence (a measure of
relevance).


Once again, MetaCard to the rescue! :)

Raney included this little gem in MC's Examples stack, and using repeat 
for each and arrays it's blazing fast, able to make a frequency table 
for even large files in almost no time at all:


on mouseUp
  put empty into field result
  answer file Select a text file for input:
  if it is empty then exit mouseUp
# let user know we're working on it
  set the cursor to watch
  put it into inputFile
  open file inputFile for read
  read from file inputFile until eof
  put it into fileContent
  close file inputFile
# wordCount is an associative array, its indexes are words
# with the contents of each element being number of times
# that word appears
  repeat for each word w in fileContent
add 1 to wordCount[w]
  end repeat
# copy all the indexes that is in the wordCount associative array
  put keys(wordCount) into keyWords
# sort the indexes -- keyWords contains a list of elements in array
  sort keyWords
  repeat for each line l in keyWords
put l  tab  wordCount[l]  return after displayResult
  end repeat
  put displayResult into field result
end mouseUp


--
 Richard Gaskin
 Fourth World
 Rev training and consulting: http://www.fourthworld.com
 Webzine for Rev developers: http://www.revjournal.com
 revJournal blog: http://revjournal.com/blog.irv

___
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution


Re: Words Indexing strategies

2010-02-11 Thread Bernard Devlin
On Wed, Feb 10, 2010 at 10:30 PM, Alejandro Tejada
capellan2...@gmail.com wrote:
 Yes, each one of these 28 text files will be compressed
 in gz format. When users look for a word, or many words,
 only these file(s) are decompressed and searched.

Like Brian, I was going to suggest existing search technologies like
Lucene.  Why re-invent the wheel?  I understand you not wanting to
ship Java and get the user to install it.  However there may be other
pre-existing solutions to your problem.

Is it imperative to your solution that these 28 text compressed text
files are part of the solution?  I mean, are you trying to maintain
the structure of the solution such that someone who comes along and
looks at your solution can see where Rev fits in.  Or can your indexes
be stored in a database?

The reason I say this is because a) Valentina already has two forms of
text searches (one form is very fast but only looks up single words,
the other form can search an entire database using regex but is
slower, and probably not fast enough for your requirements).  Unless
you already have Valentina for each platform, this solution will
involve you in the cost of buying licenses.

The other thing to consider is that sqlite already has a full-text
search facility (although I think you may have to compile it as a
sqlite plug-in and distribute it with your application).   It does
things like word-stemming, stop lists, frequencies, etc.  You would
have to distribute this sqlite add-on with your solution.

http://ft3.sourceforge.net/v0.3/userguide.html
http://michaeltrier.com/2008/7/13/full-text-search-on-sqlite

I suspect that the sqlite option may be the best.  I can't testify to
the speed of its searching, but I seem to remember that the indexing
is asynchronous (not that that would be a consideration in your
application).

Using a pre-existing solution (even one that requires some
modification on your part) has got to be easier than building your own
full-text search system.  The above suggestions could still provide
you with a cross-platform solution.

If you still want a Rev-only solution let us know.  Maybe someone else
will chip in with suggestions :-)

Hope that helps.

Bernard
___
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution


Re: Words Indexing strategies

2010-02-11 Thread Alejandro Tejada
Hi Bernard,

on Thu, 11 Feb 2010 09:13:46 +
Bernard Devlin wrote:

 Like Brian, I was going to suggest existing search technologies like
 Lucene.  Why re-invent the wheel?  I understand you not wanting to
 ship Java and get the user to install it.  However there may be other
 pre-existing solutions to your problem.

Hopefully

 Is it imperative to your solution that these 28 text compressed text
 files are part of the solution?  I mean, are you trying to maintain
 the structure of the solution such that someone who comes along and
 looks at your solution can see where Rev fits in.  Or can your indexes
 be stored in a database?

Sure, these indexes could be stored in a database.

 The reason I say this is because a) Valentina already has two forms of
 text searches (one form is very fast but only looks up single words,
 the other form can search an entire database using regex but is
 slower, and probably not fast enough for your requirements).  Unless
 you already have Valentina for each platform, this solution will
 involve you in the cost of buying licenses.
 The other thing to consider is that sqlite already has a full-text
 search facility (although I think you may have to compile it as a
 sqlite plug-in and distribute it with your application).   It does
 things like word-stemming, stop lists, frequencies, etc.  You would
 have to distribute this sqlite add-on with your solution.

I have a dll named: dbsqlite.dll (452 K) in my Rev Studio instalation.
If an experienced database developer could lend a hand, i would be
really grateful.

 If you still want a Rev-only solution let us know.  Maybe someone else
 will chip in with suggestions :-)

Yes, a Rev-only solution would be ideal.
I remember that Rob Cozens wrote a Rev-Only database,
but my inexperience with databases does not help me
to fully appreciate this work.

Thanks a lot for your advice! :-)

Alejandro
___
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution


Re: Words Indexing strategies

2010-02-10 Thread Alejandro Tejada
Hi Bernard,

on Wed, 10 Feb 2010 07:20:36 +
Bernard Devlin wrote:
 Can I just clarify your problem?  You want to be able to search for
 phrases (partial sentences, possibly with boolean logic) inside the
 text stored in the xml nodes of the article, once the article is found
 in the index?

No, it's not a search inside the displayed article.
It's a global search, within a general index created
using all words from all articles of Wikipedia.
(I do not believe that it's necessary to load this full
index in memory, instead just open specific parts
of this index when users start searching)

For this reason, i am looking for advice to create an
index structure that allows to implement a fast search
algorithm, using multiple words (and boolean logic, if
possible), similar to Wikipedia's own search engine or
(better yet) just like google. :-)

Many thanks for your interest in this question!

Alejandro
___
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution


Re: Words Indexing strategies

2010-02-10 Thread Bernard Devlin
On Wed, Feb 10, 2010 at 2:56 PM, Alejandro Tejada
capellan2...@gmail.com wrote:
 No, it's not a search inside the displayed article.
 It's a global search, within a general index created
 using all words from all articles of Wikipedia.
 (I do not believe that it's necessary to load this full
 index in memory, instead just open specific parts
 of this index when users start searching)

OK, so that's why you mention the different files for each letter of
the alphabet.

I'm still a bit confused.  Normally an index would indicate a location
for an indexed term  That's what I assume your general index files are
doing.  What are the key terms like in this index, and what do they
point to?  Can you give us some examples?

 For this reason, i am looking for advice to create an
 index structure that allows to implement a fast search
 algorithm, using multiple words (and boolean logic, if
 possible), similar to Wikipedia's own search engine or
 (better yet) just like google. :-)

To my confused and befuddled mind, it sounds like you are wanting to
create an index of the index.  That can't be right :-)

I'm no expert in search algorithms.  I have been hoping someone else
would jump in who has done this kind of thing before.

Are you wanting a pure, rev-only solution i.e. are you doing this to
demonstrate what can be done using Rev alone?

Bernard
___
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution


Re: Words Indexing strategies

2010-02-10 Thread Brian Yennie
Alejandro,

The first step for this would likely include creating an inverted index. This 
means you store something like:

monkey:1,34,3827,21314

Where the word being indexed in monkey and the numbers that follow are 
article IDs. Using this information it is pretty trivial to implement AND / OR. 
Just merge the article IDs as needed. You can use a mix of this inverted index 
and your regular index in combination to do other types of queries. You can 
even store word frequency information to find similar articles.

With that said, I think you will have a very hard time crafting a competitive 
algorithm for this in Rev. A lot of the engines out there are very mature and 
there are many open source ones that you might consider bundling with your 
project instead of writing it yourself. Some things to watch out for:

1) File size. Indexing every word takes a lot of space, especially if you 
indices are in plain text and not a compact binary format
2) Common words. You might consider a stop word list, or a threshold. For 
example, if a word is in  20% of all articles, don't index it. Or if it's in 
your stop word list.
3) Root words / plurals. Can you detect that monkeys is the plural of 
monkey (or more complex cases)?

An example of an open source engine that is very mature is Lucene, which can be 
run from the command-line in Java.

 Hi Bernard,
 
 on Wed, 10 Feb 2010 07:20:36 +
 Bernard Devlin wrote:
 Can I just clarify your problem?  You want to be able to search for
 phrases (partial sentences, possibly with boolean logic) inside the
 text stored in the xml nodes of the article, once the article is found
 in the index?
 
 No, it's not a search inside the displayed article.
 It's a global search, within a general index created
 using all words from all articles of Wikipedia.
 (I do not believe that it's necessary to load this full
 index in memory, instead just open specific parts
 of this index when users start searching)
 
 For this reason, i am looking for advice to create an
 index structure that allows to implement a fast search
 algorithm, using multiple words (and boolean logic, if
 possible), similar to Wikipedia's own search engine or
 (better yet) just like google. :-)
 
 Many thanks for your interest in this question!
 
 Alejandro
 ___
 use-revolution mailing list
 use-revolution@lists.runrev.com
 Please visit this url to subscribe, unsubscribe and manage your subscription 
 preferences:
 http://lists.runrev.com/mailman/listinfo/use-revolutio
___
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution


Re: Words Indexing strategies

2010-02-10 Thread Richard Gaskin

The ambitious Alejandro Tejada wrote:


It's a global search, within a general index created
using all words from all articles of Wikipedia.
(I do not believe that it's necessary to load this full
index in memory, instead just open specific parts
of this index when users start searching)

For this reason, i am looking for advice to create an
index structure that allows to implement a fast search
algorithm, using multiple words (and boolean logic, if
possible), similar to Wikipedia's own search engine or
(better yet) just like google. :-)


A good place to start on that is the seminal paper describing the 
initial Google implementation, written by the founders:


The Anatomy of a Large-Scale Hypertextual Web Search Engine
Sergey Brin and Lawrence Page
http://infolab.stanford.edu/~backrub/google.html

But be warned:  indexing is a deep topic, and may become a consuming 
passion.  Roulette is rumored to have been invented by a monk who came 
to believe he could find a way to predict its outcomes, and eventually 
went mad trying.  Indexing is a bit like that. :)


A couple of the longer-term projects I work on need to incorporate good 
indexing of large corpuses, and my own code to that end has advanced 
only in small baby steps as I learn more about it.


So while I have little in the way of applied code to share at the 
moment, I can offer a few theoretical pointers:


From what I'm reading, my first advice would be to not bother unless 
you absolutely need to.  If there's any way you can use an existing 
index you'll be a happier man to do so.


But if you have to make your own index, you may find it helpful (or 
maddening) to consider the challenges involved with the various tenses 
and inflections of words, and how to determine the root word in its 
native form (the lemma) for your lookups.


Currently I'm looking into the various lemmatization schemes available, 
since it can help tremendously to both keep the index small enough to be 
practical while returning more relevant search results.


Lemmatization attempts this through linguistic rules; stemming could be 
said to be a form of cheating by using simpler algos less dependent on 
the nuances of a given language to attempt the same result.  The 
differences are explained well here:


Stemming and lemmatization
http://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html

Links to specific stemming algos are at the bottom of this article:
http://www.comp.lancs.ac.uk/computing/research/stemming/general/

Without lemmatization or stemming, searches for children will fail to 
find child, which could well be relevant to the searcher.  And some 
stemming methods won't be able to transform children to child since 
it's an uncommon transformation linguistically, much less so than more 
common plural forms like just adding s or es to the end.


There's a wealth of info available searching the web for indexing 
algorithms, stemming algorithms, etc.  Doing it well may take a 
lifetime; fortunately it seems there are some cheating methods which 
will do most of the job well enough in less time.



All that said, I have to wonder: if Wikipedia's content is available, 
isn't their index also available?


Porting it from MySQL to SQLite seems a far less daunting task than 
writing an index from scratch.


--
 Richard Gaskin
 Fourth World
 Rev training and consulting: http://www.fourthworld.com
 Webzine for Rev developers: http://www.revjournal.com
 revJournal blog: http://revjournal.com/blog.irv
___
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution


Re: Words Indexing strategies

2010-02-10 Thread Alejandro Tejada
Many thanks for replying this question. :-)

on Wed, 10 Feb 2010 15:30:23 +
Bernard Devlin wrote:

 OK, so that's why you mention the different files for each letter of
 the alphabet.

Yes, each one of these 28 text files will be compressed
in gz format. When users look for a word, or many words,
only these file(s) are decompressed and searched.

 I'm still a bit confused.  Normally an index would indicate a location
 for an indexed term  That's what I assume your general index files are
 doing.  What are the key terms like in this index, and what do they
 point to?  Can you give us some examples?

Sure, look the example that Brian Yennie wrote:
monkey:1,34,3827,21314

The numbers (after the word) are the names of compressed gz files.
These files are located automatically or (if not found) manually,
when the program starts.

 To my confused and befuddled mind, it sounds like you are wanting to
 create an index of the index.  That can't be right :-)

No, this program uses only three kind of indexes:
1) Articles 2) Articles name's redirections 3) Words

 I'm no expert in search algorithms.  I have been hoping someone else
 would jump in who has done this kind of thing before.
 Are you wanting a pure, rev-only solution i.e. are you doing this to
 demonstrate what can be done using Rev alone?

Ideally, this should be a Rev only solution, for cross-platform
porting.

on Wed, 10 Feb 2010 11:17:08 -0500
Brian Yennie wrote:

 The first step for this would likely include creating an inverted index.
 This means you store something like: monkey:1,34,3827,2131
 Where the word being indexed in monkey and the numbers
 that follow are article IDs.
 Using this information it is pretty trivial to implement AND / OR.
 Just merge the article IDs as needed. You can use a mix of this inverted
 index and your regular index in combination to do other types of queries.
 You can even store word frequency information to find similar articles.

Yes, this is correct and should work fine, but how could i write in the
word index a range of article where a word appears consecutively:
baboon:1934,2345,2346,2347,2348,2349,2350,2351,2352,2567,3578

 With that said, I think you will have a very hard time crafting a competitive
 algorithm for this in Rev. A lot of the engines out there are very mature and
 there are many open source ones that you might consider bundling with
 your project instead of writing it yourself.

Actually, i look for good performance and cross-platform portability.
So Rev is my first choice.

 Some things to watch out for:
 1) File size. Indexing every word takes a lot of space, especially if you 
 indices
 are in plain text and not a compact binary format

How could i convert this index format in a compact binary format?

baboon:1934,2345,2346,2347,2348,2349,2350,2351,2352,2567,3578
monkey:1,34,3827,2131, 3456,4567,5678,5789,6123,6234,6456

 2) Common words. You might consider a stop word list, or a threshold.
 For example, if a word is in  20% of all articles, don't index it.
 Or if it's in your stop word list.

Previously i believed that stop words should appear in all articles.

 3) Root words / plurals. Can you detect that monkeys
 is the plural of monkey (or more complex cases)?

Richard wrote about a similar concern in his answer.
I suppose that this feature is useful to recommend similar
terms, when users start a new search.

 An example of an open source engine that is very mature
 is Lucene, which can be run from the command-line in Java.

How could i run Java applications from Runrev,
without asking users to install Java first?

on Wed, 10 Feb 2010 08:26:30 -0800
Richard Gaskin wrote:

 The ambitious Alejandro Tejada wrote:

this is only a really modest search
:-D

 A good place to start on that is the seminal paper describing the
 initial Google implementation, written by the founders:

 The Anatomy of a Large-Scale Hypertextual Web Search Engine
 Sergey Brin and Lawrence Page
 http://infolab.stanford.edu/~backrub/google.html

 But be warned:  indexing is a deep topic, and may become a consuming
 passion.  Roulette is rumored to have been invented by a monk who came
 to believe he could find a way to predict its outcomes, and eventually
 went mad trying.  Indexing is a bit like that. :)

Many thanks for pointers in this direction.

 A couple of the longer-term projects I work on need to incorporate good
 indexing of large corpuses, and my own code to that end has advanced
 only in small baby steps as I learn more about it.

[snip of really useful information]


 All that said, I have to wonder: if Wikipedia's content is available,
 isn't their index also available?

An article index is available, but a word index is not.

 Porting it from MySQL to SQLite seems a far less daunting task than
 writing an index from scratch.

SQLite is way over my head in this moment. :-)

Thanks again for answering this request!

Alejandro
___
use-revolution mailing list

Re: Words Indexing strategies

2010-02-10 Thread Brian Yennie
 Yes, this is correct and should work fine, but how could i write in the
 word index a range of article where a word appears consecutively:
 baboon:1934,2345,2346,2347,2348,2349,2350,2351,2352,2567,3578

If this were your format, you could compact to something like:
baboon:1934,2345-2352,2567,3578

 How could i convert this index format in a compact binary format?
 
 baboon:1934,2345,2346,2347,2348,2349,2350,2351,2352,2567,3578
 monkey:1,34,3827,2131, 3456,4567,5678,5789,6123,6234,6456

Well there are a lot of possibilities that are probably way beyond the scope of 
this discussion, however, for starters you could convert each number from text 
to binary. You could also go for a BTree structure, but that is going to be 
awfully difficult in Rev.

 Previously i believed that stop words should appear in all articles.

I would go with a threshold for sure. Think about what it means for index size 
if a word is in 50% of all articles or more. And why would you want to search 
for that word anyway?

 Richard wrote about a similar concern in his answer.
 I suppose that this feature is useful to recommend similar
 terms, when users start a new search.

Yes, but it's also built-in to the results in most modern search engines. It 
will help you return better results. Think of the simple case where someone 
searches for monkeys but doesn't find an article named Monkey. Although it 
seems obvious that these are not the same word, your users can easily be 
frustrated.

 How could i run Java applications from Runrev,
 without asking users to install Java first?

You would have to find a way to bundle it with your app. The upside is, this 
would be much easier than trying to write something equivalent in Rev. You may 
vary well be able to craft something that meets your needs, but in terms of 
performance and accuracy you'll have a nearly impossible time matching some of 
the more mature search engines out there.

___
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution


Re: Words Indexing strategies

2010-02-09 Thread Bernard Devlin
On Tue, Feb 9, 2010 at 12:00 AM, Alejandro Tejada
capellan2...@gmail.com wrote:
 Now, i am looking for advice to create an index structure for searching
 specific words inside article's text. i have been unable to implement
 a fast search algorithm, using multiple words, similar to Wikipedia's
 own search engine. Every idea or advice is welcome.

Can I just clarify your problem?  You want to be able to search for
phrases (partial sentences, possibly with boolean logic) inside the
text stored in the xml nodes of the article, once the article is found
in the index?

Bernard
___
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution