Index and Field.Text

2003-12-05 Thread Grant Ingersoll
Hi,

I have seen the example SAX based XML processing in the Lucene sandbox (thanks to the 
authors for contributing!) and have successfully adapted this approach for my 
application.  The one thing that does not sit well with me is the fact that I am using 
the method Field.Text(String, String) instead of the Field.Text(String, Reader) 
version, which means I am storing the contents in the index.

Some questions:

1. Should I care?  What is the cost of storing the contents of these files versus 
using the Reader based method.  Presumably, the index size is going to be larger, but 
will it adversaly effect search time?  If yes, how much so (relatively speaking)?

2. If storing the content is going to adversaly effect searching, has anyone written 
an XMLReader that extends java.io.Reader.  I guess it would need to take in the name 
of the tag(s) that you want the reader to retrieve and then extend all of the 
java.io.Reader results to return values based on just the tag values that I am 
interested in.  Has anyone taken this approach?  If not, does it at least seem like a 
valid approach?

Thanks for your help!

-Grant Ingersoll



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Retrieving the content from hits...

2004-01-05 Thread Grant Ingersoll
I believe since you created the field using a Reader, you have to use the 
Field.readerValue() method instead of the stringValue() method and then handle the 
reader appropriately.  I don't know if there is anyway to determine which one is used 
for a given field other than to test for null on the readerValue()?  

-Grant

>>> [EMAIL PROTECTED] 01/05/04 11:27AM >>>
Hi Group,

I have a little problem which is able of being solved easily from the
expertise within this group. 

A index has beein generated. The document used looks like this:

Document doc = new Document();
doc.add(Field.Text("contents", new FileReader(file)));
doc.add(Field.Keyword("filename", file.getCanonicalPath())); 


When I now search, I get a correct hit. However it seems the "contents"
field does not exist. When I get the field, only "filename" exists...

Here some code how I parse the hits object:

Document d = hits.doc(0);
Enumeration enum = d.fields();
while (enum.hasMoreElements()){
  Field f = (Field)enum.nextElement();
  System.out.println("Field value = " + f.stringValue()); 
}

Where is the problem? 

Ralf


-- 
+++ GMX - die erste Adresse für Mail, Message, More +++
Neu: Preissenkung für MMS und FreeMMS! http://www.gmx.net 



-
To unsubscribe, e-mail: [EMAIL PROTECTED] 
For additional commands, e-mail: [EMAIL PROTECTED] 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Using Explain and fieldNorm

2004-02-05 Thread Grant Ingersoll
Hi,

Was wondering what the fieldNorm section means when using the Explain functionality?  
How does this relate to the scoring algorithm given in the Similarity javadocs?

Thanks,
Grant


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: 'Sponsored' links

2004-02-15 Thread Grant Ingersoll
Does the sponsored information have to be in the index?  Couldn't you lookup the 
sponsor info in a database (or something else) after getting back your
initial results and then re-sort the hit list, moving up the sponsored elements while 
maintaining the rest of the results as is?  If your list of sponsors are truly that 
small, you could just put 'em in a file and load the list into memory.

Seems then you don't have to re-index when your sponsorships change and you really 
have no dependencies on Lucene with
trying to get boost values right, etc.

I guess this resembles #2.


>>> [EMAIL PROTECTED] 02/15/04 03:49PM >>>
I am a newbie to Lucene, and this is my first serious posting
to Lucene-user.

This is to solicit comment upon the problem of supplying
a "sponsored links" capability within Lucene. This capability
would not affect at all which documents are returned by a query,
but would cause any 'sponsored' documents present among the
results to be displayed before other documents in the list
returned.

I have looked over the correspondence in Lucene-user, but
not found anything addressing this topic; if I have missed it,
please tell me where and when, and ignore the rest of this.

It seems to me that there are three ways to achieve the
capability:

1. Preset boost values for 'sponsored' documents, with an
implied burden of reindexing when sponsors are modified.

2. Post-qualify documents present in the hit list for their
sponsorship status, building a new hit list.

3. Modify the query to search using both the full query as
an unsponsored boolean clause with the default boost value,
and for each sponsor, to repeat the full query ANDed with
that sponsor with the appropriate boost value.

Are there other strategies not considered?

Assuming a small list of sponsors (10 or fewer), and low
volatility amongst the sponsors (1 change / month or less)
which method is best?

I have been pursuing method #1, almost to the exclusion of
the others, but have encountered an unknown difficulty in the
implementation (separate posting).  In particular, while it is clear
that #3 is doable, I know nothing about the searching burden
added by multiplying the user's query by one plus the count of
sponsors.

Regarding #3, if my understanding is right, then:
Sponsors name: s1, s2, s3 ...
 words or phrases: s1w1, s1w2, ... , s2w1, s2w2, ... , s3w1 
 boost values: s1v, s2v, s3v

then given query q as user input, form:
 q
 or (q and (s1w1 | s1w2 | s1w3 | ...)^s1v)
 or (q and (s2w1 | s2w2 ...)^s2v)
 or (q and (s3w1 ...)^s3v)
Is this correct?

Does the strategy of search identify any kind of intermediate
sublist to speed up searching? (But then it would start to
resemble #2.)

Rolling ones own for #2 would run query q, and get the
HitCollector. Separately running queries for each of:
 s1w1 | s1w2 | s1w3 | ...,
 s2w1 | s2w2 ...
 s3w1 ...
and merge each hit collector with the one from query q.
(Just AND the bitsets???) Lastly adjust scores and form
a new composite HitCollecter.  By this time I have told
everyone much more than I know.

Stray thought:-- can HitCollectors be cached at application init?

There are many other questions regarding details of implementation,
but their proper place is another communication.

Just by preparing this document for dissemination has helped
greatly.  All and any comments are much appreciated.

Thank you all.



-
To unsubscribe, e-mail: [EMAIL PROTECTED] 
For additional commands, e-mail: [EMAIL PROTECTED] 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Term Vector support

2004-03-02 Thread Grant Ingersoll
>>> [EMAIL PROTECTED] 02/27/04 12:09PM >>>
Hi folks,

I'm trying to get a better understanding of term vector support. Looking
at lucene-dev I'm understanding that with each document you store the
list of terms and their frequencies. Is this correct? 
What uses are there for term vector other than "more like this"?


>
You can do more formal relevance feedback models and other more advanced IR 
techniques.  Presumably you could implement some other scoring capabilities that 
require the term vector.  You can access the frequency information on a document 
vector basis (kind of like termDocs, etc. which are term based on the index).

Some of these require some imagination to get to, but I think they can be done.  Pick 
up a good book on IR and you can see where the formulas use term (sometimes called 
document) vectors.  

I am sure there are other uses as well.


-Grant


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Search in all fields

2004-03-16 Thread Grant Ingersoll

You can use the MultiFieldQueryParser, which will generate a query against all of the 
fields you specify, or you could index all of your documents into one or two common 
fields and search against them.  Since you have a lot of fields, I would guess the 
latter is the better choice.  


>>> [EMAIL PROTECTED] 03/16/04 07:56AM >>>
In QueryParser.parse method I must give which is the default field.

Does this means ttah non-adressed queris are executed only over
this field?

The main question is:
How I can search in all fields in all documents in the index?
Note that I don't know field names, there can be thousands field 
names in all documnets.

10x in advance.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Boosting groups

2004-03-31 Thread Grant Ingersoll
Hi,

I was wondering if boosting a grouping of terms has any meaning?
For example, (A^2  OR B^3 OR C^1.5)^5.0

I didn't see it in the query syntax documentation (but the query parser seems to 
accept it).  If it is meaningful, what are the semantics of it?  Is the boost factor 
distributive?  That is, is the above query equivalent to (A^10.0 OR B^15.0 OR C^7.5) ?

Thanks for the help,
Grant


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Problems From the Word Go

2004-04-29 Thread Grant Ingersoll
Alex, 

What kind of errors are you getting?  Is the Lucene JAR in your classpath?  Have you 
read http://jakarta.apache.org/lucene/docs/gettingstarted.html?

-Grant

>>> [EMAIL PROTECTED] 04/29/04 11:53AM >>>
I'm sorry if this is not the correct place to post this, but I'm very
confused, and getting towards the end of my tether.

I need to install/compile and run Lucene on a Windows XP Pro based machine,
running J2SE 1.4.2, with ANT.

I downloaded both the source code and the pre-compile versions, and as yet
have not been able to get either running. I've been through the
documentation, and still I can find little to help me set it up properly.

All I want to do (to start with) is compile and run the demo version.

I'm sorry to ask such a newbie question, but I'm really stuck.

So if anyone can point me to an idiots guide, or offer me some help, I would
be most grateful.

Once I get past this stage, I'll have all sorts of juicer questions for you,
but at the minute, I can't even get past stage 1

Thank you in advance
Alex
---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.672 / Virus Database: 434 - Release Date: 28/04/2004


-
To unsubscribe, e-mail: [EMAIL PROTECTED] 
For additional commands, e-mail: [EMAIL PROTECTED] 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Lucene 1.4 RC 3 issue with temp directory

2004-05-17 Thread Grant Ingersoll
Hi All,

I just upgraded to 1.4 RC 3 and am now unable to open my index.

I am getting: 
java.io.IOException: The system cannot find the path specified
at java.io.WinNTFileSystem.createFileExclusively(Native Method)
at java.io.File.createNewFile(File.java:828)
at org.apache.lucene.store.FSDirectory$1.obtain(FSDirectory.java:297)
at org.apache.lucene.store.Lock.obtain(Lock.java:53)
at org.apache.lucene.store.Lock$With.run(Lock.java:108)
at org.apache.lucene.index.IndexReader.open(IndexReader.java:111)
at org.apache.lucene.index.IndexReader.open(IndexReader.java:95)
at org.apache.lucene.search.IndexSearcher.(IndexSearcher.java:38)


I _have_ reindexed using the new lucene jar.  I am positive the path is correct as I 
can open an index in the same directory with the old Lucene with no problems.  I 
notice that the problem only occurs when I am deployed inside of Tomcat.  If I run 
searches on the command line or through JUnit everything functions correctly.  

When I print out the lockDir location that is trying to be obtained above, it looks 
like: C:\ENG\index\LDC\trec-ar-dar\..\temp which is the directory my index resides in, 
except ..\temp does not exist.  When I create the directory, it works.  I suppose I 
could create the temp directory for every index, but I didn't know that was a 
requirement.  I do notice that Tomcat has a temp directory at the top, so it is 
probably setting some system property ("java.io.tmpdir") variable to "..\temp" 
that is being picked up by Lucene?  The question is, what changed in RC 3 that would 
cause this to be used when it wasn't before? 

On a side note, would it be useful to create the lock directory if it doesn't exist?  
If the developers think so, I can submit the patch for it.

Thanks,
Grant


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: similarity of two texts

2004-06-01 Thread Grant Ingersoll
Hey Eric,

What did you do to calc similarity?  I haven't had time, but was thinking of ways to 
add the ability to get the similarity score (as calculated when doing a search) given 
a term vector (or just a document id).  Any ideas on how to approach this would be 
appreciated.  The scoring in Lucene has always been a bit confusing to me, despite 
looking at the code several times, especially once you get into boolean queries, etc.

Thanks,
Grant

>>> [EMAIL PROTECTED] 06/01/04 06:01AM >>>
On May 31, 2004, at 2:17 PM, Stefan Groschupf wrote:
> Lucene can't help you.

What about using term vectors though?  I've been able to do rudimentary 
document similarity calculations using the new support in Lucene 1.4.  
Search the 'net for more info on term vectors and the formulas needed 
(elementary vector angle calculation, actually).

Erik

> Am 31.05.2004 um 20:10 schrieb uddam chukmol:
>
>> Hi,
>>
>> I'm a newbie to Lucene and heard that it helps in the information 
>> retrieval process. However, my problem is not really related to the 
>> information retrieval but to the comparison of two texts. I think 
>> Lucene may help resolving it.
>>
>> I would like to have a clue on how to compare two given texts and 
>> finally say how much they are similar.
>>
>> Has anyone had this kind of experience? I will be very grateful to 
>> hear your ideas and your recommendations.
>>
>> Thanks before hand!
>>
>> Uddam CHUKMOL
>>
>>
>>
>>  
>> -
>> Do you Yahoo!?
>> Friends.  Fun. Try the all-new Yahoo! Messenger
> ---
> open technology:   http://www.media-style.com 
> open source:   http://www.weta-group.net 
> open discussion:http://www.text-mining.org 
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED] 
> For additional commands, e-mail: [EMAIL PROTECTED] 


-
To unsubscribe, e-mail: [EMAIL PROTECTED] 
For additional commands, e-mail: [EMAIL PROTECTED] 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: similarity of two texts

2004-06-01 Thread Grant Ingersoll
Sorry, about the mispelling, Erik!

Thanks for the insight.

Explain is my friend as an end user, but it, too, is confusing at the code level!  At 
some point I will have time to dig deeper and step through the scoring code.

>>> [EMAIL PROTECTED] 06/01/04 09:39AM >>>
On Jun 1, 2004, at 9:24 AM, Grant Ingersoll wrote:
> Hey Eric,

Eri*K*  :)

> What did you do to calc similarity?

I computed the angle between two vectors.  The vectors are obtained 
from IndexReader.getTermFreqVector(docId, "field").

>   I haven't had time, but was thinking of ways to add the ability to 
> get the similarity score (as calculated when doing a search) given a 
> term vector (or just a document id).

It would be quite compute-intensive to do something like this.  This 
could be done through a custom sort as well, if applying it at the 
scoring level doesn't work.  I haven't given any thought to how this 
could work for scoring or sorting before, but does sound quite 
interesting.

>   Any ideas on how to approach this would be appreciated.  The scoring 
> in Lucene has always been a bit confusing to me, despite looking at 
> the code several times, especially once you get into boolean queries, 
> etc.

No doubt that it is confusing - to me also.  But Explanation is your 
friend.

Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED] 
For additional commands, e-mail: [EMAIL PROTECTED] 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Writing a stemmer

2004-06-03 Thread Grant Ingersoll
Anil,

I suppose it depends on how complex the language is and what is acceptable for your 
program.  I have written a couple of stemmers that are fairly straightforward based on 
papers that I have read and work well for the langs. we are using.  Your best bet is 
probably to do a literature search for the languages you are interested in and go from 
there.  

I am, of course, assumming stemmers for your languages don't already exist.  If your 
languages are common, there probably is a stemmer available in some form that you can 
use or adapt. You'd be suprised at what you get by doing a simple google search for 
" stemmer" where lang X is the language you are interested in and no quotes.

Hooking them into Lucene is straightforward and there are several examples of this 
available in the docs and code.

-Grant

>>> [EMAIL PROTECTED] 06/03/04 04:09PM >>>

Hi,

Can anyone provide some help on writing a stemmer for non-english languages?
How proficient must I be in a language for which I wish to write the stemmer?

Regards,
Anil

-
To unsubscribe, e-mail: [EMAIL PROTECTED] 
For additional commands, e-mail: [EMAIL PROTECTED] 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: "No tvx reader"

2004-06-07 Thread Grant Ingersoll
It can safely be removed.I am pretty sure it happens during merging on fields that 
don't have term vector information included.  Of course, the real fix is to not 
attempt to not even call this method when there  is not term vector info.

It is normal, for now.  Should probably be removed, though.

>>> [EMAIL PROTECTED] 06/05/04 05:37PM >>>

Using 1.4rc3.

Running an app that indexes 50k documents (thus it just uses an 
IndexWriter).
One field has that boolean set for it to have a term vector stored for 
it, while other 11 fields don't.

On stdout I see "No tvx file" 13 times.

Glancing thru the src it seems this comes from TermVectorReader.

The generated index seeems fine.

What could be causing this and is this normal?

thx,
 Dave


-
To unsubscribe, e-mail: [EMAIL PROTECTED] 
For additional commands, e-mail: [EMAIL PROTECTED] 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Setting Similarity in IndexWriter and IndexSearcher

2004-06-08 Thread Grant Ingersoll
I do these kind of things as part of a layer between Lucene and my application, but 
often have thought it would be nice to have a metadata layer available that wasn't 
part of the Lucene core, but was packaged w/ Lucene.  It could provide the information 
necessary and have tools for updating with out messing w/ the index.

For instance, one of the things I store is the name of the field that is the "true" 
document identifier (a unique String that won't change across Indexing) for that 
Document (which, as Doug pointed out, can vary from Document to Document w/in the 
Index).

Cheers,
Grant


>>> [EMAIL PROTECTED] 06/08/04 03:44PM >>>
David Spencer wrote:
> Does it ever make sense to set the Similartity obj in either (only one 
> of..) IndexWriter or IndexSearcher? i.e. If I set it in IndexWriter can 
> I avoid setting it in IndexSearcher? Also, can I avoid setting it in 
> IndexWriter and only set it in IndexSearcher? I noticed Nutch sets it in 
> both places and was wondering about what's going on behind the scenes...

No, it probably doesn't make sense to use a different Similarity 
implementation when indexing than when searching.  Ideally perhaps we'd 
have a LuceneConfiguration object, which encapsulates the Similarity, 
Analysis and Directory implementations, as well as perhaps other 
parameters.  And perhaps this could even be stored with the index, using 
Java object serialization.  However I worry that this could cause more 
confusion than it solves.  For example, one might not easily be able to 
search and index if a class used when it was indexed is no longer 
available when searching.  Tools like Luke could become more difficult 
to write and use.

By design, one does not have to declare things up-front with Lucene. 
For example, one never has to declare the set of fields and their types. 
  Different documents in the same index can use different fields, or 
even use the same field name differently.  Saving analyzers and 
similarity implementations with the index reduces this sort of 
flexibility somewhat.  If you rename your analysis or similarity class, 
does your index become invalid?  Lucene currently avoids such issues, at 
the expense of potential confusion about using different analyzers and 
similarity at index and search time.  But I don't think the latter is in 
practice a problem that needs more than a little documentation.

Sorry for the long-winded answer!

Doug



-
To unsubscribe, e-mail: [EMAIL PROTECTED] 
For additional commands, e-mail: [EMAIL PROTECTED] 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Devnagari Search?

2004-06-10 Thread Grant Ingersoll
Don't have experience with those particular languages, but I can tell you that dealing 
with UNICODE is just a matter of making sure you read in the input using the correct 
encoding.  Java will take care of the rest.  If you are using a Reader for your Field, 
you probably have to do something like:

new InputStreamReader(new FileInputStream(file), "UTF-8")

assuming your files are stored in UTF-8.  If they are a different encoding, then you 
will have to pass that in place of UTF-8.

I would do a google search for stemmers and tokenizers for the languages you are 
interested in.  I also believe someone had a "generic" stemmer that performed very 
well.  I believe they posted to this list a week or so ago w/ a topic of "Writing a 
stemmer" or something along those lines.

>>> [EMAIL PROTECTED] 06/10/04 01:34AM >>>

Any one have built lucene for Devnagari UNICODE search? PLZ help me wht 
kind of changes i have to do in lucene.

Also if any one have built StandardTokenizer,Analyzer,Stemmer,Indexer
,queryParser for Hindi & Marathi Plz let me know.

Thanks,
Satish.


-
To unsubscribe, e-mail: [EMAIL PROTECTED] 
For additional commands, e-mail: [EMAIL PROTECTED] 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: RE : Analyzers

2004-06-11 Thread Grant Ingersoll
I thought I had responded...

The jist of it is it isn't thread safe yet.  Although I don't think it is too much of 
a leap to make it thread-safe.  I just haven't had time to do so.  It can be done 
through reflection or perhaps by requiring a "deep" copy/reset of filter states.

-Grant

>>> [EMAIL PROTECTED] 06/11/04 06:10AM >>>
On Jun 11, 2004, at 5:30 AM, Rasik Pandey wrote:
> A CustomAnalyzer in which Tokenizers, TokenFilters (StopFilters, 
> StemFilters, etc.) could be dynamically set at runtime for creating a 
> TokenStream would be nice as well. Has anyone done any research along 
> these lines with respect to the Lucene API?

You mean like this?

http://issues.apache.org/bugzilla/show_bug.cgi?id=28182 

Nice idea, but there has been no response to Doug's questions on it.

Erik



-
To unsubscribe, e-mail: [EMAIL PROTECTED] 
For additional commands, e-mail: [EMAIL PROTECTED] 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: RE : RE : Analyzers

2004-06-11 Thread Grant Ingersoll
Will try to get to it this weekend.  What is missing?  Is it the new files?

>>> [EMAIL PROTECTED] 06/11/04 10:56AM >>>
Grant,

> I thought I had responded...
> 
> The jist of it is it isn't thread safe yet.  Although I don't
> think it is too much of a leap to make it thread-safe.  I just
> haven't had time to do so.  It can be done through reflection
> or perhaps by requiring a "deep" copy/reset of filter states.

If you have time to correct the attachment in Bugzilla, I'll look into it...

Regards,
RBP



-
To unsubscribe, e-mail: [EMAIL PROTECTED] 
For additional commands, e-mail: [EMAIL PROTECTED] 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: RE : RE : RE : Analyzers

2004-06-14 Thread Grant Ingersoll
I was able to download them.  There are 2 sets of attachments on that bug.

>>> [EMAIL PROTECTED] 06/13/04 09:05AM >>>
Grant,

> Will try to get to it this weekend.  What is missing?  Is it
> the new files?


Exactly.

>From : http://issues.apache.org/bugzilla/show_bug.cgi?id=28182 
---
04/03/04 21:51 The 4 new files needed   (application/octet-stream)
---

Regards,
RBP



-
To unsubscribe, e-mail: [EMAIL PROTECTED] 
For additional commands, e-mail: [EMAIL PROTECTED] 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: RE : RE : Analyzers

2004-06-15 Thread Grant Ingersoll
For the reflection way, I use a configuration file that specifies the initial state 
and then use a no-argument constructor.  Since I don't think that is very 
generalizable, I thought maybe you could do a copy() and then a reset() method 
(similar to the JSP tag release() method).  The copy() method would create a new 
memory object. Then reset would put itself back into a clean state.  Many of the 
filters have read-only information (such as the Stop filter), so reset wouldn't have 
to do anything.  Other's may require more, such as preserving the initialization state 
(which will take up extra memory).  I don't know if this is a generalizable process or 
if it is worth the effort.  The reflection way works well for me b/c I am already 
using a configuration file, so a few more properties aren't a big deal.  One of the 
things that is nice about Lucene is it doesn't require a configuration file.

Not sure if this is enough to go, so let me know.

-Grant

>>> [EMAIL PROTECTED] 06/15/04 05:04AM >>>
> The jist of it is it isn't thread safe yet.  Although I don't
> think it is too much of a leap to make it thread-safe.  I just
> haven't had time to do so.  It can be done through reflection
> or perhaps by requiring a "deep" copy/reset of filter states.

I had a quick look and found that using reflection would be complicated as some 
TokenFilters need extra objects at construction time like a 
charset[](RussianStemFilter) or a HashSet of stopWords (StopFilter). Do you see a 
simple way around this? What were your thoughts for the "deep" copy/reset you mention?

Regards,
RBP 



-
To unsubscribe, e-mail: [EMAIL PROTECTED] 
For additional commands, e-mail: [EMAIL PROTECTED] 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: RE : Analyzers

2004-06-17 Thread Grant Ingersoll
Seems reasonable.  My only hesitation is the use of clone().  Not sure if this is 
strong enough, as there won't be any compiler errors if someone does not override 
clone() (just potentially weird behavior due to improper copying of internal Objects). 
 I don't know whether this is a big deal or not.  Seems like some of the time the base 
clone() method will be sufficient and other times we will need to do a deep copy.  
Should we force the developer to implement it or not?

I think the best thing may be to get it working and see what the others think.

>>> [EMAIL PROTECTED] 06/17/04 06:13AM >>>
So I gave this a little thought...

AbstractTokenizer could become 
CloneableTokenizer implements Tokenizer, Cloneable 

AbstractTokenFilter could become 
CloneableTokenFilter implements TokenFilter, Cloneable

in both of which the clone() method would return a new object allowing implementations 
like BaseAnalyzer to take advantage of your init() methods and setters 
(AbstractTokenFilter .setTokenStream and AbstractTokenizer.setReader) OR allow each 
CloneableTokenizer, CloneableTokenFilter implementation to generate its new object 
using its own constructor based dependency injection.

We could also remove the need for the init(), and setter methods in AbstractTokenizer 
and AbstractTokenFilter and create two abstract factory methods 
CloneableTokenizer.clone(reader) and CloneableTokenFilter.clone(TokenStream) which 
would handle TokenStream construction using the argument and any configured class 
member objects (stopWords, charsets, etc).

Your thoughts...

Regards,
RBP



-
To unsubscribe, e-mail: [EMAIL PROTECTED] 
For additional commands, e-mail: [EMAIL PROTECTED] 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: TermFreqVector based highlighter?

2004-06-21 Thread Grant Ingersoll
Space will vary based on the content (number of unique terms), obviously, but I did 
submit some rough numbers that I saw for my implementation.  Here they are (from my 
original patch submission):

I also tested by indexing 12,598 documents (88,362 terms) using both term vectors and 
no term vectors.
Index size w/o term vectors: 42 MB
Index size w/ term vectors: 71.3 MB

Time for the first test was 5 minutes 30 seconds, time for the second test was 6 
minutes 2 seconds.


The term vector you get back is a list of strings, containing the term and the term 
frequency for the given document.  I also submitted a Term Vector representation for 
the Query (see QueryTermVector), so I suppose you could loop over the two vectors and 
compare.

Don't know if that solves your problem, but I hope it helps.

-Grant

>>> [EMAIL PROTECTED] 06/21/04 06:28AM >>>
Hi,

I have managed to extract the relevant information to highlight the
search results out of an index that does not store field's content.

The result is a list of matching terms, with their relative weights.

This solves my problem, but it is very expensive, like I was
expecting, as it uses the explain feature of the IndexSearcher.

Since Lucene 1.4 I have seen that a new option is available for
fields: storeTermVector.

Now the questions:
- how much space do storeTermVector uses on the index (compared to
just indexed and fully stored fields)?
- if I "storeTermVector" the fields can I get back the list of
matching terms for a query in a more efficient way compared to a full
explain computation?

I am willing to drop weights altogether, if this could allow a more
efficient computation.

Thanks for your attention.

Regards,

Giulio Cesare Solaroli

-
To unsubscribe, e-mail: [EMAIL PROTECTED] 
For additional commands, e-mail: [EMAIL PROTECTED] 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: how to get all terms of a given field of curtain docId?

2004-06-30 Thread Grant Ingersoll
I think, if I understand correctly, you are interested in the term vector for a 
document, which is available in 1.4.

See IndexReader.getTermFreqVector in the API.

>>> [EMAIL PROTECTED] 06/30/04 08:06AM >>>
Hello Lucene users.

I'm using Lucene within last 3 years and want to say "Great thanks" to
developers and community for perfect product and support.

I'm wondering, what is the best practice to extract all terms of
given document field("contents") for the curtain document id?

Thanks in advance.

Max


-
To unsubscribe, e-mail: [EMAIL PROTECTED] 
For additional commands, e-mail: [EMAIL PROTECTED] 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Problems indexing Japanese with CJKAnalyzer

2004-07-06 Thread Grant Ingersoll
Jon,

Java expects your files to be in the encoding of the Native Locale.  In most cases in 
the U.S., this will be English.  If you want to read files in that are in a different 
encoding, you have to tell Java what your encoding is, in this case, Shift JIS.  See 
the javadocs for java.io.InputStreamReader.

Thus, you will most likely have to alter the Lucene demo to load your document in a 
different way.  If you look at HTMLDocument, my guess is you can replace the 
construction of the HTMLParser with something like (somewhere around line 63):

HTMLParser parser = new HTMLParser(new InputStreamReader(new FileInputStream(file), 
"SJIS")); //Not sure if "SJIS" is the right moniker for Shift-Jis.

I have not tested this, but I recall do this when I first started to be able to index 
my UTF-8 docs.

I am not exactly sure if SJIS is the abbreviated form for Shift-JIS. It will throw a 
UnsupportedEncodinException if it is not right.  In that case, you can look up in the 
Java documentation to see what is supported.

Btw, do this with your original files, which are in Shift-JIS.

>>> [EMAIL PROTECTED] 07/06/04 12:53PM >>>
Hi Jon,

It sounds to me like you have a character encoding problem.  The 
native2ascii tool is designed to produce input for the Java compiler; 
the "\u7aef" notation you're seeing is understood by Java string 
interpreters to mean the corresponding hexadecimal Unicode code point. 
Other Java programs, however, depending on their implementation, may not 
understand this notation.  Alternatively, maybe the notation is 
understood, but the conversion from Shift-JIS to Java Unicode format is 
not being performed properly; if you don't tell native2ascii the source 
encoding, it will assume the "native" encoding for the platform--on 
Windows, depending on which localized version you've got, this is likely 
to be the so-called code page 1252 (ISO-8859-1 with a few 
modifications).  Converting from one character encoding to another with 
incorrect assumptions about the source encoding can only lead to sorrow 
and confusion.

I think you can use the native2ascii tool to do what you want 
(untested), but it will take two passes:

1. Use native2ascii to convert your file(s) to Java Unicode format, but 
tell it the source encoding:

native2ascii -encoding SJIS inputfile outputfile1

2. Tell it to convert from Java Unicode format to UTF-8:

native2ascii -reverse -encoding UTF8 outputfile1 finaloutput

Here's a web page with more information on native2ascii:

http://java.sun.com/j2se/1.4.2/docs/tooldocs/windows/native2ascii.html>

Hope it helps,
Steve Rowe

Jon Schuster wrote:
> I've gone through all of the past messages regarding the CJKAnalyzer but I
> still must be doing something wrong because my searches don't work.
> 
> I'm using the IndexHTML application from the org.apache.lucene.demo package
> to do the indexing, and I've changed the analyzer to use the CJKAnalyzer.
> I've also tried with and without setting the file.encoding to Shift-JIS.
> I've tried indexing the HTML files, which contain Shift-JIS, without
> conversion to Unicode and I get assorted "Parse Aborted: Lexical error..."
> messages. I've also tried converting the Shift-JIS HTML files to Unicode by
> first running them through the native2ascii tool.
> 
> When the files are converted via native2ascii, they index without errors,
> but the index appears to contain the Unicode characters as literal strings
> such as "u7aef", "u7af6", etc. Searching for an English word produces
> results that have text like "code \u5c5e\u6027".
> 
> Since others have gotten Japanese indexing to work, what's the secret I'm
> missing?
> 
> Thanks,
> Jon


-
To unsubscribe, e-mail: [EMAIL PROTECTED] 
For additional commands, e-mail: [EMAIL PROTECTED] 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: indexing help

2004-07-08 Thread Grant Ingersoll
Hi John,

The source code is available from CVS, make it non-final and do what you need to do.  
Of course, you may have a hard time finding help later if you aren't using something 
everyone else is and your solution doesn't work...  :-)

If I understand correctly what you are trying to do, you already know all of the 
answers for indexing, you just want Lucene to do the retrieval side of the coin, 
correct?  I suppose a crazy idea might be to write a program that took your info and 
output it in the Lucene file format, but that seems a bit like overkill.

-Grant

>>> [EMAIL PROTECTED] 07/07/04 07:37PM >>>
Hi Doug:
 Thanks for the response!

 The solution you proposed is still a derivative of creating a
dummy document stream. Taking the same example, java (5), lucene (6),
VectorTokenStream would create a total of 11 Tokens whereas only 2 is
neccessary.

Given many documents with many terms and frequencies, it would
create many extra Token instances.

   The reason I was looking to derving the Field class is because I
can directly manipulate the FieldInfo by setting the frequency. But
the class is final...

   Any other suggestions?

Thanks

-John

On Wed, 07 Jul 2004 14:20:24 -0700, Doug Cutting <[EMAIL PROTECTED]> wrote:
> John Wang wrote:
> >  While lucene tokenizes the words in the document, it counts the
> > frequency and figures out the position, we are trying to bypass this
> > stage: For each document, I have a set of words with a know frequency,
> > e.g. java (5), lucene (6) etc. (I don't care about the position, so it
> > can always be 0.)
> >
> >  What I can do now is to create a dummy document, e.g. "java java
> > java java java lucene lucene lucene lucene lucene" and pass it to
> > lucene.
> >
> >  This seems hacky and cumbersome. Is there a better alternative? I
> > browsed around in the source code, but couldn't find anything.
> 
> Write an analyzer that returns terms with the appropriate distribution.
> 
> For example:
> 
> public class VectorTokenStream extends TokenStream {
>   private int term;
>   private int freq;
>   public VectorTokenStream(String[] terms, int[] freqs) {
> this.terms = terms;
> this.freqs = freqs;
>   }
>   public Token next() {
> if (freq == 0) {
>   term++;
>   if (term >= terms.length)
> return null;
>   freq = freqs[term];
> }
> freq--;
> return new Token(terms[term], 0, 0);
>   }
> }
> 
> Document doc = new Document();
> doc.add(Field.Text("content", ""));
> indexWriter.addDocument(doc, new Analyzer() {
>   public TokenStream tokenStream(String field, Reader reader) {
> return new VectorTokenStream(new String[] {"java","lucene"},
>  new int[] {5,6});
>   }
> });
> 
> >   Too bad the Field class is final, otherwise I can derive from it
> > and do something on that line...
> 
> Extending Field would not help.  That's why it's final.
> 
> Doug
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED] 
> For additional commands, e-mail: [EMAIL PROTECTED] 
> 
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED] 
For additional commands, e-mail: [EMAIL PROTECTED] 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: indexing help

2004-07-08 Thread Grant Ingersoll
Hey John,

Those are just options, didn't say they were good ones!  :-)

I guess the real question is, what is the background of what you are trying to do?  
Presumably you have some other program that is generating frequencies for you, do you 
really need that in the current form?  Can't the Lucene indexing engine act as a 
stand-in for this process since your end result _should_ be the same?  The Lucene 
Analyzer process is quite flexible, I bet you could even find a way to hook in your 
existing tools into the Analyzer process.

-Grant

>>> [EMAIL PROTECTED] 07/08/04 10:42AM >>>
Hi Grant:
 Thanks for the options. How likely will the lucene file formats change?

 Are there really no more optiosn? :(...

Thanks

-John

On Thu, 08 Jul 2004 08:50:44 -0400, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
> Hi John,
> 
> The source code is available from CVS, make it non-final and do what you need to do. 
>  Of course, you may have a hard time finding help later if you aren't using 
> something everyone else is and your solution doesn't work...  :-)
> 
> If I understand correctly what you are trying to do, you already know all of the 
> answers for indexing, you just want Lucene to do the retrieval side of the coin, 
> correct?  I suppose a crazy idea might be to write a program that took your info and 
> output it in the Lucene file format, but that seems a bit like overkill.
> 
> -Grant
> 
> >>> [EMAIL PROTECTED] 07/07/04 07:37PM >>>
> 
> 
> Hi Doug:
> Thanks for the response!
> 
> The solution you proposed is still a derivative of creating a
> dummy document stream. Taking the same example, java (5), lucene (6),
> VectorTokenStream would create a total of 11 Tokens whereas only 2 is
> neccessary.
> 
>Given many documents with many terms and frequencies, it would
> create many extra Token instances.
> 
>   The reason I was looking to derving the Field class is because I
> can directly manipulate the FieldInfo by setting the frequency. But
> the class is final...
> 
>   Any other suggestions?
> 
> Thanks
> 
> -John
> 
> On Wed, 07 Jul 2004 14:20:24 -0700, Doug Cutting <[EMAIL PROTECTED]> wrote:
> > John Wang wrote:
> > >  While lucene tokenizes the words in the document, it counts the
> > > frequency and figures out the position, we are trying to bypass this
> > > stage: For each document, I have a set of words with a know frequency,
> > > e.g. java (5), lucene (6) etc. (I don't care about the position, so it
> > > can always be 0.)
> > >
> > >  What I can do now is to create a dummy document, e.g. "java java
> > > java java java lucene lucene lucene lucene lucene" and pass it to
> > > lucene.
> > >
> > >  This seems hacky and cumbersome. Is there a better alternative? I
> > > browsed around in the source code, but couldn't find anything.
> >
> > Write an analyzer that returns terms with the appropriate distribution.
> >
> > For example:
> >
> > public class VectorTokenStream extends TokenStream {
> >   private int term;
> >   private int freq;
> >   public VectorTokenStream(String[] terms, int[] freqs) {
> > this.terms = terms;
> > this.freqs = freqs;
> >   }
> >   public Token next() {
> > if (freq == 0) {
> >   term++;
> >   if (term >= terms.length)
> > return null;
> >   freq = freqs[term];
> > }
> > freq--;
> > return new Token(terms[term], 0, 0);
> >   }
> > }
> >
> > Document doc = new Document();
> > doc.add(Field.Text("content", ""));
> > indexWriter.addDocument(doc, new Analyzer() {
> >   public TokenStream tokenStream(String field, Reader reader) {
> > return new VectorTokenStream(new String[] {"java","lucene"},
> >  new int[] {5,6});
> >   }
> > });
> >
> > >   Too bad the Field class is final, otherwise I can derive from it
> > > and do something on that line...
> >
> > Extending Field would not help.  That's why it's final.
> >
> > Doug
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED] 
> > For additional commands, e-mail: [EMAIL PROTECTED] 
> >
> >
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED] 
> For additional commands, e-mail: [EMAIL PROTECTED] 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED] 
> For additional commands, e-mail: [EMAIL PROTECTED] 
> 
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED] 
For additional commands, e-mail: [EMAIL PROTECTED] 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Could search results give an idea of which field matched

2004-07-12 Thread Grant Ingersoll
See the explain functionality in the Javadocs and previous threads.  You can ask 
Lucene to explain why it got the results it did for a give hit.

>>> [EMAIL PROTECTED] 07/12/04 04:52PM >>>
I search the index on multiple fields. Could the search results also
tell me which field matched so that the document was selected? From what
I can tell, only the document number and a score are returned, is there
a way to also find out what was the field(s) of the document matched the
query?

 

Sildy

 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Tokenizers and java.text.BreakIterator

2004-07-20 Thread Grant Ingersoll
Hi,

Was wondering if anyone uses java.text.BreakIterator#getWordInstance(Locale) as a 
tokenizer for various languages?  Does it do a good job?  It seems like it does, at 
least for languages where words are separated by spaces or punctuation, but I have 
only done simple tests.

Anyone have any thoughts on this?  What am I missing?  Does this seem like a valid 
approach?

Thanks,
Grant


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: lucene cutomized indexing

2004-07-20 Thread Grant Ingersoll
It seems to me the answer to this is not necessarily to open up the API, but to 
provide a mechanism for adding Writers and Readers to the indexing/searching process 
at the application level.  These readers and writers could be passed to Lucene and 
used to read and write to separate files (thus, not harming the index file format).  
They could be used to read/write an arbitrary amount of metadata at the term, document 
and/or index level w/o affecting the core Lucene index.  Furthermore, previous 
versions could still work b/c they would just ignore the new files and the indexes 
could be used by other applications as well.

This is just a thought in the infancy stage, but it seems like it would solve the 
problem.  Of course, the trick is figuring out how it fits into the API (or maybe it 
becomes a part of 2.0).  Not sure if it is even feasible, but it seems like you could 
define interfaces for Readers and Writers that met the requirements to do this.

This may be better discussed on the dev list.

>>> [EMAIL PROTECTED] 07/20/04 11:28AM >>>
Hi:
   I am trying to store some Databased like field values into lucene.
I have my own way of storing field values in a customized format.

   I guess my question is wheather we can make the Reader/Writer
classes, e.g. FieldReader, FieldWriter, DocumentReader/Writer classes
non-final?

   I have asked to make the Lucene API less restrictive many many many
times but got no replies. Is this request feasible?

Thanks

-John

-
To unsubscribe, e-mail: [EMAIL PROTECTED] 
For additional commands, e-mail: [EMAIL PROTECTED] 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Tokenizers and java.text.BreakIterator

2004-07-20 Thread Grant Ingersoll
Answering my own question, I think it is b/c Tokenizer's work with a Reader and you 
would have to read in the whole document in order to use the BreakIterator, which 
operates on a String...

>>> [EMAIL PROTECTED] 07/20/04 03:23PM >>>
Hi,

Was wondering if anyone uses java.text.BreakIterator#getWordInstance(Locale) as a 
tokenizer for various languages?  Does it do a good job?  It seems like it does, at 
least for languages where words are separated by spaces or punctuation, but I have 
only done simple tests.

Anyone have any thoughts on this?  What am I missing?  Does this seem like a valid 
approach?

Thanks,
Grant


-
To unsubscribe, e-mail: [EMAIL PROTECTED] 
For additional commands, e-mail: [EMAIL PROTECTED] 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: authentication support in lucene

2004-07-22 Thread Grant Ingersoll
You maybe could use the HitCollector mechanism for search.  Do your ACL check as you 
add the item to the HitCollector.  If the check fails, then don't add it to the 
Collector.  

>>> [EMAIL PROTECTED] 07/22/04 01:59PM >>>
Hi:

Maybe this has been asked before.

Is there a plan to support ACL check on the documents in lucene?
Say I have a customized ACL check module, e.g.:

 boolean ACLCheck(int docID,String user,String password);

 And have some sort of framework to plug in something like that.

I was looking at the Filter class. I guess I can read the entire
index, for each document, feed it to the authentication module, if
authenticated, bitset the docID and return the BitSet instance. I
sounds very slow for large hits. I guess  I can play with cacheing
etc.

 Any other ideas?

Thanks

-John

-
To unsubscribe, e-mail: [EMAIL PROTECTED] 
For additional commands, e-mail: [EMAIL PROTECTED] 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Can I retrieve token offsets from Hits?

2004-07-22 Thread Grant Ingersoll
I am sensing a common theme throughout a variety of threads here:  Namely, a need for 
a pluggable set of Reader's and Writers (think Interface) that can write metadata 
about an Index/Document/Field/Term (which I see the TermVector stuff as being a 
instance of) and can be given to Lucene from the application level (or at least the 
application specifies which ones to use)

I proposed something like this a bit earlier, but didn't see any interest.  I suppose 
I should implement it as this is how things get going, but would be nice to have some 
input on requirements and whether the people who know Lucene better than I think this 
is possible.

Just my two cents on this one.  Doesn't help you w/ an immediate solution, but I think 
it would help us all in the long run.  If this existed, one could easily implement a 
Token position store and ask it for all of this information, I think.  :-)

-Grant

>>> [EMAIL PROTECTED] 07/22/04 03:19PM >>>
> I wonder if the information in termPositions or termVector can be used
> to restore token position from indicies?

TermFreqVector gives you term frequencies (not positions). This can be of use in 
computing document 
similarities.
TermPositions gives you the sequence number . eg in the last sentence the word 
"sequence" was 
token number 5,  (not character position 5). This is used for PhraseQueries to 
determine proximity.

Character position is what is required to do highlighting and this isnt stored 
anywhere currently. 
The requirements for such a store would be indexed access by doc number, and a compact 
means
of storing term/character position info. This could add considerable size to the index.

Previously we concluded that highlighting is only typically done on the first 10 or so 
records in a result set 
anyway and that re-analyzing the text shouldnt add too much of an overhead. If you 
want to limit the size of
an individual document's text to be tokenized use 
highlighter.setMaxDocBytesToAnalyze().
If you find tokenizing slow check you arent using StandardAnalyzer - I have found that 
to be slow
(see http://marc.theaimsgroup.com/?l=lucene-dev&m=108080820315779&w=2 )

Cheers
Mark




 

-
To unsubscribe, e-mail: [EMAIL PROTECTED] 
For additional commands, e-mail: [EMAIL PROTECTED] 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: TermFreqVector Beginner Question

2004-07-28 Thread Grant Ingersoll
Can you post the whole section of related code?  Sounds like you are doing things 
right.  

In the Lucene source code, there is a file called TestTermVectors.java, take a look at 
that and see how your stuff compares.  I ran the test against the HEAD and it worked.

>>> [EMAIL PROTECTED] 07/28/04 04:51PM >>>

Howdy,

I am new to Lucene and thus far I am very impressed.  Thanks to all who have
worked on this project!

I am working on a project where I want to do the following:

1.) Index a bunch of document.
2.) Pluck out one of the doucments by Lucene document number
3.) Get a term frequency for that document

After some digging and playing I came across this method...

   IndexReader.getTermFreqVector(int docNumber, String field)

This is exactly what I want.  So I ran the IndexFiles demo program with some
test documents and started poking at the index with an IndexReader. But when I
called

   IndexReader.getTermFreqVector(someDocNumber,"contents")

I get NULL back.  After a little more digging I find that for a TermVector to
exist the Field has to have the TermVector flag set.  So I changes some lines
in the demo FileDocument.Document method to:

FileInputStream is = new FileInputStream(f);
Reader reader = new BufferedReader(new InputStreamReader(is));
doc.add(Field.Text("contents", reader.toString(),true));

with the "true" parameter causing the new Field to turn on the storeTermVector
flag, right? So then I reindex and get the same results - getTermFreqVector
returns NULL.  So I inspect the field list of the Document from the index:

   Document d = ir.document(td.doc());
   System.out.println("  Path: "+d.get("path"));
   for (Enumeration e = d.fields() ; e.hasMoreElements() ;) 
   {
  System.out.println(((Field)e.nextElement()).toString());
   }

and I discover that there is now NO "contents" Field.  If I change the paramter
in Field.Text to false, I get a "contents" Field but no TermVector.  To date I
haven't been able to figure out how to get a TermFreqVector at all.

What am I missing?

I have looked at the documents - all the tutorials I have found just cover the
basics.

I have read the news group postings related to "TermVectors" and
"TermFreqVectors" and everybody says stuff like "the new 1.4 Vector stuff is
great".  So how do they know?  Where can I learn about this?  Are there any more
complete user tutorials/references that cover TermVector features?

Oh, I am using the 1.4 Lucene release in case it matters.

Thanks in advance,

Matt Galloway
Tulsa, Oklahoma


(BTW, I also tired Field.UnStored with the same results.)



-
This mail sent through IMP: http://horde.org/imp/ 

- End forwarded message -




-
This mail sent through IMP: http://horde.org/imp/ 

-
To unsubscribe, e-mail: [EMAIL PROTECTED] 
For additional commands, e-mail: [EMAIL PROTECTED] 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: TermFreqVector Beginner Question

2004-07-29 Thread Grant Ingersoll
Matt,

Perhaps you could add this to the Wiki somewhere?  May want to also add a bug report 
on this, so that it is captured, especially the stuff in 2.).

>>> [EMAIL PROTECTED] 07/29/04 11:31AM >>>
Well, as one would expect most of the problems were me.  Here is what I
learned... (please comment on the accuracy of these statements).

1.) Setting storeTermVertor to true does nothing if store is false, i.e. 
   you must store the contents of a filed in order to retrieve TermVectors 
   for it later.  This may seem obvious to everyone else, but to a new user 
   this is anything but obvious as it is not documented anywhere that I have
   seen.  I think it would be very helpful to include this tidbit in the 
   Field class JavaDoc if not in the FAQ or some other place.

   I also think it would be helpful to prevent the user from combinations 
   of store and storeTermVector that don't make sense, namely store = false 
   and storeTermVector = true.  Maybe an exception or something.

2.) The following methods...

   Field.Text(String name, Reader value, boolean storeTermVector) 
   Field.UnStored(String name, String value, boolean storeTermVector) 

   DO NOT store the contents of the field and (based on my assumption in 
   point 1 and through observation) consequently DO NOT store TermVectors
   despite the value of their storeTermVector value.  If this is accurate,
   why do these methods exist?  This is very misleading to the new user.

3.) I am also new to Java so if you look at my earlier sample code you
   will see that I used "reader.toString()" where reader is a buffered file
   reader.  This of course is not the desired effect.  I have since rewritten
   the code to reflect the a string that contains the content of the file
   instead of some vector address thing.  This doesn't affect Lucene or
   term vectors, just my ego.

Once you understand that stor=true is ALSO a prerequisite for TermVectors (in
addition to storeTermVector=true) then everything works great.

Thanks for the help,

Matt Galloway


Quoting Grant Ingersoll <[EMAIL PROTECTED]>:

> Can you post the whole section of related code?  Sounds like you are doing
> things right.  
> 
> In the Lucene source code, there is a file called TestTermVectors.java, take
> a look at that and see how your stuff compares.  I ran the test against the
> HEAD and it worked.
> 
> >>> [EMAIL PROTECTED] 07/28/04 04:51PM >>>
> 
> Howdy,
> 
> I am new to Lucene and thus far I am very impressed.  Thanks to all who
> have
> worked on this project!
> 
> I am working on a project where I want to do the following:
> 
> 1.) Index a bunch of document.
> 2.) Pluck out one of the doucments by Lucene document number
> 3.) Get a term frequency for that document
> 
> After some digging and playing I came across this method...
> 
>IndexReader.getTermFreqVector(int docNumber, String field)
> 
> This is exactly what I want.  So I ran the IndexFiles demo program with
> some
> test documents and started poking at the index with an IndexReader. But when
> I
> called
> 
>IndexReader.getTermFreqVector(someDocNumber,"contents")
> 
> I get NULL back.  After a little more digging I find that for a TermVector
> to
> exist the Field has to have the TermVector flag set.  So I changes some
> lines
> in the demo FileDocument.Document method to:
> 
> FileInputStream is = new FileInputStream(f);
> Reader reader = new BufferedReader(new InputStreamReader(is));
> doc.add(Field.Text("contents", reader.toString(),true));
> 
> with the "true" parameter causing the new Field to turn on the
> storeTermVector
> flag, right? So then I reindex and get the same results - getTermFreqVector
> returns NULL.  So I inspect the field list of the Document from the index:
> 
>Document d = ir.document(td.doc());
>System.out.println("  Path: "+d.get("path"));
>for (Enumeration e = d.fields() ; e.hasMoreElements() ;) 
>{
>   System.out.println(((Field)e.nextElement()).toString());
>}
> 
> and I discover that there is now NO "contents" Field.  If I change the
> paramter
> in Field.Text to false, I get a "contents" Field but no TermVector.  To date
> I
> haven't been able to figure out how to get a TermFreqVector at all.
> 
> What am I missing?
> 
> I have looked at the documents - all the tutorials I have found just cover
> the
> basics.
> 
> I have read the news group postings related to "TermVectors" and
> "TermFreqVectors" and everybody says stuff like "the new 1.4 Vector stuff
> is
> great".  So how do they know?  Where can I learn about this? 

Re: Weighted queries

2004-08-06 Thread Grant Ingersoll
Btw, MultiFieldQueryParser extends QueryParser, which has the
setOperator method that allows you to set the default operator.

>>> [EMAIL PROTECTED] 8/6/2004 10:54:55 AM >>>
Is it possible to expand a query such as

   foo bar

into

   (title:foo^4 OR abstract:foo^2 OR content:foo) AND
   (title:bar^4 OR abstract:bar^2 OR content:bar)

?

I can assign weights to individual fields when indexing, and could use

the MultiFieldQueryParser - but it seems this parser can't be
configured 
to use AND as default!

-
To unsubscribe, e-mail: [EMAIL PROTECTED] 
For additional commands, e-mail: [EMAIL PROTECTED] 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index Size

2004-08-19 Thread Grant Ingersoll
How many fields do you have and what analyzer are you using?

>>> [EMAIL PROTECTED] 8/19/2004 11:54:25 AM >>>
Otis
I upgraded to 1.4.1.  I deleted all of my old indexes and started from
scratch.  I indexed 2 MB worth of text files and my index size is 8
MB.
Would it be better if I stopped using the
IndexWriter.addIndexes(IndexReader) method and instead traverse the
IndexReader on the temp index and use
IndexWriter.addDocument(Document)
method?

Thanks again for your input, I appreciate it.

Rob
- Original Message - 
From: "Otis Gospodnetic" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Thursday, August 19, 2004 8:00 AM
Subject: Re: Index Size


Just go for 1.4.1 and look at the CHANGES.txt file to see if there
were
any index format changes.  If there were, you'll need to re-index.

Otis

--- Rob Jose <[EMAIL PROTECTED]> wrote:

> Otis
> I am using Lucene 1.3 final.  Would it help if I move to Lucene 1.4
> final?
>
> Rob
> - Original Message - 
> From: "Otis Gospodnetic" <[EMAIL PROTECTED]>
> To: "Lucene Users List" <[EMAIL PROTECTED]>
> Sent: Thursday, August 19, 2004 7:13 AM
> Subject: Re: Index Size
>
>
> I thought this was the case.  I believe there was a bug in one of
the
> recent Lucene releases that caused old CFS files not to be removed
> when
> they should be removed.  This resulted in your index directory
> containing a bunch of old CFS files consuming your disk space.
>
> Try getting a recent nightly build and see if using that takes car
> eof
> your problem.
>
> Otis
>
> --- Rob Jose <[EMAIL PROTECTED]> wrote:
>
> > Hey George
> > Thanks for responding.  I am using windows and I don't see any
> hidden
> > files.
> > I have a ton of CFS files (1366/1405).  I have 22 F# (F1, F2,
etc.)
> > files.
> > I have two FDT files and two FDX files. And three FNM files.  Add
> > these
> > files to the deletable and segments file and that is all of the
> files
> > that I
> > have.   The CFS files are appoximately 11 MB each.  The totals I
> gave
> > you
> > before were for all of my indexes together.  This particular index
> > has a
> > size of 21.6 GB.  The files that it indexed have a size of 89 MB.
> >
> > OK - I just removed all of the CFS files from the directory and I
> can
> > still
> > read my indexes.  So know I have to ask what are these CFS files?
> > Why are
> > they created?  And how can I get rid of them if I don't need them.
> I
> > will
> > also take a look at the Lucene website to see if I can find any
> > information.
> >
> > Thanks
> > Rob
> >
> > - Original Message - 
> > From: "Honey George" <[EMAIL PROTECTED]>
> > To: "Lucene Users List" <[EMAIL PROTECTED]>
> > Sent: Thursday, August 19, 2004 12:29 AM
> > Subject: Re: Index Size
> >
> >
> > Hi,
> >  Please check for hidden files in the index folder. If
> > you are using linx, do something like
> >
> > ls -al 
> >
> > I am also facing a similar problem where the index
> > size is greater than the data size. In my case there
> > were some hidden temproary files which the lucene
> > creates.
> > That was taking half of the total size.
> >
> > My problem is that after deleting the temporary files,
> > the index size is same as that of the data size. That
> > again seems to be a problem. I am yet to find out the
> > reason..
> >
> > Thanks,
> >george
> >
> >
> >  --- Rob Jose <[EMAIL PROTECTED]> wrote:
> > > Hello
> > > I have indexed several thousand (52 to be exact)
> > > text files and I keep running out of disk space to
> > > store the indexes.  The size of the documents I have
> > > indexed is around 2.5 GB.  The size of the Lucene
> > > indexes is around 287 GB.  Does this seem correct?
> > > I am not storing the contents of the file, just
> > > indexing and tokenizing.  I am using Lucene 1.3
> > > final.  Can you guys let me know what you are
> > > experiencing?  I don't want to go into production
> > > with something that I should be configuring better.
> > >
> > >
> > > I am not sure if this helps, but I have a temp index
> > > and a real index.  I index the file into the temp
> > > index, and then merge the temp index into the real
> > > index using the addIndexes method on the
> > > IndexWriter.  I have also set the production writer
> > > setUseCompoundFile to true.  I did not set this on
> > > the temp index.  The last thing that I do before
> > > closing the production writer is to call the
> > > optimize method.
> > >
> > > I would really appreciate any ideas to get the index
> > > size smaller if it is at all possible.
> > >
> > > Thanks
> > > Rob


-
To unsubscribe, e-mail: [EMAIL PROTECTED] 
For additional commands, e-mail: [EMAIL PROTECTED] 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene with English and Spanish Best Practice?

2004-08-21 Thread Grant Ingersoll
 I think the Snowball stuff works well, although I have only used the
English Porter stemmer implementation.

As for indexes, do you anticipate adding more fields later in Spanish? 
Is the content just a translation of the English, or do you have
separate conetent in Spanish?  Are your users querying in only one
language (cross-lingual) or are the Spanish speakers only querying
against Spanish content?

I am doing Arabic and English (and have done Spanish, French, and
Japanese in the past), although our cross-lingual system supports any
languages that you have resources for.  We lean towards separate
indexes, but mostly b/c they are based on separate content.  The key is
you have to be able to match up the analysis of the query with the
analysis of the index.  Having a mixed index may make this more
difficult.  If you have a mixed index would you filter out Spanish
results that had hits from an English query?  For instance, what if the
query was a term that was common to both languages (banana, mosquito,
etc.) or are you requiring the user to specify which fields they are
searching against.  I guess we really need to know more about how your
user is going to be interacting.

-Grant

>>> [EMAIL PROTECTED] 8/20/2004 5:27:40 PM >>>
Hello,

I'm interested in any feedback from anyone who has worked through
implementing Internationalization (I18N) search with Lucene or has ideas
for this requirement.  Currently, we're using Lucene with straight
English and are looking to add Spanish to the mix (with maybe more
languages to follow).  

This is our current IndexWriter setup utilizing the
PerFieldAnalyzerWrapper:

   PerFieldAnalyzerWrapper analyzer = new PerFieldAnalyzerWrapper(new
StandardAnalyzer());
   analyzer.addAnalyzer(FIELD_TITLE_STARTS_WITH, new
WhitespaceAnalyzer());
   analyzer.addAnalyzer(FIELD_CATEGORY, new WhitespaceAnalyzer());
   IndexWriter writer = new IndexWriter(indexDir, analyzer, create);

Would people suggest we switch this over to Snowball so there are
English and Spanish Analyzers and IndexWriters?  Something like this:

PerFieldAnalyzerWrapper analyzerEnglish = new
PerFieldAnalyzerWrapper(new SnowballAnalyzer("English"));
analyzerEnglish.addAnalyzer(FIELD_TITLE_STARTS_WITH, new
WhitespaceAnalyzer());
analyzerEnglish.addAnalyzer(FIELD_CATEGORY, new WhitespaceAnalyzer());
IndexWriter writerEnglish = new IndexWriter(indexDir, analyzerEnglish,
create);

PerFieldAnalyzerWrapper analyzerSpanish = new
PerFieldAnalyzerWrapper(new SnowballAnalyzer("Spanish"));
analyzerSpanish.addAnalyzer(FIELD_TITLE_STARTS_WITH, new
WhitespaceAnalyzer());
analyzerSpanish.addAnalyzer(FIELD_CATEGORY, new WhitespaceAnalyzer());
IndexWriter writerSpanish = new IndexWriter(indexDir, analyzerSpanish,
create);


Are multiple indexes or mirrors of each index then usually created for
every language?  We currently have 4 indexes that are all English. 
Would we then create 4 more that are Spanish?  Then at search time we
would determine the language and which set of indexes to search against,
English or Spanish.

Or another approach could be to add a Spanish field to the existing 4
indexes since most of the indexes have only one field that will be
translated from English to Spanish.


thanks a bunch,
chad.


-
To unsubscribe, e-mail: [EMAIL PROTECTED] 
For additional commands, e-mail: [EMAIL PROTECTED] 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: spanish stemmer

2004-08-23 Thread Grant Ingersoll
Ernesto,


http://snowball.tartarus.org/texts/introduction.html might help w/ your understanding. 
 The link provides basic info on why stemmer's are valuable (not necessarily any 
insight on how the Spanish version works).  Of course, they don't solve every problem 
and in some cases may make things worse.

A stemmer is not required to return a whole word.  

Hope this helps.

>>> [EMAIL PROTECTED] 8/23/2004 9:29:30 AM >>>
Hello

I use the Snowball jar for implement my SpanishAnalyzer. I found that the
words finished in 'bol' are not stripped.
For example:

In spanish for say basketball, you can say basquet or basquetbol. But for
SpanishStemmer are different words.
Idem with voley and voleybol.

Not idem with futbol (football), we not say fut for futbol. But 'fut' don´t
exist in spanish.

you think that I are correct?

you can change this?

Ernesto.


---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.737 / Virus Database: 491 - Release Date: 11/08/2004


-
To unsubscribe, e-mail: [EMAIL PROTECTED] 
For additional commands, e-mail: [EMAIL PROTECTED] 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Advanced timestamp usage (or global value storage)

2004-08-25 Thread Grant Ingersoll


>>> [EMAIL PROTECTED] 8/25/2004 11:50:01 AM >>>
On Aug 25, 2004, at 11:39 AM, Bernhard Messer wrote:

> If you already store the date time when the doc was index, you could

> use the following trick to get the last document added to the index:
>
>while (--maxDoc > 0) {

Yes, but that's a linear search :(

>>>
You are right, in the worst case, this would be linear, but that would
require you to delete a lot of documents.  I would bet, that on average,
arguably nearly all cases, you would go through very few iterations
before finding the doc you are interested in

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Advanced timestamp usage (or global value storage)

2004-08-25 Thread Grant Ingersoll
Avi,

I may be confused, as I understand it you said you were interested in
the last document indexed, Berhnard's code does that.   Lucene adds
documents sequentially, so counting backwards from the maxDoc() should
get you the last indexed document pretty quickly.  If all documents were
deleted, then this would go through all documents, otherwise, it is
going to find it pretty quickly.  It doesn't have to traverse through
all of the documents, it just has to find the "first" document that is
not deleted (since we are starting at the end of the list and going
backward)

>>> [EMAIL PROTECTED] 8/25/2004 12:01:50 PM >>>
On Aug 25, 2004, at 11:57 AM, Grant Ingersoll wrote:

> You are right, in the worst case, this would be linear,

No, in _all_ cases this would be linear.

> I would bet, that on average,
> arguably nearly all cases, you would go through very few iterations
> before finding the doc you are interested in

Then you don't understand what I'm trying to do. I'm trying to find the

document with the biggest value for the field. That would involve 
checking the field's value in every document to ensure this.

Avi

-- 
Avi 'rlwimi' Drissman
[EMAIL PROTECTED] 
Argh! This darn mail server is trunca


-
To unsubscribe, e-mail: [EMAIL PROTECTED] 
For additional commands, e-mail: [EMAIL PROTECTED] 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Term highlighting and Term vector patch

2004-09-17 Thread Grant Ingersoll
We use them to do relevence feedback.  We _will_ be using the offset
info, etc. for doing some more sophisticated collection mgmt things that
allow the user to interact with a collection of documents.

>>> [EMAIL PROTECTED] 9/16/2004 9:23:54 PM >>>
If you look at this:
  http://www.simpy.com/simpy/Search.do?op=user&username=otis&q=lucene 

You will see 'similar' link next to each search result item.  Finding
similar web pages can be implemented using term vectors.

Otis

--- Terry Steichen <[EMAIL PROTECTED]> wrote:

> Christoph,
> 
> Just curious - how are you currently using Term Vectors?  They seem
> to be a neat feature with lots of future promise, but I'm not sure
> how to best use them now.
> 
> Regards,
> 
> Terry
>   - Original Message - 
>   From: Christoph Goller 
>   To: Lucene Developers List 
>   Sent: Thursday, September 16, 2004 5:01 AM
>   Subject: Re: Term highlighting and Term vector patch
> 
> 
>   Hi Grant,
> 
>   I try to look into your latest code by the end of September but I
> probably
>   won't find time earlier. I am using the current TermVectors very
> successfully.
>   Thanks for the excellent code.
> 
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED] 
For additional commands, e-mail: [EMAIL PROTECTED] 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: accessing Term Vector info

2004-10-04 Thread Grant Ingersoll
See IndexReader#getTermFreqVector() in the javadocs

>>> [EMAIL PROTECTED] 10/4/2004 10:29:30 AM >>>
hi all


  i am indexing documents consisting of fields for a database id, and
text
  the text field is created as new Field("FULL_TEXT",text, false,true,
true,
true)

  in order to store the Term Vector info, how do I access this ?


  regards

  Rupinder


-
To unsubscribe, e-mail: [EMAIL PROTECTED] 
For additional commands, e-mail: [EMAIL PROTECTED] 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Arabic analyzer

2004-10-07 Thread Grant Ingersoll
Someone posted an Arabic analyzer about 1 year ago, however, I don't
think the licensing was very friendly and we no longer use it.

We have a cross language system that works w/ Arabic (among other
languages).  We have written several stemmers based on the literature
that perform pretty well
and were not too difficult to implement (but are not available as open
source at this point).  Light stemming seems to work much better in IR
applications then aggressive stemmers due to the problems with roots
discussed earlier.

-Grant

--
Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
School of Information Studies
http://www.cnlp.org 



>>> [EMAIL PROTECTED] 10/7/2004 8:45:42 AM >>>
Dawid Weiss wrote:
> 
>> nothing to do with each other furthermore, Arabic uses phonetic 
>> indicators on each letter called diacritics that change the way you

>> pronounce the word which in turn changes the words meaning so two
word 
>> spelled exactly the same way with different diacritics will mean two

>> separate things, 
> 
> 
> Just to point out the fact: most slavic languages also use diacritic

> marks (above, like 'acute', or 'dot' marks, or below, like the Polish

> 'ogonek' mark). Some people argue that they can be stripped off the
text 
> upon indexing and that the queries usually disambiguate the context
of 
> the word.

Hmm. This brings up a question: the algorithmic stemmer package from 
Egothor works quite well for Polish (http://www.getopt.org/stempel), 
wouldn't it work well for Arabic, too?

I lack the necessary expertise to evaluate results (knowing only two or

three arabic words ;-) ), but I can certainly help someone to get 
started with testing...

-- 
Best regards,
Andrzej Bialecki

-
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-
FreeBSD developer (http://www.freebsd.org)


-
To unsubscribe, e-mail: [EMAIL PROTECTED] 
For additional commands, e-mail: [EMAIL PROTECTED] 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: query term frequency

2005-01-28 Thread Grant Ingersoll
I implemented a Query version of the TermVector

org.apache.lucene.search.QueryTermVector

Works off of an array of Strings or a String and an Analyzer.  Is this
what you are looking for?


>>> [EMAIL PROTECTED] 1/28/2005 6:33:18 AM >>>
On Jan 27, 2005, at 10:24 PM, Jonathan Lasko wrote:
> No, the number of occurrences of a term in a Query.

Nothing built-in gives you this.  You'd have to dissect the Query  
clause-by-clause and cast each clause to the proper type to pull the  
terms from them.  The Highlighter code does this.

If there is a better way, I'd like to know.

Erik


>
> Jonathan
>
> Quoting David Spencer <[EMAIL PROTECTED]>:
>
>> Jonathan Lasko wrote:
>>
>>> What do I call to get the term frequencies for terms in the Query? 
I
>>> can't seem to find it in the Javadoc...
>>
>> Do you mean the # of docs that have a term?
>>
>>
> http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/ 
> IndexReader.html#docFreq(org.apache.lucene.index.Term)
>>> Thanks.
>>>
>>> Jonathan
>>>
>>>
-
>>> To unsubscribe, e-mail: [EMAIL PROTECTED]

>>> For additional commands, e-mail:
[EMAIL PROTECTED] 
>>>
>>
>>
>>
-
>> To unsubscribe, e-mail: [EMAIL PROTECTED] 
>> For additional commands, e-mail: [EMAIL PROTECTED]

>>
>>
>>
>>
>
>
>
>
-
> To unsubscribe, e-mail: [EMAIL PROTECTED] 
> For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED] 
For additional commands, e-mail: [EMAIL PROTECTED] 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]