Re: Clustering question: searching two diferent indexes

2004-06-23 Thread Albert Vila
Thanks Otis, but I can merge two indexes with different fields?
My big index has this fields, code, title, content, language and date. I 
add the new documents incrementally.

The clustering index only contains the fields code, and cluster. Merging 
the big index with the clustering one will preserve the order of the big 
one? For example, if I have the following indexes:
Big index
code_1, title_1, content_1, language_1, date_1
code_2, title_2, content_2, language_2, date_2
...

Clustering index
code_1, cluster_1
code_2, cluster_2
...
then the new merged index will be:
Merged index
code_1, title_1, content_1, language_1, date_1, cluster_1
code_2, title_2, content_2, language_2, date_2, cluster_2
...
If I can do that then fine, but I think the merging process uses the 
lucene internal ID to match the documents. I wanna use the code field to 
do that matching, is that possible?. I cannot be sure the lucene 
internal ID's are the same for the same codes in both indexes.

Thanks again,
Albert
Otis Gospodnetic wrote:
(re-directing to lucene-user list)
Albert,
If I understand your question correctly... You could run a query like
the one you gave on both indices, but if one of them contains documents
that have only one of those fields (cluster), then there will never be
any matches in the second index.
However, why not leave your big index along, add documents to a new,
smaller index, and then merge them periodically.  I may be off with
this; it sounds like this is what you want to do, but I'm not certain I
understood you fully.
Otis
--- Albert Vila [EMAIL PROTECTED] wrote:
 

Hi all,
I was wondering If I can search using the MultiSearcher over two 
diferent indexes at the same time (with diferent fields).
I've got one big index, with the code, title, content, language, etc 
fields (new documents are added incrementally). Now, I have to
introduce 
a clustering field. The problem is that I have to update the whole
index 
each time the clusters change, and I have no enought time to do it (I

wanna check for new clusters every 10 minuts and I spent 25 minutes
to 
reindex the whole index).
A query example could be: language:0 and title:java and cluster:0

Can I leave the big index whitout any changes and create a new index 
with only the following fields, code and cluster, and perform the 
searches using this two indexes? I think I cannot do that without 
changing the code. It would need a postprocess, matching all
returning 
codes from index 1 with index 2.

Anyone have a solution for this problem? I would appreciate that.
   


 

--
Albert Vila
Director de proyectos I+D
http://www.imente.com
902 933 242
[iMente La informacin con ms beneficios]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Clustering question: searching two diferent indexes

2004-06-23 Thread Otis Gospodnetic
Albert,

--- Albert Vila [EMAIL PROTECTED] wrote:
 Thanks Otis, but I can merge two indexes with different fields?

Yes.  Documents with different Fields can be stored in the same index.
Not every Document has to have all fields, and it can even have a
completely different set of Fields.

 My big index has this fields, code, title, content, language and
 date. I add the new documents incrementally.
 
 The clustering index only contains the fields code, and cluster.
 Merging 
 the big index with the clustering one will preserve the order of the
 big one?

I don't fully understand what you mean by 'order'.  If you are asking
whether internal document Ids will remain the same, the answer is
negative.  If you have deleted some documents, there will be gaps in
document Id sequence, which Lucene will fill, thus re-assigning
internal document Ids.

 For example, if I have the following indexes:
 Big index
 code_1, title_1, content_1, language_1, date_1
 code_2, title_2, content_2, language_2, date_2
 
 
 Clustering index
 code_1, cluster_1
 code_2, cluster_2
 
 
 then the new merged index will be:
 
 Merged index
 code_1, title_1, content_1, language_1, date_1, cluster_1
 code_2, title_2, content_2, language_2, date_2, cluster_2
 
 
 If I can do that then fine, but I think the merging process uses the 
 lucene internal ID to match the documents. I wanna use the code field
 to 
 do that matching, is that possible?. I cannot be sure the lucene 
 internal ID's are the same for the same codes in both indexes.

Are you storing the internal Lucene Document Id in the 'code' field? 
If you are, I suggest you change your application to use its own set of
unique Ids to serve as 'primary keys' in your indices.

Otis


 Thanks again,
 
 Albert
 
 
 Otis Gospodnetic wrote:
 
 (re-directing to lucene-user list)
 
 Albert,
 
 If I understand your question correctly... You could run a query
 like
 the one you gave on both indices, but if one of them contains
 documents
 that have only one of those fields (cluster), then there will never
 be
 any matches in the second index.
 
 However, why not leave your big index along, add documents to a new,
 smaller index, and then merge them periodically.  I may be off with
 this; it sounds like this is what you want to do, but I'm not
 certain I
 understood you fully.
 
 Otis
 
 --- Albert Vila [EMAIL PROTECTED] wrote:
   
 
 Hi all,
 
 I was wondering If I can search using the MultiSearcher over two 
 diferent indexes at the same time (with diferent fields).
 I've got one big index, with the code, title, content, language,
 etc 
 fields (new documents are added incrementally). Now, I have to
 introduce 
 a clustering field. The problem is that I have to update the whole
 index 
 each time the clusters change, and I have no enought time to do it
 (I
 
 wanna check for new clusters every 10 minuts and I spent 25 minutes
 to 
 reindex the whole index).
 A query example could be: language:0 and title:java and cluster:0
 
 Can I leave the big index whitout any changes and create a new
 index 
 with only the following fields, code and cluster, and perform the 
 searches using this two indexes? I think I cannot do that without 
 changing the code. It would need a postprocess, matching all
 returning 
 codes from index 1 with index 2.
 
 Anyone have a solution for this problem? I would appreciate that.
 
 
 
 
 
   
 
 
 -- 
 Albert Vila
 Director de proyectos I+D
 http://www.imente.com
 902 933 242
 [iMente “La información con más beneficios”]
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Clustering question: searching two diferent indexes

2004-06-23 Thread Albert Vila
By 'order', I mean that I'm adding the documents in the big index sorted 
by date (in order to increase the sorting process). I wanna preserve 
this sorting after the merging process.

I'm not using the internal lucene ID in the code field. The code field 
contains my own IDs. I was asking, if I can do the merge using my own 
IDs (the code field), and not the lucene internal IDs, for example:

luceneID_0, code_x, title_x, content_x, language_x, date_x
luceneID_1, code_y, title_y, content_y, language_y, date_y
luceneID_0, code_y, cluster_y
luceneID_1, code_x, cluster_x
Will the prevous index structure procude an unconsistent merged index?
I wanna achieve the following merged index:
luceneID_0, code_x, title_x, content_x, language_x, date_x, cluster_x
luceneID_1, code_y, title_y, content_y, language_y, date_y, cluster_y
Thanks
Otis Gospodnetic wrote:
Albert,
--- Albert Vila [EMAIL PROTECTED] wrote:
 

Thanks Otis, but I can merge two indexes with different fields?
   

Yes.  Documents with different Fields can be stored in the same index.
Not every Document has to have all fields, and it can even have a
completely different set of Fields.
 

My big index has this fields, code, title, content, language and
date. I add the new documents incrementally.
The clustering index only contains the fields code, and cluster.
Merging 
the big index with the clustering one will preserve the order of the
big one?
   

I don't fully understand what you mean by 'order'.  If you are asking
whether internal document Ids will remain the same, the answer is
negative.  If you have deleted some documents, there will be gaps in
document Id sequence, which Lucene will fill, thus re-assigning
internal document Ids.
 

For example, if I have the following indexes:
Big index
code_1, title_1, content_1, language_1, date_1
code_2, title_2, content_2, language_2, date_2

Clustering index
code_1, cluster_1
code_2, cluster_2

then the new merged index will be:
Merged index
code_1, title_1, content_1, language_1, date_1, cluster_1
code_2, title_2, content_2, language_2, date_2, cluster_2

If I can do that then fine, but I think the merging process uses the 
lucene internal ID to match the documents. I wanna use the code field
to 
do that matching, is that possible?. I cannot be sure the lucene 
internal ID's are the same for the same codes in both indexes.
   

Are you storing the internal Lucene Document Id in the 'code' field? 
If you are, I suggest you change your application to use its own set of
unique Ids to serve as 'primary keys' in your indices.

Otis
 

Thanks again,
Albert
Otis Gospodnetic wrote:
   

(re-directing to lucene-user list)
Albert,
If I understand your question correctly... You could run a query
 

like
   

the one you gave on both indices, but if one of them contains
 

documents
   

that have only one of those fields (cluster), then there will never
 

be
   

any matches in the second index.
However, why not leave your big index along, add documents to a new,
smaller index, and then merge them periodically.  I may be off with
this; it sounds like this is what you want to do, but I'm not
 

certain I
   

understood you fully.
Otis
--- Albert Vila [EMAIL PROTECTED] wrote:
 

Hi all,
I was wondering If I can search using the MultiSearcher over two 
diferent indexes at the same time (with diferent fields).
I've got one big index, with the code, title, content, language,
   

etc 
   

fields (new documents are added incrementally). Now, I have to
introduce 
a clustering field. The problem is that I have to update the whole
index 
each time the clusters change, and I have no enought time to do it
   

(I
   

wanna check for new clusters every 10 minuts and I spent 25 minutes
to 
reindex the whole index).
A query example could be: language:0 and title:java and cluster:0

Can I leave the big index whitout any changes and create a new
   

index 
   

with only the following fields, code and cluster, and perform the 
searches using this two indexes? I think I cannot do that without 
changing the code. It would need a postprocess, matching all
returning 
codes from index 1 with index 2.

Anyone have a solution for this problem? I would appreciate that.
  

   


 

--
Albert Vila
Director de proyectos I+D
http://www.imente.com
902 933 242
[iMente La informacin con ms beneficios]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

 

--
Albert Vila
Director de proyectos I+D
http://www.imente.com
902 933 242
[iMente La informacin con ms beneficios]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: 

Re: Clustering question: searching two diferent indexes

2004-06-23 Thread Otis Gospodnetic
Aha, now I see what you mean.  You didn't mention 'date' before. :)
So, dates will get preserved, and you will be able to keep using them
for sorting.  However, Lucene will not automatically recognize your 'PK
fields' and merge fields from two Documents with the same PK into a
single Document.  You can think of 'merge' as 'add' (well, the method
name is addIndices, actually :)), so Lucene will simply make a
cumulative index from your two separate indices:

luceneID_0, code_x, title_x, content_x, language_x, date_x
luceneID_1, code_y, title_y, content_y, language_y, date_y
luceneID_0, code_y, cluster_y
luceneID_1, code_x, cluster_x

Otis


--- Albert Vila [EMAIL PROTECTED] wrote:
 By 'order', I mean that I'm adding the documents in the big index
 sorted 
 by date (in order to increase the sorting process). I wanna preserve 
 this sorting after the merging process.
 
 I'm not using the internal lucene ID in the code field. The code
 field 
 contains my own IDs. I was asking, if I can do the merge using my own
 
 IDs (the code field), and not the lucene internal IDs, for example:
 
 luceneID_0, code_x, title_x, content_x, language_x, date_x
 luceneID_1, code_y, title_y, content_y, language_y, date_y
 
 luceneID_0, code_y, cluster_y
 luceneID_1, code_x, cluster_x
 
 Will the prevous index structure procude an unconsistent merged
 index?
 
 I wanna achieve the following merged index:
 luceneID_0, code_x, title_x, content_x, language_x, date_x, cluster_x
 luceneID_1, code_y, title_y, content_y, language_y, date_y, cluster_y
 
 Thanks
 
 Otis Gospodnetic wrote:
 
 Albert,
 
 --- Albert Vila [EMAIL PROTECTED] wrote:
   
 
 Thanks Otis, but I can merge two indexes with different fields?
 
 
 
 Yes.  Documents with different Fields can be stored in the same
 index.
 Not every Document has to have all fields, and it can even have a
 completely different set of Fields.
 
   
 
 My big index has this fields, code, title, content, language and
 date. I add the new documents incrementally.
 
 The clustering index only contains the fields code, and cluster.
 Merging 
 the big index with the clustering one will preserve the order of
 the
 big one?
 
 
 
 I don't fully understand what you mean by 'order'.  If you are
 asking
 whether internal document Ids will remain the same, the answer is
 negative.  If you have deleted some documents, there will be gaps in
 document Id sequence, which Lucene will fill, thus re-assigning
 internal document Ids.
 
   
 
 For example, if I have the following indexes:
 Big index
 code_1, title_1, content_1, language_1, date_1
 code_2, title_2, content_2, language_2, date_2
 
 
 Clustering index
 code_1, cluster_1
 code_2, cluster_2
 
 
 then the new merged index will be:
 
 Merged index
 code_1, title_1, content_1, language_1, date_1, cluster_1
 code_2, title_2, content_2, language_2, date_2, cluster_2
 
 
 If I can do that then fine, but I think the merging process uses
 the 
 lucene internal ID to match the documents. I wanna use the code
 field
 to 
 do that matching, is that possible?. I cannot be sure the lucene 
 internal ID's are the same for the same codes in both indexes.
 
 
 
 Are you storing the internal Lucene Document Id in the 'code' field?
 
 If you are, I suggest you change your application to use its own set
 of
 unique Ids to serve as 'primary keys' in your indices.
 
 Otis
 
 
   
 
 Thanks again,
 
 Albert
 
 
 Otis Gospodnetic wrote:
 
 
 
 (re-directing to lucene-user list)
 
 Albert,
 
 If I understand your question correctly... You could run a query
   
 
 like
 
 
 the one you gave on both indices, but if one of them contains
   
 
 documents
 
 
 that have only one of those fields (cluster), then there will
 never
   
 
 be
 
 
 any matches in the second index.
 
 However, why not leave your big index along, add documents to a
 new,
 smaller index, and then merge them periodically.  I may be off
 with
 this; it sounds like this is what you want to do, but I'm not
   
 
 certain I
 
 
 understood you fully.
 
 Otis
 
 --- Albert Vila [EMAIL PROTECTED] wrote:
  
 
   
 
 Hi all,
 
 I was wondering If I can search using the MultiSearcher over two 
 diferent indexes at the same time (with diferent fields).
 I've got one big index, with the code, title, content, language,
 
 
 etc 
 
 
 fields (new documents are added incrementally). Now, I have to
 introduce 
 a clustering field. The problem is that I have to update the
 whole
 index 
 each time the clusters change, and I have no enought time to do
 it
 
 
 (I
 
 
 wanna check for new clusters every 10 minuts and I spent 25
 minutes
 to 
 reindex the whole index).
 A query example could be: language:0 and title:java and cluster:0
 
 Can I leave the big index whitout any changes and create a new
 
 
 index 
 
 
 with only the following fields, code and cluster, and perform the
 
 searches using this two indexes? I think 

Re: Storing data in Lucene or Xindice

2004-06-23 Thread Otis Gospodnetic
(redirecting to lucene-user list)

Hello Rob,

I think you will end up with a simpler final result if you try saving
everything in a single data source.  I have not used Xindice, so I
cannot comment on its features, performance, etc., but judging from
your description, you could simply use Lucene to index the textual
information from XML feeds or HTML.

For XML parsing and indexing, you can see the article I wrote for IBM
developerWorks:
http://www-106.ibm.com/developerworks/java/library/j-lucene/

If you will be doing a lot of parsing, you will want to use something
faster than Digester, though.  Maybe Electric XML parser.

For HTML you can use NekoHTML, JTidy, htmlparser (sf.net), or Brian
Goetz's HTMLParser.

Now that I think about it, I seem to recall that Xindice uses Lucene
under the hood I can't find any information that confirms this,
now.  Maybe I'm mixing somehting up.

Otis



--- Rob Clews [EMAIL PROTECTED] wrote:
 Hi,
 
 I'm currently looking at using Lucene to index some XML feeds we
 receive
 for content. However, some of the feeds contain the articles contents
 and some don't, the feeds that do contain the contents are in XML,
 for
 the others we must retrieve them in HTML.
 
 I was originally going to store the XML contents from the feed in
 Xindice and retrieve them for each result from a Lucene query, but I
 guess I could store them in Lucene. We expect to build up a lot of
 content from shortish articles on the web and our main focus is
 speed,
 so would I be best storing the contents in Lucene or Xindice?
 
 Would storing more data (non-indexable) in Lucene slow it down on
 queries?
 
 Thanks,
 Rob Clews
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: ANN: Luke v. 0.5 released

2004-06-23 Thread Andrzej Bialecki
Vladimir Yuryev wrote:
Hi Andrzej!
I congratulate on the successful version. RussianAnalyzer works with my 
indexes, but there are problems with some words. These problem words are 
found only WildCard a method.
I don't quite understand what you are saying... Do you suspect there is 
a bug in Luke somewhere on the Search tab? If that's the case, please 
provide an example.

Besides AnalizerTool works with these 
words without problems.

There is one more small discrepancy on webpage http://www.getopt.org/luke/
- Remember to put both JARs on your classpath, e.g.: java-classpath 
luke.jar; lucene.jar org.getopt.luke. Luke
+ Remember to put both JARs on your classpath, e.g.: java-classpath 
luke.jar:lucene.jar org.getopt.luke. Luke
Well, both versions are correct - just the platform is different :-). 
I'll make a clarification. Thank you!

--
Best regards,
Andrzej Bialecki
-
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-
FreeBSD developer (http://www.freebsd.org)
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Clustering question: searching two diferent indexes

2004-06-23 Thread Albert Vila
OK, but with this solution, i cannot perform queries like:
get all codes that match title:java and language:english and cluster:0
Albert
Otis Gospodnetic wrote:
Aha, now I see what you mean.  You didn't mention 'date' before. :)
So, dates will get preserved, and you will be able to keep using them
for sorting.  However, Lucene will not automatically recognize your 'PK
fields' and merge fields from two Documents with the same PK into a
single Document.  You can think of 'merge' as 'add' (well, the method
name is addIndices, actually :)), so Lucene will simply make a
cumulative index from your two separate indices:
luceneID_0, code_x, title_x, content_x, language_x, date_x
luceneID_1, code_y, title_y, content_y, language_y, date_y
luceneID_0, code_y, cluster_y
luceneID_1, code_x, cluster_x
Otis
--- Albert Vila [EMAIL PROTECTED] wrote:
 

By 'order', I mean that I'm adding the documents in the big index
sorted 
by date (in order to increase the sorting process). I wanna preserve 
this sorting after the merging process.

I'm not using the internal lucene ID in the code field. The code
field 
contains my own IDs. I was asking, if I can do the merge using my own

IDs (the code field), and not the lucene internal IDs, for example:
luceneID_0, code_x, title_x, content_x, language_x, date_x
luceneID_1, code_y, title_y, content_y, language_y, date_y
luceneID_0, code_y, cluster_y
luceneID_1, code_x, cluster_x
Will the prevous index structure procude an unconsistent merged
index?
I wanna achieve the following merged index:
luceneID_0, code_x, title_x, content_x, language_x, date_x, cluster_x
luceneID_1, code_y, title_y, content_y, language_y, date_y, cluster_y
Thanks
Otis Gospodnetic wrote:
   

Albert,
--- Albert Vila [EMAIL PROTECTED] wrote:
 

Thanks Otis, but I can merge two indexes with different fields?
  

   

Yes.  Documents with different Fields can be stored in the same
 

index.
   

Not every Document has to have all fields, and it can even have a
completely different set of Fields.

 

My big index has this fields, code, title, content, language and
date. I add the new documents incrementally.
The clustering index only contains the fields code, and cluster.
Merging 
the big index with the clustering one will preserve the order of
   

the
   

big one?
  

   

I don't fully understand what you mean by 'order'.  If you are
 

asking
   

whether internal document Ids will remain the same, the answer is
negative.  If you have deleted some documents, there will be gaps in
document Id sequence, which Lucene will fill, thus re-assigning
internal document Ids.

 

For example, if I have the following indexes:
Big index
code_1, title_1, content_1, language_1, date_1
code_2, title_2, content_2, language_2, date_2

Clustering index
code_1, cluster_1
code_2, cluster_2

then the new merged index will be:
Merged index
code_1, title_1, content_1, language_1, date_1, cluster_1
code_2, title_2, content_2, language_2, date_2, cluster_2

If I can do that then fine, but I think the merging process uses
   

the 
   

lucene internal ID to match the documents. I wanna use the code
   

field
   

to 
do that matching, is that possible?. I cannot be sure the lucene 
internal ID's are the same for the same codes in both indexes.
  

   

Are you storing the internal Lucene Document Id in the 'code' field?
 

If you are, I suggest you change your application to use its own set
 

of
   

unique Ids to serve as 'primary keys' in your indices.
Otis

 

Thanks again,
Albert
Otis Gospodnetic wrote:
  

   

(re-directing to lucene-user list)
Albert,
If I understand your question correctly... You could run a query


 

like
  

   

the one you gave on both indices, but if one of them contains


 

documents
  

   

that have only one of those fields (cluster), then there will
 

never
   



 

be
  

   

any matches in the second index.
However, why not leave your big index along, add documents to a
 

new,
   

smaller index, and then merge them periodically.  I may be off
 

with
   

this; it sounds like this is what you want to do, but I'm not


 

certain I
  

   

understood you fully.
Otis
--- Albert Vila [EMAIL PROTECTED] wrote:


 

Hi all,
I was wondering If I can search using the MultiSearcher over two 
diferent indexes at the same time (with diferent fields).
I've got one big index, with the code, title, content, language,
  

   

etc 
  

   

fields (new documents are added incrementally). Now, I have to
introduce 
a clustering field. The problem is that I have to update the
   

whole
   

index 
each time the clusters change, and I have no enought time to do
   

it
   

  

   

(I
  

   

wanna check for new clusters every 10 minuts and I spent 25
   

minutes
   

to 
reindex the 

Re: Clustering question: searching two diferent indexes

2004-06-23 Thread Otis Gospodnetic
Correct, that is what I meant when I said you application will have to
handle your particular merge.  Instead of using addIndexes method, your
applicatoin will have to go through all Documents in the smaller index
(the one with cluster fields), get the PK of each Doc, look up that Doc
by PK in the big index, delete it from there if it exists, and re-add
it to the big index).

Otis



--- Albert Vila [EMAIL PROTECTED] wrote:
 OK, but with this solution, i cannot perform queries like:
 get all codes that match title:java and language:english and
 cluster:0
 
 Albert
 
 
 Otis Gospodnetic wrote:
 
 Aha, now I see what you mean.  You didn't mention 'date' before. :)
 So, dates will get preserved, and you will be able to keep using
 them
 for sorting.  However, Lucene will not automatically recognize your
 'PK
 fields' and merge fields from two Documents with the same PK into a
 single Document.  You can think of 'merge' as 'add' (well, the
 method
 name is addIndices, actually :)), so Lucene will simply make a
 cumulative index from your two separate indices:
 
 luceneID_0, code_x, title_x, content_x, language_x, date_x
 luceneID_1, code_y, title_y, content_y, language_y, date_y
 luceneID_0, code_y, cluster_y
 luceneID_1, code_x, cluster_x
 
 Otis
 
 
 --- Albert Vila [EMAIL PROTECTED] wrote:
   
 
 By 'order', I mean that I'm adding the documents in the big index
 sorted 
 by date (in order to increase the sorting process). I wanna
 preserve 
 this sorting after the merging process.
 
 I'm not using the internal lucene ID in the code field. The code
 field 
 contains my own IDs. I was asking, if I can do the merge using my
 own
 
 IDs (the code field), and not the lucene internal IDs, for example:
 
 luceneID_0, code_x, title_x, content_x, language_x, date_x
 luceneID_1, code_y, title_y, content_y, language_y, date_y
 
 luceneID_0, code_y, cluster_y
 luceneID_1, code_x, cluster_x
 
 Will the prevous index structure procude an unconsistent merged
 index?
 
 I wanna achieve the following merged index:
 luceneID_0, code_x, title_x, content_x, language_x, date_x,
 cluster_x
 luceneID_1, code_y, title_y, content_y, language_y, date_y,
 cluster_y
 
 Thanks
 
 Otis Gospodnetic wrote:
 
 
 
 Albert,
 
 --- Albert Vila [EMAIL PROTECTED] wrote:
  
 
   
 
 Thanks Otis, but I can merge two indexes with different fields?

 
 
 
 Yes.  Documents with different Fields can be stored in the same
   
 
 index.
 
 
 Not every Document has to have all fields, and it can even have a
 completely different set of Fields.
 
  
 
   
 
 My big index has this fields, code, title, content, language and
 date. I add the new documents incrementally.
 
 The clustering index only contains the fields code, and cluster.
 Merging 
 the big index with the clustering one will preserve the order of
 
 
 the
 
 
 big one?

 
 
 
 I don't fully understand what you mean by 'order'.  If you are
   
 
 asking
 
 
 whether internal document Ids will remain the same, the answer is
 negative.  If you have deleted some documents, there will be gaps
 in
 document Id sequence, which Lucene will fill, thus re-assigning
 internal document Ids.
 
  
 
   
 
 For example, if I have the following indexes:
 Big index
 code_1, title_1, content_1, language_1, date_1
 code_2, title_2, content_2, language_2, date_2
 
 
 Clustering index
 code_1, cluster_1
 code_2, cluster_2
 
 
 then the new merged index will be:
 
 Merged index
 code_1, title_1, content_1, language_1, date_1, cluster_1
 code_2, title_2, content_2, language_2, date_2, cluster_2
 
 
 If I can do that then fine, but I think the merging process uses
 
 
 the 
 
 
 lucene internal ID to match the documents. I wanna use the code
 
 
 field
 
 
 to 
 do that matching, is that possible?. I cannot be sure the lucene 
 internal ID's are the same for the same codes in both indexes.

 
 
 
 Are you storing the internal Lucene Document Id in the 'code'
 field?
   
 
 If you are, I suggest you change your application to use its own
 set
   
 
 of
 
 
 unique Ids to serve as 'primary keys' in your indices.
 
 Otis
 
 
  
 
   
 
 Thanks again,
 
 Albert
 
 
 Otis Gospodnetic wrote:
 

 
 
 
 (re-directing to lucene-user list)
 
 Albert,
 
 If I understand your question correctly... You could run a query
  
 
   
 
 like

 
 
 
 the one you gave on both indices, but if one of them contains
  
 
   
 
 documents

 
 
 
 that have only one of those fields (cluster), then there will
   
 
 never
 
 
  
 
   
 
 be

 
 
 
 any matches in the second index.
 
 However, why not leave your big index along, add documents to a
   
 
 new,
 
 
 smaller index, and then merge them periodically.  I may be off
   
 
 with
 
 
 this; it sounds like this is what you want to do, but 

RE: using boost factor

2004-06-23 Thread Anson Lau
Hi guys,

It seems like to really customise the scoring in lucene, one will have to go
into the lucene source.

I spend a fair bit of time looking into this and it seems to me not the full
scoring api is exported.  The formula documented on the Similarity class
seems to explain how a term is scored, but not, for example, how the final
score on a Boolean query is computed from each individual component. (Please
correct me if I'm wrong).  Normalisation is another part where the API is
not exported.

Anson

-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, June 23, 2004 3:51 AM
To: Lucene Users List
Subject: Re: using boost factor

Hello Anson,

I would look at IndexSearcher's explain method:
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/IndexSear
cher.html#explain(org.apache.lucene.search.Query,%20int)

This should give you insight into what's contributing to the high/low
scores, thus telling you what you can tweak.  Perhaps it's just the
boost, perhaps some other similarity factors.

Using explain should provide you information such as this, for example:
http://www.mozdex.com/explain.jsp?idx=2id=2067257query=goober

I hope this helps.  Somebody else will probably be able to give more
information, but this should get you started while you wait.

Otis

--- Anson Lau [EMAIL PROTECTED] wrote:
 Hi guys,
 
 Lets say I want to search the term hello world over 3 fields with
 different boost:
 
 ((hello:field1 world:field1)^0.001 (hello:field2 world:field2)^100
 (hello:field3 world:field3)^2))
 
 Note I've given field1 a really low boost, a heavy boost to field2
 and a
 REALLY heavy boost to field3.
 
 What is happening to me is that a term that matches both field1 and
 field2,
 will have a higher score than a term that matches field3 only, even
 though
 field3's boost is WAY higher.
 
 Can I change this behaviour such that the match in field3 only will
 actually
 have a higher score because of the boost?
 
 Thanks,
 
 Anson


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Delete Indexed from Merged Document

2004-06-23 Thread Karthik N S
Guys

   Has Somebody out there tried DELETING/UPDATION  of   INDEXED Files from a
MERGED Index Format,
  If HowTo do this Please Explain


with regards
Karthik




-Original Message-
From: Karthik N S [mailto:[EMAIL PROTECTED]
Sent: Wednesday, June 23, 2004 9:24 AM
To: Lucene Users List
Subject: RE: Delete Indexed from Merged Document


Hi

   Otis

   The  link u have specified  displays on how to update an Indexed File [
Deleting the Old  and then updating with new Ones']

  But My Question to be more Specific is : -

  When we MERGED more then 2 Indexed files  [using
writer.addIndexes(luceneDirs)] , In such  a case How to
   Delete one of the Indexed files from the MERGED Index in
order to Insert  an new updated one

  Please have some sample code snippet in this regard..


with regards
Karthik

-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
Sent: Tuesday, June 22, 2004 12:52 PM
To: Lucene Users List
Subject: Re: Delete Indexed from Merged Document


Hello Karthik,

Here is the answer: http://www.jguru.com/faq/view.jsp?EID=492423

Otis

--- Karthik N S [EMAIL PROTECTED] wrote:


   Dev Guys

   Apologies Please

 How Do I DELETE  an  Indexed Document from a MERGED Index File

Can Some body Write me some Code Snippets on this... please

 With Regards
 Karthik

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Delete Indexed from Merged Document

2004-06-23 Thread Karthik N S

Hi
Mr Wolf  What is this

// remove the document from index
int docID = hits.id(0);

 and can I increment the 0 factor  in the bracket ...for deletion


Thx in advance

Karthik

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED]
Sent: Wednesday, June 23, 2004 5:33 PM
To: [EMAIL PROTECTED]
Subject: AW: Delete Indexed from Merged Document


Hello,
 Karthik N S [mailto:[EMAIL PROTECTED]

Has Somebody out there tried DELETING/UPDATION  of
 INDEXED Files from a
 MERGED Index Format,
   If HowTo do this Please Explain
Of course you can delete or update a document from a merged index.
It works in the same way as for all other indexes. You need an
unique key (e.g. the file name or uri), which is indexed
for searching, to find the right document, because the internal
document numbers are changed after merging indexes or deleting
documents and optimizing an index. Using this key you can search
for the document and remove it. It doesn't matter if your index
was created by merging serveral indexes or not.
Example:
/* Create index: */
Document document = new Document();
document.add(Field.Keyword(filename, file_name)); // this must be
unique for each document!
document.add(Field.Text(content, file_content));
writer.addDocument(document);
/* ... */
  writer.close();

/* Update or remove document: Use the file name to find the original
   document and remove it from index */
  FSDirectory indexDirectory = FSDirectory.getDirectory(indexPath, false);
  IndexReader indexReader = IndexReader.open(indexDirectory);
  IndexSearcher indexSearcher = new IndexSearcher(indexReader);
  // create query and search for document using its filename
  TermQuery query = new TermQuery(new Term(filename, file_name));
  Hits hits = indexSearcher.search(query);
  if ( hits.length()  0 ) {
  // remove the document from index
int docID = hits.id(0);
  indexReader.delete( docID );
  }
  // else: this is a new file or already removed, so we can simply add it.
  indexSearcher.close();
  indexReader.close();
  indexDirectory.close();
  // now open an IndexWriter for the same index and add the updated file
  // as new document
/* done */
Hope it helps. Regards,
Wolf-Dietrich Materna

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



AW: Delete Indexed from Merged Document

2004-06-23 Thread Wolf-Dietrich . Materna
Hello, 
 Karthik N S [mailto:[EMAIL PROTECTED] wrote:
 Hi
 Mr Wolf  
Wolf-Dietrich is my first name, so leave out Mr. or use
my family name (which is uncommon here).

   What is this
 
 // remove the document from index
   int docID = hits.id(0);
 
  and can I increment the 0 factor  in the bracket ...for deletion
Yes, but there is no reason to do this in this case.
You search for documents using their file name (including their full path!).
You get a result (some kind of list). Please read Java-Docs about Hits
class.
hits.id(0) returns the (internal) ID of the first hit in your result.
This is the document that you want to remove (using
indexReader.delete(...).).
There are no more documents in your result hits unless your key is not
unique.
hits.length() returns 0 or 1.
Regards,
Wolf-Dietrich Materna

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Categorization

2004-06-23 Thread William W
Hi,
How can I do a categorization of the results ? Is it possible with the 
Lucene API ?
Thanks,
William.

_
Watch the online reality show Mixed Messages with a friend and enter to win 
a trip to NY 
http://www.msnmessenger-download.click-url.com/go/onm00200497ave/direct/01/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


How to use Highlighter concretly ?

2004-06-23 Thread Olivier Catteau
Hi !

I'd like to use the Highlighter class to show a summury highlighted after a search. 
But I don't know how to use correctly the Highlighter class.
I found this piece of code which works well.




public class TestHighlighter {

public static void main(String[] args) {

try {

Analyzer a = new StandardAnalyzer();

Query q = QueryParser.parse(jennifer lopez, cached, a);

String s =

the unofficial home page Britney Spears Elizabeth Hurley Kirsten Dunst 

+ Anna Kournikova Katie Holmes Katherine Heigl Jessica Alba Alyson Hannigan Jennifer 

+ Lopez Sarah Michelle Gellar;

Highlighter highlighter = new Highlighter(new QueryScorer(q));

TokenStream tokenstream =

a.tokenStream(cached, new java.io.StringReader(s));

String summary = highlighter.getBestFragments(tokenstream, s, 2, ...);


System.out.println(summary :  + summary);

} catch(Exception e) {

e.printStackTrace();

}

}

}





But I don't know how to adapt it. In fact, I've made a search and I get a Hits 
instance. And now, I want to give a highlighted summury of each documents of the hits. 
So it must looks like this :



Highlighter highlighter;

TokenStream tokenstream;


for (int i = 0; i  hits.length(); i++) {

Document doc = hits.doc(i);


String contents = I DON'T KNOW HOW TO GET THE CONTENTS OF MY DOC


highlighter = new Highlighter(new QueryScorer(query));

tokenstream = analyzer.tokenStream(contents, new java.io.StringReader(contents));

String summary = highlighter.getBestFragments(tokenstream, contents, 2, ...);

System.out.println(summary :  + summary);

}





Here is my questions. First, is it the good method to get a highlighted summury ? And 
if it is, how is the best way to get the contents of my document (the same way that I 
used to index their contents or another way ?) ?

(To be more precise, I use Lucene to index PDF, DOC, TXT. The size of these document 
could be about 5Mo.)

Thanks.


Re: How to use Highlighter concretly ?

2004-06-23 Thread Otis Gospodnetic
Hello Olivier,

You already have your Document instance from Hits.
Use Documents's get(String) method to get the content of the given
field.  Note that the Field MUST be stored, not just indexed.
For example

String text = doc.get(myStoredTextField);

See:
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/document/Document.html#get(java.lang.String)

I suggest you comment out all Highligher-related code from your
application, and first make sure you know how to, and really can get
the contents of fields whose content you later want to highlight.

Add the Highligher code only after you get this first step working.

Otis

--- Olivier Catteau [EMAIL PROTECTED] wrote:
 Hi !
 
 I'd like to use the Highlighter class to show a summury highlighted
 after a search. But I don't know how to use correctly the Highlighter
 class.
 I found this piece of code which works well.
 
 


 
 public class TestHighlighter {
 
 public static void main(String[] args) {
 
 try {
 
 Analyzer a = new StandardAnalyzer();
 
 Query q = QueryParser.parse(jennifer lopez, cached, a);
 
 String s =
 
 the unofficial home page Britney Spears Elizabeth Hurley Kirsten
 Dunst 
 
 + Anna Kournikova Katie Holmes Katherine Heigl Jessica Alba Alyson
 Hannigan Jennifer 
 
 + Lopez Sarah Michelle Gellar;
 
 Highlighter highlighter = new Highlighter(new QueryScorer(q));
 
 TokenStream tokenstream =
 
 a.tokenStream(cached, new java.io.StringReader(s));
 
 String summary = highlighter.getBestFragments(tokenstream, s, 2,
 ...);
 
 
 System.out.println(summary :  + summary);
 
 } catch(Exception e) {
 
 e.printStackTrace();
 
 }
 
 }
 
 }
 
 


 
 
 But I don't know how to adapt it. In fact, I've made a search and I
 get a Hits instance. And now, I want to give a highlighted summury of
 each documents of the hits. So it must looks like this :
 


 
 Highlighter highlighter;
 
 TokenStream tokenstream;
 
 
 for (int i = 0; i  hits.length(); i++) {
 
 Document doc = hits.doc(i);
 
 
 String contents = I DON'T KNOW HOW TO GET THE CONTENTS OF MY DOC
 
 
 highlighter = new Highlighter(new QueryScorer(query));
 
 tokenstream = analyzer.tokenStream(contents, new
 java.io.StringReader(contents));
 
 String summary = highlighter.getBestFragments(tokenstream, contents,
 2, ...);
 
 System.out.println(summary :  + summary);
 
 }
 
 


 
 
 Here is my questions. First, is it the good method to get a
 highlighted summury ? And if it is, how is the best way to get the
 contents of my document (the same way that I used to index their
 contents or another way ?) ?
 
 (To be more precise, I use Lucene to index PDF, DOC, TXT. The size of
 these document could be about 5Mo.)
 
 Thanks.
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



JavaOne and Lucene

2004-06-23 Thread Erik Hatcher
I'm presenting Lucene in Action Tuesday morning next week at JavaOne 
(TS-2994).

Any other Luceners going to JavaOne?
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


carrot2 - Re: Categorization

2004-06-23 Thread David Spencer
Otis Gospodnetic wrote:
Hello William,
Lucene does not have a categorization engine, but you may want to look
at Carrot2 (http://sourceforge.net/projects/carrot2/)
May be getting off topic - but maybe not..I can't find an example of how 
to use Carrot2. It builds easy enough, but there's no obvious example 
what it takes as input (documents?) and what it returns as output 
(some list of clustered docs?). I want to use the local interface to 
it and hook it into Lucene.

thx,
 Dave

Otis
--- William W [EMAIL PROTECTED] wrote:
Hi,
How can I do a categorization of the results ? Is it possible with
the 
Lucene API ?
Thanks,
William.

_
Watch the online reality show Mixed Messages with a friend and enter
to win 
a trip to NY 

http://www.msnmessenger-download.click-url.com/go/onm00200497ave/direct/01/
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: carrot2 - Re: Categorization

2004-06-23 Thread William W
Hi,
Carrot seems to be very interesting but I didn't find a simple example :(
I will try to use it ! :)
Thx,
william.

From: David Spencer [EMAIL PROTECTED]
Reply-To: Lucene Users List [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Subject: carrot2 - Re: Categorization
Date: Wed, 23 Jun 2004 11:50:22 -0700
Otis Gospodnetic wrote:
Hello William,
Lucene does not have a categorization engine, but you may want to look
at Carrot2 (http://sourceforge.net/projects/carrot2/)
May be getting off topic - but maybe not..I can't find an example of how to 
use Carrot2. It builds easy enough, but there's no obvious example what it 
takes as input (documents?) and what it returns as output (some list of 
clustered docs?). I want to use the local interface to it and hook it 
into Lucene.

thx,
 Dave

Otis
--- William W [EMAIL PROTECTED] wrote:
Hi,
How can I do a categorization of the results ? Is it possible with
the Lucene API ?
Thanks,
William.
_
Watch the online reality show Mixed Messages with a friend and enter
to win a trip to NY
http://www.msnmessenger-download.click-url.com/go/onm00200497ave/direct/01/
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
_
Get fast, reliable Internet access with MSN 9 Dial-up – now 3 months FREE! 
http://join.msn.click-url.com/go/onm00200361ave/direct/01/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: carrot2 - Re: Categorization

2004-06-23 Thread David Spencer
William W wrote:
Hi,
Carrot seems to be very interesting but I didn't find a simple example :(
I will try to use it ! :)
I can't find an example either, but after going through their source I 
think the heart of it is

com.dawidweiss.carrot.filter.stc.algorithm.STCEngine
and  com.dawidweiss.carrot.filter.stc.Processor is a class that drives this.
Lucene hook - hey - I'm trying to integrate the two. I think this is how 
it would be done, get search results from Lucene then set up STCEngine a 
la how Processor does.


Thx,
william.

From: David Spencer [EMAIL PROTECTED]
Reply-To: Lucene Users List [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Subject: carrot2 - Re: Categorization
Date: Wed, 23 Jun 2004 11:50:22 -0700
Otis Gospodnetic wrote:
Hello William,
Lucene does not have a categorization engine, but you may want to look
at Carrot2 (http://sourceforge.net/projects/carrot2/)

May be getting off topic - but maybe not..I can't find an example of 
how to use Carrot2. It builds easy enough, but there's no obvious 
example what it takes as input (documents?) and what it returns as 
output (some list of clustered docs?). I want to use the local 
interface to it and hook it into Lucene.

thx,
 Dave

Otis
--- William W [EMAIL PROTECTED] wrote:
Hi,
How can I do a categorization of the results ? Is it possible with
the Lucene API ?
Thanks,
William.
_
Watch the online reality show Mixed Messages with a friend and enter
to win a trip to NY
http://www.msnmessenger-download.click-url.com/go/onm00200497ave/direct/01/ 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
_
Get fast, reliable Internet access with MSN 9 Dial-up  now 3 months 
FREE! http://join.msn.click-url.com/go/onm00200361ave/direct/01/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]