RE: Preventing merging by IndexWriter

2006-10-18 Thread Johan Stuyts
> Why go through all this effort when it's easy to make your 
> own unique ID?
> Add a new field to each document "myuniqueid" and fill it in 
> yourself. It'll
> never change then.

I am sorry I did not mention in my post that I am aware of this solution
but that it cannot be used for my purposes. I need the stable ID during
filtering and afterwards for counting for faceted browsing. My tests
show, and from the documentation/mailing lists I conclude, that
retrieving a stable ID from a field during filtering and for each hit in
a query result is too expensive.

After your post I put some more thought into storing a stable ID in a
field. I figured I could read all the stable IDs once and create a map
from Lucene IDs to stable IDs. But this takes too long (> 600 ms on my
laptop) for a small set of documents (< 150,000). Another problem is
that I have to do millions of additional lookups during filtering and
counting.

> Of course, I may misunderstand your problem space enough that this is
> useless. If so, please tell us the problem you're trying to 
> solve and maybe
> wiser heads than mine will have better suggestions

Here is a description of our problem. We want to build a repository that
can handle a number of documents that is in the low millions (we are
designing the repository for 10 million documents intially). Almost all
navigation through this repository will be faceted. For this we need to
be able to filter based on the facet values selected by the user, and we
have to count how many documents in the search result have a particular
facet value for multiple (estimation: 25-40) facet values. The documents
in the repository are constantly changed and we want the faceted
navigation to be updated in near real-time: if the user refreshes a
search page after making changes to a document, the changes should be
visible. I estimate we have about 250-500 ms, the time it takes to go to
another page and refresh it, to update the index(es).

My idea is to use Lucene for regular searching, and use a custom index
for filtering based on facets and for counting the number of matches for
facet values. For this to work, (reasonably) stable IDs are needed so
updating the facet value index is simply changing values in doing a
number of arrays.

I am willing to sacrifice search performance for stable IDs if it gains
performance in faceted filtering and counting.

Johan


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Preventing merging by IndexWriter

2006-10-18 Thread Johan Stuyts
> > So my questions are: is there a way to prevent the IndexWriter from
> > merging, forcing it to create a new segment for each indexing batch?
> 
> Already done in the Lucene trunk:
> http://issues.apache.org/jira/browse/LUCENE-672
> 
> Background:
> http://www.gossamer-threads.com/lists/lucene/java-dev/39799#39799
> 
> > And
> > if so, will it still be possible to merge the disk segments 
> when I want
> > to?
> 
> call optimize()

Thanks, I have got it working now. But I think it is not a viable
solution either. These are the problems that I see:
- the number of segments will probably increase too fast, requiring
optimizations regularly.
- Given the size of the external data, making it consistent with the
Lucene index will, in the worst case, require processing and writing to
disk of hundreds of megabytes.

The biggest problem is of course the combination of them: having to
process too much data too may times.

Johan

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: index architectures

2006-10-18 Thread Erick Erickson

No, you've got that right. But there's something I think you might be able
to try. Fair warning, I'm remembering things I've read on this list and my
memory isn't what it used to be 

I *think* that if you reduce your result set by, say, a filter, you might
drastically reduce what gets sorted. I'm thinking of something like this
BooleanQuery bq = new BooleanQuery();
bq.add(Filter for the last N days wrapped in a ConstantScoreQuery, MUST)
bq.add(all the rest of your stuff).

RangeFilter might work for you here.

Even if this works, you'll still have to deal with making the range big
enough to do what you want. Perhaps an iterative approach, say the first
time you run the query and you don't get your 25 (or whatever) results,
increase the range and try again.

Again, I'm note entirely sure when the filter gets applied, before or after
the sort. Nor am I sure how to tell. I'd sure like you to do the work and
tell me how  I *am* sure that this has been discussed in this mailing
list, so a search there might settle this

C'mon Chris, Erik and Yonki, can't you recognize a plea for help when you
read it?

Although here's yet another thing that flitted through my mind. Is date
really the same as doc ID order? And would you be able to sort on DocID
instead? And would it matter ? If you're adding your documents as they
come in, this might work. Doc IDs change, but I *believe* if doc A is added
after doc B, the doc ID for A will always be less than the docID for B,
although neither of them is guaranteed to be the same between index
optimizations. Again, not sure if this helps at all.

Good luck!
Erick

On 10/18/06, Paul Waite <[EMAIL PROTECTED]> wrote:


Many thanks to Erik and Ollie for responding - a lot of ideas and I'll
have
my work cut out grokking them properly and thinking about what to do.
I'll respond further as that develops.

One quick thing though - Erik wrote:

> So, I wonder if your out of memory issue is really related to the number
> of requests you're servicing. But only you will be able to figure that
> out . These problems are...er...unpleasant to track down...

Indeed!

> I guess I wonder a bit about what large result sets is all about. That
> is, do your users really care about results 100-10,000 or do they just
> want to page through them on demand?

No they don't want that. They just want a small number. What happens is
they enter some silly query, like searching for all stories with a single
common non-stop-word in them, and with the usual sort criterion of by date
(ie. a field) descending, and a limit of, say 25.

So Lucene then presumably has to haul out a massive resultset, sort it,
and
return the top 25 (out of 500,000 or whatever).

Isn't that how it goes? Or am I missing something horribly obvious.


Cheers,
Paul.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: Preventing merging by IndexWriter

2006-10-18 Thread Erick Erickson

Your problem is out of my experience, so all I can suggest is that you
search the list archive. I know the idea of faceted searching has been
discussed by people with waaay more experience in that realm than I have
and, as I remember, there were some links provided

I just searched for 'faceted' on the e-mails I've seen since I subscribed to
the list, and there are certainly discussions out there...

This thread might be particularly useful, started  15-May-2006
*Aggregating category hits

*
Best of luck
Erick

On 10/18/06, Johan Stuyts <[EMAIL PROTECTED]> wrote:


> > So my questions are: is there a way to prevent the IndexWriter from
> > merging, forcing it to create a new segment for each indexing batch?
>
> Already done in the Lucene trunk:
> http://issues.apache.org/jira/browse/LUCENE-672
>
> Background:
> http://www.gossamer-threads.com/lists/lucene/java-dev/39799#39799
>
> > And
> > if so, will it still be possible to merge the disk segments
> when I want
> > to?
>
> call optimize()

Thanks, I have got it working now. But I think it is not a viable
solution either. These are the problems that I see:
- the number of segments will probably increase too fast, requiring
optimizations regularly.
- Given the size of the external data, making it consistent with the
Lucene index will, in the worst case, require processing and writing to
disk of hundreds of megabytes.

The biggest problem is of course the combination of them: having to
process too much data too may times.

Johan

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: index architectures

2006-10-18 Thread Chris Hostetter

: I *think* that if you reduce your result set by, say, a filter, you might
: drastically reduce what gets sorted. I'm thinking of something like this
: BooleanQuery bq = new BooleanQuery();
: bq.add(Filter for the last N days wrapped in a ConstantScoreQuery, MUST)
: bq.add(all the rest of your stuff).
...
: Again, I'm note entirely sure when the filter gets applied, before or after
: the sort. Nor am I sure how to tell. I'd sure like you to do the work and

The memory required for sorting on a field is independent of the size of
the result -- so a Filter wouldn't help you here.  The reason is becuase
Sorting builds/uses the FieldCache which cotnains all the values for all
docs, so that it can be reused for Sorting future queries as well.

That said: if you are seeing OOM errors when you sort by a field (but
not when you use the docId ordering, or sort by score) then it sounds like
you are keeping refrences to IndexReaders arround after you've stoped
using them -- the FieldCache is kept in a WeakHashMap keyed off of hte
IndexReader, so it should get garbage collected as soon sa you let go of
it.  Another possibility is that you are sorting on too many fields for it
to be able to build the FieldCache for all of them in the RAM you have
available.

There was some discussion recently on java-dev about an approach to
scoring that took advantage of Lazy Field loading insitead of the
FieldCache to sort on the *stored* value of a field, the goal being to
make sorting small result sets possible with small amount of RAM ... but i
don't remember if the person working on it ever submitted a patch.





-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: index architectures

2006-10-18 Thread Joe Shaw
Hi,

On Wed, 2006-10-18 at 19:05 +1300, Paul Waite wrote:
> No they don't want that. They just want a small number. What happens is
> they enter some silly query, like searching for all stories with a single
> common non-stop-word in them, and with the usual sort criterion of by date
> (ie. a field) descending, and a limit of, say 25.
> 
> So Lucene then presumably has to haul out a massive resultset, sort it, and
> return the top 25 (out of 500,000 or whatever).

I had a similar issue recently: users only want the 100 (or whatever)
most recently updated documents which match, and our documents aren't
stored in date-order.

Originally, we would walk the result set, instantiate a Document
instance, pull out the timestamp field, and keep around the top 100
documents.  Obviously this is extremely slow for large result sets.

What I initially did to address this was store a reverse timestamp and
walk the list of terms in the reverse timestamp field (they're sorted
lexigraphically), and return the 100 most recent matching documents.

In most cases this was a lot faster (for a search which returned 153,142
matches, I only had to walk 288 documents to find the 100 most recent),
but in some cases it was a lot slower (for another search which returned
339 matches, I had to walk 292,911 documents to find the 100 most
recent).

In the end I found that I could walk 5 terms for every 2 documents I
could instantiate and tuned a heuristic so that in the worst case (my
second example) searches are 50% slower, but in almost all other cases
they're quite a bit faster.

Hope this helps,
Joe


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: index architectures

2006-10-18 Thread Michael D. Curtin

On Wed, 2006-10-18 at 19:05 +1300, Paul Waite wrote:


No they don't want that. They just want a small number. What happens is
they enter some silly query, like searching for all stories with a single
common non-stop-word in them, and with the usual sort criterion of by date
(ie. a field) descending, and a limit of, say 25.

So Lucene then presumably has to haul out a massive resultset, sort it, and
return the top 25 (out of 500,000 or whatever).


I had a similar requirement on a project last year.  I implemented a 
two-pronged approach:


1.  Index (i.e. addDocument()) the documents in the order I wanted the final 
sort to be.


2.  Modify a few classes to give a "first N" results capability.

We had 4 or 5 sort orders, so I had multiple indexes on the same data, 2 per 
sort order (forward and reverse), and used the appropriate one at search time.


It wasn't a one-hour change, but it didn't take a man-year either.  Hope it 
helps!

--MDC

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Error using Luke

2006-10-18 Thread vasu shah
Hi,
   
  I am getting this error when accessing my index with Luke.
   
  No sub-file with id _1.f0 found
   
  Does any one have idea about this??
   
  Any help would be appreciated.
   
  Thanks,
  -Vasu


-
Stay in the know. Pulse on the new Yahoo.com.  Check it out. 

Re: Error using Luke

2006-10-18 Thread Volodymyr Bychkoviak
seems that you created your index with norms turned off and trying to 
open with luke which can contain older version of lucene.


vasu shah wrote:

Hi,
   
  I am getting this error when accessing my index with Luke.
   
  No sub-file with id _1.f0 found
   
  Does any one have idea about this??
   
  Any help would be appreciated.
   
  Thanks,

  -Vasu


-
Stay in the know. Pulse on the new Yahoo.com.  Check it out. 
  


--
regards,
Volodymyr Bychkoviak


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Error using Luke

2006-10-18 Thread vasu shah
Thank you very much. I have indeed turned off the norms. 
   
  Is there any new version of Luke that I can use?
   
  Thanks,
  -Vasu

Volodymyr Bychkoviak <[EMAIL PROTECTED]> wrote:
  seems that you created your index with norms turned off and trying to 
open with luke which can contain older version of lucene.

vasu shah wrote:
> Hi,
> 
> I am getting this error when accessing my index with Luke.
> 
> No sub-file with id _1.f0 found
> 
> Does any one have idea about this??
> 
> Any help would be appreciated.
> 
> Thanks,
> -Vasu
>
> 
> -
> Stay in the know. Pulse on the new Yahoo.com. Check it out. 
> 

-- 
regards,
Volodymyr Bychkoviak


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
Get your own web address for just $1.99/1st yr. We'll help. Yahoo! Small 
Business.

Re: near duplicates

2006-10-18 Thread karl wettin


17 okt 2006 kl. 18.55 skrev Andrzej Bialecki:
You need to create a fuzzy signature of the document, based on term  
histogram or shingles - take a look a the Signature framework in  
Nutch.


There is a substantial literature on this subject - go to Citeseer  
and run a search for "near duplicate detection".


Interesting. I'll have to check this out a bit more some day(tm).

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Error using Luke

2006-10-18 Thread Volodymyr Bychkoviak
You can get Lucene 1.9.1 and make Luke use this version. (you need 
luke.jar not luke-all.jar)
version 1.9.1 contains API which is removed in 2.0 version of Lucene (as 
deprecated) and should still be able to read indexes created by Lucene 
2.0 (correct me if I'm wrong)


and then run Luke with command line like this:

java -classpath luke.jar;lucene-1.9.1.jar org.getopt.luke.Luke

vasu shah wrote:
Thank you very much. I have indeed turned off the norms. 
   
  Is there any new version of Luke that I can use?
   
  Thanks,

  -Vasu

Volodymyr Bychkoviak <[EMAIL PROTECTED]> wrote:
  seems that you created your index with norms turned off and trying to 
open with luke which can contain older version of lucene.


vasu shah wrote:
  

Hi,

I am getting this error when accessing my index with Luke.

No sub-file with id _1.f0 found

Does any one have idea about this??

Any help would be appreciated.

Thanks,
-Vasu


-
Stay in the know. Pulse on the new Yahoo.com. Check it out. 




  


--
regards,
Volodymyr Bychkoviak



Re: Lucene 2.0.1 release date

2006-10-18 Thread Peter Keegan

This makes it relatively safe for people to grab a snapshot of the trunk

with less >concern about latent bugs.


I think the concern is that if we start doing this stuff on trunk now,

people that are >accustomed to snapping from the trunk might be surprised,
and not in a good way.

+1 on this. There are some great performance improvements in 2.0.1

Peter

On 10/17/06, Steven Parkes <[EMAIL PROTECTED]> wrote:


I think the idea is that 2.0.1 would be a patch-fix release from the
branch created at 2.0 release. This release would incorporate only
back-ported high-impact patches, where "high-impact" is defined by the
community. Certainly security vulnerabilities would be included. As Otis
said, to date, nobody seems to have raised any issues to that level.

2.1 will include all the patches and new features that have been
committed since 2.0; there've been a number of these. But releases are
done pretty ad hoc at this point and there hasn't been anyone that has
expressed strong interest in (i.e., lobbied for) a release.

There was a little discussion on this topic at the ApacheCon BOF. For a
number of reasons, the Lucene Java trunk has been kept "pretty stable",
with a relatively few number of large changes. This makes it relatively
safe for people to grab a snapshot of the trunk with less concern about
latent bugs. I don't know how many people/projects are doing this rather
than sticking with 2.0.

Keeping the trunk stable doesn't provide an obvious place to start
working on things that people may want to work on and share but at the
same time want to allow to percolate for a while. I think the concern is
that if we start doing this stuff on trunk now, people that are
accustomed to snapping from the trunk might be surprised, and not in a
good way. Nobody wants that.

So releases can be about both what people want (getting features out)
and allowing a bit more instability in trunk. That is, if the community
wants that.

Food for thought and/or discussion?

-Original Message-
From: George Aroush [mailto:[EMAIL PROTECTED]
Sent: Sunday, October 15, 2006 5:15 PM
To: java-user@lucene.apache.org
Subject: RE: Lucene 2.0.1 release date

Thanks for the reply Otis.

I looked at the CHANGES.txt file and saw quit a bit of changes.  For my
port
from Java to C#, I can't rely on the trunk code as it is (to my
knowledge)
changes on a monthly basic if not weekly.  What I need is an official
release so that I can use it as the port point.

Regards,

-- George Aroush


-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
Sent: Sunday, October 15, 2006 12:41 AM
To: java-user@lucene.apache.org
Subject: Re: Lucene 2.0.1 release date

I'd have to check CHANGES.txt, but I don't think that many bugs have
been
fixed and not that many new features added that anyone is itching for a
new
release.

Otis

- Original Message 
From: George Aroush <[EMAIL PROTECTED]>
To: java-dev@lucene.apache.org; java-user@lucene.apache.org
Sent: Saturday, October 14, 2006 10:32:47 AM
Subject: RE: Lucene 2.0.1 release date

Hi folks,

Sorry for reposting this question (see original email below) and this
time
to both mailing list.

If anyone can tell me what is the plan for Lucene 2.0.1 release, I would
appreciate it very much.

As some of you may know, I am the porter of Lucene to Lucene.Net knowing
when 2.0.1 will be released will help me plan things out.

Regards,

-- George Aroush


-Original Message-
From: George Aroush [mailto:[EMAIL PROTECTED]
Sent: Thursday, October 12, 2006 12:07 AM
To: java-dev@lucene.apache.org
Subject: Lucene 2.0.1 release date

Hi folks,

What's the plan for Lucene 2.0.1 release date?

Thanks!

-- George Aroush


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Scalability Questions

2006-10-18 Thread Guerre Bear

Hello All,

Lucene looks very interesting to me.  I was wondering if any of you could
comment on a few questions:

1) Assuming I use a typical server such as a dual-core dual-processor Dell
2950, about how many files can Lucene index and still have a sub-two-second
search speed for a simple search string such as "invoice 2005 mitsubishi"?
For the sake of argument, I figure that a typical file will have about 30KB
of text in it.

2) How many of these servers would it take to manage an index of one billion
such files?

3) Are there any HOWTO's on constructing a large Lucene search cluster?

4) Roughly how large is the index file in comparison to the size of the
input files?

5) How does Lucene's search performance/scalability compare to some of the
expensive commercial search products such as Fast?  (www.fastsearch.com)

Thank you all for any comments or suggestions!

Guerre


DateTools oddity....

2006-10-18 Thread Michael J. Prichard

When I run this java code:

   Long dates = new Long("1154481345000");
   Date dada = new Date(dates.longValue());
   System.out.println(dada.toString());
   System.out.println(DateTools.dateToString(dada, 
DateTools.Resolution.DAY));


I get this output:

   Tue Aug 01 21:15:45 EDT 2006
   20060802

Huh?!  Should it be:
   20060801

??

Any ideas?


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: DateTools oddity....

2006-10-18 Thread Emmanuel Bernard

DateTools use GMT as a timezone
   Tue Aug 01 21:15:45 EDT 2006
   Wed Aug 02 02:15:45 EDT 2006

Michael J. Prichard wrote:


When I run this java code:

Long dates = new Long("1154481345000");
Date dada = new Date(dates.longValue());
System.out.println(dada.toString());
System.out.println(DateTools.dateToString(dada,
DateTools.Resolution.DAY));

I get this output:

Tue Aug 01 21:15:45 EDT 2006
20060802

Huh?!  Should it be:
20060801

??

Any ideas?


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: DateTools oddity....

2006-10-18 Thread Doug Cutting

Michael J. Prichard wrote:

I get this output:

   Tue Aug 01 21:15:45 EDT 2006


That's August 2, 2006 at 01:15:45 GMT.


   20060802

Huh?!  Should it be:
   20060801


DateTools uses GMT.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: DateTools oddity....

2006-10-18 Thread Michael J. Prichard

Dang it :)

Anyway to set timezone?

Emmanuel Bernard wrote:


DateTools use GMT as a timezone
   Tue Aug 01 21:15:45 EDT 2006
   Wed Aug 02 02:15:45 EDT 2006

Michael J. Prichard wrote:



When I run this java code:

Long dates = new Long("1154481345000");
Date dada = new Date(dates.longValue());
System.out.println(dada.toString());
System.out.println(DateTools.dateToString(dada,
DateTools.Resolution.DAY));

I get this output:

Tue Aug 01 21:15:45 EDT 2006
20060802

Huh?!  Should it be:
20060801

??

Any ideas?


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: DateTools oddity....

2006-10-18 Thread Emmanuel Bernard
No, but using a constant timezone is a good thing anyway since the index 
will not keep track of the info. And will not really care as long as you 
always use DateTools (index and search).
You can always rewrite DateTools with your own timezone, but EDT is bad 
since it is vulnerable to Day light saving mess.


Michael J. Prichard wrote:

Dang it :)

Anyway to set timezone?

Emmanuel Bernard wrote:


DateTools use GMT as a timezone
   Tue Aug 01 21:15:45 EDT 2006
   Wed Aug 02 02:15:45 EDT 2006

Michael J. Prichard wrote:



When I run this java code:

Long dates = new Long("1154481345000");
Date dada = new Date(dates.longValue());
System.out.println(dada.toString());
System.out.println(DateTools.dateToString(dada,
DateTools.Resolution.DAY));

I get this output:

Tue Aug 01 21:15:45 EDT 2006
20060802

Huh?!  Should it be:
20060801

??

Any ideas?


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: DateTools oddity....

2006-10-18 Thread Paul Snyder
DITTO !!!

I like date truncation, but when I store a truncated date, I don't want to
retrieve the time in Greenwich, England at midnight of the date I'm
truncating in the local machine's time zone.  Nothing against the Brits, it
just doesn't do me any good to know what time it was over there on the day
in question.  What I want back is midnight of the correct day in the time
zone of the local machine.  In other words, when I specify DAY resolution,
I'm saying TIME-ZONE IS IRRELEVENT, ALWAYS GIVE ME THE CORRECT DATE IN THE
LOCAL TIME ZONE.

Understanding the need for backwards-compatibility, I vote that there ought
to be some kind of parameter I can set when converting Date-To-String and
String-To-Date to force TRUNCATION of a date (as in Oracle, for example) so
that it will return the correct date in local time when retrieved.

Without the ability to force symmetrical storage-retrieval, I think any
DateTools Resolutions to time units greater than hours has no practical
value and only serves to mislead people.

-Original Message-
From: Michael J. Prichard [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, October 18, 2006 2:39 PM
To: java-user@lucene.apache.org
Subject: Re: DateTools oddity

Dang it :)

Anyway to set timezone?

Emmanuel Bernard wrote:

> DateTools use GMT as a timezone
>Tue Aug 01 21:15:45 EDT 2006
>Wed Aug 02 02:15:45 EDT 2006
>
> Michael J. Prichard wrote:
>
>>
>> When I run this java code:
>>
>> Long dates = new Long("1154481345000");
>> Date dada = new Date(dates.longValue());
>> System.out.println(dada.toString());
>> System.out.println(DateTools.dateToString(dada,
>> DateTools.Resolution.DAY));
>>
>> I get this output:
>>
>> Tue Aug 01 21:15:45 EDT 2006
>> 20060802
>>
>> Huh?!  Should it be:
>> 20060801
>>
>> ??
>>
>> Any ideas?
>>
>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



termpositions at index time...

2006-10-18 Thread Erick Erickson

Here's my problem:

We're indexing books. I need to
a> return books ordered by relevancy
b> for any single book, return the number of hits in each chapter (which, of
course, may be many pages).

1>If I index each page as a document, creating the relevance on a book basis
is interesting, but collecting page hits per book is easy.
2>If I index each book as a document, returning the books by relevance is
easy but aggregating hits per chapter is interesting.

No, creating two indexes is not an option at present, although that would be
the least work for me.

I can make <2> work if, for a particular field, I can determine what the
last termposition on each page is *at index time*. Oh, we don't want
searches to span pages. Pages are added to the doc with multiple calls like
so

doc.add('"field", first page text);
doc.add('"field", second page text);
doc.add('"field", third page text);


The only approach I've really managed to come up with so far is to make my
own Analyzer that has the following characteristics...
1> override getPositionIncrementGap for this field and return, say, 100.
This should keep us from spanning pages, and provide a convenient trigger
for me to know we've finished (or are starting to) index a new page.
2> record the last token position and provide a mechanism for me to retrieve
that number. I can then keep a record in this document of what offset each
page starts at, and then accomplish my aggregation by storing, with the
document, the termpositions of the start (or end) of each page.

Note, I'm rolling my own counter for where terms hit. It'll be a degenerate
case of only ANDing things together, so it should be pretty simple even in
the wildcard case.

I'm using the Srnd* classes to do my spans, since they may include wildcards
and don't see a way to get a Spans object from that, but it's late in the
day .

Last time I appealed to y'all, you wrote back that it was already done. My
hope is that it's already done again, but I've spent a couple of hours
looking and it isn't obvious to me. What I want is a way to do something
like this

doc.add('"field", first page text);
int pos = XXX.getLastTermPosition("field");
doc.add('"field", second page text);
pos = XXX.getLastTermPosition("field");
doc.add('"field", third page text);
pos = XXX.getLastTermPosition("field");

But if I understand what's happening, the text doesn't get analyzed until
the doc is added to the index, all the doc.add(field, value) is just set-up
work without any position information really being available yet. I'd be
happy to be wrong about that .

Thanks
Erick


Re: termpositions at index time...

2006-10-18 Thread Michael D. Curtin

Erick Erickson wrote:


Here's my problem:

We're indexing books. I need to
a> return books ordered by relevancy
b> for any single book, return the number of hits in each chapter 
(which, of

course, may be many pages).

1>If I index each page as a document, creating the relevance on a book 
basis

is interesting, but collecting page hits per book is easy.
2>If I index each book as a document, returning the books by relevance is
easy but aggregating hits per chapter is interesting.

No, creating two indexes is not an option at present, although that 
would be

the least work for me.


Could you elaborate on why this approach isn't an option?

--MDC

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: termpositions at index time...

2006-10-18 Thread Erick Erickson

Arbitrary restrictions by IT on the space the indexes can take up.

Actually, I won't categorically I *can't* make this happen, but in order to
use this option, I need to be able to present a convincing case. And I can't
do that until I've exhausted my options/creativity.

And this it way keeps folks on the list from suggesting it when I've already
thought of it.

Erick

On 10/18/06, Michael D. Curtin <[EMAIL PROTECTED]> wrote:


Erick Erickson wrote:

> Here's my problem:
>
> We're indexing books. I need to
> a> return books ordered by relevancy
> b> for any single book, return the number of hits in each chapter
> (which, of
> course, may be many pages).
>
> 1>If I index each page as a document, creating the relevance on a book
> basis
> is interesting, but collecting page hits per book is easy.
> 2>If I index each book as a document, returning the books by relevance
is
> easy but aggregating hits per chapter is interesting.
>
> No, creating two indexes is not an option at present, although that
> would be
> the least work for me.

Could you elaborate on why this approach isn't an option?

--MDC

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: termpositions at index time...

2006-10-18 Thread Michael D. Curtin

Erick Erickson wrote:


Arbitrary restrictions by IT on the space the indexes can take up.

Actually, I won't categorically I *can't* make this happen, but in order to
use this option, I need to be able to present a convincing case. And I 
can't

do that until I've exhausted my options/creativity.


Disk space is a LOT cheaper than engineering time.  Any manager worth his/her 
salt should be able to evaluate that tradeoff in a millisecond, and any IT 
professional unable to do so should be reprimanded.  Maybe your boss can fix 
it.  If not, yours is probably not the only such situation in the world ...


If you can retrieve the pre-index content at search time, maybe this would work:

1.  Create the "real" index in the form that lets you get the top N books by 
relevance, on IT's disks.


2.  Create a temporary index on those books in the form that gives you the 
chapter counts in RAM, search it, then discard it.


If N is sufficiently small, #2 could be pretty darn fast.


If that wouldn't work, here's another idea.  I'm not clear on how your 
solution with getLastTermPosition() would work, but how about just counting 
words in the pages as you document.add() them (instead of relying on 
getLastTermPosition())?  It would mean two passes of parsing, but you wouldn't 
have to modify any Lucene code ...


--MDC

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



question regarding usage of IndexWriter.setMaxFieldLength()

2006-10-18 Thread d rj

Hello-

I was wondering about the usage of IndexWriter.setMaxFieldLength()
it is limited, by default, to 10k terms per field.  Can anyone tell me if
this is this a "per field" limit or a "per uniquely named field" limit?
I.e. in the following snippet I add many words to different Fields all w/
the same name.  Will all words be indexed w/ no problem allowing me to
conduct a search across the "text" field for any word occurring in any these
long strings?

string longString1 = <~9k words in string>;
string longString2 = <~9k words in string>;
string longString3 = <~9k words in string>;

Document doc = new Document();
doc.add(new Field("text", longString1, Field.Store.YES,
Field.Index.UN_TOKENIZED));
doc.add(new Field("text", longString2, Field.Store.YES,
Field.Index.UN_TOKENIZED));
doc.add(new Field("text", longString3, Field.Store.YES,
Field.Index.UN_TOKENIZED));


thanks.
-david


Re: near duplicates

2006-10-18 Thread John Casey

On 10/18/06, Isabel Drost <[EMAIL PROTECTED]> wrote:


Find Me wrote:
> How to eliminate near duplicates from the index? Someone suggested that
I
> could look at the TermVectors and do a comparision to remove the
> duplicates.

As an alternative you could also have a look at the paper "Detecting
Phrase-Level Duplication on the World Wide Web" by Dennis Fetterly, Mark
Manasse, Marc Najork.



Another good reference would be Soumen Chakrabarti's reference book, "Mining
the Web - Discovering Knowledge from Hypertext Data",2003 and the section on
shingling and the elimination of near duplicates. Of course I think this
works at the document level rather than at the term vector level but it
might be useful to prevent duplicate documents from being indexed
altogether.


One major problem with this is the structure of the document is
> no longer important. Are there any obvious pitfalls? For example:
Document
> A being a subset of Document B but in no particular order.

I think this case is pretty unlikely. But I am not sure whether you can
detect
near duplicates by only comparing term-document vectors. There might be
problems with documents with slightly changed words, words that were
replaced
with synonyms...

However, if you want to keep some information on the word order, you might
consider comparing n-gram document vectors. That is, each dimension in the

vector does not only represent one word but a sequence of 2, 3, 4, 5...
words.




would this involve something like a window of 2-5 words around a particular
term in a document?

Cheers,

Isabel



Re: question regarding usage of IndexWriter.setMaxFieldLength()

2006-10-18 Thread Erick Erickson

I had a similar question a while ago and the answer is "you can't cheat".
According to what the guys said, this

doc.add("field", )
doc.add("field", )
doc.add("field", )

is just the same as this

doc.add("field", )

But go ahead and increase the maxfieldlength. I'm successfully indexing
(unstored) a 7,500 page book with all the text as a single field. I think I
set the maxfieldlength at something like 10,000,000.

Had to bump the max memory in the JVM to do it, but it worked.

Erick


On 10/18/06, d rj <[EMAIL PROTECTED]> wrote:


Hello-

I was wondering about the usage of IndexWriter.setMaxFieldLength()
it is limited, by default, to 10k terms per field.  Can anyone tell me if
this is this a "per field" limit or a "per uniquely named field" limit?
I.e. in the following snippet I add many words to different Fields all w/
the same name.  Will all words be indexed w/ no problem allowing me to
conduct a search across the "text" field for any word occurring in any
these
long strings?

string longString1 = <~9k words in string>;
string longString2 = <~9k words in string>;
string longString3 = <~9k words in string>;

Document doc = new Document();
doc.add(new Field("text", longString1, Field.Store.YES,
Field.Index.UN_TOKENIZED));
doc.add(new Field("text", longString2, Field.Store.YES,
Field.Index.UN_TOKENIZED));
doc.add(new Field("text", longString3, Field.Store.YES,
Field.Index.UN_TOKENIZED));


thanks.
-david




Re: termpositions at index time...

2006-10-18 Thread Erick Erickson

I tried the notion of a temporary RAMDirectory already, and the documents
parse unacceptably slowly , 8-10 seconds. Great minds think alike. Believe
it or not, I have to deal with a 7,500 page book that details Civil War
records of Michigan volunteers. The XML form is 24M, probably 16M of text
exclusive of tags.

About your second suggestion, I'm trying to figure out how to do essentially
that. But a word count isn't very straight forward with stop words and dirty
ascii (OCR) data. I'm trying to hook that process into the tokenizer so the
counts have a better chance of being accurate, which is the essence of the
scheme. I'd far rather get the term offset from the same place the indexer
will than try to do a similar-but-not-quite-identical algorithm that failed
miserably on, say, the 3,000th and subsequent pages... I'm sure you've been
somewhere similar

OK, you've just caused me to think a bit, for which I thank you. I think
it's actually pretty simple. Just instantiate a class that is a thin wrapper
around the Lucene analyzer that implements the tokenstream (or whatever)
interface by calling a contained analyzer (has-a). Return the token and do
any recording I want to. And provide any additional data  to my process as
necessary. I'll have to look at that in the morning.

All in all, I'm probably going to make your exact argument about disk space
being wy cheaper than engineering time. That said, exploring this serves
two purposes; first it lets me back my recommendation with data. Second, and
longer term, we're using Lucene on more and more products, and exploring the
nooks and crannies involved in exotic schemes vastly increases my ability to
quickly tirage ways of doing things. The *other* thing my boss is good at is
being OK with a reasonable amount of time "wasted" in order to increase my
toolkit. So it isn't as frustrating as it might have appeared by my rather
off-hand blaming of IT .

Thanks for the suggestions,
Erick

On 10/18/06, Michael D. Curtin <[EMAIL PROTECTED]> wrote:


Erick Erickson wrote:

> Arbitrary restrictions by IT on the space the indexes can take up.
>
> Actually, I won't categorically I *can't* make this happen, but in order
to
> use this option, I need to be able to present a convincing case. And I
> can't
> do that until I've exhausted my options/creativity.

Disk space is a LOT cheaper than engineering time.  Any manager worth
his/her
salt should be able to evaluate that tradeoff in a millisecond, and any IT
professional unable to do so should be reprimanded.  Maybe your boss can
fix
it.  If not, yours is probably not the only such situation in the world
...

If you can retrieve the pre-index content at search time, maybe this would
work:

1.  Create the "real" index in the form that lets you get the top N books
by
relevance, on IT's disks.

2.  Create a temporary index on those books in the form that gives you the
chapter counts in RAM, search it, then discard it.

If N is sufficiently small, #2 could be pretty darn fast.


If that wouldn't work, here's another idea.  I'm not clear on how your
solution with getLastTermPosition() would work, but how about just
counting
words in the pages as you document.add() them (instead of relying on
getLastTermPosition())?  It would mean two passes of parsing, but you
wouldn't
have to modify any Lucene code ...

--MDC

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: index architectures

2006-10-18 Thread Paul Waite
Some excellent feedback guys - thanks heaps.


On my OOM issue, I think Hoss has nailed it here:

> That said: if you are seeing OOM errors when you sort by a field (but
> not when you use the docId ordering, or sort by score) then it sounds
> like  you are keeping refrences to IndexReaders arround after you've
> stoped using them -- the FieldCache is kept in a WeakHashMap keyed off of
> hte IndexReader, so it should get garbage collected as soon sa you let go
> of it.  Another possibility is that you are sorting on too many fields
> for it to be able to build the FieldCache for all of them in the RAM you
> have available. 


I'm using a piece of code written by Peter Halacsy which implements a
SearcherCache class. When we do a search we request a searcher, and this
class looks after giving us one.

It checks whether the index has been updated since the most recent Searcher
was created. If so it creates a new one.

At the same time it 'retires' outdated Searchers, once they have no queries
busy with them.

Looking at that code, if the system gets busy indexing new stuff, and doing
complex searches this is all rather open-ended as to the potential number
of fresh Searchers being created, each with the overhead of building its
FieldCache for the first time. No wonder I'm having problems as the archive
has grown! Looking at it in this light, my OOM's all seem to come just
after a bout of articles have been indexed, and querying is being done
simultaneously, so it does fit.

I guess a solution is probably to cap this process with a maximum number
of active Searchers, meaning potentially some queries might be fobbed off
with slightly out of date versions of the index, in extremis, but it would
right itself once everything settles down again.

Obviously the index partitioning would probably make this a non-issue, but
it seems better to sort the basic problem out anyway, and make it 100%
stable.

Thanks Hoss!

Cheers,
Paul.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: index architectures

2006-10-18 Thread Doron Cohen
Not sure if this is the case, but you said "searchers", so might be it -
you can (and should) reuse searchers for multiple/concurrent queries.
IndexSearcher is thread-safe, so no need to have a different searcher for
each query. Keep using this searcher until you decide to open a new
searcher - actually until you 'warmed up' the new (updated) searcher, then
switch to using the new searcher and close the old one.

- Doron

Paul Waite <[EMAIL PROTECTED]> wrote on 18/10/2006 18:22:30:

> Some excellent feedback guys - thanks heaps.
>
>
> On my OOM issue, I think Hoss has nailed it here:
>
> > That said: if you are seeing OOM errors when you sort by a field (but
> > not when you use the docId ordering, or sort by score) then it sounds
> > like  you are keeping refrences to IndexReaders arround after you've
> > stoped using them -- the FieldCache is kept in a WeakHashMap keyed off
of
> > hte IndexReader, so it should get garbage collected as soon sa you let
go
> > of it.  Another possibility is that you are sorting on too many fields
> > for it to be able to build the FieldCache for all of them in the RAM
you
> > have available.
>
>
> I'm using a piece of code written by Peter Halacsy which implements a
> SearcherCache class. When we do a search we request a searcher, and this
> class looks after giving us one.
>
> It checks whether the index has been updated since the most recent
Searcher
> was created. If so it creates a new one.
>
> At the same time it 'retires' outdated Searchers, once they have no
queries
> busy with them.
>
> Looking at that code, if the system gets busy indexing new stuff, and
doing
> complex searches this is all rather open-ended as to the potential number
> of fresh Searchers being created, each with the overhead of building its
> FieldCache for the first time. No wonder I'm having problems as the
archive
> has grown! Looking at it in this light, my OOM's all seem to come just
> after a bout of articles have been indexed, and querying is being done
> simultaneously, so it does fit.
>
> I guess a solution is probably to cap this process with a maximum number
> of active Searchers, meaning potentially some queries might be fobbed off
> with slightly out of date versions of the index, in extremis, but it
would
> right itself once everything settles down again.
>
> Obviously the index partitioning would probably make this a non-issue,
but
> it seems better to sort the basic problem out anyway, and make it 100%
> stable.
>
> Thanks Hoss!
>
> Cheers,
> Paul.
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: constructing smaller phrase queries given a multi-word query

2006-10-18 Thread Mekin Maheshwari

Resending, with the hope that the search gurus missed this.

Would really appreciate any advise on this.
Would not want to reinvent the wheel & I am sure this is something
that would have been done.

Thanks,
mek

On 10/16/06, Mek <[EMAIL PROTECTED]> wrote:

Has anyone dealt with the problem of constructing sub-queries given a
multi-word query ?

Here is an example to illustrate what I mean:

user queries for -> A B C D
right now I change that query to "A B C D" A B C D to give phrase
matches higher weightage.

What might happen though, is that the user is looking for a document
where "A B" in Field1 & "C D"  in Field2.

So I should ideally be constructing the query as :

"A B C D"^20 "A B"^10 "C D"^10 "B C D"^15  "A B C"^15 A B C D

Has someone solved this problem, are there other ways to handle this problem ?


Thanks,
mek.




--
http://mekin.livejournal.com/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



NativeFSLockFactory problem

2006-10-18 Thread Frank Kunemann
 
Hi all,

I'm trying to use the new class NativeFSLockFactory, but as you can guess I
have a problem using it.
Don't know what I'm doing wrong, so here is the code:


FSDirectory dir = FSDirectory.getDirectory(indexDir, create,
NativeFSLockFactory.getLockFactory());
logger.info("Index: "+indexDir.getAbsolutePath()+" Lock file:
"+dir.getLockID());
this.writer = new IndexWriter(dir, new StandardAnalyzer(), create);


Just to explain: indexDir is a file, create is set to false. 2nd line is to
see what is going on.


My problem is that there are many indices, for testing purpose just 4 of
them. The first one is started and working like it should, but from the 2nd
on I get those "Lock obtain timed out"- exceptions.
This is the log output:

08:38:05,199 INFO  [IndexerManager] No indexer found for directory
D:\[mydir]\index1- starting new one!
08:38:05,199 INFO  [Indexer] Index: D:\[mydir]\index1 Lock file:
lucene-0ca7838f9396a636d1feda5aabb9b8db
08:38:05,215 INFO  [IndexerManager] New amount of Indexers: 1
08:38:05,215 INFO  [IndexerManager] No indexer found for directory
D:\[mydir]\index2- starting new one!
08:38:05,215 INFO  [Indexer] Index: D:\[mydir]\index2 Lock file:
lucene-cc9dfaabbf7ad61c4bb3af007b88288c
08:38:06,213 ERROR [IndexerManager] Lock obtain timed out:
[EMAIL PROTECTED]:\Dokumente und Einstellungen\[user]\Lokale
Einstellungen\Temp\lucene-fd415060ae453638d69faa9fa07fbe95-n-write.lock
java.io.IOException: Lock obtain timed out: [EMAIL PROTECTED]:\Dokumente und
Einstellungen\[user]\Lokale
Einstellungen\Temp\lucene-fd415060ae453638d69faa9fa07fbe95-n-write.lock
at org.apache.lucene.store.Lock.obtain(Lock.java:68)
at org.apache.lucene.index.IndexWriter.(IndexWriter.java:257)
at org.apache.lucene.index.IndexWriter.(IndexWriter.java:247)
at de.innosystec.iar.indexing.Indexer.setUp(Indexer.java:101)
at de.innosystec.iar.indexing.Indexer.(Indexer.java:80)
at
de.innosystec.iar.indexing.IndexerManager.addDocumentElement(IndexerManager.
java:228)
at
de.innosystec.iar.parsing.ParserManager.indexDocumentElement(ParserManager.j
ava:286)
at
de.innosystec.iar.parsing.ParserThread.startWorking(ParserThread.java:378)
at de.innosystec.iar.parsing.ParserThread.run(ParserThread.java:175)
at java.lang.Thread.run(Unknown Source)


The lock file mentioned in the exception is really created and used by the
first index. Seems like the FSDirectory.getLockID method doesn't work for
NativeFSLockFactory?
I'm using Win XP on my test platform.


Regards,
Frank


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]