Re: indexing numeric entities?

2004-10-12 Thread Damian Gajda
Yes You need to parse the entities Yourself. I implemented an HTML
entity parser as a part of http://objectledge.org project. You may use
it if it will fit Your needs. It is in a ledge-components project
module. See http://objectledge.org/modules/ledge-components/index.html

Have fun,
-- 
Damian Gajda
Caltha Sp. j.
http://www.caltha.pl/




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: BooleanQuery - Too Many Clases on date range.

2004-10-01 Thread Damian Gajda
Dnia 01-10-2004, pi o godzinie 07:57 -0500, Scott Ganyo napisa(a):
 You can use:
 
 BooleanQuery.setMaxClauseCount(int maxClauseCount);

I had a similar problem with date ranges. Someone on the list suggested
me a solution to my problems but it was more clever than the above
solution, which helps but makes the searches work slower and is memory
hungry (many terms are loaded into memmory, and than searched).

The solution suggested was to split dates into sub fields during
indexing and use those fields while searching. This makes it more
effective but harder to create a query (personally I prefer working on
queries build using Lucene API, than ones parsed by QueryParser).

For instance the time stamp 2004-10-01 15:34:26.001 may be split into
following fields:
some-date_year: 2004
some-date_month: 10
some-date_day: 01
some-date_time: 153426001

The above fields should be indexed so they can be searched. They give
some nice possibilities, for instance fast and easy querying for all
documents that have a date in a particular year, month or day of month.
For conveniece one could also store weekdays.

A query for a date range from 15th august to 10th october 2004 (in no
particular query language - this just gives an idea):
some-date_year = 2004 AND (
   (some-date_month = 08 AND some-date_day = 15) OR
   (some-date_month=09) OR
   (some-date_month = 10 AND some-date_day = 10)
)

As You can see it is easy to build such a query from the lucene API. The
equalities are Term queries. The inequalities are Range queries. The AND
and OR operators can be provided by usage of Boolean queries.

Have fun implementing the solution - it has only one disadvantage. It
makes results sorting not so easy. The solution for it is usage of
multiple sort fields, or another stored field containing a full date
(one almost surely will need to store a date for each hit, unless You
want to write some baroque code to calculate date from split fields
values).

Have fun,
-- 
Damian Gajda
Caltha Sp. j.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Memory usage: IndexSearcher Sort

2004-09-29 Thread Damian Gajda
 Most helpful in this search was the following thread from Bugzilla:
  
 http://issues.apache.org/bugzilla/show_bug.cgi?id=30628
 http://issues.apache.org/bugzilla/show_bug.cgi?id=30628 
  

We had a similar problem in our webapp.

Please look at the bug
http://issues.apache.org/bugzilla/show_bug.cgi?id=31240

My co-worker Rafa has fixed this bug and submitted a patch today.

Have fun ;)
-- 
Damian Gajda
Caltha Sp. j.
Warszawa 02-807
ul. Kukuki 2
tel. +48 22 643 20 20
mobile: +48 501 032 506
http://www.caltha.pl/


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: term vector or document vector

2003-12-08 Thread Damian Gajda
 I have to do some work for nutch but since I need the feature vector 
 stuff for an commercial project I will try to implement it.
 Someone wish to join me??? ;)
 
 Stefan

Hello I already have some experience with Dmitry's implementation.
Feel free to contact me.

-- 
Damian



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: term vector or document vector

2003-12-08 Thread Damian Gajda
W licie z pon, 08-12-2003, godz. 19:21, Stefan Groschupf pisze: 
 Damian Gajda wrote:
 
 Hello I already have some experience with Dmitry's implementation.
 
 Can you point me to Dmitry's code,so that i can take a look, i just had 
 read about it


Here some links for Your consideration:
http://issues.apache.org/bugzilla/show_bug.cgi?id=18927
And links from the bug page:
http://nagoya.apache.org/eyebrowse/[EMAIL PROTECTED]msgId=114748
http://nagoya.apache.org/eyebrowse/[EMAIL PROTECTED]msgId=114861
http://nagoya.apache.org/eyebrowse/[EMAIL PROTECTED]msgId=114862
http://nagoya.apache.org/eyebrowse/[EMAIL PROTECTED]msgId=433778

Dmitry's code works with Lucene 1.2 altough not really - one class needs
some hand fixing aftery applying those patches.

-- 
Damian


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: term vector or document vector

2003-12-08 Thread Damian Gajda
BTW. i may send You the partly working Lucene with Dmitrys code patched
in.

-- 
Damian



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]