Re: indexing numeric entities?
Yes You need to parse the entities Yourself. I implemented an HTML entity parser as a part of http://objectledge.org project. You may use it if it will fit Your needs. It is in a ledge-components project module. See http://objectledge.org/modules/ledge-components/index.html Have fun, -- Damian Gajda Caltha Sp. j. http://www.caltha.pl/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: BooleanQuery - Too Many Clases on date range.
Dnia 01-10-2004, pi o godzinie 07:57 -0500, Scott Ganyo napisa(a): You can use: BooleanQuery.setMaxClauseCount(int maxClauseCount); I had a similar problem with date ranges. Someone on the list suggested me a solution to my problems but it was more clever than the above solution, which helps but makes the searches work slower and is memory hungry (many terms are loaded into memmory, and than searched). The solution suggested was to split dates into sub fields during indexing and use those fields while searching. This makes it more effective but harder to create a query (personally I prefer working on queries build using Lucene API, than ones parsed by QueryParser). For instance the time stamp 2004-10-01 15:34:26.001 may be split into following fields: some-date_year: 2004 some-date_month: 10 some-date_day: 01 some-date_time: 153426001 The above fields should be indexed so they can be searched. They give some nice possibilities, for instance fast and easy querying for all documents that have a date in a particular year, month or day of month. For conveniece one could also store weekdays. A query for a date range from 15th august to 10th october 2004 (in no particular query language - this just gives an idea): some-date_year = 2004 AND ( (some-date_month = 08 AND some-date_day = 15) OR (some-date_month=09) OR (some-date_month = 10 AND some-date_day = 10) ) As You can see it is easy to build such a query from the lucene API. The equalities are Term queries. The inequalities are Range queries. The AND and OR operators can be provided by usage of Boolean queries. Have fun implementing the solution - it has only one disadvantage. It makes results sorting not so easy. The solution for it is usage of multiple sort fields, or another stored field containing a full date (one almost surely will need to store a date for each hit, unless You want to write some baroque code to calculate date from split fields values). Have fun, -- Damian Gajda Caltha Sp. j. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Memory usage: IndexSearcher Sort
Most helpful in this search was the following thread from Bugzilla: http://issues.apache.org/bugzilla/show_bug.cgi?id=30628 http://issues.apache.org/bugzilla/show_bug.cgi?id=30628 We had a similar problem in our webapp. Please look at the bug http://issues.apache.org/bugzilla/show_bug.cgi?id=31240 My co-worker Rafa has fixed this bug and submitted a patch today. Have fun ;) -- Damian Gajda Caltha Sp. j. Warszawa 02-807 ul. Kukuki 2 tel. +48 22 643 20 20 mobile: +48 501 032 506 http://www.caltha.pl/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: term vector or document vector
I have to do some work for nutch but since I need the feature vector stuff for an commercial project I will try to implement it. Someone wish to join me??? ;) Stefan Hello I already have some experience with Dmitry's implementation. Feel free to contact me. -- Damian - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: term vector or document vector
W licie z pon, 08-12-2003, godz. 19:21, Stefan Groschupf pisze: Damian Gajda wrote: Hello I already have some experience with Dmitry's implementation. Can you point me to Dmitry's code,so that i can take a look, i just had read about it Here some links for Your consideration: http://issues.apache.org/bugzilla/show_bug.cgi?id=18927 And links from the bug page: http://nagoya.apache.org/eyebrowse/[EMAIL PROTECTED]msgId=114748 http://nagoya.apache.org/eyebrowse/[EMAIL PROTECTED]msgId=114861 http://nagoya.apache.org/eyebrowse/[EMAIL PROTECTED]msgId=114862 http://nagoya.apache.org/eyebrowse/[EMAIL PROTECTED]msgId=433778 Dmitry's code works with Lucene 1.2 altough not really - one class needs some hand fixing aftery applying those patches. -- Damian - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: term vector or document vector
BTW. i may send You the partly working Lucene with Dmitrys code patched in. -- Damian - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]