Re: [ANNOUNCE] Wiki editing change

2013-03-25 Thread Andrzej Bialecki
AndrzejBialecki to the ContributorsGroup. Thanks! -- Best regards, Andrzej Bialecki http://www.sigram.com, blog http://www.sigram.com/blog ___.,___,___,___,_._. __<>< [___||.__|__/|__||\/|: Information Retrieval, System In

Lucene 4 architecture - paper available

2012-10-09 Thread Andrzej Bialecki
e you enjoy the reading. :) -- Best regards, Andrzej Bialecki http://www.sigram.com, blog http://www.sigram.com/blog ___.,___,___,___,_._. __<>< [___||.__|__/|__||\/|: Information Retrieval, System Integration ___|||__||..\|..||..|: Contact

Re: Getting terms from unstored fields, doc-wise

2012-07-27 Thread Andrzej Bialecki
re it, either using stored fields or in an external system. -- Best regards, Andrzej Bialecki http://www.sigram.com, blog http://www.sigram.com/blog ___.,___,___,___,_._. __<>< [___||.__|__/|__||\/|: Information Retrieval, System Integration ___|||__||.

Re: Problem with TermVector offsets and positions not being preserved

2012-07-27 Thread Andrzej Bialecki
at shows a term vector correctly shows positions and offsets if available (or blanks if not available). -- Best regards, Andrzej Bialecki http://www.sigram.com, blog http://www.sigram.com/blog ___.,___,___,___,_._. __<>< [_

Re: Lucene 4.0 .FDT

2012-07-19 Thread Andrzej Bialecki
ds. The question is whether the space savings would be worth the complication? -- Best regards, Andrzej Bialecki http://www.sigram.com, blog http://www.sigram.com/blog ___.,___,___,___,_._. __<>< [___||.__|__/|__||\/|: Information

[ANNOUNCE] Luke 4.0.0-ALPHA released

2012-07-17 Thread Andrzej Bialecki
g.Stires) * Issue 19: Custom directory implementation must be inherited from FSDirectory (mitja.lenic) * Issue 21: luke tarball needs to extract to a "luke" directory (bevan.koopman, Photodeus) * Issue 27: Cannot add or edit documents using StandardAnalyzer (dean.thrasher) Thanks to

Re: Index pruning

2012-06-13 Thread Andrzej Bialecki
of field:term pairs. -- Best regards, Andrzej Bialecki http://www.sigram.com, blog http://www.sigram.com/blog ___.,___,___,___,_._. __<>< [___||.__|__/|__||\/|: Information Retrieval, System Integration ___|||__||..\|..||..|: Co

Re: lucene algorithm ?

2012-04-26 Thread Andrzej Bialecki
ct is lower than the current lowest score. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Co

Re: Re-indexing a particular field only without re-indexing the entire enclosing document in the index

2012-04-26 Thread Andrzej Bialecki
age as far as I know. LUCENE-3837, to be specific. But as you said, it's still early and there is no code yet to speak of... -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| I

Re: delete entries from posting list Lucene 4.0

2012-04-02 Thread Andrzej Bialecki
On 29/03/2012 11:14, Andrzej Bialecki wrote: The problem in our implementation is that we use a within-document term frequency (the number of occurrences of t in the current document) and not a collection-wide term frequency... so, it looks to me that the fix would be to first fully traverse

Re: delete entries from posting list Lucene 4.0

2012-03-29 Thread Andrzej Bialecki
se the doc enumeration and calculate the total number of term occurrences in all documents (e.g. in RIDFTermPruningPolicy.initPositionsTerm(..) ), and use this value in the formula in place of termPositions.freq(). -- Best regards, Andrze

Re: delete entries from posting list Lucene 4.0

2012-03-19 Thread Andrzej Bialecki
regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigra

Re: Tamper resistant index

2012-01-09 Thread Andrzej Bialecki
ed approach first, because it's easy to implement, and then see if it's good enough. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embe

[ANN] Luke 3.5.0 released

2011-12-28 Thread Andrzej Bialecki
s and a happy New Year to you all! :) -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info

Re: luke and chinese text

2011-12-22 Thread Andrzej Bialecki
the Settings menu a font that supports Unicode characters, the default platform font often doesn't support them, which results in '?' or other strange characters. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__

Re: Bet you didn't know Lucene can...

2011-10-31 Thread Andrzej Bialecki
On 31/10/2011 21:42, Petite Abeille wrote: On Oct 31, 2011, at 9:32 PM, Andrzej Bialecki wrote: similarity-preserving hash function was calculated on each sentence, and the hash was added as a field. The property of the hash was that similar documents (sentences) would produce a similar

Re: Bet you didn't know Lucene can...

2011-10-31 Thread Andrzej Bialecki
find a ranked list of documents that have the smallest bit-level distance in their hashes from the query hash. The solution is described in SOLR-1918 - Bit-wise scoring field type. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ _

Re: [ANN] Luke 3.4.0 release

2011-10-03 Thread Andrzej Bialecki
some lesson to learn from this situation... I committed a fix, and the updated release is marked as 3.4.0_1. Sorry! -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval

[ANN] Luke 3.4.0 release

2011-10-03 Thread Andrzej Bialecki
APIs. * Rearranged "field flags" so that they are more logical and cover index options added in 3.4.0. E.g. omitNorms is represented as "with Norms" and marked by "N", IndexOptions are expanded to "Idfp" to mark indexed fields with docs, freqs and

[ANN] Luke 3.3.0 released.

2011-07-06 Thread Andrzej Bialecki
Hi all, Luke 3.3.0 has been released and is available for download here: http://code.google.com/p/luke/ Apart from the updated Lucene libraries there were no changes in functionality. -- Best regards, Andrzej Bialecki

Re: Coloring search results based on score?

2011-06-16 Thread Andrzej Bialecki
more details: http://people.ischool.berkeley.edu/~hearst/research/tilebars.html -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www

Re: Changing Boosting that was set at indexing time

2011-06-16 Thread Andrzej Bialecki
ectly using IndexReader.setNorm(...) but you need to remember that this method uses raw byte values, that is the result of encoding a floating point value with Similarity.encodeNormValue(..). -- Best regards, Andrzej Bia

[ANN] Luke 3.1.0 released

2011-04-29 Thread Andrzej Bialecki
ributing bug reports, patches and comments. Happy Luke-ing! -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www

Re: Search one index but use IDF from another?

2011-03-10 Thread Andrzej Bialecki
->DF map with values obtained from the full index, and then you use this map to calculate IDF. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedd

Re: Adding a new field to existing Index

2010-07-07 Thread Andrzej Bialecki
On 2010-07-07 14:49, Naveen Kumar wrote: Hi Andrzej Bialecki When you suggested - "There are some other low-level ways to do this, but the easiest is to use a FilterIndexReader, especially since you just want to add a stored field - implement a subclass of FilterIndexR

Re: Document Order in IndexWriter.addIndexes

2010-06-30 Thread Andrzej Bialecki
recorded in the output index. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contac

Re: Adding a new field to existing Index

2010-06-30 Thread Andrzej Bialecki
tent can be recovered. See the "Reconstruct & Edit" functionality in Luke (http://www.getopt.org/luke). -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__

Re: Document Order in IndexWriter.addIndexes

2010-06-29 Thread Andrzej Bialecki
it doesn't rely on this behavior. You have been warned :) -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://ww

Re: Question about Field.setOmitTermFreqAndPositions(true)

2010-05-31 Thread Andrzej Bialecki
On 2010-05-31 10:54, Uwe Schindler wrote: > No. See also LUCENE-2048 (nice round number ;) ). -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | E

Re: Access indexed terms

2010-05-14 Thread Andrzej Bialecki
Is there an alternative way to do > that? Yes, see the discussion here: https://issues.apache.org/jira/browse/LUCENE-2393 -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic W

Re: Access indexed terms

2010-05-14 Thread Andrzej Bialecki
ke/DocReconstructor.java If you really need such kind of access in your application then add your documents with term vectors with offsets and positions. Even then, depending on the Analyzer you used, the process is lossy - some input data that was discarded by Analyzer is simply no longer available

Re: How to avoid sharing docStore files?

2010-05-12 Thread Andrzej Bialecki
ks. However, even this new tool will make a copy of the original index, so you will need twice as much space. But in this case perhaps you could put the original index on a network FS, and split it into the target partition - the

[ANN] Luke - The Lucene Index Toolbox - 1.0.1 release

2010-04-01 Thread Andrzej Bialecki
lyzer plugin (and analyzers) don't work. * Issue 4 : Compress flag no longer available. * Issue 14 : Error while using custom similarity. Enjoy! -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information

Re: SpanQueries in Luke

2010-03-05 Thread Andrzej Bialecki
ally one could store such information in IndexCommit.getUserData(). The lack of standardized metadata is an issue, of course - we could start experimenting with this in Luke, to see whether we can squeeze a subset of Solr schema there. -- Best

Re: SpanQueries in Luke

2010-03-05 Thread Andrzej Bialecki
textarea. I'll commit the current mostly-working state today, you can take a look - you've written some cool Luke plugins before .. ;) -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Informa

Re: SpanQueries in Luke

2010-03-04 Thread Andrzej Bialecki
gards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at

Re: SpanQueries in Luke

2010-03-04 Thread Andrzej Bialecki
this parser out of the box. I expect to make a release within a few days. Watch the commits on the Google code project ... -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval

Re: Do deleted documents affect scores?

2010-02-11 Thread Andrzej Bialecki
merge segments). -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact

Re: Field creation with TokenStream and stored value

2010-01-13 Thread Andrzej Bialecki
t your own Fieldable, and return what you want from its methods. You can also use Field constructor that takes the stored value, and then use Field.setTokenStream(TokenStream) - it doesn't override the stored value. -- Best regards, A

Re: [ANN] Luke 1.0.0 for Lucene 3.0

2009-12-26 Thread Andrzej Bialecki
existent fall back to the zero-arg ctor. > > I'll open an issue. Indeed - thanks! -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix

[ANN] Luke 1.0.0 for Lucene 3.0

2009-12-26 Thread Andrzej Bialecki
tween Lucene 2.9.1 and 3.0. Your feedback is welcome - please use the Google Issue tracker to report issues. Merry Christmas! -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval

Re: document with different index time boost returns same score

2009-12-18 Thread Andrzej Bialecki
that this encoding causes (and what input values effectively come out the same, once encoded). -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Em

[ANN] Luke 0.9.9.1 release

2009-11-20 Thread Andrzej Bialecki
ability to edit per-commit user data Map Bug fixes - * Term frequency vectors were not displayed for selected field. Enjoy! -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retr

Re: Split single string into several fields?

2009-10-28 Thread Andrzej Bialecki
d you can even create other fields in the document (or split this token stream into several fields). -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| ||

Re: [ANN] Luke 0.9.9 release

2009-10-23 Thread Andrzej Bialecki
yourself ;) Keyboard shortcuts are hardcoded somewhere deep in Thinlet, but likely they could be made configurable. You can find an EPS version of the Lucene logo here: http://lucene.apache.org/images/logo.eps

Re: Question about how to speed up custom scoring

2009-10-08 Thread Andrzej Bialecki
while) that if the terms you load are indexed that'll help. But this is mostly a guess. Just to clarify: IndexReader.document(doc) and .document(doc, selector) load _only_ stored fields, they don't interact at all with the terms-related part of Lucene.. -- Best

Re: [ANN] Luke 0.9.9 release

2009-10-01 Thread Andrzej Bialecki
Andrzej Bialecki wrote: Hi all, I'm happy to announce the new release of Luke - the Lucene Index Toolbox. There's a bug in this version in that it doesn't show TermVectors for a field. I'll fix it in a few days - I'm waiting for other potential bugs to show up. So i

[ANN] Luke 0.9.9 release

2009-09-29 Thread Andrzej Bialecki
Chris Pimlott and others. Enjoy! :) -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.co

Re: Lucene gobbling file descriptors

2009-08-27 Thread Andrzej Bialecki
e-for-Lucene -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at

Lucene Search Performance Analysis Workshop

2009-08-26 Thread Andrzej Bialecki
rsday, September 3rd 2009 11:00-11:30AM PDT / 14:00-14:30 EDT Follow this link to sign up: http://www2.eventsvc.com/lucidimagination/event/ff97623d-3fd5-43ba-a69d-650dcb1d6bbc?trk=WR-SEP2009-AP About: Lucene Performance Workshop: Understanding Lucene Search Performance with Andrzej Bialecki Experi

Re: Why does this search succeed with web app, but not Luke?

2009-08-07 Thread Andrzej Bialecki
nized" version of the field. At this point any potential mismatch in query terms vs. analyzed tokens in the field should become apparent. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Se

Re: Weird discrepancy with term counts vs. terms (off by 1)

2009-08-02 Thread Andrzej Bialecki
atch to Andrzej, the author of Luke. Thank you Phil for spotting this bug - this fix will be included in the next release of Luke. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Sem

[ANN] Luke + Hadoop, alpha version

2009-07-10 Thread Andrzej Bialecki
welcome - please keep in mind that this is an early preview. Also, various UI glitches are probably related to the Thinlet toolkit - again, one day I may re-write Luke using something else, but for now I don't have the strength to do it.

Re: Lucene Index Encryption

2009-05-10 Thread Andrzej Bialecki
rg/jira/browse/LUCENE-532 -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.co

Re: Index in text format

2009-04-24 Thread Andrzej Bialecki
(http://www.getopt.org/luke) can export all stored fields from all documents into an XML file. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, Syst

Re: Help to determine why an optimized index is proportionaly too big.

2009-04-10 Thread Andrzej Bialecki
space. (Actually: does CheckIndex warn about unused files in the index directory so people can clean them up? i'm not sure) It doesn't. But Luke has a function to do this. -- Best regards, A

Re: Index Partitioning

2009-03-22 Thread Andrzej Bialecki
. * repeat the cycle as many times as needed A more elegant version of this algorithm can be implemented using FilterIndexReader. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retr

Re: [ANN] Luke 0.9.2 release

2009-03-20 Thread Andrzej Bialecki
Andrzej Bialecki wrote: (sorry for cross-posting) Hi all, I'm happy to announce a new release of Luke, the Lucene Index Toolbox. As usually, you can obtain it from here: http://www.getopt.org/luke If you tried to access this url during last couple hours the site was down. It s

[ANN] Luke 0.9.2 release

2009-03-19 Thread Andrzej Bialecki
ounts per field in Overview - contributed by Mark Harwood. o Improved the Analysis plugin to show all token information, and highlight whenever a token is selected from the list. * Bug fixes: o (None) -- Best regards, Andrzej Bia

Re: boosting query

2009-03-19 Thread Andrzej Bialecki
plement an arbitrary re-sorting of top-N results, according to your rules of preference (business rules, or heuristics). This way you can avoid the overfitting or doing endless tweaking, and still get the ranking that makes sense to your users. -- Best regards, Andrze

Re: IndexSearcher

2009-03-09 Thread Andrzej Bialecki
the classpath when you start Luke. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info

Re: IndexSearcher

2009-03-08 Thread Andrzej Bialecki
liat oren wrote: Ok, thanks. I will have to edit the code of Luke in order to add another analyzer, right? No - if your analyzer is already on the classpath, then it's enough to type in the fully qualified class name in the drop down box (it's editable). -- Best regards, Andrze

Re: Luke site is down?

2009-03-04 Thread Andrzej Bialecki
Hi all, I apologize for the inconvenience - the site went down without any prior notice from the ISP. I'm investigating the issue ... -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval

Re: Determining index term count

2009-01-07 Thread Andrzej Bialecki
formation this way would be messy - it's better to propose that this information should be added to API. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| |

Re: Document.getBinaryValue returning null after upgrading to 2.4 for the data which was indexed using 2.3.1

2008-12-16 Thread Andrzej Bialecki
versions of Lucene involved. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: inf

Re: Document.getBinaryValue returning null after upgrading to 2.4 for the data which was indexed using 2.3.1

2008-12-16 Thread Andrzej Bialecki
g 2.4 , the search worked fine using 2.4. Any ideas why this is happening. No idea - but perhaps this is somehow related: https://issues.apache.org/jira/browse/LUCENE-1452 -- Best regards, Andrzej Bia

Re: StandardAnalyzer vs KeywordAnalyzer in Luke

2008-12-02 Thread Andrzej Bialecki
n turn _require_ the presence of a common-grams.utf8 resource on the classpath. To summarize: unless you want to get your hands dirty with Luke internals it can't be done. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __

[ANN] Luke 0.9.1 - bugfix release

2008-11-23 Thread Andrzej Bialecki
ll commits" option was specified. Reported by Mark Harwood. o Empty index with no fields was reported as invalid. Discovered by Andrew Zhang and Michael McCandless (LUCENE-1454). Thank you! -- Best regards, A

Re: [ot] a reverse lucene

2008-11-23 Thread Andrzej Bialecki
. (with a score etc) I can see the case for this would be a news-article and several people writing queries to get alerted if it matched a certain condition. http://www.seas.upenn.edu/~svilen/publications/subscribe.pdf -- Best regards, Andrzej Bialecki

Re: [ANN] Luke 0.9 released

2008-11-14 Thread Andrzej Bialecki
but in practice Luke directly accesses the underlying Directory in many other places ... I forgot about the use of IndexFileDeleter - and indeed passing the read-only flag here can solve this, because then I can always use KeepAllDeletionPolicy when opening read-

[ANN] Luke 0.9 released

2008-11-13 Thread Andrzej Bialecki
, although I tested all functionality to make sure that there is no data loss. HOWEVER, if you work with precious data, it's always a good idea to use the "Read-only" option. As usually, bug reports or suggestions for improvements, or even better patches, are welcome!

Re: Read all the data from an index

2008-10-31 Thread Andrzej Bialecki
ntent of these deleted documents, call first IndexReader.undeleteAll(). -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration htt

Re: Luke is coming .. not there yet.

2008-10-30 Thread Andrzej Bialecki
- we can include this in the proposals for the next summer. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http:

Re: Luke is coming .. not there yet.

2008-10-30 Thread Andrzej Bialecki
ss someone else does it it's simply not going to happen. All code in Luke except for the Thinlet class is under Apache License, so feel free to start coding :) -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/

Re: Luke is coming .. not there yet.

2008-10-30 Thread Andrzej Bialecki
Andrzej Bialecki wrote: 1) Luke 2.4 release. This has the advantage of being an official stable [...] 2) Luke 2.9-dev snapshot. This has the advantage that you get the [...] Of course I meant Lucene 2.4 and Lucene 2.9-dev ... sorry for the confusion. -- Best regards, Andrzej Bialecki

Luke is coming .. not there yet.

2008-10-30 Thread Andrzej Bialecki
index to a new format, incompatible with earlier versions of Lucene (including 2.4 release). -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded

Re: Sorting posting lists before intersection

2008-10-13 Thread Andrzej Bialecki
n it won't cause any IO, otherwise it needs to read this info from the .ti file. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System

Re: Case Sensitivity

2008-09-19 Thread Andrzej Bialecki
vide static methods on Fieldable that test the validity of flag combinations with particular version of Lucene? -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \|

Re: Sorting posting lists before intersection

2008-09-17 Thread Andrzej Bialecki
: ConjunctionScorer, lines 85-103 - pay attention to the comments there, it's not strictly a sort by frequency, rather by the sampled "sparseness". -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/

Re: Pre-filtering for expensive query

2008-09-04 Thread Andrzej Bialecki
Grant Ingersoll wrote: On Aug 30, 2008, at 3:14 PM, Andrzej Bialecki wrote: I think you can use a FilteredQuery in a BooleanClause. This may be faster than the filtering code in the Searcher, because the evaluation is done during scoring and not afterwards. FilteredQuery internally makes

Re: Pre-filtering for expensive query

2008-08-30 Thread Andrzej Bialecki
one during scoring and not afterwards. FilteredQuery internally makes use of skipTo(), which should help to limit the number of evaluated docs. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retr

Re: Case Sensitivity

2008-08-28 Thread Andrzej Bialecki
Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at

Re: boost freshness instead of sorting

2008-08-28 Thread Andrzej Bialecki
or each "1".) We are discussing the same thing in "Case sensitivity" thread - it's possible to have a tokenized field and omit its norms. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [_

Re: Case Sensitivity

2008-08-28 Thread Andrzej Bialecki
Otis Gospodnetic wrote: So in other words, it *is* possible to have the field both tokenized and its norms omitted? Yes. Probably this is an unintended side-effect of adding setOmitNorms, but I think it's useful and IMHO we should keep it. -- Best regards, Andrzej Bia

Re: Case Sensitivity

2008-08-28 Thread Andrzej Bialecki
ts only omitNorms. So the flags are set now like this: isIndexed = true; isTokenized = true; omitNorms = true; The end result of processing such a field is (I believe) conceptually equivalent to adding as many Fields as there are tokens, each with omitNorms=true. -- Best

Re: updating existing field values

2008-08-07 Thread Andrzej Bialecki
s too, which requires overriding other methods. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Co

Re: Stop search process when a given number of hits is reached

2008-08-07 Thread Andrzej Bialecki
t in Nutch (org.apache.nutch.indexer.IndexSorter). -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: inf

Re: Copying a part of index and index structure

2008-06-20 Thread Andrzej Bialecki
with Segmented Indices ... and quite a few other papers that I don't remember now ... please do a search for "distributed IR" on ACM or Citeseer. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\

Re: Copying a part of index and index structure

2008-06-20 Thread Andrzej Bialecki
than splitting by doc. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact

Re: Rebuilding parallel indexes

2008-06-09 Thread Andrzej Bialecki
s I have a thought ;) Perhaps you could use a FilteredIndexReader to maintain a map between new IDs and old IDs, and remap on the fly. Although I think that some parts of Lucene depend on the fact that in a normal index the IDs are monotonically increasing ... this would complicate the issue. -- B

Re: text extraction from pdf

2008-05-14 Thread Andrzej Bialecki
something better, AFAIK, PDFBox has a lower-level API that allows you to get hold of text positions. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || |

Re: Can POI provide reliable text extraction results for production search engine for Word, Excel and PowerPoint formats?

2008-05-13 Thread Andrzej Bialecki
er with extracting the content _and_ formatting from any documents that could be normally opened with MS Office - however, performance was an issue, ie. it was slow, CPU/memory hog, and occasionally it would get stuck in a weird state when only complete reboot would help. --

Re: Does lucene support distributed indexing?

2008-04-29 Thread Andrzej Bialecki
that executes in a distributed fashion (not sure if map-reduce is the best model here), but first copies the indexes to LocalFileSystem. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval,

Re: How to reconstruct field value from index ?

2008-04-02 Thread Andrzej Bialecki
e is to it :) -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: inf

Re: Stored Field vs "offset plus external file"?

2008-02-13 Thread Andrzej Bialecki
be configured to use mmap / ram / fs for specific index file types (e.g. tis, tii, fdt, prx and so on) - you should be able to cut & paste large chunks of each directory code to start the implementation. -- Best r

Re: Lukes document hitlist display

2008-02-12 Thread Andrzej Bialecki
? It should, but this information is not available ... Luke populates this screen using Document.getFields(). If a field is unstored then it's not returned in this list, so it's not possible to get i

Re: TermPositionVector

2008-02-12 Thread Andrzej Bialecki
rhaps I'll include it in a minor update. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Cont

[ANN] Luke 0.8 released

2008-02-04 Thread Andrzej Bialecki
m an index. Instead this column now reads "Norms" and shows the fieldNorm value of a field. Have fun! -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__||

Re: Performance guarantees and index format

2008-02-04 Thread Andrzej Bialecki
lass applicable to various scenarios. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System In

Re: appending field to an existing index

2008-02-04 Thread Andrzej Bialecki
emented (yet?). -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.co

  1   2   3   >