Re: Include BM25 in Lucene?
Chuck Williams wrote: Vic Bancroft wrote on 10/17/2006 02:44 AM: In some of my group's usage of lucene over large document collections, we have split the documents across several machines. This has lead to a concern of whether the inverse document frequency was appropriate, since the score seems to be dependant on the partioning of documents over indexing hosts. We have not formulated an experiment to determine if it seriously effects our results, though it has been discussed. What version of Lucene are you using? The current systems are based on 1.9.1, though I suspect we should clean up the deprecation warnings and move to 2.0.0. Are you using ParallelMultiSearcher to manage the distributed indexes or have you implemented your own mechanism? We had started with the ParallelMultiSearcher, but did not see appropriate scalability with high numbers of concurrent requests. The bottleneck was on the reduce side, folding results back together. The first cut mechanism we implemented allows for a configurable distribution of front end processors and is extremely efficient at the cost of (over) simplification. Perhaps it is time to investigate the hadoop path . . . There was a bug a couple years ago, in the 1.4.3 version as I recall, where ParallelMultiSearcher was not computing df's appropriately, but that has been fixed for a long time now. The df's are the sum of the df's from each distributed index and thus are independent of the partitioning. Interesting, we randomly spray the documents across the leaf node indexers and rely on a tendancy of large numbers of documents to smooth out the probability distributions. Hence my interest in participating in an effort to implement and evaluate the impact of using a different method, such as BM25 or perhaps even some DFR approach [1]. more, l8r, v -- The future is here. It's just not evenly distributed yet. -- William Gibson, quoted by Whitfield Diffie [1] http://ir.dcs.gla.ac.uk/terrier/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Include BM25 in Lucene?
J.Zhu wrote: If I would like to contribute, what should I do? I am not a good Java developer myself though. Can I work with someone also interested? In some of my group's usage of lucene over large document collections, we have split the documents across several machines. This has lead to a concern of whether the inverse document frequency was appropriate, since the score seems to be dependant on the partioning of documents over indexing hosts. We have not formulated an experiment to determine if it seriously effects our results, though it has been discussed. If someone could elaborate how BM25 or some DFR algorithm would differ from what (TF/IDF) is implemented in lucene, I would be willing to help translate that into java as an indexing/searching option . . . more, l8r, v -- The future is here. It's just not evenly distributed yet. -- William Gibson, quoted by Whitfield Diffie - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Clustering IndexWriter?
adasal wrote: Don't be coy, what's your comapany? This URL is derivable from the text, with a little search ening help . . . ** http://www.terracottatech.com/terracotta_spring.shtml more, l8r, v On 21/09/06, Steve Harris [EMAIL PROTECTED] wrote: Warning, I'm a vendor dude but this isn't really a vendor message. My IT guy had mentioned to me that a bunch of the open source products we use (JIRA, JForum etc) have Lucene inside and in the name of eating our own dog food I tried to cluster IndexWriter (with a RAMDirectory) using our (terracotta) clustering technology. Took me about a half hour to get the basics working from download time. I was wondering, do people in the real world want to be able to cluster this stuff? Is clustering the IndexWriter really all I need to do? If it is interesting, how do I feedback a small code change into the project. We don't yet support subclasses of collections and SegmentInfos subclasses Vector. I just turned it into aggregation (that took 10 of the 30 minutes). We will support this in a future release so it isn't a huge deal but I could get something out sooner if the change was made. Cheers, Steve - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- The future is here. It's just not evenly distributed yet. -- William Gibson, quoted by Whitfield Diffie - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Java 1.5 (was ommented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided))
Andi Vajda wrote: On Mon, 10 Jul 2006, Doug Cutting wrote: Andi Vajda wrote: On Sat, 8 Jul 2006, Doug Cutting wrote: Since GCJ is effectively available on all platforms, we could say that we will start accepting 1.5 features when a GCJ release supports those features. Does that seem reasonable? +1 If we use this criteria, then we should probably officially support GCJ. Ideally we should run nightly unit tests with GCJ. Andi, would you be interested in helping to set this up? This is interesting to me, is the nightly build environment difficult to replicate ? I'd be interested in doing this but what is it that we're after in 'supporting gcj' actually ? There is some advantage in using gcj as a measure of usability in the context of a free (as in beer) java, such that for a given target platform, one can deliver executables and shared libraries without requiring virtual machine runtimes. The second advantage is to give a simple method to nightly test contributions using new features. The third advantage seems to be a reduction in computational load on servers running native code. - running a fully compiled program linked against a lucene.so ? if so, which platforms ? the gcj story is very different on each and every platform, including different linuxes and gcj is not well supported on some platforms at all. This seems to be the case, since on an updated fedora core 5 with gcj (GCC) 4.1.1 20060525 (Red Hat 4.1.1-1), the Makefile modifications required are trivial. - running java bytecode with the gcj VM (gij, I believe) ? if the .java code needs to be compiled with gcj then a number of patches still need to be applied against the Java lucene sources. PyLucene is built by compiling .java - .jar using a regular JDK (Apple's or Blackdown) and using gcj to compile from .jar - .so thereby working around all the gcj java front-end bugs Even when only compiling .jar - .so with gcj, a number of patches still need to be applied: http://svn.osafoundation.org/pylucene/trunk/patches.lucene The last time I checked for src/gcj/Makefile (revision 420696), all that was required was to fix the name of the lucene archive file to match what is actually generated, e.g., $(BUILD)/lucene-core-[0-9].*.jar and add the FieldCache* to the names to skip . . . Not having contributed to lucene yet, is it required to generate a 'patch' to add to jira, or is the following output from a simple `svn diff` sufficient for experimentation ? Index: src/gcj/Makefile === --- src/gcj/Makefile (revision 420696) +++ src/gcj/Makefile (working copy) @@ -8,7 +8,7 @@ CORE=$(BUILD)/classes/java SRC=. -CORE_OBJ:=$(subst .jar,.a,$(wildcard $(BUILD)/lucene-[0-9]*.jar)) +CORE_OBJ:=$(subst .jar,.a,$(wildcard $(BUILD)/lucene-core-[0-9]*.jar)) CORE_JAVA:=$(shell find $(ROOT)/src/java -name '*.java') CORE_HEADERS=\ @@ -55,7 +55,7 @@ # yet accept from .class files. # NOTE: Change when http://gcc.gnu.org/bugzilla/show_bug.cgi?id=15501 is fixed. $(CORE_OBJ) : $(CORE_JAVA) - $(GCJ) $(GCJFLAGS) -c -I $(CORE) -o $@ `find $(ROOT)/src/java -name '*.java' -not -name '*Sort*' -not -name 'Span*'` `find $(CORE) -name '*.class' -name '*Sort*' -or -name 'Span*'` + $(GCJ) $(GCJFLAGS) -c -I $(CORE) -o $@ `find $(ROOT)/src/java -name '*.java' -not -name '*Sort*' -not -name 'Span*' -not -name 'FieldCache*'` `find $(CORE) -name '*.class' -name '*Sort*' -or -name 'Span*' -or -name 'FieldCache*'` # generate object code from jar files using gcj %.a : %.jar more, l8r, v -- The future is here. It's just not evenly distributed yet. -- William Gibson, quoted by Whitfield Diffie - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Java 1.5 (was ommented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided))
robert engels wrote: Seems silly to support 1.5 and not do it this way. Sometimes a little silliness is some serious fun! Just give me a rubber nose, since I am just clowning around trying to build Andi's kewly contrib/db using gcj on the slightly stylish db-4.4.20 and je-3.0.12 . . . On Jul 10, 2006, at 11:17 PM, Daniel John Debrunner wrote: Doug Cutting wrote: Since GCJ is effectively available on all platforms, we could say that we will start accepting 1.5 features when a GCJ release supports those features. Does that seem reasonable? Seems potentially a little strange to me. Does this mean Lucene would be limited to the set of 1.5 features actually implemented by GCJ? So if there is a 1.5 feature that is not supported by GCJ (while others are) it cannot be used? Seems more natural to support the complete 1.5 as defined by Sun/Java, not the subset implemented by one open source compiler. Do you have a different favorite open source java compiler for 1.5 ? more, l8r, v -- The future is here. It's just not evenly distributed yet. -- William Gibson, quoted by Whitfield Diffie - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Results (Re: Survey: Lucene and Java 1.4 vs. 1.5)
Robert Engels wrote: Do you have any hard numbers to support this? The last time I checked, gcj had minimal improvement over JVM 1.5. In terms of speed, there is not much difference between native code and classes (see sample timings). However, the pragmatic availability of java 5 environment for even somewhat _exotic_ platforms is sadly limited. My current environment is linux on a dual core x86_64. One can only ride a jrocket into 1.5 land and still address 64 bits of goodness ! more, l8r, v BTW, given a native compile and link, [EMAIL PROTECTED] lucene-415145]$ ldd build/indexFiles libstdc++.so.6 = /usr/lib64/libstdc++.so.6 (0x003f0040) libgcc_s.so.1 = /lib64/libgcc_s.so.1 (0x003efec0) libgcj.so.7 = /usr/lib64/libgcj.so.7 (0x2aac2000) libm.so.6 = /lib64/libm.so.6 (0x003ef910) libpthread.so.0 = /lib64/libpthread.so.0 (0x003efa50) libz.so.1 = /usr/lib64/libz.so.1 (0x003ef950) libdl.so.2 = /lib64/libdl.so.2 (0x003ef930) libc.so.6 = /lib64/libc.so.6 (0x003ef8e0) /lib64/ld-linux-x86-64.so.2 (0x003ef8c0) The native indexing, [EMAIL PROTECTED] lucene-415145]$ time build/indexFiles . 21 /dev/null real0m22.932s user0m16.581s sys 0m6.224s The virtual machine indexing, [EMAIL PROTECTED] lucene-415145]$ time java -d64 -Xmx8192m -cp build/lucene-demos-2.0-rc1-dev.jar:build/lucene-core-2.0-rc1-dev.jar org.apache.lucene.demo.IndexFiles . 21 /dev/null real0m23.224s user0m33.238s sys 0m5.184s Side note, the jrocket seems to use both processors just about 1/3 of the way through, where as the gcj doesn't . . . -- The future is here. It's just not evenly distributed yet. -- William Gibson, quoted by Whitfield Diffie - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Results (Re: Survey: Lucene and Java 1.4 vs. 1.5)
Until there is a free java 5 alternative, it would be nice to have a clean compile in 1.4. We might also consider waiting until gcj does the 1.5 move, since some of us are loving the native binaries, particularly on x86_64. How else can you index billions of documents (aside from expensive big blue boxes) . . . more, l8r, v -- The future is here. It's just not evenly distributed yet. -- William Gibson, quoted by Whitfield Diffie - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
gjc compile
The following diff seemed to help build a nice native binary in my fedora. The first modification makes using the new core archive file name and the second avoids a problematic class . . . [EMAIL PROTECTED] lucene-trunk]$ svn diff Index: src/gcj/Makefile === --- src/gcj/Makefile(revision 410910) +++ src/gcj/Makefile(working copy) @@ -8,7 +8,7 @@ CORE=$(BUILD)/classes/java SRC=. -CORE_OBJ:=$(subst .jar,.a,$(wildcard $(BUILD)/lucene-[0-9]*.jar)) +CORE_OBJ:=$(subst .jar,.a,$(wildcard $(BUILD)/lucene-core-[0-9].*.jar)) CORE_JAVA:=$(shell find $(ROOT)/src/java -name '*.java') CORE_HEADERS=\ @@ -55,7 +55,7 @@ # yet accept from .class files. # NOTE: Change when http://gcc.gnu.org/bugzilla/show_bug.cgi?id=15501 is fixed. $(CORE_OBJ) : $(CORE_JAVA) - $(GCJ) $(GCJFLAGS) -c -I $(CORE) -o $@ `find $(ROOT)/src/java -name '*.java' -not -name '*Sort*' -not -name 'Span*'` `find $(CORE) -name '*.class' -name '*Sort*' -or -name 'Span*'` + $(GCJ) $(GCJFLAGS) -c -I $(CORE) -o $@ `find $(ROOT)/src/java -name '*.java' -not -name '*Sort*' -not -name 'Span*' -not -name 'FieldCache*'` `find $(CORE) -name '*.class' -name '*Sort*' -or -name 'Span*' -or -name 'FieldCache*'` # generate object code from jar files using gcj %.a : %.jar # -- # The future is here. It's just not evenly distributed yet. # -- William Gibson, quoted by Whitfield Diffie - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]