Re: Include BM25 in Lucene?

2006-10-19 Thread Vic Bancroft

Chuck Williams wrote:


Vic Bancroft wrote on 10/17/2006 02:44 AM:
 


In some of my group's usage of lucene over large document collections,
we have split the documents across several machines.  This has lead to
a concern of whether the inverse document frequency was appropriate,
since the score seems to be dependant on the partioning of documents
over indexing hosts.  We have not formulated an experiment to
determine if it seriously effects our results, though it has been
discussed.
   

What version of Lucene are you using?  

The current systems are based on 1.9.1, though I suspect we should clean 
up the deprecation warnings and move to 2.0.0.



Are you using ParallelMultiSearcher to manage the distributed indexes or have 
you
implemented your own mechanism?  

We had started with the ParallelMultiSearcher, but did not see 
appropriate scalability with high numbers of concurrent requests.  The 
bottleneck was on the reduce side, folding results back together.  The 
first cut mechanism we implemented allows for a configurable 
distribution of front end processors and is extremely efficient at the 
cost of (over) simplification.


Perhaps it is time to investigate the hadoop path . . .

There was a bug a couple years ago, in the 1.4.3 version as I recall, where 
ParallelMultiSearcher was not computing df's appropriately, but that has been
fixed for a long time now.  The df's are the sum of the df's from each 
distributed index and thus are independent of the partitioning.
 

Interesting, we randomly spray the documents across the leaf node 
indexers and rely on a tendancy of large numbers of documents to smooth 
out the probability distributions.   Hence my interest in participating 
in an effort to implement and evaluate the impact of using a different 
method, such as BM25 or perhaps even some DFR approach [1].


more,
l8r,
v

--
The future is here. It's just not evenly distributed yet.
-- William Gibson, quoted by Whitfield Diffie

[1] http://ir.dcs.gla.ac.uk/terrier/


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Include BM25 in Lucene?

2006-10-17 Thread Vic Bancroft

J.Zhu wrote:


If I would like to contribute, what should I do? I am not a good Java
developer myself though. Can I work with someone also interested?
 

In some of my group's usage of lucene over large document collections, 
we have split the documents across several machines.  This has lead to a 
concern of whether the inverse document frequency was appropriate, since 
the score seems to be dependant on the partioning of documents over 
indexing hosts.  We have not formulated an experiment to determine if it 
seriously effects our results, though it has been discussed.


If someone could elaborate how BM25 or some DFR algorithm would differ 
from what (TF/IDF) is implemented in lucene, I would be willing to help 
translate that into java as an indexing/searching option . . .


more,
l8r,
v


--
The future is here. It's just not evenly distributed yet.
-- William Gibson, quoted by Whitfield Diffie


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Clustering IndexWriter?

2006-09-21 Thread Vic Bancroft

adasal wrote:


Don't be coy, what's your comapany?


This URL is derivable from the text, with a little search ening help . . .
**
 http://www.terracottatech.com/terracotta_spring.shtml

more,
l8r,
v


On 21/09/06, Steve Harris [EMAIL PROTECTED] wrote:



Warning, I'm a vendor dude but this isn't really a vendor message.

My IT guy had mentioned to me that a bunch of the open source products
we use (JIRA, JForum etc) have Lucene inside and in the name of eating
our own dog food
I tried to cluster IndexWriter (with a RAMDirectory) using our
(terracotta) clustering technology.

Took me about a half hour to get the basics working from download
time. I was wondering, do people in the real world want to be able to
cluster this stuff? Is clustering the IndexWriter really all I need 
to do?


If it is interesting, how do I feedback a small code change into the
project. We don't yet support subclasses of collections and
SegmentInfos subclasses Vector. I just turned it into aggregation
(that took 10 of the 30 minutes). We will support this in a future
release so it isn't a huge deal but I could get something out sooner
if the change was made.

Cheers,
Steve

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]







--
The future is here. It's just not evenly distributed yet.
-- William Gibson, quoted by Whitfield Diffie


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Java 1.5 (was ommented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided))

2006-07-10 Thread Vic Bancroft

Andi Vajda wrote:


On Mon, 10 Jul 2006, Doug Cutting wrote:


Andi Vajda wrote:


On Sat, 8 Jul 2006, Doug Cutting wrote:

Since GCJ is effectively available on all platforms, we could say 
that we will start accepting 1.5 features when a GCJ release 
supports those features. Does that seem reasonable?


+1


If we use this criteria, then we should probably officially support 
GCJ. Ideally we should run nightly unit tests with GCJ. Andi, would 
you be interested in helping to set this up?


This is interesting to me, is the nightly build environment difficult to 
replicate ?


I'd be interested in doing this but what is it that we're after in 
'supporting gcj' actually ?


There is some advantage in using gcj as a measure of usability in the 
context of a free (as in beer) java, such that for a given target 
platform, one can deliver executables and shared libraries without 
requiring virtual machine runtimes. The second advantage is to give a 
simple method to nightly test contributions using new features. The 
third advantage seems to be a reduction in computational load on servers 
running native code.



- running a fully compiled program linked against a lucene.so ?
if so, which platforms ? the gcj story is very different on each and 
every

platform, including different linuxes and gcj is not well supported on
some platforms at all.


This seems to be the case, since on an updated fedora core 5 with gcj 
(GCC) 4.1.1 20060525 (Red Hat 4.1.1-1), the Makefile modifications 
required are trivial.



- running java bytecode with the gcj VM (gij, I believe) ?
if the .java code needs to be compiled with gcj then a number of patches
still need to be applied against the Java lucene sources.
PyLucene is built by compiling .java - .jar using a regular JDK (Apple's
or Blackdown) and using gcj to compile from .jar - .so thereby working
around all the gcj java front-end bugs
Even when only compiling .jar - .so with gcj, a number of patches still
need to be applied:
http://svn.osafoundation.org/pylucene/trunk/patches.lucene


The last time I checked for src/gcj/Makefile (revision 420696), all that 
was required was to fix the name of the lucene archive file to match 
what is actually generated, e.g., $(BUILD)/lucene-core-[0-9].*.jar and 
add the FieldCache* to the names to skip . . .


Not having contributed to lucene yet, is it required to generate a 
'patch' to add to jira, or is the following output from a simple `svn 
diff` sufficient for experimentation ?


   Index: src/gcj/Makefile
   ===
   --- src/gcj/Makefile (revision 420696)
   +++ src/gcj/Makefile (working copy)
   @@ -8,7 +8,7 @@
   CORE=$(BUILD)/classes/java
   SRC=.

   -CORE_OBJ:=$(subst .jar,.a,$(wildcard $(BUILD)/lucene-[0-9]*.jar))
   +CORE_OBJ:=$(subst .jar,.a,$(wildcard $(BUILD)/lucene-core-[0-9]*.jar))
   CORE_JAVA:=$(shell find $(ROOT)/src/java -name '*.java')

   CORE_HEADERS=\
   @@ -55,7 +55,7 @@
   # yet accept from .class files.
   # NOTE: Change when
   http://gcc.gnu.org/bugzilla/show_bug.cgi?id=15501 is fixed.
   $(CORE_OBJ) : $(CORE_JAVA)
   - $(GCJ) $(GCJFLAGS) -c -I $(CORE) -o $@ `find $(ROOT)/src/java
   -name '*.java' -not -name '*Sort*' -not -name 'Span*'` `find $(CORE)
   -name '*.class' -name '*Sort*' -or -name 'Span*'`
   + $(GCJ) $(GCJFLAGS) -c -I $(CORE) -o $@ `find $(ROOT)/src/java
   -name '*.java' -not -name '*Sort*' -not -name 'Span*' -not -name
   'FieldCache*'` `find $(CORE) -name '*.class' -name '*Sort*' -or
   -name 'Span*' -or -name 'FieldCache*'`

   # generate object code from jar files using gcj
   %.a : %.jar

more,
l8r,
v

--
The future is here. It's just not evenly distributed yet.
-- William Gibson, quoted by Whitfield Diffie


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Java 1.5 (was ommented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided))

2006-07-10 Thread Vic Bancroft

robert engels wrote:


Seems  silly to support 1.5 and not do it this way.


Sometimes a little silliness is some serious fun!  Just give me a rubber 
nose, since I am just clowning around trying to build Andi's kewly 
contrib/db using gcj on the slightly stylish db-4.4.20 and je-3.0.12 . . .



On Jul 10, 2006, at 11:17 PM, Daniel John Debrunner wrote:


Doug Cutting wrote:


Since GCJ is effectively available on all platforms, we could say  that
we will start accepting 1.5 features when a GCJ release supports  those
features.  Does that seem reasonable?


Seems potentially a little strange to me. Does this mean Lucene  
would be

limited to the set of 1.5 features actually implemented by GCJ? So if
there is a 1.5 feature that is not supported by GCJ (while others are)
it cannot be used?

Seems more natural to support the complete 1.5 as defined by Sun/Java,
not the subset implemented by one open source compiler.



Do you have a different favorite open source java compiler for 1.5 ?

more,
l8r,
v

--
The future is here. It's just not evenly distributed yet.
-- William Gibson, quoted by Whitfield Diffie


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Results (Re: Survey: Lucene and Java 1.4 vs. 1.5)

2006-06-18 Thread Vic Bancroft

Robert Engels wrote:


Do you have any hard numbers to support this? The last time I checked, gcj
had minimal improvement over JVM 1.5. 
 

In terms of speed, there is not much difference between native code and 
classes (see sample timings).  However, the pragmatic availability of 
java 5 environment for even somewhat _exotic_ platforms is sadly 
limited.  My current environment is linux on a dual core x86_64. 

One can only ride a jrocket into 1.5 land and still address 64 bits of 
goodness !


more,
l8r,
v

BTW, given a native compile and link,

   [EMAIL PROTECTED] lucene-415145]$ ldd  build/indexFiles
   libstdc++.so.6 = /usr/lib64/libstdc++.so.6 (0x003f0040)
   libgcc_s.so.1 = /lib64/libgcc_s.so.1 (0x003efec0)
   libgcj.so.7 = /usr/lib64/libgcj.so.7 (0x2aac2000)
   libm.so.6 = /lib64/libm.so.6 (0x003ef910)
   libpthread.so.0 = /lib64/libpthread.so.0 (0x003efa50)
   libz.so.1 = /usr/lib64/libz.so.1 (0x003ef950)
   libdl.so.2 = /lib64/libdl.so.2 (0x003ef930)
   libc.so.6 = /lib64/libc.so.6 (0x003ef8e0)
   /lib64/ld-linux-x86-64.so.2 (0x003ef8c0)

The native indexing,

[EMAIL PROTECTED] lucene-415145]$ time build/indexFiles . 21  /dev/null

real0m22.932s
user0m16.581s
sys 0m6.224s

The virtual machine indexing,

[EMAIL PROTECTED] lucene-415145]$ time java -d64 -Xmx8192m -cp 
build/lucene-demos-2.0-rc1-dev.jar:build/lucene-core-2.0-rc1-dev.jar 
org.apache.lucene.demo.IndexFiles . 21  /dev/null
real0m23.224s
user0m33.238s
sys 0m5.184s
 

Side note, the jrocket seems to use both processors just about 1/3 of 
the way through, where as the gcj doesn't . . .


--
The future is here. It's just not evenly distributed yet.
-- William Gibson, quoted by Whitfield Diffie


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Results (Re: Survey: Lucene and Java 1.4 vs. 1.5)

2006-06-17 Thread Vic Bancroft
Until there is a free java 5 alternative, it would be nice to have a 
clean compile in 1.4.  We might also consider waiting until gcj does the 
1.5 move, since some of us are loving the native binaries, particularly 
on x86_64.


How else can you index billions of documents (aside from expensive big 
blue boxes) . . .


more,
l8r,
v

--
The future is here. It's just not evenly distributed yet.
-- William Gibson, quoted by Whitfield Diffie


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



gjc compile

2006-06-02 Thread Vic Bancroft
The following diff seemed to help build a nice native binary in my 
fedora. The first modification makes using the new core archive file 
name and the second avoids a problematic class . . .


[EMAIL PROTECTED] lucene-trunk]$ svn diff

Index: src/gcj/Makefile

===

--- src/gcj/Makefile(revision 410910)

+++ src/gcj/Makefile(working copy)

@@ -8,7 +8,7 @@

CORE=$(BUILD)/classes/java

SRC=.

-CORE_OBJ:=$(subst .jar,.a,$(wildcard $(BUILD)/lucene-[0-9]*.jar))

+CORE_OBJ:=$(subst .jar,.a,$(wildcard $(BUILD)/lucene-core-[0-9].*.jar))

CORE_JAVA:=$(shell find $(ROOT)/src/java -name '*.java')

CORE_HEADERS=\

@@ -55,7 +55,7 @@

# yet accept from .class files.

# NOTE: Change when http://gcc.gnu.org/bugzilla/show_bug.cgi?id=15501 is fixed. 
$(CORE_OBJ) : $(CORE_JAVA)

-   $(GCJ) $(GCJFLAGS) -c -I $(CORE) -o $@ `find $(ROOT)/src/java -name 
'*.java' -not -name '*Sort*' -not -name 'Span*'` `find $(CORE) -name '*.class' 
-name '*Sort*' -or -name 'Span*'`

+   $(GCJ) $(GCJFLAGS) -c -I $(CORE) -o $@ `find $(ROOT)/src/java -name 
'*.java' -not -name '*Sort*' -not -name 'Span*'  -not -name 'FieldCache*'` 
`find $(CORE) -name '*.class' -name '*Sort*' -or -name 'Span*' -or -name 
'FieldCache*'`

# generate object code from jar files using gcj

%.a : %.jar


# -- 
# The future is here. It's just not evenly distributed yet.

# -- William Gibson, quoted by Whitfield Diffie



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]