Hudson build is back to normal: Lucene-trunk #533

2008-07-21 Thread Apache Hudson Server
See http://hudson.zones.apache.org/hudson/job/Lucene-trunk/533/changes



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Commented: (LUCENE-1278) Add optional storing of document numbers in term dictionary

2008-07-21 Thread Doug Cutting

This also reminds me of the "pulsing" technique described in:

http://citeseer.ist.psu.edu/cutting90optimizations.html

Doug

eks dev wrote:

It seams someone else had the same idea to "inline" very short postings into 
term dictionary (even for in-memory index) ans save one pointer (and seek, in disk 
setup)... nice reading

http://www.siam.org/proceedings/alenex/2008/alx08_01transierf.pdf




- Original Message 

From: Eks Dev (JIRA) <[EMAIL PROTECTED]>
To: java-dev@lucene.apache.org
Sent: Sunday, 20 July, 2008 1:02:31 PM
Subject: [jira] Commented: (LUCENE-1278) Add optional storing of document 
numbers in term dictionary


[ 
https://issues.apache.org/jira/browse/LUCENE-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12615077#action_12615077 
] 


Eks Dev commented on LUCENE-1278:
-

in light of Mike's comments hier (Michael McCandless - 05/May/08 05:33 AM), I 
think it is worth mentioning that I am working on LUCENE-1340, that is storing 
postings without additional frq info. 

correct me if I am wrong, the only difference is that this approach with *.frq 
needs one seek more... at the same time, this could potentially increase term 
dict size, so we loose some locality.


Your your last proposal sounds interesting,  "inline short postings" into term 
dict , so for short postings (about the size of offset pointer into *.frq) with 
tf==1 (that is the always the case if you use omitTf(true) from LUCENE-1340)  we 
spare one seek()... this could be a lot. Also, there is no need to store 
postings into *frq  (this complicates maintenance I guess)  


Add optional storing of document numbers in term dictionary
---

Key: LUCENE-1278
URL: https://issues.apache.org/jira/browse/LUCENE-1278
Project: Lucene - Java
 Issue Type: New Feature
 Components: Index
   Affects Versions: 2.3.1
   Reporter: Jason Rutherglen
   Priority: Minor
Attachments: lucene.1278.5.4.2008.patch, lucene.1278.5.5.2008.2.patch, 
lucene.1278.5.5.2008.patch, lucene.1278.5.7.2008.patch, 
lucene.1278.5.7.2008.test.patch, TestTermEnumDocs.java


Add optional storing of document numbers in term dictionary.  String index 
field cache and range filter creation will be faster.  

Example read code:
{noformat}
TermEnum termEnum = indexReader.terms(TermEnum.LOAD_DOCS);
do {
  Term term = termEnum.term();
  if (term == null || term.field() != field) break;
  int[] docs = termEnum.docs();
} while (termEnum.next());
{noformat}
Example write code:
{noformat}
Document document = new Document();
document.add(new Field("tag", "dog", Field.Store.YES, 

Field.Index.UN_TOKENIZED, Field.Term.STORE_DOCS));

indexWriter.addDocument(document);
{noformat}

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




  __
Not happy with your email address?.
Get the one you really want - millions of new email addresses available now at 
Yahoo! http://uk.docs.yahoo.com/ymail/new.html

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1340) Make it posible not to include TF information in index

2008-07-21 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12615446#action_12615446
 ] 

Michael McCandless commented on LUCENE-1340:


bq. About this one, it would be nice not to store this as well, but I think the 
pointers are already reduced to one byte, as they are 0 for these cases (are 
they,?) So we have this benefit without expecting it

Ahh, right.  The delta between the proxPointers are written as vlong's.  Since 
the delta will be zero it's now only 1 byte; only a bit worse than 0 bytes ;)

bq. that would mean we could easily "inline" very short postings into term dict 
(here I expect huge performance benefit, as skip() on another large file is 
going to be saved independent from omitTf(true))

Yes, this looks like it would be a win for cases that need to visit the 
postings for many small terms.

> Make it posible not to include TF information in index
> --
>
> Key: LUCENE-1340
> URL: https://issues.apache.org/jira/browse/LUCENE-1340
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Reporter: Eks Dev
>Priority: Minor
> Attachments: LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, 
> LUCENE-1340.patch, LUCENE-1340.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Term Frequency is typically not needed  for all fields, some CPU (reading one 
> VInt less and one X>>>1...) and IO can be spared by making pure boolen fields 
> possible in Lucene. This topic has already been discussed and accepted as a 
> part of Flexible Indexing... This issue tries to push things a bit faster 
> forward as I have some concrete customer demands.
> benefits can be expected for fields that are typical candidates for Filters, 
> enumerations, user rights, IDs or very short "texts", phone  numbers, zip 
> codes, names...
> Status: just passed standard test (compatibility), commited for early review, 
> I have not tried new feature, missing some asserts and one two unit tests
> Complexity: simpler than expected
> can be used via omitTf() (who used omitNorms() will know where to find it :)  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1340) Make it posible not to include TF information in index

2008-07-21 Thread Eks Dev (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12615357#action_12615357
 ] 

Eks Dev commented on LUCENE-1340:
-

Great, it is already more than I expected, even indexing is going to be 
somewhat faster.

I have tried your patch on smallish index with 8Mio documents and it worked on 
our regression test without problems. 
it worked fine with and without omitTf(true), no performance drop or bad 
surprises when we do not use it. Tomorrow is scheduled real test with 
production data, around 80Mio very small documents, with some very extensive 
tests I will report back.

"The one place I know of that will still waste bytes is the term dict
(TermInfo): it stores a long proxPointer on disk (in .tii,.tis) and
also in memory because we load *.tii into RAM "

 About this one, it would be nice not to store this as well, but I think the 
pointers are already reduced to one byte, as they are 0 for these cases (are 
they,?) So we have this benefit without expecting it :)

And yes, more "column stride" is great, if you followed my comments on 
LUCENE-1278, that would mean we could easily "inline" very short postings into 
term dict (here I expect huge performance benefit, as skip()  on another large 
file is going to be saved independent from omitTf(true)), without increase in 
size (or minimal) of tii (no locality penalty) If we follow Zipfian 
distribution, there is *a lot* of terms with postings shorter than e.g. 16 ... 

Thanks again for your support, without you this patch would be just another 
nice idea :)








> Make it posible not to include TF information in index
> --
>
> Key: LUCENE-1340
> URL: https://issues.apache.org/jira/browse/LUCENE-1340
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Reporter: Eks Dev
>Priority: Minor
> Attachments: LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, 
> LUCENE-1340.patch, LUCENE-1340.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Term Frequency is typically not needed  for all fields, some CPU (reading one 
> VInt less and one X>>>1...) and IO can be spared by making pure boolen fields 
> possible in Lucene. This topic has already been discussed and accepted as a 
> part of Flexible Indexing... This issue tries to push things a bit faster 
> forward as I have some concrete customer demands.
> benefits can be expected for fields that are typical candidates for Filters, 
> enumerations, user rights, IDs or very short "texts", phone  numbers, zip 
> codes, names...
> Status: just passed standard test (compatibility), commited for early review, 
> I have not tried new feature, missing some asserts and one two unit tests
> Complexity: simpler than expected
> can be used via omitTf() (who used omitNorms() will know where to find it :)  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1336) Distributed Lucene using Hadoop RPC based RMI with dynamic classloading

2008-07-21 Thread Jason Rutherglen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Rutherglen updated LUCENE-1336:
-

Attachment: lucene-1336.patch

lucene-1336.patch

- HMAC based security authentication between client and server.  This was 
chosen as it is fairly simple to use and is more secure than username password. 
 Public/private keys signing, encryption can be used as well via the 
RMISecurity interface.  SSL may also be used at the socket layer, though that 
would require work in the Hadoop RPC NIO socket code.
- LuceneMultiClient class that allows searching over multiple remote indexes 
via a MultiSearcher.  Class also manages obtaining the latest Searchables via 
the registered IndexListener.
- Distributed events for new Searchables on a remote LuceneServer reopen.  
LuceneClient always has the most up to date Searchable automatically.  Added 
IndexService.registerIndexListener method.
- Apache License headers
- IndexService.flushAndReopen method flushes indexes changes from the 
IndexWriter, reopens, and returns the latest Searchable.  

Future:
- Facet interface with default Term and Query implementations.  

> Distributed Lucene using Hadoop RPC based RMI with dynamic classloading
> ---
>
> Key: LUCENE-1336
> URL: https://issues.apache.org/jira/browse/LUCENE-1336
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Affects Versions: 2.3.1
>Reporter: Jason Rutherglen
>Priority: Minor
> Attachments: lucene-1336.patch, lucene-1336.patch, lucene-1336.patch
>
>
> Hadoop RPC based RMI system for use with Lucene Searchable.  Keeps the 
> application logic on the client side with removing the need to deploy 
> application logic to the Lucene servers.  Removes the need to provision new 
> code to potentially hundreds of servers for every application logic change.  
> The use case is any deployment requiring Lucene on many servers.  This system 
> provides the added advantage of allowing custom Query and Filter classes (or 
> other classes) to be defined on for example a development machine and 
> executed on the server without deploying the custom classes to the servers 
> first.  This can save a lot of time and effort in provisioning, restarting 
> processes.  In the future this patch will include an IndexWriterService 
> interface which will enable document indexing.  This will allow subclasses of 
> Analyzer to be dynamically loaded onto a server as documents are added by the 
> client.
> Hadoop RPC is more scalable than Sun's RMI implementation because it uses non 
> blocking sockets.  Hadoop RPC is also far easier to understand and customize 
> if needed as it is embodied in 2 main class files 
> org.apache.hadoop.ipc.Client and org.apache.hadoop.ipc.Server.  
> Features include automatic dynamic classloading.  The dynamic classloading 
> enables newly compiled client classes inheriting core objects such as Query 
> or Filter to be used to query the server without first deploying the code to 
> the server.  
> Using RMI dynamic classloading is not used in practice because it is hard to 
> setup, requiring placing the new code in jar files on a web server on the 
> client.  Then requires custom system properties to be setup as well as Java 
> security manager configuration.  
> The dynamic classloading in Hadoop RMI for Lucene uses RMI to load the 
> classes.  Custom serialization and deserialization manages the classes and 
> the class versions on the server and client side.  New class files are 
> automatically detected and loaded using ClassLoader.getResourceAsStream and 
> so this system does not require creating a JAR file.  The use of the same 
> networking system used for the remote method invocation is used for the 
> loading classes over the network.  This removes the necessity of a separate 
> web server dedicated to the task and makes deployment a few lines of code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1341) BoostingNearQuery class (prototype)

2008-07-21 Thread Peter Keegan (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Keegan updated LUCENE-1341:
-

Attachment: BoostingNearQuery.java
bnq.patch

Here is version of patch for Java 1.4

> BoostingNearQuery class (prototype)
> ---
>
> Key: LUCENE-1341
> URL: https://issues.apache.org/jira/browse/LUCENE-1341
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Query/Scoring
>Affects Versions: 2.3.1
>Reporter: Peter Keegan
>Priority: Minor
> Fix For: 2.3.2
>
> Attachments: bnq.patch, bnq.patch, BoostingNearQuery.java, 
> BoostingNearQuery.java
>
>
> This patch implements term boosting for SpanNearQuery. Refer to: 
> http://www.gossamer-threads.com/lists/lucene/java-user/62779
> This patch works but probably needs more work. I don't like the use of 
> 'instanceof', but I didn't want to touch Spans or TermSpans. Also, the 
> payload code is mostly a copy of what's in BoostingTermQuery and could be 
> common-sourced somewhere. Feel free to throw darts at it :)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1340) Make it posible not to include TF information in index

2008-07-21 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1340:
---

Attachment: LUCENE-1340.patch

OK good progress eks!

I started from your latest patch and made some further changes:

  * Fixed DW to not consume RAM writing prx if omitTf==true

  * Fixed FreqProxTermsWriter to not create *.prx file if all fields
omit term freq.  I added hasProx to SegmentInfo, and changed the
index file format to store this new boolean.

  * Fixed FreqProxTermsWriterPerField to not write prox into the RAM
buffer if we will omitTf on flushing the segment to disk.  This
makes the RAM buffer efficient (no bytes wasted on prox when
omitTf==true for a field).

  * Added more test cases to TestOmitTf

  * Small whitespace, comment changes

The one place I know of that will still waste bytes is the term dict
(TermInfo): it stores a long proxPointer on disk (in *.tii,*.tis) and
also in memory because we load *.tii into RAM.  For fields with
omitTf==true this will always be unused, and we could save alot of
disk/RAM if we didn't waste it.

Unfortunately, I think it's too big a change to try to fix this now; I
think we should wait until flex indexing is online.  I wonder how we
can solve it at that point: maybe should we change TermInfo to be
"column stride", meaning, there are separate arrays storing the values
for all terms (ie long[] proxPointers, long[] freqPointers, etc.).
This would also fit the "pluggable" model better, meaning any plugin
can store new stuff (its own arrays) per-term.

> Make it posible not to include TF information in index
> --
>
> Key: LUCENE-1340
> URL: https://issues.apache.org/jira/browse/LUCENE-1340
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Reporter: Eks Dev
>Priority: Minor
> Attachments: LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, 
> LUCENE-1340.patch, LUCENE-1340.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Term Frequency is typically not needed  for all fields, some CPU (reading one 
> VInt less and one X>>>1...) and IO can be spared by making pure boolen fields 
> possible in Lucene. This topic has already been discussed and accepted as a 
> part of Flexible Indexing... This issue tries to push things a bit faster 
> forward as I have some concrete customer demands.
> benefits can be expected for fields that are typical candidates for Filters, 
> enumerations, user rights, IDs or very short "texts", phone  numbers, zip 
> codes, names...
> Status: just passed standard test (compatibility), commited for early review, 
> I have not tried new feature, missing some asserts and one two unit tests
> Complexity: simpler than expected
> can be used via omitTf() (who used omitNorms() will know where to find it :)  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1330) 0 position increment not properly supported for the first token

2008-07-21 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12615191#action_12615191
 ] 

Michael McCandless commented on LUCENE-1330:


In LUCENE-1255, we tried to correct DocumentsWriter to write absolute position 
as 0 not -1 in this case, but unfortunately this broke backwards compatibility 
so we decided to leave it be.

But: in this case the absolute position should read back later as -1, not 
Integer.MIN_VALUE -- where are you seeing Integer.MIN_VALUE?

> 0 position increment not properly supported for the first token
> ---
>
> Key: LUCENE-1330
> URL: https://issues.apache.org/jira/browse/LUCENE-1330
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Javadocs, Store
>Reporter: Moti Nisenson
>
> Setting a position increment of 0 for the first token in a field results in 
> its "absolute position" (as well as increment) being read back later as 
> Integer.MIN_VALUE.
> This is a result of how the information gets written out in DocumentsWriter: 
> position should not be updated using += t.getPositionIncrement() - 1; and 
> then always ++'ed in addPosition. It would be much simpler just to update it 
> using t.getPositionIncrement().
> While this is fairly easy to fix in DocumentsWriter, one could just update 
> the documentation for Token.setPositionIncrement() to indicate that for 
> indexing purposes the first Token in a field must have a positive position 
> increment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]