[jira] Updated: (LUCENE-1340) Make it posible not to include TF information in index

2008-07-24 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1340:
---

Attachment: LUCENE-1340.patch

Attached patch that also includes fixes to fileformat.{xml,html,pdf}.

> Make it posible not to include TF information in index
> --
>
> Key: LUCENE-1340
> URL: https://issues.apache.org/jira/browse/LUCENE-1340
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Reporter: Eks Dev
>Priority: Minor
> Attachments: LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, 
> LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Term Frequency is typically not needed  for all fields, some CPU (reading one 
> VInt less and one X>>>1...) and IO can be spared by making pure boolen fields 
> possible in Lucene. This topic has already been discussed and accepted as a 
> part of Flexible Indexing... This issue tries to push things a bit faster 
> forward as I have some concrete customer demands.
> benefits can be expected for fields that are typical candidates for Filters, 
> enumerations, user rights, IDs or very short "texts", phone  numbers, zip 
> codes, names...
> Status: just passed standard test (compatibility), commited for early review, 
> I have not tried new feature, missing some asserts and one two unit tests
> Complexity: simpler than expected
> can be used via omitTf() (who used omitNorms() will know where to find it :)  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1340) Make it posible not to include TF information in index

2008-07-24 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1340:
---

Attachment: LUCENE-1340.patch

I attached a new rev of the patch:

  * Use less RAM if field omits tf's (don't write the tf's into the RAM 
buffer), so we flush less often

  * Added another test case to TestOmitTf

As a test, I indexed full wikipedia (~3.2 million docs) with this alg:

{code}
analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer

doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker

docs.file=/Volumes/External/lucene/wiki.txt
doc.stored = false
doc.term.vector = false
doc.add.log.step=1
max.field.length=2147483647

directory=FSDirectory
autocommit=false
compound=false
doc.maker.forever = false

work.dir=/lucene/work2
ram.flush.mb=64

- CreateIndex
{ "AddDocs" AddDoc > : *
- CloseIndex

RepSumByPrefRound AddDoc

{code}

With tf's it takes 970 seconds and index size is 2.5 GB.  Without tf's
it takes 834 seconds (14% faster) and index size is 1.1 GB (56%
smaller).


> Make it posible not to include TF information in index
> --
>
> Key: LUCENE-1340
> URL: https://issues.apache.org/jira/browse/LUCENE-1340
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Reporter: Eks Dev
>Priority: Minor
> Attachments: LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, 
> LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Term Frequency is typically not needed  for all fields, some CPU (reading one 
> VInt less and one X>>>1...) and IO can be spared by making pure boolen fields 
> possible in Lucene. This topic has already been discussed and accepted as a 
> part of Flexible Indexing... This issue tries to push things a bit faster 
> forward as I have some concrete customer demands.
> benefits can be expected for fields that are typical candidates for Filters, 
> enumerations, user rights, IDs or very short "texts", phone  numbers, zip 
> codes, names...
> Status: just passed standard test (compatibility), commited for early review, 
> I have not tried new feature, missing some asserts and one two unit tests
> Complexity: simpler than expected
> can be used via omitTf() (who used omitNorms() will know where to find it :)  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1340) Make it posible not to include TF information in index

2008-07-21 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1340:
---

Attachment: LUCENE-1340.patch

OK good progress eks!

I started from your latest patch and made some further changes:

  * Fixed DW to not consume RAM writing prx if omitTf==true

  * Fixed FreqProxTermsWriter to not create *.prx file if all fields
omit term freq.  I added hasProx to SegmentInfo, and changed the
index file format to store this new boolean.

  * Fixed FreqProxTermsWriterPerField to not write prox into the RAM
buffer if we will omitTf on flushing the segment to disk.  This
makes the RAM buffer efficient (no bytes wasted on prox when
omitTf==true for a field).

  * Added more test cases to TestOmitTf

  * Small whitespace, comment changes

The one place I know of that will still waste bytes is the term dict
(TermInfo): it stores a long proxPointer on disk (in *.tii,*.tis) and
also in memory because we load *.tii into RAM.  For fields with
omitTf==true this will always be unused, and we could save alot of
disk/RAM if we didn't waste it.

Unfortunately, I think it's too big a change to try to fix this now; I
think we should wait until flex indexing is online.  I wonder how we
can solve it at that point: maybe should we change TermInfo to be
"column stride", meaning, there are separate arrays storing the values
for all terms (ie long[] proxPointers, long[] freqPointers, etc.).
This would also fit the "pluggable" model better, meaning any plugin
can store new stuff (its own arrays) per-term.

> Make it posible not to include TF information in index
> --
>
> Key: LUCENE-1340
> URL: https://issues.apache.org/jira/browse/LUCENE-1340
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Reporter: Eks Dev
>Priority: Minor
> Attachments: LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, 
> LUCENE-1340.patch, LUCENE-1340.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Term Frequency is typically not needed  for all fields, some CPU (reading one 
> VInt less and one X>>>1...) and IO can be spared by making pure boolen fields 
> possible in Lucene. This topic has already been discussed and accepted as a 
> part of Flexible Indexing... This issue tries to push things a bit faster 
> forward as I have some concrete customer demands.
> benefits can be expected for fields that are typical candidates for Filters, 
> enumerations, user rights, IDs or very short "texts", phone  numbers, zip 
> codes, names...
> Status: just passed standard test (compatibility), commited for early review, 
> I have not tried new feature, missing some asserts and one two unit tests
> Complexity: simpler than expected
> can be used via omitTf() (who used omitNorms() will know where to find it :)  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1340) Make it posible not to include TF information in index

2008-07-20 Thread Eks Dev (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eks Dev updated LUCENE-1340:


Attachment: LUCENE-1340.patch

- fixed stupid bug in SegmentTermDocs (was doc = docCode; instead of doc += 
docCode;)
- TestOmitTf extended a bit 


> Make it posible not to include TF information in index
> --
>
> Key: LUCENE-1340
> URL: https://issues.apache.org/jira/browse/LUCENE-1340
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Reporter: Eks Dev
>Priority: Minor
> Attachments: LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, 
> LUCENE-1340.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Term Frequency is typically not needed  for all fields, some CPU (reading one 
> VInt less and one X>>>1...) and IO can be spared by making pure boolen fields 
> possible in Lucene. This topic has already been discussed and accepted as a 
> part of Flexible Indexing... This issue tries to push things a bit faster 
> forward as I have some concrete customer demands.
> benefits can be expected for fields that are typical candidates for Filters, 
> enumerations, user rights, IDs or very short "texts", phone  numbers, zip 
> codes, names...
> Status: just passed standard test (compatibility), commited for early review, 
> I have not tried new feature, missing some asserts and one two unit tests
> Complexity: simpler than expected
> can be used via omitTf() (who used omitNorms() will know where to find it :)  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1340) Make it posible not to include TF information in index

2008-07-19 Thread Eks Dev (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eks Dev updated LUCENE-1340:


Attachment: LUCENE-1340.patch

Thanks Mike, with just a little bit more hand-holding we are going to be there 
:)
 
I *think* I have *.prx IO excluded in case omitTf==true, please have a look, 
this part is really not an easy one (*Merger).

Also, now if a single field has mixed true/false for omitTf, I set it to true.

One unit test is already there, basic use case works, but the test has to cover 
a bit more



> Make it posible not to include TF information in index
> --
>
> Key: LUCENE-1340
> URL: https://issues.apache.org/jira/browse/LUCENE-1340
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Reporter: Eks Dev
>Priority: Minor
> Attachments: LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Term Frequency is typically not needed  for all fields, some CPU (reading one 
> VInt less and one X>>>1...) and IO can be spared by making pure boolen fields 
> possible in Lucene. This topic has already been discussed and accepted as a 
> part of Flexible Indexing... This issue tries to push things a bit faster 
> forward as I have some concrete customer demands.
> benefits can be expected for fields that are typical candidates for Filters, 
> enumerations, user rights, IDs or very short "texts", phone  numbers, zip 
> codes, names...
> Status: just passed standard test (compatibility), commited for early review, 
> I have not tried new feature, missing some asserts and one two unit tests
> Complexity: simpler than expected
> can be used via omitTf() (who used omitNorms() will know where to find it :)  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1340) Make it posible not to include TF information in index

2008-07-19 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1340:
---

Attachment: LUCENE-1340.patch

Thanks eks, that was fast -- I think you set a new record!

The patch looks good, though we definitely need some solid unit tests
here.  I made some small (whitespace, spelling, naming) corrections &
attached a new rev of the patch.

One question I have: right now if a single field has mixed true/false
for omitTf, you set it to false, meaning we start storing the term
freq, pos, payloads again.  Can/should we do the reverse instead?  If
we did, we could make some further optimizations, eg right now we
consume RAM storing all positions/payloads on a field that has omitTF=true
on the possibility that we may stll see omitTf=false in the same session.

With this patch we still store the *.prx bytes for a field with
omitTf=true.  Can you fix that?  I think in FreqProxTermsWriter you
can simply not write any bytes to the proxOut; likewise in
SegmentMerger and SegmentTermPositions, don't try to read bytes from
the prx file if omitTf==true.

I'd also be curious about what gains in index size & filter
performance we see with these new boolean fields.


> Make it posible not to include TF information in index
> --
>
> Key: LUCENE-1340
> URL: https://issues.apache.org/jira/browse/LUCENE-1340
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Reporter: Eks Dev
>Priority: Minor
> Attachments: LUCENE-1340.patch, LUCENE-1340.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Term Frequency is typically not needed  for all fields, some CPU (reading one 
> VInt less and one X>>>1...) and IO can be spared by making pure boolen fields 
> possible in Lucene. This topic has already been discussed and accepted as a 
> part of Flexible Indexing... This issue tries to push things a bit faster 
> forward as I have some concrete customer demands.
> benefits can be expected for fields that are typical candidates for Filters, 
> enumerations, user rights, IDs or very short "texts", phone  numbers, zip 
> codes, names...
> Status: just passed standard test (compatibility), commited for early review, 
> I have not tried new feature, missing some asserts and one two unit tests
> Complexity: simpler than expected
> can be used via omitTf() (who used omitNorms() will know where to find it :)  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1340) Make it posible not to include TF information in index

2008-07-18 Thread Eks Dev (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eks Dev updated LUCENE-1340:


Attachment: LUCENE-1340.patch

first cut

> Make it posible not to include TF information in index
> --
>
> Key: LUCENE-1340
> URL: https://issues.apache.org/jira/browse/LUCENE-1340
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Reporter: Eks Dev
>Priority: Minor
> Attachments: LUCENE-1340.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Term Frequency is typically not needed  for all fields, some CPU (reading one 
> VInt less and one X>>>1...) and IO can be spared by making pure boolen fields 
> possible in Lucene. This topic has already been discussed and accepted as a 
> part of Flexible Indexing... This issue tries to push things a bit faster 
> forward as I have some concrete customer demands.
> benefits can be expected for fields that are typical candidates for Filters, 
> enumerations, user rights, IDs or very short "texts", phone  numbers, zip 
> codes, names...
> Status: just passed standard test (compatibility), commited for early review, 
> I have not tried new feature, missing some asserts and one two unit tests
> Complexity: simpler than expected
> can be used via omitTf() (who used omitNorms() will know where to find it :)  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]