[jira] Updated: (LUCENE-1340) Make it posible not to include TF information in index
[ https://issues.apache.org/jira/browse/LUCENE-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1340: --- Attachment: LUCENE-1340.patch Attached patch that also includes fixes to fileformat.{xml,html,pdf}. > Make it posible not to include TF information in index > -- > > Key: LUCENE-1340 > URL: https://issues.apache.org/jira/browse/LUCENE-1340 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Reporter: Eks Dev >Priority: Minor > Attachments: LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, > LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > Term Frequency is typically not needed for all fields, some CPU (reading one > VInt less and one X>>>1...) and IO can be spared by making pure boolen fields > possible in Lucene. This topic has already been discussed and accepted as a > part of Flexible Indexing... This issue tries to push things a bit faster > forward as I have some concrete customer demands. > benefits can be expected for fields that are typical candidates for Filters, > enumerations, user rights, IDs or very short "texts", phone numbers, zip > codes, names... > Status: just passed standard test (compatibility), commited for early review, > I have not tried new feature, missing some asserts and one two unit tests > Complexity: simpler than expected > can be used via omitTf() (who used omitNorms() will know where to find it :) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1340) Make it posible not to include TF information in index
[ https://issues.apache.org/jira/browse/LUCENE-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1340: --- Attachment: LUCENE-1340.patch I attached a new rev of the patch: * Use less RAM if field omits tf's (don't write the tf's into the RAM buffer), so we flush less often * Added another test case to TestOmitTf As a test, I indexed full wikipedia (~3.2 million docs) with this alg: {code} analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker docs.file=/Volumes/External/lucene/wiki.txt doc.stored = false doc.term.vector = false doc.add.log.step=1 max.field.length=2147483647 directory=FSDirectory autocommit=false compound=false doc.maker.forever = false work.dir=/lucene/work2 ram.flush.mb=64 - CreateIndex { "AddDocs" AddDoc > : * - CloseIndex RepSumByPrefRound AddDoc {code} With tf's it takes 970 seconds and index size is 2.5 GB. Without tf's it takes 834 seconds (14% faster) and index size is 1.1 GB (56% smaller). > Make it posible not to include TF information in index > -- > > Key: LUCENE-1340 > URL: https://issues.apache.org/jira/browse/LUCENE-1340 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Reporter: Eks Dev >Priority: Minor > Attachments: LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, > LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > Term Frequency is typically not needed for all fields, some CPU (reading one > VInt less and one X>>>1...) and IO can be spared by making pure boolen fields > possible in Lucene. This topic has already been discussed and accepted as a > part of Flexible Indexing... This issue tries to push things a bit faster > forward as I have some concrete customer demands. > benefits can be expected for fields that are typical candidates for Filters, > enumerations, user rights, IDs or very short "texts", phone numbers, zip > codes, names... > Status: just passed standard test (compatibility), commited for early review, > I have not tried new feature, missing some asserts and one two unit tests > Complexity: simpler than expected > can be used via omitTf() (who used omitNorms() will know where to find it :) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1340) Make it posible not to include TF information in index
[ https://issues.apache.org/jira/browse/LUCENE-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1340: --- Attachment: LUCENE-1340.patch OK good progress eks! I started from your latest patch and made some further changes: * Fixed DW to not consume RAM writing prx if omitTf==true * Fixed FreqProxTermsWriter to not create *.prx file if all fields omit term freq. I added hasProx to SegmentInfo, and changed the index file format to store this new boolean. * Fixed FreqProxTermsWriterPerField to not write prox into the RAM buffer if we will omitTf on flushing the segment to disk. This makes the RAM buffer efficient (no bytes wasted on prox when omitTf==true for a field). * Added more test cases to TestOmitTf * Small whitespace, comment changes The one place I know of that will still waste bytes is the term dict (TermInfo): it stores a long proxPointer on disk (in *.tii,*.tis) and also in memory because we load *.tii into RAM. For fields with omitTf==true this will always be unused, and we could save alot of disk/RAM if we didn't waste it. Unfortunately, I think it's too big a change to try to fix this now; I think we should wait until flex indexing is online. I wonder how we can solve it at that point: maybe should we change TermInfo to be "column stride", meaning, there are separate arrays storing the values for all terms (ie long[] proxPointers, long[] freqPointers, etc.). This would also fit the "pluggable" model better, meaning any plugin can store new stuff (its own arrays) per-term. > Make it posible not to include TF information in index > -- > > Key: LUCENE-1340 > URL: https://issues.apache.org/jira/browse/LUCENE-1340 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Reporter: Eks Dev >Priority: Minor > Attachments: LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, > LUCENE-1340.patch, LUCENE-1340.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > Term Frequency is typically not needed for all fields, some CPU (reading one > VInt less and one X>>>1...) and IO can be spared by making pure boolen fields > possible in Lucene. This topic has already been discussed and accepted as a > part of Flexible Indexing... This issue tries to push things a bit faster > forward as I have some concrete customer demands. > benefits can be expected for fields that are typical candidates for Filters, > enumerations, user rights, IDs or very short "texts", phone numbers, zip > codes, names... > Status: just passed standard test (compatibility), commited for early review, > I have not tried new feature, missing some asserts and one two unit tests > Complexity: simpler than expected > can be used via omitTf() (who used omitNorms() will know where to find it :) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1340) Make it posible not to include TF information in index
[ https://issues.apache.org/jira/browse/LUCENE-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eks Dev updated LUCENE-1340: Attachment: LUCENE-1340.patch - fixed stupid bug in SegmentTermDocs (was doc = docCode; instead of doc += docCode;) - TestOmitTf extended a bit > Make it posible not to include TF information in index > -- > > Key: LUCENE-1340 > URL: https://issues.apache.org/jira/browse/LUCENE-1340 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Reporter: Eks Dev >Priority: Minor > Attachments: LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, > LUCENE-1340.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > Term Frequency is typically not needed for all fields, some CPU (reading one > VInt less and one X>>>1...) and IO can be spared by making pure boolen fields > possible in Lucene. This topic has already been discussed and accepted as a > part of Flexible Indexing... This issue tries to push things a bit faster > forward as I have some concrete customer demands. > benefits can be expected for fields that are typical candidates for Filters, > enumerations, user rights, IDs or very short "texts", phone numbers, zip > codes, names... > Status: just passed standard test (compatibility), commited for early review, > I have not tried new feature, missing some asserts and one two unit tests > Complexity: simpler than expected > can be used via omitTf() (who used omitNorms() will know where to find it :) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1340) Make it posible not to include TF information in index
[ https://issues.apache.org/jira/browse/LUCENE-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eks Dev updated LUCENE-1340: Attachment: LUCENE-1340.patch Thanks Mike, with just a little bit more hand-holding we are going to be there :) I *think* I have *.prx IO excluded in case omitTf==true, please have a look, this part is really not an easy one (*Merger). Also, now if a single field has mixed true/false for omitTf, I set it to true. One unit test is already there, basic use case works, but the test has to cover a bit more > Make it posible not to include TF information in index > -- > > Key: LUCENE-1340 > URL: https://issues.apache.org/jira/browse/LUCENE-1340 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Reporter: Eks Dev >Priority: Minor > Attachments: LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > Term Frequency is typically not needed for all fields, some CPU (reading one > VInt less and one X>>>1...) and IO can be spared by making pure boolen fields > possible in Lucene. This topic has already been discussed and accepted as a > part of Flexible Indexing... This issue tries to push things a bit faster > forward as I have some concrete customer demands. > benefits can be expected for fields that are typical candidates for Filters, > enumerations, user rights, IDs or very short "texts", phone numbers, zip > codes, names... > Status: just passed standard test (compatibility), commited for early review, > I have not tried new feature, missing some asserts and one two unit tests > Complexity: simpler than expected > can be used via omitTf() (who used omitNorms() will know where to find it :) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1340) Make it posible not to include TF information in index
[ https://issues.apache.org/jira/browse/LUCENE-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1340: --- Attachment: LUCENE-1340.patch Thanks eks, that was fast -- I think you set a new record! The patch looks good, though we definitely need some solid unit tests here. I made some small (whitespace, spelling, naming) corrections & attached a new rev of the patch. One question I have: right now if a single field has mixed true/false for omitTf, you set it to false, meaning we start storing the term freq, pos, payloads again. Can/should we do the reverse instead? If we did, we could make some further optimizations, eg right now we consume RAM storing all positions/payloads on a field that has omitTF=true on the possibility that we may stll see omitTf=false in the same session. With this patch we still store the *.prx bytes for a field with omitTf=true. Can you fix that? I think in FreqProxTermsWriter you can simply not write any bytes to the proxOut; likewise in SegmentMerger and SegmentTermPositions, don't try to read bytes from the prx file if omitTf==true. I'd also be curious about what gains in index size & filter performance we see with these new boolean fields. > Make it posible not to include TF information in index > -- > > Key: LUCENE-1340 > URL: https://issues.apache.org/jira/browse/LUCENE-1340 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Reporter: Eks Dev >Priority: Minor > Attachments: LUCENE-1340.patch, LUCENE-1340.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > Term Frequency is typically not needed for all fields, some CPU (reading one > VInt less and one X>>>1...) and IO can be spared by making pure boolen fields > possible in Lucene. This topic has already been discussed and accepted as a > part of Flexible Indexing... This issue tries to push things a bit faster > forward as I have some concrete customer demands. > benefits can be expected for fields that are typical candidates for Filters, > enumerations, user rights, IDs or very short "texts", phone numbers, zip > codes, names... > Status: just passed standard test (compatibility), commited for early review, > I have not tried new feature, missing some asserts and one two unit tests > Complexity: simpler than expected > can be used via omitTf() (who used omitNorms() will know where to find it :) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1340) Make it posible not to include TF information in index
[ https://issues.apache.org/jira/browse/LUCENE-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eks Dev updated LUCENE-1340: Attachment: LUCENE-1340.patch first cut > Make it posible not to include TF information in index > -- > > Key: LUCENE-1340 > URL: https://issues.apache.org/jira/browse/LUCENE-1340 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Reporter: Eks Dev >Priority: Minor > Attachments: LUCENE-1340.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > Term Frequency is typically not needed for all fields, some CPU (reading one > VInt less and one X>>>1...) and IO can be spared by making pure boolen fields > possible in Lucene. This topic has already been discussed and accepted as a > part of Flexible Indexing... This issue tries to push things a bit faster > forward as I have some concrete customer demands. > benefits can be expected for fields that are typical candidates for Filters, > enumerations, user rights, IDs or very short "texts", phone numbers, zip > codes, names... > Status: just passed standard test (compatibility), commited for early review, > I have not tried new feature, missing some asserts and one two unit tests > Complexity: simpler than expected > can be used via omitTf() (who used omitNorms() will know where to find it :) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]