[jira] Updated: (LUCENE-1458) Further steps towards flexible indexing
[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1458: --- Attachment: LUCENE-1458.tar.bz2 I attached a .tar.bz2 of src/* with my current state -- too hard to keep svn in sync / patchable right now. Changes: * Factored out the terms dict index, so it's now "pluggable" (though I've only created one impl, so far) * Cutover SegmentMerger to flex API * Changed terms to be stored in RAM as byte[] (not char[]), when reading. These are UTF8 bytes, but in theory eventually we could allow generic bytes here (there are not that many places that try to decode them as UTF8). I think this is a good step towards allowing generic terms. It also saves 50% RAM for simple ascii terms w/ the terms index. * Changed terms index to use shared byte[] blocks * Broke sources out into "codecs" subdir of oal.index. Right now I have "preflex" (only provides reader, to read old index format), "standard" (new terms dict & index, but otherwise same freq/prox/skip/payloads encoding), "pulsing" (inlines low-freq terms directly into terms dict) and "sep" (seperately stores docs, frq, prox, skip, payloads, as a pre-cursor to using pfor to encode doc/frq/prox). The patch is very rough... core & core-test compile, but most tests fail. It's very much still a work in progress... > Further steps towards flexible indexing > --- > > Key: LUCENE-1458 > URL: https://issues.apache.org/jira/browse/LUCENE-1458 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Affects Versions: 2.9 >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Attachments: LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2 > > > I attached a very rough checkpoint of my current patch, to get early > feedback. All tests pass, though back compat tests don't pass due to > changes to package-private APIs plus certain bugs in tests that > happened to work (eg call TermPostions.nextPosition() too many times, > which the new API asserts against). > [Aside: I think, when we commit changes to package-private APIs such > that back-compat tests don't pass, we could go back, make a branch on > the back-compat tag, commit changes to the tests to use the new > package private APIs on that branch, then fix nightly build to use the > tip of that branch?o] > There's still plenty to do before this is committable! This is a > rather large change: > * Switches to a new more efficient terms dict format. This still > uses tii/tis files, but the tii only stores term & long offset > (not a TermInfo). At seek points, tis encodes term & freq/prox > offsets absolutely instead of with deltas delta. Also, tis/tii > are structured by field, so we don't have to record field number > in every term. > . > On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB > -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB). > . > RAM usage when loading terms dict index is significantly less > since we only load an array of offsets and an array of String (no > more TermInfo array). It should be faster to init too. > . > This part is basically done. > * Introduces modular reader codec that strongly decouples terms dict > from docs/positions readers. EG there is no more TermInfo used > when reading the new format. > . > There's nice symmetry now between reading & writing in the codec > chain -- the current docs/prox format is captured in: > {code} > FormatPostingsTermsDictWriter/Reader > FormatPostingsDocsWriter/Reader (.frq file) and > FormatPostingsPositionsWriter/Reader (.prx file). > {code} > This part is basically done. > * Introduces a new "flex" API for iterating through the fields, > terms, docs and positions: > {code} > FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum > {code} > This replaces TermEnum/Docs/Positions. SegmentReader emulates the > old API on top of the new API to keep back-compat. > > Next steps: > * Plug in new codecs (pulsing, pfor) to exercise the modularity / > fix any hidden assumptions. > * Expose new API out of IndexReader, deprecate old API but emulate > old API on top of new one, switch all core/contrib users to the > new API. > * Maybe switch to AttributeSources as the base class for TermsEnum, > DocsEnum, PostingsEnum -- this would give readers API flexibility > (not just index-file-format flexibility). EG if someone wanted > to store payload at the term-doc level instead of > term-doc-position level, you could just add a new attribute. > * T
MoreLikeThis Extension for documents that have tags
Hi, I would like to contribute a class based on the MoreLikeThis class in contrib/queries that generates a query based on the tags associated with a document. The class assumes that documents are tagged with a set of tags (which are stored in the index in a seperate Field). The class determines the top document terms associated with a given tag using the information gain metric. While generating a MoreLikeThis query for a document the tags associated with document are used to determine the terms in the query. This class is useful for finding similar documents to a document that does not have many relevant terms but was tagged. I have attached the class and a test class and would appreciate any feedback. Thank you, Thomas - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-1877) Use NativeFSLockFactory as default for new API (direct ctors & FSDir.open)
[ https://issues.apache.org/jira/browse/LUCENE-1877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler resolved LUCENE-1877. --- Resolution: Fixed Committed revision: 811157 > Use NativeFSLockFactory as default for new API (direct ctors & FSDir.open) > -- > > Key: LUCENE-1877 > URL: https://issues.apache.org/jira/browse/LUCENE-1877 > Project: Lucene - Java > Issue Type: Improvement > Components: Javadocs >Reporter: Mark Miller >Assignee: Uwe Schindler > Fix For: 2.9 > > Attachments: LUCENE-1877.patch, LUCENE-1877.patch, LUCENE-1877.patch, > LUCENE-1877.patch > > > A user requested we add a note in IndexWriter alerting the availability of > NativeFSLockFactory (allowing you to avoid retaining locks on abnormal jvm > exit). Seems reasonable to me - we want users to be able to easily stumble > upon this class. The below code looks like a good spot to add a note - could > also improve whats there a bit - opening an IndexWriter does not necessarily > create a lock file - that would depend on the LockFactory used. > {code} Opening an IndexWriter creates a lock file for the > directory in use. Trying to open > another IndexWriter on the same directory will lead to a > {...@link LockObtainFailedException}. The {...@link > LockObtainFailedException} > is also thrown if an IndexReader on the same directory is used to delete > documents > from the index.{code} > Anyone remember why NativeFSLockFactory is not the default over > SimpleFSLockFactory? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-1885) NativeFSLockFactory.makeLock(...).isLocked() does not work
[ https://issues.apache.org/jira/browse/LUCENE-1885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler resolved LUCENE-1885. --- Resolution: Fixed Committed revision: 811157 > NativeFSLockFactory.makeLock(...).isLocked() does not work > -- > > Key: LUCENE-1885 > URL: https://issues.apache.org/jira/browse/LUCENE-1885 > Project: Lucene - Java > Issue Type: Bug >Reporter: Uwe Schindler >Assignee: Uwe Schindler >Priority: Blocker > Fix For: 2.9 > > > IndexWriter.isLocked() or IndexReader.isLocked() do not work with > NativeFSLockFactory. > The problem is, that the method NativeFSLock.isLocked() just checks if the > same lock instance was locked before (lock != null). If the LockFactory > created a new lock instance, this always returns false, even if its locked. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1877) Use NativeFSLockFactory as default for new API (direct ctors & FSDir.open)
[ https://issues.apache.org/jira/browse/LUCENE-1877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12751190#action_12751190 ] Uwe Schindler commented on LUCENE-1877: --- I will commit soon! > Use NativeFSLockFactory as default for new API (direct ctors & FSDir.open) > -- > > Key: LUCENE-1877 > URL: https://issues.apache.org/jira/browse/LUCENE-1877 > Project: Lucene - Java > Issue Type: Improvement > Components: Javadocs >Reporter: Mark Miller >Assignee: Uwe Schindler > Fix For: 2.9 > > Attachments: LUCENE-1877.patch, LUCENE-1877.patch, LUCENE-1877.patch, > LUCENE-1877.patch > > > A user requested we add a note in IndexWriter alerting the availability of > NativeFSLockFactory (allowing you to avoid retaining locks on abnormal jvm > exit). Seems reasonable to me - we want users to be able to easily stumble > upon this class. The below code looks like a good spot to add a note - could > also improve whats there a bit - opening an IndexWriter does not necessarily > create a lock file - that would depend on the LockFactory used. > {code} Opening an IndexWriter creates a lock file for the > directory in use. Trying to open > another IndexWriter on the same directory will lead to a > {...@link LockObtainFailedException}. The {...@link > LockObtainFailedException} > is also thrown if an IndexReader on the same directory is used to delete > documents > from the index.{code} > Anyone remember why NativeFSLockFactory is not the default over > SimpleFSLockFactory? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-1884) javadocs cleanup
[ https://issues.apache.org/jira/browse/LUCENE-1884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hoss Man resolved LUCENE-1884. -- Resolution: Fixed Assignee: Robert Muir i'm the last person in the world that should be reviewing a spell correction patch -- but nothing jumped out at me as being bad about the patch... Committed revision 811070. > javadocs cleanup > > > Key: LUCENE-1884 > URL: https://issues.apache.org/jira/browse/LUCENE-1884 > Project: Lucene - Java > Issue Type: Task > Components: Javadocs >Reporter: Robert Muir >Assignee: Robert Muir >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1884.patch > > > basic cleanup in core/contrib: typos, apache license header as javadoc, > missing periods that screw up package summary, etc. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-1886) Improve Javadoc
[ https://issues.apache.org/jira/browse/LUCENE-1886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hoss Man resolved LUCENE-1886. -- Resolution: Fixed Assignee: Hoss Man Committed revision 811060. Thanks Bernd > Improve Javadoc > --- > > Key: LUCENE-1886 > URL: https://issues.apache.org/jira/browse/LUCENE-1886 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Bernd Fondermann >Assignee: Hoss Man >Priority: Trivial > Fix For: 2.9 > > Attachments: javadoc.patch > > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1879) Parallel incremental indexing
[ https://issues.apache.org/jira/browse/LUCENE-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12751077#action_12751077 ] Grant Ingersoll commented on LUCENE-1879: - Yes on the soft. grant. > Parallel incremental indexing > - > > Key: LUCENE-1879 > URL: https://issues.apache.org/jira/browse/LUCENE-1879 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch > Fix For: 3.1 > > > A new feature that allows building parallel indexes and keeping them in sync > on a docID level, independent of the choice of the MergePolicy/MergeScheduler. > Find details on the wiki page for this feature: > http://wiki.apache.org/lucene-java/ParallelIncrementalIndexing > Discussion on java-dev: > http://markmail.org/thread/ql3oxzkob7aqf3jd -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-1889) FastVectorHighlighter: support for additional queries
FastVectorHighlighter: support for additional queries - Key: LUCENE-1889 URL: https://issues.apache.org/jira/browse/LUCENE-1889 Project: Lucene - Java Issue Type: Wish Components: contrib/* Reporter: Robert Muir Priority: Minor I am using fastvectorhighlighter for some strange languages and it is working well! One thing i noticed immediately is that many query types are not highlighted (multitermquery, multiphrasequery, etc) Here is one thing Michael M posted in the original ticket: {quote} I think a nice [eventual] model would be if we could simply re-run the scorer on the single document (using InstantiatedIndex maybe, or simply some sort of wrapper on the term vectors which are already a mini-inverted-index for a single doc), but extend the scorer API to tell us the exact term occurrences that participated in a match (which I don't think is exposed today). {quote} Due to strange requirements I am using something similar to this (but specialized to our case). I am doing strange things like forcing multitermqueries to rewrite into boolean queries so they will be highlighted, and flattening multiphrasequeries into boolean or'ed phrasequeries. I do not think these things would be 'fast', but i had a few ideas that might help: * looking at contrib/highlighter, you can support FilteredQuery in flatten() by calling getQuery() right? * maybe as a last resort, try Query.extractTerms() ? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-1878) remove deprecated classes from spatial
[ https://issues.apache.org/jira/browse/LUCENE-1878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved LUCENE-1878. Resolution: Fixed > remove deprecated classes from spatial > -- > > Key: LUCENE-1878 > URL: https://issues.apache.org/jira/browse/LUCENE-1878 > Project: Lucene - Java > Issue Type: Task > Components: contrib/spatial >Reporter: Mark Miller >Assignee: Michael McCandless >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1878.patch, LUCENE-1878.patch > > > spatial has not been released, so we can remove the deprecated classes -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-1888) Provide Option to Store Payloads on the Term Vector
Provide Option to Store Payloads on the Term Vector --- Key: LUCENE-1888 URL: https://issues.apache.org/jira/browse/LUCENE-1888 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Grant Ingersoll Assignee: Grant Ingersoll Fix For: 3.0, 3.1 Would be nice to have the option to access the payloads in a document-centric way by adding them to the Term Vectors. Naturally, this makes the Term Vectors bigger, but it may be just what one needs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-1876) Some contrib packages are missing a package.html
[ https://issues.apache.org/jira/browse/LUCENE-1876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir resolved LUCENE-1876. - Resolution: Fixed Committed revision 810923. thanks Steven! > Some contrib packages are missing a package.html > > > Key: LUCENE-1876 > URL: https://issues.apache.org/jira/browse/LUCENE-1876 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/* >Reporter: Mark Miller >Assignee: Robert Muir >Priority: Trivial > Fix For: 2.9 > > Attachments: collation-package.html, LUCENE-1876.patch, > LUCENE-1876.patch > > > Dunno if we will get to this one this release, but a few contribs don't have > a package.html (or a good overview that would work as a replacement) - I > don't think this is hugely important, but I think it is important - you > should be able to easily and quickly read a quick overview for each contrib I > think. > So far I have identified collation and spatial. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org