[jira] Updated: (LUCENE-1458) Further steps towards flexible indexing

2009-09-03 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1458:
---

Attachment: LUCENE-1458.tar.bz2

I attached a .tar.bz2 of src/* with my current state -- too hard to
keep svn in sync / patchable right now.  Changes:

  * Factored out the terms dict index, so it's now "pluggable" (though
I've only created one impl, so far)

  * Cutover SegmentMerger to flex API

  * Changed terms to be stored in RAM as byte[] (not char[]), when
reading.  These are UTF8 bytes, but in theory eventually we could
allow generic bytes here (there are not that many places that try
to decode them as UTF8).  I think this is a good step towards
allowing generic terms.  It also saves 50% RAM for simple ascii
terms w/ the terms index.

  * Changed terms index to use shared byte[] blocks

  * Broke sources out into "codecs" subdir of oal.index.  Right now I
have "preflex" (only provides reader, to read old index format),
"standard" (new terms dict & index, but otherwise same
freq/prox/skip/payloads encoding), "pulsing" (inlines low-freq
terms directly into terms dict) and "sep" (seperately stores docs,
frq, prox, skip, payloads, as a pre-cursor to using pfor to encode
doc/frq/prox).

The patch is very rough... core & core-test compile, but most tests
fail.  It's very much still a work in progress...


> Further steps towards flexible indexing
> ---
>
> Key: LUCENE-1458
> URL: https://issues.apache.org/jira/browse/LUCENE-1458
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2
>
>
> I attached a very rough checkpoint of my current patch, to get early
> feedback.  All tests pass, though back compat tests don't pass due to
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too many times,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branch on
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to use the
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
>   * Switches to a new more efficient terms dict format.  This still
> uses tii/tis files, but the tii only stores term & long offset
> (not a TermInfo).  At seek points, tis encodes term & freq/prox
> offsets absolutely instead of with deltas delta.  Also, tis/tii
> are structured by field, so we don't have to record field number
> in every term.
> .
> On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
> -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> .
> RAM usage when loading terms dict index is significantly less
> since we only load an array of offsets and an array of String (no
> more TermInfo array).  It should be faster to init too.
> .
> This part is basically done.
>   * Introduces modular reader codec that strongly decouples terms dict
> from docs/positions readers.  EG there is no more TermInfo used
> when reading the new format.
> .
> There's nice symmetry now between reading & writing in the codec
> chain -- the current docs/prox format is captured in:
> {code}
> FormatPostingsTermsDictWriter/Reader
> FormatPostingsDocsWriter/Reader (.frq file) and
> FormatPostingsPositionsWriter/Reader (.prx file).
> {code}
> This part is basically done.
>   * Introduces a new "flex" API for iterating through the fields,
> terms, docs and positions:
> {code}
> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> {code}
> This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
> old API on top of the new API to keep back-compat.
> 
> Next steps:
>   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
> fix any hidden assumptions.
>   * Expose new API out of IndexReader, deprecate old API but emulate
> old API on top of new one, switch all core/contrib users to the
> new API.
>   * Maybe switch to AttributeSources as the base class for TermsEnum,
> DocsEnum, PostingsEnum -- this would give readers API flexibility
> (not just index-file-format flexibility).  EG if someone wanted
> to store payload at the term-doc level instead of
> term-doc-position level, you could just add a new attribute.
>   * T

MoreLikeThis Extension for documents that have tags

2009-09-03 Thread Thomas D'Silva
Hi,

I would like to contribute a class based on the MoreLikeThis class in
contrib/queries that generates a query based on the tags associated
with a document. The class assumes that documents are tagged with a
set of tags (which are stored in the index in a seperate Field). The
class determines the top document terms associated with a given tag
using the information gain metric.

While generating a MoreLikeThis query for a document the tags
associated with document are used to determine the terms in the query.
This class is useful for finding similar documents to a document that
does not have many relevant terms but was tagged.

I have attached the class and a test class and would appreciate any feedback.

Thank you,
Thomas

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-1877) Use NativeFSLockFactory as default for new API (direct ctors & FSDir.open)

2009-09-03 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler resolved LUCENE-1877.
---

Resolution: Fixed

Committed revision: 811157

> Use NativeFSLockFactory as default for new API (direct ctors & FSDir.open)
> --
>
> Key: LUCENE-1877
> URL: https://issues.apache.org/jira/browse/LUCENE-1877
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Javadocs
>Reporter: Mark Miller
>Assignee: Uwe Schindler
> Fix For: 2.9
>
> Attachments: LUCENE-1877.patch, LUCENE-1877.patch, LUCENE-1877.patch, 
> LUCENE-1877.patch
>
>
> A user requested we add a note in IndexWriter alerting the availability of 
> NativeFSLockFactory (allowing you to avoid retaining locks on abnormal jvm 
> exit). Seems reasonable to me - we want users to be able to easily stumble 
> upon this class. The below code looks like a good spot to add a note - could 
> also improve whats there a bit - opening an IndexWriter does not necessarily 
> create a lock file - that would depend on the LockFactory used.
> {code}  Opening an IndexWriter creates a lock file for the 
> directory in use. Trying to open
>   another IndexWriter on the same directory will lead to a
>   {...@link LockObtainFailedException}. The {...@link 
> LockObtainFailedException}
>   is also thrown if an IndexReader on the same directory is used to delete 
> documents
>   from the index.{code}
> Anyone remember why NativeFSLockFactory is not the default over 
> SimpleFSLockFactory?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-1885) NativeFSLockFactory.makeLock(...).isLocked() does not work

2009-09-03 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler resolved LUCENE-1885.
---

Resolution: Fixed

Committed revision: 811157

> NativeFSLockFactory.makeLock(...).isLocked() does not work
> --
>
> Key: LUCENE-1885
> URL: https://issues.apache.org/jira/browse/LUCENE-1885
> Project: Lucene - Java
>  Issue Type: Bug
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
>Priority: Blocker
> Fix For: 2.9
>
>
> IndexWriter.isLocked() or IndexReader.isLocked() do not work with 
> NativeFSLockFactory.
> The problem is, that the method NativeFSLock.isLocked() just checks if the 
> same lock instance was locked before (lock != null). If the LockFactory 
> created a new lock instance, this always returns false, even if its locked.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1877) Use NativeFSLockFactory as default for new API (direct ctors & FSDir.open)

2009-09-03 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12751190#action_12751190
 ] 

Uwe Schindler commented on LUCENE-1877:
---

I will commit soon!

> Use NativeFSLockFactory as default for new API (direct ctors & FSDir.open)
> --
>
> Key: LUCENE-1877
> URL: https://issues.apache.org/jira/browse/LUCENE-1877
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Javadocs
>Reporter: Mark Miller
>Assignee: Uwe Schindler
> Fix For: 2.9
>
> Attachments: LUCENE-1877.patch, LUCENE-1877.patch, LUCENE-1877.patch, 
> LUCENE-1877.patch
>
>
> A user requested we add a note in IndexWriter alerting the availability of 
> NativeFSLockFactory (allowing you to avoid retaining locks on abnormal jvm 
> exit). Seems reasonable to me - we want users to be able to easily stumble 
> upon this class. The below code looks like a good spot to add a note - could 
> also improve whats there a bit - opening an IndexWriter does not necessarily 
> create a lock file - that would depend on the LockFactory used.
> {code}  Opening an IndexWriter creates a lock file for the 
> directory in use. Trying to open
>   another IndexWriter on the same directory will lead to a
>   {...@link LockObtainFailedException}. The {...@link 
> LockObtainFailedException}
>   is also thrown if an IndexReader on the same directory is used to delete 
> documents
>   from the index.{code}
> Anyone remember why NativeFSLockFactory is not the default over 
> SimpleFSLockFactory?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-1884) javadocs cleanup

2009-09-03 Thread Hoss Man (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man resolved LUCENE-1884.
--

Resolution: Fixed
  Assignee: Robert Muir

i'm the last person in the world that should be reviewing a spell correction 
patch -- but nothing jumped out at me as being bad about the patch...

Committed revision 811070.

> javadocs cleanup
> 
>
> Key: LUCENE-1884
> URL: https://issues.apache.org/jira/browse/LUCENE-1884
> Project: Lucene - Java
>  Issue Type: Task
>  Components: Javadocs
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1884.patch
>
>
> basic cleanup in core/contrib: typos, apache license header as javadoc, 
> missing periods that screw up package summary, etc.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-1886) Improve Javadoc

2009-09-03 Thread Hoss Man (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man resolved LUCENE-1886.
--

Resolution: Fixed
  Assignee: Hoss Man

Committed revision 811060.

Thanks Bernd

> Improve Javadoc
> ---
>
> Key: LUCENE-1886
> URL: https://issues.apache.org/jira/browse/LUCENE-1886
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Bernd Fondermann
>Assignee: Hoss Man
>Priority: Trivial
> Fix For: 2.9
>
> Attachments: javadoc.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1879) Parallel incremental indexing

2009-09-03 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12751077#action_12751077
 ] 

Grant Ingersoll commented on LUCENE-1879:
-

Yes on the soft. grant.

> Parallel incremental indexing
> -
>
> Key: LUCENE-1879
> URL: https://issues.apache.org/jira/browse/LUCENE-1879
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
> Fix For: 3.1
>
>
> A new feature that allows building parallel indexes and keeping them in sync 
> on a docID level, independent of the choice of the MergePolicy/MergeScheduler.
> Find details on the wiki page for this feature:
> http://wiki.apache.org/lucene-java/ParallelIncrementalIndexing 
> Discussion on java-dev:
> http://markmail.org/thread/ql3oxzkob7aqf3jd

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-1889) FastVectorHighlighter: support for additional queries

2009-09-03 Thread Robert Muir (JIRA)
FastVectorHighlighter: support for additional queries
-

 Key: LUCENE-1889
 URL: https://issues.apache.org/jira/browse/LUCENE-1889
 Project: Lucene - Java
  Issue Type: Wish
  Components: contrib/*
Reporter: Robert Muir
Priority: Minor


I am using fastvectorhighlighter for some strange languages and it is working 
well! 

One thing i noticed immediately is that many query types are not highlighted 
(multitermquery, multiphrasequery, etc)
Here is one thing Michael M posted in the original ticket:

{quote}
I think a nice [eventual] model would be if we could simply re-run the
scorer on the single document (using InstantiatedIndex maybe, or
simply some sort of wrapper on the term vectors which are already a
mini-inverted-index for a single doc), but extend the scorer API to
tell us the exact term occurrences that participated in a match (which
I don't think is exposed today).
{quote}

Due to strange requirements I am using something similar to this (but 
specialized to our case).
I am doing strange things like forcing multitermqueries to rewrite into boolean 
queries so they will be highlighted,
and flattening multiphrasequeries into boolean or'ed phrasequeries.
I do not think these things would be 'fast', but i had a few ideas that might 
help:

* looking at contrib/highlighter, you can support FilteredQuery in flatten() by 
calling getQuery() right?
* maybe as a last resort, try Query.extractTerms() ?


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-1878) remove deprecated classes from spatial

2009-09-03 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-1878.


Resolution: Fixed

> remove deprecated classes from spatial
> --
>
> Key: LUCENE-1878
> URL: https://issues.apache.org/jira/browse/LUCENE-1878
> Project: Lucene - Java
>  Issue Type: Task
>  Components: contrib/spatial
>Reporter: Mark Miller
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1878.patch, LUCENE-1878.patch
>
>
> spatial has not been released, so we can remove the deprecated classes

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-1888) Provide Option to Store Payloads on the Term Vector

2009-09-03 Thread Grant Ingersoll (JIRA)
Provide Option to Store Payloads on the Term Vector
---

 Key: LUCENE-1888
 URL: https://issues.apache.org/jira/browse/LUCENE-1888
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
 Fix For: 3.0, 3.1


Would be nice to have the option to access the payloads in a document-centric 
way by adding them to the Term Vectors.  Naturally, this makes the Term Vectors 
bigger, but it may be just what one needs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-1876) Some contrib packages are missing a package.html

2009-09-03 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir resolved LUCENE-1876.
-

Resolution: Fixed

Committed revision 810923.

thanks Steven!


> Some contrib packages are missing a package.html
> 
>
> Key: LUCENE-1876
> URL: https://issues.apache.org/jira/browse/LUCENE-1876
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*
>Reporter: Mark Miller
>Assignee: Robert Muir
>Priority: Trivial
> Fix For: 2.9
>
> Attachments: collation-package.html, LUCENE-1876.patch, 
> LUCENE-1876.patch
>
>
> Dunno if we will get to this one this release, but a few contribs don't have 
> a package.html (or a good overview that would work as a replacement) - I 
> don't think this is hugely important, but I think it is important - you 
> should be able to easily and quickly read a quick overview for each contrib I 
> think.
> So far I have identified collation and spatial.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org