JDBC access to a Lucene index

2009-10-16 Thread Jukka Zitting
Hi,

Some while ago I implemented a simple JDBC to JCR bridge [1] that
allows one to query a JCR repository from any JDBC client, most
notably various reporting tools.

Now I'm wondering if something similar already exists for a normal
Lucene index. Something that would treat your entire index as one huge
table (or perhaps a set of tables based on some document type field)
and would allow you to use simple SQL SELECTs to query data.

Any pointers would be welcome.

[1] http://dev.day.com/microsling/content/blogs/main/jdbc2jcr.html

BR,

Jukka Zitting

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-10-16 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12766481#action_12766481
 ] 

Michael McCandless commented on LUCENE-1458:


OK thank for addressing the new nocommits -- you wanna remove them  commit as 
you find/comment on them?  Can be our means of communicating through the branch 
:)

For now, I don't think we need to explore improvements to the TermInfo cache 
(starting @ smaller size, simplistic double barrel LRU cache) -- we can simply 
mimic trunk for now; such improvements are orthogonal here.  Maybe switch those 
nocommits to TODOs instead?

bq. Hmm - I'm still getting the heap space issue I think

Sigh.  I think we have more work to do to scale down RAM used by IndexReader 
for a smallish index.

 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new attribute.
   * Test performance  iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: 

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-10-16 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12766482#action_12766482
 ] 

Michael McCandless commented on LUCENE-1458:


bq. you wanna remove them  commit as you find/comment on them?

Woops, I see you already did!  Thanks.

 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new attribute.
   * Test performance  iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: lucene 2.9 sorting algorithm

2009-10-16 Thread Michael McCandless
Thanks John; I'll have a look.

Mike

On Fri, Oct 16, 2009 at 12:57 AM, John Wang john.w...@gmail.com wrote:
 Hi Michael:
     I added classes: ScoreDocComparatorQueue and OneSortNoScoreCollector as
 a more general case. I think keeping the old api for ScoreDocComparator and
 SortComparatorSource would work.
   Please take a look.
 Thanks
 -John

 On Thu, Oct 15, 2009 at 6:52 PM, John Wang john.w...@gmail.com wrote:

 Hi Michael:
      It is open, http://code.google.com/p/lucene-book/source/checkout
      I think I sent the https url instead, sorry.
     The multi PQ sorting is fairly self-contained, I have 2 versions, 1
 for string and 1 for int, each are Collector impls.
      I shouldn't say the Multi Q is faster on int sort, it is within the
 error boundary. The diff is very very small, I would stay they are more
 equal.
      If you think it is a good thing to go this way, (if not for the perf,
 just for the simpler api) I'd be happy to work on a patch.
 Thanks
 -John
 On Thu, Oct 15, 2009 at 5:18 PM, Michael McCandless
 luc...@mikemccandless.com wrote:

 John, looks like this requires login -- any plans to open that up, or,
 post the code on an issue?

 How self-contained is your Multi PQ sorting?  EG is it a standalone
 Collector impl that I can test?

 Mike

 On Thu, Oct 15, 2009 at 6:33 PM, John Wang john.w...@gmail.com wrote:
  BTW, we are have a little sandbox for these experiments. And all my
  testcode
  are at. They are not very polished.
 
  https://lucene-book.googlecode.com/svn/trunk
 
  -John
 
  On Thu, Oct 15, 2009 at 3:29 PM, John Wang john.w...@gmail.com wrote:
 
  Numbers Mike requested for Int types:
 
  only the time/cputime are posted, others are all the same since the
  algorithm is the same.
 
  Lucene 2.9:
  numhits: 10
  time: 14619495
  cpu: 146126
 
  numhits: 20
  time: 14550568
  cpu: 163242
 
  numhits: 100
  time: 16467647
  cpu: 178379
 
 
  my test:
  numHits: 10
  time: 14101094
  cpu: 144715
 
  numHits: 20
  time: 14804821
  cpu: 151305
 
  numHits: 100
  time: 15372157
  cpu time: 158842
 
  Conclusions:
  The are very similar, the differences are all within error bounds,
  especially with lower PQ sizes, which second sort alg again slightly
  faster.
 
  Hope this helps.
 
  -John
 
 
  On Thu, Oct 15, 2009 at 3:04 PM, Yonik Seeley
  yo...@lucidimagination.com
  wrote:
 
  On Thu, Oct 15, 2009 at 5:33 PM, Michael McCandless
  luc...@mikemccandless.com wrote:
   Though it'd be odd if the switch to searching by segment
   really was most of the gains here.
 
  I had assumed that much of the improvement was due to ditching
  MultiTermEnum/MultiTermDocs.
  Note that LUCENE-1483 was before LUCENE-1596... but that only helps
  with queries that use a TermEnum (range, prefix, etc).
 
  -Yonik
  http://www.lucidimagination.com
 
  -
  To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-dev-h...@lucene.apache.org
 
 
 
 

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org





-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: search trough single pdf document - return page number

2009-10-16 Thread IvanDrago

Hey! I did it! Eric and Robert, you helped a lot. Thanks!

I didn't use LucenePDFDocument. I created a new document for every page in a
PDF document and added paga number info for every page.

PDDocument pddDocument=PDDocument.load(f);
PDFTextStripper textStripper=new PDFTextStripper();

IndexWriter iwriter = new IndexWriter(index_dir, new
StandardAnalyzer(), true);

long start = new Date().getTime();

// 350 pages just for test
for(int i=1; i350; i++){
//System.out.println(i=  + i);
textStripper.setStartPage(i);
textStripper.setEndPage(i);

//fetch one page
pagecontent = textStripper.getText(pddDocument);
System.out.println(pagecontent:  + pagecontent);

if (pagecontent != null){
System.out.println(i=  + i);
Document doc = new Document();

// Add the pagenumber
doc.add(new Field(pagenumber, Integer.toString(i) ,
Field.Store.YES,
Field.Index.ANALYZED));
doc.add(new Field(content, pagecontent , Field.Store.NO,
Field.Index.ANALYZED));

iwriter.addDocument(doc);
}

}

// Optimize and close the writer to finish building the index
iwriter.optimize();
iwriter.close();   

long end = new Date().getTime();

System.out.println(Indexing files took 
+ (end - start) +  milliseconds);

//just for test I searched for a string cryptography
String q = cryptography;

Directory fsDir = FSDirectory.getDirectory(index_dir, false);
IndexSearcher ind_searcher = new IndexSearcher(fsDir);

// Build a Query object
QueryParser parser = new QueryParser(content, new
StandardAnalyzer());
Query query = parser.parse(q);

// Search for the query
Hits hits = ind_searcher.search(query);

// Examine the Hits object to see if there were any matches
int hitCount = hits.length();
if (hitCount == 0) {
System.out.println(
No matches were found for \ + q + \);
}
else {
System.out.println(Hits for \ +
q + \ were found in pages:);

// Iterate over the Documents in the Hits object
for (int i = 0; i  hitCount; i++) {
Document doc = hits.doc(i);

// Print the value that we stored in the title field. Note
// that this Field was not indexed, but (unlike the
// contents field) was stored verbatim and can be
// retrieved.
//System.out.println(   + (i + 1) + .  +
doc.get(title));
System.out.println(   + (i + 1) + .  +
doc.get(pagenumber));
}
}
ind_searcher.close();


I'm using lucene version 2.9.0
You said that Hits are deprecated. Should I use HitCollector instead?

Another question came into my mind... What if I want do add another PDF
document to the search pool. Before search I would like to specify the PDF
document I would like to search and then return page number for searched
String. I could create index for every document that I add to search pool
but that doesn't sound good to me? Can you think of a better way to do that?


Erick Erickson wrote:
 
 Your search would be on the contents field if you use LucenePDFDocument.
 
 But on a quick look, LucenePDFDocument doesn't give you any page
 information. So, you'd have to collect that somehow, but I don't see a
 clear
 way to.
 
 Doing it manually, you could do something like:
 
 Document doc = new Document();
 for (each page in the document) {
   doc.add(contents, text for page);
   record the offset of the last term in the page you just indexed);
 }
 doc.add(metadata, string representation of the page offsets);
 iw.addDocument(doc);
 
 Now, when you search you can get the offsets of the matching term,
 then look in your metadata field for the page number.
 
 Perhaps you could use the LucenePDFDocument in conjunction with this
 somehow, but I confess that I've never used it so it's not clear to me how
 you'd do this.
 
 Incidentally, the Hits object is deprecated, what version of Lucene are
 you intending to use?
 
 Best
 Erick
 
 On Thu, Oct 15, 2009 at 10:43 AM, IvanDrago idrag...@gmail.com wrote:
 

 Thanks for the reply Erick.

 I would like to permanently index this content and search it
 multiple times so I would like a permanent copy and I want to search for
 different terms multiple
 times.

 My problem is that I dont know how to retrieve a page number where the
 searched string was found so
 if you 

Re: search trough single pdf document - return page number

2009-10-16 Thread Erick Erickson
Glad things are progressing. The only problem here will be proximityqueries
that span pages. Say, the last word on page 10 is
salmon and the first word on page 11 is fishing. Structuring
your index this way won't find the a proximity search for salmon fishing.

If that's not a concern, then there's no reason to complexify the
situation..

FWIW
Erick

On Fri, Oct 16, 2009 at 8:01 AM, IvanDrago idrag...@gmail.com wrote:


 Hey! I did it! Eric and Robert, you helped a lot. Thanks!

 I didn't use LucenePDFDocument. I created a new document for every page in
 a
 PDF document and added paga number info for every page.

PDDocument pddDocument=PDDocument.load(f);
PDFTextStripper textStripper=new PDFTextStripper();

 IndexWriter iwriter = new IndexWriter(index_dir, new
 StandardAnalyzer(), true);

 long start = new Date().getTime();

// 350 pages just for test
for(int i=1; i350; i++){
//System.out.println(i=  + i);
 textStripper.setStartPage(i);
textStripper.setEndPage(i);

 //fetch one page
pagecontent = textStripper.getText(pddDocument);
System.out.println(pagecontent:  + pagecontent);

if (pagecontent != null){
System.out.println(i=  + i);
Document doc = new Document();

// Add the pagenumber
doc.add(new Field(pagenumber, Integer.toString(i) ,
 Field.Store.YES,
Field.Index.ANALYZED));
doc.add(new Field(content, pagecontent ,
 Field.Store.NO,
Field.Index.ANALYZED));

iwriter.addDocument(doc);
}

}

// Optimize and close the writer to finish building the index
iwriter.optimize();
iwriter.close();

long end = new Date().getTime();

System.out.println(Indexing files took 
+ (end - start) +  milliseconds);

//just for test I searched for a string cryptography
String q = cryptography;

Directory fsDir = FSDirectory.getDirectory(index_dir, false);
 IndexSearcher ind_searcher = new IndexSearcher(fsDir);

// Build a Query object
QueryParser parser = new QueryParser(content, new
 StandardAnalyzer());
Query query = parser.parse(q);

 // Search for the query
Hits hits = ind_searcher.search(query);

// Examine the Hits object to see if there were any matches
int hitCount = hits.length();
if (hitCount == 0) {
System.out.println(
No matches were found for \ + q + \);
}
else {
System.out.println(Hits for \ +
q + \ were found in pages:);

// Iterate over the Documents in the Hits object
for (int i = 0; i  hitCount; i++) {
Document doc = hits.doc(i);

// Print the value that we stored in the title field. Note
// that this Field was not indexed, but (unlike the
// contents field) was stored verbatim and can be
// retrieved.
//System.out.println(   + (i + 1) + .  +
 doc.get(title));
System.out.println(   + (i + 1) + .  +
 doc.get(pagenumber));
}
}
ind_searcher.close();

 
 I'm using lucene version 2.9.0
 You said that Hits are deprecated. Should I use HitCollector instead?

 Another question came into my mind... What if I want do add another PDF
 document to the search pool. Before search I would like to specify the PDF
 document I would like to search and then return page number for searched
 String. I could create index for every document that I add to search pool
 but that doesn't sound good to me? Can you think of a better way to do
 that?


 Erick Erickson wrote:
 
  Your search would be on the contents field if you use
 LucenePDFDocument.
 
  But on a quick look, LucenePDFDocument doesn't give you any page
  information. So, you'd have to collect that somehow, but I don't see a
  clear
  way to.
 
  Doing it manually, you could do something like:
 
  Document doc = new Document();
  for (each page in the document) {
doc.add(contents, text for page);
record the offset of the last term in the page you just indexed);
  }
  doc.add(metadata, string representation of the page offsets);
  iw.addDocument(doc);
 
  Now, when you search you can get the offsets of the matching term,
  then look in your metadata field for the page number.
 
  Perhaps you could use the LucenePDFDocument in conjunction with this
  somehow, but I confess that I've never used it so it's not clear to me
 how
  you'd do this.
 
  Incidentally, the Hits object is deprecated, what version of Lucene are
  you intending to use?
 
  Best
  Erick
 
  On Thu, Oct 15, 2009 at 10:43 AM, IvanDrago idrag...@gmail.com wrote:
 
 
  Thanks for 

Re: search trough single pdf document - return page number

2009-10-16 Thread IvanDrago

proximity queries that span pages are not a concern in my case.

I asked another question on the bottom of my last post. Could you comment on
that If you have some ideas?


Erick Erickson wrote:
 
 Glad things are progressing. The only problem here will be
 proximityqueries
 that span pages. Say, the last word on page 10 is
 salmon and the first word on page 11 is fishing. Structuring
 your index this way won't find the a proximity search for salmon
 fishing.
 
 If that's not a concern, then there's no reason to complexify the
 situation..
 
 FWIW
 Erick
 
 On Fri, Oct 16, 2009 at 8:01 AM, IvanDrago idrag...@gmail.com wrote:
 

 Hey! I did it! Eric and Robert, you helped a lot. Thanks!

 I didn't use LucenePDFDocument. I created a new document for every page
 in
 a
 PDF document and added paga number info for every page.

PDDocument pddDocument=PDDocument.load(f);
PDFTextStripper textStripper=new PDFTextStripper();

 IndexWriter iwriter = new IndexWriter(index_dir, new
 StandardAnalyzer(), true);

 long start = new Date().getTime();

// 350 pages just for test
for(int i=1; i350; i++){
//System.out.println(i=  + i);
 textStripper.setStartPage(i);
textStripper.setEndPage(i);

 //fetch one page
pagecontent = textStripper.getText(pddDocument);
System.out.println(pagecontent:  + pagecontent);

if (pagecontent != null){
System.out.println(i=  + i);
Document doc = new Document();

// Add the pagenumber
doc.add(new Field(pagenumber, Integer.toString(i) ,
 Field.Store.YES,
Field.Index.ANALYZED));
doc.add(new Field(content, pagecontent ,
 Field.Store.NO,
Field.Index.ANALYZED));

iwriter.addDocument(doc);
}

}

// Optimize and close the writer to finish building the index
iwriter.optimize();
iwriter.close();

long end = new Date().getTime();

System.out.println(Indexing files took 
+ (end - start) +  milliseconds);

//just for test I searched for a string cryptography
String q = cryptography;

Directory fsDir = FSDirectory.getDirectory(index_dir, false);
 IndexSearcher ind_searcher = new IndexSearcher(fsDir);

// Build a Query object
QueryParser parser = new QueryParser(content, new
 StandardAnalyzer());
Query query = parser.parse(q);

 // Search for the query
Hits hits = ind_searcher.search(query);

// Examine the Hits object to see if there were any matches
int hitCount = hits.length();
if (hitCount == 0) {
System.out.println(
No matches were found for \ + q + \);
}
else {
System.out.println(Hits for \ +
q + \ were found in pages:);

// Iterate over the Documents in the Hits object
for (int i = 0; i  hitCount; i++) {
Document doc = hits.doc(i);

// Print the value that we stored in the title field.
 Note
// that this Field was not indexed, but (unlike the
// contents field) was stored verbatim and can be
// retrieved.
//System.out.println(   + (i + 1) + .  +
 doc.get(title));
System.out.println(   + (i + 1) + .  +
 doc.get(pagenumber));
}
}
ind_searcher.close();

 
 I'm using lucene version 2.9.0
 You said that Hits are deprecated. Should I use HitCollector instead?

 Another question came into my mind... What if I want do add another PDF
 document to the search pool. Before search I would like to specify the
 PDF
 document I would like to search and then return page number for searched
 String. I could create index for every document that I add to search pool
 but that doesn't sound good to me? Can you think of a better way to do
 that?


 Erick Erickson wrote:
 
  Your search would be on the contents field if you use
 LucenePDFDocument.
 
  But on a quick look, LucenePDFDocument doesn't give you any page
  information. So, you'd have to collect that somehow, but I don't see a
  clear
  way to.
 
  Doing it manually, you could do something like:
 
  Document doc = new Document();
  for (each page in the document) {
doc.add(contents, text for page);
record the offset of the last term in the page you just indexed);
  }
  doc.add(metadata, string representation of the page offsets);
  iw.addDocument(doc);
 
  Now, when you search you can get the offsets of the matching term,
  then look in your metadata field for the page number.
 
  Perhaps you could use the LucenePDFDocument in conjunction with this
  somehow, but I confess that I've never used it so it's not clear to me
 how
  you'd do 

Re: search trough single pdf document - return page number

2009-10-16 Thread Erick Erickson
Well, you have to add another field to each document identifying thePDF it
came from. From there, restricting to that doc just becomes
adding an AND clause. Of course how you specify these is an
exercise left to the reader G.

Erick

On Fri, Oct 16, 2009 at 8:01 AM, IvanDrago idrag...@gmail.com wrote:


 Hey! I did it! Eric and Robert, you helped a lot. Thanks!

 I didn't use LucenePDFDocument. I created a new document for every page in
 a
 PDF document and added paga number info for every page.

PDDocument pddDocument=PDDocument.load(f);
PDFTextStripper textStripper=new PDFTextStripper();

 IndexWriter iwriter = new IndexWriter(index_dir, new
 StandardAnalyzer(), true);

 long start = new Date().getTime();

// 350 pages just for test
for(int i=1; i350; i++){
//System.out.println(i=  + i);
 textStripper.setStartPage(i);
textStripper.setEndPage(i);

 //fetch one page
pagecontent = textStripper.getText(pddDocument);
System.out.println(pagecontent:  + pagecontent);

if (pagecontent != null){
System.out.println(i=  + i);
Document doc = new Document();

// Add the pagenumber
doc.add(new Field(pagenumber, Integer.toString(i) ,
 Field.Store.YES,
Field.Index.ANALYZED));
doc.add(new Field(content, pagecontent ,
 Field.Store.NO,
Field.Index.ANALYZED));

iwriter.addDocument(doc);
}

}

// Optimize and close the writer to finish building the index
iwriter.optimize();
iwriter.close();

long end = new Date().getTime();

System.out.println(Indexing files took 
+ (end - start) +  milliseconds);

//just for test I searched for a string cryptography
String q = cryptography;

Directory fsDir = FSDirectory.getDirectory(index_dir, false);
 IndexSearcher ind_searcher = new IndexSearcher(fsDir);

// Build a Query object
QueryParser parser = new QueryParser(content, new
 StandardAnalyzer());
Query query = parser.parse(q);

 // Search for the query
Hits hits = ind_searcher.search(query);

// Examine the Hits object to see if there were any matches
int hitCount = hits.length();
if (hitCount == 0) {
System.out.println(
No matches were found for \ + q + \);
}
else {
System.out.println(Hits for \ +
q + \ were found in pages:);

// Iterate over the Documents in the Hits object
for (int i = 0; i  hitCount; i++) {
Document doc = hits.doc(i);

// Print the value that we stored in the title field. Note
// that this Field was not indexed, but (unlike the
// contents field) was stored verbatim and can be
// retrieved.
//System.out.println(   + (i + 1) + .  +
 doc.get(title));
System.out.println(   + (i + 1) + .  +
 doc.get(pagenumber));
}
}
ind_searcher.close();

 
 I'm using lucene version 2.9.0
 You said that Hits are deprecated. Should I use HitCollector instead?

 Another question came into my mind... What if I want do add another PDF
 document to the search pool. Before search I would like to specify the PDF
 document I would like to search and then return page number for searched
 String. I could create index for every document that I add to search pool
 but that doesn't sound good to me? Can you think of a better way to do
 that?


 Erick Erickson wrote:
 
  Your search would be on the contents field if you use
 LucenePDFDocument.
 
  But on a quick look, LucenePDFDocument doesn't give you any page
  information. So, you'd have to collect that somehow, but I don't see a
  clear
  way to.
 
  Doing it manually, you could do something like:
 
  Document doc = new Document();
  for (each page in the document) {
doc.add(contents, text for page);
record the offset of the last term in the page you just indexed);
  }
  doc.add(metadata, string representation of the page offsets);
  iw.addDocument(doc);
 
  Now, when you search you can get the offsets of the matching term,
  then look in your metadata field for the page number.
 
  Perhaps you could use the LucenePDFDocument in conjunction with this
  somehow, but I confess that I've never used it so it's not clear to me
 how
  you'd do this.
 
  Incidentally, the Hits object is deprecated, what version of Lucene are
  you intending to use?
 
  Best
  Erick
 
  On Thu, Oct 15, 2009 at 10:43 AM, IvanDrago idrag...@gmail.com wrote:
 
 
  Thanks for the reply Erick.
 
  I would like to permanently index this content and search it
  multiple times so I would like a 

Re: search trough single pdf document - return page number

2009-10-16 Thread IvanDrago

Yes, I tough of that too but i didn't know if I could search trough index
only documents that have specific field name. After some researching I found
a way to do that:

String q = title:ant;
Query query = parser.parse(q);

title:ant - Contain the term ant in the title field

Regards,
Ivan


Erick Erickson wrote:
 
 Well, you have to add another field to each document identifying thePDF it
 came from. From there, restricting to that doc just becomes
 adding an AND clause. Of course how you specify these is an
 exercise left to the reader G.
 
 Erick
 
 On Fri, Oct 16, 2009 at 8:01 AM, IvanDrago idrag...@gmail.com wrote:
 

 Hey! I did it! Eric and Robert, you helped a lot. Thanks!

 I didn't use LucenePDFDocument. I created a new document for every page
 in
 a
 PDF document and added paga number info for every page.

PDDocument pddDocument=PDDocument.load(f);
PDFTextStripper textStripper=new PDFTextStripper();

 IndexWriter iwriter = new IndexWriter(index_dir, new
 StandardAnalyzer(), true);

 long start = new Date().getTime();

// 350 pages just for test
for(int i=1; i350; i++){
//System.out.println(i=  + i);
 textStripper.setStartPage(i);
textStripper.setEndPage(i);

 //fetch one page
pagecontent = textStripper.getText(pddDocument);
System.out.println(pagecontent:  + pagecontent);

if (pagecontent != null){
System.out.println(i=  + i);
Document doc = new Document();

// Add the pagenumber
doc.add(new Field(pagenumber, Integer.toString(i) ,
 Field.Store.YES,
Field.Index.ANALYZED));
doc.add(new Field(content, pagecontent ,
 Field.Store.NO,
Field.Index.ANALYZED));

iwriter.addDocument(doc);
}

}

// Optimize and close the writer to finish building the index
iwriter.optimize();
iwriter.close();

long end = new Date().getTime();

System.out.println(Indexing files took 
+ (end - start) +  milliseconds);

//just for test I searched for a string cryptography
String q = cryptography;

Directory fsDir = FSDirectory.getDirectory(index_dir, false);
 IndexSearcher ind_searcher = new IndexSearcher(fsDir);

// Build a Query object
QueryParser parser = new QueryParser(content, new
 StandardAnalyzer());
Query query = parser.parse(q);

 // Search for the query
Hits hits = ind_searcher.search(query);

// Examine the Hits object to see if there were any matches
int hitCount = hits.length();
if (hitCount == 0) {
System.out.println(
No matches were found for \ + q + \);
}
else {
System.out.println(Hits for \ +
q + \ were found in pages:);

// Iterate over the Documents in the Hits object
for (int i = 0; i  hitCount; i++) {
Document doc = hits.doc(i);

// Print the value that we stored in the title field.
 Note
// that this Field was not indexed, but (unlike the
// contents field) was stored verbatim and can be
// retrieved.
//System.out.println(   + (i + 1) + .  +
 doc.get(title));
System.out.println(   + (i + 1) + .  +
 doc.get(pagenumber));
}
}
ind_searcher.close();

 
 I'm using lucene version 2.9.0
 You said that Hits are deprecated. Should I use HitCollector instead?

 Another question came into my mind... What if I want do add another PDF
 document to the search pool. Before search I would like to specify the
 PDF
 document I would like to search and then return page number for searched
 String. I could create index for every document that I add to search pool
 but that doesn't sound good to me? Can you think of a better way to do
 that?


 Erick Erickson wrote:
 
  Your search would be on the contents field if you use
 LucenePDFDocument.
 
  But on a quick look, LucenePDFDocument doesn't give you any page
  information. So, you'd have to collect that somehow, but I don't see a
  clear
  way to.
 
  Doing it manually, you could do something like:
 
  Document doc = new Document();
  for (each page in the document) {
doc.add(contents, text for page);
record the offset of the last term in the page you just indexed);
  }
  doc.add(metadata, string representation of the page offsets);
  iw.addDocument(doc);
 
  Now, when you search you can get the offsets of the matching term,
  then look in your metadata field for the page number.
 
  Perhaps you could use the LucenePDFDocument in conjunction with this
  somehow, but I confess that I've never used it so it's not clear to me
 how
  you'd do this.

[jira] Updated: (LUCENE-1984) DisjunctionMaxQuery - Type safety

2009-10-16 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-1984:
--

Attachment: LUCENE-1984.patch

Small updates in Patch (also implemented Iterable). I also generified the other 
Disjunction classes.

Will commit soon. Thanks Kay Kay!

 DisjunctionMaxQuery - Type safety  
 ---

 Key: LUCENE-1984
 URL: https://issues.apache.org/jira/browse/LUCENE-1984
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Query/Scoring
Affects Versions: 2.9
Reporter: Kay Kay
Assignee: Uwe Schindler
 Fix For: 3.0

 Attachments: LUCENE-1984.patch, LUCENE-1984.patch


 DisjunctionMaxQuery code has containers that are not type-safe . The comments 
 indicate type-safety though. 
 Better to express in the API and the internals the explicit type as opposed 
 to type-less containers. 
 Patch attached. 
 Comments / backward compatibility concerns welcome.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-1984) DisjunctionMaxQuery - Type safety

2009-10-16 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler resolved LUCENE-1984.
---

Resolution: Fixed

Committed revision: 825881

Thanks Kay Kay!

 DisjunctionMaxQuery - Type safety  
 ---

 Key: LUCENE-1984
 URL: https://issues.apache.org/jira/browse/LUCENE-1984
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Query/Scoring
Affects Versions: 2.9
Reporter: Kay Kay
Assignee: Uwe Schindler
 Fix For: 3.0

 Attachments: LUCENE-1984.patch, LUCENE-1984.patch


 DisjunctionMaxQuery code has containers that are not type-safe . The comments 
 indicate type-safety though. 
 Better to express in the API and the internals the explicit type as opposed 
 to type-less containers. 
 Patch attached. 
 Comments / backward compatibility concerns welcome.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-10-16 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12766562#action_12766562
 ] 

Mark Miller commented on LUCENE-1458:
-

just committed an initial stab at pulsing cache support - could prob use your 
love again ;)

Oddly, the reopen test passed no problem and this adds more to the cache - 
perhaps I was seeing a ghost last night ...

I'll know before too long.

 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new attribute.
   * Test performance  iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Closed: (LUCENE-1984) DisjunctionMaxQuery - Type safety

2009-10-16 Thread Kay Kay (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Kay closed LUCENE-1984.
---


Thanks Uwe. The revised patch looks good as well, with better code readability. 

 DisjunctionMaxQuery - Type safety  
 ---

 Key: LUCENE-1984
 URL: https://issues.apache.org/jira/browse/LUCENE-1984
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Query/Scoring
Affects Versions: 2.9
Reporter: Kay Kay
Assignee: Uwe Schindler
 Fix For: 3.0

 Attachments: LUCENE-1984.patch, LUCENE-1984.patch


 DisjunctionMaxQuery code has containers that are not type-safe . The comments 
 indicate type-safety though. 
 Better to express in the API and the internals the explicit type as opposed 
 to type-less containers. 
 Patch attached. 
 Comments / backward compatibility concerns welcome.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-1985) DisjunctionMaxQuery - Iterator code to for ( A a : container ) construct

2009-10-16 Thread Kay Kay (JIRA)
DisjunctionMaxQuery -  Iterator code to  for ( A  a : container ) construct
---

 Key: LUCENE-1985
 URL: https://issues.apache.org/jira/browse/LUCENE-1985
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Kay Kay
Priority: Trivial


For better readability  - converting the IterableT to  for ( A  a : container 
) constructs that is more intuitive to read. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-1986) NPE in NearSpansUnordered from PayloadNearQuery

2009-10-16 Thread Peter Keegan (JIRA)
NPE in NearSpansUnordered from PayloadNearQuery
---

 Key: LUCENE-1986
 URL: https://issues.apache.org/jira/browse/LUCENE-1986
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 2.9
Reporter: Peter Keegan
 Attachments: TestPayloadNearQuery1.java

The following query causes a NPE in NearSpansUnordered, and is reproducible 
with the the attached unit test. The failure occurs on the last document scored.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1985) DisjunctionMaxQuery - Iterator code to for ( A a : container ) construct

2009-10-16 Thread Kay Kay (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Kay updated LUCENE-1985:


Attachment: LUCENE-1985.patch

 DisjunctionMaxQuery -  Iterator code to  for ( A  a : container ) construct
 ---

 Key: LUCENE-1985
 URL: https://issues.apache.org/jira/browse/LUCENE-1985
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Kay Kay
Priority: Trivial
 Attachments: LUCENE-1985.patch


 For better readability  - converting the IterableT to  for ( A  a : 
 container ) constructs that is more intuitive to read. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1986) NPE in NearSpansUnordered from PayloadNearQuery

2009-10-16 Thread Peter Keegan (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Keegan updated LUCENE-1986:
-

Attachment: TestPayloadNearQuery1.java

Unit test that causes NPE

 NPE in NearSpansUnordered from PayloadNearQuery
 ---

 Key: LUCENE-1986
 URL: https://issues.apache.org/jira/browse/LUCENE-1986
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 2.9
Reporter: Peter Keegan
 Attachments: TestPayloadNearQuery1.java


 The following query causes a NPE in NearSpansUnordered, and is reproducible 
 with the the attached unit test. The failure occurs on the last document 
 scored.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1984) DisjunctionMaxQuery - Type safety

2009-10-16 Thread Kay Kay (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12766587#action_12766587
 ] 

Kay Kay commented on LUCENE-1984:
-

As a related patch - LUCENE-1985 added to improve readability to convert   
Iterable? statements to for loops introduced in java 5 

 DisjunctionMaxQuery - Type safety  
 ---

 Key: LUCENE-1984
 URL: https://issues.apache.org/jira/browse/LUCENE-1984
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Query/Scoring
Affects Versions: 2.9
Reporter: Kay Kay
Assignee: Uwe Schindler
 Fix For: 3.0

 Attachments: LUCENE-1984.patch, LUCENE-1984.patch


 DisjunctionMaxQuery code has containers that are not type-safe . The comments 
 indicate type-safety though. 
 Better to express in the API and the internals the explicit type as opposed 
 to type-less containers. 
 Patch attached. 
 Comments / backward compatibility concerns welcome.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1124) short circuit FuzzyQuery.rewrite when input token length is small compared to minSimilarity

2009-10-16 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1124:
---

Attachment: LUCENE-1124.patch

Attach patch (based on 2.9) showing the bug, along with the fix.  Instead of 
rewriting to empty BooleanQuery when prefix term is not long enough, I rewrite 
to TermQuery with that prefix.  This way the exact term matches.

I'll commit shortly to trunk  2.9.x.

 short circuit FuzzyQuery.rewrite when input token length is small compared to 
 minSimilarity
 ---

 Key: LUCENE-1124
 URL: https://issues.apache.org/jira/browse/LUCENE-1124
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Query/Scoring
Reporter: Hoss Man
Assignee: Mark Miller
Priority: Trivial
 Fix For: 2.9

 Attachments: LUCENE-1124.patch, LUCENE-1124.patch, LUCENE-1124.patch, 
 LUCENE-1124.patch


 I found this (unreplied to) email floating around in my Lucene folder from 
 during the holidays...
 {noformat}
 From: Timo Nentwig
 To: java-dev
 Subject: Fuzzy makes no sense for short tokens
 Date: Mon, 31 Dec 2007 16:01:11 +0100
 Message-Id: 200712311601.12255.luc...@nitwit.de
 Hi!
 it generally makes no sense to search fuzzy for short tokens because changing
 even only a single character of course already results in a high edit
 distance. So it actually only makes sense in this case:
if( token.length()  1f / (1f - minSimilarity) )
 E.g. changing one character in a 3-letter token (foo) results in an edit
 distance of 0.6. And if minSimilarity (which is by default: 0.5 :-) is higher
 we can save all the expensive rewrite() logic.
 {noformat}
 I don't know much about FuzzyQueries, but this reasoning seems sound ... 
 FuzzyQuery.rewrite should be able to completely skip all TermEnumeration in 
 the event that the input token is shorter then some simple math on the 
 minSimilarity.  (i'm not smart enough to be certain that the math above is 
 right however ... it's been a while since i looked at Levenstein distances 
 ... tests needed)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Reopened: (LUCENE-1124) short circuit FuzzyQuery.rewrite when input token length is small compared to minSimilarity

2009-10-16 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reopened LUCENE-1124:



This fix breaks the case when the exact term is present in the index.

 short circuit FuzzyQuery.rewrite when input token length is small compared to 
 minSimilarity
 ---

 Key: LUCENE-1124
 URL: https://issues.apache.org/jira/browse/LUCENE-1124
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Query/Scoring
Reporter: Hoss Man
Assignee: Mark Miller
Priority: Trivial
 Fix For: 2.9

 Attachments: LUCENE-1124.patch, LUCENE-1124.patch, LUCENE-1124.patch, 
 LUCENE-1124.patch


 I found this (unreplied to) email floating around in my Lucene folder from 
 during the holidays...
 {noformat}
 From: Timo Nentwig
 To: java-dev
 Subject: Fuzzy makes no sense for short tokens
 Date: Mon, 31 Dec 2007 16:01:11 +0100
 Message-Id: 200712311601.12255.luc...@nitwit.de
 Hi!
 it generally makes no sense to search fuzzy for short tokens because changing
 even only a single character of course already results in a high edit
 distance. So it actually only makes sense in this case:
if( token.length()  1f / (1f - minSimilarity) )
 E.g. changing one character in a 3-letter token (foo) results in an edit
 distance of 0.6. And if minSimilarity (which is by default: 0.5 :-) is higher
 we can save all the expensive rewrite() logic.
 {noformat}
 I don't know much about FuzzyQueries, but this reasoning seems sound ... 
 FuzzyQuery.rewrite should be able to completely skip all TermEnumeration in 
 the event that the input token is shorter then some simple math on the 
 minSimilarity.  (i'm not smart enough to be certain that the math above is 
 right however ... it's been a while since i looked at Levenstein distances 
 ... tests needed)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-1985) DisjunctionMaxQuery - Iterator code to for ( A a : container ) construct

2009-10-16 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler resolved LUCENE-1985.
---

   Resolution: Fixed
Fix Version/s: 3.0
 Assignee: Uwe Schindler

Committed revision: 825989

Thanks Kay Kay! For further Java5 fixes, just add it to LUCENE-1257.

 DisjunctionMaxQuery -  Iterator code to  for ( A  a : container ) construct
 ---

 Key: LUCENE-1985
 URL: https://issues.apache.org/jira/browse/LUCENE-1985
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Kay Kay
Assignee: Uwe Schindler
Priority: Trivial
 Fix For: 3.0

 Attachments: LUCENE-1985.patch


 For better readability  - converting the IterableT to  for ( A  a : 
 container ) constructs that is more intuitive to read. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: lucene 2.9 sorting algorithm

2009-10-16 Thread John Wang
Mike, just a clarification on my first perf report email.
The first section, numHits is incorrectly labeled, it should be 20 instead
of 50. Sorry about the possible confusion.

Thanks

-John

On Fri, Oct 16, 2009 at 3:21 AM, Michael McCandless 
luc...@mikemccandless.com wrote:

 Thanks John; I'll have a look.

 Mike

 On Fri, Oct 16, 2009 at 12:57 AM, John Wang john.w...@gmail.com wrote:
  Hi Michael:
  I added classes: ScoreDocComparatorQueue and OneSortNoScoreCollector
 as
  a more general case. I think keeping the old api for ScoreDocComparator
 and
  SortComparatorSource would work.
Please take a look.
  Thanks
  -John
 
  On Thu, Oct 15, 2009 at 6:52 PM, John Wang john.w...@gmail.com wrote:
 
  Hi Michael:
   It is open, http://code.google.com/p/lucene-book/source/checkout
   I think I sent the https url instead, sorry.
  The multi PQ sorting is fairly self-contained, I have 2 versions, 1
  for string and 1 for int, each are Collector impls.
   I shouldn't say the Multi Q is faster on int sort, it is within the
  error boundary. The diff is very very small, I would stay they are more
  equal.
   If you think it is a good thing to go this way, (if not for the
 perf,
  just for the simpler api) I'd be happy to work on a patch.
  Thanks
  -John
  On Thu, Oct 15, 2009 at 5:18 PM, Michael McCandless
  luc...@mikemccandless.com wrote:
 
  John, looks like this requires login -- any plans to open that up, or,
  post the code on an issue?
 
  How self-contained is your Multi PQ sorting?  EG is it a standalone
  Collector impl that I can test?
 
  Mike
 
  On Thu, Oct 15, 2009 at 6:33 PM, John Wang john.w...@gmail.com
 wrote:
   BTW, we are have a little sandbox for these experiments. And all my
   testcode
   are at. They are not very polished.
  
   https://lucene-book.googlecode.com/svn/trunk
  
   -John
  
   On Thu, Oct 15, 2009 at 3:29 PM, John Wang john.w...@gmail.com
 wrote:
  
   Numbers Mike requested for Int types:
  
   only the time/cputime are posted, others are all the same since the
   algorithm is the same.
  
   Lucene 2.9:
   numhits: 10
   time: 14619495
   cpu: 146126
  
   numhits: 20
   time: 14550568
   cpu: 163242
  
   numhits: 100
   time: 16467647
   cpu: 178379
  
  
   my test:
   numHits: 10
   time: 14101094
   cpu: 144715
  
   numHits: 20
   time: 14804821
   cpu: 151305
  
   numHits: 100
   time: 15372157
   cpu time: 158842
  
   Conclusions:
   The are very similar, the differences are all within error bounds,
   especially with lower PQ sizes, which second sort alg again slightly
   faster.
  
   Hope this helps.
  
   -John
  
  
   On Thu, Oct 15, 2009 at 3:04 PM, Yonik Seeley
   yo...@lucidimagination.com
   wrote:
  
   On Thu, Oct 15, 2009 at 5:33 PM, Michael McCandless
   luc...@mikemccandless.com wrote:
Though it'd be odd if the switch to searching by segment
really was most of the gains here.
  
   I had assumed that much of the improvement was due to ditching
   MultiTermEnum/MultiTermDocs.
   Note that LUCENE-1483 was before LUCENE-1596... but that only helps
   with queries that use a TermEnum (range, prefix, etc).
  
   -Yonik
   http://www.lucidimagination.com
  
  
 -
   To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
   For additional commands, e-mail: java-dev-h...@lucene.apache.org
  
  
  
  
 
  -
  To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-dev-h...@lucene.apache.org
 
 
 
 

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org




Re: lucene 2.9 sorting algorithm

2009-10-16 Thread Michael McCandless
Oh, no problem...

Mike

On Fri, Oct 16, 2009 at 12:33 PM, John Wang john.w...@gmail.com wrote:
 Mike, just a clarification on my first perf report email.
 The first section, numHits is incorrectly labeled, it should be 20 instead
 of 50. Sorry about the possible confusion.
 Thanks
 -John

 On Fri, Oct 16, 2009 at 3:21 AM, Michael McCandless
 luc...@mikemccandless.com wrote:

 Thanks John; I'll have a look.

 Mike

 On Fri, Oct 16, 2009 at 12:57 AM, John Wang john.w...@gmail.com wrote:
  Hi Michael:
      I added classes: ScoreDocComparatorQueue and OneSortNoScoreCollector
  as
  a more general case. I think keeping the old api for ScoreDocComparator
  and
  SortComparatorSource would work.
    Please take a look.
  Thanks
  -John
 
  On Thu, Oct 15, 2009 at 6:52 PM, John Wang john.w...@gmail.com wrote:
 
  Hi Michael:
       It is open, http://code.google.com/p/lucene-book/source/checkout
       I think I sent the https url instead, sorry.
      The multi PQ sorting is fairly self-contained, I have 2 versions, 1
  for string and 1 for int, each are Collector impls.
       I shouldn't say the Multi Q is faster on int sort, it is within
  the
  error boundary. The diff is very very small, I would stay they are more
  equal.
       If you think it is a good thing to go this way, (if not for the
  perf,
  just for the simpler api) I'd be happy to work on a patch.
  Thanks
  -John
  On Thu, Oct 15, 2009 at 5:18 PM, Michael McCandless
  luc...@mikemccandless.com wrote:
 
  John, looks like this requires login -- any plans to open that up, or,
  post the code on an issue?
 
  How self-contained is your Multi PQ sorting?  EG is it a standalone
  Collector impl that I can test?
 
  Mike
 
  On Thu, Oct 15, 2009 at 6:33 PM, John Wang john.w...@gmail.com
  wrote:
   BTW, we are have a little sandbox for these experiments. And all my
   testcode
   are at. They are not very polished.
  
   https://lucene-book.googlecode.com/svn/trunk
  
   -John
  
   On Thu, Oct 15, 2009 at 3:29 PM, John Wang john.w...@gmail.com
   wrote:
  
   Numbers Mike requested for Int types:
  
   only the time/cputime are posted, others are all the same since the
   algorithm is the same.
  
   Lucene 2.9:
   numhits: 10
   time: 14619495
   cpu: 146126
  
   numhits: 20
   time: 14550568
   cpu: 163242
  
   numhits: 100
   time: 16467647
   cpu: 178379
  
  
   my test:
   numHits: 10
   time: 14101094
   cpu: 144715
  
   numHits: 20
   time: 14804821
   cpu: 151305
  
   numHits: 100
   time: 15372157
   cpu time: 158842
  
   Conclusions:
   The are very similar, the differences are all within error bounds,
   especially with lower PQ sizes, which second sort alg again
   slightly
   faster.
  
   Hope this helps.
  
   -John
  
  
   On Thu, Oct 15, 2009 at 3:04 PM, Yonik Seeley
   yo...@lucidimagination.com
   wrote:
  
   On Thu, Oct 15, 2009 at 5:33 PM, Michael McCandless
   luc...@mikemccandless.com wrote:
Though it'd be odd if the switch to searching by segment
really was most of the gains here.
  
   I had assumed that much of the improvement was due to ditching
   MultiTermEnum/MultiTermDocs.
   Note that LUCENE-1483 was before LUCENE-1596... but that only
   helps
   with queries that use a TermEnum (range, prefix, etc).
  
   -Yonik
   http://www.lucidimagination.com
  
  
   -
   To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
   For additional commands, e-mail: java-dev-h...@lucene.apache.org
  
  
  
  
 
  -
  To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-dev-h...@lucene.apache.org
 
 
 
 

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



ant build-contrib fails on trunk?

2009-10-16 Thread Michael McCandless
When I run ant build-contrib on current trunk, I hit this:

compile-core:
[javac] Compiling 1 source file to
/lucene/tmp2/build/contrib/instantiated/classes/java
[javac] 
/lucene/tmp2/contrib/instantiated/src/java/org/apache/lucene/store/instantiated/InstantiatedTermDocumentInformation.java:48:
compareTo(org.apache.lucene.index.Term) in
org.apache.lucene.index.Term cannot be applied to
(org.apache.lucene.store.instantiated.InstantiatedTerm)
[javac]   return
instantiatedTermDocumentInformation.getTerm().getTerm().compareTo(instantiatedTermDocumentInformation1.getTerm());
[javac] ^
[javac] 1 error


Is anyone else seeing this?

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1124) short circuit FuzzyQuery.rewrite when input token length is small compared to minSimilarity

2009-10-16 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1124:
---

Fix Version/s: (was: 2.9)
   3.0
   2.9.1

 short circuit FuzzyQuery.rewrite when input token length is small compared to 
 minSimilarity
 ---

 Key: LUCENE-1124
 URL: https://issues.apache.org/jira/browse/LUCENE-1124
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Query/Scoring
Reporter: Hoss Man
Assignee: Mark Miller
Priority: Trivial
 Fix For: 2.9.1, 3.0

 Attachments: LUCENE-1124.patch, LUCENE-1124.patch, LUCENE-1124.patch, 
 LUCENE-1124.patch


 I found this (unreplied to) email floating around in my Lucene folder from 
 during the holidays...
 {noformat}
 From: Timo Nentwig
 To: java-dev
 Subject: Fuzzy makes no sense for short tokens
 Date: Mon, 31 Dec 2007 16:01:11 +0100
 Message-Id: 200712311601.12255.luc...@nitwit.de
 Hi!
 it generally makes no sense to search fuzzy for short tokens because changing
 even only a single character of course already results in a high edit
 distance. So it actually only makes sense in this case:
if( token.length()  1f / (1f - minSimilarity) )
 E.g. changing one character in a 3-letter token (foo) results in an edit
 distance of 0.6. And if minSimilarity (which is by default: 0.5 :-) is higher
 we can save all the expensive rewrite() logic.
 {noformat}
 I don't know much about FuzzyQueries, but this reasoning seems sound ... 
 FuzzyQuery.rewrite should be able to completely skip all TermEnumeration in 
 the event that the input token is shorter then some simple math on the 
 minSimilarity.  (i'm not smart enough to be certain that the math above is 
 right however ... it's been a while since i looked at Levenstein distances 
 ... tests needed)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



RE: ant build-contrib fails on trunk?

2009-10-16 Thread Uwe Schindler
I'll fix, this is because of generics and compareTo(). I revert the change.

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


 -Original Message-
 From: Michael McCandless [mailto:luc...@mikemccandless.com]
 Sent: Friday, October 16, 2009 7:01 PM
 To: java-dev@lucene.apache.org
 Subject: ant build-contrib fails on trunk?
 
 When I run ant build-contrib on current trunk, I hit this:
 
 compile-core:
 [javac] Compiling 1 source file to
 /lucene/tmp2/build/contrib/instantiated/classes/java
 [javac]
 /lucene/tmp2/contrib/instantiated/src/java/org/apache/lucene/store/instant
 iated/InstantiatedTermDocumentInformation.java:48:
 compareTo(org.apache.lucene.index.Term) in
 org.apache.lucene.index.Term cannot be applied to
 (org.apache.lucene.store.instantiated.InstantiatedTerm)
 [javac]   return
 instantiatedTermDocumentInformation.getTerm().getTerm().compareTo(instanti
 atedTermDocumentInformation1.getTerm());
 [javac] ^
 [javac] 1 error
 
 
 Is anyone else seeing this?
 
 Mike
 
 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: ant build-contrib fails on trunk?

2009-10-16 Thread Robert Muir
yes, not just you

On Fri, Oct 16, 2009 at 1:00 PM, Michael McCandless 
luc...@mikemccandless.com wrote:

 When I run ant build-contrib on current trunk, I hit this:

 compile-core:
[javac] Compiling 1 source file to
 /lucene/tmp2/build/contrib/instantiated/classes/java
[javac]
 /lucene/tmp2/contrib/instantiated/src/java/org/apache/lucene/store/instantiated/InstantiatedTermDocumentInformation.java:48:
 compareTo(org.apache.lucene.index.Term) in
 org.apache.lucene.index.Term cannot be applied to
 (org.apache.lucene.store.instantiated.InstantiatedTerm)
[javac]   return

 instantiatedTermDocumentInformation.getTerm().getTerm().compareTo(instantiatedTermDocumentInformation1.getTerm());
[javac] ^
[javac] 1 error


 Is anyone else seeing this?

 Mike

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org




-- 
Robert Muir
rcm...@gmail.com


RE: ant build-contrib fails on trunk?

2009-10-16 Thread Uwe Schindler
It was not the generics change, it was a bug in the comparator. There was
one getTerm() missing. I'll add.

The compile found the error, because of generics, the signature didn't match
correct (in 1.4 it was just Object without a generics hint, now its Object
and Term, but InstantiatedTerm does not match).

Committed revision: 826011

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


 -Original Message-
 From: Uwe Schindler [mailto:u...@thetaphi.de]
 Sent: Friday, October 16, 2009 7:10 PM
 To: java-dev@lucene.apache.org
 Subject: RE: ant build-contrib fails on trunk?
 
 I'll fix, this is because of generics and compareTo(). I revert the
 change.
 
 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de
 
 
  -Original Message-
  From: Michael McCandless [mailto:luc...@mikemccandless.com]
  Sent: Friday, October 16, 2009 7:01 PM
  To: java-dev@lucene.apache.org
  Subject: ant build-contrib fails on trunk?
 
  When I run ant build-contrib on current trunk, I hit this:
 
  compile-core:
  [javac] Compiling 1 source file to
  /lucene/tmp2/build/contrib/instantiated/classes/java
  [javac]
 
 /lucene/tmp2/contrib/instantiated/src/java/org/apache/lucene/store/instant
  iated/InstantiatedTermDocumentInformation.java:48:
  compareTo(org.apache.lucene.index.Term) in
  org.apache.lucene.index.Term cannot be applied to
  (org.apache.lucene.store.instantiated.InstantiatedTerm)
  [javac]   return
 
 instantiatedTermDocumentInformation.getTerm().getTerm().compareTo(instanti
  atedTermDocumentInformation1.getTerm());
  [javac] ^
  [javac] 1 error
 
 
  Is anyone else seeing this?
 
  Mike
 
  -
  To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-dev-h...@lucene.apache.org
 
 
 
 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: ant build-contrib fails on trunk?

2009-10-16 Thread Michael McCandless
OK thanks!

Mike

On Fri, Oct 16, 2009 at 1:09 PM, Uwe Schindler u...@thetaphi.de wrote:
 I'll fix, this is because of generics and compareTo(). I revert the change.

 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de


 -Original Message-
 From: Michael McCandless [mailto:luc...@mikemccandless.com]
 Sent: Friday, October 16, 2009 7:01 PM
 To: java-dev@lucene.apache.org
 Subject: ant build-contrib fails on trunk?

 When I run ant build-contrib on current trunk, I hit this:

 compile-core:
     [javac] Compiling 1 source file to
 /lucene/tmp2/build/contrib/instantiated/classes/java
     [javac]
 /lucene/tmp2/contrib/instantiated/src/java/org/apache/lucene/store/instant
 iated/InstantiatedTermDocumentInformation.java:48:
 compareTo(org.apache.lucene.index.Term) in
 org.apache.lucene.index.Term cannot be applied to
 (org.apache.lucene.store.instantiated.InstantiatedTerm)
     [javac]       return
 instantiatedTermDocumentInformation.getTerm().getTerm().compareTo(instanti
 atedTermDocumentInformation1.getTerm());
     [javac]                                                         ^
     [javac] 1 error


 Is anyone else seeing this?

 Mike

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-1124) short circuit FuzzyQuery.rewrite when input token length is small compared to minSimilarity

2009-10-16 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-1124.


Resolution: Fixed

 short circuit FuzzyQuery.rewrite when input token length is small compared to 
 minSimilarity
 ---

 Key: LUCENE-1124
 URL: https://issues.apache.org/jira/browse/LUCENE-1124
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Query/Scoring
Reporter: Hoss Man
Assignee: Mark Miller
Priority: Trivial
 Fix For: 2.9.1, 3.0

 Attachments: LUCENE-1124.patch, LUCENE-1124.patch, LUCENE-1124.patch, 
 LUCENE-1124.patch


 I found this (unreplied to) email floating around in my Lucene folder from 
 during the holidays...
 {noformat}
 From: Timo Nentwig
 To: java-dev
 Subject: Fuzzy makes no sense for short tokens
 Date: Mon, 31 Dec 2007 16:01:11 +0100
 Message-Id: 200712311601.12255.luc...@nitwit.de
 Hi!
 it generally makes no sense to search fuzzy for short tokens because changing
 even only a single character of course already results in a high edit
 distance. So it actually only makes sense in this case:
if( token.length()  1f / (1f - minSimilarity) )
 E.g. changing one character in a 3-letter token (foo) results in an edit
 distance of 0.6. And if minSimilarity (which is by default: 0.5 :-) is higher
 we can save all the expensive rewrite() logic.
 {noformat}
 I don't know much about FuzzyQueries, but this reasoning seems sound ... 
 FuzzyQuery.rewrite should be able to completely skip all TermEnumeration in 
 the event that the input token is shorter then some simple math on the 
 minSimilarity.  (i'm not smart enough to be certain that the math above is 
 right however ... it's been a while since i looked at Levenstein distances 
 ... tests needed)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1257) Port to Java5

2009-10-16 Thread Kay Kay (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Kay updated LUCENE-1257:


Attachment: LUCENE-1257-DisjunctionMaxQuery-more_type_safety.patch

* DisjunctionMaxQuery.java - some of the casts are not necessary now that the 
members are made type-safe. 

 Port to Java5
 -

 Key: LUCENE-1257
 URL: https://issues.apache.org/jira/browse/LUCENE-1257
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis, Examples, Index, Other, Query/Scoring, 
 QueryParser, Search, Store, Term Vectors
Affects Versions: 2.3.1
Reporter: Cédric Champeau
Assignee: Uwe Schindler
Priority: Minor
 Fix For: 3.0

 Attachments: instantiated_fieldable.patch, java5.patch, 
 LUCENE-1257-DisjunctionMaxQuery-more_type_safety.patch, 
 LUCENE-1257-Document.patch, LUCENE-1257-StringBuffer.patch, 
 LUCENE-1257-StringBuffer.patch, LUCENE-1257-StringBuffer.patch, 
 LUCENE-1257_messages.patch, lucene1257surround1.patch, 
 lucene1257surround1.patch, shinglematrixfilter_generified.patch


 For my needs I've updated Lucene so that it uses Java 5 constructs. I know 
 Java 5 migration had been planned for 2.1 someday in the past, but don't know 
 when it is planned now. This patch against the trunk includes :
 - most obvious generics usage (there are tons of usages of sets, ... Those 
 which are commonly used have been generified)
 - PriorityQueue generification
 - replacement of indexed for loops with for each constructs
 - removal of unnececessary unboxing
 The code is to my opinion much more readable with those features (you 
 actually *know* what is stored in collections reading the code, without the 
 need to lookup for field definitions everytime) and it simplifies many 
 algorithms.
 Note that this patch also includes an interface for the Query class. This has 
 been done for my company's needs for building custom Query classes which add 
 some behaviour to the base Lucene queries. It prevents multiple unnnecessary 
 casts. I know this introduction is not wanted by the team, but it really 
 makes our developments easier to maintain. If you don't want to use this, 
 replace all /Queriable/ calls with standard /Query/.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1985) DisjunctionMaxQuery - Iterator code to for ( A a : container ) construct

2009-10-16 Thread Kay Kay (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12766652#action_12766652
 ] 

Kay Kay commented on LUCENE-1985:
-

Thanks Uwe. 

Added another patch to LUCENE-1257 to get away from some of the casting that is 
not necessary given that LUCENE-1984 and LUCENE-1985 are in now ( with generics 
). 

 DisjunctionMaxQuery -  Iterator code to  for ( A  a : container ) construct
 ---

 Key: LUCENE-1985
 URL: https://issues.apache.org/jira/browse/LUCENE-1985
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Kay Kay
Assignee: Uwe Schindler
Priority: Trivial
 Fix For: 3.0

 Attachments: LUCENE-1985.patch


 For better readability  - converting the IterableT to  for ( A  a : 
 container ) constructs that is more intuitive to read. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1976) isCurrent() and getVersion() on an NRT reader are broken

2009-10-16 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12766654#action_12766654
 ] 

Michael McCandless commented on LUCENE-1976:


I plan to back-port this to 2.9.x, since we're doing a 2.9.1 shortly...

 isCurrent() and getVersion() on an NRT reader are broken
 

 Key: LUCENE-1976
 URL: https://issues.apache.org/jira/browse/LUCENE-1976
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 3.1

 Attachments: LUCENE-1976.patch


 Right now isCurrent() will always return true for an NRT reader and 
 getVersion() will always return the version of the last commit.  This is 
 because the NRT reader holds the live segmentInfos.
 I think isCurrent() should return false when any further changes have 
 occurred with the writer, else true.   This is actually fairly easy to 
 determine, since the writer tracks how many docs  deletions are buffered in 
 RAM and these counters only increase with each change.
 getVersion should return the version as of when the reader was created.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1257) Port to Java5

2009-10-16 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12766657#action_12766657
 ] 

Uwe Schindler commented on LUCENE-1257:
---

Committed revision: 826035

 Port to Java5
 -

 Key: LUCENE-1257
 URL: https://issues.apache.org/jira/browse/LUCENE-1257
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis, Examples, Index, Other, Query/Scoring, 
 QueryParser, Search, Store, Term Vectors
Affects Versions: 2.3.1
Reporter: Cédric Champeau
Assignee: Uwe Schindler
Priority: Minor
 Fix For: 3.0

 Attachments: instantiated_fieldable.patch, java5.patch, 
 LUCENE-1257-DisjunctionMaxQuery-more_type_safety.patch, 
 LUCENE-1257-Document.patch, LUCENE-1257-StringBuffer.patch, 
 LUCENE-1257-StringBuffer.patch, LUCENE-1257-StringBuffer.patch, 
 LUCENE-1257_messages.patch, lucene1257surround1.patch, 
 lucene1257surround1.patch, shinglematrixfilter_generified.patch


 For my needs I've updated Lucene so that it uses Java 5 constructs. I know 
 Java 5 migration had been planned for 2.1 someday in the past, but don't know 
 when it is planned now. This patch against the trunk includes :
 - most obvious generics usage (there are tons of usages of sets, ... Those 
 which are commonly used have been generified)
 - PriorityQueue generification
 - replacement of indexed for loops with for each constructs
 - removal of unnececessary unboxing
 The code is to my opinion much more readable with those features (you 
 actually *know* what is stored in collections reading the code, without the 
 need to lookup for field definitions everytime) and it simplifies many 
 algorithms.
 Note that this patch also includes an interface for the Query class. This has 
 been done for my company's needs for building custom Query classes which add 
 some behaviour to the base Lucene queries. It prevents multiple unnnecessary 
 casts. I know this introduction is not wanted by the team, but it really 
 makes our developments easier to maintain. If you don't want to use this, 
 replace all /Queriable/ calls with standard /Query/.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: JDBC access to a Lucene index

2009-10-16 Thread Grant Ingersoll
I'm not aware of any, but you might get more mileage asking on java- 
user.


On Oct 16, 2009, at 3:54 AM, Jukka Zitting wrote:


Hi,

Some while ago I implemented a simple JDBC to JCR bridge [1] that
allows one to query a JCR repository from any JDBC client, most
notably various reporting tools.

Now I'm wondering if something similar already exists for a normal
Lucene index. Something that would treat your entire index as one huge
table (or perhaps a set of tables based on some document type field)
and would allow you to use simple SQL SELECTs to query data.

Any pointers would be welcome.

[1] http://dev.day.com/microsling/content/blogs/main/jdbc2jcr.html

BR,

Jukka Zitting

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-1976) isCurrent() and getVersion() on an NRT reader are broken

2009-10-16 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-1976.


   Resolution: Fixed
Fix Version/s: (was: 3.1)
   3.0
   2.9.1

 isCurrent() and getVersion() on an NRT reader are broken
 

 Key: LUCENE-1976
 URL: https://issues.apache.org/jira/browse/LUCENE-1976
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9.1, 3.0

 Attachments: LUCENE-1976.patch


 Right now isCurrent() will always return true for an NRT reader and 
 getVersion() will always return the version of the last commit.  This is 
 because the NRT reader holds the live segmentInfos.
 I think isCurrent() should return false when any further changes have 
 occurred with the writer, else true.   This is actually fairly easy to 
 determine, since the writer tracks how many docs  deletions are buffered in 
 RAM and these counters only increase with each change.
 getVersion should return the version as of when the reader was created.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-1987) Remove rest of analysis deprecations (Token, CharacterCache)

2009-10-16 Thread Uwe Schindler (JIRA)
Remove rest of analysis deprecations (Token, CharacterCache)


 Key: LUCENE-1987
 URL: https://issues.apache.org/jira/browse/LUCENE-1987
 Project: Lucene - Java
  Issue Type: Task
  Components: Analysis
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 3.0


These removes the rest of the deprecations in the analysis package:
- Token's termText field
- eventually un-deprecate ctors of Token taking Strings (they are still useful) 
- if yes remove deprec in 2.9.1
- remove CharacterCache and use Character.valueOf() from Java5
- Some Analyzers have stopword lists in wrong format (HashMaps)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1987) Remove rest of analysis deprecations (Token, CharacterCache)

2009-10-16 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-1987:
--

Attachment: LUCENE-1987.patch

Pastch with the first three points. The three deprecated methods should stay 
alive in my opinion. Copying the string to the termbuffer in the ctor is the 
same linke copying the initial termbuffer. If we remove these ctors, we should 
also remove the setTermBuffer(String) method. This is no consistency.

If the others agree to keep these three ctors alive I will apply an 
undeprecation in 2.9 branch.

 Remove rest of analysis deprecations (Token, CharacterCache)
 

 Key: LUCENE-1987
 URL: https://issues.apache.org/jira/browse/LUCENE-1987
 Project: Lucene - Java
  Issue Type: Task
  Components: Analysis
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 3.0

 Attachments: LUCENE-1987.patch


 These removes the rest of the deprecations in the analysis package:
 - Token's termText field
 - eventually un-deprecate ctors of Token taking Strings (they are still 
 useful) - if yes remove deprec in 2.9.1
 - remove CharacterCache and use Character.valueOf() from Java5
 - Some Analyzers have stopword lists in wrong format (HashMaps)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-10-16 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12766832#action_12766832
 ] 

Mark Miller commented on LUCENE-1458:
-

Almost got an initial rough stab at the sep codec cache done - just have to get 
two more tests to pass involving the payload's state.

 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new attribute.
   * Test performance  iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org