JDBC access to a Lucene index
Hi, Some while ago I implemented a simple JDBC to JCR bridge [1] that allows one to query a JCR repository from any JDBC client, most notably various reporting tools. Now I'm wondering if something similar already exists for a normal Lucene index. Something that would treat your entire index as one huge table (or perhaps a set of tables based on some document type field) and would allow you to use simple SQL SELECTs to query data. Any pointers would be welcome. [1] http://dev.day.com/microsling/content/blogs/main/jdbc2jcr.html BR, Jukka Zitting - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing
[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12766481#action_12766481 ] Michael McCandless commented on LUCENE-1458: OK thank for addressing the new nocommits -- you wanna remove them commit as you find/comment on them? Can be our means of communicating through the branch :) For now, I don't think we need to explore improvements to the TermInfo cache (starting @ smaller size, simplistic double barrel LRU cache) -- we can simply mimic trunk for now; such improvements are orthogonal here. Maybe switch those nocommits to TODOs instead? bq. Hmm - I'm still getting the heap space issue I think Sigh. I think we have more work to do to scale down RAM used by IndexReader for a smallish index. Further steps towards flexible indexing --- Key: LUCENE-1458 URL: https://issues.apache.org/jira/browse/LUCENE-1458 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.9 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Attachments: LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2 I attached a very rough checkpoint of my current patch, to get early feedback. All tests pass, though back compat tests don't pass due to changes to package-private APIs plus certain bugs in tests that happened to work (eg call TermPostions.nextPosition() too many times, which the new API asserts against). [Aside: I think, when we commit changes to package-private APIs such that back-compat tests don't pass, we could go back, make a branch on the back-compat tag, commit changes to the tests to use the new package private APIs on that branch, then fix nightly build to use the tip of that branch?o] There's still plenty to do before this is committable! This is a rather large change: * Switches to a new more efficient terms dict format. This still uses tii/tis files, but the tii only stores term long offset (not a TermInfo). At seek points, tis encodes term freq/prox offsets absolutely instead of with deltas delta. Also, tis/tii are structured by field, so we don't have to record field number in every term. . On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB). . RAM usage when loading terms dict index is significantly less since we only load an array of offsets and an array of String (no more TermInfo array). It should be faster to init too. . This part is basically done. * Introduces modular reader codec that strongly decouples terms dict from docs/positions readers. EG there is no more TermInfo used when reading the new format. . There's nice symmetry now between reading writing in the codec chain -- the current docs/prox format is captured in: {code} FormatPostingsTermsDictWriter/Reader FormatPostingsDocsWriter/Reader (.frq file) and FormatPostingsPositionsWriter/Reader (.prx file). {code} This part is basically done. * Introduces a new flex API for iterating through the fields, terms, docs and positions: {code} FieldProducer - TermsEnum - DocsEnum - PostingsEnum {code} This replaces TermEnum/Docs/Positions. SegmentReader emulates the old API on top of the new API to keep back-compat. Next steps: * Plug in new codecs (pulsing, pfor) to exercise the modularity / fix any hidden assumptions. * Expose new API out of IndexReader, deprecate old API but emulate old API on top of new one, switch all core/contrib users to the new API. * Maybe switch to AttributeSources as the base class for TermsEnum, DocsEnum, PostingsEnum -- this would give readers API flexibility (not just index-file-format flexibility). EG if someone wanted to store payload at the term-doc level instead of term-doc-position level, you could just add a new attribute. * Test performance iterate. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail:
[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing
[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12766482#action_12766482 ] Michael McCandless commented on LUCENE-1458: bq. you wanna remove them commit as you find/comment on them? Woops, I see you already did! Thanks. Further steps towards flexible indexing --- Key: LUCENE-1458 URL: https://issues.apache.org/jira/browse/LUCENE-1458 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.9 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Attachments: LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2 I attached a very rough checkpoint of my current patch, to get early feedback. All tests pass, though back compat tests don't pass due to changes to package-private APIs plus certain bugs in tests that happened to work (eg call TermPostions.nextPosition() too many times, which the new API asserts against). [Aside: I think, when we commit changes to package-private APIs such that back-compat tests don't pass, we could go back, make a branch on the back-compat tag, commit changes to the tests to use the new package private APIs on that branch, then fix nightly build to use the tip of that branch?o] There's still plenty to do before this is committable! This is a rather large change: * Switches to a new more efficient terms dict format. This still uses tii/tis files, but the tii only stores term long offset (not a TermInfo). At seek points, tis encodes term freq/prox offsets absolutely instead of with deltas delta. Also, tis/tii are structured by field, so we don't have to record field number in every term. . On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB). . RAM usage when loading terms dict index is significantly less since we only load an array of offsets and an array of String (no more TermInfo array). It should be faster to init too. . This part is basically done. * Introduces modular reader codec that strongly decouples terms dict from docs/positions readers. EG there is no more TermInfo used when reading the new format. . There's nice symmetry now between reading writing in the codec chain -- the current docs/prox format is captured in: {code} FormatPostingsTermsDictWriter/Reader FormatPostingsDocsWriter/Reader (.frq file) and FormatPostingsPositionsWriter/Reader (.prx file). {code} This part is basically done. * Introduces a new flex API for iterating through the fields, terms, docs and positions: {code} FieldProducer - TermsEnum - DocsEnum - PostingsEnum {code} This replaces TermEnum/Docs/Positions. SegmentReader emulates the old API on top of the new API to keep back-compat. Next steps: * Plug in new codecs (pulsing, pfor) to exercise the modularity / fix any hidden assumptions. * Expose new API out of IndexReader, deprecate old API but emulate old API on top of new one, switch all core/contrib users to the new API. * Maybe switch to AttributeSources as the base class for TermsEnum, DocsEnum, PostingsEnum -- this would give readers API flexibility (not just index-file-format flexibility). EG if someone wanted to store payload at the term-doc level instead of term-doc-position level, you could just add a new attribute. * Test performance iterate. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: lucene 2.9 sorting algorithm
Thanks John; I'll have a look. Mike On Fri, Oct 16, 2009 at 12:57 AM, John Wang john.w...@gmail.com wrote: Hi Michael: I added classes: ScoreDocComparatorQueue and OneSortNoScoreCollector as a more general case. I think keeping the old api for ScoreDocComparator and SortComparatorSource would work. Please take a look. Thanks -John On Thu, Oct 15, 2009 at 6:52 PM, John Wang john.w...@gmail.com wrote: Hi Michael: It is open, http://code.google.com/p/lucene-book/source/checkout I think I sent the https url instead, sorry. The multi PQ sorting is fairly self-contained, I have 2 versions, 1 for string and 1 for int, each are Collector impls. I shouldn't say the Multi Q is faster on int sort, it is within the error boundary. The diff is very very small, I would stay they are more equal. If you think it is a good thing to go this way, (if not for the perf, just for the simpler api) I'd be happy to work on a patch. Thanks -John On Thu, Oct 15, 2009 at 5:18 PM, Michael McCandless luc...@mikemccandless.com wrote: John, looks like this requires login -- any plans to open that up, or, post the code on an issue? How self-contained is your Multi PQ sorting? EG is it a standalone Collector impl that I can test? Mike On Thu, Oct 15, 2009 at 6:33 PM, John Wang john.w...@gmail.com wrote: BTW, we are have a little sandbox for these experiments. And all my testcode are at. They are not very polished. https://lucene-book.googlecode.com/svn/trunk -John On Thu, Oct 15, 2009 at 3:29 PM, John Wang john.w...@gmail.com wrote: Numbers Mike requested for Int types: only the time/cputime are posted, others are all the same since the algorithm is the same. Lucene 2.9: numhits: 10 time: 14619495 cpu: 146126 numhits: 20 time: 14550568 cpu: 163242 numhits: 100 time: 16467647 cpu: 178379 my test: numHits: 10 time: 14101094 cpu: 144715 numHits: 20 time: 14804821 cpu: 151305 numHits: 100 time: 15372157 cpu time: 158842 Conclusions: The are very similar, the differences are all within error bounds, especially with lower PQ sizes, which second sort alg again slightly faster. Hope this helps. -John On Thu, Oct 15, 2009 at 3:04 PM, Yonik Seeley yo...@lucidimagination.com wrote: On Thu, Oct 15, 2009 at 5:33 PM, Michael McCandless luc...@mikemccandless.com wrote: Though it'd be odd if the switch to searching by segment really was most of the gains here. I had assumed that much of the improvement was due to ditching MultiTermEnum/MultiTermDocs. Note that LUCENE-1483 was before LUCENE-1596... but that only helps with queries that use a TermEnum (range, prefix, etc). -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: search trough single pdf document - return page number
Hey! I did it! Eric and Robert, you helped a lot. Thanks! I didn't use LucenePDFDocument. I created a new document for every page in a PDF document and added paga number info for every page. PDDocument pddDocument=PDDocument.load(f); PDFTextStripper textStripper=new PDFTextStripper(); IndexWriter iwriter = new IndexWriter(index_dir, new StandardAnalyzer(), true); long start = new Date().getTime(); // 350 pages just for test for(int i=1; i350; i++){ //System.out.println(i= + i); textStripper.setStartPage(i); textStripper.setEndPage(i); //fetch one page pagecontent = textStripper.getText(pddDocument); System.out.println(pagecontent: + pagecontent); if (pagecontent != null){ System.out.println(i= + i); Document doc = new Document(); // Add the pagenumber doc.add(new Field(pagenumber, Integer.toString(i) , Field.Store.YES, Field.Index.ANALYZED)); doc.add(new Field(content, pagecontent , Field.Store.NO, Field.Index.ANALYZED)); iwriter.addDocument(doc); } } // Optimize and close the writer to finish building the index iwriter.optimize(); iwriter.close(); long end = new Date().getTime(); System.out.println(Indexing files took + (end - start) + milliseconds); //just for test I searched for a string cryptography String q = cryptography; Directory fsDir = FSDirectory.getDirectory(index_dir, false); IndexSearcher ind_searcher = new IndexSearcher(fsDir); // Build a Query object QueryParser parser = new QueryParser(content, new StandardAnalyzer()); Query query = parser.parse(q); // Search for the query Hits hits = ind_searcher.search(query); // Examine the Hits object to see if there were any matches int hitCount = hits.length(); if (hitCount == 0) { System.out.println( No matches were found for \ + q + \); } else { System.out.println(Hits for \ + q + \ were found in pages:); // Iterate over the Documents in the Hits object for (int i = 0; i hitCount; i++) { Document doc = hits.doc(i); // Print the value that we stored in the title field. Note // that this Field was not indexed, but (unlike the // contents field) was stored verbatim and can be // retrieved. //System.out.println( + (i + 1) + . + doc.get(title)); System.out.println( + (i + 1) + . + doc.get(pagenumber)); } } ind_searcher.close(); I'm using lucene version 2.9.0 You said that Hits are deprecated. Should I use HitCollector instead? Another question came into my mind... What if I want do add another PDF document to the search pool. Before search I would like to specify the PDF document I would like to search and then return page number for searched String. I could create index for every document that I add to search pool but that doesn't sound good to me? Can you think of a better way to do that? Erick Erickson wrote: Your search would be on the contents field if you use LucenePDFDocument. But on a quick look, LucenePDFDocument doesn't give you any page information. So, you'd have to collect that somehow, but I don't see a clear way to. Doing it manually, you could do something like: Document doc = new Document(); for (each page in the document) { doc.add(contents, text for page); record the offset of the last term in the page you just indexed); } doc.add(metadata, string representation of the page offsets); iw.addDocument(doc); Now, when you search you can get the offsets of the matching term, then look in your metadata field for the page number. Perhaps you could use the LucenePDFDocument in conjunction with this somehow, but I confess that I've never used it so it's not clear to me how you'd do this. Incidentally, the Hits object is deprecated, what version of Lucene are you intending to use? Best Erick On Thu, Oct 15, 2009 at 10:43 AM, IvanDrago idrag...@gmail.com wrote: Thanks for the reply Erick. I would like to permanently index this content and search it multiple times so I would like a permanent copy and I want to search for different terms multiple times. My problem is that I dont know how to retrieve a page number where the searched string was found so if you
Re: search trough single pdf document - return page number
Glad things are progressing. The only problem here will be proximityqueries that span pages. Say, the last word on page 10 is salmon and the first word on page 11 is fishing. Structuring your index this way won't find the a proximity search for salmon fishing. If that's not a concern, then there's no reason to complexify the situation.. FWIW Erick On Fri, Oct 16, 2009 at 8:01 AM, IvanDrago idrag...@gmail.com wrote: Hey! I did it! Eric and Robert, you helped a lot. Thanks! I didn't use LucenePDFDocument. I created a new document for every page in a PDF document and added paga number info for every page. PDDocument pddDocument=PDDocument.load(f); PDFTextStripper textStripper=new PDFTextStripper(); IndexWriter iwriter = new IndexWriter(index_dir, new StandardAnalyzer(), true); long start = new Date().getTime(); // 350 pages just for test for(int i=1; i350; i++){ //System.out.println(i= + i); textStripper.setStartPage(i); textStripper.setEndPage(i); //fetch one page pagecontent = textStripper.getText(pddDocument); System.out.println(pagecontent: + pagecontent); if (pagecontent != null){ System.out.println(i= + i); Document doc = new Document(); // Add the pagenumber doc.add(new Field(pagenumber, Integer.toString(i) , Field.Store.YES, Field.Index.ANALYZED)); doc.add(new Field(content, pagecontent , Field.Store.NO, Field.Index.ANALYZED)); iwriter.addDocument(doc); } } // Optimize and close the writer to finish building the index iwriter.optimize(); iwriter.close(); long end = new Date().getTime(); System.out.println(Indexing files took + (end - start) + milliseconds); //just for test I searched for a string cryptography String q = cryptography; Directory fsDir = FSDirectory.getDirectory(index_dir, false); IndexSearcher ind_searcher = new IndexSearcher(fsDir); // Build a Query object QueryParser parser = new QueryParser(content, new StandardAnalyzer()); Query query = parser.parse(q); // Search for the query Hits hits = ind_searcher.search(query); // Examine the Hits object to see if there were any matches int hitCount = hits.length(); if (hitCount == 0) { System.out.println( No matches were found for \ + q + \); } else { System.out.println(Hits for \ + q + \ were found in pages:); // Iterate over the Documents in the Hits object for (int i = 0; i hitCount; i++) { Document doc = hits.doc(i); // Print the value that we stored in the title field. Note // that this Field was not indexed, but (unlike the // contents field) was stored verbatim and can be // retrieved. //System.out.println( + (i + 1) + . + doc.get(title)); System.out.println( + (i + 1) + . + doc.get(pagenumber)); } } ind_searcher.close(); I'm using lucene version 2.9.0 You said that Hits are deprecated. Should I use HitCollector instead? Another question came into my mind... What if I want do add another PDF document to the search pool. Before search I would like to specify the PDF document I would like to search and then return page number for searched String. I could create index for every document that I add to search pool but that doesn't sound good to me? Can you think of a better way to do that? Erick Erickson wrote: Your search would be on the contents field if you use LucenePDFDocument. But on a quick look, LucenePDFDocument doesn't give you any page information. So, you'd have to collect that somehow, but I don't see a clear way to. Doing it manually, you could do something like: Document doc = new Document(); for (each page in the document) { doc.add(contents, text for page); record the offset of the last term in the page you just indexed); } doc.add(metadata, string representation of the page offsets); iw.addDocument(doc); Now, when you search you can get the offsets of the matching term, then look in your metadata field for the page number. Perhaps you could use the LucenePDFDocument in conjunction with this somehow, but I confess that I've never used it so it's not clear to me how you'd do this. Incidentally, the Hits object is deprecated, what version of Lucene are you intending to use? Best Erick On Thu, Oct 15, 2009 at 10:43 AM, IvanDrago idrag...@gmail.com wrote: Thanks for
Re: search trough single pdf document - return page number
proximity queries that span pages are not a concern in my case. I asked another question on the bottom of my last post. Could you comment on that If you have some ideas? Erick Erickson wrote: Glad things are progressing. The only problem here will be proximityqueries that span pages. Say, the last word on page 10 is salmon and the first word on page 11 is fishing. Structuring your index this way won't find the a proximity search for salmon fishing. If that's not a concern, then there's no reason to complexify the situation.. FWIW Erick On Fri, Oct 16, 2009 at 8:01 AM, IvanDrago idrag...@gmail.com wrote: Hey! I did it! Eric and Robert, you helped a lot. Thanks! I didn't use LucenePDFDocument. I created a new document for every page in a PDF document and added paga number info for every page. PDDocument pddDocument=PDDocument.load(f); PDFTextStripper textStripper=new PDFTextStripper(); IndexWriter iwriter = new IndexWriter(index_dir, new StandardAnalyzer(), true); long start = new Date().getTime(); // 350 pages just for test for(int i=1; i350; i++){ //System.out.println(i= + i); textStripper.setStartPage(i); textStripper.setEndPage(i); //fetch one page pagecontent = textStripper.getText(pddDocument); System.out.println(pagecontent: + pagecontent); if (pagecontent != null){ System.out.println(i= + i); Document doc = new Document(); // Add the pagenumber doc.add(new Field(pagenumber, Integer.toString(i) , Field.Store.YES, Field.Index.ANALYZED)); doc.add(new Field(content, pagecontent , Field.Store.NO, Field.Index.ANALYZED)); iwriter.addDocument(doc); } } // Optimize and close the writer to finish building the index iwriter.optimize(); iwriter.close(); long end = new Date().getTime(); System.out.println(Indexing files took + (end - start) + milliseconds); //just for test I searched for a string cryptography String q = cryptography; Directory fsDir = FSDirectory.getDirectory(index_dir, false); IndexSearcher ind_searcher = new IndexSearcher(fsDir); // Build a Query object QueryParser parser = new QueryParser(content, new StandardAnalyzer()); Query query = parser.parse(q); // Search for the query Hits hits = ind_searcher.search(query); // Examine the Hits object to see if there were any matches int hitCount = hits.length(); if (hitCount == 0) { System.out.println( No matches were found for \ + q + \); } else { System.out.println(Hits for \ + q + \ were found in pages:); // Iterate over the Documents in the Hits object for (int i = 0; i hitCount; i++) { Document doc = hits.doc(i); // Print the value that we stored in the title field. Note // that this Field was not indexed, but (unlike the // contents field) was stored verbatim and can be // retrieved. //System.out.println( + (i + 1) + . + doc.get(title)); System.out.println( + (i + 1) + . + doc.get(pagenumber)); } } ind_searcher.close(); I'm using lucene version 2.9.0 You said that Hits are deprecated. Should I use HitCollector instead? Another question came into my mind... What if I want do add another PDF document to the search pool. Before search I would like to specify the PDF document I would like to search and then return page number for searched String. I could create index for every document that I add to search pool but that doesn't sound good to me? Can you think of a better way to do that? Erick Erickson wrote: Your search would be on the contents field if you use LucenePDFDocument. But on a quick look, LucenePDFDocument doesn't give you any page information. So, you'd have to collect that somehow, but I don't see a clear way to. Doing it manually, you could do something like: Document doc = new Document(); for (each page in the document) { doc.add(contents, text for page); record the offset of the last term in the page you just indexed); } doc.add(metadata, string representation of the page offsets); iw.addDocument(doc); Now, when you search you can get the offsets of the matching term, then look in your metadata field for the page number. Perhaps you could use the LucenePDFDocument in conjunction with this somehow, but I confess that I've never used it so it's not clear to me how you'd do
Re: search trough single pdf document - return page number
Well, you have to add another field to each document identifying thePDF it came from. From there, restricting to that doc just becomes adding an AND clause. Of course how you specify these is an exercise left to the reader G. Erick On Fri, Oct 16, 2009 at 8:01 AM, IvanDrago idrag...@gmail.com wrote: Hey! I did it! Eric and Robert, you helped a lot. Thanks! I didn't use LucenePDFDocument. I created a new document for every page in a PDF document and added paga number info for every page. PDDocument pddDocument=PDDocument.load(f); PDFTextStripper textStripper=new PDFTextStripper(); IndexWriter iwriter = new IndexWriter(index_dir, new StandardAnalyzer(), true); long start = new Date().getTime(); // 350 pages just for test for(int i=1; i350; i++){ //System.out.println(i= + i); textStripper.setStartPage(i); textStripper.setEndPage(i); //fetch one page pagecontent = textStripper.getText(pddDocument); System.out.println(pagecontent: + pagecontent); if (pagecontent != null){ System.out.println(i= + i); Document doc = new Document(); // Add the pagenumber doc.add(new Field(pagenumber, Integer.toString(i) , Field.Store.YES, Field.Index.ANALYZED)); doc.add(new Field(content, pagecontent , Field.Store.NO, Field.Index.ANALYZED)); iwriter.addDocument(doc); } } // Optimize and close the writer to finish building the index iwriter.optimize(); iwriter.close(); long end = new Date().getTime(); System.out.println(Indexing files took + (end - start) + milliseconds); //just for test I searched for a string cryptography String q = cryptography; Directory fsDir = FSDirectory.getDirectory(index_dir, false); IndexSearcher ind_searcher = new IndexSearcher(fsDir); // Build a Query object QueryParser parser = new QueryParser(content, new StandardAnalyzer()); Query query = parser.parse(q); // Search for the query Hits hits = ind_searcher.search(query); // Examine the Hits object to see if there were any matches int hitCount = hits.length(); if (hitCount == 0) { System.out.println( No matches were found for \ + q + \); } else { System.out.println(Hits for \ + q + \ were found in pages:); // Iterate over the Documents in the Hits object for (int i = 0; i hitCount; i++) { Document doc = hits.doc(i); // Print the value that we stored in the title field. Note // that this Field was not indexed, but (unlike the // contents field) was stored verbatim and can be // retrieved. //System.out.println( + (i + 1) + . + doc.get(title)); System.out.println( + (i + 1) + . + doc.get(pagenumber)); } } ind_searcher.close(); I'm using lucene version 2.9.0 You said that Hits are deprecated. Should I use HitCollector instead? Another question came into my mind... What if I want do add another PDF document to the search pool. Before search I would like to specify the PDF document I would like to search and then return page number for searched String. I could create index for every document that I add to search pool but that doesn't sound good to me? Can you think of a better way to do that? Erick Erickson wrote: Your search would be on the contents field if you use LucenePDFDocument. But on a quick look, LucenePDFDocument doesn't give you any page information. So, you'd have to collect that somehow, but I don't see a clear way to. Doing it manually, you could do something like: Document doc = new Document(); for (each page in the document) { doc.add(contents, text for page); record the offset of the last term in the page you just indexed); } doc.add(metadata, string representation of the page offsets); iw.addDocument(doc); Now, when you search you can get the offsets of the matching term, then look in your metadata field for the page number. Perhaps you could use the LucenePDFDocument in conjunction with this somehow, but I confess that I've never used it so it's not clear to me how you'd do this. Incidentally, the Hits object is deprecated, what version of Lucene are you intending to use? Best Erick On Thu, Oct 15, 2009 at 10:43 AM, IvanDrago idrag...@gmail.com wrote: Thanks for the reply Erick. I would like to permanently index this content and search it multiple times so I would like a
Re: search trough single pdf document - return page number
Yes, I tough of that too but i didn't know if I could search trough index only documents that have specific field name. After some researching I found a way to do that: String q = title:ant; Query query = parser.parse(q); title:ant - Contain the term ant in the title field Regards, Ivan Erick Erickson wrote: Well, you have to add another field to each document identifying thePDF it came from. From there, restricting to that doc just becomes adding an AND clause. Of course how you specify these is an exercise left to the reader G. Erick On Fri, Oct 16, 2009 at 8:01 AM, IvanDrago idrag...@gmail.com wrote: Hey! I did it! Eric and Robert, you helped a lot. Thanks! I didn't use LucenePDFDocument. I created a new document for every page in a PDF document and added paga number info for every page. PDDocument pddDocument=PDDocument.load(f); PDFTextStripper textStripper=new PDFTextStripper(); IndexWriter iwriter = new IndexWriter(index_dir, new StandardAnalyzer(), true); long start = new Date().getTime(); // 350 pages just for test for(int i=1; i350; i++){ //System.out.println(i= + i); textStripper.setStartPage(i); textStripper.setEndPage(i); //fetch one page pagecontent = textStripper.getText(pddDocument); System.out.println(pagecontent: + pagecontent); if (pagecontent != null){ System.out.println(i= + i); Document doc = new Document(); // Add the pagenumber doc.add(new Field(pagenumber, Integer.toString(i) , Field.Store.YES, Field.Index.ANALYZED)); doc.add(new Field(content, pagecontent , Field.Store.NO, Field.Index.ANALYZED)); iwriter.addDocument(doc); } } // Optimize and close the writer to finish building the index iwriter.optimize(); iwriter.close(); long end = new Date().getTime(); System.out.println(Indexing files took + (end - start) + milliseconds); //just for test I searched for a string cryptography String q = cryptography; Directory fsDir = FSDirectory.getDirectory(index_dir, false); IndexSearcher ind_searcher = new IndexSearcher(fsDir); // Build a Query object QueryParser parser = new QueryParser(content, new StandardAnalyzer()); Query query = parser.parse(q); // Search for the query Hits hits = ind_searcher.search(query); // Examine the Hits object to see if there were any matches int hitCount = hits.length(); if (hitCount == 0) { System.out.println( No matches were found for \ + q + \); } else { System.out.println(Hits for \ + q + \ were found in pages:); // Iterate over the Documents in the Hits object for (int i = 0; i hitCount; i++) { Document doc = hits.doc(i); // Print the value that we stored in the title field. Note // that this Field was not indexed, but (unlike the // contents field) was stored verbatim and can be // retrieved. //System.out.println( + (i + 1) + . + doc.get(title)); System.out.println( + (i + 1) + . + doc.get(pagenumber)); } } ind_searcher.close(); I'm using lucene version 2.9.0 You said that Hits are deprecated. Should I use HitCollector instead? Another question came into my mind... What if I want do add another PDF document to the search pool. Before search I would like to specify the PDF document I would like to search and then return page number for searched String. I could create index for every document that I add to search pool but that doesn't sound good to me? Can you think of a better way to do that? Erick Erickson wrote: Your search would be on the contents field if you use LucenePDFDocument. But on a quick look, LucenePDFDocument doesn't give you any page information. So, you'd have to collect that somehow, but I don't see a clear way to. Doing it manually, you could do something like: Document doc = new Document(); for (each page in the document) { doc.add(contents, text for page); record the offset of the last term in the page you just indexed); } doc.add(metadata, string representation of the page offsets); iw.addDocument(doc); Now, when you search you can get the offsets of the matching term, then look in your metadata field for the page number. Perhaps you could use the LucenePDFDocument in conjunction with this somehow, but I confess that I've never used it so it's not clear to me how you'd do this.
[jira] Updated: (LUCENE-1984) DisjunctionMaxQuery - Type safety
[ https://issues.apache.org/jira/browse/LUCENE-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-1984: -- Attachment: LUCENE-1984.patch Small updates in Patch (also implemented Iterable). I also generified the other Disjunction classes. Will commit soon. Thanks Kay Kay! DisjunctionMaxQuery - Type safety --- Key: LUCENE-1984 URL: https://issues.apache.org/jira/browse/LUCENE-1984 Project: Lucene - Java Issue Type: Improvement Components: Query/Scoring Affects Versions: 2.9 Reporter: Kay Kay Assignee: Uwe Schindler Fix For: 3.0 Attachments: LUCENE-1984.patch, LUCENE-1984.patch DisjunctionMaxQuery code has containers that are not type-safe . The comments indicate type-safety though. Better to express in the API and the internals the explicit type as opposed to type-less containers. Patch attached. Comments / backward compatibility concerns welcome. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-1984) DisjunctionMaxQuery - Type safety
[ https://issues.apache.org/jira/browse/LUCENE-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler resolved LUCENE-1984. --- Resolution: Fixed Committed revision: 825881 Thanks Kay Kay! DisjunctionMaxQuery - Type safety --- Key: LUCENE-1984 URL: https://issues.apache.org/jira/browse/LUCENE-1984 Project: Lucene - Java Issue Type: Improvement Components: Query/Scoring Affects Versions: 2.9 Reporter: Kay Kay Assignee: Uwe Schindler Fix For: 3.0 Attachments: LUCENE-1984.patch, LUCENE-1984.patch DisjunctionMaxQuery code has containers that are not type-safe . The comments indicate type-safety though. Better to express in the API and the internals the explicit type as opposed to type-less containers. Patch attached. Comments / backward compatibility concerns welcome. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing
[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12766562#action_12766562 ] Mark Miller commented on LUCENE-1458: - just committed an initial stab at pulsing cache support - could prob use your love again ;) Oddly, the reopen test passed no problem and this adds more to the cache - perhaps I was seeing a ghost last night ... I'll know before too long. Further steps towards flexible indexing --- Key: LUCENE-1458 URL: https://issues.apache.org/jira/browse/LUCENE-1458 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.9 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Attachments: LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2 I attached a very rough checkpoint of my current patch, to get early feedback. All tests pass, though back compat tests don't pass due to changes to package-private APIs plus certain bugs in tests that happened to work (eg call TermPostions.nextPosition() too many times, which the new API asserts against). [Aside: I think, when we commit changes to package-private APIs such that back-compat tests don't pass, we could go back, make a branch on the back-compat tag, commit changes to the tests to use the new package private APIs on that branch, then fix nightly build to use the tip of that branch?o] There's still plenty to do before this is committable! This is a rather large change: * Switches to a new more efficient terms dict format. This still uses tii/tis files, but the tii only stores term long offset (not a TermInfo). At seek points, tis encodes term freq/prox offsets absolutely instead of with deltas delta. Also, tis/tii are structured by field, so we don't have to record field number in every term. . On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB). . RAM usage when loading terms dict index is significantly less since we only load an array of offsets and an array of String (no more TermInfo array). It should be faster to init too. . This part is basically done. * Introduces modular reader codec that strongly decouples terms dict from docs/positions readers. EG there is no more TermInfo used when reading the new format. . There's nice symmetry now between reading writing in the codec chain -- the current docs/prox format is captured in: {code} FormatPostingsTermsDictWriter/Reader FormatPostingsDocsWriter/Reader (.frq file) and FormatPostingsPositionsWriter/Reader (.prx file). {code} This part is basically done. * Introduces a new flex API for iterating through the fields, terms, docs and positions: {code} FieldProducer - TermsEnum - DocsEnum - PostingsEnum {code} This replaces TermEnum/Docs/Positions. SegmentReader emulates the old API on top of the new API to keep back-compat. Next steps: * Plug in new codecs (pulsing, pfor) to exercise the modularity / fix any hidden assumptions. * Expose new API out of IndexReader, deprecate old API but emulate old API on top of new one, switch all core/contrib users to the new API. * Maybe switch to AttributeSources as the base class for TermsEnum, DocsEnum, PostingsEnum -- this would give readers API flexibility (not just index-file-format flexibility). EG if someone wanted to store payload at the term-doc level instead of term-doc-position level, you could just add a new attribute. * Test performance iterate. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Closed: (LUCENE-1984) DisjunctionMaxQuery - Type safety
[ https://issues.apache.org/jira/browse/LUCENE-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Kay closed LUCENE-1984. --- Thanks Uwe. The revised patch looks good as well, with better code readability. DisjunctionMaxQuery - Type safety --- Key: LUCENE-1984 URL: https://issues.apache.org/jira/browse/LUCENE-1984 Project: Lucene - Java Issue Type: Improvement Components: Query/Scoring Affects Versions: 2.9 Reporter: Kay Kay Assignee: Uwe Schindler Fix For: 3.0 Attachments: LUCENE-1984.patch, LUCENE-1984.patch DisjunctionMaxQuery code has containers that are not type-safe . The comments indicate type-safety though. Better to express in the API and the internals the explicit type as opposed to type-less containers. Patch attached. Comments / backward compatibility concerns welcome. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-1985) DisjunctionMaxQuery - Iterator code to for ( A a : container ) construct
DisjunctionMaxQuery - Iterator code to for ( A a : container ) construct --- Key: LUCENE-1985 URL: https://issues.apache.org/jira/browse/LUCENE-1985 Project: Lucene - Java Issue Type: Improvement Reporter: Kay Kay Priority: Trivial For better readability - converting the IterableT to for ( A a : container ) constructs that is more intuitive to read. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-1986) NPE in NearSpansUnordered from PayloadNearQuery
NPE in NearSpansUnordered from PayloadNearQuery --- Key: LUCENE-1986 URL: https://issues.apache.org/jira/browse/LUCENE-1986 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 2.9 Reporter: Peter Keegan Attachments: TestPayloadNearQuery1.java The following query causes a NPE in NearSpansUnordered, and is reproducible with the the attached unit test. The failure occurs on the last document scored. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1985) DisjunctionMaxQuery - Iterator code to for ( A a : container ) construct
[ https://issues.apache.org/jira/browse/LUCENE-1985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Kay updated LUCENE-1985: Attachment: LUCENE-1985.patch DisjunctionMaxQuery - Iterator code to for ( A a : container ) construct --- Key: LUCENE-1985 URL: https://issues.apache.org/jira/browse/LUCENE-1985 Project: Lucene - Java Issue Type: Improvement Reporter: Kay Kay Priority: Trivial Attachments: LUCENE-1985.patch For better readability - converting the IterableT to for ( A a : container ) constructs that is more intuitive to read. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1986) NPE in NearSpansUnordered from PayloadNearQuery
[ https://issues.apache.org/jira/browse/LUCENE-1986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Keegan updated LUCENE-1986: - Attachment: TestPayloadNearQuery1.java Unit test that causes NPE NPE in NearSpansUnordered from PayloadNearQuery --- Key: LUCENE-1986 URL: https://issues.apache.org/jira/browse/LUCENE-1986 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 2.9 Reporter: Peter Keegan Attachments: TestPayloadNearQuery1.java The following query causes a NPE in NearSpansUnordered, and is reproducible with the the attached unit test. The failure occurs on the last document scored. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1984) DisjunctionMaxQuery - Type safety
[ https://issues.apache.org/jira/browse/LUCENE-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12766587#action_12766587 ] Kay Kay commented on LUCENE-1984: - As a related patch - LUCENE-1985 added to improve readability to convert Iterable? statements to for loops introduced in java 5 DisjunctionMaxQuery - Type safety --- Key: LUCENE-1984 URL: https://issues.apache.org/jira/browse/LUCENE-1984 Project: Lucene - Java Issue Type: Improvement Components: Query/Scoring Affects Versions: 2.9 Reporter: Kay Kay Assignee: Uwe Schindler Fix For: 3.0 Attachments: LUCENE-1984.patch, LUCENE-1984.patch DisjunctionMaxQuery code has containers that are not type-safe . The comments indicate type-safety though. Better to express in the API and the internals the explicit type as opposed to type-less containers. Patch attached. Comments / backward compatibility concerns welcome. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1124) short circuit FuzzyQuery.rewrite when input token length is small compared to minSimilarity
[ https://issues.apache.org/jira/browse/LUCENE-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1124: --- Attachment: LUCENE-1124.patch Attach patch (based on 2.9) showing the bug, along with the fix. Instead of rewriting to empty BooleanQuery when prefix term is not long enough, I rewrite to TermQuery with that prefix. This way the exact term matches. I'll commit shortly to trunk 2.9.x. short circuit FuzzyQuery.rewrite when input token length is small compared to minSimilarity --- Key: LUCENE-1124 URL: https://issues.apache.org/jira/browse/LUCENE-1124 Project: Lucene - Java Issue Type: Improvement Components: Query/Scoring Reporter: Hoss Man Assignee: Mark Miller Priority: Trivial Fix For: 2.9 Attachments: LUCENE-1124.patch, LUCENE-1124.patch, LUCENE-1124.patch, LUCENE-1124.patch I found this (unreplied to) email floating around in my Lucene folder from during the holidays... {noformat} From: Timo Nentwig To: java-dev Subject: Fuzzy makes no sense for short tokens Date: Mon, 31 Dec 2007 16:01:11 +0100 Message-Id: 200712311601.12255.luc...@nitwit.de Hi! it generally makes no sense to search fuzzy for short tokens because changing even only a single character of course already results in a high edit distance. So it actually only makes sense in this case: if( token.length() 1f / (1f - minSimilarity) ) E.g. changing one character in a 3-letter token (foo) results in an edit distance of 0.6. And if minSimilarity (which is by default: 0.5 :-) is higher we can save all the expensive rewrite() logic. {noformat} I don't know much about FuzzyQueries, but this reasoning seems sound ... FuzzyQuery.rewrite should be able to completely skip all TermEnumeration in the event that the input token is shorter then some simple math on the minSimilarity. (i'm not smart enough to be certain that the math above is right however ... it's been a while since i looked at Levenstein distances ... tests needed) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Reopened: (LUCENE-1124) short circuit FuzzyQuery.rewrite when input token length is small compared to minSimilarity
[ https://issues.apache.org/jira/browse/LUCENE-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless reopened LUCENE-1124: This fix breaks the case when the exact term is present in the index. short circuit FuzzyQuery.rewrite when input token length is small compared to minSimilarity --- Key: LUCENE-1124 URL: https://issues.apache.org/jira/browse/LUCENE-1124 Project: Lucene - Java Issue Type: Improvement Components: Query/Scoring Reporter: Hoss Man Assignee: Mark Miller Priority: Trivial Fix For: 2.9 Attachments: LUCENE-1124.patch, LUCENE-1124.patch, LUCENE-1124.patch, LUCENE-1124.patch I found this (unreplied to) email floating around in my Lucene folder from during the holidays... {noformat} From: Timo Nentwig To: java-dev Subject: Fuzzy makes no sense for short tokens Date: Mon, 31 Dec 2007 16:01:11 +0100 Message-Id: 200712311601.12255.luc...@nitwit.de Hi! it generally makes no sense to search fuzzy for short tokens because changing even only a single character of course already results in a high edit distance. So it actually only makes sense in this case: if( token.length() 1f / (1f - minSimilarity) ) E.g. changing one character in a 3-letter token (foo) results in an edit distance of 0.6. And if minSimilarity (which is by default: 0.5 :-) is higher we can save all the expensive rewrite() logic. {noformat} I don't know much about FuzzyQueries, but this reasoning seems sound ... FuzzyQuery.rewrite should be able to completely skip all TermEnumeration in the event that the input token is shorter then some simple math on the minSimilarity. (i'm not smart enough to be certain that the math above is right however ... it's been a while since i looked at Levenstein distances ... tests needed) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-1985) DisjunctionMaxQuery - Iterator code to for ( A a : container ) construct
[ https://issues.apache.org/jira/browse/LUCENE-1985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler resolved LUCENE-1985. --- Resolution: Fixed Fix Version/s: 3.0 Assignee: Uwe Schindler Committed revision: 825989 Thanks Kay Kay! For further Java5 fixes, just add it to LUCENE-1257. DisjunctionMaxQuery - Iterator code to for ( A a : container ) construct --- Key: LUCENE-1985 URL: https://issues.apache.org/jira/browse/LUCENE-1985 Project: Lucene - Java Issue Type: Improvement Reporter: Kay Kay Assignee: Uwe Schindler Priority: Trivial Fix For: 3.0 Attachments: LUCENE-1985.patch For better readability - converting the IterableT to for ( A a : container ) constructs that is more intuitive to read. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: lucene 2.9 sorting algorithm
Mike, just a clarification on my first perf report email. The first section, numHits is incorrectly labeled, it should be 20 instead of 50. Sorry about the possible confusion. Thanks -John On Fri, Oct 16, 2009 at 3:21 AM, Michael McCandless luc...@mikemccandless.com wrote: Thanks John; I'll have a look. Mike On Fri, Oct 16, 2009 at 12:57 AM, John Wang john.w...@gmail.com wrote: Hi Michael: I added classes: ScoreDocComparatorQueue and OneSortNoScoreCollector as a more general case. I think keeping the old api for ScoreDocComparator and SortComparatorSource would work. Please take a look. Thanks -John On Thu, Oct 15, 2009 at 6:52 PM, John Wang john.w...@gmail.com wrote: Hi Michael: It is open, http://code.google.com/p/lucene-book/source/checkout I think I sent the https url instead, sorry. The multi PQ sorting is fairly self-contained, I have 2 versions, 1 for string and 1 for int, each are Collector impls. I shouldn't say the Multi Q is faster on int sort, it is within the error boundary. The diff is very very small, I would stay they are more equal. If you think it is a good thing to go this way, (if not for the perf, just for the simpler api) I'd be happy to work on a patch. Thanks -John On Thu, Oct 15, 2009 at 5:18 PM, Michael McCandless luc...@mikemccandless.com wrote: John, looks like this requires login -- any plans to open that up, or, post the code on an issue? How self-contained is your Multi PQ sorting? EG is it a standalone Collector impl that I can test? Mike On Thu, Oct 15, 2009 at 6:33 PM, John Wang john.w...@gmail.com wrote: BTW, we are have a little sandbox for these experiments. And all my testcode are at. They are not very polished. https://lucene-book.googlecode.com/svn/trunk -John On Thu, Oct 15, 2009 at 3:29 PM, John Wang john.w...@gmail.com wrote: Numbers Mike requested for Int types: only the time/cputime are posted, others are all the same since the algorithm is the same. Lucene 2.9: numhits: 10 time: 14619495 cpu: 146126 numhits: 20 time: 14550568 cpu: 163242 numhits: 100 time: 16467647 cpu: 178379 my test: numHits: 10 time: 14101094 cpu: 144715 numHits: 20 time: 14804821 cpu: 151305 numHits: 100 time: 15372157 cpu time: 158842 Conclusions: The are very similar, the differences are all within error bounds, especially with lower PQ sizes, which second sort alg again slightly faster. Hope this helps. -John On Thu, Oct 15, 2009 at 3:04 PM, Yonik Seeley yo...@lucidimagination.com wrote: On Thu, Oct 15, 2009 at 5:33 PM, Michael McCandless luc...@mikemccandless.com wrote: Though it'd be odd if the switch to searching by segment really was most of the gains here. I had assumed that much of the improvement was due to ditching MultiTermEnum/MultiTermDocs. Note that LUCENE-1483 was before LUCENE-1596... but that only helps with queries that use a TermEnum (range, prefix, etc). -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: lucene 2.9 sorting algorithm
Oh, no problem... Mike On Fri, Oct 16, 2009 at 12:33 PM, John Wang john.w...@gmail.com wrote: Mike, just a clarification on my first perf report email. The first section, numHits is incorrectly labeled, it should be 20 instead of 50. Sorry about the possible confusion. Thanks -John On Fri, Oct 16, 2009 at 3:21 AM, Michael McCandless luc...@mikemccandless.com wrote: Thanks John; I'll have a look. Mike On Fri, Oct 16, 2009 at 12:57 AM, John Wang john.w...@gmail.com wrote: Hi Michael: I added classes: ScoreDocComparatorQueue and OneSortNoScoreCollector as a more general case. I think keeping the old api for ScoreDocComparator and SortComparatorSource would work. Please take a look. Thanks -John On Thu, Oct 15, 2009 at 6:52 PM, John Wang john.w...@gmail.com wrote: Hi Michael: It is open, http://code.google.com/p/lucene-book/source/checkout I think I sent the https url instead, sorry. The multi PQ sorting is fairly self-contained, I have 2 versions, 1 for string and 1 for int, each are Collector impls. I shouldn't say the Multi Q is faster on int sort, it is within the error boundary. The diff is very very small, I would stay they are more equal. If you think it is a good thing to go this way, (if not for the perf, just for the simpler api) I'd be happy to work on a patch. Thanks -John On Thu, Oct 15, 2009 at 5:18 PM, Michael McCandless luc...@mikemccandless.com wrote: John, looks like this requires login -- any plans to open that up, or, post the code on an issue? How self-contained is your Multi PQ sorting? EG is it a standalone Collector impl that I can test? Mike On Thu, Oct 15, 2009 at 6:33 PM, John Wang john.w...@gmail.com wrote: BTW, we are have a little sandbox for these experiments. And all my testcode are at. They are not very polished. https://lucene-book.googlecode.com/svn/trunk -John On Thu, Oct 15, 2009 at 3:29 PM, John Wang john.w...@gmail.com wrote: Numbers Mike requested for Int types: only the time/cputime are posted, others are all the same since the algorithm is the same. Lucene 2.9: numhits: 10 time: 14619495 cpu: 146126 numhits: 20 time: 14550568 cpu: 163242 numhits: 100 time: 16467647 cpu: 178379 my test: numHits: 10 time: 14101094 cpu: 144715 numHits: 20 time: 14804821 cpu: 151305 numHits: 100 time: 15372157 cpu time: 158842 Conclusions: The are very similar, the differences are all within error bounds, especially with lower PQ sizes, which second sort alg again slightly faster. Hope this helps. -John On Thu, Oct 15, 2009 at 3:04 PM, Yonik Seeley yo...@lucidimagination.com wrote: On Thu, Oct 15, 2009 at 5:33 PM, Michael McCandless luc...@mikemccandless.com wrote: Though it'd be odd if the switch to searching by segment really was most of the gains here. I had assumed that much of the improvement was due to ditching MultiTermEnum/MultiTermDocs. Note that LUCENE-1483 was before LUCENE-1596... but that only helps with queries that use a TermEnum (range, prefix, etc). -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
ant build-contrib fails on trunk?
When I run ant build-contrib on current trunk, I hit this: compile-core: [javac] Compiling 1 source file to /lucene/tmp2/build/contrib/instantiated/classes/java [javac] /lucene/tmp2/contrib/instantiated/src/java/org/apache/lucene/store/instantiated/InstantiatedTermDocumentInformation.java:48: compareTo(org.apache.lucene.index.Term) in org.apache.lucene.index.Term cannot be applied to (org.apache.lucene.store.instantiated.InstantiatedTerm) [javac] return instantiatedTermDocumentInformation.getTerm().getTerm().compareTo(instantiatedTermDocumentInformation1.getTerm()); [javac] ^ [javac] 1 error Is anyone else seeing this? Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1124) short circuit FuzzyQuery.rewrite when input token length is small compared to minSimilarity
[ https://issues.apache.org/jira/browse/LUCENE-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1124: --- Fix Version/s: (was: 2.9) 3.0 2.9.1 short circuit FuzzyQuery.rewrite when input token length is small compared to minSimilarity --- Key: LUCENE-1124 URL: https://issues.apache.org/jira/browse/LUCENE-1124 Project: Lucene - Java Issue Type: Improvement Components: Query/Scoring Reporter: Hoss Man Assignee: Mark Miller Priority: Trivial Fix For: 2.9.1, 3.0 Attachments: LUCENE-1124.patch, LUCENE-1124.patch, LUCENE-1124.patch, LUCENE-1124.patch I found this (unreplied to) email floating around in my Lucene folder from during the holidays... {noformat} From: Timo Nentwig To: java-dev Subject: Fuzzy makes no sense for short tokens Date: Mon, 31 Dec 2007 16:01:11 +0100 Message-Id: 200712311601.12255.luc...@nitwit.de Hi! it generally makes no sense to search fuzzy for short tokens because changing even only a single character of course already results in a high edit distance. So it actually only makes sense in this case: if( token.length() 1f / (1f - minSimilarity) ) E.g. changing one character in a 3-letter token (foo) results in an edit distance of 0.6. And if minSimilarity (which is by default: 0.5 :-) is higher we can save all the expensive rewrite() logic. {noformat} I don't know much about FuzzyQueries, but this reasoning seems sound ... FuzzyQuery.rewrite should be able to completely skip all TermEnumeration in the event that the input token is shorter then some simple math on the minSimilarity. (i'm not smart enough to be certain that the math above is right however ... it's been a while since i looked at Levenstein distances ... tests needed) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
RE: ant build-contrib fails on trunk?
I'll fix, this is because of generics and compareTo(). I revert the change. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Friday, October 16, 2009 7:01 PM To: java-dev@lucene.apache.org Subject: ant build-contrib fails on trunk? When I run ant build-contrib on current trunk, I hit this: compile-core: [javac] Compiling 1 source file to /lucene/tmp2/build/contrib/instantiated/classes/java [javac] /lucene/tmp2/contrib/instantiated/src/java/org/apache/lucene/store/instant iated/InstantiatedTermDocumentInformation.java:48: compareTo(org.apache.lucene.index.Term) in org.apache.lucene.index.Term cannot be applied to (org.apache.lucene.store.instantiated.InstantiatedTerm) [javac] return instantiatedTermDocumentInformation.getTerm().getTerm().compareTo(instanti atedTermDocumentInformation1.getTerm()); [javac] ^ [javac] 1 error Is anyone else seeing this? Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: ant build-contrib fails on trunk?
yes, not just you On Fri, Oct 16, 2009 at 1:00 PM, Michael McCandless luc...@mikemccandless.com wrote: When I run ant build-contrib on current trunk, I hit this: compile-core: [javac] Compiling 1 source file to /lucene/tmp2/build/contrib/instantiated/classes/java [javac] /lucene/tmp2/contrib/instantiated/src/java/org/apache/lucene/store/instantiated/InstantiatedTermDocumentInformation.java:48: compareTo(org.apache.lucene.index.Term) in org.apache.lucene.index.Term cannot be applied to (org.apache.lucene.store.instantiated.InstantiatedTerm) [javac] return instantiatedTermDocumentInformation.getTerm().getTerm().compareTo(instantiatedTermDocumentInformation1.getTerm()); [javac] ^ [javac] 1 error Is anyone else seeing this? Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- Robert Muir rcm...@gmail.com
RE: ant build-contrib fails on trunk?
It was not the generics change, it was a bug in the comparator. There was one getTerm() missing. I'll add. The compile found the error, because of generics, the signature didn't match correct (in 1.4 it was just Object without a generics hint, now its Object and Term, but InstantiatedTerm does not match). Committed revision: 826011 - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Uwe Schindler [mailto:u...@thetaphi.de] Sent: Friday, October 16, 2009 7:10 PM To: java-dev@lucene.apache.org Subject: RE: ant build-contrib fails on trunk? I'll fix, this is because of generics and compareTo(). I revert the change. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Friday, October 16, 2009 7:01 PM To: java-dev@lucene.apache.org Subject: ant build-contrib fails on trunk? When I run ant build-contrib on current trunk, I hit this: compile-core: [javac] Compiling 1 source file to /lucene/tmp2/build/contrib/instantiated/classes/java [javac] /lucene/tmp2/contrib/instantiated/src/java/org/apache/lucene/store/instant iated/InstantiatedTermDocumentInformation.java:48: compareTo(org.apache.lucene.index.Term) in org.apache.lucene.index.Term cannot be applied to (org.apache.lucene.store.instantiated.InstantiatedTerm) [javac] return instantiatedTermDocumentInformation.getTerm().getTerm().compareTo(instanti atedTermDocumentInformation1.getTerm()); [javac] ^ [javac] 1 error Is anyone else seeing this? Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: ant build-contrib fails on trunk?
OK thanks! Mike On Fri, Oct 16, 2009 at 1:09 PM, Uwe Schindler u...@thetaphi.de wrote: I'll fix, this is because of generics and compareTo(). I revert the change. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Friday, October 16, 2009 7:01 PM To: java-dev@lucene.apache.org Subject: ant build-contrib fails on trunk? When I run ant build-contrib on current trunk, I hit this: compile-core: [javac] Compiling 1 source file to /lucene/tmp2/build/contrib/instantiated/classes/java [javac] /lucene/tmp2/contrib/instantiated/src/java/org/apache/lucene/store/instant iated/InstantiatedTermDocumentInformation.java:48: compareTo(org.apache.lucene.index.Term) in org.apache.lucene.index.Term cannot be applied to (org.apache.lucene.store.instantiated.InstantiatedTerm) [javac] return instantiatedTermDocumentInformation.getTerm().getTerm().compareTo(instanti atedTermDocumentInformation1.getTerm()); [javac] ^ [javac] 1 error Is anyone else seeing this? Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-1124) short circuit FuzzyQuery.rewrite when input token length is small compared to minSimilarity
[ https://issues.apache.org/jira/browse/LUCENE-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved LUCENE-1124. Resolution: Fixed short circuit FuzzyQuery.rewrite when input token length is small compared to minSimilarity --- Key: LUCENE-1124 URL: https://issues.apache.org/jira/browse/LUCENE-1124 Project: Lucene - Java Issue Type: Improvement Components: Query/Scoring Reporter: Hoss Man Assignee: Mark Miller Priority: Trivial Fix For: 2.9.1, 3.0 Attachments: LUCENE-1124.patch, LUCENE-1124.patch, LUCENE-1124.patch, LUCENE-1124.patch I found this (unreplied to) email floating around in my Lucene folder from during the holidays... {noformat} From: Timo Nentwig To: java-dev Subject: Fuzzy makes no sense for short tokens Date: Mon, 31 Dec 2007 16:01:11 +0100 Message-Id: 200712311601.12255.luc...@nitwit.de Hi! it generally makes no sense to search fuzzy for short tokens because changing even only a single character of course already results in a high edit distance. So it actually only makes sense in this case: if( token.length() 1f / (1f - minSimilarity) ) E.g. changing one character in a 3-letter token (foo) results in an edit distance of 0.6. And if minSimilarity (which is by default: 0.5 :-) is higher we can save all the expensive rewrite() logic. {noformat} I don't know much about FuzzyQueries, but this reasoning seems sound ... FuzzyQuery.rewrite should be able to completely skip all TermEnumeration in the event that the input token is shorter then some simple math on the minSimilarity. (i'm not smart enough to be certain that the math above is right however ... it's been a while since i looked at Levenstein distances ... tests needed) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1257) Port to Java5
[ https://issues.apache.org/jira/browse/LUCENE-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Kay updated LUCENE-1257: Attachment: LUCENE-1257-DisjunctionMaxQuery-more_type_safety.patch * DisjunctionMaxQuery.java - some of the casts are not necessary now that the members are made type-safe. Port to Java5 - Key: LUCENE-1257 URL: https://issues.apache.org/jira/browse/LUCENE-1257 Project: Lucene - Java Issue Type: Improvement Components: Analysis, Examples, Index, Other, Query/Scoring, QueryParser, Search, Store, Term Vectors Affects Versions: 2.3.1 Reporter: Cédric Champeau Assignee: Uwe Schindler Priority: Minor Fix For: 3.0 Attachments: instantiated_fieldable.patch, java5.patch, LUCENE-1257-DisjunctionMaxQuery-more_type_safety.patch, LUCENE-1257-Document.patch, LUCENE-1257-StringBuffer.patch, LUCENE-1257-StringBuffer.patch, LUCENE-1257-StringBuffer.patch, LUCENE-1257_messages.patch, lucene1257surround1.patch, lucene1257surround1.patch, shinglematrixfilter_generified.patch For my needs I've updated Lucene so that it uses Java 5 constructs. I know Java 5 migration had been planned for 2.1 someday in the past, but don't know when it is planned now. This patch against the trunk includes : - most obvious generics usage (there are tons of usages of sets, ... Those which are commonly used have been generified) - PriorityQueue generification - replacement of indexed for loops with for each constructs - removal of unnececessary unboxing The code is to my opinion much more readable with those features (you actually *know* what is stored in collections reading the code, without the need to lookup for field definitions everytime) and it simplifies many algorithms. Note that this patch also includes an interface for the Query class. This has been done for my company's needs for building custom Query classes which add some behaviour to the base Lucene queries. It prevents multiple unnnecessary casts. I know this introduction is not wanted by the team, but it really makes our developments easier to maintain. If you don't want to use this, replace all /Queriable/ calls with standard /Query/. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1985) DisjunctionMaxQuery - Iterator code to for ( A a : container ) construct
[ https://issues.apache.org/jira/browse/LUCENE-1985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12766652#action_12766652 ] Kay Kay commented on LUCENE-1985: - Thanks Uwe. Added another patch to LUCENE-1257 to get away from some of the casting that is not necessary given that LUCENE-1984 and LUCENE-1985 are in now ( with generics ). DisjunctionMaxQuery - Iterator code to for ( A a : container ) construct --- Key: LUCENE-1985 URL: https://issues.apache.org/jira/browse/LUCENE-1985 Project: Lucene - Java Issue Type: Improvement Reporter: Kay Kay Assignee: Uwe Schindler Priority: Trivial Fix For: 3.0 Attachments: LUCENE-1985.patch For better readability - converting the IterableT to for ( A a : container ) constructs that is more intuitive to read. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1976) isCurrent() and getVersion() on an NRT reader are broken
[ https://issues.apache.org/jira/browse/LUCENE-1976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12766654#action_12766654 ] Michael McCandless commented on LUCENE-1976: I plan to back-port this to 2.9.x, since we're doing a 2.9.1 shortly... isCurrent() and getVersion() on an NRT reader are broken Key: LUCENE-1976 URL: https://issues.apache.org/jira/browse/LUCENE-1976 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.9 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 3.1 Attachments: LUCENE-1976.patch Right now isCurrent() will always return true for an NRT reader and getVersion() will always return the version of the last commit. This is because the NRT reader holds the live segmentInfos. I think isCurrent() should return false when any further changes have occurred with the writer, else true. This is actually fairly easy to determine, since the writer tracks how many docs deletions are buffered in RAM and these counters only increase with each change. getVersion should return the version as of when the reader was created. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1257) Port to Java5
[ https://issues.apache.org/jira/browse/LUCENE-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12766657#action_12766657 ] Uwe Schindler commented on LUCENE-1257: --- Committed revision: 826035 Port to Java5 - Key: LUCENE-1257 URL: https://issues.apache.org/jira/browse/LUCENE-1257 Project: Lucene - Java Issue Type: Improvement Components: Analysis, Examples, Index, Other, Query/Scoring, QueryParser, Search, Store, Term Vectors Affects Versions: 2.3.1 Reporter: Cédric Champeau Assignee: Uwe Schindler Priority: Minor Fix For: 3.0 Attachments: instantiated_fieldable.patch, java5.patch, LUCENE-1257-DisjunctionMaxQuery-more_type_safety.patch, LUCENE-1257-Document.patch, LUCENE-1257-StringBuffer.patch, LUCENE-1257-StringBuffer.patch, LUCENE-1257-StringBuffer.patch, LUCENE-1257_messages.patch, lucene1257surround1.patch, lucene1257surround1.patch, shinglematrixfilter_generified.patch For my needs I've updated Lucene so that it uses Java 5 constructs. I know Java 5 migration had been planned for 2.1 someday in the past, but don't know when it is planned now. This patch against the trunk includes : - most obvious generics usage (there are tons of usages of sets, ... Those which are commonly used have been generified) - PriorityQueue generification - replacement of indexed for loops with for each constructs - removal of unnececessary unboxing The code is to my opinion much more readable with those features (you actually *know* what is stored in collections reading the code, without the need to lookup for field definitions everytime) and it simplifies many algorithms. Note that this patch also includes an interface for the Query class. This has been done for my company's needs for building custom Query classes which add some behaviour to the base Lucene queries. It prevents multiple unnnecessary casts. I know this introduction is not wanted by the team, but it really makes our developments easier to maintain. If you don't want to use this, replace all /Queriable/ calls with standard /Query/. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: JDBC access to a Lucene index
I'm not aware of any, but you might get more mileage asking on java- user. On Oct 16, 2009, at 3:54 AM, Jukka Zitting wrote: Hi, Some while ago I implemented a simple JDBC to JCR bridge [1] that allows one to query a JCR repository from any JDBC client, most notably various reporting tools. Now I'm wondering if something similar already exists for a normal Lucene index. Something that would treat your entire index as one huge table (or perhaps a set of tables based on some document type field) and would allow you to use simple SQL SELECTs to query data. Any pointers would be welcome. [1] http://dev.day.com/microsling/content/blogs/main/jdbc2jcr.html BR, Jukka Zitting - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-1976) isCurrent() and getVersion() on an NRT reader are broken
[ https://issues.apache.org/jira/browse/LUCENE-1976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved LUCENE-1976. Resolution: Fixed Fix Version/s: (was: 3.1) 3.0 2.9.1 isCurrent() and getVersion() on an NRT reader are broken Key: LUCENE-1976 URL: https://issues.apache.org/jira/browse/LUCENE-1976 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.9 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 2.9.1, 3.0 Attachments: LUCENE-1976.patch Right now isCurrent() will always return true for an NRT reader and getVersion() will always return the version of the last commit. This is because the NRT reader holds the live segmentInfos. I think isCurrent() should return false when any further changes have occurred with the writer, else true. This is actually fairly easy to determine, since the writer tracks how many docs deletions are buffered in RAM and these counters only increase with each change. getVersion should return the version as of when the reader was created. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-1987) Remove rest of analysis deprecations (Token, CharacterCache)
Remove rest of analysis deprecations (Token, CharacterCache) Key: LUCENE-1987 URL: https://issues.apache.org/jira/browse/LUCENE-1987 Project: Lucene - Java Issue Type: Task Components: Analysis Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: 3.0 These removes the rest of the deprecations in the analysis package: - Token's termText field - eventually un-deprecate ctors of Token taking Strings (they are still useful) - if yes remove deprec in 2.9.1 - remove CharacterCache and use Character.valueOf() from Java5 - Some Analyzers have stopword lists in wrong format (HashMaps) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1987) Remove rest of analysis deprecations (Token, CharacterCache)
[ https://issues.apache.org/jira/browse/LUCENE-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-1987: -- Attachment: LUCENE-1987.patch Pastch with the first three points. The three deprecated methods should stay alive in my opinion. Copying the string to the termbuffer in the ctor is the same linke copying the initial termbuffer. If we remove these ctors, we should also remove the setTermBuffer(String) method. This is no consistency. If the others agree to keep these three ctors alive I will apply an undeprecation in 2.9 branch. Remove rest of analysis deprecations (Token, CharacterCache) Key: LUCENE-1987 URL: https://issues.apache.org/jira/browse/LUCENE-1987 Project: Lucene - Java Issue Type: Task Components: Analysis Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: 3.0 Attachments: LUCENE-1987.patch These removes the rest of the deprecations in the analysis package: - Token's termText field - eventually un-deprecate ctors of Token taking Strings (they are still useful) - if yes remove deprec in 2.9.1 - remove CharacterCache and use Character.valueOf() from Java5 - Some Analyzers have stopword lists in wrong format (HashMaps) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing
[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12766832#action_12766832 ] Mark Miller commented on LUCENE-1458: - Almost got an initial rough stab at the sep codec cache done - just have to get two more tests to pass involving the payload's state. Further steps towards flexible indexing --- Key: LUCENE-1458 URL: https://issues.apache.org/jira/browse/LUCENE-1458 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.9 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Attachments: LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2 I attached a very rough checkpoint of my current patch, to get early feedback. All tests pass, though back compat tests don't pass due to changes to package-private APIs plus certain bugs in tests that happened to work (eg call TermPostions.nextPosition() too many times, which the new API asserts against). [Aside: I think, when we commit changes to package-private APIs such that back-compat tests don't pass, we could go back, make a branch on the back-compat tag, commit changes to the tests to use the new package private APIs on that branch, then fix nightly build to use the tip of that branch?o] There's still plenty to do before this is committable! This is a rather large change: * Switches to a new more efficient terms dict format. This still uses tii/tis files, but the tii only stores term long offset (not a TermInfo). At seek points, tis encodes term freq/prox offsets absolutely instead of with deltas delta. Also, tis/tii are structured by field, so we don't have to record field number in every term. . On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB). . RAM usage when loading terms dict index is significantly less since we only load an array of offsets and an array of String (no more TermInfo array). It should be faster to init too. . This part is basically done. * Introduces modular reader codec that strongly decouples terms dict from docs/positions readers. EG there is no more TermInfo used when reading the new format. . There's nice symmetry now between reading writing in the codec chain -- the current docs/prox format is captured in: {code} FormatPostingsTermsDictWriter/Reader FormatPostingsDocsWriter/Reader (.frq file) and FormatPostingsPositionsWriter/Reader (.prx file). {code} This part is basically done. * Introduces a new flex API for iterating through the fields, terms, docs and positions: {code} FieldProducer - TermsEnum - DocsEnum - PostingsEnum {code} This replaces TermEnum/Docs/Positions. SegmentReader emulates the old API on top of the new API to keep back-compat. Next steps: * Plug in new codecs (pulsing, pfor) to exercise the modularity / fix any hidden assumptions. * Expose new API out of IndexReader, deprecate old API but emulate old API on top of new one, switch all core/contrib users to the new API. * Maybe switch to AttributeSources as the base class for TermsEnum, DocsEnum, PostingsEnum -- this would give readers API flexibility (not just index-file-format flexibility). EG if someone wanted to store payload at the term-doc level instead of term-doc-position level, you could just add a new attribute. * Test performance iterate. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org