Re: character escapes in source? ... was: Re: Eclipse: Invalid character constant
On Fri, Apr 8, 2011 at 03:01, Robert Muir rcm...@gmail.com wrote: On Thu, Apr 7, 2011 at 6:48 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : -1. These files should be readable, for maintaining, debugging and : knowing whats going on. Readability is my main concern ... i don't know (and frequently can't tell) the differnece between a lot of non ascii characters -- and i'm guessing i'm not alone. when it's spelled out explicitly using the character name or escape code, there is no ambiquity about what character was intended, or wether it got screwed up by some tool along the way (ie: the svn server, an svn client, the patch command, a text editor, an IDE, ant's fixcrlf task, etc...) Please take the time, just 5 or 10 minutes, to look thru some of this source code and tests. Imagine if you couldn't just look at the code to see what it does, but had to decode from some crazy numeric encoding scheme. Imagine if it were this way for things like stopword lists too. It would be basically impossible for you to look at the code and figure out what it does! For example, try looking at thai analyzer tests, if these were all numbers, how would you know wtf is going on? Although this comes up from time to time, I stand firm on my -1 because its important to me for the source code to be readable. I'm not willing to give this up just because some people cannot read writing system XYZ. I have said before, i'm willing to change my -1 vote on this, if *ALL* string constants (including english ones) are changed to be character escapes. If you imagine what the code would look like if english string constants were instead codes, then I think you will understand my point of view! Its really really important to source code readability to be able to open a file and understand what it does, not to have to use some decoder because it uses characters other people dont understand. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org I think having both raw characters /and/ encoded representation is the best? (one of them in comments) I'm all for unicode sources, but at least two things hit me repeatedly: 1. Tools do screw up, and you have to recover somehow. eg. IntelliJ IDEA's 'shelve' function uses platform default (MacRoman in my case) and I've lost some text on things I shelved but never committed anywhere. 2. There are characters that look all the same. E.g. different whitespace/dashes. Or, (if you have cyrillic in your fonts) I dare you to discern between a/а, c/с, e/е, o/о. These are different characters from latin and cyrillic charsets (left latin/right cyrillic), but in 99% fonts they are visually identical. I had a filter that folded up similarily looking characters, and it was documented in exactly this way - raw char+code. -- Kirill Zakharenko/Кирилл Захаренко E-Mail/Jabber: ear...@gmail.com Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Lucene Search with MoreLiktThis
Hi, I need some ideas about Lucene MoreLikeThis. I want search the records conditioning with more that one field name using MoreLikeThis. Right now i did some codes like IndexReader indexreader = IndexReader.open(PropertyManager.getStringProperty(FAW.Lucene.index_path)); IndexSearcher searcher = new IndexSearcher(indexreader); MoreLikeThis mlt = new MoreLikeThis(indexreader); Query query = mlt.like(hits.id(0)); mlt.setMinDocFreq(0); mlt.setMinTermFreq(0); mlt.setFieldNames(new String[]{serviceNumber,env}); Hits list = searcher.search(query); Iterator itr1 = list.iterator(); while() But its fetching all the records with the matching of serviceNumber and env. But i need to get the common records of this fields... Like.. ServiceNumber envValue 1 env1 2 1 env2 3 2 env1 4 1 env1 5 If the data are like this means.. I want fetch ServiceNumberenv Value 1 env1 2 1 env1 5 This is my requirement. Can anyone help for me or if any ideasThanks in advance -- View this message in context: http://lucene.472066.n3.nabble.com/Lucene-Search-with-MoreLiktThis-tp2794419p2794419.html Sent from the Lucene - Java Developer mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [HUDSON] Lucene-trunk - Build # 1523 - Failure
OOME on this one... we really need the dump heap on OOME JRE command line set... Mike http://blog.mikemccandless.com On Thu, Apr 7, 2011 at 10:34 PM, Apache Hudson Server hud...@hudson.apache.org wrote: Build: https://hudson.apache.org/hudson/job/Lucene-trunk/1523/ 1 tests failed. REGRESSION: org.apache.lucene.index.TestNRTThreads.testNRTThreads Error Message: Some threads threw uncaught exceptions! Stack Trace: junit.framework.AssertionFailedError: Some threads threw uncaught exceptions! at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1232) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1160) at org.apache.lucene.util.LuceneTestCase.tearDown(LuceneTestCase.java:521) Build Log (for compile errors): [...truncated 11839 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Google Summer Code 2011 participation
Anyone can participate in Lucene/Solr! You don't need to be GSoC student to do so... Browse the issues in Jira (maybe focusing on the ones marked for GSoC and not already taken), or open your own issues, discuss, post patches, iterate, etc. Find your itch and scratch it ;) And there are a great many itches out there that need scratching... Mike http://blog.mikemccandless.com On Thu, Apr 7, 2011 at 9:34 PM, Minh Doan daywed...@gmail.com wrote: Hi forks, Receiving a bunch of emails recently about GSOC, I really want to join but it seems like I'm not eligible to do even though I used to be a PhD student, and currently on leave (I will be probably back soon). I really want to contribute to lucene to implement some of my ideas. Can I have a lucene mentor like those mentor experts who are excited to GSOC ? Best, Minh On Tue, Apr 5, 2011 at 7:06 AM, Steven A Rowe sar...@syr.edu wrote: Hi Jayendra, From http://www.google-melange.com/document/show/gsoc_program/google/gsoc2011/faqs#who: In order to participate in the program, you must be a student. Google defines a student as an individual enrolled in or accepted into an accredited institution including (but not necessarily limited to) colleges, universities, masters programs, PhD programs and undergraduate programs. You are eligible to apply if you are enrolled in an accredited university educational program provided you meet all of the other eligibility requirements. You should be prepared, upon request, to provide Google with transcripts or other documentation from your accredited institution as proof of enrollment or admission status. Computer Science does not need to be your field of study in order to participate in the program. You may be enrolled as a full-time or part-time student. You must also be eligible to work in the country in which you'll reside throughout the duration of the program, e.g. if you are in the United States on an F-1 visa, you are welcome to apply to Google Summer of Code as long as you have U.S. work authorization. For F-1 students applying for CPT, Google will furnish you with a letter you can provide to your university to get CPT established once your application to the program has been accepted. -Original Message- From: Jayendra Patil [mailto:jayendra.patil@gmail.com] Sent: Tuesday, April 05, 2011 9:56 AM To: dev@lucene.apache.org Subject: Google Summer Code 2011 participation Hi, Does the Google Summer Code 2011 apply only to students ?? I have been working on Solr for quite some time now and would like to start contributing back. Have been using it to index structured and unstructured data and have a fair bit of knowledge of the internals as well. (Have a few jiras and patches submitted) I don't have a specific proposal in mind yet, but would like to start with any specific area or issues. Let me know if and how can i participate. Regards, Jayendra - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- --- Minh - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: character escapes in source? ... was: Re: Eclipse: Invalid character constant
On Fri, Apr 8, 2011 at 2:49 AM, Earwin Burrfoot ear...@gmail.com wrote: On Fri, Apr 8, 2011 at 03:01, Robert Muir rcm...@gmail.com wrote: On Thu, Apr 7, 2011 at 6:48 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : -1. These files should be readable, for maintaining, debugging and : knowing whats going on. Readability is my main concern ... i don't know (and frequently can't tell) the differnece between a lot of non ascii characters -- and i'm guessing i'm not alone. when it's spelled out explicitly using the character name or escape code, there is no ambiquity about what character was intended, or wether it got screwed up by some tool along the way (ie: the svn server, an svn client, the patch command, a text editor, an IDE, ant's fixcrlf task, etc...) Please take the time, just 5 or 10 minutes, to look thru some of this source code and tests. Imagine if you couldn't just look at the code to see what it does, but had to decode from some crazy numeric encoding scheme. Imagine if it were this way for things like stopword lists too. It would be basically impossible for you to look at the code and figure out what it does! For example, try looking at thai analyzer tests, if these were all numbers, how would you know wtf is going on? Although this comes up from time to time, I stand firm on my -1 because its important to me for the source code to be readable. I'm not willing to give this up just because some people cannot read writing system XYZ. I have said before, i'm willing to change my -1 vote on this, if *ALL* string constants (including english ones) are changed to be character escapes. If you imagine what the code would look like if english string constants were instead codes, then I think you will understand my point of view! Its really really important to source code readability to be able to open a file and understand what it does, not to have to use some decoder because it uses characters other people dont understand. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org I think having both raw characters /and/ encoded representation is the best? (one of them in comments) I'm all for unicode sources, but at least two things hit me repeatedly: 1. Tools do screw up, and you have to recover somehow. eg. IntelliJ IDEA's 'shelve' function uses platform default (MacRoman in my case) and I've lost some text on things I shelved but never committed anywhere. 2. There are characters that look all the same. E.g. different whitespace/dashes. Or, (if you have cyrillic in your fonts) I dare you to discern between a/а, c/с, e/е, o/о. These are different characters from latin and cyrillic charsets (left latin/right cyrillic), but in 99% fonts they are visually identical. I had a filter that folded up similarily looking characters, and it was documented in exactly this way - raw char+code. I've worked with a lot of characters on eclipse, and the ones that confuse my eyes the most are l/1 and O/0 So again if we do this, then we must do it for all english text, too - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: My GSOC proposal
I have refined my proposal here : http://goo.gl/uYXrV Are there any suggestions for which I need to update my proposal before today's deadline . On Thu, Apr 7, 2011 at 9:28 AM, Varun Thacker varunthacker1...@gmail.comwrote: I have updated my proposal online to mention the time I would be able to dedicate to the project . On Thu, Apr 7, 2011 at 7:05 AM, Adriano Crestani adrianocrest...@gmail.com wrote: Hi Varun, Nice proposal, very complete. Only one thing missing, you should mention somewhere how many hours a week you are willing to spend working on the project and whether there is any holiday you won't be able to work. Good luck ;) On Wed, Apr 6, 2011 at 5:57 PM, Varun Thacker varunthacker1...@gmail.com wrote: I have drafted the proposal on the official GSoC website . This is the link to my proposal http://goo.gl/uYXrV . Please do let me know if anything needs to be changed ,added or removed. I will keep on working on it till the deadline on the 8th. On Wed, Apr 6, 2011 at 11:41 PM, Michael McCandless luc...@mikemccandless.com wrote: That test code looks good -- you really should have seen awful performance had you used O_DIRECT since you read byte by byte. A more realistic test is to read a whole buffer (eg 4 KB is what Lucene now uses during merging, but we'd probably up this to like 1 MB when using O_DIRECT). Linus does hate O_DIRECT (see http://kerneltrap.org/node/7563), and for good reason: its existence means projects like ours can use it to work around limitations in the Linux IO apis that control the buffer cache when, otherwise, we might conceivably make patches to fix Linux correctly. It's an escape hatch, and we all use the escape hatch instead of trying to fix Linux for real... For example the NOREUSE flag is a no-op now in Linux, which is a shame, because that's precisely the flag we'd want to use for merging (along with SEQUENTIAL). Had that flag been implemented well, it'd give better results than our workaround using O_DIRECT. Anyway, giving how things are, until we can get more control (wy up in Javaland) over the buffer cache, O_DIRECT (via native directory impl through JNI) is our only real option, today. More details here: http://blog.mikemccandless.com/2010/06/lucene-and-fadvisemadvise.html Note that other OSs likely do a better job and actually implement NOREUSE, and similar APIs, so the generic Unix/WindowsNativeDirectory would simply use NOREUSE on these platforms for I/O during segment merging. Mike http://blog.mikemccandless.com On Wed, Apr 6, 2011 at 11:56 AM, Varun Thacker varunthacker1...@gmail.com wrote: Hi. I wrote a sample code to test out speed difference between SEQUENTIAL and O_DIRECT( I used the madvise flag-MADV_DONTNEED) reads . This is the link to the code: http://pastebin.com/8QywKGyS There was a speed difference which when i switched between the two flags. I have not used the O_DIRECT flag because Linus had criticized it. Is this what the flags are intended to be used for ? This is just a sample code with a test file . On Wed, Apr 6, 2011 at 12:11 PM, Simon Willnauer simon.willna...@googlemail.com wrote: Hey Varun, On Tue, Apr 5, 2011 at 11:07 PM, Michael McCandless luc...@mikemccandless.com wrote: Hi Varun, Those two issues would make a great GSoC! Comments below... +1 On Tue, Apr 5, 2011 at 1:56 PM, Varun Thacker varunthacker1...@gmail.com wrote: I would like to combine two tasks as part of my project namely-Directory createOutput and openInput should take an IOContext (Lucene-2793) and compliment it by Generalize DirectIOLinuxDir to UnixDir (Lucene-2795). The first part of the project is aimed at significantly reducing time taken to search during indexing by adding an IOContext which would store buffer size and have options to bypass the OS’s buffer cache (This is what causes the slowdown in search ) and other hints. Once completed I would move on to Lucene-2795 and generalize the Directory implementation to make a UnixDirectory . So, the first part (LUCENE-2793) should cause no change at all to performance, functionality, etc., because it's merely installing the plumbing (IOContext threaded throughout the low-level store APIs in Lucene) so that higher levels can send important details down to the Directory. We'd fix IndexWriter/IndexReader to fill out this IOContext with the details (merging, flushing, new reader, etc.). There's some fun/freedom here in figuring out just what details should be included in IOContext... (eg: is it low level set buffer size to 4 KB or is it high level I am opening a new near-real-time reader). This first step is a rote cutover, just changing APIs but in no way taking advantage of the new APIs. The 2nd step (LUCENE-2795) would then take advantage of this plumbing, by creating a UnixDir impl that, using JNI (C code), passes advanced
[HUDSON] Lucene-Solr-tests-only-trunk - Build # 6867 - Failure
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/6867/ 1 tests failed. REGRESSION: org.apache.solr.cloud.ZkControllerTest.testUploadToCloud Error Message: KeeperErrorCode = ConnectionLoss for /configs/config1/schema-reversed.xml Stack Trace: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /configs/config1/schema-reversed.xml at org.apache.zookeeper.KeeperException.create(KeeperException.java:90) at org.apache.zookeeper.KeeperException.create(KeeperException.java:42) at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038) at org.apache.solr.common.cloud.SolrZkClient.setData(SolrZkClient.java:224) at org.apache.solr.common.cloud.SolrZkClient.setData(SolrZkClient.java:388) at org.apache.solr.common.cloud.SolrZkClient.setData(SolrZkClient.java:410) at org.apache.solr.cloud.ZkController.uploadToZK(ZkController.java:520) at org.apache.solr.cloud.ZkControllerTest.testUploadToCloud(ZkControllerTest.java:191) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1232) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1160) Build Log (for compile errors): [...truncated 9073 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Google Summer Code 2011 participation
On Fri, Apr 8, 2011 at 12:11 PM, Michael McCandless luc...@mikemccandless.com wrote: Anyone can participate in Lucene/Solr! You don't need to be GSoC student to do so... Browse the issues in Jira (maybe focusing on the ones marked for GSoC and not already taken), or open your own issues, discuss, post patches, iterate, etc. Find your itch and scratch it ;) +1 we are all around and jump on the issue to guide you. Find one, ask questions if you have and start discussions / coding! simon And there are a great many itches out there that need scratching... Mike http://blog.mikemccandless.com On Thu, Apr 7, 2011 at 9:34 PM, Minh Doan daywed...@gmail.com wrote: Hi forks, Receiving a bunch of emails recently about GSOC, I really want to join but it seems like I'm not eligible to do even though I used to be a PhD student, and currently on leave (I will be probably back soon). I really want to contribute to lucene to implement some of my ideas. Can I have a lucene mentor like those mentor experts who are excited to GSOC ? Best, Minh On Tue, Apr 5, 2011 at 7:06 AM, Steven A Rowe sar...@syr.edu wrote: Hi Jayendra, From http://www.google-melange.com/document/show/gsoc_program/google/gsoc2011/faqs#who: In order to participate in the program, you must be a student. Google defines a student as an individual enrolled in or accepted into an accredited institution including (but not necessarily limited to) colleges, universities, masters programs, PhD programs and undergraduate programs. You are eligible to apply if you are enrolled in an accredited university educational program provided you meet all of the other eligibility requirements. You should be prepared, upon request, to provide Google with transcripts or other documentation from your accredited institution as proof of enrollment or admission status. Computer Science does not need to be your field of study in order to participate in the program. You may be enrolled as a full-time or part-time student. You must also be eligible to work in the country in which you'll reside throughout the duration of the program, e.g. if you are in the United States on an F-1 visa, you are welcome to apply to Google Summer of Code as long as you have U.S. work authorization. For F-1 students applying for CPT, Google will furnish you with a letter you can provide to your university to get CPT established once your application to the program has been accepted. -Original Message- From: Jayendra Patil [mailto:jayendra.patil@gmail.com] Sent: Tuesday, April 05, 2011 9:56 AM To: dev@lucene.apache.org Subject: Google Summer Code 2011 participation Hi, Does the Google Summer Code 2011 apply only to students ?? I have been working on Solr for quite some time now and would like to start contributing back. Have been using it to index structured and unstructured data and have a fair bit of knowledge of the internals as well. (Have a few jiras and patches submitted) I don't have a specific proposal in mind yet, but would like to start with any specific area or issues. Let me know if and how can i participate. Regards, Jayendra - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- --- Minh - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2459) LogLevelSelection Servlet outputs plain HTML
[ https://issues.apache.org/jira/browse/SOLR-2459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stefan Matheis (steffkes) updated SOLR-2459: Description: The current available Output of the LogLevelSelection Servlet is plain HTML, which made it unpossible, to integrate the Logging-Information in the new Admin-UI. Format-Agnostic Output (like every [?] other Servlet offers) would be really nice! Just as an Idea for a future structure, the new admin-ui is [actually based on that json-structure|https://github.com/steffkes/solr-admin/blob/master/logging.json] :) was: The current available Output of the LogLevelSelection Servlet is plain HTML, which made it unpossible, to integrate the Logging-Information in the new Admin-UI. Format-Agnostic Output (like every [?] other Servlet offers) would be really nice! Just as an Idea for a future structure, the new admin-ui is [https://github.com/steffkes/solr-admin/blob/master/logging.json|actually based on that json-structure] :) LogLevelSelection Servlet outputs plain HTML Key: SOLR-2459 URL: https://issues.apache.org/jira/browse/SOLR-2459 Project: Solr Issue Type: Wish Components: web gui Reporter: Stefan Matheis (steffkes) Priority: Trivial The current available Output of the LogLevelSelection Servlet is plain HTML, which made it unpossible, to integrate the Logging-Information in the new Admin-UI. Format-Agnostic Output (like every [?] other Servlet offers) would be really nice! Just as an Idea for a future structure, the new admin-ui is [actually based on that json-structure|https://github.com/steffkes/solr-admin/blob/master/logging.json] :) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2193) Re-architect Update Handler
[ https://issues.apache.org/jira/browse/SOLR-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13017461#comment-13017461 ] Mark Miller commented on SOLR-2193: --- {quote} I wonder how this should work with autocommit? Someone may want a soft/reopen autocommit once every x seconds, but still may want a hard flush to stable storage in case I crash commit at some other duration. {quote} Right - I agree. How about another simple start? Simply add another commitTracker that does soft commits - then you can schedule a mix of soft and hard commits. {quote} The other thing that might be cool is a client-specified freshness per request. For example, when they pass in a query, they specify that they need data that's no more than 1 second old... and if it's too old that will trigger a reopen (and block that specific request until the new searcher can be used). The benefit here is that big bulk uploads won't be interrupted if there is no time sensitive query traffic. The downside is that a high latency may be exposed to those requests if they depend on stuff that can take a lot of time the first time (like faceting). {quote} Yeah - I remember you mentioning this before - I def think this would be cool - perhaps as a follow on issue - though hopefully the affect on bulk updates will be minimized when Lucene takes care of the 'flush blocks the world' issue. Re-architect Update Handler --- Key: SOLR-2193 URL: https://issues.apache.org/jira/browse/SOLR-2193 Project: Solr Issue Type: Improvement Reporter: Mark Miller Assignee: Mark Miller Fix For: 4.0 Attachments: SOLR-2193.patch The update handler needs an overhaul. A few goals I think we might want to look at: 1. Cleanup - drop DirectUpdateHandler(2) line - move to something like UpdateHandler, DefaultUpdateHandler 2. Expose the SolrIndexWriter in the api or add the proper abstractions to get done what we now do with special casing: if (directupdatehandler2) success else failish 3. Stop closing the IndexWriter and start using commit (still lazy IW init though). 4. Drop iwAccess, iwCommit locks and sync mostly at the Lucene level. 5. Keep NRT support in mind. 6. Keep microsharding in mind (maintain logical index as multiple physical indexes) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Want to learn solr development
hi I want to learn solr development. i have used solr 1.4.1 with full text search of pdf and doc files and database search using multicore feature of solr. i want do development of solr. how to start it. please help.
[jira] [Updated] (SOLR-1922) DocBuilder onImportError/Abort EventListener
[ https://issues.apache.org/jira/browse/SOLR-1922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Zotter updated SOLR-1922: Affects Version/s: 4.0 3.1 DocBuilder onImportError/Abort EventListener Key: SOLR-1922 URL: https://issues.apache.org/jira/browse/SOLR-1922 Project: Solr Issue Type: Improvement Components: contrib - DataImportHandler Affects Versions: 1.4, 3.1, 3.1.1, 4.0 Reporter: Robert Zotter Priority: Trivial Labels: DIH, DataImportHandler, DocBuilder, EventListener Attachments: SOLR-1922.patch The onImportEnd EventListener only fires off after a successful import. It would be useful to know when an import fails via an onImportError/Abort EventListener. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-1922) DocBuilder onImportError/Abort EventListener
[ https://issues.apache.org/jira/browse/SOLR-1922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Zotter updated SOLR-1922: Affects Version/s: 3.1.1 DocBuilder onImportError/Abort EventListener Key: SOLR-1922 URL: https://issues.apache.org/jira/browse/SOLR-1922 Project: Solr Issue Type: Improvement Components: contrib - DataImportHandler Affects Versions: 1.4, 3.1, 3.1.1, 4.0 Reporter: Robert Zotter Priority: Trivial Labels: DIH, DataImportHandler, DocBuilder, EventListener Attachments: SOLR-1922.patch The onImportEnd EventListener only fires off after a successful import. It would be useful to know when an import fails via an onImportError/Abort EventListener. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-1888) Provide Option to Store Payloads on the Term Vector
[ https://issues.apache.org/jira/browse/LUCENE-1888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13017483#comment-13017483 ] Peter Wilkins commented on LUCENE-1888: --- As someone new to Lucene, with a specific problem to solve, it is difficult to identify the appropriate Lucene feature to use. Reading various online posts, I see I'm not alone. I have a use case that I think this JIRA issue addresses; perhaps it will help refine what the issue resolution would do. I'm indexing a lecture video transcript. I want to store the text of the transcript and timecodes of when each word occurs. I want to search the text of the transcript and return the timecode so I can display the lecture video from that spot. Provide Option to Store Payloads on the Term Vector --- Key: LUCENE-1888 URL: https://issues.apache.org/jira/browse/LUCENE-1888 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Grant Ingersoll Assignee: Grant Ingersoll Fix For: 4.0 Would be nice to have the option to access the payloads in a document-centric way by adding them to the Term Vectors. Naturally, this makes the Term Vectors bigger, but it may be just what one needs. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-2462) Using spellcheck.collate can result in extremely high memory usage
Using spellcheck.collate can result in extremely high memory usage -- Key: SOLR-2462 URL: https://issues.apache.org/jira/browse/SOLR-2462 Project: Solr Issue Type: Bug Components: spellchecker Affects Versions: 3.1, 4.0 Reporter: James Dyer Priority: Critical When using spellcheck.collate, class SpellPossibilityIterator creates a ranked list of *every* possible correction combination. But if returning several corrections per term, and if several words are misspelled, the existing algorithm uses a huge amount of memory. This bug was introduced with SOLR-2010. However, it is triggered anytime spellcheck.collate is used. It is not necessary to use any features that were added with SOLR-2010. We were in Production with Solr for 1 1/2 days and this bug started taking our Solr servers down with infinite GC loops. It was pretty easy for this to happen as occasionally a user will accidently paste the URL into the Search box on our app. This URL results in a search with ~12 misspelled words. We have spellcheck.count set to 15. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2462) Using spellcheck.collate can result in extremely high memory usage
[ https://issues.apache.org/jira/browse/SOLR-2462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Dyer updated SOLR-2462: - Attachment: SOLR-2462.patch This sets the maximum limit to 1000 possibilities. When this limit is reached, the list is sorted by rank then reduced to the top 100. From then on, only collations with a rank equal or better than the 100th are added. This process repeats until finished or until it has taken 50ms, at which time it quits. I also added a maxTimeAllowed setting of 50ms to the collation test queries as an additional performance safeguard. Using spellcheck.collate can result in extremely high memory usage -- Key: SOLR-2462 URL: https://issues.apache.org/jira/browse/SOLR-2462 Project: Solr Issue Type: Bug Components: spellchecker Affects Versions: 3.1, 4.0 Reporter: James Dyer Priority: Critical Attachments: SOLR-2462.patch When using spellcheck.collate, class SpellPossibilityIterator creates a ranked list of *every* possible correction combination. But if returning several corrections per term, and if several words are misspelled, the existing algorithm uses a huge amount of memory. This bug was introduced with SOLR-2010. However, it is triggered anytime spellcheck.collate is used. It is not necessary to use any features that were added with SOLR-2010. We were in Production with Solr for 1 1/2 days and this bug started taking our Solr servers down with infinite GC loops. It was pretty easy for this to happen as occasionally a user will accidently paste the URL into the Search box on our app. This URL results in a search with ~12 misspelled words. We have spellcheck.count set to 15. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2462) Using spellcheck.collate can result in extremely high memory usage
[ https://issues.apache.org/jira/browse/SOLR-2462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller updated SOLR-2462: -- Affects Version/s: (was: 4.0) Fix Version/s: 4.0 3.1.1 Using spellcheck.collate can result in extremely high memory usage -- Key: SOLR-2462 URL: https://issues.apache.org/jira/browse/SOLR-2462 Project: Solr Issue Type: Bug Components: spellchecker Affects Versions: 3.1 Reporter: James Dyer Priority: Critical Fix For: 3.1.1, 4.0 Attachments: SOLR-2462.patch When using spellcheck.collate, class SpellPossibilityIterator creates a ranked list of *every* possible correction combination. But if returning several corrections per term, and if several words are misspelled, the existing algorithm uses a huge amount of memory. This bug was introduced with SOLR-2010. However, it is triggered anytime spellcheck.collate is used. It is not necessary to use any features that were added with SOLR-2010. We were in Production with Solr for 1 1/2 days and this bug started taking our Solr servers down with infinite GC loops. It was pretty easy for this to happen as occasionally a user will accidently paste the URL into the Search box on our app. This URL results in a search with ~12 misspelled words. We have spellcheck.count set to 15. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [Lucene.Net] escaping single quotes when using query parser
I'm not sure you're issue is related to single quotes. The only characters that need to be escaped for the QueryParser are + - ! ( ) { } [ ] ^ ~ * ? : \ You can do that using QueryParser.Escape(string). It's possible you it might be related to the analyzer that you're using. In my experience, sometimes using a different analyzer to index than you use to search can *sometimes* cause unexpected behavior like this. Since I haven't myself run into this exact problem to the best of my knowledge, it's tough for me to give a more specific answer without your code/test data. Thanks, Christopher On Thu, Apr 7, 2011 at 2:01 AM, Ben Foster b...@planetcloud.co.uk wrote: Hi, How should we escape single quotes when working with the query parser? Currently we have a description field that may contain single quotes. Whilst this field is correctly indexed when we search the description no results are returned. I'm assuming it's because we need to replace the single quote in the search term with an escaped version. Many thanks, Ben Foster planetcloud The Elms, Hawton Newark-on-Trent Nottinghamshire NG24 3RL http://www.planetcloud.co.uk/ www.planetcloud.co.uk
[jira] [Commented] (LUCENE-2956) Support updateDocument() with DWPTs
[ https://issues.apache.org/jira/browse/LUCENE-2956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13017514#comment-13017514 ] Simon Willnauer commented on LUCENE-2956: - FYI I have a working patch for this. it needs some cleanup so I will hopefully upload beginning next week Support updateDocument() with DWPTs --- Key: LUCENE-2956 URL: https://issues.apache.org/jira/browse/LUCENE-2956 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: Realtime Branch Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: Realtime Branch With separate DocumentsWriterPerThreads (DWPT) it can currently happen that the delete part of an updateDocument() is flushed and committed separately from the corresponding new document. We need to make sure that updateDocument() is always an atomic operation from a IW.commit() and IW.getReader() perspective. See LUCENE-2324 for more details. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2355) simple distrib update processor
[ https://issues.apache.org/jira/browse/SOLR-2355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller updated SOLR-2355: -- Fix Version/s: 4.0 3.2 One thing we should probably address is the brittle cmd cloning. I don't like clone methods in general - but if we are going to do it in core code, better to put the clone in the cmd and be a bit less brittle. simple distrib update processor --- Key: SOLR-2355 URL: https://issues.apache.org/jira/browse/SOLR-2355 Project: Solr Issue Type: New Feature Reporter: Yonik Seeley Priority: Minor Fix For: 3.2, 4.0 Attachments: DistributedUpdateProcessorFactory.java, TestDistributedUpdate.java Here's a simple update processor for distributed indexing that I implemented years ago. It implements a simple hash(id) MOD nservers and just fails if any servers are down. Given the recent activity in distributed indexing, I thought this might be at least a good source for ideas. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2382) DIH Cache Improvements
[ https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Dyer updated SOLR-2382: - Attachment: SOLR-2382.patch This version fixes a bug involving the DIHCacheProcessor in the case of a many-to-[one|many] join between the parent entity and a child entity. If the child entity used a DIHCacheProcessor and the same child joined to consecutive parents, only the first parent would join to the child. DIH Cache Improvements -- Key: SOLR-2382 URL: https://issues.apache.org/jira/browse/SOLR-2382 Project: Solr Issue Type: New Feature Components: contrib - DataImportHandler Reporter: James Dyer Priority: Minor Attachments: SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch Functionality: 1. Provide a pluggable caching framework for DIH so that users can choose a cache implementation that best suits their data and application. 2. Provide a means to temporarily cache a child Entity's data without needing to create a special cached implementation of the Entity Processor (such as CachedSqlEntityProcessor). 3. Provide a means to write the final (root entity) DIH output to a cache rather than to Solr. Then provide a way for a subsequent DIH call to use the cache as an Entity input. Also provide the ability to do delta updates on such persistent caches. 4. Provide the ability to partition data across multiple caches that can then be fed back into DIH and indexed either to varying Solr Shards, or to the same Core in parallel. Use Cases: 1. We needed a flexible scalable way to temporarily cache child-entity data prior to joining to parent entities. - Using SqlEntityProcessor with Child Entities can cause an n+1 select problem. - CachedSqlEntityProcessor only supports an in-memory HashMap as a Caching mechanism and does not scale. - There is no way to cache non-SQL inputs (ex: flat files, xml, etc). 2. We needed the ability to gather data from long-running entities by a process that runs separate from our main indexing process. 3. We wanted the ability to do a delta import of only the entities that changed. - Lucene/Solr requires entire documents to be re-indexed, even if only a few fields changed. - Our data comes from 50+ complex sql queries and/or flat files. - We do not want to incur overhead re-gathering all of this data if only 1 entity's data changed. - Persistent DIH caches solve this problem. 4. We want the ability to index several documents in parallel (using 1.4.1, which did not have the threads parameter). 5. In the future, we may need to use Shards, creating a need to easily partition our source data into Shards. Implementation Details: 1. De-couple EntityProcessorBase from caching. - Created a new interface, DIHCache two implementations: - SortedMapBackedCache - An in-memory cache, used as default with CachedSqlEntityProcessor (now deprecated). - BerkleyBackedCache - A disk-backed cache, dependent on bdb-je, tested with je-4.1.6.jar - NOTE: the existing Lucene Contrib db project uses je-3.3.93.jar. I believe this may be incompatible due to Generic Usage. - NOTE: I did not modify the ant script to automatically get this jar, so to use or evaluate this patch, download bdb-je from http://www.oracle.com/technetwork/database/berkeleydb/downloads/index.html 2. Allow Entity Processors to take a cacheImpl parameter to cause the entity data to be cached (see EntityProcessorBase DIHCacheProperties). 3. Partially De-couple SolrWriter from DocBuilder - Created a new interface DIHWriter, two implementations: - SolrWriter (refactored) - DIHCacheWriter (allows DIH to write ultimately to a Cache). 4. Create a new Entity Processor, DIHCacheProcessor, which reads a persistent Cache as DIH Entity Input. 5. Support a partition parameter with both DIHCacheWriter and DIHCacheProcessor to allow for easy partitioning of source entity data. 6. Change the semantics of entity.destroy() - Previously, it was being called on each iteration of DocBuilder.buildDocument(). - Now it is does one-time cleanup tasks (like closing or deleting a disk-backed cache) once the entity processor is completed. - The only out-of-the-box entity processor that previously implemented destroy() was LineEntitiyProcessor, so this is not a very invasive change. General Notes: We are near completion in converting our search functionality from a legacy search engine to Solr. However, I found that DIH did not support caching to the level of our prior product's data import utility. In order to get our data into Solr, I created these caching enhancements. Because I believe this has
[jira] [Commented] (SOLR-2382) DIH Cache Improvements
[ https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13017567#comment-13017567 ] James Dyer commented on SOLR-2382: -- In light of the recent discussion about the Spatial Contrib, I do wonder if seeking to get this committed is a non-starter because of its dependency on bdb-je. I thought this wouldn't be an issue because we have an existing Lucene contrib (db) with this same dependency, but then I noticed that some of the committers regret the existence of the db contrib for this reason (and others). In any case, even if the BerkleyBackedCache part of this patch could not be committed, having this framework in place so that developers can write their own persistent cache impls would be a major improvement in my opinion. (I had originally started with a Lucene-backed cache but switch to bdb-je because I couldn't figure out how to achieve acceptable performance for gets from the cache). DIH Cache Improvements -- Key: SOLR-2382 URL: https://issues.apache.org/jira/browse/SOLR-2382 Project: Solr Issue Type: New Feature Components: contrib - DataImportHandler Reporter: James Dyer Priority: Minor Attachments: SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch Functionality: 1. Provide a pluggable caching framework for DIH so that users can choose a cache implementation that best suits their data and application. 2. Provide a means to temporarily cache a child Entity's data without needing to create a special cached implementation of the Entity Processor (such as CachedSqlEntityProcessor). 3. Provide a means to write the final (root entity) DIH output to a cache rather than to Solr. Then provide a way for a subsequent DIH call to use the cache as an Entity input. Also provide the ability to do delta updates on such persistent caches. 4. Provide the ability to partition data across multiple caches that can then be fed back into DIH and indexed either to varying Solr Shards, or to the same Core in parallel. Use Cases: 1. We needed a flexible scalable way to temporarily cache child-entity data prior to joining to parent entities. - Using SqlEntityProcessor with Child Entities can cause an n+1 select problem. - CachedSqlEntityProcessor only supports an in-memory HashMap as a Caching mechanism and does not scale. - There is no way to cache non-SQL inputs (ex: flat files, xml, etc). 2. We needed the ability to gather data from long-running entities by a process that runs separate from our main indexing process. 3. We wanted the ability to do a delta import of only the entities that changed. - Lucene/Solr requires entire documents to be re-indexed, even if only a few fields changed. - Our data comes from 50+ complex sql queries and/or flat files. - We do not want to incur overhead re-gathering all of this data if only 1 entity's data changed. - Persistent DIH caches solve this problem. 4. We want the ability to index several documents in parallel (using 1.4.1, which did not have the threads parameter). 5. In the future, we may need to use Shards, creating a need to easily partition our source data into Shards. Implementation Details: 1. De-couple EntityProcessorBase from caching. - Created a new interface, DIHCache two implementations: - SortedMapBackedCache - An in-memory cache, used as default with CachedSqlEntityProcessor (now deprecated). - BerkleyBackedCache - A disk-backed cache, dependent on bdb-je, tested with je-4.1.6.jar - NOTE: the existing Lucene Contrib db project uses je-3.3.93.jar. I believe this may be incompatible due to Generic Usage. - NOTE: I did not modify the ant script to automatically get this jar, so to use or evaluate this patch, download bdb-je from http://www.oracle.com/technetwork/database/berkeleydb/downloads/index.html 2. Allow Entity Processors to take a cacheImpl parameter to cause the entity data to be cached (see EntityProcessorBase DIHCacheProperties). 3. Partially De-couple SolrWriter from DocBuilder - Created a new interface DIHWriter, two implementations: - SolrWriter (refactored) - DIHCacheWriter (allows DIH to write ultimately to a Cache). 4. Create a new Entity Processor, DIHCacheProcessor, which reads a persistent Cache as DIH Entity Input. 5. Support a partition parameter with both DIHCacheWriter and DIHCacheProcessor to allow for easy partitioning of source entity data. 6. Change the semantics of entity.destroy() - Previously, it was being called on each iteration of DocBuilder.buildDocument(). - Now it is does one-time cleanup tasks (like closing or deleting a disk-backed cache)
[HUDSON] Lucene-Solr-tests-only-3.x - Build # 6870 - Failure
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-3.x/6870/ 1 tests failed. REGRESSION: org.apache.lucene.collation.TestCollationKeyAnalyzer.testThreadSafe Error Message: Java heap space Stack Trace: java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2894) at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:117) at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:589) at java.lang.StringBuffer.append(StringBuffer.java:337) at java.text.RuleBasedCollator.getCollationKey(RuleBasedCollator.java:617) at org.apache.lucene.collation.CollationKeyFilter.incrementToken(CollationKeyFilter.java:93) at org.apache.lucene.collation.CollationTestBase.assertThreadSafe(CollationTestBase.java:304) at org.apache.lucene.collation.TestCollationKeyAnalyzer.testThreadSafe(TestCollationKeyAnalyzer.java:89) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1082) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1010) Build Log (for compile errors): [...truncated 5265 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[HUDSON] Lucene-Solr-tests-only-trunk - Build # 6880 - Failure
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/6880/ 1 tests failed. REGRESSION: org.apache.solr.cloud.BasicDistributedZkTest.testDistribSearch Error Message: Severe errors in solr configuration. Check your log files for more detailed information on what may be wrong. - org.apache.solr.common.cloud.ZooKeeperException: at org.apache.solr.core.CoreContainer.register(CoreContainer.java:517) at org.apache.solr.core.CoreContainer.load(CoreContainer.java:406) at org.apache.solr.core.CoreContainer.load(CoreContainer.java:290) at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:239) at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:93) at org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:97) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) at org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:713) at org.mortbay.jetty.servlet.ServletHandler.updateMappings(ServletHandler.java:1104) at org.mortbay.jetty.servlet.ServletHandler.setFilterMappings(ServletHandler.java:1140) at org.mortbay.jetty.servlet.ServletHandler.addFilterWithMapping(ServletHandler.java:940) at org.mortbay.jetty.servlet.ServletHandler.addFilterWithMapping(ServletHandler.java:895) at org.mortbay.jetty.servlet.Context.addFilter(Context.java:207) at org.apache.solr.client.solrj.embedded.JettySolrRunner$1.lifeCycleStarted(JettySolrRunner.java:98) at org.mortbay.component.AbstractLifeCycle.setStarted(AbstractLifeCycle.java:140) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:52) at org.apache.solr.client.solrj.embedded.JettySolrRunner.start(JettySolrRunner.java:123) at org.apache.solr.client.solrj.embedded.JettySolrRunner.start(JettySolrRunner.java:118) at org.apache.solr.BaseDistributedSearchTestCase.createJetty(BaseDistributedSearchTestCase.java:245) at org.apache.solr.BaseDistributedSearchTestCase.createJetty(BaseDistributedSearchTestCase.java:236) at org.apache.solr.cloud.AbstractDistributedZkTestCase.createServers(AbstractDistributedZkTestCase.java:64) at org.apache.solr.BaseDistributedSearch Severe errors in solr configuration. Check your log files for more detailed information on what may be wrong. - org.apache.solr.common.cloud.ZooKeeperException: at org.apache.solr.core.CoreContainer.register(CoreContainer.java:517) at org.apache.solr.core.CoreContainer.load(CoreContainer.java:406) at org.apache.solr.core.CoreContainer.load(CoreContainer.java:290) at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:239) at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:93) at org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:97) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) at org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:713) at org.mortbay.jetty.servlet.ServletHandler.updateMappings(ServletHandler.java:1104) at org.mortbay.jetty.servlet.ServletHandler.setFilterMappings(ServletHandler.java:1140) at org.mortbay.jetty.servlet.ServletHandler.addFilterWithMapping(ServletHandler.java:940) at org.mortbay.jetty.servlet.ServletHandler.addFilterWithMapping(ServletHandler.java:895) at org.mortbay.jetty.servlet.Context.addFilter(Context.java:207) at org.apache.solr.client.solrj.embedded.JettySolrRunner$1.lifeCycleStarted(JettySolrRunner.java:98) at org.mortbay.component.AbstractLifeCycle.setStarted(AbstractLifeCycle.java:140) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:52) at org.apache.solr.client.solrj.embedded.JettySolrRunner.start(JettySolrRunner.java:123) at org.apache.solr.client.solrj.embedded.JettySolrRunner.start(JettySolrRunner.java:118) at org.apache.solr.BaseDistributedSearchTestCase.createJetty(BaseDistributedSearchTestCase.java:245) at org.apache.solr.BaseDistributedSearchTestCase.createJetty(BaseDistributedSearchTestCase.java:236) at org.apache.solr.cloud.AbstractDistributedZkTestCase.createServers(AbstractDistributedZkTestCase.java:64) at org.apache.solr.BaseDistributedSearch request: http://localhost:55435/solr/update?wt=javabinversion=2 Stack Trace: request: http://localhost:55435/solr/update?wt=javabinversion=2 at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:436) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:245) at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105) at org.apache.solr.client.solrj.SolrServer.deleteByQuery(SolrServer.java:111) at
(LUCENE-2793:Directory createOutput and openInput should take an IOContex and LUCENE-2795:Genericize DirectIOLinuxDir - UnixDir) as GSoC
I'm moving this discussion to the thread as suggested by the Lucene mentors. This is my final proposal link: http://www.google-melange.com/gsoc/proposal/review/google/gsoc2011/varunthacker1989/1 Till now I have implemented the following to get a better understanding of my project: I wrote a sample code to test out speed difference between SEQUENTIAL and O_DIRECT( I used the madvise flag-MADV_DONTNEED) reads .This is the link to the code: http://pastebin.com/8QywKGyS. There was a noticeable speed difference when i switched between the two flags. I did not use the O_DIRECT flag because Linus Torvalds had criticized it. This blog post by Micheal McCandless ( http://blog.mikemccandless.com/2010/06/lucene-and-fadvisemadvise.html) helped me understand the problem for which LUCENE-2793 is needed. I am currently reading up on the Lucene Documentation taking into account a few pointers provided by Simon WIllnauer. I want to keep everyone updated and also take into consideration the comments made by the members of this community, which will help me understand and implement these tasks . -- Regards, Varun Thacker http://varunthacker.wordpress.com
Re: [HUDSON] Lucene-trunk - Build # 1523 - Failure
On Fri, Apr 8, 2011 at 2:36 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : OOME on this one... we really need the dump heap on OOME JRE command : line set... Ant runs the tests in forked JVMs right? so thta should just be a build.xml change. OK I tried that, with the patch below (applies to 3.x), and then provoked an OOME and it works great, though I think this is Sun (Oracle!) JRE specific... which is OK for now (we use Oracle JRE on Jenkins right?), but if we want to rotate JREs in the future this won't work... The problem is... the resulting dump is large (mine was ~400 MB). We can specify a location for the dump (-XX:HeapDumpPath=/some/path)... I think we should somehow remove them after a few days? How much disk space can we use up? Patch: Index: solr/build.xml === --- solr/build.xml (revision 1089906) +++ solr/build.xml (working copy) @@ -464,6 +464,7 @@ jvmarg line=${dir.prop}/ -- jvmarg line=${args}/ + jvmarg line=-XX:+HeapDumpOnOutOfMemoryError/ formatter classname=${junit.details.formatter} usefile=false if=junit.details/ classpath refid=test.run.classpath/ Index: lucene/common-build.xml === --- lucene/common-build.xml (revision 1089906) +++ lucene/common-build.xml (working copy) @@ -488,6 +488,7 @@ /assertions jvmarg line=${args}/ + jvmarg value=-XX:+HeapDumpOnOutOfMemoryError/ !-- allow tests to control debug prints -- sysproperty key=tests.verbose value=${tests.verbose}/ Mike - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-2463) Using an evaluator outside the scope of an entity results in a null context
Using an evaluator outside the scope of an entity results in a null context --- Key: SOLR-2463 URL: https://issues.apache.org/jira/browse/SOLR-2463 Project: Solr Issue Type: Bug Components: contrib - DataImportHandler Affects Versions: 3.1, 3.1.1, 4.0 Reporter: Robert Zotter Priority: Minor When using an Evaluator outside an entity element the Context argument is null. public class LowerCaseFunctionEvaluator extends Evaluator { public String evaluate(String expression, Context context) { List l = EvaluatorBag.parseParams(expression, context.getVariableResolver()); if (l.size() != 1) { throw new RuntimeException('toLowerCase' must have only one parameter ); } return l.get(0).toString().toLowerCase(); } } dataSource name=... type=... driver=... url=... user=${dataimporter.functions.toLowerCase('THIS_WILL_NOT_WORK')} password=... batchSize=.../ entity name=... dataSource=... query=select * from ${dataimporter.functions.toLowerCase('THIS_WILL_WORK')} -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2463) Using an evaluator outside the scope of an entity results in a null context
[ https://issues.apache.org/jira/browse/SOLR-2463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Zotter updated SOLR-2463: Description: When using an Evaluator outside an entity element the Context argument is null. {code:title=foo.LowerCaseFunctionEvaluator.java|borderStyle=solid} public class LowerCaseFunctionEvaluator extends Evaluator { public String evaluate(String expression, Context context) { List l = EvaluatorBag.parseParams(expression, context.getVariableResolver()); if (l.size() != 1) { throw new RuntimeException('toLowerCase' must have only one parameter ); } return l.get(0).toString().toLowerCase(); } } {code} {code:title=data-config.xml|borderStyle=solid} dataSource name=... type=... driver=... url=... user=${dataimporter.functions.toLowerCase('THIS_WILL_NOT_WORK')} password=.../ {code} {code:title=data-config.xml|borderStyle=solid} entity name=... dataSource=... query=select * from ${dataimporter.functions.toLowerCase('THIS_WILL_WORK')}/ {code} was: When using an Evaluator outside an entity element the Context argument is null. public class LowerCaseFunctionEvaluator extends Evaluator { public String evaluate(String expression, Context context) { List l = EvaluatorBag.parseParams(expression, context.getVariableResolver()); if (l.size() != 1) { throw new RuntimeException('toLowerCase' must have only one parameter ); } return l.get(0).toString().toLowerCase(); } } dataSource name=... type=... driver=... url=... user=${dataimporter.functions.toLowerCase('THIS_WILL_NOT_WORK')} password=... batchSize=.../ entity name=... dataSource=... query=select * from ${dataimporter.functions.toLowerCase('THIS_WILL_WORK')} Using an evaluator outside the scope of an entity results in a null context --- Key: SOLR-2463 URL: https://issues.apache.org/jira/browse/SOLR-2463 Project: Solr Issue Type: Bug Components: contrib - DataImportHandler Affects Versions: 3.1, 3.1.1, 4.0 Reporter: Robert Zotter Priority: Minor When using an Evaluator outside an entity element the Context argument is null. {code:title=foo.LowerCaseFunctionEvaluator.java|borderStyle=solid} public class LowerCaseFunctionEvaluator extends Evaluator { public String evaluate(String expression, Context context) { List l = EvaluatorBag.parseParams(expression, context.getVariableResolver()); if (l.size() != 1) { throw new RuntimeException('toLowerCase' must have only one parameter ); } return l.get(0).toString().toLowerCase(); } } {code} {code:title=data-config.xml|borderStyle=solid} dataSource name=... type=... driver=... url=... user=${dataimporter.functions.toLowerCase('THIS_WILL_NOT_WORK')} password=.../ {code} {code:title=data-config.xml|borderStyle=solid} entity name=... dataSource=... query=select * from ${dataimporter.functions.toLowerCase('THIS_WILL_WORK')}/ {code} -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2463) Using an evaluator outside the scope of an entity results in a null context
[ https://issues.apache.org/jira/browse/SOLR-2463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Zotter updated SOLR-2463: Description: When using an Evaluator outside an entity element the Context argument is null. {code:title=foo.LowerCaseFunctionEvaluator.java|borderStyle=solid} public class LowerCaseFunctionEvaluator extends Evaluator { public String evaluate(String expression, Context context) { List l = EvaluatorBag.parseParams(expression, context.getVariableResolver()); if (l.size() != 1) { throw new RuntimeException('toLowerCase' must have only one parameter ); } return l.get(0).toString().toLowerCase(); } } {code} {code:title=data-config.xml|borderStyle=solid} dataSource name=... type=... driver=... url=... user=${dataimporter.functions.toLowerCase('THIS_WILL_NOT_WORK')} password=.../ {code} {code:title=data-config.xml|borderStyle=solid} entity name=... dataSource=... query=select * from ${dataimporter.functions.toLowerCase('THIS_WILL_WORK')}/ {code} This use case worked in 1.4 was: When using an Evaluator outside an entity element the Context argument is null. {code:title=foo.LowerCaseFunctionEvaluator.java|borderStyle=solid} public class LowerCaseFunctionEvaluator extends Evaluator { public String evaluate(String expression, Context context) { List l = EvaluatorBag.parseParams(expression, context.getVariableResolver()); if (l.size() != 1) { throw new RuntimeException('toLowerCase' must have only one parameter ); } return l.get(0).toString().toLowerCase(); } } {code} {code:title=data-config.xml|borderStyle=solid} dataSource name=... type=... driver=... url=... user=${dataimporter.functions.toLowerCase('THIS_WILL_NOT_WORK')} password=.../ {code} {code:title=data-config.xml|borderStyle=solid} entity name=... dataSource=... query=select * from ${dataimporter.functions.toLowerCase('THIS_WILL_WORK')}/ {code} Using an evaluator outside the scope of an entity results in a null context --- Key: SOLR-2463 URL: https://issues.apache.org/jira/browse/SOLR-2463 Project: Solr Issue Type: Bug Components: contrib - DataImportHandler Affects Versions: 3.1, 3.1.1, 4.0 Reporter: Robert Zotter Priority: Minor When using an Evaluator outside an entity element the Context argument is null. {code:title=foo.LowerCaseFunctionEvaluator.java|borderStyle=solid} public class LowerCaseFunctionEvaluator extends Evaluator { public String evaluate(String expression, Context context) { List l = EvaluatorBag.parseParams(expression, context.getVariableResolver()); if (l.size() != 1) { throw new RuntimeException('toLowerCase' must have only one parameter ); } return l.get(0).toString().toLowerCase(); } } {code} {code:title=data-config.xml|borderStyle=solid} dataSource name=... type=... driver=... url=... user=${dataimporter.functions.toLowerCase('THIS_WILL_NOT_WORK')} password=.../ {code} {code:title=data-config.xml|borderStyle=solid} entity name=... dataSource=... query=select * from ${dataimporter.functions.toLowerCase('THIS_WILL_WORK')}/ {code} This use case worked in 1.4 -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2463) Using an evaluator outside the scope of an entity results in a null context
[ https://issues.apache.org/jira/browse/SOLR-2463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Zotter updated SOLR-2463: Fix Version/s: 3.1.1 Using an evaluator outside the scope of an entity results in a null context --- Key: SOLR-2463 URL: https://issues.apache.org/jira/browse/SOLR-2463 Project: Solr Issue Type: Bug Components: contrib - DataImportHandler Affects Versions: 3.1, 3.1.1, 4.0 Reporter: Robert Zotter Priority: Minor Fix For: 3.1.1 When using an Evaluator outside an entity element the Context argument is null. {code:title=foo.LowerCaseFunctionEvaluator.java|borderStyle=solid} public class LowerCaseFunctionEvaluator extends Evaluator { public String evaluate(String expression, Context context) { List l = EvaluatorBag.parseParams(expression, context.getVariableResolver()); if (l.size() != 1) { throw new RuntimeException('toLowerCase' must have only one parameter ); } return l.get(0).toString().toLowerCase(); } } {code} {code:title=data-config.xml|borderStyle=solid} dataSource name=... type=... driver=... url=... user=${dataimporter.functions.toLowerCase('THIS_WILL_NOT_WORK')} password=.../ {code} {code:title=data-config.xml|borderStyle=solid} entity name=... dataSource=... query=select * from ${dataimporter.functions.toLowerCase('THIS_WILL_WORK')}/ {code} This use case worked in 1.4 -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
RE: [HUDSON] Lucene-trunk - Build # 1523 - Failure
AFAICT, these (Oracle/Sun-only) JVM parameters were introduced in 1.6 (that's what this parameters list says: http://blogs.sun.com/watt/resource/jvm-options-list.html) - we tell Jenkins to use 1.6 for Lucene/Solr testing, so this isn't an issue in practice, I guess. Hopefully 1.5 JVMs, and non-Oracle/Sun JVMs, won't choke on these unknown parameters. Steve -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Friday, April 08, 2011 3:55 PM To: dev@lucene.apache.org Cc: Chris Hostetter; Apache Hudson Server Subject: Re: [HUDSON] Lucene-trunk - Build # 1523 - Failure On Fri, Apr 8, 2011 at 2:36 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : OOME on this one... we really need the dump heap on OOME JRE command : line set... Ant runs the tests in forked JVMs right? so thta should just be a build.xml change. OK I tried that, with the patch below (applies to 3.x), and then provoked an OOME and it works great, though I think this is Sun (Oracle!) JRE specific... which is OK for now (we use Oracle JRE on Jenkins right?), but if we want to rotate JREs in the future this won't work... The problem is... the resulting dump is large (mine was ~400 MB). We can specify a location for the dump (-XX:HeapDumpPath=/some/path)... I think we should somehow remove them after a few days? How much disk space can we use up? Patch: Index: solr/build.xml === --- solr/build.xml(revision 1089906) +++ solr/build.xml(working copy) @@ -464,6 +464,7 @@ jvmarg line=${dir.prop}/ -- jvmarg line=${args}/ + jvmarg line=-XX:+HeapDumpOnOutOfMemoryError/ formatter classname=${junit.details.formatter} usefile=false if=junit.details/ classpath refid=test.run.classpath/ Index: lucene/common-build.xml === --- lucene/common-build.xml (revision 1089906) +++ lucene/common-build.xml (working copy) @@ -488,6 +488,7 @@ /assertions jvmarg line=${args}/ + jvmarg value=-XX:+HeapDumpOnOutOfMemoryError/ !-- allow tests to control debug prints -- sysproperty key=tests.verbose value=${tests.verbose}/ Mike - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2186) First cut at column-stride fields (index values storage)
[ https://issues.apache.org/jira/browse/LUCENE-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13017679#comment-13017679 ] Jason Rutherglen commented on LUCENE-2186: -- I'm wondering if there is a limitation on whether or not we can randomly access the doc values from the underlying Directory implementation, rather than need to load all the values directly into the main heap space. This seems doable, and if so let me know if I can provide a patch. First cut at column-stride fields (index values storage) Key: LUCENE-2186 URL: https://issues.apache.org/jira/browse/LUCENE-2186 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Michael McCandless Assignee: Simon Willnauer Fix For: CSF branch, 4.0 Attachments: LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, mem.py I created an initial basic impl for storing index values (ie column-stride value storage). This is still a work in progress... but the approach looks compelling. I'm posting my current status/patch here to get feedback/iterate, etc. The code is standalone now, and lives under new package oal.index.values (plus some util changes, refactorings) -- I have yet to integrate into Lucene so eg you can mark that a given Field's value should be stored into the index values, sorting will use these values instead of field cache, etc. It handles 3 types of values: * Six variants of byte[] per doc, all combinations of fixed vs variable length, and stored either straight (good for eg a title field), deref (good when many docs share the same value, but you won't do any sorting) or sorted. * Integers (variable bit precision used as necessary, ie this can store byte/short/int/long, and all precisions in between) * Floats (4 or 8 byte precision) String fields are stored as the UTF8 byte[]. This patch adds a BytesRef, which does the same thing as flex's TermRef (we should merge them). This patch also adds basic initial impl of PackedInts (LUCENE-1990); we can swap that out if/when we get a better impl. This storage is dense (like field cache), so it's appropriate when the field occurs in all/most docs. It's just like field cache, except the reading API is a get() method invocation, per document. Next step is to do basic integration with Lucene, and then compare sort performance of this vs field cache. For the sort by String value case, I think RAM usage GC load of this index values API should be much better than field caache, since it does not create object per document (instead shares big long[] and byte[] across all docs), and because the values are stored in RAM as their UTF8 bytes. There are abstract Writer/Reader classes. The current reader impls are entirely RAM resident (like field cache), but the API is (I think) agnostic, ie, one could make an MMAP impl instead. I think this is the first baby step towards LUCENE-1231. Ie, it cannot yet update values, and the reading API is fully random-access by docID (like field cache), not like a posting list, though I do think we should add an iterator() api (to return flex's DocsEnum) -- eg I think this would be a good way to track avg doc/field length for BM25/lnu.ltc scoring. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
some questions about how Lucene works internally
(sorry for cross-posting here, but it seems this question on the internal mechanisms of Lucene can not be answered on the user@ list , so I'm asking for more expert knowledge here, thanks a lot) === I'm new to lucene/search engine , and have been struggling with these questions recently. I'd appreciate a lot of you could shed some light on this. let's say I do a query on dog greyhound note that I did not quote them, i.e. this is not a phrase search. what happens under the hood ? which term does Lucene use to look up the inverted Index ? I read somewhere that Lucene uses the term with the higher IDF (i.e. the more distinguishing term), i.e. in this case greyhound, but what about dog? does Lucene traverse down the doclist of dog at all? if I provide multiple terms in my query, generally how does Lucene decide how many doclists to travel down? I read that Lucene uses a combination of binary model and VSM, then it seems that in the above case, it finds the full doclist of dog , and that of greyhound, (the binary model part), then find the common docs from the two doclists, then order them by scores ( the VSM part). is it true that the FULL doclists are fetched first? or is some pruning done on the individual doclists? I see the talk in http://www.slideshare.net/abial/eurocon2010 that talks about pruning and tiered search, but is this the default behavior of Lucene? how are the doclists sorted? (by idf ?? --- sorry I'm just beginning to sift through a lot of docs online, somehow got this impression but can't form a precise conclusion) also generally, could you please provide some good articles on how lucene/search engines work? I've read the anatomy of a search engine (google Sergey Brin Larry Page paper), introduction to information retrieval (Manning et al ) , Lucene in action Thanks Yang - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[HUDSON] Lucene-Solr-tests-only-trunk - Build # 6894 - Failure
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/6894/ 1 tests failed. REGRESSION: org.apache.solr.cloud.CloudStateUpdateTest.testCoreRegistration Error Message: expected:2 but was:3 Stack Trace: junit.framework.AssertionFailedError: expected:2 but was:3 at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1232) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1160) at org.apache.solr.cloud.CloudStateUpdateTest.testCoreRegistration(CloudStateUpdateTest.java:208) Build Log (for compile errors): [...truncated 8828 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org