Re: do the Java and Python garbage collectors talk to each other, with JCC?
Andi Vajda va...@apache.org wrote: On Tue, 24 Aug 2010, Bill Janssen wrote: I'm starting to see traces like the following in my UpLib (OS X 10.5.8, 32-bit Python 2.5, Java 6, JCC-2.6, PyLucene-2.9.3) that indicate an out-of-memory issue. I spawn a lot of short-lived threads in Python, and each of them is attached to Java, and detached after the run method returns. I've run test programs that do nothing but repeatedly start new threads that then invoke pylucene to index a document, and see no problems. I'm trying to come up with a hypothesis for this. One of the things I'm wondering is if my Python memory space is approaching the limit, does PyLucene arrange for the Java garbage collector to invoke the Python garbage collector if it can't allocate memory? No, not that I know of. The only fancy exchange between the Python world and the Java world is for 'extensions' of Java classes in Python. These are in a deadly embrace since they keep track of each other. A proxy object and some weak reference tricks do their work to resolve this cleanly. But this assumes the ref count on the Python side becomes 0 or that the finalize() method on the Java side is invoked (for which there is no guarantee according to the spec). I don't think that's the issue. I'm keeping an eye on the refs via _dumpRefs(), and they seem OK, no matter how many new threads I create. As I understand it, the Java GC allocates two blocks of memory (heap and stack) immediately when creating a new thread, and does its own allocations to the thread from within these blocks -- the JVM GC works exclusively within this allocated heap block. These blocks are returned to the system when the thread exits. The Python GC, in contrast, works globally, allocating memory blocks as needed and returning them to the system when possible, asynchronously respective to thread creation and completion. What I think's happening is that Java is attempting to create a thread, and fails because the system (malloc) can't allocate a large enough heap block. The weak-ref'ed allocations that could be freed are on the Python side of the world, not the Java side. I wonder if it would be possible to add a hook somehow to the Java GC that would call into Python and have Python run its GC, too. Though I'm not sure the Java GC is being called at all, so perhaps this hook would have to be in the part of the Java VM that calls malloc, the thread creation code. Note that the thread being unsuccessfully started isn't mine; it's being started by Java. It is generally better practice to pool threads and to reuse them instead of allocating them for short-lived tasks. Sure, but tell that to the Lucene folks. They're the ones starting a new thread here. Of course, now and then one needs to start a new thread. I have personally no confidence in the JNI thread detaching mechanism... If it works, great but... As an aside, here is what I found out about using Java-created threads in Python: When Java creates a thread, Python is not being told about it and the Python VM considers this thread dummy, that is, without a thread state object. In other words, Python doesn't have a documented 'attachCurrentThread()' call. Instead, a Python thread state object is allocated at every call entering the Python VM from the Java VM running on such a dummy thread and is freed upon return. The buggy side effect of this is that you lose your thread-local storage between such calls and pay an extra thread state allocation cost for every such call into Python when the GIL is acquired. A workaround for this is to create and increment this thread state object's ref count when the Java thread is first created and to decrement it upon thread completion. This is what the PythonVM.acquire/releaseThreadState() methods are for in jcc.cpp. The PythonVM class is used when embedding a Python VM in a Java VM as when running Python code in a Tomcat process, for example. Maybe these methods should move elsewhere if they have potential uses outside this scenario... Yes, that sounds useful. Bill Andi.. Bill thr1730: Running document rippers raised the following exception: thr1730: Traceback (most recent call last): thr1730:File /local/share/UpLib-1.7.9/code/uplib/newFolder.py, line 282, in _run_rippers thr1730: ripper.rip(folderpath, id) thr1730:File /local/share/UpLib-1.7.9/code/uplib/createIndexEntry.py, line 187, in rip thr1730: index_folder(location, self.repository().index_path()) thr1730:File /local/share/UpLib-1.7.9/code/uplib/createIndexEntry.py, line 82, in index_folder thr1730: c.index(folder, doc_id) thr1730:File /local/share/UpLib-1.7.9/code/uplib/indexing.py, line 813, in index thr1730: self.reopen() thr1730:File /local/share/UpLib-1.7.9/code/uplib/indexing.py, line 635, in reopen thr1730: self.current_writer.flush() thr1730:
Hudson build is back to normal : Solr-3.x #85
See https://hudson.apache.org/hudson/job/Solr-3.x/85/changes - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-2079) Expose HttpServletRequest object from SolrQueryRequest object
[ https://issues.apache.org/jira/browse/SOLR-2079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12902359#action_12902359 ] Jan Høydahl commented on SOLR-2079: --- I have been using SolrParams to convey metadata from frontends to middleware layer, and I think it has worked really well. In addition, you get it included in the query logs! As for load balancers, most have an option to convey the client's IP in the X-Forwarded-For header. What if the dispatchFilter adds all HTTP headers to the SolrQueryRequest context. Then we could map explicitly in requestHandler config how to use them: {code:xml} lst name=invariants str name=_http_remote-ip$HTTP_HEADERS(X-Forwarded-For, Remote-Address)/str /lst {code} This would mean that if HTTP header X-Forwarded-For exists in the context, it will be mapped to param _http_remote-ip, if not, it will use Remote-Address. In this way each application can choose whether to pollute the SolrParams with headers or not, choose naming as well as whether it should be invariant or default. Expose HttpServletRequest object from SolrQueryRequest object - Key: SOLR-2079 URL: https://issues.apache.org/jira/browse/SOLR-2079 Project: Solr Issue Type: Improvement Components: Response Writers, search Reporter: Chris A. Mattmann Fix For: 3.1 Attachments: SOLR-2079.Quach.Mattmann.082310.patch.txt This patch adds the HttpServletRequest object to the SolrQueryRequest object. The HttpServletRequest object is needed to obtain the client's IP address for geotargetting, and is part of the patches from W. Quach and C. Mattmann. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Should analysis.jsp honor maxFieldLength
What about an option to override this on a per field-type and/or per field basis. Then the global setting could still be default: fieldType name=text class=solr.TextField positionIncrementGap=100 maxLength=10 OR field name=teaser type=text indexed=true stored=true maxLength=10/ -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Training in Europe - www.solrtraining.com On 24. aug. 2010, at 20.56, Eric Pugh wrote: I did always think that the global maxFieldLength was odd. In one project I have, 10,000 is fine except for 1 field that I would like to bump up to 100,000, and there isn't (as far as I know) a way to do that. Is there any real negative effect to swapping to maxFieldLength of 100,000 (with the caveat that the auto truncation won't be working!)? The filter approach that you pointed out does make sense, the only worry I have is that it might make building analyzers more complex. One of the things I treasure about Solr is how many decisions it makes for you out of the box that are right so very often, and therefore how simple it is. If every user needs to think about maxFieldLength from day one, then that might make life more complex. Eric On Aug 24, 2010, at 2:44 PM, Robert Muir wrote: On Tue, Aug 24, 2010 at 2:29 PM, Eric Pugh ep...@opensourceconnections.com wrote: I created a patch file at https://issues.apache.org/jira/browse/SOLR-2086. I went with the simplest approach since I didn't want to confuse things by having extra filters being added to what the user created. However, either approach would work! One idea here was that this maxFieldLength might be going away: see https://issues.apache.org/jira/browse/LUCENE-2295 for more information (though i notice its still not listed as deprecated?). But for now its worth mentioning: The filter is more flexible, for example it supports per-field configuration (and of course if you use the filter instead, which you can do now, it will automatically work in analysis.jsp). -- Robert Muir rcm...@gmail.com - Eric Pugh | Principal | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com Co-Author: Solr 1.4 Enterprise Search Server available from http://www.packtpub.com/solr-1-4-enterprise-search-server Free/Busy: http://tinyurl.com/eric-cal - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2095) Document not guaranteed to be found after write and commit
[ https://issues.apache.org/jira/browse/LUCENE-2095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12902371#action_12902371 ] vijaykumarraja.grandhi commented on LUCENE-2095: I am currently using LUCENE Net 2.9.2 version. We have upgraded from v1.9.0 to 2.9.2. Basically we want to use threading concept now. But i am strucked with a lock. How to over come with these locks. Can any one provide .net code sample. Thank you in advance. Document not guaranteed to be found after write and commit -- Key: LUCENE-2095 URL: https://issues.apache.org/jira/browse/LUCENE-2095 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.4.1, 2.9.1 Environment: Linux 64bit Reporter: Sanne Grinovero Assignee: Michael McCandless Fix For: 2.9.2, 3.0.1, 4.0 Attachments: LUCENE-2095.patch, lucene-stresstest.patch after same email on developer list: I developed a stress test to assert that a new document containing a specific term X is always found after a commit on the IndexWriter. This works most of the time, but it fails under load in rare occasions. I'm testing with 40 Threads, both with a SerialMergeScheduler and a ConcurrentMergeScheduler, all sharing a common IndexWriter. Attached testcase is using a RAMDirectory only, but I verified a FSDirectory behaves in the same way so I don't believe it's the Directory implementation or the MergeScheduler. This test is slow, so I don't consider it a functional or unit test. It might give false positives: it doesn't always fail, sorry I couldn't find out how to make it more likely to happen, besides scheduling it to run for a longer time. I tested this to affect versions 2.4.1 and 2.9.1; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Lucene Test Failure: org.apache.lucene.search.TestCachingWrapperFilter.testEnforceDeletions (from TestCachingWrapperFilter)
OK I just cut this test over to SMS, and, took steps to make sure the reader is not GC'd. It seems to be passing now ;) Mike On Tue, Aug 24, 2010 at 6:46 PM, Michael McCandless luc...@mikemccandless.com wrote: Hmm so cms.sync() wasn't it -- I just saw it fail again. Uwe you are right -- we are failing to keep a hard ref to the old reader, for this one assert. Yet if I try to keep a ref, I still see it sometimes fail... still digging... Mike On Tue, Aug 24, 2010 at 5:49 PM, Michael McCandless luc...@mikemccandless.com wrote: Yeah the key should still have a hard ref. The key is either the SegmentReader instance, or it's CoreReader instance. The tests holds a hard ref to the parent reader, which then references the subs. I think it may instead be due to CMS, ie, we reopen the reader before a merge completes, then the merge completes, then the next reopen (which assumes there will be no changes) sees the completed merge as a change. I'll try inserting CMS.sync() to the test... Mike On Tue, Aug 24, 2010 at 5:44 PM, Uwe Schindler u...@thetaphi.de wrote: Right, but has the key any refs? This was my only explanation for the bug. My problem is, that I had no time to look closely into the test and I did not understand the new deletion modes completely and what the test tries to do. This changed since 3.0 when I modified the filter the last time (at ApacheCon US). - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Tuesday, August 24, 2010 11:38 PM To: dev@lucene.apache.org Subject: Re: Lucene Test Failure: org.apache.lucene.search.TestCachingWrapperFilter.testEnforceDeletions (from TestCachingWrapperFilter) Wait -- it's a WeakHashMap right? Entries should not be removed unless the key no longer has any hard refs? Mike On Tue, Aug 24, 2010 at 5:34 PM, Uwe Schindler u...@thetaphi.de wrote: Weh ad the same on hudson a few days ago. The problem is a too heavy GC (because if GC is very active it removes the entry from the cache and then this error occurs). This is a bug in the test. To test this correctly we can either: - during test replace WeakHashMap by a conventional HashMap (the map is package private, maybe we replace it in the test) - hold a reference to the cache entry during the test (that is the DocIdSet) Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Mark Miller [mailto:markrmil...@gmail.com] Sent: Tuesday, August 24, 2010 10:33 PM To: dev@lucene.apache.org Subject: Lucene Test Failure: org.apache.lucene.search.TestCachingWrapperFilter.testEnforceDeletion s (from TestCachingWrapperFilter) Error Message expected:2 but was:3 Stacktrace junit.framework.AssertionFailedError: expected:2 but was:3 at org.apache.lucene.search.TestCachingWrapperFilter.testEnforceDeletions (Test CachingWrapperFilter.java:228) at org.apache.lucene.util.LuceneTestCase.runBare(LuceneTestCase.java:380 ) at org.apache.lucene.util.LuceneTestCase.run(LuceneTestCase.java:372) Standard Output NOTE: random codec of testcase 'testEnforceDeletions' was: PreFlex NOTE: random locale of testcase 'testEnforceDeletions' was: zh_CN NOTE: random timezone of testcase 'testEnforceDeletions' was: Etc/GMT+4 NOTE: random seed of testcase 'testEnforceDeletions' was: - 46038615367376670 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2598) allow tests to use different Directory impls
[ https://issues.apache.org/jira/browse/LUCENE-2598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-2598: Attachment: LUCENE-2598.patch ok, here is the previous patch, except random is now enabled by default. (but most of the time uses ramdirectory so the tests are still generally quick) allow tests to use different Directory impls Key: LUCENE-2598 URL: https://issues.apache.org/jira/browse/LUCENE-2598 Project: Lucene - Java Issue Type: Test Components: Build Affects Versions: 3.1, 4.0 Reporter: Robert Muir Assignee: Robert Muir Fix For: 3.1, 4.0 Attachments: LUCENE-2598.patch, LUCENE-2598.patch, LUCENE-2598.patch, LUCENE-2598.patch, LUCENE-2598.patch, LUCENE-2598.patch, LUCENE-2598.patch, LUCENE-2598.patch, LUCENE-2598.patch, LUCENE-2598.patch Now that all tests use MockRAMDirectory instead of RAMDirectory, they are all picky like windows and force our tests to close readers etc before closing the directory. I think we should do the following: # change new MockRAMDIrectory() in tests to .newDirectory(random) # LuceneTestCase[J4] tracks if all dirs are closed at tearDown and also cleans up temp dirs like solr. # factor out the Mockish stuff from MockRAMDirectory into MockDirectoryWrapper # allow a -Dtests.directoryImpl or simpler to specify the default Directory to use for tests: default being random i think theres a chance we might find some bugs that havent yet surfaced because they are easier to trigger with FSDir Furthermore, this would be beneficial to Directory-implementors as they could run the entire testsuite against their Directory impl, just like codec-implementors can do now. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Assigned: (LUCENE-2590) Enable access to the freq information in a Query's sub-scorers
[ https://issues.apache.org/jira/browse/LUCENE-2590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer reassigned LUCENE-2590: --- Assignee: Simon Willnauer (was: Michael McCandless) Enable access to the freq information in a Query's sub-scorers -- Key: LUCENE-2590 URL: https://issues.apache.org/jira/browse/LUCENE-2590 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Michael McCandless Assignee: Simon Willnauer Attachments: LUCENE-2590.patch, LUCENE-2590.patch, LUCENE-2590.patch, LUCENE-2590.patch The ability to gather more details than just the score, of how a given doc matches the current query, has come up a number of times on the user's lists. (most recently in the thread Query Match Count by Ryan McV on java-user). EG if you have a simple TermQuery foo, on each hit you'd like to know how many times foo occurred in that doc; or a BooleanQuery +foo +bar, being able to separately see the freq of foo and bar for the current hit. Lucene doesn't make this possible today, which is a shame because Lucene in fact does compute exactly this information; it's just not accessible from the Collector. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2590) Enable access to the freq information in a Query's sub-scorers
[ https://issues.apache.org/jira/browse/LUCENE-2590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12902379#action_12902379 ] Simon Willnauer commented on LUCENE-2590: - bq. Oh I see we can't quite have Scorer impl this because it doesn't know the query. But maybe we can factor out a common method, that the subclass passed the query to? I had the same idea in a previous iteration but since Scorer doesn't know about the Query the scorer concerns I can not do the call. One way of doing it would be adding the scorers {{Weight}} as a protected final member since {{Weight}} already has a {{#getQuery()}} method we can easily access it or throw an UnsupportedOperationException if the weight is null (force it via ctor and have a default one which sets it to null). Since the most of the scorers know their {{Weight}} anyway and would need to call the visitor we can also factor it out. bq. Also, we are missing some scorers (SpanScorer, ConstantScoreQuery.ConstantScorer, probably others), but if we do the super approach, we'd get these for free (I think?). most of them would then be for free though! Thoughts? Enable access to the freq information in a Query's sub-scorers -- Key: LUCENE-2590 URL: https://issues.apache.org/jira/browse/LUCENE-2590 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Michael McCandless Assignee: Simon Willnauer Attachments: LUCENE-2590.patch, LUCENE-2590.patch, LUCENE-2590.patch, LUCENE-2590.patch The ability to gather more details than just the score, of how a given doc matches the current query, has come up a number of times on the user's lists. (most recently in the thread Query Match Count by Ryan McV on java-user). EG if you have a simple TermQuery foo, on each hit you'd like to know how many times foo occurred in that doc; or a BooleanQuery +foo +bar, being able to separately see the freq of foo and bar for the current hit. Lucene doesn't make this possible today, which is a shame because Lucene in fact does compute exactly this information; it's just not accessible from the Collector. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (SOLR-2031) QueryComponent's default query parser should be configurable from solrconfig.xml
[ https://issues.apache.org/jira/browse/SOLR-2031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer updated SOLR-2031: -- Attachment: SOLR-2031.patch updated patch to trunk - if nobody objects I'd commit that in a day or two QueryComponent's default query parser should be configurable from solrconfig.xml Key: SOLR-2031 URL: https://issues.apache.org/jira/browse/SOLR-2031 Project: Solr Issue Type: Improvement Components: SearchComponents - other Reporter: Karl Wright Assignee: Simon Willnauer Priority: Minor Attachments: SOLR-2031.patch, SOLR-2031.patch, SOLR-2031.patch In a multi-lucene-query environment, QueryComponent's way of selecting a default query parser must include solrconfig.xml support to be useful. It can't just get the default query parser from the request arguments. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-2616) FastVectorHighlighter: out of alignment when the first value is empty in multiValued field
[ https://issues.apache.org/jira/browse/LUCENE-2616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi resolved LUCENE-2616. Fix Version/s: 3.1 4.0 Resolution: Fixed trunk: Committed revision 989035. branch_3x: Committed revision 989056. FastVectorHighlighter: out of alignment when the first value is empty in multiValued field -- Key: LUCENE-2616 URL: https://issues.apache.org/jira/browse/LUCENE-2616 Project: Lucene - Java Issue Type: Bug Components: contrib/highlighter Affects Versions: 2.9.3 Reporter: Koji Sekiguchi Assignee: Koji Sekiguchi Priority: Trivial Fix For: 3.1, 4.0 Attachments: LUCENE-2616.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2095) Document not guaranteed to be found after write and commit
[ https://issues.apache.org/jira/browse/LUCENE-2095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12902410#action_12902410 ] vijaykumarraja.grandhi commented on LUCENE-2095: Please help me. Slowly all my trails are getting dried out. Document not guaranteed to be found after write and commit -- Key: LUCENE-2095 URL: https://issues.apache.org/jira/browse/LUCENE-2095 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.4.1, 2.9.1 Environment: Linux 64bit Reporter: Sanne Grinovero Assignee: Michael McCandless Fix For: 2.9.2, 3.0.1, 4.0 Attachments: LUCENE-2095.patch, lucene-stresstest.patch after same email on developer list: I developed a stress test to assert that a new document containing a specific term X is always found after a commit on the IndexWriter. This works most of the time, but it fails under load in rare occasions. I'm testing with 40 Threads, both with a SerialMergeScheduler and a ConcurrentMergeScheduler, all sharing a common IndexWriter. Attached testcase is using a RAMDirectory only, but I verified a FSDirectory behaves in the same way so I don't believe it's the Directory implementation or the MergeScheduler. This test is slow, so I don't consider it a functional or unit test. It might give false positives: it doesn't always fail, sorry I couldn't find out how to make it more likely to happen, besides scheduling it to run for a longer time. I tested this to affect versions 2.4.1 and 2.9.1; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-2095) Document not guaranteed to be found after write and commit
[ https://issues.apache.org/jira/browse/LUCENE-2095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12902410#action_12902410 ] vijaykumarraja.grandhi edited comment on LUCENE-2095 at 8/25/10 8:41 AM: - Please help me. Slowly all my trails are getting dried out. failing to resolve multi threading with Lucene. It is getting deadlog. Always I am seeing some Write.Lock file inside Index folder. was (Author: gvkraj23): Please help me. Slowly all my trails are getting dried out. Document not guaranteed to be found after write and commit -- Key: LUCENE-2095 URL: https://issues.apache.org/jira/browse/LUCENE-2095 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.4.1, 2.9.1 Environment: Linux 64bit Reporter: Sanne Grinovero Assignee: Michael McCandless Fix For: 2.9.2, 3.0.1, 4.0 Attachments: LUCENE-2095.patch, lucene-stresstest.patch after same email on developer list: I developed a stress test to assert that a new document containing a specific term X is always found after a commit on the IndexWriter. This works most of the time, but it fails under load in rare occasions. I'm testing with 40 Threads, both with a SerialMergeScheduler and a ConcurrentMergeScheduler, all sharing a common IndexWriter. Attached testcase is using a RAMDirectory only, but I verified a FSDirectory behaves in the same way so I don't believe it's the Directory implementation or the MergeScheduler. This test is slow, so I don't consider it a functional or unit test. It might give false positives: it doesn't always fail, sorry I couldn't find out how to make it more likely to happen, besides scheduling it to run for a longer time. I tested this to affect versions 2.4.1 and 2.9.1; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2598) allow tests to use different Directory impls
[ https://issues.apache.org/jira/browse/LUCENE-2598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12902433#action_12902433 ] Robert Muir commented on LUCENE-2598: - the fixes to NIOFS and MMap are committed in revision 989030. On windows all tests pass with all directory impls, but the default is still RAMDirectory until at least we verify macos and linux are ok with random. allow tests to use different Directory impls Key: LUCENE-2598 URL: https://issues.apache.org/jira/browse/LUCENE-2598 Project: Lucene - Java Issue Type: Test Components: Build Affects Versions: 3.1, 4.0 Reporter: Robert Muir Assignee: Robert Muir Fix For: 3.1, 4.0 Attachments: LUCENE-2598.patch, LUCENE-2598.patch, LUCENE-2598.patch, LUCENE-2598.patch, LUCENE-2598.patch, LUCENE-2598.patch, LUCENE-2598.patch, LUCENE-2598.patch, LUCENE-2598.patch, LUCENE-2598.patch Now that all tests use MockRAMDirectory instead of RAMDirectory, they are all picky like windows and force our tests to close readers etc before closing the directory. I think we should do the following: # change new MockRAMDIrectory() in tests to .newDirectory(random) # LuceneTestCase[J4] tracks if all dirs are closed at tearDown and also cleans up temp dirs like solr. # factor out the Mockish stuff from MockRAMDirectory into MockDirectoryWrapper # allow a -Dtests.directoryImpl or simpler to specify the default Directory to use for tests: default being random i think theres a chance we might find some bugs that havent yet surfaced because they are easier to trigger with FSDir Furthermore, this would be beneficial to Directory-implementors as they could run the entire testsuite against their Directory impl, just like codec-implementors can do now. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-2088) contrib/extraction fails on a turkish computer
[ https://issues.apache.org/jira/browse/SOLR-2088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12902453#action_12902453 ] Mark Miller commented on SOLR-2088: --- I'm running into this on my hudson box - more info: Stacktrace junit.framework.AssertionFailedError: query failed XPath: //*...@numfound='1'] xml response was: ?xml version=1.0 encoding=UTF-8? response lst name=responseHeaderint name=status0/intint name=QTime3/int/lstresult name=response numFound=0 start=0/ /response request was: start=0q=title:Welcomeqt=standardrows=20version=2.2 at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:320) at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:310) at org.apache.solr.handler.ExtractingRequestHandlerTest.testExtraction(ExtractingRequestHandlerTest.java:83) Standard Output NOTE: random codec of testcase 'testExtraction' was: MockSep NOTE: random locale of testcase 'testExtraction' was: tr NOTE: random timezone of testcase 'testExtraction' was: Africa/Dar_es_Salaam Standard Error 25.Ağu.2010 08:51:38 org.apache.solr.common.SolrException log SEVERE: org.apache.solr.common.SolrException: ERROR:unknown field 'a' at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:321) at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:60) at org.apache.solr.handler.extraction.ExtractingDocumentLoader.doAdd(ExtractingDocumentLoader.java:120) at org.apache.solr.handler.extraction.ExtractingDocumentLoader.addDoc(ExtractingDocumentLoader.java:125) at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:195) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1323) at org.apache.solr.util.TestHarness.queryAndResponse(TestHarness.java:334) at org.apache.solr.handler.ExtractingRequestHandlerTest.loadLocal(ExtractingRequestHandlerTest.java:361) at org.apache.solr.handler.ExtractingRequestHandlerTest.testDefaultField(ExtractingRequestHandlerTest.java:149) contrib/extraction fails on a turkish computer -- Key: SOLR-2088 URL: https://issues.apache.org/jira/browse/SOLR-2088 Project: Solr Issue Type: Bug Components: contrib - Solr Cell (Tika extraction) Reporter: Robert Muir Fix For: 3.1, 4.0 reproduce with: ant test -Dtests.locale=tr_TR {noformat} test: [junit] Running org.apache.solr.handler.ExtractingRequestHandlerTest [junit] xml response was: ?xml version=1.0 encoding=UTF-8? [junit] response [junit] lst name=responseHeaderint name=status0/intint name=QTime5/int/lst result name=response numFound=0 start=0/ [junit] /response [junit] [junit] request was: start=0q=title:Welcomeqt=standardrows=20version=2.2) [junit] Tests run: 8, Failures: 1, Errors: 0, Time elapsed: 3.968 sec [junit] Test org.apache.solr.handler.ExtractingRequestHandlerTest FAILED BUILD FAILED {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-1566) Allow components to add fields to outgoing documents
[ https://issues.apache.org/jira/browse/SOLR-1566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12902456#action_12902456 ] Grant Ingersoll commented on SOLR-1566: --- I think we all have generally worked around the same issues here, between this and SOLR-1298. I guess we just need to pick some names and work it out. One thing about this last patch (and mine, I think) is that perhaps we should just put the augmenter on the Request. That way, you don't have to add the response in a bunch of places. Besides, in my mind anyway, you are requesting augmentation via the Augmenter provided. Also, I'm not sure why StdAugmenter is instantiated in SolrCore. Wouldn't we want to allow for that to be driven by some user implementations? Perhaps, since there are a few of us w/ eyes on this, we should first try to tackle the ResponseWriter mess. Allow components to add fields to outgoing documents Key: SOLR-1566 URL: https://issues.apache.org/jira/browse/SOLR-1566 Project: Solr Issue Type: New Feature Components: search Reporter: Noble Paul Assignee: Grant Ingersoll Fix For: Next Attachments: SOLR-1566-gsi.patch, SOLR-1566-rm.patch, SOLR-1566.patch, SOLR-1566.patch, SOLR-1566.patch, SOLR-1566.patch Currently it is not possible for components to add fields to outgoing documents which are not in the the stored fields of the document. This makes it cumbersome to add computed fields/metadata . -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-2412) Architecture Diagrams needed for Lucene, Solr and Nutch
[ https://issues.apache.org/jira/browse/LUCENE-2412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll resolved LUCENE-2412. - Resolution: Fixed Architecture Diagrams needed for Lucene, Solr and Nutch --- Key: LUCENE-2412 URL: https://issues.apache.org/jira/browse/LUCENE-2412 Project: Lucene - Java Issue Type: Task Components: Other Reporter: Grant Ingersoll Assignee: Grant Ingersoll Attachments: arch.pdf, LIA2_01_04.pdf, NutchArch.pdf, solr-arch.pdf -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-2059) Allow customizing how WordDelimiterFilter tokenizes text.
[ https://issues.apache.org/jira/browse/SOLR-2059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12902588#action_12902588 ] Peter Karich commented on SOLR-2059: Robert, thanks for this work! I have a different application for this patch: in a twitter search # and @ shouldn't be removed. Instead I will handle them like ALPHA, I think. Would you mind to update the patch for the latest version of the trunk? I got a problem with WordDelimiterIterator at line 254 if I am using https://svn.apache.org/repos/asf/lucene/dev/trunk/solr and a file is missing problem (line 37) for http://svn.apache.org/repos/asf/solr Allow customizing how WordDelimiterFilter tokenizes text. - Key: SOLR-2059 URL: https://issues.apache.org/jira/browse/SOLR-2059 Project: Solr Issue Type: New Feature Components: Schema and Analysis Reporter: Robert Muir Priority: Minor Fix For: 3.1, 4.0 Attachments: SOLR-2059.patch By default, WordDelimiterFilter assigns 'types' to each character (computed from Unicode Properties). Based on these types and the options provided, it splits and concatenates text. In some circumstances, you might need to tweak the behavior of how this works. It seems the filter already had this in mind, since you can pass in a custom byte[] type table. But its not exposed in the factory. I think you should be able to customize the defaults with a configuration file: {noformat} # A customized type mapping for WordDelimiterFilterFactory # the allowable types are: LOWER, UPPER, ALPHA, DIGIT, ALPHANUM, SUBWORD_DELIM # # the default for any character without a mapping is always computed from # Unicode character properties # Map the $, %, '.', and ',' characters to DIGIT # This might be useful for financial data. $ = DIGIT % = DIGIT . = DIGIT \u002C = DIGIT {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-2059) Allow customizing how WordDelimiterFilter tokenizes text.
[ https://issues.apache.org/jira/browse/SOLR-2059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12902593#action_12902593 ] Robert Muir commented on SOLR-2059: --- Hi Peter: thats a great example. For my use case it was actually not the example either, but I was just trying to give a good general example. What do you think of the file format, is it ok for describing these categories? This format/parser is just stolen the one from MappingCharFilterFactory, it seemed unambiguous and is already in use. As far as applying the patch, you need to apply it to https://svn.apache.org/repos/asf/lucene/dev/trunk, not https://svn.apache.org/repos/asf/lucene/dev/trunk/solr This is because it has to modify a file in modules, too. Allow customizing how WordDelimiterFilter tokenizes text. - Key: SOLR-2059 URL: https://issues.apache.org/jira/browse/SOLR-2059 Project: Solr Issue Type: New Feature Components: Schema and Analysis Reporter: Robert Muir Priority: Minor Fix For: 3.1, 4.0 Attachments: SOLR-2059.patch By default, WordDelimiterFilter assigns 'types' to each character (computed from Unicode Properties). Based on these types and the options provided, it splits and concatenates text. In some circumstances, you might need to tweak the behavior of how this works. It seems the filter already had this in mind, since you can pass in a custom byte[] type table. But its not exposed in the factory. I think you should be able to customize the defaults with a configuration file: {noformat} # A customized type mapping for WordDelimiterFilterFactory # the allowable types are: LOWER, UPPER, ALPHA, DIGIT, ALPHANUM, SUBWORD_DELIM # # the default for any character without a mapping is always computed from # Unicode character properties # Map the $, %, '.', and ',' characters to DIGIT # This might be useful for financial data. $ = DIGIT % = DIGIT . = DIGIT \u002C = DIGIT {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-2059) Allow customizing how WordDelimiterFilter tokenizes text.
[ https://issues.apache.org/jira/browse/SOLR-2059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12902600#action_12902600 ] Peter Karich commented on SOLR-2059: Ups, my mistake ... this helped! What do you think of the file format, is it ok for describing these categories? I think it is ok. I even had a more simpler patch before stumbling over yours: handleAsChar=@# which is now more powerful IMHO: @ = ALPHA # = ALPHA Allow customizing how WordDelimiterFilter tokenizes text. - Key: SOLR-2059 URL: https://issues.apache.org/jira/browse/SOLR-2059 Project: Solr Issue Type: New Feature Components: Schema and Analysis Reporter: Robert Muir Priority: Minor Fix For: 3.1, 4.0 Attachments: SOLR-2059.patch By default, WordDelimiterFilter assigns 'types' to each character (computed from Unicode Properties). Based on these types and the options provided, it splits and concatenates text. In some circumstances, you might need to tweak the behavior of how this works. It seems the filter already had this in mind, since you can pass in a custom byte[] type table. But its not exposed in the factory. I think you should be able to customize the defaults with a configuration file: {noformat} # A customized type mapping for WordDelimiterFilterFactory # the allowable types are: LOWER, UPPER, ALPHA, DIGIT, ALPHANUM, SUBWORD_DELIM # # the default for any character without a mapping is always computed from # Unicode character properties # Map the $, %, '.', and ',' characters to DIGIT # This might be useful for financial data. $ = DIGIT % = DIGIT . = DIGIT \u002C = DIGIT {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (SOLR-2059) Allow customizing how WordDelimiterFilter tokenizes text.
[ https://issues.apache.org/jira/browse/SOLR-2059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12902600#action_12902600 ] Peter Karich edited comment on SOLR-2059 at 8/25/10 3:46 PM: - Ups, my mistake ... this helped! What do you think of the file format, is it ok for describing these categories? I think it is ok. I even had a more simpler patch before stumbling over yours: handleAsChar=@# which is now more powerful IMHO: {code} @ = ALPHA # = ALPHA {code} was (Author: peathal): Ups, my mistake ... this helped! What do you think of the file format, is it ok for describing these categories? I think it is ok. I even had a more simpler patch before stumbling over yours: handleAsChar=@# which is now more powerful IMHO: @ = ALPHA # = ALPHA Allow customizing how WordDelimiterFilter tokenizes text. - Key: SOLR-2059 URL: https://issues.apache.org/jira/browse/SOLR-2059 Project: Solr Issue Type: New Feature Components: Schema and Analysis Reporter: Robert Muir Priority: Minor Fix For: 3.1, 4.0 Attachments: SOLR-2059.patch By default, WordDelimiterFilter assigns 'types' to each character (computed from Unicode Properties). Based on these types and the options provided, it splits and concatenates text. In some circumstances, you might need to tweak the behavior of how this works. It seems the filter already had this in mind, since you can pass in a custom byte[] type table. But its not exposed in the factory. I think you should be able to customize the defaults with a configuration file: {noformat} # A customized type mapping for WordDelimiterFilterFactory # the allowable types are: LOWER, UPPER, ALPHA, DIGIT, ALPHANUM, SUBWORD_DELIM # # the default for any character without a mapping is always computed from # Unicode character properties # Map the $, %, '.', and ',' characters to DIGIT # This might be useful for financial data. $ = DIGIT % = DIGIT . = DIGIT \u002C = DIGIT {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2239) Revise NIOFSDirectory and its usage due to NIO limitations on Thread.interrupt
[ https://issues.apache.org/jira/browse/LUCENE-2239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer updated LUCENE-2239: Attachment: LUCENE-2239.patch here is a new patch that add the essential information to the NIOFSDirectory and MMapDirectory. I wonder if we should refer to this issue in the doc, IMO a link is not necessary. I removed the TestCase from the previous patch since it was only to reproduce the problem in isolation. Revise NIOFSDirectory and its usage due to NIO limitations on Thread.interrupt -- Key: LUCENE-2239 URL: https://issues.apache.org/jira/browse/LUCENE-2239 Project: Lucene - Java Issue Type: Task Components: Store Affects Versions: 2.4, 2.4.1, 2.9, 2.9.1, 3.0 Reporter: Simon Willnauer Assignee: Simon Willnauer Attachments: LUCENE-2239.patch, LUCENE-2239.patch I created this issue as a spin off from http://mail-archives.apache.org/mod_mbox/lucene-java-dev/201001.mbox/%3cf18c9dde1001280051w4af2bc50u1cfd55f85e509...@mail.gmail.com%3e We should decide what to do with NIOFSDirectory, if we want to keep it as the default on none-windows platforms and how we want to document this. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-2089) Faceting: order term ords before converting to values
[ https://issues.apache.org/jira/browse/SOLR-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12902623#action_12902623 ] Yonik Seeley commented on SOLR-2089: Results: docs=10M, docs matching query=1M, facet on field of 100,000 unique terms, facet.method=fc (multivalued) |facet.limit|ms to facet trunk|ms to facet patch| |100|63|63| |1000|228|191| |5000|722|307| |1|1033|316| So a decent speedup when facet.limit is very high. It will also help when facet.limit is high relative to the number of unique terms (since the speedup is due to ordering the term ords and not having to seek as often). I plan on committing soon if there are no objections. Faceting: order term ords before converting to values - Key: SOLR-2089 URL: https://issues.apache.org/jira/browse/SOLR-2089 Project: Solr Issue Type: Improvement Affects Versions: 1.4 Reporter: Yonik Seeley Attachments: SOLR-2089.patch We should be able to speed up multi-valued faceting that sorts by count and returns many values by first sorting the term ords before converting them to a string. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (SOLR-1986) Allow users to define multiple subfield types in AbstractSubTypeFieldType
[ https://issues.apache.org/jira/browse/SOLR-1986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Joiner updated SOLR-1986: Attachment: AbstractMultiSubTypeFieldType.patch Since the reason people seemed to object to the patch in the mailing list was that the AbstractSubTypeFieldType was not originally intended to be used for multiple different types, I made it a separate class. Also, the subFieldType parameter now works, and the created subFields are prepended with subtype_ so as to allow dynamicFields to be used to simulate multiValued fields. Allow users to define multiple subfield types in AbstractSubTypeFieldType - Key: SOLR-1986 URL: https://issues.apache.org/jira/browse/SOLR-1986 Project: Solr Issue Type: Improvement Components: Schema and Analysis Reporter: Mark Allan Priority: Minor Attachments: AbstractMultiSubTypeFieldType.patch, multiSubType.patch Original Estimate: 48h Remaining Estimate: 48h A few small changes to the AbstractSubTypeFieldType class to allow users to define distinct field types for each subfield. This enables us to define complex data types in the schema. For example, we have our own subclass of the CoordinateFieldType called TemporalCoverage where we store a start and end date for an event but now we can store a name for the event as well. fieldType name=temporal class=uk.ac.edina.solr.schema.TemporalCoverage dimension=3 subFieldSuffix=_ti,_ti,_s/ In this example, the start and end dates get stored as trie-coded integer subfields and the description as a string subfield. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2239) Revise NIOFSDirectory and its usage due to NIO limitations on Thread.interrupt
[ https://issues.apache.org/jira/browse/LUCENE-2239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12902641#action_12902641 ] Simon Willnauer commented on LUCENE-2239: - Good point Robert - instead of duplicating documentation we could recommend users to read the implementation specific documentation before using FSDirector#open(). something like that: Currently this returns {...@link NIOFSDirectory} on non-Windows JREs and {...@link SimpleFSDirectory} on Windows. Since these directory implementation have slightly different behavior and limitations it is recommended to consult the implementation specific documentation for the platform your application is running on. simon Revise NIOFSDirectory and its usage due to NIO limitations on Thread.interrupt -- Key: LUCENE-2239 URL: https://issues.apache.org/jira/browse/LUCENE-2239 Project: Lucene - Java Issue Type: Task Components: Store Affects Versions: 2.4, 2.4.1, 2.9, 2.9.1, 3.0 Reporter: Simon Willnauer Assignee: Simon Willnauer Attachments: LUCENE-2239.patch, LUCENE-2239.patch I created this issue as a spin off from http://mail-archives.apache.org/mod_mbox/lucene-java-dev/201001.mbox/%3cf18c9dde1001280051w4af2bc50u1cfd55f85e509...@mail.gmail.com%3e We should decide what to do with NIOFSDirectory, if we want to keep it as the default on none-windows platforms and how we want to document this. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[no subject]
- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-1301) Solr + Hadoop
[ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12902645#action_12902645 ] Daniel Ivan Pizarro commented on SOLR-1301: --- I'm getting the following error: java.lang.IllegalStateException: Failed to initialize record writer for , attempt_local_0001_r_00_0 Where can I find instructions to run the CVSUploader? (readme file says Please read the original patch readme for details on the CSV bulk uploader., and I can't find that readme file) Solr + Hadoop - Key: SOLR-1301 URL: https://issues.apache.org/jira/browse/SOLR-1301 Project: Solr Issue Type: Improvement Affects Versions: 1.4 Reporter: Andrzej Bialecki Fix For: Next Attachments: commons-logging-1.0.4.jar, commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop-0.20.1-core.jar, hadoop.patch, log4j-1.2.15.jar, README.txt, SOLR-1301-hadoop-0-20.patch, SOLR-1301-hadoop-0-20.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java This patch contains a contrib module that provides distributed indexing (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is twofold: * provide an API that is familiar to Hadoop developers, i.e. that of OutputFormat * avoid unnecessary export and (de)serialization of data maintained on HDFS. SolrOutputFormat consumes data produced by reduce tasks directly, without storing it in intermediate files. Furthermore, by using an EmbeddedSolrServer, the indexing task is split into as many parts as there are reducers, and the data to be indexed is not sent over the network. Design -- Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, which in turn uses SolrRecordWriter to write this data. SolrRecordWriter instantiates an EmbeddedSolrServer, and it also instantiates an implementation of SolrDocumentConverter, which is responsible for turning Hadoop (key, value) into a SolrInputDocument. This data is then added to a batch, which is periodically submitted to EmbeddedSolrServer. When reduce task completes, and the OutputFormat is closed, SolrRecordWriter calls commit() and optimize() on the EmbeddedSolrServer. The API provides facilities to specify an arbitrary existing solr.home directory, from which the conf/ and lib/ files will be taken. This process results in the creation of as many partial Solr home directories as there were reduce tasks. The output shards are placed in the output directory on the default filesystem (e.g. HDFS). Such part-N directories can be used to run N shard servers. Additionally, users can specify the number of reduce tasks, in particular 1 reduce task, in which case the output will consist of a single shard. An example application is provided that processes large CSV files and uses this API. It uses a custom CSV processing to avoid (de)serialization overhead. This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this issue, you should put it in contrib/hadoop/lib. Note: the development of this patch was sponsored by an anonymous contributor and approved for release under Apache License. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Created: (SOLR-2090) Allow reader to be passed in SolrInputDocument.addField method
Allow reader to be passed in SolrInputDocument.addField method -- Key: SOLR-2090 URL: https://issues.apache.org/jira/browse/SOLR-2090 Project: Solr Issue Type: Improvement Components: clients - java Affects Versions: 1.4.1 Environment: Windows Vista 32 bit. JDK 1.6. Reporter: Bojan Vukojevic I am using SolrJ with embedded Solr server and some documents have a lot of text. Solr will be running on a small device with very limited memory. In my tests I cannot process more than 3MB of text (in a body) with 64MB heap. According to Java there is about 30MB free memory before I call server.add and with 5MB of text it runs out of memory. I sent an inquiry to a mailing list and was advised to create JIRA issue. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: do the Java and Python garbage collectors talk to each other, with JCC?
On Wed, 25 Aug 2010, Bill Janssen wrote: Sure, but tell that to the Lucene folks. They're the ones starting a new thread here. Of course, now and then one needs to start a new thread. I forwarded your question to Mike McCandless (who is also a subscriber to this list) to see if he had something to say on this topic. Still, in your earlier message, you said: I spawn a lot of short-lived threads in Python, and each of them is attached to Java, and detached after the run method returns.. To which I'm suggesting that you pool these threads instead and reuse them. Andi..
Re: [jira] Commented: (LUCENE-2611) IntelliJ IDEA setup
Great, I'll give it a whirl when I see the notification come back through. Erick On Wed, Aug 25, 2010 at 7:24 AM, Steven Rowe (JIRA) j...@apache.org wrote: [ https://issues.apache.org/jira/browse/LUCENE-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12902459#action_12902459] Steven Rowe commented on LUCENE-2611: - Hi Erick, bq. I have to go in to each module and re-select project sdk on the dependencies tab, even though it looks like it's already selected! I removed a chunk of configuration from the *.iml files that sets this, I think - I'll post a patch shortly that puts the per-module project SDK inheritance back, and should hopefully address the problem you're seeing. IntelliJ IDEA setup --- Key: LUCENE-2611 URL: https://issues.apache.org/jira/browse/LUCENE-2611 Project: Lucene - Java Issue Type: New Feature Components: Build Affects Versions: 4.0 Reporter: Steven Rowe Priority: Minor Fix For: 4.0 Attachments: LUCENE-2611.patch Setting up Lucene/Solr in IntelliJ IDEA can be time-consuming. The attached patch adds a new top level directory {{dev-tools/}} with sub-dir {{idea/}} containing basic setup files for trunk, as well as a top-level ant target named idea that copies these files into the proper locations. This arrangement avoids the messiness attendant to in-place project configuration files directly checked into source control. The IDEA configuration includes modules for Lucene and Solr, each Lucene and Solr contrib, and each analysis module. A JUnit test run per module is included. Once {{ant idea}} has been run, the only configuration that must be performed manually is configuring the project-level JDK. If this patch is committed, Subversion svn:ignore properties should be added/modified to ignore the destination module files (*.iml) in each module's directory. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2622) Random Test Failure org.apache.lucene.TestExternalCodecs.testPerFieldCodec (from TestExternalCodecs)
Random Test Failure org.apache.lucene.TestExternalCodecs.testPerFieldCodec (from TestExternalCodecs) Key: LUCENE-2622 URL: https://issues.apache.org/jira/browse/LUCENE-2622 Project: Lucene - Java Issue Type: Test Reporter: Mark Miller Priority: Minor Error Message state.ord=54 startOrd=0 ir.isIndexTerm=true state.docFreq=1 Stacktrace junit.framework.AssertionFailedError: state.ord=54 startOrd=0 ir.isIndexTerm=true state.docFreq=1 at org.apache.lucene.index.codecs.standard.StandardTermsDictReader$FieldReader$SegmentTermsEnum.seek(StandardTermsDictReader.java:395) at org.apache.lucene.index.DocumentsWriter.applyDeletes(DocumentsWriter.java:1099) at org.apache.lucene.index.DocumentsWriter.applyDeletes(DocumentsWriter.java:1028) at org.apache.lucene.index.IndexWriter.applyDeletes(IndexWriter.java:4213) at org.apache.lucene.index.IndexWriter.doFlushInternal(IndexWriter.java:3381) at org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:3221) at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:3211) at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:2345) at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:2323) at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:2293) at org.apache.lucene.TestExternalCodecs.testPerFieldCodec(TestExternalCodecs.java:645) at org.apache.lucene.util.LuceneTestCase.runBare(LuceneTestCase.java:381) at org.apache.lucene.util.LuceneTestCase.run(LuceneTestCase.java:373) Standard Output NOTE: random codec of testcase 'testPerFieldCodec' was: MockFixedIntBlock(blockSize=1327) NOTE: random locale of testcase 'testPerFieldCodec' was: lt_LT NOTE: random timezone of testcase 'testPerFieldCodec' was: Africa/Lusaka NOTE: random seed of testcase 'testPerFieldCodec' was: 812019387131615618 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2623) Random Test Failure org.apache.lucene.index.TestIndexWriter.testAddIndexesWithThreads (from TestIndexWriter)
Random Test Failure org.apache.lucene.index.TestIndexWriter.testAddIndexesWithThreads (from TestIndexWriter) Key: LUCENE-2623 URL: https://issues.apache.org/jira/browse/LUCENE-2623 Project: Lucene - Java Issue Type: Bug Reporter: Mark Miller Priority: Minor Error Message expected:3160 but was:2752 Stacktrace junit.framework.AssertionFailedError: expected:3160 but was:2752 at org.apache.lucene.index.TestIndexWriter.testAddIndexesWithThreads(TestIndexWriter.java:3794) at org.apache.lucene.util.LuceneTestCase.runBare(LuceneTestCase.java:380) at org.apache.lucene.util.LuceneTestCase.run(LuceneTestCase.java:372) Standard Output java.lang.AssertionError: IndexFileDeleter doesn't know about file _8h.cfs at org.apache.lucene.index.IndexWriter.filesExist(IndexWriter.java:4284) at org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:4331) at org.apache.lucene.index.IndexWriter.prepareCommit(IndexWriter.java:3088) at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3161) at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3132) at org.apache.lucene.index.TestIndexWriter$CommitAndAddIndexes.doBody(TestIndexWriter.java:3774) at org.apache.lucene.index.TestIndexWriter$RunAddIndexesThreads$1.run(TestIndexWriter.java:3710) NOTE: random codec of testcase 'testAddIndexesWithThreads' was: MockSep NOTE: random locale of testcase 'testAddIndexesWithThreads' was: ms_MY NOTE: random timezone of testcase 'testAddIndexesWithThreads' was: Asia/Aqtau NOTE: random seed of testcase 'testAddIndexesWithThreads' was: -5272061551011630291 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2611) IntelliJ IDEA setup
[ https://issues.apache.org/jira/browse/LUCENE-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12902700#action_12902700 ] Erick Erickson commented on LUCENE-2611: Steven: That worked like a champ, all I had to do was set the project-level JDK and then run tests. The only other anomaly (and it's not causing me any problems) is that on the project settings page, there are circular dependencies... 1. queries, misc, common, remote 2. solr, extraction FWIW Erick IntelliJ IDEA setup --- Key: LUCENE-2611 URL: https://issues.apache.org/jira/browse/LUCENE-2611 Project: Lucene - Java Issue Type: New Feature Components: Build Affects Versions: 4.0 Reporter: Steven Rowe Priority: Minor Fix For: 4.0 Attachments: LUCENE-2611.patch, LUCENE-2611.patch Setting up Lucene/Solr in IntelliJ IDEA can be time-consuming. The attached patch adds a new top level directory {{dev-tools/}} with sub-dir {{idea/}} containing basic setup files for trunk, as well as a top-level ant target named idea that copies these files into the proper locations. This arrangement avoids the messiness attendant to in-place project configuration files directly checked into source control. The IDEA configuration includes modules for Lucene and Solr, each Lucene and Solr contrib, and each analysis module. A JUnit test run per module is included. Once {{ant idea}} has been run, the only configuration that must be performed manually is configuring the project-level JDK. If this patch is committed, Subversion svn:ignore properties should be added/modified to ignore the destination module files (*.iml) in each module's directory. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2611) IntelliJ IDEA setup
[ https://issues.apache.org/jira/browse/LUCENE-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12902739#action_12902739 ] Robert Muir commented on LUCENE-2611: - bq. DIH and Solr unit test runs still don't fully pass for me, but all other modules' test runs pass. Can you provide any information on tests that are giving you trouble? We could re-open LUCENE-2398 also. really it would be nice if all tests worked from these IDEs. IntelliJ IDEA setup --- Key: LUCENE-2611 URL: https://issues.apache.org/jira/browse/LUCENE-2611 Project: Lucene - Java Issue Type: New Feature Components: Build Affects Versions: 4.0 Reporter: Steven Rowe Priority: Minor Fix For: 4.0 Attachments: LUCENE-2611.patch, LUCENE-2611.patch Setting up Lucene/Solr in IntelliJ IDEA can be time-consuming. The attached patch adds a new top level directory {{dev-tools/}} with sub-dir {{idea/}} containing basic setup files for trunk, as well as a top-level ant target named idea that copies these files into the proper locations. This arrangement avoids the messiness attendant to in-place project configuration files directly checked into source control. The IDEA configuration includes modules for Lucene and Solr, each Lucene and Solr contrib, and each analysis module. A JUnit test run per module is included. Once {{ant idea}} has been run, the only configuration that must be performed manually is configuring the project-level JDK. If this patch is committed, Subversion svn:ignore properties should be added/modified to ignore the destination module files (*.iml) in each module's directory. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Random Lucene Test Failure: org.apache.lucene.index.TestBackwardsCompatibility.testIndexOldIndex (from TestBackwardsCompatibility)
org.apache.lucene.index.TestBackwardsCompatibility.testIndexOldIndex (from TestBackwardsCompatibility) Failing for the past 1 build (Since #1353 ) Took 0.29 sec. Error Message wrong doc count expected:46 but was:45 Stacktrace junit.framework.AssertionFailedError: wrong doc count expected:46 but was:45 at org.apache.lucene.index.TestBackwardsCompatibility.changeIndexWithAdds(TestBackwardsCompatibility.java:388) at org.apache.lucene.index.TestBackwardsCompatibility.testIndexOldIndex(TestBackwardsCompatibility.java:287) at org.apache.lucene.util.LuceneTestCase.runBare(LuceneTestCase.java:381) at org.apache.lucene.util.LuceneTestCase.run(LuceneTestCase.java:373) Standard Output NOTE: random codec of testcase 'testIndexOldIndex' was: MockVariableIntBlock(baseBlockSize=49) NOTE: random locale of testcase 'testIndexOldIndex' was: es_AR NOTE: random timezone of testcase 'testIndexOldIndex' was: Portugal NOTE: random seed of testcase 'testIndexOldIndex' was: -724598633153762820 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-2034) javabin should use UTF-8, not modified UTF-8
[ https://issues.apache.org/jira/browse/SOLR-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12902744#action_12902744 ] Robert Muir commented on SOLR-2034: --- if no one objects to the latest patch, i'd like to commit in a day or two. javabin should use UTF-8, not modified UTF-8 Key: SOLR-2034 URL: https://issues.apache.org/jira/browse/SOLR-2034 Project: Solr Issue Type: Bug Reporter: Robert Muir Attachments: SOLR-2034.patch, SOLR-2034.patch, SOLR-2034.patch, SOLR-2034.patch for better interoperability, javabin should use standard UTF-8 instead of modified UTF-8 (http://www.unicode.org/reports/tr26/) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2624) add new snowball languages
add new snowball languages -- Key: LUCENE-2624 URL: https://issues.apache.org/jira/browse/LUCENE-2624 Project: Lucene - Java Issue Type: New Feature Components: contrib/analyzers Reporter: Robert Muir Fix For: 3.1, 4.0 Attachments: LUCENE-2624.patch Snowball added new languages. This patch adds support for them. http://snowball.tartarus.org/algorithms/armenian/stemmer.html http://snowball.tartarus.org/algorithms/catalan/stemmer.html http://snowball.tartarus.org/algorithms/basque/stemmer.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2624) add new snowball languages
[ https://issues.apache.org/jira/browse/LUCENE-2624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-2624: Attachment: LUCENE-2624.patch add new snowball languages -- Key: LUCENE-2624 URL: https://issues.apache.org/jira/browse/LUCENE-2624 Project: Lucene - Java Issue Type: New Feature Components: contrib/analyzers Reporter: Robert Muir Fix For: 3.1, 4.0 Attachments: LUCENE-2624.patch Snowball added new languages. This patch adds support for them. http://snowball.tartarus.org/algorithms/armenian/stemmer.html http://snowball.tartarus.org/algorithms/catalan/stemmer.html http://snowball.tartarus.org/algorithms/basque/stemmer.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org