Solr-3.x - Build # 162 - Failure
Build: https://hudson.apache.org/hudson/job/Solr-3.x/162/ All tests passed Build Log (for compile errors): [...truncated 18785 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Lucene-Solr-tests-only-trunk - Build # 1203 - Failure
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/1203/ 1 tests failed. REGRESSION: org.apache.lucene.index.TestIndexWriterMergePolicy.testMaxBufferedDocsChange Error Message: maxMergeDocs=2147483647; numSegments=11; upperBound=10; mergeFactor=10; segs=_64:c5950 _5t:c10->_32 _5u:c10->_32 _5v:c10->_32 _5w:c10->_32 _5x:c10->_32 _5y:c10->_32 _5z:c10->_32 _60:c10->_32 _61:c10->_32 _62:c3->_32 _65:c7->_62 Stack Trace: junit.framework.AssertionFailedError: maxMergeDocs=2147483647; numSegments=11; upperBound=10; mergeFactor=10; segs=_64:c5950 _5t:c10->_32 _5u:c10->_32 _5v:c10->_32 _5w:c10->_32 _5x:c10->_32 _5y:c10->_32 _5z:c10->_32 _60:c10->_32 _61:c10->_32 _62:c3->_32 _65:c7->_62 at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:878) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:844) at org.apache.lucene.index.TestIndexWriterMergePolicy.checkInvariants(TestIndexWriterMergePolicy.java:243) at org.apache.lucene.index.TestIndexWriterMergePolicy.testMaxBufferedDocsChange(TestIndexWriterMergePolicy.java:169) Build Log (for compile errors): [...truncated 3085 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
[ https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930470#action_12930470 ] Steven Rowe commented on LUCENE-2747: - DM, I'm a committer on the JFlex team. About the second to last point: when Robert said "(jflex doesnt have this... yet)" he meant the jflex-based implementation, not JFlex itself. About the last point, JFlex is shooting for level 1 compliance with [UTS#18 Unicode Regular Expressions|http://unicode.org/reports/tr18/], which requires conforming implementations to "handle the full range of Unicode code points, including values from U+ to U+10." > Deprecate/remove language-specific tokenizers in favor of StandardTokenizer > --- > > Key: LUCENE-2747 > URL: https://issues.apache.org/jira/browse/LUCENE-2747 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Affects Versions: 3.1, 4.0 >Reporter: Steven Rowe > Fix For: 3.1, 4.0 > > Attachments: LUCENE-2747.patch, LUCENE-2747.patch > > > As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to > provide language-neutral tokenization. Lucene contains several > language-specific tokenizers that should be replaced by UAX#29-based > StandardTokenizer (deprecated in 3.1 and removed in 4.0). The > language-specific *analyzers*, by contrast, should remain, because they > contain language-specific post-tokenization filters. The language-specific > analyzers should switch to StandardTokenizer in 3.1. > Some usages of language-specific tokenizers will need additional work beyond > just replacing the tokenizer in the language-specific analyzer. > For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and > depends on the fact that this tokenizer breaks tokens on the ZWNJ character > (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ > is not a word boundary. Robert Muir has suggested using a char filter > converting ZWNJ to spaces prior to StandardTokenizer in the converted > PersianAnalyzer. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Lucene-trunk - Build # 1358 - Still Failing
Build: https://hudson.apache.org/hudson/job/Lucene-trunk/1358/ All tests passed Build Log (for compile errors): [...truncated 18290 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: svn commit: r1032995 - in /lucene/dev/trunk/solr/src/site/src/documentation/content/xdocs/images: solr.jpg solr_FC.eps
I'm with yonik on this one... why would the logo need a tm? Isn't it already trademarked? I'm in no rush to make the logo more ugly... for no need. Is the plan to add tm to all the feather images across apache too? http://www.apache.org/ but I hate to be a pain in the ass On Tue, Nov 9, 2010 at 5:16 PM, Yonik Seeley wrote: > On Tue, Nov 9, 2010 at 4:58 PM, Grant Ingersoll wrote: >> You can read it at http://www.apache.org/foundation/marks/pmcs > > I did - but just because it appears on a web page does not make it > true. There have been *many* examples (and still are many examples) > of things that are on our websites that are not strictly true. > >> , which is what I'm following and explains in the first paragraph. I don't >> understand why it is a big deal to add a little TM on the logo. > > Part of it is uglification, part of it is slippery slope. I see > increasing micro-management and rigidity and it's something I actively > fight against ;-) > >> It's standard practice for anyone wanting to protect their names/logos > > http://www.pg.com/ > http://www.pepsi.com/ > http://www.kraftfoodscompany.com > http://www.unilever.com/ > http://www.conocophillips.com/ > http://www.3m.com/ > http://www.boeing.com/ > http://www.pfizer.com/home/ > http://www.google.com/ > http://www.apple.com/ > http://www.jnj.com/ > http://www.ge.com/ > http://www.att.com/ > http://www.verizon.com > > In my quick survey of some of the biggest companies off the top of my > head, about 75% did not. > >> and as you stated, they are already trademarked items. > > meaning it shouldn't be necessary. > > -Yonik > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > > - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
[ https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930448#action_12930448 ] DM Smith commented on LUCENE-2747: -- Robert, I think we are on the same wavelength. Thanks. I like the idea of declarative analyzers, too. Regarding the "last 2 points" has anyone given input to the JFlex team on these needs? > Deprecate/remove language-specific tokenizers in favor of StandardTokenizer > --- > > Key: LUCENE-2747 > URL: https://issues.apache.org/jira/browse/LUCENE-2747 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Affects Versions: 3.1, 4.0 >Reporter: Steven Rowe > Fix For: 3.1, 4.0 > > Attachments: LUCENE-2747.patch, LUCENE-2747.patch > > > As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to > provide language-neutral tokenization. Lucene contains several > language-specific tokenizers that should be replaced by UAX#29-based > StandardTokenizer (deprecated in 3.1 and removed in 4.0). The > language-specific *analyzers*, by contrast, should remain, because they > contain language-specific post-tokenization filters. The language-specific > analyzers should switch to StandardTokenizer in 3.1. > Some usages of language-specific tokenizers will need additional work beyond > just replacing the tokenizer in the language-specific analyzer. > For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and > depends on the fact that this tokenizer breaks tokens on the ZWNJ character > (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ > is not a word boundary. Robert Muir has suggested using a char filter > converting ZWNJ to spaces prior to StandardTokenizer in the converted > PersianAnalyzer. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (SOLR-1395) Integrate Katta
[ https://issues.apache.org/jira/browse/SOLR-1395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tom liu updated SOLR-1395: -- Attachment: solr-1395-katta-0.6.2-3.patch fixed some bugs: # select?qt=qtname do not supported in SolrKattaServer of subproxy in SolrKattaServer, any queryHanlder must be MultiEmbeddedSearchHandler. so in solrconfig.xml, we must change solr.SearchHandler to MultiEmbeddedSearchHandler for example: {noformat} true tvComponent {noformat} # TermVectorComponent do not return results see https://issues.apache.org/jira/browse/SOLR-2224 > Integrate Katta > --- > > Key: SOLR-1395 > URL: https://issues.apache.org/jira/browse/SOLR-1395 > Project: Solr > Issue Type: New Feature >Affects Versions: 1.4 >Reporter: Jason Rutherglen >Priority: Minor > Fix For: Next > > Attachments: back-end.log, front-end.log, hadoop-core-0.19.0.jar, > katta-core-0.6-dev.jar, katta.node.properties, katta.zk.properties, > log4j-1.2.13.jar, solr-1395-1431-3.patch, solr-1395-1431-4.patch, > solr-1395-1431-katta0.6.patch, solr-1395-1431-katta0.6.patch, > solr-1395-1431.patch, solr-1395-katta-0.6.2-1.patch, > solr-1395-katta-0.6.2-2.patch, solr-1395-katta-0.6.2-3.patch, > solr-1395-katta-0.6.2.patch, SOLR-1395.patch, SOLR-1395.patch, > SOLR-1395.patch, test-katta-core-0.6-dev.jar, zkclient-0.1-dev.jar, > zookeeper-3.2.1.jar > > Original Estimate: 336h > Remaining Estimate: 336h > > We'll integrate Katta into Solr so that: > * Distributed search uses Hadoop RPC > * Shard/SolrCore distribution and management > * Zookeeper based failover > * Indexes may be built using Hadoop -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (SOLR-2224) TermVectorComponent did not return results when using distributedProcess in distribution envs
[ https://issues.apache.org/jira/browse/SOLR-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tom liu updated SOLR-2224: -- Attachment: TermsVectorComponent.patch in distributed query envs, use request that queryComponents creates. the patch use merge method that debugComponents have. see https://issues.apache.org/jira/browse/SOLR-2228 > TermVectorComponent did not return results when using distributedProcess in > distribution envs > - > > Key: SOLR-2224 > URL: https://issues.apache.org/jira/browse/SOLR-2224 > Project: Solr > Issue Type: Bug > Components: SearchComponents - other >Affects Versions: 4.0 > Environment: JDK1.6/Tomcat6 >Reporter: tom liu > Attachments: TermsVectorComponent.patch > > > when using distributed query, TVRH did not return any results. > in distributedProcess, tv creates one request, that use > TermVectorParams.DOC_IDS, for example, tv.docIds=10001 > but queryCommponent returns ids, that is uniqueKeys, not DOCIDS. > so, in distribution envs, must not use distributedProcess. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (SOLR-2228) refactors DebugComponent's merge method to create a class that can be used on other SeachComponents
[ https://issues.apache.org/jira/browse/SOLR-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tom liu updated SOLR-2228: -- Attachment: namedlistcolleciton.patch Create class NamedListCollection main method of the class is merge that from DebugComponent. > refactors DebugComponent's merge method to create a class that can be used on > other SeachComponents > --- > > Key: SOLR-2228 > URL: https://issues.apache.org/jira/browse/SOLR-2228 > Project: Solr > Issue Type: Improvement > Components: SearchComponents - other > Environment: JDK1.6/Tomcat6 >Reporter: tom liu > Attachments: namedlistcolleciton.patch > > > Some SearchComponents have to merge many NamedLists to one NamedList. > for example, TermVectorComponent would merge many NLS to one NL. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Lucene-Solr-tests-only-trunk - Build # 1199 - Failure
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/1199/ 4 tests failed. FAILED: junit.framework.TestSuite.org.apache.lucene.search.TestNumericRangeQuery64 Error Message: Java heap space Stack Trace: java.lang.OutOfMemoryError: Java heap space FAILED: junit.framework.TestSuite.org.apache.lucene.search.TestNumericRangeQuery64 Error Message: null Stack Trace: java.lang.NullPointerException at org.apache.lucene.search.TestNumericRangeQuery64.afterClass(TestNumericRangeQuery64.java:97) FAILED: junit.framework.TestSuite.org.apache.lucene.search.TestNumericRangeQuery64 Error Message: directory of test was not closed, opened from: org.apache.lucene.util.LuceneTestCase.newDirectory(LuceneTestCase.java:653) Stack Trace: junit.framework.AssertionFailedError: directory of test was not closed, opened from: org.apache.lucene.util.LuceneTestCase.newDirectory(LuceneTestCase.java:653) at org.apache.lucene.util.LuceneTestCase.afterClassLuceneTestCaseJ4(LuceneTestCase.java:331) REGRESSION: org.apache.lucene.search.TestPrefixFilter.testPrefixFilter Error Message: ConcurrentMergeScheduler hit unhandled exceptions Stack Trace: junit.framework.AssertionFailedError: ConcurrentMergeScheduler hit unhandled exceptions at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:878) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:844) at org.apache.lucene.util.LuceneTestCase.tearDown(LuceneTestCase.java:458) Build Log (for compile errors): [...truncated 3111 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Created: (SOLR-2228) refactors DebugComponent's merge method to create a class that can be used on other SeachComponents
refactors DebugComponent's merge method to create a class that can be used on other SeachComponents --- Key: SOLR-2228 URL: https://issues.apache.org/jira/browse/SOLR-2228 Project: Solr Issue Type: Improvement Components: SearchComponents - other Environment: JDK1.6/Tomcat6 Reporter: tom liu Some SearchComponents have to merge many NamedLists to one NamedList. for example, TermVectorComponent would merge many NLS to one NL. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Lucene-3.x - Build # 176 - Still Failing
Build: https://hudson.apache.org/hudson/job/Lucene-3.x/176/ All tests passed Build Log (for compile errors): [...truncated 21372 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [jira] Created: (SOLR-2226) DIH add data and after a while removes all from index
I'd be extremely surprised if you were the very first one to notice this, but all things are possible. It's more likely that there's something else going on besides 1.4 just throwing away your data. Let's see your DIH config, and any commands you used "after a while", I suspect you're done something like having DIH "clean" parameter set to true (I don't know if the defaults changed). How did you revert to 1.4? It would be helpful if you outlined what steps you took. Best Erick On Tue, Nov 9, 2010 at 4:18 PM, Marcin Rosinski (JIRA) wrote: > DIH add data and after a while removes all from index > - > > Key: SOLR-2226 > URL: https://issues.apache.org/jira/browse/SOLR-2226 > Project: Solr > Issue Type: Bug > Components: contrib - DataImportHandler >Affects Versions: 4.0 > Environment: centos 5.4, tomcat 7.0.2, jdbc mysql connector >Reporter: Marcin Rosinski >Priority: Critical > > > Hi guys, > > I am having weird problem. I am currently using solr 1.5-dev. Wanted to > switch to 4.0. Have dropped my indexes as they are not backward compatible > and have run DIH. All data have been indexed and were even searchable for a > short while, after that all have been suddenly dropped. > > All is working fine on 1.5 so it force me to thinking that it must be some > kind of bug in 4.0 > > > cheers, > /Marcin > > -- > This message is automatically generated by JIRA. > - > You can reply to this email to add a comment to the issue online. > > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > >
Re: Document links
I was using within-segment doc ids stored in link files named after both the source and target segments (a link after all is 2 endpoints). For a complete solution you ultimately have to deal with the fact that doc ids could be references to: * Stable, committed docs (the easy case) * Flushed but not yet committed docs * Buffered but not yet flushed docs * Flushed/committed but currently merging docs ...all of which are happening in different threads e.g reader has one view of the world, a background thread is busy merging segments to create a new view of the world even after commits have completed. All very messy. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Document links
On Monday 08 November 2010 20:52:53 mark harwood wrote: > I came to the conclusion that the transient meaning of document ids is too > deeply ingrained in Lucene's design to use them to underpin any reliable > linking. > While it might work for relatively static indexes, any index with a > reasonable > number of updates or deletes will invalidate any stored document references > in > ways which are very hard to track. Lucene's compaction shuffles IDs without > taking care to preserve identity, unlike graph DBs like Neo4j (see "recycling > IDs" here: http://goo.gl/5UbJi ) Did you try to keep the docId's as segmentId-inSegmentDocId in a tree? Somehow I think this could work, because this would not really complicate adding/deleting relations between docId's and the segment merges would become massive but straightforward deletes and inserts, and with some luck the amount of work for that would be a small proportion of the work for a normal segment merge. Regards, Paul Elschot > > > Cheers, > Mark > > > - Original Message > From: Ryan McKinley > To: dev@lucene.apache.org > Sent: Mon, 8 November, 2010 19:03:59 > Subject: Re: Document links > > Any updates/progress with this? > > I'm looking at ways to implement an RTree with lucene -- and this > discussion seems relevant > > thanks > ryan > > > On Sat, Sep 25, 2010 at 5:42 PM, mark harwood wrote: > >>>Both these on disk data structures and the ones in a B+ tree have seek > offsets > >>>into files > >>>that require disk seeks. And both could use document ids as key values. > > > > Yep. However my approach doesn't use a doc id as a key that is searched in > > any > > B+ tree index (which involves disk seeks) - it is used as direct offset > > into a > > file to get the pointer into a "links" data structure. > > > > > > > >>>But do these disk data structures support dynamic addition and deletion of > >>>(larger > >>>numbers of) document links? > > > > Yes, the slide deck I linked to shows how links (like documents) spend the > >early > > stages of life being merged frequently in the smaller, newer segments and > > over > > time migrate into larger, more stable segments as part of Lucene > > transactions. > > > > That's the theory - I'm currently benchmarking an early prototype. > > > > > > > > - Original Message > > From: Paul Elschot > > To: dev@lucene.apache.org > > Sent: Sat, 25 September, 2010 22:03:28 > > Subject: Re: Document links > > > > Op zaterdag 25 september 2010 15:23:39 schreef Mark Harwood: > >> My starting point in the solution I propose was to eliminate linking via > >> any > >>type of key. Key lookups mean indexes and indexes mean disk seeks. Graph > >>traversals have exponential numbers of links and so all these index disk > >>seeks > >>start to stack up. The solution I propose uses doc ids as more-or-less > >>direct > >>pointers into file structures avoiding any index lookup. > >> I've started coding up some tests using the file structures I outlined and > >will > >>compare that with a traditional key-based approach. > > > > Both these on disk data structures and the ones in a B+ tree have seek > > offsets > > into files > > that require disk seeks. And both could use document ids as key values. > > > > But do these disk data structures support dynamic addition and deletion of > > (larger > > numbers of) document links? > > > > B+ trees are a standard solution for problems like this one, and it would > > probably > > not be easy to outperform them. > > It may be possible to improve performance of B+ trees somewhat by > > specializing > > for the fairly simple keys that would be needed, and by encoding very short > > lists of links > > for a single document directly into a seek offset to avoid the actual seek, > but > > that's > > about it. > > > > Regards, > > Paul Elschot > > > >> > >> For reference - playing the "Kevin Bacon game" on a traditional Lucene > >> index > >of > >>IMDB data took 18 seconds to find a short path that Neo4j finds in 200 > >>milliseconds on the same data (and this was a disk based graph of 3m nodes, > 10m > >>edges). > >> Going from actor->movies->actors->movies produces a lot of key lookups and > the > >>difference between key indexes and direct node pointers becomes clear. > >> I know path finding analysis is perhaps not a typical Lucene application > >> but > >>other forms of link analysis e.g. recommendation engines require similar > >>performance. > >> > >> Cheers > >> Mark > >> > >> > >> > >> On 25 Sep 2010, at 11:41, Paul Elschot wrote: > >> > >> > Op vrijdag 24 september 2010 17:57:45 schreef mark harwood: > >> While not exactly equivalent, it reminds me of our earlier discussion > >>around > >> > >> "layered segments" for dealing with field updates > >> >> > >> >> Right. Fast discovery of document relations is a foundation on which > >> >> lots > >of > >> > >> >> things like this can build. Relations can be given types to support a > >numbe
Re: svn commit: r1032995 - in /lucene/dev/trunk/solr/src/site/src/documentation/content/xdocs/images: solr.jpg solr_FC.eps
On Tue, Nov 9, 2010 at 4:58 PM, Grant Ingersoll wrote: > You can read it at http://www.apache.org/foundation/marks/pmcs I did - but just because it appears on a web page does not make it true. There have been *many* examples (and still are many examples) of things that are on our websites that are not strictly true. > , which is what I'm following and explains in the first paragraph. I don't > understand why it is a big deal to add a little TM on the logo. Part of it is uglification, part of it is slippery slope. I see increasing micro-management and rigidity and it's something I actively fight against ;-) > It's standard practice for anyone wanting to protect their names/logos http://www.pg.com/ http://www.pepsi.com/ http://www.kraftfoodscompany.com http://www.unilever.com/ http://www.conocophillips.com/ http://www.3m.com/ http://www.boeing.com/ http://www.pfizer.com/home/ http://www.google.com/ http://www.apple.com/ http://www.jnj.com/ http://www.ge.com/ http://www.att.com/ http://www.verizon.com In my quick survey of some of the biggest companies off the top of my head, about 75% did not. > and as you stated, they are already trademarked items. meaning it shouldn't be necessary. -Yonik - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-2223) Separate out "generic" Solr site from release specific content.
[ https://issues.apache.org/jira/browse/SOLR-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930314#action_12930314 ] Hoss Man commented on SOLR-2223: I don't really understand why this blocks LUCENE-2746 ? the solr "site" snapshot has always refered to "trunk", not the release - it's something we've talked about changing to mimic the way lucene-java had release specific tutorials/docs but never got around to. Adding the TMs and consistent project name verbage to the site where it lives in svn right now and publishing should be just fine for satisfying LUCENE-2746 - no need to revamp how we build the site if we're just going to revamp it again using the CMS. > Separate out "generic" Solr site from release specific content. > --- > > Key: SOLR-2223 > URL: https://issues.apache.org/jira/browse/SOLR-2223 > Project: Solr > Issue Type: Task >Reporter: Grant Ingersoll >Assignee: Grant Ingersoll >Priority: Minor > > It would be useful for deployment purposes if we separated out the Solr site > that is non-release specific from the release specific content. This would > make it easier to apply updates, etc. while still keeping release specific > info handy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: svn commit: r1032995 - in /lucene/dev/trunk/solr/src/site/src/documentation/content/xdocs/images: solr.jpg solr_FC.eps
You can read it at http://www.apache.org/foundation/marks/pmcs, which is what I'm following and explains in the first paragraph. I don't understand why it is a big deal to add a little TM on the logo. It's standard practice for anyone wanting to protect their names/logos and as you stated, they are already trademarked items. FWIW, we can put the TM next to the logo too, but that just seems clunky HTML wise. On Nov 9, 2010, at 4:30 PM, Yonik Seeley wrote: > FYI, I've followed up with trademarks@ > > -Yonik > http://www.lucidimagination.com > > > > On Tue, Nov 9, 2010 at 4:06 PM, Yonik Seeley > wrote: >> On Tue, Nov 9, 2010 at 4:03 PM, Grant Ingersoll wrote: >>> Sorry, I figured the word "Requirements" in the title was pretty clear. >> >> I was seeking information on how this became an actual hard >> requirement, and if it actually was. >> It certainly wouldn't be the first time that something is listed as a >> requirement just because someone thought it was a good idea. >> >> -Yonik > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Lucene-Solr-tests-only-trunk - Build # 1192 - Failure
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/1192/ 1 tests failed. REGRESSION: org.apache.lucene.index.TestIndexWriterMergePolicy.testMaxBufferedDocsChange Error Message: maxMergeDocs=2147483647; numSegments=11; upperBound=10; mergeFactor=10; segs=_65:c5950 _5t:c10->_32 _5u:c10->_32 _5v:c10->_32 _5w:c10->_32 _5x:c10->_32 _5y:c10->_32 _5z:c10->_32 _60:c10->_32 _61:c10->_32 _62:c8->_32 _63:c2->_62 Stack Trace: junit.framework.AssertionFailedError: maxMergeDocs=2147483647; numSegments=11; upperBound=10; mergeFactor=10; segs=_65:c5950 _5t:c10->_32 _5u:c10->_32 _5v:c10->_32 _5w:c10->_32 _5x:c10->_32 _5y:c10->_32 _5z:c10->_32 _60:c10->_32 _61:c10->_32 _62:c8->_32 _63:c2->_62 at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:878) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:844) at org.apache.lucene.index.TestIndexWriterMergePolicy.checkInvariants(TestIndexWriterMergePolicy.java:243) at org.apache.lucene.index.TestIndexWriterMergePolicy.testMaxBufferedDocsChange(TestIndexWriterMergePolicy.java:169) Build Log (for compile errors): [...truncated 3082 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: svn commit: r1032995 - in /lucene/dev/trunk/solr/src/site/src/documentation/content/xdocs/images: solr.jpg solr_FC.eps
And FWIW, here's my bit of civil disobedience :-P http://yonik.wordpress.com/ -Yonik - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Created: (SOLR-2227) DIH add data and after a while removes all from index
DIH add data and after a while removes all from index - Key: SOLR-2227 URL: https://issues.apache.org/jira/browse/SOLR-2227 Project: Solr Issue Type: Bug Affects Versions: 4.0 Environment: centos 5.4, tomcat 7.0.2, jdbc mysql connector Reporter: Marcin Rosinski Priority: Critical Fix For: 4.0 Hi guys, I am having weird problem. I am currently using solr 1.5-dev. Wanted to switch to 4.0. Have dropped my indexes as they are not backward compatible and have run DIH. All data have been indexed and were even searchable for a short while, after that all have been suddenly dropped. All is working fine on 1.5 so it force me to thinking that it must be some kind of bug in 4.0 P.S. Sorry for posting twice but couldn't see it under solr 4.0 section cheers, /Marcin -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: svn commit: r1032995 - in /lucene/dev/trunk/solr/src/site/src/documentation/content/xdocs/images: solr.jpg solr_FC.eps
FYI, I've followed up with trademarks@ -Yonik http://www.lucidimagination.com On Tue, Nov 9, 2010 at 4:06 PM, Yonik Seeley wrote: > On Tue, Nov 9, 2010 at 4:03 PM, Grant Ingersoll wrote: >> Sorry, I figured the word "Requirements" in the title was pretty clear. > > I was seeking information on how this became an actual hard > requirement, and if it actually was. > It certainly wouldn't be the first time that something is listed as a > requirement just because someone thought it was a good idea. > > -Yonik - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Created: (SOLR-2226) DIH add data and after a while removes all from index
DIH add data and after a while removes all from index - Key: SOLR-2226 URL: https://issues.apache.org/jira/browse/SOLR-2226 Project: Solr Issue Type: Bug Components: contrib - DataImportHandler Affects Versions: 4.0 Environment: centos 5.4, tomcat 7.0.2, jdbc mysql connector Reporter: Marcin Rosinski Priority: Critical Hi guys, I am having weird problem. I am currently using solr 1.5-dev. Wanted to switch to 4.0. Have dropped my indexes as they are not backward compatible and have run DIH. All data have been indexed and were even searchable for a short while, after that all have been suddenly dropped. All is working fine on 1.5 so it force me to thinking that it must be some kind of bug in 4.0 cheers, /Marcin -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: svn commit: r1032995 - in /lucene/dev/trunk/solr/src/site/src/documentation/content/xdocs/images: solr.jpg solr_FC.eps
On Tue, Nov 9, 2010 at 4:03 PM, Grant Ingersoll wrote: > Sorry, I figured the word "Requirements" in the title was pretty clear. I was seeking information on how this became an actual hard requirement, and if it actually was. It certainly wouldn't be the first time that something is listed as a requirement just because someone thought it was a good idea. -Yonik - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: svn commit: r1032995 - in /lucene/dev/trunk/solr/src/site/src/documentation/content/xdocs/images: solr.jpg solr_FC.eps
On Nov 9, 2010, at 3:54 PM, Yonik Seeley wrote: > On Tue, Nov 9, 2010 at 3:51 PM, Grant Ingersoll wrote: >> See the mail to the PMC on Branding Requirements and the associated website. > > I guess I was looking for something more specific. > I followed a lot of the previous discussions on trademark stuff - I > just missed the transition point where it became absolutely required. Sorry, I figured the word "Requirements" in the title was pretty clear. In talking with Shane (who handles TM for the ASF) the goal is to have much more consistency across the ASF as it regards to protecting the ASF marks, so this is something all projects must implement on and report to the Board on in terms of progress. -Grant - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-2223) Separate out "generic" Solr site from release specific content.
[ https://issues.apache.org/jira/browse/SOLR-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930284#action_12930284 ] Grant Ingersoll commented on SOLR-2223: --- Yes, Hoss, that would make sense. I was just trying to do the least amount of work now for LUCENE-2746 to satisfy the needs there before we move to the new CMS. > Separate out "generic" Solr site from release specific content. > --- > > Key: SOLR-2223 > URL: https://issues.apache.org/jira/browse/SOLR-2223 > Project: Solr > Issue Type: Task >Reporter: Grant Ingersoll >Assignee: Grant Ingersoll >Priority: Minor > > It would be useful for deployment purposes if we separated out the Solr site > that is non-release specific from the release specific content. This would > make it easier to apply updates, etc. while still keeping release specific > info handy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: svn commit: r1032995 - in /lucene/dev/trunk/solr/src/site/src/documentation/content/xdocs/images: solr.jpg solr_FC.eps
On Tue, Nov 9, 2010 at 3:51 PM, Grant Ingersoll wrote: > See the mail to the PMC on Branding Requirements and the associated website. I guess I was looking for something more specific. I followed a lot of the previous discussions on trademark stuff - I just missed the transition point where it became absolutely required. -Yonik - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: svn commit: r1032995 - in /lucene/dev/trunk/solr/src/site/src/documentation/content/xdocs/images: solr.jpg solr_FC.eps
Karl, yes it is something Manifold needs to do. On Nov 9, 2010, at 3:41 PM, Yonik Seeley wrote: > On Tue, Nov 9, 2010 at 3:33 PM, Grant Ingersoll wrote: >> Sorry, no choice. Comes down from the board that logos need TMs on them. > > IMO, it's a little too much micromanagement (and IIRC not everyone > agreed about logo TMs in the past). Have any pointers to more recent > discussions for me? See the mail to the PMC on Branding Requirements and the associated website. > > We can also trademark our logos w/o adding TM to them, and I disagree > that we need to add it to the Lucene/Solr logos (Lucene and Solr names > themselves are trademarked). - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
RE: svn commit: r1032995 - in /lucene/dev/trunk/solr/src/site/src/documentation/content/xdocs/images: solr.jpg solr_FC.eps
Is this something ManifoldCF needs to do also? Karl -Original Message- From: ext Grant Ingersoll [mailto:gsing...@apache.org] Sent: Tuesday, November 09, 2010 3:34 PM To: dev@lucene.apache.org Subject: Re: svn commit: r1032995 - in /lucene/dev/trunk/solr/src/site/src/documentation/content/xdocs/images: solr.jpg solr_FC.eps Sorry, no choice. Comes down from the board that logos need TMs on them. On Nov 9, 2010, at 2:53 PM, Yonik Seeley wrote: > IMO, any changes to our logos should be voted on. > > -Yonik > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: svn commit: r1032995 - in /lucene/dev/trunk/solr/src/site/src/documentation/content/xdocs/images: solr.jpg solr_FC.eps
On Tue, Nov 9, 2010 at 3:33 PM, Grant Ingersoll wrote: > Sorry, no choice. Comes down from the board that logos need TMs on them. IMO, it's a little too much micromanagement (and IIRC not everyone agreed about logo TMs in the past). Have any pointers to more recent discussions for me? We can also trademark our logos w/o adding TM to them, and I disagree that we need to add it to the Lucene/Solr logos (Lucene and Solr names themselves are trademarked). -Yonik - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: svn commit: r1032995 - in /lucene/dev/trunk/solr/src/site/src/documentation/content/xdocs/images: solr.jpg solr_FC.eps
Sorry, no choice. Comes down from the board that logos need TMs on them. On Nov 9, 2010, at 2:53 PM, Yonik Seeley wrote: > IMO, any changes to our logos should be voted on. > > -Yonik > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: svn commit: r1032995 - in /lucene/dev/trunk/solr/src/site/src/documentation/content/xdocs/images: solr.jpg solr_FC.eps
IMO, any changes to our logos should be voted on. -Yonik - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-2212) NoMergePolicy class does not load
[ https://issues.apache.org/jira/browse/SOLR-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930243#action_12930243 ] Hoss Man commented on SOLR-2212: (and most of the SolrPlugin stuff) requires that the class specified support a no-arg constructor. "NoMergePolicy" has no public constructors at all - it seems to expect you to only ever use one of the Static singletons. > NoMergePolicy class does not load > - > > Key: SOLR-2212 > URL: https://issues.apache.org/jira/browse/SOLR-2212 > Project: Solr > Issue Type: Bug > Components: multicore >Affects Versions: 3.1, 4.0 >Reporter: Lance Norskog > > Solr cannot use the Lucene NoMergePolicy class. It will not instantiate > correctly when loading the core. > Other MergePolicy classes work, including the BalancedSegmentMergePolicy. > This is in trunk and 3.x. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-2223) Separate out "generic" Solr site from release specific content.
[ https://issues.apache.org/jira/browse/SOLR-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930241#action_12930241 ] Hoss Man commented on SOLR-2223: Given the dev merge of lucene/solr, and the reduction in subprojects due to graduation, and the consideration of a using hte new CMS, now is probably a good time to consider abandoning the way we've been having distinct svn paths for the "site" of each subproject. why not move to a single "site" directory in svn for the entire TLP that the CMS updates (which can still have subdirs by project and what not, but can easily maintain a consistent navigation and look and fell) and then keep only the releases specific docs in the individual sub-proj trunk dirs? > Separate out "generic" Solr site from release specific content. > --- > > Key: SOLR-2223 > URL: https://issues.apache.org/jira/browse/SOLR-2223 > Project: Solr > Issue Type: Task >Reporter: Grant Ingersoll >Assignee: Grant Ingersoll >Priority: Minor > > It would be useful for deployment purposes if we separated out the Solr site > that is non-release specific from the release specific content. This would > make it easier to apply updates, etc. while still keeping release specific > info handy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: svn commit: r1032995 - in /lucene/dev/trunk/solr/src/site/src/documentation/content/xdocs/images: solr.jpg solr_FC.eps
On Tue, Nov 9, 2010 at 2:00 PM, Chris Hostetter wrote: > > : LUCENE-2746: add TM to some of the logos, don't have an editor for the > others > : also, thanks for making all these updates... updating the website is a hassle. i was taking a look at lucene.apache.org and noticed the (TM) is a bit funky. any objection to instead of Lucene (TM) doing Lucene™ ? Using the trade mark sign (U+2122) looks better in my opinion, its very old in unicode and even my telephone can display it. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: svn commit: r1032995 - in /lucene/dev/trunk/solr/src/site/src/documentation/content/xdocs/images: solr.jpg solr_FC.eps
Ah, OK. I just used my photo editor. I'll fix, using the SVG. On Nov 9, 2010, at 2:00 PM, Chris Hostetter wrote: > > : LUCENE-2746: add TM to some of the logos, don't have an editor for the > others > : > : Modified: > : > lucene/dev/trunk/solr/src/site/src/documentation/content/xdocs/images/solr.jpg > : > lucene/dev/trunk/solr/src/site/src/documentation/content/xdocs/images/solr_FC.eps > > grant: I'm not sure how you edited the solr_FC.eps file but you caused the > smoothness of the line cuves to look horific when the image is enlarged. > (note the filesize droped about 50%). > > In general, the *.svg files are the "source" files that all of the other > files were generated form -- we should add the "TM" to the SVN files and > then generate the JPG and EPS files from that. > > > -Hoss > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: svn commit: r1032995 - in /lucene/dev/trunk/solr/src/site/src/documentation/content/xdocs/images: solr.jpg solr_FC.eps
: LUCENE-2746: add TM to some of the logos, don't have an editor for the others : : Modified: : lucene/dev/trunk/solr/src/site/src/documentation/content/xdocs/images/solr.jpg : lucene/dev/trunk/solr/src/site/src/documentation/content/xdocs/images/solr_FC.eps grant: I'm not sure how you edited the solr_FC.eps file but you caused the smoothness of the line cuves to look horific when the image is enlarged. (note the filesize droped about 50%). In general, the *.svg files are the "source" files that all of the other files were generated form -- we should add the "TM" to the SVN files and then generate the JPG and EPS files from that. -Hoss - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Created: (SOLR-2225) CoreContainer#register should use checkDefault to normalize the core name
CoreContainer#register should use checkDefault to normalize the core name - Key: SOLR-2225 URL: https://issues.apache.org/jira/browse/SOLR-2225 Project: Solr Issue Type: Bug Components: multicore Reporter: Mark Miller Assignee: Mark Miller Priority: Minor Fix For: 3.1, 4.0 fail case: start with default collection set to collection1 remove core collection1 default collection on CoreContainer is still set to collection1 add core collection1 it doesn't act like the default core we might do as the summary suggests, or when the default core is removed, we reset to no default core until one is again explicitly set -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2748) Convert all Lucene web properties to use the ASF CMS
[ https://issues.apache.org/jira/browse/LUCENE-2748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930219#action_12930219 ] Steven Rowe commented on LUCENE-2748: - Where will the subversion space for the site sources go? Under {{repos/asf/lucene/new-site-dir/}}? Will we still maintained versioned and unversioned content? > Convert all Lucene web properties to use the ASF CMS > > > Key: LUCENE-2748 > URL: https://issues.apache.org/jira/browse/LUCENE-2748 > Project: Lucene - Java > Issue Type: Bug >Reporter: Grant Ingersoll > > The new CMS has a lot of nice features (and some kinks to still work out) and > Forrest just doesn't cut it anymore, so we should move to the ASF CMS: > http://apache.org/dev/cms.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2729) Index corruption after 'read past EOF' under heavy update load and snapshot export
[ https://issues.apache.org/jira/browse/LUCENE-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930186#action_12930186 ] Nico Krijnen commented on LUCENE-2729: -- We really appreciate the help and understand that this is not the only thing you are working on ;-) Collecting post-it notes sounds so familliar :) We ran a test with your patch to throw a RuntimeException when an output already exists. We did get a 'read past EOF', but the additional RuntimeException is never thrown. We'll add the other log points and do another test run with those. If you have more suggestions for logging, let us know, we won't start the next run until tomorrow anyway... > Index corruption after 'read past EOF' under heavy update load and snapshot > export > -- > > Key: LUCENE-2729 > URL: https://issues.apache.org/jira/browse/LUCENE-2729 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Affects Versions: 3.0.1, 3.0.2 > Environment: Happens on both OS X 10.6 and Windows 2008 Server. > Integrated with zoie (using a zoie snapshot from 2010-08-06: > zoie-2.0.0-snapshot-20100806.jar). >Reporter: Nico Krijnen > Attachments: 2010-11-02 IndexWriter infoStream log.zip, > LUCENE-2729-test1.patch > > > We have a system running lucene and zoie. We use lucene as a content store > for a CMS/DAM system. We use the hot-backup feature of zoie to make scheduled > backups of the index. This works fine for small indexes and when there are > not a lot of changes to the index when the backup is made. > On large indexes (about 5 GB to 19 GB), when a backup is made while the index > is being changed a lot (lots of document additions and/or deletions), we > almost always get a 'read past EOF' at some point, followed by lots of 'Lock > obtain timed out'. > At that point we get lots of 0 kb files in the index, data gets lots, and the > index is unusable. > When we stop our server, remove the 0kb files and restart our server, the > index is operational again, but data has been lost. > I'm not sure if this is a zoie or a lucene issue, so i'm posting it to both. > Hopefully someone has some ideas where to look to fix this. > Some more details... > Stack trace of the read past EOF and following Lock obtain timed out: > {code} > 78307 [proj.zoie.impl.indexing.internal.realtimeindexdataloa...@31ca5085] > ERROR proj.zoie.impl.indexing.internal.BaseSearchIndex - read past EOF > java.io.IOException: read past EOF > at > org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:154) > at > org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:39) > at > org.apache.lucene.store.ChecksumIndexInput.readByte(ChecksumIndexInput.java:37) > at org.apache.lucene.store.IndexInput.readInt(IndexInput.java:69) > at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:245) > at > org.apache.lucene.index.IndexFileDeleter.(IndexFileDeleter.java:166) > at > org.apache.lucene.index.DirectoryReader.doCommit(DirectoryReader.java:725) > at org.apache.lucene.index.IndexReader.commit(IndexReader.java:987) > at org.apache.lucene.index.IndexReader.commit(IndexReader.java:973) > at org.apache.lucene.index.IndexReader.decRef(IndexReader.java:162) > at org.apache.lucene.index.IndexReader.close(IndexReader.java:1003) > at > proj.zoie.impl.indexing.internal.BaseSearchIndex.deleteDocs(BaseSearchIndex.java:203) > at > proj.zoie.impl.indexing.internal.BaseSearchIndex.loadFromIndex(BaseSearchIndex.java:223) > at > proj.zoie.impl.indexing.internal.LuceneIndexDataLoader.loadFromIndex(LuceneIndexDataLoader.java:153) > at > proj.zoie.impl.indexing.internal.DiskLuceneIndexDataLoader.loadFromIndex(DiskLuceneIndexDataLoader.java:134) > at > proj.zoie.impl.indexing.internal.RealtimeIndexDataLoader.processBatch(RealtimeIndexDataLoader.java:171) > at > proj.zoie.impl.indexing.internal.BatchedIndexDataLoader$LoaderThread.run(BatchedIndexDataLoader.java:373) > 579336 [proj.zoie.impl.indexing.internal.realtimeindexdataloa...@31ca5085] > ERROR proj.zoie.impl.indexing.internal.LuceneIndexDataLoader - > Problem copying segments: Lock obtain timed out: > org.apache.lucene.store.singleinstancel...@5ad0b895: write.lock > org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: > org.apache.lucene.store.singleinstancel...@5ad0b895: write.lock > at org.apache.lucene.store.Lock.obtain(Lock.java:84) > at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:1060) > at org.apache.lucene.index.IndexWriter.(IndexWriter.java:957) > at > proj.zoie.impl.indexing.internal.DiskSearchIndex.openIndexWriter(DiskSearchIndex.java:176) > at > proj.zoie.impl.indexing
[jira] Commented: (SOLR-2223) Separate out "generic" Solr site from release specific content.
[ https://issues.apache.org/jira/browse/SOLR-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930178#action_12930178 ] Grant Ingersoll commented on SOLR-2223: --- Here's what I propose to do: # Move main site to https://svn.apache.org/repos/asf/lucene/solr/site # Keep tutorial and release content where it is # Update the release packaging to bring in the non-release content as part of a release # Fix the docs on how to deploy it. > Separate out "generic" Solr site from release specific content. > --- > > Key: SOLR-2223 > URL: https://issues.apache.org/jira/browse/SOLR-2223 > Project: Solr > Issue Type: Task >Reporter: Grant Ingersoll >Assignee: Grant Ingersoll >Priority: Minor > > It would be useful for deployment purposes if we separated out the Solr site > that is non-release specific from the release specific content. This would > make it easier to apply updates, etc. while still keeping release specific > info handy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-2223) Separate out "generic" Solr site from release specific content.
[ https://issues.apache.org/jira/browse/SOLR-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930173#action_12930173 ] Grant Ingersoll commented on SOLR-2223: --- The problem here is there is one and only one release specific doc. It seems like a waste to move things around for one doc, but I suppose it is still worth it since there isn't a clean way to do it now. > Separate out "generic" Solr site from release specific content. > --- > > Key: SOLR-2223 > URL: https://issues.apache.org/jira/browse/SOLR-2223 > Project: Solr > Issue Type: Task >Reporter: Grant Ingersoll >Assignee: Grant Ingersoll >Priority: Minor > > It would be useful for deployment purposes if we separated out the Solr site > that is non-release specific from the release specific content. This would > make it easier to apply updates, etc. while still keeping release specific > info handy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2729) Index corruption after 'read past EOF' under heavy update load and snapshot export
[ https://issues.apache.org/jira/browse/LUCENE-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930142#action_12930142 ] Nico Krijnen commented on LUCENE-2729: -- {quote} ...spooky "other" exceptions... {quote} These are all 'WARN' level and all of them caused by non-critical timeouts in our code. All caused by the system being under very heavy load needed to reproduce the bug. {quote} Would it be possible to instrument to Zoie code to note as the backup process is copying each file in the snapshot, and at that point print a listing of the directory? {quote} Will do, that is a good one. Then we know which files are being 'held' by the Zoie deletion policy for the backup. {quote} Also, can you write to the log when Zoie applies deletes? (Looks like it happens in proj.zoie.impl.indexing.internal.BaseSearchIndex.deleteDocs). It's on applying deletes that the corruption is first detected, so, if we log this event we can better bracket the period of time when the corruption happened. {quote} Will do, but we also got the error while zoie was opening a new IndexWriter: {code} 15:25:03,453 [proj.zoie.impl.indexing.internal.realtimeindexdataloa...@3d9e7719] ERROR proj.zoie.impl.indexing.internal.LuceneIndexDataLoader - Problem copying segments: read past EOF java.io.IOException: read past EOF at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:154) at org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:39) at org.apache.lucene.store.ChecksumIndexInput.readByte(ChecksumIndexInput.java:37) at org.apache.lucene.store.IndexInput.readInt(IndexInput.java:69) at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:245) at org.apache.lucene.index.IndexFileDeleter.(IndexFileDeleter.java:170) at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:1127) at org.apache.lucene.index.IndexWriter.(IndexWriter.java:960) at proj.zoie.impl.indexing.internal.DiskSearchIndex.openIndexWriter(DiskSearchIndex.java:176) at proj.zoie.impl.indexing.internal.BaseSearchIndex.loadFromIndex(BaseSearchIndex.java:228) at proj.zoie.impl.indexing.internal.LuceneIndexDataLoader.loadFromIndex(LuceneIndexDataLoader.java:153) at proj.zoie.impl.indexing.internal.DiskLuceneIndexDataLoader.loadFromIndex(DiskLuceneIndexDataLoader.java:134) at proj.zoie.impl.indexing.internal.RealtimeIndexDataLoader.processBatch(RealtimeIndexDataLoader.java:172) at proj.zoie.impl.indexing.internal.BatchedIndexDataLoader$LoaderThread.run(BatchedIndexDataLoader.java:377) {code} I'll add a log there too. My guess is that the 'read past EOF' is not really specific to applying deletes, but just happens when the SegementInfos is loaded on a 0kb file. {quote} Does Zoie ever open an IndexReader or IndexWriter passing in an existing commit point? Or does it always open the latest commit? {quote} I'll try to find out. {quote} The timestamps on the zero length files are particularly spooky - the earliest ones are 15:21 (when first EOF is hit), but then also 15:47 and 15:49 on the others. It seems like on 3 separate occasions something truncated the files. {quote} Indeed, I thought this was weird too. > Index corruption after 'read past EOF' under heavy update load and snapshot > export > -- > > Key: LUCENE-2729 > URL: https://issues.apache.org/jira/browse/LUCENE-2729 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Affects Versions: 3.0.1, 3.0.2 > Environment: Happens on both OS X 10.6 and Windows 2008 Server. > Integrated with zoie (using a zoie snapshot from 2010-08-06: > zoie-2.0.0-snapshot-20100806.jar). >Reporter: Nico Krijnen > Attachments: 2010-11-02 IndexWriter infoStream log.zip, > LUCENE-2729-test1.patch > > > We have a system running lucene and zoie. We use lucene as a content store > for a CMS/DAM system. We use the hot-backup feature of zoie to make scheduled > backups of the index. This works fine for small indexes and when there are > not a lot of changes to the index when the backup is made. > On large indexes (about 5 GB to 19 GB), when a backup is made while the index > is being changed a lot (lots of document additions and/or deletions), we > almost always get a 'read past EOF' at some point, followed by lots of 'Lock > obtain timed out'. > At that point we get lots of 0 kb files in the index, data gets lots, and the > index is unusable. > When we stop our server, remove the 0kb files and restart our server, the > index is operational again, but data has been lost. > I'm not sure if this is a zoie or a lucene issue, so i'm posting it to both. > Hopefully someone has some ideas where to look to fix this. > Some more details.
[jira] Commented: (SOLR-2052) Allow for a list of filter queries and a single docset filter in QueryComponent
[ https://issues.apache.org/jira/browse/SOLR-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930165#action_12930165 ] Stephen Green commented on SOLR-2052: - Thanks for taking a look, Otis. I'm in CA this week, but I should have a chance to fix the patch when I'm back home. > Allow for a list of filter queries and a single docset filter in > QueryComponent > --- > > Key: SOLR-2052 > URL: https://issues.apache.org/jira/browse/SOLR-2052 > Project: Solr > Issue Type: Improvement > Components: search >Affects Versions: 4.0 > Environment: Mac OS X, Java 1.6 >Reporter: Stephen Green >Priority: Minor > Fix For: 1.4.2 > > Attachments: SOLR-2052-2.patch, SOLR-2052.patch > > > SolrIndexSearcher.QueryCommand allows you to specify a list of filter queries > or a single filter (as a DocSet), but not both. This restriction seems > arbitrary, and there are cases where we can have both a list of filter > queries and a DocSet generated by some other non-query process (e.g., > filtering documents according to IDs pulled from some other source like a > database.) > Fixing this requires a few small changes to SolrIndexSearcher to allow both > of these to be set for a QueryCommand and to take both into account when > evaluating the query. It also requires a modification to ResponseBuilder to > allow setting the single filter at query time. > I've run into this against 1.4, but the same holds true for the trunk. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
[ https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930137#action_12930137 ] Robert Muir commented on LUCENE-2747: - DM, thanks, I see exactly where you are coming from. I see your point: previously it was much easier to take something like SimpleAnalyzer and 'adapt' it to a given language based on things like unicode properties. In fact thats exactly what we did in the cases here (Arabic, Persian, Hindi, etc) But now we can actually tokenize "correctly" for more languages with jflex, thanks to its improved unicode support, and its superior to these previous hacks :) to try to answer some of your questions (all my opinion): bq. Is there a point to having SimpleAnalyzer I guess so, a lot of people can use this if they have english-only content and are probably happy with discard numbers etc... its not a big loss to me if it goes though. bq. Shouldn't UAX29Tokenizer be moved to core? (What is core anyway?) In trunk (4.x codeline) there is no core, contrib, or solr for analyzer components any more. they are all combined into modules/analysis. In branch_3x (3.x codeline) we did not make this rather disruptive refactor: there UAX29Tokenizer is in fact in lucene core. bq. Would there be a way to plugin ICUTokenizer as a replacement for UAX29Tokenizer into StandardTokenizer, such that all Analyzers using StandardTokenizer would get the alternate implementation? Personally, i would prefer if we move towards a factory model where things like these supplied "language analyzers" are actually xml/json/properties snippets. In other words, they are just example configurations that builds your analyzer, like solr does. This is nice, because then you dont have to write code to easily customize how your analyzer works. I think we have been making slow steps towards this, just doing basic things like moving stopwords lists to .txt files. But i think the next step would be LUCENE-2510, where we have factories/config attribute parsers for all these analysis components already written. Then we could have support for declarative analyzer specification via xml/json/.properties/whatever, and move all these Analyzers to that. I still think you should be able to code up your own analyzer, but in my opinion this is much easier and preferred for the ones we supply. Also i think this would solve a lot of analyzer-backwards-compat problems, because then our supplied analyzers are really just configuration file examples, and we can change our examples however we want... someone can use their old config file (and hopefully old analysis module jar file!) to guarantee the exact same behavior if they want. Finally, most of the benefits of ICUTokenizer are actually in the UAX29 support... the tokenizers are pretty close with some minor differences: * the jflex-based implementation is faster, and better in my opinion. * the ICU-based implementation allows tailoring, and supplies tailored tokenization for several complex scripts (jflex doesnt have this... yet) * the ICU-based implementation works with all of unicode, at the moment jflex is limited to the basic multilingual plane. In my opinion the last 2 points will probably be eventually resolved... i could see our ICUTokenizer possibly becoming obselete down the road by some better jflex support, though it would have to probably have hooks into ICU for the complex script support (so we get it for free from ICU) > Deprecate/remove language-specific tokenizers in favor of StandardTokenizer > --- > > Key: LUCENE-2747 > URL: https://issues.apache.org/jira/browse/LUCENE-2747 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Affects Versions: 3.1, 4.0 >Reporter: Steven Rowe > Fix For: 3.1, 4.0 > > Attachments: LUCENE-2747.patch, LUCENE-2747.patch > > > As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to > provide language-neutral tokenization. Lucene contains several > language-specific tokenizers that should be replaced by UAX#29-based > StandardTokenizer (deprecated in 3.1 and removed in 4.0). The > language-specific *analyzers*, by contrast, should remain, because they > contain language-specific post-tokenization filters. The language-specific > analyzers should switch to StandardTokenizer in 3.1. > Some usages of language-specific tokenizers will need additional work beyond > just replacing the tokenizer in the language-specific analyzer. > For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and > depends on the fact that this tokenizer breaks tokens on the ZWNJ character > (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ > is not a word boundary
Lucene-Solr-tests-only-3.x - Build # 1160 - Failure
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-3.x/1160/ 1 tests failed. REGRESSION: org.apache.lucene.index.TestIndexWriter.testCommitThreadSafety Error Message: null Stack Trace: junit.framework.AssertionFailedError: at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:779) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:745) at org.apache.lucene.index.TestIndexWriter.testCommitThreadSafety(TestIndexWriter.java:2435) Build Log (for compile errors): [...truncated 4525 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
New ASF CMS
Gang, I've asked infra to setup a space where we can kick the tires on the new ASF CMS. See https://issues.apache.org/jira/browse/LUCENE-2748 and http://apache.org/dev/cms.html. It seems like a real win, as we can produce either with Markdown or via a WYSIWYG editor and you get instant publication, more dynamic sites, etc. -Grant - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2729) Index corruption after 'read past EOF' under heavy update load and snapshot export
[ https://issues.apache.org/jira/browse/LUCENE-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930130#action_12930130 ] Michael McCandless commented on LUCENE-2729: Thank you for attaching the IW infoStream output! Sorry it took so long for me to respond. Aside: it is sad but there is no master TODO list in open source. It all comes down to our own email inboxes, todo lists, post-it notes all over the place, etc., and (in my case anyway) things sometimes fall past the event horizon. So please if I don't respond in a day or two on an active issue, bump it again (put a comment on the issue)! I'd much rather people over-nag than under-nag but unfortunately under-nag is far far more common and it causes important issues to languish unnecessarily. OK back to the issue :) I looked through the infoStream but I don't see a smoking gun. Ie, the logs indicate that nowhere did Lucene try to delete/overwrite those zero-length files; I see other files being deleted, so, this is what I'd expect given that ZoieDeletionPolicy is presumably protecting the segments_3t commit point (to back up its files). I do see some spooky "other" exceptions, though... these are the first 2 exceptions I see in the log: {noformat} 14:27:41,290 [bigIndexBuilder_QueueProcessor_3] WARN com.ds.acm.logic.impl.AssetManagerImpl - Ignoring AssetNotFoundException trying to make sure all metadata from index is loaded before updating an existing asset Exception in thread "pool-5-thread-6" java.lang.NullPointerException at org.apache.coyote.http11.InternalNioOutputBuffer.writeToSocket(InternalNioOutputBuffer.java:430) at org.apache.coyote.http11.InternalNioOutputBuffer.flushBuffer(InternalNioOutputBuffer.java:784) at org.apache.coyote.http11.InternalNioOutputBuffer.flush(InternalNioOutputBuffer.java:300) at org.apache.coyote.http11.Http11NioProcessor.action(Http11NioProcessor.java:1060) at org.apache.coyote.Response.action(Response.java:183) at org.apache.catalina.connector.OutputBuffer.doFlush(OutputBuffer.java:314) at org.apache.catalina.connector.OutputBuffer.flush(OutputBuffer.java:288) at org.apache.catalina.connector.Response.flushBuffer(Response.java:548) at org.apache.catalina.connector.ResponseFacade.flushBuffer(ResponseFacade.java:279) at org.granite.gravity.AbstractChannel.runReceived(AbstractChannel.java:251) at org.granite.gravity.AbstractChannel.runReceive(AbstractChannel.java:199) at org.granite.gravity.AsyncReceiver.doRun(AsyncReceiver.java:34) at org.granite.gravity.AsyncChannelRunner.run(AsyncChannelRunner.java:52) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:680) {noformat} and {noformat} 14:40:18,382 [Low Memory Detector] WARN com.ds.acm.engine.search.zoieimpl.core.ZoieSystemManager - Ignoring timeout while attempting to flush zoie memory index to disk to free memory proj.zoie.api.ZoieException: sync timed out at proj.zoie.impl.indexing.AsyncDataConsumer.syncWthVersion(AsyncDataConsumer.java:177) at proj.zoie.impl.indexing.AsyncDataConsumer.flushEvents(AsyncDataConsumer.java:155) at proj.zoie.impl.indexing.ZoieSystem.flushEvents(ZoieSystem.java:308) at com.ds.acm.engine.search.zoieimpl.core.ZoieSystemManager.onLowMemory(ZoieSystemManager.java:220) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at com.ds.util.event.BasicEventBroadcaster$Handler.invokeMethod(BasicEventBroadcaster.java:197) at com.ds.util.event.BasicEventBroadcaster$Handler.handle(BasicEventBroadcaster.java:190) at com.ds.util.event.BasicEventBroadcaster.fire(BasicEventBroadcaster.java:108) at com.ds.util.cache.LowMemoryWarningBroadcaster$1.handleNotification(LowMemoryWarningBroadcaster.java:135) at sun.management.NotificationEmitterSupport.sendNotification(NotificationEmitterSupport.java:138) at sun.management.MemoryImpl.createNotification(MemoryImpl.java:171) at sun.management.MemoryPoolImpl$PoolSensor.triggerAction(MemoryPoolImpl.java:272) at sun.management.Sensor.trigger(Sensor.java:120) {noformat} That 2nd exception happens a total of 9 times... and is rather spooky. What does it mean? Ie, why is Zoie timing out on flushing the index to disk, and, what does it then do w/ its RAMDir? I also see alot of these: {noformat} 15:50:18,856 [bigIndexBuilder_QueueProcessor_10] WARN
[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
[ https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930119#action_12930119 ] DM Smith commented on LUCENE-2747: -- bq. DM, can you elaborate here? I was a bit trigger happy with the comment. I should have looked at the code rather than the jira comments alone. The old StandardAnalyzer had a kitchen sink approach to tokenizations trying to do too much with *modern* constructs, e.g. URLs, email addresses, acronyms It and SimpleAnalyzer would produce about the same stream on "old" English and some other texts, but the StandardAnalyzer was much slower. (I don't remember how slow, but it was obvious.) Both of these were weak when it came to non-English/non-Western texts. Thus I could take the language specific tokenizers, lists of stop words, stemmers and create variations of the SimpleAnalyzer that properly handled a particular language. (I created my own analyzers because I wanted to make stop words and stemming optional) In looking at the code in trunk (should have done that before making my comment), I see that UAX29Tokenizer is duplicated in the StandardAnalyzer's jflex and that ClassicAnalyzer is the old jflex. Also, the new StandardAnalyzer does a lot less. If I understand the suggestion of this and the other 2 issues, StandardAnalyzer will no longer handle modern constructs. As I see it this is what SimpleAnalyzer should be: Based on UAX29 and does little else. Thus my confusion. Is there a point to having SimpleAnalyzer? Shouldn't UAX29Tokenizer be moved to core? (What is core anyway?) And if I understand where this is going: Would there be a way to plugin ICUTokenizer as a replacement for UAX29Tokenizer into StandardTokenizer, such that all Analyzers using StandardTokenizer would get the alternate implementation? > Deprecate/remove language-specific tokenizers in favor of StandardTokenizer > --- > > Key: LUCENE-2747 > URL: https://issues.apache.org/jira/browse/LUCENE-2747 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Affects Versions: 3.1, 4.0 >Reporter: Steven Rowe > Fix For: 3.1, 4.0 > > Attachments: LUCENE-2747.patch, LUCENE-2747.patch > > > As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to > provide language-neutral tokenization. Lucene contains several > language-specific tokenizers that should be replaced by UAX#29-based > StandardTokenizer (deprecated in 3.1 and removed in 4.0). The > language-specific *analyzers*, by contrast, should remain, because they > contain language-specific post-tokenization filters. The language-specific > analyzers should switch to StandardTokenizer in 3.1. > Some usages of language-specific tokenizers will need additional work beyond > just replacing the tokenizer in the language-specific analyzer. > For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and > depends on the fact that this tokenizer breaks tokens on the ZWNJ character > (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ > is not a word boundary. Robert Muir has suggested using a char filter > converting ZWNJ to spaces prior to StandardTokenizer in the converted > PersianAnalyzer. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
[ https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-2747: Attachment: LUCENE-2747.patch here's an updated patch. in reality the previous patch was a problem: because initReader() was in the TokenStream components, it caused code duplication in any Analyzer, as it had to specifiy its CharFilter twice: once in the createComponents for the initial Reader, and once in the TokenStreamComponents implementation for reset(Reader). So i moved this to just be a method of ReusableAnalyzerBase. Also, i didn't apply the 'throws IOException'. After re-thinking, there is no need to do this. None of our CharFilters for example, throw IOExceptions in their ctors. Even the Analyzer.tokenStream method cannot throw IOException. We shouldn't add 'throws X exception' just because some arbitrary user class MIGHT throw it, they might throw SQLException, or InvalidMidiDataException too. > Deprecate/remove language-specific tokenizers in favor of StandardTokenizer > --- > > Key: LUCENE-2747 > URL: https://issues.apache.org/jira/browse/LUCENE-2747 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Affects Versions: 3.1, 4.0 >Reporter: Steven Rowe > Fix For: 3.1, 4.0 > > Attachments: LUCENE-2747.patch, LUCENE-2747.patch > > > As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to > provide language-neutral tokenization. Lucene contains several > language-specific tokenizers that should be replaced by UAX#29-based > StandardTokenizer (deprecated in 3.1 and removed in 4.0). The > language-specific *analyzers*, by contrast, should remain, because they > contain language-specific post-tokenization filters. The language-specific > analyzers should switch to StandardTokenizer in 3.1. > Some usages of language-specific tokenizers will need additional work beyond > just replacing the tokenizer in the language-specific analyzer. > For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and > depends on the fact that this tokenizer breaks tokens on the ZWNJ character > (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ > is not a word boundary. Robert Muir has suggested using a char filter > converting ZWNJ to spaces prior to StandardTokenizer in the converted > PersianAnalyzer. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Lucene-Solr-tests-only-trunk - Build # 1175 - Failure
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/1175/ 1 tests failed. REGRESSION: org.apache.solr.TestDistributedSearch.testDistribSearch Error Message: Error executing query Stack Trace: org.apache.solr.client.solrj.SolrServerException: Error executing query at org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:95) at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:119) at org.apache.solr.BaseDistributedSearchTestCase.queryServer(BaseDistributedSearchTestCase.java:290) at org.apache.solr.BaseDistributedSearchTestCase.query(BaseDistributedSearchTestCase.java:305) at org.apache.solr.TestDistributedSearch.doTest(TestDistributedSearch.java:203) at org.apache.solr.BaseDistributedSearchTestCase.testDistribSearch(BaseDistributedSearchTestCase.java:568) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:878) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:844) Caused by: org.apache.solr.common.SolrException: org.apache.solr.client.solrj.SolrServerException: No live SolrServers available to handle this request org.apache.solr.common.SolrException: org.apache.solr.client.solrj.SolrServerException: No live SolrServers available to handle this request at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:318) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1359)at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:337) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:240) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1157) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:388) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:923) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:547) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) Caused by: org.apache.solr.client.solrj.SolrServerException: No live SolrServers available to handle this request at org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(LBHttpSolrServer.java:297) at org.apache.solr.handler.component.HttpCommComponent$1.call(SearchHandler.java:513) at org.apache.solr.handler.component.HttpCommComponent$1.call(SearchHandler.java:478) at java.util.concurrent org.apache.solr.client.solrj.SolrServerException: No live SolrServers available to handle this request org.apache.solr.common.SolrException: org.apache.solr.client.solrj.SolrServerException: No live SolrServers available to handle this requestat org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:318) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1359)at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:337) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:240) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1157) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:388) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:923) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:547) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(Que
[jira] Commented: (LUCENE-2729) Index corruption after 'read past EOF' under heavy update load and snapshot export
[ https://issues.apache.org/jira/browse/LUCENE-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930094#action_12930094 ] Nico Krijnen commented on LUCENE-2729: -- Thx! We will update, patch and re-run the test. > Index corruption after 'read past EOF' under heavy update load and snapshot > export > -- > > Key: LUCENE-2729 > URL: https://issues.apache.org/jira/browse/LUCENE-2729 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Affects Versions: 3.0.1, 3.0.2 > Environment: Happens on both OS X 10.6 and Windows 2008 Server. > Integrated with zoie (using a zoie snapshot from 2010-08-06: > zoie-2.0.0-snapshot-20100806.jar). >Reporter: Nico Krijnen > Attachments: 2010-11-02 IndexWriter infoStream log.zip, > LUCENE-2729-test1.patch > > > We have a system running lucene and zoie. We use lucene as a content store > for a CMS/DAM system. We use the hot-backup feature of zoie to make scheduled > backups of the index. This works fine for small indexes and when there are > not a lot of changes to the index when the backup is made. > On large indexes (about 5 GB to 19 GB), when a backup is made while the index > is being changed a lot (lots of document additions and/or deletions), we > almost always get a 'read past EOF' at some point, followed by lots of 'Lock > obtain timed out'. > At that point we get lots of 0 kb files in the index, data gets lots, and the > index is unusable. > When we stop our server, remove the 0kb files and restart our server, the > index is operational again, but data has been lost. > I'm not sure if this is a zoie or a lucene issue, so i'm posting it to both. > Hopefully someone has some ideas where to look to fix this. > Some more details... > Stack trace of the read past EOF and following Lock obtain timed out: > {code} > 78307 [proj.zoie.impl.indexing.internal.realtimeindexdataloa...@31ca5085] > ERROR proj.zoie.impl.indexing.internal.BaseSearchIndex - read past EOF > java.io.IOException: read past EOF > at > org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:154) > at > org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:39) > at > org.apache.lucene.store.ChecksumIndexInput.readByte(ChecksumIndexInput.java:37) > at org.apache.lucene.store.IndexInput.readInt(IndexInput.java:69) > at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:245) > at > org.apache.lucene.index.IndexFileDeleter.(IndexFileDeleter.java:166) > at > org.apache.lucene.index.DirectoryReader.doCommit(DirectoryReader.java:725) > at org.apache.lucene.index.IndexReader.commit(IndexReader.java:987) > at org.apache.lucene.index.IndexReader.commit(IndexReader.java:973) > at org.apache.lucene.index.IndexReader.decRef(IndexReader.java:162) > at org.apache.lucene.index.IndexReader.close(IndexReader.java:1003) > at > proj.zoie.impl.indexing.internal.BaseSearchIndex.deleteDocs(BaseSearchIndex.java:203) > at > proj.zoie.impl.indexing.internal.BaseSearchIndex.loadFromIndex(BaseSearchIndex.java:223) > at > proj.zoie.impl.indexing.internal.LuceneIndexDataLoader.loadFromIndex(LuceneIndexDataLoader.java:153) > at > proj.zoie.impl.indexing.internal.DiskLuceneIndexDataLoader.loadFromIndex(DiskLuceneIndexDataLoader.java:134) > at > proj.zoie.impl.indexing.internal.RealtimeIndexDataLoader.processBatch(RealtimeIndexDataLoader.java:171) > at > proj.zoie.impl.indexing.internal.BatchedIndexDataLoader$LoaderThread.run(BatchedIndexDataLoader.java:373) > 579336 [proj.zoie.impl.indexing.internal.realtimeindexdataloa...@31ca5085] > ERROR proj.zoie.impl.indexing.internal.LuceneIndexDataLoader - > Problem copying segments: Lock obtain timed out: > org.apache.lucene.store.singleinstancel...@5ad0b895: write.lock > org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: > org.apache.lucene.store.singleinstancel...@5ad0b895: write.lock > at org.apache.lucene.store.Lock.obtain(Lock.java:84) > at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:1060) > at org.apache.lucene.index.IndexWriter.(IndexWriter.java:957) > at > proj.zoie.impl.indexing.internal.DiskSearchIndex.openIndexWriter(DiskSearchIndex.java:176) > at > proj.zoie.impl.indexing.internal.BaseSearchIndex.loadFromIndex(BaseSearchIndex.java:228) > at > proj.zoie.impl.indexing.internal.LuceneIndexDataLoader.loadFromIndex(LuceneIndexDataLoader.java:153) > at > proj.zoie.impl.indexing.internal.DiskLuceneIndexDataLoader.loadFromIndex(DiskLuceneIndexDataLoader.java:134) > at > proj.zoie.impl.indexing.internal.RealtimeIndexDataLoader.processBatch(RealtimeIndexDataLoader.java:171) > at > proj.zoie.impl.indexing.in
[jira] Commented: (LUCENE-2729) Index corruption after 'read past EOF' under heavy update load and snapshot export
[ https://issues.apache.org/jira/browse/LUCENE-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930092#action_12930092 ] Simon Willnauer commented on LUCENE-2729: - ah I think I have been on the wrong branch nevermind! > Index corruption after 'read past EOF' under heavy update load and snapshot > export > -- > > Key: LUCENE-2729 > URL: https://issues.apache.org/jira/browse/LUCENE-2729 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Affects Versions: 3.0.1, 3.0.2 > Environment: Happens on both OS X 10.6 and Windows 2008 Server. > Integrated with zoie (using a zoie snapshot from 2010-08-06: > zoie-2.0.0-snapshot-20100806.jar). >Reporter: Nico Krijnen > Attachments: 2010-11-02 IndexWriter infoStream log.zip, > LUCENE-2729-test1.patch > > > We have a system running lucene and zoie. We use lucene as a content store > for a CMS/DAM system. We use the hot-backup feature of zoie to make scheduled > backups of the index. This works fine for small indexes and when there are > not a lot of changes to the index when the backup is made. > On large indexes (about 5 GB to 19 GB), when a backup is made while the index > is being changed a lot (lots of document additions and/or deletions), we > almost always get a 'read past EOF' at some point, followed by lots of 'Lock > obtain timed out'. > At that point we get lots of 0 kb files in the index, data gets lots, and the > index is unusable. > When we stop our server, remove the 0kb files and restart our server, the > index is operational again, but data has been lost. > I'm not sure if this is a zoie or a lucene issue, so i'm posting it to both. > Hopefully someone has some ideas where to look to fix this. > Some more details... > Stack trace of the read past EOF and following Lock obtain timed out: > {code} > 78307 [proj.zoie.impl.indexing.internal.realtimeindexdataloa...@31ca5085] > ERROR proj.zoie.impl.indexing.internal.BaseSearchIndex - read past EOF > java.io.IOException: read past EOF > at > org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:154) > at > org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:39) > at > org.apache.lucene.store.ChecksumIndexInput.readByte(ChecksumIndexInput.java:37) > at org.apache.lucene.store.IndexInput.readInt(IndexInput.java:69) > at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:245) > at > org.apache.lucene.index.IndexFileDeleter.(IndexFileDeleter.java:166) > at > org.apache.lucene.index.DirectoryReader.doCommit(DirectoryReader.java:725) > at org.apache.lucene.index.IndexReader.commit(IndexReader.java:987) > at org.apache.lucene.index.IndexReader.commit(IndexReader.java:973) > at org.apache.lucene.index.IndexReader.decRef(IndexReader.java:162) > at org.apache.lucene.index.IndexReader.close(IndexReader.java:1003) > at > proj.zoie.impl.indexing.internal.BaseSearchIndex.deleteDocs(BaseSearchIndex.java:203) > at > proj.zoie.impl.indexing.internal.BaseSearchIndex.loadFromIndex(BaseSearchIndex.java:223) > at > proj.zoie.impl.indexing.internal.LuceneIndexDataLoader.loadFromIndex(LuceneIndexDataLoader.java:153) > at > proj.zoie.impl.indexing.internal.DiskLuceneIndexDataLoader.loadFromIndex(DiskLuceneIndexDataLoader.java:134) > at > proj.zoie.impl.indexing.internal.RealtimeIndexDataLoader.processBatch(RealtimeIndexDataLoader.java:171) > at > proj.zoie.impl.indexing.internal.BatchedIndexDataLoader$LoaderThread.run(BatchedIndexDataLoader.java:373) > 579336 [proj.zoie.impl.indexing.internal.realtimeindexdataloa...@31ca5085] > ERROR proj.zoie.impl.indexing.internal.LuceneIndexDataLoader - > Problem copying segments: Lock obtain timed out: > org.apache.lucene.store.singleinstancel...@5ad0b895: write.lock > org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: > org.apache.lucene.store.singleinstancel...@5ad0b895: write.lock > at org.apache.lucene.store.Lock.obtain(Lock.java:84) > at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:1060) > at org.apache.lucene.index.IndexWriter.(IndexWriter.java:957) > at > proj.zoie.impl.indexing.internal.DiskSearchIndex.openIndexWriter(DiskSearchIndex.java:176) > at > proj.zoie.impl.indexing.internal.BaseSearchIndex.loadFromIndex(BaseSearchIndex.java:228) > at > proj.zoie.impl.indexing.internal.LuceneIndexDataLoader.loadFromIndex(LuceneIndexDataLoader.java:153) > at > proj.zoie.impl.indexing.internal.DiskLuceneIndexDataLoader.loadFromIndex(DiskLuceneIndexDataLoader.java:134) > at > proj.zoie.impl.indexing.internal.RealtimeIndexDataLoader.processBatch(RealtimeIndexDataLoader.java:171) > at > proj.zoie.impl
[jira] Commented: (LUCENE-2729) Index corruption after 'read past EOF' under heavy update load and snapshot export
[ https://issues.apache.org/jira/browse/LUCENE-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930091#action_12930091 ] Simon Willnauer commented on LUCENE-2729: - bq. at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:245) I just had a quick look at it and I wonder what revision you use for that? SegmentsInfos.java does not contain a read call nice [revision 892992 | https://svn.apache.org/viewvc/lucene/dev/branches/branch_3x/lucene/src/java/org/apache/lucene/index/SegmentInfos.java?revision=892992&view=markup] can you clarify? > Index corruption after 'read past EOF' under heavy update load and snapshot > export > -- > > Key: LUCENE-2729 > URL: https://issues.apache.org/jira/browse/LUCENE-2729 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Affects Versions: 3.0.1, 3.0.2 > Environment: Happens on both OS X 10.6 and Windows 2008 Server. > Integrated with zoie (using a zoie snapshot from 2010-08-06: > zoie-2.0.0-snapshot-20100806.jar). >Reporter: Nico Krijnen > Attachments: 2010-11-02 IndexWriter infoStream log.zip, > LUCENE-2729-test1.patch > > > We have a system running lucene and zoie. We use lucene as a content store > for a CMS/DAM system. We use the hot-backup feature of zoie to make scheduled > backups of the index. This works fine for small indexes and when there are > not a lot of changes to the index when the backup is made. > On large indexes (about 5 GB to 19 GB), when a backup is made while the index > is being changed a lot (lots of document additions and/or deletions), we > almost always get a 'read past EOF' at some point, followed by lots of 'Lock > obtain timed out'. > At that point we get lots of 0 kb files in the index, data gets lots, and the > index is unusable. > When we stop our server, remove the 0kb files and restart our server, the > index is operational again, but data has been lost. > I'm not sure if this is a zoie or a lucene issue, so i'm posting it to both. > Hopefully someone has some ideas where to look to fix this. > Some more details... > Stack trace of the read past EOF and following Lock obtain timed out: > {code} > 78307 [proj.zoie.impl.indexing.internal.realtimeindexdataloa...@31ca5085] > ERROR proj.zoie.impl.indexing.internal.BaseSearchIndex - read past EOF > java.io.IOException: read past EOF > at > org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:154) > at > org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:39) > at > org.apache.lucene.store.ChecksumIndexInput.readByte(ChecksumIndexInput.java:37) > at org.apache.lucene.store.IndexInput.readInt(IndexInput.java:69) > at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:245) > at > org.apache.lucene.index.IndexFileDeleter.(IndexFileDeleter.java:166) > at > org.apache.lucene.index.DirectoryReader.doCommit(DirectoryReader.java:725) > at org.apache.lucene.index.IndexReader.commit(IndexReader.java:987) > at org.apache.lucene.index.IndexReader.commit(IndexReader.java:973) > at org.apache.lucene.index.IndexReader.decRef(IndexReader.java:162) > at org.apache.lucene.index.IndexReader.close(IndexReader.java:1003) > at > proj.zoie.impl.indexing.internal.BaseSearchIndex.deleteDocs(BaseSearchIndex.java:203) > at > proj.zoie.impl.indexing.internal.BaseSearchIndex.loadFromIndex(BaseSearchIndex.java:223) > at > proj.zoie.impl.indexing.internal.LuceneIndexDataLoader.loadFromIndex(LuceneIndexDataLoader.java:153) > at > proj.zoie.impl.indexing.internal.DiskLuceneIndexDataLoader.loadFromIndex(DiskLuceneIndexDataLoader.java:134) > at > proj.zoie.impl.indexing.internal.RealtimeIndexDataLoader.processBatch(RealtimeIndexDataLoader.java:171) > at > proj.zoie.impl.indexing.internal.BatchedIndexDataLoader$LoaderThread.run(BatchedIndexDataLoader.java:373) > 579336 [proj.zoie.impl.indexing.internal.realtimeindexdataloa...@31ca5085] > ERROR proj.zoie.impl.indexing.internal.LuceneIndexDataLoader - > Problem copying segments: Lock obtain timed out: > org.apache.lucene.store.singleinstancel...@5ad0b895: write.lock > org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: > org.apache.lucene.store.singleinstancel...@5ad0b895: write.lock > at org.apache.lucene.store.Lock.obtain(Lock.java:84) > at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:1060) > at org.apache.lucene.index.IndexWriter.(IndexWriter.java:957) > at > proj.zoie.impl.indexing.internal.DiskSearchIndex.openIndexWriter(DiskSearchIndex.java:176) > at > proj.zoie.impl.indexing.internal.BaseSearchIndex.loadFromIndex(BaseSearchIndex.java:228) > at > proj.zoie.impl.indexing.inte
[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
[ https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930090#action_12930090 ] Robert Muir commented on LUCENE-2747: - bq. I'm not too keen on this. For classics and ancient texts the standard analyzer is not as good as the simple analyzer. DM, can you elaborate here? Are you speaking of the existing StandardAnalyzer in previous releases, that doesn't properly deal with tokenizing diacritics, etc? This is the reason these "special" tokenizers exist: to work around those bugs. but StandardTokenizer now handles this stuff fine, and they are obselete. I'm confused though, in previous releases how SimpleAnalyzer would ever be any better, since it would barf on these diacritics too, it only emits tokens that are runs of Character.isLetter Or is there something else i'm missing here? > Deprecate/remove language-specific tokenizers in favor of StandardTokenizer > --- > > Key: LUCENE-2747 > URL: https://issues.apache.org/jira/browse/LUCENE-2747 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Affects Versions: 3.1, 4.0 >Reporter: Steven Rowe > Fix For: 3.1, 4.0 > > Attachments: LUCENE-2747.patch > > > As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to > provide language-neutral tokenization. Lucene contains several > language-specific tokenizers that should be replaced by UAX#29-based > StandardTokenizer (deprecated in 3.1 and removed in 4.0). The > language-specific *analyzers*, by contrast, should remain, because they > contain language-specific post-tokenization filters. The language-specific > analyzers should switch to StandardTokenizer in 3.1. > Some usages of language-specific tokenizers will need additional work beyond > just replacing the tokenizer in the language-specific analyzer. > For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and > depends on the fact that this tokenizer breaks tokens on the ZWNJ character > (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ > is not a word boundary. Robert Muir has suggested using a char filter > converting ZWNJ to spaces prior to StandardTokenizer in the converted > PersianAnalyzer. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Lucene-Solr-tests-only-trunk - Build # 1170 - Failure
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/1170/ 1 tests failed. REGRESSION: org.apache.solr.handler.component.DistributedTermsComponentTest.testDistribSearch Error Message: Some threads threw uncaught exceptions! Stack Trace: junit.framework.AssertionFailedError: Some threads threw uncaught exceptions! at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:878) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:844) at org.apache.lucene.util.LuceneTestCase.tearDown(LuceneTestCase.java:437) at org.apache.solr.SolrTestCaseJ4.tearDown(SolrTestCaseJ4.java:78) at org.apache.solr.BaseDistributedSearchTestCase.tearDown(BaseDistributedSearchTestCase.java:144) Build Log (for compile errors): [...truncated 8776 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2729) Index corruption after 'read past EOF' under heavy update load and snapshot export
[ https://issues.apache.org/jira/browse/LUCENE-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-2729: --- Attachment: LUCENE-2729-test1.patch First, just to rule out any already-fixed-but-not-yet-released issues, can you update your Lucene JAR to the tip of the 3.0.x branch? Ie do this: {noformat} svn checkout https://svn.apache.org/repos/asf/lucene/java/branches/lucene_3_0 30x cd 30x ant jar {noformat} And then copy build/lucene-core-3.0.3-dev.jar to your CLASSPATH (replacing old Lucene JAR). Second, can you apply the patch I just attached (LUCENE-2729-test1.patch) and then make this corruption happen again? That patch throws an exception if ever we try to call SimpleFSDir.createOutput on a file that already exists. Lucene should never do this under non-exceptional situations, yet somehow it looks like it may be (with all your 0 length files). > Index corruption after 'read past EOF' under heavy update load and snapshot > export > -- > > Key: LUCENE-2729 > URL: https://issues.apache.org/jira/browse/LUCENE-2729 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Affects Versions: 3.0.1, 3.0.2 > Environment: Happens on both OS X 10.6 and Windows 2008 Server. > Integrated with zoie (using a zoie snapshot from 2010-08-06: > zoie-2.0.0-snapshot-20100806.jar). >Reporter: Nico Krijnen > Attachments: 2010-11-02 IndexWriter infoStream log.zip, > LUCENE-2729-test1.patch > > > We have a system running lucene and zoie. We use lucene as a content store > for a CMS/DAM system. We use the hot-backup feature of zoie to make scheduled > backups of the index. This works fine for small indexes and when there are > not a lot of changes to the index when the backup is made. > On large indexes (about 5 GB to 19 GB), when a backup is made while the index > is being changed a lot (lots of document additions and/or deletions), we > almost always get a 'read past EOF' at some point, followed by lots of 'Lock > obtain timed out'. > At that point we get lots of 0 kb files in the index, data gets lots, and the > index is unusable. > When we stop our server, remove the 0kb files and restart our server, the > index is operational again, but data has been lost. > I'm not sure if this is a zoie or a lucene issue, so i'm posting it to both. > Hopefully someone has some ideas where to look to fix this. > Some more details... > Stack trace of the read past EOF and following Lock obtain timed out: > {code} > 78307 [proj.zoie.impl.indexing.internal.realtimeindexdataloa...@31ca5085] > ERROR proj.zoie.impl.indexing.internal.BaseSearchIndex - read past EOF > java.io.IOException: read past EOF > at > org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:154) > at > org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:39) > at > org.apache.lucene.store.ChecksumIndexInput.readByte(ChecksumIndexInput.java:37) > at org.apache.lucene.store.IndexInput.readInt(IndexInput.java:69) > at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:245) > at > org.apache.lucene.index.IndexFileDeleter.(IndexFileDeleter.java:166) > at > org.apache.lucene.index.DirectoryReader.doCommit(DirectoryReader.java:725) > at org.apache.lucene.index.IndexReader.commit(IndexReader.java:987) > at org.apache.lucene.index.IndexReader.commit(IndexReader.java:973) > at org.apache.lucene.index.IndexReader.decRef(IndexReader.java:162) > at org.apache.lucene.index.IndexReader.close(IndexReader.java:1003) > at > proj.zoie.impl.indexing.internal.BaseSearchIndex.deleteDocs(BaseSearchIndex.java:203) > at > proj.zoie.impl.indexing.internal.BaseSearchIndex.loadFromIndex(BaseSearchIndex.java:223) > at > proj.zoie.impl.indexing.internal.LuceneIndexDataLoader.loadFromIndex(LuceneIndexDataLoader.java:153) > at > proj.zoie.impl.indexing.internal.DiskLuceneIndexDataLoader.loadFromIndex(DiskLuceneIndexDataLoader.java:134) > at > proj.zoie.impl.indexing.internal.RealtimeIndexDataLoader.processBatch(RealtimeIndexDataLoader.java:171) > at > proj.zoie.impl.indexing.internal.BatchedIndexDataLoader$LoaderThread.run(BatchedIndexDataLoader.java:373) > 579336 [proj.zoie.impl.indexing.internal.realtimeindexdataloa...@31ca5085] > ERROR proj.zoie.impl.indexing.internal.LuceneIndexDataLoader - > Problem copying segments: Lock obtain timed out: > org.apache.lucene.store.singleinstancel...@5ad0b895: write.lock > org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: > org.apache.lucene.store.singleinstancel...@5ad0b895: write.lock > at org.apache.lucene.store.Lock.obtain(Lock.java:84) > at org.apache.lucene.index.IndexWriter.init(Index
[jira] Commented: (LUCENE-2729) Index corruption after 'read past EOF' under heavy update load and snapshot export
[ https://issues.apache.org/jira/browse/LUCENE-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930086#action_12930086 ] Nico Krijnen commented on LUCENE-2729: -- Any ideas on what could be happening? It sounds like IndexWriter is the only one that is modifying these files, zoie only seems to be reading from them to make the backup. What should we look for in the IndexWriter's infoStream? > Index corruption after 'read past EOF' under heavy update load and snapshot > export > -- > > Key: LUCENE-2729 > URL: https://issues.apache.org/jira/browse/LUCENE-2729 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Affects Versions: 3.0.1, 3.0.2 > Environment: Happens on both OS X 10.6 and Windows 2008 Server. > Integrated with zoie (using a zoie snapshot from 2010-08-06: > zoie-2.0.0-snapshot-20100806.jar). >Reporter: Nico Krijnen > Attachments: 2010-11-02 IndexWriter infoStream log.zip > > > We have a system running lucene and zoie. We use lucene as a content store > for a CMS/DAM system. We use the hot-backup feature of zoie to make scheduled > backups of the index. This works fine for small indexes and when there are > not a lot of changes to the index when the backup is made. > On large indexes (about 5 GB to 19 GB), when a backup is made while the index > is being changed a lot (lots of document additions and/or deletions), we > almost always get a 'read past EOF' at some point, followed by lots of 'Lock > obtain timed out'. > At that point we get lots of 0 kb files in the index, data gets lots, and the > index is unusable. > When we stop our server, remove the 0kb files and restart our server, the > index is operational again, but data has been lost. > I'm not sure if this is a zoie or a lucene issue, so i'm posting it to both. > Hopefully someone has some ideas where to look to fix this. > Some more details... > Stack trace of the read past EOF and following Lock obtain timed out: > {code} > 78307 [proj.zoie.impl.indexing.internal.realtimeindexdataloa...@31ca5085] > ERROR proj.zoie.impl.indexing.internal.BaseSearchIndex - read past EOF > java.io.IOException: read past EOF > at > org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:154) > at > org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:39) > at > org.apache.lucene.store.ChecksumIndexInput.readByte(ChecksumIndexInput.java:37) > at org.apache.lucene.store.IndexInput.readInt(IndexInput.java:69) > at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:245) > at > org.apache.lucene.index.IndexFileDeleter.(IndexFileDeleter.java:166) > at > org.apache.lucene.index.DirectoryReader.doCommit(DirectoryReader.java:725) > at org.apache.lucene.index.IndexReader.commit(IndexReader.java:987) > at org.apache.lucene.index.IndexReader.commit(IndexReader.java:973) > at org.apache.lucene.index.IndexReader.decRef(IndexReader.java:162) > at org.apache.lucene.index.IndexReader.close(IndexReader.java:1003) > at > proj.zoie.impl.indexing.internal.BaseSearchIndex.deleteDocs(BaseSearchIndex.java:203) > at > proj.zoie.impl.indexing.internal.BaseSearchIndex.loadFromIndex(BaseSearchIndex.java:223) > at > proj.zoie.impl.indexing.internal.LuceneIndexDataLoader.loadFromIndex(LuceneIndexDataLoader.java:153) > at > proj.zoie.impl.indexing.internal.DiskLuceneIndexDataLoader.loadFromIndex(DiskLuceneIndexDataLoader.java:134) > at > proj.zoie.impl.indexing.internal.RealtimeIndexDataLoader.processBatch(RealtimeIndexDataLoader.java:171) > at > proj.zoie.impl.indexing.internal.BatchedIndexDataLoader$LoaderThread.run(BatchedIndexDataLoader.java:373) > 579336 [proj.zoie.impl.indexing.internal.realtimeindexdataloa...@31ca5085] > ERROR proj.zoie.impl.indexing.internal.LuceneIndexDataLoader - > Problem copying segments: Lock obtain timed out: > org.apache.lucene.store.singleinstancel...@5ad0b895: write.lock > org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: > org.apache.lucene.store.singleinstancel...@5ad0b895: write.lock > at org.apache.lucene.store.Lock.obtain(Lock.java:84) > at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:1060) > at org.apache.lucene.index.IndexWriter.(IndexWriter.java:957) > at > proj.zoie.impl.indexing.internal.DiskSearchIndex.openIndexWriter(DiskSearchIndex.java:176) > at > proj.zoie.impl.indexing.internal.BaseSearchIndex.loadFromIndex(BaseSearchIndex.java:228) > at > proj.zoie.impl.indexing.internal.LuceneIndexDataLoader.loadFromIndex(LuceneIndexDataLoader.java:153) > at > proj.zoie.impl.indexing.internal.DiskLuceneIndexDataLoader.loadFromIndex(DiskLuceneIndexDataLoader.java:134)
[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
[ https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930075#action_12930075 ] Simon Willnauer commented on LUCENE-2747: - bq. ...alternatively we could give this a different name, wrapReader or something... not sure, i didnt have any better ideas than charStream. wrapReader seem to be to specific what about initReader? > Deprecate/remove language-specific tokenizers in favor of StandardTokenizer > --- > > Key: LUCENE-2747 > URL: https://issues.apache.org/jira/browse/LUCENE-2747 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Affects Versions: 3.1, 4.0 >Reporter: Steven Rowe > Fix For: 3.1, 4.0 > > Attachments: LUCENE-2747.patch > > > As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to > provide language-neutral tokenization. Lucene contains several > language-specific tokenizers that should be replaced by UAX#29-based > StandardTokenizer (deprecated in 3.1 and removed in 4.0). The > language-specific *analyzers*, by contrast, should remain, because they > contain language-specific post-tokenization filters. The language-specific > analyzers should switch to StandardTokenizer in 3.1. > Some usages of language-specific tokenizers will need additional work beyond > just replacing the tokenizer in the language-specific analyzer. > For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and > depends on the fact that this tokenizer breaks tokens on the ZWNJ character > (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ > is not a word boundary. Robert Muir has suggested using a char filter > converting ZWNJ to spaces prior to StandardTokenizer in the converted > PersianAnalyzer. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
[ https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930073#action_12930073 ] Robert Muir commented on LUCENE-2747: - Simon: i agree with both those points... we should change the method signature. also i called it charStream (this is what Solr's analyzer calls it), but this is slightly confusing since the api is all Reader-based. alternatively we could give this a different name, wrapReader or something... not sure, i didnt have any better ideas than charStream. > Deprecate/remove language-specific tokenizers in favor of StandardTokenizer > --- > > Key: LUCENE-2747 > URL: https://issues.apache.org/jira/browse/LUCENE-2747 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Affects Versions: 3.1, 4.0 >Reporter: Steven Rowe > Fix For: 3.1, 4.0 > > Attachments: LUCENE-2747.patch > > > As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to > provide language-neutral tokenization. Lucene contains several > language-specific tokenizers that should be replaced by UAX#29-based > StandardTokenizer (deprecated in 3.1 and removed in 4.0). The > language-specific *analyzers*, by contrast, should remain, because they > contain language-specific post-tokenization filters. The language-specific > analyzers should switch to StandardTokenizer in 3.1. > Some usages of language-specific tokenizers will need additional work beyond > just replacing the tokenizer in the language-specific analyzer. > For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and > depends on the fact that this tokenizer breaks tokens on the ZWNJ character > (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ > is not a word boundary. Robert Muir has suggested using a char filter > converting ZWNJ to spaces prior to StandardTokenizer in the converted > PersianAnalyzer. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
[ https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930072#action_12930072 ] Simon Willnauer commented on LUCENE-2747: - I looked at the patch briefly and the charStream(Reader) extension looks good to me while I would make it protected and throw a IOException. Since this API is public and folks will use it in the wild we need to make sure we don't have to add the exception later or people creating Readers have to play tricks just because the interface has no IOException. About making it protected, do we need to call that in a non-protected context, maybe I miss something.. {code} public Reader charStream(Reader reader) { return reader; } // should be protected Reader charStream(Reader reader) throws IOException{ return reader; } {code} > Deprecate/remove language-specific tokenizers in favor of StandardTokenizer > --- > > Key: LUCENE-2747 > URL: https://issues.apache.org/jira/browse/LUCENE-2747 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Affects Versions: 3.1, 4.0 >Reporter: Steven Rowe > Fix For: 3.1, 4.0 > > Attachments: LUCENE-2747.patch > > > As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to > provide language-neutral tokenization. Lucene contains several > language-specific tokenizers that should be replaced by UAX#29-based > StandardTokenizer (deprecated in 3.1 and removed in 4.0). The > language-specific *analyzers*, by contrast, should remain, because they > contain language-specific post-tokenization filters. The language-specific > analyzers should switch to StandardTokenizer in 3.1. > Some usages of language-specific tokenizers will need additional work beyond > just replacing the tokenizer in the language-specific analyzer. > For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and > depends on the fact that this tokenizer breaks tokens on the ZWNJ character > (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ > is not a word boundary. Robert Muir has suggested using a char filter > converting ZWNJ to spaces prior to StandardTokenizer in the converted > PersianAnalyzer. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Lucene-Solr-tests-only-trunk - Build # 1168 - Failure
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/1168/ 1 tests failed. REGRESSION: org.apache.lucene.index.TestIndexWriterMergePolicy.testMaxBufferedDocsChange Error Message: maxMergeDocs=2147483647; numSegments=11; upperBound=10; mergeFactor=10; segs=_64:c5950 _5t:c10->_32 _5u:c10->_32 _5v:c10->_32 _5w:c10->_32 _5x:c10->_32 _5y:c10->_32 _5z:c10->_32 _60:c10->_32 _61:c10->_32 _62:c9->_32 _65:c1->_62 Stack Trace: junit.framework.AssertionFailedError: maxMergeDocs=2147483647; numSegments=11; upperBound=10; mergeFactor=10; segs=_64:c5950 _5t:c10->_32 _5u:c10->_32 _5v:c10->_32 _5w:c10->_32 _5x:c10->_32 _5y:c10->_32 _5z:c10->_32 _60:c10->_32 _61:c10->_32 _62:c9->_32 _65:c1->_62 at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:878) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:844) at org.apache.lucene.index.TestIndexWriterMergePolicy.checkInvariants(TestIndexWriterMergePolicy.java:243) at org.apache.lucene.index.TestIndexWriterMergePolicy.testMaxBufferedDocsChange(TestIndexWriterMergePolicy.java:169) Build Log (for compile errors): [...truncated 3082 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2742) Enable native per-field codec support
[ https://issues.apache.org/jira/browse/LUCENE-2742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer updated LUCENE-2742: Attachment: LUCENE-2742.patch Here is a first patch - all tests pass. I changed the CodecProvider interface slightly to be able to hold perField codecs as well as a default perField codec. For simplicity users can not register their codec directly though the Fieldable interface. Internally I added a CodecInfo which handles all the ordering and registration per segment / field. For consistency I bound CodecInfo to FieldInfos since we are now operating per field. A codec can only be assigned once, at the first time we see the codec during FieldInfos creation. there is a nocommit on Fieldable since it doesn't have javadoc but lets iterate first to see if we wanna go that path - it seems close. > Enable native per-field codec support > -- > > Key: LUCENE-2742 > URL: https://issues.apache.org/jira/browse/LUCENE-2742 > Project: Lucene - Java > Issue Type: Improvement > Components: Index, Store >Affects Versions: 4.0 >Reporter: Simon Willnauer >Assignee: Simon Willnauer > Fix For: 4.0 > > Attachments: LUCENE-2742.patch > > > Currently the codec name is stored for every segment and PerFieldCodecWrapper > is used to enable codecs per fields which has recently brought up some issues > (LUCENE-2740 and LUCENE-2741). When a codec name is stored lucene does not > respect the actual codec used to encode a fields postings but rather the > "top-level" Codec in such a case the name of the top-level codec is > "PerField" instead of "Pulsing" or "Standard" etc. The way this composite > pattern works make the indexing part of codecs simpler but also limits its > capabilities. By recoding the top-level codec in the segments file we rely on > the user to "configure" the PerFieldCodecWrapper correctly to open a > SegmentReader. If a fields codec has changed in the meanwhile we won't be > able to open the segment. > The issues LUCENE-2741 and LUCENE-2740 are actually closely related to the > way PFCW is implemented right now. PFCW blindly creates codecs per field on > request and at the same time doesn't have any control over the file naming > nor if a two codec instances are created for two distinct fields even if the > codec instance is the same. If so FieldsConsumer will throw an exception > since the files it relies on are already created. > Having PerFieldCodecWrapper AND a CodecProvider overcomplicates things IMO. > In order to use per field codec a user should on the one hand register its > custom codecs AND needs to build a PFCW which needs to be maintained in the > "user-land" an must not change incompatible once a new IW of IR is created. > What I would expect from Lucene is to enable me to register a codec in a > provider and then tell the Field which codec it should use for indexing. For > reading lucene should determ the codec automatically once a segment is > opened. if the codec is not available in the provider that is a different > story. Once we instantiate the composite codec in SegmentsReader we only have > the codecs which are really used in this segment for free which in turn > solves LUCENE-2740. > Yet, instead of relying on the user to configure PFCW I suggest to move > composite codec functionality inside the core an record the distinct codecs > per segment in the segments info. We only really need the distinct codecs > used in that segment since the codec instance should be reused to prevent > additional files to be created. Lets say we have the follwing codec mapping : > {noformat} > field_a:Pulsing > field_b:Standard > field_c:Pulsing > {noformat} > then we create the following mapping: > {noformat} > SegmentInfo: > [Pulsing, Standard] > PerField: > [field_a:0, field_b:1, field_c:0] > {noformat} > that way we can easily determ which codec is used for which field an build > the composite - codec internally on opening SegmentsReader. This ordering has > another advantage, if like in that case pulsing and standard use really the > same type of files we need a way to distinguish the used files per codec > within a segment. We can in turn pass the codec's ord (implicit in the > SegmentInfo) to the FieldConsumer on creation to create files with > segmentname_ord.ext (or something similar). This solvel LUCENE-2741). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
[ https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930024#action_12930024 ] Robert Muir commented on LUCENE-2747: - bq. You removed TestHindiFilters.testTokenizer(), TestIndicTokenizer.testBasics() and TestIndicTokenizer.testFormat(), but these would be useful in TestStandardAnalyzer and TestUAX29Tokenizer, wouldn't they? oh, i just deleted everything associated with that tokenizer... bq. You did not remove ArabicLetterTokenizer and IndicTokenizer, presumably so that they can be used with Lucene 4.0+ when the supplied Version is less than 3.1 - good catch, I had forgotten this requirement - but when can we actually get rid of these? Since they will be staying, shouldn't their tests remain too, but using Version.LUCENE_30 instead of TEST_VERSION_CURRENT? i removed the indictokenizer (unreleased) and deleted its tests. but i kept and deprecated the arabic one, since we have released it. > Deprecate/remove language-specific tokenizers in favor of StandardTokenizer > --- > > Key: LUCENE-2747 > URL: https://issues.apache.org/jira/browse/LUCENE-2747 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Affects Versions: 3.1, 4.0 >Reporter: Steven Rowe > Fix For: 3.1, 4.0 > > Attachments: LUCENE-2747.patch > > > As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to > provide language-neutral tokenization. Lucene contains several > language-specific tokenizers that should be replaced by UAX#29-based > StandardTokenizer (deprecated in 3.1 and removed in 4.0). The > language-specific *analyzers*, by contrast, should remain, because they > contain language-specific post-tokenization filters. The language-specific > analyzers should switch to StandardTokenizer in 3.1. > Some usages of language-specific tokenizers will need additional work beyond > just replacing the tokenizer in the language-specific analyzer. > For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and > depends on the fact that this tokenizer breaks tokens on the ZWNJ character > (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ > is not a word boundary. Robert Muir has suggested using a char filter > converting ZWNJ to spaces prior to StandardTokenizer in the converted > PersianAnalyzer. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Solr-trunk - Build # 1307 - Failure
Build: https://hudson.apache.org/hudson/job/Solr-trunk/1307/ All tests passed Build Log (for compile errors): [...truncated 18450 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org