date:20101109

Solr-3.x - Build # 162 - Failure

2010-11-09 Thread Apache Hudson Server

Build: https://hudson.apache.org/hudson/job/Solr-3.x/162/

All tests passed

Build Log (for compile errors):
[...truncated 18785 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Lucene-Solr-tests-only-trunk - Build # 1203 - Failure

2010-11-09 Thread Apache Hudson Server

Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/1203/

1 tests failed.
REGRESSION:  
org.apache.lucene.index.TestIndexWriterMergePolicy.testMaxBufferedDocsChange

Error Message:
maxMergeDocs=2147483647; numSegments=11; upperBound=10; mergeFactor=10; 
segs=_64:c5950 _5t:c10->_32 _5u:c10->_32 _5v:c10->_32 _5w:c10->_32 _5x:c10->_32 
_5y:c10->_32 _5z:c10->_32 _60:c10->_32 _61:c10->_32 _62:c3->_32 _65:c7->_62

Stack Trace:
junit.framework.AssertionFailedError: maxMergeDocs=2147483647; numSegments=11; 
upperBound=10; mergeFactor=10; segs=_64:c5950 _5t:c10->_32 _5u:c10->_32 
_5v:c10->_32 _5w:c10->_32 _5x:c10->_32 _5y:c10->_32 _5z:c10->_32 _60:c10->_32 
_61:c10->_32 _62:c3->_32 _65:c7->_62
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:878)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:844)
at 
org.apache.lucene.index.TestIndexWriterMergePolicy.checkInvariants(TestIndexWriterMergePolicy.java:243)
at 
org.apache.lucene.index.TestIndexWriterMergePolicy.testMaxBufferedDocsChange(TestIndexWriterMergePolicy.java:169)




Build Log (for compile errors):
[...truncated 3085 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

2010-11-09 Thread Steven Rowe (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930470#action_12930470
 ] 

Steven Rowe commented on LUCENE-2747:
-

DM, I'm a committer on the JFlex team.  About the second to last point: when 
Robert said "(jflex doesnt have this... yet)" he meant the jflex-based 
implementation, not JFlex itself.

About the last point, JFlex is shooting for level 1 compliance with [UTS#18 
Unicode Regular Expressions|http://unicode.org/reports/tr18/], which requires 
conforming implementations to "handle the full range of Unicode code points, 
including values from U+ to U+10."

> Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
> ---
>
> Key: LUCENE-2747
> URL: https://issues.apache.org/jira/browse/LUCENE-2747
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Affects Versions: 3.1, 4.0
>Reporter: Steven Rowe
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2747.patch, LUCENE-2747.patch
>
>
> As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to 
> provide language-neutral tokenization.  Lucene contains several 
> language-specific tokenizers that should be replaced by UAX#29-based 
> StandardTokenizer (deprecated in 3.1 and removed in 4.0).  The 
> language-specific *analyzers*, by contrast, should remain, because they 
> contain language-specific post-tokenization filters.  The language-specific 
> analyzers should switch to StandardTokenizer in 3.1.
> Some usages of language-specific tokenizers will need additional work beyond 
> just replacing the tokenizer in the language-specific analyzer.  
> For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and 
> depends on the fact that this tokenizer breaks tokens on the ZWNJ character 
> (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ 
> is not a word boundary.  Robert Muir has suggested using a char filter 
> converting ZWNJ to spaces prior to StandardTokenizer in the converted 
> PersianAnalyzer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Lucene-trunk - Build # 1358 - Still Failing

2010-11-09 Thread Apache Hudson Server

Build: https://hudson.apache.org/hudson/job/Lucene-trunk/1358/

All tests passed

Build Log (for compile errors):
[...truncated 18290 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: svn commit: r1032995 - in /lucene/dev/trunk/solr/src/site/src/documentation/content/xdocs/images: solr.jpg solr_FC.eps

2010-11-09 Thread Ryan McKinley

I'm with yonik on this one... why would the logo need a tm?

Isn't it already trademarked?

I'm in no rush to make the logo more ugly... for no need.

Is the plan to add tm to all the feather images across apache too?
http://www.apache.org/

but I hate to be a pain in the ass


On Tue, Nov 9, 2010 at 5:16 PM, Yonik Seeley  wrote:
> On Tue, Nov 9, 2010 at 4:58 PM, Grant Ingersoll  wrote:
>> You can read it at http://www.apache.org/foundation/marks/pmcs
>
> I did - but just because it appears on a web page does not make it
> true.  There have been *many* examples (and still are many examples)
> of things that are on our websites that are not strictly true.
>
>> , which is what I'm following and explains in the first paragraph.  I don't 
>> understand why it is a big deal to add a little TM on the logo.
>
> Part of it is uglification, part of it is slippery slope.  I see
> increasing micro-management and rigidity and it's something I actively
> fight against ;-)
>
>> It's standard practice for anyone wanting to protect their names/logos
>
> http://www.pg.com/
> http://www.pepsi.com/
> http://www.kraftfoodscompany.com
> http://www.unilever.com/
> http://www.conocophillips.com/
> http://www.3m.com/
> http://www.boeing.com/
> http://www.pfizer.com/home/
> http://www.google.com/
> http://www.apple.com/
> http://www.jnj.com/
> http://www.ge.com/
> http://www.att.com/
> http://www.verizon.com
>
> In my quick survey of some of the biggest companies off the top of my
> head, about 75% did not.
>
>> and as you stated, they are already trademarked items.
>
> meaning it shouldn't be necessary.
>
> -Yonik
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

2010-11-09 Thread DM Smith (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930448#action_12930448
 ] 

DM Smith commented on LUCENE-2747:
--

Robert, I think we are on the same wavelength. Thanks.

I like the idea of declarative analyzers, too.

Regarding the "last 2 points" has anyone given input to the JFlex team on these 
needs?

> Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
> ---
>
> Key: LUCENE-2747
> URL: https://issues.apache.org/jira/browse/LUCENE-2747
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Affects Versions: 3.1, 4.0
>Reporter: Steven Rowe
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2747.patch, LUCENE-2747.patch
>
>
> As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to 
> provide language-neutral tokenization.  Lucene contains several 
> language-specific tokenizers that should be replaced by UAX#29-based 
> StandardTokenizer (deprecated in 3.1 and removed in 4.0).  The 
> language-specific *analyzers*, by contrast, should remain, because they 
> contain language-specific post-tokenization filters.  The language-specific 
> analyzers should switch to StandardTokenizer in 3.1.
> Some usages of language-specific tokenizers will need additional work beyond 
> just replacing the tokenizer in the language-specific analyzer.  
> For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and 
> depends on the fact that this tokenizer breaks tokens on the ZWNJ character 
> (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ 
> is not a word boundary.  Robert Muir has suggested using a char filter 
> converting ZWNJ to spaces prior to StandardTokenizer in the converted 
> PersianAnalyzer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (SOLR-1395) Integrate Katta

2010-11-09 Thread tom liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-1395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tom liu updated SOLR-1395:
--

Attachment: solr-1395-katta-0.6.2-3.patch

fixed some bugs:
# select?qt=qtname do not supported in SolrKattaServer of subproxy
in SolrKattaServer, any queryHanlder must be MultiEmbeddedSearchHandler.
so in solrconfig.xml, we must change solr.SearchHandler to 
MultiEmbeddedSearchHandler
for example:
{noformat}
  
  
  

  true


  tvComponent

  
{noformat}
# TermVectorComponent do not return results
see https://issues.apache.org/jira/browse/SOLR-2224

> Integrate Katta
> ---
>
> Key: SOLR-1395
> URL: https://issues.apache.org/jira/browse/SOLR-1395
> Project: Solr
>  Issue Type: New Feature
>Affects Versions: 1.4
>Reporter: Jason Rutherglen
>Priority: Minor
> Fix For: Next
>
> Attachments: back-end.log, front-end.log, hadoop-core-0.19.0.jar, 
> katta-core-0.6-dev.jar, katta.node.properties, katta.zk.properties, 
> log4j-1.2.13.jar, solr-1395-1431-3.patch, solr-1395-1431-4.patch, 
> solr-1395-1431-katta0.6.patch, solr-1395-1431-katta0.6.patch, 
> solr-1395-1431.patch, solr-1395-katta-0.6.2-1.patch, 
> solr-1395-katta-0.6.2-2.patch, solr-1395-katta-0.6.2-3.patch, 
> solr-1395-katta-0.6.2.patch, SOLR-1395.patch, SOLR-1395.patch, 
> SOLR-1395.patch, test-katta-core-0.6-dev.jar, zkclient-0.1-dev.jar, 
> zookeeper-3.2.1.jar
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> We'll integrate Katta into Solr so that:
> * Distributed search uses Hadoop RPC
> * Shard/SolrCore distribution and management
> * Zookeeper based failover
> * Indexes may be built using Hadoop

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (SOLR-2224) TermVectorComponent did not return results when using distributedProcess in distribution envs

2010-11-09 Thread tom liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tom liu updated SOLR-2224:
--

Attachment: TermsVectorComponent.patch

in distributed query envs, use request that queryComponents creates.

the patch use merge method that debugComponents have.
see https://issues.apache.org/jira/browse/SOLR-2228

> TermVectorComponent did not return results when using distributedProcess in 
> distribution envs
> -
>
> Key: SOLR-2224
> URL: https://issues.apache.org/jira/browse/SOLR-2224
> Project: Solr
>  Issue Type: Bug
>  Components: SearchComponents - other
>Affects Versions: 4.0
> Environment: JDK1.6/Tomcat6
>Reporter: tom liu
> Attachments: TermsVectorComponent.patch
>
>
> when using distributed query, TVRH did not return any results.
> in distributedProcess, tv creates one request, that use 
> TermVectorParams.DOC_IDS, for example, tv.docIds=10001
> but queryCommponent returns ids, that is uniqueKeys, not DOCIDS.
> so, in distribution envs, must not use distributedProcess.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (SOLR-2228) refactors DebugComponent's merge method to create a class that can be used on other SeachComponents

2010-11-09 Thread tom liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tom liu updated SOLR-2228:
--

Attachment: namedlistcolleciton.patch

Create class NamedListCollection
main method of the class is merge that from DebugComponent.

> refactors DebugComponent's merge method to create a class that can be used on 
> other SeachComponents
> ---
>
> Key: SOLR-2228
> URL: https://issues.apache.org/jira/browse/SOLR-2228
> Project: Solr
>  Issue Type: Improvement
>  Components: SearchComponents - other
> Environment: JDK1.6/Tomcat6
>Reporter: tom liu
> Attachments: namedlistcolleciton.patch
>
>
> Some SearchComponents have to merge many NamedLists to one NamedList.
> for example, TermVectorComponent would merge many NLS to one NL.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Lucene-Solr-tests-only-trunk - Build # 1199 - Failure

2010-11-09 Thread Apache Hudson Server

Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/1199/

4 tests failed.
FAILED:  
junit.framework.TestSuite.org.apache.lucene.search.TestNumericRangeQuery64

Error Message:
Java heap space

Stack Trace:
java.lang.OutOfMemoryError: Java heap space


FAILED:  
junit.framework.TestSuite.org.apache.lucene.search.TestNumericRangeQuery64

Error Message:
null

Stack Trace:
java.lang.NullPointerException
at 
org.apache.lucene.search.TestNumericRangeQuery64.afterClass(TestNumericRangeQuery64.java:97)


FAILED:  
junit.framework.TestSuite.org.apache.lucene.search.TestNumericRangeQuery64

Error Message:
directory of test was not closed, opened from: 
org.apache.lucene.util.LuceneTestCase.newDirectory(LuceneTestCase.java:653)

Stack Trace:
junit.framework.AssertionFailedError: directory of test was not closed, opened 
from: 
org.apache.lucene.util.LuceneTestCase.newDirectory(LuceneTestCase.java:653)
at 
org.apache.lucene.util.LuceneTestCase.afterClassLuceneTestCaseJ4(LuceneTestCase.java:331)


REGRESSION:  org.apache.lucene.search.TestPrefixFilter.testPrefixFilter

Error Message:
ConcurrentMergeScheduler hit unhandled exceptions

Stack Trace:
junit.framework.AssertionFailedError: ConcurrentMergeScheduler hit unhandled 
exceptions
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:878)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:844)
at 
org.apache.lucene.util.LuceneTestCase.tearDown(LuceneTestCase.java:458)




Build Log (for compile errors):
[...truncated 3111 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Created: (SOLR-2228) refactors DebugComponent's merge method to create a class that can be used on other SeachComponents

2010-11-09 Thread tom liu (JIRA)

refactors DebugComponent's merge method to create a class that can be used on 
other SeachComponents
---

 Key: SOLR-2228
 URL: https://issues.apache.org/jira/browse/SOLR-2228
 Project: Solr
  Issue Type: Improvement
  Components: SearchComponents - other
 Environment: JDK1.6/Tomcat6
Reporter: tom liu


Some SearchComponents have to merge many NamedLists to one NamedList.
for example, TermVectorComponent would merge many NLS to one NL.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Lucene-3.x - Build # 176 - Still Failing

2010-11-09 Thread Apache Hudson Server

Build: https://hudson.apache.org/hudson/job/Lucene-3.x/176/

All tests passed

Build Log (for compile errors):
[...truncated 21372 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: [jira] Created: (SOLR-2226) DIH add data and after a while removes all from index

2010-11-09 Thread Erick Erickson

I'd be extremely surprised if you were the very
first one to notice this, but all things are possible. It's
more likely that there's something else going on besides
1.4 just throwing away your data.

Let's see your DIH config, and any commands you used
"after a while", I suspect you're done something like having
DIH "clean" parameter set to true (I don't know if the
defaults changed).  How did you revert to 1.4? It would be
helpful if you outlined what steps you took.

Best
Erick

On Tue, Nov 9, 2010 at 4:18 PM, Marcin Rosinski (JIRA) wrote:

> DIH add data and after a while removes all from index
> -
>
> Key: SOLR-2226
> URL: https://issues.apache.org/jira/browse/SOLR-2226
> Project: Solr
>  Issue Type: Bug
>  Components: contrib - DataImportHandler
>Affects Versions: 4.0
> Environment: centos 5.4, tomcat 7.0.2, jdbc mysql connector
>Reporter: Marcin Rosinski
>Priority: Critical
>
>
> Hi guys,
>
> I am having weird problem. I am currently using solr 1.5-dev. Wanted to
> switch to 4.0. Have dropped my indexes as they are not backward compatible
> and have run DIH. All data have been indexed and were even searchable for a
> short while, after that all have been suddenly dropped.
>
> All is working fine on 1.5 so it force me to thinking that it must be some
> kind of bug in 4.0
>
>
> cheers,
> /Marcin
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

Re: Document links

2010-11-09 Thread Mark Harwood

I was using within-segment doc ids stored in link files named after both the 
source and target segments (a link after all is 2 endpoints).
For a complete solution you ultimately have to deal with the fact that doc ids 
could be references to:
* Stable, committed docs (the easy case)
* Flushed but not yet committed docs
* Buffered but not yet flushed docs
* Flushed/committed but currently merging docs

...all of which are happening in different threads e.g reader has one view of 
the world, a background thread is busy merging segments to create a new view of 
the world even after commits have completed.

All very messy.
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Document links

2010-11-09 Thread Paul Elschot

On Monday 08 November 2010 20:52:53 mark harwood wrote:
> I came to the conclusion that the transient meaning of document ids is too 
> deeply ingrained in Lucene's design to use them to underpin any reliable 
> linking.
> While it might work for relatively static indexes, any index with a 
> reasonable 
> number of updates or deletes will invalidate any stored document references 
> in 
> ways which are very hard to track. Lucene's compaction shuffles IDs without 
> taking care to preserve identity, unlike graph DBs like Neo4j (see "recycling 
> IDs" here: http://goo.gl/5UbJi )

Did you try to keep the docId's as segmentId-inSegmentDocId in a tree?

Somehow I think this could work, because this would not really complicate
adding/deleting relations between docId's and the segment merges would
become massive but straightforward deletes and inserts, and with some
luck the amount of work for that would be a small proportion of the work
for a normal segment merge.

Regards,
Paul Elschot

> 
> 
> Cheers,
> Mark
> 
> 
> - Original Message 
> From: Ryan McKinley 
> To: dev@lucene.apache.org
> Sent: Mon, 8 November, 2010 19:03:59
> Subject: Re: Document links
> 
> Any updates/progress with this?
> 
> I'm looking at ways to implement an RTree with lucene -- and this
> discussion seems relevant
> 
> thanks
> ryan
> 
> 
> On Sat, Sep 25, 2010 at 5:42 PM, mark harwood  wrote:
> >>>Both these on disk data structures and the ones in a B+ tree have seek 
> offsets
> >>>into files
> >>>that require disk seeks. And both could use document ids as key values.
> >
> > Yep. However my approach doesn't use a doc id as a key that is searched in 
> > any
> > B+ tree index (which involves disk seeks) - it is used as direct offset 
> > into a
> > file to get the pointer into a "links" data structure.
> >
> >
> >
> >>>But do these disk data structures support dynamic addition and deletion of
> >>>(larger
> >>>numbers of) document links?
> >
> > Yes, the slide deck I linked to shows how links (like documents) spend the 
> >early
> > stages of life being merged frequently in the smaller, newer segments and 
> > over
> > time migrate into larger, more stable segments as part of Lucene 
> > transactions.
> >
> > That's the theory - I'm currently benchmarking an early prototype.
> >
> >
> >
> > - Original Message 
> > From: Paul Elschot 
> > To: dev@lucene.apache.org
> > Sent: Sat, 25 September, 2010 22:03:28
> > Subject: Re: Document links
> >
> > Op zaterdag 25 september 2010 15:23:39 schreef Mark Harwood:
> >> My starting point in the solution I propose was to eliminate linking via 
> >> any
> >>type of key. Key lookups mean indexes and indexes mean disk seeks. Graph
> >>traversals have exponential numbers of links and so all these index disk 
> >>seeks
> >>start to stack up. The solution I propose uses doc ids as more-or-less 
> >>direct
> >>pointers into file structures avoiding any index lookup.
> >> I've started coding up some tests using the file structures I outlined and 
> >will
> >>compare that with a traditional key-based approach.
> >
> > Both these on disk data structures and the ones in a B+ tree have seek 
> > offsets
> > into files
> > that require disk seeks. And both could use document ids as key values.
> >
> > But do these disk data structures support dynamic addition and deletion of
> > (larger
> > numbers of) document links?
> >
> > B+ trees are a standard solution for problems like this one, and it would
> > probably
> > not be easy to outperform them.
> > It may be possible to improve performance of B+ trees somewhat by 
> > specializing
> > for the fairly simple keys that would be needed, and by encoding very short
> > lists of links
> > for a single document directly into a seek offset to avoid the actual seek, 
> but
> > that's
> > about it.
> >
> > Regards,
> > Paul Elschot
> >
> >>
> >> For reference - playing the "Kevin Bacon game" on a traditional Lucene 
> >> index 
> >of
> >>IMDB data took 18 seconds to find a short path that Neo4j finds in 200
> >>milliseconds on the same data (and this was a disk based graph of 3m nodes, 
> 10m
> >>edges).
> >> Going from actor->movies->actors->movies produces a lot of key lookups and 
> the
> >>difference between key indexes and direct node pointers becomes clear.
> >> I know path finding analysis is perhaps not a typical Lucene application 
> >> but
> >>other forms of link analysis e.g. recommendation engines require similar
> >>performance.
> >>
> >> Cheers
> >> Mark
> >>
> >>
> >>
> >> On 25 Sep 2010, at 11:41, Paul Elschot wrote:
> >>
> >> > Op vrijdag 24 september 2010 17:57:45 schreef mark harwood:
> >>  While not exactly equivalent, it reminds me of our earlier discussion
> >>around
> >>
> >>  "layered segments" for dealing with field updates
> >> >>
> >> >> Right. Fast discovery of document relations is a foundation on which 
> >> >> lots 
> >of
> >>
> >> >> things like this can build. Relations can be given types to support a 
> >numbe

Re: svn commit: r1032995 - in /lucene/dev/trunk/solr/src/site/src/documentation/content/xdocs/images: solr.jpg solr_FC.eps

2010-11-09 Thread Yonik Seeley

On Tue, Nov 9, 2010 at 4:58 PM, Grant Ingersoll  wrote:
> You can read it at http://www.apache.org/foundation/marks/pmcs

I did - but just because it appears on a web page does not make it
true.  There have been *many* examples (and still are many examples)
of things that are on our websites that are not strictly true.

> , which is what I'm following and explains in the first paragraph.  I don't 
> understand why it is a big deal to add a little TM on the logo.

Part of it is uglification, part of it is slippery slope.  I see
increasing micro-management and rigidity and it's something I actively
fight against ;-)

> It's standard practice for anyone wanting to protect their names/logos

http://www.pg.com/
http://www.pepsi.com/
http://www.kraftfoodscompany.com
http://www.unilever.com/
http://www.conocophillips.com/
http://www.3m.com/
http://www.boeing.com/
http://www.pfizer.com/home/
http://www.google.com/
http://www.apple.com/
http://www.jnj.com/
http://www.ge.com/
http://www.att.com/
http://www.verizon.com

In my quick survey of some of the biggest companies off the top of my
head, about 75% did not.

> and as you stated, they are already trademarked items.

meaning it shouldn't be necessary.

-Yonik

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (SOLR-2223) Separate out "generic" Solr site from release specific content.

2010-11-09 Thread Hoss Man (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930314#action_12930314
 ] 

Hoss Man commented on SOLR-2223:


I don't really understand why this blocks  LUCENE-2746 ? the solr "site" 
snapshot has always refered to "trunk", not the release -  it's something we've 
talked about changing to mimic the way lucene-java had release specific 
tutorials/docs but never got around to.

Adding the TMs and consistent project name verbage to the site where it lives 
in svn right now and publishing should be just fine for satisfying LUCENE-2746  
- no need to revamp how we build the site if we're just going to revamp it 
again using the CMS.

> Separate out "generic" Solr site from release specific content.
> ---
>
> Key: SOLR-2223
> URL: https://issues.apache.org/jira/browse/SOLR-2223
> Project: Solr
>  Issue Type: Task
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
>
> It would be useful for deployment purposes if we separated out the Solr site 
> that is non-release specific from the release specific content.  This would 
> make it easier to apply updates, etc. while still keeping release specific 
> info handy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: svn commit: r1032995 - in /lucene/dev/trunk/solr/src/site/src/documentation/content/xdocs/images: solr.jpg solr_FC.eps

2010-11-09 Thread Grant Ingersoll

You can read it at http://www.apache.org/foundation/marks/pmcs, which is what 
I'm following and explains in the first paragraph.  I don't understand why it 
is a big deal to add a little TM on the logo.  It's standard practice for 
anyone wanting to protect their names/logos and as you stated, they are already 
trademarked items.  FWIW, we can put the TM next to the logo too, but that just 
seems clunky HTML wise.

On Nov 9, 2010, at 4:30 PM, Yonik Seeley wrote:

> FYI, I've followed up with trademarks@
> 
> -Yonik
> http://www.lucidimagination.com
> 
> 
> 
> On Tue, Nov 9, 2010 at 4:06 PM, Yonik Seeley  
> wrote:
>> On Tue, Nov 9, 2010 at 4:03 PM, Grant Ingersoll  wrote:
>>> Sorry, I figured the word "Requirements" in the title was pretty clear.
>> 
>> I was seeking information on how this became an actual hard
>> requirement, and if it actually was.
>> It certainly wouldn't be the first time that something is listed as a
>> requirement just because someone thought it was a good idea.
>> 
>> -Yonik
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
> 

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Lucene-Solr-tests-only-trunk - Build # 1192 - Failure

2010-11-09 Thread Apache Hudson Server

Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/1192/

1 tests failed.
REGRESSION:  
org.apache.lucene.index.TestIndexWriterMergePolicy.testMaxBufferedDocsChange

Error Message:
maxMergeDocs=2147483647; numSegments=11; upperBound=10; mergeFactor=10; 
segs=_65:c5950 _5t:c10->_32 _5u:c10->_32 _5v:c10->_32 _5w:c10->_32 _5x:c10->_32 
_5y:c10->_32 _5z:c10->_32 _60:c10->_32 _61:c10->_32 _62:c8->_32 _63:c2->_62

Stack Trace:
junit.framework.AssertionFailedError: maxMergeDocs=2147483647; numSegments=11; 
upperBound=10; mergeFactor=10; segs=_65:c5950 _5t:c10->_32 _5u:c10->_32 
_5v:c10->_32 _5w:c10->_32 _5x:c10->_32 _5y:c10->_32 _5z:c10->_32 _60:c10->_32 
_61:c10->_32 _62:c8->_32 _63:c2->_62
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:878)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:844)
at 
org.apache.lucene.index.TestIndexWriterMergePolicy.checkInvariants(TestIndexWriterMergePolicy.java:243)
at 
org.apache.lucene.index.TestIndexWriterMergePolicy.testMaxBufferedDocsChange(TestIndexWriterMergePolicy.java:169)




Build Log (for compile errors):
[...truncated 3082 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: svn commit: r1032995 - in /lucene/dev/trunk/solr/src/site/src/documentation/content/xdocs/images: solr.jpg solr_FC.eps

2010-11-09 Thread Yonik Seeley

And FWIW, here's my bit of civil disobedience :-P

http://yonik.wordpress.com/

-Yonik

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Created: (SOLR-2227) DIH add data and after a while removes all from index

2010-11-09 Thread Marcin Rosinski (JIRA)

DIH add data and after a while removes all from index
-

 Key: SOLR-2227
 URL: https://issues.apache.org/jira/browse/SOLR-2227
 Project: Solr
  Issue Type: Bug
Affects Versions: 4.0
 Environment: centos 5.4, tomcat 7.0.2, jdbc mysql connector

Reporter: Marcin Rosinski
Priority: Critical
 Fix For: 4.0


Hi guys,

I am having weird problem. I am currently using solr 1.5-dev. Wanted to switch 
to 4.0. Have dropped my indexes as they are not backward compatible and have 
run DIH. All data have been indexed and were even searchable for a short while, 
after that all have been suddenly dropped.

All is working fine on 1.5 so it force me to thinking that it must be some kind 
of bug in 4.0

P.S.
Sorry for posting twice but couldn't see it under solr 4.0 section

cheers,
/Marcin


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: svn commit: r1032995 - in /lucene/dev/trunk/solr/src/site/src/documentation/content/xdocs/images: solr.jpg solr_FC.eps

2010-11-09 Thread Yonik Seeley

FYI, I've followed up with trademarks@

-Yonik
http://www.lucidimagination.com



On Tue, Nov 9, 2010 at 4:06 PM, Yonik Seeley  wrote:
> On Tue, Nov 9, 2010 at 4:03 PM, Grant Ingersoll  wrote:
>> Sorry, I figured the word "Requirements" in the title was pretty clear.
>
> I was seeking information on how this became an actual hard
> requirement, and if it actually was.
> It certainly wouldn't be the first time that something is listed as a
> requirement just because someone thought it was a good idea.
>
> -Yonik

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Created: (SOLR-2226) DIH add data and after a while removes all from index

2010-11-09 Thread Marcin Rosinski (JIRA)

DIH add data and after a while removes all from index
-

 Key: SOLR-2226
 URL: https://issues.apache.org/jira/browse/SOLR-2226
 Project: Solr
  Issue Type: Bug
  Components: contrib - DataImportHandler
Affects Versions: 4.0
 Environment: centos 5.4, tomcat 7.0.2, jdbc mysql connector
Reporter: Marcin Rosinski
Priority: Critical


Hi guys,

I am having weird problem. I am currently using solr 1.5-dev. Wanted to switch 
to 4.0. Have dropped my indexes as they are not backward compatible and have 
run DIH. All data have been indexed and were even searchable for a short while, 
after that all have been suddenly dropped.

All is working fine on 1.5 so it force me to thinking that it must be some kind 
of bug in 4.0


cheers,
/Marcin

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: svn commit: r1032995 - in /lucene/dev/trunk/solr/src/site/src/documentation/content/xdocs/images: solr.jpg solr_FC.eps

2010-11-09 Thread Yonik Seeley

On Tue, Nov 9, 2010 at 4:03 PM, Grant Ingersoll  wrote:
> Sorry, I figured the word "Requirements" in the title was pretty clear.

I was seeking information on how this became an actual hard
requirement, and if it actually was.
It certainly wouldn't be the first time that something is listed as a
requirement just because someone thought it was a good idea.

-Yonik

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: svn commit: r1032995 - in /lucene/dev/trunk/solr/src/site/src/documentation/content/xdocs/images: solr.jpg solr_FC.eps

2010-11-09 Thread Grant Ingersoll

On Nov 9, 2010, at 3:54 PM, Yonik Seeley wrote:

> On Tue, Nov 9, 2010 at 3:51 PM, Grant Ingersoll  wrote:
>> See the mail to the PMC on Branding Requirements and the associated website.
> 
> I guess I was looking for something more specific.
> I followed a lot of the previous discussions on trademark stuff - I
> just missed the transition point where it became absolutely required.

Sorry, I figured the word "Requirements" in the title was pretty clear.  In 
talking with Shane (who handles TM for the ASF) the goal is to have much more 
consistency across the ASF as it regards to protecting the ASF marks, so this 
is something all projects must implement on and report to the Board on in terms 
of progress.  

-Grant
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (SOLR-2223) Separate out "generic" Solr site from release specific content.

2010-11-09 Thread Grant Ingersoll (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930284#action_12930284
 ] 

Grant Ingersoll commented on SOLR-2223:
---

Yes, Hoss, that would make sense.  I was just trying to do the least amount of 
work now for LUCENE-2746 to satisfy the needs there before we move to the new 
CMS.

> Separate out "generic" Solr site from release specific content.
> ---
>
> Key: SOLR-2223
> URL: https://issues.apache.org/jira/browse/SOLR-2223
> Project: Solr
>  Issue Type: Task
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
>
> It would be useful for deployment purposes if we separated out the Solr site 
> that is non-release specific from the release specific content.  This would 
> make it easier to apply updates, etc. while still keeping release specific 
> info handy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: svn commit: r1032995 - in /lucene/dev/trunk/solr/src/site/src/documentation/content/xdocs/images: solr.jpg solr_FC.eps

2010-11-09 Thread Yonik Seeley

On Tue, Nov 9, 2010 at 3:51 PM, Grant Ingersoll  wrote:
> See the mail to the PMC on Branding Requirements and the associated website.

I guess I was looking for something more specific.
I followed a lot of the previous discussions on trademark stuff - I
just missed the transition point where it became absolutely required.

-Yonik

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: svn commit: r1032995 - in /lucene/dev/trunk/solr/src/site/src/documentation/content/xdocs/images: solr.jpg solr_FC.eps

2010-11-09 Thread Grant Ingersoll

Karl, yes it is something Manifold needs to do.

On Nov 9, 2010, at 3:41 PM, Yonik Seeley wrote:

> On Tue, Nov 9, 2010 at 3:33 PM, Grant Ingersoll  wrote:
>> Sorry, no choice.  Comes down from the board that logos need TMs on them.
> 
> IMO, it's a little too much micromanagement (and IIRC not everyone
> agreed about logo TMs in the past).  Have any pointers to more recent
> discussions for me?

See the mail to the PMC on Branding Requirements and the associated website.  

> 
> We can also trademark our logos w/o adding TM to them, and I disagree
> that we need to add it to the Lucene/Solr logos (Lucene and Solr names
> themselves are trademarked).

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

RE: svn commit: r1032995 - in /lucene/dev/trunk/solr/src/site/src/documentation/content/xdocs/images: solr.jpg solr_FC.eps

2010-11-09 Thread karl.wright

Is this something ManifoldCF needs to do also?
Karl

-Original Message-
From: ext Grant Ingersoll [mailto:gsing...@apache.org] 
Sent: Tuesday, November 09, 2010 3:34 PM
To: dev@lucene.apache.org
Subject: Re: svn commit: r1032995 - in 
/lucene/dev/trunk/solr/src/site/src/documentation/content/xdocs/images: 
solr.jpg solr_FC.eps

Sorry, no choice.  Comes down from the board that logos need TMs on them.

On Nov 9, 2010, at 2:53 PM, Yonik Seeley wrote:

> IMO, any changes to our logos should be voted on.
> 
> -Yonik
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
> 



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: svn commit: r1032995 - in /lucene/dev/trunk/solr/src/site/src/documentation/content/xdocs/images: solr.jpg solr_FC.eps

2010-11-09 Thread Yonik Seeley

On Tue, Nov 9, 2010 at 3:33 PM, Grant Ingersoll  wrote:
> Sorry, no choice.  Comes down from the board that logos need TMs on them.

IMO, it's a little too much micromanagement (and IIRC not everyone
agreed about logo TMs in the past).  Have any pointers to more recent
discussions for me?

We can also trademark our logos w/o adding TM to them, and I disagree
that we need to add it to the Lucene/Solr logos (Lucene and Solr names
themselves are trademarked).

-Yonik

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: svn commit: r1032995 - in /lucene/dev/trunk/solr/src/site/src/documentation/content/xdocs/images: solr.jpg solr_FC.eps

2010-11-09 Thread Grant Ingersoll

Sorry, no choice.  Comes down from the board that logos need TMs on them.

On Nov 9, 2010, at 2:53 PM, Yonik Seeley wrote:

> IMO, any changes to our logos should be voted on.
> 
> -Yonik
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
> 



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: svn commit: r1032995 - in /lucene/dev/trunk/solr/src/site/src/documentation/content/xdocs/images: solr.jpg solr_FC.eps

2010-11-09 Thread Yonik Seeley

IMO, any changes to our logos should be voted on.

-Yonik

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (SOLR-2212) NoMergePolicy class does not load

2010-11-09 Thread Hoss Man (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930243#action_12930243
 ] 

Hoss Man commented on SOLR-2212:


 (and most of the SolrPlugin stuff) requires that the class 
specified support a no-arg constructor.

"NoMergePolicy" has no public constructors at all - it seems to expect you to 
only ever use one of the Static singletons.

> NoMergePolicy class does not load
> -
>
> Key: SOLR-2212
> URL: https://issues.apache.org/jira/browse/SOLR-2212
> Project: Solr
>  Issue Type: Bug
>  Components: multicore
>Affects Versions: 3.1, 4.0
>Reporter: Lance Norskog
>
> Solr cannot use the Lucene NoMergePolicy class. It will not instantiate 
> correctly when loading the core.
> Other MergePolicy classes work, including the BalancedSegmentMergePolicy.
> This is in trunk and 3.x.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (SOLR-2223) Separate out "generic" Solr site from release specific content.

2010-11-09 Thread Hoss Man (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930241#action_12930241
 ] 

Hoss Man commented on SOLR-2223:


Given the dev merge of lucene/solr, and the reduction in subprojects due to 
graduation, and the consideration of a using hte new CMS, now is probably a 
good time to consider abandoning the way we've been having distinct svn paths 
for the "site" of each subproject.

why not move to a single "site" directory in svn for the entire TLP that  the 
CMS updates (which can still have subdirs by project and what not, but can 
easily maintain a consistent navigation and look and fell) and then keep only 
the releases specific docs in the individual sub-proj trunk dirs?

> Separate out "generic" Solr site from release specific content.
> ---
>
> Key: SOLR-2223
> URL: https://issues.apache.org/jira/browse/SOLR-2223
> Project: Solr
>  Issue Type: Task
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
>
> It would be useful for deployment purposes if we separated out the Solr site 
> that is non-release specific from the release specific content.  This would 
> make it easier to apply updates, etc. while still keeping release specific 
> info handy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: svn commit: r1032995 - in /lucene/dev/trunk/solr/src/site/src/documentation/content/xdocs/images: solr.jpg solr_FC.eps

2010-11-09 Thread Robert Muir

On Tue, Nov 9, 2010 at 2:00 PM, Chris Hostetter
 wrote:
>
> : LUCENE-2746: add TM to some of the logos, don't have an editor for the 
> others
> :

also, thanks for making all these updates... updating the website is a hassle.

i was taking a look at lucene.apache.org and noticed the (TM) is a bit funky.
any objection to instead of Lucene (TM) doing Lucene™ ?

Using the trade mark sign (U+2122) looks better in my opinion, its
very old in unicode and even my telephone can display it.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: svn commit: r1032995 - in /lucene/dev/trunk/solr/src/site/src/documentation/content/xdocs/images: solr.jpg solr_FC.eps

2010-11-09 Thread Grant Ingersoll

Ah, OK.  I just used my photo editor.  I'll fix, using the SVG.


On Nov 9, 2010, at 2:00 PM, Chris Hostetter wrote:

> 
> : LUCENE-2746: add TM to some of the logos, don't have an editor for the 
> others
> : 
> : Modified:
> : 
> lucene/dev/trunk/solr/src/site/src/documentation/content/xdocs/images/solr.jpg
> : 
> lucene/dev/trunk/solr/src/site/src/documentation/content/xdocs/images/solr_FC.eps
> 
> grant: I'm not sure how you edited the solr_FC.eps file but you caused the 
> smoothness of the line cuves to look horific when the image is enlarged. 
> (note the filesize droped about 50%).
> 
> In general, the *.svg files are the "source" files that all of the other 
> files were generated form -- we should add the "TM" to the SVN files and 
> then generate the JPG and EPS files from that.
> 
> 
> -Hoss
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
> 



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: svn commit: r1032995 - in /lucene/dev/trunk/solr/src/site/src/documentation/content/xdocs/images: solr.jpg solr_FC.eps

2010-11-09 Thread Chris Hostetter


: LUCENE-2746: add TM to some of the logos, don't have an editor for the others
: 
: Modified:
: 
lucene/dev/trunk/solr/src/site/src/documentation/content/xdocs/images/solr.jpg
: 
lucene/dev/trunk/solr/src/site/src/documentation/content/xdocs/images/solr_FC.eps

grant: I'm not sure how you edited the solr_FC.eps file but you caused the 
smoothness of the line cuves to look horific when the image is enlarged. 
(note the filesize droped about 50%).

In general, the *.svg files are the "source" files that all of the other 
files were generated form -- we should add the "TM" to the SVN files and 
then generate the JPG and EPS files from that.


-Hoss

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Created: (SOLR-2225) CoreContainer#register should use checkDefault to normalize the core name

2010-11-09 Thread Mark Miller (JIRA)

CoreContainer#register should use checkDefault to normalize the core name
-

 Key: SOLR-2225
 URL: https://issues.apache.org/jira/browse/SOLR-2225
 Project: Solr
  Issue Type: Bug
  Components: multicore
Reporter: Mark Miller
Assignee: Mark Miller
Priority: Minor
 Fix For: 3.1, 4.0


fail case:

start with default collection set to collection1
remove core collection1
default collection on CoreContainer is still set to collection1
add core collection1
it doesn't act like the default core

we might do as the summary suggests, or when the default core is removed, we 
reset to no default core until one is again explicitly set

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2748) Convert all Lucene web properties to use the ASF CMS

2010-11-09 Thread Steven Rowe (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930219#action_12930219
 ] 

Steven Rowe commented on LUCENE-2748:
-

Where will the subversion space for the site sources go?  Under 
{{repos/asf/lucene/new-site-dir/}}?

Will we still maintained versioned and unversioned content?

> Convert all Lucene web properties to use the ASF CMS
> 
>
> Key: LUCENE-2748
> URL: https://issues.apache.org/jira/browse/LUCENE-2748
> Project: Lucene - Java
>  Issue Type: Bug
>Reporter: Grant Ingersoll
>
> The new CMS has a lot of nice features (and some kinks to still work out) and 
> Forrest just doesn't cut it anymore, so we should move to the ASF CMS: 
> http://apache.org/dev/cms.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2729) Index corruption after 'read past EOF' under heavy update load and snapshot export

2010-11-09 Thread Nico Krijnen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930186#action_12930186
 ] 

Nico Krijnen commented on LUCENE-2729:
--

We really appreciate the help and understand that this is not the only thing 
you are working on ;-)
Collecting post-it notes sounds so familliar :)

We ran a test with your patch to throw a RuntimeException when an output 
already exists.
We did get a 'read past EOF', but the additional RuntimeException is never 
thrown.

We'll add the other log points and do another test run with those. If you have 
more suggestions for logging, let us know, we won't start the next run until 
tomorrow anyway...

> Index corruption after 'read past EOF' under heavy update load and snapshot 
> export
> --
>
> Key: LUCENE-2729
> URL: https://issues.apache.org/jira/browse/LUCENE-2729
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 3.0.1, 3.0.2
> Environment: Happens on both OS X 10.6 and Windows 2008 Server. 
> Integrated with zoie (using a zoie snapshot from 2010-08-06: 
> zoie-2.0.0-snapshot-20100806.jar).
>Reporter: Nico Krijnen
> Attachments: 2010-11-02 IndexWriter infoStream log.zip, 
> LUCENE-2729-test1.patch
>
>
> We have a system running lucene and zoie. We use lucene as a content store 
> for a CMS/DAM system. We use the hot-backup feature of zoie to make scheduled 
> backups of the index. This works fine for small indexes and when there are 
> not a lot of changes to the index when the backup is made.
> On large indexes (about 5 GB to 19 GB), when a backup is made while the index 
> is being changed a lot (lots of document additions and/or deletions), we 
> almost always get a 'read past EOF' at some point, followed by lots of 'Lock 
> obtain timed out'.
> At that point we get lots of 0 kb files in the index, data gets lots, and the 
> index is unusable.
> When we stop our server, remove the 0kb files and restart our server, the 
> index is operational again, but data has been lost.
> I'm not sure if this is a zoie or a lucene issue, so i'm posting it to both. 
> Hopefully someone has some ideas where to look to fix this.
> Some more details...
> Stack trace of the read past EOF and following Lock obtain timed out:
> {code}
> 78307 [proj.zoie.impl.indexing.internal.realtimeindexdataloa...@31ca5085] 
> ERROR proj.zoie.impl.indexing.internal.BaseSearchIndex - read past EOF
> java.io.IOException: read past EOF
> at 
> org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:154)
> at 
> org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:39)
> at 
> org.apache.lucene.store.ChecksumIndexInput.readByte(ChecksumIndexInput.java:37)
> at org.apache.lucene.store.IndexInput.readInt(IndexInput.java:69)
> at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:245)
> at 
> org.apache.lucene.index.IndexFileDeleter.(IndexFileDeleter.java:166)
> at 
> org.apache.lucene.index.DirectoryReader.doCommit(DirectoryReader.java:725)
> at org.apache.lucene.index.IndexReader.commit(IndexReader.java:987)
> at org.apache.lucene.index.IndexReader.commit(IndexReader.java:973)
> at org.apache.lucene.index.IndexReader.decRef(IndexReader.java:162)
> at org.apache.lucene.index.IndexReader.close(IndexReader.java:1003)
> at 
> proj.zoie.impl.indexing.internal.BaseSearchIndex.deleteDocs(BaseSearchIndex.java:203)
> at 
> proj.zoie.impl.indexing.internal.BaseSearchIndex.loadFromIndex(BaseSearchIndex.java:223)
> at 
> proj.zoie.impl.indexing.internal.LuceneIndexDataLoader.loadFromIndex(LuceneIndexDataLoader.java:153)
> at 
> proj.zoie.impl.indexing.internal.DiskLuceneIndexDataLoader.loadFromIndex(DiskLuceneIndexDataLoader.java:134)
> at 
> proj.zoie.impl.indexing.internal.RealtimeIndexDataLoader.processBatch(RealtimeIndexDataLoader.java:171)
> at 
> proj.zoie.impl.indexing.internal.BatchedIndexDataLoader$LoaderThread.run(BatchedIndexDataLoader.java:373)
> 579336 [proj.zoie.impl.indexing.internal.realtimeindexdataloa...@31ca5085] 
> ERROR proj.zoie.impl.indexing.internal.LuceneIndexDataLoader - 
> Problem copying segments: Lock obtain timed out: 
> org.apache.lucene.store.singleinstancel...@5ad0b895: write.lock
> org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: 
> org.apache.lucene.store.singleinstancel...@5ad0b895: write.lock
> at org.apache.lucene.store.Lock.obtain(Lock.java:84)
> at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:1060)
> at org.apache.lucene.index.IndexWriter.(IndexWriter.java:957)
> at 
> proj.zoie.impl.indexing.internal.DiskSearchIndex.openIndexWriter(DiskSearchIndex.java:176)
> at 
> proj.zoie.impl.indexing

[jira] Commented: (SOLR-2223) Separate out "generic" Solr site from release specific content.

2010-11-09 Thread Grant Ingersoll (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930178#action_12930178
 ] 

Grant Ingersoll commented on SOLR-2223:
---

Here's what I propose to do:

# Move main site to https://svn.apache.org/repos/asf/lucene/solr/site
# Keep tutorial and release content where it is
# Update the release packaging to bring in the non-release content as part of a 
release
# Fix the docs on how to deploy it.

> Separate out "generic" Solr site from release specific content.
> ---
>
> Key: SOLR-2223
> URL: https://issues.apache.org/jira/browse/SOLR-2223
> Project: Solr
>  Issue Type: Task
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
>
> It would be useful for deployment purposes if we separated out the Solr site 
> that is non-release specific from the release specific content.  This would 
> make it easier to apply updates, etc. while still keeping release specific 
> info handy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (SOLR-2223) Separate out "generic" Solr site from release specific content.

2010-11-09 Thread Grant Ingersoll (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930173#action_12930173
 ] 

Grant Ingersoll commented on SOLR-2223:
---

The problem here is there is one and only one release specific doc.  It seems 
like a waste to move things around for one doc, but I suppose it is still worth 
it since there isn't a clean way to do it now.

> Separate out "generic" Solr site from release specific content.
> ---
>
> Key: SOLR-2223
> URL: https://issues.apache.org/jira/browse/SOLR-2223
> Project: Solr
>  Issue Type: Task
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
>
> It would be useful for deployment purposes if we separated out the Solr site 
> that is non-release specific from the release specific content.  This would 
> make it easier to apply updates, etc. while still keeping release specific 
> info handy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2729) Index corruption after 'read past EOF' under heavy update load and snapshot export

2010-11-09 Thread Nico Krijnen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930142#action_12930142
 ] 

Nico Krijnen commented on LUCENE-2729:
--

{quote}
...spooky "other" exceptions...
{quote}

These are all 'WARN' level and all of them caused by non-critical timeouts in 
our code. All caused by the system being under very heavy load needed to 
reproduce the bug.


{quote}
Would it be possible to instrument to Zoie code to note as the backup
process is copying each file in the snapshot, and at that point print
a listing of the directory?
{quote}

Will do, that is a good one. Then we know which files are being 'held' by the 
Zoie deletion policy for the backup.


{quote}
Also, can you write to the log when Zoie applies deletes? (Looks like
it happens in proj.zoie.impl.indexing.internal.BaseSearchIndex.deleteDocs).
It's on applying deletes that the corruption is first detected, so, if
we log this event we can better bracket the period of time when the
corruption happened.
{quote}

Will do, but we also got the error while zoie was opening a new IndexWriter:

{code}
15:25:03,453 
[proj.zoie.impl.indexing.internal.realtimeindexdataloa...@3d9e7719] 
ERROR proj.zoie.impl.indexing.internal.LuceneIndexDataLoader - 
Problem copying segments: read past EOF
java.io.IOException: read past EOF
at 
org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:154)
at 
org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:39)
at 
org.apache.lucene.store.ChecksumIndexInput.readByte(ChecksumIndexInput.java:37)
at org.apache.lucene.store.IndexInput.readInt(IndexInput.java:69)
at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:245)
at 
org.apache.lucene.index.IndexFileDeleter.(IndexFileDeleter.java:170)
at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:1127)
at org.apache.lucene.index.IndexWriter.(IndexWriter.java:960)
at 
proj.zoie.impl.indexing.internal.DiskSearchIndex.openIndexWriter(DiskSearchIndex.java:176)
at 
proj.zoie.impl.indexing.internal.BaseSearchIndex.loadFromIndex(BaseSearchIndex.java:228)
at 
proj.zoie.impl.indexing.internal.LuceneIndexDataLoader.loadFromIndex(LuceneIndexDataLoader.java:153)
at 
proj.zoie.impl.indexing.internal.DiskLuceneIndexDataLoader.loadFromIndex(DiskLuceneIndexDataLoader.java:134)
at 
proj.zoie.impl.indexing.internal.RealtimeIndexDataLoader.processBatch(RealtimeIndexDataLoader.java:172)
at 
proj.zoie.impl.indexing.internal.BatchedIndexDataLoader$LoaderThread.run(BatchedIndexDataLoader.java:377)
{code}

I'll add a log there too.
My guess is that the 'read past EOF' is not really specific to applying 
deletes, but just happens when the SegementInfos is loaded on a 0kb file.


{quote}
Does Zoie ever open an IndexReader or IndexWriter passing in an
existing commit point? Or does it always open the latest commit?
{quote}

I'll try to find out.


{quote}
The timestamps on the zero length files are particularly spooky - the
earliest ones are 15:21 (when first EOF is hit), but then also 15:47
and 15:49 on the others. It seems like on 3 separate occasions
something truncated the files.
{quote}

Indeed, I thought this was weird too.

> Index corruption after 'read past EOF' under heavy update load and snapshot 
> export
> --
>
> Key: LUCENE-2729
> URL: https://issues.apache.org/jira/browse/LUCENE-2729
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 3.0.1, 3.0.2
> Environment: Happens on both OS X 10.6 and Windows 2008 Server. 
> Integrated with zoie (using a zoie snapshot from 2010-08-06: 
> zoie-2.0.0-snapshot-20100806.jar).
>Reporter: Nico Krijnen
> Attachments: 2010-11-02 IndexWriter infoStream log.zip, 
> LUCENE-2729-test1.patch
>
>
> We have a system running lucene and zoie. We use lucene as a content store 
> for a CMS/DAM system. We use the hot-backup feature of zoie to make scheduled 
> backups of the index. This works fine for small indexes and when there are 
> not a lot of changes to the index when the backup is made.
> On large indexes (about 5 GB to 19 GB), when a backup is made while the index 
> is being changed a lot (lots of document additions and/or deletions), we 
> almost always get a 'read past EOF' at some point, followed by lots of 'Lock 
> obtain timed out'.
> At that point we get lots of 0 kb files in the index, data gets lots, and the 
> index is unusable.
> When we stop our server, remove the 0kb files and restart our server, the 
> index is operational again, but data has been lost.
> I'm not sure if this is a zoie or a lucene issue, so i'm posting it to both. 
> Hopefully someone has some ideas where to look to fix this.
> Some more details.

[jira] Commented: (SOLR-2052) Allow for a list of filter queries and a single docset filter in QueryComponent

2010-11-09 Thread Stephen Green (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930165#action_12930165
 ] 

Stephen Green commented on SOLR-2052:
-

Thanks for taking a look, Otis.  I'm in CA this week, but I should have a 
chance to fix the patch when I'm back home.

> Allow for a list of filter queries and a single docset filter in 
> QueryComponent
> ---
>
> Key: SOLR-2052
> URL: https://issues.apache.org/jira/browse/SOLR-2052
> Project: Solr
>  Issue Type: Improvement
>  Components: search
>Affects Versions: 4.0
> Environment: Mac OS X, Java 1.6
>Reporter: Stephen Green
>Priority: Minor
> Fix For: 1.4.2
>
> Attachments: SOLR-2052-2.patch, SOLR-2052.patch
>
>
> SolrIndexSearcher.QueryCommand allows you to specify a list of filter queries 
> or a single filter (as a DocSet), but not both.  This restriction seems 
> arbitrary, and there are cases where we can have both a list of filter 
> queries and a DocSet generated by some other non-query process (e.g., 
> filtering documents according to IDs pulled from some other source like a 
> database.)
> Fixing this requires a few small changes to SolrIndexSearcher to allow both 
> of these to be set for a QueryCommand and to take both into account when 
> evaluating the query.  It also requires a modification to ResponseBuilder to 
> allow setting the single filter at query time.
> I've run into this against 1.4, but the same holds true for the trunk.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

2010-11-09 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930137#action_12930137
 ] 

Robert Muir commented on LUCENE-2747:
-

DM, thanks, I see exactly where you are coming from.

I see your point: previously it was much easier to take something like 
SimpleAnalyzer and 'adapt' it to a given language based on things like unicode 
properties.
In fact thats exactly what we did in the cases here (Arabic, Persian, Hindi, 
etc)

But now we can actually tokenize "correctly" for more languages with jflex, 
thanks to its improved unicode support, and its superior to these previous 
hacks :)

to try to answer some of your questions (all my opinion):

bq. Is there a point to having SimpleAnalyzer

I guess so, a lot of people can use this if they have english-only content and 
are probably happy with discard numbers etc... its not a big loss to me if it 
goes though.

bq. Shouldn't UAX29Tokenizer be moved to core? (What is core anyway?)

In trunk (4.x codeline) there is no core, contrib, or solr for analyzer 
components any more. they are all combined into modules/analysis.
In branch_3x (3.x codeline) we did not make this rather disruptive refactor: 
there UAX29Tokenizer is in fact in lucene core.

bq. Would there be a way to plugin ICUTokenizer as a replacement for 
UAX29Tokenizer into StandardTokenizer, such that all Analyzers using 
StandardTokenizer would get the alternate implementation?

Personally, i would prefer if we move towards a factory model where things like 
these supplied "language analyzers" are actually xml/json/properties snippets.
In other words, they are just example configurations that builds your analyzer, 
like solr does.
This is nice, because then you dont have to write code to easily customize how 
your analyzer works.

I think we have been making slow steps towards this, just doing basic things 
like moving stopwords lists to .txt files.
But i think the next step would be LUCENE-2510, where we have factories/config 
attribute parsers for all these analysis components already written.

Then we could have support for declarative analyzer specification via 
xml/json/.properties/whatever, and move all these Analyzers to that.
I still think you should be able to code up your own analyzer, but in my 
opinion this is much easier and preferred for the ones we supply.

Also i think this would solve a lot of analyzer-backwards-compat problems, 
because then our supplied analyzers are really just configuration file examples,
and we can change our examples however we want... someone can use their old 
config file (and hopefully old analysis module jar file!) to guarantee
the exact same behavior if they want.

Finally, most of the benefits of ICUTokenizer are actually in the UAX29 
support... the tokenizers are pretty close with some minor differences:
* the jflex-based implementation is faster, and better in my opinion.
* the ICU-based implementation allows tailoring, and supplies tailored 
tokenization for several complex scripts (jflex doesnt have this... yet)
* the ICU-based implementation works with all of unicode, at the moment jflex 
is limited to the basic multilingual plane.

In my opinion the last 2 points will probably be eventually resolved... i could 
see our ICUTokenizer possibly becoming obselete down the road 
by some better jflex support, though it would have to probably have hooks into 
ICU for the complex script support (so we get it for free from ICU)


> Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
> ---
>
> Key: LUCENE-2747
> URL: https://issues.apache.org/jira/browse/LUCENE-2747
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Affects Versions: 3.1, 4.0
>Reporter: Steven Rowe
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2747.patch, LUCENE-2747.patch
>
>
> As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to 
> provide language-neutral tokenization.  Lucene contains several 
> language-specific tokenizers that should be replaced by UAX#29-based 
> StandardTokenizer (deprecated in 3.1 and removed in 4.0).  The 
> language-specific *analyzers*, by contrast, should remain, because they 
> contain language-specific post-tokenization filters.  The language-specific 
> analyzers should switch to StandardTokenizer in 3.1.
> Some usages of language-specific tokenizers will need additional work beyond 
> just replacing the tokenizer in the language-specific analyzer.  
> For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and 
> depends on the fact that this tokenizer breaks tokens on the ZWNJ character 
> (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ 
> is not a word boundary

Lucene-Solr-tests-only-3.x - Build # 1160 - Failure

2010-11-09 Thread Apache Hudson Server

Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-3.x/1160/

1 tests failed.
REGRESSION:  org.apache.lucene.index.TestIndexWriter.testCommitThreadSafety

Error Message:
null

Stack Trace:
junit.framework.AssertionFailedError: 
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:779)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:745)
at 
org.apache.lucene.index.TestIndexWriter.testCommitThreadSafety(TestIndexWriter.java:2435)




Build Log (for compile errors):
[...truncated 4525 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

New ASF CMS

2010-11-09 Thread Grant Ingersoll

Gang,

I've asked infra to setup a space where we can kick the tires on the new ASF 
CMS.   See https://issues.apache.org/jira/browse/LUCENE-2748 and 
http://apache.org/dev/cms.html.

It seems like a real win, as we can produce either with Markdown or via a 
WYSIWYG editor and you get instant publication, more dynamic sites, etc.

-Grant
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2729) Index corruption after 'read past EOF' under heavy update load and snapshot export

2010-11-09 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930130#action_12930130
 ] 

Michael McCandless commented on LUCENE-2729:



Thank you for attaching the IW infoStream output!  Sorry it took so
long for me to respond.

Aside: it is sad but there is no master TODO list in open source.  It
all comes down to our own email inboxes, todo lists, post-it notes all
over the place, etc., and (in my case anyway) things sometimes fall
past the event horizon.

So please if I don't respond in a day or two on an active issue, bump
it again (put a comment on the issue)!  I'd much rather people
over-nag than under-nag but unfortunately under-nag is far far more
common and it causes important issues to languish unnecessarily.

OK back to the issue :)

I looked through the infoStream but I don't see a smoking gun.  Ie,
the logs indicate that nowhere did Lucene try to delete/overwrite
those zero-length files; I see other files being deleted, so, this is
what I'd expect given that ZoieDeletionPolicy is presumably protecting
the segments_3t commit point (to back up its files).

I do see some spooky "other" exceptions, though... these are the first
2 exceptions I see in the log:

{noformat}
14:27:41,290 [bigIndexBuilder_QueueProcessor_3] WARN  
com.ds.acm.logic.impl.AssetManagerImpl -
  Ignoring AssetNotFoundException trying to make sure all metadata from index 
is loaded before updating an existing asset
Exception in thread "pool-5-thread-6" java.lang.NullPointerException
at 
org.apache.coyote.http11.InternalNioOutputBuffer.writeToSocket(InternalNioOutputBuffer.java:430)
at 
org.apache.coyote.http11.InternalNioOutputBuffer.flushBuffer(InternalNioOutputBuffer.java:784)
at 
org.apache.coyote.http11.InternalNioOutputBuffer.flush(InternalNioOutputBuffer.java:300)
at 
org.apache.coyote.http11.Http11NioProcessor.action(Http11NioProcessor.java:1060)
at org.apache.coyote.Response.action(Response.java:183)
at 
org.apache.catalina.connector.OutputBuffer.doFlush(OutputBuffer.java:314)
at 
org.apache.catalina.connector.OutputBuffer.flush(OutputBuffer.java:288)
at org.apache.catalina.connector.Response.flushBuffer(Response.java:548)
at 
org.apache.catalina.connector.ResponseFacade.flushBuffer(ResponseFacade.java:279)
at 
org.granite.gravity.AbstractChannel.runReceived(AbstractChannel.java:251)
at 
org.granite.gravity.AbstractChannel.runReceive(AbstractChannel.java:199)
at org.granite.gravity.AsyncReceiver.doRun(AsyncReceiver.java:34)
at 
org.granite.gravity.AsyncChannelRunner.run(AsyncChannelRunner.java:52)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:680)
{noformat}

and

{noformat}
14:40:18,382 [Low Memory Detector] WARN  
com.ds.acm.engine.search.zoieimpl.core.ZoieSystemManager -
  Ignoring timeout while attempting to flush zoie memory index to disk to free 
memory
proj.zoie.api.ZoieException: sync timed out
at 
proj.zoie.impl.indexing.AsyncDataConsumer.syncWthVersion(AsyncDataConsumer.java:177)
at 
proj.zoie.impl.indexing.AsyncDataConsumer.flushEvents(AsyncDataConsumer.java:155)
at proj.zoie.impl.indexing.ZoieSystem.flushEvents(ZoieSystem.java:308)
at 
com.ds.acm.engine.search.zoieimpl.core.ZoieSystemManager.onLowMemory(ZoieSystemManager.java:220)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
com.ds.util.event.BasicEventBroadcaster$Handler.invokeMethod(BasicEventBroadcaster.java:197)
at 
com.ds.util.event.BasicEventBroadcaster$Handler.handle(BasicEventBroadcaster.java:190)
at 
com.ds.util.event.BasicEventBroadcaster.fire(BasicEventBroadcaster.java:108)
at 
com.ds.util.cache.LowMemoryWarningBroadcaster$1.handleNotification(LowMemoryWarningBroadcaster.java:135)
at 
sun.management.NotificationEmitterSupport.sendNotification(NotificationEmitterSupport.java:138)
at sun.management.MemoryImpl.createNotification(MemoryImpl.java:171)
at 
sun.management.MemoryPoolImpl$PoolSensor.triggerAction(MemoryPoolImpl.java:272)
at sun.management.Sensor.trigger(Sensor.java:120)
{noformat}

That 2nd exception happens a total of 9 times... and is rather
spooky.  What does it mean?  Ie, why is Zoie timing out on flushing
the index to disk, and, what does it then do w/ its RAMDir?

I also see alot of these:

{noformat}
15:50:18,856 [bigIndexBuilder_QueueProcessor_10] WARN

[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

2010-11-09 Thread DM Smith (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930119#action_12930119
 ] 

DM Smith commented on LUCENE-2747:
--

bq. DM, can you elaborate here?

I was a bit trigger happy with the comment. I should have looked at the code 
rather than the jira comments alone. The old StandardAnalyzer had a kitchen 
sink approach to tokenizations trying to do too much with *modern* constructs, 
e.g. URLs, email addresses, acronyms It and SimpleAnalyzer would produce 
about the same stream on "old" English and some other texts, but the 
StandardAnalyzer was much slower. (I don't remember how slow, but it was 
obvious.)

Both of these were weak when it came to non-English/non-Western texts. Thus I 
could take the language specific tokenizers, lists of stop words, stemmers and 
create variations of the SimpleAnalyzer that properly handled a particular 
language. (I created my own analyzers because I wanted to make stop words and 
stemming optional)

In looking at the code in trunk (should have done that before making my 
comment), I see that UAX29Tokenizer is duplicated in the StandardAnalyzer's 
jflex and that ClassicAnalyzer is the old jflex. Also, the new StandardAnalyzer 
does a lot less.

If I understand the suggestion of this and the other 2 issues, StandardAnalyzer 
will no longer handle modern constructs. As I see it this is what 
SimpleAnalyzer should be: Based on UAX29 and does little else. Thus my 
confusion. Is there a point to having SimpleAnalyzer? Shouldn't UAX29Tokenizer 
be moved to core? (What is core anyway?)

And if I understand where this is going: Would there be a way to plugin 
ICUTokenizer as a replacement for UAX29Tokenizer into StandardTokenizer, such 
that all Analyzers using StandardTokenizer would get the alternate 
implementation?

> Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
> ---
>
> Key: LUCENE-2747
> URL: https://issues.apache.org/jira/browse/LUCENE-2747
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Affects Versions: 3.1, 4.0
>Reporter: Steven Rowe
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2747.patch, LUCENE-2747.patch
>
>
> As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to 
> provide language-neutral tokenization.  Lucene contains several 
> language-specific tokenizers that should be replaced by UAX#29-based 
> StandardTokenizer (deprecated in 3.1 and removed in 4.0).  The 
> language-specific *analyzers*, by contrast, should remain, because they 
> contain language-specific post-tokenization filters.  The language-specific 
> analyzers should switch to StandardTokenizer in 3.1.
> Some usages of language-specific tokenizers will need additional work beyond 
> just replacing the tokenizer in the language-specific analyzer.  
> For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and 
> depends on the fact that this tokenizer breaks tokens on the ZWNJ character 
> (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ 
> is not a word boundary.  Robert Muir has suggested using a char filter 
> converting ZWNJ to spaces prior to StandardTokenizer in the converted 
> PersianAnalyzer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

2010-11-09 Thread Robert Muir (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-2747:


Attachment: LUCENE-2747.patch

here's an updated patch.

in reality the previous patch was a problem: because initReader() was in the 
TokenStream components,
it caused code duplication in any Analyzer, as it had to specifiy its 
CharFilter twice: once in the createComponents for the initial Reader,
and once in the TokenStreamComponents implementation for reset(Reader).

So i moved this to just be a method of ReusableAnalyzerBase.

Also, i didn't apply the 'throws IOException'. After re-thinking, there is no 
need to do this.
None of our CharFilters for example, throw IOExceptions in their ctors.
Even the Analyzer.tokenStream method cannot throw IOException.

We shouldn't add 'throws X exception' just because some arbitrary user class 
MIGHT throw it,
they might throw SQLException, or InvalidMidiDataException too.

> Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
> ---
>
> Key: LUCENE-2747
> URL: https://issues.apache.org/jira/browse/LUCENE-2747
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Affects Versions: 3.1, 4.0
>Reporter: Steven Rowe
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2747.patch, LUCENE-2747.patch
>
>
> As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to 
> provide language-neutral tokenization.  Lucene contains several 
> language-specific tokenizers that should be replaced by UAX#29-based 
> StandardTokenizer (deprecated in 3.1 and removed in 4.0).  The 
> language-specific *analyzers*, by contrast, should remain, because they 
> contain language-specific post-tokenization filters.  The language-specific 
> analyzers should switch to StandardTokenizer in 3.1.
> Some usages of language-specific tokenizers will need additional work beyond 
> just replacing the tokenizer in the language-specific analyzer.  
> For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and 
> depends on the fact that this tokenizer breaks tokens on the ZWNJ character 
> (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ 
> is not a word boundary.  Robert Muir has suggested using a char filter 
> converting ZWNJ to spaces prior to StandardTokenizer in the converted 
> PersianAnalyzer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Lucene-Solr-tests-only-trunk - Build # 1175 - Failure

2010-11-09 Thread Apache Hudson Server

Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/1175/

1 tests failed.
REGRESSION:  org.apache.solr.TestDistributedSearch.testDistribSearch

Error Message:
Error executing query

Stack Trace:
org.apache.solr.client.solrj.SolrServerException: Error executing query
at 
org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:95)
at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:119)
at 
org.apache.solr.BaseDistributedSearchTestCase.queryServer(BaseDistributedSearchTestCase.java:290)
at 
org.apache.solr.BaseDistributedSearchTestCase.query(BaseDistributedSearchTestCase.java:305)
at 
org.apache.solr.TestDistributedSearch.doTest(TestDistributedSearch.java:203)
at 
org.apache.solr.BaseDistributedSearchTestCase.testDistribSearch(BaseDistributedSearchTestCase.java:568)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:878)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:844)
Caused by: org.apache.solr.common.SolrException: 
org.apache.solr.client.solrj.SolrServerException: No live SolrServers available 
to handle this request  org.apache.solr.common.SolrException: 
org.apache.solr.client.solrj.SolrServerException: No live SolrServers available 
to handle this request   at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:318)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1359)at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:337) 
 at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:240)
 at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1157)
  at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:388) at 
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at 
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765) at 
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at 
org.mortbay.jetty.Server.handle(Server.java:326) at 
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)  at 
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:923)
  at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:547)  at 
org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) at 
org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at 
org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409)
 at 
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) 
Caused by: org.apache.solr.client.solrj.SolrServerException: No live 
SolrServers available to handle this request  at 
org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(LBHttpSolrServer.java:297)
at 
org.apache.solr.handler.component.HttpCommComponent$1.call(SearchHandler.java:513)
   at 
org.apache.solr.handler.component.HttpCommComponent$1.call(SearchHandler.java:478)
   at java.util.concurrent

org.apache.solr.client.solrj.SolrServerException: No live SolrServers available 
to handle this request  org.apache.solr.common.SolrException: 
org.apache.solr.client.solrj.SolrServerException: No live SolrServers available 
to handle this requestat 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:318)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1359)at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:337) 
 at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:240)
 at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1157)
  at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:388) at 
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at 
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765) at 
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at 
org.mortbay.jetty.Server.handle(Server.java:326) at 
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)  at 
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:923)
  at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:547)  at 
org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) at 
org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at 
org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409)
 at 
org.mortbay.thread.QueuedThreadPool$PoolThread.run(Que

[jira] Commented: (LUCENE-2729) Index corruption after 'read past EOF' under heavy update load and snapshot export

2010-11-09 Thread Nico Krijnen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930094#action_12930094
 ] 

Nico Krijnen commented on LUCENE-2729:
--

Thx! We will update, patch and re-run the test.

> Index corruption after 'read past EOF' under heavy update load and snapshot 
> export
> --
>
> Key: LUCENE-2729
> URL: https://issues.apache.org/jira/browse/LUCENE-2729
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 3.0.1, 3.0.2
> Environment: Happens on both OS X 10.6 and Windows 2008 Server. 
> Integrated with zoie (using a zoie snapshot from 2010-08-06: 
> zoie-2.0.0-snapshot-20100806.jar).
>Reporter: Nico Krijnen
> Attachments: 2010-11-02 IndexWriter infoStream log.zip, 
> LUCENE-2729-test1.patch
>
>
> We have a system running lucene and zoie. We use lucene as a content store 
> for a CMS/DAM system. We use the hot-backup feature of zoie to make scheduled 
> backups of the index. This works fine for small indexes and when there are 
> not a lot of changes to the index when the backup is made.
> On large indexes (about 5 GB to 19 GB), when a backup is made while the index 
> is being changed a lot (lots of document additions and/or deletions), we 
> almost always get a 'read past EOF' at some point, followed by lots of 'Lock 
> obtain timed out'.
> At that point we get lots of 0 kb files in the index, data gets lots, and the 
> index is unusable.
> When we stop our server, remove the 0kb files and restart our server, the 
> index is operational again, but data has been lost.
> I'm not sure if this is a zoie or a lucene issue, so i'm posting it to both. 
> Hopefully someone has some ideas where to look to fix this.
> Some more details...
> Stack trace of the read past EOF and following Lock obtain timed out:
> {code}
> 78307 [proj.zoie.impl.indexing.internal.realtimeindexdataloa...@31ca5085] 
> ERROR proj.zoie.impl.indexing.internal.BaseSearchIndex - read past EOF
> java.io.IOException: read past EOF
> at 
> org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:154)
> at 
> org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:39)
> at 
> org.apache.lucene.store.ChecksumIndexInput.readByte(ChecksumIndexInput.java:37)
> at org.apache.lucene.store.IndexInput.readInt(IndexInput.java:69)
> at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:245)
> at 
> org.apache.lucene.index.IndexFileDeleter.(IndexFileDeleter.java:166)
> at 
> org.apache.lucene.index.DirectoryReader.doCommit(DirectoryReader.java:725)
> at org.apache.lucene.index.IndexReader.commit(IndexReader.java:987)
> at org.apache.lucene.index.IndexReader.commit(IndexReader.java:973)
> at org.apache.lucene.index.IndexReader.decRef(IndexReader.java:162)
> at org.apache.lucene.index.IndexReader.close(IndexReader.java:1003)
> at 
> proj.zoie.impl.indexing.internal.BaseSearchIndex.deleteDocs(BaseSearchIndex.java:203)
> at 
> proj.zoie.impl.indexing.internal.BaseSearchIndex.loadFromIndex(BaseSearchIndex.java:223)
> at 
> proj.zoie.impl.indexing.internal.LuceneIndexDataLoader.loadFromIndex(LuceneIndexDataLoader.java:153)
> at 
> proj.zoie.impl.indexing.internal.DiskLuceneIndexDataLoader.loadFromIndex(DiskLuceneIndexDataLoader.java:134)
> at 
> proj.zoie.impl.indexing.internal.RealtimeIndexDataLoader.processBatch(RealtimeIndexDataLoader.java:171)
> at 
> proj.zoie.impl.indexing.internal.BatchedIndexDataLoader$LoaderThread.run(BatchedIndexDataLoader.java:373)
> 579336 [proj.zoie.impl.indexing.internal.realtimeindexdataloa...@31ca5085] 
> ERROR proj.zoie.impl.indexing.internal.LuceneIndexDataLoader - 
> Problem copying segments: Lock obtain timed out: 
> org.apache.lucene.store.singleinstancel...@5ad0b895: write.lock
> org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: 
> org.apache.lucene.store.singleinstancel...@5ad0b895: write.lock
> at org.apache.lucene.store.Lock.obtain(Lock.java:84)
> at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:1060)
> at org.apache.lucene.index.IndexWriter.(IndexWriter.java:957)
> at 
> proj.zoie.impl.indexing.internal.DiskSearchIndex.openIndexWriter(DiskSearchIndex.java:176)
> at 
> proj.zoie.impl.indexing.internal.BaseSearchIndex.loadFromIndex(BaseSearchIndex.java:228)
> at 
> proj.zoie.impl.indexing.internal.LuceneIndexDataLoader.loadFromIndex(LuceneIndexDataLoader.java:153)
> at 
> proj.zoie.impl.indexing.internal.DiskLuceneIndexDataLoader.loadFromIndex(DiskLuceneIndexDataLoader.java:134)
> at 
> proj.zoie.impl.indexing.internal.RealtimeIndexDataLoader.processBatch(RealtimeIndexDataLoader.java:171)
> at 
> proj.zoie.impl.indexing.in

[jira] Commented: (LUCENE-2729) Index corruption after 'read past EOF' under heavy update load and snapshot export

2010-11-09 Thread Simon Willnauer (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930092#action_12930092
 ] 

Simon Willnauer commented on LUCENE-2729:
-

ah I think I have been on the wrong branch nevermind!

> Index corruption after 'read past EOF' under heavy update load and snapshot 
> export
> --
>
> Key: LUCENE-2729
> URL: https://issues.apache.org/jira/browse/LUCENE-2729
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 3.0.1, 3.0.2
> Environment: Happens on both OS X 10.6 and Windows 2008 Server. 
> Integrated with zoie (using a zoie snapshot from 2010-08-06: 
> zoie-2.0.0-snapshot-20100806.jar).
>Reporter: Nico Krijnen
> Attachments: 2010-11-02 IndexWriter infoStream log.zip, 
> LUCENE-2729-test1.patch
>
>
> We have a system running lucene and zoie. We use lucene as a content store 
> for a CMS/DAM system. We use the hot-backup feature of zoie to make scheduled 
> backups of the index. This works fine for small indexes and when there are 
> not a lot of changes to the index when the backup is made.
> On large indexes (about 5 GB to 19 GB), when a backup is made while the index 
> is being changed a lot (lots of document additions and/or deletions), we 
> almost always get a 'read past EOF' at some point, followed by lots of 'Lock 
> obtain timed out'.
> At that point we get lots of 0 kb files in the index, data gets lots, and the 
> index is unusable.
> When we stop our server, remove the 0kb files and restart our server, the 
> index is operational again, but data has been lost.
> I'm not sure if this is a zoie or a lucene issue, so i'm posting it to both. 
> Hopefully someone has some ideas where to look to fix this.
> Some more details...
> Stack trace of the read past EOF and following Lock obtain timed out:
> {code}
> 78307 [proj.zoie.impl.indexing.internal.realtimeindexdataloa...@31ca5085] 
> ERROR proj.zoie.impl.indexing.internal.BaseSearchIndex - read past EOF
> java.io.IOException: read past EOF
> at 
> org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:154)
> at 
> org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:39)
> at 
> org.apache.lucene.store.ChecksumIndexInput.readByte(ChecksumIndexInput.java:37)
> at org.apache.lucene.store.IndexInput.readInt(IndexInput.java:69)
> at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:245)
> at 
> org.apache.lucene.index.IndexFileDeleter.(IndexFileDeleter.java:166)
> at 
> org.apache.lucene.index.DirectoryReader.doCommit(DirectoryReader.java:725)
> at org.apache.lucene.index.IndexReader.commit(IndexReader.java:987)
> at org.apache.lucene.index.IndexReader.commit(IndexReader.java:973)
> at org.apache.lucene.index.IndexReader.decRef(IndexReader.java:162)
> at org.apache.lucene.index.IndexReader.close(IndexReader.java:1003)
> at 
> proj.zoie.impl.indexing.internal.BaseSearchIndex.deleteDocs(BaseSearchIndex.java:203)
> at 
> proj.zoie.impl.indexing.internal.BaseSearchIndex.loadFromIndex(BaseSearchIndex.java:223)
> at 
> proj.zoie.impl.indexing.internal.LuceneIndexDataLoader.loadFromIndex(LuceneIndexDataLoader.java:153)
> at 
> proj.zoie.impl.indexing.internal.DiskLuceneIndexDataLoader.loadFromIndex(DiskLuceneIndexDataLoader.java:134)
> at 
> proj.zoie.impl.indexing.internal.RealtimeIndexDataLoader.processBatch(RealtimeIndexDataLoader.java:171)
> at 
> proj.zoie.impl.indexing.internal.BatchedIndexDataLoader$LoaderThread.run(BatchedIndexDataLoader.java:373)
> 579336 [proj.zoie.impl.indexing.internal.realtimeindexdataloa...@31ca5085] 
> ERROR proj.zoie.impl.indexing.internal.LuceneIndexDataLoader - 
> Problem copying segments: Lock obtain timed out: 
> org.apache.lucene.store.singleinstancel...@5ad0b895: write.lock
> org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: 
> org.apache.lucene.store.singleinstancel...@5ad0b895: write.lock
> at org.apache.lucene.store.Lock.obtain(Lock.java:84)
> at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:1060)
> at org.apache.lucene.index.IndexWriter.(IndexWriter.java:957)
> at 
> proj.zoie.impl.indexing.internal.DiskSearchIndex.openIndexWriter(DiskSearchIndex.java:176)
> at 
> proj.zoie.impl.indexing.internal.BaseSearchIndex.loadFromIndex(BaseSearchIndex.java:228)
> at 
> proj.zoie.impl.indexing.internal.LuceneIndexDataLoader.loadFromIndex(LuceneIndexDataLoader.java:153)
> at 
> proj.zoie.impl.indexing.internal.DiskLuceneIndexDataLoader.loadFromIndex(DiskLuceneIndexDataLoader.java:134)
> at 
> proj.zoie.impl.indexing.internal.RealtimeIndexDataLoader.processBatch(RealtimeIndexDataLoader.java:171)
> at 
> proj.zoie.impl

[jira] Commented: (LUCENE-2729) Index corruption after 'read past EOF' under heavy update load and snapshot export

2010-11-09 Thread Simon Willnauer (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930091#action_12930091
 ] 

Simon Willnauer commented on LUCENE-2729:
-

bq. at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:245)
I just had a quick look at it and I wonder what revision you use for that? 
SegmentsInfos.java does not contain a read call nice [revision 892992 | 
https://svn.apache.org/viewvc/lucene/dev/branches/branch_3x/lucene/src/java/org/apache/lucene/index/SegmentInfos.java?revision=892992&view=markup]

can you clarify?

> Index corruption after 'read past EOF' under heavy update load and snapshot 
> export
> --
>
> Key: LUCENE-2729
> URL: https://issues.apache.org/jira/browse/LUCENE-2729
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 3.0.1, 3.0.2
> Environment: Happens on both OS X 10.6 and Windows 2008 Server. 
> Integrated with zoie (using a zoie snapshot from 2010-08-06: 
> zoie-2.0.0-snapshot-20100806.jar).
>Reporter: Nico Krijnen
> Attachments: 2010-11-02 IndexWriter infoStream log.zip, 
> LUCENE-2729-test1.patch
>
>
> We have a system running lucene and zoie. We use lucene as a content store 
> for a CMS/DAM system. We use the hot-backup feature of zoie to make scheduled 
> backups of the index. This works fine for small indexes and when there are 
> not a lot of changes to the index when the backup is made.
> On large indexes (about 5 GB to 19 GB), when a backup is made while the index 
> is being changed a lot (lots of document additions and/or deletions), we 
> almost always get a 'read past EOF' at some point, followed by lots of 'Lock 
> obtain timed out'.
> At that point we get lots of 0 kb files in the index, data gets lots, and the 
> index is unusable.
> When we stop our server, remove the 0kb files and restart our server, the 
> index is operational again, but data has been lost.
> I'm not sure if this is a zoie or a lucene issue, so i'm posting it to both. 
> Hopefully someone has some ideas where to look to fix this.
> Some more details...
> Stack trace of the read past EOF and following Lock obtain timed out:
> {code}
> 78307 [proj.zoie.impl.indexing.internal.realtimeindexdataloa...@31ca5085] 
> ERROR proj.zoie.impl.indexing.internal.BaseSearchIndex - read past EOF
> java.io.IOException: read past EOF
> at 
> org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:154)
> at 
> org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:39)
> at 
> org.apache.lucene.store.ChecksumIndexInput.readByte(ChecksumIndexInput.java:37)
> at org.apache.lucene.store.IndexInput.readInt(IndexInput.java:69)
> at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:245)
> at 
> org.apache.lucene.index.IndexFileDeleter.(IndexFileDeleter.java:166)
> at 
> org.apache.lucene.index.DirectoryReader.doCommit(DirectoryReader.java:725)
> at org.apache.lucene.index.IndexReader.commit(IndexReader.java:987)
> at org.apache.lucene.index.IndexReader.commit(IndexReader.java:973)
> at org.apache.lucene.index.IndexReader.decRef(IndexReader.java:162)
> at org.apache.lucene.index.IndexReader.close(IndexReader.java:1003)
> at 
> proj.zoie.impl.indexing.internal.BaseSearchIndex.deleteDocs(BaseSearchIndex.java:203)
> at 
> proj.zoie.impl.indexing.internal.BaseSearchIndex.loadFromIndex(BaseSearchIndex.java:223)
> at 
> proj.zoie.impl.indexing.internal.LuceneIndexDataLoader.loadFromIndex(LuceneIndexDataLoader.java:153)
> at 
> proj.zoie.impl.indexing.internal.DiskLuceneIndexDataLoader.loadFromIndex(DiskLuceneIndexDataLoader.java:134)
> at 
> proj.zoie.impl.indexing.internal.RealtimeIndexDataLoader.processBatch(RealtimeIndexDataLoader.java:171)
> at 
> proj.zoie.impl.indexing.internal.BatchedIndexDataLoader$LoaderThread.run(BatchedIndexDataLoader.java:373)
> 579336 [proj.zoie.impl.indexing.internal.realtimeindexdataloa...@31ca5085] 
> ERROR proj.zoie.impl.indexing.internal.LuceneIndexDataLoader - 
> Problem copying segments: Lock obtain timed out: 
> org.apache.lucene.store.singleinstancel...@5ad0b895: write.lock
> org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: 
> org.apache.lucene.store.singleinstancel...@5ad0b895: write.lock
> at org.apache.lucene.store.Lock.obtain(Lock.java:84)
> at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:1060)
> at org.apache.lucene.index.IndexWriter.(IndexWriter.java:957)
> at 
> proj.zoie.impl.indexing.internal.DiskSearchIndex.openIndexWriter(DiskSearchIndex.java:176)
> at 
> proj.zoie.impl.indexing.internal.BaseSearchIndex.loadFromIndex(BaseSearchIndex.java:228)
> at 
> proj.zoie.impl.indexing.inte

[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

2010-11-09 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930090#action_12930090
 ] 

Robert Muir commented on LUCENE-2747:
-

bq. I'm not too keen on this. For classics and ancient texts the standard 
analyzer is not as good as the simple analyzer.

DM, can you elaborate here? 

Are you speaking of the existing StandardAnalyzer in previous releases, that 
doesn't properly deal with tokenizing diacritics, etc?
This is the reason these "special" tokenizers exist: to work around those bugs.
but StandardTokenizer now handles this stuff fine, and they are obselete.

I'm confused though, in previous releases how SimpleAnalyzer would ever be any 
better, since it would barf on these diacritics too,
it only emits tokens that are runs of Character.isLetter

Or is there something else i'm missing here?


> Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
> ---
>
> Key: LUCENE-2747
> URL: https://issues.apache.org/jira/browse/LUCENE-2747
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Affects Versions: 3.1, 4.0
>Reporter: Steven Rowe
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2747.patch
>
>
> As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to 
> provide language-neutral tokenization.  Lucene contains several 
> language-specific tokenizers that should be replaced by UAX#29-based 
> StandardTokenizer (deprecated in 3.1 and removed in 4.0).  The 
> language-specific *analyzers*, by contrast, should remain, because they 
> contain language-specific post-tokenization filters.  The language-specific 
> analyzers should switch to StandardTokenizer in 3.1.
> Some usages of language-specific tokenizers will need additional work beyond 
> just replacing the tokenizer in the language-specific analyzer.  
> For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and 
> depends on the fact that this tokenizer breaks tokens on the ZWNJ character 
> (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ 
> is not a word boundary.  Robert Muir has suggested using a char filter 
> converting ZWNJ to spaces prior to StandardTokenizer in the converted 
> PersianAnalyzer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Lucene-Solr-tests-only-trunk - Build # 1170 - Failure

2010-11-09 Thread Apache Hudson Server

Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/1170/

1 tests failed.
REGRESSION:  
org.apache.solr.handler.component.DistributedTermsComponentTest.testDistribSearch

Error Message:
Some threads threw uncaught exceptions!

Stack Trace:
junit.framework.AssertionFailedError: Some threads threw uncaught exceptions!
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:878)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:844)
at 
org.apache.lucene.util.LuceneTestCase.tearDown(LuceneTestCase.java:437)
at org.apache.solr.SolrTestCaseJ4.tearDown(SolrTestCaseJ4.java:78)
at 
org.apache.solr.BaseDistributedSearchTestCase.tearDown(BaseDistributedSearchTestCase.java:144)




Build Log (for compile errors):
[...truncated 8776 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2729) Index corruption after 'read past EOF' under heavy update load and snapshot export

2010-11-09 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-2729:
---

Attachment: LUCENE-2729-test1.patch


First, just to rule out any already-fixed-but-not-yet-released issues,
can you update your Lucene JAR to the tip of the 3.0.x branch?  Ie do
this:

{noformat}
  svn checkout https://svn.apache.org/repos/asf/lucene/java/branches/lucene_3_0 
30x
  cd 30x
  ant jar
{noformat}

And then copy build/lucene-core-3.0.3-dev.jar to your CLASSPATH
(replacing old Lucene JAR).

Second, can you apply the patch I just attached
(LUCENE-2729-test1.patch) and then make this corruption happen again?
That patch throws an exception if ever we try to call
SimpleFSDir.createOutput on a file that already exists.  Lucene should
never do this under non-exceptional situations, yet somehow it looks
like it may be (with all your 0 length files).


> Index corruption after 'read past EOF' under heavy update load and snapshot 
> export
> --
>
> Key: LUCENE-2729
> URL: https://issues.apache.org/jira/browse/LUCENE-2729
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 3.0.1, 3.0.2
> Environment: Happens on both OS X 10.6 and Windows 2008 Server. 
> Integrated with zoie (using a zoie snapshot from 2010-08-06: 
> zoie-2.0.0-snapshot-20100806.jar).
>Reporter: Nico Krijnen
> Attachments: 2010-11-02 IndexWriter infoStream log.zip, 
> LUCENE-2729-test1.patch
>
>
> We have a system running lucene and zoie. We use lucene as a content store 
> for a CMS/DAM system. We use the hot-backup feature of zoie to make scheduled 
> backups of the index. This works fine for small indexes and when there are 
> not a lot of changes to the index when the backup is made.
> On large indexes (about 5 GB to 19 GB), when a backup is made while the index 
> is being changed a lot (lots of document additions and/or deletions), we 
> almost always get a 'read past EOF' at some point, followed by lots of 'Lock 
> obtain timed out'.
> At that point we get lots of 0 kb files in the index, data gets lots, and the 
> index is unusable.
> When we stop our server, remove the 0kb files and restart our server, the 
> index is operational again, but data has been lost.
> I'm not sure if this is a zoie or a lucene issue, so i'm posting it to both. 
> Hopefully someone has some ideas where to look to fix this.
> Some more details...
> Stack trace of the read past EOF and following Lock obtain timed out:
> {code}
> 78307 [proj.zoie.impl.indexing.internal.realtimeindexdataloa...@31ca5085] 
> ERROR proj.zoie.impl.indexing.internal.BaseSearchIndex - read past EOF
> java.io.IOException: read past EOF
> at 
> org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:154)
> at 
> org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:39)
> at 
> org.apache.lucene.store.ChecksumIndexInput.readByte(ChecksumIndexInput.java:37)
> at org.apache.lucene.store.IndexInput.readInt(IndexInput.java:69)
> at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:245)
> at 
> org.apache.lucene.index.IndexFileDeleter.(IndexFileDeleter.java:166)
> at 
> org.apache.lucene.index.DirectoryReader.doCommit(DirectoryReader.java:725)
> at org.apache.lucene.index.IndexReader.commit(IndexReader.java:987)
> at org.apache.lucene.index.IndexReader.commit(IndexReader.java:973)
> at org.apache.lucene.index.IndexReader.decRef(IndexReader.java:162)
> at org.apache.lucene.index.IndexReader.close(IndexReader.java:1003)
> at 
> proj.zoie.impl.indexing.internal.BaseSearchIndex.deleteDocs(BaseSearchIndex.java:203)
> at 
> proj.zoie.impl.indexing.internal.BaseSearchIndex.loadFromIndex(BaseSearchIndex.java:223)
> at 
> proj.zoie.impl.indexing.internal.LuceneIndexDataLoader.loadFromIndex(LuceneIndexDataLoader.java:153)
> at 
> proj.zoie.impl.indexing.internal.DiskLuceneIndexDataLoader.loadFromIndex(DiskLuceneIndexDataLoader.java:134)
> at 
> proj.zoie.impl.indexing.internal.RealtimeIndexDataLoader.processBatch(RealtimeIndexDataLoader.java:171)
> at 
> proj.zoie.impl.indexing.internal.BatchedIndexDataLoader$LoaderThread.run(BatchedIndexDataLoader.java:373)
> 579336 [proj.zoie.impl.indexing.internal.realtimeindexdataloa...@31ca5085] 
> ERROR proj.zoie.impl.indexing.internal.LuceneIndexDataLoader - 
> Problem copying segments: Lock obtain timed out: 
> org.apache.lucene.store.singleinstancel...@5ad0b895: write.lock
> org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: 
> org.apache.lucene.store.singleinstancel...@5ad0b895: write.lock
> at org.apache.lucene.store.Lock.obtain(Lock.java:84)
> at org.apache.lucene.index.IndexWriter.init(Index

[jira] Commented: (LUCENE-2729) Index corruption after 'read past EOF' under heavy update load and snapshot export

2010-11-09 Thread Nico Krijnen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930086#action_12930086
 ] 

Nico Krijnen commented on LUCENE-2729:
--

Any ideas on what could be happening? It sounds like IndexWriter is the only 
one that is modifying these files, zoie only seems to be reading from them to 
make the backup. What should we look for in the IndexWriter's infoStream?

> Index corruption after 'read past EOF' under heavy update load and snapshot 
> export
> --
>
> Key: LUCENE-2729
> URL: https://issues.apache.org/jira/browse/LUCENE-2729
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 3.0.1, 3.0.2
> Environment: Happens on both OS X 10.6 and Windows 2008 Server. 
> Integrated with zoie (using a zoie snapshot from 2010-08-06: 
> zoie-2.0.0-snapshot-20100806.jar).
>Reporter: Nico Krijnen
> Attachments: 2010-11-02 IndexWriter infoStream log.zip
>
>
> We have a system running lucene and zoie. We use lucene as a content store 
> for a CMS/DAM system. We use the hot-backup feature of zoie to make scheduled 
> backups of the index. This works fine for small indexes and when there are 
> not a lot of changes to the index when the backup is made.
> On large indexes (about 5 GB to 19 GB), when a backup is made while the index 
> is being changed a lot (lots of document additions and/or deletions), we 
> almost always get a 'read past EOF' at some point, followed by lots of 'Lock 
> obtain timed out'.
> At that point we get lots of 0 kb files in the index, data gets lots, and the 
> index is unusable.
> When we stop our server, remove the 0kb files and restart our server, the 
> index is operational again, but data has been lost.
> I'm not sure if this is a zoie or a lucene issue, so i'm posting it to both. 
> Hopefully someone has some ideas where to look to fix this.
> Some more details...
> Stack trace of the read past EOF and following Lock obtain timed out:
> {code}
> 78307 [proj.zoie.impl.indexing.internal.realtimeindexdataloa...@31ca5085] 
> ERROR proj.zoie.impl.indexing.internal.BaseSearchIndex - read past EOF
> java.io.IOException: read past EOF
> at 
> org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:154)
> at 
> org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:39)
> at 
> org.apache.lucene.store.ChecksumIndexInput.readByte(ChecksumIndexInput.java:37)
> at org.apache.lucene.store.IndexInput.readInt(IndexInput.java:69)
> at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:245)
> at 
> org.apache.lucene.index.IndexFileDeleter.(IndexFileDeleter.java:166)
> at 
> org.apache.lucene.index.DirectoryReader.doCommit(DirectoryReader.java:725)
> at org.apache.lucene.index.IndexReader.commit(IndexReader.java:987)
> at org.apache.lucene.index.IndexReader.commit(IndexReader.java:973)
> at org.apache.lucene.index.IndexReader.decRef(IndexReader.java:162)
> at org.apache.lucene.index.IndexReader.close(IndexReader.java:1003)
> at 
> proj.zoie.impl.indexing.internal.BaseSearchIndex.deleteDocs(BaseSearchIndex.java:203)
> at 
> proj.zoie.impl.indexing.internal.BaseSearchIndex.loadFromIndex(BaseSearchIndex.java:223)
> at 
> proj.zoie.impl.indexing.internal.LuceneIndexDataLoader.loadFromIndex(LuceneIndexDataLoader.java:153)
> at 
> proj.zoie.impl.indexing.internal.DiskLuceneIndexDataLoader.loadFromIndex(DiskLuceneIndexDataLoader.java:134)
> at 
> proj.zoie.impl.indexing.internal.RealtimeIndexDataLoader.processBatch(RealtimeIndexDataLoader.java:171)
> at 
> proj.zoie.impl.indexing.internal.BatchedIndexDataLoader$LoaderThread.run(BatchedIndexDataLoader.java:373)
> 579336 [proj.zoie.impl.indexing.internal.realtimeindexdataloa...@31ca5085] 
> ERROR proj.zoie.impl.indexing.internal.LuceneIndexDataLoader - 
> Problem copying segments: Lock obtain timed out: 
> org.apache.lucene.store.singleinstancel...@5ad0b895: write.lock
> org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: 
> org.apache.lucene.store.singleinstancel...@5ad0b895: write.lock
> at org.apache.lucene.store.Lock.obtain(Lock.java:84)
> at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:1060)
> at org.apache.lucene.index.IndexWriter.(IndexWriter.java:957)
> at 
> proj.zoie.impl.indexing.internal.DiskSearchIndex.openIndexWriter(DiskSearchIndex.java:176)
> at 
> proj.zoie.impl.indexing.internal.BaseSearchIndex.loadFromIndex(BaseSearchIndex.java:228)
> at 
> proj.zoie.impl.indexing.internal.LuceneIndexDataLoader.loadFromIndex(LuceneIndexDataLoader.java:153)
> at 
> proj.zoie.impl.indexing.internal.DiskLuceneIndexDataLoader.loadFromIndex(DiskLuceneIndexDataLoader.java:134)

[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

2010-11-09 Thread Simon Willnauer (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930075#action_12930075
 ] 

Simon Willnauer commented on LUCENE-2747:
-

bq. ...alternatively we could give this a different name, wrapReader or 
something... not sure, i didnt have any better ideas than charStream.
wrapReader seem to be to specific what about initReader?

> Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
> ---
>
> Key: LUCENE-2747
> URL: https://issues.apache.org/jira/browse/LUCENE-2747
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Affects Versions: 3.1, 4.0
>Reporter: Steven Rowe
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2747.patch
>
>
> As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to 
> provide language-neutral tokenization.  Lucene contains several 
> language-specific tokenizers that should be replaced by UAX#29-based 
> StandardTokenizer (deprecated in 3.1 and removed in 4.0).  The 
> language-specific *analyzers*, by contrast, should remain, because they 
> contain language-specific post-tokenization filters.  The language-specific 
> analyzers should switch to StandardTokenizer in 3.1.
> Some usages of language-specific tokenizers will need additional work beyond 
> just replacing the tokenizer in the language-specific analyzer.  
> For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and 
> depends on the fact that this tokenizer breaks tokens on the ZWNJ character 
> (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ 
> is not a word boundary.  Robert Muir has suggested using a char filter 
> converting ZWNJ to spaces prior to StandardTokenizer in the converted 
> PersianAnalyzer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

2010-11-09 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930073#action_12930073
 ] 

Robert Muir commented on LUCENE-2747:
-

Simon: i agree with both those points... we should change the method signature.

also i called it charStream (this is what Solr's analyzer calls it), but this 
is slightly confusing since the api is all Reader-based.
alternatively we could give this a different name, wrapReader or something... 
not sure, i didnt have any better ideas than charStream.



> Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
> ---
>
> Key: LUCENE-2747
> URL: https://issues.apache.org/jira/browse/LUCENE-2747
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Affects Versions: 3.1, 4.0
>Reporter: Steven Rowe
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2747.patch
>
>
> As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to 
> provide language-neutral tokenization.  Lucene contains several 
> language-specific tokenizers that should be replaced by UAX#29-based 
> StandardTokenizer (deprecated in 3.1 and removed in 4.0).  The 
> language-specific *analyzers*, by contrast, should remain, because they 
> contain language-specific post-tokenization filters.  The language-specific 
> analyzers should switch to StandardTokenizer in 3.1.
> Some usages of language-specific tokenizers will need additional work beyond 
> just replacing the tokenizer in the language-specific analyzer.  
> For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and 
> depends on the fact that this tokenizer breaks tokens on the ZWNJ character 
> (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ 
> is not a word boundary.  Robert Muir has suggested using a char filter 
> converting ZWNJ to spaces prior to StandardTokenizer in the converted 
> PersianAnalyzer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

2010-11-09 Thread Simon Willnauer (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930072#action_12930072
 ] 

Simon Willnauer commented on LUCENE-2747:
-

I looked at the patch briefly and the charStream(Reader) extension looks good 
to me while I would make it protected and throw a IOException. Since this API 
is public and folks will use it in the wild we need to make sure we don't have 
to add the exception later or people creating Readers have to play tricks just 
because the interface has no IOException. About making it protected, do we need 
to call that in a non-protected context, maybe I miss something..
{code}
 public Reader charStream(Reader reader) {
   return reader;
 }

// should be 

  protected Reader charStream(Reader reader) throws IOException{
return reader;
  }
{code}


> Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
> ---
>
> Key: LUCENE-2747
> URL: https://issues.apache.org/jira/browse/LUCENE-2747
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Affects Versions: 3.1, 4.0
>Reporter: Steven Rowe
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2747.patch
>
>
> As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to 
> provide language-neutral tokenization.  Lucene contains several 
> language-specific tokenizers that should be replaced by UAX#29-based 
> StandardTokenizer (deprecated in 3.1 and removed in 4.0).  The 
> language-specific *analyzers*, by contrast, should remain, because they 
> contain language-specific post-tokenization filters.  The language-specific 
> analyzers should switch to StandardTokenizer in 3.1.
> Some usages of language-specific tokenizers will need additional work beyond 
> just replacing the tokenizer in the language-specific analyzer.  
> For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and 
> depends on the fact that this tokenizer breaks tokens on the ZWNJ character 
> (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ 
> is not a word boundary.  Robert Muir has suggested using a char filter 
> converting ZWNJ to spaces prior to StandardTokenizer in the converted 
> PersianAnalyzer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Lucene-Solr-tests-only-trunk - Build # 1168 - Failure

2010-11-09 Thread Apache Hudson Server

Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/1168/

1 tests failed.
REGRESSION:  
org.apache.lucene.index.TestIndexWriterMergePolicy.testMaxBufferedDocsChange

Error Message:
maxMergeDocs=2147483647; numSegments=11; upperBound=10; mergeFactor=10; 
segs=_64:c5950 _5t:c10->_32 _5u:c10->_32 _5v:c10->_32 _5w:c10->_32 _5x:c10->_32 
_5y:c10->_32 _5z:c10->_32 _60:c10->_32 _61:c10->_32 _62:c9->_32 _65:c1->_62

Stack Trace:
junit.framework.AssertionFailedError: maxMergeDocs=2147483647; numSegments=11; 
upperBound=10; mergeFactor=10; segs=_64:c5950 _5t:c10->_32 _5u:c10->_32 
_5v:c10->_32 _5w:c10->_32 _5x:c10->_32 _5y:c10->_32 _5z:c10->_32 _60:c10->_32 
_61:c10->_32 _62:c9->_32 _65:c1->_62
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:878)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:844)
at 
org.apache.lucene.index.TestIndexWriterMergePolicy.checkInvariants(TestIndexWriterMergePolicy.java:243)
at 
org.apache.lucene.index.TestIndexWriterMergePolicy.testMaxBufferedDocsChange(TestIndexWriterMergePolicy.java:169)




Build Log (for compile errors):
[...truncated 3082 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2742) Enable native per-field codec support

2010-11-09 Thread Simon Willnauer (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer updated LUCENE-2742:


Attachment: LUCENE-2742.patch

Here is a first patch - all tests pass. I changed the CodecProvider interface 
slightly to be able to hold perField codecs as well as a default perField 
codec. For simplicity users can not register their codec directly though the 
Fieldable interface. Internally I added a CodecInfo which handles all the 
ordering and registration per segment / field. For consistency I bound 
CodecInfo to FieldInfos since we are now operating per field. A codec can only 
be assigned once, at the first time we see the codec during FieldInfos 
creation. 

there is a nocommit on Fieldable since it doesn't have javadoc but lets iterate 
first to see if we wanna go that path - it seems close. 


> Enable native per-field codec support 
> --
>
> Key: LUCENE-2742
> URL: https://issues.apache.org/jira/browse/LUCENE-2742
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index, Store
>Affects Versions: 4.0
>Reporter: Simon Willnauer
>Assignee: Simon Willnauer
> Fix For: 4.0
>
> Attachments: LUCENE-2742.patch
>
>
> Currently the codec name is stored for every segment and PerFieldCodecWrapper 
> is used to enable codecs per fields which has recently brought up some issues 
> (LUCENE-2740 and LUCENE-2741). When a codec name is stored lucene does not 
> respect the actual codec used to encode a fields postings but rather the 
> "top-level" Codec in such a case the name of the top-level codec is  
> "PerField" instead of "Pulsing" or "Standard" etc. The way this composite 
> pattern works make the indexing part of codecs simpler but also limits its 
> capabilities. By recoding the top-level codec in the segments file we rely on 
> the user to "configure" the PerFieldCodecWrapper correctly to open a 
> SegmentReader. If a fields codec has changed in the meanwhile we won't be 
> able to open the segment.
> The issues LUCENE-2741 and LUCENE-2740 are actually closely related to the 
> way PFCW is implemented right now. PFCW blindly creates codecs per field on 
> request and at the same time doesn't have any control over the file naming 
> nor if a two codec instances are created for two distinct fields even if the 
> codec instance is the same. If so FieldsConsumer will throw an exception 
> since the files it relies on are already created.
> Having PerFieldCodecWrapper AND a CodecProvider overcomplicates things IMO. 
> In order to use per field codec a user should on the one hand register its 
> custom codecs AND needs to build a PFCW which needs to be maintained in the 
> "user-land" an must not change incompatible once a new IW of IR is created. 
> What I would expect from Lucene is to enable me to register a codec in a 
> provider and then tell the Field which codec it should use for indexing. For 
> reading lucene should determ the codec automatically once a segment is 
> opened. if the codec is not available in the provider that is a different 
> story. Once we instantiate the composite codec in SegmentsReader we only have 
> the codecs which are really used in this segment for free which in turn 
> solves LUCENE-2740. 
> Yet, instead of relying on the user to configure PFCW I suggest to move 
> composite codec functionality inside the core an record the distinct codecs 
> per segment in the segments info. We only really need the distinct codecs 
> used in that segment since the codec instance should be reused to prevent 
> additional files to be created. Lets say we have the follwing codec mapping :
> {noformat}
> field_a:Pulsing
> field_b:Standard
> field_c:Pulsing
> {noformat}
> then we create the following mapping:
> {noformat}
> SegmentInfo:
> [Pulsing, Standard]
> PerField:
> [field_a:0, field_b:1, field_c:0]
> {noformat}
> that way we can easily determ which codec is used for which field an build 
> the composite - codec internally on opening SegmentsReader. This ordering has 
> another advantage, if like in that case pulsing and standard use really the 
> same type of files we need a way to distinguish the used files per codec 
> within a segment. We can in turn pass the codec's ord (implicit in the 
> SegmentInfo) to the FieldConsumer on creation to create files with 
> segmentname_ord.ext (or something similar). This solvel LUCENE-2741). 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

2010-11-09 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930024#action_12930024
 ] 

Robert Muir commented on LUCENE-2747:
-

bq. You removed TestHindiFilters.testTokenizer(), 
TestIndicTokenizer.testBasics() and TestIndicTokenizer.testFormat(), but these 
would be useful in TestStandardAnalyzer and TestUAX29Tokenizer, wouldn't they?

oh, i just deleted everything associated with that tokenizer...

bq. You did not remove ArabicLetterTokenizer and IndicTokenizer, presumably so 
that they can be used with Lucene 4.0+ when the supplied Version is less than 
3.1 - good catch, I had forgotten this requirement - but when can we actually 
get rid of these? Since they will be staying, shouldn't their tests remain too, 
but using Version.LUCENE_30 instead of TEST_VERSION_CURRENT?

i removed the indictokenizer (unreleased) and deleted its tests.
but i kept and deprecated the arabic one, since we have released it.


> Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
> ---
>
> Key: LUCENE-2747
> URL: https://issues.apache.org/jira/browse/LUCENE-2747
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Affects Versions: 3.1, 4.0
>Reporter: Steven Rowe
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2747.patch
>
>
> As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to 
> provide language-neutral tokenization.  Lucene contains several 
> language-specific tokenizers that should be replaced by UAX#29-based 
> StandardTokenizer (deprecated in 3.1 and removed in 4.0).  The 
> language-specific *analyzers*, by contrast, should remain, because they 
> contain language-specific post-tokenization filters.  The language-specific 
> analyzers should switch to StandardTokenizer in 3.1.
> Some usages of language-specific tokenizers will need additional work beyond 
> just replacing the tokenizer in the language-specific analyzer.  
> For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and 
> depends on the fact that this tokenizer breaks tokens on the ZWNJ character 
> (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ 
> is not a word boundary.  Robert Muir has suggested using a char filter 
> converting ZWNJ to spaces prior to StandardTokenizer in the converted 
> PersianAnalyzer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Solr-trunk - Build # 1307 - Failure

2010-11-09 Thread Apache Hudson Server

Build: https://hudson.apache.org/hudson/job/Solr-trunk/1307/

All tests passed

Build Log (for compile errors):
[...truncated 18450 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

65 matches

Mail list logo