RE: Umlauts as Char

2011-02-08 Thread Digy
Hi Prescott,

1- When I open the java file, I see the code as it should be. You can try to
open it with notepad and then paste to VS for ex.
2- There is an open issue reported by Pasha Bizhan that covers some
languages (https://issues.apache.org/jira/browse/LUCENENET-372)
But I don't know it us up to date or not.
3- ASCIIFoldingFilter.cs is another example for dealing with non-ascii
chars.

DIGY

-Original Message-
From: Prescott Nasser [mailto:geobmx...@hotmail.com] 
Sent: Tuesday, February 08, 2011 3:55 AM
To: lucene-net-dev@lucene.apache.org
Subject: Umlauts as Char



Hey all, 
 
So while digging into the code a bit (and pushed by digy's Arabic conversion
yesterday). I started looking at the various other languages we were missing
from java.
 
I started porting the GermanAnalyzer, but ran into an issue of the
Umlauts...
 
http://svn.apache.org/viewvc/lucene/java/tags/lucene_2_9_4/contrib/analyzers
/common/src/java/org/apache/lucene/analysis/de/GermanStemmer.java?revision=1
040993view=co
 
in the void subsitute function you'll see them:
 
else if ( buffer.charAt( c ) == 'ü' ) {
  buffer.setCharAt( c, 'u' );
}

This does not constitue a character in .net (that I can figure out) and thus
it doesn't compile. The .java file says encoded in UTF-8. I was thinking
maybe I could do the same thing in VS2010, but I'm not finding a way, and
searching on this has been difficult.
 
Any ideas?
 
~Prescott =



RE: Umlauts as Char

2011-02-08 Thread Prescott Nasser

Well - with regards to number 2. It was fine to dig into the code a bit - but I 
guess we have them a number of them already converted, although I guess never 
added source control.
 
Thanks for the heads up on 1 and 3.
 
~P






 From: digyd...@gmail.com
 To: lucene-net-dev@lucene.apache.org
 Subject: RE: Umlauts as Char
 Date: Tue, 8 Feb 2011 11:12:33 +0200

 Hi Prescott,

 1- When I open the java file, I see the code as it should be. You can try to
 open it with notepad and then paste to VS for ex.
 2- There is an open issue reported by Pasha Bizhan that covers some
 languages (https://issues.apache.org/jira/browse/LUCENENET-372)
 But I don't know it us up to date or not.
 3- ASCIIFoldingFilter.cs is another example for dealing with non-ascii
 chars.

 DIGY

 -Original Message-
 From: Prescott Nasser [mailto:geobmx...@hotmail.com]
 Sent: Tuesday, February 08, 2011 3:55 AM
 To: lucene-net-dev@lucene.apache.org
 Subject: Umlauts as Char



 Hey all,

 So while digging into the code a bit (and pushed by digy's Arabic conversion
 yesterday). I started looking at the various other languages we were missing
 from java.

 I started porting the GermanAnalyzer, but ran into an issue of the
 Umlauts...

 http://svn.apache.org/viewvc/lucene/java/tags/lucene_2_9_4/contrib/analyzers
 /common/src/java/org/apache/lucene/analysis/de/GermanStemmer.java?revision=1
 040993view=co

 in the void subsitute function you'll see them:

 else if ( buffer.charAt( c ) == 'ü' ) {
 buffer.setCharAt( c, 'u' );
 }

 This does not constitue a character in .net (that I can figure out) and thus
 it doesn't compile. The .java file says encoded in UTF-8. I was thinking
 maybe I could do the same thing in VS2010, but I'm not finding a way, and
 searching on this has been difficult.

 Any ideas?

 ~Prescott =
 

Re: Umlauts as Char

2011-02-08 Thread Stefan Bodewig
On 2011-02-08, Prescott Nasser wrote:

 I'm tempted to take the source at it's word and just replace them with
 the umlauts versions (via character map -thanks Aaron), and then make
 some comment expressing what originally it was in the java source.

I'd still recommend using Unicode escape sequences since otherwise the
source code depends on your local encoding - which will only lead to
trouble later on.  The Java code circumvents this somewhat by explicitly
stating the file was UTF-8, but ASCII files are still a lot more
portable.

Stefan


Re: Umlauts as Char

2011-02-08 Thread Sergey Mirvoda
+1 for unicode escape sequences.

PS
I can port RussianAnalizer\Stemmer - just looking into Lucene Contrib code
it is not so hard as I think before.

On Tue, Feb 8, 2011 at 4:33 PM, Stefan Bodewig bode...@apache.org wrote:

 On 2011-02-08, Digy wrote:

  Altough Java doesn't write BOM, VS is clever enough to open it correctly.

 OK, thank you.

 I didn't try it myself since I wasn't at a machine with VS installed
 when I responded.

 Stefan




-- 
--Regards, Sergey Mirvoda


[jira] Commented: (SOLR-1191) NullPointerException in delta import

2011-02-08 Thread Gunnlaugur Thor Briem (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12991884#comment-12991884
 ] 

Gunnlaugur Thor Briem commented on SOLR-1191:
-

bq. There seems to be TestSqlEntityProcessorDelta*.java, no?

Indeed there are, and they do seem to cover delta imports to a fair degree. I 
must have been underslept. : ) [The Hudson coverage 
report|https://hudson.apache.org/hudson/job/Solr-3.x/clover/] doesn't include 
the contrib stuff though.

 NullPointerException in delta import
 

 Key: SOLR-1191
 URL: https://issues.apache.org/jira/browse/SOLR-1191
 Project: Solr
  Issue Type: Bug
  Components: contrib - DataImportHandler
Affects Versions: 1.3, 1.4
 Environment: OS: Windows  Linux.
 Java: 1.6
 DB: MySQL  SQL Server 
Reporter: Ali Syed
Assignee: Noble Paul
 Fix For: 1.4

 Attachments: SOLR-1191.patch


 Seeing few of these NullPointerException during delta imports. Once this 
 happens delta import stops working and keeps giving the same error.
 java.lang.NullPointerException
 at 
 org.apache.solr.handler.dataimport.DocBuilder.collectDelta(DocBuilder.java:622)
 at 
 org.apache.solr.handler.dataimport.DocBuilder.doDelta(DocBuilder.java:240)
 at 
 org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:159)
 at 
 org.apache.solr.handler.dataimport.DataImporter.doDeltaImport(DataImporter.java:337)
 at 
 org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:376)
 at 
 org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:355)
 Running delta import for a particular entity fixes the problem and delta 
 import start working again.
 Here is the log just before  after the exception
 05/27 11:59:29 86987686 INFO  btpool0-538 org.apache.solr.core.SolrCore  - 
 [localhost] webapp=/solr path=/dataimport 
 params={command=delta-importoptimize=false} status=0 QTime=0
 05/27 11:59:29 86987687 INFO  Thread-4162 
 org.apache.solr.handler.dataimport.SolrWriter  - Read dataimport.properties
 05/27 11:59:29 86987687 INFO  Thread-4162 
 org.apache.solr.handler.dataimport.DataImporter  - Starting Delta Import
 05/27 11:59:29 86987687 INFO  Thread-4162 
 org.apache.solr.handler.dataimport.SolrWriter  - Read dataimport.properties
 05/27 11:59:29 86987687 INFO  Thread-4162 
 org.apache.solr.handler.dataimport.DocBuilder  - Starting delta collection.
 05/27 11:59:29 86987690 INFO  Thread-4162 
 org.apache.solr.handler.dataimport.DocBuilder  - Running ModifiedRowKey() for 
 Entity: content
 05/27 11:59:29 86987690 INFO  Thread-4162 
 org.apache.solr.handler.dataimport.DocBuilder  - Completed ModifiedRowKey for 
 Entity: content rows obtained : 0
 05/27 11:59:29 86987690 INFO  Thread-4162 
 org.apache.solr.handler.dataimport.DocBuilder  - Completed DeletedRowKey for 
 Entity: content rows obtained : 0
 05/27 11:59:29 86987692 INFO  Thread-4162 
 org.apache.solr.handler.dataimport.DocBuilder  - Completed parentDeltaQuery 
 for Entity: content
 05/27 11:59:29 86987692 INFO  Thread-4162 
 org.apache.solr.handler.dataimport.DocBuilder  - Running ModifiedRowKey() for 
 Entity: job
 05/27 11:59:29 86987692 INFO  Thread-4162 
 org.apache.solr.handler.dataimport.JdbcDataSource  - Creating a connection 
 for entity job with URL: jdbc:sqlserver://localhost;databaseName=TestDB
 05/27 11:59:29 86987704 INFO  Thread-4162 
 org.apache.solr.handler.dataimport.JdbcDataSource  - Time taken for 
 getConnection(): 12
 05/27 11:59:29 86987707 INFO  Thread-4162 
 org.apache.solr.handler.dataimport.DocBuilder  - Completed ModifiedRowKey for 
 Entity: job rows obtained : 0
 05/27 11:59:29 86987707 INFO  Thread-4162 
 org.apache.solr.handler.dataimport.DocBuilder  - Completed DeletedRowKey for 
 Entity: job rows obtained : 0
 05/27 11:59:29 86987707 INFO  Thread-4162 
 org.apache.solr.handler.dataimport.DocBuilder  - Completed parentDeltaQuery 
 for Entity: job
 05/27 11:59:29 86987707 INFO  Thread-4162 
 org.apache.solr.handler.dataimport.DocBuilder  - Delta Import completed 
 successfully
 05/27 11:59:29 86987707 INFO  Thread-4162 
 org.apache.solr.handler.dataimport.DocBuilder  - Starting delta collection.
 05/27 11:59:29 86987709 INFO  Thread-4162 
 org.apache.solr.handler.dataimport.DocBuilder  - Running ModifiedRowKey() for 
 Entity: user
 05/27 11:59:29 86987709 INFO  Thread-4162 
 org.apache.solr.handler.dataimport.JdbcDataSource  - Creating a connection 
 for entity user with URL: jdbc:sqlserver://localhost;databaseName=TestDB
 05/27 11:59:29 86987716 INFO  Thread-4162 
 org.apache.solr.handler.dataimport.JdbcDataSource  - Time taken for 
 getConnection(): 7
 05/27 11:59:29 86987873 INFO  Thread-4162 
 org.apache.solr.handler.dataimport.DocBuilder  - Completed ModifiedRowKey for 
 

[jira] Updated: (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session

2011-02-08 Thread Michael Busch (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Busch updated LUCENE-2881:
--

Attachment: lucene-2881.patch

  * Creates for every segment a new FieldInfos
  * Changes FieldInfos, so that the FieldInfo numbers within a single 
FieldInfos don't have to be contiguous - this allows using the same numbering 
as the previous segment(s), even if not all fields are present in the new 
segment
  * Adds a global fieldName - fieldNumber map;  if possible when a new field 
is added to a FieldInfo it tries to use an already assigned number for that 
field

All tests pass.  Though I need to verify if the global map works correctly 
(it'd probably be good to add a test for that).  Also it'd be nice to remove 
hasVectors and hasProx from SegmentInfo, but we could also do that in a 
separate issue. 

 Track FieldInfo per segment instead of per-IW-session
 -

 Key: LUCENE-2881
 URL: https://issues.apache.org/jira/browse/LUCENE-2881
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: Realtime Branch, CSF branch, 4.0
Reporter: Simon Willnauer
Assignee: Michael Busch
 Fix For: Realtime Branch, CSF branch, 4.0

 Attachments: lucene-2881.patch


 Currently FieldInfo is tracked per IW session to guarantee consistent global 
 field-naming / ordering. IW carries FI instances over from previous segments 
 which also carries over field properties like isIndexed etc. While having 
 consistent field ordering per IW session appears to be important due to bulk 
 merging stored fields etc. carrying over other properties might become 
 problematic with Lucene's Codec support.  Codecs that rely on consistent 
 properties in FI will fail if FI properties are carried over.
 The DocValuesCodec (DocValuesBranch) for instance writes files per segment 
 and field (using the field id within the file name). Yet, if a segment has no 
 DocValues indexed in a particular segment but a previous segment in the same 
 IW session had DocValues, FieldInfo#docValues will be true  since those 
 values are reused from previous segments. 
 We already work around this limitation in SegmentInfo with properties like 
 hasVectors or hasProx which is really something we should manage per Codec  
 Segment. Ideally FieldInfo would be managed per Segment and Codec such that 
 its properties are valid per segment. It also seems to be necessary to bind 
 FieldInfoS to SegmentInfo logically since its really just per segment 
 metadata.  

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: CustomScoreQueryWithSubqueries

2011-02-08 Thread Fernando Wasylyszyn
Hi Doron. Thanks for your answer. Maybe the question seems simple, but I want 
to 
be sure about the procedure.
By the way, there is a chance, if the patch is really useful, that it could be 
adapted for other versions (in this case, lucene 3.0).
Thanks.
Regards.
Fernando.





De: Doron Cohen cdor...@gmail.com
Para: dev@lucene.apache.org
Enviado: martes, 8 de febrero, 2011 2:54:19
Asunto: Re: CustomScoreQueryWithSubqueries


Hi Fernando,
The wiki indeed relates mainly to trunk development.
For creating a 2.9 patch checkout code from 
/repos/asf/lucene/java/branches/lucene_2_9
Regards,
Doron

As the wiki page says 
 Most development is done on the trunk
You can either use that, or, in order 


On Tue, Feb 8, 2011 at 4:56 AM, Fernando Wasylyszyn ferw...@yahoo.com.ar 
wrote:

Robert: I'm trying to follow the steps that are mentioned in:


http://wiki.apache.org/lucene-java/HowToContribute

in order to make a patch with my contribution. But, in the source code that I 
get from:


http://svn.apache.org/repos/asf/lucene/dev/trunk/
the class org.apache.lucene.search.Searcher is missing and the only method 
available to obtain a Scorer from a Weight object is 
scorer(IndexReader.AtomicReaderContext, ScorerContext)
I just checked and class Searcher still exists in Lucene 3.0.3. In which 
version 
the trunk that I've checkout is based? The patch that I want to submit is 
based 
on Lucene 2.9.1.
Thanks in advance.
Regards.
Fernando.











De: Robert Muir rcm...@gmail.com
Para: dev@lucene.apache.org
Enviado: miércoles, 2 de febrero, 2011 16:52:58
Asunto: Re: CustomScoreQueryWithSubqueries


On Wed, Feb 2, 2011 at 2:37 PM, Fernando Wasylyszyn
ferw...@yahoo.com.ar wrote:
 Hi everyone. My name is Fernando and I am a researcher and developer in the
 R+D lab at Snoop Consulting S.R.L. in Argentina.
 Based on the patch suggested in LUCENE-1608
 (https://issues.apache.org/jira/browse/LUCENE-1608) and in the needs of one
 of our customers, for who we are developing a customized search engine on
 top of Lucene and Solr, we have developed the class
 CustomScoreQueryWithSubqueries, which is a variation of CustomScoreQuery
 that allows the use of arbitrary Query objects besides instances of
 ValueSourceQuery, without the need of wrapping the arbitrary/ies query/ies
  with the QueryValueSource proposed in Jira, which has the disadvantage of
 create an instance of an IndexSearcher in each invocation of the method
 getValues(IndexReader).
 If you think that this contribution can be useful for the Lucene community,
 please let me know the steps in order to contribute.

Hi Fernando: https://issues.apache.org/jira/browse/LUCENE-1608 is
still an open issue.

If you have a better solution, please don't hesitate to upload a patch
file to the issue!
There are some more detailed instructions here:
http://wiki.apache.org/lucene-java/HowToContribute

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org


 



  

Re: CustomScoreQueryWithSubqueries

2011-02-08 Thread Simon Willnauer
Hi Fernando,

I didn't follow this really but in general we fix stuff in trunk and
then backport to older versions. Usually if something is useful for
2.9 its also useful for 4.0  3.x if the issue still applies.

simon

On Tue, Feb 8, 2011 at 12:34 PM, Fernando Wasylyszyn
ferw...@yahoo.com.ar wrote:
 Hi Doron. Thanks for your answer. Maybe the question seems simple, but I
 want to be sure about the procedure.
 By the way, there is a chance, if the patch is really useful, that it could
 be adapted for other versions (in this case, lucene 3.0).
 Thanks.
 Regards.
 Fernando.

 
 De: Doron Cohen cdor...@gmail.com
 Para: dev@lucene.apache.org
 Enviado: martes, 8 de febrero, 2011 2:54:19
 Asunto: Re: CustomScoreQueryWithSubqueries

 Hi Fernando,
 The wiki indeed relates mainly to trunk development.
 For creating a 2.9 patch checkout code from
 /repos/asf/lucene/java/branches/lucene_2_9
 Regards,
 Doron

 As the wiki page says
 Most development is done on the trunk
 You can either use that, or, in order

 On Tue, Feb 8, 2011 at 4:56 AM, Fernando Wasylyszyn ferw...@yahoo.com.ar
 wrote:

 Robert: I'm trying to follow the steps that are mentioned in:

 http://wiki.apache.org/lucene-java/HowToContribute

 in order to make a patch with my contribution. But, in the source code
 that I get from:

 http://svn.apache.org/repos/asf/lucene/dev/trunk/

 the class org.apache.lucene.search.Searcher is missing and the only method
 available to obtain a Scorer from a Weight object is
 scorer(IndexReader.AtomicReaderContext, ScorerContext)
 I just checked and class Searcher still exists in Lucene 3.0.3. In which
 version the trunk that I've checkout is based? The patch that I want to
 submit is based on Lucene 2.9.1.
 Thanks in advance.
 Regards.
 Fernando.






 
 De: Robert Muir rcm...@gmail.com
 Para: dev@lucene.apache.org
 Enviado: miércoles, 2 de febrero, 2011 16:52:58
 Asunto: Re: CustomScoreQueryWithSubqueries

 On Wed, Feb 2, 2011 at 2:37 PM, Fernando Wasylyszyn
 ferw...@yahoo.com.ar wrote:
  Hi everyone. My name is Fernando and I am a researcher and developer in
  the
  R+D lab at Snoop Consulting S.R.L. in Argentina.
  Based on the patch suggested in LUCENE-1608
  (https://issues.apache.org/jira/browse/LUCENE-1608) and in the needs of
  one
  of our customers, for who we are developing a customized search engine
  on
  top of Lucene and Solr, we have developed the class
  CustomScoreQueryWithSubqueries, which is a variation of CustomScoreQuery
  that allows the use of arbitrary Query objects besides instances of
  ValueSourceQuery, without the need of wrapping the arbitrary/ies
  query/ies
  with the QueryValueSource proposed in Jira, which has the disadvantage
  of
  create an instance of an IndexSearcher in each invocation of the method
  getValues(IndexReader).
  If you think that this contribution can be useful for the Lucene
  community,
  please let me know the steps in order to contribute.

 Hi Fernando: https://issues.apache.org/jira/browse/LUCENE-1608 is
 still an open issue.

 If you have a better solution, please don't hesitate to upload a patch
 file to the issue!
 There are some more detailed instructions here:
 http://wiki.apache.org/lucene-java/HowToContribute

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org






-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Should ASCIIFoldingFilter be deprecated?

2011-02-08 Thread Robert Muir
On Mon, Feb 7, 2011 at 10:51 PM, Steven A Rowe sar...@syr.edu wrote:
 I haven't done any benchmarking, but I'm pretty sure that ASCIIFoldingFilter 
 can achieve a significantly higher throughput rate than MappingCharFilter, 
 and given that, it probably makes sense to keep both, to allow people to make 
 the choice about the tradeoff between the flexibility provided by the 
 human-readable (and editable) mapping file and the speed provided by 
 ASCIIFoldingFilter.

I agree... have you seen http://bugs.icu-project.org/trac/ticket/7743 ?

Hopefully something along those lines would allow us to support the
flexibility in a factory or whatever (even better as described, when
you just want a small tweak) but still with good performance.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2911) synchronize grammar/token types across StandardTokenizer, UAX29EmailURLTokenizer, ICUTokenizer, add CJK types.

2011-02-08 Thread Robert Muir (JIRA)
synchronize grammar/token types across StandardTokenizer, 
UAX29EmailURLTokenizer, ICUTokenizer, add CJK types.
--

 Key: LUCENE-2911
 URL: https://issues.apache.org/jira/browse/LUCENE-2911
 Project: Lucene - Java
  Issue Type: Sub-task
Reporter: Robert Muir
Assignee: Robert Muir
 Fix For: 3.1
 Attachments: LUCENE-2911.patch

I'd like to do LUCENE-2906 (better cjk support for these tokenizers) for a 
future target such as 3.2

But, in 3.1 I would like to do a little cleanup first, and synchronize all 
these token types, etc.


-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2911) synchronize grammar/token types across StandardTokenizer, UAX29EmailURLTokenizer, ICUTokenizer, add CJK types.

2011-02-08 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-2911:


Attachment: LUCENE-2911.patch

after applying the patch, you have to run 'ant jflex' from 
modules/analysis/common and 'ant genrbbi' from modules/analysis/icu to 
regenerate.

 synchronize grammar/token types across StandardTokenizer, 
 UAX29EmailURLTokenizer, ICUTokenizer, add CJK types.
 --

 Key: LUCENE-2911
 URL: https://issues.apache.org/jira/browse/LUCENE-2911
 Project: Lucene - Java
  Issue Type: Sub-task
  Components: Analysis
Reporter: Robert Muir
Assignee: Robert Muir
 Fix For: 3.1

 Attachments: LUCENE-2911.patch


 I'd like to do LUCENE-2906 (better cjk support for these tokenizers) for a 
 future target such as 3.2
 But, in 3.1 I would like to do a little cleanup first, and synchronize all 
 these token types, etc.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-2342) Lock starvation can cause commit to never run when many clients are adding docs

2011-02-08 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12991965#comment-12991965
 ] 

Michael McCandless commented on SOLR-2342:
--

OK, passing true to the ReentrantReadWriteLock fixes the starvation in this 
test...  Should we commit that?

Passing true does make the read lock acq more costly, but I suspect this is in 
the noise for typical indexing.

And, while I think the stress test is rather unnatural (normally one thread 
should call commit in the end), I fear many apps may (for simplicity) do 
something similar to this stress test.  Also, the fact that auto-commit is also 
starved is I think a more realistic failure...

 Lock starvation can cause commit to never run when many clients are adding 
 docs
 ---

 Key: SOLR-2342
 URL: https://issues.apache.org/jira/browse/SOLR-2342
 Project: Solr
  Issue Type: Bug
  Components: update
Reporter: Michael McCandless
Priority: Minor

 I have a stress test, where 100 clients add 100 1MB docs and then call commit 
 in the end.  It's a falldown test (try to make Solr fall down) and nowhere 
 near actual usage.
 But, after some initial commits that succeed, I'm seeing later commits always 
 time out (client side timeout @ 10 minutes).  Watching Solr's logging, no 
 commit ever runs.
 Looking at the stack traces in the threads, this is not deadlock: the 
 add/update calls are running, and new segments are being flushed to the index.
 Digging in the code a bit, we use ReentrantReadWriteLock, with add/update 
 acquiring the readLock and commit acquiring the writeLock.  But, according to 
 the jdocs, the writeLock isn't given any priority over the readLock (unless 
 you set fairness, which we don't).  So I think this explains the starvation?
 However, this is not a real world use case (most apps would/should call 
 commit less often, and from on client).  Also, we could set fairness, but it 
 seems to have some performance penalty, and I'm not sure we should penalize 
 the normal case for this unusual one.  EG see here (thanks Mark): 
 http://www.javaspecialists.eu/archive/Issue165.html.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



RE: Should ASCIIFoldingFilter be deprecated?

2011-02-08 Thread David Smiley (@MITRE.org)


Chris Hostetter-3 wrote:
 
 CharFilters and TokenFilters have different purposes though...
 
 http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#When_To_use_a_CharFilter_vs_a_TokenFilter
 
 (ie: If you use MappingCharFilter, you can't then tokenize on some of the 
 characters you filtered away)
 

Right, but it’s hard to imagine wanting to tokenize on an accent character
or some other modification specified in these particular mapping files.


Steven A Rowe wrote:
 
 AFAIK, ISOLatin1AccentFilter was deprecated because ASCIIFoldingFilter
 provides a superset of it mappings.
 

*If* that is the case then this file should also be removed:
solr/example/solr/conf/mapping-ISOLatin1Accent.txt


Steven A Rowe wrote:
 
 I haven't done any benchmarking, but I'm pretty sure that
 ASCIIFoldingFilter can achieve a significantly higher throughput rate than
 MappingCharFilter, and given that, it probably makes sense to keep both,
 to allow people to make the choice about the tradeoff between the
 flexibility provided by the human-readable (and editable) mapping file and
 the speed provided by ASCIIFoldingFilter.
 

I'm skeptical that whatever the difference is is relevant in the scheme of
things. The cost to keeping it is introducing confusion on users, and more
code to maintain.

~ David Smiley

-
 Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Should-ASCIIFoldingFilter-be-deprecated-tp2448919p2451504.html
Sent from the Solr - Dev mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Umlauts as Char

2011-02-08 Thread Stefan Bodewig
On 2011-02-08, Prescott Nasser wrote:

 So I can take the source codes word that 'ü' is the u with dots over
 it (becuase it says replace umlauts in the source notes). But, I
 guess, is that really true? Is that perhaps u with a carrot over it
 instead?

I think the case has been settled by now, but I forgot to add that I am
German so if in doubt, feel free to ask.  We don't have any funny
characters except for äöüÄÖÜß in German.

Stefan


Re: Should ASCIIFoldingFilter be deprecated?

2011-02-08 Thread Robert Muir
On Tue, Feb 8, 2011 at 9:12 AM, David Smiley (@MITRE.org)
dsmi...@mitre.org wrote:

 I'm skeptical that whatever the difference is is relevant in the scheme of
 things. The cost to keeping it is introducing confusion on users, and more
 code to maintain.


its pretty significant. charfilters are not reusable, and box every
character and lookup out of a hashmap (i made a patch to fix the
reusability, but no one has commented) :
https://issues.apache.org/jira/browse/LUCENE-2788

asciifoldingfilter does a huge switch (which still isnt optimal), but
its way way faster than mappingcharfilter, especially since its a
no-op for chars  0x7F.

icufoldingfilter precompiles a recursively decomposed trie, so its
lookup is a unicode folded trie
(icu-project.org/docs/papers/foldedtrie_iuc21.ppt). I think its a tad
slower than asciifoldingfilter but it also incorporates case folding
and unicode normalization: neither asciifoldingfilter nor
mappingcharfilter will not properly fold
http://www.geonames.org/search.html?q=Ab%C5%AB+Z%CC%A7abycountry=,
because there is no composed form for Z + combining cedilla, but
icufoldingfilter will.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Should ASCIIFoldingFilter be deprecated?

2011-02-08 Thread David Smiley (@MITRE.org)


Robert Muir wrote:
 
 On Tue, Feb 8, 2011 at 9:12 AM, David Smiley (@MITRE.org)
 dsmi...@mitre.org wrote:
 
 I'm skeptical that whatever the difference is is relevant in the scheme
 of
 things. The cost to keeping it is introducing confusion on users, and
 more
 code to maintain.

 
 its pretty significant. charfilters are not reusable, and box every
 character and lookup out of a hashmap (i made a patch to fix the
 reusability, but no one has commented) :
 https://issues.apache.org/jira/browse/LUCENE-2788
 
 asciifoldingfilter does a huge switch (which still isnt optimal), but
 its way way faster than mappingcharfilter, especially since its a
 no-op for chars  0x7F.
 

Well then I see a path forward to speed up MappingCharFilter substantially. 
There's your LUCENE-2788, and then you could easily add the same no-op
optimization for the smallest char value in the HashMap.

-
 Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Should-ASCIIFoldingFilter-be-deprecated-tp2448919p2451800.html
Sent from the Solr - Dev mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Should ASCIIFoldingFilter be deprecated?

2011-02-08 Thread Robert Muir
On Tue, Feb 8, 2011 at 10:05 AM, David Smiley (@MITRE.org)
dsmi...@mitre.org wrote:

 Well then I see a path forward to speed up MappingCharFilter substantially.
 There's your LUCENE-2788, and then you could easily add the same no-op
 optimization for the smallest char value in the HashMap.

only for the smallest starter, and still mappingcharfilter has to
maintain an array of any offset changes (this is now binary searched)
for correctOffset.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-2155) Geospatial search using geohash prefixes

2011-02-08 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12991999#comment-12991999
 ] 

David Smiley commented on SOLR-2155:


So Bill's talking about sorting, and Lance is talking about polygons.

Sorting: I'll try and get to it next; but this patch is low-priority for me at 
the moment.

Polygons: Lance, I figured I could do a shifting of the coordinates off of the 
dateline already, but what was mentally hurting is contemplating a snake-like 
polygon that encircled the globe. And, a polygon that is for say Antarctica (a 
polygon covering a pole).  The SLERP stuff is interesting but I don't know how 
to apply it.  For someone who claims not to be a math guy, you're doing a good 
job fooling me :-)

 Geospatial search using geohash prefixes
 

 Key: SOLR-2155
 URL: https://issues.apache.org/jira/browse/SOLR-2155
 Project: Solr
  Issue Type: Improvement
Reporter: David Smiley
 Attachments: GeoHashPrefixFilter.patch, GeoHashPrefixFilter.patch, 
 GeoHashPrefixFilter.patch


 There currently isn't a solution in Solr for doing geospatial filtering on 
 documents that have a variable number of points.  This scenario occurs when 
 there is location extraction (i.e. via a gazateer) occurring on free text.  
 None, one, or many geospatial locations might be extracted from any given 
 document and users want to limit their search results to those occurring in a 
 user-specified area.
 I've implemented this by furthering the GeoHash based work in Lucene/Solr 
 with a geohash prefix based filter.  A geohash refers to a lat-lon box on the 
 earth.  Each successive character added further subdivides the box into a 4x8 
 (or 8x4 depending on the even/odd length of the geohash) grid.  The first 
 step in this scheme is figuring out which geohash grid squares cover the 
 user's search query.  I've added various extra methods to GeoHashUtils (and 
 added tests) to assist in this purpose.  The next step is an actual Lucene 
 Filter, GeoHashPrefixFilter, that uses these geohash prefixes in 
 TermsEnum.seek() to skip to relevant grid squares in the index.  Once a 
 matching geohash grid is found, the points therein are compared against the 
 user's query to see if it matches.  I created an abstraction GeoShape 
 extended by subclasses named PointDistance... and CartesianBox to support 
 different queried shapes so that the filter need not care about these details.
 This work was presented at LuceneRevolution in Boston on October 8th.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2911) synchronize grammar/token types across StandardTokenizer, UAX29EmailURLTokenizer, ICUTokenizer, add CJK types.

2011-02-08 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12992014#comment-12992014
 ] 

Steven Rowe commented on LUCENE-2911:
-

The generated top-level domain macro file has a bunch of new entries when I run 
this, but these are not included in your patch, and I think we should keep this 
list up-to-date.

The patch is missing HangulSupp macro generation in 
modules/icu/src/tools/.../GenerateJFlexSupplementaryMacros.java, but since the 
Hangul macro is not used in the jflex grammar, this doesn't cause a problem.

It would be nice to remove the hard-coded ranges for the intersection of Hangul 
 ALetter, but when I tried to use JFlex negation and union to produce the 
equivalent, memory usage exploded and I couldn't get JFlex to generate, so I 
guess we'll have to wait on native JFlex supplementary character support before 
we can change it.


 synchronize grammar/token types across StandardTokenizer, 
 UAX29EmailURLTokenizer, ICUTokenizer, add CJK types.
 --

 Key: LUCENE-2911
 URL: https://issues.apache.org/jira/browse/LUCENE-2911
 Project: Lucene - Java
  Issue Type: Sub-task
  Components: Analysis
Reporter: Robert Muir
Assignee: Robert Muir
 Fix For: 3.1

 Attachments: LUCENE-2911.patch


 I'd like to do LUCENE-2906 (better cjk support for these tokenizers) for a 
 future target such as 3.2
 But, in 3.1 I would like to do a little cleanup first, and synchronize all 
 these token types, etc.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Should ASCIIFoldingFilter be deprecated?

2011-02-08 Thread Robert Zotter

unsubscribe

On 2/8/11 7:05 AM, David Smiley (@MITRE.org) wrote:


Robert Muir wrote:

On Tue, Feb 8, 2011 at 9:12 AM, David Smiley (@MITRE.org)
dsmi...@mitre.org  wrote:


I'm skeptical that whatever the difference is is relevant in the scheme
of
things. The cost to keeping it is introducing confusion on users, and
more
code to maintain.


its pretty significant. charfilters are not reusable, and box every
character and lookup out of a hashmap (i made a patch to fix the
reusability, but no one has commented) :
https://issues.apache.org/jira/browse/LUCENE-2788

asciifoldingfilter does a huge switch (which still isnt optimal), but
its way way faster than mappingcharfilter, especially since its a
no-op for chars  0x7F.


Well then I see a path forward to speed up MappingCharFilter substantially.
There's your LUCENE-2788, and then you could easily add the same no-op
optimization for the smallest char value in the HashMap.

-
  Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session

2011-02-08 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12992038#comment-12992038
 ] 

Simon Willnauer commented on LUCENE-2881:
-

Michael, this looks very good!

{quote}
All tests pass. Though I need to verify if the global map works correctly (it'd 
probably be good to add a test for that). Also it'd be nice to remove 
hasVectors and hasProx from SegmentInfo, but we could also do that in a 
separate issue.
{quote}
I agree we should make FieldInfos a member of SegmentInfo and remove the 
hasVectors / hasProx an push them down. I tried to apply this patch to the 
DocValues branch but I didn't get very far - I haven't merged for a while that 
killed me ;( 
I need to push that to after my vacation. I really hoped that I can get further 
but it didn't work out ;(

 Track FieldInfo per segment instead of per-IW-session
 -

 Key: LUCENE-2881
 URL: https://issues.apache.org/jira/browse/LUCENE-2881
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: Realtime Branch, CSF branch, 4.0
Reporter: Simon Willnauer
Assignee: Michael Busch
 Fix For: Realtime Branch, CSF branch, 4.0

 Attachments: lucene-2881.patch


 Currently FieldInfo is tracked per IW session to guarantee consistent global 
 field-naming / ordering. IW carries FI instances over from previous segments 
 which also carries over field properties like isIndexed etc. While having 
 consistent field ordering per IW session appears to be important due to bulk 
 merging stored fields etc. carrying over other properties might become 
 problematic with Lucene's Codec support.  Codecs that rely on consistent 
 properties in FI will fail if FI properties are carried over.
 The DocValuesCodec (DocValuesBranch) for instance writes files per segment 
 and field (using the field id within the file name). Yet, if a segment has no 
 DocValues indexed in a particular segment but a previous segment in the same 
 IW session had DocValues, FieldInfo#docValues will be true  since those 
 values are reused from previous segments. 
 We already work around this limitation in SegmentInfo with properties like 
 hasVectors or hasProx which is really something we should manage per Codec  
 Segment. Ideally FieldInfo would be managed per Segment and Codec such that 
 its properties are valid per segment. It also seems to be necessary to bind 
 FieldInfoS to SegmentInfo logically since its really just per segment 
 metadata.  

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2911) synchronize grammar/token types across StandardTokenizer, UAX29EmailURLTokenizer, ICUTokenizer, add CJK types.

2011-02-08 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12992066#comment-12992066
 ] 

Robert Muir commented on LUCENE-2911:
-

{quote}
The generated top-level domain macro file has a bunch of new entries when I run 
this, but these are not included in your patch, and I think we should keep this 
list up-to-date.
{quote}

Yeah, i would re-run it before committing? in general i didn't re-generate so 
you wouldnt see a lot of generated differences in the patch.

{quote}
The patch is missing HangulSupp macro generation in 
modules/icu/src/tools/.../GenerateJFlexSupplementaryMacros.java, but since the 
Hangul macro is not used in the jflex grammar, this doesn't cause a problem.
{quote}

Oh i did actually mean to include this, sorry I forgot... its a one liner 
though, I can include this easily.


 synchronize grammar/token types across StandardTokenizer, 
 UAX29EmailURLTokenizer, ICUTokenizer, add CJK types.
 --

 Key: LUCENE-2911
 URL: https://issues.apache.org/jira/browse/LUCENE-2911
 Project: Lucene - Java
  Issue Type: Sub-task
  Components: Analysis
Reporter: Robert Muir
Assignee: Robert Muir
 Fix For: 3.1

 Attachments: LUCENE-2911.patch


 I'd like to do LUCENE-2906 (better cjk support for these tokenizers) for a 
 future target such as 3.2
 But, in 3.1 I would like to do a little cleanup first, and synchronize all 
 these token types, etc.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Welcome Dawid Weiss and Stanislaw Osinski as Lucene/Solr committers

2011-02-08 Thread Robert Muir
I'm pleased to announce that the PMC has voted in Dawid Weiss and
Stanislaw Osinski as Lucene/Solr committers!

Welcome!

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



RE: Welcome Dawid Weiss and Stanislaw Osinski as Lucene/Solr committers

2011-02-08 Thread Steven A Rowe
Welcome Stanisław and Dawid!

 -Original Message-
 From: Robert Muir [mailto:rcm...@gmail.com]
 Sent: Tuesday, February 08, 2011 1:13 PM
 To: gene...@lucene.apache.org; dev@lucene.apache.org
 Subject: Welcome Dawid Weiss and Stanislaw Osinski as Lucene/Solr
 committers
 
 I'm pleased to announce that the PMC has voted in Dawid Weiss and
 Stanislaw Osinski as Lucene/Solr committers!
 
 Welcome!


[jira] Created: (SOLR-2352) HTTP 400 Undefined Filed: * with TV component enabled.

2011-02-08 Thread Jed Glazner (JIRA)
HTTP 400 Undefined Filed: * with TV component enabled.
--

 Key: SOLR-2352
 URL: https://issues.apache.org/jira/browse/SOLR-2352
 Project: Solr
  Issue Type: Bug
  Components: SearchComponents - other
Affects Versions: 3.1
 Environment: Ubuntu 10.04/Arch solr 3.x branch r1058326
Reporter: Jed Glazner
 Fix For: 3.1


When searching using the term vector components and setting fl=*,score the 
result is a http 400 error 'undefined field: *'. If you disable the tvc the 
search works properly.

stack trace:
 853 SEVERE: org.apache.solr.common.SolrException: undefined field: *
 854   at 
org.apache.solr.handler.component.TermVectorComponent.process(TermVectorComponent.java:142)
 855   at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:194)
 856   at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
 857   at org.apache.solr.core.SolrCore.execute(SolrCore.java:1357)
 858   at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:341)
 859   at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:244)
 860   at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
 861   at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
 862   at 
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
 863   at 
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
 864   at 
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
 865   at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
 866   at 
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
 867   at 
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
 868   at 
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
 869   at org.mortbay.jetty.Server.handle(Server.java:326)
 870   at 
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
 871   at 
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
 872   at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
 873   at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
 874   at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
 875   at 
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
 876   at 
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Welcome Dawid Weiss and Stanislaw Osinski as Lucene/Solr committers

2011-02-08 Thread Simon Willnauer
Welcome! ;)

Simon

On Tue, Feb 8, 2011 at 8:06 PM, Steven A Rowe sar...@syr.edu wrote:
 Welcome Stanisław and Dawid!

 -Original Message-
 From: Robert Muir [mailto:rcm...@gmail.com]
 Sent: Tuesday, February 08, 2011 1:13 PM
 To: gene...@lucene.apache.org; dev@lucene.apache.org
 Subject: Welcome Dawid Weiss and Stanislaw Osinski as Lucene/Solr
 committers

 I'm pleased to announce that the PMC has voted in Dawid Weiss and
 Stanislaw Osinski as Lucene/Solr committers!

 Welcome!


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Welcome Dawid Weiss and Stanislaw Osinski as Lucene/Solr committers

2011-02-08 Thread Uwe Schindler
Welcome!
--
Uwe Schindler
H.-H.-Meier-Allee 63, 28213 Bremen
http://www.thetaphi.de



Simon Willnauer simon.willna...@googlemail.com schrieb:

Welcome! ;) Simon On Tue, Feb 8, 2011 at 8:06 PM, Steven A Rowe 
sar...@syr.edu wrote:  Welcome Stanisław and Dawid!   -Original 
Message-  From: Robert Muir [mailto:rcm...@gmail.com]  Sent: Tuesday, 
February 08, 2011 1:13 PM  To: gene...@lucene.apache.org; 
dev@lucene.apache.org  Subject: Welcome Dawid Weiss and Stanislaw Osinski as 
Lucene/Solr  committers   I'm pleased to announce that the PMC has voted 
in Dawid Weiss and  Stanislaw Osinski as Lucene/Solr committers!   
Welcome! _
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional 
commands, e-mail: dev-h...@lucene.apache.org 



[jira] Updated: (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session

2011-02-08 Thread Michael Busch (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Busch updated LUCENE-2881:
--

Attachment: lucene-2881.patch

New patch that removes the tracking of 'hasVectors' and 'hasProx' in 
SegmentInfo.  Instead SegmentInfo now has a reference to its corresponding 
FieldInfos.  

For backwards-compatibility reasons we can't completely remove the hasVectors 
and hasProx bytes from the serialized SegmentInfo yet.  Eg. if someone uses 
addIndexes(Directory...) to add external old pre-4.0 segments to a new index, 
we upgrade the SegmentInfo to the latest version.  However, we don't modify 
the FieldInfos of that segment, instead we just copy it over to the new dir.  
So the hasVector and hasProx bits in the FieldInfos might not be accurate and 
we have to keep those bits in the SegmentInfo instead.  Not an ideal solution, 
but we can remove it entirely in Lucene 5.0 :).  The alternative would be to
rewrite the FieldInfos instead of just copying the files, but then we have to 
rewrite the cfs files.

All core  contrib tests pass.

 Track FieldInfo per segment instead of per-IW-session
 -

 Key: LUCENE-2881
 URL: https://issues.apache.org/jira/browse/LUCENE-2881
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: Realtime Branch, CSF branch, 4.0
Reporter: Simon Willnauer
Assignee: Michael Busch
 Fix For: Realtime Branch, CSF branch, 4.0

 Attachments: lucene-2881.patch, lucene-2881.patch


 Currently FieldInfo is tracked per IW session to guarantee consistent global 
 field-naming / ordering. IW carries FI instances over from previous segments 
 which also carries over field properties like isIndexed etc. While having 
 consistent field ordering per IW session appears to be important due to bulk 
 merging stored fields etc. carrying over other properties might become 
 problematic with Lucene's Codec support.  Codecs that rely on consistent 
 properties in FI will fail if FI properties are carried over.
 The DocValuesCodec (DocValuesBranch) for instance writes files per segment 
 and field (using the field id within the file name). Yet, if a segment has no 
 DocValues indexed in a particular segment but a previous segment in the same 
 IW session had DocValues, FieldInfo#docValues will be true  since those 
 values are reused from previous segments. 
 We already work around this limitation in SegmentInfo with properties like 
 hasVectors or hasProx which is really something we should manage per Codec  
 Segment. Ideally FieldInfo would be managed per Segment and Codec such that 
 its properties are valid per segment. It also seems to be necessary to bind 
 FieldInfoS to SegmentInfo logically since its really just per segment 
 metadata.  

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session

2011-02-08 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12992156#comment-12992156
 ] 

Simon Willnauer commented on LUCENE-2881:
-

{quote}
New patch that removes the tracking of 'hasVectors' and 'hasProx' in 
SegmentInfo. Instead SegmentInfo now has a reference to its corresponding 
FieldInfos.
{quote}

Wow! nice work Michael! I like how you preserve bw compat in SegmentInfo and 
FieldInfos is now bound to SegmentInfo - yay! this solves two problems at once 
for DocValues branch.

bq.The alternative would be to rewrite the FieldInfos instead of just copying 
the files, but then we have to rewrite the cfs files.

I think  copying over is fine. Ideally we will move all those boolean etc to 
the codec level so that we don't need that at all. Once stored fields and 
vectors are written by the codec we can push all that into PreFlex codec 
(maybe!?) and get rid of the bw compat code.

I think you should commit that patch. I'll port to docvalues and run some tests 
that rely on this issue.

 Track FieldInfo per segment instead of per-IW-session
 -

 Key: LUCENE-2881
 URL: https://issues.apache.org/jira/browse/LUCENE-2881
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: Realtime Branch, CSF branch, 4.0
Reporter: Simon Willnauer
Assignee: Michael Busch
 Fix For: Realtime Branch, CSF branch, 4.0

 Attachments: lucene-2881.patch, lucene-2881.patch


 Currently FieldInfo is tracked per IW session to guarantee consistent global 
 field-naming / ordering. IW carries FI instances over from previous segments 
 which also carries over field properties like isIndexed etc. While having 
 consistent field ordering per IW session appears to be important due to bulk 
 merging stored fields etc. carrying over other properties might become 
 problematic with Lucene's Codec support.  Codecs that rely on consistent 
 properties in FI will fail if FI properties are carried over.
 The DocValuesCodec (DocValuesBranch) for instance writes files per segment 
 and field (using the field id within the file name). Yet, if a segment has no 
 DocValues indexed in a particular segment but a previous segment in the same 
 IW session had DocValues, FieldInfo#docValues will be true  since those 
 values are reused from previous segments. 
 We already work around this limitation in SegmentInfo with properties like 
 hasVectors or hasProx which is really something we should manage per Codec  
 Segment. Ideally FieldInfo would be managed per Segment and Codec such that 
 its properties are valid per segment. It also seems to be necessary to bind 
 FieldInfoS to SegmentInfo logically since its really just per segment 
 metadata.  

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session

2011-02-08 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12992175#comment-12992175
 ] 

Michael Busch commented on LUCENE-2881:
---

Thanks for reviewing!

bq. I think you should commit that patch. I'll port to docvalues and run some 
tests that rely on this issue.

I just want to add another tests for the global fieldname-number map, after 
that I think it'll be ready to commit.  Will do that tonight :)

 Track FieldInfo per segment instead of per-IW-session
 -

 Key: LUCENE-2881
 URL: https://issues.apache.org/jira/browse/LUCENE-2881
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: Realtime Branch, CSF branch, 4.0
Reporter: Simon Willnauer
Assignee: Michael Busch
 Fix For: Realtime Branch, CSF branch, 4.0

 Attachments: lucene-2881.patch, lucene-2881.patch


 Currently FieldInfo is tracked per IW session to guarantee consistent global 
 field-naming / ordering. IW carries FI instances over from previous segments 
 which also carries over field properties like isIndexed etc. While having 
 consistent field ordering per IW session appears to be important due to bulk 
 merging stored fields etc. carrying over other properties might become 
 problematic with Lucene's Codec support.  Codecs that rely on consistent 
 properties in FI will fail if FI properties are carried over.
 The DocValuesCodec (DocValuesBranch) for instance writes files per segment 
 and field (using the field id within the file name). Yet, if a segment has no 
 DocValues indexed in a particular segment but a previous segment in the same 
 IW session had DocValues, FieldInfo#docValues will be true  since those 
 values are reused from previous segments. 
 We already work around this limitation in SegmentInfo with properties like 
 hasVectors or hasProx which is really something we should manage per Codec  
 Segment. Ideally FieldInfo would be managed per Segment and Codec such that 
 its properties are valid per segment. It also seems to be necessary to bind 
 FieldInfoS to SegmentInfo logically since its really just per segment 
 metadata.  

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session

2011-02-08 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12992174#comment-12992174
 ] 

Simon Willnauer commented on LUCENE-2881:
-

I gave the patch another glance - here are a couple of very minor comments:

* maybe we should return IterableFieldInfo from 
FieldInfos#getFieldInfoIterator() this would make the iteration syntactically 
more javaish 
{code} 

for (FieldInfo info : getFieldInfoIterator()) {
  // do something with it
}

{code}

* If we return IterableFieldInfo should we name it getFieldInfos?
* Maybe we can simply implement IterableFieldInfo?
* Maybe we can rename SI#clearFilesCache() to SI#clearCache() or simply 
SI#clear() this would make thing less coupled to the SI internal cache but 
rather something that is called to clear internal state after flush? 

 Track FieldInfo per segment instead of per-IW-session
 -

 Key: LUCENE-2881
 URL: https://issues.apache.org/jira/browse/LUCENE-2881
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: Realtime Branch, CSF branch, 4.0
Reporter: Simon Willnauer
Assignee: Michael Busch
 Fix For: Realtime Branch, CSF branch, 4.0

 Attachments: lucene-2881.patch, lucene-2881.patch


 Currently FieldInfo is tracked per IW session to guarantee consistent global 
 field-naming / ordering. IW carries FI instances over from previous segments 
 which also carries over field properties like isIndexed etc. While having 
 consistent field ordering per IW session appears to be important due to bulk 
 merging stored fields etc. carrying over other properties might become 
 problematic with Lucene's Codec support.  Codecs that rely on consistent 
 properties in FI will fail if FI properties are carried over.
 The DocValuesCodec (DocValuesBranch) for instance writes files per segment 
 and field (using the field id within the file name). Yet, if a segment has no 
 DocValues indexed in a particular segment but a previous segment in the same 
 IW session had DocValues, FieldInfo#docValues will be true  since those 
 values are reused from previous segments. 
 We already work around this limitation in SegmentInfo with properties like 
 hasVectors or hasProx which is really something we should manage per Codec  
 Segment. Ideally FieldInfo would be managed per Segment and Codec such that 
 its properties are valid per segment. It also seems to be necessary to bind 
 FieldInfoS to SegmentInfo logically since its really just per segment 
 metadata.  

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENENET-392) Arabic Analyzer

2011-02-08 Thread Digy (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENENET-392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Digy updated LUCENENET-392:
---

Attachment: Analyzers.zip

I merged Arabic analyzer and existing Brazilian analyzer in contrib.

Since changes are much, I post the result as a zip file, not as a patch.

DIGY

 Arabic Analyzer
 ---

 Key: LUCENENET-392
 URL: https://issues.apache.org/jira/browse/LUCENENET-392
 Project: Lucene.Net
  Issue Type: New Feature
 Environment: Lucene.Net 2.9.2 VS2010
Reporter: Digy
Priority: Trivial
 Attachments: Analyzers.zip, Lucene.Net.Analyzers.zip


 A quick port of Lucene.Java's Arabic analyzer.
 All unit tests pass.
 DIGY

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (LUCENENET-392) Arabic Analyzer

2011-02-08 Thread Digy (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENENET-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12992189#comment-12992189
 ] 

Digy commented on LUCENENET-392:


If no objections, I am going to commit it in a few days.
DIGY

 Arabic Analyzer
 ---

 Key: LUCENENET-392
 URL: https://issues.apache.org/jira/browse/LUCENENET-392
 Project: Lucene.Net
  Issue Type: New Feature
 Environment: Lucene.Net 2.9.2 VS2010
Reporter: Digy
Priority: Trivial
 Attachments: Analyzers.zip, Lucene.Net.Analyzers.zip


 A quick port of Lucene.Java's Arabic analyzer.
 All unit tests pass.
 DIGY

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: Welcome Dawid Weiss and Stanislaw Osinski as Lucene/Solr committers

2011-02-08 Thread Dawid Weiss
Thank you very much, everyone! This is a great privilege and honor for me.

In the spirit of previous posters, I would like to quickly introduce
myself. I'm 32, I was born and I still live in Poznan, Poland, happily
married and with two kids on board. My computer science experience is
somewhat strange because I was a hard core assembly programmer through
primary and high school, not familiar with anything else (the internet
was not there yet, remember, and books were a scarce resource). I
learned my first high-level language during university studies... and
always thought that looked so much more complex compared to assembly
;). Nowadays I do most of my programming in Java, but am always
profoundly interested in low-level aspects of what's going on behind
the scenes. I hold a PhD in information retrieval and teach at the
local technical university in Poznan. Together with Staszek Osiński we
also develop text clustering algorithms under the Carrot2.org and
CarrotSearch.com umbrella.

Glad to be part of Lucene. I hope I will be of help to the project,
Dawid

On Tue, Feb 8, 2011 at 9:09 PM, Uwe Schindler u...@thetaphi.de wrote:
 Welcome!
 --
 Uwe Schindler
 H.-H.-Meier-Allee 63, 28213 Bremen
 http://www.thetaphi.de



 Simon Willnauer simon.willna...@googlemail.com schrieb:

 Welcome! ;) Simon On Tue, Feb 8, 2011 at 8:06 PM, Steven A Rowe
 sar...@syr.edu wrote:  Welcome Stanisław and Dawid!   -Original
 Message-  From: Robert Muir [mailto:rcm...@gmail.com]  Sent:
 Tuesday, February 08, 2011 1:13 PM  To: gene...@lucene.apache.org;
 dev@lucene.apache.org  Subject: Welcome Dawid Weiss and Stanislaw Osinski
 as Lucene/Solr  committers   I'm pleased to announce that the PMC has
 voted in Dawid Weiss and  Stanislaw Osinski as Lucene/Solr committers! 
  Welcome! 
 
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional
 commands, e-mail: dev-h...@lucene.apache.org

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1711) Race condition in org/apache/solr/client/solrj/impl/StreamingUpdateSolrServer.java

2011-02-08 Thread Aakarsh Nair (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12992224#comment-12992224
 ] 

Aakarsh Nair commented on SOLR-1711:


We are still seeing this issue even after using Johannes fix. All runners are 
exiting  and the main producer thread hangs on line 196 queue.put . I am 
thinking it may be because queue is getting drained and filled fast (queue size 
50 , number of threads 20) . So there might be a race condition on the queue 
capacity check.Queue appears to be below capacity to the last runner  
then  fills up by simultaneous calls to put . I still see the issue after 
backporting  what is in 3.x branch for testing it with solr 1.4.1. I guess a 
solution may be to use larger queue capacities for now but the race conditions 
still seem to be present.

 Race condition in 
 org/apache/solr/client/solrj/impl/StreamingUpdateSolrServer.java
 --

 Key: SOLR-1711
 URL: https://issues.apache.org/jira/browse/SOLR-1711
 Project: Solr
  Issue Type: Bug
  Components: clients - java
Affects Versions: 1.4, 1.5
Reporter: Attila Babo
Assignee: Yonik Seeley
Priority: Critical
 Fix For: 1.4.1, 1.5, 3.1, 4.0

 Attachments: StreamingUpdateSolrServer.patch

   Original Estimate: 1h
  Remaining Estimate: 1h

 While inserting a large pile of documents using StreamingUpdateSolrServer 
 there is a race condition as all Runner instances stop processing while the 
 blocking queue is full. With a high performance client this could happen 
 quite often, there is no way to recover from it at the client side.
 In StreamingUpdateSolrServer there is a BlockingQueue called queue to store 
 UpdateRequests, there are up to threadCount number of workers threads from 
 StreamingUpdateSolrServer.Runner to read that queue and push requests to a 
 Solr instance. If at one point the BlockingQueue is empty all workers stop 
 processing it and pushing the collected content to Solr which could be a time 
 consuming process, sometimes all worker threads are waiting for Solr. If at 
 this time the client fills the BlockingQueue to full all worker threads will 
 quit without processing any further and the main thread will block forever.
 There is a simple, well tested patch attached to handle this situation.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Welcome Dawid Weiss and Stanislaw Osinski as Lucene/Solr committers

2011-02-08 Thread Erick Erickson
Glad to see more committers, welcome aboard!

Erick

On Tue, Feb 8, 2011 at 5:05 PM, Dawid Weiss dawid.we...@cs.put.poznan.plwrote:

 Thank you very much, everyone! This is a great privilege and honor for me.

 In the spirit of previous posters, I would like to quickly introduce
 myself. I'm 32, I was born and I still live in Poznan, Poland, happily
 married and with two kids on board. My computer science experience is
 somewhat strange because I was a hard core assembly programmer through
 primary and high school, not familiar with anything else (the internet
 was not there yet, remember, and books were a scarce resource). I
 learned my first high-level language during university studies... and
 always thought that looked so much more complex compared to assembly
 ;). Nowadays I do most of my programming in Java, but am always
 profoundly interested in low-level aspects of what's going on behind
 the scenes. I hold a PhD in information retrieval and teach at the
 local technical university in Poznan. Together with Staszek Osiński we
 also develop text clustering algorithms under the Carrot2.org and
 CarrotSearch.com umbrella.

 Glad to be part of Lucene. I hope I will be of help to the project,
 Dawid

 On Tue, Feb 8, 2011 at 9:09 PM, Uwe Schindler u...@thetaphi.de wrote:
  Welcome!
  --
  Uwe Schindler
  H.-H.-Meier-Allee 63, 28213 Bremen
  http://www.thetaphi.de
 
 
 
  Simon Willnauer simon.willna...@googlemail.com schrieb:
 
  Welcome! ;) Simon On Tue, Feb 8, 2011 at 8:06 PM, Steven A Rowe
  sar...@syr.edu wrote:  Welcome Stanisław and Dawid!  
 -Original
  Message-  From: Robert Muir [mailto:rcm...@gmail.com]  Sent:
  Tuesday, February 08, 2011 1:13 PM  To: gene...@lucene.apache.org;
  dev@lucene.apache.org  Subject: Welcome Dawid Weiss and Stanislaw
 Osinski
  as Lucene/Solr  committers   I'm pleased to announce that the PMC
 has
  voted in Dawid Weiss and  Stanislaw Osinski as Lucene/Solr committers!
 
   Welcome! 
  
  To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For
 additional
  commands, e-mail: dev-h...@lucene.apache.org

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec

2011-02-08 Thread hao yan (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12992237#comment-12992237
 ] 

hao yan commented on LUCENE-2903:
-

I tried to move memory allocation out of readBlock() to BlockReader's 
constructor. It improves the performance a little. I also tried to use 
ByteBuffer/IntBuffer to replace my manual convertsion between bytes[]/int[]. It 
makes things worse.

The following is my result for 0.1M data:
(1) BulkVInt vs patchedFrameoFRef3
QueryQPS   bulkVIntQPS patchedFrameoFRef3  Pct diff
 united states  393.55  362.84 -7.8%
   united states~3  243.84  236.80 -2.9%
   +nebraska +states 1140.25  998.00-12.5%
 +united +states  687.76  633.31 -7.9%
doctimesecnum:[1 TO 6]  413.56  427.53  3.4%
doctitle:.*[Uu]nited.*  510.46  534.47  4.7%
  spanFirst(unit, 5) 1240.69 1108.65-10.6%
spanNear([unit, state], 10, true)  511.77  463.18 -9.5%
  states 1626.02 1483.68 -8.8%
 u*d  164.23  162.79 -0.9%
un*d  257.53  252.97 -1.8%
uni*  607.53  591.02 -2.7%
   unit* 1024.59 1043.84  1.9%
   united states  627.35  578.70 -7.8%
  united~0.6   11.51   11.36 -1.3%
 united~0.75   52.58   53.57  1.9%
unit~0.5   12.08   11.93 -1.2%
unit~0.7   50.98   51.30  0.6%

(2) FrameOfRef VS PatchcedFrameOfRef3
QueryQPSpatchedFrameofrefQPS pathcedFrameofref3  Pct diff
 united states  314.76  362.71 15.2%
   united states~3  227.53  237.08  4.2%
   +nebraska +states 1075.27 1025.64 -4.6%
 +united +states  646.41  626.57 -3.1%
doctimesecnum:[1 TO 6]  412.88  429.37  4.0%
doctitle:.*[Uu]nited.*  481.70  528.82  9.8%
  spanFirst(unit, 5) 1060.45 1118.57  5.5%
spanNear([unit, state], 10, true)  409.33  467.73 14.3%
  states 1353.18 1479.29  9.3%
 u*d  158.91  165.98  4.4%
un*d  237.36  256.41  8.0%
uni*  560.22  593.12  5.9%
   unit*  946.97 1043.84 10.2%
   united states  431.22  583.09 35.2%
  united~0.6   10.91   11.37  4.2%
 united~0.75   50.30   53.30  5.9%
unit~0.5   11.54   11.94  3.5%
unit~0.7   47.38   50.38  6.3%


(3) PatchedFrameOfRef VS PatchedFrameOfRef3

 QueryQPS FrameOfRefQPS pathcedFrameofref3  Pct diff
 united states  326.26  360.49 10.5%
   united states~3  226.50  234.69  3.6%
   +nebraska +states 1077.59 1021.45 -5.2%
 +united +states  648.51  630.52 -2.8%
doctimesecnum:[1 TO 6]  324.46  428.45 32.0%
doctitle:.*[Uu]nited.*  485.44  527.70  8.7%
  spanFirst(unit, 5) 1007.05 .11 10.3%
spanNear([unit, state], 10, true)  446.03  465.55  4.4%
  states 1449.28 1459.85  0.7%
 u*d  158.43  161.79  2.1%
un*d  246.37  256.28  4.0%
uni*  548.85  594.88  8.4%
   unit*  920.81 1042.75 13.2%
   united states  450.65  576.37 27.9%
  united~0.6   11.07   11.26  1.7%
 united~0.75   50.70   52.60  3.8%
unit~0.5   11.64   11.76  1.0%
unit~0.7   49.04   50.70  3.4%




 Improvement of PForDelta Codec
 --

 Key: LUCENE-2903
 URL: https://issues.apache.org/jira/browse/LUCENE-2903
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: hao yan
 Attachments: LUCENE_2903.patch, LUCENE_2903.patch


 There are 3 versions of PForDelta implementations in the Bulk Branch: 
 FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2.
 The FrameOfRef is a very basic one which is essentially a binary encoding 
 (may result in huge index size).
 The PatchedFrameOfRef is the implmentation based on the original version of 
 PForDelta in the literatures.
 The PatchedFrameOfRef2 is my previous implementation which are improved this 
 time. (The Codec name is changed to NewPForDelta.).
 In particular, the changes are:
 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the 
 old PForDelta does not support very large exceptions (since
 the Simple16 does not support very large numbers). Now this has been fixed in 
 the new LCPForDelta.
 2. I changed the 

Re: Welcome Dawid Weiss and Stanislaw Osinski as Lucene/Solr committers

2011-02-08 Thread Yonik Seeley
On Tue, Feb 8, 2011 at 1:13 PM, Robert Muir rcm...@gmail.com wrote:
 I'm pleased to announce that the PMC has voted in Dawid Weiss and
 Stanislaw Osinski as Lucene/Solr committers!

Welcome aboard guys!

-Yonik
http://lucidimagination.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-2342) Lock starvation can cause commit to never run when many clients are adding docs

2011-02-08 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12992245#comment-12992245
 ] 

Yonik Seeley commented on SOLR-2342:


bq. Passing true does make the read lock acq more costly, but I suspect this is 
in the noise for typical indexing.

I'll try to do a quick test to verify this soon.

 Lock starvation can cause commit to never run when many clients are adding 
 docs
 ---

 Key: SOLR-2342
 URL: https://issues.apache.org/jira/browse/SOLR-2342
 Project: Solr
  Issue Type: Bug
  Components: update
Reporter: Michael McCandless
Priority: Minor

 I have a stress test, where 100 clients add 100 1MB docs and then call commit 
 in the end.  It's a falldown test (try to make Solr fall down) and nowhere 
 near actual usage.
 But, after some initial commits that succeed, I'm seeing later commits always 
 time out (client side timeout @ 10 minutes).  Watching Solr's logging, no 
 commit ever runs.
 Looking at the stack traces in the threads, this is not deadlock: the 
 add/update calls are running, and new segments are being flushed to the index.
 Digging in the code a bit, we use ReentrantReadWriteLock, with add/update 
 acquiring the readLock and commit acquiring the writeLock.  But, according to 
 the jdocs, the writeLock isn't given any priority over the readLock (unless 
 you set fairness, which we don't).  So I think this explains the starvation?
 However, this is not a real world use case (most apps would/should call 
 commit less often, and from on client).  Also, we could set fairness, but it 
 seems to have some performance penalty, and I'm not sure we should penalize 
 the normal case for this unusual one.  EG see here (thanks Mark): 
 http://www.javaspecialists.eu/archive/Issue165.html.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Potential contrib module

2011-02-08 Thread Edward Drapkin

Hello all,

Pending approval (which is almost certain) from management, I have a 
potential module that I'd like to contribute if possible.  Before I go 
to management, I'd like to be able to make a case that I'm certain this 
will be approved by Lucene, although I am all but completely sure that 
there won't be a problem contributing back.


I've been working on some utilities that allow one to use Mathematica 
with Lucene - most notably a FilteredTermEnum/Query implementation that 
allow one to use arbitrary Mathematica expressions in searches.  It's 
not quite finished yet, but functional and useful.  My question is what 
do I need to do to be able to contribute this?  The dependent libraries 
(there are two, one in Java, one JNI) both ship with Mathematica and are 
largely dependent on the version of Mathematica in use, so they wouldn't 
need to ship with the module which should avoid licensing conflicts.  
Does the dependence on a proprietary piece of software, though, 
eliminate this from being contributed?  If not, what requirements are 
necessary for contributing?  Is there a guide anywhere on how to prepare 
the code (unit test requirements, documentation requirements, naming 
conventions, etc.)?


Thanks,
Eddie

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Welcome Dawid Weiss and Stanislaw Osinski as Lucene/Solr committers

2011-02-08 Thread Koji Sekiguchi

(11/02/09 3:13), Robert Muir wrote:

I'm pleased to announce that the PMC has voted in Dawid Weiss and
Stanislaw Osinski as Lucene/Solr committers!

Welcome!


Welcome!

Koji
--
http://www.rondhuit.com/en/

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Potential contrib module

2011-02-08 Thread Ryan McKinley
sounds awesome, but...  the dependency on software that is not
installed/testable in the Apache infrastructure is kind of a show
stopper for getting into the lucene code base.  In general, everyone
needs to be able to run ant test and make sure they have not broken
something.

However, check:  http://apache-extras.org

that may be a good place to host something like this.

thanks
ryan




On Tue, Feb 8, 2011 at 7:33 PM, Edward Drapkin edwa...@wolfram.com wrote:
 Hello all,

 Pending approval (which is almost certain) from management, I have a
 potential module that I'd like to contribute if possible.  Before I go to
 management, I'd like to be able to make a case that I'm certain this will be
 approved by Lucene, although I am all but completely sure that there won't
 be a problem contributing back.

 I've been working on some utilities that allow one to use Mathematica with
 Lucene - most notably a FilteredTermEnum/Query implementation that allow one
 to use arbitrary Mathematica expressions in searches.  It's not quite
 finished yet, but functional and useful.  My question is what do I need to
 do to be able to contribute this?  The dependent libraries (there are two,
 one in Java, one JNI) both ship with Mathematica and are largely dependent
 on the version of Mathematica in use, so they wouldn't need to ship with the
 module which should avoid licensing conflicts.  Does the dependence on a
 proprietary piece of software, though, eliminate this from being
 contributed?  If not, what requirements are necessary for contributing?  Is
 there a guide anywhere on how to prepare the code (unit test requirements,
 documentation requirements, naming conventions, etc.)?

 Thanks,
 Eddie

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Welcome Dawid Weiss and Stanislaw Osinski as Lucene/Solr committers

2011-02-08 Thread Mark Miller
Welcome guys! Thanks Dawid - your turn Stanislaw Osinksi ;)

- Mark

On Feb 8, 2011, at 5:05 PM, Dawid Weiss wrote:

 Thank you very much, everyone! This is a great privilege and honor for me.
 
 In the spirit of previous posters, I would like to quickly introduce
 myself. I'm 32, I was born and I still live in Poznan, Poland, happily
 married and with two kids on board. My computer science experience is
 somewhat strange because I was a hard core assembly programmer through
 primary and high school, not familiar with anything else (the internet
 was not there yet, remember, and books were a scarce resource). I
 learned my first high-level language during university studies... and
 always thought that looked so much more complex compared to assembly
 ;). Nowadays I do most of my programming in Java, but am always
 profoundly interested in low-level aspects of what's going on behind
 the scenes. I hold a PhD in information retrieval and teach at the
 local technical university in Poznan. Together with Staszek Osiński we
 also develop text clustering algorithms under the Carrot2.org and
 CarrotSearch.com umbrella.
 
 Glad to be part of Lucene. I hope I will be of help to the project,
 Dawid
 
 On Tue, Feb 8, 2011 at 9:09 PM, Uwe Schindler u...@thetaphi.de wrote:
 Welcome!
 --
 Uwe Schindler
 H.-H.-Meier-Allee 63, 28213 Bremen
 http://www.thetaphi.de
 
 
 
 Simon Willnauer simon.willna...@googlemail.com schrieb:
 
 Welcome! ;) Simon On Tue, Feb 8, 2011 at 8:06 PM, Steven A Rowe
 sar...@syr.edu wrote:  Welcome Stanisław and Dawid!   -Original
 Message-  From: Robert Muir [mailto:rcm...@gmail.com]  Sent:
 Tuesday, February 08, 2011 1:13 PM  To: gene...@lucene.apache.org;
 dev@lucene.apache.org  Subject: Welcome Dawid Weiss and Stanislaw Osinski
 as Lucene/Solr  committers   I'm pleased to announce that the PMC has
 voted in Dawid Weiss and  Stanislaw Osinski as Lucene/Solr committers! 
 Welcome! 
 
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional
 commands, e-mail: dev-h...@lucene.apache.org
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org
 

- Mark Miller
lucidimagination.com





-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session

2011-02-08 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12992340#comment-12992340
 ] 

Michael Busch commented on LUCENE-2881:
---

bq. Maybe we can simply implement IterableFieldInfo?

good idea - done.


bq. Maybe we can rename SI#clearFilesCache()

Actually I renamed it intentionally, because all this method does is really 
clearing the files cache.  SI has a separate reset() method for resetting its 
state entirely.

 Track FieldInfo per segment instead of per-IW-session
 -

 Key: LUCENE-2881
 URL: https://issues.apache.org/jira/browse/LUCENE-2881
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: Realtime Branch, CSF branch, 4.0
Reporter: Simon Willnauer
Assignee: Michael Busch
 Fix For: Realtime Branch, CSF branch, 4.0

 Attachments: lucene-2881.patch, lucene-2881.patch


 Currently FieldInfo is tracked per IW session to guarantee consistent global 
 field-naming / ordering. IW carries FI instances over from previous segments 
 which also carries over field properties like isIndexed etc. While having 
 consistent field ordering per IW session appears to be important due to bulk 
 merging stored fields etc. carrying over other properties might become 
 problematic with Lucene's Codec support.  Codecs that rely on consistent 
 properties in FI will fail if FI properties are carried over.
 The DocValuesCodec (DocValuesBranch) for instance writes files per segment 
 and field (using the field id within the file name). Yet, if a segment has no 
 DocValues indexed in a particular segment but a previous segment in the same 
 IW session had DocValues, FieldInfo#docValues will be true  since those 
 values are reused from previous segments. 
 We already work around this limitation in SegmentInfo with properties like 
 hasVectors or hasProx which is really something we should manage per Codec  
 Segment. Ideally FieldInfo would be managed per Segment and Codec such that 
 its properties are valid per segment. It also seems to be necessary to bind 
 FieldInfoS to SegmentInfo logically since its really just per segment 
 metadata.  

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session

2011-02-08 Thread Michael Busch (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Busch updated LUCENE-2881:
--

Attachment: lucene-2881.patch

New patch that adds a new junit for testing that field numbering is consistent 
across segments.  It tests two cases: 1) one IW is used to write two segments; 
2) two IWs are used to write two segments.  
And it also tests that addIndexes(Directory...) doesn't mess up the field 
numbering of the external segment.

All tests pass.  I'll commit this in a day or two if nobody objects.

 Track FieldInfo per segment instead of per-IW-session
 -

 Key: LUCENE-2881
 URL: https://issues.apache.org/jira/browse/LUCENE-2881
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: Realtime Branch, CSF branch, 4.0
Reporter: Simon Willnauer
Assignee: Michael Busch
 Fix For: Realtime Branch, CSF branch, 4.0

 Attachments: lucene-2881.patch, lucene-2881.patch, lucene-2881.patch


 Currently FieldInfo is tracked per IW session to guarantee consistent global 
 field-naming / ordering. IW carries FI instances over from previous segments 
 which also carries over field properties like isIndexed etc. While having 
 consistent field ordering per IW session appears to be important due to bulk 
 merging stored fields etc. carrying over other properties might become 
 problematic with Lucene's Codec support.  Codecs that rely on consistent 
 properties in FI will fail if FI properties are carried over.
 The DocValuesCodec (DocValuesBranch) for instance writes files per segment 
 and field (using the field id within the file name). Yet, if a segment has no 
 DocValues indexed in a particular segment but a previous segment in the same 
 IW session had DocValues, FieldInfo#docValues will be true  since those 
 values are reused from previous segments. 
 We already work around this limitation in SegmentInfo with properties like 
 hasVectors or hasProx which is really something we should manage per Codec  
 Segment. Ideally FieldInfo would be managed per Segment and Codec such that 
 its properties are valid per segment. It also seems to be necessary to bind 
 FieldInfoS to SegmentInfo logically since its really just per segment 
 metadata.  

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Welcome Dawid Weiss and Stanislaw Osinski as Lucene/Solr committers

2011-02-08 Thread Stanislaw Osinski
Hi guys, thanks for the warm welcome! It's an honor.

Like Dawid, I live in Poznan, we graduated in computer science from the same
local university. My computer science experience started from electronics,
Timex 2048 and Amiga 500/1200/PPC; I bought my first PC when I went to the
university. My present interests include information retrieval and text
mining (carrot2.org, carrotsearch.com), I'm also into UI design and
usability (csssprites.org).

Looking forward to contributing to Solr/Lucene,

Staszek

On Wed, Feb 9, 2011 at 04:01, Mark Miller markrmil...@gmail.com wrote:

 Welcome guys! Thanks Dawid - your turn Stanislaw Osinksi ;)

 - Mark

 On Feb 8, 2011, at 5:05 PM, Dawid Weiss wrote:

  Thank you very much, everyone! This is a great privilege and honor for
 me.
 
  In the spirit of previous posters, I would like to quickly introduce
  myself. I'm 32, I was born and I still live in Poznan, Poland, happily
  married and with two kids on board. My computer science experience is
  somewhat strange because I was a hard core assembly programmer through
  primary and high school, not familiar with anything else (the internet
  was not there yet, remember, and books were a scarce resource). I
  learned my first high-level language during university studies... and
  always thought that looked so much more complex compared to assembly
  ;). Nowadays I do most of my programming in Java, but am always
  profoundly interested in low-level aspects of what's going on behind
  the scenes. I hold a PhD in information retrieval and teach at the
  local technical university in Poznan. Together with Staszek Osiński we
  also develop text clustering algorithms under the Carrot2.org and
  CarrotSearch.com umbrella.
 
  Glad to be part of Lucene. I hope I will be of help to the project,
  Dawid