RE: Umlauts as Char
Hi Prescott, 1- When I open the java file, I see the code as it should be. You can try to open it with notepad and then paste to VS for ex. 2- There is an open issue reported by Pasha Bizhan that covers some languages (https://issues.apache.org/jira/browse/LUCENENET-372) But I don't know it us up to date or not. 3- ASCIIFoldingFilter.cs is another example for dealing with non-ascii chars. DIGY -Original Message- From: Prescott Nasser [mailto:geobmx...@hotmail.com] Sent: Tuesday, February 08, 2011 3:55 AM To: lucene-net-dev@lucene.apache.org Subject: Umlauts as Char Hey all, So while digging into the code a bit (and pushed by digy's Arabic conversion yesterday). I started looking at the various other languages we were missing from java. I started porting the GermanAnalyzer, but ran into an issue of the Umlauts... http://svn.apache.org/viewvc/lucene/java/tags/lucene_2_9_4/contrib/analyzers /common/src/java/org/apache/lucene/analysis/de/GermanStemmer.java?revision=1 040993view=co in the void subsitute function you'll see them: else if ( buffer.charAt( c ) == 'ü' ) { buffer.setCharAt( c, 'u' ); } This does not constitue a character in .net (that I can figure out) and thus it doesn't compile. The .java file says encoded in UTF-8. I was thinking maybe I could do the same thing in VS2010, but I'm not finding a way, and searching on this has been difficult. Any ideas? ~Prescott =
RE: Umlauts as Char
Well - with regards to number 2. It was fine to dig into the code a bit - but I guess we have them a number of them already converted, although I guess never added source control. Thanks for the heads up on 1 and 3. ~P From: digyd...@gmail.com To: lucene-net-dev@lucene.apache.org Subject: RE: Umlauts as Char Date: Tue, 8 Feb 2011 11:12:33 +0200 Hi Prescott, 1- When I open the java file, I see the code as it should be. You can try to open it with notepad and then paste to VS for ex. 2- There is an open issue reported by Pasha Bizhan that covers some languages (https://issues.apache.org/jira/browse/LUCENENET-372) But I don't know it us up to date or not. 3- ASCIIFoldingFilter.cs is another example for dealing with non-ascii chars. DIGY -Original Message- From: Prescott Nasser [mailto:geobmx...@hotmail.com] Sent: Tuesday, February 08, 2011 3:55 AM To: lucene-net-dev@lucene.apache.org Subject: Umlauts as Char Hey all, So while digging into the code a bit (and pushed by digy's Arabic conversion yesterday). I started looking at the various other languages we were missing from java. I started porting the GermanAnalyzer, but ran into an issue of the Umlauts... http://svn.apache.org/viewvc/lucene/java/tags/lucene_2_9_4/contrib/analyzers /common/src/java/org/apache/lucene/analysis/de/GermanStemmer.java?revision=1 040993view=co in the void subsitute function you'll see them: else if ( buffer.charAt( c ) == 'ü' ) { buffer.setCharAt( c, 'u' ); } This does not constitue a character in .net (that I can figure out) and thus it doesn't compile. The .java file says encoded in UTF-8. I was thinking maybe I could do the same thing in VS2010, but I'm not finding a way, and searching on this has been difficult. Any ideas? ~Prescott =
Re: Umlauts as Char
On 2011-02-08, Prescott Nasser wrote: I'm tempted to take the source at it's word and just replace them with the umlauts versions (via character map -thanks Aaron), and then make some comment expressing what originally it was in the java source. I'd still recommend using Unicode escape sequences since otherwise the source code depends on your local encoding - which will only lead to trouble later on. The Java code circumvents this somewhat by explicitly stating the file was UTF-8, but ASCII files are still a lot more portable. Stefan
Re: Umlauts as Char
+1 for unicode escape sequences. PS I can port RussianAnalizer\Stemmer - just looking into Lucene Contrib code it is not so hard as I think before. On Tue, Feb 8, 2011 at 4:33 PM, Stefan Bodewig bode...@apache.org wrote: On 2011-02-08, Digy wrote: Altough Java doesn't write BOM, VS is clever enough to open it correctly. OK, thank you. I didn't try it myself since I wasn't at a machine with VS installed when I responded. Stefan -- --Regards, Sergey Mirvoda
[jira] Commented: (SOLR-1191) NullPointerException in delta import
[ https://issues.apache.org/jira/browse/SOLR-1191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12991884#comment-12991884 ] Gunnlaugur Thor Briem commented on SOLR-1191: - bq. There seems to be TestSqlEntityProcessorDelta*.java, no? Indeed there are, and they do seem to cover delta imports to a fair degree. I must have been underslept. : ) [The Hudson coverage report|https://hudson.apache.org/hudson/job/Solr-3.x/clover/] doesn't include the contrib stuff though. NullPointerException in delta import Key: SOLR-1191 URL: https://issues.apache.org/jira/browse/SOLR-1191 Project: Solr Issue Type: Bug Components: contrib - DataImportHandler Affects Versions: 1.3, 1.4 Environment: OS: Windows Linux. Java: 1.6 DB: MySQL SQL Server Reporter: Ali Syed Assignee: Noble Paul Fix For: 1.4 Attachments: SOLR-1191.patch Seeing few of these NullPointerException during delta imports. Once this happens delta import stops working and keeps giving the same error. java.lang.NullPointerException at org.apache.solr.handler.dataimport.DocBuilder.collectDelta(DocBuilder.java:622) at org.apache.solr.handler.dataimport.DocBuilder.doDelta(DocBuilder.java:240) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:159) at org.apache.solr.handler.dataimport.DataImporter.doDeltaImport(DataImporter.java:337) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:376) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:355) Running delta import for a particular entity fixes the problem and delta import start working again. Here is the log just before after the exception 05/27 11:59:29 86987686 INFO btpool0-538 org.apache.solr.core.SolrCore - [localhost] webapp=/solr path=/dataimport params={command=delta-importoptimize=false} status=0 QTime=0 05/27 11:59:29 86987687 INFO Thread-4162 org.apache.solr.handler.dataimport.SolrWriter - Read dataimport.properties 05/27 11:59:29 86987687 INFO Thread-4162 org.apache.solr.handler.dataimport.DataImporter - Starting Delta Import 05/27 11:59:29 86987687 INFO Thread-4162 org.apache.solr.handler.dataimport.SolrWriter - Read dataimport.properties 05/27 11:59:29 86987687 INFO Thread-4162 org.apache.solr.handler.dataimport.DocBuilder - Starting delta collection. 05/27 11:59:29 86987690 INFO Thread-4162 org.apache.solr.handler.dataimport.DocBuilder - Running ModifiedRowKey() for Entity: content 05/27 11:59:29 86987690 INFO Thread-4162 org.apache.solr.handler.dataimport.DocBuilder - Completed ModifiedRowKey for Entity: content rows obtained : 0 05/27 11:59:29 86987690 INFO Thread-4162 org.apache.solr.handler.dataimport.DocBuilder - Completed DeletedRowKey for Entity: content rows obtained : 0 05/27 11:59:29 86987692 INFO Thread-4162 org.apache.solr.handler.dataimport.DocBuilder - Completed parentDeltaQuery for Entity: content 05/27 11:59:29 86987692 INFO Thread-4162 org.apache.solr.handler.dataimport.DocBuilder - Running ModifiedRowKey() for Entity: job 05/27 11:59:29 86987692 INFO Thread-4162 org.apache.solr.handler.dataimport.JdbcDataSource - Creating a connection for entity job with URL: jdbc:sqlserver://localhost;databaseName=TestDB 05/27 11:59:29 86987704 INFO Thread-4162 org.apache.solr.handler.dataimport.JdbcDataSource - Time taken for getConnection(): 12 05/27 11:59:29 86987707 INFO Thread-4162 org.apache.solr.handler.dataimport.DocBuilder - Completed ModifiedRowKey for Entity: job rows obtained : 0 05/27 11:59:29 86987707 INFO Thread-4162 org.apache.solr.handler.dataimport.DocBuilder - Completed DeletedRowKey for Entity: job rows obtained : 0 05/27 11:59:29 86987707 INFO Thread-4162 org.apache.solr.handler.dataimport.DocBuilder - Completed parentDeltaQuery for Entity: job 05/27 11:59:29 86987707 INFO Thread-4162 org.apache.solr.handler.dataimport.DocBuilder - Delta Import completed successfully 05/27 11:59:29 86987707 INFO Thread-4162 org.apache.solr.handler.dataimport.DocBuilder - Starting delta collection. 05/27 11:59:29 86987709 INFO Thread-4162 org.apache.solr.handler.dataimport.DocBuilder - Running ModifiedRowKey() for Entity: user 05/27 11:59:29 86987709 INFO Thread-4162 org.apache.solr.handler.dataimport.JdbcDataSource - Creating a connection for entity user with URL: jdbc:sqlserver://localhost;databaseName=TestDB 05/27 11:59:29 86987716 INFO Thread-4162 org.apache.solr.handler.dataimport.JdbcDataSource - Time taken for getConnection(): 7 05/27 11:59:29 86987873 INFO Thread-4162 org.apache.solr.handler.dataimport.DocBuilder - Completed ModifiedRowKey for
[jira] Updated: (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session
[ https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Busch updated LUCENE-2881: -- Attachment: lucene-2881.patch * Creates for every segment a new FieldInfos * Changes FieldInfos, so that the FieldInfo numbers within a single FieldInfos don't have to be contiguous - this allows using the same numbering as the previous segment(s), even if not all fields are present in the new segment * Adds a global fieldName - fieldNumber map; if possible when a new field is added to a FieldInfo it tries to use an already assigned number for that field All tests pass. Though I need to verify if the global map works correctly (it'd probably be good to add a test for that). Also it'd be nice to remove hasVectors and hasProx from SegmentInfo, but we could also do that in a separate issue. Track FieldInfo per segment instead of per-IW-session - Key: LUCENE-2881 URL: https://issues.apache.org/jira/browse/LUCENE-2881 Project: Lucene - Java Issue Type: Improvement Affects Versions: Realtime Branch, CSF branch, 4.0 Reporter: Simon Willnauer Assignee: Michael Busch Fix For: Realtime Branch, CSF branch, 4.0 Attachments: lucene-2881.patch Currently FieldInfo is tracked per IW session to guarantee consistent global field-naming / ordering. IW carries FI instances over from previous segments which also carries over field properties like isIndexed etc. While having consistent field ordering per IW session appears to be important due to bulk merging stored fields etc. carrying over other properties might become problematic with Lucene's Codec support. Codecs that rely on consistent properties in FI will fail if FI properties are carried over. The DocValuesCodec (DocValuesBranch) for instance writes files per segment and field (using the field id within the file name). Yet, if a segment has no DocValues indexed in a particular segment but a previous segment in the same IW session had DocValues, FieldInfo#docValues will be true since those values are reused from previous segments. We already work around this limitation in SegmentInfo with properties like hasVectors or hasProx which is really something we should manage per Codec Segment. Ideally FieldInfo would be managed per Segment and Codec such that its properties are valid per segment. It also seems to be necessary to bind FieldInfoS to SegmentInfo logically since its really just per segment metadata. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: CustomScoreQueryWithSubqueries
Hi Doron. Thanks for your answer. Maybe the question seems simple, but I want to be sure about the procedure. By the way, there is a chance, if the patch is really useful, that it could be adapted for other versions (in this case, lucene 3.0). Thanks. Regards. Fernando. De: Doron Cohen cdor...@gmail.com Para: dev@lucene.apache.org Enviado: martes, 8 de febrero, 2011 2:54:19 Asunto: Re: CustomScoreQueryWithSubqueries Hi Fernando, The wiki indeed relates mainly to trunk development. For creating a 2.9 patch checkout code from /repos/asf/lucene/java/branches/lucene_2_9 Regards, Doron As the wiki page says Most development is done on the trunk You can either use that, or, in order On Tue, Feb 8, 2011 at 4:56 AM, Fernando Wasylyszyn ferw...@yahoo.com.ar wrote: Robert: I'm trying to follow the steps that are mentioned in: http://wiki.apache.org/lucene-java/HowToContribute in order to make a patch with my contribution. But, in the source code that I get from: http://svn.apache.org/repos/asf/lucene/dev/trunk/ the class org.apache.lucene.search.Searcher is missing and the only method available to obtain a Scorer from a Weight object is scorer(IndexReader.AtomicReaderContext, ScorerContext) I just checked and class Searcher still exists in Lucene 3.0.3. In which version the trunk that I've checkout is based? The patch that I want to submit is based on Lucene 2.9.1. Thanks in advance. Regards. Fernando. De: Robert Muir rcm...@gmail.com Para: dev@lucene.apache.org Enviado: miércoles, 2 de febrero, 2011 16:52:58 Asunto: Re: CustomScoreQueryWithSubqueries On Wed, Feb 2, 2011 at 2:37 PM, Fernando Wasylyszyn ferw...@yahoo.com.ar wrote: Hi everyone. My name is Fernando and I am a researcher and developer in the R+D lab at Snoop Consulting S.R.L. in Argentina. Based on the patch suggested in LUCENE-1608 (https://issues.apache.org/jira/browse/LUCENE-1608) and in the needs of one of our customers, for who we are developing a customized search engine on top of Lucene and Solr, we have developed the class CustomScoreQueryWithSubqueries, which is a variation of CustomScoreQuery that allows the use of arbitrary Query objects besides instances of ValueSourceQuery, without the need of wrapping the arbitrary/ies query/ies with the QueryValueSource proposed in Jira, which has the disadvantage of create an instance of an IndexSearcher in each invocation of the method getValues(IndexReader). If you think that this contribution can be useful for the Lucene community, please let me know the steps in order to contribute. Hi Fernando: https://issues.apache.org/jira/browse/LUCENE-1608 is still an open issue. If you have a better solution, please don't hesitate to upload a patch file to the issue! There are some more detailed instructions here: http://wiki.apache.org/lucene-java/HowToContribute - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: CustomScoreQueryWithSubqueries
Hi Fernando, I didn't follow this really but in general we fix stuff in trunk and then backport to older versions. Usually if something is useful for 2.9 its also useful for 4.0 3.x if the issue still applies. simon On Tue, Feb 8, 2011 at 12:34 PM, Fernando Wasylyszyn ferw...@yahoo.com.ar wrote: Hi Doron. Thanks for your answer. Maybe the question seems simple, but I want to be sure about the procedure. By the way, there is a chance, if the patch is really useful, that it could be adapted for other versions (in this case, lucene 3.0). Thanks. Regards. Fernando. De: Doron Cohen cdor...@gmail.com Para: dev@lucene.apache.org Enviado: martes, 8 de febrero, 2011 2:54:19 Asunto: Re: CustomScoreQueryWithSubqueries Hi Fernando, The wiki indeed relates mainly to trunk development. For creating a 2.9 patch checkout code from /repos/asf/lucene/java/branches/lucene_2_9 Regards, Doron As the wiki page says Most development is done on the trunk You can either use that, or, in order On Tue, Feb 8, 2011 at 4:56 AM, Fernando Wasylyszyn ferw...@yahoo.com.ar wrote: Robert: I'm trying to follow the steps that are mentioned in: http://wiki.apache.org/lucene-java/HowToContribute in order to make a patch with my contribution. But, in the source code that I get from: http://svn.apache.org/repos/asf/lucene/dev/trunk/ the class org.apache.lucene.search.Searcher is missing and the only method available to obtain a Scorer from a Weight object is scorer(IndexReader.AtomicReaderContext, ScorerContext) I just checked and class Searcher still exists in Lucene 3.0.3. In which version the trunk that I've checkout is based? The patch that I want to submit is based on Lucene 2.9.1. Thanks in advance. Regards. Fernando. De: Robert Muir rcm...@gmail.com Para: dev@lucene.apache.org Enviado: miércoles, 2 de febrero, 2011 16:52:58 Asunto: Re: CustomScoreQueryWithSubqueries On Wed, Feb 2, 2011 at 2:37 PM, Fernando Wasylyszyn ferw...@yahoo.com.ar wrote: Hi everyone. My name is Fernando and I am a researcher and developer in the R+D lab at Snoop Consulting S.R.L. in Argentina. Based on the patch suggested in LUCENE-1608 (https://issues.apache.org/jira/browse/LUCENE-1608) and in the needs of one of our customers, for who we are developing a customized search engine on top of Lucene and Solr, we have developed the class CustomScoreQueryWithSubqueries, which is a variation of CustomScoreQuery that allows the use of arbitrary Query objects besides instances of ValueSourceQuery, without the need of wrapping the arbitrary/ies query/ies with the QueryValueSource proposed in Jira, which has the disadvantage of create an instance of an IndexSearcher in each invocation of the method getValues(IndexReader). If you think that this contribution can be useful for the Lucene community, please let me know the steps in order to contribute. Hi Fernando: https://issues.apache.org/jira/browse/LUCENE-1608 is still an open issue. If you have a better solution, please don't hesitate to upload a patch file to the issue! There are some more detailed instructions here: http://wiki.apache.org/lucene-java/HowToContribute - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Should ASCIIFoldingFilter be deprecated?
On Mon, Feb 7, 2011 at 10:51 PM, Steven A Rowe sar...@syr.edu wrote: I haven't done any benchmarking, but I'm pretty sure that ASCIIFoldingFilter can achieve a significantly higher throughput rate than MappingCharFilter, and given that, it probably makes sense to keep both, to allow people to make the choice about the tradeoff between the flexibility provided by the human-readable (and editable) mapping file and the speed provided by ASCIIFoldingFilter. I agree... have you seen http://bugs.icu-project.org/trac/ticket/7743 ? Hopefully something along those lines would allow us to support the flexibility in a factory or whatever (even better as described, when you just want a small tweak) but still with good performance. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2911) synchronize grammar/token types across StandardTokenizer, UAX29EmailURLTokenizer, ICUTokenizer, add CJK types.
synchronize grammar/token types across StandardTokenizer, UAX29EmailURLTokenizer, ICUTokenizer, add CJK types. -- Key: LUCENE-2911 URL: https://issues.apache.org/jira/browse/LUCENE-2911 Project: Lucene - Java Issue Type: Sub-task Reporter: Robert Muir Assignee: Robert Muir Fix For: 3.1 Attachments: LUCENE-2911.patch I'd like to do LUCENE-2906 (better cjk support for these tokenizers) for a future target such as 3.2 But, in 3.1 I would like to do a little cleanup first, and synchronize all these token types, etc. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2911) synchronize grammar/token types across StandardTokenizer, UAX29EmailURLTokenizer, ICUTokenizer, add CJK types.
[ https://issues.apache.org/jira/browse/LUCENE-2911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-2911: Attachment: LUCENE-2911.patch after applying the patch, you have to run 'ant jflex' from modules/analysis/common and 'ant genrbbi' from modules/analysis/icu to regenerate. synchronize grammar/token types across StandardTokenizer, UAX29EmailURLTokenizer, ICUTokenizer, add CJK types. -- Key: LUCENE-2911 URL: https://issues.apache.org/jira/browse/LUCENE-2911 Project: Lucene - Java Issue Type: Sub-task Components: Analysis Reporter: Robert Muir Assignee: Robert Muir Fix For: 3.1 Attachments: LUCENE-2911.patch I'd like to do LUCENE-2906 (better cjk support for these tokenizers) for a future target such as 3.2 But, in 3.1 I would like to do a little cleanup first, and synchronize all these token types, etc. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-2342) Lock starvation can cause commit to never run when many clients are adding docs
[ https://issues.apache.org/jira/browse/SOLR-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12991965#comment-12991965 ] Michael McCandless commented on SOLR-2342: -- OK, passing true to the ReentrantReadWriteLock fixes the starvation in this test... Should we commit that? Passing true does make the read lock acq more costly, but I suspect this is in the noise for typical indexing. And, while I think the stress test is rather unnatural (normally one thread should call commit in the end), I fear many apps may (for simplicity) do something similar to this stress test. Also, the fact that auto-commit is also starved is I think a more realistic failure... Lock starvation can cause commit to never run when many clients are adding docs --- Key: SOLR-2342 URL: https://issues.apache.org/jira/browse/SOLR-2342 Project: Solr Issue Type: Bug Components: update Reporter: Michael McCandless Priority: Minor I have a stress test, where 100 clients add 100 1MB docs and then call commit in the end. It's a falldown test (try to make Solr fall down) and nowhere near actual usage. But, after some initial commits that succeed, I'm seeing later commits always time out (client side timeout @ 10 minutes). Watching Solr's logging, no commit ever runs. Looking at the stack traces in the threads, this is not deadlock: the add/update calls are running, and new segments are being flushed to the index. Digging in the code a bit, we use ReentrantReadWriteLock, with add/update acquiring the readLock and commit acquiring the writeLock. But, according to the jdocs, the writeLock isn't given any priority over the readLock (unless you set fairness, which we don't). So I think this explains the starvation? However, this is not a real world use case (most apps would/should call commit less often, and from on client). Also, we could set fairness, but it seems to have some performance penalty, and I'm not sure we should penalize the normal case for this unusual one. EG see here (thanks Mark): http://www.javaspecialists.eu/archive/Issue165.html. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
RE: Should ASCIIFoldingFilter be deprecated?
Chris Hostetter-3 wrote: CharFilters and TokenFilters have different purposes though... http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#When_To_use_a_CharFilter_vs_a_TokenFilter (ie: If you use MappingCharFilter, you can't then tokenize on some of the characters you filtered away) Right, but it’s hard to imagine wanting to tokenize on an accent character or some other modification specified in these particular mapping files. Steven A Rowe wrote: AFAIK, ISOLatin1AccentFilter was deprecated because ASCIIFoldingFilter provides a superset of it mappings. *If* that is the case then this file should also be removed: solr/example/solr/conf/mapping-ISOLatin1Accent.txt Steven A Rowe wrote: I haven't done any benchmarking, but I'm pretty sure that ASCIIFoldingFilter can achieve a significantly higher throughput rate than MappingCharFilter, and given that, it probably makes sense to keep both, to allow people to make the choice about the tradeoff between the flexibility provided by the human-readable (and editable) mapping file and the speed provided by ASCIIFoldingFilter. I'm skeptical that whatever the difference is is relevant in the scheme of things. The cost to keeping it is introducing confusion on users, and more code to maintain. ~ David Smiley - Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book -- View this message in context: http://lucene.472066.n3.nabble.com/Should-ASCIIFoldingFilter-be-deprecated-tp2448919p2451504.html Sent from the Solr - Dev mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Umlauts as Char
On 2011-02-08, Prescott Nasser wrote: So I can take the source codes word that 'ü' is the u with dots over it (becuase it says replace umlauts in the source notes). But, I guess, is that really true? Is that perhaps u with a carrot over it instead? I think the case has been settled by now, but I forgot to add that I am German so if in doubt, feel free to ask. We don't have any funny characters except for äöüÄÖÜß in German. Stefan
Re: Should ASCIIFoldingFilter be deprecated?
On Tue, Feb 8, 2011 at 9:12 AM, David Smiley (@MITRE.org) dsmi...@mitre.org wrote: I'm skeptical that whatever the difference is is relevant in the scheme of things. The cost to keeping it is introducing confusion on users, and more code to maintain. its pretty significant. charfilters are not reusable, and box every character and lookup out of a hashmap (i made a patch to fix the reusability, but no one has commented) : https://issues.apache.org/jira/browse/LUCENE-2788 asciifoldingfilter does a huge switch (which still isnt optimal), but its way way faster than mappingcharfilter, especially since its a no-op for chars 0x7F. icufoldingfilter precompiles a recursively decomposed trie, so its lookup is a unicode folded trie (icu-project.org/docs/papers/foldedtrie_iuc21.ppt). I think its a tad slower than asciifoldingfilter but it also incorporates case folding and unicode normalization: neither asciifoldingfilter nor mappingcharfilter will not properly fold http://www.geonames.org/search.html?q=Ab%C5%AB+Z%CC%A7abycountry=, because there is no composed form for Z + combining cedilla, but icufoldingfilter will. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Should ASCIIFoldingFilter be deprecated?
Robert Muir wrote: On Tue, Feb 8, 2011 at 9:12 AM, David Smiley (@MITRE.org) dsmi...@mitre.org wrote: I'm skeptical that whatever the difference is is relevant in the scheme of things. The cost to keeping it is introducing confusion on users, and more code to maintain. its pretty significant. charfilters are not reusable, and box every character and lookup out of a hashmap (i made a patch to fix the reusability, but no one has commented) : https://issues.apache.org/jira/browse/LUCENE-2788 asciifoldingfilter does a huge switch (which still isnt optimal), but its way way faster than mappingcharfilter, especially since its a no-op for chars 0x7F. Well then I see a path forward to speed up MappingCharFilter substantially. There's your LUCENE-2788, and then you could easily add the same no-op optimization for the smallest char value in the HashMap. - Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book -- View this message in context: http://lucene.472066.n3.nabble.com/Should-ASCIIFoldingFilter-be-deprecated-tp2448919p2451800.html Sent from the Solr - Dev mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Should ASCIIFoldingFilter be deprecated?
On Tue, Feb 8, 2011 at 10:05 AM, David Smiley (@MITRE.org) dsmi...@mitre.org wrote: Well then I see a path forward to speed up MappingCharFilter substantially. There's your LUCENE-2788, and then you could easily add the same no-op optimization for the smallest char value in the HashMap. only for the smallest starter, and still mappingcharfilter has to maintain an array of any offset changes (this is now binary searched) for correctOffset. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-2155) Geospatial search using geohash prefixes
[ https://issues.apache.org/jira/browse/SOLR-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12991999#comment-12991999 ] David Smiley commented on SOLR-2155: So Bill's talking about sorting, and Lance is talking about polygons. Sorting: I'll try and get to it next; but this patch is low-priority for me at the moment. Polygons: Lance, I figured I could do a shifting of the coordinates off of the dateline already, but what was mentally hurting is contemplating a snake-like polygon that encircled the globe. And, a polygon that is for say Antarctica (a polygon covering a pole). The SLERP stuff is interesting but I don't know how to apply it. For someone who claims not to be a math guy, you're doing a good job fooling me :-) Geospatial search using geohash prefixes Key: SOLR-2155 URL: https://issues.apache.org/jira/browse/SOLR-2155 Project: Solr Issue Type: Improvement Reporter: David Smiley Attachments: GeoHashPrefixFilter.patch, GeoHashPrefixFilter.patch, GeoHashPrefixFilter.patch There currently isn't a solution in Solr for doing geospatial filtering on documents that have a variable number of points. This scenario occurs when there is location extraction (i.e. via a gazateer) occurring on free text. None, one, or many geospatial locations might be extracted from any given document and users want to limit their search results to those occurring in a user-specified area. I've implemented this by furthering the GeoHash based work in Lucene/Solr with a geohash prefix based filter. A geohash refers to a lat-lon box on the earth. Each successive character added further subdivides the box into a 4x8 (or 8x4 depending on the even/odd length of the geohash) grid. The first step in this scheme is figuring out which geohash grid squares cover the user's search query. I've added various extra methods to GeoHashUtils (and added tests) to assist in this purpose. The next step is an actual Lucene Filter, GeoHashPrefixFilter, that uses these geohash prefixes in TermsEnum.seek() to skip to relevant grid squares in the index. Once a matching geohash grid is found, the points therein are compared against the user's query to see if it matches. I created an abstraction GeoShape extended by subclasses named PointDistance... and CartesianBox to support different queried shapes so that the filter need not care about these details. This work was presented at LuceneRevolution in Boston on October 8th. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2911) synchronize grammar/token types across StandardTokenizer, UAX29EmailURLTokenizer, ICUTokenizer, add CJK types.
[ https://issues.apache.org/jira/browse/LUCENE-2911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12992014#comment-12992014 ] Steven Rowe commented on LUCENE-2911: - The generated top-level domain macro file has a bunch of new entries when I run this, but these are not included in your patch, and I think we should keep this list up-to-date. The patch is missing HangulSupp macro generation in modules/icu/src/tools/.../GenerateJFlexSupplementaryMacros.java, but since the Hangul macro is not used in the jflex grammar, this doesn't cause a problem. It would be nice to remove the hard-coded ranges for the intersection of Hangul ALetter, but when I tried to use JFlex negation and union to produce the equivalent, memory usage exploded and I couldn't get JFlex to generate, so I guess we'll have to wait on native JFlex supplementary character support before we can change it. synchronize grammar/token types across StandardTokenizer, UAX29EmailURLTokenizer, ICUTokenizer, add CJK types. -- Key: LUCENE-2911 URL: https://issues.apache.org/jira/browse/LUCENE-2911 Project: Lucene - Java Issue Type: Sub-task Components: Analysis Reporter: Robert Muir Assignee: Robert Muir Fix For: 3.1 Attachments: LUCENE-2911.patch I'd like to do LUCENE-2906 (better cjk support for these tokenizers) for a future target such as 3.2 But, in 3.1 I would like to do a little cleanup first, and synchronize all these token types, etc. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Should ASCIIFoldingFilter be deprecated?
unsubscribe On 2/8/11 7:05 AM, David Smiley (@MITRE.org) wrote: Robert Muir wrote: On Tue, Feb 8, 2011 at 9:12 AM, David Smiley (@MITRE.org) dsmi...@mitre.org wrote: I'm skeptical that whatever the difference is is relevant in the scheme of things. The cost to keeping it is introducing confusion on users, and more code to maintain. its pretty significant. charfilters are not reusable, and box every character and lookup out of a hashmap (i made a patch to fix the reusability, but no one has commented) : https://issues.apache.org/jira/browse/LUCENE-2788 asciifoldingfilter does a huge switch (which still isnt optimal), but its way way faster than mappingcharfilter, especially since its a no-op for chars 0x7F. Well then I see a path forward to speed up MappingCharFilter substantially. There's your LUCENE-2788, and then you could easily add the same no-op optimization for the smallest char value in the HashMap. - Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session
[ https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12992038#comment-12992038 ] Simon Willnauer commented on LUCENE-2881: - Michael, this looks very good! {quote} All tests pass. Though I need to verify if the global map works correctly (it'd probably be good to add a test for that). Also it'd be nice to remove hasVectors and hasProx from SegmentInfo, but we could also do that in a separate issue. {quote} I agree we should make FieldInfos a member of SegmentInfo and remove the hasVectors / hasProx an push them down. I tried to apply this patch to the DocValues branch but I didn't get very far - I haven't merged for a while that killed me ;( I need to push that to after my vacation. I really hoped that I can get further but it didn't work out ;( Track FieldInfo per segment instead of per-IW-session - Key: LUCENE-2881 URL: https://issues.apache.org/jira/browse/LUCENE-2881 Project: Lucene - Java Issue Type: Improvement Affects Versions: Realtime Branch, CSF branch, 4.0 Reporter: Simon Willnauer Assignee: Michael Busch Fix For: Realtime Branch, CSF branch, 4.0 Attachments: lucene-2881.patch Currently FieldInfo is tracked per IW session to guarantee consistent global field-naming / ordering. IW carries FI instances over from previous segments which also carries over field properties like isIndexed etc. While having consistent field ordering per IW session appears to be important due to bulk merging stored fields etc. carrying over other properties might become problematic with Lucene's Codec support. Codecs that rely on consistent properties in FI will fail if FI properties are carried over. The DocValuesCodec (DocValuesBranch) for instance writes files per segment and field (using the field id within the file name). Yet, if a segment has no DocValues indexed in a particular segment but a previous segment in the same IW session had DocValues, FieldInfo#docValues will be true since those values are reused from previous segments. We already work around this limitation in SegmentInfo with properties like hasVectors or hasProx which is really something we should manage per Codec Segment. Ideally FieldInfo would be managed per Segment and Codec such that its properties are valid per segment. It also seems to be necessary to bind FieldInfoS to SegmentInfo logically since its really just per segment metadata. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2911) synchronize grammar/token types across StandardTokenizer, UAX29EmailURLTokenizer, ICUTokenizer, add CJK types.
[ https://issues.apache.org/jira/browse/LUCENE-2911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12992066#comment-12992066 ] Robert Muir commented on LUCENE-2911: - {quote} The generated top-level domain macro file has a bunch of new entries when I run this, but these are not included in your patch, and I think we should keep this list up-to-date. {quote} Yeah, i would re-run it before committing? in general i didn't re-generate so you wouldnt see a lot of generated differences in the patch. {quote} The patch is missing HangulSupp macro generation in modules/icu/src/tools/.../GenerateJFlexSupplementaryMacros.java, but since the Hangul macro is not used in the jflex grammar, this doesn't cause a problem. {quote} Oh i did actually mean to include this, sorry I forgot... its a one liner though, I can include this easily. synchronize grammar/token types across StandardTokenizer, UAX29EmailURLTokenizer, ICUTokenizer, add CJK types. -- Key: LUCENE-2911 URL: https://issues.apache.org/jira/browse/LUCENE-2911 Project: Lucene - Java Issue Type: Sub-task Components: Analysis Reporter: Robert Muir Assignee: Robert Muir Fix For: 3.1 Attachments: LUCENE-2911.patch I'd like to do LUCENE-2906 (better cjk support for these tokenizers) for a future target such as 3.2 But, in 3.1 I would like to do a little cleanup first, and synchronize all these token types, etc. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Welcome Dawid Weiss and Stanislaw Osinski as Lucene/Solr committers
I'm pleased to announce that the PMC has voted in Dawid Weiss and Stanislaw Osinski as Lucene/Solr committers! Welcome! - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
RE: Welcome Dawid Weiss and Stanislaw Osinski as Lucene/Solr committers
Welcome Stanisław and Dawid! -Original Message- From: Robert Muir [mailto:rcm...@gmail.com] Sent: Tuesday, February 08, 2011 1:13 PM To: gene...@lucene.apache.org; dev@lucene.apache.org Subject: Welcome Dawid Weiss and Stanislaw Osinski as Lucene/Solr committers I'm pleased to announce that the PMC has voted in Dawid Weiss and Stanislaw Osinski as Lucene/Solr committers! Welcome!
[jira] Created: (SOLR-2352) HTTP 400 Undefined Filed: * with TV component enabled.
HTTP 400 Undefined Filed: * with TV component enabled. -- Key: SOLR-2352 URL: https://issues.apache.org/jira/browse/SOLR-2352 Project: Solr Issue Type: Bug Components: SearchComponents - other Affects Versions: 3.1 Environment: Ubuntu 10.04/Arch solr 3.x branch r1058326 Reporter: Jed Glazner Fix For: 3.1 When searching using the term vector components and setting fl=*,score the result is a http 400 error 'undefined field: *'. If you disable the tvc the search works properly. stack trace: 853 SEVERE: org.apache.solr.common.SolrException: undefined field: * 854 at org.apache.solr.handler.component.TermVectorComponent.process(TermVectorComponent.java:142) 855 at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:194) 856 at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) 857 at org.apache.solr.core.SolrCore.execute(SolrCore.java:1357) 858 at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:341) 859 at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:244) 860 at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) 861 at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) 862 at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) 863 at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) 864 at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) 865 at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) 866 at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) 867 at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) 868 at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) 869 at org.mortbay.jetty.Server.handle(Server.java:326) 870 at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) 871 at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928) 872 at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549) 873 at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) 874 at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) 875 at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228) 876 at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Welcome Dawid Weiss and Stanislaw Osinski as Lucene/Solr committers
Welcome! ;) Simon On Tue, Feb 8, 2011 at 8:06 PM, Steven A Rowe sar...@syr.edu wrote: Welcome Stanisław and Dawid! -Original Message- From: Robert Muir [mailto:rcm...@gmail.com] Sent: Tuesday, February 08, 2011 1:13 PM To: gene...@lucene.apache.org; dev@lucene.apache.org Subject: Welcome Dawid Weiss and Stanislaw Osinski as Lucene/Solr committers I'm pleased to announce that the PMC has voted in Dawid Weiss and Stanislaw Osinski as Lucene/Solr committers! Welcome! - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Welcome Dawid Weiss and Stanislaw Osinski as Lucene/Solr committers
Welcome! -- Uwe Schindler H.-H.-Meier-Allee 63, 28213 Bremen http://www.thetaphi.de Simon Willnauer simon.willna...@googlemail.com schrieb: Welcome! ;) Simon On Tue, Feb 8, 2011 at 8:06 PM, Steven A Rowe sar...@syr.edu wrote: Welcome Stanisław and Dawid! -Original Message- From: Robert Muir [mailto:rcm...@gmail.com] Sent: Tuesday, February 08, 2011 1:13 PM To: gene...@lucene.apache.org; dev@lucene.apache.org Subject: Welcome Dawid Weiss and Stanislaw Osinski as Lucene/Solr committers I'm pleased to announce that the PMC has voted in Dawid Weiss and Stanislaw Osinski as Lucene/Solr committers! Welcome! _ To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session
[ https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Busch updated LUCENE-2881: -- Attachment: lucene-2881.patch New patch that removes the tracking of 'hasVectors' and 'hasProx' in SegmentInfo. Instead SegmentInfo now has a reference to its corresponding FieldInfos. For backwards-compatibility reasons we can't completely remove the hasVectors and hasProx bytes from the serialized SegmentInfo yet. Eg. if someone uses addIndexes(Directory...) to add external old pre-4.0 segments to a new index, we upgrade the SegmentInfo to the latest version. However, we don't modify the FieldInfos of that segment, instead we just copy it over to the new dir. So the hasVector and hasProx bits in the FieldInfos might not be accurate and we have to keep those bits in the SegmentInfo instead. Not an ideal solution, but we can remove it entirely in Lucene 5.0 :). The alternative would be to rewrite the FieldInfos instead of just copying the files, but then we have to rewrite the cfs files. All core contrib tests pass. Track FieldInfo per segment instead of per-IW-session - Key: LUCENE-2881 URL: https://issues.apache.org/jira/browse/LUCENE-2881 Project: Lucene - Java Issue Type: Improvement Affects Versions: Realtime Branch, CSF branch, 4.0 Reporter: Simon Willnauer Assignee: Michael Busch Fix For: Realtime Branch, CSF branch, 4.0 Attachments: lucene-2881.patch, lucene-2881.patch Currently FieldInfo is tracked per IW session to guarantee consistent global field-naming / ordering. IW carries FI instances over from previous segments which also carries over field properties like isIndexed etc. While having consistent field ordering per IW session appears to be important due to bulk merging stored fields etc. carrying over other properties might become problematic with Lucene's Codec support. Codecs that rely on consistent properties in FI will fail if FI properties are carried over. The DocValuesCodec (DocValuesBranch) for instance writes files per segment and field (using the field id within the file name). Yet, if a segment has no DocValues indexed in a particular segment but a previous segment in the same IW session had DocValues, FieldInfo#docValues will be true since those values are reused from previous segments. We already work around this limitation in SegmentInfo with properties like hasVectors or hasProx which is really something we should manage per Codec Segment. Ideally FieldInfo would be managed per Segment and Codec such that its properties are valid per segment. It also seems to be necessary to bind FieldInfoS to SegmentInfo logically since its really just per segment metadata. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session
[ https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12992156#comment-12992156 ] Simon Willnauer commented on LUCENE-2881: - {quote} New patch that removes the tracking of 'hasVectors' and 'hasProx' in SegmentInfo. Instead SegmentInfo now has a reference to its corresponding FieldInfos. {quote} Wow! nice work Michael! I like how you preserve bw compat in SegmentInfo and FieldInfos is now bound to SegmentInfo - yay! this solves two problems at once for DocValues branch. bq.The alternative would be to rewrite the FieldInfos instead of just copying the files, but then we have to rewrite the cfs files. I think copying over is fine. Ideally we will move all those boolean etc to the codec level so that we don't need that at all. Once stored fields and vectors are written by the codec we can push all that into PreFlex codec (maybe!?) and get rid of the bw compat code. I think you should commit that patch. I'll port to docvalues and run some tests that rely on this issue. Track FieldInfo per segment instead of per-IW-session - Key: LUCENE-2881 URL: https://issues.apache.org/jira/browse/LUCENE-2881 Project: Lucene - Java Issue Type: Improvement Affects Versions: Realtime Branch, CSF branch, 4.0 Reporter: Simon Willnauer Assignee: Michael Busch Fix For: Realtime Branch, CSF branch, 4.0 Attachments: lucene-2881.patch, lucene-2881.patch Currently FieldInfo is tracked per IW session to guarantee consistent global field-naming / ordering. IW carries FI instances over from previous segments which also carries over field properties like isIndexed etc. While having consistent field ordering per IW session appears to be important due to bulk merging stored fields etc. carrying over other properties might become problematic with Lucene's Codec support. Codecs that rely on consistent properties in FI will fail if FI properties are carried over. The DocValuesCodec (DocValuesBranch) for instance writes files per segment and field (using the field id within the file name). Yet, if a segment has no DocValues indexed in a particular segment but a previous segment in the same IW session had DocValues, FieldInfo#docValues will be true since those values are reused from previous segments. We already work around this limitation in SegmentInfo with properties like hasVectors or hasProx which is really something we should manage per Codec Segment. Ideally FieldInfo would be managed per Segment and Codec such that its properties are valid per segment. It also seems to be necessary to bind FieldInfoS to SegmentInfo logically since its really just per segment metadata. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session
[ https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12992175#comment-12992175 ] Michael Busch commented on LUCENE-2881: --- Thanks for reviewing! bq. I think you should commit that patch. I'll port to docvalues and run some tests that rely on this issue. I just want to add another tests for the global fieldname-number map, after that I think it'll be ready to commit. Will do that tonight :) Track FieldInfo per segment instead of per-IW-session - Key: LUCENE-2881 URL: https://issues.apache.org/jira/browse/LUCENE-2881 Project: Lucene - Java Issue Type: Improvement Affects Versions: Realtime Branch, CSF branch, 4.0 Reporter: Simon Willnauer Assignee: Michael Busch Fix For: Realtime Branch, CSF branch, 4.0 Attachments: lucene-2881.patch, lucene-2881.patch Currently FieldInfo is tracked per IW session to guarantee consistent global field-naming / ordering. IW carries FI instances over from previous segments which also carries over field properties like isIndexed etc. While having consistent field ordering per IW session appears to be important due to bulk merging stored fields etc. carrying over other properties might become problematic with Lucene's Codec support. Codecs that rely on consistent properties in FI will fail if FI properties are carried over. The DocValuesCodec (DocValuesBranch) for instance writes files per segment and field (using the field id within the file name). Yet, if a segment has no DocValues indexed in a particular segment but a previous segment in the same IW session had DocValues, FieldInfo#docValues will be true since those values are reused from previous segments. We already work around this limitation in SegmentInfo with properties like hasVectors or hasProx which is really something we should manage per Codec Segment. Ideally FieldInfo would be managed per Segment and Codec such that its properties are valid per segment. It also seems to be necessary to bind FieldInfoS to SegmentInfo logically since its really just per segment metadata. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session
[ https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12992174#comment-12992174 ] Simon Willnauer commented on LUCENE-2881: - I gave the patch another glance - here are a couple of very minor comments: * maybe we should return IterableFieldInfo from FieldInfos#getFieldInfoIterator() this would make the iteration syntactically more javaish {code} for (FieldInfo info : getFieldInfoIterator()) { // do something with it } {code} * If we return IterableFieldInfo should we name it getFieldInfos? * Maybe we can simply implement IterableFieldInfo? * Maybe we can rename SI#clearFilesCache() to SI#clearCache() or simply SI#clear() this would make thing less coupled to the SI internal cache but rather something that is called to clear internal state after flush? Track FieldInfo per segment instead of per-IW-session - Key: LUCENE-2881 URL: https://issues.apache.org/jira/browse/LUCENE-2881 Project: Lucene - Java Issue Type: Improvement Affects Versions: Realtime Branch, CSF branch, 4.0 Reporter: Simon Willnauer Assignee: Michael Busch Fix For: Realtime Branch, CSF branch, 4.0 Attachments: lucene-2881.patch, lucene-2881.patch Currently FieldInfo is tracked per IW session to guarantee consistent global field-naming / ordering. IW carries FI instances over from previous segments which also carries over field properties like isIndexed etc. While having consistent field ordering per IW session appears to be important due to bulk merging stored fields etc. carrying over other properties might become problematic with Lucene's Codec support. Codecs that rely on consistent properties in FI will fail if FI properties are carried over. The DocValuesCodec (DocValuesBranch) for instance writes files per segment and field (using the field id within the file name). Yet, if a segment has no DocValues indexed in a particular segment but a previous segment in the same IW session had DocValues, FieldInfo#docValues will be true since those values are reused from previous segments. We already work around this limitation in SegmentInfo with properties like hasVectors or hasProx which is really something we should manage per Codec Segment. Ideally FieldInfo would be managed per Segment and Codec such that its properties are valid per segment. It also seems to be necessary to bind FieldInfoS to SegmentInfo logically since its really just per segment metadata. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENENET-392) Arabic Analyzer
[ https://issues.apache.org/jira/browse/LUCENENET-392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Digy updated LUCENENET-392: --- Attachment: Analyzers.zip I merged Arabic analyzer and existing Brazilian analyzer in contrib. Since changes are much, I post the result as a zip file, not as a patch. DIGY Arabic Analyzer --- Key: LUCENENET-392 URL: https://issues.apache.org/jira/browse/LUCENENET-392 Project: Lucene.Net Issue Type: New Feature Environment: Lucene.Net 2.9.2 VS2010 Reporter: Digy Priority: Trivial Attachments: Analyzers.zip, Lucene.Net.Analyzers.zip A quick port of Lucene.Java's Arabic analyzer. All unit tests pass. DIGY -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (LUCENENET-392) Arabic Analyzer
[ https://issues.apache.org/jira/browse/LUCENENET-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12992189#comment-12992189 ] Digy commented on LUCENENET-392: If no objections, I am going to commit it in a few days. DIGY Arabic Analyzer --- Key: LUCENENET-392 URL: https://issues.apache.org/jira/browse/LUCENENET-392 Project: Lucene.Net Issue Type: New Feature Environment: Lucene.Net 2.9.2 VS2010 Reporter: Digy Priority: Trivial Attachments: Analyzers.zip, Lucene.Net.Analyzers.zip A quick port of Lucene.Java's Arabic analyzer. All unit tests pass. DIGY -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Welcome Dawid Weiss and Stanislaw Osinski as Lucene/Solr committers
Thank you very much, everyone! This is a great privilege and honor for me. In the spirit of previous posters, I would like to quickly introduce myself. I'm 32, I was born and I still live in Poznan, Poland, happily married and with two kids on board. My computer science experience is somewhat strange because I was a hard core assembly programmer through primary and high school, not familiar with anything else (the internet was not there yet, remember, and books were a scarce resource). I learned my first high-level language during university studies... and always thought that looked so much more complex compared to assembly ;). Nowadays I do most of my programming in Java, but am always profoundly interested in low-level aspects of what's going on behind the scenes. I hold a PhD in information retrieval and teach at the local technical university in Poznan. Together with Staszek Osiński we also develop text clustering algorithms under the Carrot2.org and CarrotSearch.com umbrella. Glad to be part of Lucene. I hope I will be of help to the project, Dawid On Tue, Feb 8, 2011 at 9:09 PM, Uwe Schindler u...@thetaphi.de wrote: Welcome! -- Uwe Schindler H.-H.-Meier-Allee 63, 28213 Bremen http://www.thetaphi.de Simon Willnauer simon.willna...@googlemail.com schrieb: Welcome! ;) Simon On Tue, Feb 8, 2011 at 8:06 PM, Steven A Rowe sar...@syr.edu wrote: Welcome Stanisław and Dawid! -Original Message- From: Robert Muir [mailto:rcm...@gmail.com] Sent: Tuesday, February 08, 2011 1:13 PM To: gene...@lucene.apache.org; dev@lucene.apache.org Subject: Welcome Dawid Weiss and Stanislaw Osinski as Lucene/Solr committers I'm pleased to announce that the PMC has voted in Dawid Weiss and Stanislaw Osinski as Lucene/Solr committers! Welcome! To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-1711) Race condition in org/apache/solr/client/solrj/impl/StreamingUpdateSolrServer.java
[ https://issues.apache.org/jira/browse/SOLR-1711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12992224#comment-12992224 ] Aakarsh Nair commented on SOLR-1711: We are still seeing this issue even after using Johannes fix. All runners are exiting and the main producer thread hangs on line 196 queue.put . I am thinking it may be because queue is getting drained and filled fast (queue size 50 , number of threads 20) . So there might be a race condition on the queue capacity check.Queue appears to be below capacity to the last runner then fills up by simultaneous calls to put . I still see the issue after backporting what is in 3.x branch for testing it with solr 1.4.1. I guess a solution may be to use larger queue capacities for now but the race conditions still seem to be present. Race condition in org/apache/solr/client/solrj/impl/StreamingUpdateSolrServer.java -- Key: SOLR-1711 URL: https://issues.apache.org/jira/browse/SOLR-1711 Project: Solr Issue Type: Bug Components: clients - java Affects Versions: 1.4, 1.5 Reporter: Attila Babo Assignee: Yonik Seeley Priority: Critical Fix For: 1.4.1, 1.5, 3.1, 4.0 Attachments: StreamingUpdateSolrServer.patch Original Estimate: 1h Remaining Estimate: 1h While inserting a large pile of documents using StreamingUpdateSolrServer there is a race condition as all Runner instances stop processing while the blocking queue is full. With a high performance client this could happen quite often, there is no way to recover from it at the client side. In StreamingUpdateSolrServer there is a BlockingQueue called queue to store UpdateRequests, there are up to threadCount number of workers threads from StreamingUpdateSolrServer.Runner to read that queue and push requests to a Solr instance. If at one point the BlockingQueue is empty all workers stop processing it and pushing the collected content to Solr which could be a time consuming process, sometimes all worker threads are waiting for Solr. If at this time the client fills the BlockingQueue to full all worker threads will quit without processing any further and the main thread will block forever. There is a simple, well tested patch attached to handle this situation. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Welcome Dawid Weiss and Stanislaw Osinski as Lucene/Solr committers
Glad to see more committers, welcome aboard! Erick On Tue, Feb 8, 2011 at 5:05 PM, Dawid Weiss dawid.we...@cs.put.poznan.plwrote: Thank you very much, everyone! This is a great privilege and honor for me. In the spirit of previous posters, I would like to quickly introduce myself. I'm 32, I was born and I still live in Poznan, Poland, happily married and with two kids on board. My computer science experience is somewhat strange because I was a hard core assembly programmer through primary and high school, not familiar with anything else (the internet was not there yet, remember, and books were a scarce resource). I learned my first high-level language during university studies... and always thought that looked so much more complex compared to assembly ;). Nowadays I do most of my programming in Java, but am always profoundly interested in low-level aspects of what's going on behind the scenes. I hold a PhD in information retrieval and teach at the local technical university in Poznan. Together with Staszek Osiński we also develop text clustering algorithms under the Carrot2.org and CarrotSearch.com umbrella. Glad to be part of Lucene. I hope I will be of help to the project, Dawid On Tue, Feb 8, 2011 at 9:09 PM, Uwe Schindler u...@thetaphi.de wrote: Welcome! -- Uwe Schindler H.-H.-Meier-Allee 63, 28213 Bremen http://www.thetaphi.de Simon Willnauer simon.willna...@googlemail.com schrieb: Welcome! ;) Simon On Tue, Feb 8, 2011 at 8:06 PM, Steven A Rowe sar...@syr.edu wrote: Welcome Stanisław and Dawid! -Original Message- From: Robert Muir [mailto:rcm...@gmail.com] Sent: Tuesday, February 08, 2011 1:13 PM To: gene...@lucene.apache.org; dev@lucene.apache.org Subject: Welcome Dawid Weiss and Stanislaw Osinski as Lucene/Solr committers I'm pleased to announce that the PMC has voted in Dawid Weiss and Stanislaw Osinski as Lucene/Solr committers! Welcome! To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec
[ https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12992237#comment-12992237 ] hao yan commented on LUCENE-2903: - I tried to move memory allocation out of readBlock() to BlockReader's constructor. It improves the performance a little. I also tried to use ByteBuffer/IntBuffer to replace my manual convertsion between bytes[]/int[]. It makes things worse. The following is my result for 0.1M data: (1) BulkVInt vs patchedFrameoFRef3 QueryQPS bulkVIntQPS patchedFrameoFRef3 Pct diff united states 393.55 362.84 -7.8% united states~3 243.84 236.80 -2.9% +nebraska +states 1140.25 998.00-12.5% +united +states 687.76 633.31 -7.9% doctimesecnum:[1 TO 6] 413.56 427.53 3.4% doctitle:.*[Uu]nited.* 510.46 534.47 4.7% spanFirst(unit, 5) 1240.69 1108.65-10.6% spanNear([unit, state], 10, true) 511.77 463.18 -9.5% states 1626.02 1483.68 -8.8% u*d 164.23 162.79 -0.9% un*d 257.53 252.97 -1.8% uni* 607.53 591.02 -2.7% unit* 1024.59 1043.84 1.9% united states 627.35 578.70 -7.8% united~0.6 11.51 11.36 -1.3% united~0.75 52.58 53.57 1.9% unit~0.5 12.08 11.93 -1.2% unit~0.7 50.98 51.30 0.6% (2) FrameOfRef VS PatchcedFrameOfRef3 QueryQPSpatchedFrameofrefQPS pathcedFrameofref3 Pct diff united states 314.76 362.71 15.2% united states~3 227.53 237.08 4.2% +nebraska +states 1075.27 1025.64 -4.6% +united +states 646.41 626.57 -3.1% doctimesecnum:[1 TO 6] 412.88 429.37 4.0% doctitle:.*[Uu]nited.* 481.70 528.82 9.8% spanFirst(unit, 5) 1060.45 1118.57 5.5% spanNear([unit, state], 10, true) 409.33 467.73 14.3% states 1353.18 1479.29 9.3% u*d 158.91 165.98 4.4% un*d 237.36 256.41 8.0% uni* 560.22 593.12 5.9% unit* 946.97 1043.84 10.2% united states 431.22 583.09 35.2% united~0.6 10.91 11.37 4.2% united~0.75 50.30 53.30 5.9% unit~0.5 11.54 11.94 3.5% unit~0.7 47.38 50.38 6.3% (3) PatchedFrameOfRef VS PatchedFrameOfRef3 QueryQPS FrameOfRefQPS pathcedFrameofref3 Pct diff united states 326.26 360.49 10.5% united states~3 226.50 234.69 3.6% +nebraska +states 1077.59 1021.45 -5.2% +united +states 648.51 630.52 -2.8% doctimesecnum:[1 TO 6] 324.46 428.45 32.0% doctitle:.*[Uu]nited.* 485.44 527.70 8.7% spanFirst(unit, 5) 1007.05 .11 10.3% spanNear([unit, state], 10, true) 446.03 465.55 4.4% states 1449.28 1459.85 0.7% u*d 158.43 161.79 2.1% un*d 246.37 256.28 4.0% uni* 548.85 594.88 8.4% unit* 920.81 1042.75 13.2% united states 450.65 576.37 27.9% united~0.6 11.07 11.26 1.7% united~0.75 50.70 52.60 3.8% unit~0.5 11.64 11.76 1.0% unit~0.7 49.04 50.70 3.4% Improvement of PForDelta Codec -- Key: LUCENE-2903 URL: https://issues.apache.org/jira/browse/LUCENE-2903 Project: Lucene - Java Issue Type: Improvement Reporter: hao yan Attachments: LUCENE_2903.patch, LUCENE_2903.patch There are 3 versions of PForDelta implementations in the Bulk Branch: FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2. The FrameOfRef is a very basic one which is essentially a binary encoding (may result in huge index size). The PatchedFrameOfRef is the implmentation based on the original version of PForDelta in the literatures. The PatchedFrameOfRef2 is my previous implementation which are improved this time. (The Codec name is changed to NewPForDelta.). In particular, the changes are: 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the old PForDelta does not support very large exceptions (since the Simple16 does not support very large numbers). Now this has been fixed in the new LCPForDelta. 2. I changed the
Re: Welcome Dawid Weiss and Stanislaw Osinski as Lucene/Solr committers
On Tue, Feb 8, 2011 at 1:13 PM, Robert Muir rcm...@gmail.com wrote: I'm pleased to announce that the PMC has voted in Dawid Weiss and Stanislaw Osinski as Lucene/Solr committers! Welcome aboard guys! -Yonik http://lucidimagination.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-2342) Lock starvation can cause commit to never run when many clients are adding docs
[ https://issues.apache.org/jira/browse/SOLR-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12992245#comment-12992245 ] Yonik Seeley commented on SOLR-2342: bq. Passing true does make the read lock acq more costly, but I suspect this is in the noise for typical indexing. I'll try to do a quick test to verify this soon. Lock starvation can cause commit to never run when many clients are adding docs --- Key: SOLR-2342 URL: https://issues.apache.org/jira/browse/SOLR-2342 Project: Solr Issue Type: Bug Components: update Reporter: Michael McCandless Priority: Minor I have a stress test, where 100 clients add 100 1MB docs and then call commit in the end. It's a falldown test (try to make Solr fall down) and nowhere near actual usage. But, after some initial commits that succeed, I'm seeing later commits always time out (client side timeout @ 10 minutes). Watching Solr's logging, no commit ever runs. Looking at the stack traces in the threads, this is not deadlock: the add/update calls are running, and new segments are being flushed to the index. Digging in the code a bit, we use ReentrantReadWriteLock, with add/update acquiring the readLock and commit acquiring the writeLock. But, according to the jdocs, the writeLock isn't given any priority over the readLock (unless you set fairness, which we don't). So I think this explains the starvation? However, this is not a real world use case (most apps would/should call commit less often, and from on client). Also, we could set fairness, but it seems to have some performance penalty, and I'm not sure we should penalize the normal case for this unusual one. EG see here (thanks Mark): http://www.javaspecialists.eu/archive/Issue165.html. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Potential contrib module
Hello all, Pending approval (which is almost certain) from management, I have a potential module that I'd like to contribute if possible. Before I go to management, I'd like to be able to make a case that I'm certain this will be approved by Lucene, although I am all but completely sure that there won't be a problem contributing back. I've been working on some utilities that allow one to use Mathematica with Lucene - most notably a FilteredTermEnum/Query implementation that allow one to use arbitrary Mathematica expressions in searches. It's not quite finished yet, but functional and useful. My question is what do I need to do to be able to contribute this? The dependent libraries (there are two, one in Java, one JNI) both ship with Mathematica and are largely dependent on the version of Mathematica in use, so they wouldn't need to ship with the module which should avoid licensing conflicts. Does the dependence on a proprietary piece of software, though, eliminate this from being contributed? If not, what requirements are necessary for contributing? Is there a guide anywhere on how to prepare the code (unit test requirements, documentation requirements, naming conventions, etc.)? Thanks, Eddie - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Welcome Dawid Weiss and Stanislaw Osinski as Lucene/Solr committers
(11/02/09 3:13), Robert Muir wrote: I'm pleased to announce that the PMC has voted in Dawid Weiss and Stanislaw Osinski as Lucene/Solr committers! Welcome! Welcome! Koji -- http://www.rondhuit.com/en/ - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Potential contrib module
sounds awesome, but... the dependency on software that is not installed/testable in the Apache infrastructure is kind of a show stopper for getting into the lucene code base. In general, everyone needs to be able to run ant test and make sure they have not broken something. However, check: http://apache-extras.org that may be a good place to host something like this. thanks ryan On Tue, Feb 8, 2011 at 7:33 PM, Edward Drapkin edwa...@wolfram.com wrote: Hello all, Pending approval (which is almost certain) from management, I have a potential module that I'd like to contribute if possible. Before I go to management, I'd like to be able to make a case that I'm certain this will be approved by Lucene, although I am all but completely sure that there won't be a problem contributing back. I've been working on some utilities that allow one to use Mathematica with Lucene - most notably a FilteredTermEnum/Query implementation that allow one to use arbitrary Mathematica expressions in searches. It's not quite finished yet, but functional and useful. My question is what do I need to do to be able to contribute this? The dependent libraries (there are two, one in Java, one JNI) both ship with Mathematica and are largely dependent on the version of Mathematica in use, so they wouldn't need to ship with the module which should avoid licensing conflicts. Does the dependence on a proprietary piece of software, though, eliminate this from being contributed? If not, what requirements are necessary for contributing? Is there a guide anywhere on how to prepare the code (unit test requirements, documentation requirements, naming conventions, etc.)? Thanks, Eddie - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Welcome Dawid Weiss and Stanislaw Osinski as Lucene/Solr committers
Welcome guys! Thanks Dawid - your turn Stanislaw Osinksi ;) - Mark On Feb 8, 2011, at 5:05 PM, Dawid Weiss wrote: Thank you very much, everyone! This is a great privilege and honor for me. In the spirit of previous posters, I would like to quickly introduce myself. I'm 32, I was born and I still live in Poznan, Poland, happily married and with two kids on board. My computer science experience is somewhat strange because I was a hard core assembly programmer through primary and high school, not familiar with anything else (the internet was not there yet, remember, and books were a scarce resource). I learned my first high-level language during university studies... and always thought that looked so much more complex compared to assembly ;). Nowadays I do most of my programming in Java, but am always profoundly interested in low-level aspects of what's going on behind the scenes. I hold a PhD in information retrieval and teach at the local technical university in Poznan. Together with Staszek Osiński we also develop text clustering algorithms under the Carrot2.org and CarrotSearch.com umbrella. Glad to be part of Lucene. I hope I will be of help to the project, Dawid On Tue, Feb 8, 2011 at 9:09 PM, Uwe Schindler u...@thetaphi.de wrote: Welcome! -- Uwe Schindler H.-H.-Meier-Allee 63, 28213 Bremen http://www.thetaphi.de Simon Willnauer simon.willna...@googlemail.com schrieb: Welcome! ;) Simon On Tue, Feb 8, 2011 at 8:06 PM, Steven A Rowe sar...@syr.edu wrote: Welcome Stanisław and Dawid! -Original Message- From: Robert Muir [mailto:rcm...@gmail.com] Sent: Tuesday, February 08, 2011 1:13 PM To: gene...@lucene.apache.org; dev@lucene.apache.org Subject: Welcome Dawid Weiss and Stanislaw Osinski as Lucene/Solr committers I'm pleased to announce that the PMC has voted in Dawid Weiss and Stanislaw Osinski as Lucene/Solr committers! Welcome! To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - Mark Miller lucidimagination.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session
[ https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12992340#comment-12992340 ] Michael Busch commented on LUCENE-2881: --- bq. Maybe we can simply implement IterableFieldInfo? good idea - done. bq. Maybe we can rename SI#clearFilesCache() Actually I renamed it intentionally, because all this method does is really clearing the files cache. SI has a separate reset() method for resetting its state entirely. Track FieldInfo per segment instead of per-IW-session - Key: LUCENE-2881 URL: https://issues.apache.org/jira/browse/LUCENE-2881 Project: Lucene - Java Issue Type: Improvement Affects Versions: Realtime Branch, CSF branch, 4.0 Reporter: Simon Willnauer Assignee: Michael Busch Fix For: Realtime Branch, CSF branch, 4.0 Attachments: lucene-2881.patch, lucene-2881.patch Currently FieldInfo is tracked per IW session to guarantee consistent global field-naming / ordering. IW carries FI instances over from previous segments which also carries over field properties like isIndexed etc. While having consistent field ordering per IW session appears to be important due to bulk merging stored fields etc. carrying over other properties might become problematic with Lucene's Codec support. Codecs that rely on consistent properties in FI will fail if FI properties are carried over. The DocValuesCodec (DocValuesBranch) for instance writes files per segment and field (using the field id within the file name). Yet, if a segment has no DocValues indexed in a particular segment but a previous segment in the same IW session had DocValues, FieldInfo#docValues will be true since those values are reused from previous segments. We already work around this limitation in SegmentInfo with properties like hasVectors or hasProx which is really something we should manage per Codec Segment. Ideally FieldInfo would be managed per Segment and Codec such that its properties are valid per segment. It also seems to be necessary to bind FieldInfoS to SegmentInfo logically since its really just per segment metadata. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session
[ https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Busch updated LUCENE-2881: -- Attachment: lucene-2881.patch New patch that adds a new junit for testing that field numbering is consistent across segments. It tests two cases: 1) one IW is used to write two segments; 2) two IWs are used to write two segments. And it also tests that addIndexes(Directory...) doesn't mess up the field numbering of the external segment. All tests pass. I'll commit this in a day or two if nobody objects. Track FieldInfo per segment instead of per-IW-session - Key: LUCENE-2881 URL: https://issues.apache.org/jira/browse/LUCENE-2881 Project: Lucene - Java Issue Type: Improvement Affects Versions: Realtime Branch, CSF branch, 4.0 Reporter: Simon Willnauer Assignee: Michael Busch Fix For: Realtime Branch, CSF branch, 4.0 Attachments: lucene-2881.patch, lucene-2881.patch, lucene-2881.patch Currently FieldInfo is tracked per IW session to guarantee consistent global field-naming / ordering. IW carries FI instances over from previous segments which also carries over field properties like isIndexed etc. While having consistent field ordering per IW session appears to be important due to bulk merging stored fields etc. carrying over other properties might become problematic with Lucene's Codec support. Codecs that rely on consistent properties in FI will fail if FI properties are carried over. The DocValuesCodec (DocValuesBranch) for instance writes files per segment and field (using the field id within the file name). Yet, if a segment has no DocValues indexed in a particular segment but a previous segment in the same IW session had DocValues, FieldInfo#docValues will be true since those values are reused from previous segments. We already work around this limitation in SegmentInfo with properties like hasVectors or hasProx which is really something we should manage per Codec Segment. Ideally FieldInfo would be managed per Segment and Codec such that its properties are valid per segment. It also seems to be necessary to bind FieldInfoS to SegmentInfo logically since its really just per segment metadata. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Welcome Dawid Weiss and Stanislaw Osinski as Lucene/Solr committers
Hi guys, thanks for the warm welcome! It's an honor. Like Dawid, I live in Poznan, we graduated in computer science from the same local university. My computer science experience started from electronics, Timex 2048 and Amiga 500/1200/PPC; I bought my first PC when I went to the university. My present interests include information retrieval and text mining (carrot2.org, carrotsearch.com), I'm also into UI design and usability (csssprites.org). Looking forward to contributing to Solr/Lucene, Staszek On Wed, Feb 9, 2011 at 04:01, Mark Miller markrmil...@gmail.com wrote: Welcome guys! Thanks Dawid - your turn Stanislaw Osinksi ;) - Mark On Feb 8, 2011, at 5:05 PM, Dawid Weiss wrote: Thank you very much, everyone! This is a great privilege and honor for me. In the spirit of previous posters, I would like to quickly introduce myself. I'm 32, I was born and I still live in Poznan, Poland, happily married and with two kids on board. My computer science experience is somewhat strange because I was a hard core assembly programmer through primary and high school, not familiar with anything else (the internet was not there yet, remember, and books were a scarce resource). I learned my first high-level language during university studies... and always thought that looked so much more complex compared to assembly ;). Nowadays I do most of my programming in Java, but am always profoundly interested in low-level aspects of what's going on behind the scenes. I hold a PhD in information retrieval and teach at the local technical university in Poznan. Together with Staszek Osiński we also develop text clustering algorithms under the Carrot2.org and CarrotSearch.com umbrella. Glad to be part of Lucene. I hope I will be of help to the project, Dawid