Re: need some help =)
i need turkish analyzer. my lucene book says i need to use SnowballAnalyzer but i can't access to it as Lucene.Net.Analysis.Snowball should i install another library to use it? On 17.11.2010 21:12, Granroth, Neal V. wrote: You need to pick a suitable analyzer for use during indexing and for queries. The StandardAnalyzer you are using will most likely break the words apart at the non-english characters. You might want to consider using the Luke tool to inspect the index you've created and see who the words in your documents were split and indexed. - Neal -Original Message- From: asmcad [mailto:asm...@gmail.com] Sent: Wednesday, November 17, 2010 3:06 PM To: lucene-net-dev@lucene.apache.org Subject: Re: need some help =) i solved the problem . now i have non-english character problem. when i search like something çşğuı(i'm not sure you can see this) characters. i don't get any results. how can i solve this ? by the way sorry about the content messing =) thanks for the previous help =) On 17.11.2010 20:16, Digy wrote: 1. using System; 2. using System.Collections.Generic; 3. using System.ComponentModel; 4. using System.Data; 5. using System.Drawing; 6. using System.Linq; 7. using System.Text; 8. using System.Windows.Forms; 9. using Lucene.Net; 10. using Lucene.Net.Analysis.Standard; 11. using Lucene.Net.Documents; 12. using Lucene.Net.Index; 13. using Lucene.Net.QueryParsers; 14. using Lucene.Net.Search; 15. using System.IO; 16. 17. namespace newLucene 18. { 19. public partial class Form1 : Form 20. { 21. public Form1() 22. { 23. InitializeComponent(); 24. } 25. 26. private void buttonIndex_Click(object sender, EventArgs e) 27. { 28. IndexWriter indexwrtr = new IndexWriter(@c:\index\,new StandardAnalyzer() , true); 29. Document doc = new Document(); 30. string filename = @fer.txt; 31. Lucene.Net.QueryParsers.QueryParser df; 32. 33. 34. 35. System.IO.StreamReader local_StreamReader = new System.IO.StreamReader(@C:\z\fer.txt); 36. string file_text = local_StreamReader.ReadToEnd(); 37. 38. System.Text.UTF8Encoding encoding = new System.Text.UTF8Encoding(); 39. doc.Add(new Field(text,encoding.GetBytes(file_text),Field.Store.YES)); 40. doc.Add(new Field(path,encoding.GetBytes(@C:\z\),Field.Store.YES)); 41. doc.Add(new Field(title, encoding.GetBytes(filename), Field.Store.YES)); 42. indexwrtr.AddDocument(doc); 43. 44. indexwrtr.Optimize(); 45. indexwrtr.Close(); 46. 47. } 48. 49. private void buttonSearch_Click(object sender, EventArgs e) 50. { 51. IndexSearcher indxsearcher = new IndexSearcher(@C:\index\); 52. 53. QueryParser parser = new QueryParser(contents, new StandardAnalyzer()); 54. Query query = parser.Parse(textBoxQuery.Text); 55. 56. //Lucene.Net.QueryParsers.QueryParser qp = new QueryParser(Lucene.Net.QueryParsers.CharStream s).Parse(textBoxQuery.Text); 57. Hits hits = indxsearcher.Search(query); 58. 59. 60. for (int i = 0; i hits.Length(); i++) 61. { 62. 63. Document doc = hits.Doc(i); 64. 65. 66. string filename = doc.Get(title); 67. string path = doc.Get(path); 68. string folder = Path.GetDirectoryName(path); 69. 70. 71. ListViewItem item = new ListViewItem(new string[] { null, filename, asd, hits.Score(i).ToString() }); 72. item.Tag = path; 73. 74. this.listViewResults.Items.Add(item); 75. Application.DoEvents(); 76. } 77. 78. indxsearcher.Close(); 79. 80. 81. 82. 83. } 84. } 85. } thanks
Re: need some help =)
i'll try thanks =) On 17.11.2010 21:14, Digy wrote: Try to see what you are indexing http://mail-archives.apache.org/mod_mbox/lucene-lucene-net-user/201011.mbox/%3caanlktim6kyuzhwb8p7g=hvqx6dy1fkarchro0hyw+...@mail.gmail.com%3e And you can also think of use of ASCIIFoldingFilter if it fits to your needs. DIGY -Original Message- From: asmcad [mailto:asm...@gmail.com] Sent: Wednesday, November 17, 2010 11:06 PM To: lucene-net-dev@lucene.apache.org Subject: Re: need some help =) i solved the problem . now i have non-english character problem. when i search like something çşğuı(i'm not sure you can see this) characters. i don't get any results. how can i solve this ? by the way sorry about the content messing =) thanks for the previous help =) On 17.11.2010 20:16, Digy wrote: 1. using System; 2. using System.Collections.Generic; 3. using System.ComponentModel; 4. using System.Data; 5. using System.Drawing; 6. using System.Linq; 7. using System.Text; 8. using System.Windows.Forms; 9. using Lucene.Net; 10. using Lucene.Net.Analysis.Standard; 11. using Lucene.Net.Documents; 12. using Lucene.Net.Index; 13. using Lucene.Net.QueryParsers; 14. using Lucene.Net.Search; 15. using System.IO; 16. 17. namespace newLucene 18. { 19. public partial class Form1 : Form 20. { 21. public Form1() 22. { 23. InitializeComponent(); 24. } 25. 26. private void buttonIndex_Click(object sender, EventArgs e) 27. { 28. IndexWriter indexwrtr = new IndexWriter(@c:\index\,new StandardAnalyzer() , true); 29. Document doc = new Document(); 30. string filename = @fer.txt; 31. Lucene.Net.QueryParsers.QueryParser df; 32. 33. 34. 35. System.IO.StreamReader local_StreamReader = new System.IO.StreamReader(@C:\z\fer.txt); 36. string file_text = local_StreamReader.ReadToEnd(); 37. 38. System.Text.UTF8Encoding encoding = new System.Text.UTF8Encoding(); 39. doc.Add(new Field(text,encoding.GetBytes(file_text),Field.Store.YES)); 40. doc.Add(new Field(path,encoding.GetBytes(@C:\z\),Field.Store.YES)); 41. doc.Add(new Field(title, encoding.GetBytes(filename), Field.Store.YES)); 42. indexwrtr.AddDocument(doc); 43. 44. indexwrtr.Optimize(); 45. indexwrtr.Close(); 46. 47. } 48. 49. private void buttonSearch_Click(object sender, EventArgs e) 50. { 51. IndexSearcher indxsearcher = new IndexSearcher(@C:\index\); 52. 53. QueryParser parser = new QueryParser(contents, new StandardAnalyzer()); 54. Query query = parser.Parse(textBoxQuery.Text); 55. 56. //Lucene.Net.QueryParsers.QueryParser qp = new QueryParser(Lucene.Net.QueryParsers.CharStream s).Parse(textBoxQuery.Text); 57. Hits hits = indxsearcher.Search(query); 58. 59. 60. for (int i = 0; i hits.Length(); i++) 61. { 62. 63. Document doc = hits.Doc(i); 64. 65. 66. string filename = doc.Get(title); 67. string path = doc.Get(path); 68. string folder = Path.GetDirectoryName(path); 69. 70. 71. ListViewItem item = new ListViewItem(new string[] { null, filename, asd, hits.Score(i).ToString() }); 72. item.Tag = path; 73. 74. this.listViewResults.Items.Add(item); 75. Application.DoEvents(); 76. } 77. 78. indxsearcher.Close(); 79. 80. 81. 82. 83. } 84. } 85. } thanks
Re: ASF Public Mail Archives on Amazon S3
Hmmm, let me look. I don't know if I will be able to recover it On Nov 17, 2010, at 1:48 PM, Michael McCandless wrote: Grant, public_p_r.tar seems to be missing? Is that intentional? Maybe some super-secret project inside there :) Mike On Thu, Oct 14, 2010 at 12:05 PM, Grant Ingersoll gsing...@apache.org wrote: Hi ORPers, I put up the complete ASF public mail archives as of about 3 weeks ago on Amazon's S3 and have made them public (let me know if I messed up, it is the first time I've done this). I also intend, in the coming weeks, to convert them into Mahout files (if anyone wants to help let me know). There are 5 files: https://s3.amazonaws.com/asf-mail-archives/public_a_d.tar https://s3.amazonaws.com/asf-mail-archives/public_e_k.tar https://s3.amazonaws.com/asf-mail-archives/public_l_o.tar https://s3.amazonaws.com/asf-mail-archives/public_s_t.tar https://s3.amazonaws.com/asf-mail-archives/public_u_z.tar The tarballs are organized by Top Level Project name (i.e. Mahout is in the public_l_o.tar file). The tarballs contain GZIP files by date, I believe. I believe the total uncompressed file size is somewhere in the 80-100GB range. That should be sufficient to drive some semi-interesting things in terms of scale, even if it is towards the smaller end of things. As the ASF has very clear public mailing list archive policies, it is my belief that this data set is completely unencumbered. From an ORP standpoint, this might make for a first data set for evaluation once we have the evaluator framework in place. Cheers, Grant
[jira] Commented: (LUCENE-2755) Some improvements to CMS
[ https://issues.apache.org/jira/browse/LUCENE-2755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12932864#action_12932864 ] Earwin Burrfoot commented on LUCENE-2755: - {quote} If we proceed w/ your proposal, that is basically the MS/ME polling MP, and not IW doing so, how would IW know about the running merges and pending ones? Today IW tracks those two lists so that if you need to abort merges, it knows which ones to abort. We can workaround aborting the running merges by introducing a MS.abort()-like method. But what about MP? Now the lists are divided between too entities (MP and MS), and aborting a MP does not make sense (doable, but I don't think it belongs there). {quote} There are no lists at all with my approach. At least no pending list, that one gets recalculated each time we poll MP and it never gets out, neither gets stored inside. There's a kind of implicit in flight list - MS has the knowledge of its threads that are currently doing things. And if you want to go around aborting things, MS is probably the right place to do this. bq. Maybe we can have MS.abort() poll MP for next merges until it returns null, and throwing all the returned ones away - that can be done. So, just I said - that's not needed. MP is empty, it has no state. bq. Should we, in the scope of this issue, make IW a required settable parameter on MS, like we do w/ MP? For the love of God, no. I'd like to see it removed from MP too. It's only natural to pass the same instance of Policy or Scheduler to different Writers, so they have the same behaviour and share Scheduler resources (insanely important if you have fifteen indexes like I do and don't want them to rape hardware with fifteen simultaneous merges). It is against the nature to pass Writer to Policy. Does the Policy need to write anything on its own, when it decides to? No. It should advice, not act. Some improvements to CMS Key: LUCENE-2755 URL: https://issues.apache.org/jira/browse/LUCENE-2755 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Assignee: Shai Erera Priority: Minor Fix For: 3.1, 4.0 While running optimize on a large index, I've noticed several things that got me to read CMS code more carefully, and find these issues: * CMS may hold onto a merge if maxMergeCount is hit. That results in the MergeThreads taking merges from the IndexWriter until they are exhausted, and only then that blocked merge will run. I think it's unnecessary that that merge will be blocked. * CMS sorts merges by segments size, doc-based and not bytes-based. Since the default MP is LogByteSizeMP, and I hardly believe people care about doc-based size segments anymore, I think we should switch the default impl. There are two ways to make it extensible, if we want: ** Have an overridable member/method in CMS that you can extend and override - easy. ** Have OneMerge be comparable and let the MP determine the order (e.g. by bytes, docs, calibrate deletes etc.). Better, but will need to tap into several places in the code, so more risky and complicated. On the go, I'd like to add some documentation to CMS - it's not very easy to read and follow. I'll work on a patch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Solr-trunk - Build # 1315 - Still Failing
Build: https://hudson.apache.org/hudson/job/Solr-trunk/1315/ All tests passed Build Log (for compile errors): [...truncated 18459 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Lucene-Solr-tests-only-trunk - Build # 1512 - Failure
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/1512/ 18 tests failed. REGRESSION: org.apache.solr.cloud.BasicDistributedZkTest.testDistribSearch Error Message: KeeperErrorCode = ConnectionLoss for /configs/conf1/synonyms.txt Stack Trace: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /configs/conf1/synonyms.txt at org.apache.zookeeper.KeeperException.create(KeeperException.java:90) at org.apache.zookeeper.KeeperException.create(KeeperException.java:42) at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038) at org.apache.solr.common.cloud.SolrZkClient.setData(SolrZkClient.java:225) at org.apache.solr.common.cloud.SolrZkClient.setData(SolrZkClient.java:389) at org.apache.solr.common.cloud.SolrZkClient.setData(SolrZkClient.java:411) at org.apache.solr.cloud.AbstractZkTestCase.putConfig(AbstractZkTestCase.java:97) at org.apache.solr.cloud.AbstractZkTestCase.buildZooKeeper(AbstractZkTestCase.java:90) at org.apache.solr.cloud.AbstractDistributedZkTestCase.setUp(AbstractDistributedZkTestCase.java:47) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:881) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:847) FAILED: junit.framework.TestSuite.org.apache.solr.cloud.BasicZkTest Error Message: org.apache.solr.common.cloud.ZooKeeperException: Stack Trace: java.lang.RuntimeException: org.apache.solr.common.cloud.ZooKeeperException: at org.apache.solr.util.TestHarness.init(TestHarness.java:152) at org.apache.solr.util.TestHarness.init(TestHarness.java:134) at org.apache.solr.util.TestHarness.init(TestHarness.java:124) at org.apache.solr.SolrTestCaseJ4.initCore(SolrTestCaseJ4.java:247) at org.apache.solr.SolrTestCaseJ4.initCore(SolrTestCaseJ4.java:110) at org.apache.solr.SolrTestCaseJ4.initCore(SolrTestCaseJ4.java:98) at org.apache.solr.cloud.AbstractZkTestCase.azt_beforeClass(AbstractZkTestCase.java:64) Caused by: org.apache.solr.common.cloud.ZooKeeperException: at org.apache.solr.core.CoreContainer.register(CoreContainer.java:530) at org.apache.solr.util.TestHarness$Initializer.initialize(TestHarness.java:191) at org.apache.solr.util.TestHarness.init(TestHarness.java:139) Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /collections at org.apache.zookeeper.KeeperException.create(KeeperException.java:90) at org.apache.zookeeper.KeeperException.create(KeeperException.java:42) at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038) at org.apache.solr.common.cloud.SolrZkClient.setData(SolrZkClient.java:225) at org.apache.solr.common.cloud.SolrZkClient.setData(SolrZkClient.java:389) at org.apache.solr.cloud.ZkController.addZkShardsNode(ZkController.java:159) at org.apache.solr.cloud.ZkController.register(ZkController.java:481) at org.apache.solr.core.CoreContainer.register(CoreContainer.java:521) FAILED: junit.framework.TestSuite.org.apache.solr.cloud.BasicZkTest Error Message: ERROR: SolrIndexSearcher opens=1 closes=0 Stack Trace: junit.framework.AssertionFailedError: ERROR: SolrIndexSearcher opens=1 closes=0 at org.apache.solr.SolrTestCaseJ4.endTrackingSearchers(SolrTestCaseJ4.java:128) at org.apache.solr.SolrTestCaseJ4.deleteCore(SolrTestCaseJ4.java:302) at org.apache.solr.SolrTestCaseJ4.afterClassSolrTestCase(SolrTestCaseJ4.java:79) REGRESSION: org.apache.solr.cloud.CloudStateUpdateTest.testCoreRegistration Error Message: null Stack Trace: org.apache.solr.common.cloud.ZooKeeperException: at org.apache.solr.core.CoreContainer.load(CoreContainer.java:441) at org.apache.solr.core.CoreContainer.load(CoreContainer.java:294) at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:243) at org.apache.solr.cloud.CloudStateUpdateTest.setUp(CloudStateUpdateTest.java:112) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:881) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:847) Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /collections at org.apache.zookeeper.KeeperException.create(KeeperException.java:90) at org.apache.zookeeper.KeeperException.create(KeeperException.java:42) at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1243) at org.apache.solr.common.cloud.SolrZkClient.getChildren(SolrZkClient.java:199) at org.apache.solr.common.cloud.ZkStateReader.makeShardZkNodeWatches(ZkStateReader.java:184) at
Lucene-Solr-tests-only-trunk - Build # 1513 - Still Failing
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/1513/ 12 tests failed. FAILED: org.apache.solr.cloud.BasicDistributedZkTest.testDistribSearch Error Message: KeeperErrorCode = ConnectionLoss for /solr Stack Trace: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /solr at org.apache.zookeeper.KeeperException.create(KeeperException.java:90) at org.apache.zookeeper.KeeperException.create(KeeperException.java:42) at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:637) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:348) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:309) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:291) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:256) at org.apache.solr.cloud.AbstractZkTestCase.buildZooKeeper(AbstractZkTestCase.java:71) at org.apache.solr.cloud.AbstractDistributedZkTestCase.setUp(AbstractDistributedZkTestCase.java:47) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:881) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:847) FAILED: org.apache.solr.cloud.CloudStateUpdateTest.testCoreRegistration Error Message: null Stack Trace: org.apache.solr.common.cloud.ZooKeeperException: at org.apache.solr.cloud.ZkController.init(ZkController.java:301) at org.apache.solr.cloud.ZkController.init(ZkController.java:133) at org.apache.solr.core.CoreContainer.initZooKeeper(CoreContainer.java:159) at org.apache.solr.core.CoreContainer.load(CoreContainer.java:338) at org.apache.solr.core.CoreContainer.load(CoreContainer.java:294) at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:243) at org.apache.solr.cloud.CloudStateUpdateTest.setUp(CloudStateUpdateTest.java:122) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:881) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:847) Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /live_nodes/127.0.0.1:1662_solr at org.apache.zookeeper.KeeperException.create(KeeperException.java:90) at org.apache.zookeeper.KeeperException.create(KeeperException.java:42) at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:637) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:348) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:309) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:291) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:261) at org.apache.solr.cloud.ZkController.createEphemeralLiveNode(ZkController.java:372) at org.apache.solr.cloud.ZkController.init(ZkController.java:285) FAILED: org.apache.solr.cloud.ZkSolrClientTest.testConnect Error Message: Could not connect to ZooKeeper 127.0.0.1:42074/solr within 3 ms Stack Trace: java.util.concurrent.TimeoutException: Could not connect to ZooKeeper 127.0.0.1:42074/solr within 3 ms at org.apache.solr.common.cloud.ConnectionManager.waitForConnected(ConnectionManager.java:124) at org.apache.solr.common.cloud.SolrZkClient.init(SolrZkClient.java:122) at org.apache.solr.common.cloud.SolrZkClient.init(SolrZkClient.java:85) at org.apache.solr.common.cloud.SolrZkClient.init(SolrZkClient.java:65) at org.apache.solr.cloud.ZkSolrClientTest.testConnect(ZkSolrClientTest.java:43) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:881) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:847) FAILED: org.apache.solr.handler.TestReplicationHandler.testReplicateAfterWrite2Slave Error Message: http://localhost:42265/solr/replication?command=disableReplication Stack Trace: java.io.FileNotFoundException: http://localhost:42265/solr/replication?command=disableReplication at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1267) at java.net.URL.openStream(URL.java:1029) at org.apache.solr.handler.TestReplicationHandler.testReplicateAfterWrite2Slave(TestReplicationHandler.java:173) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:881) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:847) FAILED: org.apache.solr.handler.TestReplicationHandler.testIndexAndConfigReplication Error Message: expected:498 but was:499 Stack Trace:
Lucene-Solr-tests-only-trunk - Build # 1514 - Still Failing
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/1514/ 1 tests failed. FAILED: org.apache.solr.cloud.BasicDistributedZkTest.testDistribSearch Error Message: .response.numFound:35!=67 Stack Trace: junit.framework.AssertionFailedError: .response.numFound:35!=67 at org.apache.solr.BaseDistributedSearchTestCase.compareResponses(BaseDistributedSearchTestCase.java:553) at org.apache.solr.BaseDistributedSearchTestCase.query(BaseDistributedSearchTestCase.java:307) at org.apache.solr.cloud.BasicDistributedZkTest.doTest(BasicDistributedZkTest.java:127) at org.apache.solr.BaseDistributedSearchTestCase.testDistribSearch(BaseDistributedSearchTestCase.java:562) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:881) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:847) Build Log (for compile errors): [...truncated 8714 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-236) Field collapsing
[ https://issues.apache.org/jira/browse/SOLR-236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12932905#action_12932905 ] peterwang commented on SOLR-236: SOLR-236-1_4_1-paging-totals-working.patch patch failed with following errors: patch: malformed patch at line 3348: Index: src/test/org/apache/solr/search/fieldcollapse/DistributedFieldCollapsingIntegrationTest.java seems caused by hand edit (delete 6 lines without fix diff hunk number) patch files, possible fix: # diff -u SOLR-236-1_4_1.patch SOLR-236-1_4_1-paging-totals-working.patch --- SOLR-236-1_4_1.patch2010-11-17 18:22:25.0 +0800 +++ SOLR-236-1_4_1-paging-totals-working.patch 2010-11-17 19:17:20.0 +0800 @@ -2834,7 +2834,7 @@ === --- src/java/org/apache/solr/search/fieldcollapse/NonAdjacentDocumentCollapser.java (revision ) +++ src/java/org/apache/solr/search/fieldcollapse/NonAdjacentDocumentCollapser.java (revision ) -@@ -0,0 +1,517 @@ +@@ -0,0 +1,511 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with @@ -2939,12 +2939,6 @@ +collapseDoc = new NonAdjacentCollapseGroup(0, 0, documentComparator, collapseThreshold, currentValue); +collapsedDocs.put(currentValue, collapseDoc); +collapsedGroupPriority.add(collapseDoc); -+ -+if (collapsedGroupPriority.size() maxNumberOfGroups) { -+ NonAdjacentCollapseGroup inferiorGroup = collapsedGroupPriority.first(); -+ collapsedDocs.remove(inferiorGroup.fieldValue); -+ collapsedGroupPriority.remove(inferiorGroup); -+} + } + // dropoutId has a value smaller than the smallest value in the queue and therefore it was removed from the queue + Integer dropOutId = (Integer) collapseDoc.priorityQueue.insertWithOverflow(currentId); Field collapsing Key: SOLR-236 URL: https://issues.apache.org/jira/browse/SOLR-236 Project: Solr Issue Type: New Feature Components: search Affects Versions: 1.3 Reporter: Emmanuel Keller Assignee: Shalin Shekhar Mangar Fix For: Next Attachments: collapsing-patch-to-1.3.0-dieter.patch, collapsing-patch-to-1.3.0-ivan.patch, collapsing-patch-to-1.3.0-ivan_2.patch, collapsing-patch-to-1.3.0-ivan_3.patch, DocSetScoreCollector.java, field-collapse-3.patch, field-collapse-4-with-solrj.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-solr-236-2.patch, field-collapse-solr-236.patch, field-collapsing-extended-592129.patch, field_collapsing_1.1.0.patch, field_collapsing_1.3.patch, field_collapsing_dsteigerwald.diff, field_collapsing_dsteigerwald.diff, field_collapsing_dsteigerwald.diff, NonAdjacentDocumentCollapser.java, NonAdjacentDocumentCollapserTest.java, quasidistributed.additional.patch, SOLR-236-1_4_1-paging-totals-working.patch, SOLR-236-1_4_1.patch, SOLR-236-FieldCollapsing.patch, SOLR-236-FieldCollapsing.patch, SOLR-236-FieldCollapsing.patch, SOLR-236-trunk.patch, SOLR-236-trunk.patch, SOLR-236-trunk.patch, SOLR-236-trunk.patch, SOLR-236-trunk.patch, SOLR-236.patch, SOLR-236.patch, SOLR-236.patch, SOLR-236.patch, SOLR-236.patch, SOLR-236.patch, SOLR-236.patch, SOLR-236.patch, solr-236.patch, SOLR-236_collapsing.patch, SOLR-236_collapsing.patch This patch include a new feature called Field collapsing. Used in order to collapse a group of results with similar value for a given field to a single entry in the result set. Site collapsing is a special case of this, where all results for a given web site is collapsed into one or two entries in the result set, typically with an associated more documents from this site link. See also Duplicate detection. http://www.fastsearch.com/glossary.aspx?m=48amid=299 The implementation add 3 new query parameters (SolrParams): collapse.field to choose the field used to group results collapse.type normal (default value) or adjacent collapse.max to select how many continuous results are allowed before collapsing TODO (in progress): - More documentation (on source code) - Test cases Two patches: - field_collapsing.patch for current development version - field_collapsing_1.1.0.patch for Solr-1.1.0 P.S.: Feedback and misspelling correction are welcome ;-) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment
[jira] Reopened: (SOLR-1667) PatternTokenizer does not clearAttributes()
[ https://issues.apache.org/jira/browse/SOLR-1667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir reopened SOLR-1667: --- Assignee: Robert Muir (was: Shalin Shekhar Mangar) reopening to backport to solr 1.4.x branch. PatternTokenizer does not clearAttributes() --- Key: SOLR-1667 URL: https://issues.apache.org/jira/browse/SOLR-1667 Project: Solr Issue Type: Bug Components: Schema and Analysis Affects Versions: 1.4 Reporter: Robert Muir Assignee: Robert Muir Fix For: 1.5, 3.1, 4.0 Attachments: SOLR-1667.patch PatternTokenizer creates tokens, but never calls clearAttributes() because of this things like positionIncrementGap are never reset to their default value. trivial patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (SOLR-236) Field collapsing
[ https://issues.apache.org/jira/browse/SOLR-236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12932905#action_12932905 ] peterwang edited comment on SOLR-236 at 11/17/10 6:21 AM: -- SOLR-236-1_4_1-paging-totals-working.patch patch failed with following errors: patch: malformed patch at line 3348: Index: src/test/org/apache/solr/search/fieldcollapse/DistributedFieldCollapsingIntegrationTest.java seems caused by hand edit (delete 6 lines without fix diff hunk number) patch files, possible fix: $ diff -u SOLR-236-1_4_1.patch SOLR-236-1_4_1-paging-totals-working.patch --- SOLR-236-1_4_1.patch2010-11-17 18:22:25.0 +0800 +++ SOLR-236-1_4_1-paging-totals-working.patch 2010-11-17 19:17:20.0 +0800 @@ -2834,7 +2834,7 @@ === --- src/java/org/apache/solr/search/fieldcollapse/NonAdjacentDocumentCollapser.java (revision ) +++ src/java/org/apache/solr/search/fieldcollapse/NonAdjacentDocumentCollapser.java (revision ) -@@ -0,0 +1,517 @@ +@@ -0,0 +1,511 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with @@ -2939,12 +2939,6 @@ +collapseDoc = new NonAdjacentCollapseGroup(0, 0, documentComparator, collapseThreshold, currentValue); +collapsedDocs.put(currentValue, collapseDoc); +collapsedGroupPriority.add(collapseDoc); -+ -+if (collapsedGroupPriority.size() maxNumberOfGroups) { -+ NonAdjacentCollapseGroup inferiorGroup = collapsedGroupPriority.first(); -+ collapsedDocs.remove(inferiorGroup.fieldValue); -+ collapsedGroupPriority.remove(inferiorGroup); -+} + } + // dropoutId has a value smaller than the smallest value in the queue and therefore it was removed from the queue + Integer dropOutId = (Integer) collapseDoc.priorityQueue.insertWithOverflow(currentId); was (Author: peterwang): SOLR-236-1_4_1-paging-totals-working.patch patch failed with following errors: patch: malformed patch at line 3348: Index: src/test/org/apache/solr/search/fieldcollapse/DistributedFieldCollapsingIntegrationTest.java seems caused by hand edit (delete 6 lines without fix diff hunk number) patch files, possible fix: # diff -u SOLR-236-1_4_1.patch SOLR-236-1_4_1-paging-totals-working.patch --- SOLR-236-1_4_1.patch2010-11-17 18:22:25.0 +0800 +++ SOLR-236-1_4_1-paging-totals-working.patch 2010-11-17 19:17:20.0 +0800 @@ -2834,7 +2834,7 @@ === --- src/java/org/apache/solr/search/fieldcollapse/NonAdjacentDocumentCollapser.java (revision ) +++ src/java/org/apache/solr/search/fieldcollapse/NonAdjacentDocumentCollapser.java (revision ) -@@ -0,0 +1,517 @@ +@@ -0,0 +1,511 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with @@ -2939,12 +2939,6 @@ +collapseDoc = new NonAdjacentCollapseGroup(0, 0, documentComparator, collapseThreshold, currentValue); +collapsedDocs.put(currentValue, collapseDoc); +collapsedGroupPriority.add(collapseDoc); -+ -+if (collapsedGroupPriority.size() maxNumberOfGroups) { -+ NonAdjacentCollapseGroup inferiorGroup = collapsedGroupPriority.first(); -+ collapsedDocs.remove(inferiorGroup.fieldValue); -+ collapsedGroupPriority.remove(inferiorGroup); -+} + } + // dropoutId has a value smaller than the smallest value in the queue and therefore it was removed from the queue + Integer dropOutId = (Integer) collapseDoc.priorityQueue.insertWithOverflow(currentId); Field collapsing Key: SOLR-236 URL: https://issues.apache.org/jira/browse/SOLR-236 Project: Solr Issue Type: New Feature Components: search Affects Versions: 1.3 Reporter: Emmanuel Keller Assignee: Shalin Shekhar Mangar Fix For: Next Attachments: collapsing-patch-to-1.3.0-dieter.patch, collapsing-patch-to-1.3.0-ivan.patch, collapsing-patch-to-1.3.0-ivan_2.patch, collapsing-patch-to-1.3.0-ivan_3.patch, DocSetScoreCollector.java, field-collapse-3.patch, field-collapse-4-with-solrj.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-solr-236-2.patch, field-collapse-solr-236.patch,
[jira] Issue Comment Edited: (SOLR-236) Field collapsing
[ https://issues.apache.org/jira/browse/SOLR-236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12932905#action_12932905 ] peterwang edited comment on SOLR-236 at 11/17/10 6:23 AM: -- SOLR-236-1_4_1-paging-totals-working.patch patch failed with following errors: patch: malformed patch at line 3348: Index: src/test/org/apache/solr/search/fieldcollapse/DistributedFieldCollapsingIntegrationTest.java seems caused by hand edit (delete 6 lines without fix diff hunk number) patch files, possible fix: {code} $ diff -u SOLR-236-1_4_1.patch SOLR-236-1_4_1-paging-totals-working.patch --- SOLR-236-1_4_1.patch2010-11-17 18:22:25.0 +0800 +++ SOLR-236-1_4_1-paging-totals-working.patch 2010-11-17 19:17:20.0 +0800 @@ -2834,7 +2834,7 @@ === --- src/java/org/apache/solr/search/fieldcollapse/NonAdjacentDocumentCollapser.java (revision ) +++ src/java/org/apache/solr/search/fieldcollapse/NonAdjacentDocumentCollapser.java (revision ) -@@ -0,0 +1,517 @@ +@@ -0,0 +1,511 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with @@ -2939,12 +2939,6 @@ +collapseDoc = new NonAdjacentCollapseGroup(0, 0, documentComparator, collapseThreshold, currentValue); +collapsedDocs.put(currentValue, collapseDoc); +collapsedGroupPriority.add(collapseDoc); -+ -+if (collapsedGroupPriority.size() maxNumberOfGroups) { -+ NonAdjacentCollapseGroup inferiorGroup = collapsedGroupPriority.first(); -+ collapsedDocs.remove(inferiorGroup.fieldValue); -+ collapsedGroupPriority.remove(inferiorGroup); -+} + } + // dropoutId has a value smaller than the smallest value in the queue and therefore it was removed from the queue + Integer dropOutId = (Integer) collapseDoc.priorityQueue.insertWithOverflow(currentId); {code} was (Author: peterwang): SOLR-236-1_4_1-paging-totals-working.patch patch failed with following errors: patch: malformed patch at line 3348: Index: src/test/org/apache/solr/search/fieldcollapse/DistributedFieldCollapsingIntegrationTest.java seems caused by hand edit (delete 6 lines without fix diff hunk number) patch files, possible fix: $ diff -u SOLR-236-1_4_1.patch SOLR-236-1_4_1-paging-totals-working.patch --- SOLR-236-1_4_1.patch2010-11-17 18:22:25.0 +0800 +++ SOLR-236-1_4_1-paging-totals-working.patch 2010-11-17 19:17:20.0 +0800 @@ -2834,7 +2834,7 @@ === --- src/java/org/apache/solr/search/fieldcollapse/NonAdjacentDocumentCollapser.java (revision ) +++ src/java/org/apache/solr/search/fieldcollapse/NonAdjacentDocumentCollapser.java (revision ) -@@ -0,0 +1,517 @@ +@@ -0,0 +1,511 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with @@ -2939,12 +2939,6 @@ +collapseDoc = new NonAdjacentCollapseGroup(0, 0, documentComparator, collapseThreshold, currentValue); +collapsedDocs.put(currentValue, collapseDoc); +collapsedGroupPriority.add(collapseDoc); -+ -+if (collapsedGroupPriority.size() maxNumberOfGroups) { -+ NonAdjacentCollapseGroup inferiorGroup = collapsedGroupPriority.first(); -+ collapsedDocs.remove(inferiorGroup.fieldValue); -+ collapsedGroupPriority.remove(inferiorGroup); -+} + } + // dropoutId has a value smaller than the smallest value in the queue and therefore it was removed from the queue + Integer dropOutId = (Integer) collapseDoc.priorityQueue.insertWithOverflow(currentId); Field collapsing Key: SOLR-236 URL: https://issues.apache.org/jira/browse/SOLR-236 Project: Solr Issue Type: New Feature Components: search Affects Versions: 1.3 Reporter: Emmanuel Keller Assignee: Shalin Shekhar Mangar Fix For: Next Attachments: collapsing-patch-to-1.3.0-dieter.patch, collapsing-patch-to-1.3.0-ivan.patch, collapsing-patch-to-1.3.0-ivan_2.patch, collapsing-patch-to-1.3.0-ivan_3.patch, DocSetScoreCollector.java, field-collapse-3.patch, field-collapse-4-with-solrj.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-solr-236-2.patch,
[jira] Commented: (LUCENE-2764) Allow tests to use random codec per field
[ https://issues.apache.org/jira/browse/LUCENE-2764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12932910#action_12932910 ] Michael McCandless commented on LUCENE-2764: bq. The problem is if we have IW writing field A with codec Standard then open a new IW with field A using PreFlexRW we get problems with the comparator if those segments are merged though. Hmm this should be OK... The PreFlexRW codec has a sneaky impersonation layer (test only) that attempts to figure out which term comparator it's supposed to be using when something is reading the segment. It sounds like that layer isn't being smart enough now. I think we could fix it -- really it just needs to know which codec is writing. If it's PreFlexRW that's writing then it needs to use the legacy sort order; else, unicode. Allow tests to use random codec per field - Key: LUCENE-2764 URL: https://issues.apache.org/jira/browse/LUCENE-2764 Project: Lucene - Java Issue Type: Test Components: Tests Affects Versions: 4.0 Reporter: Simon Willnauer Priority: Minor Fix For: 4.0 Attachments: LUCENE-2764.patch, LUCENE-2764.patch Since we now have a real per field codec support we should enable to run the tests with a random codec per field. When I change something related to codecs internally I would like to ensure that whatever combination of codecs (except of preflex) I use the code works just fine. I created a RandomCodecProvider in LuceneTestCase that randomly selects the codec for fields when it sees them the first time. I disabled the test by default to leave the old randomize codec support in as it was / is. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (SOLR-236) Field collapsing
[ https://issues.apache.org/jira/browse/SOLR-236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12932905#action_12932905 ] peterwang edited comment on SOLR-236 at 11/17/10 6:28 AM: -- SOLR-236-1_4_1-paging-totals-working.patch patch failed with following errors: patch: malformed patch at line 3348: Index: src/test/org/apache/solr/search/fieldcollapse/DistributedFieldCollapsingIntegrationTest.java seems caused by hand edit SOLR-236-1_4_1.patch to produce SOLR-236-1_4_1-paging-totals-working.patch (delete 6 lines without fix diff hunk number) possible fix: {code} diff -u SOLR-236-1_4_1-paging-totals-working.patch.orig SOLR-236-1_4_1-paging-totals-working.patch --- SOLR-236-1_4_1-paging-totals-working.patch.orig 2010-11-17 19:26:05.0 +0800 +++ SOLR-236-1_4_1-paging-totals-working.patch 2010-11-17 19:17:20.0 +0800 @@ -2834,7 +2834,7 @@ === --- src/java/org/apache/solr/search/fieldcollapse/NonAdjacentDocumentCollapser.java (revision ) +++ src/java/org/apache/solr/search/fieldcollapse/NonAdjacentDocumentCollapser.java (revision ) -@@ -0,0 +1,517 @@ +@@ -0,0 +1,511 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with {code} was (Author: peterwang): SOLR-236-1_4_1-paging-totals-working.patch patch failed with following errors: patch: malformed patch at line 3348: Index: src/test/org/apache/solr/search/fieldcollapse/DistributedFieldCollapsingIntegrationTest.java seems caused by hand edit (delete 6 lines without fix diff hunk number) patch files, possible fix: {code} $ diff -u SOLR-236-1_4_1.patch SOLR-236-1_4_1-paging-totals-working.patch --- SOLR-236-1_4_1.patch2010-11-17 18:22:25.0 +0800 +++ SOLR-236-1_4_1-paging-totals-working.patch 2010-11-17 19:17:20.0 +0800 @@ -2834,7 +2834,7 @@ === --- src/java/org/apache/solr/search/fieldcollapse/NonAdjacentDocumentCollapser.java (revision ) +++ src/java/org/apache/solr/search/fieldcollapse/NonAdjacentDocumentCollapser.java (revision ) -@@ -0,0 +1,517 @@ +@@ -0,0 +1,511 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with @@ -2939,12 +2939,6 @@ +collapseDoc = new NonAdjacentCollapseGroup(0, 0, documentComparator, collapseThreshold, currentValue); +collapsedDocs.put(currentValue, collapseDoc); +collapsedGroupPriority.add(collapseDoc); -+ -+if (collapsedGroupPriority.size() maxNumberOfGroups) { -+ NonAdjacentCollapseGroup inferiorGroup = collapsedGroupPriority.first(); -+ collapsedDocs.remove(inferiorGroup.fieldValue); -+ collapsedGroupPriority.remove(inferiorGroup); -+} + } + // dropoutId has a value smaller than the smallest value in the queue and therefore it was removed from the queue + Integer dropOutId = (Integer) collapseDoc.priorityQueue.insertWithOverflow(currentId); {code} Field collapsing Key: SOLR-236 URL: https://issues.apache.org/jira/browse/SOLR-236 Project: Solr Issue Type: New Feature Components: search Affects Versions: 1.3 Reporter: Emmanuel Keller Assignee: Shalin Shekhar Mangar Fix For: Next Attachments: collapsing-patch-to-1.3.0-dieter.patch, collapsing-patch-to-1.3.0-ivan.patch, collapsing-patch-to-1.3.0-ivan_2.patch, collapsing-patch-to-1.3.0-ivan_3.patch, DocSetScoreCollector.java, field-collapse-3.patch, field-collapse-4-with-solrj.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-solr-236-2.patch, field-collapse-solr-236.patch, field-collapsing-extended-592129.patch, field_collapsing_1.1.0.patch, field_collapsing_1.3.patch, field_collapsing_dsteigerwald.diff, field_collapsing_dsteigerwald.diff, field_collapsing_dsteigerwald.diff, NonAdjacentDocumentCollapser.java, NonAdjacentDocumentCollapserTest.java, quasidistributed.additional.patch, SOLR-236-1_4_1-paging-totals-working.patch, SOLR-236-1_4_1.patch, SOLR-236-FieldCollapsing.patch, SOLR-236-FieldCollapsing.patch, SOLR-236-FieldCollapsing.patch, SOLR-236-trunk.patch, SOLR-236-trunk.patch, SOLR-236-trunk.patch, SOLR-236-trunk.patch, SOLR-236-trunk.patch,
[jira] Resolved: (SOLR-1667) PatternTokenizer does not clearAttributes()
[ https://issues.apache.org/jira/browse/SOLR-1667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir resolved SOLR-1667. --- Resolution: Fixed Fix Version/s: (was: 1.5) 1.4.2 Committed revision 1035982. PatternTokenizer does not clearAttributes() --- Key: SOLR-1667 URL: https://issues.apache.org/jira/browse/SOLR-1667 Project: Solr Issue Type: Bug Components: Schema and Analysis Affects Versions: 1.4 Reporter: Robert Muir Assignee: Robert Muir Fix For: 1.4.2, 3.1, 4.0 Attachments: SOLR-1667.patch PatternTokenizer creates tokens, but never calls clearAttributes() because of this things like positionIncrementGap are never reset to their default value. trivial patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments
[ https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12932915#action_12932915 ] Michael McCandless commented on LUCENE-2680: Why do we still have deletesFlushed? And why do we still need to remap docIDs on merge? I thought with this new approach the docIDUpto for each buffered delete Term/Query would be a local docID to that segment? On flush the deletesInRAM should be carried directly over to the segmentDeletes, and there shouldn't be a deletesFlushed? A few other small things: * You can use SegmentInfos.clone to copy the segment infos? (it makes a deep copy) * SegmentDeletes.clearAll() need not iterate through the terms/queries to subtract the RAM used? Ie just multiply by .size() instead and make one call to deduct RAM used? * The SegmentDeletes use less than BYTES_PER_DEL_TERM because it's a simple HashSet not a HashMap? Ie we are over-counting RAM used now? (Same for by query) * Can we store segment's deletes elsewhere? The SegmentInfo should be a lightweight class... eg it's used by DirectoryReader to read the index, and if it's read only DirectoryReader there's no need for it to allocate the SegmentDeletes? These data structures should only be held by IndexWriter/DocumentsWriter. * Do we really need to track appliedTerms/appliedQueries? Ie is this just an optimization so that if the caller deletes by the Term/Query again we know to skip it? Seems unnecessary if that's all... Improve how IndexWriter flushes deletes against existing segments - Key: LUCENE-2680 URL: https://issues.apache.org/jira/browse/LUCENE-2680 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Fix For: 4.0 Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch IndexWriter buffers up all deletes (by Term and Query) and only applies them if 1) commit or NRT getReader() is called, or 2) a merge is about to kickoff. We do this because, for a large index, it's very costly to open a SegmentReader for every segment in the index. So we defer as long as we can. We do it just before merge so that the merge can eliminate the deleted docs. But, most merges are small, yet in a big index we apply deletes to all of the segments, which is really very wasteful. Instead, we should only apply the buffered deletes to the segments that are about to be merged, and keep the buffer around for the remaining segments. I think it's not so hard to do; we'd have to have generations of pending deletions, because the newly merged segment doesn't need the same buffered deletions applied again. So every time a merge kicks off, we pinch off the current set of buffered deletions, open a new set (the next generation), and record which segment was created as of which generation. This should be a very sizable gain for large indices that mix deletes, though, less so in flex since opening the terms index is much faster. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Assigned: (LUCENE-2765) Optimize scanning in DocsEnum
[ https://issues.apache.org/jira/browse/LUCENE-2765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir reassigned LUCENE-2765: --- Assignee: Robert Muir Optimize scanning in DocsEnum - Key: LUCENE-2765 URL: https://issues.apache.org/jira/browse/LUCENE-2765 Project: Lucene - Java Issue Type: Improvement Reporter: Robert Muir Assignee: Robert Muir Fix For: 4.0 Attachments: LUCENE-2765.patch, LUCENE-2765.patch Similar to LUCENE-2761: when we call advance(), after skipping it scans, but this can be optimized better than calling nextDoc() like today {noformat} // scan for the rest: do { nextDoc(); } while (target doc); {noformat} in particular, the freq can be skipVinted and the skipDocs (deletedDocs) don't need to be checked during this scanning. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2765) Optimize scanning in DocsEnum
[ https://issues.apache.org/jira/browse/LUCENE-2765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12932920#action_12932920 ] Robert Muir commented on LUCENE-2765: - here is Mike's results on his wikipedia index (multi-segment, 5% deletions) with the patch. ||Query||QPS base||QPS spec||Pct diff|| |unit state|7.94|7.84|-1.3%| |state|36.15|35.81|-1.0%| |spanNear([unit, state], 10, true)|4.46|4.42|-0.9%| |spanFirst(unit, 5)|16.51|16.45|-0.4%| |unit state|10.76|10.78|0.1%| |unit~2.0|13.83|14.06 |1.7%| |unit~1.0|14.36|14.69 |2.3%| |uni*|15.57|16.02|2.9%| |unit*|27.29|28.26|3.5%| |+unit +state|11.73|12.31|4.9%| |united~1.0|29.01|30.86|6.4%| |un*d|66.52|70.99|6.7%| |u*d|21.29|22.98|7.9%| |united~2.0|6.48|7.07|9.1%| |+nebraska +state|169.87|188.95|11.2%| Optimize scanning in DocsEnum - Key: LUCENE-2765 URL: https://issues.apache.org/jira/browse/LUCENE-2765 Project: Lucene - Java Issue Type: Improvement Reporter: Robert Muir Fix For: 4.0 Attachments: LUCENE-2765.patch, LUCENE-2765.patch Similar to LUCENE-2761: when we call advance(), after skipping it scans, but this can be optimized better than calling nextDoc() like today {noformat} // scan for the rest: do { nextDoc(); } while (target doc); {noformat} in particular, the freq can be skipVinted and the skipDocs (deletedDocs) don't need to be checked during this scanning. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Dismax Wiki page
I was looking at a question on the users list, and there are a couple of issues... I'm running 1.4.1 on a Windows box. Trying to specify dismax via defType=dismax fails, returning 0 results and doesn't look like it hits the dismax handler at all, at least the parsed query comes back with +() +() with debugQuery=on. deftype=dismax is fine. qt=dismax is also fine. The Wiki page has qt=defType=dismax in one of the examples ( http://wiki.apache.org/solr/DisMaxQParserPlugin). and the rest of the examples have defType. Before I fix the Wiki page, what's the preferred syntax? I thought it was def[T|t]ype And is the capitalization thing really a problem or not? Thanks Erick
[jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments
[ https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12932945#action_12932945 ] Michael McCandless commented on LUCENE-2680: Also: why are we tracking the last segment info/index? Ie, this should only be necessary on cutover to DWPT right? Because effectively today we have only a single DWPT? Improve how IndexWriter flushes deletes against existing segments - Key: LUCENE-2680 URL: https://issues.apache.org/jira/browse/LUCENE-2680 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Fix For: 4.0 Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch IndexWriter buffers up all deletes (by Term and Query) and only applies them if 1) commit or NRT getReader() is called, or 2) a merge is about to kickoff. We do this because, for a large index, it's very costly to open a SegmentReader for every segment in the index. So we defer as long as we can. We do it just before merge so that the merge can eliminate the deleted docs. But, most merges are small, yet in a big index we apply deletes to all of the segments, which is really very wasteful. Instead, we should only apply the buffered deletes to the segments that are about to be merged, and keep the buffer around for the remaining segments. I think it's not so hard to do; we'd have to have generations of pending deletions, because the newly merged segment doesn't need the same buffered deletions applied again. So every time a merge kicks off, we pinch off the current set of buffered deletions, open a new set (the next generation), and record which segment was created as of which generation. This should be a very sizable gain for large indices that mix deletes, though, less so in flex since opening the terms index is much faster. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Resolved: (SOLR-2237) add factory for stempel polish stemmer
[ https://issues.apache.org/jira/browse/SOLR-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir resolved SOLR-2237. --- Resolution: Fixed Committed revision 1035996, 1036035 (3x) add factory for stempel polish stemmer -- Key: SOLR-2237 URL: https://issues.apache.org/jira/browse/SOLR-2237 Project: Solr Issue Type: New Feature Components: Schema and Analysis Reporter: Robert Muir Assignee: Robert Muir Fix For: 3.1, 4.0 Attachments: SOLR-2237.patch Some users have asked how to enable polish stemming: http://www.lucidimagination.com/search/document/2581073d836cec9a/how_to_use_polish_stemmer_stempel_in_schema_xml#c67acf3dddba1164 http://www.lucidimagination.com/search/document/d115f17bd69a4dae/polish_stemmer#d115f17bd69a4dae http://www.lucidimagination.com/search/document/137d010682bb7367/polish_language_support etc. We should add the factory to make this easy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2768) add infrastructure for longer running nightly test cases
add infrastructure for longer running nightly test cases Key: LUCENE-2768 URL: https://issues.apache.org/jira/browse/LUCENE-2768 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Fix For: 3.1, 4.0 I'm spinning this out of LUCENE-2762... The patch there adds initial infrastructure for tests to pull documents from a line file, and adds a longish running test case using that line file to test NRT. I'd like to see some tests run on more substantial indices based on real data... so this is just a start. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2768) add infrastructure for longer running nightly test cases
[ https://issues.apache.org/jira/browse/LUCENE-2768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-2768: --- Attachment: LUCENE-2768.patch Patch. add infrastructure for longer running nightly test cases Key: LUCENE-2768 URL: https://issues.apache.org/jira/browse/LUCENE-2768 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Fix For: 3.1, 4.0 Attachments: LUCENE-2768.patch I'm spinning this out of LUCENE-2762... The patch there adds initial infrastructure for tests to pull documents from a line file, and adds a longish running test case using that line file to test NRT. I'd like to see some tests run on more substantial indices based on real data... so this is just a start. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: mergeinfo commit mails, possible solution
On Nov 17, 2010, at 1:01 AM, Steven A Rowe wrote: After looking more closely at the vanilla Subversion version of the mailer.py script, I'm 99% sure that removing propchange from the generate_diffs list will have zero effect, but I'd love to be proven wrong. Turns out Subversion's mailer.py once had a much larger set of property filtering options, but C. Mike (Pilato) thought the option set was too baroque, so he reverted the entire set - see http://subversion.tigris.org/issues/show_bug.cgi?id=2944. I've asked on #svn about re-instating the ignore_props and ignore_propdiffs regex-valued options - these would allow us to only ignore svn:mergeinfo diffs while still noting that files' properties have changed, without affecting other properties or their diffs. No responses yet, hopefully tomorrow. We might also be able to supply a patch to ASF infra to turn it on. However, I still feel like we are doing something wrong here relative to other projects. Surely other projects are doing merges and have successfully avoided all this noise. (I know, I know, we've discussed this before.) Perhaps an email to commun...@a.o might help or if we look around at other projects that merge and see what they do. In looking in the mailer conf, some projects turn off generating diffs altogether, but I don't think that is what we want. FWIW, it was announced at ApacheCon that the ASF will be supporting Read/Write Git, so maybe we just live with it until we can migrate to Git. -Grant Steve -Original Message- From: Grant Ingersoll [mailto:gsing...@apache.org] Sent: Tuesday, November 16, 2010 9:55 AM To: dev@lucene.apache.org Subject: mergeinfo commit mails, possible solution From #lucene IRC: gsingers:sarowe and I were talking about the mergeinfo commit overload [09:43]gsingers:and the asf_mailer.conf file [09:43]gsingers:In looking at the file [09:44]gsingers:it appears the one thing we have the ability to do is to turn off the generation of diffs for [09:44]gsingers:events [09:44]gsingers:The default setting is: [09:44]gsingers:generate_diffs = add copy modify propchange [09:44]gsingers:sarowe and I are proposing to change our settings to just be add/copy/modify [09:44]gsingers:and try dropping propchange [09:45]gsingers:I honestly don't know whether it will work or not [09:45]gsingers:and it will also likely mean we will miss notifications of other propchanges [09:45]gsingers:We've asked on #asfinfra if there are other options [09:45]gsingers:and sarowe is looking into the mailer.py script to see if there are other things available [09:46]gsingers:I guess the question here is, do people want to try turning off propchange? -Grant - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Grant Ingersoll http://www.lucidimagination.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2768) add infrastructure for longer running nightly test cases
[ https://issues.apache.org/jira/browse/LUCENE-2768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12932956#action_12932956 ] Robert Muir commented on LUCENE-2768: - in revision 1036038 i set -Dtests.nightly=1 for running tests during hudson nightly, but i didnt set it for the clover portion... i think it would only cause the nightly job to take an eternity add infrastructure for longer running nightly test cases Key: LUCENE-2768 URL: https://issues.apache.org/jira/browse/LUCENE-2768 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Fix For: 3.1, 4.0 Attachments: LUCENE-2768.patch I'm spinning this out of LUCENE-2762... The patch there adds initial infrastructure for tests to pull documents from a line file, and adds a longish running test case using that line file to test NRT. I'd like to see some tests run on more substantial indices based on real data... so this is just a start. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: mergeinfo commit mails, possible solution
On Wed, Nov 17, 2010 at 8:53 AM, Grant Ingersoll gsing...@apache.org wrote: FWIW, it was announced at ApacheCon that the ASF will be supporting Read/Write Git, so maybe we just live with it until we can migrate to Git. I didn't know there was consensus that our project would migrate to Git. I surely hope we would vote on such a decision! - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
RE: mergeinfo commit mails, possible solution
The ASF does not use the vanilla mailer.py script, they are using http://opensource.perlig.de/svnmailer - and this one does fantastic work regarding this! We just need to change the config files of this tool and specify a special subtree config for /lucene project folder. See also the attached mail! - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Steven A Rowe [mailto:sar...@syr.edu] Sent: Wednesday, November 17, 2010 7:02 AM To: dev@lucene.apache.org Subject: RE: mergeinfo commit mails, possible solution After looking more closely at the vanilla Subversion version of the mailer.py script, I'm 99% sure that removing propchange from the generate_diffs list will have zero effect, but I'd love to be proven wrong. Turns out Subversion's mailer.py once had a much larger set of property filtering options, but C. Mike (Pilato) thought the option set was too baroque, so he reverted the entire set - see http://subversion.tigris.org/issues/show_bug.cgi?id=2944. I've asked on #svn about re-instating the ignore_props and ignore_propdiffs regex-valued options - these would allow us to only ignore svn:mergeinfo diffs while still noting that files' properties have changed, without affecting other properties or their diffs. No responses yet, hopefully tomorrow. Steve -Original Message- From: Grant Ingersoll [mailto:gsing...@apache.org] Sent: Tuesday, November 16, 2010 9:55 AM To: dev@lucene.apache.org Subject: mergeinfo commit mails, possible solution From #lucene IRC: gsingers:sarowe and I were talking about the mergeinfo commit overload [09:43]gsingers:and the asf_mailer.conf file [09:43]gsingers:In looking at the file [09:44]gsingers:it appears the one thing we have the ability to do is to turn off the generation of diffs for [09:44]gsingers:events [09:44]gsingers:The default setting is: [09:44]gsingers:generate_diffs = add copy modify propchange [09:44]gsingers:sarowe and I are proposing to change our settings to just be add/copy/modify [09:44]gsingers:and try dropping propchange [09:45]gsingers:I honestly don't know whether it will work or not [09:45]gsingers:and it will also likely mean we will miss notifications of other propchanges [09:45]gsingers:We've asked on #asfinfra if there are other options [09:45]gsingers:and sarowe is looking into the mailer.py script to see if there are other things available [09:46]gsingers:I guess the question here is, do people want to try turning off propchange? -Grant - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org ---BeginMessage--- Hi Upayavira, Thanks for the hint. Indeed with changing the config file (which allows special configs for specific subtrees of the svn, so we can do it only for Lucene), we can do it very easy: http://opensource.perlig.de/svnmailer/doc-1.0/#groups-generate-diffs The generate_diffs option defines which actions diffs are generated for. It takes a space or tab separated list of one or more of the following tokens: add, modify, copy, delete, propchange and none. If the add token is given and a new file is added to the repository, the svnmailer generates a diff between an empty file and the newly added one. If the modify token is given and the content of an already existing file is changed, a diff between the old revision and the new revision of that file is generated. The copy token only worries about files, that are copied and modified during one commit. The delete token generates a diff between the previous revision of the file and an empty file, if a file was deleted. If the propchange token is given, the svnmailer also takes care of changes in versioned properties. Whether it should actually generate diffs for the property change action depends on the other tokens of the generate_diffs list. The same rules as for files apply, except that the svnmailer never generates property diffs for deleted files If we change that config option and remove propchange, then the diffs would not contain propchanges anymore. It would it only list as modified files, but with that we can live. Grant: Can you send me a copy of the current config file of that tool? I could create a patch! (I am allowed to see it). Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Upayavira [mailto:u...@odoko.co.uk] Sent: Tuesday, October 19, 2010 1:07 PM To: dev@lucene.apache.org Subject: Re: possible to filter the output to commits@ list FWIW, the commit notices are just an SVN post-commit hook that uses the svn- mailer tool [http://opensource.perlig.de/svnmailer/]. I believe Grant has commit rights to that file - it is in the infra SVN
Re: mergeinfo commit mails, possible solution
On Nov 17, 2010, at 9:02 AM, Robert Muir wrote: On Wed, Nov 17, 2010 at 8:53 AM, Grant Ingersoll gsing...@apache.org wrote: FWIW, it was announced at ApacheCon that the ASF will be supporting Read/Write Git, so maybe we just live with it until we can migrate to Git. I didn't know there was consensus that our project would migrate to Git. I surely hope we would vote on such a decision! Yeah, sorry for the implication that we would move. It is definitely something we should decide together. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2768) add infrastructure for longer running nightly test cases
[ https://issues.apache.org/jira/browse/LUCENE-2768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12932957#action_12932957 ] Robert Muir commented on LUCENE-2768: - ok, i have two potential solutions, and no particular preference as to which we do: # we upgrade our Junit from 4.7 to 4.8 and use the Category support. in this case you would use @IncludeCategory(Nightly.class) to annotate your test. http://kentbeck.github.com/junit/doc/ReleaseNotes4.8.html # we add our own annotation (e.g. @Nightly) and use that. in either case we hack our runner to respect it, so its the same amount of work (junit 4.8 won't actually save us anything since we won't use its @RunWith(Categories.class), but our own runner), its just about syntax and possibly if we care about consistency with junit or envision other optional categories beyond nightly. add infrastructure for longer running nightly test cases Key: LUCENE-2768 URL: https://issues.apache.org/jira/browse/LUCENE-2768 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Fix For: 3.1, 4.0 Attachments: LUCENE-2768.patch I'm spinning this out of LUCENE-2762... The patch there adds initial infrastructure for tests to pull documents from a line file, and adds a longish running test case using that line file to test NRT. I'd like to see some tests run on more substantial indices based on real data... so this is just a start. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: mergeinfo commit mails, possible solution
OK, I will set it to not do propchange. -Grant On Nov 17, 2010, at 9:03 AM, Uwe Schindler wrote: The ASF does not use the vanilla mailer.py script, they are using http://opensource.perlig.de/svnmailer - and this one does fantastic work regarding this! We just need to change the config files of this tool and specify a special subtree config for /lucene project folder. See also the attached mail! - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Steven A Rowe [mailto:sar...@syr.edu] Sent: Wednesday, November 17, 2010 7:02 AM To: dev@lucene.apache.org Subject: RE: mergeinfo commit mails, possible solution After looking more closely at the vanilla Subversion version of the mailer.py script, I'm 99% sure that removing propchange from the generate_diffs list will have zero effect, but I'd love to be proven wrong. Turns out Subversion's mailer.py once had a much larger set of property filtering options, but C. Mike (Pilato) thought the option set was too baroque, so he reverted the entire set - see http://subversion.tigris.org/issues/show_bug.cgi?id=2944. I've asked on #svn about re-instating the ignore_props and ignore_propdiffs regex-valued options - these would allow us to only ignore svn:mergeinfo diffs while still noting that files' properties have changed, without affecting other properties or their diffs. No responses yet, hopefully tomorrow. Steve -Original Message- From: Grant Ingersoll [mailto:gsing...@apache.org] Sent: Tuesday, November 16, 2010 9:55 AM To: dev@lucene.apache.org Subject: mergeinfo commit mails, possible solution From #lucene IRC: gsingers:sarowe and I were talking about the mergeinfo commit overload [09:43]gsingers:and the asf_mailer.conf file [09:43]gsingers:In looking at the file [09:44]gsingers:it appears the one thing we have the ability to do is to turn off the generation of diffs for [09:44]gsingers:events [09:44]gsingers:The default setting is: [09:44]gsingers:generate_diffs = add copy modify propchange [09:44]gsingers:sarowe and I are proposing to change our settings to just be add/copy/modify [09:44]gsingers:and try dropping propchange [09:45]gsingers:I honestly don't know whether it will work or not [09:45]gsingers:and it will also likely mean we will miss notifications of other propchanges [09:45]gsingers:We've asked on #asfinfra if there are other options [09:45]gsingers:and sarowe is looking into the mailer.py script to see if there are other things available [09:46]gsingers:I guess the question here is, do people want to try turning off propchange? -Grant - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org Mail Attachment.eml - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Grant Ingersoll http://www.lucidimagination.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: mergeinfo commit mails, possible solution
OK, it is now set. Next time someone does a merge, keep an eye on the commit messages and let me know. I set it to add copy modify -Grant On Nov 17, 2010, at 9:03 AM, Uwe Schindler wrote: The ASF does not use the vanilla mailer.py script, they are using http://opensource.perlig.de/svnmailer - and this one does fantastic work regarding this! We just need to change the config files of this tool and specify a special subtree config for /lucene project folder. See also the attached mail! - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Steven A Rowe [mailto:sar...@syr.edu] Sent: Wednesday, November 17, 2010 7:02 AM To: dev@lucene.apache.org Subject: RE: mergeinfo commit mails, possible solution After looking more closely at the vanilla Subversion version of the mailer.py script, I'm 99% sure that removing propchange from the generate_diffs list will have zero effect, but I'd love to be proven wrong. Turns out Subversion's mailer.py once had a much larger set of property filtering options, but C. Mike (Pilato) thought the option set was too baroque, so he reverted the entire set - see http://subversion.tigris.org/issues/show_bug.cgi?id=2944. I've asked on #svn about re-instating the ignore_props and ignore_propdiffs regex-valued options - these would allow us to only ignore svn:mergeinfo diffs while still noting that files' properties have changed, without affecting other properties or their diffs. No responses yet, hopefully tomorrow. Steve -Original Message- From: Grant Ingersoll [mailto:gsing...@apache.org] Sent: Tuesday, November 16, 2010 9:55 AM To: dev@lucene.apache.org Subject: mergeinfo commit mails, possible solution From #lucene IRC: gsingers:sarowe and I were talking about the mergeinfo commit overload [09:43]gsingers:and the asf_mailer.conf file [09:43]gsingers:In looking at the file [09:44]gsingers:it appears the one thing we have the ability to do is to turn off the generation of diffs for [09:44]gsingers:events [09:44]gsingers:The default setting is: [09:44]gsingers:generate_diffs = add copy modify propchange [09:44]gsingers:sarowe and I are proposing to change our settings to just be add/copy/modify [09:44]gsingers:and try dropping propchange [09:45]gsingers:I honestly don't know whether it will work or not [09:45]gsingers:and it will also likely mean we will miss notifications of other propchanges [09:45]gsingers:We've asked on #asfinfra if there are other options [09:45]gsingers:and sarowe is looking into the mailer.py script to see if there are other things available [09:46]gsingers:I guess the question here is, do people want to try turning off propchange? -Grant - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org Mail Attachment.eml - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Grant Ingersoll http://www.lucidimagination.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2768) add infrastructure for longer running nightly test cases
[ https://issues.apache.org/jira/browse/LUCENE-2768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12932972#action_12932972 ] Uwe Schindler commented on LUCENE-2768: --- bq. in revision 1036038 i set -Dtests.nightly=1 for running tests during hudson nightly, but i didnt set it for the clover portion... i think it would only cause the nightly job to take an eternity +1, we also have no tests.multiplier for clover! add infrastructure for longer running nightly test cases Key: LUCENE-2768 URL: https://issues.apache.org/jira/browse/LUCENE-2768 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Fix For: 3.1, 4.0 Attachments: LUCENE-2768.patch I'm spinning this out of LUCENE-2762... The patch there adds initial infrastructure for tests to pull documents from a line file, and adds a longish running test case using that line file to test NRT. I'd like to see some tests run on more substantial indices based on real data... so this is just a start. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
RE: mergeinfo commit mails, possible solution
Who does the first merge? *g* Thanks Grant for taking care, I just did not take care the last weeks! Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Grant Ingersoll [mailto:gsing...@apache.org] Sent: Wednesday, November 17, 2010 3:41 PM To: dev@lucene.apache.org Subject: Re: mergeinfo commit mails, possible solution OK, it is now set. Next time someone does a merge, keep an eye on the commit messages and let me know. I set it to add copy modify -Grant On Nov 17, 2010, at 9:03 AM, Uwe Schindler wrote: The ASF does not use the vanilla mailer.py script, they are using http://opensource.perlig.de/svnmailer - and this one does fantastic work regarding this! We just need to change the config files of this tool and specify a special subtree config for /lucene project folder. See also the attached mail! - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Steven A Rowe [mailto:sar...@syr.edu] Sent: Wednesday, November 17, 2010 7:02 AM To: dev@lucene.apache.org Subject: RE: mergeinfo commit mails, possible solution After looking more closely at the vanilla Subversion version of the mailer.py script, I'm 99% sure that removing propchange from the generate_diffs list will have zero effect, but I'd love to be proven wrong. Turns out Subversion's mailer.py once had a much larger set of property filtering options, but C. Mike (Pilato) thought the option set was too baroque, so he reverted the entire set - see http://subversion.tigris.org/issues/show_bug.cgi?id=2944. I've asked on #svn about re-instating the ignore_props and ignore_propdiffs regex-valued options - these would allow us to only ignore svn:mergeinfo diffs while still noting that files' properties have changed, without affecting other properties or their diffs. No responses yet, hopefully tomorrow. Steve -Original Message- From: Grant Ingersoll [mailto:gsing...@apache.org] Sent: Tuesday, November 16, 2010 9:55 AM To: dev@lucene.apache.org Subject: mergeinfo commit mails, possible solution From #lucene IRC: gsingers:sarowe and I were talking about the mergeinfo commit overload [09:43]gsingers:and the asf_mailer.conf file [09:43]gsingers:In looking at the file [09:44]gsingers:it appears the one thing we have the ability to do is to turn off the generation of diffs for [09:44]gsingers:events [09:44]gsingers:The default setting is: [09:44]gsingers:generate_diffs = add copy modify propchange [09:44]gsingers:sarowe and I are proposing to change our settings to just be add/copy/modify [09:44]gsingers:and try dropping propchange [09:45]gsingers:I honestly don't know whether it will work or not [09:45]gsingers:and it will also likely mean we will miss notifications of other propchanges [09:45]gsingers:We've asked on #asfinfra if there are other options [09:45]gsingers:and sarowe is looking into the mailer.py script to see if there are other things available [09:46]gsingers:I guess the question here is, do people want to try turning off propchange? -Grant - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org Mail Attachment.eml - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Grant Ingersoll http://www.lucidimagination.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Integrate Lucin Search
Lucene can search anything you feed to it. But it doesn't know how to eat itself. So: 1) You need to load the data from the database and feed it to lucene in the format you want it indexed in. Along with the searchable data, you need to provide an identifier you can use to link back to whatever it is you are trying to index. 2) You will need to write some code to grab the data from the outside links and index it. The identifier in this case could be the url you are indexing. I'd recommend reading Lucene in Action (http://www.manning.com/hatcher3/) to get an idea of how the library thinks. The examples are in java, but the concepts translate directly. On Wed, Nov 17, 2010 at 5:43 AM, Rahul Aneja rahula.innovami...@gmail.comwrote: [image: InnovaMinds] * * Hello, I have read a lot of about Lucene/Solr search, its seems to be very interesting , I want to Integrate in our Application, that is built on ASP.Net(C#). Also I got some of the code from these are the links Below. But I can’t able to get the particular steps or code provided from these links to integrate the Lucene/Solr Search. I am using Lucene.net Library file in which the function (Indexer,Parser,Search) are defined. http://aspcode.net/c-and-lucene-to-index-and-search * http://www.logiclabz.com/c/search-lucene-index-in-net-c-with-sorting-options.aspx * * http://www.theplancollection.com/house-plan-related-articles/search-using-asp-net-and-Lucene * http://www.codeproject.com/KB/library/IntroducingLucene.aspx Firstly, I want to clarify these steps: 1. How to communicate with Database, by which data and Links corresponds to search can be get 2. Is lucene search also provided the Search to the outside(External Links) or within an application it can works? I also want to clarify more steps but first our main problem is the points that I have mention above , Please Reply me solution ASAP. Regards, Rahul Aneja Software Developer [image: Phone] INDIA: +91-172-434-6890 www.InnovaMinds.com http://www.innovaminds.com/
[jira] Commented: (LUCENE-2755) Some improvements to CMS
[ https://issues.apache.org/jira/browse/LUCENE-2755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12932982#action_12932982 ] Shai Erera commented on LUCENE-2755: Earwin, the way CMS currently handles the writer instance makes it entirely not thread-safe. If you e.g. pass different writers to merge(), the class member changes, and MTs will start merging other segments, and in the worse case attempt to merge segments of a different writer. I myself thinks it's ok to have a MP and MS per writer, but I don't have too strong feelings for/against it - so if we want to allow this, we should fix CMS. As for the other comments, I'll need to check more closely what IW does w/ those merges - as it checks all sorts of things (e.g. whether it's an optimize merge or not, see one of the latest bugs Mike resolved). So getting it entirely outside of IndexWriter and into MP/MS is risky - at least, I don't understand the code well enough (yet) to say whether it's doable at all and if we don't miss something. Some improvements to CMS Key: LUCENE-2755 URL: https://issues.apache.org/jira/browse/LUCENE-2755 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Assignee: Shai Erera Priority: Minor Fix For: 3.1, 4.0 While running optimize on a large index, I've noticed several things that got me to read CMS code more carefully, and find these issues: * CMS may hold onto a merge if maxMergeCount is hit. That results in the MergeThreads taking merges from the IndexWriter until they are exhausted, and only then that blocked merge will run. I think it's unnecessary that that merge will be blocked. * CMS sorts merges by segments size, doc-based and not bytes-based. Since the default MP is LogByteSizeMP, and I hardly believe people care about doc-based size segments anymore, I think we should switch the default impl. There are two ways to make it extensible, if we want: ** Have an overridable member/method in CMS that you can extend and override - easy. ** Have OneMerge be comparable and let the MP determine the order (e.g. by bytes, docs, calibrate deletes etc.). Better, but will need to tap into several places in the code, so more risky and complicated. On the go, I'd like to add some documentation to CMS - it's not very easy to read and follow. I'll work on a patch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2768) add infrastructure for longer running nightly test cases
[ https://issues.apache.org/jira/browse/LUCENE-2768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-2768: Attachment: LUCENE-2768_nightly.patch patch that adds support for annotating tests with @Nightly. you can also annotate a whole class with this (in that case, import it from LuceneTestCase). the only trick is that junit always requires a class to have at least one runnable method, or it throws an exception. in this special case that all methods or the whole class are somehow @Nightly, we add a fake @Ignored method so we get tests run: 0 and the NOTE instead of this exception. add infrastructure for longer running nightly test cases Key: LUCENE-2768 URL: https://issues.apache.org/jira/browse/LUCENE-2768 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Fix For: 3.1, 4.0 Attachments: LUCENE-2768.patch, LUCENE-2768_nightly.patch I'm spinning this out of LUCENE-2762... The patch there adds initial infrastructure for tests to pull documents from a line file, and adds a longish running test case using that line file to test NRT. I'd like to see some tests run on more substantial indices based on real data... so this is just a start. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
mergeinfo commit mails, possible solution
From #lucene IRC: gsingers:sarowe and I were talking about the mergeinfo commit overload [09:43]gsingers:and the asf_mailer.conf file [09:43]gsingers:In looking at the file [09:44]gsingers:it appears the one thing we have the ability to do is to turn off the generation of diffs for [09:44]gsingers:events [09:44]gsingers:The default setting is: [09:44]gsingers:generate_diffs = add copy modify propchange [09:44]gsingers:sarowe and I are proposing to change our settings to just be add/copy/modify [09:44]gsingers:and try dropping propchange [09:45]gsingers:I honestly don't know whether it will work or not [09:45]gsingers:and it will also likely mean we will miss notifications of other propchanges [09:45]gsingers:We've asked on #asfinfra if there are other options [09:45]gsingers:and sarowe is looking into the mailer.py script to see if there are other things available [09:46]gsingers:I guess the question here is, do people want to try turning off propchange? -Grant - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments
[ https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12932988#action_12932988 ] Jason Rutherglen commented on LUCENE-2680: -- {quote}Why do we still have deletesFlushed? And why do we still need to remap docIDs on merge? I thought with this new approach the docIDUpto for each buffered delete Term/Query would be a local docID to that segment?{quote} Deletes flushed can be removed if we store the docid-upto per segment. Then we'll go back to having a hash map of deletes. {quote}The SegmentDeletes use less than BYTES_PER_DEL_TERM because it's a simple HashSet not a HashMap? Ie we are over-counting RAM used now? (Same for by query){quote} Intuitively, yes, however here's the constructor of hash set: {code} public HashSet() { map = new HashMapE,Object(); } {code} bq. why are we tracking the last segment info/index? I thought last segment was supposed to be used to mark the last segment of a commit/flush. This way we save on the hash(set,map) space on the segments upto the last segment when the commit occurred. {quote}Can we store segment's deletes elsewhere?{quote} We can, however I had to minimize places in the code that were potentially causing errors (trying to reduce the problem set, which helped locate the intermittent exceptions), syncing segment infos with the per-segment deletes was one was one of those places. That and I thought it'd be worth a try simplify (at the expense of breaking the unstated intention of the SI class). {quote}Do we really need to track appliedTerms/appliedQueries? Ie is this just an optimization so that if the caller deletes by the Term/Query again we know to skip it? {quote} Yes to the 2nd question. Why would we want to try deleting multiple times? The cost is the terms dictionary lookup which you're saying is in the noise? I think potentially cracking open a query again could be costly in cases where the query is indeed expensive. {quote}not iterate through the terms/queries to subtract the RAM used?{quote} Well, the RAM usage tracking can't be completely defined until we finish how we're storing the terms/queries. Improve how IndexWriter flushes deletes against existing segments - Key: LUCENE-2680 URL: https://issues.apache.org/jira/browse/LUCENE-2680 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Fix For: 4.0 Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch IndexWriter buffers up all deletes (by Term and Query) and only applies them if 1) commit or NRT getReader() is called, or 2) a merge is about to kickoff. We do this because, for a large index, it's very costly to open a SegmentReader for every segment in the index. So we defer as long as we can. We do it just before merge so that the merge can eliminate the deleted docs. But, most merges are small, yet in a big index we apply deletes to all of the segments, which is really very wasteful. Instead, we should only apply the buffered deletes to the segments that are about to be merged, and keep the buffer around for the remaining segments. I think it's not so hard to do; we'd have to have generations of pending deletions, because the newly merged segment doesn't need the same buffered deletions applied again. So every time a merge kicks off, we pinch off the current set of buffered deletions, open a new set (the next generation), and record which segment was created as of which generation. This should be a very sizable gain for large indices that mix deletes, though, less so in flex since opening the terms index is much faster. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2768) add infrastructure for longer running nightly test cases
[ https://issues.apache.org/jira/browse/LUCENE-2768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12932990#action_12932990 ] Robert Muir commented on LUCENE-2768: - Here is the output when tests.nightly is disabled (default) and a method or class is @Nightly, respectively: {noformat} [junit] Testsuite: org.apache.lucene.TestDemo [junit] Tests run: 0, Failures: 0, Errors: 0, Time elapsed: 0.17 sec [junit] [junit] - Standard Error - [junit] NOTE: Ignoring nightly-only test method 'testDemo' [junit] - --- {noformat} {noformat} [junit] Testsuite: org.apache.lucene.TestDemo [junit] Tests run: 0, Failures: 0, Errors: 0, Time elapsed: 0.171 sec [junit] [junit] - Standard Error - [junit] NOTE: Ignoring nightly-only test class 'TestDemo' [junit] - --- {noformat} add infrastructure for longer running nightly test cases Key: LUCENE-2768 URL: https://issues.apache.org/jira/browse/LUCENE-2768 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Fix For: 3.1, 4.0 Attachments: LUCENE-2768.patch, LUCENE-2768_nightly.patch I'm spinning this out of LUCENE-2762... The patch there adds initial infrastructure for tests to pull documents from a line file, and adds a longish running test case using that line file to test NRT. I'd like to see some tests run on more substantial indices based on real data... so this is just a start. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2768) add infrastructure for longer running nightly test cases
[ https://issues.apache.org/jira/browse/LUCENE-2768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12932997#action_12932997 ] Uwe Schindler commented on LUCENE-2768: --- Looks good, the hack is a hack *lol* but should work and lead to no problems. I would only change the sysprop and static var to a Boolean and add a RuntimeException to the empty catch block in the reflection part. add infrastructure for longer running nightly test cases Key: LUCENE-2768 URL: https://issues.apache.org/jira/browse/LUCENE-2768 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Fix For: 3.1, 4.0 Attachments: LUCENE-2768.patch, LUCENE-2768_nightly.patch I'm spinning this out of LUCENE-2762... The patch there adds initial infrastructure for tests to pull documents from a line file, and adds a longish running test case using that line file to test NRT. I'd like to see some tests run on more substantial indices based on real data... so this is just a start. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2768) add infrastructure for longer running nightly test cases
[ https://issues.apache.org/jira/browse/LUCENE-2768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-2768: Attachment: LUCENE-2768_nightly.patch here is an updated patch with Uwe's suggestions, additionally i made the fake method final. I'll commit this soon, then Mike can setup his test to use it. add infrastructure for longer running nightly test cases Key: LUCENE-2768 URL: https://issues.apache.org/jira/browse/LUCENE-2768 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Fix For: 3.1, 4.0 Attachments: LUCENE-2768.patch, LUCENE-2768_nightly.patch, LUCENE-2768_nightly.patch I'm spinning this out of LUCENE-2762... The patch there adds initial infrastructure for tests to pull documents from a line file, and adds a longish running test case using that line file to test NRT. I'd like to see some tests run on more substantial indices based on real data... so this is just a start. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2768) add infrastructure for longer running nightly test cases
[ https://issues.apache.org/jira/browse/LUCENE-2768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12933007#action_12933007 ] Robert Muir commented on LUCENE-2768: - ok I committed the lucenetestcase/ant support in revision 1036088, 1036094 (3x) To make nightly-only tests, annotate the methods with @Nightly. to run tests including nightly-only tests, use -Dtests.nightly=true add infrastructure for longer running nightly test cases Key: LUCENE-2768 URL: https://issues.apache.org/jira/browse/LUCENE-2768 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Fix For: 3.1, 4.0 Attachments: LUCENE-2768.patch, LUCENE-2768_nightly.patch, LUCENE-2768_nightly.patch I'm spinning this out of LUCENE-2762... The patch there adds initial infrastructure for tests to pull documents from a line file, and adds a longish running test case using that line file to test NRT. I'd like to see some tests run on more substantial indices based on real data... so this is just a start. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Created: (SOLR-2240) Basic authentication for stream.url
Basic authentication for stream.url --- Key: SOLR-2240 URL: https://issues.apache.org/jira/browse/SOLR-2240 Project: Solr Issue Type: Improvement Components: update Affects Versions: 4.0 Reporter: Jayendra Patil Priority: Minor We intend to use stream.url for indexing documents from remote locations exposed through http. However, the remote urls are secured and would need basic authentication to be able access the documents. The current implementation for stream.url in ContentStreamBase.URLStream does not support authentication. The implementation with stream.file would mean to download the files to a local box and would cause duplicity, whereas stream.body would have indexing performance issues with the hugh data being transferred over the network. An approach would be :- 1. Passing additional authentication parameter e.g. stream.url.auth with the encoded authentication value - SolrRequestParsers 2. Setting Authorization request property for the Connection - ContentStreamBase.URLStream this.conn.setRequestProperty(Authorization, Basic + encodedauthentication); Any thoughts ?? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2762) Don't leak deleted open file handles with pooled readers
[ https://issues.apache.org/jira/browse/LUCENE-2762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12933051#action_12933051 ] Michael McCandless commented on LUCENE-2762: So with this patch, we now build the CFS for a merged segment before adding that segment to the segment infos. This is important, to prevent an NRT reader from opening the pre-CFS version, thus tying open the files, using up extra disk space, and leaking deleted-but-open files even once all NRT readers are closed. But, unfortunately, this means the worst-case temporary peak free disk space required when using CFS has gone up... this worst case is hit if you 1) open an existing index, 2) call optimize on it, 3) the index needs more than 1 merge to become optimized, and 4) on the final merge of that optimize just after it's built the CFS but hasn't yet committed it to the segment infos. At that point you have 1X due to starting segments (which cannot be deleted until commit), another 1X due to the segments created by the prior merge (now being merged), another 1X by the newly merged single segment, and a final 1X from the final CFS. In this worst case that means we require 3X of your index size in temporary space. In other cases we use less disk space (the NRT case). And of course if CFS is off there's no change to the temp disk space. I've noted this in the javadocs and will add to CHANGES... But... I think we should improve our default MP. First, maybe we should set a maxMergeMB by default? Because immense merges cause all sorts of problems, and, likely are not going to impact search perf. Second, I think if a newly merged segment will be more than X% of the index, I think we should leave it in non-compound-file format even if useCompoundFile is enabled... I think there's a separate issue open somewhere for that 2nd one. Don't leak deleted open file handles with pooled readers Key: LUCENE-2762 URL: https://issues.apache.org/jira/browse/LUCENE-2762 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.9.4, 3.0.3, 3.1, 4.0 Reporter: Michael McCandless Assignee: Michael McCandless Attachments: LUCENE-2762.patch If you have CFS enabled today, and pooling is enabled (either directly or because you've pulled an NRT reader), IndexWriter will hold open SegmentReaders against the non-CFS format of each merged segment. So even if you close all NRT readers you've pulled from the writer, you'll still see file handles open against files that have been deleted. This count will not grow unbounded, since it's limited by the number of segments in the index, but it's still a serious problem since the app had turned off CFS in the first place presumably to avoid risk of too-many-open-files. It's also bad because it ties up disk space since these files would otherwise be deleted. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Lucene-Solr-tests-only-trunk - Build # 1530 - Failure
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/1530/ 1 tests failed. REGRESSION: org.apache.solr.cloud.CloudStateUpdateTest.testCoreRegistration Error Message: expected:2 but was:3 Stack Trace: junit.framework.AssertionFailedError: expected:2 but was:3 at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:923) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:861) at org.apache.solr.cloud.CloudStateUpdateTest.testCoreRegistration(CloudStateUpdateTest.java:208) Build Log (for compile errors): [...truncated 8769 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Created: (SOLR-2241) Upgrade to Tika 0.8
Upgrade to Tika 0.8 --- Key: SOLR-2241 URL: https://issues.apache.org/jira/browse/SOLR-2241 Project: Solr Issue Type: Improvement Reporter: Grant Ingersoll Assignee: Grant Ingersoll Fix For: 3.1, 4.0 as the title says -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments
[ https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12933067#action_12933067 ] Michael McCandless commented on LUCENE-2680: {quote} Deletes flushed can be removed if we store the docid-upto per segment. Then we'll go back to having a hash map of deletes. {quote} I think we should do this? Ie, each flushed segment stores the map of del Term/Query to docid-upto, where that docid-upto is private to the segment (no remapping on merges needed). When it's time to apply deletes to about-to-be-merged segments, we must apply all future segments deletions unconditionally to each segment, and then conditionally (respecting the local docid-upto) apply that segment's deletions. {quote} Intuitively, yes, however here's the constructor of hash set: {noformat} public HashSet() { map = new HashMapE,Object(); } {noformat} {quote} Ugh I forgot about that. Is that still true? That's awful. {quote} bq. why are we tracking the last segment info/index? I thought last segment was supposed to be used to mark the last segment of a commit/flush. This way we save on the hash(set,map) space on the segments upto the last segment when the commit occurred. {quote} Hmm... I think lastSegment was needed only for the multiple DWPT case, to record the last segment already flushed in the index as of when that DWPT was created. This is so we know going back when we can start unconditionally apply the buffered delete term. With the single DWPT we effectively have today isn't last segment always going to be what we just flushed? (Or null if we haven't yet done a flush in the current session). {quote} bq. Do we really need to track appliedTerms/appliedQueries? Ie is this just an optimization so that if the caller deletes by the Term/Query again we know to skip it? Yes to the 2nd question. Why would we want to try deleting multiple times? The cost is the terms dictionary lookup which you're saying is in the noise? I think potentially cracking open a query again could be costly in cases where the query is indeed expensive. {quote} I'm saying this is unlikely to be worthwhile way to spend RAM. EG most apps wouldn't delete by same term again, like they'd typically go and process a big batch of docs, deleting by an id field and adding the new version of the doc, where a given id is seen only once in this session, and then IW is committed/closed? Improve how IndexWriter flushes deletes against existing segments - Key: LUCENE-2680 URL: https://issues.apache.org/jira/browse/LUCENE-2680 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Fix For: 4.0 Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch IndexWriter buffers up all deletes (by Term and Query) and only applies them if 1) commit or NRT getReader() is called, or 2) a merge is about to kickoff. We do this because, for a large index, it's very costly to open a SegmentReader for every segment in the index. So we defer as long as we can. We do it just before merge so that the merge can eliminate the deleted docs. But, most merges are small, yet in a big index we apply deletes to all of the segments, which is really very wasteful. Instead, we should only apply the buffered deletes to the segments that are about to be merged, and keep the buffer around for the remaining segments. I think it's not so hard to do; we'd have to have generations of pending deletions, because the newly merged segment doesn't need the same buffered deletions applied again. So every time a merge kicks off, we pinch off the current set of buffered deletions, open a new set (the next generation), and record which segment was created as of which generation. This should be a very sizable gain for large indices that mix deletes, though, less so in flex since opening the terms index is much faster. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2762) Don't leak deleted open file handles with pooled readers
[ https://issues.apache.org/jira/browse/LUCENE-2762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12933070#action_12933070 ] Jason Rutherglen commented on LUCENE-2762: -- {quote}I think we should improve our default MP. First, maybe we should set a maxMergeMB by default?{quote} That's a good idea, however would we set an absolute size or a size relative to the aggregate size of the index? I'm using 5 GB in production as otherwise I'm not sure the merge cost is worth the potential performance improvement, ie, long merges adversely affects indexing performance. Don't leak deleted open file handles with pooled readers Key: LUCENE-2762 URL: https://issues.apache.org/jira/browse/LUCENE-2762 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.9.4, 3.0.3, 3.1, 4.0 Reporter: Michael McCandless Assignee: Michael McCandless Attachments: LUCENE-2762.patch If you have CFS enabled today, and pooling is enabled (either directly or because you've pulled an NRT reader), IndexWriter will hold open SegmentReaders against the non-CFS format of each merged segment. So even if you close all NRT readers you've pulled from the writer, you'll still see file handles open against files that have been deleted. This count will not grow unbounded, since it's limited by the number of segments in the index, but it's still a serious problem since the app had turned off CFS in the first place presumably to avoid risk of too-many-open-files. It's also bad because it ties up disk space since these files would otherwise be deleted. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Lucene-Solr-tests-only-trunk - Build # 1531 - Still Failing
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/1531/ 1 tests failed. REGRESSION: org.apache.solr.TestDistributedSearch.testDistribSearch Error Message: Some threads threw uncaught exceptions! Stack Trace: junit.framework.AssertionFailedError: Some threads threw uncaught exceptions! at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:923) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:861) at org.apache.lucene.util.LuceneTestCase.tearDown(LuceneTestCase.java:446) at org.apache.solr.SolrTestCaseJ4.tearDown(SolrTestCaseJ4.java:92) at org.apache.solr.BaseDistributedSearchTestCase.tearDown(BaseDistributedSearchTestCase.java:144) Build Log (for compile errors): [...truncated 8758 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: svn commit: r1036080 [1/4] - in /lucene/dev/branches/docvalues: ./ lucene/ lucene/contrib/ lucene/contrib/highlighter/src/test/ lucene/contrib/instantiated/src/test/org/apache/lucene/store/instant
On Nov 17, 2010, at 10:55 AM, Uwe Schindler wrote: JUHU, No prop changes (only in the into, the affected files are listed, but no longer any endless pages of rev numbers! Thanks Grant! Hey, thank you! You guys did the work to figure it out, I just flipped the switch. It's too bad we couldn't be a little more fine grained about propchanges, but I'll live with it for now. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: mergeinfo commit mails, possible solution
big +1, we can actually review backports now... this was really bad before. On Wed, Nov 17, 2010 at 11:57 AM, Steven A Rowe sar...@syr.edu wrote: Uwe, my Inbox thanks you. - Steve -Original Message- From: Uwe Schindler [mailto:u...@thetaphi.de] Sent: Wednesday, November 17, 2010 9:04 AM To: dev@lucene.apache.org Subject: RE: mergeinfo commit mails, possible solution The ASF does not use the vanilla mailer.py script, they are using http://opensource.perlig.de/svnmailer - and this one does fantastic work regarding this! We just need to change the config files of this tool and specify a special subtree config for /lucene project folder. See also the attached mail! - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Steven A Rowe [mailto:sar...@syr.edu] Sent: Wednesday, November 17, 2010 7:02 AM To: dev@lucene.apache.org Subject: RE: mergeinfo commit mails, possible solution After looking more closely at the vanilla Subversion version of the mailer.py script, I'm 99% sure that removing propchange from the generate_diffs list will have zero effect, but I'd love to be proven wrong. Turns out Subversion's mailer.py once had a much larger set of property filtering options, but C. Mike (Pilato) thought the option set was too baroque, so he reverted the entire set - see http://subversion.tigris.org/issues/show_bug.cgi?id=2944. I've asked on #svn about re-instating the ignore_props and ignore_propdiffs regex-valued options - these would allow us to only ignore svn:mergeinfo diffs while still noting that files' properties have changed, without affecting other properties or their diffs. No responses yet, hopefully tomorrow. Steve -Original Message- From: Grant Ingersoll [mailto:gsing...@apache.org] Sent: Tuesday, November 16, 2010 9:55 AM To: dev@lucene.apache.org Subject: mergeinfo commit mails, possible solution From #lucene IRC: gsingers:sarowe and I were talking about the mergeinfo commit overload [09:43]gsingers:and the asf_mailer.conf file [09:43]gsingers:In looking at the file [09:44]gsingers:it appears the one thing we have the ability to do is to turn off the generation of diffs for [09:44]gsingers:events [09:44]gsingers:The default setting is: [09:44]gsingers:generate_diffs = add copy modify propchange [09:44]gsingers:sarowe and I are proposing to change our settings to just be add/copy/modify [09:44]gsingers:and try dropping propchange [09:45]gsingers:I honestly don't know whether it will work or not [09:45]gsingers:and it will also likely mean we will miss notifications of other propchanges [09:45]gsingers:We've asked on #asfinfra if there are other options [09:45]gsingers:and sarowe is looking into the mailer.py script to see if there are other things available [09:46]gsingers:I guess the question here is, do people want to try turning off propchange? -Grant - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: ASF Public Mail Archives on Amazon S3
Grant, public_p_r.tar seems to be missing? Is that intentional? Maybe some super-secret project inside there :) Mike On Thu, Oct 14, 2010 at 12:05 PM, Grant Ingersoll gsing...@apache.org wrote: Hi ORPers, I put up the complete ASF public mail archives as of about 3 weeks ago on Amazon's S3 and have made them public (let me know if I messed up, it is the first time I've done this). I also intend, in the coming weeks, to convert them into Mahout files (if anyone wants to help let me know). There are 5 files: https://s3.amazonaws.com/asf-mail-archives/public_a_d.tar https://s3.amazonaws.com/asf-mail-archives/public_e_k.tar https://s3.amazonaws.com/asf-mail-archives/public_l_o.tar https://s3.amazonaws.com/asf-mail-archives/public_s_t.tar https://s3.amazonaws.com/asf-mail-archives/public_u_z.tar The tarballs are organized by Top Level Project name (i.e. Mahout is in the public_l_o.tar file). The tarballs contain GZIP files by date, I believe. I believe the total uncompressed file size is somewhere in the 80-100GB range. That should be sufficient to drive some semi-interesting things in terms of scale, even if it is towards the smaller end of things. As the ASF has very clear public mailing list archive policies, it is my belief that this data set is completely unencumbered. From an ORP standpoint, this might make for a first data set for evaluation once we have the evaluator framework in place. Cheers, Grant -- Grant Ingersoll http://www.lucidimagination.com
Re: Basic authentication for stream.url
Hello Jayendra, i did not quiet understand what you are aiming for. Usually you would pass basic authentification credentials along with the url. In Solr+Java you might use the following piece of code: int port = 80; String url = http://localhost:; + port + /solr/select; String user = someuser; String password = somepassword; CommonsHttpSolrServer commonsHttpSolrServer = null; HttpClient httpclient = new HttpClient( new MultiThreadedHttpConnectionManager()); try { commonsHttpSolrServer = new CommonsHttpSolrServer(url, httpclient); commonsHttpSolrServer.setParser(new XMLResponseParser()); } catch (MalformedURLException e) { e.printStackTrace(); return; } if (user != null password != null) { commonsHttpSolrServer.getHttpClient().getParams() .setAuthenticationPreemptive(true); Credentials defaultcreds = new UsernamePasswordCredentials(user, password); commonsHttpSolrServer.getHttpClient().getState().setCredentials( new AuthScope(localhost, port, AuthScope.ANY_REALM), defaultcreds); } Hope I could help a little. Kind Regards, Gregor On 11/17/2010 02:57 AM, Jayendra Patil wrote: We intend to use schema.url for indexing documents. However, the remote urls are secured and would need basic authentication to be able access the document. The implementation with stream.file would mean to download the files and would cause duplicity, whereas stream.body would have indexing performance issues with the hugh data being transferred over the network. The current implementation for stream.url in ContentStreamBase.URLStream does not support authentication. But can be easily supported by :- 1. Passing additional authentication parameter e.g. stream.url.auth with the encoded authentication value - SolrRequestParsers 2. Setting Authorization request property for the Connection - ContentStreamBase.URLStream this.conn.setRequestProperty(Authorization, Basic + encodedauthentication); Any suggestions ??? Regards, Jayendra -- How to find files on the Internet? FindFiles.net http://findfiles.net!
[jira] Updated: (SOLR-236) Field collapsing
[ https://issues.apache.org/jira/browse/SOLR-236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Bell updated SOLR-236: --- Attachment: SOLR-236-distinctFacet.patch TO do distinct facet counts. Field collapsing Key: SOLR-236 URL: https://issues.apache.org/jira/browse/SOLR-236 Project: Solr Issue Type: New Feature Components: search Affects Versions: 1.3 Reporter: Emmanuel Keller Assignee: Shalin Shekhar Mangar Fix For: Next Attachments: collapsing-patch-to-1.3.0-dieter.patch, collapsing-patch-to-1.3.0-ivan.patch, collapsing-patch-to-1.3.0-ivan_2.patch, collapsing-patch-to-1.3.0-ivan_3.patch, DocSetScoreCollector.java, field-collapse-3.patch, field-collapse-4-with-solrj.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-solr-236-2.patch, field-collapse-solr-236.patch, field-collapsing-extended-592129.patch, field_collapsing_1.1.0.patch, field_collapsing_1.3.patch, field_collapsing_dsteigerwald.diff, field_collapsing_dsteigerwald.diff, field_collapsing_dsteigerwald.diff, NonAdjacentDocumentCollapser.java, NonAdjacentDocumentCollapserTest.java, quasidistributed.additional.patch, SOLR-236-1_4_1-paging-totals-working.patch, SOLR-236-1_4_1.patch, SOLR-236-distinctFacet.patch, SOLR-236-FieldCollapsing.patch, SOLR-236-FieldCollapsing.patch, SOLR-236-FieldCollapsing.patch, SOLR-236-trunk.patch, SOLR-236-trunk.patch, SOLR-236-trunk.patch, SOLR-236-trunk.patch, SOLR-236-trunk.patch, SOLR-236.patch, SOLR-236.patch, SOLR-236.patch, SOLR-236.patch, SOLR-236.patch, SOLR-236.patch, SOLR-236.patch, SOLR-236.patch, solr-236.patch, SOLR-236_collapsing.patch, SOLR-236_collapsing.patch This patch include a new feature called Field collapsing. Used in order to collapse a group of results with similar value for a given field to a single entry in the result set. Site collapsing is a special case of this, where all results for a given web site is collapsed into one or two entries in the result set, typically with an associated more documents from this site link. See also Duplicate detection. http://www.fastsearch.com/glossary.aspx?m=48amid=299 The implementation add 3 new query parameters (SolrParams): collapse.field to choose the field used to group results collapse.type normal (default value) or adjacent collapse.max to select how many continuous results are allowed before collapsing TODO (in progress): - More documentation (on source code) - Test cases Two patches: - field_collapsing.patch for current development version - field_collapsing_1.1.0.patch for Solr-1.1.0 P.S.: Feedback and misspelling correction are welcome ;-) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
RE: need some help =)
Why not digestible? This type of question with clear short source code is most likely to be answered. - Neal -Original Message- From: Nicholas Paldino [.NET/C# MVP] [mailto:casper...@caspershouse.com] Sent: Wednesday, November 17, 2010 1:33 PM To: lucene-net-...@lucene.apache.org Subject: RE: need some help =) Why are you adding the bytes as the field value? You should add the fields as strings and you should be fine. Also, note that most people won't respond to this kind of code because it is not easily digestable. -Original Message- From: asmcad [mailto:asm...@gmail.com] Sent: Wednesday, November 17, 2010 3:02 PM To: lucene-net-dev Subject: need some help =) it's a simple index and search application but i couldn't make it work. it doesn't give any error but it doesn't give any results too. 1. using System; 2. using System.Collections.Generic; 3. using System.ComponentModel; 4. using System.Data; 5. using System.Drawing; 6. using System.Linq; 7. using System.Text; 8. using System.Windows.Forms; 9. using Lucene.Net; 10. using Lucene.Net.Analysis.Standard; 11. using Lucene.Net.Documents; 12. using Lucene.Net.Index; 13. using Lucene.Net.QueryParsers; 14. using Lucene.Net.Search; 15. using System.IO; 16. 17. namespace newLucene 18. { 19. public partial class Form1 : Form 20. { 21. public Form1() 22. { 23. InitializeComponent(); 24. } 25. 26. private void buttonIndex_Click(object sender, EventArgs e) 27. { 28. IndexWriter indexwrtr = new IndexWriter(@c:\index\,new StandardAnalyzer() , true); 29. Document doc = new Document(); 30. string filename = @fer.txt; 31. Lucene.Net.QueryParsers.QueryParser df; 32. 33. 34. 35. System.IO.StreamReader local_StreamReader = new System.IO.StreamReader(@C:\z\fer.txt); 36. string file_text = local_StreamReader.ReadToEnd(); 37. 38. System.Text.UTF8Encoding encoding = new System.Text.UTF8Encoding(); 39. doc.Add(new Field(text,encoding.GetBytes(file_text),Field.Store.YES)); 40. doc.Add(new Field(path,encoding.GetBytes(@C:\z\),Field.Store.YES)); 41. doc.Add(new Field(title, encoding.GetBytes(filename), Field.Store.YES)); 42. indexwrtr.AddDocument(doc); 43. 44. indexwrtr.Optimize(); 45. indexwrtr.Close(); 46. 47. } 48. 49. private void buttonSearch_Click(object sender, EventArgs e) 50. { 51. IndexSearcher indxsearcher = new IndexSearcher(@C:\index\); 52. 53. QueryParser parser = new QueryParser(contents, new StandardAnalyzer()); 54. Query query = parser.Parse(textBoxQuery.Text); 55. 56. //Lucene.Net.QueryParsers.QueryParser qp = new QueryParser(Lucene.Net.QueryParsers.CharStream s).Parse(textBoxQuery.Text); 57. Hits hits = indxsearcher.Search(query); 58. 59. 60. for (int i = 0; i hits.Length(); i++) 61. { 62. 63. Document doc = hits.Doc(i); 64. 65. 66. string filename = doc.Get(title); 67. string path = doc.Get(path); 68. string folder = Path.GetDirectoryName(path); 69. 70. 71. ListViewItem item = new ListViewItem(new string[] { null, filename, asd, hits.Score(i).ToString() }); 72. item.Tag = path; 73. 74. this.listViewResults.Items.Add(item); 75. Application.DoEvents(); 76. } 77. 78. indxsearcher.Close(); 79. 80. 81. 82. 83. } 84. } 85. } thanks
Re: need some help =)
=) i was about to write an answer... On 17.11.2010 20:51, Granroth, Neal V. wrote: Why not digestible? This type of question with clear short source code is most likely to be answered. - Neal -Original Message- From: Nicholas Paldino [.NET/C# MVP] [mailto:casper...@caspershouse.com] Sent: Wednesday, November 17, 2010 1:33 PM To: lucene-net-...@lucene.apache.org Subject: RE: need some help =) Why are you adding the bytes as the field value? You should add the fields as strings and you should be fine. Also, note that most people won't respond to this kind of code because it is not easily digestable. -Original Message- From: asmcad [mailto:asm...@gmail.com] Sent: Wednesday, November 17, 2010 3:02 PM To: lucene-net-dev Subject: need some help =) it's a simple index and search application but i couldn't make it work. it doesn't give any error but it doesn't give any results too. 1. using System; 2. using System.Collections.Generic; 3. using System.ComponentModel; 4. using System.Data; 5. using System.Drawing; 6. using System.Linq; 7. using System.Text; 8. using System.Windows.Forms; 9. using Lucene.Net; 10. using Lucene.Net.Analysis.Standard; 11. using Lucene.Net.Documents; 12. using Lucene.Net.Index; 13. using Lucene.Net.QueryParsers; 14. using Lucene.Net.Search; 15. using System.IO; 16. 17. namespace newLucene 18. { 19. public partial class Form1 : Form 20. { 21. public Form1() 22. { 23. InitializeComponent(); 24. } 25. 26. private void buttonIndex_Click(object sender, EventArgs e) 27. { 28. IndexWriter indexwrtr = new IndexWriter(@c:\index\,new StandardAnalyzer() , true); 29. Document doc = new Document(); 30. string filename = @fer.txt; 31. Lucene.Net.QueryParsers.QueryParser df; 32. 33. 34. 35. System.IO.StreamReader local_StreamReader = new System.IO.StreamReader(@C:\z\fer.txt); 36. string file_text = local_StreamReader.ReadToEnd(); 37. 38. System.Text.UTF8Encoding encoding = new System.Text.UTF8Encoding(); 39. doc.Add(new Field(text,encoding.GetBytes(file_text),Field.Store.YES)); 40. doc.Add(new Field(path,encoding.GetBytes(@C:\z\),Field.Store.YES)); 41. doc.Add(new Field(title, encoding.GetBytes(filename), Field.Store.YES)); 42. indexwrtr.AddDocument(doc); 43. 44. indexwrtr.Optimize(); 45. indexwrtr.Close(); 46. 47. } 48. 49. private void buttonSearch_Click(object sender, EventArgs e) 50. { 51. IndexSearcher indxsearcher = new IndexSearcher(@C:\index\); 52. 53. QueryParser parser = new QueryParser(contents, new StandardAnalyzer()); 54. Query query = parser.Parse(textBoxQuery.Text); 55. 56. //Lucene.Net.QueryParsers.QueryParser qp = new QueryParser(Lucene.Net.QueryParsers.CharStream s).Parse(textBoxQuery.Text); 57. Hits hits = indxsearcher.Search(query); 58. 59. 60. for (int i = 0; i hits.Length(); i++) 61. { 62. 63. Document doc = hits.Doc(i); 64. 65. 66. string filename = doc.Get(title); 67. string path = doc.Get(path); 68. string folder = Path.GetDirectoryName(path); 69. 70. 71. ListViewItem item = new ListViewItem(new string[] { null, filename, asd, hits.Score(i).ToString() }); 72. item.Tag = path; 73. 74. this.listViewResults.Items.Add(item); 75. Application.DoEvents(); 76. } 77. 78. indxsearcher.Close(); 79. 80. 81. 82. 83. } 84. } 85. } thanks
Stemming using automata
Folks, I had an interesting conversation with Simon a few weeks back. It occurred to me that it might be possible to build an automata that handles stemming and pluralization on searches. Just a thought... Karl
[jira] Updated: (SOLR-2240) Basic authentication for stream.url
[ https://issues.apache.org/jira/browse/SOLR-2240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jayendra Patil updated SOLR-2240: - Attachment: SOLR-2240.patch Attached the Patch for the changes. Basic authentication for stream.url --- Key: SOLR-2240 URL: https://issues.apache.org/jira/browse/SOLR-2240 Project: Solr Issue Type: Improvement Components: update Affects Versions: 4.0 Reporter: Jayendra Patil Priority: Minor Attachments: SOLR-2240.patch We intend to use stream.url for indexing documents from remote locations exposed through http. However, the remote urls are secured and would need basic authentication to be able access the documents. The current implementation for stream.url in ContentStreamBase.URLStream does not support authentication. The implementation with stream.file would mean to download the files to a local box and would cause duplicity, whereas stream.body would have indexing performance issues with the hugh data being transferred over the network. An approach would be :- 1. Passing additional authentication parameter e.g. stream.url.auth with the encoded authentication value - SolrRequestParsers 2. Setting Authorization request property for the Connection - ContentStreamBase.URLStream this.conn.setRequestProperty(Authorization, Basic + encodedauthentication); Any thoughts ?? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Basic authentication for stream.url
JIRA - https://issues.apache.org/jira/browse/SOLR-2240 Patch attached. How does the patch make it to the trunk ??? Had submitted a couple of more patches SOLR-2156 SOLR-2029, would like them to be included in the release. Regards, Jayendra On Wed, Nov 17, 2010 at 2:15 PM, Yonik Seeley yo...@lucidimagination.comwrote: On Tue, Nov 16, 2010 at 8:57 PM, Jayendra Patil jayendra.patil@gmail.com wrote: We intend to use schema.url for indexing documents. However, the remote urls are secured and would need basic authentication to be able access the document. The implementation with stream.file would mean to download the files and would cause duplicity, whereas stream.body would have indexing performance issues with the hugh data being transferred over the network. The current implementation for stream.url in ContentStreamBase.URLStream does not support authentication. But can be easily supported by :- 1. Passing additional authentication parameter e.g. stream.url.auth with the encoded authentication value - SolrRequestParsers 2. Setting Authorization request property for the Connection - ContentStreamBase.URLStream this.conn.setRequestProperty(Authorization, Basic + encodedauthentication); Sounds like a good idea to me. Could you open a JIRA issue for this feature, and supply a patch if you get to it? -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Stemming using automata
Karl, you are right. this is one of the ways i originally used this thing. i've done some relevance experiments along these lines (some summary results here http://www.slideshare.net/otisg/finite-state-queries-in-lucene). in this case i compared 3 cases: index-time porter stemming, index-time plural stemming, and query-time plural stemming (with automaton). in general you can get similar results, slower query speed, but more flexibility. for instance, you could have a queryparser that implements a stem() operator without indexing everything twice. probably pretty boring for most people, but in some cases (e.g. lots of languages) query-time starts to become more attractive... On Wed, Nov 17, 2010 at 3:18 PM, karl.wri...@nokia.com wrote: Folks, I had an interesting conversation with Simon a few weeks back. It occurred to me that it might be possible to build an automata that handles stemming and pluralization on searches. Just a thought… Karl - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
RE: need some help =)
You can try UnaccentedWordAnalyzer in /contrib/Contrib.Net/ (you have to download the contrib code from svn). DIGY -Original Message- From: asmcad [mailto:asm...@gmail.com] Sent: Wednesday, November 17, 2010 11:24 PM To: lucene-net-...@lucene.apache.org Subject: Re: need some help =) i need turkish analyzer. my lucene book says i need to use SnowballAnalyzer but i can't access to it as Lucene.Net.Analysis.Snowball should i install another library to use it? On 17.11.2010 21:12, Granroth, Neal V. wrote: You need to pick a suitable analyzer for use during indexing and for queries. The StandardAnalyzer you are using will most likely break the words apart at the non-english characters. You might want to consider using the Luke tool to inspect the index you've created and see who the words in your documents were split and indexed. - Neal -Original Message- From: asmcad [mailto:asm...@gmail.com] Sent: Wednesday, November 17, 2010 3:06 PM To: lucene-net-...@lucene.apache.org Subject: Re: need some help =) i solved the problem . now i have non-english character problem. when i search like something çşğuı(i'm not sure you can see this) characters. i don't get any results. how can i solve this ? by the way sorry about the content messing =) thanks for the previous help =) On 17.11.2010 20:16, Digy wrote: 1. using System; 2. using System.Collections.Generic; 3. using System.ComponentModel; 4. using System.Data; 5. using System.Drawing; 6. using System.Linq; 7. using System.Text; 8. using System.Windows.Forms; 9. using Lucene.Net; 10. using Lucene.Net.Analysis.Standard; 11. using Lucene.Net.Documents; 12. using Lucene.Net.Index; 13. using Lucene.Net.QueryParsers; 14. using Lucene.Net.Search; 15. using System.IO; 16. 17. namespace newLucene 18. { 19. public partial class Form1 : Form 20. { 21. public Form1() 22. { 23. InitializeComponent(); 24. } 25. 26. private void buttonIndex_Click(object sender, EventArgs e) 27. { 28. IndexWriter indexwrtr = new IndexWriter(@c:\index\,new StandardAnalyzer() , true); 29. Document doc = new Document(); 30. string filename = @fer.txt; 31. Lucene.Net.QueryParsers.QueryParser df; 32. 33. 34. 35. System.IO.StreamReader local_StreamReader = new System.IO.StreamReader(@C:\z\fer.txt); 36. string file_text = local_StreamReader.ReadToEnd(); 37. 38. System.Text.UTF8Encoding encoding = new System.Text.UTF8Encoding(); 39. doc.Add(new Field(text,encoding.GetBytes(file_text),Field.Store.YES)); 40. doc.Add(new Field(path,encoding.GetBytes(@C:\z\),Field.Store.YES)); 41. doc.Add(new Field(title, encoding.GetBytes(filename), Field.Store.YES)); 42. indexwrtr.AddDocument(doc); 43. 44. indexwrtr.Optimize(); 45. indexwrtr.Close(); 46. 47. } 48. 49. private void buttonSearch_Click(object sender, EventArgs e) 50. { 51. IndexSearcher indxsearcher = new IndexSearcher(@C:\index\); 52. 53. QueryParser parser = new QueryParser(contents, new StandardAnalyzer()); 54. Query query = parser.Parse(textBoxQuery.Text); 55. 56. //Lucene.Net.QueryParsers.QueryParser qp = new QueryParser(Lucene.Net.QueryParsers.CharStream s).Parse(textBoxQuery.Text); 57. Hits hits = indxsearcher.Search(query); 58. 59. 60. for (int i = 0; i hits.Length(); i++) 61. { 62. 63. Document doc = hits.Doc(i); 64. 65. 66. string filename = doc.Get(title); 67. string path = doc.Get(path); 68. string folder = Path.GetDirectoryName(path); 69. 70. 71. ListViewItem item = new ListViewItem(new string[] { null, filename, asd, hits.Score(i).ToString() }); 72. item.Tag = path; 73. 74. this.listViewResults.Items.Add(item); 75. Application.DoEvents(); 76. } 77. 78. indxsearcher.Close(); 79. 80. 81. 82. 83. } 84.
[jira] Commented: (SOLR-236) Field collapsing
[ https://issues.apache.org/jira/browse/SOLR-236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12933153#action_12933153 ] Stephen Weiss commented on SOLR-236: Cheers peterwang, you're probably right. I didn't actually use this patch, I made the modifications by hand after applying Martijn's patch. I generally don't make my own patch files, I just let SVN do it for me, so I'm not really aware of the syntax... The point is to just delete those extra lines. Field collapsing Key: SOLR-236 URL: https://issues.apache.org/jira/browse/SOLR-236 Project: Solr Issue Type: New Feature Components: search Affects Versions: 1.3 Reporter: Emmanuel Keller Assignee: Shalin Shekhar Mangar Fix For: Next Attachments: collapsing-patch-to-1.3.0-dieter.patch, collapsing-patch-to-1.3.0-ivan.patch, collapsing-patch-to-1.3.0-ivan_2.patch, collapsing-patch-to-1.3.0-ivan_3.patch, DocSetScoreCollector.java, field-collapse-3.patch, field-collapse-4-with-solrj.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-solr-236-2.patch, field-collapse-solr-236.patch, field-collapsing-extended-592129.patch, field_collapsing_1.1.0.patch, field_collapsing_1.3.patch, field_collapsing_dsteigerwald.diff, field_collapsing_dsteigerwald.diff, field_collapsing_dsteigerwald.diff, NonAdjacentDocumentCollapser.java, NonAdjacentDocumentCollapserTest.java, quasidistributed.additional.patch, SOLR-236-1_4_1-paging-totals-working.patch, SOLR-236-1_4_1.patch, SOLR-236-distinctFacet.patch, SOLR-236-FieldCollapsing.patch, SOLR-236-FieldCollapsing.patch, SOLR-236-FieldCollapsing.patch, SOLR-236-trunk.patch, SOLR-236-trunk.patch, SOLR-236-trunk.patch, SOLR-236-trunk.patch, SOLR-236-trunk.patch, SOLR-236.patch, SOLR-236.patch, SOLR-236.patch, SOLR-236.patch, SOLR-236.patch, SOLR-236.patch, SOLR-236.patch, SOLR-236.patch, solr-236.patch, SOLR-236_collapsing.patch, SOLR-236_collapsing.patch This patch include a new feature called Field collapsing. Used in order to collapse a group of results with similar value for a given field to a single entry in the result set. Site collapsing is a special case of this, where all results for a given web site is collapsed into one or two entries in the result set, typically with an associated more documents from this site link. See also Duplicate detection. http://www.fastsearch.com/glossary.aspx?m=48amid=299 The implementation add 3 new query parameters (SolrParams): collapse.field to choose the field used to group results collapse.type normal (default value) or adjacent collapse.max to select how many continuous results are allowed before collapsing TODO (in progress): - More documentation (on source code) - Test cases Two patches: - field_collapsing.patch for current development version - field_collapsing_1.1.0.patch for Solr-1.1.0 P.S.: Feedback and misspelling correction are welcome ;-) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Basic authentication for stream.url
How does the patch make it to the trunk You need to track it and prompt the dev list if you think it's forgotten. Basically, when a committer thinks it's ready and valuable s/he will commit it to trunk for you. But give the committers some time before prompting, they're usually up to their ears in other changes Best Erick On Wed, Nov 17, 2010 at 3:30 PM, Jayendra Patil jayendra.patil@gmail.com wrote: JIRA - https://issues.apache.org/jira/browse/SOLR-2240 Patch attached. How does the patch make it to the trunk ??? Had submitted a couple of more patches SOLR-2156 SOLR-2029, would like them to be included in the release. Regards, Jayendra On Wed, Nov 17, 2010 at 2:15 PM, Yonik Seeley yo...@lucidimagination.comwrote: On Tue, Nov 16, 2010 at 8:57 PM, Jayendra Patil jayendra.patil@gmail.com wrote: We intend to use schema.url for indexing documents. However, the remote urls are secured and would need basic authentication to be able access the document. The implementation with stream.file would mean to download the files and would cause duplicity, whereas stream.body would have indexing performance issues with the hugh data being transferred over the network. The current implementation for stream.url in ContentStreamBase.URLStream does not support authentication. But can be easily supported by :- 1. Passing additional authentication parameter e.g. stream.url.auth with the encoded authentication value - SolrRequestParsers 2. Setting Authorization request property for the Connection - ContentStreamBase.URLStream this.conn.setRequestProperty(Authorization, Basic + encodedauthentication); Sounds like a good idea to me. Could you open a JIRA issue for this feature, and supply a patch if you get to it? -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
RE: need some help =)
UnaccentedWordAnalyzer doesn't make use of stemming. If you really need it; a) SnowballAnalyzer is not good in turkish stemming. b) It is better to write a custom analyzer using Zemberek or its .NET version NZemberek. DIGY -Original Message- From: asmcad [mailto:asm...@gmail.com] Sent: Wednesday, November 17, 2010 11:24 PM To: lucene-net-...@lucene.apache.org Subject: Re: need some help =) i need turkish analyzer. my lucene book says i need to use SnowballAnalyzer but i can't access to it as Lucene.Net.Analysis.Snowball should i install another library to use it? On 17.11.2010 21:12, Granroth, Neal V. wrote: You need to pick a suitable analyzer for use during indexing and for queries. The StandardAnalyzer you are using will most likely break the words apart at the non-english characters. You might want to consider using the Luke tool to inspect the index you've created and see who the words in your documents were split and indexed. - Neal -Original Message- From: asmcad [mailto:asm...@gmail.com] Sent: Wednesday, November 17, 2010 3:06 PM To: lucene-net-...@lucene.apache.org Subject: Re: need some help =) i solved the problem . now i have non-english character problem. when i search like something çşğuı(i'm not sure you can see this) characters. i don't get any results. how can i solve this ? by the way sorry about the content messing =) thanks for the previous help =) On 17.11.2010 20:16, Digy wrote: 1. using System; 2. using System.Collections.Generic; 3. using System.ComponentModel; 4. using System.Data; 5. using System.Drawing; 6. using System.Linq; 7. using System.Text; 8. using System.Windows.Forms; 9. using Lucene.Net; 10. using Lucene.Net.Analysis.Standard; 11. using Lucene.Net.Documents; 12. using Lucene.Net.Index; 13. using Lucene.Net.QueryParsers; 14. using Lucene.Net.Search; 15. using System.IO; 16. 17. namespace newLucene 18. { 19. public partial class Form1 : Form 20. { 21. public Form1() 22. { 23. InitializeComponent(); 24. } 25. 26. private void buttonIndex_Click(object sender, EventArgs e) 27. { 28. IndexWriter indexwrtr = new IndexWriter(@c:\index\,new StandardAnalyzer() , true); 29. Document doc = new Document(); 30. string filename = @fer.txt; 31. Lucene.Net.QueryParsers.QueryParser df; 32. 33. 34. 35. System.IO.StreamReader local_StreamReader = new System.IO.StreamReader(@C:\z\fer.txt); 36. string file_text = local_StreamReader.ReadToEnd(); 37. 38. System.Text.UTF8Encoding encoding = new System.Text.UTF8Encoding(); 39. doc.Add(new Field(text,encoding.GetBytes(file_text),Field.Store.YES)); 40. doc.Add(new Field(path,encoding.GetBytes(@C:\z\),Field.Store.YES)); 41. doc.Add(new Field(title, encoding.GetBytes(filename), Field.Store.YES)); 42. indexwrtr.AddDocument(doc); 43. 44. indexwrtr.Optimize(); 45. indexwrtr.Close(); 46. 47. } 48. 49. private void buttonSearch_Click(object sender, EventArgs e) 50. { 51. IndexSearcher indxsearcher = new IndexSearcher(@C:\index\); 52. 53. QueryParser parser = new QueryParser(contents, new StandardAnalyzer()); 54. Query query = parser.Parse(textBoxQuery.Text); 55. 56. //Lucene.Net.QueryParsers.QueryParser qp = new QueryParser(Lucene.Net.QueryParsers.CharStream s).Parse(textBoxQuery.Text); 57. Hits hits = indxsearcher.Search(query); 58. 59. 60. for (int i = 0; i hits.Length(); i++) 61. { 62. 63. Document doc = hits.Doc(i); 64. 65. 66. string filename = doc.Get(title); 67. string path = doc.Get(path); 68. string folder = Path.GetDirectoryName(path); 69. 70. 71. ListViewItem item = new ListViewItem(new string[] { null, filename, asd, hits.Score(i).ToString() }); 72. item.Tag = path; 73. 74. this.listViewResults.Items.Add(item); 75. Application.DoEvents(); 76. } 77. 78.
RE: Lucene project announcement
Is Java Lucene grown up ? Look at how much discussion it took to determine how to get Java out of the name :) The discussion about advancing the algorithm in C#/.NET seems to be missing the point. If you're developing at the concept level the specific language you use becomes unimportant. However as most of the concept developers apparently find Java convenient; others wanting to participate at the concept level would find it more beneficial to join that brain-pool instead of diluting the effort by starting up elsewhere. - Neal -Original Message- From: George Aroush [mailto:geo...@aroush.net] Sent: Tuesday, November 16, 2010 10:55 PM To: lucene-net-...@lucene.apache.org Cc: dev@lucene.apache.org Subject: RE: Lucene project announcement This topic has been coming back again and again which I have tried to address multiple times, so let me try again. 1) Java Lucene started years before the first C# version (4+ years if I get my history right), thus it defined and has been the definer of the technology and the API. It is the established leader, and everyone else is just a follower. 2) Lucene.Net is no were mature as Java Lucene, never got established itself, or had a rich development community -- thus why we are here today. 3) If and only if, the community of Lucene.Net (or Lucene over at codeplex.com) manages to proves itself to the level of Java Lucene, only then such a community will have the voice to influence folks over at Java Lucene. Only then you will see the two community discussing search engine vs. port issues or the state of Lucene.Net. If you look in my previous posts, I have pointed those out. We must first: 1) Be in par with Java Lucene release and keep up with commit-per-commit port. 2) Prove Lucene.Net is a grownup project with followers and a healthy community (just like Java Lucene). If we don't achieve the above, folks over at Java Lucene will not take us seriously, and thus we can't influence them. -- George -Original Message- From: Nicholas Paldino [.NET/C# MVP] [mailto:casper...@caspershouse.com] Sent: Friday, November 12, 2010 10:36 AM To: lucene-net-...@lucene.apache.org Cc: dev@lucene.apache.org Subject: RE: Lucene project announcement Paul, et al, Paul, God bless you. This is probably the most rational, practical perspective I've seen on the whole matter since the debacle started. While Lucene started off as a Java project, it's massive success indicates that the concepts around it are very desirable by developers in other technologies, and that the Java product isn't being translated well into those technology stacks. That's not a slight against those who have contributed to this point to try and keep the .NET version in line with the Java one (despite me thinking that the actual approach to doing so is a horribly misguided approach). That said, there should be a serious conversation with the Java-version folk about making this happen. How can Lucene be abstracted/standardized in a non-technology-stack-specific way that other technology stacks can create implementations against that abstraction/standard. Is it too much to ask of the Java folk? Perhaps. After all, they haven't done it yet and it doesn't seem like they see the need for this. This isn't an unjustified position; that project has a massive user base and success which creates massive responsibilities to the project that must be fulfilled. If such a thing proceeds, this is what I'd like to see in such an abstraction: - Technology-agnostic concepts used, down to the class level: - Classes might be the one exception, they are near universal. However, this could be something like entity - Properties - Java doesn't have properties, they have a property convention. .NET has the concept of a property, which translates to a named getter and/or setter which can execute additional code on either in addition to the assignment. - Fields - Raw exposed data points. Whether or not these ^should^ be used is a different story, but there are some places where they are used so a definition is needed. - Methods - Functions/methods, whatever you want to call them, we all know what they are. - In the end, the names are not important as much as the abstractions are, I think we all have an idea on what they are. - Right now, I don't have a problem with a class-by-class mapping, but over time, whether or not class design was done to suit the technology should be addressed, and ultimately abstracted out if this is the case. - Things like ^what^ is returned from methods or internal constructs that are used to make guarantees about behavior and the like should be abstracted out. For example, in Lucene.NET we have the following (in order to maintain a line-by-line port in most cases): - A custom implementation of ReaderWriterLock. There's no reason for something like this. -
Re: need some help =)
i don't have any ide writing custom analyzer... so i'll stick with SnowballAnalyzer for now. On 17.11.2010 21:53, Digy wrote: UnaccentedWordAnalyzer doesn't make use of stemming. If you really need it; a) SnowballAnalyzer is not good in turkish stemming. b) It is better to write a custom analyzer using Zemberek or its .NET version NZemberek. DIGY -Original Message- From: asmcad [mailto:asm...@gmail.com] Sent: Wednesday, November 17, 2010 11:24 PM To: lucene-net-...@lucene.apache.org Subject: Re: need some help =) i need turkish analyzer. my lucene book says i need to use SnowballAnalyzer but i can't access to it as Lucene.Net.Analysis.Snowball should i install another library to use it? On 17.11.2010 21:12, Granroth, Neal V. wrote: You need to pick a suitable analyzer for use during indexing and for queries. The StandardAnalyzer you are using will most likely break the words apart at the non-english characters. You might want to consider using the Luke tool to inspect the index you've created and see who the words in your documents were split and indexed. - Neal -Original Message- From: asmcad [mailto:asm...@gmail.com] Sent: Wednesday, November 17, 2010 3:06 PM To: lucene-net-...@lucene.apache.org Subject: Re: need some help =) i solved the problem . now i have non-english character problem. when i search like something çşğuı(i'm not sure you can see this) characters. i don't get any results. how can i solve this ? by the way sorry about the content messing =) thanks for the previous help =) On 17.11.2010 20:16, Digy wrote: 1. using System; 2. using System.Collections.Generic; 3. using System.ComponentModel; 4. using System.Data; 5. using System.Drawing; 6. using System.Linq; 7. using System.Text; 8. using System.Windows.Forms; 9. using Lucene.Net; 10. using Lucene.Net.Analysis.Standard; 11. using Lucene.Net.Documents; 12. using Lucene.Net.Index; 13. using Lucene.Net.QueryParsers; 14. using Lucene.Net.Search; 15. using System.IO; 16. 17. namespace newLucene 18. { 19. public partial class Form1 : Form 20. { 21. public Form1() 22. { 23. InitializeComponent(); 24. } 25. 26. private void buttonIndex_Click(object sender, EventArgs e) 27. { 28. IndexWriter indexwrtr = new IndexWriter(@c:\index\,new StandardAnalyzer() , true); 29. Document doc = new Document(); 30. string filename = @fer.txt; 31. Lucene.Net.QueryParsers.QueryParser df; 32. 33. 34. 35. System.IO.StreamReader local_StreamReader = new System.IO.StreamReader(@C:\z\fer.txt); 36. string file_text = local_StreamReader.ReadToEnd(); 37. 38. System.Text.UTF8Encoding encoding = new System.Text.UTF8Encoding(); 39. doc.Add(new Field(text,encoding.GetBytes(file_text),Field.Store.YES)); 40. doc.Add(new Field(path,encoding.GetBytes(@C:\z\),Field.Store.YES)); 41. doc.Add(new Field(title, encoding.GetBytes(filename), Field.Store.YES)); 42. indexwrtr.AddDocument(doc); 43. 44. indexwrtr.Optimize(); 45. indexwrtr.Close(); 46. 47. } 48. 49. private void buttonSearch_Click(object sender, EventArgs e) 50. { 51. IndexSearcher indxsearcher = new IndexSearcher(@C:\index\); 52. 53. QueryParser parser = new QueryParser(contents, new StandardAnalyzer()); 54. Query query = parser.Parse(textBoxQuery.Text); 55. 56. //Lucene.Net.QueryParsers.QueryParser qp = new QueryParser(Lucene.Net.QueryParsers.CharStream s).Parse(textBoxQuery.Text); 57. Hits hits = indxsearcher.Search(query); 58. 59. 60. for (int i = 0; ihits.Length(); i++) 61. { 62. 63. Document doc = hits.Doc(i); 64. 65. 66. string filename = doc.Get(title); 67. string path = doc.Get(path); 68. string folder = Path.GetDirectoryName(path); 69. 70. 71. ListViewItem item = new ListViewItem(new string[] { null, filename, asd, hits.Score(i).ToString() }); 72. item.Tag = path; 73. 74.
Re: need some help =)
You should be able to open any of the contrib projects with the free visual studio express software or with monodevelop, also free. On Nov 17, 2010, at 1:58 PM, asmcad asm...@gmail.com wrote: i don't have any ide writing custom analyzer... so i'll stick with SnowballAnalyzer for now. On 17.11.2010 21:53, Digy wrote: UnaccentedWordAnalyzer doesn't make use of stemming. If you really need it; a) SnowballAnalyzer is not good in turkish stemming. b) It is better to write a custom analyzer using Zemberek or its .NET version NZemberek. DIGY -Original Message- From: asmcad [mailto:asm...@gmail.com] Sent: Wednesday, November 17, 2010 11:24 PM To: lucene-net-...@lucene.apache.org Subject: Re: need some help =) i need turkish analyzer. my lucene book says i need to use SnowballAnalyzer but i can't access to it as Lucene.Net.Analysis.Snowball should i install another library to use it? On 17.11.2010 21:12, Granroth, Neal V. wrote: You need to pick a suitable analyzer for use during indexing and for queries. The StandardAnalyzer you are using will most likely break the words apart at the non-english characters. You might want to consider using the Luke tool to inspect the index you've created and see who the words in your documents were split and indexed. - Neal -Original Message- From: asmcad [mailto:asm...@gmail.com] Sent: Wednesday, November 17, 2010 3:06 PM To: lucene-net-...@lucene.apache.org Subject: Re: need some help =) i solved the problem . now i have non-english character problem. when i search like something çşğuı(i'm not sure you can see this) characters. i don't get any results. how can i solve this ? by the way sorry about the content messing =) thanks for the previous help =) On 17.11.2010 20:16, Digy wrote: 1. using System; 2. using System.Collections.Generic; 3. using System.ComponentModel; 4. using System.Data; 5. using System.Drawing; 6. using System.Linq; 7. using System.Text; 8. using System.Windows.Forms; 9. using Lucene.Net; 10. using Lucene.Net.Analysis.Standard; 11. using Lucene.Net.Documents; 12. using Lucene.Net.Index; 13. using Lucene.Net.QueryParsers; 14. using Lucene.Net.Search; 15. using System.IO; 16. 17. namespace newLucene 18. { 19. public partial class Form1 : Form 20. { 21. public Form1() 22. { 23. InitializeComponent(); 24. } 25. 26. private void buttonIndex_Click(object sender, EventArgs e) 27. { 28. IndexWriter indexwrtr = new IndexWriter(@c:\index\,new StandardAnalyzer() , true); 29. Document doc = new Document(); 30. string filename = @fer.txt; 31. Lucene.Net.QueryParsers.QueryParser df; 32. 33. 34. 35. System.IO.StreamReader local_StreamReader = new System.IO.StreamReader(@C:\z\fer.txt); 36. string file_text = local_StreamReader.ReadToEnd(); 37. 38. System.Text.UTF8Encoding encoding = new System.Text.UTF8Encoding(); 39. doc.Add(new Field(text,encoding.GetBytes(file_text),Field.Store.YES)); 40. doc.Add(new Field(path,encoding.GetBytes(@C:\z\),Field.Store.YES)); 41. doc.Add(new Field(title, encoding.GetBytes(filename), Field.Store.YES)); 42. indexwrtr.AddDocument(doc); 43. 44. indexwrtr.Optimize(); 45. indexwrtr.Close(); 46. 47. } 48. 49. private void buttonSearch_Click(object sender, EventArgs e) 50. { 51. IndexSearcher indxsearcher = new IndexSearcher(@C:\index\); 52. 53. QueryParser parser = new QueryParser(contents, new StandardAnalyzer()); 54. Query query = parser.Parse(textBoxQuery.Text); 55. 56. //Lucene.Net.QueryParsers.QueryParser qp = new QueryParser(Lucene.Net.QueryParsers.CharStream s).Parse(textBoxQuery.Text); 57. Hits hits = indxsearcher.Search(query); 58. 59. 60. for (int i = 0; ihits.Length(); i++) 61. { 62. 63. Document doc = hits.Doc(i); 64. 65. 66. string filename = doc.Get(title); 67. string path = doc.Get(path); 68. string folder = Path.GetDirectoryName(path); 69. 70.
Lucene-Solr-tests-only-trunk - Build # 1536 - Failure
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/1536/ 3 tests failed. FAILED: junit.framework.TestSuite.org.apache.solr.cloud.BasicZkTest Error Message: KeeperErrorCode = ConnectionLoss for /configs/conf1/protwords.txt Stack Trace: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /configs/conf1/protwords.txt at org.apache.zookeeper.KeeperException.create(KeeperException.java:90) at org.apache.zookeeper.KeeperException.create(KeeperException.java:42) at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038) at org.apache.solr.common.cloud.SolrZkClient.setData(SolrZkClient.java:225) at org.apache.solr.common.cloud.SolrZkClient.setData(SolrZkClient.java:389) at org.apache.solr.common.cloud.SolrZkClient.setData(SolrZkClient.java:411) at org.apache.solr.cloud.AbstractZkTestCase.putConfig(AbstractZkTestCase.java:97) at org.apache.solr.cloud.AbstractZkTestCase.buildZooKeeper(AbstractZkTestCase.java:87) at org.apache.solr.cloud.AbstractZkTestCase.azt_beforeClass(AbstractZkTestCase.java:61) REGRESSION: org.apache.solr.cloud.CloudStateUpdateTest.testCoreRegistration Error Message: null Stack Trace: org.apache.solr.common.cloud.ZooKeeperException: at org.apache.solr.core.CoreContainer.register(CoreContainer.java:530) at org.apache.solr.core.CoreContainer.register(CoreContainer.java:558) at org.apache.solr.cloud.CloudStateUpdateTest.testCoreRegistration(CloudStateUpdateTest.java:156) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:923) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:861) Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /collections/testcore/shards/127.0.0.1:1661_solr_testcore at org.apache.zookeeper.KeeperException.create(KeeperException.java:90) at org.apache.zookeeper.KeeperException.create(KeeperException.java:42) at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:637) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:348) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:309) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:371) at org.apache.solr.cloud.ZkController.addZkShardsNode(ZkController.java:155) at org.apache.solr.cloud.ZkController.register(ZkController.java:481) at org.apache.solr.core.CoreContainer.register(CoreContainer.java:521) REGRESSION: org.apache.solr.cloud.ZkSolrClientTest.testWatchChildren Error Message: KeeperErrorCode = ConnectionLoss for /collections/collection99 Stack Trace: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /collections/collection99 at org.apache.zookeeper.KeeperException.create(KeeperException.java:90) at org.apache.zookeeper.KeeperException.create(KeeperException.java:42) at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:637) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:348) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:309) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:291) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:256) at org.apache.solr.cloud.ZkSolrClientTest.testWatchChildren(ZkSolrClientTest.java:193) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:923) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:861) Build Log (for compile errors): [...truncated 9283 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: need some help =)
opps =) not ide. it was idea :S On 17.11.2010 22:02, Wyatt Barnett wrote: You should be able to open any of the contrib projects with the free visual studio express software or with monodevelop, also free. On Nov 17, 2010, at 1:58 PM, asmcadasm...@gmail.com wrote: i don't have any ide writing custom analyzer... so i'll stick with SnowballAnalyzer for now. On 17.11.2010 21:53, Digy wrote: UnaccentedWordAnalyzer doesn't make use of stemming. If you really need it; a) SnowballAnalyzer is not good in turkish stemming. b) It is better to write a custom analyzer using Zemberek or its .NET version NZemberek. DIGY -Original Message- From: asmcad [mailto:asm...@gmail.com] Sent: Wednesday, November 17, 2010 11:24 PM To: lucene-net-...@lucene.apache.org Subject: Re: need some help =) i need turkish analyzer. my lucene book says i need to use SnowballAnalyzer but i can't access to it as Lucene.Net.Analysis.Snowball should i install another library to use it? On 17.11.2010 21:12, Granroth, Neal V. wrote: You need to pick a suitable analyzer for use during indexing and for queries. The StandardAnalyzer you are using will most likely break the words apart at the non-english characters. You might want to consider using the Luke tool to inspect the index you've created and see who the words in your documents were split and indexed. - Neal -Original Message- From: asmcad [mailto:asm...@gmail.com] Sent: Wednesday, November 17, 2010 3:06 PM To: lucene-net-...@lucene.apache.org Subject: Re: need some help =) i solved the problem . now i have non-english character problem. when i search like something çşğuı(i'm not sure you can see this) characters. i don't get any results. how can i solve this ? by the way sorry about the content messing =) thanks for the previous help =) On 17.11.2010 20:16, Digy wrote: 1. using System; 2. using System.Collections.Generic; 3. using System.ComponentModel; 4. using System.Data; 5. using System.Drawing; 6. using System.Linq; 7. using System.Text; 8. using System.Windows.Forms; 9. using Lucene.Net; 10. using Lucene.Net.Analysis.Standard; 11. using Lucene.Net.Documents; 12. using Lucene.Net.Index; 13. using Lucene.Net.QueryParsers; 14. using Lucene.Net.Search; 15. using System.IO; 16. 17. namespace newLucene 18. { 19. public partial class Form1 : Form 20. { 21. public Form1() 22. { 23. InitializeComponent(); 24. } 25. 26. private void buttonIndex_Click(object sender, EventArgs e) 27. { 28. IndexWriter indexwrtr = new IndexWriter(@c:\index\,new StandardAnalyzer() , true); 29. Document doc = new Document(); 30. string filename = @fer.txt; 31. Lucene.Net.QueryParsers.QueryParser df; 32. 33. 34. 35. System.IO.StreamReader local_StreamReader = new System.IO.StreamReader(@C:\z\fer.txt); 36. string file_text = local_StreamReader.ReadToEnd(); 37. 38. System.Text.UTF8Encoding encoding = new System.Text.UTF8Encoding(); 39. doc.Add(new Field(text,encoding.GetBytes(file_text),Field.Store.YES)); 40. doc.Add(new Field(path,encoding.GetBytes(@C:\z\),Field.Store.YES)); 41. doc.Add(new Field(title, encoding.GetBytes(filename), Field.Store.YES)); 42. indexwrtr.AddDocument(doc); 43. 44. indexwrtr.Optimize(); 45. indexwrtr.Close(); 46. 47. } 48. 49. private void buttonSearch_Click(object sender, EventArgs e) 50. { 51. IndexSearcher indxsearcher = new IndexSearcher(@C:\index\); 52. 53. QueryParser parser = new QueryParser(contents, new StandardAnalyzer()); 54. Query query = parser.Parse(textBoxQuery.Text); 55. 56. //Lucene.Net.QueryParsers.QueryParser qp = new QueryParser(Lucene.Net.QueryParsers.CharStream s).Parse(textBoxQuery.Text); 57. Hits hits = indxsearcher.Search(query); 58. 59. 60. for (int i = 0; i hits.Length(); i++) 61. { 62. 63. Document doc = hits.Doc(i); 64. 65. 66. string filename = doc.Get(title); 67. string path = doc.Get(path); 68. string folder =
Lucene-Solr-tests-only-3.x - Build # 1512 - Failure
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-3.x/1512/ 7 tests failed. REGRESSION: org.apache.solr.handler.TestReplicationHandler.testIndexAndConfigReplication Error Message: Jetty/Solr unresponsive Stack Trace: java.lang.RuntimeException: Jetty/Solr unresponsive at org.apache.solr.client.solrj.embedded.JettySolrRunner.waitForSolr(JettySolrRunner.java:149) at org.apache.solr.client.solrj.embedded.JettySolrRunner.start(JettySolrRunner.java:111) at org.apache.solr.client.solrj.embedded.JettySolrRunner.start(JettySolrRunner.java:103) at org.apache.solr.handler.TestReplicationHandler.createJetty(TestReplicationHandler.java:110) at org.apache.solr.handler.TestReplicationHandler.testIndexAndConfigReplication(TestReplicationHandler.java:260) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:821) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:759) Caused by: java.io.IOException: Server returned HTTP response code: 500 for URL: http://localhost:10355/solr/select?q={!raw+f=junit_test_query}ping at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1269) at java.net.URL.openStream(URL.java:1029) at org.apache.solr.client.solrj.embedded.JettySolrRunner.waitForSolr(JettySolrRunner.java:137) REGRESSION: org.apache.solr.handler.TestReplicationHandler.testStopPoll Error Message: java.net.ConnectException: Operation timed out Stack Trace: org.apache.solr.client.solrj.SolrServerException: java.net.ConnectException: Operation timed out at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:483) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244) at org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:89) at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:118) at org.apache.solr.handler.TestReplicationHandler.query(TestReplicationHandler.java:142) at org.apache.solr.handler.TestReplicationHandler.clearIndexWithReplication(TestReplicationHandler.java:85) at org.apache.solr.handler.TestReplicationHandler.testStopPoll(TestReplicationHandler.java:285) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:821) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:759) Caused by: java.net.ConnectException: Operation timed out at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:310) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:176) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:163) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:384) at java.net.Socket.connect(Socket.java:546) at java.net.Socket.connect(Socket.java:495) at java.net.Socket.init(Socket.java:392) at java.net.Socket.init(Socket.java:266) at org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:80) at org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:122) at org.apache.commons.httpclient.HttpConnection.open(HttpConnection.java:707) at org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpConnectionAdapter.open(MultiThreadedHttpConnectionManager.java:1361) at org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:387) at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:427) REGRESSION: org.apache.solr.handler.TestReplicationHandler.testSnapPullWithMasterUrl Error Message: Jetty/Solr unresponsive Stack Trace: java.lang.RuntimeException: Jetty/Solr unresponsive at org.apache.solr.client.solrj.embedded.JettySolrRunner.waitForSolr(JettySolrRunner.java:149) at org.apache.solr.client.solrj.embedded.JettySolrRunner.start(JettySolrRunner.java:111) at org.apache.solr.client.solrj.embedded.JettySolrRunner.start(JettySolrRunner.java:103) at org.apache.solr.handler.TestReplicationHandler.createJetty(TestReplicationHandler.java:110) at org.apache.solr.handler.TestReplicationHandler.testSnapPullWithMasterUrl(TestReplicationHandler.java:357) at
Re: Basic authentication for stream.url
sure thanks for the information ... On Wed, Nov 17, 2010 at 3:47 PM, Erick Erickson erickerick...@gmail.comwrote: How does the patch make it to the trunk You need to track it and prompt the dev list if you think it's forgotten. Basically, when a committer thinks it's ready and valuable s/he will commit it to trunk for you. But give the committers some time before prompting, they're usually up to their ears in other changes Best Erick On Wed, Nov 17, 2010 at 3:30 PM, Jayendra Patil jayendra.patil@gmail.com wrote: JIRA - https://issues.apache.org/jira/browse/SOLR-2240 Patch attached. How does the patch make it to the trunk ??? Had submitted a couple of more patches SOLR-2156 SOLR-2029, would like them to be included in the release. Regards, Jayendra On Wed, Nov 17, 2010 at 2:15 PM, Yonik Seeley yo...@lucidimagination.com wrote: On Tue, Nov 16, 2010 at 8:57 PM, Jayendra Patil jayendra.patil@gmail.com wrote: We intend to use schema.url for indexing documents. However, the remote urls are secured and would need basic authentication to be able access the document. The implementation with stream.file would mean to download the files and would cause duplicity, whereas stream.body would have indexing performance issues with the hugh data being transferred over the network. The current implementation for stream.url in ContentStreamBase.URLStream does not support authentication. But can be easily supported by :- 1. Passing additional authentication parameter e.g. stream.url.auth with the encoded authentication value - SolrRequestParsers 2. Setting Authorization request property for the Connection - ContentStreamBase.URLStream this.conn.setRequestProperty(Authorization, Basic + encodedauthentication); Sounds like a good idea to me. Could you open a JIRA issue for this feature, and supply a patch if you get to it? -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments
[ https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12933182#action_12933182 ] Jason Rutherglen commented on LUCENE-2680: -- Additionally we need to decide how accounting'll work for maxBufferedDeleteTerms. We won't have a centralized place to keep track of the number of terms, and the unique term count in aggregate over many segments could be a little too time consuming calculate in a method like doApplyDeletes. An alternative is to maintain a global unique term count, such that when a term is added, every other per-segment deletes is checked for that term, and if it's not already been tallied, we increment the number of buffered terms. Improve how IndexWriter flushes deletes against existing segments - Key: LUCENE-2680 URL: https://issues.apache.org/jira/browse/LUCENE-2680 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Fix For: 4.0 Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch IndexWriter buffers up all deletes (by Term and Query) and only applies them if 1) commit or NRT getReader() is called, or 2) a merge is about to kickoff. We do this because, for a large index, it's very costly to open a SegmentReader for every segment in the index. So we defer as long as we can. We do it just before merge so that the merge can eliminate the deleted docs. But, most merges are small, yet in a big index we apply deletes to all of the segments, which is really very wasteful. Instead, we should only apply the buffered deletes to the segments that are about to be merged, and keep the buffer around for the remaining segments. I think it's not so hard to do; we'd have to have generations of pending deletions, because the newly merged segment doesn't need the same buffered deletions applied again. So every time a merge kicks off, we pinch off the current set of buffered deletions, open a new set (the next generation), and record which segment was created as of which generation. This should be a very sizable gain for large indices that mix deletes, though, less so in flex since opening the terms index is much faster. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2348) DuplicateFilter incorrectly handles multiple calls to getDocIdSet for segment readers
[ https://issues.apache.org/jira/browse/LUCENE-2348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12933222#action_12933222 ] Trejkaz commented on LUCENE-2348: - Finally got around to checking this out today, and it looks good to me. Unfortunate how Lucene has changed so much lately that we can't backport this. :) But will just await a release where it appears. DuplicateFilter incorrectly handles multiple calls to getDocIdSet for segment readers - Key: LUCENE-2348 URL: https://issues.apache.org/jira/browse/LUCENE-2348 Project: Lucene - Java Issue Type: Bug Components: contrib/* Affects Versions: 2.9.2 Reporter: Trejkaz Attachments: LUCENE-2348.patch DuplicateFilter currently works by building a single doc ID set, without taking into account that getDocIdSet() will be called once per segment and only with each segment's local reader. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments
[ https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12933227#action_12933227 ] Michael McCandless commented on LUCENE-2680: {quote} I think we may be back tracking here as I had earlier proposed we simply store each term/query in a map per segment, however I think that was nixed in favor of last segment + deletes per segment afterwards. We're not worried about the cost of storing pending deletes in a map per segment anymore? {quote} OK sorry now I remember. Hmm but, my objection then was to carrying all deletes backward to all segments? Whereas now I think what we can do is only record the deletions that were added when that segment was a RAM buffer, in its pending deletes map? This should be fine, since we aren't storing a single deletion in multiple places (well, until DWPTs anyway). It's just that on applying deletes to a segment because it's about to be merged we have to do a merge sort of the buffered deletes all future segments. BTW it could also be possible to not necessarily apply deletes when a segment is merged; eg if there are few enough deletes it may not be worthwhile. But we can leave that to another issue. {quote} Additionally we need to decide how accounting'll work for maxBufferedDeleteTerms. We won't have a centralized place to keep track of the number of terms, and the unique term count in aggregate over many segments could be a little too time consuming calculate in a method like doApplyDeletes. An alternative is to maintain a global unique term count, such that when a term is added, every other per-segment deletes is checked for that term, and if it's not already been tallied, we increment the number of buffered terms. {quote} Maybe we should change the definition to be total number of pending delete term/queries? (Ie, not dedup'd across segments). This seems reasonable since w/ this new approach the RAM consumed is in proportion to that total number and not to dedup'd count? Improve how IndexWriter flushes deletes against existing segments - Key: LUCENE-2680 URL: https://issues.apache.org/jira/browse/LUCENE-2680 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Fix For: 4.0 Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch IndexWriter buffers up all deletes (by Term and Query) and only applies them if 1) commit or NRT getReader() is called, or 2) a merge is about to kickoff. We do this because, for a large index, it's very costly to open a SegmentReader for every segment in the index. So we defer as long as we can. We do it just before merge so that the merge can eliminate the deleted docs. But, most merges are small, yet in a big index we apply deletes to all of the segments, which is really very wasteful. Instead, we should only apply the buffered deletes to the segments that are about to be merged, and keep the buffer around for the remaining segments. I think it's not so hard to do; we'd have to have generations of pending deletions, because the newly merged segment doesn't need the same buffered deletions applied again. So every time a merge kicks off, we pinch off the current set of buffered deletions, open a new set (the next generation), and record which segment was created as of which generation. This should be a very sizable gain for large indices that mix deletes, though, less so in flex since opening the terms index is much faster. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Lucene-3.x - Build # 184 - Failure
Build: https://hudson.apache.org/hudson/job/Lucene-3.x/184/ All tests passed Build Log (for compile errors): [...truncated 21395 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Created: (SOLR-2242) Get distinct count of names for a facet field
Get distinct count of names for a facet field - Key: SOLR-2242 URL: https://issues.apache.org/jira/browse/SOLR-2242 Project: Solr Issue Type: New Feature Components: Response Writers Affects Versions: 4.0 Reporter: Bill Bell Priority: Minor Fix For: 4.0 See SOLR-236. Need ability to get count back for the unique facets for grouping (field collapsing) instead of returning the facets. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Lucene-Solr-tests-only-trunk - Build # 1545 - Failure
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/1545/ 1 tests failed. REGRESSION: org.apache.solr.TestDistributedSearch.testDistribSearch Error Message: Some threads threw uncaught exceptions! Stack Trace: junit.framework.AssertionFailedError: Some threads threw uncaught exceptions! at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:923) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:861) at org.apache.lucene.util.LuceneTestCase.tearDown(LuceneTestCase.java:446) at org.apache.solr.SolrTestCaseJ4.tearDown(SolrTestCaseJ4.java:92) at org.apache.solr.BaseDistributedSearchTestCase.tearDown(BaseDistributedSearchTestCase.java:144) Build Log (for compile errors): [...truncated 8749 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments
[ https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12933305#action_12933305 ] Jason Rutherglen commented on LUCENE-2680: -- Flush deletes equals true means that all deletes are applied, however when it's false, that means we're moving the pending deletes into the newly flushed segment, as is, with no docId-upto remapping. Improve how IndexWriter flushes deletes against existing segments - Key: LUCENE-2680 URL: https://issues.apache.org/jira/browse/LUCENE-2680 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Fix For: 4.0 Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch IndexWriter buffers up all deletes (by Term and Query) and only applies them if 1) commit or NRT getReader() is called, or 2) a merge is about to kickoff. We do this because, for a large index, it's very costly to open a SegmentReader for every segment in the index. So we defer as long as we can. We do it just before merge so that the merge can eliminate the deleted docs. But, most merges are small, yet in a big index we apply deletes to all of the segments, which is really very wasteful. Instead, we should only apply the buffered deletes to the segments that are about to be merged, and keep the buffer around for the remaining segments. I think it's not so hard to do; we'd have to have generations of pending deletions, because the newly merged segment doesn't need the same buffered deletions applied again. So every time a merge kicks off, we pinch off the current set of buffered deletions, open a new set (the next generation), and record which segment was created as of which generation. This should be a very sizable gain for large indices that mix deletes, though, less so in flex since opening the terms index is much faster. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments
[ https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12933306#action_12933306 ] Jason Rutherglen commented on LUCENE-2680: -- We can upgrade to an int[] from an ArrayListInteger for the aborted docs. Improve how IndexWriter flushes deletes against existing segments - Key: LUCENE-2680 URL: https://issues.apache.org/jira/browse/LUCENE-2680 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Fix For: 4.0 Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch IndexWriter buffers up all deletes (by Term and Query) and only applies them if 1) commit or NRT getReader() is called, or 2) a merge is about to kickoff. We do this because, for a large index, it's very costly to open a SegmentReader for every segment in the index. So we defer as long as we can. We do it just before merge so that the merge can eliminate the deleted docs. But, most merges are small, yet in a big index we apply deletes to all of the segments, which is really very wasteful. Instead, we should only apply the buffered deletes to the segments that are about to be merged, and keep the buffer around for the remaining segments. I think it's not so hard to do; we'd have to have generations of pending deletions, because the newly merged segment doesn't need the same buffered deletions applied again. So every time a merge kicks off, we pinch off the current set of buffered deletions, open a new set (the next generation), and record which segment was created as of which generation. This should be a very sizable gain for large indices that mix deletes, though, less so in flex since opening the terms index is much faster. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Created: (SOLR-2243) Group Querys maybe return docList of 0 results
Group Querys maybe return docList of 0 results -- Key: SOLR-2243 URL: https://issues.apache.org/jira/browse/SOLR-2243 Project: Solr Issue Type: Wish Components: search Environment: JDK1.6/Tomcat6 Reporter: tom liu i wish have bellow results: {noformat} lst name=grouped lst name=countrycode int name=matches1411/int arr name=groups lst str name=groupValueunit/str result name=doclist numFound=921 start=0/ /lst lst str name=groupValuechina/str result name=doclist numFound=139 start=0/ /lst /arr /lst /lst {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (SOLR-2243) Group Querys maybe return docList of 0 results
[ https://issues.apache.org/jira/browse/SOLR-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tom liu updated SOLR-2243: -- Attachment: SolrIndexSearcher.patch i found: # set group.limit=0 # in solrIndexSearcher, i give value 1 to Collector constrution for example: {noformat} Phase2GroupCollector collector = new Phase2GroupCollector( (TopGroupCollector)gc.collector, gc.groupBy, gc.context, collectorSort, gc.docsPerGroup == 0? 1 : groupCommand.docsPerGroup, needScores); {noformat} Group Querys maybe return docList of 0 results -- Key: SOLR-2243 URL: https://issues.apache.org/jira/browse/SOLR-2243 Project: Solr Issue Type: Wish Components: search Environment: JDK1.6/Tomcat6 Reporter: tom liu Attachments: SolrIndexSearcher.patch i wish have bellow results: {noformat} lst name=grouped lst name=countrycode int name=matches1411/int arr name=groups lst str name=groupValueunit/str result name=doclist numFound=921 start=0/ /lst lst str name=groupValuechina/str result name=doclist numFound=139 start=0/ /lst /arr /lst /lst {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-2205) Grouping performance improvements
[ https://issues.apache.org/jira/browse/SOLR-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1295#action_1295 ] tom liu commented on SOLR-2205: --- Now, group Search do not support distributed query. anyone else, has already been meet this? Grouping performance improvements - Key: SOLR-2205 URL: https://issues.apache.org/jira/browse/SOLR-2205 Project: Solr Issue Type: Sub-task Components: search Affects Versions: 4.0 Reporter: Martijn van Groningen Fix For: 4.0 Attachments: SOLR-2205.patch, SOLR-2205.patch This issue is dedicated to the performance of the grouping functionality. I've noticed that the code is not really performing on large indexes. Doing a search (q=*:*) with grouping on an index from around 5M documents took around one second on my local development machine. We had to support grouping on an index that holds around 50M documents per machine, so we made some changes and were able to happily serve that amount of documents. Patch will follow soon. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Lucene project announcement
Neal, As you said, If you're developing at the concept level the specific language you use becomes unimportant. . This is exactly why we feel that working on this in C# is not a problem. We feel that the language should not impede our ability to contribute. If we develop some interesting or valuable concepts in C# those could be ported back to Java for inclusion in the Java implementation of Lucene. From an implementation standpoint, we feel that the code should perform and integrate as effectively as possible into the runtime it's in. Unfortunately there's no known software runtime that executes concepts. They execute code written in a specific language. The details of how that code executes and integrates into applications directly effects it's performance and usability. It's a disservice to the concept of Lucene to translate it literally, if doing so makes it less performant or less usable. Using human language as an example: Consider the Chinese name for China: 中国 (Zhong Guo) translated literally it means Middle Kingdom. Imagine you were translating and important philosophical document from Chinese to English. Would you translate Zhong Guo as Middle Kingdom or as China? Suppose someone had asked the original philosopher to write all his ideas in English to start because English is the language of philosophy.. It's what all the eminent philosophers use. Perhaps he would never contribute his ideas at all, since writing them down in English is too great a barrier. Maybe he would write them down, but write them down in a way which made them seem absurd or have less of an impact.. In other words.. Miss the meaning, even though he'd translated literally. Either way, it would be less ideal than simply writing them in Chinese to start, as that's what would be most natural for our imaginary philosopher. The burden of translation from Chinese to English could then be performed by an expert in translation, who would, undoubtedly, translate the meaning conceptually, not the words syntactically. Thanks, Troy On Wed, Nov 17, 2010 at 12:16 PM, Granroth, Neal V. neal.granr...@thermofisher.com wrote: Is Java Lucene grown up ? Look at how much discussion it took to determine how to get Java out of the name :) The discussion about advancing the algorithm in C#/.NET seems to be missing the point. If you're developing at the concept level the specific language you use becomes unimportant. However as most of the concept developers apparently find Java convenient; others wanting to participate at the concept level would find it more beneficial to join that brain-pool instead of diluting the effort by starting up elsewhere. - Neal -Original Message- From: George Aroush [mailto:geo...@aroush.net] Sent: Tuesday, November 16, 2010 10:55 PM To: lucene-net-...@lucene.apache.org Cc: dev@lucene.apache.org Subject: RE: Lucene project announcement This topic has been coming back again and again which I have tried to address multiple times, so let me try again. 1) Java Lucene started years before the first C# version (4+ years if I get my history right), thus it defined and has been the definer of the technology and the API. It is the established leader, and everyone else is just a follower. 2) Lucene.Net is no were mature as Java Lucene, never got established itself, or had a rich development community -- thus why we are here today. 3) If and only if, the community of Lucene.Net (or Lucene over at codeplex.com) manages to proves itself to the level of Java Lucene, only then such a community will have the voice to influence folks over at Java Lucene. Only then you will see the two community discussing search engine vs. port issues or the state of Lucene.Net. If you look in my previous posts, I have pointed those out. We must first: 1) Be in par with Java Lucene release and keep up with commit-per-commit port. 2) Prove Lucene.Net is a grownup project with followers and a healthy community (just like Java Lucene). If we don't achieve the above, folks over at Java Lucene will not take us seriously, and thus we can't influence them. -- George -Original Message- From: Nicholas Paldino [.NET/C# MVP] [mailto:casper...@caspershouse.com] Sent: Friday, November 12, 2010 10:36 AM To: lucene-net-...@lucene.apache.org Cc: dev@lucene.apache.org Subject: RE: Lucene project announcement Paul, et al, Paul, God bless you. This is probably the most rational, practical perspective I've seen on the whole matter since the debacle started. While Lucene started off as a Java project, it's massive success indicates that the concepts around it are very desirable by developers in other technologies, and that the Java product isn't being translated well into those technology stacks. That's not a slight against those who have contributed to this point to try and keep the .NET version in line with the Java one (despite me thinking