[jira] Resolved: (SOLR-1120) Simplify EntityProcessor API
[ https://issues.apache.org/jira/browse/SOLR-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Noble Paul resolved SOLR-1120. -- Resolution: Fixed committed revision:781272 thanks Steffen Baumgart > Simplify EntityProcessor API > > > Key: SOLR-1120 > URL: https://issues.apache.org/jira/browse/SOLR-1120 > Project: Solr > Issue Type: Improvement > Components: contrib - DataImportHandler >Affects Versions: 1.3 >Reporter: Shalin Shekhar Mangar >Assignee: Shalin Shekhar Mangar > Fix For: 1.4 > > Attachments: SOLR-1120.patch, SOLR-1120.patch, SOLR-1120.patch, > SOLR-1120.patch, SOLR-1120.patch, SOLR-1120.patch, SOLR-1120.patch, > SOLR-1120.patch > > > Writing an EntityProcessor is deceptively complex. There are so many gotchas. > I propose the following: > # Extract out the Transformer application logic from EntityProcessor and add > it to DocBuilder. Then EntityProcessor do not need to call applyTransformer > or know about rowIterator and getFromRowCache() methods. > # Change the meaning of EntityProcessor#destroy to be called on end of > parent's row -- Right now init is called once per parent row but destroy > actually means the end of import. In fact, there is no correct way for an > entity processor to do clean up right now. Most do clean up when returning > null (end of data) but with the introduction of $skipDoc, a transformer can > return $skipDoc and the entity processor will never get a chance to clean up > for the current init. > # EntityProcessor will use the EventListener API to listen for import end. > This should be used by EntityProcessor to do a final cleanup. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-1120) Simplify EntityProcessor API
[ https://issues.apache.org/jira/browse/SOLR-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Noble Paul updated SOLR-1120: - Attachment: SOLR-1120.patch > Simplify EntityProcessor API > > > Key: SOLR-1120 > URL: https://issues.apache.org/jira/browse/SOLR-1120 > Project: Solr > Issue Type: Improvement > Components: contrib - DataImportHandler >Affects Versions: 1.3 >Reporter: Shalin Shekhar Mangar >Assignee: Shalin Shekhar Mangar > Fix For: 1.4 > > Attachments: SOLR-1120.patch, SOLR-1120.patch, SOLR-1120.patch, > SOLR-1120.patch, SOLR-1120.patch, SOLR-1120.patch, SOLR-1120.patch, > SOLR-1120.patch > > > Writing an EntityProcessor is deceptively complex. There are so many gotchas. > I propose the following: > # Extract out the Transformer application logic from EntityProcessor and add > it to DocBuilder. Then EntityProcessor do not need to call applyTransformer > or know about rowIterator and getFromRowCache() methods. > # Change the meaning of EntityProcessor#destroy to be called on end of > parent's row -- Right now init is called once per parent row but destroy > actually means the end of import. In fact, there is no correct way for an > entity processor to do clean up right now. Most do clean up when returning > null (end of data) but with the introduction of $skipDoc, a transformer can > return $skipDoc and the entity processor will never get a chance to clean up > for the current init. > # EntityProcessor will use the EventListener API to listen for import end. > This should be used by EntityProcessor to do a final cleanup. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brad Giaccio updated SOLR-769: -- Attachment: (was: clustering-componet-shard.patch) > Support Document and Search Result clustering > - > > Key: SOLR-769 > URL: https://issues.apache.org/jira/browse/SOLR-769 > Project: Solr > Issue Type: New Feature >Reporter: Grant Ingersoll >Assignee: Grant Ingersoll >Priority: Minor > Fix For: 1.4 > > Attachments: clustering-componet-shard.patch, clustering-libs.tar, > clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, > SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, > SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, > SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip > > > Clustering is a useful tool for working with documents and search results, > similar to the notion of dynamic faceting. Carrot2 > (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing > search results clustering. Mahout (http://lucene.apache.org/mahout) is well > suited for whole-corpus clustering. > The patch I lays out a contrib module that starts off w/ an integration of a > SearchComponent for doing clustering and an implementation using Carrot. In > search results mode, it will use the DocList as the input for the cluster. > While Carrot2 comes w/ a Solr input component, it is not the same as the > SearchComponent that I have in that the Carrot example actually submits a > query to Solr, whereas my SearchComponent is just chained into the Component > list and uses the ResponseBuilder to add in the cluster results. > While not fully fleshed out yet, the collection based mode will take in a > list of ids or just use the whole collection and will produce clusters. > Since this is a longer, typically offline task, there will need to be some > type of storage mechanism (and replication??) for the clusters. I _may_ > push this off to a separate JIRA issue, but I at least want to present the > use case as part of the design of this component/contrib. It may even make > sense that we split this out, such that the building piece is something like > an UpdateProcessor and then the SearchComponent just acts as a lookup > mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brad Giaccio updated SOLR-769: -- Attachment: clustering-componet-shard.patch Okay I've rewritten the patch, as I suggested. Now the clustering happens in finishStage for distributed queries and it happens in process for non-distributed both by calling the new method clusterResults . To make this happen I had to convert the interfaces and supporting code to use SolrDocumentList rather than DocList. I've added a unit test which extends TestDistributedSearch, I had to modify TestDistributedSearch and make a bunch of things protected. This allowed me to write a very small test case (just had to override doTest) and leave all the logic for creating shards, distributing docs, and comparing responses in TestDistributedSearch. I felt this made for a very clean way to test a single distributed component. > Support Document and Search Result clustering > - > > Key: SOLR-769 > URL: https://issues.apache.org/jira/browse/SOLR-769 > Project: Solr > Issue Type: New Feature >Reporter: Grant Ingersoll >Assignee: Grant Ingersoll >Priority: Minor > Fix For: 1.4 > > Attachments: clustering-componet-shard.patch, clustering-libs.tar, > clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, > SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, > SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, > SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip > > > Clustering is a useful tool for working with documents and search results, > similar to the notion of dynamic faceting. Carrot2 > (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing > search results clustering. Mahout (http://lucene.apache.org/mahout) is well > suited for whole-corpus clustering. > The patch I lays out a contrib module that starts off w/ an integration of a > SearchComponent for doing clustering and an implementation using Carrot. In > search results mode, it will use the DocList as the input for the cluster. > While Carrot2 comes w/ a Solr input component, it is not the same as the > SearchComponent that I have in that the Carrot example actually submits a > query to Solr, whereas my SearchComponent is just chained into the Component > list and uses the ResponseBuilder to add in the cluster results. > While not fully fleshed out yet, the collection based mode will take in a > list of ids or just use the whole collection and will produce clusters. > Since this is a longer, typically offline task, there will need to be some > type of storage mechanism (and replication??) for the clusters. I _may_ > push this off to a separate JIRA issue, but I at least want to present the > use case as part of the design of this component/contrib. It may even make > sense that we split this out, such that the building piece is something like > an UpdateProcessor and then the SearchComponent just acts as a lookup > mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1051) Support the merge of multiple indexes
[ https://issues.apache.org/jira/browse/SOLR-1051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12715600#action_12715600 ] Ning Li commented on SOLR-1051: --- In the current approach, mergeIndexes is an admin command and the target core should be online. I haven't looked into the SolrDispatchFilter logic change, but it seems with this change, the following are the two valid options: - mergeIndexes is an update command and the target core should be online - mergeIndexes is an admin command and the target core should be offline The first option is close to what we have now. I like it a bit more because you keep track of the merge by going through UpdateProcessor. But you seem to prefer the second option? > Support the merge of multiple indexes > - > > Key: SOLR-1051 > URL: https://issues.apache.org/jira/browse/SOLR-1051 > Project: Solr > Issue Type: New Feature > Components: update >Reporter: Ning Li >Assignee: Shalin Shekhar Mangar >Priority: Minor > Fix For: 1.4 > > Attachments: SOLR-1051.patch, SOLR-1051.patch, SOLR-1051.patch > > > This is to support the merge of multiple indexes. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Reopened: (SOLR-1120) Simplify EntityProcessor API
[ https://issues.apache.org/jira/browse/SOLR-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shalin Shekhar Mangar reopened SOLR-1120: - The initial commit for this issue broke the Debug functionality. Refer to http://www.lucidimagination.com/search/document/42c345a606820f9/npe_in_dataimport_debuglogger_peekstack_dih_development_console > Simplify EntityProcessor API > > > Key: SOLR-1120 > URL: https://issues.apache.org/jira/browse/SOLR-1120 > Project: Solr > Issue Type: Improvement > Components: contrib - DataImportHandler >Affects Versions: 1.3 >Reporter: Shalin Shekhar Mangar >Assignee: Shalin Shekhar Mangar > Fix For: 1.4 > > Attachments: SOLR-1120.patch, SOLR-1120.patch, SOLR-1120.patch, > SOLR-1120.patch, SOLR-1120.patch, SOLR-1120.patch, SOLR-1120.patch > > > Writing an EntityProcessor is deceptively complex. There are so many gotchas. > I propose the following: > # Extract out the Transformer application logic from EntityProcessor and add > it to DocBuilder. Then EntityProcessor do not need to call applyTransformer > or know about rowIterator and getFromRowCache() methods. > # Change the meaning of EntityProcessor#destroy to be called on end of > parent's row -- Right now init is called once per parent row but destroy > actually means the end of import. In fact, there is no correct way for an > entity processor to do clean up right now. Most do clean up when returning > null (end of data) but with the introduction of $skipDoc, a transformer can > return $skipDoc and the entity processor will never get a chance to clean up > for the current init. > # EntityProcessor will use the EventListener API to listen for import end. > This should be used by EntityProcessor to do a final cleanup. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (SOLR-990) Add pid file to snapinstaller to skip script overruns, and recover from failure
[ https://issues.apache.org/jira/browse/SOLR-990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic resolved SOLR-990. --- Resolution: Fixed Thank you, Dan! Sendingsrc/scripts/snapinstaller Transmitting file data . Committed revision 781069. > Add pid file to snapinstaller to skip script overruns, and recover from > failure > --- > > Key: SOLR-990 > URL: https://issues.apache.org/jira/browse/SOLR-990 > Project: Solr > Issue Type: Improvement > Components: replication (scripts) >Reporter: Dan Rosher >Assignee: Otis Gospodnetic >Priority: Minor > Fix For: 1.4 > > Attachments: SOLR-990.patch, SOLR-990.patch, SOLR-990.patch, > SOLR-990.patch > > > The pid file will allow snapinstaller to be run as fast as possible without > overruns. Also it will recover from a last failed run should an older > snapinstaller process no longer be running. > Avoiding overruns means that snapinstaller can be run as fast as possible, > but without suffering from the performance issue described here: > http://wiki.apache.org/solr/SolrPerformanceFactors#head-fc7f22035c493431d58c5404ab22aef0ee1b9909 > > This means that one can do the following > */1 * * * * /bin/snappuller && /bin/snapinstaller > Even with a 'properly tuned' setup, there can be times where snapinstaller > can suffer from overruns due to a lack of resources, or an unoptimized index > using more resources etc. > currently the pid will live in /tmp ... perhaps it should be in the logs dir? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-1192) solr.NGramFilterFactory stops to index the content if it find a token smaller than minim ngram size
[ https://issues.apache.org/jira/browse/SOLR-1192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic updated SOLR-1192: --- Fix Version/s: (was: 1.3) 1.4 > solr.NGramFilterFactory stops to index the content if it find a token smaller > than minim ngram size > --- > > Key: SOLR-1192 > URL: https://issues.apache.org/jira/browse/SOLR-1192 > Project: Solr > Issue Type: Bug > Components: Analysis >Affects Versions: 1.3 > Environment: any >Reporter: viobade > Fix For: 1.4 > > > If a field is split in tokens (by a tokenizer) and after that is aplied the > NGramFilterFactory for these tokens...the indexing goes well while the length > of the tokens is greater or equal with minim ngram size (ussually is 3). > Otherwise the indexing breaks in this point and the rest of tokens are no > more indexed. This behaviour can be easy observed with the analysis tool > which is in Solr admin interface. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-1192) solr.NGramFilterFactory stops to index the content if it find a token smaller than minim ngram size
[ https://issues.apache.org/jira/browse/SOLR-1192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic updated SOLR-1192: --- That stems from Lucene, see LUCENE-1491. > solr.NGramFilterFactory stops to index the content if it find a token smaller > than minim ngram size > --- > > Key: SOLR-1192 > URL: https://issues.apache.org/jira/browse/SOLR-1192 > Project: Solr > Issue Type: Bug > Components: Analysis >Affects Versions: 1.3 > Environment: any >Reporter: viobade > Fix For: 1.3 > > > If a field is split in tokens (by a tokenizer) and after that is aplied the > NGramFilterFactory for these tokens...the indexing goes well while the length > of the tokens is greater or equal with minim ngram size (ussually is 3). > Otherwise the indexing breaks in this point and the rest of tokens are no > more indexed. This behaviour can be easy observed with the analysis tool > which is in Solr admin interface. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Solr nightly build failure
init-forrest-entities: [mkdir] Created dir: /tmp/apache-solr-nightly/build [mkdir] Created dir: /tmp/apache-solr-nightly/build/web compile-solrj: [mkdir] Created dir: /tmp/apache-solr-nightly/build/solrj [javac] Compiling 83 source files to /tmp/apache-solr-nightly/build/solrj [javac] Note: Some input files use or override a deprecated API. [javac] Note: Recompile with -Xlint:deprecation for details. [javac] Note: Some input files use unchecked or unsafe operations. [javac] Note: Recompile with -Xlint:unchecked for details. compile: [mkdir] Created dir: /tmp/apache-solr-nightly/build/solr [javac] Compiling 373 source files to /tmp/apache-solr-nightly/build/solr [javac] Note: Some input files use or override a deprecated API. [javac] Note: Recompile with -Xlint:deprecation for details. [javac] Note: Some input files use unchecked or unsafe operations. [javac] Note: Recompile with -Xlint:unchecked for details. compileTests: [mkdir] Created dir: /tmp/apache-solr-nightly/build/tests [javac] Compiling 161 source files to /tmp/apache-solr-nightly/build/tests [javac] Note: Some input files use or override a deprecated API. [javac] Note: Recompile with -Xlint:deprecation for details. [javac] Note: Some input files use unchecked or unsafe operations. [javac] Note: Recompile with -Xlint:unchecked for details. junit: [mkdir] Created dir: /tmp/apache-solr-nightly/build/test-results [junit] Running org.apache.solr.BasicFunctionalityTest [junit] Tests run: 19, Failures: 0, Errors: 0, Time elapsed: 17.658 sec [junit] Running org.apache.solr.ConvertedLegacyTest [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 8.611 sec [junit] Running org.apache.solr.DisMaxRequestHandlerTest [junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 5.135 sec [junit] Running org.apache.solr.EchoParamsTest [junit] Tests run: 4, Failures: 0, Errors: 0, Time elapsed: 2.409 sec [junit] Running org.apache.solr.OutputWriterTest [junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 2.75 sec [junit] Running org.apache.solr.SampleTest [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 2.383 sec [junit] Running org.apache.solr.SolrInfoMBeanTest [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.944 sec [junit] Running org.apache.solr.TestDistributedSearch [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 26.478 sec [junit] Running org.apache.solr.TestTrie [junit] Tests run: 7, Failures: 0, Errors: 0, Time elapsed: 7.588 sec [junit] Running org.apache.solr.analysis.DoubleMetaphoneFilterFactoryTest [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 0.611 sec [junit] Running org.apache.solr.analysis.DoubleMetaphoneFilterTest [junit] Tests run: 6, Failures: 0, Errors: 0, Time elapsed: 0.441 sec [junit] Running org.apache.solr.analysis.EnglishPorterFilterFactoryTest [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 1.465 sec [junit] Running org.apache.solr.analysis.HTMLStripReaderTest [junit] Tests run: 9, Failures: 0, Errors: 0, Time elapsed: 1.693 sec [junit] Running org.apache.solr.analysis.LengthFilterTest [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 1.775 sec [junit] Running org.apache.solr.analysis.SnowballPorterFilterFactoryTest [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 1.501 sec [junit] Running org.apache.solr.analysis.TestBufferedTokenStream [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 1.448 sec [junit] Running org.apache.solr.analysis.TestCapitalizationFilter [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 1.634 sec [junit] Running org.apache.solr.analysis.TestCharFilter [junit] Tests run: 4, Failures: 0, Errors: 0, Time elapsed: 0.533 sec [junit] Running org.apache.solr.analysis.TestHyphenatedWordsFilter [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.977 sec [junit] Running org.apache.solr.analysis.TestKeepFilterFactory [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 2.667 sec [junit] Running org.apache.solr.analysis.TestKeepWordFilter [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 1.316 sec [junit] Running org.apache.solr.analysis.TestMappingCharFilter [junit] Tests run: 11, Failures: 0, Errors: 0, Time elapsed: 0.474 sec [junit] Running org.apache.solr.analysis.TestMappingCharFilterFactory [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.477 sec [junit] Running org.apache.solr.analysis.TestPatternReplaceFilter [junit] Tests run: 5, Failures: 0, Errors: 0, Time elapsed: 2.308 sec [junit] Running org.apache.solr.analysis.TestPatternTokenizerFactory [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 1.068 sec [junit] Running org.apache.solr.analysis.TestPhoneticFilter