[jira] Resolved: (SOLR-1120) Simplify EntityProcessor API

2009-06-02 Thread Noble Paul (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Noble Paul resolved SOLR-1120.
--

Resolution: Fixed

committed revision:781272

thanks Steffen Baumgart

> Simplify EntityProcessor API
> 
>
> Key: SOLR-1120
> URL: https://issues.apache.org/jira/browse/SOLR-1120
> Project: Solr
>  Issue Type: Improvement
>  Components: contrib - DataImportHandler
>Affects Versions: 1.3
>Reporter: Shalin Shekhar Mangar
>Assignee: Shalin Shekhar Mangar
> Fix For: 1.4
>
> Attachments: SOLR-1120.patch, SOLR-1120.patch, SOLR-1120.patch, 
> SOLR-1120.patch, SOLR-1120.patch, SOLR-1120.patch, SOLR-1120.patch, 
> SOLR-1120.patch
>
>
> Writing an EntityProcessor is deceptively complex. There are so many gotchas.
> I propose the following:
> # Extract out the Transformer application logic from EntityProcessor and add 
> it to DocBuilder. Then EntityProcessor do not need to call applyTransformer 
> or know about rowIterator and getFromRowCache() methods.
> # Change the meaning of EntityProcessor#destroy to be called on end of 
> parent's row -- Right now init is called once per parent row but destroy 
> actually means the end of import. In fact, there is no correct way for an 
> entity processor to do clean up right now. Most do clean up when returning 
> null (end of data) but with the introduction of $skipDoc, a transformer can 
> return $skipDoc and the entity processor will never get a chance to clean up 
> for the current init.
> # EntityProcessor will use the EventListener API to listen for import end. 
> This should be used by EntityProcessor to do a final cleanup.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1120) Simplify EntityProcessor API

2009-06-02 Thread Noble Paul (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Noble Paul updated SOLR-1120:
-

Attachment: SOLR-1120.patch

> Simplify EntityProcessor API
> 
>
> Key: SOLR-1120
> URL: https://issues.apache.org/jira/browse/SOLR-1120
> Project: Solr
>  Issue Type: Improvement
>  Components: contrib - DataImportHandler
>Affects Versions: 1.3
>Reporter: Shalin Shekhar Mangar
>Assignee: Shalin Shekhar Mangar
> Fix For: 1.4
>
> Attachments: SOLR-1120.patch, SOLR-1120.patch, SOLR-1120.patch, 
> SOLR-1120.patch, SOLR-1120.patch, SOLR-1120.patch, SOLR-1120.patch, 
> SOLR-1120.patch
>
>
> Writing an EntityProcessor is deceptively complex. There are so many gotchas.
> I propose the following:
> # Extract out the Transformer application logic from EntityProcessor and add 
> it to DocBuilder. Then EntityProcessor do not need to call applyTransformer 
> or know about rowIterator and getFromRowCache() methods.
> # Change the meaning of EntityProcessor#destroy to be called on end of 
> parent's row -- Right now init is called once per parent row but destroy 
> actually means the end of import. In fact, there is no correct way for an 
> entity processor to do clean up right now. Most do clean up when returning 
> null (end of data) but with the introduction of $skipDoc, a transformer can 
> return $skipDoc and the entity processor will never get a chance to clean up 
> for the current init.
> # EntityProcessor will use the EventListener API to listen for import end. 
> This should be used by EntityProcessor to do a final cleanup.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-769) Support Document and Search Result clustering

2009-06-02 Thread Brad Giaccio (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brad Giaccio updated SOLR-769:
--

Attachment: (was: clustering-componet-shard.patch)

> Support Document and Search Result clustering
> -
>
> Key: SOLR-769
> URL: https://issues.apache.org/jira/browse/SOLR-769
> Project: Solr
>  Issue Type: New Feature
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 1.4
>
> Attachments: clustering-componet-shard.patch, clustering-libs.tar, 
> clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, 
> SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
> SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
> SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip
>
>
> Clustering is a useful tool for working with documents and search results, 
> similar to the notion of dynamic faceting.  Carrot2 
> (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
> search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
> suited for whole-corpus clustering.  
> The patch I lays out a contrib module that starts off w/ an integration of a 
> SearchComponent for doing clustering and an implementation using Carrot.  In 
> search results mode, it will use the DocList as the input for the cluster.   
> While Carrot2 comes w/ a Solr input component, it is not the same as the 
> SearchComponent that I have in that the Carrot example actually submits a 
> query to Solr, whereas my SearchComponent is just chained into the Component 
> list and uses the ResponseBuilder to add in the cluster results.
> While not fully fleshed out yet, the collection based mode will take in a 
> list of ids or just use the whole collection and will produce clusters.  
> Since this is a longer, typically offline task, there will need to be some 
> type of storage mechanism (and replication??) for the clusters.  I _may_ 
> push this off to a separate JIRA issue, but I at least want to present the 
> use case as part of the design of this component/contrib.  It may even make 
> sense that we split this out, such that the building piece is something like 
> an UpdateProcessor and then the SearchComponent just acts as a lookup 
> mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-769) Support Document and Search Result clustering

2009-06-02 Thread Brad Giaccio (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brad Giaccio updated SOLR-769:
--

Attachment: clustering-componet-shard.patch

Okay I've rewritten the patch, as I suggested.  Now the clustering happens in 
finishStage for distributed queries and it happens in process for 
non-distributed  both by calling the new method clusterResults .   To make this 
happen I had to convert the interfaces and supporting code to use 
SolrDocumentList rather than DocList.

I've added a unit test which extends TestDistributedSearch,  I had to modify 
TestDistributedSearch and make a bunch of things protected.   This allowed me 
to write a very small test case (just had to override doTest)  and leave all 
the logic for creating shards, distributing docs, and comparing responses in 
TestDistributedSearch.  I felt this made for a very clean way to test a single 
distributed component.

> Support Document and Search Result clustering
> -
>
> Key: SOLR-769
> URL: https://issues.apache.org/jira/browse/SOLR-769
> Project: Solr
>  Issue Type: New Feature
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 1.4
>
> Attachments: clustering-componet-shard.patch, clustering-libs.tar, 
> clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, 
> SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
> SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
> SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip
>
>
> Clustering is a useful tool for working with documents and search results, 
> similar to the notion of dynamic faceting.  Carrot2 
> (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
> search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
> suited for whole-corpus clustering.  
> The patch I lays out a contrib module that starts off w/ an integration of a 
> SearchComponent for doing clustering and an implementation using Carrot.  In 
> search results mode, it will use the DocList as the input for the cluster.   
> While Carrot2 comes w/ a Solr input component, it is not the same as the 
> SearchComponent that I have in that the Carrot example actually submits a 
> query to Solr, whereas my SearchComponent is just chained into the Component 
> list and uses the ResponseBuilder to add in the cluster results.
> While not fully fleshed out yet, the collection based mode will take in a 
> list of ids or just use the whole collection and will produce clusters.  
> Since this is a longer, typically offline task, there will need to be some 
> type of storage mechanism (and replication??) for the clusters.  I _may_ 
> push this off to a separate JIRA issue, but I at least want to present the 
> use case as part of the design of this component/contrib.  It may even make 
> sense that we split this out, such that the building piece is something like 
> an UpdateProcessor and then the SearchComponent just acts as a lookup 
> mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1051) Support the merge of multiple indexes

2009-06-02 Thread Ning Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12715600#action_12715600
 ] 

Ning Li commented on SOLR-1051:
---

In the current approach, mergeIndexes is an admin command and the target core 
should be online. I haven't looked into the SolrDispatchFilter logic change, 
but it seems with this change, the following are the two valid options:
  - mergeIndexes is an update command and the target core should be online
  - mergeIndexes is an admin command and the target core should be offline

The first option is close to what we have now. I like it a bit more because you 
keep track of the merge by going through UpdateProcessor. But you seem to 
prefer the second option?

> Support the merge of multiple indexes
> -
>
> Key: SOLR-1051
> URL: https://issues.apache.org/jira/browse/SOLR-1051
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Ning Li
>Assignee: Shalin Shekhar Mangar
>Priority: Minor
> Fix For: 1.4
>
> Attachments: SOLR-1051.patch, SOLR-1051.patch, SOLR-1051.patch
>
>
> This is to support the merge of multiple indexes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Reopened: (SOLR-1120) Simplify EntityProcessor API

2009-06-02 Thread Shalin Shekhar Mangar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar reopened SOLR-1120:
-


The initial commit for this issue broke the Debug functionality.

Refer to 
http://www.lucidimagination.com/search/document/42c345a606820f9/npe_in_dataimport_debuglogger_peekstack_dih_development_console

> Simplify EntityProcessor API
> 
>
> Key: SOLR-1120
> URL: https://issues.apache.org/jira/browse/SOLR-1120
> Project: Solr
>  Issue Type: Improvement
>  Components: contrib - DataImportHandler
>Affects Versions: 1.3
>Reporter: Shalin Shekhar Mangar
>Assignee: Shalin Shekhar Mangar
> Fix For: 1.4
>
> Attachments: SOLR-1120.patch, SOLR-1120.patch, SOLR-1120.patch, 
> SOLR-1120.patch, SOLR-1120.patch, SOLR-1120.patch, SOLR-1120.patch
>
>
> Writing an EntityProcessor is deceptively complex. There are so many gotchas.
> I propose the following:
> # Extract out the Transformer application logic from EntityProcessor and add 
> it to DocBuilder. Then EntityProcessor do not need to call applyTransformer 
> or know about rowIterator and getFromRowCache() methods.
> # Change the meaning of EntityProcessor#destroy to be called on end of 
> parent's row -- Right now init is called once per parent row but destroy 
> actually means the end of import. In fact, there is no correct way for an 
> entity processor to do clean up right now. Most do clean up when returning 
> null (end of data) but with the introduction of $skipDoc, a transformer can 
> return $skipDoc and the entity processor will never get a chance to clean up 
> for the current init.
> # EntityProcessor will use the EventListener API to listen for import end. 
> This should be used by EntityProcessor to do a final cleanup.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (SOLR-990) Add pid file to snapinstaller to skip script overruns, and recover from failure

2009-06-02 Thread Otis Gospodnetic (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic resolved SOLR-990.
---

Resolution: Fixed

Thank you, Dan!

Sendingsrc/scripts/snapinstaller
Transmitting file data .
Committed revision 781069.


> Add pid file to snapinstaller to skip script overruns, and recover from 
> failure
> ---
>
> Key: SOLR-990
> URL: https://issues.apache.org/jira/browse/SOLR-990
> Project: Solr
>  Issue Type: Improvement
>  Components: replication (scripts)
>Reporter: Dan Rosher
>Assignee: Otis Gospodnetic
>Priority: Minor
> Fix For: 1.4
>
> Attachments: SOLR-990.patch, SOLR-990.patch, SOLR-990.patch, 
> SOLR-990.patch
>
>
> The pid file will allow snapinstaller to be run as fast as possible without 
> overruns. Also it will recover from a last failed run should an older 
> snapinstaller process no longer be running. 
> Avoiding overruns means that snapinstaller can be run as fast as possible, 
> but without suffering from the performance issue described here:
> http://wiki.apache.org/solr/SolrPerformanceFactors#head-fc7f22035c493431d58c5404ab22aef0ee1b9909
>  
> This means that one can do the following
> */1 * * * * /bin/snappuller   && /bin/snapinstaller
> Even with a 'properly tuned' setup, there can be times where snapinstaller 
> can suffer from overruns due to a lack of resources, or an unoptimized index 
> using more resources etc.
> currently the pid will live in /tmp ... perhaps it should be in the logs dir?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1192) solr.NGramFilterFactory stops to index the content if it find a token smaller than minim ngram size

2009-06-02 Thread Otis Gospodnetic (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic updated SOLR-1192:
---

Fix Version/s: (was: 1.3)
   1.4

> solr.NGramFilterFactory stops to index the content if it find a token smaller 
> than minim ngram size
> ---
>
> Key: SOLR-1192
> URL: https://issues.apache.org/jira/browse/SOLR-1192
> Project: Solr
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 1.3
> Environment: any
>Reporter: viobade
> Fix For: 1.4
>
>
> If a field is split in tokens (by a tokenizer) and after that is aplied the 
> NGramFilterFactory for these tokens...the indexing goes well while the length 
> of the tokens is greater or equal with minim ngram size (ussually is 3). 
> Otherwise the indexing breaks in this point and the rest of tokens  are no 
> more indexed. This behaviour can be easy observed with the analysis tool 
> which is in Solr admin interface.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1192) solr.NGramFilterFactory stops to index the content if it find a token smaller than minim ngram size

2009-06-02 Thread Otis Gospodnetic (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic updated SOLR-1192:
---


That stems from Lucene, see LUCENE-1491.


> solr.NGramFilterFactory stops to index the content if it find a token smaller 
> than minim ngram size
> ---
>
> Key: SOLR-1192
> URL: https://issues.apache.org/jira/browse/SOLR-1192
> Project: Solr
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 1.3
> Environment: any
>Reporter: viobade
> Fix For: 1.3
>
>
> If a field is split in tokens (by a tokenizer) and after that is aplied the 
> NGramFilterFactory for these tokens...the indexing goes well while the length 
> of the tokens is greater or equal with minim ngram size (ussually is 3). 
> Otherwise the indexing breaks in this point and the rest of tokens  are no 
> more indexed. This behaviour can be easy observed with the analysis tool 
> which is in Solr admin interface.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Solr nightly build failure

2009-06-02 Thread solr-dev

init-forrest-entities:
[mkdir] Created dir: /tmp/apache-solr-nightly/build
[mkdir] Created dir: /tmp/apache-solr-nightly/build/web

compile-solrj:
[mkdir] Created dir: /tmp/apache-solr-nightly/build/solrj
[javac] Compiling 83 source files to /tmp/apache-solr-nightly/build/solrj
[javac] Note: Some input files use or override a deprecated API.
[javac] Note: Recompile with -Xlint:deprecation for details.
[javac] Note: Some input files use unchecked or unsafe operations.
[javac] Note: Recompile with -Xlint:unchecked for details.

compile:
[mkdir] Created dir: /tmp/apache-solr-nightly/build/solr
[javac] Compiling 373 source files to /tmp/apache-solr-nightly/build/solr
[javac] Note: Some input files use or override a deprecated API.
[javac] Note: Recompile with -Xlint:deprecation for details.
[javac] Note: Some input files use unchecked or unsafe operations.
[javac] Note: Recompile with -Xlint:unchecked for details.

compileTests:
[mkdir] Created dir: /tmp/apache-solr-nightly/build/tests
[javac] Compiling 161 source files to /tmp/apache-solr-nightly/build/tests
[javac] Note: Some input files use or override a deprecated API.
[javac] Note: Recompile with -Xlint:deprecation for details.
[javac] Note: Some input files use unchecked or unsafe operations.
[javac] Note: Recompile with -Xlint:unchecked for details.

junit:
[mkdir] Created dir: /tmp/apache-solr-nightly/build/test-results
[junit] Running org.apache.solr.BasicFunctionalityTest
[junit] Tests run: 19, Failures: 0, Errors: 0, Time elapsed: 17.658 sec
[junit] Running org.apache.solr.ConvertedLegacyTest
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 8.611 sec
[junit] Running org.apache.solr.DisMaxRequestHandlerTest
[junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 5.135 sec
[junit] Running org.apache.solr.EchoParamsTest
[junit] Tests run: 4, Failures: 0, Errors: 0, Time elapsed: 2.409 sec
[junit] Running org.apache.solr.OutputWriterTest
[junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 2.75 sec
[junit] Running org.apache.solr.SampleTest
[junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 2.383 sec
[junit] Running org.apache.solr.SolrInfoMBeanTest
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.944 sec
[junit] Running org.apache.solr.TestDistributedSearch
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 26.478 sec
[junit] Running org.apache.solr.TestTrie
[junit] Tests run: 7, Failures: 0, Errors: 0, Time elapsed: 7.588 sec
[junit] Running org.apache.solr.analysis.DoubleMetaphoneFilterFactoryTest
[junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 0.611 sec
[junit] Running org.apache.solr.analysis.DoubleMetaphoneFilterTest
[junit] Tests run: 6, Failures: 0, Errors: 0, Time elapsed: 0.441 sec
[junit] Running org.apache.solr.analysis.EnglishPorterFilterFactoryTest
[junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 1.465 sec
[junit] Running org.apache.solr.analysis.HTMLStripReaderTest
[junit] Tests run: 9, Failures: 0, Errors: 0, Time elapsed: 1.693 sec
[junit] Running org.apache.solr.analysis.LengthFilterTest
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 1.775 sec
[junit] Running org.apache.solr.analysis.SnowballPorterFilterFactoryTest
[junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 1.501 sec
[junit] Running org.apache.solr.analysis.TestBufferedTokenStream
[junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 1.448 sec
[junit] Running org.apache.solr.analysis.TestCapitalizationFilter
[junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 1.634 sec
[junit] Running org.apache.solr.analysis.TestCharFilter
[junit] Tests run: 4, Failures: 0, Errors: 0, Time elapsed: 0.533 sec
[junit] Running org.apache.solr.analysis.TestHyphenatedWordsFilter
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.977 sec
[junit] Running org.apache.solr.analysis.TestKeepFilterFactory
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 2.667 sec
[junit] Running org.apache.solr.analysis.TestKeepWordFilter
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 1.316 sec
[junit] Running org.apache.solr.analysis.TestMappingCharFilter
[junit] Tests run: 11, Failures: 0, Errors: 0, Time elapsed: 0.474 sec
[junit] Running org.apache.solr.analysis.TestMappingCharFilterFactory
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.477 sec
[junit] Running org.apache.solr.analysis.TestPatternReplaceFilter
[junit] Tests run: 5, Failures: 0, Errors: 0, Time elapsed: 2.308 sec
[junit] Running org.apache.solr.analysis.TestPatternTokenizerFactory
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 1.068 sec
[junit] Running org.apache.solr.analysis.TestPhoneticFilter