Re: solr 2.0 branch/sandbox?

2008-10-14 Thread Thorsten Scherler
On Thu, 2008-10-02 at 23:13 -0400, Ryan McKinley wrote:
> Hey-
> 
> Rather then continually point to solr 2.0 as a future future thing,  
> i'd like to give a go at removing all configs and deprecated stuff. --  
> I doubt that would end up being the real direction, but as an exercise  
> would be quite valuable to figure out what the major issues will be  
> and see how it feels.
> 
> What do you think the best way to do this is?

Today while preparing a talk with Santiago Gala about the ASF for [1], I
stumbled over the following link in one of his slides.
http://incubator.apache.org/learn/rules-for-revolutionaries.html

Then I remembered this thread.

salu2

[1] http://www.opensourceworldconference.com/

> 
> How do you feel if I make a branch to experiment with stripping all  
> configs out of solr
> 
> perhaps:
> http://svn.apache.org/repos/asf/lucene/solr/branches/sandbox/
> or
> http://svn.apache.org/repos/asf/lucene/solr/branches/sandbox/ryan/
> 
> thoughts?
> ryan
> 
-- 
Thorsten Scherler thorsten.at.apache.org
Open Source Java  consulting, training and solutions



[jira] Resolved: (SOLR-800) concurrentmodification exception for XPathEntityprocessor while streaming

2008-10-14 Thread Shalin Shekhar Mangar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar resolved SOLR-800.


   Resolution: Fixed
Fix Version/s: (was: 1.4)
 Assignee: Shalin Shekhar Mangar

Committed revision 704365.

Thanks Kyle and Noble!

> concurrentmodification exception for XPathEntityprocessor while streaming
> -
>
> Key: SOLR-800
> URL: https://issues.apache.org/jira/browse/SOLR-800
> Project: Solr
>  Issue Type: Bug
>  Components: contrib - DataImportHandler
>Affects Versions: 1.3
>Reporter: Noble Paul
>Assignee: Shalin Shekhar Mangar
> Fix For: 1.3.1
>
> Attachments: SOLR-800.patch
>
>
> While doing an import for an XPathEntityprocessor with stream="true"
> The stacktrace 
> java.util.ConcurrentModificationException
>
> atjava.util.AbstractList$Itr.checkForComodification(AbstractList.java:372)
>at java.util.AbstractList$Itr.next(AbstractList.java:343)
>at 
> org.apache.solr.handler.dataimport.DocBuilder.addFieldValue(DocBuilder.java:402)
>at 
> org.apache.solr.handler.dataimport.DocBuilder.addFields(DocBuilder.java:373)
>at 
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:304)
>at 
> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:178)
>at 
> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:136)
>at 
> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:334)
>at 
> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:386)
>at 
> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:377)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Solr nightly build failure

2008-10-14 Thread solr-dev

init-forrest-entities:
[mkdir] Created dir: /tmp/apache-solr-nightly/build
[mkdir] Created dir: /tmp/apache-solr-nightly/build/web

compile-common:
[mkdir] Created dir: /tmp/apache-solr-nightly/build/common
[javac] Compiling 36 source files to /tmp/apache-solr-nightly/build/common
[javac] Note: 
/tmp/apache-solr-nightly/src/java/org/apache/solr/common/util/FastInputStream.java
 uses or overrides a deprecated API.
[javac] Note: Recompile with -Xlint:deprecation for details.
[javac] Note: Some input files use unchecked or unsafe operations.
[javac] Note: Recompile with -Xlint:unchecked for details.

compile:
[mkdir] Created dir: /tmp/apache-solr-nightly/build/core
[javac] Compiling 342 source files to /tmp/apache-solr-nightly/build/core
[javac] 
/tmp/apache-solr-nightly/src/java/org/apache/solr/request/VelocityResponseWriter.java:35:
 package org.apache.velocity does not exist
[javac] import org.apache.velocity.Template;
[javac]^
[javac] 
/tmp/apache-solr-nightly/src/java/org/apache/solr/request/VelocityResponseWriter.java:36:
 package org.apache.velocity does not exist
[javac] import org.apache.velocity.VelocityContext;
[javac]^
[javac] 
/tmp/apache-solr-nightly/src/java/org/apache/solr/request/VelocityResponseWriter.java:37:
 package org.apache.velocity.app does not exist
[javac] import org.apache.velocity.app.VelocityEngine;
[javac]^
[javac] 
/tmp/apache-solr-nightly/src/java/org/apache/solr/request/VelocityResponseWriter.java:78:
 cannot find symbol
[javac] symbol  : class VelocityEngine
[javac] location: class org.apache.solr.request.VelocityResponseWriter
[javac] VelocityEngine engine = new VelocityEngine();
[javac] ^
[javac] 
/tmp/apache-solr-nightly/src/java/org/apache/solr/request/VelocityResponseWriter.java:78:
 cannot find symbol
[javac] symbol  : class VelocityEngine
[javac] location: class org.apache.solr.request.VelocityResponseWriter
[javac] VelocityEngine engine = new VelocityEngine();
[javac] ^
[javac] 
/tmp/apache-solr-nightly/src/java/org/apache/solr/request/VelocityResponseWriter.java:80:
 cannot find symbol
[javac] symbol  : variable VelocityEngine
[javac] location: class org.apache.solr.request.VelocityResponseWriter
[javac] 
engine.setProperty(VelocityEngine.FILE_RESOURCE_LOADER_PATH, 
baseDir.getAbsolutePath());
[javac]^
[javac] 
/tmp/apache-solr-nightly/src/java/org/apache/solr/request/VelocityResponseWriter.java:81:
 cannot find symbol
[javac] symbol  : variable VelocityEngine
[javac] location: class org.apache.solr.request.VelocityResponseWriter
[javac] engine.setProperty(VelocityEngine.RESOURCE_LOADER, 
"file");
[javac]^
[javac] 
/tmp/apache-solr-nightly/src/java/org/apache/solr/request/VelocityResponseWriter.java:82:
 cannot find symbol
[javac] symbol  : class Template
[javac] location: class org.apache.solr.request.VelocityResponseWriter
[javac] Template template;
[javac] ^
[javac] 
/tmp/apache-solr-nightly/src/java/org/apache/solr/request/VelocityResponseWriter.java:90:
 cannot find symbol
[javac] symbol  : class VelocityContext
[javac] location: class org.apache.solr.request.VelocityResponseWriter
[javac] VelocityContext context = new VelocityContext();
[javac] ^
[javac] 
/tmp/apache-solr-nightly/src/java/org/apache/solr/request/VelocityResponseWriter.java:90:
 cannot find symbol
[javac] symbol  : class VelocityContext
[javac] location: class org.apache.solr.request.VelocityResponseWriter
[javac] VelocityContext context = new VelocityContext();
[javac]   ^
[javac] Note: Some input files use or override a deprecated API.
[javac] Note: Recompile with -Xlint:deprecation for details.
[javac] Note: Some input files use unchecked or unsafe operations.
[javac] Note: Recompile with -Xlint:unchecked for details.
[javac] 10 errors

BUILD FAILED
/tmp/apache-solr-nightly/build.xml:125: The following error occurred while 
executing this line:
/tmp/apache-solr-nightly/common-build.xml:149: Compile failed; see the compiler 
error output for details.

Total time: 10 seconds




Build failed in Hudson: Solr-trunk #593

2008-10-14 Thread Apache Hudson Server
See http://hudson.zones.apache.org/hudson/job/Solr-trunk/593/changes

Changes:

[shalin] SOLR-800 --  Deep copy collections to avoid 
ConcurrentModificationException in XPathEntityprocessor while streaming

[hossman] error checking for mandatory param

[ryan] SOLR-793 -- removing stuff accidentally added with previous commit

[ryan] SOLR-793: Add 'commitWithin' argument to the update add command.  This 
behaves similar to the global autoCommit maxTime argument except that it is set 
for each request.

[shalin] SOLR-658 -- Allow Solr to load index from arbitrary directory in 
dataDir

--
[...truncated 1318 lines...]
A client/ruby/solr-ruby/solr/conf/xslt/example.xsl
A client/ruby/solr-ruby/solr/conf/scripts.conf
A client/ruby/solr-ruby/solr/conf/admin-extra.html
A client/ruby/solr-ruby/solr/conf/synonyms.txt
A client/ruby/solr-ruby/solr/lib
A client/ruby/solr-ruby/test
A client/ruby/solr-ruby/test/unit
A client/ruby/solr-ruby/test/unit/standard_response_test.rb
A client/ruby/solr-ruby/test/unit/document_test.rb
AUclient/ruby/solr-ruby/test/unit/select_test.rb
AUclient/ruby/solr-ruby/test/unit/delimited_file_source_test.rb
A client/ruby/solr-ruby/test/unit/xpath_test_file.xml
AUclient/ruby/solr-ruby/test/unit/array_mapper_test.rb
AUclient/ruby/solr-ruby/test/unit/solr_mock_base.rb
A client/ruby/solr-ruby/test/unit/field_test.rb
AUclient/ruby/solr-ruby/test/unit/modify_document_test.rb
A client/ruby/solr-ruby/test/unit/add_document_test.rb
AUclient/ruby/solr-ruby/test/unit/xpath_mapper_test.rb
AUclient/ruby/solr-ruby/test/unit/request_test.rb
A client/ruby/solr-ruby/test/unit/commit_test.rb
AUclient/ruby/solr-ruby/test/unit/suite.rb
AUclient/ruby/solr-ruby/test/unit/changes_yaml_test.rb
A client/ruby/solr-ruby/test/unit/spellcheck_response_test.rb
A client/ruby/solr-ruby/test/unit/ping_test.rb
A client/ruby/solr-ruby/test/unit/dismax_request_test.rb
AUclient/ruby/solr-ruby/test/unit/indexer_test.rb
A client/ruby/solr-ruby/test/unit/response_test.rb
AUclient/ruby/solr-ruby/test/unit/connection_test.rb
A client/ruby/solr-ruby/test/unit/delete_test.rb
AUclient/ruby/solr-ruby/test/unit/tab_delimited.txt
A client/ruby/solr-ruby/test/unit/hpricot_test_file.xml
AUclient/ruby/solr-ruby/test/unit/standard_request_test.rb
A client/ruby/solr-ruby/test/unit/hpricot_mapper_test.rb
A client/ruby/solr-ruby/test/unit/spellchecker_request_test.rb
AUclient/ruby/solr-ruby/test/unit/data_mapper_test.rb
AUclient/ruby/solr-ruby/test/unit/util_test.rb
A client/ruby/solr-ruby/test/functional
A client/ruby/solr-ruby/test/functional/test_solr_server.rb
A client/ruby/solr-ruby/test/functional/server_test.rb
A client/ruby/solr-ruby/test/conf
AUclient/ruby/solr-ruby/test/conf/schema.xml
A client/ruby/solr-ruby/test/conf/protwords.txt
A client/ruby/solr-ruby/test/conf/stopwords.txt
AUclient/ruby/solr-ruby/test/conf/solrconfig.xml
A client/ruby/solr-ruby/test/conf/scripts.conf
A client/ruby/solr-ruby/test/conf/admin-extra.html
A client/ruby/solr-ruby/test/conf/synonyms.txt
A client/ruby/solr-ruby/LICENSE.txt
A client/ruby/solr-ruby/Rakefile
A client/ruby/solr-ruby/script
AUclient/ruby/solr-ruby/script/setup.rb
AUclient/ruby/solr-ruby/script/solrshell
A client/ruby/solr-ruby/lib
A client/ruby/solr-ruby/lib/solr
AUclient/ruby/solr-ruby/lib/solr/util.rb
A client/ruby/solr-ruby/lib/solr/document.rb
A client/ruby/solr-ruby/lib/solr/exception.rb
AUclient/ruby/solr-ruby/lib/solr/indexer.rb
AUclient/ruby/solr-ruby/lib/solr/response.rb
AUclient/ruby/solr-ruby/lib/solr/connection.rb
A client/ruby/solr-ruby/lib/solr/importer
AUclient/ruby/solr-ruby/lib/solr/importer/delimited_file_source.rb
AUclient/ruby/solr-ruby/lib/solr/importer/solr_source.rb
AUclient/ruby/solr-ruby/lib/solr/importer/array_mapper.rb
AUclient/ruby/solr-ruby/lib/solr/importer/mapper.rb
AUclient/ruby/solr-ruby/lib/solr/importer/xpath_mapper.rb
A client/ruby/solr-ruby/lib/solr/importer/hpricot_mapper.rb
A client/ruby/solr-ruby/lib/solr/xml.rb
AUclient/ruby/solr-ruby/lib/solr/importer.rb
A client/ruby/solr-ruby/lib/solr/field.rb
AUclient/ruby/solr-ruby/lib/solr/solrtasks.rb
A client/ruby/solr-ruby/lib/solr/request
A client/ruby/solr-ruby/lib/solr/request/ping.rb
A client/ruby/solr-ruby/lib/solr/request/spellcheck.rb
A client/ruby/solr-ruby/lib/solr/request/select.rb
AUclient/ruby/solr-ruby/lib/solr/request/optimize.rb
AUclient/ruby/solr-ruby

[jira] Commented: (SOLR-561) Solr replication by Solr (for windows also)

2008-10-14 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12639406#action_12639406
 ] 

Yonik Seeley commented on SOLR-561:
---

Why are files downloaded to a temp directory first?  Since all index files are 
versioned, would it make sense to copy directly into the index dir (provided 
you copy segments_n last)?

> Solr replication by Solr (for windows also)
> ---
>
> Key: SOLR-561
> URL: https://issues.apache.org/jira/browse/SOLR-561
> Project: Solr
>  Issue Type: New Feature
>  Components: replication
>Affects Versions: 1.4
> Environment: All
>Reporter: Noble Paul
>Assignee: Shalin Shekhar Mangar
> Fix For: 1.4
>
> Attachments: deletion_policy.patch, SOLR-561-core.patch, 
> SOLR-561-full.patch, SOLR-561-full.patch, SOLR-561-full.patch, 
> SOLR-561-full.patch, SOLR-561.patch, SOLR-561.patch, SOLR-561.patch, 
> SOLR-561.patch, SOLR-561.patch, SOLR-561.patch
>
>
> The current replication strategy in solr involves shell scripts . The 
> following are the drawbacks with the approach
> *  It does not work with windows
> * Replication works as a separate piece not integrated with solr.
> * Cannot control replication from solr admin/JMX
> * Each operation requires manual telnet to the host
> Doing the replication in java has the following advantages
> * Platform independence
> * Manual steps can be completely eliminated. Everything can be driven from 
> solrconfig.xml .
> ** Adding the url of the master in the slaves should be good enough to enable 
> replication. Other things like frequency of
> snapshoot/snappull can also be configured . All other information can be 
> automatically obtained.
> * Start/stop can be triggered from solr/admin or JMX
> * Can get the status/progress while replication is going on. It can also 
> abort an ongoing replication
> * No need to have a login into the machine 
> * From a development perspective, we can unit test it
> This issue can track the implementation of solr replication in java

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling

2008-10-14 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12639470#action_12639470
 ] 

Yonik Seeley commented on SOLR-799:
---

"overwriting" is implemented and supported in Lucene now (and we gain a number 
of benefits from using that).  Conditionally adding a document, or testing if a 
document already exists, is not supported.

Since we can't currently determine if something is a duplicate, it seems like 
this issue should go ahead with just a single option: whether to remove older 
documents with the same signature or not.



> Add support for hash based exact/near duplicate document handling
> -
>
> Key: SOLR-799
> URL: https://issues.apache.org/jira/browse/SOLR-799
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Mark Miller
>Priority: Minor
> Attachments: SOLR-799.patch
>
>
> Hash based duplicate document detection is efficient and allows for blocking 
> as well as field collapsing. Lets put it into solr. 
> http://wiki.apache.org/solr/Deduplication

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling

2008-10-14 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12639474#action_12639474
 ] 

Yonik Seeley commented on SOLR-799:
---

Otis: this issue only handles the index side of things.  The signature 
generating class is pluggable.  Is there anything else needed on the indexing 
side?

> Add support for hash based exact/near duplicate document handling
> -
>
> Key: SOLR-799
> URL: https://issues.apache.org/jira/browse/SOLR-799
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Mark Miller
>Priority: Minor
> Attachments: SOLR-799.patch
>
>
> Hash based duplicate document detection is efficient and allows for blocking 
> as well as field collapsing. Lets put it into solr. 
> http://wiki.apache.org/solr/Deduplication

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (SOLR-807) UUIDField type cannot be recognized when wt=javabin is used

2008-10-14 Thread Koji Sekiguchi (JIRA)
UUIDField type cannot be recognized when wt=javabin is used
---

 Key: SOLR-807
 URL: https://issues.apache.org/jira/browse/SOLR-807
 Project: Solr
  Issue Type: Bug
  Components: clients - java, search
Affects Versions: 1.3
Reporter: Koji Sekiguchi
Priority: Trivial
 Fix For: 1.3.1


I'm using UUID via Solrj in my project. When I use javabin (default), I got:

*java.util.UUID:* 391e3214-4f8e-4abd-aa6b-4f12be79534f

as the uuid value. But if I use xml, I got:

391e3214-4f8e-4abd-aa6b-4f12be79534f

I think the both of them should return same string.

program for reproducing the problem:
{code:java}
  public static void main(String[] args) throws Exception {
CommonsHttpSolrServer server = new CommonsHttpSolrServer( 
"http://localhost:8983/solr"; );
SolrQuery query = new SolrQuery().setQuery( "*:*" );
//server.setParser( new XMLResponseParser() );   // uncomment for wt=xml
System.out.println( "= " + 
server.getParser().getClass().getSimpleName() + " =" );
QueryResponse rsp = server.query( query );
SolrDocumentList docs = rsp.getResults();
for( SolrDocument doc : docs ){
  Object id = doc.getFieldValue( "id" );
  System.out.println( "type = " + id.getClass().getName() + ", id = " + id 
);
  Object timestamp = doc.getFieldValue( "timestamp" );
  System.out.println( "type = " + timestamp.getClass().getName() + ", 
timestamp = " + timestamp );
}
  }
{code}

result for wt=javabin
{code:title=javabin}
= BinaryResponseParser =
type = java.lang.String, id = 
java.util.UUID:391e3214-4f8e-4abd-aa6b-4f12be79534f
type = java.util.Date, timestamp = Wed Oct 15 00:20:50 JST 2008
{code}

result for wt=xml
{code:title=xml}
= XMLResponseParser =
type = java.lang.String, id = 391e3214-4f8e-4abd-aa6b-4f12be79534f
type = java.util.Date, timestamp = Wed Oct 15 00:20:50 JST 2008
{code}


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-561) Solr replication by Solr (for windows also)

2008-10-14 Thread Noble Paul (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12639518#action_12639518
 ] 

Noble Paul commented on SOLR-561:
-

If the files are not part of any indexcommit (this is true if the segments_n 
file didn't get downloaded) will it still clean it up?.  And when solr restarts 
ReplicationHandler will have difficulty in cleaning up those files if 
replication kicks off before Lucene cleans it up (If it actually does that)



> Solr replication by Solr (for windows also)
> ---
>
> Key: SOLR-561
> URL: https://issues.apache.org/jira/browse/SOLR-561
> Project: Solr
>  Issue Type: New Feature
>  Components: replication
>Affects Versions: 1.4
> Environment: All
>Reporter: Noble Paul
>Assignee: Shalin Shekhar Mangar
> Fix For: 1.4
>
> Attachments: deletion_policy.patch, SOLR-561-core.patch, 
> SOLR-561-full.patch, SOLR-561-full.patch, SOLR-561-full.patch, 
> SOLR-561-full.patch, SOLR-561.patch, SOLR-561.patch, SOLR-561.patch, 
> SOLR-561.patch, SOLR-561.patch, SOLR-561.patch
>
>
> The current replication strategy in solr involves shell scripts . The 
> following are the drawbacks with the approach
> *  It does not work with windows
> * Replication works as a separate piece not integrated with solr.
> * Cannot control replication from solr admin/JMX
> * Each operation requires manual telnet to the host
> Doing the replication in java has the following advantages
> * Platform independence
> * Manual steps can be completely eliminated. Everything can be driven from 
> solrconfig.xml .
> ** Adding the url of the master in the slaves should be good enough to enable 
> replication. Other things like frequency of
> snapshoot/snappull can also be configured . All other information can be 
> automatically obtained.
> * Start/stop can be triggered from solr/admin or JMX
> * Can get the status/progress while replication is going on. It can also 
> abort an ongoing replication
> * No need to have a login into the machine 
> * From a development perspective, we can unit test it
> This issue can track the implementation of solr replication in java

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling

2008-10-14 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12639479#action_12639479
 ] 

Otis Gospodnetic commented on SOLR-799:
---

Thanks Yonik.  Good thing I asked for the clarification, since Marks' issue 
description does mention search-time stuff (field collapsing).

Mark: Do you still plan on tackling search-time 
duplicate/near-duplicate/similar doc detection?  In a separate issue?  Thanks.


> Add support for hash based exact/near duplicate document handling
> -
>
> Key: SOLR-799
> URL: https://issues.apache.org/jira/browse/SOLR-799
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Mark Miller
>Priority: Minor
> Attachments: SOLR-799.patch
>
>
> Hash based duplicate document detection is efficient and allows for blocking 
> as well as field collapsing. Lets put it into solr. 
> http://wiki.apache.org/solr/Deduplication

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling

2008-10-14 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12639456#action_12639456
 ] 

Otis Gospodnetic commented on SOLR-799:
---

Haven't looked at the patch yet.
Have looked at the Deduplication wiki page (and realize the stuff I'll write 
below is briefly mentioned there).
Have skimmed the above comments.

I want to bring up the use case that seems to have been mentioned already, but 
only in passing.  The focus of the previous comments seems to be on index-time 
duplication detection.  Another huge use case is search-time near-duplicate 
detection.  Sometimes it's about straight forward field collapsing (collapsing 
adjacent docs with identical values in some field), but sometimes it's more 
complicated.

For example, sometimes multiple fields need to be compared.  Sometimes they 
have to be identical for collapsing to happen.  Sometimes they only need to be 
"similar".  How similarity is calculated is very application-dependent.  I 
believe this similarity computation has to be completely 
open/extensible/overridable, allowing one to write a custom search component, 
examine returned hits and compare them using app-specific similarity

Ideally one would have the option not to save the document/field at index-time 
(for examination at search-time), since that prevents one from experimenting 
and dynamically changing the similarity computation.

Here is one example.  Imagine a field called "IDs" that can have 1 or more 
tokens in it and imagine docs with the following "IDs" get returned:

1) id:aaa
2) id:bbb
3) id:ccc ddd
4) id:aaa bbb
5) id:eee ddd
6) id:aaa

A custom similarity may look at all of the above (e.g. a page's worth of hits) 
and decide that:
1) and 4) are similar
2) and 4) are also similar
3) and 5) are similar
1) and 4) and 6) are similar

Another custom similarity may say that only 1) and 6) are similar because they 
are identical.

My point is really that we have to leave it up to the application to provide 
similarity implementation, just like we make it possible for the app to provide 
a custom Lucene Similarity.

Is the goal of this issue to make this possible?


> Add support for hash based exact/near duplicate document handling
> -
>
> Key: SOLR-799
> URL: https://issues.apache.org/jira/browse/SOLR-799
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Mark Miller
>Priority: Minor
> Attachments: SOLR-799.patch
>
>
> Hash based duplicate document detection is efficient and allows for blocking 
> as well as field collapsing. Lets put it into solr. 
> http://wiki.apache.org/solr/Deduplication

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-561) Solr replication by Solr (for windows also)

2008-10-14 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12639493#action_12639493
 ] 

Yonik Seeley commented on SOLR-561:
---

bq. If Solr crashes while downloading that will leave unnecessary/incomplete 
files in the index directory.

If we don't want to try and pick up from where we left off, it seems like 
Lucene's deletion policy can clean up old index files that are unreferenced.



> Solr replication by Solr (for windows also)
> ---
>
> Key: SOLR-561
> URL: https://issues.apache.org/jira/browse/SOLR-561
> Project: Solr
>  Issue Type: New Feature
>  Components: replication
>Affects Versions: 1.4
> Environment: All
>Reporter: Noble Paul
>Assignee: Shalin Shekhar Mangar
> Fix For: 1.4
>
> Attachments: deletion_policy.patch, SOLR-561-core.patch, 
> SOLR-561-full.patch, SOLR-561-full.patch, SOLR-561-full.patch, 
> SOLR-561-full.patch, SOLR-561.patch, SOLR-561.patch, SOLR-561.patch, 
> SOLR-561.patch, SOLR-561.patch, SOLR-561.patch
>
>
> The current replication strategy in solr involves shell scripts . The 
> following are the drawbacks with the approach
> *  It does not work with windows
> * Replication works as a separate piece not integrated with solr.
> * Cannot control replication from solr admin/JMX
> * Each operation requires manual telnet to the host
> Doing the replication in java has the following advantages
> * Platform independence
> * Manual steps can be completely eliminated. Everything can be driven from 
> solrconfig.xml .
> ** Adding the url of the master in the slaves should be good enough to enable 
> replication. Other things like frequency of
> snapshoot/snappull can also be configured . All other information can be 
> automatically obtained.
> * Start/stop can be triggered from solr/admin or JMX
> * Can get the status/progress while replication is going on. It can also 
> abort an ongoing replication
> * No need to have a login into the machine 
> * From a development perspective, we can unit test it
> This issue can track the implementation of solr replication in java

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-561) Solr replication by Solr (for windows also)

2008-10-14 Thread Noble Paul (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12639465#action_12639465
 ] 

Noble Paul commented on SOLR-561:
-

bq.Why are files downloaded to a temp directory first? 
If Solr crashes while downloading that will leave unnecessary/incomplete files 
in the index directory. We did not want the index directory to be polluted. The 
files are 'moved' to index directory after they are downloaded . 

The segments_n file is copied in the end. from temp directory to index 
directory. 
(OK , that patch is coming ;) )

> Solr replication by Solr (for windows also)
> ---
>
> Key: SOLR-561
> URL: https://issues.apache.org/jira/browse/SOLR-561
> Project: Solr
>  Issue Type: New Feature
>  Components: replication
>Affects Versions: 1.4
> Environment: All
>Reporter: Noble Paul
>Assignee: Shalin Shekhar Mangar
> Fix For: 1.4
>
> Attachments: deletion_policy.patch, SOLR-561-core.patch, 
> SOLR-561-full.patch, SOLR-561-full.patch, SOLR-561-full.patch, 
> SOLR-561-full.patch, SOLR-561.patch, SOLR-561.patch, SOLR-561.patch, 
> SOLR-561.patch, SOLR-561.patch, SOLR-561.patch
>
>
> The current replication strategy in solr involves shell scripts . The 
> following are the drawbacks with the approach
> *  It does not work with windows
> * Replication works as a separate piece not integrated with solr.
> * Cannot control replication from solr admin/JMX
> * Each operation requires manual telnet to the host
> Doing the replication in java has the following advantages
> * Platform independence
> * Manual steps can be completely eliminated. Everything can be driven from 
> solrconfig.xml .
> ** Adding the url of the master in the slaves should be good enough to enable 
> replication. Other things like frequency of
> snapshoot/snappull can also be configured . All other information can be 
> automatically obtained.
> * Start/stop can be triggered from solr/admin or JMX
> * Can get the status/progress while replication is going on. It can also 
> abort an ongoing replication
> * No need to have a login into the machine 
> * From a development perspective, we can unit test it
> This issue can track the implementation of solr replication in java

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-807) UUIDField type cannot be recognized when wt=javabin is used

2008-10-14 Thread Noble Paul (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12639716#action_12639716
 ] 

Noble Paul commented on SOLR-807:
-

is java.util.UUID a supported type in Lucene? 

Then let us follow what XML format is doing. 

> UUIDField type cannot be recognized when wt=javabin is used
> ---
>
> Key: SOLR-807
> URL: https://issues.apache.org/jira/browse/SOLR-807
> Project: Solr
>  Issue Type: Bug
>  Components: clients - java, search
>Affects Versions: 1.3
>Reporter: Koji Sekiguchi
>Priority: Trivial
> Fix For: 1.3.1
>
>
> I'm using UUID via Solrj in my project. When I use javabin (default), I got:
> *java.util.UUID:* 391e3214-4f8e-4abd-aa6b-4f12be79534f
> as the uuid value. But if I use xml, I got:
> 391e3214-4f8e-4abd-aa6b-4f12be79534f
> I think the both of them should return same string.
> program for reproducing the problem:
> {code:java}
>   public static void main(String[] args) throws Exception {
> CommonsHttpSolrServer server = new CommonsHttpSolrServer( 
> "http://localhost:8983/solr"; );
> SolrQuery query = new SolrQuery().setQuery( "*:*" );
> //server.setParser( new XMLResponseParser() );   // uncomment for wt=xml
> System.out.println( "= " + 
> server.getParser().getClass().getSimpleName() + " =" );
> QueryResponse rsp = server.query( query );
> SolrDocumentList docs = rsp.getResults();
> for( SolrDocument doc : docs ){
>   Object id = doc.getFieldValue( "id" );
>   System.out.println( "type = " + id.getClass().getName() + ", id = " + 
> id );
>   Object timestamp = doc.getFieldValue( "timestamp" );
>   System.out.println( "type = " + timestamp.getClass().getName() + ", 
> timestamp = " + timestamp );
> }
>   }
> {code}
> result for wt=javabin
> {code:title=javabin}
> = BinaryResponseParser =
> type = java.lang.String, id = 
> java.util.UUID:391e3214-4f8e-4abd-aa6b-4f12be79534f
> type = java.util.Date, timestamp = Wed Oct 15 00:20:50 JST 2008
> {code}
> result for wt=xml
> {code:title=xml}
> = XMLResponseParser =
> type = java.lang.String, id = 391e3214-4f8e-4abd-aa6b-4f12be79534f
> type = java.util.Date, timestamp = Wed Oct 15 00:20:50 JST 2008
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (SOLR-808) javabin format should write map keys as EXTERN_STRING

2008-10-14 Thread Noble Paul (JIRA)
javabin format should write map keys as EXTERN_STRING
-

 Key: SOLR-808
 URL: https://issues.apache.org/jira/browse/SOLR-808
 Project: Solr
  Issue Type: Improvement
Reporter: Noble Paul
Priority: Minor


Just the way NamedList keys can be externalized, Map keys can also be 
externalized.And this is backward compatible.

Maps are not used very commonly in SOLR. but SOLR-561 uses maps for 
master-slave communication

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-808) javabin format should write map keys as EXTERN_STRING

2008-10-14 Thread Noble Paul (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Noble Paul updated SOLR-808:


Attachment: SOLR-808.patch

> javabin format should write map keys as EXTERN_STRING
> -
>
> Key: SOLR-808
> URL: https://issues.apache.org/jira/browse/SOLR-808
> Project: Solr
>  Issue Type: Improvement
>Reporter: Noble Paul
>Priority: Minor
> Attachments: SOLR-808.patch
>
>
> Just the way NamedList keys can be externalized, Map keys can also be 
> externalized.And this is backward compatible.
> Maps are not used very commonly in SOLR. but SOLR-561 uses maps for 
> master-slave communication

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-807) UUIDField type cannot be recognized when wt=javabin is used

2008-10-14 Thread Koji Sekiguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12639722#action_12639722
 ] 

Koji Sekiguchi commented on SOLR-807:
-

bq. is java.util.UUID a supported type in Lucene?

No. Date is also NOT a supported type in Lucene, but it seems that Date is 
recognized in BinaryResponseWriter...

> UUIDField type cannot be recognized when wt=javabin is used
> ---
>
> Key: SOLR-807
> URL: https://issues.apache.org/jira/browse/SOLR-807
> Project: Solr
>  Issue Type: Bug
>  Components: clients - java, search
>Affects Versions: 1.3
>Reporter: Koji Sekiguchi
>Priority: Trivial
> Fix For: 1.3.1
>
>
> I'm using UUID via Solrj in my project. When I use javabin (default), I got:
> *java.util.UUID:* 391e3214-4f8e-4abd-aa6b-4f12be79534f
> as the uuid value. But if I use xml, I got:
> 391e3214-4f8e-4abd-aa6b-4f12be79534f
> I think the both of them should return same string.
> program for reproducing the problem:
> {code:java}
>   public static void main(String[] args) throws Exception {
> CommonsHttpSolrServer server = new CommonsHttpSolrServer( 
> "http://localhost:8983/solr"; );
> SolrQuery query = new SolrQuery().setQuery( "*:*" );
> //server.setParser( new XMLResponseParser() );   // uncomment for wt=xml
> System.out.println( "= " + 
> server.getParser().getClass().getSimpleName() + " =" );
> QueryResponse rsp = server.query( query );
> SolrDocumentList docs = rsp.getResults();
> for( SolrDocument doc : docs ){
>   Object id = doc.getFieldValue( "id" );
>   System.out.println( "type = " + id.getClass().getName() + ", id = " + 
> id );
>   Object timestamp = doc.getFieldValue( "timestamp" );
>   System.out.println( "type = " + timestamp.getClass().getName() + ", 
> timestamp = " + timestamp );
> }
>   }
> {code}
> result for wt=javabin
> {code:title=javabin}
> = BinaryResponseParser =
> type = java.lang.String, id = 
> java.util.UUID:391e3214-4f8e-4abd-aa6b-4f12be79534f
> type = java.util.Date, timestamp = Wed Oct 15 00:20:50 JST 2008
> {code}
> result for wt=xml
> {code:title=xml}
> = XMLResponseParser =
> type = java.lang.String, id = 391e3214-4f8e-4abd-aa6b-4f12be79534f
> type = java.util.Date, timestamp = Wed Oct 15 00:20:50 JST 2008
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-807) UUIDField type cannot be recognized when wt=javabin is used

2008-10-14 Thread Noble Paul (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12639732#action_12639732
 ] 

Noble Paul commented on SOLR-807:
-

I guess there is a {{org.apache.solr.schema.DateField}}  which produces 
{{java.util.Date}} type. 

All other types are written down as
{code}
val.getClass().getName() + ':' + val.toString()
{code}

Now I realize that there is a {{org.apache.solr.schema.UUIDField}} also 

Should we make an Exception for UUID or just change the code to omit the 
{{val.getClass().getName()}}

I guess that is better.




> UUIDField type cannot be recognized when wt=javabin is used
> ---
>
> Key: SOLR-807
> URL: https://issues.apache.org/jira/browse/SOLR-807
> Project: Solr
>  Issue Type: Bug
>  Components: clients - java, search
>Affects Versions: 1.3
>Reporter: Koji Sekiguchi
>Priority: Trivial
> Fix For: 1.3.1
>
>
> I'm using UUID via Solrj in my project. When I use javabin (default), I got:
> *java.util.UUID:* 391e3214-4f8e-4abd-aa6b-4f12be79534f
> as the uuid value. But if I use xml, I got:
> 391e3214-4f8e-4abd-aa6b-4f12be79534f
> I think the both of them should return same string.
> program for reproducing the problem:
> {code:java}
>   public static void main(String[] args) throws Exception {
> CommonsHttpSolrServer server = new CommonsHttpSolrServer( 
> "http://localhost:8983/solr"; );
> SolrQuery query = new SolrQuery().setQuery( "*:*" );
> //server.setParser( new XMLResponseParser() );   // uncomment for wt=xml
> System.out.println( "= " + 
> server.getParser().getClass().getSimpleName() + " =" );
> QueryResponse rsp = server.query( query );
> SolrDocumentList docs = rsp.getResults();
> for( SolrDocument doc : docs ){
>   Object id = doc.getFieldValue( "id" );
>   System.out.println( "type = " + id.getClass().getName() + ", id = " + 
> id );
>   Object timestamp = doc.getFieldValue( "timestamp" );
>   System.out.println( "type = " + timestamp.getClass().getName() + ", 
> timestamp = " + timestamp );
> }
>   }
> {code}
> result for wt=javabin
> {code:title=javabin}
> = BinaryResponseParser =
> type = java.lang.String, id = 
> java.util.UUID:391e3214-4f8e-4abd-aa6b-4f12be79534f
> type = java.util.Date, timestamp = Wed Oct 15 00:20:50 JST 2008
> {code}
> result for wt=xml
> {code:title=xml}
> = XMLResponseParser =
> type = java.lang.String, id = 391e3214-4f8e-4abd-aa6b-4f12be79534f
> type = java.util.Date, timestamp = Wed Oct 15 00:20:50 JST 2008
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.