[jira] Updated: (SOLR-1023) StatsComponent should support dates (and other non-numeric fields)

2009-08-17 Thread Chris Male (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Male updated SOLR-1023:
-

Attachment: SOLR-1023.patch

I have attached a patch that adds support for String and Date fields.  To 
support these I have also made some improvements in the underlying architecture 
so that it is more extensible.  It is now possible to easy add statistics for 
other field types if desired in the future.

I have also updated the test class to include tests for String and Date fields.

 StatsComponent should support dates (and other non-numeric fields)
 --

 Key: SOLR-1023
 URL: https://issues.apache.org/jira/browse/SOLR-1023
 Project: Solr
  Issue Type: New Feature
Affects Versions: 1.4
 Environment: Mac OS 10.5, java version 1.5.0_16
Reporter: Peter Wolanin
 Fix For: 1.5

 Attachments: SOLR-1023.patch


 Currently, the StatsComponent only supports single-value numeric fields:
 http://wiki.apache.org/solr/StatsComponent
 trying to use it with a date field I get an exception like:  
 java.lang.NumberFormatException: For input string: 2009-01-27T20:04:04Z
 trying to use it with a string I get an error 400  Stats are valid for 
 single valued numeric values.
 For constructing date facets it would be very useful to be able to get the 
 minimum and maximum date from a DateField within a set of documents.  In 
 general, it could be useful to get the minimum and maximum from any field 
 type that can be compared, though that's of less importance.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Solr nightly build failure

2009-08-17 Thread solr-dev

init-forrest-entities:
[mkdir] Created dir: /tmp/apache-solr-nightly/build
[mkdir] Created dir: /tmp/apache-solr-nightly/build/web

compile-solrj:
[mkdir] Created dir: /tmp/apache-solr-nightly/build/solrj
[javac] Compiling 84 source files to /tmp/apache-solr-nightly/build/solrj
[javac] Note: Some input files use or override a deprecated API.
[javac] Note: Recompile with -Xlint:deprecation for details.
[javac] Note: Some input files use unchecked or unsafe operations.
[javac] Note: Recompile with -Xlint:unchecked for details.

compile:
[mkdir] Created dir: /tmp/apache-solr-nightly/build/solr
[javac] Compiling 371 source files to /tmp/apache-solr-nightly/build/solr
[javac] Note: Some input files use or override a deprecated API.
[javac] Note: Recompile with -Xlint:deprecation for details.
[javac] Note: Some input files use unchecked or unsafe operations.
[javac] Note: Recompile with -Xlint:unchecked for details.

compileTests:
[mkdir] Created dir: /tmp/apache-solr-nightly/build/tests
[javac] Compiling 165 source files to /tmp/apache-solr-nightly/build/tests
[javac] Note: Some input files use or override a deprecated API.
[javac] Note: Recompile with -Xlint:deprecation for details.
[javac] Note: Some input files use unchecked or unsafe operations.
[javac] Note: Recompile with -Xlint:unchecked for details.

junit:
[mkdir] Created dir: /tmp/apache-solr-nightly/build/test-results
[junit] Running org.apache.solr.BasicFunctionalityTest
[junit] Tests run: 19, Failures: 0, Errors: 0, Time elapsed: 46.172 sec
[junit] Running org.apache.solr.ConvertedLegacyTest
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 26.031 sec
[junit] Running org.apache.solr.DisMaxRequestHandlerTest
[junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 21.136 sec
[junit] Running org.apache.solr.EchoParamsTest
[junit] Tests run: 4, Failures: 0, Errors: 0, Time elapsed: 7.341 sec
[junit] Running org.apache.solr.OutputWriterTest
[junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 5.074 sec
[junit] Running org.apache.solr.SampleTest
[junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 5.093 sec
[junit] Running org.apache.solr.SolrInfoMBeanTest
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 2.234 sec
[junit] Running org.apache.solr.TestDistributedSearch
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 108.369 sec
[junit] Running org.apache.solr.TestTrie
[junit] Tests run: 7, Failures: 0, Errors: 0, Time elapsed: 16.498 sec
[junit] Running org.apache.solr.analysis.DoubleMetaphoneFilterFactoryTest
[junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 1.856 sec
[junit] Running org.apache.solr.analysis.DoubleMetaphoneFilterTest
[junit] Tests run: 6, Failures: 0, Errors: 0, Time elapsed: 1.213 sec
[junit] Running org.apache.solr.analysis.EnglishPorterFilterFactoryTest
[junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 4.403 sec
[junit] Running org.apache.solr.analysis.HTMLStripCharFilterTest
[junit] Tests run: 9, Failures: 0, Errors: 0, Time elapsed: 2.277 sec
[junit] Running org.apache.solr.analysis.LengthFilterTest
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 5.931 sec
[junit] Running org.apache.solr.analysis.SnowballPorterFilterFactoryTest
[junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 2.285 sec
[junit] Running org.apache.solr.analysis.TestBufferedTokenStream
[junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 5.442 sec
[junit] Running org.apache.solr.analysis.TestCapitalizationFilter
[junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 4.327 sec
[junit] Running 
org.apache.solr.analysis.TestDelimitedPayloadTokenFilterFactory
[junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 13.251 sec
[junit] Running org.apache.solr.analysis.TestHyphenatedWordsFilter
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 1.633 sec
[junit] Running org.apache.solr.analysis.TestKeepFilterFactory
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 5.842 sec
[junit] Running org.apache.solr.analysis.TestKeepWordFilter
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 2.733 sec
[junit] Running org.apache.solr.analysis.TestMappingCharFilterFactory
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.803 sec
[junit] Running org.apache.solr.analysis.TestPatternReplaceFilter
[junit] Tests run: 5, Failures: 0, Errors: 0, Time elapsed: 4.89 sec
[junit] Running org.apache.solr.analysis.TestPatternTokenizerFactory
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 1.951 sec
[junit] Running org.apache.solr.analysis.TestPhoneticFilter
[junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 2.734 sec
[junit] Running 

Indexing Categorical fields - newbie

2009-08-17 Thread nostromo

Hi all,

Hoping your guys can help as I am real new to this :-(

I am trying to import documents using the csv handler.  The process itself
works well but I get odd results when I try to search my index.

The problem is with a field which contains keywords, BUT the keywords could
contain spaces.  An example is say I have a document describing a laptop,
its categorization might need to be :
Office Equipment
Hardware
Laptop

I have thus far been unable to search for category:Office Equipment

category is the name of the field in the schema.  Searching for Hardware or
Laptop  with the same query syntax will return the document.

I am guessing it is the way I define the index and query analyzers, but
could someone please give me some pointers on which I should use in this
case ?

Many Thanks,
D
-- 
View this message in context: 
http://www.nabble.com/Indexing-%22Categorical-fields%22---newbie-tp25007361p25007361.html
Sent from the Solr - Dev mailing list archive at Nabble.com.


Re: Indexing Categorical fields - newbie

2009-08-17 Thread nostromo

Sorry, meant to add that the category field will also be one of my faceting
fields which is why the full phrase is important


nostromo wrote:
 
 Hi all,
 
 Hoping your guys can help as I am real new to this :-(
 
 I am trying to import documents using the csv handler.  The process itself
 works well but I get odd results when I try to search my index.
 
 The problem is with a field which contains keywords, BUT the keywords
 could contain spaces.  An example is say I have a document describing a
 laptop, its categorization might need to be :
 Office Equipment
 Hardware
 Laptop
 
 I have thus far been unable to search for category:Office Equipment
 
 category is the name of the field in the schema.  Searching for Hardware
 or Laptop  with the same query syntax will return the document.
 
 I am guessing it is the way I define the index and query analyzers, but
 could someone please give me some pointers on which I should use in this
 case ?
 
 Many Thanks,
 D
 

-- 
View this message in context: 
http://www.nabble.com/Indexing-%22Categorical-fields%22---newbie-tp25007361p25007456.html
Sent from the Solr - Dev mailing list archive at Nabble.com.



Re: Indexing Categorical fields - newbie

2009-08-17 Thread nostromo

OK, feel stupid now.

Query should have been category:Office Equipment, which worked !!

Thanks,
D


nostromo wrote:
 
 Hi all,
 
 Hoping your guys can help as I am real new to this :-(
 
 I am trying to import documents using the csv handler.  The process itself
 works well but I get odd results when I try to search my index.
 
 The problem is with a field which contains keywords, BUT the keywords
 could contain spaces.  An example is say I have a document describing a
 laptop, its categorization might need to be :
 Office Equipment
 Hardware
 Laptop
 
 I have thus far been unable to search for category:Office Equipment
 
 category is the name of the field in the schema.  Searching for Hardware
 or Laptop  with the same query syntax will return the document.
 
 I am guessing it is the way I define the index and query analyzers, but
 could someone please give me some pointers on which I should use in this
 case ?
 
 Many Thanks,
 D
 

-- 
View this message in context: 
http://www.nabble.com/Indexing-%22Categorical-fields%22---newbie-tp25007361p25007600.html
Sent from the Solr - Dev mailing list archive at Nabble.com.



[jira] Commented: (SOLR-633) QParser for use with user-entered query which recognizes subphrases as well as allowing some other customizations on per field basis

2009-08-17 Thread Preetam Rao (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12744117#action_12744117
 ] 

Preetam Rao commented on SOLR-633:
--

Hi, Sorry for such a delay. 

let me take an example of a real estate site that I tried to implement free 
text search on, using dis max query. 

Also, when I say sub phrase, I mean adjacent terms appearing in a bigger phrase,

The index has the below fields and below example record. 
lets say there are about 4 million records.

city - New York
state - NY
beds (Multi valued or synonyms)- 3 beds, beds 3
baths (Multi valued or synonyms) - 4 baths, baths 4
description - newly built with swimming pool, new furniture, car parking etc
sales type - new home

Lets say the user enters a query like homes in new york for price 400k with 3 
beds 4 baths with swimming pool car parking

I played with dismax for few days trying out various boosts and factors.The 
phrase options of dismax are not very useful because they consider all terms of 
the phrase to appear in a given field. (Thats what it appeared like). Word like 
new appearing in description field multiple times, or cities like york 
seemed to cause some variations.

The nature of the problem here is that, sub phrases like new york, 3 beds 
price 400k, car parking become very important and must be matched in 
different fields without overlapping across fields.

This can be best solved by a SubPhraseQuery which is used by a DisMax-like 
query to combine multiple fields.

hence this is what I proposed:

SubPhraseQuery:
- scores based on longest sub phrases matched. Also gives a factor to boost 
based on match length. For example 4 word matches gets 16 score vs a 3 word 
match getting 9
- gives an option to score only one match per field. For example, a term new 
home gets scored only once even if it occurs N times in the description field.
- Option to score only longest match. For example, an occurrence of swimming 
pool and some other pool scores only swimming pool.
- As usual, ability to ignore IDF, norms and any other factors, but just use 
phrase match.

And a DisMax-like query that uses the above:
- Each field can be configured with above query.
- Options to ignore matches in other fields when some match.

I feel this kind of use cases will be encountered when form searches are 
migrated to free text search, since we are trying to use solr's free text 
search on some kind of structured data where different fields have different 
meaning.

Probably dismax is meant for that use case. I spent few days fine tuning dismax 
for the above use case. Just that, I felt like I had play a lot with various 
factors and it looked like lot of trial and error and still I was not sure what 
would the end results look like. I felt that I needed some more control over 
individual fields and how a match would be scored in those fields on sub 
phrases.

Let me know your thoughts or alternatives and I will be glad to look at them.






 
 


 QParser for use with user-entered query which recognizes subphrases as well 
 as allowing some other customizations on per field basis
 

 Key: SOLR-633
 URL: https://issues.apache.org/jira/browse/SOLR-633
 Project: Solr
  Issue Type: New Feature
  Components: search
Affects Versions: 1.4
 Environment: All
Reporter: Preetam Rao
Priority: Minor
 Fix For: 1.5


 Create a request handler (actually a QParser) for use with user entered 
 queries with following features-
 a) Take a user query string and try to match it against multiple fields, 
 while recognizing sub-phrase matches.
 b) For each field give the below parameters:
1) phraseBoost - the factor which decides how good a n token sub phrase 
 match is compared to n-1 token sub-phrase match.
2) maxScoreOnly - If there are multiple sub-phrase matches pick, only the 
 highest
3) ignoreDuplicates - If the same sub-phrase query matches multiple times, 
 pick only one.
4) disableOtherScoreFactors - Ignore tf, query norm, idf and any other 
 parameters which are not relevant.
 c) Try to provide all the parameters similar to dismax. Reuse or extend 
 dismax.  
 Other suggestions and feedback appreciated :-)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: good performance news

2009-08-17 Thread Grant Ingersoll


On Aug 16, 2009, at 3:46 PM, Yonik Seeley wrote:


I just profiled a CSV upload, and aside from the CSV parsing, Solr
adds pretty much no overhead!
I was expecting some non-trivial overhead due to Solr's
SolrInputDocument, update processing pipeline, and update handler...
but profiling showed that it amounted to less than 1%.

85% of the time was spent in Lucene's IndexWriter
12% of the time was spent in the CSV parser2


I'm curious how much overhead there is in parsing Solr XML.  I will  
try some tests on that later if I get a chance.  We really should push  
clients to use the Binary request/response formats in most cases.


Re: date functions and floats

2009-08-17 Thread Grant Ingersoll


On Aug 15, 2009, at 10:11 AM, Yonik Seeley wrote:


Now that we have date fields that internally store milliseconds (and
can currently be used in function queries) we have the basis for a
good replacement for using things like ord(date)... which is now a bad
idea since it causes the FieldCache to be instantiated at the highest
level reader... doubling the usage if it's also used for faceting or
sorting.

One issue though is that our float functions don't have enough
precision to deal with dates that well.
System.currentTimeMillis() currently contains 13 digits.  A float can
capture 7.x digits of precision.


Along this lines, does the DateField FieldType only allow you to store  
at precision milliseconds?  I know w/ Trie you can encode other  
precision levels, but in some cases maybe all I want is the hour/day/ 
year/whatever, it would be nice not to have to think about this on the  
client side.  Perhaps I am just missing something.  In other words, do  
we support Lucene's DateTools Resolution capabilities?




This means that our 10^-3 seconds precision on the raw date field is
only accurate to 10^3 seconds (~15minutes) when converted to a float.

We could either:
- change function queries to use doubles internally - probably a good
idea for the future in general - seems like geo might need more
precision too.
- come up with a new date scale function that uses doubles internally?

-Yonik
http://www.lucidimagination.com




Response Writers and DocLists

2009-08-17 Thread Grant Ingersoll
I'm looking a little bit at https://issues.apache.org/jira/browse/SOLR-1298 
 and some of the other pseudo-field capabilities and am curious how  
the various Response Writers are handling writing out the Docs.  The  
XMLWriter seems to have a very different approach from the others when  
it comes to dealing with multi-valued fields (it sorts first, the  
others don't.)  Does anyone know the history here?


Also, I'm thinking about having a real simple interface that would  
allow for, when materializing the Fields, to pass in something like a  
DocumentModifier, which would basically get the document right before  
it is to be returned (possibly inside the SolrIndexReader, but maybe  
this even belongs at the Lucene level similar to the FieldSelector,  
although it is likely too late for 2.9.)  Through this DocModifier,  
one could easily add fields, etc.


Part of what I think needs to be addressed here is that currently, in  
order to add fields, for instance, LocalSolr does this, one needs to  
iterate over the DocList (or SolrDocList) multiple times.   
SolrPluginUtils.docListtoSolrDocList attempts to help, but it still  
requires a double loop.  The tricky part here is that one often needs  
to have context when modifying the Document that the Response Writer's  
simply do not have, so you end up writing a SearchComponent to do it  
and thus iterating multiple times.


I know this is a bit stream of conscience, but thought I would get it  
out there a little bit to see what others thought.


-Grant


[jira] Commented: (SOLR-788) MoreLikeThis should support distributed search

2009-08-17 Thread Mike Anderson (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12744233#action_12744233
 ] 

Mike Anderson commented on SOLR-788:


What release of SOLR should one apply this patch to? 


(I tried an older build of 1.4 and got
patching file org/apache/solr/handler/MoreLikeThisHandler.java
patching file org/apache/solr/handler/component/MoreLikeThisComponent.java
Hunk #2 FAILED at 51.
1 out of 2 hunks FAILED -- saving rejects to file 
org/apache/solr/handler/component/MoreLikeThisComponent.java.rej
patching file org/apache/solr/handler/component/ShardRequest.java
)

 MoreLikeThis should support distributed search
 --

 Key: SOLR-788
 URL: https://issues.apache.org/jira/browse/SOLR-788
 Project: Solr
  Issue Type: Improvement
Reporter: Grant Ingersoll
Priority: Minor
 Attachments: MoreLikeThisComponentTest.patch, 
 SolrMoreLikeThisPatch.txt


 The MoreLikeThis component should support distributed processing.
 See SOLR-303.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Response Writers and DocLists

2009-08-17 Thread Ryan McKinley

Ya, I like this idea.

Adding a meta field is OK, but it may just be kicking the can.  Also  
implementation wise, it works well when you have a SolrDocument, but  
when directly using DocList, it gets a bit messy.

https://issues.apache.org/jira/browse/SOLR-705

Also with adding a meta field, I'm not sure I like that it is a  
double object like:

 doc.get( _meta_ ).get( distance)

It would be nicer if the user does not have any idea if it is a pseudo- 
field or real field.  (by user I mean how you consume the  
response, not how you construct the URL)


The SQL as command comes to mind:
 SELECT name, count(xxx) as cnt

ryan



On Aug 17, 2009, at 6:00 PM, Grant Ingersoll wrote:

I'm looking a little bit at https://issues.apache.org/jira/browse/SOLR-1298 
 and some of the other pseudo-field capabilities and am curious  
how the various Response Writers are handling writing out the Docs.   
The XMLWriter seems to have a very different approach from the  
others when it comes to dealing with multi-valued fields (it sorts  
first, the others don't.)  Does anyone know the history here?


Also, I'm thinking about having a real simple interface that would  
allow for, when materializing the Fields, to pass in something like  
a DocumentModifier, which would basically get the document right  
before it is to be returned (possibly inside the SolrIndexReader,  
but maybe this even belongs at the Lucene level similar to the  
FieldSelector, although it is likely too late for 2.9.)  Through  
this DocModifier, one could easily add fields, etc.


Part of what I think needs to be addressed here is that currently,  
in order to add fields, for instance, LocalSolr does this, one needs  
to iterate over the DocList (or SolrDocList) multiple times.   
SolrPluginUtils.docListtoSolrDocList attempts to help, but it still  
requires a double loop.  The tricky part here is that one often  
needs to have context when modifying the Document that the Response  
Writer's simply do not have, so you end up writing a SearchComponent  
to do it and thus iterating multiple times.


I know this is a bit stream of conscience, but thought I would get  
it out there a little bit to see what others thought.


-Grant




[jira] Created: (SOLR-1365) Add configurable Sweetspot Similarity factory

2009-08-17 Thread Kevin Osborn (JIRA)
Add configurable Sweetspot Similarity factory
-

 Key: SOLR-1365
 URL: https://issues.apache.org/jira/browse/SOLR-1365
 Project: Solr
  Issue Type: New Feature
Affects Versions: 1.3
Reporter: Kevin Osborn
Priority: Minor
 Fix For: 1.4


This is some code that I wrote a while back.

Normally, if you use SweetSpotSimilarity, you are going to make it do something 
useful by extending SweetSpotSimilarity. So, instead, I made a factory class 
and an configurable SweetSpotSimilarty. There are two classes. 
SweetSpotSimilarityFactory reads the parameters from schema.xml. It then 
creates an instance of VariableSweetSpotSimilarity, which is my custom 
SweetSpotSimilarity class. In addition to the standard functions, it also 
handles dynamic fields.

So, in schema.xml, you could have something like this:

similarity class=org.apache.solr.schema.SweetSpotSimilarityFactory
bool name=useHyperbolicTftrue/bool

float name=hyperbolicTfFactorsMin1.0/float
float name=hyperbolicTfFactorsMax1.5/float
float name=hyperbolicTfFactorsBase1.3/float
float name=hyperbolicTfFactorsXOffset2.0/float

int name=lengthNormFactorsMin1/int
int name=lengthNormFactorsMax1/int
float name=lengthNormFactorsSteepness0.5/float

int name=lengthNormFactorsMin_description2/int
int name=lengthNormFactorsMax_description9/int
float name=lengthNormFactorsSteepness_description0.2/float

int name=lengthNormFactorsMin_supplierDescription_*2/int
int name=lengthNormFactorsMax_supplierDescription_*7/int
float 
name=lengthNormFactorsSteepness_supplierDescription_*0.4/float
 /similarity

So, now everything is in a config file instead of having to create your own 
subclass.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Response Writers and DocLists

2009-08-17 Thread Erik Hatcher


On Aug 17, 2009, at 6:59 PM, Ryan McKinley wrote:
Also with adding a meta field, I'm not sure I like that it is a  
double object like:

doc.get( _meta_ ).get( distance)


It'd be more like:  doc.getMeta().get(distance), at least.  And  
doc.get(distance) could be made to fetch first the main document and  
if not found search in the meta data.


It would be nicer if the user does not have any idea if it is a  
pseudo-field or real field.  (by user I mean how you consume the  
response, not how you construct the URL)


I'm kinda ok with the direction this is heading, with the response  
document have a pluggable way to add fields.  My main reluctance  
is really from a Lucene-legacy way of thinking of the stored values  
from the actual Document object as all that should be allowed there.


Things get trickier as we want meta-meta data... like title field,  
title highlighted, and then some more like this for each document, and  
allowing for namespaces or some kind of way to keep different values  
that may have the same key from colliding.



The SQL as command comes to mind:
SELECT name, count(xxx) as cnt


Hmmm, that's an idea.

  fl=title, highlighted(title) as highlighted_title,  
some_function(popularity) as scaled_popularity


Erik



[jira] Commented: (SOLR-1365) Add configurable Sweetspot Similarity factory

2009-08-17 Thread Erik Hatcher (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12744306#action_12744306
 ] 

Erik Hatcher commented on SOLR-1365:


Sweet!  :)

Very nice use of the SimilarityFactory capability.  

I took a brief look at the patch, the only feedback I have is that I believe 
that the dynamic field handling might be able to leverage some of Solr's 
built-in logic in IndexSchema.  But how can a SimilarityFactory get access to 
that?   Hmmm?

 Add configurable Sweetspot Similarity factory
 -

 Key: SOLR-1365
 URL: https://issues.apache.org/jira/browse/SOLR-1365
 Project: Solr
  Issue Type: New Feature
Affects Versions: 1.3
Reporter: Kevin Osborn
Priority: Minor
 Fix For: 1.4

 Attachments: SOLR-1365.patch


 This is some code that I wrote a while back.
 Normally, if you use SweetSpotSimilarity, you are going to make it do 
 something useful by extending SweetSpotSimilarity. So, instead, I made a 
 factory class and an configurable SweetSpotSimilarty. There are two classes. 
 SweetSpotSimilarityFactory reads the parameters from schema.xml. It then 
 creates an instance of VariableSweetSpotSimilarity, which is my custom 
 SweetSpotSimilarity class. In addition to the standard functions, it also 
 handles dynamic fields.
 So, in schema.xml, you could have something like this:
 similarity class=org.apache.solr.schema.SweetSpotSimilarityFactory
 bool name=useHyperbolicTftrue/bool
   float name=hyperbolicTfFactorsMin1.0/float
   float name=hyperbolicTfFactorsMax1.5/float
   float name=hyperbolicTfFactorsBase1.3/float
   float name=hyperbolicTfFactorsXOffset2.0/float
   int name=lengthNormFactorsMin1/int
   int name=lengthNormFactorsMax1/int
   float name=lengthNormFactorsSteepness0.5/float
   int name=lengthNormFactorsMin_description2/int
   int name=lengthNormFactorsMax_description9/int
   float name=lengthNormFactorsSteepness_description0.2/float
   int name=lengthNormFactorsMin_supplierDescription_*2/int
   int name=lengthNormFactorsMax_supplierDescription_*7/int
   float 
 name=lengthNormFactorsSteepness_supplierDescription_*0.4/float
  /similarity
 So, now everything is in a config file instead of having to create your own 
 subclass.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1365) Add configurable Sweetspot Similarity factory

2009-08-17 Thread Erik Hatcher (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12744309#action_12744309
 ] 

Erik Hatcher commented on SOLR-1365:


bq. I took a brief look at the patch, the only feedback I have is that I 
believe that the dynamic field handling might be able to leverage some of 
Solr's built-in logic in IndexSchema. But how can a SimilarityFactory get 
access to that? Hmmm?

Why by implementing SolrCoreAware, of course.

 Add configurable Sweetspot Similarity factory
 -

 Key: SOLR-1365
 URL: https://issues.apache.org/jira/browse/SOLR-1365
 Project: Solr
  Issue Type: New Feature
Affects Versions: 1.3
Reporter: Kevin Osborn
Priority: Minor
 Fix For: 1.4

 Attachments: SOLR-1365.patch


 This is some code that I wrote a while back.
 Normally, if you use SweetSpotSimilarity, you are going to make it do 
 something useful by extending SweetSpotSimilarity. So, instead, I made a 
 factory class and an configurable SweetSpotSimilarty. There are two classes. 
 SweetSpotSimilarityFactory reads the parameters from schema.xml. It then 
 creates an instance of VariableSweetSpotSimilarity, which is my custom 
 SweetSpotSimilarity class. In addition to the standard functions, it also 
 handles dynamic fields.
 So, in schema.xml, you could have something like this:
 similarity class=org.apache.solr.schema.SweetSpotSimilarityFactory
 bool name=useHyperbolicTftrue/bool
   float name=hyperbolicTfFactorsMin1.0/float
   float name=hyperbolicTfFactorsMax1.5/float
   float name=hyperbolicTfFactorsBase1.3/float
   float name=hyperbolicTfFactorsXOffset2.0/float
   int name=lengthNormFactorsMin1/int
   int name=lengthNormFactorsMax1/int
   float name=lengthNormFactorsSteepness0.5/float
   int name=lengthNormFactorsMin_description2/int
   int name=lengthNormFactorsMax_description9/int
   float name=lengthNormFactorsSteepness_description0.2/float
   int name=lengthNormFactorsMin_supplierDescription_*2/int
   int name=lengthNormFactorsMax_supplierDescription_*7/int
   float 
 name=lengthNormFactorsSteepness_supplierDescription_*0.4/float
  /similarity
 So, now everything is in a config file instead of having to create your own 
 subclass.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1143) Return partial results when a connection to a shard is refused

2009-08-17 Thread Artem Russakovskii (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12744327#action_12744327
 ] 

Artem Russakovskii commented on SOLR-1143:
--

Any idea when this will be approved for pushing into trunk?

 Return partial results when a connection to a shard is refused
 --

 Key: SOLR-1143
 URL: https://issues.apache.org/jira/browse/SOLR-1143
 Project: Solr
  Issue Type: Improvement
  Components: search
Reporter: Nicolas Dessaigne
 Fix For: 1.4

 Attachments: SOLR-1143-2.patch, SOLR-1143.patch


 If any shard is down in a distributed search, a ConnectException it thrown.
 Here's a little patch that change this behaviour: if we can't connect to a 
 shard (ConnectException), we get partial results from the active shards. As 
 for TimeOut parameter (https://issues.apache.org/jira/browse/SOLR-502), we 
 set the parameter partialResults at true.
 This patch also adresses a problem expressed in the mailing list about a year 
 ago 
 (http://www.nabble.com/partialResults,-distributed-search---SOLR-502-td19002610.html)
 We have a use case that needs this behaviour and we would like to know your 
 thougths about such a behaviour? Should it be the default behaviour for 
 distributed search?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Response Writers and DocLists

2009-08-17 Thread Yonik Seeley
On Mon, Aug 17, 2009 at 6:00 PM, Grant Ingersollgsing...@apache.org wrote:
 I'm looking a little bit at
 https://issues.apache.org/jira/browse/SOLR-1298 and some of the other
 pseudo-field capabilities and am curious how the various Response Writers
 are handling writing out the Docs.  The XMLWriter seems to have a very
 different approach from the others when it comes to dealing with
 multi-valued fields (it sorts first, the others don't.)  Does anyone know
 the history here?

The first version of Solr didn't know about multiValued fields or not.
 The Lucene Document does not aggregate multiple values for the same
field.  Sorting was used to group the fields and detect if there were
multiple values for any of them.

 Also, I'm thinking about having a real simple interface that would allow
 for, when materializing the Fields, to pass in something like a
 DocumentModifier, which would basically get the document right before it is
 to be returned (possibly inside the SolrIndexReader, but maybe this even
 belongs at the Lucene level similar to the FieldSelector, although it is
 likely too late for 2.9.)  Through this DocModifier, one could easily add
 fields, etc.

Too high level for Lucene I think, and nothing is currently needed for
Lucene - a user calls doc() to get the document and then they can
modify or add fields however they want.

An interface could be useful for Solr... but getting 1.4 out the door
is top priority.

-Yonik
http://www.lucidimagination.com


CharFilter, analysis.jsp

2009-08-17 Thread Erik Hatcher

I'm interested in using a CharFilter, something like this:

fieldType name=html_text class=solr.TextField
  analyzer
charFilter class=solr.HTMLStripCharFilterFactory/
tokenizer class=solr.WhitespaceTokenizerFactory/
  /analyzer
/fieldType

In hopes of being able to put in a value like htmlbodywhatever/ 
body/html and have whatever come back out.  In analysis.jsp, I  
see that happening in the verbose output but it doesn't make it to the  
tokenizer input - the original string makes it there.


I must be misunderstanding something about CharFilter's and how to use  
them in Solr.  HTMLStripWhitespaceTokenizerFactory is deprecated in  
favor of the above design, I think, but does what I'm after.


Solr only seems to use CharFilter's in analysis.jsp.  Is that  
correct?  Shouldn't they be factored into the analyzer for each  
field?  (like in FieldAnalysisRequestHandler)


Thanks,
Erik



Re: CharFilter, analysis.jsp

2009-08-17 Thread Yonik Seeley
I broke it with reusable token streams.  Just checked in a fix - can
you try now?

-Yonik
http://www.lucidimagination.com


On Mon, Aug 17, 2009 at 10:17 PM, Erik Hatchererik.hatc...@gmail.com wrote:
 I'm interested in using a CharFilter, something like this:

    fieldType name=html_text class=solr.TextField
      analyzer
        charFilter class=solr.HTMLStripCharFilterFactory/
        tokenizer class=solr.WhitespaceTokenizerFactory/
      /analyzer
    /fieldType

 In hopes of being able to put in a value like
 htmlbodywhatever/body/html and have whatever come back out.  In
 analysis.jsp, I see that happening in the verbose output but it doesn't make
 it to the tokenizer input - the original string makes it there.

 I must be misunderstanding something about CharFilter's and how to use them
 in Solr.  HTMLStripWhitespaceTokenizerFactory is deprecated in favor of the
 above design, I think, but does what I'm after.

 Solr only seems to use CharFilter's in analysis.jsp.  Is that correct?
  Shouldn't they be factored into the analyzer for each field?  (like in
 FieldAnalysisRequestHandler)

 Thanks,
        Erik




Re: CharFilter, analysis.jsp

2009-08-17 Thread Yonik Seeley
On Mon, Aug 17, 2009 at 11:03 PM, Erik Hatchererik.hatc...@gmail.com wrote:
 That fixes it with analysis.jsp, but not with FieldAnalysisRequestHandler I
 don't think.  Using that field definition below, and this request -

 http://localhost:8983/solr/analysis/field?analysis.fieldtype=html_textanalysis.fieldvalue=%3Chtml%3E%3Cbody%3Ewhatever%3C/body%3E%3C/html%3E

 I still see str name=texthtmlbodywhatever/body/html/str come
 out of WhitespaceTokenizer.

 Does the consumer of an Analyzer from a FieldType have to do anything
 special to utilize CharFilter's?  Or it should all just work?

Normal users of the Analyzer should see it just work - but
FieldAnalysisRequestHandler doesn't use the Analyzer... it pulls it
apart and uses the parts separately.  It would be up to that code to
apply any char filters, and apparently it doesn't.

-Yonik
http://www.lucidimagination.com


[jira] Commented: (SOLR-1365) Add configurable Sweetspot Similarity factory

2009-08-17 Thread Kevin Osborn (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12744376#action_12744376
 ] 

Kevin Osborn commented on SOLR-1365:


Thanks for the feedback. I looked at IndexSchema. It seems like the only useful 
function in my case is using isDynamicField vs. seeing if the field name ends 
with a *.

But also is SimilarityFactory allowed to implement SolrCoreAware? I'm not too 
familiar with this interface, but my initial research shows that only 
SolrRequestHandler, QueryResponseWriter, SearchComponent, or 
UpdateRequestProcessorFactory may implement SolrCoreAware. Is this correct?

 Add configurable Sweetspot Similarity factory
 -

 Key: SOLR-1365
 URL: https://issues.apache.org/jira/browse/SOLR-1365
 Project: Solr
  Issue Type: New Feature
Affects Versions: 1.3
Reporter: Kevin Osborn
Priority: Minor
 Fix For: 1.4

 Attachments: SOLR-1365.patch


 This is some code that I wrote a while back.
 Normally, if you use SweetSpotSimilarity, you are going to make it do 
 something useful by extending SweetSpotSimilarity. So, instead, I made a 
 factory class and an configurable SweetSpotSimilarty. There are two classes. 
 SweetSpotSimilarityFactory reads the parameters from schema.xml. It then 
 creates an instance of VariableSweetSpotSimilarity, which is my custom 
 SweetSpotSimilarity class. In addition to the standard functions, it also 
 handles dynamic fields.
 So, in schema.xml, you could have something like this:
 similarity class=org.apache.solr.schema.SweetSpotSimilarityFactory
 bool name=useHyperbolicTftrue/bool
   float name=hyperbolicTfFactorsMin1.0/float
   float name=hyperbolicTfFactorsMax1.5/float
   float name=hyperbolicTfFactorsBase1.3/float
   float name=hyperbolicTfFactorsXOffset2.0/float
   int name=lengthNormFactorsMin1/int
   int name=lengthNormFactorsMax1/int
   float name=lengthNormFactorsSteepness0.5/float
   int name=lengthNormFactorsMin_description2/int
   int name=lengthNormFactorsMax_description9/int
   float name=lengthNormFactorsSteepness_description0.2/float
   int name=lengthNormFactorsMin_supplierDescription_*2/int
   int name=lengthNormFactorsMax_supplierDescription_*7/int
   float 
 name=lengthNormFactorsSteepness_supplierDescription_*0.4/float
  /similarity
 So, now everything is in a config file instead of having to create your own 
 subclass.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: good performance news

2009-08-17 Thread Noble Paul നോബിള്‍ नोब्ळ्
I our internal testing , the binary request writer gave very good perf
for large no:of docs.

Though we did not benchmark it

On Tue, Aug 18, 2009 at 2:57 AM, Grant Ingersollgsing...@apache.org wrote:

 On Aug 16, 2009, at 3:46 PM, Yonik Seeley wrote:

 I just profiled a CSV upload, and aside from the CSV parsing, Solr
 adds pretty much no overhead!
 I was expecting some non-trivial overhead due to Solr's
 SolrInputDocument, update processing pipeline, and update handler...
 but profiling showed that it amounted to less than 1%.

 85% of the time was spent in Lucene's IndexWriter
 12% of the time was spent in the CSV parser2

 I'm curious how much overhead there is in parsing Solr XML.  I will try some
 tests on that later if I get a chance.  We really should push clients to use
 the Binary request/response formats in most cases.




-- 
-
Noble Paul | Principal Engineer| AOL | http://aol.com