RE: default text type and stop words

2007-11-05 Thread Sundling, Paul
I don't know if the problem is in Lucene, I didn't investigate further.
Maybe it's considered a feature, not a bug for someone with different
expectations.

Given that Solr and Lucene have different release schedules.  Even if
the problem is in Lucene and it's addressed there, that doesn't
guarentee it's solved with Solr.  You would have to change from using a
known stable vresion of Lucene to some nightly release that included a
hypothetical patch or a patched custom version for this one little edge
case.  It's probably unlikely that either of those are going to happen.
Or consider changing a line of XML...   

I only suggested considering it.  There is also the concept of an
anti-corruption layer in domain driven design.  There are issues of time
frames, release schedules, priorities and I'm not assuming this edge
case is a high priority.  I merely pointed out an issue in the defaults.


I also didn't say not to deal with a bug that hypothetically could be in
a tightly coupled dependency.

Paul

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
Sent: Friday, November 02, 2007 11:02 PM
To: solr-dev@lucene.apache.org
Subject: Re: default text type and stop words



In a message dated 11/2/07 6:54:25 PM, [EMAIL PROTECTED]
writes:


 Even if
 the actual problem is at the Lucene level, perhaps it would be worth 
 considering changes to the default to get around it.
 

newbie here. is this common practice? find a bug in a tightly coupled 
dependency and not deal with it there?

regard,
billy


**
 See what's new at 
http://www.aol.com


default text type and stop words

2007-11-02 Thread Sundling, Paul
I noticed very unexpected results when using stop words with and without
conditions using the default text type.
 
A normal query with a stop word returns no results as expected:
 
For example with 'an' being a stopword
 
  movieName:an (results: 0 since it's a stop word) 
  movieName:another (results 237)
 
  rating:PG-13  (results: 76095)
 
 
but if I put them together with AND, for normal non stop words like
'another' the result is less than or equal to the smaller results being
ANDed.  So adding another AND clause with a stop word query should have
0 results.
 
  rating:PG-13 AND movieName:another (results 46)
 
  rating:PG-13 AND movieName:an (results 76095 should be 0)
  
Commenting out the stop word filter from the text type for query will
correct this behavior, although I'm not sure that's a real solution.  So
instead of anding the stop word clause it seems to ignore it.  Even if
the actual problem is at the Lucene level, perhaps it would be worth
considering changes to the default to get around it.
 
Workaround:
 
   fieldType name=text class=solr.TextField
positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
!-- in this example, we will only use synonyms at query time
filter class=solr.SynonymFilterFactory
synonyms=index_synonyms.txt ignoreCase=true expand=false/
--
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumbers=1 catenateAll=0/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.EnglishPorterFilterFactory
protected=protwords.txt/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.SynonymFilterFactory
synonyms=synonyms.txt ignoreCase=true expand=true/
!-- comment out to prevent strange behavior filter
class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt/--
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=0
catenateNumbers=0 catenateAll=0/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.EnglishPorterFilterFactory
protected=protwords.txt/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
/fieldType
 
Paul Sundling


RE: [Solr Wiki] Update of SolrPerformanceFactors by paulsundling

2007-08-24 Thread Sundling, Paul
Sorry I replied to a subset of that question on the user list.  I'll
include my whole message, since it also relates to a past topic on this
list (Time for a cleaner API):

The embedded approach is at http://wiki.apache.org/solr/EmbeddedSolr

For my testing I have a tunable setting for records to submit and did 10
per batch.  Both approaches committed after every 1000 records, also
tunable.  

A custom Lucene implementation I helped implement was even faster than
embedded, using a ramdrive as a double buffer.  However that did require
a much larger memory footprint.

The embedded class have little to no documentation and almost look like
stub implementations, but they work well.

While this project will succeed in a large part to how easy it is to
integrate with non Java clients, I would actually like to see this
project more java friendly, like a reference indexing implementation.
There are a lot of tools that could be more widely useful like
SimplePostTool.  

With a few API changes it could be used for the demo as well as a useful
library.  Instead I extended and then had to abandon that and resort to
cut and paste reuse in the end.  The functionality was 95% there, but
just needed API tweaks to make it usable.  It also seems unusual
exposing fields directly instead of using accessors in the Java code.
Accessors can be give a lot of flexibility that field access doesn't
have.

It would also be nice to able to get java objects back besides XML and
JSON, like an Embedded equivalent for search.  That way you could
integrate more easily with Spring MVC, etc.  There may also be some
performance gains there.  

Paul Sundling


-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik
Seeley
Sent: Friday, August 24, 2007 1:07 PM
To: solr-dev@lucene.apache.org
Subject: Re: [Solr Wiki] Update of SolrPerformanceFactors by
paulsundling


On 8/24/07, Apache Wiki [EMAIL PROTECTED] wrote:
 + Using an [EmbeddedSolr] for indexing can be over 50% faster than one

 + using XML messages that are posted.

Paul, were the documents posted one-per-message, or did you try multiple
(like 50 to 100) per message?  If one per message, the best way to
increase performance is to have multiple threads adding docs.

I'd be curious to know how a single CSV file would clock in at as
well...

-Yonik



RE: [jira] Updated: (SOLR-326) cleanup eclipse warnings

2007-08-02 Thread Sundling, Paul
So would it be useful to keep that JIRA open after the patch is
submitted to allow ongoing patch submissions?

Paul Sundling


-Original Message-
From: Paul Sundling (JIRA) [mailto:[EMAIL PROTECTED] 
Sent: Thursday, August 02, 2007 4:24 PM
To: solr-dev@lucene.apache.org
Subject: [jira] Updated: (SOLR-326) cleanup eclipse warnings



 [
https://issues.apache.org/jira/browse/SOLR-326?page=com.atlassian.jira.p
lugin.system.issuetabpanels:all-tabpanel ]

Paul Sundling updated SOLR-326:
---

Attachment: remove_unused_imports_patch.txt

This should remove unused import eclipse warnings.

 cleanup eclipse warnings
 

 Key: SOLR-326
 URL: https://issues.apache.org/jira/browse/SOLR-326
 Project: Solr
  Issue Type: Improvement
Reporter: Paul Sundling
Priority: Minor
 Attachments: remove_unused_imports_patch.txt


 On default settings, Eclipse had 628 warnings.  This patch removes 119

 of those warnings related to unused imports.  These are the safest
warnings to fix and shouldn't require any testing other than confirming
building still works.
 The general idea of removing warnings is both cleaner code, but also
making it easier for interesting warnings to get hidden by uninteresting
warnings.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.




FW: EdgeNGramTokenizer errors in eclipse

2007-07-26 Thread Sundling, Paul
Has anyone else noticed this?  I didn't get any response on the user
list.

Paul Sundling

-Original Message-
From: Sundling, Paul 
Sent: Tuesday, July 24, 2007 4:55 PM
To: [EMAIL PROTECTED]
Subject: EdgeNGramTokenizer errors in eclipse


I checked out the latest solr source code from subversion and put it in
an eclipse project.  I used all the jars for the project (had to add
junit).  I get errors in eclipse about two constants not being defined
in one of the library jars:
 
  (based on imports org.apache.lucene.analysis.ngram.EdgeNGramTokenizer)
  EdgeNGramTokenizer.DEFAULT_MAX_GRAM_SIZE
and
  EdgeNGramTokenizer.DEFAULT_MAX_GRAM_SIZE

are not defined.  So was a class changed that this Solr class depends
on?

 

The error happens in org.apache.solr.analysis.EdgeNGramTokenizerFactory:

  maxGramSize = (maxArg != null ? Integer.parseInt(maxArg) :
EdgeNGramTokenizer.DEFAULT_MAX_GRAM_SIZE);
  String minArg = args.get(minGramSize);
  minGramSize = (minArg != null ? Integer.parseInt(minArg) :
EdgeNGramTokenizer.DEFAULT_MIN_GRAM_SIZE);


Am I doing something wrong?  

Paul Sundling