Re: GIT does not support empty directories

2010-04-16 Thread Walter Underwood
"This directory intentionally left empty." --wunder

On Apr 16, 2010, at 12:33 PM, Ted Dunning wrote:

> Put a readme file in the directory and be done with it.
> 
> On Fri, Apr 16, 2010 at 8:40 AM, Robert Muir  wrote:
> 
>> I don't like the idea of complicating lucene/solr's build system any more
>> than it already is, unless its absolutely necessary. its already too
>> complicated.
>> 
>> Instead of adding more hacks, what is actually broken (git) is what should
>> be fixed, as the link states:
>> 
>> Currently the design of the git index (staging area) only permits *files*
>> to
>> be listed, and nobody competent enough to make the change to allow empty
>> directories has cared enough about this situation to remedy it.
>> 
>> On Fri, Apr 16, 2010 at 11:14 AM, Smiley, David W. >> wrote:
>> 
>>> Seriously.
>>> I sympathize with your point that git should support empty directories.
>>> But as a practical matter, it's easy to make the ant build tolerant of
>>> them.
>>> 
>>> ~ David Smiley
>>> 
>>> From: Robert Muir [rcm...@gmail.com]
>>> Sent: Friday, April 16, 2010 6:53 AM
>>> To: solr-dev@lucene.apache.org
>>> Subject: Re: GIT does not support empty directories
>>> 
>>> Seriously? We should hack our ant files around the bugs in every crappy
>>> source control system that comes out?
>>> 
>>> Fix Git.
>>> 
>>> On Thu, Apr 15, 2010 at 10:55 PM, Smiley, David W. >>> wrote:
>>> 
>>>> I've run into this too.  I don't think this needs to be documented, I
>>> think
>>>> it needs to be *fixed* -- that is, the relevant ant tasks need to not
>>> assume
>>>> these directories exist and create them if not.
>>>> 
>>>> ~ David Smiley
>>>> 
>>>> -Original Message-
>>>> From: Lance Norskog [mailto:goks...@gmail.com]
>>>> Sent: Wednesday, April 14, 2010 11:14 PM
>>>> To: solr-dev
>>>> Subject: GIT does not support empty directories
>>>> 
>>>> There are some empty directories in the Solr source tree, both in 1.4
>>>> and the trunk.
>>>> 
>>>> example/work
>>>> example/webapp
>>>> example/logs
>>>> 
>>>> Git does not support empty directories:
>>>> 
>>>> 
>>> 
>> https://git.wiki.kernel.org/index.php/GitFaq#Can_I_add_empty_directories.3F
>>>> 
>>>> And so, when you check out from the Apache GIT repository, these empty
>>>> directories do not appear and 'ant example' and 'ant run-example'
>>>> fail. There is no 'how to use the solr git stuff' wiki page; that
>>>> seems like the right place to document this. I'm not git-smart enough
>>>> to write that page.
>>>> --
>>>> Lance Norskog
>>>> goks...@gmail.com
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Robert Muir
>>> rcm...@gmail.com
>>> 
>> 
>> 
>> 
>> --
>> Robert Muir
>> rcm...@gmail.com
>> 

--
Walter Underwood
Venture ASM, Troop 14, Palo Alto





[jira] Commented: (SOLR-534) Return all query results with parameter rows=-1

2010-02-10 Thread Walter Underwood (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832351#action_12832351
 ] 

Walter Underwood commented on SOLR-534:
---

-1

This adds a denial of service vulnerability to Solr. One query can use lots of 
CPU or memory, or even crash the server.

This could also take out an entire distributed system.

If this is added, we MUST add a config option to disable it.

Let's take this back to the mailing list and find out why they believe all 
results are needed.There must be a better way to solve this.

> Return all query results with parameter rows=-1
> ---
>
> Key: SOLR-534
> URL: https://issues.apache.org/jira/browse/SOLR-534
> Project: Solr
>  Issue Type: New Feature
>  Components: search
>Affects Versions: 1.3
> Environment: Tomcat 5.5
>Reporter: Lars Kotthoff
>Priority: Minor
> Attachments: solr-all-results.patch
>
>
> The searcher should return all results matching a query when the parameter 
> rows=-1 is given.
> I know that it is a bad idea to do this in general, but as it explicitly 
> requires a special parameter, people using this feature will be aware of what 
> they are doing. The main use case for this feature is probably debugging, but 
> in some cases one might actually need to retrieve all results because they 
> e.g. are to be merged with results from different sources.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Namespaces in response (SOLR-1586)

2009-12-09 Thread Walter Underwood
On Dec 9, 2009, at 11:11 AM, Mattmann, Chris A (388J) wrote:

>> 
>> Any parser that does that is so broken that you should stop using it
>> immediately. --wunder
> 
> Walter, totally agree here.

To elaborate my position:

1. Validation is a user option. The XML spec makes that very clear. We've had 
10 years to get that right, and anyone who auto-validates is not paying 
attention. Validation is very useful when you are creating XML, rarely useful 
when reading it.

2. XML namespaces are string prefixes that use the URL syntax. They do not 
follow URI rules for anything but syntax and there is no guarantee that they 
can be resolved. In fact, an XML parser can't do anything standard with the 
result if they do resolve. Again, we've had 10 years to figure that out.

Yes, this can be confusing, but if a parser author can't figure it out, don't 
use their parser because they are already getting the simple stuff wrong.

wunder






Re: Namespaces in response (SOLR-1586)

2009-12-09 Thread Walter Underwood
Any parser that does that is so broken that you should stop using it 
immediately. --wunder

On Dec 9, 2009, at 8:33 AM, Yonik Seeley wrote:

> My gut feeling is that we should not be introducing namespaces by default.
> It introduces a new requirement of XML parsers in clients, and some
> parsers would start validating by default, and going out to the web to
> retrieve the referenced namespace/schema, etc.



Re: Functions, floats and doubles

2009-11-13 Thread Walter Underwood
Float is often OK until you try and use it for further calculation. Maybe it is 
good enough for printing out distance, but maybe not for further use.

wunder

On Nov 13, 2009, at 10:32 AM, Yonik Seeley wrote:

> On Fri, Nov 13, 2009 at 1:01 PM, Walter Underwood  
> wrote:
>> Float is almost never good enough. The loss of precision is horrific.
> 
> Are you saying it's not good enough for this case (the final answer of
> a relative distance calculation)?
> 7 digits of precision is enough to represent a distance across the US
> down to the meter... and points closer together would have higher
> precision of course.
> 
> For storage of the points themselves, 32 bit floats may also often be
> enough (~2.4 meter resolution at the equator).  Allowing doubles as an
> option would be nice too - but I expect that doubling the fieldcache
> may not be worth it for many.
> Actually, a 32 bit fixed point representation would have a lot more
> accuracy for this (256 times the resolution at the cost of on-the-fly
> conversion to a double for calculations).
> 
> -Yonik
> http://www.lucidimagination.com
> 



Re: Functions, floats and doubles

2009-11-13 Thread Walter Underwood
Float is almost never good enough. The loss of precision is horrific.

wunder

On Nov 13, 2009, at 9:58 AM, Yonik Seeley wrote:

> On Fri, Nov 13, 2009 at 12:52 PM, Grant Ingersoll  wrote:
>> Implementing my first function (distance stuff) and notices that functions 
>> seem to have a float bent to them.  Not even sure what would be involved, 
>> but there are cases for distance that I could see wanting double precision.  
>> Thoughts?
> 
> 
> It's an issue in general.
> 
> But for something like gdist(point_a,point_b), the internal
> calculations can be done in double precision and if the result is cast
> to a float at the end, it should be good enough for most uses, right?
> 
> -Yonik
> http://www.lucidimagination.com
> 



Re: Another RC

2009-10-19 Thread Walter Underwood
Please wait for an official release of Lucene. It makes thing SO much  
easier when you need to dig into the Lucene code.


It is well worth a week delay.

wunder

On Oct 19, 2009, at 10:27 AM, Yonik Seeley wrote:

On Mon, Oct 19, 2009 at 10:59 AM, Grant Ingersoll  
 wrote:

Are we ready for a release?


+1

I don't think we need to wait for Lucene 2.9.1 - we have all the fixes
in our version, and there's little point in pushing things off yet
another week.

Seems like the next RC should be a *real* one (i.e. no RC label in the
version, immediately call a VOTE).

-Yonik
http://www.lucidimagination.com


 I got busy at work and haven't been able to
address things as much, but it seems like things are progressing.

Shall I generate another RC or are we waiting for Lucene 2.9.1?  If  
we go w/

the 2.9.1-dev, then we just need to restore the Maven stuff for them.
 Hopefully, that stuff was just commented out and not completely  
removed so

as to make it a little easier to restore.

-Grant






Re: 8 for 1.4

2009-09-29 Thread Walter Underwood
It might not be proper to use the name "Solr", because it is really  
"Apache Solr". At a minimum, it is misleading to use an Apache project  
name on GPL'ed code.


I agree that changing to GPL is a bad idea. I've worked at eight or  
nine companies since the GPL was created, and GPL'ed code was  
forbidden at every one of them. GPL is where code goes to die.


wunder

On Sep 29, 2009, at 3:34 AM, Grant Ingersoll wrote:



On Sep 29, 2009, at 4:00 AM, Matthias Epheser wrote:


Grant Ingersoll schrieb:
Moving to GPL doesn't seem like a good solution to me, but I don't  
know what else to propose.  Why don't we just hold it from this  
release, but keep it in trunk and encourage the Drupal guys and  
others to submit their changes?  Perhaps by then Matthias or you  
or someone else will have stepped up.

concerning GPL:

The message from the drupal guys is that the code altered that much  
from initial solrjs that they think it's legally acceptable to get  
their new code out under GPL and "only" mention that it was  
inspired by the still existing Apache License solrjs.


Sounds reasonable for me but I have few experience with this kind  
of legal issues. So what do you think?


Oh, it's legally fine.  The ASL let's you do pretty much whatever  
you want.  But that is pretty much the point.  You're taking code  
with no restrictions on it and putting a whole slew of them back in,  
preventing Solr from ever distributing it in the future.  Something  
about that stinks to me.   There is a pretty large reason why we do  
our work at the ASF and not under GPL.  I won't go into it here, but  
suffice it to say one can go read volumes of backstory on this  
elsewhere by searching for GPL vs ASL (or BSD).  Furthermore,  
Matthias, it may be the case in the future that all that work you  
did for SolrJS may not even be accessible to you, the original  
author, under the GPL terms, depending on the company (many, many  
companies explicitly forbid GPL), etc. that you work for.  Is that  
what you want?


Also, they can't call it SolrJS, though, as that is the name of our  
version.






Re: [PMX:FAKE_SENDER] Re: large OR-boolean query

2009-09-25 Thread Walter Underwood
This would work a lot better if you did the join at index time. For  
each paper, add a field with all the related drug names (or whatever  
you want to search for), then search on that field.


With the current design, it will never be fast and never scale. Each  
lookup has a cost, so expanding a query to a thousand terms will  
always be slow. Distributing the query to multiple shards will only  
make a bad design slightly faster.


This is fundamental to search index design. The schema is flat, fully- 
denormalized, no joins. You tag each document with the terms that you  
will use to find it. Then you search for those terms directly.


wunder

On Sep 25, 2009, at 7:52 AM, Luo, Jeff wrote:

We are searching strings, not numbers. The reason we are doing this  
kind
of query is that we have two big indexes, say, a collection of  
medicine

drugs and a collection of research papers. I first run a query against
the drugs index and get 102400 unique drug names back. Then I need to
find all the research papers where one or more of the 102400 drug  
names

are mentioned, hence the large OR query. This is a kind of JOIN query
between 2 indexes, which an article in the lucid web site comparing
databases and search engines briefly touched.

I was able to issue 100 parallel small queries against solr shards and
get the results back successfully (even sorted). My custom code is  
less
than 100 lines, mostly in my SearchHandler.handleRequestBody. But I  
have
problem summing up the correct facet counts because the faceting  
counts

from each shard are not disjunctive.

Based on what is suggested by two other responses to my question, I
think it is possible that the master can pass the original large query
to each shard, and each shard will split the large query into 100  
lower
level disjunctive lucene queries, fire them against its Lucene index  
in
a parallel way and merge the results. Then each shard shall only  
return

1(instead of 100) result set to the master with disjunctive faceting
counts. It seems that the faceting problem can be solved in this  
way. I

would appreciate it if you could let me know if this approach is
feasible and correct; what solr plug-ins are needed(my guess is a  
custom

parser and query-component?)

Thanks,

Jeff



-Original Message-
From: Grant Ingersoll [mailto:gsing...@apache.org]
Sent: Thursday, September 24, 2009 10:01 AM
To: solr-dev@lucene.apache.org
Subject: [PMX:FAKE_SENDER] Re: large OR-boolean query


On Sep 23, 2009, at 4:26 PM, Luo, Jeff wrote:


Hi,

We are experimenting a parallel approach to issue a large OR-Boolean
query, e.g., keywords:(1 OR 2 OR 3 OR ... OR 102400), against several
solr shards.

The way we are trying is to break the large query into smaller ones,
e.g.,
the example above can be broken into 10 small queries: keywords:(1
OR 2
OR 3 OR ... OR 1024), keywords:(1025 OR 1026 OR 1027 OR ... OR 2048),
etc

Now each shard will get 10 requests and the master will merge the
results coming back from each shard, similar to the regular
distributed
search.



Can you tell us a little bit more about the why/what of this?  Are you
really searching numbers or are those just for example?  Do you care
about the score or do you just need to know whether the result is
there or not?


--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
using Solr/Lucene:
http://www.lucidimagination.com/search





Re: Solr Slow in Unix

2009-07-16 Thread Walter Underwood
In particular, are you using local disc or network storage? --wunder

On 7/16/09 8:24 AM, "Yonik Seeley"  wrote:

> On Thu, Jul 16, 2009 at 4:18 AM, Anand Kumar
> Prabhakar wrote:
>> I'm running a Solr instance in Apache Tomcat 6 in a Solaris Box. The QTimes
>> are high when compared to the same configuration on a Windows machine. Can
>> anyone help with the configurations i can check to improve the performance?
> 
> What's the hardware actually look like on each machine?
> 
> -Yonik
> http://www.lucidimagination.com



Re: lucene releases vs trunk

2009-06-25 Thread Walter Underwood
This is an excellent idea.

When I find a problem and want to research the Lucene bugs that might
describe it, that is really hard with a trunk build. It's easy with a
release build.

wunder

On 6/25/09 4:18 AM, "Yonik Seeley"  wrote:

> For the next release cycle (presumably 1.5?) I think we should really
> try to stick to released versions of Lucene, and not use dev/trunk
> versions.
> Early in Solr's lifetime, Lucene trunk was more stable (APIs changed
> little, even on non-released versions), and Lucene releases were few
> and far between.
> Today, the pace of change in Lucene has quickened, and Lucene APIs are
> much more in flux until a release is made.  It's also now harder to
> support a Lucene dev release given the growth in complexity
> (particularly for indexing code).  Releases are made more often too,
> making using released versions more practical.
> Many of our users dislike our use of dev versions of Lucene too.
> 
> And yes, 1.4 isn't out the door yet - but people often tend to hit the
> ground running on the next release.
> 
> -Yonik
> http://www.lucidimagination.com



[jira] Commented: (SOLR-1216) disambiguate the replication command names

2009-06-15 Thread Walter Underwood (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719625#action_12719625
 ] 

Walter Underwood commented on SOLR-1216:


If we choose a name for the thing we are pulling, like "image", then we can use 
"makeimage", "pullimage", etc.


> disambiguate the replication command names
> --
>
> Key: SOLR-1216
> URL: https://issues.apache.org/jira/browse/SOLR-1216
> Project: Solr
>  Issue Type: Improvement
>  Components: replication (java)
>Reporter: Noble Paul
>Assignee: Noble Paul
> Fix For: 1.4
>
> Attachments: SOLR-1216.patch
>
>
> There is a lot of confusion in the naming of various commands such as 
> snappull, snapshot etc. This is a vestige of the script based replication we 
> currently have. The commands can be renamed to make more sense
> * 'snappull' to be renamed to 'sync'
> * 'snapshot' to be renamed to 'backup'
> thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1216) disambiguate the replication command names

2009-06-15 Thread Walter Underwood (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719609#action_12719609
 ] 

Walter Underwood commented on SOLR-1216:


"sync" is a weak name, because it doesn't say whether it is a push or pull 
synchronization.


> disambiguate the replication command names
> --
>
> Key: SOLR-1216
> URL: https://issues.apache.org/jira/browse/SOLR-1216
> Project: Solr
>  Issue Type: Improvement
>  Components: replication (java)
>Reporter: Noble Paul
>Assignee: Noble Paul
> Fix For: 1.4
>
> Attachments: SOLR-1216.patch
>
>
> There is a lot of confusion in the naming of various commands such as 
> snappull, snapshot etc. This is a vestige of the script based replication we 
> currently have. The commands can be renamed to make more sense
> * 'snappull' to be renamed to 'sync'
> * 'snapshot' to be renamed to 'backup'
> thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Streaming Docs, Terms, TermVectors

2009-05-30 Thread Walter Underwood
Don't stream, request chunks of 10 or 100 at a time. It works fine and
you don't have to write or test any new code. In addition, it works
well with HTTP caches, so if two clients want to get the same data,
the second can get it from the cache.

We do that at Netflix. Each front-end box does a series of queries
to get all the movie titles, then loads them into a local index for
autocomplete.

wunder

On 5/30/09 11:01 AM, "Kaktu Chakarabati"  wrote:

> For a streaming-like solution, it is possible infact to have a working
> buffer in-memory that emits chunks on an http connection which is kept alive
> by the server until the full response has been sent.
> This is quite similar for example to how video streaming protocols which can
> operate on top of HTTP work ( cf. a more general discussion on
> http://ajaxpatterns.org/HTTP_Streaming#In_A_Blink ).
> Another (non-mutually exclusive) possibility is to introduce a novel binary
> format for the transmission of such data ( i.e a new wt=<..> type ) over
> http (or any other comm. protocol) so that data can be more effectively
> compressed and made to better fit into memory.
> One such format which has been widely circulating and already has many open
> source projects implementing it is Adobe's AMF (
> http://osflash.org/documentation/amf ). It is however a proprietary format
> so i'm not sure whether it is incorporable under apache foundation terms.
> 
> -Chak
> 
> 
> On Sat, May 30, 2009 at 9:58 AM, Dietrich Featherston
> wrote:
> 
>> I was actually curious about the same thing.  Perhaps an endpoint reference
>> could be passed in the request where the documents can be sent
>> asynchronously, such as a jms topic.
>> 
>> solr/query?q=*:*&epr=/my/topic&eprtype=jms
>> 
>> Then we would need to consider how to break up the response, how to cancel
>> a running query, etc.
>> 
>> Is this along the lines of what you're looking for?  I would be interested
>> in looking at how the request/response contract changes and what types of
>> endpoint references would be supported.
>> 
>> Thanks,
>> D
>> 
>> On May 30, 2009, at 12:45 PM, Grant Ingersoll  wrote:
>> 
>>  Anyone have any thoughts on what is involved with streaming lots of
>>> results out of Solr?
>>> 
>>> For instance, if I wanted to get something like 1M docs out of Solr (or
>>> more) via *:* query, how can I tractably do this?  Likewise, if I wanted to
>>> return all the terms in the index or all the Term Vectors.
>>> 
>>> Obviously, it is impossible to load all of these things into memory and
>>> then create a response, so I was wondering if anyone had any ideas on how to
>>> stream them.
>>> 
>>> Thanks,
>>> Grant
>>> 
>> 



[jira] Commented: (SOLR-1073) StrField should allow locale sensitive sorting

2009-04-28 Thread Walter Underwood (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12703893#action_12703893
 ] 

Walter Underwood commented on SOLR-1073:


Using the locale of the JVM is very, very bad for a multilingual server. Solr 
should always use the same, simple locale. It is OK to set a Locale in 
configuration for single-language installations, but using the JVM locale is a 
recipe for disaster. You move Solr to a different server and everything breaks. 
Very, very bad.  

In a multi-lingual config, locales must be set per-request.

Ideally, requests should send an ISO language code as context for the query.




> StrField should allow locale sensitive sorting
> --
>
> Key: SOLR-1073
> URL: https://issues.apache.org/jira/browse/SOLR-1073
> Project: Solr
>  Issue Type: Improvement
> Environment: All
>Reporter: Sachin
> Attachments: LocaleStrField.java
>
>
> Currently, StrField does not take a parameter which it can pass to ctor of 
> SortField making the StrField's sorting rely on the locale of the JVM.  
> Ideally, StrField should allow setting the locale in the schema.xml and use 
> it to create a new instance of the SortField in getSortField() method, 
> something like:
> snip:
>   public SortField getSortField(SchemaField field,boolean reverse)
>   {
> ...
>   Locale locale = new Locale(lang,country);
>   return new SortField(field.getName(), locale, reverse);
>  }
> More details about this issue here:
> http://www.nabble.com/CJKAnalyzer-and-Chinese-Text-sort-td22374195.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1044) Use Hadoop RPC for inter Solr communication

2009-03-03 Thread Walter Underwood (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12678601#action_12678601
 ] 

Walter Underwood commented on SOLR-1044:


During the Oscars, the HTTP cache in front of our Solr farm had a 90% hit rate. 
I think a 10X reduction in server load is a testimony to the superiority of the 
HTTP approach.


> Use Hadoop RPC for inter Solr communication
> ---
>
> Key: SOLR-1044
> URL: https://issues.apache.org/jira/browse/SOLR-1044
> Project: Solr
>  Issue Type: New Feature
>  Components: search
>Reporter: Noble Paul
>
> Solr uses http for distributed search . We can make it a whole lot faster if 
> we use an RPC mechanism which is more lightweight/efficient. 
> Hadoop RPC looks like a good candidate for this.  
> The implementation should just have one protocol. It should follow the Solr's 
> idiom of making remote calls . A uri + params +[optional stream(s)] . The 
> response can be a stream of bytes.
> To make this work we must make the SolrServer implementation pluggable in 
> distributed search. Users should be able to choose between the current 
> CommonshttpSolrServer, or a HadoopRpcSolrServer . 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Is there a built in keyword report (Tag Cloud) feature on Solr ?

2009-02-26 Thread Walter Underwood
If you want a tag cloud based on query freqency, start with your
HTTP log analysis tools. Most of those generate a list of top
queries and top words in queries.

wunder

On 2/26/09 2:54 PM, "Chris Hostetter"  wrote:

> 
> : I may have not made myself clear. When I say keyword report, I mean a kind
> : of a most popular tag cloud, showing in bigger sizes the most searched
> : terms. Therefore I need information about how many times specific terms have
> : been searched and I can't see how I could accomplish that with this
> : solution 
> 
> you have to be more explicit about what you ask for.  I've never heard
> anyone refer to a tag cloud as being based on how often a term is searched
> for -- everyone i know uses the frequency of words in the corpus,
> sometimes with a decay function to promote words mentioned in more recent
> docs.
> 
> Solr doesn't keep any record of the searches performed, so to build a tag
> cloud based on query popularity you would need to mine your logs.
> 
> if you want a tag cloud based on the frequency of words in your corpus,
> the faceting approach mentioned would work -- but a simpler way to get
> term counts for the whole index (*:*) would be the TermsComponent.  you
> only really need the facet based solution if you want a cloud based on a
> subset of documents, (ie: a cloud for all documents matching
> category:computer)
> 
> 
> 
> -Hoss
> 



Re: Is there a built in keyword report (Tag Cloud) feature on Solr ?

2009-02-26 Thread Walter Underwood
Oops, missed that you wanted it by facet. Never mind. --wunder

On 2/26/09 9:57 AM, "Walter Underwood"  wrote:

> That info is already available via Luke, right? --wunder
> 
> On 2/26/09 9:55 AM, "Robert Douglass"  wrote:
> 
>> A solution that I'd considering implementing for Drupal's ApacheSolr
>> module is to do a *:* search and then make tag clouds from all of the
>> facets. Pretty easy to sort all the facet terms into bins based on the
>> number of documents they match, and then to translate bins to font
>> sizes. Tag clouds make a nice alternate representation of facet blocks.
>> 
>> Robert Douglass
>> 
>> The RobsHouse.net Newsletter:
>> http://robshouse.net/newsletter/robshousenet-newsletter
>> Follow me on Twitter: http://twitter.com/robertDouglass
>> 
>> On Feb 26, 2009, at 6:50 PM, Emmanuel Castro Santana wrote:
>> 
>>> 
>>> I am developing a Solr based search application and need to get a
>>> kind of a
>>> keyword report for tag cloud generation. If there is anyone here who
>>> has
>>> ever had that necessity and has somehow found the way through, I would
>>> really appreciate some help.
>>> Thanks in advance
>>> -- 
>>> View this message in context:
>>> 
http://www.nabble.com/Is-there-a-built-in-keyword-report-%28Tag-Cloud%29-fea>>>
t
>>> ure-on-Solr---tp9677p9677.html
>>> Sent from the Solr - Dev mailing list archive at Nabble.com.
>>> 
>> 
> 



Re: Is there a built in keyword report (Tag Cloud) feature on Solr ?

2009-02-26 Thread Walter Underwood
That info is already available via Luke, right? --wunder

On 2/26/09 9:55 AM, "Robert Douglass"  wrote:

> A solution that I'd considering implementing for Drupal's ApacheSolr
> module is to do a *:* search and then make tag clouds from all of the
> facets. Pretty easy to sort all the facet terms into bins based on the
> number of documents they match, and then to translate bins to font
> sizes. Tag clouds make a nice alternate representation of facet blocks.
> 
> Robert Douglass
> 
> The RobsHouse.net Newsletter:
> http://robshouse.net/newsletter/robshousenet-newsletter
> Follow me on Twitter: http://twitter.com/robertDouglass
> 
> On Feb 26, 2009, at 6:50 PM, Emmanuel Castro Santana wrote:
> 
>> 
>> I am developing a Solr based search application and need to get a
>> kind of a
>> keyword report for tag cloud generation. If there is anyone here who
>> has
>> ever had that necessity and has somehow found the way through, I would
>> really appreciate some help.
>> Thanks in advance
>> -- 
>> View this message in context:
>> http://www.nabble.com/Is-there-a-built-in-keyword-report-%28Tag-Cloud%29-feat
>> ure-on-Solr---tp9677p9677.html
>> Sent from the Solr - Dev mailing list archive at Nabble.com.
>> 
> 



Re: [jira] Issue Comment Edited: (SOLR-844) A SolrServer impl to front-end multiple urls

2009-01-22 Thread Walter Underwood
This would be useful if there was search-specific balancing,
like always send the same query back to the same server. That
can make your cache far more effective.

wunder

On 1/22/09 1:13 PM, "Otis Gospodnetic (JIRA)"  wrote:

> 
> [ 
> https://issues.apache.org/jira/browse/SOLR-844?page=com.atlassian.jira.plugin.
> system.issuetabpanels:comment-tabpanel&focusedCommentId=12666296#action_126662
> 96 ] 
> 
> otis edited comment on SOLR-844 at 1/22/09 1:12 PM:
> 
> 
> I'm not sure there is a clear consensus about this functionality being a good
> thing (also 0 votes).  Perhaps we can get more people's opinions?
> 
> 
>   was (Author: otis):
> I'm not sure there is a clear consensus about this functionality being a
> good thing.  Perhaps we can get more people's opinions?
> 
>   
>> A SolrServer impl to front-end multiple urls
>> 
>> 
>> Key: SOLR-844
>> URL: https://issues.apache.org/jira/browse/SOLR-844
>> Project: Solr
>>  Issue Type: New Feature
>>  Components: clients - java
>>Affects Versions: 1.3
>>Reporter: Noble Paul
>>Assignee: Shalin Shekhar Mangar
>> Fix For: 1.4
>> 
>> Attachments: SOLR-844.patch, SOLR-844.patch, SOLR-844.patch
>> 
>> 
>> Currently a {{CommonsHttpSolrServer}} can talk to only one server. This
>> demands that the user have a LoadBalancer or do the roundrobin on their own.
>> We must have a {{LBHttpSolrServer}} which must automatically do a
>> Loadbalancing between multiple hosts. This can be backed by the
>> {{CommonsHttpSolrServer}}
>> This can have the following other features
>> * Automatic failover
>> * Optionally take in  a file /url containing the the urls of servers so that
>> the server list can be automatically updated  by periodically loading the
>> config
>> * Support for adding removing servers during runtime
>> * Pluggable Loadbalancing mechanism. (round-robin, weighted round-robin,
>> random etc)
>> * Pluggable Failover mechanisms



[jira] Commented: (SOLR-822) CharFilter - normalize characters before tokenizer

2008-10-23 Thread Walter Underwood (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12642188#action_12642188
 ] 

Walter Underwood commented on SOLR-822:
---

Yes, it should be in Lucene. LIke this: 
http://webui.sourcelabs.com/lucene/issues/1343

There are (at least) four kinds of character mapping:

Unicode normalization from decomposed to composed forms (always safe).

Unicode normalization from compatability forms to standard forms (may change 
the look, like fullwidth to halfwidth Latin).

Language-specific normalization, like "oe" to "ö" (German-only).

Mappings that improve search but are linguistically dodgy, like stripping 
accents and mapping katakana to hirigana.

wunder


> CharFilter - normalize characters before tokenizer
> --
>
> Key: SOLR-822
> URL: https://issues.apache.org/jira/browse/SOLR-822
> Project: Solr
>  Issue Type: New Feature
>  Components: Analysis
>Reporter: Koji Sekiguchi
>Priority: Minor
> Attachments: character-normalization.JPG, sample_mapping_ja.txt, 
> SOLR-822.patch, SOLR-822.patch
>
>
> A new plugin which can be placed in front of .
> {code:xml}
>  positionIncrementGap="100" >
>   
>  mapping="mapping_ja.txt" />
> 
>  words="stopwords.txt"/>
> 
>   
> 
> {code}
>  can be multiple (chained). I'll post a JPEG file to show 
> character normalization sample soon.
> MOTIVATION:
> In Japan, there are two types of tokenizers -- N-gram (CJKTokenizer) and 
> Morphological Analyzer.
> When we use morphological analyzer, because the analyzer uses Japanese 
> dictionary to detect terms,
> we need to normalize characters before tokenization.
> I'll post a patch soon, too.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-815) Add new Japanese half-width/full-width normalizaton Filter and Factory

2008-10-20 Thread Walter Underwood (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12641071#action_12641071
 ] 

Walter Underwood commented on SOLR-815:
---

I looked it up, and even found a reason to do it the right way.

Latin should be normalized to halfwidth (in the Latin-1 character space).

Kana should be normalized to fullwidth.

Normalizing Latin characters to fullwidth would mean you could not use the 
existing accent-stripping filters or probably any other filter that expected 
Latin-1, like synonyms. Normalizing to halfwidth makes the rest of Solr and 
Lucene work as expected.

See section 12.5: http://www.unicode.org/versions/Unicode5.0.0/ch12.pdf

The compatability forms (the ones we normalize away from) are int the Unicode 
range U+FF00 to U+FFEF.
The correct mappings from those forms are in this doc: 
http://www.unicode.org/charts/PDF/UFF00.pdf

Other charts are here: http://www.unicode.org/charts/


> Add new Japanese half-width/full-width normalizaton Filter and Factory
> --
>
> Key: SOLR-815
> URL: https://issues.apache.org/jira/browse/SOLR-815
> Project: Solr
>  Issue Type: New Feature
>  Components: search
>Affects Versions: 1.3
>Reporter: Todd Feak
>Assignee: Koji Sekiguchi
>Priority: Minor
> Attachments: SOLR-815.patch
>
>
> Japanese Katakana and  Latin alphabet characters exist as both a "half-width" 
> and "full-width" version. This new Filter normalizes to the full-width 
> version to allow searching and indexing using both.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-815) Add new Japanese half-width/full-width normalizaton Filter and Factory

2008-10-17 Thread Walter Underwood (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12640609#action_12640609
 ] 

Walter Underwood commented on SOLR-815:
---

If I remember correctly, Latin characters should normalize to half-width, not 
full-width.


> Add new Japanese half-width/full-width normalizaton Filter and Factory
> --
>
> Key: SOLR-815
> URL: https://issues.apache.org/jira/browse/SOLR-815
> Project: Solr
>  Issue Type: New Feature
>  Components: search
>Affects Versions: 1.3
>Reporter: Todd Feak
>Priority: Minor
> Attachments: SOLR-815.patch
>
>
> Japanese Katakana and  Latin alphabet characters exist as both a "half-width" 
> and "full-width" version. This new Filter normalizes to the full-width 
> version to allow searching and indexing using both.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-814) Add new Japanese Hiragana Filter and Factory

2008-10-17 Thread Walter Underwood (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12640605#action_12640605
 ] 

Walter Underwood commented on SOLR-814:
---

This seems like a bad idea. Hirigana and katakana are used quite differently in 
Japanese. They are not interchangeable.

I was the engineer for Japanese support in Ultraseek for years and even visited 
our distributor there, but no one ever asked for this feature. They asked for a 
lot of things, but never this.

It is very useful, maybe essential, to normalize full-width and half-width 
versions of hirigana, katakana, and ASCII.


> Add new Japanese Hiragana Filter and Factory
> 
>
> Key: SOLR-814
> URL: https://issues.apache.org/jira/browse/SOLR-814
> Project: Solr
>  Issue Type: New Feature
>  Components: search
>Affects Versions: 1.3
>Reporter: Todd Feak
>Priority: Minor
> Attachments: SOLR-814.patch
>
>
> Japanese Hiragana and Katakana character sets can be easily translated 
> between. This filter normalizes all Hiragana characters to their Katakana 
> counterpart, allowing for indexing and searching using either.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Offer to submit some custom enhancements

2008-10-16 Thread Walter Underwood
Python marshal format supports everything we need and is easy to implement
in Java. It is roughly equivalent to JSON, but binary.

http://docs.python.org/library/marshal.html

wunder

On 10/16/08 8:16 AM, "Shalin Shekhar Mangar" <[EMAIL PROTECTED]> wrote:

> Hi Todd,
> 
> AFAIK, protocol buffers cannot be used for Solr because it is unable to
> support the NamedList structure that all Solr components use.
> 
> The binary protocol (NamedListCodec) that SolrJ uses to communicate with
> Solr server is extremely optimized for our response format. However it is
> Java only.
> 
> There are other projects such as Apache Thrift (
> http://incubator.apache.org/thrift/) and Etch (both in incubation) which can
> be looked at. There are a few issues in Thrift which may help us in the
> future:
> 
> https://issues.apache.org/jira/browse/THRIFT-110
> https://issues.apache.org/jira/browse/THRIFT-122
> 
> On Thu, Oct 16, 2008 at 12:18 AM, Feak, Todd <[EMAIL PROTECTED]>wrote:
> 
>> Reposting, as I inadvertently thread hijacked on the first one. My bad.
>> 
>> Hi all,
>> 
>> I have a handful of custom classes that we've created for our purposes
>> here. I'd like to share them if you think they have value for the rest
>> of the community, but I wanted to check here before creating JIRA
>> tickets and patches.
>> 
>> Here's what I have:
>> 
>> 1. DoubleMetaphoneFilter and Factory. This replaces usage of the
>> PhoneticFilter and Factory allowing access to set maxCodeLength() on the
>> DoubleMetaphone encoder and access to the "alternate" encodings that the
>> encoder provides for some words.
>> 
>> 2. JapaneseHalfWidthFilter and Factory. Some Japanese characters (and
>> Latin alphabet) exist in both a FullWidth and HalfWidth form. This
>> filter normalizes by switching to the FullWidth form for all the
>> characters. I have seen at least one JIRA ticket about this issue. This
>> implementation doesn't rely on Java 1.6.
>> 
>> 3. JapaneseHiraganaFilter and Factory. Japanese Hiragana can be
>> translated to Katakana. This filter normalizes to Katakana so that data
>> and queries can come in either way and get hits.
>> 
>> 
>> Also, I have been requested to create a prototype that you may be
>> interested in. I'm to construct a QueryResponseWriter that returns
>> documents using Google's Protocol Buffers. This would rely on an
>> existing patch that exposes the OutputStream, but I would like to start
>> the work soon. Are there license concerns that would block sharing this
>> with you? Is there any interest in this?
>> 
>> Thanks for your consideration,
>> Todd Feak
>> 
> 
> 



[jira] Commented: (SOLR-777) backword match search, for domain search etc.

2008-09-18 Thread Walter Underwood (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12632489#action_12632489
 ] 

Walter Underwood commented on SOLR-777:
---

You don't need backwards matching for this, and it doesn't really do the right 
thing.

Split the string on ".", reverse the list, and join successive sublists with 
".". Don't index the length one list, since that is ".com", ".net", etc. Do the 
same processing at query time.

This is a special analyzer.



> backword match search, for domain search etc.
> -
>
> Key: SOLR-777
> URL: https://issues.apache.org/jira/browse/SOLR-777
> Project: Solr
>  Issue Type: New Feature
>  Components: search
>Affects Versions: 1.3
>Reporter: Koji Sekiguchi
>Priority: Minor
>
> There is a requirement for searching domains with backward match. For 
> example, using "apache.org" for a query string, www.apache.org, 
> lucene.apache.org could be returned.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: replace stax API with Geronimo-stax+Woodstox

2008-09-09 Thread Walter Underwood
We've been using woodstox in production for over a year.
No problems.

wunder

On 9/9/08 8:07 AM, "Yonik Seeley" <[EMAIL PROTECTED]> wrote:

> FYI, I'm testing Solr with woodstox now and will probably do some ad
> hoc stress testing too.
> But woodstox is a quality parser.  I expect less problems then we had
> with the reference implementation (and it may even be faster too)
> 
> -Yonik



Re: Solr changes date format?

2008-08-12 Thread Walter Underwood
On 8/12/08 11:42 AM, "Chris Hostetter" <[EMAIL PROTECTED]> wrote:

> : by a point but, as you can see, the separator is converted to a comma when
> : is accesed
> : from Solr (i can see this too from Solr web admin)
> 
> this boggles my mind ... i can't think of *anything* in Solr that would do
> this .. 

If a European locale was used when the seconds portion of the date
was formatted, it would use a comma for the radix point.

wunder



Re: [VOTE] Set Solr 1.3 freeze and release date

2008-08-06 Thread Walter Underwood
I would strongly prefer a released version of Lucene. We made some changes
to Solr 1.1 that required tweaks inside of Lucene, and it was quite a
treasure hunt to a suitable set of Lucene source.

It just seems wrong for Solr to release a version of Lucene.

wunder 

On 8/6/08 8:53 AM, "Chris Hostetter" <[EMAIL PROTECTED]> wrote:

> 
> : Yes, it's good that lots of Solr people are also Lucene people. But I
> : don't think that makes it alright to ship Lucene nightlies or
> : snapshots.
> 
> Apache Lucene is a TLP, Apache Solr and Apache Lucene-Java are just
> individual products/sub-projects of that TLP.
> 
> If the Apache Lucene PMC votes to release a particular bundle of source
> code as "Apache Solr 1.3" and that bundle includes source (or binary) code
> from the Lucene-Java subproject that hasn't already been released (via PMC
> vote) then it is by definition officially released Apache Lucene software.
> 
> So in a nutshell: yes it is "alright for Solr to ship Lucene nightlies" --
> because once the PMC votes on that Solr release, it doesn't matter where
> that Lucene-Java jar came from, it's officially released code.
> 
> I'm told there is even precedence for the PMC of a TLP X to vote
> and officially release code from completley seperate TLP Y because Y had
> not had a release and X was ready to go.
> 
> Where dependencies on "snapshots" in official releases causes problems is
> when those snapshots are from third parties and/or are not reproducable --
> where the specific version of the dependencies is unknown and as a result
> the "dependee" can not be reproduced.  We do not have that problem
> with any Apache codebase we have a dependency on.  We know exactly which
> svn revision the dependencies come from, and since the SVN repository is
> public, anyone can recreate it.
> 
> 
> -Hoss
> 



Re: Solr Logo thought

2008-08-01 Thread Walter Underwood
I kind of like the flaming version at http://www.solrmarc.org/
Not very fired up about the other choices.

wunder

On 8/1/08 9:45 AM, "Otis Gospodnetic" <[EMAIL PROTECTED]> wrote:

> Hola,
> 
> Yes, logo, trivial issue (hi Lance).  But logos are important, so:
> 
> I've cast my vote, but I don't really love even the logo I voted for (#2 -- a
> little too pale/shinny, not very "bold", so to speak).  Lukas (BCCed) did the
> logo for Mahout.  He made a number of variations and was very open to
> suggestions during the process.  I wonder if we could ask him to give Solr
> logo a shot if he is not on vacation.  Do we have time for another logo,
> assuming Lukas is willing to contribute?
> 
> 
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch




[jira] Commented: (SOLR-600) XML parser stops working under heavy load

2008-06-17 Thread Walter Underwood (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12605751#action_12605751
 ] 

Walter Underwood commented on SOLR-600:
---

It could also be a concurrency bug in Solr that shows up on the IBM JVM because 
the thread scheduler makes different decisions. 

> XML parser stops working under heavy load
> -
>
> Key: SOLR-600
> URL: https://issues.apache.org/jira/browse/SOLR-600
> Project: Solr
>  Issue Type: Bug
>  Components: update
>Affects Versions: 1.3
> Environment: Linux 2.6.19.7-ss0 #4 SMP Wed Mar 12 02:56:42 GMT 2008 
> x86_64 Intel(R) Xeon(R) CPU X5450 @ 3.00GHz GenuineIntel GNU/Linux
> Tomcat 6.0.16
> SOLR nightly 16 Jun 2008, and versions prior
> JRE 1.6.0
>Reporter: John Smith
>
> Under heavy load, the following is spat out for every update:
> org.apache.solr.common.SolrException log
> SEVERE: java.lang.NullPointerException
> at java.util.AbstractList$SimpleListIterator.hasNext(Unknown Source)
> at 
> org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:225)
> at 
> org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:66)
> at 
> org.apache.solr.handler.XmlUpdateRequestHandler.processUpdate(XmlUpdateRequestHandler.java:196)
> at 
> org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(XmlUpdateRequestHandler.java:123)
> at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:125)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:965)
> at 
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
> at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:272)
> at 
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
> at 
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
> at 
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
> at 
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175)
> at 
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
> at 
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
> at 
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
> at 
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286)
> at 
> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844)
> at 
> org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583)
> at 
> org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447)
> at java.lang.Thread.run(Thread.java:735)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: IDF in Distributed Search

2008-04-11 Thread Walter Underwood
Global IDF does not require another request/response.
It is nearly free if you return the right info.

Return the total number of docs and the df in the original
response. Sum the doc counts and dfs, recompute the idf,
and re-rank.

See this post for an efficient way to do it:

  
http://wunderwood.org/most_casual_observer/2007/04/progressive_reranking.htm
l

This works best if you treat the results from each server as
a queue and refill just that queue when it is exhausted. All the
good results might be from one server.

wunder

On 4/11/08 8:50 PM, "Yonik Seeley" <[EMAIL PROTECTED]> wrote:

> On Fri, Apr 11, 2008 at 11:39 PM, Otis Gospodnetic
> <[EMAIL PROTECTED]> wrote:
>>  So, I'd like to see what it would take to add distributed IDF info to Solr's
>> distributed search.
>>  Here are some questions to get the discussion going:
>>  - Is anyone already working on it?
>>  - Does anyone plan on working on it in the very near future?
>>  - Does anyone already have thoughts how and where dist. idf could be plugged
>> in?
>>  - There is a mention of dist idf and performance cost up there - any idea
>> how costly dist idf would
> 
> It's relatively easy to implement, but the performance cost is is not
> negligible since it adds another search "phase" (another
> request-response).  It should be optional of course (globalidf=true),
> so there is no reason not to add this feature.
> 
> I also left room for this stage (ResponseBuilder.STAGE_PARSE_QUERY),
> which is ordered before query execution.
> 
> -Yonik



[jira] Commented: (SOLR-127) Make Solr more friendly to external HTTP caches

2008-02-08 Thread Walter Underwood (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12567068#action_12567068
 ] 

Walter Underwood commented on SOLR-127:
---

Two reasons to do HTTP caching for Solr: First, Solr is HTTP and needs to 
implement that correctly. Second, caches are much harder to implement and test 
than the cache information in HTTP. HTTP caches already exist and are well 
tested, so the implementation cost is zero and deployment is very easy.

The HTTP spec already covers which responses should be cached.  A 400 response 
may only be cached if it includes explicit cache control headers which allow 
that. See RFC 2616.

We are using a caching load balancer and caching in Apache front ends to 
Tomcat. We see an increase of more than 2X in the capacity of our search farm.

I would recommend against Solr-specific cache information in the XML part of 
the responses. Distributed caching is extremely difficult to get right. Around 
25% of the HTTP 1.1 spec is devoted to caching and there are still grey areas.

> Make Solr more friendly to external HTTP caches
> ---
>
> Key: SOLR-127
> URL: https://issues.apache.org/jira/browse/SOLR-127
> Project: Solr
>  Issue Type: Wish
>Reporter: Hoss Man
>Assignee: Hoss Man
> Fix For: 1.3
>
> Attachments: CacheUnitTest.patch, CacheUnitTest.patch, 
> HTTPCaching.patch, HTTPCaching.patch, HTTPCaching.patch, HTTPCaching.patch, 
> HTTPCaching.patch, HTTPCaching.patch, HTTPCaching.patch, HTTPCaching.patch, 
> HTTPCaching.patch, HTTPCaching.patch, HTTPCaching.patch, HTTPCaching.patch, 
> HTTPCaching.patch, HTTPCaching.patch, HTTPCaching.patch, HTTPCaching.patch, 
> HTTPCaching.patch, HTTPCaching.patch, HTTPCaching.patch, HTTPCaching.patch, 
> HTTPCaching.patch, HTTPCaching.patch, HTTPCaching.patch
>
>
> an offhand comment I saw recently reminded me of something that really bugged 
> me about the serach solution i used *before* Solr -- it didn't play nicely 
> with HTTP caches that might be sitting in front of it.
> at the moment, Solr doesn't put in particularly usefull info in the HTTP 
> Response headers to aid in caching (ie: Last-Modified), responds to all HEAD 
> requests with a 400, and doesn't do anything special with If-Modified-Since.
> t the very least, we can set a Last-Modified based on when the current 
> IndexReder was open (if not the Date on the IndexReader) and use the same 
> info to determing how to respond to If-Modified-Since requests.
> (for the record, i think the reason this hasn't occured to me in the 2+ years 
> i've been using Solr, is because with the internal caching, i've yet to need 
> to put a proxy cache in front of Solr)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: remote solrj using xml versus json

2007-11-09 Thread Walter Underwood
If you want speed, you should use Python marshal format. It handles
data structures equivalent to JSON, but in binary. Very easy to
convert to Java data types. --wunder

On 11/9/07 12:56 PM, "Erik Hatcher" <[EMAIL PROTECTED]> wrote:

> anybody compared/contrasted the two?   seems like yonik's noggit
> parser might have a performance edge on xml parsing ?!
> 
> Erik




Re: default text type and stop words

2007-11-05 Thread Walter Underwood
I also said, "Stopword removal is a reasonable default because it works
fairly well for a general text corpus." Ultraseek keeps stopwords but
most engines don't. I think it is fine as a default. I also think you
have to understand stopwords at some point.

wunder

On 11/5/07 9:59 PM, "Chris Hostetter" <[EMAIL PROTECTED]> wrote:

> 
> : This isn't a problem in Lucene or Solr. It is a result of the analyzers
> : you have chosen to use. If you choose to remove stopwords, you will not
> : be able to match stopwords.
> 
> I believe paul's point was that this use of stopwords is in the "text"
> fieldtype in the example schema.xml ... which many people use as is.
> 
> I'm personally of the mindset that it's fine like it is.  While people who
> understand that "an" is a stop word might ask "why does 'rating:PG AND
> name:an' match 40K movies, it should match 0?" there is another (probably
> larger) group of people who won't know how the search is implemented, or
> that "an" is a stop word, and they will look at the same results and ask
> "why am i getting 40K results? most of these don't have 'an' in the title?
> i should only be getting X results."
> 
> That second group of people aren't going to be any happier if you
> give them 0 results instead -- at least this way people get some results
> to work with.
> 
> -Hoss




Re: default text type and stop words

2007-11-05 Thread Walter Underwood
This isn't a problem in Lucene or Solr. It is a result of the analyzers
you have chosen to use. If you choose to remove stopwords, you will not
be able to match stopwords.

Stopword removal has benefits (smaller index, faster searches) and
drawbacks (missed matches, wrong matches). Solr and Lucene allow
you to decide.

Stopword removal is a reasonable default because it works fairly
well for a general text corpus.

wunder


On 11/5/07 1:33 PM, "Sundling, Paul" <[EMAIL PROTECTED]> wrote:

> I don't know if the problem is in Lucene, I didn't investigate further.
> Maybe it's considered a feature, not a bug for someone with different
> expectations.
> 
> Given that Solr and Lucene have different release schedules.  Even if
> the problem is in Lucene and it's addressed there, that doesn't
> guarentee it's solved with Solr.  You would have to change from using a
> known stable vresion of Lucene to some nightly release that included a
> hypothetical patch or a patched custom version for this one little edge
> case.  It's probably unlikely that either of those are going to happen.
> Or consider changing a line of XML...
> 
> I only suggested considering it.  There is also the concept of an
> anti-corruption layer in domain driven design.  There are issues of time
> frames, release schedules, priorities and I'm not assuming this edge
> case is a high priority.  I merely pointed out an issue in the defaults.
> 
> 
> I also didn't say not to deal with a bug that hypothetically could be in
> a tightly coupled dependency.
> 
> Paul
> 
> -Original Message-
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
> Sent: Friday, November 02, 2007 11:02 PM
> To: solr-dev@lucene.apache.org
> Subject: Re: default text type and stop words
> 
> 
> 
> In a message dated 11/2/07 6:54:25 PM, [EMAIL PROTECTED]
> writes:
> 
> 
>> Even if
>> the actual problem is at the Lucene level, perhaps it would be worth
>> considering changes to the default to get around it.
>> 
> 
> newbie here. is this common practice? find a bug in a tightly coupled
> dependency and not deal with it there?
> 
> regard,
> billy
> 
> 
> **
>  See what's new at
> http://www.aol.com



Re: default text type and stop words

2007-11-02 Thread Walter Underwood
Stopwords are fairly common in movie titles. There are even titles
made entirely of stopwords. The first one I noticed was "Being There".
I posted more of them here:

http://wunderwood.org/most_casual_observer/2007/05/invisible_titles.html

wunder
==
Search Guy
Netflix

On 11/2/07 3:53 PM, "Sundling, Paul" <[EMAIL PROTECTED]> wrote:

> I noticed very unexpected results when using stop words with and without
> conditions using the default text type.
>  
> A normal query with a stop word returns no results as expected:
>  
> For example with 'an' being a stopword
>  
>   movieName:an (results: 0 since it's a stop word)
>   movieName:another (results 237)
>  
>   rating:PG-13  (results: 76095)
>  
>  
> but if I put them together with AND, for normal non stop words like
> 'another' the result is less than or equal to the smaller results being
> ANDed.  So adding another AND clause with a stop word query should have
> 0 results.
>  
>   rating:PG-13 AND movieName:another (results 46)
>  
>   rating:PG-13 AND movieName:an (results 76095 should be 0)
>   
> Commenting out the stop word filter from the text type for query will
> correct this behavior, although I'm not sure that's a real solution.  So
> instead of anding the stop word clause it seems to ignore it.  Even if
> the actual problem is at the Lucene level, perhaps it would be worth
> considering changes to the default to get around it.
>  
> Workaround:
>  
> positionIncrementGap="100">
>   
> 
> 
>  words="stopwords.txt"/>
>  generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0"/>
> 
>  protected="protwords.txt"/>
> 
>   
>   
> 
>  synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
> 
>  generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0"/>
> 
>  protected="protwords.txt"/>
> 
>   
> 
>  
> Paul Sundling



Re: HTTP or RMI, Jini, JavaSpaces for distributed search

2007-09-21 Thread Walter Underwood
Please don't switch to RMI. We've spent the past year converting
our entire middle tier from RMI to HTTP. We are so glad that we
no longer have any RMI servers.

The big advantage of HTTP is that there are hundreds, maybe
thousands, of engineers working on making it fast, on tools for it,
on caches, etc.

If you really need more compact responses, I would recommend
coding the JSON output in Python marshal format. That is compact,
fast, and easy to parse. We used that for a Java client in Ultraseek.

wunder

On 9/21/07 11:08 AM, "Yonik Seeley" <[EMAIL PROTECTED]> wrote:

> I wanted to take a step back for a second and think about if HTTP was
> really the right choice for the transport for distributed search.
> 
> I think the high-level approach in SOLR-303 is the right way to go
> about it, but I'm unsure if HTTP is the right transport.
> 
> Pro HTTP:
>   - using HTTP allows one to use an http load-balancer to distribute
> load across multiple copies of the same shard by assigning a VIP
> (virtual IP) to each shard.
>   - because you do pretty much everything by hand, you know that there
> isn't some hidden limitation that will jump out and bite you later.
> 
> Cons HTTP:
>  - you end up doing everything by hand... connection handling, request
> serialization, response parsing, etc...
>  - goes through normal servlet channels... every sub-request will be
> logged to the access logs, slowing things down.
> - more network bandwidth used unless we come up with a new
> BinaryResponseWriter and Parser
> 
> Currently, SOLR-303 uses and parses the XML response format, which has
> some serious downsides:
> - response size limits scalability and how deep in responses you can go...
>   If you want to retrieve documents 5000 through 5009, even though the
> user only requested 10 documents, the top-level searcher needs to get
> the top 5009 documents from *each* shard... and that can quickly
> exhaust the network bandwidth of the NIC.  XML parsing on the order of
> nShards*5009 entries won't be any picnic either.
> 
> I'm thinking the load-balancing of HTTP is overrated also, because
> it's inflexible.  Adding another shard requires adding another VIP in
> the load-balancer, and changing which servers have which shards or
> adding new copies of a shard also requires load-balancer
> configuration.  Everything points to Solr being able to do the
> load-balancing itself in the future, and there wouldn't seem to be
> much benefit to using a load-balancer w/ VIPS for each shard vs having
> Solr do it.
> 
> So even if we stuck with HTTP, Solr would need
>  - a binary protocol to minimize network bandwidth use
>  - load balancing across shard copies itself
> 
> Given that, would it make sense to just go with RMI instead?
> And perhaps leverage some other higher level services (Jini? JavaSpaces?)
> 
> I'd like to hear from people with more experience with RMI & friends,
> and what the potential downsides are to using these technologies.
> 
> -Yonik



[jira] Commented: (SOLR-127) Make Solr more friendly to external HTTP caches

2007-09-14 Thread Walter Underwood (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12527694
 ] 

Walter Underwood commented on SOLR-127:
---

Last-modified does require monotonic time, but ETags are version stamps without 
any ordering. The indexVersion should be fine for an ETag.

> Make Solr more friendly to external HTTP caches
> ---
>
> Key: SOLR-127
> URL: https://issues.apache.org/jira/browse/SOLR-127
> Project: Solr
>  Issue Type: Wish
>Reporter: Hoss Man
> Attachments: HTTPCaching.patch, HTTPCaching.patch, HTTPCaching.patch, 
> HTTPCaching.patch, HTTPCaching.patch, HTTPCaching.patch
>
>
> an offhand comment I saw recently reminded me of something that really bugged 
> me about the serach solution i used *before* Solr -- it didn't play nicely 
> with HTTP caches that might be sitting in front of it.
> at the moment, Solr doesn't put in particularly usefull info in the HTTP 
> Response headers to aid in caching (ie: Last-Modified), responds to all HEAD 
> requests with a 400, and doesn't do anything special with If-Modified-Since.
> t the very least, we can set a Last-Modified based on when the current 
> IndexReder was open (if not the Date on the IndexReader) and use the same 
> info to determing how to respond to If-Modified-Since requests.
> (for the record, i think the reason this hasn't occured to me in the 2+ years 
> i've been using Solr, is because with the internal caching, i've yet to need 
> to put a proxy cache in front of Solr)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: what goes in CHANGES.txt

2007-07-09 Thread Walter Underwood
> : If we added a more obscure method that didn't exist before (like
> : getFirstMatch()), that wouldn't need to be added (it's noise to most
> : users, doesn't change existing functionality, not accessible w/o
> : writing Java code, and an advanced user can pull up the javadoc).

It sure is handy to know what release that first appeared if you
are trying to work with an older version. --wunder






[jira] Commented: (SOLR-277) Character Entity of XHTML is not supported with XmlUpdateRequestHandler .

2007-06-26 Thread Walter Underwood (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508408
 ] 

Walter Underwood commented on SOLR-277:
---

This is not a bug. Solr accepts XML, not XHTML. It does not accept XHTML-only 
entities. 

The Solr update XML format is a specific Solr XML format, not XML, not DocBook, 
not
anything else.

To index XHTML, parse it and convert it to Solr XML update format.


> Character Entity of XHTML is not supported with XmlUpdateRequestHandler .
> -
>
> Key: SOLR-277
> URL: https://issues.apache.org/jira/browse/SOLR-277
> Project: Solr
>  Issue Type: Improvement
>  Components: update
>Affects Versions: 1.3
>Reporter: Toru Matsuzawa
> Attachments: XmlUpdateRequestHandler.patch
>
>
> Character Entity of XHTML is not supported with XmlUpdateRequestHandler .
> http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent
> http://www.w3.org/TR/xhtml1/DTD/xhtml-special.ent
> http://www.w3.org/TR/xhtml1/DTD/xhtml-symbol.ent
> It is necessary to correspond with XmlUpdateRequestHandler because xpp3 
> cannot use .
> I think it is necessary until StaxUpdateRequestHandler becomes "/update".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-216) Improvements to solr.py

2007-05-29 Thread Walter Underwood (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12499923
 ] 

Walter Underwood commented on SOLR-216:
---

GET is the right semantic for a query, since it doesn't change the resource. It 
also allows HTTP caching.

If Solr has URL length limits, that's a bug.


> Improvements to solr.py
> ---
>
> Key: SOLR-216
> URL: https://issues.apache.org/jira/browse/SOLR-216
> Project: Solr
>  Issue Type: Improvement
>  Components: clients - python
>Affects Versions: 1.2
>Reporter: Jason Cater
>Assignee: Mike Klaas
>Priority: Trivial
> Attachments: solr.py
>
>
> I've taken the original solr.py code and extended it to include higher-level 
> functions.
>   * Requires python 2.3+
>   * Supports SSL (https://) schema
>   * Conforms (mostly) to PEP 8 -- the Python Style Guide
>   * Provides a high-level results object with implicit data type conversion
>   * Supports batching of update commands

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: svn commit: r541391 - in /lucene/solr/trunk: CHANGES.txt example/solr/conf/xslt/example_atom.xsl example/solr/conf/xslt/example_rss.xsl

2007-05-25 Thread Walter Underwood
On 5/25/07 10:45 AM, "Chris Hostetter" <[EMAIL PROTECTED]> wrote:
> 
> : I'd slap versions to those 2 XSL files to immediately answer "which
> : version of Atom|RSS does this produce?"
> 
> i'm comfortable calling the example_rss.xsl "RSS", since most RSS
> readers will know what do do with it, but i don't know that i'm
> comfrotable calling it any specific version of RSS, people are more likely
> to get irrate about claiming ot be a specific version if one little thing
> is wrong then they are about not claiming to be anything in particular.

Some versions of RSS are quite incompatible, so we MUST say what
version we are implementing. RSS 1.0 is completely different from
the 0.9 series and 2.0.

Atom doesn't have a version number, but RFC 4287 Atom is informally
called 1.0. 

wunder



[jira] Commented: (SOLR-208) RSS feed XSL example

2007-05-17 Thread Walter Underwood (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12496624
 ] 

Walter Underwood commented on SOLR-208:
---

I wasn't in the RSS wars, either, but I was on the Atom working group. That was 
a bunch of volunteers making a clean, testable spec for RSS functionality 
(http://www.ietf.org/rfc/rfc4287). RSS 2.0 has some bad ambiguities, especially 
around ampersand and entities in titles. The default has changed over the years 
and clients do different, incompatible things.

GData is just a way to do search result stuff that we would need anyway. It is 
standard set of URL parameters for query, start-index, and categories, and a 
few Atom extensions for total results, items per page, and next/previous.

http://code.google.com/apis/gdata/reference.html


> RSS feed XSL example
> 
>
> Key: SOLR-208
> URL: https://issues.apache.org/jira/browse/SOLR-208
> Project: Solr
>  Issue Type: New Feature
>  Components: clients - java
>Affects Versions: 1.2
>Reporter: Brian Whitman
> Assigned To: Hoss Man
>Priority: Trivial
> Attachments: rss.xsl
>
>
> A quick .xsl file for transforming solr queries into RSS feeds. To get the 
> date and time in properly you'll need an XSL 2.0 processor, as in 
> http://wiki.apache.org/solr/XsltResponseWriter .  Tested to work with the 
> example solr distribution in the nightly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-208) RSS feed XSL example

2007-05-17 Thread Walter Underwood (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12496608
 ] 

Walter Underwood commented on SOLR-208:
---

What kind of RSS?

-1 unless it is Atom. The nine variants of RSS have some nasty interop 
problems, even between
those that are supposed to implement the same spec.

Even better,  a GData interface returning Atom.



> RSS feed XSL example
> 
>
> Key: SOLR-208
> URL: https://issues.apache.org/jira/browse/SOLR-208
> Project: Solr
>  Issue Type: New Feature
>  Components: clients - java
>Affects Versions: 1.2
>Reporter: Brian Whitman
> Assigned To: Hoss Man
>Priority: Trivial
> Attachments: rss.xsl
>
>
> A quick .xsl file for transforming solr queries into RSS feeds. To get the 
> date and time in properly you'll need an XSL 2.0 processor, as in 
> http://wiki.apache.org/solr/XsltResponseWriter .  Tested to work with the 
> example solr distribution in the nightly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: dynamic copyFields

2007-05-02 Thread Walter Underwood
That syntax is from the "ed" editor. I learned it in 1975
on Unix v6/PWB, running on a PDP-11/70. --wunder

On 5/2/07 5:04 PM, "Mike Klaas" <[EMAIL PROTECTED]> wrote:

> On 5/2/07, Ryan McKinley <[EMAIL PROTECTED]> wrote:
> 
>> How about Mike's other suggestion:
>>   
>> 
>> this would keep the glob style for "source" and "dest", but use "regex"
>> to transform a sorce -> dest
> 
> Wow, I didn't even remember suggesting that.  I agree (with Hoss) that
> backward compatibility is important, but I disagree (with myself) that
> the above syntax is nice.  Outside of perl, I'm not sure how common
> the s/ / / syntax is (is it used in java?)
> 
> perhaps
> 
> 
> 
> ?
> 
> -Mike



Re: Progressive Query Relaxation

2007-04-10 Thread Walter Underwood
On 4/10/07 10:38 AM, "J. Delgado" <[EMAIL PROTECTED]> wrote:

> I think you have something personal against Oracle... Hey I have no
> interest in defending Oracle, but this no hack.

It's true, I don't have much respect for Oracle's text search.
When I was working on enterprise search, we never really worried
about them because their quality and speed just wasn't competitive.
I do not look to them as a reliable source of good ideas for search.

Oracle's problem statement has a plausible strawman, but there are
lots of better ways to deal with misspellings. Heck, my dev instance
of Solr gives Michael Crichton as the first hit for "Michel Crichton".
It is not true that "hits which are a poor match will be mixed in
with hits which are a good match."

Hmmm, "Crichton" is much more likely to be misspelled than "Michael",
so maybe their strawman isn't very good.

wunder



Re: Progressive Query Relaxation

2007-04-10 Thread Walter Underwood
On 4/10/07 10:06 AM, "J. Delgado" <[EMAIL PROTECTED]> wrote:

> Progressive relaxation, at least as Oracle has defined it, is a
> flexible, developer defined series of queries that are efficiently
> executed in progression and in one trip to the engine, until minimum
> of hits required is satisfied. It is not a self adapting precision
> scheme nor it tries to guess what is the best match.

Correct. Search engines are all about the best match. Why would
you show anything else?

This is an RDBMS flavored approach, not an approach that considers
natural language text. Sets of matches, not a ranked list. It fails
as soon as one of the sets gets too big, like when someone searches
for "laserjet" at HP.com. That happens a lot.

It assumes that all keywords are the same, something that Gerry
Salton figured out was false thirty years ago. That is why we
use tf.idf instead of sets of matches.

I see a lot of design without any talk about what problem they are
solving. What queries don't work? How do we make those better?
Let's work from real logs and real data. Oracle's hack doesn't
solve any problem I've see in real query logs.

I'm doing e-commerce search, and our current engine does pretty
much what Oracle is offering. The results are not good, and we
are replacing it with Solr and DisMax. My off-line relevance testing
shows a big improvement.

wunder
--
Search Guru, Netflix




Re: Progressive Query Relaxation

2007-04-10 Thread Walter Underwood
>From the name, I thought this was an adaptive precision scheme,
where the engine automatically tries broader matching if there
are no matches or just a few. We talked about doing that with
Ultraseek, but it is pretty tricky. Deciding when to adjust it is
harder than making it variable.

Instead, this is an old idea that search amateurs seem to like.
Show all exact matches, then near matches, etc. This is the
kind of thing people suggest when they don't understand that
a ranking algorithm combines that evidence in a much more
powerful way. I talked customers out of this once or twice
each year at Ultraseek.

This approach fails for:

* common words
* misspellings

Since both of those happen a lot, this idea fails for a lot
of queries.

I presume that Oracle implemented this to shut up some big customer,
since it isn't a useful feature unless it closes a sale.

DisMax gives you something somewhat similar to this, by
selecting the best matching field. That is much more powerful
and gives much better results.

wunder

On 4/9/07 12:46 AM, "J. Delgado" <[EMAIL PROTECTED]> wrote:

> Has anyone within the Lucene or Solr community attempted to code a
> progressive query relaxation technique similar to the one described
> here for Oracle Text?
> http://www.oracle.com/technology/products/text/htdocs/prog_relax.html
> 
> Thanks,
> 
> -- J.D.



[jira] Commented: (SOLR-161) Dangling dash causes stack trace

2007-02-15 Thread Walter Underwood (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12473628
 ] 

Walter Underwood commented on SOLR-161:
---

It is really a Lucene query parser bug, but it wouldn't hurt to do s/(.*)-/&/ 
as a workaround. Assuming my ed(1) syntax is still fresh. Regardless, no query 
string should ever give a stack trace. --wunder

> Dangling dash causes stack trace
> 
>
> Key: SOLR-161
> URL: https://issues.apache.org/jira/browse/SOLR-161
> Project: Solr
>  Issue Type: Bug
>  Components: search
>Affects Versions: 1.1.0
> Environment: Java 1.5, Tomcat 5.5.17, Fedora Core 4, Intel
>Reporter: Walter Underwood
>
> I'm running tests from our search logs, and we have a query that ends in a 
> dash. That caused a stack trace.
> org.apache.lucene.queryParser.ParseException: Cannot parse 'digging for the 
> truth -': Encountered "" at line 1, column 23.
> Was expecting one of:
> "(" ...
>  ...
>  ...
>  ...
>  ...
> "[" ...
> "{" ...
>  ...
> 
>   at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:127)
>   at 
> org.apache.solr.request.DisMaxRequestHandler.handleRequest(DisMaxRequestHandler.java:272)
>   at org.apache.solr.core.SolrCore.execute(SolrCore.java:595)
>   at org.apache.solr.servlet.SolrServlet.doGet(SolrServlet.java:92)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-161) Dangling dash causes stack trace

2007-02-15 Thread Walter Underwood (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12473625
 ] 

Walter Underwood commented on SOLR-161:
---

The parser can have a rule for this rather than exploding. A trailing dash is 
never meaningful and can be omitted, whether we're allowing +/- or not. Seems 
like a grammar bug to me. --wunder

> Dangling dash causes stack trace
> 
>
> Key: SOLR-161
> URL: https://issues.apache.org/jira/browse/SOLR-161
> Project: Solr
>  Issue Type: Bug
>  Components: search
>Affects Versions: 1.1.0
> Environment: Java 1.5, Tomcat 5.5.17, Fedora Core 4, Intel
>Reporter: Walter Underwood
>
> I'm running tests from our search logs, and we have a query that ends in a 
> dash. That caused a stack trace.
> org.apache.lucene.queryParser.ParseException: Cannot parse 'digging for the 
> truth -': Encountered "" at line 1, column 23.
> Was expecting one of:
> "(" ...
>  ...
>  ...
>  ...
>  ...
> "[" ...
> "{" ...
>  ...
> 
>   at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:127)
>   at 
> org.apache.solr.request.DisMaxRequestHandler.handleRequest(DisMaxRequestHandler.java:272)
>   at org.apache.solr.core.SolrCore.execute(SolrCore.java:595)
>   at org.apache.solr.servlet.SolrServlet.doGet(SolrServlet.java:92)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (SOLR-161) Dangling dash causes stack trace

2007-02-15 Thread Walter Underwood (JIRA)
Dangling dash causes stack trace


 Key: SOLR-161
 URL: https://issues.apache.org/jira/browse/SOLR-161
 Project: Solr
  Issue Type: Bug
  Components: search
Affects Versions: 1.1.0
 Environment: Java 1.5, Tomcat 5.5.17, Fedora Core 4, Intel
Reporter: Walter Underwood


I'm running tests from our search logs, and we have a query that ends in a 
dash. That caused a stack trace.

org.apache.lucene.queryParser.ParseException: Cannot parse 'digging for the 
truth -': Encountered "" at line 1, column 23.
Was expecting one of:
"(" ...
 ...
 ...
 ...
 ...
"[" ...
"{" ...
 ...

at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:127)
at 
org.apache.solr.request.DisMaxRequestHandler.handleRequest(DisMaxRequestHandler.java:272)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:595)
at org.apache.solr.servlet.SolrServlet.doGet(SolrServlet.java:92)


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: AutoCommitTest failing

2007-02-05 Thread Walter Underwood
On 2/5/07 11:18 AM, "Yonik Seeley" <[EMAIL PROTECTED]> wrote:
> 
> Yes, I think that's it.
> SolrCore.close() shuts down the Executor.
> From the trace, you can see SolrCore closing, then an attempt to open
> up another searcher  after that.
> 
> The close of the update handler should probably shut down it's executor too.

That is one cause, according to the docs:

  New tasks submitted in method execute(java.lang.Runnable) will be rejected
  when the Executor has been shut down, and also when the Executor uses
  finite bounds for both maximum threads and work queue capacity, and is
  saturated.

http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/ThreadPoolExecu
tor.html

wunder



Re: resin and UTF-8 in URLs

2007-02-02 Thread Walter Underwood
On 2/1/07 6:00 PM, "Chris Hostetter" <[EMAIL PROTECTED]> wrote:

> That may be, but Solr was only publicly available for 9 months before we
> had someone running into confusion because they were tyring to post an XML
> file that wasn't UTF-8 :)
> 
> http://www.nabble.com/wana-use-CJKAnalyzer-tf2303256.html#a6498685

But that file wasn't a legal XML file in a non-standard encoding,
it was an illegal XML file in UTF-8. I don't think we're planning
on repairing broken XML automatically.

wunder



Re: resin and UTF-8 in URLs

2007-02-01 Thread Walter Underwood
On 2/1/07 3:18 PM, "Chris Hostetter" <[EMAIL PROTECTED]> wrote:
>
> As for XML, or any other format a user might POST to solr (or ask solr
> to fetch from a remote source) what possible reason would we have to only
> supporting UTF-8? .. why do you suggest that the XML standard "specify
> UTF-8, [so] we should use UTF-8" ... doesn't the XML standard say we
> should use the charset specified in the content-type if there is one, and
> if not use the encoding specified in the xml header, ie...
> 
> 

The XML spec says that XML parsers are only required to support
UTF-8, UTF-16, ISO 8859-1, and US-ASCII. If you use a different
encoding for XML, there is no guarantee that a conforming parser
will accept it.

Ultraseek has been indexing XML for the past nine years, and
I remember a single customer that had XML in a non-standard
encoding. Effectively all real-world XML is in one of the
standard encodings.

The right spec for XML over HTTP is RFC 3023. For text/xml
with no charset spec, the XML must be interpreted as US-ASCII.
>From section 8.5:

   Omitting the charset parameter is NOT RECOMMENDED for text/xml.  For
   example, even if the contents of the XML MIME entity are UTF-16 or
   UTF-8, or the XML MIME entity has an explicit encoding declaration,
   XML and MIME processors MUST assume the charset is "us-ascii".

wunder




Re: resin and UTF-8 in URLs

2007-02-01 Thread Walter Underwood
On 2/1/07 2:53 PM, "Chris Hostetter" <[EMAIL PROTECTED]> wrote:

> Solr, in my opinion, shouldn't have the string "UTF-8" hardcoded in it
> anywhere -- not even in the example config: new users shouldn't need to
> know about have any special solrconfig options that must be (un)set to get
> Solr to use their servlet container / system default charset.

I strongly disagree. When we use standards like URIs and XML which
specify UTF-8, we should use UTF-8.

If someone has intentionally set defaults which do not comply with
the standards, they can also do the extra work to make Solr behave
in a non-standard way.

I really cannot imagine a real use for that configuration, especially
in a back end server like Solr. In HTML, changing from Shift-JIS to
GB will change the shape of a few kanji characters, but there is
no need to store everything in GB or talk to the servers in GB.

wunder
 



Re: resin and UTF-8 in URLs

2007-02-01 Thread Walter Underwood
Let's not make this complicated for situations that we've never
seen in practice. Java is a Unicode language and always has been.
Anyone running a Java system with a Shift-JIS default should already
know the pitfalls, and know them much better than us (and I know a
lot about Shift-JIS).

The URI spec says UTF-8, so we can be compliant and tell people
to fix their code. If they need to add specific hacks for their
broken software, that is OK. We don't need generic design features
for a few broken clients.

RFC 3896 has been out for two years now. That is long enough for
decently-maintained software to get it right.

wunder

On 2/1/07 2:14 PM, "Chris Hostetter" <[EMAIL PROTECTED]> wrote:

> 
> : If we can do something small that makes the most normal cases work
> : even if the container is not configured, that seems good.
> 
> but how do we know the user wants what we consider a "normal cases" to
> work? ... if every servlet container lets you configure your default
> charset differently, we have no easy way to tell if/when they've
> configured the default properly, to know if we should override it.
> 
> If someone does everything in Shift-JIS, and sets up their servlet
> container with Shift-JIS as their default, and installs solr -- i don't
> want them to think Solr sucks because there is a default in Solr they
> don't know about (or know how to disable) that assumes UTF-8.
> 
> On the other hand: if someone really hasn't thought about charsets at all,
> then it doesn't seem that bad to use whatever default their servlet
> container says to use -- as I understand it some containers (tomcat
> included) pick their default based on the JVMs
> configuration (i assume from the "user.language" sysproperty) ... that
> certainly seems like a better default then for us ot asume UTF-8 -- even
> if it is "latin1" for "en", because most novice users are probably okay
> with latin1 ... if you're starting to worry about more complex characters
> that aren't in the default charset your servlet container picks for you,
> then reading a little documentation is a good idea.
> 
> 
> : At the very lease, we should change the examples in:
> : http://wiki.apache.org/solr/SolrResin etc
> 
> oh absolutely.
> 
> 
> 
> 
> -Hoss
> 



Re: loading many documents by ID

2007-02-01 Thread Walter Underwood
On 2/1/07 10:55 AM, "Ryan McKinley" <[EMAIL PROTECTED]> wrote:
> 
> Is there a better word then 'update'? It seems there is already enough
> confusion between UpdateHandlers, "Update Plugins",
> UpdateRequestHandler etc.

Try "modify". Solr uses "update" to include "add".

wunder




Re: loading many documents by ID

2007-01-31 Thread Walter Underwood
On 1/31/07 9:05 PM, "Ryan McKinley" <[EMAIL PROTECTED]> wrote:
>>> 
>>> We'd have to make it very clear that this only works if all fields are
>>> STORED.
>> 
>> Isn't there some way to do this automatically instead of relying
>> on documentation? We might need to add something, maybe a
>> "required" attribute on fields, but a runtime error would be
>> much, much better than a page on the wiki.
> 
> what about copyField?
> 
> With copyField, it is reasonable to have fields that are not stored
> and are generated from the other stored fields.  (this is what my
> setup looks like).

Mine, too. That is why I suggested explicit declarations in the
schema to say which fields are required.

wunder



Re: loading many documents by ID

2007-01-31 Thread Walter Underwood
On 1/31/07 3:39 PM, "Chris Hostetter" <[EMAIL PROTECTED]> wrote:
> 
> : Oh, and there have been numerous people interested in "updateable"
> : documents, so it would be nice if that part was in the update handler.
> 
> We'd have to make it very clear that this only works if all fields are
> STORED.

Isn't there some way to do this automatically instead of relying
on documentation? We might need to add something, maybe a
"required" attribute on fields, but a runtime error would be
much, much better than a page on the wiki.

wunder



[jira] Commented: (SOLR-129) Solrb - UTF 8 Support for add/delete

2007-01-31 Thread Walter Underwood (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12469072
 ] 

Walter Underwood commented on SOLR-129:
---

This is not a bug, unless a bad error message is a bug. It looks like the XML 
uses the  HTML entity "å" , which is not defined in XML. It has nothing 
to do with UTF-8. It really should generate an error message with line number 
instead of a stack trace.

wunder

> Solrb - UTF 8 Support for add/delete
> 
>
> Key: SOLR-129
> URL: https://issues.apache.org/jira/browse/SOLR-129
> Project: Solr
>  Issue Type: Bug
>  Components: clients - ruby - flare
> Environment: OSX
>Reporter: Antonio Eggberg
>
> Hi:
> This could be a ruby utf-8 bug. Anyway when I try to do a UTF-8 document add 
> via post.sh and then do query via Solr Admin everything works as it should. 
> However using the solrb ruby lib or flare UTF-8 doc add doesn't work as it 
> should. I am not sure what I am doing wrong and I don't think its Solr cos it 
> works as it should.
> Could this be a famous utf-8 ruby bug? I am using ruby 1.8.5 with rails 1.2.1
> Cheers

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: SOLR Improvement: Expiration

2007-01-30 Thread Walter Underwood
On 1/30/07 5:07 PM, "Fuad Efendi" <[EMAIL PROTECTED]> wrote:

> If it is not implemented/reported yet...
> I am having problems with deleting of old documents, would be nice to have
> default expiration policy!

Building in some specific policy would be hard and only
useful for people with exactly that problem.

Instead, include a date field and delete by query.
No features needed, just a tiny bit of client code.

wunder



Re: Can this be achieved? (Was: document support for file system crawling)

2007-01-19 Thread Walter Underwood
On 1/19/07 10:33 AM, "Chris Hostetter" <[EMAIL PROTECTED]> wrote:

> [...] but if your interest is in
> having an "enterprise search solution" that people can deploy on a box
> and haveit start working for them, then there is no reason for all of that
> code to run in a single JVM using a single code base -- i'm going to go
> out on a limb and guess that that the Google Appliances run more then a
> single process :)

Ultraseek does exactly that and is a single multi-threaded process.
A single process is much easier for the admin. A multi-process solution
is more complicated to start up, monitor, shut down, and upgrade.

There is decent demand for a spidering enterprise search engine.
Look at the Google Appliance, Ultraseek, and IBM OmniFind. The
free IBM OmniFind Yahoo! Edition uses Lucene.

I'd love to see the Ultraseek spider connected to Solr, but that
depends on Autonomy.

wunder
-- 
Walter Underwood
Search Guru, Netflix




Re: Java version for solr development (was Re: Update Plugins)

2007-01-16 Thread Walter Underwood
On 1/16/07 8:03 PM, "Yonik Seeley" <[EMAIL PROTECTED]> wrote:

> I think it's a bit soon to move to 1.6 - I don't know how many
> platforms it's available for yet.

It is still in "early release" from IBM for their PowerPC
servers, so requiring 1.6 would be a serious problem for us.

wunder
-- 
Walter Underwood
Search Guru, Netflix




Re: Handling disparate data sources in Solr

2007-01-08 Thread Walter Underwood
On 1/7/07 7:24 AM, "Erik Hatcher" <[EMAIL PROTECTED]> wrote:

> The idea of having Solr handle various document types is a good one,
> for sure.  I'm not sure what specifics would need to be implemented,
> but I at least wanted to reply and say its a good idea!

The design issue for this is to be clear about the schema and how
documents are mapped into the schema. If all document types are
mapped into the same schema, then one type of query will work
for all. If the documents have different schemas (in the search
index), then the query needs an expansion specific to each
document type.

Example: I have RFC-2822 mail messages with "Subject:" and
HTML with "". If I store those in Solr as subject and
title fields, then each query needs to search both fields.
If I put them both in a "document_title" field, then the
query can search one field.


wunder
-- 
Walter Underwood
Search Guru, Netflix




Re: [jira] Commented: (SOLR-85) [PATCH] Add update form to the admin screen

2006-12-18 Thread Walter Underwood
On 12/18/06 7:52 AM, "Thorsten Scherler"
<[EMAIL PROTECTED]> wrote:

> On Fri, 2006-12-15 at 11:16 -0800, Chris Hostetter wrote:
>> : The next thing on my list is to write a small cli based on httpclient to
>> : send the update docs as alternative of the post.sh.
>> 
>> You may want to take a look at SOLR-20 and SOLR-30 ... those issues are
>> first stabs at Java Client APIs for query/update which if cleaned up a bit
>> could become the basis for your CLI.
> 
> Hmm, I had a look at them but actually what I came up with is way
> smaller and more focused on the update part.
> 
> https://issues.apache.org/jira/browse/SOLR-86
> 
> It is a replacement of the post.sh not much more (yet).

I'll take a look at this. I also wrote my own, because
I had no idea that the Java client code existed.

wunder
-- 
Walter Underwood
Search Guru, Netflix




Heavily-populated bit sets

2006-12-12 Thread Walter Underwood
As an aside to SOLR-80, there is a standard trick for compressing a bit
set with more than half the bits set. You invert it, make it less than
half full, then store that. Basically, store the zeroes instead of the
ones. It costs one extra bit to say whether it is inverted or not.

wunder
-- 
Walter Underwood
Search Guru, Netflix




Re: Finalizing SOLR-58

2006-12-06 Thread Walter Underwood
On 12/6/06 10:54 AM, "Chris Hostetter" <[EMAIL PROTECTED]> wrote:

> P.S.
> Wunder - http://www.cafeconleche.org/books/bible3/chapters/ch15.html was
> invaluable, thanks.

No kidding. It is a complete Yoda session on XSLT. --wunder



[jira] Commented: (SOLR-73) schema.xml and solrconfig.xml use CNET-internal class names

2006-12-05 Thread Walter Underwood (JIRA)
[ 
http://issues.apache.org/jira/browse/SOLR-73?page=comments#action_12455684 ] 

Walter Underwood commented on SOLR-73:
--

Remember, this bug is only about removing aliased names from the sample files.

Note that the users in favor of having a alias-free sample files are all new to 
Solr. The people in favor of keeping them are generally long-time Solr users or 
developers. From a new user point of view, they are confusing.

Adding explicit alias definitions is a separate issue.




> schema.xml and solrconfig.xml use CNET-internal class names
> ---
>
> Key: SOLR-73
> URL: http://issues.apache.org/jira/browse/SOLR-73
> Project: Solr
>  Issue Type: Bug
>  Components: search
>    Reporter: Walter Underwood
>
> The configuration files in the example directory still use the old 
> CNET-internal class names, like solr.LRUCache instead of 
> org.apache.solr.search.LRUCache.  This is confusing to new users and should 
> be fixed before the first release.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (SOLR-73) schema.xml and solrconfig.xml use CNET-internal class names

2006-11-28 Thread Walter Underwood (JIRA)
[ 
http://issues.apache.org/jira/browse/SOLR-73?page=comments#action_12454190 ] 

Walter Underwood commented on SOLR-73:
--

The context required to resolve the ambiguity is a wiki page that I didn't know 
existed. Since I didn't know about it, I tried to figure it out by reading the 
code, and then by sending e-mail to the list. In my case, I was writing two 
tiny classes, but the issue would be the same if I was a non-programmer adding 
some simple plug-ins.

With a full class name, there is no ambiguity. Again, this saves typing at the 
cost of requiring an indirection through some unspecified documentation.

I saw every customer support e-mail for eight years with Ultraseek, so I'm 
pretty familiar with the problems that search engine admins run into. 
One of the things we learned was that documentation doesn't fix an unclear 
product. You fix the product instead of documenting how to understand it.

Requiring users to edit an XML file is a separate issue, but I think it is a 
serious problem, especially because any error messages show up in the server 
logs. 


> schema.xml and solrconfig.xml use CNET-internal class names
> ---
>
> Key: SOLR-73
> URL: http://issues.apache.org/jira/browse/SOLR-73
> Project: Solr
>  Issue Type: Bug
>  Components: search
>Reporter: Walter Underwood
>
> The configuration files in the example directory still use the old 
> CNET-internal class names, like solr.LRUCache instead of 
> org.apache.solr.search.LRUCache.  This is confusing to new users and should 
> be fixed before the first release.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (SOLR-73) schema.xml and solrconfig.xml use CNET-internal class names

2006-11-28 Thread Walter Underwood (JIRA)
[ 
http://issues.apache.org/jira/browse/SOLR-73?page=comments#action_12454159 ] 

Walter Underwood commented on SOLR-73:
--

I think the aliases are harder to read. You need to go elsewhere to figure them 
out. I read documentation, but I didn't find the part of the wiki that 
explained them and I had to ask the mailing list.

The javadoc uses the full class name. Google and Yahoo searches should work 
better with the full class name (Yahoo is working much better than Google for 
that right now).

The aliases save typing, but I don't think they improve usability. Full class 
names are simple and unambiguous.

If we want usability for non-programmers, we can't have them editing an XML 
file. 


> schema.xml and solrconfig.xml use CNET-internal class names
> ---
>
> Key: SOLR-73
> URL: http://issues.apache.org/jira/browse/SOLR-73
> Project: Solr
>  Issue Type: Bug
>  Components: search
>Reporter: Walter Underwood
>
> The configuration files in the example directory still use the old 
> CNET-internal class names, like solr.LRUCache instead of 
> org.apache.solr.search.LRUCache.  This is confusing to new users and should 
> be fixed before the first release.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (SOLR-73) schema.xml and solrconfig.xml use CNET-internal class names

2006-11-28 Thread Walter Underwood (JIRA)
[ 
http://issues.apache.org/jira/browse/SOLR-73?page=comments#action_12454066 ] 

Walter Underwood commented on SOLR-73:
--

The aliasing requires documentation and using the full class names doesn't. It 
seems much simpler to me to use the real class names. Less to maintain, less to 
test, less to explain. 

> schema.xml and solrconfig.xml use CNET-internal class names
> ---
>
> Key: SOLR-73
> URL: http://issues.apache.org/jira/browse/SOLR-73
> Project: Solr
>  Issue Type: Bug
>  Components: search
>    Reporter: Walter Underwood
>
> The configuration files in the example directory still use the old 
> CNET-internal class names, like solr.LRUCache instead of 
> org.apache.solr.search.LRUCache.  This is confusing to new users and should 
> be fixed before the first release.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Created: (SOLR-73) schema.xml and solrconfig.xml use CNET-internal class names

2006-11-28 Thread Walter Underwood (JIRA)
schema.xml and solrconfig.xml use CNET-internal class names
---

 Key: SOLR-73
 URL: http://issues.apache.org/jira/browse/SOLR-73
 Project: Solr
  Issue Type: Bug
  Components: search
Reporter: Walter Underwood


The configuration files in the example directory still use the old 
CNET-internal class names, like solr.LRUCache instead of 
org.apache.solr.search.LRUCache.  This is confusing to new users and should be 
fixed before the first release.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: [jira] Commented: (SOLR-58) Change Admin components to return XML like the rest of the system

2006-11-27 Thread Walter Underwood
On 11/27/06 1:52 PM, "Chris Hostetter" <[EMAIL PROTECTED]> wrote:

> 
> : Hoss, regarding your point 7) about ping - makes sense.  I think this is
> : what Walter Underwood was talking about in a recent thread, too.  So
> : what should the ping response look like in case of success and in case
> : of error?
> 
> i don't think the response body matters to most monitoring systems ... as
> long as the status is correct ... your example looks fine.

Right. The body can give additional info which might be really
handy for a monitoring script, but the important thing is to get
the response code right.

wunder
-- 
Walter Underwood
Search Guru, Netflix





Re: Cocoon-2.1.9 vs. SOLR-20 & SOLR-30

2006-11-22 Thread Walter Underwood
On 11/20/06 5:51 PM, "Yonik Seeley" <[EMAIL PROTECTED]> wrote:
>> : If you really want to handle failure in an error response, write that
>> : to a string and if that fails, send a hard-coded string.
>> 
>> Hmmm... i could definitely get on board an idea like that.
> 
> I took pains to make things streamable.. I'd hate to discard that.
> How do other servers handle streaming back a response and hitting an error?

You found the design tradeoff! We can stream the results or we can
give reliable error codes for errors that happen during result processing.
We can't do both. Ultraseek does streaming, but we were generating
HTML, so we could print reasonable errors in-line.

Streaming is very useful for HTML pages, because it allows the first
pixels to be painted as soon as possible. It isn't as important on the
back end, unless someone has gone to the considerable trouble of making
their entire front-end able to stream the back-end results to HTML.

If we aren't calling Writer.flush occasionally, then the streaming is
just filling up a buffer smoothly. The client won't see anything until
TCP decides to send it.

Does Lucene access fetch information from disk while we iterate
through the search results? If that happens a few times, then
streaming might make a difference. If it is mostly CPU-bound,
then streaming probably doesn't help.

wunder
-- 
Walter Underwood
Search Guru, Netflix




Re: Cocoon-2.1.9 vs. SOLR-20 & SOLR-30

2006-11-21 Thread Walter Underwood
One way to think about this is to assume caches, proxies, and load balancing
in the HTTP path, then think about their behavior. A 500 response may make
the load balancer drop this server from the pool, for example. A 200 OK
can be cached, so temporary errors shouldn't be sent with that code.

On 11/20/06 10:51 AM, "Chris Hostetter" <[EMAIL PROTECTED]> wrote:
> 
> ...there's kind of a chicken/egg problem with this discussion ... the egg
> being "what should the HTTP response look like in an 'error' situation"
> the chicken being "what is the internal API to allow a RequestHandler to
> denote an 'error' situation" ... talking about specific cases only gets us
> so far since those cases may not be errors in all RequestHandlers.

We can get most of the benefit with a few kinds of errors: 400, 403, 404,
500, and 503. Roughly:

400 - error in the request, fix it and try again
403 - forbidden, don't try again
404 - not found, don't try again unless you think it is there now
500 - server error, don't try again
503 - server error, try again

These can be mapped from internal error types.

> the problem gets even more complicated when you try to answer the
> question: what should Solr do if an OutputWriter encounters an error? ...
> we can't generate a valid JSON response dnoting an error if the
> JSONOutputWriter is failing :)

Write the response to a string before sending the headers. This can be
slower than writing the response out as it is computed, but the response
codes can be accurate. Also, it allows optimal buffering, so it might
scale better.

If you really want to handle failure in an error response, write that
to a string and if that fails, send a hard-coded string.

wunder
-- 
Walter Underwood
Search Guru, Netflix




Phonetic Token Filter

2006-11-21 Thread Walter Underwood
I've written a simple phonetic token filter (and factory) based
on the Double Metaphone implementation in Jakarta Codecs to
contribute. Three questions:

1. Does this sound like a generally useful addition?

2. Should we have a Jira issue first?

3. This adds a depencency on the codecs jar. How do we add that
to the distro?

The code is very simple, but I need to learn the contribution
process and build some tests, so this won't happen in one day.

wunder
-- 
Walter Underwood
Search Guru, Netflix




Re: Cocoon-2.1.9 vs. SOLR-20 & SOLR-30

2006-11-21 Thread Walter Underwood
On 11/20/06 5:51 PM, "Yonik Seeley" <[EMAIL PROTECTED]> wrote:

> Now that I think about it though, one nice change would be to get rid
> of the long stack trace for 400 exceptions... it's not needed, right?

That is correct. A client error (400) should not be reported with a
server stack trace. --wunder



Re: Cocoon-2.1.9 vs. SOLR-20 & SOLR-30

2006-11-21 Thread Walter Underwood
On 11/20/06 7:22 PM, "Fuad Efendi" <[EMAIL PROTECTED]> wrote:
> This is just a sample...
> 
> 1. What is an Error?
> 2. What is a Mistake?
> 3. What is an application bug?
> 4. What is a 'system crash'?

These are not HTTP concepts. The request on a URI can succeed or fail
or result in other codes. Mistakes and crashes are outside of the HTTP
protocol.

> Of cource, XML-over-HTTP engine is not the same as HTML-over-HTTP...
> However... Walter noticed 'crawling'... I can't imagine a company which will
> put SOLR as a front-end accessible to crawlers... (To crawl an indexing
> service instead of source documents!?)

XML-over-HTTP is exactly the same as HTML-over-HTTP. In HTML, we
could return detailed error information in a meta tag. No difference.

If something is on HTTP, a good crawler can find it. All it takes is
one link, probably to the admin URL. Once found, that crawler will
happily pound on errors returned by 200.

XSLT support means you could build the search UI natively on Solr,
so that might happen.

Even without a crawler, we must work with caches and load balancers.
I will be using Solr with a load balancer in production. If Solr is
a broken HTTP server, we will have to build something else.

> I am sure that mixing XML-based interface with HTTP status codes is not an
> attractive 'architecture', we shold separate conserns and leave HTTP code
> handling to a servlet container as much as possible...

We don't need to use HTTP response codes deep in Solr, but we do need
to separate bad parameters, retryable errors, non-retryable errors, and
so on. We can call them what ever we want internally, but we need to
report them properly over HTTP.

wunder
-- 
Walter Underwood
Search Guru, Netflix

 



Re: Cocoon-2.1.9 vs. SOLR-20 & SOLR-30

2006-11-17 Thread Walter Underwood
On 11/17/06 2:50 PM, "Fuad Efendi" <[EMAIL PROTECTED]> wrote:
>
> We should probably separate business-related end-user errors (such as when
> user submits empty query) and make it XML-like (instead of HTTP 400)

Speaking as a former web spider maintainer, it is very important to keep
the HTTP response codes accurate. Never return an error with a 200.

If we want more info, return an entity (body) with the 400 response.

wunder
-- 
Walter Underwood
Search Guru, Netflix




Re: SOLR-58

2006-11-09 Thread Walter Underwood
On 11/8/06 11:10 PM, "Otis Gospodnetic" <[EMAIL PROTECTED]> wrote:
> 
> I'd like to try writing some XSLs to convert that XML to HTML, so I need some
> additional eye on that XML output.  I've never written a single line XSL, so
> it will take me a while, and I'd love to get SOLR-58 in by the end of the week
> or so.

I learned it from Elliotte Rusty Harold's tutorial:

  http://www.cafeconleche.org/books/bible3/chapters/ch15.html

It worked for me. XSLT is a pretty odd language and takes a while
to get into. To me, it feels like an extremely verbose (and ugly)
subset of Prolog.

wunder
-- 
Walter Underwood
Search Guru, Netflix





Re: Adding Phonetic Search to Solr

2006-11-08 Thread Walter Underwood

On 11/8/06 10:30 AM, "Chris Hostetter" <[EMAIL PROTECTED]> wrote:

> : Also, the phonetic matches are ranked a bit high, so I'm trying a
> : sub-1.0 boost. I was expecting the lower idf to fix that automatically.
> : The metaphone will almost always have a lower idf because multiple
> : words are mapped to one metaphone, so the encoded term occurs in more
> : documents than the surface terms.
> 
> That all makes sense, and yet it's not what you are observing ... which
> leads me to believe you (and I since i want to agree with you) are missing
> something subtle  what does the the Explanation look like for two
> documenets where you feel like one should score higher then the other but
> they don't?

That is my next step. Maybe create some test documents in my corpus and
spend some quality time with Explain and grokking DisMax. I need to
customize Similarity anyway.

wunder
-- 
Walter Underwood
Search Guru, Netflix




Re: Adding Phonetic Search to Solr

2006-11-08 Thread Walter Underwood
On 11/7/06 5:44 PM, "Otis Gospodnetic" <[EMAIL PROTECTED]> wrote:

> Grab the code from Lucene in Action, it's got something to get you going, see:
> 
>   http://www.lucenebook.com/search?query=metaphone

Thanks. I thought about looking that up (I have the book), but the
code is really trivial inside Solr. The per-field analyzer takes
care of most of the fuss. The meat is a single line of code in the
token filter using the DoubleMetaphone class from commons codec.

  return new Token(dm.encode(token.termText(),
 token.startOffset(),
 token.endOffset());

Everything else is just initialization and declaration.

A naming convention question: should the class names end in
Filter or TokenFilter (and FilterFactory or TokenFilterFactory)?
I see both in org.apache.solr.analysis.

I'm a bit disappointed in the performance, though. It is half the
speed when adding two phonetic fields to search. Dropped from 300
qps to 130. On the other hand, I never thought I'd be complaining
about an engine delivering over 100 qps!

Could that be from searching extra fields? Indexing is the same
speed, so it shouldn't be the DoubleMetaphone class. I'm still
trying to get a feel for Lucene performance after years with the
Ultraseek engine.

Also, the phonetic matches are ranked a bit high, so I'm trying a
sub-1.0 boost. I was expecting the lower idf to fix that automatically.
The metaphone will almost always have a lower idf because multiple
words are mapped to one metaphone, so the encoded term occurs in more
documents than the surface terms.

One neat trick -- if regular terms are lowercased, they will never
collide with the metaphones, which are all upper case.

wunder
-- 
Walter Underwood
Search Guru, Netflix





Re: Adding Phonetic Search to Solr

2006-11-07 Thread Walter Underwood
On 11/7/06 3:26 PM, "Mike Klaas" <[EMAIL PROTECTED]> wrote:

> Is the state of the art in phonetic token generation reasonable?  I've
> been rather disappointed with some implementations (eg. SOUNDEX in
> MySQL, MSSQL).

SOUNDEX is excellent technology for its time, but its time was 1920.

Double Metaphone is far more complex and works fairly well. There is
an Apache commons codec implementation available. It is certainly
good enough for matching proper names, like Moody and Mudie or
Cathy and Kathie.

There are some commercial phonetic coders, but I don't have any
experience with those.

wunder
-- 
Walter Underwood
Search Guru, Netflix




Re: Adding Phonetic Search to Solr

2006-11-07 Thread Walter Underwood
On 11/7/06 2:30 PM, "Mike Klaas" <[EMAIL PROTECTED]> wrote:
> On 11/7/06, Walter Underwood <[EMAIL PROTECTED]> wrote:
>> 
>> 1. Adding fuzzy to the DisMax specs.
> 
> What do you envisage the implementation looking like?

Probably continue with the template-like patterns already there.

  title^2.0   (search title field with boost of 2.0)
  title~  (search title field with fuzzy matching)

>> 2. Adding a phonetic token filter and relying on the per-field analyzer
>> support.
> 
> I'm not sure why any modification to solr would be necessary.  You
> could add a field with a phonetic analyzer and use copyField to copy
> your search fields to it.  Search will use the modified analyzer
> automatically.

Ah, I missed the  example with a stock Lucene analyzer.
Oops. I still need to write an Analyzer, because there is no standard
phonetic search in Lucene today. There are some patches and addons
floating around.

Still, it seems like others might want to use a phonetic token
filter with the  specs. I'd be glad to contribute that,
if others think it would be useful.

wunder
-- 
Walter Underwood
Search Guru, Netflix




Adding Phonetic Search to Solr

2006-11-07 Thread Walter Underwood
I haven't found fuzzy or phonetic search in Solr, and I have a couple
of approaches I might try:

1. Adding fuzzy to the DisMax specs.

2. Adding a phonetic token filter and relying on the per-field analyzer
support.

Option 2 seems like it would be a lot faster in production, and
probably easier to implement. Does that seem right?

How do I specify the new token filter factory in the schema file?
I don't quite get the mapping from solr.FooFilterFactory to
org.apache.solr.analysis.FooFilterFactory.

wunder
-- 
Walter Underwood
Search Guru, Netflix




Re: [jira] Commented: (SOLR-66) bulk data loader

2006-11-07 Thread Walter Underwood
On 11/7/06 11:22 AM, "Yonik Seeley (JIRA)" <[EMAIL PROTECTED]> wrote:

> Yes, posting queries work because it's all form-data (query args).
> But, what if we want to post a complete file, *and* some extra info/parameters
> about how that file should be handled?

One approach is the Atom Publishing Protocol. That is pretty clear
about content and metainformation. It isn't designed to solve every
problem, but it handles a broad range of publishing, so it could be
a good fit for many uses of Solr.

APP is nearly finished. The latest draft is here (second URL also
has HTML versions).

 http://www.ietf.org/internet-drafts/draft-ietf-atompub-protocol-11.txt
 http://tools.ietf.org/wg/atompub/draft-ietf-atompub-protocol/

wunder
-- 
Walter Underwood
Search Guru, Netflix




Re: [jira] Created: (SOLR-60) Remove overwritePending, overwriteCommitted flags?

2006-11-01 Thread Walter Underwood
+1 as well. --wunder

On 11/1/06 11:17 AM, "Mike Klaas" <[EMAIL PROTECTED]> wrote:

> +1
> 
> On 11/1/06, Yonik Seeley (JIRA) <[EMAIL PROTECTED]> wrote:
>> Remove overwritePending, overwriteCommitted flags?
>> --
>> 
>>  Key: SOLR-60
>>  URL: http://issues.apache.org/jira/browse/SOLR-60
>>  Project: Solr
>>   Issue Type: Improvement
>>   Components: update
>> Reporter: Yonik Seeley
>> Priority: Minor
>> 
>> 
>> The overwritePending, overwriteCommitted, allowDups flags seem needlessly
>> complex and don't add much value.  Do people need/use separate control over
>> pending vs committed documents?
>> 
>> Perhaps all most people need is overwrite=true/false?
>> 
>> overwritePending, overwriteCommitted were originally added because it was a
>> (mis)feature that another internal search tool had.
>> 
>> --
>> This message is automatically generated by JIRA.
>> -
>> If you think it was sent incorrectly contact one of the administrators:
>> http://issues.apache.org/jira/secure/Administrators.jspa
>> -
>> For more information on JIRA, see: http://www.atlassian.com/software/jira
>> 
>> 
>> 



Re: Copying the request parameters to Solr's response

2006-10-24 Thread Walter Underwood
On 10/24/06 7:22 AM, "Bertrand Delacretaz" <[EMAIL PROTECTED]> wrote:
> On 10/24/06, Yonik Seeley <[EMAIL PROTECTED]> wrote:
> 
>> ...I imagine this would just be for explicitly passed parameters?...
> 
> I think so, the defaults would be re-applied anyway, if the client
> makes another request with the same parameters.
> 
> -Bertrand

The defaults can change, especially if the client saves results.
If possible, you want to return a full context for the results.

Ultraseek has had XML results for several years and a full query
context would have been useful in several situations. The Ultraseek
result format targeted a different problem, returning enough info
to calculate a global IDF across multiple collections and re-score
the combined results.

http://search.ultraseek.com/saquery.xml?qt=saquery.xml&col=usdc&col=docs

The Java client library for Ultraseek (XPA) does keep a local results
cache and uses the query plus the query context as a key.

wunder
-- 
Walter Underwood
Search Guru, Netflix
Former Ultraseek Architect



Re: Copying the request parameters to Solr's response

2006-10-24 Thread Walter Underwood
Returning the query parameters is really useful. I'm not sure it
needs to be optional, they are small and options multiply the test
cases.

It can even be useful to return the values of the defaults.

All those go into the key for any client side caching, for example.

wunder

On 10/24/06 1:55 AM, "Erik Hatcher" <[EMAIL PROTECTED]> wrote:

> I think its a good idea, but it probably should be made optional.
> Clients can keep track of the state themselves, and keeping the
> response size as small as possible is valuable.  But it would be
> helpful in some situations for the client to get the original query
> context sent back too.
> 
> Erik
> 
> 
> On Oct 24, 2006, at 4:20 AM, Bertrand Delacretaz wrote:
> 
>> Hi,
>> 
>> I need to implement paging of Solr result sets, and (unless I have
>> overlooked something that already exists) it would be useful to copy
>> the request parameters to the output.
>> 
>> I'm thinking of adding something like this to the XML output:
>> 
>>  
>>  
>>author:Leonardo
>>24
>>12
>>   etc...
>> 
>> I don't think the SolrParams class provides an Iterator to retrieve
>> all parameters, I'll add one to implement this.
>> 
>> WDYT?
>> 
>> -Bertrand
> 



Re: Solr NightlyBuild

2006-09-20 Thread Walter Underwood
I agree that a release would be useful for marketing, but I also
think it would help exercise the community and the release process.

I just discovered Solr on Friday and I've been telling people about
it, but every e-mail includes "you need to be OK with nightly builds."

Being OK with nightly builds means that you need to run your own
QA on the whole build every time you change. Kinda expensive.

wunder
--
Walter Underwood
Search Guru, Netflix



Re: double curl calls in post.sh?

2006-09-18 Thread Walter Underwood
On 9/18/06 10:10 AM, "Yonik Seeley" <[EMAIL PROTECTED]> wrote:
> On 9/18/06, Walter Underwood <[EMAIL PROTECTED]> wrote:
>> Instead, use a media type of application/xml, so that the server
>> is allowed to sniff the content to discover the character encoding.
> 
> Cool!  Do you know what servlet containers currently implement this
> "sniffing"?

XML parsers already do this correctly. They look at the XML declaration
for the encoding, and if that isn't there, they look for a BOM or
UTF-8 content, as described in the (non-normative) appendix to the
XML spec.

  http://www.w3.org/TR/REC-xml/#sec-guessing

The servlet container needs to hand the raw bytes to the parser,
which should be normal behavior for application/*.

wunder
--
Walter Underwood
Search Guru, Netflix



Re: double curl calls in post.sh?

2006-09-18 Thread Walter Underwood
Also, do not use text/xml. Even with a charset parameter. In a correct
implementation, that will override the XML declaration of charset.
With text/xml, the charset parameter must be correct. When it is
omitted, the content MUST be interpreted as US-ASCII (yuk).

Instead, use a media type of application/xml, so that the server
is allowed to sniff the content to discover the character encoding.

For the gory details, see RFC 3023:

  http://www.ietf.org/rfc/rfc3023.txt

wunder
==
Walter Underwood
Search Guru, Netflix

On 9/17/06 1:00 PM, "Chris Hostetter" <[EMAIL PROTECTED]> wrote:

> 
> am i smoking crack of is post.sh mistakenly sending every doc twice in a
> row? ...
> 
> for f in $FILES; do
>   echo Posting file $f to $URL
>   curl $URL --data-binary @$f
>   curl $URL --data-binary @$f -H 'Content-type:text/xml; charset=utf-8'
>   echo
> done
> 
> 
> ...is there any reason not to delete that first execution of curl?
> 
> 
> 
> -Hoss
>