Re: Simple Faceted Searching out of the box

2006-09-22 Thread Walter Underwood
going to help a user find nurse. I think part of this is that some people feel that databases like MSSQL, MYSQL should be able to provide quality search experience, but they just flat out don't. It's a separate utility. Thanks Walter. On 9/22/06, Walter Underwood [EMAIL PROTECTED] wrote

Re: wana use CJKAnalyzer

2006-09-25 Thread Walter Underwood
is not legal UTF-8. Does Solr report parsing errors? It really should. Maybe a 400 Bad Request response with a text/plain body showing the error message. wunder On 9/22/06 6:24 PM, James liu [EMAIL PROTECTED] wrote: 2006/9/23, Walter Underwood [EMAIL PROTECTED]: On 9/21/06 5:37 PM, James liu [EMAIL

Re: Extending Solr's Admin functionality

2006-09-27 Thread Walter Underwood
On 9/27/06 9:07 AM, Simon Willnauer [EMAIL PROTECTED] wrote: First I agree with yonik, the main point is to define which classes / parts / mbeans should be exposed to JMX is the hard part and should be planned carefully. That is the hard part regardless of whether we use JMX or bare-metal

Recommended Update Batch Size?

2006-10-31 Thread Walter Underwood
What is a good size for batching updates? My xml update docs are around 600-700 bytes each right now. wunder -- Walter Underwood Search Guru, Netflix

Re: Recommended Update Batch Size?

2006-10-31 Thread Walter Underwood
On 10/31/06 12:54 PM, Mike Klaas [EMAIL PROTECTED] wrote: On 10/31/06, Walter Underwood [EMAIL PROTECTED] wrote: What is a good size for batching updates? My xml update docs are around 600-700 bytes each right now. When I think of batches I think of documents sent before a commit

Re: Solr Benchmarks

2006-11-06 Thread Walter Underwood
small corpus (65K docs) I was seeing over 240 qps on my dev box (dual 3 GHz Xeon). I expect that it didn't touch the disk at all, since the index is only 50 Meg. wunder -- Walter Underwood Search Guru, Netflix

Re: Solr Benchmarks

2006-11-09 Thread Walter Underwood
list, but when I view the message, no attachment is available. Could you try sending this attachment again? Thanks --Joachim Walter Underwood wrote: I've done some testing using JMeter. I followed the instructions in the JMeter FAQ for How do I use external data files in my test scripts

Re: Index search questions; special cases

2006-11-13 Thread Walter Underwood
in other engines. Otherwise, you go nuts trying to get your analyzer to handle .NET and vitamin a. I know that AltaVista and Inktomi did this. wunder -- Walter Underwood Search Guru, Netflix

Re: MatchAllDocsQuery in solr?

2006-11-21 Thread Walter Underwood
can't just override a method of QueryParser to do : this). we could add this to the function parser, so _val_:ALL could return a MatchAllDocsQuery ? I was thinking something similar, maybe _solr:all. At Infoseek, we hardcoded url:http to match all docs. wunder -- Walter Underwood Search Guru

Re: Indexing XML files

2006-12-05 Thread Walter Underwood
At some point, it would be simpler to write a custom response handler and generate the output in your desired XML format. wunder On 12/5/06 1:52 PM, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Hi, the idea is to apply XSLT transformation on the result. But it seems that I would have to

Re: Handling disparate data sources in Solr

2006-12-24 Thread Walter Underwood
Analyzer before passing it along. Why won't cdata work? Some octet (byte) values are illegal in XML. Most of the ASCII control characters are not allowed. If one of those is in an XML document, it is a fatal error and must stop parsing in any conforming XML parser. wunder -- Walter Underwood

Re: Better highlighting fragmenter

2007-01-03 Thread Walter Underwood
On 1/3/07 9:33 PM, Yonik Seeley [EMAIL PROTECTED] wrote: On 1/3/07, Walter Underwood [EMAIL PROTECTED] wrote: We tried several APIs and decided that the best was an array of String with the odd elements containing the strings that needed highlighting. Good idea... the only thing I could

Re: Handling disparate data sources in Solr

2007-01-08 Thread Walter Underwood
Ultraseek and the Googlebox are about your only choice. wunder -- Walter Underwood Search Guru, Netflix Former Architect for Ultraseek

Re: Using HTTP-Post for Queries

2007-01-19 Thread Walter Underwood
On 1/19/07 10:02 AM, Brian Lucas [EMAIL PROTECTED] wrote: Walter Underwood wrote: Use GET unless it really, really, really doesn't work. POST is the wrong HTTP semantic for fetching information. Long query strings are not a good enough reason. HTTP puts no limit on the length of a URL

Re: INTERNET ARCHIVE goes SOLR!

2007-01-30 Thread Walter Underwood
it, an AND default is a very bad idea for nearly all sites. wunder -- Walter Underwood Search Guru, Netflix

Re: INTERNET ARCHIVE goes SOLR!

2007-02-01 Thread Walter Underwood
On 1/27/07 1:12 PM, Tracey Jaquith [EMAIL PROTECTED] wrote: * To be fair, Michael StAck (our greatest help for prior SE life support) has smartly pointed out that by making a smarter schema and strategy, I could reduce the number of fields searched from 677 to 5, with the same overall

Re: JOIN in Solr (was: convert custom facets to Solr facets...)

2007-02-03 Thread Walter Underwood
We would never use JOIN. We denormalize for speed. Not a big deal. wunder == Search Guru, Netflix On 2/3/07 11:16 AM, Brian Whitman [EMAIL PROTECTED] wrote: On Feb 2, 2007, at 4:46 PM, Ryan McKinley wrote: I would LOVE to see a JOIN in SOLR. I have an index of artists, albums, and

Re: non-relative scoring

2007-02-13 Thread Walter Underwood
You can declare the top result to be 100% and scale from there. Percent relevant is not a concept that really holds together. What does it mean to be 100% relevant? I'm not even sure what twice as relevant means. A tf.idf engine, like Lucene, might not have a maximum score. What if a document

Re: common words not stop words?? how to ??

2007-02-19 Thread Walter Underwood
Lucene/Solr does this automatically. That is how a tf.idf engine works, it boosts rare words. Do you have examples of problems or are you worrying about something that might happen? wunder On 2/19/07 1:22 AM, rubdabadub [EMAIL PROTECTED] wrote: Hi: I was wondering how are you guys dealing

Re: AW: solr performance

2007-02-20 Thread Walter Underwood
Indexing rates depend heavily on document size (text) and pre-indexing processing. Other things probably matter, too, like number of fields. My application is indexing 20X faster than Christian's, because I have small documents (a few hundred bytes) that are extracted from an RDBMS and submitted

Re: Re[2]: solr performance

2007-02-20 Thread Walter Underwood
Try running your submits while watching a CPU load meter. Do this on a multi-CPU machine. If all CPUs are busy, you are running as fast as possible. If one CPU is busy (around 50% usage on a dual-CPU system), parallel submits might help. If no CPU is 100% busy, the bottleneck is probably disk

Re: Re[4]: Starting an index...

2007-02-22 Thread Walter Underwood
On 2/22/07 1:37 PM, Jack L [EMAIL PROTECTED] wrote: I wonder what happens if I change the schema after some documents have been inserted? Is this allowed at all? Will the index become corrupted if I add/remove some fields? Or change the field properties? The schema just controls the input

Re: Problem indexing

2007-02-23 Thread Walter Underwood
It is a bug, though. That should send an error message, not a stack trace. --wunder On 2/23/07 10:39 AM, Otis Gospodnetic [EMAIL PROTECTED] wrote: Oh, look at that, adding field name=id1/field took care of the bombing, nice! Thanks, Otis I tried posting that, like this: $ java -jar

Re: merely a suggestion: schema.xml validator or better schema validation logging

2007-03-03 Thread Walter Underwood
I was bit by this, tool. It made getting started a lot harder. I think I had something outside of an lst instead of inside. More recently, I got a query time exception from a mis-formatted mm field. Right now, Solr accesses the DOM as needed (at runtime) to fetch information. There isn't much

Re: merely a suggestion: schema.xml validator or better schema validation logging

2007-03-04 Thread Walter Underwood
On 3/3/07 1:43 PM, Chris Hostetter [EMAIL PROTECTED] wrote: : Right now, Solr accesses the DOM as needed (at runtime) to fetch : information. There isn't much up-front checking beyond the XML : parser. bingo, and adding more upfront checking is hard for at least two reasons i can think

Re: merely a suggestion: schema.xml validator or better schema validation logging

2007-03-04 Thread Walter Underwood
On 3/4/07 3:01 PM, Chris Hostetter [EMAIL PROTECTED] wrote: I'm actaully haven't a hard time thinking of what kinds of just in time DOM walking is delayed until request ... all of the feld names are already known, the analyzers are built, the requesthandlers and responsewriters all exist and

Solr on Tomcat 6.0.10?

2007-03-07 Thread Walter Underwood
Is anyone running Solr on Tomcat 6.0.10? Any issues? I searched the archives and didn't see anything. wunder -- Walter Underwood Search Guru, Netflix

Re: Solr on Tomcat 6.0.10?

2007-03-08 Thread Walter Underwood
Java 1.5.0_05 on Intel and PowerPC (IBM) plus any DST changes. --wunder On 3/8/07 4:08 AM, James liu [EMAIL PROTECTED] wrote: today i use tomcat 6.0.10,,,but no time to search. tomorrow i will test it. which java version you use? 2007/3/8, Walter Underwood [EMAIL PROTECTED

Re: Adding data as UTF-8

2007-03-10 Thread Walter Underwood
It is better to use application/xml. See RFC 3023. Using text/xml; charset=UTF-8 will override the XML encoding declaration. application/xml will not. wunder On 3/10/07 12:39 PM, Bertrand Delacretaz [EMAIL PROTECTED] wrote: On 3/10/07, Morten Fangel [EMAIL PROTECTED] wrote: ...I send a

Re: Adding data as UTF-8

2007-03-10 Thread Walter Underwood
If it does something different, that is a bug. RFC 3023 is clear. --wunder On 3/10/07 1:49 PM, Bertrand Delacretaz [EMAIL PROTECTED] wrote: On 3/10/07, Walter Underwood [EMAIL PROTECTED] wrote: It is better to use application/xml. See RFC 3023. Using text/xml; charset=UTF-8 will override

Re: Question About Boosting.

2007-03-10 Thread Walter Underwood
What are you trying to achieve? Let's start with the problem instead of picking one solution which Solr doesn't support. --wunder On 3/10/07 5:08 PM, shai deljo [EMAIL PROTECTED] wrote: How can i boost some tokens over others in the same field (at Index time) ? If this is not supported

Re: Question About Boosting.

2007-03-11 Thread Walter Underwood
that have different importance. I thought boosting would be an elegant way to take this into account. Please advise, On 3/10/07, Walter Underwood [EMAIL PROTECTED] wrote: What are you trying to achieve? Let's start with the problem instead of picking one solution which Solr doesn't support

Re: How to assure a permanent index.

2007-03-21 Thread Walter Underwood
That works if you keep track of all documents that have disappeared since the last index run. Otherwise, you end up with orphans in the search index, documents that exist in search, but not in the real world, also known as serving 404's in results. wunder -- Walter Underwood Search Guru, Netflix

Re: sorting question

2007-03-23 Thread Walter Underwood
You could also promote recent results with a function query term. I've done that for news sites, where recency is an important part of relevancy. --wunder On 3/23/07 4:59 PM, Chris Hostetter [EMAIL PROTECTED] wrote: : Is there a way (in 1 query) to retrieve the best scoring X results and :

Re: How to make the search default use AND instead of OR?

2007-03-27 Thread Walter Underwood
I don't recommend defaulting to AND. This will increase the number of failed searches (no hits) for your users. If one word is misspelled in a multi-word AND query, you'll get no results. Since About 10% of queries are misspelled and about half of queries are multi-word, that will immediately

Re: How to make the search default use AND instead of OR?

2007-03-27 Thread Walter Underwood
On 3/27/07 10:57 AM, Mike Klaas [EMAIL PROTECTED] wrote: I agree with your point above, but I fear AND: bad! OR: good! becoming dogma--often AND+spellcheck is the better option. AND-with-spell-suggestion is better, but the spelling suggestion needs to be really, really good. That is really

Re: SEVERE: Error filterStart

2007-04-05 Thread Walter Underwood
This does seem to be a Tomcat config problem. Start with this search to find other e-mail strings on this: http://www.google.com/search?q=SEVERE%3A+Error+filterStart wunder On 4/5/07 11:43 AM, Chris Hostetter [EMAIL PROTECTED] wrote: : SEVERE: Error filterStart : Apr 5, 2007 10:11:28 AM

Re: Solr logo poll

2007-04-06 Thread Walter Underwood
A --wunder On 4/6/07 10:51 AM, Yonik Seeley [EMAIL PROTECTED] wrote: Quick poll... Solr 2.1 release planning is underway, and a new logo may be a part of that. What form of logo do you prefer, A or B? There may be further tweaks to these pictures, but I'd like to get a sense of what the

Re: Leading wildcards

2007-04-23 Thread Walter Underwood
Here is a late response, apache.org was rejecting our e-mails... Allowing leading wildcards opens up a denial of service attack. It becomes trivial to overload the search engine and take it out of service, just hammer it with leading wildcard queries. Please leave the default as disabled. If we

Re: solr utf 16 ?

2007-04-25 Thread Walter Underwood
UTF-16 support should not require any changes to the XML parsing. All XML parsers are required to support that encoding. The real change is implementing RFC 3023 (XML Media Types) so that the encoding can be specified over HTTP. wunder On 4/23/07 11:13 AM, Mike Klaas [EMAIL PROTECTED] wrote:

Re: expressing this logic

2007-04-25 Thread Walter Underwood
Enable leading wildcards and try this: type:changelog AND filename:*angel* wunder On 4/25/07 1:34 PM, Michael Kimsal [EMAIL PROTECTED] wrote: Thanks. I'm still no results with your suggestion though. I also tried type:+changelog AND ( (filename:angel) OR (filename:angel*) OR

Re: Searchproblem composite words

2007-05-03 Thread Walter Underwood
A agree that multi-word synonyms are an excellent way to do this. This may sound like a hack, but you'd end up doing this even if you had dedicated linguistic compound decomposition software. Those usually use a dictionary of common words and the dictionary rarely has all the words that are

Re: Facet only support english?

2007-05-09 Thread Walter Underwood
I didn't remember that requirement, so I looked it up. It was added in XML 1.0 2nd edition. Originally, unspecified encodings were open for auto-detection. Content type trumps encoding declarations, of course, per RFC 3023 and allowed by the XML spec. wunder On 5/9/07 4:19 PM, Mike Klaas [EMAIL

Re: Solr Sorting, merging/weighting sort fields

2007-05-09 Thread Walter Underwood
No problem. Use a boost function. In a DisMaxRequestHandler spec in solrconfig.xml, specify this: str name=bf popularity^0.5 /str This value will be added to the score before ranking. You will probably need to fuss with the multiplier to get the popularity to the right proportion of

Re: Requests per second/minute monitor?

2007-05-10 Thread Walter Underwood
access log so you can correlate the entries. wunder On 5/9/07 9:43 PM, Ian Holsman [EMAIL PROTECTED] wrote: Walter Underwood wrote: This is for monitoring -- what happened in the last 30 seconds. Log file analysis doesn't really do that. I would respectfully disagree. Log file analysis

Re: Solr Sorting, merging/weighting sort fields

2007-05-10 Thread Walter Underwood
The boost is a way to adjust the weight of that field, just like you adjust the weight of any other field. If the boost is dominating the score, reduce the weight and vice versa. wunder On 5/10/07 9:22 PM, Chris Hostetter [EMAIL PROTECTED] wrote: : Is this correct? bf is a boosting

Re: AW: SOLR Indexing/Querying

2007-05-31 Thread Walter Underwood
I solved something similar to this by creating a stemmer for part numbers. Variations like -BN on the end can be treated as inflections in the part number language, similar to plurals in English. I used a set of regexes to match and transform, in some cases generating multiple root part numbers.

Length norm on multi-valued fields

2007-06-04 Thread Walter Underwood
With a multi-valued field, is the length norm based the individual matched value (string) or on all the tokens in the field? I'm guessing that it is the latter, and I expect I could find that in the source or explain if I looked hard enough, but maybe someone already knows. wunder -- Walter

Re: Length norm on multi-valued fields

2007-06-04 Thread Walter Underwood
On 6/4/07 11:24 AM, Chris Hostetter [EMAIL PROTECTED] wrote: : With a multi-valued field, is the length norm based the individual : matched value (string) or on all the tokens in the field? I'm guessing : that it is the latter, and I expect I could find that in the source : or explain if I

Re: storing the document URI in the index

2007-06-12 Thread Walter Underwood
Solr doesn't have the URL of the document. The document is given to Solr in an HTTP POST. Solr is not a web spider, it is a search web service. wunder On 6/12/07 6:23 AM, Ard Schrijvers [EMAIL PROTECTED] wrote: Hello Otis, thanks for the info. Would it a be an improvement to be able to

Re: Keep having error on unknown field

2007-06-14 Thread Walter Underwood
Do we have a bug filed on this? Solr really should have complained about the unknown element. --wunder On 6/14/07 4:54 PM, Tiong Jeffrey [EMAIL PROTECTED] wrote: arh! i spent 6-7 hours on this error and didnt see this! thanks! On 6/15/07, Yonik Seeley [EMAIL PROTECTED] wrote: On 6/14/07,

Re: Multiple doc types in schema

2007-06-21 Thread Walter Underwood
I used Solr with indexes on NFS and I do not recommend it. It was either 100 or 1000 times slower than local disc for indexing, I forget which. Unusable. This is not a problem with Solr/Lucene, I have seen the same NFS performance cost with other search engines. wunder On 6/21/07 3:22 AM, Otis

Re: Use Windows 1252 encoding...

2007-06-25 Thread Walter Underwood
This is proper behavior according to RFC 3023. An encoding in the XML declaration is ignored unless the content-type is application/xml. wunder On 6/25/07 8:27 AM, Yonik Seeley [EMAIL PROTECTED] wrote: On 6/23/07, Chris Hostetter [EMAIL PROTECTED] wrote: : Is it possible to use Windows

Re: Solr Injection

2007-07-03 Thread Walter Underwood
The Atom Publishing Protocol would be a good choice for a rest API to Solr. That comes with a spec, interop testing, and an active community. wunder On 7/2/07 6:22 PM, Ian Holsman [EMAIL PROTECTED] wrote: Hi. I've been playing with Kettle (http://kettle.pentaho.org/ ) as a method to inject

Re: most popular/most commonly accessed records

2007-07-06 Thread Walter Underwood
Solr doesn't have a record of what documents were accessed. The document cache shows which documents were in the parts of search result list which were served, but probably not a count of those inclusions. Luckily, this information is trivial to get from HTTP server access logs. Look for

Re: searching multiple fields

2007-08-01 Thread Walter Underwood
This caused me a certain amount of trouble, because the parser errors with ill-formed queries. Try these: foo - TO HAVE AND HAVE NOT wunder On 8/1/07 12:47 AM, Chris Hostetter [EMAIL PROTECTED] wrote: : StandardRequestHandler), but I also want to be able to use Lucene's : boolean

Re: searching multiple fields

2007-08-01 Thread Walter Underwood
You get that behavior by avoiding any extra syntax. Use this query: a:valueAlpha b:valueBeta c:valueGamma If one of the terms is very common and one is very rare, it might not sort on pure existance. This is a tf.idf engine. wunder On 8/1/07 11:00 AM, Lance Lance [EMAIL PROTECTED] wrote:

Re: searching multiple fields

2007-08-02 Thread Walter Underwood
Use the minimum match spec for a flexible version of all-terms matching. Before implementing all-terms matching, start logging the number of searches that result in no matches. All-terms can cause big problems. One wrong or misspelled word means no matches, and searchers don't know how to fix

Re: searching multiple fields

2007-08-02 Thread Walter Underwood
, Daniel Naber [EMAIL PROTECTED] wrote: On Thursday 02 August 2007 18:46, Walter Underwood wrote: Use the minimum match spec for a flexible version of all-terms matching. I think this is too difficult and unpredictable. I also don't know how I should justify a setting like 75%, just because

Re: almost realtime updates with replication

2007-08-22 Thread Walter Underwood
At Infoseek, we ran a separate search index with today's updates and merged that in once each day. It requires a little bit of federated search to prefer the new content over the big index, but the daily index can be very nimble for update. wunder On 8/22/07 7:58 AM, mike topper [EMAIL

Re: Running into problems with distributed index and search

2007-08-23 Thread Walter Underwood
How is the performace? For me, Solr got about 100 times faster for update when I moved the files from NFS to local disk. wunder On 8/22/07 2:27 PM, Kasi Sankaralingam [EMAIL PROTECTED] wrote: Instance (index server) for indexing. The index file data directory reside on a NFS partition, I am

Re: Multiple indexes

2007-08-23 Thread Walter Underwood
It should work fine to index them and search them. 13 million docs is not even close to the limits for Lucene and Solr. Have you had problems? wunder On 8/23/07 7:30 AM, Jae Joo [EMAIL PROTECTED] wrote: Is there any solution to handle 13 millions document shown as below? Each document is not

Re: Embedded about 50% faster for indexing

2007-08-28 Thread Walter Underwood
No need to run a separate web server. I actually do HTTP updates from an extra servlet configured into the Solr webserver. It might seem a little odd, but same-system TCP sockets are extremely fast and low overhead. The additional flexibility is nice, too. If I find a bug in the indexing code in

Re: performance questions

2007-08-30 Thread Walter Underwood
Sorry dude, I'm pining for Python and coding in Java. --wunder On 8/30/07 6:57 PM, Erik Hatcher [EMAIL PROTECTED] wrote: On Aug 30, 2007, at 6:31 PM, Mike Klaas wrote: Another reason why people use stored procs is to prevent multiple round-trips in a multi-stage query operation. This is

Re: Can't get 1.2 running under Tomcat 5.5

2007-09-05 Thread Walter Underwood
Not really. It is a very poor substitute for reading the release notes, and sufficiently inadequate that it might not be worth the time. Diffing the example with the previous release is probably more instructive, but might or might not help for your application. A config file checker would be

Re: Indexing very large files.

2007-09-07 Thread Walter Underwood
Legal discovery can have requirements like this. --wunder On 9/7/07 4:47 AM, Brian Carmalt [EMAIL PROTECTED] wrote: Lance Norskog schrieb: Now I'm curious: what is the use case for documents this large? Thanks, Lance Norskog It is a rand use case, but could become relevant for

Re: Solr and KStem

2007-09-07 Thread Walter Underwood
Even if KStem isn't ASL, we could include the plug-in code with notes about how to get the stemmer. Or, the Solr plug-in could be contributed to the group that manages the KStem distribution: http://ciir.cs.umass.edu/cgi-bin/downloads/downloads.cgi wunder On 9/7/07 12:59 PM, Yonik Seeley

Re: NFS Stale handle in a distributed SOLR deployment

2007-09-13 Thread Walter Underwood
The straightforward solution is to not put your indexes on NFS. It is slow and it causes failures like this. I'm serious about that. I've seen several different search engines (not just Solr/Lucene) get very slow and unreliable when the indexes were on NFS. wunder On 9/13/07 10:59 AM, Kasi

Re: Query for German Special Characters (i.e., ä, ö, ß)

2007-09-14 Thread Walter Underwood
. Same for mit. In English, that is the Massachusetts Institute of Technology. wunder == Walter Underwood Search Guy, Netflix On 9/14/07 2:09 PM, Marc Bechler [EMAIL PROTECTED] wrote: Hi Tom, thanks for your professional response -- works fine and looks good :-). Since I am playing around

Re: Synchronize large number of records with Solr

2007-09-14 Thread Walter Underwood
You could MD4 the parts you care about, store that, fetch it and compare. If there is a reliable timestamp, you could use that. But that would be app-dependent. In general, you need to store some info about each source document and figure out whether it is new. This get much hairier with a web

Re: clarification needed for the Ranking score

2007-09-21 Thread Walter Underwood
This would probably work, but the approach has a subtle flaw. If a query has one word that matches a lot of titles, but a phrase that matches a description, the best result will be shown far too low, after all the titles. A better approach is to weight the titles a bit higher than the

Re: dataset parameters suitable for lucene application

2007-09-26 Thread Walter Underwood
That seems well within Solr's capabilities, though you should come up with a desired queries/sec figure. Solr's query rate varies widely with the configuration -- how many fields, fuzzy search, highlighting, facets, etc. Essentially, Solr uses Lucene, a modern search core. It has performance and

Re: dataset parameters suitable for lucene application

2007-09-26 Thread Walter Underwood
No one can answer that, because it depends on how you configure Solr. How many fields do you want to search? Are you using fuzzy search? Facets? Highlighting? We are searching a much smaller collection, about 250K docs, with great success. We see 80 queries/sec on each of four servers, and

Re: Converting German special characters / umlaute

2007-09-27 Thread Walter Underwood
Accent transforms are language-specific, so an accent filter should take an ISO langauge code as an argument. Some examples: * In French and English, a diereses is a hint to pronounce neighboring vowels separateley, as in coöp, naïve, or Noël. * In German, ü transformes to ue. * In Swedish, ö

Re: Indexing without application server

2007-09-28 Thread Walter Underwood
I do not think it will be much faster. The data transfer time is small compared to the indexing time. The indexing will probably take less than a day, so if you spend more than 30 minutes coding a faster method, the project will take longer. wunder On 9/28/07 6:06 AM, Jae Joo [EMAIL PROTECTED]

Solr live at Netflix

2007-10-02 Thread Walter Underwood
to upgrade. Thanks everyone, this is a great piece of software. wunder -- Walter Underwood Search Guy, Netflix

Re: Solr live at Netflix

2007-10-02 Thread Walter Underwood
I think Chris Harris is doing that. I'll check it and touch it up afterwards. Avoid race conditions. --wunder On 10/2/07 4:26 PM, Chris Hostetter [EMAIL PROTECTED] wrote: : Here at Netflix, we switched over our site search to Solr two weeks ago. That's great Walter ... could I persuade

Re: Solr live at Netflix

2007-10-04 Thread Walter Underwood
, Walter Underwood [EMAIL PROTECTED] wrote: Here at Netflix, we switched over our site search to Solr two weeks ago. We've seen zero problems with the server. We average 1.2 million queries/day on a 250K item index. We're running four Solr servers with simple round-robin HTTP load-sharing

Re: Real-time replication

2007-10-04 Thread Walter Underwood
We don't use Solr replication. Each server is independent and does its own indexing. This has several advantages: * all installations are identical * no single point of failure * no inter-server version or config dependencies * we can run a different version or config on one server for testing

Re: unable to figure out nutch type highlighting in solr....

2007-10-04 Thread Walter Underwood
Wow, well-formed HTML. That's a rare beast. --wunder On 10/4/07 7:08 PM, Chris Hostetter [EMAIL PROTECTED] wrote: if you have wellformed HTML documents, use an HTML parser to extract the real content.

Re: Indexing XML

2007-10-05 Thread Walter Underwood
Solr is not an XML engine (or a MARC engine). It uses XML as an input format for fielded data. It does not index or search arbitrary XML. You need to convert your XML into Solr's format. I would recommend expressing MARC in a Solr schema, then working on the input XML. The input XML depends on

Re: unable to figure out nutch type highlighting in solr....

2007-10-05 Thread Walter Underwood
That is one seriously manly regex, but I'd recommend using the Tag Soup parser instead: http://ccil.org/~cowan/XML/tagsoup/ wunder On 10/4/07 10:11 PM, J.J. Larrea [EMAIL PROTECTED] wrote: It uses a PatternTokenizerFactory with a RegEx that swallows runs of HTML- or XML-like tags:

Re: High-Availability deployment

2007-10-08 Thread Walter Underwood
We run multiple, identical, independent copies. No master/slave dependencies. Yes, we run indexing N times for N servers, but that's what CPU is for and I sleep better at night. It makes testing and deployment trivial, too. wunder == Walter Underwood Search Guy, Netflix On 10/8/07 4:05 AM

Re: getting number of stored documents via rest api

2007-10-11 Thread Walter Underwood
This even works if you request 0 results. --wunder On 10/11/07 1:56 AM, Stefan Rinner [EMAIL PROTECTED] wrote: On Oct 10, 2007, at 6:49 PM, Chris Hostetter wrote: : I think search for *:* is the optimal code to do it. I don't think you can : do anything faster. FYI: getting the data

Re: Opensearch XSLT

2007-10-12 Thread Walter Underwood
There is a request handler in 1.2 for Atom. That might be close. OpenSearch was a pretty poor design and is dead now, so I wouldn't expect any new implementations. Google's GData (based on Atom) reuses the few useful OpenSearch elements needed for things like number of hits. Solr's Atom support

Re: multilingual list of stopwords

2007-10-18 Thread Walter Underwood
Also die in German and English. --wunder On 10/18/07 4:16 AM, Andrzej Bialecki [EMAIL PROTECTED] wrote: One example that I'm familiar with: words is and by in English and in Swedish. Both words are stopwords in English, but they are content words in Swedish (ice and village, respectively).

Re: Overall performance: network v.s. SAN file system

2007-10-18 Thread Walter Underwood
The question almost doesn't make sense, because SANs are so configurable. It is like saying over a network without specifying whether the network is dial-up or fiber. A few things to note: * The automatic backups are not synchronized with consistent index states, so they are probably useless. *

Performance when indexing or cold cache

2007-10-22 Thread Walter Underwood
We've had some performance problems while Solr is indexing and also when it starts with a cold cache. I'm still digging through our own logs, but I'd like to get more info about this, so any ideas or info are welcome. We have four Solr servers on dual CPU PowerPC machines, 2G of heap, about

Re: Performance when indexing or cold cache

2007-10-22 Thread Walter Underwood
Solr 1.1. --wunder On 10/22/07 10:06 AM, Walter Underwood [EMAIL PROTECTED] wrote: We've had some performance problems while Solr is indexing and also when it starts with a cold cache. I'm still digging through our own logs, but I'd like to get more info about this, so any ideas or info

Re: Performance when indexing or cold cache

2007-10-22 Thread Walter Underwood
We do an optimize after indexing, so the number of segments isn't an issue. We have the default autowarming settings. wunder On 10/22/07 11:00 AM, Yonik Seeley [EMAIL PROTECTED] wrote: On 10/22/07, Walter Underwood [EMAIL PROTECTED] wrote: lst name=appends str name=fq(pushstatus:A

Re: Forced Top Document

2007-10-25 Thread Walter Underwood
On 10/25/07 12:11 AM, Chris Hostetter [EMAIL PROTECTED] wrote: this type of question typically falls into two use cases: 1) targeted ads 2) sponsored results 3) Best bets (editorial results) The query house should return House, M.D. as the first hit, but that is rather hard to achieve

Re: Phrase Query Performance Question

2007-10-31 Thread Walter Underwood
hurricane katrina is a very expensive query against a collection focused on Hurricane Katrina. There will be many matches in many documents. If you want to measure worst-case, this is fine. I'd try other things, like: * ninth ward * Ray Nagin * Audubon Park * Canal Street * French Quarter * FEMA

Re: How to get number of indexed documents?

2007-11-01 Thread Walter Underwood
/solr/admin/stats.jps is XML with a stylesheet. It contains stuff like this: stat name=numDocs 266687 /stat wunder On 11/1/07 7:39 PM, Papalagi Pakeha [EMAIL PROTECTED] wrote: Hello, Is there any way to get XML version of statistics like how many documents are

Re: Phrase Query Performance Question

2007-11-02 Thread Walter Underwood
He means extremely frequent and I agree. --wunder On 11/2/07 1:51 AM, Haishan Chen [EMAIL PROTECTED] wrote: Thanks for the advice. You certainly have a point. I believe you mean a query term that appears in 5-10% of an index in a natural language corpus is extremely INFREQUENT?

Re: Score of exact matches

2007-11-05 Thread Walter Underwood
This is fairly straightforward and works well with the DisMax handler. Indes the text into three different fields with three different sets of analyzers. Use something like this in the request handler: requestHandler name=multimatch class=solr.DisMaxRequestHandler lst name=defaults

Re: escaping characters and security

2007-11-06 Thread Walter Underwood
Solr queries can't do updates, so passing on raw user queries is OK. Solr errors for bad query syntax are not pretty, so you will want to catch those and print a real error message. wunder On 11/6/07 8:52 AM, Micah Wedemeyer [EMAIL PROTECTED] wrote: Are there any security risks to passing a

Re: What is the best way to index xml data preserving the mark up?

2007-11-07 Thread Walter Underwood
If you really, really need to preserve the XML structure, you'll be doing a LOT of work to make Solr do that. It might be cheaper to start with software that already does that. I recommend MarkLogic -- I know the principals there, and it is some seriously fine software. Not free or open, but very,

Re: 2GB limit on 32 bits

2007-11-09 Thread Walter Underwood
Some OSs split that 4GB into a 2GB data space and a 2GB instruction space. To get a 64bit address space, the CPU, OS, and JVM all need to support 64 bits. There have been 64 bit Xeon chips since 2004, the Linux 2.6 kernel supports 64 bit, and recent JVMs do, too. If your Xeon supports 64 bits, you

Re: Multiple uniqueKey fields

2007-11-13 Thread Walter Underwood
I had a similar problem with three sources of keys that have collisions between the values. I prefix a single letter for each source. movies: M12345 people: P12345 and so on. wunder On 11/13/07 12:37 PM, Will Johnson [EMAIL PROTECTED] wrote: key = sometimesUniqueField + _ +

Re: snappuller rsync parameter error? - solr hardcoded

2007-11-14 Thread Walter Underwood
I'm not an rsync expert, but I beleive that /solr/ is a virtual directory defined in the rsyncd config. It is mapped to the real directory. wunder On 11/14/07 8:43 AM, Jae Joo [EMAIL PROTECTED] wrote: In the snappuller, the solr is hardcoded. Should it be ${master_data_dir}? # rsync over

  1   2   3   4   5   6   7   8   9   10   >