RE: [EXTERNAL] Re: Getting rid of Master/Slave nomenclature in Solr

2020-06-18 Thread Demian Katz
Regarding people having a problem with the word "master" -- GitHub is changing 
the default branch name away from "master," even in isolation from a "slave" 
pairing... so the terminology seems to be falling out of favor in all contexts. 
See:

https://www.cnet.com/news/microsofts-github-is-removing-coding-terms-like-master-and-slave/

I'm not here to start a debate about the semantics of that, just to provide 
evidence that in some communities, the term "master" is causing concern all by 
itself. If we're going to make the change anyway, it might be best to get it 
over with and pick the most appropriate terminology we can agree upon, rather 
than trying to minimize the amount of change. It's going to be backward 
breaking anyway, so we might as well do it all now rather than risk having to 
go through two separate breaking changes at different points in time.

- Demian

-Original Message-
From: Noble Paul  
Sent: Thursday, June 18, 2020 1:51 AM
To: solr-user@lucene.apache.org
Subject: [EXTERNAL] Re: Getting rid of Master/Slave nomenclature in Solr

Looking at the code I see a 692 occurrences of the word "slave".
Mostly variable names and ref guide docs.

The word "slave" is present in the responses as well. Any change in the request 
param/response payload is backward incompatible.

I have no objection to changing the names in ref guide and other internal 
variables. Going ahead with backward incompatible changes is painful. If 
somebody has the appetite to take it up, it's OK

If we must change, master/follower can be a good enough option.

master (noun): A man in charge of an organization or group.
master(adj) : having or showing very great skill or proficiency.
master(verb): acquire complete knowledge or skill in (a subject, technique, or 
art).
master (verb): gain control of; overcome.

I hope nobody has a problem with the term "master"

On Thu, Jun 18, 2020 at 3:19 PM Ilan Ginzburg  wrote:
>
> Would master/follower work?
>
> Half the rename work while still getting rid of the slavery connotation...
>
>
> On Thu 18 Jun 2020 at 07:13, Walter Underwood  wrote:
>
> > > On Jun 17, 2020, at 4:00 PM, Shawn Heisey  wrote:
> > >
> > > It has been interesting watching this discussion play out on 
> > > multiple
> > open source mailing lists.  On other projects, I have seen a VERY 
> > high level of resistance to these changes, which I find disturbing 
> > and surprising.
> >
> > Yes, it is nice to see everyone just pitch in and do it on this list.
> >
> > wunder
> > Walter Underwood
> > wun...@wunderwood.org
> > https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fobs
> > erver.wunderwood.org%2Fdata=02%7C01%7Cdemian.katz%40villanova.e
> > du%7C1eef0604700a442deb7e08d8134b97fb%7C765a8de5cf9444f09cafae5bf8cf
> > a366%7C0%7C0%7C637280562684672329sdata=0GyK5Tlq0PGsWxl%2FirJOVN
> > VaFCELlEChdxuLJ5RxdQs%3Dreserved=0  (my blog)
> >
> >



--
-
Noble Paul


RE: Help with a DIH config file

2019-03-15 Thread Demian Katz
Jörn (and anyone else with more experience with this than I have),

I've been working on Whitney with this issue. It is a PDF file, and it can be 
opened successfully in a PDF reader. Interestingly, if I try to extract data 
from it on the command line, Tika version 1.3 throws a lot of warnings but does 
successfully extract data, but several newer versions, including 1.17 and 1.20 
(haven't tested other intermediate versions) encounter a fatal error and 
extract nothing. So this seems like something that used to work but has 
stopped. Unfortunately, we haven't been able to find a way to downgrade to an 
old enough Tika in her Solr installation to work around the problem that way.

The bigger question, though, is whether there's a way to allow the DIH to 
simply ignore errors and keep going. Whitney needs to index several terabytes 
of arbitrary documents for her project, and at this scale, she can't afford the 
time to stop and manually intervene for every strange document that happens to 
be in the collection. It would be greatly preferable if the indexing process 
could ignore exceptions and proceed on than if it just stops dead at the first 
problem. (I'm also pretty sure that Whitney is already using the 
ignoreTikaException attribute in her configuration, but it doesn't seem to help 
in this instance).

Any suggestions would be greatly appreciated!

thanks,
Demian

-Original Message-
From: Jörn Franke  
Sent: Friday, March 15, 2019 4:18 AM
To: solr-user@lucene.apache.org
Subject: Re: Help with a DIH config file

Do you have an exception?
It could be that the pdf is broken - can you open it on your computer with a 
pdfreader?

If the exception is related to Tika and pdf then file an issue with the pdfbox 
project. If there is an issue with Tika and MsOffice documents then Apache poi 
is the right project to ask.

> Am 15.03.2019 um 03:41 schrieb wclarke :
> 
> Thank you so much.  You helped a great deal.  I am running into one 
> last issue where the Tika DIH is stopping at a specific language and 
> fails there (Malayalam).  Do you know of a work around?
> 
> 
> 
> --
> Sent from: 
> https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Flucen
> e.472066.n3.nabble.com%2FSolr-User-f472068.htmldata=02%7C01%7Cdem
> ian.katz%40villanova.edu%7Ca54d5daee7b14648442908d6a91f9bf6%7C765a8de5
> cf9444f09cafae5bf8cfa366%7C0%7C0%7C636882350564627071sdata=NpddZY
> 2sHKJHAR8V%2BIlMt4j1i3oy94KP9%2Btp1EQ2xM4%3Dreserved=0


Solr Cell, Tika and UpdateProcessorChains

2019-02-21 Thread Demian Katz
I'm posting this question on behalf of Whitney Clarke, who is a pending member 
of this list but is not able to post on her own yet. I've been working with her 
on some troubleshooting, but I'm not familiar with the components she's using 
and thought somebody here might be able to point her in the right direction 
more quickly than I can.

Here is her original inquiry:


I am pulling data from a local drive for indexing.  I am using solr cell and 
tika in schemaless mode.  I am attempting to rewrite certain field information 
prior to indexing using html-strip and regex UpdateProcessorChains.  However, 
when run, the UpdateProcessorChains never appear to get invoked.

For example,

I am looking to rewrite "url":"e:\\documents\\apiscript.txt" to 
be http://apiscript.txt .  My current solrconfig is trying to rewrite id and 
put the rewritten link into url, but this is just the recent attempt of many 
different ways I have tried to get it to work.

My other issues is with the content field.  I am trying to strip that field 
down to just the actual text of the document.  I am getting all meta data in it 
as well.   Any suggestions?

Thanks,
Whitney


Whitney's latest solrconfig.xml in pasted in full below - as she notes, we've 
been through many iterations without any success. The key question is how to 
manipulate the data retrieved from Tika prior to indexing it. Is there a 
documented best practice for this type of situation, or any tips on how to 
troubleshoot when nothing appears to be happening?

Thanks,
Demian




  7.3.1

  
  

  
  

  
  

  

  ${solr.solr.home:./solr}/text_test

  

  



  ${solr.ulog.dir:}
  ${solr.ulog.numVersionBuckets:65536}



  ${solr.autoCommit.maxTime:15000}
  false


  

1024
  

  



  


true


  20


  200


   
  

  


  

  

false

  

   
 

  

  


  explicit
  
 content

  





  
  

  explicit
  json
  true

  

  


add-unknown-fields-to-the-schema
 html-strip-features
regex-replace

  

  



  
  true
  links
  ignored_
  true
  ignored_

  

   

text_general





  default
  _text_
  solr.DirectSolrSpellChecker
  
  internal
  
  0.5
  
  2
  
  1
  
  5
  
  4
  
  0.01
  

  

  

  
  default
  on
  true
  10
  5
  5
  true
  true
  10
  5


  spellcheck

  



  
  
  

  100

  

  
  

  
  70
  
  0.5
  
  [-\w ,/\n\]{20,200}

  

  
  

  
  

  

  
  

  
  

  
  

  
  

  
  

  

  
  

  
  

  

  

  10
  .,!? 

  

  

  
  WORD
  
  
  en
  US

  

  

  
  
  
[^\w-\.]
_
  
  
  
  
  

  -MM-dd'T'HH:mm:ss.SSSZ
  -MM-dd'T'HH:mm:ss,SSSZ
  -MM-dd'T'HH:mm:ss.SSS
  -MM-dd'T'HH:mm:ss,SSS
  -MM-dd'T'HH:mm:ssZ
  -MM-dd'T'HH:mm:ss
  -MM-dd'T'HH:mmZ
  -MM-dd'T'HH:mm
  -MM-dd HH:mm:ss.SSSZ
  -MM-dd HH:mm:ss,SSSZ
  -MM-dd HH:mm:ss.SSS
  -MM-dd HH:mm:ss,SSS
  -MM-dd HH:mm:ssZ
  -MM-dd HH:mm:ss
  -MM-dd HH:mmZ
  -MM-dd HH:mm
  -MM-dd

  
  

  java.lang.String
  text_general
  
*_str
256
  
  
  true


  java.lang.Boolean
  booleans


  java.util.Date
  pdates


  java.lang.Long
  java.lang.Integer
  plongs


  java.lang.Number
  pdoubles

  

  
  



  


  _text_




  

   
  
   id
   ^[a-z]:\w+
   http://
   true



 
  
 
   text,title,subject,description
   language_s
   en
 
 
 
   

   

text/plain; charset=UTF-8
  

   
5
  



RE: Installing Solr with Ivy

2016-08-03 Thread Demian Katz
Dan,

In case you, or anyone else, is interested, let me share my current 
solution-in-progress:

https://github.com/vufind-org/vufind/pull/769

I've written a Phing task for my project (Phing is the PHP equivalent of Ant) 
which takes some loose inspiration from your Ant download task. The task uses a 
local directory to cache Solr distributions and only hits Apache servers if the 
cache lacks the requested version. This cache can be retained on my continuous 
integration and development servers, so I think this should get me the effect I 
desire without putting an unreasonable amount of load on the archive servers. 
I'd still love in theory to find a solution that's a little more future-proof 
than "build a URL and download from it," but for now, I think this will get me 
through.

Thanks again!

- Demian

-Original Message-
From: Davis, Daniel (NIH/NLM) [C] [mailto:daniel.da...@nih.gov] 
Sent: Tuesday, August 02, 2016 11:33 AM
To: solr-user@lucene.apache.org
Subject: RE: Installing Solr with Ivy

Demian,

I've long meant to upload my own "automated installation" - it is ant without 
ivy, but with checksums.   I suppose gpg signatures could also be worked in.
It is only semi-automated, because our DevOps group does not have root, but 
here is a clean version - https://github.com/danizen/solr-ant-install

System administrators prepare the environment:
- creating a directory for solr (/opt/solr) and logs (/var/logs/solr), maybe a 
different volume for solr data.
- create an administrative user with a shell (owns the code)
- create an operational user who runs solr (no shell, cannot modify the code)
- install the initscripts
- setup sudoers rules

The installation this supports is very, very small, and I do not intend to 
support the cleaned version of this going forward.   I will update the 
README.md to make that clear.

I agree with your summary of the difference.   One more aspect of 
maturity/fullness of solution - MySQL/PostgreSQL etc. support multiple projects 
on the same server, at least administratively.   Solr is getting there, but 
until role-based access control (RBAC) is strong enough out-of-the-box, it is 
hard to setup a *shared* Solr server.Yet it is very common to do that with 
database servers, and in fact doing this is a common way to avoid siloed 
applications.Unfortunately, HTTP auth is not quite good enough for me; but 
it is only my own fault I haven't contributed something more.

Dan Davis, Systems/Applications Architect (Contractor), Office of Computer and 
Communications Systems, National Library of Medicine, NIH







-Original Message-----
From: Demian Katz [mailto:demian.k...@villanova.edu]
Sent: Tuesday, August 02, 2016 8:37 AM
To: solr-user@lucene.apache.org
Subject: RE: Installing Solr with Ivy

Thanks, Shawn, for confirming my suspicions.

Regarding your question about how Solr differs from a database server, I agree 
with you in theory, but the problem is in the practice: there are very easy, 
familiar, well-established techniques for installing and maintaining database 
platforms, and these platforms are mature enough that they evolve slowly and 
most versions are closely functionally equivalent to one another. Solr is 
comparatively young (not immature, but young).

Solr still (as far as I can tell) lacks standard package support in the default 
repos of the major Linux distros, and frequently breaks backward compatibility 
between versions in large and small ways (particularly in the internal API, but 
sometimes also in the configuration files). Those are not intended as 
criticisms of Solr -- they're to a large extent positive signs of activity and 
growth -- but they are, as far as I can tell, the current realities of working 
with the software.

For a developer with the right experience and knowledge, it's no big deal to 
navigate these challenges. However, my package is designed to be friendly to a 
less experienced, more generalized non-technical audience, and bundling Solr in 
the package instead of trying to guide the user through a potentially confusing 
manual installation process greatly simplifies the task of getting things up 
and running, saving me from having to field support emails from people who 
can't figure out how to install Solr on their platform, or those who end up 
with a version that's incompatible with my project's configurations and custom 
handlers.

At this point, my main goal is to revise the bundling process so that instead 
of storing Solr in Git, I can install it on-demand with a simple automated 
process during continuous integration builds and packaging for release. In the 
longer term, if the environmental factors change, I'd certainly prefer to stop 
bundling it entirely... but I don't think that is practical for my audience at 
this stage.

In any case, sorry for the long-winded reply, but hopefully that helps clarify 
my situation.

- Demian

-Original Message-

[...snip...]

In a t

RE: Installing Solr with Ivy

2016-08-02 Thread Demian Katz
Dan,

Thanks for taking the time to share this! I'll give it a test run in the near 
future and will happily share improvements if I come up with any (though I'll 
most likely be focusing on the download steps rather than the subsequent 
configuration).

- Demian

-Original Message-
From: Davis, Daniel (NIH/NLM) [C] [mailto:daniel.da...@nih.gov] 
Sent: Tuesday, August 02, 2016 11:33 AM
To: solr-user@lucene.apache.org
Subject: RE: Installing Solr with Ivy

Demian,

I've long meant to upload my own "automated installation" - it is ant without 
ivy, but with checksums.   I suppose gpg signatures could also be worked in.
It is only semi-automated, because our DevOps group does not have root, but 
here is a clean version - https://github.com/danizen/solr-ant-install

System administrators prepare the environment:
- creating a directory for solr (/opt/solr) and logs (/var/logs/solr), maybe a 
different volume for solr data.
- create an administrative user with a shell (owns the code)
- create an operational user who runs solr (no shell, cannot modify the code)
- install the initscripts
- setup sudoers rules

The installation this supports is very, very small, and I do not intend to 
support the cleaned version of this going forward.   I will update the 
README.md to make that clear.

I agree with your summary of the difference.   One more aspect of 
maturity/fullness of solution - MySQL/PostgreSQL etc. support multiple projects 
on the same server, at least administratively.   Solr is getting there, but 
until role-based access control (RBAC) is strong enough out-of-the-box, it is 
hard to setup a *shared* Solr server.Yet it is very common to do that with 
database servers, and in fact doing this is a common way to avoid siloed 
applications.Unfortunately, HTTP auth is not quite good enough for me; but 
it is only my own fault I haven't contributed something more.

Dan Davis, Systems/Applications Architect (Contractor), Office of Computer and 
Communications Systems, National Library of Medicine, NIH







-Original Message-----
From: Demian Katz [mailto:demian.k...@villanova.edu]
Sent: Tuesday, August 02, 2016 8:37 AM
To: solr-user@lucene.apache.org
Subject: RE: Installing Solr with Ivy

Thanks, Shawn, for confirming my suspicions.

Regarding your question about how Solr differs from a database server, I agree 
with you in theory, but the problem is in the practice: there are very easy, 
familiar, well-established techniques for installing and maintaining database 
platforms, and these platforms are mature enough that they evolve slowly and 
most versions are closely functionally equivalent to one another. Solr is 
comparatively young (not immature, but young).

Solr still (as far as I can tell) lacks standard package support in the default 
repos of the major Linux distros, and frequently breaks backward compatibility 
between versions in large and small ways (particularly in the internal API, but 
sometimes also in the configuration files). Those are not intended as 
criticisms of Solr -- they're to a large extent positive signs of activity and 
growth -- but they are, as far as I can tell, the current realities of working 
with the software.

For a developer with the right experience and knowledge, it's no big deal to 
navigate these challenges. However, my package is designed to be friendly to a 
less experienced, more generalized non-technical audience, and bundling Solr in 
the package instead of trying to guide the user through a potentially confusing 
manual installation process greatly simplifies the task of getting things up 
and running, saving me from having to field support emails from people who 
can't figure out how to install Solr on their platform, or those who end up 
with a version that's incompatible with my project's configurations and custom 
handlers.

At this point, my main goal is to revise the bundling process so that instead 
of storing Solr in Git, I can install it on-demand with a simple automated 
process during continuous integration builds and packaging for release. In the 
longer term, if the environmental factors change, I'd certainly prefer to stop 
bundling it entirely... but I don't think that is practical for my audience at 
this stage.

In any case, sorry for the long-winded reply, but hopefully that helps clarify 
my situation.

- Demian

-Original Message-

[...snip...]

In a theoretical situation where your program talked an SQL database, would you 
include a database server in your project?  How much time would you invest in 
automating the download and install of MySQL, Postgres, or some other database? 
 I think what you would do in that situation is include client code to talk to 
the database and expect the user to provide the server and prepare it for your 
program.  In this respect, how is a Solr server any different than a database 
server?

Thanks,
Shawn



RE: Installing Solr with Ivy

2016-08-02 Thread Demian Katz
Thanks, Shawn, for confirming my suspicions.

Regarding your question about how Solr differs from a database server, I agree 
with you in theory, but the problem is in the practice: there are very easy, 
familiar, well-established techniques for installing and maintaining database 
platforms, and these platforms are mature enough that they evolve slowly and 
most versions are closely functionally equivalent to one another. Solr is 
comparatively young (not immature, but young).

Solr still (as far as I can tell) lacks standard package support in the default 
repos of the major Linux distros, and frequently breaks backward compatibility 
between versions in large and small ways (particularly in the internal API, but 
sometimes also in the configuration files). Those are not intended as 
criticisms of Solr -- they're to a large extent positive signs of activity and 
growth -- but they are, as far as I can tell, the current realities of working 
with the software.

For a developer with the right experience and knowledge, it's no big deal to 
navigate these challenges. However, my package is designed to be friendly to a 
less experienced, more generalized non-technical audience, and bundling Solr in 
the package instead of trying to guide the user through a potentially confusing 
manual installation process greatly simplifies the task of getting things up 
and running, saving me from having to field support emails from people who 
can't figure out how to install Solr on their platform, or those who end up 
with a version that's incompatible with my project's configurations and custom 
handlers.

At this point, my main goal is to revise the bundling process so that instead 
of storing Solr in Git, I can install it on-demand with a simple automated 
process during continuous integration builds and packaging for release. In the 
longer term, if the environmental factors change, I'd certainly prefer to stop 
bundling it entirely... but I don't think that is practical for my audience at 
this stage.

In any case, sorry for the long-winded reply, but hopefully that helps clarify 
my situation.

- Demian

-Original Message-

[...snip...]

In a theoretical situation where your program talked an SQL database, would you 
include a database server in your project?  How much time would you invest in 
automating the download and install of MySQL, Postgres, or some other database? 
 I think what you would do in that situation is include client code to talk to 
the database and expect the user to provide the server and prepare it for your 
program.  In this respect, how is a Solr server any different than a database 
server?

Thanks,
Shawn



Installing Solr with Ivy

2016-08-01 Thread Demian Katz
As a follow-up to last week's thread about loading Solr via dependency manager, 
I started experimenting with using Ivy to install Solr. Here's what I have 
(note that I'm trying to install Solr 5.5.0 as an arbitrary example, but that 
detail should not be important):

ivy.xml:








build.xml:







My hope, based on a quick read of some Ivy tutorials, was that simply running 
"ant" with the above configs would give me a copy of Solr in my lib directory. 
When I use example libraries from the tutorials in my ivy.xml, I do indeed get 
files installed... but when I try to substitute the Solr package,
no files are installed ("0 artifacts copied"). I'm not very experienced with 
any of these tools or repositories, so I'm not sure where I'm going wrong.

- Do I need to add some extra configuration somewhere to tell Ivy to download 
the constituent parts of the solr-parent package?
- Is the solr-parent package the wrong thing to be using? (I tried replacing 
solr-parent with solr-core and ended up with many .jar files in my lib 
directory, which was better than nothing, but the .jar files were not organized 
into a directory structure and were not accompanied by any of the non-.jar 
files like shell scripts that make Solr tick).
- Am I just completely on the wrong track? (I do realize that there may not be 
a way to pull a fully-functional Solr out of the core Maven repository... but 
it seemed worth a try!)

Any suggestions would be greatly appreciated!

thanks,
Demian


RE: Installing Solr as a dependency

2016-08-01 Thread Demian Katz
Thanks -- another interesting possibility, though I suppose the disadvantage to 
this strategy would be the dependency on Docker, which could be problematic for 
some users (especially those running Windows, where I understand that this 
could only be achieved with virtualization, which would almost certainly impact 
performance). Still, another option to put on the table!

- Demian

-Original Message-
From: Alexandre Rafalovitch [mailto:arafa...@gmail.com] 
Sent: Friday, July 29, 2016 8:02 PM
To: solr-user
Subject: Re: Installing Solr as a dependency

What about (not tried) pulling down an official Docker build and adding your 
stuff to that?
https://hub.docker.com/_/solr/

Regards,
   Alex.

Newsletter and resources for Solr beginners and intermediates:
http://www.solr-start.com/


On 30 July 2016 at 03:03, Demian Katz <demian.k...@villanova.edu> wrote:
>> I wouldn't include Solr in my own project at all.  I would probably 
>> request that the user download the binary artifact and put it in a 
>> predictable location, and configure my installation script to do the 
>> download if the file is not there.  I would strongly recommend taking 
>> advantage of Apache's mirror system for that download -- although if 
>> you need a specific version of Solr, you will find that the mirror 
>> system only has the latest version, and you must go to the Apache 
>> Archives for older versions.
>>
>> To reduce load on the Apache Archive, you could place a copy of the 
>> binary on your own download servers ... and you could probably 
>> greatly reduce the size of that download by stripping out components 
>> that your software doesn't need.  If users want to enable additional 
>> functionality, they would be free to download the full Solr binary 
>> from Apache.
>
> Yes, this is the reason I was hoping to use some sort of dependency 
> management tool. The idea of downloading from Apache's system has definitely 
> crossed my mind, but it's inherently more fragile than using a dependency 
> manager (since Apache is at least theoretically free to change their URL 
> structure, etc., at any time) and, as you say, it seemed impolite to direct 
> potentially heavy amounts of traffic to Apache servers (especially when you 
> consider that every commit to my project triggers one or more continuous 
> integration builds, each of which would need to perform the download). 
> Creating a project-specific mirror also crossed my mind, but that has its own 
> set of problems: it's work to maintain it, and the server hosting it needs to 
> be able to withstand the high traffic that would otherwise be directed at 
> Apache. The idea of a theoretical dependency management tool still feels more 
> attractive because it adds a standard, unchanging mechanism for obtaining 
> specific versions of the software and it offers the possibility of local 
> package caching across builds to significantly reduce the amount of HTTP 
> traffic back and forth. Of course, it's a lot less attractive if it proves to 
> be only theory and not in fact practically achievable -- I'll play around 
> with Maven next week and see where that gets me.
>
> Anyway, I don't say any of that to dismiss your suggestions -- you 
> present potentially viable possibilities, and I'll certainly keep 
> those ideas on the table as I plan for the future -- but I thought it 
> might be worthwhile to share my thinking. :-)
>
>> I once discovered that if optional components are removed (including 
>> some jars in the webapp), the Solr download drops from 150+ MB to 
>> about
>> 25 MB.
>>
>> https://issues.apache.org/jira/browse/SOLR-6806
>
> This could actually be a separate argument for a dependency-management-based 
> Solr structure, in that you could create a core solr package with minimum 
> content that could recommend a whole array of optional dependencies. A script 
> could then be used to build different versions of the download package from 
> these -- one with just the core, one with all the optional stuff included. 
> Those who wanted some intermediate number of files could be encouraged to 
> manually create their desired build from packages.
>
> But again, I freely admit that everything I'm saying is based on 
> experience with package managers outside the realm of Java -- I need 
> to learn more about Maven (and perhaps Ivy) before I can make any 
> particularly intelligent statements about what is really possible in 
> this context. :-)
>
> - Demian


RE: Installing Solr as a dependency

2016-07-29 Thread Demian Katz
I did think about Maven, but (probably because I'm a Maven newbie) I didn't 
find an obvious way to do it and figured that Maven was meant more for 
libraries than for complete applications. In any case, your answer gives me 
more to work with, so I'll do some experimentation. Thanks!

- Demian

From: Daniel Collins [danwcoll...@gmail.com]
Sent: Friday, July 29, 2016 11:18 AM
To: solr-user@lucene.apache.org
Subject: Re: Installing Solr as a dependency

Can't you use Maven?  I thought that was the standard dependency management
tool, and Solr is published to Maven repos.  There used to be a solr
artifact which was the WAR file, but presumably now, you'd have to pull down

  org.apache.solr
  solr-parent

and maybe then start that up.

We have an internal application which is dependent on solr-core, (its a
web-app, we embed bits of Solr basically) and maven works fine for us.  We
do patch and build Solr internally though to our own corporate maven repos,
so that helps :)  But I've done it outside the corporate environment and
found recent Solr releases on standard maven repo sites.


On 29 July 2016 at 15:12, Shawn Heisey <apa...@elyograg.org> wrote:

> On 7/28/2016 1:29 PM, Demian Katz wrote:
> > I develop an open source project
> > (https://github.com/vufind-org/vufind) that depends on Solr, and I'm
> > trying to figure out if there is a better way to manage the Solr
> > dependency. Presently, I simply bundle Solr with my software by
> > committing the latest distribution to my Git repo. Over time, having
> > all of these large binaries is causing repository bloat and slow Git
> > performance. I'm beginning to wonder whether there's a better way.
> > With the rise in the popularity of dependency managers like NPM and
> > Composer, it seems like it might be nice to somehow be able to declare
> > Solr as a dependency and have it installed automatically on the client
> > side rather than bundling the whole gigantic application by hand...
> > however, as far as I can tell, there's no way to do this presently (at
> > least, not unless you count specialized niche projects like
> > https://github.com/projecthydra/hydra-jetty, which are not exactly
> > what I'm looking for). Just curious if others are dealing with this
> > problem in other ways, or if there are any tool-based approaches that
> > I haven't discovered on my own.
>
> I wouldn't include Solr in my own project at all.  I would probably
> request that the user download the binary artifact and put it in a
> predictable location, and configure my installation script to do the
> download if the file is not there.  I would strongly recommend taking
> advantage of Apache's mirror system for that download -- although if you
> need a specific version of Solr, you will find that the mirror system
> only has the latest version, and you must go to the Apache Archives for
> older versions.
>
> To reduce load on the Apache Archive, you could place a copy of the
> binary on your own download servers ... and you could probably greatly
> reduce the size of that download by stripping out components that your
> software doesn't need.  If users want to enable additional
> functionality, they would be free to download the full Solr binary from
> Apache.
>
> I once discovered that if optional components are removed (including
> some jars in the webapp), the Solr download drops from 150+ MB to about
> 25 MB.
>
> https://issues.apache.org/jira/browse/SOLR-6806
>
> Thanks,
> Shawn
>
>


RE: Installing Solr as a dependency

2016-07-29 Thread Demian Katz
> I wouldn't include Solr in my own project at all.  I would probably
> request that the user download the binary artifact and put it in a
> predictable location, and configure my installation script to do the
> download if the file is not there.  I would strongly recommend taking
> advantage of Apache's mirror system for that download -- although if you
> need a specific version of Solr, you will find that the mirror system
> only has the latest version, and you must go to the Apache Archives for
> older versions.
>
> To reduce load on the Apache Archive, you could place a copy of the
> binary on your own download servers ... and you could probably greatly
> reduce the size of that download by stripping out components that your
> software doesn't need.  If users want to enable additional
> functionality, they would be free to download the full Solr binary from
> Apache.

Yes, this is the reason I was hoping to use some sort of dependency management 
tool. The idea of downloading from Apache's system has definitely crossed my 
mind, but it's inherently more fragile than using a dependency manager (since 
Apache is at least theoretically free to change their URL structure, etc., at 
any time) and, as you say, it seemed impolite to direct potentially heavy 
amounts of traffic to Apache servers (especially when you consider that every 
commit to my project triggers one or more continuous integration builds, each 
of which would need to perform the download). Creating a project-specific 
mirror also crossed my mind, but that has its own set of problems: it's work to 
maintain it, and the server hosting it needs to be able to withstand the high 
traffic that would otherwise be directed at Apache. The idea of a theoretical 
dependency management tool still feels more attractive because it adds a 
standard, unchanging mechanism for obtaining specific versions of the software 
and it offers the possibility of local package caching across builds to 
significantly reduce the amount of HTTP traffic back and forth. Of course, it's 
a lot less attractive if it proves to be only theory and not in fact 
practically achievable -- I'll play around with Maven next week and see where 
that gets me.

Anyway, I don't say any of that to dismiss your suggestions -- you present 
potentially viable possibilities, and I'll certainly keep those ideas on the 
table as I plan for the future -- but I thought it might be worthwhile to share 
my thinking. :-)

> I once discovered that if optional components are removed (including
> some jars in the webapp), the Solr download drops from 150+ MB to about
> 25 MB.
>
> https://issues.apache.org/jira/browse/SOLR-6806

This could actually be a separate argument for a dependency-management-based 
Solr structure, in that you could create a core solr package with minimum 
content that could recommend a whole array of optional dependencies. A script 
could then be used to build different versions of the download package from 
these -- one with just the core, one with all the optional stuff included. 
Those who wanted some intermediate number of files could be encouraged to 
manually create their desired build from packages.

But again, I freely admit that everything I'm saying is based on experience 
with package managers outside the realm of Java -- I need to learn more about 
Maven (and perhaps Ivy) before I can make any particularly intelligent 
statements about what is really possible in this context. :-)

- Demian

Installing Solr as a dependency

2016-07-28 Thread Demian Katz
Hello,

I develop an open source project (https://github.com/vufind-org/vufind) that 
depends on Solr, and I'm trying to figure out if there is a better way to 
manage the Solr dependency.

Presently, I simply bundle Solr with my software by committing the latest 
distribution to my Git repo. Over time, having all of these large binaries is 
causing repository bloat and slow Git performance. I'm beginning to wonder 
whether there's a better way. With the rise in the popularity of dependency 
managers like NPM and Composer, it seems like it might be nice to somehow be 
able to declare Solr as a dependency and have it installed automatically on the 
client side rather than bundling the whole gigantic application by hand... 
however, as far as I can tell, there's no way to do this presently (at least, 
not unless you count specialized niche projects like 
https://github.com/projecthydra/hydra-jetty, which are not exactly what I'm 
looking for).

Just curious if others are dealing with this problem in other ways, or if there 
are any tool-based approaches that I haven't discovered on my own.

thanks,
Demian


qf boosts with MoreLikeThis query parser

2016-07-11 Thread Demian Katz
Hello,

I am currently using field-specific boosts in the qf setting of the 
MoreLikeThis request handler:

https://github.com/vufind-org/vufind/blob/master/solr/vufind/biblio/conf/solrconfig.xml#L410

I would like to accomplish the same effect using the MoreLikeThis query parser, 
so that I can take advantage of such benefits as sharding support.

I am currently using Solr 5.5.0, and in spite of trying many syntactical 
variations, I can't seem to get it to work. Some discussion on this JIRA ticket 
seems to suggest there may have been some problems caused by parsing 
limitations:

https://issues.apache.org/jira/browse/SOLR-7143

However, I think my work on this ticket should have eliminated those 
limitations:

https://issues.apache.org/jira/browse/SOLR-2798

Anyway, this brings up a few questions:


1.)Is field-specific boosting in qf supported by the MLT query parser, and 
if so, what syntax should I use?

2.)If this functionality is supported, but not in v5.5.0, approximately 
when was it fixed?

3.)If the functionality is still not working, would it be worth my time to 
try to fix it, or is it being excluded for a specific reason?

Any and all insight is appreciated. Apologies if the answers are already out 
there somewhere, but I wasn't able to find them!

thanks,
Demian


Pull request protocol question

2016-03-01 Thread Demian Katz
Hello,

A few weeks ago, I submitted a pull request to Solr in association with a JIRA 
ticket, and it was eventually merged.

More recently, I had an almost-trivial change I wanted to share, but on GitHub, 
my Solr fork appeared to have changed upstreams. Was the whole Solr repo moved 
and regenerated or something?

In any case, I ended up submitting my proposal using a new fork of 
apache/lucene-solr. It's visible here:

https://github.com/apache/lucene-solr/pull/13

However, due to the weirdness of the switching upstreams, I thought I'd better 
check in here and make sure I put this in the right place!

thanks,
Demian


SOLR-2798 (local params parsing issue) -- how can I help?

2015-12-02 Thread Demian Katz
Hello,

I'd really love to see a resolution to SOLR-2798, since my application has a 
bug that cannot be addressed until this issue is fixed.

It occurred to me that there's a good chance that the code involved in this 
issue is relatively isolated and testable, so I might be able to help with a 
solution even though I have no prior experience with the Solr code base. I'm 
just wondering if anyone can confirm this and, if so, point me in the right 
general direction so that I can make an attempt at a patch.

I asked about this a while ago in a comment on the JIRA ticket, but I have a 
feeling that nobody actually saw that - so I'm trying again here on the mailing 
list.

Any and all help greatly appreciated - and hopefully if you help me a little, I 
can contribute a useful fix back to the project in return.

thanks,
Demian


Costs/benefits of DocValues

2015-11-09 Thread Demian Katz
Hello,

I have a legacy Solr schema that I would like to update to take advantage of 
DocValues. I understand that by adding "docValues=true" to some of my fields, I 
can improve sorting/faceting performance. However, I have a couple of questions:


1.)Will Solr always take proper advantage of docValues when it is turned 
on, or will I gain greater performance by turning of stored/indexed in 
situations where only docValues are necessary (e.g. a sort-only field)?

2.)Will adding docValues to a field introduce significant performance 
penalties for non-docValues uses of that field, beyond the obvious fact that 
the additional data will consume more disk and memory?

I'm asking this question because the existing schema has some multi-purpose 
fields, and I'm trying to determine whether I should just add "docValues=true" 
wherever it might help, or if I need to take a more thoughtful approach and 
potentially split some fields with copyFields, etc. This is particularly 
significant because my schema makes use of some dynamic field suffixes, and I'm 
not sure if I need to add new suffixes to differentiate docValues/non-docValues 
fields, or if it's okay to turn on docValues across the board "just in case."

Apologies if these questions have already been answered - I couldn't find a 
totally clear answer in the places I searched.

Thanks!

- Demian


ExternalFileField documentation problems?

2014-09-15 Thread Demian Katz
I've just been doing some experimentation with the ExternalFileField. I ran 
into obstacles due to some apparently incorrect documentation in the wiki:

https://cwiki.apache.org/confluence/display/solr/Working+with+External+Files+and+Processes

It seems that for some reason the fieldType and field definitions are 
mashed together there. It felt wrong, but I tried it since it was in the 
official docs... and, of course, it didn't work.

Fortunately, this blog post helped me out, and I was able to get everything 
working:

http://1opensourcelover.wordpress.com/2013/07/02/solr-external-file-fields/

Anyway, I'm not writing this to complain - I'd just like to help fix the wiki. 
However, since I'm no expert on this functionality, and I have no Confluence 
experience, so I thought I'd post here before taking any action.


1.)Am I able to edit the wiki? I signed up, but I don't see any edit 
options - just leave a comment. I assume this means I have no rights, but it 
might just mean I'm looking in the wrong places.

2.)Is there anyone more intimately familiar with ExternalFileField who 
would be willing to give the wiki page a quick review and correct factual 
errors? The extent of my edit (if I could make it) would simply be to fix the 
broken schema.xml example, but it's possible other details also need 
adjustments.

3.)Is there a policy on external links in the wiki? Adding a comment with a 
link to the above-mentioned blog post might be helpful to others, but if it's 
going to get me flagged as a potential spammer, I'll refrain from doing it.

Thanks for your input! I'll go ahead and leave a comment if I don't hear 
anything in a few days, but it seemed worth asking for best practices first.

- Demian


Preserving punctuation tokens with ICUTokenizerFactory

2012-04-10 Thread Demian Katz
It has been brought to my attention that ICUTokenizerFactory drops tokens like 
the ++ in The C++ Programming Language.  Is there any way to persuade it to 
preserve these types of tokens?

thanks,
Demian


RE: sun-java6 alternatives for Solr 3.5

2012-02-27 Thread Demian Katz
For what it's worth, I run Solr 3.5 on Ubuntu using the OpenJDK packages and I 
haven't run into any problems.  I do realize that sometimes the Sun JDK has 
features that are missing from other Java implementations, but so far it hasn't 
affected my use of Solr.

- Demian

 -Original Message-
 From: ku3ia [mailto:dem...@gmail.com]
 Sent: Monday, February 27, 2012 2:25 PM
 To: solr-user@lucene.apache.org
 Subject: sun-java6 alternatives for Solr 3.5
 
 Hi all!
 I had installed an Ubuntu 10.04 LTS. I had added a 'partner' repository to
 my sources list and updated it, but I can't find a package sun-java6-*:
 root@ubuntu:~# apt-cache search java6
 default-jdk - Standard Java or Java compatible Development Kit
 default-jre - Standard Java or Java compatible Runtime
 default-jre-headless - Standard Java or Java compatible Runtime (headless)
 openjdk-6-jdk - OpenJDK Development Kit (JDK)
 openjdk-6-jre - OpenJDK Java runtime, using Hotspot JIT
 openjdk-6-jre-headless - OpenJDK Java runtime, using Hotspot JIT (headless)
 
 Than I had goggled and found an article:
 https://lists.ubuntu.com/archives/ubuntu-security-announce/2011-
 December/001528.html
 
 I'm using Solr 3.5 and Apache Tomcat 6.0.32.
 Please advice me what I must do in this situation, because I always used
 sun-java6-* packages for Tomcat and Solr and it worked fine
 Thanks!
 
 --
 View this message in context: http://lucene.472066.n3.nabble.com/sun-java6-
 alternatives-for-Solr-3-5-tp3781792p3781792.html
 Sent from the Solr - User mailing list archive at Nabble.com.


RE: SOLR - Just for search or whole site DB?

2012-02-21 Thread Demian Katz
I would strongly recommend using Solr just for search.  Solr is designed for 
doing fast search lookups.  It is really not designed for performing all the 
functions of a relational database system.  You certainly COULD use Solr for 
everything, and the software is constantly being enhanced to make it more 
flexible, but you'll still probably find it awkward and inconvenient for 
certain tasks that are simple with MySQL.  It's also useful to be able to throw 
away and rebuild your Solr index at will, so you can upgrade to a new version 
or tweak your indexing rules.  If you store mission-critical data in Solr 
itself, this becomes more difficult.  The way I like to look at it is, as the 
name says, as an index.  You use one system for actually managing your data, 
and then you use Solr to create an index of that data for fast look-up.  

- Demian

 -Original Message-
 From: Spadez [mailto:james_will...@hotmail.com]
 Sent: Tuesday, February 21, 2012 7:45 AM
 To: solr-user@lucene.apache.org
 Subject: SOLR - Just for search or whole site DB?
 
 
 I am new to this but I wanted to pitch a setup to you. I have a website
 being coded at the moment, in the very early stages, but is effectively a
 full text scrapper and search engine. We have decided on SOLR for the search
 system.
 
 We basically have two sets of data:
 
 One is the content for the search engine, which is around 100K records at
 any one time. The entire system is built on PHP and currently put into a
 MSQL database. We want very quick relevant searches, this is critical. Our
 plan is to import our records into SOLR each night from the MYSQL database.
 
 The second set of data is other parts of the site, such as our ticket
 system, stats about the number of clicks etc. The performance on this is not
 performance critical at all.
 
 So, I have two questions:
 
 Firstly, should everything be run through the SOLR search system, including
 tickets and site stats? Alterntively, is it better to keep only the main
 full text searches on SOLR and do the ticketing etc through normal MYSQL
 queries?
 
 Secondly, which is probably dependant on the first question. If everything
 should go through SOLR, should we even use a MYSQL database at all? If not,
 what is the alternative? We use an XML file as a .SQL replacement for
 content including tickets, stats, users, passwords etc.
 
 Sorry if these questions are basic, but I’m out of my depth here (but
 learning!)
 
 James
 
 
 --
 View this message in context: http://lucene.472066.n3.nabble.com/SOLR-Just-
 for-search-or-whole-site-DB-tp3763439p3763439.html
 Sent from the Solr - User mailing list archive at Nabble.com.


RE: social/collaboration features on top of solr

2011-12-13 Thread Demian Katz
VuFind (http://vufind.org) uses Solr for library catalog (or similar) 
applications and features a MySQL database which it uses for storing user tags 
and comments outside of Solr itself.  If there were a mechanism more closely 
tied to Solr for achieving this sort of effect, that would allow VuFind to do 
things with considerably more elegance!

- Demian

 -Original Message-
 From: Robert Stewart [mailto:bstewart...@gmail.com]
 Sent: Tuesday, December 13, 2011 10:28 AM
 To: solr-user@lucene.apache.org
 Subject: social/collaboration features on top of solr
 
 Has anyone implemented some social/collaboration features on top of
 SOLR?  What I am thinking is ability to add ratings and comments to
 documents in SOLR and then be able to fetch comments and ratings for
 each document in results (and have as part of response from SOLR),
 similar in fashion to MLT results.  I think a separate index or
 separate core to store collaboration info would be needed, as well as a
 search component for fetching collaboration info for results.  I would
 think this would be a great feature and wondering if anyone has done
 something similar.
 
 Bob


Re: LocalParams, bq, and highlighting

2011-11-01 Thread Demian Katz
 This is definitely an interesting case that i don't think anyone ever
 really considered before.  It seems like a strong argument in favor of
 adding an hl.q param that the HighlightingComponent would use as an
 override for whatever the QueryComponent thinks the highlighting query
 should be, that way people expressing complex queries like you you
 describe could do something like...

qq=solr
q=inStock:true AND+_query_:{!dismax v=$qq}
hl.q={!v=$qq}
hl=true
fl=name
hl.fl=name
bq=server

 ...what do you think?

 wanna file a Jira requesting this as a feature?  Pretty sure the change
 would only require a few lines of code (but of course we'd also need JUnit
 tests which would probably be several dozen lines of code)

First of all, thanks for answering both of my LocalParams-related queries back 
in September.  I somehow failed to notice your responses until today - it's 
alarmingly easy to lose things in the flood of solr-user mail - but I greatly 
appreciate your input on both issues!

It looks like there's already a JIRA ticket (more than a year old) for the hl.q 
param:

https://issues.apache.org/jira/browse/SOLR-1926

This definitely sounds like it would solve my problem, so I've put in my vote!

- Demian


RE: DisMax and WordDelimiterFilterFactory (limitations of MultiPhraseQuery)

2011-10-27 Thread Demian Katz
If we change the query chain to not split on case change, then we lose half the 
benefit of that feature -- if a user types WiFi and the source record 
contains wi fi, we fail to get a hit.  As you say, that may be worth 
considering if it comes down to picking the lesser evil, but I still think 
there should be a complete solution to my problem -- I'm not trying to 
compensate for every fat-fingered user behavior... just one specific one!

Ultimately, I think my problem relates to this note from the documentation 
about using phrases in the SynonymFilterFactory:

Phrase searching (ie: sea biscit) will cause the QueryParser to pass the 
entire string to the analyzer, but if the SynonymFilter is configured to expand 
the synonyms, then when the QueryParser gets the resulting list of tokens back 
from the Analyzer, it will construct a MultiPhraseQuery that will not have the 
desired effect. This is because of the limited mechanism available for the 
Analyzer to indicate that two terms occupy the same position: there is no way 
to indicate that a phrase occupies the same position as a term. For our 
example the resulting MultiPhraseQuery would be (sea | sea | seabiscuit) 
(biscuit | biscit) which would not match the simple case of seabiscuit 
occuring in a document.

So I suppose I'm just running up against a fundamental limitation of Solr...  
but this seems like a fundamental limitation that might be worth overcoming -- 
I'm sure my use case is not the only one where this could matter.  Has anyone 
given this any thought?

- Demian

 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com]
 Sent: Thursday, October 27, 2011 8:21 AM
 To: solr-user@lucene.apache.org
 Subject: Re: DisMax and WordDelimiterFilterFactory
 
 What happens if you change your WDDF definition in the query part of
 your analysis
 chain to NOT split on case change? Then your index should contain the
 right
 fragments (and combined words) and your queries would match.
 
 I admit I haven't thought this through entirely, but this would work
 for your example I
 think. Unfortunately I suspect it would break other cases I
 suspect you're in a
 lesser of two evils situation.
 
 But I can't imagine a 100% solution here. You're effectively asking to
 compensate for
 any fat-fingered thing a user does. Impossible I think...
 
 Best
 Erick
 
 On Tue, Oct 25, 2011 at 1:13 PM, Demian Katz
 demian.k...@villanova.edu wrote:
  I've seen a couple of threads related to this subject (for example,
 http://www.mail-archive.com/solr-user@lucene.apache.org/msg33400.html),
 but I haven't found an answer that addresses the aspect of the problem
 that concerns me...
 
  I have a field type set up like this:
 
     fieldType name=text class=solr.TextField
 positionIncrementGap=100
       analyzer type=index
         tokenizer class=solr.ICUTokenizerFactory/
         filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1 catenateWords=1
 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/
         filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt enablePositionIncrements=true/
         filter class=solr.ICUFoldingFilterFactory/
         filter class=solr.KeywordMarkerFilterFactory
 protected=protwords.txt/
         filter class=solr.SnowballPorterFilterFactory
 language=English/
         filter class=solr.RemoveDuplicatesTokenFilterFactory/
       /analyzer
       analyzer type=query
         tokenizer class=solr.ICUTokenizerFactory/
         filter class=solr.SynonymFilterFactory
 synonyms=synonyms.txt ignoreCase=true expand=true/
         filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1 catenateWords=0
 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1 /
         filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt enablePositionIncrements=true/
         filter class=solr.ICUFoldingFilterFactory/
         filter class=solr.KeywordMarkerFilterFactory
 protected=protwords.txt/
         filter class=solr.SnowballPorterFilterFactory
 language=English/
         filter class=solr.RemoveDuplicatesTokenFilterFactory/
       /analyzer
     /fieldType
 
  The important feature here is the use of WordDelimiterFilterFactory,
 which allows a search for WiFi to match an indexed term of wi fi
 (for example).
 
  The problem, of course, is that if a user accidentally introduces a
 case change in their query, the query analyzer chain breaks it into
 multiple words and no hits are found...  so a search for exaMple will
 look for exa mple and fail.
 
  I've found two solutions that resolve this problem in the admin panel
 field analysis tool:
 
 
  1.)    Turn on catenateWords and catenateNumbers in the query
 analyzer - this reassembles the user's broken word and allows a match.
 
  2.)    Turn on preserveOriginal in the query analyzer - this passes
 through the user's original query, which then gets cleaned up bythe

DisMax and WordDelimiterFilterFactory

2011-10-25 Thread Demian Katz
I've seen a couple of threads related to this subject (for example, 
http://www.mail-archive.com/solr-user@lucene.apache.org/msg33400.html), but I 
haven't found an answer that addresses the aspect of the problem that concerns 
me...

I have a field type set up like this:

fieldType name=text class=solr.TextField positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.ICUTokenizerFactory/
filter class=solr.WordDelimiterFilterFactory generateWordParts=1 
generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 
splitOnCaseChange=1/
filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords.txt enablePositionIncrements=true/
filter class=solr.ICUFoldingFilterFactory/
filter class=solr.KeywordMarkerFilterFactory 
protected=protwords.txt/
filter class=solr.SnowballPorterFilterFactory language=English/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.ICUTokenizerFactory/
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt 
ignoreCase=true expand=true/
filter class=solr.WordDelimiterFilterFactory generateWordParts=1 
generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 
splitOnCaseChange=1 /
filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords.txt enablePositionIncrements=true/
filter class=solr.ICUFoldingFilterFactory/
filter class=solr.KeywordMarkerFilterFactory 
protected=protwords.txt/
filter class=solr.SnowballPorterFilterFactory language=English/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
/fieldType

The important feature here is the use of WordDelimiterFilterFactory, which 
allows a search for WiFi to match an indexed term of wi fi (for example).

The problem, of course, is that if a user accidentally introduces a case change 
in their query, the query analyzer chain breaks it into multiple words and no 
hits are found...  so a search for exaMple will look for exa mple and fail.

I've found two solutions that resolve this problem in the admin panel field 
analysis tool:


1.)Turn on catenateWords and catenateNumbers in the query analyzer - this 
reassembles the user's broken word and allows a match.

2.)Turn on preserveOriginal in the query analyzer - this passes through the 
user's original query, which then gets cleaned up bythe ICUFoldingFilterFactory 
and allows a match.

The problem is that in my real-world application, which uses DisMax, neither of 
these solutions work.  It appears that even though (if I understand correctly) 
the WordDelimiterFilterFactory is returning ALTERNATIVE tokens, the DisMax 
handler is combining them a way that requires all of them to match in an 
inappropriate way...  for example, here's partial debugQuery output for the 
exaMple search using Dismax and solution #2 above:

parsedquery:+DisjunctionMaxQuery((genre:\(exampl exa) mple\^300.0 | 
title_new:\(exampl exa) mple\^100.0 | topic:\(exampl exa) mple\^500.0 | 
series:\(exampl exa) mple\^50.0 | title_full_unstemmed:\(example exa) 
mple\^600.0 | geographic:\(exampl exa) mple\^300.0 | contents:\(exampl exa) 
mple\^10.0 | fulltext_unstemmed:\(example exa) mple\^10.0 | 
allfields_unstemmed:\(example exa) mple\^10.0 | title_alt:\(exampl exa) 
mple\^200.0 | series2:\(exampl exa) mple\^30.0 | title_short:\(exampl exa) 
mple\^750.0 | author:\(example exa) mple\^300.0 | title:\(exampl exa) 
mple\^500.0 | topic_unstemmed:\(example exa) mple\^550.0 | 
allfields:\(exampl exa) mple\ | author_fuller:\(example exa) mple\^150.0 | 
title_full:\(exampl exa) mple\^400.0 | fulltext:\(exampl exa) mple\)) (),

Obviously, that is not what I want - ideally it would be something like 'exampl 
OR ex ample'.

I also read about the autoGeneratePhraseQueries setting, but that seems to take 
things way too far in the opposite direction - if I set that to false, then I 
get matches for any individual token; i.e. example OR ex OR ample - not good at 
all!

I have a sinking suspicion that there is not an easy solution to my problem, 
but this seems to be a fairly basic need; splitOnCaseChange is a useful feature 
to have, but it's more valuable if it serves as an ALTERNATIVE search rather 
than a necessary query munge.  Any thoughts?

thanks,
Demian


RE: Dismax handler - whitespace and special character behaviour

2011-10-25 Thread Demian Katz
I just sent an email to the list about DisMax interacting with 
WordDelimiterFilterFactory, and I think our problems are at least partially 
related -- I think the reason you are seeing an OR where you expect an AND is 
that you have autoGeneratePhraseQueries set to false, which changes the way 
DisMax handles the output of the WordDelimiterFilterFactory (among others).  
Unfortunately, I don't have a solution for you...  but you might want to keep 
an eye on my thread in case replies there shed any additional light.

- Demian

 -Original Message-
 From: Rohk [mailto:khor...@gmail.com]
 Sent: Tuesday, October 25, 2011 10:33 AM
 To: solr-user@lucene.apache.org
 Subject: Dismax handler - whitespace and special character behaviour
 
 Hello,
 
 I've got strange results when I have special characters in my query.
 
 Here is my request :
 
 q=histoire-
 francestart=0rows=10sort=score+descdefType=dismaxqf=any^1.0mm=100
 %
 
 Parsed query :
 
 str name=parsedquery_toString+((any:histoir any:franc)) ()/str
 
 I've got 17000 results because Solr is doing an OR (should be AND).
 
 I have no problem when I'm using a whitespace instead of a special char
 :
 
 q=histoire
 francestart=0rows=10sort=score+descdefType=dismaxqf=any^1.0mm=100
 %
 
 str name=parsedquery_toString+(((any:histoir) (any:franc))~2)
 ()/str
 
 2000 results for this query.
 
 Here is my schema.xml (relevant parts) :
 
 fieldType name=text class=solr.TextField
 positionIncrementGap=100 autoGeneratePhraseQueries=false
   analyzer type=index
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1 catenateWords=1
 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1
 preserveOriginal=1/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.CommonGramsFilterFactory
 words=stopwords_french.txt ignoreCase=true/
 filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords_french.txt enablePositionIncrements=true/
 filter class=solr.SnowballPorterFilterFactory
 language=French protected=protwords.txt/
 filter class=solr.RemoveDuplicatesTokenFilterFactory/
 filter class=solr.ASCIIFoldingFilterFactory/
   /analyzer
   analyzer type=query
 tokenizer class=solr.WhitespaceTokenizerFactory/
 !--filter class=solr.SynonymFilterFactory
 synonyms=synonyms.txt ignoreCase=true expand=true/--
 filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1 catenateWords=0
 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1
 preserveOriginal=0/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.CommonGramsFilterFactory
 words=stopwords_french.txt ignoreCase=true/
 filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords_french.txt enablePositionIncrements=true/
 filter class=solr.SnowballPorterFilterFactory
 language=French protected=protwords.txt/
 filter class=solr.RemoveDuplicatesTokenFilterFactory/
 filter class=solr.ASCIIFoldingFilterFactory/
   /analyzer
 /fieldType
 
 I tried with a PatternTokenizerFactory to tokenize on whitespaces 
 special
 chars but no change...
 Even with a charFilter (PatternReplaceCharFilterFactory) to replace
 special
 characters by whitespace, it doesn't work...
 
 First line of analysis via solr admin, with verbose output, for query =
 'histoire-france' :
 
 org.apache.solr.analysis.PatternReplaceCharFilterFactory {replacement=
 , pattern=([,;./\\'-]), luceneMatchVersion=LUCENE_32}
 texthistoire france
 
 The '-' is replaced by ' ', then tokenized by
 WhitespaceTokenizerFactory.
 However I still have different number of results for 'histoire-france'
 and
 'histoire france'.
 
 My current workaround is to replace all special chars by whitespaces
 before
 sending query to Solr, but it is not satisfying.
 
 Did i miss something ?


LocalParams, bq, and highlighting

2011-09-21 Thread Demian Katz
I've run into another strange behavior related to LocalParams syntax in Solr 
1.4.1.  If I apply Dismax boosts using bq in LocalParams syntax, the contents 
of the boost queries get used by the highlighter.  Obviously, when I use bq as 
a separate parameter, this is not an issue.

To clarify, here are two searches that yield identical results but different 
highlighting behaviors:

http://localhost:8080/solr/biblio/select/?q=johnrows=20start=0indent=yesqf=author^100qt=dismaxbq=author%3Asmith^1000fl=scorehl=truehl.fl=*

http://localhost:8080/solr/biblio/select/?q=%28%28_query_%3A%22{!dismax+qf%3D\%22author^100\%22+bq%3D\%27author%3Asmith^1000\%27}john%22%29%29rows=20start=0indent=yesfl=scorehl=truehl.fl=*

Query #1 highlights only john (the desired behavior), but query #2 highlights 
both john and smith.

Is this a known limitation of the highlighter, or is it a bug?  Is this issue 
resolved in newer versions of Solr?

thanks,
Demian


Questions about LocalParams syntax

2011-09-20 Thread Demian Katz
I'm using the LocalParams syntax combined with the _query_ pseudo-field to 
build an advanced search screen (built on Solr 1.4.1's Dismax handler), but I'm 
running into some syntax questions that don't seem to be addressed by the wiki 
page here:

http://wiki.apache.org/solr/LocalParams


1.)How should I deal with repeating parameters?  If I use multiple boost 
queries, it seems that only the last one listed is used...  for example:

((_query_:{!dismax qf=\title^500 author^300 allfields\ bq=\format:Book^50\ 
bq=\format:Journal^150\}test))

boosts Journals, but not Books.  If I reverse the order of the 
two bq parameters, then Books get boosted instead of Journals.  I can work 
around this by creating one bq with the clauses OR'ed together, but I would 
rather be able to apply multiple bq's like I can elsewhere.


2.)What is the proper way to escape quotes?  Since there are multiple 
nested layers of double quotes, things get ugly and it's easy to end up with 
syntax errors.  I found that this syntax doesn't cause an error:


((_query_:{!dismax qf=\title^500 author^300 allfields\ 
bq=\format:\\\Book\\\^50\ bq=\format:\\\Journal\\\^150\}test))

...but it also doesn't work correctly - the boost queries are 
completely ignored in this example.  Perhaps this is more a problem related to  
_query_ than to LocalParams syntax...  but either way, a solution would be 
great!

thanks,
Demian


RE: Questions about LocalParams syntax

2011-09-20 Thread Demian Katz
Space-separation works for the qf field, but not for bq.  If I try a bq of 
format:Book^50 format:Journal^150, I get a strange result -- I would expect 
in the case of a failed bq that either a) I would get a syntax error of some 
sort or b) I would get normal search results with no boosting applied.  
Instead, I get a successful search result containing 0 entries.  Very odd!  
Anyway, the solution that definitely works is joining the clauses with OR...  
but I'd still love to be able to specify multiple bq's separately if there's 
any way it can be done.

As for the quote issue, the problem I'm trying to solve is that my code is 
driven by configuration files, and users may specify any legal Solr bq values 
that they choose.  You're right that in some cases, I can simplify the 
situation by alternating quotes or changing the syntax...  but I don't want to 
force users into using a subset of legal Solr syntax; it would be much better 
to able to handle all legal cases in a straightforward fashion.  Admittedly, my 
example is artificial -- format:Book^50 works just as well as 
format:Book^50...  but suppose they wanted to boost a phrase like 
format:Conference Proceeding^25 -- this is a common case.  It seems like 
there should be some syntax that allows this to work in the context I am using 
it.  If not, perhaps we need to file a bug report.

In any case, thanks for taking the time to make some suggestions!  It surprises 
me that this very powerful feature of Solr is so little-documented.

- Demian

 -Original Message-
 From: Jonathan Rochkind [mailto:rochk...@jhu.edu]
 Sent: Tuesday, September 20, 2011 10:32 AM
 To: solr-user@lucene.apache.org
 Cc: Demian Katz
 Subject: Re: Questions about LocalParams syntax
 
 I don't have the complete answer. But I _think_ if you do one 'bq'
 param
 with multiple space-seperated directives, it will work.
 
 And escaping is a pain.  But can be made somewhat less of a pain if you
 realize that single quotes can sometimes be used instead of
 double-quotes. What I do:
 
 _query_:{!dismax qf='title something else'}
 
 So by switching between single and double quotes, you can avoid need to
 escape. Sometimes you still do need to escape when a single or double
 quote is actually in a value (say in a 'q'), and I do use backslash
 there. If you had more levels of nesting though... I have no idea what
 you'd do.
 
 I'm not even sure why you have the internal quotes here:
 
 bq=\format:\\\Book\\\^50\
 
 
 Shouldn't that just be bq='format:Book^50', what's the extra double
 quotes around Book?  If you don't need them, then with switching
 between single and double, this can become somewhat less crazy and
 error
 prone:
 
 _query_:{!dismax bq='format:Book^50'}
 
 I think. Maybe. If you really do need the double quotes in there, then
 I
 think switching between single and double you can use a single
 backslash
 there.
 
 
 On 9/20/2011 9:39 AM, Demian Katz wrote:
  I'm using the LocalParams syntax combined with the _query_ pseudo-
 field to build an advanced search screen (built on Solr 1.4.1's Dismax
 handler), but I'm running into some syntax questions that don't seem to
 be addressed by the wiki page here:
 
  http://wiki.apache.org/solr/LocalParams
 
 
  1.)How should I deal with repeating parameters?  If I use
 multiple boost queries, it seems that only the last one listed is
 used...  for example:
 
  ((_query_:{!dismax qf=\title^500 author^300 allfields\
 bq=\format:Book^50\ bq=\format:Journal^150\}test))
 
   boosts Journals, but not Books.  If I reverse the
 order of the two bq parameters, then Books get boosted instead of
 Journals.  I can work around this by creating one bq with the clauses
 OR'ed together, but I would rather be able to apply multiple bq's like
 I can elsewhere.
 
 
  2.)What is the proper way to escape quotes?  Since there are
 multiple nested layers of double quotes, things get ugly and it's easy
 to end up with syntax errors.  I found that this syntax doesn't cause
 an error:
 
 
  ((_query_:{!dismax qf=\title^500 author^300 allfields\
 bq=\format:\\\Book\\\^50\ bq=\format:\\\Journal\\\^150\}test))
 
   ...but it also doesn't work correctly - the boost
 queries are completely ignored in this example.  Perhaps this is more a
 problem related to  _query_ than to LocalParams syntax...  but either
 way, a solution would be great!
 
  thanks,
  Demian
 


String index out of range: -1 for hl.fl=* in Solr 1.4.1?

2011-09-09 Thread Demian Katz
I'm running into a strange problem with Solr 1.4.1 - this request:

http://localhost:8080/solr/website/select/?q=*%3A*rows=20start=0indent=yesfl=scorefacet=truefacet.mincount=1facet.limit=30facet.field=categoryfacet.field=linktypefacet.field=subjectfacet.prefix=facet.sort=fq=category%3A%22Exhibits%22spellcheck=truespellcheck.q=*%3A*spellcheck.dictionary=defaulthl=truehl.fl=*hl.simple.pre=START_HILITEhl.simple.post=END_HILITEwt=jsonjson.nl=arrarrhttp://localhost:8080/solr/website/select/?q=*%3A*rows=20start=0indent=yesfl=scorefacet=truefacet.mincount=1facet.limit=30facet.field=categoryfacet.field=linktypefacet.field=subjectfacet.prefix=facet.sort=fq=category%3A%22Exhibits%22spellcheck=truespellcheck.q=*%3A*spellcheck.dictionary=defaulthl=truehl.fl=*hl.simple.pre=%7b%7b%7b%7bSTART_HILITE%7d%7d%7d%7dhl.simple.post=%7b%7b%7b%7bEND_HILITE%7d%7d%7d%7dwt=jsonjson.nl=arrarr

leads to this error dump:

String index out of range: -1

java.lang.StringIndexOutOfBoundsException: String index out of range: -1
at java.lang.String.substring(String.java:1949)
at 
org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:263)
at 
org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting(DefaultSolrHighlighter.java:335)
at 
org.apache.solr.handler.component.HighlightComponent.process(HighlightComponent.java:89)
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1088)
at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:360)
at 
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at 
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
at 
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:729)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
at 
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:206)
at 
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at 
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:324)
at 
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:505)
at 
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:829)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:513)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:211)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:380)
at 
org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:395)
at 
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:488)

I've managed to work around the problem by replacing the hl.fl=* parameter with 
a comma-delimited list of the fields I actually need highlighted...  but I 
don't understand why I'm encountering this error, and for peace of mind I would 
like to understand the problem in case there's a deeper problem at work here.  
I'll be happy to share schema or other details if they would help narrow down a 
potential cause!

thanks,
Demian


RE: SpellCheckComponent performance

2011-06-07 Thread Demian Katz
As I may have mentioned before, VuFind is actually doing two Solr queries for 
every search -- a base query that gets basic spelling suggestions, and a 
supplemental spelling-only query that gets shingled spelling suggestions.  If 
there's a way to get two different spelling responses in a single query, I'd 
love to hear about it...  but the double-querying doesn't seem to be a huge 
problem -- the delays I'm talking about are in the spelling portion of the 
initial query.  Just for the sake of completeness, here are both of my spelling 
field types:

!-- Basic Text Field for use with Spell Correction --
fieldType name=textSpell class=solr.TextField 
positionIncrementGap=100
  analyzer
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=schema.UnicodeNormalizationFilterFactory 
version=icu4j composed=false remove_diacritics=true 
remove_modifiers=true fold=true/
filter class=solr.ISOLatin1AccentFilterFactory/
filter class=solr.WordDelimiterFilterFactory generateWordParts=1 
generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0/
filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords.txt/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
/fieldType
!-- More advanced spell checking field. --
fieldType name=textSpellShingle class=solr.TextField 
positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords.txt/
filter class=solr.ShingleFilterFactory maxShingleSize=2 
outputUnigrams=false/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords.txt/
filter class=solr.ShingleFilterFactory maxShingleSize=2 
outputUnigrams=false/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
/fieldType

...and here are the fields:

   field name=spelling type=textSpell indexed=true stored=true/
   field name=spellingShingle type=textSpellShingle indexed=true 
stored=true multiValued=true/

As you can probably guess, I'm using spelling in my main query and 
spellingShingle in my supplemental query.

Here are stats on the spelling field:

{field=spelling,memSize=107830314,tindexSize=249184,time=25747,phase1=25150,nTerms=1343061,bigTerms=231,termInstances=40960454,uses=1}

(I obtained these numbers by temporarily adding the spelling field as a facet 
to my warming query -- probably not a very smart way to do it, but it was the 
only way I could figure out!  If there's a more elegant and accurate approach, 
I'd be interested to know what it is.)

I should also note that my basic spelling index is 114MB and my shingled 
spelling index is 931MB -- not outrageously large.  Is there a way to persuade 
Solr to load these into memory for faster performance?

thanks,
Demian

 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com]
 Sent: Monday, June 06, 2011 6:23 PM
 To: solr-user@lucene.apache.org
 Subject: Re: SpellCheckComponent performance
 
 Hmmm, how are you configuring you spell checker? The first-time
 slowdown
 is probably due to cache warming, but subsequent 500 ms slowdowns
 seem odd. How many unique terms are there in your spellecheck index?
 
 It'd probably be best if you showed us your fieldtype and field
 definition...
 
 Best
 Erick
 
 On Mon, Jun 6, 2011 at 4:04 PM, Demian Katz demian.k...@villanova.edu
 wrote:
  I'm continuing to work on tuning my Solr server, and now I'm noticing
 that my biggest bottleneck is the SpellCheckComponent.  This is eating
 multiple seconds on most first-time searches, and still taking around
 500ms even on cached searches.  Here is my configuration:
 
   searchComponent name=spellcheck
 class=org.apache.solr.handler.component.SpellCheckComponent
     lst name=spellchecker
       str name=namebasicSpell/str
       str name=fieldspelling/str
       str name=accuracy0.75/str
       str name=spellcheckIndexDir./spellchecker/str
       str name=queryAnalyzerFieldTypetextSpell/str
       str name=buildOnOptimizetrue/str
     /lst
   /searchComponent
 
  I've done a bit of searching, but the best advice I could find for
 making the search component go faster involved reducing
 spellcheck.maxCollationTries, which doesn't even seem to apply to my
 settings.
 
  Does anyone have any advice on tuning this aspect of my
 configuration?  Are there any extra debug settings that might give
 deeper insight into how the component is spending its time?
 
  thanks,
  Demian
 


RE: Solr performance tuning - disk i/o?

2011-06-06 Thread Demian Katz
Thanks once again for the helpful suggestions!

Regarding the selection of facet fields, I think publishDate (which is actually 
just a year) and callnumber-first (which is actually a very broad, high-level 
category) are okay.  authorStr is an interesting problem: it's definitely a 
useful facet (when a user searches for an author, odds are good that they want 
the one who published the most books... i.e. a search for dickens will probably 
show Charles Dickens at the top of the facet list), but it has a long tail 
since there are many minor authors who have only published one or two books...  
Is there a possibility that the facet.mincount parameter could be helpful here, 
or does that have no impact on performance/memory footprint?

Regarding polling interval for slaves, are you referring to a distributed Solr 
environment, or is this something to do with Solr's internals?  We're currently 
a single-server environment, so I don't think I have to worry if it's related 
to a multi-server setup...  but if it's something internal, could you point me 
to the right area of the admin panel to check my stats?  I'm not seeing 
anything about polling on the statistics page.  It's also a little strange that 
all of my warmupTime stats on searchers and caches are showing as 0 -- is that 
normal?

thanks,
Demian

 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com]
 Sent: Friday, June 03, 2011 4:45 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Solr performance tuning - disk i/o?
 
 Quick impressions:
 
 The faceting is usually best done on fields that don't have lots of
 unique
 values for three reasons:
 1 It's questionable how much use to the user to have a gazillion
 facets.
  In the case of a unique field per document, in fact, it's useless.
 2 resource requirements go up as a function of the number of unique
  terms. This is true for faceting and sorting.
 3 warmup times grow the more terms have to be read into memory.
 
 
 Glancing at your warmup stuff, things like publishDate, authorStr and
 maybe
 callnumber-first are questionable. publishDate depends on how coarse
 the
 resolution is. If it's by day, that's not really much use. authorStr..
 How many
 authors have more than one publication? Would this be better served by
 some
 kind of autosuggest rather than facets? callnumber-first... I don't
 really know, but
 if it's unique per document it's probably not something the user would
 find useful
 as a facet.
 
 The admin page will help you determine the number of unique terms per
 field,
 which may guide you whether or not to continue to facet on these
 fields.
 
 As Otis said, doing a sort on the fields during warmup will also help.
 
 Watch your polling interval for any slaves in relation to the warmup
 times.
 If your polling interval is shorter than the warmup times, you run a
 risk of
 runaway warmups.
 
 As you've figured out, measuring responses to the first few queries
 doesn't
 always measure what you really need G..
 
 I don't have the pages handy, but autowarming is a good topic to
 understand,
 so you might spend some time tracking it down.
 
 Best
 Erick
 
 On Fri, Jun 3, 2011 at 11:21 AM, Demian Katz
 demian.k...@villanova.edu wrote:
  Thanks to you and Otis for the suggestions!  Some more information:
 
  - Based on the Solr stats page, my caches seem to be working pretty
 well (few or no evictions, hit rates in the 75-80% range).
  - VuFind is actually doing two Solr queries per search (one initial
 search followed by a supplemental spell check search -- I believe this
 is necessary because VuFind has two separate spelling indexes, one for
 shingled terms and one for single words).  That is probably
 exaggerating the problem, though based on searches with debugQuery on,
 it looks like it's always the initial search (rather than the
 supplemental spelling search) that's consuming the bulk of the time.
  - enableLazyFieldLoading is set to true.
  - I'm retrieving 20 documents per page.
  - My JVM settings: -server -
 Xloggc:/usr/local/vufind/solr/jetty/logs/gc.log -Xms4096m -Xmx4096m -
 XX:+UseParallelGC -XX:+UseParallelOldGC -XX:NewRatio=5
 
  It appears that a large portion of my problem had to do with
 autowarming, a topic that I've never had a strong grasp on, though
 perhaps I'm finally learning (any recommended primer links would be
 welcome!).  I did have some autowarming settings in solrconfig.xml (an
 arbitrary search for a bunch of random keywords in the newSearcher and
 firstSearcher events, plus autowarmCount settings on all of my caches).
  However, when I looked at the debugQuery output, I noticed that a huge
 amount of time was being wasted loading facets on the first search
 after restarting Solr, so I changed my newSearcher and firstSearcher
 events to this:
 
       arr name=queries
         lst
           str name=q*:*/str
           str name=start0/str
           str name=rows10/str
           str name=facettrue/str
           str

RE: Solr performance tuning - disk i/o?

2011-06-06 Thread Demian Katz
All of my cache autowarmCount settings are either 1 or 5.  
maxWarmingSearchers is set to 2.  I previously shared the contents of my 
firstSearcher and newSearcher events -- just a queries array surrounded by a 
standard-looking listener tag.  The events are definitely firing -- in 
addition to the measurable performance improvement they give me, I can actually 
see them happening in the console output during startup.  That seems to cover 
every configuration option in my file that references warming in any way, and 
it all looks reasonable to me.  warmupTime remains consistently 0 in the 
statistics display.  Is there anything else I should be looking at?  In any 
case, I'm not too alarmed by this... it just seems a little strange.

thanks,
Demian

 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com]
 Sent: Monday, June 06, 2011 11:59 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Solr performance tuning - disk i/o?
 
 Polling interval was in reference to slaves in a multi-machine
 master/slave setup. so probably not
 a concern just at present.
 
 Warmup time of 0 is not particularly normal, I'm not quite sure what's
 going on there but you may
 want to look at firstsearcher, newsearcher and autowarm parameters in
 config.xml..
 
 Best
 Erick
 
 On Mon, Jun 6, 2011 at 9:08 AM, Demian Katz demian.k...@villanova.edu
 wrote:
  Thanks once again for the helpful suggestions!
 
  Regarding the selection of facet fields, I think publishDate (which
 is actually just a year) and callnumber-first (which is actually a very
 broad, high-level category) are okay.  authorStr is an interesting
 problem: it's definitely a useful facet (when a user searches for an
 author, odds are good that they want the one who published the most
 books... i.e. a search for dickens will probably show Charles Dickens
 at the top of the facet list), but it has a long tail since there are
 many minor authors who have only published one or two books...  Is
 there a possibility that the facet.mincount parameter could be helpful
 here, or does that have no impact on performance/memory footprint?
 
  Regarding polling interval for slaves, are you referring to a
 distributed Solr environment, or is this something to do with Solr's
 internals?  We're currently a single-server environment, so I don't
 think I have to worry if it's related to a multi-server setup...  but
 if it's something internal, could you point me to the right area of the
 admin panel to check my stats?  I'm not seeing anything about polling
 on the statistics page.  It's also a little strange that all of my
 warmupTime stats on searchers and caches are showing as 0 -- is that
 normal?
 
  thanks,
  Demian
 
  -Original Message-
  From: Erick Erickson [mailto:erickerick...@gmail.com]
  Sent: Friday, June 03, 2011 4:45 PM
  To: solr-user@lucene.apache.org
  Subject: Re: Solr performance tuning - disk i/o?
 
  Quick impressions:
 
  The faceting is usually best done on fields that don't have lots of
  unique
  values for three reasons:
  1 It's questionable how much use to the user to have a gazillion
  facets.
       In the case of a unique field per document, in fact, it's
 useless.
  2 resource requirements go up as a function of the number of unique
       terms. This is true for faceting and sorting.
  3 warmup times grow the more terms have to be read into memory.
 
 
  Glancing at your warmup stuff, things like publishDate, authorStr
 and
  maybe
  callnumber-first are questionable. publishDate depends on how coarse
  the
  resolution is. If it's by day, that's not really much use.
 authorStr..
  How many
  authors have more than one publication? Would this be better served
 by
  some
  kind of autosuggest rather than facets? callnumber-first... I don't
  really know, but
  if it's unique per document it's probably not something the user
 would
  find useful
  as a facet.
 
  The admin page will help you determine the number of unique terms
 per
  field,
  which may guide you whether or not to continue to facet on these
  fields.
 
  As Otis said, doing a sort on the fields during warmup will also
 help.
 
  Watch your polling interval for any slaves in relation to the warmup
  times.
  If your polling interval is shorter than the warmup times, you run a
  risk of
  runaway warmups.
 
  As you've figured out, measuring responses to the first few queries
  doesn't
  always measure what you really need G..
 
  I don't have the pages handy, but autowarming is a good topic to
  understand,
  so you might spend some time tracking it down.
 
  Best
  Erick
 
  On Fri, Jun 3, 2011 at 11:21 AM, Demian Katz
  demian.k...@villanova.edu wrote:
   Thanks to you and Otis for the suggestions!  Some more
 information:
  
   - Based on the Solr stats page, my caches seem to be working
 pretty
  well (few or no evictions, hit rates in the 75-80% range).
   - VuFind is actually doing two Solr queries per search (one
 initial
  search

SpellCheckComponent performance

2011-06-06 Thread Demian Katz
I'm continuing to work on tuning my Solr server, and now I'm noticing that my 
biggest bottleneck is the SpellCheckComponent.  This is eating multiple seconds 
on most first-time searches, and still taking around 500ms even on cached 
searches.  Here is my configuration:

  searchComponent name=spellcheck 
class=org.apache.solr.handler.component.SpellCheckComponent
lst name=spellchecker
  str name=namebasicSpell/str
  str name=fieldspelling/str
  str name=accuracy0.75/str
  str name=spellcheckIndexDir./spellchecker/str
  str name=queryAnalyzerFieldTypetextSpell/str
  str name=buildOnOptimizetrue/str
/lst
  /searchComponent

I've done a bit of searching, but the best advice I could find for making the 
search component go faster involved reducing spellcheck.maxCollationTries, 
which doesn't even seem to apply to my settings.

Does anyone have any advice on tuning this aspect of my configuration?  Are 
there any extra debug settings that might give deeper insight into how the 
component is spending its time?

thanks,
Demian


Solr performance tuning - disk i/o?

2011-06-03 Thread Demian Katz
Hello,

I'm trying to move a VuFind installation from an ailing physical server into a 
virtualized environment, and I'm running into performance problems.  VuFind is 
a Solr 1.4.1-based application with fairly large and complex records (many 
stored fields, many words per record).  My particular installation contains 
about a million records in the index, with a total index size around 6GB.

The virtual environment has more RAM and better CPUs than the old physical box, 
and I am satisfied that my Java environment is well-tuned.  My index is 
optimized.  Searches that hit the cache respond very well.  The problem is that 
non-cached searches are very slow - the more keywords I add, the slower they 
get, to the point of taking 6-12 seconds to come back with results on a quiet 
box and well over a minute under stress testing.  (The old box still took a 
while for equivalent searches, but it was about twice as fast as the new one).

My gut feeling is that disk access reading the index is the bottleneck here, 
but I know little about the specifics of Solr's internals, so it's entirely 
possible that my gut is wrong.  Outside testing does show that the the virtual 
environment's disk performance is not as good as the old physical server, 
especially when multiple processes are trying to access the same file 
simultaneously.

So, two basic questions:


1.)Would you agree that I'm dealing with a disk bottleneck, or are there 
some other factors I should be considering?  Any good diagnostics I should be 
looking at?

2.)If the problem is disk access, is there anything I can tune on the Solr 
side to alleviate the problems?

Thanks,
Demian


RE: Solr performance tuning - disk i/o?

2011-06-03 Thread Demian Katz
Thanks to you and Otis for the suggestions!  Some more information:

- Based on the Solr stats page, my caches seem to be working pretty well (few 
or no evictions, hit rates in the 75-80% range).
- VuFind is actually doing two Solr queries per search (one initial search 
followed by a supplemental spell check search -- I believe this is necessary 
because VuFind has two separate spelling indexes, one for shingled terms and 
one for single words).  That is probably exaggerating the problem, though based 
on searches with debugQuery on, it looks like it's always the initial search 
(rather than the supplemental spelling search) that's consuming the bulk of the 
time.
- enableLazyFieldLoading is set to true.
- I'm retrieving 20 documents per page.
- My JVM settings: -server -Xloggc:/usr/local/vufind/solr/jetty/logs/gc.log 
-Xms4096m -Xmx4096m -XX:+UseParallelGC -XX:+UseParallelOldGC -XX:NewRatio=5

It appears that a large portion of my problem had to do with autowarming, a 
topic that I've never had a strong grasp on, though perhaps I'm finally 
learning (any recommended primer links would be welcome!).  I did have some 
autowarming settings in solrconfig.xml (an arbitrary search for a bunch of 
random keywords in the newSearcher and firstSearcher events, plus autowarmCount 
settings on all of my caches).  However, when I looked at the debugQuery 
output, I noticed that a huge amount of time was being wasted loading facets on 
the first search after restarting Solr, so I changed my newSearcher and 
firstSearcher events to this:

  arr name=queries
lst
  str name=q*:*/str
  str name=start0/str
  str name=rows10/str
  str name=facettrue/str
  str name=facet.mincount1/str
  str name=facet.fieldcollection/str
  str name=facet.fieldformat/str
  str name=facet.fieldpublishDate/str
  str name=facet.fieldcallnumber-first/str
  str name=facet.fieldtopic_facet/str
  str name=facet.fieldauthorStr/str
  str name=facet.fieldlanguage/str
  str name=facet.fieldgenre_facet/str
  str name=facet.fieldera_facet/str
  str name=facet.fieldgeographic_facet/str
/lst
  /arr

Overall performance has now increased dramatically, and now the biggest 
bottleneck in the debug output seems to be the shingle spell checking!

Any other suggestions are welcome, since I suspect there's still room to 
squeeze more performance out of the system, and I'm still not sure I'm making 
the most of autowarming...  but this seems like a big step in the right 
direction.  Thanks again for the help!

- Demian

 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com]
 Sent: Friday, June 03, 2011 9:41 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Solr performance tuning - disk i/o?
 
 This doesn't seem right. Here's a couple of things to try:
 1 attach debugQuery=on to your long-running queries. The QTime
 returned
  is the time taken to search, NOT including the time to load the
 docs. That'll
  help pinpoint whether the problem is the search itself, or
 assembling the
  documents.
 2 Are you autowarming? If so, be sure it's actually done before
 querying.
 3 Measure queries after the first few, particularly if you're sorting
 or
  faceting.
 4 What are your JVM settings? How much memory do you have?
 5 is enableLazyFieldLoading set to true in your solrconfig.xml?
 6 How many docs are you returning?
 
 
 There's more, but that'll do for a start Let us know if you gather
 more data
 and it's still slow.
 
 Best
 Erick
 
 On Fri, Jun 3, 2011 at 8:44 AM, Demian Katz demian.k...@villanova.edu
 wrote:
  Hello,
 
  I'm trying to move a VuFind installation from an ailing physical
 server into a virtualized environment, and I'm running into performance
 problems.  VuFind is a Solr 1.4.1-based application with fairly large
 and complex records (many stored fields, many words per record).  My
 particular installation contains about a million records in the index,
 with a total index size around 6GB.
 
  The virtual environment has more RAM and better CPUs than the old
 physical box, and I am satisfied that my Java environment is well-
 tuned.  My index is optimized.  Searches that hit the cache respond
 very well.  The problem is that non-cached searches are very slow - the
 more keywords I add, the slower they get, to the point of taking 6-12
 seconds to come back with results on a quiet box and well over a minute
 under stress testing.  (The old box still took a while for equivalent
 searches, but it was about twice as fast as the new one).
 
  My gut feeling is that disk access reading the index is the
 bottleneck here, but I know little about the specifics of Solr's
 internals, so it's entirely possible that my gut is wrong.  Outside
 testing does show that the the virtual environment's disk performance
 is not as good as the old physical server, especially when

Bug in solr.KeywordMarkerFilterFactory?

2011-04-20 Thread Demian Katz
I've just started experimenting with the solr.KeywordMarkerFilterFactory in 
Solr 3.1, and I'm seeing some strange behavior.  It seems that every word 
subsequent to a protected word is also treated as being protected.

For testing purposes, I have put the word spelling in my protwords.txt.  If I 
do a test for spelling bees in the analyze tool, the stemmer produces 
spelling bees - nothing is stemmed.  But if I do a test for bees spelling, 
I get bee spelling, the expected result with bees stemmed but spelling 
left unstemmed.  I have tried extended examples - in every case I tried, all of 
the words prior to spelling get stemmed, but none of the words after 
spelling get stemmed.  When turning on the verbose mode of the analyze tool, 
I can see that the settings of the keyword attribute introduced by 
solr.KeywordMarkerFilterFactory correspond with the the stemming behavior... so 
I think the solr.KeywordMarkerFilterFactory component is to blame, and not 
anything later in the analyze chain.

Any idea what might be going wrong?  Is this a known issue?

Here is my field type definition, in case it makes a difference:

fieldType name=text class=solr.TextField positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.ICUTokenizerFactory/
filter class=solr.WordDelimiterFilterFactory generateWordParts=1 
generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 
splitOnCaseChange=1/
filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords.txt enablePositionIncrements=true/
filter class=solr.ICUFoldingFilterFactory/
filter class=solr.KeywordMarkerFilterFactory 
protected=protwords.txt/
filter class=solr.SnowballPorterFilterFactory language=English/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.ICUTokenizerFactory/
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt 
ignoreCase=true expand=true/
filter class=solr.WordDelimiterFilterFactory generateWordParts=1 
generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 
splitOnCaseChange=1/
filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords.txt enablePositionIncrements=true/
filter class=solr.ICUFoldingFilterFactory/
filter class=solr.KeywordMarkerFilterFactory 
protected=protwords.txt/
filter class=solr.SnowballPorterFilterFactory language=English/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
/fieldType

thanks,
Demian


RE: Bug in solr.KeywordMarkerFilterFactory?

2011-04-20 Thread Demian Katz
That's good news -- thanks for the help (not to mention the reassurance that 
Solr itself is actually working right)!  Hopefully 3.1.1 won't be too far off, 
though; when the analysis tool lies, life can get very confusing! :-)

- Demian

 -Original Message-
 From: Robert Muir [mailto:rcm...@gmail.com]
 Sent: Wednesday, April 20, 2011 2:54 PM
 To: solr-user@lucene.apache.org; yo...@lucidimagination.com
 Subject: Re: Bug in solr.KeywordMarkerFilterFactory?
 
 No, this is only a bug in analysis.jsp.
 
 you can see this by comparing analysis.jsp's dontstems bees to using
 the query debug interface:
 lst name=debug
   str name=rawquerystringdontstems bees/str
   str name=querystringdontstems bees/str
   str name=parsedqueryPhraseQuery(text:dontstems bee)/str
   str name=parsedquery_toStringtext:dontstems bee/str
 
 On Wed, Apr 20, 2011 at 2:43 PM, Yonik Seeley
 yo...@lucidimagination.com wrote:
  On Wed, Apr 20, 2011 at 2:01 PM, Demian Katz
 demian.k...@villanova.edu wrote:
  I've just started experimenting with the
 solr.KeywordMarkerFilterFactory in Solr 3.1, and I'm seeing some
 strange behavior.  It seems that every word subsequent to a protected
 word is also treated as being protected.
 
  You're right!  This was broken by LUCENE-2901 back in Jan.
  I've opened this issue:
  https://issues.apache.org/jira/browse/LUCENE-3039
 
  The easiest short-term workaround for you would probably be to create
  a custom filter that looks like KeywordMarkerFilter before the
  LUCENE-2901 change.
 
  -Yonik
  http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
  25-26, San Francisco
 


Solr 3.1 ICU filters (error loading class)

2011-04-18 Thread Demian Katz
Hello,

I'm interested in trying out the new ICU features in Solr 3.1.  However, when I 
attempt to set up a field type using solr.ICUTokenizerFactory and/or 
solr.ICUFoldingFilterFactory, Solr refuses to start up, issuing Error loading 
class exceptions.

I did see the README.txt file that mentions lucene-libs/lucene-*.jar and 
lib/icu4j-*.jar.  I tried putting all of these files under my Solr home 
directory, but it made no difference.

Is there some other .jar that I need to add to my library folder?  Am I doing 
something wrong with the known dependencies?  (This is the first time I've seen 
a lucene-libs directory, so I wasn't sure if that required some special 
configuration).  Any general troubleshooting advice for figuring out what is 
going wrong here?

thanks,
Demian


RE: Solr 3.1 ICU filters (error loading class)

2011-04-18 Thread Demian Katz
Thanks!  apache-solr-analysis-extras-3.1.jar was the missing piece that was 
causing all of my trouble; I didn't see any mention of it in the documentation 
-- might be worth adding!

Thanks,
Demian

 -Original Message-
 From: Robert Muir [mailto:rcm...@gmail.com]
 Sent: Monday, April 18, 2011 1:46 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Solr 3.1 ICU filters (error loading class)
 
 On Mon, Apr 18, 2011 at 1:31 PM, Demian Katz
 demian.k...@villanova.edu wrote:
  Hello,
 
  I'm interested in trying out the new ICU features in Solr 3.1.
  However, when I attempt to set up a field type using
 solr.ICUTokenizerFactory and/or solr.ICUFoldingFilterFactory, Solr
 refuses to start up, issuing Error loading class exceptions.
 
  I did see the README.txt file that mentions lucene-libs/lucene-*.jar
 and lib/icu4j-*.jar.  I tried putting all of these files under my Solr
 home directory, but it made no difference.
 
  Is there some other .jar that I need to add to my library folder?  Am
 I doing something wrong with the known dependencies?  (This is the
 first time I've seen a lucene-libs directory, so I wasn't sure if that
 required some special configuration).  Any general troubleshooting
 advice for figuring out what is going wrong here?
 
 
 make a 'lib' directory under your solr home (e.g. example/solr/lib) :
 it should contain:
 * icu4j-4_6.jar
 * lucene-icu-3.1.jar
 * apache-solr-analysis-extras-3.1.jar


RE: Solr 3.1 ICU filters (error loading class)

2011-04-18 Thread Demian Katz
Right, I placed my files relative to solr_home, not in it -- but obviously 
having a solr_home/lucene-libs directory didn't do me any good. :-)

- Demian

 -Original Message-
 From: Jonathan Rochkind [mailto:rochk...@jhu.edu]
 Sent: Monday, April 18, 2011 1:46 PM
 To: solr-user@lucene.apache.org
 Cc: Demian Katz
 Subject: Re: Solr 3.1 ICU filters (error loading class)
 
 I don't think you want to put them in solr_home, I think you want to
 put
 them in solr_home/lib/.  Or did you mean that's where you put them?
 
 On 4/18/2011 1:31 PM, Demian Katz wrote:
  Hello,
 
  I'm interested in trying out the new ICU features in Solr 3.1.
 However, when I attempt to set up a field type using
 solr.ICUTokenizerFactory and/or solr.ICUFoldingFilterFactory, Solr
 refuses to start up, issuing Error loading class exceptions.
 
  I did see the README.txt file that mentions lucene-libs/lucene-*.jar
 and lib/icu4j-*.jar.  I tried putting all of these files under my Solr
 home directory, but it made no difference.
 
  Is there some other .jar that I need to add to my library folder?  Am
 I doing something wrong with the known dependencies?  (This is the
 first time I've seen a lucene-libs directory, so I wasn't sure if that
 required some special configuration).  Any general troubleshooting
 advice for figuring out what is going wrong here?
 
  thanks,
  Demian
 


RE: OAI on SOLR already done?

2011-02-02 Thread Demian Katz
I already replied to the original poster off-list, but it seems that it may be 
worth weighing in here as well...

The next release of VuFind (http://vufind.org) is going to include OAI-PMH 
server support.  As you say, there is really no way to plug OAI-PMH directly 
into Solr...  but a tool like VuFind can provide a fairly generic, extensible, 
Solr-based platform for building an OAI-PMH server.  Obviously this is helpful 
for some use cases and not others...  but I'm happy to provide more information 
if anyone needs it.

- Demian

From: Jonathan Rochkind [rochk...@jhu.edu]
Sent: Wednesday, February 02, 2011 3:38 PM
To: solr-user@lucene.apache.org
Cc: Paul Libbrecht
Subject: Re: OAI on SOLR already done?

The trick is that you can't just have a generic black box OAI-PMH
provider on top of any Solr index. How would it know where to get the
metadata elements it needs, such as title, or last-updated date, etc.
Any given solr index might not even have this in stored fields -- and a
given app might want to look them up from somewhere other than stored
fields.

If the Solr index does have them in stored fields, and you do want to
get them from the stored fields, then it's, I think (famous last words)
relatively straightforward code to write. A mapping from solr stored
fields to metadata elements needed for OAI-PMH, and then simply
outputting the XML template with those filled in.

I am not aware of anyone that has done this in a
re-useable/configurable-for-your-solr tool. You could possibly do it
solely using the built-in Solr
JSP/XSLT/other-templating-stuff-I-am-not-familiar-with stuff, rather
than as an external Solr client app, or it could be an external Solr
client app.

This is actually a very similar problem to something someone else asked
a few days ago Does anyone have an OpenSearch add-on for Solr?  Very
very similar problem, just with a different XML template for output
(usually RSS or Atom) instead of OAI-PMH.

On 2/2/2011 3:14 PM, Paul Libbrecht wrote:
 Peter,

 I'm afraid your service is harvesting and I am trying to look at a PMH 
 provider service.

 Your project appeared early in the goolge matches.

 paul


 Le 2 févr. 2011 à 20:46, Péter Király a écrit :

 Hi,

 I don't know whether it fits to your need, but we are builing a tool
 based on Drupal (eXtensible Catalog Drupal Toolkit), which can harvest
 with OAI-PMH and index the harvested records into Solr. The records is
 harvested, processed, and stored into MySQL, then we index them into
 Solr. We created some ways to manipulate the original values before
 sending to Solr. We created it in a modular way, so you can change
 settings in an admin interface or write your own hooks (special
 Drupal functions), to taylor the application to your needs. We support
 only Dublin Core, and our own FRBR-like schema (called XC schema), but
 you can add more schemas. Since this forum is about Solr, and not
 applications using Solr, if you interested this tool, plase write me a
 private message, or visit http://eXtensibleCatalog.org, or the
 module's page at http://drupal.org/project/xc.

 Hope this helps,

 Péter
 eXtensible Catalog

 2011/2/2 Paul Libbrechtp...@hoplahup.net:
 Hello list,

 I've met a few google matches that indicate that SOLR-based servers 
 implement the Open Archive Initiative's Metadata Harvesting Protocol.

 Is there something made to be re-usable that would be an add-on to solr?

 thanks in advance

 paul



RE: filter query from external list of Solr unique IDs

2010-10-15 Thread Demian Katz
The main problem I've encountered with the lots of OR clauses approach is 
that you eventually hit the limit on Boolean clauses and the whole query fails. 
 You can keep raising the limit through the Solr configuration, but there's 
still a ceiling eventually.

- Demian

 -Original Message-
 From: Jonathan Rochkind [mailto:rochk...@jhu.edu]
 Sent: Friday, October 15, 2010 1:07 PM
 To: solr-user@lucene.apache.org
 Subject: RE: filter query from external list of Solr unique IDs
 
 Definitely interested in this.
 
 The naive obvious approach would be just putting all the ID's in the
 query. Like fq=(id:1 OR id:2 OR).  Or making it another clause in
 the 'q'.
 
 Can you outline what's wrong with this approach, to make it more clear
 what's needed in a solution?
 
 From: Burton-West, Tom [tburt...@umich.edu]
 Sent: Friday, October 15, 2010 11:49 AM
 To: solr-user@lucene.apache.org
 Subject: filter query from external list of Solr unique IDs
 
 At the Lucene Revolution conference I asked about efficiently building
 a filter query from an external list of Solr unique ids.
 
 Some use cases I can think of are:
 1)  personal sub-collections (in our case a user can create a small
 subset of our 6.5 million doc collection and then run filter queries
 against it)
 2)  tagging documents
 3)  access control lists
 4)  anything that needs complex relational joins
 5)  a sort of alternative to incremental field updating (i.e.
 update in an external database or kv store)
 6)  Grant's clustering cluster points and similar apps.
 
 Grant pointed to SOLR 1715, but when I looked on JIRA, there doesn't
 seem to be any work on it yet.
 
 Hoss  mentioned a couple of ideas:
 1) sub-classing query parser
 2) Having the app query a database and somehow passing
 something to Solr or lucene for the filter query
 
 Can Hoss or someone else point me to more detailed information on what
 might be involved in the two ideas listed above?
 
 Is somehow keeping an up-to-date map of unique Solr ids to internal
 Lucene ids needed to implement this or is that a separate issue?
 
 
 Tom Burton-West
 http://www.hathitrust.org/blogs/large-scale-search
 
 
 



RE: solr.WordDelimiterFilterFactory problem with hyphenated terms?

2010-04-12 Thread Demian Katz
I don't think the behavior is correct.  The first example, with just one gap, 
does NOT match.  The second example, with an extra second gap, DOES match.  It 
seems that the term collapsing (eighteenth-century -- eighteenthcentury) 
somehow throws off the position sequence, forcing you to add an extra gap in 
order to get a match.  It's good to know that slop is an option to work around 
this problem... but it still seems to me that something isn't working the way 
it is supposed to in this particular case.

- Demian

 -Original Message-
 From: Robert Muir [mailto:rcm...@gmail.com]
 Sent: Friday, April 09, 2010 12:05 PM
 To: solr-user@lucene.apache.org
 Subject: Re: solr.WordDelimiterFilterFactory problem with hyphenated
 terms?
 
 but this behavior is correct, as you have position increments enabled.
 if you want the second query (which has 2 gaps) to match, you need to
 either
 use slop, or disable these increments alltogether.
 
 On Fri, Apr 9, 2010 at 11:44 AM, Demian Katz
 demian.k...@villanova.eduwrote:
 
  I've given it a try, and it definitely seems to have improved the
  situation.  However, there is still one weird case that's clearly
 related to
  term positions.  If I do this search, it fails:
 
  title:love customs in eighteenthcentury spain
 
  ...but if I do this search, it succeeds:
 
  title:love customs in in eighteenthcentury spain
 
  (note the duplicate in).
 
  - Demian
 
   -Original Message-
   From: Erick Erickson [mailto:erickerick...@gmail.com]
   Sent: Thursday, April 08, 2010 11:20 AM
   To: solr-user@lucene.apache.org
   Subject: Re: solr.WordDelimiterFilterFactory problem with
 hyphenated
   terms?
  
   I'm not all that familiar with the underlying issues, but of the
 two
   I'd
   pick moving the WordDelimiterFactory rather than setting increments
 =
   false.
  
   But that's at least partly a guess
  
   Best
   Erick
  
   On Thu, Apr 8, 2010 at 11:00 AM, Demian Katz
   demian.k...@villanova.eduwrote:
  
Thanks for looking into this -- I appreciate the help (and feel a
   little
better that there seems to be a bug at work here and not just my
   total
incomprehension).
   
Sorry for any confusion over the UnicodeNormalizationFactory --
   that's
actually a plug-in from the SolrMarc project (
http://code.google.com/p/solrmarc/) that slipped into my example.
   Also,
as you guessed, my default operator is indeed set to AND.
   
It sounds to me that, of your two proposed work-arounds, moving
 the
StopFilterFactory after WordDelimiterFactory is the least
 disruptive.
   I'm
guessing that disabling position increments across the board
 might
   have
implications for other types of phrase searches, while filtering
   stopwords
later in the chain should be more functionally equivalent, if
   slightly less
efficient (potentially more terms to examine).  Would you agree
 with
   this
assessment?  If not, what possible negative side effects am I
   forgetting
about?
   
thanks,
Demian
   
 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com]
 Sent: Wednesday, April 07, 2010 10:04 PM
 To: solr-user@lucene.apache.org
 Subject: Re: solr.WordDelimiterFilterFactory problem with
   hyphenated
 terms?

 Well, for a quick trial using trunk, I had to remove the
 UnicodeNormalizationFactory, is that yours?

 But with that removed, I get the results you do, ASSUMING that
   you've
 set
 your default operator to AND in schema.xml...

 Believe it or not, it all changes and all your queries return a
 hit
   if
 you
 do one of two things (I did this in both index and query when
   testing
 'cause
 I'm lazy):
 1 move the inclusion of the StopFilterFactory after
 WordDelimiterFactory
 or
 2 for StopFilterFactory, set enablePositionIncrements=false

 I think either of these might work in your situation...

 On doing some more investigation, it appears that if a
 hyphenated
   word
 is
 immediately after a stopword AND the above is true (stop
 factory
 included
 before WordDelimiterFactory and
 enablePositionIncrements=true),
   then
 the
 search fails. I indexed this title:

 Love-customs in eighteenth-century Spain for nineteenth-century

 Searching in solr/admin/form.jsp for:
 title:(nineteenth-century)

 fails. But if I remove the for from the title, the above
 query
   works.
 Searching for
 title:(love-customs)
 always works.

 Finally, (and it's *really* time to go to sleep now), just
 setting
 enablePositionIncrements=false in the index portion of the
   schema
 also
 causes things to work.

 Developer folks:
 I didn't see anything in a quick look in SOLR or Lucene JIRAs,
   should I
 refine this a bit (really, sleepy time is near) and add a JIRA?

 Best

RE: solr.WordDelimiterFilterFactory problem with hyphenated terms?

2010-04-09 Thread Demian Katz
I've given it a try, and it definitely seems to have improved the situation.  
However, there is still one weird case that's clearly related to term 
positions.  If I do this search, it fails:

title:love customs in eighteenthcentury spain

...but if I do this search, it succeeds:

title:love customs in in eighteenthcentury spain

(note the duplicate in).

- Demian

 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com]
 Sent: Thursday, April 08, 2010 11:20 AM
 To: solr-user@lucene.apache.org
 Subject: Re: solr.WordDelimiterFilterFactory problem with hyphenated
 terms?
 
 I'm not all that familiar with the underlying issues, but of the two
 I'd
 pick moving the WordDelimiterFactory rather than setting increments =
 false.
 
 But that's at least partly a guess
 
 Best
 Erick
 
 On Thu, Apr 8, 2010 at 11:00 AM, Demian Katz
 demian.k...@villanova.eduwrote:
 
  Thanks for looking into this -- I appreciate the help (and feel a
 little
  better that there seems to be a bug at work here and not just my
 total
  incomprehension).
 
  Sorry for any confusion over the UnicodeNormalizationFactory --
 that's
  actually a plug-in from the SolrMarc project (
  http://code.google.com/p/solrmarc/) that slipped into my example.
 Also,
  as you guessed, my default operator is indeed set to AND.
 
  It sounds to me that, of your two proposed work-arounds, moving the
  StopFilterFactory after WordDelimiterFactory is the least disruptive.
 I'm
  guessing that disabling position increments across the board might
 have
  implications for other types of phrase searches, while filtering
 stopwords
  later in the chain should be more functionally equivalent, if
 slightly less
  efficient (potentially more terms to examine).  Would you agree with
 this
  assessment?  If not, what possible negative side effects am I
 forgetting
  about?
 
  thanks,
  Demian
 
   -Original Message-
   From: Erick Erickson [mailto:erickerick...@gmail.com]
   Sent: Wednesday, April 07, 2010 10:04 PM
   To: solr-user@lucene.apache.org
   Subject: Re: solr.WordDelimiterFilterFactory problem with
 hyphenated
   terms?
  
   Well, for a quick trial using trunk, I had to remove the
   UnicodeNormalizationFactory, is that yours?
  
   But with that removed, I get the results you do, ASSUMING that
 you've
   set
   your default operator to AND in schema.xml...
  
   Believe it or not, it all changes and all your queries return a hit
 if
   you
   do one of two things (I did this in both index and query when
 testing
   'cause
   I'm lazy):
   1 move the inclusion of the StopFilterFactory after
   WordDelimiterFactory
   or
   2 for StopFilterFactory, set enablePositionIncrements=false
  
   I think either of these might work in your situation...
  
   On doing some more investigation, it appears that if a hyphenated
 word
   is
   immediately after a stopword AND the above is true (stop factory
   included
   before WordDelimiterFactory and enablePositionIncrements=true),
 then
   the
   search fails. I indexed this title:
  
   Love-customs in eighteenth-century Spain for nineteenth-century
  
   Searching in solr/admin/form.jsp for:
   title:(nineteenth-century)
  
   fails. But if I remove the for from the title, the above query
 works.
   Searching for
   title:(love-customs)
   always works.
  
   Finally, (and it's *really* time to go to sleep now), just setting
   enablePositionIncrements=false in the index portion of the
 schema
   also
   causes things to work.
  
   Developer folks:
   I didn't see anything in a quick look in SOLR or Lucene JIRAs,
 should I
   refine this a bit (really, sleepy time is near) and add a JIRA?
  
   Best
   Erick
  
   On Wed, Apr 7, 2010 at 10:29 AM, Demian Katz
   demian.k...@villanova.eduwrote:
  
Hello.  It has been a few weeks, and I haven't gotten any
 responses.
 Perhaps my question is too complicated -- maybe a better
 approach is
   to try
to gain enough knowledge to answer it myself.  My gut feeling is
   still that
it's something to do with the way term positions are getting
 handled
   by the
WordDelimiterFilterFactory, but I don't have a good understanding
 of
   how
term positions are calculated or factored into searching.  Can
 anyone
recommend some good reading to familiarize myself with these
 concepts
   in
better detail?
   
thanks,
Demian
   
From: Demian Katz
Sent: Tuesday, March 16, 2010 9:47 AM
To: solr-user@lucene.apache.org
Subject: solr.WordDelimiterFilterFactory problem with hyphenated
   terms?
   
This is my first post on this list -- apologies if this has been
   discussed
before; I didn't come upon anything exactly equivalent in
 searching
   the
archives via Google.
   
I'm using Solr 1.4 as part of the VuFind application, and I just
   noticed
that searches for hyphenated terms are failing in strange ways.
 I
   strongly
suspect it has something to do

RE: solr.WordDelimiterFilterFactory problem with hyphenated terms?

2010-04-08 Thread Demian Katz
Thanks for looking into this -- I appreciate the help (and feel a little better 
that there seems to be a bug at work here and not just my total 
incomprehension).

Sorry for any confusion over the UnicodeNormalizationFactory -- that's actually 
a plug-in from the SolrMarc project (http://code.google.com/p/solrmarc/) that 
slipped into my example.  Also, as you guessed, my default operator is indeed 
set to AND.

It sounds to me that, of your two proposed work-arounds, moving the 
StopFilterFactory after WordDelimiterFactory is the least disruptive.  I'm 
guessing that disabling position increments across the board might have 
implications for other types of phrase searches, while filtering stopwords 
later in the chain should be more functionally equivalent, if slightly less 
efficient (potentially more terms to examine).  Would you agree with this 
assessment?  If not, what possible negative side effects am I forgetting about?

thanks,
Demian

 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com]
 Sent: Wednesday, April 07, 2010 10:04 PM
 To: solr-user@lucene.apache.org
 Subject: Re: solr.WordDelimiterFilterFactory problem with hyphenated
 terms?
 
 Well, for a quick trial using trunk, I had to remove the
 UnicodeNormalizationFactory, is that yours?
 
 But with that removed, I get the results you do, ASSUMING that you've
 set
 your default operator to AND in schema.xml...
 
 Believe it or not, it all changes and all your queries return a hit if
 you
 do one of two things (I did this in both index and query when testing
 'cause
 I'm lazy):
 1 move the inclusion of the StopFilterFactory after
 WordDelimiterFactory
 or
 2 for StopFilterFactory, set enablePositionIncrements=false
 
 I think either of these might work in your situation...
 
 On doing some more investigation, it appears that if a hyphenated word
 is
 immediately after a stopword AND the above is true (stop factory
 included
 before WordDelimiterFactory and enablePositionIncrements=true), then
 the
 search fails. I indexed this title:
 
 Love-customs in eighteenth-century Spain for nineteenth-century
 
 Searching in solr/admin/form.jsp for:
 title:(nineteenth-century)
 
 fails. But if I remove the for from the title, the above query works.
 Searching for
 title:(love-customs)
 always works.
 
 Finally, (and it's *really* time to go to sleep now), just setting
 enablePositionIncrements=false in the index portion of the schema
 also
 causes things to work.
 
 Developer folks:
 I didn't see anything in a quick look in SOLR or Lucene JIRAs, should I
 refine this a bit (really, sleepy time is near) and add a JIRA?
 
 Best
 Erick
 
 On Wed, Apr 7, 2010 at 10:29 AM, Demian Katz
 demian.k...@villanova.eduwrote:
 
  Hello.  It has been a few weeks, and I haven't gotten any responses.
   Perhaps my question is too complicated -- maybe a better approach is
 to try
  to gain enough knowledge to answer it myself.  My gut feeling is
 still that
  it's something to do with the way term positions are getting handled
 by the
  WordDelimiterFilterFactory, but I don't have a good understanding of
 how
  term positions are calculated or factored into searching.  Can anyone
  recommend some good reading to familiarize myself with these concepts
 in
  better detail?
 
  thanks,
  Demian
 
  From: Demian Katz
  Sent: Tuesday, March 16, 2010 9:47 AM
  To: solr-user@lucene.apache.org
  Subject: solr.WordDelimiterFilterFactory problem with hyphenated
 terms?
 
  This is my first post on this list -- apologies if this has been
 discussed
  before; I didn't come upon anything exactly equivalent in searching
 the
  archives via Google.
 
  I'm using Solr 1.4 as part of the VuFind application, and I just
 noticed
  that searches for hyphenated terms are failing in strange ways.  I
 strongly
  suspect it has something to do with the
 solr.WordDelimiterFilterFactory
  filter, but I'm not exactly sure what.
 
  The problem is that I have a record with the title Love customs in
  eighteenth-century Spain.  Depending on how I search for this, I get
  successes or failures in a seemingly unpredictable pattern.
 
  Demonstration queries below were tested using the direct Solr
  administration tool, just to eliminate any VuFind-related factors
 from the
  equation while debugging.
 
  Queries that work:
  title:(Love customs in eighteenth century Spain)
  // no hyphen, no phrases
  title:(Love customs in eighteenth-century Spain)
   // phrase search on whole title, with hyphen
 
  Queries that fail:
  title:(Love customs in eighteenth-century Spain)
 // hyphen, no phrases
  title:(Love customs in eighteenth century Spain)
// phrase search on whole title, without hyphen
  title:(Love customs in eighteenth-century Spain)
   // hyphenated word as phrase
  title:(Love customs in eighteenth century Spain)
// hyphenated word as phrase, hyphen

RE: solr.WordDelimiterFilterFactory problem with hyphenated terms?

2010-04-07 Thread Demian Katz
Hello.  It has been a few weeks, and I haven't gotten any responses.  Perhaps 
my question is too complicated -- maybe a better approach is to try to gain 
enough knowledge to answer it myself.  My gut feeling is still that it's 
something to do with the way term positions are getting handled by the 
WordDelimiterFilterFactory, but I don't have a good understanding of how term 
positions are calculated or factored into searching.  Can anyone recommend some 
good reading to familiarize myself with these concepts in better detail?

thanks,
Demian

From: Demian Katz
Sent: Tuesday, March 16, 2010 9:47 AM
To: solr-user@lucene.apache.org
Subject: solr.WordDelimiterFilterFactory problem with hyphenated terms?

This is my first post on this list -- apologies if this has been discussed 
before; I didn't come upon anything exactly equivalent in searching the 
archives via Google.

I'm using Solr 1.4 as part of the VuFind application, and I just noticed that 
searches for hyphenated terms are failing in strange ways.  I strongly suspect 
it has something to do with the solr.WordDelimiterFilterFactory filter, but I'm 
not exactly sure what.

The problem is that I have a record with the title Love customs in 
eighteenth-century Spain.  Depending on how I search for this, I get successes 
or failures in a seemingly unpredictable pattern.

Demonstration queries below were tested using the direct Solr administration 
tool, just to eliminate any VuFind-related factors from the equation while 
debugging.

Queries that work:
title:(Love customs in eighteenth century Spain)
   // no hyphen, no phrases
title:(Love customs in eighteenth-century Spain)  
// phrase search on whole title, with hyphen

Queries that fail:
title:(Love customs in eighteenth-century Spain)
  // hyphen, no phrases
title:(Love customs in eighteenth century Spain)  
 // phrase search on whole title, without hyphen
title:(Love customs in eighteenth-century Spain)  
// hyphenated word as phrase
title:(Love customs in eighteenth century Spain)  
 // hyphenated word as phrase, hyphen removed

Here is VuFind's text field type definition:

fieldType name=text class=solr.TextField positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords.txt enablePositionIncrements=true/
filter class=solr.WordDelimiterFilterFactory generateWordParts=1 
generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 
splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.SnowballPorterFilterFactory language=English 
protected=protwords.txt/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
filter class=schema.UnicodeNormalizationFilterFactory 
version=icu4j composed=false remove_diacritics=true 
remove_modifiers=true fold=true/
filter class=solr.ISOLatin1AccentFilterFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt 
ignoreCase=true expand=true/
filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords.txt enablePositionIncrements=true/
filter class=solr.WordDelimiterFilterFactory generateWordParts=1 
generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 
splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.SnowballPorterFilterFactory language=English 
protected=protwords.txt/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
filter class=schema.UnicodeNormalizationFilterFactory 
version=icu4j composed=false remove_diacritics=true 
remove_modifiers=true fold=true/
filter class=solr.ISOLatin1AccentFilterFactory/
  /analyzer
/fieldType

I did notice that in the text field type in VuFind's schema has 
catenateWords and catenateNumbers turned on in both the index and query 
analyzer chains.  It is my understanding that these options should be disabled 
for the query chain and only enabled for the index chain.  However, this may be 
a red herring -- I have already tried changing this setting, but it didn't 
change the success/failure pattern described above.  I have also played with 
the preserveOriginal setting without apparent effect.

From playing with the Field Analysis tool, I notice that there is a gap in the 
term position sequence after analysis...  but I'm not sure if this is 
significant.

Has anybody else run into this sort of problem?  Any ideas on a fix?

thanks,
Demian



solr.WordDelimiterFilterFactory problem with hyphenated terms?

2010-03-16 Thread Demian Katz
This is my first post on this list -- apologies if this has been discussed 
before; I didn't come upon anything exactly equivalent in searching the 
archives via Google.

I'm using Solr 1.4 as part of the VuFind application, and I just noticed that 
searches for hyphenated terms are failing in strange ways.  I strongly suspect 
it has something to do with the solr.WordDelimiterFilterFactory filter, but I'm 
not exactly sure what.

The problem is that I have a record with the title Love customs in 
eighteenth-century Spain.  Depending on how I search for this, I get successes 
or failures in a seemingly unpredictable pattern.

Demonstration queries below were tested using the direct Solr administration 
tool, just to eliminate any VuFind-related factors from the equation while 
debugging.

Queries that work:
title:(Love customs in eighteenth century Spain)
   // no hyphen, no phrases
title:(Love customs in eighteenth-century Spain)  
// phrase search on whole title, with hyphen

Queries that fail:
title:(Love customs in eighteenth-century Spain)
  // hyphen, no phrases
title:(Love customs in eighteenth century Spain)  
 // phrase search on whole title, without hyphen
title:(Love customs in eighteenth-century Spain)  
// hyphenated word as phrase
title:(Love customs in eighteenth century Spain)  
 // hyphenated word as phrase, hyphen removed

Here is VuFind's text field type definition:

fieldType name=text class=solr.TextField positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords.txt enablePositionIncrements=true/
filter class=solr.WordDelimiterFilterFactory generateWordParts=1 
generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 
splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.SnowballPorterFilterFactory language=English 
protected=protwords.txt/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
filter class=schema.UnicodeNormalizationFilterFactory 
version=icu4j composed=false remove_diacritics=true 
remove_modifiers=true fold=true/
filter class=solr.ISOLatin1AccentFilterFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt 
ignoreCase=true expand=true/
filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords.txt enablePositionIncrements=true/
filter class=solr.WordDelimiterFilterFactory generateWordParts=1 
generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 
splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.SnowballPorterFilterFactory language=English 
protected=protwords.txt/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
filter class=schema.UnicodeNormalizationFilterFactory 
version=icu4j composed=false remove_diacritics=true 
remove_modifiers=true fold=true/
filter class=solr.ISOLatin1AccentFilterFactory/
  /analyzer
/fieldType

I did notice that in the text field type in VuFind's schema has 
catenateWords and catenateNumbers turned on in both the index and query 
analyzer chains.  It is my understanding that these options should be disabled 
for the query chain and only enabled for the index chain.  However, this may be 
a red herring -- I have already tried changing this setting, but it didn't 
change the success/failure pattern described above.  I have also played with 
the preserveOriginal setting without apparent effect.

From playing with the Field Analysis tool, I notice that there is a gap in the 
term position sequence after analysis...  but I'm not sure if this is 
significant.

Has anybody else run into this sort of problem?  Any ideas on a fix?

thanks,
Demian