from:"Demian Katz"

RE: [EXTERNAL] Re: Getting rid of Master/Slave nomenclature in Solr

2020-06-18 Thread Demian Katz

Regarding people having a problem with the word "master" -- GitHub is changing 
the default branch name away from "master," even in isolation from a "slave" 
pairing... so the terminology seems to be falling out of favor in all contexts. 
See:

https://www.cnet.com/news/microsofts-github-is-removing-coding-terms-like-master-and-slave/

I'm not here to start a debate about the semantics of that, just to provide 
evidence that in some communities, the term "master" is causing concern all by 
itself. If we're going to make the change anyway, it might be best to get it 
over with and pick the most appropriate terminology we can agree upon, rather 
than trying to minimize the amount of change. It's going to be backward 
breaking anyway, so we might as well do it all now rather than risk having to 
go through two separate breaking changes at different points in time.

- Demian

-Original Message-
From: Noble Paul  
Sent: Thursday, June 18, 2020 1:51 AM
To: solr-user@lucene.apache.org
Subject: [EXTERNAL] Re: Getting rid of Master/Slave nomenclature in Solr

Looking at the code I see a 692 occurrences of the word "slave".
Mostly variable names and ref guide docs.

The word "slave" is present in the responses as well. Any change in the request 
param/response payload is backward incompatible.

I have no objection to changing the names in ref guide and other internal 
variables. Going ahead with backward incompatible changes is painful. If 
somebody has the appetite to take it up, it's OK

If we must change, master/follower can be a good enough option.

master (noun): A man in charge of an organization or group.
master(adj) : having or showing very great skill or proficiency.
master(verb): acquire complete knowledge or skill in (a subject, technique, or 
art).
master (verb): gain control of; overcome.

I hope nobody has a problem with the term "master"

On Thu, Jun 18, 2020 at 3:19 PM Ilan Ginzburg  wrote:
>
> Would master/follower work?
>
> Half the rename work while still getting rid of the slavery connotation...
>
>
> On Thu 18 Jun 2020 at 07:13, Walter Underwood  wrote:
>
> > > On Jun 17, 2020, at 4:00 PM, Shawn Heisey  wrote:
> > >
> > > It has been interesting watching this discussion play out on 
> > > multiple
> > open source mailing lists.  On other projects, I have seen a VERY 
> > high level of resistance to these changes, which I find disturbing 
> > and surprising.
> >
> > Yes, it is nice to see everyone just pitch in and do it on this list.
> >
> > wunder
> > Walter Underwood
> > wun...@wunderwood.org
> > https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fobs
> > erver.wunderwood.org%2F&data=02%7C01%7Cdemian.katz%40villanova.e
> > du%7C1eef0604700a442deb7e08d8134b97fb%7C765a8de5cf9444f09cafae5bf8cf
> > a366%7C0%7C0%7C637280562684672329&sdata=0GyK5Tlq0PGsWxl%2FirJOVN
> > VaFCELlEChdxuLJ5RxdQs%3D&reserved=0  (my blog)
> >
> >

--
-
Noble Paul

RE: Help with a DIH config file

2019-03-15 Thread Demian Katz

Jörn (and anyone else with more experience with this than I have),

I've been working on Whitney with this issue. It is a PDF file, and it can be 
opened successfully in a PDF reader. Interestingly, if I try to extract data 
from it on the command line, Tika version 1.3 throws a lot of warnings but does 
successfully extract data, but several newer versions, including 1.17 and 1.20 
(haven't tested other intermediate versions) encounter a fatal error and 
extract nothing. So this seems like something that used to work but has 
stopped. Unfortunately, we haven't been able to find a way to downgrade to an 
old enough Tika in her Solr installation to work around the problem that way.

The bigger question, though, is whether there's a way to allow the DIH to 
simply ignore errors and keep going. Whitney needs to index several terabytes 
of arbitrary documents for her project, and at this scale, she can't afford the 
time to stop and manually intervene for every strange document that happens to 
be in the collection. It would be greatly preferable if the indexing process 
could ignore exceptions and proceed on than if it just stops dead at the first 
problem. (I'm also pretty sure that Whitney is already using the 
ignoreTikaException attribute in her configuration, but it doesn't seem to help 
in this instance).

Any suggestions would be greatly appreciated!

thanks,
Demian

-Original Message-
From: Jörn Franke  
Sent: Friday, March 15, 2019 4:18 AM
To: solr-user@lucene.apache.org
Subject: Re: Help with a DIH config file

Do you have an exception?
It could be that the pdf is broken - can you open it on your computer with a 
pdfreader?

If the exception is related to Tika and pdf then file an issue with the pdfbox 
project. If there is an issue with Tika and MsOffice documents then Apache poi 
is the right project to ask.

> Am 15.03.2019 um 03:41 schrieb wclarke :
> 
> Thank you so much.  You helped a great deal.  I am running into one 
> last issue where the Tika DIH is stopping at a specific language and 
> fails there (Malayalam).  Do you know of a work around?
> 
> 
> 
> --
> Sent from: 
> https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Flucen
> e.472066.n3.nabble.com%2FSolr-User-f472068.html&data=02%7C01%7Cdem
> ian.katz%40villanova.edu%7Ca54d5daee7b14648442908d6a91f9bf6%7C765a8de5
> cf9444f09cafae5bf8cfa366%7C0%7C0%7C636882350564627071&sdata=NpddZY
> 2sHKJHAR8V%2BIlMt4j1i3oy94KP9%2Btp1EQ2xM4%3D&reserved=0

Solr Cell, Tika and UpdateProcessorChains

2019-02-21 Thread Demian Katz

I'm posting this question on behalf of Whitney Clarke, who is a pending member 
of this list but is not able to post on her own yet. I've been working with her 
on some troubleshooting, but I'm not familiar with the components she's using 
and thought somebody here might be able to point her in the right direction 
more quickly than I can.

Here is her original inquiry:


I am pulling data from a local drive for indexing.  I am using solr cell and 
tika in schemaless mode.  I am attempting to rewrite certain field information 
prior to indexing using html-strip and regex UpdateProcessorChains.  However, 
when run, the UpdateProcessorChains never appear to get invoked.

For example,

I am looking to rewrite "url":"e:\\documents\\apiscript.txt" to 
be http://apiscript.txt .  My current solrconfig is trying to rewrite id and 
put the rewritten link into url, but this is just the recent attempt of many 
different ways I have tried to get it to work.

My other issues is with the content field.  I am trying to strip that field 
down to just the actual text of the document.  I am getting all meta data in it 
as well.   Any suggestions?

Thanks,
Whitney


Whitney's latest solrconfig.xml in pasted in full below - as she notes, we've 
been through many iterations without any success. The key question is how to 
manipulate the data retrieved from Tika prior to indexing it. Is there a 
documented best practice for this type of situation, or any tips on how to 
troubleshoot when nothing appears to be happening?

Thanks,
Demian




  7.3.1

  
  

  
  

  
  

  

  ${solr.solr.home:./solr}/text_test

  

  



  ${solr.ulog.dir:}
  ${solr.ulog.numVersionBuckets:65536}



  ${solr.autoCommit.maxTime:15000}
  false


  

1024
  

  



  


true


  20


  200


   
  

  


  

  

false

  

   
 

  

  


  explicit
  
 content

  





  
  

  explicit
  json
  true

  

  


add-unknown-fields-to-the-schema
 html-strip-features
regex-replace

  

  



  
  true
  links
  ignored_
  true
  ignored_

  

   

text_general





  default
  _text_
  solr.DirectSolrSpellChecker
  
  internal
  
  0.5
  
  2
  
  1
  
  5
  
  4
  
  0.01
  

  

  

  
  default
  on
  true
  10
  5
  5
  true
  true
  10
  5


  spellcheck

  



  
  
  

  100

  

  
  

  
  70
  
  0.5
  
  [-\w ,/\n\"']{20,200}

  

  
  

  
  

  

  
  

  
  

  
  

  
  

  
  

  

  
  

  
  

  

  

  10
  .,!? 	


  

  

  
  WORD
  
  
  en
  US

  

  

  
  
  
[^\w-\.]
_
  
  
  
  
  

  -MM-dd'T'HH:mm:ss.SSSZ
  -MM-dd'T'HH:mm:ss,SSSZ
  -MM-dd'T'HH:mm:ss.SSS
  -MM-dd'T'HH:mm:ss,SSS
  -MM-dd'T'HH:mm:ssZ
  -MM-dd'T'HH:mm:ss
  -MM-dd'T'HH:mmZ
  -MM-dd'T'HH:mm
  -MM-dd HH:mm:ss.SSSZ
  -MM-dd HH:mm:ss,SSSZ
  -MM-dd HH:mm:ss.SSS
  -MM-dd HH:mm:ss,SSS
  -MM-dd HH:mm:ssZ
  -MM-dd HH:mm:ss
  -MM-dd HH:mmZ
  -MM-dd HH:mm
  -MM-dd

  
  

  java.lang.String
  text_general
  
*_str
256
  
  
  true


  java.lang.Boolean
  booleans


  java.util.Date
  pdates


  java.lang.Long
  java.lang.Integer
  plongs


  java.lang.Number
  pdoubles

  

  
  



  


  _text_




  

   
  
   id
   ^[a-z]:\w+
   http://
   true



 
  
 
   text,title,subject,description
   language_s
   en
 
 
 
   

   

text/plain; charset=UTF-8
  

   
5

RE: Installing Solr with Ivy

2016-08-03 Thread Demian Katz

Dan,

In case you, or anyone else, is interested, let me share my current 
solution-in-progress:

https://github.com/vufind-org/vufind/pull/769

I've written a Phing task for my project (Phing is the PHP equivalent of Ant) 
which takes some loose inspiration from your Ant download task. The task uses a 
local directory to cache Solr distributions and only hits Apache servers if the 
cache lacks the requested version. This cache can be retained on my continuous 
integration and development servers, so I think this should get me the effect I 
desire without putting an unreasonable amount of load on the archive servers. 
I'd still love in theory to find a solution that's a little more future-proof 
than "build a URL and download from it," but for now, I think this will get me 
through.

Thanks again!

- Demian

-Original Message-
From: Davis, Daniel (NIH/NLM) [C] [mailto:daniel.da...@nih.gov] 
Sent: Tuesday, August 02, 2016 11:33 AM
To: solr-user@lucene.apache.org
Subject: RE: Installing Solr with Ivy

Demian,

I've long meant to upload my own "automated installation" - it is ant without 
ivy, but with checksums.   I suppose gpg signatures could also be worked in.
It is only semi-automated, because our DevOps group does not have root, but 
here is a clean version - https://github.com/danizen/solr-ant-install

System administrators prepare the environment:
- creating a directory for solr (/opt/solr) and logs (/var/logs/solr), maybe a 
different volume for solr data.
- create an administrative user with a shell (owns the code)
- create an operational user who runs solr (no shell, cannot modify the code)
- install the initscripts
- setup sudoers rules

The installation this supports is very, very small, and I do not intend to 
support the cleaned version of this going forward.   I will update the 
README.md to make that clear.

I agree with your summary of the difference.   One more aspect of 
maturity/fullness of solution - MySQL/PostgreSQL etc. support multiple projects 
on the same server, at least administratively.   Solr is getting there, but 
until role-based access control (RBAC) is strong enough out-of-the-box, it is 
hard to setup a *shared* Solr server.Yet it is very common to do that with 
database servers, and in fact doing this is a common way to avoid siloed 
applications.Unfortunately, HTTP auth is not quite good enough for me; but 
it is only my own fault I haven't contributed something more.

Dan Davis, Systems/Applications Architect (Contractor), Office of Computer and 
Communications Systems, National Library of Medicine, NIH







-Original Message-
From: Demian Katz [mailto:demian.k...@villanova.edu]
Sent: Tuesday, August 02, 2016 8:37 AM
To: solr-user@lucene.apache.org
Subject: RE: Installing Solr with Ivy

Thanks, Shawn, for confirming my suspicions.

Regarding your question about how Solr differs from a database server, I agree 
with you in theory, but the problem is in the practice: there are very easy, 
familiar, well-established techniques for installing and maintaining database 
platforms, and these platforms are mature enough that they evolve slowly and 
most versions are closely functionally equivalent to one another. Solr is 
comparatively young (not immature, but young).

Solr still (as far as I can tell) lacks standard package support in the default 
repos of the major Linux distros, and frequently breaks backward compatibility 
between versions in large and small ways (particularly in the internal API, but 
sometimes also in the configuration files). Those are not intended as 
criticisms of Solr -- they're to a large extent positive signs of activity and 
growth -- but they are, as far as I can tell, the current realities of working 
with the software.

For a developer with the right experience and knowledge, it's no big deal to 
navigate these challenges. However, my package is designed to be friendly to a 
less experienced, more generalized non-technical audience, and bundling Solr in 
the package instead of trying to guide the user through a potentially confusing 
manual installation process greatly simplifies the task of getting things up 
and running, saving me from having to field support emails from people who 
can't figure out how to install Solr on their platform, or those who end up 
with a version that's incompatible with my project's configurations and custom 
handlers.

At this point, my main goal is to revise the bundling process so that instead 
of storing Solr in Git, I can install it on-demand with a simple automated 
process during continuous integration builds and packaging for release. In the 
longer term, if the environmental factors change, I'd certainly prefer to stop 
bundling it entirely... but I don't think that is practical for my audience at 
this stage.

In any case, sorry for the long-winded reply, but hopefully that helps clarify 
my situation.

RE: Installing Solr with Ivy

2016-08-02 Thread Demian Katz

Dan,

Thanks for taking the time to share this! I'll give it a test run in the near 
future and will happily share improvements if I come up with any (though I'll 
most likely be focusing on the download steps rather than the subsequent 
configuration).

- Demian

-Original Message-
From: Davis, Daniel (NIH/NLM) [C] [mailto:daniel.da...@nih.gov] 
Sent: Tuesday, August 02, 2016 11:33 AM
To: solr-user@lucene.apache.org
Subject: RE: Installing Solr with Ivy

Demian,

I've long meant to upload my own "automated installation" - it is ant without 
ivy, but with checksums.   I suppose gpg signatures could also be worked in.
It is only semi-automated, because our DevOps group does not have root, but 
here is a clean version - https://github.com/danizen/solr-ant-install

System administrators prepare the environment:
- creating a directory for solr (/opt/solr) and logs (/var/logs/solr), maybe a 
different volume for solr data.
- create an administrative user with a shell (owns the code)
- create an operational user who runs solr (no shell, cannot modify the code)
- install the initscripts
- setup sudoers rules

The installation this supports is very, very small, and I do not intend to 
support the cleaned version of this going forward.   I will update the 
README.md to make that clear.

I agree with your summary of the difference.   One more aspect of 
maturity/fullness of solution - MySQL/PostgreSQL etc. support multiple projects 
on the same server, at least administratively.   Solr is getting there, but 
until role-based access control (RBAC) is strong enough out-of-the-box, it is 
hard to setup a *shared* Solr server.Yet it is very common to do that with 
database servers, and in fact doing this is a common way to avoid siloed 
applications.Unfortunately, HTTP auth is not quite good enough for me; but 
it is only my own fault I haven't contributed something more.

Dan Davis, Systems/Applications Architect (Contractor), Office of Computer and 
Communications Systems, National Library of Medicine, NIH







-----Original Message-
From: Demian Katz [mailto:demian.k...@villanova.edu]
Sent: Tuesday, August 02, 2016 8:37 AM
To: solr-user@lucene.apache.org
Subject: RE: Installing Solr with Ivy

Thanks, Shawn, for confirming my suspicions.

Regarding your question about how Solr differs from a database server, I agree 
with you in theory, but the problem is in the practice: there are very easy, 
familiar, well-established techniques for installing and maintaining database 
platforms, and these platforms are mature enough that they evolve slowly and 
most versions are closely functionally equivalent to one another. Solr is 
comparatively young (not immature, but young).

Solr still (as far as I can tell) lacks standard package support in the default 
repos of the major Linux distros, and frequently breaks backward compatibility 
between versions in large and small ways (particularly in the internal API, but 
sometimes also in the configuration files). Those are not intended as 
criticisms of Solr -- they're to a large extent positive signs of activity and 
growth -- but they are, as far as I can tell, the current realities of working 
with the software.

For a developer with the right experience and knowledge, it's no big deal to 
navigate these challenges. However, my package is designed to be friendly to a 
less experienced, more generalized non-technical audience, and bundling Solr in 
the package instead of trying to guide the user through a potentially confusing 
manual installation process greatly simplifies the task of getting things up 
and running, saving me from having to field support emails from people who 
can't figure out how to install Solr on their platform, or those who end up 
with a version that's incompatible with my project's configurations and custom 
handlers.

At this point, my main goal is to revise the bundling process so that instead 
of storing Solr in Git, I can install it on-demand with a simple automated 
process during continuous integration builds and packaging for release. In the 
longer term, if the environmental factors change, I'd certainly prefer to stop 
bundling it entirely... but I don't think that is practical for my audience at 
this stage.

In any case, sorry for the long-winded reply, but hopefully that helps clarify 
my situation.

- Demian

-Original Message-

[...snip...]

In a theoretical situation where your program talked an SQL database, would you 
include a database server in your project?  How much time would you invest in 
automating the download and install of MySQL, Postgres, or some other database? 
 I think what you would do in that situation is include client code to talk to 
the database and expect the user to provide the server and prepare it for your 
program.  In this respect, how is a Solr server any different than a database 
server?

Thanks,
Shawn

RE: Installing Solr with Ivy

2016-08-02 Thread Demian Katz

Thanks, Shawn, for confirming my suspicions.

Regarding your question about how Solr differs from a database server, I agree 
with you in theory, but the problem is in the practice: there are very easy, 
familiar, well-established techniques for installing and maintaining database 
platforms, and these platforms are mature enough that they evolve slowly and 
most versions are closely functionally equivalent to one another. Solr is 
comparatively young (not immature, but young).

Solr still (as far as I can tell) lacks standard package support in the default 
repos of the major Linux distros, and frequently breaks backward compatibility 
between versions in large and small ways (particularly in the internal API, but 
sometimes also in the configuration files). Those are not intended as 
criticisms of Solr -- they're to a large extent positive signs of activity and 
growth -- but they are, as far as I can tell, the current realities of working 
with the software.

For a developer with the right experience and knowledge, it's no big deal to 
navigate these challenges. However, my package is designed to be friendly to a 
less experienced, more generalized non-technical audience, and bundling Solr in 
the package instead of trying to guide the user through a potentially confusing 
manual installation process greatly simplifies the task of getting things up 
and running, saving me from having to field support emails from people who 
can't figure out how to install Solr on their platform, or those who end up 
with a version that's incompatible with my project's configurations and custom 
handlers.

At this point, my main goal is to revise the bundling process so that instead 
of storing Solr in Git, I can install it on-demand with a simple automated 
process during continuous integration builds and packaging for release. In the 
longer term, if the environmental factors change, I'd certainly prefer to stop 
bundling it entirely... but I don't think that is practical for my audience at 
this stage.

In any case, sorry for the long-winded reply, but hopefully that helps clarify 
my situation.

- Demian

-Original Message-

[...snip...]

In a theoretical situation where your program talked an SQL database, would you 
include a database server in your project?  How much time would you invest in 
automating the download and install of MySQL, Postgres, or some other database? 
 I think what you would do in that situation is include client code to talk to 
the database and expect the user to provide the server and prepare it for your 
program.  In this respect, how is a Solr server any different than a database 
server?

Thanks,
Shawn

Installing Solr with Ivy

2016-08-01 Thread Demian Katz

As a follow-up to last week's thread about loading Solr via dependency manager, 
I started experimenting with using Ivy to install Solr. Here's what I have 
(note that I'm trying to install Solr 5.5.0 as an arbitrary example, but that 
detail should not be important):

ivy.xml:








build.xml:







My hope, based on a quick read of some Ivy tutorials, was that simply running 
"ant" with the above configs would give me a copy of Solr in my lib directory. 
When I use example libraries from the tutorials in my ivy.xml, I do indeed get 
files installed... but when I try to substitute the Solr package,
no files are installed ("0 artifacts copied"). I'm not very experienced with 
any of these tools or repositories, so I'm not sure where I'm going wrong.

- Do I need to add some extra configuration somewhere to tell Ivy to download 
the constituent parts of the solr-parent package?
- Is the solr-parent package the wrong thing to be using? (I tried replacing 
solr-parent with solr-core and ended up with many .jar files in my lib 
directory, which was better than nothing, but the .jar files were not organized 
into a directory structure and were not accompanied by any of the non-.jar 
files like shell scripts that make Solr tick).
- Am I just completely on the wrong track? (I do realize that there may not be 
a way to pull a fully-functional Solr out of the core Maven repository... but 
it seemed worth a try!)

Any suggestions would be greatly appreciated!

thanks,
Demian

RE: Installing Solr as a dependency

2016-08-01 Thread Demian Katz

Thanks -- another interesting possibility, though I suppose the disadvantage to 
this strategy would be the dependency on Docker, which could be problematic for 
some users (especially those running Windows, where I understand that this 
could only be achieved with virtualization, which would almost certainly impact 
performance). Still, another option to put on the table!

- Demian

-Original Message-
From: Alexandre Rafalovitch [mailto:arafa...@gmail.com] 
Sent: Friday, July 29, 2016 8:02 PM
To: solr-user
Subject: Re: Installing Solr as a dependency

What about (not tried) pulling down an official Docker build and adding your 
stuff to that?
https://hub.docker.com/_/solr/

Regards,
   Alex.

Newsletter and resources for Solr beginners and intermediates:
http://www.solr-start.com/


On 30 July 2016 at 03:03, Demian Katz  wrote:
>> I wouldn't include Solr in my own project at all.  I would probably 
>> request that the user download the binary artifact and put it in a 
>> predictable location, and configure my installation script to do the 
>> download if the file is not there.  I would strongly recommend taking 
>> advantage of Apache's mirror system for that download -- although if 
>> you need a specific version of Solr, you will find that the mirror 
>> system only has the latest version, and you must go to the Apache 
>> Archives for older versions.
>>
>> To reduce load on the Apache Archive, you could place a copy of the 
>> binary on your own download servers ... and you could probably 
>> greatly reduce the size of that download by stripping out components 
>> that your software doesn't need.  If users want to enable additional 
>> functionality, they would be free to download the full Solr binary 
>> from Apache.
>
> Yes, this is the reason I was hoping to use some sort of dependency 
> management tool. The idea of downloading from Apache's system has definitely 
> crossed my mind, but it's inherently more fragile than using a dependency 
> manager (since Apache is at least theoretically free to change their URL 
> structure, etc., at any time) and, as you say, it seemed impolite to direct 
> potentially heavy amounts of traffic to Apache servers (especially when you 
> consider that every commit to my project triggers one or more continuous 
> integration builds, each of which would need to perform the download). 
> Creating a project-specific mirror also crossed my mind, but that has its own 
> set of problems: it's work to maintain it, and the server hosting it needs to 
> be able to withstand the high traffic that would otherwise be directed at 
> Apache. The idea of a theoretical dependency management tool still feels more 
> attractive because it adds a standard, unchanging mechanism for obtaining 
> specific versions of the software and it offers the possibility of local 
> package caching across builds to significantly reduce the amount of HTTP 
> traffic back and forth. Of course, it's a lot less attractive if it proves to 
> be only theory and not in fact practically achievable -- I'll play around 
> with Maven next week and see where that gets me.
>
> Anyway, I don't say any of that to dismiss your suggestions -- you 
> present potentially viable possibilities, and I'll certainly keep 
> those ideas on the table as I plan for the future -- but I thought it 
> might be worthwhile to share my thinking. :-)
>
>> I once discovered that if optional components are removed (including 
>> some jars in the webapp), the Solr download drops from 150+ MB to 
>> about
>> 25 MB.
>>
>> https://issues.apache.org/jira/browse/SOLR-6806
>
> This could actually be a separate argument for a dependency-management-based 
> Solr structure, in that you could create a core solr package with minimum 
> content that could recommend a whole array of optional dependencies. A script 
> could then be used to build different versions of the download package from 
> these -- one with just the core, one with all the optional stuff included. 
> Those who wanted some intermediate number of files could be encouraged to 
> manually create their desired build from packages.
>
> But again, I freely admit that everything I'm saying is based on 
> experience with package managers outside the realm of Java -- I need 
> to learn more about Maven (and perhaps Ivy) before I can make any 
> particularly intelligent statements about what is really possible in 
> this context. :-)
>
> - Demian

RE: Installing Solr as a dependency

2016-07-29 Thread Demian Katz

I did think about Maven, but (probably because I'm a Maven newbie) I didn't 
find an obvious way to do it and figured that Maven was meant more for 
libraries than for complete applications. In any case, your answer gives me 
more to work with, so I'll do some experimentation. Thanks!

- Demian

From: Daniel Collins [danwcoll...@gmail.com]
Sent: Friday, July 29, 2016 11:18 AM
To: solr-user@lucene.apache.org
Subject: Re: Installing Solr as a dependency

Can't you use Maven?  I thought that was the standard dependency management
tool, and Solr is published to Maven repos.  There used to be a solr
artifact which was the WAR file, but presumably now, you'd have to pull down

  org.apache.solr
  solr-parent

and maybe then start that up.

We have an internal application which is dependent on solr-core, (its a
web-app, we embed bits of Solr basically) and maven works fine for us.  We
do patch and build Solr internally though to our own corporate maven repos,
so that helps :)  But I've done it outside the corporate environment and
found recent Solr releases on standard maven repo sites.


On 29 July 2016 at 15:12, Shawn Heisey  wrote:

> On 7/28/2016 1:29 PM, Demian Katz wrote:
> > I develop an open source project
> > (https://github.com/vufind-org/vufind) that depends on Solr, and I'm
> > trying to figure out if there is a better way to manage the Solr
> > dependency. Presently, I simply bundle Solr with my software by
> > committing the latest distribution to my Git repo. Over time, having
> > all of these large binaries is causing repository bloat and slow Git
> > performance. I'm beginning to wonder whether there's a better way.
> > With the rise in the popularity of dependency managers like NPM and
> > Composer, it seems like it might be nice to somehow be able to declare
> > Solr as a dependency and have it installed automatically on the client
> > side rather than bundling the whole gigantic application by hand...
> > however, as far as I can tell, there's no way to do this presently (at
> > least, not unless you count specialized niche projects like
> > https://github.com/projecthydra/hydra-jetty, which are not exactly
> > what I'm looking for). Just curious if others are dealing with this
> > problem in other ways, or if there are any tool-based approaches that
> > I haven't discovered on my own.
>
> I wouldn't include Solr in my own project at all.  I would probably
> request that the user download the binary artifact and put it in a
> predictable location, and configure my installation script to do the
> download if the file is not there.  I would strongly recommend taking
> advantage of Apache's mirror system for that download -- although if you
> need a specific version of Solr, you will find that the mirror system
> only has the latest version, and you must go to the Apache Archives for
> older versions.
>
> To reduce load on the Apache Archive, you could place a copy of the
> binary on your own download servers ... and you could probably greatly
> reduce the size of that download by stripping out components that your
> software doesn't need.  If users want to enable additional
> functionality, they would be free to download the full Solr binary from
> Apache.
>
> I once discovered that if optional components are removed (including
> some jars in the webapp), the Solr download drops from 150+ MB to about
> 25 MB.
>
> https://issues.apache.org/jira/browse/SOLR-6806
>
> Thanks,
> Shawn
>
>

RE: Installing Solr as a dependency

2016-07-29 Thread Demian Katz

> I wouldn't include Solr in my own project at all.  I would probably
> request that the user download the binary artifact and put it in a
> predictable location, and configure my installation script to do the
> download if the file is not there.  I would strongly recommend taking
> advantage of Apache's mirror system for that download -- although if you
> need a specific version of Solr, you will find that the mirror system
> only has the latest version, and you must go to the Apache Archives for
> older versions.
>
> To reduce load on the Apache Archive, you could place a copy of the
> binary on your own download servers ... and you could probably greatly
> reduce the size of that download by stripping out components that your
> software doesn't need.  If users want to enable additional
> functionality, they would be free to download the full Solr binary from
> Apache.

Yes, this is the reason I was hoping to use some sort of dependency management 
tool. The idea of downloading from Apache's system has definitely crossed my 
mind, but it's inherently more fragile than using a dependency manager (since 
Apache is at least theoretically free to change their URL structure, etc., at 
any time) and, as you say, it seemed impolite to direct potentially heavy 
amounts of traffic to Apache servers (especially when you consider that every 
commit to my project triggers one or more continuous integration builds, each 
of which would need to perform the download). Creating a project-specific 
mirror also crossed my mind, but that has its own set of problems: it's work to 
maintain it, and the server hosting it needs to be able to withstand the high 
traffic that would otherwise be directed at Apache. The idea of a theoretical 
dependency management tool still feels more attractive because it adds a 
standard, unchanging mechanism for obtaining specific versions of the software 
and it offers the possibility of local package caching across builds to 
significantly reduce the amount of HTTP traffic back and forth. Of course, it's 
a lot less attractive if it proves to be only theory and not in fact 
practically achievable -- I'll play around with Maven next week and see where 
that gets me.

Anyway, I don't say any of that to dismiss your suggestions -- you present 
potentially viable possibilities, and I'll certainly keep those ideas on the 
table as I plan for the future -- but I thought it might be worthwhile to share 
my thinking. :-)

> I once discovered that if optional components are removed (including
> some jars in the webapp), the Solr download drops from 150+ MB to about
> 25 MB.
>
> https://issues.apache.org/jira/browse/SOLR-6806

This could actually be a separate argument for a dependency-management-based 
Solr structure, in that you could create a core solr package with minimum 
content that could recommend a whole array of optional dependencies. A script 
could then be used to build different versions of the download package from 
these -- one with just the core, one with all the optional stuff included. 
Those who wanted some intermediate number of files could be encouraged to 
manually create their desired build from packages.

But again, I freely admit that everything I'm saying is based on experience 
with package managers outside the realm of Java -- I need to learn more about 
Maven (and perhaps Ivy) before I can make any particularly intelligent 
statements about what is really possible in this context. :-)

- Demian

Installing Solr as a dependency

2016-07-28 Thread Demian Katz

Hello,

I develop an open source project (https://github.com/vufind-org/vufind) that 
depends on Solr, and I'm trying to figure out if there is a better way to 
manage the Solr dependency.

Presently, I simply bundle Solr with my software by committing the latest 
distribution to my Git repo. Over time, having all of these large binaries is 
causing repository bloat and slow Git performance. I'm beginning to wonder 
whether there's a better way. With the rise in the popularity of dependency 
managers like NPM and Composer, it seems like it might be nice to somehow be 
able to declare Solr as a dependency and have it installed automatically on the 
client side rather than bundling the whole gigantic application by hand... 
however, as far as I can tell, there's no way to do this presently (at least, 
not unless you count specialized niche projects like 
https://github.com/projecthydra/hydra-jetty, which are not exactly what I'm 
looking for).

Just curious if others are dealing with this problem in other ways, or if there 
are any tool-based approaches that I haven't discovered on my own.

thanks,
Demian

qf boosts with MoreLikeThis query parser

2016-07-11 Thread Demian Katz

Hello,

I am currently using field-specific boosts in the qf setting of the 
MoreLikeThis request handler:

https://github.com/vufind-org/vufind/blob/master/solr/vufind/biblio/conf/solrconfig.xml#L410

I would like to accomplish the same effect using the MoreLikeThis query parser, 
so that I can take advantage of such benefits as sharding support.

I am currently using Solr 5.5.0, and in spite of trying many syntactical 
variations, I can't seem to get it to work. Some discussion on this JIRA ticket 
seems to suggest there may have been some problems caused by parsing 
limitations:

https://issues.apache.org/jira/browse/SOLR-7143

However, I think my work on this ticket should have eliminated those 
limitations:

https://issues.apache.org/jira/browse/SOLR-2798

Anyway, this brings up a few questions:


1.)Is field-specific boosting in qf supported by the MLT query parser, and 
if so, what syntax should I use?

2.)If this functionality is supported, but not in v5.5.0, approximately 
when was it fixed?

3.)If the functionality is still not working, would it be worth my time to 
try to fix it, or is it being excluded for a specific reason?

Any and all insight is appreciated. Apologies if the answers are already out 
there somewhere, but I wasn't able to find them!

thanks,
Demian

Pull request protocol question

2016-03-01 Thread Demian Katz

Hello,

A few weeks ago, I submitted a pull request to Solr in association with a JIRA 
ticket, and it was eventually merged.

More recently, I had an almost-trivial change I wanted to share, but on GitHub, 
my Solr fork appeared to have changed upstreams. Was the whole Solr repo moved 
and regenerated or something?

In any case, I ended up submitting my proposal using a new fork of 
apache/lucene-solr. It's visible here:

https://github.com/apache/lucene-solr/pull/13

However, due to the weirdness of the switching upstreams, I thought I'd better 
check in here and make sure I put this in the right place!

thanks,
Demian

SOLR-2798 (local params parsing issue) -- how can I help?

2015-12-02 Thread Demian Katz

Hello,

I'd really love to see a resolution to SOLR-2798, since my application has a 
bug that cannot be addressed until this issue is fixed.

It occurred to me that there's a good chance that the code involved in this 
issue is relatively isolated and testable, so I might be able to help with a 
solution even though I have no prior experience with the Solr code base. I'm 
just wondering if anyone can confirm this and, if so, point me in the right 
general direction so that I can make an attempt at a patch.

I asked about this a while ago in a comment on the JIRA ticket, but I have a 
feeling that nobody actually saw that - so I'm trying again here on the mailing 
list.

Any and all help greatly appreciated - and hopefully if you help me a little, I 
can contribute a useful fix back to the project in return.

thanks,
Demian

Costs/benefits of DocValues

2015-11-09 Thread Demian Katz

Hello,

I have a legacy Solr schema that I would like to update to take advantage of 
DocValues. I understand that by adding "docValues=true" to some of my fields, I 
can improve sorting/faceting performance. However, I have a couple of questions:


1.)Will Solr always take proper advantage of docValues when it is turned 
on, or will I gain greater performance by turning of stored/indexed in 
situations where only docValues are necessary (e.g. a sort-only field)?

2.)Will adding docValues to a field introduce significant performance 
penalties for non-docValues uses of that field, beyond the obvious fact that 
the additional data will consume more disk and memory?

I'm asking this question because the existing schema has some multi-purpose 
fields, and I'm trying to determine whether I should just add "docValues=true" 
wherever it might help, or if I need to take a more thoughtful approach and 
potentially split some fields with copyFields, etc. This is particularly 
significant because my schema makes use of some dynamic field suffixes, and I'm 
not sure if I need to add new suffixes to differentiate docValues/non-docValues 
fields, or if it's okay to turn on docValues across the board "just in case."

Apologies if these questions have already been answered - I couldn't find a 
totally clear answer in the places I searched.

Thanks!

- Demian

ExternalFileField documentation problems?

2014-09-15 Thread Demian Katz

I've just been doing some experimentation with the ExternalFileField. I ran
into obstacles due to some apparently incorrect documentation in the wiki:

https://cwiki.apache.org/confluence/display/solr/Working+with+External+Files+and+Processes

It seems that for some reason the and definitions are
mashed together there. It felt wrong, but I tried it since it was in the
official docs... and, of course, it didn't work.

Fortunately, this blog post helped me out, and I was able to get everything
working:

http://1opensourcelover.wordpress.com/2013/07/02/solr-external-file-fields/

Anyway, I'm not writing this to complain - I'd just like to help fix the wiki.
However, since I'm no expert on this functionality, and I have no Confluence
experience, so I thought I'd post here before taking any action.

1.)Am I able to edit the wiki? I signed up, but I don't see any edit
options - just "leave a comment." I assume this means I have no rights, but it
might just mean I'm looking in the wrong places.

2.)Is there anyone more intimately familiar with ExternalFileField who
would be willing to give the wiki page a quick review and correct factual
errors? The extent of my edit (if I could make it) would simply be to fix the
broken schema.xml example, but it's possible other details also need
adjustments.

3.)Is there a policy on external links in the wiki? Adding a comment with a
link to the above-mentioned blog post might be helpful to others, but if it's
going to get me flagged as a potential spammer, I'll refrain from doing it.

Thanks for your input! I'll go ahead and leave a comment if I don't hear
anything in a few days, but it seemed worth asking for best practices first.

- Demian

Preserving punctuation tokens with ICUTokenizerFactory

2012-04-10 Thread Demian Katz

It has been brought to my attention that ICUTokenizerFactory drops tokens like 
the ++ in "The C++ Programming Language."  Is there any way to persuade it to 
preserve these types of tokens?

thanks,
Demian

RE: sun-java6 alternatives for Solr 3.5

2012-02-27 Thread Demian Katz

For what it's worth, I run Solr 3.5 on Ubuntu using the OpenJDK packages and I 
haven't run into any problems.  I do realize that sometimes the Sun JDK has 
features that are missing from other Java implementations, but so far it hasn't 
affected my use of Solr.

- Demian

> -Original Message-
> From: ku3ia [mailto:dem...@gmail.com]
> Sent: Monday, February 27, 2012 2:25 PM
> To: solr-user@lucene.apache.org
> Subject: sun-java6 alternatives for Solr 3.5
> 
> Hi all!
> I had installed an Ubuntu 10.04 LTS. I had added a 'partner' repository to
> my sources list and updated it, but I can't find a package sun-java6-*:
> root@ubuntu:~# apt-cache search java6
> default-jdk - Standard Java or Java compatible Development Kit
> default-jre - Standard Java or Java compatible Runtime
> default-jre-headless - Standard Java or Java compatible Runtime (headless)
> openjdk-6-jdk - OpenJDK Development Kit (JDK)
> openjdk-6-jre - OpenJDK Java runtime, using Hotspot JIT
> openjdk-6-jre-headless - OpenJDK Java runtime, using Hotspot JIT (headless)
> 
> Than I had goggled and found an article:
> https://lists.ubuntu.com/archives/ubuntu-security-announce/2011-
> December/001528.html
> 
> I'm using Solr 3.5 and Apache Tomcat 6.0.32.
> Please advice me what I must do in this situation, because I always used
> sun-java6-* packages for Tomcat and Solr and it worked fine
> Thanks!
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/sun-java6-
> alternatives-for-Solr-3-5-tp3781792p3781792.html
> Sent from the Solr - User mailing list archive at Nabble.com.

RE: SOLR - Just for search or whole site DB?

2012-02-21 Thread Demian Katz

I would strongly recommend using Solr just for search.  Solr is designed for 
doing fast search lookups.  It is really not designed for performing all the 
functions of a relational database system.  You certainly COULD use Solr for 
everything, and the software is constantly being enhanced to make it more 
flexible, but you'll still probably find it awkward and inconvenient for 
certain tasks that are simple with MySQL.  It's also useful to be able to throw 
away and rebuild your Solr index at will, so you can upgrade to a new version 
or tweak your indexing rules.  If you store mission-critical data in Solr 
itself, this becomes more difficult.  The way I like to look at it is, as the 
name says, as an index.  You use one system for actually managing your data, 
and then you use Solr to create an index of that data for fast look-up.  

- Demian

> -Original Message-
> From: Spadez [mailto:james_will...@hotmail.com]
> Sent: Tuesday, February 21, 2012 7:45 AM
> To: solr-user@lucene.apache.org
> Subject: SOLR - Just for search or whole site DB?
> 
> 
> I am new to this but I wanted to pitch a setup to you. I have a website
> being coded at the moment, in the very early stages, but is effectively a
> full text scrapper and search engine. We have decided on SOLR for the search
> system.
> 
> We basically have two sets of data:
> 
> One is the content for the search engine, which is around 100K records at
> any one time. The entire system is built on PHP and currently put into a
> MSQL database. We want very quick relevant searches, this is critical. Our
> plan is to import our records into SOLR each night from the MYSQL database.
> 
> The second set of data is other parts of the site, such as our ticket
> system, stats about the number of clicks etc. The performance on this is not
> performance critical at all.
> 
> So, I have two questions:
> 
> Firstly, should everything be run through the SOLR search system, including
> tickets and site stats? Alterntively, is it better to keep only the main
> full text searches on SOLR and do the ticketing etc through normal MYSQL
> queries?
> 
> Secondly, which is probably dependant on the first question. If everything
> should go through SOLR, should we even use a MYSQL database at all? If not,
> what is the alternative? We use an XML file as a .SQL replacement for
> content including tickets, stats, users, passwords etc.
> 
> Sorry if these questions are basic, but I’m out of my depth here (but
> learning!)
> 
> James
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/SOLR-Just-
> for-search-or-whole-site-DB-tp3763439p3763439.html
> Sent from the Solr - User mailing list archive at Nabble.com.

RE: social/collaboration features on top of solr

2011-12-13 Thread Demian Katz

VuFind (http://vufind.org) uses Solr for library catalog (or similar) 
applications and features a MySQL database which it uses for storing user tags 
and comments outside of Solr itself.  If there were a mechanism more closely 
tied to Solr for achieving this sort of effect, that would allow VuFind to do 
things with considerably more elegance!

- Demian

> -Original Message-
> From: Robert Stewart [mailto:bstewart...@gmail.com]
> Sent: Tuesday, December 13, 2011 10:28 AM
> To: solr-user@lucene.apache.org
> Subject: social/collaboration features on top of solr
> 
> Has anyone implemented some social/collaboration features on top of
> SOLR?  What I am thinking is ability to add ratings and comments to
> documents in SOLR and then be able to fetch comments and ratings for
> each document in results (and have as part of response from SOLR),
> similar in fashion to MLT results.  I think a separate index or
> separate core to store collaboration info would be needed, as well as a
> search component for fetching collaboration info for results.  I would
> think this would be a great feature and wondering if anyone has done
> something similar.
> 
> Bob

Re: LocalParams, bq, and highlighting

2011-11-01 Thread Demian Katz

> This is definitely an interesting case that i don't think anyone ever
> really considered before.  It seems like a strong argument in favor of
> adding an "hl.q" param that the HighlightingComponent would use as an
> override for whatever the QueryComponent thinks the highlighting query
> should be, that way people expressing complex queries like you you
> describe could do something like...
>
>qq=solr
>q=inStock:true AND+_query_:"{!dismax v=$qq}"
>hl.q={!v=$qq}
>hl=true
>fl=name
>hl.fl=name
>bq=server
>
> ...what do you think?
>
> wanna file a Jira requesting this as a feature?  Pretty sure the change
> would only require a few lines of code (but of course we'd also need JUnit
> tests which would probably be several dozen lines of code)

First of all, thanks for answering both of my LocalParams-related queries back 
in September.  I somehow failed to notice your responses until today - it's 
alarmingly easy to lose things in the flood of solr-user mail - but I greatly 
appreciate your input on both issues!

It looks like there's already a JIRA ticket (more than a year old) for the hl.q 
param:

https://issues.apache.org/jira/browse/SOLR-1926

This definitely sounds like it would solve my problem, so I've put in my vote!

- Demian

RE: DisMax and WordDelimiterFilterFactory (limitations of MultiPhraseQuery)

2011-10-27 Thread Demian Katz

If we change the query chain to not split on case change, then we lose half the 
benefit of that feature -- if a user types "WiFi" and the source record 
contains "wi fi," we fail to get a hit.  As you say, that may be worth 
considering if it comes down to picking the lesser evil, but I still think 
there should be a complete solution to my problem -- I'm not trying to 
compensate for every fat-fingered user behavior... just one specific one!

Ultimately, I think my problem relates to this note from the documentation 
about using phrases in the SynonymFilterFactory:

"Phrase searching (ie: "sea biscit") will cause the QueryParser to pass the 
entire string to the analyzer, but if the SynonymFilter is configured to expand 
the synonyms, then when the QueryParser gets the resulting list of tokens back 
from the Analyzer, it will construct a MultiPhraseQuery that will not have the 
desired effect. This is because of the limited mechanism available for the 
Analyzer to indicate that two terms occupy the same position: there is no way 
to indicate that a "phrase" occupies the same position as a term. For our 
example the resulting MultiPhraseQuery would be "(sea | sea | seabiscuit) 
(biscuit | biscit)" which would not match the simple case of "seabiscuit" 
occuring in a document."

So I suppose I'm just running up against a fundamental limitation of Solr...  
but this seems like a fundamental limitation that might be worth overcoming -- 
I'm sure my use case is not the only one where this could matter.  Has anyone 
given this any thought?

- Demian

> -Original Message-
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: Thursday, October 27, 2011 8:21 AM
> To: solr-user@lucene.apache.org
> Subject: Re: DisMax and WordDelimiterFilterFactory
> 
> What happens if you change your WDDF definition in the query part of
> your analysis
> chain to NOT split on case change? Then your index should contain the
> right
> fragments (and combined words) and your queries would match.
> 
> I admit I haven't thought this through entirely, but this would work
> for your example I
> think. Unfortunately I suspect it would break other cases I
> suspect you're in a
> "lesser of two evils" situation.
> 
> But I can't imagine a 100% solution here. You're effectively asking to
> compensate for
> any fat-fingered thing a user does. Impossible I think...
> 
> Best
> Erick
> 
> On Tue, Oct 25, 2011 at 1:13 PM, Demian Katz
>  wrote:
> > I've seen a couple of threads related to this subject (for example,
> http://www.mail-archive.com/solr-user@lucene.apache.org/msg33400.html),
> but I haven't found an answer that addresses the aspect of the problem
> that concerns me...
> >
> > I have a field type set up like this:
> >
> >     positionIncrementGap="100">
> >      
> >        
> >         generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> >         words="stopwords.txt" enablePositionIncrements="true"/>
> >        
> >         protected="protwords.txt"/>
> >         language="English"/>
> >        
> >      
> >      
> >        
> >         synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
> >         generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" />
> >         words="stopwords.txt" enablePositionIncrements="true"/>
> >        
> >         protected="protwords.txt"/>
> >         language="English"/>
> >        
> >      
> >    
> >
> > The important feature here is the use of WordDelimiterFilterFactory,
> which allows a search for "WiFi" to match an indexed term of "wi fi"
> (for example).
> >
> > The problem, of course, is that if a user accidentally introduces a
> case change in their query, the query analyzer chain breaks it into
> multiple words and no hits are found...  so a search for "exaMple" will
> look for "exa mple" and fail.
> >
> > I've found two solutions that resolve this problem in the admin panel
> field analysis tool:
> >
> >
> > 1.)    Turn on catenateWords and catenateNumbers in the query
> analyzer - this reassembles the user's broken word and allows a match.
> >
>

RE: Dismax handler - whitespace and special character behaviour

2011-10-25 Thread Demian Katz

I just sent an email to the list about DisMax interacting with 
WordDelimiterFilterFactory, and I think our problems are at least partially 
related -- I think the reason you are seeing an OR where you expect an AND is 
that you have autoGeneratePhraseQueries set to false, which changes the way 
DisMax handles the output of the WordDelimiterFilterFactory (among others).  
Unfortunately, I don't have a solution for you...  but you might want to keep 
an eye on my thread in case replies there shed any additional light.

- Demian

> -Original Message-
> From: Rohk [mailto:khor...@gmail.com]
> Sent: Tuesday, October 25, 2011 10:33 AM
> To: solr-user@lucene.apache.org
> Subject: Dismax handler - whitespace and special character behaviour
> 
> Hello,
> 
> I've got strange results when I have special characters in my query.
> 
> Here is my request :
> 
> q=histoire-
> france&start=0&rows=10&sort=score+desc&defType=dismax&qf=any^1.0&mm=100
> %
> 
> Parsed query :
> 
> +((any:histoir any:franc)) ()
> 
> I've got 17000 results because Solr is doing an OR (should be AND).
> 
> I have no problem when I'm using a whitespace instead of a special char
> :
> 
> q=histoire
> france&start=0&rows=10&sort=score+desc&defType=dismax&qf=any^1.0&mm=100
> %
> 
> +(((any:histoir) (any:franc))~2)
> ()
> 
> 2000 results for this query.
> 
> Here is my schema.xml (relevant parts) :
> 
>  positionIncrementGap="100" autoGeneratePhraseQueries="false">
>   
> 
>  generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"
> preserveOriginal="1"/>
> 
>  words="stopwords_french.txt" ignoreCase="true"/>
>  words="stopwords_french.txt" enablePositionIncrements="true"/>
>  language="French" protected="protwords.txt"/>
> 
> 
>   
>   
> 
> 
>  generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"
> preserveOriginal="0"/>
> 
>  words="stopwords_french.txt" ignoreCase="true"/>
>  words="stopwords_french.txt" enablePositionIncrements="true"/>
>  language="French" protected="protwords.txt"/>
> 
> 
>   
> 
> 
> I tried with a PatternTokenizerFactory to tokenize on whitespaces &
> special
> chars but no change...
> Even with a charFilter (PatternReplaceCharFilterFactory) to replace
> special
> characters by whitespace, it doesn't work...
> 
> First line of analysis via solr admin, with verbose output, for query =
> 'histoire-france' :
> 
> org.apache.solr.analysis.PatternReplaceCharFilterFactory {replacement=
> , pattern=([,;./\\'&-]), luceneMatchVersion=LUCENE_32}
> texthistoire france
> 
> The '-' is replaced by ' ', then tokenized by
> WhitespaceTokenizerFactory.
> However I still have different number of results for 'histoire-france'
> and
> 'histoire france'.
> 
> My current workaround is to replace all special chars by whitespaces
> before
> sending query to Solr, but it is not satisfying.
> 
> Did i miss something ?

DisMax and WordDelimiterFilterFactory

2011-10-25 Thread Demian Katz

I've seen a couple of threads related to this subject (for example, 
http://www.mail-archive.com/solr-user@lucene.apache.org/msg33400.html), but I 
haven't found an answer that addresses the aspect of the problem that concerns 
me...

I have a field type set up like this:


  







  
  








  


The important feature here is the use of WordDelimiterFilterFactory, which 
allows a search for "WiFi" to match an indexed term of "wi fi" (for example).

The problem, of course, is that if a user accidentally introduces a case change 
in their query, the query analyzer chain breaks it into multiple words and no 
hits are found...  so a search for "exaMple" will look for "exa mple" and fail.

I've found two solutions that resolve this problem in the admin panel field 
analysis tool:


1.)Turn on catenateWords and catenateNumbers in the query analyzer - this 
reassembles the user's broken word and allows a match.

2.)Turn on preserveOriginal in the query analyzer - this passes through the 
user's original query, which then gets cleaned up bythe ICUFoldingFilterFactory 
and allows a match.

The problem is that in my real-world application, which uses DisMax, neither of 
these solutions work.  It appears that even though (if I understand correctly) 
the WordDelimiterFilterFactory is returning ALTERNATIVE tokens, the DisMax 
handler is combining them a way that requires all of them to match in an 
inappropriate way...  for example, here's partial debugQuery output for the 
"exaMple" search using Dismax and solution #2 above:

"parsedquery":"+DisjunctionMaxQuery((genre:\"(exampl exa) mple\"^300.0 | 
title_new:\"(exampl exa) mple\"^100.0 | topic:\"(exampl exa) mple\"^500.0 | 
series:\"(exampl exa) mple\"^50.0 | title_full_unstemmed:\"(example exa) 
mple\"^600.0 | geographic:\"(exampl exa) mple\"^300.0 | contents:\"(exampl exa) 
mple\"^10.0 | fulltext_unstemmed:\"(example exa) mple\"^10.0 | 
allfields_unstemmed:\"(example exa) mple\"^10.0 | title_alt:\"(exampl exa) 
mple\"^200.0 | series2:\"(exampl exa) mple\"^30.0 | title_short:\"(exampl exa) 
mple\"^750.0 | author:\"(example exa) mple\"^300.0 | title:\"(exampl exa) 
mple\"^500.0 | topic_unstemmed:\"(example exa) mple\"^550.0 | 
allfields:\"(exampl exa) mple\" | author_fuller:\"(example exa) mple\"^150.0 | 
title_full:\"(exampl exa) mple\"^400.0 | fulltext:\"(exampl exa) mple\")) ()",

Obviously, that is not what I want - ideally it would be something like 'exampl 
OR "ex ample"'.

I also read about the autoGeneratePhraseQueries setting, but that seems to take 
things way too far in the opposite direction - if I set that to false, then I 
get matches for any individual token; i.e. example OR ex OR ample - not good at 
all!

I have a sinking suspicion that there is not an easy solution to my problem, 
but this seems to be a fairly basic need; splitOnCaseChange is a useful feature 
to have, but it's more valuable if it serves as an ALTERNATIVE search rather 
than a necessary query munge.  Any thoughts?

thanks,
Demian

LocalParams, bq, and highlighting

2011-09-21 Thread Demian Katz

I've run into another strange behavior related to LocalParams syntax in Solr 
1.4.1.  If I apply Dismax boosts using bq in LocalParams syntax, the contents 
of the boost queries get used by the highlighter.  Obviously, when I use bq as 
a separate parameter, this is not an issue.

To clarify, here are two searches that yield identical results but different 
highlighting behaviors:

http://localhost:8080/solr/biblio/select/?q=john&rows=20&start=0&indent=yes&qf=author^100&qt=dismax&bq=author%3Asmith^1000&fl=score&hl=true&hl.fl=*

http://localhost:8080/solr/biblio/select/?q=%28%28_query_%3A%22{!dismax+qf%3D\%22author^100\%22+bq%3D\%27author%3Asmith^1000\%27}john%22%29%29&rows=20&start=0&indent=yes&fl=score&hl=true&hl.fl=*

Query #1 highlights only "john" (the desired behavior), but query #2 highlights 
both "john" and "smith".

Is this a known limitation of the highlighter, or is it a bug?  Is this issue 
resolved in newer versions of Solr?

thanks,
Demian

RE: Questions about LocalParams syntax

2011-09-20 Thread Demian Katz

Space-separation works for the qf field, but not for bq.  If I try a bq of 
"format:Book^50 format:Journal^150", I get a strange result -- I would expect 
in the case of a failed bq that either a) I would get a syntax error of some 
sort or b) I would get normal search results with no boosting applied.  
Instead, I get a successful search result containing 0 entries.  Very odd!  
Anyway, the solution that definitely works is joining the clauses with OR...  
but I'd still love to be able to specify multiple bq's separately if there's 
any way it can be done.

As for the quote issue, the problem I'm trying to solve is that my code is 
driven by configuration files, and users may specify any legal Solr bq values 
that they choose.  You're right that in some cases, I can simplify the 
situation by alternating quotes or changing the syntax...  but I don't want to 
force users into using a subset of legal Solr syntax; it would be much better 
to able to handle all legal cases in a straightforward fashion.  Admittedly, my 
example is artificial -- format:Book^50 works just as well as 
format:"Book"^50...  but suppose they wanted to boost a phrase like 
format:"Conference Proceeding"^25 -- this is a common case.  It seems like 
there should be some syntax that allows this to work in the context I am using 
it.  If not, perhaps we need to file a bug report.

In any case, thanks for taking the time to make some suggestions!  It surprises 
me that this very powerful feature of Solr is so little-documented.

- Demian

> -Original Message-
> From: Jonathan Rochkind [mailto:rochk...@jhu.edu]
> Sent: Tuesday, September 20, 2011 10:32 AM
> To: solr-user@lucene.apache.org
> Cc: Demian Katz
> Subject: Re: Questions about LocalParams syntax
> 
> I don't have the complete answer. But I _think_ if you do one 'bq'
> param
> with multiple space-seperated directives, it will work.
> 
> And escaping is a pain.  But can be made somewhat less of a pain if you
> realize that single quotes can sometimes be used instead of
> double-quotes. What I do:
> 
> _query_:"{!dismax qf='title something else'}"
> 
> So by switching between single and double quotes, you can avoid need to
> escape. Sometimes you still do need to escape when a single or double
> quote is actually in a value (say in a 'q'), and I do use backslash
> there. If you had more levels of nesting though... I have no idea what
> you'd do.
> 
> I'm not even sure why you have the internal quotes here:
> 
> bq=\"format:\\\"Book\\\"^50\"
> 
> 
> Shouldn't that just be bq='format:Book^50', what's the extra double
> quotes around <>?  If you don't need them, then with switching
> between single and double, this can become somewhat less crazy and
> error
> prone:
> 
> _query_:"{!dismax bq='format:Book^50'}"
> 
> I think. Maybe. If you really do need the double quotes in there, then
> I
> think switching between single and double you can use a single
> backslash
> there.
> 
> 
> On 9/20/2011 9:39 AM, Demian Katz wrote:
> > I'm using the LocalParams syntax combined with the _query_ pseudo-
> field to build an advanced search screen (built on Solr 1.4.1's Dismax
> handler), but I'm running into some syntax questions that don't seem to
> be addressed by the wiki page here:
> >
> > http://wiki.apache.org/solr/LocalParams
> >
> >
> > 1.)How should I deal with repeating parameters?  If I use
> multiple boost queries, it seems that only the last one listed is
> used...  for example:
> >
> > ((_query_:"{!dismax qf=\"title^500 author^300 allfields\"
> bq=\"format:Book^50\" bq=\"format:Journal^150\"}test"))
> >
> >  boosts Journals, but not Books.  If I reverse the
> order of the two bq parameters, then Books get boosted instead of
> Journals.  I can work around this by creating one bq with the clauses
> OR'ed together, but I would rather be able to apply multiple bq's like
> I can elsewhere.
> >
> >
> > 2.)What is the proper way to escape quotes?  Since there are
> multiple nested layers of double quotes, things get ugly and it's easy
> to end up with syntax errors.  I found that this syntax doesn't cause
> an error:
> >
> >
> > ((_query_:"{!dismax qf=\"title^500 author^300 allfields\"
> bq=\"format:\\\"Book\\\"^50\" bq=\"format:\\\"Journal\\\"^150\"}test"))
> >
> >  ...but it also doesn't work correctly - the boost
> queries are completely ignored in this example.  Perhaps this is more a
> problem related to  _query_ than to LocalParams syntax...  but either
> way, a solution would be great!
> >
> > thanks,
> > Demian
> >

Questions about LocalParams syntax

2011-09-20 Thread Demian Katz

I'm using the LocalParams syntax combined with the _query_ pseudo-field to 
build an advanced search screen (built on Solr 1.4.1's Dismax handler), but I'm 
running into some syntax questions that don't seem to be addressed by the wiki 
page here:

http://wiki.apache.org/solr/LocalParams


1.)How should I deal with repeating parameters?  If I use multiple boost 
queries, it seems that only the last one listed is used...  for example:

((_query_:"{!dismax qf=\"title^500 author^300 allfields\" bq=\"format:Book^50\" 
bq=\"format:Journal^150\"}test"))

boosts Journals, but not Books.  If I reverse the order of the 
two bq parameters, then Books get boosted instead of Journals.  I can work 
around this by creating one bq with the clauses OR'ed together, but I would 
rather be able to apply multiple bq's like I can elsewhere.


2.)What is the proper way to escape quotes?  Since there are multiple 
nested layers of double quotes, things get ugly and it's easy to end up with 
syntax errors.  I found that this syntax doesn't cause an error:


((_query_:"{!dismax qf=\"title^500 author^300 allfields\" 
bq=\"format:\\\"Book\\\"^50\" bq=\"format:\\\"Journal\\\"^150\"}test"))

...but it also doesn't work correctly - the boost queries are 
completely ignored in this example.  Perhaps this is more a problem related to  
_query_ than to LocalParams syntax...  but either way, a solution would be 
great!

thanks,
Demian

"String index out of range: -1" for hl.fl=* in Solr 1.4.1?

2011-09-09 Thread Demian Katz

I'm running into a strange problem with Solr 1.4.1 - this request:

http://localhost:8080/solr/website/select/?q=*%3A*&rows=20&start=0&indent=yes&fl=score&facet=true&facet.mincount=1&facet.limit=30&facet.field=category&facet.field=linktype&facet.field=subject&facet.prefix=&facet.sort=&fq=category%3A%22Exhibits%22&spellcheck=true&spellcheck.q=*%3A*&spellcheck.dictionary=default&hl=true&hl.fl=*&hl.simple.pre=START_HILITE&hl.simple.post=END_HILITE&wt=json&json.nl=arrarr

leads to this error dump:

String index out of range: -1

java.lang.StringIndexOutOfBoundsException: String index out of range: -1
at java.lang.String.substring(String.java:1949)
at 
org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:263)
at 
org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting(DefaultSolrHighlighter.java:335)
at 
org.apache.solr.handler.component.HighlightComponent.process(HighlightComponent.java:89)
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1088)
at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:360)
at 
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at 
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
at 
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:729)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
at 
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:206)
at 
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at 
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:324)
at 
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:505)
at 
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:829)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:513)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:211)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:380)
at 
org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:395)
at 
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:488)

I've managed to work around the problem by replacing the hl.fl=* parameter with 
a comma-delimited list of the fields I actually need highlighted...  but I 
don't understand why I'm encountering this error, and for peace of mind I would 
like to understand the problem in case there's a deeper problem at work here.  
I'll be happy to share schema or other details if they would help narrow down a 
potential cause!

thanks,
Demian

RE: SpellCheckComponent performance

2011-06-07 Thread Demian Katz

As I may have mentioned before, VuFind is actually doing two Solr queries for 
every search -- a base query that gets basic spelling suggestions, and a 
supplemental spelling-only query that gets shingled spelling suggestions.  If 
there's a way to get two different spelling responses in a single query, I'd 
love to hear about it...  but the double-querying doesn't seem to be a huge 
problem -- the delays I'm talking about are in the spelling portion of the 
initial query.  Just for the sake of completeness, here are both of my spelling 
field types:

...and here are the fields:

As you can probably guess, I'm using spelling in my main query and 
spellingShingle in my supplemental query.

Here are stats on the spelling field:

{field=spelling,memSize=107830314,tindexSize=249184,time=25747,phase1=25150,nTerms=1343061,bigTerms=231,termInstances=40960454,uses=1}

(I obtained these numbers by temporarily adding the spelling field as a facet 
to my warming query -- probably not a very smart way to do it, but it was the 
only way I could figure out!  If there's a more elegant and accurate approach, 
I'd be interested to know what it is.)

I should also note that my basic spelling index is 114MB and my shingled 
spelling index is 931MB -- not outrageously large.  Is there a way to persuade 
Solr to load these into memory for faster performance?

thanks,
Demian

> -Original Message-
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: Monday, June 06, 2011 6:23 PM
> To: solr-user@lucene.apache.org
> Subject: Re: SpellCheckComponent performance
> 
> Hmmm, how are you configuring you spell checker? The first-time
> slowdown
> is probably due to cache warming, but subsequent 500 ms slowdowns
> seem odd. How many unique terms are there in your spellecheck index?
> 
> It'd probably be best if you showed us your fieldtype and field
> definition...
> 
> Best
> Erick
> 
> On Mon, Jun 6, 2011 at 4:04 PM, Demian Katz 
> wrote:
> > I'm continuing to work on tuning my Solr server, and now I'm noticing
> that my biggest bottleneck is the SpellCheckComponent.  This is eating
> multiple seconds on most first-time searches, and still taking around
> 500ms even on cached searches.  Here is my configuration:
> >
> >   class="org.apache.solr.handler.component.SpellCheckComponent">
> >    
> >      basicSpell
> >      spelling
> >      0.75
> >      ./spellchecker
> >      textSpell
> >      true
> >    
> >  
> >
> > I've done a bit of searching, but the best advice I could find for
> making the search component go faster involved reducing
> spellcheck.maxCollationTries, which doesn't even seem to apply to my
> settings.
> >
> > Does anyone have any advice on tuning this aspect of my
> configuration?  Are there any extra debug settings that might give
> deeper insight into how the component is spending its time?
> >
> > thanks,
> > Demian
> >

SpellCheckComponent performance

2011-06-06 Thread Demian Katz

I'm continuing to work on tuning my Solr server, and now I'm noticing that my 
biggest bottleneck is the SpellCheckComponent.  This is eating multiple seconds 
on most first-time searches, and still taking around 500ms even on cached 
searches.  Here is my configuration:

  

  basicSpell
  spelling
  0.75
  ./spellchecker
  textSpell
  true

  

I've done a bit of searching, but the best advice I could find for making the 
search component go faster involved reducing spellcheck.maxCollationTries, 
which doesn't even seem to apply to my settings.

Does anyone have any advice on tuning this aspect of my configuration?  Are 
there any extra debug settings that might give deeper insight into how the 
component is spending its time?

thanks,
Demian

RE: Solr performance tuning - disk i/o?

2011-06-06 Thread Demian Katz

All of my cache autowarmCount settings are either 1 or 5.  
maxWarmingSearchers is set to 2.  I previously shared the contents of my 
firstSearcher and newSearcher events -- just a "queries" array surrounded by a 
standard-looking  tag.  The events are definitely firing -- in 
addition to the measurable performance improvement they give me, I can actually 
see them happening in the console output during startup.  That seems to cover 
every configuration option in my file that references warming in any way, and 
it all looks reasonable to me.  warmupTime remains consistently 0 in the 
statistics display.  Is there anything else I should be looking at?  In any 
case, I'm not too alarmed by this... it just seems a little strange.

thanks,
Demian

> -Original Message-
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: Monday, June 06, 2011 11:59 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr performance tuning - disk i/o?
> 
> Polling interval was in reference to slaves in a multi-machine
> master/slave setup. so probably not
> a concern just at present.
> 
> Warmup time of 0 is not particularly normal, I'm not quite sure what's
> going on there but you may
> want to look at firstsearcher, newsearcher and autowarm parameters in
> config.xml..
> 
> Best
> Erick
> 
> On Mon, Jun 6, 2011 at 9:08 AM, Demian Katz 
> wrote:
> > Thanks once again for the helpful suggestions!
> >
> > Regarding the selection of facet fields, I think publishDate (which
> is actually just a year) and callnumber-first (which is actually a very
> broad, high-level category) are okay.  authorStr is an interesting
> problem: it's definitely a useful facet (when a user searches for an
> author, odds are good that they want the one who published the most
> books... i.e. a search for dickens will probably show Charles Dickens
> at the top of the facet list), but it has a long tail since there are
> many minor authors who have only published one or two books...  Is
> there a possibility that the facet.mincount parameter could be helpful
> here, or does that have no impact on performance/memory footprint?
> >
> > Regarding polling interval for slaves, are you referring to a
> distributed Solr environment, or is this something to do with Solr's
> internals?  We're currently a single-server environment, so I don't
> think I have to worry if it's related to a multi-server setup...  but
> if it's something internal, could you point me to the right area of the
> admin panel to check my stats?  I'm not seeing anything about polling
> on the statistics page.  It's also a little strange that all of my
> warmupTime stats on searchers and caches are showing as 0 -- is that
> normal?
> >
> > thanks,
> > Demian
> >
> >> -Original Message-
> >> From: Erick Erickson [mailto:erickerick...@gmail.com]
> >> Sent: Friday, June 03, 2011 4:45 PM
> >> To: solr-user@lucene.apache.org
> >> Subject: Re: Solr performance tuning - disk i/o?
> >>
> >> Quick impressions:
> >>
> >> The faceting is usually best done on fields that don't have lots of
> >> unique
> >> values for three reasons:
> >> 1> It's questionable how much use to the user to have a gazillion
> >> facets.
> >>      In the case of a unique field per document, in fact, it's
> useless.
> >> 2> resource requirements go up as a function of the number of unique
> >>      terms. This is true for faceting and sorting.
> >> 3> warmup times grow the more terms have to be read into memory.
> >>
> >>
> >> Glancing at your warmup stuff, things like publishDate, authorStr
> and
> >> maybe
> >> callnumber-first are questionable. publishDate depends on how coarse
> >> the
> >> resolution is. If it's by day, that's not really much use.
> authorStr..
> >> How many
> >> authors have more than one publication? Would this be better served
> by
> >> some
> >> kind of autosuggest rather than facets? callnumber-first... I don't
> >> really know, but
> >> if it's unique per document it's probably not something the user
> would
> >> find useful
> >> as a facet.
> >>
> >> The admin page will help you determine the number of unique terms
> per
> >> field,
> >> which may guide you whether or not to continue to facet on these
> >> fields.
> >>
> >> As Otis said, doing a sort on the fields during warmup will also
> help.
> >>
> >> W

RE: Solr performance tuning - disk i/o?

2011-06-06 Thread Demian Katz

Thanks once again for the helpful suggestions!

Regarding the selection of facet fields, I think publishDate (which is actually 
just a year) and callnumber-first (which is actually a very broad, high-level 
category) are okay.  authorStr is an interesting problem: it's definitely a 
useful facet (when a user searches for an author, odds are good that they want 
the one who published the most books... i.e. a search for dickens will probably 
show Charles Dickens at the top of the facet list), but it has a long tail 
since there are many minor authors who have only published one or two books...  
Is there a possibility that the facet.mincount parameter could be helpful here, 
or does that have no impact on performance/memory footprint?

Regarding polling interval for slaves, are you referring to a distributed Solr 
environment, or is this something to do with Solr's internals?  We're currently 
a single-server environment, so I don't think I have to worry if it's related 
to a multi-server setup...  but if it's something internal, could you point me 
to the right area of the admin panel to check my stats?  I'm not seeing 
anything about polling on the statistics page.  It's also a little strange that 
all of my warmupTime stats on searchers and caches are showing as 0 -- is that 
normal?

thanks,
Demian

> -Original Message-
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: Friday, June 03, 2011 4:45 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr performance tuning - disk i/o?
> 
> Quick impressions:
> 
> The faceting is usually best done on fields that don't have lots of
> unique
> values for three reasons:
> 1> It's questionable how much use to the user to have a gazillion
> facets.
>  In the case of a unique field per document, in fact, it's useless.
> 2> resource requirements go up as a function of the number of unique
>  terms. This is true for faceting and sorting.
> 3> warmup times grow the more terms have to be read into memory.
> 
> 
> Glancing at your warmup stuff, things like publishDate, authorStr and
> maybe
> callnumber-first are questionable. publishDate depends on how coarse
> the
> resolution is. If it's by day, that's not really much use. authorStr..
> How many
> authors have more than one publication? Would this be better served by
> some
> kind of autosuggest rather than facets? callnumber-first... I don't
> really know, but
> if it's unique per document it's probably not something the user would
> find useful
> as a facet.
> 
> The admin page will help you determine the number of unique terms per
> field,
> which may guide you whether or not to continue to facet on these
> fields.
> 
> As Otis said, doing a sort on the fields during warmup will also help.
> 
> Watch your polling interval for any slaves in relation to the warmup
> times.
> If your polling interval is shorter than the warmup times, you run a
> risk of
> "runaway warmups".
> 
> As you've figured out, measuring responses to the first few queries
> doesn't
> always measure what you really need ..
> 
> I don't have the pages handy, but autowarming is a good topic to
> understand,
> so you might spend some time tracking it down.
> 
> Best
> Erick
> 
> On Fri, Jun 3, 2011 at 11:21 AM, Demian Katz
>  wrote:
> > Thanks to you and Otis for the suggestions!  Some more information:
> >
> > - Based on the Solr stats page, my caches seem to be working pretty
> well (few or no evictions, hit rates in the 75-80% range).
> > - VuFind is actually doing two Solr queries per search (one initial
> search followed by a supplemental spell check search -- I believe this
> is necessary because VuFind has two separate spelling indexes, one for
> shingled terms and one for single words).  That is probably
> exaggerating the problem, though based on searches with debugQuery on,
> it looks like it's always the initial search (rather than the
> supplemental spelling search) that's consuming the bulk of the time.
> > - enableLazyFieldLoading is set to true.
> > - I'm retrieving 20 documents per page.
> > - My JVM settings: -server -
> Xloggc:/usr/local/vufind/solr/jetty/logs/gc.log -Xms4096m -Xmx4096m -
> XX:+UseParallelGC -XX:+UseParallelOldGC -XX:NewRatio=5
> >
> > It appears that a large portion of my problem had to do with
> autowarming, a topic that I've never had a strong grasp on, though
> perhaps I'm finally learning (any recommended primer links would be
> welcome!).  I did have some autowarming settings in solrconfig.xml (an
> arbitrary search for a bunch of random keywords in the newSearcher and
>

RE: Solr performance tuning - disk i/o?

2011-06-03 Thread Demian Katz

Thanks to you and Otis for the suggestions!  Some more information:

- Based on the Solr stats page, my caches seem to be working pretty well (few 
or no evictions, hit rates in the 75-80% range).
- VuFind is actually doing two Solr queries per search (one initial search 
followed by a supplemental spell check search -- I believe this is necessary 
because VuFind has two separate spelling indexes, one for shingled terms and 
one for single words).  That is probably exaggerating the problem, though based 
on searches with debugQuery on, it looks like it's always the initial search 
(rather than the supplemental spelling search) that's consuming the bulk of the 
time.
- enableLazyFieldLoading is set to true.
- I'm retrieving 20 documents per page.
- My JVM settings: -server -Xloggc:/usr/local/vufind/solr/jetty/logs/gc.log 
-Xms4096m -Xmx4096m -XX:+UseParallelGC -XX:+UseParallelOldGC -XX:NewRatio=5

It appears that a large portion of my problem had to do with autowarming, a 
topic that I've never had a strong grasp on, though perhaps I'm finally 
learning (any recommended primer links would be welcome!).  I did have some 
autowarming settings in solrconfig.xml (an arbitrary search for a bunch of 
random keywords in the newSearcher and firstSearcher events, plus autowarmCount 
settings on all of my caches).  However, when I looked at the debugQuery 
output, I noticed that a huge amount of time was being wasted loading facets on 
the first search after restarting Solr, so I changed my newSearcher and 
firstSearcher events to this:

  

  *:*
  0
  10
  true
  1
  collection
  format
  publishDate
  callnumber-first
  topic_facet
  authorStr
  language
  genre_facet
  era_facet
  geographic_facet

  

Overall performance has now increased dramatically, and now the biggest 
bottleneck in the debug output seems to be the shingle spell checking!

Any other suggestions are welcome, since I suspect there's still room to 
squeeze more performance out of the system, and I'm still not sure I'm making 
the most of autowarming...  but this seems like a big step in the right 
direction.  Thanks again for the help!

- Demian

> -Original Message-
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: Friday, June 03, 2011 9:41 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr performance tuning - disk i/o?
> 
> This doesn't seem right. Here's a couple of things to try:
> 1> attach &debugQuery=on to your long-running queries. The QTime
> returned
>  is the time taken to search, NOT including the time to load the
> docs. That'll
>  help pinpoint whether the problem is the search itself, or
> assembling the
>  documents.
> 2> Are you autowarming? If so, be sure it's actually done before
> querying.
> 3> Measure queries after the first few, particularly if you're sorting
> or
>  faceting.
> 4> What are your JVM settings? How much memory do you have?
> 5> is  set to true in your solrconfig.xml?
> 6> How many docs are you returning?
> 
> 
> There's more, but that'll do for a start Let us know if you gather
> more data
> and it's still slow.
> 
> Best
> Erick
> 
> On Fri, Jun 3, 2011 at 8:44 AM, Demian Katz 
> wrote:
> > Hello,
> >
> > I'm trying to move a VuFind installation from an ailing physical
> server into a virtualized environment, and I'm running into performance
> problems.  VuFind is a Solr 1.4.1-based application with fairly large
> and complex records (many stored fields, many words per record).  My
> particular installation contains about a million records in the index,
> with a total index size around 6GB.
> >
> > The virtual environment has more RAM and better CPUs than the old
> physical box, and I am satisfied that my Java environment is well-
> tuned.  My index is optimized.  Searches that hit the cache respond
> very well.  The problem is that non-cached searches are very slow - the
> more keywords I add, the slower they get, to the point of taking 6-12
> seconds to come back with results on a quiet box and well over a minute
> under stress testing.  (The old box still took a while for equivalent
> searches, but it was about twice as fast as the new one).
> >
> > My gut feeling is that disk access reading the index is the
> bottleneck here, but I know little about the specifics of Solr's
> internals, so it's entirely possible that my gut is wrong.  Outside
> testing does show that the the virtual environment's disk performance
> is not as good as the old physical server, especially when multiple
> processes ar

Solr performance tuning - disk i/o?

2011-06-03 Thread Demian Katz

Hello,

I'm trying to move a VuFind installation from an ailing physical server into a 
virtualized environment, and I'm running into performance problems.  VuFind is 
a Solr 1.4.1-based application with fairly large and complex records (many 
stored fields, many words per record).  My particular installation contains 
about a million records in the index, with a total index size around 6GB.

The virtual environment has more RAM and better CPUs than the old physical box, 
and I am satisfied that my Java environment is well-tuned.  My index is 
optimized.  Searches that hit the cache respond very well.  The problem is that 
non-cached searches are very slow - the more keywords I add, the slower they 
get, to the point of taking 6-12 seconds to come back with results on a quiet 
box and well over a minute under stress testing.  (The old box still took a 
while for equivalent searches, but it was about twice as fast as the new one).

My gut feeling is that disk access reading the index is the bottleneck here, 
but I know little about the specifics of Solr's internals, so it's entirely 
possible that my gut is wrong.  Outside testing does show that the the virtual 
environment's disk performance is not as good as the old physical server, 
especially when multiple processes are trying to access the same file 
simultaneously.

So, two basic questions:


1.)Would you agree that I'm dealing with a disk bottleneck, or are there 
some other factors I should be considering?  Any good diagnostics I should be 
looking at?

2.)If the problem is disk access, is there anything I can tune on the Solr 
side to alleviate the problems?

Thanks,
Demian

RE: Bug in solr.KeywordMarkerFilterFactory?

2011-04-20 Thread Demian Katz

That's good news -- thanks for the help (not to mention the reassurance that 
Solr itself is actually working right)!  Hopefully 3.1.1 won't be too far off, 
though; when the analysis tool lies, life can get very confusing! :-)

- Demian

> -Original Message-
> From: Robert Muir [mailto:rcm...@gmail.com]
> Sent: Wednesday, April 20, 2011 2:54 PM
> To: solr-user@lucene.apache.org; yo...@lucidimagination.com
> Subject: Re: Bug in solr.KeywordMarkerFilterFactory?
> 
> No, this is only a bug in analysis.jsp.
> 
> you can see this by comparing analysis.jsp's "dontstems bees" to using
> the query debug interface:
> 
>   "dontstems bees"
>   "dontstems bees"
>   PhraseQuery(text:"dontstems bee")
>   text:"dontstems bee"
> 
> On Wed, Apr 20, 2011 at 2:43 PM, Yonik Seeley
>  wrote:
> > On Wed, Apr 20, 2011 at 2:01 PM, Demian Katz
>  wrote:
> >> I've just started experimenting with the
> solr.KeywordMarkerFilterFactory in Solr 3.1, and I'm seeing some
> strange behavior.  It seems that every word subsequent to a protected
> word is also treated as being protected.
> >
> > You're right!  This was broken by LUCENE-2901 back in Jan.
> > I've opened this issue:
>  https://issues.apache.org/jira/browse/LUCENE-3039
> >
> > The easiest short-term workaround for you would probably be to create
> > a custom filter that looks like KeywordMarkerFilter before the
> > LUCENE-2901 change.
> >
> > -Yonik
> > http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
> > 25-26, San Francisco
> >

Bug in solr.KeywordMarkerFilterFactory?

2011-04-20 Thread Demian Katz

I've just started experimenting with the solr.KeywordMarkerFilterFactory in 
Solr 3.1, and I'm seeing some strange behavior.  It seems that every word 
subsequent to a protected word is also treated as being protected.

For testing purposes, I have put the word "spelling" in my protwords.txt.  If I 
do a test for "spelling bees" in the analyze tool, the stemmer produces 
"spelling bees" - nothing is stemmed.  But if I do a test for "bees spelling", 
I get "bee spelling", the expected result with "bees" stemmed but "spelling" 
left unstemmed.  I have tried extended examples - in every case I tried, all of 
the words prior to "spelling" get stemmed, but none of the words after 
"spelling" get stemmed.  When turning on the verbose mode of the analyze tool, 
I can see that the settings of the "keyword" attribute introduced by 
solr.KeywordMarkerFilterFactory correspond with the the stemming behavior... so 
I think the solr.KeywordMarkerFilterFactory component is to blame, and not 
anything later in the analyze chain.

Any idea what might be going wrong?  Is this a known issue?

Here is my field type definition, in case it makes a difference:


  







  
  








  


thanks,
Demian

RE: Solr 3.1 ICU filters (error loading class)

2011-04-18 Thread Demian Katz

Right, I placed my files relative to solr_home, not in it -- but obviously 
having a solr_home/lucene-libs directory didn't do me any good. :-)

- Demian

> -Original Message-
> From: Jonathan Rochkind [mailto:rochk...@jhu.edu]
> Sent: Monday, April 18, 2011 1:46 PM
> To: solr-user@lucene.apache.org
> Cc: Demian Katz
> Subject: Re: Solr 3.1 ICU filters (error loading class)
> 
> I don't think you want to put them in solr_home, I think you want to
> put
> them in solr_home/lib/.  Or did you mean that's where you put them?
> 
> On 4/18/2011 1:31 PM, Demian Katz wrote:
> > Hello,
> >
> > I'm interested in trying out the new ICU features in Solr 3.1.
> However, when I attempt to set up a field type using
> solr.ICUTokenizerFactory and/or solr.ICUFoldingFilterFactory, Solr
> refuses to start up, issuing "Error loading class" exceptions.
> >
> > I did see the README.txt file that mentions lucene-libs/lucene-*.jar
> and lib/icu4j-*.jar.  I tried putting all of these files under my Solr
> home directory, but it made no difference.
> >
> > Is there some other .jar that I need to add to my library folder?  Am
> I doing something wrong with the known dependencies?  (This is the
> first time I've seen a lucene-libs directory, so I wasn't sure if that
> required some special configuration).  Any general troubleshooting
> advice for figuring out what is going wrong here?
> >
> > thanks,
> > Demian
> >

RE: Solr 3.1 ICU filters (error loading class)

2011-04-18 Thread Demian Katz

Thanks!  apache-solr-analysis-extras-3.1.jar was the missing piece that was 
causing all of my trouble; I didn't see any mention of it in the documentation 
-- might be worth adding!

Thanks,
Demian

> -Original Message-
> From: Robert Muir [mailto:rcm...@gmail.com]
> Sent: Monday, April 18, 2011 1:46 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr 3.1 ICU filters (error loading class)
> 
> On Mon, Apr 18, 2011 at 1:31 PM, Demian Katz
>  wrote:
> > Hello,
> >
> > I'm interested in trying out the new ICU features in Solr 3.1.
>  However, when I attempt to set up a field type using
> solr.ICUTokenizerFactory and/or solr.ICUFoldingFilterFactory, Solr
> refuses to start up, issuing "Error loading class" exceptions.
> >
> > I did see the README.txt file that mentions lucene-libs/lucene-*.jar
> and lib/icu4j-*.jar.  I tried putting all of these files under my Solr
> home directory, but it made no difference.
> >
> > Is there some other .jar that I need to add to my library folder?  Am
> I doing something wrong with the known dependencies?  (This is the
> first time I've seen a lucene-libs directory, so I wasn't sure if that
> required some special configuration).  Any general troubleshooting
> advice for figuring out what is going wrong here?
> >
> 
> make a 'lib' directory under your solr home (e.g. example/solr/lib) :
> it should contain:
> * icu4j-4_6.jar
> * lucene-icu-3.1.jar
> * apache-solr-analysis-extras-3.1.jar

Solr 3.1 ICU filters (error loading class)

2011-04-18 Thread Demian Katz

Hello,

I'm interested in trying out the new ICU features in Solr 3.1.  However, when I 
attempt to set up a field type using solr.ICUTokenizerFactory and/or 
solr.ICUFoldingFilterFactory, Solr refuses to start up, issuing "Error loading 
class" exceptions.

I did see the README.txt file that mentions lucene-libs/lucene-*.jar and 
lib/icu4j-*.jar.  I tried putting all of these files under my Solr home 
directory, but it made no difference.

Is there some other .jar that I need to add to my library folder?  Am I doing 
something wrong with the known dependencies?  (This is the first time I've seen 
a lucene-libs directory, so I wasn't sure if that required some special 
configuration).  Any general troubleshooting advice for figuring out what is 
going wrong here?

thanks,
Demian

RE: OAI on SOLR already done?

2011-02-02 Thread Demian Katz

I already replied to the original poster off-list, but it seems that it may be 
worth weighing in here as well...

The next release of VuFind (http://vufind.org) is going to include OAI-PMH 
server support.  As you say, there is really no way to plug OAI-PMH directly 
into Solr...  but a tool like VuFind can provide a fairly generic, extensible, 
Solr-based platform for building an OAI-PMH server.  Obviously this is helpful 
for some use cases and not others...  but I'm happy to provide more information 
if anyone needs it.

- Demian

From: Jonathan Rochkind [rochk...@jhu.edu]
Sent: Wednesday, February 02, 2011 3:38 PM
To: solr-user@lucene.apache.org
Cc: Paul Libbrecht
Subject: Re: OAI on SOLR already done?

The trick is that you can't just have a generic black box OAI-PMH
provider on top of any Solr index. How would it know where to get the
metadata elements it needs, such as title, or last-updated date, etc.
Any given solr index might not even have this in stored fields -- and a
given app might want to look them up from somewhere other than stored
fields.

If the Solr index does have them in stored fields, and you do want to
get them from the stored fields, then it's, I think (famous last words)
relatively straightforward code to write. A mapping from solr stored
fields to metadata elements needed for OAI-PMH, and then simply
outputting the XML template with those filled in.

I am not aware of anyone that has done this in a
re-useable/configurable-for-your-solr tool. You could possibly do it
solely using the built-in Solr
JSP/XSLT/other-templating-stuff-I-am-not-familiar-with stuff, rather
than as an external Solr client app, or it could be an external Solr
client app.

This is actually a very similar problem to something someone else asked
a few days ago "Does anyone have an OpenSearch add-on for Solr?"  Very
very similar problem, just with a different XML template for output
(usually RSS or Atom) instead of OAI-PMH.

On 2/2/2011 3:14 PM, Paul Libbrecht wrote:
> Peter,
>
> I'm afraid your service is harvesting and I am trying to look at a PMH 
> provider service.
>
> Your project appeared early in the goolge matches.
>
> paul
>
>
> Le 2 févr. 2011 à 20:46, Péter Király a écrit :
>
>> Hi,
>>
>> I don't know whether it fits to your need, but we are builing a tool
>> based on Drupal (eXtensible Catalog Drupal Toolkit), which can harvest
>> with OAI-PMH and index the harvested records into Solr. The records is
>> harvested, processed, and stored into MySQL, then we index them into
>> Solr. We created some ways to manipulate the original values before
>> sending to Solr. We created it in a modular way, so you can change
>> settings in an admin interface or write your own "hooks" (special
>> Drupal functions), to taylor the application to your needs. We support
>> only Dublin Core, and our own FRBR-like schema (called XC schema), but
>> you can add more schemas. Since this forum is about Solr, and not
>> applications using Solr, if you interested this tool, plase write me a
>> private message, or visit http://eXtensibleCatalog.org, or the
>> module's page at http://drupal.org/project/xc.
>>
>> Hope this helps,
>>
>> Péter
>> eXtensible Catalog
>>
>> 2011/2/2 Paul Libbrecht:
>>> Hello list,
>>>
>>> I've met a few google matches that indicate that SOLR-based servers 
>>> implement the Open Archive Initiative's Metadata Harvesting Protocol.
>>>
>>> Is there something made to be re-usable that would be an add-on to solr?
>>>
>>> thanks in advance
>>>
>>> paul
>

RE: filter query from external list of Solr unique IDs

2010-10-15 Thread Demian Katz

The main problem I've encountered with the "lots of OR clauses" approach is 
that you eventually hit the limit on Boolean clauses and the whole query fails. 
 You can keep raising the limit through the Solr configuration, but there's 
still a ceiling eventually.

- Demian

> -Original Message-
> From: Jonathan Rochkind [mailto:rochk...@jhu.edu]
> Sent: Friday, October 15, 2010 1:07 PM
> To: solr-user@lucene.apache.org
> Subject: RE: filter query from external list of Solr unique IDs
> 
> Definitely interested in this.
> 
> The naive obvious approach would be just putting all the ID's in the
> query. Like fq=(id:1 OR id:2 OR).  Or making it another clause in
> the 'q'.
> 
> Can you outline what's wrong with this approach, to make it more clear
> what's needed in a solution?
> 
> From: Burton-West, Tom [tburt...@umich.edu]
> Sent: Friday, October 15, 2010 11:49 AM
> To: solr-user@lucene.apache.org
> Subject: filter query from external list of Solr unique IDs
> 
> At the Lucene Revolution conference I asked about efficiently building
> a filter query from an external list of Solr unique ids.
> 
> Some use cases I can think of are:
> 1)  personal sub-collections (in our case a user can create a small
> subset of our 6.5 million doc collection and then run filter queries
> against it)
> 2)  tagging documents
> 3)  access control lists
> 4)  anything that needs complex relational joins
> 5)  a sort of alternative to incremental field updating (i.e.
> update in an external database or kv store)
> 6)  Grant's clustering cluster points and similar apps.
> 
> Grant pointed to SOLR 1715, but when I looked on JIRA, there doesn't
> seem to be any work on it yet.
> 
> Hoss  mentioned a couple of ideas:
> 1) sub-classing query parser
> 2) Having the app query a database and somehow passing
> something to Solr or lucene for the filter query
> 
> Can Hoss or someone else point me to more detailed information on what
> might be involved in the two ideas listed above?
> 
> Is somehow keeping an up-to-date map of unique Solr ids to internal
> Lucene ids needed to implement this or is that a separate issue?
> 
> 
> Tom Burton-West
> http://www.hathitrust.org/blogs/large-scale-search
> 
> 
>

RE: solr.WordDelimiterFilterFactory problem with hyphenated terms?

2010-04-12 Thread Demian Katz

I don't think the behavior is correct.  The first example, with just one gap, 
does NOT match.  The second example, with an extra second gap, DOES match.  It 
seems that the term collapsing ("eighteenth-century" --> "eighteenthcentury") 
somehow throws off the position sequence, forcing you to add an extra gap in 
order to get a match.  It's good to know that slop is an option to work around 
this problem... but it still seems to me that something isn't working the way 
it is supposed to in this particular case.

- Demian

> -Original Message-
> From: Robert Muir [mailto:rcm...@gmail.com]
> Sent: Friday, April 09, 2010 12:05 PM
> To: solr-user@lucene.apache.org
> Subject: Re: solr.WordDelimiterFilterFactory problem with hyphenated
> terms?
> 
> but this behavior is correct, as you have position increments enabled.
> if you want the second query (which has 2 gaps) to match, you need to
> either
> use slop, or disable these increments alltogether.
> 
> On Fri, Apr 9, 2010 at 11:44 AM, Demian Katz
> wrote:
> 
> > I've given it a try, and it definitely seems to have improved the
> > situation.  However, there is still one weird case that's clearly
> related to
> > term positions.  If I do this search, it fails:
> >
> > title:"love customs in eighteenthcentury spain"
> >
> > ...but if I do this search, it succeeds:
> >
> > title:"love customs in in eighteenthcentury spain"
> >
> > (note the duplicate "in").
> >
> > - Demian
> >
> > > -Original Message-
> > > From: Erick Erickson [mailto:erickerick...@gmail.com]
> > > Sent: Thursday, April 08, 2010 11:20 AM
> > > To: solr-user@lucene.apache.org
> > > Subject: Re: solr.WordDelimiterFilterFactory problem with
> hyphenated
> > > terms?
> > >
> > > I'm not all that familiar with the underlying issues, but of the
> two
> > > I'd
> > > pick moving the WordDelimiterFactory rather than setting increments
> =
> > > "false".
> > >
> > > But that's at least partly a guess
> > >
> > > Best
> > > Erick
> > >
> > > On Thu, Apr 8, 2010 at 11:00 AM, Demian Katz
> > > wrote:
> > >
> > > > Thanks for looking into this -- I appreciate the help (and feel a
> > > little
> > > > better that there seems to be a bug at work here and not just my
> > > total
> > > > incomprehension).
> > > >
> > > > Sorry for any confusion over the UnicodeNormalizationFactory --
> > > that's
> > > > actually a plug-in from the SolrMarc project (
> > > > http://code.google.com/p/solrmarc/) that slipped into my example.
> > > Also,
> > > > as you guessed, my default operator is indeed set to "AND."
> > > >
> > > > It sounds to me that, of your two proposed work-arounds, moving
> the
> > > > StopFilterFactory after WordDelimiterFactory is the least
> disruptive.
> > > I'm
> > > > guessing that disabling position increments across the board
> might
> > > have
> > > > implications for other types of phrase searches, while filtering
> > > stopwords
> > > > later in the chain should be more functionally equivalent, if
> > > slightly less
> > > > efficient (potentially more terms to examine).  Would you agree
> with
> > > this
> > > > assessment?  If not, what possible negative side effects am I
> > > forgetting
> > > > about?
> > > >
> > > > thanks,
> > > > Demian
> > > >
> > > > > -Original Message-
> > > > > From: Erick Erickson [mailto:erickerick...@gmail.com]
> > > > > Sent: Wednesday, April 07, 2010 10:04 PM
> > > > > To: solr-user@lucene.apache.org
> > > > > Subject: Re: solr.WordDelimiterFilterFactory problem with
> > > hyphenated
> > > > > terms?
> > > > >
> > > > > Well, for a quick trial using trunk, I had to remove the
> > > > > UnicodeNormalizationFactory, is that yours?
> > > > >
> > > > > But with that removed, I get the results you do, ASSUMING that
> > > you've
> > > > > set
> > > > > your default operator to AND in schema.xml...
> > > > >
> > > > > Believe it or not, it all changes and all your queries return a
> hit
> > > if

RE: solr.WordDelimiterFilterFactory problem with hyphenated terms?

2010-04-09 Thread Demian Katz

I've given it a try, and it definitely seems to have improved the situation.  
However, there is still one weird case that's clearly related to term 
positions.  If I do this search, it fails:

title:"love customs in eighteenthcentury spain"

...but if I do this search, it succeeds:

title:"love customs in in eighteenthcentury spain"

(note the duplicate "in").

- Demian

> -Original Message-
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: Thursday, April 08, 2010 11:20 AM
> To: solr-user@lucene.apache.org
> Subject: Re: solr.WordDelimiterFilterFactory problem with hyphenated
> terms?
> 
> I'm not all that familiar with the underlying issues, but of the two
> I'd
> pick moving the WordDelimiterFactory rather than setting increments =
> "false".
> 
> But that's at least partly a guess
> 
> Best
> Erick
> 
> On Thu, Apr 8, 2010 at 11:00 AM, Demian Katz
> wrote:
> 
> > Thanks for looking into this -- I appreciate the help (and feel a
> little
> > better that there seems to be a bug at work here and not just my
> total
> > incomprehension).
> >
> > Sorry for any confusion over the UnicodeNormalizationFactory --
> that's
> > actually a plug-in from the SolrMarc project (
> > http://code.google.com/p/solrmarc/) that slipped into my example.
> Also,
> > as you guessed, my default operator is indeed set to "AND."
> >
> > It sounds to me that, of your two proposed work-arounds, moving the
> > StopFilterFactory after WordDelimiterFactory is the least disruptive.
> I'm
> > guessing that disabling position increments across the board might
> have
> > implications for other types of phrase searches, while filtering
> stopwords
> > later in the chain should be more functionally equivalent, if
> slightly less
> > efficient (potentially more terms to examine).  Would you agree with
> this
> > assessment?  If not, what possible negative side effects am I
> forgetting
> > about?
> >
> > thanks,
> > Demian
> >
> > > -Original Message-
> > > From: Erick Erickson [mailto:erickerick...@gmail.com]
> > > Sent: Wednesday, April 07, 2010 10:04 PM
> > > To: solr-user@lucene.apache.org
> > > Subject: Re: solr.WordDelimiterFilterFactory problem with
> hyphenated
> > > terms?
> > >
> > > Well, for a quick trial using trunk, I had to remove the
> > > UnicodeNormalizationFactory, is that yours?
> > >
> > > But with that removed, I get the results you do, ASSUMING that
> you've
> > > set
> > > your default operator to AND in schema.xml...
> > >
> > > Believe it or not, it all changes and all your queries return a hit
> if
> > > you
> > > do one of two things (I did this in both index and query when
> testing
> > > 'cause
> > > I'm lazy):
> > > 1> move the inclusion of the StopFilterFactory after
> > > WordDelimiterFactory
> > > or
> > > 2> for StopFilterFactory, set enablePositionIncrements="false"
> > >
> > > I think either of these might work in your situation...
> > >
> > > On doing some more investigation, it appears that if a hyphenated
> word
> > > is
> > > immediately after a stopword AND the above is true (stop factory
> > > included
> > > before WordDelimiterFactory and enablePositionIncrements="true"),
> then
> > > the
> > > search fails. I indexed this title:
> > >
> > > Love-customs in eighteenth-century Spain for nineteenth-century
> > >
> > > Searching in solr/admin/form.jsp for:
> > > title:(nineteenth-century)
> > >
> > > fails. But if I remove the "for" from the title, the above query
> works.
> > > Searching for
> > > title:(love-customs)
> > > always works.
> > >
> > > Finally, (and it's *really* time to go to sleep now), just setting
> > > enablePositionIncrements="false" in the "index" portion of the
> schema
> > > also
> > > causes things to work.
> > >
> > > Developer folks:
> > > I didn't see anything in a quick look in SOLR or Lucene JIRAs,
> should I
> > > refine this a bit (really, sleepy time is near) and add a JIRA?
> > >
> > > Best
> > > Erick
> > >
> > > On Wed, Apr 7, 2010 at 10:29 AM, Demian Katz
> > > wrote:
> > >
> > > >

RE: solr.WordDelimiterFilterFactory problem with hyphenated terms?

2010-04-08 Thread Demian Katz

Thanks for looking into this -- I appreciate the help (and feel a little better 
that there seems to be a bug at work here and not just my total 
incomprehension).

Sorry for any confusion over the UnicodeNormalizationFactory -- that's actually 
a plug-in from the SolrMarc project (http://code.google.com/p/solrmarc/) that 
slipped into my example.  Also, as you guessed, my default operator is indeed 
set to "AND."

It sounds to me that, of your two proposed work-arounds, moving the 
StopFilterFactory after WordDelimiterFactory is the least disruptive.  I'm 
guessing that disabling position increments across the board might have 
implications for other types of phrase searches, while filtering stopwords 
later in the chain should be more functionally equivalent, if slightly less 
efficient (potentially more terms to examine).  Would you agree with this 
assessment?  If not, what possible negative side effects am I forgetting about?

thanks,
Demian

> -Original Message-
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: Wednesday, April 07, 2010 10:04 PM
> To: solr-user@lucene.apache.org
> Subject: Re: solr.WordDelimiterFilterFactory problem with hyphenated
> terms?
> 
> Well, for a quick trial using trunk, I had to remove the
> UnicodeNormalizationFactory, is that yours?
> 
> But with that removed, I get the results you do, ASSUMING that you've
> set
> your default operator to AND in schema.xml...
> 
> Believe it or not, it all changes and all your queries return a hit if
> you
> do one of two things (I did this in both index and query when testing
> 'cause
> I'm lazy):
> 1> move the inclusion of the StopFilterFactory after
> WordDelimiterFactory
> or
> 2> for StopFilterFactory, set enablePositionIncrements="false"
> 
> I think either of these might work in your situation...
> 
> On doing some more investigation, it appears that if a hyphenated word
> is
> immediately after a stopword AND the above is true (stop factory
> included
> before WordDelimiterFactory and enablePositionIncrements="true"), then
> the
> search fails. I indexed this title:
> 
> Love-customs in eighteenth-century Spain for nineteenth-century
> 
> Searching in solr/admin/form.jsp for:
> title:(nineteenth-century)
> 
> fails. But if I remove the "for" from the title, the above query works.
> Searching for
> title:(love-customs)
> always works.
> 
> Finally, (and it's *really* time to go to sleep now), just setting
> enablePositionIncrements="false" in the "index" portion of the schema
> also
> causes things to work.
> 
> Developer folks:
> I didn't see anything in a quick look in SOLR or Lucene JIRAs, should I
> refine this a bit (really, sleepy time is near) and add a JIRA?
> 
> Best
> Erick
> 
> On Wed, Apr 7, 2010 at 10:29 AM, Demian Katz
> wrote:
> 
> > Hello.  It has been a few weeks, and I haven't gotten any responses.
> >  Perhaps my question is too complicated -- maybe a better approach is
> to try
> > to gain enough knowledge to answer it myself.  My gut feeling is
> still that
> > it's something to do with the way term positions are getting handled
> by the
> > WordDelimiterFilterFactory, but I don't have a good understanding of
> how
> > term positions are calculated or factored into searching.  Can anyone
> > recommend some good reading to familiarize myself with these concepts
> in
> > better detail?
> >
> > thanks,
> > Demian
> >
> > From: Demian Katz
> > Sent: Tuesday, March 16, 2010 9:47 AM
> > To: solr-user@lucene.apache.org
> > Subject: solr.WordDelimiterFilterFactory problem with hyphenated
> terms?
> >
> > This is my first post on this list -- apologies if this has been
> discussed
> > before; I didn't come upon anything exactly equivalent in searching
> the
> > archives via Google.
> >
> > I'm using Solr 1.4 as part of the VuFind application, and I just
> noticed
> > that searches for hyphenated terms are failing in strange ways.  I
> strongly
> > suspect it has something to do with the
> solr.WordDelimiterFilterFactory
> > filter, but I'm not exactly sure what.
> >
> > The problem is that I have a record with the title "Love customs in
> > eighteenth-century Spain."  Depending on how I search for this, I get
> > successes or failures in a seemingly unpredictable pattern.
> >
> > Demonstration queries below were tested using the direct Solr
> > administration tool, just to eliminate any VuFind-related factors
> from the
> > equation while debug

RE: solr.WordDelimiterFilterFactory problem with hyphenated terms?

2010-04-07 Thread Demian Katz

Hello.  It has been a few weeks, and I haven't gotten any responses.  Perhaps 
my question is too complicated -- maybe a better approach is to try to gain 
enough knowledge to answer it myself.  My gut feeling is still that it's 
something to do with the way term positions are getting handled by the 
WordDelimiterFilterFactory, but I don't have a good understanding of how term 
positions are calculated or factored into searching.  Can anyone recommend some 
good reading to familiarize myself with these concepts in better detail?

thanks,
Demian

From: Demian Katz
Sent: Tuesday, March 16, 2010 9:47 AM
To: solr-user@lucene.apache.org
Subject: solr.WordDelimiterFilterFactory problem with hyphenated terms?

This is my first post on this list -- apologies if this has been discussed 
before; I didn't come upon anything exactly equivalent in searching the 
archives via Google.

I'm using Solr 1.4 as part of the VuFind application, and I just noticed that 
searches for hyphenated terms are failing in strange ways.  I strongly suspect 
it has something to do with the solr.WordDelimiterFilterFactory filter, but I'm 
not exactly sure what.

The problem is that I have a record with the title "Love customs in 
eighteenth-century Spain."  Depending on how I search for this, I get successes 
or failures in a seemingly unpredictable pattern.

Demonstration queries below were tested using the direct Solr administration 
tool, just to eliminate any VuFind-related factors from the equation while 
debugging.

Queries that work:
title:(Love customs in eighteenth century Spain)
   // no hyphen, no phrases
title:("Love customs in eighteenth-century Spain")  
// phrase search on whole title, with hyphen

Queries that fail:
title:(Love customs in eighteenth-century Spain)
  // hyphen, no phrases
title:("Love customs in eighteenth century Spain")  
 // phrase search on whole title, without hyphen
title:(Love customs in "eighteenth-century" Spain)  
// hyphenated word as phrase
title:(Love customs in "eighteenth century" Spain)  
 // hyphenated word as phrase, hyphen removed

Here is VuFind's text field type definition:


  








  
  









  


I did notice that in the "text" field type in VuFind's schema has 
"catenateWords" and "catenateNumbers" turned on in both the index and query 
analyzer chains.  It is my understanding that these options should be disabled 
for the query chain and only enabled for the index chain.  However, this may be 
a red herring -- I have already tried changing this setting, but it didn't 
change the success/failure pattern described above.  I have also played with 
the preserveOriginal setting without apparent effect.

>From playing with the Field Analysis tool, I notice that there is a gap in the 
>term position sequence after analysis...  but I'm not sure if this is 
>significant.

Has anybody else run into this sort of problem?  Any ideas on a fix?

thanks,
Demian

solr.WordDelimiterFilterFactory problem with hyphenated terms?

2010-03-16 Thread Demian Katz

This is my first post on this list -- apologies if this has been discussed 
before; I didn't come upon anything exactly equivalent in searching the 
archives via Google.

I'm using Solr 1.4 as part of the VuFind application, and I just noticed that 
searches for hyphenated terms are failing in strange ways.  I strongly suspect 
it has something to do with the solr.WordDelimiterFilterFactory filter, but I'm 
not exactly sure what.

The problem is that I have a record with the title "Love customs in 
eighteenth-century Spain."  Depending on how I search for this, I get successes 
or failures in a seemingly unpredictable pattern.

Demonstration queries below were tested using the direct Solr administration 
tool, just to eliminate any VuFind-related factors from the equation while 
debugging.

Queries that work:
title:(Love customs in eighteenth century Spain)
   // no hyphen, no phrases
title:("Love customs in eighteenth-century Spain")  
// phrase search on whole title, with hyphen

Queries that fail:
title:(Love customs in eighteenth-century Spain)
  // hyphen, no phrases
title:("Love customs in eighteenth century Spain")  
 // phrase search on whole title, without hyphen
title:(Love customs in "eighteenth-century" Spain)  
// hyphenated word as phrase
title:(Love customs in "eighteenth century" Spain)  
 // hyphenated word as phrase, hyphen removed

Here is VuFind's text field type definition:


  








  
  









  


I did notice that in the "text" field type in VuFind's schema has 
"catenateWords" and "catenateNumbers" turned on in both the index and query 
analyzer chains.  It is my understanding that these options should be disabled 
for the query chain and only enabled for the index chain.  However, this may be 
a red herring -- I have already tried changing this setting, but it didn't 
change the success/failure pattern described above.  I have also played with 
the preserveOriginal setting without apparent effect.

>From playing with the Field Analysis tool, I notice that there is a gap in the 
>term position sequence after analysis...  but I'm not sure if this is 
>significant.

Has anybody else run into this sort of problem?  Any ideas on a fix?

thanks,
Demian

46 matches

Mail list logo