RE: [EXTERNAL] Re: Getting rid of Master/Slave nomenclature in Solr
Regarding people having a problem with the word "master" -- GitHub is changing the default branch name away from "master," even in isolation from a "slave" pairing... so the terminology seems to be falling out of favor in all contexts. See: https://www.cnet.com/news/microsofts-github-is-removing-coding-terms-like-master-and-slave/ I'm not here to start a debate about the semantics of that, just to provide evidence that in some communities, the term "master" is causing concern all by itself. If we're going to make the change anyway, it might be best to get it over with and pick the most appropriate terminology we can agree upon, rather than trying to minimize the amount of change. It's going to be backward breaking anyway, so we might as well do it all now rather than risk having to go through two separate breaking changes at different points in time. - Demian -Original Message- From: Noble Paul Sent: Thursday, June 18, 2020 1:51 AM To: solr-user@lucene.apache.org Subject: [EXTERNAL] Re: Getting rid of Master/Slave nomenclature in Solr Looking at the code I see a 692 occurrences of the word "slave". Mostly variable names and ref guide docs. The word "slave" is present in the responses as well. Any change in the request param/response payload is backward incompatible. I have no objection to changing the names in ref guide and other internal variables. Going ahead with backward incompatible changes is painful. If somebody has the appetite to take it up, it's OK If we must change, master/follower can be a good enough option. master (noun): A man in charge of an organization or group. master(adj) : having or showing very great skill or proficiency. master(verb): acquire complete knowledge or skill in (a subject, technique, or art). master (verb): gain control of; overcome. I hope nobody has a problem with the term "master" On Thu, Jun 18, 2020 at 3:19 PM Ilan Ginzburg wrote: > > Would master/follower work? > > Half the rename work while still getting rid of the slavery connotation... > > > On Thu 18 Jun 2020 at 07:13, Walter Underwood wrote: > > > > On Jun 17, 2020, at 4:00 PM, Shawn Heisey wrote: > > > > > > It has been interesting watching this discussion play out on > > > multiple > > open source mailing lists. On other projects, I have seen a VERY > > high level of resistance to these changes, which I find disturbing > > and surprising. > > > > Yes, it is nice to see everyone just pitch in and do it on this list. > > > > wunder > > Walter Underwood > > wun...@wunderwood.org > > https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fobs > > erver.wunderwood.org%2Fdata=02%7C01%7Cdemian.katz%40villanova.e > > du%7C1eef0604700a442deb7e08d8134b97fb%7C765a8de5cf9444f09cafae5bf8cf > > a366%7C0%7C0%7C637280562684672329sdata=0GyK5Tlq0PGsWxl%2FirJOVN > > VaFCELlEChdxuLJ5RxdQs%3Dreserved=0 (my blog) > > > > -- - Noble Paul
RE: Help with a DIH config file
Jörn (and anyone else with more experience with this than I have), I've been working on Whitney with this issue. It is a PDF file, and it can be opened successfully in a PDF reader. Interestingly, if I try to extract data from it on the command line, Tika version 1.3 throws a lot of warnings but does successfully extract data, but several newer versions, including 1.17 and 1.20 (haven't tested other intermediate versions) encounter a fatal error and extract nothing. So this seems like something that used to work but has stopped. Unfortunately, we haven't been able to find a way to downgrade to an old enough Tika in her Solr installation to work around the problem that way. The bigger question, though, is whether there's a way to allow the DIH to simply ignore errors and keep going. Whitney needs to index several terabytes of arbitrary documents for her project, and at this scale, she can't afford the time to stop and manually intervene for every strange document that happens to be in the collection. It would be greatly preferable if the indexing process could ignore exceptions and proceed on than if it just stops dead at the first problem. (I'm also pretty sure that Whitney is already using the ignoreTikaException attribute in her configuration, but it doesn't seem to help in this instance). Any suggestions would be greatly appreciated! thanks, Demian -Original Message- From: Jörn Franke Sent: Friday, March 15, 2019 4:18 AM To: solr-user@lucene.apache.org Subject: Re: Help with a DIH config file Do you have an exception? It could be that the pdf is broken - can you open it on your computer with a pdfreader? If the exception is related to Tika and pdf then file an issue with the pdfbox project. If there is an issue with Tika and MsOffice documents then Apache poi is the right project to ask. > Am 15.03.2019 um 03:41 schrieb wclarke : > > Thank you so much. You helped a great deal. I am running into one > last issue where the Tika DIH is stopping at a specific language and > fails there (Malayalam). Do you know of a work around? > > > > -- > Sent from: > https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Flucen > e.472066.n3.nabble.com%2FSolr-User-f472068.htmldata=02%7C01%7Cdem > ian.katz%40villanova.edu%7Ca54d5daee7b14648442908d6a91f9bf6%7C765a8de5 > cf9444f09cafae5bf8cfa366%7C0%7C0%7C636882350564627071sdata=NpddZY > 2sHKJHAR8V%2BIlMt4j1i3oy94KP9%2Btp1EQ2xM4%3Dreserved=0
Solr Cell, Tika and UpdateProcessorChains
I'm posting this question on behalf of Whitney Clarke, who is a pending member of this list but is not able to post on her own yet. I've been working with her on some troubleshooting, but I'm not familiar with the components she's using and thought somebody here might be able to point her in the right direction more quickly than I can. Here is her original inquiry: I am pulling data from a local drive for indexing. I am using solr cell and tika in schemaless mode. I am attempting to rewrite certain field information prior to indexing using html-strip and regex UpdateProcessorChains. However, when run, the UpdateProcessorChains never appear to get invoked. For example, I am looking to rewrite "url":"e:\\documents\\apiscript.txt" to be http://apiscript.txt . My current solrconfig is trying to rewrite id and put the rewritten link into url, but this is just the recent attempt of many different ways I have tried to get it to work. My other issues is with the content field. I am trying to strip that field down to just the actual text of the document. I am getting all meta data in it as well. Any suggestions? Thanks, Whitney Whitney's latest solrconfig.xml in pasted in full below - as she notes, we've been through many iterations without any success. The key question is how to manipulate the data retrieved from Tika prior to indexing it. Is there a documented best practice for this type of situation, or any tips on how to troubleshoot when nothing appears to be happening? Thanks, Demian 7.3.1 ${solr.solr.home:./solr}/text_test ${solr.ulog.dir:} ${solr.ulog.numVersionBuckets:65536} ${solr.autoCommit.maxTime:15000} false 1024 true 20 200 false explicit content explicit json true add-unknown-fields-to-the-schema html-strip-features regex-replace true links ignored_ true ignored_ text_general default _text_ solr.DirectSolrSpellChecker internal 0.5 2 1 5 4 0.01 default on true 10 5 5 true true 10 5 spellcheck 100 70 0.5 [-\w ,/\n\]{20,200} 10 .,!? WORD en US [^\w-\.] _ -MM-dd'T'HH:mm:ss.SSSZ -MM-dd'T'HH:mm:ss,SSSZ -MM-dd'T'HH:mm:ss.SSS -MM-dd'T'HH:mm:ss,SSS -MM-dd'T'HH:mm:ssZ -MM-dd'T'HH:mm:ss -MM-dd'T'HH:mmZ -MM-dd'T'HH:mm -MM-dd HH:mm:ss.SSSZ -MM-dd HH:mm:ss,SSSZ -MM-dd HH:mm:ss.SSS -MM-dd HH:mm:ss,SSS -MM-dd HH:mm:ssZ -MM-dd HH:mm:ss -MM-dd HH:mmZ -MM-dd HH:mm -MM-dd java.lang.String text_general *_str 256 true java.lang.Boolean booleans java.util.Date pdates java.lang.Long java.lang.Integer plongs java.lang.Number pdoubles _text_ id ^[a-z]:\w+ http:// true text,title,subject,description language_s en text/plain; charset=UTF-8 5
RE: Installing Solr with Ivy
Dan, In case you, or anyone else, is interested, let me share my current solution-in-progress: https://github.com/vufind-org/vufind/pull/769 I've written a Phing task for my project (Phing is the PHP equivalent of Ant) which takes some loose inspiration from your Ant download task. The task uses a local directory to cache Solr distributions and only hits Apache servers if the cache lacks the requested version. This cache can be retained on my continuous integration and development servers, so I think this should get me the effect I desire without putting an unreasonable amount of load on the archive servers. I'd still love in theory to find a solution that's a little more future-proof than "build a URL and download from it," but for now, I think this will get me through. Thanks again! - Demian -Original Message- From: Davis, Daniel (NIH/NLM) [C] [mailto:daniel.da...@nih.gov] Sent: Tuesday, August 02, 2016 11:33 AM To: solr-user@lucene.apache.org Subject: RE: Installing Solr with Ivy Demian, I've long meant to upload my own "automated installation" - it is ant without ivy, but with checksums. I suppose gpg signatures could also be worked in. It is only semi-automated, because our DevOps group does not have root, but here is a clean version - https://github.com/danizen/solr-ant-install System administrators prepare the environment: - creating a directory for solr (/opt/solr) and logs (/var/logs/solr), maybe a different volume for solr data. - create an administrative user with a shell (owns the code) - create an operational user who runs solr (no shell, cannot modify the code) - install the initscripts - setup sudoers rules The installation this supports is very, very small, and I do not intend to support the cleaned version of this going forward. I will update the README.md to make that clear. I agree with your summary of the difference. One more aspect of maturity/fullness of solution - MySQL/PostgreSQL etc. support multiple projects on the same server, at least administratively. Solr is getting there, but until role-based access control (RBAC) is strong enough out-of-the-box, it is hard to setup a *shared* Solr server.Yet it is very common to do that with database servers, and in fact doing this is a common way to avoid siloed applications.Unfortunately, HTTP auth is not quite good enough for me; but it is only my own fault I haven't contributed something more. Dan Davis, Systems/Applications Architect (Contractor), Office of Computer and Communications Systems, National Library of Medicine, NIH -Original Message----- From: Demian Katz [mailto:demian.k...@villanova.edu] Sent: Tuesday, August 02, 2016 8:37 AM To: solr-user@lucene.apache.org Subject: RE: Installing Solr with Ivy Thanks, Shawn, for confirming my suspicions. Regarding your question about how Solr differs from a database server, I agree with you in theory, but the problem is in the practice: there are very easy, familiar, well-established techniques for installing and maintaining database platforms, and these platforms are mature enough that they evolve slowly and most versions are closely functionally equivalent to one another. Solr is comparatively young (not immature, but young). Solr still (as far as I can tell) lacks standard package support in the default repos of the major Linux distros, and frequently breaks backward compatibility between versions in large and small ways (particularly in the internal API, but sometimes also in the configuration files). Those are not intended as criticisms of Solr -- they're to a large extent positive signs of activity and growth -- but they are, as far as I can tell, the current realities of working with the software. For a developer with the right experience and knowledge, it's no big deal to navigate these challenges. However, my package is designed to be friendly to a less experienced, more generalized non-technical audience, and bundling Solr in the package instead of trying to guide the user through a potentially confusing manual installation process greatly simplifies the task of getting things up and running, saving me from having to field support emails from people who can't figure out how to install Solr on their platform, or those who end up with a version that's incompatible with my project's configurations and custom handlers. At this point, my main goal is to revise the bundling process so that instead of storing Solr in Git, I can install it on-demand with a simple automated process during continuous integration builds and packaging for release. In the longer term, if the environmental factors change, I'd certainly prefer to stop bundling it entirely... but I don't think that is practical for my audience at this stage. In any case, sorry for the long-winded reply, but hopefully that helps clarify my situation. - Demian -Original Message- [...snip...] In a t
RE: Installing Solr with Ivy
Dan, Thanks for taking the time to share this! I'll give it a test run in the near future and will happily share improvements if I come up with any (though I'll most likely be focusing on the download steps rather than the subsequent configuration). - Demian -Original Message- From: Davis, Daniel (NIH/NLM) [C] [mailto:daniel.da...@nih.gov] Sent: Tuesday, August 02, 2016 11:33 AM To: solr-user@lucene.apache.org Subject: RE: Installing Solr with Ivy Demian, I've long meant to upload my own "automated installation" - it is ant without ivy, but with checksums. I suppose gpg signatures could also be worked in. It is only semi-automated, because our DevOps group does not have root, but here is a clean version - https://github.com/danizen/solr-ant-install System administrators prepare the environment: - creating a directory for solr (/opt/solr) and logs (/var/logs/solr), maybe a different volume for solr data. - create an administrative user with a shell (owns the code) - create an operational user who runs solr (no shell, cannot modify the code) - install the initscripts - setup sudoers rules The installation this supports is very, very small, and I do not intend to support the cleaned version of this going forward. I will update the README.md to make that clear. I agree with your summary of the difference. One more aspect of maturity/fullness of solution - MySQL/PostgreSQL etc. support multiple projects on the same server, at least administratively. Solr is getting there, but until role-based access control (RBAC) is strong enough out-of-the-box, it is hard to setup a *shared* Solr server.Yet it is very common to do that with database servers, and in fact doing this is a common way to avoid siloed applications.Unfortunately, HTTP auth is not quite good enough for me; but it is only my own fault I haven't contributed something more. Dan Davis, Systems/Applications Architect (Contractor), Office of Computer and Communications Systems, National Library of Medicine, NIH -Original Message----- From: Demian Katz [mailto:demian.k...@villanova.edu] Sent: Tuesday, August 02, 2016 8:37 AM To: solr-user@lucene.apache.org Subject: RE: Installing Solr with Ivy Thanks, Shawn, for confirming my suspicions. Regarding your question about how Solr differs from a database server, I agree with you in theory, but the problem is in the practice: there are very easy, familiar, well-established techniques for installing and maintaining database platforms, and these platforms are mature enough that they evolve slowly and most versions are closely functionally equivalent to one another. Solr is comparatively young (not immature, but young). Solr still (as far as I can tell) lacks standard package support in the default repos of the major Linux distros, and frequently breaks backward compatibility between versions in large and small ways (particularly in the internal API, but sometimes also in the configuration files). Those are not intended as criticisms of Solr -- they're to a large extent positive signs of activity and growth -- but they are, as far as I can tell, the current realities of working with the software. For a developer with the right experience and knowledge, it's no big deal to navigate these challenges. However, my package is designed to be friendly to a less experienced, more generalized non-technical audience, and bundling Solr in the package instead of trying to guide the user through a potentially confusing manual installation process greatly simplifies the task of getting things up and running, saving me from having to field support emails from people who can't figure out how to install Solr on their platform, or those who end up with a version that's incompatible with my project's configurations and custom handlers. At this point, my main goal is to revise the bundling process so that instead of storing Solr in Git, I can install it on-demand with a simple automated process during continuous integration builds and packaging for release. In the longer term, if the environmental factors change, I'd certainly prefer to stop bundling it entirely... but I don't think that is practical for my audience at this stage. In any case, sorry for the long-winded reply, but hopefully that helps clarify my situation. - Demian -Original Message- [...snip...] In a theoretical situation where your program talked an SQL database, would you include a database server in your project? How much time would you invest in automating the download and install of MySQL, Postgres, or some other database? I think what you would do in that situation is include client code to talk to the database and expect the user to provide the server and prepare it for your program. In this respect, how is a Solr server any different than a database server? Thanks, Shawn
RE: Installing Solr with Ivy
Thanks, Shawn, for confirming my suspicions. Regarding your question about how Solr differs from a database server, I agree with you in theory, but the problem is in the practice: there are very easy, familiar, well-established techniques for installing and maintaining database platforms, and these platforms are mature enough that they evolve slowly and most versions are closely functionally equivalent to one another. Solr is comparatively young (not immature, but young). Solr still (as far as I can tell) lacks standard package support in the default repos of the major Linux distros, and frequently breaks backward compatibility between versions in large and small ways (particularly in the internal API, but sometimes also in the configuration files). Those are not intended as criticisms of Solr -- they're to a large extent positive signs of activity and growth -- but they are, as far as I can tell, the current realities of working with the software. For a developer with the right experience and knowledge, it's no big deal to navigate these challenges. However, my package is designed to be friendly to a less experienced, more generalized non-technical audience, and bundling Solr in the package instead of trying to guide the user through a potentially confusing manual installation process greatly simplifies the task of getting things up and running, saving me from having to field support emails from people who can't figure out how to install Solr on their platform, or those who end up with a version that's incompatible with my project's configurations and custom handlers. At this point, my main goal is to revise the bundling process so that instead of storing Solr in Git, I can install it on-demand with a simple automated process during continuous integration builds and packaging for release. In the longer term, if the environmental factors change, I'd certainly prefer to stop bundling it entirely... but I don't think that is practical for my audience at this stage. In any case, sorry for the long-winded reply, but hopefully that helps clarify my situation. - Demian -Original Message- [...snip...] In a theoretical situation where your program talked an SQL database, would you include a database server in your project? How much time would you invest in automating the download and install of MySQL, Postgres, or some other database? I think what you would do in that situation is include client code to talk to the database and expect the user to provide the server and prepare it for your program. In this respect, how is a Solr server any different than a database server? Thanks, Shawn
Installing Solr with Ivy
As a follow-up to last week's thread about loading Solr via dependency manager, I started experimenting with using Ivy to install Solr. Here's what I have (note that I'm trying to install Solr 5.5.0 as an arbitrary example, but that detail should not be important): ivy.xml: build.xml: My hope, based on a quick read of some Ivy tutorials, was that simply running "ant" with the above configs would give me a copy of Solr in my lib directory. When I use example libraries from the tutorials in my ivy.xml, I do indeed get files installed... but when I try to substitute the Solr package, no files are installed ("0 artifacts copied"). I'm not very experienced with any of these tools or repositories, so I'm not sure where I'm going wrong. - Do I need to add some extra configuration somewhere to tell Ivy to download the constituent parts of the solr-parent package? - Is the solr-parent package the wrong thing to be using? (I tried replacing solr-parent with solr-core and ended up with many .jar files in my lib directory, which was better than nothing, but the .jar files were not organized into a directory structure and were not accompanied by any of the non-.jar files like shell scripts that make Solr tick). - Am I just completely on the wrong track? (I do realize that there may not be a way to pull a fully-functional Solr out of the core Maven repository... but it seemed worth a try!) Any suggestions would be greatly appreciated! thanks, Demian
RE: Installing Solr as a dependency
Thanks -- another interesting possibility, though I suppose the disadvantage to this strategy would be the dependency on Docker, which could be problematic for some users (especially those running Windows, where I understand that this could only be achieved with virtualization, which would almost certainly impact performance). Still, another option to put on the table! - Demian -Original Message- From: Alexandre Rafalovitch [mailto:arafa...@gmail.com] Sent: Friday, July 29, 2016 8:02 PM To: solr-user Subject: Re: Installing Solr as a dependency What about (not tried) pulling down an official Docker build and adding your stuff to that? https://hub.docker.com/_/solr/ Regards, Alex. Newsletter and resources for Solr beginners and intermediates: http://www.solr-start.com/ On 30 July 2016 at 03:03, Demian Katz <demian.k...@villanova.edu> wrote: >> I wouldn't include Solr in my own project at all. I would probably >> request that the user download the binary artifact and put it in a >> predictable location, and configure my installation script to do the >> download if the file is not there. I would strongly recommend taking >> advantage of Apache's mirror system for that download -- although if >> you need a specific version of Solr, you will find that the mirror >> system only has the latest version, and you must go to the Apache >> Archives for older versions. >> >> To reduce load on the Apache Archive, you could place a copy of the >> binary on your own download servers ... and you could probably >> greatly reduce the size of that download by stripping out components >> that your software doesn't need. If users want to enable additional >> functionality, they would be free to download the full Solr binary >> from Apache. > > Yes, this is the reason I was hoping to use some sort of dependency > management tool. The idea of downloading from Apache's system has definitely > crossed my mind, but it's inherently more fragile than using a dependency > manager (since Apache is at least theoretically free to change their URL > structure, etc., at any time) and, as you say, it seemed impolite to direct > potentially heavy amounts of traffic to Apache servers (especially when you > consider that every commit to my project triggers one or more continuous > integration builds, each of which would need to perform the download). > Creating a project-specific mirror also crossed my mind, but that has its own > set of problems: it's work to maintain it, and the server hosting it needs to > be able to withstand the high traffic that would otherwise be directed at > Apache. The idea of a theoretical dependency management tool still feels more > attractive because it adds a standard, unchanging mechanism for obtaining > specific versions of the software and it offers the possibility of local > package caching across builds to significantly reduce the amount of HTTP > traffic back and forth. Of course, it's a lot less attractive if it proves to > be only theory and not in fact practically achievable -- I'll play around > with Maven next week and see where that gets me. > > Anyway, I don't say any of that to dismiss your suggestions -- you > present potentially viable possibilities, and I'll certainly keep > those ideas on the table as I plan for the future -- but I thought it > might be worthwhile to share my thinking. :-) > >> I once discovered that if optional components are removed (including >> some jars in the webapp), the Solr download drops from 150+ MB to >> about >> 25 MB. >> >> https://issues.apache.org/jira/browse/SOLR-6806 > > This could actually be a separate argument for a dependency-management-based > Solr structure, in that you could create a core solr package with minimum > content that could recommend a whole array of optional dependencies. A script > could then be used to build different versions of the download package from > these -- one with just the core, one with all the optional stuff included. > Those who wanted some intermediate number of files could be encouraged to > manually create their desired build from packages. > > But again, I freely admit that everything I'm saying is based on > experience with package managers outside the realm of Java -- I need > to learn more about Maven (and perhaps Ivy) before I can make any > particularly intelligent statements about what is really possible in > this context. :-) > > - Demian
RE: Installing Solr as a dependency
I did think about Maven, but (probably because I'm a Maven newbie) I didn't find an obvious way to do it and figured that Maven was meant more for libraries than for complete applications. In any case, your answer gives me more to work with, so I'll do some experimentation. Thanks! - Demian From: Daniel Collins [danwcoll...@gmail.com] Sent: Friday, July 29, 2016 11:18 AM To: solr-user@lucene.apache.org Subject: Re: Installing Solr as a dependency Can't you use Maven? I thought that was the standard dependency management tool, and Solr is published to Maven repos. There used to be a solr artifact which was the WAR file, but presumably now, you'd have to pull down org.apache.solr solr-parent and maybe then start that up. We have an internal application which is dependent on solr-core, (its a web-app, we embed bits of Solr basically) and maven works fine for us. We do patch and build Solr internally though to our own corporate maven repos, so that helps :) But I've done it outside the corporate environment and found recent Solr releases on standard maven repo sites. On 29 July 2016 at 15:12, Shawn Heisey <apa...@elyograg.org> wrote: > On 7/28/2016 1:29 PM, Demian Katz wrote: > > I develop an open source project > > (https://github.com/vufind-org/vufind) that depends on Solr, and I'm > > trying to figure out if there is a better way to manage the Solr > > dependency. Presently, I simply bundle Solr with my software by > > committing the latest distribution to my Git repo. Over time, having > > all of these large binaries is causing repository bloat and slow Git > > performance. I'm beginning to wonder whether there's a better way. > > With the rise in the popularity of dependency managers like NPM and > > Composer, it seems like it might be nice to somehow be able to declare > > Solr as a dependency and have it installed automatically on the client > > side rather than bundling the whole gigantic application by hand... > > however, as far as I can tell, there's no way to do this presently (at > > least, not unless you count specialized niche projects like > > https://github.com/projecthydra/hydra-jetty, which are not exactly > > what I'm looking for). Just curious if others are dealing with this > > problem in other ways, or if there are any tool-based approaches that > > I haven't discovered on my own. > > I wouldn't include Solr in my own project at all. I would probably > request that the user download the binary artifact and put it in a > predictable location, and configure my installation script to do the > download if the file is not there. I would strongly recommend taking > advantage of Apache's mirror system for that download -- although if you > need a specific version of Solr, you will find that the mirror system > only has the latest version, and you must go to the Apache Archives for > older versions. > > To reduce load on the Apache Archive, you could place a copy of the > binary on your own download servers ... and you could probably greatly > reduce the size of that download by stripping out components that your > software doesn't need. If users want to enable additional > functionality, they would be free to download the full Solr binary from > Apache. > > I once discovered that if optional components are removed (including > some jars in the webapp), the Solr download drops from 150+ MB to about > 25 MB. > > https://issues.apache.org/jira/browse/SOLR-6806 > > Thanks, > Shawn > >
RE: Installing Solr as a dependency
> I wouldn't include Solr in my own project at all. I would probably > request that the user download the binary artifact and put it in a > predictable location, and configure my installation script to do the > download if the file is not there. I would strongly recommend taking > advantage of Apache's mirror system for that download -- although if you > need a specific version of Solr, you will find that the mirror system > only has the latest version, and you must go to the Apache Archives for > older versions. > > To reduce load on the Apache Archive, you could place a copy of the > binary on your own download servers ... and you could probably greatly > reduce the size of that download by stripping out components that your > software doesn't need. If users want to enable additional > functionality, they would be free to download the full Solr binary from > Apache. Yes, this is the reason I was hoping to use some sort of dependency management tool. The idea of downloading from Apache's system has definitely crossed my mind, but it's inherently more fragile than using a dependency manager (since Apache is at least theoretically free to change their URL structure, etc., at any time) and, as you say, it seemed impolite to direct potentially heavy amounts of traffic to Apache servers (especially when you consider that every commit to my project triggers one or more continuous integration builds, each of which would need to perform the download). Creating a project-specific mirror also crossed my mind, but that has its own set of problems: it's work to maintain it, and the server hosting it needs to be able to withstand the high traffic that would otherwise be directed at Apache. The idea of a theoretical dependency management tool still feels more attractive because it adds a standard, unchanging mechanism for obtaining specific versions of the software and it offers the possibility of local package caching across builds to significantly reduce the amount of HTTP traffic back and forth. Of course, it's a lot less attractive if it proves to be only theory and not in fact practically achievable -- I'll play around with Maven next week and see where that gets me. Anyway, I don't say any of that to dismiss your suggestions -- you present potentially viable possibilities, and I'll certainly keep those ideas on the table as I plan for the future -- but I thought it might be worthwhile to share my thinking. :-) > I once discovered that if optional components are removed (including > some jars in the webapp), the Solr download drops from 150+ MB to about > 25 MB. > > https://issues.apache.org/jira/browse/SOLR-6806 This could actually be a separate argument for a dependency-management-based Solr structure, in that you could create a core solr package with minimum content that could recommend a whole array of optional dependencies. A script could then be used to build different versions of the download package from these -- one with just the core, one with all the optional stuff included. Those who wanted some intermediate number of files could be encouraged to manually create their desired build from packages. But again, I freely admit that everything I'm saying is based on experience with package managers outside the realm of Java -- I need to learn more about Maven (and perhaps Ivy) before I can make any particularly intelligent statements about what is really possible in this context. :-) - Demian
Installing Solr as a dependency
Hello, I develop an open source project (https://github.com/vufind-org/vufind) that depends on Solr, and I'm trying to figure out if there is a better way to manage the Solr dependency. Presently, I simply bundle Solr with my software by committing the latest distribution to my Git repo. Over time, having all of these large binaries is causing repository bloat and slow Git performance. I'm beginning to wonder whether there's a better way. With the rise in the popularity of dependency managers like NPM and Composer, it seems like it might be nice to somehow be able to declare Solr as a dependency and have it installed automatically on the client side rather than bundling the whole gigantic application by hand... however, as far as I can tell, there's no way to do this presently (at least, not unless you count specialized niche projects like https://github.com/projecthydra/hydra-jetty, which are not exactly what I'm looking for). Just curious if others are dealing with this problem in other ways, or if there are any tool-based approaches that I haven't discovered on my own. thanks, Demian
qf boosts with MoreLikeThis query parser
Hello, I am currently using field-specific boosts in the qf setting of the MoreLikeThis request handler: https://github.com/vufind-org/vufind/blob/master/solr/vufind/biblio/conf/solrconfig.xml#L410 I would like to accomplish the same effect using the MoreLikeThis query parser, so that I can take advantage of such benefits as sharding support. I am currently using Solr 5.5.0, and in spite of trying many syntactical variations, I can't seem to get it to work. Some discussion on this JIRA ticket seems to suggest there may have been some problems caused by parsing limitations: https://issues.apache.org/jira/browse/SOLR-7143 However, I think my work on this ticket should have eliminated those limitations: https://issues.apache.org/jira/browse/SOLR-2798 Anyway, this brings up a few questions: 1.)Is field-specific boosting in qf supported by the MLT query parser, and if so, what syntax should I use? 2.)If this functionality is supported, but not in v5.5.0, approximately when was it fixed? 3.)If the functionality is still not working, would it be worth my time to try to fix it, or is it being excluded for a specific reason? Any and all insight is appreciated. Apologies if the answers are already out there somewhere, but I wasn't able to find them! thanks, Demian
Pull request protocol question
Hello, A few weeks ago, I submitted a pull request to Solr in association with a JIRA ticket, and it was eventually merged. More recently, I had an almost-trivial change I wanted to share, but on GitHub, my Solr fork appeared to have changed upstreams. Was the whole Solr repo moved and regenerated or something? In any case, I ended up submitting my proposal using a new fork of apache/lucene-solr. It's visible here: https://github.com/apache/lucene-solr/pull/13 However, due to the weirdness of the switching upstreams, I thought I'd better check in here and make sure I put this in the right place! thanks, Demian
SOLR-2798 (local params parsing issue) -- how can I help?
Hello, I'd really love to see a resolution to SOLR-2798, since my application has a bug that cannot be addressed until this issue is fixed. It occurred to me that there's a good chance that the code involved in this issue is relatively isolated and testable, so I might be able to help with a solution even though I have no prior experience with the Solr code base. I'm just wondering if anyone can confirm this and, if so, point me in the right general direction so that I can make an attempt at a patch. I asked about this a while ago in a comment on the JIRA ticket, but I have a feeling that nobody actually saw that - so I'm trying again here on the mailing list. Any and all help greatly appreciated - and hopefully if you help me a little, I can contribute a useful fix back to the project in return. thanks, Demian
Costs/benefits of DocValues
Hello, I have a legacy Solr schema that I would like to update to take advantage of DocValues. I understand that by adding "docValues=true" to some of my fields, I can improve sorting/faceting performance. However, I have a couple of questions: 1.)Will Solr always take proper advantage of docValues when it is turned on, or will I gain greater performance by turning of stored/indexed in situations where only docValues are necessary (e.g. a sort-only field)? 2.)Will adding docValues to a field introduce significant performance penalties for non-docValues uses of that field, beyond the obvious fact that the additional data will consume more disk and memory? I'm asking this question because the existing schema has some multi-purpose fields, and I'm trying to determine whether I should just add "docValues=true" wherever it might help, or if I need to take a more thoughtful approach and potentially split some fields with copyFields, etc. This is particularly significant because my schema makes use of some dynamic field suffixes, and I'm not sure if I need to add new suffixes to differentiate docValues/non-docValues fields, or if it's okay to turn on docValues across the board "just in case." Apologies if these questions have already been answered - I couldn't find a totally clear answer in the places I searched. Thanks! - Demian
ExternalFileField documentation problems?
I've just been doing some experimentation with the ExternalFileField. I ran into obstacles due to some apparently incorrect documentation in the wiki: https://cwiki.apache.org/confluence/display/solr/Working+with+External+Files+and+Processes It seems that for some reason the fieldType and field definitions are mashed together there. It felt wrong, but I tried it since it was in the official docs... and, of course, it didn't work. Fortunately, this blog post helped me out, and I was able to get everything working: http://1opensourcelover.wordpress.com/2013/07/02/solr-external-file-fields/ Anyway, I'm not writing this to complain - I'd just like to help fix the wiki. However, since I'm no expert on this functionality, and I have no Confluence experience, so I thought I'd post here before taking any action. 1.)Am I able to edit the wiki? I signed up, but I don't see any edit options - just leave a comment. I assume this means I have no rights, but it might just mean I'm looking in the wrong places. 2.)Is there anyone more intimately familiar with ExternalFileField who would be willing to give the wiki page a quick review and correct factual errors? The extent of my edit (if I could make it) would simply be to fix the broken schema.xml example, but it's possible other details also need adjustments. 3.)Is there a policy on external links in the wiki? Adding a comment with a link to the above-mentioned blog post might be helpful to others, but if it's going to get me flagged as a potential spammer, I'll refrain from doing it. Thanks for your input! I'll go ahead and leave a comment if I don't hear anything in a few days, but it seemed worth asking for best practices first. - Demian
Preserving punctuation tokens with ICUTokenizerFactory
It has been brought to my attention that ICUTokenizerFactory drops tokens like the ++ in The C++ Programming Language. Is there any way to persuade it to preserve these types of tokens? thanks, Demian
RE: sun-java6 alternatives for Solr 3.5
For what it's worth, I run Solr 3.5 on Ubuntu using the OpenJDK packages and I haven't run into any problems. I do realize that sometimes the Sun JDK has features that are missing from other Java implementations, but so far it hasn't affected my use of Solr. - Demian -Original Message- From: ku3ia [mailto:dem...@gmail.com] Sent: Monday, February 27, 2012 2:25 PM To: solr-user@lucene.apache.org Subject: sun-java6 alternatives for Solr 3.5 Hi all! I had installed an Ubuntu 10.04 LTS. I had added a 'partner' repository to my sources list and updated it, but I can't find a package sun-java6-*: root@ubuntu:~# apt-cache search java6 default-jdk - Standard Java or Java compatible Development Kit default-jre - Standard Java or Java compatible Runtime default-jre-headless - Standard Java or Java compatible Runtime (headless) openjdk-6-jdk - OpenJDK Development Kit (JDK) openjdk-6-jre - OpenJDK Java runtime, using Hotspot JIT openjdk-6-jre-headless - OpenJDK Java runtime, using Hotspot JIT (headless) Than I had goggled and found an article: https://lists.ubuntu.com/archives/ubuntu-security-announce/2011- December/001528.html I'm using Solr 3.5 and Apache Tomcat 6.0.32. Please advice me what I must do in this situation, because I always used sun-java6-* packages for Tomcat and Solr and it worked fine Thanks! -- View this message in context: http://lucene.472066.n3.nabble.com/sun-java6- alternatives-for-Solr-3-5-tp3781792p3781792.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: SOLR - Just for search or whole site DB?
I would strongly recommend using Solr just for search. Solr is designed for doing fast search lookups. It is really not designed for performing all the functions of a relational database system. You certainly COULD use Solr for everything, and the software is constantly being enhanced to make it more flexible, but you'll still probably find it awkward and inconvenient for certain tasks that are simple with MySQL. It's also useful to be able to throw away and rebuild your Solr index at will, so you can upgrade to a new version or tweak your indexing rules. If you store mission-critical data in Solr itself, this becomes more difficult. The way I like to look at it is, as the name says, as an index. You use one system for actually managing your data, and then you use Solr to create an index of that data for fast look-up. - Demian -Original Message- From: Spadez [mailto:james_will...@hotmail.com] Sent: Tuesday, February 21, 2012 7:45 AM To: solr-user@lucene.apache.org Subject: SOLR - Just for search or whole site DB? I am new to this but I wanted to pitch a setup to you. I have a website being coded at the moment, in the very early stages, but is effectively a full text scrapper and search engine. We have decided on SOLR for the search system. We basically have two sets of data: One is the content for the search engine, which is around 100K records at any one time. The entire system is built on PHP and currently put into a MSQL database. We want very quick relevant searches, this is critical. Our plan is to import our records into SOLR each night from the MYSQL database. The second set of data is other parts of the site, such as our ticket system, stats about the number of clicks etc. The performance on this is not performance critical at all. So, I have two questions: Firstly, should everything be run through the SOLR search system, including tickets and site stats? Alterntively, is it better to keep only the main full text searches on SOLR and do the ticketing etc through normal MYSQL queries? Secondly, which is probably dependant on the first question. If everything should go through SOLR, should we even use a MYSQL database at all? If not, what is the alternative? We use an XML file as a .SQL replacement for content including tickets, stats, users, passwords etc. Sorry if these questions are basic, but I’m out of my depth here (but learning!) James -- View this message in context: http://lucene.472066.n3.nabble.com/SOLR-Just- for-search-or-whole-site-DB-tp3763439p3763439.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: social/collaboration features on top of solr
VuFind (http://vufind.org) uses Solr for library catalog (or similar) applications and features a MySQL database which it uses for storing user tags and comments outside of Solr itself. If there were a mechanism more closely tied to Solr for achieving this sort of effect, that would allow VuFind to do things with considerably more elegance! - Demian -Original Message- From: Robert Stewart [mailto:bstewart...@gmail.com] Sent: Tuesday, December 13, 2011 10:28 AM To: solr-user@lucene.apache.org Subject: social/collaboration features on top of solr Has anyone implemented some social/collaboration features on top of SOLR? What I am thinking is ability to add ratings and comments to documents in SOLR and then be able to fetch comments and ratings for each document in results (and have as part of response from SOLR), similar in fashion to MLT results. I think a separate index or separate core to store collaboration info would be needed, as well as a search component for fetching collaboration info for results. I would think this would be a great feature and wondering if anyone has done something similar. Bob
Re: LocalParams, bq, and highlighting
This is definitely an interesting case that i don't think anyone ever really considered before. It seems like a strong argument in favor of adding an hl.q param that the HighlightingComponent would use as an override for whatever the QueryComponent thinks the highlighting query should be, that way people expressing complex queries like you you describe could do something like... qq=solr q=inStock:true AND+_query_:{!dismax v=$qq} hl.q={!v=$qq} hl=true fl=name hl.fl=name bq=server ...what do you think? wanna file a Jira requesting this as a feature? Pretty sure the change would only require a few lines of code (but of course we'd also need JUnit tests which would probably be several dozen lines of code) First of all, thanks for answering both of my LocalParams-related queries back in September. I somehow failed to notice your responses until today - it's alarmingly easy to lose things in the flood of solr-user mail - but I greatly appreciate your input on both issues! It looks like there's already a JIRA ticket (more than a year old) for the hl.q param: https://issues.apache.org/jira/browse/SOLR-1926 This definitely sounds like it would solve my problem, so I've put in my vote! - Demian
RE: DisMax and WordDelimiterFilterFactory (limitations of MultiPhraseQuery)
If we change the query chain to not split on case change, then we lose half the benefit of that feature -- if a user types WiFi and the source record contains wi fi, we fail to get a hit. As you say, that may be worth considering if it comes down to picking the lesser evil, but I still think there should be a complete solution to my problem -- I'm not trying to compensate for every fat-fingered user behavior... just one specific one! Ultimately, I think my problem relates to this note from the documentation about using phrases in the SynonymFilterFactory: Phrase searching (ie: sea biscit) will cause the QueryParser to pass the entire string to the analyzer, but if the SynonymFilter is configured to expand the synonyms, then when the QueryParser gets the resulting list of tokens back from the Analyzer, it will construct a MultiPhraseQuery that will not have the desired effect. This is because of the limited mechanism available for the Analyzer to indicate that two terms occupy the same position: there is no way to indicate that a phrase occupies the same position as a term. For our example the resulting MultiPhraseQuery would be (sea | sea | seabiscuit) (biscuit | biscit) which would not match the simple case of seabiscuit occuring in a document. So I suppose I'm just running up against a fundamental limitation of Solr... but this seems like a fundamental limitation that might be worth overcoming -- I'm sure my use case is not the only one where this could matter. Has anyone given this any thought? - Demian -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Thursday, October 27, 2011 8:21 AM To: solr-user@lucene.apache.org Subject: Re: DisMax and WordDelimiterFilterFactory What happens if you change your WDDF definition in the query part of your analysis chain to NOT split on case change? Then your index should contain the right fragments (and combined words) and your queries would match. I admit I haven't thought this through entirely, but this would work for your example I think. Unfortunately I suspect it would break other cases I suspect you're in a lesser of two evils situation. But I can't imagine a 100% solution here. You're effectively asking to compensate for any fat-fingered thing a user does. Impossible I think... Best Erick On Tue, Oct 25, 2011 at 1:13 PM, Demian Katz demian.k...@villanova.edu wrote: I've seen a couple of threads related to this subject (for example, http://www.mail-archive.com/solr-user@lucene.apache.org/msg33400.html), but I haven't found an answer that addresses the aspect of the problem that concerns me... I have a field type set up like this: fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.ICUTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true/ filter class=solr.ICUFoldingFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.SnowballPorterFilterFactory language=English/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.ICUTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1 / filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true/ filter class=solr.ICUFoldingFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.SnowballPorterFilterFactory language=English/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType The important feature here is the use of WordDelimiterFilterFactory, which allows a search for WiFi to match an indexed term of wi fi (for example). The problem, of course, is that if a user accidentally introduces a case change in their query, the query analyzer chain breaks it into multiple words and no hits are found... so a search for exaMple will look for exa mple and fail. I've found two solutions that resolve this problem in the admin panel field analysis tool: 1.) Turn on catenateWords and catenateNumbers in the query analyzer - this reassembles the user's broken word and allows a match. 2.) Turn on preserveOriginal in the query analyzer - this passes through the user's original query, which then gets cleaned up bythe
DisMax and WordDelimiterFilterFactory
I've seen a couple of threads related to this subject (for example, http://www.mail-archive.com/solr-user@lucene.apache.org/msg33400.html), but I haven't found an answer that addresses the aspect of the problem that concerns me... I have a field type set up like this: fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.ICUTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true/ filter class=solr.ICUFoldingFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.SnowballPorterFilterFactory language=English/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.ICUTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1 / filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true/ filter class=solr.ICUFoldingFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.SnowballPorterFilterFactory language=English/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType The important feature here is the use of WordDelimiterFilterFactory, which allows a search for WiFi to match an indexed term of wi fi (for example). The problem, of course, is that if a user accidentally introduces a case change in their query, the query analyzer chain breaks it into multiple words and no hits are found... so a search for exaMple will look for exa mple and fail. I've found two solutions that resolve this problem in the admin panel field analysis tool: 1.)Turn on catenateWords and catenateNumbers in the query analyzer - this reassembles the user's broken word and allows a match. 2.)Turn on preserveOriginal in the query analyzer - this passes through the user's original query, which then gets cleaned up bythe ICUFoldingFilterFactory and allows a match. The problem is that in my real-world application, which uses DisMax, neither of these solutions work. It appears that even though (if I understand correctly) the WordDelimiterFilterFactory is returning ALTERNATIVE tokens, the DisMax handler is combining them a way that requires all of them to match in an inappropriate way... for example, here's partial debugQuery output for the exaMple search using Dismax and solution #2 above: parsedquery:+DisjunctionMaxQuery((genre:\(exampl exa) mple\^300.0 | title_new:\(exampl exa) mple\^100.0 | topic:\(exampl exa) mple\^500.0 | series:\(exampl exa) mple\^50.0 | title_full_unstemmed:\(example exa) mple\^600.0 | geographic:\(exampl exa) mple\^300.0 | contents:\(exampl exa) mple\^10.0 | fulltext_unstemmed:\(example exa) mple\^10.0 | allfields_unstemmed:\(example exa) mple\^10.0 | title_alt:\(exampl exa) mple\^200.0 | series2:\(exampl exa) mple\^30.0 | title_short:\(exampl exa) mple\^750.0 | author:\(example exa) mple\^300.0 | title:\(exampl exa) mple\^500.0 | topic_unstemmed:\(example exa) mple\^550.0 | allfields:\(exampl exa) mple\ | author_fuller:\(example exa) mple\^150.0 | title_full:\(exampl exa) mple\^400.0 | fulltext:\(exampl exa) mple\)) (), Obviously, that is not what I want - ideally it would be something like 'exampl OR ex ample'. I also read about the autoGeneratePhraseQueries setting, but that seems to take things way too far in the opposite direction - if I set that to false, then I get matches for any individual token; i.e. example OR ex OR ample - not good at all! I have a sinking suspicion that there is not an easy solution to my problem, but this seems to be a fairly basic need; splitOnCaseChange is a useful feature to have, but it's more valuable if it serves as an ALTERNATIVE search rather than a necessary query munge. Any thoughts? thanks, Demian
RE: Dismax handler - whitespace and special character behaviour
I just sent an email to the list about DisMax interacting with WordDelimiterFilterFactory, and I think our problems are at least partially related -- I think the reason you are seeing an OR where you expect an AND is that you have autoGeneratePhraseQueries set to false, which changes the way DisMax handles the output of the WordDelimiterFilterFactory (among others). Unfortunately, I don't have a solution for you... but you might want to keep an eye on my thread in case replies there shed any additional light. - Demian -Original Message- From: Rohk [mailto:khor...@gmail.com] Sent: Tuesday, October 25, 2011 10:33 AM To: solr-user@lucene.apache.org Subject: Dismax handler - whitespace and special character behaviour Hello, I've got strange results when I have special characters in my query. Here is my request : q=histoire- francestart=0rows=10sort=score+descdefType=dismaxqf=any^1.0mm=100 % Parsed query : str name=parsedquery_toString+((any:histoir any:franc)) ()/str I've got 17000 results because Solr is doing an OR (should be AND). I have no problem when I'm using a whitespace instead of a special char : q=histoire francestart=0rows=10sort=score+descdefType=dismaxqf=any^1.0mm=100 % str name=parsedquery_toString+(((any:histoir) (any:franc))~2) ()/str 2000 results for this query. Here is my schema.xml (relevant parts) : fieldType name=text class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=false analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1 preserveOriginal=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.CommonGramsFilterFactory words=stopwords_french.txt ignoreCase=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords_french.txt enablePositionIncrements=true/ filter class=solr.SnowballPorterFilterFactory language=French protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ filter class=solr.ASCIIFoldingFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ !--filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/-- filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1 preserveOriginal=0/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.CommonGramsFilterFactory words=stopwords_french.txt ignoreCase=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords_french.txt enablePositionIncrements=true/ filter class=solr.SnowballPorterFilterFactory language=French protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ filter class=solr.ASCIIFoldingFilterFactory/ /analyzer /fieldType I tried with a PatternTokenizerFactory to tokenize on whitespaces special chars but no change... Even with a charFilter (PatternReplaceCharFilterFactory) to replace special characters by whitespace, it doesn't work... First line of analysis via solr admin, with verbose output, for query = 'histoire-france' : org.apache.solr.analysis.PatternReplaceCharFilterFactory {replacement= , pattern=([,;./\\'-]), luceneMatchVersion=LUCENE_32} texthistoire france The '-' is replaced by ' ', then tokenized by WhitespaceTokenizerFactory. However I still have different number of results for 'histoire-france' and 'histoire france'. My current workaround is to replace all special chars by whitespaces before sending query to Solr, but it is not satisfying. Did i miss something ?
LocalParams, bq, and highlighting
I've run into another strange behavior related to LocalParams syntax in Solr 1.4.1. If I apply Dismax boosts using bq in LocalParams syntax, the contents of the boost queries get used by the highlighter. Obviously, when I use bq as a separate parameter, this is not an issue. To clarify, here are two searches that yield identical results but different highlighting behaviors: http://localhost:8080/solr/biblio/select/?q=johnrows=20start=0indent=yesqf=author^100qt=dismaxbq=author%3Asmith^1000fl=scorehl=truehl.fl=* http://localhost:8080/solr/biblio/select/?q=%28%28_query_%3A%22{!dismax+qf%3D\%22author^100\%22+bq%3D\%27author%3Asmith^1000\%27}john%22%29%29rows=20start=0indent=yesfl=scorehl=truehl.fl=* Query #1 highlights only john (the desired behavior), but query #2 highlights both john and smith. Is this a known limitation of the highlighter, or is it a bug? Is this issue resolved in newer versions of Solr? thanks, Demian
Questions about LocalParams syntax
I'm using the LocalParams syntax combined with the _query_ pseudo-field to build an advanced search screen (built on Solr 1.4.1's Dismax handler), but I'm running into some syntax questions that don't seem to be addressed by the wiki page here: http://wiki.apache.org/solr/LocalParams 1.)How should I deal with repeating parameters? If I use multiple boost queries, it seems that only the last one listed is used... for example: ((_query_:{!dismax qf=\title^500 author^300 allfields\ bq=\format:Book^50\ bq=\format:Journal^150\}test)) boosts Journals, but not Books. If I reverse the order of the two bq parameters, then Books get boosted instead of Journals. I can work around this by creating one bq with the clauses OR'ed together, but I would rather be able to apply multiple bq's like I can elsewhere. 2.)What is the proper way to escape quotes? Since there are multiple nested layers of double quotes, things get ugly and it's easy to end up with syntax errors. I found that this syntax doesn't cause an error: ((_query_:{!dismax qf=\title^500 author^300 allfields\ bq=\format:\\\Book\\\^50\ bq=\format:\\\Journal\\\^150\}test)) ...but it also doesn't work correctly - the boost queries are completely ignored in this example. Perhaps this is more a problem related to _query_ than to LocalParams syntax... but either way, a solution would be great! thanks, Demian
RE: Questions about LocalParams syntax
Space-separation works for the qf field, but not for bq. If I try a bq of format:Book^50 format:Journal^150, I get a strange result -- I would expect in the case of a failed bq that either a) I would get a syntax error of some sort or b) I would get normal search results with no boosting applied. Instead, I get a successful search result containing 0 entries. Very odd! Anyway, the solution that definitely works is joining the clauses with OR... but I'd still love to be able to specify multiple bq's separately if there's any way it can be done. As for the quote issue, the problem I'm trying to solve is that my code is driven by configuration files, and users may specify any legal Solr bq values that they choose. You're right that in some cases, I can simplify the situation by alternating quotes or changing the syntax... but I don't want to force users into using a subset of legal Solr syntax; it would be much better to able to handle all legal cases in a straightforward fashion. Admittedly, my example is artificial -- format:Book^50 works just as well as format:Book^50... but suppose they wanted to boost a phrase like format:Conference Proceeding^25 -- this is a common case. It seems like there should be some syntax that allows this to work in the context I am using it. If not, perhaps we need to file a bug report. In any case, thanks for taking the time to make some suggestions! It surprises me that this very powerful feature of Solr is so little-documented. - Demian -Original Message- From: Jonathan Rochkind [mailto:rochk...@jhu.edu] Sent: Tuesday, September 20, 2011 10:32 AM To: solr-user@lucene.apache.org Cc: Demian Katz Subject: Re: Questions about LocalParams syntax I don't have the complete answer. But I _think_ if you do one 'bq' param with multiple space-seperated directives, it will work. And escaping is a pain. But can be made somewhat less of a pain if you realize that single quotes can sometimes be used instead of double-quotes. What I do: _query_:{!dismax qf='title something else'} So by switching between single and double quotes, you can avoid need to escape. Sometimes you still do need to escape when a single or double quote is actually in a value (say in a 'q'), and I do use backslash there. If you had more levels of nesting though... I have no idea what you'd do. I'm not even sure why you have the internal quotes here: bq=\format:\\\Book\\\^50\ Shouldn't that just be bq='format:Book^50', what's the extra double quotes around Book? If you don't need them, then with switching between single and double, this can become somewhat less crazy and error prone: _query_:{!dismax bq='format:Book^50'} I think. Maybe. If you really do need the double quotes in there, then I think switching between single and double you can use a single backslash there. On 9/20/2011 9:39 AM, Demian Katz wrote: I'm using the LocalParams syntax combined with the _query_ pseudo- field to build an advanced search screen (built on Solr 1.4.1's Dismax handler), but I'm running into some syntax questions that don't seem to be addressed by the wiki page here: http://wiki.apache.org/solr/LocalParams 1.)How should I deal with repeating parameters? If I use multiple boost queries, it seems that only the last one listed is used... for example: ((_query_:{!dismax qf=\title^500 author^300 allfields\ bq=\format:Book^50\ bq=\format:Journal^150\}test)) boosts Journals, but not Books. If I reverse the order of the two bq parameters, then Books get boosted instead of Journals. I can work around this by creating one bq with the clauses OR'ed together, but I would rather be able to apply multiple bq's like I can elsewhere. 2.)What is the proper way to escape quotes? Since there are multiple nested layers of double quotes, things get ugly and it's easy to end up with syntax errors. I found that this syntax doesn't cause an error: ((_query_:{!dismax qf=\title^500 author^300 allfields\ bq=\format:\\\Book\\\^50\ bq=\format:\\\Journal\\\^150\}test)) ...but it also doesn't work correctly - the boost queries are completely ignored in this example. Perhaps this is more a problem related to _query_ than to LocalParams syntax... but either way, a solution would be great! thanks, Demian
String index out of range: -1 for hl.fl=* in Solr 1.4.1?
I'm running into a strange problem with Solr 1.4.1 - this request: http://localhost:8080/solr/website/select/?q=*%3A*rows=20start=0indent=yesfl=scorefacet=truefacet.mincount=1facet.limit=30facet.field=categoryfacet.field=linktypefacet.field=subjectfacet.prefix=facet.sort=fq=category%3A%22Exhibits%22spellcheck=truespellcheck.q=*%3A*spellcheck.dictionary=defaulthl=truehl.fl=*hl.simple.pre=START_HILITEhl.simple.post=END_HILITEwt=jsonjson.nl=arrarrhttp://localhost:8080/solr/website/select/?q=*%3A*rows=20start=0indent=yesfl=scorefacet=truefacet.mincount=1facet.limit=30facet.field=categoryfacet.field=linktypefacet.field=subjectfacet.prefix=facet.sort=fq=category%3A%22Exhibits%22spellcheck=truespellcheck.q=*%3A*spellcheck.dictionary=defaulthl=truehl.fl=*hl.simple.pre=%7b%7b%7b%7bSTART_HILITE%7d%7d%7d%7dhl.simple.post=%7b%7b%7b%7bEND_HILITE%7d%7d%7d%7dwt=jsonjson.nl=arrarr leads to this error dump: String index out of range: -1 java.lang.StringIndexOutOfBoundsException: String index out of range: -1 at java.lang.String.substring(String.java:1949) at org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:263) at org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting(DefaultSolrHighlighter.java:335) at org.apache.solr.handler.component.HighlightComponent.process(HighlightComponent.java:89) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1088) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:360) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:729) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:206) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:324) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:505) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:829) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:513) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:211) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:380) at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:395) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:488) I've managed to work around the problem by replacing the hl.fl=* parameter with a comma-delimited list of the fields I actually need highlighted... but I don't understand why I'm encountering this error, and for peace of mind I would like to understand the problem in case there's a deeper problem at work here. I'll be happy to share schema or other details if they would help narrow down a potential cause! thanks, Demian
RE: SpellCheckComponent performance
As I may have mentioned before, VuFind is actually doing two Solr queries for every search -- a base query that gets basic spelling suggestions, and a supplemental spelling-only query that gets shingled spelling suggestions. If there's a way to get two different spelling responses in a single query, I'd love to hear about it... but the double-querying doesn't seem to be a huge problem -- the delays I'm talking about are in the spelling portion of the initial query. Just for the sake of completeness, here are both of my spelling field types: !-- Basic Text Field for use with Spell Correction -- fieldType name=textSpell class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=schema.UnicodeNormalizationFilterFactory version=icu4j composed=false remove_diacritics=true remove_modifiers=true fold=true/ filter class=solr.ISOLatin1AccentFilterFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType !-- More advanced spell checking field. -- fieldType name=textSpellShingle class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.ShingleFilterFactory maxShingleSize=2 outputUnigrams=false/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.ShingleFilterFactory maxShingleSize=2 outputUnigrams=false/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType ...and here are the fields: field name=spelling type=textSpell indexed=true stored=true/ field name=spellingShingle type=textSpellShingle indexed=true stored=true multiValued=true/ As you can probably guess, I'm using spelling in my main query and spellingShingle in my supplemental query. Here are stats on the spelling field: {field=spelling,memSize=107830314,tindexSize=249184,time=25747,phase1=25150,nTerms=1343061,bigTerms=231,termInstances=40960454,uses=1} (I obtained these numbers by temporarily adding the spelling field as a facet to my warming query -- probably not a very smart way to do it, but it was the only way I could figure out! If there's a more elegant and accurate approach, I'd be interested to know what it is.) I should also note that my basic spelling index is 114MB and my shingled spelling index is 931MB -- not outrageously large. Is there a way to persuade Solr to load these into memory for faster performance? thanks, Demian -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Monday, June 06, 2011 6:23 PM To: solr-user@lucene.apache.org Subject: Re: SpellCheckComponent performance Hmmm, how are you configuring you spell checker? The first-time slowdown is probably due to cache warming, but subsequent 500 ms slowdowns seem odd. How many unique terms are there in your spellecheck index? It'd probably be best if you showed us your fieldtype and field definition... Best Erick On Mon, Jun 6, 2011 at 4:04 PM, Demian Katz demian.k...@villanova.edu wrote: I'm continuing to work on tuning my Solr server, and now I'm noticing that my biggest bottleneck is the SpellCheckComponent. This is eating multiple seconds on most first-time searches, and still taking around 500ms even on cached searches. Here is my configuration: searchComponent name=spellcheck class=org.apache.solr.handler.component.SpellCheckComponent lst name=spellchecker str name=namebasicSpell/str str name=fieldspelling/str str name=accuracy0.75/str str name=spellcheckIndexDir./spellchecker/str str name=queryAnalyzerFieldTypetextSpell/str str name=buildOnOptimizetrue/str /lst /searchComponent I've done a bit of searching, but the best advice I could find for making the search component go faster involved reducing spellcheck.maxCollationTries, which doesn't even seem to apply to my settings. Does anyone have any advice on tuning this aspect of my configuration? Are there any extra debug settings that might give deeper insight into how the component is spending its time? thanks, Demian
RE: Solr performance tuning - disk i/o?
Thanks once again for the helpful suggestions! Regarding the selection of facet fields, I think publishDate (which is actually just a year) and callnumber-first (which is actually a very broad, high-level category) are okay. authorStr is an interesting problem: it's definitely a useful facet (when a user searches for an author, odds are good that they want the one who published the most books... i.e. a search for dickens will probably show Charles Dickens at the top of the facet list), but it has a long tail since there are many minor authors who have only published one or two books... Is there a possibility that the facet.mincount parameter could be helpful here, or does that have no impact on performance/memory footprint? Regarding polling interval for slaves, are you referring to a distributed Solr environment, or is this something to do with Solr's internals? We're currently a single-server environment, so I don't think I have to worry if it's related to a multi-server setup... but if it's something internal, could you point me to the right area of the admin panel to check my stats? I'm not seeing anything about polling on the statistics page. It's also a little strange that all of my warmupTime stats on searchers and caches are showing as 0 -- is that normal? thanks, Demian -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Friday, June 03, 2011 4:45 PM To: solr-user@lucene.apache.org Subject: Re: Solr performance tuning - disk i/o? Quick impressions: The faceting is usually best done on fields that don't have lots of unique values for three reasons: 1 It's questionable how much use to the user to have a gazillion facets. In the case of a unique field per document, in fact, it's useless. 2 resource requirements go up as a function of the number of unique terms. This is true for faceting and sorting. 3 warmup times grow the more terms have to be read into memory. Glancing at your warmup stuff, things like publishDate, authorStr and maybe callnumber-first are questionable. publishDate depends on how coarse the resolution is. If it's by day, that's not really much use. authorStr.. How many authors have more than one publication? Would this be better served by some kind of autosuggest rather than facets? callnumber-first... I don't really know, but if it's unique per document it's probably not something the user would find useful as a facet. The admin page will help you determine the number of unique terms per field, which may guide you whether or not to continue to facet on these fields. As Otis said, doing a sort on the fields during warmup will also help. Watch your polling interval for any slaves in relation to the warmup times. If your polling interval is shorter than the warmup times, you run a risk of runaway warmups. As you've figured out, measuring responses to the first few queries doesn't always measure what you really need G.. I don't have the pages handy, but autowarming is a good topic to understand, so you might spend some time tracking it down. Best Erick On Fri, Jun 3, 2011 at 11:21 AM, Demian Katz demian.k...@villanova.edu wrote: Thanks to you and Otis for the suggestions! Some more information: - Based on the Solr stats page, my caches seem to be working pretty well (few or no evictions, hit rates in the 75-80% range). - VuFind is actually doing two Solr queries per search (one initial search followed by a supplemental spell check search -- I believe this is necessary because VuFind has two separate spelling indexes, one for shingled terms and one for single words). That is probably exaggerating the problem, though based on searches with debugQuery on, it looks like it's always the initial search (rather than the supplemental spelling search) that's consuming the bulk of the time. - enableLazyFieldLoading is set to true. - I'm retrieving 20 documents per page. - My JVM settings: -server - Xloggc:/usr/local/vufind/solr/jetty/logs/gc.log -Xms4096m -Xmx4096m - XX:+UseParallelGC -XX:+UseParallelOldGC -XX:NewRatio=5 It appears that a large portion of my problem had to do with autowarming, a topic that I've never had a strong grasp on, though perhaps I'm finally learning (any recommended primer links would be welcome!). I did have some autowarming settings in solrconfig.xml (an arbitrary search for a bunch of random keywords in the newSearcher and firstSearcher events, plus autowarmCount settings on all of my caches). However, when I looked at the debugQuery output, I noticed that a huge amount of time was being wasted loading facets on the first search after restarting Solr, so I changed my newSearcher and firstSearcher events to this: arr name=queries lst str name=q*:*/str str name=start0/str str name=rows10/str str name=facettrue/str str
RE: Solr performance tuning - disk i/o?
All of my cache autowarmCount settings are either 1 or 5. maxWarmingSearchers is set to 2. I previously shared the contents of my firstSearcher and newSearcher events -- just a queries array surrounded by a standard-looking listener tag. The events are definitely firing -- in addition to the measurable performance improvement they give me, I can actually see them happening in the console output during startup. That seems to cover every configuration option in my file that references warming in any way, and it all looks reasonable to me. warmupTime remains consistently 0 in the statistics display. Is there anything else I should be looking at? In any case, I'm not too alarmed by this... it just seems a little strange. thanks, Demian -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Monday, June 06, 2011 11:59 AM To: solr-user@lucene.apache.org Subject: Re: Solr performance tuning - disk i/o? Polling interval was in reference to slaves in a multi-machine master/slave setup. so probably not a concern just at present. Warmup time of 0 is not particularly normal, I'm not quite sure what's going on there but you may want to look at firstsearcher, newsearcher and autowarm parameters in config.xml.. Best Erick On Mon, Jun 6, 2011 at 9:08 AM, Demian Katz demian.k...@villanova.edu wrote: Thanks once again for the helpful suggestions! Regarding the selection of facet fields, I think publishDate (which is actually just a year) and callnumber-first (which is actually a very broad, high-level category) are okay. authorStr is an interesting problem: it's definitely a useful facet (when a user searches for an author, odds are good that they want the one who published the most books... i.e. a search for dickens will probably show Charles Dickens at the top of the facet list), but it has a long tail since there are many minor authors who have only published one or two books... Is there a possibility that the facet.mincount parameter could be helpful here, or does that have no impact on performance/memory footprint? Regarding polling interval for slaves, are you referring to a distributed Solr environment, or is this something to do with Solr's internals? We're currently a single-server environment, so I don't think I have to worry if it's related to a multi-server setup... but if it's something internal, could you point me to the right area of the admin panel to check my stats? I'm not seeing anything about polling on the statistics page. It's also a little strange that all of my warmupTime stats on searchers and caches are showing as 0 -- is that normal? thanks, Demian -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Friday, June 03, 2011 4:45 PM To: solr-user@lucene.apache.org Subject: Re: Solr performance tuning - disk i/o? Quick impressions: The faceting is usually best done on fields that don't have lots of unique values for three reasons: 1 It's questionable how much use to the user to have a gazillion facets. In the case of a unique field per document, in fact, it's useless. 2 resource requirements go up as a function of the number of unique terms. This is true for faceting and sorting. 3 warmup times grow the more terms have to be read into memory. Glancing at your warmup stuff, things like publishDate, authorStr and maybe callnumber-first are questionable. publishDate depends on how coarse the resolution is. If it's by day, that's not really much use. authorStr.. How many authors have more than one publication? Would this be better served by some kind of autosuggest rather than facets? callnumber-first... I don't really know, but if it's unique per document it's probably not something the user would find useful as a facet. The admin page will help you determine the number of unique terms per field, which may guide you whether or not to continue to facet on these fields. As Otis said, doing a sort on the fields during warmup will also help. Watch your polling interval for any slaves in relation to the warmup times. If your polling interval is shorter than the warmup times, you run a risk of runaway warmups. As you've figured out, measuring responses to the first few queries doesn't always measure what you really need G.. I don't have the pages handy, but autowarming is a good topic to understand, so you might spend some time tracking it down. Best Erick On Fri, Jun 3, 2011 at 11:21 AM, Demian Katz demian.k...@villanova.edu wrote: Thanks to you and Otis for the suggestions! Some more information: - Based on the Solr stats page, my caches seem to be working pretty well (few or no evictions, hit rates in the 75-80% range). - VuFind is actually doing two Solr queries per search (one initial search
SpellCheckComponent performance
I'm continuing to work on tuning my Solr server, and now I'm noticing that my biggest bottleneck is the SpellCheckComponent. This is eating multiple seconds on most first-time searches, and still taking around 500ms even on cached searches. Here is my configuration: searchComponent name=spellcheck class=org.apache.solr.handler.component.SpellCheckComponent lst name=spellchecker str name=namebasicSpell/str str name=fieldspelling/str str name=accuracy0.75/str str name=spellcheckIndexDir./spellchecker/str str name=queryAnalyzerFieldTypetextSpell/str str name=buildOnOptimizetrue/str /lst /searchComponent I've done a bit of searching, but the best advice I could find for making the search component go faster involved reducing spellcheck.maxCollationTries, which doesn't even seem to apply to my settings. Does anyone have any advice on tuning this aspect of my configuration? Are there any extra debug settings that might give deeper insight into how the component is spending its time? thanks, Demian
Solr performance tuning - disk i/o?
Hello, I'm trying to move a VuFind installation from an ailing physical server into a virtualized environment, and I'm running into performance problems. VuFind is a Solr 1.4.1-based application with fairly large and complex records (many stored fields, many words per record). My particular installation contains about a million records in the index, with a total index size around 6GB. The virtual environment has more RAM and better CPUs than the old physical box, and I am satisfied that my Java environment is well-tuned. My index is optimized. Searches that hit the cache respond very well. The problem is that non-cached searches are very slow - the more keywords I add, the slower they get, to the point of taking 6-12 seconds to come back with results on a quiet box and well over a minute under stress testing. (The old box still took a while for equivalent searches, but it was about twice as fast as the new one). My gut feeling is that disk access reading the index is the bottleneck here, but I know little about the specifics of Solr's internals, so it's entirely possible that my gut is wrong. Outside testing does show that the the virtual environment's disk performance is not as good as the old physical server, especially when multiple processes are trying to access the same file simultaneously. So, two basic questions: 1.)Would you agree that I'm dealing with a disk bottleneck, or are there some other factors I should be considering? Any good diagnostics I should be looking at? 2.)If the problem is disk access, is there anything I can tune on the Solr side to alleviate the problems? Thanks, Demian
RE: Solr performance tuning - disk i/o?
Thanks to you and Otis for the suggestions! Some more information: - Based on the Solr stats page, my caches seem to be working pretty well (few or no evictions, hit rates in the 75-80% range). - VuFind is actually doing two Solr queries per search (one initial search followed by a supplemental spell check search -- I believe this is necessary because VuFind has two separate spelling indexes, one for shingled terms and one for single words). That is probably exaggerating the problem, though based on searches with debugQuery on, it looks like it's always the initial search (rather than the supplemental spelling search) that's consuming the bulk of the time. - enableLazyFieldLoading is set to true. - I'm retrieving 20 documents per page. - My JVM settings: -server -Xloggc:/usr/local/vufind/solr/jetty/logs/gc.log -Xms4096m -Xmx4096m -XX:+UseParallelGC -XX:+UseParallelOldGC -XX:NewRatio=5 It appears that a large portion of my problem had to do with autowarming, a topic that I've never had a strong grasp on, though perhaps I'm finally learning (any recommended primer links would be welcome!). I did have some autowarming settings in solrconfig.xml (an arbitrary search for a bunch of random keywords in the newSearcher and firstSearcher events, plus autowarmCount settings on all of my caches). However, when I looked at the debugQuery output, I noticed that a huge amount of time was being wasted loading facets on the first search after restarting Solr, so I changed my newSearcher and firstSearcher events to this: arr name=queries lst str name=q*:*/str str name=start0/str str name=rows10/str str name=facettrue/str str name=facet.mincount1/str str name=facet.fieldcollection/str str name=facet.fieldformat/str str name=facet.fieldpublishDate/str str name=facet.fieldcallnumber-first/str str name=facet.fieldtopic_facet/str str name=facet.fieldauthorStr/str str name=facet.fieldlanguage/str str name=facet.fieldgenre_facet/str str name=facet.fieldera_facet/str str name=facet.fieldgeographic_facet/str /lst /arr Overall performance has now increased dramatically, and now the biggest bottleneck in the debug output seems to be the shingle spell checking! Any other suggestions are welcome, since I suspect there's still room to squeeze more performance out of the system, and I'm still not sure I'm making the most of autowarming... but this seems like a big step in the right direction. Thanks again for the help! - Demian -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Friday, June 03, 2011 9:41 AM To: solr-user@lucene.apache.org Subject: Re: Solr performance tuning - disk i/o? This doesn't seem right. Here's a couple of things to try: 1 attach debugQuery=on to your long-running queries. The QTime returned is the time taken to search, NOT including the time to load the docs. That'll help pinpoint whether the problem is the search itself, or assembling the documents. 2 Are you autowarming? If so, be sure it's actually done before querying. 3 Measure queries after the first few, particularly if you're sorting or faceting. 4 What are your JVM settings? How much memory do you have? 5 is enableLazyFieldLoading set to true in your solrconfig.xml? 6 How many docs are you returning? There's more, but that'll do for a start Let us know if you gather more data and it's still slow. Best Erick On Fri, Jun 3, 2011 at 8:44 AM, Demian Katz demian.k...@villanova.edu wrote: Hello, I'm trying to move a VuFind installation from an ailing physical server into a virtualized environment, and I'm running into performance problems. VuFind is a Solr 1.4.1-based application with fairly large and complex records (many stored fields, many words per record). My particular installation contains about a million records in the index, with a total index size around 6GB. The virtual environment has more RAM and better CPUs than the old physical box, and I am satisfied that my Java environment is well- tuned. My index is optimized. Searches that hit the cache respond very well. The problem is that non-cached searches are very slow - the more keywords I add, the slower they get, to the point of taking 6-12 seconds to come back with results on a quiet box and well over a minute under stress testing. (The old box still took a while for equivalent searches, but it was about twice as fast as the new one). My gut feeling is that disk access reading the index is the bottleneck here, but I know little about the specifics of Solr's internals, so it's entirely possible that my gut is wrong. Outside testing does show that the the virtual environment's disk performance is not as good as the old physical server, especially when
Bug in solr.KeywordMarkerFilterFactory?
I've just started experimenting with the solr.KeywordMarkerFilterFactory in Solr 3.1, and I'm seeing some strange behavior. It seems that every word subsequent to a protected word is also treated as being protected. For testing purposes, I have put the word spelling in my protwords.txt. If I do a test for spelling bees in the analyze tool, the stemmer produces spelling bees - nothing is stemmed. But if I do a test for bees spelling, I get bee spelling, the expected result with bees stemmed but spelling left unstemmed. I have tried extended examples - in every case I tried, all of the words prior to spelling get stemmed, but none of the words after spelling get stemmed. When turning on the verbose mode of the analyze tool, I can see that the settings of the keyword attribute introduced by solr.KeywordMarkerFilterFactory correspond with the the stemming behavior... so I think the solr.KeywordMarkerFilterFactory component is to blame, and not anything later in the analyze chain. Any idea what might be going wrong? Is this a known issue? Here is my field type definition, in case it makes a difference: fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.ICUTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true/ filter class=solr.ICUFoldingFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.SnowballPorterFilterFactory language=English/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.ICUTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true/ filter class=solr.ICUFoldingFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.SnowballPorterFilterFactory language=English/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType thanks, Demian
RE: Bug in solr.KeywordMarkerFilterFactory?
That's good news -- thanks for the help (not to mention the reassurance that Solr itself is actually working right)! Hopefully 3.1.1 won't be too far off, though; when the analysis tool lies, life can get very confusing! :-) - Demian -Original Message- From: Robert Muir [mailto:rcm...@gmail.com] Sent: Wednesday, April 20, 2011 2:54 PM To: solr-user@lucene.apache.org; yo...@lucidimagination.com Subject: Re: Bug in solr.KeywordMarkerFilterFactory? No, this is only a bug in analysis.jsp. you can see this by comparing analysis.jsp's dontstems bees to using the query debug interface: lst name=debug str name=rawquerystringdontstems bees/str str name=querystringdontstems bees/str str name=parsedqueryPhraseQuery(text:dontstems bee)/str str name=parsedquery_toStringtext:dontstems bee/str On Wed, Apr 20, 2011 at 2:43 PM, Yonik Seeley yo...@lucidimagination.com wrote: On Wed, Apr 20, 2011 at 2:01 PM, Demian Katz demian.k...@villanova.edu wrote: I've just started experimenting with the solr.KeywordMarkerFilterFactory in Solr 3.1, and I'm seeing some strange behavior. It seems that every word subsequent to a protected word is also treated as being protected. You're right! This was broken by LUCENE-2901 back in Jan. I've opened this issue: https://issues.apache.org/jira/browse/LUCENE-3039 The easiest short-term workaround for you would probably be to create a custom filter that looks like KeywordMarkerFilter before the LUCENE-2901 change. -Yonik http://www.lucenerevolution.org -- Lucene/Solr User Conference, May 25-26, San Francisco
Solr 3.1 ICU filters (error loading class)
Hello, I'm interested in trying out the new ICU features in Solr 3.1. However, when I attempt to set up a field type using solr.ICUTokenizerFactory and/or solr.ICUFoldingFilterFactory, Solr refuses to start up, issuing Error loading class exceptions. I did see the README.txt file that mentions lucene-libs/lucene-*.jar and lib/icu4j-*.jar. I tried putting all of these files under my Solr home directory, but it made no difference. Is there some other .jar that I need to add to my library folder? Am I doing something wrong with the known dependencies? (This is the first time I've seen a lucene-libs directory, so I wasn't sure if that required some special configuration). Any general troubleshooting advice for figuring out what is going wrong here? thanks, Demian
RE: Solr 3.1 ICU filters (error loading class)
Thanks! apache-solr-analysis-extras-3.1.jar was the missing piece that was causing all of my trouble; I didn't see any mention of it in the documentation -- might be worth adding! Thanks, Demian -Original Message- From: Robert Muir [mailto:rcm...@gmail.com] Sent: Monday, April 18, 2011 1:46 PM To: solr-user@lucene.apache.org Subject: Re: Solr 3.1 ICU filters (error loading class) On Mon, Apr 18, 2011 at 1:31 PM, Demian Katz demian.k...@villanova.edu wrote: Hello, I'm interested in trying out the new ICU features in Solr 3.1. However, when I attempt to set up a field type using solr.ICUTokenizerFactory and/or solr.ICUFoldingFilterFactory, Solr refuses to start up, issuing Error loading class exceptions. I did see the README.txt file that mentions lucene-libs/lucene-*.jar and lib/icu4j-*.jar. I tried putting all of these files under my Solr home directory, but it made no difference. Is there some other .jar that I need to add to my library folder? Am I doing something wrong with the known dependencies? (This is the first time I've seen a lucene-libs directory, so I wasn't sure if that required some special configuration). Any general troubleshooting advice for figuring out what is going wrong here? make a 'lib' directory under your solr home (e.g. example/solr/lib) : it should contain: * icu4j-4_6.jar * lucene-icu-3.1.jar * apache-solr-analysis-extras-3.1.jar
RE: Solr 3.1 ICU filters (error loading class)
Right, I placed my files relative to solr_home, not in it -- but obviously having a solr_home/lucene-libs directory didn't do me any good. :-) - Demian -Original Message- From: Jonathan Rochkind [mailto:rochk...@jhu.edu] Sent: Monday, April 18, 2011 1:46 PM To: solr-user@lucene.apache.org Cc: Demian Katz Subject: Re: Solr 3.1 ICU filters (error loading class) I don't think you want to put them in solr_home, I think you want to put them in solr_home/lib/. Or did you mean that's where you put them? On 4/18/2011 1:31 PM, Demian Katz wrote: Hello, I'm interested in trying out the new ICU features in Solr 3.1. However, when I attempt to set up a field type using solr.ICUTokenizerFactory and/or solr.ICUFoldingFilterFactory, Solr refuses to start up, issuing Error loading class exceptions. I did see the README.txt file that mentions lucene-libs/lucene-*.jar and lib/icu4j-*.jar. I tried putting all of these files under my Solr home directory, but it made no difference. Is there some other .jar that I need to add to my library folder? Am I doing something wrong with the known dependencies? (This is the first time I've seen a lucene-libs directory, so I wasn't sure if that required some special configuration). Any general troubleshooting advice for figuring out what is going wrong here? thanks, Demian
RE: OAI on SOLR already done?
I already replied to the original poster off-list, but it seems that it may be worth weighing in here as well... The next release of VuFind (http://vufind.org) is going to include OAI-PMH server support. As you say, there is really no way to plug OAI-PMH directly into Solr... but a tool like VuFind can provide a fairly generic, extensible, Solr-based platform for building an OAI-PMH server. Obviously this is helpful for some use cases and not others... but I'm happy to provide more information if anyone needs it. - Demian From: Jonathan Rochkind [rochk...@jhu.edu] Sent: Wednesday, February 02, 2011 3:38 PM To: solr-user@lucene.apache.org Cc: Paul Libbrecht Subject: Re: OAI on SOLR already done? The trick is that you can't just have a generic black box OAI-PMH provider on top of any Solr index. How would it know where to get the metadata elements it needs, such as title, or last-updated date, etc. Any given solr index might not even have this in stored fields -- and a given app might want to look them up from somewhere other than stored fields. If the Solr index does have them in stored fields, and you do want to get them from the stored fields, then it's, I think (famous last words) relatively straightforward code to write. A mapping from solr stored fields to metadata elements needed for OAI-PMH, and then simply outputting the XML template with those filled in. I am not aware of anyone that has done this in a re-useable/configurable-for-your-solr tool. You could possibly do it solely using the built-in Solr JSP/XSLT/other-templating-stuff-I-am-not-familiar-with stuff, rather than as an external Solr client app, or it could be an external Solr client app. This is actually a very similar problem to something someone else asked a few days ago Does anyone have an OpenSearch add-on for Solr? Very very similar problem, just with a different XML template for output (usually RSS or Atom) instead of OAI-PMH. On 2/2/2011 3:14 PM, Paul Libbrecht wrote: Peter, I'm afraid your service is harvesting and I am trying to look at a PMH provider service. Your project appeared early in the goolge matches. paul Le 2 févr. 2011 à 20:46, Péter Király a écrit : Hi, I don't know whether it fits to your need, but we are builing a tool based on Drupal (eXtensible Catalog Drupal Toolkit), which can harvest with OAI-PMH and index the harvested records into Solr. The records is harvested, processed, and stored into MySQL, then we index them into Solr. We created some ways to manipulate the original values before sending to Solr. We created it in a modular way, so you can change settings in an admin interface or write your own hooks (special Drupal functions), to taylor the application to your needs. We support only Dublin Core, and our own FRBR-like schema (called XC schema), but you can add more schemas. Since this forum is about Solr, and not applications using Solr, if you interested this tool, plase write me a private message, or visit http://eXtensibleCatalog.org, or the module's page at http://drupal.org/project/xc. Hope this helps, Péter eXtensible Catalog 2011/2/2 Paul Libbrechtp...@hoplahup.net: Hello list, I've met a few google matches that indicate that SOLR-based servers implement the Open Archive Initiative's Metadata Harvesting Protocol. Is there something made to be re-usable that would be an add-on to solr? thanks in advance paul
RE: filter query from external list of Solr unique IDs
The main problem I've encountered with the lots of OR clauses approach is that you eventually hit the limit on Boolean clauses and the whole query fails. You can keep raising the limit through the Solr configuration, but there's still a ceiling eventually. - Demian -Original Message- From: Jonathan Rochkind [mailto:rochk...@jhu.edu] Sent: Friday, October 15, 2010 1:07 PM To: solr-user@lucene.apache.org Subject: RE: filter query from external list of Solr unique IDs Definitely interested in this. The naive obvious approach would be just putting all the ID's in the query. Like fq=(id:1 OR id:2 OR). Or making it another clause in the 'q'. Can you outline what's wrong with this approach, to make it more clear what's needed in a solution? From: Burton-West, Tom [tburt...@umich.edu] Sent: Friday, October 15, 2010 11:49 AM To: solr-user@lucene.apache.org Subject: filter query from external list of Solr unique IDs At the Lucene Revolution conference I asked about efficiently building a filter query from an external list of Solr unique ids. Some use cases I can think of are: 1) personal sub-collections (in our case a user can create a small subset of our 6.5 million doc collection and then run filter queries against it) 2) tagging documents 3) access control lists 4) anything that needs complex relational joins 5) a sort of alternative to incremental field updating (i.e. update in an external database or kv store) 6) Grant's clustering cluster points and similar apps. Grant pointed to SOLR 1715, but when I looked on JIRA, there doesn't seem to be any work on it yet. Hoss mentioned a couple of ideas: 1) sub-classing query parser 2) Having the app query a database and somehow passing something to Solr or lucene for the filter query Can Hoss or someone else point me to more detailed information on what might be involved in the two ideas listed above? Is somehow keeping an up-to-date map of unique Solr ids to internal Lucene ids needed to implement this or is that a separate issue? Tom Burton-West http://www.hathitrust.org/blogs/large-scale-search
RE: solr.WordDelimiterFilterFactory problem with hyphenated terms?
I don't think the behavior is correct. The first example, with just one gap, does NOT match. The second example, with an extra second gap, DOES match. It seems that the term collapsing (eighteenth-century -- eighteenthcentury) somehow throws off the position sequence, forcing you to add an extra gap in order to get a match. It's good to know that slop is an option to work around this problem... but it still seems to me that something isn't working the way it is supposed to in this particular case. - Demian -Original Message- From: Robert Muir [mailto:rcm...@gmail.com] Sent: Friday, April 09, 2010 12:05 PM To: solr-user@lucene.apache.org Subject: Re: solr.WordDelimiterFilterFactory problem with hyphenated terms? but this behavior is correct, as you have position increments enabled. if you want the second query (which has 2 gaps) to match, you need to either use slop, or disable these increments alltogether. On Fri, Apr 9, 2010 at 11:44 AM, Demian Katz demian.k...@villanova.eduwrote: I've given it a try, and it definitely seems to have improved the situation. However, there is still one weird case that's clearly related to term positions. If I do this search, it fails: title:love customs in eighteenthcentury spain ...but if I do this search, it succeeds: title:love customs in in eighteenthcentury spain (note the duplicate in). - Demian -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Thursday, April 08, 2010 11:20 AM To: solr-user@lucene.apache.org Subject: Re: solr.WordDelimiterFilterFactory problem with hyphenated terms? I'm not all that familiar with the underlying issues, but of the two I'd pick moving the WordDelimiterFactory rather than setting increments = false. But that's at least partly a guess Best Erick On Thu, Apr 8, 2010 at 11:00 AM, Demian Katz demian.k...@villanova.eduwrote: Thanks for looking into this -- I appreciate the help (and feel a little better that there seems to be a bug at work here and not just my total incomprehension). Sorry for any confusion over the UnicodeNormalizationFactory -- that's actually a plug-in from the SolrMarc project ( http://code.google.com/p/solrmarc/) that slipped into my example. Also, as you guessed, my default operator is indeed set to AND. It sounds to me that, of your two proposed work-arounds, moving the StopFilterFactory after WordDelimiterFactory is the least disruptive. I'm guessing that disabling position increments across the board might have implications for other types of phrase searches, while filtering stopwords later in the chain should be more functionally equivalent, if slightly less efficient (potentially more terms to examine). Would you agree with this assessment? If not, what possible negative side effects am I forgetting about? thanks, Demian -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Wednesday, April 07, 2010 10:04 PM To: solr-user@lucene.apache.org Subject: Re: solr.WordDelimiterFilterFactory problem with hyphenated terms? Well, for a quick trial using trunk, I had to remove the UnicodeNormalizationFactory, is that yours? But with that removed, I get the results you do, ASSUMING that you've set your default operator to AND in schema.xml... Believe it or not, it all changes and all your queries return a hit if you do one of two things (I did this in both index and query when testing 'cause I'm lazy): 1 move the inclusion of the StopFilterFactory after WordDelimiterFactory or 2 for StopFilterFactory, set enablePositionIncrements=false I think either of these might work in your situation... On doing some more investigation, it appears that if a hyphenated word is immediately after a stopword AND the above is true (stop factory included before WordDelimiterFactory and enablePositionIncrements=true), then the search fails. I indexed this title: Love-customs in eighteenth-century Spain for nineteenth-century Searching in solr/admin/form.jsp for: title:(nineteenth-century) fails. But if I remove the for from the title, the above query works. Searching for title:(love-customs) always works. Finally, (and it's *really* time to go to sleep now), just setting enablePositionIncrements=false in the index portion of the schema also causes things to work. Developer folks: I didn't see anything in a quick look in SOLR or Lucene JIRAs, should I refine this a bit (really, sleepy time is near) and add a JIRA? Best
RE: solr.WordDelimiterFilterFactory problem with hyphenated terms?
I've given it a try, and it definitely seems to have improved the situation. However, there is still one weird case that's clearly related to term positions. If I do this search, it fails: title:love customs in eighteenthcentury spain ...but if I do this search, it succeeds: title:love customs in in eighteenthcentury spain (note the duplicate in). - Demian -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Thursday, April 08, 2010 11:20 AM To: solr-user@lucene.apache.org Subject: Re: solr.WordDelimiterFilterFactory problem with hyphenated terms? I'm not all that familiar with the underlying issues, but of the two I'd pick moving the WordDelimiterFactory rather than setting increments = false. But that's at least partly a guess Best Erick On Thu, Apr 8, 2010 at 11:00 AM, Demian Katz demian.k...@villanova.eduwrote: Thanks for looking into this -- I appreciate the help (and feel a little better that there seems to be a bug at work here and not just my total incomprehension). Sorry for any confusion over the UnicodeNormalizationFactory -- that's actually a plug-in from the SolrMarc project ( http://code.google.com/p/solrmarc/) that slipped into my example. Also, as you guessed, my default operator is indeed set to AND. It sounds to me that, of your two proposed work-arounds, moving the StopFilterFactory after WordDelimiterFactory is the least disruptive. I'm guessing that disabling position increments across the board might have implications for other types of phrase searches, while filtering stopwords later in the chain should be more functionally equivalent, if slightly less efficient (potentially more terms to examine). Would you agree with this assessment? If not, what possible negative side effects am I forgetting about? thanks, Demian -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Wednesday, April 07, 2010 10:04 PM To: solr-user@lucene.apache.org Subject: Re: solr.WordDelimiterFilterFactory problem with hyphenated terms? Well, for a quick trial using trunk, I had to remove the UnicodeNormalizationFactory, is that yours? But with that removed, I get the results you do, ASSUMING that you've set your default operator to AND in schema.xml... Believe it or not, it all changes and all your queries return a hit if you do one of two things (I did this in both index and query when testing 'cause I'm lazy): 1 move the inclusion of the StopFilterFactory after WordDelimiterFactory or 2 for StopFilterFactory, set enablePositionIncrements=false I think either of these might work in your situation... On doing some more investigation, it appears that if a hyphenated word is immediately after a stopword AND the above is true (stop factory included before WordDelimiterFactory and enablePositionIncrements=true), then the search fails. I indexed this title: Love-customs in eighteenth-century Spain for nineteenth-century Searching in solr/admin/form.jsp for: title:(nineteenth-century) fails. But if I remove the for from the title, the above query works. Searching for title:(love-customs) always works. Finally, (and it's *really* time to go to sleep now), just setting enablePositionIncrements=false in the index portion of the schema also causes things to work. Developer folks: I didn't see anything in a quick look in SOLR or Lucene JIRAs, should I refine this a bit (really, sleepy time is near) and add a JIRA? Best Erick On Wed, Apr 7, 2010 at 10:29 AM, Demian Katz demian.k...@villanova.eduwrote: Hello. It has been a few weeks, and I haven't gotten any responses. Perhaps my question is too complicated -- maybe a better approach is to try to gain enough knowledge to answer it myself. My gut feeling is still that it's something to do with the way term positions are getting handled by the WordDelimiterFilterFactory, but I don't have a good understanding of how term positions are calculated or factored into searching. Can anyone recommend some good reading to familiarize myself with these concepts in better detail? thanks, Demian From: Demian Katz Sent: Tuesday, March 16, 2010 9:47 AM To: solr-user@lucene.apache.org Subject: solr.WordDelimiterFilterFactory problem with hyphenated terms? This is my first post on this list -- apologies if this has been discussed before; I didn't come upon anything exactly equivalent in searching the archives via Google. I'm using Solr 1.4 as part of the VuFind application, and I just noticed that searches for hyphenated terms are failing in strange ways. I strongly suspect it has something to do
RE: solr.WordDelimiterFilterFactory problem with hyphenated terms?
Thanks for looking into this -- I appreciate the help (and feel a little better that there seems to be a bug at work here and not just my total incomprehension). Sorry for any confusion over the UnicodeNormalizationFactory -- that's actually a plug-in from the SolrMarc project (http://code.google.com/p/solrmarc/) that slipped into my example. Also, as you guessed, my default operator is indeed set to AND. It sounds to me that, of your two proposed work-arounds, moving the StopFilterFactory after WordDelimiterFactory is the least disruptive. I'm guessing that disabling position increments across the board might have implications for other types of phrase searches, while filtering stopwords later in the chain should be more functionally equivalent, if slightly less efficient (potentially more terms to examine). Would you agree with this assessment? If not, what possible negative side effects am I forgetting about? thanks, Demian -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Wednesday, April 07, 2010 10:04 PM To: solr-user@lucene.apache.org Subject: Re: solr.WordDelimiterFilterFactory problem with hyphenated terms? Well, for a quick trial using trunk, I had to remove the UnicodeNormalizationFactory, is that yours? But with that removed, I get the results you do, ASSUMING that you've set your default operator to AND in schema.xml... Believe it or not, it all changes and all your queries return a hit if you do one of two things (I did this in both index and query when testing 'cause I'm lazy): 1 move the inclusion of the StopFilterFactory after WordDelimiterFactory or 2 for StopFilterFactory, set enablePositionIncrements=false I think either of these might work in your situation... On doing some more investigation, it appears that if a hyphenated word is immediately after a stopword AND the above is true (stop factory included before WordDelimiterFactory and enablePositionIncrements=true), then the search fails. I indexed this title: Love-customs in eighteenth-century Spain for nineteenth-century Searching in solr/admin/form.jsp for: title:(nineteenth-century) fails. But if I remove the for from the title, the above query works. Searching for title:(love-customs) always works. Finally, (and it's *really* time to go to sleep now), just setting enablePositionIncrements=false in the index portion of the schema also causes things to work. Developer folks: I didn't see anything in a quick look in SOLR or Lucene JIRAs, should I refine this a bit (really, sleepy time is near) and add a JIRA? Best Erick On Wed, Apr 7, 2010 at 10:29 AM, Demian Katz demian.k...@villanova.eduwrote: Hello. It has been a few weeks, and I haven't gotten any responses. Perhaps my question is too complicated -- maybe a better approach is to try to gain enough knowledge to answer it myself. My gut feeling is still that it's something to do with the way term positions are getting handled by the WordDelimiterFilterFactory, but I don't have a good understanding of how term positions are calculated or factored into searching. Can anyone recommend some good reading to familiarize myself with these concepts in better detail? thanks, Demian From: Demian Katz Sent: Tuesday, March 16, 2010 9:47 AM To: solr-user@lucene.apache.org Subject: solr.WordDelimiterFilterFactory problem with hyphenated terms? This is my first post on this list -- apologies if this has been discussed before; I didn't come upon anything exactly equivalent in searching the archives via Google. I'm using Solr 1.4 as part of the VuFind application, and I just noticed that searches for hyphenated terms are failing in strange ways. I strongly suspect it has something to do with the solr.WordDelimiterFilterFactory filter, but I'm not exactly sure what. The problem is that I have a record with the title Love customs in eighteenth-century Spain. Depending on how I search for this, I get successes or failures in a seemingly unpredictable pattern. Demonstration queries below were tested using the direct Solr administration tool, just to eliminate any VuFind-related factors from the equation while debugging. Queries that work: title:(Love customs in eighteenth century Spain) // no hyphen, no phrases title:(Love customs in eighteenth-century Spain) // phrase search on whole title, with hyphen Queries that fail: title:(Love customs in eighteenth-century Spain) // hyphen, no phrases title:(Love customs in eighteenth century Spain) // phrase search on whole title, without hyphen title:(Love customs in eighteenth-century Spain) // hyphenated word as phrase title:(Love customs in eighteenth century Spain) // hyphenated word as phrase, hyphen
RE: solr.WordDelimiterFilterFactory problem with hyphenated terms?
Hello. It has been a few weeks, and I haven't gotten any responses. Perhaps my question is too complicated -- maybe a better approach is to try to gain enough knowledge to answer it myself. My gut feeling is still that it's something to do with the way term positions are getting handled by the WordDelimiterFilterFactory, but I don't have a good understanding of how term positions are calculated or factored into searching. Can anyone recommend some good reading to familiarize myself with these concepts in better detail? thanks, Demian From: Demian Katz Sent: Tuesday, March 16, 2010 9:47 AM To: solr-user@lucene.apache.org Subject: solr.WordDelimiterFilterFactory problem with hyphenated terms? This is my first post on this list -- apologies if this has been discussed before; I didn't come upon anything exactly equivalent in searching the archives via Google. I'm using Solr 1.4 as part of the VuFind application, and I just noticed that searches for hyphenated terms are failing in strange ways. I strongly suspect it has something to do with the solr.WordDelimiterFilterFactory filter, but I'm not exactly sure what. The problem is that I have a record with the title Love customs in eighteenth-century Spain. Depending on how I search for this, I get successes or failures in a seemingly unpredictable pattern. Demonstration queries below were tested using the direct Solr administration tool, just to eliminate any VuFind-related factors from the equation while debugging. Queries that work: title:(Love customs in eighteenth century Spain) // no hyphen, no phrases title:(Love customs in eighteenth-century Spain) // phrase search on whole title, with hyphen Queries that fail: title:(Love customs in eighteenth-century Spain) // hyphen, no phrases title:(Love customs in eighteenth century Spain) // phrase search on whole title, without hyphen title:(Love customs in eighteenth-century Spain) // hyphenated word as phrase title:(Love customs in eighteenth century Spain) // hyphenated word as phrase, hyphen removed Here is VuFind's text field type definition: fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=English protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ filter class=schema.UnicodeNormalizationFilterFactory version=icu4j composed=false remove_diacritics=true remove_modifiers=true fold=true/ filter class=solr.ISOLatin1AccentFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=English protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ filter class=schema.UnicodeNormalizationFilterFactory version=icu4j composed=false remove_diacritics=true remove_modifiers=true fold=true/ filter class=solr.ISOLatin1AccentFilterFactory/ /analyzer /fieldType I did notice that in the text field type in VuFind's schema has catenateWords and catenateNumbers turned on in both the index and query analyzer chains. It is my understanding that these options should be disabled for the query chain and only enabled for the index chain. However, this may be a red herring -- I have already tried changing this setting, but it didn't change the success/failure pattern described above. I have also played with the preserveOriginal setting without apparent effect. From playing with the Field Analysis tool, I notice that there is a gap in the term position sequence after analysis... but I'm not sure if this is significant. Has anybody else run into this sort of problem? Any ideas on a fix? thanks, Demian
solr.WordDelimiterFilterFactory problem with hyphenated terms?
This is my first post on this list -- apologies if this has been discussed before; I didn't come upon anything exactly equivalent in searching the archives via Google. I'm using Solr 1.4 as part of the VuFind application, and I just noticed that searches for hyphenated terms are failing in strange ways. I strongly suspect it has something to do with the solr.WordDelimiterFilterFactory filter, but I'm not exactly sure what. The problem is that I have a record with the title Love customs in eighteenth-century Spain. Depending on how I search for this, I get successes or failures in a seemingly unpredictable pattern. Demonstration queries below were tested using the direct Solr administration tool, just to eliminate any VuFind-related factors from the equation while debugging. Queries that work: title:(Love customs in eighteenth century Spain) // no hyphen, no phrases title:(Love customs in eighteenth-century Spain) // phrase search on whole title, with hyphen Queries that fail: title:(Love customs in eighteenth-century Spain) // hyphen, no phrases title:(Love customs in eighteenth century Spain) // phrase search on whole title, without hyphen title:(Love customs in eighteenth-century Spain) // hyphenated word as phrase title:(Love customs in eighteenth century Spain) // hyphenated word as phrase, hyphen removed Here is VuFind's text field type definition: fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=English protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ filter class=schema.UnicodeNormalizationFilterFactory version=icu4j composed=false remove_diacritics=true remove_modifiers=true fold=true/ filter class=solr.ISOLatin1AccentFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=English protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ filter class=schema.UnicodeNormalizationFilterFactory version=icu4j composed=false remove_diacritics=true remove_modifiers=true fold=true/ filter class=solr.ISOLatin1AccentFilterFactory/ /analyzer /fieldType I did notice that in the text field type in VuFind's schema has catenateWords and catenateNumbers turned on in both the index and query analyzer chains. It is my understanding that these options should be disabled for the query chain and only enabled for the index chain. However, this may be a red herring -- I have already tried changing this setting, but it didn't change the success/failure pattern described above. I have also played with the preserveOriginal setting without apparent effect. From playing with the Field Analysis tool, I notice that there is a gap in the term position sequence after analysis... but I'm not sure if this is significant. Has anybody else run into this sort of problem? Any ideas on a fix? thanks, Demian