Re: Cloud Deployment Strategy... In the Cloud
ant is very good at this sort of thing, and easier for Java devs to learn than Make. Python has a module called fabric that is also very fine, but for my dev. ops. it is another thing to learn. I tend to divide things into three categories: - Things that have to do with system setup, and need to be run as root. For this I write a bash script (I should learn puppet, but...) - Things that have to do with one time installation as a solr admin user with /bin/bash, including upconfig. For this I use an ant build. - Normal operational procedures. For this, I typically use Solr admin or scripts, but I wish I had time to create a good webapp (or money to purchase Fusion). On Thu, Sep 24, 2015 at 12:39 AM, Erick Ericksonwrote: > bq: What tools do you use for the "auto setup"? How do you get your config > automatically uploaded to zk? > > Both uploading the config to ZK and creating collections are one-time > operations, usually done manually. Currently uploading the config set is > accomplished with zkCli (yes, it's a little clumsy). There's a JIRA to put > this into solr/bin as a command though. They'd be easy enough to script in > any given situation though with a shell script or wizard > > Best, > Erick > > On Wed, Sep 23, 2015 at 7:33 PM, Steve Davids wrote: > > > What tools do you use for the "auto setup"? How do you get your config > > automatically uploaded to zk? > > > > On Tue, Sep 22, 2015 at 2:35 PM, Gili Nachum > wrote: > > > > > Our auto setup sequence is: > > > 1.deploy 3 zk nodes > > > 2. Deploy solr nodes and start them connecting to zk. > > > 3. Upload collection config to zk. > > > 4. Call create collection rest api. > > > 5. Done. SolrCloud ready to work. > > > > > > Don't yet have automation for replacing or adding a node. > > > On Sep 22, 2015 18:27, "Steve Davids" wrote: > > > > > > > Hi, > > > > > > > > I am trying to come up with a repeatable process for deploying a Solr > > > Cloud > > > > cluster from scratch along with the appropriate security groups, auto > > > > scaling groups, and custom Solr plugin code. I saw that LucidWorks > > > created > > > > a Solr Scale Toolkit but that seems to be more of a one-shot deal > than > > > > really setting up your environment for the long-haul. Here is were we > > are > > > > at right now: > > > > > > > >1. ZooKeeper ensemble is easily brought up via a Cloud Formation > > > Script > > > >2. We have an RPM built to lay down the Solr distribution + Custom > > > >plugins + Configuration > > > >3. Solr machines come up and connect to ZK > > > > > > > > Now, we are using Puppet which could easily create the > core.properties > > > file > > > > for the corresponding core and have ZK get bootstrapped but that > seems > > to > > > > be a no-no these days... So, can anyone think of a way to get ZK > > > > bootstrapped automatically with pre-configured Collection > > configurations? > > > > Also, is there a recommendation on how to deal with machines that are > > > > coming/going? As I see it machines will be getting spun up and > > terminated > > > > from time to time and we need to have a process of dealing with that, > > the > > > > first idea was to just use a common node name so if a machine was > > > > terminated a new one can come up and replace that particular node but > > on > > > > second thought it would seem to require an auto scaling group *per* > > node > > > > (so it knows what node name it is). For a large cluster this seems > > crazy > > > > from a maintenance perspective, especially if you want to be elastic > > with > > > > regard to the number of live replicas for peak times. So, then the > next > > > > idea was to have some outside observer listen to when new ec2 > instances > > > are > > > > created or terminated (via CloudWatch SQS) and make the appropriate > API > > > > calls to either add the replica or delete it, this seems doable but > > > perhaps > > > > not the simplest solution that could work. > > > > > > > > I was hoping others have already gone through this and have valuable > > > advice > > > > to give, we are trying to setup Solr Cloud the "right way" so we > don't > > > get > > > > nickel-and-dimed to death from an O perspective. > > > > > > > > Thanks, > > > > > > > > -Steve > > > > > > > > > >
Re: Solr authentication - Error 401 Unauthorized
It seems that you have secured Solr so thoroughly that you cannot now run bin/solr status! bin/solr has no arguments as yet for providing a username/password - as a mostly user like you I'm not sure of the roadmap. I think you should relax those restrictions a bit and try again. On Fri, Sep 11, 2015 at 5:06 AM, Merlin Morgenstern < merlin.morgenst...@gmail.com> wrote: > I have secured solr cloud via basic authentication. > > Now I am having difficulties creating cores and getting status information. > Solr keeps telling me that the request is unothorized. However, I have > access to the admin UI after login. > > How do I configure solr to use the basic authentication credentials? > > This is the error message: > > /opt/solr-5.3.0/bin/solr status > > Found 1 Solr nodes: > > Solr process 31114 running on port 8983 > > ERROR: Failed to get system information from http://localhost:8983/solr > due > to: org.apache.http.client.ClientProtocolException: Expected JSON response > from server but received: > > > > > > Error 401 Unauthorized > > > > HTTP ERROR 401 > > Problem accessing /solr/admin/info/system. Reason: > > UnauthorizedPowered by > Jetty:// > > > > > >
Re: Solr authentication - Error 401 Unauthorized
Noble, You should also look at this if it is intended to be more than an internal API. Using the minor protections I added to test SOLR-8000, I was able to reproduce a problem very like this: bin/solr healthcheck -z localhost:2181 -c mycollection Since Solr /select is protected... On Sat, Sep 12, 2015 at 9:40 AM, Dan Davis <dansm...@gmail.com> wrote: > It seems that you have secured Solr so thoroughly that you cannot now run > bin/solr status! > > bin/solr has no arguments as yet for providing a username/password - as a > mostly user like you I'm not sure of the roadmap. > > I think you should relax those restrictions a bit and try again. > > On Fri, Sep 11, 2015 at 5:06 AM, Merlin Morgenstern < > merlin.morgenst...@gmail.com> wrote: > >> I have secured solr cloud via basic authentication. >> >> Now I am having difficulties creating cores and getting status >> information. >> Solr keeps telling me that the request is unothorized. However, I have >> access to the admin UI after login. >> >> How do I configure solr to use the basic authentication credentials? >> >> This is the error message: >> >> /opt/solr-5.3.0/bin/solr status >> >> Found 1 Solr nodes: >> >> Solr process 31114 running on port 8983 >> >> ERROR: Failed to get system information from http://localhost:8983/solr >> due >> to: org.apache.http.client.ClientProtocolException: Expected JSON response >> from server but received: >> >> >> >> >> >> Error 401 Unauthorized >> >> >> >> HTTP ERROR 401 >> >> Problem accessing /solr/admin/info/system. Reason: >> >> UnauthorizedPowered by >> Jetty:// >> >> >> >> >> >> > >
Re: Issue Using Solr 5.3 Authentication and Authorization Plugins
Kevin & Noble, I've manually verified the fix for SOLR-8000, but not yet for SOLR-8004. I reproduced the initial problem with reloading security.json after restarting both Solr and ZooKeeper. I verified using zkcli.sh that ZooKeeper does retain the changes to the file after using /solr/admin/authorization, and that therefore the problem was Solr. After building solr-5.3.1-SNAPSHOT.tgz with ant package (because I don't know how to give parameters to ant server), I expanded it, copied in the core data, and then started it. I was prompted for a password, and it let me in once the password was given. I'll probably get to SOLR-8004 shortly, since I have both environments built and working. It also occurs to me that it might be better to forbid all permissions and grant specific permissions to specific roles. Is there a comprehensive list of the permissions available? On Tue, Sep 8, 2015 at 1:07 PM, Kevin Lee <kgle...@yahoo.com.invalid> wrote: > Thanks Dan! Please let us know what you find. I’m interested to know if > this is an issue with anyone else’s setup or if I have an issue in my local > configuration that is still preventing it to work on start/restart. > > - Kevin > > > On Sep 5, 2015, at 8:45 AM, Dan Davis <dansm...@gmail.com> wrote: > > > > Kevin & Noble, > > > > I'll take it on to test this. I've built from source before, and I've > > wanted this authorization capability for awhile. > > > > On Fri, Sep 4, 2015 at 9:59 AM, Kevin Lee <kgle...@yahoo.com.invalid> > wrote: > > > >> Noble, > >> > >> Does SOLR-8000 need to be re-opened? Has anyone else been able to test > >> the restart fix? > >> > >> At startup, these are the log messages that say there is no security > >> configuration and the plugins aren’t being used even though > security.json > >> is in Zookeeper: > >> 2015-09-04 08:06:21.205 INFO (main) [ ] o.a.s.c.CoreContainer > Security > >> conf doesn't exist. Skipping setup for authorization module. > >> 2015-09-04 08:06:21.205 INFO (main) [ ] o.a.s.c.CoreContainer No > >> authentication plugin used. > >> > >> Thanks, > >> Kevin > >> > >>> On Sep 4, 2015, at 5:47 AM, Noble Paul <noble.p...@gmail.com> wrote: > >>> > >>> There are no download links for 5.3.x branch till we do a bug fix > >> release > >>> > >>> If you wish to download the trunk nightly (which is not same as 5.3.0) > >>> check here > >> > https://builds.apache.org/job/Solr-Artifacts-trunk/lastSuccessfulBuild/artifact/solr/package/ > >>> > >>> If you wish to get the binaries for 5.3 branch you will have to make it > >>> (you will need to install svn and ant) > >>> > >>> Here are the steps > >>> > >>> svn checkout > >> http://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_5_3/ > >>> cd lucene_solr_5_3/solr > >>> ant server > >>> > >>> > >>> > >>> On Fri, Sep 4, 2015 at 4:11 PM, davidphilip cherian > >>> <davidphilipcher...@gmail.com> wrote: > >>>> Hi Kevin/Noble, > >>>> > >>>> What is the download link to take the latest? What are the steps to > >> compile > >>>> it, test and use? > >>>> We also have a use case to have this feature in solr too. Therefore, > >> wanted > >>>> to test and above info would help a lot to get started. > >>>> > >>>> Thanks. > >>>> > >>>> > >>>> On Fri, Sep 4, 2015 at 1:45 PM, Kevin Lee <kgle...@yahoo.com.invalid> > >> wrote: > >>>> > >>>>> Thanks, I downloaded the source and compiled it and replaced the jar > >> file > >>>>> in the dist and solr-webapp’s WEB-INF/lib directory. It does seem to > >> be > >>>>> protecting the Collections API reload command now as long as I upload > >> the > >>>>> security.json after startup of the Solr instances. If I shutdown and > >> bring > >>>>> the instances back up, the security is no longer in place and I have > to > >>>>> upload the security.json again for it to take effect. > >>>>> > >>>>> - Kevin > >>>>> > >>>>>> On Sep 3, 2015, at 10:29 PM, Noble Paul <noble.p...@gmail.com> > wrote: > >>>>>> > >>>>>> Both these are committed.
Re: Issue Using Solr 5.3 Authentication and Authorization Plugins
SOLR-8004 also appears to work to me. I manually edited security.json and did putfile. I didn't bother with browse permission, because it was Kevin's workaround.solr-5.3.1-SNAPSHOT did challenge me for credentials when going to curl http://localhost:8983/solr/admin/collections?action=CREATE and so on... On Thu, Sep 10, 2015 at 11:10 PM, Dan Davis <dansm...@gmail.com> wrote: > Kevin & Noble, > > I've manually verified the fix for SOLR-8000, but not yet for SOLR-8004. > > I reproduced the initial problem with reloading security.json after > restarting both Solr and ZooKeeper. I verified using zkcli.sh that > ZooKeeper does retain the changes to the file after using > /solr/admin/authorization, and that therefore the problem was Solr. > > After building solr-5.3.1-SNAPSHOT.tgz with ant package (because I don't > know how to give parameters to ant server), I expanded it, copied in the > core data, and then started it. I was prompted for a password, and it let > me in once the password was given. > > I'll probably get to SOLR-8004 shortly, since I have both environments > built and working. > > It also occurs to me that it might be better to forbid all permissions and > grant specific permissions to specific roles. Is there a comprehensive > list of the permissions available? > > > On Tue, Sep 8, 2015 at 1:07 PM, Kevin Lee <kgle...@yahoo.com.invalid> > wrote: > >> Thanks Dan! Please let us know what you find. I’m interested to know if >> this is an issue with anyone else’s setup or if I have an issue in my local >> configuration that is still preventing it to work on start/restart. >> >> - Kevin >> >> > On Sep 5, 2015, at 8:45 AM, Dan Davis <dansm...@gmail.com> wrote: >> > >> > Kevin & Noble, >> > >> > I'll take it on to test this. I've built from source before, and I've >> > wanted this authorization capability for awhile. >> > >> > On Fri, Sep 4, 2015 at 9:59 AM, Kevin Lee <kgle...@yahoo.com.invalid> >> wrote: >> > >> >> Noble, >> >> >> >> Does SOLR-8000 need to be re-opened? Has anyone else been able to test >> >> the restart fix? >> >> >> >> At startup, these are the log messages that say there is no security >> >> configuration and the plugins aren’t being used even though >> security.json >> >> is in Zookeeper: >> >> 2015-09-04 08:06:21.205 INFO (main) [ ] o.a.s.c.CoreContainer >> Security >> >> conf doesn't exist. Skipping setup for authorization module. >> >> 2015-09-04 08:06:21.205 INFO (main) [ ] o.a.s.c.CoreContainer No >> >> authentication plugin used. >> >> >> >> Thanks, >> >> Kevin >> >> >> >>> On Sep 4, 2015, at 5:47 AM, Noble Paul <noble.p...@gmail.com> wrote: >> >>> >> >>> There are no download links for 5.3.x branch till we do a bug fix >> >> release >> >>> >> >>> If you wish to download the trunk nightly (which is not same as 5.3.0) >> >>> check here >> >> >> https://builds.apache.org/job/Solr-Artifacts-trunk/lastSuccessfulBuild/artifact/solr/package/ >> >>> >> >>> If you wish to get the binaries for 5.3 branch you will have to make >> it >> >>> (you will need to install svn and ant) >> >>> >> >>> Here are the steps >> >>> >> >>> svn checkout >> >> http://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_5_3/ >> >>> cd lucene_solr_5_3/solr >> >>> ant server >> >>> >> >>> >> >>> >> >>> On Fri, Sep 4, 2015 at 4:11 PM, davidphilip cherian >> >>> <davidphilipcher...@gmail.com> wrote: >> >>>> Hi Kevin/Noble, >> >>>> >> >>>> What is the download link to take the latest? What are the steps to >> >> compile >> >>>> it, test and use? >> >>>> We also have a use case to have this feature in solr too. Therefore, >> >> wanted >> >>>> to test and above info would help a lot to get started. >> >>>> >> >>>> Thanks. >> >>>> >> >>>> >> >>>> On Fri, Sep 4, 2015 at 1:45 PM, Kevin Lee <kgle...@yahoo.com.invalid >> > >> >> wrote: >> >>>> >> >>>>> Thanks, I downloaded the source and compiled it and replaced
Re: Issue Using Solr 5.3 Authentication and Authorization Plugins
Kevin & Noble, I'll take it on to test this. I've built from source before, and I've wanted this authorization capability for awhile. On Fri, Sep 4, 2015 at 9:59 AM, Kevin Leewrote: > Noble, > > Does SOLR-8000 need to be re-opened? Has anyone else been able to test > the restart fix? > > At startup, these are the log messages that say there is no security > configuration and the plugins aren’t being used even though security.json > is in Zookeeper: > 2015-09-04 08:06:21.205 INFO (main) [ ] o.a.s.c.CoreContainer Security > conf doesn't exist. Skipping setup for authorization module. > 2015-09-04 08:06:21.205 INFO (main) [ ] o.a.s.c.CoreContainer No > authentication plugin used. > > Thanks, > Kevin > > > On Sep 4, 2015, at 5:47 AM, Noble Paul wrote: > > > > There are no download links for 5.3.x branch till we do a bug fix > release > > > > If you wish to download the trunk nightly (which is not same as 5.3.0) > > check here > https://builds.apache.org/job/Solr-Artifacts-trunk/lastSuccessfulBuild/artifact/solr/package/ > > > > If you wish to get the binaries for 5.3 branch you will have to make it > > (you will need to install svn and ant) > > > > Here are the steps > > > > svn checkout > http://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_5_3/ > > cd lucene_solr_5_3/solr > > ant server > > > > > > > > On Fri, Sep 4, 2015 at 4:11 PM, davidphilip cherian > > wrote: > >> Hi Kevin/Noble, > >> > >> What is the download link to take the latest? What are the steps to > compile > >> it, test and use? > >> We also have a use case to have this feature in solr too. Therefore, > wanted > >> to test and above info would help a lot to get started. > >> > >> Thanks. > >> > >> > >> On Fri, Sep 4, 2015 at 1:45 PM, Kevin Lee > wrote: > >> > >>> Thanks, I downloaded the source and compiled it and replaced the jar > file > >>> in the dist and solr-webapp’s WEB-INF/lib directory. It does seem to > be > >>> protecting the Collections API reload command now as long as I upload > the > >>> security.json after startup of the Solr instances. If I shutdown and > bring > >>> the instances back up, the security is no longer in place and I have to > >>> upload the security.json again for it to take effect. > >>> > >>> - Kevin > >>> > On Sep 3, 2015, at 10:29 PM, Noble Paul wrote: > > Both these are committed. If you could test with the latest 5.3 branch > it would be helpful > > On Wed, Sep 2, 2015 at 5:11 PM, Noble Paul > wrote: > > I opened a ticket for the same > > https://issues.apache.org/jira/browse/SOLR-8004 > > > > On Wed, Sep 2, 2015 at 1:36 PM, Kevin Lee > > >>> wrote: > >> I’ve found that completely exiting Chrome or Firefox and opening it > >>> back up re-prompts for credentials when they are required. It was > >>> re-prompting with the /browse path where authentication was working > each > >>> time I completely exited and started the browser again, however it > won’t > >>> re-prompt unless you exit completely and close all running instances > so I > >>> closed all instances each time to test. > >> > >> However, to make sure I ran it via the command line via curl as > >>> suggested and it still does not give any authentication error when > trying > >>> to issue the command via curl. I get a success response from all the > Solr > >>> instances that the reload was successful. > >> > >> Not sure why the pre-canned permissions aren’t working, but the one > to > >>> the request handler at the /browse path is. > >> > >> > >>> On Sep 1, 2015, at 11:03 PM, Noble Paul > wrote: > >>> > >>> " However, after uploading the new security.json and restarting the > >>> web browser," > >>> > >>> The browser remembers your login , So it is unlikely to prompt for > the > >>> credentials again. > >>> > >>> Why don't you try the RELOAD operation using command line (curl) ? > >>> > >>> On Tue, Sep 1, 2015 at 10:31 PM, Kevin Lee > > >>> wrote: > The restart issues aside, I’m trying to lockdown usage of the > >>> Collections API, but that also does not seem to be working either. > > Here is my security.json. I’m using the “collection-admin-edit” > >>> permission and assigning it to the “adminRole”. However, after > uploading > >>> the new security.json and restarting the web browser, it doesn’t seem > to be > >>> requiring credentials when calling the RELOAD action on the Collections > >>> API. The only thing that seems to work is the custom permission > “browse” > >>> which is requiring authentication before allowing me to pull up the > page. > >>> Am I using the permissions correctly for the > RuleBasedAuthorizationPlugin? > >
Re: analyzer, indexAnalyzer and queryAnalyzer
Hi Doug, nice write-up and 2 questions: - You write your own QParser plugins - can one keep the features of edismax for field boosting/phrase-match boosting by subclassing edismax? Assuming yes... - What do pf2 and pf3 do in the edismax query parser? hon-lucene-synonyms plugin links corrections: http://nolanlawson.com/2012/10/31/better-synonym-handling-in-solr/ https://github.com/healthonnet/hon-lucene-synonyms On Wed, Apr 29, 2015 at 9:24 PM, Doug Turnbull dturnb...@opensourceconnections.com wrote: So Solr has the idea of a query parser. The query parser is a convenient way of passing a search string to Solr and having Solr parse it into underlying Lucene queries: You can see a list of query parsers here http://wiki.apache.org/solr/QueryParser What this means is that the query parser does work to pull terms into individual clauses *before* analysis is run. It's a parsing layer that sits outside the analysis chain. This creates problems like the sea biscuit problem, whereby we declare sea biscuit as a query time synonym of seabiscuit. As you may know synonyms are checked during analysis. However, if the query parser splits up sea from biscuit before running analysis, the query time analyzer will fail. The string sea is brought by itself to the query time analyzer and of course won't match sea biscuit. Same with the string biscuit in isolation. If the full string sea biscuit was brought to the analyzer, it would see [sea] next to [biscuit] and declare it a synonym of seabiscuit. Thanks to the query parser, the analyzer has lost the association between the terms, and both terms aren't brought together to the analyzer. My colleague John Berryman wrote a pretty good blog post on this http://opensourceconnections.com/blog/2013/10/27/why-is-multi-term-synonyms-so-hard-in-solr/ There's several solutions out there that attempt to address this problem. One from Ted Sullivan at Lucidworks https://lucidworks.com/blog/solution-for-multi-term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter/ Another popular one is the hon-lucene-synonyms plugin: http://lucene.apache.org/solr/5_1_0/solr-core/org/apache/solr/search/FieldQParserPlugin.html Yet another work-around is to use the field query parser: http://lucene.apache.org/solr/5_1_0/solr-core/org/apache/solr/search/FieldQParserPlugin.html I also tend to write my own query parsers, so on the one hand its annoying that query parsers have the problems above, on the flipside Solr makes it very easy to implement whatever parsing you think is appropriatte with a small bit of Java/Lucene knowledge. Hopefully that explanation wasn't too deep, but its an important thing to know about Solr. Are you asking out of curiosity, or do you have a specific problem? Thanks -Doug On Wed, Apr 29, 2015 at 6:32 PM, Steven White swhite4...@gmail.com wrote: Hi Doug, I don't understand what you mean by the following: For example, if a user searches for q=hot dogsdefType=edismaxqf=title body the *query parser* *not* the *analyzer* first turns the query into: If I have indexAnalyzer and queryAnalyzer in a fieldType that are 100% identical, the example you provided, does it stand? If so, why? Or do you mean something totally different by query parser? Thanks Steve On Wed, Apr 29, 2015 at 4:18 PM, Doug Turnbull dturnb...@opensourceconnections.com wrote: * 1) If the content of indexAnalyzer and queryAnalyzer are exactly the same,that's the same as if I have an analyzer only, right?* 1) Yes * 2) Under the hood, all three are the same thing when it comes to what kind* *of data and configuration attributes can take, right?* 2) Yes. Both take in text and output a token stream. *What I'm trying to figure out is this: beside being able to configure a* *fieldType to have different analyzer setting at index and query time, thereis nothing else that's unique about each.* The only thing to look out for in Solr land is the query parser. Most Solr query parsers treat whitespace as meaningful. For example, if a user searches for q=hot dogsdefType=edismaxqf=title body the *query parser* *not* the *analyzer* first turns the query into: (title:hot title:dog) | (body:hot body:dog) each word which *then *gets analyzed. This is because the query parser tries to be smart and turn hot dog into hot OR dog, or more specifically making them two must clauses. This trips quite a few folks up, you can use the field query parser which uses the field as a phrase query. Hope that helps -- *Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections, LLC | 240.476.9983 | http://www.opensourceconnections.com Author: Taming Search http://manning.com/turnbull from Manning Publications This e-mail and all contents, including attachments, is considered to be Company Confidential unless
Re: Odp.: solr issue with pdf forms
Steve, You gave as an example: Ich�bestätige�mit�meiner�Unterschrift,�dass�alle�Angaben�korrekt�und� vollständig�sind This sentence is probably from the PDF form label content, rather than form values. Sometimes in PDF, the form's value fields are kept in a separate file. I'm 99% sure Tika won't be able to handle that, because it handles one file at a time. If the form's value fields are in the PDF, Tika should be able to handle it, but may be making some small errors that could be addressed. When you look at the form in Acrobat Reader, can you see whether the indexed words contain any words from the form fields's values? If you have a form where the data is not sensitive, I can investigate. If you are interested in this contact me offline - to dansm...@gmail.com or d...@danizen.net. Thanks, Dan On Thu, Apr 23, 2015 at 11:59 AM, Erick Erickson erickerick...@gmail.com wrote: When you say they're not indexed correctly, what's your evidence? You cannot rely on the display in the browser, that's the raw input just as it was sent to Solr, _not_ the actual tokens in the index. What do you see when you go to the admin schema browser pate and load the actual tokens. Or use the TermsComponent (https://cwiki.apache.org/confluence/display/solr/The+Terms+Component) to see the actual terms in the index as opposed to the stored data you see in the browser when you look at search results. If the actual terms don't seem right _in the index_ we need to see your analysis chain, i.e. your fieldType definition. I'm, 90% sure you're seeing the stored data and your terms are indexed just fine, but I've certainly been wrong before, more times than I want to remember. Best, Erick On Thu, Apr 23, 2015 at 1:18 AM, steve.sch...@t-systems.com wrote: Hey Erick, thanks for your answer. They are not indexed correctly. Also throught the solr admin interface I see these typical questionmarks within a rhombus where a blank space should be. I now figured out the following (not sure if it is relevant at all): - PDF documents created with Acrobat PDFMaker 10.0 for Word are indexed correctly, no issues - PDF documents (with editable form fields) created with Adobe InDesign CS5 (7.0.1) are indexed with the blank space issue Best Steve -Ursprüngliche Nachricht- Von: Erick Erickson [mailto:erickerick...@gmail.com] Gesendet: Mittwoch, 22. April 2015 17:11 An: solr-user@lucene.apache.org Betreff: Re: Odp.: solr issue with pdf forms Are they not _indexed_ correctly or not being displayed correctly? Take a look at admin UIschema browser your field and press the load terms button. That'll show you what is _in_ the index as opposed to what the raw data looked like. When you return the field in a Solr search, you get a verbatim, un-analyzed copy of your original input. My guess is that your browser isn't using the compatible character encoding for display. Best, Erick On Wed, Apr 22, 2015 at 7:08 AM, steve.sch...@t-systems.com wrote: Thanks for your answer. Maybe my English is not good enough, what are you trying to say? Sorry I didn't get the point. :-( -Ursprüngliche Nachricht- Von: LAFK [mailto:tomasz.bo...@gmail.com] Gesendet: Mittwoch, 22. April 2015 14:01 An: solr-user@lucene.apache.org; solr-user@lucene.apache.org Betreff: Odp.: solr issue with pdf forms Out of my head I'd follow how are writable PDFs created and encoded. @LAFK_PL Oryginalna wiadomość Od: steve.sch...@t-systems.com Wysłano: środa, 22 kwietnia 2015 12:41 Do: solr-user@lucene.apache.org Odpowiedz: solr-user@lucene.apache.org Temat: solr issue with pdf forms Hi guys, hopefully you can help me with my issue. We are using a solr setup and have the following issue: - usual pdf files are indexed just fine - pdf files with writable form-fields look like this: Ich bestätige mit meiner Unterschrift, dass alle Angaben korrekt und v ollständig sind Somehow the blank space character is not indexed correctly. Is this a know issue? Does anybody have an idea? Thanks a lot Best Steve
Re: solr issue with pdf forms
Steve, Are you using ExtractingRequestHandler / DataImportHandler or extracting the text content from the PDF outside of Solr? On Wed, Apr 22, 2015 at 6:40 AM, steve.sch...@t-systems.com wrote: Hi guys, hopefully you can help me with my issue. We are using a solr setup and have the following issue: - usual pdf files are indexed just fine - pdf files with writable form-fields look like this: Ich�bestätige�mit�meiner�Unterschrift,�dass�alle�Angaben�korrekt�und�vollständig�sind Somehow the blank space character is not indexed correctly. Is this a know issue? Does anybody have an idea? Thanks a lot Best Steve
Re: Odp.: solr issue with pdf forms
+1 - I like Erick's answer. Let me know if that turns out to be the problem - I'm interested in this problem and would be happy to help. On Wed, Apr 22, 2015 at 11:11 AM, Erick Erickson erickerick...@gmail.com wrote: Are they not _indexed_ correctly or not being displayed correctly? Take a look at admin UIschema browser your field and press the load terms button. That'll show you what is _in_ the index as opposed to what the raw data looked like. When you return the field in a Solr search, you get a verbatim, un-analyzed copy of your original input. My guess is that your browser isn't using the compatible character encoding for display. Best, Erick On Wed, Apr 22, 2015 at 7:08 AM, steve.sch...@t-systems.com wrote: Thanks for your answer. Maybe my English is not good enough, what are you trying to say? Sorry I didn't get the point. :-( -Ursprüngliche Nachricht- Von: LAFK [mailto:tomasz.bo...@gmail.com] Gesendet: Mittwoch, 22. April 2015 14:01 An: solr-user@lucene.apache.org; solr-user@lucene.apache.org Betreff: Odp.: solr issue with pdf forms Out of my head I'd follow how are writable PDFs created and encoded. @LAFK_PL Oryginalna wiadomość Od: steve.sch...@t-systems.com Wysłano: środa, 22 kwietnia 2015 12:41 Do: solr-user@lucene.apache.org Odpowiedz: solr-user@lucene.apache.org Temat: solr issue with pdf forms Hi guys, hopefully you can help me with my issue. We are using a solr setup and have the following issue: - usual pdf files are indexed just fine - pdf files with writable form-fields look like this: Ich�bestätige�mit�meiner�Unterschrift,�dass�alle�Angaben�korrekt�und�vollständig�sind Somehow the blank space character is not indexed correctly. Is this a know issue? Does anybody have an idea? Thanks a lot Best Steve
Re: Securing solr index
Where you want true Role-Based Access Control (RBAC) on each index (core or collection), one solution is to buy Solr Enterprise from LucidWorks. My personal practice is mostly dictated by financial decisions: - Each core/index has its configuration directory in a Git repository/branch where the Git repository software provides RBAC. - This relies on developers to keep a separate Solr for development, and then to check-in their configuration directory changes when they are satisfied with the changes. This is probably a best practice anyway :) - Continuous Integration pushes the Git configuration appropriately when a particular branch changes. - The main URL /solr has security provided by Apache httpd on port 80 (a reverse proxy to http://localhost:8983/solr/) - That port is also open, secured by IP address, to other Solr nodes in the cluster. - The /select request Handler for each core/collection is reverse proxied to /search/corename. - The Solr Amin UI uses a authentication/authorization handler such that only the Search Administrators group has access to it. The security here relies on search developers not enabling handleSelect in their solrconfig.xml.The security can also be extended by adding security on reverse proxied URLs such as /search/corename and /update/corename so that the client application needs to know some key, or have access to an SSL private key file. The downside is that only Search Administrators group has access to the QA or production Solr Admin UI. On Mon, Apr 13, 2015 at 6:13 AM, Suresh Vanasekaran suresh_vanaseka...@infosys.com wrote: Hi, We are having the solr index maintained in a central server and multiple users might be able to access the index data. May I know what are best practice for securing the solr index folder where ideally only application user should be able to access. Even an admin user should not be able to copy the data and use it in another schema. Thanks CAUTION - Disclaimer * This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely for the use of the addressee(s). If you are not the intended recipient, please notify the sender by e-mail and delete the original message. Further, you are not to copy, disclose, or distribute this e-mail or its contents to any other person and any such actions are unlawful. This e-mail may contain viruses. Infosys has taken every reasonable precaution to minimize this risk, but is not liable for any damage you may sustain as a result of any virus in this e-mail. You should carry out your own virus checks before opening the e-mail or attachment. Infosys reserves the right to monitor and review the content of all messages sent to or from this e-mail address. Messages sent to or from this e-mail address may be stored on the Infosys e-mail system. ***INFOSYS End of Disclaimer INFOSYS***
Re: What is the best way of Indexing different formats of documents?
Sangeetha, You can also run Tika directly from data import handler, and Data Import Handler can be made to run several threads if you can partition the input documents by directory or database id. I've done 4 threads by having a base configuration that does an Oracle query like this: SELECT * (SELECT id, url, ..., Modulo(rowNum, 4) as threadid FROM ... WHERE ...) WHERE threadid = %d A bash/sed script writes several data import handler XML files. I can then index several threads at a time. Each of these threads can then use all the transformers, e.g. templateTransformer, etc. XML can be transformed via XSLT. The Data Import Handler has other entities that go out to the web and then index the document via Tika. If you are indexing generic HTML, you may want to figure out an approach to SOLR-3808 and SOLR-2250 - this can be resolved by recompiling Solr and Tika locally, because Boilerpipe has a bug that has been fixed, but not pushed to Maven Central. Without that, the ASF cannot include the fix, but distributions such as LucidWorks Solr Enterprise can. I can drop some configs into github.com if I clean them up to obfuscate host names, passwords, and such. On Tue, Apr 7, 2015 at 9:14 AM, Yavar Husain yavarhus...@gmail.com wrote: Well have indexed heterogeneous sources including a variety of NoSQL's, RDBMs and Rich Documents (PDF Word etc.) using SolrJ. The only prerequisite of using SolrJ is that you should have an API to fetch data from your data source (Say JDBC for RDBMS, Tika for extracting text content from rich documents etc.) than SolrJ is so damn great and simple. Its as simple as downloading the jar and few lines of code to send data to your solr server after pre-processing your data. More details here: http://lucidworks.com/blog/indexing-with-solrj/ https://wiki.apache.org/solr/Solrj http://www.solrtutorial.com/solrj-tutorial.html Cheers, Yavar On Tue, Apr 7, 2015 at 4:18 PM, sangeetha.subraman...@gtnexus.com sangeetha.subraman...@gtnexus.com wrote: Hi, I am a newbie to SOLR and basically from database background. We have a requirement of indexing files of different formats (x12,edifact, csv,xml). The files which are inputted can be of any format and we need to do a content based search on it. From the web I understand we can use TIKA processor to extract the content and store it in SOLR. What I want to know is, is there any better approach for indexing files in SOLR ? Can we index the document through streaming directly from the Application ? If so what is the disadvantage of using it (against DIH which fetches from the database)? Could someone share me some insight on this ? ls there any web links which I can refer to get some idea on it ? Please do help. Thanks Sangeetha
Re: Customzing Solr Dedupe
But you can potentially still use Solr dedupe if you do the upfront work (in RDMS or NoSQL pre-index processing) to assign some sort of Group ID. See OCLC's FRBR Work-Set Algorithm, http://www.oclc.org/content/dam/research/activities/frbralgorithm/2009-08.pdf?urlm=161376 , for some details on one such algorithm. If the job is too big for RDBMS, and/or you don't want to use/have a suitable NoSQL, you can have two Solr indexes (collection/core/whatever) - one for classification with only id, field1, field2, field3, and another for production query. Then, you put stuff into the classification index, use queries and your own algorithm to do classification, assigning a groupId, and then put the document with groupId assigned into the production database. A key question is whether you want to preserve the groupId. In some cases, you do, and in some cases, it is just an internal signature. In both cases, a non-deterministic up-front algorithm can work, but if the groupId needs to be preserved, you need to work harder to make sure it all hangs together. Hope this helps, -Dan On Wed, Apr 1, 2015 at 7:05 AM, Jack Krupansky jack.krupan...@gmail.com wrote: Solr dedupe is based on the concept of a signature - some fields and rules that reduce a document into a discrete signature, and then checking if that signature exists as a document key that can be looked up quickly in the index. That's the conceptual basis. It is not based on any kind of field by field comparison to all existing documents. -- Jack Krupansky On Wed, Apr 1, 2015 at 6:35 AM, thakkar.aayush thakkar.aay...@gmail.com wrote: I'm facing a challenges using de-dupliation of Solr documents. De-duplicate is done using TextProfileSignature with following parameters: str name=fieldsfield1, field2, field3/str str name=quantRate0.5/str str name=minTokenLen3/str Here Field3 is normal text with few lines of data. Field1 and Field2 can contain upto 5 or 6 words of data. I want to de-duplicate when data in field1 and field2 are exactly the same and 90% of the lines in field3 is matched to that in another document. Is there anyway to achieve this? -- View this message in context: http://lucene.472066.n3.nabble.com/Customzing-Solr-Dedupe-tp4196879.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr on Tomcat
As an application developer, I have to agree with this direction. I ran ManifoldCF and Solr together in the same Tomcat, and the sl4j configurations of the two conflicted with strange results. From a systems administrator/operations perspective, a separate install allows better packaging, e.g. Debian and RPM packages are then possible, although may not be preferred as many enterprises will want to use Oracle Java rather than OpenJDK. On Tue, Feb 10, 2015 at 1:12 PM, Matt Kuiper matt.kui...@issinc.com wrote: Thanks for all the responses. I am planning a new project, and considering deployment options at this time. It's helpful to see where Solr is headed. Thanks, Matt Kuiper -Original Message- From: Shawn Heisey [mailto:apa...@elyograg.org] Sent: Tuesday, February 10, 2015 10:05 AM To: solr-user@lucene.apache.org Subject: Re: Solr on Tomcat On 2/10/2015 9:48 AM, Matt Kuiper wrote: I am starting to look in to Solr 5.0. I have been running Solr 4.* on Tomcat. I was surprised to find the following notice on https://cwiki.apache.org/confluence/display/solr/Running+Solr+on+Tomcat (Marked as Unreleased) Beginning with Solr 5.0, Support for deploying Solr as a WAR in servlet containers like Tomcat is no longer supported. I want to verify that it is true that Solr 5.0 will not be able to run on Tomcat, and confirm that the recommended way to deploy Solr 5.0 is as a Linux service. Solr will eventually (hopefully soon) be entirely its own application. The documentation you have seen in the reference guide is there to prepare users for this eventuality. Right now we are in a transition period. We have built scripts for controlling the start and stop of the example server installation. Under the covers, Solr is still a web application contained in a war and the example server still runs an unmodified copy of jetty. Down the road, when Solr will becomes a completely standalone application, we will merely have to modify the script wrapper to use it, and the user may not even notice the change. With 5.0, if you want to run in tomcat, you will be able to find the war in the download's server/webapps directory and use it just like you do now ... but we will be encouraging people to NOT do this, because eventually it will be completely unsupported. Thanks, Shawn
Re: clarification regarding shard splitting and composite IDs
Thanks, Anshum - I should never have posted so late.It is true that different users will have different word frequencies, but an application exploiting that for better relevancy would be going far for the relevancy of individual user's results. On Thu, Feb 5, 2015 at 12:41 AM, Anshum Gupta ans...@anshumgupta.net wrote: Solr 5.0 has support for distributed IDF. Also, users having the same IDF is orthogonal to the original question. In general, the Doc Freq. is only per-shard. If for some reason, a single user has documents split across shards, the IDF used would be different for docs on different shards. On Wed, Feb 4, 2015 at 9:06 PM, Dan Davis dansm...@gmail.com wrote: Doesn't relevancy for that assume that the IDF and TF for user1 and user2 are not too different?SolrCloud still doesn't use a distributed IDF, correct? On Wed, Feb 4, 2015 at 7:05 PM, Gili Nachum gilinac...@gmail.com wrote: Alright. So shard splitting and composite routing plays nicely together. Thank you Anshum. On Wed, Feb 4, 2015 at 11:24 AM, Anshum Gupta ans...@anshumgupta.net wrote: In one line, shard splitting doesn't cater to depend on the routing mechanism but just the hash range so you could have documents for the same prefix split up. Here's an overview of routing in SolrCloud: * Happens based on a hash value * The hash is calculated using the multiple parts of the routing key. In case of A!B, 16 bits are obtained from murmurhash(A) and the LSB 16 bits of the routing key are obtained from murmurhash(B). This sends the docs to the right shard. * When querying using A!, all shards that contain hashes from the range 16 bits from murmurhash(A)- to murmurhash(A)- are used. When you split a shard, for say range - , it is split from the middle (by default) and over multiple split, docs for the same A! prefix might end up on different shards, but the request routing should take care of that. You can read more about routing here: https://lucidworks.com/blog/solr-cloud-document-routing/ http://lucidworks.com/blog/multi-level-composite-id-routing-solrcloud/ and shard splitting here: http://lucidworks.com/blog/shard-splitting-in-solrcloud/ On Wed, Feb 4, 2015 at 12:59 AM, Gili Nachum gilinac...@gmail.com wrote: Hi, I'm also interested. When using composite the ID, the _route_ information is not kept on the document itself, so to me it looks like it's not possible as the split API https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api3 doesn't have a relevant parameter to split correctly. Could report back once I try it in practice. On Mon, Nov 10, 2014 at 7:27 PM, Ian Rose ianr...@fullstory.com wrote: Howdy - We are using composite IDs of the form user!event. This ensures that all events for a user are stored in the same shard. I'm assuming from the description of how composite ID routing works, that if you split a shard the split point of the hash range for that shard is chosen to maintain the invariant that all documents that share a routing prefix (before the !) will still map to the same (new) shard. Is that accurate? A naive shard-split implementation (e.g. that chose the hash range split point arbitrarily) could end up with child shards that split a routing prefix. Thanks, Ian -- Anshum Gupta http://about.me/anshumgupta -- Anshum Gupta http://about.me/anshumgupta
Re: Delta import query not working
It looks like you are returning the transformed ID, along with some other fields, in the deltaQuery command.deltaQuery should only return the ID, without the stk_ prefix, and then deltaImportQuery should retrieve the transformed ID. I'd suggest: entity ... deltaQuery=SELECT id WHERE updated_at '${dih.delta.last_index_time}' deltaImportQuery=SELECT CONCAT('stk_',id) AS id, part_no, name, description FROM stock_items WHERE id='${dih.delta.id}' I'm not sure which RDBMS you are using, but you probably don't need to work around the column names at all. On Thu, Feb 5, 2015 at 5:18 PM, willbrindle m...@willbrindle.com wrote: Hi, I am very new to Solr but I have been playing around with it a bit and my imports are all working fine. However, now I wish to perform a delta import on my query and I'm just getting nothing. I have the entity: entity name=stock query=SELECT CONCAT('stk_',id) AS id,part_no,name,description FROM stock_items deltaQuery=SELECT CONCAT('stk_',id) AS id,part_no,name,description,updated_at FROM stock_items WHERE updated_at '${dih.delta.last_index_time}' deltaImportQuery=SELECT CONCAT('stk_',id) AS id,id AS id2,part_no,name,description FROM stock_items WHERE id2='${dih.delta.id }' I am not too sure if ${dih.delta.id} is supposed to be id or id2 but I have tried both and neither work. My output is something along the lines of: { responseHeader: { status: 0, QTime: 0 }, initArgs: [ defaults, [ config, data-config.xml ] ], command: status, status: idle, importResponse: , statusMessages: { Time Elapsed: 0:0:16.778, Total Requests made to DataSource: 2, Total Rows Fetched: 0, Total Documents Skipped: 0, Delta Dump started: 2015-02-05 16:17:54, Identifying Delta: 2015-02-05 16:17:54, Deltas Obtained: 2015-02-05 16:17:54, Building documents: 2015-02-05 16:17:54, Total Changed Documents: 0, Delta Import Failed: 2015-02-05 16:17:54 }, WARNING: This response format is experimental. It is likely to change in the future. } My full import query is working fine. Thanks. -- View this message in context: http://lucene.472066.n3.nabble.com/Delta-import-query-not-working-tp4184280.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Delta import query not working
It also should be ${dataimporter.last_index_time} Also, that's two queries - an outer query to get the IDs that are modified, and another query (done repeatedly) to get the data. You can go faster using a parameterized data import as described in the wiki: http://wiki.apache.org/solr/DataImportHandlerDeltaQueryViaFullImport Hope this helps, Dan On Thu, Feb 5, 2015 at 9:30 PM, Dan Davis dansm...@gmail.com wrote: It looks like you are returning the transformed ID, along with some other fields, in the deltaQuery command.deltaQuery should only return the ID, without the stk_ prefix, and then deltaImportQuery should retrieve the transformed ID. I'd suggest: entity ... deltaQuery=SELECT id WHERE updated_at '${dih.delta.last_index_time}' deltaImportQuery=SELECT CONCAT('stk_',id) AS id, part_no, name, description FROM stock_items WHERE id='${dih.delta.id}' I'm not sure which RDBMS you are using, but you probably don't need to work around the column names at all. On Thu, Feb 5, 2015 at 5:18 PM, willbrindle m...@willbrindle.com wrote: Hi, I am very new to Solr but I have been playing around with it a bit and my imports are all working fine. However, now I wish to perform a delta import on my query and I'm just getting nothing. I have the entity: entity name=stock query=SELECT CONCAT('stk_',id) AS id,part_no,name,description FROM stock_items deltaQuery=SELECT CONCAT('stk_',id) AS id,part_no,name,description,updated_at FROM stock_items WHERE updated_at '${dih.delta.last_index_time}' deltaImportQuery=SELECT CONCAT('stk_',id) AS id,id AS id2,part_no,name,description FROM stock_items WHERE id2='${dih.delta.id }' I am not too sure if ${dih.delta.id} is supposed to be id or id2 but I have tried both and neither work. My output is something along the lines of: { responseHeader: { status: 0, QTime: 0 }, initArgs: [ defaults, [ config, data-config.xml ] ], command: status, status: idle, importResponse: , statusMessages: { Time Elapsed: 0:0:16.778, Total Requests made to DataSource: 2, Total Rows Fetched: 0, Total Documents Skipped: 0, Delta Dump started: 2015-02-05 16:17:54, Identifying Delta: 2015-02-05 16:17:54, Deltas Obtained: 2015-02-05 16:17:54, Building documents: 2015-02-05 16:17:54, Total Changed Documents: 0, Delta Import Failed: 2015-02-05 16:17:54 }, WARNING: This response format is experimental. It is likely to change in the future. } My full import query is working fine. Thanks. -- View this message in context: http://lucene.472066.n3.nabble.com/Delta-import-query-not-working-tp4184280.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr 4.9 Calling DIH concurrently
Suresh and Meena, I have solved this problem by taking a row count on a query, and adding its modulo as another field called threadid. The base query is wrapped in a query that selects a subset of the results for indexing. The modulo on the row number was intentional - you cannot rely on id columns to be well distributed and you cannot rely on the number of rows to stay constant over time. To make it more concrete, I have a base DataImportHandler configuration that looks something like what's below - your SQL may differ as we use Oracle. entity name=medsite dataSource=oltp01_prod rootEntity=true query=SELECT * FROM (SELECT t.*, Mod(RowNum, 4) threadid FROM medplus.public_topic_sites_us_v t) WHERE threadid = %%d%% transformer=TemplateTransformer ... /entity To get it to be multi-threaded, I then copy it to 4 different configuration files as follows: echo Medical Sites Configuration - ${MEDSITES_CONF:=medical-sites-conf.xml} echo Medical Sites Prototype - ${MEDSITES_PROTOTYPE:=medical-sites-%%d%%-conf.xml} for tid in `seq 0 3`; do MEDSITES_OUT=`echo $MEDSITES_PROTOTYPE| sed -e s/%%d%%/$tid/` sed -e s/%%d%%/$tid/ $MEDSITES_CONF $MEDSITES_OUT done Then, I have 4 requestHandlers in solrconfig.xml that point to each of these files.They are /import/medical-sites-0 through /import/medical-sites-3. Note that this wouldn't work with a single Data Import Handler that was parameterized - a particular data Import Handler is either idle or busy, and no longer should be run in multiple threads. How this would work if the first entity weren't the root entity is another question - you can usually structure it with the first SQL query being the root entity if you are using SQL. XML is another story, however. I did it this way because I wanted to stay with Solr out-of-the-box because it was an evaluation of what Data Import Handler could do. If I were doing this without some business requirement to evaluate whether Solr out-of-the-box could do multithreaded database improt, I'd probably write a multi-threaded front-end that did the queries and transformations I needed to do. In this case, I was considering the best way to do all our data imports from RDBMS, and Data Import Handler is the only good solution that involves writing configuration, not code. The distinction is slight, I think. Hope this helps, Dan Davis On Wed, Feb 4, 2015 at 3:02 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Suresh, There are a few common workaround for such problem. But, I think that submitting more than maxIndexingThreads is not really productive. Also, I think that out-of-memory problem is caused not by indexing, but by opening searcher. Do you really need to open it? I don't think it's a good idea to search on the instance which cooks many T index at the same time. Are you sure you don't issue superfluous commit, and you've disabled auto-commit? let's nail down oom problem first, and then deal with indexing speedup. I like huge indices! On Wed, Feb 4, 2015 at 1:10 AM, Arumugam, Suresh suresh.arumu...@emc.com wrote: We are also facing the same problem in loading 14 Billion documents into Solr 4.8.10. Dataimport is working in Single threaded, which is taking more than 3 weeks. This is working fine without any issues but it takes months to complete the load. When we tried SolrJ with the below configuration in Multithreaded load, the Solr is taking more memory at one point we will end up in out of memory as well. Batch Doc count : 10 docs No of Threads : 16/32 Solr Memory Allocated : 200 GB The reason can be as below. Solr is taking the snapshot, whenever we open a SearchIndexer. Due to this more memory is getting consumed solr is extremely slow while running 16 or more threads for loading. If anyone have already done the multithreaded data load into Solr in a quicker way, Can you please share the code or logic in using the SolrJ API? Thanks in advance. Regards, Suresh.A -Original Message- From: Dyer, James [mailto:james.d...@ingramcontent.com] Sent: Tuesday, February 03, 2015 1:58 PM To: solr-user@lucene.apache.org Subject: RE: Solr 4.9 Calling DIH concurrently DIH is single-threaded. There was once a threaded option, but it was buggy and subsequently was removed. What I do is partition my data and run multiple dih request handlers at the same time. It means redundant sections in solrconfig.xml and its not very elegant but it works. For instance, for a sql query, I add something like this: where mod(id, ${dataimporter.request.numPartitions})=${dataimporter.request.currentPartition}. I think, though, most users who want to make the most out of multithreading write their own program and use the solrj api to send the updates. James Dyer Ingram Content Group
Re: Solr 4.9 Calling DIH concurrently
Data Import Handler is the only good solution that involves writing configuration, not code. - I also had a requirement not to look at product-oriented enhancements to Solr, and there are many products I didn't look at, or rejected, like django-haystack. Perl, ruby, and python have good handling of both databases and Solr, as does Java with JDBC and SolrJ. Pushing to Solr probably has more legs than Data Import Handler going forward. On Wed, Feb 4, 2015 at 11:13 AM, Dan Davis dansm...@gmail.com wrote: Suresh and Meena, I have solved this problem by taking a row count on a query, and adding its modulo as another field called threadid. The base query is wrapped in a query that selects a subset of the results for indexing. The modulo on the row number was intentional - you cannot rely on id columns to be well distributed and you cannot rely on the number of rows to stay constant over time. To make it more concrete, I have a base DataImportHandler configuration that looks something like what's below - your SQL may differ as we use Oracle. entity name=medsite dataSource=oltp01_prod rootEntity=true query=SELECT * FROM (SELECT t.*, Mod(RowNum, 4) threadid FROM medplus.public_topic_sites_us_v t) WHERE threadid = %%d%% transformer=TemplateTransformer ... /entity To get it to be multi-threaded, I then copy it to 4 different configuration files as follows: echo Medical Sites Configuration - ${MEDSITES_CONF:=medical-sites-conf.xml} echo Medical Sites Prototype - ${MEDSITES_PROTOTYPE:=medical-sites-%%d%%-conf.xml} for tid in `seq 0 3`; do MEDSITES_OUT=`echo $MEDSITES_PROTOTYPE| sed -e s/%%d%%/$tid/` sed -e s/%%d%%/$tid/ $MEDSITES_CONF $MEDSITES_OUT done Then, I have 4 requestHandlers in solrconfig.xml that point to each of these files.They are /import/medical-sites-0 through /import/medical-sites-3. Note that this wouldn't work with a single Data Import Handler that was parameterized - a particular data Import Handler is either idle or busy, and no longer should be run in multiple threads. How this would work if the first entity weren't the root entity is another question - you can usually structure it with the first SQL query being the root entity if you are using SQL. XML is another story, however. I did it this way because I wanted to stay with Solr out-of-the-box because it was an evaluation of what Data Import Handler could do. If I were doing this without some business requirement to evaluate whether Solr out-of-the-box could do multithreaded database improt, I'd probably write a multi-threaded front-end that did the queries and transformations I needed to do. In this case, I was considering the best way to do all our data imports from RDBMS, and Data Import Handler is the only good solution that involves writing configuration, not code. The distinction is slight, I think. Hope this helps, Dan Davis On Wed, Feb 4, 2015 at 3:02 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Suresh, There are a few common workaround for such problem. But, I think that submitting more than maxIndexingThreads is not really productive. Also, I think that out-of-memory problem is caused not by indexing, but by opening searcher. Do you really need to open it? I don't think it's a good idea to search on the instance which cooks many T index at the same time. Are you sure you don't issue superfluous commit, and you've disabled auto-commit? let's nail down oom problem first, and then deal with indexing speedup. I like huge indices! On Wed, Feb 4, 2015 at 1:10 AM, Arumugam, Suresh suresh.arumu...@emc.com wrote: We are also facing the same problem in loading 14 Billion documents into Solr 4.8.10. Dataimport is working in Single threaded, which is taking more than 3 weeks. This is working fine without any issues but it takes months to complete the load. When we tried SolrJ with the below configuration in Multithreaded load, the Solr is taking more memory at one point we will end up in out of memory as well. Batch Doc count : 10 docs No of Threads : 16/32 Solr Memory Allocated : 200 GB The reason can be as below. Solr is taking the snapshot, whenever we open a SearchIndexer. Due to this more memory is getting consumed solr is extremely slow while running 16 or more threads for loading. If anyone have already done the multithreaded data load into Solr in a quicker way, Can you please share the code or logic in using the SolrJ API? Thanks in advance. Regards, Suresh.A -Original Message- From: Dyer, James [mailto:james.d...@ingramcontent.com] Sent: Tuesday, February 03, 2015 1:58 PM To: solr-user@lucene.apache.org Subject: RE: Solr 4.9 Calling DIH concurrently DIH is single-threaded. There was once a threaded option
Re: clarification regarding shard splitting and composite IDs
Doesn't relevancy for that assume that the IDF and TF for user1 and user2 are not too different?SolrCloud still doesn't use a distributed IDF, correct? On Wed, Feb 4, 2015 at 7:05 PM, Gili Nachum gilinac...@gmail.com wrote: Alright. So shard splitting and composite routing plays nicely together. Thank you Anshum. On Wed, Feb 4, 2015 at 11:24 AM, Anshum Gupta ans...@anshumgupta.net wrote: In one line, shard splitting doesn't cater to depend on the routing mechanism but just the hash range so you could have documents for the same prefix split up. Here's an overview of routing in SolrCloud: * Happens based on a hash value * The hash is calculated using the multiple parts of the routing key. In case of A!B, 16 bits are obtained from murmurhash(A) and the LSB 16 bits of the routing key are obtained from murmurhash(B). This sends the docs to the right shard. * When querying using A!, all shards that contain hashes from the range 16 bits from murmurhash(A)- to murmurhash(A)- are used. When you split a shard, for say range - , it is split from the middle (by default) and over multiple split, docs for the same A! prefix might end up on different shards, but the request routing should take care of that. You can read more about routing here: https://lucidworks.com/blog/solr-cloud-document-routing/ http://lucidworks.com/blog/multi-level-composite-id-routing-solrcloud/ and shard splitting here: http://lucidworks.com/blog/shard-splitting-in-solrcloud/ On Wed, Feb 4, 2015 at 12:59 AM, Gili Nachum gilinac...@gmail.com wrote: Hi, I'm also interested. When using composite the ID, the _route_ information is not kept on the document itself, so to me it looks like it's not possible as the split API https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api3 doesn't have a relevant parameter to split correctly. Could report back once I try it in practice. On Mon, Nov 10, 2014 at 7:27 PM, Ian Rose ianr...@fullstory.com wrote: Howdy - We are using composite IDs of the form user!event. This ensures that all events for a user are stored in the same shard. I'm assuming from the description of how composite ID routing works, that if you split a shard the split point of the hash range for that shard is chosen to maintain the invariant that all documents that share a routing prefix (before the !) will still map to the same (new) shard. Is that accurate? A naive shard-split implementation (e.g. that chose the hash range split point arbitrarily) could end up with child shards that split a routing prefix. Thanks, Ian -- Anshum Gupta http://about.me/anshumgupta
Re: role of the wiki and cwiki
Hoss et. al, I'm not intending on contributing documentation in any immediate sense (the disclaimer), but I thank you all for the clarification. It makes some sense to require a committer to review each suggested piece of official documentation, but I wonder abstractly how a non-committer then should contribute to the documentation. I just did an evaluation of several WCM systems, and it sounds almost like you need something more like a WCM that supports some moderation workflow, rather than a wiki. With current technology, possibilities include: * Make a comment within Confluence suggesting content or making a clarification, * Create a blog post or MoinMoin edit with whatever content seems to be needed, * Paste text and/or content into a JIRA ticket, or upload an attachment to the JIRA ticket. I think the JIRA ticket is the strongest, honestly, because it is true moderation - nothing shows up until evaluated by a committer. I also want to say that I value the very technical nature of the Solr documentation, even as I welcome better organization Many product's documentation is very much too much abstracted, because it is written by a technical writer not deeply familiar with either the technology or with what users specifically want to do. This is addressed by surfacing what the user's want to do, and then How-to specific documentation is written that is still too vague on the technical details. Sometimes a worked example is very useful. I see a little, though not too much, of this transition in the Data Import Handler documentation - https://cwiki.apache.org/confluence/display/solr/Uploading+Structured+Data+Store+Data+with+the+Data+Import+Handler is more abstract, and moves too fast, relative to http://wiki.apache.org/solr/DataImportHandler. The ability to nest SQL based entities is very key to understanding, and not covered in the former. One needs to see that entity is not always a root entity. So, I agree with the direction, but I hope the Solr Reference Guide can go into more depth in some places, even as it continues to be better organized if you are reading from scratch rather than starting with Solr In Action or something like that. Thanks again, Dan On Mon, Feb 2, 2015 at 11:57 AM, Chris Hostetter hossman_luc...@fucit.org wrote: : Because they have different potential authors, the two systems now serve : different purposes. : : There are still some pages on the MoinMoin wiki that contain : documentation that should be in the reference guide, but isn't. : : The MoinMoin wiki is still useful, as a place where users can collect : information that is useful to others, but doesn't qualify as official : documentation, or perhaps simply hasn't been verified. I believe this : means that a lot of information which has been migrated into the : reference guide will eventually be removed from MoinMoin. +1 ... it's just a matter of time/energy to clean things up... https://cwiki.apache.org/confluence/display/solr/Internal+-+Maintaining+Documentation#Internal-MaintainingDocumentation-WhatShouldandShouldNotbeIncludedinThisDocumentation FWIW: Emmanuel Stalling has started doing an audit of the wiki content vs the ref guide ... once more folks have a chance to review dive in with edits should be really helpful to cleaning all this up... https://wiki.apache.org/solr/WikiManualComparison -Hoss http://www.lucidworks.com/
Re: Calling custom request handler with data import
The Data Import Handler isn't pushing data into the /update request handler. However, Data Import Handler can be extended with transformers. Two such transformers are the TemplateTransformer and the ScriptTransformer. It may be possible to get a script function to load your custom Java code. You could also just write a StandfordNerTransformer. Hope this helps, Dan On Fri, Jan 30, 2015 at 9:07 AM, vineet yadav vineet.yadav.i...@gmail.com wrote: Hi, I am using data import handler to import data from mysql, and I want to identify name entities from it. So I am using following example( http://www.searchbox.com/named-entity-recognition-ner-in-solr/). where I am using stanford ner to identify name entities. I am using following requesthandler requestHandler name=/dataimport class=org.apache.solr.handler.dataimport.DataImportHandler lst name=defaults str name=configdata-import.xml/str /lst /requestHandler for importing data from mysql and requestHandler name=/ner class=com.searchbox.ner.NerHandler / updateRequestProcessorChain name=mychain processor class=com.searchbox.ner.NerProcessorFactory lst name=queryFields str name=queryFieldcontent/str /lst /processor processor class=solr.LogUpdateProcessorFactory / processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain requestHandler name=/update class=solr.UpdateRequestHandler lst name=defaults str name=update.chainmychain/str /lst /requestHandler for identifying name entities.NER request handler identifies name entities from content field, but store extracted entities in solr fields. NER request handler was working when I am using nutch with solr. But When I am importing data from mysql, ner request handler is not invoked. So entities are not stored in solr for imported documents. Can anybody tell me how to call custom request handler in data import handler. Otherwise if I can invoke ner request handler externally, so that it can index person, organization and location in solr for imported document. It is also fine. Any suggestion are welcome. Thanks Vineet Yadav
Re: Calling custom request handler with data import
You know, another thing you can do is just write some Java/perl/whatever to pull data out of your database and push it to Solr.Not as convenient for development perhaps, but it has more legs in the long run. Data Import Handler does not easily multi-thread. On Sat, Jan 31, 2015 at 12:34 AM, Dan Davis dansm...@gmail.com wrote: The Data Import Handler isn't pushing data into the /update request handler. However, Data Import Handler can be extended with transformers. Two such transformers are the TemplateTransformer and the ScriptTransformer. It may be possible to get a script function to load your custom Java code. You could also just write a StandfordNerTransformer. Hope this helps, Dan On Fri, Jan 30, 2015 at 9:07 AM, vineet yadav vineet.yadav.i...@gmail.com wrote: Hi, I am using data import handler to import data from mysql, and I want to identify name entities from it. So I am using following example( http://www.searchbox.com/named-entity-recognition-ner-in-solr/). where I am using stanford ner to identify name entities. I am using following requesthandler requestHandler name=/dataimport class=org.apache.solr.handler.dataimport.DataImportHandler lst name=defaults str name=configdata-import.xml/str /lst /requestHandler for importing data from mysql and requestHandler name=/ner class=com.searchbox.ner.NerHandler / updateRequestProcessorChain name=mychain processor class=com.searchbox.ner.NerProcessorFactory lst name=queryFields str name=queryFieldcontent/str /lst /processor processor class=solr.LogUpdateProcessorFactory / processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain requestHandler name=/update class=solr.UpdateRequestHandler lst name=defaults str name=update.chainmychain/str /lst /requestHandler for identifying name entities.NER request handler identifies name entities from content field, but store extracted entities in solr fields. NER request handler was working when I am using nutch with solr. But When I am importing data from mysql, ner request handler is not invoked. So entities are not stored in solr for imported documents. Can anybody tell me how to call custom request handler in data import handler. Otherwise if I can invoke ner request handler externally, so that it can index person, organization and location in solr for imported document. It is also fine. Any suggestion are welcome. Thanks Vineet Yadav
role of the wiki and cwiki
I've been thinking of https://wiki.apache.org/solr/ as the Old Wiki and https://cwiki.apache.org/confluence/display/solr as the New Wiki. I guess that's the wrong way to think about it - Confluence is being used for the Solr Reference Guide, and MoinMoin is being used as a wiki. Is this the correct understanding?
Re: Cannot reindex to add a new field
For this I prefer TemplateTransformer to RegexTransformer - its not a regex, just a pattern, and so should be more efficient to use TemplateTransformer. A script will also work, of course. On Tue, Jan 27, 2015 at 5:54 PM, Alexandre Rafalovitch arafa...@gmail.com wrote: On 27 January 2015 at 17:47, Carl Roberts carl.roberts.zap...@gmail.com wrote: field column=product sourceColName=vulnerable-software commonField=false regex=: replaceWith= / Yes, that works because the transformer copies it, not the EntityProcessor. So, no conflict on xpath. Regards, Alex. Sign up for my Solr resources newsletter at http://www.solr-start.com/
Re: Need help importing data
Glad it worked out. On Fri, Jan 23, 2015 at 9:50 PM, Carl Roberts carl.roberts.zap...@gmail.com wrote: NVM I figured this out. The problem was this: pk=link in rss-dat.config.xml but unique id not link in schema.xml - it is id. From rss-data-config.xml: entity name=cve-2002 *pk=link* url=https://nvd.nist.gov/feeds/xml/cve/nvdcve-2.0-2002.xml.zip; processor=XPathEntityProcessor forEach=/nvd/entry field column=id xpath=/nvd/entry/@id commonField=true / field column=cve xpath=/nvd/entry/cve-id commonField=true / field column=cwe xpath=/nvd/entry/cwe/@id commonField=true / !-- field column=vulnerable-configuration xpath=/nvd/entry/vulnerable-configuration/logical-test/fact-ref/@name commonField=false / field column=vulnerable-software xpath=/nvd/entry/vulnerable-software-list/product commonField=false / field column=published xpath=/nvd/entry/published-datetime commonField=false / field column=modified xpath=/nvd/entry/last-modified-datetime commonField=false / field column=summary xpath=/nvd/entry/summary commonField=false / -- /entity From schema.xml: * uniqueKeyid/uniqueKey *What really bothers me is that there were no errors output by Solr to indicate this type of misconfiguration error and all the messages that Solr gave indicated the import was successful. This lack of appropriate error reporting is a pain, especially for someone learning Solr. Switching pk=link to pk=id solved the problem and I was then able to import the data. On 1/23/15, 9:39 PM, Carl Roberts wrote: Hi, I have set log4j logging to level DEBUG and I have also modified the code to see what is being imported and I can see the nextRow() records, and the import is successful, however I have no data. Can someone please help me figure this out? Here is the logging output: ow: r1={{id=CVE-2002-2353, cve=CVE-2002-2353, cwe=CWE-264, $forEach=/nvd/entry}} 2015-01-23 21:28:04,606- INFO-[Thread-15]-[XPathEntityProcessor.java:251] -org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow: r3={{id=CVE-2002-2353, cve=CVE-2002-2353, cwe=CWE-264, $forEach=/nvd/entry}} 2015-01-23 21:28:04,606- INFO-[Thread-15]-[XPathEntityProcessor.java:221] -org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow: URL={url} 2015-01-23 21:28:04,606- INFO-[Thread-15]-[XPathEntityProcessor.java:227] -org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow: r1={{id=CVE-2002-2354, cve=CVE-2002-2354, cwe=CWE-20, $forEach=/nvd/entry}} 2015-01-23 21:28:04,606- INFO-[Thread-15]-[XPathEntityProcessor.java:251] -org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow: r3={{id=CVE-2002-2354, cve=CVE-2002-2354, cwe=CWE-20, $forEach=/nvd/entry}} 2015-01-23 21:28:04,606- INFO-[Thread-15]-[XPathEntityProcessor.java:221] -org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow: URL={url} 2015-01-23 21:28:04,606- INFO-[Thread-15]-[XPathEntityProcessor.java:227] -org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow: r1={{id=CVE-2002-2355, cve=CVE-2002-2355, cwe=CWE-255, $forEach=/nvd/entry}} 2015-01-23 21:28:04,606- INFO-[Thread-15]-[XPathEntityProcessor.java:251] -org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow: r3={{id=CVE-2002-2355, cve=CVE-2002-2355, cwe=CWE-255, $forEach=/nvd/entry}} 2015-01-23 21:28:04,607- INFO-[Thread-15]-[XPathEntityProcessor.java:221] -org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow: URL={url} 2015-01-23 21:28:04,607- INFO-[Thread-15]-[XPathEntityProcessor.java:227] -org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow: r1={{id=CVE-2002-2356, cve=CVE-2002-2356, cwe=CWE-264, $forEach=/nvd/entry}} 2015-01-23 21:28:04,607- INFO-[Thread-15]-[XPathEntityProcessor.java:251] -org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow: r3={{id=CVE-2002-2356, cve=CVE-2002-2356, cwe=CWE-264, $forEach=/nvd/entry}} 2015-01-23 21:28:04,607- INFO-[Thread-15]-[XPathEntityProcessor.java:221] -org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow: URL={url} 2015-01-23 21:28:04,607- INFO-[Thread-15]-[XPathEntityProcessor.java:227] -org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow: r1={{id=CVE-2002-2357, cve=CVE-2002-2357, cwe=CWE-119, $forEach=/nvd/entry}} 2015-01-23 21:28:04,607- INFO-[Thread-15]-[XPathEntityProcessor.java:251] -org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow: r3={{id=CVE-2002-2357, cve=CVE-2002-2357, cwe=CWE-119, $forEach=/nvd/entry}} 2015-01-23 21:28:04,607- INFO-[Thread-15]-[XPathEntityProcessor.java:221] -org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow: URL={url} 2015-01-23 21:28:04,607- INFO-[Thread-15]-[XPathEntityProcessor.java:227]
Re: Need Help with custom ZIPURLDataSource class
I have seen such errors by looking under Logging in the Solr Admin UI. There is also the LogTransformer for Data Import Handler. However, it is a design choice in Data Import Handler to skip fields not in the schema. I would suggest you always use Debug and Verbose to do the first couple of documents through the GUI, and then look at the debugging output with a fine toothed comb. I'm not sure whether there's an option for it, but it would be nice if the Data Import Handler could collect skipped fields into the status response. That would highlight your problem without forcing you to look in other areas. On Fri, Jan 23, 2015 at 9:51 PM, Carl Roberts carl.roberts.zap...@gmail.com wrote: NVM - I have this working. The problem was this: pk=link in rss-dat.config.xml but unique id not link in schema.xml - it is id. From rss-data-config.xml: entity name=cve-2002 *pk=link* url=https://nvd.nist.gov/feeds/xml/cve/nvdcve-2.0-2002. xml.zip processor=XPathEntityProcessor forEach=/nvd/entry field column=id xpath=/nvd/entry/@id commonField=true / field column=cve xpath=/nvd/entry/cve-id commonField=true / field column=cwe xpath=/nvd/entry/cwe/@id commonField=true / !-- field column=vulnerable-configuration xpath=/nvd/entry/vulnerable-configuration/logical-test/fact-ref/@name commonField=false / field column=vulnerable-software xpath=/nvd/entry/vulnerable-software-list/product commonField=false / field column=published xpath=/nvd/entry/published-datetime commonField=false / field column=modified xpath=/nvd/entry/last-modified-datetime commonField=false / field column=summary xpath=/nvd/entry/summary commonField=false / -- /entity From schema.xml: * uniqueKeyid/uniqueKey *What really bothers me is that there were no errors output by Solr to indicate this type of misconfiguration error and all the messages that Solr gave indicated the import was successful. This lack of appropriate error reporting is a pain, especially for someone learning Solr. Switching pk=link to pk=id solved the problem and I was then able to import the data. On 1/23/15, 6:34 PM, Carl Roberts wrote: Hi, I created a custom ZIPURLDataSource class to unzip the content from an http URL for an XML ZIP file and it seems to be working (at least I have no errors), but no data is imported. Here is my configuration in rss-data-config.xml: dataConfig dataSource type=ZIPURLDataSource connectionTimeout=15000 readTimeout=3/ document entity name=cve-2002 pk=link url=https://nvd.nist.gov/feeds/xml/cve/nvdcve-2.0-2002.xml.zip; processor=XPathEntityProcessor forEach=/nvd/entry transformer=DateFormatTransformer field column=id xpath=/nvd/entry/@id commonField=true / field column=cve xpath=/nvd/entry/cve-id commonField=true / field column=cwe xpath=/nvd/entry/cwe/@id commonField=true / field column=vulnerable-configuration xpath=/nvd/entry/vulnerable-configuration/logical-test/fact-ref/@name commonField=false / field column=vulnerable-software xpath=/nvd/entry/vulnerable-software-list/product commonField=false / field column=published xpath=/nvd/entry/published-datetime commonField=false / field column=modified xpath=/nvd/entry/last-modified-datetime commonField=false / field column=summary xpath=/nvd/entry/summary commonField=false / /entity /document /dataConfig Attached is the ZIPURLDataSource.java file. It actually unzips and saves the raw XML to disk, which I have verified to be a valid XML file. The file has one or more entries (here is an example): nvd xmlns:scap-core=http://scap.nist.gov/schema/scap-core/0.1; xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance; xmlns:patch=http://scap.nist.gov/schema/patch/0.1; xmlns:vuln=http://scap.nist.gov/schema/vulnerability/0.4; xmlns:cvss=http://scap.nist.gov/schema/cvss-v2/0.2; xmlns:cpe-lang=http://cpe.mitre.org/language/2.0; xmlns=http://scap.nist.gov/schema/feed/vulnerability/2.0; pub_date=2015-01-10T05:37:05 xsi:schemaLocation=http://scap.nist.gov/schema/patch/0.1 http://nvd.nist.gov/schema/patch_0.1.xsd http://scap.nist.gov/schema/scap-core/0.1 http://nvd.nist.gov/schema/scap-core_0.1.xsd http://scap.nist.gov/schema/feed/vulnerability/2.0 http://nvd.nist.gov/schema/nvd-cve-feed_2.0.xsd; nvd_xml_version=2.0 entry id=CVE-1999-0001 vuln:vulnerable-configuration id=http://nvd.nist.gov/; cpe-lang:logical-test operator=OR negate=false cpe-lang:fact-ref name=cpe:/o:bsdi:bsd_os:3.1/ cpe-lang:fact-ref name=cpe:/o:freebsd:freebsd:1.0/ cpe-lang:fact-ref name=cpe:/o:freebsd:freebsd:1.1/ cpe-lang:fact-ref name=cpe:/o:freebsd:freebsd:1.1.5.1/ cpe-lang:fact-ref name=cpe:/o:freebsd:freebsd:1.2/ cpe-lang:fact-ref name=cpe:/o:freebsd:freebsd:2.0/ cpe-lang:fact-ref name=cpe:/o:freebsd:freebsd:2.0.5/ cpe-lang:fact-ref
Re: Indexed epoch time in Solr
I think copying to a new Solr date field is your best bet, because then you have the flexibility to do date range facets in the future. If you can re-index, and are using Data Import Handler, Jim Musil's suggestion is just right. If you can re-index, and are not using Data Import Handler: - This seems a job for an UpdateRequestProcessor https://cwiki.apache.org/confluence/display/solr/Update+Request+Processors, but I don't see one for this. - This seems to be a good candidate for a standard, core UpdateRequestProcessor, but I haven't checked Jira for a bug report. If the scale is too large to re-index, then there is surely still a way, but I'm not sure I can advise you on the best one. I'm not an Solr expert yet... just someone on the list with a IR background. On Mon, Jan 26, 2015 at 12:35 AM, Ahmed Adel ahmed.a...@badrit.com wrote: Hi All, Is there a way to convert unix time field that is already indexed to ISO-8601 format in query response? If this is not possible on the query level, what is the best way to copy this field to a new Solr standard date field. Thanks, -- *Ahmed Adel* http://s.wisestamp.com/links?url=http%3A%2F%2Fwww.linkedin.com%2Fin%2F
Re: Solr admin Url issues
Is Jetty actually running on port 80?Do you have Apache2 reverse proxy in front? On Mon, Jan 26, 2015 at 11:02 PM, Summer Shire shiresum...@gmail.com wrote: Hi All, Running solr (4.7.2) locally and hitting the admin page like this works just fine http://localhost:8983/solr/ http://localhost:8983/solr/## http://localhost:8983/solr/# But on my deployment server my path is http://example.org/jetty/MyApp/1/solr/# http://example.org/jetty/MyApp/1/solr/# Or http://example.org/jetty/MyApp/1/solr/admin/cores http://example.org/jetty/MyApp/1/solr/admin/cores or http://example.org/jetty/MyApp/1/solr/main/admin/ http://example.org/jetty/MyApp/1/solr/main/admin/ the above request in a browser loads the admin page half way and then spawns another request at http://example.org/solr/admin/cores http://example.org/solr/admin/cores …. how can I maintain my other params such as jetty/MyApp/1/ btw http://example.org/jetty/MyApp/1/solr/main/select?q=*:* http://example.org/jetty/MyApp/1/solr/main/select?q=*:* or any other requesthandlers work just fine. What is going on here ? any idea ? thanks, Summer
Re: How to implement Auto complete, suggestion client side
Cannot get any easier than jquery-ui's autocomplete widget - http://jqueryui.com/autocomplete/ Basically, you set some classes and implement a javascript that calls the server to get the autocomplete data. I never would expose Solr to browsers, so I would have the AJAX call go to a php script (or function/method if you are using a web framework such as CakePHP or Symfony). Then, on the server, you make a request to Solr /suggest or /spell with wt=json, and then you reformulate this into a simple JSON response that is a simple array of options. You can do this in stages: - Constant suggestions - you change your html and implement Javascript that shows constant suggestions after for instance 2 seconds. - Constant suggestions from the server - you change your JavaScript to call the server, and have the server return a constant list. - Dynamic suggestions from the server - you implement the server-side to query Solr and turn the return from /suggest or /spell into a JSON array. - Tuning, tuning, tuning - you work hard on tuning it so that you get high quality suggestions for a wide variety of inputs. Note that the autocomplete I've described for you is basically the simplest thing possible, as you suggest you are new to it. It is not based on data mining of query and click-through logs, which is a very common pattern these days. There is no bolding of the portion of the words that are new. It is just a basic autocomplete widget with a delay. On Mon, Jan 26, 2015 at 5:11 PM, Olivier Austina olivier.aust...@gmail.com wrote: Hi All, I would say I am new to web technology. I would like to implement auto complete/suggestion in the user search box as the user type in the search box (like Google for example). I am using Solr as database. Basically I am familiar with Solr and I can formulate suggestion queries. But now I don't know how to implement suggestion in the User Interface. Which technologies should I need. The website is in PHP. Any suggestions, examples, basic tutorial is welcome. Thank you. Regards Olivier
Re: [MASSMAIL]Weighting of prominent text in HTML
Helps lots. Thanks, Jorge Luis. Good point about different fields - I'll just put the h1 and h2 (however deep I want to go) into fields, and we can sort out weighting and whether we want it later with edismax. The blogs on adding plugins for that sort of thing look straightforward. On Mon, Jan 26, 2015 at 12:47 AM, Jorge Luis Betancourt González jlbetanco...@uci.cu wrote: Hi Dan: Agreed, this question is more Nutch related than Solr ;) Nutch doesn't send any data into /update/extract request handler, all the text and metadata extraction happens in Nutch side rather than relying in the ExtractRequestHandler provided by Solr. Underneath Nutch use Tika the same technology as the ExtractRequestHandler provided by Solr so shouldn't be any greater difference. By default Nutch doesn't boost anything as is Solr job to boost the different content in the different fields, which is what happens when you do a query against Solr. Nutch calculates the LinkRank which is a variation of the famous PageRank (or the OPIC score, which is another scoring algorithm implemented in Nutch, which I believe is the default in Nutch 2.x). What you can do is use the headings and map the heading tags into different fields and then apply different boosts to each field. The general idea with Nutch is to make pieces of the web page and store each piece in a different field in Solr, then you can tweak your relevance function using the values yo see fit, so you don't need to write any plugin to accomplish this (at least for the h1, h2, etc. example you provided, if you want to extract other parts of the webpage you'll need to write your own plugin to do so). Nutch is highly customizable, you can write a plugin for almost any piece of logic, from parsers to indexers, passing from URL filters, scoring algorithms, protocols and a long long list, usually the plugins are not so difficult to write, but the problem comes to know which extension point you need to use, this comes with experience and taking a good dive in the source code. Hope this helps, - Original Message - From: Dan Davis dansm...@gmail.com To: solr-user solr-user@lucene.apache.org Sent: Monday, January 26, 2015 12:08:13 AM Subject: [MASSMAIL]Weighting of prominent text in HTML By examining solr.log, I can see that Nutch is using the /update request handler rather than /update/extract. So, this may be a more appropriate question for the nutch mailing list. OTOH, y'all know the anwser off the top of your head. Will Nutch boost text occurring in h1, h2, etc. more heavily than text in a normal paragraph?Can this weighting be tuned without writing a plugin? Is writing a plugin often needed because of the flexibility that is needed in practice? I wanted to call this post *Anatomy of a small scale search engine*, but lacked the nerve ;) Thanks, all and many, Dan Davis, Systems/Applications Architect National Library of Medicine --- XII Aniversario de la creación de la Universidad de las Ciencias Informáticas. 12 años de historia junto a Fidel. 12 de diciembre de 2014.
Weighting of prominent text in HTML
By examining solr.log, I can see that Nutch is using the /update request handler rather than /update/extract. So, this may be a more appropriate question for the nutch mailing list. OTOH, y'all know the anwser off the top of your head. Will Nutch boost text occurring in h1, h2, etc. more heavily than text in a normal paragraph?Can this weighting be tuned without writing a plugin? Is writing a plugin often needed because of the flexibility that is needed in practice? I wanted to call this post *Anatomy of a small scale search engine*, but lacked the nerve ;) Thanks, all and many, Dan Davis, Systems/Applications Architect National Library of Medicine
Re: solr replication vs. rsync
@Erick, Problem space is not constant indexing. I thought SolrCloud replicas were replication, and you imply parallel indexing. Good to know. On Sunday, January 25, 2015, Erick Erickson erickerick...@gmail.com wrote: @Shawn: Cool table, thanks! @Dan: Just to throw a different spin on it, if you migrate to SolrCloud, then this question becomes moot as the raw documents are sent to each of the replicas so you very rarely have to copy the full index. Kind of a tradeoff between constant load because you're sending the raw documents around whenever you index and peak usage when the index replicates. There are a bunch of other reasons to go to SolrCloud, but you know your problem space best. FWIW, Erick On Sun, Jan 25, 2015 at 9:26 AM, Shawn Heisey apa...@elyograg.org javascript:; wrote: On 1/24/2015 10:56 PM, Dan Davis wrote: When I polled the various projects already using Solr at my organization, I was greatly surprised that none of them were using Solr replication, because they had talked about replicating the data. But we are not Pinterest, and do not expect to be taking in changes one post at a time (at least the engineers don't - just wait until its used for a Crud app that wants full-text search on a description field!). Still, rsync can be very, very fast with the right options (-W for gigabit ethernet, and maybe -S for sparse files). I've clocked it at 48 MB/s over GigE previously. Does anyone have any numbers for how fast Solr replication goes, and what to do to tune it? I'm not enthusiastic to give-up recently tested cluster stability for a home grown mess, but I am interested in numbers that are out there. Numbers are included on the Solr replication wiki page, both in graph and numeric form. Gathering these numbers must have been pretty easy -- before the HTTP replication made it into Solr, Solr used to contain an rsync-based implementation. http://wiki.apache.org/solr/SolrReplication#Performance_numbers Other data on that wiki page discusses the replication config. There's not a lot to tune. I run a redundant non-SolrCloud index myself through a different method -- my indexing program indexes each index copy completely independently. There is no replication. This separation allows me to upgrade any component, or change any part of solrconfig or schema, on either copy of the index without affecting the other copy at all. With replication, if something is changed on the master or the slave, you might find that the slave no longer works, because it will be handling an index created by different software or a different config. Thanks, Shawn
Re: solr replication vs. rsync
Thanks! On Sunday, January 25, 2015, Erick Erickson erickerick...@gmail.com wrote: @Shawn: Cool table, thanks! @Dan: Just to throw a different spin on it, if you migrate to SolrCloud, then this question becomes moot as the raw documents are sent to each of the replicas so you very rarely have to copy the full index. Kind of a tradeoff between constant load because you're sending the raw documents around whenever you index and peak usage when the index replicates. There are a bunch of other reasons to go to SolrCloud, but you know your problem space best. FWIW, Erick On Sun, Jan 25, 2015 at 9:26 AM, Shawn Heisey apa...@elyograg.org javascript:; wrote: On 1/24/2015 10:56 PM, Dan Davis wrote: When I polled the various projects already using Solr at my organization, I was greatly surprised that none of them were using Solr replication, because they had talked about replicating the data. But we are not Pinterest, and do not expect to be taking in changes one post at a time (at least the engineers don't - just wait until its used for a Crud app that wants full-text search on a description field!). Still, rsync can be very, very fast with the right options (-W for gigabit ethernet, and maybe -S for sparse files). I've clocked it at 48 MB/s over GigE previously. Does anyone have any numbers for how fast Solr replication goes, and what to do to tune it? I'm not enthusiastic to give-up recently tested cluster stability for a home grown mess, but I am interested in numbers that are out there. Numbers are included on the Solr replication wiki page, both in graph and numeric form. Gathering these numbers must have been pretty easy -- before the HTTP replication made it into Solr, Solr used to contain an rsync-based implementation. http://wiki.apache.org/solr/SolrReplication#Performance_numbers Other data on that wiki page discusses the replication config. There's not a lot to tune. I run a redundant non-SolrCloud index myself through a different method -- my indexing program indexes each index copy completely independently. There is no replication. This separation allows me to upgrade any component, or change any part of solrconfig or schema, on either copy of the index without affecting the other copy at all. With replication, if something is changed on the master or the slave, you might find that the slave no longer works, because it will be handling an index created by different software or a different config. Thanks, Shawn
solr replication vs. rsync
When I polled the various projects already using Solr at my organization, I was greatly surprised that none of them were using Solr replication, because they had talked about replicating the data. But we are not Pinterest, and do not expect to be taking in changes one post at a time (at least the engineers don't - just wait until its used for a Crud app that wants full-text search on a description field!).Still, rsync can be very, very fast with the right options (-W for gigabit ethernet, and maybe -S for sparse files). I've clocked it at 48 MB/s over GigE previously. Does anyone have any numbers for how fast Solr replication goes, and what to do to tune it? I'm not enthusiastic to give-up recently tested cluster stability for a home grown mess, but I am interested in numbers that are out there.
Re: OutOfMemoryError for PDF document upload into Solr
Why re-write all the document conversion in Java ;) Tika is very slow. 5 GB PDF is very big. If you have a lot of PDF like that try pdftotext in HTML and UTF-8 output mode. The HTML mode captures some meta-data that would otherwise be lost. If you need to go faster still, you can also write some stuff linked directly against poppler library. Before you jump down by through about Tika being slow - I wrote a PDF indexer that ran at 36 MB/s per core. Different indexer, all C, lots of getjmp/longjmp. But fast... On Thu, Jan 15, 2015 at 1:54 PM, ganesh.ya...@sungard.com wrote: Siegfried and Michael Thank you for your replies and help. -Original Message- From: Siegfried Goeschl [mailto:sgoes...@gmx.at] Sent: Thursday, January 15, 2015 3:45 AM To: solr-user@lucene.apache.org Subject: Re: OutOfMemoryError for PDF document upload into Solr Hi Ganesh, you can increase the heap size but parsing a 4 GB PDF document will very likely consume A LOT OF memory - I think you need to check if that large PDF can be parsed at all :-) Cheers, Siegfried Goeschl On 14.01.15 18:04, Michael Della Bitta wrote: Yep, you'll have to increase the heap size for your Tomcat container. http://stackoverflow.com/questions/6897476/tomcat-7-how-to-set-initial -heap-size-correctly Michael Della Bitta Senior Software Engineer o: +1 646 532 3062 appinions inc. “The Science of Influence Marketing” 18 East 41st Street New York, NY 10017 t: @appinions https://twitter.com/Appinions | g+: plus.google.com/appinions https://plus.google.com/u/0/b/112002776285509593336/11200277628550959 3336/posts w: appinions.com http://www.appinions.com/ On Wed, Jan 14, 2015 at 12:00 PM, ganesh.ya...@sungard.com wrote: Hello, Can someone pass on the hints to get around following error? Is there any Heap Size parameter I can set in Tomcat or in Solr webApp that gets deployed in Solr? I am running Solr webapp inside Tomcat on my local machine which has RAM of 12 GB. I have PDF document which is 4 GB max in size that needs to be loaded into Solr Exception in thread http-apr-8983-exec-6 java.lang.: Java heap space at java.util.AbstractCollection.toArray(Unknown Source) at java.util.ArrayList.init(Unknown Source) at org.apache.pdfbox.cos.COSDocument.getObjects(COSDocument.java:518) at org.apache.pdfbox.cos.COSDocument.close(COSDocument.java:575) at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:254) at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1238) at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1203) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:111) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:246) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1967) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:220) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:122) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:170) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103) at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:950) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:421) at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1070) at
Improved suggester question
The suggester is not working for me with Solr 4.10.2 Can anyone shed light over why I might be getting the exception below when I build the dictionary? response lst name=responseHeader int name=status500/int int name=QTime26/int /lst lst name=error str name=msglen must be = 32767; got 35680/str str name=trace java.lang.IllegalArgumentException: len must be = 32767; got 35680 at org.apache.lucene.util.OfflineSorter$ByteSequencesWriter.write(OfflineSorter.java:479) at org.apache.lucene.search.suggest.analyzing.AnalyzingSuggester.build(AnalyzingSuggester.java:493) at org.apache.lucene.search.suggest.Lookup.build(Lookup.java:190) at org.apache.solr.spelling.suggest.SolrSuggester.build(SolrSuggester.java:160) at org.apache.solr.handler.component.SuggestComponent.prepare(SuggestComponent.java:165) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:197) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:246) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1967) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:222) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:123) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:171) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:100) at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:953) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:408) at org.apache.coyote.ajp.AjpProcessor.process(AjpProcessor.java:200) at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:603) at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:310) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) /str int name=code500/int /lst /response Thank you. I've configured my suggester as follows: searchComponent name=suggest class=solr.SuggestComponent lst name=suggester str name=namemySuggester/str str name=lookupImplFuzzyLookupFactory/str str name=dictionaryImplDocumentDictionaryFactory/str str name=fieldtext/str str name=weightFieldmedsite_id/str str name=suggestAnalyzerFieldTypetext_general/str str name=buildOnCommittrue/str str name=threshold0.1/str /lst /searchComponent requestHandler name=/suggest class=solr.SearchHandler startup=lazy lst name=defaults str name=suggeston/str str name=suggest.dictionarymySuggester/str str name=suggest.count10/str /lst arr name=components strsuggest/str /arr /requestHandler
Re: Logging in Solr's DataImportHandler
Mikhail, Thanks - it works now.The script transformer was really not needed, a template transformer is clearer, and the log transformer is now working. On Mon, Dec 8, 2014 at 1:56 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Hello Dan, Usually it works well. Can you describe how you run it particularly, eg what you download exactly and what's the command line ? On Fri, Dec 5, 2014 at 11:37 PM, Dan Davis dansm...@gmail.com wrote: I have a script transformer and a log transformer, and I'm not seeing the log messages, at least not where I expect. Is there anyway I can simply log a custom message from within my script? Can the script easily interact with its containers logger? -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Suggester questions
I am having some trouble getting the suggester to work. The spell requestHandler is working, but I didn't like the results I was getting from the word breaking dictionary and turned them off. So some basic questions: - How can I check on the status of a dictionary? - How can I see what is in that dictionary? - How do I actually manually rebuild the dictionary - all attempts to set spellcheck.build=on or suggest.build=on have led to nearly instant results (0 suggestions for the latter), indicating something is wrong. Thanks, Daniel Davis
Re: Best way to implement Spotlight of certain results
Maybe I can use grouping, but my understanding of the feature is not up to figuring that out :) I tried something like http://localhost:8983/solr/collection/select?q=childhood+cancergroup=ongroup.query=childhood+cancer Because the group.limit=1, I get a single result, and no other results. If I add group.field=title, then I get each result, in a group of 1 member... Eric's re-ranking I do understand - I can re-rank the top-N to make sure the spotlighted result is always first, avoiding the potential problem of having to overweight the title field.In practice, I may not ever need to use the reranking, but its there if I need it.This is enough, because it gives me talking points. On Fri, Jan 9, 2015 at 3:05 PM, Michał B. . m.bienkow...@gmail.com wrote: Maybe I understand you badly but I thing that you could use grouping to achieve such effect. If you could prepare two group queries one with exact match and other, let's say, default than you will be able to extract matches from grouping results. i.e (using default solr example collection) http://localhost:8983/solr/collection1/select?q=*:*group=truegroup.query=manu%3A%22Ap+Computer+Inc.%22group.query=name:Apple%2060%20GB%20iPod%20with%20Video%20Playback%20Blackgroup.limit=10 this query will return two groups one with exact match second with the rest standard results. Regars, Michal 2015-01-09 20:44 GMT+01:00 Erick Erickson erickerick...@gmail.com: Hmm, I wonder if the RerankingQueryParser might help here? See: https://cwiki.apache.org/confluence/display/solr/Query+Re-Ranking Best, Erick On Fri, Jan 9, 2015 at 10:35 AM, Dan Davis dansm...@gmail.com wrote: I have a requirement to spotlight certain results if the query text exactly matches the title or see reference (indexed by me as alttitle_t). What that means is that these matching results are shown above the top-10/20 list with different CSS and fields. Its like feeling lucky on google :) I have considered three ways of implementing this: 1. Assume that edismax qf/pf will boost these results to be first when there is an exact match on these important fields. The downside then is that my relevancy is constrained and I must maintain my configuration with title and alttitle_t as top search fields (see XML snippet below). I may have to overweight them to achieve the always first criteria. Another less major downside is that I must always return the spotlight summary field (for display) and the image to display on each search. These could be got from a database by the id, however, it is convenient to get them from Solr. 2. Issue two searches for every user search, and use a second set of parameters (change the search type and fields to search only by exact matching a specific string field spottitle_s). The search for the spotlight can then have its own configuration. The downside here is that I am using Django and pysolr for the front-end, and pysolr is both synchronous and tied to the requestHandler named select. Convention. Of course, running in parallel is not a fix-all - running a search takes some time, even if run in parallel. 3. Automate the population of elevate.xml so that all these 959 queries are here. This is probably best, but forces me to restart/reload when there are changes to this components. The elevation can be done through a query. What I'd love to do is to configure the select requestHandler to run both searches and return me both sets of results. Is there anyway to do that - apply the same q= parameter to two configured way to run a search? Something like sub queries? I suspect that approach 1 will get me through my demo and a brief evaluation period, but that either approach 2 or 3 will be the winner. Here's a snippet from my current qf/pf configuration: str name=qf title^100 alttitle_t^100 ... text /str str name=pf title^1000 alttitle_t^1000 ... text^10 /str Thanks, Dan Davis -- Michał Bieńkowski
Re: Occasionally getting error in solr suggester component.
Related question - I see mention of needing to rebuild the spellcheck/suggest dictionary after solr core reload. I see spellcheckIndexDir in both the old wiki entry and the solr reference guide https://cwiki.apache.org/confluence/display/solr/Spell+Checking. If this parameter is provided, it sounds like the index is stored on the filesystem and need not be rebuilt each time the core is reloaded. Is this a correct understanding? On Tue, Jan 13, 2015 at 2:17 PM, Michael Sokolov msoko...@safaribooksonline.com wrote: I think you are probably getting bitten by one of the issues addressed in LUCENE-5889 I would recommend against using buildOnCommit=true - with a large index this can be a performance-killer. Instead, build the index yourself using the Solr spellchecker support (spellcheck.build=true) -Mike On 01/13/2015 10:41 AM, Dhanesh Radhakrishnan wrote: Hi all, I am experiencing a problem in Solr SuggestComponent Occasionally solr suggester component throws an error like Solr failed: {responseHeader:{status:500,QTime:1},error:{msg:suggester was not built,trace:java.lang.IllegalStateException: suggester was not built\n\tat org.apache.lucene.search.suggest.analyzing.AnalyzingInfixSuggester. lookup(AnalyzingInfixSuggester.java:368)\n\tat org.apache.lucene.search.suggest.analyzing.AnalyzingInfixSuggester. lookup(AnalyzingInfixSuggester.java:342)\n\tat org.apache.lucene.search.suggest.Lookup.lookup(Lookup.java:240)\n\tat org.apache.solr.spelling.suggest.SolrSuggester. getSuggestions(SolrSuggester.java:199)\n\tat org.apache.solr.handler.component.SuggestComponent. process(SuggestComponent.java:234)\n\tat org.apache.solr.handler.component.SearchHandler.handleRequestBody( SearchHandler.java:218)\n\tat org.apache.solr.handler.RequestHandlerBase.handleRequest( RequestHandlerBase.java:135)\n\tat org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper. handleRequest(RequestHandlers.java:246)\n\tat org.apache.solr.core.SolrCore.execute(SolrCore.java:1967)\n\tat org.apache.solr.servlet.SolrDispatchFilter.execute( SolrDispatchFilter.java:777)\n\tat org.apache.solr.servlet.SolrDispatchFilter.doFilter( SolrDispatchFilter.java:418)\n\tat org.apache.solr.servlet.SolrDispatchFilter.doFilter( SolrDispatchFilter.java:207)\n\tat org.apache.catalina.core.ApplicationFilterChain.internalDoFilter( ApplicationFilterChain.java:243)\n\tat org.apache.catalina.core.ApplicationFilterChain.doFilter( ApplicationFilterChain.java:210)\n\tat org.apache.catalina.core.StandardWrapperValve.invoke( StandardWrapperValve.java:225)\n\tat org.apache.catalina.core.StandardContextValve.invoke( StandardContextValve.java:123)\n\tat org.apache.catalina.core.StandardHostValve.invoke( StandardHostValve.java:168)\n\tat org.apache.catalina.valves.ErrorReportValve.invoke( ErrorReportValve.java:98)\n\tat org.apache.catalina.valves.AccessLogValve.invoke( AccessLogValve.java:927)\n\tat org.apache.catalina.valves.RemoteIpValve.invoke( RemoteIpValve.java:680)\n\tat org.apache.catalina.core.StandardEngineValve.invoke( StandardEngineValve.java:118)\n\tat org.apache.catalina.connector.CoyoteAdapter.service( CoyoteAdapter.java:407)\n\tat org.apache.coyote.http11.AbstractHttp11Processor.process( AbstractHttp11Processor.java:1002)\n\tat org.apache.coyote.AbstractProtocol$AbstractConnectionHandler. process(AbstractProtocol.java:579)\n\tat org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor. run(JIoEndpoint.java:312)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker( ThreadPoolExecutor.java:1145)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run( ThreadPoolExecutor.java:615)\n\tat java.lang.Thread.run(Thread.java:745)\n,code:500}} This is not freequently happening, but idexing and suggestor component working togethere this error will occur. In solr config searchComponent name=suggest class=solr.SuggestComponent lst name=suggester str name=namehaSuggester/str str name=lookupImplAnalyzingInfixLookupFactory/str !-- org.apache.solr.spelling.suggest.fst -- str name=suggestAnalyzerFieldTypetextSpell/str str name=dictionaryImplDocumentDictionaryFactory/str !-- org.apache.solr.spelling.suggest.HighFrequencyDictionaryFactory -- str name=fieldname/str str name=weightFieldpackageWeight/str str name=buildOnCommittrue/str /lst /searchComponent requestHandler name=/suggest class=solr.SearchHandler startup=lazy lst name=defaults str name=suggesttrue/str str name=suggest.count10/str /lst arr name=components strsuggest/str /arr /requestHandler Can any one suggest where to look to figure out this error and why these errors are occurring? Thanks, dhanesh s.r --
Best way to implement Spotlight of certain results
I have a requirement to spotlight certain results if the query text exactly matches the title or see reference (indexed by me as alttitle_t). What that means is that these matching results are shown above the top-10/20 list with different CSS and fields. Its like feeling lucky on google :) I have considered three ways of implementing this: 1. Assume that edismax qf/pf will boost these results to be first when there is an exact match on these important fields. The downside then is that my relevancy is constrained and I must maintain my configuration with title and alttitle_t as top search fields (see XML snippet below).I may have to overweight them to achieve the always first criteria. Another less major downside is that I must always return the spotlight summary field (for display) and the image to display on each search. These could be got from a database by the id, however, it is convenient to get them from Solr. 2. Issue two searches for every user search, and use a second set of parameters (change the search type and fields to search only by exact matching a specific string field spottitle_s). The search for the spotlight can then have its own configuration. The downside here is that I am using Django and pysolr for the front-end, and pysolr is both synchronous and tied to the requestHandler named select. Convention. Of course, running in parallel is not a fix-all - running a search takes some time, even if run in parallel. 3. Automate the population of elevate.xml so that all these 959 queries are here. This is probably best, but forces me to restart/reload when there are changes to this components. The elevation can be done through a query. What I'd love to do is to configure the select requestHandler to run both searches and return me both sets of results. Is there anyway to do that - apply the same q= parameter to two configured way to run a search? Something like sub queries? I suspect that approach 1 will get me through my demo and a brief evaluation period, but that either approach 2 or 3 will be the winner. Here's a snippet from my current qf/pf configuration: str name=qf title^100 alttitle_t^100 ... text /str str name=pf title^1000 alttitle_t^1000 ... text^10 /str Thanks, Dan Davis
Re: Spellchecker delivers far too few suggestions
What about the frequency comparison - I haven't used the spellchecker heavily, but it seems that if bnak is in the database, but bank is much more frequent, then bank should be a suggestion anyway... On Wed, Dec 17, 2014 at 10:41 AM, Erick Erickson erickerick...@gmail.com wrote: First, I'd look in your corpus for bnak. The problem with index-based suggestions is that if your index contains garbage, they're correctly spelled since they're in the index. TermsComponent is very useful for this. You can also loosen up the match criteria, and as I remember the collations parameter does some permutations of the word (but my memory of how that works is shaky). Best, Erick On Wed, Dec 17, 2014 at 9:13 AM, Martin Dietze mdie...@gmail.com wrote: I recently upgraded to SOLR 4.10.1 and after that set up the spell checker which I use for returning suggestions after searches with few or no results. When the spellchecker is active, this request handler is used (most of which is taken from examples I found in the net): requestHandler name=standardWithSpell class=solr.SearchHandler default=false lst name=defaults str name=echoParamsexplicit/str str name=spellchecktrue/str str name=spellcheck.onlyMorePopularfalse/str str name=spellcheck.count10/str str name=spellcheck.collatefalse/str str name=q.alt*:*/str str name=echoParamsexplicit/str int name=rows50/int str name=fl*,score/str /lst arr name=last-components strspellcheck/str /arr /requestHandler The search component is configured as follows (again most of it copied from examples in the net): searchComponent name=spellcheck class=solr.SpellCheckComponent str name=queryAnalyzerFieldTypetext/str lst name=spellchecker str name=namedefault/str str name=fieldtext/str str name=classnamesolr.DirectSolrSpellChecker/str str name=distanceMeasureinternal/str float name=accuracy0.3/float int name=maxEdits2/int int name=minPrefix1/int int name=maxInspections5/int int name=minQueryLength4/int float name=maxQueryFrequency0.01/float float name=maxQueryFrequency.01/float /lst /searchComponent With this setup I can get suggestions for misspelled words. The results on my developer machine were mostly fine, but on the test system (much larger database, much larger search index) I found it very hard to get suggestions at all. If for instance I misspell “bank” as “bnak” I’d expect to get a suggestion for “bank” (since that word can be found in the index very often). I’ve played around with maxQueryFrequency and maxQueryFrequency with no success. Does anyone see any obvious misconfiguration? Anything that I could try? Any way I can debug this? (problem is that my application uses the core API which makes trying out requests through the web interface does not work) Any help would be greatly appreciated! Cheers, Martin -- -- mdie...@gmail.com --/-- mar...@the-little-red-haired-girl.org - / http://herbert.the-little-red-haired-girl.org / -
Re: Tika HTTP 400 Errors with DIH
I would say that you could determine a row that gives a bad URL, and then run it in DIH admin interface (or the command-line) with debug enabled The url parameter going into tika should be present in its transformed form before the next entity gets going. This works in a similar scenario for me. On Tue, Dec 2, 2014 at 1:19 PM, Teague James teag...@insystechinc.com wrote: Hi all, I am using Solr 4.9.0 to index a DB with DIH. In the DB there is a URL field. In the DIH Tika uses that field to fetch and parse the documents. The URL from the field is valid and will download the document in the browser just fine. But Tika is getting HTTP response code 400. Any ideas why? ERROR BinURLDataSource java.io.IOException: Server returned HTTP response code: 400 for URL: EntityProcessorWrapper Exception in entity : tika_content:org.apache.solr.handler.dataimport.DataImportHandlerException: Exception in invoking url DIH dataConfig dataSource type=JdbcDataSource name=ds-1 driver=net.sourceforge.jtds.jdbc.Driver url=jdbc:jtds:sqlserver:// 1.2.3.4/database;instance=INSTANCE;user=USER;pass word=PASSWORD / dataSource type=BinURLDataSource name=ds-2 / document entity name=db_content dataSource=ds-1 transformer=ClobTransformer, RegexTransformer query=SELECT ContentID, DownloadURL FROM DATABASE.VIEW field column=ContentID name=id / field column=DownloadURL clob=true name=DownloadURL / entity name=tika_content processor=TikaEntityProcessor url=${db_content.DownloadURL} onError=continue dataSource=ds-2 field column=TikaParsedContent / /entity /entity /document /dataConfig SCHEMA - Fields field name=DownloadURL type=string indexed=true stored=true / field name=TikaParsedContent type=text_general indexed=true stored=true multiValued=true/
DIH XPathEntityProcessor question
When I have a forEach attribute like the following: forEach=/medical-topics/medical-topic/health-topic[@language='English'] And then need to match an attribute of that, is there any alternative to spelling it all out: field column=url xpath=/medical-topics/medical-topic/health-topic[@language='English']/@url/ I suppose I could do //health-topic/@url since the document should then have a single health-topic (as long as I know they don't nest).
Re: DIH XPathEntityProcessor question
In experimentation with a much simpler and smaller XML file, it doesn't look like '//health-topic/@url will not work, nor will '//@url' etc.So far, only spelling it all out will work. With child elements, such as title, an xpath of //title works fine, but it is beginning to same dangerous. Is there any short-hand for the current node or the match? On Mon, Dec 8, 2014 at 4:42 PM, Dan Davis dansm...@gmail.com wrote: When I have a forEach attribute like the following: forEach=/medical-topics/medical-topic/health-topic[@language='English'] And then need to match an attribute of that, is there any alternative to spelling it all out: field column=url xpath=/medical-topics/medical-topic/health-topic[@language='English']/@url/ I suppose I could do //health-topic/@url since the document should then have a single health-topic (as long as I know they don't nest).
Re: DIH XPathEntityProcessor question
The problem is that XPathEntityProcessor implements Xpath on its own, and implements a subset of XPath. So, if the input document is small enough, it makes no sense to fight it. One possibility is to apply an XSLT to the file before processing ite This blog post http://www.andornot.com/blog/post/Sample-Solr-DataImportHandler-for-XML-Files.aspx shows a worked example. The XSL transform takes place before the forEach or field specifications, which is the principal question I had about it from the documentation. This is also illustrated in the initQuery() private method of XPathEntityProcessor.You can see the transformation being applied before the forEach. This will not scale to extremely large XML documents including millions of rows - that is why they have the stream=true argument there, so that you don't preprocess the document. In my case, the entire XML file is 29M, and so I think I could do the XSL transformation and then do for each document. This potentially shortens my time frame of moving to Apache Solr substantially, because the common case with our previous indexer is to run XSLT to trasform to the document format desired by the indexer. On Mon, Dec 8, 2014 at 5:10 PM, Alexandre Rafalovitch arafa...@gmail.com wrote: I don't believe there are any alternatives. At least I could not get anything but the full path to work. Regards, Alex. Personal: http://www.outerthoughts.com/ and @arafalov Solr resources and newsletter: http://www.solr-start.com/ and @solrstart Solr popularizers community: https://www.linkedin.com/groups?gid=6713853 On 8 December 2014 at 17:01, Dan Davis dansm...@gmail.com wrote: In experimentation with a much simpler and smaller XML file, it doesn't look like '//health-topic/@url will not work, nor will '//@url' etc. So far, only spelling it all out will work. With child elements, such as title, an xpath of //title works fine, but it is beginning to same dangerous. Is there any short-hand for the current node or the match? On Mon, Dec 8, 2014 at 4:42 PM, Dan Davis dansm...@gmail.com wrote: When I have a forEach attribute like the following: forEach=/medical-topics/medical-topic/health-topic[@language='English'] And then need to match an attribute of that, is there any alternative to spelling it all out: field column=url xpath=/medical-topics/medical-topic/health-topic[@language='English']/@url/ I suppose I could do //health-topic/@url since the document should then have a single health-topic (as long as I know they don't nest).
Re: DIH XPathEntityProcessor question
Yes, that worked quite well. I still need the //tagname but that is the only DIH incantation I need. This will substantially accelerate things. On Mon, Dec 8, 2014 at 5:37 PM, Dan Davis d...@danizen.net wrote: The problem is that XPathEntityProcessor implements Xpath on its own, and implements a subset of XPath. So, if the input document is small enough, it makes no sense to fight it. One possibility is to apply an XSLT to the file before processing ite This blog post http://www.andornot.com/blog/post/Sample-Solr-DataImportHandler-for-XML-Files.aspx shows a worked example. The XSL transform takes place before the forEach or field specifications, which is the principal question I had about it from the documentation. This is also illustrated in the initQuery() private method of XPathEntityProcessor.You can see the transformation being applied before the forEach. This will not scale to extremely large XML documents including millions of rows - that is why they have the stream=true argument there, so that you don't preprocess the document. In my case, the entire XML file is 29M, and so I think I could do the XSL transformation and then do for each document. This potentially shortens my time frame of moving to Apache Solr substantially, because the common case with our previous indexer is to run XSLT to trasform to the document format desired by the indexer. On Mon, Dec 8, 2014 at 5:10 PM, Alexandre Rafalovitch arafa...@gmail.com wrote: I don't believe there are any alternatives. At least I could not get anything but the full path to work. Regards, Alex. Personal: http://www.outerthoughts.com/ and @arafalov Solr resources and newsletter: http://www.solr-start.com/ and @solrstart Solr popularizers community: https://www.linkedin.com/groups?gid=6713853 On 8 December 2014 at 17:01, Dan Davis dansm...@gmail.com wrote: In experimentation with a much simpler and smaller XML file, it doesn't look like '//health-topic/@url will not work, nor will '//@url' etc. So far, only spelling it all out will work. With child elements, such as title, an xpath of //title works fine, but it is beginning to same dangerous. Is there any short-hand for the current node or the match? On Mon, Dec 8, 2014 at 4:42 PM, Dan Davis dansm...@gmail.com wrote: When I have a forEach attribute like the following: forEach=/medical-topics/medical-topic/health-topic[@language='English'] And then need to match an attribute of that, is there any alternative to spelling it all out: field column=url xpath=/medical-topics/medical-topic/health-topic[@language='English']/@url/ I suppose I could do //health-topic/@url since the document should then have a single health-topic (as long as I know they don't nest).
Logging in Solr's DataImportHandler
I have a script transformer and a log transformer, and I'm not seeing the log messages, at least not where I expect. Is there anyway I can simply log a custom message from within my script? Can the script easily interact with its containers logger?
Fwd: Best Practices for open source pipeline/connectors
The volume and influx rate in my scenario are very modest. Our largest collections with existing indexing software is about 20 million objects, second up is about 5 million, and more typical collections are in the tens of thousands. Aside from the 20 million object corpus, we re-index and replicate nightly. Note that I am not responsible for any specific operation, only for advising my organization on how to go. My organization wants to understand how much programming will be involved using Solr rather than higher level tools. I have to acknowledge that our current solution involves less programming, even as I urge them to think of programming as not a bad thing ;) From my perspective, 'programming', that is, configuration files in a git archive (with internal comments and commit comments) is much, much more productive than using form-based configuration software. So, my organizations' needs and mine may be different... -- Forwarded message -- From: Jürgen Wagner (DVT) juergen.wag...@devoteam.com Date: Tue, Nov 4, 2014 at 4:48 PM Subject: Re: Best Practices for open source pipeline/connectors To: solr-user@lucene.apache.org Hello Dan, ManifoldCF is a connector framework, not a processing framework. Therefore, you may try your own lightweight connectors (which usually are not really rocket science and may take less time to write than time to configure a super-generic connector of some sort), any connector out there (including Nutch and others), or even commercial offerings from some companies. That, however, won't make you very happy all by itself - my guess. Key to really creating value out of data dragged into a search platform is the processing pipeline. Depending on the scale of data and the amount of processing you need to do, you may have a simplistic approach with just some more or less configurable Java components massaging your data until it can be sent to Solr (without using Tika or any other processing in Solr), or you can employ frameworks like Apache Spark to really heavily transform and enrich data before feeding them into Solr. I prefer to have a clear separation between connectors, processing, indexing/querying and front-end visualization/interaction. Only the indexing/querying task I grant to Solr (or naked Lucene or Elasticsearch). Each of the different task types has entirely different scaling requirements and computing/networking properties, so you definitely don't want them depend on each other too much. Addressing the needs of several customers, one needs to even swap one or the other component in favour of what a customer prefers or needs. So, my answer is YES. But we've also tried Nutch, our own specialized crawlers and a number of elaborate connectors for special customer applications. In any case, the result of that connector won't go into Solr. It will go into processing. From there it will go into Solr. I suspect that connectors won't be the challenge in your project. Solr requires a bit of tuning and tweaking, but you'll be fine eventually. Document processing will be the fun part. As you come to scaling the zoo of components, this will become evident :-) What is the volume and influx rate in your scenario? Best regards, --Jürgen On 04.11.2014 22:01, Dan Davis wrote: I'm trying to do research for my organization on the best practices for open source pipeline/connectors. Since we need Web Crawls, File System crawls, and Databases, it seems to me that Manifold CF might be the best case. Has anyone combined ManifestCF with Solr UpdateRequestProcessors or DataImportHandler? It would be nice to decide in ManifestCF which resultHandler should receive a document or id, barring that, you can post some fields including an URL and have Data Import Handler handle it - it already supports scripts whereas ManifestCF may not at this time. Suggestions and ideas? Thanks, Dan -- Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С уважением *i.A. Jürgen Wagner* Head of Competence Center Intelligence Senior Cloud Consultant Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543 E-Mail: juergen.wag...@devoteam.com, URL: www.devoteam.de -- Managing Board: Jürgen Hatzipantelis (CEO) Address of Record: 64331 Weiterstadt, Germany; Commercial Register: Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071
Re: Tika Integration problem with DIH and JDBC
All, The problem here was that I gave driver=BinURLDataSource rather than type=BinURLDataSource. Of course, saying driver=BinURLDataSource caused it not to be able to find it.
Best Practices for open source pipeline/connectors
I'm trying to do research for my organization on the best practices for open source pipeline/connectors. Since we need Web Crawls, File System crawls, and Databases, it seems to me that Manifold CF might be the best case. Has anyone combined ManifestCF with Solr UpdateRequestProcessors or DataImportHandler? It would be nice to decide in ManifestCF which resultHandler should receive a document or id, barring that, you can post some fields including an URL and have Data Import Handler handle it - it already supports scripts whereas ManifestCF may not at this time. Suggestions and ideas? Thanks, Dan
Re: Best Practices for open source pipeline/connectors
We are looking at LucidWorks, but also want to see what we can do on our own so we can evaluate the value-add of Lucid Works among other products. On Tue, Nov 4, 2014 at 4:13 PM, Alexandre Rafalovitch arafa...@gmail.com wrote: And, just to get the stupid question out of the way, you prefer to pay in developer integration time rather than in purchase/maintenance fees? Because, otherwise, I would look at LucidWorks commercial offering first, even to just have a comparison. Regards, Alex. Personal: http://www.outerthoughts.com/ and @arafalov Solr resources and newsletter: http://www.solr-start.com/ and @solrstart Solr popularizers community: https://www.linkedin.com/groups?gid=6713853 On 4 November 2014 16:01, Dan Davis dansm...@gmail.com wrote: I'm trying to do research for my organization on the best practices for open source pipeline/connectors. Since we need Web Crawls, File System crawls, and Databases, it seems to me that Manifold CF might be the best case. Has anyone combined ManifestCF with Solr UpdateRequestProcessors or DataImportHandler? It would be nice to decide in ManifestCF which resultHandler should receive a document or id, barring that, you can post some fields including an URL and have Data Import Handler handle it - it already supports scripts whereas ManifestCF may not at this time. Suggestions and ideas? Thanks, Dan
Re: javascript form data save to XML in server side
I always, always have a web application running that accepts the JavaScript AJAX call and then forwards it on to the Apache Solr request handler. Even if you don't control the web application, and can only add JavaScript, you can put up a API oriented webapp somewhere that only protects Solr for a couple of posts. Then, you can use CORS or JSONP to facilitate interaction between the main web application and the ancillary webapp providing APIs for Solr integration. Of course, this only applies if you don't control the primary application. If you can use a Drupal or Typo3 to front-end Solr, than this is a great way to solve the problem. On Mon, Oct 20, 2014 at 11:02 PM, LongY zhangyulin8...@hotmail.com wrote: thank you very much. Alex. You reply is very informative and I really appreciate it. I hope I would be able to help others in this forum like you are in the future. -- View this message in context: http://lucene.472066.n3.nabble.com/javascript-form-data-save-to-XML-in-server-side-tp4165025p4165066.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Problem with DIH
This seems a little abstract. What I'd do is double check that the SQL is working correctly by running the stored procedure outside of Solr and see what you get. You should also be able to look at the corresponding .properties file and see the inputs used for the delta import. If the data import XML is called dih-example.xml, then the properties file should be called dih-example.properties and be in the same conf directory (for the collection).Example contents are: #Fri Oct 10 14:53:44 EDT 2014 last_index_time=2014-10-10 14\:53\:44 healthtopic.last_index_time=2014-10-10 14\:53\:44 Again, I'm suggesting you double check that the SQL is working correctly. If that isn't the problem, provide more details on your data import handler, e.g. the XML with some modifications (no passwords). On Thu, Oct 16, 2014 at 2:11 AM, Jay Potharaju jspothar...@gmail.com wrote: Hi I 'm using DIH for updating my core. I 'm using store procedure for doing a full/ delta imports. In order to avoid running delta imports for a long time, i limit the rows returned to a max of 100,000 rows at a given time. On an average the delta import runs for less than 1 minute. For the last couple of days I have been noticing that my delta imports has been running for couple of hours and tries to update all the records in the core. I 'm not sure why that has been happening. I cant reproduce this event all the time, it happens randomly. Has anyone noticed this kind of behavior. And secondly are there any solr logs that will tell me what is getting updated or what exactly is happening at the DIH ? Any suggestion appreciated. Document size: 20 million Solr 4.9 3 Nodes in the solr cloud. Thanks J
Re: import solr source to eclipse
I had a problem with the ant eclipse answer - it was unable to resolve javax.activation for the Javadoc. Updating solr/contrib/dataimporthandler-extras/ivy.xml as follows did the trick for me: - dependency org=javax.activation name=activation rev=${/javax.activation/activation} conf=compile-*/ + dependency org=javax.activation name=activation rev=${/javax.activation/activation} conf=compile-default/ What I'm trying to do is to construct a failing Unit test for something that I think is a bug. But the first thing is to be able to run tests, probably in eclipse, but the command-line might be good enough although not ideal. On Tue, Oct 14, 2014 at 10:38 AM, Erick Erickson erickerick...@gmail.com wrote: I do exactly what Anurag mentioned, but _only_ when what I want to debug is, for some reason, not accessible via unit tests. It's very easy to do. It's usually much faster though to use unit tests, which you should be able to run from eclipse without starting a server at all. In IntelliJ, you just ctrl-click on the file and the menu gives you a choice of running or debugging the unit test, I'm sure Eclipse does something similar. There are zillions of units to choose from, and for new development it's a Good Thing to write the unit test first... Good luck! Erick On Tue, Oct 14, 2014 at 1:37 AM, Anurag Sharma anura...@gmail.com wrote: Another alternative is launch the jetty server from outside and attach it remotely from eclipse. java -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=y,address=7666 -jar start.jar The above command waits until the application attach succeed. On Tue, Oct 14, 2014 at 12:56 PM, Rajani Maski rajinima...@gmail.com wrote: Configure eclipse with Jetty plugin. Create a Solr folder under your Solr-Java-Project and Run the project [Run as] on Jetty Server. This blog[1] may help you to configure Solr within eclipse. [1] http://hokiesuns.blogspot.in/2010/01/setting-up-apache-solr-in-eclipse.html On Tue, Oct 14, 2014 at 12:06 PM, Ali Nazemian alinazem...@gmail.com wrote: Thank you very much for your guides but how can I run solr server inside eclipse? Best regards. On Mon, Oct 13, 2014 at 8:02 PM, Rajani Maski rajinima...@gmail.com wrote: Hi, The best tutorial for setting up Solr[solr 4.7] in eclipse/intellij is documented in Solr In Action book, Apendix A, *Working with the Solr codebase* On Mon, Oct 13, 2014 at 6:45 AM, Tomás Fernández Löbbe tomasflo...@gmail.com wrote: The way I do this: From a terminal: svn checkout https://svn.apache.org/repos/asf/lucene/dev/trunk/ lucene-solr-trunk cd lucene-solr-trunk ant eclipse ... And then, from your Eclipse import existing java project, and select the directory where you placed lucene-solr-trunk On Sun, Oct 12, 2014 at 7:09 AM, Ali Nazemian alinazem...@gmail.com wrote: Hi, I am going to import solr source code to eclipse for some development purpose. Unfortunately every tutorial that I found for this purpose is outdated and did not work. So would you please give me some hint about how can I import solr source code to eclipse? Thank you very much. -- A.Nazemian -- A.Nazemian
Tika Integration problem with DIH and JDBC
What I want to do is to pull an URL out of an Oracle database, and then use TikaEntityProcessor and BinURLDataSource to go fetch and process that URL. I'm having a problem with this that seems general to JDBC with Tika - I get an exception as follows: Exception in entity : extract:org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to execute query: http://www.cdc.gov/healthypets/pets/wildlife.html Processing Document # 14 at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:71) ... Steps to reproduce any problem should be: - Try it with the XML and verify you get two documents and they contain text (schema browser with the text field) - Try it with a JDBC sqlite3 dataSource and verify that you get an exception, and advise me what may be the problem in my configuration ... Now, I've tried this 3 ways: - My Oracle database - fails as above - An SQLite3 database to see if it is Oracle specific - fails with Unable to execute query, but doesn't have the URL as part of the message. - An XML file listing two URLs - succeeds without error. For the SQL attempts, setting onError=skip leads the data from the database to be indexed, but the exception is logged for each root entity. I can tell that nothing is indexed from the text extraction by browsing the text field from the schema browser and seeing how few terms there are. The exceptions also sort of give it away, but it is good to be careful :) This is using: - Tomcat 7.0.55 - Solr 4.10.1 - and JDBC drivers - ojdbc7.jar - sqlite-jdbc-3.7.2.jar Excerpt of solrconfig.xml: !-- Data Import Handler for Health Topics -- requestHandler name=/dih-healthtopics class=solr.DataImportHandler lst name=defaults str name=configdih-healthtopics.xml/str /lst /requestHandler !-- Data Import Handler that imports a single URL via Tika -- requestHandler name=/dih-smallxml class=solr.DataImportHandler lst name=defaults str name=configdih-smallxml.xml/str /lst /requestHandler !-- Data Import Handler that imports a single URL via Tika -- requestHandler name=/dih-smallsqlite class=solr.DataImportHandler lst name=defaults str name=configdih-smallsqlite.xml/str /lst /requestHandler The data import handlers and a copy-paste from Solr logging are attached. Exception in entity : extract:org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to execute query: Processing Document # 1 at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:71) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.init(JdbcDataSource.java:283) at org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:240) at org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:44) at org.apache.solr.handler.dataimport.DebugLogger$2.getData(DebugLogger.java:188) at org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:112) at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:243) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:476) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:502) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:415) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:330) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:232) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:416) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:480) at org.apache.solr.handler.dataimport.DataImportHandler.handleRequestBody(DataImportHandler.java:189) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1967) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:220) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:122) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:171) at
Re: Tika Integration problem with DIH and JDBC
Thanks, Alexandre.My role is to kick the tires on this. We're trying it a couple of different ways. So, I'm going to assume this could be resolved and move on to trying ManifestCF and see whether it can do similar things for me, e.g. what it adds for free to our bag of tricks. On Fri, Oct 10, 2014 at 3:16 PM, Alexandre Rafalovitch arafa...@gmail.com wrote: I would concentrate on the stack traces and try reading them. They often provide a lot of clues. For example, you original stack trace had org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:71) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.init(JdbcDataSource.java:283) at org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:240) 2) at org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:44) at org.apache.solr.handler.dataimport.DebugLogger$2.getData(DebugLogger.java:188) 1) at org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:112) at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:243) I added 1) and 2) to show the lines of importance. You can see in 1) that your TikaEntityProcessor is calling 2) JdbcDataSource, which was not what you wanted as you specified BinDataSource. So, you focus on that until it gets resolved. Sometimes these happens when the XML file says 'datasource' instead of 'dataSource' (DIH is case-sensitive), but it does not seem to be the case in your situation. Regards, Alex. P.s. If you still haven't figure it out, mention the Solr version on the next email. Sometimes it makes difference, though DIH has been largely unchanged for a while. -- Forwarded message -- From: Dan Davis d...@danizen.net Date: 10 October 2014 15:00 Subject: Re: Tika Integration problem with DIH and JDBC To: Alexandre Rafalovitch arafa...@gmail.com The definition of dataSource name=bin type=BinURLDataSource is in each of the dih-*.xml files. But only the xml version has the definition at the top, above the document. Moving the dataSource definition to the top does change the behavior, now I get the following error for that entity: Exception in entity : extract:org.apache.solr.handler.dataimport.DataImportHandlerException: JDBC URL or JNDI name has to be specified Processing Document # 30 When I changed it to specify url=, it then reverted to form: Exception in entity : extract:org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to execute query: http://www.cdc.gov/flu/swineflu/ Processing Document # 1 at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:71) It does seem to be a problem resolving the dataSource in some way. I did double check another part of solrconfig.xml therefore. Since the XML example still works, I guess I know it has to be there. lib dir=${solr.solr.home:}/dist/ regex=solr-dataimporthandler-.*\.jar / lib dir=${solr.solr.home:}/contrib/extraction/lib regex=.*\.jar / lib dir=${solr.solr.home:}/dist/ regex=solr-cell-\d.*\.jar / lib dir=${solr.solr.home:}/contrib/clustering/lib/ regex=.*\.jar / lib dir=${solr.solr.home:}/dist/ regex=solr-clustering-\d.*\.jar / lib dir=${solr.solr.home:}/contrib/langid/lib/ regex=.*\.jar / lib dir=${solr.solr.home:}/dist/ regex=solr-langid-\d.*\.jar / lib dir=${solr.solr.home:}/contrib/velocity/lib regex=.*\.jar / lib dir=${solr.solr.home:}/dist/ regex=solr-velocity-\d.*\.jar / On Fri, Oct 10, 2014 at 2:37 PM, Alexandre Rafalovitch arafa...@gmail.com wrote: You say dataSource='bin' but I don't see you defining that datasource. E.g.: dataSource type=BinURLDataSource name=bin/ So, there might be some weird default fallback that's just causes strange problems. Regards, Alex. Personal: http://www.outerthoughts.com/ and @arafalov Solr resources and newsletter: http://www.solr-start.com/ and @solrstart Solr popularizers community: https://www.linkedin.com/groups?gid=6713853 On 10 October 2014 14:17, Dan Davis dansm...@gmail.com wrote: What I want to do is to pull an URL out of an Oracle database, and then use TikaEntityProcessor and BinURLDataSource to go fetch and process that URL. I'm having a problem with this that seems general to JDBC with Tika - I get an exception as follows: Exception in entity : extract:org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to execute query: http://www.cdc.gov/healthypets/pets/wildlife.html Processing Document # 14 at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:71) ... Steps to reproduce any problem should be: Try it with the XML and verify you get two documents and they contain text
Re: Nagle's Algorithm
I don't keep up with this list well enough to know whether anyone else answered. I don't know how to do it in jetty.xml, but you can certainly tweak the code. java.net.Socket has a method setTcpNoDelay() that corresponds with the standard Unix system calls. Long-time past, my suggestion of this made Apache Axis 2.0 250ms faster per call (1). Now I want to know whether Apache Solr sets it. One common way to test the overhead portion of latency is to project the latency for a zero size request based on larger requests. What you do is to warm requests (all in memory) for progressively fewer and fewer rows. You can make requests for 100, 90, 80, 70 ... 10 rows each more than once so that all is warmed. If you plot this, it should look like a linear function latency(rows) = m(rows) + b since all is cached in memory. You have to control what else is going on on the server to get the linear plot of course - it can be quite hard to get this to work right on modern Linux. But once you have it, you can simply calculate f(0) and you have the latency for a theoretical 0 sized request. This is a tangential answer at best - I wish I just knew a setting to give you. (1) Latency Performance of SOAP Implementationshttp://citeseer.ist.psu.edu/viewdoc/similar?doi=10.1.1.21.8556type=ab On Sun, Sep 29, 2013 at 9:22 PM, William Bell billnb...@gmail.com wrote: How do I set TCP_NODELAY on the http sockets for Jetty in SOLR 4? Is there an option in jetty.xml ? /* Create new stream socket */ sock = *socket*( AF_INET, SOCK_STREAM, 0 ); /* Disable the Nagle (TCP No Delay) algorithm */ flag = 1; ret = *setsockopt*( sock, IPPROTO_TCP, TCP_NODELAY, (char *)flag, sizeof(flag) ); -- Bill Bell billnb...@gmail.com cell 720-256-8076
Excluding a facet's constraint to exclude a facet
Summary - when constraining a search using filter query, how can I exclude the constraint for a particular facet? Detail - Suppose I have the following facet results for a query q=* mainquery*: lst name=facet_counts lst name=facet_queries/ lst name=facet_fields lst name=foo int name=A491/int int name=B111/int int name=C103/int ... /lst ... I understand from http://people.apache.org/~hossman/apachecon2010/facets/and Wiki documentation that I can limit results to category A as follows: fq={!raw f=foo}A But I cannot seem to (Solr 3.6.1) exclude that way: fq={!raw f=foo}-A And the simpler test (with edismax) doesn't work either: fq=foo:A# works fq=foo:-A # doesn't work Do I need to be using facet.method=enum to get this to work? What else could be the problem here?
Re: Storing query results
You could copy the existing core to a new core every once in awhile, and then do your delta indexing into a new core once the copy is complete. If a Persistent URL for the search results included the name of the original core, the results you would get from a bookmark would be stable. However, if you went to the site, and did a new site, you would be searching the newest core. This I think applies whether the site is Intranet or not. Older cores could be aged out gracefully, and the search handler for an old core could be replaced by a search on the new core via sharding. On Fri, Aug 23, 2013 at 11:57 AM, jfeist jfe...@llminc.com wrote: I completely agree. I would prefer to just rerun the search each time. However, we are going to be replacing our rdb based search with something like Solr, and the application currently behaves this way. Our users understand that the search is essentially a snapshot (and I would guess many prefer this over changing results) and we don't want to change existing behavior and confuse anyone. Also, my boss told me it unequivocally has to be this way :p Thanks for your input though, looks like I'm going to have to do something like you've suggested within our application. -- View this message in context: http://lucene.472066.n3.nabble.com/Storing-query-results-tp4086182p4086349.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How to Manage RAM Usage at Heavy Indexing
This could be an operating systems problem rather than a Solr problem. CentOS 6.4 (linux kernel 2.6.32) may have some issues with page flushing and I would read-up up on that. The VM parameters can be tuned in /etc/sysctl.conf On Sun, Aug 25, 2013 at 4:23 PM, Furkan KAMACI furkankam...@gmail.comwrote: Hi Erick; I wanted to get a quick answer that's why I asked my question as that way. Error is as follows: INFO - 2013-08-21 22:01:30.978; org.apache.solr.update.processor.LogUpdateProcessor; [collection1] webapp=/solr path=/update params={wt=javabinversion=2} {add=[com.deviantart.reachmeh ere:http/gallery/, com.deviantart.reachstereo:http/, com.deviantart.reachstereo:http/art/SE-mods-313298903, com.deviantart.reachtheclouds:http/, com.deviantart.reachthegoddess:http/, co m.deviantart.reachthegoddess:http/art/retouched-160219962, com.deviantart.reachthegoddess:http/badges/, com.deviantart.reachthegoddess:http/favourites/, com.deviantart.reachthetop:http/ art/Blue-Jean-Baby-82204657 (1444006227844530177), com.deviantart.reachurdreams:http/, ... (163 adds)]} 0 38790 ERROR - 2013-08-21 22:01:30.979; org.apache.solr.common.SolrException; java.lang.RuntimeException: [was class org.eclipse.jetty.io.EofException] early EOF at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18) at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731) at com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657) at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809) at org.apache.solr.handler.loader.XMLLoader.readDoc(XMLLoader.java:393) at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:245) at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173) at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1812) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:639) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:345) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1072) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:382) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1006) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) at org.eclipse.jetty.server.Server.handle(Server.java:365) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:485) at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53) at org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:937) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:998) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:948) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235) at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72) at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) at java.lang.Thread.run(Thread.java:722) Caused by: org.eclipse.jetty.io.EofException: early EOF at org.eclipse.jetty.server.HttpInput.read(HttpInput.java:65) at java.io.InputStream.read(InputStream.java:101) at com.ctc.wstx.io.UTF8Reader.loadMore(UTF8Reader.java:365) at com.ctc.wstx.io.UTF8Reader.read(UTF8Reader.java:110) at com.ctc.wstx.io.MergedReader.read(MergedReader.java:101) at com.ctc.wstx.io.ReaderSource.readInto(ReaderSource.java:84) at
Re: More on topic of Meta-search/Federated Search with Solr
On Tue, Aug 27, 2013 at 2:03 AM, Paul Libbrecht p...@hoplahup.net wrote: Dan, if you're bound to federated search then I would say that you need to work on the service guarantees of each of the nodes and, maybe, create strategies to cope with bad nodes. paul +1 I'll think on that.
Re: More on topic of Meta-search/Federated Search with Solr
On Tue, Aug 27, 2013 at 3:33 AM, Bernd Fehling bernd.fehl...@uni-bielefeld.de wrote: Years ago when Federated Search was a buzzword we did some development and testing with Lucene, FAST Search, Google and several other Search Engines according Federated Search in Library context. The results can be found here http://pub.uni-bielefeld.de/download/2516631/2516644 Some minor parts are in German most is written in English. It also gives you an idea where to keep an eye on, where are the pitfalls and so on. We also had a tool called unity (written in Python) which did Federated Search on any Search Engine and Database, like Google, Gigablast, FAST, Lucene, ... The trick with Federated Search is to combine the results. We offered three options to the users search surface: - RoundRobin - Relevancy - PseudoRandom Thanks much - Andrzej B. suggested I read Comparing top-k lists in addition to his Berlin Buzzwords presentation. I will know soon whether we are intent on this direction, right now I'm still trying to think on how hard it will be.
Re: More on topic of Meta-search/Federated Search with Solr
On Mon, Aug 26, 2013 at 9:06 PM, Amit Jha shanuu@gmail.com wrote: Would you like to create something like http://knimbus.com I work at the National Library of Medicine. We are moving our library catalog to a newer platform, and we will probably include articles. The article's content and meta-data are available from a number of web-scale discovery services such as PRIMO, Summon, EBSCO's EDS, EBSCO's traditional API. Most libraries use open source solutions to avoid the cost of purchasing an expensive enterprise search platform. We are big; we already have a closed-source enterprise search engine (and our own home grown Entrez search used for PubMed).Since we can already do Federated Search with the above, I am evaluating the effort of adding such to Apache Solr. Because NLM data is used in the open relevancy project, we actually have the relevancy decisions to decide whether we have done a good job of it. I obviously think it would be Fun to add Federated Search to Apache Solr. *Standard disclosure *- my opinion's do not represent the opinions of NIH or NLM.Fun is no reason to spend tax-payer money.Enhancing Apache Solr would reduce the risk of putting all our eggs in one basket. and there may be some other relevant benefits. We do use Apache Solr here for more than one other project... so keep up the good work even if my working group decides to go with the closed-source solution.
Re: More on topic of Meta-search/Federated Search with Solr
I have now come to the task of estimating man-days to add Blended Search Results to Apache Solr. The argument has been made that this is not desirable (see Jonathan Rochkind's blog entries on Bento search with blacklight). But the estimate remains.No estimate is worth much without a design. So, I am come to the difficult of estimating this without having an in-depth knowledge of the Apache core. Here is my design, likely imperfect, as it stands. - Configure a core specific to each search source (local or remote) - On cores that index remote content, implement a periodic delete query that deletes documents whose timestamp is too old - Implement a custom requestHandler for the remote cores that goes out and queries the remote source. For each result in the top N (configurable), it computes an id that is stable (e.g. it is based on the remote resource URL, doi, or hash of data returned). It uses that id to look-up the document in the lucene database. If the data is not there, it updates the lucene core and sets a flag that commit is required. Once it is done, it commits if needed. - Configure a core that uses a custom SearchComponent to call the requestHandler that goes and gets new documents and commits them. Since the cores for remote content are different cores, they can restart their searcher at this point if any commit is needed. The custom SearchComponent will wait for commit and reload to be completed. Then, search continues uses the other cores as shards. - Auto-warming on this will assure that the most recently requested data is present. It will, of course, be very slow a good part of the time. Erik and others, I need to know whether this design has legs and what other alternatives I might consider. On Sun, Aug 18, 2013 at 3:14 PM, Erick Erickson erickerick...@gmail.comwrote: The lack of global TF/IDF has been answered in the past, in the sharded case, by usually you have similar enough stats that it doesn't matter. This pre-supposes a fairly evenly distributed set of documents. But if you're talking about federated search across different types of documents, then what would you rescore with? How would you even consider scoring docs that are somewhat/ totally different? Think magazine articles an meta-data associated with pictures. What I've usually found is that one can use grouping to show the top N of a variety of results. Or show tabs with different types. Or have the app intelligently combine the different types of documents in a way that makes sense. But I don't know how you'd just get the right thing to happen with some kind of scoring magic. Best Erick On Fri, Aug 16, 2013 at 4:07 PM, Dan Davis dansm...@gmail.com wrote: I've thought about it, and I have no time to really do a meta-search during evaluation. What I need to do is to create a single core that contains both of my data sets, and then describe the architecture that would be required to do blended results, with liberal estimates. From the perspective of evaluation, I need to understand whether any of the solutions to better ranking in the absence of global IDF have been explored?I suspect that one could retrieve a much larger than N set of results from a set of shards, re-score in some way that doesn't require IDF, e.g. storing both results in the same priority queue and *re-scoring* before *re-ranking*. The other way to do this would be to have a custom SearchHandler that works differently - it performs the query, retries all results deemed relevant by another engine, adds them to the Lucene index, and then performs the query again in the standard way. This would be quite slow, but perhaps useful as a way to evaluate my method. I still welcome any suggestions on how such a SearchHandler could be implemented.
Re: More on topic of Meta-search/Federated Search with Solr
First answer: My employer is a library and do not have the license to harvest everything indexed by a web-scale discovery service such as PRIMO or Summon.If our design automatically relays searches entered by users, and then periodically purges results, I think it is reasonable from a licensing perspective. Second answer: What if you wanted your Apache Solr powered search to include all results from Google scholar to any query? Do you think you could easily or cheaply configure a Zookeeper cluster large enough to harvest and index all of Google Scholar? Would that violate robot rules?Is it even possible to do this from an API perspective? Wouldn't google notice? Third answer: On Gartner's 2013 Enterprise Search Magic Quadrant, LucidWorks and the other Enterprise Search firm based on Apache Solr were dinged on the lack of Federated Search. I do not have the hubris to think I can fix that, and it is not really my role to try, but something that works without Harvesting and local indexing is obviously desirable to Enterprise Search users. On Mon, Aug 26, 2013 at 4:46 PM, Paul Libbrecht p...@hoplahup.net wrote: Why not simply create a meta search engine that indexes everything of each of the nodes.? (I think one calls this harvesting) I believe that this the way to avoid all sorts of performance bottleneck. As far as I could analyze, the performance of a federated search is the performance of the least speedy node; which can turn to be quite bad if you do not exercise guarantees of remote sources. Or are the remote cores below actually things that you manage on your side? If yes guarantees are easy to manage.. Paul Le 26 août 2013 à 22:38, Dan Davis a écrit : I have now come to the task of estimating man-days to add Blended Search Results to Apache Solr. The argument has been made that this is not desirable (see Jonathan Rochkind's blog entries on Bento search with blacklight). But the estimate remains.No estimate is worth much without a design. So, I am come to the difficult of estimating this without having an in-depth knowledge of the Apache core. Here is my design, likely imperfect, as it stands. - Configure a core specific to each search source (local or remote) - On cores that index remote content, implement a periodic delete query that deletes documents whose timestamp is too old - Implement a custom requestHandler for the remote cores that goes out and queries the remote source. For each result in the top N (configurable), it computes an id that is stable (e.g. it is based on the remote resource URL, doi, or hash of data returned). It uses that id to look-up the document in the lucene database. If the data is not there, it updates the lucene core and sets a flag that commit is required. Once it is done, it commits if needed. - Configure a core that uses a custom SearchComponent to call the requestHandler that goes and gets new documents and commits them. Since the cores for remote content are different cores, they can restart their searcher at this point if any commit is needed. The custom SearchComponent will wait for commit and reload to be completed. Then, search continues uses the other cores as shards. - Auto-warming on this will assure that the most recently requested data is present. It will, of course, be very slow a good part of the time. Erik and others, I need to know whether this design has legs and what other alternatives I might consider. On Sun, Aug 18, 2013 at 3:14 PM, Erick Erickson erickerick...@gmail.com wrote: The lack of global TF/IDF has been answered in the past, in the sharded case, by usually you have similar enough stats that it doesn't matter. This pre-supposes a fairly evenly distributed set of documents. But if you're talking about federated search across different types of documents, then what would you rescore with? How would you even consider scoring docs that are somewhat/ totally different? Think magazine articles an meta-data associated with pictures. What I've usually found is that one can use grouping to show the top N of a variety of results. Or show tabs with different types. Or have the app intelligently combine the different types of documents in a way that makes sense. But I don't know how you'd just get the right thing to happen with some kind of scoring magic. Best Erick On Fri, Aug 16, 2013 at 4:07 PM, Dan Davis dansm...@gmail.com wrote: I've thought about it, and I have no time to really do a meta-search during evaluation. What I need to do is to create a single core that contains both of my data sets, and then describe the architecture that would be required to do blended results, with liberal estimates. From the perspective of evaluation, I need to understand whether any of the solutions to better ranking
Re: More on topic of Meta-search/Federated Search with Solr
One more question here - is this topic more appropriate to a different list? On Mon, Aug 26, 2013 at 4:38 PM, Dan Davis dansm...@gmail.com wrote: I have now come to the task of estimating man-days to add Blended Search Results to Apache Solr. The argument has been made that this is not desirable (see Jonathan Rochkind's blog entries on Bento search with blacklight). But the estimate remains.No estimate is worth much without a design. So, I am come to the difficult of estimating this without having an in-depth knowledge of the Apache core. Here is my design, likely imperfect, as it stands. - Configure a core specific to each search source (local or remote) - On cores that index remote content, implement a periodic delete query that deletes documents whose timestamp is too old - Implement a custom requestHandler for the remote cores that goes out and queries the remote source. For each result in the top N (configurable), it computes an id that is stable (e.g. it is based on the remote resource URL, doi, or hash of data returned). It uses that id to look-up the document in the lucene database. If the data is not there, it updates the lucene core and sets a flag that commit is required. Once it is done, it commits if needed. - Configure a core that uses a custom SearchComponent to call the requestHandler that goes and gets new documents and commits them. Since the cores for remote content are different cores, they can restart their searcher at this point if any commit is needed. The custom SearchComponent will wait for commit and reload to be completed. Then, search continues uses the other cores as shards. - Auto-warming on this will assure that the most recently requested data is present. It will, of course, be very slow a good part of the time. Erik and others, I need to know whether this design has legs and what other alternatives I might consider. On Sun, Aug 18, 2013 at 3:14 PM, Erick Erickson erickerick...@gmail.comwrote: The lack of global TF/IDF has been answered in the past, in the sharded case, by usually you have similar enough stats that it doesn't matter. This pre-supposes a fairly evenly distributed set of documents. But if you're talking about federated search across different types of documents, then what would you rescore with? How would you even consider scoring docs that are somewhat/ totally different? Think magazine articles an meta-data associated with pictures. What I've usually found is that one can use grouping to show the top N of a variety of results. Or show tabs with different types. Or have the app intelligently combine the different types of documents in a way that makes sense. But I don't know how you'd just get the right thing to happen with some kind of scoring magic. Best Erick On Fri, Aug 16, 2013 at 4:07 PM, Dan Davis dansm...@gmail.com wrote: I've thought about it, and I have no time to really do a meta-search during evaluation. What I need to do is to create a single core that contains both of my data sets, and then describe the architecture that would be required to do blended results, with liberal estimates. From the perspective of evaluation, I need to understand whether any of the solutions to better ranking in the absence of global IDF have been explored?I suspect that one could retrieve a much larger than N set of results from a set of shards, re-score in some way that doesn't require IDF, e.g. storing both results in the same priority queue and *re-scoring* before *re-ranking*. The other way to do this would be to have a custom SearchHandler that works differently - it performs the query, retries all results deemed relevant by another engine, adds them to the Lucene index, and then performs the query again in the standard way. This would be quite slow, but perhaps useful as a way to evaluate my method. I still welcome any suggestions on how such a SearchHandler could be implemented.
Re: Flushing cache without restarting everything?
be careful with drop_caches - make sure you sync first On Thu, Aug 22, 2013 at 1:28 PM, Jean-Sebastien Vachon jean-sebastien.vac...@wantedanalytics.com wrote: I was afraid someone would tell me that... thanks for your input -Original Message- From: Toke Eskildsen [mailto:t...@statsbiblioteket.dk] Sent: August-22-13 9:56 AM To: solr-user@lucene.apache.org Subject: Re: Flushing cache without restarting everything? On Tue, 2013-08-20 at 20:04 +0200, Jean-Sebastien Vachon wrote: Is there a way to flush the cache of all nodes in a Solr Cloud (by reloading all the cores, through the collection API, ...) without having to restart all nodes? As MMapDirectory shares data with the OS disk cache, flushing of Solr-related caches on a machine should involve 1) Shut down all Solr instances on the machine 2) Clear the OS read cache ('sudo echo 1 /proc/sys/vm/drop_caches' on a Linux box) 3) Start the Solr instances I do not know of any Solr-supported way to do step 2. For our performance tests we use custom scripts to perform the steps. - Toke Eskildsen, State and University Library, Denmark - Aucun virus trouvé dans ce message. Analyse effectuée par AVG - www.avg.fr Version: 2013.0.3392 / Base de données virale: 3209/6563 - Date: 09/08/2013 La Base de données des virus a expiré.
Removing duplicates during a query
Suppose I have two documents with different id, and there is another field, for instance content-hash which is something like a 16-byte hash of the content. Can Solr be configured to return just one copy, and drop the other if both are relevant? If Solr does drop one result, do you get any indication in the document that was kept that there was another copy?
Re: How to avoid underscore sign indexing problem?
Ah, but what is the definition of punctuation in Solr? On Wed, Aug 21, 2013 at 11:15 PM, Jack Krupansky j...@basetechnology.comwrote: I thought that the StandardTokenizer always split on punctuation, Proving that you haven't read my book! The section on the standard tokenizer details the rules that the tokenizer uses (in addition to extensive examples.) That's what I mean by deep dive. -- Jack Krupansky -Original Message- From: Shawn Heisey Sent: Wednesday, August 21, 2013 10:41 PM To: solr-user@lucene.apache.org Subject: Re: How to avoid underscore sign indexing problem? On 8/21/2013 7:54 PM, Floyd Wu wrote: When using StandardAnalyzer to tokenize string Pacific_Rim will get ST textraw_**bytesstartendtypeposition pacific_rim[70 61 63 69 66 69 63 5f 72 69 6d]011ALPHANUM1 How to make this string to be tokenized to these two tokens Pacific, Rim? Set _ as stopword? Please kindly help on this. Many thanks. Interesting. I thought that the StandardTokenizer always split on punctuation, but apparently that's not the case for the underscore character. You can always use the WordDelimeterFilter after the StandardTokenizer. http://wiki.apache.org/solr/**AnalyzersTokenizersTokenFilter**s#solr.** WordDelimiterFilterFactoryhttp://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory Thanks, Shawn
Re: More on topic of Meta-search/Federated Search with Solr
You are right, but here's my null hypothesis for studying the impact on relevance.Hash the query to deterministically seed random number generator.Pick one from column A or column B randomly. This is of course wrong - a query might find two non-relevant results in corpus A and lots of relevant results in corpus B, leading to poor precision because the two non-relevant documents are likely to show up on the first page. You can weight on the size of the corpus, but weighting is probably wrong then on any specifc query. It was an interesting thought experiment though. Erik, Since LucidWorks was dinged in the 2013 Magic Quadrant on Enterprise Search due to a lack of Federated Search, the for-profit Enterprise Search companies must be doing it some way.Maybe relevance suffers (a lot), but you can do it if you want to. I have read very little of the IR literature - enough to sound like I know a little, but it is a very little. If there is literature on this, it would be an interesting read. On Sun, Aug 18, 2013 at 3:14 PM, Erick Erickson erickerick...@gmail.comwrote: The lack of global TF/IDF has been answered in the past, in the sharded case, by usually you have similar enough stats that it doesn't matter. This pre-supposes a fairly evenly distributed set of documents. But if you're talking about federated search across different types of documents, then what would you rescore with? How would you even consider scoring docs that are somewhat/ totally different? Think magazine articles an meta-data associated with pictures. What I've usually found is that one can use grouping to show the top N of a variety of results. Or show tabs with different types. Or have the app intelligently combine the different types of documents in a way that makes sense. But I don't know how you'd just get the right thing to happen with some kind of scoring magic. Best Erick On Fri, Aug 16, 2013 at 4:07 PM, Dan Davis dansm...@gmail.com wrote: I've thought about it, and I have no time to really do a meta-search during evaluation. What I need to do is to create a single core that contains both of my data sets, and then describe the architecture that would be required to do blended results, with liberal estimates. From the perspective of evaluation, I need to understand whether any of the solutions to better ranking in the absence of global IDF have been explored?I suspect that one could retrieve a much larger than N set of results from a set of shards, re-score in some way that doesn't require IDF, e.g. storing both results in the same priority queue and *re-scoring* before *re-ranking*. The other way to do this would be to have a custom SearchHandler that works differently - it performs the query, retries all results deemed relevant by another engine, adds them to the Lucene index, and then performs the query again in the standard way. This would be quite slow, but perhaps useful as a way to evaluate my method. I still welcome any suggestions on how such a SearchHandler could be implemented.
Re: Removing duplicates during a query
OK - I see that this can be done with Field Collapsing/Grouping. I also see the mentions in the Wiki for avoiding duplicates using a 16-byte hash. So, question withdrawn... On Thu, Aug 22, 2013 at 10:21 PM, Dan Davis dansm...@gmail.com wrote: Suppose I have two documents with different id, and there is another field, for instance content-hash which is something like a 16-byte hash of the content. Can Solr be configured to return just one copy, and drop the other if both are relevant? If Solr does drop one result, do you get any indication in the document that was kept that there was another copy?
Re: Prevent Some Keywords at Analyzer Step
This is an interesting topic - my employer is a medical library and there are many keywords that may need to be aliased in various ways, and 2 or 3 word phrases that perhaps should be treated specially. Jack, can you give me an example of how to do that sort of thing?Perhaps I need to buy your almost released Deep Dive book... Sorry to be too tangential - it is my strange way. On Mon, Aug 19, 2013 at 12:32 PM, Jack Krupansky j...@basetechnology.comwrote: Okay, but what is it that you are trying to prevent?? And, diet follower is a phrase, not a keyword or term. So, I'm still baffled as to what you are really trying to do. Trying explaining it in plain English. And given this same input, how would it be queried? -- Jack Krupansky -Original Message- From: Furkan KAMACI Sent: Monday, August 19, 2013 11:22 AM To: solr-user@lucene.apache.org Subject: Re: Prevent Some Keywords at Analyzer Step Let's assume that my sentence is that: *Alice is a diet follower* My special keyword = *diet follower* Tokens will be: Token 1) Alice Token 2) is Token 3) a Token 4) diet Token 5) follower Token 6) *diet follower* 2013/8/19 Jack Krupansky j...@basetechnology.com Your example doesn't prevent any keywords. You need to elaborate the specific requirements with more detail. Given a long stream of text, what tokenization do you expect in the index? -- Jack Krupansky -Original Message- From: Furkan KAMACI Sent: Monday, August 19, 2013 8:07 AM To: solr-user@lucene.apache.org Subject: Prevent Some Keywords at Analyzer Step Hi; I want to write an analyzer that will prevent some special words. For example sentence to be indexed is: diet follower it will tokenize it as like that token 1) diet token 2) follower token 3) diet follower How can I do that with Solr?
More on topic of Meta-search/Federated Search with Solr
I've thought about it, and I have no time to really do a meta-search during evaluation. What I need to do is to create a single core that contains both of my data sets, and then describe the architecture that would be required to do blended results, with liberal estimates. From the perspective of evaluation, I need to understand whether any of the solutions to better ranking in the absence of global IDF have been explored?I suspect that one could retrieve a much larger than N set of results from a set of shards, re-score in some way that doesn't require IDF, e.g. storing both results in the same priority queue and *re-scoring* before *re-ranking*. The other way to do this would be to have a custom SearchHandler that works differently - it performs the query, retries all results deemed relevant by another engine, adds them to the Lucene index, and then performs the query again in the standard way. This would be quite slow, but perhaps useful as a way to evaluate my method. I still welcome any suggestions on how such a SearchHandler could be implemented.
Meta-search by subclassing SearchHandler
I am considering enabling a true Federated Search, or meta-search, using the following basic configuration (this configuration is only for development and evaluation): Three Solr cores: - One to search data I have indexed locally - One with a custom SearchHandler that is a facade, e.g. it performs a meta-search (aka Federated Search) - One that queries and merges the above cores as shards Lest I seem completely like Sauron, I read http://2011.berlinbuzzwords.de/sites/2011.berlinbuzzwords.de/files/AndrzejBialecki-Buzzwords-2011_0.pdf and am familiar with evaluating precision at 10, etc. although I am no doubt less familiar with IR than many. I think that it is much, much better for performance and relevancy to index it all on a level playing field. But my employer cannot do that, because we do not have a license to all the data we may wish to search in the future. My questions are simple - has anybody implemented such a SearchHandler that is a facade for another search engine? How would I get started with that? I have made a similar post on the blacklight developers google group.