Re: Solr with Auto-suggest
Nice. Great help. I have added following fields to hold tokens. I am wondering how can I extract tokens? I can see all tokens http://localhost:8080/solr/admin/schema.jsp page for fields prefix1 and prefix2 but when I query http://localhost:8080/solr/select?fl=prefix1,id&q=prefix2:jun%20prefix2:jun to get the content for prefix2, it does not display any content for prefix2. Am I doing anything wrong??? - RB On 4/24/08, Ryan McKinley <[EMAIL PROTECTED]> wrote: > > > On Apr 24, 2008, at 12:25 PM, Rantjil Bould wrote: > >> Hi Group, >> I was asked in my project to implement google suggest kind of >> functionality for searching help system. I have seen one thread >> http://www.mail-archive.com/solr-user@lucene.apache.org/msg06739.html >> which >> deals with the way to index if large index. But I am not able to get much >> information to start with. I am using JQuery's plugin for auto-suggest and >> query field is a large text(appx 2000 char long). I am just wondering how >> can I extract all tokens for any character typed by user? Somebody might >> have already implemented the same functionality and I would appreciate >> your >> help on this, even a hint might be a great help. >> > > I don't think there is a magic one-size-fits-all solution to this, only a > set of approaches you will need to modify for your specific index. > > You will need to modify the jquery plugin to grab results from a solr > query. For starters that can be just a standard query whatever. > > Unless your index is small, you will likely need to configure your index > with special fields to use for the auto-complete search. This is the > approach pointed to in SOLR-357. Eseentially you index: "Bould" as "b" "bo" > "bou" boul" bould". > > ryan > >
Re: Updating in Solr.SOLR-139
Hi!!! I have already realized the mistake.My "id" field was generated from the copy of another field called "url".In other words It seems that the thing did not work well when the "id" field was generated from the copy of another one. Now I have changed the schema.xml so that this does not happen. so the documents indexed for the first time have to contain the id field. Thank you very much for your attention !!! Regards... Koji Sekiguchi-2 wrote: > > I've just tried this again in my environment, but I couldn't reproduce > what you pointed. > > My schema is: > >: > required="true" /> > > multiValued="true"/> >: > id > > > Koji > > > nutchvf wrote: >> Hi!! >> Thank you very much,Koji!! >> Your response has helped me a lot and I have already managed to update >> the >> document.Now,I have another problem: >> Sending the update request to Solr: >> >> For example: >> http://localhost:8389/solr/update?mode=tags:overwrite&commit=true >> >> >> >> AAA >> German >> >> >> >> After that step,I realized that the "id" field,(defined in my schema.xml >> as >> an uniqueKey field) appears as a multivalued field,with two "fields" with >> the same value.Do you know which may be the reason for this behavior? >> >> Thank you, >> >> Regards! >> >> > > > -- View this message in context: http://www.nabble.com/Updating-in-Solr.SOLR-139-tp16744841p16892573.html Sent from the Solr - User mailing list archive at Nabble.com.
How to extract terms associated with a field
Hello Group, I have a field name prefix1 and which is copy of another field called "content". Field type of prefix1 is My question is, how can i extract terms associated with prefix1. Is there ant query parameter that can extract all tokens for a field. Your help/input would be appreciated. regards, Ranjan
Custom Filter. Pass field thru regular expression to match.
My data, found with solr needs to be tested against matching regular expression formed at auery time. to avoid sending big data chunks via http i've suggested that results can be verified on solr side before they sent to client. I've heard that we can assign custom java function for filtering but what about my own function that tests data against formed regexp? -- View this message in context: http://www.nabble.com/Custom-Filter.-Pass-field-thru-regular-expression-to-match.-tp16893711p16893711.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: GSA <-> Solr
Otis, May I ask you how do you go about handling user access privileges? I mean you need some mechanism how to get user privileges from corporate environment (LDAP for example) and filter returned hits using document access policy. Also you may be caching these informations as well for performance reasons (refreshing once a day for example). Do you use some general open framework or ad-hoc code? Thanks & Regards, Lukas On Fri, Apr 25, 2008 at 7:26 AM, Otis Gospodnetic < [EMAIL PROTECTED]> wrote: > Lukas, > > From your description, this looks like a Nutch job, not Solr (no crawling > component), though one can also use Nutch with Solr now. > > I can't share the reasons, unfortunately. But from a personal stand point, > I've seen GSA and it's not all that impressive, it costs a pile of money, > and the price raises exponentially with the number of documents, it seems. > > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > - Original Message > > From: Lukas Vlcek <[EMAIL PROTECTED]> > > To: solr-user@lucene.apache.org > > Sent: Friday, April 25, 2008 12:31:13 AM > > Subject: Re: GSA <-> Solr > > > > BTW: Do you think you can share reasons why your clients are switching > from > > GSA? I am very interested in their experience. > > > > On Fri, Apr 25, 2008 at 6:29 AM, Lukas Vlcek wrote: > > > > > Hi, > > > > > > I posted related question into to Nutch-user yesterday. Here is the > post: > > Crawling > > > MOSS 2007 content using Nutch via GSA > > connector > > > > > > My specific situation if as folows: > > > We are deploying MOSS 2007 which includes its own search server. > However, > > > we found that the search is lacking in some areas and solution requires > > > additional expenses on HW or SW. Thus we are evaluating alternatives. > GSA is > > > one of them. But after I saw a presentation from technical guys on GSA > I > > > thought myself that Nutch could do the same (or even better in terms of > term > > > boosting for example :-). > > > GSA is able to use connectors for external datasources and for Share > Point > > > there is sharepoint connector which is written in Java and is Apache > > > licenced. This connector can crawl document links out of MOSS 2007 and > push > > > them into GSA which is then responsible for crawling. I wonder if I am > able > > > to use sharepoint connector to get the list of URLs which I can then > crawl > > > and index by Nutch. Is there any chance that using Solr make sanse in > such > > > scenario? Is Solr more convenient for such job? > > > > > > I have no experience with Solr. I think I just understand basic > concept: > > > Solr is a search server which can accept document in XML via HTTP. So I > > > don't see a match with my use case because I would have to download all > > > those documents from MOSS on my own and convert them into XML prior to > > > sending to Solr. Am I correct? > > > > > > Regards, > > > Lukas > > > > > > > > > On Fri, Apr 25, 2008 at 3:42 AM, Otis Gospodnetic < > > > [EMAIL PROTECTED]> wrote: > > > > > >> Ask me in about a month. I will likely be converting one *very* large > and > > >> well-known organization from the expensive GSA to Solr if that's > what > > >> you are asking about. > > >> > > >> Otis > > >> -- > > >> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > >> > > >> > > >> - Original Message > > >> > From: Jon Baer > > >> > To: solr-user@lucene.apache.org > > >> > Sent: Thursday, April 24, 2008 8:03:19 PM > > >> > Subject: GSA <-> Solr > > >> > > > >> > Hi, > > >> > > > >> > Going to try to persuade my employer to switch away some functions, > > >> > maybe all from the GSA black box to Solr and was trying to find some > > >> > (any?) case studies where this was done ... > > >> > > > >> > Also what is the similar function to a "KeyMatch" in Solr? Is it > > >> > elevate.xml? > > >> > > > >> > BTW, have been testing the DataImportHandler w/ MultiCore and it > works > > >> > very nicely. > > >> > > > >> > Thanks! > > >> > > > >> > - Jon > > >> > > >> > > > > > > > > > -- > > > http://blog.lukas-vlcek.com/ > > > > > > > > > > -- > > http://blog.lukas-vlcek.com/ > > -- http://blog.lukas-vlcek.com/
Delete's increase while adding new documents
Hi all, we send xml add document messages to Solr and we notice something very strange. We autocommit at 10 documents, starting from a total clean index (removed the data folder), when we start uploading we notice that the docsPending is going up but also that the deletesPending is going up very fast. After reaching the first 10 we queried to solr to return everything and the total results count was not 10 but somewhere around 77000 which is exactly 10 - docsDeleted from the stats page. We used that Solr instance before, so my question is : is it possible that Solr remembers the unique identities somewhere else as in the data folder ? Btw we stopped Solr, removed the data folder and restarted Solr and than this behavior began... greetings, Tim Op dit e-mail bericht is de disclaimer van Info Support van toepassing, zie http://www.infosupport.nl/disclaimer
Re: Caching of DataImportHandler's Status Page
Noble-- You should probably include SOLR-505 in your DataImportHandler patch. -Sean Noble Paul നോബിള് नोब्ळ् wrote: It is caused by the new caching feature in Solr. The caching is done at the browser level . Slr just sends appropriate headers. .We had raised an issue to disable that. BTW The command is not exactly http://localhost:8983/solr/dataimport?command=status . http://localhost:8983/solr/dataimport itself gives the status . But even for an unknown command it just gives the status. --Noble On Fri, Apr 25, 2008 at 3:43 AM, Otis Gospodnetic <[EMAIL PROTECTED]> wrote: Chris - what happens if you hit ctrl-R (or command-R on OSX)? That should bypass the browser cache. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Chris Harris <[EMAIL PROTECTED]> > To: solr-user@lucene.apache.org > Sent: Thursday, April 24, 2008 6:04:05 PM > Subject: Caching of DataImportHandler's Status Page > > I'm playing with the DataImportHandler, which so far seems pretty > cool. (I've applied the latest patch from JIRA to a fresh download of > trunk revision 651344. I'm using the basic Jetty setup in the example > directory.) The thing that's bugging me is that while the handler's > status page (http://localhost:8983/solr/dataimport?command=status) > loads fine, if I hit reload in my browser (either IE or FF), the page > won't update; the only way to get the page to provide up-to-date > indexing status information seems to be to clear the browser cache and > only then to reload the page. Does anyone know whether this is most > likely a Jetty issue, a Solr issue, a DataImportHandler issue, or a > more idiosyncratic problem with my setup? > > Thanks, > Chris
Re: solr performance for documents with hundreds of fields
That is well within the boundaries of what Solr/Lucene can handle. But, of course, it depends on what you're doing with those fields too. Putting 200 fields into a dismax qf specification, for example, would surely be bad for performance :) But querying on only a handful of fields or less at a time - should be no problem. Erik On Apr 25, 2008, at 2:24 AM, Umar Shah wrote: I am just wondering, because having 200 fields seems like too much (for me), I want to know if people actually have such kind of schemas and how well they perform. On Thu, Apr 24, 2008 at 5:10 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: Are you actually seeing performance problems or just wondering if there will be a performance problem? -Grant On Apr 24, 2008, at 7:08 AM, Umar Shah wrote: Hi, I wanted to know what would be the performance of SOLR for the following scenario: the documents contain say 200 fields with say 100 of the fields (containing numbers) and rest containing short strings of 40-50 character length. the sparseness of the data can be assumed to be as approximately 50 fields missing per document. any insights? can a default value of 0 for missing fields change the performance, how? thanks in anticipation, -umar -- Grant Ingersoll Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ
Re: Solr with Auto-suggest
On Apr 25, 2008, at 3:02 AM, Rantjil Bould wrote: Nice. Great help. I have added following fields to hold tokens. class="solr.KeywordTokenizerFactory"/> class="solr.KeywordTokenizerFactory"/> stored="true"/> stored="true"/> I am wondering how can I extract tokens? I can see all tokens http://localhost:8080/solr/admin/schema.jsp page for fields prefix1 and prefix2 but when I query http://localhost:8080/solr/select?fl=prefix1,id&q=prefix2:jun%20prefix2:jun to get the content for prefix2, it does not display any content for prefix2. Am I doing anything wrong??? what do you mean "extract tokens"? The documents returned from /select? are the stored field values not the tokens -- you don't get to see the analyzed tokens (nor do you need to). If you want to interact with tokens consider using faceting. ryan - RB On 4/24/08, Ryan McKinley <[EMAIL PROTECTED]> wrote: On Apr 24, 2008, at 12:25 PM, Rantjil Bould wrote: Hi Group, I was asked in my project to implement google suggest kind of functionality for searching help system. I have seen one thread http://www.mail-archive.com/solr-user@lucene.apache.org/ msg06739.html which deals with the way to index if large index. But I am not able to get much information to start with. I am using JQuery's plugin for auto- suggest and query field is a large text(appx 2000 char long). I am just wondering how can I extract all tokens for any character typed by user? Somebody might have already implemented the same functionality and I would appreciate your help on this, even a hint might be a great help. I don't think there is a magic one-size-fits-all solution to this, only a set of approaches you will need to modify for your specific index. You will need to modify the jquery plugin to grab results from a solr query. For starters that can be just a standard query whatever. Unless your index is small, you will likely need to configure your index with special fields to use for the auto-complete search. This is the approach pointed to in SOLR-357. Eseentially you index: "Bould" as "b" "bo" "bou" boul" bould". ryan
Reindexing mode for solr
Hi, Is there any way to tell solr to load in a kind of reindexing mode, which won't open a new searcher after every commit, etc? This is just when you don't have it available to query because you just want to reindex all the information. What do you think? Jonathan
Re: Reindexing mode for solr
Don't think so. But you reindex on the master and query on the slave. If your concern is that the index will be sent to the search slave while you are still reindexing, just don't commit until you are done. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Jonathan Ariel <[EMAIL PROTECTED]> > To: solr-user@lucene.apache.org > Sent: Friday, April 25, 2008 10:05:55 AM > Subject: Reindexing mode for solr > > Hi, > Is there any way to tell solr to load in a kind of reindexing mode, which > won't open a new searcher after every commit, etc? This is just when you > don't have it available to query because you just want to reindex all the > information. > > What do you think? > > Jonathan
Re: Caching of DataImportHandler's Status Page
Yes , We are waiting for the patch to get committed. --Noble On Fri, Apr 25, 2008 at 5:36 PM, Sean Timm <[EMAIL PROTECTED]> wrote: > Noble-- > > You should probably include SOLR-505 in your DataImportHandler patch. > > -Sean > > > > Noble Paul നോബിള് नोब्ळ् wrote: > > > It is caused by the new caching feature in Solr. The caching is done > > at the browser level . Slr just sends appropriate headers. .We had > > raised an issue to disable that. > > > > BTW The command is not exactly > > http://localhost:8983/solr/dataimport?command=status . > > http://localhost:8983/solr/dataimport itself gives the status . But > > even for an unknown command it just gives the status. > > > > --Noble > > > > On Fri, Apr 25, 2008 at 3:43 AM, Otis Gospodnetic > > <[EMAIL PROTECTED]> wrote: > > > > > > > Chris - what happens if you hit ctrl-R (or command-R on OSX)? That > should bypass the browser cache. > > > > > > Otis > > > -- > > > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > > > > > > > > > > > > - Original Message > > > > From: Chris Harris <[EMAIL PROTECTED]> > > > > To: solr-user@lucene.apache.org > > > > Sent: Thursday, April 24, 2008 6:04:05 PM > > > > Subject: Caching of DataImportHandler's Status Page > > > > > > > > I'm playing with the DataImportHandler, which so far seems pretty > > > > cool. (I've applied the latest patch from JIRA to a fresh download of > > > > trunk revision 651344. I'm using the basic Jetty setup in the example > > > > directory.) The thing that's bugging me is that while the handler's > > > > status page (http://localhost:8983/solr/dataimport?command=status) > > > > loads fine, if I hit reload in my browser (either IE or FF), the page > > > > won't update; the only way to get the page to provide up-to-date > > > > indexing status information seems to be to clear the browser cache > and > > > > only then to reload the page. Does anyone know whether this is most > > > > likely a Jetty issue, a Solr issue, a DataImportHandler issue, or a > > > > more idiosyncratic problem with my setup? > > > > > > > > Thanks, > > > > Chris > > > > > > > > > > > > > > > -- --Noble Paul
Help required with external value source SOLR-351
Help required with external value source SOLR-351 I'm trying to get this new feature to work without much success. I've completed the following steps. 1) dowloaded latest nightly build 2) added the following to schema.xml and 3) Created a file in the solr index folder - "external_cpc" with the following entries 4901708=10 4901715=20 The ids correspond to job_id ids in the index. when I run a query _val_:cpc the max score just corresponds to the defval 1. It doesn't seem to be picking up anything from the external file. from a query job_id:4901708 _val_:cpc In the explain I get FunctionQuery(FileFloatSource(field=cpc,keyField=job_id,defVal=1.0,dataDir=D:/solr1/data/)), product of: 1.0 = float(cpc{type=file,properties=})=1.0 1.0 = boost what am I doing wrong? Thanks Howard
Re: Reindexing mode for solr
In our setup, snapshooter is triggered on optimize, not commit. We can commit all we want on the master without making a snapshot. That only happens when we optimize. The new Searcher is the biggest performance impact for us. We don't have that many documents (~250K), so copying an entire index is not a big deal. wunder On 4/25/08 8:28 AM, "Otis Gospodnetic" <[EMAIL PROTECTED]> wrote: > Don't think so. But you reindex on the master and query on the slave. If > your concern is that the index will be sent to the search slave while you are > still reindexing, just don't commit until you are done. > > > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > - Original Message >> From: Jonathan Ariel <[EMAIL PROTECTED]> >> To: solr-user@lucene.apache.org >> Sent: Friday, April 25, 2008 10:05:55 AM >> Subject: Reindexing mode for solr >> >> Hi, >> Is there any way to tell solr to load in a kind of reindexing mode, which >> won't open a new searcher after every commit, etc? This is just when you >> don't have it available to query because you just want to reindex all the >> information. >> >> What do you think? >> >> Jonathan >
Re: Reindexing mode for solr
You're right. But I'm concerned about some Max Number of Searchers Reached that I usually get when reindexing every one in a while. On Fri, Apr 25, 2008 at 12:28 PM, Otis Gospodnetic < [EMAIL PROTECTED]> wrote: > Don't think so. But you reindex on the master and query on the slave. If > your concern is that the index will be sent to the search slave while you > are still reindexing, just don't commit until you are done. > > > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > - Original Message > > From: Jonathan Ariel <[EMAIL PROTECTED]> > > To: solr-user@lucene.apache.org > > Sent: Friday, April 25, 2008 10:05:55 AM > > Subject: Reindexing mode for solr > > > > Hi, > > Is there any way to tell solr to load in a kind of reindexing mode, > which > > won't open a new searcher after every commit, etc? This is just when you > > don't have it available to query because you just want to reindex all > the > > information. > > > > What do you think? > > > > Jonathan > >
Re: Reindexing mode for solr
Like Wunder said, you can reindex every once in a while all you want, just don't create index snapshots then you commit (disable the postcommit hook in solrconfig.xml) or don't commit at all until you are done. Or call optimize at the end and enable postOptimize hook. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Jonathan Ariel <[EMAIL PROTECTED]> > To: solr-user@lucene.apache.org > Sent: Friday, April 25, 2008 12:11:49 PM > Subject: Re: Reindexing mode for solr > > You're right. But I'm concerned about some Max Number of Searchers Reached > that I usually get when reindexing every one in a while. > > On Fri, Apr 25, 2008 at 12:28 PM, Otis Gospodnetic < > [EMAIL PROTECTED]> wrote: > > > Don't think so. But you reindex on the master and query on the slave. If > > your concern is that the index will be sent to the search slave while you > > are still reindexing, just don't commit until you are done. > > > > > > Otis > > -- > > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > > > - Original Message > > > From: Jonathan Ariel > > > To: solr-user@lucene.apache.org > > > Sent: Friday, April 25, 2008 10:05:55 AM > > > Subject: Reindexing mode for solr > > > > > > Hi, > > > Is there any way to tell solr to load in a kind of reindexing mode, > > which > > > won't open a new searcher after every commit, etc? This is just when you > > > don't have it available to query because you just want to reindex all > > the > > > information. > > > > > > What do you think? > > > > > > Jonathan > > > >
Re: solr performance for documents with hundreds of fields
What Erik said ;) 200 fields is not a problem. Things to watch out for are: - more index file and thus more open file descriptors if you use non-compound Lucene index format and are working with non-optimized indices (on master - optimize your index before it gets to slaves) - slower merging (I think) with more fields (on master, not slave searchers) - more memory used if lots of fields don't have their norms turned off (i.e. are of sub-optimal type) - more memory used if you sort by lots of fields Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Erik Hatcher <[EMAIL PROTECTED]> > To: solr-user@lucene.apache.org > Sent: Friday, April 25, 2008 8:23:27 AM > Subject: Re: solr performance for documents with hundreds of fields > > That is well within the boundaries of what Solr/Lucene can handle. > > But, of course, it depends on what you're doing with those fields > too. Putting 200 fields into a dismax qf specification, for example, > would surely be bad for performance :) But querying on only a > handful of fields or less at a time - should be no problem. > > Erik > > > > On Apr 25, 2008, at 2:24 AM, Umar Shah wrote: > > I am just wondering, because having 200 fields seems like too much > > (for > > me), > > I want to know if people actually have such kind of schemas and how > > well > > they perform. > > > > > > > > > > On Thu, Apr 24, 2008 at 5:10 PM, Grant Ingersoll > > wrote: > > > >> Are you actually seeing performance problems or just wondering if > >> there > >> will be a performance problem? > >> > >> -Grant > >> > >> > >> On Apr 24, 2008, at 7:08 AM, Umar Shah wrote: > >> > >> Hi, > >>> > >>> I wanted to know what would be the performance of SOLR for the > >>> following > >>> scenario: > >>> the documents contain say 200 fields with > >>> say 100 of the fields (containing numbers) > >>> and rest containing short strings of 40-50 character length. > >>> the sparseness of the data can be assumed to be as approximately > >>> 50 fields > >>> missing per document. > >>> > >>> any insights? > >>> > >>> can a default value of 0 for missing fields change the > >>> performance, how? > >>> > >>> thanks in anticipation, > >>> -umar > >>> > >> > >> -- > >> Grant Ingersoll > >> > >> Lucene Helpful Hints: > >> http://wiki.apache.org/lucene-java/BasicsOfPerformance > >> http://wiki.apache.org/lucene-java/LuceneFAQ > >> > >> > >> > >> > >> > >> > >>
Re: GSA <-> Solr
The GSA -> Solr conversion I mentioned has not yet happened and may not even include doc access right functionality. However, when I implemented things like that in the past, I used custom trickery, not a general open framework. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Lukas Vlcek <[EMAIL PROTECTED]> > To: solr-user@lucene.apache.org > Sent: Friday, April 25, 2008 7:05:48 AM > Subject: Re: GSA <-> Solr > > Otis, > > May I ask you how do you go about handling user access privileges? I mean > you need some mechanism how to get user privileges from corporate > environment (LDAP for example) and filter returned hits using document > access policy. Also you may be caching these informations as well for > performance reasons (refreshing once a day for example). Do you use some > general open framework or ad-hoc code? > > Thanks & Regards, > Lukas > > On Fri, Apr 25, 2008 at 7:26 AM, Otis Gospodnetic < > [EMAIL PROTECTED]> wrote: > > > Lukas, > > > > From your description, this looks like a Nutch job, not Solr (no crawling > > component), though one can also use Nutch with Solr now. > > > > I can't share the reasons, unfortunately. But from a personal stand point, > > I've seen GSA and it's not all that impressive, it costs a pile of money, > > and the price raises exponentially with the number of documents, it seems. > > > > Otis > > -- > > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > > > - Original Message > > > From: Lukas Vlcek > > > To: solr-user@lucene.apache.org > > > Sent: Friday, April 25, 2008 12:31:13 AM > > > Subject: Re: GSA <-> Solr > > > > > > BTW: Do you think you can share reasons why your clients are switching > > from > > > GSA? I am very interested in their experience. > > > > > > On Fri, Apr 25, 2008 at 6:29 AM, Lukas Vlcek wrote: > > > > > > > Hi, > > > > > > > > I posted related question into to Nutch-user yesterday. Here is the > > post: > > > Crawling > > > > MOSS 2007 content using Nutch via GSA > > > connector > > > > > > > > My specific situation if as folows: > > > > We are deploying MOSS 2007 which includes its own search server. > > However, > > > > we found that the search is lacking in some areas and solution requires > > > > additional expenses on HW or SW. Thus we are evaluating alternatives. > > GSA is > > > > one of them. But after I saw a presentation from technical guys on GSA > > I > > > > thought myself that Nutch could do the same (or even better in terms of > > term > > > > boosting for example :-). > > > > GSA is able to use connectors for external datasources and for Share > > Point > > > > there is sharepoint connector which is written in Java and is Apache > > > > licenced. This connector can crawl document links out of MOSS 2007 and > > push > > > > them into GSA which is then responsible for crawling. I wonder if I am > > able > > > > to use sharepoint connector to get the list of URLs which I can then > > crawl > > > > and index by Nutch. Is there any chance that using Solr make sanse in > > such > > > > scenario? Is Solr more convenient for such job? > > > > > > > > I have no experience with Solr. I think I just understand basic > > concept: > > > > Solr is a search server which can accept document in XML via HTTP. So I > > > > don't see a match with my use case because I would have to download all > > > > those documents from MOSS on my own and convert them into XML prior to > > > > sending to Solr. Am I correct? > > > > > > > > Regards, > > > > Lukas > > > > > > > > > > > > On Fri, Apr 25, 2008 at 3:42 AM, Otis Gospodnetic < > > > > [EMAIL PROTECTED]> wrote: > > > > > > > >> Ask me in about a month. I will likely be converting one *very* large > > and > > > >> well-known organization from the expensive GSA to Solr if that's > > what > > > >> you are asking about. > > > >> > > > >> Otis > > > >> -- > > > >> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > >> > > > >> > > > >> - Original Message > > > >> > From: Jon Baer > > > >> > To: solr-user@lucene.apache.org > > > >> > Sent: Thursday, April 24, 2008 8:03:19 PM > > > >> > Subject: GSA <-> Solr > > > >> > > > > >> > Hi, > > > >> > > > > >> > Going to try to persuade my employer to switch away some functions, > > > >> > maybe all from the GSA black box to Solr and was trying to find some > > > >> > (any?) case studies where this was done ... > > > >> > > > > >> > Also what is the similar function to a "KeyMatch" in Solr? Is it > > > >> > elevate.xml? > > > >> > > > > >> > BTW, have been testing the DataImportHandler w/ MultiCore and it > > works > > > >> > very nicely. > > > >> > > > > >> > Thanks! > > > >> > > > > >> > - Jon > > > >> > > > >> > > > > > > > > > > > > -- > > > > http://blog.lukas-vlcek.com/ > > > > > > > > > > > > > > > -- > > > http://blog.lukas-vlcek.com/ > > > > > > > -- > http://blog.lukas-vlcek.com/
Re: GSA <-> Solr
Custom trickery is pretty standard for access controls in search. A couple of the high points from deploying Ultraseek: three incompatible "single sign on" system in one company and a system that controlled which links were shown instead of access to the docs themselves. The latter amazed me. If you had the URL, you could access the document. No access control at all, just trying to control knowledge of the URL. Of course, spiders are experts at finding URLs. wunder On 4/25/08 1:32 PM, "Otis Gospodnetic" <[EMAIL PROTECTED]> wrote: > The GSA -> Solr conversion I mentioned has not yet happened and may not even > include doc access right functionality. > However, when I implemented things like that in the past, I used custom > trickery, not a general open framework. > > > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > - Original Message >> From: Lukas Vlcek <[EMAIL PROTECTED]> >> To: solr-user@lucene.apache.org >> Sent: Friday, April 25, 2008 7:05:48 AM >> Subject: Re: GSA <-> Solr >> >> Otis, >> >> May I ask you how do you go about handling user access privileges? I mean >> you need some mechanism how to get user privileges from corporate >> environment (LDAP for example) and filter returned hits using document >> access policy. Also you may be caching these informations as well for >> performance reasons (refreshing once a day for example). Do you use some >> general open framework or ad-hoc code? >> >> Thanks & Regards, >> Lukas >> >> On Fri, Apr 25, 2008 at 7:26 AM, Otis Gospodnetic < >> [EMAIL PROTECTED]> wrote: >> >>> Lukas, >>> >>> From your description, this looks like a Nutch job, not Solr (no crawling >>> component), though one can also use Nutch with Solr now. >>> >>> I can't share the reasons, unfortunately. But from a personal stand point, >>> I've seen GSA and it's not all that impressive, it costs a pile of money, >>> and the price raises exponentially with the number of documents, it seems. >>> >>> Otis >>> -- >>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch >>> >>> >>> - Original Message From: Lukas Vlcek To: solr-user@lucene.apache.org Sent: Friday, April 25, 2008 12:31:13 AM Subject: Re: GSA <-> Solr BTW: Do you think you can share reasons why your clients are switching >>> from GSA? I am very interested in their experience. On Fri, Apr 25, 2008 at 6:29 AM, Lukas Vlcek wrote: > Hi, > > I posted related question into to Nutch-user yesterday. Here is the >>> post: Crawling > MOSS 2007 content using Nutch via GSA connector > > My specific situation if as folows: > We are deploying MOSS 2007 which includes its own search server. >>> However, > we found that the search is lacking in some areas and solution requires > additional expenses on HW or SW. Thus we are evaluating alternatives. >>> GSA is > one of them. But after I saw a presentation from technical guys on GSA >>> I > thought myself that Nutch could do the same (or even better in terms of >>> term > boosting for example :-). > GSA is able to use connectors for external datasources and for Share >>> Point > there is sharepoint connector which is written in Java and is Apache > licenced. This connector can crawl document links out of MOSS 2007 and >>> push > them into GSA which is then responsible for crawling. I wonder if I am >>> able > to use sharepoint connector to get the list of URLs which I can then >>> crawl > and index by Nutch. Is there any chance that using Solr make sanse in >>> such > scenario? Is Solr more convenient for such job? > > I have no experience with Solr. I think I just understand basic >>> concept: > Solr is a search server which can accept document in XML via HTTP. So I > don't see a match with my use case because I would have to download all > those documents from MOSS on my own and convert them into XML prior to > sending to Solr. Am I correct? > > Regards, > Lukas > > > On Fri, Apr 25, 2008 at 3:42 AM, Otis Gospodnetic < > [EMAIL PROTECTED]> wrote: > >> Ask me in about a month. I will likely be converting one *very* large >>> and >> well-known organization from the expensive GSA to Solr if that's >>> what >> you are asking about. >> >> Otis >> -- >> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch >> >> >> - Original Message >>> From: Jon Baer >>> To: solr-user@lucene.apache.org >>> Sent: Thursday, April 24, 2008 8:03:19 PM >>> Subject: GSA <-> Solr >>> >>> Hi, >>> >>> Going to try to persuade my employer to switch away some functions, >>> maybe all from the GSA black box to Solr and was trying to find some >>> (any?) case studies where this was done ... >>> >>> Also what is the similar function to a "KeyMatch" in Solr
DisMax and pf
Hello, I was looking at DisMax and playing with its "pf" parameter. I created a sample index with field "content". I set "pf" to: content^2.0 and expected to see (content:"my query here")^2.0 in the query (debugQuery=true). However, I only got (content:"my query here") -- no boost. Is this a bug or am I forgetting something? I did add "&pf=content^2.0" to the request URL and then I did see (content:"my query here")^2.0 Thanks, Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
RE: Solr with Auto-suggest
This what the spellchecker does. It makes a separate Lucene index of n-gram letters and searches those. Works pretty well and it is outside the main index. I did an experimental variation indexing word pairs as phrases, and it worked well too. Lance Norskog -Original Message- From: Ryan McKinley [mailto:[EMAIL PROTECTED] Sent: Thursday, April 24, 2008 2:18 PM To: solr-user@lucene.apache.org Subject: Re: Solr with Auto-suggest On Apr 24, 2008, at 12:25 PM, Rantjil Bould wrote: > Hi Group, >I was asked in my project to implement google suggest kind > of functionality for searching help system. I have seen one thread > http://www.mail-archive.com/solr-user@lucene.apache.org/ > msg06739.html which > deals with the way to index if large index. But I am not able to get > much information to start with. I am using JQuery's plugin for auto- > suggest and query field is a large text(appx 2000 char long). I am > just wondering how can I extract all tokens for any character typed by > user? Somebody might have already implemented the same functionality > and I would appreciate your help on this, even a hint might be a great > help. I don't think there is a magic one-size-fits-all solution to this, only a set of approaches you will need to modify for your specific index. You will need to modify the jquery plugin to grab results from a solr query. For starters that can be just a standard query whatever. Unless your index is small, you will likely need to configure your index with special fields to use for the auto-complete search. This is the approach pointed to in SOLR-357. Eseentially you index: "Bould" as "b" "bo" "bou" boul" bould". ryan . Checked by AVG. Version: 8.0.100 / Virus Database: 269.23.4/1397 - Release Date: 25.04.2008 7:42
Re: Reindexing mode for solr
On 25-Apr-08, at 7:05 AM, Jonathan Ariel wrote: Hi, Is there any way to tell solr to load in a kind of reindexing mode, which won't open a new searcher after every commit, etc? This is just when you don't have it available to query because you just want to reindex all the information. Are you using autoCommit, and want a way to temporarily disable autoCommit? -Mike
Re: Delete's increase while adding new documents
On 25-Apr-08, at 4:27 AM, Tim Mahy wrote: Hi all, we send xml add document messages to Solr and we notice something very strange. We autocommit at 10 documents, starting from a total clean index (removed the data folder), when we start uploading we notice that the docsPending is going up but also that the deletesPending is going up very fast. After reaching the first 10 we queried to solr to return everything and the total results count was not 10 but somewhere around 77000 which is exactly 10 - docsDeleted from the stats page. We used that Solr instance before, so my question is : is it possible that Solr remembers the unique identities somewhere else as in the data folder ? Btw we stopped Solr, removed the data folder and restarted Solr and than this behavior began... Are you sure that all the documents you added were unique? (btw, deletePending doesn't necessarily mean that an old version of the doc was in the index, I think). -Mike
Re: MultiThreaded Document Loader?
On 24-Apr-08, at 2:57 PM, oleg_gnatovskiy wrote: Hello. I was wondering if Solr has some kind of a multi-threaded document loader? I've been using post.sh (curl) to post documents to my Solr server, and it's pretty slow. I know it should be pretty easy to write one up, but I was just wondering if one already existed. Yeah, I won't rely on post.sh for performance. However, you can do "multithreaded" indexing by launching several instances of it, if you really wanted to: $ post.sh [a-gA-G]*.xml & $ post.sh [h-pH-P]*.xml & $ post.sh [q-zQ-Z]*.xml & -Mike
Re: Standard vs. DisMaxQueryHandler
I am frustrated that I have to pick between the two because I want both. The way I look at it, there should be a more configurable query handler which allows me to dimax if I want to, and pick a parser for the user's query (like the flexible one used by the standard query handler, or the more restrictive one found in DisMax Q.H. today). At the moment, I'm faced with telling a user of my search service (another developer of a corporate app using my corporate search service) that he has to compose a dis-max manually (i.e. use the standard query handler to get the job done) simply because he wants to do a prefix query (which isn't supported by DisMax Q.H.). This is for an auto-complete type thing, by the way. You might argue it's not hard -- that's true though it is annoying. But the bigger issue is that I can't encapsulate these internal details into my search service -- where it belongs IMO. ~ David Smiley hossman_lucene wrote: > > > : Is the main difference between the StandardQueryHandler and > : DisMaxQueryHandler the supported query syntax (and different query > : parser used in each of them), and the fact that the latter creates > : DisjunctionMaxQueries, while the former just creates vanilla > : BooleanQueries? Are there any other differences? > > the main differnece is the query string yes: Standard expects to get > "lucene QueryParser" formatted queries, while DisMax expects to get raw > user input strings ... Standard builds queries (wehter they be prefix or > boolean or wildcard) using the QueryParser as is, while DisMax does a > "cross product" of the user input across many differnet fields and builds > up a very specific query structure -- which can then be augmented with > aditional query clauses like the "bq" boost query and the "bf" boost > function. > > there's no reason the StandardRequestHandler can't construct DisMaxQueries > (once QueryParser has some syntax for them) and DisMaxRequestHandler does > (at the outermost level) generate a BooleanQuery (with a custom > "minShouldMatch" value set on it) but the main differnece is really the > use case: if you want the clinet to specify the exact query structure that > they want, use StandardRequstHandler. if you want the client to just > propogate the raw search string typed by the user, without any structure > or escaping, and get the nice complex DisMax style query across the > configured fields, the DisMax handler was written to fill that niche. > > (load up the example configs, and take a look at the query toString from > this url to see what i mean about the complex structure... > > http://localhost:8983/solr/select/?qt=dismax&q=how+now+brown+cow&debugQuery=1 > > > > > -Hoss > > > -- View this message in context: http://www.nabble.com/Standard-vs.-DisMaxQueryHandler-tp6421205p16909626.html Sent from the Solr - User mailing list archive at Nabble.com.