Re: Solr with Auto-suggest

2008-04-25 Thread Rantjil Bould
Nice. Great help. I have added following fields to hold tokens.


   
 
 
 
   
   
 
 
   



   
 
 
 
   
   
 
 
   



   

   
   
I am wondering how can I extract tokens? I can see all tokens
http://localhost:8080/solr/admin/schema.jsp page for fields prefix1 and
prefix2 but when I query
http://localhost:8080/solr/select?fl=prefix1,id&q=prefix2:jun%20prefix2:jun to
get the content for prefix2, it does not display any content for prefix2.
Am I doing anything wrong???

- RB


On 4/24/08, Ryan McKinley <[EMAIL PROTECTED]> wrote:
>
>
> On Apr 24, 2008, at 12:25 PM, Rantjil Bould wrote:
>
>> Hi Group,
>>   I was asked in my project to implement google suggest kind of
>> functionality for searching help system. I have seen one thread
>> http://www.mail-archive.com/solr-user@lucene.apache.org/msg06739.html
>> which
>> deals with the way to index if large index. But I am not able to get much
>> information to start with. I am using JQuery's plugin for auto-suggest and
>> query field is a large text(appx 2000 char long). I am just wondering how
>> can I extract all tokens for any character typed by user? Somebody might
>> have already implemented the same functionality and I would appreciate
>> your
>> help on this, even a hint might be a great help.
>>
>
> I don't think there is a magic one-size-fits-all solution to this, only a
> set of approaches you will need to modify for your specific index.
>
> You will need to modify the jquery plugin to grab results from a solr
> query.  For starters that can be just a standard query whatever.
>
> Unless your index is small, you will likely need to configure your index
> with special fields to use for the auto-complete search.  This is the
> approach pointed to in SOLR-357.  Eseentially you index: "Bould" as "b" "bo"
> "bou" boul" bould".
>
> ryan
>
>


Re: Updating in Solr.SOLR-139

2008-04-25 Thread nutchvf

Hi!!!
I have already realized the mistake.My "id" field was generated from the
copy of another field called "url".In other words



It seems that the thing did not work well when the "id" field was generated
from the copy of another one.

Now I have changed the schema.xml so that this does not happen.



so the documents indexed for the first time have to contain the id field.

Thank you very much for your attention !!!

Regards...









Koji Sekiguchi-2 wrote:
> 
> I've just tried this again in my environment, but I couldn't reproduce 
> what you pointed.
> 
> My schema is:
> 
>:
> required="true" />
>
> multiValued="true"/>
>:
>  id
> 
> 
> Koji
> 
> 
> nutchvf wrote:
>> Hi!!
>> Thank you very much,Koji!!
>> Your response has helped me a lot and  I have already managed to update
>> the
>> document.Now,I have another problem:
>> Sending the update request to Solr:
>>
>> For example:
>> http://localhost:8389/solr/update?mode=tags:overwrite&commit=true
>>
>> 
>> 
>>   AAA
>>   German
>> 
>> 
>>
>> After that step,I realized that the "id" field,(defined in my schema.xml
>> as
>> an uniqueKey field) appears as a multivalued field,with two "fields" with
>> the same value.Do you know which may be the reason for this behavior?
>>
>> Thank you,
>>
>> Regards!
>>
>>   
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Updating-in-Solr.SOLR-139-tp16744841p16892573.html
Sent from the Solr - User mailing list archive at Nabble.com.



How to extract terms associated with a field

2008-04-25 Thread Rantjil Bould
Hello Group,
I have a field name prefix1 and which is copy of another
field called "content". Field type of prefix1 is

   
 
 
 
   
   
 
 
   

My question is, how can i extract terms associated with prefix1. Is there
ant query parameter that can extract all tokens for a field.
Your help/input would be appreciated.

regards,
Ranjan


Custom Filter. Pass field thru regular expression to match.

2008-04-25 Thread surfer10

My data, found with solr needs to be tested against matching regular
expression formed at auery time. to avoid sending big data chunks via http
i've suggested that results can be verified on solr side before they sent to
client.

I've heard that we can assign custom java function for filtering but what
about my own function that tests data against formed regexp?
-- 
View this message in context: 
http://www.nabble.com/Custom-Filter.-Pass-field-thru-regular-expression-to-match.-tp16893711p16893711.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: GSA <-> Solr

2008-04-25 Thread Lukas Vlcek
Otis,

May I ask you how do you go about handling user access privileges? I mean
you need some mechanism how to get user privileges from corporate
environment (LDAP for example) and filter returned hits using document
access policy. Also you may be caching these informations as well for
performance reasons (refreshing once a day for example). Do you use some
general open framework or ad-hoc code?

Thanks & Regards,
Lukas

On Fri, Apr 25, 2008 at 7:26 AM, Otis Gospodnetic <
[EMAIL PROTECTED]> wrote:

> Lukas,
>
> From your description, this looks like a Nutch job, not Solr (no crawling
> component), though one can also use Nutch with Solr now.
>
> I can't share the reasons, unfortunately.  But from a personal stand point,
> I've seen GSA and it's not all that impressive, it costs a pile of money,
> and the price raises exponentially with the number of documents, it seems.
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
> - Original Message 
> > From: Lukas Vlcek <[EMAIL PROTECTED]>
> > To: solr-user@lucene.apache.org
> > Sent: Friday, April 25, 2008 12:31:13 AM
> > Subject: Re: GSA <-> Solr
> >
> > BTW: Do you think you can share reasons why your clients are switching
> from
> > GSA? I am very interested in their experience.
> >
> > On Fri, Apr 25, 2008 at 6:29 AM, Lukas Vlcek wrote:
> >
> > > Hi,
> > >
> > > I posted related question into to Nutch-user yesterday. Here is the
> post:
> > Crawling
> > > MOSS 2007 content using Nutch via GSA
> > connector
> > >
> > > My specific situation if as folows:
> > > We are deploying MOSS 2007 which includes its own search server.
> However,
> > > we found that the search is lacking in some areas and solution requires
> > > additional expenses on HW or SW. Thus we are evaluating alternatives.
> GSA is
> > > one of them. But after I saw a presentation from technical guys on GSA
> I
> > > thought myself that Nutch could do the same (or even better in terms of
> term
> > > boosting for example :-).
> > > GSA is able to use connectors for external datasources and for Share
> Point
> > > there is sharepoint connector which is written in Java and is Apache
> > > licenced. This connector can crawl document links out of MOSS 2007 and
> push
> > > them into GSA which is then responsible for crawling. I wonder if I am
> able
> > > to use sharepoint connector to get the list of URLs which I can then
> crawl
> > > and index by Nutch. Is there any chance that using Solr make sanse in
> such
> > > scenario? Is Solr more convenient for such job?
> > >
> > > I have no experience with Solr. I think I just understand basic
> concept:
> > > Solr is a search server which can accept document in XML via HTTP. So I
> > > don't see a match with my use case because I would have to download all
> > > those documents from MOSS on my own and convert them into XML prior to
> > > sending to Solr. Am I correct?
> > >
> > > Regards,
> > > Lukas
> > >
> > >
> > > On Fri, Apr 25, 2008 at 3:42 AM, Otis Gospodnetic <
> > > [EMAIL PROTECTED]> wrote:
> > >
> > >> Ask me in about a month.  I will likely be converting one *very* large
> and
> > >> well-known organization from the expensive GSA to Solr if that's
> what
> > >> you are asking about.
> > >>
> > >> Otis
> > >> --
> > >> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> > >>
> > >>
> > >> - Original Message 
> > >> > From: Jon Baer
> > >> > To: solr-user@lucene.apache.org
> > >> > Sent: Thursday, April 24, 2008 8:03:19 PM
> > >> > Subject: GSA <-> Solr
> > >> >
> > >> > Hi,
> > >> >
> > >> > Going to try to persuade my employer to switch away some functions,
> > >> > maybe all from the GSA black box to Solr and was trying to find some
> > >> > (any?) case studies where this was done ...
> > >> >
> > >> > Also what is the similar function to a "KeyMatch" in Solr?  Is it
> > >> > elevate.xml?
> > >> >
> > >> > BTW, have been testing the DataImportHandler w/ MultiCore and it
> works
> > >> > very nicely.
> > >> >
> > >> > Thanks!
> > >> >
> > >> > - Jon
> > >>
> > >>
> > >
> > >
> > > --
> > > http://blog.lukas-vlcek.com/
> >
> >
> >
> >
> > --
> > http://blog.lukas-vlcek.com/
>
>


-- 
http://blog.lukas-vlcek.com/


Delete's increase while adding new documents

2008-04-25 Thread Tim Mahy
Hi all,

we send xml add document messages to Solr and we notice something very strange.
We autocommit at 10 documents, starting from a total clean index (removed 
the data folder), when we start uploading we notice that the docsPending is 
going up but also that the deletesPending is going up very fast. After reaching 
the first 10 we queried to solr to return everything and the total results 
count was not 10 but somewhere around 77000 which is exactly 10 - 
docsDeleted from the stats page.

We used that Solr instance before, so my question is : is it possible that Solr 
remembers the unique identities somewhere else as in the data folder ? Btw we 
stopped Solr, removed the data folder and restarted Solr and than this behavior 
began...

greetings,
Tim


Op dit e-mail bericht is de disclaimer van Info Support van toepassing, zie 
http://www.infosupport.nl/disclaimer
 

Re: Caching of DataImportHandler's Status Page

2008-04-25 Thread Sean Timm

Noble--

You should probably include SOLR-505 in your DataImportHandler patch.

-Sean

Noble Paul നോബിള്‍ नोब्ळ् wrote:

It is caused by the new caching feature in Solr.  The caching is done
at the browser level . Slr just sends appropriate headers. .We had
raised an issue to disable that.

BTW The command is not exactly
http://localhost:8983/solr/dataimport?command=status .
http://localhost:8983/solr/dataimport itself gives the status . But
even for an unknown command it just gives the status.

--Noble

On Fri, Apr 25, 2008 at 3:43 AM, Otis Gospodnetic
<[EMAIL PROTECTED]> wrote:
  

Chris - what happens if you hit ctrl-R (or command-R on OSX)?  That should 
bypass the browser cache.

 Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch




 - Original Message 
 > From: Chris Harris <[EMAIL PROTECTED]>
 > To: solr-user@lucene.apache.org
 > Sent: Thursday, April 24, 2008 6:04:05 PM
 > Subject: Caching of DataImportHandler's Status Page
 >
 > I'm playing with the DataImportHandler, which so far seems pretty
 > cool. (I've applied the latest patch from JIRA to a fresh download of
 > trunk revision 651344. I'm using the basic Jetty setup in the example
 > directory.) The thing that's bugging me is that while the handler's
 > status page (http://localhost:8983/solr/dataimport?command=status)
 > loads fine, if I hit reload in my browser (either IE or FF), the page
 > won't update; the only way to get the page to provide up-to-date
 > indexing status information seems to be to clear the browser cache and
 > only then to reload the page. Does anyone know whether this is most
 > likely a Jetty issue, a Solr issue, a DataImportHandler issue, or a
 > more idiosyncratic problem with my setup?
 >
 > Thanks,
 > Chris





Re: solr performance for documents with hundreds of fields

2008-04-25 Thread Erik Hatcher

That is well within the boundaries of what Solr/Lucene can handle.

But, of course, it depends on what you're doing with those fields  
too.  Putting 200 fields into a dismax qf specification, for example,  
would surely be bad for performance :)   But querying on only a  
handful of fields or less at a time  - should be no problem.


Erik



On Apr 25, 2008, at 2:24 AM, Umar Shah wrote:
I am just wondering,  because having 200 fields seems like too much  
(for

me),
I want to know if people actually have such kind of schemas and how  
well

they perform.




On Thu, Apr 24, 2008 at 5:10 PM, Grant Ingersoll <[EMAIL PROTECTED]>
wrote:

Are you actually seeing performance problems or just wondering if  
there

will be a performance problem?

-Grant


On Apr 24, 2008, at 7:08 AM, Umar Shah wrote:

 Hi,


I wanted to know what would be the performance of SOLR  for the  
following

scenario:
the documents contain say 200 fields with
say 100 of the fields (containing numbers)
and rest containing short strings of 40-50 character length.
the sparseness of the data can be assumed to be as approximately  
50 fields

missing per document.

any insights?

can a default value of 0 for missing fields change the  
performance, how?


thanks in anticipation,
-umar



--
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ











Re: Solr with Auto-suggest

2008-04-25 Thread Ryan McKinley


On Apr 25, 2008, at 3:02 AM, Rantjil Bould wrote:

Nice. Great help. I have added following fields to hold tokens.


  
class="solr.KeywordTokenizerFactory"/>



  
  
class="solr.KeywordTokenizerFactory"/>


  
   

   
  



  
  


  
   

stored="true"/>
  stored="true"/>


  
  
I am wondering how can I extract tokens? I can see all tokens
http://localhost:8080/solr/admin/schema.jsp page for fields prefix1  
and

prefix2 but when I query
http://localhost:8080/solr/select?fl=prefix1,id&q=prefix2:jun%20prefix2:jun 
 to
get the content for prefix2, it does not display any content for  
prefix2.

Am I doing anything wrong???



what do you mean "extract tokens"?

The documents returned from /select? are the stored field values not  
the tokens -- you don't get to see the analyzed tokens (nor do you  
need to).  If you want to interact with tokens consider using faceting.


ryan



- RB


On 4/24/08, Ryan McKinley <[EMAIL PROTECTED]> wrote:



On Apr 24, 2008, at 12:25 PM, Rantjil Bould wrote:


Hi Group,
 I was asked in my project to implement google suggest  
kind of

functionality for searching help system. I have seen one thread
http://www.mail-archive.com/solr-user@lucene.apache.org/ 
msg06739.html

which
deals with the way to index if large index. But I am not able to  
get much
information to start with. I am using JQuery's plugin for auto- 
suggest and
query field is a large text(appx 2000 char long). I am just  
wondering how
can I extract all tokens for any character typed by user? Somebody  
might
have already implemented the same functionality and I would  
appreciate

your
help on this, even a hint might be a great help.



I don't think there is a magic one-size-fits-all solution to this,  
only a

set of approaches you will need to modify for your specific index.

You will need to modify the jquery plugin to grab results from a solr
query.  For starters that can be just a standard query whatever.

Unless your index is small, you will likely need to configure your  
index

with special fields to use for the auto-complete search.  This is the
approach pointed to in SOLR-357.  Eseentially you index: "Bould" as  
"b" "bo"

"bou" boul" bould".

ryan






Reindexing mode for solr

2008-04-25 Thread Jonathan Ariel
Hi,
Is there any way to tell solr to load in a kind of reindexing mode, which
won't open a new searcher after every commit, etc? This is just when you
don't have it available to query because you just want to reindex all the
information.

What do you think?

Jonathan


Re: Reindexing mode for solr

2008-04-25 Thread Otis Gospodnetic
Don't think so.  But you reindex on the master and query on the slave.  If your 
concern is that the index will be sent to the search slave while you are still 
reindexing, just don't commit until you are done.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


- Original Message 
> From: Jonathan Ariel <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Friday, April 25, 2008 10:05:55 AM
> Subject: Reindexing mode for solr
> 
> Hi,
> Is there any way to tell solr to load in a kind of reindexing mode, which
> won't open a new searcher after every commit, etc? This is just when you
> don't have it available to query because you just want to reindex all the
> information.
> 
> What do you think?
> 
> Jonathan



Re: Caching of DataImportHandler's Status Page

2008-04-25 Thread Noble Paul നോബിള്‍ नोब्ळ्
Yes , We are waiting for the patch to get committed.
--Noble

On Fri, Apr 25, 2008 at 5:36 PM, Sean Timm <[EMAIL PROTECTED]> wrote:
> Noble--
>
>  You should probably include SOLR-505 in your DataImportHandler patch.
>
>  -Sean
>
>
>
>  Noble Paul നോബിള്‍ नोब्ळ् wrote:
>
> > It is caused by the new caching feature in Solr.  The caching is done
> > at the browser level . Slr just sends appropriate headers. .We had
> > raised an issue to disable that.
> >
> > BTW The command is not exactly
> > http://localhost:8983/solr/dataimport?command=status .
> > http://localhost:8983/solr/dataimport itself gives the status . But
> > even for an unknown command it just gives the status.
> >
> > --Noble
> >
> > On Fri, Apr 25, 2008 at 3:43 AM, Otis Gospodnetic
> > <[EMAIL PROTECTED]> wrote:
> >
> >
> > > Chris - what happens if you hit ctrl-R (or command-R on OSX)?  That
> should bypass the browser cache.
> > >
> > >  Otis
> > >  --
> > >  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> > >
> > >
> > >
> > >
> > >  - Original Message 
> > >  > From: Chris Harris <[EMAIL PROTECTED]>
> > >  > To: solr-user@lucene.apache.org
> > >  > Sent: Thursday, April 24, 2008 6:04:05 PM
> > >  > Subject: Caching of DataImportHandler's Status Page
> > >  >
> > >  > I'm playing with the DataImportHandler, which so far seems pretty
> > >  > cool. (I've applied the latest patch from JIRA to a fresh download of
> > >  > trunk revision 651344. I'm using the basic Jetty setup in the example
> > >  > directory.) The thing that's bugging me is that while the handler's
> > >  > status page (http://localhost:8983/solr/dataimport?command=status)
> > >  > loads fine, if I hit reload in my browser (either IE or FF), the page
> > >  > won't update; the only way to get the page to provide up-to-date
> > >  > indexing status information seems to be to clear the browser cache
> and
> > >  > only then to reload the page. Does anyone know whether this is most
> > >  > likely a Jetty issue, a Solr issue, a DataImportHandler issue, or a
> > >  > more idiosyncratic problem with my setup?
> > >  >
> > >  > Thanks,
> > >  > Chris
> > >
> > >
> > >
> > >
> >
>



-- 
--Noble Paul


Help required with external value source SOLR-351

2008-04-25 Thread Howard Lee
Help required with external value source SOLR-351

I'm trying to get this new feature to work without much success. I've
completed the following steps.

1) dowloaded latest nightly build
2) added the following to schema.xml


and



3) Created a file in the solr index folder - "external_cpc" with the
following entries
4901708=10
4901715=20

The ids correspond to job_id ids in the index.

when I run a query _val_:cpc the max score just corresponds to the defval 1.
It doesn't seem to be picking up anything from the external file.

from a query

job_id:4901708  _val_:cpc

In the explain I get

FunctionQuery(FileFloatSource(field=cpc,keyField=job_id,defVal=1.0,dataDir=D:/solr1/data/)),
product of:
1.0 = float(cpc{type=file,properties=})=1.0
1.0 = boost

what am I doing wrong?

Thanks

Howard


Re: Reindexing mode for solr

2008-04-25 Thread Walter Underwood
In our setup, snapshooter is triggered on optimize, not commit.
We can commit all we want on the master without making a
snapshot. That only happens when we optimize.

The new Searcher is the biggest performance impact for us.
We don't have that many documents (~250K), so copying an
entire index is not a big deal.

wunder

On 4/25/08 8:28 AM, "Otis Gospodnetic" <[EMAIL PROTECTED]> wrote:

> Don't think so.  But you reindex on the master and query on the slave.  If
> your concern is that the index will be sent to the search slave while you are
> still reindexing, just don't commit until you are done.
> 
> 
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 
> 
> - Original Message 
>> From: Jonathan Ariel <[EMAIL PROTECTED]>
>> To: solr-user@lucene.apache.org
>> Sent: Friday, April 25, 2008 10:05:55 AM
>> Subject: Reindexing mode for solr
>> 
>> Hi,
>> Is there any way to tell solr to load in a kind of reindexing mode, which
>> won't open a new searcher after every commit, etc? This is just when you
>> don't have it available to query because you just want to reindex all the
>> information.
>> 
>> What do you think?
>> 
>> Jonathan
> 



Re: Reindexing mode for solr

2008-04-25 Thread Jonathan Ariel
You're right. But I'm concerned about some Max Number of Searchers Reached
that I usually get when reindexing every one in a while.

On Fri, Apr 25, 2008 at 12:28 PM, Otis Gospodnetic <
[EMAIL PROTECTED]> wrote:

> Don't think so.  But you reindex on the master and query on the slave.  If
> your concern is that the index will be sent to the search slave while you
> are still reindexing, just don't commit until you are done.
>
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
> - Original Message 
> > From: Jonathan Ariel <[EMAIL PROTECTED]>
> > To: solr-user@lucene.apache.org
> > Sent: Friday, April 25, 2008 10:05:55 AM
> > Subject: Reindexing mode for solr
> >
> > Hi,
> > Is there any way to tell solr to load in a kind of reindexing mode,
> which
> > won't open a new searcher after every commit, etc? This is just when you
> > don't have it available to query because you just want to reindex all
> the
> > information.
> >
> > What do you think?
> >
> > Jonathan
>
>


Re: Reindexing mode for solr

2008-04-25 Thread Otis Gospodnetic
Like Wunder said, you can reindex every once in a while all you want, just 
don't create index snapshots then you commit (disable the postcommit hook in 
solrconfig.xml) or don't commit at all until you are done.  Or call optimize at 
the end and enable postOptimize hook.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


- Original Message 
> From: Jonathan Ariel <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Friday, April 25, 2008 12:11:49 PM
> Subject: Re: Reindexing mode for solr
> 
> You're right. But I'm concerned about some Max Number of Searchers Reached
> that I usually get when reindexing every one in a while.
> 
> On Fri, Apr 25, 2008 at 12:28 PM, Otis Gospodnetic <
> [EMAIL PROTECTED]> wrote:
> 
> > Don't think so.  But you reindex on the master and query on the slave.  If
> > your concern is that the index will be sent to the search slave while you
> > are still reindexing, just don't commit until you are done.
> >
> >
> > Otis
> > --
> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> >
> >
> > - Original Message 
> > > From: Jonathan Ariel 
> > > To: solr-user@lucene.apache.org
> > > Sent: Friday, April 25, 2008 10:05:55 AM
> > > Subject: Reindexing mode for solr
> > >
> > > Hi,
> > > Is there any way to tell solr to load in a kind of reindexing mode,
> > which
> > > won't open a new searcher after every commit, etc? This is just when you
> > > don't have it available to query because you just want to reindex all
> > the
> > > information.
> > >
> > > What do you think?
> > >
> > > Jonathan
> >
> >



Re: solr performance for documents with hundreds of fields

2008-04-25 Thread Otis Gospodnetic
What Erik said ;)

200 fields is not a problem.  Things to watch out for are:
- more index file and thus more open file descriptors if you use non-compound 
Lucene index format and are working with non-optimized indices (on master - 
optimize your index before it gets to slaves)
- slower merging (I think) with more fields (on master, not slave searchers)
- more memory used if lots of fields don't have their norms turned off (i.e. 
are of sub-optimal type)
- more memory used if you sort by lots of fields


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


- Original Message 
> From: Erik Hatcher <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Friday, April 25, 2008 8:23:27 AM
> Subject: Re: solr performance for documents with hundreds of fields
> 
> That is well within the boundaries of what Solr/Lucene can handle.
> 
> But, of course, it depends on what you're doing with those fields  
> too.  Putting 200 fields into a dismax qf specification, for example,  
> would surely be bad for performance :)   But querying on only a  
> handful of fields or less at a time  - should be no problem.
> 
> Erik
> 
> 
> 
> On Apr 25, 2008, at 2:24 AM, Umar Shah wrote:
> > I am just wondering,  because having 200 fields seems like too much  
> > (for
> > me),
> > I want to know if people actually have such kind of schemas and how  
> > well
> > they perform.
> >
> >
> >
> >
> > On Thu, Apr 24, 2008 at 5:10 PM, Grant Ingersoll 
> > wrote:
> >
> >> Are you actually seeing performance problems or just wondering if  
> >> there
> >> will be a performance problem?
> >>
> >> -Grant
> >>
> >>
> >> On Apr 24, 2008, at 7:08 AM, Umar Shah wrote:
> >>
> >>  Hi,
> >>>
> >>> I wanted to know what would be the performance of SOLR  for the  
> >>> following
> >>> scenario:
> >>> the documents contain say 200 fields with
> >>> say 100 of the fields (containing numbers)
> >>> and rest containing short strings of 40-50 character length.
> >>> the sparseness of the data can be assumed to be as approximately  
> >>> 50 fields
> >>> missing per document.
> >>>
> >>> any insights?
> >>>
> >>> can a default value of 0 for missing fields change the  
> >>> performance, how?
> >>>
> >>> thanks in anticipation,
> >>> -umar
> >>>
> >>
> >> --
> >> Grant Ingersoll
> >>
> >> Lucene Helpful Hints:
> >> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> >> http://wiki.apache.org/lucene-java/LuceneFAQ
> >>
> >>
> >>
> >>
> >>
> >>
> >>



Re: GSA <-> Solr

2008-04-25 Thread Otis Gospodnetic
The GSA -> Solr conversion I mentioned has not yet happened and may not even 
include doc access right functionality.
However, when I implemented things like that in the past, I used custom 
trickery, not a general open framework.

 
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


- Original Message 
> From: Lukas Vlcek <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Friday, April 25, 2008 7:05:48 AM
> Subject: Re: GSA <-> Solr
> 
> Otis,
> 
> May I ask you how do you go about handling user access privileges? I mean
> you need some mechanism how to get user privileges from corporate
> environment (LDAP for example) and filter returned hits using document
> access policy. Also you may be caching these informations as well for
> performance reasons (refreshing once a day for example). Do you use some
> general open framework or ad-hoc code?
> 
> Thanks & Regards,
> Lukas
> 
> On Fri, Apr 25, 2008 at 7:26 AM, Otis Gospodnetic <
> [EMAIL PROTECTED]> wrote:
> 
> > Lukas,
> >
> > From your description, this looks like a Nutch job, not Solr (no crawling
> > component), though one can also use Nutch with Solr now.
> >
> > I can't share the reasons, unfortunately.  But from a personal stand point,
> > I've seen GSA and it's not all that impressive, it costs a pile of money,
> > and the price raises exponentially with the number of documents, it seems.
> >
> > Otis
> > --
> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> >
> >
> > - Original Message 
> > > From: Lukas Vlcek 
> > > To: solr-user@lucene.apache.org
> > > Sent: Friday, April 25, 2008 12:31:13 AM
> > > Subject: Re: GSA <-> Solr
> > >
> > > BTW: Do you think you can share reasons why your clients are switching
> > from
> > > GSA? I am very interested in their experience.
> > >
> > > On Fri, Apr 25, 2008 at 6:29 AM, Lukas Vlcek wrote:
> > >
> > > > Hi,
> > > >
> > > > I posted related question into to Nutch-user yesterday. Here is the
> > post:
> > > Crawling
> > > > MOSS 2007 content using Nutch via GSA
> > > connector
> > > >
> > > > My specific situation if as folows:
> > > > We are deploying MOSS 2007 which includes its own search server.
> > However,
> > > > we found that the search is lacking in some areas and solution requires
> > > > additional expenses on HW or SW. Thus we are evaluating alternatives.
> > GSA is
> > > > one of them. But after I saw a presentation from technical guys on GSA
> > I
> > > > thought myself that Nutch could do the same (or even better in terms of
> > term
> > > > boosting for example :-).
> > > > GSA is able to use connectors for external datasources and for Share
> > Point
> > > > there is sharepoint connector which is written in Java and is Apache
> > > > licenced. This connector can crawl document links out of MOSS 2007 and
> > push
> > > > them into GSA which is then responsible for crawling. I wonder if I am
> > able
> > > > to use sharepoint connector to get the list of URLs which I can then
> > crawl
> > > > and index by Nutch. Is there any chance that using Solr make sanse in
> > such
> > > > scenario? Is Solr more convenient for such job?
> > > >
> > > > I have no experience with Solr. I think I just understand basic
> > concept:
> > > > Solr is a search server which can accept document in XML via HTTP. So I
> > > > don't see a match with my use case because I would have to download all
> > > > those documents from MOSS on my own and convert them into XML prior to
> > > > sending to Solr. Am I correct?
> > > >
> > > > Regards,
> > > > Lukas
> > > >
> > > >
> > > > On Fri, Apr 25, 2008 at 3:42 AM, Otis Gospodnetic <
> > > > [EMAIL PROTECTED]> wrote:
> > > >
> > > >> Ask me in about a month.  I will likely be converting one *very* large
> > and
> > > >> well-known organization from the expensive GSA to Solr if that's
> > what
> > > >> you are asking about.
> > > >>
> > > >> Otis
> > > >> --
> > > >> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> > > >>
> > > >>
> > > >> - Original Message 
> > > >> > From: Jon Baer
> > > >> > To: solr-user@lucene.apache.org
> > > >> > Sent: Thursday, April 24, 2008 8:03:19 PM
> > > >> > Subject: GSA <-> Solr
> > > >> >
> > > >> > Hi,
> > > >> >
> > > >> > Going to try to persuade my employer to switch away some functions,
> > > >> > maybe all from the GSA black box to Solr and was trying to find some
> > > >> > (any?) case studies where this was done ...
> > > >> >
> > > >> > Also what is the similar function to a "KeyMatch" in Solr?  Is it
> > > >> > elevate.xml?
> > > >> >
> > > >> > BTW, have been testing the DataImportHandler w/ MultiCore and it
> > works
> > > >> > very nicely.
> > > >> >
> > > >> > Thanks!
> > > >> >
> > > >> > - Jon
> > > >>
> > > >>
> > > >
> > > >
> > > > --
> > > > http://blog.lukas-vlcek.com/
> > >
> > >
> > >
> > >
> > > --
> > > http://blog.lukas-vlcek.com/
> >
> >
> 
> 
> -- 
> http://blog.lukas-vlcek.com/



Re: GSA <-> Solr

2008-04-25 Thread Walter Underwood
Custom trickery is pretty standard for access controls in search.

A couple of the high points from deploying Ultraseek: three incompatible
"single sign on" system in one company and a system that controlled
which links were shown instead of access to the docs themselves.
The latter amazed me. If you had the URL, you could access the
document. No access control at all, just trying to control knowledge
of the URL. Of course, spiders are experts at finding URLs.

wunder

On 4/25/08 1:32 PM, "Otis Gospodnetic" <[EMAIL PROTECTED]> wrote:

> The GSA -> Solr conversion I mentioned has not yet happened and may not even
> include doc access right functionality.
> However, when I implemented things like that in the past, I used custom
> trickery, not a general open framework.
> 
>  
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 
> 
> - Original Message 
>> From: Lukas Vlcek <[EMAIL PROTECTED]>
>> To: solr-user@lucene.apache.org
>> Sent: Friday, April 25, 2008 7:05:48 AM
>> Subject: Re: GSA <-> Solr
>> 
>> Otis,
>> 
>> May I ask you how do you go about handling user access privileges? I mean
>> you need some mechanism how to get user privileges from corporate
>> environment (LDAP for example) and filter returned hits using document
>> access policy. Also you may be caching these informations as well for
>> performance reasons (refreshing once a day for example). Do you use some
>> general open framework or ad-hoc code?
>> 
>> Thanks & Regards,
>> Lukas
>> 
>> On Fri, Apr 25, 2008 at 7:26 AM, Otis Gospodnetic <
>> [EMAIL PROTECTED]> wrote:
>> 
>>> Lukas,
>>> 
>>> From your description, this looks like a Nutch job, not Solr (no crawling
>>> component), though one can also use Nutch with Solr now.
>>> 
>>> I can't share the reasons, unfortunately.  But from a personal stand point,
>>> I've seen GSA and it's not all that impressive, it costs a pile of money,
>>> and the price raises exponentially with the number of documents, it seems.
>>> 
>>> Otis
>>> --
>>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>>> 
>>> 
>>> - Original Message 
 From: Lukas Vlcek
 To: solr-user@lucene.apache.org
 Sent: Friday, April 25, 2008 12:31:13 AM
 Subject: Re: GSA <-> Solr
 
 BTW: Do you think you can share reasons why your clients are switching
>>> from
 GSA? I am very interested in their experience.
 
 On Fri, Apr 25, 2008 at 6:29 AM, Lukas Vlcek wrote:
 
> Hi,
> 
> I posted related question into to Nutch-user yesterday. Here is the
>>> post:
 Crawling
> MOSS 2007 content using Nutch via GSA
 connector
> 
> My specific situation if as folows:
> We are deploying MOSS 2007 which includes its own search server.
>>> However,
> we found that the search is lacking in some areas and solution requires
> additional expenses on HW or SW. Thus we are evaluating alternatives.
>>> GSA is
> one of them. But after I saw a presentation from technical guys on GSA
>>> I
> thought myself that Nutch could do the same (or even better in terms of
>>> term
> boosting for example :-).
> GSA is able to use connectors for external datasources and for Share
>>> Point
> there is sharepoint connector which is written in Java and is Apache
> licenced. This connector can crawl document links out of MOSS 2007 and
>>> push
> them into GSA which is then responsible for crawling. I wonder if I am
>>> able
> to use sharepoint connector to get the list of URLs which I can then
>>> crawl
> and index by Nutch. Is there any chance that using Solr make sanse in
>>> such
> scenario? Is Solr more convenient for such job?
> 
> I have no experience with Solr. I think I just understand basic
>>> concept:
> Solr is a search server which can accept document in XML via HTTP. So I
> don't see a match with my use case because I would have to download all
> those documents from MOSS on my own and convert them into XML prior to
> sending to Solr. Am I correct?
> 
> Regards,
> Lukas
> 
> 
> On Fri, Apr 25, 2008 at 3:42 AM, Otis Gospodnetic <
> [EMAIL PROTECTED]> wrote:
> 
>> Ask me in about a month.  I will likely be converting one *very* large
>>> and
>> well-known organization from the expensive GSA to Solr if that's
>>> what
>> you are asking about.
>> 
>> Otis
>> --
>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>> 
>> 
>> - Original Message 
>>> From: Jon Baer
>>> To: solr-user@lucene.apache.org
>>> Sent: Thursday, April 24, 2008 8:03:19 PM
>>> Subject: GSA <-> Solr
>>> 
>>> Hi,
>>> 
>>> Going to try to persuade my employer to switch away some functions,
>>> maybe all from the GSA black box to Solr and was trying to find some
>>> (any?) case studies where this was done ...
>>> 
>>> Also what is the similar function to a "KeyMatch" in Solr

DisMax and pf

2008-04-25 Thread Otis Gospodnetic
Hello,

I was looking at DisMax and playing with its "pf" parameter.  I created a 
sample index with field "content".  I set "pf" to: content^2.0 and expected to 
see (content:"my query here")^2.0 in the query (debugQuery=true).  However, I 
only got (content:"my query here") -- no boost.
Is this a bug or am I forgetting something?

I did add "&pf=content^2.0" to the request URL and then I did see  (content:"my 
query here")^2.0 
Thanks,
Otis

--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch




RE: Solr with Auto-suggest

2008-04-25 Thread Lance Norskog
This what the spellchecker does. It makes a separate Lucene index of n-gram
letters and searches those. Works pretty well and it is outside the main
index. I did an experimental variation indexing word pairs as phrases, and
it worked well too.

Lance Norskog 

-Original Message-
From: Ryan McKinley [mailto:[EMAIL PROTECTED] 
Sent: Thursday, April 24, 2008 2:18 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr with Auto-suggest


On Apr 24, 2008, at 12:25 PM, Rantjil Bould wrote:
> Hi Group,
>I was asked in my project to implement google suggest kind 
> of functionality for searching help system. I have seen one thread 
> http://www.mail-archive.com/solr-user@lucene.apache.org/
> msg06739.html which
> deals with the way to index if large index. But I am not able to get 
> much information to start with. I am using JQuery's plugin for auto- 
> suggest and query field is a large text(appx 2000 char long). I am 
> just wondering how can I extract all tokens for any character typed by 
> user? Somebody might have already implemented the same functionality 
> and I would appreciate your help on this, even a hint might be a great 
> help.

I don't think there is a magic one-size-fits-all solution to this, only a
set of approaches you will need to modify for your specific index.

You will need to modify the jquery plugin to grab results from a solr query.
For starters that can be just a standard query whatever.

Unless your index is small, you will likely need to configure your index
with special fields to use for the auto-complete search.  This is the
approach pointed to in SOLR-357.  Eseentially you index:  
"Bould" as "b" "bo" "bou" boul" bould".

ryan

.
Checked by AVG. 
Version: 8.0.100 / Virus Database: 269.23.4/1397 - Release Date: 25.04.2008
7:42



Re: Reindexing mode for solr

2008-04-25 Thread Mike Klaas

On 25-Apr-08, at 7:05 AM, Jonathan Ariel wrote:

Hi,
Is there any way to tell solr to load in a kind of reindexing mode,  
which
won't open a new searcher after every commit, etc? This is just when  
you
don't have it available to query because you just want to reindex  
all the

information.


Are you using autoCommit, and want a way to temporarily disable  
autoCommit?


-Mike


Re: Delete's increase while adding new documents

2008-04-25 Thread Mike Klaas

On 25-Apr-08, at 4:27 AM, Tim Mahy wrote:


Hi all,

we send xml add document messages to Solr and we notice something  
very strange.
We autocommit at 10 documents, starting from a total clean index  
(removed the data folder), when we start uploading we notice that  
the docsPending is going up but also that the deletesPending is  
going up very fast. After reaching the first 10 we queried to  
solr to return everything and the total results count was not 10  
but somewhere around 77000 which is exactly 10 - docsDeleted  
from the stats page.


We used that Solr instance before, so my question is : is it  
possible that Solr remembers the unique identities somewhere else as  
in the data folder ? Btw we stopped Solr, removed the data folder  
and restarted Solr and than this behavior began...


Are you sure that all the documents you added were unique?

(btw, deletePending doesn't necessarily mean that an old version of  
the doc was in the index, I think).


-Mike


Re: MultiThreaded Document Loader?

2008-04-25 Thread Mike Klaas

On 24-Apr-08, at 2:57 PM, oleg_gnatovskiy wrote:


Hello. I was wondering if Solr has some kind of a multi-threaded  
document
loader? I've been using post.sh (curl) to post documents to my Solr  
server,
and it's pretty slow. I know it should be pretty easy to write one  
up, but I

was just wondering if one already existed.


Yeah, I won't rely on post.sh for performance.  However, you can do  
"multithreaded" indexing by launching several instances of it, if you  
really wanted to:


$ post.sh [a-gA-G]*.xml &
$ post.sh [h-pH-P]*.xml &
$ post.sh [q-zQ-Z]*.xml &

-Mike


Re: Standard vs. DisMaxQueryHandler

2008-04-25 Thread David Smiley @MITRE.org

I am frustrated that I have to pick between the two because I want both.  The
way I look at it, there should be a more configurable query handler which
allows me to dimax if I want to, and pick a parser for the user's query
(like the flexible one used by the standard query handler, or the more
restrictive one found in DisMax Q.H. today).  At the moment, I'm faced with
telling a user of my search service (another developer of a corporate app
using my corporate search service) that he has to compose a dis-max manually
(i.e. use the standard query handler to get the job done) simply because he
wants to do a prefix query (which isn't supported by DisMax Q.H.).  This is
for an auto-complete type thing, by the way.  You might argue it's not hard
-- that's true though it is annoying.  But the bigger issue is that I can't
encapsulate these internal details into my search service -- where it
belongs IMO.

~ David Smiley


hossman_lucene wrote:
> 
> 
> : Is the main difference between the StandardQueryHandler and
> : DisMaxQueryHandler the supported query syntax (and different query
> : parser used in each of them), and the fact that the latter creates
> : DisjunctionMaxQueries, while the former just creates vanilla
> : BooleanQueries?  Are there any other differences?
> 
> the main differnece is the query string yes: Standard expects to get
> "lucene QueryParser" formatted queries, while DisMax expects to get raw
> user input strings ... Standard builds queries (wehter they be prefix or
> boolean or wildcard) using the QueryParser as is, while DisMax does a
> "cross product" of the user input across many differnet fields and builds
> up a very specific query structure -- which can then be augmented with
> aditional query clauses like the "bq" boost query and the "bf" boost
> function.
> 
> there's no reason the StandardRequestHandler can't construct DisMaxQueries
> (once QueryParser has some syntax for them) and DisMaxRequestHandler does
> (at the outermost level) generate a BooleanQuery (with a custom
> "minShouldMatch" value set on it) but the main differnece is really the
> use case: if you want the clinet to specify the exact query structure that
> they want, use StandardRequstHandler.  if you want the client to just
> propogate the raw search string typed by the user, without any structure
> or escaping, and get the nice complex DisMax style query across the
> configured fields, the DisMax handler was written to fill that niche.
> 
> (load up the example configs, and take a look at the query toString from
> this url to see what i mean about the complex structure...
> 
> http://localhost:8983/solr/select/?qt=dismax&q=how+now+brown+cow&debugQuery=1
> 
> 
> 
> 
> -Hoss
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Standard-vs.-DisMaxQueryHandler-tp6421205p16909626.html
Sent from the Solr - User mailing list archive at Nabble.com.