Re: solr result....

2010-10-27 Thread Lance Norskog

I'm not quite sure what Tika exceptions mean in this context.
You can give the 'fl=field1,field2' option to only return some fields in 
a query.

You can get google-like results using highlighting and 'snippetizing'.
These are documented on the wiki.

satya swaroop wrote:

Hi ,
   Can the result of solr show the only a part of the content of a
document that got in the result.
example

if i send a query for to search tika then the result should be as follows:::


-
 0
 79

-

-
text/html

  1html
-
-
Apache Tomcat/6.0.26 - Error reportHTTP Status 500 -
org.apache.tika.exception.TikaException: Unexpected RuntimeException from
org.apache.tika.parser.pdf.pdfpar...@cc9d70

org.apache.solr.common.SolrException:
org.apache.tika.exception.TikaException: Unexpected RuntimeException from
org.apache.tika.parser.pdf.pdfpar...@cc9d70
 at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:214)
 at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
 at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
 at
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:237)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:1323)
 at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:337)...

  




The result should not show the entire content of a file. It should show up
only a part of the content where the query word is present..As like the
google result and like search result in the lucidimagionation

Regards,
satya

   


solr result....

2010-10-27 Thread satya swaroop
Hi ,
  Can the result of solr show the only a part of the content of a
document that got in the result.
example

if i send a query for to search tika then the result should be as follows:::


-
0
79

-

-
   text/html

 1html
-
-
   Apache Tomcat/6.0.26 - Error reportHTTP Status 500 -
org.apache.tika.exception.TikaException: Unexpected RuntimeException from
org.apache.tika.parser.pdf.pdfpar...@cc9d70

org.apache.solr.common.SolrException:
org.apache.tika.exception.TikaException: Unexpected RuntimeException from
org.apache.tika.parser.pdf.pdfpar...@cc9d70
at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:214)
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:237)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1323)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:337)...

 
   



The result should not show the entire content of a file. It should show up
only a part of the content where the query word is present..As like the
google result and like search result in the lucidimagionation

Regards,
satya


Re: How does DIH multithreading work?

2010-10-27 Thread markwaddle

Anyone know how it works?
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/How-does-DIH-multithreading-work-tp1776111p1784419.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Stored or indexed?

2010-10-27 Thread kenf_nc

Interesting wiki link, I hadn't seen that table before.

And to answer your specific question about indexed=true, stored=false, this
is most often done when you are using analyzers/tokenizers on your field.
This field is for search only, you would never retrieve it's contents for
display. It may in fact be an amalgam of several fields into one 'content'
field. You have your display copy stored in another field marked
indexed=false, stored=true and optionally compressed. I also have simple
string fields set to lowercase so searching is case-insensitive, and have a
duplicate field where the string is normal case. the first one is
indexed/not stored, the second is stored/not indexed. 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Stored-or-indexed-tp1782805p1784315.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: solr 1.3 suggester component

2010-10-27 Thread abhayd

hi erick,
I was able to implement this using link you posted.
I am using SOLR 1.3

I wanted to add spellcheck component to it so did this
  

 
   explicit
   
spellcheck
   
 
  

but for some reason it does not return suggestion for misspelled words. For
instance iphole does not get a suggestion of iphone.
here is my query
http://localhost:10101/solr/core1/select?q=user_query:iphole&spellcheck=true&spellcheck.collate=true

At the same time when I added another request handler 
 

  
  false
  
  false
  
  1


  spellcheck

  
it works fine and returns suggestion
here is my query
http://localhost:10101/solr/core1/spell?q=iphole&spellcheck=true&spellcheck.collate=true
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-1-4-suggester-component-tp1766915p1784255.html
Sent from the Solr - User mailing list archive at Nabble.com.


replication not working between 1.4.1 and 3.1-dev

2010-10-27 Thread Shawn Heisey
I started to upgrade my slave servers from 1.4.1 to 3.1-dev checked out 
this morning.  Because of SOLR-2034 (new javabin version) the 
replication fails.


Asking about it in comments on SOLR-2034 brought up the suggestion of 
switching to XML instead of javabin, but so far I have not been able to 
figure out how to do this.  I filed a new Jira (SOLR-2204) on the 
replication failure.


Is there any way (through either a config change or minor code changes) 
to make the replication handler use XML?  If I have to make small edits 
to the 1.4.1 source as well as 3.1, that would be OK.


Thanks,
Shawn



RE: Inconsistent slave performance after optimize

2010-10-27 Thread Jonathan Rochkind
Seriously, at least try JVM argument -XX:+UseConcMarkSweepGC .  That argument 
took care of very similar symptoms I was having.  I never did figure out 
exactly what was causing them, but at some point I tried that JVM argument, and 
they went away never to come back (which I guess is a clue about what was 
causing the slowdown, but the JVM still confuses me). 

From: Mason Hale [masonh...@gmail.com]
Sent: Wednesday, October 27, 2010 9:03 PM
To: solr-user@lucene.apache.org
Subject: Re: Inconsistent slave performance after optimize

On Wed, Oct 27, 2010 at 7:18 PM, Ken Krugler wrote:

> Normally I'd say like you were getting into swap hell, but based on your
> settings you only have 5GB of JVM space being used, on a 16GB box.
>
> Just to confirm, nothing else is using lots of memory, right? And the "top"
> command isn't showing any swap usage, right?
>
>
Correct. Only thing of note running on this machine is Solr.

I don't have a poor performing server on hand at the moment, but I recall
checking top when it was tanking, and it was not showing any swap usage.


> When you encounter very slow search times, what does the top command say
> about system load and cpu vs. I/O percentages?
>

I did look at iostat -x when the server was running slow and IO util was
100%.

This lead me to believe the problem was cache-warming related, and that data
needed to be loaded into Solr caches and/or files loaded into the file
system cache.

Does that yield any additional clues.

In this does happen again, what stats should I collect?

(Note to self: need to install sar on these servers to collect historical
performance data...)

Mason


Re: documentCache clarification

2010-10-27 Thread Chris Hostetter

: schema.) My evidence for this is the documentCache stats reported by
: solr/admin. If I request "rows=10&fl=id" followed by
: "rows=10&fl=id,title" I would expect to see the 2nd request result in
: a 2nd insert to the cache, but instead I see that the 2nd request hits
: the cache from the 1st request. "rows=10&fl=*" does the same thing.

your evidence is correct, but your interpretation is incorrect.

the objects in the documentCache are lucene Documents, which contain a 
List of Field refrences.  when enableLazyFieldLoading=true is set, and 
there is a documentCache Document fetched from the IndexReader only 
contains the Fields specified in the fl, and all other Fields are marked 
as "LOAD_LAZY".

When there is a cache hit on that uniqueKey at a later date, the Fields 
allready loaded are used directly if requested, but the Fields marked 
LOAD_LAZY are (you guessed it) lazy loaded from the IndexReader and then 
the Document updates the refrence to the newly actualized fields (which 
are no longer marked LOAD_LAZY)

So with different "fl" params, the same Document Object is continually 
used, but the Fields in that Document grow as the fields requested (using 
the "fl" param) change.

: will *not* result in an insert to queryResultCache. I have tried
: various increments--10, 100, 200, 500--and it seems the magic number
: is somewhere between 200 (cache insert) and 500 (no insert). Can
: someone explain this?

In addition to the  config option already 
mentioned (which controls wether a DocList is cached based on it's size) 
there is also the  config option which may confuse 
your cache observations.  if the window size is "50" and you ask for 
start=0&rows=10 what actually gets cached is "0-50" (assuming there are 
more then 50 results) so a subsequent request for start=10&rows=10 will be 
a cache hit.

-Hoss


Re: ClassCastException Issue

2010-10-27 Thread Alex Matviychuk
I found it! Ran this on against the webapps folder:

find . -name *.jar | sed 's/^.*\/\(.*\)$/\1/' | sort

...
lucene-analyzers-2.9.3.jar
lucene-core-2.9.1.jar
lucene-highlighter-2.9.3.jar
lucene-memory-2.9.3.jar
lucene-misc-2.9.3.jar
lucene-queries-2.9.1.jar
lucene-queries-2.9.3.jar
lucene-snowball-2.9.3.jar
lucene-spellchecker-2.9.3.jar
...

Hope that helps someone in a similar situation.

Thanks again for the help!

Cheers,
Alex Matviychuk

On Wed, Oct 27, 2010 at 18:01, Alex Matviychuk  wrote:
> On Wed, Oct 27, 2010 at 03:57, Chris Hostetter  
> wrote:
>> This almost certainly inidcates a classloader issue - i suspect you have
>> multiple solr related jars in various places, and the FieldType class
>> instance found when StrField is loaded comes from a different
>> (incompatible) jar.
>
> Thanks for the feedback. I am pretty sure you are right about this
> being a class loader issue. This is a pretty big project and it isn't
> immediately easy to figure out what is causing the conflict. I am
> using EmbeddedSolr from the 1.4.1 release. These show up in the
> project's lib folder when deployed to tomcat:
>
> lucene-core-2.9.3.jar
> lucene-highlighter-2.9.3.jar
> lucene-memory-2.9.3.jar
> lucene-misc-2.9.3.jar
> lucene-queries-2.9.3.jar
> lucene-snowball-2.9.3.jar
> lucene-spellchecker-2.9.3.jar
>
> According to the solr homepage, this is what solr 1.4.1 is built against.
>
> I looked at the FieldType class and it looks like it only relies on
> solr stuff and lucene.
>
> I don't have much experience with classloader issues, any tips on how
> to debug this?
>
>
> And Ken:
> I tried renaming the field as you suggested, but I get the same issue.
>
> Thanks,
> Alex Matviychuk
>


Use SolrCloud (SOLR-1873) on trunk, or with 1.4.1?

2010-10-27 Thread Jeremy Hinegardner
Hi all,

I see that as of r1022188 Solr Cloud has been committed to trunk.

I was wondering about the stability of Solr Cloud on trunk.  We are
planning to do a major reindexing soon (within 30 days), several billion docs,
and would like to switch to a Solr Cloud based infrastructure. 

We are wondering should use trunk as it is now that SOLR-1873 is applied, or
should we take SOLR-1873 and apply it to Solr 1.4.1.

Has anyone used 1.4.1 + SOLR-1873?  In production?

Thanks,

-jeremy

-- 

 Jeremy Hinegardner  jer...@hinegardner.org 



Re: Inconsistent slave performance after optimize

2010-10-27 Thread Mason Hale
On Wed, Oct 27, 2010 at 7:18 PM, Ken Krugler wrote:

> Normally I'd say like you were getting into swap hell, but based on your
> settings you only have 5GB of JVM space being used, on a 16GB box.
>
> Just to confirm, nothing else is using lots of memory, right? And the "top"
> command isn't showing any swap usage, right?
>
>
Correct. Only thing of note running on this machine is Solr.

I don't have a poor performing server on hand at the moment, but I recall
checking top when it was tanking, and it was not showing any swap usage.


> When you encounter very slow search times, what does the top command say
> about system load and cpu vs. I/O percentages?
>

I did look at iostat -x when the server was running slow and IO util was
100%.

This lead me to believe the problem was cache-warming related, and that data
needed to be loaded into Solr caches and/or files loaded into the file
system cache.

Does that yield any additional clues.

In this does happen again, what stats should I collect?

(Note to self: need to install sar on these servers to collect historical
performance data...)

Mason


Re: Inconsistent slave performance after optimize

2010-10-27 Thread Ken Krugler
Normally I'd say like you were getting into swap hell, but based on  
your settings you only have 5GB of JVM space being used, on a 16GB box.


Just to confirm, nothing else is using lots of memory, right? And the  
"top" command isn't showing any swap usage, right?


When you encounter very slow search times, what does the top command  
say about system load and cpu vs. I/O percentages?


-- Ken

On Oct 27, 2010, at 3:33pm, Mason Hale wrote:


Hi Lance --

Thanks for the reply.


Did you restart all of these slave servers? That would help.


We discovered independently that restarting the slave nodes resulted  
in
dramatically improved performance (e.g. from 2.0 sec average  
response to

0.25 sec average).

Can you please explain why this is the case?

I would expect a process restart to invalid caches and thus trigger
additional cache-warming overhead, slowing things down, not speeding  
things

up.


What garbage collection options do you use?


We've not tweaked the garbage collection settings. We're using - 
Xms512M

-Xmx5000M on the command line.


Which release of Solr?


version 1.4.0


How many Searchers are there in admin/stats.jsp?


I'm looking much later, and after a restart -- but I currently see 2
searchers listed.

I admit I'm not sure what I'm looking for on this page.


thanks,
Mason




On Wed, Oct 27, 2010 at 2:25 AM, Lance Norskog   
wrote:



Did you restart all of these slave servers? That would help.
What garbage collection options do you use?
Which release of Solr?
How many Searchers are there in admin/stats.jsp?
Searchers hold open all kinds of memory. They are supposed to cycle  
out.


These are standard questions, but- what you are seeing is  
definitely not

normal.

Separately, if you want a regular optimization regime, there is a new
option called 'maxSegments' to the optimize command. If you have  
solrconfig
mergeFactor set to 10, then optimize 'maxSegments=8' will roll up  
the very
smallest segments. This allows you to have a gradual optimization  
(and

replication overhead) instead of big ones.


Mason Hale wrote:


Hello esteemed Solr community --

I'm observing some inconsistent performance on our slave servers  
after

recently optimizing our master server.

Our configuration is as follows:

   - all servers are hosted at Amazon EC2, running Ubuntu 8.04
   - 1 master with heavy insert/update traffic, about 125K new  
documents

per day (m1.large, ~8GB RAM)
  - autocommit every 1 minute
   - 3 slaves (m2.xlarge instance sizes, ~16GB RAM)
  - replicate every 5 minutes
  - we have configured autowarming queries for these machines
  - autowarmCount = 0
   - Total index size is ~7M documents

We were seeing increasing, but gradual performance degradation  
across all

nodes.
So we decided to try optimizing our index to improve performance.

In preparation for the optimize we disabled replication polling on  
all

slaves. We also turned off all
workers that were writing to the index. Then we ran optimize on the
master.

The optimize took 45-60 minutes to complete, and the total size  
went from

68GB down to 23GB.

We then enabled replication on each slave one at a time.

The first slave we re-enabled took about 15 minutes to copy the  
new files.

Once the files were copied
the performance of slave plummeted. Average response time went  
from 0.75

sec
to 45 seconds.
Over the past 18 hours the average response time has gradually  
gown down

to
around 1.2 seconds now.

Before re-enabling replication the second slave, we first removed  
it from

our load-balanced pool of available search servers.
This server's average query performance also degraded quickly, and  
then

(unlike the first slave we replicated) did not improve.
It stayed at around 30 secs per query. On the theory that this is a
cache-warming issue, we added this server
back to the pool in hopes that additional traffic would warm the  
cache.

But
what we saw was a quick spike of much worse
performance (50 sec / query on average) followed by a slow/gradual  
decline

in average response times.
As of now (10 hours after the initial replication) this server is  
still

reporting an average response time of ~2 seconds.
This is much worse than before the optimize and is a counter- 
intuitive
result. We expected an index 1/3 the size would be faster, not  
slower.


On the theory that the index files needed to be loaded into the file
system
cache, I used the 'dd' command to copy
the contents of the data/index directory to /dev/null, but that  
did not

result in any noticeable performance improvement.

At this point, things were not going as expected. We did not  
expect the

replication after an optimize to result in such horrid
performance. So we decided to let the last slave continue to serve  
stale

results while we waited 4 hours for the
other two slaves to approach some acceptable performance level.

After the 4 hour break, we re-moved the 3rd and last slave server  
from our

load-balancin

Searching for terms on specific fields

2010-10-27 Thread Imran
Hi All

We need to be able to perform a search based on two search terms (from the
user) against specific fields and a location. For example assume our index
(for a collection of books) has fields as title, description, authors
(multi-valued), categories(multi-valued), location (ofcourse lng and lats).
Every field is indexed.

I want to give the user the option of having two search options (one box for
title, and one for category - to find a more relevant result) along with a
location option. The user would then want to search for a book with
a certain title belonging to a set of categories in a given location (all of
these should be a AND). I want to show results that would ONLY match the
terms for the corresponding fields.

I want to be able to use the dismax parser to search on both terms
(assigning slops and tie breaks). I take it the 'fq' is a candidate for
this,but can I add dismax capabilities to fq as well? Also my query would be
a spatial search as well. So the spatial tier would be included in the
query. What would be the best way to implement my query to match terms to
specific fields along with spatial capabilities? Appreciate your inputs.
Thanks!!

Cheers
-- Imran


RE: Inconsistent slave performance after optimize

2010-10-27 Thread Jonathan Rochkind
I'm guessing the slaves you restarted were running low on RAM, and possibly 
engaged in out of control GC. 

I have had good luck using the JVM option "-XX:+UseConcMarkSweepGC ", which 
seems to result in GC happening in another thread and not interfering with the 
servicing of requests. 

If that's what's going on, it may also indicate that you should give your Solr 
more RAM, for your current cache settings. On the other hand, apparently 
sometimes out of control GC can also indicate that your JVM has too MUCH ram.  
I don't know the answers, there doesn't seem to be any simple way to figure out 
how much RAM your Solr needs, or if it's having problems as a result of not 
enough (or too much!), I haven't really figured out how to do it. 

For a server process like Solr, I also think it makes sense to set -xms the 
same value as -xmx, I don't think there's much to be gained by leaving a 
window, and by setting them the same the JVM won't spend any time increasing 
it's heap size. 

From: Mason Hale [masonh...@gmail.com]
Sent: Wednesday, October 27, 2010 6:33 PM
To: solr-user@lucene.apache.org
Subject: Re: Inconsistent slave performance after optimize

Hi Lance --

Thanks for the reply.

> Did you restart all of these slave servers? That would help.

We discovered independently that restarting the slave nodes resulted in
dramatically improved performance (e.g. from 2.0 sec average response to
0.25 sec average).

Can you please explain why this is the case?

I would expect a process restart to invalid caches and thus trigger
additional cache-warming overhead, slowing things down, not speeding things
up.

> What garbage collection options do you use?

We've not tweaked the garbage collection settings. We're using -Xms512M
-Xmx5000M on the command line.

> Which release of Solr?

version 1.4.0

> How many Searchers are there in admin/stats.jsp?

I'm looking much later, and after a restart -- but I currently see 2
searchers listed.

I admit I'm not sure what I'm looking for on this page.


thanks,
Mason




On Wed, Oct 27, 2010 at 2:25 AM, Lance Norskog  wrote:

> Did you restart all of these slave servers? That would help.
> What garbage collection options do you use?
> Which release of Solr?
> How many Searchers are there in admin/stats.jsp?
> Searchers hold open all kinds of memory. They are supposed to cycle out.
>
> These are standard questions, but- what you are seeing is definitely not
> normal.
>
> Separately, if you want a regular optimization regime, there is a new
> option called 'maxSegments' to the optimize command. If you have solrconfig
> mergeFactor set to 10, then optimize 'maxSegments=8' will roll up the very
> smallest segments. This allows you to have a gradual optimization (and
> replication overhead) instead of big ones.
>
>
> Mason Hale wrote:
>
>> Hello esteemed Solr community --
>>
>> I'm observing some inconsistent performance on our slave servers after
>> recently optimizing our master server.
>>
>> Our configuration is as follows:
>>
>> - all servers are hosted at Amazon EC2, running Ubuntu 8.04
>> - 1 master with heavy insert/update traffic, about 125K new documents
>> per day (m1.large, ~8GB RAM)
>>- autocommit every 1 minute
>> - 3 slaves (m2.xlarge instance sizes, ~16GB RAM)
>>- replicate every 5 minutes
>>- we have configured autowarming queries for these machines
>>- autowarmCount = 0
>> - Total index size is ~7M documents
>>
>> We were seeing increasing, but gradual performance degradation across all
>> nodes.
>> So we decided to try optimizing our index to improve performance.
>>
>> In preparation for the optimize we disabled replication polling on all
>> slaves. We also turned off all
>> workers that were writing to the index. Then we ran optimize on the
>> master.
>>
>> The optimize took 45-60 minutes to complete, and the total size went from
>> 68GB down to 23GB.
>>
>> We then enabled replication on each slave one at a time.
>>
>> The first slave we re-enabled took about 15 minutes to copy the new files.
>> Once the files were copied
>> the performance of slave plummeted. Average response time went from 0.75
>> sec
>> to 45 seconds.
>> Over the past 18 hours the average response time has gradually gown down
>> to
>> around 1.2 seconds now.
>>
>> Before re-enabling replication the second slave, we first removed it from
>> our load-balanced pool of available search servers.
>> This server's average query performance also degraded quickly, and then
>> (unlike the first slave we replicated) did not improve.
>> It stayed at around 30 secs per query. On the theory that this is a
>> cache-warming issue, we added this server
>> back to the pool in hopes that additional traffic would warm the cache.
>> But
>> what we saw was a quick spike of much worse
>> performance (50 sec / query on average) followed by a slow/gradual decline
>> in average response times.
>> As of now (10 hours after the ini

Re: documentCache clarification

2010-10-27 Thread Koji Sekiguchi

(10/10/28 6:32), Jonathan Rochkind wrote:

Woah, I hadn't known about that. queryResultMaxDocsCached is actually a part of 
Solr 1.4? Is it
documented anywhere at all? I guess it is included in the example 
solrconfig.xml, but is not in my
own personal solrconfig.xml.


The feature was added since 1.3. Please see:
https://issues.apache.org/jira/browse/SOLR-291


Anyone know if it has a default size if left unspecified?


I believe the default is no limit.

Koji

--
http://www.rondhuit.com/en/


Re: Inconsistent slave performance after optimize

2010-10-27 Thread Mason Hale
Hi Lance --

Thanks for the reply.

> Did you restart all of these slave servers? That would help.

We discovered independently that restarting the slave nodes resulted in
dramatically improved performance (e.g. from 2.0 sec average response to
0.25 sec average).

Can you please explain why this is the case?

I would expect a process restart to invalid caches and thus trigger
additional cache-warming overhead, slowing things down, not speeding things
up.

> What garbage collection options do you use?

We've not tweaked the garbage collection settings. We're using -Xms512M
-Xmx5000M on the command line.

> Which release of Solr?

version 1.4.0

> How many Searchers are there in admin/stats.jsp?

I'm looking much later, and after a restart -- but I currently see 2
searchers listed.

I admit I'm not sure what I'm looking for on this page.


thanks,
Mason




On Wed, Oct 27, 2010 at 2:25 AM, Lance Norskog  wrote:

> Did you restart all of these slave servers? That would help.
> What garbage collection options do you use?
> Which release of Solr?
> How many Searchers are there in admin/stats.jsp?
> Searchers hold open all kinds of memory. They are supposed to cycle out.
>
> These are standard questions, but- what you are seeing is definitely not
> normal.
>
> Separately, if you want a regular optimization regime, there is a new
> option called 'maxSegments' to the optimize command. If you have solrconfig
> mergeFactor set to 10, then optimize 'maxSegments=8' will roll up the very
> smallest segments. This allows you to have a gradual optimization (and
> replication overhead) instead of big ones.
>
>
> Mason Hale wrote:
>
>> Hello esteemed Solr community --
>>
>> I'm observing some inconsistent performance on our slave servers after
>> recently optimizing our master server.
>>
>> Our configuration is as follows:
>>
>> - all servers are hosted at Amazon EC2, running Ubuntu 8.04
>> - 1 master with heavy insert/update traffic, about 125K new documents
>> per day (m1.large, ~8GB RAM)
>>- autocommit every 1 minute
>> - 3 slaves (m2.xlarge instance sizes, ~16GB RAM)
>>- replicate every 5 minutes
>>- we have configured autowarming queries for these machines
>>- autowarmCount = 0
>> - Total index size is ~7M documents
>>
>> We were seeing increasing, but gradual performance degradation across all
>> nodes.
>> So we decided to try optimizing our index to improve performance.
>>
>> In preparation for the optimize we disabled replication polling on all
>> slaves. We also turned off all
>> workers that were writing to the index. Then we ran optimize on the
>> master.
>>
>> The optimize took 45-60 minutes to complete, and the total size went from
>> 68GB down to 23GB.
>>
>> We then enabled replication on each slave one at a time.
>>
>> The first slave we re-enabled took about 15 minutes to copy the new files.
>> Once the files were copied
>> the performance of slave plummeted. Average response time went from 0.75
>> sec
>> to 45 seconds.
>> Over the past 18 hours the average response time has gradually gown down
>> to
>> around 1.2 seconds now.
>>
>> Before re-enabling replication the second slave, we first removed it from
>> our load-balanced pool of available search servers.
>> This server's average query performance also degraded quickly, and then
>> (unlike the first slave we replicated) did not improve.
>> It stayed at around 30 secs per query. On the theory that this is a
>> cache-warming issue, we added this server
>> back to the pool in hopes that additional traffic would warm the cache.
>> But
>> what we saw was a quick spike of much worse
>> performance (50 sec / query on average) followed by a slow/gradual decline
>> in average response times.
>> As of now (10 hours after the initial replication) this server is still
>> reporting an average response time of ~2 seconds.
>> This is much worse than before the optimize and is a counter-intuitive
>> result. We expected an index 1/3 the size would be faster, not slower.
>>
>> On the theory that the index files needed to be loaded into the file
>> system
>> cache, I used the 'dd' command to copy
>> the contents of the data/index directory to /dev/null, but that did not
>> result in any noticeable performance improvement.
>>
>> At this point, things were not going as expected. We did not expect the
>> replication after an optimize to result in such horrid
>> performance. So we decided to let the last slave continue to serve stale
>> results while we waited 4 hours for the
>> other two slaves to approach some acceptable performance level.
>>
>> After the 4 hour break, we re-moved the 3rd and last slave server from our
>> load-balancing pool, then re-enabled replication.
>> This time we saw a tiny blip. The average performance went up to 1 second
>> briefly then went back to the (normal for us)
>> 0.25 to 0.5 second range. We then added this server back to the
>> load-balancing pool and observed no degradation in performance.

If I want to move a core from one physical machine to another....

2010-10-27 Thread Ron Mayer
If I want to move a core from one physical machine to another,
is it as simple as just
   scp -r core5 otherserver:/path/on/other/server/
and then adding

on that other server's solr.xml file and restarting the server there?



PS: Should have I been able to figure the answer to that
out by RTFM somewhere?


RE: how well does multicore scale?

2010-10-27 Thread Toke Eskildsen
mike anderson [saidthero...@gmail.com] wrote:
> That's a great point. If SSDs are sufficient, then what does the "Index size
> vs Response time" curve look like? Since that would dictate the number
> of machines needed. I took a look at 
> http://wiki.apache.org/solr/SolrPerformanceData but only one use case
> seemed comparable.

I generally find it very hard to compare acrosse setups. Looking at 
SolrPerformanceData for example, we see that CNET Shopper has a very poor 
resposetime/size ratio, while HathiTrust is a lot better. This is not too 
surprising as CNET seems to use quite advanced searching where HathiTrust's is 
more simple, but it does illustrate that comparisons are not easy.

However, as long as I/O has been identified as the main bottleneck for a given 
setup, relative gains from different storage back ends should be fairly 
comparable across setups. We did some work on storage testing with Lucene two 
years ago (see the I-wish-I-had-the-time-to-update-this page at 
http://wiki.statsbiblioteket.dk/summa/Hardware), but unfortunately we did very 
little testing on scaling over index size.

...

I just digged out some old measurements that says a little bit: We tried 
changing the size of out index (by deleting every X document and optimizing) 
and performing 350K queries with extraction of 2 or 3 fairly small fields for 
the first 20 hits from each. The machine was capped at 4GB of RAM. I am fairly 
certain the searcher was single threaded and there were no web-services 
involved, so this is very raw Lucene speed:
4GB index: 626 queries/second
9GB index: 405 queries/second
17GB index: 205 queries/second
26GB index: 188 queries/second
Not a lot of measurement points and I wish I had data for larger index sizes, 
as it seems that the curve is flattening quite drastically at the end. Graph at
http://www.mathcracker.com/scatterplotimage.php?datax=4,9,17,26&datay=626,405,205,188&namex=Index%20size%20in%20GB&namey=queries/second&titl=SSD%20scaling%20performance%20with%20Lucene

> We currently have about 25M docs, split into 18 shards, with a
> total index size of about 120GB. If index size has truly little
> impact on performance then perhaps tagging articles with user
> IDs is a better way to approach my use case.

I don't know your budget, but do consider buying a single 160GB Intel X25-M or 
one of the new 256GB SandForce-based SSDs for testing. If it does not deliver 
what you hoped for, you'll be happy to put it in your workstation.

It would be nice if there were some sort of corpus generator that generated 
Zipfian-distributed data and sample queries so that we could do large scale 
testing on different hardware without having to share sample data.

Regards,
Toke Eskildsen

Re: documentCache clarification

2010-10-27 Thread Jonathan Rochkind
Woah, I hadn't known about that. queryResultMaxDocsCached is actually a 
part of Solr 1.4?   Is it documented anywhere at all?  I guess it is 
included in the example solrconfig.xml, but is not in my own personal 
solrconfig.xml.


Anyone know if it has a default size if left unspecified?

Shawn Heisey wrote:

On 10/27/2010 12:17 PM, Jay Luker wrote:
  

A 2nd question: while watching these stats I noticed something else
weird with the queryResultCache. It seems that inserts to the
queryResultCache depend on the number of rows requested. For example,
an initial request (solr restarted, clean cache, etc) with rows=10
will result in a insert. A 2nd request of the same query with
rows=1000 will result in a cache hit. However if you reverse that
order, starting with a clean cache, an initial request for rows=1000
will *not* result in an insert to queryResultCache. I have tried
various increments--10, 100, 200, 500--and it seems the magic number
is somewhere between 200 (cache insert) and 500 (no insert). Can
someone explain this?



Perhaps it's this setting in the  section of solrconfig.xml?

200

See SOLR-291.

Shawn

  


Re: Stored or indexed?

2010-10-27 Thread Markus Jelsma
http://wiki.apache.org/solr/FieldOptionsByUseCase]

> Hi all-
> 
> I've read through the documentation, but I'm still a little confused about
> the  tag, in terms of the indexed and stored attributes. If I have
> something marked as indexed="true", why would I ever want stored="false"?
> Are there any good tips-n-tricks anywhere about how to properly set the
> field tag? I've been finding bits and pieces both on the wiki and a couple
> of other websites, but there doesn't seem to be a good definitive how-to
> on this.
> 
> Thanks for any info,
> 
> Ron
> 
> DISCLAIMER: This electronic message, including any attachments, files or
> documents, is intended only for the addressee and may contain
> CONFIDENTIAL, PROPRIETARY or LEGALLY PRIVILEGED information.  If you are
> not the intended recipient, you are hereby notified that any use,
> disclosure, copying or distribution of this message or any of the
> information included in or with it is  unauthorized and strictly
> prohibited.  If you have received this message in error, please notify the
> sender immediately by reply e-mail and permanently delete and destroy this
> message and its attachments, along with any copies thereof. This message
> does not create any contractual obligation on behalf of the sender or Law
> Bulletin Publishing Company. Thank you.


Re: documentCache clarification

2010-10-27 Thread Shawn Heisey

On 10/27/2010 12:17 PM, Jay Luker wrote:

A 2nd question: while watching these stats I noticed something else
weird with the queryResultCache. It seems that inserts to the
queryResultCache depend on the number of rows requested. For example,
an initial request (solr restarted, clean cache, etc) with rows=10
will result in a insert. A 2nd request of the same query with
rows=1000 will result in a cache hit. However if you reverse that
order, starting with a clean cache, an initial request for rows=1000
will *not* result in an insert to queryResultCache. I have tried
various increments--10, 100, 200, 500--and it seems the magic number
is somewhere between 200 (cache insert) and 500 (no insert). Can
someone explain this?


Perhaps it's this setting in the  section of solrconfig.xml?

200

See SOLR-291.

Shawn



Stored or indexed?

2010-10-27 Thread Olson, Ron
Hi all-

I've read through the documentation, but I'm still a little confused about the 
 tag, in terms of the indexed and stored attributes. If I have 
something marked as indexed="true", why would I ever want stored="false"? Are 
there any good tips-n-tricks anywhere about how to properly set the field tag? 
I've been finding bits and pieces both on the wiki and a couple of other 
websites, but there doesn't seem to be a good definitive how-to on this.

Thanks for any info,

Ron

DISCLAIMER: This electronic message, including any attachments, files or 
documents, is intended only for the addressee and may contain CONFIDENTIAL, 
PROPRIETARY or LEGALLY PRIVILEGED information.  If you are not the intended 
recipient, you are hereby notified that any use, disclosure, copying or 
distribution of this message or any of the information included in or with it 
is  unauthorized and strictly prohibited.  If you have received this message in 
error, please notify the sender immediately by reply e-mail and permanently 
delete and destroy this message and its attachments, along with any copies 
thereof. This message does not create any contractual obligation on behalf of 
the sender or Law Bulletin Publishing Company.
Thank you.


Searching with wrong keyboard layout or using translit

2010-10-27 Thread Pavel Minchenkov
Hi,

When I'm trying to search Google with wrong keyboard layout -- it corrects
my query, example: http://www.google.ru/search?q=vjcrdf (I typed word
"Moscow" in Russian but in English keyboard layout).
Also, when I'm searching using
translit, It does the same: http://www.google.ru/search?q=moskva

What is the right way to implement this feature in Solr?

-- 
Pavel Minchenkov


Michigan Information Retrieval Enthusiasts Group Quarterly Meetup - November 13, 2010

2010-10-27 Thread Provalov, Ivan
Cengage Learning is organizing a second quarterly meetup in Michigan 
(web-conference and dial-in are available) for the IR Enthusiasts.  Please RSVP 
at http://www.meetup.com/Michigan-Information-Retrieval-Enthusiasts-Group

Presentations:

1. Search Assist Dictionary Based on Corpus Terms Collocation by Drew 
Kozsewnik, Cengage Learning
The presentation will be a brief overview of what search assist is followed by 
a technical discussion about the algorithms that make it work.  The technical 
part will detail phrase extraction (building the dictionaries), runtime 
(indexing and retrieving relevant phrases based on a partial query), and phrase 
correlation (The "Kevin Bacon" feature which returns phrases that often occur 
nearby recommended phrases).
2. Carrot2 Clustering Engine by Stanislaw Osinski and Dawid Weiss, Carrot 
Search (Poland)
Overview of Carrot2 - an Open Source Search Results Clustering Engine.  It can 
automatically organize small collections of documents, e.g. search results, 
into thematic categories.

Schedule:

1. 9-10am breakfast (uRefer, Ann Arbor)
2. 10-11am Presentations (uRefer, Ann Arbor, Web-conferencing)
3. 11am-12pm discussion

Remote Dial-in:
866-394-9514
6087393

Remote Web-Conferencing:
http://www.conferenceplus.com/
866-394-9514
6087393

Location:
uRefer, 924 N. Main St., Suite 3, Ann Arbor, MI

Thank you,

Ivan Provalov
Information Architect
Cengage Learning



Re: Solr sorting problem

2010-10-27 Thread Ron Mayer
Savvas-Andreas Moysidis wrote:
> In my understanding sorting on a field for which analysis has yielded
> multiple terms just doesn't make sense..
> If you have document#1 with a field A which has the terms Epsilon, Alpha,
> and document#2 with field A which has the terms Beta, Delta and request
> an ascending sort on field A what order should you get and why?

In the couple use cases I've been asked for it, either...
(a) returning each document only the first time it appeared
 document 1 [for alpha] followed by document 2[beta]
(b) or returning them with duplicates
 doc1 [alpha], doc2[beta], doc2[beta] doc1[epsilon]
... would have been an OK user experience.

The use case
   "show me documents relevant to things close to a location"
seems like a pretty broad use-case that any geospatial-aware
search engine would like to handle; and I imagine in many cases
a single document might refer to multiple addresses/locations.

In another case, I was asked if the application could "sort the
incidents by the age of rape victims".   And while most incidents
involved a single victim, some had 2 or more.The idea wasn't
to impose some total ordering but rather make it quick to find
documents involving younger people. I realize I can work
around that one by adding a "min-age" column.

For the spatial one, where different users might pick different
center points I can't think of any good workaround beyond Jonathan's
idea of facets -- perhaps overlaying some map grid on the data
and using facets for that.




> On 27 October 2010 17:56, Jonathan Rochkind  wrote:
> 
>> I would suggest that trying to sort on a multi-token/multi-value value in
>> the first place ought to always raise an exception. Are there any reasons
>> why you'd EVER want to do this, with the way it currently works?  Letting
>> people do this and only _sometimes_ raise an exception, but never do
>> anything that's actually reasonable, just adds confusion for newbies.
>>
>> Alternately, perhaps sorting on a multi-valued or tokenized field ought to
>> sort only on the FIRST token found in the first value of , but not sure how
>> feasible that is to code.
>>
>> Ron, for your particular use case -- lucene sorting just can't really do
>> that, I'm not sure there's a WAY to code sorting that works on multi-valued
>> fields.  A given lucene/solr search results set only includes each document
>> ONCE.  So where would that document appear in your sort on a multi-valued
>> field?   A different solution is required.  I too sometimes have similar use
>> cases, and my best ideas about how to solve them involve using faceting ---
>> you can facet on a multi-valued field, and you can sort facets--but you can
>> only sort facets by "index order", a strict byte-by-byte sort.  Which
>> doesn't always work for me either.  I haven't quite figured out the solution
>> to this sort of problem.
>>
>>
>> Ron Mayer wrote:
>>
>>> Lance Norskog wrote:
>>>
>>>
 You may not sort on a tokenized field. You may not sort on a multiValued
 field. You can only have one term in a field.

 If there are more search terms than documents, A) sorting doesn't mean
 anything and B) Lucene will throw an exception.


>>>
>>> Is that considered a feature, or an annoyance/bug?
>>>
>>> One of the things I'm using Solr for is to store a whole bunch of
>>> documents about crime events that contain information roughly
>>> like this:
>>>
>>> "the gang member ran the red light at 100 main st, and
>>>  continued driving to 500 main street where he hit a car. He
>>>  fled his car and ran to 789 second avenue where he hijacked
>>>  another car and drove to his house at 654 someother st"
>>>
>>> If I do a search for the name of that gang member's gang,
>>> I'd really really like to be able to sort my documents by
>>> distance from a location -- for example to quickly find
>>> any documents referring to gang activity in a neighborhood.
>>>
>>> And I'd really like to see this document near the top
>>> of my search results whether the user chose 100 main, 500 main,
>>> 790 second, or 650 someother street  as his center point for
>>> sorting his search.
>>>
>>>
>>> If I wanted that so badly I'd be willing to try coding it
>>> so you _could_ sort on multiValued fields, would people want
>>> that feature?   If so - would someone know off the top of
>>> their head where I should get started looking in the code?
>>>
>>> Or is it considered a feature that solr currently disallows it?
>>>
>>>
>>>
>>>
> 



Re: how well does multicore scale?

2010-10-27 Thread mike anderson
That's a great point. If SSDs are sufficient, then what does the "Index size
vs Response time" curve look like? Since that would dictate the number of
machines needed. I took a look at
http://wiki.apache.org/solr/SolrPerformanceData but only one use case seemed
comparable. We currently have about 25M docs, split into 18 shards, with a
total index size of about 120GB. If index size has truly little impact on
performance then perhaps tagging articles with user IDs is a better way to
approach my use case.

-Mike



On Wed, Oct 27, 2010 at 9:45 AM, Toke Eskildsen wrote:

> On Wed, 2010-10-27 at 14:20 +0200, mike anderson wrote:
> > [...] By my simple math, this would mean that if we want each shard's
> > index to be able to fit in memory, [...]
>
> Might I ask why you're planning on using memory-based sharding? The
> performance gap between memory and SSDs is not very big so using memory
> to get those last queries/second is quite expensive.
>
>


Re: documentCache clarification

2010-10-27 Thread Jay Luker
(btw, I'm running 1.4.1)

It looks like my assumption was wrong. Regardless of the fields
selected using the "fl" parameter and the enableLazyFieldLoading
setting, solr apparently fetches from disk and caches all the fields
in the document (or maybe just those that are stored="true" in my
schema.) My evidence for this is the documentCache stats reported by
solr/admin. If I request "rows=10&fl=id" followed by
"rows=10&fl=id,title" I would expect to see the 2nd request result in
a 2nd insert to the cache, but instead I see that the 2nd request hits
the cache from the 1st request. "rows=10&fl=*" does the same thing.
i.e., the first request, even though I have
enableLazyFieldLoading=true and I'm only asking for the ids, fetches
the entire document from disk and inserts into the documentCache.
Subsequent requests, regardless of which fields I actually select,
don't hit the disk but are loaded from the documentCache. Is this
really the expected behavior and/or am I misunderstanding something?

A 2nd question: while watching these stats I noticed something else
weird with the queryResultCache. It seems that inserts to the
queryResultCache depend on the number of rows requested. For example,
an initial request (solr restarted, clean cache, etc) with rows=10
will result in a insert. A 2nd request of the same query with
rows=1000 will result in a cache hit. However if you reverse that
order, starting with a clean cache, an initial request for rows=1000
will *not* result in an insert to queryResultCache. I have tried
various increments--10, 100, 200, 500--and it seems the magic number
is somewhere between 200 (cache insert) and 500 (no insert). Can
someone explain this?

Thanks,
--jay

On Wed, Oct 27, 2010 at 10:54 AM, Markus Jelsma
 wrote:
> I've been wondering about this too some time ago. I've found more 
> informationenableLazyFieldLoading
> in SOLR-52 and some correspondence on this one but it didn't give me a
> definitive answer..
>
> [1]: https://issues.apache.org/jira/browse/SOLR-52
> [2]: http://www.mail-archive.com/solr-...@lucene.apache.org/msg01185.html
>
> On Wednesday 27 October 2010 16:39:44 Jay Luker wrote:
>> Hi all,
>>
>> The solr wiki says this about the documentCache: "The more fields you
>> store in your documents, the higher the memory usage of this cache
>> will be."
>>
>> OK, but if i have enableLazyFieldLoading set to true and in my request
>> parameters specify "fl=id", then the number of fields per document
>> shouldn't affect the memory usage of the document cache, right?
>>
>> Thanks,
>> --jay
>
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536600 / 06-50258350
>


Re: newSearcher vs. firstSearcher

2010-10-27 Thread Chris Hostetter

: But thinking about warming queries, which is my use of new/firstSearcher (and
: probably the most common use?), I can't think of any case but ones where I'd
: want newSearcher and firstSearcher warming queries to be identical.

a firstSearcher event is one in which there is no previous searcher, so 
the *only* warming is the warming you do explicitly.

a newSearcher event happens when there is an "old" searcher, and 
"autowarming" will be happening for the caches that support it -- so you 
may not need as much explicit warming.

It's common to have a lot of warming configured to explicitly happen on 
firstSearcher, but thenhave much less warming configured to explicitly 
happen on newSearcher because of the implicit warming.

-Hoss


RE: Solr sorting problem

2010-10-27 Thread Toke Eskildsen
Jonathan Rochkind [rochk...@jhu.edu] wrote:
> I too sometimes have similar use cases, and my best ideas about how to
> solve them involve using faceting --- you can facet on a multi-valued
> field, and you can sort facets--but you can only sort facets by "index
> order", a strict byte-by-byte sort.  Which doesn't always work for me
> either.  I haven't quite figured out the solution to this sort of problem.

Index ICU CollatorKeys for the terms in a dedicated facet field, then index 
order will be equal to collator order. That is only haft the solution though, 
as the documents still needs to be extracted based on the top X tags in the 
facet. It could be done from outside Solr by iterating the tags and performing 
searches until the wanted number of documents has been resolved, but that seems 
inefficient. A faster way would be to step through the terms for the tags and 
request the docIDs, but that must AFAIK happen inside Solr.

- Toke

Re: documentCache clarification

2010-10-27 Thread Markus Jelsma
I've been wondering about this too some time ago. I've found more information 
in SOLR-52 and some correspondence on this one but it didn't give me a 
definitive answer..

[1]: https://issues.apache.org/jira/browse/SOLR-52
[2]: http://www.mail-archive.com/solr-...@lucene.apache.org/msg01185.html

On Wednesday 27 October 2010 16:39:44 Jay Luker wrote:
> Hi all,
> 
> The solr wiki says this about the documentCache: "The more fields you
> store in your documents, the higher the memory usage of this cache
> will be."
> 
> OK, but if i have enableLazyFieldLoading set to true and in my request
> parameters specify "fl=id", then the number of fields per document
> shouldn't affect the memory usage of the document cache, right?
> 
> Thanks,
> --jay

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536600 / 06-50258350


documentCache clarification

2010-10-27 Thread Jay Luker
Hi all,

The solr wiki says this about the documentCache: "The more fields you
store in your documents, the higher the memory usage of this cache
will be."

OK, but if i have enableLazyFieldLoading set to true and in my request
parameters specify "fl=id", then the number of fields per document
shouldn't affect the memory usage of the document cache, right?

Thanks,
--jay


Re: Implementing Search Suggestion on Solr

2010-10-27 Thread Israel Ekpo
I think you may want to configure the field type used for the spell check to
use the synonyms file/database.

That way synonyms are also processed during index time.

This could help.

On Wed, Oct 27, 2010 at 6:47 AM, Antonio Calo'  wrote:

> Hi
>
> If I understood, you will build a kind of dictionary or ontology or
> thesauru and you will use it if Solr query results are few. At query time
> (before or after) you will perform a query on this dictionary in order to
> retrieve the suggested word.
>
> If you  need to do this, you can try to cvreate a custom request handler
> where you can controll the querying process in a simple manner (
> http://wiki.apache.org/solr/SolrRequestHandler).
>
> With the custom request handler, you can add custom code to check query
> results before submitting query to solr or analizing the query before
> sending result to client. I never coded one, but I think this is a good
> starting point.
>
> Hope this can help you
>
> Antonio
>
>
>
> Il 27/10/2010 11.03, Pablo Recio ha scritto:
>
>  Thanks, it's not what I'm looking for.
>>
>> Actually I need something like search "Ubuntu" and it will prompt "Maybe
>> you
>> will like 'Debian' too" or something like that. I'm not trying to do it
>> automatically, manually will be ok.
>>
>> Anyway, is good article you shared, maybe I will implement it, thanks!
>>
>> 2010/10/27 Jakub Godawa
>>
>>  I am a real rookie at solr, but try this:
>>> http://solr.pl/2010/10/18/solr-and-autocomplete-part-1/?lang=en
>>>
>>> 2010/10/27 Pablo Recio
>>>
>>>  Hi,

 I don't want to be annoying, but I'm looking for a way to do that.

 I repeat the question: is there a way to implement Search Suggestion
 manually?

 Thanks in advance.
 Regards,

 2010/10/18 Pablo Recio Quijano

  Hi!
>
> I'm trying to implement some kind of Search Suggestion on a search
>
 engine
>>>
 I

> have implemented. This search suggestions should not be automatically
>
 like

> the one described for the SpellCheckComponent [1]. I'm looking
>
 something
>>>
 like:
>
> "SAS oppositions" =>  "Public job offers for some-company"
>
> So I will have to define it manually. I was thinking about synonyms [2]
>
 but

> I don't know if it's the proper way to do it, because semantically
>
 those
>>>
 terms are not synonyms.
>
> Any ideas or suggestions?
>
> Regards,
>
> [1] http://wiki.apache.org/solr/SpellCheckComponent
> [2]
>
>
>>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory
>>>
>>
>


-- 
°O°
"Good Enough" is not good enough.
To give anything less than your best is to sacrifice the gift.
Quality First. Measure Twice. Cut Once.
http://www.israelekpo.com/


Re: after the slave node pull index from master, when will solr del the tmp index dir

2010-10-27 Thread Jayendra Patil
We faced the same issue.
If you are executing a complete clean build, the Slave copies the complete
index and just switches the pointer in the index.properties to point to the
new index. directory, leaving behind the old copies. And it does not
clean it up.

Had logged an JIRA and patch to SnapPuller class, you may want to give it a
try -
https://issues.apache.org/jira/browse/SOLR-2156

Regards,
Jayendra

2010/10/26 Chengyang 

> I noticed that the slave node have some tmp Index.x dir that created
> during the index sync with master, but they are not removed even after
> serval days. So when will solr del the tmp index dir?
>


Re: How do I this in Solr?

2010-10-27 Thread Varun Gupta
Toke, the search query will contain 4-5 words on an average (excluding the
stopwords).

Mike, I don't care about the result count. Excluding the terms at the client
side may be a good idea. Is there any way to alter scoring such that the
docs containing only the searched-for terms are shown first? Can I use term
frequency to do such kind of thing?

--
Thanks
Varun Gupta

On Wed, Oct 27, 2010 at 7:13 PM, Mike Sokolov  wrote:

> Yes I missed that requirement (as Steven also pointed out in a private
> e-mail).  I now agree that the combinatorics are required.
>
> Another possibility to consider (if the queries are large, which actually
> seems unlikely) is to use the default behavior where all terms are optional,
> sort by relevance, and truncate the result list on the client side after
> some unwanted term is found.  I *think* the scoring should find only docs
> with the searched-for terms first, although if there are a lot of repeated
> terms maybe not? Also result counts will be screwy.
>
> -Mike
>
>
> On 10/27/2010 09:34 AM, Toke Eskildsen wrote:
>
>> That does not work either as it requires that all the terms in the query
>> are present in the document. The original poster did not state this
>> requirement. On the contrary, his examples were mostly single-word
>> matches, implying an OR-search at the core.
>>
>> The query-explosion still seems like the only working idea. Maybe Varun
>> could comment on the maximum numbers of terms that his queries will
>> contain?
>>
>> Regards,
>> Toke Eskildsen
>>
>> On Wed, 2010-10-27 at 15:02 +0200, Mike Sokolov wrote:
>>
>>
>>> Right - my point was to combine this with the previous approaches to
>>> form a query like:
>>>
>>> samsung AND android AND GPS AND word_count:3
>>>
>>> in order to exclude documents containing additional words. This would
>>> avoid the combinatoric explosion problem otehrs had alluded to earlier.
>>> Of course this would fail because android is "mis-" spelled :)
>>>
>>> -Mike
>>>
>>> On 10/27/2010 08:45 AM, Steven A Rowe wrote:
>>>
>>>
 I'm pretty sure the word-count strategy won't work.




> If I search with the text "samsung andriod GPS", search results
> should only conain "samsung", "GPS", "andriod" and "samsung andriod".
>
>
>
 Using the word-count strategy, a document containing "samsung andriod
 PDQ" would be a hit, but Varun doesn't want it, because it contains a word
 that is not in the query.

 Steve




> -Original Message-
> From: Michael Sokolov [mailto:soko...@ifactory.com]
> Sent: Wednesday, October 27, 2010 7:44 AM
> To: solr-user@lucene.apache.org
> Subject: RE: How do I this in Solr?
>
> You might try adding a field containing the word count and making sure
> that
> matches the query's word count?
>
> This would require you to tokenize the query and document yourself,
> perhaps.
>
> -Mike
>
>
>
>
>> -Original Message-
>> From: Varun Gupta [mailto:varun.vgu...@gmail.com]
>> Sent: Tuesday, October 26, 2010 11:26 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: How do I this in Solr?
>>
>> Thanks everybody for the inputs.
>>
>> Looks like Steven's solution is the closest one but will lead
>> to performance issues when the query string has many terms.
>>
>> I will try to implement the two filters suggested by Steven
>> and see how the performance matches up.
>>
>> --
>> Thanks
>> Varun Gupta
>>
>>
>> On Wed, Oct 27, 2010 at 8:04 AM, scott chu (???)
>> wrote:
>>
>>
>>
>>
>>> I think you have to write a "yet exact match" handler
>>>
>>>
>>>
>> yourself (I mean
>>
>>
>>
>>> yet cause it's not quite exact match we normally know).
>>>
>>>
>>>
>> Steve's answer
>>
>>
>>
>>> is quite near your request. You can do further work based
>>>
>>>
>>>
>> on his solution.
>>
>>
>>
>>> At the last step, I'll suggest you eat up all blank within query
>>> string and query result, respevtively&   only returns those results
>>> that has equal string length as the query string's.
>>>
>>> For example, giving:
>>> *query string = "Samsung with GPS"
>>> *query results:
>>> resutl 1 = "Samsung has lots of mobile with GPS"
>>> result 2 = "with GPS Samsng"
>>> result 3 = "GPS mobile with vendors, such as Sony, Samsung"
>>>
>>> they become:
>>> *query result = "SamsungwithGPS" (length =14) *query results:
>>> resutl 1 = "SamsunghaslotsofmobilewithGPS" (length =29) result 2 =
>>> "withGPSSamsng" (length =14) result 3 =
>>> "GPSmobilewithvendors,suchasSony,Samsung" (length =43)
>>>
>>> so result 2 matches your request.
>>>
>>> In this way, you can avoid case-sensitive,
>>>
>>>
>>>
>> word-order-rearrange 

Re: Multiple Word Facets

2010-10-27 Thread Ken Krugler


On Oct 27, 2010, at 6:29am, Adam Estrada wrote:


Ahhh...I see! I am doing my testing crawling a couple websites using
Nutch and in doing so I am assigning my facets to the title field
which is type=text. Are you saying that I will need to manually
generate the content for my facet field? I can see the reason and need
for doing it that way but I really need for my faceting to happen
dynamically based on the content in the field which in this case is
the title of a URL.


You would use copyfield to copy the contents of the title into a new  
field that uses the string type, and is the one you use for faceting.


-- Ken


On Wed, Oct 27, 2010 at 9:19 AM, Jayendra Patil
 wrote:
The Shingle Filter Breaks the words in a sentence into a  
combination of 2/3

words.

For faceting field you should use :-
stored="true"

multiValued="true"/>

The type of the field should be *string *so that it is not  
tokenised at all.


On Wed, Oct 27, 2010 at 9:12 AM, Adam Estrada  
wrote:


Thanks guys, the solr.ShingleFilterFactory did work to get me  
multiple

terms per facet but now I am seeing some redundancy in the facets
numbers. See below...

Highway (62)
Highway System (59)
National (59)
National Highway (59)
National Highway System (59)
System (59)

See what's going on here? How can I make my multi token facets  
smarter

so that the tokens aren't duplicated?

Thanks in advance,
Adam

On Tue, Oct 26, 2010 at 10:32 PM, Ahmet Arslan   
wrote:

Facets are generated from indexed terms.

Depending on your need/use-case:

You can use a additional separate String field (which is not  
tokenized)
for facets, populate it via copyField. Search on tokenized field  
facet on

non-tokenized field.


Or

You can add solr.ShingleFilterFactory to your index analyzer to  
form

multiple word terms.


--- On Wed, 10/27/10, Adam Estrada  wrote:


From: Adam Estrada 
Subject: Multiple Word Facets
To: solr-user@lucene.apache.org
Date: Wednesday, October 27, 2010, 4:43 AM
All,
I am a new to Solr faceting and stuck on how to get
multiple-word
facets returned from a standard Solr query. See below for
what is
currently being returned.





89
87
87
87
84
60
32
22
19
15
15
14
12
11
10
9
7
7
7
6
6
6
6
...etc...

There are many terms in there that are 2 or 3 word phrases.
For
example, Eastern Federal Lands Highway Division all gets
broken down
in to the individual words that make up the total group of
words. I've
seen quite a few websites that do what it is I am trying to
do here so
any suggestions at this point would be great. See my schema
below
(copied from the example schema).


  
 




  

Similar for type="query". Please advise on how to group or
cluster
document terms so that they can be used as facets.

Many thanks in advance,
Adam Estrada












--
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g







Re: how well does multicore scale?

2010-10-27 Thread Toke Eskildsen
On Wed, 2010-10-27 at 14:20 +0200, mike anderson wrote:
> [...] By my simple math, this would mean that if we want each shard's
> index to be able to fit in memory, [...]

Might I ask why you're planning on using memory-based sharding? The
performance gap between memory and SSDs is not very big so using memory
to get those last queries/second is quite expensive.



Re: How do I this in Solr?

2010-10-27 Thread Mike Sokolov
Yes I missed that requirement (as Steven also pointed out in a private 
e-mail).  I now agree that the combinatorics are required.


Another possibility to consider (if the queries are large, which 
actually seems unlikely) is to use the default behavior where all terms 
are optional, sort by relevance, and truncate the result list on the 
client side after some unwanted term is found.  I *think* the scoring 
should find only docs with the searched-for terms first, although if 
there are a lot of repeated terms maybe not? Also result counts will be 
screwy.


-Mike

On 10/27/2010 09:34 AM, Toke Eskildsen wrote:

That does not work either as it requires that all the terms in the query
are present in the document. The original poster did not state this
requirement. On the contrary, his examples were mostly single-word
matches, implying an OR-search at the core.

The query-explosion still seems like the only working idea. Maybe Varun
could comment on the maximum numbers of terms that his queries will
contain?

Regards,
Toke Eskildsen

On Wed, 2010-10-27 at 15:02 +0200, Mike Sokolov wrote:
   

Right - my point was to combine this with the previous approaches to
form a query like:

samsung AND android AND GPS AND word_count:3

in order to exclude documents containing additional words. This would
avoid the combinatoric explosion problem otehrs had alluded to earlier.
Of course this would fail because android is "mis-" spelled :)

-Mike

On 10/27/2010 08:45 AM, Steven A Rowe wrote:
 

I'm pretty sure the word-count strategy won't work.


   

If I search with the text "samsung andriod GPS", search results
should only conain "samsung", "GPS", "andriod" and "samsung andriod".

 

Using the word-count strategy, a document containing "samsung andriod PDQ" 
would be a hit, but Varun doesn't want it, because it contains a word that is not in the 
query.

Steve


   

-Original Message-
From: Michael Sokolov [mailto:soko...@ifactory.com]
Sent: Wednesday, October 27, 2010 7:44 AM
To: solr-user@lucene.apache.org
Subject: RE: How do I this in Solr?

You might try adding a field containing the word count and making sure
that
matches the query's word count?

This would require you to tokenize the query and document yourself,
perhaps.

-Mike


 

-Original Message-
From: Varun Gupta [mailto:varun.vgu...@gmail.com]
Sent: Tuesday, October 26, 2010 11:26 PM
To: solr-user@lucene.apache.org
Subject: Re: How do I this in Solr?

Thanks everybody for the inputs.

Looks like Steven's solution is the closest one but will lead
to performance issues when the query string has many terms.

I will try to implement the two filters suggested by Steven
and see how the performance matches up.

--
Thanks
Varun Gupta


On Wed, Oct 27, 2010 at 8:04 AM, scott chu (???)
wrote:


   

I think you have to write a "yet exact match" handler

 

yourself (I mean

   

yet cause it's not quite exact match we normally know).

 

Steve's answer

   

is quite near your request. You can do further work based

 

on his solution.

   

At the last step, I'll suggest you eat up all blank within query
string and query result, respevtively&   only returns those results
that has equal string length as the query string's.

For example, giving:
*query string = "Samsung with GPS"
*query results:
resutl 1 = "Samsung has lots of mobile with GPS"
result 2 = "with GPS Samsng"
result 3 = "GPS mobile with vendors, such as Sony, Samsung"

they become:
*query result = "SamsungwithGPS" (length =14) *query results:
resutl 1 = "SamsunghaslotsofmobilewithGPS" (length =29) result 2 =
"withGPSSamsng" (length =14) result 3 =
"GPSmobilewithvendors,suchasSony,Samsung" (length =43)

so result 2 matches your request.

In this way, you can avoid case-sensitive,

 

word-order-rearrange load

   

of works. Furthermore, you can do refined work, such as

 

remove white

   

characters, etc.

Scott @ Taiwan


- Original Message - From: "Varun Gupta"


To:
Sent: Tuesday, October 26, 2010 9:07 PM

Subject: How do I this in Solr?


   Hi,

 

I have lot of small documents (each containing 1 to 15

   

words) indexed

   

in Solr. For the search query, I want the search results

   

to contain

   

only those documents that satisfy this criteria "All of

   

the words of

   

the search result document are present in the search query"

For example:
If I have the following documents indexed: "nokia n95", "GPS",
"android", "samsung", "samsung andriod", "nokia andriod",

   

"mobile with GPS"

   

If I search with the text "samsung andriod GPS", search results
should only conain "samsung", "GPS", "andriod" and

   

"samsung andriod".

   

Is there a way to do this in Solr.

--
Thanks
Varun Gupta



   


Re: How do I this in Solr?

2010-10-27 Thread Toke Eskildsen
That does not work either as it requires that all the terms in the query
are present in the document. The original poster did not state this
requirement. On the contrary, his examples were mostly single-word
matches, implying an OR-search at the core.

The query-explosion still seems like the only working idea. Maybe Varun
could comment on the maximum numbers of terms that his queries will
contain?

Regards,
Toke Eskildsen

On Wed, 2010-10-27 at 15:02 +0200, Mike Sokolov wrote:
> Right - my point was to combine this with the previous approaches to 
> form a query like:
> 
> samsung AND android AND GPS AND word_count:3
> 
> in order to exclude documents containing additional words. This would 
> avoid the combinatoric explosion problem otehrs had alluded to earlier. 
> Of course this would fail because android is "mis-" spelled :)
> 
> -Mike
> 
> On 10/27/2010 08:45 AM, Steven A Rowe wrote:
> > I'm pretty sure the word-count strategy won't work.
> >
> >
> >> If I search with the text "samsung andriod GPS", search results
> >> should only conain "samsung", "GPS", "andriod" and "samsung andriod".
> >>  
> > Using the word-count strategy, a document containing "samsung andriod PDQ" 
> > would be a hit, but Varun doesn't want it, because it contains a word that 
> > is not in the query.
> >
> > Steve
> >
> >
> >> -Original Message-
> >> From: Michael Sokolov [mailto:soko...@ifactory.com]
> >> Sent: Wednesday, October 27, 2010 7:44 AM
> >> To: solr-user@lucene.apache.org
> >> Subject: RE: How do I this in Solr?
> >>
> >> You might try adding a field containing the word count and making sure
> >> that
> >> matches the query's word count?
> >>
> >> This would require you to tokenize the query and document yourself,
> >> perhaps.
> >>
> >> -Mike
> >>
> >>  
> >>> -Original Message-
> >>> From: Varun Gupta [mailto:varun.vgu...@gmail.com]
> >>> Sent: Tuesday, October 26, 2010 11:26 PM
> >>> To: solr-user@lucene.apache.org
> >>> Subject: Re: How do I this in Solr?
> >>>
> >>> Thanks everybody for the inputs.
> >>>
> >>> Looks like Steven's solution is the closest one but will lead
> >>> to performance issues when the query string has many terms.
> >>>
> >>> I will try to implement the two filters suggested by Steven
> >>> and see how the performance matches up.
> >>>
> >>> --
> >>> Thanks
> >>> Varun Gupta
> >>>
> >>>
> >>> On Wed, Oct 27, 2010 at 8:04 AM, scott chu (???)
> >>> wrote:
> >>>
> >>>
>  I think you have to write a "yet exact match" handler
>   
> >>> yourself (I mean
> >>>
>  yet cause it's not quite exact match we normally know).
>   
> >>> Steve's answer
> >>>
>  is quite near your request. You can do further work based
>   
> >>> on his solution.
> >>>
>  At the last step, I'll suggest you eat up all blank within query
>  string and query result, respevtively&  only returns those results
>  that has equal string length as the query string's.
> 
>  For example, giving:
>  *query string = "Samsung with GPS"
>  *query results:
>  resutl 1 = "Samsung has lots of mobile with GPS"
>  result 2 = "with GPS Samsng"
>  result 3 = "GPS mobile with vendors, such as Sony, Samsung"
> 
>  they become:
>  *query result = "SamsungwithGPS" (length =14) *query results:
>  resutl 1 = "SamsunghaslotsofmobilewithGPS" (length =29) result 2 =
>  "withGPSSamsng" (length =14) result 3 =
>  "GPSmobilewithvendors,suchasSony,Samsung" (length =43)
> 
>  so result 2 matches your request.
> 
>  In this way, you can avoid case-sensitive,
>   
> >>> word-order-rearrange load
> >>>
>  of works. Furthermore, you can do refined work, such as
>   
> >>> remove white
> >>>
>  characters, etc.
> 
>  Scott @ Taiwan
> 
> 
>  - Original Message - From: "Varun Gupta"
>  
> 
>  To:
>  Sent: Tuesday, October 26, 2010 9:07 PM
> 
>  Subject: How do I this in Solr?
> 
> 
>    Hi,
>   
> > I have lot of small documents (each containing 1 to 15
> >
> >>> words) indexed
> >>>
> > in Solr. For the search query, I want the search results
> >
> >>> to contain
> >>>
> > only those documents that satisfy this criteria "All of
> >
> >>> the words of
> >>>
> > the search result document are present in the search query"
> >
> > For example:
> > If I have the following documents indexed: "nokia n95", "GPS",
> > "android", "samsung", "samsung andriod", "nokia andriod",
> >
> >>> "mobile with GPS"
> >>>
> > If I search with the text "samsung andriod GPS", search results
> > should only conain "samsung", "GPS", "andriod" and
> >
> >>> "samsung andriod".
> >>>
> > Is there a way to do this in So

Re: Multiple Word Facets

2010-10-27 Thread Adam Estrada
Ahhh...I see! I am doing my testing crawling a couple websites using
Nutch and in doing so I am assigning my facets to the title field
which is type=text. Are you saying that I will need to manually
generate the content for my facet field? I can see the reason and need
for doing it that way but I really need for my faceting to happen
dynamically based on the content in the field which in this case is
the title of a URL.

Thanks again for all the tips on getting this working for me.

Adam

On Wed, Oct 27, 2010 at 9:19 AM, Jayendra Patil
 wrote:
> The Shingle Filter Breaks the words in a sentence into a combination of 2/3
> words.
>
> For faceting field you should use :-
>  multiValued="true"/>
>
> The type of the field should be *string *so that it is not tokenised at all.
>
> On Wed, Oct 27, 2010 at 9:12 AM, Adam Estrada wrote:
>
>> Thanks guys, the solr.ShingleFilterFactory did work to get me multiple
>> terms per facet but now I am seeing some redundancy in the facets
>> numbers. See below...
>>
>> Highway (62)
>> Highway System (59)
>> National (59)
>> National Highway (59)
>> National Highway System (59)
>> System (59)
>>
>> See what's going on here? How can I make my multi token facets smarter
>> so that the tokens aren't duplicated?
>>
>> Thanks in advance,
>> Adam
>>
>> On Tue, Oct 26, 2010 at 10:32 PM, Ahmet Arslan  wrote:
>> > Facets are generated from indexed terms.
>> >
>> > Depending on your need/use-case:
>> >
>> > You can use a additional separate String field (which is not tokenized)
>> for facets, populate it via copyField. Search on tokenized field facet on
>> non-tokenized field.
>> >
>> > Or
>> >
>> > You can add solr.ShingleFilterFactory to your index analyzer to form
>> multiple word terms.
>> >
>> > --- On Wed, 10/27/10, Adam Estrada  wrote:
>> >
>> >> From: Adam Estrada 
>> >> Subject: Multiple Word Facets
>> >> To: solr-user@lucene.apache.org
>> >> Date: Wednesday, October 27, 2010, 4:43 AM
>> >> All,
>> >> I am a new to Solr faceting and stuck on how to get
>> >> multiple-word
>> >> facets returned from a standard Solr query. See below for
>> >> what is
>> >> currently being returned.
>> >>
>> >> 
>> >> 
>> >> 
>> >> 
>> >> 89
>> >> 87
>> >> 87
>> >> 87
>> >> 84
>> >> 60
>> >> 32
>> >> 22
>> >> 19
>> >> 15
>> >> 15
>> >> 14
>> >> 12
>> >> 11
>> >> 10
>> >> 9
>> >> 7
>> >> 7
>> >> 7
>> >> 6
>> >> 6
>> >> 6
>> >> 6
>> >> ...etc...
>> >>
>> >> There are many terms in there that are 2 or 3 word phrases.
>> >> For
>> >> example, Eastern Federal Lands Highway Division all gets
>> >> broken down
>> >> in to the individual words that make up the total group of
>> >> words. I've
>> >> seen quite a few websites that do what it is I am trying to
>> >> do here so
>> >> any suggestions at this point would be great. See my schema
>> >> below
>> >> (copied from the example schema).
>> >>
>> >>     > >> class="solr.TextField" positionIncrementGap="100">
>> >>       
>> >>          > >> class="solr.WhitespaceTokenizerFactory"/>
>> >>     > >> class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
>> >> ignoreCase="true" expand="false"/>
>> >>         > >> class="solr.StopFilterFactory"
>> >>
>> >> ignoreCase="true"
>> >>
>> >> words="stopwords.txt"
>> >>
>> >> enablePositionIncrements="true"
>> >>
>> >> />
>> >>     > >> class="solr.WordDelimiterFilterFactory"
>> >> generateWordParts="1"
>> >> generateNumberParts="1" catenateWords="0"
>> >> catenateNumbers="0"
>> >> catenateAll="0" splitOnCaseChange="1"/>
>> >>         > >> class="solr.RemoveDuplicatesTokenFilterFactory"/>
>> >>       
>> >>
>> >> Similar for type="query". Please advise on how to group or
>> >> cluster
>> >> document terms so that they can be used as facets.
>> >>
>> >> Many thanks in advance,
>> >> Adam Estrada
>> >>
>> >
>> >
>> >
>> >
>>
>


Re: Multiple Word Facets

2010-10-27 Thread Jayendra Patil
The Shingle Filter Breaks the words in a sentence into a combination of 2/3
words.

For faceting field you should use :-


The type of the field should be *string *so that it is not tokenised at all.

On Wed, Oct 27, 2010 at 9:12 AM, Adam Estrada wrote:

> Thanks guys, the solr.ShingleFilterFactory did work to get me multiple
> terms per facet but now I am seeing some redundancy in the facets
> numbers. See below...
>
> Highway (62)
> Highway System (59)
> National (59)
> National Highway (59)
> National Highway System (59)
> System (59)
>
> See what's going on here? How can I make my multi token facets smarter
> so that the tokens aren't duplicated?
>
> Thanks in advance,
> Adam
>
> On Tue, Oct 26, 2010 at 10:32 PM, Ahmet Arslan  wrote:
> > Facets are generated from indexed terms.
> >
> > Depending on your need/use-case:
> >
> > You can use a additional separate String field (which is not tokenized)
> for facets, populate it via copyField. Search on tokenized field facet on
> non-tokenized field.
> >
> > Or
> >
> > You can add solr.ShingleFilterFactory to your index analyzer to form
> multiple word terms.
> >
> > --- On Wed, 10/27/10, Adam Estrada  wrote:
> >
> >> From: Adam Estrada 
> >> Subject: Multiple Word Facets
> >> To: solr-user@lucene.apache.org
> >> Date: Wednesday, October 27, 2010, 4:43 AM
> >> All,
> >> I am a new to Solr faceting and stuck on how to get
> >> multiple-word
> >> facets returned from a standard Solr query. See below for
> >> what is
> >> currently being returned.
> >>
> >> 
> >> 
> >> 
> >> 
> >> 89
> >> 87
> >> 87
> >> 87
> >> 84
> >> 60
> >> 32
> >> 22
> >> 19
> >> 15
> >> 15
> >> 14
> >> 12
> >> 11
> >> 10
> >> 9
> >> 7
> >> 7
> >> 7
> >> 6
> >> 6
> >> 6
> >> 6
> >> ...etc...
> >>
> >> There are many terms in there that are 2 or 3 word phrases.
> >> For
> >> example, Eastern Federal Lands Highway Division all gets
> >> broken down
> >> in to the individual words that make up the total group of
> >> words. I've
> >> seen quite a few websites that do what it is I am trying to
> >> do here so
> >> any suggestions at this point would be great. See my schema
> >> below
> >> (copied from the example schema).
> >>
> >>  >> class="solr.TextField" positionIncrementGap="100">
> >>   
> >>   >> class="solr.WhitespaceTokenizerFactory"/>
> >>  >> class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> >> ignoreCase="true" expand="false"/>
> >>  >> class="solr.StopFilterFactory"
> >>
> >> ignoreCase="true"
> >>
> >> words="stopwords.txt"
> >>
> >> enablePositionIncrements="true"
> >>
> >> />
> >>  >> class="solr.WordDelimiterFilterFactory"
> >> generateWordParts="1"
> >> generateNumberParts="1" catenateWords="0"
> >> catenateNumbers="0"
> >> catenateAll="0" splitOnCaseChange="1"/>
> >>  >> class="solr.RemoveDuplicatesTokenFilterFactory"/>
> >>   
> >>
> >> Similar for type="query". Please advise on how to group or
> >> cluster
> >> document terms so that they can be used as facets.
> >>
> >> Many thanks in advance,
> >> Adam Estrada
> >>
> >
> >
> >
> >
>


Re: Multiple Word Facets

2010-10-27 Thread Adam Estrada
Thanks guys, the solr.ShingleFilterFactory did work to get me multiple
terms per facet but now I am seeing some redundancy in the facets
numbers. See below...

Highway (62)
Highway System (59)
National (59)
National Highway (59)
National Highway System (59)
System (59)

See what's going on here? How can I make my multi token facets smarter
so that the tokens aren't duplicated?

Thanks in advance,
Adam

On Tue, Oct 26, 2010 at 10:32 PM, Ahmet Arslan  wrote:
> Facets are generated from indexed terms.
>
> Depending on your need/use-case:
>
> You can use a additional separate String field (which is not tokenized) for 
> facets, populate it via copyField. Search on tokenized field facet on 
> non-tokenized field.
>
> Or
>
> You can add solr.ShingleFilterFactory to your index analyzer to form multiple 
> word terms.
>
> --- On Wed, 10/27/10, Adam Estrada  wrote:
>
>> From: Adam Estrada 
>> Subject: Multiple Word Facets
>> To: solr-user@lucene.apache.org
>> Date: Wednesday, October 27, 2010, 4:43 AM
>> All,
>> I am a new to Solr faceting and stuck on how to get
>> multiple-word
>> facets returned from a standard Solr query. See below for
>> what is
>> currently being returned.
>>
>> 
>> 
>> 
>> 
>> 89
>> 87
>> 87
>> 87
>> 84
>> 60
>> 32
>> 22
>> 19
>> 15
>> 15
>> 14
>> 12
>> 11
>> 10
>> 9
>> 7
>> 7
>> 7
>> 6
>> 6
>> 6
>> 6
>> ...etc...
>>
>> There are many terms in there that are 2 or 3 word phrases.
>> For
>> example, Eastern Federal Lands Highway Division all gets
>> broken down
>> in to the individual words that make up the total group of
>> words. I've
>> seen quite a few websites that do what it is I am trying to
>> do here so
>> any suggestions at this point would be great. See my schema
>> below
>> (copied from the example schema).
>>
>>     > class="solr.TextField" positionIncrementGap="100">
>>       
>>          > class="solr.WhitespaceTokenizerFactory"/>
>>     > class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
>> ignoreCase="true" expand="false"/>
>>         > class="solr.StopFilterFactory"
>>
>> ignoreCase="true"
>>
>> words="stopwords.txt"
>>
>> enablePositionIncrements="true"
>>
>> />
>>     > class="solr.WordDelimiterFilterFactory"
>> generateWordParts="1"
>> generateNumberParts="1" catenateWords="0"
>> catenateNumbers="0"
>> catenateAll="0" splitOnCaseChange="1"/>
>>         > class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>       
>>
>> Similar for type="query". Please advise on how to group or
>> cluster
>> document terms so that they can be used as facets.
>>
>> Many thanks in advance,
>> Adam Estrada
>>
>
>
>
>


Re: How do I this in Solr?

2010-10-27 Thread Mike Sokolov
Right - my point was to combine this with the previous approaches to 
form a query like:


samsung AND android AND GPS AND word_count:3

in order to exclude documents containing additional words. This would 
avoid the combinatoric explosion problem otehrs had alluded to earlier. 
Of course this would fail because android is "mis-" spelled :)


-Mike

On 10/27/2010 08:45 AM, Steven A Rowe wrote:

I'm pretty sure the word-count strategy won't work.

   

If I search with the text "samsung andriod GPS", search results
should only conain "samsung", "GPS", "andriod" and "samsung andriod".
 

Using the word-count strategy, a document containing "samsung andriod PDQ" 
would be a hit, but Varun doesn't want it, because it contains a word that is not in the 
query.

Steve

   

-Original Message-
From: Michael Sokolov [mailto:soko...@ifactory.com]
Sent: Wednesday, October 27, 2010 7:44 AM
To: solr-user@lucene.apache.org
Subject: RE: How do I this in Solr?

You might try adding a field containing the word count and making sure
that
matches the query's word count?

This would require you to tokenize the query and document yourself,
perhaps.

-Mike

 

-Original Message-
From: Varun Gupta [mailto:varun.vgu...@gmail.com]
Sent: Tuesday, October 26, 2010 11:26 PM
To: solr-user@lucene.apache.org
Subject: Re: How do I this in Solr?

Thanks everybody for the inputs.

Looks like Steven's solution is the closest one but will lead
to performance issues when the query string has many terms.

I will try to implement the two filters suggested by Steven
and see how the performance matches up.

--
Thanks
Varun Gupta


On Wed, Oct 27, 2010 at 8:04 AM, scott chu (???)
wrote:

   

I think you have to write a "yet exact match" handler
 

yourself (I mean
   

yet cause it's not quite exact match we normally know).
 

Steve's answer
   

is quite near your request. You can do further work based
 

on his solution.
   

At the last step, I'll suggest you eat up all blank within query
string and query result, respevtively&  only returns those results
that has equal string length as the query string's.

For example, giving:
*query string = "Samsung with GPS"
*query results:
resutl 1 = "Samsung has lots of mobile with GPS"
result 2 = "with GPS Samsng"
result 3 = "GPS mobile with vendors, such as Sony, Samsung"

they become:
*query result = "SamsungwithGPS" (length =14) *query results:
resutl 1 = "SamsunghaslotsofmobilewithGPS" (length =29) result 2 =
"withGPSSamsng" (length =14) result 3 =
"GPSmobilewithvendors,suchasSony,Samsung" (length =43)

so result 2 matches your request.

In this way, you can avoid case-sensitive,
 

word-order-rearrange load
   

of works. Furthermore, you can do refined work, such as
 

remove white
   

characters, etc.

Scott @ Taiwan


- Original Message - From: "Varun Gupta"


To:
Sent: Tuesday, October 26, 2010 9:07 PM

Subject: How do I this in Solr?


  Hi,
 

I have lot of small documents (each containing 1 to 15
   

words) indexed
   

in Solr. For the search query, I want the search results
   

to contain
   

only those documents that satisfy this criteria "All of
   

the words of
   

the search result document are present in the search query"

For example:
If I have the following documents indexed: "nokia n95", "GPS",
"android", "samsung", "samsung andriod", "nokia andriod",
   

"mobile with GPS"
   

If I search with the text "samsung andriod GPS", search results
should only conain "samsung", "GPS", "andriod" and
   

"samsung andriod".
   

Is there a way to do this in Solr.

--
Thanks
Varun Gupta


   



 

--
   

--



%<&b6G$J0T.'$$'d(l/f,r!C
Checked by AVG - www.avg.com
Version: 9.0.862 / Virus Database: 271.1.1/3220 - Release Date:
10/26/10 14:34:00


 
   
   


RE: How do I this in Solr?

2010-10-27 Thread Steven A Rowe
I'm pretty sure the word-count strategy won't work.

> If I search with the text "samsung andriod GPS", search results
> should only conain "samsung", "GPS", "andriod" and "samsung andriod".

Using the word-count strategy, a document containing "samsung andriod PDQ" 
would be a hit, but Varun doesn't want it, because it contains a word that is 
not in the query.

Steve

> -Original Message-
> From: Michael Sokolov [mailto:soko...@ifactory.com]
> Sent: Wednesday, October 27, 2010 7:44 AM
> To: solr-user@lucene.apache.org
> Subject: RE: How do I this in Solr?
> 
> You might try adding a field containing the word count and making sure
> that
> matches the query's word count?
> 
> This would require you to tokenize the query and document yourself,
> perhaps.
> 
> -Mike
> 
> > -Original Message-
> > From: Varun Gupta [mailto:varun.vgu...@gmail.com]
> > Sent: Tuesday, October 26, 2010 11:26 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: How do I this in Solr?
> >
> > Thanks everybody for the inputs.
> >
> > Looks like Steven's solution is the closest one but will lead
> > to performance issues when the query string has many terms.
> >
> > I will try to implement the two filters suggested by Steven
> > and see how the performance matches up.
> >
> > --
> > Thanks
> > Varun Gupta
> >
> >
> > On Wed, Oct 27, 2010 at 8:04 AM, scott chu (???)
> > wrote:
> >
> > > I think you have to write a "yet exact match" handler
> > yourself (I mean
> > > yet cause it's not quite exact match we normally know).
> > Steve's answer
> > > is quite near your request. You can do further work based
> > on his solution.
> > >
> > > At the last step, I'll suggest you eat up all blank within query
> > > string and query result, respevtively & only returns those results
> > > that has equal string length as the query string's.
> > >
> > > For example, giving:
> > > *query string = "Samsung with GPS"
> > > *query results:
> > > resutl 1 = "Samsung has lots of mobile with GPS"
> > > result 2 = "with GPS Samsng"
> > > result 3 = "GPS mobile with vendors, such as Sony, Samsung"
> > >
> > > they become:
> > > *query result = "SamsungwithGPS" (length =14) *query results:
> > > resutl 1 = "SamsunghaslotsofmobilewithGPS" (length =29) result 2 =
> > > "withGPSSamsng" (length =14) result 3 =
> > > "GPSmobilewithvendors,suchasSony,Samsung" (length =43)
> > >
> > > so result 2 matches your request.
> > >
> > > In this way, you can avoid case-sensitive,
> > word-order-rearrange load
> > > of works. Furthermore, you can do refined work, such as
> > remove white
> > > characters, etc.
> > >
> > > Scott @ Taiwan
> > >
> > >
> > > - Original Message - From: "Varun Gupta"
> > > 
> > >
> > > To: 
> > > Sent: Tuesday, October 26, 2010 9:07 PM
> > >
> > > Subject: How do I this in Solr?
> > >
> > >
> > >  Hi,
> > >>
> > >> I have lot of small documents (each containing 1 to 15
> > words) indexed
> > >> in Solr. For the search query, I want the search results
> > to contain
> > >> only those documents that satisfy this criteria "All of
> > the words of
> > >> the search result document are present in the search query"
> > >>
> > >> For example:
> > >> If I have the following documents indexed: "nokia n95", "GPS",
> > >> "android", "samsung", "samsung andriod", "nokia andriod",
> > "mobile with GPS"
> > >>
> > >> If I search with the text "samsung andriod GPS", search results
> > >> should only conain "samsung", "GPS", "andriod" and
> > "samsung andriod".
> > >>
> > >> Is there a way to do this in Solr.
> > >>
> > >> --
> > >> Thanks
> > >> Varun Gupta
> > >>
> > >>
> > >
> > >
> > >
> > --
> > > --
> > >
> > >
> > >
> > > %<&b6G$J0T.'$$'d(l/f,r!C
> > > Checked by AVG - www.avg.com
> > > Version: 9.0.862 / Virus Database: 271.1.1/3220 - Release Date:
> > > 10/26/10 14:34:00
> > >
> > >
> >



Re: how well does multicore scale?

2010-10-27 Thread Tharindu Mathew
Hi mike,

I think I wasn't clear,

Each document will only be tagged with one user_id, or to be specific
one tenant_id. Users of the same tenant can't upload the same document
to the same path.

So I use this to make the key unique for each tenant. So I can index,
delete without a problem.

On Wed, Oct 27, 2010 at 5:50 PM, mike anderson  wrote:
> Tagging every document with a few hundred thousand 6 character user-ids
> would  increase the document size by two orders of magnitude. I can't
> imagine why this wouldn't mean the index would increase by just as much
> (though I really don't know much about that file structure). By my simple
> math, this would mean that if we want each shard's index to be able to fit
> in memory, then (even with some beefy servers) each query would have to go
> out to a few thousand shards (as opposed to 21 if we used the MultiCore
> approach). This means the typical response time would be much slower.
>
>
> -mike
>
> On Tue, Oct 26, 2010 at 10:15 AM, Jonathan Rochkind wrote:
>
>> mike anderson wrote:
>>
>>> I'm really curious if there is a clever solution to the obvious problem
>>> with: "So your better off using a single index and with a user id and use
>>> a query filter with the user id when fetching data.", i.e.. when you have
>>> hundreds of thousands of user IDs tagged on each article. That just
>>> doesn't
>>> sound like it scales very well..
>>>
>>>
>> Actually, I think that design would scale pretty fine, I don't think
>> there's an 'obvious' problem. You store your userIDs in a multi-valued field
>> (or as multiple terms in a single value, ends up being similar). You fq on
>> there with the current userID.   There's one way to find out of course, but
>> that doesn't seem a patently ridiculous scenario or anything, that's the
>> kind of thing Solr is generally good at, it's what it's built for.   The
>> problem might actually be in the time it takes to add such a document to the
>> index; but not in query time.
>>
>> Doesn't mean it's the best solution for your problem though, I can't say.
>>
>> My impression is that Solr in general isn't really designed to support the
>> kind of multi-tenancy use case people are talking about lately.  So trying
>> to make it work anyway... if multi-cores work for you, then great, but be
>> aware they weren't really designed for that (having thousands of cores) and
>> may not. If a single index can work for you instead, great, but as you've
>> discovered it's not neccesarily obvious how to set up the schema to do what
>> you need -- really this applies to Solr in general, unlike an rdbms where
>> you just third-form-normalize everything and figure it'll work for almost
>> any use case that comes up,  in Solr you generally need to custom fit the
>> schema for your particular use cases, sometimes being kind of clever to
>> figure out the optimal way to do that.
>>
>> This is, I'd argue/agree, indeed kind of a disadvantage, setting up a Solr
>> index takes more intellectual work than setting up an rdbms. The trade off
>> is you get speed, and flexible ways to set up relevancy (that still perform
>> well). Took a couple decades for rdbms to get as brainless to use as they
>> are, maybe in a couple more we'll have figured out ways to make indexing
>> engines like solr equally brainless, but not yet -- but it's still pretty
>> damn easy for what it is, the lucene/Solr folks have done a remarkable job.
>>
>



-- 
Regards,

Tharindu


Re: how well does multicore scale?

2010-10-27 Thread mike anderson
Tagging every document with a few hundred thousand 6 character user-ids
would  increase the document size by two orders of magnitude. I can't
imagine why this wouldn't mean the index would increase by just as much
(though I really don't know much about that file structure). By my simple
math, this would mean that if we want each shard's index to be able to fit
in memory, then (even with some beefy servers) each query would have to go
out to a few thousand shards (as opposed to 21 if we used the MultiCore
approach). This means the typical response time would be much slower.


-mike

On Tue, Oct 26, 2010 at 10:15 AM, Jonathan Rochkind wrote:

> mike anderson wrote:
>
>> I'm really curious if there is a clever solution to the obvious problem
>> with: "So your better off using a single index and with a user id and use
>> a query filter with the user id when fetching data.", i.e.. when you have
>> hundreds of thousands of user IDs tagged on each article. That just
>> doesn't
>> sound like it scales very well..
>>
>>
> Actually, I think that design would scale pretty fine, I don't think
> there's an 'obvious' problem. You store your userIDs in a multi-valued field
> (or as multiple terms in a single value, ends up being similar). You fq on
> there with the current userID.   There's one way to find out of course, but
> that doesn't seem a patently ridiculous scenario or anything, that's the
> kind of thing Solr is generally good at, it's what it's built for.   The
> problem might actually be in the time it takes to add such a document to the
> index; but not in query time.
>
> Doesn't mean it's the best solution for your problem though, I can't say.
>
> My impression is that Solr in general isn't really designed to support the
> kind of multi-tenancy use case people are talking about lately.  So trying
> to make it work anyway... if multi-cores work for you, then great, but be
> aware they weren't really designed for that (having thousands of cores) and
> may not. If a single index can work for you instead, great, but as you've
> discovered it's not neccesarily obvious how to set up the schema to do what
> you need -- really this applies to Solr in general, unlike an rdbms where
> you just third-form-normalize everything and figure it'll work for almost
> any use case that comes up,  in Solr you generally need to custom fit the
> schema for your particular use cases, sometimes being kind of clever to
> figure out the optimal way to do that.
>
> This is, I'd argue/agree, indeed kind of a disadvantage, setting up a Solr
> index takes more intellectual work than setting up an rdbms. The trade off
> is you get speed, and flexible ways to set up relevancy (that still perform
> well). Took a couple decades for rdbms to get as brainless to use as they
> are, maybe in a couple more we'll have figured out ways to make indexing
> engines like solr equally brainless, but not yet -- but it's still pretty
> damn easy for what it is, the lucene/Solr folks have done a remarkable job.
>


Feeding Solr with its own Logs

2010-10-27 Thread Peter Karich

 In case someone is interested:
http://karussell.wordpress.com/2010/10/27/feeding-solr-with-its-own-logs/

a lot of TODOs but: it is working. I could also imagine that this kind 
of example would be suited for an intro-tutorial,
because it covers dynamic fields, rapid solr prototyping, filter and 
facet queries, ... (and date faceting could be added)


Regards,
Peter.


RE: How do I this in Solr?

2010-10-27 Thread Michael Sokolov
You might try adding a field containing the word count and making sure that
matches the query's word count?

This would require you to tokenize the query and document yourself, perhaps.

-Mike 

> -Original Message-
> From: Varun Gupta [mailto:varun.vgu...@gmail.com] 
> Sent: Tuesday, October 26, 2010 11:26 PM
> To: solr-user@lucene.apache.org
> Subject: Re: How do I this in Solr?
> 
> Thanks everybody for the inputs.
> 
> Looks like Steven's solution is the closest one but will lead 
> to performance issues when the query string has many terms.
> 
> I will try to implement the two filters suggested by Steven 
> and see how the performance matches up.
> 
> --
> Thanks
> Varun Gupta
> 
> 
> On Wed, Oct 27, 2010 at 8:04 AM, scott chu (???) 
> wrote:
> 
> > I think you have to write a "yet exact match" handler 
> yourself (I mean 
> > yet cause it's not quite exact match we normally know). 
> Steve's answer 
> > is quite near your request. You can do further work based 
> on his solution.
> >
> > At the last step, I'll suggest you eat up all blank within query 
> > string and query result, respevtively & only returns those results 
> > that has equal string length as the query string's.
> >
> > For example, giving:
> > *query string = "Samsung with GPS"
> > *query results:
> > resutl 1 = "Samsung has lots of mobile with GPS"
> > result 2 = "with GPS Samsng"
> > result 3 = "GPS mobile with vendors, such as Sony, Samsung"
> >
> > they become:
> > *query result = "SamsungwithGPS" (length =14) *query results:
> > resutl 1 = "SamsunghaslotsofmobilewithGPS" (length =29) result 2 = 
> > "withGPSSamsng" (length =14) result 3 = 
> > "GPSmobilewithvendors,suchasSony,Samsung" (length =43)
> >
> > so result 2 matches your request.
> >
> > In this way, you can avoid case-sensitive, 
> word-order-rearrange load 
> > of works. Furthermore, you can do refined work, such as 
> remove white 
> > characters, etc.
> >
> > Scott @ Taiwan
> >
> >
> > - Original Message - From: "Varun Gupta" 
> > 
> >
> > To: 
> > Sent: Tuesday, October 26, 2010 9:07 PM
> >
> > Subject: How do I this in Solr?
> >
> >
> >  Hi,
> >>
> >> I have lot of small documents (each containing 1 to 15 
> words) indexed 
> >> in Solr. For the search query, I want the search results 
> to contain 
> >> only those documents that satisfy this criteria "All of 
> the words of 
> >> the search result document are present in the search query"
> >>
> >> For example:
> >> If I have the following documents indexed: "nokia n95", "GPS", 
> >> "android", "samsung", "samsung andriod", "nokia andriod", 
> "mobile with GPS"
> >>
> >> If I search with the text "samsung andriod GPS", search results 
> >> should only conain "samsung", "GPS", "andriod" and 
> "samsung andriod".
> >>
> >> Is there a way to do this in Solr.
> >>
> >> --
> >> Thanks
> >> Varun Gupta
> >>
> >>
> >
> >
> > 
> --
> > --
> >
> >
> >
> > %<&b6G$J0T.'$$'d(l/f,r!C
> > Checked by AVG - www.avg.com
> > Version: 9.0.862 / Virus Database: 271.1.1/3220 - Release Date: 
> > 10/26/10 14:34:00
> >
> >
> 



Re: a bug of solr distributed search

2010-10-27 Thread Toke Eskildsen
On Tue, 2010-10-26 at 15:48 +0200, Ron Mayer wrote:
> And a third potential reason - it's arguably a feature instead of a bug
> for some applications.  Depending on how I organize my shards, "give me
> the most relevant document from each shard for this search" seems like
> it could be useful.

You can get that even if the shards scored equally, so it is a
limitation, not a feature. I hope to find the time later this week to
read some of the papers Andrzej was kind enough to point out, but it
seems like I really need to do the heavy lifting of setting up
comparisons for our own material.

The problem is of course to judge the quality of the outputs, but
setting the single index as the norm and plotting the differences in
document positions in the result sets might provide some insight.

Regards,
Toke Eskildsen



Re: Implementing Search Suggestion on Solr

2010-10-27 Thread Antonio Calo'

Hi

If I understood, you will build a kind of dictionary or ontology or 
thesauru and you will use it if Solr query results are few. At query 
time (before or after) you will perform a query on this dictionary in 
order to retrieve the suggested word.


If you  need to do this, you can try to cvreate a custom request handler 
where you can controll the querying process in a simple manner 
(http://wiki.apache.org/solr/SolrRequestHandler).


With the custom request handler, you can add custom code to check query 
results before submitting query to solr or analizing the query before 
sending result to client. I never coded one, but I think this is a good 
starting point.


Hope this can help you

Antonio



Il 27/10/2010 11.03, Pablo Recio ha scritto:

Thanks, it's not what I'm looking for.

Actually I need something like search "Ubuntu" and it will prompt "Maybe you
will like 'Debian' too" or something like that. I'm not trying to do it
automatically, manually will be ok.

Anyway, is good article you shared, maybe I will implement it, thanks!

2010/10/27 Jakub Godawa


I am a real rookie at solr, but try this:
http://solr.pl/2010/10/18/solr-and-autocomplete-part-1/?lang=en

2010/10/27 Pablo Recio


Hi,

I don't want to be annoying, but I'm looking for a way to do that.

I repeat the question: is there a way to implement Search Suggestion
manually?

Thanks in advance.
Regards,

2010/10/18 Pablo Recio Quijano


Hi!

I'm trying to implement some kind of Search Suggestion on a search

engine

I

have implemented. This search suggestions should not be automatically

like

the one described for the SpellCheckComponent [1]. I'm looking

something

like:

"SAS oppositions" =>  "Public job offers for some-company"

So I will have to define it manually. I was thinking about synonyms [2]

but

I don't know if it's the proper way to do it, because semantically

those

terms are not synonyms.

Any ideas or suggestions?

Regards,

[1] http://wiki.apache.org/solr/SpellCheckComponent
[2]


http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory




Re: Implementing Search Suggestion on Solr

2010-10-27 Thread Pablo Recio
Thanks, it's not what I'm looking for.

Actually I need something like search "Ubuntu" and it will prompt "Maybe you
will like 'Debian' too" or something like that. I'm not trying to do it
automatically, manually will be ok.

Anyway, is good article you shared, maybe I will implement it, thanks!

2010/10/27 Jakub Godawa 

> I am a real rookie at solr, but try this:
> http://solr.pl/2010/10/18/solr-and-autocomplete-part-1/?lang=en
>
> 2010/10/27 Pablo Recio 
>
> > Hi,
> >
> > I don't want to be annoying, but I'm looking for a way to do that.
> >
> > I repeat the question: is there a way to implement Search Suggestion
> > manually?
> >
> > Thanks in advance.
> > Regards,
> >
> > 2010/10/18 Pablo Recio Quijano 
> >
> > > Hi!
> > >
> > > I'm trying to implement some kind of Search Suggestion on a search
> engine
> > I
> > > have implemented. This search suggestions should not be automatically
> > like
> > > the one described for the SpellCheckComponent [1]. I'm looking
> something
> > > like:
> > >
> > > "SAS oppositions" => "Public job offers for some-company"
> > >
> > > So I will have to define it manually. I was thinking about synonyms [2]
> > but
> > > I don't know if it's the proper way to do it, because semantically
> those
> > > terms are not synonyms.
> > >
> > > Any ideas or suggestions?
> > >
> > > Regards,
> > >
> > > [1] http://wiki.apache.org/solr/SpellCheckComponent
> > > [2]
> > >
> >
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory
> > >
> >
>


Re: ClassCastException Issue

2010-10-27 Thread Alex Matviychuk
On Wed, Oct 27, 2010 at 03:57, Chris Hostetter  wrote:
> This almost certainly inidcates a classloader issue - i suspect you have
> multiple solr related jars in various places, and the FieldType class
> instance found when StrField is loaded comes from a different
> (incompatible) jar.

Thanks for the feedback. I am pretty sure you are right about this
being a class loader issue. This is a pretty big project and it isn't
immediately easy to figure out what is causing the conflict. I am
using EmbeddedSolr from the 1.4.1 release. These show up in the
project's lib folder when deployed to tomcat:

lucene-core-2.9.3.jar
lucene-highlighter-2.9.3.jar
lucene-memory-2.9.3.jar
lucene-misc-2.9.3.jar
lucene-queries-2.9.3.jar
lucene-snowball-2.9.3.jar
lucene-spellchecker-2.9.3.jar

According to the solr homepage, this is what solr 1.4.1 is built against.

I looked at the FieldType class and it looks like it only relies on
solr stuff and lucene.

I don't have much experience with classloader issues, any tips on how
to debug this?


And Ken:
I tried renaming the field as you suggested, but I get the same issue.

Thanks,
Alex Matviychuk


Re: Implementing Search Suggestion on Solr

2010-10-27 Thread Jakub Godawa
I am a real rookie at solr, but try this:
http://solr.pl/2010/10/18/solr-and-autocomplete-part-1/?lang=en

2010/10/27 Pablo Recio 

> Hi,
>
> I don't want to be annoying, but I'm looking for a way to do that.
>
> I repeat the question: is there a way to implement Search Suggestion
> manually?
>
> Thanks in advance.
> Regards,
>
> 2010/10/18 Pablo Recio Quijano 
>
> > Hi!
> >
> > I'm trying to implement some kind of Search Suggestion on a search engine
> I
> > have implemented. This search suggestions should not be automatically
> like
> > the one described for the SpellCheckComponent [1]. I'm looking something
> > like:
> >
> > "SAS oppositions" => "Public job offers for some-company"
> >
> > So I will have to define it manually. I was thinking about synonyms [2]
> but
> > I don't know if it's the proper way to do it, because semantically those
> > terms are not synonyms.
> >
> > Any ideas or suggestions?
> >
> > Regards,
> >
> > [1] http://wiki.apache.org/solr/SpellCheckComponent
> > [2]
> >
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory
> >
>


Re: Implementing Search Suggestion on Solr

2010-10-27 Thread Pablo Recio
Hi,

I don't want to be annoying, but I'm looking for a way to do that.

I repeat the question: is there a way to implement Search Suggestion
manually?

Thanks in advance.
Regards,

2010/10/18 Pablo Recio Quijano 

> Hi!
>
> I'm trying to implement some kind of Search Suggestion on a search engine I
> have implemented. This search suggestions should not be automatically like
> the one described for the SpellCheckComponent [1]. I'm looking something
> like:
>
> "SAS oppositions" => "Public job offers for some-company"
>
> So I will have to define it manually. I was thinking about synonyms [2] but
> I don't know if it's the proper way to do it, because semantically those
> terms are not synonyms.
>
> Any ideas or suggestions?
>
> Regards,
>
> [1] http://wiki.apache.org/solr/SpellCheckComponent
> [2]
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory
>


Re: Strange search

2010-10-27 Thread Gora Mohanty
On Wed, Oct 27, 2010 at 1:23 PM, ramzesua  wrote:
>
> Can anyone give me working schema.xml and solrconfig from own project?
[...]

Solr comes with an example configuration in example/solr/conf/ . Please see
http://lucene.apache.org/solr/tutorial.html for an example of how to get started
with that.

As a schema is specific to the problem domain, as is solrconfig.xml to some
extent, having access to someone else's version of this would probably not be
of much help. Instead, you could probably use the example configuration, and
ask questions about anything that is not documented, or not well understood.

Regards,
Gora


Re: Strange search

2010-10-27 Thread ramzesua

Can anyone give me working schema.xml and solrconfig from own project?
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Strange-search-tp998961p1778760.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Step by step tutorial for multi-language indexing and search

2010-10-27 Thread Lance Norskog
Yes, you can declare each field with the Spanish, French, etc. types. 
The _t and other types are "dynamic" and don't have to be declared. This 
feature is generally used when you have hundreds or thousands of fields. 
It is more clear to declare your fields.


You're right- that error should not be thrown. You are not asking for a 
sort.
I don't know that one. You could try starting over with the Solr 1.4.1 
release binaries.


Jakub Godawa wrote:

Hi Erick, thanks for your help!

I need some technical help though... let me put it that way:

1. I deleted everything in index with:
curl http://localhost:8983/solr/update -F stream.body='
*:*'
curl http://localhost:8983/solr/update -F stream.body=''

2. I created 2 documents with fields: name_en, answer_en, name_es, answer_es
3. I made a query through admin page, with response:


-

0
9
-

on
0
Jakub

2.2
10


-

-

-

My name is Jakub

-

Me llamo Jakub.

-

Question:1

-

What is your name?

-

Como te llamas?

-

1

-

What is your name?
My name is Jakub
Como te llamas?
Me llamo Jakub.


-

-

I am in the kitchen Jakub!

-

Estoy en la cocina.

-

Question:2

-

Where are you?

-

Donde estas?

-

2

-

Where are you?
I am in the kitchen Jakub!
Donde estas?
Estoy en la cocina.





4. Now I needed two dismaxes to make it work in two separate languages. Lets
say I just want to look up in *_en fields, then I created a dismax:


 
   dismax
   explicit
   0.01
   
 name_en_t^0.5 answer_en_t^1.0
  
  
   


5. Hitting the url: http://localhost:8982/solr/English/?q=Jakub gaves me an
error:

there are more terms than documents in field "name_en_t", but it's
impossible to sort on tokenized fields

6. I know that I should create a separate dismax for Spanish.

My questions:
1. Why those fields are named with *_t? I saw in schema.xml that they are
made dynamicly. Can/should I create my own predefined fields in schema.xml?
Is this the place where you put "HOW" the field should be interpreted by
indexer?
2. Why the error in no. 5 is being thrown? I know that you cannot do sorting
on tokenized fields, but I don't see myself trying to index anything nor
tokenizing.
3. How should it be changed to work properly?

Thank you and I ask for patience as this can help many rookies like to me to
get started.
Jakub.

2010/10/21 Erick Erickson

   

See below:

But also search the archives for multilanguage, this topic has been
discussed
many times before. Lucid Imagination maintains a Solr-powered (of course)
searchable
list at: http://www.lucidimagination.com/search/



On Wed, Oct 20, 2010 at 9:03 AM, Jakub Godawa 

wrote:
   
 

Hi everyone! (my first post)

I am new, but really curious about usefullness of lucene/solr in
   

documents
 

search from the web applications. I use Ruby on Rails to create one, with
plugin "acts_as_solr_reloaded" that makes connection between web app and
solr easy.

So I am in a point, where I know that good solution is to prepare
multi-language documents with fields like:
question_en, answer_en,
question_fr, answer_fr,
question_pl,  answer_pl... etc.

I need to create an index that would work with 6 languages: english,
french,
german, russian, ukrainian and polish.

My questions are:
1. Is it doable to have just one search field that behaves like Google's
for
all those documents? It can be an option to indicate a language to
   

search.
 
   

This depends on what you mean by do-able. Are you going to allow a French
user to search an English document (&  etc)? But the real answer is "yes,
you
can
if you .". There'll be tradeoffs.

Take a look at the dismax handler. It's kind of hard to grok all at once,
but you
can cause it to search across multiple fields. That is, the user types
"language",
and you can turn it into a complex query under the covers like
lang_en:language lang_fr:language lang_ru:language, etc. You can also
apply boosts. Note that this has obvious problems with, say, Russian. Half
your
job will be figuring out what will satisfy the user.

You could also have a #different# dismax handler defined for various
languages. Say
the user was coming from Spanish. Consider a browseES handler. See
solrconfig.xml
for the default dismax handler. The Solr book mentioned above describes
this.


 

2. How should I begin changing the solr/conf/schema.xml (or other) file
   

to
 

tailor it to my needs? As I am a real rookie here, I am still a bit
confused
about "fields", "fieldTypes" and their connection with particular field
(ex.
answer_fr) and the "tokenizers" and "analyzers". If someone can provide a
basic step by step tutorial on how to make it work in two languages I
   

would
 

be more that happy.

   

You have several choices here:
 

books "Lucene in Action" and "Solr 1.4, Enterprise SearchServer" both
   

have
discussions here.
 

Spend some time on the solr/admin/analysis page. Th

Re: FieldCollapsing and Stats or Sum ?!

2010-10-27 Thread stockiii

okay. 

i want one number per group. yes its similar to the "group by" command.

is there another way to get this ? 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/FieldCollapsing-and-Stats-or-Sum-tp1773842p1778731.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: xpath processing

2010-10-27 Thread Lance Norskog
The XPathEntityProcessor does not do full XPath. It is a very limited 
set intended to be very fast.
You can add code in any scripting language, but that is not really 
performant.
Is it possible to use the RegexTransformer to find your records with 
regular expressions?


Ken Stanley wrote:

On Fri, Oct 22, 2010 at 11:52 PM,  wrote:

   





















 


The documentation says you don't need a dataSource for your
XPathEntityProcessor entity; in my configuration, I have mine set to the
name of the top-level FileListEntityProcessor. Everything else looks fine.
Can you provide one record from your data? Also, are you getting any errors
in your log?

- Ken

   


Re: command line to check if Solr is up running

2010-10-27 Thread Pradeep Singh
How about - Please do not respond to 20 emails at one time?

On Wed, Oct 27, 2010 at 12:33 AM, Lance Norskog  wrote:

> Please start new threads for new topics.
>
>
> Xin Li wrote:
>
>> As we know we can use browser to check if Solr is running by going to
>> http://$hostName:$portNumber/$masterName/admin, say
>> http://localhost:8080/solr1/admin. My questions is: are there any ways to
>> check it using command line? I used "curl http://localhost:8080"; to check
>> my Tomcat, it worked fine. However, no response if I try "curl
>> http://localhost:8080/solr1/admin"; (even when my Solr is running). Does
>> anyone know any command line alternatives?
>>
>> Thanks,
>> Xin
>> This electronic mail message contains information that (a) is or
>> may be CONFIDENTIAL, PROPRIETARY IN NATURE, OR OTHERWISE
>> PROTECTED
>> BY LAW FROM DISCLOSURE, and (b) is intended only for the use of
>> the addressee(s) named herein.  If you are not an intended
>> recipient, please contact the sender immediately and take the
>> steps necessary to delete the message completely from your
>> computer system.
>>
>> Not Intended as a Substitute for a Writing: Notwithstanding the
>> Uniform Electronic Transaction Act or any other law of similar
>> effect, absent an express statement to the contrary, this e-mail
>> message, its contents, and any attachments hereto are not
>> intended
>> to represent an offer or acceptance to enter into a contract and
>> are not otherwise intended to bind this sender,
>> barnesandnoble.com
>> llc, barnesandnoble.com inc. or any other person or entity.
>>
>>
>


Re: Failing to successfully import international characters via DIH

2010-10-27 Thread Lance Norskog

CLOB is probably better for what you want.
Also, make sure the table is declared UTF-8 (or Unicode or whatever 
mysql calls it.)


virtas wrote:

As it turns out issue was somewhere in mysql. Not sure exactly where, but
something to do to with BLOB.

Now, I changed text field from BLOB to varchar and started using
mysql_real_escape_string in my php code and all started working just fine.

Thanks for the help
   


Re: command line to check if Solr is up running

2010-10-27 Thread Lance Norskog

Please start new threads for new topics.

Xin Li wrote:

As we know we can use browser to check if Solr is running by going to 
http://$hostName:$portNumber/$masterName/admin, say http://localhost:8080/solr1/admin. My questions 
is: are there any ways to check it using command line? I used "curl 
http://localhost:8080"; to check my Tomcat, it worked fine. However, no response if I try 
"curl http://localhost:8080/solr1/admin"; (even when my Solr is running). Does anyone know 
any command line alternatives?

Thanks,
Xin
This electronic mail message contains information that (a) is or
may be CONFIDENTIAL, PROPRIETARY IN NATURE, OR OTHERWISE
PROTECTED
BY LAW FROM DISCLOSURE, and (b) is intended only for the use of
the addressee(s) named herein.  If you are not an intended
recipient, please contact the sender immediately and take the
steps necessary to delete the message completely from your
computer system.

Not Intended as a Substitute for a Writing: Notwithstanding the
Uniform Electronic Transaction Act or any other law of similar
effect, absent an express statement to the contrary, this e-mail
message, its contents, and any attachments hereto are not
intended
to represent an offer or acceptance to enter into a contract and
are not otherwise intended to bind this sender,
barnesandnoble.com
llc, barnesandnoble.com inc. or any other person or entity.
   


Re: Documents are deleted when Solr is restarted

2010-10-27 Thread Lance Norskog

These directories are shown at the top of the admin/index.jsp page.
Check out all of the pages off of admin/index.jsp- there is a lot of 
information there about what solr is doing.


Israel Ekpo wrote:

The Solr home is the -Dsolr.solr.home Java System property

Also make sure that -Dsolr.data.dir is define for your data directory, if it
is not already defined in the solrconfig.xml file

On Tue, Oct 26, 2010 at 10:46 AM, Upayavira  wrote:

   

You need to watch what you are setting your solr.home to. That is where
your indexes are being written. Are they getting overwritten/lost
somehow. Watch the files in that dir while doing a restart.

That's a start at least.

Upayavira

On Tue, 26 Oct 2010 16:40 +0300, "Mackram Raydan"
wrote:
 

Hey everyone,

I apologize if this question is rudimentary but it is getting to me and
I did not find anything reasonable about it online.

So basically I have a Solr 1.4.1 setup behind Tomcat 6. I used the
SolrTomcat wiki page to setup. The system works exactly the way I want
it (proper search, highlighting, etc...). The problem however is when I
restart my Tomcat server all the data in Solr (ie the index) is simply
lost. The admin shows me the number of docs is 0 when it was before in
the thousands.

Can someone please help me understand why the above is happening and how
can I workaround it if possible?

Big thanks for any help you can send my way.

Regards,

Mackram

   
 



   


Re: Inconsistent slave performance after optimize

2010-10-27 Thread Lance Norskog

Did you restart all of these slave servers? That would help.
What garbage collection options do you use?
Which release of Solr?
How many Searchers are there in admin/stats.jsp?
Searchers hold open all kinds of memory. They are supposed to cycle out.

These are standard questions, but- what you are seeing is definitely not 
normal.


Separately, if you want a regular optimization regime, there is a new 
option called 'maxSegments' to the optimize command. If you have 
solrconfig mergeFactor set to 10, then optimize 'maxSegments=8' will 
roll up the very smallest segments. This allows you to have a gradual 
optimization (and replication overhead) instead of big ones.


Mason Hale wrote:

Hello esteemed Solr community --

I'm observing some inconsistent performance on our slave servers after
recently optimizing our master server.

Our configuration is as follows:

 - all servers are hosted at Amazon EC2, running Ubuntu 8.04
 - 1 master with heavy insert/update traffic, about 125K new documents
per day (m1.large, ~8GB RAM)
- autocommit every 1 minute
 - 3 slaves (m2.xlarge instance sizes, ~16GB RAM)
- replicate every 5 minutes
- we have configured autowarming queries for these machines
- autowarmCount = 0
 - Total index size is ~7M documents

We were seeing increasing, but gradual performance degradation across all
nodes.
So we decided to try optimizing our index to improve performance.

In preparation for the optimize we disabled replication polling on all
slaves. We also turned off all
workers that were writing to the index. Then we ran optimize on the master.

The optimize took 45-60 minutes to complete, and the total size went from
68GB down to 23GB.

We then enabled replication on each slave one at a time.

The first slave we re-enabled took about 15 minutes to copy the new files.
Once the files were copied
the performance of slave plummeted. Average response time went from 0.75 sec
to 45 seconds.
Over the past 18 hours the average response time has gradually gown down to
around 1.2 seconds now.

Before re-enabling replication the second slave, we first removed it from
our load-balanced pool of available search servers.
This server's average query performance also degraded quickly, and then
(unlike the first slave we replicated) did not improve.
It stayed at around 30 secs per query. On the theory that this is a
cache-warming issue, we added this server
back to the pool in hopes that additional traffic would warm the cache. But
what we saw was a quick spike of much worse
performance (50 sec / query on average) followed by a slow/gradual decline
in average response times.
As of now (10 hours after the initial replication) this server is still
reporting an average response time of ~2 seconds.
This is much worse than before the optimize and is a counter-intuitive
result. We expected an index 1/3 the size would be faster, not slower.

On the theory that the index files needed to be loaded into the file system
cache, I used the 'dd' command to copy
the contents of the data/index directory to /dev/null, but that did not
result in any noticeable performance improvement.

At this point, things were not going as expected. We did not expect the
replication after an optimize to result in such horrid
performance. So we decided to let the last slave continue to serve stale
results while we waited 4 hours for the
other two slaves to approach some acceptable performance level.

After the 4 hour break, we re-moved the 3rd and last slave server from our
load-balancing pool, then re-enabled replication.
This time we saw a tiny blip. The average performance went up to 1 second
briefly then went back to the (normal for us)
0.25 to 0.5 second range. We then added this server back to the
load-balancing pool and observed no degradation in performance.

While we were happy to avoid a repeat of the poor performance we saw on the
previous slaves, we are at a loss to explain
why this slave did not also have such poor performance.

At this point we're scratching our heads trying to understand:
(a) Why the performance of the first two slaves was so terrible after the
optimize. We think its cache-warming related, but we're not sure.
  >  10 hours seems like a long time to wait for the cache to warm up
(b) Why the performance of the third slave was barely impacted. It should
have hit the same cold-cache issues as the other servers, if that is indeed
the root cause.
(c) Why performance of the first 2 slaves is still much worse after the
optimize than it was before the optimize,
   where the performance of the 3rd slave is pretty much unchanged. We
expected the optimize to *improve* performance.

All 3 slave servers are identically configured, and the procedure for
re-enabling replication was identical for the 2nd and 3rd
slaves, with the exception of a 4-hour wait period.

We have confirmed that the 3rd slave did replicate, the number of documents
and total index s

Re: Jars required in classpath to run embedded solr server?

2010-10-27 Thread Lance Norskog
It requires all of the jars that are packed into solr.war. It is a full 
and complete implementation of indexing and searching.


Tharindu Mathew wrote:

Hi everyone,

Do we need all lucene jars in the class path for this? Seems that the
solr-solrj and solr-core jars are not enough
(http://wiki.apache.org/solr/Solrj). It is asking for lucene jars in
the classpath. Could I know what jars are required to run this?

Thanks in advance.

   


Re: Solr sorting problem

2010-10-27 Thread Lance Norskog
You may not sort on a tokenized field. You may not sort on a multiValued 
field. You can only have one term in a field.


If there are more search terms than documents, A) sorting doesn't mean 
anything and B) Lucene will throw an exception.


Erick Erickson wrote:

In general, the behavior when sorting is not predictable when
sorting on a tokenized field, which "text" is. What would
it mean to sort on a field with "erick" "Moazzam" as tokens
in a single document? Should it be in the "e"s or the "m"s?

That said, you probably want to watch out for case

Best
Erick

On Fri, Oct 22, 2010 at 10:02 AM, Moazzam Khan  wrote:

   

For anyone who faced the same problem, changing the field to string
from text worked!

-Moazzam

On Fri, Oct 22, 2010 at 8:50 AM, Moazzam Khan  wrote:
 

The field type of the first name and last name is text. Could that be
why it's not sorting properly? I just changed it to string and started
a full-import. Hopefully that will work.

Thanks,
Moazzam

On Thu, Oct 21, 2010 at 7:42 PM, Jayendra Patil
  wrote:
   

need additional information .
Sorting is easy in Solr just by passing the sort parameter

However, when it comes to text sorting it depends on how you analyse
and tokenize your fields
Sorting does not work on fields with multiple tokens.

 

http://wiki.apache.org/solr/FAQ#Why_Isn.27t_Sorting_Working_on_my_Text_Fields.3F
 

On Thu, Oct 21, 2010 at 7:24 PM, Moazzam Khan
 

wrote:
 
 

Hey guys,

I have a list of people indexed in Solr. I am trying to sort by their
first names but I keep getting results that are not alphabetically
sorted (I see the names starting with W before the names starting with
A). I have a feeling that the results are first being sorted by
relevancy then sorted by first name.

Is there a way I can get the results to be sorted alphabetically?

Thanks,
Moazzam

   
 
   
 
   


Re: How do I this in Solr?

2010-10-27 Thread Lance Norskog
There is also a feature called a 'filter'. If you use certain words a 
lot, you can make filter queries with just those words.  Look for 
'filter' and 'fq=' on the wiki.


But really you can have hundreds of words in a query and not have a 
performance problem. Solr/Lucene is very fast. In benchmarking I have 
trouble sending enough requests to make several processors run at the 
same time.


Varun Gupta wrote:

Hi,

I have lot of small documents (each containing 1 to 15 words) indexed in
Solr. For the search query, I want the search results to contain only those
documents that satisfy this criteria "All of the words of the search result
document are present in the search query"

For example:
If I have the following documents indexed: "nokia n95", "GPS", "android",
"samsung", "samsung andriod", "nokia andriod", "mobile with GPS"

If I search with the text "samsung andriod GPS", search results should only
conain "samsung", "GPS", "andriod" and "samsung andriod".

Is there a way to do this in Solr.

--
Thanks
Varun Gupta