Extended characters

2017-10-29 Thread Robert Brown

Hi,

I have a text field in my index containing extended characters, which 
I'd like to match against when searching without the extended characters.


e.g.  field contains "Ensō" which I want to match when searching for 
just "enso".


My current config for that field (type) is given below:


positionIncrementGap="100" autoGeneratePhraseQueries="true">






synonyms="index_synonyms.txt" ignoreCase="true" expand="true" />



words="lang/stopwords_en.txt" />




















words="lang/stopwords_en.txt" />





















Re: (Tiny Index) Solr dies but not OOM

2017-05-26 Thread Robert Brown

Thanks Rick,

Swap is actually turned off, but reducing the number of Perl processes 
is a quick win.




On 26/05/17 17:06, Rick Leir wrote:

Robert,
Cool, perl is taking most of your memory. 12 fcgi processes, at about 8% memory 
each. Try changing the web server config so it just forks 2 or 4 of them.

And check whether your swap device is working. With a working swap disk, maybe 
your system would just slow down instead of crashing. No, sorry, your swap _is_ 
working, and java is mostly swapped out. It must be slow. Cheers -- Rick

On May 26, 2017 1:25:55 PM EDT, Robert Brown  wrote:

Thanks Shawn,

It's more inquisitiveness now more than anything.

http://web.lavoco.com/top.png

(forgot to mention mariadb on there too  :)



On 26/05/17 16:20, Shawn Heisey wrote:

On 5/26/2017 11:01 AM, Robert Brown wrote:

Let's assume I can't get more RAM - why would an index of no more

than

1MB (on disk) need so much?

(without getting into why I'm using Solr on such a small index in

the

first place  :)

My docs consist of 3 text fields for searching, all others are
strings/ints for facets and filtering, about 20 fields in total.

Currently just 500 docs.

If Solr were the only thing on the server, I feel fairly confident

that

you would not be having any problems.  Although your heap is only at
256MB, Java itself requires memory to run, and that memory may be

even

larger than 256MB.

Webservers, particularly if they are running in a forked-process
paradigm rather than a multi-threaded paradigm (common with Apache),
tend to be VERY memory hungry.  I assume that nginx is threaded, but
although a threaded webserver uses less memory than a forked-process
webserver, a busy site is still going to eat up a lot of memory.

With

only 2GB of memory, you should be limiting the number of idle
threads/processes the webserver will keep around, and you might want

to

limit the number of simultaneous connections the webserver allows.

Your perl webapp is a complete unknown where memory usage is

concerned.

If you run top, press shift-M to sort by memory, grab a screenshot,

and

put that screenshot somewhere we can access it by URL, I'll be able

to

see the overall memory usage of the server and at least tell you

what's

happening.

The best recommendation I can make, even without that top screenshot,

is

to add memory to the server, or to get a second server and dedicate

it

to Solr.

Thanks,
Shawn





Re: (Tiny Index) Solr dies but not OOM

2017-05-26 Thread Robert Brown

Thanks Shawn,

It's more inquisitiveness now more than anything.

http://web.lavoco.com/top.png

(forgot to mention mariadb on there too  :)



On 26/05/17 16:20, Shawn Heisey wrote:

On 5/26/2017 11:01 AM, Robert Brown wrote:

Let's assume I can't get more RAM - why would an index of no more than
1MB (on disk) need so much?

(without getting into why I'm using Solr on such a small index in the
first place  :)

My docs consist of 3 text fields for searching, all others are
strings/ints for facets and filtering, about 20 fields in total.

Currently just 500 docs.

If Solr were the only thing on the server, I feel fairly confident that
you would not be having any problems.  Although your heap is only at
256MB, Java itself requires memory to run, and that memory may be even
larger than 256MB.

Webservers, particularly if they are running in a forked-process
paradigm rather than a multi-threaded paradigm (common with Apache),
tend to be VERY memory hungry.  I assume that nginx is threaded, but
although a threaded webserver uses less memory than a forked-process
webserver, a busy site is still going to eat up a lot of memory.  With
only 2GB of memory, you should be limiting the number of idle
threads/processes the webserver will keep around, and you might want to
limit the number of simultaneous connections the webserver allows.

Your perl webapp is a complete unknown where memory usage is concerned.

If you run top, press shift-M to sort by memory, grab a screenshot, and
put that screenshot somewhere we can access it by URL, I'll be able to
see the overall memory usage of the server and at least tell you what's
happening.

The best recommendation I can make, even without that top screenshot, is
to add memory to the server, or to get a second server and dedicate it
to Solr.

Thanks,
Shawn





Re: (Tiny Index) Solr dies but not OOM

2017-05-26 Thread Robert Brown
Let's assume I can't get more RAM - why would an index of no more than 
1MB (on disk) need so much?


(without getting into why I'm using Solr on such a small index in the 
first place  :)


My docs consist of 3 text fields for searching, all others are 
strings/ints for facets and filtering, about 20 fields in total.


Currently just 500 docs.



On 26/05/17 15:43, Erick Erickson wrote:

Or get more physical memory? Solr _likes_ memory, you won't be able to
do much with only 2G physical memory..

On Fri, May 26, 2017 at 2:00 AM, Robert Brown  wrote:

Thanks Rick,

Turns out it was the kernel killing it, dmesg showed:

Out of memory: Kill process 2647 (java) score 118 or sacrifice child
Killed process 2647, UID 1006, (java) total-vm:2857484kB, anon-rss:227440kB,
file-rss:12kB

Now I just need to tell the kernel not to do that.

The other things on the box are nginx and my Perl web-app.

Those 2 are both restarted upon a deploy, which is what knocks Solr down.

I'll experiment with different heap values, with a 1MB index (on disk) I
should be able to get it fairly low.

I have the same occasional problem on my dev box, which only has 1GB RAM -
quite surprising it runs at all on there if it has half the ram of the live
box(es).




On 26/05/17 07:35, Rick Leir wrote:

Robert,

What is at the end of solr.log when it has died?

Is there anything in syslog or messages?

What is the other app?

Run the top command, memory screen, on Ubuntu:

$ top -o RES

I have never used strace(1) on Solr, but that is an option. Run Solr in
strace with the appropriate options to reduce the voluminous output.

You could upgrade your hardware cheaply at a surplus store (almost every
machine in my office is surplus .. think .. actually, every one).

cheers -- Rick


On 2017-05-25 06:55 PM, Robert Brown wrote:

Hi,

I'm currently running 6.5.1 with a tiny index, less than 1MB.

When I restart another app on the same server as Solr, Solr occasionally
dies, but no solr_oom_killer.log file.

Heap size is 256MB (~30MB used), Physical RAM 2GB, typically using 1.5GB.

How else can I debug what's causing it?

Also, for such a small index, what would be an appropriate heap size?
10MB seems to just kill it (OOM log file produced) as it starts.

Thanks,
Rob





Re: (Tiny Index) Solr dies but not OOM

2017-05-26 Thread Robert Brown

Thanks Rick,

Turns out it was the kernel killing it, dmesg showed:

Out of memory: Kill process 2647 (java) score 118 or sacrifice child
Killed process 2647, UID 1006, (java) total-vm:2857484kB, 
anon-rss:227440kB, file-rss:12kB


Now I just need to tell the kernel not to do that.

The other things on the box are nginx and my Perl web-app.

Those 2 are both restarted upon a deploy, which is what knocks Solr down.

I'll experiment with different heap values, with a 1MB index (on disk) I 
should be able to get it fairly low.


I have the same occasional problem on my dev box, which only has 1GB RAM 
- quite surprising it runs at all on there if it has half the ram of the 
live box(es).




On 26/05/17 07:35, Rick Leir wrote:

Robert,

What is at the end of solr.log when it has died?

Is there anything in syslog or messages?

What is the other app?

Run the top command, memory screen, on Ubuntu:

$ top -o RES

I have never used strace(1) on Solr, but that is an option. Run Solr 
in strace with the appropriate options to reduce the voluminous output.


You could upgrade your hardware cheaply at a surplus store (almost 
every machine in my office is surplus .. think .. actually, every one).


cheers -- Rick


On 2017-05-25 06:55 PM, Robert Brown wrote:

Hi,

I'm currently running 6.5.1 with a tiny index, less than 1MB.

When I restart another app on the same server as Solr, Solr 
occasionally dies, but no solr_oom_killer.log file.


Heap size is 256MB (~30MB used), Physical RAM 2GB, typically using 
1.5GB.


How else can I debug what's causing it?

Also, for such a small index, what would be an appropriate heap 
size?  10MB seems to just kill it (OOM log file produced) as it starts.


Thanks,
Rob







(Tiny Index) Solr dies but not OOM

2017-05-25 Thread Robert Brown

Hi,

I'm currently running 6.5.1 with a tiny index, less than 1MB.

When I restart another app on the same server as Solr, Solr occasionally 
dies, but no solr_oom_killer.log file.


Heap size is 256MB (~30MB used), Physical RAM 2GB, typically using 1.5GB.

How else can I debug what's causing it?

Also, for such a small index, what would be an appropriate heap size?  
10MB seems to just kill it (OOM log file produced) as it starts.


Thanks,
Rob



Grouping performance with MLT

2016-07-05 Thread Robert Brown

Hi All,

I have an index with 10m documents.

When performing an MLT query and grouping by a field, response times are 
roughly 20s.


The group field is currently populated with unique values, as we now 
start to manually group documents (hence using MLT).


The group field has docValues turned on.

The index fits in RAM, and is a single shard with 1 replica, version 6.0.1

Example query is :

http://server:8983/solr/uk/select?group=true&group.field=product&group.limit=10&group.ngroups=true&group.sort=price%20asc&q={!mlt%20qf=name,brand}bg-uk-9cef78e8d5812f14bfebbf2801888a43 



With a QTime of 18000.

A "normal" group query is also over 1 second...

http://server:8983/solr/uk/select?group=true&group.field=product&group.limit=10&group.ngroups=true&group.sort=price%20asc&q=iphone 



Does anyone have any suggestions for improving performance?

Thanks,
Rob



Re: Alternate Port Not Working for Solr 6.0.0

2016-06-02 Thread Robert Brown
In addition to a separate proxy you could use iptables, I use this 
technique for another app (running on port 5000 but requests come in 
port 80)...



*nat
:PREROUTING ACCEPT [0:0]
:POSTROUTING ACCEPT [0:0]
:OUTPUT ACCEPT [0:0]

-A PREROUTING -i eth0 -p tcp --dport 80 -j REDIRECT --to-port 5000

COMMIT
*filter
:INPUT ACCEPT [0:0]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [0:0]
-A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
-A INPUT -p icmp -j ACCEPT
-A INPUT -i lo -j ACCEPT
-A INPUT -m state --state NEW -m tcp -p tcp --dport 22 -j ACCEPT





On 02/06/16 20:48, Shawn Heisey wrote:

On 6/2/2016 12:51 PM, Teague James wrote:

Thanks for that suggestion, but I had found that file and I had
changed it to 80, but still no luck. Solr isn't running because it
never started in the first place. I also tried the -p 80 flag using
the install script and it failed.

Something I just thought of, but should have remembered earlier:  In
order to bind to port 80, you must run as root.  Binding to any port
below 1024 requires privilege.  It looks like you installed Solr to run
as the user named "solr" -- so it cannot do what it is being asked to do.

It might be possible to fiddle with selinux and achieve this without
running as root, but I have no idea how that is done.  You can also
install a proxy in front of Solr that runs on port 80, and accesses Solr
via some other port.

This is one of the reasons that Solr runs on a high port number by default.

Thanks,
Shawn





Re: [E] Re: Faceting Question(s)

2016-06-02 Thread Robert Brown
MaryJo, I think you've mis-understood.  The counts are different simply 
because the 2nd query contains an filter of a facet value from the 1st 
query - that's completely expected.


The issue is how to get the original facet counts (with no filters but 
same q) in the same call as also filtering by one of those facet values.


Personally I don't think it's possible, but will be interested to hear 
others input, since it's a very common situation for me - I cache the 
first result in memcached and tag future queries as related to the first.


Or could you always make 2 calls back to Solr (one original (again), and 
one with the filters), the caches should help massively.




On 02/06/16 19:07, MaryJo Sminkey wrote:

And you're saying the count for the second query is different than what was
returned in the facet? You may need to check for any defaults you have set
up in the solrconfig for the select parser, if for instance you have any
grouping going on, but aren't doing grouping in your facet, that could
result in the counts being off.

MJ




On Thu, Jun 2, 2016 at 2:01 PM, Jamal, Sarfaraz <
sarfaraz.ja...@verizonwireless.com.invalid> wrote:


Absolutely,

Here is what it looks like:

This brings the right counts as it should
http://
**select?q=video&hl=true&hl.fl=*&hl.snippets=20&facet=true&facet.field=team

Then when I specify which team
http://
**select?q=video&hl=true&hl.fl=*&hl.snippets=20&facet=true&facet.field=team&fq=team:rollback

The counts are obviously different now, as the result set is limited to
one team.

Sas

-Original Message-
From: MaryJo Sminkey [mailto:mjsmin...@gmail.com]
Sent: Thursday, June 2, 2016 1:56 PM
To: solr-user@lucene.apache.org
Subject: [E] Re: Faceting Question(s)

Jamai - what is your q= set to? And do you have a fq for the original
query? I have found that if you do a wildcard search (*.*) you have to be
careful about other parameters you set as that can often result in the
numbers returned being off. In my case, my defaults had things like edismax
settings for phrase boosting, etc. that don't apply if there isn't a search
term, and once I removed those for a wildcard search I got the correct
numbers. So possibly your facet query itself may be set up correctly but
something else in the parameters and/or filters with the two queries may be
the cause of the difference.

Mary Jo


On Thu, Jun 2, 2016 at 1:47 PM, Jamal, Sarfaraz <
sarfaraz.ja...@verizonwireless.com.invalid> wrote:


Hello Everyone,

I am working on implementing some basic faceting into my project.

I have it working the way I want to, but I feel like there is probably
a better way the way I went about it.

* I want to show a category and its count.
* when someone clicks a category, it sets a FQ= to that category.

But now that the results are being filtered, the category counts from
the original query without the filters are off.

So, I have a single api call that I make with rows set to 0 and the
base query without any filters, and use that to display my categories.

And then I call the api again, this time to get the results. And the
category count is the same.

I hope that makes sense.

I was hoping  facet.query would be of help, but I am not sure I
understood it properly.

Thanks in advance =)

Sas





MongoDB and Solr - Massive re-indexing

2016-06-02 Thread Robert Brown

Hi,

Currently we import data-sets from various sources (csv, xml, json, 
etc.) and POST to Solr, after some pre-processing to get it into a 
consistent format, and some other transformations.


We currently dump out to a json file in batches of 1,000 documents and 
POST that file to Solr.


Roughly 50m documents come in throughout the day, and are fully 
re-indexed.  Following the update calls, we then delete any docs based 
on a last_seen datetime field, which removes documents before the most 
recent run, related to that run.


I'm now importing our raw data firstly into MongoDB, in raw format. The 
data will then be translated and stored in another Mongo collection.  
These 2 steps are for business reasons.


That final Mongo collection then needs to be sent to Solr.

My question is whether sending batches of 1,000 documents to Solr is 
still beneficial (thinking about docs that may not change), or if I 
should look at the MongoDB connector for Solr, based on the volume of 
incoming data we see.


Would the connector still see all docs updating if I re-insert them 
blindly, and thus still send all 50m documents back to Solr everyday anyway?


Is my setup quite typical for the MongoDB connector?

Thanks,
Rob





Re: Idle timeout expired: 50000/50000 ms

2016-04-29 Thread Robert Brown

Thanks Shawn,

I'm definitely not looking to just upping the timeout, like you say, 
there's a bigger issue to be resolved.


My indexes are between 1m and up to 60m docs (30m per shard, ~70GB on 
disk each).


All of these collections get completely refreshed at least once a day, 
data may not actually be changing, but the JSON files are re-uploaded 
currently.


I've used a 3GB heap for all of the nodes, perhaps that needs upping a bit.

Strange that I've not seen this issue before though.

Any good advice/guidance for analysing and tweaking GC?




On 29/04/16 18:52, Shawn Heisey wrote:

On 4/28/2016 3:13 PM, Robert Brown wrote:

I operate several collections (about 7-8) all using the same 5-node
ZooKeeper cluster.  They've been in production for 3 months, with only
2 previous issues where a Solr node went down.

Tonight, during several updates to the various collections, a handful
failed due to the below error.

Could this be related to ZooKeeper in any way?  If so, what could I
check to ensure everything is running smoothly?

The collections are a mix of 1 and 2 shards, all with 1 replica.

Updates are performed in batches of 1000 in JSON files.

Are there any other things I could/should be checking?


$VAR1 = {
  'error' => {
   'code' => 500,
   'msg' => 'java.util.concurrent.TimeoutException:
Idle timeout expired: 5/5 ms',

This idle timeout is configured in Jetty.  The default setting in the
jetty config provided with Solr 5.x is 50 seconds.

If your update requests are taking too long for the Jetty idle timeout,
then I think you're having a general performance problem with Solr.
Increasing the timeout might help in the short term, but unless you fix
the underlying performance issue, you'd probably just run into the new
timeout at some point in the future.

Most severe performance problems like this are memory related, and are
solved by adding more memory.  Sometimes that is Java heap memory,
sometimes that is memory that is not allocated to a program.  Sometimes
both are required.

Thanks,
Shawn





Idle timeout expired: 50000/50000 ms

2016-04-28 Thread Robert Brown

Hi,

I operate several collections (about 7-8) all using the same 5-node 
ZooKeeper cluster.  They've been in production for 3 months, with only 2 
previous issues where a Solr node went down.


Tonight, during several updates to the various collections, a handful 
failed due to the below error.


Could this be related to ZooKeeper in any way?  If so, what could I 
check to ensure everything is running smoothly?


The collections are a mix of 1 and 2 shards, all with 1 replica.

Updates are performed in batches of 1000 in JSON files.

Are there any other things I could/should be checking?


$VAR1 = {
 'error' => {
  'code' => 500,
  'msg' => 'java.util.concurrent.TimeoutException: 
Idle timeout expired: 5/5 ms',
  'trace' => 'java.io.IOException: 
java.util.concurrent.TimeoutException: Idle timeout expired: 5/5 ms
at 
org.eclipse.jetty.util.SharedBlockingCallback$Blocker.block(SharedBlockingCallback.java:234)
at 
org.eclipse.jetty.server.HttpInputOverHTTP.blockForContent(HttpInputOverHTTP.java:66)
at 
org.eclipse.jetty.server.HttpInput$1.waitForContent(HttpInput.java:476)

at org.eclipse.jetty.server.HttpInput.read(HttpInput.java:121)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
at org.noggit.JSONParser.fill(JSONParser.java:196)
at org.noggit.JSONParser.getMore(JSONParser.java:203)
at org.noggit.JSONParser.readStringChars2(JSONParser.java:646)
at org.noggit.JSONParser.readStringChars(JSONParser.java:626)
at org.noggit.JSONParser.getStringChars(JSONParser.java:1029)
at org.noggit.JSONParser.getString(JSONParser.java:1017)
at 
org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.parseDoc(JsonLoader.java:501)
at 
org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.handleAdds(JsonLoader.java:470)
at 
org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.processUpdate(JsonLoader.java:135)
at 
org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.load(JsonLoader.java:114)

at org.apache.solr.handler.loader.JsonLoader.load(JsonLoader.java:77)
at 
org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:95)
at 
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:70)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:156)

at org.apache.solr.core.SolrCore.execute(SolrCore.java:2073)
at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:658)
at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:457)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:222)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:181)
at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
at 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
at 
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
at 
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
at 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110)
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)

at org.eclipse.jetty.server.Server.handle(Server.java:499)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310)
at 
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
at 
org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)

at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.concurrent.TimeoutException: Idle timeout expired: 
5/5 ms
at 
org.eclipse.jetty.io.IdleTimeout.checkIdleTimeout(IdleTimeout.java:161)

at org.eclipse.jetty.io.IdleTimeout$1.run(IdleTimeout.java:50)
   

HTTP Client Only

2016-04-14 Thread Robert Brown

Hi,

I have a collection with 2 shards, 1 replica each.

When I send updates, I currently /admin/ping each of the nodes, and then 
pick one at random.


I'm guessing it makes more sense to only send updates to one of the 
leaders, so I'm contemplating getting the collection status instead, and 
filter out the leaders.


Is there anything else I should be aware of, apart from using a Java 
client, etc.


I guess the ping becomes redundant?

Thanks,
Rob





Commiting with no updates

2016-04-13 Thread Robert Brown

Hi,

My autoSoftCommit is set to 1 minute.  Does this actually affect things 
if no documents have actually been updated/created?  Will this also 
affect the clearing of any caches?


Is this also the same for hard commits, either with autoCommit or making 
an explicit http request to commit.


Thanks,
Rob



Bad Request

2016-04-12 Thread Robert Brown

Hi,

My collection had issues earlier, 1 shard showed as Down, the other only 
replica was Gone.


Both were actually still up and running, no disk or CPU issues.

This occurred during updates.

The server since recovered after a reboot.

Upon trying to update the index again, I'm now getting constant Bad 
Requests.


Does anyone know what the issue could be, and/or how to resolve it?

org.apache.solr.common.SolrException: Bad Request

request: 
http://hostname:8983/solr/de_shard1_replica1/update?update.distrib=TOLEADER&distrib.from=http%3A%2F%2Fhostname%3A8983%2Fsolr%2Fde_shard2_replica2%2F&wt=javabin&version=2
at 
org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:287)
at 
org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:160)
at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:232)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:745)

I also occasionaly get "Exception writing document id 
be-de-109513-307573357 to the index; possible analysis error.' which was 
the first bunch of errors I saw.


Thanks,
Rob




Re: Range filters: inclusive?

2016-04-11 Thread Robert Brown

It's a string field, ean...

http://paste.scsys.co.uk/510132



On 04/11/2016 06:00 PM, Yonik Seeley wrote:

On Mon, Apr 11, 2016 at 12:52 PM, Robert Brown  wrote:

Hi,

When I perform a range query of ['' TO *] to filter out docs where a
particular field has a value, this does what I want, but I thought using the
square brackets was inclusive, so empty-string values should actually be
included?

They should be.  Are you saying that zero length values are not
included by the range query above?

-Yonik




Range filters: inclusive?

2016-04-11 Thread Robert Brown

Hi,

When I perform a range query of ['' TO *] to filter out docs where a 
particular field has a value, this does what I want, but I thought using 
the square brackets was inclusive, so empty-string values should 
actually be included?


The JSON I post to Solr has empty values, not null/undefined.

Am I missing something or is this a feature?

Thanks,
Rob




Re: Delete by query, including negative filters

2016-04-09 Thread Robert Brown

Thanks Erick,

The *'s were accidental, if that makes any difference whatsoever.




On 09/04/16 15:42, Erick Erickson wrote:

Should work, or
-merchant_id:(12345 OR 9876*)

But do be aware that Solr is not strict boolean logic. The above is
close enough for this purpose. Here's an excellent writeup on this
subtlety:

https://lucidworks.com/blog/2011/12/28/why-not-and-or-and-not/

Best,
Erick

On Sat, Apr 9, 2016 at 3:51 AM, Robert Brown  wrote:

Hi,

I have this delete query: "*partner:pg AND market:us AND last_seen:[* TO
2016-04-09T02:01:06Z]*"

And would like to add "AND merchant_id != 12345 AND merchant_id != 98765"

Would this be done by including "*AND -merchant_id:12345 AND
-merchant_id:98765*" ?

Thanks,
Rob





Delete by query, including negative filters

2016-04-09 Thread Robert Brown

Hi,

I have this delete query: "*partner:pg AND market:us AND last_seen:[* TO 
2016-04-09T02:01:06Z]*"


And would like to add "AND merchant_id != 12345 AND merchant_id != 98765"

Would this be done by including "*AND -merchant_id:12345 AND 
-merchant_id:98765*" ?


Thanks,
Rob



Re: Update Speed: QTime 1,000 - 5,000

2016-04-05 Thread Robert Brown

The QTime's are from the updates.

We don't have the resource right now to switch to SolrJ, but I would 
assume only sending updates to the leaders would take some redirects out 
of the process, I can regularly query for the collection status to know 
who's who.


I'm now more interested in the caches that are thrown away on 
softCommit, since we do see some performance issues on queries too. 
Would these caches affect querying and faceting?


Thanks,
Rob



On 06/04/16 00:41, Erick Erickson wrote:

bq: Apart from the obvious delay, I'm also seeing QTime's of 1,000 to 5,000

QTimes for what? The update? Queries? If for queries, autowarming may help,
especially as your soft commit is throwing away all the top-level
caches (i.e. the
ones configured in solrconfig.xml) every minute. It shouldn't be that bad on the
lower-level Lucene caches though, at least the per-segment ones.

You'll get some improvement by using SolrJ (with CloudSolrClient)
rather than cURL.
no matter which node you hit, about half your documents will have to
be forwarded to
the other shard when using cURL, whereas SolrJ (with CloudSolrClient)
will route the docs
to the correct leader right from the client.

Best,
Erick

On Tue, Apr 5, 2016 at 2:53 PM, John Bickerstaff
 wrote:

A few thoughts...

 From a black-box testing perspective, you might try changing that
softCommit time frame  to something longer and see if it makes a difference.

The size of  your documents will make a difference too - so the comparison
to 300 - 500 on other cloud setups may or may not be comparing apples to
oranges...

Are the "new" documents actually new or are you overwriting existing solr
doc ID's?  If you are overwriting, you may want to optimize and see if that
helps.



On Tue, Apr 5, 2016 at 2:38 PM, Robert Brown  wrote:


Hi,

I'm currently posting updates via cURL, in batches of 1,000 docs in JSON
files.

My setup consists of 2 shards, 1 replica each, 50m docs in total.

These updates are hitting a node at random, from a server across the
Internet.

Apart from the obvious delay, I'm also seeing QTime's of 1,000 to 5,000.

This strikes me as quite high since I also sometimes see times of around
300-500, on similar cloud setups.

The setup is running on VMs with rotary disks, and enough RAM to hold
roughly half the entire index in disk cache (I'm in the process of
upgrading this).

I hard commit every 10 minutes but don't open a new searcher, just to make
sure data is "safe".  I softCommit every 1 minute to make data available.

Are there any obvious things I can do to improve my situation?

Thanks,
Rob









Update Speed: QTime 1,000 - 5,000

2016-04-05 Thread Robert Brown

Hi,

I'm currently posting updates via cURL, in batches of 1,000 docs in JSON 
files.


My setup consists of 2 shards, 1 replica each, 50m docs in total.

These updates are hitting a node at random, from a server across the 
Internet.


Apart from the obvious delay, I'm also seeing QTime's of 1,000 to 5,000.

This strikes me as quite high since I also sometimes see times of around 
300-500, on similar cloud setups.


The setup is running on VMs with rotary disks, and enough RAM to hold 
roughly half the entire index in disk cache (I'm in the process of 
upgrading this).


I hard commit every 10 minutes but don't open a new searcher, just to 
make sure data is "safe".  I softCommit every 1 minute to make data 
available.


Are there any obvious things I can do to improve my situation?

Thanks,
Rob






Re: Parallel Updates

2016-04-04 Thread Robert Brown

Thanks John,

I have 2 shards, 1 replica in each.

The issue is the external processing job(s) I have to convert external 
data into JSON, and then upload it via cURL.


Will one Solr server only accept one update at a time and have any 
others queued?  (And possibly timeout).


I like the idea of having my leaders only deal with indexing, and the 
replicas only deal with searching - how can I actually configure this?  
And is it actually required with my shard setup?


I'm doing hard commits every minute but not opening a new searcher (so I 
know the data is safe), with soft commits happening every 10 minutes to 
make the data visible.


Cheers,
Rob


On 04/04/16 22:40, John Bickerstaff wrote:

Will the processes be Solr processes?  Or do you mean multiple threads
hitting the same Solr server(s)?

There will be a natural bottleneck at one Solr server if you are hitting it
with a lot of threads - since that one server will have to do all the
indexing.

I don't know if this idea is helpful, but if your underlying challenge is
protecting the user experience and preventing slowdown during the indexing,
you can have a separate Solr server that just accepts incoming documents
(and bearing the cost of the indexing) while serving documents from other
Solr servers...

There will be a slight cost for those "serving servers" to get updates from
the "indexing server" but that will be much less than the cost of indexing
directly.

If processing power was really important you could have two or more
"indexing" servers and fire multiple threads at each one...

You probably already know this, but the key is how often you "commit" and
force the indexing to occur...

On Mon, Apr 4, 2016 at 3:33 PM, Robert Brown  wrote:


Hi,

Does Solr have any sort of limit when attempting multiple updates, from
separate clients?

Are there any safe thresholds one should try to stay within?

I have an index of around 60m documents that gets updated at key points
during the day from ~200 downloaded files - I'd like to fork off multiple
processes to deal with the incoming data to get it into Solr quicker.

Thanks,
Rob







Parallel Updates

2016-04-04 Thread Robert Brown

Hi,

Does Solr have any sort of limit when attempting multiple updates, from 
separate clients?


Are there any safe thresholds one should try to stay within?

I have an index of around 60m documents that gets updated at key points 
during the day from ~200 downloaded files - I'd like to fork off 
multiple processes to deal with the incoming data to get it into Solr 
quicker.


Thanks,
Rob




Re: Facet by truncated date

2016-03-31 Thread Robert Brown

Hi Emir,

What if I don't want to specify a range?  Or would I have to do year 0 
to NOW?


Thanks,
Rob


On 03/31/2016 10:26 AM, Emir Arnautovic wrote:

Hi Yago,
Not sure if I misunderstood the case, but assuming you have date field 
called my_date you can facet last 10 days by day using range queries:


?facet.range=my_date&facet.range.start=NOW/DAY-10DAYS&facet.range.end=NOW/DAY+1DAY&facet.range.gap=+1DAY 



Regards,
Emir

On 31.03.2016 11:14, Yago Riveiro wrote:
If you want aggregate the dat by the truncated date, I think the only 
way to

do it is using other field with the truncated date.


You can use a update request processor to calculate the truncated data
(https://wiki.apache.org/solr/UpdateRequestProcessor) or add the 
field in

indexing time.


date:"2016-03-31T12:00:0Z"

truncated_date_s:'2016-03-31' or truncated_date_i:20160331 (this 
should be

more memory efficient)

\--


/Yago Riveiro


![](https://link.nylas.com/open/m7fkqw0yim04itb62itnp7r9/4708d221e9a24b519bab6 


3936013ce59)

On Mar 31 2016, at 10:08 am, Emir Arnautovic
<emir.arnauto...@sematext.com> wrote:


Hi Robert,

You can use range faceting and set use facet.range.gap to set how dates
are "truncated".


Regards,

Emir


On 31.03.2016 10:52, Robert Brown wrote:

> Hi,
>
> Is it possible to facet by a date (solr.TrieDateField) but 
truncated

> to the day, or even the hour?
>
> If not, are there any other options apart from storing that 
truncated

> data in another (string?) field?
>
> Thanks,
> Rob
>
>


\--

Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * <http://sematext.com/>








Facet by truncated date

2016-03-31 Thread Robert Brown

Hi,

Is it possible to facet by a date (solr.TrieDateField) but truncated to 
the day, or even the hour?


If not, are there any other options apart from storing that truncated 
data in another (string?) field?


Thanks,
Rob



Re: Index not fitting in memory (file-cache)

2016-03-24 Thread Robert Brown

Thanks Shawn,

One of my indexes is 70G on disk but only has 25G RAM, usually it's fast 
as hell, less than 0.5s for a full API wrapped call, but we do 
occasionally see searches taking 2.5 seconds.


I'm currently shuffling VMs around to increase the RAM, good to hear 
this may solve those random slowdowns - or at least rule it out.




On 03/24/2016 01:44 PM, Shawn Heisey wrote:

On 3/24/2016 4:02 AM, Robert Brown wrote:

If my index data directory size is 70G, and I don't have 70G (plus
heap, etc) in the system, this will occasionally affect search speed
right?  When Solr has to resort to reading from disk?

Before I go out and throw more RAM into the system, in the above
example, what would you recommend?

Having enough memory available to cache all your index data offers the
best possible performance.

You may be able to achieve acceptable performance when you don't have
that much memory, but I would try to make sure there's at least enough
memory available to cache *half* the index data.  Depending on the
nature of your queries and your index, this might not be enough, but
chances are good that it would work well.

I have a dev server where there's only enough memory available to cache
about a tenth of the index -- it's got full copies of all three of my
large indexes on ONE machine, while production runs two copies of these
same indexes on ten machines.  Performance of any single query is not
very good on the dev server, but if I absolutely had to use that server
for production with one of my indexes, it would be a slow, but I could
do it.  I don't think it would have enough performance to handle running
all three indexes for production, though.

Thanks,
Shawn





Index not fitting in memory (file-cache)

2016-03-24 Thread Robert Brown

Hi,

If my index data directory size is 70G, and I don't have 70G (plus heap, 
etc) in the system, this will occasionally affect search speed right?  
When Solr has to resort to reading from disk?


Before I go out and throw more RAM into the system, in the above 
example, what would you recommend?


Thanks,
Rob




Re: Creating new cluster with existing config in zookeeper

2016-03-23 Thread Robert Brown

Thanks all,

I am no doubt confusing things myself - I (rather stupidly) have 5 
completely separate clouds, with separate ZK trees - a bad design 
decision on day one when I thought each config needed a separate ZK tree.


So it could all be simplified a bit, but that's my current view, which 
is probably sounding confused.


Cheers,
Rob


On 03/23/2016 04:03 PM, Tom Evans wrote:

On Wed, Mar 23, 2016 at 3:43 PM, Robert Brown  wrote:

So I setup a new solr server to point to my existing ZK configs.

When going to the admin UI on this new server I can see the shards/replica's
of the existing collection, and can even query it, even tho this new server
has no cores on it itself.

Is this all expected behaviour?

Is there any performance gain with what I have at this precise stage?  The
extra server certainly makes it appear i could balance more load/requests,
but I guess the queries are just being forwarded on to the servers with the
actual data?

Am I correct in thinking I can now create a new collection on this host, and
begin to build up a new cluster?  and they won't interfere with each other
at all?

Also, that I'll be able to see both collections when using the admin UI
Cloud page on any of the servers in either collection?


I'm confused slightly:

SolrCloud is a (singular) cluster of servers, storing all of its state
and configuration underneath a single zookeeper path. The cluster
contains collections. Collections are tied to a particular config set
within the cluster. Collections are made up of 1 or more shards. Each
shard is a core, and there are 1 or more replicas of each core.

You can add more servers to the cluster, and then create a new
collection with the same config as an existing collection, but it is
still part of the same cluster. Of course, you could think of a set of
servers within a cluster as a "logical" cluster if it just serves
particular collection, but "cluster" to me would be all of the servers
within the same zookeeper tree, because that is where cluster state is
maintained.

Cheers

Tom




Re: Creating new cluster with existing config in zookeeper

2016-03-23 Thread Robert Brown

So I setup a new solr server to point to my existing ZK configs.

When going to the admin UI on this new server I can see the 
shards/replica's of the existing collection, and can even query it, even 
tho this new server has no cores on it itself.


Is this all expected behaviour?

Is there any performance gain with what I have at this precise stage?  
The extra server certainly makes it appear i could balance more 
load/requests, but I guess the queries are just being forwarded on to 
the servers with the actual data?


Am I correct in thinking I can now create a new collection on this host, 
and begin to build up a new cluster?  and they won't interfere with each 
other at all?


Also, that I'll be able to see both collections when using the admin UI 
Cloud page on any of the servers in either collection?


Thanks,
Rob



On 03/22/2016 04:47 PM, Erick Erickson wrote:

The whole _point_ of configsets is to re-use them in multiple
collections, so please do!

Best,
Erick

On Tue, Mar 22, 2016 at 5:38 AM, Robert Brown  wrote:

Hi,

Is it safe to create a new cluster but use an existing config set that's in
zookeeper?  Or does that config set contain the cluster status too?

I want to (re)-build a cluster from scratch, with a different amount of
shards, but not using shard-splitting.

Thanks,
Rob





Re: Delete by query using JSON?

2016-03-22 Thread Robert Brown

"why do you care? just do this ..."

I see this a lot on mailing lists these days, it's usually a learning 
curve/task/question.  I know I fall into these types of questions/tasks 
regularly.


Which usually leads to "don't tell me my approach is wrong, just explain 
what's going on, and why", or "just answer the straight-forward question 
I asked in first place.".


Sorry for rambling, this just sounded familiar...

:)



On 22/03/16 22:50, Alexandre Rafalovitch wrote:

Why do you care?

The difference between Q and FQ are the scoring. For delete, you
delete all of them regardless of scoring and there is no difference.
Just chuck them all into Q.

Regards,
Alex.

Newsletter and resources for Solr beginners and intermediates:
http://www.solr-start.com/


On 23 March 2016 at 06:07, Paul Hoffman  wrote:

I've been struggling to find the right syntax for deleting by query
using JSON, where the query includes an fq parameter.

I know how to delete *all* documents, but how would I delete only
documents with field doctype = "cres"?  I have tried the following along
with a number of variations, all to no avail:

$ curl -s -d @- 'http://localhost:8983/solr/blacklight-core/update?wt=json' 


Re: Creating new cluster with existing config in zookeeper

2016-03-22 Thread Robert Brown

Thanks Erick and Shawn, a "collection" is indeed what I meant.

I was under the impression the entire Tree view in the admin GUI was 
showing everything in ZK, including things like 
"collections/name/state.json", not just the /configs directory.


The solr.xml file is too isn't it? (I added it to ZK as per the docs), 
just a bit confusing to see some files/directories from ZK, and some not.


Thanks for any more insight.



On 03/22/2016 04:57 PM, Shawn Heisey wrote:

On 3/22/2016 6:38 AM, Robert Brown wrote:
Is it safe to create a new cluster but use an existing config set 
that's in zookeeper?  Or does that config set contain the cluster 
status too?


I want to (re)-build a cluster from scratch, with a different amount 
of shards, but not using shard-splitting.


When you say "cluster" what exactly do you mean?

To me, "cluster" in a Solr context means "a bunch of Solr servers."  
If this is what you mean, there is nothing built in to copy things 
from an existing cluster.  You *can* run multiple SolrCloud clusters 
on one Zookeeper ensemble.


If you are actually talking about a *collection* when you say 
"cluster", then what Erick said is 100% correct.


Thanks,
Shawn





Creating new cluster with existing config in zookeeper

2016-03-22 Thread Robert Brown

Hi,

Is it safe to create a new cluster but use an existing config set that's 
in zookeeper?  Or does that config set contain the cluster status too?


I want to (re)-build a cluster from scratch, with a different amount of 
shards, but not using shard-splitting.


Thanks,
Rob



Re: Boosts for relevancy (shopping products)

2016-03-20 Thread Robert Brown
It's also worth mentioning that our platform contains shopping products 
in every single category, and will be searched by absolutely anyone, via 
an API made available to various websites, some niche, some not.


If those websites are category specific, ie, electrical goods, then we 
could boost on certain categories for a given website, but if they're 
also broad, is this even possible?


I guess we could track individual users and build up search-histories to 
try and guide us, but I don't see many hits being made on repeat users.


Recording clicks on products could also be used to boost individual 
products for specific keywords - I'm beginning to think this is actually 
our best hope?  e.g.  A multi-valued field containing keywords that 
resulted in a click on that product.





On 03/18/2016 04:14 PM, Robert Brown wrote:

That does sound rather useful!

We currently have it set to 0.1



On 03/18/2016 04:13 PM, Nick Vasilyev wrote:
Tie does quite a bit, without it only the highest weighted field that 
has

the term will be included in relevance score. Tie let's you include the
other fields that match as well.
On Mar 18, 2016 10:40 AM, "Robert Brown"  wrote:


Thanks for the added input.

I'll certainly look into the machine learning aspect, will be good 
to put

some basic knowledge I have into practice.

I'd been led to believe the tie parameter didn't actually do a lot. :-/



On 03/18/2016 12:07 PM, Nick Vasilyev wrote:

I work with a similar catalog; except our data is especially bad.  
We've

found that several things helped:

- Item level grouping (group same item sold by multiple vendors). Rank
items with more vendors a bit higher.
- Include a boost function for other attributes, such as an 
original image

of the product
- Rank items a bit higher if they have data from an external 
catalog like

IceCat
- For relevance and performance, we have several fields that we 
copy data
into. High value fields get copied into a high weighted field, 
while lower

value fields like description get copied into a lower weighted field.
These
fields are the backbone of our qf parameter, with other fields adding
additional boost.
- Play around with the tie parameter for edismax, we found that it 
makes

quite a big difference.

Hope this helps.

On Fri, Mar 18, 2016 at 6:19 AM, Alessandro Benedetti <
abenede...@apache.org


wrote:
In a relevancy problem I would repeat what my colleagues already 
pointed

out :
Data is key. We need to understand first of all our data before we 
can

understand what is relevant and what is not.
Once we specify a groundfloor which make sense ( and your basic 
approach

+
proper schema configuration as suggested + properly configured 
request

handler , seems a good start to me ) .

At this point if you are still not happy with the relevancy (i.e. 
you are

not happy with the different boosts you assigned ) my strongest
suggestion
at this time is to move to machine learning.
You need a good amount of data to feed the learner and make it 
your Super

Business Expert) .
I have been recently working with the Learn To Rank Bloomberg 
Plugin [1]

.
In  my opinion will be key for all the business that have many 
features

in
the game, that can help to evaluate a proper ranking.
For that you need to be able to collect and process signals, and 
you need

to carefully tune the features of your interest.
But the results could be surprising .

[1] https://issues.apache.org/jira/browse/SOLR-8542
[2] Learning to Rank in Solr <
https://www.youtube.com/watch?v=M7BKwJoh96s>

Cheers

On Thu, Mar 17, 2016 at 10:15 AM, Robert Brown 
wrote:

Thanks Scott and John,
As luck would have it I've got a PhD graduate coming for an 
interview

today, who just happened to do her research thesis on information


retrieval


with quantum theory and machine learning  :)

John, it sounds like you're describing my system! Shopping products
from
multiple sources.  (De-duplication is going to be fun soon).

I already copy fields like merchant, brand, category, to string 
fields

to
use them as facets/filters.  I was contemplating removing the
description
due to the spammy issue you mentioned, I didn't know about the
RemoveDuplicatesTokenFilterFactory, so I'm sure that's going to be a
huge
help.

Thanks a lot,
Rob



On 03/17/2016 10:01 AM, John Smith wrote:

Hi,

For once I might be of some help: I've had a similar configuration
(large set of products from various sources). It's very 
difficult to

find the right balance between all parameters and requires a lot of
tweaking, most often in the dark unfortunately.

What I've found is that omitNorms=true is a real breakthrough: 
without
it results tend to favor small texts, which is not what's wanted 
for
product names. I also added a RemoveDuplicatesTokenFilterFactory 
for

the
name as it's a common practice for spammers to repeat some key 
words in
orde

Re: Boosts for relevancy (shopping products)

2016-03-19 Thread Robert Brown

Thanks Scott and John,

As luck would have it I've got a PhD graduate coming for an interview 
today, who just happened to do her research thesis on information 
retrieval with quantum theory and machine learning  :)


John, it sounds like you're describing my system!  Shopping products 
from multiple sources.  (De-duplication is going to be fun soon).


I already copy fields like merchant, brand, category, to string fields 
to use them as facets/filters.  I was contemplating removing the 
description due to the spammy issue you mentioned, I didn't know about 
the RemoveDuplicatesTokenFilterFactory, so I'm sure that's going to be a 
huge help.


Thanks a lot,
Rob


On 03/17/2016 10:01 AM, John Smith wrote:

Hi,

For once I might be of some help: I've had a similar configuration
(large set of products from various sources). It's very difficult to
find the right balance between all parameters and requires a lot of
tweaking, most often in the dark unfortunately.

What I've found is that omitNorms=true is a real breakthrough: without
it results tend to favor small texts, which is not what's wanted for
product names. I also added a RemoveDuplicatesTokenFilterFactory for the
name as it's a common practice for spammers to repeat some key words in
order to be better placed in results. Stemming and custom stop words
(e.g. "cheap", "sale", ...) are other potential ideas.

I've also ended up in removing the description field as it's often too
broad, and name is now the only field left: brand, category and merchant
(as well as other fields) are offered as additional filters using
facets. Note that you'd have to re-index them as plain strings.

It's more difficult to achieve but popularity boost can also be useful:
you can measure it by sales or by number of clicks. I use a combination
of both, and store those values using partial updates.

Hope it helps,
John


On 17/03/16 09:36, Robert Brown wrote:

Hi,

I currently have an index of ~50m docs representing shopping products:
name, description, brand, category, etc.

Our "qf" is currently setup as:

name^5
brand^2
category^3
merchant^2
description^1

mm: 100%
ps: 5

I'm getting complaints from the business concerning relevancy, and was
hoping to get some constructive ideas/thoughts on whether these boosts
look semi-sensible or not, I think they were put in place pretty much
at random.

I know it's going to be a case of rounds upon rounds of testing, but
maybe there's a good starting point that will save me some time?

My initial thoughts right now are to actually just search on the name
field, and maybe the brand (for things like "Apple Ipod").

Has anyone got a similar setup that could share some direction?

Many Thanks,
Rob





Re: Shard splitting for immediate performance boost?

2016-03-19 Thread Robert Brown

Thanks Erick,

I have another index with the same infrastructure setup, but only 10m 
docs, and never see these slow-downs, that's why my first instinct was 
to look at creating more shards.


I'll definitely make a point of investigating further tho with all the 
things you and Shawn mentioned, time is unfortunately against me.


Cheers,
Rob



On 19/03/16 19:11, Erick Erickson wrote:

Be _very_ cautious when you're looking at these timings. Random
spikes are often due to opening a new searcher (assuming
you're indexing as you query) and are eminently tunable by
autowarming. Obviously you can't fire the same query again and again,
but if you collect a set of "bad" queries and, say, measure them after
Solr has been running for a while (without any indexing going on) and
the times are radically better, then autowarming is where I'd look first.

Second, what are you measuring? Time until the client displays the
results? There's a bunch of other things going on there, it might be
network issues. The QTime is a rough measure of how long Solr is
taking, although it doesn't include the time spent assembling the return
packet.

Third, "randomly requesting facets" is something of a red flag. Make
really sure the facets are realistic. Fields with high cardinality make
solr work harder. For instance, let's say you have a date field with
millisecond resolution. I'd bet that faceting on that field is not something
you'll ever support. NOTE: I'm talking about just setting
facet.field=datefield
here. A range facet on the field is totally reasonable. Really, I'm saying
to insure that your queries are realistic before jumping into sharding.

Fourth, garbage collection (especially "stop the world" GCs) won't be helped
by just splitting into shards.

And the list goes on and on. Really what both Shawn and I are saying is
that you really need to identify _what's_ slowing you down before trying
a solution like sharding. And you need to be able to quantify that rather
than
"well, sometimes when I put stuff in it seems slow" or you'll spend a large
amount of time chasing the wrong thing (at least I have).

30M docs per shard is well within a reasonable range, although the
complexity of your docs may push that number up or down. You haven't told
us much about how much memory you have on your machine, how much
RAM you're allocating to Solr and the like so it's hard to say much other
than generalities

Best,
Erick

On Sat, Mar 19, 2016 at 10:41 AM, Shawn Heisey  wrote:


On 3/19/2016 11:12 AM, Robert Brown wrote:

I have an index of 60m docs split across 2 shards (each with a replica).

When load testing queries (picking random keywords I know exist), and
randomly requesting facets too, 95% of my responses are under 0.5s.

However, during some random manual tests, sometimes I see searches
taking between 1-2 seconds.

Should I expect a simple shard split to assist with the speed
immediately?  Even with the 2 new shards still being on the original
servers?

Will move them to their own dedicated hosts, but just want to
understand what I should expect during the process.

Maybe.  It depends on why the responses are slow in the first place.

If your queries are completely CPU-bound, then splitting into more
shards and either putting those shards on additional machines or taking
advantage of idle CPUs will make performance better.  Note that if your
query rate is extremely high, you should only have one shard replica on
each server -- all your CPU power will be needed for handling query
volume, so none of your CPUs will be idle.

Most of the time, Solr installations are actually I/O bound, because
there's not enough unused RAM to effectively cache the index.  If this
is what's happening and you don't add memory (which you can do by adding
machines and adding/removing replicas to move them), then you'll make
performance worse by splitting into more shards.

Thanks,
Shawn






Re: Boosts for relevancy (shopping products)

2016-03-19 Thread Robert Brown

That does sound rather useful!

We currently have it set to 0.1



On 03/18/2016 04:13 PM, Nick Vasilyev wrote:

Tie does quite a bit, without it only the highest weighted field that has
the term will be included in relevance score. Tie let's you include the
other fields that match as well.
On Mar 18, 2016 10:40 AM, "Robert Brown"  wrote:


Thanks for the added input.

I'll certainly look into the machine learning aspect, will be good to put
some basic knowledge I have into practice.

I'd been led to believe the tie parameter didn't actually do a lot. :-/



On 03/18/2016 12:07 PM, Nick Vasilyev wrote:


I work with a similar catalog; except our data is especially bad.  We've
found that several things helped:

- Item level grouping (group same item sold by multiple vendors). Rank
items with more vendors a bit higher.
- Include a boost function for other attributes, such as an original image
of the product
- Rank items a bit higher if they have data from an external catalog like
IceCat
- For relevance and performance, we have several fields that we copy data
into. High value fields get copied into a high weighted field, while lower
value fields like description get copied into a lower weighted field.
These
fields are the backbone of our qf parameter, with other fields adding
additional boost.
- Play around with the tie parameter for edismax, we found that it makes
quite a big difference.

Hope this helps.

On Fri, Mar 18, 2016 at 6:19 AM, Alessandro Benedetti <
abenede...@apache.org


wrote:
In a relevancy problem I would repeat what my colleagues already pointed
out :
Data is key. We need to understand first of all our data before we can
understand what is relevant and what is not.
Once we specify a groundfloor which make sense ( and your basic approach
+
proper schema configuration as suggested + properly configured request
handler , seems a good start to me ) .

At this point if you are still not happy with the relevancy (i.e. you are
not happy with the different boosts you assigned ) my strongest
suggestion
at this time is to move to machine learning.
You need a good amount of data to feed the learner and make it your Super
Business Expert) .
I have been recently working with the Learn To Rank Bloomberg Plugin [1]
.
In  my opinion will be key for all the business that have many features
in
the game, that can help to evaluate a proper ranking.
For that you need to be able to collect and process signals, and you need
to carefully tune the features of your interest.
But the results could be surprising .

[1] https://issues.apache.org/jira/browse/SOLR-8542
[2] Learning to Rank in Solr <
https://www.youtube.com/watch?v=M7BKwJoh96s>

Cheers

On Thu, Mar 17, 2016 at 10:15 AM, Robert Brown 
wrote:

Thanks Scott and John,

As luck would have it I've got a PhD graduate coming for an interview
today, who just happened to do her research thesis on information


retrieval


with quantum theory and machine learning  :)

John, it sounds like you're describing my system!  Shopping products
from
multiple sources.  (De-duplication is going to be fun soon).

I already copy fields like merchant, brand, category, to string fields
to
use them as facets/filters.  I was contemplating removing the
description
due to the spammy issue you mentioned, I didn't know about the
RemoveDuplicatesTokenFilterFactory, so I'm sure that's going to be a
huge
help.

Thanks a lot,
Rob



On 03/17/2016 10:01 AM, John Smith wrote:

Hi,

For once I might be of some help: I've had a similar configuration
(large set of products from various sources). It's very difficult to
find the right balance between all parameters and requires a lot of
tweaking, most often in the dark unfortunately.

What I've found is that omitNorms=true is a real breakthrough: without
it results tend to favor small texts, which is not what's wanted for
product names. I also added a RemoveDuplicatesTokenFilterFactory for
the
name as it's a common practice for spammers to repeat some key words in
order to be better placed in results. Stemming and custom stop words
(e.g. "cheap", "sale", ...) are other potential ideas.

I've also ended up in removing the description field as it's often too
broad, and name is now the only field left: brand, category and
merchant
(as well as other fields) are offered as additional filters using
facets. Note that you'd have to re-index them as plain strings.

It's more difficult to achieve but popularity boost can also be useful:
you can measure it by sales or by number of clicks. I use a combination
of both, and store those values using partial updates.

Hope it helps,
John


On 17/03/16 09:36, Robert Brown wrote:

Hi,

I currently have an index of ~50m docs representing shopping products:
name, description, brand, category, etc.

Our "qf" is currently setup as:

name^5
brand^2
category^3
merchant^2
desc

Shard splitting for immediate performance boost?

2016-03-19 Thread Robert Brown

Hi,

I have an index of 60m docs split across 2 shards (each with a replica).

When load testing queries (picking random keywords I know exist), and 
randomly requesting facets too, 95% of my responses are under 0.5s.


However, during some random manual tests, sometimes I see searches 
taking between 1-2 seconds.


Should I expect a simple shard split to assist with the speed 
immediately?  Even with the 2 new shards still being on the original 
servers?


Will move them to their own dedicated hosts, but just want to understand 
what I should expect during the process.


Thanks,
Rob



Re: Boosts for relevancy (shopping products)

2016-03-19 Thread Robert Brown
Thanks, would be a great idea but unfortunately we don't have that sort 
of granularity of features.


Can definitely use the category of clicked products though, sounds like 
a good enough start.





On 03/18/2016 04:36 PM, Alessandro Benedetti wrote:

Actually if you are able to collect past ( or future signals) like clicks
or purchase, i would rather focus on the features of your products rather
than the products themselves.
What will happen is that you are going to be able rank in a better way
products based on how their feature should affect the score.
i.e.
after you trained your model you realize that people searching for computer
gadgets are more likely to click and buy :
specific brands - apple compatible - low energy consumption - high user
rating  ect ect products

At this point even new products that will arrive, which have that set of
features, are going to be boosted.
Even if you haven't seen them at all.

Cheers

On Fri, Mar 18, 2016 at 4:21 PM, Robert Brown  wrote:


It's also worth mentioning that our platform contains shopping products in
every single category, and will be searched by absolutely anyone, via an
API made available to various websites, some niche, some not.

If those websites are category specific, ie, electrical goods, then we
could boost on certain categories for a given website, but if they're also
broad, is this even possible?

I guess we could track individual users and build up search-histories to
try and guide us, but I don't see many hits being made on repeat users.

Recording clicks on products could also be used to boost individual
products for specific keywords - I'm beginning to think this is actually
our best hope?  e.g.  A multi-valued field containing keywords that
resulted in a click on that product.





On 03/18/2016 04:14 PM, Robert Brown wrote:


That does sound rather useful!

We currently have it set to 0.1



On 03/18/2016 04:13 PM, Nick Vasilyev wrote:


Tie does quite a bit, without it only the highest weighted field that has
the term will be included in relevance score. Tie let's you include the
other fields that match as well.
On Mar 18, 2016 10:40 AM, "Robert Brown"  wrote:

Thanks for the added input.

I'll certainly look into the machine learning aspect, will be good to
put
some basic knowledge I have into practice.

I'd been led to believe the tie parameter didn't actually do a lot. :-/



On 03/18/2016 12:07 PM, Nick Vasilyev wrote:

I work with a similar catalog; except our data is especially bad.  We've

found that several things helped:

- Item level grouping (group same item sold by multiple vendors). Rank
items with more vendors a bit higher.
- Include a boost function for other attributes, such as an original
image
of the product
- Rank items a bit higher if they have data from an external catalog
like
IceCat
- For relevance and performance, we have several fields that we copy
data
into. High value fields get copied into a high weighted field, while
lower
value fields like description get copied into a lower weighted field.
These
fields are the backbone of our qf parameter, with other fields adding
additional boost.
- Play around with the tie parameter for edismax, we found that it
makes
quite a big difference.

Hope this helps.

On Fri, Mar 18, 2016 at 6:19 AM, Alessandro Benedetti <
abenede...@apache.org

wrote:

In a relevancy problem I would repeat what my colleagues already
pointed
out :
Data is key. We need to understand first of all our data before we can
understand what is relevant and what is not.
Once we specify a groundfloor which make sense ( and your basic
approach
+
proper schema configuration as suggested + properly configured request
handler , seems a good start to me ) .

At this point if you are still not happy with the relevancy (i.e. you
are
not happy with the different boosts you assigned ) my strongest
suggestion
at this time is to move to machine learning.
You need a good amount of data to feed the learner and make it your
Super
Business Expert) .
I have been recently working with the Learn To Rank Bloomberg Plugin
[1]
.
In  my opinion will be key for all the business that have many
features
in
the game, that can help to evaluate a proper ranking.
For that you need to be able to collect and process signals, and you
need
to carefully tune the features of your interest.
But the results could be surprising .

[1] https://issues.apache.org/jira/browse/SOLR-8542
[2] Learning to Rank in Solr <
https://www.youtube.com/watch?v=M7BKwJoh96s>

Cheers

On Thu, Mar 17, 2016 at 10:15 AM, Robert Brown 
wrote:

Thanks Scott and John,


As luck would have it I've got a PhD graduate coming for an interview
today, who just happened to do her research thesis on information

retrieval

with quantum theory and machine learning  :)

John, it sounds like you're describing my system! Shopping products
from
multiple sources.  (De-duplication is going to b

Re: Boosts for relevancy (shopping products)

2016-03-19 Thread Robert Brown

Thanks for the added input.

I'll certainly look into the machine learning aspect, will be good to 
put some basic knowledge I have into practice.


I'd been led to believe the tie parameter didn't actually do a lot. :-/



On 03/18/2016 12:07 PM, Nick Vasilyev wrote:

I work with a similar catalog; except our data is especially bad.  We've
found that several things helped:

- Item level grouping (group same item sold by multiple vendors). Rank
items with more vendors a bit higher.
- Include a boost function for other attributes, such as an original image
of the product
- Rank items a bit higher if they have data from an external catalog like
IceCat
- For relevance and performance, we have several fields that we copy data
into. High value fields get copied into a high weighted field, while lower
value fields like description get copied into a lower weighted field. These
fields are the backbone of our qf parameter, with other fields adding
additional boost.
- Play around with the tie parameter for edismax, we found that it makes
quite a big difference.

Hope this helps.

On Fri, Mar 18, 2016 at 6:19 AM, Alessandro Benedetti 
wrote:
In a relevancy problem I would repeat what my colleagues already pointed
out :
Data is key. We need to understand first of all our data before we can
understand what is relevant and what is not.
Once we specify a groundfloor which make sense ( and your basic approach +
proper schema configuration as suggested + properly configured request
handler , seems a good start to me ) .

At this point if you are still not happy with the relevancy (i.e. you are
not happy with the different boosts you assigned ) my strongest suggestion
at this time is to move to machine learning.
You need a good amount of data to feed the learner and make it your Super
Business Expert) .
I have been recently working with the Learn To Rank Bloomberg Plugin [1] .
In  my opinion will be key for all the business that have many features in
the game, that can help to evaluate a proper ranking.
For that you need to be able to collect and process signals, and you need
to carefully tune the features of your interest.
But the results could be surprising .

[1] https://issues.apache.org/jira/browse/SOLR-8542
[2] Learning to Rank in Solr <https://www.youtube.com/watch?v=M7BKwJoh96s>

Cheers

On Thu, Mar 17, 2016 at 10:15 AM, Robert Brown 
wrote:


Thanks Scott and John,

As luck would have it I've got a PhD graduate coming for an interview
today, who just happened to do her research thesis on information

retrieval

with quantum theory and machine learning  :)

John, it sounds like you're describing my system!  Shopping products from
multiple sources.  (De-duplication is going to be fun soon).

I already copy fields like merchant, brand, category, to string fields to
use them as facets/filters.  I was contemplating removing the description
due to the spammy issue you mentioned, I didn't know about the
RemoveDuplicatesTokenFilterFactory, so I'm sure that's going to be a huge
help.

Thanks a lot,
Rob



On 03/17/2016 10:01 AM, John Smith wrote:


Hi,

For once I might be of some help: I've had a similar configuration
(large set of products from various sources). It's very difficult to
find the right balance between all parameters and requires a lot of
tweaking, most often in the dark unfortunately.

What I've found is that omitNorms=true is a real breakthrough: without
it results tend to favor small texts, which is not what's wanted for
product names. I also added a RemoveDuplicatesTokenFilterFactory for the
name as it's a common practice for spammers to repeat some key words in
order to be better placed in results. Stemming and custom stop words
(e.g. "cheap", "sale", ...) are other potential ideas.

I've also ended up in removing the description field as it's often too
broad, and name is now the only field left: brand, category and merchant
(as well as other fields) are offered as additional filters using
facets. Note that you'd have to re-index them as plain strings.

It's more difficult to achieve but popularity boost can also be useful:
you can measure it by sales or by number of clicks. I use a combination
of both, and store those values using partial updates.

Hope it helps,
John


On 17/03/16 09:36, Robert Brown wrote:


Hi,

I currently have an index of ~50m docs representing shopping products:
name, description, brand, category, etc.

Our "qf" is currently setup as:

name^5
brand^2
category^3
merchant^2
description^1

mm: 100%
ps: 5

I'm getting complaints from the business concerning relevancy, and was
hoping to get some constructive ideas/thoughts on whether these boosts
look semi-sensible or not, I think they were put in place pretty much
at random.

I know it's going to be a case of rounds upon rounds of testing, but
maybe there's a good starting point that will save

Boosts for relevancy (shopping products)

2016-03-19 Thread Robert Brown

Hi,

I currently have an index of ~50m docs representing shopping products: 
name, description, brand, category, etc.


Our "qf" is currently setup as:

name^5
brand^2
category^3
merchant^2
description^1

mm: 100%
ps: 5

I'm getting complaints from the business concerning relevancy, and was 
hoping to get some constructive ideas/thoughts on whether these boosts 
look semi-sensible or not, I think they were put in place pretty much at 
random.


I know it's going to be a case of rounds upon rounds of testing, but 
maybe there's a good starting point that will save me some time?


My initial thoughts right now are to actually just search on the name 
field, and maybe the brand (for things like "Apple Ipod").


Has anyone got a similar setup that could share some direction?

Many Thanks,
Rob



Relevancy for "tablet"

2016-03-09 Thread Robert Brown

Hi,

I'm looking for some advice and possible options for dealing with our 
relevancy when searching through shopping products.


A search for "tablet" returns pills, when the user would expect 
electronic devices.


Without any extra criteria (like category), how would/could you manage 
this situation?


Any solution would also need to scale since this is just a random example.

Thanks,
Rob



Different scores depending on cloud node

2016-03-08 Thread Robert Brown

Hi,

I have 2 shards, each with 1 replica.

When sending the same request to the cluster, I'm seeing the same 
results, but ordered differently, and with different scores.


Does this highlight an issue with my index, or is this an accepted anomaly?

Example of 8 results:

1st call:

160.2047
160.2047
157.86732
157.86732
157.86732
157.86732
152.6514
152.6514

2nd call:

157.86732
157.86732
157.86732
157.86732
157.64246
157.64246
150.39238
150.39238



Thanks,
Rob



Re: Disk Usage anomoly across shards/replicas

2016-03-06 Thread Robert Brown
There was only one single index dir, after taking a node down, another 
was created with the timestamp, so I know what you mean, but then the 
original was removed.


I've since replaced the node and all is well again, just very odd.




On 06/03/16 06:52, Varun Thacker wrote:

Hi Robert,

Within the shard directory there should be multiple directories - "tlog"
"index." . Do you see multiple "index.*" directories in there
for the shard which has more data on disk?

On Sat, Mar 5, 2016 at 6:39 PM, Robert Brown  wrote:


Hi,

I have an index with 65m docs spread across 2 shards, each with 1 replica.

The replica1 of shard2 is using up nearly double the amount of disk space
as the other shards/replicas.

Could there be a reason/fix for this?


/home/s123/solr/data/de_shard1_replica1 = 72G

numDocs:34,786,026
maxDoc:45,825,444
deletedDocs:11,039,418



/home/s123/solr/data/de_shard1_replica2 = 70G

numDocs:34,786,026
maxDoc:46,914,095
deletedDocs:12,128,069



/home/s123/solr/data/de_shard2_replica1 = 138G

numDocs:34,775,193
maxDoc:45,409,362
deletedDocs:10,634,169



/home/s123/solr/data/de_shard2_replica2 = 66G

numDocs:34,775,193
maxDoc:44,181,734
deletedDocs:9,406,541



Thanks,
Rob












Re: Disk Usage anomoly across shards/replicas

2016-03-05 Thread Robert Brown

Thanks Shawn,

I'm just about to remove that node and rebuild it, at least there won't 
be any actual downtime.




On 05/03/16 14:44, Shawn Heisey wrote:

On 3/5/2016 6:09 AM, Robert Brown wrote:

I have an index with 65m docs spread across 2 shards, each with 1
replica.

The replica1 of shard2 is using up nearly double the amount of disk
space as the other shards/replicas.

I *very* occasionally see some of the shards in my non-SolrCloud index
show this behavior.  Usually if I fully rebuild the index (which takes
several hours), the problem will correct itself.

I have no idea what causes it.  I do not recall seeing it before
upgrading from 3.5 to late 4.x.  I do have some 5.x indexes ... I have
not been running them for very long, so I do not know whether that
version is having the same problem.

Thanks,
Shawn





Re: Disk Usage anomoly across shards/replicas

2016-03-05 Thread Robert Brown

Nope, we never run optimise.

Would there be some tell-tale files in the index dir to indicate if 
someone else had ran an optimise?




On 05/03/16 13:11, Binoy Dalal wrote:

Have you executed an optimize across that particular shard?

On Sat, 5 Mar 2016, 18:39 Robert Brown,  wrote:


Hi,

I have an index with 65m docs spread across 2 shards, each with 1 replica.

The replica1 of shard2 is using up nearly double the amount of disk
space as the other shards/replicas.

Could there be a reason/fix for this?


/home/s123/solr/data/de_shard1_replica1 = 72G

numDocs:34,786,026
maxDoc:45,825,444
deletedDocs:11,039,418



/home/s123/solr/data/de_shard1_replica2 = 70G

numDocs:34,786,026
maxDoc:46,914,095
deletedDocs:12,128,069



/home/s123/solr/data/de_shard2_replica1 = 138G

numDocs:34,775,193
maxDoc:45,409,362
deletedDocs:10,634,169



/home/s123/solr/data/de_shard2_replica2 = 66G

numDocs:34,775,193
maxDoc:44,181,734
deletedDocs:9,406,541



Thanks,
Rob





--

Regards,
Binoy Dalal





Disk Usage anomoly across shards/replicas

2016-03-05 Thread Robert Brown

Hi,

I have an index with 65m docs spread across 2 shards, each with 1 replica.

The replica1 of shard2 is using up nearly double the amount of disk 
space as the other shards/replicas.


Could there be a reason/fix for this?


/home/s123/solr/data/de_shard1_replica1 = 72G

numDocs:34,786,026
maxDoc:45,825,444
deletedDocs:11,039,418



/home/s123/solr/data/de_shard1_replica2 = 70G

numDocs:34,786,026
maxDoc:46,914,095
deletedDocs:12,128,069



/home/s123/solr/data/de_shard2_replica1 = 138G

numDocs:34,775,193
maxDoc:45,409,362
deletedDocs:10,634,169



/home/s123/solr/data/de_shard2_replica2 = 66G

numDocs:34,775,193
maxDoc:44,181,734
deletedDocs:9,406,541



Thanks,
Rob







SolrCloud, Best performance directly from C

2016-02-22 Thread Robert Brown

Hi,

As a pure C user, without wishing to use Java, what's my best approach 
for managing the SolrCloud environment?


I operate a FastCGI environment, so I have the persistence to cache the 
state of the "cloud".


So far I see good utilisation of the collections API being my best bet?

Any other thoughts or experiences?

Thanks,
Rob





Re: MLT Component only returns ID and score

2016-01-31 Thread Robert Brown
Thanks for the info, does this replace the mlt component?  ie, can I 
remove that component from being loaded/used?


These are my parameters, the results look okay, just want to ensure I'm 
using it right...



  fq => [
  'market:uk'
],
  q => '{!mlt 
qf=name,description,brand,category,ean,upc,asin}az-uk-b017u8thna',

  rows => 10,
  start => 0,
  wt => 'json'






On 31/01/16 20:17, Upayavira wrote:

Try the MLT query parser, which is a much newer way of doing this.
Perhaps it will work better for you.

Upayavira

On Sun, Jan 31, 2016, at 06:31 PM, Robert Brown wrote:

Hi,

I've had to switch to using the MLT component, rather than the handler,
since I'm running on Solrcloud (5.4) and if I hit a node without the
starting document, I get nothing back.

When I perform a MLT query, I only get back the ID and score for the
similar documents, yet my fl=*,score.

"moreLikeThis" : [
"tg-uk-6336277-5820875618687434337",
{
   "maxScore" : 16.857872,
   "numFound" : 49559,
   "docs" : [
  {
 "score" : 16.857872,
 "id" : "tg-uk-6336277-6676971947462687384"
  },
  {
 "score" : 16.857872,
 "id" : "tg-uk-6336277-1922478172471276129"
  },



Here's my config...

  

  

  edismax

  explicit

  0.1

  *,score

  
  name^5
  brand^2
  category^3
  

  name

  100%

  100

  *:*

  
  name^5 description^2 brand^3 category^3
  

  name,description,ean,upc,asin,brand,category

  

  
  query
  facet
  mlt
  

  








MLT Component only returns ID and score

2016-01-31 Thread Robert Brown

Hi,

I've had to switch to using the MLT component, rather than the handler, 
since I'm running on Solrcloud (5.4) and if I hit a node without the 
starting document, I get nothing back.


When I perform a MLT query, I only get back the ID and score for the 
similar documents, yet my fl=*,score.


"moreLikeThis" : [
  "tg-uk-6336277-5820875618687434337",
  {
 "maxScore" : 16.857872,
 "numFound" : 49559,
 "docs" : [
{
   "score" : 16.857872,
   "id" : "tg-uk-6336277-6676971947462687384"
},
{
   "score" : 16.857872,
   "id" : "tg-uk-6336277-1922478172471276129"
},



Here's my config...





edismax

explicit

0.1

*,score


name^5
brand^2
category^3


name

100%

100

*:*


name^5 description^2 brand^3 category^3


name="mlt.fl">name,description,ean,upc,asin,brand,category





query
facet
mlt









Query cache with grouping

2016-01-28 Thread Robert Brown

Hi,

During some testing, I've found that the queryResultCache is not used 
when I use grouping.


Is there another cache that is being used in this scenario, if so, 
which, and how can I ensure they'[re providing a real benefit?


Thanks,
Rob



Leader Election Time

2016-01-15 Thread Robert Brown

Hi,

I have 2 shards, 1 leader and 1 replica in each.

I've just removed a leader from one of the shards but the replica hasn't 
become a leader yet.


How quickly should this normally happen?

tickTime=2000
dataDir=/home/rob/zoodata
clientPort=2181
initLimit=5
syncLimit=2

Thanks,
Rob



Re: Querying only replica's

2016-01-11 Thread Robert Brown

We won't be using SolrJ, etc. anytime soon unfortunately.

We'll be using a hardware load-balancer to send requests into the 
cloud/pool of servers.


The LB therefore needs to know when a node is down, otherwise a query 
wouldn't get anywhere.


The solr.PingRequestHandler is what I was after.




On 01/11/2016 05:16 PM, Alessandro Benedetti wrote:

mmm i think there is a misconception here :

On 10 January 2016 at 19:00, Robert Brown  wrote:


I'm thinking more about how the external load-balancer will know if a node
is down, as to take it out the pool of active servers to even attempt
sending a query to.


This is SolrCloud responsibility and in particular Zookeeper knows the
topology of the cluster.
A query will not reach a dead node.
You should use a SolrCloud aware client ( like the SolrJ one) .

If you want to use a different load-balancer because you don't like the
SolrCloud one, it will not be that easy, because the distribution of the
queries happens automatically.

Cheers


I could ping tho that just means the IP is alive.  I could configure the
load-balancer to actually try a query, but this may be (even a tiny)
performance hit.

Is there another recommended way of configuring external load-balancers to
know when a node is not accepting queries?




On 10/01/16 18:25, Erick Erickson wrote:


For health checks, you can go ahead and get the real IP addresses and
ping them directly if you care to Or just let Zookeeper do that
for you. One of the tasks of Zookeeper is pinging all the machines
with all the replicas and, if any of them are unreachable, telling the
rest of the cluster that that machine is down.

Best,
Erick

On Sun, Jan 10, 2016 at 5:19 AM, Robert Brown 
wrote:


Thanks Erick,

For the health-checks on the load-balancer side, would you recommend a
simple query, or is there a reliable ping or similar for this scenario?

Cheers,
Rob


On 09/01/16 23:44, Erick Erickson wrote:


bq: is it best/good to get the CLUSTERSTATUS via the collection API
and explicitly send queries to a replica to ensure I don't send
queries to the leaders of my collection

In a word _no_. SolrCloud is vastly different than the old
master/slave. In SolrCloud, each and every node (leader and replicas)
index all the docs and serve queries. The additional burden the leader
has is actually very small. There's absolutely no reason to _not_ use
the leader to serve queries.

As far as sending updates, there would be a _little_ benefit to
sending the updates directly to the leader, but _far_ more benefit in
using SolrJ. If you use SolrJ (and CloudSolrClient), then the
documents are split up on the _client_ and only the docs for a
particular shard are automatically sent to the leader for that shard.
Using SolrJ you can essentially scale indexing linearly with the
number of shards you have. Just using HTTP does not scale linearly.
Your particular app may not care, but in high-throughput situations
this can be significant.

So rather than spend time and effort sending updates directly to a
leader and have the leader then forward the docs to the correct shard,
I recommend investing the time in using SolrJ for updates rather than
sending updates to the leader over HTTP. Or just ignore the problem
and devote your efforts to something that are more valuable.

So in short:
1> just stick a load balancer in front of _all_ your Solr nodes for
queries. And note that there's an internal load balancer already in
Solr that routes things around anyway, although putting a load
balancer in front of your entire cluster makes it so there's not a
single point of failure.
2> Depending on your throughput needs, either
2a> use SolrJ to index
2b> don't worry about it and send updates through the load balancer as
well. There'll be an extra hop if you send updates to a replica, but
if that's significant you should be using SolrJ

As for 5.5, it's not at all clear that there _will_ be a 5.5. 5.4 was
just released in early December. There's usually a several month lag
between point releases and there's some agitation to start the 6.0
release process, so it's up in the air.


On Sat, Jan 9, 2016 at 12:04 PM, Robert Brown 
wrote:


Hi,

(btw, when is 5.5 due?  I see the docs reference it, but not the
download
page)

Anyway, I index and query Solr over HTTP (no SolrJ, etc.) - is it
best/good
to get the CLUSTERSTATUS via the collection API and explicitly send
queries
to a replica to ensure I don't send queries to the leaders of my
collection,
to improve performance?  Like-wise with sending updates directly to a
Leader?

My leaders will receive full updates of the entire collection once a
day,
so
I would assume if the leader is handling queries too, performance would
be
hit?

Is the CLUSTERSTATUS API the only way to do this btw without SolrJ,
etc.?
I
wasn't sure if ZooKeeper would be able to tell me also.

Do I also need to do anything to ensure th

Re: Querying only replica's

2016-01-10 Thread Robert Brown
I'm thinking more about how the external load-balancer will know if a 
node is down, as to take it out the pool of active servers to even 
attempt sending a query to.


I could ping tho that just means the IP is alive.  I could configure the 
load-balancer to actually try a query, but this may be (even a tiny) 
performance hit.


Is there another recommended way of configuring external load-balancers 
to know when a node is not accepting queries?




On 10/01/16 18:25, Erick Erickson wrote:

For health checks, you can go ahead and get the real IP addresses and
ping them directly if you care to Or just let Zookeeper do that
for you. One of the tasks of Zookeeper is pinging all the machines
with all the replicas and, if any of them are unreachable, telling the
rest of the cluster that that machine is down.

Best,
Erick

On Sun, Jan 10, 2016 at 5:19 AM, Robert Brown  wrote:

Thanks Erick,

For the health-checks on the load-balancer side, would you recommend a
simple query, or is there a reliable ping or similar for this scenario?

Cheers,
Rob


On 09/01/16 23:44, Erick Erickson wrote:

bq: is it best/good to get the CLUSTERSTATUS via the collection API
and explicitly send queries to a replica to ensure I don't send
queries to the leaders of my collection

In a word _no_. SolrCloud is vastly different than the old
master/slave. In SolrCloud, each and every node (leader and replicas)
index all the docs and serve queries. The additional burden the leader
has is actually very small. There's absolutely no reason to _not_ use
the leader to serve queries.

As far as sending updates, there would be a _little_ benefit to
sending the updates directly to the leader, but _far_ more benefit in
using SolrJ. If you use SolrJ (and CloudSolrClient), then the
documents are split up on the _client_ and only the docs for a
particular shard are automatically sent to the leader for that shard.
Using SolrJ you can essentially scale indexing linearly with the
number of shards you have. Just using HTTP does not scale linearly.
Your particular app may not care, but in high-throughput situations
this can be significant.

So rather than spend time and effort sending updates directly to a
leader and have the leader then forward the docs to the correct shard,
I recommend investing the time in using SolrJ for updates rather than
sending updates to the leader over HTTP. Or just ignore the problem
and devote your efforts to something that are more valuable.

So in short:
1> just stick a load balancer in front of _all_ your Solr nodes for
queries. And note that there's an internal load balancer already in
Solr that routes things around anyway, although putting a load
balancer in front of your entire cluster makes it so there's not a
single point of failure.
2> Depending on your throughput needs, either
2a> use SolrJ to index
2b> don't worry about it and send updates through the load balancer as
well. There'll be an extra hop if you send updates to a replica, but
if that's significant you should be using SolrJ

As for 5.5, it's not at all clear that there _will_ be a 5.5. 5.4 was
just released in early December. There's usually a several month lag
between point releases and there's some agitation to start the 6.0
release process, so it's up in the air.


On Sat, Jan 9, 2016 at 12:04 PM, Robert Brown 
wrote:

Hi,

(btw, when is 5.5 due?  I see the docs reference it, but not the download
page)

Anyway, I index and query Solr over HTTP (no SolrJ, etc.) - is it
best/good
to get the CLUSTERSTATUS via the collection API and explicitly send
queries
to a replica to ensure I don't send queries to the leaders of my
collection,
to improve performance?  Like-wise with sending updates directly to a
Leader?

My leaders will receive full updates of the entire collection once a day,
so
I would assume if the leader is handling queries too, performance would
be
hit?

Is the CLUSTERSTATUS API the only way to do this btw without SolrJ, etc.?
I
wasn't sure if ZooKeeper would be able to tell me also.

Do I also need to do anything to ensure the leaders are never sent
queries
from the replica's?

Does this all sound sane?

One of my collections is 3 shards, with 2 replica's each (9 total nodes),
70m docs in total.

Thanks,
Rob





Re: Querying only replica's

2016-01-10 Thread Robert Brown

Thanks Erick,

For the health-checks on the load-balancer side, would you recommend a 
simple query, or is there a reliable ping or similar for this scenario?


Cheers,
Rob


On 09/01/16 23:44, Erick Erickson wrote:

bq: is it best/good to get the CLUSTERSTATUS via the collection API
and explicitly send queries to a replica to ensure I don't send
queries to the leaders of my collection

In a word _no_. SolrCloud is vastly different than the old
master/slave. In SolrCloud, each and every node (leader and replicas)
index all the docs and serve queries. The additional burden the leader
has is actually very small. There's absolutely no reason to _not_ use
the leader to serve queries.

As far as sending updates, there would be a _little_ benefit to
sending the updates directly to the leader, but _far_ more benefit in
using SolrJ. If you use SolrJ (and CloudSolrClient), then the
documents are split up on the _client_ and only the docs for a
particular shard are automatically sent to the leader for that shard.
Using SolrJ you can essentially scale indexing linearly with the
number of shards you have. Just using HTTP does not scale linearly.
Your particular app may not care, but in high-throughput situations
this can be significant.

So rather than spend time and effort sending updates directly to a
leader and have the leader then forward the docs to the correct shard,
I recommend investing the time in using SolrJ for updates rather than
sending updates to the leader over HTTP. Or just ignore the problem
and devote your efforts to something that are more valuable.

So in short:
1> just stick a load balancer in front of _all_ your Solr nodes for
queries. And note that there's an internal load balancer already in
Solr that routes things around anyway, although putting a load
balancer in front of your entire cluster makes it so there's not a
single point of failure.
2> Depending on your throughput needs, either
2a> use SolrJ to index
2b> don't worry about it and send updates through the load balancer as
well. There'll be an extra hop if you send updates to a replica, but
if that's significant you should be using SolrJ

As for 5.5, it's not at all clear that there _will_ be a 5.5. 5.4 was
just released in early December. There's usually a several month lag
between point releases and there's some agitation to start the 6.0
release process, so it's up in the air.


On Sat, Jan 9, 2016 at 12:04 PM, Robert Brown  wrote:

Hi,

(btw, when is 5.5 due?  I see the docs reference it, but not the download
page)

Anyway, I index and query Solr over HTTP (no SolrJ, etc.) - is it best/good
to get the CLUSTERSTATUS via the collection API and explicitly send queries
to a replica to ensure I don't send queries to the leaders of my collection,
to improve performance?  Like-wise with sending updates directly to a
Leader?

My leaders will receive full updates of the entire collection once a day, so
I would assume if the leader is handling queries too, performance would be
hit?

Is the CLUSTERSTATUS API the only way to do this btw without SolrJ, etc.?  I
wasn't sure if ZooKeeper would be able to tell me also.

Do I also need to do anything to ensure the leaders are never sent queries
from the replica's?

Does this all sound sane?

One of my collections is 3 shards, with 2 replica's each (9 total nodes),
70m docs in total.

Thanks,
Rob





Querying only replica's

2016-01-09 Thread Robert Brown

Hi,

(btw, when is 5.5 due?  I see the docs reference it, but not the 
download page)


Anyway, I index and query Solr over HTTP (no SolrJ, etc.) - is it 
best/good to get the CLUSTERSTATUS via the collection API and explicitly 
send queries to a replica to ensure I don't send queries to the leaders 
of my collection, to improve performance?  Like-wise with sending 
updates directly to a Leader?


My leaders will receive full updates of the entire collection once a 
day, so I would assume if the leader is handling queries too, 
performance would be hit?


Is the CLUSTERSTATUS API the only way to do this btw without SolrJ, 
etc.?  I wasn't sure if ZooKeeper would be able to tell me also.


Do I also need to do anything to ensure the leaders are never sent 
queries from the replica's?


Does this all sound sane?

One of my collections is 3 shards, with 2 replica's each (9 total 
nodes), 70m docs in total.


Thanks,
Rob



Re: SolrCloud: Setting/finding node names for deleting replicas

2016-01-08 Thread Robert Brown

Thanks for the pointer Jeff,

For SolrCloud it turned out to be...

&property.coreNodeName=xxx

btw, for your app, isn't "slice" old notation?




On 08/01/16 22:05, Jeff Wartes wrote:


I’m pretty sure you could change the name when you ADDREPLICA using a core.name 
property. I don’t know if you can when you initially create the collection 
though.

The CLUSTERSTATUS command will tell you the core names: 
https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api18

That said, this tool might make things easier.
https://github.com/whitepages/solrcloud_manager


# shows cluster status, including core names:
java -jar solrcloud_manager-assembly-1.4.0.jar -z zk0.example.com:2181/myapp


# deletes a replica by node/collection/shard (figures out the core name under 
the hood)
java -jar solrcloud_manager-assembly-1.4.0.jar deletereplica -z 
zk0.example.com:2181/myapp -c collection1 --node node1.example.com --slice 
shard2


I mention this tool every now and then on this list because I like it, but I’m 
the author, so take that with a pretty big grain of salt. Feedback is very 
welcome.







On 1/8/16, 1:18 PM, "Robert Brown"  wrote:


Hi,

I'm having trouble identifying a replica to delete...

I've created a 3-shard cluster, all 3 created on a single host, then
added a replica for shard2 onto another host, no problem so far.

Now I want to delete the original shard, but got this error when trying
a *replica* param value I thought would work...

shard2/uk available replicas are core_node1,core_node4

I can't find any mention of core_node1 or core_node4 via the admin UI,
how would I know/find the name of each one?

Is it possible to set these names explicitly myself for easier maintenance?

Many thanks for any guidance,
Rob





SolrCloud: Setting/finding node names for deleting replicas

2016-01-08 Thread Robert Brown

Hi,

I'm having trouble identifying a replica to delete...

I've created a 3-shard cluster, all 3 created on a single host, then 
added a replica for shard2 onto another host, no problem so far.


Now I want to delete the original shard, but got this error when trying 
a *replica* param value I thought would work...


shard2/uk available replicas are core_node1,core_node4

I can't find any mention of core_node1 or core_node4 via the admin UI, 
how would I know/find the name of each one?


Is it possible to set these names explicitly myself for easier maintenance?

Many thanks for any guidance,
Rob



Re: Which Tokeniser (and/or filter)

2012-02-08 Thread Robert Brown
Attempting to re-produce legacy behaviour (i know!) of simple SQL
substring searching, with and without phrases.

I feel simply NGram'ing 4m CV's may be pushing it?


---

IntelCompute
Web Design & Local Online Marketing

http://www.intelcompute.com


On Wed, 8 Feb 2012 11:27:24 -0500, Erick Erickson
 wrote:
> You'll probably have to index them in separate fields to
> get what you want. The question is always whether it's
> worth it, is the use-case really well served by having a
> variant that keeps dots and things? But that's always more
> a question for your product manager
> 
> Best
> Erick
> 
> On Wed, Feb 8, 2012 at 9:23 AM, Robert Brown  wrote:
>> Thanks Erick,
>>
>> I didn't get confused with multiple tokens vs multiValued  :)
>>
>> Before I go ahead and re-index 4m docs, and believe me I'm using the
>> analysis page like a mad-man!
>>
>> What do I need to configure to have the following both indexed with and
>> without the dots...
>>
>> .net
>> sales manager.
>> £12.50
>>
>> Currently...
>>
>> 
>> >        generateWordParts="1"
>>        generateNumberParts="1"
>>        catenateWords="1"
>>        catenateNumbers="1"
>>        catenateAll="1"
>>        splitOnCaseChange="1"
>>        splitOnNumerics="1"
>>        types="wdftypes.txt"
>> />
>>
>> with nothing specific in wdftypes.txt for full-stops.
>>
>> Should there also be any difference when quoting my searches?
>>
>> The analysis page seems to just drop the quotes, but surely actual
>> calls don't do this?
>>
>>
>>
>> ---
>>
>> IntelCompute
>> Web Design & Local Online Marketing
>>
>> http://www.intelcompute.com
>>
>>
>> On Wed, 8 Feb 2012 07:38:42 -0500, Erick Erickson
>>  wrote:
>>> Yes, WDDF creates multiple tokens. But that has
>>> nothing to do with the multiValued suggestion.
>>>
>>> You can get exactly what you want by
>>> 1> setting multiValued="true" in your schema file and re-indexing. Say
>>> positionIncrementGap is set to 100
>>> 2> When you index, add the field for each sentence, so your doc
>>>       looks something like:
>>>      
>>>         i am a sales-manager in here
>>>        using asp.net and .net daily
>>>          .
>>>       
>>> 3> search like "sales manager"~100
>>>
>>> Best
>>> Erick
>>>
>>> On Wed, Feb 8, 2012 at 3:05 AM, Rob Brown  wrote:
>>>> Apologies if things were a little vague.
>>>>
>>>> Given the example snippet to index (numbered to show searches needed to
>>>> match)...
>>>>
>>>> 1: i am a sales-manager in here
>>>> 2: using asp.net and .net daily
>>>> 3: working in design.
>>>> 4: using something called sage 200. and i'm fluent
>>>> 5: german sausages.
>>>> 6: busy A&E dept earning £10,000 annually
>>>>
>>>>
>>>> ... all with newlines in place.
>>>>
>>>> able to match...
>>>>
>>>> 1. sales
>>>> 1. "sales manager"
>>>> 1. sales-manager
>>>> 1. "sales-manager"
>>>> 2. .net
>>>> 2. asp.net
>>>> 3. design
>>>> 4. sage 200
>>>> 6. A&E
>>>> 6. £10,000
>>>>
>>>> But do NOT match "fluent german" from 4 + 5 since there's a newline
>>>> between them when indexed, but not when searched.
>>>>
>>>>
>>>> Do the filters (wdf in this case) not create multiple tokens, so if
>>>> splitting on period in "asp.net" would create tokens for all of "asp",
>>>> "asp.", "asp.net", ".net", "net".
>>>>
>>>>
>>>> Cheers,
>>>> Rob
>>>>
>>>> --
>>>>
>>>> IntelCompute
>>>> Web Design and Online Marketing
>>>>
>>>> http://www.intelcompute.com
>>>>
>>>>
>>>> -Original Message-
>>>> From: Chris Hostetter 
>>>> Reply-to: solr-user@lucene.apache.org
>>>> To: solr-user@lucene.apache.org
>>>> Subject: Re: Which Tokeniser (and/or filter)
>>>> Date: Tue, 7 Feb 2012 15:02:36 -0800 (PST)
>>>>
>>>> : This all seems a bit too much work for such a real-world scenario?
>>>>
>>>> You haven't really told us what your scenerio is.
>>>>
>>>> You said you want to split tokens on whitespace, full-stop (aka:
>>>> period) and comma only, but then in response to some suggestions you added
>>>> comments other things that you never mentioned previously...
>>>>
>>>> 1) evidently you don't want the "." in foo.net to cause a split in tokens?
>>>> 2) evidently you not only want token splits on newlines, but also
>>>> positition gaps to prevent phrases matching across newlines.
>>>>
>>>> ...these are kind of important details that affect suggestions people
>>>> might give you.
>>>>
>>>> can you please provide some concrete examples of hte types of data you
>>>> have, the types of queries you want them to match, and the types of
>>>> queries you *don't* want to match?
>>>>
>>>>
>>>> -Hoss
>>>>
>>



Re: Which Tokeniser (and/or filter)

2012-02-08 Thread Robert Brown
Thanks Erick,

I didn't get confused with multiple tokens vs multiValued  :)

Before I go ahead and re-index 4m docs, and believe me I'm using the
analysis page like a mad-man!

What do I need to configure to have the following both indexed with and
without the dots...

.net
sales manager.
£12.50

Currently...




with nothing specific in wdftypes.txt for full-stops.

Should there also be any difference when quoting my searches?

The analysis page seems to just drop the quotes, but surely actual
calls don't do this?



---

IntelCompute
Web Design & Local Online Marketing

http://www.intelcompute.com


On Wed, 8 Feb 2012 07:38:42 -0500, Erick Erickson
 wrote:
> Yes, WDDF creates multiple tokens. But that has
> nothing to do with the multiValued suggestion.
> 
> You can get exactly what you want by
> 1> setting multiValued="true" in your schema file and re-indexing. Say
> positionIncrementGap is set to 100
> 2> When you index, add the field for each sentence, so your doc
>   looks something like:
>  
> i am a sales-manager in here
>using asp.net and .net daily
>  .
>   
> 3> search like "sales manager"~100
> 
> Best
> Erick
> 
> On Wed, Feb 8, 2012 at 3:05 AM, Rob Brown  wrote:
>> Apologies if things were a little vague.
>>
>> Given the example snippet to index (numbered to show searches needed to
>> match)...
>>
>> 1: i am a sales-manager in here
>> 2: using asp.net and .net daily
>> 3: working in design.
>> 4: using something called sage 200. and i'm fluent
>> 5: german sausages.
>> 6: busy A&E dept earning £10,000 annually
>>
>>
>> ... all with newlines in place.
>>
>> able to match...
>>
>> 1. sales
>> 1. "sales manager"
>> 1. sales-manager
>> 1. "sales-manager"
>> 2. .net
>> 2. asp.net
>> 3. design
>> 4. sage 200
>> 6. A&E
>> 6. £10,000
>>
>> But do NOT match "fluent german" from 4 + 5 since there's a newline
>> between them when indexed, but not when searched.
>>
>>
>> Do the filters (wdf in this case) not create multiple tokens, so if
>> splitting on period in "asp.net" would create tokens for all of "asp",
>> "asp.", "asp.net", ".net", "net".
>>
>>
>> Cheers,
>> Rob
>>
>> --
>>
>> IntelCompute
>> Web Design and Online Marketing
>>
>> http://www.intelcompute.com
>>
>>
>> -Original Message-
>> From: Chris Hostetter 
>> Reply-to: solr-user@lucene.apache.org
>> To: solr-user@lucene.apache.org
>> Subject: Re: Which Tokeniser (and/or filter)
>> Date: Tue, 7 Feb 2012 15:02:36 -0800 (PST)
>>
>> : This all seems a bit too much work for such a real-world scenario?
>>
>> You haven't really told us what your scenerio is.
>>
>> You said you want to split tokens on whitespace, full-stop (aka:
>> period) and comma only, but then in response to some suggestions you added
>> comments other things that you never mentioned previously...
>>
>> 1) evidently you don't want the "." in foo.net to cause a split in tokens?
>> 2) evidently you not only want token splits on newlines, but also
>> positition gaps to prevent phrases matching across newlines.
>>
>> ...these are kind of important details that affect suggestions people
>> might give you.
>>
>> can you please provide some concrete examples of hte types of data you
>> have, the types of queries you want them to match, and the types of
>> queries you *don't* want to match?
>>
>>
>> -Hoss
>>



Re: Which Tokeniser (and/or filter)

2012-02-07 Thread Robert Brown
This all seems a bit too much work for such a real-world scenario?


---

IntelCompute
Web Design & Local Online Marketing

http://www.intelcompute.com


On Tue, 7 Feb 2012 05:11:01 -0800 (PST), Ahmet Arslan
 wrote:
>> I'm still finding matches across
>> newlines
>>
>> index...
>>
>> i am fluent
>> german racing
>>
>> search...
>>
>> "fluent german"
>>
>> Any suggestions? 
> 
> You can use a multiValued field for this. Split your document
> according to new line at client side.
> 
> i am fluent
> german racing
> 
> positionIncrementGap="100" will prevent query "fluent german" to match.
> 
> Or, may be you can inject artificial tokens via 
> 
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PatternReplaceCharFilterFactory
> 
> Your document becomes : i am fluent NEWLINE german racing



Re: Which Tokeniser (and/or filter)

2012-02-07 Thread Robert Brown
I'm still finding matches across newlines

index...

i am fluent
german racing

search...

"fluent german" 

Any suggestions?  I've currently got this in wdftypes.txt for
WordDelimiterfilterfactory


\u000A => ALPHANUM
\u000B => ALPHANUM
\u000C => ALPHANUM
\u000D => ALPHANUM
# \u000D\u000A => ALPHA
\u0085 => ALPHANUM
\u2028 => ALPHANUM
\u2029 => ALPHANUM

\u2424 => ALPHANUM





---

IntelCompute
Web Design & Local Online Marketing

http://www.intelcompute.com

On Mon, 6 Feb 2012 04:10:18 -0800 (PST), Ahmet Arslan
 wrote:
>> My fear is what will then happen with
>> highlighting if I use re-mapping?
> 
> What do you mean by re-mapping?



Re: Which Tokeniser (and/or filter)

2012-02-06 Thread Robert Brown
mapping dots to spaces.  I don't think that's workable anyway since
".net" would cause issues.

Tying out the wdftypes now...


---

IntelCompute
Web Design & Local Online Marketing

http://www.intelcompute.com

On Mon, 6 Feb 2012 04:10:18 -0800 (PST), Ahmet Arslan
 wrote:
>> My fear is what will then happen with
>> highlighting if I use re-mapping?
> 
> What do you mean by re-mapping?



Re: Which Tokeniser (and/or filter)

2012-02-06 Thread Robert Brown
My fear is what will then happen with highlighting if I use re-mapping?



On Mon, 6 Feb 2012 03:33:03 -0800 (PST), Ahmet Arslan
 wrote:
>> I need to tokenise on whitespace, full-stop, and comma
>> ONLY.
>>
>> Currently using solr.WhitespaceTokenizerFactory with
>> WordDelimiterFilterFactory but this is also splitting on
>> &, /, new-line, etc.
> 
> WDF is customizable via types="wdftypes.txt" parameter. 
> 
> https://svn.apache.org/repos/asf/lucene/dev/trunk/solr/core/src/test-files/solr/conf/wdftypes.txt
> 
> Alternatively you can convert . and , to whitespace (before
> tokenizer) by MappingCharFilterFactory.
> 
> http://lucene.apache.org/solr/api/org/apache/solr/analysis/MappingCharFilterFactory.html



Symbols in synonyms

2012-02-06 Thread Robert Brown
is it good practice, common, or even possible to put symbols in my 
list of synonyms?


I'm having trouble indexing and searching for "A&E", with it being 
split on the &.


we already convert .net to dotnet, but don't want to store every 
combination of 2 letters, A&E, M&E, etc.





--

IntelCompute
Web Design & Local Online Marketing

http://www.intelcompute.com



Which Tokeniser (and/or filter)

2012-02-06 Thread Robert Brown

Hi,

I need to tokenise on whitespace, full-stop, and comma ONLY.

Currently using solr.WhitespaceTokenizerFactory with 
WordDelimiterFilterFactory but this is also splitting on &, /, 
new-line, etc.


It seems such a simple setup, what am I doing wrong?  what do you use 
for such "normal searching"?


Thanks,
Rob

--

IntelCompute
Web Design & Local Online Marketing

http://www.intelcompute.com



"sage 200" not matching "... sage 200."

2012-01-30 Thread Robert Brown
The trailing full-stop above is not being matched when searching for 
"sage 200" for the below field type...


Do I need the WordDelimiterFilterFactory for this to work as expected? 
I don't see any mention of periods being discussed in the docs.



positionIncrementGap="100">



		synonyms="textgen-synonyms.txt" ignoreCase="true" expand="true"/>





		synonyms="textgen-synonyms.txt" ignoreCase="true" expand="true"/>





Thanks,
Rob


--

IntelCompute
Web Design & Local Online Marketing

http://www.intelcompute.com



edismax phrase matching with a non-word char inbetween

2011-12-13 Thread Robert Brown

I have a field which is indexed and queried as follows:



ignoreCase="true" expand="true"/>
words="stopwords.txt" enablePositionIncrements="true" />
generateNumberParts="1" catenateWords="0" catenateNumbers="0" 
catenateAll="0" splitOnCaseChange="1"/>


protected="protwords.txt"/>




When searching for "street work" (with quotes), i'm getting matches 
and highlighting on things like...



"...Oxford Street (Work Experience)..."


why is this happening, and what can I do to stop it?

I've set 0 in my config to try and avert this 
sort of behaviour, am I correct in thinking that this is used to ensure 
there are no words in-between the phrase words?




Highlighting to include stop words

2011-12-08 Thread Robert Brown

I have a text field, using stopwords...

Index and query analysers setup as follows:

SynonymFilterFactory
StopFilterFactory
WordDelimiterFilterFactory
LowerCaseFilterFactory
SnowballPorterFilterFactory


Searching for "front of house" brings back perfect matches, but 
doesn't highlight the "of".


I took "of" out of the stops words for the query analyser and it now 
matches "front-of-house" but I know there's better matches stored as 
"front of house" (without the hyphens) that are ranked much lower.


Is there any quick way to have the highlighting applied to the entire 
phrase that was searched?





--

IntelCompute
Web Design & Local Online Marketing

http://www.intelcompute.com



highlight 1 field twice

2011-12-06 Thread Robert Brown
When searching against 1 field, is it possible to have highlighting 
returned 2 different ways?


We'd like the full field returned with keywords highlighted, but then 
also returned as snippets.


Any possible approaches?


--

IntelCompute
Web Design & Local Online Marketing

http://www.intelcompute.com



lower score for synonyms

2011-12-06 Thread Robert Brown

is it possible to lower the score for synonym matches?

we setup...

admin => administration

but if someone searches specifically for "admin", we want those 
specific matches to rank higher than matches for "administration"




--

IntelCompute
Web Design & Local Online Marketing

http://www.intelcompute.com



Re: overriding qf in q affecting boosts

2011-12-05 Thread Robert Brown
So I need to explicitly set the boosts in the query?

ie

q=+(field1:this^2 field1:"that thing"^4) +(field2:other^3)



---

IntelCompute
Web Design & Local Online Marketing

http://www.intelcompute.com

On Mon, 5 Dec 2011 09:49:34 -0300, Tomás Fernández Löbbe
 wrote:
> In this case, the boost and fields in the "qf" parameter won't be
> considered for the search. With this query Solr will search for documents
> with the terms "this" and/or (depending on your default operator) "that" in
> the field1 and the term "other" in the field2
> 
> On Mon, Dec 5, 2011 at 9:44 AM, Robert Brown  wrote:
> 
>> Thanks Tomás,
>>
>> My example should have read...
>>
>> q=+(field1:this field1:that) +(field2:other)
>>
>> I'm using edismax.
>>
>> so with this approach, the boosts as specified in solrconfig qf will
>> remain in place?
>>
>>
>> ---
>>
>> IntelCompute
>> Web Design & Local Online Marketing
>>
>> http://www.intelcompute.com
>>
>> On Mon, 5 Dec 2011 09:17:59 -0300, Tomás Fernández Löbbe
>>  wrote:
>> > Hi Robert, the answer depends on the query parser you are using. If you
>> are
>> > using the "edismax" query parser, then the "qf" will only be used when
>> you
>> > don't specify any field in the "q" parameter. In your example the result
>> > query will be, boolean queries for "this" and "that" in the field1 and a
>> > DisMax query for the term "other" in fields (and the boost) you specify
>> in
>> > qf.
>> >
>> > If you use "dismax" the field in the query will not be considered and if
>> > you use LuceneQP the qf are not considered and it is going to use the
>> > default search field for the term "other" and no boost.
>> >
>> > You can see this very easily turning on the "debugQuery".
>> >
>> > Regards,
>> >
>> > Tomás
>> >
>> > On Mon, Dec 5, 2011 at 8:18 AM, Robert Brown 
>> wrote:
>> >
>> >> If I have a set list in solrconfig for my "qf" along with their boosts,
>> >> and I then specify field names directly in q (where I could also
>> override
>> >> the boosts), are the boosts left in place, or reset to 1?
>> >>
>> >>
>> >> 
>> >>  this^3
>> >>  that^2
>> >>  other^9
>> >> 
>> >>
>> >>
>> >> ie q=field1:+(this that) +(other)
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> --
>> >>
>> >> IntelCompute
>> >> Web Design & Local Online Marketing
>> >>
>> >> http://www.intelcompute.com
>> >>
>> >>
>>
>>



Re: overriding qf in q affecting boosts

2011-12-05 Thread Robert Brown
Thanks Tomás,

My example should have read...

q=+(field1:this field1:that) +(field2:other)

I'm using edismax.

so with this approach, the boosts as specified in solrconfig qf will
remain in place?


---

IntelCompute
Web Design & Local Online Marketing

http://www.intelcompute.com

On Mon, 5 Dec 2011 09:17:59 -0300, Tomás Fernández Löbbe
 wrote:
> Hi Robert, the answer depends on the query parser you are using. If you are
> using the "edismax" query parser, then the "qf" will only be used when you
> don't specify any field in the "q" parameter. In your example the result
> query will be, boolean queries for "this" and "that" in the field1 and a
> DisMax query for the term "other" in fields (and the boost) you specify in
> qf.
> 
> If you use "dismax" the field in the query will not be considered and if
> you use LuceneQP the qf are not considered and it is going to use the
> default search field for the term "other" and no boost.
> 
> You can see this very easily turning on the "debugQuery".
> 
> Regards,
> 
> Tomás
> 
> On Mon, Dec 5, 2011 at 8:18 AM, Robert Brown  wrote:
> 
>> If I have a set list in solrconfig for my "qf" along with their boosts,
>> and I then specify field names directly in q (where I could also override
>> the boosts), are the boosts left in place, or reset to 1?
>>
>>
>> 
>>  this^3
>>  that^2
>>  other^9
>> 
>>
>>
>> ie q=field1:+(this that) +(other)
>>
>>
>>
>>
>>
>> --
>>
>> IntelCompute
>> Web Design & Local Online Marketing
>>
>> http://www.intelcompute.com
>>
>>



overriding qf in q affecting boosts

2011-12-05 Thread Robert Brown
If I have a set list in solrconfig for my "qf" along with their 
boosts, and I then specify field names directly in q (where I could 
also override the boosts), are the boosts left in place, or reset to 1?




  this^3
  that^2
  other^9



ie q=field1:+(this that) +(other)





--

IntelCompute
Web Design & Local Online Marketing

http://www.intelcompute.com



switching on hl.requireFieldMatch reducing highlighted fields returned

2011-12-01 Thread Robert Brown
I have a query which is highlighting 3 snippets in 1 field, and 1 
snippet in another field.


By enabling hl.requireFieldMatch, only the latter highlighted field is 
returned.


from this...





plc Whetstone Temporary [hl-on]Sales[hl-off] Assistant Customer 
service Cashier work 08



and customer queries. 07 / 99 – 2003 Debenhams Central London 
[hl-on]Sales[hl-off] Adviser Customer



Central London [hl-on]Sales[hl-off] Assistant Customer service; Visual 
merchandising; Dealing



with telephone enquiries; Assisted in the [hl-on]production[hl-off] of 
jewellery, e.g. setting stones





[hl-on]product[hl-off] knowledge [hl-on]sales[hl-off] experience






to this...




product knowledge [hl-on]sales[hl-off] experience





I'm doing this so the word "product" and it's variants are NOT 
highlighted - they match against a different field.




Re: Don't snowball depending on terms

2011-11-30 Thread Robert Brown
Thanks Erick,

This is a required feature since we're swapping out an existing search
engine for Solr - users have saved searches that need to behave the
same.

I'll look into the edismax stuff, that's the handler we're using
anyway.



---

IntelCompute
Web Design & Local Online Marketing

http://www.intelcompute.com

On Wed, 30 Nov 2011 09:12:11 -0500, Erick Erickson
 wrote:
> First, watch the syntax 
> 
> q=+(stemmed:perl^2 or stemmed:java^3) +unstemmed:"development manager"^5
> although it is a bit confusing to see the dismax stuff where the boost
> is put on the
> field name, but that's not how the queries are formed.
> 
> BTW, have you looked at edismax queries? You can distribute your terms
> across the fields, applying whatever boost you want and have the query
> input be pretty simple. It takes a bit to get your head around what
> edismax does,
> but it's worth it
> 
> But before you go there You've presented no evidence that this is
> desirable.
> What is the use-case here? You say "users may want"... Well, why do the work
> unless they *do* want this capability? I'd strongly advise that you
> just forget about
> this feature unless and until there's a demonstrated need. Here's a
> blog I made at
> Lucid. Long-winded, but I'm like that sometimes
> 
> http://www.lucidimagination.com/blog/2011/11/03/stop-being-so-agreeable/
> 
> Best
> Erick
> 
> 
> On Wed, Nov 30, 2011 at 8:50 AM, Robert Brown  wrote:
>> Boosts can be included there too can't they?
>>
>> so this is valid?
>>
>> q=+(stemmed^2:perl or stemmed^3:java) +unstemmed^5:"development
>> manager"
>>
>> is it possible to have different boosts on the same field btw?
>>
>> We currently search across 5 fields anyway, so my queries are gonna
>> start getting messy.  :-/
>>
>>
>> ---
>>
>> IntelCompute
>> Web Design & Local Online Marketing
>>
>> http://www.intelcompute.com
>>
>> On Wed, 30 Nov 2011 08:08:41 -0500, Erick Erickson
>>  wrote:
>>> You can't have multiple "q" clauses (as opposed to "fq" clauses).
>>> You could form something like
>>> q=unstemmed:perl or java&fq=stemmed:manager
>>> or
>>> q=+(unstemmed:perl or java) +stemmed:manager
>>>
>>> BTW, this fragment of the query probably doesn't do
>>> what you expect:
>>> unstemmed:perl or java
>>> would be parsed as
>>> unstemmed:perl OR default_search_field:java
>>>
>>> FWIW
>>> Erick
>>>
>>> On Wed, Nov 30, 2011 at 7:39 AM, Rob Brown  wrote:
>>>> I guess I could do a bit of pre-processing, look for any words that are
>>>> quoted, and search in a diff field for those
>>>>
>>>> How is a query like this formulated?
>>>>
>>>> q=unstemmed:perl or java&q=stemmed:manager
>>>>
>>>>
>>>> --
>>>>
>>>> IntelCompute
>>>> Web Design and Online Marketing
>>>>
>>>> http://www.intelcompute.com
>>>>
>>>>
>>>> -Original Message-
>>>> From: Tomas Zerolo 
>>>> Reply-to: solr-user@lucene.apache.org
>>>> To: solr-user@lucene.apache.org
>>>> Subject: Re: Don't snowball depending on terms
>>>> Date: Wed, 30 Nov 2011 08:49:37 +0100
>>>>
>>>> On Tue, Nov 29, 2011 at 01:53:44PM -0500, François Schiettecatte wrote:
>>>>> It won't and depending on how your analyzer is set up the terms are most 
>>>>> likely stemmed at index time.
>>>>>
>>>>> You could create a separate field for unstemmed terms though, or use a 
>>>>> less aggressive stemmer such as EnglishMinimalStemFilterFactory.
>>>>
>>>> This is surprising to me. Snowball introduces new homonyms, meaning it
>>>> will lump e.g. "management" and "manage" into one index entry. Thus,
>>>> I'd expect a handful of "false positives" (but usually not too many).
>>>>
>>>> That's a "lossy index" (loosely speaking) and could be fixed by
>>>> post-filtering (instead of introducing another index, which in
>>>> most cases would seem a waste of resurces).
>>>>
>>>> Is there no way in SOLR of filtering the results *after* the index
>>>> scan? I'd be disappointed!
>>>>
>>>> Regards
>>>> -- tomás
>>>>
>>



Re: Don't snowball depending on terms

2011-11-30 Thread Robert Brown
Boosts can be included there too can't they?

so this is valid?

q=+(stemmed^2:perl or stemmed^3:java) +unstemmed^5:"development
manager"

is it possible to have different boosts on the same field btw?

We currently search across 5 fields anyway, so my queries are gonna
start getting messy.  :-/


---

IntelCompute
Web Design & Local Online Marketing

http://www.intelcompute.com

On Wed, 30 Nov 2011 08:08:41 -0500, Erick Erickson
 wrote:
> You can't have multiple "q" clauses (as opposed to "fq" clauses).
> You could form something like
> q=unstemmed:perl or java&fq=stemmed:manager
> or
> q=+(unstemmed:perl or java) +stemmed:manager
> 
> BTW, this fragment of the query probably doesn't do
> what you expect:
> unstemmed:perl or java
> would be parsed as
> unstemmed:perl OR default_search_field:java
> 
> FWIW
> Erick
> 
> On Wed, Nov 30, 2011 at 7:39 AM, Rob Brown  wrote:
>> I guess I could do a bit of pre-processing, look for any words that are
>> quoted, and search in a diff field for those
>>
>> How is a query like this formulated?
>>
>> q=unstemmed:perl or java&q=stemmed:manager
>>
>>
>> --
>>
>> IntelCompute
>> Web Design and Online Marketing
>>
>> http://www.intelcompute.com
>>
>>
>> -Original Message-
>> From: Tomas Zerolo 
>> Reply-to: solr-user@lucene.apache.org
>> To: solr-user@lucene.apache.org
>> Subject: Re: Don't snowball depending on terms
>> Date: Wed, 30 Nov 2011 08:49:37 +0100
>>
>> On Tue, Nov 29, 2011 at 01:53:44PM -0500, François Schiettecatte wrote:
>>> It won't and depending on how your analyzer is set up the terms are most 
>>> likely stemmed at index time.
>>>
>>> You could create a separate field for unstemmed terms though, or use a less 
>>> aggressive stemmer such as EnglishMinimalStemFilterFactory.
>>
>> This is surprising to me. Snowball introduces new homonyms, meaning it
>> will lump e.g. "management" and "manage" into one index entry. Thus,
>> I'd expect a handful of "false positives" (but usually not too many).
>>
>> That's a "lossy index" (loosely speaking) and could be fixed by
>> post-filtering (instead of introducing another index, which in
>> most cases would seem a waste of resurces).
>>
>> Is there no way in SOLR of filtering the results *after* the index
>> scan? I'd be disappointed!
>>
>> Regards
>> -- tomás
>>



Don't snowball depending on terms

2011-11-29 Thread Robert Brown
Is it possible to search a field but not be affected by the snowball 
filter?


ie, searching for "manage" is matching "management", but a user may 
want to restrict results to only containing "manage".


I was hoping that simply quoting the term would do this, but it 
doesn't appear to make any difference.





--

IntelCompute
Web Design & Local Online Marketing

http://www.intelcompute.com



Highlighting too much, indexing not seeing commas?

2011-11-23 Thread Robert Brown

Solr 3.3.0

I have a field/type indexed as below.

For a particular document the content of this field is 
'FreeBSD,Perl,Linux,Unix,SQL,MySQL,Exim,Postgresql,Apache,Exim'


Using eDismax, mm=1

When I query for...

+perl +(apache sql) +(linux unix)

Strangely, the highlighting is being returned as...

FreeBSD,Perl,Linux,Unix,SQL,MySQL,Exim,Postgresql,Apache,Exim


The full call is...

/select/?qt=core&q=%2Bperl%20%2B%28apache%20sql%29%20%2B%28linux%20unix%29&fl=skills&hl=true&hl.fl=skills&fq=id:2819615

I've checked the matching in the online analyser which looks fine, so 
can't understand why the highlighting isn't correct, I would have 
thought the highlighting would have highlighted in the same way the 
analyser tool does?




Is it an index-time/field type issue, or am I missing something in the 
request?



Thanks in advance...



positionIncrementGap="100">



		ignoreCase="true" expand="true"/>
		words="stopwords.txt" enablePositionIncrements="true" />
		generateWordParts="1" generateNumberParts="1" catenateWords="1" 
catenateNumbers="1" catenateAll="0" splitOnCaseChange="0"/>





		ignoreCase="true" expand="true"/>
		words="stopwords.txt" enablePositionIncrements="true" />
		generateWordParts="1" generateNumberParts="1" catenateWords="0" 
catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>






stored="true"  multiValued="false" />






--

IntelCompute
Web Design & Local Online Marketing

http://www.intelcompute.com



Re: Always return total number of documents

2011-10-28 Thread Robert Brown
Cheers Kuli,

This is actually of huge importance to our customers, to see how many
documents we store.

The faceting option sounds a bit messy, maybe we'll have to stick with
2 queries.


---

IntelCompute
Web Design & Local Online Marketing

http://www.intelcompute.com

On Fri, 28 Oct 2011 11:43:11 +0200, Michael Kuhlmann 
wrote:
> Am 28.10.2011 11:16, schrieb Robert Brown:
>> Is there no way to return the total number of docs as part of a search?
> 
> No, it isn't. Usually this information is of absolutely no value to the
> end user.
> 
> A workaround would be to add some field to the schema that has the same
> value for every document, and use this for facetting.
> 
> Greetings,
> Kuli



Always return total number of documents

2011-10-28 Thread Robert Brown
Currently I'm making 2 calls to Solr to be able to state "matched 20 
out of 200 documents".


Is there no way to return the total number of docs as part of a 
search?



--

IntelCompute
Web Design & Local Online Marketing

http://www.intelcompute.com



Limit by score? sort by other field

2011-10-27 Thread Robert Brown
When we display search results to our users we include a percentage 
score.


Top result being 100%, then all others normalised based on the 
maxScore, calculated outside of Solr.


We now want to limit returned docs with a percentage score higher than 
say, 50%.


e.g. We want to search but only return docs scoring above 80%, but 
want to sort by date, hence not being able to just sort by score.




Getting single documents by fq on unique field, performance

2011-10-21 Thread Robert Brown

Hi,

We do regular searches against documents, with highlighting on.  To 
then view a document in more detail, we re-do the search but using 
fq=id:12345 to return the single document of interest, but still want 
highlighting on, so sending the q param back again.


Is there anything you would recommend doing to increase performance 
(it's not currently a problem, more a curiosity)?  I had the following 
in mind, but wanted to gauge whether they'd actually be worthwhile...


1. Using a different request handler with no boosts, etc.
2. Setting rows=1 since we know there's only 1 doc coming back.

Thanks,
Rob

--

IntelCompute
Web Design & Local Online Marketing

http://www.intelcompute.com



Re: Multi CPU Cores

2011-10-17 Thread Robert Brown
Thanks Otis,

I certainly won't be copying & pasting - Good to know such options are
available tho.



On Mon, 17 Oct 2011 07:01:24 -0700 (PDT), Otis Gospodnetic
 wrote:
> Robert,
> 
> You have to add (some of) that stuff to the command for starting
> Java/Tomcat.  Likely in a catalina.sh script.
> 
> That said, I do NOT recommend you use those parameters at all because
> they may be completely unneeded or even unsuitable for your
> environment.
> 
> Otis
> 
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
> 
> 
> 
> 
>>
>>From: Robert Brown 
>>To: solr-user@lucene.apache.org
>>Sent: Monday, October 17, 2011 4:01 AM
>>Subject: Re: Multi CPU Cores
>>
>>Where exactly do you set this up?  We're running Solr3.4 under tomcat,
>>OpenJDK 1.6.0.20
>>
>>btw, is the JRE just a different name for the VM?  Apologies for such a
>>newbie Java question.
>>
>>
>>
>>On Sun, 16 Oct 2011 12:51:44 -0400, Johannes Goll
>> wrote:
>>> we use the the following in production
>>>
>>> java -server -XX:+UseParallelGC -XX:+AggressiveOpts
>>> -XX:+DisableExplicitGC -Xms3G -Xmx40G -Djetty.port=
>>> -Dsolr.solr.home= jar start.jar
>>>
>>> more information
>>> http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html
>>>
>>> Johannes
>>
>>
>>
>>



Re: Multi CPU Cores

2011-10-17 Thread Robert Brown
Where exactly do you set this up?  We're running Solr3.4 under tomcat,
OpenJDK 1.6.0.20

btw, is the JRE just a different name for the VM?  Apologies for such a
newbie Java question.



On Sun, 16 Oct 2011 12:51:44 -0400, Johannes Goll
 wrote:
> we use the the following in production
> 
> java -server -XX:+UseParallelGC -XX:+AggressiveOpts
> -XX:+DisableExplicitGC -Xms3G -Xmx40G -Djetty.port=
> -Dsolr.solr.home= jar start.jar
> 
> more information
> http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html
> 
> Johannes



Re: what is the recommended way to store locations?

2011-10-06 Thread Robert Brown
Expanding CA to California sounds like a use for a synonyms config
file?  you can then do that translation at index and query time, if
needed.


On Thu, 6 Oct 2011 12:01:33 -0400, Jason Toy 
wrote:
> Hi Otis,
>  Thanks for the response. So just to make sure I understand clearly, so I
> would store a location field of either text or ngram fields
> of the format "San Francisco, California, United States"  and use full text
>  search against that so someone could search for San Francisco or California
> and get that hit?
> I've also added some code in the application level so that if someone
> searches for CA, it gets expanded to California during search time, would it
> be better to store this in the doc directly or keep it in application code?
> 
> Jason
> 
> 
> On Thu, Oct 6, 2011 at 11:34 AM, Otis Gospodnetic <
> otis_gospodne...@yahoo.com> wrote:
> 
>> Jason,
>>
>> That sounds pretty simple and works well if you plan on allowing
>> fielded/structured search.
>> If not, you could alternatively stick all geo values in a single text field
>> and avoid dealing with multiple fields.
>>
>> You may also want to use ngram fields instead of text if you want to still
>> match that San Fransisco oops typo.
>>
>> Otis
>> 
>> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
>> Lucene ecosystem search :: http://search-lucene.com/
>>
>>
>> - Original Message -
>> > From: Jason Toy 
>> > To: solr-user@lucene.apache.org
>> > Cc:
>> > Sent: Thursday, October 6, 2011 11:27 AM
>> > Subject: what is the recommended way to store locations?
>> >
>> > In our current system ,we have 3 fields for location,  city, state, and
>> > country.People in our system search for one of those 3 strings.
>> > So a user can search for "San Francisco" or "California".
>> > In solr I store
>> > those 3 fields as strings and when a search happens I search with an OR
>> > statement across those 3 fields.
>> >
>> > Is there a more efficient way to store this data storage wise and/or
>> speed
>> > wise?  We don't currently plan to use any spacial features like "3
>> > miles
>> > near SF".
>> >
>>



Re: negative boosts for docs with common field value

2011-10-06 Thread Robert Brown
We don't want to limit the number of results coming back, so
unfortunately grouping doesn't quite fix it, plus it would, by nature,
group docs by a particular Author together which might not necessarily
be adjacent.



On Thu, 6 Oct 2011 07:16:48 -0700 (PDT), Ahmet Arslan
 wrote:
>> For the sake of simplicity, I have an index with docs
>> containing the following fields:
>>
>> Title
>> Description
>> Author
>>
>> Some searches will obviously be saturated by docs from any
>> given author if they've simply written more.
>>
>> I'd like to give a negative boost to these matches,
>> there-by making sure that 1 Author doesn't saturate the
>> results just because they've written 500 documents, compared
>> to others who may have only written 2-3 documents.
>>
>> The actual author value doesn't matter, I just want to
>> bring down the score of docs by any common author to give
>> more varied results.
>>
>> What's the easiest approach for this, and is it even
>> possible at query time?  I could do this at index time
>> but would prefer a Solr solution.
>>
>> Solr 3.4 using edismax handler
> 
> You can consider grouping results by author name. Display 2-3 results
> per author, and put a link saying "see remaining xxx documents of this
> author"
> 
> http://wiki.apache.org/solr/FieldCollapsing



negative boosts for docs with common field value

2011-10-06 Thread Robert Brown

Hi,

For the sake of simplicity, I have an index with docs containing the 
following fields:


Title
Description
Author

Some searches will obviously be saturated by docs from any given 
author if they've simply written more.


I'd like to give a negative boost to these matches, there-by making 
sure that 1 Author doesn't saturate the results just because they've 
written 500 documents, compared to others who may have only written 2-3 
documents.


The actual author value doesn't matter, I just want to bring down the 
score of docs by any common author to give more varied results.


What's the easiest approach for this, and is it even possible at query 
time?  I could do this at index time but would prefer a Solr solution.


Solr 3.4 using edismax handler

Thanks,
Rob