eDisMax, multiple language support and stopwords

2013-11-07 Thread Tom Mortimer
Hi all,

Thanks for the help and advice I've got here so far!

Another question - I want to support stopwords at search time, so that e.g.
the query oscar and wilde is equivalent to oscar wilde (this is with
lowercaseOperators=false). Fair enough, I have stopword and in the query
analyser chain.

However, I also need to support French as well as English, so I've got _en
and _fr versions of the text fields, with appropriate stemming and
stopwords. I index French content into the _fr fields and English into the
_en fields. I'm searching with eDisMax over both versions, e.g.:

str name=qfheadline_en headline_fr/str

However, this means I get no results for oscar and wilde. The parsed
query is:

(+((DisjunctionMaxQuery((headline_fr:osca | headline_en:oscar))
DisjunctionMaxQuery((headline_fr:and))
DisjunctionMaxQuery((headline_fr:wild | headline_en:wild)))~3))/no_coord

If I add and to the French stopwords list, I *do* get results, and the
parsed query is:

(+((DisjunctionMaxQuery((headline_fr:osca | headline_en:oscar))
DisjunctionMaxQuery((headline_fr:wild | headline_en:wild)))~2))/no_coord

This implies that the only solution is to have a minimal, shared stopwords
list for all languages I want to support. Is this correct, or is there a
way of supporting this kind of searching with per-language stopword lists?

Thanks for any ideas!

Tom


Re: eDisMax, multiple language support and stopwords

2013-11-07 Thread Tom Mortimer
Ah, thanks Markus. I think I'll just add the Boolean operators to the
stopwords list in that case.

Tom



On 7 November 2013 12:01, Markus Jelsma markus.jel...@openindex.io wrote:

 This is an ancient problem. The issue here is your mm-parameter, it gets
 confused because for separate fields different amount of tokens are
 filtered/emitted so it is never going to work just like this. The easiest
 option is not to use the stopfilter.


 http://lucene.472066.n3.nabble.com/Dismax-Minimum-Match-Stopwords-Bug-td493483.html
 https://issues.apache.org/jira/browse/SOLR-3085

 -Original message-
  From:Tom Mortimer tom.m.f...@gmail.com
  Sent: Thursday 7th November 2013 12:50
  To: solr-user@lucene.apache.org
  Subject: eDisMax, multiple language support and stopwords
 
  Hi all,
 
  Thanks for the help and advice I've got here so far!
 
  Another question - I want to support stopwords at search time, so that
 e.g.
  the query oscar and wilde is equivalent to oscar wilde (this is with
  lowercaseOperators=false). Fair enough, I have stopword and in the
 query
  analyser chain.
 
  However, I also need to support French as well as English, so I've got
 _en
  and _fr versions of the text fields, with appropriate stemming and
  stopwords. I index French content into the _fr fields and English into
 the
  _en fields. I'm searching with eDisMax over both versions, e.g.:
 
  str name=qfheadline_en headline_fr/str
 
  However, this means I get no results for oscar and wilde. The parsed
  query is:
 
  (+((DisjunctionMaxQuery((headline_fr:osca | headline_en:oscar))
  DisjunctionMaxQuery((headline_fr:and))
  DisjunctionMaxQuery((headline_fr:wild | headline_en:wild)))~3))/no_coord
 
  If I add and to the French stopwords list, I *do* get results, and the
  parsed query is:
 
  (+((DisjunctionMaxQuery((headline_fr:osca | headline_en:oscar))
  DisjunctionMaxQuery((headline_fr:wild | headline_en:wild)))~2))/no_coord
 
  This implies that the only solution is to have a minimal, shared
 stopwords
  list for all languages I want to support. Is this correct, or is there a
  way of supporting this kind of searching with per-language stopword
 lists?
 
  Thanks for any ideas!
 
  Tom
 



Re: newbie getting started with solr

2013-11-07 Thread Tom Mortimer
Hi Eric,

Solr configuration can certainly be confusing at first. And for some time
after. :P

If you're running start.jar from the example folder (which is fine for
testing, and I've known some people to use it for production systems) then
the default solr home is example/solr.  This contains solr.xml, which
specifies where to find per-core configuration and data. (A core is
equivalent to a collection in a simple non-sharded setup).

For now, the easiest thing would be to use the default core in
example/solr/collection1. Copy your solrconfig.xml and schema.xml over the
ones in collection1/conf (backing up the originals for reference). Create
your data directory wherever you like and symlink it into collection1.

Now when you run $ java -jar start.jar in example/, you should be able to
access Solr at http://localhost:8983/solr/ , and add and search for
documents.

Hope that helps a bit!

Tom



On 7 November 2013 14:50, Palmer, Eric epal...@richmond.edu wrote:

 Sorry if this is obvious (because it isn't for me)

 I want to build a solr (4.5.1) + nutch (1.7.1) environment.  I'm doing
 this on amazon linux (I may put nutch on a separate server eventually).

 Please let me know if my thinking is sound or off base

 in the example folder are a lot of files and folders including the war
 file and start.jar

 drwxr-xr-x   cloud-scripts
 drwxr-xr-x   contexts
 drwxr-xr-x   etc
 drwxr-xr-x   example-DIH
 drwxr-xr-x   exampledocs
 drwxr-xr-x   example-schemaless
 drwxr-xr-x   lib
 drwxr-xr-x   logs
 drwxr-xr-x   multicore
 -rw-r--r--   README.txt
 drwxr-xr-x   resources
 drwxr-xr-x   solr
 drwxr-xr-x   solr-webapp
 -rw-r--r--   start.jar
 drwxr-xr-x   webapps


 I am creating a separate folder for the conf and data folders (on another
 disk) and placing these files in the conf file

 schema-solr.xml (from nutch) renamed to schema.solr
 solrconfig.xml

 I will use the example folder and start.jar from that location. (is this
 okay)

 Where do I set the collection name?

 What else do I need to do to get a basic web page indexer built. (I'll
 work out the crawling later, I just want to be able to manually add some
 documents and query).  I'm trying to understand solr first and then will
 use nutch.

 I have several books and have looked at the tutorial and other web sites.
 It seems they assume that I know where to begin when creating a new
 collection and customizing it.

 Thanks in advance for your help.

 --
 Eric Palmer
 Web Services
 U of Richmond

 To report technical issues, obtain technical support or make requests for
 enhancements please visit
 http://web.richmond.edu/contact/technical-support.html



eDisMax and Boolean operator case-sensitivity

2013-11-06 Thread Tom Mortimer
Hi,

I'm using eDisMax query parser, and need to support Boolean operators AND
and OR. It seems from testing that these are *not* case sensitive, e.g.
setting mm to 0, oscar AND wilde returns the same results as oscar and
wilde (15 hits) while oscar foo wilde returns the same results as oscar
wilde (2000 hits).

Is it possible to configure eDisMax to do case-sensitive parsing, so that
AND is an operator but and is just another term?

thanks,
Tom


Re: eDisMax and Boolean operator case-sensitivity

2013-11-06 Thread Tom Mortimer
Oh, good grief - I was just reading that page, how did I miss that? *derp*

Thanks Shawn!!!

Tom


On 6 November 2013 18:59, Shawn Heisey s...@elyograg.org wrote:

 On 11/6/2013 11:46 AM, Tom Mortimer wrote:

 I'm using eDisMax query parser, and need to support Boolean operators AND
 and OR. It seems from testing that these are *not* case sensitive, e.g.

 setting mm to 0, oscar AND wilde returns the same results as oscar and
 wilde (15 hits) while oscar foo wilde returns the same results as
 oscar
 wilde (2000 hits).

 Is it possible to configure eDisMax to do case-sensitive parsing, so that
 AND is an operator but and is just another term?


 Include another query parameter: lowercaseOperators=false

 http://wiki.apache.org/solr/ExtendedDisMax#lowercaseOperators

 Thanks,
 Shawn




Re: SolrCloud performance in VM environment

2013-10-22 Thread Tom Mortimer
Boogie, Shawn,

Thanks for the replies. I'm going to try out some of your suggestions
today. Although, without more RAM I'm not that optimistic..

Tom



On 21 October 2013 18:40, Shawn Heisey s...@elyograg.org wrote:

 On 10/21/2013 9:48 AM, Tom Mortimer wrote:

 Hi everyone,

 I've been working on an installation recently which uses SolrCloud to
 index
 45M documents into 8 shards on 2 VMs running 64-bit Ubuntu (with another 2
 identical VMs set up for replicas). The reason we're using so many shards
 for a relatively small index is that there are complex filtering
 requirements at search time, to restrict users to items they are licensed
 to view. Initial tests demonstrated that multiple shards would be
 required.

 The total size of the index is about 140GB, and each VM has 16GB RAM (32GB
 total) and 4 CPU units. I know this is far under what would normally be
 recommended for an index of this size, and I'm working on persuading the
 customer to increase the RAM (basically, telling them it won't work
 otherwise.) Performance is currently pretty poor and I would expect more
 RAM to improve things. However, there are a couple of other oddities which
 concern me,


 Running multiple shards like you are, where each operating system is
 handling more than one shard, is only going to perform better if your query
 volume is low and you have lots of CPU cores.  If your query volume is high
 or you only have 2-4 CPU cores on each VM, you might be better off with
 fewer shards or not sharded at all.

 The way that I read this is that you've got two physical machines with
 32GB RAM, each running two VMs that have 16GB.  Each VM houses 4 shards, or
 70GB of index.

 There's a scenario that might be better if all of the following are true:
 1) I'm right about how your hardware is provisioned.  2) You or the client
 owns the hardware.  3) You have an extremely low-end third machine
 available - single CPU with 1GB of RAM would probably be enough.  In this
 scenario, you run one Solr instance and one zookeeper instance on each of
 your two big machines, and use the third wimpy machine as a third
 zookeeper node.  No virtualization.  For the rest of my reply, I'm assuming
 that you haven't taken this step, but it will probably apply either way.


  The first is that I've been reindexing a fixed set of 500 docs to test
 indexing and commit performance (with soft commits within 60s). The time
 taken to complete a hard commit after this is longer than I'd expect, and
 highly variable - from 10s to 70s. This makes me wonder whether the SAN
 (which provides all the storage for these VMs and the customers several
 other VMs) is being saturated periodically. I grabbed some iostat output
 on
 different occasions to (possibly) show the variability:

 Device:tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
 sdb  64.50 0.00  2476.00  0   4952
 ...
 sdb   8.90 0.00   348.00  0   6960
 ...
 sdb   1.15 0.0043.20  0864


 There are two likely possibilities for this.  One or both of them might be
 in play.  1) Because the OS disk cache is small, not much of the index can
 be cached.  This can result in a lot of disk I/O for a commit, slowing
 things way down.  Increasing the size of the OS disk cache is really the
 only solution for that. 2) Cache autowarming, particularly the filter
 cache.  In the cache statistics, you can see how long each cache took to
 warm up after the last searcher was opened.  The solution for that is to
 reduce the autowarmCount values.


  The other thing that confuses me is that after a Solr restart or hard
 commit, search times average about 1.2s under light load. After searching
 the same set of queries for 5-6 iterations this improves to 0.1s. However,
 in either case - cold or warm - iostat reports no device reads at all:

 Device:tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
 sdb   0.40 0.00 8.00  0160
 ...
 sdb   0.30 0.0010.40  0104

 (the writes are due to logging). This implies to me that the 'hot' blocks
 are being completely cached in RAM - so why the variation in search time
 and the number of iterations required to speed it up?


 Linux is pretty good about making limited OS disk cache resources work.
  Sounds like the caching is working reasonably well for queries.  It might
 not be working so well for updates or commits, though.

 Running multiple Solr JVMs per machine, virtual or not, causes more
 problems than it solves.  Solr has no limits on the number of cores (shard
 replicas) per instance, assuming there are enough system resources.  There
 should be exactly one Solr JVM per operating system.  Running more than one
 results in quite a lot of overhead, and your memory is precious.  When you
 create a collection, you can give the collections API

Re: SolrCloud performance in VM environment

2013-10-22 Thread Tom Mortimer
Just tried it with no other changes than upping the RAM to 128GB total, and
it's flying. I think that proves that RAM is good. =)  Will implement
suggested changes later, though.

cheers,
Tom


On 22 October 2013 09:04, Tom Mortimer tom.m.f...@gmail.com wrote:

 Boogie, Shawn,

 Thanks for the replies. I'm going to try out some of your suggestions
 today. Although, without more RAM I'm not that optimistic..

 Tom



 On 21 October 2013 18:40, Shawn Heisey s...@elyograg.org wrote:

 On 10/21/2013 9:48 AM, Tom Mortimer wrote:

 Hi everyone,

 I've been working on an installation recently which uses SolrCloud to
 index
 45M documents into 8 shards on 2 VMs running 64-bit Ubuntu (with another
 2
 identical VMs set up for replicas). The reason we're using so many shards
 for a relatively small index is that there are complex filtering
 requirements at search time, to restrict users to items they are licensed
 to view. Initial tests demonstrated that multiple shards would be
 required.

 The total size of the index is about 140GB, and each VM has 16GB RAM
 (32GB
 total) and 4 CPU units. I know this is far under what would normally be
 recommended for an index of this size, and I'm working on persuading the
 customer to increase the RAM (basically, telling them it won't work
 otherwise.) Performance is currently pretty poor and I would expect more
 RAM to improve things. However, there are a couple of other oddities
 which
 concern me,


 Running multiple shards like you are, where each operating system is
 handling more than one shard, is only going to perform better if your query
 volume is low and you have lots of CPU cores.  If your query volume is high
 or you only have 2-4 CPU cores on each VM, you might be better off with
 fewer shards or not sharded at all.

 The way that I read this is that you've got two physical machines with
 32GB RAM, each running two VMs that have 16GB.  Each VM houses 4 shards, or
 70GB of index.

 There's a scenario that might be better if all of the following are true:
 1) I'm right about how your hardware is provisioned.  2) You or the client
 owns the hardware.  3) You have an extremely low-end third machine
 available - single CPU with 1GB of RAM would probably be enough.  In this
 scenario, you run one Solr instance and one zookeeper instance on each of
 your two big machines, and use the third wimpy machine as a third
 zookeeper node.  No virtualization.  For the rest of my reply, I'm assuming
 that you haven't taken this step, but it will probably apply either way.


  The first is that I've been reindexing a fixed set of 500 docs to test
 indexing and commit performance (with soft commits within 60s). The time
 taken to complete a hard commit after this is longer than I'd expect, and
 highly variable - from 10s to 70s. This makes me wonder whether the SAN
 (which provides all the storage for these VMs and the customers several
 other VMs) is being saturated periodically. I grabbed some iostat output
 on
 different occasions to (possibly) show the variability:

 Device:tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
 sdb  64.50 0.00  2476.00  0   4952
 ...
 sdb   8.90 0.00   348.00  0   6960
 ...
 sdb   1.15 0.0043.20  0864


 There are two likely possibilities for this.  One or both of them might
 be in play.  1) Because the OS disk cache is small, not much of the index
 can be cached.  This can result in a lot of disk I/O for a commit, slowing
 things way down.  Increasing the size of the OS disk cache is really the
 only solution for that. 2) Cache autowarming, particularly the filter
 cache.  In the cache statistics, you can see how long each cache took to
 warm up after the last searcher was opened.  The solution for that is to
 reduce the autowarmCount values.


  The other thing that confuses me is that after a Solr restart or hard
 commit, search times average about 1.2s under light load. After searching
 the same set of queries for 5-6 iterations this improves to 0.1s.
 However,
 in either case - cold or warm - iostat reports no device reads at all:

 Device:tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
 sdb   0.40 0.00 8.00  0160
 ...
 sdb   0.30 0.0010.40  0104

 (the writes are due to logging). This implies to me that the 'hot' blocks
 are being completely cached in RAM - so why the variation in search time
 and the number of iterations required to speed it up?


 Linux is pretty good about making limited OS disk cache resources work.
  Sounds like the caching is working reasonably well for queries.  It might
 not be working so well for updates or commits, though.

 Running multiple Solr JVMs per machine, virtual or not, causes more
 problems than it solves.  Solr has no limits on the number of cores (shard
 replicas) per

SolrCloud performance in VM environment

2013-10-21 Thread Tom Mortimer
Hi everyone,

I've been working on an installation recently which uses SolrCloud to index
45M documents into 8 shards on 2 VMs running 64-bit Ubuntu (with another 2
identical VMs set up for replicas). The reason we're using so many shards
for a relatively small index is that there are complex filtering
requirements at search time, to restrict users to items they are licensed
to view. Initial tests demonstrated that multiple shards would be required.

The total size of the index is about 140GB, and each VM has 16GB RAM (32GB
total) and 4 CPU units. I know this is far under what would normally be
recommended for an index of this size, and I'm working on persuading the
customer to increase the RAM (basically, telling them it won't work
otherwise.) Performance is currently pretty poor and I would expect more
RAM to improve things. However, there are a couple of other oddities which
concern me,

The first is that I've been reindexing a fixed set of 500 docs to test
indexing and commit performance (with soft commits within 60s). The time
taken to complete a hard commit after this is longer than I'd expect, and
highly variable - from 10s to 70s. This makes me wonder whether the SAN
(which provides all the storage for these VMs and the customers several
other VMs) is being saturated periodically. I grabbed some iostat output on
different occasions to (possibly) show the variability:

Device:tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sdb  64.50 0.00  2476.00  0   4952
...
sdb   8.90 0.00   348.00  0   6960
...
sdb   1.15 0.0043.20  0864

The other thing that confuses me is that after a Solr restart or hard
commit, search times average about 1.2s under light load. After searching
the same set of queries for 5-6 iterations this improves to 0.1s. However,
in either case - cold or warm - iostat reports no device reads at all:

Device:tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sdb   0.40 0.00 8.00  0160
...
sdb   0.30 0.0010.40  0104

(the writes are due to logging). This implies to me that the 'hot' blocks
are being completely cached in RAM - so why the variation in search time
and the number of iterations required to speed it up?

The Solr caches are only being used lightly by these tests and there are no
evictions. GC is not a significant overhead. Each Solr shard runs in a
separate JVM with 1GB heap.

I don't have a great deal of experience in low-level performance tuning, so
please forgive any naivety. Any ideas of what to do next would be greatly
appreciated. I don't currently have details of the VM implementation but
can get hold of this if it's relevant.

thanks,
Tom


Re: Restricting search results by field value

2012-12-06 Thread Tom Mortimer
Sounds like it's worth a try! Thanks Andre.
Tom

On 5 Dec 2012, at 17:49, Andre Bois-Crettez andre.b...@kelkoo.com wrote:

 If you do grouping on source_id, it should be enough to request 3 times
 more documents than you need, then reorder and drop the bottom.
 
 Is a 3x overhead acceptable ?
 
 
 
 On 12/05/2012 12:04 PM, Tom Mortimer wrote:
 Hi everyone,
 
 I've got a problem where I have docs with a source_id field, and there can 
 be many docs from each source. Searches will typically return docs from many 
 sources. I want to restrict the number of docs from each source in results, 
 so there will be no more than (say) 3 docs from source_id=123 etc.
 
 Field collapsing is the obvious approach, but I want to get the results back 
 in relevancy order, not grouped by source_id. So it looks like I'll have to 
 fetch more docs than I need to and re-sort them. It might even be better to 
 count source_ids in the client code and drop excess docs that way, but the 
 potential overhead is large.
 
 Is there any way of doing this in Solr without hacking in a custom Lucene 
 Collector? (which doesn't look all that straightforward).
 
 cheers,
 Tom
 
 
 --
 André Bois-Crettez
 
 Search technology, Kelkoo
 http://www.kelkoo.com/
 
 Kelkoo SAS
 Société par Actions Simplifiée
 Au capital de € 4.168.964,30
 Siège social : 8, rue du Sentier 75002 Paris
 425 093 069 RCS Paris
 
 Ce message et les pièces jointes sont confidentiels et établis à l'attention 
 exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
 message, merci de le détruire et d'en avertir l'expéditeur.



Re: Restricting search results by field value

2012-12-06 Thread Tom Mortimer
Thanks, but even with group.main=true the results are not in relevancy (score) 
order, they are in group order. Which is why I can't use it as is.

Tom


On 6 Dec 2012, at 19:00, Way Cool way1.wayc...@gmail.com wrote:

 Grouping should work:
 group=truegroup.field=source_idgroup.limit=3group.main=true
 
 On Thu, Dec 6, 2012 at 2:35 AM, Tom Mortimer bano...@gmail.com wrote:
 
 Sounds like it's worth a try! Thanks Andre.
 Tom
 
 On 5 Dec 2012, at 17:49, Andre Bois-Crettez andre.b...@kelkoo.com wrote:
 
 If you do grouping on source_id, it should be enough to request 3 times
 more documents than you need, then reorder and drop the bottom.
 
 Is a 3x overhead acceptable ?
 
 
 
 On 12/05/2012 12:04 PM, Tom Mortimer wrote:
 Hi everyone,
 
 I've got a problem where I have docs with a source_id field, and there
 can be many docs from each source. Searches will typically return docs from
 many sources. I want to restrict the number of docs from each source in
 results, so there will be no more than (say) 3 docs from source_id=123 etc.
 
 Field collapsing is the obvious approach, but I want to get the results
 back in relevancy order, not grouped by source_id. So it looks like I'll
 have to fetch more docs than I need to and re-sort them. It might even be
 better to count source_ids in the client code and drop excess docs that
 way, but the potential overhead is large.
 
 Is there any way of doing this in Solr without hacking in a custom
 Lucene Collector? (which doesn't look all that straightforward).
 
 cheers,
 Tom
 
 
 --
 André Bois-Crettez
 
 Search technology, Kelkoo
 http://www.kelkoo.com/
 
 Kelkoo SAS
 Société par Actions Simplifiée
 Au capital de € 4.168.964,30
 Siège social : 8, rue du Sentier 75002 Paris
 425 093 069 RCS Paris
 
 Ce message et les pièces jointes sont confidentiels et établis à
 l'attention exclusive de leurs destinataires. Si vous n'êtes pas le
 destinataire de ce message, merci de le détruire et d'en avertir
 l'expéditeur.
 
 



Restricting search results by field value

2012-12-05 Thread Tom Mortimer
Hi everyone,

I've got a problem where I have docs with a source_id field, and there can be 
many docs from each source. Searches will typically return docs from many 
sources. I want to restrict the number of docs from each source in results, so 
there will be no more than (say) 3 docs from source_id=123 etc.

Field collapsing is the obvious approach, but I want to get the results back in 
relevancy order, not grouped by source_id. So it looks like I'll have to fetch 
more docs than I need to and re-sort them. It might even be better to count 
source_ids in the client code and drop excess docs that way, but the potential 
overhead is large.

Is there any way of doing this in Solr without hacking in a custom Lucene 
Collector? (which doesn't look all that straightforward).

cheers,
Tom
 

Re: AutoIndexing

2012-09-25 Thread Tom Mortimer
Hi Darshan,

Can you give us some more details, e.g. what do you mean by database? A 
RDBMS? Which software? How are you indexing it (or intending to index it) to 
Solr? etc...

cheers,
Tom


On 25 Sep 2012, at 09:55, darshan dk...@dreamsoftech.com wrote:

 Hi All,
 
Is there any way where I can auto-index whenever there is
 changes in my database.
 
 Thanks,
 
 Darshan
 



Re: How can I create about 100000 independent indexes in Solr?

2012-09-25 Thread Tom Mortimer
Hi,

Why do you think that the indexes should be independent? What would be the 
problem with using a single index and filter queries?

Tom

On 25 Sep 2012, at 03:21, 韦震宇 weizhe...@win-trust.com wrote:

 Dear all,
The company I'm working in have a website to server more than 10 
 customers, and every customer should have it's own search cataegory. So I 
 should create independent index for every customer.
The site http://wiki.apache.org/solr/MultipleIndexes give some solution to 
 create multiple indexes.
I want to use multicore solution. But i'm afraid that Solr can't support 
 so many indexes in this solution.
The other solution Flattening data into a single index is a choice, but 
 i think it's best to keep all indexes indepent.  
Could you tell me how to create about 10 independent indexes in Solr?
Thank you all for reply!



Re: AutoIndexing

2012-09-25 Thread Tom Mortimer
I'm afraid I don't have any DIH experience myself, but some googling suggests 
that using a postgresql trigger to start a delta import might be one approach:

http://wiki.apache.org/solr/DataImportHandler#Using_delta-import_command  and
http://wiki.apache.org/solr/DataImportHandlerDeltaQueryViaFullImport

Tom

On 25 Sep 2012, at 11:28, darshan dk...@dreamsoftech.com wrote:

 My Document is Database(yes RDBMS) and software for it is postgresql, where
 any change in it's table should be reflected, without re-indexing. I am
 indexing it via DIH process
 Thanks,
 Darshan
 
 -Original Message-
 From: Tom Mortimer [mailto:tom.m.f...@gmail.com] 
 Sent: Tuesday, September 25, 2012 3:31 PM
 To: solr-user@lucene.apache.org
 Subject: Re: AutoIndexing
 
 Hi Darshan,
 
 Can you give us some more details, e.g. what do you mean by database? A
 RDBMS? Which software? How are you indexing it (or intending to index it) to
 Solr? etc...
 
 cheers,
 Tom
 
 
 On 25 Sep 2012, at 09:55, darshan dk...@dreamsoftech.com wrote:
 
 Hi All,
 
   Is there any way where I can auto-index whenever there 
 is changes in my database.
 
 Thanks,
 
 Darshan
 
 
 



Re: ID reference field - Needed but not searchable or retrievable

2012-09-20 Thread Tom Mortimer
Hi James,

If you don't want this field to be included in user searches, just omit it from 
the search configuration (e.g. if using eDisMax parser, don't put it in the qf 
list). To keep it out of search results, exclude it from the fl list. See

http://wiki.apache.org/solr/CommonQueryParameters
and
http://wiki.apache.org/solr/ExtendedDisMax

The overhead from storing it will most likely be very small, and as Micheal 
points out it means you could potentially reference documents both ways.

Not sure about the JSON question, but in XML 
deletequeryuniqueID:7/query/delete would remove the whole doc, not 
just the uniqueID field. 

Tom


On 20 Sep 2012, at 13:38, Spadez james_will...@hotmail.com wrote:

 Hi.
 
 My SQL database assigns a uniqueID to each item. I want to keep this
 uniqueID assosiated to the items that are in Solr even though I wont ever
 need to display them or have them searchable. I do however what to be able
 to target specific items in Solr with it, for updating or deleting the
 record.
 
 Right now I have this in my schema:
 field name=id type=string indexed=true stored=true/
 
 However, since I dont want it searchable or stored it should be this:
 field name=uniqueid type=string indexed=false stored=false/
 
 Firstly, is this the correct way of doing this? I saw mention of a ignore
 attibute that can be added.
 
 Secondly, if I wanted to do updates to the fields using JSON by targetting
 the uniqueID can I still do something like this:
 delete: { uniqueid:7 },   /* delete entry
 uniqueID=7 */
 
 Thank you for any help you can give. Hope I explained it well enough.
 
 James
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/ID-reference-field-Needed-but-not-searchable-or-retrievable-tp4009162.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Solr 4.0 - disappointing results sharding on 1 machine

2012-09-20 Thread Tom Mortimer
Hi all,

After reading 
http://carsabi.com/car-news/2012/03/23/optimizing-solr-7x-your-search-speed/ , 
I thought I'd do my own experiments. I used 2M docs from wikipedia, indexed in 
Solr 4.0 Beta on a standard EC2 large instance. I compared an unsharded and 
2-shard configuration (the latter set up with SolrCloud following the 
http://wiki.apache.org/solr/SolrCloud example). I wrote a simple python script 
to randomly throw queries from a hand-compiled list at Solr. The only extra I 
had turned on was facets (on document category).

To my surprise, the performance of the 2-shard configuration is almost exactly 
half that of the unsharded index - 

unsharded
4983912891 results in 24920 searches; 0 errors
70.02 mean qps
0.35s mean query time, 2.25s max, 0.00s min
90%   of qtimes = 0.83s
99%   of qtimes = 1.42s
99.9% of qtimes = 1.68s

2-shard
4990351660 results in 24501 searches; 0 errors
34.07 mean qps
0.66s mean query time, 694.20s max, 0.01s min
90%   of qtimes = 1.19s
99%   of qtimes = 2.12s
99.9% of qtimes = 2.95s

All caches were set to 4096 items, and performance looks ok in both cases (hit 
ratios close to 1.0, 0 evictions). I gave the single VM -Xmx1G and each shard 
VM -Xmx500M.

I must be doing something stupid - surely this result is unexpected? Does 
anybody have any thoughts where it might be going wrong?

cheers,
Tom



Re: Solr 4.0 - disappointing results sharding on 1 machine

2012-09-20 Thread Tom Mortimer
Before anyone asks, these results were obtained warm.

On 20 Sep 2012, at 14:39, Tom Mortimer tom.m.f...@gmail.com wrote:

 Hi all,
 
 After reading 
 http://carsabi.com/car-news/2012/03/23/optimizing-solr-7x-your-search-speed/ 
 , I thought I'd do my own experiments. I used 2M docs from wikipedia, indexed 
 in Solr 4.0 Beta on a standard EC2 large instance. I compared an unsharded 
 and 2-shard configuration (the latter set up with SolrCloud following the 
 http://wiki.apache.org/solr/SolrCloud example). I wrote a simple python 
 script to randomly throw queries from a hand-compiled list at Solr. The only 
 extra I had turned on was facets (on document category).
 
 To my surprise, the performance of the 2-shard configuration is almost 
 exactly half that of the unsharded index - 
 
 unsharded
 4983912891 results in 24920 searches; 0 errors
 70.02 mean qps
 0.35s mean query time, 2.25s max, 0.00s min
 90%   of qtimes = 0.83s
 99%   of qtimes = 1.42s
 99.9% of qtimes = 1.68s
 
 2-shard
 4990351660 results in 24501 searches; 0 errors
 34.07 mean qps
 0.66s mean query time, 694.20s max, 0.01s min
 90%   of qtimes = 1.19s
 99%   of qtimes = 2.12s
 99.9% of qtimes = 2.95s
 
 All caches were set to 4096 items, and performance looks ok in both cases 
 (hit ratios close to 1.0, 0 evictions). I gave the single VM -Xmx1G and each 
 shard VM -Xmx500M.
 
 I must be doing something stupid - surely this result is unexpected? Does 
 anybody have any thoughts where it might be going wrong?
 
 cheers,
 Tom
 



Re: Personalized Boosting

2012-09-19 Thread Tom Mortimer
I'm still not sure I understand what it is you're trying to do. Index-time or 
query-time boosts would probably be neater and more predictable than multiple 
field instances, though.

http://wiki.apache.org/solr/UpdateXmlMessages#Optional_attributes_for_.22field.22
http://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_change_the_score_of_a_document_based_on_the_.2Avalue.2A_of_a_field_.28say.2C_.22popularity.22.29

Tom


On 19 Sep 2012, at 02:49, deniz denizdurmu...@gmail.com wrote:

 Hello Tom
 
 Thank you for your link, but after overviewing it, I dont think it will
 help... In my case, it will be dynamic, rather than setting a config file
 and as you think of a big country like Russia or China, i will need to add
 all cities manually to the elevator.xml file, and also the boosted users,
 which is not something i desire...
 
 it seems like duplicating the values in the location field is the best (at
 least quickest) solution for this case..
 
 
 
 -
 Zeki ama calismiyor... Calissa yapar...
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Personalized-Boosting-tp4008495p4008783.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Solr4 how to make it do this?

2012-09-18 Thread Tom Mortimer
Hi George,

I don't think this will work. The synonyms will be added after the query is 
parsed, so you'll have terms like bed:3 rather than matching 3 against the 
bed field. If I was implementing this I'd try doing some pattern matching 
before passing the query to Solr, e.g.:

3 bed Surrey  -  q=Surrey fq=bed:3

I guess this kind of thing could also be implemented as a Solr query plug-in. 
Don't know if anything like it exists.

Tom


On 18 Sep 2012, at 11:30, george123 daniel.tarase...@gmail.com wrote:

 I guess I could come up with a synonyms.txt file and every instance of 
 3 bed
 I change to 
 bed:3
 it should work.
 
 eg
 
 3 bed = bed:3
 
 not exactly a synonym or what it was designed for, but it might work?
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Solr4-how-to-make-it-do-this-tp4008574p4008576.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Personalized Boosting

2012-09-18 Thread Tom Mortimer
Hi,

Would this do the job?  http://wiki.apache.org/solr/QueryElevationComponent

Tom


On 18 Sep 2012, at 01:36, deniz denizdurmu...@gmail.com wrote:

 Hello All,
 
 I have a requirement or a pre=requirement for our search application.
 Basically the engine will be on a website with plenty of users and more than
 20 different fields, including location.
 
 So basically, the question is this:
 
 Is it possible to let user's define their position in search when location
 is queried? Let's say that I am UserA and when you make a search with
 Moscow, my default ranking is 258. By clicking a button, something like
 Boost Me!, I would like to see UserA as the first user when search is done
 by Moscow query.
 
 is this possible? I have some ideas (like adding more of the persons
 location like 10 times to their location field, so it will score the highest
 and so on) but I am not sure if the requirement is hard or easy to implement
 or will it require a plugin rather than config changes...
 
 anyone has any ideas? 
 
 
 
 -
 Zeki ama calismiyor... Calissa yapar...
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Personalized-Boosting-tp4008495.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Error loading a custom request handler in Solr 4.0

2011-08-10 Thread Tom Mortimer
Hi,

Apologies if this is really basic. I'm trying to learn how to create a
custom request handler, so I wrote the minimal class (attached), compiled
and jar'd it, and placed it in example/lib. I added this to solrconfig.xml:

requestHandler name=/flaxtest class=FlaxTestHandler /

When I started Solr with java -jar start.jar, I got this:

...
SEVERE: java.lang.NoClassDefFoundError:
org/apache/solr/handler/RequestHandlerBase
at java.lang.ClassLoader.defineClass1(Native Method)
...

So I copied all the dist/*.jar files into lib and tried again. This time it
seemed to start ok, but browsing to http://localhost:8983/solr/ displayed
this:

org.apache.solr.common.SolrException: Error Instantiating Request
Handler, FlaxTestHandler is not a org.apache.solr.request.SolrRequestHandler

at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:410) ...


Any ideas?

thanks,
Tom


Re: Error loading a custom request handler in Solr 4.0

2011-08-10 Thread Tom Mortimer
Sure -

import org.apache.solr.request.SolrQueryRequest;
import org.apache.solr.response.SolrQueryResponse;
import org.apache.solr.handler.RequestHandlerBase;

public class FlaxTestHandler extends RequestHandlerBase {

public FlaxTestHandler() { }

public void handleRequestBody(SolrQueryRequest req, SolrQueryResponse
rsp)
throws Exception
{
rsp.add(FlaxTest, Hello!);
}

public String getDescription() { return Flax; }
public String getSourceId() { return Flax; }
public String getSource() { return Flax; }
public String getVersion() { return Flax; }

}



On 10 August 2011 16:43, simon mtnes...@gmail.com wrote:

 Th attachment isn't showing up (in gmail, at least). Can you inline
 the relevant bits of code ?

 On Wed, Aug 10, 2011 at 11:05 AM, Tom Mortimer t...@flax.co.uk wrote:
  Hi,
  Apologies if this is really basic. I'm trying to learn how to create a
  custom request handler, so I wrote the minimal class (attached), compiled
  and jar'd it, and placed it in example/lib. I added this to
 solrconfig.xml:
  requestHandler name=/flaxtest class=FlaxTestHandler /
  When I started Solr with java -jar start.jar, I got this:
  ...
  SEVERE: java.lang.NoClassDefFoundError:
  org/apache/solr/handler/RequestHandlerBase
  at java.lang.ClassLoader.defineClass1(Native Method)
  ...
  So I copied all the dist/*.jar files into lib and tried again. This time
 it
  seemed to start ok, but browsing to http://localhost:8983/solr/displayed
  this:
  org.apache.solr.common.SolrException: Error Instantiating Request
  Handler, FlaxTestHandler is not a
 org.apache.solr.request.SolrRequestHandler
 
at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:410)
 ...
 
  Any ideas?
  thanks,
  Tom
 



Re: how to ignore case in solr search field?

2011-08-10 Thread Tom Mortimer
You can use solr.LowerCaseFilterFactory in an analyser chain for both
indexing and queries. The schema.xml supplied with example has several field
types using this (including text_general).

Tom


On 10 August 2011 16:42, nagarjuna nagarjuna.avul...@gmail.com wrote:

 Hi please help me ..
how to ignore case while searching in solr


 ex:i need same results for the keywords abc, ABC , aBc,AbC and all the
 cases.




 Thank u in advance

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/how-to-ignore-case-in-solr-search-field-tp3242967p3242967.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Error loading a custom request handler in Solr 4.0

2011-08-10 Thread Tom Mortimer
Interesting.. is this in trunk (4.0)? Maybe I've broken mine somehow!

What classpath did you use for compiling? And did you copy anything other
than the new jar into lib/ ?

thanks,
Tom


On 10 August 2011 18:07, simon mtnes...@gmail.com wrote:

 It's working for me. Compiled, inserted in solr/lib, added the config
 line to solrconfig.

  when I send a /flaxtest request i get

 response
 lst name=responseHeader
 int name=status0/int
 int name=QTime16/int
 /lst
 str name=FlaxTestHello!/str
 /response

 I was doing this within a core defined in solr.xml

 -Simon

 On Wed, Aug 10, 2011 at 11:46 AM, Tom Mortimer t...@flax.co.uk wrote:
  Sure -
 
  import org.apache.solr.request.SolrQueryRequest;
  import org.apache.solr.response.SolrQueryResponse;
  import org.apache.solr.handler.RequestHandlerBase;
 
  public class FlaxTestHandler extends RequestHandlerBase {
 
 public FlaxTestHandler() { }
 
 public void handleRequestBody(SolrQueryRequest req, SolrQueryResponse
  rsp)
 throws Exception
 {
 rsp.add(FlaxTest, Hello!);
 }
 
 public String getDescription() { return Flax; }
 public String getSourceId() { return Flax; }
 public String getSource() { return Flax; }
 public String getVersion() { return Flax; }
 
  }
 
 
 
  On 10 August 2011 16:43, simon mtnes...@gmail.com wrote:
 
  Th attachment isn't showing up (in gmail, at least). Can you inline
  the relevant bits of code ?
 
  On Wed, Aug 10, 2011 at 11:05 AM, Tom Mortimer t...@flax.co.uk wrote:
   Hi,
   Apologies if this is really basic. I'm trying to learn how to create a
   custom request handler, so I wrote the minimal class (attached),
 compiled
   and jar'd it, and placed it in example/lib. I added this to
  solrconfig.xml:
   requestHandler name=/flaxtest class=FlaxTestHandler /
   When I started Solr with java -jar start.jar, I got this:
   ...
   SEVERE: java.lang.NoClassDefFoundError:
   org/apache/solr/handler/RequestHandlerBase
   at java.lang.ClassLoader.defineClass1(Native Method)
   ...
   So I copied all the dist/*.jar files into lib and tried again. This
 time
  it
   seemed to start ok, but browsing to
 http://localhost:8983/solr/displayed
   this:
   org.apache.solr.common.SolrException: Error Instantiating Request
   Handler, FlaxTestHandler is not a
  org.apache.solr.request.SolrRequestHandler
  
 at
 org.apache.solr.core.SolrCore.createInstance(SolrCore.java:410)
  ...
  
   Any ideas?
   thanks,
   Tom
  
 
 



Re: Error loading a custom request handler in Solr 4.0

2011-08-10 Thread Tom Mortimer
Thanks Simon. I'll try again tomorrow.

Tom

On 10 August 2011 18:46, simon mtnes...@gmail.com wrote:

 This is in trunk (up to date). Compiler is 1.6.0_26

 classpath was
  
 dist/apache-solr-solrj-4.0-SNAPSHOT.jar:dist/apache-solr-core-4.0-SNAPSHOT.jar
 built from trunk just prior by 'ant dist'

 I'd try again with a clean trunk .

 -Simon

 On Wed, Aug 10, 2011 at 1:20 PM, Tom Mortimer t...@flax.co.uk wrote:
  Interesting.. is this in trunk (4.0)? Maybe I've broken mine somehow!
 
  What classpath did you use for compiling? And did you copy anything other
  than the new jar into lib/ ?
 
  thanks,
  Tom
 
 
  On 10 August 2011 18:07, simon mtnes...@gmail.com wrote:
 
  It's working for me. Compiled, inserted in solr/lib, added the config
  line to solrconfig.
 
   when I send a /flaxtest request i get
 
  response
  lst name=responseHeader
  int name=status0/int
  int name=QTime16/int
  /lst
  str name=FlaxTestHello!/str
  /response
 
  I was doing this within a core defined in solr.xml
 
  -Simon
 
  On Wed, Aug 10, 2011 at 11:46 AM, Tom Mortimer t...@flax.co.uk wrote:
   Sure -
  
   import org.apache.solr.request.SolrQueryRequest;
   import org.apache.solr.response.SolrQueryResponse;
   import org.apache.solr.handler.RequestHandlerBase;
  
   public class FlaxTestHandler extends RequestHandlerBase {
  
  public FlaxTestHandler() { }
  
  public void handleRequestBody(SolrQueryRequest req,
 SolrQueryResponse
   rsp)
  throws Exception
  {
  rsp.add(FlaxTest, Hello!);
  }
  
  public String getDescription() { return Flax; }
  public String getSourceId() { return Flax; }
  public String getSource() { return Flax; }
  public String getVersion() { return Flax; }
  
   }
  
  
  
   On 10 August 2011 16:43, simon mtnes...@gmail.com wrote:
  
   Th attachment isn't showing up (in gmail, at least). Can you inline
   the relevant bits of code ?
  
   On Wed, Aug 10, 2011 at 11:05 AM, Tom Mortimer t...@flax.co.uk
 wrote:
Hi,
Apologies if this is really basic. I'm trying to learn how to
 create a
custom request handler, so I wrote the minimal class (attached),
  compiled
and jar'd it, and placed it in example/lib. I added this to
   solrconfig.xml:
requestHandler name=/flaxtest class=FlaxTestHandler /
When I started Solr with java -jar start.jar, I got this:
...
SEVERE: java.lang.NoClassDefFoundError:
org/apache/solr/handler/RequestHandlerBase
at java.lang.ClassLoader.defineClass1(Native Method)
...
So I copied all the dist/*.jar files into lib and tried again. This
  time
   it
seemed to start ok, but browsing to
  http://localhost:8983/solr/displayed
this:
org.apache.solr.common.SolrException: Error Instantiating
 Request
Handler, FlaxTestHandler is not a
   org.apache.solr.request.SolrRequestHandler
   
  at
  org.apache.solr.core.SolrCore.createInstance(SolrCore.java:410)
   ...
   
Any ideas?
thanks,
Tom
   
  
  
 
 



Highlighting not working

2011-04-07 Thread Tom Mortimer
Hi,

I'm having trouble getting highlighting to work for a large text
field. This field can be in several languages, so I'm sending it to
one of several fields configured appropriately (e.g. cv_text_en) and
then copying it to a common field for storage and display (cv_text).
The relevant fragment of schema.xml looks like:

field name=cv_text_en type=text_en indexed=true
stored=false termVectors=true termPositions=true/
...
copyField source=cv_text_* dest=cv_text /
field name=cv_text   type=text   indexed=false stored=true /

At search time I can't get cv_text to be highlighted - it's returned
in its entirety. Here's the relevant bit of solrconfig.xml (I'm
qt=all with the default request handler):

  requestHandler name=all class=solr.SearchHandler default=false
lst name=defaults
str name=echoParamsexplicit/str
int name=rows10/int

str name=dfcv_text_en/str
str name=dfcv_text_de/str
...

str name=hlon/str
str name=hl.flcv_text/str

 /lst

I've tried playing with other hl. parameters, but have had no luck so
far. Any ideas?

thanks,
Tom


Re: Highlighting not working

2011-04-07 Thread Tom Mortimer
I guess what I'm asking is - can Solr highlight non-indexed fields?

Tom


On 7 April 2011 11:33, Tom Mortimer t...@flax.co.uk wrote:
 Hi,

 I'm having trouble getting highlighting to work for a large text
 field. This field can be in several languages, so I'm sending it to
 one of several fields configured appropriately (e.g. cv_text_en) and
 then copying it to a common field for storage and display (cv_text).
 The relevant fragment of schema.xml looks like:

    field name=cv_text_en type=text_en indexed=true
 stored=false termVectors=true termPositions=true/
    ...
    copyField source=cv_text_* dest=cv_text /
    field name=cv_text       type=text   indexed=false stored=true /

 At search time I can't get cv_text to be highlighted - it's returned
 in its entirety. Here's the relevant bit of solrconfig.xml (I'm
 qt=all with the default request handler):

  requestHandler name=all class=solr.SearchHandler default=false
    lst name=defaults
        str name=echoParamsexplicit/str
        int name=rows10/int

        str name=dfcv_text_en/str
        str name=dfcv_text_de/str
        ...

        str name=hlon/str
        str name=hl.flcv_text/str

     /lst

 I've tried playing with other hl. parameters, but have had no luck so
 far. Any ideas?

 thanks,
 Tom



Re: Highlighting not working

2011-04-07 Thread Tom Mortimer
Problem solved. *bangs head on desk*
T

On 7 April 2011 11:33, Tom Mortimer t...@flax.co.uk wrote:
 Hi,

 I'm having trouble getting highlighting to work for a large text
 field. This field can be in several languages, so I'm sending it to
 one of several fields configured appropriately (e.g. cv_text_en) and
 then copying it to a common field for storage and display (cv_text).
 The relevant fragment of schema.xml looks like:

    field name=cv_text_en type=text_en indexed=true
 stored=false termVectors=true termPositions=true/
    ...
    copyField source=cv_text_* dest=cv_text /
    field name=cv_text       type=text   indexed=false stored=true /

 At search time I can't get cv_text to be highlighted - it's returned
 in its entirety. Here's the relevant bit of solrconfig.xml (I'm
 qt=all with the default request handler):

  requestHandler name=all class=solr.SearchHandler default=false
    lst name=defaults
        str name=echoParamsexplicit/str
        int name=rows10/int

        str name=dfcv_text_en/str
        str name=dfcv_text_de/str
        ...

        str name=hlon/str
        str name=hl.flcv_text/str

     /lst

 I've tried playing with other hl. parameters, but have had no luck so
 far. Any ideas?

 thanks,
 Tom



copyField at search time / multi-language support

2011-03-28 Thread Tom Mortimer
Hi,

Here's my problem: I'm indexing a corpus with text in a variety of
languages. I'm planning to detect these at index time and send the
text to one of a suitably-configured field (e.g. mytext_de for
German, mytext_cjk for Chinese/Japanese/Korean etc.)

At search time I want to search all of these fields. However, there
will be at least 12 of them, which could lead to a very long query
string. (Also I need to use the standard query parser rather than
dismax, for full query syntax.)

Therefore I was wondering if there was a way to copy fields at search
time, so I can have my mytext query in a single field and have it
copied to mytext_de, mytext_cjk etc. Something like:

   copyQueryField source=mytext dest=mytext_de /
   copyQueryField source=mytext dest=mytext_cjk /
  ...

If this is not currently possible, could someone give me some pointers
for hacking Solr to support it? Should I subclass solr.SearchHandler?
I know nothing about Solr internals at the moment...

thanks,
Tom