How to configure solr to not bind at 8983

2015-08-19 Thread Samy Ateia
I changed the solr listen port in the solr.in.sh file in my solr home directory 
by setting the variable: SOLR_PORT=.
But Solr is still trying to also listen on 8983 because it gets started with 
the -DSTOP.PORT=8983 variable.

What is this -DSTOP.PORT variable for and where should I configure it?

I ran the install_solr_service.sh script to setup solr and changed the 
SOLR_PORT afterwards.

best regards. 

Samy
  

Re: How to find the ordinal for a numeric doc value

2015-08-19 Thread Mikhail Khludnev
Hello,
Giving the code
https://github.com/apache/lucene-solr/blob/trunk/solr/core/src/java/org/apache/solr/schema/TrieField.java#L727
it creates NumericDocValuesField only.
try to define field as multivalued, giving the code it creates
SortedSetDocValuesField.

On Wed, Aug 19, 2015 at 11:13 PM, tedsolr  wrote:

> One error (others perhaps?) in my statement ... the code
>
> searcher.getLeafReader().getSortedDocValues(field)
>
> just returns null for numeric and date fields. That is why they appear to
> be
> ignored, not that the ordinals are all absent or equivalent. But my
> question
> is still valid I think!
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/How-to-find-the-ordinal-for-a-numeric-doc-value-tp4224018p4224037.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics





Re: How to find the ordinal for a numeric doc value

2015-08-19 Thread Toke Eskildsen
tedsolr  wrote:
> I'm sure there is a good reason why SortedDocValues exposes
> the backing dictionary and [Sorted]NumericDocValues does not.

There is: Numerics does not have a backing dictionary. Instead of storing the 
values via the intermediate ordinals-map (aka by reference), they are stored 
directly. Overall it takes up less space and makes a lot of things easier (and 
a few things harder, as you have found out).

> How do I get to the ordinal for number and date fields?

You don't really. You will have to assign ordinals yourself by sorting all the 
existing values upon first call.

> I assume my fallback is to not index with doc values, and use an uninverting
> reader to get the field data. Is there a better approach?

You could index your integers as DocValued Strings, prefixed with zeroes to 
ensure same length and proper integer sort.

- Toke Eskildsen


Re: plagiarism Checker with solr

2015-08-19 Thread Roshan Agarwal
Dear Jack,

Thank you very much,

Roshan Agarwal


On Mon, Aug 10, 2015 at 8:38 PM, Jack Krupansky 
wrote:

> The simplest and maybe best approach is to use the edismax query parser and
> query all terms using the OR operator and use the PF1, PF2, and PF3
> parameters to boost phrases so that the closest matches rank higher.
>
> No need to do any special indexing.
>
> You can tune the ps, ps2, and ps3 parameters as well to loosen the
> tightness of phrases that get boosted.
>
> -- Jack Krupansky
>
> On Mon, Aug 10, 2015 at 1:54 AM, Roshan Agarwal 
> wrote:
>
> > Dear All,
> >
> > Can any one let us know how to implement plagiarism Checker with solr,
> > how to index content with shingles and what to send in queries
> >
> > Roshan
> >
> > --
> >
> > Siddhast Ip innovation (P) ltd
> > 907 chandra vihar colony
> > Jhansi-284002
> > M:+919871549769
> > M:+917376314900
> >
>



-- 

Roshan Agarwal
Director sales
Siddhast Ip innovation (P) ltd
907 chandra vihar colony
Jhansi-284002
M:+919871549769
M:+917376314900


Solr having problems with highlighting when using Jieba anaylzer

2015-08-19 Thread Zheng Lin Edwin Yeo
Hi,

I'm using Jieba analyser to index Chinese characters in the Solr. It works
fine with the segmentation when using the Anaylsis on the Solr Admin UI.

However, when I tried to do highlighting in Solr, it is not highlighting in
the correct place. For example, when I search for 自然环境与企业本身, it highlight
认为自然环境与企业本身的

Even when I search English character responsibility, it highlight  
*responsibilit*y.

I'm using jieba-analysis-1.0.0, Solr 5.2.1 and Lucene 5.1.0

Regards,
Edwin


How to Delta-Import to solr by Id(key word)

2015-08-19 Thread fent
I have a table with Id , this is a  increase attribute,
So I want to  Delta add new  category to solr may like "select * from
my_table where Id > '${latest_id}'"
the latest_id is the max Id that last time add ,
how to config the data-config.xml.
or how to get the max Id from the solr?

ths!






--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-Delta-Import-to-solr-by-Id-key-word-tp4224090.html
Sent from the Solr - User mailing list archive at Nabble.com.


SolrCloud: /live_nodes in ZK shows the server is there, but all cores are down in /clusterstate.json.

2015-08-19 Thread forest_soup
Opened a JIRA - https://issues.apache.org/jira/browse/SOLR-7947

A SolrCloud with 2 solr node in Tomcat server on 2 VM servers. After restart
one solr node, the cores on it turns to "down" state and logs showing below
errors.

Logs are in attachmenent. solr.zip
  

ERROR - 2015-07-24 09:40:34.887; org.apache.solr.common.SolrException;
null:org.apache.solr.common.SolrException: Unable to create core:
collection1_shard1_replica1
at org.apache.solr.core.CoreContainer.recordAndThrow(CoreContainer.java:989)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:606)
at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:258)
at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:250)
at java.util.concurrent.FutureTask.run(FutureTask.java:273)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:482)
at java.util.concurrent.FutureTask.run(FutureTask.java:273)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1156)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:626)
at java.lang.Thread.run(Thread.java:804)
Caused by: org.apache.solr.common.SolrException
at org.apache.solr.core.SolrCore.(SolrCore.java:844)
at org.apache.solr.core.SolrCore.(SolrCore.java:630)
at org.apache.solr.core.ZkContainer.createFromZk(ZkContainer.java:244)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:595)
... 8 more
Caused by: java.nio.channels.OverlappingFileLockException
at sun.nio.ch.SharedFileLockTable.checkList(FileLockTable.java:267)
at sun.nio.ch.SharedFileLockTable.add(FileLockTable.java:164)
at sun.nio.ch.FileChannelImpl.tryLock(FileChannelImpl.java:1078)
at java.nio.channels.FileChannel.tryLock(FileChannel.java:1165)
at org.apache.lucene.store.NativeFSLock.obtain(NativeFSLockFactory.java:217)
at
org.apache.lucene.store.NativeFSLock.isLocked(NativeFSLockFactory.java:319)
at org.apache.lucene.index.IndexWriter.isLocked(IndexWriter.java:4510)
at org.apache.solr.core.SolrCore.initIndex(SolrCore.java:485)
at org.apache.solr.core.SolrCore.(SolrCore.java:761)
... 11 more



--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrCloud-live-nodes-in-ZK-shows-the-server-is-there-but-all-cores-are-down-in-clusterstate-json-tp4224104.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How to Fast Bulk Inserting documents

2015-08-19 Thread Troy Edwards
Thank you for taking the time to do the test.

I have been doing similar tests using the post Tool (SimplePostTool) with
the real data and was able to get to about 10K documents/second.

I am considering using multiple files (one per client) ftp'd into a solr
node and then use a scheduled job to use the post tool and post them to
solr.

The only issue I have run into so far is that if there is an error in data
(e.g. required field missing) the post tool stops processing the rest of
the file.



On Wed, Aug 19, 2015 at 3:58 PM, Toke Eskildsen 
wrote:

> Toke Eskildsen  wrote
> > Use more than one cloud. Make them fully independent.
> > As I suggested when you asked 4 days ago. That would
> > also make it easy to scale: Just measure how much a
> > single setup can take and do the math.
>
> The goal is 250K documents/second.
>
> I tried modifying the books.csv-example that comes with Solr to use lines
> with 400 characters and inflated it to 4 * 1 million entries. I then
> started a Solr with the techproduct-example and ingested the 4*1M entries
> using curl from 4 prompts a the same time. The longest running took 138
> seconds. 4M/138 seconds = 29K documents/second.
>
> My machine is a 4 core (8 with HyperThreading) i7 laptop, using SSD. On a
> modern server and with custom schema & config, the speed should of course
> be better. On the other hand, the rate might slow down as the shards grows.
>
> Give or take, something like 10 machines could conceivably be enough to
> handle the Solr load if the analysis chain is near the books-example in
> complexity. Of course real data tests are needed and the CSv-data must be
> constructed somehow.
>
> - Toke Eskildsen
>


Re: How to Fast Bulk Inserting documents

2015-08-19 Thread Troy Edwards
Are you suggesting that requests come into a service layer that identifies
which client is on which solrcloud and passes the request to that cloud?

Thank you

On Wed, Aug 19, 2015 at 1:13 PM, Toke Eskildsen 
wrote:

> Troy Edwards  wrote:
> > My average document size is 400 bytes
> > Number of documents that need to be inserted 25/second
> > (for a total of about 3.6 Billion documents)
>
> > Any ideas/suggestions on how that can be done? (use a client
> > or uploadcsv or stream or data import handler)
>
> Use more than one cloud. Make them fully independent. As I suggested when
> you asked 4 days ago. That would also make it easy to scale: Just measure
> how much a single setup can take and do the math.
>
> - Toke Eskildsen
>


Re: Solrcloud node is not comming up

2015-08-19 Thread Erick Erickson
Well, you can use curl instead ;).

But at present there's no real collections admin UI akin to the core
admin UI, although that's in the works with the new Angular JS based
admin UI, but the ETA is not defined quite yet although it shouldn't
be all that far away.



On Wed, Aug 19, 2015 at 2:48 PM, Merlin Morgenstern
 wrote:
> Thank you for the quick answer. I learned now how to use the Collections
> API.
>
> Is there a "better" way to issue the commands then to enter them into the
> Browser as URL and getting back JSON?
>
>
>
> 2015-08-19 22:23 GMT+02:00 Erick Erickson :
>
>> No, nothing. The graphical view shows collections and the associated
>> replicas.
>> This new node has no replicas that are part of any collection, so it won't
>> show in the graphical view.
>>
>> If you create a new collection that happens to put a replica on the new
>> node,
>> it'll then show up as part of that collection in the graphical view.
>>
>> If you do an ADDREPLICA to the existing collection and specify the new
>> machine with the "node" parameter, see:
>>
>> https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api_addreplica
>>
>> then it should show up.
>>
>> On Wed, Aug 19, 2015 at 12:42 PM, Merlin Morgenstern
>>  wrote:
>> > I have a Solrcloud cluster running with 2 nodes, configured with 1 shard
>> > and 2 replica. Now I have added a node on a new server, registered with
>> the
>> > same three zookeepers. The node shows up inside the tree of the Solrcloud
>> > admin GUI under "live nodes".
>> >
>> > Unfortunatelly the new node is not inside the graphical view and it
>> shows 0
>> > cores available while the other admin interface shows the available
>> core. I
>> > have also shutdown the second replica server which is now grayed out. But
>> > still third node not available.
>> >
>> > Is there something I have to do in order to add a node, despite
>> registering
>> > it? This is the startup command I am using:
>> > bin/solr start -cloud -s server/solr2 -p 8983 -z
>> zk1:2181,zk1:2182,zk1:2183
>> > -noprompt
>>


Re: Cache

2015-08-19 Thread Nagasharath

I will go with {!cache=false}.

Can we specify facet method in json nested faceting query? 




> On 19-Aug-2015, at 7:07 pm, Yonik Seeley  wrote:
> 
>> On Wed, Aug 19, 2015 at 8:00 PM, Nagasharath  
>> wrote:
>> Trying to evaluate the performance of queries with and without cache
> 
> Yeah, so to try and see how much a specific type of query costs, you can use
> {!cache=false}
> 
> But I've seen some people trying to benchmark the performance of the
> *system* with caching disabled, and that's not really a valid way to
> go about it.
> 
> -Yonik
> 
> 
> 
>>> On 18-Aug-2015, at 11:30 am, Yonik Seeley  wrote:
>>> 
>>> On Tue, Aug 18, 2015 at 12:23 PM, naga sharathrayapati
>>>  wrote:
 Is it possible to clear the cache through query?
 
 I need this for performance valuation.
>>> 
>>> No, but you can prevent a query from being cached:
>>> q={!cache=false}my query
>>> 
>>> What are you trying to test the performance of exactly?
>>> If you think queries will be highly unique, the best way of testing is
>>> to make your test queries highly unique (for example, adding a random
>>> number in the mix) so that the hit rate on the query cache won't be
>>> unrealistically high.
>>> 
>>> -Yonik


Re: Cache

2015-08-19 Thread Yonik Seeley
On Wed, Aug 19, 2015 at 8:00 PM, Nagasharath  wrote:
> Trying to evaluate the performance of queries with and without cache

Yeah, so to try and see how much a specific type of query costs, you can use
{!cache=false}

But I've seen some people trying to benchmark the performance of the
*system* with caching disabled, and that's not really a valid way to
go about it.

-Yonik



>> On 18-Aug-2015, at 11:30 am, Yonik Seeley  wrote:
>>
>> On Tue, Aug 18, 2015 at 12:23 PM, naga sharathrayapati
>>  wrote:
>>> Is it possible to clear the cache through query?
>>>
>>> I need this for performance valuation.
>>
>> No, but you can prevent a query from being cached:
>> q={!cache=false}my query
>>
>> What are you trying to test the performance of exactly?
>> If you think queries will be highly unique, the best way of testing is
>> to make your test queries highly unique (for example, adding a random
>> number in the mix) so that the hit rate on the query cache won't be
>> unrealistically high.
>>
>> -Yonik


Re: Cache

2015-08-19 Thread Walter Underwood
Why? Do you evaluate Unix performance with and without file buffers?

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


On Aug 19, 2015, at 5:00 PM, Nagasharath  wrote:

> Trying to evaluate the performance of queries with and without cache
> 
> 
> 
>> On 18-Aug-2015, at 11:30 am, Yonik Seeley  wrote:
>> 
>> On Tue, Aug 18, 2015 at 12:23 PM, naga sharathrayapati
>>  wrote:
>>> Is it possible to clear the cache through query?
>>> 
>>> I need this for performance valuation.
>> 
>> No, but you can prevent a query from being cached:
>> q={!cache=false}my query
>> 
>> What are you trying to test the performance of exactly?
>> If you think queries will be highly unique, the best way of testing is
>> to make your test queries highly unique (for example, adding a random
>> number in the mix) so that the hit rate on the query cache won't be
>> unrealistically high.
>> 
>> -Yonik



Re: Cache

2015-08-19 Thread Nagasharath
Trying to evaluate the performance of queries with and without cache



> On 18-Aug-2015, at 11:30 am, Yonik Seeley  wrote:
> 
> On Tue, Aug 18, 2015 at 12:23 PM, naga sharathrayapati
>  wrote:
>> Is it possible to clear the cache through query?
>> 
>> I need this for performance valuation.
> 
> No, but you can prevent a query from being cached:
> q={!cache=false}my query
> 
> What are you trying to test the performance of exactly?
> If you think queries will be highly unique, the best way of testing is
> to make your test queries highly unique (for example, adding a random
> number in the mix) so that the hit rate on the query cache won't be
> unrealistically high.
> 
> -Yonik


Re: Changing Similarity without re-indexing (for example from default to BM25)

2015-08-19 Thread Ahmet Arslan
Hi again,

Here is a relevant/past discussion : 
http://search-lucene.com/m/eHNlTDHKb17MW532

Ahmet



On Thursday, August 20, 2015 2:28 AM, Ahmet Arslan  
wrote:
Hi Tom,

computeNorm(FieldInvertState) method is the only place where similarity is tied 
to indexing process.
If you want to switch between different similarities, they should share the 
same implementation for the method. For example, subclasses of SimilarityBase 
can be used without re-indexing.

By the way DefaultSimilarity and BM25 looks compatible.

For memory consumption reasons, exact value of the field length is 
encoded/decoded into byte in norms at the expense of some precision loss.

Ahmet



On Wednesday, August 19, 2015 7:40 PM, Tom Burton-West  
wrote:
Hello all,

The last time I worked with changing Simlarities was with Solr 4.1 and at
that time, it was possible to simply change the schema to specify the use
of a different Similarity without re-indexing.   This allowed me to
experiment with several different ranking algorithms without having to
re-index.

Currently the documentation states that while doing this is theoretically
possible but not well defined:

"To change Similarity
,
one must do so for both indexing and searching, and the changes must happen
before either of these actions take place. Although in theory there is
nothing stopping you from changing mid-stream, it just isn't well-defined
what is going to happen."

http://lucene.apache.org/core/5_2_0/core/org/apache/lucene/search/similarities/package-summary.html#changingSimilarity

Has something changed between 4.1 and 5.2 that actually will prevent
changing Similarity without re-indexing from working, or is this just a
warning in case at some future point someone contributes code so that a
particular similarity takes advantage of a different index format?

Tom


Re: Changing Similarity without re-indexing (for example from default to BM25)

2015-08-19 Thread Ahmet Arslan
Hi Tom,

computeNorm(FieldInvertState) method is the only place where similarity is tied 
to indexing process.
If you want to switch between different similarities, they should share the 
same implementation for the method. For example, subclasses of SimilarityBase 
can be used without re-indexing.

By the way DefaultSimilarity and BM25 looks compatible.

For memory consumption reasons, exact value of the field length is 
encoded/decoded into byte in norms at the expense of some precision loss.

Ahmet


On Wednesday, August 19, 2015 7:40 PM, Tom Burton-West  
wrote:
Hello all,

The last time I worked with changing Simlarities was with Solr 4.1 and at
that time, it was possible to simply change the schema to specify the use
of a different Similarity without re-indexing.   This allowed me to
experiment with several different ranking algorithms without having to
re-index.

Currently the documentation states that while doing this is theoretically
possible but not well defined:

"To change Similarity
,
one must do so for both indexing and searching, and the changes must happen
before either of these actions take place. Although in theory there is
nothing stopping you from changing mid-stream, it just isn't well-defined
what is going to happen."

http://lucene.apache.org/core/5_2_0/core/org/apache/lucene/search/similarities/package-summary.html#changingSimilarity

Has something changed between 4.1 and 5.2 that actually will prevent
changing Similarity without re-indexing from working, or is this just a
warning in case at some future point someone contributes code so that a
particular similarity takes advantage of a different index format?

Tom


Re: Reindexing

2015-08-19 Thread Alexandre Rafalovitch
Reload will get the new schema definitions. But all the indexed
content will stay as is and will probably start causing problems if
you changed analyzer definitions seriously.

You probably will have to reindex from scratch/external source.

Sorry.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 19 August 2015 at 18:03, Azazel K  wrote:
> Hi,
> We have an over engineered index that we would be to rework.  It's already 
> holding 150M documents with 94GB of index size.  We have High index/high 
> query system running Solr 4.5.
> My question -  If we update the schema, can we run reindex by using "Reload" 
> action in CoreAdmin UI?  Will that regenerate the index according to schema 
> updates?
> Thanks,Az
>


Reindexing

2015-08-19 Thread Azazel K
Hi,
We have an over engineered index that we would be to rework.  It's already 
holding 150M documents with 94GB of index size.  We have High index/high query 
system running Solr 4.5.
My question -  If we update the schema, can we run reindex by using "Reload" 
action in CoreAdmin UI?  Will that regenerate the index according to schema 
updates?
Thanks,Az
  

Re: Solrcloud node is not comming up

2015-08-19 Thread Merlin Morgenstern
Thank you for the quick answer. I learned now how to use the Collections
API.

Is there a "better" way to issue the commands then to enter them into the
Browser as URL and getting back JSON?



2015-08-19 22:23 GMT+02:00 Erick Erickson :

> No, nothing. The graphical view shows collections and the associated
> replicas.
> This new node has no replicas that are part of any collection, so it won't
> show in the graphical view.
>
> If you create a new collection that happens to put a replica on the new
> node,
> it'll then show up as part of that collection in the graphical view.
>
> If you do an ADDREPLICA to the existing collection and specify the new
> machine with the "node" parameter, see:
>
> https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api_addreplica
>
> then it should show up.
>
> On Wed, Aug 19, 2015 at 12:42 PM, Merlin Morgenstern
>  wrote:
> > I have a Solrcloud cluster running with 2 nodes, configured with 1 shard
> > and 2 replica. Now I have added a node on a new server, registered with
> the
> > same three zookeepers. The node shows up inside the tree of the Solrcloud
> > admin GUI under "live nodes".
> >
> > Unfortunatelly the new node is not inside the graphical view and it
> shows 0
> > cores available while the other admin interface shows the available
> core. I
> > have also shutdown the second replica server which is now grayed out. But
> > still third node not available.
> >
> > Is there something I have to do in order to add a node, despite
> registering
> > it? This is the startup command I am using:
> > bin/solr start -cloud -s server/solr2 -p 8983 -z
> zk1:2181,zk1:2182,zk1:2183
> > -noprompt
>


Re: How to Fast Bulk Inserting documents

2015-08-19 Thread Toke Eskildsen
Toke Eskildsen  wrote
> Use more than one cloud. Make them fully independent.
> As I suggested when you asked 4 days ago. That would
> also make it easy to scale: Just measure how much a
> single setup can take and do the math.

The goal is 250K documents/second.

I tried modifying the books.csv-example that comes with Solr to use lines with 
400 characters and inflated it to 4 * 1 million entries. I then started a Solr 
with the techproduct-example and ingested the 4*1M entries using curl from 4 
prompts a the same time. The longest running took 138 seconds. 4M/138 seconds = 
29K documents/second.

My machine is a 4 core (8 with HyperThreading) i7 laptop, using SSD. On a 
modern server and with custom schema & config, the speed should of course be 
better. On the other hand, the rate might slow down as the shards grows.

Give or take, something like 10 machines could conceivably be enough to handle 
the Solr load if the analysis chain is near the books-example in complexity. Of 
course real data tests are needed and the CSv-data must be constructed somehow.

- Toke Eskildsen


Re: How to Fast Bulk Inserting documents

2015-08-19 Thread Susheel Kumar
For Indexing 3.5 billion documents, you will not only run into bottleneck
with Solr but also at different places (data acquisition, solr document
object creation, submitting in bulk/batches to Solr).

This will require parallelizing the above operations at each of the above
steps which can get you maximum throughput.  Multi-threaded java solrj
based Indexer & CloudSolrClient is required as described by Shawn.   I have
used ConcurrentSolrUpdate in the past but with CloudSolrClient,
setParallelUpdates should be tried out.

Thanks,
Susheel

On Wed, Aug 19, 2015 at 2:41 PM, Erick Erickson 
wrote:

> Ir you're sitting on HDFS anyway, you could use MapReduceIndexerTool. I'm
> not
> sure that'll hit your rate, it spends some time copying things around.
> If you're not on
> HDFS, though, it's not an option.
>
> Best,
> Erick
>
> On Wed, Aug 19, 2015 at 11:36 AM, Upayavira  wrote:
> >
> >
> > On Wed, Aug 19, 2015, at 07:13 PM, Toke Eskildsen wrote:
> >> Troy Edwards  wrote:
> >> > My average document size is 400 bytes
> >> > Number of documents that need to be inserted 25/second
> >> > (for a total of about 3.6 Billion documents)
> >>
> >> > Any ideas/suggestions on how that can be done? (use a client
> >> > or uploadcsv or stream or data import handler)
> >>
> >> Use more than one cloud. Make them fully independent. As I suggested
> when
> >> you asked 4 days ago. That would also make it easy to scale: Just
> measure
> >> how much a single setup can take and do the math.
> >
> > Yes - work out how much each node can handle, then you can work out how
> > many nodes you need.
> >
> > You could consider using implicit routing rather than compositeId, which
> > means that you take on responsibility for hashing your ID to push
> > content to the right node. (Or, if you use compositeId, you could use
> > the same algorithm, and be sure that you send docs directly to the
> > correct shard.
> >
> > At the moment, if you push five documents to a five shard collection,
> > the node you send them to could end up doing four HTTP requests to the
> > other nodes in the collection. This means you don't need to worry about
> > where to post your content - it is just handled for you. However, there
> > is a performance hit there. Push content direct to the correct node
> > (either using implicit routing, or by replicating the compositeId hash
> > calculation in your client) and you'd increase your indexing throughput
> > significantly, I would theorise.
> >
> > Upayavira
>


Re: Solrcloud node is not comming up

2015-08-19 Thread Susheel Kumar
When you are adding a node,what exactly you are looking for that node to
do.  Are you adding node to create a new Replica in which case you will
call ADDREPLICA collections API.

Thanks,
Susheel

On Wed, Aug 19, 2015 at 3:42 PM, Merlin Morgenstern <
merlin.morgenst...@gmail.com> wrote:

> I have a Solrcloud cluster running with 2 nodes, configured with 1 shard
> and 2 replica. Now I have added a node on a new server, registered with the
> same three zookeepers. The node shows up inside the tree of the Solrcloud
> admin GUI under "live nodes".
>
> Unfortunatelly the new node is not inside the graphical view and it shows 0
> cores available while the other admin interface shows the available core. I
> have also shutdown the second replica server which is now grayed out. But
> still third node not available.
>
> Is there something I have to do in order to add a node, despite registering
> it? This is the startup command I am using:
> bin/solr start -cloud -s server/solr2 -p 8983 -z zk1:2181,zk1:2182,zk1:2183
> -noprompt
>


Re: Solrcloud node is not comming up

2015-08-19 Thread Erick Erickson
No, nothing. The graphical view shows collections and the associated replicas.
This new node has no replicas that are part of any collection, so it won't
show in the graphical view.

If you create a new collection that happens to put a replica on the new node,
it'll then show up as part of that collection in the graphical view.

If you do an ADDREPLICA to the existing collection and specify the new
machine with the "node" parameter, see:
https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api_addreplica

then it should show up.

On Wed, Aug 19, 2015 at 12:42 PM, Merlin Morgenstern
 wrote:
> I have a Solrcloud cluster running with 2 nodes, configured with 1 shard
> and 2 replica. Now I have added a node on a new server, registered with the
> same three zookeepers. The node shows up inside the tree of the Solrcloud
> admin GUI under "live nodes".
>
> Unfortunatelly the new node is not inside the graphical view and it shows 0
> cores available while the other admin interface shows the available core. I
> have also shutdown the second replica server which is now grayed out. But
> still third node not available.
>
> Is there something I have to do in order to add a node, despite registering
> it? This is the startup command I am using:
> bin/solr start -cloud -s server/solr2 -p 8983 -z zk1:2181,zk1:2182,zk1:2183
> -noprompt


Re: How to find the ordinal for a numeric doc value

2015-08-19 Thread tedsolr
One error (others perhaps?) in my statement ... the code

searcher.getLeafReader().getSortedDocValues(field)

just returns null for numeric and date fields. That is why they appear to be
ignored, not that the ordinals are all absent or equivalent. But my question
is still valid I think!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-find-the-ordinal-for-a-numeric-doc-value-tp4224018p4224037.html
Sent from the Solr - User mailing list archive at Nabble.com.


How to find the ordinal for a numeric doc value

2015-08-19 Thread tedsolr
I'm trying to upgrade my custom post filter from Solr 4.9 to 5.2. This filter
collapses documents based on a user chosen field set. The key to the whole
thing is determining document uniqueness based on a fixed int array of field
value ordinals. In 4.9 this worked regardless of the field type. In the
collect() method of my delegating collector:

for (SortedDocValues vals : values) {
   ords[i++] = vals.getOrd(globalDoc);
}

The values come from FieldCache:

SortedDocValues docValues =
FieldCache.DEFAULT.getTermsIndex(searcher.getAtomicReader(), field);

I was not indexing using doc values. Now I'm trying to use doc values, and I
find that for non string fields I can't recover the ordinal for a field
value. At least I don't know how to get at it. So in my upgrade I index with
doc values, and the above code looks like:

SortedDocValues docValues =
searcher.getLeafReader().getSortedDocValues(field);

However this is not equivalent to the old 4.9 approach. My TrieDateField and
TrieDoubleField fields are being ignored (all field value ordinals are
treated as the same). I'm sure there is a good reason why SortedDocValues
exposes the backing dictionary and [Sorted]NumericDocValues does not. How do
I get to the ordinal for number and date fields?

I assume my fallback is to not index with doc values, and use an uninverting
reader to get the field data. Is there a better approach?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-find-the-ordinal-for-a-numeric-doc-value-tp4224018.html
Sent from the Solr - User mailing list archive at Nabble.com.


Solrcloud node is not comming up

2015-08-19 Thread Merlin Morgenstern
I have a Solrcloud cluster running with 2 nodes, configured with 1 shard
and 2 replica. Now I have added a node on a new server, registered with the
same three zookeepers. The node shows up inside the tree of the Solrcloud
admin GUI under "live nodes".

Unfortunatelly the new node is not inside the graphical view and it shows 0
cores available while the other admin interface shows the available core. I
have also shutdown the second replica server which is now grayed out. But
still third node not available.

Is there something I have to do in order to add a node, despite registering
it? This is the startup command I am using:
bin/solr start -cloud -s server/solr2 -p 8983 -z zk1:2181,zk1:2182,zk1:2183
-noprompt


Re: Performance issue with FILTER QUERY

2015-08-19 Thread Erick Erickson
If you're committing that rapidly then you're correct, filter caching
may not be a good fit. The entire _point_ of
filter caching is to increase performance of subsequent executions of
the exact same fq clause. But if you're
throwing them away every second there's little/no benefit.

You really have two choices here
1> lengthen out the commit interval. Frankly, 1 second commit
intervals are rarely necessary despite what
 your product manager says. Really, check this requirement out.
2> disable caches.

Autowarming is potentially useful here, but if your filter queries are
taking on the order of a second and
you're committing every second then autowarming takes too long to help.

Best,
Erick

On Wed, Aug 19, 2015 at 12:26 AM, Mikhail Khludnev
 wrote:
> Maulin,
> Did you check performance with segmented filters which I advised recently?
>
> On Wed, Aug 19, 2015 at 10:24 AM, Maulin Rathod  wrote:
>
>> As per my understanding caches are flushed every time when add new
>> document to collection (we do soft commit at every 1 sec to make newly
>> added document available for search). Due to which it is not effectively
>> uses cache and hence it slow every time in our case.
>>
>> -Original Message-
>> From: Toke Eskildsen [mailto:t...@statsbiblioteket.dk]
>> Sent: 19 August 2015 12:16
>> To: solr-user@lucene.apache.org
>> Subject: Re: Performance issue with FILTER QUERY
>>
>> On Wed, 2015-08-19 at 05:55 +, Maulin Rathod wrote:
>> > SLOW WITH FILTER QUERY (takes more than 1 second)
>> > 
>> >
>> > q=+recipient_id:(4042) AND project_id:(332) AND resource_id:(13332247
>> > 13332245 13332243 13332241 13332239) AND entity_type:(2) AND
>> > -action_id:(20 32) ==> This returns 5 records
>> > fq=+action_status:(0) AND is_active:(true) ==> This Filter Query
>> > returns 9432252 records
>>
>> The fq is evaluated independently of the q: For the fq a bitset is
>> allocated, filled and stored in cache. Then the q is evaluated and the two
>> bitsets are merged.
>>
>> Next time you use the same fq, it should be cached (if you have caching
>> enabled) and be a lot faster.
>>
>>
>> Also, if you ran your two tests right after each other, the second one
>> benefits from disk caching. If you had executed them in reverse order, the
>> q+fq might have been the fastest one.
>>
>> - Toke Eskildsen, State and University Library, Denmark
>>
>>
>>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> 
> 


Re: jetty.xml

2015-08-19 Thread Erick Erickson
what's happening on the system when you see this? If you're heavily
indexing and NOT
using SolrJ.cloudSolrSever/Client, then a lot of threads can be
occupied forwarding
documents to the other shards.

Best,
Erick

On Wed, Aug 19, 2015 at 6:55 AM, Davis, Daniel (NIH/NLM) [C]
 wrote:
> Jetty includes a QoSFilter, 
> https://wiki.eclipse.org/Jetty/Reference/QoSFilter, with some changes I think 
> it might be able to throttle the requests coming into Solr from truly 
> outside, e.g. not SolrCloud replication, ZooKeeper etc., so as to make sure 
> that Solr's own work could still get done.   This is just an idea, not a 
> suggestion.
>
> -Original Message-
> From: Shawn Heisey [mailto:apa...@elyograg.org]
> Sent: Wednesday, August 19, 2015 9:15 AM
> To: solr-user@lucene.apache.org
> Subject: Re: jetty.xml
>
> On 8/18/2015 11:50 PM, William Bell wrote:
>> We sometimes get a spike in Solr, and we get like 3K of threads and
>> then timeouts...
>>
>> In Solr 5.2.1 the defult jetty settings is kinda crazy for threads -
>> since the value is HIGH!
>>
>> What do others recommend?
>
> The setting of 1 is so that there is effectively no limit.  Solr will 
> stop working right if it is not allowed to start threads whenever it wishes.  
> Solr is not a typical web application.
>
> As far as I know (any my knowledge could be wrong), a typical web application 
> that serves a website to users will handle all back-end details with the same 
> thread(s) that were created when the connection was opened.  Putting a 
> relatively low limit on the number of threads in that situation is sensible.
>
> A very small Solr install with a low query volume will work in 200 threads 
> (the default limit in most containers), but it doesn't take very much to 
> exceed that.
>
> I have a Solr 4.9.1 dev install with 44 cores, running with the Jetty 8 
> example included in the 4.x download.  19 of those cores are build cores, 
> with 19 cores for live indexes.  The other six cores are always empty, with a 
> shards parameter in the search handler definition for distributed searching.  
> This install does NOT run in SolrCloud mode.
>
> This dev server sees very little traffic besides a few indexing requests 
> every minute and load balancer health checks.  JConsole shows the number of 
> threads hovering between 230 and 235.  If I scroll through the thread list, 
> most of them show a state of WAITING on various locking mechanisms, which 
> explains why my CPUs (8 CPU cores total) are not being overwhelmed with work 
> from all those threads.
>
> Solr and Lucene don't really have a runaway thread problem as far as I can 
> tell, but the system does use a fair number of them for basic operation, with 
> more cores/collections adding more threads.  SolrCloud mode will also use 
> more threads.
>
> If you send requests to Solr at a vary fast rate, the servlet container may 
> also use a lot of threads.
>
> Thanks,
> Shawn
>


Re: Geospatial Predicate Question

2015-08-19 Thread david.w.smi...@gmail.com
Hi Jamie,

Your understanding is inverted.  The predicates can be read as:
 .

For indexed point data, there is almost no semantic different between the
Within and Intersects predicates.  There is if the field is multi-valued
and you want to ensure that all of the points for a document are within the
query shape (Within predicate) versus any of them being okay (Intersects
predicate).  Intersects is pretty fast.

The Contains predicate only makes sense for non-point indexed data.

~ David

On Wed, Aug 12, 2015 at 6:02 PM Jamie Johnson  wrote:

> Can someone clarify the difference between isWithin and Contains in regards
> to Solr's spatial support?  From
> https://wiki.apache.org/solr/SolrAdaptersForLuceneSpatial4 I see that if
> you are using point data you should use Intersects, but it is not clear
> when to use isWithin and contains.  My guess is that you use isWithin when
> you want to know if the query shape is within the shape that is indexed and
> you use contains to know if the query shape contains the indexed shape.  Is
> that right?
>
-- 
Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
http://www.solrenterprisesearchserver.com


Re: How to Fast Bulk Inserting documents

2015-08-19 Thread Erick Erickson
Ir you're sitting on HDFS anyway, you could use MapReduceIndexerTool. I'm not
sure that'll hit your rate, it spends some time copying things around.
If you're not on
HDFS, though, it's not an option.

Best,
Erick

On Wed, Aug 19, 2015 at 11:36 AM, Upayavira  wrote:
>
>
> On Wed, Aug 19, 2015, at 07:13 PM, Toke Eskildsen wrote:
>> Troy Edwards  wrote:
>> > My average document size is 400 bytes
>> > Number of documents that need to be inserted 25/second
>> > (for a total of about 3.6 Billion documents)
>>
>> > Any ideas/suggestions on how that can be done? (use a client
>> > or uploadcsv or stream or data import handler)
>>
>> Use more than one cloud. Make them fully independent. As I suggested when
>> you asked 4 days ago. That would also make it easy to scale: Just measure
>> how much a single setup can take and do the math.
>
> Yes - work out how much each node can handle, then you can work out how
> many nodes you need.
>
> You could consider using implicit routing rather than compositeId, which
> means that you take on responsibility for hashing your ID to push
> content to the right node. (Or, if you use compositeId, you could use
> the same algorithm, and be sure that you send docs directly to the
> correct shard.
>
> At the moment, if you push five documents to a five shard collection,
> the node you send them to could end up doing four HTTP requests to the
> other nodes in the collection. This means you don't need to worry about
> where to post your content - it is just handled for you. However, there
> is a performance hit there. Push content direct to the correct node
> (either using implicit routing, or by replicating the compositeId hash
> calculation in your client) and you'd increase your indexing throughput
> significantly, I would theorise.
>
> Upayavira


Re: How to Fast Bulk Inserting documents

2015-08-19 Thread Upayavira


On Wed, Aug 19, 2015, at 07:13 PM, Toke Eskildsen wrote:
> Troy Edwards  wrote:
> > My average document size is 400 bytes
> > Number of documents that need to be inserted 25/second
> > (for a total of about 3.6 Billion documents)
> 
> > Any ideas/suggestions on how that can be done? (use a client
> > or uploadcsv or stream or data import handler)
> 
> Use more than one cloud. Make them fully independent. As I suggested when
> you asked 4 days ago. That would also make it easy to scale: Just measure
> how much a single setup can take and do the math.

Yes - work out how much each node can handle, then you can work out how
many nodes you need.

You could consider using implicit routing rather than compositeId, which
means that you take on responsibility for hashing your ID to push
content to the right node. (Or, if you use compositeId, you could use
the same algorithm, and be sure that you send docs directly to the
correct shard.

At the moment, if you push five documents to a five shard collection,
the node you send them to could end up doing four HTTP requests to the
other nodes in the collection. This means you don't need to worry about
where to post your content - it is just handled for you. However, there
is a performance hit there. Push content direct to the correct node
(either using implicit routing, or by replicating the compositeId hash
calculation in your client) and you'd increase your indexing throughput
significantly, I would theorise.

Upayavira


Re: Changing Similarity without re-indexing (for example from default to BM25)

2015-08-19 Thread Upayavira
warning: I'm no expert on other similarities.

Having said that, I'm not aware of similarities being used in the
indexing process - during indexing term frequency, document frequency,
field norms, and so on are all recorded. These are things that the
default similarity (TF/IDF) uses to calculate its score. So long as the
data required by the similarity is already in the index, I don't see why
changing similarity would require a re-index.

But then, who ever wrote that must have been thinking of something...

Upayavira

On Wed, Aug 19, 2015, at 05:40 PM, Tom Burton-West wrote:
> Hello all,
> 
> The last time I worked with changing Simlarities was with Solr 4.1 and at
> that time, it was possible to simply change the schema to specify the use
> of a different Similarity without re-indexing.   This allowed me to
> experiment with several different ranking algorithms without having to
> re-index.
> 
>  Currently the documentation states that while doing this is
>  theoretically
> possible but not well defined:
> 
> "To change Similarity
> ,
> one must do so for both indexing and searching, and the changes must
> happen
> before either of these actions take place. Although in theory there is
> nothing stopping you from changing mid-stream, it just isn't well-defined
> what is going to happen."
> 
> http://lucene.apache.org/core/5_2_0/core/org/apache/lucene/search/similarities/package-summary.html#changingSimilarity
> 
> Has something changed between 4.1 and 5.2 that actually will prevent
> changing Similarity without re-indexing from working, or is this just a
> warning in case at some future point someone contributes code so that a
> particular similarity takes advantage of a different index format?
> 
> Tom


Re: How to Fast Bulk Inserting documents

2015-08-19 Thread Toke Eskildsen
Troy Edwards  wrote:
> My average document size is 400 bytes
> Number of documents that need to be inserted 25/second
> (for a total of about 3.6 Billion documents)

> Any ideas/suggestions on how that can be done? (use a client
> or uploadcsv or stream or data import handler)

Use more than one cloud. Make them fully independent. As I suggested when you 
asked 4 days ago. That would also make it easy to scale: Just measure how much 
a single setup can take and do the math.

- Toke Eskildsen


Re: How to Fast Bulk Inserting documents

2015-08-19 Thread Shawn Heisey
On 8/19/2015 11:09 AM, Troy Edwards wrote:
> I have a requirement where I have to bulk insert a lot of documents in
> SolrCloud.
>
> My average document size is 400 bytes
> Number of documents that need to be inserted 25/second (for a total of
> about 3.6 Billion documents)
>
> Any ideas/suggestions on how that can be done? (use a client or uploadcsv
> or stream or data import handler)
>
> How can SolrCloud be configured to allow this fast bulk insert?
>
> Any thoughts on what the SolrCloud configuration would probably look like?

I think this is an unrealistic goal, unless you're planning on a couple
hundred shards with a very small number of shards (1 or 2) per server. 
This would also require a very large number of very fast servers with a
fair amount of RAM.  The more shards you have on each server, the more
likely it is that you'll need SSD storage.  This will get very expensive.

It is likely going to take a lot longer than 4 hours to rebuild your
entire 3.6 billion document index.  Your small document size will help
keep the rebuild time lower than I would otherwise expect, but 3.6
billion is a VERY large number.  I can achieve about 6000 docs per
second on my largest index, which means that each of my cold shards
indexes at about 1000 docs per second.  I'm not sure how large my
documents are, but a few kilobytes is probably about right.  The entire
rebuild takes over 9 hours for a little more than 200 million documents.

The best performance is likely to come from a heavily multi-threaded
SolrJ 5.2.1 or later application using CloudSolrClient, with at least
version 5.2.1 on your servers.  Even if you build the hardware
infrastructure I described above, it won't perform to your expectations
unless you've got someone with considerable Java programming skills.

Thanks,
Shawn



Re: Is it a good query performance with this data size ?

2015-08-19 Thread wwang525
Hi Upayavira,

I happened to compose individual fq for each field, such as:
fq=Gatewaycode:(...)&fq=DestCode:(...)&fq=DateDep:(...)&fq=Duration:(...)

It is nice to know that I am not creating unnecessary cache entries since
the above method results in minimal carnality as you pointed out.

Thank





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Is-it-a-good-query-performance-with-this-data-size-tp4223699p4223988.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How to Fast Bulk Inserting documents

2015-08-19 Thread Vineeth Dasaraju
I have been using the solrj client and get speeds of 1000 objects per
second. The size of my object is around 4 kb.
On Aug 19, 2015 12:09 PM, "Troy Edwards"  wrote:

> I have a requirement where I have to bulk insert a lot of documents in
> SolrCloud.
>
> My average document size is 400 bytes
> Number of documents that need to be inserted 25/second (for a total of
> about 3.6 Billion documents)
>
> Any ideas/suggestions on how that can be done? (use a client or uploadcsv
> or stream or data import handler)
>
> How can SolrCloud be configured to allow this fast bulk insert?
>
> Any thoughts on what the SolrCloud configuration would probably look like?
>
> Thanks
>


How to Fast Bulk Inserting documents

2015-08-19 Thread Troy Edwards
I have a requirement where I have to bulk insert a lot of documents in
SolrCloud.

My average document size is 400 bytes
Number of documents that need to be inserted 25/second (for a total of
about 3.6 Billion documents)

Any ideas/suggestions on how that can be done? (use a client or uploadcsv
or stream or data import handler)

How can SolrCloud be configured to allow this fast bulk insert?

Any thoughts on what the SolrCloud configuration would probably look like?

Thanks


Re: Is it a good query performance with this data size ?

2015-08-19 Thread Upayavira
Yes, you can limit the size of the filter cache, as Erick says, but
then, you could just end up with cache churn, where you are constantly
re-populating your cache as stuff gets pushed out, only to have to
regenerate it again for the next query.

Is it possible to decompose these queries into parts?

fq=+category:sport +year:2015

could be better expressed as:
fq=category:sport
fq=year:2015

Instead of resulting in cardinality(category) * cardinality(year) cache
entries, you'd have cardinality(category) + cardinality(year).

cardinality() here simply means the number of unique values for that
field.

Upayavira

On Wed, Aug 19, 2015, at 05:23 PM, Erick Erickson wrote:
> bq:  can I limit the size of the three
> caches so that the RAM usage will be under control
> 
> That's exactly what the "size" parameter is for.
> 
> As Upayavira says, the rough size of each entry in
> the filterCache is maxDocs/8 + (sizeof query string).
> 
> queryResultCache is much smaller per entry, it's
> roughly (sizeof entire query) + ((sizeof Java int) *
> )
> 
>  is from solrconfig.xml. The point
> here is this is rarely very bug unless you make the
> queryResultCache huge.
> 
> As for documentResultCache, it's also usually not
> very large, it's the (size you declare it) * (average size of a doc).
> 
> Best,
> Erick
> 
> On Wed, Aug 19, 2015 at 9:12 AM, wwang525  wrote:
> > Hi Upayavira,
> >
> > Thank you very much for pointing out the potential design issue
> >
> > The queries will be determined through a configuration by business users.
> > There will be limited number of queries every day, and will get executed by
> > customers repeatedly. However, business users will change the configurations
> > so that new queries will get generated and also will be limited. The change
> > can be as frequent as daily or weekly. The project is to supporting daily
> > promotional based on fresh index data.
> >
> > Cumulatively, there can be a lot of different queries. If I still want to
> > take the advantage of the filterCache, can I limit the size of the three
> > caches so that the RAM usage will be under control?
> >
> > Thanks
> >
> >
> >
> > --
> > View this message in context: 
> > http://lucene.472066.n3.nabble.com/Is-it-a-good-query-performance-with-this-data-size-tp4223699p4223960.html
> > Sent from the Solr - User mailing list archive at Nabble.com.


Changing Similarity without re-indexing (for example from default to BM25)

2015-08-19 Thread Tom Burton-West
Hello all,

The last time I worked with changing Simlarities was with Solr 4.1 and at
that time, it was possible to simply change the schema to specify the use
of a different Similarity without re-indexing.   This allowed me to
experiment with several different ranking algorithms without having to
re-index.

 Currently the documentation states that while doing this is theoretically
possible but not well defined:

"To change Similarity
,
one must do so for both indexing and searching, and the changes must happen
before either of these actions take place. Although in theory there is
nothing stopping you from changing mid-stream, it just isn't well-defined
what is going to happen."

http://lucene.apache.org/core/5_2_0/core/org/apache/lucene/search/similarities/package-summary.html#changingSimilarity

Has something changed between 4.1 and 5.2 that actually will prevent
changing Similarity without re-indexing from working, or is this just a
warning in case at some future point someone contributes code so that a
particular similarity takes advantage of a different index format?

Tom


Re: Is it a good query performance with this data size ?

2015-08-19 Thread Erick Erickson
bq:  can I limit the size of the three
caches so that the RAM usage will be under control

That's exactly what the "size" parameter is for.

As Upayavira says, the rough size of each entry in
the filterCache is maxDocs/8 + (sizeof query string).

queryResultCache is much smaller per entry, it's
roughly (sizeof entire query) + ((sizeof Java int) * )

 is from solrconfig.xml. The point
here is this is rarely very bug unless you make the
queryResultCache huge.

As for documentResultCache, it's also usually not
very large, it's the (size you declare it) * (average size of a doc).

Best,
Erick

On Wed, Aug 19, 2015 at 9:12 AM, wwang525  wrote:
> Hi Upayavira,
>
> Thank you very much for pointing out the potential design issue
>
> The queries will be determined through a configuration by business users.
> There will be limited number of queries every day, and will get executed by
> customers repeatedly. However, business users will change the configurations
> so that new queries will get generated and also will be limited. The change
> can be as frequent as daily or weekly. The project is to supporting daily
> promotional based on fresh index data.
>
> Cumulatively, there can be a lot of different queries. If I still want to
> take the advantage of the filterCache, can I limit the size of the three
> caches so that the RAM usage will be under control?
>
> Thanks
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Is-it-a-good-query-performance-with-this-data-size-tp4223699p4223960.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Is it a good query performance with this data size ?

2015-08-19 Thread wwang525
Hi Upayavira,

Thank you very much for pointing out the potential design issue

The queries will be determined through a configuration by business users.
There will be limited number of queries every day, and will get executed by
customers repeatedly. However, business users will change the configurations
so that new queries will get generated and also will be limited. The change
can be as frequent as daily or weekly. The project is to supporting daily
promotional based on fresh index data.

Cumulatively, there can be a lot of different queries. If I still want to
take the advantage of the filterCache, can I limit the size of the three
caches so that the RAM usage will be under control?

Thanks



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Is-it-a-good-query-performance-with-this-data-size-tp4223699p4223960.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Difficulties in getting Solrcloud running

2015-08-19 Thread Susheel Kumar
Use command like below to create collection

http://
:/solr/admin/collections?action=CREATE&name=&numShards=2&replicationFactor=2&maxShardsPerNode=2&collection.configName=

Susheel



On Wed, Aug 19, 2015 at 11:03 AM, Kevin Lee 
wrote:

> Hi,
>
> Have you created a collection yet?  If not, then there won’t be a graph
> yet.  It doesn’t show up until there is at least one collection.
>
> - Kevin
>
> > On Aug 19, 2015, at 5:48 AM, Merlin Morgenstern <
> merlin.morgenst...@gmail.com> wrote:
> >
> > HI everybody,
> >
> > I am trying to setup solrcloud on ubuntu and somehow the graph on the
> admin
> > interface does not show up. It is simply blanck. The tree is available.
> >
> > This is a test installation on one machine.
> >
> > There are 3 zookeepers running.
> >
> > I start two solr nodes like this:
> >
> > solr-5.2.1$ bin/solr start -cloud -s server/solr1 -p 8983 -z
> > zk1:2181,zk1:2182,zk1:2183 -noprompt
> >
> > solr-5.2.1$ bin/solr start -cloud -s server/solr2 -p 8984 -z
> > zk1:2181,zk1:2182,zk1:2183 -noprompt
> >
> > zk1 is a local interface with 10.0.0.120
> >
> > it all looks OK, no error messages.
> >
> > Thank you in advance for any help on this
>
>


Re: Difficulties in getting Solrcloud running

2015-08-19 Thread Kevin Lee
Hi,

Have you created a collection yet?  If not, then there won’t be a graph yet.  
It doesn’t show up until there is at least one collection.

- Kevin

> On Aug 19, 2015, at 5:48 AM, Merlin Morgenstern 
>  wrote:
> 
> HI everybody,
> 
> I am trying to setup solrcloud on ubuntu and somehow the graph on the admin
> interface does not show up. It is simply blanck. The tree is available.
> 
> This is a test installation on one machine.
> 
> There are 3 zookeepers running.
> 
> I start two solr nodes like this:
> 
> solr-5.2.1$ bin/solr start -cloud -s server/solr1 -p 8983 -z
> zk1:2181,zk1:2182,zk1:2183 -noprompt
> 
> solr-5.2.1$ bin/solr start -cloud -s server/solr2 -p 8984 -z
> zk1:2181,zk1:2182,zk1:2183 -noprompt
> 
> zk1 is a local interface with 10.0.0.120
> 
> it all looks OK, no error messages.
> 
> Thank you in advance for any help on this



Re: Is it a good query performance with this data size ?

2015-08-19 Thread Upayavira
You say "all of my queries are based upon fq"? Why? How unique are they?
Remember, for each fq value, it could end up storing one bit per
document in your index. If you have 8m documents, you could end up with
a cache usage of 1Mb, for that query alone!

Filter queries are primarily designed for queries that are repeated,
e.g.: category:sport, where caching gives some advantage.

If all of your queries are unique, then move them to the q= parameter,
or make them fq={!cache=false}, otherwise you will waste memory storing
cached values that are never used, and CPU building and then destroying
those cached entries.

Upayavira

On Wed, Aug 19, 2015, at 02:25 PM, wwang525 wrote:
> Hi Erick,
> 
> All my queries are based on fq (filter query). I have to send the
> randomly
> generated queries to warm up low level lucene cache.
> 
> I went to the more tedious way to warm up low level cache without
> utilizing
> the three caches by turning off the three caches (set values to zero).
> Then,
> I send 800 randomly generated request to Solr. The RAM jumped from 500MB
> to
> 2.5G, and stayed there.
> 
> Then, I test individual queries against Solr. This time, I got very close
> response time when I requested the first time, second time, or third
> time. 
> 
> The results: 
> 
> (1) average response time: 803 ms with only one request having a response
> time >1 second (1042 ms)
> (2) the majority of the time was spent on query, and not on faceting 
> (730/803 = 90%)
> 
> So the query is the bottleneck.
> 
> I also have an interesting finding: it looks like the fq query works
> better
> with integer type. I created string type for two properties: DateDep and
> Duration since the definition of docValues=true for integer type did not
> work with faceted search. There was a time I accidentally used filter
> query
> with the string type property and I found the query performance degraded
> quite a lot.
> 
> Is it generally true that fq works better with integer type  ?
> 
> If this is the case, I could create two integer type properties for two
> other fq to check if I can boost the performance.
> 
> Thanks
> 
> 
> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Is-it-a-good-query-performance-with-this-data-size-tp4223699p4223920.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Lucene 5.2.1 Spatial Strategy PointVectorStrategy

2015-08-19 Thread Pablo Mincz
Hi,

I'm implementing a sort search by distance with a PointVectorStrategy.
In the index process I used createIndexableFields from the strategy
and makePoint from the context GEO.

But when I'm sorting the search I get the error:
Java::JavaLang::IllegalStateException: unexpected docvalues type NONE
for field 'location__x' (expected=NUMERIC)

And for what I see it is impossible to use a specific FieldType with
DocValuesType NUMERIC.

Someone know how to fix this?

Thanks a lot for the help.

Regards,
Pablo.


json facet

2015-08-19 Thread naga sharathrayapati
is it possible to specify facet.method with json nested faceting query?

would like to see if there would be a performance improvement using methods


Re: Solr leader and replica version mismatch 4.7.2

2015-08-19 Thread Shawn Heisey
On 8/19/2015 7:52 AM, Jeff Courtade wrote:
> We are running SOLR 4.7.2
> SolrCloud with 2 shards
> one Leader and one replica per shard.
> 
> the "Version" of the replica and leader differ displayed here as...
> 
> curl http://ps01:8983/solr/admin/cores?action=STATUS |sed 's/>\n 
> 7753045
> 
> 
> However the commitTimeMSec lastModified and sizeInBytes matches on Leader
> and replica

SolrCloud works very differently than the old master-slave replication.
 The index is NOT copied from the leader to the other replicas, except
in extreme recovery circumstances.

Each replica builds its own copy of the index independently from the
others.  Due to slight timing differences in the indexing operations,
and possible actions related to transaction log replay on node restart,
each replica may end up with a different index layout.  There also could
be differences in the number of deleted documents.  Unless something
goes really wrong, all replicas should contain the same live documents.

Thanks,
Shawn



Re: Solr leader and replica version mismatch 4.7.2

2015-08-19 Thread Jeff Courtade
What I am trying to determine is a way to validate for instance if a leader
dies. As in completely unrecoverable that the data on the replica is an
exact match to what the leader had.

I need to be able to monitor it and have confidence that it is working as
expected.

i had assumed the version number is what I was interested in.

Should the version number be different in SOLR Cloud then as it is
deprecated?


--
Thanks,

Jeff Courtade
M: 240.507.6116

On Wed, Aug 19, 2015 at 10:08 AM, Shawn Heisey  wrote:

> On 8/19/2015 7:52 AM, Jeff Courtade wrote:
> > We are running SOLR 4.7.2
> > SolrCloud with 2 shards
> > one Leader and one replica per shard.
> >
> > the "Version" of the replica and leader differ displayed here as...
> >
> > curl http://ps01:8983/solr/admin/cores?action=STATUS |sed 's/>\n >
> > 7753045
> >
> >
> > However the commitTimeMSec lastModified and sizeInBytes matches on Leader
> > and replica
>
> SolrCloud works very differently than the old master-slave replication.
>  The index is NOT copied from the leader to the other replicas, except
> in extreme recovery circumstances.
>
> Each replica builds its own copy of the index independently from the
> others.  Due to slight timing differences in the indexing operations,
> and possible actions related to transaction log replay on node restart,
> each replica may end up with a different index layout.  There also could
> be differences in the number of deleted documents.  Unless something
> goes really wrong, all replicas should contain the same live documents.
>
> Thanks,
> Shawn
>
>


RE: jetty.xml

2015-08-19 Thread Davis, Daniel (NIH/NLM) [C]
Jetty includes a QoSFilter, https://wiki.eclipse.org/Jetty/Reference/QoSFilter, 
with some changes I think it might be able to throttle the requests coming into 
Solr from truly outside, e.g. not SolrCloud replication, ZooKeeper etc., so as 
to make sure that Solr's own work could still get done.   This is just an idea, 
not a suggestion.

-Original Message-
From: Shawn Heisey [mailto:apa...@elyograg.org] 
Sent: Wednesday, August 19, 2015 9:15 AM
To: solr-user@lucene.apache.org
Subject: Re: jetty.xml

On 8/18/2015 11:50 PM, William Bell wrote:
> We sometimes get a spike in Solr, and we get like 3K of threads and 
> then timeouts...
> 
> In Solr 5.2.1 the defult jetty settings is kinda crazy for threads - 
> since the value is HIGH!
> 
> What do others recommend?

The setting of 1 is so that there is effectively no limit.  Solr will stop 
working right if it is not allowed to start threads whenever it wishes.  Solr 
is not a typical web application.

As far as I know (any my knowledge could be wrong), a typical web application 
that serves a website to users will handle all back-end details with the same 
thread(s) that were created when the connection was opened.  Putting a 
relatively low limit on the number of threads in that situation is sensible.

A very small Solr install with a low query volume will work in 200 threads (the 
default limit in most containers), but it doesn't take very much to exceed that.

I have a Solr 4.9.1 dev install with 44 cores, running with the Jetty 8 example 
included in the 4.x download.  19 of those cores are build cores, with 19 cores 
for live indexes.  The other six cores are always empty, with a shards 
parameter in the search handler definition for distributed searching.  This 
install does NOT run in SolrCloud mode.

This dev server sees very little traffic besides a few indexing requests every 
minute and load balancer health checks.  JConsole shows the number of threads 
hovering between 230 and 235.  If I scroll through the thread list, most of 
them show a state of WAITING on various locking mechanisms, which explains why 
my CPUs (8 CPU cores total) are not being overwhelmed with work from all those 
threads.

Solr and Lucene don't really have a runaway thread problem as far as I can 
tell, but the system does use a fair number of them for basic operation, with 
more cores/collections adding more threads.  SolrCloud mode will also use more 
threads.

If you send requests to Solr at a vary fast rate, the servlet container may 
also use a lot of threads.

Thanks,
Shawn



Solr leader and replica version mismatch 4.7.2

2015-08-19 Thread Jeff Courtade
We are running SOLR 4.7.2
SolrCloud with 2 shards
one Leader and one replica per shard.

the "Version" of the replica and leader differ displayed here as...

curl http://ps01:8983/solr/admin/cores?action=STATUS |sed 's/>\n7753045


However the commitTimeMSec lastModified and sizeInBytes matches on Leader
and replica

curl http://ps01:8983/solr/admin/cores?action=STATUS |sed 's/>\n1439974815928

2015-08-19T09:00:15.928Z
43691759309
40.69 GB



if that number date and size match on the leader and the replicas I believe
we are in sync.

Can anyone verify this?



--
Thanks,

Jeff Courtade
M: 240.507.6116


Re: Is it a good query performance with this data size ?

2015-08-19 Thread wwang525
Hi Erick,

All my queries are based on fq (filter query). I have to send the randomly
generated queries to warm up low level lucene cache.

I went to the more tedious way to warm up low level cache without utilizing
the three caches by turning off the three caches (set values to zero). Then,
I send 800 randomly generated request to Solr. The RAM jumped from 500MB to
2.5G, and stayed there.

Then, I test individual queries against Solr. This time, I got very close
response time when I requested the first time, second time, or third time. 

The results: 

(1) average response time: 803 ms with only one request having a response
time >1 second (1042 ms)
(2) the majority of the time was spent on query, and not on faceting 
(730/803 = 90%)

So the query is the bottleneck.

I also have an interesting finding: it looks like the fq query works better
with integer type. I created string type for two properties: DateDep and
Duration since the definition of docValues=true for integer type did not
work with faceted search. There was a time I accidentally used filter query
with the string type property and I found the query performance degraded
quite a lot.

Is it generally true that fq works better with integer type  ?

If this is the case, I could create two integer type properties for two
other fq to check if I can boost the performance.

Thanks



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Is-it-a-good-query-performance-with-this-data-size-tp4223699p4223920.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: jetty.xml

2015-08-19 Thread Shawn Heisey
On 8/18/2015 11:50 PM, William Bell wrote:
> We sometimes get a spike in Solr, and we get like 3K of threads and then
> timeouts...
> 
> In Solr 5.2.1 the defult jetty settings is kinda crazy for threads - since
> the value is HIGH!
> 
> What do others recommend?

The setting of 1 is so that there is effectively no limit.  Solr
will stop working right if it is not allowed to start threads whenever
it wishes.  Solr is not a typical web application.

As far as I know (any my knowledge could be wrong), a typical web
application that serves a website to users will handle all back-end
details with the same thread(s) that were created when the connection
was opened.  Putting a relatively low limit on the number of threads in
that situation is sensible.

A very small Solr install with a low query volume will work in 200
threads (the default limit in most containers), but it doesn't take very
much to exceed that.

I have a Solr 4.9.1 dev install with 44 cores, running with the Jetty 8
example included in the 4.x download.  19 of those cores are build
cores, with 19 cores for live indexes.  The other six cores are always
empty, with a shards parameter in the search handler definition for
distributed searching.  This install does NOT run in SolrCloud mode.

This dev server sees very little traffic besides a few indexing requests
every minute and load balancer health checks.  JConsole shows the number
of threads hovering between 230 and 235.  If I scroll through the thread
list, most of them show a state of WAITING on various locking
mechanisms, which explains why my CPUs (8 CPU cores total) are not being
overwhelmed with work from all those threads.

Solr and Lucene don't really have a runaway thread problem as far as I
can tell, but the system does use a fair number of them for basic
operation, with more cores/collections adding more threads.  SolrCloud
mode will also use more threads.

If you send requests to Solr at a vary fast rate, the servlet container
may also use a lot of threads.

Thanks,
Shawn



Re: Disable caching

2015-08-19 Thread Jamie Johnson
This was my original thought.  We already have the thread local so should
be straight fwd to just wrap the Field name and use that as the key.  Again
thanks, I really appreciate the feedback
On Aug 19, 2015 8:12 AM, "Yonik Seeley"  wrote:

> On Tue, Aug 18, 2015 at 10:58 PM, Jamie Johnson  wrote:
> > Hmm...so I think I have things setup correctly, I have a custom
> > QParserPlugin building a custom query that wraps the query built from the
> > base parser and stores the user who is executing the query.  I've added
> the
> > username to the hashCode and equals checks so I think everything is setup
> > properly.  I ran a quick test and it definitely looks like my items are
> > being cached now per user, which is really great.
> >
> > The outage that I'm running into now is the FieldValueCache doesn't take
> > into account the query, so the FieldValueCache is built for user a and
> then
> > reused for user b, which is an issue for me.  In short I'm back to my
> > NoOpCache for FieldValues.  It's great that I'm in a better spot for the
> > others, but is there anything that can be done with FieldValues to take
> > into account the requesting user?
>
> I guess a cache implementation that gets the user through a thread
> local and either wraps the original key with an object containing the
> user, or delegates to a per-user cache underneath.
>
> -Yonik
>


Difficulties in getting Solrcloud running

2015-08-19 Thread Merlin Morgenstern
HI everybody,

I am trying to setup solrcloud on ubuntu and somehow the graph on the admin
interface does not show up. It is simply blanck. The tree is available.

This is a test installation on one machine.

There are 3 zookeepers running.

I start two solr nodes like this:

solr-5.2.1$ bin/solr start -cloud -s server/solr1 -p 8983 -z
zk1:2181,zk1:2182,zk1:2183 -noprompt

solr-5.2.1$ bin/solr start -cloud -s server/solr2 -p 8984 -z
zk1:2181,zk1:2182,zk1:2183 -noprompt

zk1 is a local interface with 10.0.0.120

it all looks OK, no error messages.

Thank you in advance for any help on this


Re: Disable caching

2015-08-19 Thread Yonik Seeley
On Tue, Aug 18, 2015 at 10:58 PM, Jamie Johnson  wrote:
> Hmm...so I think I have things setup correctly, I have a custom
> QParserPlugin building a custom query that wraps the query built from the
> base parser and stores the user who is executing the query.  I've added the
> username to the hashCode and equals checks so I think everything is setup
> properly.  I ran a quick test and it definitely looks like my items are
> being cached now per user, which is really great.
>
> The outage that I'm running into now is the FieldValueCache doesn't take
> into account the query, so the FieldValueCache is built for user a and then
> reused for user b, which is an issue for me.  In short I'm back to my
> NoOpCache for FieldValues.  It's great that I'm in a better spot for the
> others, but is there anything that can be done with FieldValues to take
> into account the requesting user?

I guess a cache implementation that gets the user through a thread
local and either wraps the original key with an object containing the
user, or delegates to a per-user cache underneath.

-Yonik


Re: Performance issue with FILTER QUERY

2015-08-19 Thread Mikhail Khludnev
Maulin,
Did you check performance with segmented filters which I advised recently?

On Wed, Aug 19, 2015 at 10:24 AM, Maulin Rathod  wrote:

> As per my understanding caches are flushed every time when add new
> document to collection (we do soft commit at every 1 sec to make newly
> added document available for search). Due to which it is not effectively
> uses cache and hence it slow every time in our case.
>
> -Original Message-
> From: Toke Eskildsen [mailto:t...@statsbiblioteket.dk]
> Sent: 19 August 2015 12:16
> To: solr-user@lucene.apache.org
> Subject: Re: Performance issue with FILTER QUERY
>
> On Wed, 2015-08-19 at 05:55 +, Maulin Rathod wrote:
> > SLOW WITH FILTER QUERY (takes more than 1 second)
> > 
> >
> > q=+recipient_id:(4042) AND project_id:(332) AND resource_id:(13332247
> > 13332245 13332243 13332241 13332239) AND entity_type:(2) AND
> > -action_id:(20 32) ==> This returns 5 records
> > fq=+action_status:(0) AND is_active:(true) ==> This Filter Query
> > returns 9432252 records
>
> The fq is evaluated independently of the q: For the fq a bitset is
> allocated, filled and stored in cache. Then the q is evaluated and the two
> bitsets are merged.
>
> Next time you use the same fq, it should be cached (if you have caching
> enabled) and be a lot faster.
>
>
> Also, if you ran your two tests right after each other, the second one
> benefits from disk caching. If you had executed them in reverse order, the
> q+fq might have been the fastest one.
>
> - Toke Eskildsen, State and University Library, Denmark
>
>
>


-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics





Re: Performance issue with FILTER QUERY

2015-08-19 Thread Mikhail Khludnev
Hello,

try to experiment with fq={!cache=false}... or fq={!cache=false cost=100}...
see https://cwiki.apache.org/confluence/display/solr/Common+Query+Parameters

On Wed, Aug 19, 2015 at 8:55 AM, Maulin Rathod  wrote:

>
> Hi,
>
> http://stackoverflow.com/questions/11627427/solr-query-q-or-filter-query-fq
>
> As per above link it suggests to use Filter Query but we observed Filter
> Query is slower than Normal Query in our case. Are we doing something wrong?
>
>
> SLOW WITH FILTER QUERY (takes more than 1 second)
> 
>
> q=+recipient_id:(4042) AND project_id:(332) AND resource_id:(13332247
> 13332245 13332243 13332241 13332239) AND entity_type:(2) AND -action_id:(20
> 32) ==> This returns 5 records
> fq=+action_status:(0) AND is_active:(true) ==> This Filter Query returns
> 9432252 records
>
> Final result is 0 records.
>
> FAST WITHOUT FILTER QUERY (takes less than 10 millisecond)
> ==
>
> q=+recipient_id:( 4042) AND project_id:( 332) AND resource_id:(13332247
> 13332245 13332243 13332241 13332239) AND entity_type:(2) AND -action_id:(20
> 32) AND +action_status:(0) AND is_active:(true)==> This returns 0  records
>
> Final result is 0 records.
>
>
>
>
>
> Regards,
>
> Maulin
>
> [CC Award Winners 2014]
>
>


-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics





RE: Performance issue with FILTER QUERY

2015-08-19 Thread Maulin Rathod
As per my understanding caches are flushed every time when add new document to 
collection (we do soft commit at every 1 sec to make newly added document 
available for search). Due to which it is not effectively uses cache and hence 
it slow every time in our case.
 
-Original Message-
From: Toke Eskildsen [mailto:t...@statsbiblioteket.dk] 
Sent: 19 August 2015 12:16
To: solr-user@lucene.apache.org
Subject: Re: Performance issue with FILTER QUERY

On Wed, 2015-08-19 at 05:55 +, Maulin Rathod wrote:
> SLOW WITH FILTER QUERY (takes more than 1 second) 
> 
> 
> q=+recipient_id:(4042) AND project_id:(332) AND resource_id:(13332247 
> 13332245 13332243 13332241 13332239) AND entity_type:(2) AND 
> -action_id:(20 32) ==> This returns 5 records
> fq=+action_status:(0) AND is_active:(true) ==> This Filter Query 
> returns 9432252 records

The fq is evaluated independently of the q: For the fq a bitset is allocated, 
filled and stored in cache. Then the q is evaluated and the two bitsets are 
merged.

Next time you use the same fq, it should be cached (if you have caching
enabled) and be a lot faster.


Also, if you ran your two tests right after each other, the second one benefits 
from disk caching. If you had executed them in reverse order, the q+fq might 
have been the fastest one.

- Toke Eskildsen, State and University Library, Denmark