Re: Some performance questions....

2018-03-16 Thread Walter Underwood
> On Mar 16, 2018, at 3:26 PM, Deepak Goel  wrote:
> 
> Can you please post results of your test?
> 
> Please tell us the tps at 25%, 50%, 75%, 100% of your CPU resource


I could, but it probably would not be useful for your documents or your queries.

We have 22 million homework problems. Our queries are often hundreds of words 
long,
because students copy and paste entire problems. After pre-processing, the 
average query
is still 25 words.

For load benchmarking, I use access logs from production. I typically gather 
over a half-million
lines of log. Using production logs means that queries have the same 
statistical distribution
as prod, so the cache hit rates are reasonable.

Before each benchmark, I restart all the Solr instances to clear the caches. 
Then the first part
of the query log is used to warm the caches, typically about 4000 queries.

After that, the measured benchmark run starts. This uses JMeter with 100-500 
threads. Each
thread is configured with a constant throughput timer so a constant load is 
offered. Test run
one or two hours. Recently, I ran a test with a rate of 1000 requests/minute 
for one hour.

During the benchmark, I monitor the CPU usage. Our systems are configured with 
enough RAM
so that disk is not accessed for search indexes. If the CPU goes over 75-80%, 
there is congestion
and queries will slow down. Also, if the run queue (load average) increases 
over the number of
CPUs, there will be congestion.

After the benchmark run, the JMeter log is analyzed to report response time 
percentiles for
each Solr request handler.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)



Re: Solr 6.6.3: Errors when using facet.field

2018-03-16 Thread Jay Potharaju
This is my
query: 
facet=true=true=true=product_id=true=category_id

Field def:



Tried adding both docvalues & without docvalues.

Shards: 2

Has anyone else experienced this error?

Thanks


Thanks
Jay Potharaju


On Fri, Mar 16, 2018 at 2:20 PM, Jay Potharaju 
wrote:

> It looks like it was fixed as part of 6.6.3 : SOLR-6160
> .
> FYI: I have 2 shards in my test environment.
>
>
> Thanks
> Jay Potharaju
>
>
> On Fri, Mar 16, 2018 at 2:07 PM, Jay Potharaju 
> wrote:
>
>> Hi,
>> I am running a simple query with group by & faceting.
>>
>> facet=true=true=true=product_
>> id=true=true=product_id=1
>>
>>
>> When I run the query I get errors
>>
>>  
>>   
>> org.apache.solr.common.SolrException
>> java.lang.IllegalStateException
>> > name="error-class">org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException
>> > name="root-error-class">org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException
>>   
>>   Error from server at 
>> http://localhost:9223/solr/test2_shard2_replica1: Exception during 
>> facet.field: category_id
>>   > name="trace">org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
>>  Error from server at http://localhost:9223/solr/test2_shard2_replica1: 
>> Exception during facet.field: category_id
>>  at 
>> org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:612)
>>  at 
>> org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:279)
>>  at 
>> org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:268)
>>  at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1219)
>>  at 
>> org.apache.solr.handler.component.HttpShardHandler.lambda$submit$0(HttpShardHandler.java:163)
>>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>  at 
>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>  at 
>> com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:176)
>>  at 
>> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
>>  at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>  at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>  at java.lang.Thread.run(Thread.java:748)
>> 
>>   500
>>
>>
>> The above query worked in solr 5.3. Any suggestions ?
>> Thanks
>> Jay Potharaju
>>
>>
>
>


Re: Some performance questions....

2018-03-16 Thread Deepak Goel
On Sat, Mar 17, 2018 at 3:11 AM, Walter Underwood 
wrote:

> > On Mar 16, 2018, at 1:21 PM, Deepak Goel  wrote:
> >
> > However a single client object with thousands of queries coming in would
> > surely become a bottleneck. I can test this scenario too.
>
> No it isn’t. The single client object is thread-safe and manages a pool of
> connections.
>
> Your benchmark is probably the bottleneck. I have no problem driving 36
> CPUs to beyond
> 65% utilization with a benchmark.
>
>
Can you please post results of your test?

Please tell us the tps at 25%, 50%, 75%, 100% of your CPU resource


> Using one client object is not a scenario. It is how SolrJ was designed to
> be used.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>


Re: Some performance questions....

2018-03-16 Thread Deepak Goel
On Sat, Mar 17, 2018 at 2:56 AM, Shawn Heisey  wrote:

> On 3/16/2018 2:21 PM, Deepak Goel wrote:
> > I wanted to test how many max connections can Solr handle concurrently.
> > Also I would have to implement an 'connection pooling' of the
> client-object
> > connections rather than a single connection thread
> >
> > However a single client object with thousands of queries coming in would
> > surely become a bottleneck. I can test this scenario too.
>
> Handling thousands of simultaneous queries is NOT something you can
> expect a single Solr server to do.  It's not going to happen.  It
> wouldn't happen with ES, either.  Handling that much load requires load
> balancing to a LOT of servers.  The server would much more of a
> bottleneck than the client.
>

The problem is not server in my case. The server has hardware resources.
It's the software which is a problem.


>
> > The problem is the max throughput which I can get on the machine is
> around
> > 28 tps, even though I increase the load further & only 65% CPU is
> utilised
> > (there is still 35% which is not being used). This clearly indicates the
> > software is a problem as there is enough hardware resources.
>
> If your code is creating a client object before every single query, that
> could be part of the issue.  The benchmark code should be using the same
> client for all requests.  I really don't know how long it takes to
> create HttpSolrClient objects, but I don't imagine that it's super-fast.
>
>
It is taking less than 100ms to create a HttpSolrClient Object


> What version of SolrJ were you using?
>

Solr 7.2.0


> Depending on the SolrJ version you may need to create the client with a
> custom HttpClient object in order to allow it to handle plenty of
> threads.  This is how I create client objects in my SolrJ code:
>
>   RequestConfig rc = RequestConfig.custom().setConnectTimeout(2000)
> .setSocketTimeout(6).build();
>   CloseableHttpClient httpClient =
> HttpClients.custom().setDefaultRequestConfig(rc).setMaxConnPerRoute(1024)
> .setMaxConnTotal(4096).disableAutomaticRetries().build();
>
>   SolrClient sc = new HttpSolrClient.Builder().withBaseSolrUrl(solrUrl)
> .withHttpClient(httpClient).build();
>
>
I can give the above configuration a spin and test if the results improve


> Thanks,
> Shawn
>
>


Re: Some performance questions....

2018-03-16 Thread Walter Underwood
> On Mar 16, 2018, at 1:21 PM, Deepak Goel  wrote:
> 
> However a single client object with thousands of queries coming in would
> surely become a bottleneck. I can test this scenario too.

No it isn’t. The single client object is thread-safe and manages a pool of 
connections.

Your benchmark is probably the bottleneck. I have no problem driving 36 CPUs to 
beyond
65% utilization with a benchmark.

Using one client object is not a scenario. It is how SolrJ was designed to be 
used.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)



Re: Some performance questions....

2018-03-16 Thread Shawn Heisey
On 3/16/2018 2:21 PM, Deepak Goel wrote:
> I wanted to test how many max connections can Solr handle concurrently.
> Also I would have to implement an 'connection pooling' of the client-object
> connections rather than a single connection thread
>
> However a single client object with thousands of queries coming in would
> surely become a bottleneck. I can test this scenario too.

Handling thousands of simultaneous queries is NOT something you can
expect a single Solr server to do.  It's not going to happen.  It
wouldn't happen with ES, either.  Handling that much load requires load
balancing to a LOT of servers.  The server would much more of a
bottleneck than the client.

> The problem is the max throughput which I can get on the machine is around
> 28 tps, even though I increase the load further & only 65% CPU is utilised
> (there is still 35% which is not being used). This clearly indicates the
> software is a problem as there is enough hardware resources.

If your code is creating a client object before every single query, that
could be part of the issue.  The benchmark code should be using the same
client for all requests.  I really don't know how long it takes to
create HttpSolrClient objects, but I don't imagine that it's super-fast.

What version of SolrJ were you using?

Depending on the SolrJ version you may need to create the client with a
custom HttpClient object in order to allow it to handle plenty of
threads.  This is how I create client objects in my SolrJ code:

  RequestConfig rc = RequestConfig.custom().setConnectTimeout(2000)
    .setSocketTimeout(6).build();
  CloseableHttpClient httpClient =
HttpClients.custom().setDefaultRequestConfig(rc).setMaxConnPerRoute(1024)
    .setMaxConnTotal(4096).disableAutomaticRetries().build();

  SolrClient sc = new HttpSolrClient.Builder().withBaseSolrUrl(solrUrl)
    .withHttpClient(httpClient).build();

Thanks,
Shawn



Re: Solr 6.6.3: Errors when using facet.field

2018-03-16 Thread Jay Potharaju
It looks like it was fixed as part of 6.6.3 : SOLR-6160
.
FYI: I have 2 shards in my test environment.


Thanks
Jay Potharaju


On Fri, Mar 16, 2018 at 2:07 PM, Jay Potharaju 
wrote:

> Hi,
> I am running a simple query with group by & faceting.
>
> facet=true=true=true=
> product_id=true=true=
> product_id=1
>
>
> When I run the query I get errors
>
>   
>   
> org.apache.solr.common.SolrException
> java.lang.IllegalStateException
>  name="error-class">org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException
>  name="root-error-class">org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException
>   
>   Error from server at 
> http://localhost:9223/solr/test2_shard2_replica1: Exception during 
> facet.field: category_id
>name="trace">org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
>  Error from server at http://localhost:9223/solr/test2_shard2_replica1: 
> Exception during facet.field: category_id
>   at 
> org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:612)
>   at 
> org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:279)
>   at 
> org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:268)
>   at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1219)
>   at 
> org.apache.solr.handler.component.HttpShardHandler.lambda$submit$0(HttpShardHandler.java:163)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:176)
>   at 
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 
>   500
>
>
> The above query worked in solr 5.3. Any suggestions ?
> Thanks
> Jay Potharaju
>
>


Solr 6.6.3: Errors when using facet.field

2018-03-16 Thread Jay Potharaju
Hi,
I am running a simple query with group by & faceting.

facet=true=true=true=product_id=true=true=product_id=1


When I run the query I get errors


  
org.apache.solr.common.SolrException
java.lang.IllegalStateException
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException
  
  Error from server at
http://localhost:9223/solr/test2_shard2_replica1: Exception during
facet.field: category_id
  org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
Error from server at http://localhost:9223/solr/test2_shard2_replica1:
Exception during facet.field: category_id
at 
org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:612)
at 
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:279)
at 
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:268)
at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1219)
at 
org.apache.solr.handler.component.HttpShardHandler.lambda$submit$0(HttpShardHandler.java:163)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:176)
at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

  500


The above query worked in solr 5.3. Any suggestions ?
Thanks
Jay Potharaju


Solrj Analytics component

2018-03-16 Thread Asmaa Shoala
Hello,

I want to use analytics 
component(https://lucene.apache.org/solr/guide/7_2/analytics.html#analytic-pivot-facets)
 in java code but i didn't find any guide over the internet .

Can you please help me?

Thanks,

Asmaa Ramzy Shoala

novomind Egypt LLC
_

7 Abou Rafea Street, Moustafa Kamel, Alexandria, Egypt

Mobile +20 1227281143
email asmaa.sho...@nm-eg.com · Skype 
asmaa.shoala_nmeg



Re: Adding Documents to Solr by using Java Client API is failed

2018-03-16 Thread Andy Tang
Erik,

Thank you for reminding.
javac -cp
.:/opt/solr/solr-6.6.2/dist/*:/opt/solr/solr-6.6.2/dist/solrj-lib/*
 AddingDocument.java

java -cp
.:/opt/solr/solr-6.6.2/dist/*:/opt/solr/solr-6.6.2/dist/solrj-lib/*
 AddingDocument

SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further
details.
Documents added

All jars are included and documents added successfully. However, there are
some error message coming out.

Thank you.


On Fri, Mar 16, 2018 at 12:43 PM, Erick Erickson 
wrote:

> this is the important bit:
>
> java.lang.NoClassDefFoundError: org/apache/http/Header
>
> That class is not defined in the Solr code at all, it's in
> httpcore-#.#.#.jar
>
> You probably need to include /opt/solr/solr-6.6.2/dist/solrj-lib in
> your classpath.
>
> Best,
> Erick
>
> On Fri, Mar 16, 2018 at 12:14 PM, Andy Tang 
> wrote:
> > I have the code to add document to Solr. I tested it in Both Solr 6.6.2
> and
> > Solr 7.2.1 and failed.
> >
> > import java.io.IOException;  import
> > org.apache.solr.client.solrj.SolrClient; import
> > org.apache.solr.client.solrj.SolrServerException; import
> > org.apache.solr.client.solrj.impl.HttpSolrClient; import
> > org.apache.solr.common.SolrInputDocument;
> > public class AddingDocument {
> >public static void main(String args[]) throws Exception {
> >
> > String urlString ="http://localhost:8983/solr/Solr_example;;
> >  SolrClient Solr = new HttpSolrClient.Builder(urlString).build();
> >
> >   //Preparing the Solr document
> >   SolrInputDocument doc = new SolrInputDocument();
> >
> >   //Adding fields to the document
> >   doc.addField("id", "007");
> >   doc.addField("name", "James Bond");
> >   doc.addField("age","45");
> >   doc.addField("addr","England");
> >
> >   //Adding the document to Solr
> >   Solr.add(doc);
> >
> >   //Saving the changes
> >   Solr.commit();
> >   System.out.println("Documents added");
> >} }
> >
> > The compilation is successful like below.
> >
> > javac -cp .:/opt/solr/solr-6.6.2/dist/solr-solrj-6.6.2.jar
> > AddingDocument.java
> >
> > However, when I run it, it gave me some errors message confused.
> >
> > java -cp .:/opt/solr/solr-6.6.2/dist/solr-solrj-6.6.2.jar AddingDocument
> >
> > Exception in thread "main" java.lang.NoClassDefFoundError:
> > org/apache/http/Header
> > at org.apache.solr.client.solrj.impl.HttpSolrClient$Builder.
> build(HttpSolrClient.java:892)
> > at AddingDocument.main(AddingDocument.java:13)Caused by:
> > java.lang.ClassNotFoundException: org.apache.http.Header
> > at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> > at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:338)
> > at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> > ... 2 more
> >
> > What is wrong with it? Is this urlString correct?
> >
> > Any help is appreciated!
> > Andy Tang
>


RE: Some performance questions....

2018-03-16 Thread Davis, Daniel (NIH/NLM) [C]
Deepak,

A better test of multi-user support might be to vary the queries and try to 
simulate a realistic 'working set' of search data.

I've made this same performance analysis mistake with the search index of 
www.indexengines.com, which I developed (in part).   Somewhat different from 
Lucene, inside, although.

What we cared a lot about were these things:

- if a query was done warm, e.g. with results cached in memory, response time 
should be very fast.
- If a query was done cold, e.g. with results from disk, response time should 
still be acceptable.
- If a lot of different queries were done, that we think simulate the real 
behavior of N users, that the memory usage of cache should be acceptable, e.g. 
the cache should get warm and there should be few cache misses.

This last test was key - if we have designed our caching properly, then the 
queries of X users will fit in Y memory, and we will be able to develop a 
simple understanding of that, with our target users.

Generating that realistic amount of query behavior for X users is hard.   Using 
real search logs from your previous search product is a good idea.   For 
instance, if you look at the top 1000 queries performed by your users over a 
particular period of time, you can then say that some percentage of user 
queries were covered by the top 1000 queries, e.g. 90%.   Then, maybe you 
measure of that same period your queries per second (QPS).

Now, you can say that if you randomly sample those top 1000 queries while 
generating the same QPS with an exponential distribution generator, that you 
have covered 90% of your real traffic.   Your queries are much more randomly 
distributed, but that's OK, because what you want to know is whether it all 
fits in cache memory, the effect of # of CPUs, amount of Memory, number of 
cluster nodes, sharding, and replication on the response time and such.

Depending on your user community, top 1000 queries may not be enough to hit 
90%, it may only hit 70%.   Maybe you also need to look at the rate of 
"advanced search" and "search", or account for queries that drive business 
intelligence reports.   It really depends on your use case.   I wish I'd had 
the cloud available to test performance with - we were really naïve and did all 
this testing with our metal because, well, we thought our stuff relied on that.

I recommend you read the first couple chapters of Ran Jain's Art of Computer 
Systems Performance Analysis.   It’s a great book even if you totally skip the 
later chapters on Queuing System analysis, and just think about what and how to 
test.

Hope this helps,

-Dan 





-Original Message-
From: Deepak Goel [mailto:deic...@gmail.com] 
Sent: Friday, March 16, 2018 4:22 PM
To: solr-user@lucene.apache.org
Subject: Re: Some performance questions

On Sat, Mar 17, 2018 at 1:06 AM, Shawn Heisey  wrote:

> On 3/16/2018 7:38 AM, Deepak Goel wrote:
> > I did a performance study of Solr a while back. And I found that it 
> > does not scale beyond a particular point on a single machine (could 
> > be due to the way its coded). Hence multiple instances might make sense.
> >
> > https://docs.google.com/document/d/1kUqEcZl3NhOo6SLklo5Icg3fMnn9O
> tLY_lwnc6wbXus/edit?usp=sharing
>
> How did you *use* that code that you've shown?  That is not apparent 
> (at least to me) from the document.
>
> If every usage of the SolrJ code went through ALL of the code you've 
> shown, then it's not done well.  It appears that you're creating and 
> closing a client object with every query.  This will be VERY inefficient.
>
> The client object should be created during an initialization step, and 
> then passed to the benchmark step to be used there.  One client object 
> can be used by many threads.


I wanted to test how many max connections can Solr handle concurrently.
Also I would have to implement an 'connection pooling' of the client-object 
connections rather than a single connection thread

However a single client object with thousands of queries coming in would surely 
become a bottleneck. I can test this scenario too.

Very likely the ES client works the same,
> but you'd need to ask them to be sure.
>
> That code seems to be doing an identical query on every run.  If 
> that's what's happening, it's not a good indicator of performance.  
> Running the same query over and over will show better performance than 
> you can expect from a real-world query load

What evidence do you see that Solr isn't scaling like you expect?
>
> The problem is the max throughput which I can get on the machine is 
> around
28 tps, even though I increase the load further & only 65% CPU is utilised 
(there is still 35% which is not being used). This clearly indicates the 
software is a problem as there is enough hardware resources.

Also very soon I would have a Linux environment with me, so I can conduct the 
test in the document on Linux too (for the users interested in Linux and not 
Windows)


> 

Recovering from machine failure

2018-03-16 Thread Andy C
Running Solr 7.2 in SolrCloud mode with 5 Linux VMs. Each VM was a single
shard, no replication. Single Zookeeper instance running on the same VM as
one of the Solr instances.

IT was making changes, and 2 of the VMs won't reboot (including the VM
where Zookeeper is installed). There was a dedicated drive which Solr (and
Zookeeper for the one node) where installed on, and a dedicated drive where
the Solr indexes were created.

They believe the drives are still good. Their plan is to create 2 new VMs
and attach the drives from the old VMs to them. However the IP addresses of
the new VMs will be different.

In the solr.in.sh I had set the SOLR_HOST entry to the IP address of the
VM. Is this just an arbitrary name? Will Zookeeper still recognize the Solr
instance if the SOLR_HOST entry doesn't match the IP address.

Obviously I will need to adjust the ZK_HOST entries on all nodes to reflect
the new IP address of the VMs. But will that be sufficient?

Appreciate any guidance.

Thanks
- Andy -


Re: Some performance questions....

2018-03-16 Thread Deepak Goel
On Sat, Mar 17, 2018 at 1:06 AM, Shawn Heisey  wrote:

> On 3/16/2018 7:38 AM, Deepak Goel wrote:
> > I did a performance study of Solr a while back. And I found that it does
> > not scale beyond a particular point on a single machine (could be due to
> > the way its coded). Hence multiple instances might make sense.
> >
> > https://docs.google.com/document/d/1kUqEcZl3NhOo6SLklo5Icg3fMnn9O
> tLY_lwnc6wbXus/edit?usp=sharing
>
> How did you *use* that code that you've shown?  That is not apparent (at
> least to me) from the document.
>
> If every usage of the SolrJ code went through ALL of the code you've
> shown, then it's not done well.  It appears that you're creating and
> closing a client object with every query.  This will be VERY inefficient.
>
> The client object should be created during an initialization step, and
> then passed to the benchmark step to be used there.  One client object
> can be used by many threads.


I wanted to test how many max connections can Solr handle concurrently.
Also I would have to implement an 'connection pooling' of the client-object
connections rather than a single connection thread

However a single client object with thousands of queries coming in would
surely become a bottleneck. I can test this scenario too.

Very likely the ES client works the same,
> but you'd need to ask them to be sure.
>
> That code seems to be doing an identical query on every run.  If that's
> what's happening, it's not a good indicator of performance.  Running the
> same query over and over will show better performance than you can
> expect from a real-world query load

What evidence do you see that Solr isn't scaling like you expect?
>
> The problem is the max throughput which I can get on the machine is around
28 tps, even though I increase the load further & only 65% CPU is utilised
(there is still 35% which is not being used). This clearly indicates the
software is a problem as there is enough hardware resources.

Also very soon I would have a Linux environment with me, so I can conduct
the test in the document on Linux too (for the users interested in Linux
and not Windows)


> Thanks,
> Shawn
>
>


Re: Adding Documents to Solr by using Java Client API is failed

2018-03-16 Thread Erick Erickson
this is the important bit:

java.lang.NoClassDefFoundError: org/apache/http/Header

That class is not defined in the Solr code at all, it's in  httpcore-#.#.#.jar

You probably need to include /opt/solr/solr-6.6.2/dist/solrj-lib in
your classpath.

Best,
Erick

On Fri, Mar 16, 2018 at 12:14 PM, Andy Tang  wrote:
> I have the code to add document to Solr. I tested it in Both Solr 6.6.2 and
> Solr 7.2.1 and failed.
>
> import java.io.IOException;  import
> org.apache.solr.client.solrj.SolrClient; import
> org.apache.solr.client.solrj.SolrServerException; import
> org.apache.solr.client.solrj.impl.HttpSolrClient; import
> org.apache.solr.common.SolrInputDocument;
> public class AddingDocument {
>public static void main(String args[]) throws Exception {
>
> String urlString ="http://localhost:8983/solr/Solr_example;;
>  SolrClient Solr = new HttpSolrClient.Builder(urlString).build();
>
>   //Preparing the Solr document
>   SolrInputDocument doc = new SolrInputDocument();
>
>   //Adding fields to the document
>   doc.addField("id", "007");
>   doc.addField("name", "James Bond");
>   doc.addField("age","45");
>   doc.addField("addr","England");
>
>   //Adding the document to Solr
>   Solr.add(doc);
>
>   //Saving the changes
>   Solr.commit();
>   System.out.println("Documents added");
>} }
>
> The compilation is successful like below.
>
> javac -cp .:/opt/solr/solr-6.6.2/dist/solr-solrj-6.6.2.jar
> AddingDocument.java
>
> However, when I run it, it gave me some errors message confused.
>
> java -cp .:/opt/solr/solr-6.6.2/dist/solr-solrj-6.6.2.jar AddingDocument
>
> Exception in thread "main" java.lang.NoClassDefFoundError:
> org/apache/http/Header
> at 
> org.apache.solr.client.solrj.impl.HttpSolrClient$Builder.build(HttpSolrClient.java:892)
> at AddingDocument.main(AddingDocument.java:13)Caused by:
> java.lang.ClassNotFoundException: org.apache.http.Header
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:338)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> ... 2 more
>
> What is wrong with it? Is this urlString correct?
>
> Any help is appreciated!
> Andy Tang


Re: Some performance questions....

2018-03-16 Thread Shawn Heisey
On 3/16/2018 7:38 AM, Deepak Goel wrote:
> I did a performance study of Solr a while back. And I found that it does
> not scale beyond a particular point on a single machine (could be due to
> the way its coded). Hence multiple instances might make sense.
>
> https://docs.google.com/document/d/1kUqEcZl3NhOo6SLklo5Icg3fMnn9OtLY_lwnc6wbXus/edit?usp=sharing

How did you *use* that code that you've shown?  That is not apparent (at
least to me) from the document.

If every usage of the SolrJ code went through ALL of the code you've
shown, then it's not done well.  It appears that you're creating and
closing a client object with every query.  This will be VERY inefficient.

The client object should be created during an initialization step, and
then passed to the benchmark step to be used there.  One client object
can be used by many threads.  Very likely the ES client works the same,
but you'd need to ask them to be sure.

That code seems to be doing an identical query on every run.  If that's
what's happening, it's not a good indicator of performance.  Running the
same query over and over will show better performance than you can
expect from a real-world query load.

What evidence do you see that Solr isn't scaling like you expect?

Thanks,
Shawn



Adding Documents to Solr by using Java Client API is failed

2018-03-16 Thread Andy Tang
I have the code to add document to Solr. I tested it in Both Solr 6.6.2 and
Solr 7.2.1 and failed.

import java.io.IOException;  import
org.apache.solr.client.solrj.SolrClient; import
org.apache.solr.client.solrj.SolrServerException; import
org.apache.solr.client.solrj.impl.HttpSolrClient; import
org.apache.solr.common.SolrInputDocument;
public class AddingDocument {
   public static void main(String args[]) throws Exception {

String urlString ="http://localhost:8983/solr/Solr_example;;
 SolrClient Solr = new HttpSolrClient.Builder(urlString).build();

  //Preparing the Solr document
  SolrInputDocument doc = new SolrInputDocument();

  //Adding fields to the document
  doc.addField("id", "007");
  doc.addField("name", "James Bond");
  doc.addField("age","45");
  doc.addField("addr","England");

  //Adding the document to Solr
  Solr.add(doc);

  //Saving the changes
  Solr.commit();
  System.out.println("Documents added");
   } }

The compilation is successful like below.

javac -cp .:/opt/solr/solr-6.6.2/dist/solr-solrj-6.6.2.jar
AddingDocument.java

However, when I run it, it gave me some errors message confused.

java -cp .:/opt/solr/solr-6.6.2/dist/solr-solrj-6.6.2.jar AddingDocument

Exception in thread "main" java.lang.NoClassDefFoundError:
org/apache/http/Header
at 
org.apache.solr.client.solrj.impl.HttpSolrClient$Builder.build(HttpSolrClient.java:892)
at AddingDocument.main(AddingDocument.java:13)Caused by:
java.lang.ClassNotFoundException: org.apache.http.Header
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:338)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 2 more

What is wrong with it? Is this urlString correct?

Any help is appreciated!
Andy Tang


Re: statistics in hitlist

2018-03-16 Thread Joel Bernstein
With regression you're looking at how the change in one variable effects
the change in another variable. So you need to have values that are
changing. What you described is an average of field X which is not
changing, regressed against the value of X.

I think one approach to this is to regress the moving average of X with the
actual value of X. We can do this with the math library, but before
exploring the code for this spend some
thinking about if that's the problem you're trying to solve. Take a look at
how moving averages work: https://en.wikipedia.org/wiki/Moving_average





Joel Bernstein
http://joelsolr.blogspot.com/

On Fri, Mar 16, 2018 at 9:26 AM, John Smith  wrote:

> Thanks for the link to the documentation, that will probably come in
> useful.
>
> I didn't see a way though, to get my avg function working? So instead of
> doing a linear regression on two fields, X and Y, in a hitlist, we need to
> do a linear regression on field X, and the average value of X. Is that
> possible? To pass in a function to the regress function instead of a field?
>
>
>
>
>
> On Thu, Mar 15, 2018 at 10:41 PM, Joel Bernstein 
> wrote:
>
> > I've been working on the user guide for the math expressions. Here is the
> > page on regression:
> >
> > https://github.com/joel-bernstein/lucene-solr/blob/math_expressions_
> > documentation/solr/solr-ref-guide/src/regression.adoc
> >
> > This page is part of the larger math expression documentation. The TOC is
> > here:
> >
> > https://github.com/joel-bernstein/lucene-solr/blob/math_expressions_
> > documentation/solr/solr-ref-guide/src/math-expressions.adoc
> >
> > The docs are still very rough but you can get an idea of the coverage.
> >
> >
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> > On Thu, Mar 15, 2018 at 10:26 PM, Joel Bernstein 
> > wrote:
> >
> > > If you want to get everything in query you can do this:
> > >
> > > let(echo="d,e",
> > >  a=search(tx_prod_production, q="oil_first_90_days_production:[1
> TO
> > > *]",
> > > fq="isParent:true", rows="150",
> > > fl="id,oil_first_90_days_production,oil_last_30_days_production",
> > sort="id
> > > asc"),
> > >  b=col(a, oil_first_90_days_production),
> > >  c=col(a, oil_last_30_days_production),
> > >  d=regress(b, c),
> > >  e=someExpression())
> > >
> > > The echo parameter tells the let expression which variables to output.
> > >
> > >
> > >
> > > Joel Bernstein
> > > http://joelsolr.blogspot.com/
> > >
> > > On Thu, Mar 15, 2018 at 3:13 PM, Erick Erickson <
> erickerick...@gmail.com
> > >
> > > wrote:
> > >
> > >> What does the fq clause look like?
> > >>
> > >> On Thu, Mar 15, 2018 at 11:51 AM, John Smith 
> > >> wrote:
> > >> > Hi Joel, I did some more work on this statistics stuff today. Yes,
> we
> > do
> > >> > have nulls in our data; the document contains many fields, we don't
> > >> always
> > >> > have values for each field, but we can't set the nulls to 0 either
> (or
> > >> any
> > >> > other value, really) as that will mess up other calculations (such
> as
> > >> when
> > >> > calculating average etc); we would normally just ignore fields with
> > null
> > >> > values when calculating stats manually ourselves.
> > >> >
> > >> > Adding a check in the "q" parameter to ensure that the fields used
> in
> > >> the
> > >> > calculations are > 0 does work now. Thanks for the tip (and sorry,
> > >> should
> > >> > have caught that myself). But I am unable to use "fq" for these
> > checks,
> > >> > they have to be added to the q instead. Adding fq's doesn't have any
> > >> effect.
> > >> >
> > >> >
> > >> > Anyway, I'm trying to change this up a little. This is what I'm
> > >> currently
> > >> > using (switched from "random" to "search" since I actually need the
> > full
> > >> > hitlist not just a random subset):
> > >> >
> > >> > let(a=search(tx_prod_production, q="oil_first_90_days_production:[1
> > TO
> > >> *]",
> > >> > fq="isParent:true", rows="150",
> > >> > fl="id,oil_first_90_days_production,oil_last_30_days_production",
> > >> sort="id
> > >> > asc"),
> > >> >  b=col(a, oil_first_90_days_production),
> > >> >  c=col(a, oil_last_30_days_production),
> > >> >  d=regress(b, c))
> > >> >
> > >> > So I have 2 fields there defined, that works great (in terms of a
> test
> > >> and
> > >> > running the query); but I need to replace the second field,
> > >> > "oil_last_30_days_production" with the avg value in
> > >> > oil_first_90_days_production.
> > >> >
> > >> > I can get the avg with this expression:
> > >> > stats(tx_prod_production, q="oil_first_90_days_production:[1 TO
> *]",
> > >> > fq="isParent:true", rows="150", avg(oil_first_90_days_
> > production))
> > >> >
> > >> > But I don't know how to push that avg value into the first streaming
> > >> > expression; guessing I have to set "c=" but that is where I'm
> > >> getting
> > >> > lost, since avg only 

QueryElevator prepare() in in distributed search

2018-03-16 Thread Markus Jelsma
Hello,

QueryElevator.prepare() runs five times for a single query in distributed 
search, this is probably not how it should be, but in what phase of distributed 
search is it supposed to actually run?

Many thanks,
Markus



Re: Some performance questions....

2018-03-16 Thread Deepak Goel
> That benchmark is on Windows, so not interesting for most of us.

I guess I must have missed this in the author's question. Did he describe
his OS?

Also other applications scale well on Windows. Why would Solr be different?
The Solr page does not say about any performance limits on windows
(shouldn't they say that upfront in that case!)

https://lucene.apache.org/solr/guide/6_6/installing-solr.html#got-java
(You can install Solr in any system where a suitable Java Runtime
Environment (JRE) is available, as detailed below. Currently this includes
Linux, OS X, and Microsoft Windows.)

> Windows has very different handling for threads, memory, and files
compared to Unix. I had to do a lot of Windows-specific tuning for > >
Ultraseek Server to get decent performance. For example, merge speed was
terrible unless I opened files with a Windows-specific > > > >caching hint.





Deepak
"Please stop cruelty to Animals, help by becoming a Vegan"
+91 73500 12833
deic...@gmail.com

Facebook: https://www.facebook.com/deicool
LinkedIn: www.linkedin.com/in/deicool

"Plant a Tree, Go Green"
On Fri, Mar 16, 2018 at 9:43 PM, Walter Underwood 
wrote:

> On Mar 16, 2018, at 6:38 AM, Deepak Goel  wrote:
> >
> > I did a performance study of Solr a while back. And I found that it does
> > not scale beyond a particular point on a single machine (could be due to
> > the way its coded). Hence multiple instances might make sense.
> >
> > https://docs.google.com/document/d/1kUqEcZl3NhOo6SLklo5Icg3fMnn9O
> tLY_lwnc6wbXus/edit?usp=sharing  1kUqEcZl3NhOo6SLklo5Icg3fMnn9OtLY_lwnc6wbXus/edit?usp=sharing>
> >
> > ***Deepak***
>
> That benchmark is on Windows, so not interesting for most of us.
>
> Windows has very different handling for threads, memory, and files
> compared to Unix. I had to do a lot of Windows-specific tuning for
> Ultraseek Server to get decent performance. For example, merge speed was
> terrible unless I opened files with a Windows-specific caching hint.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>


Re: Some performance questions....

2018-03-16 Thread Deepak Goel
> On Mar 16, 2018, at 6:26 AM, Deepak Goel  wrote:
>
> I would try multiple Solr instances rather a single Solr instance (it
> definitely will give a performance boost)
> I would avoid multiple Solr instances on single machine. I can use all 36
cores on our servers with one Solr process.

Is your load scaling linearly? Can you please post the results?




Deepak
"Please stop cruelty to Animals, help by becoming a Vegan"
+91 73500 12833
deic...@gmail.com

Facebook: https://www.facebook.com/deicool
LinkedIn: www.linkedin.com/in/deicool

"Plant a Tree, Go Green"

On Fri, Mar 16, 2018 at 9:39 PM, Walter Underwood 
wrote:

> > On Mar 16, 2018, at 6:26 AM, Deepak Goel  wrote:
> >
> > I would try multiple Solr instances rather a single Solr instance (it
> > definitely will give a performance boost)
>
>
> I would avoid multiple Solr instances on single machine. I can use all 36
> cores on our servers with one Solr process.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>


Re: Solr document routing using composite key

2018-03-16 Thread Erick Erickson
What Shawn said. 117 shards and 116 docs tells you absolutely nothing
useful. I've never seen the number of docs on various shards be off by
more than 2-3% when enough docs are indexed to be statistically valid.

Best,
Erick

On Fri, Mar 16, 2018 at 5:34 AM, Shawn Heisey  wrote:
> On 3/6/2018 11:53 AM, Nawab Zada Asad Iqbal wrote:
>>
>> I have 117 shards and i tried to use document ids from zero to 116. I find
>> that the distribution is very uneven, e.g., the largest bucket receives
>> total 5 documents; and around 38 shards will be empty.  Is it expected?
>
>
> With such a small data set, this fits what I would expect.
>
> Choosing buckets by hashing (which is what compositeId does) is not perfect,
> but if you send it thousands or millions of documents, it will be
> *generally* balanced.
>
> Thanks,
> Shawn
>


Re: Recommendations for non-narrative data

2018-03-16 Thread Erick Erickson
For an index that size, you have a lot of options. I'd completely
ignore any discussion that starts with "but our index will be bigger
if we do that" until it's proven to be a problem. For reference, I
commonly see 200G-300G indexes so

Ok, to your problem.
Your update rate is very low so don't worry about it. In this case I'd
set my autocommit setting to as long as you can tolerate (say 15
seconds? 5 seconds?). If you can batch up your updates it'll help
(i.e. let's say you update your Solr index once a minute. Collect all
of the records that have changed in the last minute, batch them up in
a single request and send it).

If your update pattern _is_ something like above, it really doesn't
matter what your autocommit interval is since it'll only be triggered
every minute in my example. At this size/rate I wouldn't worry about
soft commits at all, just leave it out or set it to -1 (never fires).

As for your use-cases, pre-and-postfix wildcards are tricky. In the
naive case where you just index them regularly, they're quite
expensive since to find the matching terms you must enumerate all
terms in a field. However, at this size this is the first thing I'd
try, it might be fast enough. If it's not, the trick is to use ngrams
(say bigrams). So if I'm indexing "erick", it becomes "er" "ri" "ic"
"ck". Now a search for *ric* becomes simpler as it's a phrase search
for "ri" followed by "ic". Again, at your size the index increase not
a problem I'd guess.

So StandartTokenizer + LowercaseFilter + NgramFilter is where I'd
start. You'll find the admin/analysis page _extremely_ valuable for
understanding how these interact.

Do be careful to try edge cases, particularly ones involving
punctuation. You'll discover that switching to something like
WhitespaceTokenizer all of the sudden stops removing punctuation for
instance.

Best,
Erick

On Fri, Mar 16, 2018 at 6:46 AM, Christopher Schultz
 wrote:
> All,
>
> I'm using Solr to index and search a database of user data (username,
> email, first and last name), so there aren't really "terms" in the data
> to search for, like you might search for words that describe products in
> a catalog, for example.
>
> I have set up my schema to include plain-old text fields for each of the
> data mentioned above, plus I have a copy-field called "all" which
> includes everything all together, plus I have a first + last field which
> uses a phonetic index and query analyzer.
>
> Since I don't need things such as term-replacement (spanner == wrench),
> stemming (first name 'chris' -> 'chri'), and possibly other features
> that I don't know about, I'm wondering what might be a recommended set
> of tokenizer(s), analyzer(s), etc. for such data.
>
> We will definitely want to be able to search by substring (to find
> 'cschultz' as a username with 'schultz' as input) but some substrings
> are probably useless (such as @gmail.com for email addresses) and don't
> need to be supported.
>
> What are some good options to look at for this type of data?
>
> In production, we have fewer than 5M records to handle, so this is more
> of an academic exercise than an actual performance requirement (since
> Solr is at least an order of magnitude faster than our current
> RDBMS-searching implementation).
>
> If it makes any difference, we are trying to keep the index up-to-date
> with all user changes made in real time (okay, maybe delayed by a few
> seconds, but basically realtime). We have a few hundred new-user
> registrations per day and probably half as many changes to user records
> as that, so perhaps 2 document-updates per minute on average (during ~12
> business hours in the US on weekdays).
>
> Thanks for any advice anyone may have,
> -chris
>


Re: Some performance questions....

2018-03-16 Thread Walter Underwood
On Mar 16, 2018, at 6:38 AM, Deepak Goel  wrote:
> 
> I did a performance study of Solr a while back. And I found that it does
> not scale beyond a particular point on a single machine (could be due to
> the way its coded). Hence multiple instances might make sense.
> 
> https://docs.google.com/document/d/1kUqEcZl3NhOo6SLklo5Icg3fMnn9OtLY_lwnc6wbXus/edit?usp=sharing
>  
> 
> 
> ***Deepak***

That benchmark is on Windows, so not interesting for most of us.

Windows has very different handling for threads, memory, and files compared to 
Unix. I had to do a lot of Windows-specific tuning for Ultraseek Server to get 
decent performance. For example, merge speed was terrible unless I opened files 
with a Windows-specific caching hint.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)



Re: Some performance questions....

2018-03-16 Thread Walter Underwood
> On Mar 16, 2018, at 6:26 AM, Deepak Goel  wrote:
> 
> I would try multiple Solr instances rather a single Solr instance (it
> definitely will give a performance boost)


I would avoid multiple Solr instances on single machine. I can use all 36 cores 
on our servers with one Solr process.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)



Re: question regarding wildcard-searches

2018-03-16 Thread Erick Erickson
If you goal is to search prefixes only, I'd go away from the _text_
field all together and use a "string" type. This will mean you need to
1> make it multiValued=true
2> split this up (either on your client or use a
FieldMutatingUpdateProcessor, probably RegexReplaceProcessorFactory)
into separate entries, i.e.
'EO.1954.53.1', 'EO.1954.53.2', EO.1954.53.3'
becomes three separate entries in the field
'EO.1954.53.1'
'EO.1954.53.2'
'EO.1954.53.3'

At that point, searches like: 'EO.1954.53.*'

will work just fine. NOTE: String types do zero analysis, so you have
to handle things like casing yourself. That is, 'eO.1954.53.*' would
_not_ match. You can probably use something like
KeywordTokenizerFactory + LowerCaseFilterFactory in that case.

All this makes _much_ more sense if you use the admin UI>>analysis
page (probably uncheck the "verbose" checkbox, there'll be less
clutter").

Best,
Erick

On Fri, Mar 16, 2018 at 8:35 AM, Emir Arnautović
 wrote:
> Hi Roel,
> As mentioned, _text_ field probably does not contain complete “EO.1954.53.1” 
> but only its parts. You can verify that using snalysis screen in admin 
> console. What you can try is searching for phrase without wildcard 
> “EO.1954.53” or if you are using WordDelimiterTokenFilter in your analysis 
> chain, you can set preserveOriginal=“1” and reindex.
>
> Can you share how your text_general looks like.
>
> HTH,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
>> On 16 Mar 2018, at 14:05, Paesen Roel  wrote:
>>
>> Hi,
>>
>> Unfortunately that also gives no results (and it would not be practical, as 
>> for this example the numbering only goes up till 19 but others go up into 
>> the thousands etc)
>>
>> Anybody with a pointer on this?
>>
>> Thanks already,
>> Roel
>>
>>
>> -Original Message-
>> From: jagdish vasani [mailto:jagdisht.vas...@gmail.com]
>> Sent: vrijdag 16 maart 2018 12:41
>> To: solr-user@lucene.apache.org
>> Subject: Re: question regarding wildcard-searches
>>
>> Hi paesen,
>>
>> Value - EO.1954.53.1 is indexed as below Eo
>> 1954
>> 53
>> 1
>> Dot is removed.try with wildcard -?
>> Like EO.1954.53.?? If you have 2 digits only in last..
>>
>> I have not tried but you just check it.
>> Hope it will solve your problem.
>>
>> Thanks,
>> Jagdish
>> On 16-Mar-2018 3:51 pm, "Paesen Roel"  wrote:
>>
>>> Hi everybody,
>>>
>>> We are experimenting with solr, and I have a (I think) basic-level
>>> question:
>>> we have a multiple fields, all copied into a generic field so we can
>>> search everything at once.
>>> However we have a (for us) strange situation doing wildcard searches
>>> for the contents of one specific field.
>>>
>>> Given in the schema:
>>>
>>> >> multiValued="true"/>
>>>
>>> >> stored="true"/>
>>> 
>>> and lot of other fields exactly like 'genormaliseerdInventarisnummer'.
>>>
>>>
>>> Now, we are certain that the field 'genormaliseerdInventarisnummer'
>>> contains entries like 'EO.1954.53.1', 'EO.1954.53.2', EO.1954.53.3',
>>> all the way up to '.19', we can query these directly by passing these
>>> exact texts to the query on field '_text_' (our default search field).
>>> Problem is: wildcard searches for these don't work, like 'EO.1954.53.*'
>>> for example returns zero results.
>>>
>>> Why is that?
>>> What needs to be adjusted? (and how?)
>>>
>>> Thanks already,
>>> Roel
>>>
>>>
>


Re: In Place Updates not work as expected

2018-03-16 Thread Emir Arnautović
Hi,
That’s how you build regular document. Incremental/atomic updates need to use 
update commands. 
Did not check latest Solrj, so maybe there is built in way of doing that, but 
quick googling showed how it can be achieved:

 SolrInputDocument doc2 = new SolrInputDocument();
Map fpValue2 = new HashMap();
fpValue2.put("add","fp2");
doc2.setField("FACTURES_PRODUIT", fpValue2);
HTH,
Emir 
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 16 Mar 2018, at 15:36, mganeshs  wrote:
> 
> Hi Emir,
> 
> It's normal setfield and addDocument
> 
> for ex.
> in a for loop 
>   solrInputDocument.setField(sFieldId, fieldValue);
> and after this, we add the created document.
>   solrClient.add(collectionName, solrInputDocuments);
> 
> I just want to know whether, we need to do something specific for in-place
> updates ? 
> 
> Kindly let me know,
> 
> Regards,
> 
> 
> 
> 
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html



Re: question regarding wildcard-searches

2018-03-16 Thread Emir Arnautović
Hi Roel,
As mentioned, _text_ field probably does not contain complete “EO.1954.53.1” 
but only its parts. You can verify that using snalysis screen in admin console. 
What you can try is searching for phrase without wildcard “EO.1954.53” or if 
you are using WordDelimiterTokenFilter in your analysis chain, you can set 
preserveOriginal=“1” and reindex.

Can you share how your text_general looks like.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 16 Mar 2018, at 14:05, Paesen Roel  wrote:
> 
> Hi,
> 
> Unfortunately that also gives no results (and it would not be practical, as 
> for this example the numbering only goes up till 19 but others go up into the 
> thousands etc)
> 
> Anybody with a pointer on this?
> 
> Thanks already,
> Roel
> 
> 
> -Original Message-
> From: jagdish vasani [mailto:jagdisht.vas...@gmail.com] 
> Sent: vrijdag 16 maart 2018 12:41
> To: solr-user@lucene.apache.org
> Subject: Re: question regarding wildcard-searches
> 
> Hi paesen,
> 
> Value - EO.1954.53.1 is indexed as below Eo
> 1954
> 53
> 1
> Dot is removed.try with wildcard -?
> Like EO.1954.53.?? If you have 2 digits only in last..
> 
> I have not tried but you just check it.
> Hope it will solve your problem.
> 
> Thanks,
> Jagdish
> On 16-Mar-2018 3:51 pm, "Paesen Roel"  wrote:
> 
>> Hi everybody,
>> 
>> We are experimenting with solr, and I have a (I think) basic-level
>> question:
>> we have a multiple fields, all copied into a generic field so we can 
>> search everything at once.
>> However we have a (for us) strange situation doing wildcard searches 
>> for the contents of one specific field.
>> 
>> Given in the schema:
>> 
>> > multiValued="true"/>
>> 
>> > stored="true"/>
>>  
>> and lot of other fields exactly like 'genormaliseerdInventarisnummer'.
>> 
>> 
>> Now, we are certain that the field 'genormaliseerdInventarisnummer'
>> contains entries like 'EO.1954.53.1', 'EO.1954.53.2', EO.1954.53.3', 
>> all the way up to '.19', we can query these directly by passing these 
>> exact texts to the query on field '_text_' (our default search field).
>> Problem is: wildcard searches for these don't work, like 'EO.1954.53.*'
>> for example returns zero results.
>> 
>> Why is that?
>> What needs to be adjusted? (and how?)
>> 
>> Thanks already,
>> Roel
>> 
>> 



solr equivalent for elasticsearch 'terminate_after' param

2018-03-16 Thread Martin Buechler


Hi,

In order to decide, if any search result exists for a given query, you 
can do this in ES efficiently using 'size=0_after=1',


see 
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-body.html#_fast_check_for_any_matching_docs


Is there an equivalent for the terminate_after param in solr/lucene?

--
Martin




smime.p7s
Description: S/MIME Cryptographic Signature


Re: In Place Updates not work as expected

2018-03-16 Thread mganeshs
Hi Emir,

It's normal setfield and addDocument

for ex.
in a for loop 
   solrInputDocument.setField(sFieldId, fieldValue);
and after this, we add the created document.
   solrClient.add(collectionName, solrInputDocuments);

I just want to know whether, we need to do something specific for in-place
updates ? 

Kindly let me know,

Regards,




--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Recommendations for non-narrative data

2018-03-16 Thread Christopher Schultz
All,

I'm using Solr to index and search a database of user data (username,
email, first and last name), so there aren't really "terms" in the data
to search for, like you might search for words that describe products in
a catalog, for example.

I have set up my schema to include plain-old text fields for each of the
data mentioned above, plus I have a copy-field called "all" which
includes everything all together, plus I have a first + last field which
uses a phonetic index and query analyzer.

Since I don't need things such as term-replacement (spanner == wrench),
stemming (first name 'chris' -> 'chri'), and possibly other features
that I don't know about, I'm wondering what might be a recommended set
of tokenizer(s), analyzer(s), etc. for such data.

We will definitely want to be able to search by substring (to find
'cschultz' as a username with 'schultz' as input) but some substrings
are probably useless (such as @gmail.com for email addresses) and don't
need to be supported.

What are some good options to look at for this type of data?

In production, we have fewer than 5M records to handle, so this is more
of an academic exercise than an actual performance requirement (since
Solr is at least an order of magnitude faster than our current
RDBMS-searching implementation).

If it makes any difference, we are trying to keep the index up-to-date
with all user changes made in real time (okay, maybe delayed by a few
seconds, but basically realtime). We have a few hundred new-user
registrations per day and probably half as many changes to user records
as that, so perhaps 2 document-updates per minute on average (during ~12
business hours in the US on weekdays).

Thanks for any advice anyone may have,
-chris



signature.asc
Description: OpenPGP digital signature


Re: Some performance questions....

2018-03-16 Thread Deepak Goel
On Fri, Mar 16, 2018 at 6:03 PM, Shawn Heisey  wrote:

> On 3/15/2018 6:34 AM, BlackIce wrote:
>
>> However the main app that will be
>> running is more or less a single threated app which takes advantage when
>> run under several instances, ie: parallelism, so I thought, since I'm at
>> it
>> I may give solr a few instances as well
>>
>
> ***Deepak***

I did a performance study of Solr a while back. And I found that it does
not scale beyond a particular point on a single machine (could be due to
the way its coded). Hence multiple instances might make sense.

https://docs.google.com/document/d/1kUqEcZl3NhOo6SLklo5Icg3fMnn9OtLY_lwnc6wbXus/edit?usp=sharing

***Deepak***



> Solr is a fully threaded app, capable of doing LOTS of things at the same
> time, without multiple instances.
>
> Thnx for the Heap pointer.. I've read, from some Professor.. that Solr
>> actually is more efficient with a very small Heap and to have everything
>> mapped to virtual memory... Which brings me to the next question.. is the
>> Virtual memory mapping done by the OS or Solar? Does the Virtual memory
>> reside on the OS HDD? Or on the Solr HDD?.. and if the Virtual memory
>> mapping is done on the OS HDD, wouldn't it be beneficial to run the OS off
>> a SSD?
>>
>
> ***Deepak***
If you have a small RAM (I am assuming that is what you mean by a small
heap), then OS will do swapping or demand paging to manage your memory
requirements. SSD will help. However it might be better to have a larger
RAM than rely on SSD.
***Deepak***

> There appears to be some confusion here.
>
> The virtual memory doesn't reside on ANY hard drive, unless you've REALLY
> configured the system badly and the system starts using swap space.  If the
> system starts using swap, performance is going to be terrible, no matter
> how fast the disk where swap resides is.
>
> The "mapping to virtual memory" feature is something the operating system
> does.  Lucene/Solr utilizes MMAP code in Java, which then turns around and
> uses MMAP functionality provided by the OS.
>
> At that point, that file can be accessed by the application as if it were
> a very large block of memory.  Mapping the file doesn't immediately use any
> memory at all.  The OS manages the access to the file.  If the part of the
> file that is being accessed has not been accessed before, then the OS will
> read the data off the disk, place it into the OS disk cache, and provide it
> to whatever requested it.  If it has been accessed before and is still in
> the disk cache, then it won't read the disk, it will just provide the data
> from the cache.  Getting most data from cache is *required* for good Solr
> performance.
>
> http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
>
> Running with your indexes on SSD might indeed help performance, and
> regardless of anything that's going on, WILL help performance in the short
> term, when you first turn the machine on.  But if it also helps with
> long-term query performance, then chances are that the machine doesn't have
> enough memory.When Solr servers are sized correctly, running on SSD is
> typically not going to make a big difference, unless the machine does a lot
> more indexing than querying.
>
> For now.. my FEELING is to run one Solr instance on this particular
>> machine.. by the time the RAM is outgrown add another machine and so
>> forth...
>>
>
> Any plans you have for a growth strategy with multiple Solr instances are
> extremely likely to still be possible with only one instance, with very
> little change.
>
> Thanks,
> Shawn







Deepak
"Please stop cruelty to Animals, help by becoming a Vegan"
+91 73500 12833
deic...@gmail.com

Facebook: https://www.facebook.com/deicool
LinkedIn: www.linkedin.com/in/deicool

"Plant a Tree, Go Green"

On Fri, Mar 16, 2018 at 6:03 PM, Shawn Heisey  wrote:

> On 3/15/2018 6:34 AM, BlackIce wrote:
>
>> However the main app that will be
>> running is more or less a single threated app which takes advantage when
>> run under several instances, ie: parallelism, so I thought, since I'm at
>> it
>> I may give solr a few instances as well
>>
>
> Solr is a fully threaded app, capable of doing LOTS of things at the same
> time, without multiple instances.
>
> Thnx for the Heap pointer.. I've read, from some Professor.. that Solr
>> actually is more efficient with a very small Heap and to have everything
>> mapped to virtual memory... Which brings me to the next question.. is the
>> Virtual memory mapping done by the OS or Solar? Does the Virtual memory
>> reside on the OS HDD? Or on the Solr HDD?.. and if the Virtual memory
>> mapping is done on the OS HDD, wouldn't it be beneficial to run the OS off
>> a SSD?
>>
>
> There appears to be some confusion here.
>
> The virtual memory doesn't reside on ANY hard drive, unless you've REALLY
> configured the system badly and the system starts using swap space.  If the
> system starts using swap, 

Re: Some performance questions....

2018-03-16 Thread Deepak Goel
>I think there is no benefit in having multiple Solr instances on a single
>server, unless the heap memory required by the JVM is too big.
Deepak***
I would try multiple Solr instances rather a single Solr instance (it
definitely will give a performance boost)
Deepak***
>And remember that this has relatively to do with the index size ( inverted
>index is memory mapped OFF heap and docValues as well).
>On the other hand of course Apache Solr uses plenty of JVM heap memory as
>well ( caches, temporary data structures during indexing, ect ect)

> Deepak:
>
> Well its kinda a given that when running ANYTHING under a VM you have an
> overhead..

>***Deepak***
>You mean you are assuming without any facts (performance benchmark with n
>without VM)
 >***Deepak***
>I think Shawn detailed this quite extensively, I am no sys admin or OS
>expert, but there is no need of benchmarks and I don't even understand your
>doubts.
>In Information technology anytime you add additional layers of software you
>need adapters which means additional instructions executed.
>It is obvious  that having :
>metal -> OS -> APP is cheaper instruction wise then
>metal -> OS -> VM -> APP
>The APP will execute instruction in the VM that will be responsible to
>translate those instructions for the underlining OS.
Deepak***
I had past experience with VM's. They absolutely do not take any overheads.
Since we have conflicting opinions, it is best to benchmark it yourself
Deepak***
>Going direct you skip one passage.
>you can think about this when you emulate different OS, is it cheaper to
run
>windows on a machine directly to execute windows applications or run a
>Windows VM on top of another OS to execute windows applications ?









Deepak
"Please stop cruelty to Animals, help by becoming a Vegan"
+91 73500 12833
deic...@gmail.com

Facebook: https://www.facebook.com/deicool
LinkedIn: www.linkedin.com/in/deicool

"Plant a Tree, Go Green"

On Thu, Mar 15, 2018 at 9:43 PM, Alessandro Benedetti 
wrote:

> *Single Solr Instance VS Multiple Solr instances on Single Server
> *
>
> I think there is no benefit in having multiple Solr instances on a single
> server, unless the heap memory required by the JVM is too big.
> And remember that this has relatively to do with the index size ( inverted
> index is memory mapped OFF heap and docValues as well).
> On the other hand of course Apache Solr uses plenty of JVM heap memory as
> well ( caches, temporary data structures during indexing, ect ect)
>
> > Deepak:
> >
> > Well its kinda a given that when running ANYTHING under a VM you have an
> > overhead..
>
> ***Deepak***
> You mean you are assuming without any facts (performance benchmark with n
> without VM)
>  ***Deepak***
> I think Shawn detailed this quite extensively, I am no sys admin or OS
> expert, but there is no need of benchmarks and I don't even understand your
> doubts.
> In Information technology anytime you add additional layers of software you
> need adapters which means additional instructions executed.
> It is obvious  that having :
> metal -> OS -> APP is cheaper instruction wise then
> metal -> OS -> VM -> APP
> The APP will execute instruction in the VM that will be responsible to
> translate those instructions for the underlining OS.
> Going direct you skip one passage.
> you can think about this when you emulate different OS, is it cheaper to
> run
> windows on a machine directly to execute windows applications or run a
> Windows VM on top of another OS to execute windows applications ?
>
>
>
> -
> ---
> Alessandro Benedetti
> Search Consultant, R Software Engineer, Director
> Sease Ltd. - www.sease.io
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


Re: statistics in hitlist

2018-03-16 Thread John Smith
Thanks for the link to the documentation, that will probably come in useful.

I didn't see a way though, to get my avg function working? So instead of
doing a linear regression on two fields, X and Y, in a hitlist, we need to
do a linear regression on field X, and the average value of X. Is that
possible? To pass in a function to the regress function instead of a field?





On Thu, Mar 15, 2018 at 10:41 PM, Joel Bernstein  wrote:

> I've been working on the user guide for the math expressions. Here is the
> page on regression:
>
> https://github.com/joel-bernstein/lucene-solr/blob/math_expressions_
> documentation/solr/solr-ref-guide/src/regression.adoc
>
> This page is part of the larger math expression documentation. The TOC is
> here:
>
> https://github.com/joel-bernstein/lucene-solr/blob/math_expressions_
> documentation/solr/solr-ref-guide/src/math-expressions.adoc
>
> The docs are still very rough but you can get an idea of the coverage.
>
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Thu, Mar 15, 2018 at 10:26 PM, Joel Bernstein 
> wrote:
>
> > If you want to get everything in query you can do this:
> >
> > let(echo="d,e",
> >  a=search(tx_prod_production, q="oil_first_90_days_production:[1 TO
> > *]",
> > fq="isParent:true", rows="150",
> > fl="id,oil_first_90_days_production,oil_last_30_days_production",
> sort="id
> > asc"),
> >  b=col(a, oil_first_90_days_production),
> >  c=col(a, oil_last_30_days_production),
> >  d=regress(b, c),
> >  e=someExpression())
> >
> > The echo parameter tells the let expression which variables to output.
> >
> >
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> > On Thu, Mar 15, 2018 at 3:13 PM, Erick Erickson  >
> > wrote:
> >
> >> What does the fq clause look like?
> >>
> >> On Thu, Mar 15, 2018 at 11:51 AM, John Smith 
> >> wrote:
> >> > Hi Joel, I did some more work on this statistics stuff today. Yes, we
> do
> >> > have nulls in our data; the document contains many fields, we don't
> >> always
> >> > have values for each field, but we can't set the nulls to 0 either (or
> >> any
> >> > other value, really) as that will mess up other calculations (such as
> >> when
> >> > calculating average etc); we would normally just ignore fields with
> null
> >> > values when calculating stats manually ourselves.
> >> >
> >> > Adding a check in the "q" parameter to ensure that the fields used in
> >> the
> >> > calculations are > 0 does work now. Thanks for the tip (and sorry,
> >> should
> >> > have caught that myself). But I am unable to use "fq" for these
> checks,
> >> > they have to be added to the q instead. Adding fq's doesn't have any
> >> effect.
> >> >
> >> >
> >> > Anyway, I'm trying to change this up a little. This is what I'm
> >> currently
> >> > using (switched from "random" to "search" since I actually need the
> full
> >> > hitlist not just a random subset):
> >> >
> >> > let(a=search(tx_prod_production, q="oil_first_90_days_production:[1
> TO
> >> *]",
> >> > fq="isParent:true", rows="150",
> >> > fl="id,oil_first_90_days_production,oil_last_30_days_production",
> >> sort="id
> >> > asc"),
> >> >  b=col(a, oil_first_90_days_production),
> >> >  c=col(a, oil_last_30_days_production),
> >> >  d=regress(b, c))
> >> >
> >> > So I have 2 fields there defined, that works great (in terms of a test
> >> and
> >> > running the query); but I need to replace the second field,
> >> > "oil_last_30_days_production" with the avg value in
> >> > oil_first_90_days_production.
> >> >
> >> > I can get the avg with this expression:
> >> > stats(tx_prod_production, q="oil_first_90_days_production:[1 TO *]",
> >> > fq="isParent:true", rows="150", avg(oil_first_90_days_
> production))
> >> >
> >> > But I don't know how to push that avg value into the first streaming
> >> > expression; guessing I have to set "c=" but that is where I'm
> >> getting
> >> > lost, since avg only returns 1 value and the first parameter, "b",
> >> returns
> >> > a list of sorts. Somehow I have to get the avg value stuffed inside a
> >> > "col", where it is the same value for every row in the hitlist...?
> >> >
> >> > Thanks for your help!
> >> >
> >> >
> >> > On Mon, Mar 5, 2018 at 10:50 PM, Joel Bernstein 
> >> wrote:
> >> >
> >> >> I suspect you've got nulls in your data. I just tested with null
> >> values and
> >> >> got the same error. For testing purposes try loading the data with
> >> default
> >> >> values of zero.
> >> >>
> >> >>
> >> >> Joel Bernstein
> >> >> http://joelsolr.blogspot.com/
> >> >>
> >> >> On Mon, Mar 5, 2018 at 10:12 PM, Joel Bernstein 
> >> >> wrote:
> >> >>
> >> >> > Let's break the expression down and build it up slowly. Let's start
> >> with:
> >> >> >
> >> >> > let(echo="true",
> >> >> >  a=random(tx_prod_production, q="*:*", fq="isParent:true",
> >> rows="15",
> 

RE: question regarding wildcard-searches

2018-03-16 Thread Paesen Roel
Hi,

Unfortunately that also gives no results (and it would not be practical, as for 
this example the numbering only goes up till 19 but others go up into the 
thousands etc)

Anybody with a pointer on this?

Thanks already,
Roel


-Original Message-
From: jagdish vasani [mailto:jagdisht.vas...@gmail.com] 
Sent: vrijdag 16 maart 2018 12:41
To: solr-user@lucene.apache.org
Subject: Re: question regarding wildcard-searches

Hi paesen,

Value - EO.1954.53.1 is indexed as below Eo
1954
53
1
Dot is removed.try with wildcard -?
Like EO.1954.53.?? If you have 2 digits only in last..

I have not tried but you just check it.
Hope it will solve your problem.

Thanks,
Jagdish
On 16-Mar-2018 3:51 pm, "Paesen Roel"  wrote:

> Hi everybody,
>
> We are experimenting with solr, and I have a (I think) basic-level
> question:
> we have a multiple fields, all copied into a generic field so we can 
> search everything at once.
> However we have a (for us) strange situation doing wildcard searches 
> for the contents of one specific field.
>
> Given in the schema:
>
>  multiValued="true"/>
>
>  stored="true"/>
>  
> and lot of other fields exactly like 'genormaliseerdInventarisnummer'.
>
>
> Now, we are certain that the field 'genormaliseerdInventarisnummer'
> contains entries like 'EO.1954.53.1', 'EO.1954.53.2', EO.1954.53.3', 
> all the way up to '.19', we can query these directly by passing these 
> exact texts to the query on field '_text_' (our default search field).
> Problem is: wildcard searches for these don't work, like 'EO.1954.53.*'
> for example returns zero results.
>
> Why is that?
> What needs to be adjusted? (and how?)
>
> Thanks already,
> Roel
>
>


Re: Solr on DC/OS ?

2018-03-16 Thread Søren

Thanks a lot guys. Now we know where to start.

Best
    Soren

On 15-03-2018 09:27, Hendrik Haddorp wrote:

Hi,

we are running Solr on Marathon/Mesos, which should basically be the 
same as DC/OS. Solr and ZooKeeper are running in docker containers. I 
wrote my own Mesos framework that handles the assignment to the 
agents. There is a public sample that does the same for ElasticSearch. 
I'm not aware of a public Solr Mesos framework. The only "mediation" 
that happens here is that Solr runs in a docker container with a 
memory limit. If you give it enough resources it should be pretty 
close to running straight on the machine. JVM memory tuning and docker 
is however not the most fun.


regards,
Hendrik

On 15.03.2018 00:09, Rick Leir wrote:

Søren,
DC/OS installs on top of Ubuntu or RedHat, and it is used to 
coordinate many machines so they appear as a cluster.


Solr needs to be on a single machine, or in the case of SolrCloud, on 
many machines. It has no need of the coordination which DC/OS 
provides. Solr depends on direct access to lots of memory, and if any 
coordination layer attempts to mediate access to the memory then Solr 
would slow down. I recommend you install Solr directly on Ubuntu or 
Redhat or Windows Server (Disclosure: I know very little about DC/OS)

Cheers -- Rick


On March 14, 2018 6:19:22 AM EDT, "Søren"  wrote:

Hi, has anyone experience in running solr on DC/OS?

If so, how is that achieved succesfully? Solr is not in Universe.

Thanks in advance,
Soren






Re: Solr document routing using composite key

2018-03-16 Thread Shawn Heisey

On 3/6/2018 11:53 AM, Nawab Zada Asad Iqbal wrote:

I have 117 shards and i tried to use document ids from zero to 116. I find
that the distribution is very uneven, e.g., the largest bucket receives
total 5 documents; and around 38 shards will be empty.  Is it expected?


With such a small data set, this fits what I would expect.

Choosing buckets by hashing (which is what compositeId does) is not 
perfect, but if you send it thousands or millions of documents, it will 
be *generally* balanced.


Thanks,
Shawn



Re: Some performance questions....

2018-03-16 Thread Shawn Heisey

On 3/15/2018 6:34 AM, BlackIce wrote:

However the main app that will be
running is more or less a single threated app which takes advantage when
run under several instances, ie: parallelism, so I thought, since I'm at it
I may give solr a few instances as well


Solr is a fully threaded app, capable of doing LOTS of things at the 
same time, without multiple instances.



Thnx for the Heap pointer.. I've read, from some Professor.. that Solr
actually is more efficient with a very small Heap and to have everything
mapped to virtual memory... Which brings me to the next question.. is the
Virtual memory mapping done by the OS or Solar? Does the Virtual memory
reside on the OS HDD? Or on the Solr HDD?.. and if the Virtual memory
mapping is done on the OS HDD, wouldn't it be beneficial to run the OS off
a SSD?


There appears to be some confusion here.

The virtual memory doesn't reside on ANY hard drive, unless you've 
REALLY configured the system badly and the system starts using swap 
space.  If the system starts using swap, performance is going to be 
terrible, no matter how fast the disk where swap resides is.


The "mapping to virtual memory" feature is something the operating 
system does.  Lucene/Solr utilizes MMAP code in Java, which then turns 
around and uses MMAP functionality provided by the OS.


At that point, that file can be accessed by the application as if it 
were a very large block of memory.  Mapping the file doesn't immediately 
use any memory at all.  The OS manages the access to the file.  If the 
part of the file that is being accessed has not been accessed before, 
then the OS will read the data off the disk, place it into the OS disk 
cache, and provide it to whatever requested it.  If it has been accessed 
before and is still in the disk cache, then it won't read the disk, it 
will just provide the data from the cache.  Getting most data from cache 
is *required* for good Solr performance.


http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

Running with your indexes on SSD might indeed help performance, and 
regardless of anything that's going on, WILL help performance in the 
short term, when you first turn the machine on.  But if it also helps 
with long-term query performance, then chances are that the machine 
doesn't have enough memory.When Solr servers are sized correctly, 
running on SSD is typically not going to make a big difference, unless 
the machine does a lot more indexing than querying.



For now.. my FEELING is to run one Solr instance on this particular
machine.. by the time the RAM is outgrown add another machine and so
forth...


Any plans you have for a growth strategy with multiple Solr instances 
are extremely likely to still be possible with only one instance, with 
very little change.


Thanks,
Shawn



Re: Remove Replacement character "�" from the search results

2018-03-16 Thread uttamdhakal
Erick Erickson wrote
> This is more likely a problem with your browser's character set, try
> setting it to UTF-8.

The problem is not with my browser's character set. Anyway I want to
remove/replace certain characters from the search result



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


RE: SpellCheck Reload

2018-03-16 Thread Sadiki Latty
Thanks Alessandro, I'll give this a try next time. I ended up deleting the 
spell folder after trying the reload option without success. Next time I will 
try the reload then build method you suggested.

Thanks again for the info.

-Original Message-
From: Alessandro Benedetti [mailto:a.benede...@sease.io] 
Sent: March-15-18 1:34 PM
To: solr-user@lucene.apache.org
Subject: RE: SpellCheck Reload

Hi Sadiki,
the kind of spellchecker you are using built an auxiliary Lucene index as a 
support data structure.
That is going to be used to provide the spellcheck suggestions.

"My question is, does "reloading the dictionary" mean completely erasing the 
current dictionary and starting from scratch (which is what I want)? "

What you want is re-build the spellchecker.
In the case of the the IndexBasedSpellChecker, the index is used to build the 
dictionary.
When the spellchecker is initialized a reader is opened from the latest index 
version available.

if in the meantime your index has changed and commits have happened, just 
building the spellchecker *should* use the old reader :

@Override
  public void build(SolrCore core, SolrIndexSearcher searcher) throws 
IOException {
IndexReader reader = null;
if (sourceLocation == null) {
  // Load from Solr's index
  reader = searcher.getIndexReader();
} else {
  // Load from Lucene index at given sourceLocation
  reader = this.reader;
}

This means your dictionary is not going to see any substantial changes.

So what you need to do is :

1) reload the spellchecker -> which will initialise again the source for the 
dictionary to the latest index commit
2) re-build the dictionary



Cheers







-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Using multi valued field in solr cloud Graph Traversal Query

2018-03-16 Thread Jan Høydahl
> Adding multi-value field support is a fairly high priority so I would
> expect this to be coming in a future release.

I got this question from a client of mine as well. Trying to find a JIRA issue 
for multi value support, is there one?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

> 11. mar. 2017 kl. 03:21 skrev Joel Bernstein :
> 
> Currently gatherNodes only works on single value fields. You can seed a
> gatherNodes with a facet() expression which works with multi-value fields,
> but after that it only works with single value fields.
> 
> So you would have to index the data as a graph like this:
> 
> id, concept1, participant1
> id, concept1, participant2
> id, concept2, participant1
> id, concept2, participant3
> id, concept3, participant2
> 
> 
> Then you walk the graph like this:
> 
> gatherNodes(mydata,
>  gatheNodes(mydata, walk="concept1->conceptID",
> gather="participantID")
>  walk="node->particpantID",
>  gather="conceptID")
> 
> This is a two step graph expression:
> 1) Gathers all the participantID's where concept1 is in the conceptID
> field.
> 2) Gathers all the conceptID's for the participantID's gathered in step 1.
> 
> Let me know if you have other questions about how to structure the data or
> run the queries.
> 
> 
> 
> 
> 
> 
> 
> 
> Adding multi-value field support is a fairly high priority so I would
> expect this to be coming in a future release.
> 
> 
> 
> 
> Joel Bernstein
> http://joelsolr.blogspot.com/
> 
> On Fri, Mar 10, 2017 at 5:15 PM, Pratik Patel  wrote:
> 
>> I am trying to do a graph traversal query using gatherNode function. I am
>> seeding a streaming expression to get some documents and then I am trying
>> to map their ids(conceptid) to a multi valued field "participantIds" and
>> gather nodes.
>> 
>> Here is the query I am doing.
>> 
>> 
>> gatherNodes(collection1,
>>> search(collection1,q="*:*",fl="conceptid",sort="conceptid
>>> asc",fq=storeid:"524efcfd505637004b1f6f24",fq=tags:"Project"),
>>> walk=conceptid->participantIds,
>>> gather="conceptid")
>> 
>> 
>> The field participantIds is a multi valued field. This is the field which
>> holds connections between the documents. When I execute this query, I get
>> exception as below.
>> 
>> 
>> { "result-set": { "docs": [ { "EXCEPTION":
>> "java.util.concurrent.ExecutionException: java.lang.RuntimeException:
>> java.io.IOException: java.util.concurrent.ExecutionException:
>> java.io.IOException: -->
>> http://169.254.40.158:8081/solr/collection1_shard1_replica1/:can not sort
>> on multivalued field: participantIds", "EOF": true, "RESPONSE_TIME": 15 } ]
>> } }
>> 
>> 
>> Does this mean you can not look into multivalued fields in graph traversal
>> query? In our solr index, we have documents having "conceptid" field which
>> is id and we have participantIds which is a multivalued field storing
>> connections of that documents to other documents. I believe we need to have
>> one field in document which stores connections of that document so that
>> graph traversal is possible. If not, what is the other the way to index
>> graph data and use graph traversal. I am trying to explore graph traversal
>> and am new to it. Any help would be appreciated.
>> 
>> Thanks,
>> Pratik
>> 



Re: question regarding wildcard-searches

2018-03-16 Thread jagdish vasani
Hi paesen,

Value - EO.1954.53.1 is indexed as below
Eo
1954
53
1
Dot is removed.try with wildcard -?
Like EO.1954.53.?? If you have 2 digits only in last..

I have not tried but you just check it.
Hope it will solve your problem.

Thanks,
Jagdish
On 16-Mar-2018 3:51 pm, "Paesen Roel"  wrote:

> Hi everybody,
>
> We are experimenting with solr, and I have a (I think) basic-level
> question:
> we have a multiple fields, all copied into a generic field so we can
> search everything at once.
> However we have a (for us) strange situation doing wildcard searches for
> the contents of one specific field.
>
> Given in the schema:
>
>  multiValued="true"/>
>
>  stored="true"/>
> 
> and lot of other fields exactly like 'genormaliseerdInventarisnummer'.
>
>
> Now, we are certain that the field 'genormaliseerdInventarisnummer'
> contains entries like 'EO.1954.53.1', 'EO.1954.53.2', EO.1954.53.3', all
> the way up to '.19', we can query these directly by passing these exact
> texts to the query on field '_text_' (our default search field).
> Problem is: wildcard searches for these don't work, like 'EO.1954.53.*'
> for example returns zero results.
>
> Why is that?
> What needs to be adjusted? (and how?)
>
> Thanks already,
> Roel
>
>


Re: LTR - OriginalScore query issue

2018-03-16 Thread ilayaraja
Yes, I have tried that too:

But it was throwing error while feature extraction:
 "Exception from createWeight for SolrFeature [name=originalLuceneScore,
params={q={!dismax qf=tem_type_all^30.0 ..}${user_query}}] Failed to
parse feature query.
at
org.apache.solr.ltr.LTRScoringQuery.createWeights(LTRScoringQuery.java:241)
at
org.apache.solr.ltr.LTRScoringQuery.createWeight(LTRScoringQuery.java:208)
at
org.apache.solr.ltr.response.transform.LTRFeatureLoggerTransformerFactory$FeatureTransformer.setContext(LTRFeatureLoggerTransformerFactory.java:245)
at
org.apache.solr.response.transform.DocTransformers.setContext(DocTransformers.java:69)
at org.apache.solr.response.DocsStreamer.(DocsStreamer.java:89)
at
org.apache.solr.response.ResultContext.getProcessedDocuments(ResultContext.java:55)
at
org.apache.solr.response.TextResponseWriter.writeDocuments(TextResponseWriter.java:260)



-
--Ilay
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: LTR - OriginalScore query issue

2018-03-16 Thread Alessandro Benedetti
I understood your requirement,
the SolrFeature feature type should be quite flexible,
have you tried :

{ 
name: "overallEdismaxScore", 
class: "org.apache.solr.ltr.feature.SolrFeature", 
params: { 
q: "{!dismax qf=item_typel^3.0 brand^2.0 title^5.0}${user_query}" 
}, 
store: "myFeatureStoreDemo", 
} 

Cheers



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


question regarding wildcard-searches

2018-03-16 Thread Paesen Roel
Hi everybody,

We are experimenting with solr, and I have a (I think) basic-level question:
we have a multiple fields, all copied into a generic field so we can search 
everything at once.
However we have a (for us) strange situation doing wildcard searches for the 
contents of one specific field.

Given in the schema:





and lot of other fields exactly like 'genormaliseerdInventarisnummer'.


Now, we are certain that the field 'genormaliseerdInventarisnummer' contains 
entries like 'EO.1954.53.1', 'EO.1954.53.2', EO.1954.53.3', all the way up to 
'.19', we can query these directly by passing these exact texts to the query on 
field '_text_' (our default search field).
Problem is: wildcard searches for these don't work, like 'EO.1954.53.*' for 
example returns zero results.

Why is that?
What needs to be adjusted? (and how?)

Thanks already,
Roel



Re: Apache commons fileupload migration

2018-03-16 Thread padmanabhan1616
Yes I read the changelog 1.3.3. This release contains the security
vulnerability fix.

DiskDileItem can actually no longer be deserialized, *unless a system
property is set to true*. Fixes FILEUPLOAD-279.

 We don't have security architecture for my product to decide weather it is
vulnerable or not. So, please kindly help us below

My concern here is,
Is this vulnerable for lower version of commons-fileupload or not? If yes
then upgrading directly in Apache solr-5.2.1 version is good idea or not?



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html