Re: _childDocuments_ automatically multivalued field type

2018-07-02 Thread jeebix
Ok, I'll have a look at the link above.

Thanks a lot...

Best
JB



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: _childDocuments_ automatically multivalued field type

2018-07-02 Thread jeebix
Ok, I see what I have to look for, thanks to your reply. I'll adjust the
schema and see difference.

Thanks.

Best
JB



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


please unsubscribe

2018-07-02 Thread Karl Hampel



Re: Managed Schemas and Version Control

2018-07-02 Thread Zimmermann, Thomas
Thanks all! I think we will maintain our current approach of hand editing
the configs in git and implement something at the shell level to automate
the process of running upconfig and performing a core reload.



Override a single value in a Config Set

2018-07-02 Thread Zimmermann, Thomas
Hi,

We have several cores with identical configurations with the sole exception 
being the language of their document sets. I'd like to leverage Config Sets to 
manage the going forward, but ran into two issues I'm struggling to solve 
conceptually.

Sample Cores:
our_documents
our_documents_de
our_documents_es
our_documents_fr

The two values I'd like to override are are:

Set a default field value for a field called "language" to the language of the 
core, ex = "de" on a german core.
Override some text field analyzers to use the correct language
Override index specific language files like stopwords.txt

All of our config files live in SVN and pushed out to staging/prod envs via 
zkcli, so we want to avoid API dependent settings on our prod servers. We 
always want our configs in SVN and don't want to rely on the API to manage 
production settings in a way that we can't change via redeploying our code.

Any thoughts on if this is feasible? Should I just stick with independent core 
configs?

Thanks,
TZ





Re: Solrcloud collection sharding and querying

2018-07-02 Thread Sushant Vengurlekar
We have two collections which are 21G and constantly growing. The index on
one of them is also 12G. I am trying to see how sharding can be employed to
improve the query performance by adding the route to a shard based on a
field in schema.xml. So I am trying to figure out how to split the
collections into shards based on this one field and then query them further
by routing the query to a particular shard based on this field.

Thank you

On Mon, Jul 2, 2018 at 7:36 PM, Erick Erickson 
wrote:

> This seems like an "XY problem". _Why_ do you want to do this?
> Has your collection outgrown one shard and you feel you have to
> split it? Sharding should only be used when you can't host your
> entire collection on a single replica and still get adequate performance.
>
> When you do reach that point, the usual process is to just
> decide how many shards you need and let Solr do the rest
> of the work. Why do you think you need to specify how docs
> are routed based on some field?
>
> Best,
> Erick
>
> On Mon, Jul 2, 2018 at 6:06 PM, Sushant Vengurlekar
>  wrote:
> > I want to split a collection based on one field. How do I do it and then
> > query based off that.
> >
> > Ex: collection1. Field to split off col1
> >
> > Thank you
>


Block Join Child Query returns incorrect result

2018-07-02 Thread kristaclaire14
Hi,

I'm having a problem in my solr when querying third level child documents. I
want to retrieve parent documents that have specific third level child
documents. The example data is:

[{ 
"id":"1001" 
"path":"1.Project", 
"Project_Title":"Sample Project", 
"_childDocuments_":[ 
{ 
"id":"2001", 
"path":"2.Project.Submission", 
"Submission_No":"1234-QWE", 
"_childDocuments_":[ 
{ 
"id":"3001", 
"path":"3.Project.Submission.Agency", 
"Agency_Cd":"QWE" 
} 
] 
}] 
}, { 
"id":"1002" 
"path":"1.Project", 
"Project_Title":"Test Project QWE", 
"_childDocuments_":[ 
{ 
"id":"2002", 
"path":"2.Project.Submission", 
"Submission_No":"4567-AGY", 
"_childDocuments_":[ 
{ 
"id":"3002", 
"path":"3.Project.Submission.Agency", 
"Agency_Cd":"AGY" 
}] 
}] 
}] 

I want to retrieve the parent with *Agency_Cd:ZXC* in third level child
document.
So far, this is my query:
q={!parent which="path:1.Project" v="path:3.Project.Submission.Agency AND
Agency_Cd:ZXC"}

My expected result is 0 but solr return parents with no matching child
documents based on the query. Am I doing something wrong on the query?
Thanks in advance.



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Solrcloud collection sharding and querying

2018-07-02 Thread Erick Erickson
This seems like an "XY problem". _Why_ do you want to do this?
Has your collection outgrown one shard and you feel you have to
split it? Sharding should only be used when you can't host your
entire collection on a single replica and still get adequate performance.

When you do reach that point, the usual process is to just
decide how many shards you need and let Solr do the rest
of the work. Why do you think you need to specify how docs
are routed based on some field?

Best,
Erick

On Mon, Jul 2, 2018 at 6:06 PM, Sushant Vengurlekar
 wrote:
> I want to split a collection based on one field. How do I do it and then
> query based off that.
>
> Ex: collection1. Field to split off col1
>
> Thank you


Re: Solr 7.1.0 - NoNode for /collections

2018-07-02 Thread Shawn Heisey
On 7/2/2018 5:57 PM, Joe Obernberger wrote:
> Just to add to this - looks like the only valid replica that is
> remaining is a TLOG type, and I suspect that is why it no longer has a
> leader.  Poop.

A replica of that type (TLOG) should be capable of becoming leader.  The
PULL replica type is the one that cannot become leader.

> On 7/2/2018 7:54 PM, Joe Obernberger wrote:
>> Hi - On startup, I'm getting the following error.  The shard had 3
>> replicas, but none are selected as the leader.  I deleted one, and
>> adding a new one back, but that had no effect, and at times the calls
>> would timeout.  I was having the same issue with another shard on the
>> same collection and deleting/re-adding a replica worked; the shard
>> now has a leader.  This one, I can't seem to get to come up.  Any ideas?

>> Caused by: org.apache.zookeeper.KeeperException$NoNodeException:
>> KeeperErrorCode = NoNode for
>> /collections/UNCLASS_30DAYS/leaders/shard6/leader

This is saying that a specific path
(/collections/UNCLASS_30DAYS/leaders/shard6/leader) does not exist in
the ZooKeeper database.  I do not know why it doesn't exist.

You may need to create the znode represented by that path.  If you use
the same zkHost option for the command as you have for Solr itself, you
can create the path with the "zk mkroot" command on the control script:

https://lucene.apache.org/solr/guide/6_6/solr-control-script-reference.html#solr-control-script-reference

After creating that znode, it is probably a good idea to restart one of
the Solr nodes with a shard for that collection.  It is possible that it
still might not work, but it should change the error message even if it
doesn't work, and we can go from there to come up with additional ideas.

Thanks,
Shawn



Solrcloud collection sharding and querying

2018-07-02 Thread Sushant Vengurlekar
I want to split a collection based on one field. How do I do it and then
query based off that.

Ex: collection1. Field to split off col1

Thank you


Re: CDCR Custom Document Routing

2018-07-02 Thread Jay Potharaju
Solr cdcr : https://issues.apache.org/jira/browse/SOLR-12380
deletebyid: https://issues.apache.org/jira/browse/SOLR-8889

Thanks
Jay Potharaju



On Mon, Jul 2, 2018 at 5:41 PM Jay Potharaju  wrote:

> Hi Amrit,
> I am using a curl command to send a request to solr for deleting
> documents. That is because deleteById does not work for collections setup
> with implicit routing.
>
> curl http:/localhost:8983/solr/test_5_replica2/update/json/ -H
> 'Content-type:application/json/docs' -d '{
> "delete": {"id":"documentid13123123"}
> }'
> The deletes doesn't seem to propagate correctly to the target side.
>
> Thanks
> Jay Potharaju
>
>
>
> On Mon, Jul 2, 2018 at 5:37 PM Amrit Sarkar 
> wrote:
>
>> Jay,
>>
>> Can you sample delete command you are firing at the source to understand
>> the issue with Cdcr.
>>
>> On Tue, 3 Jul 2018, 4:22 am Jay Potharaju,  wrote:
>>
>> > Hi
>> > The current cdcr setup does not work if my collection uses implicit
>> > routing.
>> > In my testing i found that adding documents works without any problems.
>> It
>> > doesn't seem to work correctly when deleting documents.
>> > Is there an alternative to cdcr that would work in cross data center
>> > scenario.
>> >
>> > Setup:
>> > 8 shards : 2 on each node
>> > Solr:6.6.4
>> >
>> > Thanks
>> > Jay Potharaju
>> >
>>
>


Re: CDCR Custom Document Routing

2018-07-02 Thread Jay Potharaju
Hi Amrit,
I am using a curl command to send a request to solr for deleting documents.
That is because deleteById does not work for collections setup with
implicit routing.

curl http:/localhost:8983/solr/test_5_replica2/update/json/ -H
'Content-type:application/json/docs' -d '{
"delete": {"id":"documentid13123123"}
}'
The deletes doesn't seem to propagate correctly to the target side.

Thanks
Jay Potharaju



On Mon, Jul 2, 2018 at 5:37 PM Amrit Sarkar  wrote:

> Jay,
>
> Can you sample delete command you are firing at the source to understand
> the issue with Cdcr.
>
> On Tue, 3 Jul 2018, 4:22 am Jay Potharaju,  wrote:
>
> > Hi
> > The current cdcr setup does not work if my collection uses implicit
> > routing.
> > In my testing i found that adding documents works without any problems.
> It
> > doesn't seem to work correctly when deleting documents.
> > Is there an alternative to cdcr that would work in cross data center
> > scenario.
> >
> > Setup:
> > 8 shards : 2 on each node
> > Solr:6.6.4
> >
> > Thanks
> > Jay Potharaju
> >
>


Re: CDCR Custom Document Routing

2018-07-02 Thread Amrit Sarkar
Jay,

Can you sample delete command you are firing at the source to understand
the issue with Cdcr.

On Tue, 3 Jul 2018, 4:22 am Jay Potharaju,  wrote:

> Hi
> The current cdcr setup does not work if my collection uses implicit
> routing.
> In my testing i found that adding documents works without any problems. It
> doesn't seem to work correctly when deleting documents.
> Is there an alternative to cdcr that would work in cross data center
> scenario.
>
> Setup:
> 8 shards : 2 on each node
> Solr:6.6.4
>
> Thanks
> Jay Potharaju
>


Re: Creating single CloudSolrClient object which can be used throughout the application

2018-07-02 Thread Shawn Heisey
On 7/2/2018 7:35 AM, Ritesh Kumar wrote:
> I have got a static method which returns CloudSolrClient object if Solr is
> running in Cloud mode and HttpSolrClient object otherwise.

Declare that method as synchronized, so that multiple usages do not step
on each other's toes.  This will also eliminate object visibility issues
in multi-threaded code.  The modifiers for the method will probably end
up being "public static synchronized".

In the class where that method lives, create a "private static
SolrClient" field and set it to null.  In the method, if the class-level
field is not null, return it.  If it is null, create the HttpSolrClient
or CloudSolrClient object just as you do now, set the default collection
if that's required, then assign that client object to the class-level
field and return it.

Remove any client.close() calls that you have currently.  You can close
the client at application shutdown, but this is not actually necessary
if application shutdown also halts the JVM.

You could also use the singleton paradigm that Erick mentioned, but
since you already have code to obtain a client object, it's probably
more straightforward to just modify that code as I have described, and
don't close the client after you use it.

Thanks,
Shawn



Re: Can't recover - HDFS

2018-07-02 Thread Shawn Heisey
On 7/2/2018 1:40 PM, Joe Obernberger wrote:
> Hi All - having this same problem again with a large index in HDFS.  A
> replica needs to recover, and it just spins retrying over and over
> again.  Any ideas?  Is there an adjustable timeout?
>
> Screenshot:
> http://lovehorsepower.com/images/SolrShot1.jpg

There is considerably more log detail available than can be seen in the
screenshot.  Can you please make your solr.log file from this server
available so we can see full error and warning log messages, and let us
know the exact Solr version that wrote the log?  You'll probably need to
use a file sharing site, and make sure the file is available until after
the problem has been examined.  Attachments sent to the mailing list are
almost always stripped.

Based on the timestamps in the screenshot, it is taking about 22 to 24
seconds to transfer 1750073344 bytes.  Which calculates to right around
the 75 MB per second rate that you were configuring in your last email
thread.  In order for that single large file to transfer successfully,
you're going to need a timeout of at least 40 seconds.  Based on what I
see, it sounds like the timeout has been set to 20 seconds.  The default
client socket timeout on replication should be about two minutes, which
would be plenty for a file of that size to transfer.

This might be a timeout issue, but without seeing the full log and
knowing the exact version of Solr that created it, it is difficult to
know for sure where the problem might be or what can be done to fix it. 
We will need that logfile.  If there are multiple servers involved, we
may need logfiles from both ends of the replication.

Do you have any config in solrconfig.xml for the /replication handler
other than the maxWriteMBPerSec config you showed last time?

Have you configured anything (particularly a socket timeout or sotimeout
setting) to a value near 20 or 2?

Thanks,
Shawn



Re: A user defined request handler is failing to fetch the data.

2018-07-02 Thread Shawn Heisey
On 7/2/2018 12:58 AM, Adarsh_infor wrote:
> Yes am going to have the shards on 6 different servers which will be later
> called in my searchHandler by specifying the shards list.  But for that
> initially i was testing the filesearch with the single shard which was
> suppose to work.  I know solr could does handle these thing more better than
> but for now i need to use the master/slave architecture with distributed
> node in front of them. As of now if in the solrconfig.xml if i keep
> lucenematchversion to 6.6.3 then only . am seeing the error which i posted
> earlier if switch the version back to LUCENE_40 it just works fine. But is
> it not suppose to work with 6.6.3 am confused there. 
>
> And also the logs which i pasted in from solr.log not from the client side. 

I agree with Erick that this looks like a problem with a recursive
shards parameter.  I set up a test handler with a recursive shards
parameter, and when I tried a request to that handler, it spit out an
error message that looked just like what you're seeing.

It is normally a bad idea to include the shards parameter in a handler
definition.  It *can* be safe, but takes a lot of special care to make
sure that recursive shard calls do not happen.

I can't imagine how setting luceneMatchVersion could alter this
behavior.  Which might be a failure of imagination on my part.

Thanks,
Shawn



Resources for Monitoring Cassandra, Spark, Solr

2018-07-02 Thread Rahul Singh
Folks,
We often get questions on monitoring here so I assembled this post with 
articles from those in the community as well as links to the component tools to 
give folks a more comprehensive listing.

https://blog.anant.us/resources-for-monitoring-datastax-cassandra-spark-solr-performance/
This is a work in progress and I'll update this with screenshots as well as 
with links from other contributors.

--
Rahul Singh
rahul.si...@anant.us

Anant Corporation


Re: Solr 7.1.0 - NoNode for /collections

2018-07-02 Thread Joe Obernberger
Just to add to this - looks like the only valid replica that is 
remaining is a TLOG type, and I suspect that is why it no longer has a 
leader.  Poop.


-Joe


On 7/2/2018 7:54 PM, Joe Obernberger wrote:
Hi - On startup, I'm getting the following error.  The shard had 3 
replicas, but none are selected as the leader.  I deleted one, and 
adding a new one back, but that had no effect, and at times the calls 
would timeout.  I was having the same issue with another shard on the 
same collection and deleting/re-adding a replica worked; the shard now 
has a leader.  This one, I can't seem to get to come up.  Any ideas?


org.apache.solr.common.SolrException: Error getting leader from zk for 
shard shard6
    at 
org.apache.solr.cloud.ZkController.getLeader(ZkController.java:1223)
    at 
org.apache.solr.cloud.ZkController.register(ZkController.java:1090)
    at 
org.apache.solr.cloud.ZkController.register(ZkController.java:1018)
    at 
org.apache.solr.core.ZkContainer.lambda$registerInZk$0(ZkContainer.java:187)
    at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:188)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

    at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.solr.common.SolrException: Could not get leader 
props
    at 
org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1270)
    at 
org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1234)
    at 
org.apache.solr.cloud.ZkController.getLeader(ZkController.java:1190)

    ... 7 more
Caused by: org.apache.zookeeper.KeeperException$NoNodeException: 
KeeperErrorCode = NoNode for 
/collections/UNCLASS_30DAYS/leaders/shard6/leader
    at 
org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
    at 
org.apache.zookeeper.KeeperException.create(KeeperException.java:51)

    at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1151)
    at 
org.apache.solr.common.cloud.SolrZkClient.lambda$getData$5(SolrZkClient.java:340)
    at 
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:60)
    at 
org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:340)
    at 
org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1248)

    ... 9 more


-Joe





Solr 7.1.0 - NoNode for /collections

2018-07-02 Thread Joe Obernberger
Hi - On startup, I'm getting the following error.  The shard had 3 
replicas, but none are selected as the leader.  I deleted one, and 
adding a new one back, but that had no effect, and at times the calls 
would timeout.  I was having the same issue with another shard on the 
same collection and deleting/re-adding a replica worked; the shard now 
has a leader.  This one, I can't seem to get to come up.  Any ideas?


org.apache.solr.common.SolrException: Error getting leader from zk for 
shard shard6

    at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:1223)
    at org.apache.solr.cloud.ZkController.register(ZkController.java:1090)
    at org.apache.solr.cloud.ZkController.register(ZkController.java:1018)
    at 
org.apache.solr.core.ZkContainer.lambda$registerInZk$0(ZkContainer.java:187)
    at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:188)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

    at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.solr.common.SolrException: Could not get leader props
    at 
org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1270)
    at 
org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1234)

    at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:1190)
    ... 7 more
Caused by: org.apache.zookeeper.KeeperException$NoNodeException: 
KeeperErrorCode = NoNode for 
/collections/UNCLASS_30DAYS/leaders/shard6/leader
    at 
org.apache.zookeeper.KeeperException.create(KeeperException.java:111)

    at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
    at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1151)
    at 
org.apache.solr.common.cloud.SolrZkClient.lambda$getData$5(SolrZkClient.java:340)
    at 
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:60)
    at 
org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:340)
    at 
org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1248)

    ... 9 more


-Joe



CDCR Custom Document Routing

2018-07-02 Thread Jay Potharaju
Hi
The current cdcr setup does not work if my collection uses implicit
routing.
In my testing i found that adding documents works without any problems. It
doesn't seem to work correctly when deleting documents.
Is there an alternative to cdcr that would work in cross data center
scenario.

Setup:
8 shards : 2 on each node
Solr:6.6.4

Thanks
Jay Potharaju


Re: Creating single CloudSolrClient object which can be used throughout the application

2018-07-02 Thread Ritesh Kumar
Yes, the client object is closed each time.

The bulk indexing service calls the service which, let's say, indexes all
the orders from the database. So, a service is called *asynchronously* from
within the bulk service which indexes order related data individually for
each order. There may be more than one such bulk services.

Pseudo code:
createOrdersIndex () {
allThe Orders.iterate {
createOrderIndex(orderId)
   }
}
createOrderIndex() {
get the client object
client.add(SolrInputDocument)
client.close()
}

The individual service indexing each order is scheduled, each one having a
client object. So there is a possibility that the services have not yet
finished execution, meaning the client objects are not closed yet,
resulting in connection timeout.

This is how I have prepared the CloudSolrClient object
client = new CloudSolrClient.Builder().withZkHost("zkHosts").build();
client.setDefaultCollection(collectionName)

Hope, this gives a clear picture of the problem I am facing.



On Mon, Jul 2, 2018 at 7:41 PM Erick Erickson 
wrote:

> It's recommended to use one object of course. That said, you should
> not be having a connection problem just because you create new ones
> all the time. Are you closing it after you're done with it each time?
>
> As to your question about how to reuse the same one, the "singleton
> pattern" is one solution.
>
> Best,
> Erick
>
> On Mon, Jul 2, 2018 at 6:35 AM, Ritesh Kumar
>  wrote:
> > Hello Team,
> >
> > I have got a static method which returns CloudSolrClient object if Solr
> is
> > running in Cloud mode and HttpSolrClient object otherwise.
> >
> > When running bulk indexing service, this method is called from within the
> > indexing service to get the appropriate client object. Each time, this
> > method creates a new client object. The problem is, when the bulk
> indexing
> > service is run, after a while, connection error occurs (could not connect
> > to zookeeper running at 0.0.0.0:2181. It seems the Zookeeper runs out of
> > connections.
> >
> > Configuration:
> > One Zookeeper - maxClientCnxns=60
> > Two Solr nodes, running in the Cloud mode.
> >
> > After looking out for the solution, I could find that CloudSolrClient is
> > thread safe provided it collection remains the same.
> >
> > How can I create an object of CloudSolrClient such that it is used
> > throughout the application without creating a new object each time the
> data
> > is indexed.
> >
> > Best,
> > Ritesh Kumar
>


Can't recover - HDFS

2018-07-02 Thread Joe Obernberger
Hi All - having this same problem again with a large index in HDFS.  A 
replica needs to recover, and it just spins retrying over and over 
again.  Any ideas?  Is there an adjustable timeout?


Screenshot:
http://lovehorsepower.com/images/SolrShot1.jpg

Thank you!

-Joe Obernberger




Re: NgramTokenizerFactory question

2018-07-02 Thread Alexandre Rafalovitch
I am not familiar with Lucene method to create analyzer. Perhaps it
was already doing just analyzes phase. But here is what the NGram
would do to a string of '123456' with just trigrams:
123
234
345
456

So, if you only apply it on the index side, and your query is '2345' -
there is no such token in the index to match against.

On the other hand, if you apply trigram on the query side as well,
against the query '2349', it will split into:
234
349

And 234 would match. If that's ok for you that 2349 would match
against 123456, you are fine. But if you want any search string to be
actually present fully, then you need index-only NGram and it needs to
be maxed at your maximum possible string.

So with index-only min=3 and max=4, you will get:
123
1234
234
2345
345
3456
456

Then 2349, not being ngrammed will not match anything, but 2345 will.

Again, Admin UI will show that to you.

Regards,
   Alex.

On 2 July 2018 at 14:33, Kudrettin Güleryüz  wrote:
>> 1) if you want face to match interface, you need max value to be at least
> 4.
> Can you please explain this a bit more? I am not following this one. Values
> are set to 3,3 and Solr already matches interface and interfaces when
> searched for face.  In addition to that Solr matches the trigrams of face
> (fac and ace) as well, which I find not as relevant as interface or faceted.
>
> Application I am working on moving to Solr 7.3.1 is currently using Lucene
> API 5.3.1 and has a custom analyzer like following:
>
>
> public class TrigramCaseAnalyzer extends SourceSearchAnalyzer {
> private int indexType;
>
> public TrigramCaseAnalyzer() {
> indexType = 1;
> }
>
> @Override
> public int getIndexType() {
> return this.indexType;
> }
>
> @Override
> public void setIndexType(int type) {
> this.indexType = type;
> }
>
> @Override
> protected TokenStreamComponents createComponents(String fieldName) {
> Tokenizer st;
> st = new NGramTokenizer(3, 3);
> return new TokenStreamComponents(st);
> }
> }
>
> This somehow behaves as I described. (for a search: face returns interface
> face faceted but not fac or ace).
>
> Is there a change since 5.3.1 regarding this behavious in Lucene? Or is the
> difference in behaviour caused by Solr's implementation of the Lucene API?
>
> Thank you
>
>
> On Mon, Jul 2, 2018 at 2:00 PM Alexandre Rafalovitch 
> wrote:
>
>> Two things:
>> 1) if you want face to match interface, you need max value to be at least
>> 4.
>> 2) you probably have the factory symmetrically or on Query analyzer. You
>> probably want it on Index analyzer side only. Otherwise you are trying to
>> match any 3-letter query substring against yoir index.
>>
>> Admin UI analysis screen will show that to you.
>>
>> Regards,
>> Alex
>>
>> On Mon, Jul 2, 2018, 11:01 AM Kudrettin Güleryüz, 
>> wrote:
>>
>> > Hi,
>> >
>> > When using NgramTokenizerFactory with settings min ngram size=3 and max
>> > ngram size=3 I get the following behaviour.
>> >
>> > Assume that search term is, face
>> >
>> > I expect the results to show documents with strings:
>> > * interface or
>> > * face or
>> > * faceted
>> >
>> > but not
>> > * ace or
>> > * fac
>> >
>> > Why would I get the matches with results ace or fac? Am I missing some
>> > settings somewhere? What is the suggested way to change this this
>> > behaviour?
>> >
>> > Thank you,
>> >
>>


Re: NgramTokenizerFactory question

2018-07-02 Thread Kudrettin Güleryüz
> 1) if you want face to match interface, you need max value to be at least
4.
Can you please explain this a bit more? I am not following this one. Values
are set to 3,3 and Solr already matches interface and interfaces when
searched for face.  In addition to that Solr matches the trigrams of face
(fac and ace) as well, which I find not as relevant as interface or faceted.

Application I am working on moving to Solr 7.3.1 is currently using Lucene
API 5.3.1 and has a custom analyzer like following:


public class TrigramCaseAnalyzer extends SourceSearchAnalyzer {
private int indexType;

public TrigramCaseAnalyzer() {
indexType = 1;
}

@Override
public int getIndexType() {
return this.indexType;
}

@Override
public void setIndexType(int type) {
this.indexType = type;
}

@Override
protected TokenStreamComponents createComponents(String fieldName) {
Tokenizer st;
st = new NGramTokenizer(3, 3);
return new TokenStreamComponents(st);
}
}

This somehow behaves as I described. (for a search: face returns interface
face faceted but not fac or ace).

Is there a change since 5.3.1 regarding this behavious in Lucene? Or is the
difference in behaviour caused by Solr's implementation of the Lucene API?

Thank you


On Mon, Jul 2, 2018 at 2:00 PM Alexandre Rafalovitch 
wrote:

> Two things:
> 1) if you want face to match interface, you need max value to be at least
> 4.
> 2) you probably have the factory symmetrically or on Query analyzer. You
> probably want it on Index analyzer side only. Otherwise you are trying to
> match any 3-letter query substring against yoir index.
>
> Admin UI analysis screen will show that to you.
>
> Regards,
> Alex
>
> On Mon, Jul 2, 2018, 11:01 AM Kudrettin Güleryüz, 
> wrote:
>
> > Hi,
> >
> > When using NgramTokenizerFactory with settings min ngram size=3 and max
> > ngram size=3 I get the following behaviour.
> >
> > Assume that search term is, face
> >
> > I expect the results to show documents with strings:
> > * interface or
> > * face or
> > * faceted
> >
> > but not
> > * ace or
> > * fac
> >
> > Why would I get the matches with results ace or fac? Am I missing some
> > settings somewhere? What is the suggested way to change this this
> > behaviour?
> >
> > Thank you,
> >
>


Scores with Solr Suggester

2018-07-02 Thread Buckler, Christine
Is it possible to return a score field for Suggester results like it does with 
standard search? I am looking for the score which quantifies how close of a 
match between type entered and suggestion result (not the weight associated 
with the suggestion). Is this possible?

Christine Buckler
[id:image001.png@01D3F81F.AF489300]christinebuckler
[id:image002.png@01D3F81F.AF489300]206.295.6772


Re: _childDocuments_ automatically multivalued field type

2018-07-02 Thread Shawn Heisey
On 7/2/2018 9:18 AM, jeebix wrote:
> I don't understand why for example "type_cmd_s" get the field type attribute
> "singleValued", but "TTC" or "kits_sans_suite" get "multiValued" attribute ?
> Why those field are in the managed-schema and enseigne_s (for example) is
> not ?

The field named enseigne_s is almost certainly handled by a dynamic
field definition, most likely one with the name "*_s".  That field (and
its field type) do not have multiValued="true".  This was probably
already in your schema before you did any indexing.

The ones that were automatically added by the data-driven nature of your
schema were added as the "strings" type, which IS multi-valued.  The
update processor definition that is in the Solr examples is set up to
add fields as multiValued, so that if a later indexing request comes in
with multiple values for the field, it will not fail.

This is the major danger of relying on Solr to automatically add fields
to your schema.  Chances are good that the choice it makes for the field
will be the wrong choice.  And when that happens, you will need to fix
the schema and completely reindex.

https://wiki.apache.org/solr/HowToReindex

Thanks,
Shawn



Re: NgramTokenizerFactory question

2018-07-02 Thread Alexandre Rafalovitch
Two things:
1) if you want face to match interface, you need max value to be at least 4.
2) you probably have the factory symmetrically or on Query analyzer. You
probably want it on Index analyzer side only. Otherwise you are trying to
match any 3-letter query substring against yoir index.

Admin UI analysis screen will show that to you.

Regards,
Alex

On Mon, Jul 2, 2018, 11:01 AM Kudrettin Güleryüz, 
wrote:

> Hi,
>
> When using NgramTokenizerFactory with settings min ngram size=3 and max
> ngram size=3 I get the following behaviour.
>
> Assume that search term is, face
>
> I expect the results to show documents with strings:
> * interface or
> * face or
> * faceted
>
> but not
> * ace or
> * fac
>
> Why would I get the matches with results ace or fac? Am I missing some
> settings somewhere? What is the suggested way to change this this
> behaviour?
>
> Thank you,
>


Re: _childDocuments_ automatically multivalued field type

2018-07-02 Thread Alexandre Rafalovitch
Because your _s fields must be mapping to the dynamicField definition and
are created accordingly in the schema dynamically without needing a special
definition for each field.

The TTC field you did map explicitly, perhaps with "schemaless" mapping
autodiscovery. Which does create specific field definitions, but always
multivalued.

The multivalued attribute can be set on type  not just on individual field.

So you may just want to adjust schema definition to use singular types
instead.

AdminUI schema screen is helpful to see such differences.

Regards,
 Alex

On Mon, Jul 2, 2018, 11:43 AM jeebix,  wrote:

> Hello everybody,
>
> I have a problem with some field types in the managed-schema generated.
>
> First, the data SOLR returned with a standard query :
>
> response":{"numFound":365567,"start":0,"docs":[
>   {
> "id":"560.561.134676",
> "parent_i":560,
> "asso_i":561,
> "personne_i":134676,
> "etat_technique_s":"avec_documents",
> "etat_marketing_s":"actif",
> "type_parent_s":"Ecole élémentaire publique",
> "type_asso_s":"APE (association de parents d'élèves)",
> "groupe_type_parent_s":"ENSEIGNEMENT_PRIMAIRE",
> "groupe_type_asso_s":"ASSOCIATION_DE_PARENTS",
> "nombre_commandes_brut_i":2,
> "nombre_commandes_i":1,
> "nombre_kits_saveur_i":0,
> "ca_periode_i":560,
> "ca_periode_fleur_i":0,
> "ca_periode_saveur_i":0,
> "zone_scolaire_s":"A",
> "territoire_s":"France Métropolitaine",
> "region_s":"AUVERGNE RHONE-ALPES",
> "departement_s":"01 AIN",
> "postal_country_s":"FR",
> "asso_country_s":"FRANCE",
> "object_type_s":"contact",
> "date_derni_re_commande_dt":"2016-05-20T00:00:00Z",
> "_version_":1604889647955050496,
> "_childDocuments_":[
> {
>   "fixe_facturation":["0256897856"],
>   "object_type":["order"],
>   "mobile_livraison":["0658987874"],
>   "kit_sans_suite":["false"],
>   "fixe_livraison":["0450598311"],
>   "type_cde_s":"CDE",
>   "statut_s":"V",
>   "mobile_facturation":["0658787458"],
>   "campagne_s":"A",
>   "TTC":[780],
>   "date_dt":"2016-05-20T00:00:00Z",
>   "id":"A28837",
>   "enseigne_s":"CRE"},
> {
>   "fixe_facturation":["0245784975"],
>   "object_type":["order"],
>   "mobile_livraison":["0645789874"],
>   "kit_sans_suite":["false"],
>   "type_cde_s":"KIT",
>   "statut_s":"V",
>   "mobile_facturation":["0612345678"],
>   "campagne_s":"A",
>   "TTC":[0],
>   "date_dt":"2016-05-04T00:00:00Z",
>   "id":"A25415",
>   "enseigne_s":"CRE"}]}
>
> My goal is to sum fields "TTC" by parentDocument. But with the type
> "multiValued", I can't use aggregation functions.
>
> The core get the data from this script : /opt/solr/bin/post -c 
> -format solr build/index.json
>
> The index.json looks like that:
>
> [
>   {
> "id": "781.782.134878",
> "parent_i": 781,
> "asso_i": 782,
> "personne_i": 134878,
> "etat_technique_s": "avec_documents",
> "etat_marketing_s": "inactif",
> "type_parent_s": "Ecole élémentaire privée",
> "type_asso_s": "APEL (association de parents école libre)",
> "groupe_type_parent_s": "ENSEIGNEMENT_PRIMAIRE",
> "groupe_type_asso_s": "ASSOCIATION_DE_PARENTS",
> "nombre_commandes_brut_i": 4,
> "nombre_commandes_i": 2,
> "nombre_kits_saveur_i": 2,
> "date_dernière_commande_dt": "2010-11-16",
> "ca_periode_i": 0,
> "ca_periode_fleur_i": 0,
> "ca_periode_saveur_i": 0,
> "zone_scolaire_s": "A",
> "territoire_s": "France Métropolitaine",
> "region_s": "AUVERGNE RHONE-ALPES",
> "departement_s": "01 AIN",
> "postal_country_s": "FR",
> "asso_country_s": "FRANCE",
> "object_type_s": "contact",
> "kits_sans_suite_ss": null,
> "_childDocuments_": [
>   {
> "fixe_facturation": "0450407279",
> "object_type": "order",
> "mobile_livraison": "0628332864",
> "kit_sans_suite": "false",
> "fixe_livraison": "0450407279",
> "type_cde_s": "KIT",
> "statut_s": "V",
> "mobile_facturation": "0628332864",
> "campagne_s": "L",
> "TTC": 0,
> "date_dt": "2009-10-12T00:00:00Z",
> "id": "L14276",
> "enseigne_s": "SAV",
> "gamme": [
>   "KITS > Kits Saveurs"
> ]
>   },
>   {
> "fixe_facturation": "0450407279",
> "object_type": "order",
> "mobile_livraison": "0628332864",
> "kit_sans_suite": "false",
> "fixe_livraison": "0450407279",
> "type_cde_s": "CDE",
> "statut_s": "V",
> "mobile_facturation": "0628332864",
> "campagne_s": "L",
> "TTC": 1045,
> "date_dt"

Re: NgramTokenizerFactory question

2018-07-02 Thread Kudrettin Güleryüz
It is correct that a search string causes following query to be generated:
+(field:fac field:ace)
Hence the results... However, I fail to see how (fac OR ace) is a relevant
query, shouldn't it be
+field:fac +field:ace
instead?

What is the suggested way to change this this behaviour?

On Mon, Jul 2, 2018 at 11:47 AM Erick Erickson 
wrote:

> Take a look at two things:
> 1> the admin/analysis page. This is probably mostly a sanity check to
> insure you're seeing what you expect.
> 2> add debug=query to the query and look at the parsed query. My bet
> is that the grams are being OR'd together
>  and your search term is effectively
>
> fac OR ace
>
> Best,
> Erick
>
> On Mon, Jul 2, 2018 at 8:01 AM, Kudrettin Güleryüz 
> wrote:
> > Hi,
> >
> > When using NgramTokenizerFactory with settings min ngram size=3 and max
> > ngram size=3 I get the following behaviour.
> >
> > Assume that search term is, face
> >
> > I expect the results to show documents with strings:
> > * interface or
> > * face or
> > * faceted
> >
> > but not
> > * ace or
> > * fac
> >
> > Why would I get the matches with results ace or fac? Am I missing some
> > settings somewhere? What is the suggested way to change this this
> > behaviour?
> >
> > Thank you,
>


Re: NgramTokenizerFactory question

2018-07-02 Thread Erick Erickson
Take a look at two things:
1> the admin/analysis page. This is probably mostly a sanity check to
insure you're seeing what you expect.
2> add debug=query to the query and look at the parsed query. My bet
is that the grams are being OR'd together
 and your search term is effectively

fac OR ace

Best,
Erick

On Mon, Jul 2, 2018 at 8:01 AM, Kudrettin Güleryüz  wrote:
> Hi,
>
> When using NgramTokenizerFactory with settings min ngram size=3 and max
> ngram size=3 I get the following behaviour.
>
> Assume that search term is, face
>
> I expect the results to show documents with strings:
> * interface or
> * face or
> * faceted
>
> but not
> * ace or
> * fac
>
> Why would I get the matches with results ace or fac? Am I missing some
> settings somewhere? What is the suggested way to change this this
> behaviour?
>
> Thank you,


_childDocuments_ automatically multivalued field type

2018-07-02 Thread jeebix
Hello everybody,

I have a problem with some field types in the managed-schema generated.

First, the data SOLR returned with a standard query :

response":{"numFound":365567,"start":0,"docs":[
  {
"id":"560.561.134676",
"parent_i":560,
"asso_i":561,
"personne_i":134676,
"etat_technique_s":"avec_documents",
"etat_marketing_s":"actif",
"type_parent_s":"Ecole élémentaire publique",
"type_asso_s":"APE (association de parents d'élèves)",
"groupe_type_parent_s":"ENSEIGNEMENT_PRIMAIRE",
"groupe_type_asso_s":"ASSOCIATION_DE_PARENTS",
"nombre_commandes_brut_i":2,
"nombre_commandes_i":1,
"nombre_kits_saveur_i":0,
"ca_periode_i":560,
"ca_periode_fleur_i":0,
"ca_periode_saveur_i":0,
"zone_scolaire_s":"A",
"territoire_s":"France Métropolitaine",
"region_s":"AUVERGNE RHONE-ALPES",
"departement_s":"01 AIN",
"postal_country_s":"FR",
"asso_country_s":"FRANCE",
"object_type_s":"contact",
"date_derni_re_commande_dt":"2016-05-20T00:00:00Z",
"_version_":1604889647955050496,
"_childDocuments_":[
{
  "fixe_facturation":["0256897856"],
  "object_type":["order"],
  "mobile_livraison":["0658987874"],
  "kit_sans_suite":["false"],
  "fixe_livraison":["0450598311"],
  "type_cde_s":"CDE",
  "statut_s":"V",
  "mobile_facturation":["0658787458"],
  "campagne_s":"A",
  "TTC":[780],
  "date_dt":"2016-05-20T00:00:00Z",
  "id":"A28837",
  "enseigne_s":"CRE"},
{
  "fixe_facturation":["0245784975"],
  "object_type":["order"],
  "mobile_livraison":["0645789874"],
  "kit_sans_suite":["false"],
  "type_cde_s":"KIT",
  "statut_s":"V",
  "mobile_facturation":["0612345678"],
  "campagne_s":"A",
  "TTC":[0],
  "date_dt":"2016-05-04T00:00:00Z",
  "id":"A25415",
  "enseigne_s":"CRE"}]}

My goal is to sum fields "TTC" by parentDocument. But with the type
"multiValued", I can't use aggregation functions.

The core get the data from this script : /opt/solr/bin/post -c 
-format solr build/index.json

The index.json looks like that:

[
  {
"id": "781.782.134878",
"parent_i": 781,
"asso_i": 782,
"personne_i": 134878,
"etat_technique_s": "avec_documents",
"etat_marketing_s": "inactif",
"type_parent_s": "Ecole élémentaire privée",
"type_asso_s": "APEL (association de parents école libre)",
"groupe_type_parent_s": "ENSEIGNEMENT_PRIMAIRE",
"groupe_type_asso_s": "ASSOCIATION_DE_PARENTS",
"nombre_commandes_brut_i": 4,
"nombre_commandes_i": 2,
"nombre_kits_saveur_i": 2,
"date_dernière_commande_dt": "2010-11-16",
"ca_periode_i": 0,
"ca_periode_fleur_i": 0,
"ca_periode_saveur_i": 0,
"zone_scolaire_s": "A",
"territoire_s": "France Métropolitaine",
"region_s": "AUVERGNE RHONE-ALPES",
"departement_s": "01 AIN",
"postal_country_s": "FR",
"asso_country_s": "FRANCE",
"object_type_s": "contact",
"kits_sans_suite_ss": null,
"_childDocuments_": [
  {
"fixe_facturation": "0450407279",
"object_type": "order",
"mobile_livraison": "0628332864",
"kit_sans_suite": "false",
"fixe_livraison": "0450407279",
"type_cde_s": "KIT",
"statut_s": "V",
"mobile_facturation": "0628332864",
"campagne_s": "L",
"TTC": 0,
"date_dt": "2009-10-12T00:00:00Z",
"id": "L14276",
"enseigne_s": "SAV",
"gamme": [
  "KITS > Kits Saveurs"
]
  },
  {
"fixe_facturation": "0450407279",
"object_type": "order",
"mobile_livraison": "0628332864",
"kit_sans_suite": "false",
"fixe_livraison": "0450407279",
"type_cde_s": "CDE",
"statut_s": "V",
"mobile_facturation": "0628332864",
"campagne_s": "L",
"TTC": 1045,
"date_dt": "2009-11-14T00:00:00Z",
"id": "L25049",
"enseigne_s": "SAV",
"gamme": [
  "CHOCOLAT > Assortiment",
  "CHOCOLAT > Individuel",
  "CHOCOLAT > Mono-produit",
  "EQUIPEMENT MAISON > Cuisine",
  "EQUIPEMENT MAISON > Décoration",
  "KITS > Kits Saveurs",
  "SAVEURS > Confiserie",
  "SAVEURS > Pâtisserie"
]
}
]

In the managed-schema, only those fields appear:














I don't understand why for example "type_cmd_s" get the field type attribute
"singleValued", but "TTC" or "kits_sans_suite" get "multiValued" attribute ?
Why those field are in the managed-schema and enseigne_s (for example) is
not ?

Thanks a lot for your help...

Best
JB





--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


NgramTokenizerFactory question

2018-07-02 Thread Kudrettin Güleryüz
Hi,

When using NgramTokenizerFactory with settings min ngram size=3 and max
ngram size=3 I get the following behaviour.

Assume that search term is, face

I expect the results to show documents with strings:
* interface or
* face or
* faceted

but not
* ace or
* fac

Why would I get the matches with results ace or fac? Am I missing some
settings somewhere? What is the suggested way to change this this
behaviour?

Thank you,


Re: Server refused connection at: http://localhost:xxxx/solr/collectionName

2018-07-02 Thread Erick Erickson
Given your other e-mail I suspect you're not closing the client
and creating new ones for every update request.

You should simply not run out of connections, your client is
most probably incorrect.

Best,
Erick

On Mon, Jul 2, 2018 at 3:38 AM, Ritesh Kumar
 wrote:
> I could get the live Solr nodes using this piece of code
>
> ZkStateReader zkStateReader = client.getZkStateReader();
> ClusterState clusterState = zkStateReader.getClusterState();
> Set liveNodes = clusterState.getLiveNodes();
>
> This way, I will be able to send a query to one of the live nodes and
> Zookeeper will take care of the rest, but, I was wondering if this is a
> good practice to query from SolrCloud.
>
> What if the Solr node goes down in the middle of bulk indexing.
>
> On Mon, Jul 2, 2018 at 3:37 PM Ritesh Kumar 
> wrote:
>
>> I did use CloudSolrClient to query or index data. I did not have to check
>> which Solr node is active. The problem I am facing during bulk indexing is
>> that the Zookeeper runs out of connections resulting in Connection Timeout
>> error.
>>
>> How can I get to know in advance the active Solr nodes? Any reference
>> would be helpful.
>>
>> Thanks
>>
>> On Mon, Jul 2, 2018 at 2:36 PM Yasufumi Mizoguchi 
>> wrote:
>>
>>> Hi,
>>>
>>> I think ZooKeeper can not notice requests to dead nodes, if you send
>>> requests to Solr nodes directly.
>>> It will be better that asking ZooKeeper which Solr nodes will be running
>>> before requesting Solr nodes with CloudSolrClient etc...
>>>
>>> Thanks,
>>> Yasufumi
>>>
>>> 2018年7月2日(月) 16:49 Ritesh Kumar :
>>>
>>> > Hello Team,
>>> >
>>> > I have two Solr nodes running in cloud mode. I know that we send queries
>>> > and updates directly to Solr's collection e.g.http://host:
>>> > port/solr/. Any of the Solr nodes can be used. If
>>> the
>>> > node does not have the collection being queried then the request will be
>>> > forwarded internally to a Solr instance which has that collection.
>>> >
>>> > But, my question is what happens when the node being queried is down. I
>>> am
>>> > getting this
>>> > error: Server refused connection at http://localhost:
>>> > /solr/collectionName.
>>> >
>>> > Does not Zookeeper handle this scenario?
>>> >
>>> > Everything is fine when the node being queried is running. I am able to
>>> > index and fetch data.
>>> >
>>> > Please, help me.
>>> >
>>> > Best,
>>> > Ritesh Kumar
>>> >
>>>
>>


Re: CursorMarks and 'end of results'

2018-07-02 Thread Erick Erickson
OK, that makes sense then.

I don't think we've mentioned streaming as an alternative. It has some
restrictions (it can only export docValues), and frankly I don't
really remember how much of it was in 5.5 so you'll have to check.

Streaming is designed exactly to, well, stream the entire result set
out. There's some setup cost, so your use case where most cases have
not have all that many hits the setup may be too onerous but I thought
I'd mention it.

Best,
Erick

On Mon, Jul 2, 2018 at 5:14 AM, David Frese  wrote:
> Am 29.06.18 um 17:42 schrieb Erick Erickson:
>>
>> bq. It basically cuts down the search time in half in the usual case
>> for us, so it's an important 'feature'.
>>
>> Wait. You mean that the "extra" call to get back 0 rows doubles your
>> query time? That's surprising, tell us more.
>>
>> How many times does your "usual" use case call using CursorMark? My
>> off-the-cuff explanation would be that
>> you usually get all the rows in the first call.
>>
>> CursorMark is intended to help with the "deep paging" problem, i.e.
>> where start=some_big_number to allow
>> returning large results sets in chunks, say through 10s of K rows.
>> Part of our puzzlement is that in that
>> case the overhead of the last call is minuscule compared to the rest.
>>
>> There's no reason that it can't be used for small result sets, those
>> are just usually handled by setting the
>> start parameter. Up through, say, 1,000 or so the extra overhead is
>> pretty unnoticeable. So my head was
>> in the "what's the problem with 1 extra call after making the first 50?".
>>
>> OTOH, if you make 100 successive calls to search with the CursorMark
>> and call 101 takes as long as
>> the previous 100, something's horribly wrong.
>
>
> Hi,
>
> I use it in a server application where I need to process all results in
> every case, which can be between 0 and 100's of thousands. We use pagination
> to have a boundary on the required memory on "our" side by processing
> page-after-page.
>
> Most cases will fit into one page though - a few hundred results. Our Solr
> cluster takes about 5 to 10 seconds (*) for the first 'filled' page _and_
> about the _same time_ again for the second empty page. So if I have the
> guarantee that the second page is always empty, that helps a lot.
>
> Solr 5.5 that is, btw.
>
> (*) If it could be faster then 5 seconds is a different issue. But the query
> is quite complex with a lot of AND/OR and BlockJoins too, and I have no idea
> if memory is large enough to hold the indices and things like that. Not
> really optimized yet.
>
>
> David.
>
> --
> David Frese
> +49 7071 70896 75
>
> Active Group GmbH
> Hechinger Str. 12/1, 72072 Tübingen
> Registergericht: Amtsgericht Stuttgart, HRB 224404
> Geschäftsführer: Dr. Michael Sperber


Re: Creating single CloudSolrClient object which can be used throughout the application

2018-07-02 Thread Erick Erickson
It's recommended to use one object of course. That said, you should
not be having a connection problem just because you create new ones
all the time. Are you closing it after you're done with it each time?

As to your question about how to reuse the same one, the "singleton
pattern" is one solution.

Best,
Erick

On Mon, Jul 2, 2018 at 6:35 AM, Ritesh Kumar
 wrote:
> Hello Team,
>
> I have got a static method which returns CloudSolrClient object if Solr is
> running in Cloud mode and HttpSolrClient object otherwise.
>
> When running bulk indexing service, this method is called from within the
> indexing service to get the appropriate client object. Each time, this
> method creates a new client object. The problem is, when the bulk indexing
> service is run, after a while, connection error occurs (could not connect
> to zookeeper running at 0.0.0.0:2181. It seems the Zookeeper runs out of
> connections.
>
> Configuration:
> One Zookeeper - maxClientCnxns=60
> Two Solr nodes, running in the Cloud mode.
>
> After looking out for the solution, I could find that CloudSolrClient is
> thread safe provided it collection remains the same.
>
> How can I create an object of CloudSolrClient such that it is used
> throughout the application without creating a new object each time the data
> is indexed.
>
> Best,
> Ritesh Kumar


Creating single CloudSolrClient object which can be used throughout the application

2018-07-02 Thread Ritesh Kumar
Hello Team,

I have got a static method which returns CloudSolrClient object if Solr is
running in Cloud mode and HttpSolrClient object otherwise.

When running bulk indexing service, this method is called from within the
indexing service to get the appropriate client object. Each time, this
method creates a new client object. The problem is, when the bulk indexing
service is run, after a while, connection error occurs (could not connect
to zookeeper running at 0.0.0.0:2181. It seems the Zookeeper runs out of
connections.

Configuration:
One Zookeeper - maxClientCnxns=60
Two Solr nodes, running in the Cloud mode.

After looking out for the solution, I could find that CloudSolrClient is
thread safe provided it collection remains the same.

How can I create an object of CloudSolrClient such that it is used
throughout the application without creating a new object each time the data
is indexed.

Best,
Ritesh Kumar


Re: CursorMarks and 'end of results'

2018-07-02 Thread David Frese

Am 29.06.18 um 17:42 schrieb Erick Erickson:

bq. It basically cuts down the search time in half in the usual case
for us, so it's an important 'feature'.

Wait. You mean that the "extra" call to get back 0 rows doubles your
query time? That's surprising, tell us more.

How many times does your "usual" use case call using CursorMark? My
off-the-cuff explanation would be that
you usually get all the rows in the first call.

CursorMark is intended to help with the "deep paging" problem, i.e.
where start=some_big_number to allow
returning large results sets in chunks, say through 10s of K rows.
Part of our puzzlement is that in that
case the overhead of the last call is minuscule compared to the rest.

There's no reason that it can't be used for small result sets, those
are just usually handled by setting the
start parameter. Up through, say, 1,000 or so the extra overhead is
pretty unnoticeable. So my head was
in the "what's the problem with 1 extra call after making the first 50?".

OTOH, if you make 100 successive calls to search with the CursorMark
and call 101 takes as long as
the previous 100, something's horribly wrong.


Hi,

I use it in a server application where I need to process all results in 
every case, which can be between 0 and 100's of thousands. We use 
pagination to have a boundary on the required memory on "our" side by 
processing page-after-page.


Most cases will fit into one page though - a few hundred results. Our 
Solr cluster takes about 5 to 10 seconds (*) for the first 'filled' page 
_and_ about the _same time_ again for the second empty page. So if I 
have the guarantee that the second page is always empty, that helps a lot.


Solr 5.5 that is, btw.

(*) If it could be faster then 5 seconds is a different issue. But the 
query is quite complex with a lot of AND/OR and BlockJoins too, and I 
have no idea if memory is large enough to hold the indices and things 
like that. Not really optimized yet.



David.

--
David Frese
+49 7071 70896 75

Active Group GmbH
Hechinger Str. 12/1, 72072 Tübingen
Registergericht: Amtsgericht Stuttgart, HRB 224404
Geschäftsführer: Dr. Michael Sperber


Re: /replication?command=details does not show infos for all replicas on the core

2018-07-02 Thread Arturas Mazeika
Hi Shawn,
hi Erick,
hi et al.,

Very nice clarifications indeed. I also looked at the index replication
section. In addition to the clarifications in this thread this brought
quite some light into the area (and shows that I need to read solrcloud
part of the manual more extensively). Thanks a lot indeed!

Cheers,
Arturas


On Fri, Jun 29, 2018 at 5:44 PM, Shawn Heisey  wrote:

> On 6/29/2018 8:47 AM, Arturas Mazeika wrote:
>
>> Out of curiosity: some cores give infos for both shards (through
>> replication query) and some only for one (if you still be able to see the
>> prev post). I wonder why..
>>
>
> Adding to what Erick said:
>
> If SolrCloud has initiated a replication on that core at some point since
> that Solr instance started, then you might see both the master and slave
> side of that replication reported by the replication handler.  If a
> replication has never been initiated, then you will only see info about the
> local core.
>
> The replication handler is used by SolrCloud for two things:
>
> 1) Index recovery when a replica gets too far out of sync.
> 2) Replicating data to TLOG and PULL replica types (new in 7.x).
>
> Thanks,
> Shawn
>
>


Running Solr on Aws S3

2018-07-02 Thread Taher Koitawala
Hi All,
 Has anyone here tried to run solr on S3? I found a page here

which describes how you can run solr on S3. I followed the link, however, i
get the following exception

Class com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found.

The jar I created refering to link is kept in the
server/solr-webapp/webapp/WEB-INF/lib/ folder.

Is this the correct way to do it?


Regards,
Taher Koitawala
GS Lab Pune
+91 8407979163


Re: Server refused connection at: http://localhost:xxxx/solr/collectionName

2018-07-02 Thread Ritesh Kumar
I could get the live Solr nodes using this piece of code

ZkStateReader zkStateReader = client.getZkStateReader();
ClusterState clusterState = zkStateReader.getClusterState();
Set liveNodes = clusterState.getLiveNodes();

This way, I will be able to send a query to one of the live nodes and
Zookeeper will take care of the rest, but, I was wondering if this is a
good practice to query from SolrCloud.

What if the Solr node goes down in the middle of bulk indexing.

On Mon, Jul 2, 2018 at 3:37 PM Ritesh Kumar 
wrote:

> I did use CloudSolrClient to query or index data. I did not have to check
> which Solr node is active. The problem I am facing during bulk indexing is
> that the Zookeeper runs out of connections resulting in Connection Timeout
> error.
>
> How can I get to know in advance the active Solr nodes? Any reference
> would be helpful.
>
> Thanks
>
> On Mon, Jul 2, 2018 at 2:36 PM Yasufumi Mizoguchi 
> wrote:
>
>> Hi,
>>
>> I think ZooKeeper can not notice requests to dead nodes, if you send
>> requests to Solr nodes directly.
>> It will be better that asking ZooKeeper which Solr nodes will be running
>> before requesting Solr nodes with CloudSolrClient etc...
>>
>> Thanks,
>> Yasufumi
>>
>> 2018年7月2日(月) 16:49 Ritesh Kumar :
>>
>> > Hello Team,
>> >
>> > I have two Solr nodes running in cloud mode. I know that we send queries
>> > and updates directly to Solr's collection e.g.http://host:
>> > port/solr/. Any of the Solr nodes can be used. If
>> the
>> > node does not have the collection being queried then the request will be
>> > forwarded internally to a Solr instance which has that collection.
>> >
>> > But, my question is what happens when the node being queried is down. I
>> am
>> > getting this
>> > error: Server refused connection at http://localhost:
>> > /solr/collectionName.
>> >
>> > Does not Zookeeper handle this scenario?
>> >
>> > Everything is fine when the node being queried is running. I am able to
>> > index and fetch data.
>> >
>> > Please, help me.
>> >
>> > Best,
>> > Ritesh Kumar
>> >
>>
>


Re: Server refused connection at: http://localhost:xxxx/solr/collectionName

2018-07-02 Thread Ritesh Kumar
I did use CloudSolrClient to query or index data. I did not have to check
which Solr node is active. The problem I am facing during bulk indexing is
that the Zookeeper runs out of connections resulting in Connection Timeout
error.

How can I get to know in advance the active Solr nodes? Any reference would
be helpful.

Thanks

On Mon, Jul 2, 2018 at 2:36 PM Yasufumi Mizoguchi 
wrote:

> Hi,
>
> I think ZooKeeper can not notice requests to dead nodes, if you send
> requests to Solr nodes directly.
> It will be better that asking ZooKeeper which Solr nodes will be running
> before requesting Solr nodes with CloudSolrClient etc...
>
> Thanks,
> Yasufumi
>
> 2018年7月2日(月) 16:49 Ritesh Kumar :
>
> > Hello Team,
> >
> > I have two Solr nodes running in cloud mode. I know that we send queries
> > and updates directly to Solr's collection e.g.http://host:
> > port/solr/. Any of the Solr nodes can be used. If
> the
> > node does not have the collection being queried then the request will be
> > forwarded internally to a Solr instance which has that collection.
> >
> > But, my question is what happens when the node being queried is down. I
> am
> > getting this
> > error: Server refused connection at http://localhost:
> > /solr/collectionName.
> >
> > Does not Zookeeper handle this scenario?
> >
> > Everything is fine when the node being queried is running. I am able to
> > index and fetch data.
> >
> > Please, help me.
> >
> > Best,
> > Ritesh Kumar
> >
>


Re: Server refused connection at: http://localhost:xxxx/solr/collectionName

2018-07-02 Thread Yasufumi Mizoguchi
Hi,

I think ZooKeeper can not notice requests to dead nodes, if you send
requests to Solr nodes directly.
It will be better that asking ZooKeeper which Solr nodes will be running
before requesting Solr nodes with CloudSolrClient etc...

Thanks,
Yasufumi

2018年7月2日(月) 16:49 Ritesh Kumar :

> Hello Team,
>
> I have two Solr nodes running in cloud mode. I know that we send queries
> and updates directly to Solr's collection e.g.http://host:
> port/solr/. Any of the Solr nodes can be used. If the
> node does not have the collection being queried then the request will be
> forwarded internally to a Solr instance which has that collection.
>
> But, my question is what happens when the node being queried is down. I am
> getting this
> error: Server refused connection at http://localhost:
> /solr/collectionName.
>
> Does not Zookeeper handle this scenario?
>
> Everything is fine when the node being queried is running. I am able to
> index and fetch data.
>
> Please, help me.
>
> Best,
> Ritesh Kumar
>


Server refused connection at: http://localhost:xxxx/solr/collectionName

2018-07-02 Thread Ritesh Kumar
Hello Team,

I have two Solr nodes running in cloud mode. I know that we send queries
and updates directly to Solr's collection e.g.http://host:
port/solr/. Any of the Solr nodes can be used. If the
node does not have the collection being queried then the request will be
forwarded internally to a Solr instance which has that collection.

But, my question is what happens when the node being queried is down. I am
getting this
error: Server refused connection at http://localhost:
/solr/collectionName.

Does not Zookeeper handle this scenario?

Everything is fine when the node being queried is running. I am able to
index and fetch data.

Please, help me.

Best,
Ritesh Kumar