RE: help on implicit routing

2017-07-09 Thread imran
Thanks for the reference, I am guessing this feature is not available through 
the post utility inside solr/bin

Regards,
Imran

Sent from Mail for Windows 10

From: Jan Høydahl
Sent: Friday, July 7, 2017 1:51 AM
To: solr-user@lucene.apache.org
Subject: Re: help on implicit routing

http://lucene.apache.org/solr/guide/6_6/shards-and-indexing-data-in-solrcloud.html
 


--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

> 6. jul. 2017 kl. 03.15 skrev im...@elogic.pk:
> 
> I am trying out the document routing feature in Solr 6.4.1. I am unable to 
> comprehend the documentation where it states that 
> “The 'implicit' router does not
> automatically route documents to different
> shards.  Whichever shard you indicate on the
> indexing request (or within each document) will
> be used as the destination for those documents”
> 
> How do you specify the shard inside a document? E.g If I have basic 
> collection with two shards called day_1 and day_2. What value should be 
> populated in the router field that will ensure the document routing to the 
> respective shard?
> 
> Regards,
> Imran
> 
> Sent from Mail for Windows 10
> 




RE: ZooKeeper transaction logs

2017-07-09 Thread Avi Steiner
Thanks for info, Sean
Can I do it in Windows server?

-Original Message-
From: Xie, Sean [mailto:sean@finra.org]
Sent: Sunday, July 9, 2017 7:33 PM
To: solr-user@lucene.apache.org
Subject: Re: ZooKeeper transaction logs

You can try run purge manually see if it is working: 
org.apache.zookeeper.server.PurgeTxnLog.

And use a cron job to do clean up.


On 7/9/17, 11:07 AM, "Avi Steiner"  wrote:

Hello

I'm using Zookeeper 3.4.6

The ZK log data folder keeps growing with transaction logs files (log.*).

I set the following in zoo.cfg:
autopurge.purgeInterval=1
autopurge.snapRetainCount=3
dataDir=..\\data

Per ZK log, it reads those parameters:

2017-07-09 17:44:59,792 [myid:] - INFO  [main:DatadirCleanupManager@78] - 
autopurge.snapRetainCount set to 3
2017-07-09 17:44:59,792 [myid:] - INFO  [main:DatadirCleanupManager@79] - 
autopurge.purgeInterval set to 1

It also says that cleanup process is running:

2017-07-09 17:44:59,792 [myid:] - INFO  
[PurgeTask:DatadirCleanupManager$PurgeTask@138] - Purge task started.
2017-07-09 17:44:59,823 [myid:] - INFO  
[PurgeTask:DatadirCleanupManager$PurgeTask@144] - Purge task completed.

But actually nothing is deleted.
Every service restart, new file is created.

The only parameter I managed to change is preAllocSize, which means the 
minimum size per file. The default is 64MB. I changed it to 10KB only for 
watching the effect.



This email and any attachments thereto may contain private, confidential, 
and privileged material for the sole use of the intended recipient. Any review, 
copying, or distribution of this email (or any attachments thereto) by others 
is strictly prohibited. If you are not the intended recipient, please contact 
the sender immediately and permanently delete the original and any copies of 
this email and any attachments thereto.



Confidentiality Notice::  This email, including attachments, may include 
non-public, proprietary, confidential or legally privileged information.  If 
you are not an intended recipient or an authorized agent of an intended 
recipient, you are hereby notified that any dissemination, distribution or 
copying of the information contained in or transmitted with this e-mail is 
unauthorized and strictly prohibited.  If you have received this email in 
error, please notify the sender by replying to this message and permanently 
delete this e-mail, its attachments, and any copies of it immediately.  You 
should not retain, copy or use this e-mail or any attachment for any purpose, 
nor disclose all or any part of the contents to any other person. Thank you.

This email and any attachments thereto may contain private, confidential, and 
privileged material for the sole use of the intended recipient. Any review, 
copying, or distribution of this email (or any attachments thereto) by others 
is strictly prohibited. If you are not the intended recipient, please contact 
the sender immediately and permanently delete the original and any copies of 
this email and any attachments thereto.


Re: Solr 6.5.1 crashing when too many queries with error or high memory usage are queried

2017-07-09 Thread Zheng Lin Edwin Yeo
I have found that it could be likely due to the hashJoin in the streaming
expression, as this will store all tuples in memory?

I have more than 12 million in the collections which I am querying, in 1
shard. The index size of the collection is 45 GB.
Physical RAM of server: 384 GB
Java Heap: 22 GB
Typical search latency: 2 to 4 seconds

Regards,
Edwin


On 7 July 2017 at 16:46, Jan Høydahl  wrote:

> You have not told us how many documents you have, how many shards, how big
> the docs are, physical RAM, Java heap, what typical search latency is etc.
>
> If you have tried to squeeze too many docs into a single node it might get
> overloaded faster, thus sharding would help.
> If you return too much content (large fields that you won’t use) that may
> lower the max QPS for a node, so check that.
> If you are not using DocValues, faceting etc will take too much memory,
> but since you use streaming I guess you use Docvalues.
> There are products that you can put in front of Solr that can do rate
> limiting for you, such as https://getkong.org/ 
>
> You really need to debug what is the bottleneck in your case and try to
> fix that.
>
> Can you share your key numbers here so we can do a qualified guess?
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
>
> > 2. jul. 2017 kl. 09.00 skrev Zheng Lin Edwin Yeo :
> >
> > Hi,
> >
> > I'm currently facing the issue whereby the Solr crashed when I have
> issued
> > too many queries with error or those with high memory usage, like JSON
> > facet or Streaming expressions.
> >
> > What could be the issue here?
> >
> > I'm using Solr 6.5.1
> >
> > Regards,
> > Edwin
>
>


Re: CDCR - how to deal with the transaction log files

2017-07-09 Thread Xie, Sean
Did another round of testing, the tlog on target cluster is cleaned up once the 
hard commit is triggered. However, on source cluster, the tlog files stay there 
and never gets cleaned up.

Not sure if there is any command to run manually to trigger the 
updateLogSynchronizer. The updateLogSynchronizer already set at run at every 10 
seconds, but seems it didn’t help.

Any help?

Thanks
Sean

On 7/8/17, 1:14 PM, "Xie, Sean"  wrote:

I have monitored the CDCR process for a while, the updates are actively 
sent to the target without a problem. However the tlog size and files count are 
growing everyday, even when there is 0 updates to sent, the tlog stays there:

Following is from the action=queues command, and you can see after about a 
month or so running days, the total transaction are reaching to 140K total 
files, and size is about 103G.



0
465




0
2017-07-07T23:19:09.655Z



102740042616
140809
stopped


Any help on it? Or do I need to configure something else? The CDCR 
configuration is pretty much following the wiki:

On target:

  

  disabled

  

  


  

  

  cdcr-processor-chain

  

  

  ${solr.ulog.dir:}

 
  ${solr.autoCommit.maxTime:18}
  false 


 
  ${solr.autoSoftCommit.maxTime:3}
 
  

On source:
  

  ${TargetZk}
  MY_COLLECTION
  MY_COLLECTION



  1
  1000
  128



  6

  

  

  ${solr.ulog.dir:}

 
  ${solr.autoCommit.maxTime:18}
  false 


 
  ${solr.autoSoftCommit.maxTime:3}
 
  

Thanks.
Sean

On 7/8/17, 12:10 PM, "Erick Erickson"  wrote:

This should not be the case if you are actively sending updates to the
target cluster. The tlog is used to store unsent updates, so if the
connection is broken for some time, the target cluster will have a
chance to catch up.

If you don't have the remote DC online and do not intend to bring it
online soon, you should turn CDCR off.

Best,
Erick

On Fri, Jul 7, 2017 at 9:35 PM, Xie, Sean  wrote:
> Once enabled CDCR, update log stores an unlimited number of entries. 
This is causing the tlog folder getting bigger and bigger, as well as the open 
files are growing. How can one reduce the number of open files and also to 
reduce the tlog files? If it’s not taken care properly, sooner or later the log 
files size and open file count will exceed the limits.
>
> Thanks
> Sean
>
>
> Confidentiality Notice::  This email, including attachments, may 
include non-public, proprietary, confidential or legally privileged 
information.  If you are not an intended recipient or an authorized agent of an 
intended recipient, you are hereby notified that any dissemination, 
distribution or copying of the information contained in or transmitted with 
this e-mail is unauthorized and strictly prohibited.  If you have received this 
email in error, please notify the sender by replying to this message and 
permanently delete this e-mail, its attachments, and any copies of it 
immediately.  You should not retain, copy or use this e-mail or any attachment 
for any purpose, nor disclose all or any part of the contents to any other 
person. Thank you.






Re: index new discovered fileds of different types

2017-07-09 Thread Rick Leir

Jan

I hope this is not off-topic, but I am curious: if you do not use the 
three fields, subject, predicate, and object for indexing RDF
then what is your algorithm? Maybe document nesting is appropriate for 
this? cheers -- Rick



On 2017-07-09 05:52 PM, Jan Høydahl wrote:

Hi,

I have personally written a Python script to parse RDF files into an in-memory 
graph structure and then pull data from that structure to index to Solr.
I.e. you may perfectly well have RDF (nt, turtle, whatever) as source but index 
sub structures in very specific ways.
Anyway, as Erick points out, that’s probably where in your code that you should 
use Managed Schema REST API in order to
1. Query Solr for what fields are defined
2. If you need to index a field that is not yet in Solr, add it, using the 
correct field type (your app should know)
3. Push the data
4. Repeat

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com


8. jul. 2017 kl. 02.36 skrev Rick Leir :

Thaer
Whoa, hold everything! You said RDF, meaning resource description framework? If 
so, you have exactly​ three fields: subject, predicate, and object. Maybe they 
are text type, or for exact matches you might want string fields. Add an ID 
field, which could be automatically generated by Solr, so now you have four 
fields. Or am I on a tangent again? Cheers -- Rick

On July 7, 2017 6:01:00 AM EDT, Thaer Sammar  wrote:

Hi Jan,

Thanks!, I am exploring the schemaless option based on Furkan
suggestion. I
need the the flexibility because not all fields are known. We get the
data
from RDF database (which changes continuously). To be more specific, we
have a database and all changes on it are sent to a kafka queue. and we
have a consumer which listen to the queue and update the Solr index.

regards,
Thaer

On 7 July 2017 at 10:53, Jan Høydahl  wrote:


If you do not need the flexibility of dynamic fields, don’t use them.
Sounds to me that you really want a field “price” to be float and a

field

“birthdate” to be of type date etc.
If so, simply create your schema (either manually, through Schema API

or

using schemaless) up front and index each field as correct type

without

messing with field name prefixes.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com


5. jul. 2017 kl. 15.23 skrev Thaer Sammar :

Hi,
We are trying to index documents of different types. Document have

different fields. fields are known at indexing time. We run a query

on a

database and we index what comes using query variables as field names

in

solr. Our current solution: we use dynamic fields with prefix, for

example

feature_i_*, the issue with that

1) we need to define the type of the dynamic field and to be able

to

cover the type of discovered fields we define the following

feature_i_* for integers, feature_t_* for string, feature_d_* for

double, 

1.a) this means we need to check the type of the discovered field

and

then put in the corresponding dynamic field

2) at search time, we need to know the right prefix
We are looking for help to find away to ignore the prefix and check

of

the type

regards,
Thaer



--
Sorry for being brief. Alternate email is rickleir at yahoo dot com




Re: How to "chain" import handlers: import from DB and from file system

2017-07-09 Thread Walter Underwood
4. Write an external program that fetches the file, fetches the metadata, 
combines them, and send them to Solr.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Jul 9, 2017, at 3:03 PM, Giovanni De Stefano  wrote:
> 
> Hello all,
> 
> I have to index (and search) data organised as followed: many files on the 
> filesystem and each file has extra metadata stored on a DB (the DB table has 
> a reference to the file path).
> 
> I think I should have 1 Solr document per file with fields coming from both 
> the DB (through DIH) and from Tika.
> 
> How do you suggest to proceed?
> 
> 1. index into different cores and search across cores (I would rather not do 
> that but I would be able to reuse “standard” importers)
> 2. extend the DIH (which one?)
> 3. implement a custom import handler
> 
> How would you do it?
> 
> Developing in Java is not a problem, I would just need some ideas on where to 
> start (I have been away from Solr for many years…).
> 
> Thanks!
> G.



How to "chain" import handlers: import from DB and from file system

2017-07-09 Thread Giovanni De Stefano
Hello all,

I have to index (and search) data organised as followed: many files on the 
filesystem and each file has extra metadata stored on a DB (the DB table has a 
reference to the file path).

I think I should have 1 Solr document per file with fields coming from both the 
DB (through DIH) and from Tika.

How do you suggest to proceed?

1. index into different cores and search across cores (I would rather not do 
that but I would be able to reuse “standard” importers)
2. extend the DIH (which one?)
3. implement a custom import handler

How would you do it?

Developing in Java is not a problem, I would just need some ideas on where to 
start (I have been away from Solr for many years…).

Thanks!
G.

Re: index new discovered fileds of different types

2017-07-09 Thread Jan Høydahl
Hi,

I have personally written a Python script to parse RDF files into an in-memory 
graph structure and then pull data from that structure to index to Solr.
I.e. you may perfectly well have RDF (nt, turtle, whatever) as source but index 
sub structures in very specific ways.
Anyway, as Erick points out, that’s probably where in your code that you should 
use Managed Schema REST API in order to
1. Query Solr for what fields are defined
2. If you need to index a field that is not yet in Solr, add it, using the 
correct field type (your app should know)
3. Push the data
4. Repeat

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

> 8. jul. 2017 kl. 02.36 skrev Rick Leir :
> 
> Thaer
> Whoa, hold everything! You said RDF, meaning resource description framework? 
> If so, you have exactly​ three fields: subject, predicate, and object. Maybe 
> they are text type, or for exact matches you might want string fields. Add an 
> ID field, which could be automatically generated by Solr, so now you have 
> four fields. Or am I on a tangent again? Cheers -- Rick
> 
> On July 7, 2017 6:01:00 AM EDT, Thaer Sammar  wrote:
>> Hi Jan,
>> 
>> Thanks!, I am exploring the schemaless option based on Furkan
>> suggestion. I
>> need the the flexibility because not all fields are known. We get the
>> data
>> from RDF database (which changes continuously). To be more specific, we
>> have a database and all changes on it are sent to a kafka queue. and we
>> have a consumer which listen to the queue and update the Solr index.
>> 
>> regards,
>> Thaer
>> 
>> On 7 July 2017 at 10:53, Jan Høydahl  wrote:
>> 
>>> If you do not need the flexibility of dynamic fields, don’t use them.
>>> Sounds to me that you really want a field “price” to be float and a
>> field
>>> “birthdate” to be of type date etc.
>>> If so, simply create your schema (either manually, through Schema API
>> or
>>> using schemaless) up front and index each field as correct type
>> without
>>> messing with field name prefixes.
>>> 
>>> --
>>> Jan Høydahl, search solution architect
>>> Cominvent AS - www.cominvent.com
>>> 
 5. jul. 2017 kl. 15.23 skrev Thaer Sammar :
 
 Hi,
 We are trying to index documents of different types. Document have
>>> different fields. fields are known at indexing time. We run a query
>> on a
>>> database and we index what comes using query variables as field names
>> in
>>> solr. Our current solution: we use dynamic fields with prefix, for
>> example
>>> feature_i_*, the issue with that
 1) we need to define the type of the dynamic field and to be able
>> to
>>> cover the type of discovered fields we define the following
 feature_i_* for integers, feature_t_* for string, feature_d_* for
>>> double, 
 1.a) this means we need to check the type of the discovered field
>> and
>>> then put in the corresponding dynamic field
 2) at search time, we need to know the right prefix
 We are looking for help to find away to ignore the prefix and check
>> of
>>> the type
 
 regards,
 Thaer
>>> 
>>> 
> 
> -- 
> Sorry for being brief. Alternate email is rickleir at yahoo dot com



Re: ZooKeeper transaction logs

2017-07-09 Thread Xie, Sean
You can try run purge manually see if it is working: 
org.apache.zookeeper.server.PurgeTxnLog.

And use a cron job to do clean up.


On 7/9/17, 11:07 AM, "Avi Steiner"  wrote:

Hello

I'm using Zookeeper 3.4.6

The ZK log data folder keeps growing with transaction logs files (log.*).

I set the following in zoo.cfg:
autopurge.purgeInterval=1
autopurge.snapRetainCount=3
dataDir=..\\data

Per ZK log, it reads those parameters:

2017-07-09 17:44:59,792 [myid:] - INFO  [main:DatadirCleanupManager@78] - 
autopurge.snapRetainCount set to 3
2017-07-09 17:44:59,792 [myid:] - INFO  [main:DatadirCleanupManager@79] - 
autopurge.purgeInterval set to 1

It also says that cleanup process is running:

2017-07-09 17:44:59,792 [myid:] - INFO  
[PurgeTask:DatadirCleanupManager$PurgeTask@138] - Purge task started.
2017-07-09 17:44:59,823 [myid:] - INFO  
[PurgeTask:DatadirCleanupManager$PurgeTask@144] - Purge task completed.

But actually nothing is deleted.
Every service restart, new file is created.

The only parameter I managed to change is preAllocSize, which means the 
minimum size per file. The default is 64MB. I changed it to 10KB only for 
watching the effect.



This email and any attachments thereto may contain private, confidential, 
and privileged material for the sole use of the intended recipient. Any review, 
copying, or distribution of this email (or any attachments thereto) by others 
is strictly prohibited. If you are not the intended recipient, please contact 
the sender immediately and permanently delete the original and any copies of 
this email and any attachments thereto.



Confidentiality Notice::  This email, including attachments, may include 
non-public, proprietary, confidential or legally privileged information.  If 
you are not an intended recipient or an authorized agent of an intended 
recipient, you are hereby notified that any dissemination, distribution or 
copying of the information contained in or transmitted with this e-mail is 
unauthorized and strictly prohibited.  If you have received this email in 
error, please notify the sender by replying to this message and permanently 
delete this e-mail, its attachments, and any copies of it immediately.  You 
should not retain, copy or use this e-mail or any attachment for any purpose, 
nor disclose all or any part of the contents to any other person. Thank you.


Streaming expressions and Jetty Host

2017-07-09 Thread Pratik Patel
Hi Everyone,

We are running solr 6.4.1 in cloud mode on CentOS production server.
Currently, we are using the embedded zookeeper. It is a simple set up with
one collection and one shard.

By default, Jetty server binds to all interfaces which is not safe so we
have changed the bin/solr script. We have added "-Djetty.host=127.0.0.1" in
SOLR_START_OPTS so that it looks like as follows.

 SOLR_START_OPTS=('-server' "${JAVA_MEM_OPTS[@]}" "${GC_TUNE[@]}"
"${GC_LOG_OPTS[@]}" \
"${REMOTE_JMX_OPTS[@]}" "${CLOUD_MODE_OPTS[@]}"
$SOLR_LOG_LEVEL_OPT -Dsolr.log.dir="$SOLR_LOGS_DIR" \
"-Djetty.host=127.0.0.1" "-Djetty.port=$SOLR_PORT"
"-DSTOP.PORT=$stop_port" "-DSTOP.KEY=$STOP_KEY" \
"${SOLR_HOST_ARG[@]}" "-Duser.timezone=$SOLR_TIMEZONE" \
"-Djetty.home=$SOLR_SERVER_DIR" "-Dsolr.solr.home=$SOLR_HOME"
"-Dsolr.install.dir=$SOLR_TIP" \
"${LOG4J_CONFIG[@]}" "${SOLR_OPTS[@]}")


We just found that with this change everything works fine in cloud mode
except the streaming expressions. With streaming expressions, we get
following response.

org.apache.solr.client.solrj.SolrServerException: Server refused connection
> at: http://:8081/solr/collection1_shard1_replica1


We don't get this error if we let jetty server bind to all interfaces. Any
idea about what's the problem here?

Thanks,
Pratik


ZooKeeper transaction logs

2017-07-09 Thread Avi Steiner
Hello

I'm using Zookeeper 3.4.6

The ZK log data folder keeps growing with transaction logs files (log.*).

I set the following in zoo.cfg:
autopurge.purgeInterval=1
autopurge.snapRetainCount=3
dataDir=..\\data

Per ZK log, it reads those parameters:

2017-07-09 17:44:59,792 [myid:] - INFO  [main:DatadirCleanupManager@78] - 
autopurge.snapRetainCount set to 3
2017-07-09 17:44:59,792 [myid:] - INFO  [main:DatadirCleanupManager@79] - 
autopurge.purgeInterval set to 1

It also says that cleanup process is running:

2017-07-09 17:44:59,792 [myid:] - INFO  
[PurgeTask:DatadirCleanupManager$PurgeTask@138] - Purge task started.
2017-07-09 17:44:59,823 [myid:] - INFO  
[PurgeTask:DatadirCleanupManager$PurgeTask@144] - Purge task completed.

But actually nothing is deleted.
Every service restart, new file is created.

The only parameter I managed to change is preAllocSize, which means the minimum 
size per file. The default is 64MB. I changed it to 10KB only for watching the 
effect.



This email and any attachments thereto may contain private, confidential, and 
privileged material for the sole use of the intended recipient. Any review, 
copying, or distribution of this email (or any attachments thereto) by others 
is strictly prohibited. If you are not the intended recipient, please contact 
the sender immediately and permanently delete the original and any copies of 
this email and any attachments thereto.