Re: Solr 7.2.1 DELETEREPLICA automatically NRT replica appears

2018-03-07 Thread Greg Roodt
I'll check the logs when I'm back at my computer. Mostly errors about
failing to find the core spamming the logs if I recall correctly.

Node never becomes active. Just spams the logs. Only way to remove it is to
stop solr in the node and delete the replica via API on another node.


On Thu, 8 Mar 2018 at 15:49, Tomas Fernandez Lobbe 
wrote:

> This shouldn’t be happening. Did you see anything related in the logs?
> Does the new NRT replica ever becomes active? Is there a new core created
> or do you just see the replica in the clusterstate?
>
> Tomas
>
> Sent from my iPhone
>
> > On Mar 7, 2018, at 8:18 PM, Greg Roodt  wrote:
> >
> > Hi
> >
> > I am running a cluster of TLOG and PULL replicas. When I call the
> > DELETEREPLICA api to remove a replica, the replica is removed, however, a
> > new NRT replica pops up in a down state in the cluster.
> >
> > Any ideas why?
> >
> > Greg
>


Re: Solr 7.2.1 DELETEREPLICA automatically NRT replica appears

2018-03-07 Thread Tomas Fernandez Lobbe
This shouldn’t be happening. Did you see anything related in the logs? Does the 
new NRT replica ever becomes active? Is there a new core created or do you just 
see the replica in the clusterstate?

Tomas 

Sent from my iPhone

> On Mar 7, 2018, at 8:18 PM, Greg Roodt  wrote:
> 
> Hi
> 
> I am running a cluster of TLOG and PULL replicas. When I call the
> DELETEREPLICA api to remove a replica, the replica is removed, however, a
> new NRT replica pops up in a down state in the cluster.
> 
> Any ideas why?
> 
> Greg


Solr 7.2.1 DELETEREPLICA automatically NRT replica appears

2018-03-07 Thread Greg Roodt
Hi

I am running a cluster of TLOG and PULL replicas. When I call the
DELETEREPLICA api to remove a replica, the replica is removed, however, a
new NRT replica pops up in a down state in the cluster.

Any ideas why?

Greg


Re: LTR not picking up modified features

2018-03-07 Thread Roopa ML
Thank you, I Reloaded and collection and see that the change picked up.

I had not seen a need to do this in my local environment which is on non cloud 
mode.

Regards 
Roopa

Sent from my iPhone

> On Mar 7, 2018, at 7:09 PM, Shawn Heisey  wrote:
> 
>> On 3/6/2018 12:57 PM, Roopa Rao wrote:
>> There was an error in one of the feature definition in Solr LTR
>> features.json file and I modified and uploaded it to Solr.  I can see that
>> the definition change is uploaded correctly using the feature store url such
>> as
>> 
>> http://servername/solr/techproducts/schema/feature-store/myFeatureStore
>> I checked the _schema_feature-store.json file and I see that the change is
>> present.
>> 
>> However, during run time it is picking the old feature definition.
> 
> Did you reload the collection (SolrCloud mode) or core (standalone
> mode)?  Or restart all Solr instances with that index present?
> 
> Most of the time, if you don't reload or restart, then configuration
> changes will not take effect.  When using the config or schema APIs that
> change things on the fly, Solr does a reload in order to make changes
> effective.
> 
> Thanks,
> Shawn
> 


Re: Solr Read-Only?

2018-03-07 Thread Shawn Heisey
On 3/6/2018 2:08 PM, Terry Steichen wrote:
> Is it possible to run solr in a read-only directory?

Solr can be installed as a service on most operating systems other than
Windows.  A service installer script comes with the download.  It is
installed to run as an unprivileged user, "solr" by default.

The program directory (defaulting to /opt/solr-X.Y.Z, with a symlink at
/opt/solr pointing to the real directory) gets set up so it is owned by
root, so that directory *is* effectively read-only.

The "var dir" defaults to /var/solr and is fully writable by the solr
user.  The solr home defaults to /var/solr/data.

If you want the solr home to be read only, then you will need to turn
off all index locking in your solrconfig.xml files.  When locking is
enabled, which it is by default, Lucene *will* write to the index
directory at startup, and the index will fail to start if it's not able
to make that write.  On startup, it writes to a lockfile, not the index
itself.

https://lucene.apache.org/solr/guide/7_2/indexconfig-in-solrconfig.html#index-locks

Looks like the lockType "none" is not in the documentation, but I'm
pretty sure it's a value you can use.

I would strongly recommend *NOT* making the solr home read only,
*especially* if you're running in SolrCloud mode.

> The problem is that it's an all-or-nothing situation so everyone who's
> authorized access to the platform has, in effect, administrator
> privileges on solr.  I understand that authentication is coming, but
> that it isn't here yet.  (Or, to add complexity, I had to downgrade from
> 7.2.1 to 6.4.2 to overcome a new bug concerning indexing of eml files,
> and 6.4.2 definitely doesn't have authentication.)

Solr has authentication, and has had for a very long time.  Basic
authentication required SolrCloud when it became a workable feature in
5.3.  If you're running standalone mode instead of SolrCloud, then you
need version 6.5.0 to use the authentication plugin.  Is this what you
mean when you say that 6.4.2 doesn't have authentication?  One option
that you DO have with 6.4.2 (and a number of other earlier versions) is
to configure authentication with Kerberos.  But this is a lot more
involved than basic authentication.

If you are using Tika to index those emails, then you should not be
running Tika within Solr.  Eventually Tika is probably going to crash
when trying to read a document with a layout the authors have never seen
before, and when that happens, it'll take any other software (like Solr)
running in the same process down with it.

> Anyway, what I was wondering is if it might be possible to run solr not
> as me (the administrator), but as a user with lesser privileges so that
> no one who came through the SSH tunnel could (inadvertently or
> otherwise) screw up the indexes.

As of version 6.3, Solr will refuse to start if it's run as root,
without a special option to force it.  So this is already there.

https://issues.apache.org/jira/browse/SOLR-9547

I would definitely recommend installing the service so there is a
dedicated unprivileged user account for Solr.

Thanks,
Shawn



Re: LTR not picking up modified features

2018-03-07 Thread Shawn Heisey
On 3/6/2018 12:57 PM, Roopa Rao wrote:
> There was an error in one of the feature definition in Solr LTR
> features.json file and I modified and uploaded it to Solr.  I can see that
> the definition change is uploaded correctly using the feature store url such
> as
>
> http://servername/solr/techproducts/schema/feature-store/myFeatureStore
> I checked the _schema_feature-store.json file and I see that the change is
> present.
>
> However, during run time it is picking the old feature definition.

Did you reload the collection (SolrCloud mode) or core (standalone
mode)?  Or restart all Solr instances with that index present?

Most of the time, if you don't reload or restart, then configuration
changes will not take effect.  When using the config or schema APIs that
change things on the fly, Solr does a reload in order to make changes
effective.

Thanks,
Shawn



Re: Replicate configoverlay.json

2018-03-07 Thread Shawn Heisey
On 3/6/2018 10:50 AM, Sundaram, Dinesh wrote:
> Can you please share the steps to replicate configoverlay.json from
> Master to Slave… in other words, how do we replicate from Master to
> Slave if any configuration updated via API.

If that file is in the same place as solrconfig.xml, then you would add
it to the "confFiles" parameter in the master replication config.  If it
gets saved somewhere else, then I don't know if it would be possible. 
I've never used the config overlay, but it sounds like it probably gets
saved in the conf directory along with the rest of the config files.

https://lucene.apache.org/solr/guide/6_6/index-replication.html#IndexReplication-ConfiguringtheReplicationRequestHandleronaMasterServer

Thanks,
Shawn



Re: Solr Warming Doubts

2018-03-07 Thread Shawn Heisey

On 3/7/2018 12:10 PM, Bsr wrote:

I guess i should increase autowarmCount count. whats should be the ideal no.
Also is there any way by which i can know that autowarm is completed?


There are no generic answers.  You want autowarmCount to be large enough 
to be effective, but small enough that warming doesn't take a really 
long time.  There's no way I can tell you how long warming will take 
with a certain number in autowarmCount.  That will depend on the nature 
of your queries, what's in your index, and what the hardware is.


When the new searcher opens, that's when you will know that all warming 
is complete.  The cache stats will show how long it took for for 
aspecific cache to warm up when that instance of the cache was created.


In another message you askedthis:

Can you eloborate more on newSearcher and cache i.e what should we set.

The newSearcher config defines queries to execute on *every* new 
searcher.  Which is different than firstSearcher, in that firstSearcher 
defines queries that will be executed exactly once, when the core first 
starts up.  If you want to use newSearcher, it probably needs the same 
queries you currently have in firstSearcher.  If this is a warming 
issue, adding newSearcher will probably help.


The caches themselves have autowarming, which reads the top N queries 
from the cache from the old searcher and re-executes those queries on 
the new index to populate the cache in the new searcher.  This tends to 
produce better results than newSearcher, because the queries might be 
different, and will reflect what's actually IN the cache.


You said this as well:  Also there is no resource crunch with the solr 
resources.


At the risk of being offensive:  How do you know?  I find that many 
people do not actually know how to detect resource issues with Solr, 
particularly with memory.  They look at their systems and conclude that 
everything's fine, even though there is nowhere near enough memory 
installed in the system for good performance.


Thanks,
Shawn



LTR not able to upload org.apache.solr.ltr.model.MultipleAdditiveTreesModel

2018-03-07 Thread Roopa Rao
Trying to upload a simple MultipleAdditiveTreesModel, however I am getting an
error 
"msg":"org.apache.solr.ltr.model.ModelException: Model type does not exist
org.apache.solr.ltr.model.MultipleAdditiveTreesModel"

Root cause seems to be a syntax error in the model file?
I did copy this from the example
https://lucene.apache.org/solr/guide/6_6/learning-to-rank.html#LearningToRank-Examples

Made sure all the values have surrounded by quotes

Made sure features specified in the model are in the features file

Did anyone else face this? what was the resolution?

Thanks,
Roopa





--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


CDCR performance issues

2018-03-07 Thread Tom Peters
I'm having issues with the target collection staying up-to-date with indexing 
from the source collection using CDCR.
 
This is what I'm getting back in terms of OPS:

curl -s 'solr2-a:8080/solr/mycollection/cdcr?action=OPS' | jq .
{
  "responseHeader": {
"status": 0,
"QTime": 0
  },
  "operationsPerSecond": [
"zook01,zook02,zook03/solr",
[
  "mycollection",
  [
"all",
49.10140553500938,
"adds",
10.27612635309587,
"deletes",
38.82527896994054
  ]
]
  ]
}

The source and target collections are in separate data centers.

Doing a network test between the leader node in the source data center and the 
ZooKeeper nodes in the target data center
show decent enough network performance: ~181 Mbit/s

I've tried playing around with the "batchSize" value (128, 512, 728, 1000, 
2000, 2500) and they've haven't made much of a difference.

Any suggestions on potential settings to tune to improve the performance?

Thanks

--

Here's some relevant log lines from the source data center's leader:

2018-03-07 23:16:11.984 INFO  
(cdcr-replicator-207-thread-3-processing-n:solr2-a:8080_solr 
x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) 
[c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] 
o.a.s.h.CdcrReplicator Forwarded 511 updates to target mycollection
2018-03-07 23:16:23.062 INFO  
(cdcr-replicator-207-thread-4-processing-n:solr2-a:8080_solr 
x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) 
[c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] 
o.a.s.h.CdcrReplicator Forwarded 510 updates to target mycollection
2018-03-07 23:16:32.063 INFO  
(cdcr-replicator-207-thread-5-processing-n:solr2-a:8080_solr 
x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) 
[c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] 
o.a.s.h.CdcrReplicator Forwarded 511 updates to target mycollection
2018-03-07 23:16:36.209 INFO  
(cdcr-replicator-207-thread-1-processing-n:solr2-a:8080_solr 
x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) 
[c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] 
o.a.s.h.CdcrReplicator Forwarded 512 updates to target mycollection
2018-03-07 23:16:42.091 INFO  
(cdcr-replicator-207-thread-2-processing-n:solr2-a:8080_solr 
x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) 
[c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] 
o.a.s.h.CdcrReplicator Forwarded 512 updates to target mycollection
2018-03-07 23:16:46.790 INFO  
(cdcr-replicator-207-thread-3-processing-n:solr2-a:8080_solr 
x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) 
[c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] 
o.a.s.h.CdcrReplicator Forwarded 511 updates to target mycollection
2018-03-07 23:16:50.004 INFO  
(cdcr-replicator-207-thread-4-processing-n:solr2-a:8080_solr 
x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) 
[c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] 
o.a.s.h.CdcrReplicator Forwarded 512 updates to target mycollection


And what the log looks like in the target:

2018-03-07 23:18:46.475 INFO  (qtp1595212853-26) [c:mycollection s:shard1 
r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request 
[mycollection_shard1_replica_n1]  webapp=/solr path=/update 
params={_stateVer_=mycollection:30&_version_=-1594317067896487950&cdcr.update=&wt=javabin&version=2}
 status=0 QTime=0
2018-03-07 23:18:46.500 INFO  (qtp1595212853-25) [c:mycollection s:shard1 
r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request 
[mycollection_shard1_replica_n1]  webapp=/solr path=/update 
params={_stateVer_=mycollection:30&_version_=-1594317067896487951&cdcr.update=&wt=javabin&version=2}
 status=0 QTime=0
2018-03-07 23:18:46.525 INFO  (qtp1595212853-24) [c:mycollection s:shard1 
r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request 
[mycollection_shard1_replica_n1]  webapp=/solr path=/update 
params={_stateVer_=mycollection:30&_version_=-1594317067897536512&cdcr.update=&wt=javabin&version=2}
 status=0 QTime=0
2018-03-07 23:18:46.550 INFO  (qtp1595212853-3793) [c:mycollection s:shard1 
r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request 
[mycollection_shard1_replica_n1]  webapp=/solr path=/update 
params={_stateVer_=mycollection:30&_version_=-1594317067897536513&cdcr.update=&wt=javabin&version=2}
 status=0 QTime=0
2018-03-07 23:18:46.575 INFO  (qtp1595212853-30) [c:mycollection s:shard1 
r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request 
[mycollection_shard1_replica_n1]  webapp=/solr path=/update 
params={_stateVer_=mycollection:30&_version_=-1594317067897536514&cdcr.update=&wt=javabin&version=2}
 status=0 QTime=0

RE: Solr Read-Only?

2018-03-07 Thread Phil Scadden
I would also second the proxy approach. Beside keeping your solr instance 
behind a firewall and not directly exposed, you can do a lot in a proxy. 
Per-user control over which index they are access, filtering of queries, etc.

-Original Message-
From: Emir Arnautović [mailto:emir.arnauto...@sematext.com]
Sent: Wednesday, 7 March 2018 10:19 p.m.
To: solr-user@lucene.apache.org
Subject: Re: Solr Read-Only?

Hi Terry,
Maybe you can try alternative approaches like putting some proxy in front of 
Solr and configure it to let only certain URLs. Other option is to define 
custom update request processor chain that will not include 
RunUpdateProcessorFactory - that will prevent accidental index updates.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection Solr & Elasticsearch 
Consulting Support Training - http://sematext.com/



> On 6 Mar 2018, at 22:55, Terry Steichen  wrote:
>
> Chris,
>
> Thanks for your suggestion.  Restarting solr after an in-memory
> corruption is, of course, trivial (compared to rebuilding the indexes).
>
> Are there any solr directories that MUST be read/write (even with a
> pre-built index)?  Would it suffice (for my purposes) to make only the
> data/index directory R-O?
>
> Terry
>
>
> On 03/06/2018 04:20 PM, Christopher Schultz wrote:
>> Terry,
>>
>> On 3/6/18 4:08 PM, Terry Steichen wrote:
>>> Is it possible to run solr in a read-only directory?
>>
>>> I'm running it just fine on a ubuntu server which is accessible only
>>> through SSH tunneling.  At the platform level, this is fine:
>>> only authorized users can access it (via a browser on their machine
>>> accessing a forwarded port).
>>
>>> The problem is that it's an all-or-nothing situation so everyone
>>> who's authorized access to the platform has, in effect,
>>> administrator privileges on solr.  I understand that authentication
>>> is coming, but that it isn't here yet.  (Or, to add complexity, I
>>> had to downgrade from 7.2.1 to 6.4.2 to overcome a new bug
>>> concerning indexing of eml files, and 6.4.2 definitely doesn't have
>>> authentication.)
>>
>>> Anyway, what I was wondering is if it might be possible to run solr
>>> not as me (the administrator), but as a user with lesser privileges
>>> so that no one who came through the SSH tunnel could (inadvertently
>>> or otherwise) screw up the indexes.
>>
>> With shell access, the only protection you could provide would be
>> through file-permissions. But of course Solr will need to be
>> read-write in order to build the index in the first place. So you'd
>> probably have to run read-write at first, build the index (perhaps
>> that's already been done in the past), then (possibly) restart in
>> read-only mode.
>>
>> Read-only can be achieved by simply revoking write-access to the data
>> directories from the euid of the Solr process. Theoretically, you
>> could switch from being read-write to read-only merely by changing
>> file-permissions... no Solr restarts required.
>>
>> I'm not sure if it matters to you very much, but a user can still do
>> some damage to the index even if the "server" is read-only (through
>> file-permissions): they can issue a batch of DELETE or ADD requests
>> that will effect the in-memory copies of the index. It might be
>> temporary, but it might require that you restart the Solr instance to
>> get back to a sane state.
>>
>> Hope that helps,
>> -chris
>>
>

Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.


Re: What is creating certain fields?

2018-03-07 Thread Cassandra Targett
I'll guess you're using Solr 7.x and those fields in your schema were
created automatically?

As of Solr 7.0, the schemaless mode field guessing added a copyField rule
for any field that's guessed to be text to copy the first 256 characters to
a multivalued string field. The way it works is a field is created with the
type "text_general", and a copyField is then automatically created with the
dynamic field rule "*_str" to create the multivalued string field.

This came from https://issues.apache.org/jira/browse/SOLR-9526.

You can prohibit the behavior if you want to by removing the copyField rule
section. See the docs for where in the solrconfig.xml you will want to
edit:
https://lucene.apache.org/solr/guide/schemaless-mode.html#enable-field-class-guessing
.

Cassandra

On Wed, Mar 7, 2018 at 9:46 AM, Erick Erickson 
wrote:

> Maybe  a copyField is realizing the dynamic fields?
>
>
> On Wed, Mar 7, 2018 at 7:43 AM, David Hastings
>  wrote:
> > those are dynamic fields.
> >
> >> indexed="false" stored="false"/>
> >
> >
> > On Wed, Mar 7, 2018 at 12:43 AM, Keith Dopson 
> wrote:
> >
> >> My default query produces this:
> >>
> >> |  {
> >> "id":"44419",
> >> "date":["11/13/17 13:18"],
> >> "url":["http://www.someurl.com";],
> >> "title":["some title"],
> >> "content":["some indexed content..."],
> >> "date_str":["11/13/17 13:18"],
> >> "url_str":["http://www.someurl.com";],
> >> "title_str":["some title"],
> >> "_version_":1594211356390719488,
> >> "content_str":["some indexed content.."]
> >> },
> >>
> >>
> >> In my managed_schema file, I only have five populated fields,
> >>
> >> >> required="true" multiValued="false" />
> >>
> >> >> stored="true"/>
> >> >> stored="true"/>
> >> >> stored="true"/>
> >> >> stored="true"/>
> >>
> >> While other fields are declared, none of them are populated by my "post"
> >> command.
> >>
> >> My question is "Where are the x_str fields coming from?
> >> I.e., what is producing the
> >> |
> >> ||"date_str":["...
> >> "url_str":["...
> >> "title_str":["...
> >> "content_str":["...|
> >>
> >> entries?
> >>
> >> Thanks in advance.
> >> |
> >>
> >>
> >>
>


Re: Solr Warming Doubts

2018-03-07 Thread Bsr
I guess i should increase autowarmCount count. whats should be the ideal no.
Also is there any way by which i can know that autowarm is completed?



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: CDCR Invalid Number on deletes

2018-03-07 Thread Chris Troullis
Hey Amrit, thanks for the reply!

I checked out SOLR-12036, but it doesn't look like it has to do with CDCR,
and the patch that is attached doesn't look CDCR related. Are you sure
that's the correct JIRA number?

Thanks,

Chris

On Wed, Mar 7, 2018 at 11:21 AM, Amrit Sarkar 
wrote:

> Hey Chris,
>
> I figured a separate issue while working on CDCR which may relate to your
> problem. Please see jira: *SOLR-12063*
> . This is
> a
> bug got introduced when we supported the bidirectional approach where an
> extra flag in tlog entry for cdcr is added.
>
> This part of the code is messing up:
> *UpdateLog.java.RecentUpdates::update()::*
>
> switch (oper) {
>   case UpdateLog.ADD:
>   case UpdateLog.UPDATE_INPLACE:
>   case UpdateLog.DELETE:
>   case UpdateLog.DELETE_BY_QUERY:
> Update update = new Update();
> update.log = oldLog;
> update.pointer = reader.position();
> update.version = version;
>
> if (oper == UpdateLog.UPDATE_INPLACE && entry.size() == 5) {
>   update.previousVersion = (Long) entry.get(UpdateLog.PREV_
> VERSION_IDX);
> }
> updatesForLog.add(update);
> updates.put(version, update);
>
> if (oper == UpdateLog.DELETE_BY_QUERY) {
>   deleteByQueryList.add(update);
> } else if (oper == UpdateLog.DELETE) {
>   deleteList.add(new DeleteUpdate(version,
> (byte[])entry.get(entry.size()-1)));
> }
>
> break;
>
>   case UpdateLog.COMMIT:
> break;
>   default:
> throw new SolrException(SolrException.ErrorCode.SERVER_ERROR,
> "Unknown Operation! " + oper);
> }
>
> deleteList.add(new DeleteUpdate(version, (byte[])entry.get(entry.size()
> -1)));
>
> is expecting the last entry to be the payload, but everywhere in the
> project, *pos:[2] *is the index for the payload, while the last entry in
> source code is *boolean* in / after Solr 7.2, denoting update is cdcr
> forwarded or typical. UpdateLog.java.RecentUpdates is used to in cdcr sync,
> checkpoint operations and hence it is a legit bug, slipped the tests I
> wrote.
>
> The immediate fix patch is uploaded and I am awaiting feedback on that.
> Meanwhile if it is possible for you to apply the patch, build the jar and
> try it out, please do and let us know.
>
> For, *SOLR-9394* , if you
> can comment on the JIRA and post the sample docs, solr logs, relevant
> information, I can give it a thorough look.
>
> Amrit Sarkar
> Search Engineer
> Lucidworks, Inc.
> 415-589-9269
> www.lucidworks.com
> Twitter http://twitter.com/lucidworks
> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
> Medium: https://medium.com/@sarkaramrit2
>
> On Wed, Mar 7, 2018 at 1:35 AM, Chris Troullis 
> wrote:
>
> > Hi all,
> >
> > We recently upgraded to Solr 7.2.0 as we saw that there were some CDCR
> bug
> > fixes and features added that would finally let us be able to make use of
> > it (bi-directional syncing was the big one). The first time we tried to
> > implement we ran into all kinds of errors, but this time we were able to
> > get it mostly working.
> >
> > The issue we seem to be having now is that any time a document is deleted
> > via deleteById from a collection on the primary node, we are flooded with
> > "Invalid Number" errors followed by a random sequence of characters when
> > CDCR tries to sync the update to the backup site. This happens on all of
> > our collections where our id fields are defined as longs (some of them
> the
> > ids are compound keys and are strings).
> >
> > Here's a sample exception:
> >
> > org.apache.solr.client.solrj.impl.CloudSolrClient$RouteException: Error
> > from server at http://ip/solr/collection_shard1_replica_n1: Invalid
> > Number:  ]
> > -s
> > at
> > org.apache.solr.client.solrj.impl.CloudSolrClient.
> > directUpdate(CloudSolrClient.java:549)
> > at
> > org.apache.solr.client.solrj.impl.CloudSolrClient.
> > sendRequest(CloudSolrClient.java:1012)
> > at
> > org.apache.solr.client.solrj.impl.CloudSolrClient.
> > requestWithRetryOnStaleState(CloudSolrClient.java:883)
> > at
> > org.apache.solr.client.solrj.impl.CloudSolrClient.
> > requestWithRetryOnStaleState(CloudSolrClient.java:945)
> > at
> > org.apache.solr.client.solrj.impl.CloudSolrClient.
> > requestWithRetryOnStaleState(CloudSolrClient.java:945)
> > at
> > org.apache.solr.client.solrj.impl.CloudSolrClient.
> > requestWithRetryOnStaleState(CloudSolrClient.java:945)
> > at
> > org.apache.solr.client.solrj.impl.CloudSolrClient.
> > requestWithRetryOnStaleState(CloudSolrClient.java:945)
> > at
> > org.apache.solr.client.solrj.impl.CloudSolrClient.
> > requestWithRetryOnStaleState(CloudSolrClient.java:945)
> > at
> > org.apache.solr.client.solrj.impl.CloudSolrClient.request(
> > CloudSolrClient.java:816)
> > at
> > org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:194)
> > at
> > org.apache.solr.clien

Re: Solr Warming Doubts

2018-03-07 Thread Bsr
Hi

Its Just after 1-3 seconds when full import completed.
Can you eloborate more on newSearcher and cache i.e what should we set.

Also there is no resource crunch with the solr resources.



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Solr Warming Doubts

2018-03-07 Thread Shawn Heisey

On 3/7/2018 8:37 AM, Bsr wrote:

Whenever i am running the full-import, my response time for some request
increases from 80ms to 3000ms.


I'll start with the same question Emir asked: Is this *during* the 
import, or *after* it's done?


If it's during the import, then the machine is doing a very heavyweight 
operation -- indexing -- and you're doing queries at the same time.


If it's after the import, then a warming issue is more likely.  I find 
that cache auto-warming is a better way to do warming than 
firstSearcher/newSearcher.  FYI: the firstSearcher and useColdSearcher 
parameters only apply to the very first searcher that gets created when 
Solr initially starts.  You want either newSearcher or cache 
auto-warming, and possibly both.


Seeing that much of an increase in response times probably indicates 
that you don't have enough system resources for what you are asking Solr 
to do, especially if it's *during* import.  Memory is usually the 
resource that makes the most difference.


Thanks,
Shawn



Re: Solr 7.2.0 CDCR Issue with TLOG collections

2018-03-07 Thread Amrit Sarkar
Webster,

I updated the JIRA: *SOLR-12057
, **CdcrUpdateProcessor*
has a hack, it enable *PEER_SYNC* to bypass the leader logic in
*DistributedUpdateProcessor.versionAdd,* which eventually ends up in
segments not getting created.

I wrote a very dirty patch which fixes the problem with basic tests to
prove it works. I will try to polish and finish this as soon as possible.

Amrit Sarkar
Search Engineer
Lucidworks, Inc.
415-589-9269
www.lucidworks.com
Twitter http://twitter.com/lucidworks
LinkedIn: https://www.linkedin.com/in/sarkaramrit2
Medium: https://medium.com/@sarkaramrit2

On Tue, Mar 6, 2018 at 10:07 PM, Webster Homer 
wrote:

> seems that this is a bug in Solr
> https://issues.apache.org/jira/browse/SOLR-12057
>
> Hopefully it can be addressed soon!
>
> On Mon, Mar 5, 2018 at 4:14 PM, Webster Homer 
> wrote:
>
> > I noticed that the cdcr action=queues returns different results for the
> > target clouds. One target says that the  updateLogSynchronizer  is
> > stopped the other says started. Why? What does that mean. We don't
> > explicitly set that anywhere
> >
> >
> > {"responseHeader": {"status": 0,"QTime": 0},"queues": [],"tlogTotalSize":
> > 0,"tlogTotalCount": 0,"updateLogSynchronizer": "stopped"}
> >
> > and the other
> >
> > {"responseHeader": {"status": 0,"QTime": 0},"queues": [],"tlogTotalSize":
> > 22254206389,"tlogTotalCount": 2,"updateLogSynchronizer": "started"}
> >
> > The source is as follows:
> > {
> > "responseHeader": {
> > "status": 0,
> > "QTime": 5
> > },
> > "queues": [
> > "xxx-mzk01.sial.com:2181,xxx-mzk02.sial.com:2181,xxx-mzk03.
> > sial.com:2181/solr",
> > [
> > "b2b-catalog-material-180124T",
> > [
> > "queueSize",
> > 0,
> > "lastTimestamp",
> > "2018-02-28T18:34:39.704Z"
> > ]
> > ],
> > "yyy-mzk01.sial.com:2181,yyy-mzk02.sial.com:2181,yyy-mzk03.
> > sial.com:2181/solr",
> > [
> > "b2b-catalog-material-180124T",
> > [
> > "queueSize",
> > 0,
> > "lastTimestamp",
> > "2018-02-28T18:34:39.704Z"
> > ]
> > ]
> > ],
> > "tlogTotalSize": 1970848,
> > "tlogTotalCount": 1,
> > "updateLogSynchronizer": "stopped"
> > }
> >
> >
> > On Fri, Mar 2, 2018 at 5:05 PM, Webster Homer 
> > wrote:
> >
> >> It looks like the data is getting to the target servers. I see tlog
> files
> >> with the right timestamps. Looking at the timestamps on the documents in
> >> the collection none of the data appears to have been loaded.
> >> In the solr.log I see lots of /cdcr messages
> action=LASTPROCESSEDVERSION,
> >>  action=COLLECTIONCHECKPOINT, and  action=SHARDCHECKPOINT
> >>
> >> no errors
> >>
> >> autoCommit is set to  6 I tried sending a commit explicitly no
> >> difference. cdcr is uploading data, but no new data appears in the
> >> collection.
> >>
> >> On Fri, Mar 2, 2018 at 1:39 PM, Webster Homer 
> >> wrote:
> >>
> >>> We have been having strange behavior with CDCR on Solr 7.2.0.
> >>>
> >>> We have a number of replicas which have identical schemas. We found
> that
> >>> TLOG replicas give much more consistent search results.
> >>>
> >>> We created a collection using TLOG replicas in our QA clouds.
> >>> We have a locally hosted solrcloud with 2 nodes, all our collections
> >>> have 2 shards. We use CDCR to replicate the collections from this
> >>> environment to 2 data centers hosted in Google cloud. This seems to
> work
> >>> fairly well for our collections with NRT replicas. However the new TLOG
> >>> collection has problems.
> >>>
> >>> The google cloud solrclusters have 4 nodes each (3 separate
> Zookeepers).
> >>> 2 shards per collection with 2 replicas per shard.
> >>>
> >>> We never see data show up in the cloud collections, but we do see tlog
> >>> files show up on the cloud servers. I can see that all of the servers
> have
> >>> cdcr started, buffers are disabled.
> >>> The cdcr source configuration is:
> >>>
> >>> "requestHandler":{"/cdcr":{
> >>>   "name":"/cdcr",
> >>>   "class":"solr.CdcrRequestHandler",
> >>>   "replica":[
> >>> {
> >>>   "zkHost":"xxx-mzk01.sial.com:2181,xxx-mzk02.sial.com:2181,xx
> >>> x-mzk03.sial.com:2181/solr",
> >>>   "source":"b2b-catalog-material-180124T",
> >>>   "target":"b2b-catalog-material-180124T"},
> >>> {
> >>>   "zkHost":"-mzk01.sial.com:2181,-mzk02.sial.com:2181,
> >>> -mzk03.sial.com:2181/solr",
> >>>   "source":"b2b-catalog-material-180124T",
> >>>   "target":"b2b-catalog-material-180124T"}],
> >>>   "replicator":{
> >>> "threadPoolSize":4,
> >>> "schedule":500,
> >>> "batchSize":250},
> >>>   "updateLogSynchronizer":{"schedule":6
> >>>
> >>> The target configurations in the 2 clouds are the same:
> >>> "requestHandler":{"/cdcr":{ "name":"/cdcr", "class":
> >>> "solr.CdcrRequestHandler", "buffer":{"defaultState":"disabled"}}}
> >>>
> >>> All of our collections have a timestamp field, index_date. In the
> source
> >>> collection all the records have a date of 

Re: Solr Warming Doubts

2018-03-07 Thread Emir Arnautović
Hi,
Is it during full import or after full import? If it is during, then it might 
mean that you don’t have enough resources or maybe GC is more active. You 
should monitor your system to see if there is some resource starvation.
What sort of queries are slower? Does it include faceting? Maybe you are 
missing doc values for those fields and you don’t warm up field cache.
What is your commit strategy? You commit at the end of full import or you have 
autocommit every X seconds?
Is your index static after full import or you update docs between two imports? 
If there are updates, do you see similar slow-down?

Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 7 Mar 2018, at 16:37, Bsr  wrote:
> 
> Whenever i am running the full-import, my response time for some request
> increases from 80ms to 3000ms.
> 
> This must be indicating my poor choice of warming up.
> 
> *1. FirstSearcher*
> I have added some 10 frequent used query but all my autowarmCount are set to
> 0. I have also added facet for warming.
> So if my autowarmCount=0, does this mean by queries are not getting cached.
> 
> *2. useColdSearcher = false*
> Despite reading many document, i am not able to understand how it works
> after full import (assuming this is not my first full-import) 
> 
> *3. not defined maxWarmingSearchers in solrconfig.*
> 
> Am i doing anything wrong as why my autowarm is not working proprtly?
> 
> Note: I am using solr6.6.0
> 
> 
> 
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html



Re: CDCR Invalid Number on deletes

2018-03-07 Thread Amrit Sarkar
Hey Chris,

I figured a separate issue while working on CDCR which may relate to your
problem. Please see jira: *SOLR-12063*
. This is a
bug got introduced when we supported the bidirectional approach where an
extra flag in tlog entry for cdcr is added.

This part of the code is messing up:
*UpdateLog.java.RecentUpdates::update()::*

switch (oper) {
  case UpdateLog.ADD:
  case UpdateLog.UPDATE_INPLACE:
  case UpdateLog.DELETE:
  case UpdateLog.DELETE_BY_QUERY:
Update update = new Update();
update.log = oldLog;
update.pointer = reader.position();
update.version = version;

if (oper == UpdateLog.UPDATE_INPLACE && entry.size() == 5) {
  update.previousVersion = (Long) entry.get(UpdateLog.PREV_VERSION_IDX);
}
updatesForLog.add(update);
updates.put(version, update);

if (oper == UpdateLog.DELETE_BY_QUERY) {
  deleteByQueryList.add(update);
} else if (oper == UpdateLog.DELETE) {
  deleteList.add(new DeleteUpdate(version,
(byte[])entry.get(entry.size()-1)));
}

break;

  case UpdateLog.COMMIT:
break;
  default:
throw new SolrException(SolrException.ErrorCode.SERVER_ERROR,
"Unknown Operation! " + oper);
}

deleteList.add(new DeleteUpdate(version, (byte[])entry.get(entry.size()-1)));

is expecting the last entry to be the payload, but everywhere in the
project, *pos:[2] *is the index for the payload, while the last entry in
source code is *boolean* in / after Solr 7.2, denoting update is cdcr
forwarded or typical. UpdateLog.java.RecentUpdates is used to in cdcr sync,
checkpoint operations and hence it is a legit bug, slipped the tests I
wrote.

The immediate fix patch is uploaded and I am awaiting feedback on that.
Meanwhile if it is possible for you to apply the patch, build the jar and
try it out, please do and let us know.

For, *SOLR-9394* , if you
can comment on the JIRA and post the sample docs, solr logs, relevant
information, I can give it a thorough look.

Amrit Sarkar
Search Engineer
Lucidworks, Inc.
415-589-9269
www.lucidworks.com
Twitter http://twitter.com/lucidworks
LinkedIn: https://www.linkedin.com/in/sarkaramrit2
Medium: https://medium.com/@sarkaramrit2

On Wed, Mar 7, 2018 at 1:35 AM, Chris Troullis  wrote:

> Hi all,
>
> We recently upgraded to Solr 7.2.0 as we saw that there were some CDCR bug
> fixes and features added that would finally let us be able to make use of
> it (bi-directional syncing was the big one). The first time we tried to
> implement we ran into all kinds of errors, but this time we were able to
> get it mostly working.
>
> The issue we seem to be having now is that any time a document is deleted
> via deleteById from a collection on the primary node, we are flooded with
> "Invalid Number" errors followed by a random sequence of characters when
> CDCR tries to sync the update to the backup site. This happens on all of
> our collections where our id fields are defined as longs (some of them the
> ids are compound keys and are strings).
>
> Here's a sample exception:
>
> org.apache.solr.client.solrj.impl.CloudSolrClient$RouteException: Error
> from server at http://ip/solr/collection_shard1_replica_n1: Invalid
> Number:  ]
> -s
> at
> org.apache.solr.client.solrj.impl.CloudSolrClient.
> directUpdate(CloudSolrClient.java:549)
> at
> org.apache.solr.client.solrj.impl.CloudSolrClient.
> sendRequest(CloudSolrClient.java:1012)
> at
> org.apache.solr.client.solrj.impl.CloudSolrClient.
> requestWithRetryOnStaleState(CloudSolrClient.java:883)
> at
> org.apache.solr.client.solrj.impl.CloudSolrClient.
> requestWithRetryOnStaleState(CloudSolrClient.java:945)
> at
> org.apache.solr.client.solrj.impl.CloudSolrClient.
> requestWithRetryOnStaleState(CloudSolrClient.java:945)
> at
> org.apache.solr.client.solrj.impl.CloudSolrClient.
> requestWithRetryOnStaleState(CloudSolrClient.java:945)
> at
> org.apache.solr.client.solrj.impl.CloudSolrClient.
> requestWithRetryOnStaleState(CloudSolrClient.java:945)
> at
> org.apache.solr.client.solrj.impl.CloudSolrClient.
> requestWithRetryOnStaleState(CloudSolrClient.java:945)
> at
> org.apache.solr.client.solrj.impl.CloudSolrClient.request(
> CloudSolrClient.java:816)
> at
> org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:194)
> at
> org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:211)
> at
> org.apache.solr.handler.CdcrReplicator.sendRequest(
> CdcrReplicator.java:140)
> at
> org.apache.solr.handler.CdcrReplicator.run(CdcrReplicator.java:104)
> at
> org.apache.solr.handler.CdcrReplicatorScheduler.lambda$null$0(
> CdcrReplicatorScheduler.java:81)
> at
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.
> lambda$execute$0(ExecutorUtil.java:188)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker

Solr Warming Doubts

2018-03-07 Thread Bsr
Whenever i am running the full-import, my response time for some request
increases from 80ms to 3000ms.

This must be indicating my poor choice of warming up.

1. FirstSearcher
I have added some 10 frequent used query but all my autowarmCount are set to
0. I have also added facet for warming.
So if my autowarmCount=0, does this mean by queries are not getting cached.

2. useColdSearcher = false
Despite reading many document, i am not able to understand how it works
after full import (assuming this is not my first full-import) 

3. not defined maxWarmingSearchers in solrconfig.

Am i doing anything wrong as why my autowarm is not working proprtly?

Note: I am using solr6.6.0



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Solr Warming Doubts

2018-03-07 Thread Bsr
Whenever i am running the full-import, my response time for some request
increases from 80ms to 3000ms.

This must be indicating my poor choice of warming up.

*1. FirstSearcher*
I have added some 10 frequent used query but all my autowarmCount are set to
0. I have also added facet for warming.
So if my autowarmCount=0, does this mean by queries are not getting cached.

*2. useColdSearcher = false*
Despite reading many document, i am not able to understand how it works
after full import (assuming this is not my first full-import) 

*3. not defined maxWarmingSearchers in solrconfig.*

Am i doing anything wrong as why my autowarm is not working proprtly?

Note: I am using solr6.6.0



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Solr Warming Up Doubts

2018-03-07 Thread Birender Rawat
henever i am running the full-import, my response time for some request
increases from 80ms to 3000ms.

This must be indicating my poor choice of warming up.

*1. FirstSearcher* I have added some 2 frequent used query but all my
autowarmCount are set to 0. I have also added facet for warming. So if my
autowarmCount=0, does this mean by queries are not getting cached.

*2. useColdSearcher = false* Despite reading many document, i am not able
to understand how it works after full import (assuming this is not my first
full-import)

*3. not defined maxWarmingSearchers in solrconfig.*

Am i doing anything wrong as why my autowarm is not working proprtly?

Note: I am using solr6.6.0


Re: What is creating certain fields?

2018-03-07 Thread Erick Erickson
Maybe  a copyField is realizing the dynamic fields?


On Wed, Mar 7, 2018 at 7:43 AM, David Hastings
 wrote:
> those are dynamic fields.
>
>indexed="false" stored="false"/>
>
>
> On Wed, Mar 7, 2018 at 12:43 AM, Keith Dopson  wrote:
>
>> My default query produces this:
>>
>> |  {
>> "id":"44419",
>> "date":["11/13/17 13:18"],
>> "url":["http://www.someurl.com";],
>> "title":["some title"],
>> "content":["some indexed content..."],
>> "date_str":["11/13/17 13:18"],
>> "url_str":["http://www.someurl.com";],
>> "title_str":["some title"],
>> "_version_":1594211356390719488,
>> "content_str":["some indexed content.."]
>> },
>>
>>
>> In my managed_schema file, I only have five populated fields,
>>
>>> required="true" multiValued="false" />
>>
>>> stored="true"/>
>>> stored="true"/>
>>> stored="true"/>
>>> stored="true"/>
>>
>> While other fields are declared, none of them are populated by my "post"
>> command.
>>
>> My question is "Where are the x_str fields coming from?
>> I.e., what is producing the
>> |
>> ||"date_str":["...
>> "url_str":["...
>> "title_str":["...
>> "content_str":["...|
>>
>> entries?
>>
>> Thanks in advance.
>> |
>>
>>
>>


Re: Solr dih extract text from inline images in pdf

2018-03-07 Thread Charlie Hull

On 07/03/2018 13:29, lala wrote:

Thanks Charlie...
It's just confusing for me, In the DIH configuration file, the inner entity
that takes "TikaEntityProcessor" as its processor, I can easily specify a
tikaConfig attribute to an xml file, located inside the config folder in the
core, and where in this file I should be able to override the PDFParser
default properties... As in parseContext.Config...
The thing is that I placed my tika-config.xml file in the config folder,
set "tikaConfig" attribute = "tika-config.xml"... But tika still not parsing
images inside PDF file!!!
Let's say this is just experimenting Solr DIH crawling... Why it's not
working.?

This is my tika-config.xml file:



 
 
 
 
 true
 true
 
 
 


I've read the code in both TikaEntityProcessor and TikaConfig... It should
read the xml file from config folder, extract params and override original
PDFParser attributes. But It DOESN'T!
Any Idea??


Hi,

My reading of 
https://tika.apache.org/1.17/configuring.html#Using_a_Tika_Configuration_XML_file 
indicates that your PDF parser may not run unless you explicitly exclude 
PDFs, which I don't think you're doing above.


I'm not an expert on Tika configuration, but I think you should first 
try this xml file with standalone Tika and see if it does what you think 
it should. Once you're sure, then try it with DIH or SolrJ.


Cheers

Charlie




--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html




--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


Re: What is creating certain fields?

2018-03-07 Thread David Hastings
those are dynamic fields.

  


On Wed, Mar 7, 2018 at 12:43 AM, Keith Dopson  wrote:

> My default query produces this:
>
> |  {
> "id":"44419",
> "date":["11/13/17 13:18"],
> "url":["http://www.someurl.com";],
> "title":["some title"],
> "content":["some indexed content..."],
> "date_str":["11/13/17 13:18"],
> "url_str":["http://www.someurl.com";],
> "title_str":["some title"],
> "_version_":1594211356390719488,
> "content_str":["some indexed content.."]
> },
>
>
> In my managed_schema file, I only have five populated fields,
>
> required="true" multiValued="false" />
>
> stored="true"/>
> stored="true"/>
> stored="true"/>
> stored="true"/>
>
> While other fields are declared, none of them are populated by my "post"
> command.
>
> My question is "Where are the x_str fields coming from?
> I.e., what is producing the
> |
> ||"date_str":["...
> "url_str":["...
> "title_str":["...
> "content_str":["...|
>
> entries?
>
> Thanks in advance.
> |
>
>
>


Re: Solr dih extract text from inline images in pdf

2018-03-07 Thread Erick Erickson
You're missing Charlie's point, and if you read the blog I pointed you
to that point is reiterated.

DIH does the Tika processing on the Solr node that is _also_ indexing
documents and satisfying queries. Parsing a semi-structured document
(PDF in this case) consumes CPU cycles and memory, all _within_ the
Solr process. You can easily create an OOM problem on the Solr node if
someone drops, say, a 2G file in your directory structure and you
blithely send it to Solr via DIH.

Additionally there are so many variants of, say, the PDF "standard"
that some edge case somewhere can (and has) caused Tika to blow it's
brains out. The Tika folks have done a marvelous job of fixing these
when they come up, but it's a never-ending battle.

If you do the Tika processing in your own Java process you isolate
your Solr's from these issues.

Up to you of course.
Erick

On Wed, Mar 7, 2018 at 5:39 AM, lala  wrote:
> I dont' know what is the problem, when posting the message, the xml format
> inside the   is not correct, it should contain ["<"param
> name="extractInlineImages" type="bool">true] AND ["<"param
> name="sortByPosition" type="bool">true]...
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


What is creating certain fields?

2018-03-07 Thread Keith Dopson

My default query produces this:

|  {
"id":"44419",
"date":["11/13/17 13:18"],
"url":["http://www.someurl.com";],
"title":["some title"],
"content":["some indexed content..."],
"date_str":["11/13/17 13:18"],
"url_str":["http://www.someurl.com";],
"title_str":["some title"],
"_version_":1594211356390719488,
"content_str":["some indexed content.."]
},


In my managed_schema file, I only have five populated fields,

   

   
   
   
   

While other fields are declared, none of them are populated by my "post" 
command.

My question is "Where are the x_str fields coming from?
I.e., what is producing the
|
||"date_str":["...
"url_str":["...
"title_str":["...
"content_str":["...|

entries?

Thanks in advance.
|




[ANNOUNCE] Apache Solr 6.6.3 released

2018-03-07 Thread Steve Rowe
7 March 2018, Apache Solr™ 6.6.3 available 

The Lucene PMC is pleased to announce the release of Apache Solr 6.6.3. 

Solr is the popular, blazing fast, open source NoSQL search platform from the 
Apache Lucene project. Its major features include powerful full-text search, 
hit highlighting, faceted search and analytics, rich document parsing, 
geospatial search, extensive REST APIs as well as parallel SQL. Solr is 
enterprise grade, secure and highly scalable, providing fault tolerant 
distributed search and indexing, and powers the search and navigation features 
of many of the world's largest internet sites. 

This release contains three bugfixes: 

* Disallow reference to external resources in DataImportHandler's dataConfig 
request parameter 
* Allow collections created with legacyCloud=true to be opened if 
legacyCloud=false 
* LeaderInitiatedRecoveryThread now retries on UnknownHostException 

The release is available for immediate download at: 

http://lucene.apache.org/solr/mirrors-solr-redir.html 

Please read CHANGES.txt for a detailed list of changes: 

https://lucene.apache.org/solr/6_6_3/changes/Changes.html 

Please report any feedback to the mailing lists 
(http://lucene.apache.org/solr/discussion.html) 

Note: The Apache Software Foundation uses an extensive mirroring 
network for distributing releases. It is possible that the mirror you 
are using may not have replicated the release yet. If that is the 
case, please try another mirror. This also goes for Maven access.

Re: Solr dih extract text from inline images in pdf

2018-03-07 Thread lala
I dont' know what is the problem, when posting the message, the xml format
inside the   is not correct, it should contain ["<"param
name="extractInlineImages" type="bool">true] AND ["<"param
name="sortByPosition" type="bool">true]...



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Solr dih extract text from inline images in pdf

2018-03-07 Thread lala
Thanks Charlie...
It's just confusing for me, In the DIH configuration file, the inner entity
that takes "TikaEntityProcessor" as its processor, I can easily specify a
tikaConfig attribute to an xml file, located inside the config folder in the
core, and where in this file I should be able to override the PDFParser
default properties... As in parseContext.Config...
The thing is that I placed my tika-config.xml file in the config folder,
set "tikaConfig" attribute = "tika-config.xml"... But tika still not parsing
images inside PDF file!!!
Let's say this is just experimenting Solr DIH crawling... Why it's not
working.?

This is my tika-config.xml file:







true
true   





I've read the code in both TikaEntityProcessor and TikaConfig... It should
read the xml file from config folder, extract params and override original
PDFParser attributes. But It DOESN'T!
Any Idea??



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Solr Collection Loosing Leader

2018-03-07 Thread Aaryan Reddy
Folks any suggestion here ?

On Thu, Mar 1, 2018 at 12:28 PM, Aaryan Reddy 
wrote:

> Hello All,
>
> I am running into frequent issue where the leader shard in solr cloud
> stays active but does not acknowledge as "leader" . This brings down the
> other replicas as they go into to recovery mode and eventually fail trying
> to sync up.
>
> The error seen in "solr.log" is below: { this also similar to what is
> shared in this email thread (https://www.mail-archive.com/
> solr-user@lucene.apache.org/msg127969.html) }
>
> This has consumed lot of time but have not been able get any direction
> here . Any help will be appreciated
>
> Solr Version used : 5.5.2 { Comes packaged with HDP 2.5.3 }
> The index are being stored on HDFS.
>
> ==error==
>
> completed with http://node06.test.net:8984/solr/TEST_COLLECTION2_shard
>> 5_replica1/
>> 2018-02-21 20:41:10.148 INFO  (zkCallback-5-thread-4294-processing-n:
>> node04.test.net:8984_solr) [c:TEST_COLLECTION2 s:shard5 r:core_node1
>> 6 x:TEST_COLLECTION2_shard5_replica2] o.a.s.c.SyncStrategy http://no
>> de04.test.net:8984/solr/TEST_COLLECTION2_shard5_replica2/:  sync
>> completed with http://node17.test.net:8984/solr/TEST_COLLECTION2_shard
>> 5_replica3/
>> 2018-02-21 20:41:10.149 INFO  (zkCallback-5-thread-4294-processing-n:
>> node04.test.net:8984_solr) [c:TEST_COLLECTION2 s:shard5 r:core_node1
>> 6 x:TEST_COLLECTION2_shard5_replica2] o.a.s.c.ShardLeaderElectionContextBase
>> Creating leader registration node /collections/TEST_COLLECTION2/
>> leaders/sh
>> ard5/leader after winning as /collections/TEST_COLLECTION2/
>> leader_elect/shard5/election/171270658970051676-core_node16-n_001784
>> 2018-02-21 20:41:10.151 INFO  (zkCallback-5-thread-4294-processing-n:
>> node04.test.net:8984_solr) [c:TEST_COLLECTION2 s:shard5 r:core_node1
>> 6 x:TEST_COLLECTION2_shard5_replica2] o.a.s.c.u.RetryUtil Retry due to
>> Throwable, org.apache.zookeeper.KeeperException$NodeExistsException
>> KeeperErrorCode
>> = NodeExists
>> 2018-02-21 20:41:10.498 ERROR 
>> (recoveryExecutor-3-thread-55-processing-s:shard10
>> x:TEST_COLLECTION_shard10_replica3 c:TEST_COLLECTION
>> n:node04.test.net:8984_solr r:core_node59) [c:TEST_COLLECTION s:shard10
>> r:core_node59 x:TEST_COLLECTION_shard10_replica3]
>> o.a.s.c.RecoveryStrategy Error while trying to recover.
>> core=TEST_COLLECTION_shard10_replica3:org.apache.solr.common.SolrException:
>> No registered leader was found after waiting for 4000ms , collection:
>> TEST_COLLECTION slice: shard10
>> at org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(Zk
>> StateReader.java:626)
>> at org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(Zk
>> StateReader.java:612)
>> at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoverySt
>> rategy.java:306)
>> at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.
>> java:222)
>> at java.util.concurrent.Executors$RunnableAdapter.call(Executor
>> s.java:471)
>> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>> at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolE
>> xecutor$1.run(ExecutorUtil.java:231)
>> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
>> Executor.java:1145)
>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
>> lExecutor.java:615)
>> at java.lang.Thread.run(Thread.java:744)
>> 2018-02-21 20:41:10.498 INFO  
>> (recoveryExecutor-3-thread-55-processing-s:shard10
>> x:TEST_COLLECTION_shard10_replica3 c:TEST_COLLECTION
>> n:node04.test.net:8984_solr r:core_node59) [c:TEST_COLLECTION s:shard10
>> r:core_node59 x:TEST_COLLECTION_shard10_replica3]
>> o.a.s.c.RecoveryStrategy Replay not started, or was not successful... still
>> buffering updates.
>> 2018-02-21 20:41:10.498 ERROR 
>> (recoveryExecutor-3-thread-55-processing-s:shard10
>> x:TEST_COLLECTION_shard10_replica3 c:TEST_COLLECTION
>> n:node04.test.net:8984_solr r:core_node59) [c:TEST_COLLECTION s:shard10
>> r:core_node59 x:TEST_COLLECTION_shard10_replica3]
>> o.a.s.c.RecoveryStrategy Recovery failed - trying again... (0)
>> 2018-02-21 20:41:10.498 INFO  
>> (recoveryExecutor-3-thread-55-processing-s:shard10
>> x:TEST_COLLECTION_shard10_replica3 c:TEST_COLLECTION
>> n:node04.test.net:8984_solr r:core_node59) [c:TEST_COLLECTION s:shard10
>> r:core_node59 x:TEST_COLLECTION_shard10_replica3]
>> o.a.s.c.RecoveryStrategy Wait [2.0] seconds before trying to recover again
>> (attempt=1)
>> 2018-02-21 20:41:10.928 INFO  (zkCallback-5-thread-4295-processing-n:
>> node04.test.net:8984_solr) [   ] o.a.s.c.c.ZkStateReader A cluster state
>> change: [WatchedEvent state:SyncConnected type:NodeDataChanged
>> path:/collections/TEST_COLLECTION3/state.json] for collection
>> [TEST_COLLECTION3] has occurred - updating... (live nodes size: [17])
>> 2018-02-21 20:41:10.928 INFO  (zkCallback-5-thread-4293-processing-n:
>> node04.test.net:8984_solr) [   ] o.a.s.c.c.ZkStateReader A cluster state
>> change: [W

Secure way to backup Solrcloud

2018-03-07 Thread Daniel Carrasco
Hello,

My question is if there is any way to backup a Solr cluster even when all
replicas are "not synced"...

I'm using the api to create the dumps:
http://localhost:8983/solr/admin/collections?action=BACKUP&name=myBackupName&collection=myCollectionName&location=/path/to/my/shared/drive

But is a lottery where the most of time a collection fails and is
impossible to backup until you restart the node that looks out of sync.

Error: {
  "responseHeader":{
"status":500,
"QTime":4057},
  "failure":{

"192.168.3.14:8983_solr":"org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:Error
from server at http://192.168.3.14:8983/solr: Failed to backup
core=myproducts_shard1_replica_n91 because
java.nio.file.NoSuchFileException:
/server/solr/solr-data/data/myproducts_shard1_replica_n91/data/index.20180212231626422/segments_5kjp"},
  "Operation backup caused
exception:":"org.apache.solr.common.SolrException:org.apache.solr.common.SolrException:
Could not backup all replicas",
  "exception":{
"msg":"Could not backup all replicas",
"rspCode":500},
  "error":{
"metadata":[
  "error-class","org.apache.solr.common.SolrException",
  "root-error-class","org.apache.solr.common.SolrException"],
"msg":"Could not backup all replicas",
"trace":"org.apache.solr.common.SolrException: Could not backup all
replicas\n\tat
org.apache.solr.handler.admin.CollectionsHandler.handleResponse(CollectionsHandler.java:306)\n\tat
org.apache.solr.handler.admin.CollectionsHandler.invokeAction(CollectionsHandler.java:243)\n\tat
org.apache.solr.handler.admin.CollectionsHandler.handleRequestBody(CollectionsHandler.java:221)\n\tat
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:177)\n\tat
org.apache.solr.servlet.HttpSolrCall.handleAdmin(HttpSolrCall.java:745)\n\tat
org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:726)\n\tat
org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:507)\n\tat
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:382)\n\tat
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:326)\n\tat
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1751)\n\tat
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)\n\tat
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)\n\tat
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)\n\tat
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)\n\tat
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)\n\tat
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)\n\tat
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)\n\tat
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)\n\tat
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)\n\tat
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)\n\tat
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)\n\tat
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)\n\tat
org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)\n\tat
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)\n\tat
org.eclipse.jetty.server.Server.handle(Server.java:534)\n\tat
org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)\n\tat
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)\n\tat
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)\n\tat
org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108)\n\tat
org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)\n\tat
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)\n\tat
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)\n\tat
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)\n\tat
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)\n\tat
org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)\n\tat
java.lang.Thread.run(Thread.java:748)\n",
"code":500}}

After restart the node that give the error then is able to dump the backup
without my problem, but i can't restart all nodes that fail and every day
to dump the data.

is there any way to do it?

Thanks!


-- 
_

  Daniel Carrasco Marín
  Ingeniería para la Innovación i2TIC, S.L.
  Tlf:  +34 911 12 32 84 Ext: 223
  www.i2tic.com
_


Re: Solr dih extract text from inline images in pdf

2018-03-07 Thread Charlie Hull

On 07/03/2018 09:32, lala wrote:

Thanks for your reply Erick,

Actually I am using Solrj to index files among other operations with Solr,
but to index a large amount of differesnt kinds of file, I'm sending a DIH
request to Solr using Solrj API : FileListEntityProcessor with
TikaEntityParser...
Why not benefit from this technology if Solr offers it? It simplifies our
work tremendosely...


It may simplify your work, but it isn't good practice. Tika has some 
heavy lifting to do to extract text from some formats and you should 
consider how this load will affect Solr. We've often put Tika into a 
different process for this reason.



Isn't there any way to be able to extract inline images in PDF docs??


https://stackoverflow.com/questions/31303735/how-to-extract-images-from-a-file-using-apache-tika 
has some useful suggestions.


Charlie


Waiting your reply, best regards...



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html




--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


Re: Solr dih extract text from inline images in pdf

2018-03-07 Thread lala
Thanks for your reply Erick,

Actually I am using Solrj to index files among other operations with Solr,
but to index a large amount of differesnt kinds of file, I'm sending a DIH
request to Solr using Solrj API : FileListEntityProcessor with
TikaEntityParser... 
Why not benefit from this technology if Solr offers it? It simplifies our
work tremendosely...
Isn't there any way to be able to extract inline images in PDF docs??

Waiting your reply, best regards...



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Solr Read-Only?

2018-03-07 Thread Emir Arnautović
Hi Terry,
Maybe you can try alternative approaches like putting some proxy in front of 
Solr and configure it to let only certain URLs. Other option is to define 
custom update request processor chain that will not include 
RunUpdateProcessorFactory - that will prevent accidental index updates.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 6 Mar 2018, at 22:55, Terry Steichen  wrote:
> 
> Chris,
> 
> Thanks for your suggestion.  Restarting solr after an in-memory
> corruption is, of course, trivial (compared to rebuilding the indexes).
> 
> Are there any solr directories that MUST be read/write (even with a
> pre-built index)?  Would it suffice (for my purposes) to make only the
> data/index directory R-O?
> 
> Terry
> 
> 
> On 03/06/2018 04:20 PM, Christopher Schultz wrote:
>> Terry,
>> 
>> On 3/6/18 4:08 PM, Terry Steichen wrote:
>>> Is it possible to run solr in a read-only directory?
>> 
>>> I'm running it just fine on a ubuntu server which is accessible
>>> only through SSH tunneling.  At the platform level, this is fine:
>>> only authorized users can access it (via a browser on their machine
>>> accessing a forwarded port).
>> 
>>> The problem is that it's an all-or-nothing situation so everyone
>>> who's authorized access to the platform has, in effect,
>>> administrator privileges on solr.  I understand that authentication
>>> is coming, but that it isn't here yet.  (Or, to add complexity, I
>>> had to downgrade from 7.2.1 to 6.4.2 to overcome a new bug
>>> concerning indexing of eml files, and 6.4.2 definitely doesn't have
>>> authentication.)
>> 
>>> Anyway, what I was wondering is if it might be possible to run solr
>>> not as me (the administrator), but as a user with lesser privileges
>>> so that no one who came through the SSH tunnel could (inadvertently
>>> or otherwise) screw up the indexes.
>> 
>> With shell access, the only protection you could provide would be
>> through file-permissions. But of course Solr will need to be
>> read-write in order to build the index in the first place. So you'd
>> probably have to run read-write at first, build the index (perhaps
>> that's already been done in the past), then (possibly) restart in
>> read-only mode.
>> 
>> Read-only can be achieved by simply revoking write-access to the data
>> directories from the euid of the Solr process. Theoretically, you
>> could switch from being read-write to read-only merely by changing
>> file-permissions... no Solr restarts required.
>> 
>> I'm not sure if it matters to you very much, but a user can still do
>> some damage to the index even if the "server" is read-only (through
>> file-permissions): they can issue a batch of DELETE or ADD requests
>> that will effect the in-memory copies of the index. It might be
>> temporary, but it might require that you restart the Solr instance to
>> get back to a sane state.
>> 
>> Hope that helps,
>> -chris
>> 
>