Re: Setting up to index multiple datastores

2017-03-03 Thread Daniel Miller

On 3/2/2017 5:14 PM, Shawn Heisey wrote:

On 3/2/2017 2:58 PM, Daniel Miller wrote:

I'm asking for some guidance on how I might
optimize Solr.

I use Solr for work.  I use Dovecot for personal domains.  I have not
used them together.  I probably should -- my personal mailbox is many
gigabytes and would benefit from a boost in search performance.
If using Thunderbird - searches for header fields like sender or subject 
doesn't change much. Body type searches - unbelievable difference.  And 
of course other client's, especially mobile clients, benefit tremendously.

What I don't know is:

1.  Is it possible to split the "indexes" (I'm still learning Solr
vocabulary) without creating separate "cores" (which to me means
separate Java instances)?
2.  Can these separate "indexes" be created on-demand - or do they
need to be explictly created prior to use?

Here's a paragraph that hopefully clears up most confusion about Solr
terminology.  This is applicable to SolrCloud:

Collections are made up of one or more shards.  Shards are made up of
one or more replicas.  Each replica is a core.  One replica from each
shard is elected as the leader of that shard, and if there are multiple
replicas, the leader role can move between them in response to a change
in cluster state.

Further info: One Solr instance (JVM) can handle many cores.  SolrCloud
allows multiple Solr instances to coordinate with each other (via
ZooKeeper) and form a whole cluster.  Without SolrCloud, you have cores,
but no collections and no replicas.  Sharding is possible without
SolrCloud, but is handled mostly manually.
What I think I want is create a single collection, with a 
shard/replica/core per user.  Or maybe I'm wanting a separate collection 
per user - which would again mean a single shard/replica/core.  But it 
seems like each shard/replica/core is a separate instance.


Without modifying Dovecot source, I can have it generate URL's like, 
"http://solr.server.local:8983/solr/dovecot/"; (which is what I do now) 
or maybe, "http://solr.server.local:8983/solr/dovecot_user/"; or even 
"http://solr.server.local:8983/solr/dovecot/dovecot_user";.  But I'm not 
understanding how, if possible, I can have the indexes created 
appropriately to support such access.  The only examples I've seen use 
either separate ports or ip's for listeners.


One thing to note:  SolrCloud begins to have performance issues when the
number of collections in the cloud reaches the low hundreds.  It's not
going to scale very well with a collection per user or per mailbox
unless there aren't very many users.
At the moment, without digging into Dovecot code, it doesn't look like a 
per-mailbox option exists.  But per-user certainly does - and in my case 
I have less than 100 users so it shouldn't be an issue - if I get it to 
work.



Daniel



Re: How to update index after document expired.

2017-03-03 Thread XuQing Tan
On Fri, Mar 3, 2017 at 9:17 AM, Erick Erickson 
wrote:

> you'd have to copy/paste or petition to make
> DocExpirationUpdateProcessorFactory not final.
>

yes, I copied DocExpirationUpdateProcessorFactory with additional reason
is, our xml content from external source already contains expires in Date,
i don't need to bother with converting it to TTL into date math expr (with
some update processor) then converting it back to Date, it's just
inefficient. I want to use the expires field directly.

the custom DocExpirationUpdateProcessorFactory in our case is to check the
expired document periodically (within 1 min) and refresh them from external
source if any. in this way, we can achieve almost instantly refresh once
the doc expired, rather than waiting for another large time scheduled task
process.

  Thanks & Best Regards!

  ///
 (. .)
  ooO--(_)--Ooo
  |   Nick Tan   |
  


Solr 6.3.0, possible SYN flooding on port 8983. Sending cookies.

2017-03-03 Thread Yago Riveiro
Hello, 

I have this log in my dmesg: possible SYN flooding on port 8983. Sending
cookies.

The Solr instance (6.3.0) has not accepting more http connections.

I ran this: _lsof -nPi |grep \:8983 | wc -l_ and the number of connection to
port 8983 is about 14K in CLOSE_WAIT ou ESTABLISHED state.

Any suggestion of what could be the reason?

Thanks,

/Yago




-
Best regards

/Yago
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-6-3-0-possible-SYN-flooding-on-port-8983-Sending-cookies-tp4323341.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: solr warning - filling logs

2017-03-03 Thread Satya Marivada
There is nothing else running on port that I am trying to use: 15101. 15102
works fine.

On Fri, Mar 3, 2017 at 2:25 PM Satya Marivada 
wrote:

> Dave and All,
>
> The below exception is not happening anymore when I change the startup
> port to something else apart from that I had in original startup. The
> original starup, if I have started without ssl enabled and then startup on
> the same port with ssl enabled, it is when this warning is happening. But I
> really need to use the original port that I had. Any suggestion for getting
> around.
>
> Thanks,
> Satya
>
> java.lang.IllegalArgumentException: No Authority for
> HttpChannelOverHttp@a01eef8{r=0,c=false,a=IDLE,uri=null}
> java.lang.IllegalArgumentException: No Authority
> at
> org.eclipse.jetty.http.HostPortHttpField.(HostPortHttpField.java:43)
>
>
> On Sun, Feb 26, 2017 at 8:00 PM Dave  wrote:
>
> I don't know about your network setup but a port scanner sometimes can be
> an it security device that, well, scans ports looking to see if they're
> open.
>
> > On Feb 26, 2017, at 7:14 PM, Satya Marivada 
> wrote:
> >
> > May I ask about the port scanner running? Can you please elaborate?
> > Sure, will try to move out to external zookeeper
> >
> >> On Sun, Feb 26, 2017 at 7:07 PM Dave 
> wrote:
> >>
> >> You shouldn't use the embedded zookeeper with solr, it's just for
> >> development not anywhere near worthy of being out in production.
> Otherwise
> >> it looks like you may have a port scanner running. In any case don't use
> >> the zk that comes with solr
> >>
> >>> On Feb 26, 2017, at 6:52 PM, Satya Marivada  >
> >> wrote:
> >>>
> >>> Hi All,
> >>>
> >>> I have configured solr with SSL and enabled http authentication. It is
> >> all
> >>> working fine on the solr admin page, indexing and querying process. One
> >>> bothering thing is that it is filling up logs every second saying no
> >>> authority, I have configured host name, port and authentication
> >> parameters
> >>> right in all config files. Not sure, where is it coming from. Any
> >>> suggestions, please. Really appreciate it. It is with sol-6.3.0 cloud
> >> with
> >>> embedded zookeeper. Could it be some bug with solr-6.3.0 or am I
> missing
> >>> some configuration?
> >>>
> >>> 2017-02-26 23:32:43.660 WARN (qtp606548741-18) [c:plog s:shard1
> >>> r:core_node2 x:plog_shard1_replica1] o.e.j.h.HttpParser parse
> exception:
> >>> java.lang.IllegalArgumentException: No Authority for
> >>> HttpChannelOverHttp@6dac689d{r=0,c=false,a=IDLE,uri=null}
> >>> java.lang.IllegalArgumentException: No Authority
> >>> at
> >>>
> >>
> org.eclipse.jetty.http.HostPortHttpField.(HostPortHttpField.java:43)
> >>> at org.eclipse.jetty.http.HttpParser.parsedHeader(HttpParser.java:877)
> >>> at org.eclipse.jetty.http.HttpParser.parseHeaders(HttpParser.java:1050)
> >>> at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:1266)
> >>> at
> >>>
> >>
> org.eclipse.jetty.server.HttpConnection.parseRequestBuffer(HttpConnection.java:344)
> >>> at
> >>>
> >>
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:227)
> >>> at org.eclipse.jetty.io
> >>> .AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
> >>> at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
> >>> at
> >>
> org.eclipse.jetty.io.ssl.SslConnection.onFillable(SslConnection.java:186)
> >>> at org.eclipse.jetty.io
> >>> .AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
> >>> at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
> >>> at org.eclipse.jetty.io
> >>> .SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
> >>> at
> >>>
> >>
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceAndRun(ExecuteProduceConsume.java:246)
> >>> at
> >>>
> >>
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:156)
> >>> at
> >>>
> >>
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:654)
> >>> at
> >>>
> >>
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:572)
> >>> at java.lang.Thread.run(Thread.java:745)
> >>
>
>


Re: copyField match, but how?

2017-03-03 Thread nbosecker
You're on the money, Chris. Thank you s much, I didn't even realize
"body" wasn't stored. Of course that is the reason!!





--
View this message in context: 
http://lucene.472066.n3.nabble.com/copyField-match-but-how-tp4323327p4323335.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: copyField match, but how?

2017-03-03 Thread Chris Hostetter

: In my schema.xml, I have these copyFields:

you haven't shown us the field/fieldType definitions for any of those 
fields, so it's possible "simplex" was included in a field that is 
indexed=true but not stored-false -- which is why you might be able to 
search on it, but not see it in the fields returned in a search. 

Wild guess...

: 

...perhaps your "body" field is stored=false, but contains "simplex" and 
was copied into alltext.



-Hoss
http://www.lucidworks.com/


Re: copyField match, but how?

2017-03-03 Thread Alexandre Rafalovitch
I think you are not using default field, but rather eDismax field
definitions. Still you seem to be matching on alltext anyway.

What's the field definition? Did you check the index content with Maple or
with Admin Schema field content?

Regards,
   Alex


On 3 Mar 2017 5:07 PM, "nbosecker"  wrote:

I've got a confusing situation related to copyFields and search.

In my schema.xml, I have these copyFields:












and a defaultSearchField to the 'alltext' copyField:
alltext


In my index, this document with all these mapped fields - nothing to note
except that the word "*simplex*" is *NOT IN ANY OF THESE* :
"path": "Components/Analysis and Statistics/R
Statistics/Experimental Design/Design Mixture Experiment",
"folder": "Components/Analysis and Statistics/R
Statistics/Experimental Design",
"name": "Design Mixture Experiment",
"hostapplication": "Pro Client",
"purpose": "Designs a mixture (formulation) experiment using
Pipeline Pilot or R (DOE)",
"parameters": [
  "Design Type",
  "Ingredient Sum",
  "Number of Levels",
  "Centroid Dimension",
  "Ignore Properties",
  "Ingredient 1 Min Level",
  "Ingredient 1 Max Level",
  "Ingredient 2 Min Level",
  "Ingredient 2 Max Level",
  "Ingredient 3 Min Level",
  "Ingredient 3 Max Level",
  "Ingredient 4 Min Level",
  "Ingredient 4 Max Level",
  "Ingredient 5 Min Level",
  "Ingredient 5 Max Level",
  "Ingredient 6 Min Level",
  "Ingredient 6 Max Level",
  "Factors",
  "Ingredient 1",
  "Ingredient 2",
  "Ingredient 3",
  "Ingredient 4",
  "Ingredient 5",
  "Ingredient 6",
  "Constraints",
  "Constraint 1",
  "Constraint 2",
  "Filter Points",
  "Fill",
  "Factors from Input Data"
],
"components": [
  "Custom Filter (PilotScript)",
  "Custom Manipulator (PilotScript)",
  "Custom Manipulator (PilotScript)",
  "Custom Manipulator (PilotScript)",
  "Unmerge Data",
  "Custom Filter (PilotScript)",
  "Custom Filter (PilotScript)",
  "Custom Filter (PilotScript)",
  "Custom Manipulator (PilotScript)",
  "R Custom Script",
  "Custom Manipulator (PilotScript)"
],
"domain": [
  "Statistics"
],
"author": "BIOVIA"
  }

When I perform a debug search on "*simplex*" in the Solr Admin, it finds
this document as a match to the alltext copyField. *BUT HOW?*:
"debug": {
"rawquerystring": "simplex",
"querystring": "simplex",
"parsedquery": "(+DisjunctionMaxQuery((name:simplex^10.0 |
folder:simplex^5.0 | purpose:simplex^3.0 | alltext:simplex)~0.5)
())/no_coord",
"parsedquery_toString": "+(name:simplex^10.0 | folder:simplex^5.0 |
purpose:simplex^3.0 | alltext:simplex)~0.5 ()",
"explain": {
  "Components/Analysis and Statistics/R Statistics/Experimental
Design/Design Mixture Experiment": "\n0.039487615 = (MATCH) sum of:\n
0.039487615 = (MATCH) max plus 0.5 times others of:\n0.039487615 =
(MATCH) weight(alltext:simplex in 2191) [DefaultSimilarity], result of:\n
0.039487615 = score(doc=2191,freq=24.0 = termFreq=24.0\n), product of:\n
0.07139119 = queryWeight, product of:\n  7.225878 = idf(docFreq=11,
maxDocs=6068)\n  0.009879933 = queryNorm\n0.5531161 =
fieldWeight in 2191, product of:\n  4.8989797 = tf(freq=24.0), with
freq of:\n24.0 = termFreq=24.0\n  7.225878 =
idf(docFreq=11, maxDocs=6068)\n  0.015625 = fieldNorm(doc=2191)\n"
},
"QParser": "ExtendedDismaxQParser",



I'm probably missing something obvious, help!



--
View this message in context: http://lucene.472066.n3.
nabble.com/copyField-match-but-how-tp4323327.html
Sent from the Solr - User mailing list archive at Nabble.com.


copyField match, but how?

2017-03-03 Thread nbosecker
I've got a confusing situation related to copyFields and search.

In my schema.xml, I have these copyFields:












and a defaultSearchField to the 'alltext' copyField:
alltext


In my index, this document with all these mapped fields - nothing to note
except that the word "*simplex*" is *NOT IN ANY OF THESE* :
"path": "Components/Analysis and Statistics/R
Statistics/Experimental Design/Design Mixture Experiment",
"folder": "Components/Analysis and Statistics/R
Statistics/Experimental Design",
"name": "Design Mixture Experiment",
"hostapplication": "Pro Client",
"purpose": "Designs a mixture (formulation) experiment using
Pipeline Pilot or R (DOE)",
"parameters": [
  "Design Type",
  "Ingredient Sum",
  "Number of Levels",
  "Centroid Dimension",
  "Ignore Properties",
  "Ingredient 1 Min Level",
  "Ingredient 1 Max Level",
  "Ingredient 2 Min Level",
  "Ingredient 2 Max Level",
  "Ingredient 3 Min Level",
  "Ingredient 3 Max Level",
  "Ingredient 4 Min Level",
  "Ingredient 4 Max Level",
  "Ingredient 5 Min Level",
  "Ingredient 5 Max Level",
  "Ingredient 6 Min Level",
  "Ingredient 6 Max Level",
  "Factors",
  "Ingredient 1",
  "Ingredient 2",
  "Ingredient 3",
  "Ingredient 4",
  "Ingredient 5",
  "Ingredient 6",
  "Constraints",
  "Constraint 1",
  "Constraint 2",
  "Filter Points",
  "Fill",
  "Factors from Input Data"
],
"components": [
  "Custom Filter (PilotScript)",
  "Custom Manipulator (PilotScript)",
  "Custom Manipulator (PilotScript)",
  "Custom Manipulator (PilotScript)",
  "Unmerge Data",
  "Custom Filter (PilotScript)",
  "Custom Filter (PilotScript)",
  "Custom Filter (PilotScript)",
  "Custom Manipulator (PilotScript)",
  "R Custom Script",
  "Custom Manipulator (PilotScript)"
],
"domain": [
  "Statistics"
],
"author": "BIOVIA"
  }

When I perform a debug search on "*simplex*" in the Solr Admin, it finds
this document as a match to the alltext copyField. *BUT HOW?*:
"debug": {
"rawquerystring": "simplex",
"querystring": "simplex",
"parsedquery": "(+DisjunctionMaxQuery((name:simplex^10.0 |
folder:simplex^5.0 | purpose:simplex^3.0 | alltext:simplex)~0.5)
())/no_coord",
"parsedquery_toString": "+(name:simplex^10.0 | folder:simplex^5.0 |
purpose:simplex^3.0 | alltext:simplex)~0.5 ()",
"explain": {
  "Components/Analysis and Statistics/R Statistics/Experimental
Design/Design Mixture Experiment": "\n0.039487615 = (MATCH) sum of:\n 
0.039487615 = (MATCH) max plus 0.5 times others of:\n0.039487615 =
(MATCH) weight(alltext:simplex in 2191) [DefaultSimilarity], result of:\n 
0.039487615 = score(doc=2191,freq=24.0 = termFreq=24.0\n), product of:\n   
0.07139119 = queryWeight, product of:\n  7.225878 = idf(docFreq=11,
maxDocs=6068)\n  0.009879933 = queryNorm\n0.5531161 =
fieldWeight in 2191, product of:\n  4.8989797 = tf(freq=24.0), with
freq of:\n24.0 = termFreq=24.0\n  7.225878 =
idf(docFreq=11, maxDocs=6068)\n  0.015625 = fieldNorm(doc=2191)\n"
},
"QParser": "ExtendedDismaxQParser",



I'm probably missing something obvious, help!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/copyField-match-but-how-tp4323327.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Partial Name matching

2017-03-03 Thread Alexandre Rafalovitch
That's a curse of too much info. Could you extract just the relevant
parts (field definitions, search configuration). And also explain
*) What you expected to see with a couple of examples
*) What you actually see
*) Why is that "wrong" for you

It is an interesting problem, but it is unusual enough to be able to
answer without details above.

Regards,
   Alex.

http://www.solr-start.com/ - Resources for Solr users, new and experienced


On 3 March 2017 at 15:49, Jackson  wrote:
> Hi,
>
> I have been using Solr 5.3 for searching names from a indexed names of 4.5
> million.
>
> Our application uses Solr for search names with 2 options
>
> Exact Match -
> – Uses Phonetic, Mis-spelled, Re-arranged names search
>
> Partial Match -
> - Wanted to use in partial matching, Fuzzy, Re-arranged of names for
> searching part of the requested name within Solr Indexed names
> and to get the output as per the % of matching, set by the user.
>
> Partial should return the matched % in order of highest matched names first.
>
>
> Require assistance in fixing the Partial Match which is not getting the
> result as required
>
> Would appreciate your assistance in addressing the above point.
>
> Best Regards,
>
> Jackson Vadakkan
> SMART Infotech
> Post Box 114954
> Abu Dhabi, United Arab Emirates.
> Phone : +971-2-6399411
> Mobile : +971-50-7724164
>
> DISCLAIMER: This Electronic Mail and any attached information is sent on
> behalf of SMART Infotech. The information is confidential and may also be
> privileged. It is intended only for the use of authorized persons. If you
> are not an addressee, or have received the message in error, you are not
> authorized to read, copy, disseminate, distribute or use the Electronic Mail
> or any attachment in any way. Please notify the sender by return E-Mail or
> over the telephone and delete this e-mail.


Re: solr warning - filling logs

2017-03-03 Thread Satya Marivada
Dave and All,

The below exception is not happening anymore when I change the startup port
to something else apart from that I had in original startup. The original
starup, if I have started without ssl enabled and then startup on the same
port with ssl enabled, it is when this warning is happening. But I really
need to use the original port that I had. Any suggestion for getting around.

Thanks,
Satya

java.lang.IllegalArgumentException: No Authority for
HttpChannelOverHttp@a01eef8{r=0,c=false,a=IDLE,uri=null}
java.lang.IllegalArgumentException: No Authority
at
org.eclipse.jetty.http.HostPortHttpField.(HostPortHttpField.java:43)


On Sun, Feb 26, 2017 at 8:00 PM Dave  wrote:

> I don't know about your network setup but a port scanner sometimes can be
> an it security device that, well, scans ports looking to see if they're
> open.
>
> > On Feb 26, 2017, at 7:14 PM, Satya Marivada 
> wrote:
> >
> > May I ask about the port scanner running? Can you please elaborate?
> > Sure, will try to move out to external zookeeper
> >
> >> On Sun, Feb 26, 2017 at 7:07 PM Dave 
> wrote:
> >>
> >> You shouldn't use the embedded zookeeper with solr, it's just for
> >> development not anywhere near worthy of being out in production.
> Otherwise
> >> it looks like you may have a port scanner running. In any case don't use
> >> the zk that comes with solr
> >>
> >>> On Feb 26, 2017, at 6:52 PM, Satya Marivada  >
> >> wrote:
> >>>
> >>> Hi All,
> >>>
> >>> I have configured solr with SSL and enabled http authentication. It is
> >> all
> >>> working fine on the solr admin page, indexing and querying process. One
> >>> bothering thing is that it is filling up logs every second saying no
> >>> authority, I have configured host name, port and authentication
> >> parameters
> >>> right in all config files. Not sure, where is it coming from. Any
> >>> suggestions, please. Really appreciate it. It is with sol-6.3.0 cloud
> >> with
> >>> embedded zookeeper. Could it be some bug with solr-6.3.0 or am I
> missing
> >>> some configuration?
> >>>
> >>> 2017-02-26 23:32:43.660 WARN (qtp606548741-18) [c:plog s:shard1
> >>> r:core_node2 x:plog_shard1_replica1] o.e.j.h.HttpParser parse
> exception:
> >>> java.lang.IllegalArgumentException: No Authority for
> >>> HttpChannelOverHttp@6dac689d{r=0,c=false,a=IDLE,uri=null}
> >>> java.lang.IllegalArgumentException: No Authority
> >>> at
> >>>
> >>
> org.eclipse.jetty.http.HostPortHttpField.(HostPortHttpField.java:43)
> >>> at org.eclipse.jetty.http.HttpParser.parsedHeader(HttpParser.java:877)
> >>> at org.eclipse.jetty.http.HttpParser.parseHeaders(HttpParser.java:1050)
> >>> at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:1266)
> >>> at
> >>>
> >>
> org.eclipse.jetty.server.HttpConnection.parseRequestBuffer(HttpConnection.java:344)
> >>> at
> >>>
> >>
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:227)
> >>> at org.eclipse.jetty.io
> >>> .AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
> >>> at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
> >>> at
> >>
> org.eclipse.jetty.io.ssl.SslConnection.onFillable(SslConnection.java:186)
> >>> at org.eclipse.jetty.io
> >>> .AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
> >>> at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
> >>> at org.eclipse.jetty.io
> >>> .SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
> >>> at
> >>>
> >>
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceAndRun(ExecuteProduceConsume.java:246)
> >>> at
> >>>
> >>
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:156)
> >>> at
> >>>
> >>
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:654)
> >>> at
> >>>
> >>
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:572)
> >>> at java.lang.Thread.run(Thread.java:745)
> >>
>


Re: Data Import Handler, also "Real Time" index updates

2017-03-03 Thread Alexandre Rafalovitch
Commit is index global. So if you have overlapping timelines and commit is
issued, it will affect all changes done to that point.

So, the aliases may be better for you. You could potentially also reload a
cure with changes solrconfig.XML settings, but that's heavy on caches.

Regards,
   Alex

On 3 Mar 2017 1:21 PM, "Sales" 
wrote:


>
> You have indicated that you have a way to avoid doing updates during the
> full import.  Because of this, you do have another option that is likely
> much easier for you to implement:  Set the "commitWithin" parameter on
> each update request.  This works almost identically to autoSoftCommit,
> but only after a request is made.  As long as there are never any of
> these updates during a full import, these commits cannot affect that
import.

I had attempted at least to say that there may be a few updates that happen
at the start of an import, so, they are while an import is happening just
due to timing issues. Those will be detected, and, re-executed once the
import is done though. But my question here is if the update is using
commitWithin, then, does that only affect those updates that have the
parameter, or, does it then also soft commit the in progress import? I
cannot guarantee that zero updates will be done as there is a timing issue
at the very start of the import, so, a few could cross over.

Adding commitWithin is fine. Just want to make sure those that might
execute for the first few seconds of an import don’t kill anything.
>
> No matter what is happening, you should have autoCommit (not
> autoSoftCommit) configured with openSearcher set to false.  This will
> ensure transaction log rollover, without affecting change visibility.  I
> recommend a maxTime of one to five minutes for this.  You'll see 15
> seconds as the recommended value in many places.
>
> https://lucidworks.com/2013/08/23/understanding-
transaction-logs-softcommit-and-commit-in-sorlcloud/ <
https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-
and-commit-in-sorlcloud/>

Oh, we are fine with much longer, does not have to be instant. 10-15
minutes would be fine.

>
> Thanks
> Shawn
>


Re: Data Import Handler, also "Real Time" index updates

2017-03-03 Thread Sales

> 
> You have indicated that you have a way to avoid doing updates during the
> full import.  Because of this, you do have another option that is likely
> much easier for you to implement:  Set the "commitWithin" parameter on
> each update request.  This works almost identically to autoSoftCommit,
> but only after a request is made.  As long as there are never any of
> these updates during a full import, these commits cannot affect that import.

I had attempted at least to say that there may be a few updates that happen at 
the start of an import, so, they are while an import is happening just due to 
timing issues. Those will be detected, and, re-executed once the import is done 
though. But my question here is if the update is using commitWithin, then, does 
that only affect those updates that have the parameter, or, does it then also 
soft commit the in progress import? I cannot guarantee that zero updates will 
be done as there is a timing issue at the very start of the import, so, a few 
could cross over. 

Adding commitWithin is fine. Just want to make sure those that might execute 
for the first few seconds of an import don’t kill anything. 
> 
> No matter what is happening, you should have autoCommit (not
> autoSoftCommit) configured with openSearcher set to false.  This will
> ensure transaction log rollover, without affecting change visibility.  I
> recommend a maxTime of one to five minutes for this.  You'll see 15
> seconds as the recommended value in many places.
> 
> https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
>  
> 

Oh, we are fine with much longer, does not have to be instant. 10-15 minutes 
would be fine.

> 
> Thanks
> Shawn
> 



Re: Data Import Handler, also "Real Time" index updates

2017-03-03 Thread Shawn Heisey
On 3/3/2017 10:17 AM, Sales wrote:
> I am not sure how best to handle this. We use the data import handle to 
> re-sync all our data on a daily basis, takes 1-2 hours depending on system 
> load. It is set up to commit at the end, so, the old index remains until it’s 
> done, and, we lose no access while the import is happening.
>
> But, we now want to update certain fields in the index, but still regen 
> daily. So, it would seem we might need to autocommit, and, soft commit 
> potentially. When we enabled those, during the index, the data disappeared 
> since it kept soft committing during the import process, I see no way to 
> avoid soft commits during the import. But soft commits would appear to be 
> needed for the (non import) updates to the index. 
>
> I realize the import could happen while an update is done, but we can 
> actually avoid those. So, that is not an issue (one or two might go through, 
> but, we will redo those updates once the index is done, that part is all 
> handled.

Erick's solution of using aliases to swap a live index and a build index
is one very good way to go.  It does involve some additional complexity
that you may not be ready for.  Only you will know whether that's
something you can implement easily.  Collection aliasing was implemented
in Solr 4.2 by SOLR-4497, so 4.10 should definitely have it.

You have indicated that you have a way to avoid doing updates during the
full import.  Because of this, you do have another option that is likely
much easier for you to implement:  Set the "commitWithin" parameter on
each update request.  This works almost identically to autoSoftCommit,
but only after a request is made.  As long as there are never any of
these updates during a full import, these commits cannot affect that import.

No matter what is happening, you should have autoCommit (not
autoSoftCommit) configured with openSearcher set to false.  This will
ensure transaction log rollover, without affecting change visibility.  I
recommend a maxTime of one to five minutes for this.  You'll see 15
seconds as the recommended value in many places.

https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

Thanks
Shawn



Re: Data Import Handler, also "Real Time" index updates

2017-03-03 Thread Sales

> On Mar 3, 2017, at 11:30 AM, Erick Erickson  wrote:
> 
> One way to handle this (presuming SolrCloud) is collection aliasing.
> You create two collections, c1 and c2. You then have two aliases. when
> you start "index" is aliased to c1 and "search" is aliased to c2. Now
> do your full import  to "index" (and, BTW, you'd be well advised to do
> at least a hard commit openSearcher=false during that time or you risk
> replaying all the docs in the tlog).
> 
> When the full import is done, switch the aliases so "search" points to c1 and
> "index" points to c2. Rinse. Repeat. Your client apps always use the same 
> alias,
> the alias switching makes whether c1 or c2 is being used transparent.
> By that I mean your user-facing app uses "search" and your indexing client
> uses "index".
> 
> You can now do your live updates to the "search" alias that has a soft
> commit set.
> Of course you have to have some mechanism for replaying all the live updates
> that came in when you were doing your full index into the "indexing"
> alias before
> you switch, but you say you have that handled.
> 
> Best,
> Erick
> 

Thanks. So, is this available on 4.10.4? 

If not, we used to gen another core, do the import, and, swap cores so this is 
possibly similar to collection aliases since in the end, the client did not 
care. I don’t see why that would not still work. Took a little effort to 
automate, but, not much. 

Regarding the import and commit, we use in data-config.xml readonly so this 
sets autocommit the way I understand it. Not sure what happens with 
opensearcher though. If that is not sufficient, how would I do hard commit and 
opensearcher false during that time? Surely not by modifying the config file?

Re: Data Import Handler, also "Real Time" index updates

2017-03-03 Thread Erick Erickson
One way to handle this (presuming SolrCloud) is collection aliasing.
You create two collections, c1 and c2. You then have two aliases. when
you start "index" is aliased to c1 and "search" is aliased to c2. Now
do your full import  to "index" (and, BTW, you'd be well advised to do
at least a hard commit openSearcher=false during that time or you risk
replaying all the docs in the tlog).

When the full import is done, switch the aliases so "search" points to c1 and
"index" points to c2. Rinse. Repeat. Your client apps always use the same alias,
the alias switching makes whether c1 or c2 is being used transparent.
By that I mean your user-facing app uses "search" and your indexing client
uses "index".

You can now do your live updates to the "search" alias that has a soft
commit set.
Of course you have to have some mechanism for replaying all the live updates
that came in when you were doing your full index into the "indexing"
alias before
you switch, but you say you have that handled.

Best,
Erick

On Fri, Mar 3, 2017 at 9:22 AM, Alexandre Rafalovitch
 wrote:
> On 3 March 2017 at 12:17, Sales  
> wrote:
>> When we enabled those, during the index, the data disappeared since it kept 
>> soft committing during the import process,
>
> This part does not quite make sense. Could you expand on this "data
> disappeared" part to understand what the issue is.
>
> The main issue with "update" is that all fields (apart from pure
> copyField destinations) need to be stored, so the document can be
> reconstructed, updated, re-indexed. Perhaps you have something strange
> happening around that?
>
> Regards,
>Alex.
>
> 
> http://www.solr-start.com/ - Resources for Solr users, new and experienced


Re: Data Import Handler, also "Real Time" index updates

2017-03-03 Thread Sales
> 
> On Mar 3, 2017, at 11:22 AM, Alexandre Rafalovitch  wrote:
> 
> On 3 March 2017 at 12:17, Sales  
> wrote:
>> When we enabled those, during the index, the data disappeared since it kept 
>> soft committing during the import process,
> 
> This part does not quite make sense. Could you expand on this "data
> disappeared" part to understand what the issue is.
> 

So, the issue here is the first part of the import handler is to erase all the 
data, so, there are no products left in the index (it would appear based on 
what we see, after the first softcommit), and, a search returns no result at 
first, but, ever increasing number of records while the import is happening. We 
have 6 million indexed products.

I can't find a way to stop soft commits during the import?

Re: What is the bottleneck for an optimise operation?

2017-03-03 Thread Erick Erickson
Well, historically during the really old days, optimize made a major
difference. As Lucene evolved that difference was smaller, and in
recent _anecdotal_ reports it's up to 10% improvement in query
processing with the usual caveats that there are a ton of variables
here, including and especially how frequently the index is updated
so since yours is rarely updated take that number with a huge grain
of salt. It's an open question whether a 10% increase (assuming that's
what you get) is noticeable. I mean if your query latency is 250 ms,
who's going to notice? If it's 10 seconds, that's a different story

There's a subtext here about whether your query load is enough that
a 10% improvement can ripple, but that's only under pretty high query
loads.

Best,
Erick

On Fri, Mar 3, 2017 at 9:03 AM, Caruana, Matthew  wrote:
> We index rarely and in bulk as we’re an organisation that deals in enabling 
> access to leaked documents for journalists.
>
> The indexes are mostly static for 99% of the year. We only optimise after 
> reindexing due to schema changes or when
> we have a new leak.
>
> Our workflow is to index on a staging server, optimise then trigger 
> replication to a production instance of Solr. We cannot
> index straight to production as extracting text from documents is expensive 
> (lots of EC2 machines running Extract) and 
> we need
> to really hammer the Solr server with updates (up to 250 concurrent update 
> request at some times).
>
> I’ve never done benchmark tests, but it’s an interesting question. I always 
> worked on the assumption that if the optimise
> operation exists then there must be a reason. Also something tells me that 
> having your index spread over 70 files must be bad.
>
> The OOM error is certainly due to something else as it happens when we try 
> indexing text extracted from multi-gigabyte
> archives.
>
> On 3 Mar 2017, at 17:45, Erick Erickson 
> mailto:erickerick...@gmail.com>> wrote:
>
> Matthew:
>
> What load testing have you done on optimized .vs. unoptimized indexes?
> Is there enough of a performance gain to be worth the trouble? Toke's
> indexes are pretty static, and in his situation it's worth the effort.
> Before spending a lot of cycles on making optimization
> work/understanding the ins and outs I'd really recommend you see if
> any performance gain is worth it ;)...
>
> And as I mentioned earlier, optimizing is unlikely to be related to
> OOMs during indexing. You never know of course
>
> Best,
> Erick
>
> On Fri, Mar 3, 2017 at 3:40 AM, Caruana, Matthew 
> mailto:mcaru...@icij.org>> wrote:
> Thank you, you’re right - only one of the four cores is hitting 100%. This is 
> the correct answer. The bottleneck is CPU exacerbated by an absence of 
> parallelisation.
>
> On 3 Mar 2017, at 12:32, Toke Eskildsen mailto:t...@kb.dk>> wrote:
>
> On Thu, 2017-03-02 at 15:39 +, Caruana, Matthew wrote:
> Thank you. The question remains however, if this is such a hefty
> operation then why is it walking to the destination instead of
> running, so to speak?
>
> We only do optimize on an old Solr 4.10 setup, but for that we have
> plenty of experience. At least for single-shard, and at least for most
> of the work, optimize is a single-threaded process: It takes us ~8
> hours to optimize a ~900GB shard using SSDs, with 1 CPU-core at near
> 100% and the other ones not doing anything.
>
> The machine load number is a bit fuzzy, but if you do a top doing
> optimization, my guess is that you will see the same thing as we do:
> Only 1 CPU-core working.
> --
> Toke Eskildsen, Royal Danish Library
>
>


Re: Data Import Handler, also "Real Time" index updates

2017-03-03 Thread Alexandre Rafalovitch
On 3 March 2017 at 12:17, Sales  wrote:
> When we enabled those, during the index, the data disappeared since it kept 
> soft committing during the import process,

This part does not quite make sense. Could you expand on this "data
disappeared" part to understand what the issue is.

The main issue with "update" is that all fields (apart from pure
copyField destinations) need to be stored, so the document can be
reconstructed, updated, re-indexed. Perhaps you have something strange
happening around that?

Regards,
   Alex.


http://www.solr-start.com/ - Resources for Solr users, new and experienced


Re: How to update index after document expired.

2017-03-03 Thread Erick Erickson
Right, you'd have to copy/paste or petition to make
DocExpirationUpdateProcessorFactory not final.
Although it was likely made that for a good reason.

It does beg the question however, is this the right
thing to do? What you have here is some record
of when docs should expire. You had to know that
when you indexed the doc. Would an alternative be
to just "roll your own"? That is, you have some process
that checks your presumed system-of-record and updates
the docs that it finds there that are past their expiration
date? It could even delete those that have no newer
version in the system-of-record...

Not sure whether it'd be better or not, just thinking out
loud. Well, in print.

Erick

On Thu, Mar 2, 2017 at 9:53 PM, XuQing Tan  wrote:
> SOLR gets the updated content from external source (by calling a REST api
> which returns xml content).
> so my question is how can I plug this logic
> in DocExpirationUpdateProcessorFactory, saying poll from external source
> and update indexing?
>
> for now i'm thinking to use a custom 'autoDeleteChainName', still i'm
> experimenting with this, is it feasible?
>
> 
>   scheduled-delete-and-update
>   ...
> 
>
>   Thanks & Best Regards!
>
>   ///
>  (. .)
>   ooO--(_)--Ooo
>   |   Nick Tan   |
>   
>
> On Thu, Mar 2, 2017 at 7:36 PM, Alexandre Rafalovitch 
> wrote:
>
>> Where would Solr get the updated content? Do you mean would it poll
>> from external source to refresh? Then, no. And if it is pushed from
>> external sources to Solr, then you just replace it as normal.
>>
>> Not sure if I understand your use-case exactly.
>>
>> Regards,
>>Alex.
>> 
>> http://www.solr-start.com/ - Resources for Solr users, new and experienced
>>
>>
>> On 2 March 2017 at 22:29, XuQing Tan  wrote:
>> > Hi folks
>> >
>> > in our case, we have contents need to be refreshed periodically according
>> > to the TTL of each document.
>> >
>> > looks like DocExpirationUpdateProcessorFactory is a quite good fit
>> except
>> > that it does delete the document only, but no way to update the indexing
>> > with the new document.
>> >
>> > I don't see there's a way to hook into DocExpirationUpdateProcessorFa
>> ctory
>> > for custom logic like get the document and update index. and even
>> > DocExpirationUpdateProcessorFactory is a final class.
>> >
>> > so just want to confirm with you, is there any existing solution for
>> this?
>> >
>> > otherwise, I might have our own copy of DocExpirationUpdateProcessorFa
>> ctory
>> > with custom code.
>> >
>> >
>> >   Thanks & Best Regards!
>> >
>> >   ///
>> >  (. .)
>> >   ooO--(_)--Ooo
>> >   |   Nick Tan   |
>> >   
>>


Data Import Handler, also "Real Time" index updates

2017-03-03 Thread Sales
I am not sure how best to handle this. We use the data import handle to re-sync 
all our data on a daily basis, takes 1-2 hours depending on system load. It is 
set up to commit at the end, so, the old index remains until it’s done, and, we 
lose no access while the import is happening.

But, we now want to update certain fields in the index, but still regen daily. 
So, it would seem we might need to autocommit, and, soft commit potentially. 
When we enabled those, during the index, the data disappeared since it kept 
soft committing during the import process, I see no way to avoid soft commits 
during the import. But soft commits would appear to be needed for the (non 
import) updates to the index. 

I realize the import could happen while an update is done, but we can actually 
avoid those. So, that is not an issue (one or two might go through, but, we 
will redo those updates once the index is done, that part is all handled.

So, what is the best way to handle “real time” updates (10-15 minutes is fine 
to see the updates in a searcher), yet, also allow dataimport handler to do a 
full clear and regen without losing products (what we index) during the import, 
we don’t want searchers not seeing the data! Have not seen any techniques for 
this. 

Steve

Re: What is the bottleneck for an optimise operation?

2017-03-03 Thread Caruana, Matthew
We index rarely and in bulk as we’re an organisation that deals in enabling 
access to leaked documents for journalists.

The indexes are mostly static for 99% of the year. We only optimise after 
reindexing due to schema changes or when
we have a new leak.

Our workflow is to index on a staging server, optimise then trigger replication 
to a production instance of Solr. We cannot
index straight to production as extracting text from documents is expensive 
(lots of EC2 machines running Extract) and we 
need
to really hammer the Solr server with updates (up to 250 concurrent update 
request at some times).

I’ve never done benchmark tests, but it’s an interesting question. I always 
worked on the assumption that if the optimise
operation exists then there must be a reason. Also something tells me that 
having your index spread over 70 files must be bad.

The OOM error is certainly due to something else as it happens when we try 
indexing text extracted from multi-gigabyte
archives.

On 3 Mar 2017, at 17:45, Erick Erickson 
mailto:erickerick...@gmail.com>> wrote:

Matthew:

What load testing have you done on optimized .vs. unoptimized indexes?
Is there enough of a performance gain to be worth the trouble? Toke's
indexes are pretty static, and in his situation it's worth the effort.
Before spending a lot of cycles on making optimization
work/understanding the ins and outs I'd really recommend you see if
any performance gain is worth it ;)...

And as I mentioned earlier, optimizing is unlikely to be related to
OOMs during indexing. You never know of course

Best,
Erick

On Fri, Mar 3, 2017 at 3:40 AM, Caruana, Matthew 
mailto:mcaru...@icij.org>> wrote:
Thank you, you’re right - only one of the four cores is hitting 100%. This is 
the correct answer. The bottleneck is CPU exacerbated by an absence of 
parallelisation.

On 3 Mar 2017, at 12:32, Toke Eskildsen mailto:t...@kb.dk>> wrote:

On Thu, 2017-03-02 at 15:39 +, Caruana, Matthew wrote:
Thank you. The question remains however, if this is such a hefty
operation then why is it walking to the destination instead of
running, so to speak?

We only do optimize on an old Solr 4.10 setup, but for that we have
plenty of experience. At least for single-shard, and at least for most
of the work, optimize is a single-threaded process: It takes us ~8
hours to optimize a ~900GB shard using SSDs, with 1 CPU-core at near
100% and the other ones not doing anything.

The machine load number is a bit fuzzy, but if you do a top doing
optimization, my guess is that you will see the same thing as we do:
Only 1 CPU-core working.
--
Toke Eskildsen, Royal Danish Library




Re: Using solr-core-4.6.1.jar on solr-5.5.4 server

2017-03-03 Thread Erick Erickson
bq: " do you think there could be compatibility problems"

What API calls? Are you using some custom SolrJ from a
client?

This is _extremely_ risky IMO. Solr does not guarantee you can
do rolling upgrades across major versions for instance. And
the Solr<->Solr communications are through SolrJ (which I'm
assuming you are using when you say API calls).

So dropping an older jar file in the mix seems even riskier.

I don't recommend this. Your vendor has locked you in to
something proprietary, I'd ask them some hard questions about
why they haven't updated their offering in close to 2.5 years.

Best,
Erick

On Fri, Mar 3, 2017 at 1:47 AM, skasab2s  wrote:
> Hello,
>
> we want to update our Solr Server to *solr-5.5.4*. For the API calls on this
> server, we only can use *solr-core-4.6.1.jar*, because it is maintained by
> an external vendor and we cannot upgrade the library ourselves. I tested
> this constellation shortly and seemed to work fine. Even so, do you think
> there could be compatibility problems? Is there a table - which library
> version is compatible with which solr server version ?
>
> Many thanks in advance,
>
> Sven
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Using-solr-core-4-6-1-jar-on-solr-5-5-4-server-tp4323178.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: What is the bottleneck for an optimise operation?

2017-03-03 Thread Erick Erickson
Matthew:

What load testing have you done on optimized .vs. unoptimized indexes?
Is there enough of a performance gain to be worth the trouble? Toke's
indexes are pretty static, and in his situation it's worth the effort.
Before spending a lot of cycles on making optimization
work/understanding the ins and outs I'd really recommend you see if
any performance gain is worth it ;)...

And as I mentioned earlier, optimizing is unlikely to be related to
OOMs during indexing. You never know of course

Best,
Erick

On Fri, Mar 3, 2017 at 3:40 AM, Caruana, Matthew  wrote:
> Thank you, you’re right - only one of the four cores is hitting 100%. This is 
> the correct answer. The bottleneck is CPU exacerbated by an absence of 
> parallelisation.
>
>> On 3 Mar 2017, at 12:32, Toke Eskildsen  wrote:
>>
>> On Thu, 2017-03-02 at 15:39 +, Caruana, Matthew wrote:
>>> Thank you. The question remains however, if this is such a hefty
>>> operation then why is it walking to the destination instead of
>>> running, so to speak?
>>
>> We only do optimize on an old Solr 4.10 setup, but for that we have
>> plenty of experience. At least for single-shard, and at least for most
>> of the work, optimize is a single-threaded process: It takes us ~8
>> hours to optimize a ~900GB shard using SSDs, with 1 CPU-core at near
>> 100% and the other ones not doing anything.
>>
>> The machine load number is a bit fuzzy, but if you do a top doing
>> optimization, my guess is that you will see the same thing as we do:
>> Only 1 CPU-core working.
>> --
>> Toke Eskildsen, Royal Danish Library
>


Re: Joining across collections with Nested documents

2017-03-03 Thread Walter Underwood
Make two denormalized collections. Just don’t join at query time.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Mar 3, 2017, at 1:01 AM, Preeti Bhat  wrote:
> 
> We can't, they are being used for different purposes and we have few cases 
> where we would need data from both.
> 
> 
> Thanks and Regards,
> Preeti Bhat
> 
> -Original Message-
> From: Walter Underwood [mailto:wun...@wunderwood.org]
> Sent: Friday, March 03, 2017 12:02 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Joining across collections with Nested documents
> 
> Make one collection with denormalized data. This looks like a relational, 
> multi-table schema in Solr. That will be slow and painful.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
> 
>> On Mar 2, 2017, at 9:55 PM, Preeti Bhat  wrote:
>> 
>> Hi All,
>> 
>> I have two collections in solrcloud namely contact and company, they are in 
>> same solr instance. Company is relatively simpler document with id, Name, 
>> address etc... Coming over to Contact, this has the nested document like 
>> below. I would like to get the Company details using the "CompanyId" field 
>> in child document  by joining this to the "Company" collections's id 
>> document. Is this possible? Could someone please guide me on this?
>> 
>> {
>> id: "1"
>> , FirstName: "ABC"
>> , LastName: "BCD"
>> .
>> .
>> .
>> _childDocuments_:{
>> {
>> id:"123-1",
>> CompanyId: "123",
>> Email: "abc@smd.edu"
>> }
>> {
>> id:"124-1",
>> CompanyId: "124",
>> Email: "abc@smd.edu"
>> 
>> }
>> }
>> 
>> 
>> 
>> Thanks and Regards,
>> Preeti Bhat
>> 
>> 
>> 
>> NOTICE TO RECIPIENTS: This communication may contain confidential and/or 
>> privileged information. If you are not the intended recipient (or have 
>> received this communication in error) please notify the sender and 
>> it-supp...@shoregrp.com immediately, and destroy this communication. Any 
>> unauthorized copying, disclosure or distribution of the material in this 
>> communication is strictly forbidden. Any views or opinions presented in this 
>> email are solely those of the author and do not necessarily represent those 
>> of the company. Finally, the recipient should check this email and any 
>> attachments for the presence of viruses. The company accepts no liability 
>> for any damage caused by any virus transmitted by this email.
>> 
>> 
> 
> 
> NOTICE TO RECIPIENTS: This communication may contain confidential and/or 
> privileged information. If you are not the intended recipient (or have 
> received this communication in error) please notify the sender and 
> it-supp...@shoregrp.com immediately, and destroy this communication. Any 
> unauthorized copying, disclosure or distribution of the material in this 
> communication is strictly forbidden. Any views or opinions presented in this 
> email are solely those of the author and do not necessarily represent those 
> of the company. Finally, the recipient should check this email and any 
> attachments for the presence of viruses. The company accepts no liability for 
> any damage caused by any virus transmitted by this email.
> 
> 



Re: What is the bottleneck for an optimise operation?

2017-03-03 Thread Caruana, Matthew
Thank you, you’re right - only one of the four cores is hitting 100%. This is 
the correct answer. The bottleneck is CPU exacerbated by an absence of 
parallelisation.

> On 3 Mar 2017, at 12:32, Toke Eskildsen  wrote:
> 
> On Thu, 2017-03-02 at 15:39 +, Caruana, Matthew wrote:
>> Thank you. The question remains however, if this is such a hefty
>> operation then why is it walking to the destination instead of
>> running, so to speak?
> 
> We only do optimize on an old Solr 4.10 setup, but for that we have
> plenty of experience. At least for single-shard, and at least for most
> of the work, optimize is a single-threaded process: It takes us ~8
> hours to optimize a ~900GB shard using SSDs, with 1 CPU-core at near
> 100% and the other ones not doing anything.
> 
> The machine load number is a bit fuzzy, but if you do a top doing
> optimization, my guess is that you will see the same thing as we do:
> Only 1 CPU-core working.
> -- 
> Toke Eskildsen, Royal Danish Library



Re: What is the bottleneck for an optimise operation?

2017-03-03 Thread Toke Eskildsen
On Thu, 2017-03-02 at 15:39 +, Caruana, Matthew wrote:
> Thank you. The question remains however, if this is such a hefty
> operation then why is it walking to the destination instead of
> running, so to speak?

We only do optimize on an old Solr 4.10 setup, but for that we have
plenty of experience. At least for single-shard, and at least for most
of the work, optimize is a single-threaded process: It takes us ~8
hours to optimize a ~900GB shard using SSDs, with 1 CPU-core at near
100% and the other ones not doing anything.

The machine load number is a bit fuzzy, but if you do a top doing
optimization, my guess is that you will see the same thing as we do:
Only 1 CPU-core working.
-- 
Toke Eskildsen, Royal Danish Library


Re: Solr Query Suggestion

2017-03-03 Thread Emir Arnautovic

Hi Vrinda,

You should use field collapsing 
(https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results) 
or if you cannot live with its limitations, you can use results grouping 
(https://cwiki.apache.org/confluence/display/solr/Result+Grouping)


HTH,
Emir


On 03.03.2017 10:55, vrindavda wrote:

Hello,

I have indexed data of 3 categories say Category-1,Category-2,Category-3.

I need suggestions to form query as to get top 3 results from each
categories -  Category-1(3),Category-2(3),Category-3(3). - Total 9.

Is this possible ?

Thank you,
Vrinda Davda



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Query-Suggestion-tp4323180.html
Sent from the Solr - User mailing list archive at Nabble.com.


--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Using solr-core-4.6.1.jar on solr-5.5.4 server

2017-03-03 Thread skasab2s
Hello,

we want to update our Solr Server to *solr-5.5.4*. For the API calls on this
server, we only can use *solr-core-4.6.1.jar*, because it is maintained by
an external vendor and we cannot upgrade the library ourselves. I tested
this constellation shortly and seemed to work fine. Even so, do you think
there could be compatibility problems? Is there a table - which library
version is compatible with which solr server version ?

Many thanks in advance,

Sven



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Using-solr-core-4-6-1-jar-on-solr-5-5-4-server-tp4323178.html
Sent from the Solr - User mailing list archive at Nabble.com.


Solr Query Suggestion

2017-03-03 Thread vrindavda
Hello,

I have indexed data of 3 categories say Category-1,Category-2,Category-3.

I need suggestions to form query as to get top 3 results from each
categories -  Category-1(3),Category-2(3),Category-3(3). - Total 9.

Is this possible ?

Thank you,
Vrinda Davda



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Query-Suggestion-tp4323180.html
Sent from the Solr - User mailing list archive at Nabble.com.


FieldName as case insenstive

2017-03-03 Thread Preeti Bhat
Hi All,

I have a field named "CompanyName" in one of my collection. When I try to 
search CompanyName:xyz or CompanyName:XYZ it gives me results. But when I try 
companyname:xyz then the result fails. Is there a way to ensure that fieldname 
in solr is case insensitive as the client is going to pass the search string 
along with the fieldname for us.


Thanks and Regards,
Preeti Bhat



NOTICE TO RECIPIENTS: This communication may contain confidential and/or 
privileged information. If you are not the intended recipient (or have received 
this communication in error) please notify the sender and 
it-supp...@shoregrp.com immediately, and destroy this communication. Any 
unauthorized copying, disclosure or distribution of the material in this 
communication is strictly forbidden. Any views or opinions presented in this 
email are solely those of the author and do not necessarily represent those of 
the company. Finally, the recipient should check this email and any attachments 
for the presence of viruses. The company accepts no liability for any damage 
caused by any virus transmitted by this email.




RE: Joining across collections with Nested documents

2017-03-03 Thread Preeti Bhat
Thanks Mikhail, I will look into this option.


Thanks and Regards,
Preeti Bhat

-Original Message-
From: Mikhail Khludnev [mailto:m...@apache.org]
Sent: Friday, March 03, 2017 1:03 PM
To: solr-user
Subject: Re: Joining across collections with Nested documents

Related docs can be retrieved with
https://cwiki.apache.org/confluence/display/solr/Transforming+Result+Documents#TransformingResultDocuments-[subquery]
but searching related docs is less ready.
Here is a patch for query time join across collections 
https://issues.apache.org/jira/browse/SOLR-8297.

On Fri, Mar 3, 2017 at 8:55 AM, Preeti Bhat 
wrote:

> Hi All,
>
> I have two collections in solrcloud namely contact and company, they
> are in same solr instance. Company is relatively simpler document with
> id, Name, address etc... Coming over to Contact, this has the nested
> document like below. I would like to get the Company details using the 
> "CompanyId"
> field in child document  by joining this to the "Company"
> collections's id document. Is this possible? Could someone please guide me on 
> this?
>
> {
> id: "1"
> , FirstName: "ABC"
> , LastName: "BCD"
> .
> .
> .
> _childDocuments_:{
> {
> id:"123-1",
> CompanyId: "123",
> Email: "abc@smd.edu"
> }
> {
> id:"124-1",
> CompanyId: "124",
> Email: "abc@smd.edu"
>
> }
> }
>
>
>
> Thanks and Regards,
> Preeti Bhat
>
>
>
> NOTICE TO RECIPIENTS: This communication may contain confidential
> and/or privileged information. If you are not the intended recipient
> (or have received this communication in error) please notify the
> sender and it-supp...@shoregrp.com immediately, and destroy this
> communication. Any unauthorized copying, disclosure or distribution of
> the material in this communication is strictly forbidden. Any views or
> opinions presented in this email are solely those of the author and do
> not necessarily represent those of the company. Finally, the recipient
> should check this email and any attachments for the presence of
> viruses. The company accepts no liability for any damage caused by any virus 
> transmitted by this email.
>
>
>


--
Sincerely yours
Mikhail Khludnev

NOTICE TO RECIPIENTS: This communication may contain confidential and/or 
privileged information. If you are not the intended recipient (or have received 
this communication in error) please notify the sender and 
it-supp...@shoregrp.com immediately, and destroy this communication. Any 
unauthorized copying, disclosure or distribution of the material in this 
communication is strictly forbidden. Any views or opinions presented in this 
email are solely those of the author and do not necessarily represent those of 
the company. Finally, the recipient should check this email and any attachments 
for the presence of viruses. The company accepts no liability for any damage 
caused by any virus transmitted by this email.




RE: Joining across collections with Nested documents

2017-03-03 Thread Preeti Bhat
We can't, they are being used for different purposes and we have few cases 
where we would need data from both.


Thanks and Regards,
Preeti Bhat

-Original Message-
From: Walter Underwood [mailto:wun...@wunderwood.org]
Sent: Friday, March 03, 2017 12:02 PM
To: solr-user@lucene.apache.org
Subject: Re: Joining across collections with Nested documents

Make one collection with denormalized data. This looks like a relational, 
multi-table schema in Solr. That will be slow and painful.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Mar 2, 2017, at 9:55 PM, Preeti Bhat  wrote:
>
> Hi All,
>
> I have two collections in solrcloud namely contact and company, they are in 
> same solr instance. Company is relatively simpler document with id, Name, 
> address etc... Coming over to Contact, this has the nested document like 
> below. I would like to get the Company details using the "CompanyId" field in 
> child document  by joining this to the "Company" collections's id document. 
> Is this possible? Could someone please guide me on this?
>
> {
> id: "1"
> , FirstName: "ABC"
> , LastName: "BCD"
> .
> .
> .
> _childDocuments_:{
> {
> id:"123-1",
> CompanyId: "123",
> Email: "abc@smd.edu"
> }
> {
> id:"124-1",
> CompanyId: "124",
> Email: "abc@smd.edu"
>
> }
> }
>
>
>
> Thanks and Regards,
> Preeti Bhat
>
>
>
> NOTICE TO RECIPIENTS: This communication may contain confidential and/or 
> privileged information. If you are not the intended recipient (or have 
> received this communication in error) please notify the sender and 
> it-supp...@shoregrp.com immediately, and destroy this communication. Any 
> unauthorized copying, disclosure or distribution of the material in this 
> communication is strictly forbidden. Any views or opinions presented in this 
> email are solely those of the author and do not necessarily represent those 
> of the company. Finally, the recipient should check this email and any 
> attachments for the presence of viruses. The company accepts no liability for 
> any damage caused by any virus transmitted by this email.
>
>


NOTICE TO RECIPIENTS: This communication may contain confidential and/or 
privileged information. If you are not the intended recipient (or have received 
this communication in error) please notify the sender and 
it-supp...@shoregrp.com immediately, and destroy this communication. Any 
unauthorized copying, disclosure or distribution of the material in this 
communication is strictly forbidden. Any views or opinions presented in this 
email are solely those of the author and do not necessarily represent those of 
the company. Finally, the recipient should check this email and any attachments 
for the presence of viruses. The company accepts no liability for any damage 
caused by any virus transmitted by this email.




Re: OR condition between !frange and normal query

2017-03-03 Thread Emir Arnautovic

Hi Edwin,

_query_ is not field in your index but Solr syntax for subqueries. Not 
sure if that is the issue that you are referring to but query you sent 
(example I sent earlier) is not fully valid - has an extra '('. Can you try:


q=_query_:"{!frange l=1}ms(startDate_dt,endDate_dt)" OR
_query_:"startDate:[2000-01-01T00:00:00Z TO *] AND
endDate:[2016-12-31T23:59:59Z]"

Emir

On 03.03.2017 02:53, Zheng Lin Edwin Yeo wrote:

Hi Emir,

Thanks for your reply.

For the query:

q=_query_:"({!frange l=1}ms(startDate_dt,endDate_dt)" OR
_query_:"startDate:[2000-01-01T00:00:00Z TO *] AND
endDate:[2016-12-31T23:59:59Z]"

Must the _query_  be one of the field in the index? I do not have any
fields in the index that relates to the output of the query, and if I put
something that is not one of the fields in the index, it is not returning
any results.

Regards,
Edwin



On 2 March 2017 at 17:04, Emir Arnautovic 
wrote:


Hi Edwin,

You can use subqueries:

q=_query_:"({!frange l=1}ms(startDate_dt,endDate_dt)" OR
_query_:"startDate:[2000-01-01T00:00:00Z TO *] AND
endDate:[2016-12-31T23:59:59Z]"

HTH,
Emir



On 02.03.2017 04:51, Zheng Lin Edwin Yeo wrote:


Hi,

Would like to check, how can we do an OR condition between !frange and
normal query?

For example, I want to have the following condition in my query:

({!frange l=1}ms(startDate_dt,endDate_dt) OR
(startDate:[2000-01-01T00:00:00Z TO *] AND endDate:[2016-12-31T23:59:59Z]
))

How can we put it in the Solr query URL for Solr to recognize this
condition?

I'm using Solr 6.4.1

Thank you.

Regards,
Edwin



--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/




--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: What is the bottleneck for an optimise operation? / solve the disk space and time issues by specifying multiple segments to optimize

2017-03-03 Thread Caruana, Matthew
This is the current config:


100
1


10
10



We index in bulk, so after indexing about 4 million documents over a week (OCR 
takes long) we normally end up with about 60-70 segments with this 
configuration.

> On 3 Mar 2017, at 02:42, Alexandre Rafalovitch  wrote:
> 
> What do you have for merge configuration in solrconfig.xml? You should
> be able to tune it to - approximately - whatever you want without
> doing the grand optimize:
> https://cwiki.apache.org/confluence/display/solr/IndexConfig+in+SolrConfig#IndexConfiginSolrConfig-MergingIndexSegments
> 
> Regards,
>   Alex.
> 
> http://www.solr-start.com/ - Resources for Solr users, new and experienced
> 
> 
> On 2 March 2017 at 16:37, Caruana, Matthew  wrote:
>> Yes, we already do it outside Solr. See https://github.com/ICIJ/extract 
>> which we developed for this purpose. My guess is that the documents are very 
>> large, as you say.
>> 
>> Optimising was always an attempt to bring down the number of segments from 
>> 60+. Not sure how else to do that.
>> 
>>> On 2 Mar 2017, at 7:42 pm, Michael Joyner  wrote:
>>> 
>>> You can solve the disk space and time issues by specifying multiple 
>>> segments to optimize down to instead of a single segment.
>>> 
>>> When we reindex we have to optimize or we end up with hundreds of segments 
>>> and very horrible performance.
>>> 
>>> We optimize down to like 16 segments or so and it doesn't do the 3x disk 
>>> space thing and usually runs in a decent amount of time. (we have >50 
>>> million articles in one of our solr indexes).
>>> 
>>> 
 On 03/02/2017 10:20 AM, David Hastings wrote:
 Agreed, and since it takes three times the space is part of the reason it
 takes so long, so that 190gb index ends up writing another 380 gb until it
 compresses down and deletes the two left over files.  its a pretty hefty
 operation
 
 On Thu, Mar 2, 2017 at 10:13 AM, Alexandre Rafalovitch 
 wrote:
 
> Optimize operation is no longer recommended for Solr, as the
> background merges got a lot smarter.
> 
> It is an extremely expensive operation that can require up to 3-times
> amount of disk during the processing.
> 
> This is not to say yours is a valid question, which I am leaving to
> others to respond.
> 
> Regards,
>   Alex.
> 
> http://www.solr-start.com/ - Resources for Solr users, new and experienced
> 
> 
>> On 2 March 2017 at 10:04, Caruana, Matthew  wrote:
>> I’m currently performing an optimise operation on a ~190GB index with
> about 4 million documents. The process has been running for hours.
>> This is surprising, because the machine is an EC2 r4.xlarge with four
> cores and 30GB of RAM, 24GB of which is allocated to the JVM.
>> The load average has been steady at about 1.3. Memory usage is 25% or
> less the whole time. iostat reports ~6% util.
>> What gives?
>> 
>> Running Solr 6.4.1.
>>>