Re: Streaming Expressions: Merge array values? Inverse of cartesianProduct()

2018-06-14 Thread Joel Bernstein
Actually you're second example is probably a straight forward:

reduce(select(...), group(...), by="k1")

Joel Bernstein
http://joelsolr.blogspot.com/

On Thu, Jun 14, 2018 at 7:33 PM, Joel Bernstein  wrote:

> Take a look at the reduce() function. You'll have to write a custom reduce
> operation but you can follow the example here:
>
> https://github.com/apache/lucene-solr/blob/master/solr/
> solrj/src/java/org/apache/solr/client/solrj/io/ops/GroupOperation.java
>
> You can plug in your custom reduce operation in the solrconfig.xml and use
> it like any other function. If you're interested in working on this you
> could create a ticket and I can provide guidance.
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> 2018-06-14 13:13 GMT-04:00 Christian Spitzlay <
> christian.spitz...@biologis.com>:
>
>> Hi,
>>
>> is there a way to merge array values?
>>
>> Something that transforms
>>
>> {
>>   "k1": "1",
>>   "k2": ["a", "b"]
>> },
>> {
>>   "k1": "2",
>>   "k2": ["c", "d"]
>> },
>> {
>>   "k1": "2",
>>   "k2": ["e", "f"]
>> }
>>
>> into
>>
>> {
>>   "k1": "1",
>>   "k2": ["a", "b"]
>> },
>> {
>>   "k1": "2",
>>   "k2": ["c", "d", "e", "f"]
>> }
>>
>>
>> And an inverse of cartesianProduct() that transforms
>>
>> {
>>   "k1": "1",
>>   "k2": "a"
>> },
>> {
>>   "k1": "2",
>>   "k2": "b"
>> },
>> {
>>   "k1": "2",
>>   "k2": "c"
>> }
>>
>> into
>>
>> {
>>   "k1": "1",
>>   "k2": ["a"]
>> },
>> {
>>   "k1": "2",
>>   "k2": ["b", "c"]
>> }
>>
>>
>> Christian
>>
>>
>>
>


Re: Streaming Expressions: Merge array values? Inverse of cartesianProduct()

2018-06-14 Thread Joel Bernstein
Take a look at the reduce() function. You'll have to write a custom reduce
operation but you can follow the example here:

https://github.com/apache/lucene-solr/blob/master/solr/solrj/src/java/org/apache/solr/client/solrj/io/ops/GroupOperation.java

You can plug in your custom reduce operation in the solrconfig.xml and use
it like any other function. If you're interested in working on this you
could create a ticket and I can provide guidance.


Joel Bernstein
http://joelsolr.blogspot.com/

2018-06-14 13:13 GMT-04:00 Christian Spitzlay <
christian.spitz...@biologis.com>:

> Hi,
>
> is there a way to merge array values?
>
> Something that transforms
>
> {
>   "k1": "1",
>   "k2": ["a", "b"]
> },
> {
>   "k1": "2",
>   "k2": ["c", "d"]
> },
> {
>   "k1": "2",
>   "k2": ["e", "f"]
> }
>
> into
>
> {
>   "k1": "1",
>   "k2": ["a", "b"]
> },
> {
>   "k1": "2",
>   "k2": ["c", "d", "e", "f"]
> }
>
>
> And an inverse of cartesianProduct() that transforms
>
> {
>   "k1": "1",
>   "k2": "a"
> },
> {
>   "k1": "2",
>   "k2": "b"
> },
> {
>   "k1": "2",
>   "k2": "c"
> }
>
> into
>
> {
>   "k1": "1",
>   "k2": ["a"]
> },
> {
>   "k1": "2",
>   "k2": ["b", "c"]
> }
>
>
> Christian
>
>
>


Re: Exception when processing streaming expression

2018-06-14 Thread Joel Bernstein
We have to check the behavior of the innerJoin. I suspect that its closing
the second stream when the first stream his finished. This would cause a
broken pipe with the second stream. The export handler has specific code
that eats the broken pipe exception so it doesn't end up in the logs. The
select hander does not have this code.

In general you never want to use the select handler and set the rows to
such a big number. If you have that many rows you'll want to use the export
and handler which is designed to export the entire result set.

Joel Bernstein
http://joelsolr.blogspot.com/

On Thu, Jun 14, 2018 at 1:30 PM, Christian Spitzlay <
christian.spitz...@biologis.com> wrote:

> What does that mean exactly?  If I set the rows parameter to 10
> the exception still occurs.  AFAICT all this happens internally during the
> processing of the streaming expression.  Why wouldn't the select send
> the EOF tuple when it reaches the end of the documents?
> Or why wouldn't the receiving end wait for it to appear?
> Due to an incredibly low timeout used internally?
>
>
> Christian Spitzlay
>
>
>
> > Am 14.06.2018 um 19:18 schrieb Susmit :
> >
> > Hi,
> > This may be expected if one of the streams is closed early - does not
> reach to EOF tuple
> >
> > Sent from my iPhone
> >
> >> On Jun 14, 2018, at 9:53 AM, Christian Spitzlay <
> christian.spitz...@biologis.com> wrote:
> >>
> >> Here ist one I stripped down as far as I could:
> >>
> >> innerJoin(sort(search(kmm, q="sds_endpoint_uuid:(
> 2f927a0b\-fe38\-451e\-9103\-580914a77e82)", 
> fl="sds_endpoint_uuid,sds_to_endpoint_uuid",
> sort="sds_to_endpoint_uuid ASC", qt="/export"), by="sds_endpoint_uuid
> ASC"), search(kmm, q=ss_search_api_datasource:entity\:as_metadata,
> fl="sds_metadata_of_uuid", sort="sds_metadata_of_uuid ASC", qt="/select",
> rows=1), on="sds_endpoint_uuid=sds_metadata_of_uuid")
> >>
> >> The exception happens both via PHP (search_api_solr / Solarium) and via
> the Solr admin UI.
> >> (version: Solr 7.3.1 on macOS High Sierra 10.13.5)
> >>
> >> It seems to be related to the fact that the second stream uses
> "select“.
> >> - If I use "export“ the exception doesn’t occur.
> >> - If I set the rows parameter "low enough“ so I do not get any results
> >> the exception doesn’t occur either.
> >>
> >>
> >> BTW: Do you know of any tool for formatting and/or syntax highlighting
> >> these expressions?
> >>
> >>
> >> Christian Spitzlay
> >>
> >>
> >>
> >>
> >>
> >>> Am 13.06.2018 um 23:02 schrieb Joel Bernstein :
> >>>
> >>> Can your provide some example expressions that are causing these
> exceptions?
> >>>
> >>> Joel Bernstein
> >>> http://joelsolr.blogspot.com/
> >>>
> >>> On Wed, Jun 13, 2018 at 9:02 AM, Christian Spitzlay <
> >>> christian.spitz...@biologis.com> wrote:
> >>>
>  Hi,
> 
>  I am seeing a lot of (reproducible) exceptions in my solr log file
>  when I execute streaming expressions:
> 
>  o.a.s.s.HttpSolrCall  Unable to write response, client closed
> connection
>  or we are shutting down
>  org.eclipse.jetty.io.EofException
>   at org.eclipse.jetty.io.ChannelEndPoint.flush(
>  ChannelEndPoint.java:292)
>   at org.eclipse.jetty.io.WriteFlusher.flush(
> WriteFlusher.java:429)
>   at org.eclipse.jetty.io.WriteFlusher.write(
> WriteFlusher.java:322)
>   at org.eclipse.jetty.io.AbstractEndPoint.write(
>  AbstractEndPoint.java:372)
>   at org.eclipse.jetty.server.HttpConnection$SendCallback.
>  process(HttpConnection.java:794)
>  […]
>   at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(
>  EatWhatYouKill.java:131)
>   at org.eclipse.jetty.util.thread.ReservedThreadExecutor$
>  ReservedThread.run(ReservedThreadExecutor.java:382)
>   at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(
>  QueuedThreadPool.java:708)
>   at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(
>  QueuedThreadPool.java:626)
>   at java.base/java.lang.Thread.run(Thread.java:844)
>  Caused by: java.io.IOException: Broken pipe
>   at java.base/sun.nio.ch.FileDispatcherImpl.writev0(Native
> Method)
>   at java.base/sun.nio.ch.SocketDispatcher.writev(
>  SocketDispatcher.java:51)
>   at java.base/sun.nio.ch.IOUtil.write(IOUtil.java:148)
>   at java.base/sun.nio.ch.SocketChannelImpl.write(
>  SocketChannelImpl.java:506)
>   at org.eclipse.jetty.io.ChannelEndPoint.flush(
>  ChannelEndPoint.java:272)
>   ... 69 more
> 
> 
>  I have read up on the exception message and found
>  http://lucene.472066.n3.nabble.com/Unable-to-write-
> response-client-closed-
>  connection-or-we-are-shutting-down-tt4350349.html#a4350947
>  but I don’t understand how an early client connect can cause what I am
>  seeing:
> 
>  What puzzles me is that the response has been delivered in full to the
>  client library, including 

Re: Suggestions for debugging performance issue

2018-06-14 Thread Shawn Heisey
On 6/12/2018 12:06 PM, Chris Troullis wrote:
> The issue we are seeing is with 1 collection in particular, after we set up
> CDCR, we are getting extremely slow response times when retrieving
> documents. Debugging the query shows QTime is almost nothing, but the
> overall responseTime is like 5x what it should be. The problem is
> exacerbated by larger result sizes. IE retrieving 25 results is almost
> normal, but 200 results is way slower than normal. I can run the exact same
> query multiple times in a row (so everything should be cached), and I still
> see response times way higher than another environment that is not using
> CDCR. It doesn't seem to matter if CDCR is enabled or disabled, just that
> we are using the CDCRUpdateLog. The problem started happening even before
> we enabled CDCR.
>
> In a lower environment we noticed that the transaction logs were huge
> (multiple gigs), so we tried stopping solr and deleting the tlogs then
> restarting, and that seemed to fix the performance issue. We tried the same
> thing in production the other day but it had no effect, so now I don't know
> if it was a coincidence or not.

There is one other cause besides CDCR buffering that I know of for huge
transaction logs, and it has nothing to do with CDCR:  A lack of hard
commits.  It is strongly recommended to have autoCommit set to a
reasonably short interval (about a minute in my opinion, but 15 seconds
is VERY common).  Most of the time openSearcher should be set to false
in the autoCommit config, and other mechanisms (which might include
autoSoftCommit) should be used for change visibility.  The example
autoCommit settings might seem superfluous because they don't affect
what's searchable, but it is actually a very important configuration to
keep.

Are the docs in this collection really big, by chance?

As I went through previous threads you've started on the mailing list, I
have noticed that none of your messages provided some details that would
be useful for looking into performance problems:

 * What OS vendor and version Solr is running on.
 * Total document count on the server (counting all index cores).
 * Total index size on the server (counting all cores).
 * What the total of all Solr heaps on the server is.
 * Whether there is software other than Solr on the server.
 * How much total memory the server has installed.

If you name the OS, I can use that information to help you gather some
additional info which will actually show me most of that list.  Total
document count is something that I cannot get from the info I would help
you gather.

Something else that can cause performance issues is GC pauses.  If you
provide a GC log (The script that starts Solr logs this by default), we
can analyze it to see if that's a problem.

Attachments to messages on the mailing list typically do not make it to
the list, so a file sharing website is a better way to share large
logfiles.  A paste website is good for log data that's smaller.

Thanks,
Shawn



Re: Changing Field Assignments

2018-06-14 Thread Shawn Heisey
On 6/14/2018 12:10 PM, Terry Steichen wrote:
> I don't disagree at all, but have a basic question: How do you easily
> transition from a system using a dynamic schema to one using a fixed one?

Not sure you need to actually transition.  Just remove the config in
solrconfig.xml that causes Solr to invoke the update chain where the
unknown fields are added, upload the new config to zookeeper, and reload
the collection.  When you do that, indexing with unknown fields will
fail, and if the indexing program has good error handling, somebody is
going to notice the failure.

The major difficulty with this will be more of a people problem than a
technical problem.  You have to convince people who use the Solr install
that it's a lot better that they get an indexing error and ask you to
fix it.  They may not care that you've got a major problem on your hands
when the system makes a mistake adding a field.

> I'm runnning 6.6.0 in cloud mode (only because it's necessary, as I
> understand it, to be in cloud mode for the authentication/authorization
> to work).  In my server/solr/configsets subdirectory there are
> directories "data_driven_schema_configs" and "basic_configs".  Both
> contain a file named "managed_schema."  Which one is the active one?

As of Solr 6.5.0, the basic authentication plugin also works in
non-cloud (standalone) mode.

https://issues.apache.org/jira/browse/SOLR-9481

I will typically recommend cloud mode to anyone setting up a brand new
Solr installation, mostly because it automates a lot of the steps of
setting up high availability.  I don't use cloud mode myself, because it
didn't exist when I set up my systems.  Converting to cloud mode would
require rewriting all of the tools I've written that keep the indexes up
to date.  I might do that one day, but not today.

In cloud mode, neither of the managed-schema files you have mentioned is
active.  The active config (solrconfig.xml, the schema, and all files
mentioned in either of those) is in zookeeper, not on the disk.

> From the AdminUI, each collection has an associated "managed_schema"
> (under the "Files" option).  I'm guessing that this collection-specific
> managed_schema is the result of the automated field discovery process,
> presumably using some baseline version (in configsets) to start with.

If you create a collection with "bin/solr create", the config that you
give it is usually uploaded to zookeeper and all shard replicas in the
collection use that uploaded config.  In older versions like 6.6.0,
basic_configs is used if no source config is named.  In newer versions,
_default is used.

When the update processor adds an unknown field, it is added to the
managed-schema file in zookeeper and the collection is reloaded.  The
source configset on disk is not touched.

> If that's true, then it would presumably make sense to save this
> collection-specific managed_schema to disk as schema.xml.  I further
> presume I'd create a config subdirectory for each of said collections
> and put schema.xml there.  Is that right?

As long as you're in cloud mode, all your index configs are in
zookeeper.  Any config you have on disk is NOT what is actually being used.

https://lucene.apache.org/solr/guide/6_6/using-zookeeper-to-manage-configuration-files.html

> Every time I read (and reread, and reread, ...) the Solr docs they seem
> to be making certain (very basic) assumptions that I'm unclear about, so
> your help in the preceding would be most appreciated.

The Solr documentation is not very friendly to novices.  Writing
documentation that an expert can use is sometimes difficult, but most
developers can manage it.  Writing documentation that a novice can use
is much harder, because it's not easy for someone who has intimate
knowledge of the system to step back and look at it from a place where
that knowledge isn't available.  Some success has been achieved in later
documentation versions.  It's going to take a lot of time and effort
before most of Solr's documentation is novice-friendly.

Thanks,
Shawn



Solr basic auth

2018-06-14 Thread Dinesh Sundaram
Hi,

I have configured basic auth for solrcloud. it works well when i access the
solr url directly. i have integrated this solr with test.com domain. now if
I access the solr url like test.com/solr it prompts the credentials but I
dont want to ask this time since it is known domain. is there any way to
achieve this. much appreciate your quick response.

my security json below. i'm using the default security, want to allow my
domain default without prompting any credentials.

{"authentication":{
   "blockUnknown": true,
   "class":"solr.BasicAuthPlugin",
   "credentials":{"solr":"IV0EHq1OnNrj6gvRCwvFwTrZ1+z1oBbnQdiVC3otuq0=
Ndd7LKvVBAaZIF0QAVi1ekCfAJXr1GGfLtRUXhgrF8c="}
},"authorization":{
   "class":"solr.RuleBasedAuthorizationPlugin",
   "permissions":[{"name":"security-edit",
  "role":"admin"}],
   "user-role":{"solr":"admin"}
}}


Re: Changing Field Assignments

2018-06-14 Thread Terry Steichen
Shawn,

I don't disagree at all, but have a basic question: How do you easily
transition from a system using a dynamic schema to one using a fixed one?

I'm runnning 6.6.0 in cloud mode (only because it's necessary, as I
understand it, to be in cloud mode for the authentication/authorization
to work).  In my server/solr/configsets subdirectory there are
directories "data_driven_schema_configs" and "basic_configs".  Both
contain a file named "managed_schema."  Which one is the active one?

>From the AdminUI, each collection has an associated "managed_schema"
(under the "Files" option).  I'm guessing that this collection-specific
managed_schema is the result of the automated field discovery process,
presumably using some baseline version (in configsets) to start with.

If that's true, then it would presumably make sense to save this
collection-specific managed_schema to disk as schema.xml.  I further
presume I'd create a config subdirectory for each of said collections
and put schema.xml there.  Is that right?

And I have to do this for each collection, right?

Every time I read (and reread, and reread, ...) the Solr docs they seem
to be making certain (very basic) assumptions that I'm unclear about, so
your help in the preceding would be most appreciated.

Thanks.

Terry


On 06/14/2018 01:51 PM, Shawn Heisey wrote:
> On 6/11/2018 2:02 PM, Terry Steichen wrote:
>> I am using Solr (6.6.0) in the automatic mode (where it discovers
>> fields).  It's working fine with one exception.  The problem is that
>> Solr maps the discovered "meta_creation_date" is assigned the type
>> TrieDateField. 
>>
>> Unfortunately, that type is limited in a number of ways (like sorting,
>> abbreviated forms and etc.).  What I'd like to do is have that
>> ("meta_creation_date") field assigned to a different type, like
>> DateRangeField. 
>>
>> Is it possible to accomplish this (during indexing) by creating a copy
>> field to a different type, and using the copy field in the query?  Or
>> via some kind of function operation (which I've never understood)?
> What you are describing is precisely why I never use the mode where Solr
> automatically adds unknown fields.
>
> If the field does not exist in the schema before you index the document,
> then the best Solr can do is precisely what is configured in the update
> processor that adds unknown fields.  You can adjust that config, but it
> will always be a general purpose guess.
>
> What is actually needed for multiple unknown fields is often outside
> what that update processor is capable of detecting and configuring
> automatically.  For that reason, I set up the schema manually, and I
> want indexing to fail if the input documents contain fields that I
> haven't defined.  Then whoever is doing the indexing can contact me with
> their error details, and I can add new fields with the exact required
> definition.
>
> Thanks,
> Shawn
>
>



Re: Indexing to replica instead leader

2018-06-14 Thread Shawn Heisey
On 6/8/2018 3:56 AM, SOLR4189 wrote:
> /When a document is sent to a Solr node for indexing, the system first
> determines which Shard that document belongs to, and then which node is
> currently hosting the leader for that shard. The document is then forwarded
> to the current leader for indexing, and the leader forwards the update to
> all of the other replicas./
>
> So my question, what does happen when I'm sending index request to replica
> server instead leader server?
>
> Replica becomes a leader for this request? Or replica becomes only federator
> that resends request to leader and then leader will resend to replica?

Terminology nit:  The leader *is* a replica.  It just has a temporary
special job.  It doesn't lose its status as a replica when it is elected
leader.

If you send a document update to an index that is not the leader for the
correct shard, it will do just what you said above -- figure out the
correct shard, figure out which replica is the leader of that shard, and
forward the request there.  That leader will index the request itself
and then handle updating the other replicas.  It will also reply to the
index where you sent the request, which will reply to you.  The leader
role will not change to another core unless there is a leader election
and the existing leader loses that election.  An election is not going
to happen without a significant cluster event.  Examples are an explicit
election request, or the core/server with the the leader role going down.

Thanks,
Shawn



Re: Changing Field Assignments

2018-06-14 Thread Shawn Heisey
On 6/11/2018 2:02 PM, Terry Steichen wrote:
> I am using Solr (6.6.0) in the automatic mode (where it discovers
> fields).  It's working fine with one exception.  The problem is that
> Solr maps the discovered "meta_creation_date" is assigned the type
> TrieDateField. 
>
> Unfortunately, that type is limited in a number of ways (like sorting,
> abbreviated forms and etc.).  What I'd like to do is have that
> ("meta_creation_date") field assigned to a different type, like
> DateRangeField. 
>
> Is it possible to accomplish this (during indexing) by creating a copy
> field to a different type, and using the copy field in the query?  Or
> via some kind of function operation (which I've never understood)?

What you are describing is precisely why I never use the mode where Solr
automatically adds unknown fields.

If the field does not exist in the schema before you index the document,
then the best Solr can do is precisely what is configured in the update
processor that adds unknown fields.  You can adjust that config, but it
will always be a general purpose guess.

What is actually needed for multiple unknown fields is often outside
what that update processor is capable of detecting and configuring
automatically.  For that reason, I set up the schema manually, and I
want indexing to fail if the input documents contain fields that I
haven't defined.  Then whoever is doing the indexing can contact me with
their error details, and I can add new fields with the exact required
definition.

Thanks,
Shawn



Re: Exception when processing streaming expression

2018-06-14 Thread Christian Spitzlay
What does that mean exactly?  If I set the rows parameter to 10
the exception still occurs.  AFAICT all this happens internally during the 
processing of the streaming expression.  Why wouldn't the select send 
the EOF tuple when it reaches the end of the documents? 
Or why wouldn't the receiving end wait for it to appear?  
Due to an incredibly low timeout used internally?


Christian Spitzlay



> Am 14.06.2018 um 19:18 schrieb Susmit :
> 
> Hi, 
> This may be expected if one of the streams is closed early - does not reach 
> to EOF tuple
> 
> Sent from my iPhone
> 
>> On Jun 14, 2018, at 9:53 AM, Christian Spitzlay 
>>  wrote:
>> 
>> Here ist one I stripped down as far as I could:
>> 
>> innerJoin(sort(search(kmm, 
>> q="sds_endpoint_uuid:(2f927a0b\-fe38\-451e\-9103\-580914a77e82)", 
>> fl="sds_endpoint_uuid,sds_to_endpoint_uuid", sort="sds_to_endpoint_uuid 
>> ASC", qt="/export"), by="sds_endpoint_uuid ASC"), search(kmm, 
>> q=ss_search_api_datasource:entity\:as_metadata, fl="sds_metadata_of_uuid", 
>> sort="sds_metadata_of_uuid ASC", qt="/select", rows=1), 
>> on="sds_endpoint_uuid=sds_metadata_of_uuid")
>> 
>> The exception happens both via PHP (search_api_solr / Solarium) and via the 
>> Solr admin UI.
>> (version: Solr 7.3.1 on macOS High Sierra 10.13.5)
>> 
>> It seems to be related to the fact that the second stream uses "select“. 
>> - If I use "export“ the exception doesn’t occur.
>> - If I set the rows parameter "low enough“ so I do not get any results
>> the exception doesn’t occur either.
>> 
>> 
>> BTW: Do you know of any tool for formatting and/or syntax highlighting 
>> these expressions?
>> 
>> 
>> Christian Spitzlay
>> 
>> 
>> 
>> 
>> 
>>> Am 13.06.2018 um 23:02 schrieb Joel Bernstein :
>>> 
>>> Can your provide some example expressions that are causing these exceptions?
>>> 
>>> Joel Bernstein
>>> http://joelsolr.blogspot.com/
>>> 
>>> On Wed, Jun 13, 2018 at 9:02 AM, Christian Spitzlay <
>>> christian.spitz...@biologis.com> wrote:
>>> 
 Hi,
 
 I am seeing a lot of (reproducible) exceptions in my solr log file
 when I execute streaming expressions:
 
 o.a.s.s.HttpSolrCall  Unable to write response, client closed connection
 or we are shutting down
 org.eclipse.jetty.io.EofException
  at org.eclipse.jetty.io.ChannelEndPoint.flush(
 ChannelEndPoint.java:292)
  at org.eclipse.jetty.io.WriteFlusher.flush(WriteFlusher.java:429)
  at org.eclipse.jetty.io.WriteFlusher.write(WriteFlusher.java:322)
  at org.eclipse.jetty.io.AbstractEndPoint.write(
 AbstractEndPoint.java:372)
  at org.eclipse.jetty.server.HttpConnection$SendCallback.
 process(HttpConnection.java:794)
 […]
  at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(
 EatWhatYouKill.java:131)
  at org.eclipse.jetty.util.thread.ReservedThreadExecutor$
 ReservedThread.run(ReservedThreadExecutor.java:382)
  at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(
 QueuedThreadPool.java:708)
  at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(
 QueuedThreadPool.java:626)
  at java.base/java.lang.Thread.run(Thread.java:844)
 Caused by: java.io.IOException: Broken pipe
  at java.base/sun.nio.ch.FileDispatcherImpl.writev0(Native Method)
  at java.base/sun.nio.ch.SocketDispatcher.writev(
 SocketDispatcher.java:51)
  at java.base/sun.nio.ch.IOUtil.write(IOUtil.java:148)
  at java.base/sun.nio.ch.SocketChannelImpl.write(
 SocketChannelImpl.java:506)
  at org.eclipse.jetty.io.ChannelEndPoint.flush(
 ChannelEndPoint.java:272)
  ... 69 more
 
 
 I have read up on the exception message and found
 http://lucene.472066.n3.nabble.com/Unable-to-write-response-client-closed-
 connection-or-we-are-shutting-down-tt4350349.html#a4350947
 but I don’t understand how an early client connect can cause what I am
 seeing:
 
 What puzzles me is that the response has been delivered in full to the
 client library, including the document with EOF.
 
 So Solr must have already processed the streaming expression and returned
 the result.
 It’s just that the log is filled with stacktraces of this exception that
 suggests something went wrong.
 I don’t understand why this happens when the query seems to have succeeded.
 
 
 Best regards,
 Christian
 
 
 
>> 



Re: Exception when processing streaming expression

2018-06-14 Thread Susmit
Hi, 
This may be expected if one of the streams is closed early - does not reach to 
EOF tuple

Sent from my iPhone

> On Jun 14, 2018, at 9:53 AM, Christian Spitzlay 
>  wrote:
> 
> Here ist one I stripped down as far as I could:
> 
> innerJoin(sort(search(kmm, 
> q="sds_endpoint_uuid:(2f927a0b\-fe38\-451e\-9103\-580914a77e82)", 
> fl="sds_endpoint_uuid,sds_to_endpoint_uuid", sort="sds_to_endpoint_uuid ASC", 
> qt="/export"), by="sds_endpoint_uuid ASC"), search(kmm, 
> q=ss_search_api_datasource:entity\:as_metadata, fl="sds_metadata_of_uuid", 
> sort="sds_metadata_of_uuid ASC", qt="/select", rows=1), 
> on="sds_endpoint_uuid=sds_metadata_of_uuid")
> 
> The exception happens both via PHP (search_api_solr / Solarium) and via the 
> Solr admin UI.
> (version: Solr 7.3.1 on macOS High Sierra 10.13.5)
> 
> It seems to be related to the fact that the second stream uses "select“. 
> - If I use "export“ the exception doesn’t occur.
> - If I set the rows parameter "low enough“ so I do not get any results
>  the exception doesn’t occur either.
> 
> 
> BTW: Do you know of any tool for formatting and/or syntax highlighting 
> these expressions?
> 
> 
> Christian Spitzlay
> 
> 
> 
> 
> 
>> Am 13.06.2018 um 23:02 schrieb Joel Bernstein :
>> 
>> Can your provide some example expressions that are causing these exceptions?
>> 
>> Joel Bernstein
>> http://joelsolr.blogspot.com/
>> 
>> On Wed, Jun 13, 2018 at 9:02 AM, Christian Spitzlay <
>> christian.spitz...@biologis.com> wrote:
>> 
>>> Hi,
>>> 
>>> I am seeing a lot of (reproducible) exceptions in my solr log file
>>> when I execute streaming expressions:
>>> 
>>> o.a.s.s.HttpSolrCall  Unable to write response, client closed connection
>>> or we are shutting down
>>> org.eclipse.jetty.io.EofException
>>>   at org.eclipse.jetty.io.ChannelEndPoint.flush(
>>> ChannelEndPoint.java:292)
>>>   at org.eclipse.jetty.io.WriteFlusher.flush(WriteFlusher.java:429)
>>>   at org.eclipse.jetty.io.WriteFlusher.write(WriteFlusher.java:322)
>>>   at org.eclipse.jetty.io.AbstractEndPoint.write(
>>> AbstractEndPoint.java:372)
>>>   at org.eclipse.jetty.server.HttpConnection$SendCallback.
>>> process(HttpConnection.java:794)
>>> […]
>>>   at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(
>>> EatWhatYouKill.java:131)
>>>   at org.eclipse.jetty.util.thread.ReservedThreadExecutor$
>>> ReservedThread.run(ReservedThreadExecutor.java:382)
>>>   at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(
>>> QueuedThreadPool.java:708)
>>>   at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(
>>> QueuedThreadPool.java:626)
>>>   at java.base/java.lang.Thread.run(Thread.java:844)
>>> Caused by: java.io.IOException: Broken pipe
>>>   at java.base/sun.nio.ch.FileDispatcherImpl.writev0(Native Method)
>>>   at java.base/sun.nio.ch.SocketDispatcher.writev(
>>> SocketDispatcher.java:51)
>>>   at java.base/sun.nio.ch.IOUtil.write(IOUtil.java:148)
>>>   at java.base/sun.nio.ch.SocketChannelImpl.write(
>>> SocketChannelImpl.java:506)
>>>   at org.eclipse.jetty.io.ChannelEndPoint.flush(
>>> ChannelEndPoint.java:272)
>>>   ... 69 more
>>> 
>>> 
>>> I have read up on the exception message and found
>>> http://lucene.472066.n3.nabble.com/Unable-to-write-response-client-closed-
>>> connection-or-we-are-shutting-down-tt4350349.html#a4350947
>>> but I don’t understand how an early client connect can cause what I am
>>> seeing:
>>> 
>>> What puzzles me is that the response has been delivered in full to the
>>> client library, including the document with EOF.
>>> 
>>> So Solr must have already processed the streaming expression and returned
>>> the result.
>>> It’s just that the log is filled with stacktraces of this exception that
>>> suggests something went wrong.
>>> I don’t understand why this happens when the query seems to have succeeded.
>>> 
>>> 
>>> Best regards,
>>> Christian
>>> 
>>> 
>>> 
> 


Streaming Expressions: Merge array values? Inverse of cartesianProduct()

2018-06-14 Thread Christian Spitzlay
Hi,

is there a way to merge array values?

Something that transforms

{
  "k1": "1",
  "k2": ["a", "b"] 
},
{
  "k1": "2",
  "k2": ["c", "d"] 
},
{
  "k1": "2",
  "k2": ["e", "f"] 
}

into 

{
  "k1": "1",
  "k2": ["a", "b"] 
},
{
  "k1": "2",
  "k2": ["c", "d", "e", "f"] 
}


And an inverse of cartesianProduct() that transforms

{
  "k1": "1",
  "k2": "a"
},
{
  "k1": "2",
  "k2": "b"
},
{
  "k1": "2",
  "k2": "c"
}

into 

{
  "k1": "1",
  "k2": ["a"]
},
{
  "k1": "2",
  "k2": ["b", "c"] 
}


Christian




Re: Exception when processing streaming expression

2018-06-14 Thread Christian Spitzlay
Here ist one I stripped down as far as I could:

innerJoin(sort(search(kmm, 
q="sds_endpoint_uuid:(2f927a0b\-fe38\-451e\-9103\-580914a77e82)", 
fl="sds_endpoint_uuid,sds_to_endpoint_uuid", sort="sds_to_endpoint_uuid ASC", 
qt="/export"), by="sds_endpoint_uuid ASC"), search(kmm, 
q=ss_search_api_datasource:entity\:as_metadata, fl="sds_metadata_of_uuid", 
sort="sds_metadata_of_uuid ASC", qt="/select", rows=1), 
on="sds_endpoint_uuid=sds_metadata_of_uuid")

The exception happens both via PHP (search_api_solr / Solarium) and via the 
Solr admin UI.
(version: Solr 7.3.1 on macOS High Sierra 10.13.5)

It seems to be related to the fact that the second stream uses "select“. 
- If I use "export“ the exception doesn’t occur.
- If I set the rows parameter "low enough“ so I do not get any results
  the exception doesn’t occur either.


BTW: Do you know of any tool for formatting and/or syntax highlighting 
these expressions?


Christian Spitzlay





> Am 13.06.2018 um 23:02 schrieb Joel Bernstein :
> 
> Can your provide some example expressions that are causing these exceptions?
> 
> Joel Bernstein
> http://joelsolr.blogspot.com/
> 
> On Wed, Jun 13, 2018 at 9:02 AM, Christian Spitzlay <
> christian.spitz...@biologis.com> wrote:
> 
>> Hi,
>> 
>> I am seeing a lot of (reproducible) exceptions in my solr log file
>> when I execute streaming expressions:
>> 
>> o.a.s.s.HttpSolrCall  Unable to write response, client closed connection
>> or we are shutting down
>> org.eclipse.jetty.io.EofException
>>at org.eclipse.jetty.io.ChannelEndPoint.flush(
>> ChannelEndPoint.java:292)
>>at org.eclipse.jetty.io.WriteFlusher.flush(WriteFlusher.java:429)
>>at org.eclipse.jetty.io.WriteFlusher.write(WriteFlusher.java:322)
>>at org.eclipse.jetty.io.AbstractEndPoint.write(
>> AbstractEndPoint.java:372)
>>at org.eclipse.jetty.server.HttpConnection$SendCallback.
>> process(HttpConnection.java:794)
>> […]
>>at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(
>> EatWhatYouKill.java:131)
>>at org.eclipse.jetty.util.thread.ReservedThreadExecutor$
>> ReservedThread.run(ReservedThreadExecutor.java:382)
>>at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(
>> QueuedThreadPool.java:708)
>>at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(
>> QueuedThreadPool.java:626)
>>at java.base/java.lang.Thread.run(Thread.java:844)
>> Caused by: java.io.IOException: Broken pipe
>>at java.base/sun.nio.ch.FileDispatcherImpl.writev0(Native Method)
>>at java.base/sun.nio.ch.SocketDispatcher.writev(
>> SocketDispatcher.java:51)
>>at java.base/sun.nio.ch.IOUtil.write(IOUtil.java:148)
>>at java.base/sun.nio.ch.SocketChannelImpl.write(
>> SocketChannelImpl.java:506)
>>at org.eclipse.jetty.io.ChannelEndPoint.flush(
>> ChannelEndPoint.java:272)
>>... 69 more
>> 
>> 
>> I have read up on the exception message and found
>> http://lucene.472066.n3.nabble.com/Unable-to-write-response-client-closed-
>> connection-or-we-are-shutting-down-tt4350349.html#a4350947
>> but I don’t understand how an early client connect can cause what I am
>> seeing:
>> 
>> What puzzles me is that the response has been delivered in full to the
>> client library, including the document with EOF.
>> 
>> So Solr must have already processed the streaming expression and returned
>> the result.
>> It’s just that the log is filled with stacktraces of this exception that
>> suggests something went wrong.
>> I don’t understand why this happens when the query seems to have succeeded.
>> 
>> 
>> Best regards,
>> Christian
>> 
>> 
>> 



Re: Cost of enabling doc values

2018-06-14 Thread Erick Erickson
My claim is it simply doesn't matter. You either have to have those
bytes laying around on disk in the DV case and using OS memory or in
the cumulative java heap in the non-dv case.

If you're doing one of the three operations I know of no situation
where I would _not_ enable docValues.

The Lucene people do a lot of effort to make things compact, so what
you're coming up with is probably an upper bound. Frankly I'd just
enable the DV fields, index a bunch of docs and look at the cumulative
sizes of your dvd and dvm files.

I'd probably index, say, 10M docs and measure the two extensions, then
index 10M more and use the delta between 10M and 20M to extrapolate.

I also use the size of those files to get something of a sense of how
much OS memory I need for those operations (searching not included
yet). Gives me a sense of whether what I want to do is possible or
not.

Long blog on the topic of sizing, but it sums up as "try it and see":

https://lucidworks.com/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

Best,
Erick

On Thu, Jun 14, 2018 at 8:34 AM, root23  wrote:
> Thanks for the detailed explanation erick.
> I did a little math as you suggested. Just wanted to see if i am doing it
> right.
> So we have around 4 billion docs in production and around 70 nodes.
>
> To support the business use case we have around 18 fields on which we have
> to enable docvalues for sorting.
>
> FieldType   totalFields   Size of field
> TriIntField2   4 bytes
> StrField   720 bytes
> IntField14 bytes
> Bool  1  1 bytes
> TrieDateField  2 10 bytes
> TextField5 10 bytes
>
>
> Some of them i approximated the bytes like fot strField and textField based
> on no. of chatacters we usually have in those fields. I am not sure about
> the TrieDate field how much it will take. Please feel free to correct me if
> i am way off.
>
> so acc. to the above total size for a doc is = 2*4 + 20 *7 + 4 + 1+20+50 =
> 223 bytes.
>
> So for 4 billion docs it comes to approximate 8920 bytes or 892 gb.
>
> Does that math sound right or am i way off ?
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Cost of enabling doc values

2018-06-14 Thread root23
Thanks for the detailed explanation erick.
I did a little math as you suggested. Just wanted to see if i am doing it
right.
So we have around 4 billion docs in production and around 70 nodes.

To support the business use case we have around 18 fields on which we have
to enable docvalues for sorting.

FieldType   totalFields   Size of field 
TriIntField2   4 bytes
StrField   720 bytes
IntField14 bytes
Bool  1  1 bytes
TrieDateField  2 10 bytes
TextField5 10 bytes


Some of them i approximated the bytes like fot strField and textField based
on no. of chatacters we usually have in those fields. I am not sure about
the TrieDate field how much it will take. Please feel free to correct me if
i am way off.

so acc. to the above total size for a doc is = 2*4 + 20 *7 + 4 + 1+20+50 =
223 bytes.

So for 4 billion docs it comes to approximate 8920 bytes or 892 gb.

Does that math sound right or am i way off ?



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Logging Every document to particular core

2018-06-14 Thread Mikhail Khludnev
You can enable DEBUG level for LogUpdateProcessorFactory category

https://github.com/apache/lucene-solr/blob/228a84fd6db3ef5fc1624d69e1c82a1f02c51352/solr/core/src/java/org/apache/solr/update/processor/LogUpdateProcessorFactory.java#L100



On Wed, Jun 13, 2018 at 5:00 PM, govind nitk  wrote:

> Hi,
>
> Is there any way to log all the data getting indexed to a particular core
> only ?
>
>
> Regards,
> govind
>



-- 
Sincerely yours
Mikhail Khludnev


Re: Cost of enabling doc values

2018-06-14 Thread Jan Høydahl
Depending on what your documents look like, it could be that enabling docValues 
would allow you to save space by switching to stored="false" since Solr can 
fetch the stored value from docValues. I say it depends on your documents and 
use case since sometimes it may be slower to access a docValue just to read one 
field if all the other fields come from stored values. If you do not do 
matches/lookups/range-queries on some fields you may even be able to set 
indexed="false" and save space in the inverted index.

A benefit of having docValues enabled is that it then lets you do atomic 
updates to your docs, to re-index from an existing index (not from source) and 
to use streaming expressions on all fields.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

> 14. jun. 2018 kl. 04:13 skrev Erick Erickson :
> 
> I pretty much agree with your business side.
> 
> The rough size of the docValues fields is one of X for each doc. So
> say you have an int field. Size is near maxDoc * 4 bytes. This is not
> totally accurate, there is some int packing done for instance, but
> it'll do. If you really want an accurate count, look at the
> before/after size of your *.dvd, *.dvm segment files in your index.
> 
> However, it's "pay me now or pay me later". The critical operations
> are faceting, grouping and sorting. If you do any of those operations
> on a field that is _not_ docValues=true, it will be uninverted on the
> _java heap_, where it will consume GC cycles, put pressure on all your
> other operations, etc. This process will be done _every_ time you open
> a new searcher and use these fields.
> 
> If the field _does_ have docValues=true, that will be held in the OS's
> memory space, _not_ the JVM's heap due to using MMapDirectory (see:
> http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html).
> Among other virtues, it can be swapped out (although you don't want it
> to be, it's still better than OOMing). Plus loading it is just reading
> it off disk rather than the expensive uninversion process.
> 
> And if you don't do any of those operations (grouping, sorting and
> faceting), then the bits just sit there on disk doing nothing.
> 
> So say you carefully define what fields will be used for any of the
> three operations and enable docValues. Then 3 months later the
> business side comes back with "oh, we need to facet on another field".
> Your choices are:
> 1> live with the increased heap usage and other resource contention.
> Perhaps along the way panicking because your processes OOM and prod
> goes down.
> or
> 2> reindex from scratch, starting with a totally new collection.
> 
> And note the fragility here. Your application can be humming along
> just fine for months. Then one fine day someone innocently submits a
> query that sorts on a new field that has docValues=false and B-OOM.
> 
> If (and only if) you can _guarantee_ that fieldX will never be used
> for any of the three operations, then turning off docValues for that
> field will save you some disk space. But that's the only advantage.
> Well, alright. If you have to do a full index replication that'll
> happen a bit faster too.
> 
> So I prefer to err on the side of caution. I recommend making fields
> docValues=true unless I can absolutely guarantee (and business _also_
> agrees)
> 1>  that fieldX will never be used for sorting, grouping or faceting,
> or
> 2> if the can't promise that they guarantee to give me time to
> completely reindex,
> 
> Best,
> Erick
> 
> 
> On Wed, Jun 13, 2018 at 4:30 PM, root23  wrote:
>> Hi all,
>> Does anyone know how much typically index size increments when we enable doc
>> value on a field.
>> Our business side want to enable sorting fields on most of our fields. I am
>> trying to push back saying that it will increase the index size, since
>> enabling docvalues will create the univerted index.
>> 
>> I know the size probably depends on what values are in the fields but i need
>> a general idea so that i can convince them that enabling on the fields is
>> costly and it will incur this much cost.
>> 
>> If anyone knows how to find this out looking at an existing solr index which
>> has docvalues enabled , that will  also be great help.
>> 
>> Thanks !!!
>> 
>> 
>> 
>> --
>> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html



Re: Solr Suggest Component and OOM

2018-06-14 Thread Alessandro Benedetti
I didn't get any answer to my questions ( unless you meant you have 25
millions of different values for those fields ...)
Please read again my answer and elaborate further.
Do you problem happen for the 2 different suggesters ?

Cheers



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Hardware-Aware Solr Coud Sharding?

2018-06-14 Thread Jan Høydahl
You could also look into the Autoscaling stuff in 7.x which can be programmed 
to move shards around based on system load and HW specs on the various nodes, 
so in theory that framework (although still a bit unstable) will suggest moving 
some replicas from weak nodes over to more powerful ones. If you "overshard" 
your system, i.e. if you have three nodes, you create a collection with 9 
shards, then there will be three shards per node, and Solr can suggest moving 
one of them off to anther server.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

> 12. jun. 2018 kl. 18:39 skrev Erick Erickson :
> 
> In a mixed-hardware situation you can certainly place replicas as you
> choose. Create a minimal collection or use the special nodeset EMPTY
> and then place your replicas one-by-one.
> 
> You can also consider "replica placement rules", see:
> https://lucene.apache.org/solr/guide/6_6/rule-based-replica-placement.html.
> I _think_ this would be a variant of "rack aware". In this case you'd
> provide a "snitch" that says something about the hardware
> characteristics and the rules you'd define would be sensitive to that.
> 
> WARNING: haven't done this myself so don't have any examples to point to
> 
> Best,
> Erick
> 
> On Tue, Jun 12, 2018 at 8:34 AM, Shawn Heisey  wrote:
>> On 6/12/2018 9:12 AM, Michael Braun wrote:
>>> The way to handle this right now looks to be running additional Solr
>>> instances on nodes with increased resources to balance the load (so if the
>>> machines are 1x, 1.5x, and 2x, run 2 instances, 3 instances, and 4
>>> instances, respectively). Has anyone looked into other ways of handling
>>> this that don't require the additional Solr instance deployments?
>> 
>> Usually, no.  In most cases, you only want to run one Solr instance per
>> server.  One Solr instance can handle many individual shard replicas.
>> If there are more individual indexes on a Solr instance, then it is
>> likely to be able to take advantage of additional system resources
>> without running another Solr instance.
>> 
>> The only time you should run multiple Solr instances is when the heap
>> requirements for running the required indexes with one instance would be
>> way too big.  Splitting the indexes between two instances with smaller
>> heaps might end up with much better garbage collection efficiency.
>> 
>> https://lucene.apache.org/solr/guide/7_3/taking-solr-to-production.html#running-multiple-solr-nodes-per-host
>> 
>> Thanks,
>> Shawn
>> 



Re: Can replace the IP with the hostname or some unique identifier for each node in Solr

2018-06-14 Thread Jan Høydahl
See this FAQ
https://github.com/docker-solr/docker-solr/blob/master/Docker-FAQ.md#can-i-run-zookeeper-and-solr-clusters-under-docker

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

> 8. jun. 2018 kl. 14:52 skrev akshat :
> 
> Hi,
> 
> I have deployed Solr in docker swarm and scaling the replicas as 3.
> 
> What I have achieved -> Created the Solr core replicas in the other
> containers.
> 
> Blocker -> When I kill a container
> ​, D
> ocker swarm brings another container with a different IP. So, when I see
> the graph it is still pointing to the older dead node. But in ZooKeeper
> live_nodes, I can see the newly registered node.
> 
> So, For experimenting I am doing it manually through GUI by pointin
> ​g​
> the new node
> ​by
> manually delet
> ​ing​
> the older node from the collections in Solr GUI and created a replica in
> the new node.
> 
> My question -> Is it possible to some way we can trick the
> ​S​
> olr by replacing the IP which it shows in the graph to some unique
> identifier so that when swarm brings the new node it should still be
> pointing to the unique identifier name, not the IP.
> 
> -- 
> ​Regards​
> Akshat Singh



Re: Solr Suggest Component and OOM

2018-06-14 Thread Ratnadeep Rakshit
Anyone from the Solr team who can shed some more light?

On Tue, Jun 12, 2018 at 8:13 PM, Ratnadeep Rakshit 
wrote:

> I observed that the build works if the data size is below 25M. The moment
> the records go beyond that, this OOM error shows up. Solar itself shows 56%
> usage of 20GB space during the build. So, is there some settings I need to
> change to handle larger data size?
>
> On Tue, Jun 12, 2018 at 3:17 PM, Alessandro Benedetti <
> a.benede...@sease.io> wrote:
>
>> Hi,
>> first of all the two different suggesters you are using are based on
>> different data structures ( with different memory utilisation) :
>>
>> - FuzzyLookupFactory -> FST ( in memory and stored binary on disk)
>> - AnalyzingInfixLookupFactory -> Auxiliary Lucene Index
>>
>> Both the data structures should be very memory efficient ( both in
>> building
>> and storage).
>> What is the cardinality of the fields you are building suggestions from ?
>> (
>> site_address and site_address_other)
>> What is the memory situation in Solr when you start the suggester
>> building ?
>> You are allocating much more memory to the JVM Solr process than the OS (
>> which in your situation doesn't fit the entire index ideal scenario).
>>
>> I would recommend to put some monitoring in place ( there are plenty of
>> open
>> source tools to do that)
>>
>> Regards
>>
>>
>>
>> -
>> ---
>> Alessandro Benedetti
>> Search Consultant, R Software Engineer, Director
>> Sease Ltd. - www.sease.io
>> --
>> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>>
>
>


Re: A good KV store/plugins to go with Solr

2018-06-14 Thread Joel Bernstein
The approach that Alfresco/Solr takes with this is store the original
document in filesystem when it indexes content. This way you can be frugal
about which fields are stored in the index. Then Alfresco/Solr can retrieve
the original document as part of the results using a doc transformer.

This may be an approach that Solr could adopt.

Joel Bernstein
http://joelsolr.blogspot.com/

On Thu, Jun 14, 2018 at 8:10 AM, Jan Høydahl  wrote:

> You could fetch the data from your application directly :;)
> Also, the Streaming expressions has a jdbc() function but then you will
> need to know what to query for. It also has a fetch() function which
> enriches documents with fields from another collection. It would probably
> be possible to write a fetchKV() function which per result document fetches
> data from external JDBC (or other) source and enriches on the fly.
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
>
> > 5. jun. 2018 kl. 05:38 skrev Erick Erickson :
> >
> > Well, you can always throw more replicas at the problem as well.
> >
> > But Andrea's comment is spot on. When Solr stores a field, it
> > compresses it. So to fetch the stored info, it has to:
> > 1> seek the disk
> > 2> decompress at minimum 16K
> > 3> assemble the response.
> >
> > All the while perhaps causing memory to be consumed, adding to GC
> > issues and the like.
> >
> > One possibility is implement an doc transformer. See the class
> > ValueAugmenterFactory for a model. What that does is, for each doc
> > returned in the result set, call the transform method.
> >
> > Another approach would be to only index the first, say, 1K characters
> > and just return _that_, along with a link for the full doc that you
> > get from another store. Or, indeed from Solr itself since that would
> > only be one doc at a time. If you put this in as a string type with
> > docValues=true you would avoid most of the disk seek/decompression
> > issues.
> >
> > Best,
> > Erick
> >
> > On Mon, Jun 4, 2018 at 12:27 PM, Andrea Gazzarini 
> wrote:
> >> Hi Sam, I have been in a similar scenario (not recently so my answer
> could
> >> be outdated). As far as I remember caching, at least in that scenario,
> >> didn't help so much, probably because the field size.
> >>
> >> So we went with the second option: a custom SearchComponent connected
> with
> >> Redis. I'm not aware if such component is available somewhere but, trust
> >> me, it's a very easy thing to write.
> >>
> >> Best,
> >> Andrea
> >>
> >> On Mon, 4 Jun 2018, 20:45 Sambhav Kothari, 
> wrote:
> >>
> >>> Hi everyone,
> >>>
> >>> We at MetaBrainz are trying to scale our solr cloud instance but are
> >>> hitting a bottle-neck.
> >>>
> >>> Each of the documents in our solr index is accompanied by a '_store'
> field
> >>> that store our API compatible response for that document (which is
> >>> basically parsed and displayed by our custom response writer).
> >>>
> >>> The main problem is that this field is very large (It takes upto
> 60-70% of
> >>> our index) and because of this, Solr is struggling to keep up with our
> >>> required reqs/s.
> >>>
> >>> Any ideas on how to improve upon this?
> >>>
> >>> I have a couple of options in mind -
> >>>
> >>> 1. Use caches extensively.
> >>> 2. Have solr return only a doc id and fetch the response string from a
> KV
> >>> store/fast db.
> >>>
> >>> About 2 - are there any solr plugins will allow me to do this?
> >>>
> >>> Thanks,
> >>> Sam
> >>>
>
>


Re: A good KV store/plugins to go with Solr

2018-06-14 Thread Jan Høydahl
You could fetch the data from your application directly :;)
Also, the Streaming expressions has a jdbc() function but then you will need to 
know what to query for. It also has a fetch() function which enriches documents 
with fields from another collection. It would probably be possible to write a 
fetchKV() function which per result document fetches data from external JDBC 
(or other) source and enriches on the fly.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

> 5. jun. 2018 kl. 05:38 skrev Erick Erickson :
> 
> Well, you can always throw more replicas at the problem as well.
> 
> But Andrea's comment is spot on. When Solr stores a field, it
> compresses it. So to fetch the stored info, it has to:
> 1> seek the disk
> 2> decompress at minimum 16K
> 3> assemble the response.
> 
> All the while perhaps causing memory to be consumed, adding to GC
> issues and the like.
> 
> One possibility is implement an doc transformer. See the class
> ValueAugmenterFactory for a model. What that does is, for each doc
> returned in the result set, call the transform method.
> 
> Another approach would be to only index the first, say, 1K characters
> and just return _that_, along with a link for the full doc that you
> get from another store. Or, indeed from Solr itself since that would
> only be one doc at a time. If you put this in as a string type with
> docValues=true you would avoid most of the disk seek/decompression
> issues.
> 
> Best,
> Erick
> 
> On Mon, Jun 4, 2018 at 12:27 PM, Andrea Gazzarini  
> wrote:
>> Hi Sam, I have been in a similar scenario (not recently so my answer could
>> be outdated). As far as I remember caching, at least in that scenario,
>> didn't help so much, probably because the field size.
>> 
>> So we went with the second option: a custom SearchComponent connected with
>> Redis. I'm not aware if such component is available somewhere but, trust
>> me, it's a very easy thing to write.
>> 
>> Best,
>> Andrea
>> 
>> On Mon, 4 Jun 2018, 20:45 Sambhav Kothari,  wrote:
>> 
>>> Hi everyone,
>>> 
>>> We at MetaBrainz are trying to scale our solr cloud instance but are
>>> hitting a bottle-neck.
>>> 
>>> Each of the documents in our solr index is accompanied by a '_store' field
>>> that store our API compatible response for that document (which is
>>> basically parsed and displayed by our custom response writer).
>>> 
>>> The main problem is that this field is very large (It takes upto 60-70% of
>>> our index) and because of this, Solr is struggling to keep up with our
>>> required reqs/s.
>>> 
>>> Any ideas on how to improve upon this?
>>> 
>>> I have a couple of options in mind -
>>> 
>>> 1. Use caches extensively.
>>> 2. Have solr return only a doc id and fetch the response string from a KV
>>> store/fast db.
>>> 
>>> About 2 - are there any solr plugins will allow me to do this?
>>> 
>>> Thanks,
>>> Sam
>>> 



Re: Logging Every document to particular core

2018-06-14 Thread Alessandro Benedetti
Isn't the Transaction Log what you are looking for ?

Read this good blog post as a reference :
https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

Cheers



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Solr 7.2.1 Master-slave replication Issue

2018-06-14 Thread Nitin Kumar
Hi,

Facing issue in Solr 7.2.1 Master-slave replication,

Master-slave replication is working fine.
But if I disable replication from master, Slaves shows no data
(numFound=0). Slave in not serving data, it had before replication.
I suspect, Index generation is getting updated in slave, which was not
there is previous Solr version.

Please advise.

Thanks,
Nitin