Re: Solr on HDFS: Streaming API performance tuning

2016-12-16 Thread Reth RM
If you could provide the json parse exception stack trace, it might help to
predict issue there.


On Fri, Dec 16, 2016 at 5:52 PM, Chetas Joshi 
wrote:

> Hi Joel,
>
> The only NON alpha-numeric characters I have in my data are '+' and '/'. I
> don't have any backslashes.
>
> If the special characters was the issue, I should get the JSON parsing
> exceptions every time irrespective of the index size and irrespective of
> the available memory on the machine. That is not the case here. The
> streaming API successfully returns all the documents when the index size is
> small and fits in the available memory. That's the reason I am confused.
>
> Thanks!
>
> On Fri, Dec 16, 2016 at 5:43 PM, Joel Bernstein 
> wrote:
>
> > The Streaming API may have been throwing exceptions because the JSON
> > special characters were not escaped. This was fixed in Solr 6.0.
> >
> >
> >
> >
> >
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> > On Fri, Dec 16, 2016 at 4:34 PM, Chetas Joshi 
> > wrote:
> >
> > > Hello,
> > >
> > > I am running Solr 5.5.0.
> > > It is a solrCloud of 50 nodes and I have the following config for all
> the
> > > collections.
> > > maxShardsperNode: 1
> > > replicationFactor: 1
> > >
> > > I was using Streaming API to get back results from Solr. It worked fine
> > for
> > > a while until the index data size reached beyond 40 GB per shard (i.e.
> > per
> > > node). It started throwing JSON parsing exceptions while reading the
> > > TupleStream data. FYI: I have other services (Yarn, Spark) deployed on
> > the
> > > same boxes on which Solr shards are running. Spark jobs also use a lot
> of
> > > disk cache. So, the free available disk cache on the boxes vary a
> > > lot depending upon what else is running on the box.
> > >
> > > Due to this issue, I moved to using the cursor approach and it works
> fine
> > > but as we all know it is way slower than the streaming approach.
> > >
> > > Currently the index size per shard is 80GB (The machine has 512 GB of
> RAM
> > > and being used by different services/programs: heap/off-heap and the
> disk
> > > cache requirements).
> > >
> > > When I have enough RAM (more than 80 GB so that all the index data
> could
> > > fit in memory) available on the machine, the streaming API succeeds
> > without
> > > running into any exceptions.
> > >
> > > Question:
> > > How different the index data caching mechanism (for HDFS) is for the
> > > Streaming API from the cursorMark approach?
> > > Why cursor works every time but streaming works only when there is a
> lot
> > of
> > > free disk cache?
> > >
> > > Thank you.
> > >
> >
>


Re: Solr on HDFS: Streaming API performance tuning

2016-12-16 Thread Chetas Joshi
Hi Joel,

The only NON alpha-numeric characters I have in my data are '+' and '/'. I
don't have any backslashes.

If the special characters was the issue, I should get the JSON parsing
exceptions every time irrespective of the index size and irrespective of
the available memory on the machine. That is not the case here. The
streaming API successfully returns all the documents when the index size is
small and fits in the available memory. That's the reason I am confused.

Thanks!

On Fri, Dec 16, 2016 at 5:43 PM, Joel Bernstein  wrote:

> The Streaming API may have been throwing exceptions because the JSON
> special characters were not escaped. This was fixed in Solr 6.0.
>
>
>
>
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Fri, Dec 16, 2016 at 4:34 PM, Chetas Joshi 
> wrote:
>
> > Hello,
> >
> > I am running Solr 5.5.0.
> > It is a solrCloud of 50 nodes and I have the following config for all the
> > collections.
> > maxShardsperNode: 1
> > replicationFactor: 1
> >
> > I was using Streaming API to get back results from Solr. It worked fine
> for
> > a while until the index data size reached beyond 40 GB per shard (i.e.
> per
> > node). It started throwing JSON parsing exceptions while reading the
> > TupleStream data. FYI: I have other services (Yarn, Spark) deployed on
> the
> > same boxes on which Solr shards are running. Spark jobs also use a lot of
> > disk cache. So, the free available disk cache on the boxes vary a
> > lot depending upon what else is running on the box.
> >
> > Due to this issue, I moved to using the cursor approach and it works fine
> > but as we all know it is way slower than the streaming approach.
> >
> > Currently the index size per shard is 80GB (The machine has 512 GB of RAM
> > and being used by different services/programs: heap/off-heap and the disk
> > cache requirements).
> >
> > When I have enough RAM (more than 80 GB so that all the index data could
> > fit in memory) available on the machine, the streaming API succeeds
> without
> > running into any exceptions.
> >
> > Question:
> > How different the index data caching mechanism (for HDFS) is for the
> > Streaming API from the cursorMark approach?
> > Why cursor works every time but streaming works only when there is a lot
> of
> > free disk cache?
> >
> > Thank you.
> >
>


Re: Solr on HDFS: Streaming API performance tuning

2016-12-16 Thread Joel Bernstein
The Streaming API may have been throwing exceptions because the JSON
special characters were not escaped. This was fixed in Solr 6.0.






Joel Bernstein
http://joelsolr.blogspot.com/

On Fri, Dec 16, 2016 at 4:34 PM, Chetas Joshi 
wrote:

> Hello,
>
> I am running Solr 5.5.0.
> It is a solrCloud of 50 nodes and I have the following config for all the
> collections.
> maxShardsperNode: 1
> replicationFactor: 1
>
> I was using Streaming API to get back results from Solr. It worked fine for
> a while until the index data size reached beyond 40 GB per shard (i.e. per
> node). It started throwing JSON parsing exceptions while reading the
> TupleStream data. FYI: I have other services (Yarn, Spark) deployed on the
> same boxes on which Solr shards are running. Spark jobs also use a lot of
> disk cache. So, the free available disk cache on the boxes vary a
> lot depending upon what else is running on the box.
>
> Due to this issue, I moved to using the cursor approach and it works fine
> but as we all know it is way slower than the streaming approach.
>
> Currently the index size per shard is 80GB (The machine has 512 GB of RAM
> and being used by different services/programs: heap/off-heap and the disk
> cache requirements).
>
> When I have enough RAM (more than 80 GB so that all the index data could
> fit in memory) available on the machine, the streaming API succeeds without
> running into any exceptions.
>
> Question:
> How different the index data caching mechanism (for HDFS) is for the
> Streaming API from the cursorMark approach?
> Why cursor works every time but streaming works only when there is a lot of
> free disk cache?
>
> Thank you.
>


Re: ttl on merge-time possible somehow ?

2016-12-16 Thread Chris Hostetter

: > lucene, something has to "mark" the segements as deleted in order for them
...
: Note, it doesn't mark the "segment", it marks the "document".

correct, typo on my part -- sorry.

: > The disatisfaction you expressed with this approach confuses me...
: >
: Really ?
: If you have many expiring docs

...you didn't seem to finish that thought so i'm still not really sure 
what your're suggestion is in terms of why an alternative would be more 
efficient.

: "For example, with the configuration below the
: DocExpirationUpdateProcessorFactory will create a timer thread that wakes
: up every 30 seconds. When the timer triggers, it will execute a
: *deleteByQuery* command to *remove any documents* with a value in the
: press_release_expiration_date field value that is in the past "

that document is describing a *logical* deletion as i mentioned before -- 
the documents are "removed" in the sense that they are flaged "not alive" 
won't be included in future searches, but the data still lives in the 
segements on disk until a future merge.  (That is end user documentation, 
focusing on the effects as percieved by clients -- the concept of "delete" 
from a low level storage implementation is a much more involved concept 
that affects any discussion of "deleting" documents in solr, not just TTL 
based deletes)

: > 1) nothing would ensure that docs *ever* get removed during perioids when
: > docs aren't being added (thus no new segments, thus no merging)
: >
: This can be done with a periodic/smart thread that wakes up every 'ttl' and
: checks min-max (or histogram) of timestamps on segments. If there are a
: lot, do merge (or just delete the whole dead segment). At least that's how
: those systems do it.

OK -- with lucene/solr today we have the ConcurrentMergeScheduler which 
will watch for segments that have many (logically deleted) documents 
flaged "not alive" and will proactively merge those segments when the 
number of docs is above some configured/default threshold -- but to 
automatically flag those documents as "deleted" you need something like 
what solr is doing today.


Again: i really feel like the only disconnect here is terminology.

You're describing a background thread that wakes up periodically, scans 
the docs in each segment to see if they have an expire field > $now, and 
based on the size of the set of matches merges some segments and expunges 
the docs that were in that set.  For segments that aren't merged, docs 
stay put and are excluded from queries only by filters specified at 
request time.

What Solr/Lucene has are 2 background threads: one wakes up periodically, 
scans the docs in the index to see if the expire field > $now and if so 
flags them as being "not alive" so they don't match queries at request 
time. A second thread chegks each segment to see how many docs are marked 
"not alive" -- either by the previous thread or by some other form of 
(logical) deletion -- and merges some of those segments, expunging the 
docs that were marked "not alive".  For segments that aren't merged, the 
"not alive" docs are still in the segment, but the "not alive" flag 
automatically excludes them from queries.



-Hoss
http://www.lucidworks.com/


Re: ttl on merge-time possible somehow ?

2016-12-16 Thread Dorian Hoxha
On Fri, Dec 16, 2016 at 10:53 PM, Chris Hostetter 
wrote:

>
> : Yep, that's what came in my search. See how TTL work in hbase/cassandra/
> : rocksdb . There
> : isn't a "delete old docs"query, but old docs are deleted by the storage
> : when merging. Looks like this needs to be a lucene-module which can then
> be
> : configured by solr ?
> ...
> : Just like in hbase,cassandra,rocksdb, when you "select" a row/document
> that
> : has expired, it exists on the storage, but isn't returned by the db,
>
>
> What you're describing is exactly how segment merges work in Lucene, it's
> just a question of terminology.
>
> In Lucene, "deleting" a document is a *logical* operation, the data still
> lives in the (existing) segments but the affected docs are recorded in a
> list of deletions (and automatically excluded from future searchers that
> are opened against them) ... once the segments are merged then the deleted
> documents are "expunged" rather then being copied over to the new
> segments.
>
> Where this diverges from what you describe is that as things stand in
> lucene, something has to "mark" the segements as deleted in order for them
> to later be expunged -- in Solr right now is the code in question that
> does this via (internal) DBQ.
>
Note, it doesn't mark the "segment", it marks the "document".

>
> The disatisfaction you expressed with this approach confuses me...
>
Really ?
If you have many expiring docs

>
> >> I did some search for TTL on solr, and found only a way to do it with a
> >> delete-query. But that ~sucks, because you have to do a lot of inserts
> >> (and queries).
>
> ...nothing about this approach does any "inserts" (or queries -- unless
> you mean the DBQ itself?) so w/o more elaboration on what exactly you find
> problematic about this approach, it's hard to make any sense of your
> objection or request for an alternative.
>
"For example, with the configuration below the
DocExpirationUpdateProcessorFactory will create a timer thread that wakes
up every 30 seconds. When the timer triggers, it will execute a
*deleteByQuery* command to *remove any documents* with a value in the
press_release_expiration_date field value that is in the past "


>
> With all those caveats out of the way...
>
> What you're ultimately requesting -- new code that hooks into segment
> merging to exclude "expired" documents from being copied into the the new
> merged segments --- should be theoretically possible with a custom
> MergePolicy, but I don't really see how it would be better then the
> current approach in typically use cases (ie: i want docs excluded from
> results after the expiration date is reached, with a min tollerance of
> X) ...
>
I mentioned that the client would also make a range-query since expired
documents in this case would still be indexed.

>
> 1) nothing would ensure that docs *ever* get removed during perioids when
> docs aren't being added (thus no new segments, thus no merging)
>
This can be done with a periodic/smart thread that wakes up every 'ttl' and
checks min-max (or histogram) of timestamps on segments. If there are a
lot, do merge (or just delete the whole dead segment). At least that's how
those systems do it.

>
> 2) as you described, query clients would be required to specify date range
> filters on every query to identify the "logically live docs at this
> moment" on a per-request basis -- something that's far less efficient from
> a cachng standpoint then letting the system do a DBQ on the backened to
> affect the *global* set of logically live docs at the index level.
>
This makes sense. Deleted docs-ids is cached better than the range-query
that I said.

>
>
> Frankly: It seems to me that you've looked at how other non-lucene based
> systems X & Y handle TTL type logic and decided that's the best possible
> solution therefore the solution used by Solr "sucks" w/o taking into
> account that what's efficient in the underlying Lucene storage
> implementation might just be diff then what's efficient in the underlying
> storage implementation of X & Y.
>
Yes.

>
> If you'd like to tackle implementing TTL as a lower level primitive
> concept in Lucene, then by all means be my guest -- but personally i
> don't think you're going to find any real perf improvements in an
> approach like you describe compared to what we offer today.  i look
> forward to being proved wrong.
>
Since the implementation is apparently more efficient than I thought I'm
gonna leave it.

>
>
>
> -Hoss
> http://www.lucidworks.com/
>


Re: ttl on merge-time possible somehow ?

2016-12-16 Thread Chris Hostetter

: Yep, that's what came in my search. See how TTL work in hbase/cassandra/
: rocksdb . There
: isn't a "delete old docs"query, but old docs are deleted by the storage
: when merging. Looks like this needs to be a lucene-module which can then be
: configured by solr ?
...
: Just like in hbase,cassandra,rocksdb, when you "select" a row/document that
: has expired, it exists on the storage, but isn't returned by the db,


What you're describing is exactly how segment merges work in Lucene, it's 
just a question of terminology.

In Lucene, "deleting" a document is a *logical* operation, the data still 
lives in the (existing) segments but the affected docs are recorded in a 
list of deletions (and automatically excluded from future searchers that 
are opened against them) ... once the segments are merged then the deleted 
documents are "expunged" rather then being copied over to the new 
segments.

Where this diverges from what you describe is that as things stand in 
lucene, something has to "mark" the segements as deleted in order for them 
to later be expunged -- in Solr right now is the code in question that 
does this via (internal) DBQ.

The disatisfaction you expressed with this approach confuses me...

>> I did some search for TTL on solr, and found only a way to do it with a
>> delete-query. But that ~sucks, because you have to do a lot of inserts 
>> (and queries).

...nothing about this approach does any "inserts" (or queries -- unless 
you mean the DBQ itself?) so w/o more elaboration on what exactly you find 
problematic about this approach, it's hard to make any sense of your 
objection or request for an alternative.


With all those caveats out of the way...

What you're ultimately requesting -- new code that hooks into segment 
merging to exclude "expired" documents from being copied into the the new 
merged segments --- should be theoretically possible with a custom 
MergePolicy, but I don't really see how it would be better then the 
current approach in typically use cases (ie: i want docs excluded from 
results after the expiration date is reached, with a min tollerance of 
X) ...

1) nothing would ensure that docs *ever* get removed during perioids when 
docs aren't being added (thus no new segments, thus no merging)

2) as you described, query clients would be required to specify date range 
filters on every query to identify the "logically live docs at this 
moment" on a per-request basis -- something that's far less efficient from 
a cachng standpoint then letting the system do a DBQ on the backened to 
affect the *global* set of logically live docs at the index level.


Frankly: It seems to me that you've looked at how other non-lucene based 
systems X & Y handle TTL type logic and decided that's the best possible 
solution therefore the solution used by Solr "sucks" w/o taking into 
account that what's efficient in the underlying Lucene storage 
implementation might just be diff then what's efficient in the underlying 
storage implementation of X & Y.

If you'd like to tackle implementing TTL as a lower level primitive 
concept in Lucene, then by all means be my guest -- but personally i 
don't think you're going to find any real perf improvements in an 
approach like you describe compared to what we offer today.  i look 
forward to being proved wrong.



-Hoss
http://www.lucidworks.com/


Solr on HDFS: Streaming API performance tuning

2016-12-16 Thread Chetas Joshi
Hello,

I am running Solr 5.5.0.
It is a solrCloud of 50 nodes and I have the following config for all the
collections.
maxShardsperNode: 1
replicationFactor: 1

I was using Streaming API to get back results from Solr. It worked fine for
a while until the index data size reached beyond 40 GB per shard (i.e. per
node). It started throwing JSON parsing exceptions while reading the
TupleStream data. FYI: I have other services (Yarn, Spark) deployed on the
same boxes on which Solr shards are running. Spark jobs also use a lot of
disk cache. So, the free available disk cache on the boxes vary a
lot depending upon what else is running on the box.

Due to this issue, I moved to using the cursor approach and it works fine
but as we all know it is way slower than the streaming approach.

Currently the index size per shard is 80GB (The machine has 512 GB of RAM
and being used by different services/programs: heap/off-heap and the disk
cache requirements).

When I have enough RAM (more than 80 GB so that all the index data could
fit in memory) available on the machine, the streaming API succeeds without
running into any exceptions.

Question:
How different the index data caching mechanism (for HDFS) is for the
Streaming API from the cursorMark approach?
Why cursor works every time but streaming works only when there is a lot of
free disk cache?

Thank you.


Re: ttl on merge-time possible somehow ?

2016-12-16 Thread Dorian Hoxha
Well there is a reason why they all do it that way.

I'm gonna guess that the reason lucene does it this way is because it keeps
a 'deleted docs bitset', which should act like a filter, which is not as
slow as doing a full-delete/insert like in the other dbs that I mentioned.

Thanks Shawn.

On Fri, Dec 16, 2016 at 9:57 PM, Shawn Heisey  wrote:

> On 12/16/2016 1:12 PM, Dorian Hoxha wrote:
> > Shawn, I know how it works, I read the blog post. But I don't want it
> > that
> > way. So how to do it my way? Like a custom merge function on lucene or
> > something else ?
>
> A considerable amount of custom coding.
>
> At a minimum, you'd have to write your own implementations of some
> Lucene classes and probably some Solr classes.  This sort of integration
> might also require changes to the upstream Lucene/Solr source code.  I
> doubt there would be enough benefit (either performance or anything
> else) to be worth the time and energy required.  If Lucene-level support
> would have produced a demonstrably better expiration feature, it would
> have been implemented that way.
>
> If you're *already* an expert in Lucene/Solr code, then it might be a
> fun intellectual exercise, but such a large-scale overhaul of an
> existing feature that works well is not something I would try to do.
>
> Thanks,
> Shawn
>
>


Re: cores vs. instances vs. zookeeper vs. cloud vs ?

2016-12-16 Thread John Blythe
a, there's a lightbulb. your last paragraph cleared up confusion i was
carrying through the responses (and, incidentally, was likely the reason
for confusion on the ZK question you couldn't make sense of). i was
thinking zookeeper was a separate means of handling things from solrcloud,
two entirely different approaches to scaling out. very much helpful to see
how off balance i was on that assumption!

thanks shawn-

-- 
*John Blythe*
Product Manager & Lead Developer

251.605.3071 | j...@curvolabs.com
www.curvolabs.com

58 Adams Ave
Evansville, IN 47713

On Fri, Dec 16, 2016 at 3:31 PM, Shawn Heisey  wrote:

> On 12/16/2016 10:30 AM, John Blythe wrote:
> > thanks, erick. this is helpful. a few questions for clarity's sake, but
> > first: nope, not using SolrCloud as of yet.
> >
> >- if i start using SolrCloud i could have my current multi-core setup
> >(e.g. "transactions", "opportunities", etc.) exist within the
> appropriate
> >collection. so instead of dev-transactions i'd have a 'dev'
> collection that
> >has a 'transactions' core inside of it?
>
> No.  You would not be thinking in terms of cores at all.  When your
> programs talk to SolrCloud, they will only care about collections.
>
> Some terminology clarification: Collections are made up of one or more
> shards.  Shards are made up of one or more replicas.  Each shard replica
> is a core.  One replica for each shard is elected as leader.  If there's
> only one replica, then there's no redundancy, and that replica becomes
> leader.
>
> For the example you gave, you would have dev-transactions and
> prod-transactions collections.  Each of these collections might have
> shard replicas (cores) on completely different machines in the cloud ...
> or they might be on the same machines.  During normal operation, you
> would never access a core directly.  You'd probably only ever do that if
> something went very wrong and you needed to take very unusual steps to
> fix it or figure out what went wrong.
>
> >- this seems to be the same with ZK, too?
>
> No idea what you're asking here.  Perhaps it should be obvious, but I
> can't figure it out.
>
> >- i'm totally fine w separate/diff indexing. the demo collection, for
> >instance, *has* to be separate from production bc the data has been
> >stitched together from various customers' accounts on prod and
> blinded so
> >that we have avoid privacy issues and can have all the various goodies
> >under one demo account rather than separate ones. is the separate
> indexing
> >happening out of the box w Cloud or something it's even capable of?
>
> Again, I don't really know what you're asking with "separate indexing".
> Different collections are separate from each other, just like cores in
> standalone mode.  Each collection is linked to a configuration in
> Zookeeper, which all of its shard replicas (cores) will use.  You could
> have all your collections pointing to the same config.  Some (or all) of
> them could point to completely different configs, too.
>
> Addressing a later question:  You don't have SolrCloud if you don't have
> zookeeper.  ZK is a requirement.  You don't really interact directly
> with zookeeper when you're using SolrCloud.  It's an administrative
> detail in the *setup* of SolrCloud.
>
> Thanks,
> Shawn
>
>


Re: Separating Search and Indexing in SolrCloud

2016-12-16 Thread Jaroslaw Rozanski
Thanks, that issue looks interesting!

On 16/12/16 16:38, Pushkar Raste wrote:
> This kind of separation is not supported yet.  There however some work
> going on,  you can read about it on
> https://issues.apache.org/jira/browse/SOLR-9835
> 
> This unfortunately would not support soft commits and hence would not be a
> good solution for near real time indexing.
> 
> On Dec 16, 2016 7:44 AM, "Jaroslaw Rozanski"  wrote:
> 
>> Sorry, not what I meant.
>>
>> Leader is responsible for distributing update requests to replica. So
>> eventually all replicas have same state as leader. Not a problem.
>>
>> It is more about the performance of such. If I gather correctly normal
>> replication happens by standard update request. Not by, say, segment copy.
>>
>> Which means update on leader is as "expensive" as on replica.
>>
>> Hence, if my understanding is correct, sending search request to replica
>> only, in index heavy environment, would bring no benefit.
>>
>> So the question is: is there a mechanism, in SolrCloud (not legacy
>> master/slave set-up) to make one node take a load of indexing which
>> other nodes focus on searching.
>>
>> This is not a question of SolrClient cause that is clear how to direct
>> search request to specific nodes. This is more about index optimization
>> so that certain nodes (ie. replicas) could suffer less due to high
>> volume indexing while serving search requests.
>>
>>
>>
>>
>> On 16/12/16 12:35, Dorian Hoxha wrote:
>>> The leader is the source of truth. You expect to make the replica the
>>> source of truth or something???Doesn't make sense?
>>> What people do, is send write to leader/master and reads to
>> replicas/slaves
>>> in other solr/other-dbs.
>>>
>>> On Fri, Dec 16, 2016 at 1:31 PM, Jaroslaw Rozanski >>
>>> wrote:
>>>
 Hi all,

 According to documentation, in normal operation (not recovery) in Solr
 Cloud configuration the leader sends updates it receives to all the
 replicas.

 This means and all nodes in the shard perform same effort to index
 single document. Correct?

 Is there then a benefit to *not* to send search requests to leader, but
 only to replicas?

 Given index & search heavy Solr Cloud system, is it possible to separate
 search from indexing nodes?


 RE: Solr 5.5.0

 --
 Jaroslaw Rozanski | e: m...@jarekrozanski.com
 695E 436F A176 4961 7793  5C70 AFDF FB5E 682C 4D3D


>>>
>>
>> --
>> Jaroslaw Rozanski | e: m...@jarekrozanski.com
>> 695E 436F A176 4961 7793  5C70 AFDF FB5E 682C 4D3D
>>
>>
> 

-- 
Jaroslaw Rozanski | e: m...@jarekrozanski.com
695E 436F A176 4961 7793  5C70 AFDF FB5E 682C 4D3D



signature.asc
Description: OpenPGP digital signature


Re: Separating Search and Indexing in SolrCloud

2016-12-16 Thread Jaroslaw Rozanski
Thanks,


On 16/12/16 20:56, Shawn Heisey wrote:
> On 12/16/2016 5:43 AM, Jaroslaw Rozanski wrote:
>> Leader is responsible for distributing update requests to replica. So
>> eventually all replicas have same state as leader. Not a problem. It
>> is more about the performance of such. If I gather correctly normal
>> replication happens by standard update request. Not by, say, segment
>> copy. 
> 
> For SolrCloud, yes.  The master/slave replication that existed before
> SolrCloud does work by copying segment files, but SolrCloud does not
> work that way.  The old master/slave replication feature IS used by
> SolrCloud, but ONLY for index recovery -- copying the entire index from
> the leader to another replica in the event that the replica gets so far
> behind that it cannot be brought current by regular updates and/or the
> transaction log.  This is also used to make new replicas.
> 
>> Hence, if my understanding is correct, sending search request to
>> replica only, in index heavy environment, would bring no benefit. 
> 
> Correct, it would have no benefit.  There's something else: when you
> send queries to SolrCloud, they do not necessarily stay on the node
> where you sent them.  By default, multiple query requests are load
> balanced across the cloud, so they'll hit the leader anyway, even if you
> never send them to the leader.

With custom Solr Client the above logic no longer applies to my case. I
can easily control to which replica/core in shard my query is directed
to (along with distrib=false).

>> So the question is: is there a mechanism, in SolrCloud (not legacy
>> master/slave set-up) to make one node take a load of indexing which
>> other nodes focus on searching. 
> 
> Indexing will always be done by all replicas, including the leader.
> 
> Something to mention, although it doesn't accomplish what you're after: 
> There is a preferLocalShards parameter that you can send with your query
> to keep SolrCloud from doing its load balancing *if* the query can be
> satisfied from local indexes.  This parameter should only be used in one
> of the following situations:
> 
> * Your query rate is very low.
> * You are already load balancing the requests yourself.
> 
> If the preferlocalShards parameter is used in other situations, it can
> end up concentrating a large number of requests onto some replicas and
> leaving the other replicas idle.
> 
> https://cwiki.apache.org/confluence/display/solr/Distributed+Requests#DistributedRequests-PreferLocalShards


Yeap, already solved. I am more concerned with indexing memory
requirements at volume affecting performance of search requests and/or
cluster stability.

> Thanks,
> Shawn
> 



-- 
Jaroslaw Rozanski | e: m...@jarekrozanski.com
695E 436F A176 4961 7793  5C70 AFDF FB5E 682C 4D3D



signature.asc
Description: OpenPGP digital signature


Re: ttl on merge-time possible somehow ?

2016-12-16 Thread Shawn Heisey
On 12/16/2016 1:12 PM, Dorian Hoxha wrote:
> Shawn, I know how it works, I read the blog post. But I don't want it
> that
> way. So how to do it my way? Like a custom merge function on lucene or
> something else ?

A considerable amount of custom coding.

At a minimum, you'd have to write your own implementations of some
Lucene classes and probably some Solr classes.  This sort of integration
might also require changes to the upstream Lucene/Solr source code.  I
doubt there would be enough benefit (either performance or anything
else) to be worth the time and energy required.  If Lucene-level support
would have produced a demonstrably better expiration feature, it would
have been implemented that way.

If you're *already* an expert in Lucene/Solr code, then it might be a
fun intellectual exercise, but such a large-scale overhaul of an
existing feature that works well is not something I would try to do.

Thanks,
Shawn



Re: Solr on HDFS: increase in query time with increase in data

2016-12-16 Thread Shawn Heisey
On 12/16/2016 11:58 AM, Chetas Joshi wrote:
> How different the index data caching mechanism is for the Streaming
> API from the cursor approach?

Solr and Lucene do not handle that caching.  Systems external to Solr
(like the OS, or HDFS) handle the caching.  The cache effectiveness will
be a combination of the cache size, overall data size, and the data
access patterns of the application.  I do not know enough to tell you
how the cursorMark feature and the streaming API work when they access
the index data.  I would imagine them to be pretty similar, but cannot
be sure about that.

Thanks,
Shawn



Re: Separating Search and Indexing in SolrCloud

2016-12-16 Thread Shawn Heisey
On 12/16/2016 5:43 AM, Jaroslaw Rozanski wrote:
> Leader is responsible for distributing update requests to replica. So
> eventually all replicas have same state as leader. Not a problem. It
> is more about the performance of such. If I gather correctly normal
> replication happens by standard update request. Not by, say, segment
> copy. 

For SolrCloud, yes.  The master/slave replication that existed before
SolrCloud does work by copying segment files, but SolrCloud does not
work that way.  The old master/slave replication feature IS used by
SolrCloud, but ONLY for index recovery -- copying the entire index from
the leader to another replica in the event that the replica gets so far
behind that it cannot be brought current by regular updates and/or the
transaction log.  This is also used to make new replicas.

> Hence, if my understanding is correct, sending search request to
> replica only, in index heavy environment, would bring no benefit. 

Correct, it would have no benefit.  There's something else: when you
send queries to SolrCloud, they do not necessarily stay on the node
where you sent them.  By default, multiple query requests are load
balanced across the cloud, so they'll hit the leader anyway, even if you
never send them to the leader.

> So the question is: is there a mechanism, in SolrCloud (not legacy
> master/slave set-up) to make one node take a load of indexing which
> other nodes focus on searching. 

Indexing will always be done by all replicas, including the leader.

Something to mention, although it doesn't accomplish what you're after: 
There is a preferLocalShards parameter that you can send with your query
to keep SolrCloud from doing its load balancing *if* the query can be
satisfied from local indexes.  This parameter should only be used in one
of the following situations:

* Your query rate is very low.
* You are already load balancing the requests yourself.

If the preferlocalShards parameter is used in other situations, it can
end up concentrating a large number of requests onto some replicas and
leaving the other replicas idle.

https://cwiki.apache.org/confluence/display/solr/Distributed+Requests#DistributedRequests-PreferLocalShards

Thanks,
Shawn



Re: cores vs. instances vs. zookeeper vs. cloud vs ?

2016-12-16 Thread Shawn Heisey
On 12/16/2016 10:30 AM, John Blythe wrote:
> thanks, erick. this is helpful. a few questions for clarity's sake, but
> first: nope, not using SolrCloud as of yet.
>
>- if i start using SolrCloud i could have my current multi-core setup
>(e.g. "transactions", "opportunities", etc.) exist within the appropriate
>collection. so instead of dev-transactions i'd have a 'dev' collection that
>has a 'transactions' core inside of it?

No.  You would not be thinking in terms of cores at all.  When your
programs talk to SolrCloud, they will only care about collections.

Some terminology clarification: Collections are made up of one or more
shards.  Shards are made up of one or more replicas.  Each shard replica
is a core.  One replica for each shard is elected as leader.  If there's
only one replica, then there's no redundancy, and that replica becomes
leader.

For the example you gave, you would have dev-transactions and
prod-transactions collections.  Each of these collections might have
shard replicas (cores) on completely different machines in the cloud ...
or they might be on the same machines.  During normal operation, you
would never access a core directly.  You'd probably only ever do that if
something went very wrong and you needed to take very unusual steps to
fix it or figure out what went wrong.

>- this seems to be the same with ZK, too?

No idea what you're asking here.  Perhaps it should be obvious, but I
can't figure it out.

>- i'm totally fine w separate/diff indexing. the demo collection, for
>instance, *has* to be separate from production bc the data has been
>stitched together from various customers' accounts on prod and blinded so
>that we have avoid privacy issues and can have all the various goodies
>under one demo account rather than separate ones. is the separate indexing
>happening out of the box w Cloud or something it's even capable of?

Again, I don't really know what you're asking with "separate indexing". 
Different collections are separate from each other, just like cores in
standalone mode.  Each collection is linked to a configuration in
Zookeeper, which all of its shard replicas (cores) will use.  You could
have all your collections pointing to the same config.  Some (or all) of
them could point to completely different configs, too.

Addressing a later question:  You don't have SolrCloud if you don't have
zookeeper.  ZK is a requirement.  You don't really interact directly
with zookeeper when you're using SolrCloud.  It's an administrative
detail in the *setup* of SolrCloud.

Thanks,
Shawn



Re: ttl on merge-time possible somehow ?

2016-12-16 Thread Dorian Hoxha
On Fri, Dec 16, 2016 at 8:11 PM, Shawn Heisey  wrote:

> On 12/16/2016 11:13 AM, Dorian Hoxha wrote:
> > Yep, that's what came in my search. See how TTL work in hbase/cassandra/
> > rocksdb . There
> > isn't a "delete old docs"query, but old docs are deleted by the
> > storage when merging. Looks like this needs to be a lucene-module
> > which can then be configured by solr ?
>
> No.  Lucene doesn't know about expiration and doesn't need to know about
> expiration.
>
It needs to know or else it will be ~non efficient in my case.

>
> The document expiration happens in Solr.  In the background, Solr
> finds/deletes old documents in the Lucene index according to how the
> expiration feature is configured.  What happens after that is basic
> Lucene operation.  If you index enough new data to trigger a merge (or
> if you do an optimize/forceMerge), then Lucene will get rid of deleted
> documents in the merged segments.  The contents of the documents in your
> index (whether that's a timestamp or something else) are completely
> irrelevant for decisions made during Lucene's segment merging.
>
Shawn, I know how it works, I read the blog post. But I don't want it that
way.
So how to do it my way? Like a custom merge function on lucene or something
else ?

>
> > Just like in hbase,cassandra,rocksdb, when you "select" a row/document
> > that has expired, it exists on the storage, but isn't returned by the
> > db, because it checks the timestamp and sees that it's expired. Looks
> > like this also need to be in lucene?
>
> That's pretty much how Lucene (and by extension, Solr) works, except
> it's not related to expiration, it is *deleted* documents that don't
> show up in the results.
>
No it doesn't. But I want expirations to function that way. Just like you
have "custom update processors", there should be a similar way for get (so
on my custom-get-processor, I check the timestamp and return NotFound if
it's expired)

>
> Thanks,
> Shawn
>
> Makes sense ?


Re: cores vs. instances vs. zookeeper vs. cloud vs ?

2016-12-16 Thread John Blythe
this is awesome. i'm jazzed to try it out. will do some introductory
examples and what not to familiarize myself and get the lightbulb to
(hopefully) go off. one last question: given all the info and questions
above, does it seem like ZK is overkill at this point and i should focus my
efforts on solrcloud? seems like it to me, but hope to confirm that before
investing time in the wrong direction.

thanks!

-- 
*John Blythe*
Product Manager & Lead Developer

251.605.3071 | j...@curvolabs.com
www.curvolabs.com

58 Adams Ave
Evansville, IN 47713

On Fri, Dec 16, 2016 at 1:59 PM, Erick Erickson 
wrote:

> bq: if i start using SolrCloud i could have my current multi-core setup
>(e.g. "transactions", "opportunities", etc.) exist within the
> appropriate
>collection.
>
> I'd guess that cores == collections but
> Are you reaching across from one core to another to
> satisfy your use-case? I.e. using "cross core joins"
> or anything similar? You have to do some careful
> placement of collections and co-locate the replicas
> for each collection. This is quite do-able via the collecitons
> API.
>
> bq: this seems to be the same with ZK, too?
>
> Kind of ignore ZK. From what you're describing you won't
> have 100s of nodes. Therefore ZK will just keep track of it
> all for you. One common misconception is that ZK is involved
> in indexing and querying. It's not, kind of. Once each Solr
> instance gets the current state of the network, the Solr
> instances don't need to reference ZK to update or query.
> It's only when nodes change state (are shut down and the
> like) that ZK gets involved in letting the Solr nodes know about
> the state change.
>
> bq: is the separate indexing
>happening out of the box w Cloud or something it's even capable of?
>
> Totally. The unit of address is a "collection". So let's say you have
> a collection named transactions_dev. You send updates to
> http://any_solr_server:port/solr/transactions_dev/update.
> Queries similarly as
> http://any_solr_server:port/solr/transactions_dev/query
>
> Don't try to put cores in here, just think about collections. The
> actual _core_ is something like transactions_dev_shard1_replica1
> but you'll almost never reference it directly unless you're debugging or
> something.
>
> So what I'd recommend is just go through the getting started example
> and at some point it'll all suddenly start to make sense ;).
>
> The take-away is that much of what you're doing in terms of keeping
> track of cores and all that just goes away.
>
> Best,
> Erick
>
> On Fri, Dec 16, 2016 at 9:30 AM, John Blythe  wrote:
> > thanks, erick. this is helpful. a few questions for clarity's sake, but
> > first: nope, not using SolrCloud as of yet.
> >
> >- if i start using SolrCloud i could have my current multi-core setup
> >(e.g. "transactions", "opportunities", etc.) exist within the
> appropriate
> >collection. so instead of dev-transactions i'd have a 'dev'
> collection that
> >has a 'transactions' core inside of it?
> >- this seems to be the same with ZK, too?
> >- i'm totally fine w separate/diff indexing. the demo collection, for
> >instance, *has* to be separate from production bc the data has been
> >stitched together from various customers' accounts on prod and
> blinded so
> >that we have avoid privacy issues and can have all the various goodies
> >under one demo account rather than separate ones. is the separate
> indexing
> >happening out of the box w Cloud or something it's even capable of?
> >
> > thanks again, erick!
> >
> > --
> > *John Blythe*
> > Product Manager & Lead Developer
> >
> > 251.605.3071 | j...@curvolabs.com
> > www.curvolabs.com
> >
> > 58 Adams Ave
> > Evansville, IN 47713
> >
> > On Fri, Dec 16, 2016 at 11:38 AM, Erick Erickson <
> erickerick...@gmail.com>
> > wrote:
> >
> >> It's not quite clear to me whether you're using SolrCloud now or not, my
> >> guess is not. My guess here is that you _should_ move to SolrCloud and
> >> collections. Then, instead of thinking about "cores", you just think
> about
> >> collections. Where the replicas live then isn't something you have to
> >> manage
> >> in that case.
> >>
> >> There's a bit of a learning curve for Zookeeper, and a mental shift you
> >> have to make to not worry about cores at all, just trust Solr. That
> said,
> >> if you _want_ to explicitly manage where each and every core for each
> >> and every collection lives, that's easy with the collections API. Once
> you
> >> do make that shift, going back is painful ;)
> >>
> >> So the scenario is that you have three collections, prod, dev, demo.
> They
> >> all
> >> happen to use the same configset (which you keep in ZK). You have one
> >> zookeeper ensemble that the three collections reference. They can even
> >> all share the same machine if that machine has sufficient capacity.
> >>
> >> The deal here is that these are really completely independent; you'll
> have
> >> to index your content 

Re: ttl on merge-time possible somehow ?

2016-12-16 Thread Shawn Heisey
On 12/16/2016 11:13 AM, Dorian Hoxha wrote:
> Yep, that's what came in my search. See how TTL work in hbase/cassandra/
> rocksdb . There
> isn't a "delete old docs"query, but old docs are deleted by the
> storage when merging. Looks like this needs to be a lucene-module
> which can then be configured by solr ? 

No.  Lucene doesn't know about expiration and doesn't need to know about
expiration.

The document expiration happens in Solr.  In the background, Solr
finds/deletes old documents in the Lucene index according to how the
expiration feature is configured.  What happens after that is basic
Lucene operation.  If you index enough new data to trigger a merge (or
if you do an optimize/forceMerge), then Lucene will get rid of deleted
documents in the merged segments.  The contents of the documents in your
index (whether that's a timestamp or something else) are completely
irrelevant for decisions made during Lucene's segment merging.

> Just like in hbase,cassandra,rocksdb, when you "select" a row/document
> that has expired, it exists on the storage, but isn't returned by the
> db, because it checks the timestamp and sees that it's expired. Looks
> like this also need to be in lucene?

That's pretty much how Lucene (and by extension, Solr) works, except
it's not related to expiration, it is *deleted* documents that don't
show up in the results.

Thanks,
Shawn



Re: cores vs. instances vs. zookeeper vs. cloud vs ?

2016-12-16 Thread Erick Erickson
bq: if i start using SolrCloud i could have my current multi-core setup
   (e.g. "transactions", "opportunities", etc.) exist within the appropriate
   collection.

I'd guess that cores == collections but
Are you reaching across from one core to another to
satisfy your use-case? I.e. using "cross core joins"
or anything similar? You have to do some careful
placement of collections and co-locate the replicas
for each collection. This is quite do-able via the collecitons
API.

bq: this seems to be the same with ZK, too?

Kind of ignore ZK. From what you're describing you won't
have 100s of nodes. Therefore ZK will just keep track of it
all for you. One common misconception is that ZK is involved
in indexing and querying. It's not, kind of. Once each Solr
instance gets the current state of the network, the Solr
instances don't need to reference ZK to update or query.
It's only when nodes change state (are shut down and the
like) that ZK gets involved in letting the Solr nodes know about
the state change.

bq: is the separate indexing
   happening out of the box w Cloud or something it's even capable of?

Totally. The unit of address is a "collection". So let's say you have
a collection named transactions_dev. You send updates to
http://any_solr_server:port/solr/transactions_dev/update.
Queries similarly as
http://any_solr_server:port/solr/transactions_dev/query

Don't try to put cores in here, just think about collections. The
actual _core_ is something like transactions_dev_shard1_replica1
but you'll almost never reference it directly unless you're debugging or
something.

So what I'd recommend is just go through the getting started example
and at some point it'll all suddenly start to make sense ;).

The take-away is that much of what you're doing in terms of keeping
track of cores and all that just goes away.

Best,
Erick

On Fri, Dec 16, 2016 at 9:30 AM, John Blythe  wrote:
> thanks, erick. this is helpful. a few questions for clarity's sake, but
> first: nope, not using SolrCloud as of yet.
>
>- if i start using SolrCloud i could have my current multi-core setup
>(e.g. "transactions", "opportunities", etc.) exist within the appropriate
>collection. so instead of dev-transactions i'd have a 'dev' collection that
>has a 'transactions' core inside of it?
>- this seems to be the same with ZK, too?
>- i'm totally fine w separate/diff indexing. the demo collection, for
>instance, *has* to be separate from production bc the data has been
>stitched together from various customers' accounts on prod and blinded so
>that we have avoid privacy issues and can have all the various goodies
>under one demo account rather than separate ones. is the separate indexing
>happening out of the box w Cloud or something it's even capable of?
>
> thanks again, erick!
>
> --
> *John Blythe*
> Product Manager & Lead Developer
>
> 251.605.3071 | j...@curvolabs.com
> www.curvolabs.com
>
> 58 Adams Ave
> Evansville, IN 47713
>
> On Fri, Dec 16, 2016 at 11:38 AM, Erick Erickson 
> wrote:
>
>> It's not quite clear to me whether you're using SolrCloud now or not, my
>> guess is not. My guess here is that you _should_ move to SolrCloud and
>> collections. Then, instead of thinking about "cores", you just think about
>> collections. Where the replicas live then isn't something you have to
>> manage
>> in that case.
>>
>> There's a bit of a learning curve for Zookeeper, and a mental shift you
>> have to make to not worry about cores at all, just trust Solr. That said,
>> if you _want_ to explicitly manage where each and every core for each
>> and every collection lives, that's easy with the collections API. Once you
>> do make that shift, going back is painful ;)
>>
>> So the scenario is that you have three collections, prod, dev, demo. They
>> all
>> happen to use the same configset (which you keep in ZK). You have one
>> zookeeper ensemble that the three collections reference. They can even
>> all share the same machine if that machine has sufficient capacity.
>>
>> The deal here is that these are really completely independent; you'll have
>> to index your content to each separately.
>>
>> But then your URL becomes x.x.x.x:8983/solr/prod, x.x.x.x:8983/solr/dev
>> and the like.
>>
>> FWIW,
>> Erick
>>
>> On Fri, Dec 16, 2016 at 5:26 AM, John Blythe  wrote:
>> > good morning everyone.
>> >
>> > i've got a crowing number of cores that various parts of our application
>> > are relying upon. i'm having difficulty figuring out the best way to
>> > continue expanding for both sake of scale and convenience.
>> >
>> > i need two extra versions of each core due to our demo instance and our
>> > development instance. when we had just one or two cores it wasn't the
>> worst
>> > thing to have cores like X, demo-X, and dev-X. that has quickly become
>> > unnecessarily cumbersome.
>> >
>> > i've considered moving each instance to its own solr instance, perhaps
>> just
>> > throwing it on a different port.

Re: Solr on HDFS: increase in query time with increase in data

2016-12-16 Thread Chetas Joshi
Thank you everyone. I would add nodes to the SolrCloud and split the shards.

Shawn,

Thank you for explaining why putting index data on local file system could
be a better idea than using HDFS. I need to find out how HDFS caches the
index files in a resource constrained environment.

I would also like to add that when I try the Streaming API instead of using
the cursor approach, it starts running into JSON parsing exceptions when my
nodes (running Solr shards) don't have enough RAM to fit the entire index
into memory. FYI: I have other services (Yarn, Spark) deployed on the same
boxes as well. Spark jobs also use a lot of disk cache.
When I have enough RAM (more than 70 GB so that all the index data could
fit in memory), the streaming API succeeds without running into any
exceptions. How different the index data caching mechanism is for the
Streaming API from the cursor approach?

Thanks!



On Fri, Dec 16, 2016 at 6:52 AM, Shawn Heisey  wrote:

> On 12/14/2016 11:58 AM, Chetas Joshi wrote:
> > I am running Solr 5.5.0 on HDFS. It is a solrCloud of 50 nodes and I have
> > the following config.
> > maxShardsperNode: 1
> > replicationFactor: 1
> >
> > I have been ingesting data into Solr for the last 3 months. With increase
> > in data, I am observing increase in the query time. Currently the size of
> > my indices is 70 GB per shard (i.e. per node).
>
> Query times will increase as the index size increases, but significant
> jumps in the query time may be an indication of a performance problem.
> Performance problems are usually caused by insufficient resources,
> memory in particular.
>
> With HDFS, I am honestly not sure *where* the cache memory is needed.  I
> would assume that it's needed on the HDFS hosts, that a lot of spare
> memory on the Solr (HDFS client) hosts probably won't make much
> difference.  I could be wrong -- I have no idea what kind of caching
> HDFS does.  If the HDFS client can cache data, then you probably would
> want extra memory on the Solr machines.
>
> > I am using cursor approach (/export handler) using SolrJ client to get
> back
> > results from Solr. All the fields I am querying on and all the fields
> that
> > I get back from Solr are indexed and have docValues enabled as well. What
> > could be the reason behind increase in query time?
>
> If actual disk access is required to satisfy a query, Solr is going to
> be slow.  Caching is absolutely required for good performance.  If your
> query times are really long but used to be short, chances are that your
> index size has exceeded your system's ability to cache it effectively.
>
> One thing to keep in mind:  Gigabit Ethernet is comparable in speed to
> the sustained transfer rate of a single modern SATA magnetic disk, so if
> the data has to traverse a gigabit network, it probably will be nearly
> as slow as it would be if it were coming from a single disk.  Having a
> 10gig network for your storage is probably a good idea ... but current
> fast memory chips can leave 10gig in the dust, so if the data can come
> from cache and the chips are new enough, then it can be faster than
> network storage.
>
> Because the network can be a potential bottleneck, I strongly recommend
> putting index data on local disks.  If you have enough memory, the disk
> doesn't even need to be super-fast.
>
> > Has this got something to do with the OS disk cache that is used for
> > loading the Solr indices? When a query is fired, will Solr wait for all
> > (70GB) of disk cache being available so that it can load the index file?
>
> Caching the files on the disk is not handled by Solr, so Solr won't wait
> for the entire index to be cached unless the underlying storage waits
> for some reason.  The caching is usually handled by the OS.  For HDFS,
> it might be handled by a combination of the OS and Hadoop, but I don't
> know enough about HDFS to comment.  Solr makes a request for the parts
> of the index files that it needs to satisfy the request.  If the
> underlying system is capable of caching the data, if that feature is
> enabled, and if there's memory available for that purpose, then it gets
> cached.
>
> Thanks,
> Shawn
>
>


Re: Max vertical scaling in your experience ? (1 instance/server)

2016-12-16 Thread Dorian Hoxha
On Fri, Dec 16, 2016 at 5:55 PM, Erick Erickson 
wrote:

> You said that the data expires, but you haven't said
> how many docs you need to host at a time.

The data will expire in ~30 minutes average. Many of them are updates on
the same document (this makes it worse because updates are delete+insert)

> At 10M/second
> inserts you'll need a boatload of shards. All of the
> conversation about one beefy machine .vs. lots of not-so-beefy
> machines should wait until you answer that question.

Note, there will be multiple beefy machines(just less compared to small
machines). I was asking what would be a big enough one, so 1 instance will
be able to use the full server.

> For
> instance, indexing some _very_ simple documents on my
> laptop can hit 10,000 docs/second/shard. So you're talking
> 1,000 shards here. Indexing more complex docs I might
> get 1,000 docs/second/shard, so then you'd need 10,000
> shards. Don't take these as hard numbers, I'm
> just trying to emphasize that you'll need to do scaling
> exercises to see if what you want to do is reasonable given
> your constraints.
>
Of course. I think I've done ~80K/s/server on a previous project (it wasn't
the bottleneck so didn't bother too much) but there are too many knobs that
will change that number.

>
> If those 10M docs/second are bursty and you can stand some
> latency, then that's one set of considerations. If it's steady-state
> it's another. In either case you need some _serious_ design work
> before you go forward.
>
I expect 10M to be the burst. But it needs to handle the burst. And I don't
think I will do 10M requests, but small batches.

>
> And then you want to facet (fq clauses aren't nearly as expensive)
> and want 2 second commit intervals.
>
It is what it is.

>
> You _really_ need to stand up some test systems and see what
> performance you can get before launching off on trying to do the
> whole thing. Fortunately, you can stand up, say, a 4 shard system
> and tune it and drive it as fast as you possibly can and extrapolate
> from there.
>
My ~thinking~ would be to have 1 instance/server + 1 shard/core + 1 thread
for each shard. Assuming I remove all "blocking" disk operations (don't
know if network is async?), it should be ~best~ scenario. I'll have to see
how it functions more though.

>
> But to reiterate. This is a very high indexing rate that very few
> organizations have attempted. You _really_ need to do a
> proof-of-concept _then_ plan.
>
It's why I posted, to ask what have people used as big machine, and I would
test on that.

>
> Here's the long form of this argument:
> https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-th
> e-abstract-why-we-dont-have-a-definitive-answer/


>
> Best,
> Erick
>
Thanks!

>
> On Fri, Dec 16, 2016 at 5:19 AM, GW  wrote:
> > Layer 2 bridge SAN is just for my Apache/apps on Conga so they can be
> spun
> > on up any host with a static IP. This has nothing to do with Solr which
> is
> > running on plain old hardware.
> >
> > Solrcloud is on a real cluster not on a SAN.
> >
> > The bit about dead with no error. I got this from a post I made asking
> > about the best way to deploy apps. Was shown some code on making your app
> > zookeeper aware. I am just getting to this so I'm talking from my ass. A
> ZK
> > aware program will have a list of nodes ready for business verses a plain
> > old Round Robin. If data on a machine is corrupted you can get 0 docs
> found
> > while a ZK aware app will know that node is shite.
> >
> >
> >
> >
> >
> >
> >
> > On 16 December 2016 at 07:20, Dorian Hoxha 
> wrote:
> >
> >> On Fri, Dec 16, 2016 at 12:39 PM, GW  wrote:
> >>
> >> > Dorian,
> >> >
> >> > From my reading, my belief is that you just need some beefy machines
> for
> >> > your zookeeper ensemble so they can think fast.
> >>
> >> Zookeeper need to think fast enough for cluster state/changes. So I
> think
> >> it scales with the number of machines/collections/shards and not
> documents.
> >>
> >> > After that your issues are
> >> > complicated by drive I/O which I believe is solved by using shards. If
> >> you
> >> > have a collection running on top of a single drive array it should not
> >> > compare to writing to a dozen drive arrays. So a whole bunch of light
> >> duty
> >> > machines that have a decent amount of memory and barely able process
> >> faster
> >> > than their drive I/O will serve you better.
> >> >
> >> My dataset will be lower than total memory, so I expect no query to hit
> >> disk.
> >>
> >> >
> >> > I think the Apache big data mandate was to be horizontally scalable to
> >> > infinity with cheap consumer hardware. In my minds eye you are not
> going
> >> to
> >> > get crazy input rates without a big horizontal drive system.
> >> >
> >> There is overhead with small machines, and with very big machines
> (pricy).
> >> So something in the middle.
> >> So small cluster of big machines or big cluster of small machines.
> >>
> >> >
> >> > I'm in the same boat. All the scaling a

Re: ttl on merge-time possible somehow ?

2016-12-16 Thread Dorian Hoxha
On Fri, Dec 16, 2016 at 4:42 PM, Shawn Heisey  wrote:

> On 12/16/2016 12:54 AM, Dorian Hoxha wrote:
> > I did some search for TTL on solr, and found only a way to do it with
> > a delete-query. But that ~sucks, because you have to do a lot of
> > inserts (and queries).
>
> You're going to have to be very specific about what you want Solr to do.
>
> > The other(kinda better) way to do it, is to set a collection-level
> > ttl, and when indexes are merged, they will drop the documents that
> > have expired in the new merged segment. On the client, I will make
> > sure to do date-range queries so I don't get back old documents. So:
> > 1. is there a way to easily modify the segment-merger (or better way?)
> > to do that ?
>
> Does the following describe the the feature you're after?
>
> https://lucidworks.com/blog/2014/05/07/document-expiration/
>
> If this is what you're after, this is *Solr* functionality.  Segment
> merging is *Lucene* functionality.  Lucene cannot remove documents
> during merge until they have been deleted.  It is Solr that handles
> deleting documents after they expire.  Lucene is not aware of the
> expiration concept.
>
Yep, that's what came in my search. See how TTL work in hbase/cassandra/
rocksdb . There
isn't a "delete old docs"query, but old docs are deleted by the storage
when merging. Looks like this needs to be a lucene-module which can then be
configured by solr ?


> > 2. is there a way to support this also on get ? looks like I can use
> > realtimeget + filter query and it should work based on documentation
>
> Realtime get allows you to retrieve documents that have been indexed but
> not yet committed.  I doubt that deleted documents or document
> expiration affects RTG at all.  We would need to know exactly what you
> want to get working here before we can say whether or not you're right
> when you say "it should work."
>
Just like in hbase,cassandra,rocksdb, when you "select" a row/document that
has expired, it exists on the storage, but isn't returned by the db,
because it checks the timestamp and sees that it's expired. Looks like this
also need to be in lucene?

>
> Thanks,
> Shawn
>
> Makes more sense ?


Re: Stemming with SOLR

2016-12-16 Thread Susheel Kumar
To handle irregular nouns (
http://www.ef.com/english-resources/english-grammar/singular-and-plural-nouns/),
the simplest way is handle them using StemOverriderFactory.  The list is
not so long. Or otherwise go for commercial solutions like basistech etc.
as Alex suggested  oR you can customize Hunspell extensively to handle most
of them.

Thanks,
Susheel

On Thu, Dec 15, 2016 at 9:46 PM, Alexandre Rafalovitch 
wrote:

> If you need the full fidelity solution taking care of multiple
> edge-cases, it could be worth looking at commercial solutions.
>
>
> http://www.basistech.com/ has one, including a free-level SAAS plan.
>
> Regards,
>Alex.
> 
> http://www.solr-start.com/ - Resources for Solr users, new and experienced
>
>
> On 15 December 2016 at 21:28, Lasitha Wattaladeniya 
> wrote:
> > Hi all,
> >
> > Thanks for the replies,
> >
> > @eric, ahmet : since those stemmers are logical stemmers it won't work on
> > words such as caught, ran and so on. So in our case it won't work
> >
> > @susheel : Yes I thought about it but problems we have is, the documents
> we
> > index are some what large text, so copy fielding these into duplicate
> > fields will affect on the index time ( we have jobs to index data
> > periodically) and query time. I wonder why there isn't a correct solution
> > to this
> >
> > Regards,
> > Lasitha
> >
> > Lasitha Wattaladeniya
> > Software Engineer
> >
> > Mobile : +6593896893
> > Blog : techreadme.blogspot.com
> >
> > On Fri, Dec 16, 2016 at 12:58 AM, Susheel Kumar 
> > wrote:
> >
> >> We did extensive comparison in the past for Snowball, KStem and Hunspell
> >> and there are cases where one of them works better but not other or
> >> vice-versa. You may utilise all three of them by having 3 different
> fields
> >> (fieldTypes) and during query, search in all of them.
> >>
> >> For some of the cases where none of them works (e.g wolves, wolf etc).,
> use
> >> StemOverriderFactory.
> >>
> >> HTH.
> >>
> >> Thanks,
> >> Susheel
> >>
> >> On Thu, Dec 15, 2016 at 11:32 AM, Ahmet Arslan
> 
> >> wrote:
> >>
> >> > Hi,
> >> >
> >> > KStemFilter returns legitimate English words, please use it.
> >> >
> >> > Ahmet
> >> >
> >> >
> >> >
> >> > On Thursday, December 15, 2016 6:17 PM, Lasitha Wattaladeniya <
> >> > watt...@gmail.com> wrote:
> >> > Hello devs,
> >> >
> >> > I'm trying to develop this indexing and querying flow where it
> converts
> >> the
> >> > words to its original form (lemmatization). I was doing bit of
> research
> >> > lately but the information on the internet is very limited. I tried
> using
> >> > hunspellfactory but it doesn't convert the word to it's original form,
> >> > instead it gives suggestions for some words (hunspell works for some
> >> > english words correctly but for some it gives multiple suggestions or
> no
> >> > suggestions, i used the en_us.dic provided by openoffice)
> >> >
> >> > I know this is a generic problem in searching, so is there anyone who
> can
> >> > point me to correct direction or some information :)
> >> >
> >> > Best regards,
> >> > Lasitha Wattaladeniya
> >> > Software Engineer
> >> >
> >> > Mobile : +6593896893
> >> > Blog : techreadme.blogspot.com
> >> >
> >>
>


Re: cores vs. instances vs. zookeeper vs. cloud vs ?

2016-12-16 Thread John Blythe
thanks, erick. this is helpful. a few questions for clarity's sake, but
first: nope, not using SolrCloud as of yet.

   - if i start using SolrCloud i could have my current multi-core setup
   (e.g. "transactions", "opportunities", etc.) exist within the appropriate
   collection. so instead of dev-transactions i'd have a 'dev' collection that
   has a 'transactions' core inside of it?
   - this seems to be the same with ZK, too?
   - i'm totally fine w separate/diff indexing. the demo collection, for
   instance, *has* to be separate from production bc the data has been
   stitched together from various customers' accounts on prod and blinded so
   that we have avoid privacy issues and can have all the various goodies
   under one demo account rather than separate ones. is the separate indexing
   happening out of the box w Cloud or something it's even capable of?

thanks again, erick!

-- 
*John Blythe*
Product Manager & Lead Developer

251.605.3071 | j...@curvolabs.com
www.curvolabs.com

58 Adams Ave
Evansville, IN 47713

On Fri, Dec 16, 2016 at 11:38 AM, Erick Erickson 
wrote:

> It's not quite clear to me whether you're using SolrCloud now or not, my
> guess is not. My guess here is that you _should_ move to SolrCloud and
> collections. Then, instead of thinking about "cores", you just think about
> collections. Where the replicas live then isn't something you have to
> manage
> in that case.
>
> There's a bit of a learning curve for Zookeeper, and a mental shift you
> have to make to not worry about cores at all, just trust Solr. That said,
> if you _want_ to explicitly manage where each and every core for each
> and every collection lives, that's easy with the collections API. Once you
> do make that shift, going back is painful ;)
>
> So the scenario is that you have three collections, prod, dev, demo. They
> all
> happen to use the same configset (which you keep in ZK). You have one
> zookeeper ensemble that the three collections reference. They can even
> all share the same machine if that machine has sufficient capacity.
>
> The deal here is that these are really completely independent; you'll have
> to index your content to each separately.
>
> But then your URL becomes x.x.x.x:8983/solr/prod, x.x.x.x:8983/solr/dev
> and the like.
>
> FWIW,
> Erick
>
> On Fri, Dec 16, 2016 at 5:26 AM, John Blythe  wrote:
> > good morning everyone.
> >
> > i've got a crowing number of cores that various parts of our application
> > are relying upon. i'm having difficulty figuring out the best way to
> > continue expanding for both sake of scale and convenience.
> >
> > i need two extra versions of each core due to our demo instance and our
> > development instance. when we had just one or two cores it wasn't the
> worst
> > thing to have cores like X, demo-X, and dev-X. that has quickly become
> > unnecessarily cumbersome.
> >
> > i've considered moving each instance to its own solr instance, perhaps
> just
> > throwing it on a different port. for example, production could be
> > x.x.x.x:8983, dev x.x.x.x:8993, and demo x.x.x.x:8938.
> >
> > i'm pretty helpless at this point with zookeeper and/or solrcloud. given
> > the above info, i'd love to hear some quick overview ideas as to the best
> > approach that i can then begin to explore online.
> >
> > thanks for any pointers!
>


Re: Getting Error - Session expired for /collections/sprod/state.json

2016-12-16 Thread Erick Erickson
Look at your connection timeouts and your ZK timeouts.
This usually means your Solr instances are going
into heavy GC as Yago mentions. You can turn on
GC logging if it's not already then use something like
GCViewer to get a handle on the GC.

You really have two options:
1> if it is GC, tune your instances to avoid that if posisble.
This is "more art than science".

2> lengthen timeouts, there are a series of them for client
connections, Solr<->Solr connections and ZK<->Solr
connections

Best,
Erick

On Fri, Dec 16, 2016 at 2:07 AM, Yago Riveiro  wrote:
> Do some gc profiling to get some information about. It's possible you have 
> configure a small heap and you are running in gc stop the world issues.
>
> Normally zookeeper erros are bounded to gc and network latency issues
>
> --
>
> /Yago Riveiro
>
> On 16 Dec 2016, 09:49 +, Piyush Kunal , wrote:
>> Looks like an issue with 6.x version then.
>> But this seems too basic. Not sure if community would not have caught this
>> till now.
>>
>> On Fri, Dec 16, 2016 at 2:55 PM, Yago Riveiro > wrote:
>>
>> > I had some of this error in my logs too on 6.3.0
>> >
>> > My cluster also index like 20K docs/sec I don't know why.
>> >
>> > --
>> >
>> > /Yago Riveiro
>> >
>> > On 16 Dec 2016, 08:39 +, Piyush Kunal ,
>> > wrote:
>> > > Anyone has noticed such issue before?
>> > >
>> > > On Thu, Dec 15, 2016 at 4:36 PM, Piyush Kunal > > > wrote:
>> > >
>> > > > This is happening when heavy indexing like 100/second is going on.
>> > > >
>> > > > On Thu, Dec 15, 2016 at 4:33 PM, Piyush Kunal > > > > wrote:
>> > > >
>> > > > > - We have solr6.1.0 cluster running on production with 1 shard and 5
>> > > > > replicas.
>> > > > > - Zookeeper quorum on 3 nodes.
>> > > > > - Using a chroot in zookeeper to segregate the configs from other
>> > > > > collections.
>> > > > > - Using solrj5.1.0 as our client to query solr.
>> > > > >
>> > > > >
>> > > > >
>> > > > > Usually things work fine but on and off we witness this exception
>> > coming
>> > > > > up:
>> > > > > =
>> > > > > org.apache.solr.common.SolrException: Could not load collection from
>> > > > > ZK:sprod
>> > > > > at org.apache.solr.common.cloud.ZkStateReader.getCollectionLive
>> > > > > (ZkStateReader.java:815)
>> > > > > at org.apache.solr.common.cloud.ZkStateReader$5.get(ZkStateRead
>> > > > > er.java:477)
>> > > > > at org.apache.solr.client.solrj.impl.CloudSolrClient.getDocColl
>> > > > > ection(CloudSolrClient.java:1174)
>> > > > > at org.apache.solr.client.solrj.impl.CloudSolrClient.requestWit
>> > > > > hRetryOnStaleState(CloudSolrClient.java:807)
>> > > > > at org.apache.solr.client.solrj.impl.CloudSolrClient.request(Cl
>> > > > > oudSolrClient.java:782)
>> > > > > --
>> > > > > Caused by: org.apache.zookeeper.KeeperException$
>> > SessionExpiredException:
>> > > > > KeeperErrorCode = Session expired for /collections/sprod/state.json
>> > > > > at org.apache.zookeeper.KeeperException.create(KeeperException.
>> > > > > java:127)
>> > > > > at org.apache.zookeeper.KeeperException.create(KeeperException.
>> > > > > java:51)
>> > > > > at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)
>> > > > > at org.apache.solr.common.cloud.SolrZkClient$5.execute(SolrZkCl
>> > > > > ient.java:311)
>> > > > > at org.apache.solr.common.cloud.SolrZkClient$5.execute(SolrZkCl
>> > > > > ient.java:308)
>> > > > > at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(Zk
>> > > > > CmdExecutor.java:61)
>> > > > > at org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClien
>> > > > > t.java:308)
>> > > > > --
>> > > > > org.apache.solr.common.SolrException: Could not load collection from
>> > > > > ZK:sprod
>> > > > > at org.apache.solr.common.cloud.ZkStateReader.getCollectionLive
>> > > > > (ZkStateReader.java:815)
>> > > > > at org.apache.solr.common.cloud.ZkStateReader$5.get(ZkStateRead
>> > > > > er.java:477)
>> > > > > at org.apache.solr.client.solrj.impl.CloudSolrClient.getDocColl
>> > > > > ection(CloudSolrClient.java:1174)
>> > > > > at org.apache.solr.client.solrj.impl.CloudSolrClient.requestWit
>> > > > > hRetryOnStaleState(CloudSolrClient.java:807)
>> > > > > at org.apache.solr.client.solrj.impl.CloudSolrClient.request(Cl
>> > > > > oudSolrClient.java:782)
>> > > > > --
>> > > > > Caused by: org.apache.zookeeper.KeeperException$
>> > SessionExpiredException:
>> > > > > KeeperErrorCode = Session expired for /collections/sprod/state.json
>> > > > > at org.apache.zookeeper.KeeperException.create(KeeperException.
>> > > > > java:127)
>> > > > > at org.apache.zookeeper.KeeperException.create(KeeperException.
>> > > > > java:51)
>> > > > > at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)
>> > > > > at org.apache.solr.common.cloud.SolrZkClient$5.execute(SolrZkCl
>> > > > > ient.java:311)
>> > > > > at org.apache.solr.common.cloud.SolrZkClient$5.execute(SolrZkCl
>> > > > > ient.java:308)
>> > > > > at org.apache.solr.common.clou

Re: Max vertical scaling in your experience ? (1 instance/server)

2016-12-16 Thread Erick Erickson
You said that the data expires, but you haven't said
how many docs you need to host at a time. At 10M/second
inserts you'll need a boatload of shards. All of the
conversation about one beefy machine .vs. lots of not-so-beefy
machines should wait until you answer that question. For
instance, indexing some _very_ simple documents on my
laptop can hit 10,000 docs/second/shard. So you're talking
1,000 shards here. Indexing more complex docs I might
get 1,000 docs/second/shard, so then you'd need 10,000
shards. Don't take these as hard numbers, I'm
just trying to emphasize that you'll need to do scaling
exercises to see if what you want to do is reasonable given
your constraints.

If those 10M docs/second are bursty and you can stand some
latency, then that's one set of considerations. If it's steady-state
it's another. In either case you need some _serious_ design work
before you go forward.

And then you want to facet (fq clauses aren't nearly as expensive)
and want 2 second commit intervals.

You _really_ need to stand up some test systems and see what
performance you can get before launching off on trying to do the
whole thing. Fortunately, you can stand up, say, a 4 shard system
and tune it and drive it as fast as you possibly can and extrapolate
from there.

But to reiterate. This is a very high indexing rate that very few
organizations have attempted. You _really_ need to do a
proof-of-concept _then_ plan.

Here's the long form of this argument:
https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

Best,
Erick

On Fri, Dec 16, 2016 at 5:19 AM, GW  wrote:
> Layer 2 bridge SAN is just for my Apache/apps on Conga so they can be spun
> on up any host with a static IP. This has nothing to do with Solr which is
> running on plain old hardware.
>
> Solrcloud is on a real cluster not on a SAN.
>
> The bit about dead with no error. I got this from a post I made asking
> about the best way to deploy apps. Was shown some code on making your app
> zookeeper aware. I am just getting to this so I'm talking from my ass. A ZK
> aware program will have a list of nodes ready for business verses a plain
> old Round Robin. If data on a machine is corrupted you can get 0 docs found
> while a ZK aware app will know that node is shite.
>
>
>
>
>
>
>
> On 16 December 2016 at 07:20, Dorian Hoxha  wrote:
>
>> On Fri, Dec 16, 2016 at 12:39 PM, GW  wrote:
>>
>> > Dorian,
>> >
>> > From my reading, my belief is that you just need some beefy machines for
>> > your zookeeper ensemble so they can think fast.
>>
>> Zookeeper need to think fast enough for cluster state/changes. So I think
>> it scales with the number of machines/collections/shards and not documents.
>>
>> > After that your issues are
>> > complicated by drive I/O which I believe is solved by using shards. If
>> you
>> > have a collection running on top of a single drive array it should not
>> > compare to writing to a dozen drive arrays. So a whole bunch of light
>> duty
>> > machines that have a decent amount of memory and barely able process
>> faster
>> > than their drive I/O will serve you better.
>> >
>> My dataset will be lower than total memory, so I expect no query to hit
>> disk.
>>
>> >
>> > I think the Apache big data mandate was to be horizontally scalable to
>> > infinity with cheap consumer hardware. In my minds eye you are not going
>> to
>> > get crazy input rates without a big horizontal drive system.
>> >
>> There is overhead with small machines, and with very big machines (pricy).
>> So something in the middle.
>> So small cluster of big machines or big cluster of small machines.
>>
>> >
>> > I'm in the same boat. All the scaling and roll out documentation seems to
>> > reference the Witch Doctor's secret handbook.
>> >
>> > I just started into making my applications ZK aware and really just
>> > starting to understand the architecture. After a whole year I still feel
>> > weak while at the same time I have traveled far. I still feel like an
>> > amateur.
>> >
>> > My plans are to use bridge tools in Linux so all my machines are sitting
>> on
>> > the switch with layer 2. Then use Conga to monitor which apps need to be
>> > running. If a server dies, it's apps are spun up on one of the other
>> > servers using the original IP and mac address through a bridge firewall
>> > gateway so there is no hold up with with mac phreaking like layer 3.
>> Layer
>> > 3 does not like to see a route change with a mac address. My apps will be
>> > on a SAN ~ Data on as many shards/machines as financially possible.
>> >
>> By conga you mean https://sourceware.org/cluster/conga/spec/ ?
>> Also SAN may/will suck like someone answered in your thread.
>>
>> >
>> > I was going to put a bunch of Apache web servers in round robin to talk
>> to
>> > Solr but discovered that a Solr node can be dead and not report errors.
>> >
>> Please explain more "dead but no error".
>>
>> > It's all rough at the moment but it makes t

Re: cores vs. instances vs. zookeeper vs. cloud vs ?

2016-12-16 Thread Erick Erickson
It's not quite clear to me whether you're using SolrCloud now or not, my
guess is not. My guess here is that you _should_ move to SolrCloud and
collections. Then, instead of thinking about "cores", you just think about
collections. Where the replicas live then isn't something you have to manage
in that case.

There's a bit of a learning curve for Zookeeper, and a mental shift you
have to make to not worry about cores at all, just trust Solr. That said,
if you _want_ to explicitly manage where each and every core for each
and every collection lives, that's easy with the collections API. Once you
do make that shift, going back is painful ;)

So the scenario is that you have three collections, prod, dev, demo. They all
happen to use the same configset (which you keep in ZK). You have one
zookeeper ensemble that the three collections reference. They can even
all share the same machine if that machine has sufficient capacity.

The deal here is that these are really completely independent; you'll have
to index your content to each separately.

But then your URL becomes x.x.x.x:8983/solr/prod, x.x.x.x:8983/solr/dev
and the like.

FWIW,
Erick

On Fri, Dec 16, 2016 at 5:26 AM, John Blythe  wrote:
> good morning everyone.
>
> i've got a crowing number of cores that various parts of our application
> are relying upon. i'm having difficulty figuring out the best way to
> continue expanding for both sake of scale and convenience.
>
> i need two extra versions of each core due to our demo instance and our
> development instance. when we had just one or two cores it wasn't the worst
> thing to have cores like X, demo-X, and dev-X. that has quickly become
> unnecessarily cumbersome.
>
> i've considered moving each instance to its own solr instance, perhaps just
> throwing it on a different port. for example, production could be
> x.x.x.x:8983, dev x.x.x.x:8993, and demo x.x.x.x:8938.
>
> i'm pretty helpless at this point with zookeeper and/or solrcloud. given
> the above info, i'd love to hear some quick overview ideas as to the best
> approach that i can then begin to explore online.
>
> thanks for any pointers!


Re: Separating Search and Indexing in SolrCloud

2016-12-16 Thread Pushkar Raste
This kind of separation is not supported yet.  There however some work
going on,  you can read about it on
https://issues.apache.org/jira/browse/SOLR-9835

This unfortunately would not support soft commits and hence would not be a
good solution for near real time indexing.

On Dec 16, 2016 7:44 AM, "Jaroslaw Rozanski"  wrote:

> Sorry, not what I meant.
>
> Leader is responsible for distributing update requests to replica. So
> eventually all replicas have same state as leader. Not a problem.
>
> It is more about the performance of such. If I gather correctly normal
> replication happens by standard update request. Not by, say, segment copy.
>
> Which means update on leader is as "expensive" as on replica.
>
> Hence, if my understanding is correct, sending search request to replica
> only, in index heavy environment, would bring no benefit.
>
> So the question is: is there a mechanism, in SolrCloud (not legacy
> master/slave set-up) to make one node take a load of indexing which
> other nodes focus on searching.
>
> This is not a question of SolrClient cause that is clear how to direct
> search request to specific nodes. This is more about index optimization
> so that certain nodes (ie. replicas) could suffer less due to high
> volume indexing while serving search requests.
>
>
>
>
> On 16/12/16 12:35, Dorian Hoxha wrote:
> > The leader is the source of truth. You expect to make the replica the
> > source of truth or something???Doesn't make sense?
> > What people do, is send write to leader/master and reads to
> replicas/slaves
> > in other solr/other-dbs.
> >
> > On Fri, Dec 16, 2016 at 1:31 PM, Jaroslaw Rozanski  >
> > wrote:
> >
> >> Hi all,
> >>
> >> According to documentation, in normal operation (not recovery) in Solr
> >> Cloud configuration the leader sends updates it receives to all the
> >> replicas.
> >>
> >> This means and all nodes in the shard perform same effort to index
> >> single document. Correct?
> >>
> >> Is there then a benefit to *not* to send search requests to leader, but
> >> only to replicas?
> >>
> >> Given index & search heavy Solr Cloud system, is it possible to separate
> >> search from indexing nodes?
> >>
> >>
> >> RE: Solr 5.5.0
> >>
> >> --
> >> Jaroslaw Rozanski | e: m...@jarekrozanski.com
> >> 695E 436F A176 4961 7793  5C70 AFDF FB5E 682C 4D3D
> >>
> >>
> >
>
> --
> Jaroslaw Rozanski | e: m...@jarekrozanski.com
> 695E 436F A176 4961 7793  5C70 AFDF FB5E 682C 4D3D
>
>


Re: Stable releases of Solr

2016-12-16 Thread Jaroslaw Rozanski
Hi Deepak,

Lucene 6.3.0 is latest official release:
https://lucene.apache.org/core/6_3_0/index.html

Same applies to Solr if that is what you meant.

It is as stable as guaranteed by release process.

On 16/12/16 07:10, Deepak Kumar Gupta wrote:
> Hi,
> 
> I am planning to upgrade lucene version in my codebase from 3.6.1
> What is the latest stable version to which I can upgrade it?
> Is 6.3.X stable?
> 
> Thanks,
> Deepak
> 

-- 
Jaroslaw Rozanski | e: m...@jarekrozanski.com
695E 436F A176 4961 7793  5C70 AFDF FB5E 682C 4D3D



signature.asc
Description: OpenPGP digital signature


Stable releases of Solr

2016-12-16 Thread Deepak Kumar Gupta
Hi,

I am planning to upgrade lucene version in my codebase from 3.6.1
What is the latest stable version to which I can upgrade it?
Is 6.3.X stable?

Thanks,
Deepak


Re: ttl on merge-time possible somehow ?

2016-12-16 Thread Shawn Heisey
On 12/16/2016 12:54 AM, Dorian Hoxha wrote:
> I did some search for TTL on solr, and found only a way to do it with
> a delete-query. But that ~sucks, because you have to do a lot of
> inserts (and queries). 

You're going to have to be very specific about what you want Solr to do.

> The other(kinda better) way to do it, is to set a collection-level
> ttl, and when indexes are merged, they will drop the documents that
> have expired in the new merged segment. On the client, I will make
> sure to do date-range queries so I don't get back old documents. So:
> 1. is there a way to easily modify the segment-merger (or better way?)
> to do that ? 

Does the following describe the the feature you're after?

https://lucidworks.com/blog/2014/05/07/document-expiration/

If this is what you're after, this is *Solr* functionality.  Segment
merging is *Lucene* functionality.  Lucene cannot remove documents
during merge until they have been deleted.  It is Solr that handles
deleting documents after they expire.  Lucene is not aware of the
expiration concept.

> 2. is there a way to support this also on get ? looks like I can use
> realtimeget + filter query and it should work based on documentation

Realtime get allows you to retrieve documents that have been indexed but
not yet committed.  I doubt that deleted documents or document
expiration affects RTG at all.  We would need to know exactly what you
want to get working here before we can say whether or not you're right
when you say "it should work."

Thanks,
Shawn



Can't get spelling suggestions to work properly

2016-12-16 Thread jimi.hullegard
Hi,

I'm trying to add the spelling suggestion feature to our search, but I'm having 
problems getting suggestions on some misspellings.

For example, the Swedish word 'mycket' exists in ~14.000 of a total of ~40.000 
documents in our index.

A search for the incorrect spelling 'myket' (a missing 'c') gives several 
spelling suggestions, and the top one is 'mycket'. This is the wanted/expected 
behaivor.

But a search for the incorrect spelling 'mycet' (a missing 'k') gives no 
spelling suggestions.

The only difference between these two searches is that the one that results in 
spelling suggestions had zero results, while the other one had two (2) results. 
These two documents contain this incorrect spelling ('mycet'). Can this be the 
cause of no spelling suggestions? But I have set 'maxQueryFrequency' to 0.001, 
and with 40.000 documents in the index that should mean that the word can exist 
in up to 40 documents, and since 2 is less than 40 I argue that that this word 
would be considered a spelling misstake. But for some reason the solr 
spellchecker considers 'myket' as an incorrect spelling, while 'mycet' 
incorrectly is considered as a correct spelling.

Also, I tried with spellcheck.accuracy=0 just to rule out that I have a too 
high accuracy setting, but that didn't help.

Can someone see what I'm doing wrong, or give some tips on configuration 
changes and/or how I can troubleshoot this? For example, is there any way to 
debug the spellchecker function?


Here are the searches:

Search for 'myket':

http://localhost:8080/solr/s2/select/?q=myket&rows=100&sort=score+desc&fl=*%2Cscore%2C%5Bexplain+style%3Dtext%5D&defType=edismax&qf=title%5E2&qf=swedishText1%5E1&spellcheck=true&spellcheck.accuracy=0&spellcheck.maxCollationTries=200&fq=%2Bactivatedate%3A%5B*+TO+NOW%5D+%2Bexpiredate%3A%5BNOW+TO+*%5D+%2B%28state%3Apublished+OR+state%3Adraft-published+OR+state%3Asubmitted-published+OR+state%3Aapproved-published%29&wt=xml&indent=true

Spellcheck output for 'myket':


 
  

   16

   0

   5

   0

   



 mycket

 14039



[...]

   
  
  false
  

   mycket

   14005

   

mycket

   
  
  [...]
  
 



Spellcheck output for 'mycet':

http://localhost:8080/solr/s2/select/?q=mycet&rows=100&sort=score+desc&fl=*%2Cscore%2C%5Bexplain+style%3Dtext%5D&defType=edismax&qf=title%5E2&qf=swedishText1%5E1&spellcheck=true&spellcheck.accuracy=0&spellcheck.maxCollationTries=200&fq=%2Bactivatedate%3A%5B*+TO+NOW%5D+%2Bexpiredate%3A%5BNOW+TO+*%5D+%2B%28state%3Apublished+OR+state%3Adraft-published+OR+state%3Asubmitted-published+OR+state%3Aapproved-published%29&wt=xml&indent=true

Search for 'mycet':

http://localhost:8080/solr/s2/select/?q=mycet&rows=100&sort=score+desc&fl=*%2Cscore%2C%5Bexplain+style%3Dtext%5D&defType=edismax&qf=title%5E2&qf=swedishText1%5E1&spellcheck=true&spellcheck.accuracy=0&spellcheck.maxCollationTries=200&fq=%2Bactivatedate%3A%5B*+TO+NOW%5D+%2Bexpiredate%3A%5BNOW+TO+*%5D+%2B%28state%3Apublished+OR+

Re: Solr on HDFS: increase in query time with increase in data

2016-12-16 Thread Shawn Heisey
On 12/14/2016 11:58 AM, Chetas Joshi wrote:
> I am running Solr 5.5.0 on HDFS. It is a solrCloud of 50 nodes and I have
> the following config.
> maxShardsperNode: 1
> replicationFactor: 1
>
> I have been ingesting data into Solr for the last 3 months. With increase
> in data, I am observing increase in the query time. Currently the size of
> my indices is 70 GB per shard (i.e. per node).

Query times will increase as the index size increases, but significant
jumps in the query time may be an indication of a performance problem. 
Performance problems are usually caused by insufficient resources,
memory in particular.

With HDFS, I am honestly not sure *where* the cache memory is needed.  I
would assume that it's needed on the HDFS hosts, that a lot of spare
memory on the Solr (HDFS client) hosts probably won't make much
difference.  I could be wrong -- I have no idea what kind of caching
HDFS does.  If the HDFS client can cache data, then you probably would
want extra memory on the Solr machines.

> I am using cursor approach (/export handler) using SolrJ client to get back
> results from Solr. All the fields I am querying on and all the fields that
> I get back from Solr are indexed and have docValues enabled as well. What
> could be the reason behind increase in query time?

If actual disk access is required to satisfy a query, Solr is going to
be slow.  Caching is absolutely required for good performance.  If your
query times are really long but used to be short, chances are that your
index size has exceeded your system's ability to cache it effectively.

One thing to keep in mind:  Gigabit Ethernet is comparable in speed to
the sustained transfer rate of a single modern SATA magnetic disk, so if
the data has to traverse a gigabit network, it probably will be nearly
as slow as it would be if it were coming from a single disk.  Having a
10gig network for your storage is probably a good idea ... but current
fast memory chips can leave 10gig in the dust, so if the data can come
from cache and the chips are new enough, then it can be faster than
network storage.

Because the network can be a potential bottleneck, I strongly recommend
putting index data on local disks.  If you have enough memory, the disk
doesn't even need to be super-fast.

> Has this got something to do with the OS disk cache that is used for
> loading the Solr indices? When a query is fired, will Solr wait for all
> (70GB) of disk cache being available so that it can load the index file?

Caching the files on the disk is not handled by Solr, so Solr won't wait
for the entire index to be cached unless the underlying storage waits
for some reason.  The caching is usually handled by the OS.  For HDFS,
it might be handled by a combination of the OS and Hadoop, but I don't
know enough about HDFS to comment.  Solr makes a request for the parts
of the index files that it needs to satisfy the request.  If the
underlying system is capable of caching the data, if that feature is
enabled, and if there's memory available for that purpose, then it gets
cached.

Thanks,
Shawn



cores vs. instances vs. zookeeper vs. cloud vs ?

2016-12-16 Thread John Blythe
good morning everyone.

i've got a crowing number of cores that various parts of our application
are relying upon. i'm having difficulty figuring out the best way to
continue expanding for both sake of scale and convenience.

i need two extra versions of each core due to our demo instance and our
development instance. when we had just one or two cores it wasn't the worst
thing to have cores like X, demo-X, and dev-X. that has quickly become
unnecessarily cumbersome.

i've considered moving each instance to its own solr instance, perhaps just
throwing it on a different port. for example, production could be
x.x.x.x:8983, dev x.x.x.x:8993, and demo x.x.x.x:8938.

i'm pretty helpless at this point with zookeeper and/or solrcloud. given
the above info, i'd love to hear some quick overview ideas as to the best
approach that i can then begin to explore online.

thanks for any pointers!


Re: Max vertical scaling in your experience ? (1 instance/server)

2016-12-16 Thread GW
Layer 2 bridge SAN is just for my Apache/apps on Conga so they can be spun
on up any host with a static IP. This has nothing to do with Solr which is
running on plain old hardware.

Solrcloud is on a real cluster not on a SAN.

The bit about dead with no error. I got this from a post I made asking
about the best way to deploy apps. Was shown some code on making your app
zookeeper aware. I am just getting to this so I'm talking from my ass. A ZK
aware program will have a list of nodes ready for business verses a plain
old Round Robin. If data on a machine is corrupted you can get 0 docs found
while a ZK aware app will know that node is shite.







On 16 December 2016 at 07:20, Dorian Hoxha  wrote:

> On Fri, Dec 16, 2016 at 12:39 PM, GW  wrote:
>
> > Dorian,
> >
> > From my reading, my belief is that you just need some beefy machines for
> > your zookeeper ensemble so they can think fast.
>
> Zookeeper need to think fast enough for cluster state/changes. So I think
> it scales with the number of machines/collections/shards and not documents.
>
> > After that your issues are
> > complicated by drive I/O which I believe is solved by using shards. If
> you
> > have a collection running on top of a single drive array it should not
> > compare to writing to a dozen drive arrays. So a whole bunch of light
> duty
> > machines that have a decent amount of memory and barely able process
> faster
> > than their drive I/O will serve you better.
> >
> My dataset will be lower than total memory, so I expect no query to hit
> disk.
>
> >
> > I think the Apache big data mandate was to be horizontally scalable to
> > infinity with cheap consumer hardware. In my minds eye you are not going
> to
> > get crazy input rates without a big horizontal drive system.
> >
> There is overhead with small machines, and with very big machines (pricy).
> So something in the middle.
> So small cluster of big machines or big cluster of small machines.
>
> >
> > I'm in the same boat. All the scaling and roll out documentation seems to
> > reference the Witch Doctor's secret handbook.
> >
> > I just started into making my applications ZK aware and really just
> > starting to understand the architecture. After a whole year I still feel
> > weak while at the same time I have traveled far. I still feel like an
> > amateur.
> >
> > My plans are to use bridge tools in Linux so all my machines are sitting
> on
> > the switch with layer 2. Then use Conga to monitor which apps need to be
> > running. If a server dies, it's apps are spun up on one of the other
> > servers using the original IP and mac address through a bridge firewall
> > gateway so there is no hold up with with mac phreaking like layer 3.
> Layer
> > 3 does not like to see a route change with a mac address. My apps will be
> > on a SAN ~ Data on as many shards/machines as financially possible.
> >
> By conga you mean https://sourceware.org/cluster/conga/spec/ ?
> Also SAN may/will suck like someone answered in your thread.
>
> >
> > I was going to put a bunch of Apache web servers in round robin to talk
> to
> > Solr but discovered that a Solr node can be dead and not report errors.
> >
> Please explain more "dead but no error".
>
> > It's all rough at the moment but it makes total sense to send Solr
> requests
> > based on what ZK says is available verses a round robin.
> >
> Yes, like I&other commenter wrote on your thread.
>
> >
> > Will keep you posted on my roll out if you like.
> >
> > Best,
> >
> > GW
> >
> >
> >
> >
> >
> >
> >
> > On 16 December 2016 at 03:31, Dorian Hoxha 
> wrote:
> >
> > > Hello searchers,
> > >
> > > I'm researching solr for a project that would require a
> > max-inserts(10M/s)
> > > and some heavy facet+fq on top of that, though on low qps.
> > >
> > > And I'm trying to find blogs/slides where people have used some big
> > > machines instead of hundreds of small ones.
> > >
> > > 1. Largest I've found is this
> > >  > > 4-machines-1-solrcloud/>
> > > with 16cores + 384GB ram but they were using 25! solr4 instances /
> server
> > > which seems wasteful to me ?
> > >
> > > I know that 1 solr can have max ~29-30GB heap because GC is
> > wasteful/sucks
> > > after that, and you should leave the other amount to the os for
> > file-cache.
> > > 2. But do you think 1 instance will be able to fully-use a 256GB/20core
> > > machine ?
> > >
> > > 3. Like to share your findings/links with big-machine clusters ?
> > >
> > > Thank You
> > >
> >
>


Re: Separating Search and Indexing in SolrCloud

2016-12-16 Thread Dorian Hoxha
Makes more sense, but I think the master should do the write before it can
be redirected to other replicas. So not sure if that can be done.

In elasticsearch you can have datanodes and coordinator nodes:
https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-node.html#coordinating-node
I don't think it's available in solr though.

On Fri, Dec 16, 2016 at 1:43 PM, Jaroslaw Rozanski 
wrote:

> Sorry, not what I meant.
>
> Leader is responsible for distributing update requests to replica. So
> eventually all replicas have same state as leader. Not a problem.
>
> It is more about the performance of such. If I gather correctly normal
> replication happens by standard update request. Not by, say, segment copy.
>
> Which means update on leader is as "expensive" as on replica.
>
> Hence, if my understanding is correct, sending search request to replica
> only, in index heavy environment, would bring no benefit.
>
> So the question is: is there a mechanism, in SolrCloud (not legacy
> master/slave set-up) to make one node take a load of indexing which
> other nodes focus on searching.
>
> This is not a question of SolrClient cause that is clear how to direct
> search request to specific nodes. This is more about index optimization
> so that certain nodes (ie. replicas) could suffer less due to high
> volume indexing while serving search requests.
>
>
>
>
> On 16/12/16 12:35, Dorian Hoxha wrote:
> > The leader is the source of truth. You expect to make the replica the
> > source of truth or something???Doesn't make sense?
> > What people do, is send write to leader/master and reads to
> replicas/slaves
> > in other solr/other-dbs.
> >
> > On Fri, Dec 16, 2016 at 1:31 PM, Jaroslaw Rozanski  >
> > wrote:
> >
> >> Hi all,
> >>
> >> According to documentation, in normal operation (not recovery) in Solr
> >> Cloud configuration the leader sends updates it receives to all the
> >> replicas.
> >>
> >> This means and all nodes in the shard perform same effort to index
> >> single document. Correct?
> >>
> >> Is there then a benefit to *not* to send search requests to leader, but
> >> only to replicas?
> >>
> >> Given index & search heavy Solr Cloud system, is it possible to separate
> >> search from indexing nodes?
> >>
> >>
> >> RE: Solr 5.5.0
> >>
> >> --
> >> Jaroslaw Rozanski | e: m...@jarekrozanski.com
> >> 695E 436F A176 4961 7793  5C70 AFDF FB5E 682C 4D3D
> >>
> >>
> >
>
> --
> Jaroslaw Rozanski | e: m...@jarekrozanski.com
> 695E 436F A176 4961 7793  5C70 AFDF FB5E 682C 4D3D
>
>


Re: Separating Search and Indexing in SolrCloud

2016-12-16 Thread Jaroslaw Rozanski
Sorry, not what I meant.

Leader is responsible for distributing update requests to replica. So
eventually all replicas have same state as leader. Not a problem.

It is more about the performance of such. If I gather correctly normal
replication happens by standard update request. Not by, say, segment copy.

Which means update on leader is as "expensive" as on replica.

Hence, if my understanding is correct, sending search request to replica
only, in index heavy environment, would bring no benefit.

So the question is: is there a mechanism, in SolrCloud (not legacy
master/slave set-up) to make one node take a load of indexing which
other nodes focus on searching.

This is not a question of SolrClient cause that is clear how to direct
search request to specific nodes. This is more about index optimization
so that certain nodes (ie. replicas) could suffer less due to high
volume indexing while serving search requests.




On 16/12/16 12:35, Dorian Hoxha wrote:
> The leader is the source of truth. You expect to make the replica the
> source of truth or something???Doesn't make sense?
> What people do, is send write to leader/master and reads to replicas/slaves
> in other solr/other-dbs.
> 
> On Fri, Dec 16, 2016 at 1:31 PM, Jaroslaw Rozanski 
> wrote:
> 
>> Hi all,
>>
>> According to documentation, in normal operation (not recovery) in Solr
>> Cloud configuration the leader sends updates it receives to all the
>> replicas.
>>
>> This means and all nodes in the shard perform same effort to index
>> single document. Correct?
>>
>> Is there then a benefit to *not* to send search requests to leader, but
>> only to replicas?
>>
>> Given index & search heavy Solr Cloud system, is it possible to separate
>> search from indexing nodes?
>>
>>
>> RE: Solr 5.5.0
>>
>> --
>> Jaroslaw Rozanski | e: m...@jarekrozanski.com
>> 695E 436F A176 4961 7793  5C70 AFDF FB5E 682C 4D3D
>>
>>
> 

-- 
Jaroslaw Rozanski | e: m...@jarekrozanski.com
695E 436F A176 4961 7793  5C70 AFDF FB5E 682C 4D3D



signature.asc
Description: OpenPGP digital signature


Re: error diagnosis help.

2016-12-16 Thread Comcast
Afaik the only xml that nutch should be touching is its own config files. This 
error shows up in solr admin

Sent from my iPhone

> On Dec 16, 2016, at 1:55 AM, Reth RM  wrote:
> 
> Are you indexing xml files through nutch? This exception purely looks like
> processing of in-correct format xml file.
> 
> On Mon, Dec 12, 2016 at 11:53 AM, KRIS MUSSHORN 
> wrote:
> 
>> ive scoured my nutch and solr config files and I cant find any cause.
>> suggestions?
>> Monday, December 12, 2016 2:37:13 PMERROR   null
>> RequestHandlerBase  org.apache.solr.common.SolrException: Unexpected
>> character '&' (code 38) in epilog; expected '<'
>> org.apache.solr.common.SolrException: Unexpected character '&' (code 38)
>> in epilog; expected '<'
>> at [row,col {unknown-source}]: [1,36]
>>at org.apache.solr.handler.loader.XMLLoader.load(
>> XMLLoader.java:180)
>>at org.apache.solr.handler.UpdateRequestHandler$1.load(
>> UpdateRequestHandler.java:95)
>>at org.apache.solr.handler.ContentStreamHandlerBase.
>> handleRequestBody(ContentStreamHandlerBase.java:70)
>>at org.apache.solr.handler.RequestHandlerBase.handleRequest(
>> RequestHandlerBase.java:156)
>>at org.apache.solr.core.SolrCore.execute(SolrCore.java:2073)
>>at org.apache.solr.servlet.HttpSolrCall.execute(
>> HttpSolrCall.java:658)
>>at org.apache.solr.servlet.HttpSolrCall.call(
>> HttpSolrCall.java:457)
>>at org.apache.solr.servlet.SolrDispatchFilter.doFilter(
>> SolrDispatchFilter.java:223)
>>at org.apache.solr.servlet.SolrDispatchFilter.doFilter(
>> SolrDispatchFilter.java:181)
>>at org.eclipse.jetty.servlet.ServletHandler$CachedChain.
>> doFilter(ServletHandler.java:1652)
>>at org.eclipse.jetty.servlet.ServletHandler.doHandle(
>> ServletHandler.java:585)
>>at org.eclipse.jetty.server.handler.ScopedHandler.handle(
>> ScopedHandler.java:143)
>>at org.eclipse.jetty.security.SecurityHandler.handle(
>> SecurityHandler.java:577)
>>at org.eclipse.jetty.server.session.SessionHandler.
>> doHandle(SessionHandler.java:223)
>>at org.eclipse.jetty.server.handler.ContextHandler.
>> doHandle(ContextHandler.java:1127)
>>at org.eclipse.jetty.servlet.ServletHandler.doScope(
>> ServletHandler.java:515)
>>at org.eclipse.jetty.server.session.SessionHandler.
>> doScope(SessionHandler.java:185)
>>at org.eclipse.jetty.server.handler.ContextHandler.
>> doScope(ContextHandler.java:1061)
>>at org.eclipse.jetty.server.handler.ScopedHandler.handle(
>> ScopedHandler.java:141)
>>at org.eclipse.jetty.server.handler.ContextHandlerCollection.
>> handle(ContextHandlerCollection.java:215)
>>at org.eclipse.jetty.server.handler.HandlerCollection.
>> handle(HandlerCollection.java:110)
>>at org.eclipse.jetty.server.handler.HandlerWrapper.handle(
>> HandlerWrapper.java:97)
>>at org.eclipse.jetty.server.Server.handle(Server.java:499)
>>at org.eclipse.jetty.server.HttpChannel.handle(
>> HttpChannel.java:310)
>>at org.eclipse.jetty.server.HttpConnection.onFillable(
>> HttpConnection.java:257)
>>at org.eclipse.jetty.io.AbstractConnection$2.run(
>> AbstractConnection.java:540)
>>at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(
>> QueuedThreadPool.java:635)
>>at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(
>> QueuedThreadPool.java:555)
>>at java.lang.Thread.run(Thread.java:745)
>> 
>> 



Re: Separating Search and Indexing in SolrCloud

2016-12-16 Thread Dorian Hoxha
The leader is the source of truth. You expect to make the replica the
source of truth or something???Doesn't make sense?
What people do, is send write to leader/master and reads to replicas/slaves
in other solr/other-dbs.

On Fri, Dec 16, 2016 at 1:31 PM, Jaroslaw Rozanski 
wrote:

> Hi all,
>
> According to documentation, in normal operation (not recovery) in Solr
> Cloud configuration the leader sends updates it receives to all the
> replicas.
>
> This means and all nodes in the shard perform same effort to index
> single document. Correct?
>
> Is there then a benefit to *not* to send search requests to leader, but
> only to replicas?
>
> Given index & search heavy Solr Cloud system, is it possible to separate
> search from indexing nodes?
>
>
> RE: Solr 5.5.0
>
> --
> Jaroslaw Rozanski | e: m...@jarekrozanski.com
> 695E 436F A176 4961 7793  5C70 AFDF FB5E 682C 4D3D
>
>


Separating Search and Indexing in SolrCloud

2016-12-16 Thread Jaroslaw Rozanski
Hi all,

According to documentation, in normal operation (not recovery) in Solr
Cloud configuration the leader sends updates it receives to all the
replicas.

This means and all nodes in the shard perform same effort to index
single document. Correct?

Is there then a benefit to *not* to send search requests to leader, but
only to replicas?

Given index & search heavy Solr Cloud system, is it possible to separate
search from indexing nodes?


RE: Solr 5.5.0

-- 
Jaroslaw Rozanski | e: m...@jarekrozanski.com
695E 436F A176 4961 7793  5C70 AFDF FB5E 682C 4D3D



signature.asc
Description: OpenPGP digital signature


Re: Max vertical scaling in your experience ? (1 instance/server)

2016-12-16 Thread Dorian Hoxha
On Fri, Dec 16, 2016 at 12:39 PM, GW  wrote:

> Dorian,
>
> From my reading, my belief is that you just need some beefy machines for
> your zookeeper ensemble so they can think fast.

Zookeeper need to think fast enough for cluster state/changes. So I think
it scales with the number of machines/collections/shards and not documents.

> After that your issues are
> complicated by drive I/O which I believe is solved by using shards. If you
> have a collection running on top of a single drive array it should not
> compare to writing to a dozen drive arrays. So a whole bunch of light duty
> machines that have a decent amount of memory and barely able process faster
> than their drive I/O will serve you better.
>
My dataset will be lower than total memory, so I expect no query to hit
disk.

>
> I think the Apache big data mandate was to be horizontally scalable to
> infinity with cheap consumer hardware. In my minds eye you are not going to
> get crazy input rates without a big horizontal drive system.
>
There is overhead with small machines, and with very big machines (pricy).
So something in the middle.
So small cluster of big machines or big cluster of small machines.

>
> I'm in the same boat. All the scaling and roll out documentation seems to
> reference the Witch Doctor's secret handbook.
>
> I just started into making my applications ZK aware and really just
> starting to understand the architecture. After a whole year I still feel
> weak while at the same time I have traveled far. I still feel like an
> amateur.
>
> My plans are to use bridge tools in Linux so all my machines are sitting on
> the switch with layer 2. Then use Conga to monitor which apps need to be
> running. If a server dies, it's apps are spun up on one of the other
> servers using the original IP and mac address through a bridge firewall
> gateway so there is no hold up with with mac phreaking like layer 3. Layer
> 3 does not like to see a route change with a mac address. My apps will be
> on a SAN ~ Data on as many shards/machines as financially possible.
>
By conga you mean https://sourceware.org/cluster/conga/spec/ ?
Also SAN may/will suck like someone answered in your thread.

>
> I was going to put a bunch of Apache web servers in round robin to talk to
> Solr but discovered that a Solr node can be dead and not report errors.
>
Please explain more "dead but no error".

> It's all rough at the moment but it makes total sense to send Solr requests
> based on what ZK says is available verses a round robin.
>
Yes, like I&other commenter wrote on your thread.

>
> Will keep you posted on my roll out if you like.
>
> Best,
>
> GW
>
>
>
>
>
>
>
> On 16 December 2016 at 03:31, Dorian Hoxha  wrote:
>
> > Hello searchers,
> >
> > I'm researching solr for a project that would require a
> max-inserts(10M/s)
> > and some heavy facet+fq on top of that, though on low qps.
> >
> > And I'm trying to find blogs/slides where people have used some big
> > machines instead of hundreds of small ones.
> >
> > 1. Largest I've found is this
> >  > 4-machines-1-solrcloud/>
> > with 16cores + 384GB ram but they were using 25! solr4 instances / server
> > which seems wasteful to me ?
> >
> > I know that 1 solr can have max ~29-30GB heap because GC is
> wasteful/sucks
> > after that, and you should leave the other amount to the os for
> file-cache.
> > 2. But do you think 1 instance will be able to fully-use a 256GB/20core
> > machine ?
> >
> > 3. Like to share your findings/links with big-machine clusters ?
> >
> > Thank You
> >
>


Re: Max vertical scaling in your experience ? (1 instance/server)

2016-12-16 Thread GW
Dorian,

>From my reading, my belief is that you just need some beefy machines for
your zookeeper ensemble so they can think fast. After that your issues are
complicated by drive I/O which I believe is solved by using shards. If you
have a collection running on top of a single drive array it should not
compare to writing to a dozen drive arrays. So a whole bunch of light duty
machines that have a decent amount of memory and barely able process faster
than their drive I/O will serve you better.

I think the Apache big data mandate was to be horizontally scalable to
infinity with cheap consumer hardware. In my minds eye you are not going to
get crazy input rates without a big horizontal drive system.

I'm in the same boat. All the scaling and roll out documentation seems to
reference the Witch Doctor's secret handbook.

I just started into making my applications ZK aware and really just
starting to understand the architecture. After a whole year I still feel
weak while at the same time I have traveled far. I still feel like an
amateur.

My plans are to use bridge tools in Linux so all my machines are sitting on
the switch with layer 2. Then use Conga to monitor which apps need to be
running. If a server dies, it's apps are spun up on one of the other
servers using the original IP and mac address through a bridge firewall
gateway so there is no hold up with with mac phreaking like layer 3. Layer
3 does not like to see a route change with a mac address. My apps will be
on a SAN ~ Data on as many shards/machines as financially possible.

I was going to put a bunch of Apache web servers in round robin to talk to
Solr but discovered that a Solr node can be dead and not report errors.
It's all rough at the moment but it makes total sense to send Solr requests
based on what ZK says is available verses a round robin.

Will keep you posted on my roll out if you like.

Best,

GW







On 16 December 2016 at 03:31, Dorian Hoxha  wrote:

> Hello searchers,
>
> I'm researching solr for a project that would require a max-inserts(10M/s)
> and some heavy facet+fq on top of that, though on low qps.
>
> And I'm trying to find blogs/slides where people have used some big
> machines instead of hundreds of small ones.
>
> 1. Largest I've found is this
>  4-machines-1-solrcloud/>
> with 16cores + 384GB ram but they were using 25! solr4 instances / server
> which seems wasteful to me ?
>
> I know that 1 solr can have max ~29-30GB heap because GC is wasteful/sucks
> after that, and you should leave the other amount to the os for file-cache.
> 2. But do you think 1 instance will be able to fully-use a 256GB/20core
> machine ?
>
> 3. Like to share your findings/links with big-machine clusters ?
>
> Thank You
>


Re: Max vertical scaling in your experience ? (1 instance/server)

2016-12-16 Thread Dorian Hoxha
On Fri, Dec 16, 2016 at 11:31 AM, Toke Eskildsen 
wrote:

> On Fri, 2016-12-16 at 11:19 +0100, Dorian Hoxha wrote:
> > On Fri, Dec 16, 2016 at 10:45 AM, Toke Eskildsen
> >  wrote:
> > > We try hard to stay below 32GB, but for some setups the penalty of
> > > crossing the boundary is worth it. If, for example, having
> > > everything in 1 shard means a heap requirement of 50GB, it can be a
> > > better solution than a multi-shard setup with 2*25GB heap.
> > >
> > The heap is for the instance, not for each shard. Yeah, having less
> > shards is ~more efficient since terms-dictionary,cache etc have lower
> > duplication.
>
> True, but that was not my point. What I tried to communicate is that
> there can be a huge difference between having 1 shard in the collection
> and having more than 1 shard. Not for document searches, but for
> aggregations such as grouping and especially String faceting.
>
> - Toke Eskildsen, State and University Library, Denmark
>
Yes makes sense, I remember doing cross-shard aggs may require more than 1
call (1 call to get top(x), 1 other verify that they really are top(x)
cross-shards). So less shards less merges to get final values.


Re: Max vertical scaling in your experience ? (1 instance/server)

2016-12-16 Thread Toke Eskildsen
On Fri, 2016-12-16 at 11:19 +0100, Dorian Hoxha wrote:
> On Fri, Dec 16, 2016 at 10:45 AM, Toke Eskildsen
>  wrote:
> > We try hard to stay below 32GB, but for some setups the penalty of
> > crossing the boundary is worth it. If, for example, having
> > everything in 1 shard means a heap requirement of 50GB, it can be a
> > better solution than a multi-shard setup with 2*25GB heap.
> > 
> The heap is for the instance, not for each shard. Yeah, having less
> shards is ~more efficient since terms-dictionary,cache etc have lower
> duplication.

True, but that was not my point. What I tried to communicate is that
there can be a huge difference between having 1 shard in the collection
and having more than 1 shard. Not for document searches, but for
aggregations such as grouping and especially String faceting.

- Toke Eskildsen, State and University Library, Denmark


Problem to specify end parameter for range facets

2016-12-16 Thread Aman Tandon
Hi,

I want to do the range facets with gap of 10 and I don't know the end as it
could be a very large value so how could I do that.

Thanks
Aman Tandon


Re: Max vertical scaling in your experience ? (1 instance/server)

2016-12-16 Thread Dorian Hoxha
On Fri, Dec 16, 2016 at 10:45 AM, Toke Eskildsen 
wrote:

> On Fri, 2016-12-16 at 09:31 +0100, Dorian Hoxha wrote:
> > I'm researching solr for a project that would require a max-
> > inserts(10M/s) and some heavy facet+fq on top of that, though on low
> > qps.
>
> You don't ask for much, do you :-) If you add high commit rate to the
> list, you have a serious candidate for worst-case.
>
I'm sorry, the commit will be 1-2 seconds :( . But this will be expiring
data, so it won't go petabytes. I can also relax disk-activity. I don't see
a config on how to relax the translog persistence ?Like I write, solr
returns 'ok', the document is in the translog, but the translog didn't get
an 'ok' from the filesystem.

>
> > And I'm trying to find blogs/slides where people have used some big
> > machines instead of hundreds of small ones.
> >
> > 1. Largest I've found is this
> >  > solrcloud/>
> > with 16cores + 384GB ram but they were using 25! solr4 instances /
> > server which seems wasteful to me ?
>
> The way those machines are set up is (nearly) the same as having 16
> quadcore machines with 96GB of RAM, each running 6 Solr instances.
> I say nearly because the shared memory is a plus as it averages
> fluctuations in Solr requirements and a minus because of the cross-
> socket penalties in NUMA.
>
> I digress, sorry. Point is that they are not really run as large
> machines. The choice of box size vs. box count was hugely driven by
> purchase & maintenance cost. Also, as that setup is highly optimized
> towards serving a static index, I don't think it would fit your very
> high update requirements.
>
> As for you argument for less Solrs, each serving multiple shards, then
> it is entirely valid. I have answered your question about this on the
> blog, but the short story is: It works now and optimizing hardware
> utilization is not high on our priority list.
>
> > I know that 1 solr can have max ~29-30GB heap because GC is
> > wasteful/sucks after that, and you should leave the other amount to
> > the os for file-cache.
>
> We try hard to stay below 32GB, but for some setups the penalty of
> crossing the boundary is worth it. If, for example, having everything
> in 1 shard means a heap requirement of 50GB, it can be a better
> solution than a multi-shard setup with 2*25GB heap.
>
The heap is for the instance, not for each shard. Yeah, having less shards
is ~more efficient since terms-dictionary,cache etc have lower duplication.

>
> > 2. But do you think 1 instance will be able to fully-use a
> > 256GB/20core machine ?
>
> I think (you should verify this) that there is some congestion issues
> in the indexing part of Solr: Feeding a single Solr with X threads will
> give you a lower index rate that feeding 2 separate Solrs (running on
> the same machine) with X/2 threads each.
>
That means the thread-pools aren't ~very scalable with number of cores.
Assuming we have 2 shards on 1 solr vs 2 solr each with 1 shard.

>
> - Toke Eskildsen, State and University Library, Denmark
>
Thanks Toke!


Re: Getting Error - Session expired for /collections/sprod/state.json

2016-12-16 Thread Yago Riveiro
Do some gc profiling to get some information about. It's possible you have 
configure a small heap and you are running in gc stop the world issues.

Normally zookeeper erros are bounded to gc and network latency issues

--

/Yago Riveiro

On 16 Dec 2016, 09:49 +, Piyush Kunal , wrote:
> Looks like an issue with 6.x version then.
> But this seems too basic. Not sure if community would not have caught this
> till now.
>
> On Fri, Dec 16, 2016 at 2:55 PM, Yago Riveiro  wrote:
>
> > I had some of this error in my logs too on 6.3.0
> >
> > My cluster also index like 20K docs/sec I don't know why.
> >
> > --
> >
> > /Yago Riveiro
> >
> > On 16 Dec 2016, 08:39 +, Piyush Kunal ,
> > wrote:
> > > Anyone has noticed such issue before?
> > >
> > > On Thu, Dec 15, 2016 at 4:36 PM, Piyush Kunal  > > wrote:
> > >
> > > > This is happening when heavy indexing like 100/second is going on.
> > > >
> > > > On Thu, Dec 15, 2016 at 4:33 PM, Piyush Kunal  > > > wrote:
> > > >
> > > > > - We have solr6.1.0 cluster running on production with 1 shard and 5
> > > > > replicas.
> > > > > - Zookeeper quorum on 3 nodes.
> > > > > - Using a chroot in zookeeper to segregate the configs from other
> > > > > collections.
> > > > > - Using solrj5.1.0 as our client to query solr.
> > > > >
> > > > >
> > > > >
> > > > > Usually things work fine but on and off we witness this exception
> > coming
> > > > > up:
> > > > > =
> > > > > org.apache.solr.common.SolrException: Could not load collection from
> > > > > ZK:sprod
> > > > > at org.apache.solr.common.cloud.ZkStateReader.getCollectionLive
> > > > > (ZkStateReader.java:815)
> > > > > at org.apache.solr.common.cloud.ZkStateReader$5.get(ZkStateRead
> > > > > er.java:477)
> > > > > at org.apache.solr.client.solrj.impl.CloudSolrClient.getDocColl
> > > > > ection(CloudSolrClient.java:1174)
> > > > > at org.apache.solr.client.solrj.impl.CloudSolrClient.requestWit
> > > > > hRetryOnStaleState(CloudSolrClient.java:807)
> > > > > at org.apache.solr.client.solrj.impl.CloudSolrClient.request(Cl
> > > > > oudSolrClient.java:782)
> > > > > --
> > > > > Caused by: org.apache.zookeeper.KeeperException$
> > SessionExpiredException:
> > > > > KeeperErrorCode = Session expired for /collections/sprod/state.json
> > > > > at org.apache.zookeeper.KeeperException.create(KeeperException.
> > > > > java:127)
> > > > > at org.apache.zookeeper.KeeperException.create(KeeperException.
> > > > > java:51)
> > > > > at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)
> > > > > at org.apache.solr.common.cloud.SolrZkClient$5.execute(SolrZkCl
> > > > > ient.java:311)
> > > > > at org.apache.solr.common.cloud.SolrZkClient$5.execute(SolrZkCl
> > > > > ient.java:308)
> > > > > at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(Zk
> > > > > CmdExecutor.java:61)
> > > > > at org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClien
> > > > > t.java:308)
> > > > > --
> > > > > org.apache.solr.common.SolrException: Could not load collection from
> > > > > ZK:sprod
> > > > > at org.apache.solr.common.cloud.ZkStateReader.getCollectionLive
> > > > > (ZkStateReader.java:815)
> > > > > at org.apache.solr.common.cloud.ZkStateReader$5.get(ZkStateRead
> > > > > er.java:477)
> > > > > at org.apache.solr.client.solrj.impl.CloudSolrClient.getDocColl
> > > > > ection(CloudSolrClient.java:1174)
> > > > > at org.apache.solr.client.solrj.impl.CloudSolrClient.requestWit
> > > > > hRetryOnStaleState(CloudSolrClient.java:807)
> > > > > at org.apache.solr.client.solrj.impl.CloudSolrClient.request(Cl
> > > > > oudSolrClient.java:782)
> > > > > --
> > > > > Caused by: org.apache.zookeeper.KeeperException$
> > SessionExpiredException:
> > > > > KeeperErrorCode = Session expired for /collections/sprod/state.json
> > > > > at org.apache.zookeeper.KeeperException.create(KeeperException.
> > > > > java:127)
> > > > > at org.apache.zookeeper.KeeperException.create(KeeperException.
> > > > > java:51)
> > > > > at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)
> > > > > at org.apache.solr.common.cloud.SolrZkClient$5.execute(SolrZkCl
> > > > > ient.java:311)
> > > > > at org.apache.solr.common.cloud.SolrZkClient$5.execute(SolrZkCl
> > > > > ient.java:308)
> > > > > at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(Zk
> > > > > CmdExecutor.java:61)
> > > > > at org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClien
> > > > > t.java:308)
> > > > > =
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > This is our zoo.cfg:
> > > > > ==
> > > > > tickTime=2000
> > > > > dataDir=/var/lib/zookeeper
> > > > > clientPort=2181
> > > > > initLimit=5
> > > > > syncLimit=2
> > > > > server.1=192.168.70.27:2888:3888
> > > > > server.2=192.168.70.64:2889:3889
> > > > > server.3=192.168.70.26:2889:3889
> > > > > maxClientCnxns=300
> > > > > maxSessionTimeo

Re: Getting Error - Session expired for /collections/sprod/state.json

2016-12-16 Thread Piyush Kunal
Looks like an issue with 6.x version then.
But this seems too basic. Not sure if community would not have caught this
till now.

On Fri, Dec 16, 2016 at 2:55 PM, Yago Riveiro 
wrote:

> I had some of this error in my logs too on 6.3.0
>
> My cluster also index like 20K docs/sec I don't know why.
>
> --
>
> /Yago Riveiro
>
> On 16 Dec 2016, 08:39 +, Piyush Kunal ,
> wrote:
> > Anyone has noticed such issue before?
> >
> > On Thu, Dec 15, 2016 at 4:36 PM, Piyush Kunal  > wrote:
> >
> > > This is happening when heavy indexing like 100/second is going on.
> > >
> > > On Thu, Dec 15, 2016 at 4:33 PM, Piyush Kunal  > > wrote:
> > >
> > > > - We have solr6.1.0 cluster running on production with 1 shard and 5
> > > > replicas.
> > > > - Zookeeper quorum on 3 nodes.
> > > > - Using a chroot in zookeeper to segregate the configs from other
> > > > collections.
> > > > - Using solrj5.1.0 as our client to query solr.
> > > >
> > > >
> > > >
> > > > Usually things work fine but on and off we witness this exception
> coming
> > > > up:
> > > > =
> > > > org.apache.solr.common.SolrException: Could not load collection from
> > > > ZK:sprod
> > > > at org.apache.solr.common.cloud.ZkStateReader.getCollectionLive
> > > > (ZkStateReader.java:815)
> > > > at org.apache.solr.common.cloud.ZkStateReader$5.get(ZkStateRead
> > > > er.java:477)
> > > > at org.apache.solr.client.solrj.impl.CloudSolrClient.getDocColl
> > > > ection(CloudSolrClient.java:1174)
> > > > at org.apache.solr.client.solrj.impl.CloudSolrClient.requestWit
> > > > hRetryOnStaleState(CloudSolrClient.java:807)
> > > > at org.apache.solr.client.solrj.impl.CloudSolrClient.request(Cl
> > > > oudSolrClient.java:782)
> > > > --
> > > > Caused by: org.apache.zookeeper.KeeperException$
> SessionExpiredException:
> > > > KeeperErrorCode = Session expired for /collections/sprod/state.json
> > > > at org.apache.zookeeper.KeeperException.create(KeeperException.
> > > > java:127)
> > > > at org.apache.zookeeper.KeeperException.create(KeeperException.
> > > > java:51)
> > > > at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)
> > > > at org.apache.solr.common.cloud.SolrZkClient$5.execute(SolrZkCl
> > > > ient.java:311)
> > > > at org.apache.solr.common.cloud.SolrZkClient$5.execute(SolrZkCl
> > > > ient.java:308)
> > > > at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(Zk
> > > > CmdExecutor.java:61)
> > > > at org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClien
> > > > t.java:308)
> > > > --
> > > > org.apache.solr.common.SolrException: Could not load collection from
> > > > ZK:sprod
> > > > at org.apache.solr.common.cloud.ZkStateReader.getCollectionLive
> > > > (ZkStateReader.java:815)
> > > > at org.apache.solr.common.cloud.ZkStateReader$5.get(ZkStateRead
> > > > er.java:477)
> > > > at org.apache.solr.client.solrj.impl.CloudSolrClient.getDocColl
> > > > ection(CloudSolrClient.java:1174)
> > > > at org.apache.solr.client.solrj.impl.CloudSolrClient.requestWit
> > > > hRetryOnStaleState(CloudSolrClient.java:807)
> > > > at org.apache.solr.client.solrj.impl.CloudSolrClient.request(Cl
> > > > oudSolrClient.java:782)
> > > > --
> > > > Caused by: org.apache.zookeeper.KeeperException$
> SessionExpiredException:
> > > > KeeperErrorCode = Session expired for /collections/sprod/state.json
> > > > at org.apache.zookeeper.KeeperException.create(KeeperException.
> > > > java:127)
> > > > at org.apache.zookeeper.KeeperException.create(KeeperException.
> > > > java:51)
> > > > at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)
> > > > at org.apache.solr.common.cloud.SolrZkClient$5.execute(SolrZkCl
> > > > ient.java:311)
> > > > at org.apache.solr.common.cloud.SolrZkClient$5.execute(SolrZkCl
> > > > ient.java:308)
> > > > at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(Zk
> > > > CmdExecutor.java:61)
> > > > at org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClien
> > > > t.java:308)
> > > > =
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > This is our zoo.cfg:
> > > > ==
> > > > tickTime=2000
> > > > dataDir=/var/lib/zookeeper
> > > > clientPort=2181
> > > > initLimit=5
> > > > syncLimit=2
> > > > server.1=192.168.70.27:2888:3888
> > > > server.2=192.168.70.64:2889:3889
> > > > server.3=192.168.70.26:2889:3889
> > > > maxClientCnxns=300
> > > > maxSessionTimeout=9
> > > > ===
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > This is our solr.xml on server side
> > > > ===
> > > >
> > > >  > > >
> > > >  > > >
> > > > ${host:} > > > ${jetty.port:8983} > > > ${hostContext:solr} > > >
> > > > ${genericCoreNodeNames:true} > > >
> > > > ${zkClientTimeout:3} > > > ${distribUpdateSoTimeout:
> 60} > > > ${distribUpdateConnTimeout:
> 6} > > > ${zkCredentialsProvider:org.
> apache.solr.common.cl

Re: Max vertical scaling in your experience ? (1 instance/server)

2016-12-16 Thread Toke Eskildsen
On Fri, 2016-12-16 at 09:31 +0100, Dorian Hoxha wrote:
> I'm researching solr for a project that would require a max-
> inserts(10M/s) and some heavy facet+fq on top of that, though on low
> qps.

You don't ask for much, do you :-) If you add high commit rate to the
list, you have a serious candidate for worst-case.

> And I'm trying to find blogs/slides where people have used some big
> machines instead of hundreds of small ones.
> 
> 1. Largest I've found is this
>  solrcloud/>
> with 16cores + 384GB ram but they were using 25! solr4 instances /
> server which seems wasteful to me ?

The way those machines are set up is (nearly) the same as having 16
quadcore machines with 96GB of RAM, each running 6 Solr instances.
I say nearly because the shared memory is a plus as it averages
fluctuations in Solr requirements and a minus because of the cross-
socket penalties in NUMA.

I digress, sorry. Point is that they are not really run as large
machines. The choice of box size vs. box count was hugely driven by
purchase & maintenance cost. Also, as that setup is highly optimized
towards serving a static index, I don't think it would fit your very
high update requirements.

As for you argument for less Solrs, each serving multiple shards, then
it is entirely valid. I have answered your question about this on the
blog, but the short story is: It works now and optimizing hardware
utilization is not high on our priority list.

> I know that 1 solr can have max ~29-30GB heap because GC is
> wasteful/sucks after that, and you should leave the other amount to
> the os for file-cache.

We try hard to stay below 32GB, but for some setups the penalty of
crossing the boundary is worth it. If, for example, having everything
in 1 shard means a heap requirement of 50GB, it can be a better
solution than a multi-shard setup with 2*25GB heap.

> 2. But do you think 1 instance will be able to fully-use a
> 256GB/20core machine ?

I think (you should verify this) that there is some congestion issues
in the indexing part of Solr: Feeding a single Solr with X threads will
give you a lower index rate that feeding 2 separate Solrs (running on
the same machine) with X/2 threads each.

- Toke Eskildsen, State and University Library, Denmark


Re: Search only for single value of Solr multivalue field

2016-12-16 Thread Leo BRUVRY-LAGADEC

Hi Dorian,

Firstly thanks for your response, but it does not seems to work.

Here is another example, I want to search document with affiliations 
contains the NHM (Natural History Museum) of India. So, I want to only 
get the document with id=2 :



1
NHM, Austria
Annamalai Univ, India



2
NHM, India
IRD, FRANCE


If I implement your solution, ((NMH in affilliation OR India in 
affilliation) AND NOT (NMH in affilliation AND India in affilliation) it 
doesn't return any document. did I have missed something in you 
explanation ?


In the prvious version of my application I used and had a solution with 
Oracle Full Text, it seem weird that SOLR cannot provide a solution for 
that.


Best regards,
Léo.

Le 15/12/2016 12:44, Dorian Hoxha a écrit :

You should be able to filter "(word1 in field OR word2 in field) AND
NOT(word1 in field AND word2 in field)". Translate that into the right
syntax.
I don't know if lucene is smart enough to execute the filter only once (it
should be i guess).
Makes sense ?

On Thu, Dec 15, 2016 at 12:12 PM, Leo BRUVRY-LAGADEC  wrote:


Hi,

I have a multivalued field in my schema called "idx_affilliation".

IFREMER, Ctr Brest, DRO Geosci Marines,
F-29280 Plouzane, France.
Univ Lisbon, Ctr Geofis, P-1269102 Lisbon,
Portugal.
Univ Bretagne Occidentale, Inst Univ
Europeen Mer, Lab Domaines Ocean, F-29280 Plouzane, France.
Total Explorat Prod Geosci Projets Nouveaux
Exper, F-92078 Paris, France.

I want to be able to do a query like: idx_affilliation:(IFREMER Portugal)
and not have this document returned. In other words, I do not want queries
to span individual values for the field.


---

Here are some further examples using the document above of how I want this
to work:

idx_affilliation:(IFREMER France) --> Returns it.
idx_affilliation:(IFREMER Plouzane) --> Returns it.
idx_affilliation:("Univ Bretagne Occidentale") --> Returns it.
idx_affilliation:("Univ Lisbon" Portugal) --> Returns it.
idx_affilliation:(IFREMER Portugal) --> DOES NOT RETURN IT.

Does someone known if it's possible to do this ?

Best regards,
Leo.





Re: Getting Error - Session expired for /collections/sprod/state.json

2016-12-16 Thread Yago Riveiro
I had some of this error in my logs too on 6.3.0

My cluster also index like 20K docs/sec I don't know why.

--

/Yago Riveiro

On 16 Dec 2016, 08:39 +, Piyush Kunal , wrote:
> Anyone has noticed such issue before?
>
> On Thu, Dec 15, 2016 at 4:36 PM, Piyush Kunal  wrote:
>
> > This is happening when heavy indexing like 100/second is going on.
> >
> > On Thu, Dec 15, 2016 at 4:33 PM, Piyush Kunal  > wrote:
> >
> > > - We have solr6.1.0 cluster running on production with 1 shard and 5
> > > replicas.
> > > - Zookeeper quorum on 3 nodes.
> > > - Using a chroot in zookeeper to segregate the configs from other
> > > collections.
> > > - Using solrj5.1.0 as our client to query solr.
> > >
> > >
> > >
> > > Usually things work fine but on and off we witness this exception coming
> > > up:
> > > =
> > > org.apache.solr.common.SolrException: Could not load collection from
> > > ZK:sprod
> > > at org.apache.solr.common.cloud.ZkStateReader.getCollectionLive
> > > (ZkStateReader.java:815)
> > > at org.apache.solr.common.cloud.ZkStateReader$5.get(ZkStateRead
> > > er.java:477)
> > > at org.apache.solr.client.solrj.impl.CloudSolrClient.getDocColl
> > > ection(CloudSolrClient.java:1174)
> > > at org.apache.solr.client.solrj.impl.CloudSolrClient.requestWit
> > > hRetryOnStaleState(CloudSolrClient.java:807)
> > > at org.apache.solr.client.solrj.impl.CloudSolrClient.request(Cl
> > > oudSolrClient.java:782)
> > > --
> > > Caused by: org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > KeeperErrorCode = Session expired for /collections/sprod/state.json
> > > at org.apache.zookeeper.KeeperException.create(KeeperException.
> > > java:127)
> > > at org.apache.zookeeper.KeeperException.create(KeeperException.
> > > java:51)
> > > at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)
> > > at org.apache.solr.common.cloud.SolrZkClient$5.execute(SolrZkCl
> > > ient.java:311)
> > > at org.apache.solr.common.cloud.SolrZkClient$5.execute(SolrZkCl
> > > ient.java:308)
> > > at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(Zk
> > > CmdExecutor.java:61)
> > > at org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClien
> > > t.java:308)
> > > --
> > > org.apache.solr.common.SolrException: Could not load collection from
> > > ZK:sprod
> > > at org.apache.solr.common.cloud.ZkStateReader.getCollectionLive
> > > (ZkStateReader.java:815)
> > > at org.apache.solr.common.cloud.ZkStateReader$5.get(ZkStateRead
> > > er.java:477)
> > > at org.apache.solr.client.solrj.impl.CloudSolrClient.getDocColl
> > > ection(CloudSolrClient.java:1174)
> > > at org.apache.solr.client.solrj.impl.CloudSolrClient.requestWit
> > > hRetryOnStaleState(CloudSolrClient.java:807)
> > > at org.apache.solr.client.solrj.impl.CloudSolrClient.request(Cl
> > > oudSolrClient.java:782)
> > > --
> > > Caused by: org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > KeeperErrorCode = Session expired for /collections/sprod/state.json
> > > at org.apache.zookeeper.KeeperException.create(KeeperException.
> > > java:127)
> > > at org.apache.zookeeper.KeeperException.create(KeeperException.
> > > java:51)
> > > at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)
> > > at org.apache.solr.common.cloud.SolrZkClient$5.execute(SolrZkCl
> > > ient.java:311)
> > > at org.apache.solr.common.cloud.SolrZkClient$5.execute(SolrZkCl
> > > ient.java:308)
> > > at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(Zk
> > > CmdExecutor.java:61)
> > > at org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClien
> > > t.java:308)
> > > =
> > >
> > >
> > >
> > >
> > >
> > > This is our zoo.cfg:
> > > ==
> > > tickTime=2000
> > > dataDir=/var/lib/zookeeper
> > > clientPort=2181
> > > initLimit=5
> > > syncLimit=2
> > > server.1=192.168.70.27:2888:3888
> > > server.2=192.168.70.64:2889:3889
> > > server.3=192.168.70.26:2889:3889
> > > maxClientCnxns=300
> > > maxSessionTimeout=9
> > > ===
> > >
> > >
> > >
> > >
> > >
> > > This is our solr.xml on server side
> > > ===
> > >
> > >  > >
> > >  > >
> > > ${host:} > > ${jetty.port:8983} > > ${hostContext:solr} > >
> > > ${genericCoreNodeNames:true} > >
> > > ${zkClientTimeout:3} > > ${distribUpdateSoTimeout:60} > >  > > name="distribUpdateConnTimeout">${distribUpdateConnTimeout:6} > >  > > name="zkCredentialsProvider">${zkCredentialsProvider:org.apache.solr.common.cloud.DefaultZkCredentialsProvider} > >  > > name="zkACLProvider">${zkACLProvider:org.apache.solr.common.cloud.DefaultZkACLProvider} > >
> > >  > >
> > >  > > class="HttpShardHandlerFactory"
> > > ${socketTimeout:60} > > ${connTimeout:6} > >  > >  > >
> > > ===
> > >
> > >
> > >
> > >
> > > Any help appreciated.
> > >
> > > Regards,
> > > Piy

Re: Solr MapReduce Indexer Tool is failing for empty core name.

2016-12-16 Thread Manan Sheth
Thats what I presume and it should start utilizing the collection only. The 
collection param has already been specified and it should take all details from 
there only. also, core to collection change was happed in solr 4. The map 
reduce inderexer for solr 4.10 is working correctly with this, but not for solr 
6.

From: Reth RM 
Sent: Friday, December 16, 2016 12:45 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr MapReduce Indexer Tool is failing for empty core name.

The primary difference has been solr to solr-cloud in later version,
starting from solr4.0  And what happens if you try starting solr in stand
alone mode, solr cloud does not consider 'core' anymore, it considers
'collection' as param.


On Thu, Dec 15, 2016 at 11:05 PM, Manan Sheth 
wrote:

> Thanks Reth. As noted this is the same map reduce based indexer tool that
> comes shipped with the solr distribution by default.
>
> It only take the zk_host details and extracts all required information
> from there only. It does not have core specific configurations. The same
> tool released with solr 4.10 distro is working correctly, it seems to be
> some issue/ changes from solr 5 onwards. I have tested it for both solr 5.5
> & solr 6.2.1 and the behaviour remains same for both.
>
> Thanks,
> Manan Sheth
> 
> From: Reth RM 
> Sent: Friday, December 16, 2016 12:21 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr MapReduce Indexer Tool is failing for empty core name.
>
> It looks like command line tool that you are using to initiate index
> process,  is expecting some name to solr-core with respective command line
> param. use -help on the command line tool that you are using and check the
> solr-core-name parameter key, pass that also with some value.
>
>
> On Tue, Dec 13, 2016 at 5:44 AM, Manan Sheth 
> wrote:
>
> > Hi All,
> >
> >
> > While working on a migration project from Solr 4 to Solr 6, I need to
> > reindex my data using Solr map reduce Indexer tool in offline mode with
> > avro data.
> >
> > While executing the map reduce indexer tool shipped with solr 6.2.1, it
> is
> > throwing error of cannot create core with empty name value. The solr
> > instances are running fine with new indexed are being added and modified
> > correctly. Below is the command that was being fired:
> >
> >
> > hadoop --config /etc/hadoop/conf jar /home/impadmin/solr-6.2.1/
> dist/solr-map-reduce-*.jar
> > -D 'mapred.child.java.opts=-Xmx500m' \
> >-libjars `echo /home/impadmin/solr6lib/*.jar | sed 's/ /,/g'`
> > --morphline-file /home/impadmin/app_quotes_morphline_actual.conf \
> >--zk-host 172.26.45.71:9984 --output-dir hdfs://
> > impetus-i0056.impetus.co.in:8020/user/impadmin/
> > MapReduceIndexerTool/output5 \
> >--collection app.quotes --log4j src/test/resources/log4j.
> properties
> > --verbose \
> >  "hdfs://impetus-i0056.impetus.co.in:8020/user/impadmin/
> > MapReduceIndexerTool/5d63e0f8-afc1-483e-bd3f-d508c885d794-00"
> >
> >
> > Below is the complete snapshot of error trace:
> >
> >
> > Failed to initialize record writer for org.apache.solr.hadoop.
> > MapReduceIndexerTool/MorphlineMapper, attempt_1479795440861_0343_r_
> > 00_0
> > at org.apache.solr.hadoop.SolrRecordWriter.(
> > SolrRecordWriter.java:128)
> > at org.apache.solr.hadoop.SolrOutputFormat.getRecordWriter(
> > SolrOutputFormat.java:163)
> > at org.apache.hadoop.mapred.ReduceTask$
> NewTrackingRecordWriter.
> > (ReduceTask.java:540)
> > at org.apache.hadoop.mapred.ReduceTask.runNewReducer(
> > ReduceTask.java:614)
> > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
> > at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
> > at java.security.AccessController.doPrivileged(Native Method)
> > at javax.security.auth.Subject.doAs(Subject.java:422)
> > at org.apache.hadoop.security.UserGroupInformation.doAs(
> > UserGroupInformation.java:1709)
> > at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
> > Caused by: org.apache.solr.common.SolrException: Cannot create core with
> > empty name value
> > at org.apache.solr.core.CoreDescriptor.checkPropertyIsNotEmpty(
> > CoreDescriptor.java:280)
> > at org.apache.solr.core.CoreDescriptor.(
> CoreDescriptor.java:191)
> > at org.apache.solr.core.CoreContainer.create(CoreContainer.java:754)
> > at org.apache.solr.core.CoreContainer.create(CoreContainer.java:742)
> > at org.apache.solr.hadoop.SolrRecordWriter.createEmbeddedSolrServer(
> > SolrRecordWriter.java:163)
> > at org.apache.solr.hadoop.SolrRecordWriter.(
> SolrRecordWriter.java:121)
> > ... 9 more
> >
> > Additional points to note:
> >
> >
> >   *   The solrconfig and schema files are copied as is from Solr 4.
> >   *   Once collection is deployed, user can perform all operations on the
> > collection without any issue.
> >   *   The indexation process is working fine wi

Re: Solr on HDFS: increase in query time with increase in data

2016-12-16 Thread Piyush Kunal
I think 70GB is too huge for a shard.
How much memory does the system is having?
Incase solr does not have sufficient memory to load the indexes, it will
use only the amount of memory defined in your Solr Caches.

Although you are on HFDS, solr performances will be really bad if it has do
disk IO at the query time.

The best option for you is to shard it into atleast 8-10 nodes and create
appropriate replicas according to your read traffic.

Regards,
Piyush

On Fri, Dec 16, 2016 at 12:15 PM, Reth RM  wrote:

> I think the shard index size is huge and should be split.
>
> On Wed, Dec 14, 2016 at 10:58 AM, Chetas Joshi 
> wrote:
>
> > Hi everyone,
> >
> > I am running Solr 5.5.0 on HDFS. It is a solrCloud of 50 nodes and I have
> > the following config.
> > maxShardsperNode: 1
> > replicationFactor: 1
> >
> > I have been ingesting data into Solr for the last 3 months. With increase
> > in data, I am observing increase in the query time. Currently the size of
> > my indices is 70 GB per shard (i.e. per node).
> >
> > I am using cursor approach (/export handler) using SolrJ client to get
> back
> > results from Solr. All the fields I am querying on and all the fields
> that
> > I get back from Solr are indexed and have docValues enabled as well. What
> > could be the reason behind increase in query time?
> >
> > Has this got something to do with the OS disk cache that is used for
> > loading the Solr indices? When a query is fired, will Solr wait for all
> > (70GB) of disk cache being available so that it can load the index file?
> >
> > Thnaks!
> >
>


Re: Getting Error - Session expired for /collections/sprod/state.json

2016-12-16 Thread Piyush Kunal
Anyone has noticed such issue before?

On Thu, Dec 15, 2016 at 4:36 PM, Piyush Kunal 
wrote:

> This is happening when heavy indexing like 100/second is going on.
>
> On Thu, Dec 15, 2016 at 4:33 PM, Piyush Kunal 
> wrote:
>
>> - We have solr6.1.0 cluster running on production with 1 shard and 5
>> replicas.
>> - Zookeeper quorum on 3 nodes.
>> - Using a chroot in zookeeper to segregate the configs from other
>> collections.
>> - Using solrj5.1.0 as our client to query solr.
>>
>>
>>
>> Usually things work fine but on and off we witness this exception coming
>> up:
>> =
>> org.apache.solr.common.SolrException: Could not load collection from
>> ZK:sprod
>> at org.apache.solr.common.cloud.ZkStateReader.getCollectionLive
>> (ZkStateReader.java:815)
>> at org.apache.solr.common.cloud.ZkStateReader$5.get(ZkStateRead
>> er.java:477)
>> at org.apache.solr.client.solrj.impl.CloudSolrClient.getDocColl
>> ection(CloudSolrClient.java:1174)
>> at org.apache.solr.client.solrj.impl.CloudSolrClient.requestWit
>> hRetryOnStaleState(CloudSolrClient.java:807)
>> at org.apache.solr.client.solrj.impl.CloudSolrClient.request(Cl
>> oudSolrClient.java:782)
>> --
>> Caused by: org.apache.zookeeper.KeeperException$SessionExpiredException:
>> KeeperErrorCode = Session expired for /collections/sprod/state.json
>> at org.apache.zookeeper.KeeperException.create(KeeperException.
>> java:127)
>> at org.apache.zookeeper.KeeperException.create(KeeperException.
>> java:51)
>> at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)
>> at org.apache.solr.common.cloud.SolrZkClient$5.execute(SolrZkCl
>> ient.java:311)
>> at org.apache.solr.common.cloud.SolrZkClient$5.execute(SolrZkCl
>> ient.java:308)
>> at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(Zk
>> CmdExecutor.java:61)
>> at org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClien
>> t.java:308)
>> --
>> org.apache.solr.common.SolrException: Could not load collection from
>> ZK:sprod
>> at org.apache.solr.common.cloud.ZkStateReader.getCollectionLive
>> (ZkStateReader.java:815)
>> at org.apache.solr.common.cloud.ZkStateReader$5.get(ZkStateRead
>> er.java:477)
>> at org.apache.solr.client.solrj.impl.CloudSolrClient.getDocColl
>> ection(CloudSolrClient.java:1174)
>> at org.apache.solr.client.solrj.impl.CloudSolrClient.requestWit
>> hRetryOnStaleState(CloudSolrClient.java:807)
>> at org.apache.solr.client.solrj.impl.CloudSolrClient.request(Cl
>> oudSolrClient.java:782)
>> --
>> Caused by: org.apache.zookeeper.KeeperException$SessionExpiredException:
>> KeeperErrorCode = Session expired for /collections/sprod/state.json
>> at org.apache.zookeeper.KeeperException.create(KeeperException.
>> java:127)
>> at org.apache.zookeeper.KeeperException.create(KeeperException.
>> java:51)
>> at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)
>> at org.apache.solr.common.cloud.SolrZkClient$5.execute(SolrZkCl
>> ient.java:311)
>> at org.apache.solr.common.cloud.SolrZkClient$5.execute(SolrZkCl
>> ient.java:308)
>> at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(Zk
>> CmdExecutor.java:61)
>> at org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClien
>> t.java:308)
>> =
>>
>>
>>
>>
>>
>> This is our zoo.cfg:
>> ==
>> tickTime=2000
>> dataDir=/var/lib/zookeeper
>> clientPort=2181
>> initLimit=5
>> syncLimit=2
>> server.1=192.168.70.27:2888:3888
>> server.2=192.168.70.64:2889:3889
>> server.3=192.168.70.26:2889:3889
>> maxClientCnxns=300
>> maxSessionTimeout=9
>> ===
>>
>>
>>
>>
>>
>> This is our solr.xml on server side
>> ===
>>
>> 
>>
>>   
>>
>> ${host:}
>> ${jetty.port:8983}
>> ${hostContext:solr}
>>
>> ${genericCoreNodeNames:true}
>>
>> ${zkClientTimeout:3}
>> ${distribUpdateSoTimeout:60}
>> > name="distribUpdateConnTimeout">${distribUpdateConnTimeout:6}
>> > name="zkCredentialsProvider">${zkCredentialsProvider:org.apache.solr.common.cloud.DefaultZkCredentialsProvider}
>> > name="zkACLProvider">${zkACLProvider:org.apache.solr.common.cloud.DefaultZkACLProvider}
>>
>>   
>>
>>   > class="HttpShardHandlerFactory">
>> ${socketTimeout:60}
>> ${connTimeout:6}
>>   
>> 
>>
>> ===
>>
>>
>>
>>
>> Any help appreciated.
>>
>> Regards,
>> Piyush
>>
>
>


Max vertical scaling in your experience ? (1 instance/server)

2016-12-16 Thread Dorian Hoxha
Hello searchers,

I'm researching solr for a project that would require a max-inserts(10M/s)
and some heavy facet+fq on top of that, though on low qps.

And I'm trying to find blogs/slides where people have used some big
machines instead of hundreds of small ones.

1. Largest I've found is this

with 16cores + 384GB ram but they were using 25! solr4 instances / server
which seems wasteful to me ?

I know that 1 solr can have max ~29-30GB heap because GC is wasteful/sucks
after that, and you should leave the other amount to the os for file-cache.
2. But do you think 1 instance will be able to fully-use a 256GB/20core
machine ?

3. Like to share your findings/links with big-machine clusters ?

Thank You