(Newbie Help!) Seeking guidance in regards to Solr's suggestor and others

2016-12-12 Thread KV Medmeme
Hi Friends,

I'm new to solr, been working on it for the past 2-3 months trying to
really get my feet wet with it so that I can transition the current search
engine at my current job to solr. (Eww sphinx  haha) anyway I need some
help. I was running around the net getting my suggester working and im
stuck and I need some help. This is what I have so far. (I will explain
after I posted links to the config files)

here is a link to my managed-schema.xml
http://pastebin.com/MiEWwESP

solr config.xml
http://pastebin.com/fq2yxbvp

I am currently using Solr 6.2.1, my issue is..

I am trying to build a suggester that builds search terms or phrases based
off of the index that is in memory. I was playing around with the analyzers
and the tokenizers as well as reading some very old books that touch base
on solr 4. And I came up with this set of tokenizers and analyzer chain.
Please correct it if its wrong. But my index contains Medical Abstracts
published by Doctors and terms that I would really need to search for are
"brain cancer" , "anti-inflammatory" , "hiv-1" kinda see where im going
with? So i need to sorta preserve the white space and some sort of hyphen
delimiter. After I discovered that, (now here comes the fun part)

I type in the url:

http://localhost:8983/solr/AbstractSuggest/suggest/?spellcheck.build=true

then after when its built I query,

http://localhost:8983/solr/AbstractSuggest/suggest/?spellcheck.q=suggest_field:%22anti-infl%22

Which is perfectly fine It works great. I can see the collations so that In
my dropdown search bar for when clients search these medical articles they
can see these terms. Now In regards to PHP (solarium api to talk to solr)
now. Since this is a website and I intend on making an AJAX call to php  I
cannot see the collation list. Solarium fails on hyphenated terms as well
as fails on building the collations list. For example if I would type in

"brain canc" ( i want to search brain cancer)

It auto suggests brain , then cancer but in collations nothing is shown. If
I would to send this to the URL (localhost url, which will soon change when
moved to prod enviornment) i can see the collations. A screenshot is here..

brain can (url) -> https://gyazo.com/30a9d11e4b9b73b0768a12d342223dc3

bran canc(solarium) -> https://gyazo.com/507b02e50d0e39d7daa96655dff83c76
php code ->https://gyazo.com/1d2b8c90013784d7cde5301769cd230c

So here is where I am. The ideal goal is to have the PHP api produce the
same results just like the URL so when users type into a search bar they
can see the collations.

 Can someone please help? Im looking towards the community as the savior to
all my problems. I want to learn about solr at the same time so if future
problems popup I can solve them accordingly.

Thanks!
Happy Holidays
Kevin.


Re: Does sharding improve or degrade performance?

2016-12-12 Thread Shawn Heisey
On 12/12/2016 1:14 PM, Piyush Kunal wrote:
> We did the following change:
>
> 1. Previously we had 1 shard and 32 replicas for 1.2million documents of
> size 5 GB.
> 2. We changed it to 4 shards and 8 replicas for 1.2 million documents of
> size 5GB

How many machines and shards per machine were you running in both
situations?  For either setup, I would recommend at least 32 machines,
where each one handles exactly one shard replica.  For the latter setup,
you may need even more machines, so there are more replicas.

> We have a combined RPM of around 20k rpm for solr.

Twenty thousand queries per minute is over 300 per second.  This is a
very high query rate, which is going to require many replicas.  Your
replica count has gone down significantly with the change you made.

> But unfortunately we saw a degrade in performance with RTs going insanely
> high when we moved to setup 2.

With such a high query rate, I'm not really surprised that this caused
the performance to go down, even if you actually do have 32 machines. 
Distributed queries are a two-phase process where the coordinating node
sends individual queries to each shard to find out how to sort the
sub-results into one final result, and then sends a second query to
relevant shards to request the individual documents it needs for the
result.  The total number of individual queries goes up significantly.

Before Solr was doing one query for one result.  Now it is doing between
five and nine queries for one result (the initial query from your client
to the coordinating node, the first query to each of the four shards,
and then a possible second query to each shard).  If the number of
search hits is more than zero, it will be at least six queries.  This is
why one shard is preferred for high query rates if you can fit the whole
index into one shard.  Five gigabytes is a pretty small Solr index.

Sharding is most effective when the query rate is low, because Solr can
take advantage of idle CPUs.  It makes it possible to have a much larger
index.  A high query rate means that there are no idle CPUs.

Thanks,
Shawn



Re: Does sharding improve or degrade performance?

2016-12-12 Thread Erick Erickson
Sharding adds inevitable overhead. Particularly
each request, rather than being serviced on a
single replica has to send out a first request
to each replica, get the ID and sort criteria back,
then send out a second request to get the actual docs.

Especially if you're asking for a lot of rows this can get
very expensive. And you're now serving your queries
on 1/4 of the machines. In the first setup, an incoming
request was completely serviced on 1 node. Now you're
requiring 4 nodes to participate.

Sharding is always a second choice and always has
some overhead. As long as your QTimes are
acceptable, you should stick with only a single replica.

Best,
Erick

On Mon, Dec 12, 2016 at 12:14 PM, Piyush Kunal  wrote:
> All our shards and replicas reside on different machines with 16GB RAM and
> 4 cores.
>
> On Tue, Dec 13, 2016 at 1:44 AM, Piyush Kunal 
> wrote:
>
>> We did the following change:
>>
>> 1. Previously we had 1 shard and 32 replicas for 1.2million documents of
>> size 5 GB.
>> 2. We changed it to 4 shards and 8 replicas for 1.2 million documents of
>> size 5GB
>>
>> We have a combined RPM of around 20k rpm for solr.
>>
>> But unfortunately we saw a degrade in performance with RTs going insanely
>> high when we moved to setup 2.
>>
>> What could be probable reasons and how it can be fixed?
>>


Re: How to check optimized or disk free status via solrj for a particular collection?

2016-12-12 Thread Erick Erickson
bq: We are indexing with autocommit at 30 minutes

OK, check the size of your tlogs. What this means is that all the
updates accumulate for 30 minutes in a single tlog. That tlog will be
closed when autocommit happens and a new one opened for the
next 30 minutes. The first tlog won't be purged until the second one
is closed. All this is detailed in the link I provided.

If the tlogs are significant in size this may be the entire problem.

Best,
Erick

On Mon, Dec 12, 2016 at 12:46 PM, Susheel Kumar  wrote:
> One option:
>
> First you may purge all documents before full-reindex that you don't need
> to run optimize unless you need the data to serve queries same time.
>
> i think you are running into out of space because your 43 million may be
> consuming 30% of total disk space and when you re-index the total disk
> space usage goes to 60%.  Now if you run optimize, it may require double
> another 60% disk space making to 120% which causes out of disk space.
>
> The other option is to increase disk space if you want to run optimize at
> the end.
>
>
> On Mon, Dec 12, 2016 at 3:36 PM, Michael Joyner  wrote:
>
>> We are having an issue with running out of space when trying to do a full
>> re-index.
>>
>> We are indexing with autocommit at 30 minutes.
>>
>> We have it set to only optimize at the end of an indexing cycle.
>>
>>
>>
>> On 12/12/2016 02:43 PM, Erick Erickson wrote:
>>
>>> First off, optimize is actually rarely necessary. I wouldn't bother
>>> unless you have measurements to prove that it's desirable.
>>>
>>> I would _certainly_ not call optimize every 10M docs. If you must call
>>> it at all call it exactly once when indexing is complete. But see
>>> above.
>>>
>>> As far as the commit, I'd just set the autocommit settings in
>>> solrconfig.xml to something "reasonable" and forget it. I usually use
>>> time rather than doc count as it's a little more predictable. I often
>>> use 60 seconds, but it can be longer. The longer it is, the bigger
>>> your tlog will grow and if Solr shuts down forcefully the longer
>>> replaying may take. Here's the whole writeup on this topic:
>>>
>>> https://lucidworks.com/blog/2013/08/23/understanding-transac
>>> tion-logs-softcommit-and-commit-in-sorlcloud/
>>>
>>> Running out of space during indexing with about 30% utilization is
>>> very odd. My guess is that you're trying to take too much control.
>>> Having multiple optimizations going on at once would be a very good
>>> way to run out of disk space.
>>>
>>> And I'm assuming one replica's index per disk or you're reporting
>>> aggregate index size per disk when you sah 30%. Having three replicas
>>> on the same disk each consuming 30% is A Bad Thing.
>>>
>>> Best,
>>> Erick
>>>
>>> On Mon, Dec 12, 2016 at 8:36 AM, Michael Joyner 
>>> wrote:
>>>
 Halp!

 I need to reindex over 43 millions documents, when optimized the
 collection
 is currently < 30% of disk space, we tried it over this weekend and it
 ran
 out of space during the reindexing.

 I'm thinking for the best solution for what we are trying to do is to
 call
 commit/optimize every 10,000,000 documents or so and then wait for the
 optimize to complete.

 How to check optimized status via solrj for a particular collection?

 Also, is there is a way to check free space per shard by collection?

 -Mike


>>


Re: OOMs in Solr

2016-12-12 Thread Erick Erickson
bq: ...so I wonder if reducing the heap is going to help or it won’t
matter that much...

Well, if you're hitting OOM errors than you have no _choice_ but to
reduce the heap. Or increase the memory. And you don't have much
physical memory to grow into.

Longer term, reducing the JVM size (assuming you can w/o hitting OOM
errors) is always to the good. The more heap, the more GC you have,
the longer stop-the-world GC pauses will take etc. The OS memory
management for GC is vastly more efficient (because it's simpler) than
Java's is.

Note, however, that this "more art than science". I've seen situations
where the JVM requires very close to the max heap size at some point.
>From there I've seen situations where the GC kicks in and recovers
just enough memory to continue for a few  milliseconds and then go
right back into a GC cycle. So you need some overhead.

Or are you talking about SSDs for the OS to use for swapping? Assuming
you're swapping we're talking about query response time here, SSDs
will be much faster if you're swapping. But you _really_ want to
strive to _not_ swap. SSD access is faster than spinning disk for
sure, but still vastly slower than RAM access.

I applaud you changing one thing at a time BTW. You probably want to
use GCViewer or similar on the GC logs (turn them on first!) for Solr
for a quick take on how GC is performing when you test.

And the one other thing I'd do: Mine your Solr (or servelet container)
logs for the real queries over one of these periods. Then use
something like jmeter (or roll your own) test program to fire them at
your test instance to evaluate the effects of your changes.

Best,
Erick



On Mon, Dec 12, 2016 at 1:03 PM, Alfonso Muñoz-Pomer Fuentes
 wrote:
> According to the post you linked to, it strongly advises to buy SSDs. I got
> in touch with the systems department in my organization and it turns out
> that our VM storage is SSD-backed, so I wonder if reducing the heap is going
> to help or it won’t matter that much. Of course, there’s nothing like trying
> and check out the results. I’ll do that in due time, though. At the moment
> I’ve reduced the filter cache and will change all parameters one at a time
> to see what affects performance the most.
>
> Thanks again for the feedback.
>
> On 12/12/2016 19:36, Erick Erickson wrote:
>>
>> The biggest bang for the buck is _probably_ docValues for the fields
>> you facet on. If that's the culprit, you can also reduce your JVM heap
>> considerably, as Toke says, leaving this little memory for the OS is
>> bad. Here's the writeup on why:
>> http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
>>
>> Roughly what's happening is that all the values you facet on have to
>> be read into memory somewhere. docvalues puts almost all of that into
>> the OS memory rather than JVM heap. It's much faster to load, reduces
>> JVM GC pressure, OOMs, and allows the pages to be swapped out.
>>
>> However, this is somewhat pushing the problem around. Moving the
>> memory consumption to the OS memory space will have a huge impact on
>> your OOM errors but the cost will be that you'll probably start
>> swapping pages out of the OS memory, which will impact search speed.
>> Slower searches are preferable to OOMs, certainly. That said you'll
>> probably need more physical memory at some point, or go to SolrCloud
>> or
>>
>> Best,
>> Erick
>>
>> On Mon, Dec 12, 2016 at 10:57 AM, Susheel Kumar 
>> wrote:
>>>
>>> Double check if your queries are not running into deep pagination
>>> (q=*:*...&start=).  This is something i recently
>>> experienced
>>> and was the only cause of OOM.  You may have the gc logs when OOM
>>> happened
>>> and drawing it on GC Viewer may give insight how gradual your heap got
>>> filled and run into OOM.
>>>
>>> Thanks,
>>> Susheel
>>>
>>> On Mon, Dec 12, 2016 at 10:32 AM, Alfonso Muñoz-Pomer Fuentes <
>>> amu...@ebi.ac.uk> wrote:
>>>
 Thanks again.

 I’m learning more about Solr in this thread than in my previous months
 reading about it!

 Moving to Solr Cloud is a possibility we’ve discussed and I guess it
 will
 eventually happen, as the index will grow no matter what.

 I’ve already lowered filterCache from 512 to 64 and I’m looking forward
 to
 seeing what happens in the next few days. Our filter cache hit ratio was
 0.99, so I would expect this to go down but if we can have a more
 efficiente memory usage I think e.g. an extra second for each search is
 still acceptable.

 Regarding the startup scripts we’re using the ones included with Solr.

 As for the use of filters we’re always using the same four filters,
 IIRC.
 In any case we’ll review the code to ensure that that’s the case.

 I’m aware of the need to reindex when the schema changes, but thanks for
 the reminder. We’ll add docValues because I think that’ll make a
 significant difference in our case. We’ll also try to leave s

Re: OOMs in Solr

2016-12-12 Thread Alfonso Muñoz-Pomer Fuentes
According to the post you linked to, it strongly advises to buy SSDs. I 
got in touch with the systems department in my organization and it turns 
out that our VM storage is SSD-backed, so I wonder if reducing the heap 
is going to help or it won’t matter that much. Of course, there’s 
nothing like trying and check out the results. I’ll do that in due time, 
though. At the moment I’ve reduced the filter cache and will change all 
parameters one at a time to see what affects performance the most.


Thanks again for the feedback.

On 12/12/2016 19:36, Erick Erickson wrote:

The biggest bang for the buck is _probably_ docValues for the fields
you facet on. If that's the culprit, you can also reduce your JVM heap
considerably, as Toke says, leaving this little memory for the OS is
bad. Here's the writeup on why:
http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

Roughly what's happening is that all the values you facet on have to
be read into memory somewhere. docvalues puts almost all of that into
the OS memory rather than JVM heap. It's much faster to load, reduces
JVM GC pressure, OOMs, and allows the pages to be swapped out.

However, this is somewhat pushing the problem around. Moving the
memory consumption to the OS memory space will have a huge impact on
your OOM errors but the cost will be that you'll probably start
swapping pages out of the OS memory, which will impact search speed.
Slower searches are preferable to OOMs, certainly. That said you'll
probably need more physical memory at some point, or go to SolrCloud
or

Best,
Erick

On Mon, Dec 12, 2016 at 10:57 AM, Susheel Kumar  wrote:

Double check if your queries are not running into deep pagination
(q=*:*...&start=).  This is something i recently experienced
and was the only cause of OOM.  You may have the gc logs when OOM happened
and drawing it on GC Viewer may give insight how gradual your heap got
filled and run into OOM.

Thanks,
Susheel

On Mon, Dec 12, 2016 at 10:32 AM, Alfonso Muñoz-Pomer Fuentes <
amu...@ebi.ac.uk> wrote:


Thanks again.

I’m learning more about Solr in this thread than in my previous months
reading about it!

Moving to Solr Cloud is a possibility we’ve discussed and I guess it will
eventually happen, as the index will grow no matter what.

I’ve already lowered filterCache from 512 to 64 and I’m looking forward to
seeing what happens in the next few days. Our filter cache hit ratio was
0.99, so I would expect this to go down but if we can have a more
efficiente memory usage I think e.g. an extra second for each search is
still acceptable.

Regarding the startup scripts we’re using the ones included with Solr.

As for the use of filters we’re always using the same four filters, IIRC.
In any case we’ll review the code to ensure that that’s the case.

I’m aware of the need to reindex when the schema changes, but thanks for
the reminder. We’ll add docValues because I think that’ll make a
significant difference in our case. We’ll also try to leave space for the
disk cache as we’re using spinning disk storage.

Thanks again to everybody for the useful and insightful replies.

Alfonso


On 12/12/2016 14:12, Shawn Heisey wrote:


On 12/12/2016 3:13 AM, Alfonso Muñoz-Pomer Fuentes wrote:


I’m writing because in our web application we’re using Solr 5.1.0 and
currently we’re hosting it on a VM with 32 GB of RAM (of which 30 are
dedicated to Solr and nothing else is running there). We have four
cores, that are this size:
- 25.56 GB, Num Docs = 57,860,845
- 12.09 GB, Num Docs = 173,491,631

(The other two cores are about 10 MB, 20k docs)



An OOM indicates that a Java application is requesting more memory than
it has been told it can use. There are only two remedies for OOM errors:
Increase the heap, or make the program use less memory.  In this email,
I have concentrated on ways to reduce the memory requirements.

These index sizes and document counts are relatively small to Solr -- as
long as you have enough memory and are smart about how it's used.

Solr 5.1.0 comes with GC tuning built into the startup scripts, using
some well-tested CMS settings.  If you are using those startup scripts,
then the parallel collector will NOT be default.  No matter what
collector is in use, it cannot fix OOM problems.  It may change when and
how frequently they occur, but it can't do anything about them.

We aren’t indexing on this machine, and we’re getting OOM relatively

quickly (after about 14 hours of regular use). Right now we have a
Cron job that restarts Solr every 12 hours, so it’s not pretty. We use
faceting quite heavily and mostly as a document storage server (we
want full data sets instead of the n most relevant results).



Like Toke, I suspect two things: a very large filterCache, and the heavy
facet usage, maybe both.  Enabling docValues on the fields you're using
for faceting and reindexing will make the latter more memory efficient,
and likely faster.  Reducing the filterCache size would help the
forme

Re: How to check optimized or disk free status via solrj for a particular collection?

2016-12-12 Thread Susheel Kumar
One option:

First you may purge all documents before full-reindex that you don't need
to run optimize unless you need the data to serve queries same time.

i think you are running into out of space because your 43 million may be
consuming 30% of total disk space and when you re-index the total disk
space usage goes to 60%.  Now if you run optimize, it may require double
another 60% disk space making to 120% which causes out of disk space.

The other option is to increase disk space if you want to run optimize at
the end.


On Mon, Dec 12, 2016 at 3:36 PM, Michael Joyner  wrote:

> We are having an issue with running out of space when trying to do a full
> re-index.
>
> We are indexing with autocommit at 30 minutes.
>
> We have it set to only optimize at the end of an indexing cycle.
>
>
>
> On 12/12/2016 02:43 PM, Erick Erickson wrote:
>
>> First off, optimize is actually rarely necessary. I wouldn't bother
>> unless you have measurements to prove that it's desirable.
>>
>> I would _certainly_ not call optimize every 10M docs. If you must call
>> it at all call it exactly once when indexing is complete. But see
>> above.
>>
>> As far as the commit, I'd just set the autocommit settings in
>> solrconfig.xml to something "reasonable" and forget it. I usually use
>> time rather than doc count as it's a little more predictable. I often
>> use 60 seconds, but it can be longer. The longer it is, the bigger
>> your tlog will grow and if Solr shuts down forcefully the longer
>> replaying may take. Here's the whole writeup on this topic:
>>
>> https://lucidworks.com/blog/2013/08/23/understanding-transac
>> tion-logs-softcommit-and-commit-in-sorlcloud/
>>
>> Running out of space during indexing with about 30% utilization is
>> very odd. My guess is that you're trying to take too much control.
>> Having multiple optimizations going on at once would be a very good
>> way to run out of disk space.
>>
>> And I'm assuming one replica's index per disk or you're reporting
>> aggregate index size per disk when you sah 30%. Having three replicas
>> on the same disk each consuming 30% is A Bad Thing.
>>
>> Best,
>> Erick
>>
>> On Mon, Dec 12, 2016 at 8:36 AM, Michael Joyner 
>> wrote:
>>
>>> Halp!
>>>
>>> I need to reindex over 43 millions documents, when optimized the
>>> collection
>>> is currently < 30% of disk space, we tried it over this weekend and it
>>> ran
>>> out of space during the reindexing.
>>>
>>> I'm thinking for the best solution for what we are trying to do is to
>>> call
>>> commit/optimize every 10,000,000 documents or so and then wait for the
>>> optimize to complete.
>>>
>>> How to check optimized status via solrj for a particular collection?
>>>
>>> Also, is there is a way to check free space per shard by collection?
>>>
>>> -Mike
>>>
>>>
>


Re: How to check optimized or disk free status via solrj for a particular collection?

2016-12-12 Thread Michael Joyner
We are having an issue with running out of space when trying to do a 
full re-index.


We are indexing with autocommit at 30 minutes.

We have it set to only optimize at the end of an indexing cycle.


On 12/12/2016 02:43 PM, Erick Erickson wrote:

First off, optimize is actually rarely necessary. I wouldn't bother
unless you have measurements to prove that it's desirable.

I would _certainly_ not call optimize every 10M docs. If you must call
it at all call it exactly once when indexing is complete. But see
above.

As far as the commit, I'd just set the autocommit settings in
solrconfig.xml to something "reasonable" and forget it. I usually use
time rather than doc count as it's a little more predictable. I often
use 60 seconds, but it can be longer. The longer it is, the bigger
your tlog will grow and if Solr shuts down forcefully the longer
replaying may take. Here's the whole writeup on this topic:

https://lucidworks.com/blog/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

Running out of space during indexing with about 30% utilization is
very odd. My guess is that you're trying to take too much control.
Having multiple optimizations going on at once would be a very good
way to run out of disk space.

And I'm assuming one replica's index per disk or you're reporting
aggregate index size per disk when you sah 30%. Having three replicas
on the same disk each consuming 30% is A Bad Thing.

Best,
Erick

On Mon, Dec 12, 2016 at 8:36 AM, Michael Joyner  wrote:

Halp!

I need to reindex over 43 millions documents, when optimized the collection
is currently < 30% of disk space, we tried it over this weekend and it ran
out of space during the reindexing.

I'm thinking for the best solution for what we are trying to do is to call
commit/optimize every 10,000,000 documents or so and then wait for the
optimize to complete.

How to check optimized status via solrj for a particular collection?

Also, is there is a way to check free space per shard by collection?

-Mike





Re: Does sharding improve or degrade performance?

2016-12-12 Thread Piyush Kunal
All our shards and replicas reside on different machines with 16GB RAM and
4 cores.

On Tue, Dec 13, 2016 at 1:44 AM, Piyush Kunal 
wrote:

> We did the following change:
>
> 1. Previously we had 1 shard and 32 replicas for 1.2million documents of
> size 5 GB.
> 2. We changed it to 4 shards and 8 replicas for 1.2 million documents of
> size 5GB
>
> We have a combined RPM of around 20k rpm for solr.
>
> But unfortunately we saw a degrade in performance with RTs going insanely
> high when we moved to setup 2.
>
> What could be probable reasons and how it can be fixed?
>


Does sharding improve or degrade performance?

2016-12-12 Thread Piyush Kunal
We did the following change:

1. Previously we had 1 shard and 32 replicas for 1.2million documents of
size 5 GB.
2. We changed it to 4 shards and 8 replicas for 1.2 million documents of
size 5GB

We have a combined RPM of around 20k rpm for solr.

But unfortunately we saw a degrade in performance with RTs going insanely
high when we moved to setup 2.

What could be probable reasons and how it can be fixed?


Re: How to check optimized or disk free status via solrj for a particular collection?

2016-12-12 Thread Susheel Kumar
How much difference between below two parameters from your Solr stats
screen.  For e.g. in our case we have very frequent updates which results
into max docs = num docs x2 over the period of time and in that case I have
seen optimization helps in query performance.  Unless you have huge
difference, optimization may not be necessary.

Num Docs:39183404Max Doc:78056265



Thanks,
Susheel


On Mon, Dec 12, 2016 at 2:43 PM, Erick Erickson 
wrote:

> First off, optimize is actually rarely necessary. I wouldn't bother
> unless you have measurements to prove that it's desirable.
>
> I would _certainly_ not call optimize every 10M docs. If you must call
> it at all call it exactly once when indexing is complete. But see
> above.
>
> As far as the commit, I'd just set the autocommit settings in
> solrconfig.xml to something "reasonable" and forget it. I usually use
> time rather than doc count as it's a little more predictable. I often
> use 60 seconds, but it can be longer. The longer it is, the bigger
> your tlog will grow and if Solr shuts down forcefully the longer
> replaying may take. Here's the whole writeup on this topic:
>
> https://lucidworks.com/blog/2013/08/23/understanding-
> transaction-logs-softcommit-and-commit-in-sorlcloud/
>
> Running out of space during indexing with about 30% utilization is
> very odd. My guess is that you're trying to take too much control.
> Having multiple optimizations going on at once would be a very good
> way to run out of disk space.
>
> And I'm assuming one replica's index per disk or you're reporting
> aggregate index size per disk when you sah 30%. Having three replicas
> on the same disk each consuming 30% is A Bad Thing.
>
> Best,
> Erick
>
> On Mon, Dec 12, 2016 at 8:36 AM, Michael Joyner 
> wrote:
> > Halp!
> >
> > I need to reindex over 43 millions documents, when optimized the
> collection
> > is currently < 30% of disk space, we tried it over this weekend and it
> ran
> > out of space during the reindexing.
> >
> > I'm thinking for the best solution for what we are trying to do is to
> call
> > commit/optimize every 10,000,000 documents or so and then wait for the
> > optimize to complete.
> >
> > How to check optimized status via solrj for a particular collection?
> >
> > Also, is there is a way to check free space per shard by collection?
> >
> > -Mike
> >
>


Re: regex-urlfilter help

2016-12-12 Thread KRIS MUSSHORN

sorry my mistake.. sent to wrong list. 
  
- Original Message -

From: "Shawn Heisey"  
To: solr-user@lucene.apache.org 
Sent: Monday, December 12, 2016 2:36:26 PM 
Subject: Re: regex-urlfilter help 

On 12/12/2016 12:19 PM, KRIS MUSSHORN wrote: 
> I'm using nutch 1.12 and Solr 5.4.1. 
>   
> Crawling a website and indexing into nutch. 
>   
> AFAIK the regex-urlfilter.txt file will cause content to not be crawled.. 
>   
> what if I have 
> https:///inside/default.cfm  as my seed url... 
> I want the links on this page to be crawled and indexed but I do not want 
> this page to be indexed into SOLR. 
> How would I set this up? 
>   
> I'm thnking that the regex.urlfilter.txt file is NOT the right place. 

These sound like questions about how to configure Nutch.  This is a Solr 
mailing list.  Nutch is a completely separate Apache product with its 
own mailing list.  Although there may be people here who do use Nutch, 
it's not the purpose of this list.  Please use support resources for Nutch. 

http://nutch.apache.org/mailing_lists.html 

I'm reasonably certain that this cannot be controlled by Solr's 
configuration.  Solr will index anything that is sent to it, so the 
choice of what to send or not send in this situation will be decided by 
Nutch. 

Thanks, 
Shawn 




error diagnosis help.

2016-12-12 Thread KRIS MUSSHORN
ive scoured my nutch and solr config files and I cant find any cause. 
suggestions? 
Monday, December 12, 2016 2:37:13 PMERROR   nullRequestHandlerBase  
org.apache.solr.common.SolrException: Unexpected character '&' (code 38) in 
epilog; expected '<' 
org.apache.solr.common.SolrException: Unexpected character '&' (code 38) in 
epilog; expected '<'
 at [row,col {unknown-source}]: [1,36]
at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:180)
at 
org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:95)
at 
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:70)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:156)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:2073)
at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:658)
at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:457)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:223)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:181)
at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
at 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
at 
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
at 
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
at 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110)
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
at org.eclipse.jetty.server.Server.handle(Server.java:499)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310)
at 
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
at 
org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
at java.lang.Thread.run(Thread.java:745) 



Setting Shard Count at Initial Startup of SolrCloud

2016-12-12 Thread Furkan KAMACI
Hi,

I have an external Zookeeper. I don't wanna use SolrCloud as test. I upload
confs to Zookeeper:

server/scripts/cloud-scripts/zkcli.sh -zkhost localhost:2181 -cmd upconfig
-confdir server/solr/my_collection/conf -confname my_collection

Start servers:

Server 1: bin/solr start -cloud -d server -p 8983 -z localhost:2181
Server 2: bin/solr start -cloud -d server -p 8984 -z localhost:2181

As usual, shard count will be 1 with this approach. I want 2 shards. I know
that I can create shard with:

bin/solr create

However, I have to delete existing collection and than I can create shards.
Is there any possibility to set number of shards and maximum shards per
node etc. at initial start of Solr?

Kind Regards,
Furkan KAMACI


Re: How to check optimized or disk free status via solrj for a particular collection?

2016-12-12 Thread Erick Erickson
First off, optimize is actually rarely necessary. I wouldn't bother
unless you have measurements to prove that it's desirable.

I would _certainly_ not call optimize every 10M docs. If you must call
it at all call it exactly once when indexing is complete. But see
above.

As far as the commit, I'd just set the autocommit settings in
solrconfig.xml to something "reasonable" and forget it. I usually use
time rather than doc count as it's a little more predictable. I often
use 60 seconds, but it can be longer. The longer it is, the bigger
your tlog will grow and if Solr shuts down forcefully the longer
replaying may take. Here's the whole writeup on this topic:

https://lucidworks.com/blog/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

Running out of space during indexing with about 30% utilization is
very odd. My guess is that you're trying to take too much control.
Having multiple optimizations going on at once would be a very good
way to run out of disk space.

And I'm assuming one replica's index per disk or you're reporting
aggregate index size per disk when you sah 30%. Having three replicas
on the same disk each consuming 30% is A Bad Thing.

Best,
Erick

On Mon, Dec 12, 2016 at 8:36 AM, Michael Joyner  wrote:
> Halp!
>
> I need to reindex over 43 millions documents, when optimized the collection
> is currently < 30% of disk space, we tried it over this weekend and it ran
> out of space during the reindexing.
>
> I'm thinking for the best solution for what we are trying to do is to call
> commit/optimize every 10,000,000 documents or so and then wait for the
> optimize to complete.
>
> How to check optimized status via solrj for a particular collection?
>
> Also, is there is a way to check free space per shard by collection?
>
> -Mike
>


Re: OOMs in Solr

2016-12-12 Thread Erick Erickson
The biggest bang for the buck is _probably_ docValues for the fields
you facet on. If that's the culprit, you can also reduce your JVM heap
considerably, as Toke says, leaving this little memory for the OS is
bad. Here's the writeup on why:
http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

Roughly what's happening is that all the values you facet on have to
be read into memory somewhere. docvalues puts almost all of that into
the OS memory rather than JVM heap. It's much faster to load, reduces
JVM GC pressure, OOMs, and allows the pages to be swapped out.

However, this is somewhat pushing the problem around. Moving the
memory consumption to the OS memory space will have a huge impact on
your OOM errors but the cost will be that you'll probably start
swapping pages out of the OS memory, which will impact search speed.
Slower searches are preferable to OOMs, certainly. That said you'll
probably need more physical memory at some point, or go to SolrCloud
or

Best,
Erick

On Mon, Dec 12, 2016 at 10:57 AM, Susheel Kumar  wrote:
> Double check if your queries are not running into deep pagination
> (q=*:*...&start=).  This is something i recently experienced
> and was the only cause of OOM.  You may have the gc logs when OOM happened
> and drawing it on GC Viewer may give insight how gradual your heap got
> filled and run into OOM.
>
> Thanks,
> Susheel
>
> On Mon, Dec 12, 2016 at 10:32 AM, Alfonso Muñoz-Pomer Fuentes <
> amu...@ebi.ac.uk> wrote:
>
>> Thanks again.
>>
>> I’m learning more about Solr in this thread than in my previous months
>> reading about it!
>>
>> Moving to Solr Cloud is a possibility we’ve discussed and I guess it will
>> eventually happen, as the index will grow no matter what.
>>
>> I’ve already lowered filterCache from 512 to 64 and I’m looking forward to
>> seeing what happens in the next few days. Our filter cache hit ratio was
>> 0.99, so I would expect this to go down but if we can have a more
>> efficiente memory usage I think e.g. an extra second for each search is
>> still acceptable.
>>
>> Regarding the startup scripts we’re using the ones included with Solr.
>>
>> As for the use of filters we’re always using the same four filters, IIRC.
>> In any case we’ll review the code to ensure that that’s the case.
>>
>> I’m aware of the need to reindex when the schema changes, but thanks for
>> the reminder. We’ll add docValues because I think that’ll make a
>> significant difference in our case. We’ll also try to leave space for the
>> disk cache as we’re using spinning disk storage.
>>
>> Thanks again to everybody for the useful and insightful replies.
>>
>> Alfonso
>>
>>
>> On 12/12/2016 14:12, Shawn Heisey wrote:
>>
>>> On 12/12/2016 3:13 AM, Alfonso Muñoz-Pomer Fuentes wrote:
>>>
 I’m writing because in our web application we’re using Solr 5.1.0 and
 currently we’re hosting it on a VM with 32 GB of RAM (of which 30 are
 dedicated to Solr and nothing else is running there). We have four
 cores, that are this size:
 - 25.56 GB, Num Docs = 57,860,845
 - 12.09 GB, Num Docs = 173,491,631

 (The other two cores are about 10 MB, 20k docs)

>>>
>>> An OOM indicates that a Java application is requesting more memory than
>>> it has been told it can use. There are only two remedies for OOM errors:
>>> Increase the heap, or make the program use less memory.  In this email,
>>> I have concentrated on ways to reduce the memory requirements.
>>>
>>> These index sizes and document counts are relatively small to Solr -- as
>>> long as you have enough memory and are smart about how it's used.
>>>
>>> Solr 5.1.0 comes with GC tuning built into the startup scripts, using
>>> some well-tested CMS settings.  If you are using those startup scripts,
>>> then the parallel collector will NOT be default.  No matter what
>>> collector is in use, it cannot fix OOM problems.  It may change when and
>>> how frequently they occur, but it can't do anything about them.
>>>
>>> We aren’t indexing on this machine, and we’re getting OOM relatively
 quickly (after about 14 hours of regular use). Right now we have a
 Cron job that restarts Solr every 12 hours, so it’s not pretty. We use
 faceting quite heavily and mostly as a document storage server (we
 want full data sets instead of the n most relevant results).

>>>
>>> Like Toke, I suspect two things: a very large filterCache, and the heavy
>>> facet usage, maybe both.  Enabling docValues on the fields you're using
>>> for faceting and reindexing will make the latter more memory efficient,
>>> and likely faster.  Reducing the filterCache size would help the
>>> former.  Note that if you have a completely static index, then it is
>>> more likely that you will fill up the filterCache over time.
>>>
>>> I don’t know if what we’re experiencing is usual given the index size
 and memory constraint of the VM, or something looks like it’s wildly
 misconfigured. What do you think? Any us

Re: regex-urlfilter help

2016-12-12 Thread Shawn Heisey
On 12/12/2016 12:19 PM, KRIS MUSSHORN wrote:
> I'm using nutch 1.12 and Solr 5.4.1. 
>   
> Crawling a website and indexing into nutch. 
>   
> AFAIK the regex-urlfilter.txt file will cause content to not be crawled.. 
>   
> what if I have 
> https:///inside/default.cfm  as my seed url... 
> I want the links on this page to be crawled and indexed but I do not want 
> this page to be indexed into SOLR. 
> How would I set this up? 
>   
> I'm thnking that the regex.urlfilter.txt file is NOT the right place. 

These sound like questions about how to configure Nutch.  This is a Solr
mailing list.  Nutch is a completely separate Apache product with its
own mailing list.  Although there may be people here who do use Nutch,
it's not the purpose of this list.  Please use support resources for Nutch.

http://nutch.apache.org/mailing_lists.html

I'm reasonably certain that this cannot be controlled by Solr's
configuration.  Solr will index anything that is sent to it, so the
choice of what to send or not send in this situation will be decided by
Nutch.

Thanks,
Shawn



RE: Unicode Character Problem

2016-12-12 Thread Allison, Timothy B.
> I don't see any weird character when I manual copy it to any text editor.

That's a good diagnostic step, but there's a chance that Adobe (or your viewer) 
got it right, and Tika or PDFBox isn't getting it right.

If you run tika-app on the file [0], do you get the same problem?  See our stub 
on common text extraction challenges with PDFs [1] and how to run PDFBox's 
ExtractText against your file [2].

[0] java -jar tika-app.jar -i  -o 
[1] https://wiki.apache.org/tika/PDFParser%20%28Apache%20PDFBox%29
[2] https://wiki.apache.org/tika/Troubleshooting%20Tika#PDF_Text_Problems 

-Original Message-
From: Furkan KAMACI [mailto:furkankam...@gmail.com] 
Sent: Monday, December 12, 2016 10:55 AM
To: solr-user@lucene.apache.org; Ahmet Arslan 
Subject: Re: Unicode Character Problem

Hi Ahmet,

I don't see any weird character when I manual copy it to any text editor.

On Sat, Dec 10, 2016 at 6:19 PM, Ahmet Arslan 
wrote:

> Hi Furkan,
>
> I am pretty sure this is a pdf extraction thing.
> Turkish characters caused us trouble in the past during extracting 
> text from pdf files.
> You can confirm by performing manual copy-paste from original pdf file.
>
> Ahmet
>
>
> On Friday, December 9, 2016 8:44 PM, Furkan KAMACI 
> 
> wrote:
> Hi,
>
> I'm trying to index Turkish characters. These are what I see at my 
> index (I see both of them at different places of my content):
>
> aç  klama
> açıklama
>
> These are same words but indexed different (same weird character at 
> first one). I see that there is not a weird character when I check the 
> original PDF file.
>
> What do you think about it. Is it related to Solr or Tika?
>
> PS: I use text_general for analyser of content field.
>
> Kind Regards,
> Furkan KAMACI
>


regex-urlfilter help

2016-12-12 Thread KRIS MUSSHORN
I'm using nutch 1.12 and Solr 5.4.1. 
  
Crawling a website and indexing into nutch. 
  
AFAIK the regex-urlfilter.txt file will cause content to not be crawled.. 
  
what if I have 
https:///inside/default.cfm  as my seed url... 
I want the links on this page to be crawled and indexed but I do not want this 
page to be indexed into SOLR. 
How would I set this up? 
  
I'm thnking that the regex.urlfilter.txt file is NOT the right place. 
  
TIA 
Kris 


Re: Copying Tokens

2016-12-12 Thread Alexandre Rafalovitch
Multilingual is - hard - fun. What you are trying to do is probably
not super-doable as copyField copies original text representation. You
don't want to copy tokens anyway, as your query-time analysis chains
are different too.

I would recommend looking at the books first.

Mine talks about languages (for older Solr version) and happens to use
English and Russian :-) You can read it for free at:
* 
https://www.packtpub.com/mapt/book/Big%20Data%20&%20Business%20Intelligence/9781782164845
(Free sample is the whole book :-) )
* multilingual setup is the last section/chapter
* Source code is at: https://github.com/arafalov/solr-indexing-book

There is also large chapter in the "Solr in Action" (chapter 14) that
has 3 different strategies, including one that multiplexes code using
custom field type.

There might be others, but I can't remember off the top of my head.
But it is a problem books tend to cover, because it is known to be
thorny.

Regards,
   Alex.

http://www.solr-start.com/ - Resources for Solr users, new and experienced


On 12 December 2016 at 11:00, Furkan KAMACI  wrote:
> Hi,
>
> I'm testing language identification. I've enabled it solrconfig.xml.  Here
> is my dynamic fields at schema:
>
> 
> 
>
> So, after indexing, I see that fields are generated:
>
> content_en
> content_ru
>
> I copy my fields into a text field:
>
> 
> 
>
> Here is my text field:
>
>  multiValued="true"/>
>
> I want to let users only search on only *text* field. However, when I copy
> that fields into *text *field, they are indexed according to text_general.
>
> How can I copy *tokens* to *text *field?
>
> Kind Regards,
> Furkan KAMACI


Re: OOMs in Solr

2016-12-12 Thread Susheel Kumar
Double check if your queries are not running into deep pagination
(q=*:*...&start=).  This is something i recently experienced
and was the only cause of OOM.  You may have the gc logs when OOM happened
and drawing it on GC Viewer may give insight how gradual your heap got
filled and run into OOM.

Thanks,
Susheel

On Mon, Dec 12, 2016 at 10:32 AM, Alfonso Muñoz-Pomer Fuentes <
amu...@ebi.ac.uk> wrote:

> Thanks again.
>
> I’m learning more about Solr in this thread than in my previous months
> reading about it!
>
> Moving to Solr Cloud is a possibility we’ve discussed and I guess it will
> eventually happen, as the index will grow no matter what.
>
> I’ve already lowered filterCache from 512 to 64 and I’m looking forward to
> seeing what happens in the next few days. Our filter cache hit ratio was
> 0.99, so I would expect this to go down but if we can have a more
> efficiente memory usage I think e.g. an extra second for each search is
> still acceptable.
>
> Regarding the startup scripts we’re using the ones included with Solr.
>
> As for the use of filters we’re always using the same four filters, IIRC.
> In any case we’ll review the code to ensure that that’s the case.
>
> I’m aware of the need to reindex when the schema changes, but thanks for
> the reminder. We’ll add docValues because I think that’ll make a
> significant difference in our case. We’ll also try to leave space for the
> disk cache as we’re using spinning disk storage.
>
> Thanks again to everybody for the useful and insightful replies.
>
> Alfonso
>
>
> On 12/12/2016 14:12, Shawn Heisey wrote:
>
>> On 12/12/2016 3:13 AM, Alfonso Muñoz-Pomer Fuentes wrote:
>>
>>> I’m writing because in our web application we’re using Solr 5.1.0 and
>>> currently we’re hosting it on a VM with 32 GB of RAM (of which 30 are
>>> dedicated to Solr and nothing else is running there). We have four
>>> cores, that are this size:
>>> - 25.56 GB, Num Docs = 57,860,845
>>> - 12.09 GB, Num Docs = 173,491,631
>>>
>>> (The other two cores are about 10 MB, 20k docs)
>>>
>>
>> An OOM indicates that a Java application is requesting more memory than
>> it has been told it can use. There are only two remedies for OOM errors:
>> Increase the heap, or make the program use less memory.  In this email,
>> I have concentrated on ways to reduce the memory requirements.
>>
>> These index sizes and document counts are relatively small to Solr -- as
>> long as you have enough memory and are smart about how it's used.
>>
>> Solr 5.1.0 comes with GC tuning built into the startup scripts, using
>> some well-tested CMS settings.  If you are using those startup scripts,
>> then the parallel collector will NOT be default.  No matter what
>> collector is in use, it cannot fix OOM problems.  It may change when and
>> how frequently they occur, but it can't do anything about them.
>>
>> We aren’t indexing on this machine, and we’re getting OOM relatively
>>> quickly (after about 14 hours of regular use). Right now we have a
>>> Cron job that restarts Solr every 12 hours, so it’s not pretty. We use
>>> faceting quite heavily and mostly as a document storage server (we
>>> want full data sets instead of the n most relevant results).
>>>
>>
>> Like Toke, I suspect two things: a very large filterCache, and the heavy
>> facet usage, maybe both.  Enabling docValues on the fields you're using
>> for faceting and reindexing will make the latter more memory efficient,
>> and likely faster.  Reducing the filterCache size would help the
>> former.  Note that if you have a completely static index, then it is
>> more likely that you will fill up the filterCache over time.
>>
>> I don’t know if what we’re experiencing is usual given the index size
>>> and memory constraint of the VM, or something looks like it’s wildly
>>> misconfigured. What do you think? Any useful pointers for some tuning
>>> we could do to improve the service? Would upgrading to Solr 6 make sense?
>>>
>>
>> As I already mentioned, the first thing I'd check is the size of the
>> filterCache.  Reduce it, possibly so it's VERY small.  Do everything you
>> can to assure that you are re-using filters, not sending many unique
>> filters.  One of the most common things that leads to low filter re-use
>> is using the bare NOW keyword in date filters and queries.  Use NOW/HOUR
>> or NOW/DAY instead -- NOW changes once a millisecond, so it is typically
>> unique for every query.  FilterCache entries are huge, as you were told
>> in another reply.
>>
>> Unless you use docValues, or utilize the facet.method parameter VERY
>> carefully, each field you facet on will tie up a large section of memory
>> containing the value for that field in EVERY document in the index.
>> With the document counts you've got, this is a LOT of memory.
>>
>> It is strongly recommended to have docValues enabled on every field
>> you're using for faceting.  If you change the schema in this manner, a
>> full reindex will be required before you can use that field again.

How to check optimized or disk free status via solrj for a particular collection?

2016-12-12 Thread Michael Joyner

Halp!

I need to reindex over 43 millions documents, when optimized the 
collection is currently < 30% of disk space, we tried it over this 
weekend and it ran out of space during the reindexing.


I'm thinking for the best solution for what we are trying to do is to 
call commit/optimize every 10,000,000 documents or so and then wait for 
the optimize to complete.


How to check optimized status via solrj for a particular collection?

Also, is there is a way to check free space per shard by collection?

-Mike



Re: Distribution Packages

2016-12-12 Thread Pushkar Raste
We use jdeb maven plugin to build the debian packages, we use it for Solr
as well

On Dec 12, 2016 9:03 AM, "Adjamilton Junior"  wrote:

> Hi folks,
>
> I am new here and I wonder to know why there's no Solr 6.x packages for
> ubuntu/debian?
>
> Thank you.
>
> Adjamilton Junior
>


Map Highlight Field into Another Field

2016-12-12 Thread Furkan KAMACI
Hi,

One can use * at highlight fields. As like:

content_*

So, content_de and content_en can match to it. However response will
include such fields:

"highlighting":{
"my query":{
  "content_de":
  "content_en":
...

Is it possible to map matched fields into a pre defined field. As like:

content_* => content

So, one can handle a generic name for such cases at response?

If not, I can implement such a feature.

Kind Regards,
Furkan KAMACI


Copying Tokens

2016-12-12 Thread Furkan KAMACI
Hi,

I'm testing language identification. I've enabled it solrconfig.xml.  Here
is my dynamic fields at schema:




So, after indexing, I see that fields are generated:

content_en
content_ru

I copy my fields into a text field:




Here is my text field:



I want to let users only search on only *text* field. However, when I copy
that fields into *text *field, they are indexed according to text_general.

How can I copy *tokens* to *text *field?

Kind Regards,
Furkan KAMACI


Re: Unicode Character Problem

2016-12-12 Thread Furkan KAMACI
Hi Ahmet,

I don't see any weird character when I manual copy it to any text editor.

On Sat, Dec 10, 2016 at 6:19 PM, Ahmet Arslan 
wrote:

> Hi Furkan,
>
> I am pretty sure this is a pdf extraction thing.
> Turkish characters caused us trouble in the past during extracting text
> from pdf files.
> You can confirm by performing manual copy-paste from original pdf file.
>
> Ahmet
>
>
> On Friday, December 9, 2016 8:44 PM, Furkan KAMACI 
> wrote:
> Hi,
>
> I'm trying to index Turkish characters. These are what I see at my index (I
> see both of them at different places of my content):
>
> aç �klama
> açıklama
>
> These are same words but indexed different (same weird character at first
> one). I see that there is not a weird character when I check the original
> PDF file.
>
> What do you think about it. Is it related to Solr or Tika?
>
> PS: I use text_general for analyser of content field.
>
> Kind Regards,
> Furkan KAMACI
>


Re: OOMs in Solr

2016-12-12 Thread Alfonso Muñoz-Pomer Fuentes

Thanks again.

I’m learning more about Solr in this thread than in my previous months 
reading about it!


Moving to Solr Cloud is a possibility we’ve discussed and I guess it 
will eventually happen, as the index will grow no matter what.


I’ve already lowered filterCache from 512 to 64 and I’m looking forward 
to seeing what happens in the next few days. Our filter cache hit ratio 
was 0.99, so I would expect this to go down but if we can have a more 
efficiente memory usage I think e.g. an extra second for each search is 
still acceptable.


Regarding the startup scripts we’re using the ones included with Solr.

As for the use of filters we’re always using the same four filters, 
IIRC. In any case we’ll review the code to ensure that that’s the case.


I’m aware of the need to reindex when the schema changes, but thanks for 
the reminder. We’ll add docValues because I think that’ll make a 
significant difference in our case. We’ll also try to leave space for 
the disk cache as we’re using spinning disk storage.


Thanks again to everybody for the useful and insightful replies.

Alfonso

On 12/12/2016 14:12, Shawn Heisey wrote:

On 12/12/2016 3:13 AM, Alfonso Muñoz-Pomer Fuentes wrote:

I’m writing because in our web application we’re using Solr 5.1.0 and
currently we’re hosting it on a VM with 32 GB of RAM (of which 30 are
dedicated to Solr and nothing else is running there). We have four
cores, that are this size:
- 25.56 GB, Num Docs = 57,860,845
- 12.09 GB, Num Docs = 173,491,631

(The other two cores are about 10 MB, 20k docs)


An OOM indicates that a Java application is requesting more memory than
it has been told it can use. There are only two remedies for OOM errors:
Increase the heap, or make the program use less memory.  In this email,
I have concentrated on ways to reduce the memory requirements.

These index sizes and document counts are relatively small to Solr -- as
long as you have enough memory and are smart about how it's used.

Solr 5.1.0 comes with GC tuning built into the startup scripts, using
some well-tested CMS settings.  If you are using those startup scripts,
then the parallel collector will NOT be default.  No matter what
collector is in use, it cannot fix OOM problems.  It may change when and
how frequently they occur, but it can't do anything about them.


We aren’t indexing on this machine, and we’re getting OOM relatively
quickly (after about 14 hours of regular use). Right now we have a
Cron job that restarts Solr every 12 hours, so it’s not pretty. We use
faceting quite heavily and mostly as a document storage server (we
want full data sets instead of the n most relevant results).


Like Toke, I suspect two things: a very large filterCache, and the heavy
facet usage, maybe both.  Enabling docValues on the fields you're using
for faceting and reindexing will make the latter more memory efficient,
and likely faster.  Reducing the filterCache size would help the
former.  Note that if you have a completely static index, then it is
more likely that you will fill up the filterCache over time.


I don’t know if what we’re experiencing is usual given the index size
and memory constraint of the VM, or something looks like it’s wildly
misconfigured. What do you think? Any useful pointers for some tuning
we could do to improve the service? Would upgrading to Solr 6 make sense?


As I already mentioned, the first thing I'd check is the size of the
filterCache.  Reduce it, possibly so it's VERY small.  Do everything you
can to assure that you are re-using filters, not sending many unique
filters.  One of the most common things that leads to low filter re-use
is using the bare NOW keyword in date filters and queries.  Use NOW/HOUR
or NOW/DAY instead -- NOW changes once a millisecond, so it is typically
unique for every query.  FilterCache entries are huge, as you were told
in another reply.

Unless you use docValues, or utilize the facet.method parameter VERY
carefully, each field you facet on will tie up a large section of memory
containing the value for that field in EVERY document in the index.
With the document counts you've got, this is a LOT of memory.

It is strongly recommended to have docValues enabled on every field
you're using for faceting.  If you change the schema in this manner, a
full reindex will be required before you can use that field again.

There is another problem lurking here that Toke already touched on:
Leaving only 2GB of RAM for the OS to handle disk caching will result in
terrible performance.

What you've been told by me and and in other replies is discussed here:

https://wiki.apache.org/solr/SolrPerformanceProblems

Thanks,
Shawn



--
Alfonso Muñoz-Pomer Fuentes
Software Engineer @ Expression Atlas Team
European Bioinformatics Institute (EMBL-EBI)
European Molecular Biology Laboratory
Tel:+ 44 (0) 1223 49 2633
Skype: amunozpomer


Re: empty result set for a sort query

2016-12-12 Thread Yonik Seeley
Ah, 2-phase distributed search is the most likely answer (and
currently classified as more of a limitation than a bug)...
Phase 1 collects the top N ids from each shard (and merges them to
find the global top N)
Phase 2 retrieves the stored fields for the global top N

If any of the ids have been deleted between Phase 1 and Phase 2, then
you can get less than N docs back.

-Yonik


On Mon, Dec 12, 2016 at 4:26 AM, moscovig  wrote:
> I am not sure that it's related,
> but with local tests we got to a scenario where we
> Add doc that somehow has * empty key* and then, when querying with sort over
> creationTime with rows=1, we get empty result set.
> When specifying the recent doc shard with shards=shard2 we do have results.
>
> I don't think we have empty keys in our production schema but maybe it can
> give a clue.
>
> Thanks
> Gilad


Re: Distribution Packages

2016-12-12 Thread Shawn Heisey
On 12/12/2016 7:03 AM, Adjamilton Junior wrote:
> I am new here and I wonder to know why there's no Solr 6.x packages
> for ubuntu/debian? 

There are no official Solr packages for ANY operating system.  We have
binary releases that include an installation script for UNIX-like
operating systems with typical open source utilities, but there are no
RPM or DEB packages.

It takes considerable developer time and effort to maintain such
packages.  The idea has been discussed, but nobody has volunteered to do
it, and Apache Infra has not been approached about the resources
required for hosting the repositories.

Thanks,
Shawn



Re: OOMs in Solr

2016-12-12 Thread Shawn Heisey
On 12/12/2016 3:13 AM, Alfonso Muñoz-Pomer Fuentes wrote:
> I’m writing because in our web application we’re using Solr 5.1.0 and
> currently we’re hosting it on a VM with 32 GB of RAM (of which 30 are
> dedicated to Solr and nothing else is running there). We have four
> cores, that are this size:
> - 25.56 GB, Num Docs = 57,860,845
> - 12.09 GB, Num Docs = 173,491,631
>
> (The other two cores are about 10 MB, 20k docs)

An OOM indicates that a Java application is requesting more memory than
it has been told it can use. There are only two remedies for OOM errors:
Increase the heap, or make the program use less memory.  In this email,
I have concentrated on ways to reduce the memory requirements.

These index sizes and document counts are relatively small to Solr -- as
long as you have enough memory and are smart about how it's used.

Solr 5.1.0 comes with GC tuning built into the startup scripts, using
some well-tested CMS settings.  If you are using those startup scripts,
then the parallel collector will NOT be default.  No matter what
collector is in use, it cannot fix OOM problems.  It may change when and
how frequently they occur, but it can't do anything about them.

> We aren’t indexing on this machine, and we’re getting OOM relatively
> quickly (after about 14 hours of regular use). Right now we have a
> Cron job that restarts Solr every 12 hours, so it’s not pretty. We use
> faceting quite heavily and mostly as a document storage server (we
> want full data sets instead of the n most relevant results).

Like Toke, I suspect two things: a very large filterCache, and the heavy
facet usage, maybe both.  Enabling docValues on the fields you're using
for faceting and reindexing will make the latter more memory efficient,
and likely faster.  Reducing the filterCache size would help the
former.  Note that if you have a completely static index, then it is
more likely that you will fill up the filterCache over time.

> I don’t know if what we’re experiencing is usual given the index size
> and memory constraint of the VM, or something looks like it’s wildly
> misconfigured. What do you think? Any useful pointers for some tuning
> we could do to improve the service? Would upgrading to Solr 6 make sense? 

As I already mentioned, the first thing I'd check is the size of the
filterCache.  Reduce it, possibly so it's VERY small.  Do everything you
can to assure that you are re-using filters, not sending many unique
filters.  One of the most common things that leads to low filter re-use
is using the bare NOW keyword in date filters and queries.  Use NOW/HOUR
or NOW/DAY instead -- NOW changes once a millisecond, so it is typically
unique for every query.  FilterCache entries are huge, as you were told
in another reply.

Unless you use docValues, or utilize the facet.method parameter VERY
carefully, each field you facet on will tie up a large section of memory
containing the value for that field in EVERY document in the index. 
With the document counts you've got, this is a LOT of memory.

It is strongly recommended to have docValues enabled on every field
you're using for faceting.  If you change the schema in this manner, a
full reindex will be required before you can use that field again.

There is another problem lurking here that Toke already touched on:
Leaving only 2GB of RAM for the OS to handle disk caching will result in
terrible performance.

What you've been told by me and and in other replies is discussed here:

https://wiki.apache.org/solr/SolrPerformanceProblems

Thanks,
Shawn



RE: OOMs in Solr

2016-12-12 Thread Prateek Jain J

You can also try following:

1. reduced stack size of thread using -Xss flag.
2. Try to use sharding instead of single large instance (if possible).
3. reduce cache size in solrconfig.xml


Regards,
Prateek Jain

-Original Message-
From: Alfonso Muñoz-Pomer Fuentes [mailto:amu...@ebi.ac.uk] 
Sent: 12 December 2016 01:31 PM
To: solr-user@lucene.apache.org
Subject: Re: OOMs in Solr

I wasn’t aware of docValues and filterCache policies. We’ll try to fine-tune it 
and see if it helps.

Thanks so much for the info!

On 12/12/2016 12:13, Toke Eskildsen wrote:
> On Mon, 2016-12-12 at 10:13 +, Alfonso Muñoz-Pomer Fuentes wrote:
>> I’m writing because in our web application we’re using Solr 5.1.0 and 
>> currently we’re hosting it on a VM with 32 GB of RAM (of which 30 are 
>> dedicated to Solr and nothing else is running there).
>
> This leaves very little memory for disk cache. I hope your underlying 
> storage is local SSDs and not spinning drives over the network.
>
>>  We have four cores, that are this size:
>> - 25.56 GB, Num Docs = 57,860,845
>> - 12.09 GB, Num Docs = 173,491,631
>
> Smallish in bytes, largish in document count.
>
>> We aren’t indexing on this machine, and we’re getting OOM relatively 
>> quickly (after about 14 hours of regular use).
>
> The usual suspect for OOMs after some time is the filterCache. Worst- 
> case entries in that one takes up 1 bit/document, which means 7MB and 
> 22MB respectively for the two collections above. If your filterCache 
> is set to 1000 for those, this means (7MB+22MB)*1000 ~= all your heap.
>
>
>>  Right now we have a Cron job that restarts Solr every 12 hours, so 
>> it’s not pretty. We use faceting quite heavily
>
> Hopefully on docValued fields?
>
>>  and mostly as a document storage server (we want full data sets 
>> instead of the n most relevant results).
>
> Hopefully with deep paging, as opposed to rows=173491631?
>
>> I don’t know if what we’re experiencing is usual given the index size 
>> and memory constraint of the VM, or something looks like it’s wildly 
>> misconfigured.
>
> I would have guessed that your heap was quite large enough for a 
> static index, but that is just ... guesswork.
>
> Would upgrading to Solr 6 make sense?
>
> It would not hep in itself, but if you also switched to using 
> streaming for your assumedly large exports, it would lower memory 
> requirements.
>
> - Toke Eskildsen, State and University Library, Denmark
>
>>

--
Alfonso Muñoz-Pomer Fuentes
Software Engineer @ Expression Atlas Team European Bioinformatics Institute 
(EMBL-EBI) European Molecular Biology Laboratory Tel:+ 44 (0) 1223 49 2633
Skype: amunozpomer


Distribution Packages

2016-12-12 Thread Adjamilton Junior
Hi folks,

I am new here and I wonder to know why there's no Solr 6.x packages for
ubuntu/debian?

Thank you.

Adjamilton Junior


Re: Antw: Re: Solr 6.2.1 :: Collection Aliasing

2016-12-12 Thread Shawn Heisey
On 12/12/2016 3:56 AM, Rainer Gnan wrote:
> Do the query this way:
> http://hostname.de:8983/solr/live/select?indent=on&q=*:* 
>
> I have no idea whether the behavior you are seeing is correct or wrong,
> but if you send the traffic directly to the alias it should work correctly.
>
> It might turn out that this is a bug, but I believe the above workaround
> should take care of the issue in your environment.

It's standard SolrCloud usage.  You use the name of a collection in the
URL path after /solr, where normally (non-cloud) you would use a core
name.  All aliases, even those with multiple collections, will work for
queries, and single-collection aliases would have predictable results
for updates.  I do not have any documentation to point you at, although
it's possible that this IS mentioned in the docs.

Thanks,
Shawn



Re: OOMs in Solr

2016-12-12 Thread Alfonso Muñoz-Pomer Fuentes
I wasn’t aware of docValues and filterCache policies. We’ll try to 
fine-tune it and see if it helps.


Thanks so much for the info!

On 12/12/2016 12:13, Toke Eskildsen wrote:

On Mon, 2016-12-12 at 10:13 +, Alfonso Muñoz-Pomer Fuentes wrote:

I’m writing because in our web application we’re using Solr 5.1.0
and currently we’re hosting it on a VM with 32 GB of RAM (of which 30
are dedicated to Solr and nothing else is running there).


This leaves very little memory for disk cache. I hope your underlying
storage is local SSDs and not spinning drives over the network.


 We have four cores, that are this size:
- 25.56 GB, Num Docs = 57,860,845
- 12.09 GB, Num Docs = 173,491,631


Smallish in bytes, largish in document count.


We aren’t indexing on this machine, and we’re getting OOM relatively
quickly (after about 14 hours of regular use).


The usual suspect for OOMs after some time is the filterCache. Worst-
case entries in that one takes up 1 bit/document, which means 7MB and
22MB respectively for the two collections above. If your filterCache is
set to 1000 for those, this means (7MB+22MB)*1000 ~= all your heap.



 Right now we have a Cron job that restarts Solr every 12 hours, so
it’s not pretty. We use faceting quite heavily


Hopefully on docValued fields?


 and mostly as a document storage server (we want full data sets
instead of the n most relevant results).


Hopefully with deep paging, as opposed to rows=173491631?


I don’t know if what we’re experiencing is usual given the index size
and memory constraint of the VM, or something looks like it’s wildly
misconfigured.


I would have guessed that your heap was quite large enough for a static
index, but that is just ... guesswork.

Would upgrading to Solr 6 make sense?

It would not hep in itself, but if you also switched to using streaming
for your assumedly large exports, it would lower memory requirements.

- Toke Eskildsen, State and University Library, Denmark





--
Alfonso Muñoz-Pomer Fuentes
Software Engineer @ Expression Atlas Team
European Bioinformatics Institute (EMBL-EBI)
European Molecular Biology Laboratory
Tel:+ 44 (0) 1223 49 2633
Skype: amunozpomer


Re: OOMs in Solr

2016-12-12 Thread Alfonso Muñoz-Pomer Fuentes

Thanks for the reply. Here’s some more info...

Disk space:
39 GB / 148 GB (used / available)

Deployment model:
Single instance

JVM version:
1.7.0_04

Number of queries:
avgRequestsPerSecond: 0.5478469104833896

GC algorithm:
None specified, so I guess it defaults to the parallel GC.

On 12/12/2016 10:22, Prateek Jain J wrote:


Please provide some information like,

disk space available
deployment model of solr like solr-cloud or single instance
jvm version
no. of queries and type of queries etc.
GC algorithm used etc.


Regards,
Prateek Jain

-Original Message-
From: Alfonso Muñoz-Pomer Fuentes [mailto:amu...@ebi.ac.uk]
Sent: 12 December 2016 10:14 AM
To: solr-user@lucene.apache.org
Subject: OOMs in Solr

Hi Solr users,

I’m writing because in our web application we’re using Solr 5.1.0 and currently 
we’re hosting it on a VM with 32 GB of RAM (of which 30 are dedicated to Solr 
and nothing else is running there). We have four cores, that are this size:
- 25.56 GB, Num Docs = 57,860,845
- 12.09 GB, Num Docs = 173,491,631

(The other two cores are about 10 MB, 20k docs)

We aren’t indexing on this machine, and we’re getting OOM relatively quickly 
(after about 14 hours of regular use). Right now we have a Cron job that 
restarts Solr every 12 hours, so it’s not pretty. We use faceting quite heavily 
and mostly as a document storage server (we want full data sets instead of the 
n most relevant results).

I don’t know if what we’re experiencing is usual given the index size and 
memory constraint of the VM, or something looks like it’s wildly misconfigured. 
What do you think? Any useful pointers for some tuning we could do to improve 
the service? Would upgrading to Solr 6 make sense?

Thanks a lot in advance.

--
Alfonso Muñoz-Pomer Fuentes
Software Engineer @ Expression Atlas Team European Bioinformatics Institute 
(EMBL-EBI) European Molecular Biology Laboratory Tel:+ 44 (0) 1223 49 2633
Skype: amunozpomer



--
Alfonso Muñoz-Pomer Fuentes
Software Engineer @ Expression Atlas Team
European Bioinformatics Institute (EMBL-EBI)
European Molecular Biology Laboratory
Tel:+ 44 (0) 1223 49 2633
Skype: amunozpomer


Traverse over response docs in SearchComponent impl.

2016-12-12 Thread Markus Jelsma
Hello - i need to traverse over the list of response docs in a SearchComponent, 
get all values for a specific field, and then conditionally add a new field.

The request handler is configured as follows:

  dostuff


I can see that Solr calls the component's process() method, but from within 
that method, rb.getResponseDocs(); is always null. No matter what i try, i do 
not seem to be able to get a hold of that list of response docs.

I don't like to use a DocTransformer because i first need to check all fields 
in the response.

Any idea on how to process the SolrDocumentList correctly? I am clearly missing 
something.

Thanks,
Markus


Re: OOMs in Solr

2016-12-12 Thread Toke Eskildsen
On Mon, 2016-12-12 at 10:13 +, Alfonso Muñoz-Pomer Fuentes wrote:
> I’m writing because in our web application we’re using Solr 5.1.0
> and currently we’re hosting it on a VM with 32 GB of RAM (of which 30
> are dedicated to Solr and nothing else is running there).

This leaves very little memory for disk cache. I hope your underlying
storage is local SSDs and not spinning drives over the network.

>  We have four cores, that are this size:
> - 25.56 GB, Num Docs = 57,860,845
> - 12.09 GB, Num Docs = 173,491,631

Smallish in bytes, largish in document count.

> We aren’t indexing on this machine, and we’re getting OOM relatively 
> quickly (after about 14 hours of regular use).

The usual suspect for OOMs after some time is the filterCache. Worst-
case entries in that one takes up 1 bit/document, which means 7MB and
22MB respectively for the two collections above. If your filterCache is
set to 1000 for those, this means (7MB+22MB)*1000 ~= all your heap.


>  Right now we have a Cron job that restarts Solr every 12 hours, so
> it’s not pretty. We use faceting quite heavily

Hopefully on docValued fields?

>  and mostly as a document storage server (we want full data sets
> instead of the n most relevant results).

Hopefully with deep paging, as opposed to rows=173491631?

> I don’t know if what we’re experiencing is usual given the index size
> and memory constraint of the VM, or something looks like it’s wildly 
> misconfigured.

I would have guessed that your heap was quite large enough for a static
index, but that is just ... guesswork.

Would upgrading to Solr 6 make sense?

It would not hep in itself, but if you also switched to using streaming
for your assumedly large exports, it would lower memory requirements.

- Toke Eskildsen, State and University Library, Denmark

> 

Antw: Re: Solr 6.2.1 :: Collection Aliasing

2016-12-12 Thread Rainer Gnan
Hi Shawn,

your workaround works and is exactly what I was looking for.
Did you find this solution via trial and error or can you point me to the 
appropriate section in the APRGuide?

Thanks a lot!
Rainer


Rainer Gnan
Bayerische Staatsbibliothek 
BibliotheksVerbund Bayern
Verbundnahe Dienste
80539 München
Tel.: +49(0)89/28638-2445
Fax: +49(0)89/28638-2665
E-Mail: rainer.g...@bsb-muenchen.de




>>> Shawn Heisey  12.12.2016 11:43 >>>
On 12/12/2016 3:32 AM, Rainer Gnan wrote:
> Hi,
>
> actually I am trying to use Collection Aliasing in a SolrCloud-environment.
>
> My set up is as follows:
>
> 1. Collection_1 (alias "live") linked with config_1
> 2. Collection_2 (alias "test") linked with config_2
> 3. Collection_1 is different to Collection _2
> 4. config_1 is different to config_2 
>
> Case 1: Using
> http://hostname.de:8983/solr/Collection_2/select?indent=on&q=*:*&wt=xml&collection=test
>  
> leads to the same results as
> http://hostname.de:8983/solr/Collection_2/select?indent=on&q=*:*&wt=xml 
> which is correct.
>
> Case 2: Using
> http://hostname.de:8983/solr/Collection_1/select?indent=on&q=*:*&wt=xml&collection=live
>  
> leads to the same result as
> http://hostname.de:8983/solr/Collection_1/select?indent=on&q=*:*&wt=xml 
> which is correct, too.
>
> BUT
>
> Case 3: Using
> http://hostname.de:8983/solr/Collection_2/select?indent=on&q=*:*&wt=xml&collection=live
>  
> leads NOT to the same result as
> http://hostname.de:8983/solr/Collection_1/select?indent=on&q=*:*&wt=xml&collection=live
>  
> or 
> http://hostname.de:8983/solr/Collection_1/select?indent=on&q=*:*&wt=xml 

Do the query this way:

http://hostname.de:8983/solr/live/select?indent=on&q=*:* 

I have no idea whether the behavior you are seeing is correct or wrong,
but if you send the traffic directly to the alias it should work correctly.

It might turn out that this is a bug, but I believe the above workaround
should take care of the issue in your environment.

Thanks,
Shawn




Re: Solr 6.2.1 :: Collection Aliasing

2016-12-12 Thread Shawn Heisey
On 12/12/2016 3:32 AM, Rainer Gnan wrote:
> Hi,
>
> actually I am trying to use Collection Aliasing in a SolrCloud-environment.
>
> My set up is as follows:
>
> 1. Collection_1 (alias "live") linked with config_1
> 2. Collection_2 (alias "test") linked with config_2
> 3. Collection_1 is different to Collection _2
> 4. config_1 is different to config_2 
>
> Case 1: Using
> http://hostname.de:8983/solr/Collection_2/select?indent=on&q=*:*&wt=xml&collection=test
> leads to the same results as
> http://hostname.de:8983/solr/Collection_2/select?indent=on&q=*:*&wt=xml
> which is correct.
>
> Case 2: Using
> http://hostname.de:8983/solr/Collection_1/select?indent=on&q=*:*&wt=xml&collection=live
> leads to the same result as
> http://hostname.de:8983/solr/Collection_1/select?indent=on&q=*:*&wt=xml
> which is correct, too.
>
> BUT
>
> Case 3: Using
> http://hostname.de:8983/solr/Collection_2/select?indent=on&q=*:*&wt=xml&collection=live
> leads NOT to the same result as
> http://hostname.de:8983/solr/Collection_1/select?indent=on&q=*:*&wt=xml&collection=live
> or 
> http://hostname.de:8983/solr/Collection_1/select?indent=on&q=*:*&wt=xml

Do the query this way:

http://hostname.de:8983/solr/live/select?indent=on&q=*:*

I have no idea whether the behavior you are seeing is correct or wrong,
but if you send the traffic directly to the alias it should work correctly.

It might turn out that this is a bug, but I believe the above workaround
should take care of the issue in your environment.

Thanks,
Shawn



Re: Data Import Handler - maximum?

2016-12-12 Thread Shawn Heisey
On 12/11/2016 8:00 PM, Brian Narsi wrote:
> We are using Solr 5.1.0 and DIH to build index.
>
> We are using DIH with clean=true and commit=true and optimize=true.
> Currently retrieving about 10.5 million records in about an hour.
>
> I will like to find from other member's experiences as to how long can DIH
> run with no issues? What is the maximum number of records that anyone has
> pulled using DIH?
>
> Are there any limitations on the maximum number of records that can/should
> be pulled using DIH? What is the longest DIH can run?

There are no hard limits other than the Lucene limit of a little over
two billion docs per individual index.  With sharding, Solr is able to
easily overcome this limit on an entire index.

I have one index where each shard was over 50 million docs.  Each shard
has fewer docs now, because I changed it so there are more shards and
more machines.  For some reason the rebuild time (using DIH) got really
really long -- nearly 48 hours -- while building every shard in
parallel.  Still haven't figured out why the build time increased
dramatically.

One problem you might run into with DIH from a database has to do with
merging.  With default merge scheduler settings, eventually (typically
when there are millions of rows being imported) you'll run into a pause
in indexing that will take so long that the database connection will
close, causing the import to fail after the pause finishes.

I even opened a Lucene issue to get the default value for maxMergeCount
changed.  This issue went nowhere:

https://issues.apache.org/jira/browse/LUCENE-5705

Here's a thread from this mailing list discussing the problem and the
configuration solution:

http://lucene.472066.n3.nabble.com/What-does-quot-too-many-merges-stalling-quot-in-indexwriter-log-mean-td4077380.html

Thanks,
Shawn



Solr 6.2.1 :: Collection Aliasing

2016-12-12 Thread Rainer Gnan
Hi,

actually I am trying to use Collection Aliasing in a SolrCloud-environment.

My set up is as follows:

1. Collection_1 (alias "live") linked with config_1
2. Collection_2 (alias "test") linked with config_2
3. Collection_1 is different to Collection _2
4. config_1 is different to config_2 

Case 1: Using
http://hostname.de:8983/solr/Collection_2/select?indent=on&q=*:*&wt=xml&collection=test
leads to the same results as
http://hostname.de:8983/solr/Collection_2/select?indent=on&q=*:*&wt=xml
which is correct.

Case 2: Using
http://hostname.de:8983/solr/Collection_1/select?indent=on&q=*:*&wt=xml&collection=live
leads to the same result as
http://hostname.de:8983/solr/Collection_1/select?indent=on&q=*:*&wt=xml
which is correct, too.

BUT

Case 3: Using
http://hostname.de:8983/solr/Collection_2/select?indent=on&q=*:*&wt=xml&collection=live
leads NOT to the same result as
http://hostname.de:8983/solr/Collection_1/select?indent=on&q=*:*&wt=xml&collection=live
or 
http://hostname.de:8983/solr/Collection_1/select?indent=on&q=*:*&wt=xml

It seems that using alias "live" in case 3 forces solr to search in 
Collection_1 (which is desired) but it uses config_2 of Collection_2 (which is 
not desired).

MY AIM IS:
Running  one collection as productive and the other as test environment within 
a single SolrCloud.
After setting up a new index (new schema, new solrconfig.xml) on the test 
collection I want to asign the test collection the alias "live" and the live 
collection the alias "test".

How can I force solr to search in Collection_X with config_X?

I hope that my description makes clear what my problem is. If not, don't 
hesitate to ask, I appreciate any help.

Rainer





Rainer Gnan
Bayerische Staatsbibliothek 
BibliotheksVerbund Bayern
Verbundnahe Dienste
80539 München
Tel.: +49(0)89/28638-2445
Fax: +49(0)89/28638-2665
E-Mail: rainer.g...@bsb-muenchen.de






RE: OOMs in Solr

2016-12-12 Thread Prateek Jain J

Please provide some information like, 

disk space available
deployment model of solr like solr-cloud or single instance
jvm version
no. of queries and type of queries etc. 
GC algorithm used etc.


Regards,
Prateek Jain

-Original Message-
From: Alfonso Muñoz-Pomer Fuentes [mailto:amu...@ebi.ac.uk] 
Sent: 12 December 2016 10:14 AM
To: solr-user@lucene.apache.org
Subject: OOMs in Solr

Hi Solr users,

I’m writing because in our web application we’re using Solr 5.1.0 and currently 
we’re hosting it on a VM with 32 GB of RAM (of which 30 are dedicated to Solr 
and nothing else is running there). We have four cores, that are this size:
- 25.56 GB, Num Docs = 57,860,845
- 12.09 GB, Num Docs = 173,491,631

(The other two cores are about 10 MB, 20k docs)

We aren’t indexing on this machine, and we’re getting OOM relatively quickly 
(after about 14 hours of regular use). Right now we have a Cron job that 
restarts Solr every 12 hours, so it’s not pretty. We use faceting quite heavily 
and mostly as a document storage server (we want full data sets instead of the 
n most relevant results).

I don’t know if what we’re experiencing is usual given the index size and 
memory constraint of the VM, or something looks like it’s wildly misconfigured. 
What do you think? Any useful pointers for some tuning we could do to improve 
the service? Would upgrading to Solr 6 make sense?

Thanks a lot in advance.

--
Alfonso Muñoz-Pomer Fuentes
Software Engineer @ Expression Atlas Team European Bioinformatics Institute 
(EMBL-EBI) European Molecular Biology Laboratory Tel:+ 44 (0) 1223 49 2633
Skype: amunozpomer


OOMs in Solr

2016-12-12 Thread Alfonso Muñoz-Pomer Fuentes

Hi Solr users,

I’m writing because in our web application we’re using Solr 5.1.0 and 
currently we’re hosting it on a VM with 32 GB of RAM (of which 30 are 
dedicated to Solr and nothing else is running there). We have four 
cores, that are this size:

- 25.56 GB, Num Docs = 57,860,845
- 12.09 GB, Num Docs = 173,491,631

(The other two cores are about 10 MB, 20k docs)

We aren’t indexing on this machine, and we’re getting OOM relatively 
quickly (after about 14 hours of regular use). Right now we have a Cron 
job that restarts Solr every 12 hours, so it’s not pretty. We use 
faceting quite heavily and mostly as a document storage server (we want 
full data sets instead of the n most relevant results).


I don’t know if what we’re experiencing is usual given the index size 
and memory constraint of the VM, or something looks like it’s wildly 
misconfigured. What do you think? Any useful pointers for some tuning we 
could do to improve the service? Would upgrading to Solr 6 make sense?


Thanks a lot in advance.

--
Alfonso Muñoz-Pomer Fuentes
Software Engineer @ Expression Atlas Team
European Bioinformatics Institute (EMBL-EBI)
European Molecular Biology Laboratory
Tel:+ 44 (0) 1223 49 2633
Skype: amunozpomer


Re: empty result set for a sort query

2016-12-12 Thread moscovig
I am not sure that it's related,
but with local tests we got to a scenario where we 
Add doc that somehow has * empty key* and then, when querying with sort over
creationTime with rows=1, we get empty result set. 
When specifying the recent doc shard with shards=shard2 we do have results. 

I don't think we have empty keys in our production schema but maybe it can
give a clue.

Thanks
Gilad



--
View this message in context: 
http://lucene.472066.n3.nabble.com/empty-result-set-for-a-sort-query-tp4309256p4309315.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: empty result set for a sort query

2016-12-12 Thread moscovig
Hi
Thanks for the reply. 

We are using 
select?q=*:*&sort=creationTimestamp+desc&rows=1
So as you said we should have got results. 

Another piece of information is that we commit within 300ms when inserting
the "sanity" doc. 
And again, we delete by query. 

We don't have any custom plugin/query processor.





--
View this message in context: 
http://lucene.472066.n3.nabble.com/empty-result-set-for-a-sort-query-tp4309256p4309304.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Data Import Handler - maximum?

2016-12-12 Thread Bernd Fehling

Am 12.12.2016 um 04:00 schrieb Brian Narsi:
> We are using Solr 5.1.0 and DIH to build index.
> 
> We are using DIH with clean=true and commit=true and optimize=true.
> Currently retrieving about 10.5 million records in about an hour.
> 
> I will like to find from other member's experiences as to how long can DIH
> run with no issues? What is the maximum number of records that anyone has
> pulled using DIH?

Afaik, DIH will run until maximum number of documents per index.
Our longest run took about 3.5 days for single DIH and over 100 mio. docs.
The runtime depends pretty much on the complexity of the analysis during 
loading.

Currently we are using concurrent DIH with 12 processes which takes 15 hours
for the same amount. Optimizing afterwards takes 9.5 hours.

SolrJ with 12 threads is doing the same indexing within 7.5 hours plus 
optimizing.
For huge amounts of data you should consider using SolrJ.

> 
> Are there any limitations on the maximum number of records that can/should
> be pulled using DIH? What is the longest DIH can run?
> 
> Thanks a bunch!
> 


RE: Problem with Cross Data Center Replication

2016-12-12 Thread WILLMES Gero (SAFRAN IDENTITY AND SECURITY)
Hi Erick,

thanks for the hint. Indeed, i just forgot to paste the  
 section into the email. It was configured just the same way as you 
wrote. Do you have any idea what else could be the cause for the error?

Best regard,

Gero

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Wednesday, November 23, 2016 5:00 PM
To: solr-user 
Subject: Re: Problem with Cross Data Center Replication

Your _source_ (i.e. cdcr_testa) doesn't have the CDCR update log configured.
This section isn't in solrconfig for cdcr_testa:



${solr.ulog.dir:}
  


The update log is the transfer mechanism between the source and target 
clusters, so it needs to be configured in both.

Best,
Erick.

P.S. kudos for including enough info to diagnose (assuming I'm right)!


On Wed, Nov 23, 2016 at 4:40 AM, WILLMES Gero (SAFRAN IDENTITY AND
SECURITY)  wrote:
> Hi Solr users,
>
> i try to configure Cross Data Center Replication according to 
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=62687
> 462
>
> I set up two independent solr clouds. I created the collection "cdcr_testa" 
> on the source cloud and the collection "backup_collection" on the target 
> cloud.
>
> I adapted the configurations, according to the documentation.
>
> solrconfig.xml of the collection "cdcr_testa" in the source cluster:
>
>
>   
> 127.0.0.2:2181
> cdcr_testa
> backup_collection
>   
>
>   
> 8
> 1000
> 128
>   
>
>   
> 1000
>   
> 
>
>
> solrconfig.xml of the collection "backup_collection" in the target cluster:
>
> 
>   
> disabled
>   
> 
>
> 
>   
> cdcr-processor-chain
>   
> 
>
> 
>   
>   
> 
>
> 
> 
>   
> ${solr.ulog.dir:}
> 
>   
> 
>
>
> When I now reload the collection "cdcr_testa", I allways get the 
> following  Solr Exception
>
> 2016-11-23 12:05:35.604 ERROR (qtp1134712904-8045) [c:cdcr_testa s:shard1 
> r:core_node1 x:cdcr_testa_shard1_replica1] o.a.s.s.HttpSolrCall 
> null:org.apache.solr.common.SolrException: Error handling 'reload' action
> at 
> org.apache.solr.handler.admin.CoreAdminOperation$3.call(CoreAdminOperation.java:150)
> at 
> org.apache.solr.handler.admin.CoreAdminHandler$CallInfo.call(CoreAdminHandler.java:367)
> at 
> org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:158)
> at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:156)
> at 
> org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:663)
> at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:445)
> at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:257)
> at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:208)
> at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1668)
> at 
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:581)
> at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
> at 
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
> at 
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1160)
> at 
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:511)
> at 
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1092)
> at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
> at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
> at 
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
> at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
> at org.eclipse.jetty.server.Server.handle(Server.java:518)
> at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:308)
> at 
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:244)
> at 
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
> at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
> at 
> org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
> at 
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceAndRun(ExecuteProduceConsume.java:246)
> at 
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:156)
> at 
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:654)
> at 
> org.eclipse.jetty.util.thread.Queue