Empty value fields not indexed

2017-04-27 Thread Zheng Lin Edwin Yeo
Hi,

I'm using Solr 6.4.2, and I realized that for those fields which has no
values, the field name is not index into Solr.

It was working fine in the previous version.

Any reason for this or any settings which needs to be done so that the
field name can be indexed even though it's value is empty?

Regards,
Edwin


Re: DIH Speed

2017-04-27 Thread Vijay Kokatnur
​Let me clarify -

DIH is running on Solr 6.5.0 that calls a different solr instance running​
on 4.5.0, which has 150M documents.  If we try fetch them using DIH onto
new solr cluster, wouldn't it result in deep paging on solr 4.5.0 and
drastically slow down indexing on solr 6.5.0?

On Thu, Apr 27, 2017 at 4:40 PM, Erick Erickson 
wrote:

> I'm unclear why DIH an deep paging are  mixed. DIH is
> indexing and deep paging is querying.
>
> If it's querying, consider cursorMark or the /export handler.
> https://lucidworks.com/2013/12/12/coming-soon-to-solr-
> efficient-cursor-based-iteration-of-large-result-sets/
>
> If it's DIH, please explain a bit more.
>
> Best,
> Erick
>
> On Thu, Apr 27, 2017 at 3:37 PM, Vijay Kokatnur
>  wrote:
> > We have a new solr 6.5.0 cluster, for which data is being imported via
> DIH
> > from another Solr cluster running version 4.5.0.
> >
> > This question comes back to deep paging, but we have observed that after
> 30
> > minutes of querying the rate of processing goes down from 400/s to about
> > 120/s.  At that point it has processed only 500K of 1.3M docs.  Is there
> > any way to speed this up?
> >
> > And, I can't go back to the source for the data.
> >
> > --
>



-- 
Best,
Vijay


Re: Poll: Master-Slave or SolrCloud?

2017-04-27 Thread David Lee
As someone who moved from ES to Solr, I can say that one of the things 
that makes ES so much easier to configure is that the majority of things 
that need to be set for a specific environment are all in pretty much 
one config file. Also, I didn't have to deal with the "magic stuff" that 
many people have talked about where SolrCloud is concerned.


One of the problems is also do to documentation and user blogs that 
discuss how to use SolrCloud. They all tell you how to create a config 
to run SolrCloud on one system using the -e cloud flag, but then that's 
it. They all seem to avoid discussions of what to do from there in terms 
of best practices in distributing to other nodes. It's out there, but in 
many cases the guides refer to older versions of Solr so sometimes it is 
hard to know what versions people are writing about until you try their 
solutions and nothing works, so you finally figure out they are talking 
about a much older version.


I moved away from ES to Solr because I prefer the openness of Solr and 
the community participation but I really haven't been very successful in 
deploying this in a production environment at this point.


I'd say the two things I find that I'm battling with the most are the 
cloud configuration and the work I'm having to do to get even the most 
basic JSON documents indexed correctly (specifically where I need block 
joins, etc.).


I'm hopeful that the V2 Api will help with the JSON issue, but it would 
be nice to have some documentation that goes more in-depth on how to set 
up additional nodes. Also, even though I use ZK for other parts of my 
application, I have no problem with a version running specifically for 
Solr if it makes this process more straight-forward.


David



On 4/27/2017 2:51 AM, Emir Arnautovic wrote:
I think creating poll for ES ppl with question: "How do you run master 
nodes? A) on some data nodes B) dedicated node C) dedicated server" 
would give some insight how big issue is having ZK and if hiding ZK 
behind Solr would do any good.


Emir


On 25.04.2017 23:13, Otis Gospodnetić wrote:

Hi Erick,

Could one run *only* embedded ZK on some SolrCloud nodes, sans any data?
It would be equivalent of dedicated Elasticsearch nodes, which is the
current ES best practice/recommendation.  I've never heard of anyone 
being
scared of running 3 dedicated master ES nodes, so if SolrCloud 
offered the
same, perhaps even completely hiding ZK from users, that would 
present the
same level of complexity (err, simplicity) ES users love about ES.  
Don't

want to talk about SolrCloud vs. ES here at all, just trying to share
observations since we work a lot with both Elasticsearch and 
Solr(Cloud) at

Sematext.

Otis
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/


On Tue, Apr 25, 2017 at 4:03 PM, Erick Erickson 


wrote:


bq: I read somewhere that you should run your own ZK externally, and
turn off SolrCloud

this is a bit confused. "turn off SolrCloud" has nothing to do with
running ZK internally or externally. SolrCloud requires ZK, whether
internal or external is irrelevant to the term SolrCloud.

On to running an external ZK ensemble. Mostly, that's administratively
by far the safest. If you're running the embedded ZK, then the ZK
instances are tied to your Solr instance. Now if, for any reason, your
Solr nodes hosting ZK go down, you lose ZK quorum, can't index.
etc

Now consider a cluster with, say, 100 Solr nodes. Not talking replicas
in a collection here, I'm talking 100 physical machines. BTW, this is
not even close to the largest ones I'm aware of. Which three (for
example) are running ZK? If I want to upgrade Solr I better make
really sure not to upgrade to of the Solr instances running ZK at once
if I want my cluster to keep going

And, ZK is sensitive to system resources. So putting ZK on a Solr node
then hosing, say, updates to my Solr cluster can cause ZK to be
starved for resources.

This is one of those deals where _functionally_, it's OK to run
embedded ZK, but administratively it's suspect.

Best,
Erick

On Tue, Apr 25, 2017 at 10:49 AM, Rick Leir  wrote:

All,
I read somewhere that you should run your own ZK externally, and turn

off SolrCloud. Comments please!

Rick

On April 25, 2017 1:33:31 PM EDT, "Otis Gospodnetić" <

otis.gospodne...@gmail.com> wrote:
This is interesting - that ZK is seen as adding so much complexity 
that

it
turns people off!

If you think about it, Elasticsearch users have no choice -- except
their
"ZK" is built-in, hidden, so one doesn't have to think about it, at
least
not initially.

I think I saw mentions (maybe on user or dev MLs or JIRA) about
potentially, in the future, there only being SolrCloud mode (and
dropping
SolrCloud name in favour of Solr).  If the above comment from Charlie
about
complexity is really true for Solr users, and if that's the reason 
why

Re: DIH Speed

2017-04-27 Thread Shawn Heisey
On 4/27/2017 9:15 PM, Vijay Kokatnur wrote:
> Hey Shawn, Unfortunately, we can't upgrade the existing cluster. That
> was my first approach as well. Yes, SolrEntityProcessor is used so it
> results in deep paging after certain rows. I have observed that
> instead of importing for a larger period, if data is imported only for
> 4 hours at a time, import process is much faster. Since we are
> importing for several months it would be nice if dataimport can be
> scripted, in bash or python. But I am can't find any documentation on
> it. Any pointers? 

Hopefully this won't be too confusing:

If you have a field in the original index you can do a range query on,
you could use that range query to do the import in pieces, so each
import doesn't have as large a numFound value.

I'd probably put something like this in the SolrEntityProcessor:

query="${dih.request.solrquery}"

And then on the full-import command URL, I'd add these URL parameters,
varying the X and Y for each import:

=false=field:[X TO Y}

You'd want to be sure that either you url encode the parameter value
(especially the brackets and spaces), or that whatever you're using to
execute the URL will automatically do the encoding for you -- which is
something a full browser would do.  For the first import, you'd want to
leave the clean parameter off or set it to true.

Doing it this way would require more imports, but the whole process
should go faster.  Unless you want to write something that can detect
when each import is finished, it would be relatively hard to fully
automate the process.

Thanks,
Shawn



Re: DIH Speed

2017-04-27 Thread Vijay Kokatnur
Hey Shawn,

Unfortunately, we can't upgrade the existing cluster.  That was my first
approach as well.

Yes, SolrEntityProcessor is used so it results in deep paging after certain
rows.

I have observed that instead of importing for a larger period, if data is
imported only for 4 hours at a time, import process is much faster.  Since
we are importing for several months it would be nice if dataimport can be
scripted, in bash or python.  But I am can't find any documentation on it.
Any pointers?

--
*From:* Shawn Heisey 
*Sent:* Thursday, April 27, 2017 5:07 PM
*To:* solr-user@lucene.apache.org
*Subject:* Re: DIH Speed

On 4/27/2017 5:40 PM, Erick Erickson wrote:
> I'm unclear why DIH an deep paging are mixed. DIH is indexing and deep
paging is querying.
>
> If it's querying, consider cursorMark or the /export handler.
https://lucidworks.com/2013/12/12/coming-soon-to-solr-efficient-cursor-based-iteration-of-large-result-sets/

Very likely they are using SolrEntityProcessor.

Vijay, if the source server were running 4.7 (or later) instead of 4.5,
you could enable cursorMark for SolrEntityProcessor in 6.5.0 as Erick
mentioned, and pagination would be immensely more efficient.
Unfortunately, 4.5 doesn't support cursorMark.

https://issues.apache.org/jira/browse/SOLR-9668

Any chance you could upgrade the source server to a later 4.x version?

Thanks,
Shawn


Re: 1 main collection or multiple smaller collections?

2017-04-27 Thread Derek Poh

Richard

Iam considering the sameoption asyour suggestion to put them in 1 single 
collection of products documents. A product doccontaining the supplier info.
In this option, a supplier info will get repeated in eachof the 
supplier's product doc.I may be influenced by DB concepts. Guess it's a 
trade off for this option.


On 4/28/2017 1:01 AM, Rick Leir wrote:

Does it make sense to use nested documents here? Products could be nested in a 
supplier document perhaps.

Alternately, consider de-normalizing "til it hurts". A product doc might be 
able to contain supplier info.

On April 27, 2017 8:50:59 AM EDT, Shawn Heisey  wrote:

On 4/26/2017 11:57 PM, Derek Poh wrote:

There are some common fields between them.
At the source data end (database), the supplier info and product info
are updated separately. In this regard, I should separate them?
If it's In 1 single collection, when there are updatesto only the
supplier info,the product info will be index again even though there
is noupdates to them, Is my reasoning valid?


On 4/27/2017 1:33 PM, Walter Underwood wrote:

Do they have the same fields or different fields? Are they updated
separately or together?

If they have the same fields and are updated together, I’d put them
in the same collection. Otherwise, probably separate.

Walter's statements are right on the money, you just might need a
little
more detail.

There are are two critical details that decide whether you even CAN
combine different data in a single index: One is that all types of
records must use the same field (the uniqueKey field) to determine
uniqueness, and the value of this field must be unique across the
entire
dataset.  The other is that there SHOULD be a field with a name like
"type" that your search client can use to differentiate the different
kinds of documents.  This type field is not necessary, but it does make
things easier.

Assuming you CAN combine documents, there is still the question of
whether you SHOULD.  If the fields that you will commonly search are
the
same between the different kinds of documents, and if people want to be
able to do one search and get more than one of the document types you
are indexing, then it is something you should consider.  If people will
only ever search one type of document, you should probably keep them in
separate indexes to keep things cleaner.

Thanks,
Shawn



--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 


This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.

Re: 1 main collection or multiple smaller collections?

2017-04-27 Thread Derek Poh

Hi Shawn

1 set of data is suppliers info and 1 set isthe suppliers products info.
Usercan eitherdo a product search or a supplier search.

1 optionI am thinking of is to put them in 1 single collectionwith each 
product as a document. Each productdocument will have the supplier info 
in it.

Product id will be the uniquekey field.
With thisoption, the same supplier infowill be in every product document 
of the supplier.


A simplified example:
doc:
product id: P1
product description: XXX
supplier id: S1
supplier name: XXX
suppiler address: XXX

doc:
product id: P2
product description: XXXYYY
supplier id: S1
supplier name: XXX
supplier address: XXX

I may be influenced by DB concepts. Is such a design logical?


On 4/27/2017 8:50 PM, Shawn Heisey wrote:

On 4/26/2017 11:57 PM, Derek Poh wrote:

There are some common fields between them.
At the source data end (database), the supplier info and product info
are updated separately. In this regard, I should separate them?
If it's In 1 single collection, when there are updatesto only the
supplier info,the product info will be index again even though there
is noupdates to them, Is my reasoning valid?


On 4/27/2017 1:33 PM, Walter Underwood wrote:

Do they have the same fields or different fields? Are they updated
separately or together?

If they have the same fields and are updated together, I’d put them
in the same collection. Otherwise, probably separate.

Walter's statements are right on the money, you just might need a little
more detail.

There are are two critical details that decide whether you even CAN
combine different data in a single index: One is that all types of
records must use the same field (the uniqueKey field) to determine
uniqueness, and the value of this field must be unique across the entire
dataset.  The other is that there SHOULD be a field with a name like
"type" that your search client can use to differentiate the different
kinds of documents.  This type field is not necessary, but it does make
things easier.

Assuming you CAN combine documents, there is still the question of
whether you SHOULD.  If the fields that you will commonly search are the
same between the different kinds of documents, and if people want to be
able to do one search and get more than one of the document types you
are indexing, then it is something you should consider.  If people will
only ever search one type of document, you should probably keep them in
separate indexes to keep things cleaner.

Thanks,
Shawn





--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 


This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.

Re: DIH Speed

2017-04-27 Thread Shawn Heisey
On 4/27/2017 5:40 PM, Erick Erickson wrote:
> I'm unclear why DIH an deep paging are mixed. DIH is indexing and deep paging 
> is querying.
>
> If it's querying, consider cursorMark or the /export handler. 
> https://lucidworks.com/2013/12/12/coming-soon-to-solr-efficient-cursor-based-iteration-of-large-result-sets/

Very likely they are using SolrEntityProcessor.

Vijay, if the source server were running 4.7 (or later) instead of 4.5,
you could enable cursorMark for SolrEntityProcessor in 6.5.0 as Erick
mentioned, and pagination would be immensely more efficient. 
Unfortunately, 4.5 doesn't support cursorMark.

https://issues.apache.org/jira/browse/SOLR-9668

Any chance you could upgrade the source server to a later 4.x version?

Thanks,
Shawn



Re: Atomic Updates

2017-04-27 Thread Erick Erickson
Been there, done that, got the t-shirt. Thanks for closing it out!

Erick

On Thu, Apr 27, 2017 at 10:29 AM, Chris Ulicny  wrote:
> While recreating it with a fresh schema, I realized that this was a case of
> a very, very stupid user error during configuring the cores.
>
> I setup the testing cores with the wrong configset, and then proceeded to
> edit the schema in the right configset. So, the field was actually stored
> by default, but I wasn't attempting to retrieve it so I never realized it
> was being stored since I ended up looking at the wrong schema.
>
> I switched the config sets and everything works as expected. Any atomic
> updates clear out the indexed values for the non-stored field.
>
> Thanks for bearing with me.
> Chris
>
>
> On Thu, Apr 27, 2017 at 11:23 AM Chris Ulicny  wrote:
>
>> I'm sending commit=true with every update while testing. I'll write up the
>> tests and see if someone else can reproduce it.
>>
>> On Thu, Apr 27, 2017 at 10:54 AM Erick Erickson 
>> wrote:
>>
>>> bq: but is there any possibility that the values stick around until
>>> there is a segment merge for some strange reason
>>>
>>> There better not be or it's a bug. Things will stick around until
>>> you issue a commit, is there any chance that's the problem?
>>>
>>> If you can document the exact steps, maybe we can reproduce
>>> the issue and raise a JIRA.
>>>
>>> Best,
>>> Erick
>>>
>>> On Thu, Apr 27, 2017 at 6:03 AM, Chris Ulicny  wrote:
>>> > Yeah, something's not quite right somewhere. We never even considered
>>> > in-place updates an option since it requires the fields to be
>>> non-indexed
>>> > and non-stored. Our schemas never have any field that satisfies those
>>> two
>>> > conditions let alone the other necessary ones.
>>> >
>>> > I went ahead and tested the atomic updates on different textfields, and
>>> I
>>> > still can't get the indexed but not-stored othertext_field to
>>> disappear. So
>>> > far set, add, and remove updates do not change it regardless of what the
>>> > fields are in the atomic update.
>>> >
>>> > It would be extraordinarily useful if this update behavior is now
>>> expected
>>> > (but not currently documented) functionality.
>>> >
>>> > I'm not too familiar with the nitty-gritty details of merging segment
>>> > files, but is there any possibility that the values stick around until
>>> > there is a segment merge for some strange reason?
>>> >
>>> > On Thu, Apr 27, 2017 at 12:59 AM Dorian Hoxha 
>>> > wrote:
>>> >
>>> >> @Chris,
>>> >> According to doc-link-above, only INC,SET are in-place-updates. And
>>> only
>>> >> when they're not indexed/stored, while your 'integer-field' is. So
>>> still
>>> >> shenanigans in there somewhere (docs,your-code,your-test,solr-code).
>>> >>
>>> >> On Thu, Apr 27, 2017 at 2:04 AM, Chris Ulicny 
>>> wrote:
>>> >>
>>> >> > That's probably it then. None of the atomic updates that I've tried
>>> have
>>> >> > been on TextFields. I'll give the TextField atomic update to verify
>>> that
>>> >> it
>>> >> > will clear the other field.
>>> >> >
>>> >> > Has this functionality been consistent since atomic updates were
>>> >> > introduced, or is this a side effect of some other change? It'd be
>>> very
>>> >> > convenient for us to use this functionality as it currently works,
>>> but if
>>> >> > it's something that prevents us from upgrading versions in the
>>> future, we
>>> >> > should probably avoid expecting it to work.
>>> >> >
>>> >> > On Wed, Apr 26, 2017 at 7:36 PM Ishan Chattopadhyaya <
>>> >> > ichattopadhy...@gmail.com> wrote:
>>> >> >
>>> >> > > > Hmm, interesting. I can imagine that as long as you're updating
>>> >> > > > docValues fields, the other_text field would be there. But the
>>> >> instant
>>> >> > > > you updated a non-docValues field (text_field in your example)
>>> the
>>> >> > > > other_text field would disappear
>>> >> > >
>>> >> > > I can confirm this. When in-place updates to DV fields are done,
>>> the
>>> >> rest
>>> >> > > of the fields remain as they were.
>>> >> > >
>>> >> > > On Thu, Apr 27, 2017 at 4:33 AM, Erick Erickson <
>>> >> erickerick...@gmail.com
>>> >> > >
>>> >> > > wrote:
>>> >> > >
>>> >> > > > Hmm, interesting. I can imagine that as long as you're updating
>>> >> > > > docValues fields, the other_text field would be there. But the
>>> >> instant
>>> >> > > > you updated a non-docValues field (text_field in your example)
>>> the
>>> >> > > > other_text field would disappear.
>>> >> > > >
>>> >> > > > I DO NOT KNOW this for a fact, but I'm asking people who do.
>>> >> > > >
>>> >> > > > On Wed, Apr 26, 2017 at 2:13 PM, Dorian Hoxha <
>>> >> dorian.ho...@gmail.com>
>>> >> > > > wrote:
>>> >> > > > > There are In Place Updates, but according to docs they stll
>>> >> shouldn't
>>> >> > > > work
>>> >> > > > > in your case:
>>> >> > > > > https://cwiki.apache.org/confluence/display/solr/
>>> >> > 

Re: DIH Speed

2017-04-27 Thread Erick Erickson
I'm unclear why DIH an deep paging are  mixed. DIH is
indexing and deep paging is querying.

If it's querying, consider cursorMark or the /export handler.
https://lucidworks.com/2013/12/12/coming-soon-to-solr-efficient-cursor-based-iteration-of-large-result-sets/

If it's DIH, please explain a bit more.

Best,
Erick

On Thu, Apr 27, 2017 at 3:37 PM, Vijay Kokatnur
 wrote:
> We have a new solr 6.5.0 cluster, for which data is being imported via DIH
> from another Solr cluster running version 4.5.0.
>
> This question comes back to deep paging, but we have observed that after 30
> minutes of querying the rate of processing goes down from 400/s to about
> 120/s.  At that point it has processed only 500K of 1.3M docs.  Is there
> any way to speed this up?
>
> And, I can't go back to the source for the data.
>
> --


Solr Query Performance benchmarking

2017-04-27 Thread Suresh Pendap
Hi,
I am trying to perform Solr Query performance benchmarking and trying to 
measure the maximum throughput and latency that I can get from.a given Solr 
cluster.

Following are my configurations

Number of Solr Nodes: 4
Number of shards: 2
replication-factor:  2
Index size: 55 GB
Shard/Core size: 27.7 GB
maxConnsPerHost: 1000

The Solr nodes are VM's with 16 core vCpu and 112GB RAM.  The CPU is 1-1 and it 
is not overcommitted.

I am generating query load using a Java client program which fires Solr queries 
read from a static file.  The client java program is using the Apache Http 
Client library to invoke the queries. I have already configured the client to 
create 300 max connections.

The type of queries are mostly of the below pattern
q=*:*=orderNo,purchaseOrderNos,timestamp,eventName,eventID,_src_=((orderNo:+AND+purchaseOrderNos:

DIH Speed

2017-04-27 Thread Vijay Kokatnur
We have a new solr 6.5.0 cluster, for which data is being imported via DIH
from another Solr cluster running version 4.5.0.

This question comes back to deep paging, but we have observed that after 30
minutes of querying the rate of processing goes down from 400/s to about
120/s.  At that point it has processed only 500K of 1.3M docs.  Is there
any way to speed this up?

​And, I can't go back to the source for the data.​

--


TransactionLog doesn't know how to serialize class java.util.UUID; try implementing ObjectResolver?

2017-04-27 Thread Mahmoud Almokadem
Hello,

When I try to update a document exists on solr cloud I got this message:

TransactionLog doesn't know how to serialize class java.util.UUID; try
implementing ObjectResolver?

With the stack trace:


{"data":{"responseHeader":{"status":500,"QTime":3},"error":{"metadata":["error-class","org.apache.solr.common.SolrException","root-error-class","org.apache.solr.common.SolrException"],"msg":"TransactionLog
doesn't know how to serialize class java.util.UUID; try implementing
ObjectResolver?","trace":"org.apache.solr.common.SolrException:
TransactionLog doesn't know how to serialize class java.util.UUID; try
implementing ObjectResolver?\n\tat
org.apache.solr.update.TransactionLog$1.resolve(TransactionLog.java:100)\n\tat
org.apache.solr.common.util.JavaBinCodec.writeVal(JavaBinCodec.java:234)\n\tat
org.apache.solr.common.util.JavaBinCodec.writeSolrInputDocument(JavaBinCodec.java:589)\n\tat
org.apache.solr.update.TransactionLog.write(TransactionLog.java:395)\n\tat
org.apache.solr.update.UpdateLog.add(UpdateLog.java:524)\n\tat
org.apache.solr.update.UpdateLog.add(UpdateLog.java:508)\n\tat
org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(DirectUpdateHandler2.java:320)\n\tat
org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:239)\n\tat
org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:194)\n\tat
org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:67)\n\tat
org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)\n\tat
org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:980)\n\tat
org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1193)\n\tat
org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:749)\n\tat
org.apache.solr.update.processor.LogUpdateProcessorFactory$LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:103)\n\tat
org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.handleAdds(JsonLoader.java:502)\n\tat
org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.processUpdate(JsonLoader.java:141)\n\tat
org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.load(JsonLoader.java:117)\n\tat
org.apache.solr.handler.loader.JsonLoader.load(JsonLoader.java:80)\n\tat
org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:97)\n\tat
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)\n\tat
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:173)\n\tat
org.apache.solr.core.SolrCore.execute(SolrCore.java:2440)\n\tat
org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723)\n\tat
org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:529)\n\tat
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:347)\n\tat
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:298)\n\tat
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691)\n\tat
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)\n\tat
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)\n\tat
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)\n\tat
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)\n\tat
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)\n\tat
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)\n\tat
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)\n\tat
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)\n\tat
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)\n\tat
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)\n\tat
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)\n\tat
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)\n\tat
org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)\n\tat
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)\n\tat
org.eclipse.jetty.server.Server.handle(Server.java:534)\n\tat
org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)\n\tat
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)\n\tat
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)\n\tat
org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)\n\tat
org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)\n\tat

Re: Atomic Updates

2017-04-27 Thread Chris Ulicny
While recreating it with a fresh schema, I realized that this was a case of
a very, very stupid user error during configuring the cores.

I setup the testing cores with the wrong configset, and then proceeded to
edit the schema in the right configset. So, the field was actually stored
by default, but I wasn't attempting to retrieve it so I never realized it
was being stored since I ended up looking at the wrong schema.

I switched the config sets and everything works as expected. Any atomic
updates clear out the indexed values for the non-stored field.

Thanks for bearing with me.
Chris


On Thu, Apr 27, 2017 at 11:23 AM Chris Ulicny  wrote:

> I'm sending commit=true with every update while testing. I'll write up the
> tests and see if someone else can reproduce it.
>
> On Thu, Apr 27, 2017 at 10:54 AM Erick Erickson 
> wrote:
>
>> bq: but is there any possibility that the values stick around until
>> there is a segment merge for some strange reason
>>
>> There better not be or it's a bug. Things will stick around until
>> you issue a commit, is there any chance that's the problem?
>>
>> If you can document the exact steps, maybe we can reproduce
>> the issue and raise a JIRA.
>>
>> Best,
>> Erick
>>
>> On Thu, Apr 27, 2017 at 6:03 AM, Chris Ulicny  wrote:
>> > Yeah, something's not quite right somewhere. We never even considered
>> > in-place updates an option since it requires the fields to be
>> non-indexed
>> > and non-stored. Our schemas never have any field that satisfies those
>> two
>> > conditions let alone the other necessary ones.
>> >
>> > I went ahead and tested the atomic updates on different textfields, and
>> I
>> > still can't get the indexed but not-stored othertext_field to
>> disappear. So
>> > far set, add, and remove updates do not change it regardless of what the
>> > fields are in the atomic update.
>> >
>> > It would be extraordinarily useful if this update behavior is now
>> expected
>> > (but not currently documented) functionality.
>> >
>> > I'm not too familiar with the nitty-gritty details of merging segment
>> > files, but is there any possibility that the values stick around until
>> > there is a segment merge for some strange reason?
>> >
>> > On Thu, Apr 27, 2017 at 12:59 AM Dorian Hoxha 
>> > wrote:
>> >
>> >> @Chris,
>> >> According to doc-link-above, only INC,SET are in-place-updates. And
>> only
>> >> when they're not indexed/stored, while your 'integer-field' is. So
>> still
>> >> shenanigans in there somewhere (docs,your-code,your-test,solr-code).
>> >>
>> >> On Thu, Apr 27, 2017 at 2:04 AM, Chris Ulicny 
>> wrote:
>> >>
>> >> > That's probably it then. None of the atomic updates that I've tried
>> have
>> >> > been on TextFields. I'll give the TextField atomic update to verify
>> that
>> >> it
>> >> > will clear the other field.
>> >> >
>> >> > Has this functionality been consistent since atomic updates were
>> >> > introduced, or is this a side effect of some other change? It'd be
>> very
>> >> > convenient for us to use this functionality as it currently works,
>> but if
>> >> > it's something that prevents us from upgrading versions in the
>> future, we
>> >> > should probably avoid expecting it to work.
>> >> >
>> >> > On Wed, Apr 26, 2017 at 7:36 PM Ishan Chattopadhyaya <
>> >> > ichattopadhy...@gmail.com> wrote:
>> >> >
>> >> > > > Hmm, interesting. I can imagine that as long as you're updating
>> >> > > > docValues fields, the other_text field would be there. But the
>> >> instant
>> >> > > > you updated a non-docValues field (text_field in your example)
>> the
>> >> > > > other_text field would disappear
>> >> > >
>> >> > > I can confirm this. When in-place updates to DV fields are done,
>> the
>> >> rest
>> >> > > of the fields remain as they were.
>> >> > >
>> >> > > On Thu, Apr 27, 2017 at 4:33 AM, Erick Erickson <
>> >> erickerick...@gmail.com
>> >> > >
>> >> > > wrote:
>> >> > >
>> >> > > > Hmm, interesting. I can imagine that as long as you're updating
>> >> > > > docValues fields, the other_text field would be there. But the
>> >> instant
>> >> > > > you updated a non-docValues field (text_field in your example)
>> the
>> >> > > > other_text field would disappear.
>> >> > > >
>> >> > > > I DO NOT KNOW this for a fact, but I'm asking people who do.
>> >> > > >
>> >> > > > On Wed, Apr 26, 2017 at 2:13 PM, Dorian Hoxha <
>> >> dorian.ho...@gmail.com>
>> >> > > > wrote:
>> >> > > > > There are In Place Updates, but according to docs they stll
>> >> shouldn't
>> >> > > > work
>> >> > > > > in your case:
>> >> > > > > https://cwiki.apache.org/confluence/display/solr/
>> >> > > > Updating+Parts+of+Documents
>> >> > > > >
>> >> > > > > On Wed, Apr 26, 2017 at 10:36 PM, Chris Ulicny
>> 
>> >> > > wrote:
>> >> > > > >
>> >> > > > >> That's the thing I'm curious about though. As I mentioned in
>> the
>> >> > first
>> >> > > > >> post, I've 

Re: Split Shard not working

2017-04-27 Thread Walter Underwood
What is the message in the log when it crashes?

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Apr 27, 2017, at 10:10 AM, Vijay Kokatnur  wrote:
> 
> We recently upgraded 4.5 index to 6.5 using IndexUpgrader.  The index size
> is around 600 GB on disk.  When we try to split it using SPLITSHARD, it
> creates two new sub shards on the node and eventually crashes before
> completely the split.  After restart, the original shard size if around 100
> GB and each sub shards are around 2 GB.  They are no doubt not fully
> constructed.
> 
> We tried different heap settings as well- 15, 20 and 30 GB, but it always
> crashed.  RAM is about 256GB.
> 
> What's going on here?
> ​  Any one faced this situation before?​



Split Shard not working

2017-04-27 Thread Vijay Kokatnur
We recently upgraded 4.5 index to 6.5 using IndexUpgrader.  The index size
is around 600 GB on disk.  When we try to split it using SPLITSHARD, it
creates two new sub shards on the node and eventually crashes before
completely the split.  After restart, the original shard size if around 100
GB and each sub shards are around 2 GB.  They are no doubt not fully
constructed.

We tried different heap settings as well- 15, 20 and 30 GB, but it always
crashed.  RAM is about 256GB.

What's going on here?
​  Any one faced this situation before?​


Re: 1 main collection or multiple smaller collections?

2017-04-27 Thread Walter Underwood
Design backwards from the search result pages (SRP). Make flat schema(s) with 
the fields you will search and display.

One example is the schema I used at Netflix. I used one collection to hold 
movies, people (actors), and genres. There were collisions between the integer 
IDs, movies IDs were prefixed with “m”, people with “p”, and genres with “g”. 
The searched fields were “title” and “description”. There was also a “type” 
field which was “movie”, “person”, or “genre”. There was a also a field for the 
database ID (without the prefix).

A movie SRP used an “fq” filter of “type:movie”, and so on for other SRPs. 
There were a few other filters, like G-rated movies or streaming, DVD, HD DVD, 
or Bluray.

The full index was under 350K documents.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Apr 27, 2017, at 10:01 AM, Rick Leir  wrote:
> 
> Does it make sense to use nested documents here? Products could be nested in 
> a supplier document perhaps.
> 
> Alternately, consider de-normalizing "til it hurts". A product doc might be 
> able to contain supplier info.
> 
> On April 27, 2017 8:50:59 AM EDT, Shawn Heisey  wrote:
>> On 4/26/2017 11:57 PM, Derek Poh wrote:
>>> There are some common fields between them.
>>> At the source data end (database), the supplier info and product info
>>> are updated separately. In this regard, I should separate them?
>>> If it's In 1 single collection, when there are updatesto only the
>>> supplier info,the product info will be index again even though there
>>> is noupdates to them, Is my reasoning valid?
>>> 
>>> 
>>> On 4/27/2017 1:33 PM, Walter Underwood wrote:
 Do they have the same fields or different fields? Are they updated
 separately or together?
 
 If they have the same fields and are updated together, I’d put them
 in the same collection. Otherwise, probably separate. 
>> 
>> Walter's statements are right on the money, you just might need a
>> little
>> more detail.
>> 
>> There are are two critical details that decide whether you even CAN
>> combine different data in a single index: One is that all types of
>> records must use the same field (the uniqueKey field) to determine
>> uniqueness, and the value of this field must be unique across the
>> entire
>> dataset.  The other is that there SHOULD be a field with a name like
>> "type" that your search client can use to differentiate the different
>> kinds of documents.  This type field is not necessary, but it does make
>> things easier.
>> 
>> Assuming you CAN combine documents, there is still the question of
>> whether you SHOULD.  If the fields that you will commonly search are
>> the
>> same between the different kinds of documents, and if people want to be
>> able to do one search and get more than one of the document types you
>> are indexing, then it is something you should consider.  If people will
>> only ever search one type of document, you should probably keep them in
>> separate indexes to keep things cleaner.
>> 
>> Thanks,
>> Shawn
> 
> -- 
> Sorry for being brief. Alternate email is rickleir at yahoo dot com



Re: 1 main collection or multiple smaller collections?

2017-04-27 Thread Rick Leir
Does it make sense to use nested documents here? Products could be nested in a 
supplier document perhaps.

Alternately, consider de-normalizing "til it hurts". A product doc might be 
able to contain supplier info.

On April 27, 2017 8:50:59 AM EDT, Shawn Heisey  wrote:
>On 4/26/2017 11:57 PM, Derek Poh wrote:
>> There are some common fields between them.
>> At the source data end (database), the supplier info and product info
>> are updated separately. In this regard, I should separate them?
>> If it's In 1 single collection, when there are updatesto only the
>> supplier info,the product info will be index again even though there
>> is noupdates to them, Is my reasoning valid?
>>
>>
>> On 4/27/2017 1:33 PM, Walter Underwood wrote:
>>> Do they have the same fields or different fields? Are they updated
>>> separately or together?
>>>
>>> If they have the same fields and are updated together, I’d put them
>>> in the same collection. Otherwise, probably separate. 
>
>Walter's statements are right on the money, you just might need a
>little
>more detail.
>
>There are are two critical details that decide whether you even CAN
>combine different data in a single index: One is that all types of
>records must use the same field (the uniqueKey field) to determine
>uniqueness, and the value of this field must be unique across the
>entire
>dataset.  The other is that there SHOULD be a field with a name like
>"type" that your search client can use to differentiate the different
>kinds of documents.  This type field is not necessary, but it does make
>things easier.
>
>Assuming you CAN combine documents, there is still the question of
>whether you SHOULD.  If the fields that you will commonly search are
>the
>same between the different kinds of documents, and if people want to be
>able to do one search and get more than one of the document types you
>are indexing, then it is something you should consider.  If people will
>only ever search one type of document, you should probably keep them in
>separate indexes to keep things cleaner.
>
>Thanks,
>Shawn

-- 
Sorry for being brief. Alternate email is rickleir at yahoo dot com 

Re: size-estimator-lucene-solr.xls error in disk space estimator

2017-04-27 Thread Matteo Grolla
Right Alessandro that's another bug
Cheers

2017-04-27 12:30 GMT+02:00 alessandro.benedetti :

> +1
> I would add that what is called : "Avg. Document Size (KB)" seems more to
> me
> "Avg. Field Size (KB)".
> Cheers
>
>
>
> -
> ---
> Alessandro Benedetti
> Search Consultant, R Software Engineer, Director
> Sease Ltd. - www.sease.io
> --
> View this message in context: http://lucene.472066.n3.
> nabble.com/size-estimator-lucene-solr-xls-error-in-disk-space-estimator-
> tp4332156p4332160.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Atomic Updates

2017-04-27 Thread Chris Ulicny
I'm sending commit=true with every update while testing. I'll write up the
tests and see if someone else can reproduce it.

On Thu, Apr 27, 2017 at 10:54 AM Erick Erickson 
wrote:

> bq: but is there any possibility that the values stick around until
> there is a segment merge for some strange reason
>
> There better not be or it's a bug. Things will stick around until
> you issue a commit, is there any chance that's the problem?
>
> If you can document the exact steps, maybe we can reproduce
> the issue and raise a JIRA.
>
> Best,
> Erick
>
> On Thu, Apr 27, 2017 at 6:03 AM, Chris Ulicny  wrote:
> > Yeah, something's not quite right somewhere. We never even considered
> > in-place updates an option since it requires the fields to be non-indexed
> > and non-stored. Our schemas never have any field that satisfies those two
> > conditions let alone the other necessary ones.
> >
> > I went ahead and tested the atomic updates on different textfields, and I
> > still can't get the indexed but not-stored othertext_field to disappear.
> So
> > far set, add, and remove updates do not change it regardless of what the
> > fields are in the atomic update.
> >
> > It would be extraordinarily useful if this update behavior is now
> expected
> > (but not currently documented) functionality.
> >
> > I'm not too familiar with the nitty-gritty details of merging segment
> > files, but is there any possibility that the values stick around until
> > there is a segment merge for some strange reason?
> >
> > On Thu, Apr 27, 2017 at 12:59 AM Dorian Hoxha 
> > wrote:
> >
> >> @Chris,
> >> According to doc-link-above, only INC,SET are in-place-updates. And only
> >> when they're not indexed/stored, while your 'integer-field' is. So still
> >> shenanigans in there somewhere (docs,your-code,your-test,solr-code).
> >>
> >> On Thu, Apr 27, 2017 at 2:04 AM, Chris Ulicny  wrote:
> >>
> >> > That's probably it then. None of the atomic updates that I've tried
> have
> >> > been on TextFields. I'll give the TextField atomic update to verify
> that
> >> it
> >> > will clear the other field.
> >> >
> >> > Has this functionality been consistent since atomic updates were
> >> > introduced, or is this a side effect of some other change? It'd be
> very
> >> > convenient for us to use this functionality as it currently works,
> but if
> >> > it's something that prevents us from upgrading versions in the
> future, we
> >> > should probably avoid expecting it to work.
> >> >
> >> > On Wed, Apr 26, 2017 at 7:36 PM Ishan Chattopadhyaya <
> >> > ichattopadhy...@gmail.com> wrote:
> >> >
> >> > > > Hmm, interesting. I can imagine that as long as you're updating
> >> > > > docValues fields, the other_text field would be there. But the
> >> instant
> >> > > > you updated a non-docValues field (text_field in your example) the
> >> > > > other_text field would disappear
> >> > >
> >> > > I can confirm this. When in-place updates to DV fields are done, the
> >> rest
> >> > > of the fields remain as they were.
> >> > >
> >> > > On Thu, Apr 27, 2017 at 4:33 AM, Erick Erickson <
> >> erickerick...@gmail.com
> >> > >
> >> > > wrote:
> >> > >
> >> > > > Hmm, interesting. I can imagine that as long as you're updating
> >> > > > docValues fields, the other_text field would be there. But the
> >> instant
> >> > > > you updated a non-docValues field (text_field in your example) the
> >> > > > other_text field would disappear.
> >> > > >
> >> > > > I DO NOT KNOW this for a fact, but I'm asking people who do.
> >> > > >
> >> > > > On Wed, Apr 26, 2017 at 2:13 PM, Dorian Hoxha <
> >> dorian.ho...@gmail.com>
> >> > > > wrote:
> >> > > > > There are In Place Updates, but according to docs they stll
> >> shouldn't
> >> > > > work
> >> > > > > in your case:
> >> > > > > https://cwiki.apache.org/confluence/display/solr/
> >> > > > Updating+Parts+of+Documents
> >> > > > >
> >> > > > > On Wed, Apr 26, 2017 at 10:36 PM, Chris Ulicny  >
> >> > > wrote:
> >> > > > >
> >> > > > >> That's the thing I'm curious about though. As I mentioned in
> the
> >> > first
> >> > > > >> post, I've already tried a few tests, and the value seems to
> still
> >> > be
> >> > > > >> present after an atomic update.
> >> > > > >>
> >> > > > >> I haven't exhausted all possible atomic updates, but 'set' and
> >> 'add'
> >> > > > seem
> >> > > > >> to preserve the non-stored text field.
> >> > > > >>
> >> > > > >> Thanks,
> >> > > > >> Chris
> >> > > > >>
> >> > > > >> On Wed, Apr 26, 2017 at 4:07 PM Dorian Hoxha <
> >> > dorian.ho...@gmail.com>
> >> > > > >> wrote:
> >> > > > >>
> >> > > > >> > You'll lose the data in that field. Try doing a commit and it
> >> > should
> >> > > > >> > happen.
> >> > > > >> >
> >> > > > >> > On Wed, Apr 26, 2017 at 9:50 PM, Chris Ulicny
>  >> >
> >> > > > wrote:
> >> > > > >> >
> >> > > > >> > > Thanks Shawn, I didn't realize docValues were enabled 

Re: Atomic Updates

2017-04-27 Thread Erick Erickson
bq: but is there any possibility that the values stick around until
there is a segment merge for some strange reason

There better not be or it's a bug. Things will stick around until
you issue a commit, is there any chance that's the problem?

If you can document the exact steps, maybe we can reproduce
the issue and raise a JIRA.

Best,
Erick

On Thu, Apr 27, 2017 at 6:03 AM, Chris Ulicny  wrote:
> Yeah, something's not quite right somewhere. We never even considered
> in-place updates an option since it requires the fields to be non-indexed
> and non-stored. Our schemas never have any field that satisfies those two
> conditions let alone the other necessary ones.
>
> I went ahead and tested the atomic updates on different textfields, and I
> still can't get the indexed but not-stored othertext_field to disappear. So
> far set, add, and remove updates do not change it regardless of what the
> fields are in the atomic update.
>
> It would be extraordinarily useful if this update behavior is now expected
> (but not currently documented) functionality.
>
> I'm not too familiar with the nitty-gritty details of merging segment
> files, but is there any possibility that the values stick around until
> there is a segment merge for some strange reason?
>
> On Thu, Apr 27, 2017 at 12:59 AM Dorian Hoxha 
> wrote:
>
>> @Chris,
>> According to doc-link-above, only INC,SET are in-place-updates. And only
>> when they're not indexed/stored, while your 'integer-field' is. So still
>> shenanigans in there somewhere (docs,your-code,your-test,solr-code).
>>
>> On Thu, Apr 27, 2017 at 2:04 AM, Chris Ulicny  wrote:
>>
>> > That's probably it then. None of the atomic updates that I've tried have
>> > been on TextFields. I'll give the TextField atomic update to verify that
>> it
>> > will clear the other field.
>> >
>> > Has this functionality been consistent since atomic updates were
>> > introduced, or is this a side effect of some other change? It'd be very
>> > convenient for us to use this functionality as it currently works, but if
>> > it's something that prevents us from upgrading versions in the future, we
>> > should probably avoid expecting it to work.
>> >
>> > On Wed, Apr 26, 2017 at 7:36 PM Ishan Chattopadhyaya <
>> > ichattopadhy...@gmail.com> wrote:
>> >
>> > > > Hmm, interesting. I can imagine that as long as you're updating
>> > > > docValues fields, the other_text field would be there. But the
>> instant
>> > > > you updated a non-docValues field (text_field in your example) the
>> > > > other_text field would disappear
>> > >
>> > > I can confirm this. When in-place updates to DV fields are done, the
>> rest
>> > > of the fields remain as they were.
>> > >
>> > > On Thu, Apr 27, 2017 at 4:33 AM, Erick Erickson <
>> erickerick...@gmail.com
>> > >
>> > > wrote:
>> > >
>> > > > Hmm, interesting. I can imagine that as long as you're updating
>> > > > docValues fields, the other_text field would be there. But the
>> instant
>> > > > you updated a non-docValues field (text_field in your example) the
>> > > > other_text field would disappear.
>> > > >
>> > > > I DO NOT KNOW this for a fact, but I'm asking people who do.
>> > > >
>> > > > On Wed, Apr 26, 2017 at 2:13 PM, Dorian Hoxha <
>> dorian.ho...@gmail.com>
>> > > > wrote:
>> > > > > There are In Place Updates, but according to docs they stll
>> shouldn't
>> > > > work
>> > > > > in your case:
>> > > > > https://cwiki.apache.org/confluence/display/solr/
>> > > > Updating+Parts+of+Documents
>> > > > >
>> > > > > On Wed, Apr 26, 2017 at 10:36 PM, Chris Ulicny 
>> > > wrote:
>> > > > >
>> > > > >> That's the thing I'm curious about though. As I mentioned in the
>> > first
>> > > > >> post, I've already tried a few tests, and the value seems to still
>> > be
>> > > > >> present after an atomic update.
>> > > > >>
>> > > > >> I haven't exhausted all possible atomic updates, but 'set' and
>> 'add'
>> > > > seem
>> > > > >> to preserve the non-stored text field.
>> > > > >>
>> > > > >> Thanks,
>> > > > >> Chris
>> > > > >>
>> > > > >> On Wed, Apr 26, 2017 at 4:07 PM Dorian Hoxha <
>> > dorian.ho...@gmail.com>
>> > > > >> wrote:
>> > > > >>
>> > > > >> > You'll lose the data in that field. Try doing a commit and it
>> > should
>> > > > >> > happen.
>> > > > >> >
>> > > > >> > On Wed, Apr 26, 2017 at 9:50 PM, Chris Ulicny > >
>> > > > wrote:
>> > > > >> >
>> > > > >> > > Thanks Shawn, I didn't realize docValues were enabled by
>> default
>> > > > now.
>> > > > >> > > That's very convenient and probably makes a lot of the schemas
>> > > we've
>> > > > >> been
>> > > > >> > > making excessively verbose.
>> > > > >> > >
>> > > > >> > > This is on 6.3.0. Do you know what the first version was that
>> > they
>> > > > >> added
>> > > > >> > > the docValues by default for non-Text field?
>> > > > >> > >
>> > > > >> > > However, that shouldn't apply to this since I'm concerned
>> 

Blocked ConcurrentUpdateSolrClient

2017-04-27 Thread Christian Belka
Hello 

I am trying to update larger amounts of Documents (mostly ADD/DELETE) through 
various threads. 

After a certain amount of time (a few hours) all my threads get stuck at 


taskExecutor-46" prio=5 tid=0x268 nid=0x10c BLOCKED owned by 
taskExecutor-9 Id=230 - stats: cpu=2788 blk=-1 wait=-1
java.lang.Thread.State: BLOCKED
at 
org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient.blockUntilFinished(ConcurrentUpdateSolrClient.java:429)
- waiting to lock 
org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer@27a59c5b
at 
org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient.request(ConcurrentUpdateSolrClient.java:359)
at 
org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:149)
at 
org.apache.solr.client.solrj.SolrClient.deleteById(SolrClient.java:753)
at 
org.apache.solr.client.solrj.SolrClient.deleteById(SolrClient.java:716)
at 
org.apache.solr.client.solrj.SolrClient.deleteById(SolrClient.java:731)

I am using solr 5.5.4. 

Does anyone else have this problem ? 

Best regards 



Re: Indexing I/O errors and CorruptIndex messages

2017-04-27 Thread simon
Nope ... huge file system (600gb) only 50% full, and a complete index would
be 80gb max.

On Wed, Apr 26, 2017 at 4:04 PM, Erick Erickson 
wrote:

> Disk space issue? Lucene requires at least as much free disk space as
> your index size. Note that the disk full issue will be transient, IOW
> if you look now and have free space it still may have been all used up
> but had some space reclaimed.
>
> Best,
> Erick
>
> On Wed, Apr 26, 2017 at 12:02 PM, simon  wrote:
> > reposting this as the problem described is happening again and there were
> > no responses to the original email. Anyone ?
> > 
> > I'm seeing an odd error during indexing for which I can't find any
> reason.
> >
> > The relevant solr log entry:
> >
> > 2017-03-24 19:09:35.363 ERROR (commitScheduler-30-thread-1) [
> > x:build0324] o.a.s.u.CommitTracker auto commit
> > error...:java.io.EOFException: read past EOF:  MMapIndexInput(path="/
> > indexes/solrindexes/build0324/index/_4ku.fdx")
> >  at org.apache.lucene.store.ByteBufferIndexInput.readByte(
> > ByteBufferIndexInput.java:75)
> > ...
> > Suppressed: org.apache.lucene.index.CorruptIndexException: checksum
> > status indeterminate: remaining=0, please run checkindex for more details
> > (resource= BufferedChecksumIndexInput(MMapIndexInput(path="/indexes/
> > solrindexes/build0324/index/_4ku.fdx")))
> >  at org.apache.lucene.codecs.CodecUtil.checkFooter(
> > CodecUtil.java:451)
> >  at org.apache.lucene.codecs.compressing.
> > CompressingStoredFieldsReader.(CompressingStoredFieldsReader.
> java:140)
> >  followed within a few seconds by
> >
> >  2017-03-24 19:09:56.402 ERROR (commitScheduler-31-thread-1) [
> > x:build0324] o.a.s.u.CommitTracker auto commit
> > error...:org.apache.solr.common.SolrException:
> > Error opening new searcher
> > at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1820)
> > at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1931)
> > ...
> > Caused by: java.io.EOFException: read past EOF:
> > MMapIndexInput(path="/indexes/solrindexes/build0324/index/_4ku.fdx")
> > at org.apache.lucene.store.ByteBufferIndexInput.readByte(
> > ByteBufferIndexInput.java:75)
> >
> > This error is repeated a few times as the indexing continued and further
> > autocommits were triggered.
> >
> > I stopped the indexing process, made a backup snapshot of the index,
> >  restarted indexing at a checkpoint, and everything then completed
> without
> > further incidents
> >
> > I ran checkIndex on the saved snapshot and it reported no errors
> > whatsoever. Operations on the complete index (inclcuing an optimize and
> > several query scripts) have all been error-free.
> >
> > Some background:
> >  Solr information from the beginning of the checkindex output:
> >  ---
> >  Opening index @ /indexes/solrindexes/build0324.bad/index
> >
> > Segments file=segments_9s numSegments=105 version=6.3.0
> > id=7m1ldieoje0m6sljp7xocbz9l userData={commitTimeMSec=1490400514324}
> >   1 of 105: name=_be maxDoc=1227144
> > version=6.3.0
> > id=7m1ldieoje0m6sljp7xocburb
> > codec=Lucene62
> > compound=false
> > numFiles=14
> > size (MB)=4,926.186
> > diagnostics = {os=Linux, java.vendor=Oracle Corporation,
> > java.version=1.8.0_45, java.vm.version=25.45-b02, lucene.version=6.3.0,
> > mergeMaxNumSegments=-1, os.arch=amd64, java.runtime.version=1.8.0_45-
> b13,
> > source=merge, mergeFactor=19, os.version=3.10.0-229.1.2.el7.x86_64,
> > timestamp=1490380905920}
> > no deletions
> > test: open reader.OK [took 0.176 sec]
> > test: check integrity.OK [took 37.399 sec]
> > test: check live docs.OK [took 0.000 sec]
> > test: field infos.OK [49 fields] [took 0.000 sec]
> > test: field norms.OK [17 fields] [took 0.030 sec]
> > test: terms, freq, prox...OK [14568108 terms; 612537186 terms/docs
> > pairs; 801208966 tokens] [took 30.005 sec]
> > test: stored fields...OK [150164874 total field count; avg 122.4
> > fields per doc] [took 35.321 sec]
> > test: term vectorsOK [4804967 total term vector count; avg
> 3.9
> > term/freq vector fields per doc] [took 55.857 sec]
> > test: docvalues...OK [4 docvalues fields; 0 BINARY; 1
> NUMERIC;
> > 2 SORTED; 0 SORTED_NUMERIC; 1 SORTED_SET] [took 0.954 sec]
> > test: points..OK [0 fields, 0 points] [took 0.000 sec]
> >   -
> >
> >   The indexing process is a Python script (using the scorched Python
> > client)  which spawns multiple instance of itself, in this case 6, so
> there
> > are definitely concurrent calls ( to /update/json )
> >
> > Solrconfig and the schema have not been changed for several months,
> during
> > which time many ingests have been done, and the documents which were
> being
> > indexed at the time of the error have been indexed before without
> problems,
> > so I don't think it's a 

Re: Atomic Updates

2017-04-27 Thread Chris Ulicny
Yeah, something's not quite right somewhere. We never even considered
in-place updates an option since it requires the fields to be non-indexed
and non-stored. Our schemas never have any field that satisfies those two
conditions let alone the other necessary ones.

I went ahead and tested the atomic updates on different textfields, and I
still can't get the indexed but not-stored othertext_field to disappear. So
far set, add, and remove updates do not change it regardless of what the
fields are in the atomic update.

It would be extraordinarily useful if this update behavior is now expected
(but not currently documented) functionality.

I'm not too familiar with the nitty-gritty details of merging segment
files, but is there any possibility that the values stick around until
there is a segment merge for some strange reason?

On Thu, Apr 27, 2017 at 12:59 AM Dorian Hoxha 
wrote:

> @Chris,
> According to doc-link-above, only INC,SET are in-place-updates. And only
> when they're not indexed/stored, while your 'integer-field' is. So still
> shenanigans in there somewhere (docs,your-code,your-test,solr-code).
>
> On Thu, Apr 27, 2017 at 2:04 AM, Chris Ulicny  wrote:
>
> > That's probably it then. None of the atomic updates that I've tried have
> > been on TextFields. I'll give the TextField atomic update to verify that
> it
> > will clear the other field.
> >
> > Has this functionality been consistent since atomic updates were
> > introduced, or is this a side effect of some other change? It'd be very
> > convenient for us to use this functionality as it currently works, but if
> > it's something that prevents us from upgrading versions in the future, we
> > should probably avoid expecting it to work.
> >
> > On Wed, Apr 26, 2017 at 7:36 PM Ishan Chattopadhyaya <
> > ichattopadhy...@gmail.com> wrote:
> >
> > > > Hmm, interesting. I can imagine that as long as you're updating
> > > > docValues fields, the other_text field would be there. But the
> instant
> > > > you updated a non-docValues field (text_field in your example) the
> > > > other_text field would disappear
> > >
> > > I can confirm this. When in-place updates to DV fields are done, the
> rest
> > > of the fields remain as they were.
> > >
> > > On Thu, Apr 27, 2017 at 4:33 AM, Erick Erickson <
> erickerick...@gmail.com
> > >
> > > wrote:
> > >
> > > > Hmm, interesting. I can imagine that as long as you're updating
> > > > docValues fields, the other_text field would be there. But the
> instant
> > > > you updated a non-docValues field (text_field in your example) the
> > > > other_text field would disappear.
> > > >
> > > > I DO NOT KNOW this for a fact, but I'm asking people who do.
> > > >
> > > > On Wed, Apr 26, 2017 at 2:13 PM, Dorian Hoxha <
> dorian.ho...@gmail.com>
> > > > wrote:
> > > > > There are In Place Updates, but according to docs they stll
> shouldn't
> > > > work
> > > > > in your case:
> > > > > https://cwiki.apache.org/confluence/display/solr/
> > > > Updating+Parts+of+Documents
> > > > >
> > > > > On Wed, Apr 26, 2017 at 10:36 PM, Chris Ulicny 
> > > wrote:
> > > > >
> > > > >> That's the thing I'm curious about though. As I mentioned in the
> > first
> > > > >> post, I've already tried a few tests, and the value seems to still
> > be
> > > > >> present after an atomic update.
> > > > >>
> > > > >> I haven't exhausted all possible atomic updates, but 'set' and
> 'add'
> > > > seem
> > > > >> to preserve the non-stored text field.
> > > > >>
> > > > >> Thanks,
> > > > >> Chris
> > > > >>
> > > > >> On Wed, Apr 26, 2017 at 4:07 PM Dorian Hoxha <
> > dorian.ho...@gmail.com>
> > > > >> wrote:
> > > > >>
> > > > >> > You'll lose the data in that field. Try doing a commit and it
> > should
> > > > >> > happen.
> > > > >> >
> > > > >> > On Wed, Apr 26, 2017 at 9:50 PM, Chris Ulicny  >
> > > > wrote:
> > > > >> >
> > > > >> > > Thanks Shawn, I didn't realize docValues were enabled by
> default
> > > > now.
> > > > >> > > That's very convenient and probably makes a lot of the schemas
> > > we've
> > > > >> been
> > > > >> > > making excessively verbose.
> > > > >> > >
> > > > >> > > This is on 6.3.0. Do you know what the first version was that
> > they
> > > > >> added
> > > > >> > > the docValues by default for non-Text field?
> > > > >> > >
> > > > >> > > However, that shouldn't apply to this since I'm concerned
> with a
> > > > >> > non-stored
> > > > >> > > TextField without docValues enabled.
> > > > >> > >
> > > > >> > > Best,
> > > > >> > > Chris
> > > > >> > >
> > > > >> > > On Wed, Apr 26, 2017 at 3:36 PM Shawn Heisey <
> > apa...@elyograg.org
> > > >
> > > > >> > wrote:
> > > > >> > >
> > > > >> > > > On 4/25/2017 1:40 PM, Chris Ulicny wrote:
> > > > >> > > > > Hello all,
> > > > >> > > > >
> > > > >> > > > > Suppose I have the following fields in a document and
> > populate
> > > > all
> > > > >> 4
> > > > >> > > > fields
> > > > >> > > > > for 

Re: Spatial Search: can not use FieldCache on a field which is neither indexed nor has doc values: latitudeLongitude_0_coordinate

2017-04-27 Thread freddy79
It does work with "solr.LatLonPointSpatialField" instead of
"solr.LatLonType".



But why not with "solr.LatLonType"?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Spatial-Search-can-not-use-FieldCache-on-a-field-which-is-neither-indexed-nor-has-doc-values-latitude-tp4332185p4332199.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Update to Solr 6 - Amazon EC2 high CPU SYS usage

2017-04-27 Thread Shawn Heisey
On 4/27/2017 3:03 AM, Elodie Sannier wrote:
> We have migrated from Solr 5.4.1 to Solr 6.4.0 on Amazon EC2 and we have
> a high CPU SYS usage and it drastically decreases the Solr performance.
>
> The JVM version (java-1.8.0-openjdk-1.8.0.131-0.b11.el6_9.x86_64), the
> Jetty version (9.3.14) and the OS version (CentOS 6.9) have not changed
> with the Solr upgrade. 

I can almost guarantee that this is the issue you are experiencing:

https://issues.apache.org/jira/browse/SOLR-10130

Upgrade to 6.4.2 or later.

Thanks,
Shawn



Re: 1 main collection or multiple smaller collections?

2017-04-27 Thread Shawn Heisey
On 4/26/2017 11:57 PM, Derek Poh wrote:
> There are some common fields between them.
> At the source data end (database), the supplier info and product info
> are updated separately. In this regard, I should separate them?
> If it's In 1 single collection, when there are updatesto only the
> supplier info,the product info will be index again even though there
> is noupdates to them, Is my reasoning valid?
>
>
> On 4/27/2017 1:33 PM, Walter Underwood wrote:
>> Do they have the same fields or different fields? Are they updated
>> separately or together?
>>
>> If they have the same fields and are updated together, I’d put them
>> in the same collection. Otherwise, probably separate. 

Walter's statements are right on the money, you just might need a little
more detail.

There are are two critical details that decide whether you even CAN
combine different data in a single index: One is that all types of
records must use the same field (the uniqueKey field) to determine
uniqueness, and the value of this field must be unique across the entire
dataset.  The other is that there SHOULD be a field with a name like
"type" that your search client can use to differentiate the different
kinds of documents.  This type field is not necessary, but it does make
things easier.

Assuming you CAN combine documents, there is still the question of
whether you SHOULD.  If the fields that you will commonly search are the
same between the different kinds of documents, and if people want to be
able to do one search and get more than one of the document types you
are indexing, then it is something you should consider.  If people will
only ever search one type of document, you should probably keep them in
separate indexes to keep things cleaner.

Thanks,
Shawn



Spatial Search: can not use FieldCache on a field which is neither indexed nor has doc values: latitudeLongitude_0_coordinate

2017-04-27 Thread freddy79
Hi,

when doing a query with spatial search i get the error: can not use
FieldCache on a field which is neither indexed nor has doc values:
latitudeLongitude_0_coordinate

*SOLR Version:* 6.1.0
*schema.xml:*




*Query:*
http://localhost:8983/solr/career_educationVacancyLocation/select?q=*:*={!geofilt}=latitudeLongitude=48.15,16.23=10

*Error Message:*
can not use FieldCache on a field which is neither indexed nor has doc
values: latitudeLongitude_0_coordinate

What is wrong? Thanks.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Spatial-Search-can-not-use-FieldCache-on-a-field-which-is-neither-indexed-nor-has-doc-values-latitude-tp4332185.html
Sent from the Solr - User mailing list archive at Nabble.com.


[ANNOUNCE] Apache Solr 6.5.1 released

2017-04-27 Thread jim ferenczi
27 April 2017, Apache Solr™ 6.5.1 available


The Lucene PMC is pleased to announce the release of Apache Solr 6.5.1


Solr is the popular, blazing fast, open source NoSQL search platform from
the Apache Lucene project. Its major features include powerful full-text
search, hit highlighting, faceted search and analytics, rich document
parsing, geospatial search, extensive REST APIs as well as parallel SQL.
Solr is enterprise grade, secure and highly scalable, providing fault
tolerant distributed search and indexing, and powers the search and
navigation features of many of the world's largest internet sites.


This release includes 11 bug fixes since the 6.5.0 release. Some of the
major fixes are:


* bin\solr.cmd delete and healthcheck now works again; fixed continuation
chars ^


* Fix debug related NullPointerException in solr/contrib/ltr
OriginalScoreFeature class.


* The JSON output of /admin/metrics is fixed to write the container as a
map (SimpleOrderedMap) instead of an array (NamedList).


* On 'downnode', lots of wasteful mutations are done to ZK.


* Fix params persistence for solr/contrib/ltr (MinMax|Standard)Normalizer
classes.


* The fetch() streaming expression wouldn't work if a value included query
syntax chars (like :+-). Fixed, and enhanced the generated query to not
pollute the queryCache.


* Disable graph query production via schema configuration . This fixes broken queries for
ShingleFilter-containing query-time analyzers when request param sow=false.


* Fix indexed="false" on numeric PointFields


* SQL AVG function mis-interprets field type.


* SQL interface does not use client cache.


* edismax with sow=false fails to create dismax-per-term queries when any
field is boosted.


Furthermore, this release includes Apache Lucene 6.5.1 which includes 3 bug
fixes since the 6.5.0 release.


The release is available for immediate download at:


http://www.apache.org/dyn/closer.lua/lucene/solr/6.5.1

Please read CHANGES.txt for a detailed list of changes:


https://lucene.apache.org/solr/6_5_1/changes/Changes.html

Please report any feedback to the mailing lists (
http://lucene.apache.org/solr/discussion.html)


Note: The Apache Software Foundation uses an extensive mirroring network
for distributing releases. It is possible that the mirror you are using may
not have replicated the release yet. If that is the case, please try
another mirror. This also goes for Maven access.


[ANNOUNCE] Apache Solr 6.5.1 released

2017-04-27 Thread jim ferenczi
27 April 2017, Apache Solr™ 6.5.1 available

The Lucene PMC is pleased to announce the release of Apache Solr 6.5.1

Solr is the popular, blazing fast, open source NoSQL search platform from
the Apache Lucene project. Its major features include powerful full-text
search, hit highlighting, faceted search and analytics, rich document
parsing, geospatial search, extensive REST APIs as well as parallel SQL.
Solr is enterprise grade, secure and highly scalable, providing fault
tolerant distributed search and indexing, and powers the search and
navigation features of many of the world's largest internet sites.

This release includes 11 bug fixes since the 6.5.0 release. Some of the
major fixes are:

* bin\solr.cmd delete and healthcheck now works again; fixed continuation
chars ^

* Fix debug related NullPointerException in solr/contrib/ltr
OriginalScoreFeature class.

* The JSON output of /admin/metrics is fixed to write the container as a
map (SimpleOrderedMap) instead of an array (NamedList).

* On 'downnode', lots of wasteful mutations are done to ZK.

* Fix params persistence for solr/contrib/ltr (MinMax|Standard)Normalizer
classes.

* The fetch() streaming expression wouldn't work if a value included query
syntax chars (like :+-). Fixed, and enhanced the generated query to not
pollute the queryCache.

* Disable graph query production via schema configuration . This fixes
broken queries for ShingleFilter-containing query-time analyzers when
request param sow=false.

* Fix indexed="false" on numeric PointFields

* SQL AVG function mis-interprets field type.

* SQL interface does not use client cache.

* edismax with sow=false fails to create dismax-per-term queries when any
field is boosted.

Furthermore, this release includes Apache Lucene 6.5.1 which includes 3 bug
fixes since the 6.5.0 release.

The release is available for immediate download at:

http://www.apache.org/dyn/closer.lua/lucene/solr/6.5.1
Please read CHANGES.txt for a detailed list of changes:

https://lucene.apache.org/solr/6_5_1/changes/Changes.html
Please report any feedback to the mailing lists
(http://lucene.apache.org/solr/discussion.html)

Note: The Apache Software Foundation uses an extensive mirroring network
for distributing releases. It is possible that the mirror you are using may
not have replicated the release yet. If that is the case, please try
another mirror. This also goes for Maven access.


[ANNOUNCE] Apache Solr 6.5.1 released

2017-04-27 Thread jim ferenczi
27 April 2017, Apache Solr™ 6.5.1 available

The Lucene PMC is pleased to announce the release of Apache Solr 6.5.1

Solr is the popular, blazing fast, open source NoSQL search platform from
the Apache Lucene project. Its major features include powerful full-text
search, hit highlighting, faceted search and analytics, rich document
parsing, geospatial search, extensive REST APIs as well as parallel SQL.
Solr is enterprise grade, secure and highly scalable, providing fault
tolerant distributed search and indexing, and powers the search and
navigation features of many of the world's largest internet sites.

This release includes 11 bug fixes since the 6.5.0 release. Some of the
major fixes are:

* bin\solr.cmd delete and healthcheck now works again; fixed continuation
chars ^

* Fix debug related NullPointerException in solr/contrib/ltr
OriginalScoreFeature class.

* The JSON output of /admin/metrics is fixed to write the container as a
map (SimpleOrderedMap) instead of an array (NamedList).

* On 'downnode', lots of wasteful mutations are done to ZK.

* Fix params persistence for solr/contrib/ltr (MinMax|Standard)Normalizer
classes.

* The fetch() streaming expression wouldn't work if a value included query
syntax chars (like :+-). Fixed, and enhanced the generated query to not
pollute the queryCache.

* Disable graph query production via schema configuration . This fixes broken queries for
ShingleFilter-containing query-time analyzers when request param sow=false.

* Fix indexed="false" on numeric PointFields

* SQL AVG function mis-interprets field type.

* SQL interface does not use client cache.

* edismax with sow=false fails to create dismax-per-term queries when any
field is boosted.

Furthermore, this release includes Apache Lucene 6.5.1 which includes 3 bug
fixes since the 6.5.0 release.

The release is available for immediate download at:

http://www.apache.org/dyn/closer.lua/lucene/solr/6.5.1
Please read CHANGES.txt for a detailed list of changes:

https://lucene.apache.org/solr/6_5_1/changes/Changes.html
Please report any feedback to the mailing lists (
http://lucene.apache.org/solr/discussion.html)

Note: The Apache Software Foundation uses an extensive mirroring network
for distributing releases. It is possible that the mirror you are using may
not have replicated the release yet. If that is the case, please try
another mirror. This also goes for Maven access.


Re: Help with facet.limit

2017-04-27 Thread alessandro.benedetti
In addition to what Erick mentioned, (if) you can use Json faceting and sort
your facets according to your preferences using the stats integration [1].

Cheers

[1] https://cwiki.apache.org/confluence/display/solr/Faceted+Search



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Help-with-facet-limit-tp4331971p4332162.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: counting_number_of_term_in_a_doc

2017-04-27 Thread alessandro.benedetti
I think the closest you get out of the box is the term vector component[1] .

Cheers

[1]
https://cwiki.apache.org/confluence/display/solr/The+Term+Vector+Component



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
View this message in context: 
http://lucene.472066.n3.nabble.com/counting-number-of-term-in-a-doc-tp4332032p4332161.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: size-estimator-lucene-solr.xls error in disk space estimator

2017-04-27 Thread alessandro.benedetti
+1
I would add that what is called : "Avg. Document Size (KB)" seems more to me
"Avg. Field Size (KB)".
Cheers



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
View this message in context: 
http://lucene.472066.n3.nabble.com/size-estimator-lucene-solr-xls-error-in-disk-space-estimator-tp4332156p4332160.html
Sent from the Solr - User mailing list archive at Nabble.com.


size-estimator-lucene-solr.xls error in disk space estimator

2017-04-27 Thread Matteo Grolla
It seems me that the estimation in MB is in fact an estimation in GB
the formula includes the avg doc size, which is in kb, so the result is in
kb and  should be divided by 1024 to obtain the result in MB.
But it's divided by 1024*1024


Update to Solr 6 - Amazon EC2 high CPU SYS usage

2017-04-27 Thread Elodie Sannier

Hello,

We have migrated from Solr 5.4.1 to Solr 6.4.0 on Amazon EC2 and we have
a high CPU SYS usage and it drastically decreases the Solr performance.

The JVM version (java-1.8.0-openjdk-1.8.0.131-0.b11.el6_9.x86_64), the
Jetty version (9.3.14) and the OS version (CentOS 6.9) have not changed
with the Solr upgrade.

Using "strace" command we have found a lot of "clock_gettime"
(gettimeofday) calls when Solr is started.

The clocksource on Amazon VMs is "xen" and, according to this web site,
it impacts the system calls:
https://blog.packagecloud.io/eng/2017/03/08/system-calls-are-much-slower-on-ec2/


We have updated the clocksource to "tsc" and it fixes the issue.

Is there a change between Solr 5.4.1 and 6.4.0 that would trigger many
more gettimeofday calls done by the JVM ?

Elodie

Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 158 Ter Rue du Temple 75003 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention 
exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
message, merci de le détruire et d'en avertir l'expéditeur.


Re: Poll: Master-Slave or SolrCloud?

2017-04-27 Thread Emir Arnautovic
I think creating poll for ES ppl with question: "How do you run master 
nodes? A) on some data nodes B) dedicated node C) dedicated server" 
would give some insight how big issue is having ZK and if hiding ZK 
behind Solr would do any good.


Emir


On 25.04.2017 23:13, Otis Gospodnetić wrote:

Hi Erick,

Could one run *only* embedded ZK on some SolrCloud nodes, sans any data?
It would be equivalent of dedicated Elasticsearch nodes, which is the
current ES best practice/recommendation.  I've never heard of anyone being
scared of running 3 dedicated master ES nodes, so if SolrCloud offered the
same, perhaps even completely hiding ZK from users, that would present the
same level of complexity (err, simplicity) ES users love about ES.  Don't
want to talk about SolrCloud vs. ES here at all, just trying to share
observations since we work a lot with both Elasticsearch and Solr(Cloud) at
Sematext.

Otis
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/


On Tue, Apr 25, 2017 at 4:03 PM, Erick Erickson 
wrote:


bq: I read somewhere that you should run your own ZK externally, and
turn off SolrCloud

this is a bit confused. "turn off SolrCloud" has nothing to do with
running ZK internally or externally. SolrCloud requires ZK, whether
internal or external is irrelevant to the term SolrCloud.

On to running an external ZK ensemble. Mostly, that's administratively
by far the safest. If you're running the embedded ZK, then the ZK
instances are tied to your Solr instance. Now if, for any reason, your
Solr nodes hosting ZK go down, you lose ZK quorum, can't index.
etc

Now consider a cluster with, say, 100 Solr nodes. Not talking replicas
in a collection here, I'm talking 100 physical machines. BTW, this is
not even close to the largest ones I'm aware of. Which three (for
example) are running ZK? If I want to upgrade Solr I better make
really sure not to upgrade to of the Solr instances running ZK at once
if I want my cluster to keep going

And, ZK is sensitive to system resources. So putting ZK on a Solr node
then hosing, say, updates to my Solr cluster can cause ZK to be
starved for resources.

This is one of those deals where _functionally_, it's OK to run
embedded ZK, but administratively it's suspect.

Best,
Erick

On Tue, Apr 25, 2017 at 10:49 AM, Rick Leir  wrote:

All,
I read somewhere that you should run your own ZK externally, and turn

off SolrCloud. Comments please!

Rick

On April 25, 2017 1:33:31 PM EDT, "Otis Gospodnetić" <

otis.gospodne...@gmail.com> wrote:

This is interesting - that ZK is seen as adding so much complexity that
it
turns people off!

If you think about it, Elasticsearch users have no choice -- except
their
"ZK" is built-in, hidden, so one doesn't have to think about it, at
least
not initially.

I think I saw mentions (maybe on user or dev MLs or JIRA) about
potentially, in the future, there only being SolrCloud mode (and
dropping
SolrCloud name in favour of Solr).  If the above comment from Charlie
about
complexity is really true for Solr users, and if that's the reason why
we
see so few people running SolrCloud today, perhaps that's a good signal
for
Solr development/priorities in terms of ZK
hiding/automating/embedding/something...

Otis
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/


On Tue, Apr 25, 2017 at 4:50 AM, Charlie Hull 
wrote:


On 24/04/2017 15:58, Otis Gospodnetić wrote:


Hi,

I'm really really surprised here.  Back in 2013 we did a poll to see

how

people were running Master-Slave (4.x back then) and SolrCloud was a

bit

more popular than Master-Slave:
https://sematext.com/blog/2013/02/25/poll-solr-cloud-or-not/

Here is a fresh new poll with pretty much the same question - How do

you

run your Solr?

 -

and guess what?  SolrCloud is *not* at all a lot more prevalent than
Master-Slave.

We definitely see a lot more SolrCloud used by Sematext Solr
consulting/support customers, so I'm a bit surprised by the results

of

this
poll so far.


I'm not particularly surprised. We regularly see clients either with
single nodes or elderly versions of Solr (or even Lucene). Zookeeper

is

still seen as a bit of a black art. Once you move from 'how do I run

a

search engine' to 'how do I manage a cluster of servers with scaling

for

performance/resilience/failover' you're looking at a completely new

set

of skills and challenges, which I think puts many people off.

Charlie


Is anyone else surprised by this?  See https://twitter.com/sematext/
status/854927627748036608

Thanks,
Otis
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training -

http://sematext.com/


---
This email has been checked for viruses by AVG.