Re: Solr on HDFS: Streaming API performance tuning

2016-12-16 Thread Reth RM
If you could provide the json parse exception stack trace, it might help to predict issue there. On Fri, Dec 16, 2016 at 5:52 PM, Chetas Joshi wrote: > Hi Joel, > > The only NON alpha-numeric characters I have in my data are '+' and '/'. I > don't have any backslashes.

Re: Solr on HDFS: Streaming API performance tuning

2016-12-16 Thread Chetas Joshi
Hi Joel, The only NON alpha-numeric characters I have in my data are '+' and '/'. I don't have any backslashes. If the special characters was the issue, I should get the JSON parsing exceptions every time irrespective of the index size and irrespective of the available memory on the machine.

Re: Solr on HDFS: Streaming API performance tuning

2016-12-16 Thread Joel Bernstein
The Streaming API may have been throwing exceptions because the JSON special characters were not escaped. This was fixed in Solr 6.0. Joel Bernstein http://joelsolr.blogspot.com/ On Fri, Dec 16, 2016 at 4:34 PM, Chetas Joshi wrote: > Hello, > > I am running Solr

Re: ttl on merge-time possible somehow ?

2016-12-16 Thread Chris Hostetter
: > lucene, something has to "mark" the segements as deleted in order for them ... : Note, it doesn't mark the "segment", it marks the "document". correct, typo on my part -- sorry. : > The disatisfaction you expressed with this approach confuses me... : > : Really ? : If you have many

Re: ttl on merge-time possible somehow ?

2016-12-16 Thread Dorian Hoxha
On Fri, Dec 16, 2016 at 10:53 PM, Chris Hostetter wrote: > > : Yep, that's what came in my search. See how TTL work in hbase/cassandra/ > : rocksdb . There > : isn't a "delete old docs"query, but old docs are

Re: ttl on merge-time possible somehow ?

2016-12-16 Thread Chris Hostetter
: Yep, that's what came in my search. See how TTL work in hbase/cassandra/ : rocksdb . There : isn't a "delete old docs"query, but old docs are deleted by the storage : when merging. Looks like this needs to be a lucene-module which can then

Solr on HDFS: Streaming API performance tuning

2016-12-16 Thread Chetas Joshi
Hello, I am running Solr 5.5.0. It is a solrCloud of 50 nodes and I have the following config for all the collections. maxShardsperNode: 1 replicationFactor: 1 I was using Streaming API to get back results from Solr. It worked fine for a while until the index data size reached beyond 40 GB per

Re: ttl on merge-time possible somehow ?

2016-12-16 Thread Dorian Hoxha
Well there is a reason why they all do it that way. I'm gonna guess that the reason lucene does it this way is because it keeps a 'deleted docs bitset', which should act like a filter, which is not as slow as doing a full-delete/insert like in the other dbs that I mentioned. Thanks Shawn. On

Re: cores vs. instances vs. zookeeper vs. cloud vs ?

2016-12-16 Thread John Blythe
a, there's a lightbulb. your last paragraph cleared up confusion i was carrying through the responses (and, incidentally, was likely the reason for confusion on the ZK question you couldn't make sense of). i was thinking zookeeper was a separate means of handling things from solrcloud, two

Re: Separating Search and Indexing in SolrCloud

2016-12-16 Thread Jaroslaw Rozanski
Thanks, that issue looks interesting! On 16/12/16 16:38, Pushkar Raste wrote: > This kind of separation is not supported yet. There however some work > going on, you can read about it on > https://issues.apache.org/jira/browse/SOLR-9835 > > This unfortunately would not support soft commits and

Re: Separating Search and Indexing in SolrCloud

2016-12-16 Thread Jaroslaw Rozanski
Thanks, On 16/12/16 20:56, Shawn Heisey wrote: > On 12/16/2016 5:43 AM, Jaroslaw Rozanski wrote: >> Leader is responsible for distributing update requests to replica. So >> eventually all replicas have same state as leader. Not a problem. It >> is more about the performance of such. If I gather

Re: ttl on merge-time possible somehow ?

2016-12-16 Thread Shawn Heisey
On 12/16/2016 1:12 PM, Dorian Hoxha wrote: > Shawn, I know how it works, I read the blog post. But I don't want it > that > way. So how to do it my way? Like a custom merge function on lucene or > something else ? A considerable amount of custom coding. At a minimum, you'd have to write your own

Re: Solr on HDFS: increase in query time with increase in data

2016-12-16 Thread Shawn Heisey
On 12/16/2016 11:58 AM, Chetas Joshi wrote: > How different the index data caching mechanism is for the Streaming > API from the cursor approach? Solr and Lucene do not handle that caching. Systems external to Solr (like the OS, or HDFS) handle the caching. The cache effectiveness will be a

Re: Separating Search and Indexing in SolrCloud

2016-12-16 Thread Shawn Heisey
On 12/16/2016 5:43 AM, Jaroslaw Rozanski wrote: > Leader is responsible for distributing update requests to replica. So > eventually all replicas have same state as leader. Not a problem. It > is more about the performance of such. If I gather correctly normal > replication happens by standard

Re: cores vs. instances vs. zookeeper vs. cloud vs ?

2016-12-16 Thread Shawn Heisey
On 12/16/2016 10:30 AM, John Blythe wrote: > thanks, erick. this is helpful. a few questions for clarity's sake, but > first: nope, not using SolrCloud as of yet. > >- if i start using SolrCloud i could have my current multi-core setup >(e.g. "transactions", "opportunities", etc.) exist

Re: ttl on merge-time possible somehow ?

2016-12-16 Thread Dorian Hoxha
On Fri, Dec 16, 2016 at 8:11 PM, Shawn Heisey wrote: > On 12/16/2016 11:13 AM, Dorian Hoxha wrote: > > Yep, that's what came in my search. See how TTL work in hbase/cassandra/ > > rocksdb . There > > isn't a "delete old

Re: cores vs. instances vs. zookeeper vs. cloud vs ?

2016-12-16 Thread John Blythe
this is awesome. i'm jazzed to try it out. will do some introductory examples and what not to familiarize myself and get the lightbulb to (hopefully) go off. one last question: given all the info and questions above, does it seem like ZK is overkill at this point and i should focus my efforts on

Re: ttl on merge-time possible somehow ?

2016-12-16 Thread Shawn Heisey
On 12/16/2016 11:13 AM, Dorian Hoxha wrote: > Yep, that's what came in my search. See how TTL work in hbase/cassandra/ > rocksdb . There > isn't a "delete old docs"query, but old docs are deleted by the > storage when merging. Looks like this

Re: cores vs. instances vs. zookeeper vs. cloud vs ?

2016-12-16 Thread Erick Erickson
bq: if i start using SolrCloud i could have my current multi-core setup (e.g. "transactions", "opportunities", etc.) exist within the appropriate collection. I'd guess that cores == collections but Are you reaching across from one core to another to satisfy your use-case? I.e. using

Re: Solr on HDFS: increase in query time with increase in data

2016-12-16 Thread Chetas Joshi
Thank you everyone. I would add nodes to the SolrCloud and split the shards. Shawn, Thank you for explaining why putting index data on local file system could be a better idea than using HDFS. I need to find out how HDFS caches the index files in a resource constrained environment. I would also

Re: Max vertical scaling in your experience ? (1 instance/server)

2016-12-16 Thread Dorian Hoxha
On Fri, Dec 16, 2016 at 5:55 PM, Erick Erickson wrote: > You said that the data expires, but you haven't said > how many docs you need to host at a time. The data will expire in ~30 minutes average. Many of them are updates on the same document (this makes it worse

Re: ttl on merge-time possible somehow ?

2016-12-16 Thread Dorian Hoxha
On Fri, Dec 16, 2016 at 4:42 PM, Shawn Heisey wrote: > On 12/16/2016 12:54 AM, Dorian Hoxha wrote: > > I did some search for TTL on solr, and found only a way to do it with > > a delete-query. But that ~sucks, because you have to do a lot of > > inserts (and queries). > >

Re: Stemming with SOLR

2016-12-16 Thread Susheel Kumar
To handle irregular nouns ( http://www.ef.com/english-resources/english-grammar/singular-and-plural-nouns/), the simplest way is handle them using StemOverriderFactory. The list is not so long. Or otherwise go for commercial solutions like basistech etc. as Alex suggested oR you can customize

Re: cores vs. instances vs. zookeeper vs. cloud vs ?

2016-12-16 Thread John Blythe
thanks, erick. this is helpful. a few questions for clarity's sake, but first: nope, not using SolrCloud as of yet. - if i start using SolrCloud i could have my current multi-core setup (e.g. "transactions", "opportunities", etc.) exist within the appropriate collection. so instead of

Re: Getting Error - Session expired for /collections/sprod/state.json

2016-12-16 Thread Erick Erickson
Look at your connection timeouts and your ZK timeouts. This usually means your Solr instances are going into heavy GC as Yago mentions. You can turn on GC logging if it's not already then use something like GCViewer to get a handle on the GC. You really have two options: 1> if it is GC, tune your

Re: Max vertical scaling in your experience ? (1 instance/server)

2016-12-16 Thread Erick Erickson
You said that the data expires, but you haven't said how many docs you need to host at a time. At 10M/second inserts you'll need a boatload of shards. All of the conversation about one beefy machine .vs. lots of not-so-beefy machines should wait until you answer that question. For instance,

Re: cores vs. instances vs. zookeeper vs. cloud vs ?

2016-12-16 Thread Erick Erickson
It's not quite clear to me whether you're using SolrCloud now or not, my guess is not. My guess here is that you _should_ move to SolrCloud and collections. Then, instead of thinking about "cores", you just think about collections. Where the replicas live then isn't something you have to manage in

Re: Separating Search and Indexing in SolrCloud

2016-12-16 Thread Pushkar Raste
This kind of separation is not supported yet. There however some work going on, you can read about it on https://issues.apache.org/jira/browse/SOLR-9835 This unfortunately would not support soft commits and hence would not be a good solution for near real time indexing. On Dec 16, 2016 7:44

Re: Stable releases of Solr

2016-12-16 Thread Jaroslaw Rozanski
Hi Deepak, Lucene 6.3.0 is latest official release: https://lucene.apache.org/core/6_3_0/index.html Same applies to Solr if that is what you meant. It is as stable as guaranteed by release process. On 16/12/16 07:10, Deepak Kumar Gupta wrote: > Hi, > > I am planning to upgrade lucene version

Stable releases of Solr

2016-12-16 Thread Deepak Kumar Gupta
Hi, I am planning to upgrade lucene version in my codebase from 3.6.1 What is the latest stable version to which I can upgrade it? Is 6.3.X stable? Thanks, Deepak

Re: ttl on merge-time possible somehow ?

2016-12-16 Thread Shawn Heisey
On 12/16/2016 12:54 AM, Dorian Hoxha wrote: > I did some search for TTL on solr, and found only a way to do it with > a delete-query. But that ~sucks, because you have to do a lot of > inserts (and queries). You're going to have to be very specific about what you want Solr to do. > The

Can't get spelling suggestions to work properly

2016-12-16 Thread jimi.hullegard
Hi, I'm trying to add the spelling suggestion feature to our search, but I'm having problems getting suggestions on some misspellings. For example, the Swedish word 'mycket' exists in ~14.000 of a total of ~40.000 documents in our index. A search for the incorrect spelling 'myket' (a missing

cores vs. instances vs. zookeeper vs. cloud vs ?

2016-12-16 Thread John Blythe
good morning everyone. i've got a crowing number of cores that various parts of our application are relying upon. i'm having difficulty figuring out the best way to continue expanding for both sake of scale and convenience. i need two extra versions of each core due to our demo instance and our

Re: Max vertical scaling in your experience ? (1 instance/server)

2016-12-16 Thread GW
Layer 2 bridge SAN is just for my Apache/apps on Conga so they can be spun on up any host with a static IP. This has nothing to do with Solr which is running on plain old hardware. Solrcloud is on a real cluster not on a SAN. The bit about dead with no error. I got this from a post I made asking

Re: Separating Search and Indexing in SolrCloud

2016-12-16 Thread Dorian Hoxha
Makes more sense, but I think the master should do the write before it can be redirected to other replicas. So not sure if that can be done. In elasticsearch you can have datanodes and coordinator nodes:

Re: Separating Search and Indexing in SolrCloud

2016-12-16 Thread Jaroslaw Rozanski
Sorry, not what I meant. Leader is responsible for distributing update requests to replica. So eventually all replicas have same state as leader. Not a problem. It is more about the performance of such. If I gather correctly normal replication happens by standard update request. Not by, say,

Re: error diagnosis help.

2016-12-16 Thread Comcast
Afaik the only xml that nutch should be touching is its own config files. This error shows up in solr admin Sent from my iPhone > On Dec 16, 2016, at 1:55 AM, Reth RM wrote: > > Are you indexing xml files through nutch? This exception purely looks like > processing of

Re: Separating Search and Indexing in SolrCloud

2016-12-16 Thread Dorian Hoxha
The leader is the source of truth. You expect to make the replica the source of truth or something???Doesn't make sense? What people do, is send write to leader/master and reads to replicas/slaves in other solr/other-dbs. On Fri, Dec 16, 2016 at 1:31 PM, Jaroslaw Rozanski

Separating Search and Indexing in SolrCloud

2016-12-16 Thread Jaroslaw Rozanski
Hi all, According to documentation, in normal operation (not recovery) in Solr Cloud configuration the leader sends updates it receives to all the replicas. This means and all nodes in the shard perform same effort to index single document. Correct? Is there then a benefit to *not* to send

Re: Max vertical scaling in your experience ? (1 instance/server)

2016-12-16 Thread Dorian Hoxha
On Fri, Dec 16, 2016 at 12:39 PM, GW wrote: > Dorian, > > From my reading, my belief is that you just need some beefy machines for > your zookeeper ensemble so they can think fast. Zookeeper need to think fast enough for cluster state/changes. So I think it scales with

Re: Max vertical scaling in your experience ? (1 instance/server)

2016-12-16 Thread GW
Dorian, >From my reading, my belief is that you just need some beefy machines for your zookeeper ensemble so they can think fast. After that your issues are complicated by drive I/O which I believe is solved by using shards. If you have a collection running on top of a single drive array it

Re: Max vertical scaling in your experience ? (1 instance/server)

2016-12-16 Thread Dorian Hoxha
On Fri, Dec 16, 2016 at 11:31 AM, Toke Eskildsen wrote: > On Fri, 2016-12-16 at 11:19 +0100, Dorian Hoxha wrote: > > On Fri, Dec 16, 2016 at 10:45 AM, Toke Eskildsen > > wrote: > > > We try hard to stay below 32GB, but for some setups the

Re: Max vertical scaling in your experience ? (1 instance/server)

2016-12-16 Thread Toke Eskildsen
On Fri, 2016-12-16 at 11:19 +0100, Dorian Hoxha wrote: > On Fri, Dec 16, 2016 at 10:45 AM, Toke Eskildsen > wrote: > > We try hard to stay below 32GB, but for some setups the penalty of > > crossing the boundary is worth it. If, for example, having > > everything in 1

Problem to specify end parameter for range facets

2016-12-16 Thread Aman Tandon
Hi, I want to do the range facets with gap of 10 and I don't know the end as it could be a very large value so how could I do that. Thanks Aman Tandon

Re: Max vertical scaling in your experience ? (1 instance/server)

2016-12-16 Thread Dorian Hoxha
On Fri, Dec 16, 2016 at 10:45 AM, Toke Eskildsen wrote: > On Fri, 2016-12-16 at 09:31 +0100, Dorian Hoxha wrote: > > I'm researching solr for a project that would require a max- > > inserts(10M/s) and some heavy facet+fq on top of that, though on low > > qps. > > You

Re: Getting Error - Session expired for /collections/sprod/state.json

2016-12-16 Thread Yago Riveiro
Do some gc profiling to get some information about. It's possible you have configure a small heap and you are running in gc stop the world issues. Normally zookeeper erros are bounded to gc and network latency issues -- /Yago Riveiro On 16 Dec 2016, 09:49 +, Piyush Kunal

Re: Getting Error - Session expired for /collections/sprod/state.json

2016-12-16 Thread Piyush Kunal
Looks like an issue with 6.x version then. But this seems too basic. Not sure if community would not have caught this till now. On Fri, Dec 16, 2016 at 2:55 PM, Yago Riveiro wrote: > I had some of this error in my logs too on 6.3.0 > > My cluster also index like 20K

Re: Max vertical scaling in your experience ? (1 instance/server)

2016-12-16 Thread Toke Eskildsen
On Fri, 2016-12-16 at 09:31 +0100, Dorian Hoxha wrote: > I'm researching solr for a project that would require a max- > inserts(10M/s) and some heavy facet+fq on top of that, though on low > qps. You don't ask for much, do you :-) If you add high commit rate to the list, you have a serious

Re: Search only for single value of Solr multivalue field

2016-12-16 Thread Leo BRUVRY-LAGADEC
Hi Dorian, Firstly thanks for your response, but it does not seems to work. Here is another example, I want to search document with affiliations contains the NHM (Natural History Museum) of India. So, I want to only get the document with id=2 : 1 NHM, Austria Annamalai Univ,

Re: Getting Error - Session expired for /collections/sprod/state.json

2016-12-16 Thread Yago Riveiro
I had some of this error in my logs too on 6.3.0 My cluster also index like 20K docs/sec I don't know why. -- /Yago Riveiro On 16 Dec 2016, 08:39 +, Piyush Kunal , wrote: > Anyone has noticed such issue before? > > On Thu, Dec 15, 2016 at 4:36 PM, Piyush Kunal

Re: Solr MapReduce Indexer Tool is failing for empty core name.

2016-12-16 Thread Manan Sheth
Thats what I presume and it should start utilizing the collection only. The collection param has already been specified and it should take all details from there only. also, core to collection change was happed in solr 4. The map reduce inderexer for solr 4.10 is working correctly with this,

Re: Solr on HDFS: increase in query time with increase in data

2016-12-16 Thread Piyush Kunal
I think 70GB is too huge for a shard. How much memory does the system is having? Incase solr does not have sufficient memory to load the indexes, it will use only the amount of memory defined in your Solr Caches. Although you are on HFDS, solr performances will be really bad if it has do disk IO

Re: Getting Error - Session expired for /collections/sprod/state.json

2016-12-16 Thread Piyush Kunal
Anyone has noticed such issue before? On Thu, Dec 15, 2016 at 4:36 PM, Piyush Kunal wrote: > This is happening when heavy indexing like 100/second is going on. > > On Thu, Dec 15, 2016 at 4:33 PM, Piyush Kunal > wrote: > >> - We have solr6.1.0

Max vertical scaling in your experience ? (1 instance/server)

2016-12-16 Thread Dorian Hoxha
Hello searchers, I'm researching solr for a project that would require a max-inserts(10M/s) and some heavy facet+fq on top of that, though on low qps. And I'm trying to find blogs/slides where people have used some big machines instead of hundreds of small ones. 1. Largest I've found is this