Re: rough maximum cores (shards) per machine?

2015-03-25 Thread Damien Kamerman
I've tried (very simplistically) hitting a collection with a good variety of searches and looking at the collection's heap memory and working out the bytes / doc. I've seen results around 100 bytes / doc, and as low as 3 bytes / doc for collections with small docs. It's still a work-in-progress -

Re: rough maximum cores (shards) per machine?

2015-03-25 Thread Shai Erera
While it's hard to answer this question because as others have said, it depends, I think it will be good of we can quantify or assess the cost of running a SolrCore. For instance, let's say that a server can handle a load of 10M indexed documents (I omit search load on purpose for now) in a

Re: Using G1 with Apache Solr

2015-03-25 Thread Daniel Collins
Interesting none the less Shawn :) We use G1GC on our servers, we were on Java 7 (64-bit, RHEL6), but are trying to migrate to Java 8 (which seems to cause more GC issues, so we clearly need to tweak our settings), will investigate 8u40 though. On 25 March 2015 at 04:23, Shawn Heisey

Re: Custom TokenFilter

2015-03-25 Thread Test Test
Thanks Eric,  I'm working on Solr 4.10.2 and all my dependencies jar seems to be compatible with this version. I can't figure out which one make this issue. ThanksRegards, Le Mardi 24 mars 2015 23h45, Erick Erickson erickerick...@gmail.com a écrit : bq: 13 moreCaused by:

Re: rough maximum cores (shards) per machine?

2015-03-25 Thread Toke Eskildsen
On Wed, 2015-03-25 at 03:46 +0100, Ian Rose wrote: Thus theoretically we could actually just use one single collection for all of our customers (adding a 'customer:whatever' type fq to all queries) but since we never need to query across customers it seemed more performant (as well as safer -

Re: rough maximum cores (shards) per machine?

2015-03-25 Thread Per Steffensen
In one of our production environments we use 32GB, 4-core, 3T RAID0 spinning disk Dell servers (do not remember the exact model). We have about 25 collections with 2 replica (shard-instances) per collection on each machine - 25 machines. Total of 25 coll * 2 replica/coll/machine * 25 machines

Re: Data indexing is going too slow on single shard Why?

2015-03-25 Thread Shawn Heisey
On 3/25/2015 5:03 AM, Nitin Solanki wrote: Please can anyone assist me? I am indexing on single shard it is taking too much of time to index data. And I am indexing around 49GB of data on single shard. What's wrong? Why solr is taking too much time to index data? Earlier I was

Re: rough maximum cores (shards) per machine?

2015-03-25 Thread Ian Rose
Per - Wow, 1 trillion documents stored is pretty impressive. One clarification: when you say that you have 2 replica per collection on each machine, what exactly does that mean? Do you mean that each collection is sharded into 50 shards, divided evenly over all 25 machines (thus 2 shards per

Re: Data indexing is going too slow on single shard Why?

2015-03-25 Thread Nitin Solanki
Hello, * Updating my question again.* Please can anyone assist me? I am indexing on single shard it is taking too much of time to index data. And I am indexing around 49GB of data on single shard. What's wrong? Why solr is taking too much time to index data? Earlier I was

Re: rough maximum cores (shards) per machine?

2015-03-25 Thread Per Steffensen
On 25/03/15 15:03, Ian Rose wrote: Per - Wow, 1 trillion documents stored is pretty impressive. One clarification: when you say that you have 2 replica per collection on each machine, what exactly does that mean? Do you mean that each collection is sharded into 50 shards, divided evenly over

Information Retrieval/Text Mining opportunity @ GE Research Data Mining Labs, Bangalore

2015-03-25 Thread Yavar Husain
I have loved working on Solr, so thought of posting an Information Retrieval/Text Mining requirement that we have for our GE Data Mining Research Labs @ Bangalore. Apologies if it is considered inappropriate here. Here goes the Job Description for those interested: If Information Retrieval,

Re: Have anyone used Automatic Phrase Tokenization (AutoPhrasingTokenFilterFactory) ?

2015-03-25 Thread afrooz
Hi, I am a .net developer, but i need to use solr and specifically this good plugin AutoPhrasingTokenFilter. I searched everywhere and i couldn't get useful information, can any one help me to run it in solr 5.0 or even previous versions. I am not able to add it to my solr it is throwing below

Sorting and Rerank

2015-03-25 Thread innoculou
If I do an initial search without any field sorting; and then do the exact same query but also sort one field will I get the same result set in the subsequent query but sorted. In other words, does simply applying a sort criteria affect the re-rank on the full search or does it just sort the

Solr Monitoring - Stored Stats?

2015-03-25 Thread Matt Kuiper
Hello, I am familiar with the JMX points that Solr exposes to allow for monitoring of statistics like QPS, numdocs, Average Query Time... I am wondering if there is a way to configure Solr to automatically store the value of these stats over time (for a given time interval), and then allow a

Optimize SolrCloud without downtime

2015-03-25 Thread pavelhladik
Hi, I didn't find the answer yet, please help. We have standalone Solr 5.0.0 with a few cores yet. One of those cores contains: numDocs:120M deletedDocs:110M Our data are changing frequently so that's why so many deletedDocs. Optimized core takes around 50GB on disk, we are now almost on 100GB

Replica and node states

2015-03-25 Thread Shai Erera
Hi Is it possible for a replica to be DOWN, while the node it resides on is under /live_nodes? If so, what can lead to it, aside from someone unloading a core. I don't know if each SolrCore reports status to ZK independently, or it's done by the Solr process as a whole. Also, is it possible for

Re: Data indexing is going too slow on single shard Why?

2015-03-25 Thread Nitin Solanki
Hi Shawn, Sorry for all the things. Server configuration: 8 CPUs. 32 GB RAM O.S. - Linux *Earlier*, I was using 8 shards without replica(default is 1) using SOLR CLOUD. On server, Only Solr is running. There is no other application which are running. Java heap set to 4096 MB in

Re: Sorting and Rerank

2015-03-25 Thread Koji Sekiguchi
Hi, You're right. Those sets are same each other, only documents order is different. Koji On 2015/03/26 0:53, innoculou wrote: If I do an initial search without any field sorting; and then do the exact same query but also sort one field will I get the same result set in the subsequent query

Re: Unable to setup solr cloud with multiple collections.

2015-03-25 Thread Erick Erickson
You're still mixing master/slave with SolrCloud. Do _not_ reconfigure the replication. If you want your core (we call them replicas in SolrCloud) to appear on various nodes in your cluster, either create the collection with the nodes specified (createNodeSet) or, once the collection is created on

Re: Custom TokenFilter

2015-03-25 Thread Erick Erickson
Images don't come through the mailing list, can't see your image. Whether or not all the jars in the directory you're working on are consistent is the least of your problems. Are the libs to be found in any _other_ place specified on your classpath? Best, Erick On Wed, Mar 25, 2015 at 12:36 AM,

Re: Setting up SOLR 5 from an RPM

2015-03-25 Thread Shawn Heisey
On 3/25/2015 5:49 AM, Tom Evans wrote: On Tue, Mar 24, 2015 at 4:00 PM, Tom Evans tevans...@googlemail.com wrote: Hi all We're migrating to SOLR 5 (from 4.8), and our infrastructure guys would prefer we installed SOLR from an RPM rather than extracting the tarball where we need it. They are

Re: Optimize SolrCloud without downtime

2015-03-25 Thread Erick Erickson
That's a high number of deleted documents as a percentage of your index! Or at least I find those numbers surprising. When segments are merged in the background during normal indexing, quite a bit of weight is given to segments that have a high percentage of deleted docs. I usually see at most

Re: Optimize SolrCloud without downtime

2015-03-25 Thread Shawn Heisey
On 3/25/2015 9:08 AM, pavelhladik wrote: Our data are changing frequently so that's why so many deletedDocs. Optimized core takes around 50GB on disk, we are now almost on 100GB and I'm looking for best solution howto optimize this huge core without downtime. I know optimization working in

Re: Optimize SolrCloud without downtime

2015-03-25 Thread Erick Erickson
bq: It does NOT optimize multiple replicas or shards in parallel. This behavior was changed in 4.10 though, see: https://issues.apache.org/jira/browse/SOLR-6264 So with 5.0 Pavel is seeing the result of that JIRA I bet. I have to agree with Shawn, the optimization step should proceed invisibly

Re: [MARKETING] Re: KeywordTokenizerFactory splits by whitespaces

2015-03-25 Thread Erick Erickson
Yeah, this is a head scratcher. But it _has_ to be that way for things like edismax to work where you mix-and-match fielded and un-fielded terms. I.e. I can have a query like q=field1:whatever some more stuffqf=field2,field3,field4 where I want whatever to be evaluated only against field1, but the

Re: Solr Monitoring - Stored Stats?

2015-03-25 Thread Erick Erickson
Matt: Not really. There's a bunch of third-party log analysis tools that give much of this information (not everything exposed by JMX of course is in the log files though). Not quite sure whether things like Nagios, Zabbix and the like have this kind of stuff built in seems like a natural

German Compound Splitter words.fst causing problems.

2015-03-25 Thread Chris Morley
Hello, Chris Morley here, of Wayfair.com. I am working on the German compound-splitter by Dawid Weiss. I tried to upgrade the words.fst file that comes with the German compound-splitter using Solr 3.5, but it doesn't work. Below is the IndexNotFoundException that I get.

Re: Data indexing is going too slow on single shard Why?

2015-03-25 Thread Shawn Heisey
On 3/25/2015 8:42 AM, Nitin Solanki wrote: Server configuration: 8 CPUs. 32 GB RAM O.S. - Linux snip are running. Java heap set to 4096 MB in Solr. While indexing, snip *Currently*, I have 1 shard with 2 replicas using SOLR CLOUD. Data Size: 102G

KeywordTokenizerFactory splits by whitespaces

2015-03-25 Thread Vadim Gorlovetsky
Hello, solr.KeywordTokenizerFactory seems splitting by whitespaces though according SOLR documentation shouldn't do that. For example I have the following configuration for the fields proj_name and proj_name_sort: field name=proj_name type=sortable_text_general indexed=true stored=true/

Re: Setting up SOLR 5 from an RPM

2015-03-25 Thread Tom Evans
On Wed, Mar 25, 2015 at 2:40 PM, Shawn Heisey apa...@elyograg.org wrote: I think you will only need to change the ownership of the solr home and the location where the .war file is extracted, which by default is server/solr-webapp. The user must be able to *read* the program data, but should

Re: KeywordTokenizerFactory splits by whitespaces

2015-03-25 Thread Erick Erickson
This is a _very_ common thing we all had to learn; what you're seeing is the results of the _query parser_, not the analysis chain. Anything like proj_name_sort:term1 term2 gets split at the query parser level, attaching debug=query to the URL should show down in the parsed query section something

Re: Solr Monitoring - Stored Stats?

2015-03-25 Thread Shawn Heisey
On 3/25/2015 9:26 AM, Matt Kuiper wrote: I am familiar with the JMX points that Solr exposes to allow for monitoring of statistics like QPS, numdocs, Average Query Time... I am wondering if there is a way to configure Solr to automatically store the value of these stats over time (for a

Re: Replica and node states

2015-03-25 Thread Shalin Shekhar Mangar
Comments inline: On Wed, Mar 25, 2015 at 8:30 AM, Shai Erera ser...@gmail.com wrote: Hi Is it possible for a replica to be DOWN, while the node it resides on is under /live_nodes? If so, what can lead to it, aside from someone unloading a core. Yes, aside from someone unloading the index,

RE: [MARKETING] Re: KeywordTokenizerFactory splits by whitespaces

2015-03-25 Thread Vadim Gorlovetsky
Thanks for a quick response. A bit confusing that analyzer of query type configured to use KeywordTokenizerFactory does not un-tokenize query criteria. I guess whitespace only the special case because it separates phrases in a query and runs prior analyzing. Actually I am handling a query the

Re: rough maximum cores (shards) per machine?

2015-03-25 Thread Jack Krupansky
Just to give a specific answer to the original question, I would say that dozens of cores (collections) is certainly fine (assuming the total data load and query rate is reasonable), maybe 50 or even 100. Low hundreds of cores/collections MAY work, but isn't advisable. Thousands, if it works at

Re: Replica and node states

2015-03-25 Thread Shai Erera
Thanks. Does Solr ever clean up those states? I.e. does it ever remove down replicas, or replicas belonging to non-live_nodes after some time? Or will these remain in the cluster state forever (assuming they never come back up)? If they remain there, is there any penalty? E.g. Solr tries to send

Re: Custom TokenFilter

2015-03-25 Thread Test Test
Re, Sorry about the image.So, there are all my dependencies jar in listing below :-  commons-cli-2.0-mahout.jar- commons-compress-1.9.jar- commons-io-2.4.jar-  commons-logging-1.2.jar- httpclient-4.4.jar- httpcore-4.4.jar-  httpmime-4.4.jar- junit-4.10.jar- log4j-1.2.17.jar- 

Re: Custom TokenFilter

2015-03-25 Thread Test Test
Re, Sorry about the image.So, there are all my dependencies jar in listing below : - commons-cli-2.0-mahout.jar - commons-compress-1.9.jar - commons-io-2.4.jar - commons-logging-1.2.jar - httpclient-4.4.jar - httpcore-4.4.jar - httpmime-4.4.jar

Applying Tokenizers and Filters to CopyFields

2015-03-25 Thread Martin Wunderlich
Hi all, I am wondering what the process is for applying Tokenizers and Filter (as defined in the FieldType definition) to field contents that result from CopyFields. To be more specific, in my Solr instance, Iwould like to support query expansion by two means: removing stop words and adding

Uneven data distribution with composite router

2015-03-25 Thread Shamik Bandopadhyay
Hi, I'm using a three level composite router in a solr cloud environment, primarily for multi-tenant and field collapsing. The format is as follows. *language!topic!url*. An example would be : ENU!12345!www.testurl.com/enu/doc1 GER!12345!www.testurl.com/ger/doc2

Re: Applying Tokenizers and Filters to CopyFields

2015-03-25 Thread Martin Wunderlich
Thanks a lot, Michael. See replies below. Am 25.03.2015 um 21:41 schrieb Michael Della Bitta michael.della.bi...@appinions.com: Two other things I noticed: 1. You probably don't want to store your copyFields. That's literally going to be the same information each time. OK, got it. I

Re: Applying Tokenizers and Filters to CopyFields

2015-03-25 Thread Michael Della Bitta
Two other things I noticed: 1. You probably don't want to store your copyFields. That's literally going to be the same information each time. 2. Your expectation the pre-processed version of the text is added to the index may be incorrect. Anything done in analyzer type=query sections actually

Re: Replica and node states

2015-03-25 Thread Shalin Shekhar Mangar
On Wed, Mar 25, 2015 at 12:51 PM, Shai Erera ser...@gmail.com wrote: Thanks. Does Solr ever clean up those states? I.e. does it ever remove down replicas, or replicas belonging to non-live_nodes after some time? Or will these remain in the cluster state forever (assuming they never come back

Re: Applying Tokenizers and Filters to CopyFields

2015-03-25 Thread Ahmet Arslan
Hi Martin, fq means filter query. May be you want to use qf (query fields) parameter of edismax? On Wednesday, March 25, 2015 9:23 PM, Martin Wunderlich martin...@gmx.net wrote: Hi all, I am wondering what the process is for applying Tokenizers and Filter (as defined in the FieldType

Re: Applying Tokenizers and Filters to CopyFields

2015-03-25 Thread Martin Wunderlich
Thanks a lot, Ahmet. I’ve just read up on this query field parameter and it sounds good. Since the field contents are currently all identical, I can’t really test it, yet. Cheers, Martin Am 25.03.2015 um 21:27 schrieb Ahmet Arslan iori...@yahoo.com.INVALID: Hi Martin, fq means

location field giving error for lat long

2015-03-25 Thread abhayd
hi I have field name GeoLocate with datatype as location. For some lat and long it is giving me following error during indexing process Can't parse point '139.9544301,35.4298081' because: Bad Y value 139.9544301 is not in boundary Rect(minX=-180.0,maxX=180.0,minY=-90.0,maxY=90.0) Any idea

Re: Custom TokenFilter

2015-03-25 Thread Test Test
Re, Finally, i think i found where this problem comes.I didn't use the right class extender, instead using Tokenizers, i'm using Token filter. Eric, thanks for your replies.Regards. Le Mercredi 25 mars 2015 23h55, Test Test andymish...@yahoo.fr a écrit : Re, I have tried to remove

Re: Applying Tokenizers and Filters to CopyFields

2015-03-25 Thread Michael Della Bitta
I agree the terminology is possibly a little confusing. Stored refers to values that are stored verbatim. You can retrieve them verbatim. Analysis does not affect stored values. Indexed values are tokenized/transformed and stored inverted. You can't recover the literal analyzed version (at least,

Re: Problem with Terms Query Parser

2015-03-25 Thread Jack Krupansky
That should work. Check to be sure that you really are running Solr 5.0. Was it an old version of trunk or the 5x branch before last August when the terms query parser was added? -- Jack Krupansky On Tue, Mar 24, 2015 at 5:15 PM, Shamik Bandopadhyay sham...@gmail.com wrote: Hi, I'm trying

RE: German Compound Splitter words.fst causing problems.

2015-03-25 Thread Markus Jelsma
Hello Chris - i don't know that token filter you mention but i would like to recommend Lucene's HyphenationCompoundWordTokenFilter. It works reasonably well if you provide the hyphenation rules and a dictionary. It has some flaws such as decompounding to irrelevant subwords, overlapping

Re: Custom TokenFilter

2015-03-25 Thread Test Test
Re, I have tried to remove all the redundant jar files.Then i've relaunched it but it's blocked directly on the same issue. It's very strange. Regards, Le Mercredi 25 mars 2015 23h31, Erick Erickson erickerick...@gmail.com a écrit : Wait, you didn't put, say, lucene-core-4.10.2.jar

Re: Applying Tokenizers and Filters to CopyFields

2015-03-25 Thread Erick Erickson
Martin: Perhaps this would help indexed=true, stored=true field can be searched. The raw input (not analyzed in any way) can be shown to the user in the results list. indexed=true, stored=false field can be searched. However, the field can't be returned in the results list with the document.

RE: Difference in indexing using config file vs client i.e SolrJ

2015-03-25 Thread Purohit, Sumit
Thanks Erick for the helpful explanations. thanks sumit From: Erick Erickson [erickerick...@gmail.com] Sent: Monday, March 23, 2015 4:58 PM To: solr-user@lucene.apache.org Subject: Re: Difference in indexing using config file vs client i.e SolrJ 1 Either

Re: Using G1 with Apache Solr

2015-03-25 Thread William Bell
The issue we had with Java 8 was with DIH handler. We were using Rhino and with the new implementation in Java 8, we had several Regex expression issues... We are almost ready to go now, since we moved away from Rhino and now use Java. Bill On Wed, Mar 25, 2015 at 2:14 AM, Daniel Collins

Re: [MASSMAIL]Re: Issues to create new core

2015-03-25 Thread Alejandro Jesus Mariño Molerio
Erick, Thanks for your help. I could fix the problem. I work in no SolrCloud mode. Best Regards, Ale - Mensaje original - De: Erick Erickson erickerick...@gmail.com Para: solr-user@lucene.apache.org Enviados: Martes, 24 de Marzo 2015 10:14:22 Asunto: [MASSMAIL]Re: Issues to create new

Re: Setting up SOLR 5 from an RPM

2015-03-25 Thread Tom Evans
On Tue, Mar 24, 2015 at 4:00 PM, Tom Evans tevans...@googlemail.com wrote: Hi all We're migrating to SOLR 5 (from 4.8), and our infrastructure guys would prefer we installed SOLR from an RPM rather than extracting the tarball where we need it. They are creating the RPM file themselves, and

Re: Replica and node states

2015-03-25 Thread Shalin Shekhar Mangar
On Wed, Mar 25, 2015 at 9:24 PM, Shai Erera ser...@gmail.com wrote: There's even a param onyIfDown=true which will remove a replica only if it's already 'down'. That will only work if the replica is in DOWN state correct? That is, if the Solr JVM was killed, and the replica stays in

Re: Custom TokenFilter

2015-03-25 Thread Erick Erickson
Wait, you didn't put, say, lucene-core-4.10.2.jar into your contrib/tamingtext/dependency directory did you? That means you have Lucene (and solr and solrj and ...) in your class path twice since they're _already_ in your classpath by default since you're running Solr. All your jars should be in

Retrieving list of words for highlighting

2015-03-25 Thread Damien Dykman
In Solr 5 (or 4), is there an easy way to retrieve the list of words to highlight? Use case: allow an external application to highlight the matching words of a matching document, rather than using the highlighted snippets returned by Solr. Thanks, Damien

Data indexing is going too slow on single shard Why?

2015-03-25 Thread Nitin Solanki
Hello, Please can anyone assist me? I am indexing on single shard it is taking too much of time to index data. And I am indexing around 49GB of data on single shard. What's wrong? Why solr is taking too much time to index data? Earlier I was indexing same data on 8 shards. That time,

Re: Custom TokenFilter

2015-03-25 Thread Erick Erickson
Thanks for letting us know the resolution, the problem was bugging me Erick On Wed, Mar 25, 2015 at 4:21 PM, Test Test andymish...@yahoo.fr wrote: Re, Finally, i think i found where this problem comes.I didn't use the right class extender, instead using Tokenizers, i'm using Token

Re: Replica and node states

2015-03-25 Thread Shai Erera
There's even a param onyIfDown=true which will remove a replica only if it's already 'down'. That will only work if the replica is in DOWN state correct? That is, if the Solr JVM was killed, and the replica stays in ACTIVE, but its node is not under /live_nodes, it won't get deleted? What I