date:20150325

While it's hard to answer this question because as others have said, it
depends, I think it will be good of we can quantify or assess the cost of
running a SolrCore.

For instance, let's say that a server can handle a load of 10M indexed
documents (I omit search load on purpose for now) in a single SolrCore.
Would the same server be able to handle the same number of documents, If we
indexed 1000 docs per SolrCore, in total of 10,000 SorClores? If the answer
is no, then it means there is some cost that comes w/ each SolrCore, and we
may at least be able to give an upper bound --- on a server with X amount
of storage, Y GB RAM and Z cores you can run up to maxSolrCores(X, Y, Z).

Another way to look at it, if I were to create empty SolrCores, would I be
able to create an infinite number of cores if storage was infinite? Or even
empty cores have their toll on CPU and RAM?

I know from the Lucene side of things that each SolrCore (carries a Lucene
index) there is a toll to an index -- the lexicon, IW's RAM buffer, Codecs
that store things in memory etc. For instance, one downside of splitting a
10M core into 10,000 cores is that the cost of the holding the total
lexicon (dictionary of indexed words) goes up drastically, since now every
word (just the byte[] of the word) is potentially represented in memory
10,000 times.

What other RAM/CPU/Storage costs does a SolrCore carry with it? There are
the caches of course, which really depend on how many documents are
indexed. Any other non-trivial or constant cost?

So yes, there isn't a single answer to this question. It's just like
someone would ask how many documents can a single Lucene index handle
efficiently. But if we can come up with basic numbers as I outlined above,
it might help people doing rough estimates. That doesn't mean people
shouldn't benchmark, as that upper bound may be wy too high for their
data set, query workload and search needs.

Shai

On Wed, Mar 25, 2015 at 5:25 AM, Damien Kamerman dami...@gmail.com wrote:

 From my experience on a high-end sever (256GB memory, 40 core CPU) testing
 collection numbers with one shard and two replicas, the maximum that would
 work is 3,000 cores (1,500 collections). I'd recommend much less (perhaps
 half of that), depending on your startup-time requirements. (Though I have
 settled on 6,000 collection maximum with some patching. See SOLR-7191). You
 could create multiple clouds after that, and choose the cloud least used to
 create your collection.

 Regarding memory usage I'd pencil in 6MB overheard (no docs) java heap per
 collection.

 On 25 March 2015 at 13:46, Ian Rose ianr...@fullstory.com wrote:

  First off thanks everyone for the very useful replies thus far.
 
  Shawn - thanks for the list of items to check.  #1 and #2 should be fine
  for us and I'll check our ulimit for #3.
 
  To add a bit of clarification, we are indeed using SolrCloud.  Our
 current
  setup is to create a new collection for each customer.  For now we allow
  SolrCloud to decide for itself where to locate the initial shard(s) but
 in
  time we expect to refine this such that our system will automatically
  choose the least loaded nodes according to some metric(s).
 
  Having more than one business entity controlling the configuration of a
   single (Solr) server is a recipe for disaster. Solr works well if there
  is
   an architect for the system.
 
 
  Jack, can you explain a bit what you mean here?  It looks like Toke
 caught
  your meaning but I'm afraid it missed me.  What do you mean by business
  entity?  Is your concern that with automatic creation of collections
 they
  will be distributed willy-nilly across the cluster, leading to uneven
 load
  across nodes?  If it is relevant, the schema and solrconfig are
 controlled
  entirely by me and is the same for all collections.  Thus theoretically
 we
  could actually just use one single collection for all of our customers
  (adding a 'customer:whatever' type fq to all queries) but since we
 never
  need to query across customers it seemed more performant (as well as
 safer
  - less chance of accidentally leaking data across customers) to use
  separate collections.
 
  Better to give each tenant a separate Solr instance that you spin up and
   spin down based on demand.
 
 
  Regarding this, if by tenant you mean customer, this is not viable for
 us
  from a cost perspective.  As I mentioned initially, many of our customers
  are very small so dedicating an entire machine to each of them would not
 be
  economical (or efficient).  Or perhaps I am not understanding what your
  definition of tenant is?
 
  Cheers,
  Ian
 
 
 
  On Tue, Mar 24, 2015 at 4:51 PM, Toke Eskildsen t...@statsbiblioteket.dk
  wrote:
 
   Jack Krupansky [jack.krupan...@gmail.com] wrote:
I'm sure that I am quite unqualified to describe his hypothetical
  setup.
   I
mean, he's the one using the term multi-tenancy, so it's for him to
 be
clear.
  
   It was my understanding that Ian

Re: Using G1 with Apache Solr

2015-03-25 Thread Daniel Collins

Interesting none the less Shawn :)

We use G1GC on our servers, we were on Java 7 (64-bit, RHEL6), but are
trying to migrate to Java 8 (which seems to cause more GC issues, so we
clearly need to tweak our settings), will investigate 8u40 though.

On 25 March 2015 at 04:23, Shawn Heisey apa...@elyograg.org wrote:

 On 3/24/2015 9:52 PM, Shawn Heisey wrote:
  On 3/24/2015 3:48 PM, Kamran Khawaja wrote:
  I'm running Solr 4.7.2 with Java 7u75 with the following JVM params:

 I really got my wires crossed.  Kamran sent his message to the
 hostpot-gc-use mailing list, not the solr-user list!

 Thanks,
 Shawn

Re: Custom TokenFilter

Thanks Eric, 
I'm working on Solr 4.10.2 and all my dependencies jar seems to be compatible 
with this version.


I can't figure out which one make this issue.
ThanksRegards,
 


 Le Mardi 24 mars 2015 23h45, Erick Erickson erickerick...@gmail.com a 
écrit :
   

 bq: 13 moreCaused by: java.lang.ClassCastException: class
com.tamingtext.texttamer.solr.

This usually means you have jar files from different versions of Solr
in your classpath.

Best,
Erick

On Tue, Mar 24, 2015 at 2:38 PM, Test Test andymish...@yahoo.fr wrote:
 Hi there,
 I'm trying to create my own TokenizerFactory (from tamingtext's book).After 
 setting schema.xml and have adding path in solrconfig.xml, i start solr.I 
 have this error message : Caused by: org.apache.solr.common.SolrException: 
 Plugin init failure for [schema.xml] fieldType text: Plugin init failure 
 for [schema.xml] analyzer/tokenizer: class 
 com.tamingtext.texttamer.solr.SentenceTokenizerFactory. Schema file is 
 .../conf/schema.xmlat 
 org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:595)at 
 org.apache.solr.schema.IndexSchema.init(IndexSchema.java:166)at 
 org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:55)at
  
 org.apache.solr.schema.IndexSchemaFactory.buildIndexSchema(IndexSchemaFactory.java:69)at
  
 org.apache.solr.core.ConfigSetService.createIndexSchema(ConfigSetService.java:90)at
  org.apache.solr.core.ConfigSetService.getConfig(ConfigSetService.java:62)... 
 7 moreCaused by: org.apache.solr.common.SolrException: Plugin init failure 
 for [schema.xml] fieldType text: Plugin init failure for [schema.xml] 
 analyzer/tokenizer: class 
 com.tamingtext.texttamer.solr.SentenceTokenizerFactoryat 
 org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:177)at
  org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:486)... 12 
 moreCaused by: org.apache.solr.common.SolrException: Plugin init failure for 
 [schema.xml] analyzer/tokenizer: class 
 com.tamingtext.texttamer.solr.SentenceTokenizerFactoryat 
 org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:177)at
  
 org.apache.solr.schema.FieldTypePluginLoader.readAnalyzer(FieldTypePluginLoader.java:362)at
  
 org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:95)at
  
 org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:43)at
  
 org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:151)...
  13 moreCaused by: java.lang.ClassCastException: class 
 com.tamingtext.texttamer.solr.SentenceTokenizerFactoryat 
 java.lang.Class.asSubclass(Class.java:3208)at 
 org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:474)at
  
 org.apache.solr.core.SolrResourceLoader.newInstance(SolrResourceLoader.java:593)at
  
 org.apache.solr.schema.FieldTypePluginLoader$2.create(FieldTypePluginLoader.java:342)at
  
 org.apache.solr.schema.FieldTypePluginLoader$2.create(FieldTypePluginLoader.java:335)at
  
 org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:151)
 Someone can help?
 Thanks.Regards.

Re: rough maximum cores (shards) per machine?

2015-03-25 Thread Toke Eskildsen

On Wed, 2015-03-25 at 03:46 +0100, Ian Rose wrote:
 Thus theoretically we could actually just use one single collection for
all of our customers (adding a 'customer:whatever' type fq to all
 queries) but since we never need to query across customers it seemed
 more performant (as well as safer - less chance of accidentally
 leaking data across customers) to use separate collections.

If only a few customers are active at a given time, it is more
performant to use a collestion/customer. If many of them are active, the
more performant option is to lump them together and filter on a field,
due to the redundancy-reduction of larger indexes.

The 1 collection/customer solution has another edge as ranking will be
calculated based on the corpus of the customer and not based on all
customers. If the number of customers is low enough to get the
individual collections solution to work, that would be the preferable
solution.

- Toke Eskildsen, State and University Library, Denmark

Re: rough maximum cores (shards) per machine?

2015-03-25 Thread Per Steffensen

In one of our production environments we use 32GB, 4-core, 3T RAID0 
spinning disk Dell servers (do not remember the exact model). We have 
about 25 collections with 2 replica (shard-instances) per collection on 
each machine - 25 machines. Total of 25 coll * 2 replica/coll/machine * 
25 machines = 1250 replica. Each replica contains about 800 million 
pretty small documents - thats about 1000 billion (do not know the 
english word for it) documents all in all. We index about 1.5 billion 
new documents every day (mainly into one of the collections = 50 replica 
across 25 machine) and keep a history of 2 years on the data. Shifting 
the index into collection every month. We can fairly easy keep up with 
the indexing load. We have almost non of the data on the heap, but of 
course a small fraction of the data in the files will at any time be in 
OS file-cache.
Compared to our indexing frequency we do not do a lot of searches. We 
have about 10 users searching the system from time to time - anything 
from major extracts to small quick searches. Depending on the nature of 
the search we have response-times between 1 sec and 5 min. But of course 
that is very dependent on clever choice on each field wrt index, 
store, doc-value etc.
BUT we are not using out-of-box Apache Solr. We have made quit a lot of 
performance tweaks ourselves.
Please note that, even though you disable all Solr caches, each replica 
will use heap-memory linearly dependent on the number of documents (and 
their size) in that replica. But not much, so you can get pretty far 
with relatively little RAM.
Our version of Solr is based on Apache Solr 4.4.0, but I expect/hope it 
did not get worse in newer releases.


Just to give you some idea of what can at least be achieved - in the 
high-end of #replica and #docs, I guess


Regards, Per Steffensen

On 24/03/15 14:02, Ian Rose wrote:

Hi all -

I'm sure this topic has been covered before but I was unable to find any
clear references online or in the mailing list.

Are there any rules of thumb for how many cores (aka shards, since I am
using SolrCloud) is too many for one machine?  I realize there is no one
answer (depends on size of the machine, etc.) so I'm just looking for a
rough idea.  Something like the following would be very useful:

* People commonly run up to X cores/shards on a mid-sized (4 or 8 core)
server without any problems.
* I have never heard of anyone successfully running X cores/shards on a
single machine, even if you throw a lot of hardware at it.

Thanks!
- Ian

Re: Data indexing is going too slow on single shard Why?

On 3/25/2015 5:03 AM, Nitin Solanki wrote:
 Please can anyone assist me? I am indexing on single shard it
 is taking too much of time to index data. And I am indexing around 49GB of
 data on single shard. What's wrong? Why solr is taking too much time to
 index data?
 Earlier I was indexing same data on 8 shards. That time, it was fast as
 compared to single shard. Why so? Any help please..

There's practically no information to go on here, so about all I can
offer is general information in return:

http://wiki.apache.org/solr/SolrPerformanceProblems

I looked over the previous messages that you have sent the list, and I
can find very little of the required information about your index.  I
see a lot of questions from you, but they did not include the kind of
details needed here:

How much total RAM is in each Solr server?  Are there any other programs
on the server with significant RAM requirements?  An example of such a
program would be a database server.  On each server, how much memory is
dedicated to the java heap(s) for Solr?  I gather from other questions
that you are running SolrCloud, can you confirm?

On a per-server basis, how much disk space do all the index replicas
take?  How many documents are on each server?  Note that for disk space
and number of documents, I am asking you to count every replica, not
take the total in the collection and divide it by the number of servers.

How are you doing your indexing?  For this question, I am asking what
program or Solr API is actually sending the data to Solr.  Possible
answers include the dataimport handler, a SolrJ program, one of the
other Solr APIs such as a PHP client, and hand-crafted URLs with an HTTP
client.

Thanks,
Shawn

Re: rough maximum cores (shards) per machine?

2015-03-25 Thread Ian Rose

Per - Wow, 1 trillion documents stored is pretty impressive.  One
clarification: when you say that you have 2 replica per collection on each
machine, what exactly does that mean?  Do you mean that each collection is
sharded into 50 shards, divided evenly over all 25 machines (thus 2 shards
per machine)?  Or are some of these slave replicas (e.g. 25x sharding with
1 replica per shard)?

Thanks!

On Wed, Mar 25, 2015 at 5:13 AM, Per Steffensen st...@designware.dk wrote:

 In one of our production environments we use 32GB, 4-core, 3T RAID0
 spinning disk Dell servers (do not remember the exact model). We have about
 25 collections with 2 replica (shard-instances) per collection on each
 machine - 25 machines. Total of 25 coll * 2 replica/coll/machine * 25
 machines = 1250 replica. Each replica contains about 800 million pretty
 small documents - thats about 1000 billion (do not know the english word
 for it) documents all in all. We index about 1.5 billion new documents
 every day (mainly into one of the collections = 50 replica across 25
 machine) and keep a history of 2 years on the data. Shifting the index
 into collection every month. We can fairly easy keep up with the indexing
 load. We have almost non of the data on the heap, but of course a small
 fraction of the data in the files will at any time be in OS file-cache.
 Compared to our indexing frequency we do not do a lot of searches. We have
 about 10 users searching the system from time to time - anything from major
 extracts to small quick searches. Depending on the nature of the search we
 have response-times between 1 sec and 5 min. But of course that is very
 dependent on clever choice on each field wrt index, store, doc-value etc.
 BUT we are not using out-of-box Apache Solr. We have made quit a lot of
 performance tweaks ourselves.
 Please note that, even though you disable all Solr caches, each replica
 will use heap-memory linearly dependent on the number of documents (and
 their size) in that replica. But not much, so you can get pretty far with
 relatively little RAM.
 Our version of Solr is based on Apache Solr 4.4.0, but I expect/hope it
 did not get worse in newer releases.

 Just to give you some idea of what can at least be achieved - in the
 high-end of #replica and #docs, I guess

 Regards, Per Steffensen


 On 24/03/15 14:02, Ian Rose wrote:

 Hi all -

 I'm sure this topic has been covered before but I was unable to find any
 clear references online or in the mailing list.

 Are there any rules of thumb for how many cores (aka shards, since I am
 using SolrCloud) is too many for one machine?  I realize there is no one
 answer (depends on size of the machine, etc.) so I'm just looking for a
 rough idea.  Something like the following would be very useful:

 * People commonly run up to X cores/shards on a mid-sized (4 or 8 core)
 server without any problems.
 * I have never heard of anyone successfully running X cores/shards on a
 single machine, even if you throw a lot of hardware at it.

 Thanks!
 - Ian

Re: Data indexing is going too slow on single shard Why?

2015-03-25 Thread Nitin Solanki

Hello,
* Updating my question again.*
Please can anyone assist me? I am indexing on single shard it
is taking too much of time to index data. And I am indexing around 49GB of
data on single shard. What's wrong? Why solr is taking too much time to
index data?
Earlier I was indexing same data on 8 shards. That time, it was fast as
compared to single shard. Why so? Any help please..


*HardCommit - 15 sec*
*SoftCommit - 10 min.*

ii) Searching a query/term is also taking too much time. Any help on this
also.



On Wed, Mar 25, 2015 at 4:33 PM, Nitin Solanki nitinml...@gmail.com wrote:

 Hello,
 Please can anyone assist me? I am indexing on single shard it
 is taking too much of time to index data. And I am indexing around 49GB of
 data on single shard. What's wrong? Why solr is taking too much time to
 index data?
 Earlier I was indexing same data on 8 shards. That time, it was fast as
 compared to single shard. Why so? Any help please..


 *HardCommit - 15 sec*
 *SoftCommit - 10 min.*



 Best,
 Nitin

Re: rough maximum cores (shards) per machine?

2015-03-25 Thread Per Steffensen


On 25/03/15 15:03, Ian Rose wrote:

Per - Wow, 1 trillion documents stored is pretty impressive.  One
clarification: when you say that you have 2 replica per collection on each
machine, what exactly does that mean?  Do you mean that each collection is
sharded into 50 shards, divided evenly over all 25 machines (thus 2 shards
per machine)?

Yes

   Or are some of these slave replicas (e.g. 25x sharding with
1 replica per shard)?
No replication. It does not work very well, at least in 4.4.0. Besides 
that I am not a big fan of two (or more) machines having to do all the 
indexing work and making sure to keep synchronized. Use a distributed 
file-system supporting multiple copies of every piece of data (like 
HDFS) for HA on data-level. Have only one Solr-node handle the indexing 
into a particular shard - if this Solr-node breaks down let another 
Solr-node take over the indexing leadership on this shard. Besides the 
indexing Solr-node several other Solr-nodes can serve data from this 
shard - just watching the data-folder (can commits) done by the 
indexing-leader of this particular shard - will give you HA on 
service-level. That is probably how we are going to do HA - pretty soon. 
But that is another story


Thanks!

No problem

Information Retrieval/Text Mining opportunity @ GE Research Data Mining Labs, Bangalore

2015-03-25 Thread Yavar Husain

I have loved working on Solr, so thought of posting an Information
Retrieval/Text Mining requirement that we have for our GE Data Mining
Research Labs @ Bangalore. Apologies if it is considered inappropriate here.



Here goes the Job Description for those interested:



If Information Retrieval, Text Mining, Natural Language Processing  
Machine Learning fascinates you; if you are excited to research  build
state of art Algorithms working on massive data-sets for an array of Text
Mining problems (Search, Named Entity Recognition, Semantic Graphs,
Sentiments, Spell Corrector, Text Categorization, Clustering, Topic
Modelling and so on…) then GE Global Research Data Mining Labs in Bangalore
is looking out for you. The real scope of applied research in our lab goes
way beyond the term “Natural” in Natural Language Processing.



Do connect if you need more information. Even if one has limited or no
experience with the areas mentioned above but is passionate about
Information Retrieval/Text Mining  have rock solid background in
Algorithms is encouraged to apply/connect.



Check out more on GE Research: http://www.geglobalresearch.com/



Cheers,

Yavar Husain

Lead Data Scientist - Text Mining Laboratory

GE Research, Bangalore

LinkedIn: http://www.linkedin.com/pub/yavar-husain/5/805/151

Text@ yavarhus...@gmail.com

Re: Have anyone used Automatic Phrase Tokenization (AutoPhrasingTokenFilterFactory) ?

2015-03-25 Thread afrooz

Hi,
I am a .net developer, but i need to use solr and specifically this good
plugin AutoPhrasingTokenFilter.
I searched everywhere and i couldn't get useful information, can any one
help me to run it in solr 5.0 or even previous versions. I am not able to
add it to my solr it is throwing below error while i am putting the Lib
folder under the core which contains also my jar files for the
AutoPhrasingTokenFilter

Error:
org.apache.solr.common.SolrException:org.apache.solr.common.SolrException:
JVM Error creating core [gettingstarted_shard1_replica1]: class
org.apache.lucene.codecs.diskdv.DiskDocValuesFormat$1 cannot access its
superclass org.apache.lucene.codecs.lucene45.Lucene45DocValuesConsumer
 





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Have-anyone-used-Automatic-Phrase-Tokenization-AutoPhrasingTokenFilterFactory-tp4173808p4195182.html
Sent from the Solr - User mailing list archive at Nabble.com.

Sorting and Rerank

2015-03-25 Thread innoculou

If I do an initial search without any field sorting; and then do the exact
same query but also sort one field will I get the same result set in the
subsequent query but sorted.  In other words, does simply applying a sort
criteria affect the re-rank on the full search or does it just sort the
result from the main query?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Sorting-and-Rerank-tp4195187.html
Sent from the Solr - User mailing list archive at Nabble.com.

Solr Monitoring - Stored Stats?

2015-03-25 Thread Matt Kuiper

Hello,

I am familiar with the JMX points that Solr exposes to allow for monitoring of 
statistics like QPS, numdocs, Average Query Time...

I am wondering if there is a way to configure Solr to automatically store the 
value of these stats over time (for a given time interval), and then allow a 
user to query a stat over a time range.  So for the QPS stat,  the query might 
return a set that includes the QPS value for each hour in the time range 
specified.

Thanks,
Matt

Optimize SolrCloud without downtime

2015-03-25 Thread pavelhladik

Hi,

I didn't find the answer yet, please help. We have standalone Solr 5.0.0
with a few cores yet. One of those cores contains:

numDocs:120M
deletedDocs:110M

Our data are changing frequently so that's why so many deletedDocs.
Optimized core takes around 50GB on disk, we are now almost on 100GB and I'm
looking for best solution howto optimize this huge core without downtime. I
know optimization working in background, but anyway when the optimization is
running our search system is slow and sometimes I receive errors - this
behavior is like a downtime for us.

I would like to switch to SolrCloud, the performance is not a issue, so I
don't need the sharding feature at this time. I'm more interested with
replication and distribute requests by some Nginx proxy. Idea is:

1) proxy forward requests to node1 and optimize cores on node2
2) proxy forward requests to node2 and optimize cores on node1

But when I do optimize on node2, the node1 is doing optimization as well,
even if I use the distrib=false with curl.

Can you please recommend architecture for optimizing without downtime? Many
thanks.

Pavel



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Optimize-SolrCloud-without-downtime-tp4195170.html
Sent from the Solr - User mailing list archive at Nabble.com.

Replica and node states

Hi

Is it possible for a replica to be DOWN, while the node it resides on is
under /live_nodes? If so, what can lead to it, aside from someone unloading
a core.

I don't know if each SolrCore reports status to ZK independently, or it's
done by the Solr process as a whole.

Also, is it possible for a replica to report ACTIVE, while the node it
lives on is no longer under /live_nodes? Are there any ZK timings that can
cause that?

Shai

Re: Data indexing is going too slow on single shard Why?

2015-03-25 Thread Nitin Solanki

Hi Shawn,
  Sorry for all the things.

Server configuration:
8 CPUs.
32 GB RAM
O.S. - Linux
*Earlier*, I was using 8 shards without replica(default is 1) using SOLR
CLOUD. On server, Only Solr is running. There is no other application which
are running.  Java heap set to 4096 MB in Solr.  While indexing,
Solr(sometime) eats up whole RAM. I don't know how each solr server takes
RAM? Each server taking around 50 GB data(indexed). Actually, I had deleted
previous solr architecture, so I don't have any idea that how many
documents were on each shards and also don't know total documents.

*Currently*, I have 1 shard  with 2 replicas using SOLR CLOUD.
Data Size:
102Gsolr/node1/solr/wikingram_shard1_replica2
102Gsolr/node2/solr/wikingram_shard1_replica1

I am running a python script to index data using Solr RESTAPI. Commiting
2 Documents each time for indexing using python script with Solr
RESTAPI.
If I missed anything related to Solr. Please inform me..
THanks Shawn. Waiting for your reply




On Wed, Mar 25, 2015 at 7:33 PM, Shawn Heisey apa...@elyograg.org wrote:

 On 3/25/2015 5:03 AM, Nitin Solanki wrote:
  Please can anyone assist me? I am indexing on single shard it
  is taking too much of time to index data. And I am indexing around 49GB
 of
  data on single shard. What's wrong? Why solr is taking too much time to
  index data?
  Earlier I was indexing same data on 8 shards. That time, it was fast as
  compared to single shard. Why so? Any help please..

 There's practically no information to go on here, so about all I can
 offer is general information in return:

 http://wiki.apache.org/solr/SolrPerformanceProblems

 I looked over the previous messages that you have sent the list, and I
 can find very little of the required information about your index.  I
 see a lot of questions from you, but they did not include the kind of
 details needed here:

 How much total RAM is in each Solr server?  Are there any other programs
 on the server with significant RAM requirements?  An example of such a
 program would be a database server.  On each server, how much memory is
 dedicated to the java heap(s) for Solr?  I gather from other questions
 that you are running SolrCloud, can you confirm?

 On a per-server basis, how much disk space do all the index replicas
 take?  How many documents are on each server?  Note that for disk space
 and number of documents, I am asking you to count every replica, not
 take the total in the collection and divide it by the number of servers.

 How are you doing your indexing?  For this question, I am asking what
 program or Solr API is actually sending the data to Solr.  Possible
 answers include the dataimport handler, a SolrJ program, one of the
 other Solr APIs such as a PHP client, and hand-crafted URLs with an HTTP
 client.

 Thanks,
 Shawn

Re: Sorting and Rerank

2015-03-25 Thread Koji Sekiguchi


Hi,

You're right. Those sets are same each other, only documents order is different.

Koji


On 2015/03/26 0:53, innoculou wrote:

If I do an initial search without any field sorting; and then do the exact
same query but also sort one field will I get the same result set in the
subsequent query but sorted.  In other words, does simply applying a sort
criteria affect the re-rank on the full search or does it just sort the
result from the main query?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Sorting-and-Rerank-tp4195187.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Unable to setup solr cloud with multiple collections.

You're still mixing master/slave with SolrCloud. Do _not_ reconfigure
the replication. If you want your core (we call them replicas in
SolrCloud) to appear on various nodes in your cluster, either create
the collection with the nodes specified (createNodeSet) or, once the
collection is created on any node (or set of nodes) do an ADDREPLICA
(again with the collections API) where you want replicas to appear.
The rest is automatic, i.e. the replica's index will be copied from
the leader, all updates will be forwarded etc., without you doing any
other configuration.

I think you're shooting yourself in the foot by trying to fiddle with
replication.

Or I misunderstand your problem entirely.

Best,
Erick

On Tue, Mar 24, 2015 at 8:09 PM, sthita sthit...@gmail.com wrote:
Thanks Erick for your reply.
I am trying to create a new core i.e dict_cn , which is totally different in
terms of index data, configs etc from the existing core abc.
The core is created successfully in my master (i.e mail) and i can do solr
query on this newly created core .
All the config files(Schema.xml and solrconfig.xml) are in mail server and
zookeper helps it for me to share all config files to other collections.
I did the similar setup to other collection , so that newly created core
should be available to all the collections, but it is still showing down.

--
View this message in context:
http://lucene.472066.n3.nabble.com/Unable-to-setup-solr-cloud-with-multiple-collections-tp4194833p4195078.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Custom TokenFilter

Images don't come through the mailing list, can't see your image.

Whether or not all the jars in the directory you're working on are
consistent is the least of your problems. Are the libs to be found in any
_other_ place specified on your classpath?

Best,
Erick

On Wed, Mar 25, 2015 at 12:36 AM, Test Test andymish...@yahoo.fr wrote:

 Thanks Eric,

 I'm working on Solr 4.10.2 and all my dependencies jar seems to be
 compatible with this version.

 [image: Image en ligne]

 I can't figure out which one make this issue.

 Thanks
 Regards,




   Le Mardi 24 mars 2015 23h45, Erick Erickson erickerick...@gmail.com a
 écrit :


 bq: 13 moreCaused by: java.lang.ClassCastException: class
 com.tamingtext.texttamer.solr.

 This usually means you have jar files from different versions of Solr
 in your classpath.

 Best,
 Erick

 On Tue, Mar 24, 2015 at 2:38 PM, Test Test andymish...@yahoo.fr wrote:
  Hi there,
  I'm trying to create my own TokenizerFactory (from tamingtext's
 book).After setting schema.xml and have adding path in solrconfig.xml, i
 start solr.I have this error message : Caused by:
 org.apache.solr.common.SolrException: Plugin init failure for [schema.xml]
 fieldType text: Plugin init failure for [schema.xml] analyzer/tokenizer:
 class com.tamingtext.texttamer.solr.SentenceTokenizerFactory. Schema file
 is .../conf/schema.xmlat
 org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:595)at
 org.apache.solr.schema.IndexSchema.init(IndexSchema.java:166)at
 org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:55)at
 org.apache.solr.schema.IndexSchemaFactory.buildIndexSchema(IndexSchemaFactory.java:69)at
 org.apache.solr.core.ConfigSetService.createIndexSchema(ConfigSetService.java:90)at
 org.apache.solr.core.ConfigSetService.getConfig(ConfigSetService.java:62)...
 7 moreCaused by: org.apache.solr.common.SolrException: Plugin init failure
 for [schema.xml] fieldType text: Plugin init failure for [schema.xml]
 analyzer/tokenizer: class
 com.tamingtext.texttamer.solr.SentenceTokenizerFactoryat
 org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:177)at
 org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:486)... 12
 moreCaused by: org.apache.solr.common.SolrException: Plugin init failure
 for [schema.xml] analyzer/tokenizer: class
 com.tamingtext.texttamer.solr.SentenceTokenizerFactoryat
 org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:177)at
 org.apache.solr.schema.FieldTypePluginLoader.readAnalyzer(FieldTypePluginLoader.java:362)at
 org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:95)at
 org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:43)at
 org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:151)...
 13 moreCaused by: java.lang.ClassCastException: class
 com.tamingtext.texttamer.solr.SentenceTokenizerFactoryat
 java.lang.Class.asSubclass(Class.java:3208)at
 org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:474)at
 org.apache.solr.core.SolrResourceLoader.newInstance(SolrResourceLoader.java:593)at
 org.apache.solr.schema.FieldTypePluginLoader$2.create(FieldTypePluginLoader.java:342)at
 org.apache.solr.schema.FieldTypePluginLoader$2.create(FieldTypePluginLoader.java:335)at
 org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:151)
  Someone can help?
  Thanks.Regards.

Re: Setting up SOLR 5 from an RPM

On 3/25/2015 5:49 AM, Tom Evans wrote:
 On Tue, Mar 24, 2015 at 4:00 PM, Tom Evans tevans...@googlemail.com wrote:
 Hi all

 We're migrating to SOLR 5 (from 4.8), and our infrastructure guys
 would prefer we installed SOLR from an RPM rather than extracting the
 tarball where we need it. They are creating the RPM file themselves,
 and it installs an init.d script and the equivalent of the tarball to
 /opt/solr.

 We're having problems running SOLR from the installed files, as SOLR
 wants to (I think) extract the WAR file and create various temporary
 files below /opt/solr/server.
 
 From the SOLR 5 reference guide, section Managing SOLR, sub-section
 Taking SOLR to production, it seems changing the ownership of the
 installed files to the user that will run SOLR is an explicit
 requirement if you do not wish to run as root.
 
 It would be better if this was not required. With most applications
 you do not normally require permission to modify the installed files
 in order to run the application, eg I do not need write permission to
 /usr/share/vim to run vim, it is a shame I need write permission to
 /opt/solr to run solr.

I think you will only need to change the ownership of the solr home and
the location where the .war file is extracted, which by default is
server/solr-webapp.  The user must be able to *read* the program data,
but should not need to write to it. If you are using the start script
included with Solr 5 and one of the examples, I believe the logging
destination will also be located under the solr home, but you should
make sure that's the case.

Thanks,
Shawn

Re: Optimize SolrCloud without downtime

That's a high number of deleted documents as a percentage of your
index! Or at least I find those numbers surprising. When segments are
merged in the background during normal indexing, quite a bit of weight
is given to segments that have a high percentage of deleted docs. I
usually see at most 10-20% of docs deleted.

So what kinds of things have you done to get into this state? Did you
optimize previously? Change the merge policy? Anything else?

Best,
Erick

On Wed, Mar 25, 2015 at 8:08 AM, pavelhladik pavel.hla...@profimedia.cz wrote:
 Hi,

 I didn't find the answer yet, please help. We have standalone Solr 5.0.0
 with a few cores yet. One of those cores contains:

 numDocs:120M
 deletedDocs:110M

 Our data are changing frequently so that's why so many deletedDocs.
 Optimized core takes around 50GB on disk, we are now almost on 100GB and I'm
 looking for best solution howto optimize this huge core without downtime. I
 know optimization working in background, but anyway when the optimization is
 running our search system is slow and sometimes I receive errors - this
 behavior is like a downtime for us.

 I would like to switch to SolrCloud, the performance is not a issue, so I
 don't need the sharding feature at this time. I'm more interested with
 replication and distribute requests by some Nginx proxy. Idea is:

 1) proxy forward requests to node1 and optimize cores on node2
 2) proxy forward requests to node2 and optimize cores on node1

 But when I do optimize on node2, the node1 is doing optimization as well,
 even if I use the distrib=false with curl.

 Can you please recommend architecture for optimizing without downtime? Many
 thanks.

 Pavel



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Optimize-SolrCloud-without-downtime-tp4195170.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Optimize SolrCloud without downtime

On 3/25/2015 9:08 AM, pavelhladik wrote:
 Our data are changing frequently so that's why so many deletedDocs.
 Optimized core takes around 50GB on disk, we are now almost on 100GB and I'm
 looking for best solution howto optimize this huge core without downtime. I
 know optimization working in background, but anyway when the optimization is
 running our search system is slow and sometimes I receive errors - this
 behavior is like a downtime for us.

 I would like to switch to SolrCloud, the performance is not a issue, so I
 don't need the sharding feature at this time. I'm more interested with
 replication and distribute requests by some Nginx proxy. Idea is:

 1) proxy forward requests to node1 and optimize cores on node2
 2) proxy forward requests to node2 and optimize cores on node1

 But when I do optimize on node2, the node1 is doing optimization as well,
 even if I use the distrib=false with curl.

You are correct - with SolrCloud, any optimize command will optimize the
entire collection, one shard replica at a time, regardless of any
distrib parameter.  It does NOT optimize multiple replicas or shards in
parallel.  I thought we had an issue in Jira asking to make optimize
honor a distrib=false parameter, but I can't find it.  Even if that
were fixed, it would not help you, because SolrCloud is only optimizing
one shard replica at any given moment.

Optimization does NOT directly result in downtime ... but because
optimize generates a very large amount of disk I/O, it can be disruptive
if your server does not have enough resources.

I don't have enough information to say for sure, but I am betting that
you don't have enough RAM in your machine to effectively cache your
index, so anything that negatively affects performance, like an
optimize, is too much for your server to handle at the same time as
ongoing queries or indexing.  The info on this wiki page can help you
determine how much total RAM you might need:

http://wiki.apache.org/solr/SolrPerformanceProblems

Thanks,
Shawn

Re: Optimize SolrCloud without downtime

bq:  It does NOT optimize multiple replicas or shards in parallel.

This behavior was changed in 4.10 though, see:
https://issues.apache.org/jira/browse/SOLR-6264

So with 5.0 Pavel is seeing the result of that JIRA I bet.

I have to agree with Shawn, the optimization step should proceed
invisibly in the background, I suspect you have something else
going on here.

FWIW,
Erick

On Wed, Mar 25, 2015 at 9:54 AM, Shawn Heisey apa...@elyograg.org wrote:
 On 3/25/2015 9:08 AM, pavelhladik wrote:
 Our data are changing frequently so that's why so many deletedDocs.
 Optimized core takes around 50GB on disk, we are now almost on 100GB and I'm
 looking for best solution howto optimize this huge core without downtime. I
 know optimization working in background, but anyway when the optimization is
 running our search system is slow and sometimes I receive errors - this
 behavior is like a downtime for us.

 I would like to switch to SolrCloud, the performance is not a issue, so I
 don't need the sharding feature at this time. I'm more interested with
 replication and distribute requests by some Nginx proxy. Idea is:

 1) proxy forward requests to node1 and optimize cores on node2
 2) proxy forward requests to node2 and optimize cores on node1

 But when I do optimize on node2, the node1 is doing optimization as well,
 even if I use the distrib=false with curl.

 You are correct - with SolrCloud, any optimize command will optimize the
 entire collection, one shard replica at a time, regardless of any
 distrib parameter.  It does NOT optimize multiple replicas or shards in
 parallel.  I thought we had an issue in Jira asking to make optimize
 honor a distrib=false parameter, but I can't find it.  Even if that
 were fixed, it would not help you, because SolrCloud is only optimizing
 one shard replica at any given moment.

 Optimization does NOT directly result in downtime ... but because
 optimize generates a very large amount of disk I/O, it can be disruptive
 if your server does not have enough resources.

 I don't have enough information to say for sure, but I am betting that
 you don't have enough RAM in your machine to effectively cache your
 index, so anything that negatively affects performance, like an
 optimize, is too much for your server to handle at the same time as
 ongoing queries or indexing.  The info on this wiki page can help you
 determine how much total RAM you might need:

 http://wiki.apache.org/solr/SolrPerformanceProblems

 Thanks,
 Shawn

Re: [MARKETING] Re: KeywordTokenizerFactory splits by whitespaces

Yeah, this is a head scratcher. But it _has_ to be that way for things
like edismax to work where you mix-and-match fielded and un-fielded
terms. I.e. I can have a query like q=field1:whatever some more
stuffqf=field2,field3,field4 where I want whatever to be evaluated
only against field1, but the remaining terms to be searched for in the
three other fields.

The deal is that how you want _individual terms_ handled at index time
may be different than at query time, WordDelimiterFilterFactory and
SynonymFilterFactory are prime examples of this. Getting my head
around why field analysis is completely different from query _parsing_
took me a while. But the fact that both are query is confusing, I'm
just not sure what would be better since they're very closely related,
they just both deal with queries just at different times.

Missed the wildcards, you're right you need to escape Or use
the prefix query parser. It'd look like:
q={!prefix f=proj_name_sort}CR610070 An

No escaping is necessary. If you add debug=query to a query using the
prefix queries you see that there's an implied * trailing... Do be
aware, though, that there is _no_ analysis done, so things like
lowercasing would have to be done by the app.

Neither one is more correct, in fact I believe that the wildcard
query becomes a prefix query eventually, strictly a matter of how you
want to deal with that in the app.

Best,
Erick

On Wed, Mar 25, 2015 at 10:04 AM, Vadim Gorlovetsky vadim...@amdocs.com wrote:
 Thanks for a quick response.

 A bit confusing that analyzer of query type configured to use 
 KeywordTokenizerFactory does not un-tokenize query criteria.
 I guess whitespace only the special case because it separates phrases in a 
 query and runs prior analyzing.

 Actually I am handling a query the way you are recommended:
 Double quotes for exact matching and escaped whitespace for a values with 
 wildcards (double quotes do not work as probably considering * wildcard as 
 a part of the criteria value).

 Thanks
 Vadim

 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com]
 Sent: Wednesday, March 25, 2015 6:34 PM
 To: solr-user@lucene.apache.org
 Subject: [MARKETING] Re: KeywordTokenizerFactory splits by whitespaces

 This is a _very_ common thing we all had to learn; what you're seeing is the 
 results of the _query parser_, not the analysis chain. Anything like
 proj_name_sort:term1 term2 gets split at the query parser level, attaching 
 debug=query to the URL should show down in the parsed query section 
 something like:

 proj_name_sort:term1 default_search_field:term2

 To get thing through the query parser, enclose in double quotes, escape the 
 space and such. That'll get the terms _as a single token_ to the analysis 
 chain for that field where the behavior will be what you expect.

 Best,
 Erick

 On Wed, Mar 25, 2015 at 9:26 AM, Vadim Gorlovetsky vadim...@amdocs.com 
 wrote:
 Hello,

 solr.KeywordTokenizerFactory seems splitting by whitespaces though according 
 SOLR documentation shouldn't do that.


 For example I have the following configuration for the fields proj_name 
 and proj_name_sort:

 field name=proj_name type=sortable_text_general indexed=true
 stored=true/ field name=proj_name_sort type=string_sort
 indexed=true stored=false/ ..

 copyField source=proj_name dest=proj_name_sort /
 ..

 fieldType name=string_sort class=solr.TextField sortMissingLast=true 
 omitNorms=true
   analyzer
 !-- KeywordTokenizer does no actual tokenizing, so the entire
  input string is preserved as a single token
  --
 tokenizer class=solr.KeywordTokenizerFactory/
 !-- The LowerCase TokenFilter does what you expect, which can be
  when you want your sorting to be case insensitive
   --
 filter class=solr.LowerCaseFilterFactory /
 !-- The TrimFilter removes any leading or trailing whitespace --
 filter class=solr.TrimFilterFactory /
   /analyzer
 /fieldType

 There are 3 indexed documents having the respective field values:
 proj_name:
 Test1008
 CR610070 Test1
 CR610070 Another Test2

 Searching on the proj_name_sort giving me the following results:

 Query

 Expected

 Real

 Comments

 proj_name_sort : CR610070 Test1

 CR610070 Test1

 CR610070 Test1

 Expectable as seems searching exact un-tokenized value

 proj_name_sort : CR610070 Te

 None

 None

 Expectable as seems searching exact un-tokenized value

 proj_name_sort : CR610070 Te*

 CR610070 Test1

 CR610070 Test1, Test1008, CR610070 Another Test2

 Seems splits on tokens by whitespace ?

 proj_name_sort : CR610070 An*

 CR610070 Another Test2

 CR610070 Another Test2

 Expectable as seems applying wild card on un-tokenized value

 proj_name_sort : CR610070 Another Te*

 CR610070 Another Test2

 CR610070 Test1, Test1008, CR610070 Another Test2

 Seems splits on tokens by whitespace ?

 proj_name_sort : CR610070 Another Test1*

 None

 CR610070 Test1, Test1008, CR610070

Re: Solr Monitoring - Stored Stats?

Matt:

Not really. There's a bunch of third-party log analysis tools that
give much of this information (not everything exposed by JMX of course
is in the log files though).

Not quite sure whether things like Nagios, Zabbix and the like have
this kind of stuff built in seems like a natural extension of those
kinds of tools though

Not much help here...
Erick

On Wed, Mar 25, 2015 at 8:26 AM, Matt Kuiper matt.kui...@issinc.com wrote:
 Hello,

 I am familiar with the JMX points that Solr exposes to allow for monitoring 
 of statistics like QPS, numdocs, Average Query Time...

 I am wondering if there is a way to configure Solr to automatically store the 
 value of these stats over time (for a given time interval), and then allow a 
 user to query a stat over a time range.  So for the QPS stat,  the query 
 might return a set that includes the QPS value for each hour in the time 
 range specified.

 Thanks,
 Matt

German Compound Splitter words.fst causing problems.

2015-03-25 Thread Chris Morley

Hello, Chris Morley here, of Wayfair.com. I am working on the German
compound-splitter by Dawid Weiss.

I tried to upgrade the words.fst file that comes with the German
compound-splitter using Solr 3.5, but it doesn't work. Below is the
IndexNotFoundException that I get.

cmorley@Caracal01:~/Work/oss/git/apache-solr-3.5.0$ java -cp
lucene/build/lucene-core-3.5-SNAPSHOT.jar org.apache.lucene.index.IndexUpgrader
wordsFst
Exception in thread main org.apache.lucene.index.IndexNotFoundException:
org.apache.lucene.store.MMapDirectory@/home/cmorley/Work/oss/git/apache-solr-3.5.0/wordsFst
lockFactory=org.apache.lucene.store.NativeFSLockFactory@201a755e
at
org.apache.lucene.index.IndexUpgrader.upgrade(IndexUpgrader.java:118)
at
org.apache.lucene.index.IndexUpgrader.main(IndexUpgrader.java:85)

The reason I'm attempting this at all is due to the answer here,
http://stackoverflow.com/questions/25450865/migrate-solr-1-4-index-files-to-4-7,
which says to do the upgrade in a two step process, first using Solr 3.5, and
then the latest Solr version (4.10.3). When I try this running the unit tests
for my modified German compound-splitter I'm getting this same type of error.
The thing is, this is an FST, not an index, which is a little confusing. The
reason why I'm following this answer though, is because I'm getting that exact
same message when trying to build the (modified) project with mavenat the
point at which it tries to load in words.fst. Below.

[main] ERROR com.wayfair.lucene.analysis.de.compound.GermanCompoundSplitter -
Format version is not supported (resource:
com.wayfair.lucene.analysis.de.compound.InputStreamDataInput@79a66240): 0
(needs to be between 3 and 4). This version of Lucene only supports indexes
created with release 3.0 and later. Failed to initialize static data
structures for German compound splitter.

Thanks,
-Chris.

Re: Data indexing is going too slow on single shard Why?

On 3/25/2015 8:42 AM, Nitin Solanki wrote:
 Server configuration:
 8 CPUs.
 32 GB RAM
 O.S. - Linux

snip

 are running.  Java heap set to 4096 MB in Solr.  While indexing,

snip

 *Currently*, I have 1 shard  with 2 replicas using SOLR CLOUD.
 Data Size:
 102Gsolr/node1/solr/wikingram_shard1_replica2
 102Gsolr/node2/solr/wikingram_shard1_replica1

If both of those are on the same machine, I'm guessing that you're
running two Solr instances on that machine, so there's 8GB of RAM used
for Java.  That means you have about 24 GB of RAM left for caching ...
and 200GB of index data to cache.

24GB is not enough to cache 200GB of index.  If there is only one Solr
instance (leaving 28GB for caching) with 102GB of data on the machine,
it still might not be enough.  See that SolrPerformanceProblems wiki
page I linked in my earlier email.

For 102GB of data per server, I recommend at least 64GB of total RAM,
preferably 128GB.

For 204GB of data per server, I recommend at least 128GB of total RAM,
preferably 256GB.

Thanks,
Shawn

KeywordTokenizerFactory splits by whitespaces

2015-03-25 Thread Vadim Gorlovetsky

Hello,

solr.KeywordTokenizerFactory seems splitting by whitespaces though according 
SOLR documentation shouldn't do that.


For example I have the following configuration for the fields proj_name and 
proj_name_sort:

field name=proj_name type=sortable_text_general indexed=true 
stored=true/
field name=proj_name_sort type=string_sort indexed=true stored=false/
..

copyField source=proj_name dest=proj_name_sort /
..

fieldType name=string_sort class=solr.TextField sortMissingLast=true 
omitNorms=true
  analyzer
!-- KeywordTokenizer does no actual tokenizing, so the entire
 input string is preserved as a single token
 --
tokenizer class=solr.KeywordTokenizerFactory/
!-- The LowerCase TokenFilter does what you expect, which can be
 when you want your sorting to be case insensitive
  --
filter class=solr.LowerCaseFilterFactory /
!-- The TrimFilter removes any leading or trailing whitespace --
filter class=solr.TrimFilterFactory /
  /analyzer
/fieldType

There are 3 indexed documents having the respective field values:
proj_name:
Test1008
CR610070 Test1
CR610070 Another Test2

Searching on the proj_name_sort giving me the following results:

Query

Expected

Real

Comments

proj_name_sort : CR610070 Test1

CR610070 Test1

CR610070 Test1

Expectable as seems searching exact un-tokenized value

proj_name_sort : CR610070 Te

None

None

Expectable as seems searching exact un-tokenized value

proj_name_sort : CR610070 Te*

CR610070 Test1

CR610070 Test1, Test1008, CR610070 Another Test2

Seems splits on tokens by whitespace ?

proj_name_sort : CR610070 An*

CR610070 Another Test2

CR610070 Another Test2

Expectable as seems applying wild card on un-tokenized value

proj_name_sort : CR610070 Another Te*

CR610070 Another Test2

CR610070 Test1, Test1008, CR610070 Another Test2

Seems splits on tokens by whitespace ?

proj_name_sort : CR610070 Another Test1*

None

CR610070 Test1, Test1008, CR610070 Another Test2

Seems splits on tokens by whitespace ?


Please, advise the way to search on un-tokenized fields using partial criteria 
and wild cards.

Thanks
Vadim


This message and the information contained herein is proprietary and 
confidential and subject to the Amdocs policy statement,
you may review at http://www.amdocs.com/email_disclaimer.asp

Re: Setting up SOLR 5 from an RPM

2015-03-25 Thread Tom Evans

On Wed, Mar 25, 2015 at 2:40 PM, Shawn Heisey apa...@elyograg.org wrote:
 I think you will only need to change the ownership of the solr home and
 the location where the .war file is extracted, which by default is
 server/solr-webapp.  The user must be able to *read* the program data,
 but should not need to write to it. If you are using the start script
 included with Solr 5 and one of the examples, I believe the logging
 destination will also be located under the solr home, but you should
 make sure that's the case.


Thanks Shawn, this sort of makes sense. The thing which I cannot seem
to do is change the location where the war file is extracted. I think
this is probably because, as of solr 5, I am not supposed to know or
be aware that there is a war file, or that the war file is hosted in
jetty, which makes it tricky to specify the jetty temporary directory.

Our use case is that we want to create a single system image that
would be usable for several projects, each project would check out its
solr home and run solr as their own user (possibly on the same
server). Eg, /data/projectA being a solr home for one project,
/data/projectB being a solr home for another project, both running
solr from the same location.

Also, on a dev server, I want to install solr once, and each member of
my team run it from that single location. Because they cannot change
the temporary directory, and they cannot all own server/solr-webapp,
this does not work and they must each have their own copy of the solr
install.

I think the way we will go for this is in production to run all our
solr instance as the solr user, who will own the files in /opt/solr,
and have their solr home directory wherever they choose. In dev, we
will just do something...

Cheers

Tom

Re: KeywordTokenizerFactory splits by whitespaces

This is a _very_ common thing we all had to learn; what you're seeing
is the results of the _query parser_, not the analysis chain. Anything
like
proj_name_sort:term1 term2 gets split at the query parser level,
attaching debug=query to the URL should show down in the parsed
query section something like:

proj_name_sort:term1 default_search_field:term2

To get thing through the query parser, enclose in double quotes,
escape the space and such. That'll get the terms _as a single token_
to the analysis chain for that field where the behavior will be what
you expect.

Best,
Erick

On Wed, Mar 25, 2015 at 9:26 AM, Vadim Gorlovetsky vadim...@amdocs.com wrote:
 Hello,

 solr.KeywordTokenizerFactory seems splitting by whitespaces though according 
 SOLR documentation shouldn't do that.


 For example I have the following configuration for the fields proj_name and 
 proj_name_sort:

 field name=proj_name type=sortable_text_general indexed=true 
 stored=true/
 field name=proj_name_sort type=string_sort indexed=true 
 stored=false/
 ..

 copyField source=proj_name dest=proj_name_sort /
 ..

 fieldType name=string_sort class=solr.TextField sortMissingLast=true 
 omitNorms=true
   analyzer
 !-- KeywordTokenizer does no actual tokenizing, so the entire
  input string is preserved as a single token
  --
 tokenizer class=solr.KeywordTokenizerFactory/
 !-- The LowerCase TokenFilter does what you expect, which can be
  when you want your sorting to be case insensitive
   --
 filter class=solr.LowerCaseFilterFactory /
 !-- The TrimFilter removes any leading or trailing whitespace --
 filter class=solr.TrimFilterFactory /
   /analyzer
 /fieldType

 There are 3 indexed documents having the respective field values:
 proj_name:
 Test1008
 CR610070 Test1
 CR610070 Another Test2

 Searching on the proj_name_sort giving me the following results:

 Query

 Expected

 Real

 Comments

 proj_name_sort : CR610070 Test1

 CR610070 Test1

 CR610070 Test1

 Expectable as seems searching exact un-tokenized value

 proj_name_sort : CR610070 Te

 None

 None

 Expectable as seems searching exact un-tokenized value

 proj_name_sort : CR610070 Te*

 CR610070 Test1

 CR610070 Test1, Test1008, CR610070 Another Test2

 Seems splits on tokens by whitespace ?

 proj_name_sort : CR610070 An*

 CR610070 Another Test2

 CR610070 Another Test2

 Expectable as seems applying wild card on un-tokenized value

 proj_name_sort : CR610070 Another Te*

 CR610070 Another Test2

 CR610070 Test1, Test1008, CR610070 Another Test2

 Seems splits on tokens by whitespace ?

 proj_name_sort : CR610070 Another Test1*

 None

 CR610070 Test1, Test1008, CR610070 Another Test2

 Seems splits on tokens by whitespace ?


 Please, advise the way to search on un-tokenized fields using partial 
 criteria and wild cards.

 Thanks
 Vadim


 This message and the information contained herein is proprietary and 
 confidential and subject to the Amdocs policy statement,
 you may review at http://www.amdocs.com/email_disclaimer.asp

Re: Solr Monitoring - Stored Stats?

On 3/25/2015 9:26 AM, Matt Kuiper wrote:
 I am familiar with the JMX points that Solr exposes to allow for monitoring 
 of statistics like QPS, numdocs, Average Query Time...

 I am wondering if there is a way to configure Solr to automatically store the 
 value of these stats over time (for a given time interval), and then allow a 
 user to query a stat over a time range.  So for the QPS stat,  the query 
 might return a set that includes the QPS value for each hour in the time 
 range specified.

I am reasonably sure that JMX does not have this ability built in, and
Solr does not keep track of each stat over time.

Some of the statistics, in particular the average and percentile
statistics for QTime on a request handler, are relevant across the
entire history of the handler -- so they are valid until the core is
reloaded or Solr restarts.

Thanks,
Shawn

Re: Replica and node states

2015-03-25 Thread Shalin Shekhar Mangar

Comments inline:

On Wed, Mar 25, 2015 at 8:30 AM, Shai Erera ser...@gmail.com wrote:

 Hi

 Is it possible for a replica to be DOWN, while the node it resides on is
 under /live_nodes? If so, what can lead to it, aside from someone unloading
 a core.


Yes, aside from someone unloading the index, this can happen in two ways 1)
during startup each core publishes it's state as 'down' before it enters
recovery, and 2) the leader force-publishes a replica as 'down' if it is
not able to forward updates to that replica (this mechanism is called
Leader-Initiated-Recovery or LIR in short)

The #2 above can happen when the replica is partitioned from leader but
both are able to talk to ZooKeeper.



 I don't know if each SolrCore reports status to ZK independently, or it's
 done by the Solr process as a whole.


It is done on a per-core basis for now. But the 'live' node is maintained
one per Solr instance (JVM).


 Also, is it possible for a replica to report ACTIVE, while the node it
 lives on is no longer under /live_nodes? Are there any ZK timings that can
 cause that?


Yes, this can happen if the JVM crashed. A replica publishes itself as
'down' on shutdown so if the graceful shutdown step is skipped then the
replica will continue to be 'active' in the cluster state. Even LIR doesn't
apply here because there's no point in the leader marking a node as 'down'
if it is not 'live' already.



 Shai




-- 
Regards,
Shalin Shekhar Mangar.

RE: [MARKETING] Re: KeywordTokenizerFactory splits by whitespaces

2015-03-25 Thread Vadim Gorlovetsky

Thanks for a quick response.

A bit confusing that analyzer of query type configured to use 
KeywordTokenizerFactory does not un-tokenize query criteria.
I guess whitespace only the special case because it separates phrases in a 
query and runs prior analyzing.

Actually I am handling a query the way you are recommended:
Double quotes for exact matching and escaped whitespace for a values with 
wildcards (double quotes do not work as probably considering * wildcard as a 
part of the criteria value).

Thanks
Vadim

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Wednesday, March 25, 2015 6:34 PM
To: solr-user@lucene.apache.org
Subject: [MARKETING] Re: KeywordTokenizerFactory splits by whitespaces

This is a _very_ common thing we all had to learn; what you're seeing is the 
results of the _query parser_, not the analysis chain. Anything like
proj_name_sort:term1 term2 gets split at the query parser level, attaching 
debug=query to the URL should show down in the parsed query section 
something like:

proj_name_sort:term1 default_search_field:term2

To get thing through the query parser, enclose in double quotes, escape the 
space and such. That'll get the terms _as a single token_ to the analysis chain 
for that field where the behavior will be what you expect.

Best,
Erick

On Wed, Mar 25, 2015 at 9:26 AM, Vadim Gorlovetsky vadim...@amdocs.com wrote:
 Hello,

 solr.KeywordTokenizerFactory seems splitting by whitespaces though according 
 SOLR documentation shouldn't do that.


 For example I have the following configuration for the fields proj_name and 
 proj_name_sort:

 field name=proj_name type=sortable_text_general indexed=true 
 stored=true/ field name=proj_name_sort type=string_sort 
 indexed=true stored=false/ ..

 copyField source=proj_name dest=proj_name_sort / 
 ..

 fieldType name=string_sort class=solr.TextField sortMissingLast=true 
 omitNorms=true
   analyzer
 !-- KeywordTokenizer does no actual tokenizing, so the entire
  input string is preserved as a single token
  --
 tokenizer class=solr.KeywordTokenizerFactory/
 !-- The LowerCase TokenFilter does what you expect, which can be
  when you want your sorting to be case insensitive
   --
 filter class=solr.LowerCaseFilterFactory /
 !-- The TrimFilter removes any leading or trailing whitespace --
 filter class=solr.TrimFilterFactory /
   /analyzer
 /fieldType

 There are 3 indexed documents having the respective field values:
 proj_name:
 Test1008
 CR610070 Test1
 CR610070 Another Test2

 Searching on the proj_name_sort giving me the following results:

 Query

 Expected

 Real

 Comments

 proj_name_sort : CR610070 Test1

 CR610070 Test1

 CR610070 Test1

 Expectable as seems searching exact un-tokenized value

 proj_name_sort : CR610070 Te

 None

 None

 Expectable as seems searching exact un-tokenized value

 proj_name_sort : CR610070 Te*

 CR610070 Test1

 CR610070 Test1, Test1008, CR610070 Another Test2

 Seems splits on tokens by whitespace ?

 proj_name_sort : CR610070 An*

 CR610070 Another Test2

 CR610070 Another Test2

 Expectable as seems applying wild card on un-tokenized value

 proj_name_sort : CR610070 Another Te*

 CR610070 Another Test2

 CR610070 Test1, Test1008, CR610070 Another Test2

 Seems splits on tokens by whitespace ?

 proj_name_sort : CR610070 Another Test1*

 None

 CR610070 Test1, Test1008, CR610070 Another Test2

 Seems splits on tokens by whitespace ?


 Please, advise the way to search on un-tokenized fields using partial 
 criteria and wild cards.

 Thanks
 Vadim


 This message and the information contained herein is proprietary and 
 confidential and subject to the Amdocs policy statement, you may review at 
 http://www.amdocs.com/email_disclaimer.asp

This message and the information contained herein is proprietary and 
confidential and subject to the Amdocs policy statement,
you may review at http://www.amdocs.com/email_disclaimer.asp

Re: rough maximum cores (shards) per machine?

2015-03-25 Thread Jack Krupansky

Just to give a specific answer to the original question, I would say that
dozens of cores (collections) is certainly fine (assuming the total data
load and query rate is reasonable), maybe 50 or even 100. Low hundreds of
cores/collections MAY work, but isn't advisable. Thousands, if it works at
all, is probably just asking for trouble and likely to be far more hassle
than it could possible be worth.

Whether the number for you ends up being 37, 50, 75, 100, 237, or 1273, you
will have to do a proof of concept implementation to validate it.

I'm not sure where we are at these days for lazy-loading of cores. That may
work for you with hundreds (thousands?!) of cores/collections for tenants
who are mostly idle or dormant, but if the server is running long enough,
it may build up a lot of memory usage for collections that were active but
have gone idle after days or weeks.


-- Jack Krupansky

On Wed, Mar 25, 2015 at 2:49 AM, Shai Erera ser...@gmail.com wrote:

 While it's hard to answer this question because as others have said, it
 depends, I think it will be good of we can quantify or assess the cost of
 running a SolrCore.

 For instance, let's say that a server can handle a load of 10M indexed
 documents (I omit search load on purpose for now) in a single SolrCore.
 Would the same server be able to handle the same number of documents, If we
 indexed 1000 docs per SolrCore, in total of 10,000 SorClores? If the answer
 is no, then it means there is some cost that comes w/ each SolrCore, and we
 may at least be able to give an upper bound --- on a server with X amount
 of storage, Y GB RAM and Z cores you can run up to maxSolrCores(X, Y, Z).

 Another way to look at it, if I were to create empty SolrCores, would I be
 able to create an infinite number of cores if storage was infinite? Or even
 empty cores have their toll on CPU and RAM?

 I know from the Lucene side of things that each SolrCore (carries a Lucene
 index) there is a toll to an index -- the lexicon, IW's RAM buffer, Codecs
 that store things in memory etc. For instance, one downside of splitting a
 10M core into 10,000 cores is that the cost of the holding the total
 lexicon (dictionary of indexed words) goes up drastically, since now every
 word (just the byte[] of the word) is potentially represented in memory
 10,000 times.

 What other RAM/CPU/Storage costs does a SolrCore carry with it? There are
 the caches of course, which really depend on how many documents are
 indexed. Any other non-trivial or constant cost?

 So yes, there isn't a single answer to this question. It's just like
 someone would ask how many documents can a single Lucene index handle
 efficiently. But if we can come up with basic numbers as I outlined above,
 it might help people doing rough estimates. That doesn't mean people
 shouldn't benchmark, as that upper bound may be wy too high for their
 data set, query workload and search needs.

 Shai

 On Wed, Mar 25, 2015 at 5:25 AM, Damien Kamerman dami...@gmail.com
 wrote:

  From my experience on a high-end sever (256GB memory, 40 core CPU)
 testing
  collection numbers with one shard and two replicas, the maximum that
 would
  work is 3,000 cores (1,500 collections). I'd recommend much less (perhaps
  half of that), depending on your startup-time requirements. (Though I
 have
  settled on 6,000 collection maximum with some patching. See SOLR-7191).
 You
  could create multiple clouds after that, and choose the cloud least used
 to
  create your collection.
 
  Regarding memory usage I'd pencil in 6MB overheard (no docs) java heap
 per
  collection.
 
  On 25 March 2015 at 13:46, Ian Rose ianr...@fullstory.com wrote:
 
   First off thanks everyone for the very useful replies thus far.
  
   Shawn - thanks for the list of items to check.  #1 and #2 should be
 fine
   for us and I'll check our ulimit for #3.
  
   To add a bit of clarification, we are indeed using SolrCloud.  Our
  current
   setup is to create a new collection for each customer.  For now we
 allow
   SolrCloud to decide for itself where to locate the initial shard(s) but
  in
   time we expect to refine this such that our system will automatically
   choose the least loaded nodes according to some metric(s).
  
   Having more than one business entity controlling the configuration of a
single (Solr) server is a recipe for disaster. Solr works well if
 there
   is
an architect for the system.
  
  
   Jack, can you explain a bit what you mean here?  It looks like Toke
  caught
   your meaning but I'm afraid it missed me.  What do you mean by
 business
   entity?  Is your concern that with automatic creation of collections
  they
   will be distributed willy-nilly across the cluster, leading to uneven
  load
   across nodes?  If it is relevant, the schema and solrconfig are
  controlled
   entirely by me and is the same for all collections.  Thus theoretically
  we
   could actually just use one single collection for all of our customers

Re: Replica and node states

Thanks.

Does Solr ever clean up those states? I.e. does it ever remove down
replicas, or replicas belonging to non-live_nodes after some time? Or will
these remain in the cluster state forever (assuming they never come back
up)?

If they remain there, is there any penalty? E.g. Solr tries to send them
updates, maybe tries to route search requests to? I'm talking about
replicas that stay in ACTIVE state, but their nodes aren't under
/live_nodes.

Shai

On Wed, Mar 25, 2015 at 8:05 PM, Shalin Shekhar Mangar 
shalinman...@gmail.com wrote:

 Comments inline:

 On Wed, Mar 25, 2015 at 8:30 AM, Shai Erera ser...@gmail.com wrote:

  Hi
 
  Is it possible for a replica to be DOWN, while the node it resides on is
  under /live_nodes? If so, what can lead to it, aside from someone
 unloading
  a core.
 

 Yes, aside from someone unloading the index, this can happen in two ways 1)
 during startup each core publishes it's state as 'down' before it enters
 recovery, and 2) the leader force-publishes a replica as 'down' if it is
 not able to forward updates to that replica (this mechanism is called
 Leader-Initiated-Recovery or LIR in short)

 The #2 above can happen when the replica is partitioned from leader but
 both are able to talk to ZooKeeper.


 
  I don't know if each SolrCore reports status to ZK independently, or it's
  done by the Solr process as a whole.
 
 
 It is done on a per-core basis for now. But the 'live' node is maintained
 one per Solr instance (JVM).


  Also, is it possible for a replica to report ACTIVE, while the node it
  lives on is no longer under /live_nodes? Are there any ZK timings that
 can
  cause that?
 

 Yes, this can happen if the JVM crashed. A replica publishes itself as
 'down' on shutdown so if the graceful shutdown step is skipped then the
 replica will continue to be 'active' in the cluster state. Even LIR doesn't
 apply here because there's no point in the leader marking a node as 'down'
 if it is not 'live' already.


 
  Shai
 



 --
 Regards,
 Shalin Shekhar Mangar.

Re: Custom TokenFilter

Re,
Sorry about the image.So, there are all my dependencies jar in listing below :- 
commons-cli-2.0-mahout.jar- commons-compress-1.9.jar- commons-io-2.4.jar- 
commons-logging-1.2.jar- httpclient-4.4.jar- httpcore-4.4.jar- 
httpmime-4.4.jar- junit-4.10.jar- log4j-1.2.17.jar- 
lucene-analyzers-common-4.10.2.jar- lucene-benchmark-4.10.2.jar- 
lucene-core-4.10.2.jar- mahout-core-0.9.jar- noggit-0.5.jar- 
opennlp-maxent-3.0.3.jar- opennlp-tools-1.5.3.jar- slf4j-api-1.7.9.jar- 
slf4j-simple-1.7.10.jar- solr-solrj-4.10.2.jar
I have put them into a specific repository (contrib/tamingtext/dependency).And 
my jar containing my class into another repository (contrib/tamingtext/lib).I 
added these paths in solrconfig.xml
lib dir=../../../contrib/tamingtext/lib regex=.*\.jar /lib 
dir=../../../contrib/tamingtext/dependency regex=.*\.jar /
Thanks for advance,Regards.
  


 Le Mercredi 25 mars 2015 17h12, Erick Erickson erickerick...@gmail.com a 
écrit :
   

 Images don't come through the mailing list, can't see your image.

Whether or not all the jars in the directory you're working on are
consistent is the least of your problems. Are the libs to be found in any
_other_ place specified on your classpath?

Best,
Erick

On Wed, Mar 25, 2015 at 12:36 AM, Test Test andymish...@yahoo.fr wrote:

 Thanks Eric,

 I'm working on Solr 4.10.2 and all my dependencies jar seems to be
 compatible with this version.

 [image: Image en ligne]

 I can't figure out which one make this issue.

 Thanks
 Regards,




  Le Mardi 24 mars 2015 23h45, Erick Erickson erickerick...@gmail.com a
 écrit :


 bq: 13 moreCaused by: java.lang.ClassCastException: class
 com.tamingtext.texttamer.solr.

 This usually means you have jar files from different versions of Solr
 in your classpath.

 Best,
 Erick

 On Tue, Mar 24, 2015 at 2:38 PM, Test Test andymish...@yahoo.fr wrote:
  Hi there,
  I'm trying to create my own TokenizerFactory (from tamingtext's
 book).After setting schema.xml and have adding path in solrconfig.xml, i
 start solr.I have this error message : Caused by:
 org.apache.solr.common.SolrException: Plugin init failure for [schema.xml]
 fieldType text: Plugin init failure for [schema.xml] analyzer/tokenizer:
 class com.tamingtext.texttamer.solr.SentenceTokenizerFactory. Schema file
 is .../conf/schema.xmlat
 org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:595)at
 org.apache.solr.schema.IndexSchema.init(IndexSchema.java:166)at
 org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:55)at
 org.apache.solr.schema.IndexSchemaFactory.buildIndexSchema(IndexSchemaFactory.java:69)at
 org.apache.solr.core.ConfigSetService.createIndexSchema(ConfigSetService.java:90)at
 org.apache.solr.core.ConfigSetService.getConfig(ConfigSetService.java:62)...
 7 moreCaused by: org.apache.solr.common.SolrException: Plugin init failure
 for [schema.xml] fieldType text: Plugin init failure for [schema.xml]
 analyzer/tokenizer: class
 com.tamingtext.texttamer.solr.SentenceTokenizerFactoryat
 org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:177)at
 org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:486)... 12
 moreCaused by: org.apache.solr.common.SolrException: Plugin init failure
 for [schema.xml] analyzer/tokenizer: class
 com.tamingtext.texttamer.solr.SentenceTokenizerFactoryat
 org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:177)at
 org.apache.solr.schema.FieldTypePluginLoader.readAnalyzer(FieldTypePluginLoader.java:362)at
 org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:95)at
 org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:43)at
 org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:151)...
 13 moreCaused by: java.lang.ClassCastException: class
 com.tamingtext.texttamer.solr.SentenceTokenizerFactoryat
 java.lang.Class.asSubclass(Class.java:3208)at
 org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:474)at
 org.apache.solr.core.SolrResourceLoader.newInstance(SolrResourceLoader.java:593)at
 org.apache.solr.schema.FieldTypePluginLoader$2.create(FieldTypePluginLoader.java:342)at
 org.apache.solr.schema.FieldTypePluginLoader$2.create(FieldTypePluginLoader.java:335)at
 org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:151)
  Someone can help?
  Thanks.Regards.

Re: Custom TokenFilter

Re,
Sorry about the image.So, there are all my dependencies jar in listing below :  
 
   - commons-cli-2.0-mahout.jar   

   - commons-compress-1.9.jar   

   - commons-io-2.4.jar   

   - commons-logging-1.2.jar   

   - httpclient-4.4.jar   

   - httpcore-4.4.jar   

   - httpmime-4.4.jar   

   - junit-4.10.jar   

   - log4j-1.2.17.jar   

   - lucene-analyzers-common-4.10.2.jar   

   - lucene-benchmark-4.10.2.jar   

   - lucene-core-4.10.2.jar   

   - mahout-core-0.9.jar   

   - noggit-0.5.jar   

   - opennlp-maxent-3.0.3.jar   

   - opennlp-tools-1.5.3.jar   

   - slf4j-api-1.7.9.jar   

   - slf4j-simple-1.7.10.jar   

   - solr-solrj-4.10.2.jar   


I have put them into a specific repository (contrib/tamingtext/dependency).And 
my jar containing my class into another repository (contrib/tamingtext/lib).I 
added these paths in solrconfig.xml
   
   - lib dir=../../../contrib/tamingtext/lib regex=.*\.jar /   

   - lib dir=../../../contrib/tamingtext/dependency regex=.*\.jar /   


Thanks for advance
Regards. 



 Le Mercredi 25 mars 2015 20h18, Test Test andymish...@yahoo.fr a écrit :
   

 Re,
Sorry about the image.So, there are all my dependencies jar in listing below :- 
commons-cli-2.0-mahout.jar- commons-compress-1.9.jar- commons-io-2.4.jar- 
commons-logging-1.2.jar- httpclient-4.4.jar- httpcore-4.4.jar- 
httpmime-4.4.jar- junit-4.10.jar- log4j-1.2.17.jar- 
lucene-analyzers-common-4.10.2.jar- lucene-benchmark-4.10.2.jar- 
lucene-core-4.10.2.jar- mahout-core-0.9.jar- noggit-0.5.jar- 
opennlp-maxent-3.0.3.jar- opennlp-tools-1.5.3.jar- slf4j-api-1.7.9.jar- 
slf4j-simple-1.7.10.jar- solr-solrj-4.10.2.jar
I have put them into a specific repository (contrib/tamingtext/dependency).And 
my jar containing my class into another repository (contrib/tamingtext/lib).I 
added these paths in solrconfig.xml
lib dir=../../../contrib/tamingtext/lib regex=.*\.jar /lib 
dir=../../../contrib/tamingtext/dependency regex=.*\.jar /
Thanks for advance,Regards.
  


    Le Mercredi 25 mars 2015 17h12, Erick Erickson erickerick...@gmail.com a 
écrit :
  

 Images don't come through the mailing list, can't see your image.

Whether or not all the jars in the directory you're working on are
consistent is the least of your problems. Are the libs to be found in any
_other_ place specified on your classpath?

Best,
Erick

On Wed, Mar 25, 2015 at 12:36 AM, Test Test andymish...@yahoo.fr wrote:

 Thanks Eric,

 I'm working on Solr 4.10.2 and all my dependencies jar seems to be
 compatible with this version.

 [image: Image en ligne]

 I can't figure out which one make this issue.

 Thanks
 Regards,




  Le Mardi 24 mars 2015 23h45, Erick Erickson erickerick...@gmail.com a
 écrit :


 bq: 13 moreCaused by: java.lang.ClassCastException: class
 com.tamingtext.texttamer.solr.

 This usually means you have jar files from different versions of Solr
 in your classpath.

 Best,
 Erick

 On Tue, Mar 24, 2015 at 2:38 PM, Test Test andymish...@yahoo.fr wrote:
  Hi there,
  I'm trying to create my own TokenizerFactory (from tamingtext's
 book).After setting schema.xml and have adding path in solrconfig.xml, i
 start solr.I have this error message : Caused by:
 org.apache.solr.common.SolrException: Plugin init failure for [schema.xml]
 fieldType text: Plugin init failure for [schema.xml] analyzer/tokenizer:
 class com.tamingtext.texttamer.solr.SentenceTokenizerFactory. Schema file
 is .../conf/schema.xmlat
 org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:595)at
 org.apache.solr.schema.IndexSchema.init(IndexSchema.java:166)at
 org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:55)at
 org.apache.solr.schema.IndexSchemaFactory.buildIndexSchema(IndexSchemaFactory.java:69)at
 org.apache.solr.core.ConfigSetService.createIndexSchema(ConfigSetService.java:90)at
 org.apache.solr.core.ConfigSetService.getConfig(ConfigSetService.java:62)...
 7 moreCaused by: org.apache.solr.common.SolrException: Plugin init failure
 for [schema.xml] fieldType text: Plugin init failure for [schema.xml]
 analyzer/tokenizer: class
 com.tamingtext.texttamer.solr.SentenceTokenizerFactoryat
 org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:177)at
 org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:486)... 12
 moreCaused by: org.apache.solr.common.SolrException: Plugin init failure
 for [schema.xml] analyzer/tokenizer: class
 com.tamingtext.texttamer.solr.SentenceTokenizerFactoryat
 org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:177)at
 org.apache.solr.schema.FieldTypePluginLoader.readAnalyzer(FieldTypePluginLoader.java:362)at
 org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:95)at
 org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:43)at
 org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:151)...
 13 moreCaused by: java.lang.ClassCastException: class

Applying Tokenizers and Filters to CopyFields

2015-03-25 Thread Martin Wunderlich

Hi all, 

I am wondering what the process is for applying Tokenizers and Filter (as 
defined in the FieldType definition) to field contents that result from 
CopyFields. To be more specific, in my Solr instance, Iwould like to support 
query expansion by two means: removing stop words and adding inflected word 
forms as synonyms. 

To use a specific example, let’s say I have the following sentence to be 
indexed (from a Wittgenstein manuscript): 

Was zum Wesen der Welt gehört, kann die Sprache nicht ausdrücken.“


This sentence will be indexed in a field called „original“ that is defined as 
follows: 

field name=original type=text_original indexed=true stored=true 
required=true“/

fieldType name=text_windex_original class=solr.TextField 
positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.StandardTokenizerFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.StandardTokenizerFactory/
  /analyzer
/fieldType


Then, in order to create fields for the two types of query expansion, I have 
set up specific fields for this: 

- one field where stopwords are removed both on the indexed content and the 
query. So, if the users is searching for a phrase like „der Sprache“, Solr 
should still find the segment above, because the determiners („der“ and „die“) 
are removed prior to indexing and prior to querying, respectively. This field 
is defined as follows: 

field name=stopwords_removed type=text_stopwords_removed indexed=true 
stored=true required=true“/

fieldType name=text_stopwords_removed class=solr.TextField 
positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true 
words=„stopwords_de.txt format=snowball/
filter class=solr.LowerCaseFilterFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords_de.txt format=snowball/
filter class=solr.LowerCaseFilterFactory/
  /analyzer
/fieldType


- a second field where synonyms are added to the query so that more segments 
will be found. For instance, if the user is searching for the plural form 
„Sprachen“, Solr should return the segment above, due to this entry in the 
synonyms file: Sprache,Sprach,Sprachen“. This field is defined as follows: 

field name=expanded type=text_multiplied indexed=true stored=true 
required=true“/expanded

fieldType name=text_expanded class=solr.TextField 
positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords_de.txt format=snowball/
filter class=solr.LowerCaseFilterFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords_de.txt format=snowball/
filter class=solr.SynonymFilterFactory synonyms=synonyms_de.txt 
ignoreCase=true expand=true/
filter class=solr.LowerCaseFilterFactory/
  /analyzer
/fieldType

Finally, to avoid having to specify three fields with identical content in the 
import documents, I am defining the two fields for query expansion as 
copyFields: 

  copyField source=original dest=stopwords_removed/
  copyField source=original dest=expanded“/

Now, my expectation would be as follows: 
- during import, two temporary fields are created by copying content from the 
original field
- these two temporary fields are then pre-processed as per the definitions above
- the pre-processed version of the text is added to the index
- then, the user can search for „Sprache“, „sprache“, „Sprachen“ or „der 
Sprache“ and will always get the segment above as a matching result. 

However, what happens actually is that I get matches only for „Sprache“ and 
„sprache“. 

The other thing that strikes as odd, is that when I restrict the search to one 
of the fields only using the „fq“ parameter, I get no results. For instance: 
http://localhost:8983/solr/windex/select?q=Sprachefq=originalwt=jsonindent=true
 
http://localhost:8983/solr/windex/select?q=Sprachefq=originalwt=jsonindent=true

will return no matches. I would expected that using the fq parameter the user 
can specify what type of search (s)he would like to carry out: A standard 
search (field original) or an expanded search (one of the other two fields). 

For debugging, I have checked the analysis and results seem ok (posted below). 
Apologies for the long post, but I am really a bit stuck here (even after doing 
a lot of reading and googling). It is probably something simple that I missing. 
Thanks a lot in advance for any help. 

Cheers, 

Martin
 

ST
Was
zum
Wesen
der
Welt
gehört
kann
die
Sprache
nicht
ausdrücken
SF
Was
zum
Wesen
 
Welt
gehört
kann
die
Sprache
nicht

Uneven data distribution with composite router

2015-03-25 Thread Shamik Bandopadhyay

Hi,

   I'm using a three level composite router in a solr cloud environment,
primarily for multi-tenant and field collapsing. The format is as follows.

*language!topic!url*.

An example would be :

ENU!12345!www.testurl.com/enu/doc1
GER!12345!www.testurl.com/ger/doc2
CHS!67890!www.testurl.com/chs/doc3

The Solr Cloud cluster contains 2 shard, each having 3 replicas. After
indexing around 10 million documents, I'm observing that the index size in
shard 1 is around 60gb while shard 2 is 15gb. So the bulk of the data is
getting indexed in shard 1. Since 60% of the document is english, I expect
the index size to be higher on one shard, but the difference seem little
too high.

The idea is to make sure that all ENU!12345 documents are routed to one
shard so that distributed field collapsing works. Is there something I can
do differently here to make a better distribution ?

Any pointers will be appreciated.

Regards,
Shamik

Re: Applying Tokenizers and Filters to CopyFields

2015-03-25 Thread Martin Wunderlich

Thanks a lot, Michael. See replies below. 


 Am 25.03.2015 um 21:41 schrieb Michael Della Bitta 
 michael.della.bi...@appinions.com:
 
 Two other things I noticed:
 
 1. You probably don't want to store your copyFields. That's literally going
 to be the same information each time.

OK, got it. I have set the targets of the copy fields to store=„false“. 

 
 2. Your expectation the pre-processed version of the text is added to the
 index may be incorrect. Anything done in analyzer type=query sections
 actually happens at query time. Not sure if that's significant for you.

I was actually referring to what is happening at index time. So, the 
pre-processing steps are applied under analyzer type=„index“. And this point 
is not quite clear to me: Assuming that I have a simple case-folding step 
applied to the target of the copyField: How or where are the lower-case tokens 
stored, if the text isn’t added to the index? How is the query supposed to 
retrieve the lower-case version? 
(sorry, if this sounds like a naive question, but I have a feeling that I am 
missing something really basic here). 

Cheers, 

Martin
 

 
 
 Michael Della Bitta
 
 Senior Software Engineer
 
 o: +1 646 532 3062
 
 appinions inc.
 
 “The Science of Influence Marketing”
 
 18 East 41st Street
 
 New York, NY 10017
 
 t: @appinions https://twitter.com/Appinions | g+:
 plus.google.com/appinions
 https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts
 w: appinions.com http://www.appinions.com/
 
 On Wed, Mar 25, 2015 at 4:27 PM, Ahmet Arslan iori...@yahoo.com.invalid
 wrote:
 
 Hi Martin,
 
 fq means filter query. May be you want to use qf (query fields) parameter
 of edismax?
 
 
 
 On Wednesday, March 25, 2015 9:23 PM, Martin Wunderlich martin...@gmx.net
 wrote:
 Hi all,
 
 I am wondering what the process is for applying Tokenizers and Filter (as
 defined in the FieldType definition) to field contents that result from
 CopyFields. To be more specific, in my Solr instance, Iwould like to
 support query expansion by two means: removing stop words and adding
 inflected word forms as synonyms.
 
 To use a specific example, let’s say I have the following sentence to be
 indexed (from a Wittgenstein manuscript):
 
 Was zum Wesen der Welt gehört, kann die Sprache nicht ausdrücken.“
 
 
 This sentence will be indexed in a field called „original“ that is defined
 as follows:
 
 field name=original type=text_original indexed=true stored=true
 required=true“/
 
fieldType name=text_windex_original class=solr.TextField
 positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.StandardTokenizerFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.StandardTokenizerFactory/
  /analyzer
/fieldType
 
 
 Then, in order to create fields for the two types of query expansion, I
 have set up specific fields for this:
 
 - one field where stopwords are removed both on the indexed content and
 the query. So, if the users is searching for a phrase like „der Sprache“,
 Solr should still find the segment above, because the determiners („der“
 and „die“) are removed prior to indexing and prior to querying,
 respectively. This field is defined as follows:
 
 field name=stopwords_removed type=text_stopwords_removed
 indexed=true stored=true required=true“/
 
fieldType name=text_stopwords_removed class=solr.TextField
 positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true
 words=„stopwords_de.txt format=snowball/
filter class=solr.LowerCaseFilterFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords_de.txt format=snowball/
filter class=solr.LowerCaseFilterFactory/
  /analyzer
/fieldType
 
 
 - a second field where synonyms are added to the query so that more
 segments will be found. For instance, if the user is searching for the
 plural form „Sprachen“, Solr should return the segment above, due to this
 entry in the synonyms file: Sprache,Sprach,Sprachen“. This field is
 defined as follows:
 
 field name=expanded type=text_multiplied indexed=true stored=true
 required=true“/expanded
 
fieldType name=text_expanded class=solr.TextField
 positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords_de.txt format=snowball/
filter class=solr.LowerCaseFilterFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords_de.txt format=snowball/
filter class=solr.SynonymFilterFactory
 synonyms=synonyms_de.txt ignoreCase=true expand=true/
filter

Re: Applying Tokenizers and Filters to CopyFields

2015-03-25 Thread Michael Della Bitta

Two other things I noticed:

1. You probably don't want to store your copyFields. That's literally going
to be the same information each time.

2. Your expectation the pre-processed version of the text is added to the
index may be incorrect. Anything done in analyzer type=query sections
actually happens at query time. Not sure if that's significant for you.


Michael Della Bitta

Senior Software Engineer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions https://twitter.com/Appinions | g+:
plus.google.com/appinions
https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts
w: appinions.com http://www.appinions.com/

On Wed, Mar 25, 2015 at 4:27 PM, Ahmet Arslan iori...@yahoo.com.invalid
wrote:

 Hi Martin,

 fq means filter query. May be you want to use qf (query fields) parameter
 of edismax?



 On Wednesday, March 25, 2015 9:23 PM, Martin Wunderlich martin...@gmx.net
 wrote:
 Hi all,

 I am wondering what the process is for applying Tokenizers and Filter (as
 defined in the FieldType definition) to field contents that result from
 CopyFields. To be more specific, in my Solr instance, Iwould like to
 support query expansion by two means: removing stop words and adding
 inflected word forms as synonyms.

 To use a specific example, let’s say I have the following sentence to be
 indexed (from a Wittgenstein manuscript):

 Was zum Wesen der Welt gehört, kann die Sprache nicht ausdrücken.“


 This sentence will be indexed in a field called „original“ that is defined
 as follows:

 field name=original type=text_original indexed=true stored=true
 required=true“/

 fieldType name=text_windex_original class=solr.TextField
 positionIncrementGap=100
   analyzer type=index
 tokenizer class=solr.StandardTokenizerFactory/
   /analyzer
   analyzer type=query
 tokenizer class=solr.StandardTokenizerFactory/
   /analyzer
 /fieldType


 Then, in order to create fields for the two types of query expansion, I
 have set up specific fields for this:

 - one field where stopwords are removed both on the indexed content and
 the query. So, if the users is searching for a phrase like „der Sprache“,
 Solr should still find the segment above, because the determiners („der“
 and „die“) are removed prior to indexing and prior to querying,
 respectively. This field is defined as follows:

 field name=stopwords_removed type=text_stopwords_removed
 indexed=true stored=true required=true“/

 fieldType name=text_stopwords_removed class=solr.TextField
 positionIncrementGap=100
   analyzer type=index
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.StopFilterFactory ignoreCase=true
 words=„stopwords_de.txt format=snowball/
 filter class=solr.LowerCaseFilterFactory/
   /analyzer
   analyzer type=query
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords_de.txt format=snowball/
 filter class=solr.LowerCaseFilterFactory/
   /analyzer
 /fieldType


 - a second field where synonyms are added to the query so that more
 segments will be found. For instance, if the user is searching for the
 plural form „Sprachen“, Solr should return the segment above, due to this
 entry in the synonyms file: Sprache,Sprach,Sprachen“. This field is
 defined as follows:

 field name=expanded type=text_multiplied indexed=true stored=true
 required=true“/expanded

 fieldType name=text_expanded class=solr.TextField
 positionIncrementGap=100
   analyzer type=index
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords_de.txt format=snowball/
 filter class=solr.LowerCaseFilterFactory/
   /analyzer
   analyzer type=query
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords_de.txt format=snowball/
 filter class=solr.SynonymFilterFactory
 synonyms=synonyms_de.txt ignoreCase=true expand=true/
 filter class=solr.LowerCaseFilterFactory/
   /analyzer
 /fieldType

 Finally, to avoid having to specify three fields with identical content in
 the import documents, I am defining the two fields for query expansion as
 copyFields:

   copyField source=original dest=stopwords_removed/
   copyField source=original dest=expanded“/

 Now, my expectation would be as follows:
 - during import, two temporary fields are created by copying content from
 the original field
 - these two temporary fields are then pre-processed as per the definitions
 above
 - the pre-processed version of the text is added to the index
 - then, the user can search for „Sprache“, „sprache“, „Sprachen“ or „der
 Sprache“ and will always get the segment above as a matching result.

 However, what happens actually is that I

Re: Replica and node states

2015-03-25 Thread Shalin Shekhar Mangar

On Wed, Mar 25, 2015 at 12:51 PM, Shai Erera ser...@gmail.com wrote:

 Thanks.

 Does Solr ever clean up those states? I.e. does it ever remove down
 replicas, or replicas belonging to non-live_nodes after some time? Or will
 these remain in the cluster state forever (assuming they never come back
 up)?


No, they remain there forever. You can still call the deletereplica API to
clean them up. There's even a param onyIfDown=true which will remove a
replica only if it's already 'down'.



 If they remain there, is there any penalty? E.g. Solr tries to send them
 updates, maybe tries to route search requests to? I'm talking about
 replicas that stay in ACTIVE state, but their nodes aren't under
 /live_nodes.


No, there is no penalty because we always check for the state=active and
the live-ness before routing any requests to a replica.



 Shai

 On Wed, Mar 25, 2015 at 8:05 PM, Shalin Shekhar Mangar 
 shalinman...@gmail.com wrote:

  Comments inline:
 
  On Wed, Mar 25, 2015 at 8:30 AM, Shai Erera ser...@gmail.com wrote:
 
   Hi
  
   Is it possible for a replica to be DOWN, while the node it resides on
 is
   under /live_nodes? If so, what can lead to it, aside from someone
  unloading
   a core.
  
 
  Yes, aside from someone unloading the index, this can happen in two ways
 1)
  during startup each core publishes it's state as 'down' before it enters
  recovery, and 2) the leader force-publishes a replica as 'down' if it is
  not able to forward updates to that replica (this mechanism is called
  Leader-Initiated-Recovery or LIR in short)
 
  The #2 above can happen when the replica is partitioned from leader but
  both are able to talk to ZooKeeper.
 
 
  
   I don't know if each SolrCore reports status to ZK independently, or
 it's
   done by the Solr process as a whole.
  
  
  It is done on a per-core basis for now. But the 'live' node is maintained
  one per Solr instance (JVM).
 
 
   Also, is it possible for a replica to report ACTIVE, while the node it
   lives on is no longer under /live_nodes? Are there any ZK timings that
  can
   cause that?
  
 
  Yes, this can happen if the JVM crashed. A replica publishes itself as
  'down' on shutdown so if the graceful shutdown step is skipped then the
  replica will continue to be 'active' in the cluster state. Even LIR
 doesn't
  apply here because there's no point in the leader marking a node as
 'down'
  if it is not 'live' already.
 
 
  
   Shai
  
 
 
 
  --
  Regards,
  Shalin Shekhar Mangar.
 




-- 
Regards,
Shalin Shekhar Mangar.

Re: Applying Tokenizers and Filters to CopyFields

2015-03-25 Thread Ahmet Arslan

Hi Martin,

fq means filter query. May be you want to use qf (query fields) parameter of 
edismax?



On Wednesday, March 25, 2015 9:23 PM, Martin Wunderlich martin...@gmx.net 
wrote:
Hi all, 

I am wondering what the process is for applying Tokenizers and Filter (as 
defined in the FieldType definition) to field contents that result from 
CopyFields. To be more specific, in my Solr instance, Iwould like to support 
query expansion by two means: removing stop words and adding inflected word 
forms as synonyms. 

To use a specific example, let’s say I have the following sentence to be 
indexed (from a Wittgenstein manuscript): 

Was zum Wesen der Welt gehört, kann die Sprache nicht ausdrücken.“


This sentence will be indexed in a field called „original“ that is defined as 
follows: 

field name=original type=text_original indexed=true stored=true 
required=true“/

fieldType name=text_windex_original class=solr.TextField 
positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.StandardTokenizerFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.StandardTokenizerFactory/
  /analyzer
/fieldType


Then, in order to create fields for the two types of query expansion, I have 
set up specific fields for this: 

- one field where stopwords are removed both on the indexed content and the 
query. So, if the users is searching for a phrase like „der Sprache“, Solr 
should still find the segment above, because the determiners („der“ and „die“) 
are removed prior to indexing and prior to querying, respectively. This field 
is defined as follows: 

field name=stopwords_removed type=text_stopwords_removed indexed=true 
stored=true required=true“/

fieldType name=text_stopwords_removed class=solr.TextField 
positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true 
words=„stopwords_de.txt format=snowball/
filter class=solr.LowerCaseFilterFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords_de.txt format=snowball/
filter class=solr.LowerCaseFilterFactory/
  /analyzer
/fieldType


- a second field where synonyms are added to the query so that more segments 
will be found. For instance, if the user is searching for the plural form 
„Sprachen“, Solr should return the segment above, due to this entry in the 
synonyms file: Sprache,Sprach,Sprachen“. This field is defined as follows: 

field name=expanded type=text_multiplied indexed=true stored=true 
required=true“/expanded

fieldType name=text_expanded class=solr.TextField 
positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords_de.txt format=snowball/
filter class=solr.LowerCaseFilterFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords_de.txt format=snowball/
filter class=solr.SynonymFilterFactory synonyms=synonyms_de.txt 
ignoreCase=true expand=true/
filter class=solr.LowerCaseFilterFactory/
  /analyzer
/fieldType

Finally, to avoid having to specify three fields with identical content in the 
import documents, I am defining the two fields for query expansion as 
copyFields: 

  copyField source=original dest=stopwords_removed/
  copyField source=original dest=expanded“/

Now, my expectation would be as follows: 
- during import, two temporary fields are created by copying content from the 
original field
- these two temporary fields are then pre-processed as per the definitions above
- the pre-processed version of the text is added to the index
- then, the user can search for „Sprache“, „sprache“, „Sprachen“ or „der 
Sprache“ and will always get the segment above as a matching result. 

However, what happens actually is that I get matches only for „Sprache“ and 
„sprache“. 

The other thing that strikes as odd, is that when I restrict the search to one 
of the fields only using the „fq“ parameter, I get no results. For instance: 
http://localhost:8983/solr/windex/select?q=Sprachefq=originalwt=jsonindent=true
 
http://localhost:8983/solr/windex/select?q=Sprachefq=originalwt=jsonindent=true

will return no matches. I would expected that using the fq parameter the user 
can specify what type of search (s)he would like to carry out: A standard 
search (field original) or an expanded search (one of the other two fields). 

For debugging, I have checked the analysis and results seem ok (posted below). 
Apologies for the long post, but I am really a bit stuck here (even after doing 
a lot of reading and googling). It is probably something simple that I missing. 
Thanks

Re: Applying Tokenizers and Filters to CopyFields

2015-03-25 Thread Martin Wunderlich

Thanks a lot, Ahmet. I’ve just read up on this query field parameter and it 
sounds good. Since the field contents are currently all identical, I can’t 
really test it, yet. 

Cheers, 

Martin
 



 Am 25.03.2015 um 21:27 schrieb Ahmet Arslan iori...@yahoo.com.INVALID:
 
 Hi Martin,
 
 fq means filter query. May be you want to use qf (query fields) parameter of 
 edismax?
 
 
 
 On Wednesday, March 25, 2015 9:23 PM, Martin Wunderlich martin...@gmx.net 
 wrote:
 Hi all, 
 
 I am wondering what the process is for applying Tokenizers and Filter (as 
 defined in the FieldType definition) to field contents that result from 
 CopyFields. To be more specific, in my Solr instance, Iwould like to support 
 query expansion by two means: removing stop words and adding inflected word 
 forms as synonyms. 
 
 To use a specific example, let’s say I have the following sentence to be 
 indexed (from a Wittgenstein manuscript): 
 
 Was zum Wesen der Welt gehört, kann die Sprache nicht ausdrücken.“
 
 
 This sentence will be indexed in a field called „original“ that is defined as 
 follows: 
 
 field name=original type=text_original indexed=true stored=true 
 required=true“/
 
fieldType name=text_windex_original class=solr.TextField 
 positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.StandardTokenizerFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.StandardTokenizerFactory/
  /analyzer
/fieldType
 
 
 Then, in order to create fields for the two types of query expansion, I have 
 set up specific fields for this: 
 
 - one field where stopwords are removed both on the indexed content and the 
 query. So, if the users is searching for a phrase like „der Sprache“, Solr 
 should still find the segment above, because the determiners („der“ and 
 „die“) are removed prior to indexing and prior to querying, respectively. 
 This field is defined as follows: 
 
 field name=stopwords_removed type=text_stopwords_removed indexed=true 
 stored=true required=true“/
 
fieldType name=text_stopwords_removed class=solr.TextField 
 positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true 
 words=„stopwords_de.txt format=snowball/
filter class=solr.LowerCaseFilterFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true 
 words=stopwords_de.txt format=snowball/
filter class=solr.LowerCaseFilterFactory/
  /analyzer
/fieldType
 
 
 - a second field where synonyms are added to the query so that more segments 
 will be found. For instance, if the user is searching for the plural form 
 „Sprachen“, Solr should return the segment above, due to this entry in the 
 synonyms file: Sprache,Sprach,Sprachen“. This field is defined as follows: 
 
 field name=expanded type=text_multiplied indexed=true stored=true 
 required=true“/expanded
 
fieldType name=text_expanded class=solr.TextField 
 positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true 
 words=stopwords_de.txt format=snowball/
filter class=solr.LowerCaseFilterFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true 
 words=stopwords_de.txt format=snowball/
filter class=solr.SynonymFilterFactory synonyms=synonyms_de.txt 
 ignoreCase=true expand=true/
filter class=solr.LowerCaseFilterFactory/
  /analyzer
/fieldType
 
 Finally, to avoid having to specify three fields with identical content in 
 the import documents, I am defining the two fields for query expansion as 
 copyFields: 
 
  copyField source=original dest=stopwords_removed/
  copyField source=original dest=expanded“/
 
 Now, my expectation would be as follows: 
 - during import, two temporary fields are created by copying content from the 
 original field
 - these two temporary fields are then pre-processed as per the definitions 
 above
 - the pre-processed version of the text is added to the index
 - then, the user can search for „Sprache“, „sprache“, „Sprachen“ or „der 
 Sprache“ and will always get the segment above as a matching result. 
 
 However, what happens actually is that I get matches only for „Sprache“ and 
 „sprache“. 
 
 The other thing that strikes as odd, is that when I restrict the search to 
 one of the fields only using the „fq“ parameter, I get no results. For 
 instance: 
 http://localhost:8983/solr/windex/select?q=Sprachefq=originalwt=jsonindent=true
  
 http://localhost:8983/solr/windex/select?q=Sprachefq=originalwt=jsonindent=true
 
 will return no matches. I would expected that using the fq parameter the user 
 can specify what type of search (s)he

location field giving error for lat long

2015-03-25 Thread abhayd

hi 

I have field name GeoLocate with datatype as location. For some lat and long
it is giving me following error during indexing process


Can't parse point '139.9544301,35.4298081' because: Bad Y value 139.9544301
is not in boundary Rect(minX=-180.0,maxX=180.0,minY=-90.0,maxY=90.0)


Any idea whats wrong?




--
View this message in context: 
http://lucene.472066.n3.nabble.com/location-field-giving-error-for-lat-long-tp4195339.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Custom TokenFilter

Re,
Finally, i think i found where this problem comes.I didn't use the right class 
extender, instead using Tokenizers, i'm using Token filter.
Eric, thanks for your replies.Regards. 


 Le Mercredi 25 mars 2015 23h55, Test Test andymish...@yahoo.fr a écrit :
   

 Re,
I have tried to remove all the redundant jar files.Then i've relaunched it but 
it's blocked directly on the same issue.
It's very strange.
Regards, 


    Le Mercredi 25 mars 2015 23h31, Erick Erickson erickerick...@gmail.com a 
écrit :
  

 Wait, you didn't put, say, lucene-core-4.10.2.jar into your
contrib/tamingtext/dependency directory did you? That means you have
Lucene (and solr and solrj and ...) in your class path twice since
they're _already_ in your classpath by default since you're running
Solr.

All your jars should be in your aggregate classpath exactly once.
Having them in twice would explain the cast exception. not need these
in the tamingtext/dependency subdirectory, just the things that are
_not_ in Solr already..

Best,
Erick

On Wed, Mar 25, 2015 at 12:21 PM, Test Test andymish...@yahoo.fr wrote:
 Re,
 Sorry about the image.So, there are all my dependencies jar in listing below :
    - commons-cli-2.0-mahout.jar

    - commons-compress-1.9.jar

    - commons-io-2.4.jar

    - commons-logging-1.2.jar

    - httpclient-4.4.jar

    - httpcore-4.4.jar

    - httpmime-4.4.jar

    - junit-4.10.jar

    - log4j-1.2.17.jar

    - lucene-analyzers-common-4.10.2.jar

    - lucene-benchmark-4.10.2.jar

    - lucene-core-4.10.2.jar

    - mahout-core-0.9.jar

    - noggit-0.5.jar

    - opennlp-maxent-3.0.3.jar

    - opennlp-tools-1.5.3.jar

    - slf4j-api-1.7.9.jar

    - slf4j-simple-1.7.10.jar

    - solr-solrj-4.10.2.jar


 I have put them into a specific repository 
 (contrib/tamingtext/dependency).And my jar containing my class into another 
 repository (contrib/tamingtext/lib).I added these paths in solrconfig.xml

    - lib dir=../../../contrib/tamingtext/lib regex=.*\.jar /

    - lib dir=../../../contrib/tamingtext/dependency regex=.*\.jar /


 Thanks for advance
 Regards.



      Le Mercredi 25 mars 2015 20h18, Test Test andymish...@yahoo.fr a écrit 
:


  Re,
 Sorry about the image.So, there are all my dependencies jar in listing below 
 :- commons-cli-2.0-mahout.jar- commons-compress-1.9.jar- commons-io-2.4.jar- 
 commons-logging-1.2.jar- httpclient-4.4.jar- httpcore-4.4.jar- 
 httpmime-4.4.jar- junit-4.10.jar- log4j-1.2.17.jar- 
 lucene-analyzers-common-4.10.2.jar- lucene-benchmark-4.10.2.jar- 
 lucene-core-4.10.2.jar- mahout-core-0.9.jar- noggit-0.5.jar- 
 opennlp-maxent-3.0.3.jar- opennlp-tools-1.5.3.jar- slf4j-api-1.7.9.jar- 
 slf4j-simple-1.7.10.jar- solr-solrj-4.10.2.jar
 I have put them into a specific repository 
 (contrib/tamingtext/dependency).And my jar containing my class into another 
 repository (contrib/tamingtext/lib).I added these paths in solrconfig.xml
 lib dir=../../../contrib/tamingtext/lib regex=.*\.jar /lib 
 dir=../../../contrib/tamingtext/dependency regex=.*\.jar /
 Thanks for advance,Regards.



    Le Mercredi 25 mars 2015 17h12, Erick Erickson erickerick...@gmail.com a 
écrit :


  Images don't come through the mailing list, can't see your image.

 Whether or not all the jars in the directory you're working on are
 consistent is the least of your problems. Are the libs to be found in any
 _other_ place specified on your classpath?

 Best,
 Erick

 On Wed, Mar 25, 2015 at 12:36 AM, Test Test andymish...@yahoo.fr wrote:

 Thanks Eric,

 I'm working on Solr 4.10.2 and all my dependencies jar seems to be
 compatible with this version.

 [image: Image en ligne]

 I can't figure out which one make this issue.

 Thanks
 Regards,




  Le Mardi 24 mars 2015 23h45, Erick Erickson erickerick...@gmail.com a
 écrit :


 bq: 13 moreCaused by: java.lang.ClassCastException: class
 com.tamingtext.texttamer.solr.

 This usually means you have jar files from different versions of Solr
 in your classpath.

 Best,
 Erick

 On Tue, Mar 24, 2015 at 2:38 PM, Test Test andymish...@yahoo.fr wrote:
  Hi there,
  I'm trying to create my own TokenizerFactory (from tamingtext's
 book).After setting schema.xml and have adding path in solrconfig.xml, i
 start solr.I have this error message : Caused by:
 org.apache.solr.common.SolrException: Plugin init failure for [schema.xml]
 fieldType text: Plugin init failure for [schema.xml] analyzer/tokenizer:
 class com.tamingtext.texttamer.solr.SentenceTokenizerFactory. Schema file
 is .../conf/schema.xmlat
 org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:595)at
 org.apache.solr.schema.IndexSchema.init(IndexSchema.java:166)at
 org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:55)at
 org.apache.solr.schema.IndexSchemaFactory.buildIndexSchema(IndexSchemaFactory.java:69)at
 org.apache.solr.core.ConfigSetService.createIndexSchema(ConfigSetService.java:90)at
 org.apache.solr.core.ConfigSetService.getConfig(ConfigSetService.java:62)...
 7

Re: Applying Tokenizers and Filters to CopyFields

2015-03-25 Thread Michael Della Bitta

I agree the terminology is possibly a little confusing.

Stored refers to values that are stored verbatim. You can retrieve them
verbatim. Analysis does not affect stored values.
Indexed values are tokenized/transformed and stored inverted. You can't
recover the literal analyzed version (at least, not easily).

If what you really want is to store and retrieve case folded versions of
your data as well as the original, you need to use something like a
UpdateRequestProcessor, which I personally am less familiar with.


On Wed, Mar 25, 2015 at 5:28 PM, Martin Wunderlich martin...@gmx.net
wrote:

 So, the pre-processing steps are applied under analyzer type=„index“.
 And this point is not quite clear to me: Assuming that I have a simple
 case-folding step applied to the target of the copyField: How or where are
 the lower-case tokens stored, if the text isn’t added to the index? How is
 the query supposed to retrieve the lower-case version?
 (sorry, if this sounds like a naive question, but I have a feeling that I
 am missing something really basic here).



Michael Della Bitta

Senior Software Engineer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions https://twitter.com/Appinions | g+:
plus.google.com/appinions
https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts
w: appinions.com http://www.appinions.com/

Re: Problem with Terms Query Parser

2015-03-25 Thread Jack Krupansky

That should work. Check to be sure that you really are running Solr 5.0.
Was it an old version of trunk or the 5x branch before last August when the
terms query parser was added?

-- Jack Krupansky

On Tue, Mar 24, 2015 at 5:15 PM, Shamik Bandopadhyay sham...@gmail.com
wrote:

 Hi,

   I'm trying to use Terms Query Parser for one of my use cases where I use
 an implicit filter on bunch of sources.

 When I'm trying to run the following query,

 fq={!terms f=Source}help,documentation,sfdc

 I'm getting the following error.

 lst name=errorstr name=msgUnknown query parser 'terms'/strint
 name=code400/int/lst

 What am I missing here ? I'm using Solr 5.0 version.

 Any pointers will be appreciated.

 Regards,
 Shamik

RE: German Compound Splitter words.fst causing problems.

2015-03-25 Thread Markus Jelsma

Hello Chris - i don't know that token filter you mention but i would like to
recommend Lucene's HyphenationCompoundWordTokenFilter. It works reasonably well
if you provide the hyphenation rules and a dictionary. It has some flaws such
as decompounding to irrelevant subwords, overlapping subwords or to subwords
that do not form the whole compound word (minus genitives), but these can be
fixed.

Markus

-Original message-
From:Chris Morley ch...@depahelix.com
Sent: Wednesday 25th March 2015 17:59
To: solr-user@lucene.apache.org
Subject: German Compound Splitter words.fst causing problems.

Hello, Chris Morley here, of Wayfair.com. I am working on the German
compound-splitter by Dawid Weiss.

I tried to upgrade the words.fst file that comes with the German
compound-splitter using Solr 3.5, but it doesn't work. Below is the
IndexNotFoundException that I get.

cmorley@Caracal01:~/Work/oss/git/apache-solr-3.5.0$ java -cp
lucene/build/lucene-core-3.5-SNAPSHOT.jar
org.apache.lucene.index.IndexUpgrader wordsFst
Exception in thread main org.apache.lucene.index.IndexNotFoundException:
org.apache.lucene.store.MMapDirectory@/home/cmorley/Work/oss/git/apache-solr-3.5.0/wordsFst
lockFactory=org.apache.lucene.store.NativeFSLockFactory@201a755e
at
org.apache.lucene.index.IndexUpgrader.upgrade(IndexUpgrader.java:118)
at
org.apache.lucene.index.IndexUpgrader.main(IndexUpgrader.java:85)

The reason I'm attempting this at all is due to the answer here,
http://stackoverflow.com/questions/25450865/migrate-solr-1-4-index-files-to-4-7,
which says to do the upgrade in a two step process, first using Solr 3.5,
and then the latest Solr version (4.10.3). When I try this running the unit
tests for my modified German compound-splitter I'm getting this same type of
error. The thing is, this is an FST, not an index, which is a little
confusing. The reason why I'm following this answer though, is because I'm
getting that exact same message when trying to build the (modified) project
with mavenat the point at which it tries to load in words.fst. Below.

[main] ERROR com.wayfair.lucene.analysis.de.compound.GermanCompoundSplitter
- Format version is not supported (resource:
com.wayfair.lucene.analysis.de.compound.InputStreamDataInput@79a66240): 0
(needs to be between 3 and 4). This version of Lucene only supports indexes
created with release 3.0 and later. Failed to initialize static data
structures for German compound splitter.

Thanks,
-Chris.

Re: Custom TokenFilter

Re,
I have tried to remove all the redundant jar files.Then i've relaunched it but 
it's blocked directly on the same issue.
It's very strange.
Regards, 


 Le Mercredi 25 mars 2015 23h31, Erick Erickson erickerick...@gmail.com a 
écrit :
   

 Wait, you didn't put, say, lucene-core-4.10.2.jar into your
contrib/tamingtext/dependency directory did you? That means you have
Lucene (and solr and solrj and ...) in your class path twice since
they're _already_ in your classpath by default since you're running
Solr.

All your jars should be in your aggregate classpath exactly once.
Having them in twice would explain the cast exception. not need these
in the tamingtext/dependency subdirectory, just the things that are
_not_ in Solr already..

Best,
Erick

On Wed, Mar 25, 2015 at 12:21 PM, Test Test andymish...@yahoo.fr wrote:
 Re,
 Sorry about the image.So, there are all my dependencies jar in listing below :
    - commons-cli-2.0-mahout.jar

    - commons-compress-1.9.jar

    - commons-io-2.4.jar

    - commons-logging-1.2.jar

    - httpclient-4.4.jar

    - httpcore-4.4.jar

    - httpmime-4.4.jar

    - junit-4.10.jar

    - log4j-1.2.17.jar

    - lucene-analyzers-common-4.10.2.jar

    - lucene-benchmark-4.10.2.jar

    - lucene-core-4.10.2.jar

    - mahout-core-0.9.jar

    - noggit-0.5.jar

    - opennlp-maxent-3.0.3.jar

    - opennlp-tools-1.5.3.jar

    - slf4j-api-1.7.9.jar

    - slf4j-simple-1.7.10.jar

    - solr-solrj-4.10.2.jar


 I have put them into a specific repository 
 (contrib/tamingtext/dependency).And my jar containing my class into another 
 repository (contrib/tamingtext/lib).I added these paths in solrconfig.xml

    - lib dir=../../../contrib/tamingtext/lib regex=.*\.jar /

    - lib dir=../../../contrib/tamingtext/dependency regex=.*\.jar /


 Thanks for advance
 Regards.



      Le Mercredi 25 mars 2015 20h18, Test Test andymish...@yahoo.fr a écrit 
:


  Re,
 Sorry about the image.So, there are all my dependencies jar in listing below 
 :- commons-cli-2.0-mahout.jar- commons-compress-1.9.jar- commons-io-2.4.jar- 
 commons-logging-1.2.jar- httpclient-4.4.jar- httpcore-4.4.jar- 
 httpmime-4.4.jar- junit-4.10.jar- log4j-1.2.17.jar- 
 lucene-analyzers-common-4.10.2.jar- lucene-benchmark-4.10.2.jar- 
 lucene-core-4.10.2.jar- mahout-core-0.9.jar- noggit-0.5.jar- 
 opennlp-maxent-3.0.3.jar- opennlp-tools-1.5.3.jar- slf4j-api-1.7.9.jar- 
 slf4j-simple-1.7.10.jar- solr-solrj-4.10.2.jar
 I have put them into a specific repository 
 (contrib/tamingtext/dependency).And my jar containing my class into another 
 repository (contrib/tamingtext/lib).I added these paths in solrconfig.xml
 lib dir=../../../contrib/tamingtext/lib regex=.*\.jar /lib 
 dir=../../../contrib/tamingtext/dependency regex=.*\.jar /
 Thanks for advance,Regards.



    Le Mercredi 25 mars 2015 17h12, Erick Erickson erickerick...@gmail.com a 
écrit :


  Images don't come through the mailing list, can't see your image.

 Whether or not all the jars in the directory you're working on are
 consistent is the least of your problems. Are the libs to be found in any
 _other_ place specified on your classpath?

 Best,
 Erick

 On Wed, Mar 25, 2015 at 12:36 AM, Test Test andymish...@yahoo.fr wrote:

 Thanks Eric,

 I'm working on Solr 4.10.2 and all my dependencies jar seems to be
 compatible with this version.

 [image: Image en ligne]

 I can't figure out which one make this issue.

 Thanks
 Regards,




  Le Mardi 24 mars 2015 23h45, Erick Erickson erickerick...@gmail.com a
 écrit :


 bq: 13 moreCaused by: java.lang.ClassCastException: class
 com.tamingtext.texttamer.solr.

 This usually means you have jar files from different versions of Solr
 in your classpath.

 Best,
 Erick

 On Tue, Mar 24, 2015 at 2:38 PM, Test Test andymish...@yahoo.fr wrote:
  Hi there,
  I'm trying to create my own TokenizerFactory (from tamingtext's
 book).After setting schema.xml and have adding path in solrconfig.xml, i
 start solr.I have this error message : Caused by:
 org.apache.solr.common.SolrException: Plugin init failure for [schema.xml]
 fieldType text: Plugin init failure for [schema.xml] analyzer/tokenizer:
 class com.tamingtext.texttamer.solr.SentenceTokenizerFactory. Schema file
 is .../conf/schema.xmlat
 org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:595)at
 org.apache.solr.schema.IndexSchema.init(IndexSchema.java:166)at
 org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:55)at
 org.apache.solr.schema.IndexSchemaFactory.buildIndexSchema(IndexSchemaFactory.java:69)at
 org.apache.solr.core.ConfigSetService.createIndexSchema(ConfigSetService.java:90)at
 org.apache.solr.core.ConfigSetService.getConfig(ConfigSetService.java:62)...
 7 moreCaused by: org.apache.solr.common.SolrException: Plugin init failure
 for [schema.xml] fieldType text: Plugin init failure for [schema.xml]
 analyzer/tokenizer: class
 com.tamingtext.texttamer.solr.SentenceTokenizerFactoryat

Re: Applying Tokenizers and Filters to CopyFields

Martin:
Perhaps this would help

indexed=true, stored=true
field can be searched. The raw input (not analyzed in any way) can be
shown to the user in the results list.

indexed=true, stored=false
field can be searched. However, the field can't be returned in the
results list with the document.

indexed=false, stored=true
The field cannot be searched, but the contents can be returned in the
results list with the document. There are some use-cases where this is
desirable behavior.

indexed=false, stored=false
The entire field is thrown out, it's just as if you didn't send the
field to be indexed at all.

And one other thing, the copyField gets the _raw_ data not the
analyzed data. Let's say you have two fields, src and dst.
copying from src to dest in schema.xml is identical to
add
  doc
field name=srcoriginal text/field
   field name=dstoriginal text/field
/doc
/add

that is, copyfield directives are not chained.

Also, watch out for your query syntax. Michael's comments are spot-on,
I'd just add this:

http://localhost:8983/solr/windex/select?q=Sprachefq=originalwt=jsonindent=true

is kind of odd. Let's assume you mean qf rather than fq. That
_only_ matters if your query parser is edismax, it'll be ignored in
this case I believe.

You'd want something like
q=src:Sprache
or
q=dst:Sprache
or even
http://localhost:8983/solr/windex/select?q=Sprachedf=src
http://localhost:8983/solr/windex/select?q=Sprachedf=dst

where df is default field and the search is applied against that
field in the absence of a field qualification like my first two
examples.

Best,
Erick

On Wed, Mar 25, 2015 at 2:52 PM, Michael Della Bitta
michael.della.bi...@appinions.com wrote:
 I agree the terminology is possibly a little confusing.

 Stored refers to values that are stored verbatim. You can retrieve them
 verbatim. Analysis does not affect stored values.
 Indexed values are tokenized/transformed and stored inverted. You can't
 recover the literal analyzed version (at least, not easily).

 If what you really want is to store and retrieve case folded versions of
 your data as well as the original, you need to use something like a
 UpdateRequestProcessor, which I personally am less familiar with.


 On Wed, Mar 25, 2015 at 5:28 PM, Martin Wunderlich martin...@gmx.net
 wrote:

 So, the pre-processing steps are applied under analyzer type=„index“.
 And this point is not quite clear to me: Assuming that I have a simple
 case-folding step applied to the target of the copyField: How or where are
 the lower-case tokens stored, if the text isn’t added to the index? How is
 the query supposed to retrieve the lower-case version?
 (sorry, if this sounds like a naive question, but I have a feeling that I
 am missing something really basic here).



 Michael Della Bitta

 Senior Software Engineer

 o: +1 646 532 3062

 appinions inc.

 “The Science of Influence Marketing”

 18 East 41st Street

 New York, NY 10017

 t: @appinions https://twitter.com/Appinions | g+:
 plus.google.com/appinions
 https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts
 w: appinions.com http://www.appinions.com/

RE: Difference in indexing using config file vs client i.e SolrJ

2015-03-25 Thread Purohit, Sumit

Thanks Erick for the helpful explanations.

thanks
sumit 

From: Erick Erickson [erickerick...@gmail.com]
Sent: Monday, March 23, 2015 4:58 PM
To: solr-user@lucene.apache.org
Subject: Re: Difference in indexing using config file vs client i.e SolrJ

1 Either none or lots, depending;). You're talking schemaless here
I think. schemaless mode guesses what the field should be based on the
document and creates a field in the doc. pre-defined schemas require
you to make that decision up front.

So in terms of what the underlying index looks like on a lower-level
Lucene basis, whether a field is defined in the schema.xml or
dynamically it's identical. So in that perspective, there's no
difference.

However, whether the field definitions chosen best represent the
problem you're trying to solve is another issue all together.
Schemaless simply cannot apply the same kind of domain-specific
interpretation that a human can, not to mention construct analysis
chains for the tokens that are reflective of the characteristics
specific to that domain.

2 There have been some anecdotal reports of schemaless copying
everything into a _text field that impact performance, but this is
configurable.

3 Again, the underlying structure of the index at the Lucene level is
the same. What's NOT the same is whether schemaless mode makes the
right decisions. Almost invariably a human being can do better since
you're armed with knowledge of what's important and what's not.

Here's my take: Schemaless mode is a great way to get started with
minimal effort on your part. But pretty soon the problem domain
requires that you take control of the schema and hand-craft
schema.xml. For some problem spaces, schemaless may be good enough,
you have to evaluate your corpus and your problem space

Best,
Erick

On Mon, Mar 23, 2015 at 4:41 PM, Purohit, Sumit sumit.puro...@pnnl.gov wrote:
 Hi All,

 I have recently started working with Solr and i have a trivial question to 
 ask, as i could not find suitable answer.

 A document's indexes can be defined in a config file (such as schema.xml) and 
 on the fly using some solr client such as SolrJ.

 1. What is the difference in indexes created by both the approaches ?
 2. Is there any major performance gain in the case of using predefined index 
 instead of using SolrJ ?
 3. Does solr persist these indexes differently and does that has any impact 
 on the Query efficiency ?

 Thanks
 Sumit Purohit

Re: Using G1 with Apache Solr

2015-03-25 Thread William Bell

The issue we had with Java 8 was with DIH handler. We were using Rhino and
with the new implementation in Java 8, we had several Regex expression
issues...

We are almost ready to go now, since we moved away from Rhino and now use
Java.

Bill

On Wed, Mar 25, 2015 at 2:14 AM, Daniel Collins danwcoll...@gmail.com
wrote:

 Interesting none the less Shawn :)

 We use G1GC on our servers, we were on Java 7 (64-bit, RHEL6), but are
 trying to migrate to Java 8 (which seems to cause more GC issues, so we
 clearly need to tweak our settings), will investigate 8u40 though.

 On 25 March 2015 at 04:23, Shawn Heisey apa...@elyograg.org wrote:

  On 3/24/2015 9:52 PM, Shawn Heisey wrote:
   On 3/24/2015 3:48 PM, Kamran Khawaja wrote:
   I'm running Solr 4.7.2 with Java 7u75 with the following JVM params:
 
  I really got my wires crossed.  Kamran sent his message to the
  hostpot-gc-use mailing list, not the solr-user list!
 
  Thanks,
  Shawn
 
 




-- 
Bill Bell
billnb...@gmail.com
cell 720-256-8076

Re: [MASSMAIL]Re: Issues to create new core

2015-03-25 Thread Alejandro Jesus Mariño Molerio

Erick,
Thanks for your help.  I could fix the problem. I work in no SolrCloud mode.
Best Regards,
Ale

- Mensaje original -
De: Erick Erickson erickerick...@gmail.com
Para: solr-user@lucene.apache.org
Enviados: Martes, 24 de Marzo 2015 10:14:22
Asunto: [MASSMAIL]Re: Issues to create new core

Tell us all the steps you went through to do this. Note that you
should _not_ be using the core admin in the admin UI if you're working
with SolrCloud.

For stand-alone Solr, the message above is probably caused by your not
having a conf directory set up already. The core admin UI expects that
you have a pre-existing directory with a conf directory that
contains solrconfig.xml, schema.xml, and all the rest of the
configuration files. You can specify this via some of the parameters
on the admin UI screen (see instanceDir and dataDir). Each core must
be in a separate directory or Bad Things Happen.

HTH,
Erick

On Tue, Mar 24, 2015 at 7:01 AM, Alejandro Jesus Mariño Molerio
ajmar...@estudiantes.uci.cu wrote:
 Dear Solr Community:
 I just began to work with Solr. I choose Solr 5.0, but when I try to create a 
 new core with GUI, show the following error:  Error CREATEing SolrCore 
 'datos': Unable to create core [datos] Caused by: Can't find resource 
 'solrconfig.xml' in classpath or 'C:\solr\server\solr\datos\conf'. My 
 question is simple, How can fix this problem?.

 Thanks in advance for your consideration.
 Alejandro.

Re: Setting up SOLR 5 from an RPM

2015-03-25 Thread Tom Evans

On Tue, Mar 24, 2015 at 4:00 PM, Tom Evans tevans...@googlemail.com wrote:
 Hi all

 We're migrating to SOLR 5 (from 4.8), and our infrastructure guys
 would prefer we installed SOLR from an RPM rather than extracting the
 tarball where we need it. They are creating the RPM file themselves,
 and it installs an init.d script and the equivalent of the tarball to
 /opt/solr.

 We're having problems running SOLR from the installed files, as SOLR
 wants to (I think) extract the WAR file and create various temporary
 files below /opt/solr/server.

From the SOLR 5 reference guide, section Managing SOLR, sub-section
Taking SOLR to production, it seems changing the ownership of the
installed files to the user that will run SOLR is an explicit
requirement if you do not wish to run as root.

It would be better if this was not required. With most applications
you do not normally require permission to modify the installed files
in order to run the application, eg I do not need write permission to
/usr/share/vim to run vim, it is a shame I need write permission to
/opt/solr to run solr.

Cheers

Tom

Re: Replica and node states

2015-03-25 Thread Shalin Shekhar Mangar

On Wed, Mar 25, 2015 at 9:24 PM, Shai Erera ser...@gmail.com wrote:

 
  There's even a param onyIfDown=true which will remove a
  replica only if it's already 'down'.
 

 That will only work if the replica is in DOWN state correct? That is, if
 the Solr JVM was killed, and the replica stays in ACTIVE, but its node is
 not under /live_nodes, it won't get deleted? What I chose to do is to
 delete the replica if its node is not under /live_nodes, and I'm sure it
 will never return.


Probably not and we should fix it. It should be possible to delete replicas
which are not live I guess. But there are more behaviors that need to
defined e.g. what happens if a node was down and you deleted the replica
which was supposed to be on it and then the node came back up. Should we
re-create the replica automatically or ask the node to delete the local
core and have something new assigned to it? Some of these behaviors are
what we informally call ZK as Truth features where we want to move to a
world where ZK is the source of truth and nodes modify their state and
cores depending on what's inside ZK.



 No, there is no penalty because we always check for the state=active and
  the live-ness before routing any requests to a replica.
 

 Well, that's also a penalty :), though I agree it's a minor one. There is
 also a penalty ZK-wise -- clusterstate.json still records these orphanage
 replicas, so I'll make sure I do this cleanup from time to time.


Yeah but just to avoid any misunderstanding -- the live nodes are watched
by ZK so checking live-ness is a hash set lookup which is the cost but a
small one. But yeah you do need to cleanup from time to time.


 Thanks for the responses and clarifications!

 Shai

 On Wed, Mar 25, 2015 at 11:39 PM, Shalin Shekhar Mangar 
 shalinman...@gmail.com wrote:

  On Wed, Mar 25, 2015 at 12:51 PM, Shai Erera ser...@gmail.com wrote:
 
   Thanks.
  
   Does Solr ever clean up those states? I.e. does it ever remove down
   replicas, or replicas belonging to non-live_nodes after some time? Or
  will
   these remain in the cluster state forever (assuming they never come
 back
   up)?
  
 
  No, they remain there forever. You can still call the deletereplica API
 to
  clean them up. There's even a param onyIfDown=true which will remove a
  replica only if it's already 'down'.
 
 
  
   If they remain there, is there any penalty? E.g. Solr tries to send
 them
   updates, maybe tries to route search requests to? I'm talking about
   replicas that stay in ACTIVE state, but their nodes aren't under
   /live_nodes.
  
 
  No, there is no penalty because we always check for the state=active and
  the live-ness before routing any requests to a replica.
 
 
  
   Shai
  
   On Wed, Mar 25, 2015 at 8:05 PM, Shalin Shekhar Mangar 
   shalinman...@gmail.com wrote:
  
Comments inline:
   
On Wed, Mar 25, 2015 at 8:30 AM, Shai Erera ser...@gmail.com
 wrote:
   
 Hi

 Is it possible for a replica to be DOWN, while the node it resides
 on
   is
 under /live_nodes? If so, what can lead to it, aside from someone
unloading
 a core.

   
Yes, aside from someone unloading the index, this can happen in two
  ways
   1)
during startup each core publishes it's state as 'down' before it
  enters
recovery, and 2) the leader force-publishes a replica as 'down' if it
  is
not able to forward updates to that replica (this mechanism is called
Leader-Initiated-Recovery or LIR in short)
   
The #2 above can happen when the replica is partitioned from leader
 but
both are able to talk to ZooKeeper.
   
   

 I don't know if each SolrCore reports status to ZK independently,
 or
   it's
 done by the Solr process as a whole.


It is done on a per-core basis for now. But the 'live' node is
  maintained
one per Solr instance (JVM).
   
   
 Also, is it possible for a replica to report ACTIVE, while the node
  it
 lives on is no longer under /live_nodes? Are there any ZK timings
  that
can
 cause that?

   
Yes, this can happen if the JVM crashed. A replica publishes itself
 as
'down' on shutdown so if the graceful shutdown step is skipped then
 the
replica will continue to be 'active' in the cluster state. Even LIR
   doesn't
apply here because there's no point in the leader marking a node as
   'down'
if it is not 'live' already.
   
   

 Shai

   
   
   
--
Regards,
Shalin Shekhar Mangar.
   
  
 
 
 
  --
  Regards,
  Shalin Shekhar Mangar.
 




-- 
Regards,
Shalin Shekhar Mangar.

Re: Custom TokenFilter

Wait, you didn't put, say, lucene-core-4.10.2.jar into your
contrib/tamingtext/dependency directory did you? That means you have
Lucene (and solr and solrj and ...) in your class path twice since
they're _already_ in your classpath by default since you're running
Solr.

All your jars should be in your aggregate classpath exactly once.
Having them in twice would explain the cast exception. not need these
in the tamingtext/dependency subdirectory, just the things that are
_not_ in Solr already..

Best,
Erick

On Wed, Mar 25, 2015 at 12:21 PM, Test Test andymish...@yahoo.fr wrote:
 Re,
 Sorry about the image.So, there are all my dependencies jar in listing below :
- commons-cli-2.0-mahout.jar

- commons-compress-1.9.jar

- commons-io-2.4.jar

- commons-logging-1.2.jar

- httpclient-4.4.jar

- httpcore-4.4.jar

- httpmime-4.4.jar

- junit-4.10.jar

- log4j-1.2.17.jar

- lucene-analyzers-common-4.10.2.jar

- lucene-benchmark-4.10.2.jar

- lucene-core-4.10.2.jar

- mahout-core-0.9.jar

- noggit-0.5.jar

- opennlp-maxent-3.0.3.jar

- opennlp-tools-1.5.3.jar

- slf4j-api-1.7.9.jar

- slf4j-simple-1.7.10.jar

- solr-solrj-4.10.2.jar


 I have put them into a specific repository 
 (contrib/tamingtext/dependency).And my jar containing my class into another 
 repository (contrib/tamingtext/lib).I added these paths in solrconfig.xml

- lib dir=../../../contrib/tamingtext/lib regex=.*\.jar /

- lib dir=../../../contrib/tamingtext/dependency regex=.*\.jar /


 Thanks for advance
 Regards.



  Le Mercredi 25 mars 2015 20h18, Test Test andymish...@yahoo.fr a écrit 
 :


  Re,
 Sorry about the image.So, there are all my dependencies jar in listing below 
 :- commons-cli-2.0-mahout.jar- commons-compress-1.9.jar- commons-io-2.4.jar- 
 commons-logging-1.2.jar- httpclient-4.4.jar- httpcore-4.4.jar- 
 httpmime-4.4.jar- junit-4.10.jar- log4j-1.2.17.jar- 
 lucene-analyzers-common-4.10.2.jar- lucene-benchmark-4.10.2.jar- 
 lucene-core-4.10.2.jar- mahout-core-0.9.jar- noggit-0.5.jar- 
 opennlp-maxent-3.0.3.jar- opennlp-tools-1.5.3.jar- slf4j-api-1.7.9.jar- 
 slf4j-simple-1.7.10.jar- solr-solrj-4.10.2.jar
 I have put them into a specific repository 
 (contrib/tamingtext/dependency).And my jar containing my class into another 
 repository (contrib/tamingtext/lib).I added these paths in solrconfig.xml
 lib dir=../../../contrib/tamingtext/lib regex=.*\.jar /lib 
 dir=../../../contrib/tamingtext/dependency regex=.*\.jar /
 Thanks for advance,Regards.



 Le Mercredi 25 mars 2015 17h12, Erick Erickson erickerick...@gmail.com 
 a écrit :


  Images don't come through the mailing list, can't see your image.

 Whether or not all the jars in the directory you're working on are
 consistent is the least of your problems. Are the libs to be found in any
 _other_ place specified on your classpath?

 Best,
 Erick

 On Wed, Mar 25, 2015 at 12:36 AM, Test Test andymish...@yahoo.fr wrote:

 Thanks Eric,

 I'm working on Solr 4.10.2 and all my dependencies jar seems to be
 compatible with this version.

 [image: Image en ligne]

 I can't figure out which one make this issue.

 Thanks
 Regards,




  Le Mardi 24 mars 2015 23h45, Erick Erickson erickerick...@gmail.com a
 écrit :


 bq: 13 moreCaused by: java.lang.ClassCastException: class
 com.tamingtext.texttamer.solr.

 This usually means you have jar files from different versions of Solr
 in your classpath.

 Best,
 Erick

 On Tue, Mar 24, 2015 at 2:38 PM, Test Test andymish...@yahoo.fr wrote:
  Hi there,
  I'm trying to create my own TokenizerFactory (from tamingtext's
 book).After setting schema.xml and have adding path in solrconfig.xml, i
 start solr.I have this error message : Caused by:
 org.apache.solr.common.SolrException: Plugin init failure for [schema.xml]
 fieldType text: Plugin init failure for [schema.xml] analyzer/tokenizer:
 class com.tamingtext.texttamer.solr.SentenceTokenizerFactory. Schema file
 is .../conf/schema.xmlat
 org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:595)at
 org.apache.solr.schema.IndexSchema.init(IndexSchema.java:166)at
 org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:55)at
 org.apache.solr.schema.IndexSchemaFactory.buildIndexSchema(IndexSchemaFactory.java:69)at
 org.apache.solr.core.ConfigSetService.createIndexSchema(ConfigSetService.java:90)at
 org.apache.solr.core.ConfigSetService.getConfig(ConfigSetService.java:62)...
 7 moreCaused by: org.apache.solr.common.SolrException: Plugin init failure
 for [schema.xml] fieldType text: Plugin init failure for [schema.xml]
 analyzer/tokenizer: class
 com.tamingtext.texttamer.solr.SentenceTokenizerFactoryat
 org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:177)at
 org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:486)... 12
 moreCaused by: org.apache.solr.common.SolrException: Plugin init failure
 for [schema.xml] analyzer/tokenizer: class

Retrieving list of words for highlighting

2015-03-25 Thread Damien Dykman

In Solr 5 (or 4), is there an easy way to retrieve the list of words to
highlight?

Use case: allow an external application to highlight the matching words
of a matching document, rather than using the highlighted snippets
returned by Solr.

Thanks,
Damien

Data indexing is going too slow on single shard Why?

2015-03-25 Thread Nitin Solanki

Hello,
Please can anyone assist me? I am indexing on single shard it
is taking too much of time to index data. And I am indexing around 49GB of
data on single shard. What's wrong? Why solr is taking too much time to
index data?
Earlier I was indexing same data on 8 shards. That time, it was fast as
compared to single shard. Why so? Any help please..


*HardCommit - 15 sec*
*SoftCommit - 10 min.*



Best,
Nitin

Re: Custom TokenFilter

Thanks for letting us know the resolution, the problem was bugging me

Erick

On Wed, Mar 25, 2015 at 4:21 PM, Test Test andymish...@yahoo.fr wrote:
 Re,
 Finally, i think i found where this problem comes.I didn't use the right 
 class extender, instead using Tokenizers, i'm using Token filter.
 Eric, thanks for your replies.Regards.


  Le Mercredi 25 mars 2015 23h55, Test Test andymish...@yahoo.fr a écrit 
 :


  Re,
 I have tried to remove all the redundant jar files.Then i've relaunched it 
 but it's blocked directly on the same issue.
 It's very strange.
 Regards,


 Le Mercredi 25 mars 2015 23h31, Erick Erickson erickerick...@gmail.com 
 a écrit :


  Wait, you didn't put, say, lucene-core-4.10.2.jar into your
 contrib/tamingtext/dependency directory did you? That means you have
 Lucene (and solr and solrj and ...) in your class path twice since
 they're _already_ in your classpath by default since you're running
 Solr.

 All your jars should be in your aggregate classpath exactly once.
 Having them in twice would explain the cast exception. not need these
 in the tamingtext/dependency subdirectory, just the things that are
 _not_ in Solr already..

 Best,
 Erick

 On Wed, Mar 25, 2015 at 12:21 PM, Test Test andymish...@yahoo.fr wrote:
 Re,
 Sorry about the image.So, there are all my dependencies jar in listing below 
 :
- commons-cli-2.0-mahout.jar

- commons-compress-1.9.jar

- commons-io-2.4.jar

- commons-logging-1.2.jar

- httpclient-4.4.jar

- httpcore-4.4.jar

- httpmime-4.4.jar

- junit-4.10.jar

- log4j-1.2.17.jar

- lucene-analyzers-common-4.10.2.jar

- lucene-benchmark-4.10.2.jar

- lucene-core-4.10.2.jar

- mahout-core-0.9.jar

- noggit-0.5.jar

- opennlp-maxent-3.0.3.jar

- opennlp-tools-1.5.3.jar

- slf4j-api-1.7.9.jar

- slf4j-simple-1.7.10.jar

- solr-solrj-4.10.2.jar


 I have put them into a specific repository 
 (contrib/tamingtext/dependency).And my jar containing my class into another 
 repository (contrib/tamingtext/lib).I added these paths in solrconfig.xml

- lib dir=../../../contrib/tamingtext/lib regex=.*\.jar /

- lib dir=../../../contrib/tamingtext/dependency regex=.*\.jar /


 Thanks for advance
 Regards.



  Le Mercredi 25 mars 2015 20h18, Test Test andymish...@yahoo.fr a 
 écrit :


  Re,
 Sorry about the image.So, there are all my dependencies jar in listing below 
 :- commons-cli-2.0-mahout.jar- commons-compress-1.9.jar- commons-io-2.4.jar- 
 commons-logging-1.2.jar- httpclient-4.4.jar- httpcore-4.4.jar- 
 httpmime-4.4.jar- junit-4.10.jar- log4j-1.2.17.jar- 
 lucene-analyzers-common-4.10.2.jar- lucene-benchmark-4.10.2.jar- 
 lucene-core-4.10.2.jar- mahout-core-0.9.jar- noggit-0.5.jar- 
 opennlp-maxent-3.0.3.jar- opennlp-tools-1.5.3.jar- slf4j-api-1.7.9.jar- 
 slf4j-simple-1.7.10.jar- solr-solrj-4.10.2.jar
 I have put them into a specific repository 
 (contrib/tamingtext/dependency).And my jar containing my class into another 
 repository (contrib/tamingtext/lib).I added these paths in solrconfig.xml
 lib dir=../../../contrib/tamingtext/lib regex=.*\.jar /lib 
 dir=../../../contrib/tamingtext/dependency regex=.*\.jar /
 Thanks for advance,Regards.



Le Mercredi 25 mars 2015 17h12, Erick Erickson erickerick...@gmail.com 
 a écrit :


  Images don't come through the mailing list, can't see your image.

 Whether or not all the jars in the directory you're working on are
 consistent is the least of your problems. Are the libs to be found in any
 _other_ place specified on your classpath?

 Best,
 Erick

 On Wed, Mar 25, 2015 at 12:36 AM, Test Test andymish...@yahoo.fr wrote:

 Thanks Eric,

 I'm working on Solr 4.10.2 and all my dependencies jar seems to be
 compatible with this version.

 [image: Image en ligne]

 I can't figure out which one make this issue.

 Thanks
 Regards,




  Le Mardi 24 mars 2015 23h45, Erick Erickson erickerick...@gmail.com a
 écrit :


 bq: 13 moreCaused by: java.lang.ClassCastException: class
 com.tamingtext.texttamer.solr.

 This usually means you have jar files from different versions of Solr
 in your classpath.

 Best,
 Erick

 On Tue, Mar 24, 2015 at 2:38 PM, Test Test andymish...@yahoo.fr wrote:
  Hi there,
  I'm trying to create my own TokenizerFactory (from tamingtext's
 book).After setting schema.xml and have adding path in solrconfig.xml, i
 start solr.I have this error message : Caused by:
 org.apache.solr.common.SolrException: Plugin init failure for [schema.xml]
 fieldType text: Plugin init failure for [schema.xml] analyzer/tokenizer:
 class com.tamingtext.texttamer.solr.SentenceTokenizerFactory. Schema file
 is .../conf/schema.xmlat
 org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:595)at
 org.apache.solr.schema.IndexSchema.init(IndexSchema.java:166)at
 org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:55)at

Re: Replica and node states