Re: solre scores remains same for exact match and nearly exact match

2013-04-03 Thread Gora Mohanty
On 3 April 2013 10:52, amit amit.mal...@gmail.com wrote:

 Below is my query
 http://localhost:8983/solr/select/?q=subject:session management in
 phpfq=category:[*%20TO%20*]fl=category,score,subject
[...]

Add debugQuery=on to your Solr URL, and you will get an
explanation of the score. Your subject field is tokenised, so
that there is no a priori reason that an exact match should
score higher. Several strategies are available if you want that
behaviour. Try searching Google, e.g., for solr exact match
higher score.

Regards,
Gora


Re: Out of memory on some faceting queries

2013-04-03 Thread Toke Eskildsen
On Tue, 2013-04-02 at 17:08 +0200, Dotan Cohen wrote:
 Most of the time I facet on one field that has about twenty unique
 values.

They are likely to be disk cached so warming those for 9M documents
should only take a few seconds.

 However, once per day I would like to facet on the text field,
 which is a free-text field usually around 1 KiB (about 100 words), in
 order to determine what the top keywords / topics are. That query
 would take up to 200 seconds to run, [...]

If that query is somehow part of your warming, then I am surprised that
search has worked at all with your commit frequency. That would however
explain your OOM if you have multiple warmups running at the same time.

It sounds like TermsComponent would be a better fit for getting top
topics: https://wiki.apache.org/solr/TermsComponent



maxWarmingSearchers in Solr 4.

2013-04-03 Thread Dotan Cohen
I have been dragging the same solrconfig.xml from Solr 3.x to 4.0 to
4.1, with no customization (bad, bad me!). I'm now looking into
customizing it and I see that the Solr 4.1 solrconfig.xml is much
simpler and shorter. Is this simply because many of the examples have
been removed?

In particular, I notice that there is no mention of
maxWarmingSearchers in the Solr 4.1 solrconfig.xml. I assume that I
can simply add it in, are there any other critical config options that
are missing that I should be looking into as well? Would I be better
off using the old Solr 3.x solrconfig.xml in Solr 4.1 as it contains
so many examples?

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: Out of memory on some faceting queries

2013-04-03 Thread Dotan Cohen
On Tue, Apr 2, 2013 at 6:26 PM, Andre Bois-Crettez
andre.b...@kelkoo.com wrote:
 warmupTime is available on the admin page for each type of cache (in
 milliseconds) :
 http://solr-box:8983/solr/#/core1/plugins/cache

 Or if you are only interested in the total :
 http://solr-box:8983/solr/core1/admin/mbeans?stats=truekey=searcher


Thanks.


 Batches of 20-50 results are added to solr a few times a minute, and a
 commit is done after each batch since I'm calling Solr as such:
 http://127.0.0.1:8983/solr/core/update/json?commit=true Should I
 remove commit=true and run a cron job to commit once per minute?


 Even better, it sounds like a job for CommitWithin :
 http://wiki.apache.org/solr/CommitWithin



I'll look into that. Thank you!


--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: Out of memory on some faceting queries

2013-04-03 Thread Dotan Cohen
On Wed, Apr 3, 2013 at 10:11 AM, Toke Eskildsen t...@statsbiblioteket.dk 
wrote:
 However, once per day I would like to facet on the text field,
 which is a free-text field usually around 1 KiB (about 100 words), in
 order to determine what the top keywords / topics are. That query
 would take up to 200 seconds to run, [...]

 If that query is somehow part of your warming, then I am surprised that
 search has worked at all with your commit frequency. That would however
 explain your OOM if you have multiple warmups running at the same time.


No, the 'heavy facet' is not part of the warming. I run it at most
once per day, at the end of the day. Solr is not shut down daily.

 It sounds like TermsComponent would be a better fit for getting top
 topics: https://wiki.apache.org/solr/TermsComponent


I had once looked at TermsComponent, but I think that I eliminated it
as a possibility because I actually need the top keywords related to a
specific keyword. For instance, I need to know which words are most
commonly used with the word coffee.


--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: Solr 4.2.0 results links

2013-04-03 Thread zeroeffect
Thanks for the response. I found the issue. The data was being ingested
correctly it just being echoed incorrectly. while inspecting the final HTML
output I was able to find that the richtext-doc.vm file was used to display
my data. The code in this file generated the links to local files. I did
some more research on velocity coding and some trial and error I now have my
links displaying and working correctly.

I'm still picking apart the example collections and solr configs to suit my
needs. Currently I ran into a HEAP memory issue but that is more of a JAVA
thing. I have adjusted the setting and am testing it out.

Down the road I'd like to make the year a drop down option. This way you
only search the selected year and not the whole library but that is a
different topic and I need to do some more research.

Again thanks for the reply,

ZeroEffect



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-4-2-0-results-links-tp4049788p4053420.html
Sent from the Solr - User mailing list archive at Nabble.com.


Query parser cuts last letter from search term.

2013-04-03 Thread vsl
Hi,
I have strange problem with Solr query. I added to my Solr Index new
document with behave! word inside content. While I was trying to search
this document using behave search term it was impossible. Only behave!
returns result. Additionaly search debug returns following information:

debug: {
rawquerystring: behave,
querystring: behave,
parsedquery: allText:behav,
parsedquery_toString: allText:behav,
  
Does anybody know how to deal with such case? Below is my field type
definition.


Field definition: 

fieldType name=text_general class=solr.TextField
positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt enablePositionIncrements=true /

filter class=solr.LowerCaseFilterFactory/
filter class=solr.SnowballPorterFilterFactory
language=English/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumbers=1 catenateAll=0 splitOnCaseChange=1
preserveOriginal=1 types=characters.txt /
  /analyzer
  analyzer type=query
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt enablePositionIncrements=true /
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
ignoreCase=true expand=true/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.SnowballPorterFilterFactory
language=English/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumbers=1 catenateAll=0 splitOnCaseChange=1
preserveOriginal=1 types=characters.txt /
  /analyzer
/fieldType

where: characters.txt 

§ = ALPHA 
$ = ALPHA 
% = ALPHA 
 = ALPHA 
/ = ALPHA 
( = ALPHA 
) = ALPHA 
= = ALPHA 
? = ALPHA 
+ = ALPHA 
* = ALPHA 
# = ALPHA 
' = ALPHA 
- = ALPHA 
 = ALPHA 
 = ALPHA 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Query-parser-cuts-last-letter-from-search-term-tp4053432.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: MoreLikeThis - Odd results - what am I doing wrong?

2013-04-03 Thread DC tech
Thanks David - I suppose it is an AWS question and thank you for the pointers. 

As a further input to the MLT question - it does seem that 3.6 behavior is 
different from 4.2 - the issue seems to be more in terms of the raw query that 
is generated. 
I will some more research and revert back with details. 

David Parks davidpark...@yahoo.com wrote:

Isn't this an AWS security groups question? You should probably post this 
question on the AWS forums, but for the moment, here's the basic reading 
material - go set up your EC2 security groups and lock down your systems.

   
 http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-network-security.html

If you just want to password protect Solr here are the instructions:

   http://wiki.apache.org/solr/SolrSecurity

But I most certainly would not leave it open to the world even with a password 
(note that the basic password authentication sends passwords in clear text if 
you're not using HTTPS, best lock the thing down behind a firewall).

Dave


-Original Message-
From: DC tech [mailto:dctech1...@gmail.com] 
Sent: Tuesday, April 02, 2013 1:02 PM
To: solr-user@lucene.apache.org
Subject: Re: MoreLikeThis - Odd results - what am I doing wrong?

OK - so I have my SOLR instance running on AWS. 
Any suggestions on how to safely share the link?  Right now, the whole SOLR 
instance is totally open. 



Gagandeep singh gagan.g...@gmail.com wrote:

say debugQuery=truemlt=true and see the scores for the MLT query, not 
a sample query. You can use Amazon ec2 to bring up your solr, you 
should be able to get a micro instance for free trial.


On Mon, Apr 1, 2013 at 5:10 AM, dc tech dctech1...@gmail.com wrote:

 I did try the raw query against the *simi* field and those seem to 
 return results in the order expected.
 For instance, Acura MDX has  ( large, SUV, 4WD   Luxury) in the simi field.
 Running a query with those words against the simi field returns the 
 expected models (X5, Audi Q5, etc) and then the subsequent documents 
 have decreasing relevance. So the basic query mechanism seems to be fine.

 The issue just seems to be with MoreLikeThis component and handler.
 I can post the index on a public SOLR instance - any suggestions? (or 
 for
 hosting)


 On Sun, Mar 31, 2013 at 1:54 PM, Gagandeep singh 
 gagan.g...@gmail.com
 wrote:

  If you can bring up your solr setup on a public machine then im 
  sure a
 lot
  of debugging can be done. Without that, i think what you should 
  look at
 is
  the tf-idf scores of the terms like camry etc. Usually idf is the 
  deciding factor into which results show at the top (tf should be 1 
  for
 your
  data).
  Enable debugQuery=true and look at explain section to see show 
  score is getting calculated.
 
  You should try giving different boosts to class, type, drive, size 
  to control the results.
 
 
  On Sun, Mar 31, 2013 at 8:52 PM, dc tech dctech1...@gmail.com wrote:
 
  I am running some experiments on more like this and the results 
  seem rather odd - I am doing something wrong but just cannot figure out 
  what.
  Basically, the similarity results are decent - but not great.
 
  *Issue 1  = Quality*
  Toyota Camry : finds Altima (good) but then next one is Camry 
  Hybrid whereas it should have found Accord.
  I have normalized the data into a simi field which has only the 
  attributes that I care about.
  Without the simi field, I could not get mlt.qf boosts to work well
 enough
  to return results
 
  *Issue 2*
  Some fields do not work at all. For instance, text+simi (in 
  mlt.fl)
 works
  whereas just simi does not.
  So some weirdness that am just not understanding.
 
  Would be grateful for your guidance !
 
 
  Here is the setup:
  *1. SOLR Version*
  solr-spec 4.2.0.2013.03.06.22.32.13
  solr-impl 4.2.0 1453694   rmuir - 2013-03-06 22:32:13
  lucene-spec 4.2.0
  lucene-impl 4.2.0 1453694 -  rmuir - 2013-03-06 22:25:29
 
  *2. Machine Information*
  Sun Microsystems Inc. Java HotSpot(TM) 64-Bit Server VM (1.6.0_23
  19.0-b09)
  Windows 7 Home 64 Bit with 4 GB RAM
 
  *3. Sample Data *
  I created this 'dummy' data of cars  - the idea being that these 
  would
 be
  sufficient and simple to generate similarity and understand how it 
  would work.
  There are 181 rows in the data set (I have attached it for 
  reference in CSV format)
 
  [image: Inline image 1]
 
  *4. SCHEMA*
  *Field Definitions*
 field name=id type=string indexed=true stored=true
  termVectors=true multiValued=false/
 field name=make type=string indexed=true stored=true
  termVectors=true multiValued=false/
 field name=model type=string indexed=true stored=true
  termVectors=true multiValued=false/
 field name=class type=string indexed=true stored=true
  termVectors=true multiValued=false/
 field name=type type=string indexed=true stored=true
  termVectors=true multiValued=false/
 field name=drive type=string indexed=true stored=true
  termVectors=true multiValued=false/
 field 

Re: Query parser cuts last letter from search term.

2013-04-03 Thread Upayavira
This is called 'stemming', and is caused by this:

filter class=solr.SnowballPorterFilterFactory
language=English/

It means that all of these terms would match:

behave
behaving
behaved
(and possibly more)

because they would all stem down to 'behav'.

This stemming will happen at index time and at query time, so stemmed
terms are stored in your index, and also, as you are seeing, stemming
happens on your query terms. You can use the 'analyze' option in the
admin interface to see what happens to terms at query/indexing time for
your various field definitions.
 
Upayavira

On Wed, Apr 3, 2013, at 11:25 AM, vsl wrote:
 Hi,
 I have strange problem with Solr query. I added to my Solr Index new
 document with behave! word inside content. While I was trying to search
 this document using behave search term it was impossible. Only
 behave!
 returns result. Additionaly search debug returns following information:
 
 debug: {
 rawquerystring: behave,
 querystring: behave,
 parsedquery: allText:behav,
 parsedquery_toString: allText:behav,
   
 Does anybody know how to deal with such case? Below is my field type
 definition.
 
 
 Field definition: 
 
 fieldType name=text_general class=solr.TextField
 positionIncrementGap=100
   analyzer type=index
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt enablePositionIncrements=true /
 
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.SnowballPorterFilterFactory
 language=English/
 filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1 catenateWords=1
 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1
 preserveOriginal=1 types=characters.txt /
   /analyzer
   analyzer type=query
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt enablePositionIncrements=true /
 filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
 ignoreCase=true expand=true/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.SnowballPorterFilterFactory
 language=English/
 filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1 catenateWords=1
 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1
 preserveOriginal=1 types=characters.txt /
   /analyzer
 /fieldType
 
 where: characters.txt 
 
 § = ALPHA 
 $ = ALPHA 
 % = ALPHA 
  = ALPHA 
 / = ALPHA 
 ( = ALPHA 
 ) = ALPHA 
 = = ALPHA 
 ? = ALPHA 
 + = ALPHA 
 * = ALPHA 
 # = ALPHA 
 ' = ALPHA 
 - = ALPHA 
  = ALPHA 
  = ALPHA 
 
 
 
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Query-parser-cuts-last-letter-from-search-term-tp4053432.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: Query parser cuts last letter from search term.

2013-04-03 Thread vsl
So why Solr does not return proper document?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Query-parser-cuts-last-letter-from-search-term-tp4053432p4053435.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Flow Chart of Solr

2013-04-03 Thread Furkan KAMACI
So, all in all, is there anybody who can write down just main steps of
Solr(including parsing, stemming etc.)?


2013/4/2 Furkan KAMACI furkankam...@gmail.com

 I think about myself as an example. I have started to make research about
 Solr just for some weeks. I have learned Solr and its related projects. My
 next step writing down the main steps Solr. We have separated learning
 curve of Solr into two main categories.
 First one is who are using it as out of the box components. Second one is
 developer side.

 Actually developer side branches into two way.

 First one is general steps of it. i.e. document comes into Solr (i.e.
 crawled data of Nutch). which analyzing processes are going to done
 (stamming, hamming etc.), what will be doing after parsing step by step.
 When a search query happens what happens step by step, at which step scores
 are calculated so on so forth.
 Second one is more code specific i.e. which handlers takes into account
 data that will going to be indexed(no need the explain every handler at
 this step) . Which are the analyzer, tokenizer classes and what are the
 flow between them. How response handlers works and what are they.

 Also explaining about cloud side is other work.

 Some of explanations are currently presents at wiki (but some of them are
 at very deep places at wiki and it is not easy to find the parent topic of
 it, maybe starting wiki from a top age and branching all other topics as
 possible as from it could be better)

 If we could show the big picture, and beside of it the smaller pictures
 within it, it would be great (if you know the main parts it will be easy to
 go deep into the code i.e. you don't need to explain every handler, if you
 show the way to the developer he/she could debug and find the needs)

 When I think about myself as an example, I have to write down the steps of
 Solr a bit detail  even I read many pages at wiki and a book about it, I
 see that it is not easy even writing down the big picture of developer side.


 2013/4/2 Alexandre Rafalovitch arafa...@gmail.com

 Yago,

 My point - perhaps lost in too much text - was that Solr is presented -
 and
 can function - as a black-box. Which makes it different from more
 traditional open-source project. So, the stage-2 happens exactly when the
 non-programmers have to cross the boundary from the black-box into
 code-first approach and the hand-off is not particularly smooth. Or even
 when - say - php or .Net programmer  tries to get beyond the basic
 operations their client library and has the understand the server-side
 aspects of Solr.

 Regards,
Alex.

 On Tue, Apr 2, 2013 at 1:19 PM, Yago Riveiro yago.rive...@gmail.com
 wrote:

  Alexandre,
 
  You describe the normal path when a beginner try to use a source of code
  that doesn't understand, black-box, reading code, hacking, ok now I know
  10% of the project, with lucky :p.
 


 Personal blog: http://blog.outerthoughts.com/
 LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
 - Time is the quality of nature that keeps events from happening all at
 once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)





Words being duplicated with highlighting DictionaryCompoundWordTokenFilterFactory

2013-04-03 Thread Philtjens, Raf
I'm having issues with highlighting  DictionaryCompoundWordTokenFilterFactory 
in Solr 3.6.1/3.6.2.

It's duplicating/adding words in the highlighted snippet. For example, my 
dictionary (dutch) has the following words: premie, beter, ring.
If I search for 'verbetering', results with 'verbeteringspremie' are correctly 
found, but highlighted as following: 
Verhighlightbeter/highlighthighlightVerbetering/highlightspremie.
Words from the DictionaryCompoundWordTokenFilterFactory dictionary are added to 
the highlighted item, resulting in all kinds of jibberish.

schema.xml  http://pastebin.com/SxGAg52N (problem is happening for fields of 
type 'text')
solrconfig.xml  http://pastebin.com/MUTkgZJq

Only solution I can come up at the moment is removing those words (beter, ring) 
from the dictionary (which disables word compound searching on those 
words...which is unwanted).

Any idea what this could be? I found someone else facing the exact same 
problem: 
http://stackoverflow.com/questions/13879349/solr-duplicating-words-in-highlighted-results
 - unfortunately, no workable solution has been given.


Solr ZooKeeper ensemble with HBase

2013-04-03 Thread Amit Sela
Hi all,

I have a running Hadoop + HBase cluster and the HBase cluster is running
it's own zookeeper (HBase manages zookeeper).
I would like to deploy my SolrCloud cluster on a portion of the machines on
that cluster.

My question is: Should I have any trouble / issues deploying an additional
ZooKeeper ensemble ? I don't want to use the HBase ZooKeeper because, well
first of all HBase manages it so I'm not sure it's possible and second I
have HBase working pretty hard at times and I don't want to create any
connection issues by overloading ZooKeeper.

Thanks,

Amit.


Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

2013-04-03 Thread Mark Miller
Clear out it's tlogs before starting it again may help.

- Mark

On Apr 2, 2013, at 10:07 PM, Jamie Johnson jej2...@gmail.com wrote:

 I brought the bad one down and back up and it did nothing.  I can clear the
 index and try4.2.1. I will save off the logs and see if there is anything
 else odd
 On Apr 2, 2013 9:13 PM, Mark Miller markrmil...@gmail.com wrote:
 
 It would appear it's a bug given what you have said.
 
 Any other exceptions would be useful. Might be best to start tracking in a
 JIRA issue as well.
 
 To fix, I'd bring the behind node down and back again.
 
 Unfortunately, I'm pressed for time, but we really need to get to the
 bottom of this and fix it, or determine if it's fixed in 4.2.1 (spreading
 to mirrors now).
 
 - Mark
 
 On Apr 2, 2013, at 7:21 PM, Jamie Johnson jej2...@gmail.com wrote:
 
 Sorry I didn't ask the obvious question.  Is there anything else that I
 should be looking for here and is this a bug?  I'd be happy to troll
 through the logs further if more information is needed, just let me know.
 
 Also what is the most appropriate mechanism to fix this.  Is it required
 to
 kill the index that is out of sync and let solr resync things?
 
 
 On Tue, Apr 2, 2013 at 5:45 PM, Jamie Johnson jej2...@gmail.com wrote:
 
 sorry for spamming here
 
 shard5-core2 is the instance we're having issues with...
 
 Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log
 SEVERE: shard update error StdNode:
 
 http://10.38.33.17:7577/solr/dsc-shard5-core2/:org.apache.solr.common.SolrException
 :
 Server at http://10.38.33.17:7577/solr/dsc-shard5-core2 returned non ok
 status:503, message:Service Unavailable
   at
 
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373)
   at
 
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
   at
 
 org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:332)
   at
 
 org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:306)
   at
 java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
   at java.util.concurrent.FutureTask.run(FutureTask.java:138)
   at
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
   at
 java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
   at java.util.concurrent.FutureTask.run(FutureTask.java:138)
   at
 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
   at
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
   at java.lang.Thread.run(Thread.java:662)
 
 
 On Tue, Apr 2, 2013 at 5:43 PM, Jamie Johnson jej2...@gmail.com
 wrote:
 
 here is another one that looks interesting
 
 Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log
 SEVERE: org.apache.solr.common.SolrException: ClusterState says we are
 the leader, but locally we don't think so
   at
 
 org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:293)
   at
 
 org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:228)
   at
 
 org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:339)
   at
 
 org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
   at
 
 org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246)
   at
 org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173)
   at
 
 org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
   at
 
 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
   at
 
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
   at org.apache.solr.core.SolrCore.execute(SolrCore.java:1797)
   at
 
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:637)
   at
 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:343)
 
 
 
 On Tue, Apr 2, 2013 at 5:41 PM, Jamie Johnson jej2...@gmail.com
 wrote:
 
 Looking at the master it looks like at some point there were shards
 that
 went down.  I am seeing things like what is below.
 
 NFO: A cluster state change: WatchedEvent state:SyncConnected
 type:NodeChildrenChanged path:/live_nodes, has occurred - updating...
 (live
 nodes size: 12)
 Apr 2, 2013 8:12:52 PM org.apache.solr.common.cloud.ZkStateReader$3
 process
 INFO: Updating live nodes... (9)
 Apr 2, 2013 8:12:52 PM
 org.apache.solr.cloud.ShardLeaderElectionContext
 runLeaderProcess
 INFO: Running the leader process.
 Apr 2, 2013 8:12:52 PM
 org.apache.solr.cloud.ShardLeaderElectionContext
 shouldIBeLeader
 INFO: Checking if I should try and be the leader.
 Apr 2, 2013 8:12:52 PM
 org.apache.solr.cloud.ShardLeaderElectionContext
 

Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

2013-04-03 Thread Mark Miller
No, not that I know if, which is why I say we need to get to the bottom of it.

- Mark

On Apr 2, 2013, at 10:18 PM, Jamie Johnson jej2...@gmail.com wrote:

 Mark
 It's there a particular jira issue that you think may address this? I read
 through it quickly but didn't see one that jumped out
 On Apr 2, 2013 10:07 PM, Jamie Johnson jej2...@gmail.com wrote:
 
 I brought the bad one down and back up and it did nothing.  I can clear
 the index and try4.2.1. I will save off the logs and see if there is
 anything else odd
 On Apr 2, 2013 9:13 PM, Mark Miller markrmil...@gmail.com wrote:
 
 It would appear it's a bug given what you have said.
 
 Any other exceptions would be useful. Might be best to start tracking in
 a JIRA issue as well.
 
 To fix, I'd bring the behind node down and back again.
 
 Unfortunately, I'm pressed for time, but we really need to get to the
 bottom of this and fix it, or determine if it's fixed in 4.2.1 (spreading
 to mirrors now).
 
 - Mark
 
 On Apr 2, 2013, at 7:21 PM, Jamie Johnson jej2...@gmail.com wrote:
 
 Sorry I didn't ask the obvious question.  Is there anything else that I
 should be looking for here and is this a bug?  I'd be happy to troll
 through the logs further if more information is needed, just let me
 know.
 
 Also what is the most appropriate mechanism to fix this.  Is it
 required to
 kill the index that is out of sync and let solr resync things?
 
 
 On Tue, Apr 2, 2013 at 5:45 PM, Jamie Johnson jej2...@gmail.com
 wrote:
 
 sorry for spamming here
 
 shard5-core2 is the instance we're having issues with...
 
 Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log
 SEVERE: shard update error StdNode:
 
 http://10.38.33.17:7577/solr/dsc-shard5-core2/:org.apache.solr.common.SolrException
 :
 Server at http://10.38.33.17:7577/solr/dsc-shard5-core2 returned non
 ok
 status:503, message:Service Unavailable
   at
 
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373)
   at
 
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
   at
 
 org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:332)
   at
 
 org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:306)
   at
 java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
   at java.util.concurrent.FutureTask.run(FutureTask.java:138)
   at
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
   at
 java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
   at java.util.concurrent.FutureTask.run(FutureTask.java:138)
   at
 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
   at
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
   at java.lang.Thread.run(Thread.java:662)
 
 
 On Tue, Apr 2, 2013 at 5:43 PM, Jamie Johnson jej2...@gmail.com
 wrote:
 
 here is another one that looks interesting
 
 Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log
 SEVERE: org.apache.solr.common.SolrException: ClusterState says we are
 the leader, but locally we don't think so
   at
 
 org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:293)
   at
 
 org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:228)
   at
 
 org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:339)
   at
 
 org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
   at
 
 org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246)
   at
 org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173)
   at
 
 org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
   at
 
 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
   at
 
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
   at org.apache.solr.core.SolrCore.execute(SolrCore.java:1797)
   at
 
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:637)
   at
 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:343)
 
 
 
 On Tue, Apr 2, 2013 at 5:41 PM, Jamie Johnson jej2...@gmail.com
 wrote:
 
 Looking at the master it looks like at some point there were shards
 that
 went down.  I am seeing things like what is below.
 
 NFO: A cluster state change: WatchedEvent state:SyncConnected
 type:NodeChildrenChanged path:/live_nodes, has occurred -
 updating... (live
 nodes size: 12)
 Apr 2, 2013 8:12:52 PM org.apache.solr.common.cloud.ZkStateReader$3
 process
 INFO: Updating live nodes... (9)
 Apr 2, 2013 8:12:52 PM
 org.apache.solr.cloud.ShardLeaderElectionContext
 runLeaderProcess
 INFO: Running the leader process.
 

Re: Flow Chart of Solr

2013-04-03 Thread Jack Krupansky
Sure, yes. But... it comes down to what level of detail you want and need 
for a specific task. In other words, there are probably a dozen or more 
levels of detail. The reality is that if you are going to work at the Solr 
code level, that is very, very different than being a user of Solr, and at 
that point your first step is to become familiar with the code itself.


When you talk about parsing and stemming, you are really talking about 
the user-level, not the Solr code level. Maybe what you really need is a 
cheat sheet that maps a user-visible feature to the main Solr code component 
for that implements that user feature.


There are a number of different forms of parsing in Solr - parsing of 
what? Queries? Requests? Solr documents? Function queries?


Stemming? Well, in truth, Solr doesn't even do stemming - Lucene does that. 
Lucene does all of the token filtering. Are you asking for details on how 
Lucene works? Maybe you meant to ask how term analysis works, which is 
split between Solr and Lucene. Or maybe you simply wanted to know when and 
where term analysis is done. Tell us your specific problem or specific 
question and we can probably quickly give you an answer.


In truth, NOBODY uses flow charts anymore. Sure, there are some user-level 
diagrams, but not down to the code level.


If you could focus on specific questions, we could give you specific 
answers.


Main steps? That depends on what level you are working at. Tell us what 
problem you are trying to solve and we can point you to the relevant areas.


In truth, if you become generally familiar with Solr at the user level 
(study the wikis), you will already know what the main steps are.


So, it is not main steps of Solr, but main steps of some specific 
request of Solr, and for a specified level of detail, and for a specified 
area of Solr if greater detail is needed. Be more specific, and then we can 
be more specific.


For now, the general advice for people who need or want to go far beyond the 
user level is to get familiar with the code - just LOOK at it - a lot of 
the package and class names are OBVIOUS, really, and follow the class 
hierarchy and code flow using the standard features of any modern Java IDE. 
If you are wondering where to start for some specific user-level feature, 
please ask specifically about that feature. But... make a diligent effort to 
discover and learn on your own before asking open-ended questions.


Sure, there are lots of things in Lucene and Solr that are rather complex 
and seemingly convoluted, and not obvious, but people are more than willing 
to help you out if you simply ask a specific question. I mean, not everybody 
needs to know the fine detail of query parsing, analysis, building a 
Lucene-level stemmer, etc. If we tried to put all of that in a diagram, most 
people would be more confused than enlightened.


At which step are scores calculated? That's more of a Lucene question. Or, 
are you really asking what code in Solr invokes Lucene search methods that 
calculate basic scores?


In short, you need to be more specific. Don't force us to guess what problem 
you are trying to solve.


-- Jack Krupansky

-Original Message- 
From: Furkan KAMACI

Sent: Wednesday, April 03, 2013 6:52 AM
To: solr-user@lucene.apache.org
Subject: Re: Flow Chart of Solr

So, all in all, is there anybody who can write down just main steps of
Solr(including parsing, stemming etc.)?


2013/4/2 Furkan KAMACI furkankam...@gmail.com


I think about myself as an example. I have started to make research about
Solr just for some weeks. I have learned Solr and its related projects. My
next step writing down the main steps Solr. We have separated learning
curve of Solr into two main categories.
First one is who are using it as out of the box components. Second one is
developer side.

Actually developer side branches into two way.

First one is general steps of it. i.e. document comes into Solr (i.e.
crawled data of Nutch). which analyzing processes are going to done
(stamming, hamming etc.), what will be doing after parsing step by step.
When a search query happens what happens step by step, at which step 
scores

are calculated so on so forth.
Second one is more code specific i.e. which handlers takes into account
data that will going to be indexed(no need the explain every handler at
this step) . Which are the analyzer, tokenizer classes and what are the
flow between them. How response handlers works and what are they.

Also explaining about cloud side is other work.

Some of explanations are currently presents at wiki (but some of them are
at very deep places at wiki and it is not easy to find the parent topic of
it, maybe starting wiki from a top age and branching all other topics as
possible as from it could be better)

If we could show the big picture, and beside of it the smaller pictures
within it, it would be great (if you know the main parts it will be easy 
to

go deep into the code i.e. you don't 

Re: Query parser cuts last letter from search term.

2013-04-03 Thread Jack Krupansky
The standard tokenizer recognizes ! as a punctuation character, so it will 
be treated as white space.


You could use the white space tokenizer if punctuation is considered 
significant.


-- Jack Krupansky

-Original Message- 
From: vsl

Sent: Wednesday, April 03, 2013 6:25 AM
To: solr-user@lucene.apache.org
Subject: Query parser cuts last letter from search term.

Hi,
I have strange problem with Solr query. I added to my Solr Index new
document with behave! word inside content. While I was trying to search
this document using behave search term it was impossible. Only behave!
returns result. Additionaly search debug returns following information:

debug: {
rawquerystring: behave,
querystring: behave,
parsedquery: allText:behav,
parsedquery_toString: allText:behav,

Does anybody know how to deal with such case? Below is my field type
definition.


Field definition:

fieldType name=text_general class=solr.TextField
positionIncrementGap=100
 analyzer type=index
   tokenizer class=solr.StandardTokenizerFactory/
   filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt enablePositionIncrements=true /

   filter class=solr.LowerCaseFilterFactory/
   filter class=solr.SnowballPorterFilterFactory
language=English/
   filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumbers=1 catenateAll=0 splitOnCaseChange=1
preserveOriginal=1 types=characters.txt /
 /analyzer
 analyzer type=query
   tokenizer class=solr.StandardTokenizerFactory/
   filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt enablePositionIncrements=true /
   filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
ignoreCase=true expand=true/
   filter class=solr.LowerCaseFilterFactory/
   filter class=solr.SnowballPorterFilterFactory
language=English/
   filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumbers=1 catenateAll=0 splitOnCaseChange=1
preserveOriginal=1 types=characters.txt /
 /analyzer
   /fieldType

where: characters.txt

§ = ALPHA
$ = ALPHA
% = ALPHA
 = ALPHA
/ = ALPHA
( = ALPHA
) = ALPHA
= = ALPHA
? = ALPHA
+ = ALPHA
* = ALPHA
# = ALPHA
' = ALPHA
- = ALPHA
 = ALPHA

= ALPHA




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Query-parser-cuts-last-letter-from-search-term-tp4053432.html
Sent from the Solr - User mailing list archive at Nabble.com. 



RE: Confusion over Solr highlight hl.q parameter

2013-04-03 Thread Van Tassell, Kristian
Thank you for the response, unfortunately it didn't change that I'm still 
getting no highlighting hits for this query. 

...hl.q={!dismax}text_it_IT:l'assieme...

-Original Message-
From: Koji Sekiguchi [mailto:k...@r.email.ne.jp] 
Sent: Tuesday, April 02, 2013 9:00 PM
To: solr-user@lucene.apache.org
Subject: Re: Confusion over Solr highlight hl.q parameter

(13/04/03 5:27), Van Tassell, Kristian wrote:
 Thanks Koji, this helped with some of our problems, but it is still not 
 perfect.
 
 This query, for example, returns no highlighting:
 
 ?q=id:abc123hl.q=text_it_IT:l'assiemehl.fl=text_it_IThl=truedefTyp
 e=edismax
 
 But this one does (when it is, in effect, the same query):
 
 ?q=text_it_IT:l'assiemehl=truedefType=edismaxhl.fl=text_it_IT
 
 I've tried many combinations but can't seem to get the right one to work. Is 
 this possibly a bug?

As hl.q doesn't care defType parameter but does localParams, can you try to put 
{!edismax} to hl.q parameter?

koji
--
http://soleami.com/blog/lucene-4-is-super-convenient-for-developing-nlp-tools.html


Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

2013-04-03 Thread Jamie Johnson
Ok, so clearing the transaction log allowed things to go again.  I am going
to clear the index and try to replicate the problem on 4.2.0 and then I'll
try on 4.2.1


On Wed, Apr 3, 2013 at 8:21 AM, Mark Miller markrmil...@gmail.com wrote:

 No, not that I know if, which is why I say we need to get to the bottom of
 it.

 - Mark

 On Apr 2, 2013, at 10:18 PM, Jamie Johnson jej2...@gmail.com wrote:

  Mark
  It's there a particular jira issue that you think may address this? I
 read
  through it quickly but didn't see one that jumped out
  On Apr 2, 2013 10:07 PM, Jamie Johnson jej2...@gmail.com wrote:
 
  I brought the bad one down and back up and it did nothing.  I can clear
  the index and try4.2.1. I will save off the logs and see if there is
  anything else odd
  On Apr 2, 2013 9:13 PM, Mark Miller markrmil...@gmail.com wrote:
 
  It would appear it's a bug given what you have said.
 
  Any other exceptions would be useful. Might be best to start tracking
 in
  a JIRA issue as well.
 
  To fix, I'd bring the behind node down and back again.
 
  Unfortunately, I'm pressed for time, but we really need to get to the
  bottom of this and fix it, or determine if it's fixed in 4.2.1
 (spreading
  to mirrors now).
 
  - Mark
 
  On Apr 2, 2013, at 7:21 PM, Jamie Johnson jej2...@gmail.com wrote:
 
  Sorry I didn't ask the obvious question.  Is there anything else that
 I
  should be looking for here and is this a bug?  I'd be happy to troll
  through the logs further if more information is needed, just let me
  know.
 
  Also what is the most appropriate mechanism to fix this.  Is it
  required to
  kill the index that is out of sync and let solr resync things?
 
 
  On Tue, Apr 2, 2013 at 5:45 PM, Jamie Johnson jej2...@gmail.com
  wrote:
 
  sorry for spamming here
 
  shard5-core2 is the instance we're having issues with...
 
  Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log
  SEVERE: shard update error StdNode:
 
 
 http://10.38.33.17:7577/solr/dsc-shard5-core2/:org.apache.solr.common.SolrException
  :
  Server at http://10.38.33.17:7577/solr/dsc-shard5-core2 returned non
  ok
  status:503, message:Service Unavailable
at
 
 
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373)
at
 
 
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
at
 
 
 org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:332)
at
 
 
 org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:306)
at
  java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
at
  java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
 
 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
 
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
 
 
  On Tue, Apr 2, 2013 at 5:43 PM, Jamie Johnson jej2...@gmail.com
  wrote:
 
  here is another one that looks interesting
 
  Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log
  SEVERE: org.apache.solr.common.SolrException: ClusterState says we
 are
  the leader, but locally we don't think so
at
 
 
 org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:293)
at
 
 
 org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:228)
at
 
 
 org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:339)
at
 
 
 org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
at
 
 
 org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246)
at
  org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173)
at
 
 
 org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
at
 
 
 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
at
 
 
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1797)
at
 
 
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:637)
at
 
 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:343)
 
 
 
  On Tue, Apr 2, 2013 at 5:41 PM, Jamie Johnson jej2...@gmail.com
  wrote:
 
  Looking at the master it looks like at some point there were shards
  that
  went down.  I am seeing things like what is below.
 
  NFO: A cluster state change: 

Re: is there a way we can build spell dictionary from solr index such that it only take words leaving all`special characters

2013-04-03 Thread Rohan Thakur
hi upayavira

you mean to say that I dont have to follow this :
http://wiki.apache.org/solr/SpellCheckComponent

and directly I can create spell check field from copyfield and use it...I
dont have to build dictionary on the fieldjust use copyfield for spell
suggetions?

thanks
regards
Rohan


On Wed, Mar 13, 2013 at 12:56 PM, Upayavira u...@odoko.co.uk wrote:

 Use text analysis and copyField to create a new field that has terms as
 you expect them. Then use that for your spellcheck dictionary.

 Note, since 4.0, you don't need to create a dictionary. Solr can use
 your index directly.

 Upayavira

 On Wed, Mar 13, 2013, at 06:00 AM, Rohan Thakur wrote:
  while building the spell dictionary...
 
  On Wed, Mar 13, 2013 at 11:29 AM, Rohan Thakur rohan.i...@gmail.com
  wrote:
 
   even do not want to break the words as in samsung to s a m s u n g or
 sII
   ti s II ir s2 to s 2
  
   On Wed, Mar 13, 2013 at 11:28 AM, Rohan Thakur rohan.i...@gmail.com
 wrote:
  
   k as in like if the field I am indixing from the database like title
 that
   has characters like () - # /n//
   example:
  
   Screenguard for Samsung Galaxy SII (Matt and Gloss) (with Dual
 Protection, Cleaning Cloth and Bubble Remover)
  
   or
   samsung-galaxy-sii-screenguard-matt-and-gloss.html
   or
   /s/a/samsung_galaxy_sii_i9100_pink_.jpg
   or
   4.27-inch Touchscreen, 3G, Android v2.3 OS, 8MP Camera with LED Flash
  
   now I do not want to build the spell dictionary to only include the
 words
   not any of the - , _ . ( ) /s/a/ or numeric like 4.27
   how can I do that?
  
   thanks
   regards
   Rohan
  
   On Tue, Mar 12, 2013 at 11:06 PM, Alexandre Rafalovitch 
   arafa...@gmail.com wrote:
  
   Sorry, leaving them where?
  
   Can you give a concrete example or problem.
  
   Regards,
   Alex
   On Mar 12, 2013 1:31 PM, Rohan Thakur rohan.i...@gmail.com
 wrote:
  
hi all
   
wanted to know is there way we can make spell dictionary from solr
   index
such that it only takes words from the index leaving all the
 special
characters and unwanted characters.
   
thanks
regards
Rohan
   
  
  
  
  



Re: Solr metrics in Codahale metrics and Graphite?

2013-04-03 Thread Shawn Heisey
On 3/29/2013 12:07 PM, Walter Underwood wrote:
 What are folks using for this?

I don't know that this really answers your question, but Solr 4.1 and
later includes a big chunk of codahale metrics internally for request
handler statistics - see SOLR-1972.  First we tried including the jar
and using the API, but that created thread leak problems, so the source
code was added.

Thanks,
Shawn



Re: Synonyms problem

2013-04-03 Thread Shawn Heisey
On 3/29/2013 12:14 PM, Plamen Mihaylov wrote:
 Can I ask you another question: I have Magento + Solr and have a
 requirement to create an admin magento module, where I can add/remove
 synonyms dynamically. Is this possible? I searched google but it seems not
 possible.

If you change the synonym list that you are using in your index analyzer
chain, you must rebuild your entire index.  If you don't, the updated
synonyms will only affect newly added records.  This is because the
index analyzer is only applied at index time.

Thanks,
Shawn



Question on Exact Matches - edismax

2013-04-03 Thread Sandeep Mestry
Hi All,

I have a requirement where in exact matches for 2 fields (Series Title,
Title) should be ranked higher than the partial matches. The configuration
looks like below:

requestHandler name=assetdismax class=solr.SearchHandler 
lst name=defaults
str name=defTypeedismax/str
str name=echoParamsexplicit/str
float name=tie0.01/float
str name=qf*pg_series_title_ci*^500 *title_ci*^300 *
pg_series_title*^200 *title*^25 classifications^15 classifications_texts^15
parent_classifications^10 synonym_classifications^5 pg_brand_title^5
pg_series_working_title^5 p_programme_title^5 p_item_title^5
p_interstitial_title^5 description^15 pg_series_description annotations^0.1
classification_notes^0.05 pv_program_version_number^2
pv_program_version_number_ci^2 pv_program_number^2 pv_program_number_ci^2
p_program_number^2 ma_version_number^2 ma_recording_location
ma_contributions^0.001 rel_pg_series_title rel_programme_title
rel_programme_number rel_programme_number_ci pg_uuid^0.5 p_uuid^0.5
pv_uuid^0.5 ma_uuid^0.5/str
str name=pfpg_series_title_ci^500 title_ci^500/str
int name=ps0/int
str name=q.alt*:*/str
str name=mm100%/str
str name=q.opAND/str
str name=facettrue/str
str name=facet.limit-1/str
str name=facet.mincount1/str
/lst
/requestHandler

As you can see above, the search is against many fields. What I'd want is
the documents that have exact matches for series title and title fields
should rank higher than the rest.

I have added 2 case insensitive (*pg_series_title_ci, title_ci*) fields for
series title and title and have boosted them higher over the tokenized and
rest of the fields. I have also implemented a similarity class to override
idf however I still get documents having partial matches in title and other
fields ranking higher than exact match in pg_series_title_ci.

Many Thanks,
Sandeep


Re: solre scores remains same for exact match and nearly exact match

2013-04-03 Thread amit
Thanks. I added a copy field and that fixed the issue.


On Wed, Apr 3, 2013 at 12:29 PM, Gora Mohanty-3 [via Lucene] 
ml-node+s472066n4053412...@n3.nabble.com wrote:

 On 3 April 2013 10:52, amit [hidden 
 email]http://user/SendEmail.jtp?type=nodenode=4053412i=0
 wrote:
 
  Below is my query
  http://localhost:8983/solr/select/?q=subject:session management in
  phpfq=category:[*%20TO%20*]fl=category,score,subject
 [...]

 Add debugQuery=on to your Solr URL, and you will get an
 explanation of the score. Your subject field is tokenised, so
 that there is no a priori reason that an exact match should
 score higher. Several strategies are available if you want that
 behaviour. Try searching Google, e.g., for solr exact match
 higher score.

 Regards,
 Gora


 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://lucene.472066.n3.nabble.com/solre-scores-remains-same-for-exact-match-and-nearly-exact-match-tp4053406p4053412.html
  To unsubscribe from solre scores remains same for exact match and nearly
 exact match, click 
 herehttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=4053406code=YW1pdC5tYWxsaWtAZ21haWwuY29tfDQwNTM0MDZ8LTk5Njc5OTA3NA==
 .
 NAMLhttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml





--
View this message in context: 
http://lucene.472066.n3.nabble.com/solre-scores-remains-same-for-exact-match-and-nearly-exact-match-tp4053406p4053478.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

2013-04-03 Thread Jamie Johnson
Something interesting that I'm noticing as well, I just indexed 300,000
items, and some how 300,020 ended up in the index.  I thought perhaps I
messed something up so I started the indexing again and indexed another
400,000 and I see 400,064 docs.  Is there a good way to find possibile
duplicates?  I had tried to facet on key (our id field) but that didn't
give me anything with more than a count of 1.


On Wed, Apr 3, 2013 at 9:22 AM, Jamie Johnson jej2...@gmail.com wrote:

 Ok, so clearing the transaction log allowed things to go again.  I am
 going to clear the index and try to replicate the problem on 4.2.0 and then
 I'll try on 4.2.1


 On Wed, Apr 3, 2013 at 8:21 AM, Mark Miller markrmil...@gmail.com wrote:

 No, not that I know if, which is why I say we need to get to the bottom
 of it.

 - Mark

 On Apr 2, 2013, at 10:18 PM, Jamie Johnson jej2...@gmail.com wrote:

  Mark
  It's there a particular jira issue that you think may address this? I
 read
  through it quickly but didn't see one that jumped out
  On Apr 2, 2013 10:07 PM, Jamie Johnson jej2...@gmail.com wrote:
 
  I brought the bad one down and back up and it did nothing.  I can clear
  the index and try4.2.1. I will save off the logs and see if there is
  anything else odd
  On Apr 2, 2013 9:13 PM, Mark Miller markrmil...@gmail.com wrote:
 
  It would appear it's a bug given what you have said.
 
  Any other exceptions would be useful. Might be best to start tracking
 in
  a JIRA issue as well.
 
  To fix, I'd bring the behind node down and back again.
 
  Unfortunately, I'm pressed for time, but we really need to get to the
  bottom of this and fix it, or determine if it's fixed in 4.2.1
 (spreading
  to mirrors now).
 
  - Mark
 
  On Apr 2, 2013, at 7:21 PM, Jamie Johnson jej2...@gmail.com wrote:
 
  Sorry I didn't ask the obvious question.  Is there anything else
 that I
  should be looking for here and is this a bug?  I'd be happy to troll
  through the logs further if more information is needed, just let me
  know.
 
  Also what is the most appropriate mechanism to fix this.  Is it
  required to
  kill the index that is out of sync and let solr resync things?
 
 
  On Tue, Apr 2, 2013 at 5:45 PM, Jamie Johnson jej2...@gmail.com
  wrote:
 
  sorry for spamming here
 
  shard5-core2 is the instance we're having issues with...
 
  Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log
  SEVERE: shard update error StdNode:
 
 
 http://10.38.33.17:7577/solr/dsc-shard5-core2/:org.apache.solr.common.SolrException
  :
  Server at http://10.38.33.17:7577/solr/dsc-shard5-core2 returned
 non
  ok
  status:503, message:Service Unavailable
at
 
 
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373)
at
 
 
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
at
 
 
 org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:332)
at
 
 
 org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:306)
at
  java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
at
  java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
 
 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
 
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
 
 
  On Tue, Apr 2, 2013 at 5:43 PM, Jamie Johnson jej2...@gmail.com
  wrote:
 
  here is another one that looks interesting
 
  Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log
  SEVERE: org.apache.solr.common.SolrException: ClusterState says we
 are
  the leader, but locally we don't think so
at
 
 
 org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:293)
at
 
 
 org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:228)
at
 
 
 org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:339)
at
 
 
 org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
at
 
 
 org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246)
at
  org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173)
at
 
 
 org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
at
 
 
 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
at
 
 
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at 

Re: Solr ZooKeeper ensemble with HBase

2013-04-03 Thread Michael Della Bitta
Hello, Amit:

My guess is that, if HBase is working hard, you're going to have more
trouble with HBase and Solr on the same nodes than HBase and Solr
sharing a Zookeeper. Solr's usage of Zookeeper is very minimal.

Michael Della Bitta


Appinions
18 East 41st Street, 2nd Floor
New York, NY 10017-6271

www.appinions.com

Where Influence Isn’t a Game


On Wed, Apr 3, 2013 at 8:06 AM, Amit Sela am...@infolinks.com wrote:
 Hi all,

 I have a running Hadoop + HBase cluster and the HBase cluster is running
 it's own zookeeper (HBase manages zookeeper).
 I would like to deploy my SolrCloud cluster on a portion of the machines on
 that cluster.

 My question is: Should I have any trouble / issues deploying an additional
 ZooKeeper ensemble ? I don't want to use the HBase ZooKeeper because, well
 first of all HBase manages it so I'm not sure it's possible and second I
 have HBase working pretty hard at times and I don't want to create any
 connection issues by overloading ZooKeeper.

 Thanks,

 Amit.


Re: Flow Chart of Solr

2013-04-03 Thread Jack Park
There are three books on Solr, two with that in the title, and one,
Taming Text, each of which have been very valuable in understanding
Solr.

Jack

On Wed, Apr 3, 2013 at 5:25 AM, Jack Krupansky j...@basetechnology.com wrote:
 Sure, yes. But... it comes down to what level of detail you want and need
 for a specific task. In other words, there are probably a dozen or more
 levels of detail. The reality is that if you are going to work at the Solr
 code level, that is very, very different than being a user of Solr, and at
 that point your first step is to become familiar with the code itself.

 When you talk about parsing and stemming, you are really talking about
 the user-level, not the Solr code level. Maybe what you really need is a
 cheat sheet that maps a user-visible feature to the main Solr code component
 for that implements that user feature.

 There are a number of different forms of parsing in Solr - parsing of
 what? Queries? Requests? Solr documents? Function queries?

 Stemming? Well, in truth, Solr doesn't even do stemming - Lucene does that.
 Lucene does all of the token filtering. Are you asking for details on how
 Lucene works? Maybe you meant to ask how term analysis works, which is
 split between Solr and Lucene. Or maybe you simply wanted to know when and
 where term analysis is done. Tell us your specific problem or specific
 question and we can probably quickly give you an answer.

 In truth, NOBODY uses flow charts anymore. Sure, there are some user-level
 diagrams, but not down to the code level.

 If you could focus on specific questions, we could give you specific
 answers.

 Main steps? That depends on what level you are working at. Tell us what
 problem you are trying to solve and we can point you to the relevant areas.

 In truth, if you become generally familiar with Solr at the user level
 (study the wikis), you will already know what the main steps are.

 So, it is not main steps of Solr, but main steps of some specific
 request of Solr, and for a specified level of detail, and for a specified
 area of Solr if greater detail is needed. Be more specific, and then we can
 be more specific.

 For now, the general advice for people who need or want to go far beyond the
 user level is to get familiar with the code - just LOOK at it - a lot of
 the package and class names are OBVIOUS, really, and follow the class
 hierarchy and code flow using the standard features of any modern Java IDE.
 If you are wondering where to start for some specific user-level feature,
 please ask specifically about that feature. But... make a diligent effort to
 discover and learn on your own before asking open-ended questions.

 Sure, there are lots of things in Lucene and Solr that are rather complex
 and seemingly convoluted, and not obvious, but people are more than willing
 to help you out if you simply ask a specific question. I mean, not everybody
 needs to know the fine detail of query parsing, analysis, building a
 Lucene-level stemmer, etc. If we tried to put all of that in a diagram, most
 people would be more confused than enlightened.

 At which step are scores calculated? That's more of a Lucene question. Or,
 are you really asking what code in Solr invokes Lucene search methods that
 calculate basic scores?

 In short, you need to be more specific. Don't force us to guess what problem
 you are trying to solve.

 -- Jack Krupansky

 -Original Message- From: Furkan KAMACI
 Sent: Wednesday, April 03, 2013 6:52 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Flow Chart of Solr


 So, all in all, is there anybody who can write down just main steps of
 Solr(including parsing, stemming etc.)?


 2013/4/2 Furkan KAMACI furkankam...@gmail.com

 I think about myself as an example. I have started to make research about
 Solr just for some weeks. I have learned Solr and its related projects. My
 next step writing down the main steps Solr. We have separated learning
 curve of Solr into two main categories.
 First one is who are using it as out of the box components. Second one is
 developer side.

 Actually developer side branches into two way.

 First one is general steps of it. i.e. document comes into Solr (i.e.
 crawled data of Nutch). which analyzing processes are going to done
 (stamming, hamming etc.), what will be doing after parsing step by step.
 When a search query happens what happens step by step, at which step
 scores
 are calculated so on so forth.
 Second one is more code specific i.e. which handlers takes into account
 data that will going to be indexed(no need the explain every handler at
 this step) . Which are the analyzer, tokenizer classes and what are the
 flow between them. How response handlers works and what are they.

 Also explaining about cloud side is other work.

 Some of explanations are currently presents at wiki (but some of them are
 at very deep places at wiki and it is not easy to find the parent topic of
 it, maybe starting wiki from a top age and 

Re: Solr ZooKeeper ensemble with HBase

2013-04-03 Thread Amit Sela
Trouble in what why ? If I have enough memory - HBase RegionServer 10GB and
maybe 2GB for Solr ? - or you mean CPU / disk ?


On Wed, Apr 3, 2013 at 5:54 PM, Michael Della Bitta 
michael.della.bi...@appinions.com wrote:

 Hello, Amit:

 My guess is that, if HBase is working hard, you're going to have more
 trouble with HBase and Solr on the same nodes than HBase and Solr
 sharing a Zookeeper. Solr's usage of Zookeeper is very minimal.

 Michael Della Bitta

 
 Appinions
 18 East 41st Street, 2nd Floor
 New York, NY 10017-6271

 www.appinions.com

 Where Influence Isn’t a Game


 On Wed, Apr 3, 2013 at 8:06 AM, Amit Sela am...@infolinks.com wrote:
  Hi all,
 
  I have a running Hadoop + HBase cluster and the HBase cluster is running
  it's own zookeeper (HBase manages zookeeper).
  I would like to deploy my SolrCloud cluster on a portion of the machines
 on
  that cluster.
 
  My question is: Should I have any trouble / issues deploying an
 additional
  ZooKeeper ensemble ? I don't want to use the HBase ZooKeeper because,
 well
  first of all HBase manages it so I'm not sure it's possible and second I
  have HBase working pretty hard at times and I don't want to create any
  connection issues by overloading ZooKeeper.
 
  Thanks,
 
  Amit.



Re: Solr ZooKeeper ensemble with HBase

2013-04-03 Thread Michael Della Bitta
Solr heavily uses RAM for disk caching, so depending on your index
size and what you intend to do with it, 2 GB could easily not be
enough. We run with 6 GB heaps on 34 GB boxes, and the remaining RAM
is there solely to act as a disk cache. We're on EC2, though, so
unless you're using the SSD instances, the disks are slow. Might not
be a problem for you.

Also things like faceting and sorting can heavily hit the CPU.

Michael Della Bitta


Appinions
18 East 41st Street, 2nd Floor
New York, NY 10017-6271

www.appinions.com

Where Influence Isn’t a Game


On Wed, Apr 3, 2013 at 11:55 AM, Amit Sela am...@infolinks.com wrote:
 Trouble in what why ? If I have enough memory - HBase RegionServer 10GB and
 maybe 2GB for Solr ? - or you mean CPU / disk ?


 On Wed, Apr 3, 2013 at 5:54 PM, Michael Della Bitta 
 michael.della.bi...@appinions.com wrote:

 Hello, Amit:

 My guess is that, if HBase is working hard, you're going to have more
 trouble with HBase and Solr on the same nodes than HBase and Solr
 sharing a Zookeeper. Solr's usage of Zookeeper is very minimal.

 Michael Della Bitta

 
 Appinions
 18 East 41st Street, 2nd Floor
 New York, NY 10017-6271

 www.appinions.com

 Where Influence Isn’t a Game


 On Wed, Apr 3, 2013 at 8:06 AM, Amit Sela am...@infolinks.com wrote:
  Hi all,
 
  I have a running Hadoop + HBase cluster and the HBase cluster is running
  it's own zookeeper (HBase manages zookeeper).
  I would like to deploy my SolrCloud cluster on a portion of the machines
 on
  that cluster.
 
  My question is: Should I have any trouble / issues deploying an
 additional
  ZooKeeper ensemble ? I don't want to use the HBase ZooKeeper because,
 well
  first of all HBase manages it so I'm not sure it's possible and second I
  have HBase working pretty hard at times and I don't want to create any
  connection issues by overloading ZooKeeper.
 
  Thanks,
 
  Amit.



Re: Upgrade Solr3.5 to Solr4.1 - Index Reformat ?

2013-04-03 Thread Shawn Heisey
On 4/1/2013 12:19 PM, feroz_kh wrote:
 Hi Shawn,
 
 I tried optimizing using this command...
 
 curl
 'http://localhost:/solr/update?optimize=truemaxSegments=10waitFlush=true'
 
 And i got this response within secs...
 
 ?xml version=1.0 encoding=UTF-8?
 response
 lst name=responseHeaderint name=status0/intint
 name=QTime840/int/lst
 /response
 
 Is this a valid response that one should get ?
 I checked the statistics link from  /solr/admin page and it shows the number
 segments got updated.
 Would this be a good indication that optimization is complete ?
 At the same time - I even noticed the number of files in data/index
 directory hasn't reduced  all files are not updated.
 Since it took just couple of secs for the response(even with waitFlush=true)
 - i am doubting if optimization really happened , but details on statistics
 page shows me correct number of segments.

That looks like a valid success response.  An optimize in Solr defaults
to one segment.  You asked it to do ten segments.  Either you already
had less than 10 segments, or it was able to find some very small
segments to merge in order to get below 10.

When you are optimizing in order to upgrade the index format, you should
leave maxSegments off or set it to 1.

Thanks,
Shawn



Re: Lengthy description is converted to hash symbols

2013-04-03 Thread Danny Watari
Yes... the str.. / is what I see in the admin console when I perform a
search for the document.  Currently, I am using solrj and the addBean()
method to update the core.  Whats strange is in our QA env, the document
indexed correctly.  But in prod, I see hash symbols and thus any user search
against that field fails to find the document.  Btw, I see no errors in the
logs!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Lengthy-description-is-converted-to-hash-symbols-tp4053338p4053505.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Flow Chart of Solr

2013-04-03 Thread Jack Krupansky

And another one on the way:
http://www.amazon.com/Lucene-Solr-Definitive-comprehensive-realtime/dp/1449359957

Hopefully that help a lot as well. Plenty of diagrams. Lots of examples.

-- Jack Krupansky

-Original Message- 
From: Jack Park

Sent: Wednesday, April 03, 2013 11:25 AM
To: solr-user@lucene.apache.org
Subject: Re: Flow Chart of Solr

There are three books on Solr, two with that in the title, and one,
Taming Text, each of which have been very valuable in understanding
Solr.

Jack

On Wed, Apr 3, 2013 at 5:25 AM, Jack Krupansky j...@basetechnology.com 
wrote:

Sure, yes. But... it comes down to what level of detail you want and need
for a specific task. In other words, there are probably a dozen or more
levels of detail. The reality is that if you are going to work at the Solr
code level, that is very, very different than being a user of Solr, and 
at

that point your first step is to become familiar with the code itself.

When you talk about parsing and stemming, you are really talking about
the user-level, not the Solr code level. Maybe what you really need is a
cheat sheet that maps a user-visible feature to the main Solr code 
component

for that implements that user feature.

There are a number of different forms of parsing in Solr - parsing of
what? Queries? Requests? Solr documents? Function queries?

Stemming? Well, in truth, Solr doesn't even do stemming - Lucene does 
that.
Lucene does all of the token filtering. Are you asking for details on 
how

Lucene works? Maybe you meant to ask how term analysis works, which is
split between Solr and Lucene. Or maybe you simply wanted to know when and
where term analysis is done. Tell us your specific problem or specific
question and we can probably quickly give you an answer.

In truth, NOBODY uses flow charts anymore. Sure, there are some 
user-level

diagrams, but not down to the code level.

If you could focus on specific questions, we could give you specific
answers.

Main steps? That depends on what level you are working at. Tell us what
problem you are trying to solve and we can point you to the relevant 
areas.


In truth, if you become generally familiar with Solr at the user level
(study the wikis), you will already know what the main steps are.

So, it is not main steps of Solr, but main steps of some specific
request of Solr, and for a specified level of detail, and for a 
specified
area of Solr if greater detail is needed. Be more specific, and then we 
can

be more specific.

For now, the general advice for people who need or want to go far beyond 
the

user level is to get familiar with the code - just LOOK at it - a lot of
the package and class names are OBVIOUS, really, and follow the class
hierarchy and code flow using the standard features of any modern Java 
IDE.

If you are wondering where to start for some specific user-level feature,
please ask specifically about that feature. But... make a diligent effort 
to

discover and learn on your own before asking open-ended questions.

Sure, there are lots of things in Lucene and Solr that are rather complex
and seemingly convoluted, and not obvious, but people are more than 
willing
to help you out if you simply ask a specific question. I mean, not 
everybody

needs to know the fine detail of query parsing, analysis, building a
Lucene-level stemmer, etc. If we tried to put all of that in a diagram, 
most

people would be more confused than enlightened.

At which step are scores calculated? That's more of a Lucene question. Or,
are you really asking what code in Solr invokes Lucene search methods that
calculate basic scores?

In short, you need to be more specific. Don't force us to guess what 
problem

you are trying to solve.

-- Jack Krupansky

-Original Message- From: Furkan KAMACI
Sent: Wednesday, April 03, 2013 6:52 AM
To: solr-user@lucene.apache.org
Subject: Re: Flow Chart of Solr


So, all in all, is there anybody who can write down just main steps of
Solr(including parsing, stemming etc.)?


2013/4/2 Furkan KAMACI furkankam...@gmail.com


I think about myself as an example. I have started to make research about
Solr just for some weeks. I have learned Solr and its related projects. 
My

next step writing down the main steps Solr. We have separated learning
curve of Solr into two main categories.
First one is who are using it as out of the box components. Second one is
developer side.

Actually developer side branches into two way.

First one is general steps of it. i.e. document comes into Solr (i.e.
crawled data of Nutch). which analyzing processes are going to done
(stamming, hamming etc.), what will be doing after parsing step by step.
When a search query happens what happens step by step, at which step
scores
are calculated so on so forth.
Second one is more code specific i.e. which handlers takes into account
data that will going to be indexed(no need the explain every handler at
this step) . Which are the analyzer, tokenizer classes and what are the

Re: Lengthy description is converted to hash symbols

2013-04-03 Thread Jack Krupansky

Show us the exact query URL as well as the request handler defaults.

Make sure to try to do an explicit query on the field that has the # 
value.


QA and prod may differ because maybe QA got completely reindexed more 
recently and maybe prod hasn't gotten fully reindexed recently. Maybe the 
schema changed but a full reindex wasn't done.


-- Jack Krupansky

-Original Message- 
From: Danny Watari

Sent: Wednesday, April 03, 2013 12:15 PM
To: solr-user@lucene.apache.org
Subject: Re: Lengthy description is converted to hash symbols

Yes... the str.. / is what I see in the admin console when I perform a
search for the document.  Currently, I am using solrj and the addBean()
method to update the core.  Whats strange is in our QA env, the document
indexed correctly.  But in prod, I see hash symbols and thus any user search
against that field fails to find the document.  Btw, I see no errors in the
logs!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Lengthy-description-is-converted-to-hash-symbols-tp4053338p4053505.html
Sent from the Solr - User mailing list archive at Nabble.com. 



SolrCloud not distributing documents across shards

2013-04-03 Thread vsilgalis
So we have 3 servers in a SolrCloud cluster.

http://lucene.472066.n3.nabble.com/file/n4053506/Cloud1.png 

We have 2 shards for our collection (classic_bt) with a shard on each of the
first two servers as the picture shows. The third server has replicas of the
first 2 shards just for high availability purposes.

Now if we go into counts we have the following information:
shard1 - Numdocs - 33010
shard2 - Numdocs - 85934

Both shards replicate to the third server with no issues.

For some reason the documents aren't distributing across the shards, nothing
in the logs indicates a problem but I'm not sure what we should be looking
for.

Let me know if you need more information.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrCloud-not-distributing-documents-across-shards-tp4053506.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Filtering Search Cloud

2013-04-03 Thread Shawn Heisey
On 4/1/2013 3:02 PM, Furkan KAMACI wrote:
 I want to separate my cloud into two logical parts. One of them is indexer
 cloud of SolrCloud. Second one is Searcher cloud of SolrCloud.
 
 My first question is that. Does separating my cloud system make sense about
 performance improvement. Because I think that when indexing, searching make
 time to response and if I separate them I get a performance improvement. On
 the other hand maybe using all Solr machines as whole (I mean not
 partitioning as I mentioned) SolrCloud can make a better load balancing, I
 would want to learn it.
 
 My second question is that. Let's assume that I have separated my machines
 as I mentioned. Can I filter some indexes to be searchable or not from
 Searcher SolrCloud?

SolrCloud gets rid of the master and slave designations.  It also gets
rid of the line between indexing and querying.  Each shard has a replica
that is designated the leader, but that has no real impact on searching
and indexing, only on deciding which data to use when replicas get out
of sync.

In the old master-slave architecture, you indexed to the master and the
updated index files were replicated to the slave.  The slave did not
handle the analysis for indexing, so it was usually better to send
queries to slaves and let the master only do indexing.

SolrCloud is very different.  When you index, the documents are indexed
on all replicas at about the same time.  When you query, the requests
are load balanced across all replicas.  During normal operation,
SolrCloud does not use replication at all.  The replication feature is
only used when a replica gets out of sync with the leader, and in that
case, the entire index is replicated.

Thanks,
Shawn



Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

2013-04-03 Thread Jamie Johnson
Since I don't have that many items in my index I exported all of the keys
for each shard and wrote a simple java program that checks for duplicates.
 I found some duplicate keys on different shards, a grep of the files for
the keys found does indicate that they made it to the wrong places.  If you
notice documents with the same ID are on shard 3 and shard 5.  Is it
possible that the hash is being calculated taking into account only the
live nodes?  I know that we don't specify the numShards param @ startup
so could this be what is happening?

grep -c 7cd1a717-3d94-4f5d-bcb1-9d8a95ca78de *
shard1-core1:0
shard1-core2:0
shard2-core1:0
shard2-core2:0
shard3-core1:1
shard3-core2:1
shard4-core1:0
shard4-core2:0
shard5-core1:1
shard5-core2:1
shard6-core1:0
shard6-core2:0


On Wed, Apr 3, 2013 at 10:42 AM, Jamie Johnson jej2...@gmail.com wrote:

 Something interesting that I'm noticing as well, I just indexed 300,000
 items, and some how 300,020 ended up in the index.  I thought perhaps I
 messed something up so I started the indexing again and indexed another
 400,000 and I see 400,064 docs.  Is there a good way to find possibile
 duplicates?  I had tried to facet on key (our id field) but that didn't
 give me anything with more than a count of 1.


 On Wed, Apr 3, 2013 at 9:22 AM, Jamie Johnson jej2...@gmail.com wrote:

 Ok, so clearing the transaction log allowed things to go again.  I am
 going to clear the index and try to replicate the problem on 4.2.0 and then
 I'll try on 4.2.1


 On Wed, Apr 3, 2013 at 8:21 AM, Mark Miller markrmil...@gmail.comwrote:

 No, not that I know if, which is why I say we need to get to the bottom
 of it.

 - Mark

 On Apr 2, 2013, at 10:18 PM, Jamie Johnson jej2...@gmail.com wrote:

  Mark
  It's there a particular jira issue that you think may address this? I
 read
  through it quickly but didn't see one that jumped out
  On Apr 2, 2013 10:07 PM, Jamie Johnson jej2...@gmail.com wrote:
 
  I brought the bad one down and back up and it did nothing.  I can
 clear
  the index and try4.2.1. I will save off the logs and see if there is
  anything else odd
  On Apr 2, 2013 9:13 PM, Mark Miller markrmil...@gmail.com wrote:
 
  It would appear it's a bug given what you have said.
 
  Any other exceptions would be useful. Might be best to start
 tracking in
  a JIRA issue as well.
 
  To fix, I'd bring the behind node down and back again.
 
  Unfortunately, I'm pressed for time, but we really need to get to the
  bottom of this and fix it, or determine if it's fixed in 4.2.1
 (spreading
  to mirrors now).
 
  - Mark
 
  On Apr 2, 2013, at 7:21 PM, Jamie Johnson jej2...@gmail.com wrote:
 
  Sorry I didn't ask the obvious question.  Is there anything else
 that I
  should be looking for here and is this a bug?  I'd be happy to troll
  through the logs further if more information is needed, just let me
  know.
 
  Also what is the most appropriate mechanism to fix this.  Is it
  required to
  kill the index that is out of sync and let solr resync things?
 
 
  On Tue, Apr 2, 2013 at 5:45 PM, Jamie Johnson jej2...@gmail.com
  wrote:
 
  sorry for spamming here
 
  shard5-core2 is the instance we're having issues with...
 
  Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log
  SEVERE: shard update error StdNode:
 
 
 http://10.38.33.17:7577/solr/dsc-shard5-core2/:org.apache.solr.common.SolrException
  :
  Server at http://10.38.33.17:7577/solr/dsc-shard5-core2 returned
 non
  ok
  status:503, message:Service Unavailable
at
 
 
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373)
at
 
 
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
at
 
 
 org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:332)
at
 
 
 org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:306)
at
  java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
at
  java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
 
 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
 
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
 
 
  On Tue, Apr 2, 2013 at 5:43 PM, Jamie Johnson jej2...@gmail.com
  wrote:
 
  here is another one that looks interesting
 
  Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log
  SEVERE: org.apache.solr.common.SolrException: ClusterState says
 we are
  the leader, but locally we don't think so
at
 
 
 org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:293)
at
 
 
 

SolrException: Error opening new searcher

2013-04-03 Thread Van Tassell, Kristian
We're suddenly seeing an error when trying to do updates/commits.

This is on Solr 4.2 (Tomcat, solr war deployed to webapps, on Linux SuSE 11).

Based off of some initial searching on things related to this issue, I have set 
ulimit in Linux to 'unlimited' and verified that Tomcat has enough memory for 
the virtual memory needed to run the Solr index (which is 1.1GB in size).

Does anyone have any ideas?

1:25:41

SEVERE

UpdateLog

Error opening realtime searcher for 
deleteByQuery:org.apache.solr.common.SolrException: Error opening new searcher

Error opening realtime searcher for 
deleteByQuery:org.apache.solr.common.SolrException: Error opening new searcher


11:25:39

SEVERE

UpdateLog

Replay exception: final commit.

java.io.IOException: Map failed
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:761)
at org.apache.lucene.store.MMapDirectory.map(MMapDirectory.java:283)
at 
org.apache.lucene.store.MMapDirectory$MMapIndexInput.init(MMapDirectory.java:228)
at 
org.apache.lucene.store.MMapDirectory.openInput(MMapDirectory.java:195)
at 
org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.init(Lucene41PostingsReader.java:81)
at 
org.apache.lucene.codecs.lucene41.Lucene41PostingsFormat.fieldsProducer(Lucene41PostingsFormat.java:430)
at 
org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.init(PerFieldPostingsFormat.java:194)
at 
org.apache.lucene.codecs.perfield.PerFieldPostingsFormat.fieldsProducer(PerFieldPostingsFormat.java:233)
at 
org.apache.lucene.index.SegmentCoreReaders.init(SegmentCoreReaders.java:127)
at org.apache.lucene.index.SegmentReader.init(SegmentReader.java:56)
at 
org.apache.lucene.index.ReadersAndLiveDocs.getReader(ReadersAndLiveDocs.java:121)
at 
org.apache.lucene.index.BufferedDeletesStream.applyDeletes(BufferedDeletesStream.java:269)
at 
org.apache.lucene.index.IndexWriter.applyAllDeletes(IndexWriter.java:2961)
at 
org.apache.lucene.index.IndexWriter.maybeApplyDeletes(IndexWriter.java:2952)
at 
org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:2692)
at 
org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:2827)
at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:2807)
at 
org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:541)
at 
org.apache.solr.update.UpdateLog$LogReplayer.doReplay(UpdateLog.java:1341)
at org.apache.solr.update.UpdateLog$LogReplayer.run(UpdateLog.java:1160)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
Caused by: java.lang.OutOfMemoryError: Map failed
at sun.nio.ch.FileChannelImpl.map0(Native Method)
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:758)
... 28 more



SolrConfig:

query

useColdSearchertrue/useColdSearcher

maxBooleanClauses1024/maxBooleanClauses



filterCache class=solr.FastLRUCache

size=512

initialSize=512

autowarmCount=0/

queryResultCache class=solr.LRUCache size=512 initialSize=512 
autowarmCount=0/



documentCache class=solr.LRUCache size=512 initialSize=512 
autowarmCount=0/

queryResultWindowSize20/queryResultWindowSize

queryResultMaxDocsCached200/queryResultMaxDocsCached

maxWarmingSearchers6/maxWarmingSearchers



/query



Re: Lengthy description is converted to hash symbols

2013-04-03 Thread Danny Watari
I looked at the text via the admin analysis tool.  The text appeared to be
ok!  Unfortunately, the description is client data... so I can't post it
here, but I do not see any issues when running the analysis tool.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Lengthy-description-is-converted-to-hash-symbols-tp4053338p4053526.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr ZooKeeper ensemble with HBase

2013-04-03 Thread Walter underwood
It will be limited by disk IO until you get the caches full. Then it will be 
limited by CPU. 

wunder

On Apr 3, 2013, at 8:55 AM, Amit Sela am...@infolinks.com wrote:

 Trouble in what why ? If I have enough memory - HBase RegionServer 10GB and
 maybe 2GB for Solr ? - or you mean CPU / disk ?
 
 
 On Wed, Apr 3, 2013 at 5:54 PM, Michael Della Bitta 
 michael.della.bi...@appinions.com wrote:
 
 Hello, Amit:
 
 My guess is that, if HBase is working hard, you're going to have more
 trouble with HBase and Solr on the same nodes than HBase and Solr
 sharing a Zookeeper. Solr's usage of Zookeeper is very minimal.
 
 Michael Della Bitta
 
 
 Appinions
 18 East 41st Street, 2nd Floor
 New York, NY 10017-6271
 
 www.appinions.com
 
 Where Influence Isn’t a Game
 
 
 On Wed, Apr 3, 2013 at 8:06 AM, Amit Sela am...@infolinks.com wrote:
 Hi all,
 
 I have a running Hadoop + HBase cluster and the HBase cluster is running
 it's own zookeeper (HBase manages zookeeper).
 I would like to deploy my SolrCloud cluster on a portion of the machines
 on
 that cluster.
 
 My question is: Should I have any trouble / issues deploying an
 additional
 ZooKeeper ensemble ? I don't want to use the HBase ZooKeeper because,
 well
 first of all HBase manages it so I'm not sure it's possible and second I
 have HBase working pretty hard at times and I don't want to create any
 connection issues by overloading ZooKeeper.
 
 Thanks,
 
 Amit.
 


Re: maxWarmingSearchers in Solr 4.

2013-04-03 Thread Shawn Heisey
On 4/3/2013 1:48 AM, Dotan Cohen wrote:
 I have been dragging the same solrconfig.xml from Solr 3.x to 4.0 to
 4.1, with no customization (bad, bad me!). I'm now looking into
 customizing it and I see that the Solr 4.1 solrconfig.xml is much
 simpler and shorter. Is this simply because many of the examples have
 been removed?
 
 In particular, I notice that there is no mention of
 maxWarmingSearchers in the Solr 4.1 solrconfig.xml. I assume that I
 can simply add it in, are there any other critical config options that
 are missing that I should be looking into as well? Would I be better
 off using the old Solr 3.x solrconfig.xml in Solr 4.1 as it contains
 so many examples?

In situations where I don't want to change the default value, I prefer
to leave config elements out of the solrconfig.  It makes the config
smaller, and it also makes it so that I will automatically see benefits
from the default changing in new versions.

In the case of maxWarmingSearchers, I would hope that you have your
system set up so that you would never need more than 1 warming searcher
at a time.  If you do a commit while a previous commit is still warming,
Solr will try to create a second warming searcher.

I went poking in the code, and it seems that maxWarmingSearchers
defaults to Integer.MAX_VALUE.  I'm not sure whether this is a bad
default or not.  It does mean that a pathological setup without
maxWarmingSearchers in the config will probably blow up with an
OutOfMemory exception, but is that better or worse than commits that
don't make new documents searchable?  I can see arguments either way.

Thanks,
Shawn



Re: Flow Chart of Solr

2013-04-03 Thread Jack Park
Jack,

Is that new book up to the 4.+ series?

Thanks
The other Jack

On Wed, Apr 3, 2013 at 9:19 AM, Jack Krupansky j...@basetechnology.com wrote:
 And another one on the way:
 http://www.amazon.com/Lucene-Solr-Definitive-comprehensive-realtime/dp/1449359957

 Hopefully that help a lot as well. Plenty of diagrams. Lots of examples.

 -- Jack Krupansky

 -Original Message- From: Jack Park
 Sent: Wednesday, April 03, 2013 11:25 AM

 To: solr-user@lucene.apache.org
 Subject: Re: Flow Chart of Solr

 There are three books on Solr, two with that in the title, and one,
 Taming Text, each of which have been very valuable in understanding
 Solr.

 Jack

 On Wed, Apr 3, 2013 at 5:25 AM, Jack Krupansky j...@basetechnology.com
 wrote:

 Sure, yes. But... it comes down to what level of detail you want and need
 for a specific task. In other words, there are probably a dozen or more
 levels of detail. The reality is that if you are going to work at the Solr
 code level, that is very, very different than being a user of Solr, and
 at
 that point your first step is to become familiar with the code itself.

 When you talk about parsing and stemming, you are really talking about
 the user-level, not the Solr code level. Maybe what you really need is a
 cheat sheet that maps a user-visible feature to the main Solr code
 component
 for that implements that user feature.

 There are a number of different forms of parsing in Solr - parsing of
 what? Queries? Requests? Solr documents? Function queries?

 Stemming? Well, in truth, Solr doesn't even do stemming - Lucene does
 that.
 Lucene does all of the token filtering. Are you asking for details on
 how
 Lucene works? Maybe you meant to ask how term analysis works, which is
 split between Solr and Lucene. Or maybe you simply wanted to know when and
 where term analysis is done. Tell us your specific problem or specific
 question and we can probably quickly give you an answer.

 In truth, NOBODY uses flow charts anymore. Sure, there are some
 user-level
 diagrams, but not down to the code level.

 If you could focus on specific questions, we could give you specific
 answers.

 Main steps? That depends on what level you are working at. Tell us what
 problem you are trying to solve and we can point you to the relevant
 areas.

 In truth, if you become generally familiar with Solr at the user level
 (study the wikis), you will already know what the main steps are.

 So, it is not main steps of Solr, but main steps of some specific
 request of Solr, and for a specified level of detail, and for a
 specified
 area of Solr if greater detail is needed. Be more specific, and then we
 can
 be more specific.

 For now, the general advice for people who need or want to go far beyond
 the
 user level is to get familiar with the code - just LOOK at it - a lot of
 the package and class names are OBVIOUS, really, and follow the class
 hierarchy and code flow using the standard features of any modern Java
 IDE.
 If you are wondering where to start for some specific user-level feature,
 please ask specifically about that feature. But... make a diligent effort
 to
 discover and learn on your own before asking open-ended questions.

 Sure, there are lots of things in Lucene and Solr that are rather complex
 and seemingly convoluted, and not obvious, but people are more than
 willing
 to help you out if you simply ask a specific question. I mean, not
 everybody
 needs to know the fine detail of query parsing, analysis, building a
 Lucene-level stemmer, etc. If we tried to put all of that in a diagram,
 most
 people would be more confused than enlightened.

 At which step are scores calculated? That's more of a Lucene question. Or,
 are you really asking what code in Solr invokes Lucene search methods that
 calculate basic scores?

 In short, you need to be more specific. Don't force us to guess what
 problem
 you are trying to solve.

 -- Jack Krupansky

 -Original Message- From: Furkan KAMACI
 Sent: Wednesday, April 03, 2013 6:52 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Flow Chart of Solr


 So, all in all, is there anybody who can write down just main steps of
 Solr(including parsing, stemming etc.)?


 2013/4/2 Furkan KAMACI furkankam...@gmail.com

 I think about myself as an example. I have started to make research about
 Solr just for some weeks. I have learned Solr and its related projects.
 My
 next step writing down the main steps Solr. We have separated learning
 curve of Solr into two main categories.
 First one is who are using it as out of the box components. Second one is
 developer side.

 Actually developer side branches into two way.

 First one is general steps of it. i.e. document comes into Solr (i.e.
 crawled data of Nutch). which analyzing processes are going to done
 (stamming, hamming etc.), what will be doing after parsing step by step.
 When a search query happens what happens step by step, at which step
 scores
 are calculated so on so forth.
 

Re: Flow Chart of Solr

2013-04-03 Thread Jack Krupansky
We're using the 4.x branch code as the basis for our writing. So, 
effectively it will be for at least 4.3 when the book comes out in the 
summer.


Early access will be in about a month or so. O'Reilly will be showing a 
galley proof for 200 pages of the book next week at Big Data TechCon next 
week in Boston.


-- Jack Krupansky

-Original Message- 
From: Jack Park

Sent: Wednesday, April 03, 2013 12:56 PM
To: solr-user@lucene.apache.org
Subject: Re: Flow Chart of Solr

Jack,

Is that new book up to the 4.+ series?

Thanks
The other Jack

On Wed, Apr 3, 2013 at 9:19 AM, Jack Krupansky j...@basetechnology.com 
wrote:

And another one on the way:
http://www.amazon.com/Lucene-Solr-Definitive-comprehensive-realtime/dp/1449359957

Hopefully that help a lot as well. Plenty of diagrams. Lots of examples.

-- Jack Krupansky

-Original Message- From: Jack Park
Sent: Wednesday, April 03, 2013 11:25 AM

To: solr-user@lucene.apache.org
Subject: Re: Flow Chart of Solr

There are three books on Solr, two with that in the title, and one,
Taming Text, each of which have been very valuable in understanding
Solr.

Jack

On Wed, Apr 3, 2013 at 5:25 AM, Jack Krupansky j...@basetechnology.com
wrote:


Sure, yes. But... it comes down to what level of detail you want and need
for a specific task. In other words, there are probably a dozen or more
levels of detail. The reality is that if you are going to work at the 
Solr

code level, that is very, very different than being a user of Solr, and
at
that point your first step is to become familiar with the code itself.

When you talk about parsing and stemming, you are really talking 
about

the user-level, not the Solr code level. Maybe what you really need is a
cheat sheet that maps a user-visible feature to the main Solr code
component
for that implements that user feature.

There are a number of different forms of parsing in Solr - parsing of
what? Queries? Requests? Solr documents? Function queries?

Stemming? Well, in truth, Solr doesn't even do stemming - Lucene does
that.
Lucene does all of the token filtering. Are you asking for details on
how
Lucene works? Maybe you meant to ask how term analysis works, which is
split between Solr and Lucene. Or maybe you simply wanted to know when 
and

where term analysis is done. Tell us your specific problem or specific
question and we can probably quickly give you an answer.

In truth, NOBODY uses flow charts anymore. Sure, there are some
user-level
diagrams, but not down to the code level.

If you could focus on specific questions, we could give you specific
answers.

Main steps? That depends on what level you are working at. Tell us what
problem you are trying to solve and we can point you to the relevant
areas.

In truth, if you become generally familiar with Solr at the user level
(study the wikis), you will already know what the main steps are.

So, it is not main steps of Solr, but main steps of some specific
request of Solr, and for a specified level of detail, and for a
specified
area of Solr if greater detail is needed. Be more specific, and then we
can
be more specific.

For now, the general advice for people who need or want to go far beyond
the
user level is to get familiar with the code - just LOOK at it - a lot 
of

the package and class names are OBVIOUS, really, and follow the class
hierarchy and code flow using the standard features of any modern Java
IDE.
If you are wondering where to start for some specific user-level feature,
please ask specifically about that feature. But... make a diligent effort
to
discover and learn on your own before asking open-ended questions.

Sure, there are lots of things in Lucene and Solr that are rather complex
and seemingly convoluted, and not obvious, but people are more than
willing
to help you out if you simply ask a specific question. I mean, not
everybody
needs to know the fine detail of query parsing, analysis, building a
Lucene-level stemmer, etc. If we tried to put all of that in a diagram,
most
people would be more confused than enlightened.

At which step are scores calculated? That's more of a Lucene question. 
Or,
are you really asking what code in Solr invokes Lucene search methods 
that

calculate basic scores?

In short, you need to be more specific. Don't force us to guess what
problem
you are trying to solve.

-- Jack Krupansky

-Original Message- From: Furkan KAMACI
Sent: Wednesday, April 03, 2013 6:52 AM
To: solr-user@lucene.apache.org
Subject: Re: Flow Chart of Solr


So, all in all, is there anybody who can write down just main steps of
Solr(including parsing, stemming etc.)?


2013/4/2 Furkan KAMACI furkankam...@gmail.com

I think about myself as an example. I have started to make research 
about

Solr just for some weeks. I have learned Solr and its related projects.
My
next step writing down the main steps Solr. We have separated learning
curve of Solr into two main categories.
First one is who are using it as out of the box 

Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

2013-04-03 Thread Jamie Johnson
no, my thought was wrong, it appears that even with the parameter set I am
seeing this behavior.  I've been able to duplicate it on 4.2.0 by indexing
100,000 documents on 10 threads (10,000 each) when I get to 400,000 or so.
 I will try this on 4.2.1. to see if I see the same behavior


On Wed, Apr 3, 2013 at 12:37 PM, Jamie Johnson jej2...@gmail.com wrote:

 Since I don't have that many items in my index I exported all of the keys
 for each shard and wrote a simple java program that checks for duplicates.
  I found some duplicate keys on different shards, a grep of the files for
 the keys found does indicate that they made it to the wrong places.  If you
 notice documents with the same ID are on shard 3 and shard 5.  Is it
 possible that the hash is being calculated taking into account only the
 live nodes?  I know that we don't specify the numShards param @ startup
 so could this be what is happening?

 grep -c 7cd1a717-3d94-4f5d-bcb1-9d8a95ca78de *
 shard1-core1:0
 shard1-core2:0
 shard2-core1:0
 shard2-core2:0
 shard3-core1:1
 shard3-core2:1
 shard4-core1:0
 shard4-core2:0
 shard5-core1:1
 shard5-core2:1
 shard6-core1:0
 shard6-core2:0


 On Wed, Apr 3, 2013 at 10:42 AM, Jamie Johnson jej2...@gmail.com wrote:

 Something interesting that I'm noticing as well, I just indexed 300,000
 items, and some how 300,020 ended up in the index.  I thought perhaps I
 messed something up so I started the indexing again and indexed another
 400,000 and I see 400,064 docs.  Is there a good way to find possibile
 duplicates?  I had tried to facet on key (our id field) but that didn't
 give me anything with more than a count of 1.


 On Wed, Apr 3, 2013 at 9:22 AM, Jamie Johnson jej2...@gmail.com wrote:

 Ok, so clearing the transaction log allowed things to go again.  I am
 going to clear the index and try to replicate the problem on 4.2.0 and then
 I'll try on 4.2.1


 On Wed, Apr 3, 2013 at 8:21 AM, Mark Miller markrmil...@gmail.comwrote:

 No, not that I know if, which is why I say we need to get to the bottom
 of it.

 - Mark

 On Apr 2, 2013, at 10:18 PM, Jamie Johnson jej2...@gmail.com wrote:

  Mark
  It's there a particular jira issue that you think may address this? I
 read
  through it quickly but didn't see one that jumped out
  On Apr 2, 2013 10:07 PM, Jamie Johnson jej2...@gmail.com wrote:
 
  I brought the bad one down and back up and it did nothing.  I can
 clear
  the index and try4.2.1. I will save off the logs and see if there is
  anything else odd
  On Apr 2, 2013 9:13 PM, Mark Miller markrmil...@gmail.com wrote:
 
  It would appear it's a bug given what you have said.
 
  Any other exceptions would be useful. Might be best to start
 tracking in
  a JIRA issue as well.
 
  To fix, I'd bring the behind node down and back again.
 
  Unfortunately, I'm pressed for time, but we really need to get to
 the
  bottom of this and fix it, or determine if it's fixed in 4.2.1
 (spreading
  to mirrors now).
 
  - Mark
 
  On Apr 2, 2013, at 7:21 PM, Jamie Johnson jej2...@gmail.com
 wrote:
 
  Sorry I didn't ask the obvious question.  Is there anything else
 that I
  should be looking for here and is this a bug?  I'd be happy to
 troll
  through the logs further if more information is needed, just let me
  know.
 
  Also what is the most appropriate mechanism to fix this.  Is it
  required to
  kill the index that is out of sync and let solr resync things?
 
 
  On Tue, Apr 2, 2013 at 5:45 PM, Jamie Johnson jej2...@gmail.com
  wrote:
 
  sorry for spamming here
 
  shard5-core2 is the instance we're having issues with...
 
  Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log
  SEVERE: shard update error StdNode:
 
 
 http://10.38.33.17:7577/solr/dsc-shard5-core2/:org.apache.solr.common.SolrException
  :
  Server at http://10.38.33.17:7577/solr/dsc-shard5-core2 returned
 non
  ok
  status:503, message:Service Unavailable
at
 
 
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373)
at
 
 
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
at
 
 
 org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:332)
at
 
 
 org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:306)
at
  java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
at
  java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
 
 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
 
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
 
 
  On Tue, Apr 2, 2013 at 5:43 PM, Jamie Johnson jej2...@gmail.com
  wrote:
 
  

Re: Query parser cuts last letter from search term.

2013-04-03 Thread Upayavira

On Wed, Apr 3, 2013, at 11:36 AM, vsl wrote:
 So why Solr does not return proper document?

You're gonna have to give us a bit more than that. 

What is wrong with the documents it is returning?

Upayavira


Re: Solr Multiword Search

2013-04-03 Thread skmirch
I have been trying to use the MultiWordSpellingQueryConverter.java since I
need to be able to find the document that correspond to the suggested
collations.  At the moment it seems to be producing collations based on word
matches and arbitrary words from the field are picked up to form collation
and so nothing corresponds to any of the titles in our set of indexed
documents.

Could anyone please confirm that this would work if I took the following
steps.

steps:
1. Get the solr4.2.war file.
2. Get to the WEB-INF lib and add the lucene-core-4.2.0.jar and the
solr-core-4.2.0.jar that to the classpath to compile the
MultiWordSpellingQueryConverter.java . The code for this is in my previous
post in this thread.
3. jar cvf multiwordspellchecker.jar
com/foo/MultiWordSpellingQueryConverter.java
4. Copy this jar to the $SOLR_HOME/lib directory.
6. Define queryConverter.  Question: Where does this need to go? I have just
put this somewhere between the searchComponent and the requestHandler for
spell checks.
5. Start webserver. I see this jar file getting registered at startup:
2013-04-03 12:56:22,243 INFO  [org.apache.solr.core.SolrResourceLoader]
(coreLoadExecutor-3-thread-1) Adding
'file:/solr/lib/multiwordspellchecker.jar' to classloader
6. When I run the spell query, I don't see my print statements, so I am not
sure if this code is really being called.  I don't think it may be the
logging that is failing but rather this code not being called at all.

I would appreciate any information on what I might be doing wrong.  Please
help.

Thanks.
Regards,
-- Sandeep



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Multiword-Search-tp4053038p4053534.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Out of memory on some faceting queries

2013-04-03 Thread Shawn Heisey
On 4/2/2013 3:09 AM, Dotan Cohen wrote:
 I notice that this only occurs on queries that run facets. I start
 Solr with the following command:
 sudo nohup java -XX:NewRatio=1 -XX:+UseParNewGC
 -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled
 -Dsolr.solr.home=/mnt/SolrFiles100/solr -jar
 /opt/solr-4.1.0/example/start.jar 

It looks like you've followed some advice that I gave previously on how
to tune java.  I have since learned that this advice is bad, it results
in long GC pauses, even with heaps that aren't huge.

As others have pointed out, you don't have a max heap setting, which
would mean that you're using whatever Java chooses for its default,
which might not be enough.  If you can get Solr to successfully run for
a while with queries and updates happening, the heap should eventually
max out and the admin UI will show you what Java is choosing by default.

Here is what I would now recommend for a beginning point on your Solr
startup command.  You may need to increase the heap beyond 4GB, but be
careful that you still have enough free memory to be able to do
effective caching of your index.

sudo nohup java -Xms4096M -Xmx4096M -XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75 -XX:NewRatio=3
-XX:MaxTenuringThreshold=8 -XX:+CMSParallelRemarkEnabled
-XX:+ParallelRefProcEnabled -XX:+UseLargePages -XX:+AggressiveOpts
-Dsolr.solr.home=/mnt/SolrFiles100/solr -jar
/opt/solr-4.1.0/example/start.jar 

If you are running a really old build of java (latest versions on
Oracle's website are 1.6 build 43 and 1.7 build 17), you might want to
leave AggressiveOpts out.  Some people would argue that you should never
use that option.

Thanks,
Shawn



Re: SolrCloud not distributing documents across shards

2013-04-03 Thread Michael Della Bitta
Hello Vytenis,

What exactly do you mean by aren't distributing across the shards?
Do you mean that POSTs against the server for shard 1 never end up
resulting in documents saved in shard 2?

Michael Della Bitta


Appinions
18 East 41st Street, 2nd Floor
New York, NY 10017-6271

www.appinions.com

Where Influence Isn’t a Game


On Wed, Apr 3, 2013 at 12:31 PM, vsilgalis vsilga...@gmail.com wrote:
 So we have 3 servers in a SolrCloud cluster.

 http://lucene.472066.n3.nabble.com/file/n4053506/Cloud1.png

 We have 2 shards for our collection (classic_bt) with a shard on each of the
 first two servers as the picture shows. The third server has replicas of the
 first 2 shards just for high availability purposes.

 Now if we go into counts we have the following information:
 shard1 - Numdocs - 33010
 shard2 - Numdocs - 85934

 Both shards replicate to the third server with no issues.

 For some reason the documents aren't distributing across the shards, nothing
 in the logs indicates a problem but I'm not sure what we should be looking
 for.

 Let me know if you need more information.



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/SolrCloud-not-distributing-documents-across-shards-tp4053506.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: Lengthy description is converted to hash symbols

2013-04-03 Thread Danny Watari
Here is a query that should return 2 documents... but it only returns 1.

/solr/m7779912/select?indent=onversion=2.2q=description%3Agatewayfq=start=0rows=10fl=descriptionqt=wt=explainOther=hl.fl=

Oddly enough, the description of the two documents are exactly the same. 
Except one is indexed correctly and the other contains the hash symbols.

Btw, when the core was created, it was built from scratch via a pojo's and
the addBeans() method.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Lengthy-description-is-converted-to-hash-symbols-tp4053338p4053544.html
Sent from the Solr - User mailing list archive at Nabble.com.


Solr Tika Override

2013-04-03 Thread JerryC
I am researching Solr and seeing if it would be a good fit for a document
search service I am helping to develop.  One of the requirements is that we
will need to be able to customize how file contents are parsed beyond the
default configurations that are offered out of the box by Tika.  For
example, we know that we will be indexing .pdf files that will contain a
cover page with a project start date, and would like to pull this date out
into a searchable field that is separate from the file content.  I have seen
several sources saying you can do this by overriding the
ExtractingRequestHandler.createFactory() method, but I have not been able to
find much documentation on how to implement a new parser.  Can someone point
me in the right direction on where to look, or let me know if the scenario I
described above is even possible?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Tika-Override-tp4053552.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

2013-04-03 Thread Mark Miller
Thanks for digging Jamie. In 4.2, hash ranges are assigned up front when a 
collection is created - each shard gets a range, which is stored in zookeeper. 
You should not be able to end up with the same id on different shards - 
something very odd going on.

Hopefully I'll have some time to try and help you reproduce. Ideally we can 
capture it in a test case.

- Mark

On Apr 3, 2013, at 1:13 PM, Jamie Johnson jej2...@gmail.com wrote:

 no, my thought was wrong, it appears that even with the parameter set I am
 seeing this behavior.  I've been able to duplicate it on 4.2.0 by indexing
 100,000 documents on 10 threads (10,000 each) when I get to 400,000 or so.
 I will try this on 4.2.1. to see if I see the same behavior
 
 
 On Wed, Apr 3, 2013 at 12:37 PM, Jamie Johnson jej2...@gmail.com wrote:
 
 Since I don't have that many items in my index I exported all of the keys
 for each shard and wrote a simple java program that checks for duplicates.
 I found some duplicate keys on different shards, a grep of the files for
 the keys found does indicate that they made it to the wrong places.  If you
 notice documents with the same ID are on shard 3 and shard 5.  Is it
 possible that the hash is being calculated taking into account only the
 live nodes?  I know that we don't specify the numShards param @ startup
 so could this be what is happening?
 
 grep -c 7cd1a717-3d94-4f5d-bcb1-9d8a95ca78de *
 shard1-core1:0
 shard1-core2:0
 shard2-core1:0
 shard2-core2:0
 shard3-core1:1
 shard3-core2:1
 shard4-core1:0
 shard4-core2:0
 shard5-core1:1
 shard5-core2:1
 shard6-core1:0
 shard6-core2:0
 
 
 On Wed, Apr 3, 2013 at 10:42 AM, Jamie Johnson jej2...@gmail.com wrote:
 
 Something interesting that I'm noticing as well, I just indexed 300,000
 items, and some how 300,020 ended up in the index.  I thought perhaps I
 messed something up so I started the indexing again and indexed another
 400,000 and I see 400,064 docs.  Is there a good way to find possibile
 duplicates?  I had tried to facet on key (our id field) but that didn't
 give me anything with more than a count of 1.
 
 
 On Wed, Apr 3, 2013 at 9:22 AM, Jamie Johnson jej2...@gmail.com wrote:
 
 Ok, so clearing the transaction log allowed things to go again.  I am
 going to clear the index and try to replicate the problem on 4.2.0 and then
 I'll try on 4.2.1
 
 
 On Wed, Apr 3, 2013 at 8:21 AM, Mark Miller markrmil...@gmail.comwrote:
 
 No, not that I know if, which is why I say we need to get to the bottom
 of it.
 
 - Mark
 
 On Apr 2, 2013, at 10:18 PM, Jamie Johnson jej2...@gmail.com wrote:
 
 Mark
 It's there a particular jira issue that you think may address this? I
 read
 through it quickly but didn't see one that jumped out
 On Apr 2, 2013 10:07 PM, Jamie Johnson jej2...@gmail.com wrote:
 
 I brought the bad one down and back up and it did nothing.  I can
 clear
 the index and try4.2.1. I will save off the logs and see if there is
 anything else odd
 On Apr 2, 2013 9:13 PM, Mark Miller markrmil...@gmail.com wrote:
 
 It would appear it's a bug given what you have said.
 
 Any other exceptions would be useful. Might be best to start
 tracking in
 a JIRA issue as well.
 
 To fix, I'd bring the behind node down and back again.
 
 Unfortunately, I'm pressed for time, but we really need to get to
 the
 bottom of this and fix it, or determine if it's fixed in 4.2.1
 (spreading
 to mirrors now).
 
 - Mark
 
 On Apr 2, 2013, at 7:21 PM, Jamie Johnson jej2...@gmail.com
 wrote:
 
 Sorry I didn't ask the obvious question.  Is there anything else
 that I
 should be looking for here and is this a bug?  I'd be happy to
 troll
 through the logs further if more information is needed, just let me
 know.
 
 Also what is the most appropriate mechanism to fix this.  Is it
 required to
 kill the index that is out of sync and let solr resync things?
 
 
 On Tue, Apr 2, 2013 at 5:45 PM, Jamie Johnson jej2...@gmail.com
 wrote:
 
 sorry for spamming here
 
 shard5-core2 is the instance we're having issues with...
 
 Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log
 SEVERE: shard update error StdNode:
 
 
 http://10.38.33.17:7577/solr/dsc-shard5-core2/:org.apache.solr.common.SolrException
 :
 Server at http://10.38.33.17:7577/solr/dsc-shard5-core2 returned
 non
 ok
 status:503, message:Service Unavailable
  at
 
 
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373)
  at
 
 
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
  at
 
 
 org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:332)
  at
 
 
 org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:306)
  at
 java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
  at java.util.concurrent.FutureTask.run(FutureTask.java:138)
  at
 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
  at
 

RE: AW: AW: java.lang.OutOfMemoryError: Map failed

2013-04-03 Thread Van Tassell, Kristian
I just posted a similar error and discovered that decreasing the Xmx fixed the 
problem for me. The free command/top, etc. indicated I was flying just below 
the threshold for my allowed memory, and with swap/virtual space available, so 
I'm still confused as to what the issue is, but you may try this in your 
configurations to see if it helps.

-Original Message-
From: Per Steffensen [mailto:st...@designware.dk] 
Sent: Tuesday, April 02, 2013 6:09 AM
To: solr-user@lucene.apache.org
Subject: Re: AW: AW: java.lang.OutOfMemoryError: Map failed

I have seen the exact same on Ubuntu Server 12.04. It helped adding some swap 
space, but I do not understand why this is necessary, since OS ought to just 
use the actual memory mapped files if there is not room in
(virtual) memory, swapping pages in and out on demand. Note that I saw this for 
memory mapped files opened for read+write - not in the exact same context as 
you see it where MMapDirectory is trying to map memory mapped files.

If you find a solution/explanation, please post it here. I really want to know 
more about why FileChannel.map can cause OOM. I do not think the OOM is a 
real OOM indicating no more space on java heap, but is more an exception 
saying that OS has no more memory (in some interpretation of that).

Regards, Per Steffensen

On 4/2/13 11:32 AM, Arkadi Colson wrote:
 It is running as root:

 root@solr01-dcg:~# ps aux | grep tom
 root  1809 10.2 67.5 49460420 6931232 ?Sl   Mar28 706:29 
 /usr/bin/java
 -Djava.util.logging.config.file=/usr/local/tomcat/conf/logging.propert
 ies -server -Xms2048m -Xmx6144m -XX:PermSize=64m -XX:MaxPermSize=128m 
 -XX:+UseG1GC -verbose:gc -Xloggc:/solr/tomcat-logs/gc.log 
 -XX:+PrintGCTimeStamps -XX:+PrintGCDetails -Duser.timezone=UTC
 -Dfile.encoding=UTF8 -Dsolr.solr.home=/opt/solr/ -Dport=8983 
 -Dcollection.configName=smsc -DzkClientTimeout=2
 -DzkHost=solr01-dcg.intnet.smartbit.be:2181,solr01-gs.intnet.smartbit.
 be:2181,solr02-dcg.intnet.smartbit.be:2181,solr02-gs.intnet.smartbit.b
 e:2181,solr03-dcg.intnet.smartbit.be:2181,solr03-gs.intnet.smartbit.be
 :2181 
 -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager
 -Dcom.sun.management.jmxremote
 -Dcom.sun.management.jmxremote.port=
 -Dcom.sun.management.jmxremote.ssl=false
 -Dcom.sun.management.jmxremote.authenticate=false
 -Djava.endorsed.dirs=/usr/local/tomcat/endorsed -classpath 
 /usr/local/tomcat/bin/bootstrap.jar:/usr/local/tomcat/bin/tomcat-juli.
 jar -Dcatalina.base=/usr/local/tomcat 
 -Dcatalina.home=/usr/local/tomcat 
 -Djava.io.tmpdir=/usr/local/tomcat/temp
 org.apache.catalina.startup.Bootstrap start

 Arkadi

 On 04/02/2013 11:29 AM, André Widhani wrote:
 The output is from the root user. Are you running Solr as root?

 If not, please try again using the operating system user that runs Solr.

 André
 
 Von: Arkadi Colson [ark...@smartbit.be]
 Gesendet: Dienstag, 2. April 2013 11:26
 An: solr-user@lucene.apache.org
 Cc: André Widhani
 Betreff: Re: AW: java.lang.OutOfMemoryError: Map failed

 Hmmm I checked it and it seems to be ok:

 root@solr01-dcg:~# ulimit -v
 unlimited

 Any other tips or do you need more debug info?

 BR

 On 04/02/2013 11:15 AM, André Widhani wrote:
 Hi Arkadi,

 this error usually indicates that virtual memory is not sufficient 
 (should be unlimited).

 Please see
 http://comments.gmane.org/gmane.comp.jakarta.lucene.solr.user/69168

 Regards,
 André

 
 Von: Arkadi Colson [ark...@smartbit.be]
 Gesendet: Dienstag, 2. April 2013 10:24
 An: solr-user@lucene.apache.org
 Betreff: java.lang.OutOfMemoryError: Map failed

 Hi

 Recently solr crashed. I've found this in the error log.
 My commit settings are loking like this:
 autoCommit
   maxTime1/maxTime
   openSearcherfalse/openSearcher
 /autoCommit

   autoSoftCommit
 maxTime2000/maxTime
   /autoSoftCommit

 The machine has 10GB of memory. Tomcat is running with -Xms2048m 
 -Xmx6144m

 Versions
 Solr: 4.2
 Tomcat: 7.0.33
 Java: 1.7

 Anybody any idea?

 Thx!

 Arkadi

 SEVERE: auto commit error...:org.apache.solr.common.SolrException: 
 Error
 opening new searcher
at
 org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1415)
at
 org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1527)
at
 org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandl
 er2.java:562)

at
 org.apache.solr.update.CommitTracker.run(CommitTracker.java:216)
at
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at
 java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at
 java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask
 .access$201(ScheduledThreadPoolExecutor.java:178)

at
 

RE: Solr Multiword Search

2013-04-03 Thread Dyer, James
You have specified spellcheck.q in your query.  The whole purpose of 
spellcheck.q is to bypass any query converter you've configured giving it raw 
keywords instead.

But possibly a custom query converter is not your best answer?

I agree that charles  charlie is an edit distance of 2, so if everything is 
set up correctly then DirectSolrSpellChecker with maxEdits=2 should find it.  
The collate functionality as you have it set up would check the index and only 
give you re-written queries that are guaranteed to return hits.  But there is a 
big caveat:  If the word charles occurs at all in the dictionary (because any 
document in your index contains it), then the spellchecker (by default) assumes 
it is a correctly-spelled word and will not try to correct it.  In this case, 
specify spellcheck.alternateTermCount with a non-zero value. (See 
http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.alternativeTermCount)
  

James Dyer
Ingram Content Group
(615) 213-4311


-Original Message-
From: skmirch [mailto:skmi...@hotmail.com] 
Sent: Wednesday, April 03, 2013 12:19 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr Multiword Search

I have been trying to use the MultiWordSpellingQueryConverter.java since I
need to be able to find the document that correspond to the suggested
collations.  At the moment it seems to be producing collations based on word
matches and arbitrary words from the field are picked up to form collation
and so nothing corresponds to any of the titles in our set of indexed
documents.

Could anyone please confirm that this would work if I took the following
steps.

steps:
1. Get the solr4.2.war file.
2. Get to the WEB-INF lib and add the lucene-core-4.2.0.jar and the
solr-core-4.2.0.jar that to the classpath to compile the
MultiWordSpellingQueryConverter.java . The code for this is in my previous
post in this thread.
3. jar cvf multiwordspellchecker.jar
com/foo/MultiWordSpellingQueryConverter.java
4. Copy this jar to the $SOLR_HOME/lib directory.
6. Define queryConverter.  Question: Where does this need to go? I have just
put this somewhere between the searchComponent and the requestHandler for
spell checks.
5. Start webserver. I see this jar file getting registered at startup:
2013-04-03 12:56:22,243 INFO  [org.apache.solr.core.SolrResourceLoader]
(coreLoadExecutor-3-thread-1) Adding
'file:/solr/lib/multiwordspellchecker.jar' to classloader
6. When I run the spell query, I don't see my print statements, so I am not
sure if this code is really being called.  I don't think it may be the
logging that is failing but rather this code not being called at all.

I would appreciate any information on what I might be doing wrong.  Please
help.

Thanks.
Regards,
-- Sandeep



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Multiword-Search-tp4053038p4053534.html
Sent from the Solr - User mailing list archive at Nabble.com.




Re: SolrCloud not distributing documents across shards

2013-04-03 Thread vsilgalis
Michael Della Bitta-2 wrote
 Hello Vytenis,
 
 What exactly do you mean by aren't distributing across the shards?
 Do you mean that POSTs against the server for shard 1 never end up
 resulting in documents saved in shard 2?

So we indexed a set of 33010 documents on server01 which are now in shard1.
And we kicked off a set of 85934 documents on server02 which are now in
shard2 (as tests).  In my understanding of how SolrCloud works, the
documents should be distributed across the shards in the collection.  Now I
have seen this work before in my environment.  Not sure what I need to look
at to ensure this distribution.

Just as a FYI, this is SOLR 4.1



--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrCloud-not-distributing-documents-across-shards-tp4053506p4053563.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

2013-04-03 Thread Jamie Johnson
Where is this information stored in ZK?  I don't see it in the cluster
state (or perhaps I don't understand it ;) ).

Perhaps something with my process is broken.  What I do when I start from
scratch is the following

ZkCLI -cmd upconfig ...
ZkCLI -cmd linkconfig 

but I don't ever explicitly create the collection.  What should the steps
from scratch be?  I am moving from an unreleased snapshot of 4.0 so I never
did that previously either so perhaps I did create the collection in one of
my steps to get this working but have forgotten it along the way.


On Wed, Apr 3, 2013 at 2:16 PM, Mark Miller markrmil...@gmail.com wrote:

 Thanks for digging Jamie. In 4.2, hash ranges are assigned up front when a
 collection is created - each shard gets a range, which is stored in
 zookeeper. You should not be able to end up with the same id on different
 shards - something very odd going on.

 Hopefully I'll have some time to try and help you reproduce. Ideally we
 can capture it in a test case.

 - Mark

 On Apr 3, 2013, at 1:13 PM, Jamie Johnson jej2...@gmail.com wrote:

  no, my thought was wrong, it appears that even with the parameter set I
 am
  seeing this behavior.  I've been able to duplicate it on 4.2.0 by
 indexing
  100,000 documents on 10 threads (10,000 each) when I get to 400,000 or
 so.
  I will try this on 4.2.1. to see if I see the same behavior
 
 
  On Wed, Apr 3, 2013 at 12:37 PM, Jamie Johnson jej2...@gmail.com
 wrote:
 
  Since I don't have that many items in my index I exported all of the
 keys
  for each shard and wrote a simple java program that checks for
 duplicates.
  I found some duplicate keys on different shards, a grep of the files for
  the keys found does indicate that they made it to the wrong places.  If
 you
  notice documents with the same ID are on shard 3 and shard 5.  Is it
  possible that the hash is being calculated taking into account only the
  live nodes?  I know that we don't specify the numShards param @
 startup
  so could this be what is happening?
 
  grep -c 7cd1a717-3d94-4f5d-bcb1-9d8a95ca78de *
  shard1-core1:0
  shard1-core2:0
  shard2-core1:0
  shard2-core2:0
  shard3-core1:1
  shard3-core2:1
  shard4-core1:0
  shard4-core2:0
  shard5-core1:1
  shard5-core2:1
  shard6-core1:0
  shard6-core2:0
 
 
  On Wed, Apr 3, 2013 at 10:42 AM, Jamie Johnson jej2...@gmail.com
 wrote:
 
  Something interesting that I'm noticing as well, I just indexed 300,000
  items, and some how 300,020 ended up in the index.  I thought perhaps I
  messed something up so I started the indexing again and indexed another
  400,000 and I see 400,064 docs.  Is there a good way to find possibile
  duplicates?  I had tried to facet on key (our id field) but that didn't
  give me anything with more than a count of 1.
 
 
  On Wed, Apr 3, 2013 at 9:22 AM, Jamie Johnson jej2...@gmail.com
 wrote:
 
  Ok, so clearing the transaction log allowed things to go again.  I am
  going to clear the index and try to replicate the problem on 4.2.0
 and then
  I'll try on 4.2.1
 
 
  On Wed, Apr 3, 2013 at 8:21 AM, Mark Miller markrmil...@gmail.com
 wrote:
 
  No, not that I know if, which is why I say we need to get to the
 bottom
  of it.
 
  - Mark
 
  On Apr 2, 2013, at 10:18 PM, Jamie Johnson jej2...@gmail.com
 wrote:
 
  Mark
  It's there a particular jira issue that you think may address this?
 I
  read
  through it quickly but didn't see one that jumped out
  On Apr 2, 2013 10:07 PM, Jamie Johnson jej2...@gmail.com wrote:
 
  I brought the bad one down and back up and it did nothing.  I can
  clear
  the index and try4.2.1. I will save off the logs and see if there
 is
  anything else odd
  On Apr 2, 2013 9:13 PM, Mark Miller markrmil...@gmail.com
 wrote:
 
  It would appear it's a bug given what you have said.
 
  Any other exceptions would be useful. Might be best to start
  tracking in
  a JIRA issue as well.
 
  To fix, I'd bring the behind node down and back again.
 
  Unfortunately, I'm pressed for time, but we really need to get to
  the
  bottom of this and fix it, or determine if it's fixed in 4.2.1
  (spreading
  to mirrors now).
 
  - Mark
 
  On Apr 2, 2013, at 7:21 PM, Jamie Johnson jej2...@gmail.com
  wrote:
 
  Sorry I didn't ask the obvious question.  Is there anything else
  that I
  should be looking for here and is this a bug?  I'd be happy to
  troll
  through the logs further if more information is needed, just let
 me
  know.
 
  Also what is the most appropriate mechanism to fix this.  Is it
  required to
  kill the index that is out of sync and let solr resync things?
 
 
  On Tue, Apr 2, 2013 at 5:45 PM, Jamie Johnson jej2...@gmail.com
 
  wrote:
 
  sorry for spamming here
 
  shard5-core2 is the instance we're having issues with...
 
  Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log
  SEVERE: shard update error StdNode:
 
 
 
 http://10.38.33.17:7577/solr/dsc-shard5-core2/:org.apache.solr.common.SolrException
  :
  Server at 

Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

2013-04-03 Thread Mark Miller
It should be part of your clusterstate.json. Some users have reported trouble 
upgrading a previous zk install when this change came. I recommended manually 
updating the clusterstate.json to have the right info, and that seemed to work. 
Otherwise, I guess you have to start from a clean zk state.

If you don't have that range information, I think there will be trouble. Do you 
have an router type defined in the clusterstate.json?

- Mark

On Apr 3, 2013, at 2:24 PM, Jamie Johnson jej2...@gmail.com wrote:

 Where is this information stored in ZK?  I don't see it in the cluster
 state (or perhaps I don't understand it ;) ).
 
 Perhaps something with my process is broken.  What I do when I start from
 scratch is the following
 
 ZkCLI -cmd upconfig ...
 ZkCLI -cmd linkconfig 
 
 but I don't ever explicitly create the collection.  What should the steps
 from scratch be?  I am moving from an unreleased snapshot of 4.0 so I never
 did that previously either so perhaps I did create the collection in one of
 my steps to get this working but have forgotten it along the way.
 
 
 On Wed, Apr 3, 2013 at 2:16 PM, Mark Miller markrmil...@gmail.com wrote:
 
 Thanks for digging Jamie. In 4.2, hash ranges are assigned up front when a
 collection is created - each shard gets a range, which is stored in
 zookeeper. You should not be able to end up with the same id on different
 shards - something very odd going on.
 
 Hopefully I'll have some time to try and help you reproduce. Ideally we
 can capture it in a test case.
 
 - Mark
 
 On Apr 3, 2013, at 1:13 PM, Jamie Johnson jej2...@gmail.com wrote:
 
 no, my thought was wrong, it appears that even with the parameter set I
 am
 seeing this behavior.  I've been able to duplicate it on 4.2.0 by
 indexing
 100,000 documents on 10 threads (10,000 each) when I get to 400,000 or
 so.
 I will try this on 4.2.1. to see if I see the same behavior
 
 
 On Wed, Apr 3, 2013 at 12:37 PM, Jamie Johnson jej2...@gmail.com
 wrote:
 
 Since I don't have that many items in my index I exported all of the
 keys
 for each shard and wrote a simple java program that checks for
 duplicates.
 I found some duplicate keys on different shards, a grep of the files for
 the keys found does indicate that they made it to the wrong places.  If
 you
 notice documents with the same ID are on shard 3 and shard 5.  Is it
 possible that the hash is being calculated taking into account only the
 live nodes?  I know that we don't specify the numShards param @
 startup
 so could this be what is happening?
 
 grep -c 7cd1a717-3d94-4f5d-bcb1-9d8a95ca78de *
 shard1-core1:0
 shard1-core2:0
 shard2-core1:0
 shard2-core2:0
 shard3-core1:1
 shard3-core2:1
 shard4-core1:0
 shard4-core2:0
 shard5-core1:1
 shard5-core2:1
 shard6-core1:0
 shard6-core2:0
 
 
 On Wed, Apr 3, 2013 at 10:42 AM, Jamie Johnson jej2...@gmail.com
 wrote:
 
 Something interesting that I'm noticing as well, I just indexed 300,000
 items, and some how 300,020 ended up in the index.  I thought perhaps I
 messed something up so I started the indexing again and indexed another
 400,000 and I see 400,064 docs.  Is there a good way to find possibile
 duplicates?  I had tried to facet on key (our id field) but that didn't
 give me anything with more than a count of 1.
 
 
 On Wed, Apr 3, 2013 at 9:22 AM, Jamie Johnson jej2...@gmail.com
 wrote:
 
 Ok, so clearing the transaction log allowed things to go again.  I am
 going to clear the index and try to replicate the problem on 4.2.0
 and then
 I'll try on 4.2.1
 
 
 On Wed, Apr 3, 2013 at 8:21 AM, Mark Miller markrmil...@gmail.com
 wrote:
 
 No, not that I know if, which is why I say we need to get to the
 bottom
 of it.
 
 - Mark
 
 On Apr 2, 2013, at 10:18 PM, Jamie Johnson jej2...@gmail.com
 wrote:
 
 Mark
 It's there a particular jira issue that you think may address this?
 I
 read
 through it quickly but didn't see one that jumped out
 On Apr 2, 2013 10:07 PM, Jamie Johnson jej2...@gmail.com wrote:
 
 I brought the bad one down and back up and it did nothing.  I can
 clear
 the index and try4.2.1. I will save off the logs and see if there
 is
 anything else odd
 On Apr 2, 2013 9:13 PM, Mark Miller markrmil...@gmail.com
 wrote:
 
 It would appear it's a bug given what you have said.
 
 Any other exceptions would be useful. Might be best to start
 tracking in
 a JIRA issue as well.
 
 To fix, I'd bring the behind node down and back again.
 
 Unfortunately, I'm pressed for time, but we really need to get to
 the
 bottom of this and fix it, or determine if it's fixed in 4.2.1
 (spreading
 to mirrors now).
 
 - Mark
 
 On Apr 2, 2013, at 7:21 PM, Jamie Johnson jej2...@gmail.com
 wrote:
 
 Sorry I didn't ask the obvious question.  Is there anything else
 that I
 should be looking for here and is this a bug?  I'd be happy to
 troll
 through the logs further if more information is needed, just let
 me
 know.
 
 Also what is the most appropriate mechanism to fix this.  Is it
 required to
 kill the index that 

Re: SolrCloud not distributing documents across shards

2013-04-03 Thread Chris Hostetter
: So we indexed a set of 33010 documents on server01 which are now in shard1.
: And we kicked off a set of 85934 documents on server02 which are now in
: shard2 (as tests).  In my understanding of how SolrCloud works, the
: documents should be distributed across the shards in the collection.  Now I

I'm not familiar with the details, but i've seen miller respond to a 
similar question with reference to the issue of not explicitly specifying 
numShards when creating your collections...

http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201303.mbox/%3c0aa0b422-f1de-4915-b602-53cb18492...@gmail.com%3E


-Hoss


Re: It seems a issue of deal with chinese synonym for solr

2013-04-03 Thread Kuro Kurosaka

On 3/11/13 6:15 PM, 李威 wrote:

in org.apache.solr.parser.SolrQueryParserBase, there is a function: protected Query 
newFieldQuery(Analyzer analyzer, String field, String queryText, boolean quoted)  throws 
SyntaxError

The below code can't process chinese rightly.

  BooleanClause.Occur occur = positionCount  1  operator == 
AND_OPERATOR ?
 BooleanClause.Occur.MUST : BooleanClause.Occur.SHOULD;



For example, “北京市 and “北京 are synonym, if I seach 北京市动物园, the expected parse result is 
+(北京市 北京) +动物园, but actually it would be parsed to +北京市 +北京 +动物园.

The code can process English, because English word is seperate by space, and 
only one position.


An interesting feature of this example is that difference between the two 
synonyms is
omission of one token 市 (city). Doesn't the same same problem happen if we 
define

London City and London as synonyms, and execute a query like London City 
Zoo?
Must Chinese Analyzer be used to reproduce this problem?

I tried to test this but I couldn't. The result of query string expansion using 
Solr 4.2's

query interface with debug output shows:

str name=parsedqueryMultiPhraseQuery(text:(london london) city zoo)/str

I see no plus (+). What query parser did you use?

--
Kuro Kurosaka


Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

2013-04-03 Thread Jamie Johnson
The router says implicit.  I did start from a blank zk state but perhaps
I missed one of the ZkCLI commands?  One of my shards from the
clusterstate.json is shown below.  What is the process that should be done
to bootstrap a cluster other than the ZkCLI commands I listed above?  My
process right now is run those ZkCLI commands and then start solr on all of
the instances with a command like this

java -server -Dshard=shard5 -DcoreName=shard5-core1
-Dsolr.data.dir=/solr/data/shard5-core1 -Dcollection.configName=solr-conf
-Dcollection=collection1 -DzkHost=so-zoo1:2181,so-zoo2:2181,so-zoo3:2181
-Djetty.port=7575 -DhostPort=7575 -jar start.jar

I feel like maybe I'm missing a step.

shard5:{
state:active,
replicas:{
  10.38.33.16:7575_solr_shard5-core1:{
shard:shard5,
state:active,
core:shard5-core1,
collection:collection1,
node_name:10.38.33.16:7575_solr,
base_url:http://10.38.33.16:7575/solr;,
leader:true},
  10.38.33.17:7577_solr_shard5-core2:{
shard:shard5,
state:recovering,
core:shard5-core2,
collection:collection1,
node_name:10.38.33.17:7577_solr,
base_url:http://10.38.33.17:7577/solr}}}


On Wed, Apr 3, 2013 at 2:40 PM, Mark Miller markrmil...@gmail.com wrote:

 It should be part of your clusterstate.json. Some users have reported
 trouble upgrading a previous zk install when this change came. I
 recommended manually updating the clusterstate.json to have the right info,
 and that seemed to work. Otherwise, I guess you have to start from a clean
 zk state.

 If you don't have that range information, I think there will be trouble.
 Do you have an router type defined in the clusterstate.json?

 - Mark

 On Apr 3, 2013, at 2:24 PM, Jamie Johnson jej2...@gmail.com wrote:

  Where is this information stored in ZK?  I don't see it in the cluster
  state (or perhaps I don't understand it ;) ).
 
  Perhaps something with my process is broken.  What I do when I start from
  scratch is the following
 
  ZkCLI -cmd upconfig ...
  ZkCLI -cmd linkconfig 
 
  but I don't ever explicitly create the collection.  What should the steps
  from scratch be?  I am moving from an unreleased snapshot of 4.0 so I
 never
  did that previously either so perhaps I did create the collection in one
 of
  my steps to get this working but have forgotten it along the way.
 
 
  On Wed, Apr 3, 2013 at 2:16 PM, Mark Miller markrmil...@gmail.com
 wrote:
 
  Thanks for digging Jamie. In 4.2, hash ranges are assigned up front
 when a
  collection is created - each shard gets a range, which is stored in
  zookeeper. You should not be able to end up with the same id on
 different
  shards - something very odd going on.
 
  Hopefully I'll have some time to try and help you reproduce. Ideally we
  can capture it in a test case.
 
  - Mark
 
  On Apr 3, 2013, at 1:13 PM, Jamie Johnson jej2...@gmail.com wrote:
 
  no, my thought was wrong, it appears that even with the parameter set I
  am
  seeing this behavior.  I've been able to duplicate it on 4.2.0 by
  indexing
  100,000 documents on 10 threads (10,000 each) when I get to 400,000 or
  so.
  I will try this on 4.2.1. to see if I see the same behavior
 
 
  On Wed, Apr 3, 2013 at 12:37 PM, Jamie Johnson jej2...@gmail.com
  wrote:
 
  Since I don't have that many items in my index I exported all of the
  keys
  for each shard and wrote a simple java program that checks for
  duplicates.
  I found some duplicate keys on different shards, a grep of the files
 for
  the keys found does indicate that they made it to the wrong places.
  If
  you
  notice documents with the same ID are on shard 3 and shard 5.  Is it
  possible that the hash is being calculated taking into account only
 the
  live nodes?  I know that we don't specify the numShards param @
  startup
  so could this be what is happening?
 
  grep -c 7cd1a717-3d94-4f5d-bcb1-9d8a95ca78de *
  shard1-core1:0
  shard1-core2:0
  shard2-core1:0
  shard2-core2:0
  shard3-core1:1
  shard3-core2:1
  shard4-core1:0
  shard4-core2:0
  shard5-core1:1
  shard5-core2:1
  shard6-core1:0
  shard6-core2:0
 
 
  On Wed, Apr 3, 2013 at 10:42 AM, Jamie Johnson jej2...@gmail.com
  wrote:
 
  Something interesting that I'm noticing as well, I just indexed
 300,000
  items, and some how 300,020 ended up in the index.  I thought
 perhaps I
  messed something up so I started the indexing again and indexed
 another
  400,000 and I see 400,064 docs.  Is there a good way to find
 possibile
  duplicates?  I had tried to facet on key (our id field) but that
 didn't
  give me anything with more than a count of 1.
 
 
  On Wed, Apr 3, 2013 at 9:22 AM, Jamie Johnson jej2...@gmail.com
  wrote:
 
  Ok, so clearing the transaction log allowed things to go again.  I
 am
  going to clear the index and try to replicate the problem on 4.2.0
  and then
  I'll try on 4.2.1
 
 
  On Wed, 

Re: SolrCloud not distributing documents across shards

2013-04-03 Thread vsilgalis
Chris Hostetter-3 wrote
 I'm not familiar with the details, but i've seen miller respond to a 
 similar question with reference to the issue of not explicitly specifying 
 numShards when creating your collections...
 
 http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201303.mbox/%

 3C0AA0B422-F1DE-4915-B602-53CB1849204A@

 %3E
 
 
 -Hoss

Well theoretically we are okay there.

The commands we run to create our collection are as follow (note the
numShards being specified):
http://server01/solr/admin/cores?action=CREATEname=classic_btcollection=classic_btnumShards=2instanceDir=instances/basistechdataDir=/opt/index/classic_btconfig=solrconfig.xmlschema=schema.xmlcollection.configName=classic_bt

http://server02/solr/admin/cores?action=CREATEname=classic_btcollection=classic_btnumShards=2instanceDir=instances/basistechdataDir=/opt/index/classic_btconfig=solrconfig.xmlschema=schema.xmlcollection.configName=classic_bt

http://server03/solr/admin/cores?action=CREATEname=classic_bt_shard1collection=classic_btnumShards=2instanceDir=instances/basistechdataDir=/opt/index/classic_bt_shard1config=solrconfig.xmlschema=schema.xmlcollection.configName=classic_btshard=shard1

http://server03/solr/admin/cores?action=CREATEname=classic_bt_shard2collection=classic_btnumShards=2instanceDir=instances/basistechdataDir=/opt/index/classic_bt_shard2config=solrconfig.xmlschema=schema.xmlcollection.configName=classic_btshard=shard2




--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrCloud-not-distributing-documents-across-shards-tp4053506p4053581.html
Sent from the Solr - User mailing list archive at Nabble.com.


HTML entities being missed by DIH HTMLStripTransformer

2013-04-03 Thread Ashok
Hi,

I am using DIH to index some database fields. These fields contain html
formatted text in them. I use the 'HTMLStripTransformer' to remove that
markup. This works fine when the text is like for example:

liItem One/li or *This is in Bold*

However when the text has HTML entity names like in:

lt;ligt;Item Onelt;/gt; or lt;bgt;This is in Boldlt;/bgt;

NOTHING HAPPENS. 

Two questions.

(1) Is this the expected behavior of DIH HTMLStripTransformer?
(2) If yes, is there an another transformer that I can employ first to turn
these html entities into their usual symbols that can then be removed by the
DIH HTMLStripTransformer?

Thanks

- ashok



--
View this message in context: 
http://lucene.472066.n3.nabble.com/HTML-entities-being-missed-by-DIH-HTMLStripTransformer-tp4053582.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: HTML entities being missed by DIH HTMLStripTransformer

2013-04-03 Thread Gora Mohanty
On 4 April 2013 00:30, Ashok ash...@qualcomm.com wrote:
[...]
 Two questions.

 (1) Is this the expected behavior of DIH HTMLStripTransformer?

Yes, I believe so.

 (2) If yes, is there an another transformer that I can employ first to turn
 these html entities into their usual symbols that can then be removed by the
 DIH HTMLStripTransformer?

How are the HTML tags getting converted into entities?
Are you escaping input somewhere?

Regards,
Gora


Re: Filtering Search Cloud

2013-04-03 Thread Furkan KAMACI
Shawn, thanks for your detailed explanation. My system will work on high
load. I mean I will always index something and something always will be
queried at my system. That is why I consider about physically separating
indexer and query reply machines. I think about that: imagine a machine
that both does indexing (a disk IO for it, I don't know the underlying
system maybe Solr makes a sequential IO) and both trying to reply queries
(another kind of IO) That is my main challenge to decide separating them.
And the next step is that, if I separate them before response can I filter
the data of indexer machines (I don't have any filtering  issues right now,
I just think that maybe I can need it at future)


2013/4/3 Shawn Heisey s...@elyograg.org

 On 4/1/2013 3:02 PM, Furkan KAMACI wrote:
  I want to separate my cloud into two logical parts. One of them is
 indexer
  cloud of SolrCloud. Second one is Searcher cloud of SolrCloud.
 
  My first question is that. Does separating my cloud system make sense
 about
  performance improvement. Because I think that when indexing, searching
 make
  time to response and if I separate them I get a performance improvement.
 On
  the other hand maybe using all Solr machines as whole (I mean not
  partitioning as I mentioned) SolrCloud can make a better load balancing,
 I
  would want to learn it.
 
  My second question is that. Let's assume that I have separated my
 machines
  as I mentioned. Can I filter some indexes to be searchable or not from
  Searcher SolrCloud?

 SolrCloud gets rid of the master and slave designations.  It also gets
 rid of the line between indexing and querying.  Each shard has a replica
 that is designated the leader, but that has no real impact on searching
 and indexing, only on deciding which data to use when replicas get out
 of sync.

 In the old master-slave architecture, you indexed to the master and the
 updated index files were replicated to the slave.  The slave did not
 handle the analysis for indexing, so it was usually better to send
 queries to slaves and let the master only do indexing.

 SolrCloud is very different.  When you index, the documents are indexed
 on all replicas at about the same time.  When you query, the requests
 are load balanced across all replicas.  During normal operation,
 SolrCloud does not use replication at all.  The replication feature is
 only used when a replica gets out of sync with the leader, and in that
 case, the entire index is replicated.

 Thanks,
 Shawn




Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

2013-04-03 Thread Mark Miller
If you don't specify numShards after 4.1, you get an implicit doc router and 
it's up to you to distribute updates. In the past, partitioning was done on the 
fly - but for shard splitting and perhaps other features, we now divvy up the 
hash range up front based on numShards and store it in ZooKeeper. No numShards 
is now how you take complete control of updates yourself.

- Mark

On Apr 3, 2013, at 2:57 PM, Jamie Johnson jej2...@gmail.com wrote:

 The router says implicit.  I did start from a blank zk state but perhaps
 I missed one of the ZkCLI commands?  One of my shards from the
 clusterstate.json is shown below.  What is the process that should be done
 to bootstrap a cluster other than the ZkCLI commands I listed above?  My
 process right now is run those ZkCLI commands and then start solr on all of
 the instances with a command like this
 
 java -server -Dshard=shard5 -DcoreName=shard5-core1
 -Dsolr.data.dir=/solr/data/shard5-core1 -Dcollection.configName=solr-conf
 -Dcollection=collection1 -DzkHost=so-zoo1:2181,so-zoo2:2181,so-zoo3:2181
 -Djetty.port=7575 -DhostPort=7575 -jar start.jar
 
 I feel like maybe I'm missing a step.
 
 shard5:{
state:active,
replicas:{
  10.38.33.16:7575_solr_shard5-core1:{
shard:shard5,
state:active,
core:shard5-core1,
collection:collection1,
node_name:10.38.33.16:7575_solr,
base_url:http://10.38.33.16:7575/solr;,
leader:true},
  10.38.33.17:7577_solr_shard5-core2:{
shard:shard5,
state:recovering,
core:shard5-core2,
collection:collection1,
node_name:10.38.33.17:7577_solr,
base_url:http://10.38.33.17:7577/solr}}}
 
 
 On Wed, Apr 3, 2013 at 2:40 PM, Mark Miller markrmil...@gmail.com wrote:
 
 It should be part of your clusterstate.json. Some users have reported
 trouble upgrading a previous zk install when this change came. I
 recommended manually updating the clusterstate.json to have the right info,
 and that seemed to work. Otherwise, I guess you have to start from a clean
 zk state.
 
 If you don't have that range information, I think there will be trouble.
 Do you have an router type defined in the clusterstate.json?
 
 - Mark
 
 On Apr 3, 2013, at 2:24 PM, Jamie Johnson jej2...@gmail.com wrote:
 
 Where is this information stored in ZK?  I don't see it in the cluster
 state (or perhaps I don't understand it ;) ).
 
 Perhaps something with my process is broken.  What I do when I start from
 scratch is the following
 
 ZkCLI -cmd upconfig ...
 ZkCLI -cmd linkconfig 
 
 but I don't ever explicitly create the collection.  What should the steps
 from scratch be?  I am moving from an unreleased snapshot of 4.0 so I
 never
 did that previously either so perhaps I did create the collection in one
 of
 my steps to get this working but have forgotten it along the way.
 
 
 On Wed, Apr 3, 2013 at 2:16 PM, Mark Miller markrmil...@gmail.com
 wrote:
 
 Thanks for digging Jamie. In 4.2, hash ranges are assigned up front
 when a
 collection is created - each shard gets a range, which is stored in
 zookeeper. You should not be able to end up with the same id on
 different
 shards - something very odd going on.
 
 Hopefully I'll have some time to try and help you reproduce. Ideally we
 can capture it in a test case.
 
 - Mark
 
 On Apr 3, 2013, at 1:13 PM, Jamie Johnson jej2...@gmail.com wrote:
 
 no, my thought was wrong, it appears that even with the parameter set I
 am
 seeing this behavior.  I've been able to duplicate it on 4.2.0 by
 indexing
 100,000 documents on 10 threads (10,000 each) when I get to 400,000 or
 so.
 I will try this on 4.2.1. to see if I see the same behavior
 
 
 On Wed, Apr 3, 2013 at 12:37 PM, Jamie Johnson jej2...@gmail.com
 wrote:
 
 Since I don't have that many items in my index I exported all of the
 keys
 for each shard and wrote a simple java program that checks for
 duplicates.
 I found some duplicate keys on different shards, a grep of the files
 for
 the keys found does indicate that they made it to the wrong places.
 If
 you
 notice documents with the same ID are on shard 3 and shard 5.  Is it
 possible that the hash is being calculated taking into account only
 the
 live nodes?  I know that we don't specify the numShards param @
 startup
 so could this be what is happening?
 
 grep -c 7cd1a717-3d94-4f5d-bcb1-9d8a95ca78de *
 shard1-core1:0
 shard1-core2:0
 shard2-core1:0
 shard2-core2:0
 shard3-core1:1
 shard3-core2:1
 shard4-core1:0
 shard4-core2:0
 shard5-core1:1
 shard5-core2:1
 shard6-core1:0
 shard6-core2:0
 
 
 On Wed, Apr 3, 2013 at 10:42 AM, Jamie Johnson jej2...@gmail.com
 wrote:
 
 Something interesting that I'm noticing as well, I just indexed
 300,000
 items, and some how 300,020 ended up in the index.  I thought
 perhaps I
 messed something up so I started the indexing again and indexed
 another
 400,000 and I see 400,064 docs.  Is there a good 

Re: HTML entities being missed by DIH HTMLStripTransformer

2013-04-03 Thread Ashok
Well, the database field has text,  sometimes with HTML entities and at other
times with html tags. I have no control over the process that populates the
database tables with info.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/HTML-entities-being-missed-by-DIH-HTMLStripTransformer-tp4053582p4053586.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

2013-04-03 Thread Jamie Johnson
ah interestingso I need to specify num shards, blow out zk and then try
this again to see if things work properly now.  What is really strange is
that for the most part things have worked right and on 4.2.1 I have 600,000
items indexed with no duplicates.  In any event I will specify num shards
clear out zk and begin again.  If this works properly what should the
router type be?


On Wed, Apr 3, 2013 at 3:14 PM, Mark Miller markrmil...@gmail.com wrote:

 If you don't specify numShards after 4.1, you get an implicit doc router
 and it's up to you to distribute updates. In the past, partitioning was
 done on the fly - but for shard splitting and perhaps other features, we
 now divvy up the hash range up front based on numShards and store it in
 ZooKeeper. No numShards is now how you take complete control of updates
 yourself.

 - Mark

 On Apr 3, 2013, at 2:57 PM, Jamie Johnson jej2...@gmail.com wrote:

  The router says implicit.  I did start from a blank zk state but
 perhaps
  I missed one of the ZkCLI commands?  One of my shards from the
  clusterstate.json is shown below.  What is the process that should be
 done
  to bootstrap a cluster other than the ZkCLI commands I listed above?  My
  process right now is run those ZkCLI commands and then start solr on all
 of
  the instances with a command like this
 
  java -server -Dshard=shard5 -DcoreName=shard5-core1
  -Dsolr.data.dir=/solr/data/shard5-core1 -Dcollection.configName=solr-conf
  -Dcollection=collection1 -DzkHost=so-zoo1:2181,so-zoo2:2181,so-zoo3:2181
  -Djetty.port=7575 -DhostPort=7575 -jar start.jar
 
  I feel like maybe I'm missing a step.
 
  shard5:{
 state:active,
 replicas:{
   10.38.33.16:7575_solr_shard5-core1:{
 shard:shard5,
 state:active,
 core:shard5-core1,
 collection:collection1,
 node_name:10.38.33.16:7575_solr,
 base_url:http://10.38.33.16:7575/solr;,
 leader:true},
   10.38.33.17:7577_solr_shard5-core2:{
 shard:shard5,
 state:recovering,
 core:shard5-core2,
 collection:collection1,
 node_name:10.38.33.17:7577_solr,
 base_url:http://10.38.33.17:7577/solr}}}
 
 
  On Wed, Apr 3, 2013 at 2:40 PM, Mark Miller markrmil...@gmail.com
 wrote:
 
  It should be part of your clusterstate.json. Some users have reported
  trouble upgrading a previous zk install when this change came. I
  recommended manually updating the clusterstate.json to have the right
 info,
  and that seemed to work. Otherwise, I guess you have to start from a
 clean
  zk state.
 
  If you don't have that range information, I think there will be trouble.
  Do you have an router type defined in the clusterstate.json?
 
  - Mark
 
  On Apr 3, 2013, at 2:24 PM, Jamie Johnson jej2...@gmail.com wrote:
 
  Where is this information stored in ZK?  I don't see it in the cluster
  state (or perhaps I don't understand it ;) ).
 
  Perhaps something with my process is broken.  What I do when I start
 from
  scratch is the following
 
  ZkCLI -cmd upconfig ...
  ZkCLI -cmd linkconfig 
 
  but I don't ever explicitly create the collection.  What should the
 steps
  from scratch be?  I am moving from an unreleased snapshot of 4.0 so I
  never
  did that previously either so perhaps I did create the collection in
 one
  of
  my steps to get this working but have forgotten it along the way.
 
 
  On Wed, Apr 3, 2013 at 2:16 PM, Mark Miller markrmil...@gmail.com
  wrote:
 
  Thanks for digging Jamie. In 4.2, hash ranges are assigned up front
  when a
  collection is created - each shard gets a range, which is stored in
  zookeeper. You should not be able to end up with the same id on
  different
  shards - something very odd going on.
 
  Hopefully I'll have some time to try and help you reproduce. Ideally
 we
  can capture it in a test case.
 
  - Mark
 
  On Apr 3, 2013, at 1:13 PM, Jamie Johnson jej2...@gmail.com wrote:
 
  no, my thought was wrong, it appears that even with the parameter
 set I
  am
  seeing this behavior.  I've been able to duplicate it on 4.2.0 by
  indexing
  100,000 documents on 10 threads (10,000 each) when I get to 400,000
 or
  so.
  I will try this on 4.2.1. to see if I see the same behavior
 
 
  On Wed, Apr 3, 2013 at 12:37 PM, Jamie Johnson jej2...@gmail.com
  wrote:
 
  Since I don't have that many items in my index I exported all of the
  keys
  for each shard and wrote a simple java program that checks for
  duplicates.
  I found some duplicate keys on different shards, a grep of the files
  for
  the keys found does indicate that they made it to the wrong places.
  If
  you
  notice documents with the same ID are on shard 3 and shard 5.  Is it
  possible that the hash is being calculated taking into account only
  the
  live nodes?  I know that we don't specify the numShards param @
  startup
  so could this be what is happening?
 
  grep -c 

Re: SolrCloud not distributing documents across shards

2013-04-03 Thread Michael Della Bitta
With earlier versions of Solr Cloud, if there was any error or warning
when you made a collection, you likely were set up for implicit
routing which means that documents only go to the shard you're talking
to. What you want is compositeId routing, which works how you think
it should.

Go into the cloud GUI and look at clusterstate.json in the Tree tab.
You should see the routing algorithm it's using in that file.

Michael Della Bitta


Appinions
18 East 41st Street, 2nd Floor
New York, NY 10017-6271

www.appinions.com

Where Influence Isn’t a Game


On Wed, Apr 3, 2013 at 2:59 PM, vsilgalis vsilga...@gmail.com wrote:
 Chris Hostetter-3 wrote
 I'm not familiar with the details, but i've seen miller respond to a
 similar question with reference to the issue of not explicitly specifying
 numShards when creating your collections...

 http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201303.mbox/%

 3C0AA0B422-F1DE-4915-B602-53CB1849204A@

 %3E


 -Hoss

 Well theoretically we are okay there.

 The commands we run to create our collection are as follow (note the
 numShards being specified):
 http://server01/solr/admin/cores?action=CREATEname=classic_btcollection=classic_btnumShards=2instanceDir=instances/basistechdataDir=/opt/index/classic_btconfig=solrconfig.xmlschema=schema.xmlcollection.configName=classic_bt

 http://server02/solr/admin/cores?action=CREATEname=classic_btcollection=classic_btnumShards=2instanceDir=instances/basistechdataDir=/opt/index/classic_btconfig=solrconfig.xmlschema=schema.xmlcollection.configName=classic_bt

 http://server03/solr/admin/cores?action=CREATEname=classic_bt_shard1collection=classic_btnumShards=2instanceDir=instances/basistechdataDir=/opt/index/classic_bt_shard1config=solrconfig.xmlschema=schema.xmlcollection.configName=classic_btshard=shard1

 http://server03/solr/admin/cores?action=CREATEname=classic_bt_shard2collection=classic_btnumShards=2instanceDir=instances/basistechdataDir=/opt/index/classic_bt_shard2config=solrconfig.xmlschema=schema.xmlcollection.configName=classic_btshard=shard2




 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/SolrCloud-not-distributing-documents-across-shards-tp4053506p4053581.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: HTML entities being missed by DIH HTMLStripTransformer

2013-04-03 Thread Alexandre Rafalovitch
Then, I would say, you have a bigger problem

However, you can probably run RegEx filter and replace those known escapes
with real characters before you run your HTMLStrip filter. Or run,
HTMLStrip, RegEx and HTMLStrip again.

Regards,
   Alex.

Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


On Wed, Apr 3, 2013 at 3:19 PM, Ashok ash...@qualcomm.com wrote:

 Well, the database field has text,  sometimes with HTML entities and at
 other
 times with html tags. I have no control over the process that populates the
 database tables with info.



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/HTML-entities-being-missed-by-DIH-HTMLStripTransformer-tp4053582p4053586.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Filtering Search Cloud

2013-04-03 Thread Shawn Heisey
On 4/3/2013 1:13 PM, Furkan KAMACI wrote:
 Shawn, thanks for your detailed explanation. My system will work on high
 load. I mean I will always index something and something always will be
 queried at my system. That is why I consider about physically separating
 indexer and query reply machines. I think about that: imagine a machine
 that both does indexing (a disk IO for it, I don't know the underlying
 system maybe Solr makes a sequential IO) and both trying to reply queries
 (another kind of IO) That is my main challenge to decide separating them.
 And the next step is that, if I separate them before response can I filter
 the data of indexer machines (I don't have any filtering  issues right now,
 I just think that maybe I can need it at future)

We do seem to have a language barrier, so let me try to be very clear:
If you use SolrCloud, you can't separate querying and indexing.  You
will have to use the master-slave replication that been part of Solr
since at least 1.4, possibly earlier.

Thanks,
Shawn



Re: HTML entities being missed by DIH HTMLStripTransformer

2013-04-03 Thread Steve Rowe
Hi Ashok,

HTMLStripTransformer uses HTMLStripCharFilter under the hood, and 
HTMLStripCharFilter converts all HTML entities to their corresponding 
characters.

What version of Solr are you using?

My guess is that it only appears that nothing is happening, since when they are 
presented in a browser, they show up as the characters the entities represent.

I think (never done this myself) that if you apply the HTMLStripTransformer 
twice, it will first convert the entities to characters, and then on the second 
pass, remove the HTML constructs.

From http://wiki.apache.org/solr/DataImportHandler#Transformer:

-
The entity transformer attribute can consist of a comma separated list of 
transformers (say transformer=foo.X,foo.Y). The transformers are chained in 
this case and they are applied one after the other in the order in which they 
are specified. What this means is that after the fields are fetched from the 
datasource, the list of entity columns are processed one at a time in the order 
listed inside the entity tag and scanned by the first transformer to see if any 
of that transformers attributes are present. If so the transformer does it's 
thing! When all of the listed entity columns have been scanned the process is 
repeated using the next transformer in the list.
-

Steve

On Apr 3, 2013, at 3:30 PM, Alexandre Rafalovitch arafa...@gmail.com wrote:

 Then, I would say, you have a bigger problem
 
 However, you can probably run RegEx filter and replace those known escapes
 with real characters before you run your HTMLStrip filter. Or run,
 HTMLStrip, RegEx and HTMLStrip again.
 
 Regards,
   Alex.
 
 Personal blog: http://blog.outerthoughts.com/
 LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
 - Time is the quality of nature that keeps events from happening all at
 once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)
 
 
 On Wed, Apr 3, 2013 at 3:19 PM, Ashok ash...@qualcomm.com wrote:
 
 Well, the database field has text,  sometimes with HTML entities and at
 other
 times with html tags. I have no control over the process that populates the
 database tables with info.




Re: SolrCloud not distributing documents across shards

2013-04-03 Thread vsilgalis
Michael Della Bitta-2 wrote
 With earlier versions of Solr Cloud, if there was any error or warning
 when you made a collection, you likely were set up for implicit
 routing which means that documents only go to the shard you're talking
 to. What you want is compositeId routing, which works how you think
 it should.
 
 Go into the cloud GUI and look at clusterstate.json in the Tree tab.
 You should see the routing algorithm it's using in that file.
 
 Michael Della Bitta

That sounds like my huckleberry.

 router:implicit

Is in the collection info in the clusterstate.json

How do I fix this? Just wipe the clusterstate.json?

Thanks for your help.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrCloud-not-distributing-documents-across-shards-tp4053506p4053593.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

2013-04-03 Thread Jamie Johnson
answered my own question, it now says compositeId.  What is problematic
though is that in addition to my shards (which are say jamie-shard1) I see
the solr created shards (shard1).  I assume that these were created because
of the numShards param.  Is there no way to specify the names of these
shards?


On Wed, Apr 3, 2013 at 3:25 PM, Jamie Johnson jej2...@gmail.com wrote:

 ah interestingso I need to specify num shards, blow out zk and then
 try this again to see if things work properly now.  What is really strange
 is that for the most part things have worked right and on 4.2.1 I have
 600,000 items indexed with no duplicates.  In any event I will specify num
 shards clear out zk and begin again.  If this works properly what should
 the router type be?


 On Wed, Apr 3, 2013 at 3:14 PM, Mark Miller markrmil...@gmail.com wrote:

 If you don't specify numShards after 4.1, you get an implicit doc router
 and it's up to you to distribute updates. In the past, partitioning was
 done on the fly - but for shard splitting and perhaps other features, we
 now divvy up the hash range up front based on numShards and store it in
 ZooKeeper. No numShards is now how you take complete control of updates
 yourself.

 - Mark

 On Apr 3, 2013, at 2:57 PM, Jamie Johnson jej2...@gmail.com wrote:

  The router says implicit.  I did start from a blank zk state but
 perhaps
  I missed one of the ZkCLI commands?  One of my shards from the
  clusterstate.json is shown below.  What is the process that should be
 done
  to bootstrap a cluster other than the ZkCLI commands I listed above?  My
  process right now is run those ZkCLI commands and then start solr on
 all of
  the instances with a command like this
 
  java -server -Dshard=shard5 -DcoreName=shard5-core1
  -Dsolr.data.dir=/solr/data/shard5-core1
 -Dcollection.configName=solr-conf
  -Dcollection=collection1 -DzkHost=so-zoo1:2181,so-zoo2:2181,so-zoo3:2181
  -Djetty.port=7575 -DhostPort=7575 -jar start.jar
 
  I feel like maybe I'm missing a step.
 
  shard5:{
 state:active,
 replicas:{
   10.38.33.16:7575_solr_shard5-core1:{
 shard:shard5,
 state:active,
 core:shard5-core1,
 collection:collection1,
 node_name:10.38.33.16:7575_solr,
 base_url:http://10.38.33.16:7575/solr;,
 leader:true},
   10.38.33.17:7577_solr_shard5-core2:{
 shard:shard5,
 state:recovering,
 core:shard5-core2,
 collection:collection1,
 node_name:10.38.33.17:7577_solr,
 base_url:http://10.38.33.17:7577/solr}}}
 
 
  On Wed, Apr 3, 2013 at 2:40 PM, Mark Miller markrmil...@gmail.com
 wrote:
 
  It should be part of your clusterstate.json. Some users have reported
  trouble upgrading a previous zk install when this change came. I
  recommended manually updating the clusterstate.json to have the right
 info,
  and that seemed to work. Otherwise, I guess you have to start from a
 clean
  zk state.
 
  If you don't have that range information, I think there will be
 trouble.
  Do you have an router type defined in the clusterstate.json?
 
  - Mark
 
  On Apr 3, 2013, at 2:24 PM, Jamie Johnson jej2...@gmail.com wrote:
 
  Where is this information stored in ZK?  I don't see it in the cluster
  state (or perhaps I don't understand it ;) ).
 
  Perhaps something with my process is broken.  What I do when I start
 from
  scratch is the following
 
  ZkCLI -cmd upconfig ...
  ZkCLI -cmd linkconfig 
 
  but I don't ever explicitly create the collection.  What should the
 steps
  from scratch be?  I am moving from an unreleased snapshot of 4.0 so I
  never
  did that previously either so perhaps I did create the collection in
 one
  of
  my steps to get this working but have forgotten it along the way.
 
 
  On Wed, Apr 3, 2013 at 2:16 PM, Mark Miller markrmil...@gmail.com
  wrote:
 
  Thanks for digging Jamie. In 4.2, hash ranges are assigned up front
  when a
  collection is created - each shard gets a range, which is stored in
  zookeeper. You should not be able to end up with the same id on
  different
  shards - something very odd going on.
 
  Hopefully I'll have some time to try and help you reproduce. Ideally
 we
  can capture it in a test case.
 
  - Mark
 
  On Apr 3, 2013, at 1:13 PM, Jamie Johnson jej2...@gmail.com wrote:
 
  no, my thought was wrong, it appears that even with the parameter
 set I
  am
  seeing this behavior.  I've been able to duplicate it on 4.2.0 by
  indexing
  100,000 documents on 10 threads (10,000 each) when I get to 400,000
 or
  so.
  I will try this on 4.2.1. to see if I see the same behavior
 
 
  On Wed, Apr 3, 2013 at 12:37 PM, Jamie Johnson jej2...@gmail.com
  wrote:
 
  Since I don't have that many items in my index I exported all of
 the
  keys
  for each shard and wrote a simple java program that checks for
  duplicates.
  I found some duplicate keys on different shards, a grep of 

Re: SolrCloud not distributing documents across shards

2013-04-03 Thread Michael Della Bitta
If you can work with a clean state, I'd turn off all your shards,
clear out the Solr directories in Zookeeper, reset solr.xml for each
of your shards, upgrade to the latest version of Solr, and turn
everything back on again. Then upload config, recreate your
collection, etc.

I do it like this, but YMMV:

curl 
http://localhost:8080/solr/admin/collections?action=CREATEname=$namenumShards=$numcollection.configName=$config-name;


Michael Della Bitta


Appinions
18 East 41st Street, 2nd Floor
New York, NY 10017-6271

www.appinions.com

Where Influence Isn’t a Game


On Wed, Apr 3, 2013 at 3:40 PM, vsilgalis vsilga...@gmail.com wrote:
 Michael Della Bitta-2 wrote
 With earlier versions of Solr Cloud, if there was any error or warning
 when you made a collection, you likely were set up for implicit
 routing which means that documents only go to the shard you're talking
 to. What you want is compositeId routing, which works how you think
 it should.

 Go into the cloud GUI and look at clusterstate.json in the Tree tab.
 You should see the routing algorithm it's using in that file.

 Michael Della Bitta

 That sounds like my huckleberry.

  router:implicit

 Is in the collection info in the clusterstate.json

 How do I fix this? Just wipe the clusterstate.json?

 Thanks for your help.



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/SolrCloud-not-distributing-documents-across-shards-tp4053506p4053593.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: Filtering Search Cloud

2013-04-03 Thread Furkan KAMACI
Thanks for your explanation, you explained every thing what I need. Just
one more question. I see that I can not make it with Solr Cloud, but I can
do something like that with master-slave replication of Solr. If I use
master-slave replication of Solr, can I eliminate (filter) something
(something that is indexed from master) from being a response after
querying (querying from slaves) ?


2013/4/3 Shawn Heisey s...@elyograg.org

 On 4/3/2013 1:13 PM, Furkan KAMACI wrote:
  Shawn, thanks for your detailed explanation. My system will work on high
  load. I mean I will always index something and something always will be
  queried at my system. That is why I consider about physically separating
  indexer and query reply machines. I think about that: imagine a machine
  that both does indexing (a disk IO for it, I don't know the underlying
  system maybe Solr makes a sequential IO) and both trying to reply queries
  (another kind of IO) That is my main challenge to decide separating them.
  And the next step is that, if I separate them before response can I
 filter
  the data of indexer machines (I don't have any filtering  issues right
 now,
  I just think that maybe I can need it at future)

 We do seem to have a language barrier, so let me try to be very clear:
 If you use SolrCloud, you can't separate querying and indexing.  You
 will have to use the master-slave replication that been part of Solr
 since at least 1.4, possibly earlier.

 Thanks,
 Shawn




Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

2013-04-03 Thread Mark Miller
I had thought you could - but looking at the code recently, I don't think you 
can anymore. I think that's a technical limitation more than anything though. 
When these changes were made, I think support for that was simply not added at 
the time.

I'm not sure exactly how straightforward it would be, but it seems doable - as 
it is, the overseer will preallocate shards when first creating the collection 
- that's when they get named shard(n). There would have to be logic to replace 
shard(n) with the custom shard name when the core actually registers.

- Mark

On Apr 3, 2013, at 3:42 PM, Jamie Johnson jej2...@gmail.com wrote:

 answered my own question, it now says compositeId.  What is problematic
 though is that in addition to my shards (which are say jamie-shard1) I see
 the solr created shards (shard1).  I assume that these were created because
 of the numShards param.  Is there no way to specify the names of these
 shards?
 
 
 On Wed, Apr 3, 2013 at 3:25 PM, Jamie Johnson jej2...@gmail.com wrote:
 
 ah interestingso I need to specify num shards, blow out zk and then
 try this again to see if things work properly now.  What is really strange
 is that for the most part things have worked right and on 4.2.1 I have
 600,000 items indexed with no duplicates.  In any event I will specify num
 shards clear out zk and begin again.  If this works properly what should
 the router type be?
 
 
 On Wed, Apr 3, 2013 at 3:14 PM, Mark Miller markrmil...@gmail.com wrote:
 
 If you don't specify numShards after 4.1, you get an implicit doc router
 and it's up to you to distribute updates. In the past, partitioning was
 done on the fly - but for shard splitting and perhaps other features, we
 now divvy up the hash range up front based on numShards and store it in
 ZooKeeper. No numShards is now how you take complete control of updates
 yourself.
 
 - Mark
 
 On Apr 3, 2013, at 2:57 PM, Jamie Johnson jej2...@gmail.com wrote:
 
 The router says implicit.  I did start from a blank zk state but
 perhaps
 I missed one of the ZkCLI commands?  One of my shards from the
 clusterstate.json is shown below.  What is the process that should be
 done
 to bootstrap a cluster other than the ZkCLI commands I listed above?  My
 process right now is run those ZkCLI commands and then start solr on
 all of
 the instances with a command like this
 
 java -server -Dshard=shard5 -DcoreName=shard5-core1
 -Dsolr.data.dir=/solr/data/shard5-core1
 -Dcollection.configName=solr-conf
 -Dcollection=collection1 -DzkHost=so-zoo1:2181,so-zoo2:2181,so-zoo3:2181
 -Djetty.port=7575 -DhostPort=7575 -jar start.jar
 
 I feel like maybe I'm missing a step.
 
 shard5:{
   state:active,
   replicas:{
 10.38.33.16:7575_solr_shard5-core1:{
   shard:shard5,
   state:active,
   core:shard5-core1,
   collection:collection1,
   node_name:10.38.33.16:7575_solr,
   base_url:http://10.38.33.16:7575/solr;,
   leader:true},
 10.38.33.17:7577_solr_shard5-core2:{
   shard:shard5,
   state:recovering,
   core:shard5-core2,
   collection:collection1,
   node_name:10.38.33.17:7577_solr,
   base_url:http://10.38.33.17:7577/solr}}}
 
 
 On Wed, Apr 3, 2013 at 2:40 PM, Mark Miller markrmil...@gmail.com
 wrote:
 
 It should be part of your clusterstate.json. Some users have reported
 trouble upgrading a previous zk install when this change came. I
 recommended manually updating the clusterstate.json to have the right
 info,
 and that seemed to work. Otherwise, I guess you have to start from a
 clean
 zk state.
 
 If you don't have that range information, I think there will be
 trouble.
 Do you have an router type defined in the clusterstate.json?
 
 - Mark
 
 On Apr 3, 2013, at 2:24 PM, Jamie Johnson jej2...@gmail.com wrote:
 
 Where is this information stored in ZK?  I don't see it in the cluster
 state (or perhaps I don't understand it ;) ).
 
 Perhaps something with my process is broken.  What I do when I start
 from
 scratch is the following
 
 ZkCLI -cmd upconfig ...
 ZkCLI -cmd linkconfig 
 
 but I don't ever explicitly create the collection.  What should the
 steps
 from scratch be?  I am moving from an unreleased snapshot of 4.0 so I
 never
 did that previously either so perhaps I did create the collection in
 one
 of
 my steps to get this working but have forgotten it along the way.
 
 
 On Wed, Apr 3, 2013 at 2:16 PM, Mark Miller markrmil...@gmail.com
 wrote:
 
 Thanks for digging Jamie. In 4.2, hash ranges are assigned up front
 when a
 collection is created - each shard gets a range, which is stored in
 zookeeper. You should not be able to end up with the same id on
 different
 shards - something very odd going on.
 
 Hopefully I'll have some time to try and help you reproduce. Ideally
 we
 can capture it in a test case.
 
 - Mark
 
 On Apr 3, 2013, at 1:13 PM, Jamie Johnson jej2...@gmail.com wrote:
 
 no, my thought was wrong, it 

Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

2013-04-03 Thread Jamie Johnson
ok, so that's not a deal breaker for me.  I just changed it to match the
shards that are auto created and it looks like things are happy.  I'll go
ahead and try my test to see if I can get things out of sync.


On Wed, Apr 3, 2013 at 3:56 PM, Mark Miller markrmil...@gmail.com wrote:

 I had thought you could - but looking at the code recently, I don't think
 you can anymore. I think that's a technical limitation more than anything
 though. When these changes were made, I think support for that was simply
 not added at the time.

 I'm not sure exactly how straightforward it would be, but it seems doable
 - as it is, the overseer will preallocate shards when first creating the
 collection - that's when they get named shard(n). There would have to be
 logic to replace shard(n) with the custom shard name when the core actually
 registers.

 - Mark

 On Apr 3, 2013, at 3:42 PM, Jamie Johnson jej2...@gmail.com wrote:

  answered my own question, it now says compositeId.  What is problematic
  though is that in addition to my shards (which are say jamie-shard1) I
 see
  the solr created shards (shard1).  I assume that these were created
 because
  of the numShards param.  Is there no way to specify the names of these
  shards?
 
 
  On Wed, Apr 3, 2013 at 3:25 PM, Jamie Johnson jej2...@gmail.com wrote:
 
  ah interestingso I need to specify num shards, blow out zk and then
  try this again to see if things work properly now.  What is really
 strange
  is that for the most part things have worked right and on 4.2.1 I have
  600,000 items indexed with no duplicates.  In any event I will specify
 num
  shards clear out zk and begin again.  If this works properly what should
  the router type be?
 
 
  On Wed, Apr 3, 2013 at 3:14 PM, Mark Miller markrmil...@gmail.com
 wrote:
 
  If you don't specify numShards after 4.1, you get an implicit doc
 router
  and it's up to you to distribute updates. In the past, partitioning was
  done on the fly - but for shard splitting and perhaps other features,
 we
  now divvy up the hash range up front based on numShards and store it in
  ZooKeeper. No numShards is now how you take complete control of updates
  yourself.
 
  - Mark
 
  On Apr 3, 2013, at 2:57 PM, Jamie Johnson jej2...@gmail.com wrote:
 
  The router says implicit.  I did start from a blank zk state but
  perhaps
  I missed one of the ZkCLI commands?  One of my shards from the
  clusterstate.json is shown below.  What is the process that should be
  done
  to bootstrap a cluster other than the ZkCLI commands I listed above?
  My
  process right now is run those ZkCLI commands and then start solr on
  all of
  the instances with a command like this
 
  java -server -Dshard=shard5 -DcoreName=shard5-core1
  -Dsolr.data.dir=/solr/data/shard5-core1
  -Dcollection.configName=solr-conf
  -Dcollection=collection1
 -DzkHost=so-zoo1:2181,so-zoo2:2181,so-zoo3:2181
  -Djetty.port=7575 -DhostPort=7575 -jar start.jar
 
  I feel like maybe I'm missing a step.
 
  shard5:{
state:active,
replicas:{
  10.38.33.16:7575_solr_shard5-core1:{
shard:shard5,
state:active,
core:shard5-core1,
collection:collection1,
node_name:10.38.33.16:7575_solr,
base_url:http://10.38.33.16:7575/solr;,
leader:true},
  10.38.33.17:7577_solr_shard5-core2:{
shard:shard5,
state:recovering,
core:shard5-core2,
collection:collection1,
node_name:10.38.33.17:7577_solr,
base_url:http://10.38.33.17:7577/solr}}}
 
 
  On Wed, Apr 3, 2013 at 2:40 PM, Mark Miller markrmil...@gmail.com
  wrote:
 
  It should be part of your clusterstate.json. Some users have reported
  trouble upgrading a previous zk install when this change came. I
  recommended manually updating the clusterstate.json to have the right
  info,
  and that seemed to work. Otherwise, I guess you have to start from a
  clean
  zk state.
 
  If you don't have that range information, I think there will be
  trouble.
  Do you have an router type defined in the clusterstate.json?
 
  - Mark
 
  On Apr 3, 2013, at 2:24 PM, Jamie Johnson jej2...@gmail.com wrote:
 
  Where is this information stored in ZK?  I don't see it in the
 cluster
  state (or perhaps I don't understand it ;) ).
 
  Perhaps something with my process is broken.  What I do when I start
  from
  scratch is the following
 
  ZkCLI -cmd upconfig ...
  ZkCLI -cmd linkconfig 
 
  but I don't ever explicitly create the collection.  What should the
  steps
  from scratch be?  I am moving from an unreleased snapshot of 4.0 so
 I
  never
  did that previously either so perhaps I did create the collection in
  one
  of
  my steps to get this working but have forgotten it along the way.
 
 
  On Wed, Apr 3, 2013 at 2:16 PM, Mark Miller markrmil...@gmail.com
  wrote:
 
  Thanks for digging Jamie. In 4.2, hash ranges are assigned up front
  when a
  collection is 

Re: HTML entities being missed by DIH HTMLStripTransformer

2013-04-03 Thread Ashok
Hi Steve,

Fabulous suggestion! Yup, that is it! Using the HTMLStripTransformer twice
did the trick. I am using Solr 4.1.

Thank you very much!

- ashok



--
View this message in context: 
http://lucene.472066.n3.nabble.com/HTML-entities-being-missed-by-DIH-HTMLStripTransformer-tp4053582p4053609.html
Sent from the Solr - User mailing list archive at Nabble.com.


do SearchComponents have access to response contents

2013-04-03 Thread xavier jmlucjav
I need to implement some SearchComponent that will deal with metrics on the
response. Some things I see will be easy to get, like number of hits for
instance, but I am more worried with this:

We need to also track the size of the response (as the size in bytes of the
whole xml response tat is streamed, with stored fields and all). I was a
bit worried cause I am wondering if a searchcomponent will actually have
access to the response bytes...

Can someone confirm one way or the other? We are targeting Sorl4.0

thanks
xavier


Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

2013-04-03 Thread Jamie Johnson
with these changes things are looking good, I'm up to 600,000 documents
without any issues as of right now.  I'll keep going and add more to see if
I find anything.


On Wed, Apr 3, 2013 at 4:01 PM, Jamie Johnson jej2...@gmail.com wrote:

 ok, so that's not a deal breaker for me.  I just changed it to match the
 shards that are auto created and it looks like things are happy.  I'll go
 ahead and try my test to see if I can get things out of sync.


 On Wed, Apr 3, 2013 at 3:56 PM, Mark Miller markrmil...@gmail.com wrote:

 I had thought you could - but looking at the code recently, I don't think
 you can anymore. I think that's a technical limitation more than anything
 though. When these changes were made, I think support for that was simply
 not added at the time.

 I'm not sure exactly how straightforward it would be, but it seems doable
 - as it is, the overseer will preallocate shards when first creating the
 collection - that's when they get named shard(n). There would have to be
 logic to replace shard(n) with the custom shard name when the core actually
 registers.

 - Mark

 On Apr 3, 2013, at 3:42 PM, Jamie Johnson jej2...@gmail.com wrote:

  answered my own question, it now says compositeId.  What is problematic
  though is that in addition to my shards (which are say jamie-shard1) I
 see
  the solr created shards (shard1).  I assume that these were created
 because
  of the numShards param.  Is there no way to specify the names of these
  shards?
 
 
  On Wed, Apr 3, 2013 at 3:25 PM, Jamie Johnson jej2...@gmail.com
 wrote:
 
  ah interestingso I need to specify num shards, blow out zk and then
  try this again to see if things work properly now.  What is really
 strange
  is that for the most part things have worked right and on 4.2.1 I have
  600,000 items indexed with no duplicates.  In any event I will specify
 num
  shards clear out zk and begin again.  If this works properly what
 should
  the router type be?
 
 
  On Wed, Apr 3, 2013 at 3:14 PM, Mark Miller markrmil...@gmail.com
 wrote:
 
  If you don't specify numShards after 4.1, you get an implicit doc
 router
  and it's up to you to distribute updates. In the past, partitioning
 was
  done on the fly - but for shard splitting and perhaps other features,
 we
  now divvy up the hash range up front based on numShards and store it
 in
  ZooKeeper. No numShards is now how you take complete control of
 updates
  yourself.
 
  - Mark
 
  On Apr 3, 2013, at 2:57 PM, Jamie Johnson jej2...@gmail.com wrote:
 
  The router says implicit.  I did start from a blank zk state but
  perhaps
  I missed one of the ZkCLI commands?  One of my shards from the
  clusterstate.json is shown below.  What is the process that should be
  done
  to bootstrap a cluster other than the ZkCLI commands I listed above?
  My
  process right now is run those ZkCLI commands and then start solr on
  all of
  the instances with a command like this
 
  java -server -Dshard=shard5 -DcoreName=shard5-core1
  -Dsolr.data.dir=/solr/data/shard5-core1
  -Dcollection.configName=solr-conf
  -Dcollection=collection1
 -DzkHost=so-zoo1:2181,so-zoo2:2181,so-zoo3:2181
  -Djetty.port=7575 -DhostPort=7575 -jar start.jar
 
  I feel like maybe I'm missing a step.
 
  shard5:{
state:active,
replicas:{
  10.38.33.16:7575_solr_shard5-core1:{
shard:shard5,
state:active,
core:shard5-core1,
collection:collection1,
node_name:10.38.33.16:7575_solr,
base_url:http://10.38.33.16:7575/solr;,
leader:true},
  10.38.33.17:7577_solr_shard5-core2:{
shard:shard5,
state:recovering,
core:shard5-core2,
collection:collection1,
node_name:10.38.33.17:7577_solr,
base_url:http://10.38.33.17:7577/solr}}}
 
 
  On Wed, Apr 3, 2013 at 2:40 PM, Mark Miller markrmil...@gmail.com
  wrote:
 
  It should be part of your clusterstate.json. Some users have
 reported
  trouble upgrading a previous zk install when this change came. I
  recommended manually updating the clusterstate.json to have the
 right
  info,
  and that seemed to work. Otherwise, I guess you have to start from a
  clean
  zk state.
 
  If you don't have that range information, I think there will be
  trouble.
  Do you have an router type defined in the clusterstate.json?
 
  - Mark
 
  On Apr 3, 2013, at 2:24 PM, Jamie Johnson jej2...@gmail.com
 wrote:
 
  Where is this information stored in ZK?  I don't see it in the
 cluster
  state (or perhaps I don't understand it ;) ).
 
  Perhaps something with my process is broken.  What I do when I
 start
  from
  scratch is the following
 
  ZkCLI -cmd upconfig ...
  ZkCLI -cmd linkconfig 
 
  but I don't ever explicitly create the collection.  What should the
  steps
  from scratch be?  I am moving from an unreleased snapshot of 4.0
 so I
  never
  did that previously either so perhaps I did create the collection
 in
  one

Re: HTML entities being missed by DIH HTMLStripTransformer

2013-04-03 Thread Steve Rowe
Cool, glad I was able to help.

On Apr 3, 2013, at 4:18 PM, Ashok ash...@qualcomm.com wrote:

 Hi Steve,
 
 Fabulous suggestion! Yup, that is it! Using the HTMLStripTransformer twice
 did the trick. I am using Solr 4.1.
 
 Thank you very much!
 
 - ashok
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/HTML-entities-being-missed-by-DIH-HTMLStripTransformer-tp4053582p4053609.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: SolrCloud not distributing documents across shards

2013-04-03 Thread vsilgalis
Michael Della Bitta-2 wrote
 If you can work with a clean state, I'd turn off all your shards,
 clear out the Solr directories in Zookeeper, reset solr.xml for each
 of your shards, upgrade to the latest version of Solr, and turn
 everything back on again. Then upload config, recreate your
 collection, etc.
 
 I do it like this, but YMMV:
 
 curl
 http://localhost:8080/solr/admin/collections?action=CREATEname=$namenumShards=$numcollection.configName=$config-name;
 
 
 Michael Della Bitta


Looks like that was the problem.  Thanks, much appreciated.

Is there any insight into specifically what I should look into for
preventing this in the future?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrCloud-not-distributing-documents-across-shards-tp4053506p4053622.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Question on Exact Matches - edismax

2013-04-03 Thread Jan Høydahl
Can you show us your *_ci field type? Solr does not really have a way to tell 
whether a match is exact or only partial, but you could hack around it with 
the fieldType. See https://github.com/cominvent/exactmatch for a possible 
solution.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

3. apr. 2013 kl. 15:55 skrev Sandeep Mestry sanmes...@gmail.com:

 Hi All,
 
 I have a requirement where in exact matches for 2 fields (Series Title,
 Title) should be ranked higher than the partial matches. The configuration
 looks like below:
 
 requestHandler name=assetdismax class=solr.SearchHandler 
lst name=defaults
str name=defTypeedismax/str
str name=echoParamsexplicit/str
float name=tie0.01/float
str name=qf*pg_series_title_ci*^500 *title_ci*^300 *
 pg_series_title*^200 *title*^25 classifications^15 classifications_texts^15
 parent_classifications^10 synonym_classifications^5 pg_brand_title^5
 pg_series_working_title^5 p_programme_title^5 p_item_title^5
 p_interstitial_title^5 description^15 pg_series_description annotations^0.1
 classification_notes^0.05 pv_program_version_number^2
 pv_program_version_number_ci^2 pv_program_number^2 pv_program_number_ci^2
 p_program_number^2 ma_version_number^2 ma_recording_location
 ma_contributions^0.001 rel_pg_series_title rel_programme_title
 rel_programme_number rel_programme_number_ci pg_uuid^0.5 p_uuid^0.5
 pv_uuid^0.5 ma_uuid^0.5/str
str name=pfpg_series_title_ci^500 title_ci^500/str
int name=ps0/int
str name=q.alt*:*/str
str name=mm100%/str
str name=q.opAND/str
str name=facettrue/str
str name=facet.limit-1/str
str name=facet.mincount1/str
/lst
/requestHandler
 
 As you can see above, the search is against many fields. What I'd want is
 the documents that have exact matches for series title and title fields
 should rank higher than the rest.
 
 I have added 2 case insensitive (*pg_series_title_ci, title_ci*) fields for
 series title and title and have boosted them higher over the tokenized and
 rest of the fields. I have also implemented a similarity class to override
 idf however I still get documents having partial matches in title and other
 fields ranking higher than exact match in pg_series_title_ci.
 
 Many Thanks,
 Sandeep



Re: do SearchComponents have access to response contents

2013-04-03 Thread Jack Krupansky
The search components can see the response as a namedlist, but it is only 
when SolrDispatchFIlter calls the QueryResponseWriter that XML or JSON or 
whatever other format (Javabin as well) is generated from the named list for 
final output in an HTTP response.


You probably want a custom query response writer that wraps the XML response 
writer. Then you can generate the XML and then do whatever you want with it.


The QueryResponseWriter class and queryResponseWriter in solrconfig.xml.

-- Jack Krupansky

-Original Message- 
From: xavier jmlucjav

Sent: Wednesday, April 03, 2013 4:22 PM
To: solr-user@lucene.apache.org
Subject: do SearchComponents have access to response contents

I need to implement some SearchComponent that will deal with metrics on the
response. Some things I see will be easy to get, like number of hits for
instance, but I am more worried with this:

We need to also track the size of the response (as the size in bytes of the
whole xml response tat is streamed, with stored fields and all). I was a
bit worried cause I am wondering if a searchcomponent will actually have
access to the response bytes...

Can someone confirm one way or the other? We are targeting Sorl4.0

thanks
xavier 



Re: Solr Tika Override

2013-04-03 Thread Jan Høydahl
You'd probably want to work on the XML output from Tika's PDF parser, from 
which you can identify which page and context.

Personally I would build a separate indexing application in Java and call Tika 
directly, then build a SolrInputDocument which you pass to solr through SolrJ. 
I.e. not use ExtractingRequestHandler, but put all this logic on the client 
side. This scales better, you can handle weird parsing errors and OOM 
situations better and you have full control of how to deal with the XML output 
from various file formats, and what metadata to pass on into the Solr document. 

This is possible with a customized ExtractingHandler too, but it will be uglier 
and harder to test. With a standalone indexer application you can write unit 
tests for all the special parsing requirements. see http://tika.apache.org for 
more.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

3. apr. 2013 kl. 20:09 skrev JerryC coss...@vt.edu:

 I am researching Solr and seeing if it would be a good fit for a document
 search service I am helping to develop.  One of the requirements is that we
 will need to be able to customize how file contents are parsed beyond the
 default configurations that are offered out of the box by Tika.  For
 example, we know that we will be indexing .pdf files that will contain a
 cover page with a project start date, and would like to pull this date out
 into a searchable field that is separate from the file content.  I have seen
 several sources saying you can do this by overriding the
 ExtractingRequestHandler.createFactory() method, but I have not been able to
 find much documentation on how to implement a new parser.  Can someone point
 me in the right direction on where to look, or let me know if the scenario I
 described above is even possible?
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Solr-Tika-Override-tp4053552.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: SolrCloud not distributing documents across shards

2013-04-03 Thread Michael Della Bitta
From what I can tell, the Collections API has been hardened
significantly since 4.2 and now will refuse to create a collection if
you give it something ambiguous to do. So if you upgrade to 4.2,
things will become more safe.

But overall I'd find a way of using the Collections API that works and
stick with it.

Michael Della Bitta


Appinions
18 East 41st Street, 2nd Floor
New York, NY 10017-6271

www.appinions.com

Where Influence Isn’t a Game


On Wed, Apr 3, 2013 at 5:01 PM, vsilgalis vsilga...@gmail.com wrote:
 Michael Della Bitta-2 wrote
 If you can work with a clean state, I'd turn off all your shards,
 clear out the Solr directories in Zookeeper, reset solr.xml for each
 of your shards, upgrade to the latest version of Solr, and turn
 everything back on again. Then upload config, recreate your
 collection, etc.

 I do it like this, but YMMV:

 curl
 http://localhost:8080/solr/admin/collections?action=CREATEname=$namenumShards=$numcollection.configName=$config-name;


 Michael Della Bitta


 Looks like that was the problem.  Thanks, much appreciated.

 Is there any insight into specifically what I should look into for
 preventing this in the future?



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/SolrCloud-not-distributing-documents-across-shards-tp4053506p4053622.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: SolrCloud not distributing documents across shards

2013-04-03 Thread Mark Miller

On Apr 3, 2013, at 5:53 PM, Michael Della Bitta 
michael.della.bi...@appinions.com wrote:

 From what I can tell, the Collections API has been hardened
 significantly since 4.2 

I did a lot of work here for 4.2.1 - there was a lot to improve. Hopefully 
there is much less now, but if anyone finds anything, I'll fix any JIRA's.

- Mark

Re: Filtering Search Cloud

2013-04-03 Thread Shawn Heisey
On 4/3/2013 1:52 PM, Furkan KAMACI wrote:
 Thanks for your explanation, you explained every thing what I need. Just
 one more question. I see that I can not make it with Solr Cloud, but I can
 do something like that with master-slave replication of Solr. If I use
 master-slave replication of Solr, can I eliminate (filter) something
 (something that is indexed from master) from being a response after
 querying (querying from slaves) ?

I don't understand the question.  I will attempt to give you more
information, but it might not answer your question.  If not, you'll have
to try to improve your question.

Your master and each of that master's slaves will have the same index as
soon as replication is done.  A query on the slave has no idea that the
master exists.

Thanks,
Shawn



Streaming search results

2013-04-03 Thread Victor Miroshnikov
Is it possible to stream search results from Solr? Seems that this feature is 
missing.

I see two options to solve this: 

1. Using search results pagination feature
The idea is to implement a smart proxy that will stream chunks from search 
results using pagination.

2. Implement Solr plugin with search streaming feature (is that possible at 
all?)

First option is easy to implement and reliable, though I dont know what are the 
drawbacks.

Regards,
Viktor 




Re: Solr metrics in Codahale metrics and Graphite?

2013-04-03 Thread Walter Underwood
That sounds great. I'll check out the bug, I didn't see anything in the docs 
about this. And if I can't find it with a search engine, it probably isn't 
there.  --wunder

On Apr 3, 2013, at 6:39 AM, Shawn Heisey wrote:

 On 3/29/2013 12:07 PM, Walter Underwood wrote:
 What are folks using for this?
 
 I don't know that this really answers your question, but Solr 4.1 and
 later includes a big chunk of codahale metrics internally for request
 handler statistics - see SOLR-1972.  First we tried including the jar
 and using the API, but that created thread leak problems, so the source
 code was added.
 
 Thanks,
 Shawn
 







Re: Solr metrics in Codahale metrics and Graphite?

2013-04-03 Thread Otis Gospodnetic
It's there! :)
http://search-lucene.com/?q=percentilefc_project=Solrfc_type=issue

Otis
--
Solr  ElasticSearch Support
http://sematext.com/





On Wed, Apr 3, 2013 at 6:29 PM, Walter Underwood wun...@wunderwood.org wrote:
 That sounds great. I'll check out the bug, I didn't see anything in the docs 
 about this. And if I can't find it with a search engine, it probably isn't 
 there.  --wunder

 On Apr 3, 2013, at 6:39 AM, Shawn Heisey wrote:

 On 3/29/2013 12:07 PM, Walter Underwood wrote:
 What are folks using for this?

 I don't know that this really answers your question, but Solr 4.1 and
 later includes a big chunk of codahale metrics internally for request
 handler statistics - see SOLR-1972.  First we tried including the jar
 and using the API, but that created thread leak problems, so the source
 code was added.

 Thanks,
 Shawn








Re: Solr metrics in Codahale metrics and Graphite?

2013-04-03 Thread Walter Underwood
In the Jira, but not in the docs. 

It would be nice to have VM stats like GC, too, so we can have common 
monitoring and alerting on all our services.

wunder

On Apr 3, 2013, at 3:31 PM, Otis Gospodnetic wrote:

 It's there! :)
 http://search-lucene.com/?q=percentilefc_project=Solrfc_type=issue
 
 Otis
 --
 Solr  ElasticSearch Support
 http://sematext.com/
 
 On Wed, Apr 3, 2013 at 6:29 PM, Walter Underwood wun...@wunderwood.org 
 wrote:
 That sounds great. I'll check out the bug, I didn't see anything in the docs 
 about this. And if I can't find it with a search engine, it probably isn't 
 there.  --wunder
 
 On Apr 3, 2013, at 6:39 AM, Shawn Heisey wrote:
 
 On 3/29/2013 12:07 PM, Walter Underwood wrote:
 What are folks using for this?
 
 I don't know that this really answers your question, but Solr 4.1 and
 later includes a big chunk of codahale metrics internally for request
 handler statistics - see SOLR-1972.  First we tried including the jar
 and using the API, but that created thread leak problems, so the source
 code was added.
 
 Thanks,
 Shawn
 
 
 
 
 
 

--
Walter Underwood
wun...@wunderwood.org





Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

2013-04-03 Thread Jamie Johnson
just an update, I'm at 1M records now with no issues.  This looks promising
as to the cause of my issues, thanks for the help.  Is the routing method
with numShards documented anywhere?  I know numShards is documented but I
didn't know that the routing changed if you don't specify it.


On Wed, Apr 3, 2013 at 4:44 PM, Jamie Johnson jej2...@gmail.com wrote:

 with these changes things are looking good, I'm up to 600,000 documents
 without any issues as of right now.  I'll keep going and add more to see if
 I find anything.


 On Wed, Apr 3, 2013 at 4:01 PM, Jamie Johnson jej2...@gmail.com wrote:

 ok, so that's not a deal breaker for me.  I just changed it to match the
 shards that are auto created and it looks like things are happy.  I'll go
 ahead and try my test to see if I can get things out of sync.


 On Wed, Apr 3, 2013 at 3:56 PM, Mark Miller markrmil...@gmail.comwrote:

 I had thought you could - but looking at the code recently, I don't
 think you can anymore. I think that's a technical limitation more than
 anything though. When these changes were made, I think support for that was
 simply not added at the time.

 I'm not sure exactly how straightforward it would be, but it seems
 doable - as it is, the overseer will preallocate shards when first creating
 the collection - that's when they get named shard(n). There would have to
 be logic to replace shard(n) with the custom shard name when the core
 actually registers.

 - Mark

 On Apr 3, 2013, at 3:42 PM, Jamie Johnson jej2...@gmail.com wrote:

  answered my own question, it now says compositeId.  What is problematic
  though is that in addition to my shards (which are say jamie-shard1) I
 see
  the solr created shards (shard1).  I assume that these were created
 because
  of the numShards param.  Is there no way to specify the names of these
  shards?
 
 
  On Wed, Apr 3, 2013 at 3:25 PM, Jamie Johnson jej2...@gmail.com
 wrote:
 
  ah interestingso I need to specify num shards, blow out zk and
 then
  try this again to see if things work properly now.  What is really
 strange
  is that for the most part things have worked right and on 4.2.1 I have
  600,000 items indexed with no duplicates.  In any event I will
 specify num
  shards clear out zk and begin again.  If this works properly what
 should
  the router type be?
 
 
  On Wed, Apr 3, 2013 at 3:14 PM, Mark Miller markrmil...@gmail.com
 wrote:
 
  If you don't specify numShards after 4.1, you get an implicit doc
 router
  and it's up to you to distribute updates. In the past, partitioning
 was
  done on the fly - but for shard splitting and perhaps other
 features, we
  now divvy up the hash range up front based on numShards and store it
 in
  ZooKeeper. No numShards is now how you take complete control of
 updates
  yourself.
 
  - Mark
 
  On Apr 3, 2013, at 2:57 PM, Jamie Johnson jej2...@gmail.com wrote:
 
  The router says implicit.  I did start from a blank zk state but
  perhaps
  I missed one of the ZkCLI commands?  One of my shards from the
  clusterstate.json is shown below.  What is the process that should
 be
  done
  to bootstrap a cluster other than the ZkCLI commands I listed
 above?  My
  process right now is run those ZkCLI commands and then start solr on
  all of
  the instances with a command like this
 
  java -server -Dshard=shard5 -DcoreName=shard5-core1
  -Dsolr.data.dir=/solr/data/shard5-core1
  -Dcollection.configName=solr-conf
  -Dcollection=collection1
 -DzkHost=so-zoo1:2181,so-zoo2:2181,so-zoo3:2181
  -Djetty.port=7575 -DhostPort=7575 -jar start.jar
 
  I feel like maybe I'm missing a step.
 
  shard5:{
state:active,
replicas:{
  10.38.33.16:7575_solr_shard5-core1:{
shard:shard5,
state:active,
core:shard5-core1,
collection:collection1,
node_name:10.38.33.16:7575_solr,
base_url:http://10.38.33.16:7575/solr;,
leader:true},
  10.38.33.17:7577_solr_shard5-core2:{
shard:shard5,
state:recovering,
core:shard5-core2,
collection:collection1,
node_name:10.38.33.17:7577_solr,
base_url:http://10.38.33.17:7577/solr}}}
 
 
  On Wed, Apr 3, 2013 at 2:40 PM, Mark Miller markrmil...@gmail.com
  wrote:
 
  It should be part of your clusterstate.json. Some users have
 reported
  trouble upgrading a previous zk install when this change came. I
  recommended manually updating the clusterstate.json to have the
 right
  info,
  and that seemed to work. Otherwise, I guess you have to start from
 a
  clean
  zk state.
 
  If you don't have that range information, I think there will be
  trouble.
  Do you have an router type defined in the clusterstate.json?
 
  - Mark
 
  On Apr 3, 2013, at 2:24 PM, Jamie Johnson jej2...@gmail.com
 wrote:
 
  Where is this information stored in ZK?  I don't see it in the
 cluster
  state (or perhaps I don't understand it ;) ).
 
  Perhaps something with my process 

RE: Solr Multiword Search

2013-04-03 Thread skmirch
The following query is doing a word search (based on my previous post)...

solr/spell?q=(charles+and+the+choclit+factory+OR+(title2:(charles+and+the+choclit+factory)))spellcheck.collate=truespellcheck=truespellcheck.q=charles+and+the+choclit+factory
 

It produces a lot of unwanted matches.


In order to do a phrase search, I changed it to:
solr/spell?q=(charles+and+the+choclit+factory+OR+(title2:(charles+and+the+choclit+factory)))spellcheck.collate=truespellcheck=truespellcheck.q=charles+and+the+choclit+factory
 

It does not find any match for the words in the phrase I am looking for and
does poorly in the suggested collations.  I want phrase corrections.  How do
I achieve this?

charles and the chocolit factory
produces the following collations:
bool name=correctlySpelledfalse/bool
lst name=collation
  str name=collationQuerycharles and the chocolat factory/str
  int name=hits2849777/int
  lst name=misspellingsAndCorrections
str name=charlescharles/str
str name=andand/str
str name=thethe/str
str name=chocolitchocolat/str
str name=factoryfactory/str
  /lst
/lst
lst name=collation
  str name=collationQuerycharles and the chocalit factory/str
  int name=hits2849464/int
  lst name=misspellingsAndCorrections
str name=charlescharles/str
str name=andand/str
str name=thethe/str
str name=chocolitchocalit/str
str name=factoryfactory/str
  /lst
/lst
lst name=collation
  str name=collationQuerycharles and the chocolat factors/str
  int name=hits2841190/int
  lst name=misspellingsAndCorrections
str name=charlescharles/str
str name=andand/str
str name=thethe/str
str name=chocolitchocolat/str
str name=factoryfactors/str
  /lst
/lst
lst name=collation
  str name=collationQuerycharley and the chocolat factory/str
  int name=hits2827908/int
  lst name=misspellingsAndCorrections
str name=charlescharley/str
str name=andand/str
str name=thethe/str
str name=chocolitchocolat/str
str name=factoryfactory/str
  /lst
/lst
lst name=collation
  str name=collationQuerycharles and the chocalit factors/str
  int name=hits2840877/int
  lst name=misspellingsAndCorrections
str name=charlescharles/str
str name=andand/str
str name=thethe/str
str name=chocolitchocalit/str
str name=factoryfactors/str
  /lst
/lst
lst name=collation
  str name=collationQuerycharles and the chocklit factory/str
  int name=hits2849464/int
  lst name=misspellingsAndCorrections
str name=charlescharles/str
str name=andand/str
str name=thethe/str
str name=chocolitchocklit/str
str name=factoryfactory/str
  /lst
/lst
lst name=collation
  str name=collationQuerycharles and the chocolat factorz/str
  int name=hits2841173/int
  lst name=misspellingsAndCorrections
str name=charlescharles/str
str name=andand/str
str name=thethe/str
str name=chocolitchocolat/str
str name=factoryfactorz/str
  /lst
/lst
lst name=collation
  str name=collationQuerycharley and the chocalit factory/str
  int name=hits2827595/int
  lst name=misspellingsAndCorrections
str name=charlescharley/str
str name=andand/str
str name=thethe/str
str name=chocolitchocalit/str
str name=factoryfactory/str
  /lst
/lst
lst name=collation
  str name=collationQuerycharley and the chocolat factors/str
  int name=hits2819321/int
  lst name=misspellingsAndCorrections
str name=charlescharley/str
str name=andand/str
str name=thethe/str
str name=chocolitchocolat/str
str name=factoryfactors/str
  /lst
/lst
lst name=collation
  str name=collationQuerycharlies and the chocolat factory/str
  int name=hits2826661/int
  lst name=misspellingsAndCorrections
str name=charlescharlies/str
str name=andand/str
str name=thethe/str
str name=chocolitchocolat/str
str name=factoryfactory/str
  /lst
/lst
  /lst

Notice number of hits.  This does not look right?  Please help.

Thanks.
-- Sandeep



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Multiword-Search-tp4053038p4053674.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

2013-04-03 Thread Jamie Johnson
I am occasionally seeing this in the log, is this just a timeout issue?
 Should I be increasing the zk client timeout?

WARNING: Overseer cannot talk to ZK
Apr 3, 2013 11:14:25 PM
org.apache.solr.cloud.DistributedQueue$LatchChildWatcher process
INFO: Watcher fired on path: null state: Expired type None
Apr 3, 2013 11:14:25 PM org.apache.solr.cloud.Overseer$ClusterStateUpdater
run
WARNING: Solr cannot talk to ZK, exiting Overseer main queue loop
org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired for /overseer/queue
at
org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
at
org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1468)
at
org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:236)
at
org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:233)
at
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:65)
at
org.apache.solr.common.cloud.SolrZkClient.getChildren(SolrZkClient.java:233)
at
org.apache.solr.cloud.DistributedQueue.orderedChildren(DistributedQueue.java:89)
at
org.apache.solr.cloud.DistributedQueue.element(DistributedQueue.java:131)
at
org.apache.solr.cloud.DistributedQueue.peek(DistributedQueue.java:326)
at
org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:128)
at java.lang.Thread.run(Thread.java:662)



On Wed, Apr 3, 2013 at 7:25 PM, Jamie Johnson jej2...@gmail.com wrote:

 just an update, I'm at 1M records now with no issues.  This looks
 promising as to the cause of my issues, thanks for the help.  Is the
 routing method with numShards documented anywhere?  I know numShards is
 documented but I didn't know that the routing changed if you don't specify
 it.


 On Wed, Apr 3, 2013 at 4:44 PM, Jamie Johnson jej2...@gmail.com wrote:

 with these changes things are looking good, I'm up to 600,000 documents
 without any issues as of right now.  I'll keep going and add more to see if
 I find anything.


 On Wed, Apr 3, 2013 at 4:01 PM, Jamie Johnson jej2...@gmail.com wrote:

 ok, so that's not a deal breaker for me.  I just changed it to match the
 shards that are auto created and it looks like things are happy.  I'll go
 ahead and try my test to see if I can get things out of sync.


 On Wed, Apr 3, 2013 at 3:56 PM, Mark Miller markrmil...@gmail.comwrote:

 I had thought you could - but looking at the code recently, I don't
 think you can anymore. I think that's a technical limitation more than
 anything though. When these changes were made, I think support for that was
 simply not added at the time.

 I'm not sure exactly how straightforward it would be, but it seems
 doable - as it is, the overseer will preallocate shards when first creating
 the collection - that's when they get named shard(n). There would have to
 be logic to replace shard(n) with the custom shard name when the core
 actually registers.

 - Mark

 On Apr 3, 2013, at 3:42 PM, Jamie Johnson jej2...@gmail.com wrote:

  answered my own question, it now says compositeId.  What is
 problematic
  though is that in addition to my shards (which are say jamie-shard1)
 I see
  the solr created shards (shard1).  I assume that these were created
 because
  of the numShards param.  Is there no way to specify the names of these
  shards?
 
 
  On Wed, Apr 3, 2013 at 3:25 PM, Jamie Johnson jej2...@gmail.com
 wrote:
 
  ah interestingso I need to specify num shards, blow out zk and
 then
  try this again to see if things work properly now.  What is really
 strange
  is that for the most part things have worked right and on 4.2.1 I
 have
  600,000 items indexed with no duplicates.  In any event I will
 specify num
  shards clear out zk and begin again.  If this works properly what
 should
  the router type be?
 
 
  On Wed, Apr 3, 2013 at 3:14 PM, Mark Miller markrmil...@gmail.com
 wrote:
 
  If you don't specify numShards after 4.1, you get an implicit doc
 router
  and it's up to you to distribute updates. In the past, partitioning
 was
  done on the fly - but for shard splitting and perhaps other
 features, we
  now divvy up the hash range up front based on numShards and store
 it in
  ZooKeeper. No numShards is now how you take complete control of
 updates
  yourself.
 
  - Mark
 
  On Apr 3, 2013, at 2:57 PM, Jamie Johnson jej2...@gmail.com
 wrote:
 
  The router says implicit.  I did start from a blank zk state but
  perhaps
  I missed one of the ZkCLI commands?  One of my shards from the
  clusterstate.json is shown below.  What is the process that should
 be
  done
  to bootstrap a cluster other than the ZkCLI commands I listed
 above?  My
  process right now is run those ZkCLI commands and then start solr
 on
  all of
  the instances with a command like this
 
  java -server -Dshard=shard5 -DcoreName=shard5-core1
  

Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

2013-04-03 Thread Mark Miller
Yeah. Are you using the concurrent low pause garbage collector?

This means the overseer wasn't able to communicate with zk for 15 seconds - due 
to load or gc or whatever. If you can't resolve the root cause of that, or the 
load just won't allow for it, next best thing you can do is raise it to 30 
seconds.

- Mark

On Apr 3, 2013, at 7:41 PM, Jamie Johnson jej2...@gmail.com wrote:

 I am occasionally seeing this in the log, is this just a timeout issue?
 Should I be increasing the zk client timeout?
 
 WARNING: Overseer cannot talk to ZK
 Apr 3, 2013 11:14:25 PM
 org.apache.solr.cloud.DistributedQueue$LatchChildWatcher process
 INFO: Watcher fired on path: null state: Expired type None
 Apr 3, 2013 11:14:25 PM org.apache.solr.cloud.Overseer$ClusterStateUpdater
 run
 WARNING: Solr cannot talk to ZK, exiting Overseer main queue loop
 org.apache.zookeeper.KeeperException$SessionExpiredException:
 KeeperErrorCode = Session expired for /overseer/queue
at
 org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
at
 org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1468)
at
 org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:236)
at
 org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:233)
at
 org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:65)
at
 org.apache.solr.common.cloud.SolrZkClient.getChildren(SolrZkClient.java:233)
at
 org.apache.solr.cloud.DistributedQueue.orderedChildren(DistributedQueue.java:89)
at
 org.apache.solr.cloud.DistributedQueue.element(DistributedQueue.java:131)
at
 org.apache.solr.cloud.DistributedQueue.peek(DistributedQueue.java:326)
at
 org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:128)
at java.lang.Thread.run(Thread.java:662)
 
 
 
 On Wed, Apr 3, 2013 at 7:25 PM, Jamie Johnson jej2...@gmail.com wrote:
 
 just an update, I'm at 1M records now with no issues.  This looks
 promising as to the cause of my issues, thanks for the help.  Is the
 routing method with numShards documented anywhere?  I know numShards is
 documented but I didn't know that the routing changed if you don't specify
 it.
 
 
 On Wed, Apr 3, 2013 at 4:44 PM, Jamie Johnson jej2...@gmail.com wrote:
 
 with these changes things are looking good, I'm up to 600,000 documents
 without any issues as of right now.  I'll keep going and add more to see if
 I find anything.
 
 
 On Wed, Apr 3, 2013 at 4:01 PM, Jamie Johnson jej2...@gmail.com wrote:
 
 ok, so that's not a deal breaker for me.  I just changed it to match the
 shards that are auto created and it looks like things are happy.  I'll go
 ahead and try my test to see if I can get things out of sync.
 
 
 On Wed, Apr 3, 2013 at 3:56 PM, Mark Miller markrmil...@gmail.comwrote:
 
 I had thought you could - but looking at the code recently, I don't
 think you can anymore. I think that's a technical limitation more than
 anything though. When these changes were made, I think support for that 
 was
 simply not added at the time.
 
 I'm not sure exactly how straightforward it would be, but it seems
 doable - as it is, the overseer will preallocate shards when first 
 creating
 the collection - that's when they get named shard(n). There would have to
 be logic to replace shard(n) with the custom shard name when the core
 actually registers.
 
 - Mark
 
 On Apr 3, 2013, at 3:42 PM, Jamie Johnson jej2...@gmail.com wrote:
 
 answered my own question, it now says compositeId.  What is
 problematic
 though is that in addition to my shards (which are say jamie-shard1)
 I see
 the solr created shards (shard1).  I assume that these were created
 because
 of the numShards param.  Is there no way to specify the names of these
 shards?
 
 
 On Wed, Apr 3, 2013 at 3:25 PM, Jamie Johnson jej2...@gmail.com
 wrote:
 
 ah interestingso I need to specify num shards, blow out zk and
 then
 try this again to see if things work properly now.  What is really
 strange
 is that for the most part things have worked right and on 4.2.1 I
 have
 600,000 items indexed with no duplicates.  In any event I will
 specify num
 shards clear out zk and begin again.  If this works properly what
 should
 the router type be?
 
 
 On Wed, Apr 3, 2013 at 3:14 PM, Mark Miller markrmil...@gmail.com
 wrote:
 
 If you don't specify numShards after 4.1, you get an implicit doc
 router
 and it's up to you to distribute updates. In the past, partitioning
 was
 done on the fly - but for shard splitting and perhaps other
 features, we
 now divvy up the hash range up front based on numShards and store
 it in
 ZooKeeper. No numShards is now how you take complete control of
 updates
 yourself.
 
 - Mark
 
 On Apr 3, 2013, at 2:57 PM, Jamie Johnson jej2...@gmail.com
 wrote:
 
 The router says implicit.  I did start from a blank zk state but
 perhaps
 

Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

2013-04-03 Thread Mark Miller
This shouldn't be a problem though, if things are working as they are supposed 
to. Another node should simply take over as the overseer and continue 
processing the work queue. It's just best if you configure so that session 
timeouts don't happen unless a node is really down. On the other hand, it's 
nicer to detect that faster. Your tradeoff to make.

- Mark

On Apr 3, 2013, at 7:46 PM, Mark Miller markrmil...@gmail.com wrote:

 Yeah. Are you using the concurrent low pause garbage collector?
 
 This means the overseer wasn't able to communicate with zk for 15 seconds - 
 due to load or gc or whatever. If you can't resolve the root cause of that, 
 or the load just won't allow for it, next best thing you can do is raise it 
 to 30 seconds.
 
 - Mark
 
 On Apr 3, 2013, at 7:41 PM, Jamie Johnson jej2...@gmail.com wrote:
 
 I am occasionally seeing this in the log, is this just a timeout issue?
 Should I be increasing the zk client timeout?
 
 WARNING: Overseer cannot talk to ZK
 Apr 3, 2013 11:14:25 PM
 org.apache.solr.cloud.DistributedQueue$LatchChildWatcher process
 INFO: Watcher fired on path: null state: Expired type None
 Apr 3, 2013 11:14:25 PM org.apache.solr.cloud.Overseer$ClusterStateUpdater
 run
 WARNING: Solr cannot talk to ZK, exiting Overseer main queue loop
 org.apache.zookeeper.KeeperException$SessionExpiredException:
 KeeperErrorCode = Session expired for /overseer/queue
   at
 org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
   at
 org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
   at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1468)
   at
 org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:236)
   at
 org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:233)
   at
 org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:65)
   at
 org.apache.solr.common.cloud.SolrZkClient.getChildren(SolrZkClient.java:233)
   at
 org.apache.solr.cloud.DistributedQueue.orderedChildren(DistributedQueue.java:89)
   at
 org.apache.solr.cloud.DistributedQueue.element(DistributedQueue.java:131)
   at
 org.apache.solr.cloud.DistributedQueue.peek(DistributedQueue.java:326)
   at
 org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:128)
   at java.lang.Thread.run(Thread.java:662)
 
 
 
 On Wed, Apr 3, 2013 at 7:25 PM, Jamie Johnson jej2...@gmail.com wrote:
 
 just an update, I'm at 1M records now with no issues.  This looks
 promising as to the cause of my issues, thanks for the help.  Is the
 routing method with numShards documented anywhere?  I know numShards is
 documented but I didn't know that the routing changed if you don't specify
 it.
 
 
 On Wed, Apr 3, 2013 at 4:44 PM, Jamie Johnson jej2...@gmail.com wrote:
 
 with these changes things are looking good, I'm up to 600,000 documents
 without any issues as of right now.  I'll keep going and add more to see if
 I find anything.
 
 
 On Wed, Apr 3, 2013 at 4:01 PM, Jamie Johnson jej2...@gmail.com wrote:
 
 ok, so that's not a deal breaker for me.  I just changed it to match the
 shards that are auto created and it looks like things are happy.  I'll go
 ahead and try my test to see if I can get things out of sync.
 
 
 On Wed, Apr 3, 2013 at 3:56 PM, Mark Miller markrmil...@gmail.comwrote:
 
 I had thought you could - but looking at the code recently, I don't
 think you can anymore. I think that's a technical limitation more than
 anything though. When these changes were made, I think support for that 
 was
 simply not added at the time.
 
 I'm not sure exactly how straightforward it would be, but it seems
 doable - as it is, the overseer will preallocate shards when first 
 creating
 the collection - that's when they get named shard(n). There would have to
 be logic to replace shard(n) with the custom shard name when the core
 actually registers.
 
 - Mark
 
 On Apr 3, 2013, at 3:42 PM, Jamie Johnson jej2...@gmail.com wrote:
 
 answered my own question, it now says compositeId.  What is
 problematic
 though is that in addition to my shards (which are say jamie-shard1)
 I see
 the solr created shards (shard1).  I assume that these were created
 because
 of the numShards param.  Is there no way to specify the names of these
 shards?
 
 
 On Wed, Apr 3, 2013 at 3:25 PM, Jamie Johnson jej2...@gmail.com
 wrote:
 
 ah interestingso I need to specify num shards, blow out zk and
 then
 try this again to see if things work properly now.  What is really
 strange
 is that for the most part things have worked right and on 4.2.1 I
 have
 600,000 items indexed with no duplicates.  In any event I will
 specify num
 shards clear out zk and begin again.  If this works properly what
 should
 the router type be?
 
 
 On Wed, Apr 3, 2013 at 3:14 PM, Mark Miller markrmil...@gmail.com
 wrote:
 
 If you don't specify numShards after 4.1, you get an implicit doc
 router
 and it's up to you to 

Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

2013-04-03 Thread Jamie Johnson
I am not using the concurrent low pause garbage collector, I could look at
switching, I'm assuming you're talking about adding -XX:+UseConcMarkSweepGC
correct?

I also just had a shard go down and am seeing this in the log

SEVERE: org.apache.solr.common.SolrException: I was asked to wait on state
down for 10.38.33.17:7576_solr but I still do not see the requested state.
I see state: recovering live:false
at
org.apache.solr.handler.admin.CoreAdminHandler.handleWaitForStateAction(CoreAdminHandler.java:890)
at
org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:186)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at
org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:591)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:192)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560)
at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)

Nothing other than this in the log jumps out as interesting though.


On Wed, Apr 3, 2013 at 7:47 PM, Mark Miller markrmil...@gmail.com wrote:

 This shouldn't be a problem though, if things are working as they are
 supposed to. Another node should simply take over as the overseer and
 continue processing the work queue. It's just best if you configure so that
 session timeouts don't happen unless a node is really down. On the other
 hand, it's nicer to detect that faster. Your tradeoff to make.

 - Mark

 On Apr 3, 2013, at 7:46 PM, Mark Miller markrmil...@gmail.com wrote:

  Yeah. Are you using the concurrent low pause garbage collector?
 
  This means the overseer wasn't able to communicate with zk for 15
 seconds - due to load or gc or whatever. If you can't resolve the root
 cause of that, or the load just won't allow for it, next best thing you can
 do is raise it to 30 seconds.
 
  - Mark
 
  On Apr 3, 2013, at 7:41 PM, Jamie Johnson jej2...@gmail.com wrote:
 
  I am occasionally seeing this in the log, is this just a timeout issue?
  Should I be increasing the zk client timeout?
 
  WARNING: Overseer cannot talk to ZK
  Apr 3, 2013 11:14:25 PM
  org.apache.solr.cloud.DistributedQueue$LatchChildWatcher process
  INFO: Watcher fired on path: null state: Expired type None
  Apr 3, 2013 11:14:25 PM
 org.apache.solr.cloud.Overseer$ClusterStateUpdater
  run
  WARNING: Solr cannot talk to ZK, exiting Overseer main queue loop
  org.apache.zookeeper.KeeperException$SessionExpiredException:
  KeeperErrorCode = Session expired for /overseer/queue
at
  org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
at
  org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1468)
at
 
 org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:236)
at
 
 org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:233)
at
 
 org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:65)
at
 
 org.apache.solr.common.cloud.SolrZkClient.getChildren(SolrZkClient.java:233)
at
 
 org.apache.solr.cloud.DistributedQueue.orderedChildren(DistributedQueue.java:89)
at
 
 org.apache.solr.cloud.DistributedQueue.element(DistributedQueue.java:131)
at
  org.apache.solr.cloud.DistributedQueue.peek(DistributedQueue.java:326)
at
 
 org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:128)
at java.lang.Thread.run(Thread.java:662)
 
 
 
  On Wed, Apr 3, 2013 at 7:25 PM, Jamie Johnson jej2...@gmail.com
 wrote:
 
  just an update, I'm at 1M records now with no issues.  This looks
  promising as to the cause of my issues, thanks for the help.  Is the
  routing method with numShards documented anywhere?  I know numShards is
  documented but I didn't know that the routing changed if you don't
 specify
  it.
 
 
  On Wed, Apr 3, 2013 at 4:44 PM, Jamie Johnson jej2...@gmail.com
 wrote:
 
  with these changes things are looking good, I'm up to 600,000
 documents
  without any issues as of right now.  I'll keep going and add more to
 see if
  I find anything.
 
 
  On Wed, Apr 3, 2013 at 4:01 PM, Jamie Johnson jej2...@gmail.com
 wrote:
 
  ok, so that's not a deal breaker for me.  I just changed it to match
 the
  shards that are auto created and it looks like things are happy.
  I'll go
  ahead and try my test to see if I can get things out of sync.
 
 
  On Wed, Apr 3, 2013 at 

Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

2013-04-03 Thread Mark Miller


On Apr 3, 2013, at 8:17 PM, Jamie Johnson jej2...@gmail.com wrote:

 I am not using the concurrent low pause garbage collector, I could look at
 switching, I'm assuming you're talking about adding -XX:+UseConcMarkSweepGC
 correct?

Right - if you don't do that, the default is almost always the throughput 
collector (I've only seen OSX buck this trend when apple handled java). That 
means stop the world garbage collections, so with larger heaps, that can be a 
fair amount of time that no threads can run. It's not that great for something 
as interactive as search generally is anyway, but it's always not that great 
when added to heavy load and a 15 sec session timeout between solr and zk.


The below is odd - a replica node is waiting for the leader to see it as 
recovering and live - live means it has created an ephemeral node for that Solr 
corecontainer in zk - it's very strange if that didn't happen, unless this 
happened during shutdown or something.

 
 I also just had a shard go down and am seeing this in the log
 
 SEVERE: org.apache.solr.common.SolrException: I was asked to wait on state
 down for 10.38.33.17:7576_solr but I still do not see the requested state.
 I see state: recovering live:false
at
 org.apache.solr.handler.admin.CoreAdminHandler.handleWaitForStateAction(CoreAdminHandler.java:890)
at
 org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:186)
at
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at
 org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:591)
at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:192)
at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141)
at
 org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307)
at
 org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453)
at
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
at
 org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560)
at
 org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
 
 Nothing other than this in the log jumps out as interesting though.
 
 
 On Wed, Apr 3, 2013 at 7:47 PM, Mark Miller markrmil...@gmail.com wrote:
 
 This shouldn't be a problem though, if things are working as they are
 supposed to. Another node should simply take over as the overseer and
 continue processing the work queue. It's just best if you configure so that
 session timeouts don't happen unless a node is really down. On the other
 hand, it's nicer to detect that faster. Your tradeoff to make.
 
 - Mark
 
 On Apr 3, 2013, at 7:46 PM, Mark Miller markrmil...@gmail.com wrote:
 
 Yeah. Are you using the concurrent low pause garbage collector?
 
 This means the overseer wasn't able to communicate with zk for 15
 seconds - due to load or gc or whatever. If you can't resolve the root
 cause of that, or the load just won't allow for it, next best thing you can
 do is raise it to 30 seconds.
 
 - Mark
 
 On Apr 3, 2013, at 7:41 PM, Jamie Johnson jej2...@gmail.com wrote:
 
 I am occasionally seeing this in the log, is this just a timeout issue?
 Should I be increasing the zk client timeout?
 
 WARNING: Overseer cannot talk to ZK
 Apr 3, 2013 11:14:25 PM
 org.apache.solr.cloud.DistributedQueue$LatchChildWatcher process
 INFO: Watcher fired on path: null state: Expired type None
 Apr 3, 2013 11:14:25 PM
 org.apache.solr.cloud.Overseer$ClusterStateUpdater
 run
 WARNING: Solr cannot talk to ZK, exiting Overseer main queue loop
 org.apache.zookeeper.KeeperException$SessionExpiredException:
 KeeperErrorCode = Session expired for /overseer/queue
  at
 org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
  at
 org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
  at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1468)
  at
 
 org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:236)
  at
 
 org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:233)
  at
 
 org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:65)
  at
 
 org.apache.solr.common.cloud.SolrZkClient.getChildren(SolrZkClient.java:233)
  at
 
 org.apache.solr.cloud.DistributedQueue.orderedChildren(DistributedQueue.java:89)
  at
 
 org.apache.solr.cloud.DistributedQueue.element(DistributedQueue.java:131)
  at
 org.apache.solr.cloud.DistributedQueue.peek(DistributedQueue.java:326)
  at
 
 org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:128)
  at java.lang.Thread.run(Thread.java:662)
 
 
 
 On Wed, Apr 3, 2013 at 7:25 PM, Jamie Johnson jej2...@gmail.com
 wrote:
 
 just an update, I'm at 1M records now with no issues.  This looks
 promising as to 

hl.usePhraseHighlighter defaults to true but Query form and wiki suggest otherwise

2013-04-03 Thread Timothy Potter
Minor issues - It seems that the hl.usePhraseHighlighter is enabled by
default, which definitely makes sense but the wiki says it's default value
is false and the checkbox is unchecked by default on the Query form. This
gives the impression this parameter defaults to false.

I'm assuming the code is right in this case and we just need a JIRA to
bring the Query form in-sync with the code. I can update the wiki ... just
want to make sure that having this field enabled by default is the correct
behavior before I update things.

Cheers,
Tim


Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

2013-04-03 Thread Jamie Johnson
Thanks I will try that.


On Wed, Apr 3, 2013 at 8:28 PM, Mark Miller markrmil...@gmail.com wrote:



 On Apr 3, 2013, at 8:17 PM, Jamie Johnson jej2...@gmail.com wrote:

  I am not using the concurrent low pause garbage collector, I could look
 at
  switching, I'm assuming you're talking about adding
 -XX:+UseConcMarkSweepGC
  correct?

 Right - if you don't do that, the default is almost always the throughput
 collector (I've only seen OSX buck this trend when apple handled java).
 That means stop the world garbage collections, so with larger heaps, that
 can be a fair amount of time that no threads can run. It's not that great
 for something as interactive as search generally is anyway, but it's always
 not that great when added to heavy load and a 15 sec session timeout
 between solr and zk.


 The below is odd - a replica node is waiting for the leader to see it as
 recovering and live - live means it has created an ephemeral node for that
 Solr corecontainer in zk - it's very strange if that didn't happen, unless
 this happened during shutdown or something.

 
  I also just had a shard go down and am seeing this in the log
 
  SEVERE: org.apache.solr.common.SolrException: I was asked to wait on
 state
  down for 10.38.33.17:7576_solr but I still do not see the requested
 state.
  I see state: recovering live:false
 at
 
 org.apache.solr.handler.admin.CoreAdminHandler.handleWaitForStateAction(CoreAdminHandler.java:890)
 at
 
 org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:186)
 at
 
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
 at
 
 org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:591)
 at
 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:192)
 at
 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141)
 at
 
 org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307)
 at
 
 org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453)
 at
 
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
 at
 
 org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560)
 at
 
 org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
 
  Nothing other than this in the log jumps out as interesting though.
 
 
  On Wed, Apr 3, 2013 at 7:47 PM, Mark Miller markrmil...@gmail.com
 wrote:
 
  This shouldn't be a problem though, if things are working as they are
  supposed to. Another node should simply take over as the overseer and
  continue processing the work queue. It's just best if you configure so
 that
  session timeouts don't happen unless a node is really down. On the other
  hand, it's nicer to detect that faster. Your tradeoff to make.
 
  - Mark
 
  On Apr 3, 2013, at 7:46 PM, Mark Miller markrmil...@gmail.com wrote:
 
  Yeah. Are you using the concurrent low pause garbage collector?
 
  This means the overseer wasn't able to communicate with zk for 15
  seconds - due to load or gc or whatever. If you can't resolve the root
  cause of that, or the load just won't allow for it, next best thing you
 can
  do is raise it to 30 seconds.
 
  - Mark
 
  On Apr 3, 2013, at 7:41 PM, Jamie Johnson jej2...@gmail.com wrote:
 
  I am occasionally seeing this in the log, is this just a timeout
 issue?
  Should I be increasing the zk client timeout?
 
  WARNING: Overseer cannot talk to ZK
  Apr 3, 2013 11:14:25 PM
  org.apache.solr.cloud.DistributedQueue$LatchChildWatcher process
  INFO: Watcher fired on path: null state: Expired type None
  Apr 3, 2013 11:14:25 PM
  org.apache.solr.cloud.Overseer$ClusterStateUpdater
  run
  WARNING: Solr cannot talk to ZK, exiting Overseer main queue loop
  org.apache.zookeeper.KeeperException$SessionExpiredException:
  KeeperErrorCode = Session expired for /overseer/queue
   at
  org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
   at
  org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
   at
 org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1468)
   at
 
 
 org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:236)
   at
 
 
 org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:233)
   at
 
 
 org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:65)
   at
 
 
 org.apache.solr.common.cloud.SolrZkClient.getChildren(SolrZkClient.java:233)
   at
 
 
 org.apache.solr.cloud.DistributedQueue.orderedChildren(DistributedQueue.java:89)
   at
 
 
 org.apache.solr.cloud.DistributedQueue.element(DistributedQueue.java:131)
   at
  org.apache.solr.cloud.DistributedQueue.peek(DistributedQueue.java:326)
   at
 
 
 org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:128)
   

Difference Between Indexing and Reindexing

2013-04-03 Thread Furkan KAMACI
OK, This could be a so easy question but I want to learn just a bit more
technical detail of it.
When I use Nutch to send documents to Solr to be indexing there are two
parameters:

-index and -reindex.

What Solr does at each one different from the other one?


  1   2   >