date:20130403

Hi Otis, then what is the difference between add and update? And how we
update or add documents into Solr (I see that there is just one update
handler)?


2013/4/4 Otis Gospodnetic 

> I don't recall what Nutch does, so it's hard to tell.
>
> In Solr (Lucene, really), you can:
> * add documents
> * update documents
> * delete documents
>
> Currently, update is really a delete + readd under the hood.  It's
> been like that for 13+ years, but this may change:
> https://issues.apache.org/jira/browse/LUCENE-4258
>
> Otis
> --
> Solr & ElasticSearch Support
> http://sematext.com/
>
>
>
>
>
> On Wed, Apr 3, 2013 at 9:15 PM, Furkan KAMACI 
> wrote:
> > OK, This could be a so easy question but I want to learn just a bit more
> > technical detail of it.
> > When I use Nutch to send documents to Solr to be indexing there are two
> > parameters:
> >
> > -index and -reindex.
> >
> > What Solr does at each one different from the other one?
>

Re: maxWarmingSearchers in Solr 4.

2013-04-03 Thread Dotan Cohen

On Wed, Apr 3, 2013 at 7:55 PM, Shawn Heisey  wrote:
> In situations where I don't want to change the default value, I prefer
> to leave config elements out of the solrconfig.  It makes the config
> smaller, and it also makes it so that I will automatically see benefits
> from the default changing in new versions.
>

Thanks. This makes sense. I take it, then, that you update (or at
least review) solrconfig for each new Solr version. As I become more
familiar with that file I will begin doing the same.

> In the case of maxWarmingSearchers, I would hope that you have your
> system set up so that you would never need more than 1 warming searcher
> at a time.  If you do a commit while a previous commit is still warming,
> Solr will try to create a second warming searcher.
>

How would I set the system up for that? We have very many commits
(every few seconds) and each commit contains a few tens of documents
(mostly smaller than 1 KiB per document). Right now we get about
200-300 searches per minute.

Note that I expect both the commit rate and the search rate to
increase 2-3 times in the next month, and ideally I should be able to
scale it beyond that. I'm right now looking into sharding as a
possible solution.

> I went poking in the code, and it seems that maxWarmingSearchers
> defaults to Integer.MAX_VALUE.  I'm not sure whether this is a bad
> default or not.  It does mean that a pathological setup without
> maxWarmingSearchers in the config will probably blow up with an
> OutOfMemory exception, but is that better or worse than commits that
> don't make new documents searchable?  I can see arguments either way.
>

This is interesting, what you found is that the value in the stock
solrconfig.xml file differs from the Solr default value. I think that
this is bad practice: a single default should be decided upon and Solr
should use this value when nothing is specified in solrconfig.xml, and
that _same_value_ should be specified in the stock solrconfig.xml. Is
it not a reasonable assumption that this would be the case?

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: solre scores remains same for exact match and nearly exact match

2013-04-03 Thread amit

when I use the copy field destination as "text" it works fine.
I get a boost for exact match.
But if I use some other field the score is not boosted for exact match.




Not sure if I am in the right direction.. I am new to solr please bear with
me
I checked this link http://wiki.apache.org/solr/SolrRelevancyCookbook
and trying to index same field multiple times to get exact match.





--
View this message in context: 
http://lucene.472066.n3.nabble.com/solre-scores-remains-same-for-exact-match-and-nearly-exact-match-tp4053406p4053718.html
Sent from the Solr - User mailing list archive at Nabble.com.

Zookeeper dataimport.properties node

2013-04-03 Thread Nathan Findley

- Is dataimport.properties ever written to the filesystem? (Trying to 
determine if I have a permissions error because I don't see it anywhere 
on disk).
- How do you manually edit dataimport.properties? My system is 
periodically pulling in new data. If that process has issues, I want to 
be able to reset to an earlier known good timestamp value.


Regards,
Nate

--
CTO
Zenlok株式会社

Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

sorry that should say none of the _* files were present, not one


On Wed, Apr 3, 2013 at 10:16 PM, Jamie Johnson  wrote:

> I have since removed the files but when I had looked there was an index
> directory, the only files I remember being there were the segments, one of
> the _* files were present.  I'll watch it to see if it happens again but it
> happened on 2 of the shards while heavy indexing.
>
>
> On Wed, Apr 3, 2013 at 10:13 PM, Mark Miller wrote:
>
>> Is that file still there when you look? Not being able to find an index
>> file is not a common error I've seen recently.
>>
>> Do those replicas have an index directory or when you look on disk, is it
>> an index.timestamp directory?
>>
>> - Mark
>>
>> On Apr 3, 2013, at 10:01 PM, Jamie Johnson  wrote:
>>
>> > so something is still not right.  Things were going ok, but I'm seeing
>> this
>> > in the logs of several of the replicas
>> >
>> > SEVERE: Unable to create core: dsc-shard3-core1
>> > org.apache.solr.common.SolrException: Error opening new searcher
>> >at org.apache.solr.core.SolrCore.(SolrCore.java:822)
>> >at org.apache.solr.core.SolrCore.(SolrCore.java:618)
>> >at
>> > org.apache.solr.core.CoreContainer.createFromZk(CoreContainer.java:967)
>> >at
>> > org.apache.solr.core.CoreContainer.create(CoreContainer.java:1049)
>> >at
>> org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:634)
>> >at
>> org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:629)
>> >at
>> > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>> >at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>> >at
>> > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
>> >at
>> > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>> >at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>> >at
>> >
>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>> >at
>> >
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>> >at java.lang.Thread.run(Thread.java:662)
>> > Caused by: org.apache.solr.common.SolrException: Error opening new
>> searcher
>> >at
>> org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1435)
>> >at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1547)
>> >at org.apache.solr.core.SolrCore.(SolrCore.java:797)
>> >... 13 more
>> > Caused by: org.apache.solr.common.SolrException: Error opening Reader
>> >at
>> >
>> org.apache.solr.search.SolrIndexSearcher.getReader(SolrIndexSearcher.java:172)
>> >at
>> >
>> org.apache.solr.search.SolrIndexSearcher.(SolrIndexSearcher.java:183)
>> >at
>> >
>> org.apache.solr.search.SolrIndexSearcher.(SolrIndexSearcher.java:179)
>> >at
>> org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1411)
>> >... 15 more
>> > Caused by: java.io.FileNotFoundException:
>> > /cce2/solr/data/dsc-shard3-core1/index/_13x.si (No such file or
>> directory)
>> >at java.io.RandomAccessFile.open(Native Method)
>> >at java.io.RandomAccessFile.(RandomAccessFile.java:216)
>> >at
>> > org.apache.lucene.store.MMapDirectory.openInput(MMapDirectory.java:193)
>> >at
>> >
>> org.apache.lucene.store.NRTCachingDirectory.openInput(NRTCachingDirectory.java:232)
>> >at
>> >
>> org.apache.lucene.codecs.lucene40.Lucene40SegmentInfoReader.read(Lucene40SegmentInfoReader.java:50)
>> >at
>> org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:301)
>> >at
>> >
>> org.apache.lucene.index.StandardDirectoryReader$1.doBody(StandardDirectoryReader.java:56)
>> >at
>> >
>> org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:783)
>> >at
>> >
>> org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:52)
>> >at
>> > org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:88)
>> >at
>> >
>> org.apache.solr.core.StandardIndexReaderFactory.newReader(StandardIndexReaderFactory.java:34)
>> >at
>> >
>> org.apache.solr.search.SolrIndexSearcher.getReader(SolrIndexSearcher.java:169)
>> >... 18 more
>> >
>> >
>> >
>> > On Wed, Apr 3, 2013 at 8:54 PM, Jamie Johnson 
>> wrote:
>> >
>> >> Thanks I will try that.
>> >>
>> >>
>> >> On Wed, Apr 3, 2013 at 8:28 PM, Mark Miller 
>> wrote:
>> >>
>> >>>
>> >>>
>> >>> On Apr 3, 2013, at 8:17 PM, Jamie Johnson  wrote:
>> >>>
>>  I am not using the concurrent low pause garbage collector, I could
>> look
>> >>> at
>>  switching, I'm assuming you're talking about adding
>> >>> -XX:+UseConcMarkSweepGC
>>  correct?
>> >>>
>> >>> Right - if you don't do that, the default is almost always the
>> throughput
>> >>> collector (I've only seen OSX buck this trend when apple handled
>> java).
>> >>> That means stop the world garbage col

Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

I have since removed the files but when I had looked there was an index
directory, the only files I remember being there were the segments, one of
the _* files were present.  I'll watch it to see if it happens again but it
happened on 2 of the shards while heavy indexing.


On Wed, Apr 3, 2013 at 10:13 PM, Mark Miller  wrote:

> Is that file still there when you look? Not being able to find an index
> file is not a common error I've seen recently.
>
> Do those replicas have an index directory or when you look on disk, is it
> an index.timestamp directory?
>
> - Mark
>
> On Apr 3, 2013, at 10:01 PM, Jamie Johnson  wrote:
>
> > so something is still not right.  Things were going ok, but I'm seeing
> this
> > in the logs of several of the replicas
> >
> > SEVERE: Unable to create core: dsc-shard3-core1
> > org.apache.solr.common.SolrException: Error opening new searcher
> >at org.apache.solr.core.SolrCore.(SolrCore.java:822)
> >at org.apache.solr.core.SolrCore.(SolrCore.java:618)
> >at
> > org.apache.solr.core.CoreContainer.createFromZk(CoreContainer.java:967)
> >at
> > org.apache.solr.core.CoreContainer.create(CoreContainer.java:1049)
> >at
> org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:634)
> >at
> org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:629)
> >at
> > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> >at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> >at
> > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
> >at
> > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> >at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> >at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> >at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> >at java.lang.Thread.run(Thread.java:662)
> > Caused by: org.apache.solr.common.SolrException: Error opening new
> searcher
> >at
> org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1435)
> >at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1547)
> >at org.apache.solr.core.SolrCore.(SolrCore.java:797)
> >... 13 more
> > Caused by: org.apache.solr.common.SolrException: Error opening Reader
> >at
> >
> org.apache.solr.search.SolrIndexSearcher.getReader(SolrIndexSearcher.java:172)
> >at
> >
> org.apache.solr.search.SolrIndexSearcher.(SolrIndexSearcher.java:183)
> >at
> >
> org.apache.solr.search.SolrIndexSearcher.(SolrIndexSearcher.java:179)
> >at
> org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1411)
> >... 15 more
> > Caused by: java.io.FileNotFoundException:
> > /cce2/solr/data/dsc-shard3-core1/index/_13x.si (No such file or
> directory)
> >at java.io.RandomAccessFile.open(Native Method)
> >at java.io.RandomAccessFile.(RandomAccessFile.java:216)
> >at
> > org.apache.lucene.store.MMapDirectory.openInput(MMapDirectory.java:193)
> >at
> >
> org.apache.lucene.store.NRTCachingDirectory.openInput(NRTCachingDirectory.java:232)
> >at
> >
> org.apache.lucene.codecs.lucene40.Lucene40SegmentInfoReader.read(Lucene40SegmentInfoReader.java:50)
> >at
> org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:301)
> >at
> >
> org.apache.lucene.index.StandardDirectoryReader$1.doBody(StandardDirectoryReader.java:56)
> >at
> >
> org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:783)
> >at
> >
> org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:52)
> >at
> > org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:88)
> >at
> >
> org.apache.solr.core.StandardIndexReaderFactory.newReader(StandardIndexReaderFactory.java:34)
> >at
> >
> org.apache.solr.search.SolrIndexSearcher.getReader(SolrIndexSearcher.java:169)
> >... 18 more
> >
> >
> >
> > On Wed, Apr 3, 2013 at 8:54 PM, Jamie Johnson  wrote:
> >
> >> Thanks I will try that.
> >>
> >>
> >> On Wed, Apr 3, 2013 at 8:28 PM, Mark Miller 
> wrote:
> >>
> >>>
> >>>
> >>> On Apr 3, 2013, at 8:17 PM, Jamie Johnson  wrote:
> >>>
>  I am not using the concurrent low pause garbage collector, I could
> look
> >>> at
>  switching, I'm assuming you're talking about adding
> >>> -XX:+UseConcMarkSweepGC
>  correct?
> >>>
> >>> Right - if you don't do that, the default is almost always the
> throughput
> >>> collector (I've only seen OSX buck this trend when apple handled java).
> >>> That means stop the world garbage collections, so with larger heaps,
> that
> >>> can be a fair amount of time that no threads can run. It's not that
> great
> >>> for something as interactive as search generally is anyway, but it's
> always
> >>> not that great when added to heavy load and a 15 sec s

Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

Is that file still there when you look? Not being able to find an index file is 
not a common error I've seen recently.

Do those replicas have an index directory or when you look on disk, is it an 
index.timestamp directory?

- Mark

On Apr 3, 2013, at 10:01 PM, Jamie Johnson  wrote:

> so something is still not right.  Things were going ok, but I'm seeing this
> in the logs of several of the replicas
> 
> SEVERE: Unable to create core: dsc-shard3-core1
> org.apache.solr.common.SolrException: Error opening new searcher
>at org.apache.solr.core.SolrCore.(SolrCore.java:822)
>at org.apache.solr.core.SolrCore.(SolrCore.java:618)
>at
> org.apache.solr.core.CoreContainer.createFromZk(CoreContainer.java:967)
>at
> org.apache.solr.core.CoreContainer.create(CoreContainer.java:1049)
>at org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:634)
>at org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:629)
>at
> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
>at
> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>at java.lang.Thread.run(Thread.java:662)
> Caused by: org.apache.solr.common.SolrException: Error opening new searcher
>at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1435)
>at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1547)
>at org.apache.solr.core.SolrCore.(SolrCore.java:797)
>... 13 more
> Caused by: org.apache.solr.common.SolrException: Error opening Reader
>at
> org.apache.solr.search.SolrIndexSearcher.getReader(SolrIndexSearcher.java:172)
>at
> org.apache.solr.search.SolrIndexSearcher.(SolrIndexSearcher.java:183)
>at
> org.apache.solr.search.SolrIndexSearcher.(SolrIndexSearcher.java:179)
>at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1411)
>... 15 more
> Caused by: java.io.FileNotFoundException:
> /cce2/solr/data/dsc-shard3-core1/index/_13x.si (No such file or directory)
>at java.io.RandomAccessFile.open(Native Method)
>at java.io.RandomAccessFile.(RandomAccessFile.java:216)
>at
> org.apache.lucene.store.MMapDirectory.openInput(MMapDirectory.java:193)
>at
> org.apache.lucene.store.NRTCachingDirectory.openInput(NRTCachingDirectory.java:232)
>at
> org.apache.lucene.codecs.lucene40.Lucene40SegmentInfoReader.read(Lucene40SegmentInfoReader.java:50)
>at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:301)
>at
> org.apache.lucene.index.StandardDirectoryReader$1.doBody(StandardDirectoryReader.java:56)
>at
> org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:783)
>at
> org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:52)
>at
> org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:88)
>at
> org.apache.solr.core.StandardIndexReaderFactory.newReader(StandardIndexReaderFactory.java:34)
>at
> org.apache.solr.search.SolrIndexSearcher.getReader(SolrIndexSearcher.java:169)
>... 18 more
> 
> 
> 
> On Wed, Apr 3, 2013 at 8:54 PM, Jamie Johnson  wrote:
> 
>> Thanks I will try that.
>> 
>> 
>> On Wed, Apr 3, 2013 at 8:28 PM, Mark Miller  wrote:
>> 
>>> 
>>> 
>>> On Apr 3, 2013, at 8:17 PM, Jamie Johnson  wrote:
>>> 
 I am not using the concurrent low pause garbage collector, I could look
>>> at
 switching, I'm assuming you're talking about adding
>>> -XX:+UseConcMarkSweepGC
 correct?
>>> 
>>> Right - if you don't do that, the default is almost always the throughput
>>> collector (I've only seen OSX buck this trend when apple handled java).
>>> That means stop the world garbage collections, so with larger heaps, that
>>> can be a fair amount of time that no threads can run. It's not that great
>>> for something as interactive as search generally is anyway, but it's always
>>> not that great when added to heavy load and a 15 sec session timeout
>>> between solr and zk.
>>> 
>>> 
>>> The below is odd - a replica node is waiting for the leader to see it as
>>> recovering and live - live means it has created an ephemeral node for that
>>> Solr corecontainer in zk - it's very strange if that didn't happen, unless
>>> this happened during shutdown or something.
>>> 
 
 I also just had a shard go down and am seeing this in the log
 
 SEVERE: org.apache.solr.common.SolrException: I was asked to wait on
>>> state
 down for 10.38.33.17:7576_solr but I still do not see the

Re: hl.usePhraseHighlighter defaults to true but Query form and wiki suggest otherwise

It was def intentional to make it default to true, but I believe that was 
changed at one point from initially defaulting to false - the doc was probably 
not updated and that slipped into he UI. Thanks for looking into this.

- Mark

On Apr 3, 2013, at 8:50 PM, Timothy Potter  wrote:

> Minor issues - It seems that the hl.usePhraseHighlighter is enabled by
> default, which definitely makes sense but the wiki says it's default value
> is "false" and the checkbox is unchecked by default on the Query form. This
> gives the impression this parameter defaults to "false".
> 
> I'm assuming the code is right in this case and we just need a JIRA to
> bring the Query form in-sync with the code. I can update the wiki ... just
> want to make sure that having this field enabled by default is the correct
> behavior before I update things.
> 
> Cheers,
> Tim

[ANNOUNCE] Apache Solr 4.2.1 released

April 2013, Apache Solr™ 4.2.1 available
The Lucene PMC is pleased to announce the release of Apache Solr 4.2.1

Solr is the popular, blazing fast, open source NoSQL search platform
from the Apache Lucene project. Its major features include powerful
full-text search, hit highlighting, faceted search, dynamic
clustering, database integration, rich document (e.g., Word, PDF)
handling, and geospatial search.  Solr is highly scalable, providing
fault tolerant distributed search and indexing, and powers the search
and navigation features of many of the world's largest internet sites.

Solr 4.2.1 is available for immediate download at:
   http://lucene.apache.org/solr/mirrors-solr-latest-redir.html

See the CHANGES.txt file included with the release for a full list of
details.

Solr 4.2.1 Release Highlights:

* Solr 4.2.1 includes 38 bug fixes and 2 optimizations. The list includes a
lot of SolrCloud bug fixes around the Collections API as well as many fixes
around Directory management. There are many fixes in other areas as well.

* Lucene 4.2.1 bug fixes and optimizations.

Please report any feedback to the mailing lists
(http://lucene.apache.org/solr/discussion.html)

Note: The Apache Software Foundation uses an extensive mirroring
network for distributing releases.  It is possible that the mirror you
are using may not have replicated the release yet.  If that is the
case, please try another mirror.  This also goes for Maven access.

Happy searching,
Lucene/Solr developers

Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

so something is still not right.  Things were going ok, but I'm seeing this
in the logs of several of the replicas

SEVERE: Unable to create core: dsc-shard3-core1
org.apache.solr.common.SolrException: Error opening new searcher
at org.apache.solr.core.SolrCore.(SolrCore.java:822)
at org.apache.solr.core.SolrCore.(SolrCore.java:618)
at
org.apache.solr.core.CoreContainer.createFromZk(CoreContainer.java:967)
at
org.apache.solr.core.CoreContainer.create(CoreContainer.java:1049)
at org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:634)
at org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:629)
at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: org.apache.solr.common.SolrException: Error opening new searcher
at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1435)
at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1547)
at org.apache.solr.core.SolrCore.(SolrCore.java:797)
... 13 more
Caused by: org.apache.solr.common.SolrException: Error opening Reader
at
org.apache.solr.search.SolrIndexSearcher.getReader(SolrIndexSearcher.java:172)
at
org.apache.solr.search.SolrIndexSearcher.(SolrIndexSearcher.java:183)
at
org.apache.solr.search.SolrIndexSearcher.(SolrIndexSearcher.java:179)
at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1411)
... 15 more
Caused by: java.io.FileNotFoundException:
/cce2/solr/data/dsc-shard3-core1/index/_13x.si (No such file or directory)
at java.io.RandomAccessFile.open(Native Method)
at java.io.RandomAccessFile.(RandomAccessFile.java:216)
at
org.apache.lucene.store.MMapDirectory.openInput(MMapDirectory.java:193)
at
org.apache.lucene.store.NRTCachingDirectory.openInput(NRTCachingDirectory.java:232)
at
org.apache.lucene.codecs.lucene40.Lucene40SegmentInfoReader.read(Lucene40SegmentInfoReader.java:50)
at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:301)
at
org.apache.lucene.index.StandardDirectoryReader$1.doBody(StandardDirectoryReader.java:56)
at
org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:783)
at
org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:52)
at
org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:88)
at
org.apache.solr.core.StandardIndexReaderFactory.newReader(StandardIndexReaderFactory.java:34)
at
org.apache.solr.search.SolrIndexSearcher.getReader(SolrIndexSearcher.java:169)
... 18 more

On Wed, Apr 3, 2013 at 8:54 PM, Jamie Johnson  wrote:

> Thanks I will try that.
>
>
> On Wed, Apr 3, 2013 at 8:28 PM, Mark Miller  wrote:
>
>>
>>
>> On Apr 3, 2013, at 8:17 PM, Jamie Johnson  wrote:
>>
>> > I am not using the concurrent low pause garbage collector, I could look
>> at
>> > switching, I'm assuming you're talking about adding
>> -XX:+UseConcMarkSweepGC
>> > correct?
>>
>> Right - if you don't do that, the default is almost always the throughput
>> collector (I've only seen OSX buck this trend when apple handled java).
>> That means stop the world garbage collections, so with larger heaps, that
>> can be a fair amount of time that no threads can run. It's not that great
>> for something as interactive as search generally is anyway, but it's always
>> not that great when added to heavy load and a 15 sec session timeout
>> between solr and zk.
>>
>>
>> The below is odd - a replica node is waiting for the leader to see it as
>> recovering and live - live means it has created an ephemeral node for that
>> Solr corecontainer in zk - it's very strange if that didn't happen, unless
>> this happened during shutdown or something.
>>
>> >
>> > I also just had a shard go down and am seeing this in the log
>> >
>> > SEVERE: org.apache.solr.common.SolrException: I was asked to wait on
>> state
>> > down for 10.38.33.17:7576_solr but I still do not see the requested
>> state.
>> > I see state: recovering live:false
>> >at
>> >
>> org.apache.solr.handler.admin.CoreAdminHandler.handleWaitForStateAction(CoreAdminHandler.java:890)
>> >at
>> >
>> org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:186)
>> >at
>> >
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
>> >at
>> >
>

Re: Difference Between Indexing and Reindexing

2013-04-03 Thread Otis Gospodnetic

I don't recall what Nutch does, so it's hard to tell.

In Solr (Lucene, really), you can:
* add documents
* update documents
* delete documents

Currently, update is really a delete + readd under the hood.  It's
been like that for 13+ years, but this may change:
https://issues.apache.org/jira/browse/LUCENE-4258

Otis
--
Solr & ElasticSearch Support
http://sematext.com/

On Wed, Apr 3, 2013 at 9:15 PM, Furkan KAMACI  wrote:
> OK, This could be a so easy question but I want to learn just a bit more
> technical detail of it.
> When I use Nutch to send documents to Solr to be indexing there are two
> parameters:
>
> -index and -reindex.
>
> What Solr does at each one different from the other one?

Difference Between Indexing and Reindexing

OK, This could be a so easy question but I want to learn just a bit more
technical detail of it.
When I use Nutch to send documents to Solr to be indexing there are two
parameters:

-index and -reindex.

What Solr does at each one different from the other one?

Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

Thanks I will try that.


On Wed, Apr 3, 2013 at 8:28 PM, Mark Miller  wrote:

>
>
> On Apr 3, 2013, at 8:17 PM, Jamie Johnson  wrote:
>
> > I am not using the concurrent low pause garbage collector, I could look
> at
> > switching, I'm assuming you're talking about adding
> -XX:+UseConcMarkSweepGC
> > correct?
>
> Right - if you don't do that, the default is almost always the throughput
> collector (I've only seen OSX buck this trend when apple handled java).
> That means stop the world garbage collections, so with larger heaps, that
> can be a fair amount of time that no threads can run. It's not that great
> for something as interactive as search generally is anyway, but it's always
> not that great when added to heavy load and a 15 sec session timeout
> between solr and zk.
>
>
> The below is odd - a replica node is waiting for the leader to see it as
> recovering and live - live means it has created an ephemeral node for that
> Solr corecontainer in zk - it's very strange if that didn't happen, unless
> this happened during shutdown or something.
>
> >
> > I also just had a shard go down and am seeing this in the log
> >
> > SEVERE: org.apache.solr.common.SolrException: I was asked to wait on
> state
> > down for 10.38.33.17:7576_solr but I still do not see the requested
> state.
> > I see state: recovering live:false
> >at
> >
> org.apache.solr.handler.admin.CoreAdminHandler.handleWaitForStateAction(CoreAdminHandler.java:890)
> >at
> >
> org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:186)
> >at
> >
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
> >at
> >
> org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:591)
> >at
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:192)
> >at
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141)
> >at
> >
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307)
> >at
> >
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453)
> >at
> >
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
> >at
> >
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560)
> >at
> >
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
> >
> > Nothing other than this in the log jumps out as interesting though.
> >
> >
> > On Wed, Apr 3, 2013 at 7:47 PM, Mark Miller 
> wrote:
> >
> >> This shouldn't be a problem though, if things are working as they are
> >> supposed to. Another node should simply take over as the overseer and
> >> continue processing the work queue. It's just best if you configure so
> that
> >> session timeouts don't happen unless a node is really down. On the other
> >> hand, it's nicer to detect that faster. Your tradeoff to make.
> >>
> >> - Mark
> >>
> >> On Apr 3, 2013, at 7:46 PM, Mark Miller  wrote:
> >>
> >>> Yeah. Are you using the concurrent low pause garbage collector?
> >>>
> >>> This means the overseer wasn't able to communicate with zk for 15
> >> seconds - due to load or gc or whatever. If you can't resolve the root
> >> cause of that, or the load just won't allow for it, next best thing you
> can
> >> do is raise it to 30 seconds.
> >>>
> >>> - Mark
> >>>
> >>> On Apr 3, 2013, at 7:41 PM, Jamie Johnson  wrote:
> >>>
>  I am occasionally seeing this in the log, is this just a timeout
> issue?
>  Should I be increasing the zk client timeout?
> 
>  WARNING: Overseer cannot talk to ZK
>  Apr 3, 2013 11:14:25 PM
>  org.apache.solr.cloud.DistributedQueue$LatchChildWatcher process
>  INFO: Watcher fired on path: null state: Expired type None
>  Apr 3, 2013 11:14:25 PM
> >> org.apache.solr.cloud.Overseer$ClusterStateUpdater
>  run
>  WARNING: Solr cannot talk to ZK, exiting Overseer main queue loop
>  org.apache.zookeeper.KeeperException$SessionExpiredException:
>  KeeperErrorCode = Session expired for /overseer/queue
>   at
>  org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
>   at
>  org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>   at
> org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1468)
>   at
> 
> >>
> org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:236)
>   at
> 
> >>
> org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:233)
>   at
> 
> >>
> org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:65)
>   at
> 
> >>
> org.apache.solr.common.cloud.SolrZkClient.getChildren(SolrZkClient.java:233)
>   at
> 
> >>
> org.apache.solr.cloud.DistributedQueue.orderedChildren(DistributedQueue.java:89)
>   at
> >>

hl.usePhraseHighlighter defaults to true but Query form and wiki suggest otherwise

2013-04-03 Thread Timothy Potter

Minor issues - It seems that the hl.usePhraseHighlighter is enabled by
default, which definitely makes sense but the wiki says it's default value
is "false" and the checkbox is unchecked by default on the Query form. This
gives the impression this parameter defaults to "false".

I'm assuming the code is right in this case and we just need a JIRA to
bring the Query form in-sync with the code. I can update the wiki ... just
want to make sure that having this field enabled by default is the correct
behavior before I update things.

Cheers,
Tim

Re: Solr 4.2 Cloud Replication Replica has higher version than Master?



On Apr 3, 2013, at 8:17 PM, Jamie Johnson  wrote:

> I am not using the concurrent low pause garbage collector, I could look at
> switching, I'm assuming you're talking about adding -XX:+UseConcMarkSweepGC
> correct?

Right - if you don't do that, the default is almost always the throughput 
collector (I've only seen OSX buck this trend when apple handled java). That 
means stop the world garbage collections, so with larger heaps, that can be a 
fair amount of time that no threads can run. It's not that great for something 
as interactive as search generally is anyway, but it's always not that great 
when added to heavy load and a 15 sec session timeout between solr and zk.


The below is odd - a replica node is waiting for the leader to see it as 
recovering and live - live means it has created an ephemeral node for that Solr 
corecontainer in zk - it's very strange if that didn't happen, unless this 
happened during shutdown or something.

> 
> I also just had a shard go down and am seeing this in the log
> 
> SEVERE: org.apache.solr.common.SolrException: I was asked to wait on state
> down for 10.38.33.17:7576_solr but I still do not see the requested state.
> I see state: recovering live:false
>at
> org.apache.solr.handler.admin.CoreAdminHandler.handleWaitForStateAction(CoreAdminHandler.java:890)
>at
> org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:186)
>at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
>at
> org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:591)
>at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:192)
>at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141)
>at
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307)
>at
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453)
>at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
>at
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560)
>at
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
> 
> Nothing other than this in the log jumps out as interesting though.
> 
> 
> On Wed, Apr 3, 2013 at 7:47 PM, Mark Miller  wrote:
> 
>> This shouldn't be a problem though, if things are working as they are
>> supposed to. Another node should simply take over as the overseer and
>> continue processing the work queue. It's just best if you configure so that
>> session timeouts don't happen unless a node is really down. On the other
>> hand, it's nicer to detect that faster. Your tradeoff to make.
>> 
>> - Mark
>> 
>> On Apr 3, 2013, at 7:46 PM, Mark Miller  wrote:
>> 
>>> Yeah. Are you using the concurrent low pause garbage collector?
>>> 
>>> This means the overseer wasn't able to communicate with zk for 15
>> seconds - due to load or gc or whatever. If you can't resolve the root
>> cause of that, or the load just won't allow for it, next best thing you can
>> do is raise it to 30 seconds.
>>> 
>>> - Mark
>>> 
>>> On Apr 3, 2013, at 7:41 PM, Jamie Johnson  wrote:
>>> 
 I am occasionally seeing this in the log, is this just a timeout issue?
 Should I be increasing the zk client timeout?
 
 WARNING: Overseer cannot talk to ZK
 Apr 3, 2013 11:14:25 PM
 org.apache.solr.cloud.DistributedQueue$LatchChildWatcher process
 INFO: Watcher fired on path: null state: Expired type None
 Apr 3, 2013 11:14:25 PM
>> org.apache.solr.cloud.Overseer$ClusterStateUpdater
 run
 WARNING: Solr cannot talk to ZK, exiting Overseer main queue loop
 org.apache.zookeeper.KeeperException$SessionExpiredException:
 KeeperErrorCode = Session expired for /overseer/queue
  at
 org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
  at
 org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
  at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1468)
  at
 
>> org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:236)
  at
 
>> org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:233)
  at
 
>> org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:65)
  at
 
>> org.apache.solr.common.cloud.SolrZkClient.getChildren(SolrZkClient.java:233)
  at
 
>> org.apache.solr.cloud.DistributedQueue.orderedChildren(DistributedQueue.java:89)
  at
 
>> org.apache.solr.cloud.DistributedQueue.element(DistributedQueue.java:131)
  at
 org.apache.solr.cloud.DistributedQueue.peek(DistributedQueue.java:326)
  at
 
>> org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:128)
  at java.lang.Thread.run(Thread.java:662)
>>

Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

I am not using the concurrent low pause garbage collector, I could look at
switching, I'm assuming you're talking about adding -XX:+UseConcMarkSweepGC
correct?

I also just had a shard go down and am seeing this in the log

SEVERE: org.apache.solr.common.SolrException: I was asked to wait on state
down for 10.38.33.17:7576_solr but I still do not see the requested state.
I see state: recovering live:false
at
org.apache.solr.handler.admin.CoreAdminHandler.handleWaitForStateAction(CoreAdminHandler.java:890)
at
org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:186)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at
org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:591)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:192)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560)
at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)

Nothing other than this in the log jumps out as interesting though.


On Wed, Apr 3, 2013 at 7:47 PM, Mark Miller  wrote:

> This shouldn't be a problem though, if things are working as they are
> supposed to. Another node should simply take over as the overseer and
> continue processing the work queue. It's just best if you configure so that
> session timeouts don't happen unless a node is really down. On the other
> hand, it's nicer to detect that faster. Your tradeoff to make.
>
> - Mark
>
> On Apr 3, 2013, at 7:46 PM, Mark Miller  wrote:
>
> > Yeah. Are you using the concurrent low pause garbage collector?
> >
> > This means the overseer wasn't able to communicate with zk for 15
> seconds - due to load or gc or whatever. If you can't resolve the root
> cause of that, or the load just won't allow for it, next best thing you can
> do is raise it to 30 seconds.
> >
> > - Mark
> >
> > On Apr 3, 2013, at 7:41 PM, Jamie Johnson  wrote:
> >
> >> I am occasionally seeing this in the log, is this just a timeout issue?
> >> Should I be increasing the zk client timeout?
> >>
> >> WARNING: Overseer cannot talk to ZK
> >> Apr 3, 2013 11:14:25 PM
> >> org.apache.solr.cloud.DistributedQueue$LatchChildWatcher process
> >> INFO: Watcher fired on path: null state: Expired type None
> >> Apr 3, 2013 11:14:25 PM
> org.apache.solr.cloud.Overseer$ClusterStateUpdater
> >> run
> >> WARNING: Solr cannot talk to ZK, exiting Overseer main queue loop
> >> org.apache.zookeeper.KeeperException$SessionExpiredException:
> >> KeeperErrorCode = Session expired for /overseer/queue
> >>   at
> >> org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
> >>   at
> >> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> >>   at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1468)
> >>   at
> >>
> org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:236)
> >>   at
> >>
> org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:233)
> >>   at
> >>
> org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:65)
> >>   at
> >>
> org.apache.solr.common.cloud.SolrZkClient.getChildren(SolrZkClient.java:233)
> >>   at
> >>
> org.apache.solr.cloud.DistributedQueue.orderedChildren(DistributedQueue.java:89)
> >>   at
> >>
> org.apache.solr.cloud.DistributedQueue.element(DistributedQueue.java:131)
> >>   at
> >> org.apache.solr.cloud.DistributedQueue.peek(DistributedQueue.java:326)
> >>   at
> >>
> org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:128)
> >>   at java.lang.Thread.run(Thread.java:662)
> >>
> >>
> >>
> >> On Wed, Apr 3, 2013 at 7:25 PM, Jamie Johnson 
> wrote:
> >>
> >>> just an update, I'm at 1M records now with no issues.  This looks
> >>> promising as to the cause of my issues, thanks for the help.  Is the
> >>> routing method with numShards documented anywhere?  I know numShards is
> >>> documented but I didn't know that the routing changed if you don't
> specify
> >>> it.
> >>>
> >>>
> >>> On Wed, Apr 3, 2013 at 4:44 PM, Jamie Johnson 
> wrote:
> >>>
>  with these changes things are looking good, I'm up to 600,000
> documents
>  without any issues as of right now.  I'll keep going and add more to
> see if
>  I find anything.
> 
> 
>  On Wed, Apr 3, 2013 at 4:01 PM, Jamie Johnson 
> wrote:
> 
> > ok, so that's not a deal breaker for me.  I just changed it to match
> the
> > shards that are auto created and it l

Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

This shouldn't be a problem though, if things are working as they are supposed 
to. Another node should simply take over as the overseer and continue 
processing the work queue. It's just best if you configure so that session 
timeouts don't happen unless a node is really down. On the other hand, it's 
nicer to detect that faster. Your tradeoff to make.

- Mark

On Apr 3, 2013, at 7:46 PM, Mark Miller  wrote:

> Yeah. Are you using the concurrent low pause garbage collector?
> 
> This means the overseer wasn't able to communicate with zk for 15 seconds - 
> due to load or gc or whatever. If you can't resolve the root cause of that, 
> or the load just won't allow for it, next best thing you can do is raise it 
> to 30 seconds.
> 
> - Mark
> 
> On Apr 3, 2013, at 7:41 PM, Jamie Johnson  wrote:
> 
>> I am occasionally seeing this in the log, is this just a timeout issue?
>> Should I be increasing the zk client timeout?
>> 
>> WARNING: Overseer cannot talk to ZK
>> Apr 3, 2013 11:14:25 PM
>> org.apache.solr.cloud.DistributedQueue$LatchChildWatcher process
>> INFO: Watcher fired on path: null state: Expired type None
>> Apr 3, 2013 11:14:25 PM org.apache.solr.cloud.Overseer$ClusterStateUpdater
>> run
>> WARNING: Solr cannot talk to ZK, exiting Overseer main queue loop
>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>> KeeperErrorCode = Session expired for /overseer/queue
>>   at
>> org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
>>   at
>> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>>   at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1468)
>>   at
>> org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:236)
>>   at
>> org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:233)
>>   at
>> org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:65)
>>   at
>> org.apache.solr.common.cloud.SolrZkClient.getChildren(SolrZkClient.java:233)
>>   at
>> org.apache.solr.cloud.DistributedQueue.orderedChildren(DistributedQueue.java:89)
>>   at
>> org.apache.solr.cloud.DistributedQueue.element(DistributedQueue.java:131)
>>   at
>> org.apache.solr.cloud.DistributedQueue.peek(DistributedQueue.java:326)
>>   at
>> org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:128)
>>   at java.lang.Thread.run(Thread.java:662)
>> 
>> 
>> 
>> On Wed, Apr 3, 2013 at 7:25 PM, Jamie Johnson  wrote:
>> 
>>> just an update, I'm at 1M records now with no issues.  This looks
>>> promising as to the cause of my issues, thanks for the help.  Is the
>>> routing method with numShards documented anywhere?  I know numShards is
>>> documented but I didn't know that the routing changed if you don't specify
>>> it.
>>> 
>>> 
>>> On Wed, Apr 3, 2013 at 4:44 PM, Jamie Johnson  wrote:
>>> 
 with these changes things are looking good, I'm up to 600,000 documents
 without any issues as of right now.  I'll keep going and add more to see if
 I find anything.
 
 
 On Wed, Apr 3, 2013 at 4:01 PM, Jamie Johnson  wrote:
 
> ok, so that's not a deal breaker for me.  I just changed it to match the
> shards that are auto created and it looks like things are happy.  I'll go
> ahead and try my test to see if I can get things out of sync.
> 
> 
> On Wed, Apr 3, 2013 at 3:56 PM, Mark Miller wrote:
> 
>> I had thought you could - but looking at the code recently, I don't
>> think you can anymore. I think that's a technical limitation more than
>> anything though. When these changes were made, I think support for that 
>> was
>> simply not added at the time.
>> 
>> I'm not sure exactly how straightforward it would be, but it seems
>> doable - as it is, the overseer will preallocate shards when first 
>> creating
>> the collection - that's when they get named shard(n). There would have to
>> be logic to replace shard(n) with the custom shard name when the core
>> actually registers.
>> 
>> - Mark
>> 
>> On Apr 3, 2013, at 3:42 PM, Jamie Johnson  wrote:
>> 
>>> answered my own question, it now says compositeId.  What is
>> problematic
>>> though is that in addition to my shards (which are say jamie-shard1)
>> I see
>>> the solr created shards (shard1).  I assume that these were created
>> because
>>> of the numShards param.  Is there no way to specify the names of these
>>> shards?
>>> 
>>> 
>>> On Wed, Apr 3, 2013 at 3:25 PM, Jamie Johnson 
>> wrote:
>>> 
 ah interestingso I need to specify num shards, blow out zk and
>> then
 try this again to see if things work properly now.  What is really
>> strange
 is that for the most part things have worked right and on 4.2.1 I
>> have
 600,000 items indexed with no duplicates.  In any event I will
>> specify num
>

Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

Yeah. Are you using the concurrent low pause garbage collector?

This means the overseer wasn't able to communicate with zk for 15 seconds - due 
to load or gc or whatever. If you can't resolve the root cause of that, or the 
load just won't allow for it, next best thing you can do is raise it to 30 
seconds.

- Mark

On Apr 3, 2013, at 7:41 PM, Jamie Johnson  wrote:

> I am occasionally seeing this in the log, is this just a timeout issue?
> Should I be increasing the zk client timeout?
> 
> WARNING: Overseer cannot talk to ZK
> Apr 3, 2013 11:14:25 PM
> org.apache.solr.cloud.DistributedQueue$LatchChildWatcher process
> INFO: Watcher fired on path: null state: Expired type None
> Apr 3, 2013 11:14:25 PM org.apache.solr.cloud.Overseer$ClusterStateUpdater
> run
> WARNING: Solr cannot talk to ZK, exiting Overseer main queue loop
> org.apache.zookeeper.KeeperException$SessionExpiredException:
> KeeperErrorCode = Session expired for /overseer/queue
>at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
>at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1468)
>at
> org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:236)
>at
> org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:233)
>at
> org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:65)
>at
> org.apache.solr.common.cloud.SolrZkClient.getChildren(SolrZkClient.java:233)
>at
> org.apache.solr.cloud.DistributedQueue.orderedChildren(DistributedQueue.java:89)
>at
> org.apache.solr.cloud.DistributedQueue.element(DistributedQueue.java:131)
>at
> org.apache.solr.cloud.DistributedQueue.peek(DistributedQueue.java:326)
>at
> org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:128)
>at java.lang.Thread.run(Thread.java:662)
> 
> 
> 
> On Wed, Apr 3, 2013 at 7:25 PM, Jamie Johnson  wrote:
> 
>> just an update, I'm at 1M records now with no issues.  This looks
>> promising as to the cause of my issues, thanks for the help.  Is the
>> routing method with numShards documented anywhere?  I know numShards is
>> documented but I didn't know that the routing changed if you don't specify
>> it.
>> 
>> 
>> On Wed, Apr 3, 2013 at 4:44 PM, Jamie Johnson  wrote:
>> 
>>> with these changes things are looking good, I'm up to 600,000 documents
>>> without any issues as of right now.  I'll keep going and add more to see if
>>> I find anything.
>>> 
>>> 
>>> On Wed, Apr 3, 2013 at 4:01 PM, Jamie Johnson  wrote:
>>> 
 ok, so that's not a deal breaker for me.  I just changed it to match the
 shards that are auto created and it looks like things are happy.  I'll go
 ahead and try my test to see if I can get things out of sync.
 
 
 On Wed, Apr 3, 2013 at 3:56 PM, Mark Miller wrote:
 
> I had thought you could - but looking at the code recently, I don't
> think you can anymore. I think that's a technical limitation more than
> anything though. When these changes were made, I think support for that 
> was
> simply not added at the time.
> 
> I'm not sure exactly how straightforward it would be, but it seems
> doable - as it is, the overseer will preallocate shards when first 
> creating
> the collection - that's when they get named shard(n). There would have to
> be logic to replace shard(n) with the custom shard name when the core
> actually registers.
> 
> - Mark
> 
> On Apr 3, 2013, at 3:42 PM, Jamie Johnson  wrote:
> 
>> answered my own question, it now says compositeId.  What is
> problematic
>> though is that in addition to my shards (which are say jamie-shard1)
> I see
>> the solr created shards (shard1).  I assume that these were created
> because
>> of the numShards param.  Is there no way to specify the names of these
>> shards?
>> 
>> 
>> On Wed, Apr 3, 2013 at 3:25 PM, Jamie Johnson 
> wrote:
>> 
>>> ah interestingso I need to specify num shards, blow out zk and
> then
>>> try this again to see if things work properly now.  What is really
> strange
>>> is that for the most part things have worked right and on 4.2.1 I
> have
>>> 600,000 items indexed with no duplicates.  In any event I will
> specify num
>>> shards clear out zk and begin again.  If this works properly what
> should
>>> the router type be?
>>> 
>>> 
>>> On Wed, Apr 3, 2013 at 3:14 PM, Mark Miller 
> wrote:
>>> 
 If you don't specify numShards after 4.1, you get an implicit doc
> router
 and it's up to you to distribute updates. In the past, partitioning
> was
 done on the fly - but for shard splitting and perhaps other
> features, we
 now divvy up the hash range up front based on num

Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

I am occasionally seeing this in the log, is this just a timeout issue?
 Should I be increasing the zk client timeout?

WARNING: Overseer cannot talk to ZK
Apr 3, 2013 11:14:25 PM
org.apache.solr.cloud.DistributedQueue$LatchChildWatcher process
INFO: Watcher fired on path: null state: Expired type None
Apr 3, 2013 11:14:25 PM org.apache.solr.cloud.Overseer$ClusterStateUpdater
run
WARNING: Solr cannot talk to ZK, exiting Overseer main queue loop
org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired for /overseer/queue
at
org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
at
org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1468)
at
org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:236)
at
org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:233)
at
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:65)
at
org.apache.solr.common.cloud.SolrZkClient.getChildren(SolrZkClient.java:233)
at
org.apache.solr.cloud.DistributedQueue.orderedChildren(DistributedQueue.java:89)
at
org.apache.solr.cloud.DistributedQueue.element(DistributedQueue.java:131)
at
org.apache.solr.cloud.DistributedQueue.peek(DistributedQueue.java:326)
at
org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:128)
at java.lang.Thread.run(Thread.java:662)



On Wed, Apr 3, 2013 at 7:25 PM, Jamie Johnson  wrote:

> just an update, I'm at 1M records now with no issues.  This looks
> promising as to the cause of my issues, thanks for the help.  Is the
> routing method with numShards documented anywhere?  I know numShards is
> documented but I didn't know that the routing changed if you don't specify
> it.
>
>
> On Wed, Apr 3, 2013 at 4:44 PM, Jamie Johnson  wrote:
>
>> with these changes things are looking good, I'm up to 600,000 documents
>> without any issues as of right now.  I'll keep going and add more to see if
>> I find anything.
>>
>>
>> On Wed, Apr 3, 2013 at 4:01 PM, Jamie Johnson  wrote:
>>
>>> ok, so that's not a deal breaker for me.  I just changed it to match the
>>> shards that are auto created and it looks like things are happy.  I'll go
>>> ahead and try my test to see if I can get things out of sync.
>>>
>>>
>>> On Wed, Apr 3, 2013 at 3:56 PM, Mark Miller wrote:
>>>
 I had thought you could - but looking at the code recently, I don't
 think you can anymore. I think that's a technical limitation more than
 anything though. When these changes were made, I think support for that was
 simply not added at the time.

 I'm not sure exactly how straightforward it would be, but it seems
 doable - as it is, the overseer will preallocate shards when first creating
 the collection - that's when they get named shard(n). There would have to
 be logic to replace shard(n) with the custom shard name when the core
 actually registers.

 - Mark

 On Apr 3, 2013, at 3:42 PM, Jamie Johnson  wrote:

 > answered my own question, it now says compositeId.  What is
 problematic
 > though is that in addition to my shards (which are say jamie-shard1)
 I see
 > the solr created shards (shard1).  I assume that these were created
 because
 > of the numShards param.  Is there no way to specify the names of these
 > shards?
 >
 >
 > On Wed, Apr 3, 2013 at 3:25 PM, Jamie Johnson 
 wrote:
 >
 >> ah interestingso I need to specify num shards, blow out zk and
 then
 >> try this again to see if things work properly now.  What is really
 strange
 >> is that for the most part things have worked right and on 4.2.1 I
 have
 >> 600,000 items indexed with no duplicates.  In any event I will
 specify num
 >> shards clear out zk and begin again.  If this works properly what
 should
 >> the router type be?
 >>
 >>
 >> On Wed, Apr 3, 2013 at 3:14 PM, Mark Miller 
 wrote:
 >>
 >>> If you don't specify numShards after 4.1, you get an implicit doc
 router
 >>> and it's up to you to distribute updates. In the past, partitioning
 was
 >>> done on the fly - but for shard splitting and perhaps other
 features, we
 >>> now divvy up the hash range up front based on numShards and store
 it in
 >>> ZooKeeper. No numShards is now how you take complete control of
 updates
 >>> yourself.
 >>>
 >>> - Mark
 >>>
 >>> On Apr 3, 2013, at 2:57 PM, Jamie Johnson 
 wrote:
 >>>
  The router says "implicit".  I did start from a blank zk state but
 >>> perhaps
  I missed one of the ZkCLI commands?  One of my shards from the
  clusterstate.json is shown below.  What is the process that should
 be
 >>> done
  to b

RE: Solr Multiword Search

2013-04-03 Thread skmirch

The following query is doing a word search (based on my previous post)...

solr/spell?q=(charles+and+the+choclit+factory+OR+(title2:("charles+and+the+choclit+factory")))&spellcheck.collate=true&spellcheck=true&spellcheck.q=charles+and+the+choclit+factory
 

It produces a lot of unwanted matches.


In order to do a phrase search, I changed it to:
solr/spell?q=("charles+and+the+choclit+factory"+OR+(title2:("charles+and+the+choclit+factory")))&spellcheck.collate=true&spellcheck=true&spellcheck.q=charles+and+the+choclit+factory
 

It does not find any match for the words in the phrase I am looking for and
does poorly in the suggested collations.  I want phrase corrections.  How do
I achieve this?

"charles and the chocolit factory"
produces the following collations:
false

  charles and the chocolat factory
  2849777
  
charles
and
the
chocolat
factory
  


  charles and the chocalit factory
  2849464
  
charles
and
the
chocalit
factory
  


  charles and the chocolat factors
  2841190
  
charles
and
the
chocolat
factors
  


  charley and the chocolat factory
  2827908
  
charley
and
the
chocolat
factory
  


  charles and the chocalit factors
  2840877
  
charles
and
the
chocalit
factors
  


  charles and the chocklit factory
  2849464
  
charles
and
the
chocklit
factory
  


  charles and the chocolat factorz
  2841173
  
charles
and
the
chocolat
factorz
  


  charley and the chocalit factory
  2827595
  
charley
and
the
chocalit
factory
  


  charley and the chocolat factors
  2819321
  
charley
and
the
chocolat
factors
  


  charlies and the chocolat factory
  2826661
  
charlies
and
the
chocolat
factory
  

  

Notice number of hits.  This does not look right?  Please help.

Thanks.
-- Sandeep



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Multiword-Search-tp4053038p4053674.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

just an update, I'm at 1M records now with no issues.  This looks promising
as to the cause of my issues, thanks for the help.  Is the routing method
with numShards documented anywhere?  I know numShards is documented but I
didn't know that the routing changed if you don't specify it.


On Wed, Apr 3, 2013 at 4:44 PM, Jamie Johnson  wrote:

> with these changes things are looking good, I'm up to 600,000 documents
> without any issues as of right now.  I'll keep going and add more to see if
> I find anything.
>
>
> On Wed, Apr 3, 2013 at 4:01 PM, Jamie Johnson  wrote:
>
>> ok, so that's not a deal breaker for me.  I just changed it to match the
>> shards that are auto created and it looks like things are happy.  I'll go
>> ahead and try my test to see if I can get things out of sync.
>>
>>
>> On Wed, Apr 3, 2013 at 3:56 PM, Mark Miller wrote:
>>
>>> I had thought you could - but looking at the code recently, I don't
>>> think you can anymore. I think that's a technical limitation more than
>>> anything though. When these changes were made, I think support for that was
>>> simply not added at the time.
>>>
>>> I'm not sure exactly how straightforward it would be, but it seems
>>> doable - as it is, the overseer will preallocate shards when first creating
>>> the collection - that's when they get named shard(n). There would have to
>>> be logic to replace shard(n) with the custom shard name when the core
>>> actually registers.
>>>
>>> - Mark
>>>
>>> On Apr 3, 2013, at 3:42 PM, Jamie Johnson  wrote:
>>>
>>> > answered my own question, it now says compositeId.  What is problematic
>>> > though is that in addition to my shards (which are say jamie-shard1) I
>>> see
>>> > the solr created shards (shard1).  I assume that these were created
>>> because
>>> > of the numShards param.  Is there no way to specify the names of these
>>> > shards?
>>> >
>>> >
>>> > On Wed, Apr 3, 2013 at 3:25 PM, Jamie Johnson 
>>> wrote:
>>> >
>>> >> ah interestingso I need to specify num shards, blow out zk and
>>> then
>>> >> try this again to see if things work properly now.  What is really
>>> strange
>>> >> is that for the most part things have worked right and on 4.2.1 I have
>>> >> 600,000 items indexed with no duplicates.  In any event I will
>>> specify num
>>> >> shards clear out zk and begin again.  If this works properly what
>>> should
>>> >> the router type be?
>>> >>
>>> >>
>>> >> On Wed, Apr 3, 2013 at 3:14 PM, Mark Miller 
>>> wrote:
>>> >>
>>> >>> If you don't specify numShards after 4.1, you get an implicit doc
>>> router
>>> >>> and it's up to you to distribute updates. In the past, partitioning
>>> was
>>> >>> done on the fly - but for shard splitting and perhaps other
>>> features, we
>>> >>> now divvy up the hash range up front based on numShards and store it
>>> in
>>> >>> ZooKeeper. No numShards is now how you take complete control of
>>> updates
>>> >>> yourself.
>>> >>>
>>> >>> - Mark
>>> >>>
>>> >>> On Apr 3, 2013, at 2:57 PM, Jamie Johnson  wrote:
>>> >>>
>>>  The router says "implicit".  I did start from a blank zk state but
>>> >>> perhaps
>>>  I missed one of the ZkCLI commands?  One of my shards from the
>>>  clusterstate.json is shown below.  What is the process that should
>>> be
>>> >>> done
>>>  to bootstrap a cluster other than the ZkCLI commands I listed
>>> above?  My
>>>  process right now is run those ZkCLI commands and then start solr on
>>> >>> all of
>>>  the instances with a command like this
>>> 
>>>  java -server -Dshard=shard5 -DcoreName=shard5-core1
>>>  -Dsolr.data.dir=/solr/data/shard5-core1
>>> >>> -Dcollection.configName=solr-conf
>>>  -Dcollection=collection1
>>> -DzkHost=so-zoo1:2181,so-zoo2:2181,so-zoo3:2181
>>>  -Djetty.port=7575 -DhostPort=7575 -jar start.jar
>>> 
>>>  I feel like maybe I'm missing a step.
>>> 
>>>  "shard5":{
>>>    "state":"active",
>>>    "replicas":{
>>>  "10.38.33.16:7575_solr_shard5-core1":{
>>>    "shard":"shard5",
>>>    "state":"active",
>>>    "core":"shard5-core1",
>>>    "collection":"collection1",
>>>    "node_name":"10.38.33.16:7575_solr",
>>>    "base_url":"http://10.38.33.16:7575/solr";,
>>>    "leader":"true"},
>>>  "10.38.33.17:7577_solr_shard5-core2":{
>>>    "shard":"shard5",
>>>    "state":"recovering",
>>>    "core":"shard5-core2",
>>>    "collection":"collection1",
>>>    "node_name":"10.38.33.17:7577_solr",
>>>    "base_url":"http://10.38.33.17:7577/solr"}}}
>>> 
>>> 
>>>  On Wed, Apr 3, 2013 at 2:40 PM, Mark Miller 
>>> >>> wrote:
>>> 
>>> > It should be part of your clusterstate.json. Some users have
>>> reported
>>> > trouble upgrading a previous zk install when this change came. I
>>> > recommended manually updating the clusterstate.json to have the
>

Re: Solr metrics in Codahale metrics and Graphite?

2013-04-03 Thread Walter Underwood

In the Jira, but not in the docs. 

It would be nice to have VM stats like GC, too, so we can have common 
monitoring and alerting on all our services.

wunder

On Apr 3, 2013, at 3:31 PM, Otis Gospodnetic wrote:

> It's there! :)
> http://search-lucene.com/?q=percentile&fc_project=Solr&fc_type=issue
> 
> Otis
> --
> Solr & ElasticSearch Support
> http://sematext.com/
> 
> On Wed, Apr 3, 2013 at 6:29 PM, Walter Underwood  
> wrote:
>> That sounds great. I'll check out the bug, I didn't see anything in the docs 
>> about this. And if I can't find it with a search engine, it probably isn't 
>> there.  --wunder
>> 
>> On Apr 3, 2013, at 6:39 AM, Shawn Heisey wrote:
>> 
>>> On 3/29/2013 12:07 PM, Walter Underwood wrote:
 What are folks using for this?
>>> 
>>> I don't know that this really answers your question, but Solr 4.1 and
>>> later includes a big chunk of codahale metrics internally for request
>>> handler statistics - see SOLR-1972.  First we tried including the jar
>>> and using the API, but that created thread leak problems, so the source
>>> code was added.
>>> 
>>> Thanks,
>>> Shawn
>>> 
>> 
>> 
>> 
>> 
>> 

--
Walter Underwood
wun...@wunderwood.org

Re: Solr metrics in Codahale metrics and Graphite?

2013-04-03 Thread Otis Gospodnetic

It's there! :)
http://search-lucene.com/?q=percentile&fc_project=Solr&fc_type=issue

Otis
--
Solr & ElasticSearch Support
http://sematext.com/





On Wed, Apr 3, 2013 at 6:29 PM, Walter Underwood  wrote:
> That sounds great. I'll check out the bug, I didn't see anything in the docs 
> about this. And if I can't find it with a search engine, it probably isn't 
> there.  --wunder
>
> On Apr 3, 2013, at 6:39 AM, Shawn Heisey wrote:
>
>> On 3/29/2013 12:07 PM, Walter Underwood wrote:
>>> What are folks using for this?
>>
>> I don't know that this really answers your question, but Solr 4.1 and
>> later includes a big chunk of codahale metrics internally for request
>> handler statistics - see SOLR-1972.  First we tried including the jar
>> and using the API, but that created thread leak problems, so the source
>> code was added.
>>
>> Thanks,
>> Shawn
>>
>
>
>
>
>

Re: Solr metrics in Codahale metrics and Graphite?

2013-04-03 Thread Walter Underwood

That sounds great. I'll check out the bug, I didn't see anything in the docs 
about this. And if I can't find it with a search engine, it probably isn't 
there.  --wunder

On Apr 3, 2013, at 6:39 AM, Shawn Heisey wrote:

> On 3/29/2013 12:07 PM, Walter Underwood wrote:
>> What are folks using for this?
> 
> I don't know that this really answers your question, but Solr 4.1 and
> later includes a big chunk of codahale metrics internally for request
> handler statistics - see SOLR-1972.  First we tried including the jar
> and using the API, but that created thread leak problems, so the source
> code was added.
> 
> Thanks,
> Shawn
>

Thanks for the information you have provided. I tried your suggestion and
it helped a lot. However, as close as this seems to what I want, I still
need for it to match the exact phrases that closely match my search words.
So while I am now using the search words in q and also spellcheck.q (which I
believe starts to play a role only if there are no matches with the phrase
entered and has to provide collations), and it not only finds "Charlie and
the Chocolate Factory", it also finds any title that contains factory or
charles in it (just like you mentioned it would).

I also tried your suggestion of spellcheck.alternativeTermCount and set it
to 5 (>0) in my solrconfig.xml and this still did the same thing. I am not
using queryConverter at all any more, thanks for that suggestion.

I still need it to find the closest match for the phrase that it finds in
title. My query now is:
solr/spell?q=(charles+and+the+choclit+factory+OR+(title2:("charles+and+the+choclit+factory")))&spellcheck.collate=true&spellcheck=true&spellcheck.q=charles+and+the+choclit+factory
The results are anything that matches charles, and the factory and so I get
lots of matches (bad for performance). If I group the above query on the
content type, it ends up producing bogus results in categories that don't
have a title evenly remotely close to "Charlie and the Chocolate Factory".

Can this work somehow? If it finds a doc that has low score, just not
provide it in the results. Is there a way to use a certain score threshold
and only present things that are above this threshold from the terms matched
perspective? I am getting a lot of matches for "and the" just because that
is in the phrase being searched. I know I can make them stopwords so that
they are ignored. Suggestion should be closest matches and nothing more.
Can this be done?

Appreciate your help.
Thanks.
-- Sandeep

--
View this message in context:
http://lucene.472066.n3.nabble.com/Solr-Multiword-Search-tp4053038p4053650.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr metrics in Codahale metrics and Graphite?

2013-04-03 Thread Otis Gospodnetic

Hi,

We're using... eh, our SPM for Solr -
http://sematext.com/spm/solr-performance-monitoring/index.html
(Wunder, I think somebody from Chegg actually looked into using it -
please ping if you need more info)

Shawn, metrics 3.0.0beta1 is out, apparently very reworked, so
might be worth revisiting.

Otis
--
Solr & ElasticSearch Support
http://sematext.com/

On Wed, Apr 3, 2013 at 9:39 AM, Shawn Heisey  wrote:
> On 3/29/2013 12:07 PM, Walter Underwood wrote:
>> What are folks using for this?
>
> I don't know that this really answers your question, but Solr 4.1 and
> later includes a big chunk of codahale metrics internally for request
> handler statistics - see SOLR-1972.  First we tried including the jar
> and using the API, but that created thread leak problems, so the source
> code was added.
>
> Thanks,
> Shawn
>

Streaming search results

2013-04-03 Thread Victor Miroshnikov

Is it possible to stream search results from Solr? Seems that this feature is 
missing.

I see two options to solve this: 

1. Using search results pagination feature
The idea is to implement a smart proxy that will stream chunks from search 
results using pagination.

2. Implement Solr plugin with search streaming feature (is that possible at 
all?)

First option is easy to implement and reliable, though I dont know what are the 
drawbacks.

Regards,
Viktor

Re: Filtering Search Cloud

On 4/3/2013 1:52 PM, Furkan KAMACI wrote:
> Thanks for your explanation, you explained every thing what I need. Just
> one more question. I see that I can not make it with Solr Cloud, but I can
> do something like that with master-slave replication of Solr. If I use
> master-slave replication of Solr, can I eliminate (filter) something
> (something that is indexed from master) from being a response after
> querying (querying from slaves) ?

I don't understand the question.  I will attempt to give you more
information, but it might not answer your question.  If not, you'll have
to try to improve your question.

Your master and each of that master's slaves will have the same index as
soon as replication is done.  A query on the slave has no idea that the
master exists.

Thanks,
Shawn

Re: SolrCloud not distributing documents across shards

On Apr 3, 2013, at 5:53 PM, Michael Della Bitta 
 wrote:

> From what I can tell, the Collections API has been hardened
> significantly since 4.2 

I did a lot of work here for 4.2.1 - there was a lot to improve. Hopefully 
there is much less now, but if anyone finds anything, I'll fix any JIRA's.

- Mark

Re: SolrCloud not distributing documents across shards

>From what I can tell, the Collections API has been hardened
significantly since 4.2 and now will refuse to create a collection if
you give it something ambiguous to do. So if you upgrade to 4.2,
things will become more safe.

But overall I'd find a way of using the Collections API that works and
stick with it.

Michael Della Bitta


Appinions
18 East 41st Street, 2nd Floor
New York, NY 10017-6271

www.appinions.com

Where Influence Isn’t a Game


On Wed, Apr 3, 2013 at 5:01 PM, vsilgalis  wrote:
> Michael Della Bitta-2 wrote
>> If you can work with a clean state, I'd turn off all your shards,
>> clear out the Solr directories in Zookeeper, reset solr.xml for each
>> of your shards, upgrade to the latest version of Solr, and turn
>> everything back on again. Then upload config, recreate your
>> collection, etc.
>>
>> I do it like this, but YMMV:
>>
>> curl
>> "http://localhost:8080/solr/admin/collections?action=CREATE&name=$name&numShards=$num&collection.configName=$config-name";
>>
>>
>> Michael Della Bitta
>
>
> Looks like that was the problem.  Thanks, much appreciated.
>
> Is there any insight into specifically what I should look into for
> preventing this in the future?
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/SolrCloud-not-distributing-documents-across-shards-tp4053506p4053622.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr Tika Override

2013-04-03 Thread Jan Høydahl

You'd probably want to work on the XML output from Tika's PDF parser, from 
which you can identify which page and context.

Personally I would build a separate indexing application in Java and call Tika 
directly, then build a SolrInputDocument which you pass to solr through SolrJ. 
I.e. not use ExtractingRequestHandler, but put all this logic on the client 
side. This scales better, you can handle weird parsing errors and OOM 
situations better and you have full control of how to deal with the XML output 
from various file formats, and what metadata to pass on into the Solr document. 

This is possible with a customized ExtractingHandler too, but it will be uglier 
and harder to test. With a standalone indexer application you can write unit 
tests for all the special parsing requirements. see http://tika.apache.org for 
more.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

3. apr. 2013 kl. 20:09 skrev JerryC :

> I am researching Solr and seeing if it would be a good fit for a document
> search service I am helping to develop.  One of the requirements is that we
> will need to be able to customize how file contents are parsed beyond the
> default configurations that are offered out of the box by Tika.  For
> example, we know that we will be indexing .pdf files that will contain a
> cover page with a project start date, and would like to pull this date out
> into a searchable field that is separate from the file content.  I have seen
> several sources saying you can do this by overriding the
> ExtractingRequestHandler.createFactory() method, but I have not been able to
> find much documentation on how to implement a new parser.  Can someone point
> me in the right direction on where to look, or let me know if the scenario I
> described above is even possible?
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Solr-Tika-Override-tp4053552.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: do SearchComponents have access to response contents

The search components can see the "response" as a namedlist, but it is only 
when SolrDispatchFIlter calls the QueryResponseWriter that XML or JSON or 
whatever other format (Javabin as well) is generated from the named list for 
final output in an HTTP response.


You probably want a custom query response writer that wraps the XML response 
writer. Then you can generate the XML and then do whatever you want with it.


The QueryResponseWriter class and  in solrconfig.xml.

-- Jack Krupansky

-Original Message- 
From: xavier jmlucjav

Sent: Wednesday, April 03, 2013 4:22 PM
To: solr-user@lucene.apache.org
Subject: do SearchComponents have access to response contents

I need to implement some SearchComponent that will deal with metrics on the
response. Some things I see will be easy to get, like number of hits for
instance, but I am more worried with this:

We need to also track the size of the response (as the size in bytes of the
whole xml response tat is streamed, with stored fields and all). I was a
bit worried cause I am wondering if a searchcomponent will actually have
access to the response bytes...

Can someone confirm one way or the other? We are targeting Sorl4.0

thanks
xavier

Re: Question on Exact Matches - edismax

2013-04-03 Thread Jan Høydahl

Can you show us your *_ci field type? Solr does not really have a way to tell 
whether a match is "exact" or only partial, but you could hack around it with 
the fieldType. See https://github.com/cominvent/exactmatch for a possible 
solution.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

3. apr. 2013 kl. 15:55 skrev Sandeep Mestry :

> Hi All,
> 
> I have a requirement where in exact matches for 2 fields (Series Title,
> Title) should be ranked higher than the partial matches. The configuration
> looks like below:
> 
> 
>
>edismax
>explicit
>0.01
>*pg_series_title_ci*^500 *title_ci*^300 *
> pg_series_title*^200 *title*^25 classifications^15 classifications_texts^15
> parent_classifications^10 synonym_classifications^5 pg_brand_title^5
> pg_series_working_title^5 p_programme_title^5 p_item_title^5
> p_interstitial_title^5 description^15 pg_series_description annotations^0.1
> classification_notes^0.05 pv_program_version_number^2
> pv_program_version_number_ci^2 pv_program_number^2 pv_program_number_ci^2
> p_program_number^2 ma_version_number^2 ma_recording_location
> ma_contributions^0.001 rel_pg_series_title rel_programme_title
> rel_programme_number rel_programme_number_ci pg_uuid^0.5 p_uuid^0.5
> pv_uuid^0.5 ma_uuid^0.5
>pg_series_title_ci^500 title_ci^500
>0
>*:*
>100%
>AND
>true
>-1
>1
>
>
> 
> As you can see above, the search is against many fields. What I'd want is
> the documents that have exact matches for series title and title fields
> should rank higher than the rest.
> 
> I have added 2 case insensitive (*pg_series_title_ci, title_ci*) fields for
> series title and title and have boosted them higher over the tokenized and
> rest of the fields. I have also implemented a similarity class to override
> idf however I still get documents having partial matches in title and other
> fields ranking higher than exact match in pg_series_title_ci.
> 
> Many Thanks,
> Sandeep

Re: SolrCloud not distributing documents across shards

Michael Della Bitta-2 wrote
> If you can work with a clean state, I'd turn off all your shards,
> clear out the Solr directories in Zookeeper, reset solr.xml for each
> of your shards, upgrade to the latest version of Solr, and turn
> everything back on again. Then upload config, recreate your
> collection, etc.
> 
> I do it like this, but YMMV:
> 
> curl
> "http://localhost:8080/solr/admin/collections?action=CREATE&name=$name&numShards=$num&collection.configName=$config-name";
> 
> 
> Michael Della Bitta


Looks like that was the problem.  Thanks, much appreciated.

Is there any insight into specifically what I should look into for
preventing this in the future?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrCloud-not-distributing-documents-across-shards-tp4053506p4053622.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: HTML entities being missed by DIH HTMLStripTransformer

2013-04-03 Thread Steve Rowe

Cool, glad I was able to help.

On Apr 3, 2013, at 4:18 PM, Ashok  wrote:

> Hi Steve,
> 
> Fabulous suggestion! Yup, that is it! Using the HTMLStripTransformer twice
> did the trick. I am using Solr 4.1.
> 
> Thank you very much!
> 
> - ashok
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/HTML-entities-being-missed-by-DIH-HTMLStripTransformer-tp4053582p4053609.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

with these changes things are looking good, I'm up to 600,000 documents
without any issues as of right now.  I'll keep going and add more to see if
I find anything.


On Wed, Apr 3, 2013 at 4:01 PM, Jamie Johnson  wrote:

> ok, so that's not a deal breaker for me.  I just changed it to match the
> shards that are auto created and it looks like things are happy.  I'll go
> ahead and try my test to see if I can get things out of sync.
>
>
> On Wed, Apr 3, 2013 at 3:56 PM, Mark Miller  wrote:
>
>> I had thought you could - but looking at the code recently, I don't think
>> you can anymore. I think that's a technical limitation more than anything
>> though. When these changes were made, I think support for that was simply
>> not added at the time.
>>
>> I'm not sure exactly how straightforward it would be, but it seems doable
>> - as it is, the overseer will preallocate shards when first creating the
>> collection - that's when they get named shard(n). There would have to be
>> logic to replace shard(n) with the custom shard name when the core actually
>> registers.
>>
>> - Mark
>>
>> On Apr 3, 2013, at 3:42 PM, Jamie Johnson  wrote:
>>
>> > answered my own question, it now says compositeId.  What is problematic
>> > though is that in addition to my shards (which are say jamie-shard1) I
>> see
>> > the solr created shards (shard1).  I assume that these were created
>> because
>> > of the numShards param.  Is there no way to specify the names of these
>> > shards?
>> >
>> >
>> > On Wed, Apr 3, 2013 at 3:25 PM, Jamie Johnson 
>> wrote:
>> >
>> >> ah interestingso I need to specify num shards, blow out zk and then
>> >> try this again to see if things work properly now.  What is really
>> strange
>> >> is that for the most part things have worked right and on 4.2.1 I have
>> >> 600,000 items indexed with no duplicates.  In any event I will specify
>> num
>> >> shards clear out zk and begin again.  If this works properly what
>> should
>> >> the router type be?
>> >>
>> >>
>> >> On Wed, Apr 3, 2013 at 3:14 PM, Mark Miller 
>> wrote:
>> >>
>> >>> If you don't specify numShards after 4.1, you get an implicit doc
>> router
>> >>> and it's up to you to distribute updates. In the past, partitioning
>> was
>> >>> done on the fly - but for shard splitting and perhaps other features,
>> we
>> >>> now divvy up the hash range up front based on numShards and store it
>> in
>> >>> ZooKeeper. No numShards is now how you take complete control of
>> updates
>> >>> yourself.
>> >>>
>> >>> - Mark
>> >>>
>> >>> On Apr 3, 2013, at 2:57 PM, Jamie Johnson  wrote:
>> >>>
>>  The router says "implicit".  I did start from a blank zk state but
>> >>> perhaps
>>  I missed one of the ZkCLI commands?  One of my shards from the
>>  clusterstate.json is shown below.  What is the process that should be
>> >>> done
>>  to bootstrap a cluster other than the ZkCLI commands I listed above?
>>  My
>>  process right now is run those ZkCLI commands and then start solr on
>> >>> all of
>>  the instances with a command like this
>> 
>>  java -server -Dshard=shard5 -DcoreName=shard5-core1
>>  -Dsolr.data.dir=/solr/data/shard5-core1
>> >>> -Dcollection.configName=solr-conf
>>  -Dcollection=collection1
>> -DzkHost=so-zoo1:2181,so-zoo2:2181,so-zoo3:2181
>>  -Djetty.port=7575 -DhostPort=7575 -jar start.jar
>> 
>>  I feel like maybe I'm missing a step.
>> 
>>  "shard5":{
>>    "state":"active",
>>    "replicas":{
>>  "10.38.33.16:7575_solr_shard5-core1":{
>>    "shard":"shard5",
>>    "state":"active",
>>    "core":"shard5-core1",
>>    "collection":"collection1",
>>    "node_name":"10.38.33.16:7575_solr",
>>    "base_url":"http://10.38.33.16:7575/solr";,
>>    "leader":"true"},
>>  "10.38.33.17:7577_solr_shard5-core2":{
>>    "shard":"shard5",
>>    "state":"recovering",
>>    "core":"shard5-core2",
>>    "collection":"collection1",
>>    "node_name":"10.38.33.17:7577_solr",
>>    "base_url":"http://10.38.33.17:7577/solr"}}}
>> 
>> 
>>  On Wed, Apr 3, 2013 at 2:40 PM, Mark Miller 
>> >>> wrote:
>> 
>> > It should be part of your clusterstate.json. Some users have
>> reported
>> > trouble upgrading a previous zk install when this change came. I
>> > recommended manually updating the clusterstate.json to have the
>> right
>> >>> info,
>> > and that seemed to work. Otherwise, I guess you have to start from a
>> >>> clean
>> > zk state.
>> >
>> > If you don't have that range information, I think there will be
>> >>> trouble.
>> > Do you have an router type defined in the clusterstate.json?
>> >
>> > - Mark
>> >
>> > On Apr 3, 2013, at 2:24 PM, Jamie Johnson 
>> wrote:
>> >
>> >> Where is this information stored in ZK?  I don't see it in

do SearchComponents have access to response contents

2013-04-03 Thread xavier jmlucjav

I need to implement some SearchComponent that will deal with metrics on the
response. Some things I see will be easy to get, like number of hits for
instance, but I am more worried with this:

We need to also track the size of the response (as the size in bytes of the
whole xml response tat is streamed, with stored fields and all). I was a
bit worried cause I am wondering if a searchcomponent will actually have
access to the response bytes...

Can someone confirm one way or the other? We are targeting Sorl4.0

thanks
xavier

Re: HTML entities being missed by DIH HTMLStripTransformer

2013-04-03 Thread Ashok

Hi Steve,

Fabulous suggestion! Yup, that is it! Using the HTMLStripTransformer twice
did the trick. I am using Solr 4.1.

Thank you very much!

- ashok



--
View this message in context: 
http://lucene.472066.n3.nabble.com/HTML-entities-being-missed-by-DIH-HTMLStripTransformer-tp4053582p4053609.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

ok, so that's not a deal breaker for me.  I just changed it to match the
shards that are auto created and it looks like things are happy.  I'll go
ahead and try my test to see if I can get things out of sync.


On Wed, Apr 3, 2013 at 3:56 PM, Mark Miller  wrote:

> I had thought you could - but looking at the code recently, I don't think
> you can anymore. I think that's a technical limitation more than anything
> though. When these changes were made, I think support for that was simply
> not added at the time.
>
> I'm not sure exactly how straightforward it would be, but it seems doable
> - as it is, the overseer will preallocate shards when first creating the
> collection - that's when they get named shard(n). There would have to be
> logic to replace shard(n) with the custom shard name when the core actually
> registers.
>
> - Mark
>
> On Apr 3, 2013, at 3:42 PM, Jamie Johnson  wrote:
>
> > answered my own question, it now says compositeId.  What is problematic
> > though is that in addition to my shards (which are say jamie-shard1) I
> see
> > the solr created shards (shard1).  I assume that these were created
> because
> > of the numShards param.  Is there no way to specify the names of these
> > shards?
> >
> >
> > On Wed, Apr 3, 2013 at 3:25 PM, Jamie Johnson  wrote:
> >
> >> ah interestingso I need to specify num shards, blow out zk and then
> >> try this again to see if things work properly now.  What is really
> strange
> >> is that for the most part things have worked right and on 4.2.1 I have
> >> 600,000 items indexed with no duplicates.  In any event I will specify
> num
> >> shards clear out zk and begin again.  If this works properly what should
> >> the router type be?
> >>
> >>
> >> On Wed, Apr 3, 2013 at 3:14 PM, Mark Miller 
> wrote:
> >>
> >>> If you don't specify numShards after 4.1, you get an implicit doc
> router
> >>> and it's up to you to distribute updates. In the past, partitioning was
> >>> done on the fly - but for shard splitting and perhaps other features,
> we
> >>> now divvy up the hash range up front based on numShards and store it in
> >>> ZooKeeper. No numShards is now how you take complete control of updates
> >>> yourself.
> >>>
> >>> - Mark
> >>>
> >>> On Apr 3, 2013, at 2:57 PM, Jamie Johnson  wrote:
> >>>
>  The router says "implicit".  I did start from a blank zk state but
> >>> perhaps
>  I missed one of the ZkCLI commands?  One of my shards from the
>  clusterstate.json is shown below.  What is the process that should be
> >>> done
>  to bootstrap a cluster other than the ZkCLI commands I listed above?
>  My
>  process right now is run those ZkCLI commands and then start solr on
> >>> all of
>  the instances with a command like this
> 
>  java -server -Dshard=shard5 -DcoreName=shard5-core1
>  -Dsolr.data.dir=/solr/data/shard5-core1
> >>> -Dcollection.configName=solr-conf
>  -Dcollection=collection1
> -DzkHost=so-zoo1:2181,so-zoo2:2181,so-zoo3:2181
>  -Djetty.port=7575 -DhostPort=7575 -jar start.jar
> 
>  I feel like maybe I'm missing a step.
> 
>  "shard5":{
>    "state":"active",
>    "replicas":{
>  "10.38.33.16:7575_solr_shard5-core1":{
>    "shard":"shard5",
>    "state":"active",
>    "core":"shard5-core1",
>    "collection":"collection1",
>    "node_name":"10.38.33.16:7575_solr",
>    "base_url":"http://10.38.33.16:7575/solr";,
>    "leader":"true"},
>  "10.38.33.17:7577_solr_shard5-core2":{
>    "shard":"shard5",
>    "state":"recovering",
>    "core":"shard5-core2",
>    "collection":"collection1",
>    "node_name":"10.38.33.17:7577_solr",
>    "base_url":"http://10.38.33.17:7577/solr"}}}
> 
> 
>  On Wed, Apr 3, 2013 at 2:40 PM, Mark Miller 
> >>> wrote:
> 
> > It should be part of your clusterstate.json. Some users have reported
> > trouble upgrading a previous zk install when this change came. I
> > recommended manually updating the clusterstate.json to have the right
> >>> info,
> > and that seemed to work. Otherwise, I guess you have to start from a
> >>> clean
> > zk state.
> >
> > If you don't have that range information, I think there will be
> >>> trouble.
> > Do you have an router type defined in the clusterstate.json?
> >
> > - Mark
> >
> > On Apr 3, 2013, at 2:24 PM, Jamie Johnson  wrote:
> >
> >> Where is this information stored in ZK?  I don't see it in the
> cluster
> >> state (or perhaps I don't understand it ;) ).
> >>
> >> Perhaps something with my process is broken.  What I do when I start
> >>> from
> >> scratch is the following
> >>
> >> ZkCLI -cmd upconfig ...
> >> ZkCLI -cmd linkconfig 
> >>
> >> but I don't ever explicitly create the collection.  What should the
> >

Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

I had thought you could - but looking at the code recently, I don't think you 
can anymore. I think that's a technical limitation more than anything though. 
When these changes were made, I think support for that was simply not added at 
the time.

I'm not sure exactly how straightforward it would be, but it seems doable - as 
it is, the overseer will preallocate shards when first creating the collection 
- that's when they get named shard(n). There would have to be logic to replace 
shard(n) with the custom shard name when the core actually registers.

- Mark

On Apr 3, 2013, at 3:42 PM, Jamie Johnson  wrote:

> answered my own question, it now says compositeId.  What is problematic
> though is that in addition to my shards (which are say jamie-shard1) I see
> the solr created shards (shard1).  I assume that these were created because
> of the numShards param.  Is there no way to specify the names of these
> shards?
> 
> 
> On Wed, Apr 3, 2013 at 3:25 PM, Jamie Johnson  wrote:
> 
>> ah interestingso I need to specify num shards, blow out zk and then
>> try this again to see if things work properly now.  What is really strange
>> is that for the most part things have worked right and on 4.2.1 I have
>> 600,000 items indexed with no duplicates.  In any event I will specify num
>> shards clear out zk and begin again.  If this works properly what should
>> the router type be?
>> 
>> 
>> On Wed, Apr 3, 2013 at 3:14 PM, Mark Miller  wrote:
>> 
>>> If you don't specify numShards after 4.1, you get an implicit doc router
>>> and it's up to you to distribute updates. In the past, partitioning was
>>> done on the fly - but for shard splitting and perhaps other features, we
>>> now divvy up the hash range up front based on numShards and store it in
>>> ZooKeeper. No numShards is now how you take complete control of updates
>>> yourself.
>>> 
>>> - Mark
>>> 
>>> On Apr 3, 2013, at 2:57 PM, Jamie Johnson  wrote:
>>> 
 The router says "implicit".  I did start from a blank zk state but
>>> perhaps
 I missed one of the ZkCLI commands?  One of my shards from the
 clusterstate.json is shown below.  What is the process that should be
>>> done
 to bootstrap a cluster other than the ZkCLI commands I listed above?  My
 process right now is run those ZkCLI commands and then start solr on
>>> all of
 the instances with a command like this
 
 java -server -Dshard=shard5 -DcoreName=shard5-core1
 -Dsolr.data.dir=/solr/data/shard5-core1
>>> -Dcollection.configName=solr-conf
 -Dcollection=collection1 -DzkHost=so-zoo1:2181,so-zoo2:2181,so-zoo3:2181
 -Djetty.port=7575 -DhostPort=7575 -jar start.jar
 
 I feel like maybe I'm missing a step.
 
 "shard5":{
   "state":"active",
   "replicas":{
 "10.38.33.16:7575_solr_shard5-core1":{
   "shard":"shard5",
   "state":"active",
   "core":"shard5-core1",
   "collection":"collection1",
   "node_name":"10.38.33.16:7575_solr",
   "base_url":"http://10.38.33.16:7575/solr";,
   "leader":"true"},
 "10.38.33.17:7577_solr_shard5-core2":{
   "shard":"shard5",
   "state":"recovering",
   "core":"shard5-core2",
   "collection":"collection1",
   "node_name":"10.38.33.17:7577_solr",
   "base_url":"http://10.38.33.17:7577/solr"}}}
 
 
 On Wed, Apr 3, 2013 at 2:40 PM, Mark Miller 
>>> wrote:
 
> It should be part of your clusterstate.json. Some users have reported
> trouble upgrading a previous zk install when this change came. I
> recommended manually updating the clusterstate.json to have the right
>>> info,
> and that seemed to work. Otherwise, I guess you have to start from a
>>> clean
> zk state.
> 
> If you don't have that range information, I think there will be
>>> trouble.
> Do you have an router type defined in the clusterstate.json?
> 
> - Mark
> 
> On Apr 3, 2013, at 2:24 PM, Jamie Johnson  wrote:
> 
>> Where is this information stored in ZK?  I don't see it in the cluster
>> state (or perhaps I don't understand it ;) ).
>> 
>> Perhaps something with my process is broken.  What I do when I start
>>> from
>> scratch is the following
>> 
>> ZkCLI -cmd upconfig ...
>> ZkCLI -cmd linkconfig 
>> 
>> but I don't ever explicitly create the collection.  What should the
>>> steps
>> from scratch be?  I am moving from an unreleased snapshot of 4.0 so I
> never
>> did that previously either so perhaps I did create the collection in
>>> one
> of
>> my steps to get this working but have forgotten it along the way.
>> 
>> 
>> On Wed, Apr 3, 2013 at 2:16 PM, Mark Miller 
> wrote:
>> 
>>> Thanks for digging Jamie. In 4.2, hash ranges are assigned up front
> when a
>>> collection is created - each s

Re: Filtering Search Cloud

Thanks for your explanation, you explained every thing what I need. Just
one more question. I see that I can not make it with Solr Cloud, but I can
do something like that with master-slave replication of Solr. If I use
master-slave replication of Solr, can I eliminate (filter) something
(something that is indexed from master) from being a response after
querying (querying from slaves) ?


2013/4/3 Shawn Heisey 

> On 4/3/2013 1:13 PM, Furkan KAMACI wrote:
> > Shawn, thanks for your detailed explanation. My system will work on high
> > load. I mean I will always index something and something always will be
> > queried at my system. That is why I consider about physically separating
> > indexer and query reply machines. I think about that: imagine a machine
> > that both does indexing (a disk IO for it, I don't know the underlying
> > system maybe Solr makes a sequential IO) and both trying to reply queries
> > (another kind of IO) That is my main challenge to decide separating them.
> > And the next step is that, if I separate them before response can I
> filter
> > the data of indexer machines (I don't have any filtering  issues right
> now,
> > I just think that maybe I can need it at future)
>
> We do seem to have a language barrier, so let me try to be very clear:
> If you use SolrCloud, you can't separate querying and indexing.  You
> will have to use the master-slave replication that been part of Solr
> since at least 1.4, possibly earlier.
>
> Thanks,
> Shawn
>
>

Re: SolrCloud not distributing documents across shards

If you can work with a clean state, I'd turn off all your shards,
clear out the Solr directories in Zookeeper, reset solr.xml for each
of your shards, upgrade to the latest version of Solr, and turn
everything back on again. Then upload config, recreate your
collection, etc.

I do it like this, but YMMV:

curl 
"http://localhost:8080/solr/admin/collections?action=CREATE&name=$name&numShards=$num&collection.configName=$config-name";


Michael Della Bitta


Appinions
18 East 41st Street, 2nd Floor
New York, NY 10017-6271

www.appinions.com

Where Influence Isn’t a Game


On Wed, Apr 3, 2013 at 3:40 PM, vsilgalis  wrote:
> Michael Della Bitta-2 wrote
>> With earlier versions of Solr Cloud, if there was any error or warning
>> when you made a collection, you likely were set up for "implicit"
>> routing which means that documents only go to the shard you're talking
>> to. What you want is "compositeId" routing, which works how you think
>> it should.
>>
>> Go into the cloud GUI and look at clusterstate.json in the Tree tab.
>> You should see the routing algorithm it's using in that file.
>>
>> Michael Della Bitta
>
> That sounds like my huckleberry.
>
>  "router":"implicit"
>
> Is in the collection info in the clusterstate.json
>
> How do I fix this? Just wipe the clusterstate.json?
>
> Thanks for your help.
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/SolrCloud-not-distributing-documents-across-shards-tp4053506p4053593.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

answered my own question, it now says compositeId.  What is problematic
though is that in addition to my shards (which are say jamie-shard1) I see
the solr created shards (shard1).  I assume that these were created because
of the numShards param.  Is there no way to specify the names of these
shards?


On Wed, Apr 3, 2013 at 3:25 PM, Jamie Johnson  wrote:

> ah interestingso I need to specify num shards, blow out zk and then
> try this again to see if things work properly now.  What is really strange
> is that for the most part things have worked right and on 4.2.1 I have
> 600,000 items indexed with no duplicates.  In any event I will specify num
> shards clear out zk and begin again.  If this works properly what should
> the router type be?
>
>
> On Wed, Apr 3, 2013 at 3:14 PM, Mark Miller  wrote:
>
>> If you don't specify numShards after 4.1, you get an implicit doc router
>> and it's up to you to distribute updates. In the past, partitioning was
>> done on the fly - but for shard splitting and perhaps other features, we
>> now divvy up the hash range up front based on numShards and store it in
>> ZooKeeper. No numShards is now how you take complete control of updates
>> yourself.
>>
>> - Mark
>>
>> On Apr 3, 2013, at 2:57 PM, Jamie Johnson  wrote:
>>
>> > The router says "implicit".  I did start from a blank zk state but
>> perhaps
>> > I missed one of the ZkCLI commands?  One of my shards from the
>> > clusterstate.json is shown below.  What is the process that should be
>> done
>> > to bootstrap a cluster other than the ZkCLI commands I listed above?  My
>> > process right now is run those ZkCLI commands and then start solr on
>> all of
>> > the instances with a command like this
>> >
>> > java -server -Dshard=shard5 -DcoreName=shard5-core1
>> > -Dsolr.data.dir=/solr/data/shard5-core1
>> -Dcollection.configName=solr-conf
>> > -Dcollection=collection1 -DzkHost=so-zoo1:2181,so-zoo2:2181,so-zoo3:2181
>> > -Djetty.port=7575 -DhostPort=7575 -jar start.jar
>> >
>> > I feel like maybe I'm missing a step.
>> >
>> > "shard5":{
>> >"state":"active",
>> >"replicas":{
>> >  "10.38.33.16:7575_solr_shard5-core1":{
>> >"shard":"shard5",
>> >"state":"active",
>> >"core":"shard5-core1",
>> >"collection":"collection1",
>> >"node_name":"10.38.33.16:7575_solr",
>> >"base_url":"http://10.38.33.16:7575/solr";,
>> >"leader":"true"},
>> >  "10.38.33.17:7577_solr_shard5-core2":{
>> >"shard":"shard5",
>> >"state":"recovering",
>> >"core":"shard5-core2",
>> >"collection":"collection1",
>> >"node_name":"10.38.33.17:7577_solr",
>> >"base_url":"http://10.38.33.17:7577/solr"}}}
>> >
>> >
>> > On Wed, Apr 3, 2013 at 2:40 PM, Mark Miller 
>> wrote:
>> >
>> >> It should be part of your clusterstate.json. Some users have reported
>> >> trouble upgrading a previous zk install when this change came. I
>> >> recommended manually updating the clusterstate.json to have the right
>> info,
>> >> and that seemed to work. Otherwise, I guess you have to start from a
>> clean
>> >> zk state.
>> >>
>> >> If you don't have that range information, I think there will be
>> trouble.
>> >> Do you have an router type defined in the clusterstate.json?
>> >>
>> >> - Mark
>> >>
>> >> On Apr 3, 2013, at 2:24 PM, Jamie Johnson  wrote:
>> >>
>> >>> Where is this information stored in ZK?  I don't see it in the cluster
>> >>> state (or perhaps I don't understand it ;) ).
>> >>>
>> >>> Perhaps something with my process is broken.  What I do when I start
>> from
>> >>> scratch is the following
>> >>>
>> >>> ZkCLI -cmd upconfig ...
>> >>> ZkCLI -cmd linkconfig 
>> >>>
>> >>> but I don't ever explicitly create the collection.  What should the
>> steps
>> >>> from scratch be?  I am moving from an unreleased snapshot of 4.0 so I
>> >> never
>> >>> did that previously either so perhaps I did create the collection in
>> one
>> >> of
>> >>> my steps to get this working but have forgotten it along the way.
>> >>>
>> >>>
>> >>> On Wed, Apr 3, 2013 at 2:16 PM, Mark Miller 
>> >> wrote:
>> >>>
>>  Thanks for digging Jamie. In 4.2, hash ranges are assigned up front
>> >> when a
>>  collection is created - each shard gets a range, which is stored in
>>  zookeeper. You should not be able to end up with the same id on
>> >> different
>>  shards - something very odd going on.
>> 
>>  Hopefully I'll have some time to try and help you reproduce. Ideally
>> we
>>  can capture it in a test case.
>> 
>>  - Mark
>> 
>>  On Apr 3, 2013, at 1:13 PM, Jamie Johnson  wrote:
>> 
>> > no, my thought was wrong, it appears that even with the parameter
>> set I
>>  am
>> > seeing this behavior.  I've been able to duplicate it on 4.2.0 by
>>  indexing
>> > 100,000 documents on 10 threads (10,000 each) when I get to 400,000
>> or
>> >>

Re: SolrCloud not distributing documents across shards

Michael Della Bitta-2 wrote
> With earlier versions of Solr Cloud, if there was any error or warning
> when you made a collection, you likely were set up for "implicit"
> routing which means that documents only go to the shard you're talking
> to. What you want is "compositeId" routing, which works how you think
> it should.
> 
> Go into the cloud GUI and look at clusterstate.json in the Tree tab.
> You should see the routing algorithm it's using in that file.
> 
> Michael Della Bitta

That sounds like my huckleberry.

 "router":"implicit"

Is in the collection info in the clusterstate.json

How do I fix this? Just wipe the clusterstate.json?

Thanks for your help.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrCloud-not-distributing-documents-across-shards-tp4053506p4053593.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: HTML entities being missed by DIH HTMLStripTransformer

2013-04-03 Thread Steve Rowe

Hi Ashok,

HTMLStripTransformer uses HTMLStripCharFilter under the hood, and 
HTMLStripCharFilter converts all HTML entities to their corresponding 
characters.

What version of Solr are you using?

My guess is that it only appears that nothing is happening, since when they are 
presented in a browser, they show up as the characters the entities represent.

I think (never done this myself) that if you apply the HTMLStripTransformer 
twice, it will first convert the entities to characters, and then on the second 
pass, remove the HTML constructs.

From :

-
The entity transformer attribute can consist of a comma separated list of 
transformers (say transformer="foo.X,foo.Y"). The transformers are chained in 
this case and they are applied one after the other in the order in which they 
are specified. What this means is that after the fields are fetched from the 
datasource, the list of entity columns are processed one at a time in the order 
listed inside the entity tag and scanned by the first transformer to see if any 
of that transformers attributes are present. If so the transformer does it's 
thing! When all of the listed entity columns have been scanned the process is 
repeated using the next transformer in the list.
-

Steve

On Apr 3, 2013, at 3:30 PM, Alexandre Rafalovitch  wrote:

> Then, I would say, you have a bigger problem
> 
> However, you can probably run RegEx filter and replace those known escapes
> with real characters before you run your HTMLStrip filter. Or run,
> HTMLStrip, RegEx and HTMLStrip again.
> 
> Regards,
>   Alex.
> 
> Personal blog: http://blog.outerthoughts.com/
> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
> - Time is the quality of nature that keeps events from happening all at
> once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)
> 
> 
> On Wed, Apr 3, 2013 at 3:19 PM, Ashok  wrote:
> 
>> Well, the database field has text,  sometimes with HTML entities and at
>> other
>> times with html tags. I have no control over the process that populates the
>> database tables with info.

Re: Filtering Search Cloud

On 4/3/2013 1:13 PM, Furkan KAMACI wrote:
> Shawn, thanks for your detailed explanation. My system will work on high
> load. I mean I will always index something and something always will be
> queried at my system. That is why I consider about physically separating
> indexer and query reply machines. I think about that: imagine a machine
> that both does indexing (a disk IO for it, I don't know the underlying
> system maybe Solr makes a sequential IO) and both trying to reply queries
> (another kind of IO) That is my main challenge to decide separating them.
> And the next step is that, if I separate them before response can I filter
> the data of indexer machines (I don't have any filtering  issues right now,
> I just think that maybe I can need it at future)

We do seem to have a language barrier, so let me try to be very clear:
If you use SolrCloud, you can't separate querying and indexing.  You
will have to use the master-slave replication that been part of Solr
since at least 1.4, possibly earlier.

Thanks,
Shawn

Re: HTML entities being missed by DIH HTMLStripTransformer

2013-04-03 Thread Alexandre Rafalovitch

Then, I would say, you have a bigger problem

However, you can probably run RegEx filter and replace those known escapes
with real characters before you run your HTMLStrip filter. Or run,
HTMLStrip, RegEx and HTMLStrip again.

Regards,
   Alex.

Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


On Wed, Apr 3, 2013 at 3:19 PM, Ashok  wrote:

> Well, the database field has text,  sometimes with HTML entities and at
> other
> times with html tags. I have no control over the process that populates the
> database tables with info.
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/HTML-entities-being-missed-by-DIH-HTMLStripTransformer-tp4053582p4053586.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: SolrCloud not distributing documents across shards

With earlier versions of Solr Cloud, if there was any error or warning
when you made a collection, you likely were set up for "implicit"
routing which means that documents only go to the shard you're talking
to. What you want is "compositeId" routing, which works how you think
it should.

Go into the cloud GUI and look at clusterstate.json in the Tree tab.
You should see the routing algorithm it's using in that file.

Michael Della Bitta


Appinions
18 East 41st Street, 2nd Floor
New York, NY 10017-6271

www.appinions.com

Where Influence Isn’t a Game


On Wed, Apr 3, 2013 at 2:59 PM, vsilgalis  wrote:
> Chris Hostetter-3 wrote
>> I'm not familiar with the details, but i've seen miller respond to a
>> similar question with reference to the issue of not explicitly specifying
>> numShards when creating your collections...
>>
>> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201303.mbox/%
>
>> 3C0AA0B422-F1DE-4915-B602-53CB1849204A@
>
>> %3E
>>
>>
>> -Hoss
>
> Well theoretically we are okay there.
>
> The commands we run to create our collection are as follow (note the
> numShards being specified):
> http://server01/solr/admin/cores?action=CREATE&name=classic_bt&collection=classic_bt&numShards=2&instanceDir=instances/basistech&dataDir=/opt/index/classic_bt&config=solrconfig.xml&schema=schema.xml&collection.configName=classic_bt
>
> http://server02/solr/admin/cores?action=CREATE&name=classic_bt&collection=classic_bt&numShards=2&instanceDir=instances/basistech&dataDir=/opt/index/classic_bt&config=solrconfig.xml&schema=schema.xml&collection.configName=classic_bt
>
> http://server03/solr/admin/cores?action=CREATE&name=classic_bt_shard1&collection=classic_bt&numShards=2&instanceDir=instances/basistech&dataDir=/opt/index/classic_bt_shard1&config=solrconfig.xml&schema=schema.xml&collection.configName=classic_bt&shard=shard1
>
> http://server03/solr/admin/cores?action=CREATE&name=classic_bt_shard2&collection=classic_bt&numShards=2&instanceDir=instances/basistech&dataDir=/opt/index/classic_bt_shard2&config=solrconfig.xml&schema=schema.xml&collection.configName=classic_bt&shard=shard2
>
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/SolrCloud-not-distributing-documents-across-shards-tp4053506p4053581.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

ah interestingso I need to specify num shards, blow out zk and then try
this again to see if things work properly now.  What is really strange is
that for the most part things have worked right and on 4.2.1 I have 600,000
items indexed with no duplicates.  In any event I will specify num shards
clear out zk and begin again.  If this works properly what should the
router type be?


On Wed, Apr 3, 2013 at 3:14 PM, Mark Miller  wrote:

> If you don't specify numShards after 4.1, you get an implicit doc router
> and it's up to you to distribute updates. In the past, partitioning was
> done on the fly - but for shard splitting and perhaps other features, we
> now divvy up the hash range up front based on numShards and store it in
> ZooKeeper. No numShards is now how you take complete control of updates
> yourself.
>
> - Mark
>
> On Apr 3, 2013, at 2:57 PM, Jamie Johnson  wrote:
>
> > The router says "implicit".  I did start from a blank zk state but
> perhaps
> > I missed one of the ZkCLI commands?  One of my shards from the
> > clusterstate.json is shown below.  What is the process that should be
> done
> > to bootstrap a cluster other than the ZkCLI commands I listed above?  My
> > process right now is run those ZkCLI commands and then start solr on all
> of
> > the instances with a command like this
> >
> > java -server -Dshard=shard5 -DcoreName=shard5-core1
> > -Dsolr.data.dir=/solr/data/shard5-core1 -Dcollection.configName=solr-conf
> > -Dcollection=collection1 -DzkHost=so-zoo1:2181,so-zoo2:2181,so-zoo3:2181
> > -Djetty.port=7575 -DhostPort=7575 -jar start.jar
> >
> > I feel like maybe I'm missing a step.
> >
> > "shard5":{
> >"state":"active",
> >"replicas":{
> >  "10.38.33.16:7575_solr_shard5-core1":{
> >"shard":"shard5",
> >"state":"active",
> >"core":"shard5-core1",
> >"collection":"collection1",
> >"node_name":"10.38.33.16:7575_solr",
> >"base_url":"http://10.38.33.16:7575/solr";,
> >"leader":"true"},
> >  "10.38.33.17:7577_solr_shard5-core2":{
> >"shard":"shard5",
> >"state":"recovering",
> >"core":"shard5-core2",
> >"collection":"collection1",
> >"node_name":"10.38.33.17:7577_solr",
> >"base_url":"http://10.38.33.17:7577/solr"}}}
> >
> >
> > On Wed, Apr 3, 2013 at 2:40 PM, Mark Miller 
> wrote:
> >
> >> It should be part of your clusterstate.json. Some users have reported
> >> trouble upgrading a previous zk install when this change came. I
> >> recommended manually updating the clusterstate.json to have the right
> info,
> >> and that seemed to work. Otherwise, I guess you have to start from a
> clean
> >> zk state.
> >>
> >> If you don't have that range information, I think there will be trouble.
> >> Do you have an router type defined in the clusterstate.json?
> >>
> >> - Mark
> >>
> >> On Apr 3, 2013, at 2:24 PM, Jamie Johnson  wrote:
> >>
> >>> Where is this information stored in ZK?  I don't see it in the cluster
> >>> state (or perhaps I don't understand it ;) ).
> >>>
> >>> Perhaps something with my process is broken.  What I do when I start
> from
> >>> scratch is the following
> >>>
> >>> ZkCLI -cmd upconfig ...
> >>> ZkCLI -cmd linkconfig 
> >>>
> >>> but I don't ever explicitly create the collection.  What should the
> steps
> >>> from scratch be?  I am moving from an unreleased snapshot of 4.0 so I
> >> never
> >>> did that previously either so perhaps I did create the collection in
> one
> >> of
> >>> my steps to get this working but have forgotten it along the way.
> >>>
> >>>
> >>> On Wed, Apr 3, 2013 at 2:16 PM, Mark Miller 
> >> wrote:
> >>>
>  Thanks for digging Jamie. In 4.2, hash ranges are assigned up front
> >> when a
>  collection is created - each shard gets a range, which is stored in
>  zookeeper. You should not be able to end up with the same id on
> >> different
>  shards - something very odd going on.
> 
>  Hopefully I'll have some time to try and help you reproduce. Ideally
> we
>  can capture it in a test case.
> 
>  - Mark
> 
>  On Apr 3, 2013, at 1:13 PM, Jamie Johnson  wrote:
> 
> > no, my thought was wrong, it appears that even with the parameter
> set I
>  am
> > seeing this behavior.  I've been able to duplicate it on 4.2.0 by
>  indexing
> > 100,000 documents on 10 threads (10,000 each) when I get to 400,000
> or
>  so.
> > I will try this on 4.2.1. to see if I see the same behavior
> >
> >
> > On Wed, Apr 3, 2013 at 12:37 PM, Jamie Johnson 
>  wrote:
> >
> >> Since I don't have that many items in my index I exported all of the
>  keys
> >> for each shard and wrote a simple java program that checks for
>  duplicates.
> >> I found some duplicate keys on different shards, a grep of the files
> >> for
> >> the keys found does indicate that they made it t

Re: HTML entities being missed by DIH HTMLStripTransformer

2013-04-03 Thread Ashok

Well, the database field has text,  sometimes with HTML entities and at other
times with html tags. I have no control over the process that populates the
database tables with info.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/HTML-entities-being-missed-by-DIH-HTMLStripTransformer-tp4053582p4053586.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

If you don't specify numShards after 4.1, you get an implicit doc router and 
it's up to you to distribute updates. In the past, partitioning was done on the 
fly - but for shard splitting and perhaps other features, we now divvy up the 
hash range up front based on numShards and store it in ZooKeeper. No numShards 
is now how you take complete control of updates yourself.

- Mark

On Apr 3, 2013, at 2:57 PM, Jamie Johnson  wrote:

> The router says "implicit".  I did start from a blank zk state but perhaps
> I missed one of the ZkCLI commands?  One of my shards from the
> clusterstate.json is shown below.  What is the process that should be done
> to bootstrap a cluster other than the ZkCLI commands I listed above?  My
> process right now is run those ZkCLI commands and then start solr on all of
> the instances with a command like this
> 
> java -server -Dshard=shard5 -DcoreName=shard5-core1
> -Dsolr.data.dir=/solr/data/shard5-core1 -Dcollection.configName=solr-conf
> -Dcollection=collection1 -DzkHost=so-zoo1:2181,so-zoo2:2181,so-zoo3:2181
> -Djetty.port=7575 -DhostPort=7575 -jar start.jar
> 
> I feel like maybe I'm missing a step.
> 
> "shard5":{
>"state":"active",
>"replicas":{
>  "10.38.33.16:7575_solr_shard5-core1":{
>"shard":"shard5",
>"state":"active",
>"core":"shard5-core1",
>"collection":"collection1",
>"node_name":"10.38.33.16:7575_solr",
>"base_url":"http://10.38.33.16:7575/solr";,
>"leader":"true"},
>  "10.38.33.17:7577_solr_shard5-core2":{
>"shard":"shard5",
>"state":"recovering",
>"core":"shard5-core2",
>"collection":"collection1",
>"node_name":"10.38.33.17:7577_solr",
>"base_url":"http://10.38.33.17:7577/solr"}}}
> 
> 
> On Wed, Apr 3, 2013 at 2:40 PM, Mark Miller  wrote:
> 
>> It should be part of your clusterstate.json. Some users have reported
>> trouble upgrading a previous zk install when this change came. I
>> recommended manually updating the clusterstate.json to have the right info,
>> and that seemed to work. Otherwise, I guess you have to start from a clean
>> zk state.
>> 
>> If you don't have that range information, I think there will be trouble.
>> Do you have an router type defined in the clusterstate.json?
>> 
>> - Mark
>> 
>> On Apr 3, 2013, at 2:24 PM, Jamie Johnson  wrote:
>> 
>>> Where is this information stored in ZK?  I don't see it in the cluster
>>> state (or perhaps I don't understand it ;) ).
>>> 
>>> Perhaps something with my process is broken.  What I do when I start from
>>> scratch is the following
>>> 
>>> ZkCLI -cmd upconfig ...
>>> ZkCLI -cmd linkconfig 
>>> 
>>> but I don't ever explicitly create the collection.  What should the steps
>>> from scratch be?  I am moving from an unreleased snapshot of 4.0 so I
>> never
>>> did that previously either so perhaps I did create the collection in one
>> of
>>> my steps to get this working but have forgotten it along the way.
>>> 
>>> 
>>> On Wed, Apr 3, 2013 at 2:16 PM, Mark Miller 
>> wrote:
>>> 
 Thanks for digging Jamie. In 4.2, hash ranges are assigned up front
>> when a
 collection is created - each shard gets a range, which is stored in
 zookeeper. You should not be able to end up with the same id on
>> different
 shards - something very odd going on.
 
 Hopefully I'll have some time to try and help you reproduce. Ideally we
 can capture it in a test case.
 
 - Mark
 
 On Apr 3, 2013, at 1:13 PM, Jamie Johnson  wrote:
 
> no, my thought was wrong, it appears that even with the parameter set I
 am
> seeing this behavior.  I've been able to duplicate it on 4.2.0 by
 indexing
> 100,000 documents on 10 threads (10,000 each) when I get to 400,000 or
 so.
> I will try this on 4.2.1. to see if I see the same behavior
> 
> 
> On Wed, Apr 3, 2013 at 12:37 PM, Jamie Johnson 
 wrote:
> 
>> Since I don't have that many items in my index I exported all of the
 keys
>> for each shard and wrote a simple java program that checks for
 duplicates.
>> I found some duplicate keys on different shards, a grep of the files
>> for
>> the keys found does indicate that they made it to the wrong places.
>> If
 you
>> notice documents with the same ID are on shard 3 and shard 5.  Is it
>> possible that the hash is being calculated taking into account only
>> the
>> "live" nodes?  I know that we don't specify the numShards param @
 startup
>> so could this be what is happening?
>> 
>> grep -c "7cd1a717-3d94-4f5d-bcb1-9d8a95ca78de" *
>> shard1-core1:0
>> shard1-core2:0
>> shard2-core1:0
>> shard2-core2:0
>> shard3-core1:1
>> shard3-core2:1
>> shard4-core1:0
>> shard4-core2:0
>> shard5-core1:1
>> shard5-core2:1
>> shard6-core1:0
>> shard6-core2:0
>> 
>>>

Re: Filtering Search Cloud

Shawn, thanks for your detailed explanation. My system will work on high
load. I mean I will always index something and something always will be
queried at my system. That is why I consider about physically separating
indexer and query reply machines. I think about that: imagine a machine
that both does indexing (a disk IO for it, I don't know the underlying
system maybe Solr makes a sequential IO) and both trying to reply queries
(another kind of IO) That is my main challenge to decide separating them.
And the next step is that, if I separate them before response can I filter
the data of indexer machines (I don't have any filtering  issues right now,
I just think that maybe I can need it at future)


2013/4/3 Shawn Heisey 

> On 4/1/2013 3:02 PM, Furkan KAMACI wrote:
> > I want to separate my cloud into two logical parts. One of them is
> indexer
> > cloud of SolrCloud. Second one is Searcher cloud of SolrCloud.
> >
> > My first question is that. Does separating my cloud system make sense
> about
> > performance improvement. Because I think that when indexing, searching
> make
> > time to response and if I separate them I get a performance improvement.
> On
> > the other hand maybe using all Solr machines as whole (I mean not
> > partitioning as I mentioned) SolrCloud can make a better load balancing,
> I
> > would want to learn it.
> >
> > My second question is that. Let's assume that I have separated my
> machines
> > as I mentioned. Can I filter some indexes to be searchable or not from
> > Searcher SolrCloud?
>
> SolrCloud gets rid of the master and slave designations.  It also gets
> rid of the line between indexing and querying.  Each shard has a replica
> that is designated the leader, but that has no real impact on searching
> and indexing, only on deciding which data to use when replicas get out
> of sync.
>
> In the old master-slave architecture, you indexed to the master and the
> updated index files were replicated to the slave.  The slave did not
> handle the analysis for indexing, so it was usually better to send
> queries to slaves and let the master only do indexing.
>
> SolrCloud is very different.  When you index, the documents are indexed
> on all replicas at about the same time.  When you query, the requests
> are load balanced across all replicas.  During normal operation,
> SolrCloud does not use replication at all.  The replication feature is
> only used when a replica gets out of sync with the leader, and in that
> case, the entire index is replicated.
>
> Thanks,
> Shawn
>
>

Re: HTML entities being missed by DIH HTMLStripTransformer

2013-04-03 Thread Gora Mohanty

On 4 April 2013 00:30, Ashok  wrote:
[...]
> Two questions.
>
> (1) Is this the expected behavior of DIH HTMLStripTransformer?

Yes, I believe so.

> (2) If yes, is there an another transformer that I can employ first to turn
> these html entities into their usual symbols that can then be removed by the
> DIH HTMLStripTransformer?

How are the HTML tags getting converted into entities?
Are you escaping input somewhere?

Regards,
Gora

HTML entities being missed by DIH HTMLStripTransformer

2013-04-03 Thread Ashok

Hi,

I am using DIH to index some database fields. These fields contain html
formatted text in them. I use the 'HTMLStripTransformer' to remove that
markup. This works fine when the text is like for example:

Item One or *This is in Bold*

However when the text has HTML entity names like in:

Item One or This is in Bold

NOTHING HAPPENS. 

Two questions.

(1) Is this the expected behavior of DIH HTMLStripTransformer?
(2) If yes, is there an another transformer that I can employ first to turn
these html entities into their usual symbols that can then be removed by the
DIH HTMLStripTransformer?

Thanks

- ashok



--
View this message in context: 
http://lucene.472066.n3.nabble.com/HTML-entities-being-missed-by-DIH-HTMLStripTransformer-tp4053582.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: SolrCloud not distributing documents across shards

Chris Hostetter-3 wrote
> I'm not familiar with the details, but i've seen miller respond to a 
> similar question with reference to the issue of not explicitly specifying 
> numShards when creating your collections...
> 
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201303.mbox/%

> 3C0AA0B422-F1DE-4915-B602-53CB1849204A@

> %3E
> 
> 
> -Hoss

Well theoretically we are okay there.

The commands we run to create our collection are as follow (note the
numShards being specified):
http://server01/solr/admin/cores?action=CREATE&name=classic_bt&collection=classic_bt&numShards=2&instanceDir=instances/basistech&dataDir=/opt/index/classic_bt&config=solrconfig.xml&schema=schema.xml&collection.configName=classic_bt

http://server02/solr/admin/cores?action=CREATE&name=classic_bt&collection=classic_bt&numShards=2&instanceDir=instances/basistech&dataDir=/opt/index/classic_bt&config=solrconfig.xml&schema=schema.xml&collection.configName=classic_bt

http://server03/solr/admin/cores?action=CREATE&name=classic_bt_shard1&collection=classic_bt&numShards=2&instanceDir=instances/basistech&dataDir=/opt/index/classic_bt_shard1&config=solrconfig.xml&schema=schema.xml&collection.configName=classic_bt&shard=shard1

http://server03/solr/admin/cores?action=CREATE&name=classic_bt_shard2&collection=classic_bt&numShards=2&instanceDir=instances/basistech&dataDir=/opt/index/classic_bt_shard2&config=solrconfig.xml&schema=schema.xml&collection.configName=classic_bt&shard=shard2




--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrCloud-not-distributing-documents-across-shards-tp4053506p4053581.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

The router says "implicit".  I did start from a blank zk state but perhaps
I missed one of the ZkCLI commands?  One of my shards from the
clusterstate.json is shown below.  What is the process that should be done
to bootstrap a cluster other than the ZkCLI commands I listed above?  My
process right now is run those ZkCLI commands and then start solr on all of
the instances with a command like this

java -server -Dshard=shard5 -DcoreName=shard5-core1
-Dsolr.data.dir=/solr/data/shard5-core1 -Dcollection.configName=solr-conf
-Dcollection=collection1 -DzkHost=so-zoo1:2181,so-zoo2:2181,so-zoo3:2181
-Djetty.port=7575 -DhostPort=7575 -jar start.jar

I feel like maybe I'm missing a step.

"shard5":{
"state":"active",
"replicas":{
  "10.38.33.16:7575_solr_shard5-core1":{
"shard":"shard5",
"state":"active",
"core":"shard5-core1",
"collection":"collection1",
"node_name":"10.38.33.16:7575_solr",
"base_url":"http://10.38.33.16:7575/solr";,
"leader":"true"},
  "10.38.33.17:7577_solr_shard5-core2":{
"shard":"shard5",
"state":"recovering",
"core":"shard5-core2",
"collection":"collection1",
"node_name":"10.38.33.17:7577_solr",
"base_url":"http://10.38.33.17:7577/solr"}}}


On Wed, Apr 3, 2013 at 2:40 PM, Mark Miller  wrote:

> It should be part of your clusterstate.json. Some users have reported
> trouble upgrading a previous zk install when this change came. I
> recommended manually updating the clusterstate.json to have the right info,
> and that seemed to work. Otherwise, I guess you have to start from a clean
> zk state.
>
> If you don't have that range information, I think there will be trouble.
> Do you have an router type defined in the clusterstate.json?
>
> - Mark
>
> On Apr 3, 2013, at 2:24 PM, Jamie Johnson  wrote:
>
> > Where is this information stored in ZK?  I don't see it in the cluster
> > state (or perhaps I don't understand it ;) ).
> >
> > Perhaps something with my process is broken.  What I do when I start from
> > scratch is the following
> >
> > ZkCLI -cmd upconfig ...
> > ZkCLI -cmd linkconfig 
> >
> > but I don't ever explicitly create the collection.  What should the steps
> > from scratch be?  I am moving from an unreleased snapshot of 4.0 so I
> never
> > did that previously either so perhaps I did create the collection in one
> of
> > my steps to get this working but have forgotten it along the way.
> >
> >
> > On Wed, Apr 3, 2013 at 2:16 PM, Mark Miller 
> wrote:
> >
> >> Thanks for digging Jamie. In 4.2, hash ranges are assigned up front
> when a
> >> collection is created - each shard gets a range, which is stored in
> >> zookeeper. You should not be able to end up with the same id on
> different
> >> shards - something very odd going on.
> >>
> >> Hopefully I'll have some time to try and help you reproduce. Ideally we
> >> can capture it in a test case.
> >>
> >> - Mark
> >>
> >> On Apr 3, 2013, at 1:13 PM, Jamie Johnson  wrote:
> >>
> >>> no, my thought was wrong, it appears that even with the parameter set I
> >> am
> >>> seeing this behavior.  I've been able to duplicate it on 4.2.0 by
> >> indexing
> >>> 100,000 documents on 10 threads (10,000 each) when I get to 400,000 or
> >> so.
> >>> I will try this on 4.2.1. to see if I see the same behavior
> >>>
> >>>
> >>> On Wed, Apr 3, 2013 at 12:37 PM, Jamie Johnson 
> >> wrote:
> >>>
>  Since I don't have that many items in my index I exported all of the
> >> keys
>  for each shard and wrote a simple java program that checks for
> >> duplicates.
>  I found some duplicate keys on different shards, a grep of the files
> for
>  the keys found does indicate that they made it to the wrong places.
>  If
> >> you
>  notice documents with the same ID are on shard 3 and shard 5.  Is it
>  possible that the hash is being calculated taking into account only
> the
>  "live" nodes?  I know that we don't specify the numShards param @
> >> startup
>  so could this be what is happening?
> 
>  grep -c "7cd1a717-3d94-4f5d-bcb1-9d8a95ca78de" *
>  shard1-core1:0
>  shard1-core2:0
>  shard2-core1:0
>  shard2-core2:0
>  shard3-core1:1
>  shard3-core2:1
>  shard4-core1:0
>  shard4-core2:0
>  shard5-core1:1
>  shard5-core2:1
>  shard6-core1:0
>  shard6-core2:0
> 
> 
>  On Wed, Apr 3, 2013 at 10:42 AM, Jamie Johnson 
> >> wrote:
> 
> > Something interesting that I'm noticing as well, I just indexed
> 300,000
> > items, and some how 300,020 ended up in the index.  I thought
> perhaps I
> > messed something up so I started the indexing again and indexed
> another
> > 400,000 and I see 400,064 docs.  Is there a good way to find
> possibile
> > duplicates?  I had tried to facet on key (our id field) but that
> didn't
> > give me anything with more than a c

Re: It seems a issue of deal with chinese synonym for solr

2013-04-03 Thread Kuro Kurosaka


On 3/11/13 6:15 PM, 李威 wrote:

in org.apache.solr.parser.SolrQueryParserBase, there is a function: "protected Query 
newFieldQuery(Analyzer analyzer, String field, String queryText, boolean quoted)  throws 
SyntaxError"

The below code can't process chinese rightly.

"  BooleanClause.Occur occur = positionCount > 1 && operator == 
AND_OPERATOR ?
 BooleanClause.Occur.MUST : BooleanClause.Occur.SHOULD;

"

For example, “北京市" and “北京" are synonym, if I seach "北京市动物园", the expected parse result is 
"+(北京市 北京) +动物园", but actually it would be parsed to "+北京市 +北京 +动物园".

The code can process English, because English word is seperate by space, and 
only one position.


An interesting feature of this example is that difference between the two 
synonyms is
omission of one token "市" (city). Doesn't the same same problem happen if we 
define

"London City" and "London" as synonyms, and execute a query like "London City 
Zoo"?
Must Chinese Analyzer be used to reproduce this problem?

I tried to test this but I couldn't. The result of query string expansion using 
Solr 4.2's

query interface with debug output shows:

MultiPhraseQuery(text:"(london london) city zoo")

I see no plus (+). What query parser did you use?

--
Kuro Kurosaka

Re: SolrCloud not distributing documents across shards

2013-04-03 Thread Chris Hostetter

: So we indexed a set of 33010 documents on server01 which are now in shard1.
: And we kicked off a set of 85934 documents on server02 which are now in
: shard2 (as tests).  In my understanding of how SolrCloud works, the
: documents should be distributed across the shards in the collection.  Now I

I'm not familiar with the details, but i've seen miller respond to a 
similar question with reference to the issue of not explicitly specifying 
numShards when creating your collections...

http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201303.mbox/%3c0aa0b422-f1de-4915-b602-53cb18492...@gmail.com%3E


-Hoss

Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

It should be part of your clusterstate.json. Some users have reported trouble 
upgrading a previous zk install when this change came. I recommended manually 
updating the clusterstate.json to have the right info, and that seemed to work. 
Otherwise, I guess you have to start from a clean zk state.

If you don't have that range information, I think there will be trouble. Do you 
have an router type defined in the clusterstate.json?

- Mark

On Apr 3, 2013, at 2:24 PM, Jamie Johnson  wrote:

> Where is this information stored in ZK?  I don't see it in the cluster
> state (or perhaps I don't understand it ;) ).
> 
> Perhaps something with my process is broken.  What I do when I start from
> scratch is the following
> 
> ZkCLI -cmd upconfig ...
> ZkCLI -cmd linkconfig 
> 
> but I don't ever explicitly create the collection.  What should the steps
> from scratch be?  I am moving from an unreleased snapshot of 4.0 so I never
> did that previously either so perhaps I did create the collection in one of
> my steps to get this working but have forgotten it along the way.
> 
> 
> On Wed, Apr 3, 2013 at 2:16 PM, Mark Miller  wrote:
> 
>> Thanks for digging Jamie. In 4.2, hash ranges are assigned up front when a
>> collection is created - each shard gets a range, which is stored in
>> zookeeper. You should not be able to end up with the same id on different
>> shards - something very odd going on.
>> 
>> Hopefully I'll have some time to try and help you reproduce. Ideally we
>> can capture it in a test case.
>> 
>> - Mark
>> 
>> On Apr 3, 2013, at 1:13 PM, Jamie Johnson  wrote:
>> 
>>> no, my thought was wrong, it appears that even with the parameter set I
>> am
>>> seeing this behavior.  I've been able to duplicate it on 4.2.0 by
>> indexing
>>> 100,000 documents on 10 threads (10,000 each) when I get to 400,000 or
>> so.
>>> I will try this on 4.2.1. to see if I see the same behavior
>>> 
>>> 
>>> On Wed, Apr 3, 2013 at 12:37 PM, Jamie Johnson 
>> wrote:
>>> 
 Since I don't have that many items in my index I exported all of the
>> keys
 for each shard and wrote a simple java program that checks for
>> duplicates.
 I found some duplicate keys on different shards, a grep of the files for
 the keys found does indicate that they made it to the wrong places.  If
>> you
 notice documents with the same ID are on shard 3 and shard 5.  Is it
 possible that the hash is being calculated taking into account only the
 "live" nodes?  I know that we don't specify the numShards param @
>> startup
 so could this be what is happening?
 
 grep -c "7cd1a717-3d94-4f5d-bcb1-9d8a95ca78de" *
 shard1-core1:0
 shard1-core2:0
 shard2-core1:0
 shard2-core2:0
 shard3-core1:1
 shard3-core2:1
 shard4-core1:0
 shard4-core2:0
 shard5-core1:1
 shard5-core2:1
 shard6-core1:0
 shard6-core2:0
 
 
 On Wed, Apr 3, 2013 at 10:42 AM, Jamie Johnson 
>> wrote:
 
> Something interesting that I'm noticing as well, I just indexed 300,000
> items, and some how 300,020 ended up in the index.  I thought perhaps I
> messed something up so I started the indexing again and indexed another
> 400,000 and I see 400,064 docs.  Is there a good way to find possibile
> duplicates?  I had tried to facet on key (our id field) but that didn't
> give me anything with more than a count of 1.
> 
> 
> On Wed, Apr 3, 2013 at 9:22 AM, Jamie Johnson 
>> wrote:
> 
>> Ok, so clearing the transaction log allowed things to go again.  I am
>> going to clear the index and try to replicate the problem on 4.2.0
>> and then
>> I'll try on 4.2.1
>> 
>> 
>> On Wed, Apr 3, 2013 at 8:21 AM, Mark Miller >> wrote:
>> 
>>> No, not that I know if, which is why I say we need to get to the
>> bottom
>>> of it.
>>> 
>>> - Mark
>>> 
>>> On Apr 2, 2013, at 10:18 PM, Jamie Johnson 
>> wrote:
>>> 
 Mark
 It's there a particular jira issue that you think may address this?
>> I
>>> read
 through it quickly but didn't see one that jumped out
 On Apr 2, 2013 10:07 PM, "Jamie Johnson"  wrote:
 
> I brought the bad one down and back up and it did nothing.  I can
>>> clear
> the index and try4.2.1. I will save off the logs and see if there
>> is
> anything else odd
> On Apr 2, 2013 9:13 PM, "Mark Miller" 
>> wrote:
> 
>> It would appear it's a bug given what you have said.
>> 
>> Any other exceptions would be useful. Might be best to start
>>> tracking in
>> a JIRA issue as well.
>> 
>> To fix, I'd bring the behind node down and back again.
>> 
>> Unfortunately, I'm pressed for time, but we really need to get to
>>> the
>> bottom of this and fix it, or determine if it's fixed in 4.2.1
>>> (spreading
>> to mirrors now).
>> 
>>

Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

Where is this information stored in ZK?  I don't see it in the cluster
state (or perhaps I don't understand it ;) ).

Perhaps something with my process is broken.  What I do when I start from
scratch is the following

ZkCLI -cmd upconfig ...
ZkCLI -cmd linkconfig 

but I don't ever explicitly create the collection.  What should the steps
from scratch be?  I am moving from an unreleased snapshot of 4.0 so I never
did that previously either so perhaps I did create the collection in one of
my steps to get this working but have forgotten it along the way.


On Wed, Apr 3, 2013 at 2:16 PM, Mark Miller  wrote:

> Thanks for digging Jamie. In 4.2, hash ranges are assigned up front when a
> collection is created - each shard gets a range, which is stored in
> zookeeper. You should not be able to end up with the same id on different
> shards - something very odd going on.
>
> Hopefully I'll have some time to try and help you reproduce. Ideally we
> can capture it in a test case.
>
> - Mark
>
> On Apr 3, 2013, at 1:13 PM, Jamie Johnson  wrote:
>
> > no, my thought was wrong, it appears that even with the parameter set I
> am
> > seeing this behavior.  I've been able to duplicate it on 4.2.0 by
> indexing
> > 100,000 documents on 10 threads (10,000 each) when I get to 400,000 or
> so.
> > I will try this on 4.2.1. to see if I see the same behavior
> >
> >
> > On Wed, Apr 3, 2013 at 12:37 PM, Jamie Johnson 
> wrote:
> >
> >> Since I don't have that many items in my index I exported all of the
> keys
> >> for each shard and wrote a simple java program that checks for
> duplicates.
> >> I found some duplicate keys on different shards, a grep of the files for
> >> the keys found does indicate that they made it to the wrong places.  If
> you
> >> notice documents with the same ID are on shard 3 and shard 5.  Is it
> >> possible that the hash is being calculated taking into account only the
> >> "live" nodes?  I know that we don't specify the numShards param @
> startup
> >> so could this be what is happening?
> >>
> >> grep -c "7cd1a717-3d94-4f5d-bcb1-9d8a95ca78de" *
> >> shard1-core1:0
> >> shard1-core2:0
> >> shard2-core1:0
> >> shard2-core2:0
> >> shard3-core1:1
> >> shard3-core2:1
> >> shard4-core1:0
> >> shard4-core2:0
> >> shard5-core1:1
> >> shard5-core2:1
> >> shard6-core1:0
> >> shard6-core2:0
> >>
> >>
> >> On Wed, Apr 3, 2013 at 10:42 AM, Jamie Johnson 
> wrote:
> >>
> >>> Something interesting that I'm noticing as well, I just indexed 300,000
> >>> items, and some how 300,020 ended up in the index.  I thought perhaps I
> >>> messed something up so I started the indexing again and indexed another
> >>> 400,000 and I see 400,064 docs.  Is there a good way to find possibile
> >>> duplicates?  I had tried to facet on key (our id field) but that didn't
> >>> give me anything with more than a count of 1.
> >>>
> >>>
> >>> On Wed, Apr 3, 2013 at 9:22 AM, Jamie Johnson 
> wrote:
> >>>
>  Ok, so clearing the transaction log allowed things to go again.  I am
>  going to clear the index and try to replicate the problem on 4.2.0
> and then
>  I'll try on 4.2.1
> 
> 
>  On Wed, Apr 3, 2013 at 8:21 AM, Mark Miller  >wrote:
> 
> > No, not that I know if, which is why I say we need to get to the
> bottom
> > of it.
> >
> > - Mark
> >
> > On Apr 2, 2013, at 10:18 PM, Jamie Johnson 
> wrote:
> >
> >> Mark
> >> It's there a particular jira issue that you think may address this?
> I
> > read
> >> through it quickly but didn't see one that jumped out
> >> On Apr 2, 2013 10:07 PM, "Jamie Johnson"  wrote:
> >>
> >>> I brought the bad one down and back up and it did nothing.  I can
> > clear
> >>> the index and try4.2.1. I will save off the logs and see if there
> is
> >>> anything else odd
> >>> On Apr 2, 2013 9:13 PM, "Mark Miller" 
> wrote:
> >>>
>  It would appear it's a bug given what you have said.
> 
>  Any other exceptions would be useful. Might be best to start
> > tracking in
>  a JIRA issue as well.
> 
>  To fix, I'd bring the behind node down and back again.
> 
>  Unfortunately, I'm pressed for time, but we really need to get to
> > the
>  bottom of this and fix it, or determine if it's fixed in 4.2.1
> > (spreading
>  to mirrors now).
> 
>  - Mark
> 
>  On Apr 2, 2013, at 7:21 PM, Jamie Johnson 
> > wrote:
> 
> > Sorry I didn't ask the obvious question.  Is there anything else
> > that I
> > should be looking for here and is this a bug?  I'd be happy to
> > troll
> > through the logs further if more information is needed, just let
> me
>  know.
> >
> > Also what is the most appropriate mechanism to fix this.  Is it
>  required to
> > kill the index that is out of sync and let solr resync things?
> >
> >

Re: SolrCloud not distributing documents across shards

Michael Della Bitta-2 wrote
> Hello Vytenis,
> 
> What exactly do you mean by "aren't distributing across the shards"?
> Do you mean that POSTs against the server for shard 1 never end up
> resulting in documents saved in shard 2?

So we indexed a set of 33010 documents on server01 which are now in shard1.
And we kicked off a set of 85934 documents on server02 which are now in
shard2 (as tests).  In my understanding of how SolrCloud works, the
documents should be distributed across the shards in the collection.  Now I
have seen this work before in my environment.  Not sure what I need to look
at to ensure this distribution.

Just as a FYI, this is SOLR 4.1



--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrCloud-not-distributing-documents-across-shards-tp4053506p4053563.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Solr Multiword Search

2013-04-03 Thread Dyer, James

You have specified "spellcheck.q" in your query.  The whole purpose of 
"spellcheck.q" is to bypass any query converter you've configured giving it raw 
keywords instead.

But possibly a custom query converter is not your best answer?

I agree that charles > charlie is an edit distance of 2, so if everything is 
set up correctly then DirectSolrSpellChecker with maxEdits=2 should find it.  
The collate functionality as you have it set up would check the index and only 
give you re-written queries that are guaranteed to return hits.  But there is a 
big caveat:  If the word "charles" occurs at all in the dictionary (because any 
document in your index contains it), then the spellchecker (by default) assumes 
it is a correctly-spelled word and will not try to correct it.  In this case, 
specify spellcheck.alternateTermCount with a non-zero value. (See 
http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.alternativeTermCount)
  

James Dyer
Ingram Content Group
(615) 213-4311


-Original Message-
From: skmirch [mailto:skmi...@hotmail.com] 
Sent: Wednesday, April 03, 2013 12:19 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr Multiword Search

I have been trying to use the MultiWordSpellingQueryConverter.java since I
need to be able to find the document that correspond to the suggested
collations.  At the moment it seems to be producing collations based on word
matches and arbitrary words from the field are picked up to form collation
and so nothing corresponds to any of the titles in our set of indexed
documents.

Could anyone please confirm that this would work if I took the following
steps.

steps:
1. Get the solr4.2.war file.
2. Get to the WEB-INF lib and add the lucene-core-4.2.0.jar and the
solr-core-4.2.0.jar that to the classpath to compile the
MultiWordSpellingQueryConverter.java . The code for this is in my previous
post in this thread.
3. jar cvf multiwordspellchecker.jar
com/foo/MultiWordSpellingQueryConverter.java
4. Copy this jar to the $SOLR_HOME/lib directory.
6. Define queryConverter.  Question: Where does this need to go? I have just
put this somewhere between the searchComponent and the requestHandler for
spell checks.
5. Start webserver. I see this jar file getting registered at startup:
2013-04-03 12:56:22,243 INFO  [org.apache.solr.core.SolrResourceLoader]
(coreLoadExecutor-3-thread-1) Adding
'file:/solr/lib/multiwordspellchecker.jar' to classloader
6. When I run the spell query, I don't see my print statements, so I am not
sure if this code is really being called.  I don't think it may be the
logging that is failing but rather this code not being called at all.

I would appreciate any information on what I might be doing wrong.  Please
help.

Thanks.
Regards,
-- Sandeep



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Multiword-Search-tp4053038p4053534.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: AW: AW: java.lang.OutOfMemoryError: Map failed

2013-04-03 Thread Van Tassell, Kristian

I just posted a similar error and discovered that decreasing the Xmx fixed the 
problem for me. The "free" command/top, etc. indicated I was flying just below 
the threshold for my allowed memory, and with swap/virtual space available, so 
I'm still confused as to what the issue is, but you may try this in your 
configurations to see if it helps.

-Original Message-
From: Per Steffensen [mailto:st...@designware.dk] 
Sent: Tuesday, April 02, 2013 6:09 AM
To: solr-user@lucene.apache.org
Subject: Re: AW: AW: java.lang.OutOfMemoryError: Map failed

I have seen the exact same on Ubuntu Server 12.04. It helped adding some swap 
space, but I do not understand why this is necessary, since OS ought to just 
use the actual memory mapped files if there is not room in
(virtual) memory, swapping pages in and out on demand. Note that I saw this for 
memory mapped files opened for read+write - not in the exact same context as 
you see it where MMapDirectory is trying to map memory mapped files.

If you find a solution/explanation, please post it here. I really want to know 
more about why FileChannel.map can cause OOM. I do not think the OOM is a 
"real" OOM indicating no more space on java heap, but is more an exception 
saying that OS has no more memory (in some interpretation of that).

Regards, Per Steffensen

On 4/2/13 11:32 AM, Arkadi Colson wrote:
> It is running as root:
>
> root@solr01-dcg:~# ps aux | grep tom
> root  1809 10.2 67.5 49460420 6931232 ?Sl   Mar28 706:29 
> /usr/bin/java
> -Djava.util.logging.config.file=/usr/local/tomcat/conf/logging.propert
> ies -server -Xms2048m -Xmx6144m -XX:PermSize=64m -XX:MaxPermSize=128m 
> -XX:+UseG1GC -verbose:gc -Xloggc:/solr/tomcat-logs/gc.log 
> -XX:+PrintGCTimeStamps -XX:+PrintGCDetails -Duser.timezone=UTC
> -Dfile.encoding=UTF8 -Dsolr.solr.home=/opt/solr/ -Dport=8983 
> -Dcollection.configName=smsc -DzkClientTimeout=2
> -DzkHost=solr01-dcg.intnet.smartbit.be:2181,solr01-gs.intnet.smartbit.
> be:2181,solr02-dcg.intnet.smartbit.be:2181,solr02-gs.intnet.smartbit.b
> e:2181,solr03-dcg.intnet.smartbit.be:2181,solr03-gs.intnet.smartbit.be
> :2181 
> -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager
> -Dcom.sun.management.jmxremote
> -Dcom.sun.management.jmxremote.port=
> -Dcom.sun.management.jmxremote.ssl=false
> -Dcom.sun.management.jmxremote.authenticate=false
> -Djava.endorsed.dirs=/usr/local/tomcat/endorsed -classpath 
> /usr/local/tomcat/bin/bootstrap.jar:/usr/local/tomcat/bin/tomcat-juli.
> jar -Dcatalina.base=/usr/local/tomcat 
> -Dcatalina.home=/usr/local/tomcat 
> -Djava.io.tmpdir=/usr/local/tomcat/temp
> org.apache.catalina.startup.Bootstrap start
>
> Arkadi
>
> On 04/02/2013 11:29 AM, André Widhani wrote:
>> The output is from the root user. Are you running Solr as root?
>>
>> If not, please try again using the operating system user that runs Solr.
>>
>> André
>> 
>> Von: Arkadi Colson [ark...@smartbit.be]
>> Gesendet: Dienstag, 2. April 2013 11:26
>> An: solr-user@lucene.apache.org
>> Cc: André Widhani
>> Betreff: Re: AW: java.lang.OutOfMemoryError: Map failed
>>
>> Hmmm I checked it and it seems to be ok:
>>
>> root@solr01-dcg:~# ulimit -v
>> unlimited
>>
>> Any other tips or do you need more debug info?
>>
>> BR
>>
>> On 04/02/2013 11:15 AM, André Widhani wrote:
>>> Hi Arkadi,
>>>
>>> this error usually indicates that virtual memory is not sufficient 
>>> (should be "unlimited").
>>>
>>> Please see
>>> http://comments.gmane.org/gmane.comp.jakarta.lucene.solr.user/69168
>>>
>>> Regards,
>>> André
>>>
>>> 
>>> Von: Arkadi Colson [ark...@smartbit.be]
>>> Gesendet: Dienstag, 2. April 2013 10:24
>>> An: solr-user@lucene.apache.org
>>> Betreff: java.lang.OutOfMemoryError: Map failed
>>>
>>> Hi
>>>
>>> Recently solr crashed. I've found this in the error log.
>>> My commit settings are loking like this:
>>> 
>>>   1
>>>   false
>>> 
>>>
>>>   
>>> 2000
>>>   
>>>
>>> The machine has 10GB of memory. Tomcat is running with -Xms2048m 
>>> -Xmx6144m
>>>
>>> Versions
>>> Solr: 4.2
>>> Tomcat: 7.0.33
>>> Java: 1.7
>>>
>>> Anybody any idea?
>>>
>>> Thx!
>>>
>>> Arkadi
>>>
>>> SEVERE: auto commit error...:org.apache.solr.common.SolrException: 
>>> Error
>>> opening new searcher
>>>at
>>> org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1415)
>>>at
>>> org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1527)
>>>at
>>> org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandl
>>> er2.java:562)
>>>
>>>at
>>> org.apache.solr.update.CommitTracker.run(CommitTracker.java:216)
>>>at
>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>>>at
>>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>>>at java.util.concurrent.FutureTask.run(FutureTask.java:16

Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

Thanks for digging Jamie. In 4.2, hash ranges are assigned up front when a 
collection is created - each shard gets a range, which is stored in zookeeper. 
You should not be able to end up with the same id on different shards - 
something very odd going on.

Hopefully I'll have some time to try and help you reproduce. Ideally we can 
capture it in a test case.

- Mark

On Apr 3, 2013, at 1:13 PM, Jamie Johnson  wrote:

> no, my thought was wrong, it appears that even with the parameter set I am
> seeing this behavior.  I've been able to duplicate it on 4.2.0 by indexing
> 100,000 documents on 10 threads (10,000 each) when I get to 400,000 or so.
> I will try this on 4.2.1. to see if I see the same behavior
> 
> 
> On Wed, Apr 3, 2013 at 12:37 PM, Jamie Johnson  wrote:
> 
>> Since I don't have that many items in my index I exported all of the keys
>> for each shard and wrote a simple java program that checks for duplicates.
>> I found some duplicate keys on different shards, a grep of the files for
>> the keys found does indicate that they made it to the wrong places.  If you
>> notice documents with the same ID are on shard 3 and shard 5.  Is it
>> possible that the hash is being calculated taking into account only the
>> "live" nodes?  I know that we don't specify the numShards param @ startup
>> so could this be what is happening?
>> 
>> grep -c "7cd1a717-3d94-4f5d-bcb1-9d8a95ca78de" *
>> shard1-core1:0
>> shard1-core2:0
>> shard2-core1:0
>> shard2-core2:0
>> shard3-core1:1
>> shard3-core2:1
>> shard4-core1:0
>> shard4-core2:0
>> shard5-core1:1
>> shard5-core2:1
>> shard6-core1:0
>> shard6-core2:0
>> 
>> 
>> On Wed, Apr 3, 2013 at 10:42 AM, Jamie Johnson  wrote:
>> 
>>> Something interesting that I'm noticing as well, I just indexed 300,000
>>> items, and some how 300,020 ended up in the index.  I thought perhaps I
>>> messed something up so I started the indexing again and indexed another
>>> 400,000 and I see 400,064 docs.  Is there a good way to find possibile
>>> duplicates?  I had tried to facet on key (our id field) but that didn't
>>> give me anything with more than a count of 1.
>>> 
>>> 
>>> On Wed, Apr 3, 2013 at 9:22 AM, Jamie Johnson  wrote:
>>> 
 Ok, so clearing the transaction log allowed things to go again.  I am
 going to clear the index and try to replicate the problem on 4.2.0 and then
 I'll try on 4.2.1
 
 
 On Wed, Apr 3, 2013 at 8:21 AM, Mark Miller wrote:
 
> No, not that I know if, which is why I say we need to get to the bottom
> of it.
> 
> - Mark
> 
> On Apr 2, 2013, at 10:18 PM, Jamie Johnson  wrote:
> 
>> Mark
>> It's there a particular jira issue that you think may address this? I
> read
>> through it quickly but didn't see one that jumped out
>> On Apr 2, 2013 10:07 PM, "Jamie Johnson"  wrote:
>> 
>>> I brought the bad one down and back up and it did nothing.  I can
> clear
>>> the index and try4.2.1. I will save off the logs and see if there is
>>> anything else odd
>>> On Apr 2, 2013 9:13 PM, "Mark Miller"  wrote:
>>> 
 It would appear it's a bug given what you have said.
 
 Any other exceptions would be useful. Might be best to start
> tracking in
 a JIRA issue as well.
 
 To fix, I'd bring the behind node down and back again.
 
 Unfortunately, I'm pressed for time, but we really need to get to
> the
 bottom of this and fix it, or determine if it's fixed in 4.2.1
> (spreading
 to mirrors now).
 
 - Mark
 
 On Apr 2, 2013, at 7:21 PM, Jamie Johnson 
> wrote:
 
> Sorry I didn't ask the obvious question.  Is there anything else
> that I
> should be looking for here and is this a bug?  I'd be happy to
> troll
> through the logs further if more information is needed, just let me
 know.
> 
> Also what is the most appropriate mechanism to fix this.  Is it
 required to
> kill the index that is out of sync and let solr resync things?
> 
> 
> On Tue, Apr 2, 2013 at 5:45 PM, Jamie Johnson 
 wrote:
> 
>> sorry for spamming here
>> 
>> shard5-core2 is the instance we're having issues with...
>> 
>> Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log
>> SEVERE: shard update error StdNode:
>> 
 
> http://10.38.33.17:7577/solr/dsc-shard5-core2/:org.apache.solr.common.SolrException
 :
>> Server at http://10.38.33.17:7577/solr/dsc-shard5-core2 returned
> non
 ok
>> status:503, message:Service Unavailable
>>  at
>> 
 
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373)
>>  at
>> 
 
> org.apache.solr.client.solrj.impl.HttpSolrServer.r

Solr Tika Override

2013-04-03 Thread JerryC

I am researching Solr and seeing if it would be a good fit for a document
search service I am helping to develop.  One of the requirements is that we
will need to be able to customize how file contents are parsed beyond the
default configurations that are offered out of the box by Tika.  For
example, we know that we will be indexing .pdf files that will contain a
cover page with a project start date, and would like to pull this date out
into a searchable field that is separate from the file content.  I have seen
several sources saying you can do this by overriding the
ExtractingRequestHandler.createFactory() method, but I have not been able to
find much documentation on how to implement a new parser.  Can someone point
me in the right direction on where to look, or let me know if the scenario I
described above is even possible?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Tika-Override-tp4053552.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Lengthy description is converted to hash symbols

2013-04-03 Thread Danny Watari

Here is a query that should return 2 documents... but it only returns 1.

/solr/m7779912/select?indent=on&version=2.2&q=description%3Agateway&fq=&start=0&rows=10&fl=description&qt=&wt=&explainOther=&hl.fl=

Oddly enough, the description of the two documents are exactly the same. 
Except one is indexed correctly and the other contains the hash symbols.

Btw, when the core was created, it was built from scratch via a pojo's and
the addBeans() method.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Lengthy-description-is-converted-to-hash-symbols-tp4053338p4053544.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: SolrCloud not distributing documents across shards

Hello Vytenis,

What exactly do you mean by "aren't distributing across the shards"?
Do you mean that POSTs against the server for shard 1 never end up
resulting in documents saved in shard 2?

Michael Della Bitta


Appinions
18 East 41st Street, 2nd Floor
New York, NY 10017-6271

www.appinions.com

Where Influence Isn’t a Game


On Wed, Apr 3, 2013 at 12:31 PM, vsilgalis  wrote:
> So we have 3 servers in a SolrCloud cluster.
>
> 
>
> We have 2 shards for our collection (classic_bt) with a shard on each of the
> first two servers as the picture shows. The third server has replicas of the
> first 2 shards just for high availability purposes.
>
> Now if we go into counts we have the following information:
> shard1 - Numdocs - 33010
> shard2 - Numdocs - 85934
>
> Both shards replicate to the third server with no issues.
>
> For some reason the documents aren't distributing across the shards, nothing
> in the logs indicates a problem but I'm not sure what we should be looking
> for.
>
> Let me know if you need more information.
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/SolrCloud-not-distributing-documents-across-shards-tp4053506.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Out of memory on some faceting queries

On 4/2/2013 3:09 AM, Dotan Cohen wrote:
> I notice that this only occurs on queries that run facets. I start
> Solr with the following command:
> sudo nohup java -XX:NewRatio=1 -XX:+UseParNewGC
> -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled
> -Dsolr.solr.home=/mnt/SolrFiles100/solr -jar
> /opt/solr-4.1.0/example/start.jar &

It looks like you've followed some advice that I gave previously on how
to tune java.  I have since learned that this advice is bad, it results
in long GC pauses, even with heaps that aren't huge.

As others have pointed out, you don't have a max heap setting, which
would mean that you're using whatever Java chooses for its default,
which might not be enough.  If you can get Solr to successfully run for
a while with queries and updates happening, the heap should eventually
max out and the admin UI will show you what Java is choosing by default.

Here is what I would now recommend for a beginning point on your Solr
startup command.  You may need to increase the heap beyond 4GB, but be
careful that you still have enough free memory to be able to do
effective caching of your index.

sudo nohup java -Xms4096M -Xmx4096M -XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75 -XX:NewRatio=3
-XX:MaxTenuringThreshold=8 -XX:+CMSParallelRemarkEnabled
-XX:+ParallelRefProcEnabled -XX:+UseLargePages -XX:+AggressiveOpts
-Dsolr.solr.home=/mnt/SolrFiles100/solr -jar
/opt/solr-4.1.0/example/start.jar &

If you are running a really old build of java (latest versions on
Oracle's website are 1.6 build 43 and 1.7 build 17), you might want to
leave AggressiveOpts out.  Some people would argue that you should never
use that option.

Thanks,
Shawn

Re: Solr Multiword Search

2013-04-03 Thread skmirch

I have been trying to use the MultiWordSpellingQueryConverter.java since I
need to be able to find the document that correspond to the suggested
collations.  At the moment it seems to be producing collations based on word
matches and arbitrary words from the field are picked up to form collation
and so nothing corresponds to any of the titles in our set of indexed
documents.

Could anyone please confirm that this would work if I took the following
steps.

steps:
1. Get the solr4.2.war file.
2. Get to the WEB-INF lib and add the lucene-core-4.2.0.jar and the
solr-core-4.2.0.jar that to the classpath to compile the
MultiWordSpellingQueryConverter.java . The code for this is in my previous
post in this thread.
3. jar cvf multiwordspellchecker.jar
com/foo/MultiWordSpellingQueryConverter.java
4. Copy this jar to the $SOLR_HOME/lib directory.
6. Define queryConverter.  Question: Where does this need to go? I have just
put this somewhere between the searchComponent and the requestHandler for
spell checks.
5. Start webserver. I see this jar file getting registered at startup:
2013-04-03 12:56:22,243 INFO  [org.apache.solr.core.SolrResourceLoader]
(coreLoadExecutor-3-thread-1) Adding
'file:/solr/lib/multiwordspellchecker.jar' to classloader
6. When I run the spell query, I don't see my print statements, so I am not
sure if this code is really being called.  I don't think it may be the
logging that is failing but rather this code not being called at all.

I would appreciate any information on what I might be doing wrong.  Please
help.

Thanks.
Regards,
-- Sandeep



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Multiword-Search-tp4053038p4053534.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Query parser cuts last letter from search term.

2013-04-03 Thread Upayavira


On Wed, Apr 3, 2013, at 11:36 AM, vsl wrote:
> So why Solr does not return proper document?

You're gonna have to give us a bit more than that. 

What is wrong with the documents it is returning?

Upayavira

Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

no, my thought was wrong, it appears that even with the parameter set I am
seeing this behavior.  I've been able to duplicate it on 4.2.0 by indexing
100,000 documents on 10 threads (10,000 each) when I get to 400,000 or so.
 I will try this on 4.2.1. to see if I see the same behavior


On Wed, Apr 3, 2013 at 12:37 PM, Jamie Johnson  wrote:

> Since I don't have that many items in my index I exported all of the keys
> for each shard and wrote a simple java program that checks for duplicates.
>  I found some duplicate keys on different shards, a grep of the files for
> the keys found does indicate that they made it to the wrong places.  If you
> notice documents with the same ID are on shard 3 and shard 5.  Is it
> possible that the hash is being calculated taking into account only the
> "live" nodes?  I know that we don't specify the numShards param @ startup
> so could this be what is happening?
>
> grep -c "7cd1a717-3d94-4f5d-bcb1-9d8a95ca78de" *
> shard1-core1:0
> shard1-core2:0
> shard2-core1:0
> shard2-core2:0
> shard3-core1:1
> shard3-core2:1
> shard4-core1:0
> shard4-core2:0
> shard5-core1:1
> shard5-core2:1
> shard6-core1:0
> shard6-core2:0
>
>
> On Wed, Apr 3, 2013 at 10:42 AM, Jamie Johnson  wrote:
>
>> Something interesting that I'm noticing as well, I just indexed 300,000
>> items, and some how 300,020 ended up in the index.  I thought perhaps I
>> messed something up so I started the indexing again and indexed another
>> 400,000 and I see 400,064 docs.  Is there a good way to find possibile
>> duplicates?  I had tried to facet on key (our id field) but that didn't
>> give me anything with more than a count of 1.
>>
>>
>> On Wed, Apr 3, 2013 at 9:22 AM, Jamie Johnson  wrote:
>>
>>> Ok, so clearing the transaction log allowed things to go again.  I am
>>> going to clear the index and try to replicate the problem on 4.2.0 and then
>>> I'll try on 4.2.1
>>>
>>>
>>> On Wed, Apr 3, 2013 at 8:21 AM, Mark Miller wrote:
>>>
 No, not that I know if, which is why I say we need to get to the bottom
 of it.

 - Mark

 On Apr 2, 2013, at 10:18 PM, Jamie Johnson  wrote:

 > Mark
 > It's there a particular jira issue that you think may address this? I
 read
 > through it quickly but didn't see one that jumped out
 > On Apr 2, 2013 10:07 PM, "Jamie Johnson"  wrote:
 >
 >> I brought the bad one down and back up and it did nothing.  I can
 clear
 >> the index and try4.2.1. I will save off the logs and see if there is
 >> anything else odd
 >> On Apr 2, 2013 9:13 PM, "Mark Miller"  wrote:
 >>
 >>> It would appear it's a bug given what you have said.
 >>>
 >>> Any other exceptions would be useful. Might be best to start
 tracking in
 >>> a JIRA issue as well.
 >>>
 >>> To fix, I'd bring the behind node down and back again.
 >>>
 >>> Unfortunately, I'm pressed for time, but we really need to get to
 the
 >>> bottom of this and fix it, or determine if it's fixed in 4.2.1
 (spreading
 >>> to mirrors now).
 >>>
 >>> - Mark
 >>>
 >>> On Apr 2, 2013, at 7:21 PM, Jamie Johnson 
 wrote:
 >>>
  Sorry I didn't ask the obvious question.  Is there anything else
 that I
  should be looking for here and is this a bug?  I'd be happy to
 troll
  through the logs further if more information is needed, just let me
 >>> know.
 
  Also what is the most appropriate mechanism to fix this.  Is it
 >>> required to
  kill the index that is out of sync and let solr resync things?
 
 
  On Tue, Apr 2, 2013 at 5:45 PM, Jamie Johnson 
 >>> wrote:
 
 > sorry for spamming here
 >
 > shard5-core2 is the instance we're having issues with...
 >
 > Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log
 > SEVERE: shard update error StdNode:
 >
 >>>
 http://10.38.33.17:7577/solr/dsc-shard5-core2/:org.apache.solr.common.SolrException
 >>> :
 > Server at http://10.38.33.17:7577/solr/dsc-shard5-core2 returned
 non
 >>> ok
 > status:503, message:Service Unavailable
 >   at
 >
 >>>
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373)
 >   at
 >
 >>>
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
 >   at
 >
 >>>
 org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:332)
 >   at
 >
 >>>
 org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:306)
 >   at
 > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
 >   at java.util.concurrent.FutureTask.run(FutureTask.java:138)
 >   at
 >
 java.util.concurrent.Executors$RunnableAd

Re: Flow Chart of Solr

We're using the 4.x branch code as the basis for our writing. So, 
effectively it will be for at least 4.3 when the book comes out in the 
summer.


Early access will be in about a month or so. O'Reilly will be showing a 
galley proof for 200 pages of the book next week at Big Data TechCon next 
week in Boston.


-- Jack Krupansky

-Original Message- 
From: Jack Park

Sent: Wednesday, April 03, 2013 12:56 PM
To: solr-user@lucene.apache.org
Subject: Re: Flow Chart of Solr

Jack,

Is that new book up to the 4.+ series?

Thanks
The other Jack

On Wed, Apr 3, 2013 at 9:19 AM, Jack Krupansky  
wrote:

And another one on the way:
http://www.amazon.com/Lucene-Solr-Definitive-comprehensive-realtime/dp/1449359957

Hopefully that help a lot as well. Plenty of diagrams. Lots of examples.

-- Jack Krupansky

-Original Message- From: Jack Park
Sent: Wednesday, April 03, 2013 11:25 AM

To: solr-user@lucene.apache.org
Subject: Re: Flow Chart of Solr

There are three books on Solr, two with that in the title, and one,
Taming Text, each of which have been very valuable in understanding
Solr.

Jack

On Wed, Apr 3, 2013 at 5:25 AM, Jack Krupansky 
wrote:


Sure, yes. But... it comes down to what level of detail you want and need
for a specific task. In other words, there are probably a dozen or more
levels of detail. The reality is that if you are going to work at the 
Solr

code level, that is very, very different than being a "user" of Solr, and
at
that point your first step is to become familiar with the code itself.

When you talk about "parsing" and "stemming", you are really talking 
about

the user-level, not the Solr code level. Maybe what you really need is a
cheat sheet that maps a user-visible feature to the main Solr code
component
for that implements that user feature.

There are a number of different forms of "parsing" in Solr - parsing of
what? Queries? Requests? Solr documents? Function queries?

Stemming? Well, in truth, Solr doesn't even do stemming - Lucene does
that.
Lucene does all of the "token filtering". Are you asking for details on
how
Lucene works? Maybe you meant to ask how "term analysis" works, which is
split between Solr and Lucene. Or maybe you simply wanted to know when 
and

where term analysis is done. Tell us your specific problem or specific
question and we can probably quickly give you an answer.

In truth, NOBODY uses "flow charts" anymore. Sure, there are some
user-level
diagrams, but not down to the code level.

If you could focus on specific questions, we could give you specific
answers.

"Main steps"? That depends on what level you are working at. Tell us what
problem you are trying to solve and we can point you to the relevant
areas.

In truth, if you become generally familiar with Solr at the user level
(study the wikis), you will already know what the "main steps" are.

So, it is not "main steps of Solr", but main steps of some specific
"request" of Solr, and for a specified level of detail, and for a
specified
area of Solr if greater detail is needed. Be more specific, and then we
can
be more specific.

For now, the general advice for people who need or want to go far beyond
the
user level is to "get familiar with the code" - just LOOK at it - a lot 
of

the package and class names are OBVIOUS, really, and follow the class
hierarchy and code flow using the standard features of any modern Java
IDE.
If you are wondering where to start for some specific user-level feature,
please ask specifically about that feature. But... make a diligent effort
to
discover and learn on your own before asking open-ended questions.

Sure, there are lots of things in Lucene and Solr that are rather complex
and seemingly convoluted, and not obvious, but people are more than
willing
to help you out if you simply ask a specific question. I mean, not
everybody
needs to know the fine detail of query parsing, analysis, building a
Lucene-level stemmer, etc. If we tried to put all of that in a diagram,
most
people would be more confused than enlightened.

At which step are scores calculated? That's more of a Lucene question. 
Or,
are you really asking what code in Solr invokes Lucene search methods 
that

calculate basic scores?

In short, you need to be more specific. Don't force us to guess what
problem
you are trying to solve.

-- Jack Krupansky

-Original Message- From: Furkan KAMACI
Sent: Wednesday, April 03, 2013 6:52 AM
To: solr-user@lucene.apache.org
Subject: Re: Flow Chart of Solr


So, all in all, is there anybody who can write down just main steps of
Solr(including parsing, stemming etc.)?


2013/4/2 Furkan KAMACI 

I think about myself as an example. I have started to make research 
about

Solr just for some weeks. I have learned Solr and its related projects.
My
next step writing down the main steps Solr. We have separated learning
curve of Solr into two main categories.
First one is who are using it as out of the box components. Second one 
is

developer side.

Actually

Re: Flow Chart of Solr

2013-04-03 Thread Jack Park

Jack,

Is that new book up to the 4.+ series?

Thanks
The other Jack

On Wed, Apr 3, 2013 at 9:19 AM, Jack Krupansky  wrote:
> And another one on the way:
> http://www.amazon.com/Lucene-Solr-Definitive-comprehensive-realtime/dp/1449359957
>
> Hopefully that help a lot as well. Plenty of diagrams. Lots of examples.
>
> -- Jack Krupansky
>
> -Original Message- From: Jack Park
> Sent: Wednesday, April 03, 2013 11:25 AM
>
> To: solr-user@lucene.apache.org
> Subject: Re: Flow Chart of Solr
>
> There are three books on Solr, two with that in the title, and one,
> Taming Text, each of which have been very valuable in understanding
> Solr.
>
> Jack
>
> On Wed, Apr 3, 2013 at 5:25 AM, Jack Krupansky 
> wrote:
>>
>> Sure, yes. But... it comes down to what level of detail you want and need
>> for a specific task. In other words, there are probably a dozen or more
>> levels of detail. The reality is that if you are going to work at the Solr
>> code level, that is very, very different than being a "user" of Solr, and
>> at
>> that point your first step is to become familiar with the code itself.
>>
>> When you talk about "parsing" and "stemming", you are really talking about
>> the user-level, not the Solr code level. Maybe what you really need is a
>> cheat sheet that maps a user-visible feature to the main Solr code
>> component
>> for that implements that user feature.
>>
>> There are a number of different forms of "parsing" in Solr - parsing of
>> what? Queries? Requests? Solr documents? Function queries?
>>
>> Stemming? Well, in truth, Solr doesn't even do stemming - Lucene does
>> that.
>> Lucene does all of the "token filtering". Are you asking for details on
>> how
>> Lucene works? Maybe you meant to ask how "term analysis" works, which is
>> split between Solr and Lucene. Or maybe you simply wanted to know when and
>> where term analysis is done. Tell us your specific problem or specific
>> question and we can probably quickly give you an answer.
>>
>> In truth, NOBODY uses "flow charts" anymore. Sure, there are some
>> user-level
>> diagrams, but not down to the code level.
>>
>> If you could focus on specific questions, we could give you specific
>> answers.
>>
>> "Main steps"? That depends on what level you are working at. Tell us what
>> problem you are trying to solve and we can point you to the relevant
>> areas.
>>
>> In truth, if you become generally familiar with Solr at the user level
>> (study the wikis), you will already know what the "main steps" are.
>>
>> So, it is not "main steps of Solr", but main steps of some specific
>> "request" of Solr, and for a specified level of detail, and for a
>> specified
>> area of Solr if greater detail is needed. Be more specific, and then we
>> can
>> be more specific.
>>
>> For now, the general advice for people who need or want to go far beyond
>> the
>> user level is to "get familiar with the code" - just LOOK at it - a lot of
>> the package and class names are OBVIOUS, really, and follow the class
>> hierarchy and code flow using the standard features of any modern Java
>> IDE.
>> If you are wondering where to start for some specific user-level feature,
>> please ask specifically about that feature. But... make a diligent effort
>> to
>> discover and learn on your own before asking open-ended questions.
>>
>> Sure, there are lots of things in Lucene and Solr that are rather complex
>> and seemingly convoluted, and not obvious, but people are more than
>> willing
>> to help you out if you simply ask a specific question. I mean, not
>> everybody
>> needs to know the fine detail of query parsing, analysis, building a
>> Lucene-level stemmer, etc. If we tried to put all of that in a diagram,
>> most
>> people would be more confused than enlightened.
>>
>> At which step are scores calculated? That's more of a Lucene question. Or,
>> are you really asking what code in Solr invokes Lucene search methods that
>> calculate basic scores?
>>
>> In short, you need to be more specific. Don't force us to guess what
>> problem
>> you are trying to solve.
>>
>> -- Jack Krupansky
>>
>> -Original Message- From: Furkan KAMACI
>> Sent: Wednesday, April 03, 2013 6:52 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Flow Chart of Solr
>>
>>
>> So, all in all, is there anybody who can write down just main steps of
>> Solr(including parsing, stemming etc.)?
>>
>>
>> 2013/4/2 Furkan KAMACI 
>>
>>> I think about myself as an example. I have started to make research about
>>> Solr just for some weeks. I have learned Solr and its related projects.
>>> My
>>> next step writing down the main steps Solr. We have separated learning
>>> curve of Solr into two main categories.
>>> First one is who are using it as out of the box components. Second one is
>>> developer side.
>>>
>>> Actually developer side branches into two way.
>>>
>>> First one is general steps of it. i.e. document comes into Solr (i.e.
>>> crawled data of Nutch). which analyzing processes are going to

Re: maxWarmingSearchers in Solr 4.

On 4/3/2013 1:48 AM, Dotan Cohen wrote:
> I have been dragging the same solrconfig.xml from Solr 3.x to 4.0 to
> 4.1, with no customization (bad, bad me!). I'm now looking into
> customizing it and I see that the Solr 4.1 solrconfig.xml is much
> simpler and shorter. Is this simply because many of the examples have
> been removed?
> 
> In particular, I notice that there is no mention of
> maxWarmingSearchers in the Solr 4.1 solrconfig.xml. I assume that I
> can simply add it in, are there any other critical config options that
> are missing that I should be looking into as well? Would I be better
> off using the old Solr 3.x solrconfig.xml in Solr 4.1 as it contains
> so many examples?

In situations where I don't want to change the default value, I prefer
to leave config elements out of the solrconfig.  It makes the config
smaller, and it also makes it so that I will automatically see benefits
from the default changing in new versions.

In the case of maxWarmingSearchers, I would hope that you have your
system set up so that you would never need more than 1 warming searcher
at a time.  If you do a commit while a previous commit is still warming,
Solr will try to create a second warming searcher.

I went poking in the code, and it seems that maxWarmingSearchers
defaults to Integer.MAX_VALUE.  I'm not sure whether this is a bad
default or not.  It does mean that a pathological setup without
maxWarmingSearchers in the config will probably blow up with an
OutOfMemory exception, but is that better or worse than commits that
don't make new documents searchable?  I can see arguments either way.

Thanks,
Shawn

Re: Solr ZooKeeper ensemble with HBase

2013-04-03 Thread Walter underwood

It will be limited by disk IO until you get the caches full. Then it will be 
limited by CPU. 

wunder

On Apr 3, 2013, at 8:55 AM, Amit Sela  wrote:

> Trouble in what why ? If I have enough memory - HBase RegionServer 10GB and
> maybe 2GB for Solr ? - or you mean CPU / disk ?
> 
> 
> On Wed, Apr 3, 2013 at 5:54 PM, Michael Della Bitta <
> michael.della.bi...@appinions.com> wrote:
> 
>> Hello, Amit:
>> 
>> My guess is that, if HBase is working hard, you're going to have more
>> trouble with HBase and Solr on the same nodes than HBase and Solr
>> sharing a Zookeeper. Solr's usage of Zookeeper is very minimal.
>> 
>> Michael Della Bitta
>> 
>> 
>> Appinions
>> 18 East 41st Street, 2nd Floor
>> New York, NY 10017-6271
>> 
>> www.appinions.com
>> 
>> Where Influence Isn’t a Game
>> 
>> 
>> On Wed, Apr 3, 2013 at 8:06 AM, Amit Sela  wrote:
>>> Hi all,
>>> 
>>> I have a running Hadoop + HBase cluster and the HBase cluster is running
>>> it's own zookeeper (HBase manages zookeeper).
>>> I would like to deploy my SolrCloud cluster on a portion of the machines
>> on
>>> that cluster.
>>> 
>>> My question is: Should I have any trouble / issues deploying an
>> additional
>>> ZooKeeper ensemble ? I don't want to use the HBase ZooKeeper because,
>> well
>>> first of all HBase manages it so I'm not sure it's possible and second I
>>> have HBase working pretty hard at times and I don't want to create any
>>> connection issues by overloading ZooKeeper.
>>> 
>>> Thanks,
>>> 
>>> Amit.
>>

Re: Lengthy description is converted to hash symbols

2013-04-03 Thread Danny Watari

I looked at the text via the admin analysis tool.  The text appeared to be
ok!  Unfortunately, the description is client data... so I can't post it
here, but I do not see any issues when running the analysis tool.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Lengthy-description-is-converted-to-hash-symbols-tp4053338p4053526.html
Sent from the Solr - User mailing list archive at Nabble.com.

SolrException: Error opening new searcher

2013-04-03 Thread Van Tassell, Kristian

We're suddenly seeing an error when trying to do updates/commits.

This is on Solr 4.2 (Tomcat, solr war deployed to webapps, on Linux SuSE 11).

Based off of some initial searching on things related to this issue, I have set 
ulimit in Linux to 'unlimited' and verified that Tomcat has enough memory for 
the virtual memory needed to run the Solr index (which is 1.1GB in size).

Does anyone have any ideas?

1:25:41

SEVERE

UpdateLog

Error opening realtime searcher for 
deleteByQuery:org.apache.solr.common.SolrException: Error opening new searcher

Error opening realtime searcher for 
deleteByQuery:org.apache.solr.common.SolrException: Error opening new searcher


11:25:39

SEVERE

UpdateLog

Replay exception: final commit.

java.io.IOException: Map failed
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:761)
at org.apache.lucene.store.MMapDirectory.map(MMapDirectory.java:283)
at 
org.apache.lucene.store.MMapDirectory$MMapIndexInput.(MMapDirectory.java:228)
at 
org.apache.lucene.store.MMapDirectory.openInput(MMapDirectory.java:195)
at 
org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.(Lucene41PostingsReader.java:81)
at 
org.apache.lucene.codecs.lucene41.Lucene41PostingsFormat.fieldsProducer(Lucene41PostingsFormat.java:430)
at 
org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.(PerFieldPostingsFormat.java:194)
at 
org.apache.lucene.codecs.perfield.PerFieldPostingsFormat.fieldsProducer(PerFieldPostingsFormat.java:233)
at 
org.apache.lucene.index.SegmentCoreReaders.(SegmentCoreReaders.java:127)
at org.apache.lucene.index.SegmentReader.(SegmentReader.java:56)
at 
org.apache.lucene.index.ReadersAndLiveDocs.getReader(ReadersAndLiveDocs.java:121)
at 
org.apache.lucene.index.BufferedDeletesStream.applyDeletes(BufferedDeletesStream.java:269)
at 
org.apache.lucene.index.IndexWriter.applyAllDeletes(IndexWriter.java:2961)
at 
org.apache.lucene.index.IndexWriter.maybeApplyDeletes(IndexWriter.java:2952)
at 
org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:2692)
at 
org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:2827)
at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:2807)
at 
org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:541)
at 
org.apache.solr.update.UpdateLog$LogReplayer.doReplay(UpdateLog.java:1341)
at org.apache.solr.update.UpdateLog$LogReplayer.run(UpdateLog.java:1160)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
Caused by: java.lang.OutOfMemoryError: Map failed
at sun.nio.ch.FileChannelImpl.map0(Native Method)
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:758)
... 28 more



SolrConfig:



true

1024











20

200

6

Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

Since I don't have that many items in my index I exported all of the keys
for each shard and wrote a simple java program that checks for duplicates.
 I found some duplicate keys on different shards, a grep of the files for
the keys found does indicate that they made it to the wrong places.  If you
notice documents with the same ID are on shard 3 and shard 5.  Is it
possible that the hash is being calculated taking into account only the
"live" nodes?  I know that we don't specify the numShards param @ startup
so could this be what is happening?

grep -c "7cd1a717-3d94-4f5d-bcb1-9d8a95ca78de" *
shard1-core1:0
shard1-core2:0
shard2-core1:0
shard2-core2:0
shard3-core1:1
shard3-core2:1
shard4-core1:0
shard4-core2:0
shard5-core1:1
shard5-core2:1
shard6-core1:0
shard6-core2:0


On Wed, Apr 3, 2013 at 10:42 AM, Jamie Johnson  wrote:

> Something interesting that I'm noticing as well, I just indexed 300,000
> items, and some how 300,020 ended up in the index.  I thought perhaps I
> messed something up so I started the indexing again and indexed another
> 400,000 and I see 400,064 docs.  Is there a good way to find possibile
> duplicates?  I had tried to facet on key (our id field) but that didn't
> give me anything with more than a count of 1.
>
>
> On Wed, Apr 3, 2013 at 9:22 AM, Jamie Johnson  wrote:
>
>> Ok, so clearing the transaction log allowed things to go again.  I am
>> going to clear the index and try to replicate the problem on 4.2.0 and then
>> I'll try on 4.2.1
>>
>>
>> On Wed, Apr 3, 2013 at 8:21 AM, Mark Miller wrote:
>>
>>> No, not that I know if, which is why I say we need to get to the bottom
>>> of it.
>>>
>>> - Mark
>>>
>>> On Apr 2, 2013, at 10:18 PM, Jamie Johnson  wrote:
>>>
>>> > Mark
>>> > It's there a particular jira issue that you think may address this? I
>>> read
>>> > through it quickly but didn't see one that jumped out
>>> > On Apr 2, 2013 10:07 PM, "Jamie Johnson"  wrote:
>>> >
>>> >> I brought the bad one down and back up and it did nothing.  I can
>>> clear
>>> >> the index and try4.2.1. I will save off the logs and see if there is
>>> >> anything else odd
>>> >> On Apr 2, 2013 9:13 PM, "Mark Miller"  wrote:
>>> >>
>>> >>> It would appear it's a bug given what you have said.
>>> >>>
>>> >>> Any other exceptions would be useful. Might be best to start
>>> tracking in
>>> >>> a JIRA issue as well.
>>> >>>
>>> >>> To fix, I'd bring the behind node down and back again.
>>> >>>
>>> >>> Unfortunately, I'm pressed for time, but we really need to get to the
>>> >>> bottom of this and fix it, or determine if it's fixed in 4.2.1
>>> (spreading
>>> >>> to mirrors now).
>>> >>>
>>> >>> - Mark
>>> >>>
>>> >>> On Apr 2, 2013, at 7:21 PM, Jamie Johnson  wrote:
>>> >>>
>>>  Sorry I didn't ask the obvious question.  Is there anything else
>>> that I
>>>  should be looking for here and is this a bug?  I'd be happy to troll
>>>  through the logs further if more information is needed, just let me
>>> >>> know.
>>> 
>>>  Also what is the most appropriate mechanism to fix this.  Is it
>>> >>> required to
>>>  kill the index that is out of sync and let solr resync things?
>>> 
>>> 
>>>  On Tue, Apr 2, 2013 at 5:45 PM, Jamie Johnson 
>>> >>> wrote:
>>> 
>>> > sorry for spamming here
>>> >
>>> > shard5-core2 is the instance we're having issues with...
>>> >
>>> > Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log
>>> > SEVERE: shard update error StdNode:
>>> >
>>> >>>
>>> http://10.38.33.17:7577/solr/dsc-shard5-core2/:org.apache.solr.common.SolrException
>>> >>> :
>>> > Server at http://10.38.33.17:7577/solr/dsc-shard5-core2 returned
>>> non
>>> >>> ok
>>> > status:503, message:Service Unavailable
>>> >   at
>>> >
>>> >>>
>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373)
>>> >   at
>>> >
>>> >>>
>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
>>> >   at
>>> >
>>> >>>
>>> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:332)
>>> >   at
>>> >
>>> >>>
>>> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:306)
>>> >   at
>>> > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>>> >   at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>>> >   at
>>> >
>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
>>> >   at
>>> > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>>> >   at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>>> >   at
>>> >
>>> >>>
>>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>>> >   at
>>> >
>>> >>>
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>>> >   at java.lang.Thread.run(Thread.java:662)
>>>

Re: Filtering Search Cloud

On 4/1/2013 3:02 PM, Furkan KAMACI wrote:
> I want to separate my cloud into two logical parts. One of them is indexer
> cloud of SolrCloud. Second one is Searcher cloud of SolrCloud.
> 
> My first question is that. Does separating my cloud system make sense about
> performance improvement. Because I think that when indexing, searching make
> time to response and if I separate them I get a performance improvement. On
> the other hand maybe using all Solr machines as whole (I mean not
> partitioning as I mentioned) SolrCloud can make a better load balancing, I
> would want to learn it.
> 
> My second question is that. Let's assume that I have separated my machines
> as I mentioned. Can I filter some indexes to be searchable or not from
> Searcher SolrCloud?

SolrCloud gets rid of the master and slave designations.  It also gets
rid of the line between indexing and querying.  Each shard has a replica
that is designated the leader, but that has no real impact on searching
and indexing, only on deciding which data to use when replicas get out
of sync.

In the old master-slave architecture, you indexed to the master and the
updated index files were replicated to the slave.  The slave did not
handle the analysis for indexing, so it was usually better to send
queries to slaves and let the master only do indexing.

SolrCloud is very different.  When you index, the documents are indexed
on all replicas at about the same time.  When you query, the requests
are load balanced across all replicas.  During normal operation,
SolrCloud does not use replication at all.  The replication feature is
only used when a replica gets out of sync with the leader, and in that
case, the entire index is replicated.

Thanks,
Shawn

SolrCloud not distributing documents across shards

So we have 3 servers in a SolrCloud cluster.

 

We have 2 shards for our collection (classic_bt) with a shard on each of the
first two servers as the picture shows. The third server has replicas of the
first 2 shards just for high availability purposes.

Now if we go into counts we have the following information:
shard1 - Numdocs - 33010
shard2 - Numdocs - 85934

Both shards replicate to the third server with no issues.

For some reason the documents aren't distributing across the shards, nothing
in the logs indicates a problem but I'm not sure what we should be looking
for.

Let me know if you need more information.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrCloud-not-distributing-documents-across-shards-tp4053506.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Lengthy description is converted to hash symbols


Show us the exact query URL as well as the request handler defaults.

Make sure to try to do an explicit query on the field that has the "#" 
value.


QA and prod may differ because maybe QA got completely reindexed more 
recently and maybe prod hasn't gotten fully reindexed recently. Maybe the 
schema changed but a full reindex wasn't done.


-- Jack Krupansky

-Original Message- 
From: Danny Watari

Sent: Wednesday, April 03, 2013 12:15 PM
To: solr-user@lucene.apache.org
Subject: Re: Lengthy description is converted to hash symbols

Yes... the  is what I see in the admin console when I perform a
search for the document.  Currently, I am using solrj and the addBean()
method to update the core.  Whats strange is in our QA env, the document
indexed correctly.  But in prod, I see hash symbols and thus any user search
against that field fails to find the document.  Btw, I see no errors in the
logs!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Lengthy-description-is-converted-to-hash-symbols-tp4053338p4053505.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Flow Chart of Solr


And another one on the way:
http://www.amazon.com/Lucene-Solr-Definitive-comprehensive-realtime/dp/1449359957

Hopefully that help a lot as well. Plenty of diagrams. Lots of examples.

-- Jack Krupansky

-Original Message- 
From: Jack Park

Sent: Wednesday, April 03, 2013 11:25 AM
To: solr-user@lucene.apache.org
Subject: Re: Flow Chart of Solr

There are three books on Solr, two with that in the title, and one,
Taming Text, each of which have been very valuable in understanding
Solr.

Jack

On Wed, Apr 3, 2013 at 5:25 AM, Jack Krupansky  
wrote:

Sure, yes. But... it comes down to what level of detail you want and need
for a specific task. In other words, there are probably a dozen or more
levels of detail. The reality is that if you are going to work at the Solr
code level, that is very, very different than being a "user" of Solr, and 
at

that point your first step is to become familiar with the code itself.

When you talk about "parsing" and "stemming", you are really talking about
the user-level, not the Solr code level. Maybe what you really need is a
cheat sheet that maps a user-visible feature to the main Solr code 
component

for that implements that user feature.

There are a number of different forms of "parsing" in Solr - parsing of
what? Queries? Requests? Solr documents? Function queries?

Stemming? Well, in truth, Solr doesn't even do stemming - Lucene does 
that.
Lucene does all of the "token filtering". Are you asking for details on 
how

Lucene works? Maybe you meant to ask how "term analysis" works, which is
split between Solr and Lucene. Or maybe you simply wanted to know when and
where term analysis is done. Tell us your specific problem or specific
question and we can probably quickly give you an answer.

In truth, NOBODY uses "flow charts" anymore. Sure, there are some 
user-level

diagrams, but not down to the code level.

If you could focus on specific questions, we could give you specific
answers.

"Main steps"? That depends on what level you are working at. Tell us what
problem you are trying to solve and we can point you to the relevant 
areas.


In truth, if you become generally familiar with Solr at the user level
(study the wikis), you will already know what the "main steps" are.

So, it is not "main steps of Solr", but main steps of some specific
"request" of Solr, and for a specified level of detail, and for a 
specified
area of Solr if greater detail is needed. Be more specific, and then we 
can

be more specific.

For now, the general advice for people who need or want to go far beyond 
the

user level is to "get familiar with the code" - just LOOK at it - a lot of
the package and class names are OBVIOUS, really, and follow the class
hierarchy and code flow using the standard features of any modern Java 
IDE.

If you are wondering where to start for some specific user-level feature,
please ask specifically about that feature. But... make a diligent effort 
to

discover and learn on your own before asking open-ended questions.

Sure, there are lots of things in Lucene and Solr that are rather complex
and seemingly convoluted, and not obvious, but people are more than 
willing
to help you out if you simply ask a specific question. I mean, not 
everybody

needs to know the fine detail of query parsing, analysis, building a
Lucene-level stemmer, etc. If we tried to put all of that in a diagram, 
most

people would be more confused than enlightened.

At which step are scores calculated? That's more of a Lucene question. Or,
are you really asking what code in Solr invokes Lucene search methods that
calculate basic scores?

In short, you need to be more specific. Don't force us to guess what 
problem

you are trying to solve.

-- Jack Krupansky

-Original Message- From: Furkan KAMACI
Sent: Wednesday, April 03, 2013 6:52 AM
To: solr-user@lucene.apache.org
Subject: Re: Flow Chart of Solr


So, all in all, is there anybody who can write down just main steps of
Solr(including parsing, stemming etc.)?


2013/4/2 Furkan KAMACI 


I think about myself as an example. I have started to make research about
Solr just for some weeks. I have learned Solr and its related projects. 
My

next step writing down the main steps Solr. We have separated learning
curve of Solr into two main categories.
First one is who are using it as out of the box components. Second one is
developer side.

Actually developer side branches into two way.

First one is general steps of it. i.e. document comes into Solr (i.e.
crawled data of Nutch). which analyzing processes are going to done
(stamming, hamming etc.), what will be doing after parsing step by step.
When a search query happens what happens step by step, at which step
scores
are calculated so on so forth.
Second one is more code specific i.e. which handlers takes into account
data that will going to be indexed(no need the explain every handler at
this step) . Which are the analyzer, tokenizer classes and what are the
flow between them. How

Re: Lengthy description is converted to hash symbols

2013-04-03 Thread Danny Watari

Yes... the  is what I see in the admin console when I perform a
search for the document.  Currently, I am using solrj and the addBean()
method to update the core.  Whats strange is in our QA env, the document
indexed correctly.  But in prod, I see hash symbols and thus any user search
against that field fails to find the document.  Btw, I see no errors in the
logs!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Lengthy-description-is-converted-to-hash-symbols-tp4053338p4053505.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Upgrade Solr3.5 to Solr4.1 - Index Reformat ?

On 4/1/2013 12:19 PM, feroz_kh wrote:
> Hi Shawn,
> 
> I tried optimizing using this command...
> 
> curl
> 'http://localhost:/solr/update?optimize=true&maxSegments=10&waitFlush=true'
> 
> And i got this response within secs...
> 
> 
> 
> 0 name="QTime">840
> 
> 
> Is this a valid response that one should get ?
> I checked the statistics link from  /solr/admin page and it shows the number
> segments got updated.
> Would this be a good indication that optimization is complete ?
> At the same time - I even noticed the number of files in data/index
> directory hasn't reduced & all files are not updated.
> Since it took just couple of secs for the response(even with waitFlush=true)
> - i am doubting if optimization really happened , but details on statistics
> page shows me correct number of segments.

That looks like a valid success response.  An optimize in Solr defaults
to one segment.  You asked it to do ten segments.  Either you already
had less than 10 segments, or it was able to find some very small
segments to merge in order to get below 10.

When you are optimizing in order to upgrade the index format, you should
leave maxSegments off or set it to 1.

Thanks,
Shawn

Re: Solr ZooKeeper ensemble with HBase

Solr heavily uses RAM for disk caching, so depending on your index
size and what you intend to do with it, 2 GB could easily not be
enough. We run with 6 GB heaps on 34 GB boxes, and the remaining RAM
is there solely to act as a disk cache. We're on EC2, though, so
unless you're using the SSD instances, the disks are slow. Might not
be a problem for you.

Also things like faceting and sorting can heavily hit the CPU.

Michael Della Bitta


Appinions
18 East 41st Street, 2nd Floor
New York, NY 10017-6271

www.appinions.com

Where Influence Isn’t a Game


On Wed, Apr 3, 2013 at 11:55 AM, Amit Sela  wrote:
> Trouble in what why ? If I have enough memory - HBase RegionServer 10GB and
> maybe 2GB for Solr ? - or you mean CPU / disk ?
>
>
> On Wed, Apr 3, 2013 at 5:54 PM, Michael Della Bitta <
> michael.della.bi...@appinions.com> wrote:
>
>> Hello, Amit:
>>
>> My guess is that, if HBase is working hard, you're going to have more
>> trouble with HBase and Solr on the same nodes than HBase and Solr
>> sharing a Zookeeper. Solr's usage of Zookeeper is very minimal.
>>
>> Michael Della Bitta
>>
>> 
>> Appinions
>> 18 East 41st Street, 2nd Floor
>> New York, NY 10017-6271
>>
>> www.appinions.com
>>
>> Where Influence Isn’t a Game
>>
>>
>> On Wed, Apr 3, 2013 at 8:06 AM, Amit Sela  wrote:
>> > Hi all,
>> >
>> > I have a running Hadoop + HBase cluster and the HBase cluster is running
>> > it's own zookeeper (HBase manages zookeeper).
>> > I would like to deploy my SolrCloud cluster on a portion of the machines
>> on
>> > that cluster.
>> >
>> > My question is: Should I have any trouble / issues deploying an
>> additional
>> > ZooKeeper ensemble ? I don't want to use the HBase ZooKeeper because,
>> well
>> > first of all HBase manages it so I'm not sure it's possible and second I
>> > have HBase working pretty hard at times and I don't want to create any
>> > connection issues by overloading ZooKeeper.
>> >
>> > Thanks,
>> >
>> > Amit.
>>

Re: Solr ZooKeeper ensemble with HBase

2013-04-03 Thread Amit Sela

Trouble in what why ? If I have enough memory - HBase RegionServer 10GB and
maybe 2GB for Solr ? - or you mean CPU / disk ?


On Wed, Apr 3, 2013 at 5:54 PM, Michael Della Bitta <
michael.della.bi...@appinions.com> wrote:

> Hello, Amit:
>
> My guess is that, if HBase is working hard, you're going to have more
> trouble with HBase and Solr on the same nodes than HBase and Solr
> sharing a Zookeeper. Solr's usage of Zookeeper is very minimal.
>
> Michael Della Bitta
>
> 
> Appinions
> 18 East 41st Street, 2nd Floor
> New York, NY 10017-6271
>
> www.appinions.com
>
> Where Influence Isn’t a Game
>
>
> On Wed, Apr 3, 2013 at 8:06 AM, Amit Sela  wrote:
> > Hi all,
> >
> > I have a running Hadoop + HBase cluster and the HBase cluster is running
> > it's own zookeeper (HBase manages zookeeper).
> > I would like to deploy my SolrCloud cluster on a portion of the machines
> on
> > that cluster.
> >
> > My question is: Should I have any trouble / issues deploying an
> additional
> > ZooKeeper ensemble ? I don't want to use the HBase ZooKeeper because,
> well
> > first of all HBase manages it so I'm not sure it's possible and second I
> > have HBase working pretty hard at times and I don't want to create any
> > connection issues by overloading ZooKeeper.
> >
> > Thanks,
> >
> > Amit.
>

Re: Flow Chart of Solr

2013-04-03 Thread Jack Park

There are three books on Solr, two with that in the title, and one,
Taming Text, each of which have been very valuable in understanding
Solr.

Jack

On Wed, Apr 3, 2013 at 5:25 AM, Jack Krupansky  wrote:
> Sure, yes. But... it comes down to what level of detail you want and need
> for a specific task. In other words, there are probably a dozen or more
> levels of detail. The reality is that if you are going to work at the Solr
> code level, that is very, very different than being a "user" of Solr, and at
> that point your first step is to become familiar with the code itself.
>
> When you talk about "parsing" and "stemming", you are really talking about
> the user-level, not the Solr code level. Maybe what you really need is a
> cheat sheet that maps a user-visible feature to the main Solr code component
> for that implements that user feature.
>
> There are a number of different forms of "parsing" in Solr - parsing of
> what? Queries? Requests? Solr documents? Function queries?
>
> Stemming? Well, in truth, Solr doesn't even do stemming - Lucene does that.
> Lucene does all of the "token filtering". Are you asking for details on how
> Lucene works? Maybe you meant to ask how "term analysis" works, which is
> split between Solr and Lucene. Or maybe you simply wanted to know when and
> where term analysis is done. Tell us your specific problem or specific
> question and we can probably quickly give you an answer.
>
> In truth, NOBODY uses "flow charts" anymore. Sure, there are some user-level
> diagrams, but not down to the code level.
>
> If you could focus on specific questions, we could give you specific
> answers.
>
> "Main steps"? That depends on what level you are working at. Tell us what
> problem you are trying to solve and we can point you to the relevant areas.
>
> In truth, if you become generally familiar with Solr at the user level
> (study the wikis), you will already know what the "main steps" are.
>
> So, it is not "main steps of Solr", but main steps of some specific
> "request" of Solr, and for a specified level of detail, and for a specified
> area of Solr if greater detail is needed. Be more specific, and then we can
> be more specific.
>
> For now, the general advice for people who need or want to go far beyond the
> user level is to "get familiar with the code" - just LOOK at it - a lot of
> the package and class names are OBVIOUS, really, and follow the class
> hierarchy and code flow using the standard features of any modern Java IDE.
> If you are wondering where to start for some specific user-level feature,
> please ask specifically about that feature. But... make a diligent effort to
> discover and learn on your own before asking open-ended questions.
>
> Sure, there are lots of things in Lucene and Solr that are rather complex
> and seemingly convoluted, and not obvious, but people are more than willing
> to help you out if you simply ask a specific question. I mean, not everybody
> needs to know the fine detail of query parsing, analysis, building a
> Lucene-level stemmer, etc. If we tried to put all of that in a diagram, most
> people would be more confused than enlightened.
>
> At which step are scores calculated? That's more of a Lucene question. Or,
> are you really asking what code in Solr invokes Lucene search methods that
> calculate basic scores?
>
> In short, you need to be more specific. Don't force us to guess what problem
> you are trying to solve.
>
> -- Jack Krupansky
>
> -Original Message- From: Furkan KAMACI
> Sent: Wednesday, April 03, 2013 6:52 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Flow Chart of Solr
>
>
> So, all in all, is there anybody who can write down just main steps of
> Solr(including parsing, stemming etc.)?
>
>
> 2013/4/2 Furkan KAMACI 
>
>> I think about myself as an example. I have started to make research about
>> Solr just for some weeks. I have learned Solr and its related projects. My
>> next step writing down the main steps Solr. We have separated learning
>> curve of Solr into two main categories.
>> First one is who are using it as out of the box components. Second one is
>> developer side.
>>
>> Actually developer side branches into two way.
>>
>> First one is general steps of it. i.e. document comes into Solr (i.e.
>> crawled data of Nutch). which analyzing processes are going to done
>> (stamming, hamming etc.), what will be doing after parsing step by step.
>> When a search query happens what happens step by step, at which step
>> scores
>> are calculated so on so forth.
>> Second one is more code specific i.e. which handlers takes into account
>> data that will going to be indexed(no need the explain every handler at
>> this step) . Which are the analyzer, tokenizer classes and what are the
>> flow between them. How response handlers works and what are they.
>>
>> Also explaining about cloud side is other work.
>>
>> Some of explanations are currently presents at wiki (but some of them are
>> at very deep places at

Re: Solr ZooKeeper ensemble with HBase

Hello, Amit:

My guess is that, if HBase is working hard, you're going to have more
trouble with HBase and Solr on the same nodes than HBase and Solr
sharing a Zookeeper. Solr's usage of Zookeeper is very minimal.

Michael Della Bitta


Appinions
18 East 41st Street, 2nd Floor
New York, NY 10017-6271

www.appinions.com

Where Influence Isn’t a Game


On Wed, Apr 3, 2013 at 8:06 AM, Amit Sela  wrote:
> Hi all,
>
> I have a running Hadoop + HBase cluster and the HBase cluster is running
> it's own zookeeper (HBase manages zookeeper).
> I would like to deploy my SolrCloud cluster on a portion of the machines on
> that cluster.
>
> My question is: Should I have any trouble / issues deploying an additional
> ZooKeeper ensemble ? I don't want to use the HBase ZooKeeper because, well
> first of all HBase manages it so I'm not sure it's possible and second I
> have HBase working pretty hard at times and I don't want to create any
> connection issues by overloading ZooKeeper.
>
> Thanks,
>
> Amit.

Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

Something interesting that I'm noticing as well, I just indexed 300,000
items, and some how 300,020 ended up in the index.  I thought perhaps I
messed something up so I started the indexing again and indexed another
400,000 and I see 400,064 docs.  Is there a good way to find possibile
duplicates?  I had tried to facet on key (our id field) but that didn't
give me anything with more than a count of 1.


On Wed, Apr 3, 2013 at 9:22 AM, Jamie Johnson  wrote:

> Ok, so clearing the transaction log allowed things to go again.  I am
> going to clear the index and try to replicate the problem on 4.2.0 and then
> I'll try on 4.2.1
>
>
> On Wed, Apr 3, 2013 at 8:21 AM, Mark Miller  wrote:
>
>> No, not that I know if, which is why I say we need to get to the bottom
>> of it.
>>
>> - Mark
>>
>> On Apr 2, 2013, at 10:18 PM, Jamie Johnson  wrote:
>>
>> > Mark
>> > It's there a particular jira issue that you think may address this? I
>> read
>> > through it quickly but didn't see one that jumped out
>> > On Apr 2, 2013 10:07 PM, "Jamie Johnson"  wrote:
>> >
>> >> I brought the bad one down and back up and it did nothing.  I can clear
>> >> the index and try4.2.1. I will save off the logs and see if there is
>> >> anything else odd
>> >> On Apr 2, 2013 9:13 PM, "Mark Miller"  wrote:
>> >>
>> >>> It would appear it's a bug given what you have said.
>> >>>
>> >>> Any other exceptions would be useful. Might be best to start tracking
>> in
>> >>> a JIRA issue as well.
>> >>>
>> >>> To fix, I'd bring the behind node down and back again.
>> >>>
>> >>> Unfortunately, I'm pressed for time, but we really need to get to the
>> >>> bottom of this and fix it, or determine if it's fixed in 4.2.1
>> (spreading
>> >>> to mirrors now).
>> >>>
>> >>> - Mark
>> >>>
>> >>> On Apr 2, 2013, at 7:21 PM, Jamie Johnson  wrote:
>> >>>
>>  Sorry I didn't ask the obvious question.  Is there anything else
>> that I
>>  should be looking for here and is this a bug?  I'd be happy to troll
>>  through the logs further if more information is needed, just let me
>> >>> know.
>> 
>>  Also what is the most appropriate mechanism to fix this.  Is it
>> >>> required to
>>  kill the index that is out of sync and let solr resync things?
>> 
>> 
>>  On Tue, Apr 2, 2013 at 5:45 PM, Jamie Johnson 
>> >>> wrote:
>> 
>> > sorry for spamming here
>> >
>> > shard5-core2 is the instance we're having issues with...
>> >
>> > Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log
>> > SEVERE: shard update error StdNode:
>> >
>> >>>
>> http://10.38.33.17:7577/solr/dsc-shard5-core2/:org.apache.solr.common.SolrException
>> >>> :
>> > Server at http://10.38.33.17:7577/solr/dsc-shard5-core2 returned
>> non
>> >>> ok
>> > status:503, message:Service Unavailable
>> >   at
>> >
>> >>>
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373)
>> >   at
>> >
>> >>>
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
>> >   at
>> >
>> >>>
>> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:332)
>> >   at
>> >
>> >>>
>> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:306)
>> >   at
>> > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>> >   at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>> >   at
>> >
>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
>> >   at
>> > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>> >   at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>> >   at
>> >
>> >>>
>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>> >   at
>> >
>> >>>
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>> >   at java.lang.Thread.run(Thread.java:662)
>> >
>> >
>> > On Tue, Apr 2, 2013 at 5:43 PM, Jamie Johnson 
>> >>> wrote:
>> >
>> >> here is another one that looks interesting
>> >>
>> >> Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log
>> >> SEVERE: org.apache.solr.common.SolrException: ClusterState says we
>> are
>> >> the leader, but locally we don't think so
>> >>   at
>> >>
>> >>>
>> org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:293)
>> >>   at
>> >>
>> >>>
>> org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:228)
>> >>   at
>> >>
>> >>>
>> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:339)
>> >>   at
>> >>
>> >>>
>> org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
>> >

Re: solre scores remains same for exact match and nearly exact match

2013-04-03 Thread amit

Thanks. I added a copy field and that fixed the issue.


On Wed, Apr 3, 2013 at 12:29 PM, Gora Mohanty-3 [via Lucene] <
ml-node+s472066n4053412...@n3.nabble.com> wrote:

> On 3 April 2013 10:52, amit <[hidden 
> email]>
> wrote:
> >
> > Below is my query
> > http://localhost:8983/solr/select/?q=subject:session management in
> > php&fq=category:[*%20TO%20*]&fl=category,score,subject
> [...]
>
> Add debugQuery=on to your Solr URL, and you will get an
> explanation of the score. Your subject field is tokenised, so
> that there is no a priori reason that an exact match should
> score higher. Several strategies are available if you want that
> behaviour. Try searching Google, e.g., for "solr exact match
> higher score".
>
> Regards,
> Gora
>
>
> --
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://lucene.472066.n3.nabble.com/solre-scores-remains-same-for-exact-match-and-nearly-exact-match-tp4053406p4053412.html
>  To unsubscribe from solre scores remains same for exact match and nearly
> exact match, click 
> here
> .
> NAML
>




--
View this message in context: 
http://lucene.472066.n3.nabble.com/solre-scores-remains-same-for-exact-match-and-nearly-exact-match-tp4053406p4053478.html
Sent from the Solr - User mailing list archive at Nabble.com.

Question on Exact Matches - edismax

2013-04-03 Thread Sandeep Mestry

Hi All,

I have a requirement where in exact matches for 2 fields (Series Title,
Title) should be ranked higher than the partial matches. The configuration
looks like below:



edismax
explicit
0.01
*pg_series_title_ci*^500 *title_ci*^300 *
pg_series_title*^200 *title*^25 classifications^15 classifications_texts^15
parent_classifications^10 synonym_classifications^5 pg_brand_title^5
pg_series_working_title^5 p_programme_title^5 p_item_title^5
p_interstitial_title^5 description^15 pg_series_description annotations^0.1
classification_notes^0.05 pv_program_version_number^2
pv_program_version_number_ci^2 pv_program_number^2 pv_program_number_ci^2
p_program_number^2 ma_version_number^2 ma_recording_location
ma_contributions^0.001 rel_pg_series_title rel_programme_title
rel_programme_number rel_programme_number_ci pg_uuid^0.5 p_uuid^0.5
pv_uuid^0.5 ma_uuid^0.5
pg_series_title_ci^500 title_ci^500
0
*:*
100%
AND
true
-1
1



As you can see above, the search is against many fields. What I'd want is
the documents that have exact matches for series title and title fields
should rank higher than the rest.

I have added 2 case insensitive (*pg_series_title_ci, title_ci*) fields for
series title and title and have boosted them higher over the tokenized and
rest of the fields. I have also implemented a similarity class to override
idf however I still get documents having partial matches in title and other
fields ranking higher than exact match in pg_series_title_ci.

Many Thanks,
Sandeep

Re: Synonyms problem

On 3/29/2013 12:14 PM, Plamen Mihaylov wrote:
> Can I ask you another question: I have Magento + Solr and have a
> requirement to create an admin magento module, where I can add/remove
> synonyms dynamically. Is this possible? I searched google but it seems not
> possible.

If you change the synonym list that you are using in your index analyzer
chain, you must rebuild your entire index.  If you don't, the updated
synonyms will only affect newly added records.  This is because the
index analyzer is only applied at index time.

Thanks,
Shawn

Re: Solr metrics in Codahale metrics and Graphite?

On 3/29/2013 12:07 PM, Walter Underwood wrote:
> What are folks using for this?

I don't know that this really answers your question, but Solr 4.1 and
later includes a big chunk of codahale metrics internally for request
handler statistics - see SOLR-1972.  First we tried including the jar
and using the API, but that created thread leak problems, so the source
code was added.

Thanks,
Shawn

Re: is there a way we can build spell dictionary from solr index such that it only take words leaving all`special characters

2013-04-03 Thread Rohan Thakur

hi upayavira

you mean to say that I dont have to follow this :
http://wiki.apache.org/solr/SpellCheckComponent

and directly I can create spell check field from copyfield and use it...I
dont have to build dictionary on the fieldjust use copyfield for spell
suggetions?

thanks
regards
Rohan


On Wed, Mar 13, 2013 at 12:56 PM, Upayavira  wrote:

> Use text analysis and copyField to create a new field that has terms as
> you expect them. Then use that for your spellcheck dictionary.
>
> Note, since 4.0, you don't need to create a dictionary. Solr can use
> your index directly.
>
> Upayavira
>
> On Wed, Mar 13, 2013, at 06:00 AM, Rohan Thakur wrote:
> > while building the spell dictionary...
> >
> > On Wed, Mar 13, 2013 at 11:29 AM, Rohan Thakur 
> > wrote:
> >
> > > even do not want to break the words as in samsung to s a m s u n g or
> sII
> > > ti s II ir s2 to s 2
> > >
> > > On Wed, Mar 13, 2013 at 11:28 AM, Rohan Thakur  >wrote:
> > >
> > >> k as in like if the field I am indixing from the database like title
> that
> > >> has characters like () - # /n//
> > >> example:
> > >>
> > >> Screenguard for Samsung Galaxy SII (Matt and Gloss) (with Dual
> Protection, Cleaning Cloth and Bubble Remover)
> > >>
> > >> or
> > >> samsung-galaxy-sii-screenguard-matt-and-gloss.html
> > >> or
> > >> /s/a/samsung_galaxy_sii_i9100_pink_.jpg
> > >> or
> > >> 4.27-inch Touchscreen, 3G, Android v2.3 OS, 8MP Camera with LED Flash
> > >>
> > >> now I do not want to build the spell dictionary to only include the
> words
> > >> not any of the - , _ . ( ) /s/a/ or numeric like 4.27
> > >> how can I do that?
> > >>
> > >> thanks
> > >> regards
> > >> Rohan
> > >>
> > >> On Tue, Mar 12, 2013 at 11:06 PM, Alexandre Rafalovitch <
> > >> arafa...@gmail.com> wrote:
> > >>
> > >>> Sorry, leaving them where?
> > >>>
> > >>> Can you give a concrete example or problem.
> > >>>
> > >>> Regards,
> > >>> Alex
> > >>> On Mar 12, 2013 1:31 PM, "Rohan Thakur" 
> wrote:
> > >>>
> > >>> > hi all
> > >>> >
> > >>> > wanted to know is there way we can make spell dictionary from solr
> > >>> index
> > >>> > such that it only takes words from the index leaving all the
> special
> > >>> > characters and unwanted characters.
> > >>> >
> > >>> > thanks
> > >>> > regards
> > >>> > Rohan
> > >>> >
> > >>>
> > >>
> > >>
> > >
>

Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

Ok, so clearing the transaction log allowed things to go again.  I am going
to clear the index and try to replicate the problem on 4.2.0 and then I'll
try on 4.2.1


On Wed, Apr 3, 2013 at 8:21 AM, Mark Miller  wrote:

> No, not that I know if, which is why I say we need to get to the bottom of
> it.
>
> - Mark
>
> On Apr 2, 2013, at 10:18 PM, Jamie Johnson  wrote:
>
> > Mark
> > It's there a particular jira issue that you think may address this? I
> read
> > through it quickly but didn't see one that jumped out
> > On Apr 2, 2013 10:07 PM, "Jamie Johnson"  wrote:
> >
> >> I brought the bad one down and back up and it did nothing.  I can clear
> >> the index and try4.2.1. I will save off the logs and see if there is
> >> anything else odd
> >> On Apr 2, 2013 9:13 PM, "Mark Miller"  wrote:
> >>
> >>> It would appear it's a bug given what you have said.
> >>>
> >>> Any other exceptions would be useful. Might be best to start tracking
> in
> >>> a JIRA issue as well.
> >>>
> >>> To fix, I'd bring the behind node down and back again.
> >>>
> >>> Unfortunately, I'm pressed for time, but we really need to get to the
> >>> bottom of this and fix it, or determine if it's fixed in 4.2.1
> (spreading
> >>> to mirrors now).
> >>>
> >>> - Mark
> >>>
> >>> On Apr 2, 2013, at 7:21 PM, Jamie Johnson  wrote:
> >>>
>  Sorry I didn't ask the obvious question.  Is there anything else that
> I
>  should be looking for here and is this a bug?  I'd be happy to troll
>  through the logs further if more information is needed, just let me
> >>> know.
> 
>  Also what is the most appropriate mechanism to fix this.  Is it
> >>> required to
>  kill the index that is out of sync and let solr resync things?
> 
> 
>  On Tue, Apr 2, 2013 at 5:45 PM, Jamie Johnson 
> >>> wrote:
> 
> > sorry for spamming here
> >
> > shard5-core2 is the instance we're having issues with...
> >
> > Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log
> > SEVERE: shard update error StdNode:
> >
> >>>
> http://10.38.33.17:7577/solr/dsc-shard5-core2/:org.apache.solr.common.SolrException
> >>> :
> > Server at http://10.38.33.17:7577/solr/dsc-shard5-core2 returned non
> >>> ok
> > status:503, message:Service Unavailable
> >   at
> >
> >>>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373)
> >   at
> >
> >>>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
> >   at
> >
> >>>
> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:332)
> >   at
> >
> >>>
> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:306)
> >   at
> > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> >   at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> >   at
> >
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
> >   at
> > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> >   at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> >   at
> >
> >>>
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> >   at
> >
> >>>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> >   at java.lang.Thread.run(Thread.java:662)
> >
> >
> > On Tue, Apr 2, 2013 at 5:43 PM, Jamie Johnson 
> >>> wrote:
> >
> >> here is another one that looks interesting
> >>
> >> Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log
> >> SEVERE: org.apache.solr.common.SolrException: ClusterState says we
> are
> >> the leader, but locally we don't think so
> >>   at
> >>
> >>>
> org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:293)
> >>   at
> >>
> >>>
> org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:228)
> >>   at
> >>
> >>>
> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:339)
> >>   at
> >>
> >>>
> org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
> >>   at
> >>
> >>>
> org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246)
> >>   at
> >> org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173)
> >>   at
> >>
> >>>
> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
> >>   at
> >>
> >>>
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
> >>   at
> >>
> >>>
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
> >>   at org.apache.s

RE: Confusion over Solr highlight hl.q parameter

2013-04-03 Thread Van Tassell, Kristian

Thank you for the response, unfortunately it didn't change that I'm still 
getting no highlighting hits for this query. 

...hl.q={!dismax}text_it_IT:l'assieme...

-Original Message-
From: Koji Sekiguchi [mailto:k...@r.email.ne.jp] 
Sent: Tuesday, April 02, 2013 9:00 PM
To: solr-user@lucene.apache.org
Subject: Re: Confusion over Solr highlight hl.q parameter

(13/04/03 5:27), Van Tassell, Kristian wrote:
> Thanks Koji, this helped with some of our problems, but it is still not 
> perfect.
> 
> This query, for example, returns no highlighting:
> 
> ?q=id:abc123&hl.q=text_it_IT:l'assieme&hl.fl=text_it_IT&hl=true&defTyp
> e=edismax
> 
> But this one does (when it is, in effect, the same query):
> 
> ?q=text_it_IT:l'assieme&hl=true&defType=edismax&hl.fl=text_it_IT
> 
> I've tried many combinations but can't seem to get the right one to work. Is 
> this possibly a bug?

As hl.q doesn't care defType parameter but does localParams, can you try to put 
{!edismax} to hl.q parameter?

koji
--
http://soleami.com/blog/lucene-4-is-super-convenient-for-developing-nlp-tools.html

Re: Query parser cuts last letter from search term.

The standard tokenizer recognizes "!" as a punctuation character, so it will 
be treated as white space.


You could use the white space tokenizer if punctuation is considered 
significant.


-- Jack Krupansky

-Original Message- 
From: vsl

Sent: Wednesday, April 03, 2013 6:25 AM
To: solr-user@lucene.apache.org
Subject: Query parser cuts last letter from search term.

Hi,
I have strange problem with Solr query. I added to my Solr Index new
document with "behave!" word inside content. While I was trying to search
this document using "behave" search term it was impossible. Only "behave!"
returns result. Additionaly search debug returns following information:

debug: {
rawquerystring: "behave",
querystring: "behave",
parsedquery: "allText:behav",
parsedquery_toString: "allText:behav",

Does anybody know how to deal with such case? Below is my field type
definition.


Field definition:


 
   
   

   
   
   
 
 
   
   
   
   
   
   
 
   

where: characters.txt

§ => ALPHA
$ => ALPHA
% => ALPHA
& => ALPHA
/ => ALPHA
( => ALPHA
) => ALPHA
= => ALPHA
? => ALPHA
+ => ALPHA
* => ALPHA
# => ALPHA
' => ALPHA
- => ALPHA
< => ALPHA

=> ALPHA




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Query-parser-cuts-last-letter-from-search-term-tp4053432.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Flow Chart of Solr

Sure, yes. But... it comes down to what level of detail you want and need 
for a specific task. In other words, there are probably a dozen or more 
levels of detail. The reality is that if you are going to work at the Solr 
code level, that is very, very different than being a "user" of Solr, and at 
that point your first step is to become familiar with the code itself.


When you talk about "parsing" and "stemming", you are really talking about 
the user-level, not the Solr code level. Maybe what you really need is a 
cheat sheet that maps a user-visible feature to the main Solr code component 
for that implements that user feature.


There are a number of different forms of "parsing" in Solr - parsing of 
what? Queries? Requests? Solr documents? Function queries?


Stemming? Well, in truth, Solr doesn't even do stemming - Lucene does that. 
Lucene does all of the "token filtering". Are you asking for details on how 
Lucene works? Maybe you meant to ask how "term analysis" works, which is 
split between Solr and Lucene. Or maybe you simply wanted to know when and 
where term analysis is done. Tell us your specific problem or specific 
question and we can probably quickly give you an answer.


In truth, NOBODY uses "flow charts" anymore. Sure, there are some user-level 
diagrams, but not down to the code level.


If you could focus on specific questions, we could give you specific 
answers.


"Main steps"? That depends on what level you are working at. Tell us what 
problem you are trying to solve and we can point you to the relevant areas.


In truth, if you become generally familiar with Solr at the user level 
(study the wikis), you will already know what the "main steps" are.


So, it is not "main steps of Solr", but main steps of some specific 
"request" of Solr, and for a specified level of detail, and for a specified 
area of Solr if greater detail is needed. Be more specific, and then we can 
be more specific.


For now, the general advice for people who need or want to go far beyond the 
user level is to "get familiar with the code" - just LOOK at it - a lot of 
the package and class names are OBVIOUS, really, and follow the class 
hierarchy and code flow using the standard features of any modern Java IDE. 
If you are wondering where to start for some specific user-level feature, 
please ask specifically about that feature. But... make a diligent effort to 
discover and learn on your own before asking open-ended questions.


Sure, there are lots of things in Lucene and Solr that are rather complex 
and seemingly convoluted, and not obvious, but people are more than willing 
to help you out if you simply ask a specific question. I mean, not everybody 
needs to know the fine detail of query parsing, analysis, building a 
Lucene-level stemmer, etc. If we tried to put all of that in a diagram, most 
people would be more confused than enlightened.


At which step are scores calculated? That's more of a Lucene question. Or, 
are you really asking what code in Solr invokes Lucene search methods that 
calculate basic scores?


In short, you need to be more specific. Don't force us to guess what problem 
you are trying to solve.


-- Jack Krupansky

-Original Message- 
From: Furkan KAMACI

Sent: Wednesday, April 03, 2013 6:52 AM
To: solr-user@lucene.apache.org
Subject: Re: Flow Chart of Solr

So, all in all, is there anybody who can write down just main steps of
Solr(including parsing, stemming etc.)?


2013/4/2 Furkan KAMACI 


I think about myself as an example. I have started to make research about
Solr just for some weeks. I have learned Solr and its related projects. My
next step writing down the main steps Solr. We have separated learning
curve of Solr into two main categories.
First one is who are using it as out of the box components. Second one is
developer side.

Actually developer side branches into two way.

First one is general steps of it. i.e. document comes into Solr (i.e.
crawled data of Nutch). which analyzing processes are going to done
(stamming, hamming etc.), what will be doing after parsing step by step.
When a search query happens what happens step by step, at which step 
scores

are calculated so on so forth.
Second one is more code specific i.e. which handlers takes into account
data that will going to be indexed(no need the explain every handler at
this step) . Which are the analyzer, tokenizer classes and what are the
flow between them. How response handlers works and what are they.

Also explaining about cloud side is other work.

Some of explanations are currently presents at wiki (but some of them are
at very deep places at wiki and it is not easy to find the parent topic of
it, maybe starting wiki from a top age and branching all other topics as
possible as from it could be better)

If we could show the big picture, and beside of it the smaller pictures
within it, it would be great (if you know the main parts it will be easy 
to

go deep into the code i.e. you don't

Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

No, not that I know if, which is why I say we need to get to the bottom of it.

- Mark

On Apr 2, 2013, at 10:18 PM, Jamie Johnson  wrote:

> Mark
> It's there a particular jira issue that you think may address this? I read
> through it quickly but didn't see one that jumped out
> On Apr 2, 2013 10:07 PM, "Jamie Johnson"  wrote:
> 
>> I brought the bad one down and back up and it did nothing.  I can clear
>> the index and try4.2.1. I will save off the logs and see if there is
>> anything else odd
>> On Apr 2, 2013 9:13 PM, "Mark Miller"  wrote:
>> 
>>> It would appear it's a bug given what you have said.
>>> 
>>> Any other exceptions would be useful. Might be best to start tracking in
>>> a JIRA issue as well.
>>> 
>>> To fix, I'd bring the behind node down and back again.
>>> 
>>> Unfortunately, I'm pressed for time, but we really need to get to the
>>> bottom of this and fix it, or determine if it's fixed in 4.2.1 (spreading
>>> to mirrors now).
>>> 
>>> - Mark
>>> 
>>> On Apr 2, 2013, at 7:21 PM, Jamie Johnson  wrote:
>>> 
 Sorry I didn't ask the obvious question.  Is there anything else that I
 should be looking for here and is this a bug?  I'd be happy to troll
 through the logs further if more information is needed, just let me
>>> know.
 
 Also what is the most appropriate mechanism to fix this.  Is it
>>> required to
 kill the index that is out of sync and let solr resync things?
 
 
 On Tue, Apr 2, 2013 at 5:45 PM, Jamie Johnson 
>>> wrote:
 
> sorry for spamming here
> 
> shard5-core2 is the instance we're having issues with...
> 
> Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log
> SEVERE: shard update error StdNode:
> 
>>> http://10.38.33.17:7577/solr/dsc-shard5-core2/:org.apache.solr.common.SolrException
>>> :
> Server at http://10.38.33.17:7577/solr/dsc-shard5-core2 returned non
>>> ok
> status:503, message:Service Unavailable
>   at
> 
>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373)
>   at
> 
>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
>   at
> 
>>> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:332)
>   at
> 
>>> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:306)
>   at
> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>   at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
>   at
> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>   at
> 
>>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>   at
> 
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>   at java.lang.Thread.run(Thread.java:662)
> 
> 
> On Tue, Apr 2, 2013 at 5:43 PM, Jamie Johnson 
>>> wrote:
> 
>> here is another one that looks interesting
>> 
>> Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log
>> SEVERE: org.apache.solr.common.SolrException: ClusterState says we are
>> the leader, but locally we don't think so
>>   at
>> 
>>> org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:293)
>>   at
>> 
>>> org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:228)
>>   at
>> 
>>> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:339)
>>   at
>> 
>>> org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
>>   at
>> 
>>> org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246)
>>   at
>> org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173)
>>   at
>> 
>>> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
>>   at
>> 
>>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
>>   at
>> 
>>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
>>   at org.apache.solr.core.SolrCore.execute(SolrCore.java:1797)
>>   at
>> 
>>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:637)
>>   at
>> 
>>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:343)
>> 
>> 
>> 
>> On Tue, Apr 2, 2013 at 5:41 PM, Jamie Johnson 
>>> wrote:
>> 
>>> Looking at the master it looks like at some point there were shards
>>> that
>>> went down.  I

Re: Solr 4.2 Cloud Replication Replica has higher version than Master?