Solr 1.4 - Performance Issues

2013-11-04 Thread Stephen Delano
Hi all,

I wanted to share the issues we're having with Solr 1.4 to get some ideas
of things we can do in the short term that will buy us enough time to
validate Solr 4 before upgrading and not have 1.4 burn to the ground before
we get there.

We've been running Solr 1.4 in production for over 3 years now, but are
really starting to hit some performance bottlenecks that are beginning to
affect our users. Here are the details of our setup:

We're running 2 4-CPU Solr servers. The data is on a 4-disk RAID 10 array
and we're using block-level replication via DRBD over GigE to write to the
standby node. Only one server is serving traffic at a time.

Some tuning information:
- Merge Factor: 25
- Auto Commit: 60s / 1000 docs

What we're seeing:
In roughly 14 hour cycles, the CPU usage climbs from 100% to between 200
and 250%. At the end of the cycle, we get one long commit of roughly 500
seconds, blocking all writes. Around the same time queries begin to get
very slow, often causing timeouts from connecting clients. This behavior is
cyclical, and is getting progressively worse.

What is this, and what can we do about it?

I've attached relevant graphs. Apologies in advance for the obscenely large
image sizes.

Cheers,
Stephen

 
client-requests-2.png

 
cpu-usage.png

 
disk-ios-2.png

 
mem-usage-2.png

 
tcp-connections-2.png



Re: character encoding issue...

2013-11-04 Thread Chris
Sorry, was away a bit & hence the delay.

I am inserting java strings into a java bean class, and then doing a
addBean() method to insert the POJO into Solr.

When i Query using either tomcat/jetty, I get these special characters. But
I have noted, if I change output to - "Shift-JIS" encoding then those
characters appear as some japanese characters I think.

But then this solution doesn't work for all special characters as I can
still see some of them...isn't there an encoding that can cover all the
characters whatever they might be? Any ideas on what do i do?

Regards,
Chris


On Mon, Nov 4, 2013 at 6:27 PM, Erick Erickson wrote:

> The problem is there are about a dozen places where the character
> encoding can be mis-configured. The problem you're seeing above
> actually looks like a problem with the character set configured in
> your browser, it may have nothing to do with what's actually in Solr.
>
> You might write small SolrJ program and see if you can dump the contents
> in binary and examine to see...
>
> Best
> Erick
>
>
> On Sun, Nov 3, 2013 at 6:39 AM, Rajani Maski 
> wrote:
>
> > How are you extracting the text that is there in the website[1] you are
> > referring to? Apache Nutch or any other crawler? If yes, initially check
> > whether that crawler engine is giving you data in correct format before
> you
> > invoke solr index method.
> >
> > [1]http://blog.diigo.com/2009/09/28/scheduled-groups-maintenance/
> >
> > URI encoding should resolve this problem.
> >
> >
> >
> >
> > On Fri, Nov 1, 2013 at 10:50 AM, Chris  wrote:
> >
> > > Hi Rajani,
> > >
> > > I followed the steps exactly as in
> > >
> > >
> >
> http://zensarteam.wordpress.com/2011/11/25/6-steps-to-configure-solr-on-apache-tomcat-7-0-20/
> > >
> > > However, when i send a query to this new instance in tomcat, i again
> get
> > > the error -
> > >
> > >   Scheduled Groups Maintenance
> > > In preparation for the new release roll-out, Diigo groups won’t be
> > > accessible on Sept 28 (Mon) around midnight 0:00 PST for several
> > > hours.
> > > Stay tuned to say hello to Diigo V4 soon!
> > >
> > > location of the text  -
> > > http://blog.diigo.com/2009/09/28/scheduled-groups-maintenance/
> > >
> > > same problem at - http://cn.nytimes.com/business/20130926/c26alibaba/
> > >
> > > All text in title comes like -
> > >
> > >  - �
> > > 
> > > 
> > >    -
> > > � 
> > > 
> > >
> > >
> > > Can you please advice?
> > >
> > > Chris
> > >
> > >
> > >
> > >
> > > On Tue, Oct 29, 2013 at 11:33 PM, Rajani Maski  > > >wrote:
> > >
> > > > Hi,
> > > >
> > > >If you are using Apache Tomcat Server, hope you are not missing
> the
> > > > below mentioned configuration:
> > > >
> > > >   > > > connectionTimeout=”2″
> > > > redirectPort=”8443″ *URIEncoding=”UTF-8″*/>
> > > >
> > > > I had faced similar issue with Chinese Characters and had resolved
> with
> > > the
> > > > above config.
> > > >
> > > > Links for reference :
> > > >
> > > >
> > >
> >
> http://zensarteam.wordpress.com/2011/11/25/6-steps-to-configure-solr-on-apache-tomcat-7-0-20/
> > > >
> > > >
> > >
> >
> http://blog.sidu.in/2007/05/tomcat-and-utf-8-encoded-uri-parameters.html#.Um_3P3Cw2X8
> > > >
> > > >
> > > > Thanks
> > > >
> > > >
> > > >
> > > > On Tue, Oct 29, 2013 at 9:20 PM, Chris  wrote:
> > > >
> > > > > Hi All,
> > > > >
> > > > > I get characters like -
> > > > >
> > > > > �� - CTA -
> > > > >
> > > > > in the solr index. I am adding Java beans to solr by the addBean()
> > > > > function.
> > > > >
> > > > > This seems to be a character encoding issue. Any pointers on how to
> > > > > resolve this one?
> > > > >
> > > > > I have seen that this occurs  mostly for japanese chinese
> characters.
> > > > >
> > > >
> > >
> >
>


Slow Indexing speed for csv files, multi-threaded indexing

2013-11-04 Thread Vikram Srinivasan
Hello,

  I know this has been discussed extensively in past posts. I have tried a
bunch of suggestions and I still have a few questions.

 I am using solr4.4 from tomcat 7. I am using openjdk1.7 and I am using 1
solr core
 I am trying to index a bunch of csv files (total size 13GB). Each csv file
contains a long list of tuples - ( word1 word2, frequency) as shown below.
(bigram frequencies)

E.g: blue sky, 2500
   green grass, 300

My schema.xml is as  simple as can be: I am trying to index these two
fields of type string and long and do not use any tokenizer or analyzer
factories as shown below.


 



  




In my solrconfig.xml:

My rambuffer size is 100MB, merge factor is 10, maxIndexingThreads is 8.

I am using solrj and concurrentupdatesolrserver (CUSS) to index. I have set
the queue size to 1 and number of threads to 10 and javabin format.

I run my solrj instance by providing the path to the directory where the
csv files are stored.

I start one instance of CUSS and have multiple threads reading from the
various files simultaneously and writing into the CUSS threads
simutaneously. I do a commit only after all the records have been indexed.
Also my autocommit values for number of documents and commit time are set
to very large numbers.

I have tried indexing a test set of csv files which contains 1.44M records
(total size 21MB).  All my tests have been on different types of Amazon ec2
instances - e.g. m1.xlarge (4vCPU, 15GB RAM) and m3.2xlarge(8vCPU, 30GB
RAM).

I have set my jvm heap size large enough and tuned gc parameters as seen on
various forums.

Observations:

1. My indexing speed for 1.44M records (or row in CSV file) is 240s on the
m1.xlarge instance and 160s on the m3.2xlarge instance.
2. The indexing speed is independent of whether I have one large file with
1.44M rows or 2 files with 720K rows each.
3. My indexing speed is independent of the number of threads and queue size
I specify for CUSS. I have kept set these parameters as low as 1 for both
queue size and number of threads with no difference..
4. My indexing speed is independent of merge factor, rambuffer and number
of indexing threads. I've tried various settings.
5. It appears that I am not really indexing my files in parallel if I use a
single solr core. Is this not possible? What exactly does maxindexthreads
in solrconfig control?
6. My concern is that my indexing speed is way slower than what I've seen
claimed on various forums (e.g., 29GB wikipedia in 13 minutes, 50GB in 39
minutes etc.) even with a single solr core.

What am I doing wrong? How do I speed up my indexing? Any suggestions will
be appreciated.

Thanks,
Vikram


The first search is slow

2013-11-04 Thread Boole.Z.Guo (mis.cnsh04.Newegg) 41442
Hi,
I am using solr4.3.1.
When I search something, the first time is too slow. How can I improve this?

[cid:image001.png@01CEDA11.6F0AA1B0]
   The first time search


[cid:image002.png@01CEDA11.6F0AA1B0]
   The second time search

Best Regards,
Boole Guo
Software Engineer, NESC-SH.MIS
+86-021-51530666*41442
Floor 19, KaiKai Plaza, 888, Wanhangdu Rd, Shanghai (200042)
ONCE YOU KNOW, YOU NEWEGG.
CONFIDENTIALITY NOTICE: This email and any files transmitted with it may 
contain privileged or otherwise confidential information. It is intended only 
for the person or persons to whom it is addressed. If you received this message 
in error, you are not authorized to read, print, retain, copy, disclose, 
disseminate, distribute, or use this message any part thereof or any 
information contained therein. Please notify the sender immediately and delete 
all copies of this message. Thank you in advance for your cooperation.
保密注意:此邮件及其附随文件可能包含了保密信息。该邮件的目的是发送给指定收件人。如果您非指定收件人而错误地收到了本邮件,您将无权阅读、打印、保存、复制、泄露、传播、分发或使用此邮件全部或部分内容或者邮件中包含的任何信息。请立即通知发件人,并删除该邮件。感谢您的配合!



Re: Does solr supports Federated search, if not what framework

2013-11-04 Thread Alexandre Rafalovitch
On Tue, Nov 5, 2013 at 6:09 AM, Susheel Kumar <
susheel.ku...@thedigitalgroup.net> wrote:

> Hello,
>
> We have a scenario where we present results to users one from solr and
> other from real time web site search. The solr data we have locally
> available that we are able to index but other website search, we don't host
> data and it is real time.
>

Have you looked at Carrot2? http://project.carrot2.org/

Regards,
   Alex.

Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


Does solr supports Federated search, if not what framework

2013-11-04 Thread Susheel Kumar
Hello,

We have a scenario where we present results to users one from solr and other 
from real time web site search. The solr data we have locally available that we 
are able to index but other website search, we don't host data and it is real 
time.

We are wondering if we can use some federated search framework which can unify 
the results into single set with relevancy and all.

Any thoughts?

Thanks & appreciate your help.
Susheel

-Original Message-
From: Patanachai Tangchaisin [mailto:patanachai.tangchai...@wizecommerce.com] 
Sent: Monday, November 04, 2013 7:38 PM
To: solr-user@lucene.apache.org
Subject: Disjuctive Queries (OR queries) and FilterCache

Hello,

We are running our search system using Apache Solr 4.2.1 and using Master/Slave 
model.
Our index has ~100M document. The index size is  ~20gb.
The machine has 24 CPU and 48gb rams.

Our response time is pretty bad, median is ~4 seconds with 25 queries/second.

We noticed a couple of things
- Our machine always use 100% CPU.
- There is a lot of room for Java Heap. We assign Xms12g and Xmx16g, but the 
size of heap is still only 12g
- Solr's filterCache hit ratio is only 0.76 and the number of insertion and 
eviction is almost equal.

The weird thing is
- most items in Solr's filterCache (only 100 first) are specify to only
1 field which we filter it by using an OR query for this field. Note that every 
request will have this field constraint.

For example, if field name is x
fq=x:(1 OR 2 OR 3)&fq=y:'a'
fq=x:(3 OR 2 OR 1)&fq=y:'b'
fq=x:(2 OR 1 OR 3)&fq=y:'c'

An order of items is different since it is an input from a different system.

To me, it seems that Solr do a cache on this field in different entry if an 
order of item is different. e.g. "(1 OR 2)" and "(2 OR 1)" is going to be a 
different cache entry.

Question:
Is there other way to create a fq parameter using 'OR' and make Solr cache them 
as a same entry?


Thanks,
Patanachai Tangchaisin

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Disjuctive Queries (OR queries) and FilterCache

2013-11-04 Thread Patanachai Tangchaisin

Hello,

We are running our search system using Apache Solr 4.2.1 and using
Master/Slave model.
Our index has ~100M document. The index size is  ~20gb.
The machine has 24 CPU and 48gb rams.

Our response time is pretty bad, median is ~4 seconds with 25
queries/second.

We noticed a couple of things
- Our machine always use 100% CPU.
- There is a lot of room for Java Heap. We assign Xms12g and Xmx16g, but
the size of heap is still only 12g
- Solr's filterCache hit ratio is only 0.76 and the number of insertion
and eviction is almost equal.

The weird thing is
- most items in Solr's filterCache (only 100 first) are specify to only
1 field which we filter it by using an OR query for this field. Note
that every request will have this field constraint.

For example, if field name is x
fq=x:(1 OR 2 OR 3)&fq=y:'a'
fq=x:(3 OR 2 OR 1)&fq=y:'b'
fq=x:(2 OR 1 OR 3)&fq=y:'c'

An order of items is different since it is an input from a different
system.

To me, it seems that Solr do a cache on this field in different entry if
an order of item is different. e.g. "(1 OR 2)" and "(2 OR 1)" is going
to be a different cache entry.

Question:
Is there other way to create a fq parameter using 'OR' and make Solr
cache them as a same entry?


Thanks,
Patanachai Tangchaisin

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Re: Problem of facet on 170M documents

2013-11-04 Thread Mingfeng Yang
Erick,

It could have more than 4M distinct values.  The purpose of this facet is
to display the most frequent, say top 500, urls to users.

Sascha,

Thanks for the info. I will look into this thread thing.

Mingfeng


On Mon, Nov 4, 2013 at 4:47 AM, Erick Erickson wrote:

> How many unique URLs do you have in your 9M
> docs? If your 9M hits have 4M distinct URLs, then
> this is not very valuable to the user.
>
> Sascha:
> Was that speedup on a single field or were you faceting over
> multiple fields? Because as I remember that code spins off
> threads on a per-field basis, and if I'm mis-remembering I need
> to look again!
>
> Best,
> Erick
>
>
> On Sat, Nov 2, 2013 at 5:07 AM, Sascha SZOTT  wrote:
>
> > Hi Ming,
> >
> > which Solr version are you using? In case you use one of the latest
> > versions (4.5 or above) try the new parameter facet.threads with a
> > reasonable value (4 to 8 gave me a massive performance speedup when
> > working with large facets, i.e. nTerms >> 10^7).
> >
> > -Sascha
> >
> >
> > Mingfeng Yang wrote:
> > > I have an index with 170M documents, and two of the fields for each
> > > doc is "source" and "url".  And I want to know the top 500 most
> > > frequent urls from Video source.
> > >
> > > So I did a facet with
> > > "fq=source:Video&facet=true&facet.field=url&facet.limit=500", and
> > > the matching documents are about 9 millions.
> > >
> > > The solr cluster is hosted on two ec2 instances each with 4 cpu, and
> > > 32G memory. 16G is allocated tfor java heap.  4 master shards on one
> > > machine, and 4 replica on another machine. Connected together via
> > > zookeeper.
> > >
> > > Whenever I did the query above, the response is just taking too long
> > > and the client will get timed out. Sometimes,  when the end user is
> > > impatient, so he/she may wait for a few second for the results, and
> > > then kill the connection, and then issue the same query again and
> > > again.  Then the server will have to deal with multiple such heavy
> > > queries simultaneously and being so busy that we got "no server
> > > hosting shard" error, probably due to lost communication between solr
> > > node and zookeeper.
> > >
> > > Is there any way to deal with such problem?
> > >
> > > Thanks, Ming
> > >
> >
>


RE: 2 replicas with different num of documents

2013-11-04 Thread Markus Jelsma
Hi - we've seen that issue as well (SOLR-4260) and it happend many times with 
older versions. The good thing is that we haven't seen it for a very long time 
now so i silently assumed other fixes already solved the problem.

We don't know how to reproduce the problem but in older versions it seemed to 
happen when, while indexing, one of the replica's died (usually OOM). It would 
be very helpful if you can reproduce the problem and follow up on the issue 
with steps to reproduce.

https://issues.apache.org/jira/browse/SOLR-4260

 
 
-Original message-
> From:yriveiro 
> Sent: Monday 4th November 2013 22:57
> To: solr-user@lucene.apache.org
> Subject: 2 replicas with different num of documents
> 
> Hi,
> 
> I have 2 replicas with different number of documents, Is it possible?
> 
> I'm using Solr 4.5.1 
> 
> Replica 1:
> 
> version:77847
> numDocs:5951879
> maxDoc:5951978
> deletedDocs:99
> 
> Replica 2:
> 
> version:76011
> numDocs:5951793
> maxDoc:5951965
> deletedDocs:172
> 
> Is it not supposed tlog ensure the data consistency?
> 
> 
> 
> -
> Best regards
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/2-replicas-with-different-num-of-documents-tp4099279.html
> Sent from the Solr - User mailing list archive at Nabble.com.
> 


2 replicas with different num of documents

2013-11-04 Thread yriveiro
Hi,

I have 2 replicas with different number of documents, Is it possible?

I'm using Solr 4.5.1 

Replica 1:

version:77847
numDocs:5951879
maxDoc:5951978
deletedDocs:99

Replica 2:

version:76011
numDocs:5951793
maxDoc:5951965
deletedDocs:172

Is it not supposed tlog ensure the data consistency?



-
Best regards
--
View this message in context: 
http://lucene.472066.n3.nabble.com/2-replicas-with-different-num-of-documents-tp4099279.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Facet question: Getting only the matched value from multivalued field

2013-11-04 Thread Susheel Kumar
Thanks, Aloke.

Prefix solves this problem partially but wanted to see if we have solution 
which works all the time. For e.g. if we search for "Ronald Wagner" and in 
multivalues fields we will get result like below  and I really want to get only 
the values facets are "Wagner, Ronald S MD ", "Wagner Enterprise Ronald"

"docs": [
{
"dname": [
   "Oracle Radiology of NV",
   "Wagner, Ronald S MD ",
  ]  
,
"dname": [
   "Wagner Enterprise Ronald",
   "Gery Levy",
  ]  

Any help/suggestion on this?

-Original Message-
From: Aloke Ghoshal [mailto:alghos...@gmail.com] 
Sent: Monday, November 04, 2013 1:09 AM
To: solr-user@lucene.apache.org
Subject: Re: Facet question: Getting only the matched value from multivalued 
field

Hi Susheel,

You might be able to pull something off using facet.prefix:
http://wiki.apache.org/solr/SimpleFacetParameters#facet.prefix.
Will work when the prefix is exact and doesn't require any analysis, something 
along these lines:
http://solr.pl/en/2013/03/25/autocomplete-on-multivalued-fields-using-faceting/

Regards,
Aloke


On Mon, Nov 4, 2013 at 10:44 AM, Susheel Kumar < 
susheel.ku...@thedigitalgroup.net> wrote:

> Hello,
>
> We have one multivalued field called "dname". When user search for any 
> of the name like "160 Associates LLC", we are able to get facet, but 
> we only want values which matches the search query. Is there any way?
>
> For e.g. assuming below doc, I want to get facet results for only 
> first value "160 WATER ASSOCIATES LLC" which produced hit not all 3.
> -
>
> "dname": [
>   "160 WATER ASSOCIATES LLC",
>   "McDonald",
>  "Office of Mcdowel Attorney"
> ]
>
>
> Thanks in advance and appreciate your help.
>
> Thanks,
> Susheel
>
>
>
>


Re: Can't find some fields in solr result

2013-11-04 Thread Jack Krupansky
Is it possible that you added stored="true" later, after some of the 
documents were already indexed? Then the older documents would not have the 
stored values. If so, you need to reindex the older documents.


-- Jack Krupansky

-Original Message- 
From: gohome190

Sent: Monday, November 04, 2013 2:38 PM
To: solr-user@lucene.apache.org
Subject: Re: Can't find some fields in solr result

All fields are set to stored="true" in my schema.xml, and fl=* doesn't 
change

the output of the response.  I even checked the logs, no errors on any
fields.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Can-t-find-some-fields-in-solr-result-tp4099245p4099251.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: Can't find some fields in solr result

2013-11-04 Thread gohome190
All fields are set to stored="true" in my schema.xml, and fl=* doesn't change
the output of the response.  I even checked the logs, no errors on any
fields.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Can-t-find-some-fields-in-solr-result-tp4099245p4099251.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Can't find some fields in solr result

2013-11-04 Thread gohome190
Also also, adding fl=* still doesn't solve the problem, still only 19 fields
returning. And the missing fields definitely have values, because I can do a
specific solr query of a missing field and its value, and the entry show up
(with only 19 fields again though)



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Can-t-find-some-fields-in-solr-result-tp4099245p4099249.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Can't find some fields in solr result

2013-11-04 Thread Yonik Seeley
On Mon, Nov 4, 2013 at 2:19 PM, gohome190  wrote:
> I have a database that has about 25 fields for each entry.  However, when I
> do a solr *:* query, I can only see the first 19 fields for each entry.
> However, I can successfully use the fields that don't show up as queries.
> So weird! Because that means that solr has them, but isn't sending them in
> the response!  Any ideas?

One of two things:
- fields need to be stored (stored="true" on the field def in the
schema) to be returned.  A field can be indexed but not stored... this
means that you can still search the field (because the index for the
field exists), but it won't appear when you retrieve that document.
- there is a "fl" (field list) parameter that controls what fields are
returned.  The default is all stored fields.

-Yonik
http://heliosearch.com -- making solr shine


Re: Can't find some fields in solr result

2013-11-04 Thread gohome190
Also, no errors in the Logging, and all fields are in the schema.xml.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Can-t-find-some-fields-in-solr-result-tp4099245p4099247.html
Sent from the Solr - User mailing list archive at Nabble.com.


Can't find some fields in solr result

2013-11-04 Thread gohome190
Hi,

I have a database that has about 25 fields for each entry.  However, when I
do a solr *:* query, I can only see the first 19 fields for each entry. 
However, I can successfully use the fields that don't show up as queries. 
So weird! Because that means that solr has them, but isn't sending them in
the response!  Any ideas?

Thanks! 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Can-t-find-some-fields-in-solr-result-tp4099245.html
Sent from the Solr - User mailing list archive at Nabble.com.


SolrCloud (4.4) and CurrencyField refresh intervals

2013-11-04 Thread Michael Tracey
I've got a 4.4 solrCloud cluster running, and have an external process that 
rebuilds the currency.xml file and uploads to zookeeper the latest version 
every X minutes.

It looks like with CurrencyField the OpenExchangeRatesOrgProvider provider has 
a refreshInterval setting, but the documentation does not mention a 
refreshInterval on the FileExchangeRateProvider.  Is there a way to do this 
without reloading the whole core on each of the nodes after updating the rates? 
 (Ideally, I'd like the changes to be picked up at the next hard commit).

Thanks,

M.


Re: Recherche avec et sans espaces

2013-11-04 Thread Roman Chyla
Hi Antoine,
I'll permit myself to respond in English, cause my written French is
slower;-)
Your problem is a well known amongst Sold users, the query parser splits
tokens by empty space, so the analyser never sees input 'la redoutte' but
it receives 'la' 'reroute'. You can of course enclose your search in quotes
like ”la redoutte" but it is hard to force your users to do the sameI
have solved this and related problems for our astrophysics system by
writing a better query parser that does search both for individual tokens
and for phrases, so essentially the parser decides when to join tokens
together - and this takes care also of multi-token synonyms, because
synonym recognition is related issue, it happens in the analysis phase, and
that one comes after parsing. The code is there in lucene-5014 and I'll
perhaps make it available as a simple jar that you can drop inside solr,
but impossible to do sion, it is too busy But I hope the explanation
will help you to search for a solution, you need to make sure that your
analysis chain sees 'la redoutte' and then, because you are using
whitespace tokenizer, you need to define the synonyms laredoutte,la\
redoutte

Hth

Roman
On 4 Nov 2013 11:48, "Antoine REBOUL"  wrote:

> Bonjour,
>
> je souhaite faire en sorte que les recherches dans un champs de type texte
> renvoient des résultats même si les espaces sont mal saisies
> (par exemple : "la redoute"="laredoute").
>
> Aujourd'hui mon champ texte est défini de la façon suivante :
>
>
> 
>  
> 
>  
>   ignoreCase="true"
> words="stopwords.txt"
> enablePositionIncrements="true"
>  />
> 
>   ignoreCase="true" expand="false"/>
> 
>   generateWordParts="1"
> generateNumberParts="1"
>  catenateWords="1"
> catenateNumbers="1"
> catenateAll="1"
>  splitOnCaseChange="1"
> splitOnNumerics="1"
> preserveOriginal="1"
>  />
> 
> 
>  
> 
>  
>   generateWordParts="1"
> generateNumberParts="1"
> catenateWords="1"
>  catenateNumbers="0"
> catenateAll="1"
> splitOnCaseChange="1"
>  preserveOriginal="1"
> />
>   ignoreCase="true"
> words="stopwords.txt"
> enablePositionIncrements="true"
>  />
> 
>  
> 
>  
> 
> 
>
>
>
>
>
>
> Merci d'avance pour vos éventuelles réponses.
> Cordialement.
>
> Antoine Reboul
> *
>


RE: Recherche avec et sans espaces

2013-11-04 Thread Jean-Sebastien Vachon
Bonjour Antoine,

Je ne vois que 2 solutions à ton problème.

1) utilisation de synonymes mais tu seras limités au cas connus d'avance 
seulement alors c'est une solution qui ne scale pas à long terme.

2) sinon tu dois envisager d'avoir un deuxième champ (probablement en 
CopyField) qui n'utilisera pas un WhitespaceTokenizer (la classe 
KeywordTokenizerFactory semble un bon candidat) et faire la recherche sur les 2 
champs (fq=champ1:"la redoute" OR champ2:"la redoute")

La page d'administration (/solr/admin/analysis.jsp) te permet de bien voir ce 
qui se passe pour différentes valeurs et champs.

De plus, tu auras beaucoup plus de chance d'obtenir des réponses à tes 
questions si celles-ci sont rédigées en anglais. ;)

Bonne chance

> -Original Message-
> From: Antoine REBOUL [mailto:antoine.reb...@gmail.com]
> Sent: November-04-13 11:42 AM
> To: solr-user@lucene.apache.org
> Subject: Recherche avec et sans espaces
> 
> Bonjour,
> 
> je souhaite faire en sorte que les recherches dans un champs de type texte
> renvoient des résultats même si les espaces sont mal saisies (par exemple : 
> "la
> redoute"="laredoute").
> 
> Aujourd'hui mon champ texte est défini de la façon suivante :
> 
> 
> 
>  
>  
>   ignoreCase="true"
> words="stopwords.txt"
> enablePositionIncrements="true"
>  />
>class="solr.SynonymFilterFactory" synonyms="synonyms2.txt"
> ignoreCase="true" expand="false"/>
> 
>   generateWordParts="1"
> generateNumberParts="1"
>  catenateWords="1"
> catenateNumbers="1"
> catenateAll="1"
>  splitOnCaseChange="1"
> splitOnNumerics="1"
> preserveOriginal="1"
>  />
> 
> 
>  
> 
>  
>   generateWordParts="1"
> generateNumberParts="1"
> catenateWords="1"
>  catenateNumbers="0"
> catenateAll="1"
> splitOnCaseChange="1"
>  preserveOriginal="1"
> />
>   ignoreCase="true"
> words="stopwords.txt"
> enablePositionIncrements="true"
>  />
>class="solr.ASCIIFoldingFilterFactory"/>
> 
>  
> 
> 
> 
> 
> 
> 
> 
> 
> Merci d'avance pour vos éventuelles réponses.
> Cordialement.
> 
> Antoine Reboul
> *
> 
> -
> Aucun virus trouvé dans ce message.
> Analyse effectuée par AVG - www.avg.fr
> Version: 2014.0.4158 / Base de données virale: 3615/6784 - Date: 26/10/2013
> La Base de données des virus a expiré.


Re: Performance of "rows" and "start" parameters

2013-11-04 Thread Erick Erickson
bq: start=0&rows=30

Let's see the start and rows parameters for a few of
your queries, because on the surface this makes
no sense. If you're always starting at 0, this
shouldn't be happening

And you say "the second query is visibly slower". You're
talking about the "deep paging" problem, which you shouldn't
notice until your start parameter is at least up in the
thousands, perhaps 10s of thousands.

So unless you're incrementing the start parameter way up
there, there's something else going on.

You should be seeing this reflected in your QTimes BTW, if
not then you're seeing something else, perhaps just
too much happening on the box...

FWIW,
Erick


On Mon, Nov 4, 2013 at 11:01 AM, Michael Della Bitta <
michael.della.bi...@appinions.com> wrote:

> The query time increases because in order to calculate the set of documents
> that belongs in page N, you must first calculate all the pages prior to
> page N, and this information is not stored in between requests.
>
> Two ways of speeding this stuff up are to request bigger pages, and/or use
> filter queries over some sort of orderable field in your index to do the
> paging. So for example, if you have a timestamp field in your index, and
> your data represents 100 days, doing 100 queries, one for each day, is much
> better than doing 100 queries using start/rows.
>
>
> Michael Della Bitta
>
> Applications Developer
>
> o: +1 646 532 3062  | c: +1 917 477 7906
>
> appinions inc.
>
> “The Science of Influence Marketing”
>
> 18 East 41st Street
>
> New York, NY 10017
>
> t: @appinions  | g+:
> plus.google.com/appinions<
> https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts
> >
> w: appinions.com 
>
>
> On Mon, Nov 4, 2013 at 8:43 AM, michael.boom  wrote:
>
> > I saw that some time ago there was a JIRA ticket dicussing this, but
> still
> > i
> > found no relevant information on how to deal with it.
> >
> > When working with big nr of docs (e.g. 70M) in my case, I'm using
> > start=0&rows=30 in my requests.
> > For the first req the query time is ok, the next one is visibily slower,
> > the
> > third even more slow and so on until i get some huge query times of up
> > 140secs, after a few hundreds requests. My test were done with SolrMeter
> at
> > a rate of 1000qpm. Same thing happens at 100qpm, tough.
> >
> > Is there a best practice on how to do in this situation, or maybe an
> > explanation why is the query time increasing, from request to request ?
> >
> > Thanks!
> >
> >
> >
> > -
> > Thanks,
> > Michael
> > --
> > View this message in context:
> >
> http://lucene.472066.n3.nabble.com/Performance-of-rows-and-start-parameters-tp4099194.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
> >
>


Re: SolrCloud: read only node

2013-11-04 Thread Erick Erickson
Well, I do have to question why you need to do anything.
Just don't send updates to the remote machines..

But do remember that all nodes in SolrCloud can be equal,
which is one of the points.

FWIW,
Erick


On Mon, Nov 4, 2013 at 10:34 AM, Uwe Reh  wrote:

> F***, this is the answer, I was afraid of. ;-)
> I hoped, there could be anything, similar to http://zookeeper.apache.org/
> doc/trunk/zookeeperObservers.html.
>
> Nevertheless, thank you.
> Uwe
>
> Am 04.11.2013 14:14, schrieb Erick Erickson:
>
>  In this situation, I'd consider going with the older master/slave
>> setup. The problem is that in SolrCloud, you have a lot of chatter
>> back and forth. Presumably the connection to your local instances
>> is rather slow, so if you're adding data to your index, each and
>> every add has to be communicated individually to the remote node.
>>
>> But no, there's no good way in SolrCloud to make a node "read only".
>> Actually, that doesn't really make sense in the solr cloud world since
>> each node maintains its own index, does its own indexing, etc. So
>> each node _must_ be able to change the Solr index it uses.
>>
>> FWIW,
>> Erick
>>
>>
>


Recherche avec et sans espaces

2013-11-04 Thread Antoine REBOUL
Bonjour,

je souhaite faire en sorte que les recherches dans un champs de type texte
renvoient des résultats même si les espaces sont mal saisies
(par exemple : "la redoute"="laredoute").

Aujourd'hui mon champ texte est défini de la façon suivante :



 

 


 

 


 

 



 

 








Merci d'avance pour vos éventuelles réponses.
Cordialement.

Antoine Reboul
*


Re: Performance of "rows" and "start" parameters

2013-11-04 Thread Michael Della Bitta
The query time increases because in order to calculate the set of documents
that belongs in page N, you must first calculate all the pages prior to
page N, and this information is not stored in between requests.

Two ways of speeding this stuff up are to request bigger pages, and/or use
filter queries over some sort of orderable field in your index to do the
paging. So for example, if you have a timestamp field in your index, and
your data represents 100 days, doing 100 queries, one for each day, is much
better than doing 100 queries using start/rows.


Michael Della Bitta

Applications Developer

o: +1 646 532 3062  | c: +1 917 477 7906

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions  | g+:
plus.google.com/appinions
w: appinions.com 


On Mon, Nov 4, 2013 at 8:43 AM, michael.boom  wrote:

> I saw that some time ago there was a JIRA ticket dicussing this, but still
> i
> found no relevant information on how to deal with it.
>
> When working with big nr of docs (e.g. 70M) in my case, I'm using
> start=0&rows=30 in my requests.
> For the first req the query time is ok, the next one is visibily slower,
> the
> third even more slow and so on until i get some huge query times of up
> 140secs, after a few hundreds requests. My test were done with SolrMeter at
> a rate of 1000qpm. Same thing happens at 100qpm, tough.
>
> Is there a best practice on how to do in this situation, or maybe an
> explanation why is the query time increasing, from request to request ?
>
> Thanks!
>
>
>
> -
> Thanks,
> Michael
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Performance-of-rows-and-start-parameters-tp4099194.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: SolrCloud: read only node

2013-11-04 Thread Uwe Reh

F***, this is the answer, I was afraid of. ;-)
I hoped, there could be anything, similar to 
http://zookeeper.apache.org/doc/trunk/zookeeperObservers.html.


Nevertheless, thank you.
Uwe

Am 04.11.2013 14:14, schrieb Erick Erickson:

In this situation, I'd consider going with the older master/slave
setup. The problem is that in SolrCloud, you have a lot of chatter
back and forth. Presumably the connection to your local instances
is rather slow, so if you're adding data to your index, each and
every add has to be communicated individually to the remote node.

But no, there's no good way in SolrCloud to make a node "read only".
Actually, that doesn't really make sense in the solr cloud world since
each node maintains its own index, does its own indexing, etc. So
each node _must_ be able to change the Solr index it uses.

FWIW,
Erick





Re: Performance of "rows" and "start" parameters

2013-11-04 Thread Bill Bell
Do you want to look thru then all ? Have you considered Lucene API? Not sure if 
that is better but it might be.

Bill Bell
Sent from mobile


> On Nov 4, 2013, at 6:43 AM, "michael.boom"  wrote:
> 
> I saw that some time ago there was a JIRA ticket dicussing this, but still i
> found no relevant information on how to deal with it.
> 
> When working with big nr of docs (e.g. 70M) in my case, I'm using
> start=0&rows=30 in my requests.
> For the first req the query time is ok, the next one is visibily slower, the
> third even more slow and so on until i get some huge query times of up
> 140secs, after a few hundreds requests. My test were done with SolrMeter at
> a rate of 1000qpm. Same thing happens at 100qpm, tough.
> 
> Is there a best practice on how to do in this situation, or maybe an
> explanation why is the query time increasing, from request to request ?
> 
> Thanks!
> 
> 
> 
> -
> Thanks,
> Michael
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Performance-of-rows-and-start-parameters-tp4099194.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Core admin: create new core

2013-11-04 Thread Bill Bell
You could pre create a bunch of directories and base configs. Create as needed. 
Then use schema less API to set it up ... Or make changes in a script and 
reload the core..

Bill Bell
Sent from mobile


> On Nov 4, 2013, at 6:06 AM, Erick Erickson  wrote:
> 
> Right, this has been an issue for a while, there's no current
> way to do this.
> 
> Someday, I'll be able to work on SOLR-4779 which should
> go some toward making this work more easily. It's still not
> exactly what you're looking for, but it might work.
> 
> Of course with SolrCloud you can specify a configuration
> set that is used for multiple collections.
> 
> People are using Puppet or similar to automate this over
> large numbers of nodes, but that's not entirely satisfactory
> either in our case I suspect.
> 
> FWIW,
> Erick
> 
> 
>> On Mon, Nov 4, 2013 at 4:00 AM, Bram Van Dam  wrote:
>> 
>> The core admin CREATE function requires that the new instance dir and
>> schema/config exist already. Is there a particular reason for this? It
>> would be incredible convenient if I could create a core with a new schema
>> and new config simply by calling CREATE (maybe providing the contents of
>> config.xml and schema.xml as base64 encoded strings in HTTP POST or
>> something?).
>> 
>> I'm guessing this isn't currently possible?
>> 
>> Ta,
>> 
>> - bram
>> 


Re: SolrCloud different machine sizes

2013-11-04 Thread michael.boom
Thank you, Erick!



-
Thanks,
Michael
--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrCloud-different-machine-sizes-tp4099138p4099195.html
Sent from the Solr - User mailing list archive at Nabble.com.


Performance of "rows" and "start" parameters

2013-11-04 Thread michael.boom
I saw that some time ago there was a JIRA ticket dicussing this, but still i
found no relevant information on how to deal with it.

When working with big nr of docs (e.g. 70M) in my case, I'm using
start=0&rows=30 in my requests.
For the first req the query time is ok, the next one is visibily slower, the
third even more slow and so on until i get some huge query times of up
140secs, after a few hundreds requests. My test were done with SolrMeter at
a rate of 1000qpm. Same thing happens at 100qpm, tough.

Is there a best practice on how to do in this situation, or maybe an
explanation why is the query time increasing, from request to request ?

Thanks!



-
Thanks,
Michael
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Performance-of-rows-and-start-parameters-tp4099194.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: SolrCloud: read only node

2013-11-04 Thread Erick Erickson
In this situation, I'd consider going with the older master/slave
setup. The problem is that in SolrCloud, you have a lot of chatter
back and forth. Presumably the connection to your local instances
is rather slow, so if you're adding data to your index, each and
every add has to be communicated individually to the remote node.

But no, there's no good way in SolrCloud to make a node "read only".
Actually, that doesn't really make sense in the solr cloud world since
each node maintains its own index, does its own indexing, etc. So
each node _must_ be able to change the Solr index it uses.

FWIW,
Erick


On Mon, Nov 4, 2013 at 7:34 AM, Uwe Reh  wrote:

> Hi,
>
> as service provider for libraries we run a small cloud (1 collection, 1
> shard, 3 replicas).  To improve the local reliability we want to offer the
> possibility to set up own local replicas.
> As fas as I know, this can be easily done just by adding a new node to the
> cloud. But the external node shouldn't be able to do any changes on the
> index.
>
> Is there a cheap way to restrict a node of a SolrCloud into a read only
> modus?
> Is it a better idea, to do legacy replication from one node (master) to an
> external slave?
>
>
> Uwe
>


Re: SolrCloud different machine sizes

2013-11-04 Thread Erick Erickson
"It Depends"(tm). As long as you're getting adequate
throughput on the smaller machines, adding bigger
machines won't make it any _slower_. But sometime
as you add documents, the smaller machines will start
having memory issues etc. and you will see an impact.

Fortunately, the migrating path to larger machines is
pretty painless.
1> bring up at least one larger machine for each shard
2> wait for them to synch up and start serving queries
3> shut off the smaller machines.
FWIW,
Erick


On Mon, Nov 4, 2013 at 5:01 AM, michael.boom  wrote:

> I've setup my SolrCloud using AWS and i'm currently using 2 average
> machines.
> I'm planning to ad one more bigger machine (by bigger i mean double the
> RAM).
>
> If they all work in a cluster and the search being distributed, will the
> smaller machines limit the performance the bigger machine could offer?
> (they
> have less memory, so less cache, thus more disk reads on that machines ==>
> bigger query times) ?
> Thanks!
>
>
>
> -
> Thanks,
> Michael
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/SolrCloud-different-machine-sizes-tp4099138.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Core admin: create new core

2013-11-04 Thread Erick Erickson
Right, this has been an issue for a while, there's no current
way to do this.

Someday, I'll be able to work on SOLR-4779 which should
go some toward making this work more easily. It's still not
exactly what you're looking for, but it might work.

Of course with SolrCloud you can specify a configuration
set that is used for multiple collections.

People are using Puppet or similar to automate this over
large numbers of nodes, but that's not entirely satisfactory
either in our case I suspect.

FWIW,
Erick


On Mon, Nov 4, 2013 at 4:00 AM, Bram Van Dam  wrote:

> The core admin CREATE function requires that the new instance dir and
> schema/config exist already. Is there a particular reason for this? It
> would be incredible convenient if I could create a core with a new schema
> and new config simply by calling CREATE (maybe providing the contents of
> config.xml and schema.xml as base64 encoded strings in HTTP POST or
> something?).
>
> I'm guessing this isn't currently possible?
>
> Ta,
>
>  - bram
>


Re: Cloud issue as an issue with SolrJ?

2013-11-04 Thread Erick Erickson
Thanks for closing this off!

Erick


On Sun, Nov 3, 2013 at 8:24 PM, Jack Park  wrote:

> Issue resolved, with great thanks to Tim Casey.
> The issue was based on my own poor understanding of the mechanics of
> ZooKeeper. The "host" setting in solr.xml must find the correct value
> and not default to localhost. Simply hard-wiring host to the network
> address of the computer made everything work.
>
>
> On Sun, Nov 3, 2013 at 12:04 PM, Jack Park 
> wrote:
> > I now have a single ZK running standalone on 2121. On the same CPU, I
> > have three nodes.
> >
> > I used a curl to send over two documents, one each to two of the three
> > nodes in the cloud.  According to a web query, they are both there.
> >
> > My solrconfig.xml file has a custom update response processor chain
> > defined thus:
> >
> > 
> >   
> >> class="org.apache.solr.update.TopicQuestsHarvestProcessFactory">
> > hello
> >   
> >   
> > 
> >
> > where the added process intercepts a SolrDocument after it is
> > processed and sends it out as a JSON object to TCP socket listeners.
> >
> > The instance of SolrJ I have implemented looks like this:
> >
> >  LBHttpSolrServer sv = new
> LBHttpSolrServer(solrurla,solrurlb,solrurlc);
> > sv.getHttpClient().getParams().setParameter("update.chain",
> > "update"); // "merge");
> >CloudSolrServer server = new CloudSolrServer(zkurl,sv);
> > server.setDefaultCollection("collection1");
> >
> > where the commented-out code would call my "merge" update chain.
> >
> > In curl tests, /solr/merge?commit=true ... got a jetty error
> > /solr/merge not found.
> > When I changed that to /solr/update?commit=true... the document got
> > indexed. Thus, commenting out "merge" in favor of "update".
> >
> > In any case (merge, update, or no update.chain setting at all), the
> > SolrJ implementation fails, typically at a zookeeper.out nio exception
> > "socket closed by peer".
> >
> > Rewriting my implementation to this:
> >CloudSolrServer server = new CloudSolrServer(zkurl);
> >server.setDefaultCollection("collection1");
> > makes no change in behavior.
> >
> > Where is the error thrown?
> >
> > The code to build a doc is this (which reflects my field definitions):
> >
> > SolrInputDocument doc = new SolrInputDocument();
> >doc.addField( "locator", "doc"+i);
> >doc.addField( "label", "document " + i);
> >doc.addField( "details", "This is document " + i);
> >server.add(doc);
> >
> > The error is thrown at server.add(doc)
> >
> > Many thanks in advance for any observations or suggestions.
> >
> > Cheers
> > Jack
>


Re: Lots of tlog files remained, why?

2013-11-04 Thread Erick Erickson
What is your commit strategy? A hard commit
(openSearcher=true or false doesn't matter)
should close the current tlog file, open
a new one and delete old ones. That said, there
will be enough tlog files kept around to hold at
least 100 documents. So if you're committing
too often (say after every document or something),
you can expect to have a bunch around. The
real question is whether they stay around forever
or not. If you index more documents, do old ones
disappear?

Here's a writerup:
http://searchhub.org/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

If that doesn't help, what version of Solr? How
big are you tlog files? Details matter.

Best,
Erick


On Sun, Nov 3, 2013 at 10:03 AM, Floyd Wu  wrote:

> After re-index 2 xml files and done commit, optimization many times, I
> still have many tlog files in data/tlof directory.
>
> Why?
>
> How to remove those files(delete them directly or just ignored them?)
>
> What is the difference if tlog files exist or not?
>
> Please kindly guide me.
>
> Thanks
>
> Floyd
>


Re: character encoding issue...

2013-11-04 Thread Erick Erickson
The problem is there are about a dozen places where the character
encoding can be mis-configured. The problem you're seeing above
actually looks like a problem with the character set configured in
your browser, it may have nothing to do with what's actually in Solr.

You might write small SolrJ program and see if you can dump the contents
in binary and examine to see...

Best
Erick


On Sun, Nov 3, 2013 at 6:39 AM, Rajani Maski  wrote:

> How are you extracting the text that is there in the website[1] you are
> referring to? Apache Nutch or any other crawler? If yes, initially check
> whether that crawler engine is giving you data in correct format before you
> invoke solr index method.
>
> [1]http://blog.diigo.com/2009/09/28/scheduled-groups-maintenance/
>
> URI encoding should resolve this problem.
>
>
>
>
> On Fri, Nov 1, 2013 at 10:50 AM, Chris  wrote:
>
> > Hi Rajani,
> >
> > I followed the steps exactly as in
> >
> >
> http://zensarteam.wordpress.com/2011/11/25/6-steps-to-configure-solr-on-apache-tomcat-7-0-20/
> >
> > However, when i send a query to this new instance in tomcat, i again get
> > the error -
> >
> >   Scheduled Groups Maintenance
> > In preparation for the new release roll-out, Diigo groups won’t be
> > accessible on Sept 28 (Mon) around midnight 0:00 PST for several
> > hours.
> > Stay tuned to say hello to Diigo V4 soon!
> >
> > location of the text  -
> > http://blog.diigo.com/2009/09/28/scheduled-groups-maintenance/
> >
> > same problem at - http://cn.nytimes.com/business/20130926/c26alibaba/
> >
> > All text in title comes like -
> >
> >  - �
> > 
> > 
> >    -
> > � 
> > 
> >
> >
> > Can you please advice?
> >
> > Chris
> >
> >
> >
> >
> > On Tue, Oct 29, 2013 at 11:33 PM, Rajani Maski  > >wrote:
> >
> > > Hi,
> > >
> > >If you are using Apache Tomcat Server, hope you are not missing the
> > > below mentioned configuration:
> > >
> > >   > > connectionTimeout=”2″
> > > redirectPort=”8443″ *URIEncoding=”UTF-8″*/>
> > >
> > > I had faced similar issue with Chinese Characters and had resolved with
> > the
> > > above config.
> > >
> > > Links for reference :
> > >
> > >
> >
> http://zensarteam.wordpress.com/2011/11/25/6-steps-to-configure-solr-on-apache-tomcat-7-0-20/
> > >
> > >
> >
> http://blog.sidu.in/2007/05/tomcat-and-utf-8-encoded-uri-parameters.html#.Um_3P3Cw2X8
> > >
> > >
> > > Thanks
> > >
> > >
> > >
> > > On Tue, Oct 29, 2013 at 9:20 PM, Chris  wrote:
> > >
> > > > Hi All,
> > > >
> > > > I get characters like -
> > > >
> > > > �� - CTA -
> > > >
> > > > in the solr index. I am adding Java beans to solr by the addBean()
> > > > function.
> > > >
> > > > This seems to be a character encoding issue. Any pointers on how to
> > > > resolve this one?
> > > >
> > > > I have seen that this occurs  mostly for japanese chinese characters.
> > > >
> > >
> >
>


Re: Store Solr OpenBitSets In Solr Indexes

2013-11-04 Thread Erick Erickson
If the bitset is something you control you can use the binary
field type, although it's not a horribly efficient way to store binary
data.

If the bitset is bounded, you could do something with indexing
N long values that will contain the set and write a custom
similarity class to work with it.

Best,
Erick


On Sat, Nov 2, 2013 at 9:19 AM, David Philip wrote:

> Oh fine. Caution point was useful for me.
> Yes I wanted to do something similar to filer queries. It is not XY
> problem. I am simply trying to implement  something as described below.
>
> I have a [non-clinical] group sets in system and I want to build bitset
> based on the documents belonging to that group and save it.
> So that, While searching I want to retrieve similar bitset from Solr engine
> for matched document and then execute logical XOR. [Am I clear with problem
> explanation now?]
>
>
> So what I am looking for is, If I have to retrieve bitset instance from
> Solr search engine for the documents matched, how can I get it?
> And How do I save bit mapping for the documents belonging to a particular
> group. thus enable XOR operation.
>
> Thanks - David
>
>
>
>
>
>
>
>
>
>
> On Fri, Nov 1, 2013 at 5:05 PM, Erick Erickson  >wrote:
>
> > Why are you saving this? Because if the bitset you're saving
> > has anything to do with, say, filter queries, it's probably useless.
> >
> > The internal bitsets are often based on the internal Lucene doc ID,
> > which will change when segment merges happen, thus the caution.
> >
> > Otherwise, theres the binary type you can probably use. It's not very
> > efficient since I believe it uses base-64 encoding under the covers
> > though...
> >
> > Is this an "XY" problem?
> >
> > Best,
> > Erick
> >
> >
> > On Wed, Oct 30, 2013 at 8:06 AM, David Philip
> > wrote:
> >
> > > Hi All,
> > >
> > > What should be the field type if I have to save solr's open bit set
> value
> > > within solr document object and retrieve it later for search?
> > >
> > >   OpenBitSet bits = new OpenBitSet();
> > >
> > >   bits.set(0);
> > >   bits.set(1000);
> > >
> > >   doc.addField("SolrBitSets", bits);
> > >
> > >
> > > What should be the field type of  SolrBitSets?
> > >
> > > Thanks
> > >
> >
>


Re: Problem of facet on 170M documents

2013-11-04 Thread Erick Erickson
How many unique URLs do you have in your 9M
docs? If your 9M hits have 4M distinct URLs, then
this is not very valuable to the user.

Sascha:
Was that speedup on a single field or were you faceting over
multiple fields? Because as I remember that code spins off
threads on a per-field basis, and if I'm mis-remembering I need
to look again!

Best,
Erick


On Sat, Nov 2, 2013 at 5:07 AM, Sascha SZOTT  wrote:

> Hi Ming,
>
> which Solr version are you using? In case you use one of the latest
> versions (4.5 or above) try the new parameter facet.threads with a
> reasonable value (4 to 8 gave me a massive performance speedup when
> working with large facets, i.e. nTerms >> 10^7).
>
> -Sascha
>
>
> Mingfeng Yang wrote:
> > I have an index with 170M documents, and two of the fields for each
> > doc is "source" and "url".  And I want to know the top 500 most
> > frequent urls from Video source.
> >
> > So I did a facet with
> > "fq=source:Video&facet=true&facet.field=url&facet.limit=500", and
> > the matching documents are about 9 millions.
> >
> > The solr cluster is hosted on two ec2 instances each with 4 cpu, and
> > 32G memory. 16G is allocated tfor java heap.  4 master shards on one
> > machine, and 4 replica on another machine. Connected together via
> > zookeeper.
> >
> > Whenever I did the query above, the response is just taking too long
> > and the client will get timed out. Sometimes,  when the end user is
> > impatient, so he/she may wait for a few second for the results, and
> > then kill the connection, and then issue the same query again and
> > again.  Then the server will have to deal with multiple such heavy
> > queries simultaneously and being so busy that we got "no server
> > hosting shard" error, probably due to lost communication between solr
> > node and zookeeper.
> >
> > Is there any way to deal with such problem?
> >
> > Thanks, Ming
> >
>


Re: Simple (?) zookeeper question

2013-11-04 Thread Erick Erickson
Well, the easiest thing to do is cheat. Fire up the admin UI, should be
something like
http://localhost:8983/solr

See if anything drops down in the "core selector" box and select it. Then
select a core,
the default is "collection1". Now you should see a "query" section, go
there and
scroll down to the "execute query" button. You should see stuff.

But here's the important bit. There should be a URL in light grey near the
top of the screen
that gives you the right URL to ping. And anywhere in the above steps you
can't proceed
(say you don't see a drop-down with a core to select) and you know where to
focus your
efforts...

Best,
Erick

Oh, and please raise new issues in a new e-mail thread, see "thread
hijacking"
http://people.apache.org/~hossman/#threadhijack




On Fri, Nov 1, 2013 at 2:19 PM, Jack Park  wrote:

> Thanks. I reviewed clusterstate.json again; those URLs are alive. Why
> they are not responding seems to be the mystery du jour.
>
> I reviewed my test suite: it is using field names in schema.xml, and
> the server is configured to use the update responders I installed, all
> of which work fine in a non-cloud mode.
>
> Thanks
> Jack
>
> On Fri, Nov 1, 2013 at 11:12 AM, Shawn Heisey  wrote:
> > On 11/1/2013 12:07 PM, Jack Park wrote:
> >>
> >> The top error message at my test harness is this:
> >>
> >> No live SolrServers available to handle this request:
> >> [http://127.0.1.1:8983/solr/collection1,
> >> http://127.0.1.1:7574/solr/collection1,
> >> http://127.0.1.1:7590/solr/collection1]
> >>
> >> I have to assume that error message was somehow shipped by zookeeper,
> >> because those servers actually exist, to the test harness, at
> >> 10.1.10.178, and if I access any one of them from the browser,
> >> /solr/collection1 does not work, but /solr/#/collection1 does work.
> >
> >
> > Those are *base* urls.  By themselves, they return 404. For an example of
> > how a base URL is used, try /solr/collection1/select?q=*:* instead.
> >
> > Any URL with /#/ in it is part of the admin UI, which runs mostly in the
> > browser and accesses Solr handlers to gather information. It is not Solr
> > itself.
> >
> > Thanks,
> > Shawn
> >
>


SolrCloud: read only node

2013-11-04 Thread Uwe Reh

Hi,

as service provider for libraries we run a small cloud (1 collection, 1 
shard, 3 replicas).  To improve the local reliability we want to offer 
the possibility to set up own local replicas.
As fas as I know, this can be easily done just by adding a new node to 
the cloud. But the external node shouldn't be able to do any changes on 
the index.


Is there a cheap way to restrict a node of a SolrCloud into a read only 
modus?
Is it a better idea, to do legacy replication from one node (master) to 
an external slave?



Uwe


Re: Replication after re adding nodes to cluster (sleeping replicas)

2013-11-04 Thread Erick Erickson
The whole point of SolrCloud is to automatically take care of all
the ugly details of synching etc. You should be able to add a node
and, assuming it has been assigned to a shard, do nothing.
The node will start up, synch with the leader, get registered and
start handling queries without you having to do anything.

If you shut the node down, SolrCloud will figure that out and stop
sending requests to it.

If yo then bring the node back up, SolrCloud will figure out how
to synch it with the leader and just make it happen. When it's
synched, it'll start serving requests.

Watch the Solr admin page and you'll see the status change as
these operations happen. You'll have to refresh the screen

And finally, watch the Solr log on the new node, that'll give you
a good sense of what the steps are.

Best,
Erick


On Fri, Nov 1, 2013 at 4:13 AM, michael.boom  wrote:

> I have a SolrCloud cluster holding 4 collections, each with with 3 shards
> and
> replication factor = 2.
> They all live on 2 machines, and I am currently using this setup for
> testing.
>
> However, i would like to connect this test setup to our live application,
> just for benchmarking and evaluating if it can handle the big qpm number.
> I am planning also to setup a new machine, and add new nodes manually, one
> more replica for each shard on the new machines, in case the first two have
> problems handling the big qpm.
> But what i would like to do is after I set up the new nodes, to shut down
> the new machine and only put it back in the cluster if it's needed.
>
> Thus, getting to the title of this mail:
> After re adding the 3rd machine to the cluster, will the replicas be
> automatically synced with the leader, or do i need to manually trigger this
> somehow ?
>
> Is there a better idea for having this sleeping  replicas? I bet lots of
> people faced this problem, so a best practice must be out there.
>
>
>
> -
> Thanks,
> Michael
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Replication-after-re-adding-nodes-to-cluster-sleeping-replicas-tp4098764.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


SolrCloud different machine sizes

2013-11-04 Thread michael.boom
I've setup my SolrCloud using AWS and i'm currently using 2 average machines.
I'm planning to ad one more bigger machine (by bigger i mean double the
RAM).

If they all work in a cluster and the search being distributed, will the
smaller machines limit the performance the bigger machine could offer? (they
have less memory, so less cache, thus more disk reads on that machines ==>
bigger query times) ?
Thanks!



-
Thanks,
Michael
--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrCloud-different-machine-sizes-tp4099138.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: how can i disable coord?

2013-11-04 Thread Markus Jelsma
You cannot disable coordination factor at query time at this moment so you need 
to change your Similarity in the schema. Easiest to do this is to set the 
SchemaSimilarityFactory. It defaults to TFIDF but without queryNorm and coord 
or use another similarity implementation. 
 
-Original message-
> From:jihyun suh 
> Sent: Monday 4th November 2013 6:14
> To: solr-user@lucene.apache.org
> Subject: how can i disable coord?
> 
> I want to disable coord in bq.
> But eventhough I set the coordFactor=false just like
> .../select?q=...&coordFactor=false, it's not working...
> How can I disable coord? 
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/how-can-i-disable-coord-tp4099121.html
> Sent from the Solr - User mailing list archive at Nabble.com.
> 


Re: Unable to add mahout classifier

2013-11-04 Thread lovely kasi
i didnt understnad what i need to do.
Should i make any changes in the CategorizeDocumentFactory or change the
version of the solr core jars?

Thanks,


On Thu, Oct 31, 2013 at 2:35 PM, Koji Sekiguchi  wrote:

> Caused by: java.lang.ClassCastException: class com.mahout.solr.classifier.
>> CategorizeDocumentFactory
>>  at java.lang.Class.asSubclass(Unknown Source)
>>  at org.apache.solr.core.SolrResourceLoader.findClass(
>> SolrResourceLoader.java:433)
>>  at org.apache.solr.core.SolrResourceLoader.findClass(
>> SolrResourceLoader.java:381)
>>  at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:
>> 526)
>>  ... 21 more
>>
>
> There seems to be a problem related class loaders, e.g.
> CategorizeDocumentFactory
> which extends UpdateRequestProcessorFactory, loaded by class loader "B",
> but Solr core has loaded UpdateRequestProcessorFactory via class loader "A"
> or something like that...
>
> koji
> --
> http://www.rondhuit.com/
>


Core admin: create new core

2013-11-04 Thread Bram Van Dam
The core admin CREATE function requires that the new instance dir and 
schema/config exist already. Is there a particular reason for this? It 
would be incredible convenient if I could create a core with a new 
schema and new config simply by calling CREATE (maybe providing the 
contents of config.xml and schema.xml as base64 encoded strings in HTTP 
POST or something?).


I'm guessing this isn't currently possible?

Ta,

 - bram