date:20130911

Mail guy.

You've been around long enough to know to try adding debug=query to your
URL and looking at the results, what does that show?

Best
Erick


On Tue, Sep 10, 2013 at 9:25 AM, Mysurf Mail stammail...@gmail.com wrote:

 I am querying using

 http://...:8983/solr/vault/select?q=design testfl=PackageName

 I get 3 result:

- design test
- design test 2013
- design test for jobs

 Now when I query using q=test for jobs
 - I get only design test for jobs

 But when I query using q = 2013

 http://...:8983/solr/vault/select?q=2013fl=PackageName

 I get no result. Why doesnt it return an answer when I query with numbers?

 In schema xml

  field name=PackageName type=text_en indexed=true stored=true
 required=true/

Re: SolrCloud 4.x hangs under high update volume

If you use CloudSolrServer, you need to apply SOLR-4816 or use a recent
copy of the 4x branch. By recent, I mean like today, it looks like Mark
applied this early this morning. But several reports indicate that this will
solve your problem.

I would expect that increasing the number of shards would make the problem
worse, not
better.

There's also SOLR-5232...

Best
Erick


On Tue, Sep 10, 2013 at 5:20 PM, Tim Vaillancourt t...@elementspace.comwrote:

 Hey guys,

 Based on my understanding of the problem we are encountering, I feel we've
 been able to reduce the likelihood of this issue by making the following
 changes to our app's usage of SolrCloud:

 1) We increased our document batch size to 200 from 10 - our app batches
 updates to reduce HTTP requests/overhead. The theory is increasing the
 batch size reduces the likelihood of this issue happening.
 2) We reduced to 1 application node sending updates to SolrCloud - we write
 Solr updates to Redis, and have previously had 4 application nodes pushing
 the updates to Solr (popping off the Redis queue). Reducing the number of
 nodes pushing to Solr reduces the concurrency on SolrCloud.
 3) Less threads pushing to SolrCloud - due to the increase in batch size,
 we were able to go down to 5 update threads on the update-pushing-app (from
 10 threads).

 To be clear the above only reduces the likelihood of the issue happening,
 and DOES NOT actually resolve the issue at hand.

 If we happen to encounter issues with the above 3 changes, the next steps
 (I could use some advice on) are:

 1) Increase the number of shards (2x) - the theory here is this reduces the
 locking on shards because there are more shards. Am I onto something here,
 or will this not help at all?
 2) Use CloudSolrServer - currently we have a plain-old least-connection
 HTTP VIP. If we go direct to what we need to update, this will reduce
 concurrency in SolrCloud a bit. Thoughts?

 Thanks all!

 Cheers,

 Tim


 On 6 September 2013 14:47, Tim Vaillancourt t...@elementspace.com wrote:

  Enjoy your trip, Mark! Thanks again for the help!
 
  Tim
 
 
  On 6 September 2013 14:18, Mark Miller markrmil...@gmail.com wrote:
 
  Okay, thanks, useful info. Getting on a plane, but ill look more at this
  soon. That 10k thread spike is good to know - that's no good and could
  easily be part of the problem. We want to keep that from happening.
 
  Mark
 
  Sent from my iPhone
 
  On Sep 6, 2013, at 2:05 PM, Tim Vaillancourt t...@elementspace.com
  wrote:
 
   Hey Mark,
  
   The farthest we've made it at the same batch size/volume was 12 hours
   without this patch, but that isn't consistent. Sometimes we would only
  get
   to 6 hours or less.
  
   During the crash I can see an amazing spike in threads to 10k which is
   essentially our ulimit for the JVM, but I strangely see no
 OutOfMemory:
   cannot open native thread errors that always follow this. Weird!
  
   We also notice a spike in CPU around the crash. The instability caused
  some
   shard recovery/replication though, so that CPU may be a symptom of the
   replication, or is possibly the root cause. The CPU spikes from about
   20-30% utilization (system + user) to 60% fairly sharply, so the CPU,
  while
   spiking isn't quite pinned (very beefy Dell R720s - 16 core Xeons,
  whole
   index is in 128GB RAM, 6xRAID10 15k).
  
   More on resources: our disk I/O seemed to spike about 2x during the
  crash
   (about 1300kbps written to 3500kbps), but this may have been the
   replication, or ERROR logging (we generally log nothing due to
   WARN-severity unless something breaks).
  
   Lastly, I found this stack trace occurring frequently, and have no
 idea
   what it is (may be useful or not):
  
   java.lang.IllegalStateException :
at
 org.eclipse.jetty.server.Response.resetBuffer(Response.java:964)
at org.eclipse.jetty.server.Response.sendError(Response.java:325)
at
  
 
 org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:692)
at
  
 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:380)
at
  
 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
at
  
 
 org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1423)
at
  
 
 org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:450)
at
  
 
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:138)
at
  
 
 org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:564)
at
  
 
 org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:213)
at
  
 
 org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1083)
at
  
 
 org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:379)
at
  
 
 org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:175)
at

Re: Regarding improving performance of the solr

Be a little careful when extrapolating from disk to memory.
Any fields where you've set stored=true will put data in
segment files with extensions .fdt and .fdx, see
These are the compressed verbatim copy of the data
for stored fields and have very little impact on
memory required for searching. I've seen indexes where
75% of the data is stored and indexes where 5% of the
data is stored.

Summary of File Extensions here:
http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/codecs/lucene40/package-summary.html

Best,
Erick

On Wed, Sep 11, 2013 at 2:57 AM, prabu palanisamy pr...@serendio.comwrote:

@Shawn: Correctly I am trying to reduce the index size. I am working on
reindex the solr with some of the features as indexed and not stored

@Jean: I tried with different caches. It did not show much improvement.

On Fri, Sep 6, 2013 at 3:17 PM, Shawn Heisey s...@elyograg.org wrote:

On 9/6/2013 2:54 AM, prabu palanisamy wrote:
I am currently using solr -3.5.0, indexed wikipedia dump (50 gb) with
java 1.6.
I am searching the solr with text (which is actually twitter tweets) .
Currently it takes average time of 210 millisecond for each post, out
of
which 200 millisecond is consumed by solr server (QTime). I used the
jconsole monitor tool.

If the size of all your Solr indexes on disk is in the 50GB range of
your wikipedia dump, then for ideal performance, you'll want to have
50GB of free memory so the OS can cache your index. You might be able
to get by with 25-30GB of free memory, depending on your index
composition.

Note that this is memory over and above what you allocate to the Solr
JVM, and memory used by other processes on the machine. If you do have
other services on the same machine, note that those programs might ALSO
require OS disk cache RAM.

http://wiki.apache.org/solr/SolrPerformanceProblems#OS_Disk_Cache

Thanks,
Shawn

Re: Dynamic analizer settings change

I wouldn't :). Here's the problem. Say you do this successfully at
index time. How do you then search reasonably? There's often
not near enough information to know what the search language is,
there's little or no context.

If the number of languages is limited, people often index into separate
language-specific fields, say title_fr and title_en and use edismax
to automatically distribute queries against all the fields.

Others index families of languages in separate fields using things
like the folding filters for Western languages, another field for, say,
CJK languages and another for Middle Eastern languages etc.

FWIW,
Erick


On Wed, Sep 11, 2013 at 6:55 AM, maephisto my_sky...@yahoo.com wrote:

 Let's take the following type definition and schema (borrowed from Rafal
 Kuc's Solr 4 cookbook) :
 fieldType name=text class=solr.TextField positionIncrementGap=100
 analyzer
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.SnowballPorterFilterFactory language=English/
 /analyzer
 /fieldType

 and schema:

 field name=id type=string indexed=true stored=true
 required=true /
 field name=title type=text indexed=true stored=true /

 The above analizer will apply SnowballPorterFilter english language filter.
 But would it be possible to change the language to french during indexing
 for some documents. is this possible? If not, what would be the best
 solution for having the same analizer but with different languages, which
 languange being determined at index time ?

 Thanks!



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Dynamic-analizer-settings-change-tp4089274.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: No or limited use of FieldCache

I don't know any more than Michael, but I'd _love_ some reports from the
field.

There are some restriction on DocValues though, I believe one of them
is that they don't really work on analyzed data

FWIW,
Erick


On Wed, Sep 11, 2013 at 7:00 AM, Michael Sokolov 
msoko...@safaribooksonline.com wrote:

 On 9/11/13 3:11 AM, Per Steffensen wrote:

 Hi

 We have a SolrCloud setup handling huge amounts of data. When we do
 group, facet or sort searches Solr will use its FieldCache, and add data in
 it for every single document we have. For us it is not realistic that this
 will ever fit in memory and we get OOM exceptions. Are there some way of
 disabling the FieldCache (taking the performance penalty of course) or make
 it behave in a nicer way where it only uses up to e.g. 80% of the memory
 available to the JVM? Or other suggestions?

 Regards, Per Steffensen

 I think you might want to look into using DocValues fields, which are
 column-stride fields stored as compressed arrays - one value per document
 -- for the fields on which you are sorting and faceting. My understanding
 (which is limited) is that these avoid the use of the field cache, and I
 believe you have the option to control whether they are held in memory or
 on disk.  I hope someone who knows more will elaborate...

 -Mike

Re: Stemming and protwords configuration

Did you try putting them _all_ in protwords.txt? i.e.
frais, fraise, fraises?

Don't forget to re-index.

An alternative is to index in a second field that doesn't have the
stemmer and when you want exact matches, search against that
field.

Best
Erick


On Mon, Sep 9, 2013 at 10:29 AM, csicard@orange.com wrote:

 Hi,

 We have a Solr server using stemming:

 filter class=solr.SnowballPorterFilterFactory language=French
 protected=protwords.txt /

 I would like to query the French words frais and fraise separately. I
 put the word fraise in protwords.txt file.

 - When I query the word fraise, no document indexed with the word
 frais are found.
 - When I query the word frais, I've got documents indexed with the word
 fraise.

 Is there a way to do not match fraises documents in the second situation
 ?

 I hope this is clear. Thanks for your reply.

 Christophe



 _

 Ce message et ses pieces jointes peuvent contenir des informations
 confidentielles ou privilegiees et ne doivent donc
 pas etre diffuses, exploites ou copies sans autorisation. Si vous avez
 recu ce message par erreur, veuillez le signaler
 a l'expediteur et le detruire ainsi que les pieces jointes. Les messages
 electroniques etant susceptibles d'alteration,
 Orange decline toute responsabilite si ce message a ete altere, deforme ou
 falsifie. Merci.

 This message and its attachments may contain confidential or privileged
 information that may be protected by law;
 they should not be distributed, used or copied without authorisation.
 If you have received this email in error, please notify the sender and
 delete this message and its attachments.
 As emails may be altered, Orange is not liable for messages that have been
 modified, changed or falsified.
 Thank you.

solrj-httpclient-slow

2013-09-11 Thread xiaoqi


hi,everyone

when i track my solr client  timing cost , i find one problem : 

some time the whole execute time is very long ,when i go to detail ,i find
the solr server execute short time  , then the main costs inside httpclient
(make a connection ,send request or recived  response ,blablabla. 

i am not familar httpclient inside code . does anyone met the same problem ? 

although , i update solrj 's new version ,the problem still. 

by the way : my solrj version is : 4.2 ,solr is 3.* 


Thanks a lot 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/solrj-httpclient-slow-tp4089287.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Dynamic analizer settings change

2013-09-11 Thread maephisto

Thanks, Erik!

I might have missed mentioning something relevant. When querying Solr, I
wouldn't actually need to query all fields, but only the one corresponding
to the language picked by the user on the website. If he's using DE, then
the search should only apply to the text_de field.

What if I need to work with 50 different languages?
Then I would get a schema with 50 types and 50 fields (text_en, text_fr,
text_de, ...): won't this affect the performance ? bigger documents -
slower queries.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Dynamic-analizer-settings-change-tp4089274p4089288.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Profiling Solr Lucene for query

2013-09-11 Thread Manuel Le Normand

Dmitry - currently we don't have such a front end, this sounds like a good
idea creating it. And yes, we do query all 36 shards every query.

Mikhail - I do think 1 minute is enough data, as during this exact minute I
had a single query running (that took a qtime of 1 minute). I wanted to
isolate these hard queries. I repeated this profiling few times.

I think I will take the termInterval from 128 to 32 and check the results.
I'm currently using NRTCachingDirectoryFactory




On Mon, Sep 9, 2013 at 11:29 PM, Dmitry Kan solrexp...@gmail.com wrote:

 Hi Manuel,

 The frontend solr instance is the one that does not have its own index and
 is doing merging of the results. Is this the case? If yes, are all 36
 shards always queried?

 Dmitry


 On Mon, Sep 9, 2013 at 10:11 PM, Manuel Le Normand 
 manuel.lenorm...@gmail.com wrote:

  Hi Dmitry,
 
  I have solr 4.3 and every query is distributed and merged back for
 ranking
  purpose.
 
  What do you mean by frontend solr?
 
 
  On Mon, Sep 9, 2013 at 2:12 PM, Dmitry Kan solrexp...@gmail.com wrote:
 
   are you querying your shards via a frontend solr? We have noticed, that
   querying becomes much faster if results merging can be avoided.
  
   Dmitry
  
  
   On Sun, Sep 8, 2013 at 6:56 PM, Manuel Le Normand 
   manuel.lenorm...@gmail.com wrote:
  
Hello all
Looking on the 10% slowest queries, I get very bad performances (~60
  sec
per query).
These queries have lots of conditions on my main field (more than a
hundred), including phrase queries and rows=1000. I do return only
 id's
though.
I can quite firmly say that this bad performance is due to slow
 storage
issue (that are beyond my control for now). Despite this I want to
   improve
my performances.
   
As tought in school, I started profiling these queries and the data
 of
  ~1
minute profile is located here:
http://picpaste.com/pics/IMG_20130908_132441-ZyrfXeTY.1378637843.jpg
   
Main observation: most of the time I do wait for readVInt, who's
   stacktrace
(2 out of 2 thread dumps) is:
   
catalina-exec-3870 - Thread t@6615
 java.lang.Thread.State: RUNNABLE
 at org.apadhe.lucene.store.DataInput.readVInt(DataInput.java:108)
 at
   
   
  
 
 org.apaChe.lucene.codeosAockTreeIermsReade$FieldReader$SegmentTermsEnumFrame.loadBlock(BlockTreeTermsReader.java:
2357)
 at
   
   
  
 
 ora.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.seekExact(BlockTreeTermsReader.java:1745)
 at org.apadhe.lucene.index.TermContext.build(TermContext.java:95)
 at
   
   
  
 
 org.apache.lucene.search.PhraseQuery$PhraseWeight.init(PhraseQuery.java:221)
 at
   org.apache.lucene.search.PhraseQuery.createWeight(PhraseQuery.java:326)
 at
   
   
  
 
 org.apache.lucene.search.BooleanQuery$BooleanWeight.init(BooleanQuery.java:183)
 at
   
  org.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:384)
 at
   
   
  
 
 org.apache.lucene.searth.BooleanQuery$BooleanWeight.init(BooleanQuery.java:183)
 at
   
  oro.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:384)
 at
   
   
  
 
 org.apache.lucene.searth.BooleanQuery$BooleanWeight.init(BooleanQuery.java:183)
 at
   
  org.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:384)
 at
   
   
  
 
 org.apache.lucene.search.IndexSearcher.createNormalizedWeight(IndexSearcher.java:675)
 at
  org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:297)
   
   
So I do actually wait for IO as expected, but I might be too many
 time
   page
faulting while looking for the TermBlocks (tim file), ie locating the
   term.
As I reindex now, would it be useful lowering down the termInterval
(default to 128)? As the FST (tip files) are that small (few 10-100
 MB)
   so
there are no memory contentions, could I lower down this param to 8
 for
example? The benefit from lowering down the term interval would be to
obligate the FST to get on memory (JVM - thanks to the
   NRTCachingDirectory)
as I do not control the term dictionary file (OS caching, loads an
   average
of 6% of it).
   
   
General configs:
solr 4.3
36 shards, each has few million docs
These 36 servers (each server has 2 replicas) are running virtual,
 16GB
memory each (4GB for JVM, 12GB remain for the OS caching),  consuming
   260GB
of disk mounted for the index files.

RE: Dynamic analizer settings change

2013-09-11 Thread Markus Jelsma

-Original message-
 From:maephisto my_sky...@yahoo.com
 Sent: Wednesday 11th September 2013 14:34
 To: solr-user@lucene.apache.org
 Subject: Re: Dynamic analizer settings change

 Thanks, Erik!

 I might have missed mentioning something relevant. When querying Solr, I
 wouldn't actually need to query all fields, but only the one corresponding
 to the language picked by the user on the website. If he's using DE, then
 the search should only apply to the text_de field.

 What if I need to work with 50 different languages?
 Then I would get a schema with 50 types and 50 fields (text_en, text_fr,
 text_de, ...): won't this affect the performance ? bigger documents -
 slower queries.

Yes, that will affect performance greatly! The problem is not searching 50 
languages but when using (e)dismax, the problem is creating the entire query.  
You will see good performance in the `process` part of a search but poor 
performance in the `prepare` part of the search when debugging.

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Dynamic-analizer-settings-change-tp4089274p4089288.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: No or limited use of FieldCache

2013-09-11 Thread Per Steffensen

The reason I mention sort is that we in my project, half a year ago, 
have dealt with the FieldCache-OOM-problem when doing sort-requests. We 
basically just reject sort-requests unless they hit below X documents - 
in case they do we just find them without sorting and sort them 
ourselves afterwards.


Currently our problem is, that we have to do a group/distinct (in 
SQL-language) query and we have found that we can do what we want to do 
using group (http://wiki.apache.org/solr/FieldCollapsing) or facet - 
either will work for us. Problem is that they both use FieldCache and we 
know that using FieldCache will lead to OOM-execptions with the amount 
of data each of our Solr-nodes administrate. This time we have really no 
option of just limit usage as we did with sort. Therefore we need a 
group/distinct-functionality that works even on huge data-amounts (and a 
algorithm using FieldCache will not)


I believe setting facet.method=enum will actually make facet not use the 
FieldCache. Is that true? Is it a bad idea?


I do not know much about DocValues, but I do not believe that you will 
avoid FieldCache by using DocValues? Please elaborate, or point to 
documentation where I will be able to read that I am wrong. Thanks!


Regards, Per Steffensen

On 9/11/13 1:38 PM, Erick Erickson wrote:

I don't know any more than Michael, but I'd _love_ some reports from the
field.

There are some restriction on DocValues though, I believe one of them
is that they don't really work on analyzed data

FWIW,
Erick

Re: Dynamic analizer settings change

2013-09-11 Thread Jack Krupansky

Yes, supporting multiple languages will be a performance hit, but maybe it 
won't be so bad since all but one of these language-specific fields will be 
empty for each document and Lucene text search should handle empty field 
values just fine. If you can't accept that performance hit, don't support 
multiple languages! It is completely your choice.


There are index-time update processors that can do language detection and 
then automatically direct the text to the proper text_xx field.


See:
https://cwiki.apache.org/confluence/display/solr/Detecting+Languages+During+Indexing

Although my e-book has a lot better examples, especially for the field 
redirection aspect.


-- Jack Krupansky

-Original Message- 
From: maephisto

Sent: Wednesday, September 11, 2013 8:33 AM
To: solr-user@lucene.apache.org
Subject: Re: Dynamic analizer settings change

Thanks, Erik!

I might have missed mentioning something relevant. When querying Solr, I
wouldn't actually need to query all fields, but only the one corresponding
to the language picked by the user on the website. If he's using DE, then
the search should only apply to the text_de field.

What if I need to work with 50 different languages?
Then I would get a schema with 50 types and 50 fields (text_en, text_fr,
text_de, ...): won't this affect the performance ? bigger documents -
slower queries.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Dynamic-analizer-settings-change-tp4089274p4089288.html
Sent from the Solr - User mailing list archive at Nabble.com.

charset encoding

2013-09-11 Thread Andreas Owen

i'm using solr 4.3.1 with tika to index html-pages. the html files are 
iso-8859-1 (ansi) encoded and the meta tag content-encoding as well. the 
server-http-header says it's utf8 and firefox-webdeveloper agrees. 

when i index a page with special chars like ä,ö,ü solr outputs it completly 
foreign signs, not the normal wrong chars with 1/4 or the Flag in it. so it 
seams that its not simply the normal utf8/iso-8859-1 discrepancy. has anyone 
got a idea whats wrong?

Re: No or limited use of FieldCache

2013-09-11 Thread Michael Sokolov

On 09/11/2013 08:40 AM, Per Steffensen wrote:
The reason I mention sort is that we in my project, half a year ago,
have dealt with the FieldCache-OOM-problem when doing sort-requests.
We basically just reject sort-requests unless they hit below X
documents - in case they do we just find them without sorting and sort
them ourselves afterwards.

Currently our problem is, that we have to do a group/distinct (in
SQL-language) query and we have found that we can do what we want to
do using group (http://wiki.apache.org/solr/FieldCollapsing) or facet
- either will work for us. Problem is that they both use FieldCache
and we know that using FieldCache will lead to OOM-execptions with
the amount of data each of our Solr-nodes administrate. This time we
have really no option of just limit usage as we did with sort.
Therefore we need a group/distinct-functionality that works even on
huge data-amounts (and a algorithm using FieldCache will not)

I believe setting facet.method=enum will actually make facet not use
the FieldCache. Is that true? Is it a bad idea?

I do not know much about DocValues, but I do not believe that you will
avoid FieldCache by using DocValues? Please elaborate, or point to
documentation where I will be able to read that I am wrong. Thanks!
There is Simon Willnauer's presentation
http://www.slideshare.net/lucenerevolution/willnauer-simon-doc-values-column-stride-fields-in-lucene

and this blog post
http://blog.trifork.com/2011/10/27/introducing-lucene-index-doc-values/

and this one that shows some performance comparisons:
http://searchhub.org/2013/04/02/fun-with-docvalues-in-solr-4-2/

AW: Facet values for spacial field

2013-09-11 Thread Köhler Christian

Hi Eric (and others),

thanx for the the explanation. This helps.

For the usecase: I am cataloging findings of field expeditions. The collectors
usualy store a single location for the field trip, so the numer of locations is
limited.

Regards
Chris

Von: Erick Erickson [erickerick...@gmail.com]
Gesendet: Dienstag, 10. September 2013 19:14
Bis: solr-user@lucene.apache.org
Betreff: Re: Facet values for spacial field

You might be able to facet by query, but faceting by
location fields doesn't make a huge amount of sense,
you'll have lots of facets on individual lat/lon points.

What is the use-case you are trying to support here?

Best,
Erick

On Tue, Sep 10, 2013 at 8:43 AM, Christian Köhler - ZFMK
c.koeh...@zfmk.dewrote:

Hi,

I use the new SpatialRecursivePrefixTreeFiel**dType field to store geo
coordinates (e.g. 14.021666,51.5433353 ). I can retrieve the coordinates
just find so I am sure they are indexed correctly.

However when I try to create facets from this field, solr returns
something which looks like a hash of the coordinates:

Schema:

?xml version=1.0 encoding=UTF-8 ?
schema name=example version=1.5
types
...
fieldType name=location
class=solr.**SpatialRecursivePrefixTreeFiel**dType
units=degrees /
...
field name=geo_locality type=location indexed=true
stored=true /
/schema

Result:
http://localhost/solr/browse?**facet=truefacet.field=geo_**localityhttp://localhost/solr/browse?facet=truefacet.field=geo_locality-
...
lst name=facet_fields
lst name=geo_locality
int name=7zz660/int
int name=t4m70cmvej9290/int
int name=t4187pnmky3214/int
int name=t441z6vwv3j179/int
int name=t4328x4s6dj165/int
int name=t1c639yyxdr143/int
...
/lst
/lst

Filtering by this hashes fails:
http://localhost/solr/browse?**q=fq=geo_localityhttp://localhost/solr/browse?q=fq=geo_locality
:**t4m70cmvej9
java.lang.**IllegalArgumentException: missing parens: t4m70cmvej9

How do I get the results of a single location using faceting?
Any thoughts?

Regards
Chris

--
Christian Köhler

Zoologisches Forschungsmuseum Alexander Koenig
Leibniz-Institut für Biodiversität der Tiere
Adenauerallee 160, 53113 Bonn, Germany
www.zfmk.de

Stiftung des öffentlichen Rechts
Direktor: Prof. J. Wolfgang Wägele
Sitz: Bonn
--
Zoologisches Forschungsmuseum Alexander Koenig
- Leibniz-Institut für Biodiversität der Tiere -
Adenauerallee 160, 53113 Bonn, Germany
www.zfmk.de

Stiftung des öffentlichen Rechts; Direktor: Prof. J. Wolfgang Wägele
Sitz: Bonn

--
Zoologisches Forschungsmuseum Alexander Koenig
- Leibniz-Institut für Biodiversität der Tiere -
Adenauerallee 160, 53113 Bonn, Germany
www.zfmk.de

Stiftung des öffentlichen Rechts; Direktor: Prof. J. Wolfgang Wägele
Sitz: Bonn

Re: Dynamic analizer settings change

2013-09-11 Thread maephisto

Thanks Jack! Indeed, very nice examples in your book.

Inspired from there, here's a crazy idea: would it be possible to build a
custom processor chain that would detect the language and use it to apply
filters, like the aforementioned SnowballPorterFilter.
That would leave at the end a document having as fields: text(with filtered
content) and language(the one determined by the processor).
And at search time, always append the language=user selected language.

Does this make sense? If so, would it affect the performance at index time?
Thanks!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Dynamic-analizer-settings-change-tp4089274p4089305.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: solrj-httpclient-slow

First, I would be wary of mixing the solrj version
with a different solr version. They are pretty compatible
but what are you expecting to gain for the risk?
Regardless, though, that shouldn't be your problem.

You'll have to give us a lot more detail about what
you're trying to do, what you mean by slow (300ms?
300 secnds?) and what you expect.

Best
Erick


On Wed, Sep 11, 2013 at 7:44 AM, xiaoqi belivexia...@gmail.com wrote:


 hi,everyone

 when i track my solr client  timing cost , i find one problem :

 some time the whole execute time is very long ,when i go to detail ,i find
 the solr server execute short time  , then the main costs inside httpclient
 (make a connection ,send request or recived  response ,blablabla.

 i am not familar httpclient inside code . does anyone met the same problem
 ?

 although , i update solrj 's new version ,the problem still.

 by the way : my solrj version is : 4.2 ,solr is 3.*


 Thanks a lot



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/solrj-httpclient-slow-tp4089287.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Facet values for spacial field

It seems like the right thing to do here is store something
more intelligible than an encoded lat/lon pair and facet on
that instead. lat/lon, even bare are not all that useful
without some effort anywa...

FWIW,
Erick


On Wed, Sep 11, 2013 at 9:24 AM, Köhler Christian c.koeh...@zfmk.de wrote:

 Hi Eric (and others),

 thanx for the the explanation. This helps.

 For the usecase: I am cataloging findings of field expeditions. The
 collectors usualy store a single location for the field trip, so the numer
 of locations is limited.

 Regards
 Chris
 
 Von: Erick Erickson [erickerick...@gmail.com]
 Gesendet: Dienstag, 10. September 2013 19:14
 Bis: solr-user@lucene.apache.org
 Betreff: Re: Facet values for spacial field

 You might be able to facet by query, but faceting by
 location fields doesn't make a huge amount of sense,
 you'll have lots of facets on individual lat/lon points.

 What is the use-case you are trying to support here?

 Best,
 Erick


 On Tue, Sep 10, 2013 at 8:43 AM, Christian Köhler - ZFMK
 c.koeh...@zfmk.dewrote:

  Hi,
 
  I use the new SpatialRecursivePrefixTreeFiel**dType field to store geo
  coordinates (e.g. 14.021666,51.5433353 ). I can retrieve the coordinates
  just find so I am sure they are indexed correctly.
 
  However when I try to create facets from this field, solr returns
  something which looks like a hash of the coordinates:
 
  Schema:
 
  ?xml version=1.0 encoding=UTF-8 ?
  schema name=example version=1.5
types
 ...
  fieldType name=location
 class=solr.**SpatialRecursivePrefixTreeFiel**dType
 units=degrees /
  ...
field name=geo_locality type=location indexed=true
   stored=true  /
  /schema
 
  Result:
  http://localhost/solr/browse?**facet=truefacet.field=geo_**locality
 http://localhost/solr/browse?facet=truefacet.field=geo_locality-
  ...
  lst name=facet_fields
   lst name=geo_locality
int name=7zz660/int
int name=t4m70cmvej9290/int
int name=t4187pnmky3214/int
int name=t441z6vwv3j179/int
int name=t4328x4s6dj165/int
int name=t1c639yyxdr143/int
 ...
   /lst
  /lst
 
  Filtering by this hashes fails:
  http://localhost/solr/browse?**q=fq=geo_locality
 http://localhost/solr/browse?q=fq=geo_locality
  :**t4m70cmvej9
  java.lang.**IllegalArgumentException: missing parens: t4m70cmvej9
 
  How do I get the results of a single location using faceting?
  Any thoughts?
 
  Regards
  Chris
 
  --
  Christian Köhler
 
  Zoologisches Forschungsmuseum Alexander Koenig
  Leibniz-Institut für Biodiversität der Tiere
  Adenauerallee 160, 53113 Bonn, Germany
  www.zfmk.de
 
  Stiftung des öffentlichen Rechts
  Direktor: Prof. J. Wolfgang Wägele
  Sitz: Bonn
  --
  Zoologisches Forschungsmuseum Alexander Koenig
  - Leibniz-Institut für Biodiversität der Tiere -
  Adenauerallee 160, 53113 Bonn, Germany
  www.zfmk.de
 
  Stiftung des öffentlichen Rechts; Direktor: Prof. J. Wolfgang Wägele
  Sitz: Bonn
 
 --
 Zoologisches Forschungsmuseum Alexander Koenig
 - Leibniz-Institut für Biodiversität der Tiere -
 Adenauerallee 160, 53113 Bonn, Germany
 www.zfmk.de

 Stiftung des öffentlichen Rechts; Direktor: Prof. J. Wolfgang Wägele
 Sitz: Bonn

Re: Dynamic analizer settings change

You're still in danger of overly-broad hits. When you
try stemming differently into the _same_ underlying
field you get things that make sense in one language
but are totally bogus in another language matching
the query.

As far as lots and lots of fields is concerned, if you
want to restrict your searches to only one language
you have a couple of choices here

Consider a different core per language. Solr easily
handles many cores/server. Now you have no
'wasted' space, it just happens that the stemmer for
the core uses the DE-specific stemmers. Which
you can extend to German de-compounding etc.

Alternatively, you can form your queries with some
care. There's nothing that requires, say, edismax to
be specified in solrconfig.xml. Anything you would
put in the defaults section of the config you can
override on the command line. So, for instance,
if you knew you were querying in French, you could
form something like (going from memory)
defType=edismaxqf=title_fr,text_fr
or
qf=title_de,text_de

and so completely avoid cross-languge searching.

Or you could simply include a field that has the
language and tack on an fq clause like fq=de.

But you haven't told us how big your problem is. I wouldn't
worry at all about efficiency at this stage if you have, say,
10M documents, I'd just try the simplest thing first and
measure.

500M documents is probably another story.

FWIW
Erick

On Wed, Sep 11, 2013 at 9:50 AM, maephisto my_sky...@yahoo.com wrote:

Thanks Jack! Indeed, very nice examples in your book.

Inspired from there, here's a crazy idea: would it be possible to build a
custom processor chain that would detect the language and use it to apply
filters, like the aforementioned SnowballPorterFilter.
That would leave at the end a document having as fields: text(with filtered
content) and language(the one determined by the processor).
And at search time, always append the language=user selected language.

Does this make sense? If so, would it affect the performance at index time?
Thanks!

--
View this message in context:
http://lucene.472066.n3.nabble.com/Dynamic-analizer-settings-change-tp4089274p4089305.html
Sent from the Solr - User mailing list archive at Nabble.com.

Higher Memory Usage with solr 4.4

2013-09-11 Thread Kuchekar

Hi,

 We are using solr 4.4 on Linux with OpenJDK 64-Bit. We started the
Solr with 40GB but we noticed that the QTime is way high compared to
similar on 3.5 solr.
Both the 3.5 and 4.4 solr's configurations and schema are similarly
constructed. Also during the triage we found the physical memory to be
utilized at 95..%.

Is there any configuration we might be missing.

Looking forward for your reply.

Thanks.
Kuchekar, Nilesh

Re: Error with Solr 4.4.0, Glassfish, and CentOS 6.2


On 9/10/2013 9:18 PM, vhoangvu wrote:

Yesterday, I just install latest version of Solr 4.4.0 on Glassfish and
CentOS 6.2 and got an error when try to access the administration page. I
have checked this version on Mac OS one month ago, it works well. So, please
help me clarify what problem.


snip


[#|2013-09-10T18:31:36.896+|INFO|oracle-glassfish3.1.2|javax.enterprise.system.std.com.sun.enterprise.server.logging|_ThreadID=1;_ThreadName=Thread-2;|2907
[main] ERROR org.apache.solr.core.SolrCore  ?
null:org.apache.solr.common.SolrException: Error instantiating
shardHandlerFactory class [HttpShardHandlerFactory]: Failure initializing
default system SSL context


This is a container problem.  It can't initialize SSL.  The most common 
reason is that the java keystore has a password and it hasn't been 
provided.  If that's the problem, here's one solution:


http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201303.mbox/%3c1364232676233-4051159.p...@n3.nabble.com%3E

Another solution, especially if you aren't going to be hosting SSL in 
Java containers at all on that machine, is to get rid of the keystore 
entirely.  If that doesn't do it, you'll need to get help from a 
Glassfish support avenue.


Thanks,
Shawn

Re: Higher Memory Usage with solr 4.4

There are some defaults (sorry, don't have them listed) that are
somewhat different. If you took your 3.5 and just used it for
4.x, it's probably worth going back over it and start with the 4.x
example and add in any customizations you did for 3.5...

But in general, the memory usage for 4.x should be much smaller
than for 3.5, there were some _major_ improvements in that area.
So I'm guessing you've moved over some innocent-seeming config..

FWIW,
Erick


On Wed, Sep 11, 2013 at 10:54 AM, Kuchekar kuchekar.nil...@gmail.comwrote:

 Hi,

  We are using solr 4.4 on Linux with OpenJDK 64-Bit. We started the
 Solr with 40GB but we noticed that the QTime is way high compared to
 similar on 3.5 solr.
 Both the 3.5 and 4.4 solr's configurations and schema are similarly
 constructed. Also during the triage we found the physical memory to be
 utilized at 95..%.

 Is there any configuration we might be missing.

 Looking forward for your reply.

 Thanks.
 Kuchekar, Nilesh

Re: synonyms not working

Attach debug=query to your URL and inspect the parsed
query, you should be seeing the substitutions if you're
configured correctly. Multi-word synonyms at query time
have the getting through the query parser problem.

Best
Erick


On Wed, Sep 11, 2013 at 11:04 AM, cheops m.schm...@mediaskill.de wrote:

 Hi,
 I'm using solr4.4 and try to use different synonyms based on different
 fieldtypes:

 fieldType name=text_general class=solr.TextField
 positionIncrementGap=100
   analyzer type=index
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt /
 filter class=solr.LowerCaseFilterFactory/
   /analyzer
   analyzer type=query
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt /
 filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
 ignoreCase=true expand=true/
 filter class=solr.LowerCaseFilterFactory/
   /analyzer
 /fieldType


 ...I have the same fieldtype for english (name=text_general_en and
 synonyms=synonyms_en.txt).
 The first fieldtype works fine, my synonyms are processed and the result is
 as expected. But the en-version doesn't seem to work. I'm able to find
 the
 original english words but the synonyms are not processed.
 ps: yes, i know using synonyms at query time is not a good idea :-) ... but
 can't change it here

 Any help would be appreciated!

 Thank you.

 Best regards
 Marcus



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/synonyms-not-working-tp4089318.html
 Sent from the Solr - User mailing list archive at Nabble.com.

synonyms not working

2013-09-11 Thread cheops

Hi,
I'm using solr4.4 and try to use different synonyms based on different
fieldtypes:

fieldType name=text_general class=solr.TextField
positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt /
filter class=solr.LowerCaseFilterFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt /
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
ignoreCase=true expand=true/
filter class=solr.LowerCaseFilterFactory/
  /analyzer
/fieldType


...I have the same fieldtype for english (name=text_general_en and
synonyms=synonyms_en.txt).
The first fieldtype works fine, my synonyms are processed and the result is
as expected. But the en-version doesn't seem to work. I'm able to find the
original english words but the synonyms are not processed.
ps: yes, i know using synonyms at query time is not a good idea :-) ... but
can't change it here

Any help would be appreciated!

Thank you.

Best regards
Marcus



--
View this message in context: 
http://lucene.472066.n3.nabble.com/synonyms-not-working-tp4089318.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: synonyms not working

2013-09-11 Thread cheops

thanx for your help. could solve the problem meanwhile!
i used 
analyzer type=query_en
...which is wrong, it must be
analyzer type=query





--
View this message in context: 
http://lucene.472066.n3.nabble.com/synonyms-not-working-tp4089318p4089345.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Higher Memory Usage with solr 4.4


On 9/11/2013 8:54 AM, Kuchekar wrote:

  We are using solr 4.4 on Linux with OpenJDK 64-Bit. We started the
Solr with 40GB but we noticed that the QTime is way high compared to
similar on 3.5 solr.
Both the 3.5 and 4.4 solr's configurations and schema are similarly
constructed. Also during the triage we found the physical memory to be
utilized at 95..%.


A 40GB heap is *huge*.  Unless you are dealing with millions of 
super-large documents or many many millions of smaller documents, there 
should be no need for a heap that large.  Additionally, if you are 
allocating most of your system memory to Java, then you will have little 
or no RAM available for OS disk caching, which will cause major 
performance issues.


For most indexes, memory usage should be less after an upgrade, but 
there are exceptions.


I see that you had an earlier question about stored field compression, 
and that you talked about exporting data from your 3.5 install to index 
into 4.4, in which you had stored every field, including copyFields.


If you have a lot of stored data, memory usage for decompression can 
become a problem.  It's usually a lot better to store minimal 
information, just enough to display a result grid/list, and some ID 
information so that when someone clicks on an individual result, you can 
retrieve the entire record from another data source, like a database or 
a filesystem.


Here's a more exhaustive list of potential performance and memory 
problems with Solr:


http://wiki.apache.org/solr/SolrPerformanceProblems

OpenJDK may be problematic, especially if it's version 6.  With Java 7, 
OpenJDK is actually the reference implementation, so if you are using 
OpenJDK 7, I would be less concerned.  With either version, Oracle Java 
tends to produce better results.


Thanks,
Shawn

Distributing lucene segments across multiple disks.

Hi,

I know that SolrCloud allows you to have multiple shards on different
machines (or a single machine). But it requires a zookeeper installation
for doing things like leader election, leader availability, etc

While SolrCloud may be the ideal solution for my usecase eventually, I'd
like to know if there's a way I can point my Solr instance to read lucene
segments distributed across different disks attached to the same machine.

Thanks!

-Deepak

Re: Distributing lucene segments across multiple disks.

2013-09-11 Thread Upayavira

I think you'll find it hard to distribute different segments between
disks, as they are typically stored in the same directory.

However, instantiating separate cores on different disks should be
straight-forward enough, and would give you a performance benefit.

I've certainly heard of that done at Amazon, with a separate EBS volume
per core giving some performance improvement.

Upayavira

On Wed, Sep 11, 2013, at 07:35 PM, Deepak Konidena wrote:
 Hi,
 
 I know that SolrCloud allows you to have multiple shards on different
 machines (or a single machine). But it requires a zookeeper installation
 for doing things like leader election, leader availability, etc
 
 While SolrCloud may be the ideal solution for my usecase eventually, I'd
 like to know if there's a way I can point my Solr instance to read lucene
 segments distributed across different disks attached to the same machine.
 
 Thanks!
 
 -Deepak

Re: Distributing lucene segments across multiple disks.

@Greg - Are you suggesting RAID as a replacement for Solr or making Solr
work with RAID? Could you elaborate more on the latter, if that's you
meant?
We make use of solr's advanced text processing features which would be hard
to replicate just using RAID.


-Deepak



On Wed, Sep 11, 2013 at 12:11 PM, Greg Walters gwalt...@sherpaanalytics.com
 wrote:

 Why not use some form of RAID for your index store? You'd get the
 performance benefit of multiple disks without the complexity of managing
 them via solr.

 Thanks,
 Greg



 -Original Message-
 From: Deepak Konidena [mailto:deepakk...@gmail.com]
 Sent: Wednesday, September 11, 2013 2:07 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Distributing lucene segments across multiple disks.

 Are you suggesting a multi-core setup, where all the cores share the same
 schema, and the cores lie on different disks?

 Basically, I'd like to know if I can distribute shards/segments on a
 single machine (with multiple disks) without the use of zookeeper.





 -Deepak



 On Wed, Sep 11, 2013 at 11:55 AM, Upayavira u...@odoko.co.uk wrote:

  I think you'll find it hard to distribute different segments between
  disks, as they are typically stored in the same directory.
 
  However, instantiating separate cores on different disks should be
  straight-forward enough, and would give you a performance benefit.
 
  I've certainly heard of that done at Amazon, with a separate EBS
  volume per core giving some performance improvement.
 
  Upayavira
 
  On Wed, Sep 11, 2013, at 07:35 PM, Deepak Konidena wrote:
   Hi,
  
   I know that SolrCloud allows you to have multiple shards on
   different machines (or a single machine). But it requires a
   zookeeper installation for doing things like leader election, leader
   availability, etc
  
   While SolrCloud may be the ideal solution for my usecase eventually,
   I'd like to know if there's a way I can point my Solr instance to
   read lucene segments distributed across different disks attached to
 the same machine.
  
   Thanks!
  
   -Deepak

Re: Distributing lucene segments across multiple disks.

Are you suggesting a multi-core setup, where all the cores share the same
schema, and the cores lie on different disks?

Basically, I'd like to know if I can distribute shards/segments on a single
machine (with multiple disks) without the use of zookeeper.





-Deepak



On Wed, Sep 11, 2013 at 11:55 AM, Upayavira u...@odoko.co.uk wrote:

 I think you'll find it hard to distribute different segments between
 disks, as they are typically stored in the same directory.

 However, instantiating separate cores on different disks should be
 straight-forward enough, and would give you a performance benefit.

 I've certainly heard of that done at Amazon, with a separate EBS volume
 per core giving some performance improvement.

 Upayavira

 On Wed, Sep 11, 2013, at 07:35 PM, Deepak Konidena wrote:
  Hi,
 
  I know that SolrCloud allows you to have multiple shards on different
  machines (or a single machine). But it requires a zookeeper installation
  for doing things like leader election, leader availability, etc
 
  While SolrCloud may be the ideal solution for my usecase eventually, I'd
  like to know if there's a way I can point my Solr instance to read lucene
  segments distributed across different disks attached to the same machine.
 
  Thanks!
 
  -Deepak

RE: Distributing lucene segments across multiple disks.

2013-09-11 Thread Greg Walters

Why not use some form of RAID for your index store? You'd get the performance 
benefit of multiple disks without the complexity of managing them via solr.

Thanks,
Greg



-Original Message-
From: Deepak Konidena [mailto:deepakk...@gmail.com] 
Sent: Wednesday, September 11, 2013 2:07 PM
To: solr-user@lucene.apache.org
Subject: Re: Distributing lucene segments across multiple disks.

Are you suggesting a multi-core setup, where all the cores share the same 
schema, and the cores lie on different disks?

Basically, I'd like to know if I can distribute shards/segments on a single 
machine (with multiple disks) without the use of zookeeper.





-Deepak



On Wed, Sep 11, 2013 at 11:55 AM, Upayavira u...@odoko.co.uk wrote:

 I think you'll find it hard to distribute different segments between 
 disks, as they are typically stored in the same directory.

 However, instantiating separate cores on different disks should be 
 straight-forward enough, and would give you a performance benefit.

 I've certainly heard of that done at Amazon, with a separate EBS 
 volume per core giving some performance improvement.

 Upayavira

 On Wed, Sep 11, 2013, at 07:35 PM, Deepak Konidena wrote:
  Hi,
 
  I know that SolrCloud allows you to have multiple shards on 
  different machines (or a single machine). But it requires a 
  zookeeper installation for doing things like leader election, leader 
  availability, etc
 
  While SolrCloud may be the ideal solution for my usecase eventually, 
  I'd like to know if there's a way I can point my Solr instance to 
  read lucene segments distributed across different disks attached to the 
  same machine.
 
  Thanks!
 
  -Deepak

RE: Distributing lucene segments across multiple disks.

2013-09-11 Thread Greg Walters

Deepak,

Sorry for not being more verbose in my previous suggestion. As I take your 
question, you'd like to spread your index files across multiple disks (for 
performance or space reasons I assume). If you used even a basic md-raid setup 
you could then format the raid device and thus your entire set of disks with 
your favorite filesystem, mount it in one directory in the directory tree then 
configure it as the data directory that solr uses.

This setup would accomplish your goal of having the lucene indexes spread 
across multiple disks without the complexity of using multiple solr 
cores/collections.

Thanks,
Greg


-Original Message-
From: Deepak Konidena [mailto:deepakk...@gmail.com] 
Sent: Wednesday, September 11, 2013 2:26 PM
To: solr-user@lucene.apache.org
Subject: Re: Distributing lucene segments across multiple disks.

@Greg - Are you suggesting RAID as a replacement for Solr or making Solr work 
with RAID? Could you elaborate more on the latter, if that's you meant?
We make use of solr's advanced text processing features which would be hard to 
replicate just using RAID.


-Deepak



On Wed, Sep 11, 2013 at 12:11 PM, Greg Walters gwalt...@sherpaanalytics.com
 wrote:

 Why not use some form of RAID for your index store? You'd get the 
 performance benefit of multiple disks without the complexity of 
 managing them via solr.

 Thanks,
 Greg



 -Original Message-
 From: Deepak Konidena [mailto:deepakk...@gmail.com]
 Sent: Wednesday, September 11, 2013 2:07 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Distributing lucene segments across multiple disks.

 Are you suggesting a multi-core setup, where all the cores share the 
 same schema, and the cores lie on different disks?

 Basically, I'd like to know if I can distribute shards/segments on a 
 single machine (with multiple disks) without the use of zookeeper.





 -Deepak



 On Wed, Sep 11, 2013 at 11:55 AM, Upayavira u...@odoko.co.uk wrote:

  I think you'll find it hard to distribute different segments between 
  disks, as they are typically stored in the same directory.
 
  However, instantiating separate cores on different disks should be 
  straight-forward enough, and would give you a performance benefit.
 
  I've certainly heard of that done at Amazon, with a separate EBS 
  volume per core giving some performance improvement.
 
  Upayavira
 
  On Wed, Sep 11, 2013, at 07:35 PM, Deepak Konidena wrote:
   Hi,
  
   I know that SolrCloud allows you to have multiple shards on 
   different machines (or a single machine). But it requires a 
   zookeeper installation for doing things like leader election, 
   leader availability, etc
  
   While SolrCloud may be the ideal solution for my usecase 
   eventually, I'd like to know if there's a way I can point my Solr 
   instance to read lucene segments distributed across different 
   disks attached to
 the same machine.
  
   Thanks!
  
   -Deepak

Do I need to delete my index?


Hello,
I'm in the process of creating my index using a series of 
SolrClient::request commands in PHP. I ran into a problem when some of 
the fields that I had as text_general fieldType contained load= in 
a URL, triggering an error because the HTML entity load wasn't 
recognized. I realized that I should have made my URL fields of type 
string instead, so that they would be taken as is (they're not being 
indexed, just stored), so I removed all docs from my index, updated 
schema.xml, and restarted Solr, but I'm still getting the same error. Do 
I need to delete the index itself and then restart to get this to work? 
Am I correct that changing those fields to string type should fix the 
issue?

Thanks,
Brian

Re: Distributing lucene segments across multiple disks.


On 9/11/2013 1:07 PM, Deepak Konidena wrote:

Are you suggesting a multi-core setup, where all the cores share the same
schema, and the cores lie on different disks?

Basically, I'd like to know if I can distribute shards/segments on a single
machine (with multiple disks) without the use of zookeeper.


Sure, you can do it all manually.  At that point you would not be using 
SolrCloud at all, because the way to enable SolrCloud is to tell Solr 
where zookeeper lives.


Without SolrCloud, there is no cluster automation at all.  There is no 
collection paradigm, you just have cores.  You have to send updates to 
the correct core; they not be redirected for you.  Similarly, queries 
will not be load balanced automatically.  For Java clients, the 
CloudSolrServer object can work seamlessly when servers go down.  If 
you're not using SolrCloud, you can't use CloudSolrServer.


You would be in charge of creating the shards parameter yourself.  The 
way that I do this on my index is that I have a broker core that has 
no index of its own, but its solrconfig.xml has the shards and shards.qt 
parameters in all the request handler definitions.  You can also include 
the parameter with the query.


You would also have to handle redundancy yourself, either with 
replication or with independently updated indexes.  I use the latter 
method, because it offers a lot more flexibility than replication.


As mentioned in another reply, setting up RAID with a lot of disks may 
be better than trying to split your index up on different filesystems 
that each reside on different disks.  I would recommend RAID10 for Solr, 
and it works best if it's hardware RAID and the controller has 
battery-backed (or NVRAM) cache.


Thanks,
Shawn

Re: Error while importing HBase data to Solr using the DataImportHandler

2013-09-11 Thread ppatel

Hi,

Can you provide me an example of data-config.xml? because with my Hbase
configuration, I am getting
Full Import failed:java.lang.RuntimeException: java.lang.RuntimeException:
org.apache.solr.handler.dataimport.DataImportHandlerException:
java.lang.NoSuchMethodError:
org.apache.hadoop.net.NetUtils.getInputStream(Ljava/net/Socket;)Ljava/io/InputStream;

AND

Exception while processing: item document :
SolrInputDocument[]:org.apache.solr.handler.dataimport.DataImportHandlerException:
Unable to execute SCANNER: [tableName=Item, startRow=null, stopRow=null,
columns=[{Item|r}, {Item|m}, {Item|u}]] Processing Document # 1

Mine data-config.xml:

dataConfig

dataSource type=HbaseDataSource name=HBase host=127.0.0.1 port=2181
/

document name=Item

entity name=item 
pk=ROW_KEY
dataSource=HBase
processor=HbaseEntityProcessor
tableName=Item 
onError=abort
columns=Item|r,
 Item|m,
 Item|u
query=scan 'Item', {COLUMNS = ['r','m', 'u']}
deltaImportQuery=
deltaQuery= 

field column=ROW_KEY name=id /
field column=r name=r /
field column=m name=m /
field column=u name=u /

/entity


/document
/dataConfig

Please respond me ASAP.

Thanks in advance!!




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Error-while-importing-HBase-data-to-Solr-using-the-DataImportHandler-tp4085613p4089402.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Do I need to delete my index?

In addition, if I do need to delete my index, how do I go about that? 
I've been looking through the documentation and can't find anything 
specific. I know where the index is, I'm just not sure which files to 
delete.



Hello,
I'm in the process of creating my index using a series of 
SolrClient::request commands in PHP. I ran into a problem when some of 
the fields that I had as text_general fieldType contained load= 
in a URL, triggering an error because the HTML entity load wasn't 
recognized. I realized that I should have made my URL fields of type 
string instead, so that they would be taken as is (they're not being 
indexed, just stored), so I removed all docs from my index, updated 
schema.xml, and restarted Solr, but I'm still getting the same error. 
Do I need to delete the index itself and then restart to get this to 
work? Am I correct that changing those fields to string type should 
fix the issue?

Thanks,
Brian

Re: Distributing lucene segments across multiple disks.

I guess at this point in the discussion, I should probably give some more
background on why I am doing what I am doing. Having a single Solr shard
(multiple segments) on the same disk is posing severe performance problems
under load,in that, calls to Solr cause a lot of connection timeouts. When
we looked at the ganglia stats for the Solr box, we saw that while memory,
cpu and network usage were quite normal, the i/o wait spiked. We are unsure
on what caused the i/o wait and why there were no spikes in the cpu/memory
usage. Since the Solr box is a beefy box (multi-core setup, huge ram, SSD),
we'd like to distribute the segments to multiple locations (disks) and see
whether this improves performance under load.

@Greg - Thanks for clarifying that.  I just learnt that I can't set them up
using RAID as some of them are SSDs and some others are SATA (spinning
disks).

@Shawn Heisey - Could you elaborate more about the broker core and
delegating the requests to other cores?


-Deepak



On Wed, Sep 11, 2013 at 1:10 PM, Shawn Heisey s...@elyograg.org wrote:

 On 9/11/2013 1:07 PM, Deepak Konidena wrote:

 Are you suggesting a multi-core setup, where all the cores share the same
 schema, and the cores lie on different disks?

 Basically, I'd like to know if I can distribute shards/segments on a
 single
 machine (with multiple disks) without the use of zookeeper.


 Sure, you can do it all manually.  At that point you would not be using
 SolrCloud at all, because the way to enable SolrCloud is to tell Solr where
 zookeeper lives.

 Without SolrCloud, there is no cluster automation at all.  There is no
 collection paradigm, you just have cores.  You have to send updates to
 the correct core; they not be redirected for you.  Similarly, queries will
 not be load balanced automatically.  For Java clients, the CloudSolrServer
 object can work seamlessly when servers go down.  If you're not using
 SolrCloud, you can't use CloudSolrServer.

 You would be in charge of creating the shards parameter yourself.  The way
 that I do this on my index is that I have a broker core that has no index
 of its own, but its solrconfig.xml has the shards and shards.qt parameters
 in all the request handler definitions.  You can also include the parameter
 with the query.

 You would also have to handle redundancy yourself, either with replication
 or with independently updated indexes.  I use the latter method, because it
 offers a lot more flexibility than replication.

 As mentioned in another reply, setting up RAID with a lot of disks may be
 better than trying to split your index up on different filesystems that
 each reside on different disks.  I would recommend RAID10 for Solr, and it
 works best if it's hardware RAID and the controller has battery-backed (or
 NVRAM) cache.

 Thanks,
 Shawn

Re: Do I need to delete my index?


On 9/11/2013 2:17 PM, Brian Robinson wrote:

I'm in the process of creating my index using a series of
SolrClient::request commands in PHP. I ran into a problem when some of
the fields that I had as text_general fieldType contained load= in
a URL, triggering an error because the HTML entity load wasn't
recognized. I realized that I should have made my URL fields of type
string instead, so that they would be taken as is (they're not being
indexed, just stored), so I removed all docs from my index, updated
schema.xml, and restarted Solr, but I'm still getting the same error. Do
I need to delete the index itself and then restart to get this to work?
Am I correct that changing those fields to string type should fix the
issue?


Changing the field type is not going to affect this issue.  Because you 
are not indexing the field, the choice of string or text_general is not 
really going to matter, but string will probably be more efficient.


What is happening here is an XML issue with the update request itself. 
The PHP client is sending an XML update request to Solr, and the request 
includes the URL text as-is in the XML request.  It is not properly XML 
encoded.  For an XML update request, that snippet of your text would 
need to be encoded as amp;load= to work properly.


XML has a much smaller list of valid entities than HTML, but load is 
not a valid entity in either XML or HTML.


I was going to call this a bug in the PHP client library, but then I got 
a look at what SolrClient::request actually does:


http://php.net/manual/en/solrclient.request.php

It expects you to create the XML yourself, which means you have to do 
all the encoding of characters which have special meaning to XML.


If you have no desire to figure out proper XML encoding, you should 
probably be using SolrClient::addDocument or SolrClient::addDocuments 
instead.


Thanks,
Shawn

RE: Distributing lucene segments across multiple disks.

2013-09-11 Thread Greg Walters

Deepak,

It might be a bit outside what you're willing to consider but you can make a 
raid out of your spinning disks then use your SSD(s) as a dm-cache device to 
accelerate reads and writes to the raid device. If you're putting lucene 
indexes on a mixed bag of disks and ssd's without any type of control for what 
goes where you'd want to use the ssd to accelerate the spinning disks anyway. 
Check out http://lwn.net/Articles/540996/ for more information on the dm-cache 
device.

Thanks,
Greg

-Original Message-
From: Deepak Konidena [mailto:deepakk...@gmail.com] 
Sent: Wednesday, September 11, 2013 3:57 PM
To: solr-user@lucene.apache.org
Subject: Re: Distributing lucene segments across multiple disks.

I guess at this point in the discussion, I should probably give some more 
background on why I am doing what I am doing. Having a single Solr shard 
(multiple segments) on the same disk is posing severe performance problems 
under load,in that, calls to Solr cause a lot of connection timeouts. When we 
looked at the ganglia stats for the Solr box, we saw that while memory, cpu and 
network usage were quite normal, the i/o wait spiked. We are unsure on what 
caused the i/o wait and why there were no spikes in the cpu/memory usage. Since 
the Solr box is a beefy box (multi-core setup, huge ram, SSD), we'd like to 
distribute the segments to multiple locations (disks) and see whether this 
improves performance under load.

@Greg - Thanks for clarifying that.  I just learnt that I can't set them up 
using RAID as some of them are SSDs and some others are SATA (spinning disks).

@Shawn Heisey - Could you elaborate more about the broker core and delegating 
the requests to other cores?


-Deepak



On Wed, Sep 11, 2013 at 1:10 PM, Shawn Heisey s...@elyograg.org wrote:

 On 9/11/2013 1:07 PM, Deepak Konidena wrote:

 Are you suggesting a multi-core setup, where all the cores share the 
 same schema, and the cores lie on different disks?

 Basically, I'd like to know if I can distribute shards/segments on a 
 single machine (with multiple disks) without the use of zookeeper.


 Sure, you can do it all manually.  At that point you would not be using
 SolrCloud at all, because the way to enable SolrCloud is to tell Solr where
 zookeeper lives.

 Without SolrCloud, there is no cluster automation at all.  There is no
 collection paradigm, you just have cores.  You have to send updates to
 the correct core; they not be redirected for you.  Similarly, queries will
 not be load balanced automatically.  For Java clients, the CloudSolrServer
 object can work seamlessly when servers go down.  If you're not using
 SolrCloud, you can't use CloudSolrServer.

 You would be in charge of creating the shards parameter yourself.  The way
 that I do this on my index is that I have a broker core that has no index
 of its own, but its solrconfig.xml has the shards and shards.qt parameters
 in all the request handler definitions.  You can also include the parameter
 with the query.

 You would also have to handle redundancy yourself, either with replication
 or with independently updated indexes.  I use the latter method, because it
 offers a lot more flexibility than replication.

 As mentioned in another reply, setting up RAID with a lot of disks may be
 better than trying to split your index up on different filesystems that
 each reside on different disks.  I would recommend RAID10 for Solr, and it
 works best if it's hardware RAID and the controller has battery-backed (or
 NVRAM) cache.

 Thanks,
 Shawn

ReplicationFactor for solrcloud

2013-09-11 Thread Aditya Sakhuja

Hi -

I am trying to set the 3 shards and 3 replicas for my solrcloud deployment
with 3 servers, specifying the replicationFactor=3 and numShards=3 when
starting the first node. I see each of the servers allocated to 1 shard
each.however, do not see 3 replicas allocated on each node.

I specifically need to have 3 replicas across 3 servers with 3 shards. Do
we think of any reason to not have this configuration ?

-- 
Regards,
-Aditya Sakhuja

Re: Distributing lucene segments across multiple disks.


On 9/11/2013 2:57 PM, Deepak Konidena wrote:

I guess at this point in the discussion, I should probably give some more
background on why I am doing what I am doing. Having a single Solr shard
(multiple segments) on the same disk is posing severe performance problems
under load,in that, calls to Solr cause a lot of connection timeouts. When
we looked at the ganglia stats for the Solr box, we saw that while memory,
cpu and network usage were quite normal, the i/o wait spiked. We are unsure
on what caused the i/o wait and why there were no spikes in the cpu/memory
usage. Since the Solr box is a beefy box (multi-core setup, huge ram, SSD),
we'd like to distribute the segments to multiple locations (disks) and see
whether this improves performance under load.

@Greg - Thanks for clarifying that.  I just learnt that I can't set them up
using RAID as some of them are SSDs and some others are SATA (spinning
disks).

@Shawn Heisey - Could you elaborate more about the broker core and
delegating the requests to other cores?


On the broker core - I have a core on my servers that has no index of 
its own.  In the /select handler (and others) I have placed a shards 
parameter, and many of them also have a shards.qt parameter.  The shards 
paramter is how a non-cloud distributed search is done.


http://wiki.apache.org/solr/DistributedSearch

Addressing your first paragraph: You say that you have lots of RAM ... 
but is there a lot of unallocated RAM that the OS can use for caching, 
or is it mostly allocated to processes, such as the java heap for Solr?


Depending on exactly how your indexes are composed, you need up to 100% 
of the total index size available as unallocated RAM.  With SSD, the 
requirement is less, but cannot be ignored.  I personally wouldn't go 
below about 25-50% even with SSD, and I'd plan on 50-100% for regular disks.


There is some evidence to suggest that you only need unallocated RAM 
equal to 10% of your index size for caching with SSD, but that is only 
likely to work if you have a lot of stored (as opposed to indexed) data. 
 If most of your index is unstored, then more would be required.


Thanks,
Shawn

Re: Do I need to delete my index?


On 9/11/2013 3:17 PM, Brian Robinson wrote:

In addition, if I do need to delete my index, how do I go about that?
I've been looking through the documentation and can't find anything
specific. I know where the index is, I'm just not sure which files to
delete.


Generally you'll find it in a path that ends with data/index ... but if 
you have messed with dataDir, it might just end in /index instead.  Here 
is an example of index directory contents, from a system that *is* 
changing dataDir:


ncindex@bigindy5 /index/solr4/data/s2_0 $ echo `ls -1 index`
_m5o.fdt _m5o.fdx _m5o.fnm _m5o_Lucene41_0.doc _m5o_Lucene41_0.pos 
_m5o_Lucene41_0.tim _m5o_Lucene41_0.tip _m5o_Lucene45_0.dvd 
_m5o_Lucene45_0.dvm _m5o.nvd _m5o.nvm _m5o.si _m5o.tvd _m5o.tvx 
_m5o_z.del _m5p.fdt _m5p.fdx _m5p.fnm _m5p_Lucene41_0.doc 
_m5p_Lucene41_0.pos _m5p_Lucene41_0.tim _m5p_Lucene41_0.tip 
_m5p_Lucene45_0.dvd _m5p_Lucene45_0.dvm _m5p.nvd _m5p.nvm _m5p.si 
_m5p.tvd _m5p.tvx _m5v.fdt _m5v.fdx _m5v.fnm _m5v_Lucene41_0.doc 
_m5v_Lucene41_0.pos _m5v_Lucene41_0.tim _m5v_Lucene41_0.tip 
_m5v_Lucene45_0.dvd _m5v_Lucene45_0.dvm _m5v.nvd _m5v.nvm _m5v.si 
_m5v.tvd _m5v.tvx _m5w.fdt _m5w.fdx _m5w.fnm _m5w_Lucene41_0.doc 
_m5w_Lucene41_0.pos _m5w_Lucene41_0.tim _m5w_Lucene41_0.tip 
_m5w_Lucene45_0.dvd _m5w_Lucene45_0.dvm _m5w.nvd _m5w.nvm _m5w.si 
_m5w.tvd _m5w.tvx _m5x.fdt _m5x.fdx _m5x.fnm _m5x_Lucene41_0.doc 
_m5x_Lucene41_0.pos _m5x_Lucene41_0.tim _m5x_Lucene41_0.tip 
_m5x_Lucene45_0.dvd _m5x_Lucene45_0.dvm _m5x.nvd _m5x.nvm _m5x.si 
_m5x.tvd _m5x.tvx _m5y.fdt _m5y.fdx _m5y.fnm _m5y_Lucene41_0.doc 
_m5y_Lucene41_0.pos _m5y_Lucene41_0.tim _m5y_Lucene41_0.tip 
_m5y_Lucene45_0.dvd _m5y_Lucene45_0.dvm _m5y.nvd _m5y.nvm _m5y.si 
_m5y.tvd _m5y.tvx _m5z.fdt _m5z.fdx _m5z.fnm _m5z_Lucene41_0.doc 
_m5z_Lucene41_0.pos _m5z_Lucene41_0.tim _m5z_Lucene41_0.tip 
_m5z_Lucene45_0.dvd _m5z_Lucene45_0.dvm _m5z.nvd _m5z.nvm _m5z.si 
_m5z.tvd _m5z.tvx segments_554 segments.gen write.lock


Thanks,
Shawn

Re: Distributing lucene segments across multiple disks.

@Greg - Thanks for the suggestion. Will pass it along to my folks.

@Shawn - That's the link I was looking for 'non-SolrCloud approach to
distributed search'. Thanks for passing that along. Will give it a try.

As far as RAM usage goes, I believe we set the heap size to about 40% of
the RAM and less than 10% is available for OS caching ( since replica takes
another 40%). Why does unallocated RAM help? How does it impact performance
under load?

-Deepak

On Wed, Sep 11, 2013 at 2:50 PM, Shawn Heisey s...@elyograg.org wrote:

On 9/11/2013 2:57 PM, Deepak Konidena wrote:

I guess at this point in the discussion, I should probably give some more
background on why I am doing what I am doing. Having a single Solr shard
(multiple segments) on the same disk is posing severe performance problems
under load,in that, calls to Solr cause a lot of connection timeouts. When
we looked at the ganglia stats for the Solr box, we saw that while memory,
cpu and network usage were quite normal, the i/o wait spiked. We are
unsure
on what caused the i/o wait and why there were no spikes in the cpu/memory
usage. Since the Solr box is a beefy box (multi-core setup, huge ram,
SSD),
we'd like to distribute the segments to multiple locations (disks) and see
whether this improves performance under load.

@Greg - Thanks for clarifying that. I just learnt that I can't set them
up
using RAID as some of them are SSDs and some others are SATA (spinning
disks).

@Shawn Heisey - Could you elaborate more about the broker core and
delegating the requests to other cores?

On the broker core - I have a core on my servers that has no index of its
own. In the /select handler (and others) I have placed a shards parameter,
and many of them also have a shards.qt parameter. The shards paramter is
how a non-cloud distributed search is done.

http://wiki.apache.org/solr/**DistributedSearchhttp://wiki.apache.org/solr/DistributedSearch

Addressing your first paragraph: You say that you have lots of RAM ... but
is there a lot of unallocated RAM that the OS can use for caching, or is it
mostly allocated to processes, such as the java heap for Solr?

Depending on exactly how your indexes are composed, you need up to 100% of
the total index size available as unallocated RAM. With SSD, the
requirement is less, but cannot be ignored. I personally wouldn't go below
about 25-50% even with SSD, and I'd plan on 50-100% for regular disks.

There is some evidence to suggest that you only need unallocated RAM equal
to 10% of your index size for caching with SSD, but that is only likely to
work if you have a lot of stored (as opposed to indexed) data. If most of
your index is unstored, then more would be required.

Thanks,
Shawn

Re: Do I need to delete my index?

Thanks Shawn. I had actually tried changing load= to amp;load=, but 
still got the error. It sounds like addDocuments is worth a try, though.


On 9/11/2013 4:37 PM, Shawn Heisey wrote:

On 9/11/2013 2:17 PM, Brian Robinson wrote:

I'm in the process of creating my index using a series of
SolrClient::request commands in PHP. I ran into a problem when some of
the fields that I had as text_general fieldType contained load= in
a URL, triggering an error because the HTML entity load wasn't
recognized. I realized that I should have made my URL fields of type
string instead, so that they would be taken as is (they're not being
indexed, just stored), so I removed all docs from my index, updated
schema.xml, and restarted Solr, but I'm still getting the same error. Do
I need to delete the index itself and then restart to get this to work?
Am I correct that changing those fields to string type should fix the
issue?


Changing the field type is not going to affect this issue. Because you 
are not indexing the field, the choice of string or text_general is 
not really going to matter, but string will probably be more efficient.


What is happening here is an XML issue with the update request itself. 
The PHP client is sending an XML update request to Solr, and the 
request includes the URL text as-is in the XML request. It is not 
properly XML encoded.  For an XML update request, that snippet of your 
text would need to be encoded as amp;load= to work properly.


XML has a much smaller list of valid entities than HTML, but load is 
not a valid entity in either XML or HTML.


I was going to call this a bug in the PHP client library, but then I 
got a look at what SolrClient::request actually does:


http://php.net/manual/en/solrclient.request.php

It expects you to create the XML yourself, which means you have to do 
all the encoding of characters which have special meaning to XML.


If you have no desire to figure out proper XML encoding, you should 
probably be using SolrClient::addDocument or SolrClient::addDocuments 
instead.


Thanks,
Shawn

Re: Distributing lucene segments across multiple disks.


On 9/11/2013 4:16 PM, Deepak Konidena wrote:

As far as RAM usage goes, I believe we set the heap size to about 40% of
the RAM and less than 10% is available for OS caching ( since replica takes
another 40%). Why does unallocated RAM help? How does it impact performance
under load?


Because once the data is in the OS disk cache, reading it becomes 
instantaneous, it doesn't need to go out to the disk.  Disks are glacial 
compared to RAM.  Even SSD has a far slower response time.  Any recent 
operating system does this automatically, including the one from Redmond 
that we all love to hate.


http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

Thanks,
Shawn

Re: Do I need to delete my index?

Typically I'll just delete the entire data dir recursively after shutting
down Solr, the default location is solr_home/solr/collectionblah/data

On Wed, Sep 11, 2013 at 6:01 PM, Brian Robinson
br...@socialsurgemedia.comwrote:

Thanks Shawn. I had actually tried changing load= to amp;load=, but
still got the error. It sounds like addDocuments is worth a try, though.

On 9/11/2013 4:37 PM, Shawn Heisey wrote:

On 9/11/2013 2:17 PM, Brian Robinson wrote:

I'm in the process of creating my index using a series of
SolrClient::request commands in PHP. I ran into a problem when some of
the fields that I had as text_general fieldType contained load= in
a URL, triggering an error because the HTML entity load wasn't
recognized. I realized that I should have made my URL fields of type
string instead, so that they would be taken as is (they're not being
indexed, just stored), so I removed all docs from my index, updated
schema.xml, and restarted Solr, but I'm still getting the same error. Do
I need to delete the index itself and then restart to get this to work?
Am I correct that changing those fields to string type should fix the
issue?

Changing the field type is not going to affect this issue. Because you
are not indexing the field, the choice of string or text_general is not
really going to matter, but string will probably be more efficient.

What is happening here is an XML issue with the update request itself.
The PHP client is sending an XML update request to Solr, and the request
includes the URL text as-is in the XML request. It is not properly XML
encoded. For an XML update request, that snippet of your text would need
to be encoded as amp;load= to work properly.

XML has a much smaller list of valid entities than HTML, but load is
not a valid entity in either XML or HTML.

I was going to call this a bug in the PHP client library, but then I got
a look at what SolrClient::request actually does:

http://php.net/manual/en/**solrclient.request.phphttp://php.net/manual/en/solrclient.request.php

It expects you to create the XML yourself, which means you have to do all
the encoding of characters which have special meaning to XML.

If you have no desire to figure out proper XML encoding, you should
probably be using SolrClient::addDocument or SolrClient::addDocuments
instead.

Thanks,
Shawn

Re: Do I need to delete my index?

Thanks Erick

On 9/11/2013 6:46 PM, Erick Erickson wrote:

Typically I'll just delete the entire data dir recursively after shutting
down Solr, the default location is solr_home/solr/collectionblah/data

On Wed, Sep 11, 2013 at 6:01 PM, Brian Robinson
br...@socialsurgemedia.comwrote:

Thanks Shawn. I had actually tried changing load= to amp;load=, but
still got the error. It sounds like addDocuments is worth a try, though.

On 9/11/2013 4:37 PM, Shawn Heisey wrote:

On 9/11/2013 2:17 PM, Brian Robinson wrote:

XML has a much smaller list of valid entities than HTML, but load is
not a valid entity in either XML or HTML.

I was going to call this a bug in the PHP client library, but then I got
a look at what SolrClient::request actually does:

http://php.net/manual/en/**solrclient.request.phphttp://php.net/manual/en/solrclient.request.php

It expects you to create the XML yourself, which means you have to do all
the encoding of characters which have special meaning to XML.

If you have no desire to figure out proper XML encoding, you should
probably be using SolrClient::addDocument or SolrClient::addDocuments
instead.

Thanks,
Shawn

Grouping by field substring?

2013-09-11 Thread Ken Krugler

Hi all,

Assuming I want to use the first N characters of a specific field for grouping 
results, is such a thing possible out-of-the-box?

If not, then what would the next best option be? E.g. a custom function query?

Thanks,

-- Ken

--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions  training
Hadoop, Cascading, Cassandra  Solr

Re: Grouping by field substring?

2013-09-11 Thread Jack Krupansky

Do a copyField to another field, with a limit of 8 characters, and then use 
that other field.


-- Jack Krupansky

-Original Message- 
From: Ken Krugler

Sent: Wednesday, September 11, 2013 8:24 PM
To: solr-user@lucene.apache.org
Subject: Grouping by field substring?

Hi all,

Assuming I want to use the first N characters of a specific field for 
grouping results, is such a thing possible out-of-the-box?


If not, then what would the next best option be? E.g. a custom function 
query?


Thanks,

-- Ken

--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions  training
Hadoop, Cascading, Cassandra  Solr

Re: solr performance against oracle

2013-09-11 Thread Chris Hostetter


Setting asside the excellent responses that have already been made in this 
thread, there are fundemental discrepencies in what you are comparing in 
your respective timing tests.

first off: a micro benchmark like this is virtually useless -- unless you 
really plan on only ever executing a single query in a single run of a 
java application that then terminates, trying to time a single query is 
silly -- you should do lots and lots of iterations using a large set of 
sample inputs.

Second: what you are timing is vastly different between the two cases.

In your Solr timing, no communication happens over the wire to the solr 
server until the call to server.query() inside your time stamps -- if you 
were doing multiple requests using the same SolrServer object, the HTTP 
connection would get re-used, but as things stand your timing includes all 
of hte network overhead of connecting to the server, sending hte request, 
and reading the response.

in your oracle method however, the timestamps you record are only arround 
the call to executeQuery(), rs.next(), and rs.getString() ... you are 
ignoring the timing neccessary for the getConnection() and 
prepareStatement() methods, which may be significant as they both involved 
over the wire communication with the remote server (And it's not like 
these are one time execute and forget about them methods ... in a real 
long lived application you'd need to manage your connections, re-open if 
they get closed, recreate the prepared statement if your connection has to 
be re-open, etc... )

Your comparison is definitly apples and oranges.


Lastly, as others have mentioned: 150-200ms to request a single document 
by uniqueKey from an index containing 800K docs seems ridiculously slow, 
and suggests that something is poorly configured about your solr instance 
(another apples to oranges comparison: you've got an ad-hoc solr 
installation setup on your laptop and you're benchmarking it against a 
remote oracle server running on dedicated remote hardware that has 
probably been heavily tunned/optimized for queries).  

You haven't provided us any details however about how your index is setup, 
or how you have confiugred solr, or what JVM options you are using to run 
solr, or what physical resources are available to your solr process (disk, 
jvm heap ram, os file system cache ram) so there isn't much we can offer 
in the way of advice on how to speed things up.


FWIW:  On my laptop, using Solr 4.4 w/ the example configs and built in 
jetty (ie: java -jart start.jar) i got a 3.4 GB max heap, and a 1.5 GB 
default heap, with plenty of physical ram left over for the os file system 
cache of an index i created containing 1,000,000 documents with 6 small 
fields containing small amounts of random terms.  I then used curl to 
execute ~4150 requests for documents by id (using simple search, not the 
/get RTG handler) and return the results using JSON.

This commpleted in under 4.5 seconds, or ~1.0ms/request.

Using the more verbose XML response format (after restarting solr to 
ensure nothing in the query result caches) only took 0.3 seconds longer on 
the total time (~1.1ms/request)

$ time curl -sS 
'http://localhost:8983/solr/collection1/select?q=id%3A[1-100:241]wt=jsonindent=true'
  /dev/null

real0m4.471s
user0m0.412s
sys 0m0.116s
$ time curl -sS 
'http://localhost:8983/solr/collection1/select?q=id%3A[1-100:241]wt=xmlindent=true'
  /dev/null

real0m4.868s
user0m0.376s
sys 0m0.136s
$ java -version
java version 1.7.0_25
OpenJDK Runtime Environment (IcedTea 2.3.10) (7u25-2.3.10-1ubuntu0.12.04.2)
OpenJDK 64-Bit Server VM (build 23.7-b01, mixed mode)
$ uname -a
Linux frisbee 3.2.0-52-generic #78-Ubuntu SMP Fri Jul 26 16:21:44 UTC 2013 
x86_64 x86_64 x86_64 GNU/Linux






-Hoss

Re: No or limited use of FieldCache

2013-09-11 Thread Otis Gospodnetic

Per,  check zee Wiki, there is a page describing docvalues. We used them
successfully in a solr for analytics scenario.

Otis
Solr  ElasticSearch Support
http://sematext.com/
On Sep 11, 2013 9:15 AM, Michael Sokolov msoko...@safaribooksonline.com
wrote:

 On 09/11/2013 08:40 AM, Per Steffensen wrote:

 The reason I mention sort is that we in my project, half a year ago, have
 dealt with the FieldCache-OOM-problem when doing sort-requests. We
 basically just reject sort-requests unless they hit below X documents - in
 case they do we just find them without sorting and sort them ourselves
 afterwards.

 Currently our problem is, that we have to do a group/distinct (in
 SQL-language) query and we have found that we can do what we want to do
 using group 
 (http://wiki.apache.org/solr/**FieldCollapsinghttp://wiki.apache.org/solr/FieldCollapsing)
 or facet - either will work for us. Problem is that they both use
 FieldCache and we know that using FieldCache will lead to OOM-execptions
 with the amount of data each of our Solr-nodes administrate. This time we
 have really no option of just limit usage as we did with sort. Therefore
 we need a group/distinct-functionality that works even on huge data-amounts
 (and a algorithm using FieldCache will not)

 I believe setting facet.method=enum will actually make facet not use the
 FieldCache. Is that true? Is it a bad idea?

 I do not know much about DocValues, but I do not believe that you will
 avoid FieldCache by using DocValues? Please elaborate, or point to
 documentation where I will be able to read that I am wrong. Thanks!

 There is Simon Willnauer's presentation http://www.slideshare.net/**
 lucenerevolution/willnauer-**simon-doc-values-column-**
 stride-fields-in-lucenehttp://www.slideshare.net/lucenerevolution/willnauer-simon-doc-values-column-stride-fields-in-lucene

 and this blog post http://blog.trifork.com/2011/**
 10/27/introducing-lucene-**index-doc-values/http://blog.trifork.com/2011/10/27/introducing-lucene-index-doc-values/

 and this one that shows some performance comparisons:
 http://searchhub.org/2013/04/**02/fun-with-docvalues-in-solr-**4-2/http://searchhub.org/2013/04/02/fun-with-docvalues-in-solr-4-2/

Re: charset encoding

2013-09-11 Thread Otis Gospodnetic

Using tomcat by any chance? The ML archive has the solution. May be on
Wiki, too.

Otis
Solr  ElasticSearch Support
http://sematext.com/
On Sep 11, 2013 8:56 AM, Andreas Owen a...@conx.ch wrote:

 i'm using solr 4.3.1 with tika to index html-pages. the html files are
 iso-8859-1 (ansi) encoded and the meta tag content-encoding as well. the
 server-http-header says it's utf8 and firefox-webdeveloper agrees.

 when i index a page with special chars like ä,ö,ü solr outputs it
 completly foreign signs, not the normal wrong chars with 1/4 or the Flag in
 it. so it seams that its not simply the normal utf8/iso-8859-1 discrepancy.
 has anyone got a idea whats wrong?

Re: Distributing lucene segments across multiple disks.