date:20130409

This question may not have a generel answer and may be open ended but is
there any commodity server spec. for a usual Solr running machine? I mean
what is the average server spesification for a Solr machine (i.e. Hadoop
running system it is not recommended to have very big storage capably
computers.) I will use Solr for indexing web crawled data.

Re: SOLR-4581

2013-04-09 Thread Shalin Shekhar Mangar

Hi Alexander,

I have put up a test case reproducing your issue. Perhaps someone more
familiar with faceting code can debug this.

For now, you can workaround this issue by adding facet.method=fc on your
queries.


On Mon, Apr 8, 2013 at 2:14 PM, Alexander Buhr a.b...@epages.com wrote:

 Hello,

 I created
 https://issues.apache.org/jira/browse/SOLR-4581
 on 14.03.2013. Can anyone help me out with this?
 Thank You.

 Alexander Buhr
 Software Engineer

 ePages GmbH
 Pilatuspool 2
 20355 Hamburg
 Germany

 +49-40-350 188-266 phone
 +49-40-350 188-222 fax

 a.b...@epages.commailto:a.b...@epages.com
 www.epages.comhttp://www.epages.com/
 www.epages.com/bloghttp://www.epages.com/blog
 www.epages.com/twitterhttp://www.epages.com/twitter
 www.epages.com/facebookhttp://www.epages.com/facebook

 e-commerce. now plugplay.

 Geschäftsführer: Wilfried Beeck
 Handelsregister: Amtsgericht Hamburg HRB 120861
 Sitz der Gesellschaft: Pilatuspool 2, 20355 Hamburg
 Steuernummer: 48/718/02195
 USt-Ident.-Nr.: DE 282 947 700




-- 
Regards,
Shalin Shekhar Mangar.

Doc Transformer with SolrDocumentList object

2013-04-09 Thread neha yadav

I am trying to modify the results of solr output . basically I need to
change the ranking of the output of solr for a query.

So please can anyone help.

I wrote a java code that returns the SolrDocumentList object which is a
union of the results  I want this object to be displayed on solr.

hats is once the query is hit. The solr runs the java code i wrote and the
output returned in the java code gets as a output to the screen .


I have tried to use the code as a data transformer. But I am getting this
error:


org.apache.solr.handler.dataimport.SolrWriter upload
WARNING: Error creating document : SolrInputDocument[id=44,
category=Apparel  Fash Accessories, _version_=1431753044032225280,
price=ERROR:SCHEMA-INDEX-MISMATC
H,stringValue=1400, description=for girls, brand=Wrangler,
price_c=1400,USD, siz
e=ERROR:SCHEMA-INDEX-MISMATCH,stringValue=12]
org.apache.solr.common.SolrException: version conflict for 44
expected=143175304
4032225280 actual=-1


Please can anyone help ?

Re: conditional queries?

2013-04-09 Thread Koji Sekiguchi


Hi Mark,

 Is it possible to do a conditional query if another query has no results?  For example, say I 
want to search against a given field for:


- Search for car.  If there are results, return them.
- Else, search for car* .  If there are results, return them.
- Else, search for car~ .  If there are results, return them.

Is this possible in one query?  Or would I need to make 3 separate queries by 
implementing this logic within my client?


As far as I know, there is no such SearchComponent.
But the idea of FallbackRequestHandler has been told, see SOLR-1878, for 
example:

https://issues.apache.org/jira/browse/SOLR-1878

koji
--
http://soleami.com/blog/lucene-4-is-super-convenient-for-developing-nlp-tools.html

How to configure shards with SSL?

2013-04-09 Thread eShard

Good morning everyone,
I'm running solr 4.0 Final with ManifoldCF v1.2dev on tomcat 7.0.37 and I
had shards up and running on http but when I migrated to SSL it won't work
anymore.
First I got an IO Exception but then I changed my configuration in
solrconfig.xml to this:
   requestHandler name=/all class=solr.SearchHandler
 lst name=defaults
   str name=echoParamsexplicit/str
   str name=wtxml/str
   str name=indenttrue/str
   str name=q.alt*:*/str

str name=flid, solr.title, content, category, link, pubdateiso/str

str
name=shardsdev:7443/solr/ProfilesJava/|dev:7443/solr/C3Files/|dev:7443/solr/Blogs/|dev:7443/solr/Communities/|dev:7443/solr/Wikis/|dev:7443/solr/Bedeworks/|dev:7443/solr/Forums/|dev:7443/solr/Web/|dev:7443/solr/Bookmarks//str
 

 /lst
 shardHandlerFactory class=HttpShardHandlerFactory
str name=urlSchemehttps:///str
int name=socketTimeOut1000/int
int name=connTimeOut5000/int
/shardHandlerFactory   

  /requestHandler

And Now I'm getting this error:
org.apache.solr.client.solrj.SolrServerException: No live SolrServers
available to handle this request:
How do I configure shards with SSL?
Thanks,



--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-configure-shards-with-SSL-tp4054735.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Latency Comparison between cloud hosting Vs Dedicated hosting

2013-04-09 Thread Michael Della Bitta

On Tue, Apr 9, 2013 at 3:33 AM, Sujatha Arun suja.a...@gmail.com wrote:
 Would a bigger instance improve latency?

Yes, and prewarming caches would help, too.


Michael Della Bitta


Appinions
18 East 41st Street, 2nd Floor
New York, NY 10017-6271

www.appinions.com

Where Influence Isn’t a Game

Re: Best practice for rebuild index in SolrCloud

2013-04-09 Thread Michael Della Bitta

We're setting up two collection aliases. One's a read alias, one's a
write alias.

When we need to start over with a new collection, we create the
collection alongside the original, and point the write alias at it.

When indexing is done, we point the read alias at it.

Then you can delete the old collection when you feel good about the new one.

Obviously this means that none of your clients should point at the
collection directly, but rather one of the aliases depending on
whether they're reading or writing.

HTH,

Michael Della Bitta


Appinions
18 East 41st Street, 2nd Floor
New York, NY 10017-6271

www.appinions.com

Where Influence Isn’t a Game


On Mon, Apr 8, 2013 at 5:45 PM, Bill Au bill.w...@gmail.com wrote:
 We are using SolrCloud for replication and dynamic scaling but not
 distribution so we are only using a single shard.  From time to time we
 make changes to the index schema that requires rebuilding of the index.

 Should I treat the rebuilding as just any other index operation?  It seems
 to me it would be better if I can somehow take a node offline and rebuild
 the index there, then put it back online and let the new index be
 replicated from there.  But I am not sure how to do the latter.

 Bill

Re: conditional queries?

We do this on the client side with multiple queries. It is fairly efficient, 
because most responses are from the first, exact query.

wunder

On Apr 9, 2013, at 6:15 AM, Koji Sekiguchi wrote:

 Hi Mark,
 
  Is it possible to do a conditional query if another query has no results?  
  For example, say I want to search against a given field for:
 
 - Search for car.  If there are results, return them.
 - Else, search for car* .  If there are results, return them.
 - Else, search for car~ .  If there are results, return them.
 
 Is this possible in one query?  Or would I need to make 3 separate queries 
 by implementing this logic within my client?
 
 As far as I know, there is no such SearchComponent.
 But the idea of FallbackRequestHandler has been told, see SOLR-1878, for 
 example:
 
 https://issues.apache.org/jira/browse/SOLR-1878
 
 koji
 -- 
 http://soleami.com/blog/lucene-4-is-super-convenient-for-developing-nlp-tools.html

Execution of Queries in Parallel: geotagged textual documents in Solrvvvv

2013-04-09 Thread Massimiliano Ruocco

I have around 100M of textual document geotagged (lat,long). THese 
documents are indexed with Solr 1.4. I am testing a retrieval model 
(written over Terrier). This model requires frequent execution of 
queries ( Bounding-box filter). These queries could be executed in 
parallel, one for each specific geographic tile.


I was wondering if exists a solution speeding up the execution of 
queries in parallel. My naif idea is Split the index in many parts 
according the geographical tiles (how to do that? SolrCloud? Solr Index 
Replication? What is the max number of eventual replication?)


Any practical further suggestion?

Thanks in advance

Massimiliano

Re: Search data who does not have x field

2013-04-09 Thread Victor Ruiz

Sorry, I didnt explain my self good, I mean , you have to create an
additional field 'hasCategory' in your schema, and then, before indexing,
set the field 'hasCategory' in the indexed document as true, if your
document has categories, or set it to false, if it has any. With this you
will save computation time, since the query for a boolean field is much
easier for Solr than checking for an empty string field. 

The query should be = q=*:*fq=hasCategory:true 


anurag.jain wrote
 another solution would be to add a boolean field, hasCategory, and use it
 for filtering 
 q=
 your query here
 fq=hasCategory:true 
 
 
 I am not getting result.
 
 
 i am trying
 
 localhost:8983/search?q=*:*fq=category:true
 
 it is giving zero result.
 
 by the way first technique is working fine.





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Search-data-who-does-not-have-x-field-tp4046959p4054763.html
Sent from the Solr - User mailing list archive at Nabble.com.

corrupted index in slave?

2013-04-09 Thread Victor Ruiz

Hi guys,

I'm getting exceptions in a Solr slave, when accessing TermVector component
and RealTimeGetHandler. The weird thing is, that in the master and in one of
the 2 slaves, the documents are ok, and the same query doesnt return any
exception. For now, the only way I have to solve the problem is deleting
these documents and indexing them again.

I upgraded Solr from 4.0 directly to 4.2, then to 4.2.1 last week These
exceptions seems to appear since the upgrade to 4.2.
I didn't run the script for migrating the index files (as I did in the
migration from 3.6 to 4.0), should I? Has the format of the index changed?
If not, is that a known bug? If it's, sorry I couldn't find it in JIRA.

These are the exceptions I get:

{responseHeader:{status:500,QTime:1},response:{numFound:1,start:0,docs:[{itemid:105266867,text:exklusiver
kann man kaum würzen  safran ist das teuerste gewürz der welt handverlesen
und in mühevoller kleinstarbeit hergestellt ist safran sehr selten und wird
in winzigen mengen gehandelt und
verwendet,title:safran,domainid:4287,date_i:2012-11-21T17:01:23Z,date:2012-11-21T17:01:09Z,category:[kultur,literatur,gesellschaft,umwelt,trinken,essen]}]},termVectors:[uniqueKeyFieldName,itemid,105266867,[uniqueKey,105266867]],error:{trace:java.lang.ArrayIndexOutOfBoundsException\n\tat
org.apache.lucene.codecs.compressing.LZ4.decompress(LZ4.java:132)\n\tat
org.apache.lucene.codecs.compressing.CompressionMode$4.decompress(CompressionMode.java:135)\n\tat
org.apache.lucene.codecs.compressing.CompressingTermVectorsReader.get(CompressingTermVectorsReader.java:493)\n\tat
org.apache.lucene.index.SegmentReader.getTermVectors(SegmentReader.java:175)\n\tat
org.apache.lucene.index.BaseCompositeReader.getTermVectors(BaseCompositeReader.java:97)\n\tat
org.apache.lucene.index.IndexReader.getTermVector(IndexReader.java:385)\n\tat
org.apache.solr.handler.component.TermVectorComponent.process(TermVectorComponent.java:313)\n\tat
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:208)\n\tat
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)\n\tat
org.apache.solr.core.SolrCore.execute(SolrCore.java:1817)\n\tat
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:639)\n\tat
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:345)\n\tat
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141)\n\tat
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1157)\n\tat
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:388)\n\tat
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)\n\tat
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)\n\tat
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)\n\tat
org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:418)\n\tat
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)\n\tat
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)\n\tat
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)\n\tat
org.mortbay.jetty.Server.handle(Server.java:326)\n\tat
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)\n\tat
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:926)\n\tat
org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)\n\tat
org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)\n\tat
org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)\n\tat
org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410)\n\tat
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)\n,code:500}}


{error:{trace:java.lang.ArrayIndexOutOfBoundsException\n\tat
org.apache.lucene.codecs.compressing.LZ4.decompress(LZ4.java:132)\n\tat
org.apache.lucene.codecs.compressing.CompressionMode$4.decompress(CompressionMode.java:135)\n\tat
org.apache.lucene.codecs.compressing.CompressingStoredFieldsReader.visitDocument(CompressingStoredFieldsReader.java:258)\n\tat
org.apache.lucene.index.SegmentReader.document(SegmentReader.java:139)\n\tat
org.apache.lucene.index.BaseCompositeReader.document(BaseCompositeReader.java:116)\n\tat
org.apache.lucene.index.IndexReader.document(IndexReader.java:436)\n\tat
org.apache.solr.search.SolrIndexSearcher.doc(SolrIndexSearcher.java:640)\n\tat
org.apache.solr.search.SolrIndexSearcher.doc(SolrIndexSearcher.java:568)\n\tat
org.apache.solr.handler.component.RealTimeGetComponent.process(RealTimeGetComponent.java:176)\n\tat
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:208)\n\tat
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)\n\tat
org.apache.solr.core.SolrCore.execute(SolrCore.java:1817)\n\tat

Re: corrupted index in slave?

2013-04-09 Thread Victor Ruiz

sorry I forgot to say, the exceptions are not for every document, but only
for a few...

regards,
Victor

Victor Ruiz wrote
 Hi guys,
 
 I'm getting exceptions in a Solr slave, when accessing TermVector
 component and RealTimeGetHandler. The weird thing is, that in the master
 and in one of the 2 slaves, the documents are ok, and the same query
 doesnt return any exception. For now, the only way I have to solve the
 problem is deleting these documents and indexing them again.
 
 I upgraded Solr from 4.0 directly to 4.2, then to 4.2.1 last week These
 exceptions seems to appear since the upgrade to 4.2.
 I didn't run the script for migrating the index files (as I did in the
 migration from 3.6 to 4.0), should I? Has the format of the index changed?
 If not, is that a known bug? If it's, sorry I couldn't find it in JIRA.
 
 These are the exceptions I get:
 
 {responseHeader:{status:500,QTime:1},response:{numFound:1,start:0,docs:[{itemid:105266867,text:exklusiver
 kann man kaum würzen  safran ist das teuerste gewürz der welt handverlesen
 und in mühevoller kleinstarbeit hergestellt ist safran sehr selten und
 wird in winzigen mengen gehandelt und
 verwendet,title:safran,domainid:4287,date_i:2012-11-21T17:01:23Z,date:2012-11-21T17:01:09Z,category:[kultur,literatur,gesellschaft,umwelt,trinken,essen]}]},termVectors:[uniqueKeyFieldName,itemid,105266867,[uniqueKey,105266867]],error:{trace:java.lang.ArrayIndexOutOfBoundsException\n\tat
 org.apache.lucene.codecs.compressing.LZ4.decompress(LZ4.java:132)\n\tat
 org.apache.lucene.codecs.compressing.CompressionMode$4.decompress(CompressionMode.java:135)\n\tat
 org.apache.lucene.codecs.compressing.CompressingTermVectorsReader.get(CompressingTermVectorsReader.java:493)\n\tat
 org.apache.lucene.index.SegmentReader.getTermVectors(SegmentReader.java:175)\n\tat
 org.apache.lucene.index.BaseCompositeReader.getTermVectors(BaseCompositeReader.java:97)\n\tat
 org.apache.lucene.index.IndexReader.getTermVector(IndexReader.java:385)\n\tat
 org.apache.solr.handler.component.TermVectorComponent.process(TermVectorComponent.java:313)\n\tat
 org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:208)\n\tat
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)\n\tat
 org.apache.solr.core.SolrCore.execute(SolrCore.java:1817)\n\tat
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:639)\n\tat
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:345)\n\tat
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141)\n\tat
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1157)\n\tat
 org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:388)\n\tat
 org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)\n\tat
 org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)\n\tat
 org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)\n\tat
 org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:418)\n\tat
 org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)\n\tat
 org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)\n\tat
 org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)\n\tat
 org.mortbay.jetty.Server.handle(Server.java:326)\n\tat
 org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)\n\tat
 org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:926)\n\tat
 org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)\n\tat
 org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)\n\tat
 org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)\n\tat
 org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410)\n\tat
 org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)\n,code:500}}
 
 
 {error:{trace:java.lang.ArrayIndexOutOfBoundsException\n\tat
 org.apache.lucene.codecs.compressing.LZ4.decompress(LZ4.java:132)\n\tat
 org.apache.lucene.codecs.compressing.CompressionMode$4.decompress(CompressionMode.java:135)\n\tat
 org.apache.lucene.codecs.compressing.CompressingStoredFieldsReader.visitDocument(CompressingStoredFieldsReader.java:258)\n\tat
 org.apache.lucene.index.SegmentReader.document(SegmentReader.java:139)\n\tat
 org.apache.lucene.index.BaseCompositeReader.document(BaseCompositeReader.java:116)\n\tat
 org.apache.lucene.index.IndexReader.document(IndexReader.java:436)\n\tat
 org.apache.solr.search.SolrIndexSearcher.doc(SolrIndexSearcher.java:640)\n\tat
 org.apache.solr.search.SolrIndexSearcher.doc(SolrIndexSearcher.java:568)\n\tat
 org.apache.solr.handler.component.RealTimeGetComponent.process(RealTimeGetComponent.java:176)\n\tat
 org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:208)\n\tat

Re: solr 4.2.1 still has problems with index version and index generation


: And with replication?command=details I also see the correct commit part as
: above, BUT where the hell are the wrong info below the commit array are
: coming from?

Please read the details in the previously mentioned Jira issue...

https://issues.apache.org/jira/browse/SOLR-4661

The indexVersion and generation you are looking at refer to the speciics 
of the IndexReader as used by the *seracher* on the master server -- but 
in addition to situations like openSearcher=false, there are some 
optimizations in place such that Solr/Lucene is smart enough to realize 
when an empty commit doesn't change the IndexReader it continues to use 
the previous commit point...

https://issues.apache.org/jira/browse/SOLR-4661?focusedCommentId=13620195page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13620195

...but from the perspective of the slave, this is still a commit that 
needs replicated and loaded.

Hence the current objective of the patch in SOLR-3855: add more details to 
the command=details response (as well as the Admin UI) to clearly 
distinguish between the gen/ver of the currently replicatable commit and 
the gen/ver of the currently open searcher.

All available information suggests that this is purely a problem of 
conveying information to users via command=details -- Replication is 
behaving as designed using the correct information about hte commit 
points.



-Hoss

How can I set configuration options?

2013-04-09 Thread Edd Grant

Hi all,

I have been working through the examples on the SolrCloud page:
http://wiki.apache.org/solr/SolrCloud

I am now at the point where, rather than firing up Solr through start.jar,
I'm deploying the Solr war in to Tomcat instances. Taking the following
command as an example:

java -Dbootstrap_confdir=./solr/collection1/conf
-Dcollection.configName=myconf -DzkRun
-DzkHost=localhost:9983,localhost:8574,localhost:9900 -DnumShards=2
-jar start.jar

I can't figure out from the documentation how/ where I set the above
properties when deploying Solr as a war file. I initially thought these
might be configurable through solr.xml but can't find anything in the
documentation to support this.

Most grateful for any pointers here.

Cheers,

Edd
-- 
Web: http://www.eddgrant.com
Email: e...@eddgrant.com
Mobile: +44 (0) 7861 394 543

Re: conditional queries?

2013-04-09 Thread Miguel

I not sure, but you can create a class extend of SearchComponent and 
include at the least of your requesthandler and in this way add optional 
actions about whatever query on your solr server.

Example solrconfig.xml
 requestHandler
...
 arr name=last-components
  stractions/str
/arr
 /requestHandler

 searchComponent name=actions class= HERE YOUR CLASS 
str name=params/str
  /searchComponent

Regars

El 09/04/2013 17:05, Walter Underwood escribió:

We do this on the client side with multiple queries. It is fairly efficient, 
because most responses are from the first, exact query.

wunder

On Apr 9, 2013, at 6:15 AM, Koji Sekiguchi wrote:


Hi Mark,


Is it possible to do a conditional query if another query has no results?  For 
example, say I want to search against a given field for:

- Search for car.  If there are results, return them.
- Else, search for car* .  If there are results, return them.
- Else, search for car~ .  If there are results, return them.

Is this possible in one query?  Or would I need to make 3 separate queries by 
implementing this logic within my client?

As far as I know, there is no such SearchComponent.
But the idea of FallbackRequestHandler has been told, see SOLR-1878, for 
example:

https://issues.apache.org/jira/browse/SOLR-1878

koji
--
http://soleami.com/blog/lucene-4-is-super-convenient-for-developing-nlp-tools.html

Index Replication Failing in Solr 4.2.1

2013-04-09 Thread Umesh Prasad

Hi All,
  I am migrating from Solr 3.5.0 to Solr 4.2.1. And everything is running
fine and set to go, except the master slave replication.

We use master slave replication with multi cores ( 1 master, 10 slaves and
20 plus cores).

My Configuration is :

Master :  Solr 3.5.0,  Has existing index, and delta import running using
DIH.
Slave : Solr 4.2.1 ,  Has no startup index


Apr 9, 2013 9:18:40 PM org.apache.solr.core.SolrCore execute
INFO: [phcare] webapp= path=/replication
params={command=fetchindex_=1365522520521wt=json} status=0 QTime=1
Apr 9, 2013 9:18:40 PM org.apache.solr.handler.SnapPuller fetchLatestIndex
*INFO: Master's generation: 107876
*Apr 9, 2013 9:18:40 PM org.apache.solr.handler.SnapPuller fetchLatestIndex
*INFO: Slave's generation: 79248
*Apr 9, 2013 9:18:40 PM org.apache.solr.handler.SnapPuller fetchLatestIndex
INFO: Starting replication process
*Apr 9, 2013 9:18:40 PM org.apache.solr.handler.SnapPuller fetchFileList
SEVERE: No files to download for index generation: 107876
*Apr 9, 2013 9:18:40 PM org.apache.solr.core.SolrCore execute
INFO: [phcare] webapp= path=/replication
params={command=details_=1365522520556wt=json} status=0 QTime=7

In Both Master and Slave The File list for replicable version is correct.
*on Slave *

{

   - masterDetails: {
  - indexSize: 4.31 MB,
  - indexPath:
  /var/lib/fk-w3-sherlock/cores/phcare/data/index.20130124235012,
  - commits: [
 - [
- indexVersion,
- 1323961124638,
- generation,
- 107856,
- filelist,
- [
   - _45e1.tii,
   - _45e1.nrm,
   -

..


*ON Master
*
[

   - indexVersion,
   - 1323961124638,
   - generation,
   - 107856,
   - filelist,
   - [
  - _45e1.tii,
  - _45e1.nrm,
  - _45e2_1.del,
  - _45e2.frq,
  - _45e1_3.del,
  - _45e1.tis,
  - ..



Can someone help. Our whole Migration to Solr 4.2 is blocked on Replication
issue.

---
Thanks  Regards
Umesh Prasad

SolrCloud: Result Grouping - no groups with field type with precisionStep 0

2013-04-09 Thread Elodie Sannier


Hello,

I am using the Result Grouping feature with SolrCloud, and it seems that
grouping does not work with field types having precisionStep property
greater than 0, in distributed mode.

I updated the SolrCloud - Getting Started page example A (Simple two
shard cluster).
In my schema.xml, the popularity field has an int type where I
changed precisionStep from 0 to 4 :

fieldType name=int class=solr.TrieIntField precisionStep=4
positionIncrementGap=0 /
field name=popularity type=int indexed=true stored=true /

When I'm requesting in distributed mode, the grouping on this field does
not return groups :
http://localhost:8983/solr/select?q=*:*group=truegroup.field=popularitydistrib=true

lst name=grouped
lst name=popularity
int name=matches1/int
arr name=groups
lst
int name=groupValue0/int
result name=doclist numFound=0 start=0 /
/lst
/arr
/lst
/lst

When I'm requesting on a single core, the grouping on this field returns
a group :
http://localhost:8983/solr/select?q=*:*group=truegroup.field=popularitydistrib=false

lst
int name=groupValue10/int
result name=doclist numFound=1 start=0
doc
str name=idMA147LL/A/str
...
int name=popularity10/int
...
/doc
/result
/lst

If I come back to the origin configuration, changing the int type with
precisionStep=0, the distributed request works :
fieldType name=int class=solr.TrieIntField precisionStep=0
positionIncrementGap=0 /

The precisionStep  0 can be useful for range queries but is it normal
that it is not compatible with grouping queries, in distributed mode only ?

Elodie Sannier

Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention 
exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
message, merci de le détruire et d'en avertir l'expéditeur.

Re: Execution of Queries in Parallel: geotagged textual documents in Solrvvvv

Hi,

I'd move to SolrCloud 4.2.1 to benefit from sharding, replication, and
the latest Lucene.  How many queries you will then be able to run in
parallel will depend on their complexity, index size, query
cachability, index size, latency requirements... But move to the
latest setup first.

Otis
--
SOLR Performance Monitoring - http://sematext.com/spm/index.html





On Tue, Apr 9, 2013 at 11:10 AM, Massimiliano Ruocco ruo...@idi.ntnu.no wrote:
 I have around 100M of textual document geotagged (lat,long). THese documents
 are indexed with Solr 1.4. I am testing a retrieval model (written over
 Terrier). This model requires frequent execution of queries ( Bounding-box
 filter). These queries could be executed in parallel, one for each specific
 geographic tile.

 I was wondering if exists a solution speeding up the execution of queries in
 parallel. My naif idea is Split the index in many parts according the
 geographical tiles (how to do that? SolrCloud? Solr Index Replication? What
 is the max number of eventual replication?)

 Any practical further suggestion?

 Thanks in advance

 Massimiliano

Re: Latency Comparison between cloud hosting Vs Dedicated hosting

Hi Sujatha,

You should really do the same stuff to improve latency in the cloud as
what you would do on a dedicated server.
Amazon-specific stuff:
Bigger EC2 instances have better IO.  EBS performance varies.  Some
people mount N of them and stripe across them.  Some people try N EBS
volumes to find the best performing one(s) and discard the rest.  Some
people pay for provisioned IOPS.

Otis
--
SOLR Performance Monitoring - http://sematext.com/spm/index.html





On Tue, Apr 9, 2013 at 3:33 AM, Sujatha Arun suja.a...@gmail.com wrote:
 Hi,

 We are comparing search request latency between Amazon Vs  Dedicated
 hosting [Rackspace] .For comparison we used solr version 3.6.1 and Amazon
 small instance.The index size was less than 1GB.

 We see that the latency is about 75 -100 %  from Amazon. Any body who has
 migrated form Dedicated hosting to Cloud has any pointers  for improving
 latecny?

 Would a bigger instance improve latency?

 Regards
 Sujatha

Solr 4.2.1 SSLInitializationException

2013-04-09 Thread Sarita Nair

Hi All,

Deploying Solr 4.2.1 to GlassFish 3.1.1 results in the error below.  I have 
seen similar problems being reported with Solr 4.2

and my take-away was that 4.2.1 contains the necessary fix.

Any help with this will be appreciated.

Thanks!


    2013-04-09 10:45:06,144 [main] ERROR 
org.apache.solr.servlet.SolrDispatchFilter - Could not start Solr. Check 
solr/home property and the logs
    2013-04-09 10:45:06,224 [main] ERROR 
org.apache.solr.core.SolrCore - 
null:org.apache.http.conn.ssl.SSLInitializationException: Failure 
initializing default system SSL context
    Caused by: java.io.IOException: Keystore was tampered with, or password was 
incorrect
  at sun.security.provider.JavaKeyStore.engineLoad(JavaKeyStore.java:772)
    at sun.security.provider.JavaKeyStore$JKS.engineLoad(JavaKeyStore.java:55)
    at java.security.KeyStore.load(KeyStore.java:1214)
    at
 
org.apache.http.conn.ssl.SSLSocketFactory.createSystemSSLContext(SSLSocketFactory.java:281)
    at 
org.apache.http.conn.ssl.SSLSocketFactory.createSystemSSLContext(SSLSocketFactory.java:366)
 
    ... 50 more
Caused by: java.security.UnrecoverableKeyException: Password verification failed
    at sun.security.provider.JavaKeyStore.engineLoad(JavaKeyStore.java:770)

Re: Solr 4.2.1 SSLInitializationException


: Deploying Solr 4.2.1 to GlassFish 3.1.1 results in the error below.  I 
: have seen similar problems being reported with Solr 4.2

Are you trying to use server SSL with glassfish?

can you please post the full stack trace so we can see where this error is 
coming from.

My best guess is that this is coming from the changes made in 
SOLR-4451 to use system defaults correctly when initializing HttpClient, 
which suggets that your problem is exactly what the error message says...

  Keystore was tampered with, or password was incorrect

Is it possible that the default keystore password for your JVM (or as 
overridden by glassfish defaults - possibly using the 
javax.net.ssl.keyStore sysprop) has a password set on it?  If so you 
need to confiure your JVM with the standard java system properties to 
specify what that password is.

http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201303.mbox/%3c1364232676233-4051159.p...@n3.nabble.com%3E

:     2013-04-09 10:45:06,144 [main] ERROR 
: org.apache.solr.servlet.SolrDispatchFilter - Could not start Solr. Check 
solr/home property and the logs
:     2013-04-09 10:45:06,224 [main] ERROR 
: org.apache.solr.core.SolrCore - 
: null:org.apache.http.conn.ssl.SSLInitializationException: Failure 
: initializing default system SSL context
:     Caused by: java.io.IOException: Keystore was tampered with, or password 
was incorrect
:   at sun.security.provider.JavaKeyStore.engineLoad(JavaKeyStore.java:772)
:     at sun.security.provider.JavaKeyStore$JKS.engineLoad(JavaKeyStore.java:55)
:     at java.security.KeyStore.load(KeyStore.java:1214)
:     at
:  
org.apache.http.conn.ssl.SSLSocketFactory.createSystemSSLContext(SSLSocketFactory.java:281)
:     at 
org.apache.http.conn.ssl.SSLSocketFactory.createSystemSSLContext(SSLSocketFactory.java:366)
 
:     ... 50 more
: Caused by: java.security.UnrecoverableKeyException: Password verification 
failed
:     at sun.security.provider.JavaKeyStore.engineLoad(JavaKeyStore.java:770)

-Hoss

Re: Execution of Queries in Parallel: geotagged textual documents in Solrvvvv


: I'd move to SolrCloud 4.2.1 to benefit from sharding, replication, and
: the latest Lucene.  How many queries you will then be able to run in
: parallel will depend on their complexity, index size, query
: cachability, index size, latency requirements... But move to the
: latest setup first.

No to mention thta geospatial query support is vastly improved in Solr 4.x 
vs what was possible in Solr 1.4.

-Hoss

query regarding the use of boost across the fields in edismax query

2013-04-09 Thread Rohan Thakur

hi all

wanted to know what could be the difference between the results if I apply
boost accross say 5 fields in query like for

first: title^10.0 features^7.0 cat^5.0 color^3.0 root^1.0 and
second settings like : title^10.0 features^5.0 cat^3.0 color^2.0 root^1.0

what could be the difference as in the weights are in same order decreasing?

thanks in advance

regards
Rohan

Re: query regarding the use of boost across the fields in edismax query

Not sure if i'm missing something but in the first case features, cat,
and color field have more weight, so matches on them with have bigger
contribution to the overall relevancy score.

Otis
--
Solr  ElasticSearch Support
http://sematext.com/





On Tue, Apr 9, 2013 at 1:52 PM, Rohan Thakur rohan.i...@gmail.com wrote:
 hi all

 wanted to know what could be the difference between the results if I apply
 boost accross say 5 fields in query like for

 first: title^10.0 features^7.0 cat^5.0 color^3.0 root^1.0 and
 second settings like : title^10.0 features^5.0 cat^3.0 color^2.0 root^1.0

 what could be the difference as in the weights are in same order decreasing?

 thanks in advance

 regards
 Rohan

Re: Average Solr Server Spec.

Hi,

You are right there is no average.  I saw a Solr cluster with a
few EC2 micro instances yesterday and regularly see Solr running on 16
or 32 GB RAM and sometimes well over 100 GB RAM.  Sometimes they have
just 2 CPU cores, sometimes 32 or more.  Some use SSDs, some HDDs,
some local storage, some SAN, some EBS on AWS. etc.

Otis
--
Solr  ElasticSearch Support
http://sematext.com/





On Tue, Apr 9, 2013 at 7:04 AM, Furkan KAMACI furkankam...@gmail.com wrote:
 This question may not have a generel answer and may be open ended but is
 there any commodity server spec. for a usual Solr running machine? I mean
 what is the average server spesification for a Solr machine (i.e. Hadoop
 running system it is not recommended to have very big storage capably
 computers.) I will use Solr for indexing web crawled data.

Re: Average Solr Server Spec.

We mostly run m1.xlarge with an 8GB heap. --wunder

On Apr 9, 2013, at 10:57 AM, Otis Gospodnetic wrote:

 Hi,
 
 You are right there is no average.  I saw a Solr cluster with a
 few EC2 micro instances yesterday and regularly see Solr running on 16
 or 32 GB RAM and sometimes well over 100 GB RAM.  Sometimes they have
 just 2 CPU cores, sometimes 32 or more.  Some use SSDs, some HDDs,
 some local storage, some SAN, some EBS on AWS. etc.
 
 Otis
 --
 Solr  ElasticSearch Support
 http://sematext.com/
 
 
 
 
 
 On Tue, Apr 9, 2013 at 7:04 AM, Furkan KAMACI furkankam...@gmail.com wrote:
 This question may not have a generel answer and may be open ended but is
 there any commodity server spec. for a usual Solr running machine? I mean
 what is the average server spesification for a Solr machine (i.e. Hadoop
 running system it is not recommended to have very big storage capably
 computers.) I will use Solr for indexing web crawled data.

Results Order When Performing Wildcard Query

2013-04-09 Thread P Williams

Hi,

I wrote a test of my application which revealed a Solr oddity (I think).
 The test which I wrote on Windows 7 and makes use of the
solr-test-frameworkhttp://lucene.apache.org/solr/4_1_0/solr-test-framework/index.html
fails
under Ubuntu 12.04 because the Solr results I expected for a wildcard query
of the test data are ordered differently under Ubuntu than Windows.  On
both Windows and Ubuntu all items in the result set have a score of 1.0 and
appear to be ordered by docid (which looks like in corresponds to
alphabetical unique id on Windows but not Ubuntu).  I'm guessing that the
root of my issue is that a different docid was assigned to the same
document on each operating system.

The data was imported using a DataImportHandler configuration during a
@BeforeClass step in my JUnit test on both systems.

Any suggestions on how to ensure a consistently ordered wildcard query
result set for testing?

Thanks,
Tricia

Re: How can I set configuration options?

2013-04-09 Thread Nate Fox

In Ubuntu, I've added it to /etc/default/tomcat7 in the JAVA_OPTS options.

For example, I have:
JAVA_OPTS=-Djava.awt.headless=true -Xmx2048m -XX:+UseConcMarkSweepGC
JAVA_OPTS=${JAVA_OPTS} -DnumShards=2 -Djetty.port=8080
-DzkHost=zookeeper01.dev.:2181 -Dboostrap_conf=true



--
Nate Fox
Sr Systems Engineer

o: 310.658.5775
m: 714.248.5350

Follow us @NEOGOV http://twitter.com/NEOGOV and on
Facebookhttp://www.facebook.com/neogov

NEOGOV http://www.neogov.com/ is among the top fastest growing software
companies in the USA, recognized by Inc 500|5000, Deloitte Fast 500, and
the LA Business Journal. We are hiring!http://www.neogov.com/#/company/careers



On Tue, Apr 9, 2013 at 8:55 AM, Edd Grant e...@eddgrant.com wrote:

 Hi all,

 I have been working through the examples on the SolrCloud page:
 http://wiki.apache.org/solr/SolrCloud

 I am now at the point where, rather than firing up Solr through start.jar,
 I'm deploying the Solr war in to Tomcat instances. Taking the following
 command as an example:

 java -Dbootstrap_confdir=./solr/collection1/conf
 -Dcollection.configName=myconf -DzkRun
 -DzkHost=localhost:9983,localhost:8574,localhost:9900 -DnumShards=2
 -jar start.jar

 I can't figure out from the documentation how/ where I set the above
 properties when deploying Solr as a war file. I initially thought these
 might be configurable through solr.xml but can't find anything in the
 documentation to support this.

 Most grateful for any pointers here.

 Cheers,

 Edd
 --
 Web: http://www.eddgrant.com
 Email: e...@eddgrant.com
 Mobile: +44 (0) 7861 394 543

Re: How can I set configuration options?

Hi Edd;

The parameters you mentioned are JVM parameters. There are two ways to
define them.
First one is if you are using an IDE you can indicate them as JVM
parameters. i.e. if you are using Intellij IDEA when you click your
Run/Debug configurations there is a line called VM Options. You can write
your paramters without writing java word in front of them.

Second one is deploying your war file into Tomcat without using an IDE (I
think this is what you want). Here is what to do:

Go to tomcat home folder and under the bin folder create a file called
setenv.sh Then add that lines:

#!/bin/sh
#
#
export JAVA_OPTS=$JAVA_OPTS
-Dbootstrap_confdir=./solr/collection1/conf
-Dcollection.configName=myconf -DzkRun
-DzkHost=localhost:9983,localhost:8574,localhost:9900 -DnumShards=2



2013/4/9 Edd Grant e...@eddgrant.com

 Hi all,

 I have been working through the examples on the SolrCloud page:
 http://wiki.apache.org/solr/SolrCloud

 I am now at the point where, rather than firing up Solr through start.jar,
 I'm deploying the Solr war in to Tomcat instances. Taking the following
 command as an example:

 java -Dbootstrap_confdir=./solr/collection1/conf
 -Dcollection.configName=myconf -DzkRun
 -DzkHost=localhost:9983,localhost:8574,localhost:9900 -DnumShards=2
 -jar start.jar

 I can't figure out from the documentation how/ where I set the above
 properties when deploying Solr as a war file. I initially thought these
 might be configurable through solr.xml but can't find anything in the
 documentation to support this.

 Most grateful for any pointers here.

 Cheers,

 Edd
 --
 Web: http://www.eddgrant.com
 Email: e...@eddgrant.com
 Mobile: +44 (0) 7861 394 543

Re: Pointing to Hbase for Docuements or Directly Saving Documents at Hbase

You may also be interested in looking at things like solrbase (on Github).

Otis
--
Solr  ElasticSearch Support
http://sematext.com/





On Sat, Apr 6, 2013 at 6:01 PM, Furkan KAMACI furkankam...@gmail.com wrote:
 Hi;

 First of all should mention that I am new to Solr and making a research
 about it. What I am trying to do that I will crawl some websites with Nutch
 and then I will index them with Solr. (Nutch 2.1, Solr-SolrCloud 4.2 )

 I wonder about something. I have a cloud of machines that crawls websites
 and stores that documents. Then I send that documents into SolrCloud. Solr
 indexes that documents and generates indexes and save them. I know that
 from Information Retrieval theory: it *may* not be efficient to store
 indexes at a NoSQL database (they are something like linked lists and if
 you store them in such kind of database you *may* have a sparse
 representation -by the way there may be some solutions for it. If you
 explain them you are welcome.)

 However Solr stores some documents too (i.e. highlights) So some of my
 documents will be doubled somehow. If I consider that I will have many
 documents, that dobuled documents may cause a problem for me. So is there
 any way not storing that documents at Solr and pointing to them at
 Hbase(where I save my crawled documents) or instead of pointing directly
 storing them at Hbase (is it efficient or not)?

Re: Average Solr Server Spec.

Hi Walter;

Could I learn that what is the average size of Solr indexes and average
query per second to your Solr. Maybe I can come up with an assumption?

2013/4/9 Walter Underwood wun...@wunderwood.org

 We mostly run m1.xlarge with an 8GB heap. --wunder

 On Apr 9, 2013, at 10:57 AM, Otis Gospodnetic wrote:

  Hi,
 
  You are right there is no average.  I saw a Solr cluster with a
  few EC2 micro instances yesterday and regularly see Solr running on 16
  or 32 GB RAM and sometimes well over 100 GB RAM.  Sometimes they have
  just 2 CPU cores, sometimes 32 or more.  Some use SSDs, some HDDs,
  some local storage, some SAN, some EBS on AWS. etc.
 
  Otis
  --
  Solr  ElasticSearch Support
  http://sematext.com/
 
 
 
 
 
  On Tue, Apr 9, 2013 at 7:04 AM, Furkan KAMACI furkankam...@gmail.com
 wrote:
  This question may not have a generel answer and may be open ended but is
  there any commodity server spec. for a usual Solr running machine? I
 mean
  what is the average server spesification for a Solr machine (i.e. Hadoop
  running system it is not recommended to have very big storage capably
  computers.) I will use Solr for indexing web crawled data.

Indexing and searching documents in different languages

2013-04-09 Thread dev



Hello,

I'm trying to index a large number of documents in different languages.
I don't know the language of the document, so I'm using  
TikaLanguageIdentifierUpdateProcessorFactory to identify it.


So, this is my configuration in solrconfig.xml

 updateRequestProcessorChain name=langid
   processor  
class=org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessorFactory

 bool name=langidtrue/bool
 str name=langid.fltitle,subtitle,content/str
 str name=langid.langFieldlanguage_s/str
 str name=langid.threshold0.3/str
 str name=langid.fallbackgeneral/str
 str name=langid.whitelisten,fr,de,it,es/str
 bool name=langid.maptrue/bool
 bool name=langid.map.keepOrigtrue/bool
   /processor
   processor class=solr.LogUpdateProcessorFactory /
   processor class=solr.RunUpdateProcessorFactory /
 /updateRequestProcessorChain

So, the detection works fine and I put some dynamic fields in  
schema.xml to store the results:
  dynamicField name=*_en  type=text_enindexed=true   
stored=true multiValued=true/
  dynamicField name=*_fr  type=text_frindexed=true   
stored=true multiValued=true/
  dynamicField name=*_de  type=text_deindexed=true   
stored=true multiValued=true/
  dynamicField name=*_it  type=text_itindexed=true   
stored=true multiValued=true/
  dynamicField name=*_es  type=text_esindexed=true   
stored=true multiValued=true/


My main problem now is how to search the document without knowing the  
language of the searched document.
I don't want to have a huge querystring like   
?q=title_en:+term+subtitle_en:+term+title_de:+term...
Okay, using copyField and copy all fields into the text field...but  
text has the type text_general, so the language specific indexing is  
not working. I could use at least a combined field for every language  
(like text_en, text_fr...) but still, my querystring gets very long  
and to add new languages is terribly uncomfortable.


So, what can I do? Is there a better solution to index and search  
documents in many languages without knowing the language of the  
document and the query before?


- Geschan

Re: Number of segments

2013-04-09 Thread Michael Long

My main concern was just making sure we were getting the best search 
performance, and that we did not have too many segments. Every attempt I 
made to adjust the segment count resulted in no difference (segment 
count never changed). Looking at that blog page, it looks like 30-40 
segments is probably the norm.


On 04/08/2013 08:43 PM, Chris Hostetter wrote:

: How do I determine how many tiers it has?

You may find this blog post from mccandless helpful...

http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html

(don't ignore the videos! watching them really helpful to understand what
he is talking about)

Once you've obsorbed that, then please revist your question, specifically
Upayavira's key point: what is the problem you are trying to solve?

https://people.apache.org/~hossman/#xyproblem
XY Problem

Your question appears to be an XY Problem ... that is: you are dealing
with X, you are assuming Y will help you, and you are asking about Y
without giving more details about the X so that we can understand the
full issue.  Perhaps the best solution doesn't involve Y at all?
See Also: http://www.perlmonks.org/index.pl?node_id=542341


-Hoss

Re: Indexing and searching documents in different languages

Hi,

Typically people try to figure out the query language somehow.
Queries are short, so LID on them is hard.  But user profile could
indicate a language, or users can be asked and such.

Otis
--
Solr  ElasticSearch Support
http://sematext.com/





On Tue, Apr 9, 2013 at 2:32 PM,  d...@geschan.de wrote:

 Hello,

 I'm trying to index a large number of documents in different languages.
 I don't know the language of the document, so I'm using
 TikaLanguageIdentifierUpdateProcessorFactory to identify it.

 So, this is my configuration in solrconfig.xml

  updateRequestProcessorChain name=langid
processor
 class=org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessorFactory
  bool name=langidtrue/bool
  str name=langid.fltitle,subtitle,content/str
  str name=langid.langFieldlanguage_s/str
  str name=langid.threshold0.3/str
  str name=langid.fallbackgeneral/str
  str name=langid.whitelisten,fr,de,it,es/str
  bool name=langid.maptrue/bool
  bool name=langid.map.keepOrigtrue/bool
/processor
processor class=solr.LogUpdateProcessorFactory /
processor class=solr.RunUpdateProcessorFactory /
  /updateRequestProcessorChain

 So, the detection works fine and I put some dynamic fields in schema.xml to
 store the results:
   dynamicField name=*_en  type=text_enindexed=true  stored=true
 multiValued=true/
   dynamicField name=*_fr  type=text_frindexed=true  stored=true
 multiValued=true/
   dynamicField name=*_de  type=text_deindexed=true  stored=true
 multiValued=true/
   dynamicField name=*_it  type=text_itindexed=true  stored=true
 multiValued=true/
   dynamicField name=*_es  type=text_esindexed=true  stored=true
 multiValued=true/

 My main problem now is how to search the document without knowing the
 language of the searched document.
 I don't want to have a huge querystring like
 ?q=title_en:+term+subtitle_en:+term+title_de:+term...
 Okay, using copyField and copy all fields into the text field...but text
 has the type text_general, so the language specific indexing is not working.
 I could use at least a combined field for every language (like text_en,
 text_fr...) but still, my querystring gets very long and to add new
 languages is terribly uncomfortable.

 So, what can I do? Is there a better solution to index and search documents
 in many languages without knowing the language of the document and the query
 before?

 - Geschan

Re: Solr metrics in Codahale metrics and Graphite?

If it isn't obvious, I'm glad to help test a patch for this. We can run a 
simulated production load in dev and report to our metrics server.

wunder

On Apr 8, 2013, at 1:07 PM, Walter Underwood wrote:

 That approach sounds great. --wunder
 
 On Apr 7, 2013, at 9:40 AM, Alan Woodward wrote:
 
 I've been thinking about how to improve this reporting, especially now that 
 metrics-3 (which removes all of the funky thread issues we ran into last 
 time I tried to add it to Solr) is close to release.  I think we could go 
 about it as follows:
 
 * refactor the existing JMX reporting to use metrics-3.  This would mean 
 replacing the SolrCore.infoRegistry map with a MetricsRegistry, and adding a 
 JmxReporter, keeping the existing config logic to determine which JMX server 
 to use.  PluginInfoHandler and SolrMBeanInfoHandler translate the metrics-3 
 data back into SolrMBean format to keep the reporting backwards-compatible.  
 This seems like a lot of work for no visible benefit, but…
 * we can then add the ability to define other metrics reporters in 
 solrconfig.xml.  There are already reporters for Ganglia and Graphite - you 
 just add then to the Solr lib/ directory, configure them in solrconfig, and 
 voila - Solr can be monitored using the same devops tools you use to monitor 
 everything else.
 
 Does this sound sane?
 
 Alan Woodward
 www.flax.co.uk
 
 
 On 6 Apr 2013, at 20:49, Walter Underwood wrote:
 
 Wow, that really doesn't help at all, since these seem to only be reported 
 in the stats page. 
 
 I don't need another non-standard app-specific set of metrics, especially 
 one that needs polling. I need metrics delivered to the common system that 
 we use for all our servers.
 
 This is also why SPM is not useful for us, sorry Otis.
 
 Also, there is no time period on these stats. How do you graph the 95th 
 percentile? I know there was a lot of work on these, but they seem really 
 useless to me. I'm picky about metrics, working at Netflix does that to you.
 
 wunder
 
 On Apr 3, 2013, at 4:01 PM, Walter Underwood wrote:
 
 In the Jira, but not in the docs. 
 
 It would be nice to have VM stats like GC, too, so we can have common 
 monitoring and alerting on all our services.
 
 wunder
 
 On Apr 3, 2013, at 3:31 PM, Otis Gospodnetic wrote:
 
 It's there! :)
 http://search-lucene.com/?q=percentilefc_project=Solrfc_type=issue
 
 Otis
 --
 Solr  ElasticSearch Support
 http://sematext.com/
 
 On Wed, Apr 3, 2013 at 6:29 PM, Walter Underwood wun...@wunderwood.org 
 wrote:
 That sounds great. I'll check out the bug, I didn't see anything in the 
 docs about this. And if I can't find it with a search engine, it 
 probably isn't there.  --wunder
 
 On Apr 3, 2013, at 6:39 AM, Shawn Heisey wrote:
 
 On 3/29/2013 12:07 PM, Walter Underwood wrote:
 What are folks using for this?
 
 I don't know that this really answers your question, but Solr 4.1 and
 later includes a big chunk of codahale metrics internally for request
 handler statistics - see SOLR-1972.  First we tried including the jar
 and using the API, but that created thread leak problems, so the source
 code was added.
 
 Thanks,
 Shawn
 
 
 
 
 
 
 --
 Walter Underwood
 wun...@wunderwood.org
 
 
 

--
Walter Underwood
wun...@wunderwood.org

Re: Indexing and searching documents in different languages

2013-04-09 Thread Alexandre Rafalovitch

Have you looked at edismax and the 'qf' fields parameter? It allows you to
define the fields to search. Also, you can define those parameters in
solrconfig.xml and not have to send them down the wire.

Finally, you can define several different request handlers (e.g. /ensearch,
/frsearch) and have each of them use different 'qf' values, possibly with
'fl' field also defined and with field name aliasing from language-specific
to generic names.

Regards,
   Alex.

Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


On Tue, Apr 9, 2013 at 2:32 PM, d...@geschan.de wrote:


 Hello,

 I'm trying to index a large number of documents in different languages.
 I don't know the language of the document, so I'm using
 TikaLanguageIdentifierUpdatePr**ocessorFactory to identify it.

 So, this is my configuration in solrconfig.xml

  updateRequestProcessorChain name=langid
processor class=org.apache.solr.update.**processor.**
 TikaLanguageIdentifierUpdatePr**ocessorFactory
  bool name=langidtrue/bool
  str name=langid.fltitle,**subtitle,content/str
  str name=langid.langField**language_s/str
  str name=langid.threshold0.3/**str
  str name=langid.fallback**general/str
  str name=langid.whitelisten,fr,**de,it,es/str
  bool name=langid.maptrue/bool
  bool name=langid.map.keepOrig**true/bool
/processor
processor class=solr.**LogUpdateProcessorFactory /
processor class=solr.**RunUpdateProcessorFactory /
  /updateRequestProcessorChain

 So, the detection works fine and I put some dynamic fields in schema.xml
 to store the results:
   dynamicField name=*_en  type=text_enindexed=true
  stored=true multiValued=true/
   dynamicField name=*_fr  type=text_frindexed=true
  stored=true multiValued=true/
   dynamicField name=*_de  type=text_deindexed=true
  stored=true multiValued=true/
   dynamicField name=*_it  type=text_itindexed=true
  stored=true multiValued=true/
   dynamicField name=*_es  type=text_esindexed=true
  stored=true multiValued=true/

 My main problem now is how to search the document without knowing the
 language of the searched document.
 I don't want to have a huge querystring like
  ?q=title_en:+term+subtitle_en:**+term+title_de:+term...
 Okay, using copyField and copy all fields into the text field...but
 text has the type text_general, so the language specific indexing is not
 working. I could use at least a combined field for every language (like
 text_en, text_fr...) but still, my querystring gets very long and to add
 new languages is terribly uncomfortable.

 So, what can I do? Is there a better solution to index and search
 documents in many languages without knowing the language of the document
 and the query before?

 - Geschan

Re: Slow qTime for distributed search

2013-04-09 Thread Manuel Le Normand

Thanks for replying.
My config:

   - 40 dedicated servers, dual-core each
   - Running Tomcat servlet on Linux
   - 12 Gb RAM per server, splitted half between OS and Solr
   - Complex queries (up to 30 conditions on different fields), 1 qps rate

Sharding my index was done for two reasons, based on 2 servers (4shards)
tests:

   1. As index grew above few million of docs qTime raised greatly, while
   sharding the index to smaller pieces (about 0.5M docs) gave way better
   results, so I bound every shard to have 0.5M docs.
   2. Tests showed i was cpu-bounded during queries. As i have low qps rate
   (emphasize: lower than expected qTime) and as a query runs single-threaded
   on each shard, it made sense to accord a cpu to each shard.

For the same amount of docs per shards I do expect a raise in total qTime
for the reasons:

   1. The response should wait for the slowest shard
   2. Merging the responses from 40 different shards takes time

What i understand from your explanation is that it's the merging that takes
time and as qTime ends only after the second retrieval phase, the qTime on
each shard will take longer. Meaning during a significant proportion of the
first query phase (right after the [id,score] are retieved), all cpu's are
idle except the response-merger thread running on a single cpu. I thought
of the merge as a simple sorting of [id,score], way more simple than
additional 300 ms cpu time.

Why would a RAM increase improve my performances, as it's a
response-merge (CPU resource) bottleneck?

Thanks in advance,
Manu


On Mon, Apr 8, 2013 at 10:19 PM, Shawn Heisey s...@elyograg.org wrote:

 On 4/8/2013 12:19 PM, Manuel Le Normand wrote:

 It seems that sharding my collection to many shards slowed down
 unreasonably, and I'm trying to investigate why.

 First, I created collection1 - 4 shards*replicationFactor=1 collection
 on
 2 servers. Second I created collection2 - 48 shards*replicationFactor=2
 collection on 24 servers, keeping same config and same num of documents
 per
 shard.


 The primary reason to use shards is for index size, when your index is so
 big that a single index cannot give you reasonable performance. There are
 also sometimes performance gains when you break a smaller index into
 shards, but there is a limit.

 Going from 2 shards to 3 shards will have more of an impact that going
 from 8 shards to 9 shards.  At some point, adding shards makes things
 slower, not faster, because of the extra work required for combining
 multiple queries into one result response.  There is no reasonable way to
 predict when that will happen.

  Observations showed the following:

 1. Total qTime for the same query set is 5 time higher in collection2
 (150ms-700 ms)
 2. Adding to colleciton2 the *shard.info=true* param in the query
 shows

 that each shard is much slower than each shard was in collection1
 (about 4
 times slower)
 3.  Querying only specific shards on collection2 (by adding the

 shards=shard1,shard2...shard12 param) gave me much better qTime per
 shard
 (only 2 times higher than in collection1)
 4. I have a low qps rate, thus i don't suspect the replication factor

 for being the major cause of this.
 5. The avg. cpu load on servers during querying was much higher in

 collection1 than in collection2 and i didn't catch any other
 bottlekneck.


 A distributed query actually consists of up to two queries per shard. The
 first query just requests the uniqueKey field, not the entire document.  If
 you are sorting the results, then the sort field(s) are also requested,
 otherwise the only additional information requested is the relevance score.
  The results are compiled into a set of unique keys, then a second query is
 sent to the proper shards requesting specific documents.


  Q:
 1. Why does the amount of shards affect the qTime of each shard?
 2. How can I overcome to reduce back the qTime of each shard?


 With more shards, it takes longer for the first phase to compile the
 results, so the second phase (document retrieval) gets delayed, and the
 QTime goes up.

 One way to reduce the total time is to reduce the number of shards.

 You haven't said anything about how complex your queries are, your index
 size(s), or how much RAM you have on each server and how it is allocated.
  Can you provide this information?

 Getting good performance out of Solr requires plenty of RAM in your OS
 disk cache.  Query times of 150 to 700 milliseconds seem very high, which
 could be due to query complexity or a lack of server resources (especially
 RAM), or possibly both.

 Thanks,
 Shawn

Re: Results Order When Performing Wildcard Query


On 4/9/2013 12:08 PM, P Williams wrote:

I wrote a test of my application which revealed a Solr oddity (I think).
  The test which I wrote on Windows 7 and makes use of the
solr-test-frameworkhttp://lucene.apache.org/solr/4_1_0/solr-test-framework/index.html
fails
under Ubuntu 12.04 because the Solr results I expected for a wildcard query
of the test data are ordered differently under Ubuntu than Windows.  On
both Windows and Ubuntu all items in the result set have a score of 1.0 and
appear to be ordered by docid (which looks like in corresponds to
alphabetical unique id on Windows but not Ubuntu).  I'm guessing that the
root of my issue is that a different docid was assigned to the same
document on each operating system.


It might be due to differences in how Java works on the two platforms, 
or even something as simple as different Java versions.  I don't know a 
lot about the underlying Lucene stuff, so this next sentence may not be 
correct: If you have are not starting from an index where the actual 
index directory was deleted before the test started (rather than 
deleting all documents), that might produce different internal Lucene 
document ids.



The data was imported using a DataImportHandler configuration during a
@BeforeClass step in my JUnit test on both systems.

Any suggestions on how to ensure a consistently ordered wildcard query
result set for testing?


Include an explicit sort parameter.  That way it will depend on the 
data, not the internal Lucene representation.


Thanks,
Shawn

Re: Slow qTime for distributed search


On 4/9/2013 2:10 PM, Manuel Le Normand wrote:

Thanks for replying.
My config:

- 40 dedicated servers, dual-core each
- Running Tomcat servlet on Linux
- 12 Gb RAM per server, splitted half between OS and Solr
- Complex queries (up to 30 conditions on different fields), 1 qps rate

Sharding my index was done for two reasons, based on 2 servers (4shards)
tests:

1. As index grew above few million of docs qTime raised greatly, while
sharding the index to smaller pieces (about 0.5M docs) gave way better
results, so I bound every shard to have 0.5M docs.
2. Tests showed i was cpu-bounded during queries. As i have low qps rate
(emphasize: lower than expected qTime) and as a query runs single-threaded
on each shard, it made sense to accord a cpu to each shard.

For the same amount of docs per shards I do expect a raise in total qTime
for the reasons:

1. The response should wait for the slowest shard
2. Merging the responses from 40 different shards takes time

What i understand from your explanation is that it's the merging that takes
time and as qTime ends only after the second retrieval phase, the qTime on
each shard will take longer. Meaning during a significant proportion of the
first query phase (right after the [id,score] are retieved), all cpu's are
idle except the response-merger thread running on a single cpu. I thought
of the merge as a simple sorting of [id,score], way more simple than
additional 300 ms cpu time.

Why would a RAM increase improve my performances, as it's a
response-merge (CPU resource) bottleneck?


If you have not tweaked the Tomcat configuration, that can lead to 
problems, but if your total query volume is really only one query per 
second, this is probably not a worry for you.  A tomcat connector can be 
configured with a maxThreads parameter.  The recommended value there is 
1, but Tomcat defaults to 200.


You didn't include the index sizes.  There's half a million docs per 
shard, but I don't know what that translates to in terms of MB or GB of 
disk space.


On another email thread you mention that your documents are about 50KB 
each.  That would translate to an index that's at least 25GB, possibly 
more.  That email thread also says that optimization for you takes an 
hour, further indications that you've got some really big indexes.


You're saying that you have given 6GB out of the 12GB to Solr, leaving 
only 6GB for the OS and caching.  Ideally you want to have enough RAM to 
cache the entire index, but in reality you can usually get away with 
caching between half and two thirds of the index.  Exactly what ratio 
works best is highly dependent on your schema.


If my numbers are even close to right, then you've got a lot more index 
on each server than available RAM.  Based on what I can deduce, you 
would want 24 to 48GB of RAM per server.  If my numbers are wrong, then 
this estimate is wrong.


I would be interested in seeing your queries.  If the complexity can be 
expressed as filter queries that get re-used a lot, the filter cache can 
be a major boost to performance.  Solr's caches in general can make a 
big difference.  There is no guarantee that caches will help, of course.


Thanks,
Shawn

Re: How can I set configuration options?

2013-04-09 Thread Edd Grant

Thanks for the replies. The problem I have is that setting them at the JVM
level would mean that all instances of Solr deployed in the Tomcat instance
are forced to use the same settings. I actually want to set the properties
at the application level (e.g. in solr.xml, zoo.conf or maybe an
application level Tomcat Context.xml file).

I'll grab the Solr source and see if there's any way to do this, unless
anyone knows how off the top of their head?

Cheers,

Edd


On 9 April 2013 19:21, Furkan KAMACI furkankam...@gmail.com wrote:

 Hi Edd;

 The parameters you mentioned are JVM parameters. There are two ways to
 define them.
 First one is if you are using an IDE you can indicate them as JVM
 parameters. i.e. if you are using Intellij IDEA when you click your
 Run/Debug configurations there is a line called VM Options. You can write
 your paramters without writing java word in front of them.

 Second one is deploying your war file into Tomcat without using an IDE (I
 think this is what you want). Here is what to do:

 Go to tomcat home folder and under the bin folder create a file called
 setenv.sh Then add that lines:

 #!/bin/sh
 #
 #
 export JAVA_OPTS=$JAVA_OPTS
 -Dbootstrap_confdir=./solr/collection1/conf
 -Dcollection.configName=myconf -DzkRun
 -DzkHost=localhost:9983,localhost:8574,localhost:9900 -DnumShards=2



 2013/4/9 Edd Grant e...@eddgrant.com

  Hi all,
 
  I have been working through the examples on the SolrCloud page:
  http://wiki.apache.org/solr/SolrCloud
 
  I am now at the point where, rather than firing up Solr through
 start.jar,
  I'm deploying the Solr war in to Tomcat instances. Taking the following
  command as an example:
 
  java -Dbootstrap_confdir=./solr/collection1/conf
  -Dcollection.configName=myconf -DzkRun
  -DzkHost=localhost:9983,localhost:8574,localhost:9900 -DnumShards=2
  -jar start.jar
 
  I can't figure out from the documentation how/ where I set the above
  properties when deploying Solr as a war file. I initially thought these
  might be configurable through solr.xml but can't find anything in the
  documentation to support this.
 
  Most grateful for any pointers here.
 
  Cheers,
 
  Edd
  --
  Web: http://www.eddgrant.com
  Email: e...@eddgrant.com
  Mobile: +44 (0) 7861 394 543
 




-- 
Web: http://www.eddgrant.com
Email: e...@eddgrant.com
Mobile: +44 (0) 7861 394 543

Re: Results Order When Performing Wildcard Query

2013-04-09 Thread P Williams

Hey Shawn,

My gut says the difference in assignment of docids has to do with how the
FileListEntityProcessorhttp://wiki.apache.org/solr/DataImportHandler#FileListEntityProcessor
works
on the two operating systems. The documents are updated/imported in a
different order is my guess, but I haven't tested that theory. I still
think it's kind of odd that there would be a difference.

Indexes are created from scratch in my test, so it's not that. java
-versionreports the same values on both machines
java version 1.7.0_17
Java(TM) SE Runtime Environment (build 1.7.0_17-b02)
Java HotSpot(TM) Client VM (build 23.7-b01, mixed mode)

The explicit (arbitrary non-score) sort parameter will work as a
work-around to get my test to pass in both environments while I think about
this some more. Thanks!

Cheers,
Tricia

On Tue, Apr 9, 2013 at 2:13 PM, Shawn Heisey s...@elyograg.org wrote:

On 4/9/2013 12:08 PM, P Williams wrote:

I wrote a test of my application which revealed a Solr oddity (I think).
The test which I wrote on Windows 7 and makes use of the
solr-test-frameworkhttp://**lucene.apache.org/solr/4_1_0/**
solr-test-framework/index.htmlhttp://lucene.apache.org/solr/4_1_0/solr-test-framework/index.html
**

fails
under Ubuntu 12.04 because the Solr results I expected for a wildcard
query
of the test data are ordered differently under Ubuntu than Windows. On
both Windows and Ubuntu all items in the result set have a score of 1.0
and
appear to be ordered by docid (which looks like in corresponds to
alphabetical unique id on Windows but not Ubuntu). I'm guessing that the
root of my issue is that a different docid was assigned to the same
document on each operating system.

It might be due to differences in how Java works on the two platforms, or
even something as simple as different Java versions. I don't know a lot
about the underlying Lucene stuff, so this next sentence may not be
correct: If you have are not starting from an index where the actual index
directory was deleted before the test started (rather than deleting all
documents), that might produce different internal Lucene document ids.

The data was imported using a DataImportHandler configuration during a
@BeforeClass step in my JUnit test on both systems.

Any suggestions on how to ensure a consistently ordered wildcard query
result set for testing?

Include an explicit sort parameter. That way it will depend on the data,
not the internal Lucene representation.

Thanks,
Shawn

Re: Slow qTime for distributed search

Hi Shawn;

You say that:

*... your documents are about 50KB each.  That would translate to an index
that's at least 25GB*

I know we can not say an exact size but what is the approximately ratio of
document size / index size according to your experiences?


2013/4/9 Shawn Heisey s...@elyograg.org

 On 4/9/2013 2:10 PM, Manuel Le Normand wrote:

 Thanks for replying.
 My config:

 - 40 dedicated servers, dual-core each
 - Running Tomcat servlet on Linux
 - 12 Gb RAM per server, splitted half between OS and Solr
 - Complex queries (up to 30 conditions on different fields), 1 qps
 rate

 Sharding my index was done for two reasons, based on 2 servers (4shards)
 tests:

 1. As index grew above few million of docs qTime raised greatly, while
 sharding the index to smaller pieces (about 0.5M docs) gave way better
 results, so I bound every shard to have 0.5M docs.
 2. Tests showed i was cpu-bounded during queries. As i have low qps
 rate
 (emphasize: lower than expected qTime) and as a query runs
 single-threaded
 on each shard, it made sense to accord a cpu to each shard.

 For the same amount of docs per shards I do expect a raise in total qTime
 for the reasons:

 1. The response should wait for the slowest shard
 2. Merging the responses from 40 different shards takes time

 What i understand from your explanation is that it's the merging that
 takes
 time and as qTime ends only after the second retrieval phase, the qTime on
 each shard will take longer. Meaning during a significant proportion of
 the
 first query phase (right after the [id,score] are retieved), all cpu's are
 idle except the response-merger thread running on a single cpu. I thought
 of the merge as a simple sorting of [id,score], way more simple than
 additional 300 ms cpu time.

 Why would a RAM increase improve my performances, as it's a
 response-merge (CPU resource) bottleneck?


 If you have not tweaked the Tomcat configuration, that can lead to
 problems, but if your total query volume is really only one query per
 second, this is probably not a worry for you.  A tomcat connector can be
 configured with a maxThreads parameter.  The recommended value there is
 1, but Tomcat defaults to 200.

 You didn't include the index sizes.  There's half a million docs per
 shard, but I don't know what that translates to in terms of MB or GB of
 disk space.

 On another email thread you mention that your documents are about 50KB
 each.  That would translate to an index that's at least 25GB, possibly
 more.  That email thread also says that optimization for you takes an hour,
 further indications that you've got some really big indexes.

 You're saying that you have given 6GB out of the 12GB to Solr, leaving
 only 6GB for the OS and caching.  Ideally you want to have enough RAM to
 cache the entire index, but in reality you can usually get away with
 caching between half and two thirds of the index.  Exactly what ratio works
 best is highly dependent on your schema.

 If my numbers are even close to right, then you've got a lot more index on
 each server than available RAM.  Based on what I can deduce, you would want
 24 to 48GB of RAM per server.  If my numbers are wrong, then this estimate
 is wrong.

 I would be interested in seeing your queries.  If the complexity can be
 expressed as filter queries that get re-used a lot, the filter cache can be
 a major boost to performance.  Solr's caches in general can make a big
 difference.  There is no guarantee that caches will help, of course.

 Thanks,
 Shawn

Approximately needed RAM for 5000 query/second at a Solr machine?

Are there anybody who can help me about how to guess the approximately
needed RAM for 5000 query/second at a Solr machine?

Re: Approximately needed RAM for 5000 query/second at a Solr machine?

2013-04-09 Thread Jack Krupansky

It all depends on the nature of your query and the nature of the data in the 
index. Does returning results from a result cache count in your QPS? Not to 
mention how many cores and CPU speed and CPU caching as well. Not to mention 
network latency.


The best way to answer is to do a proof of concept implementation and 
measure it yourself.


-- Jack Krupansky

-Original Message- 
From: Furkan KAMACI

Sent: Tuesday, April 09, 2013 6:06 PM
To: solr-user@lucene.apache.org
Subject: Approximately needed RAM for 5000 query/second at a Solr machine?

Are there anybody who can help me about how to guess the approximately
needed RAM for 5000 query/second at a Solr machine?

Re: Approximately needed RAM for 5000 query/second at a Solr machine?

Actually I will propose a system and I should figure out about machine
specifications. There will be no faceting mechanism at first, just simple
search queries of a web search engine. We can think that I will have a
commodity server (I don't know is there any benchmark for a usual Solr
machine)

2013/4/10 Jack Krupansky j...@basetechnology.com

 It all depends on the nature of your query and the nature of the data in
 the index. Does returning results from a result cache count in your QPS?
 Not to mention how many cores and CPU speed and CPU caching as well. Not to
 mention network latency.

 The best way to answer is to do a proof of concept implementation and
 measure it yourself.

 -- Jack Krupansky

 -Original Message- From: Furkan KAMACI
 Sent: Tuesday, April 09, 2013 6:06 PM
 To: solr-user@lucene.apache.org
 Subject: Approximately needed RAM for 5000 query/second at a Solr machine?


 Are there anybody who can help me about how to guess the approximately
 needed RAM for 5000 query/second at a Solr machine?

Re: Approximately needed RAM for 5000 query/second at a Solr machine?

On Apr 9, 2013, at 3:06 PM, Furkan KAMACI wrote:

 Are there anybody who can help me about how to guess the approximately
 needed RAM for 5000 query/second at a Solr machine?

No.

That depends on the kind of queries you have, the size and content of the 
index, the required response time, how frequently the index is updated, and 
many more factors. So anyone who can guess that is wrong.

You can only find that out by running your own benchmarks with your own queries 
against your own index.

In our system, we can meet our response time requirements at a rate of 4000 
queries/minute. We have several cores, but most traffic goes to a 3M document 
index. This index is small documents, mostly titles and authors of books. We 
have no wildcard queries and less than 5% of our queries use fuzzy matching. We 
update once per day and have cache hit rates of around 30%.

We run new benchmarks twice each year, before our busy seasons. We use the 
current index and configuration and the queries from the busiest day of the 
previous season.

Our key benchmark is the 95th percentile response time, but we also measure 
median, 90th, and 99th percentile.

We are currently on Solr 3.3 with some customizations. We're working on 
transitioning to Solr 4.

wunder
--
Walter Underwood
wun...@wunderwood.org

Re: Approximately needed RAM for 5000 query/second at a Solr machine?

Hi Walter;

Firstly thank for your detailed reply. I know that this is not a well
detailed question but I don't have any metrics yet. If we talk about your
system, what is the average RAM size of your Solr machines? Maybe that can
help me to make a comparison.

2013/4/10 Walter Underwood wun...@wunderwood.org

 On Apr 9, 2013, at 3:06 PM, Furkan KAMACI wrote:

  Are there anybody who can help me about how to guess the approximately
  needed RAM for 5000 query/second at a Solr machine?

 No.

 That depends on the kind of queries you have, the size and content of the
 index, the required response time, how frequently the index is updated, and
 many more factors. So anyone who can guess that is wrong.

 You can only find that out by running your own benchmarks with your own
 queries against your own index.

 In our system, we can meet our response time requirements at a rate of
 4000 queries/minute. We have several cores, but most traffic goes to a 3M
 document index. This index is small documents, mostly titles and authors of
 books. We have no wildcard queries and less than 5% of our queries use
 fuzzy matching. We update once per day and have cache hit rates of around
 30%.

 We run new benchmarks twice each year, before our busy seasons. We use the
 current index and configuration and the queries from the busiest day of the
 previous season.

 Our key benchmark is the 95th percentile response time, but we also
 measure median, 90th, and 99th percentile.

 We are currently on Solr 3.3 with some customizations. We're working on
 transitioning to Solr 4.

 wunder
 --
 Walter Underwood
 wun...@wunderwood.org

Re: Approximately needed RAM for 5000 query/second at a Solr machine?

We are using Amazon EC2 M1 Extra Large instances (m1.xlarge).

http://aws.amazon.com/ec2/instance-types/

wunder

On Apr 9, 2013, at 3:35 PM, Furkan KAMACI wrote:

 Hi Walter;
 
 Firstly thank for your detailed reply. I know that this is not a well
 detailed question but I don't have any metrics yet. If we talk about your
 system, what is the average RAM size of your Solr machines? Maybe that can
 help me to make a comparison.
 
 2013/4/10 Walter Underwood wun...@wunderwood.org
 
 On Apr 9, 2013, at 3:06 PM, Furkan KAMACI wrote:
 
 Are there anybody who can help me about how to guess the approximately
 needed RAM for 5000 query/second at a Solr machine?
 
 No.
 
 That depends on the kind of queries you have, the size and content of the
 index, the required response time, how frequently the index is updated, and
 many more factors. So anyone who can guess that is wrong.
 
 You can only find that out by running your own benchmarks with your own
 queries against your own index.
 
 In our system, we can meet our response time requirements at a rate of
 4000 queries/minute. We have several cores, but most traffic goes to a 3M
 document index. This index is small documents, mostly titles and authors of
 books. We have no wildcard queries and less than 5% of our queries use
 fuzzy matching. We update once per day and have cache hit rates of around
 30%.
 
 We run new benchmarks twice each year, before our busy seasons. We use the
 current index and configuration and the queries from the busiest day of the
 previous season.
 
 Our key benchmark is the 95th percentile response time, but we also
 measure median, 90th, and 99th percentile.
 
 We are currently on Solr 3.3 with some customizations. We're working on
 transitioning to Solr 4.
 
 wunder
 --
 Walter Underwood
 wun...@wunderwood.org
 
 
 
 

--
Walter Underwood
wun...@wunderwood.org

Re: Approximately needed RAM for 5000 query/second at a Solr machine?

Thanks for your answer.

2013/4/10 Walter Underwood wun...@wunderwood.org

 We are using Amazon EC2 M1 Extra Large instances (m1.xlarge).

 http://aws.amazon.com/ec2/instance-types/

 wunder

 On Apr 9, 2013, at 3:35 PM, Furkan KAMACI wrote:

  Hi Walter;
 
  Firstly thank for your detailed reply. I know that this is not a well
  detailed question but I don't have any metrics yet. If we talk about your
  system, what is the average RAM size of your Solr machines? Maybe that
 can
  help me to make a comparison.
 
  2013/4/10 Walter Underwood wun...@wunderwood.org
 
  On Apr 9, 2013, at 3:06 PM, Furkan KAMACI wrote:
 
  Are there anybody who can help me about how to guess the approximately
  needed RAM for 5000 query/second at a Solr machine?
 
  No.
 
  That depends on the kind of queries you have, the size and content of
 the
  index, the required response time, how frequently the index is updated,
 and
  many more factors. So anyone who can guess that is wrong.
 
  You can only find that out by running your own benchmarks with your own
  queries against your own index.
 
  In our system, we can meet our response time requirements at a rate of
  4000 queries/minute. We have several cores, but most traffic goes to a
 3M
  document index. This index is small documents, mostly titles and
 authors of
  books. We have no wildcard queries and less than 5% of our queries use
  fuzzy matching. We update once per day and have cache hit rates of
 around
  30%.
 
  We run new benchmarks twice each year, before our busy seasons. We use
 the
  current index and configuration and the queries from the busiest day of
 the
  previous season.
 
  Our key benchmark is the 95th percentile response time, but we also
  measure median, 90th, and 99th percentile.
 
  We are currently on Solr 3.3 with some customizations. We're working on
  transitioning to Solr 4.
 
  wunder
  --
  Walter Underwood
  wun...@wunderwood.org
 
 
 
 

 --
 Walter Underwood
 wun...@wunderwood.org

Re: Pushing a whole set of pdf-files to solr

If anybody could still help me out with this, I'd really appreciate it.
Thanks!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Pushing-a-whole-set-of-pdf-files-to-solr-tp4025256p4054885.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Pushing a whole set of pdf-files to solr

Apache Solr 4 Cookbok says that:

curl http://localhost:8983/solr/update/extract?literal.id=1commit=true;
-F myfile=@cookbook.pdf

is that what you want?

2013/4/10 sdspieg sdsp...@mail.ru

 If anybody could still help me out with this, I'd really appreciate it.
 Thanks!



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Pushing-a-whole-set-of-pdf-files-to-solr-tp4025256p4054885.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Pushing a whole set of pdf-files to solr

2013-04-09 Thread Jack Krupansky

The newer release of SimplePostTool with Solr 4.x makes it easy to post PDF 
files from a directory, including automatically adding the file name to a 
field. But SolrCell is the direct API that it uses as well.


-- Jack Krupansky
-Original Message- 
From: Furkan KAMACI

Sent: Tuesday, April 09, 2013 6:58 PM
To: solr-user@lucene.apache.org
Subject: Re: Pushing a whole set of pdf-files to solr

Apache Solr 4 Cookbok says that:

curl http://localhost:8983/solr/update/extract?literal.id=1commit=true;
-F myfile=@cookbook.pdf

is that what you want?

2013/4/10 sdspieg sdsp...@mail.ru


If anybody could still help me out with this, I'd really appreciate it.
Thanks!



--
View this message in context:
http://lucene.472066.n3.nabble.com/Pushing-a-whole-set-of-pdf-files-to-solr-tp4025256p4054885.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Slow qTime for distributed search


On 4/9/2013 3:50 PM, Furkan KAMACI wrote:

Hi Shawn;

You say that:

*... your documents are about 50KB each.  That would translate to an index
that's at least 25GB*

I know we can not say an exact size but what is the approximately ratio of
document size / index size according to your experiences?


If you store the fields, that is actual size plus a small amount of 
overhead.  Starting with Solr 4.1, stored fields are compressed.  I 
believe that it uses LZ4 compression.  Some people store all fields, 
some people store only a few or one - an ID field.  The size of stored 
fields does have an impact on how much OS disk cache you need, but not 
as much as the other parts of an index.


It's been my experience that termvectors take up almost as much space as 
stored data for the same fields, and sometimes more.  Starting with Solr 
4.2, termvectors are also compressed.


Adding docValues (new in 4.2) to the schema will also make the index 
larger.  The requirements here are similar to stored fields.  I do not 
know whether this data gets compressed, but I don't think it does.


As for the indexed data, this is where I am less clear about the storage 
ratios, but I think you can count on it needing almost as much space as 
the original data.  If the schema uses types or filters that produce a 
lot of information, the indexed data might be larger than the original 
input.  Examples of data explosions in a schema: trie fields with a 
non-zero precisionStep, the edgengram filter, the shingle filter.


Thanks,
Shawn

Re: How can I set configuration options?

: Thanks for the replies. The problem I have is that setting them at the JVM
: level would mean that all instances of Solr deployed in the Tomcat instance
: are forced to use the same settings. I actually want to set the properties
: at the application level (e.g. in solr.xml, zoo.conf or maybe an
: application level Tomcat Context.xml file).

the thing to keep in mind is that most of the params you refered to are 
things you would not typically want in a deployed setup.  others are 
just ways of specifying defaults that are substituted into configs...

:   java -Dbootstrap_confdir=./solr/collection1/conf

you don't wnat this option for a normal setup, it's just for boostratping 
(hence it's only a system property).  in a production setup you would use 
the zookeeper tools to load the configs into your zk quorum.

https://wiki.apache.org/solr/SolrCloud#Config_Startup_Bootstrap_Params
...vs...
https://wiki.apache.org/solr/SolrCloud#Command_Line_Util

:   -Dcollection.configName=myconf -DzkRun

ditto for collection.configName -- it's only for boostraping

zkRun is something you only use in trivial setups like the examples in the 
SolrCloud tutorial to run zookeeper embedded in Solr.  if you are running 
a production cluster where you want to be able to add/remove solr nodes on 
the fly, then you are going to want to set of specific machines running 
standalone zookeper.

:   -DzkHost=localhost:9983,localhost:8574,localhost:9900 -DnumShards=2

zkHost can be specified in solr.xml (allthough i'm not sure why the 
example solr.xml doesn't include it, i'll update SOLR-4622 to address 
this), or it can be overridden by a system property.


-Hoss

Re: Field exist in schema.xml but returns

2013-04-09 Thread deniz

Raymond Wiker wrote
 You have misspelt the tag name in the field definition... you have fiald
 instead of field.

thank you Raymond, it was really hard to find it out in a massive schema
file



-
Zeki ama calismiyor... Calissa yapar...
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Field-exist-in-schema-xml-but-returns-tp4054634p4054903.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Approximately needed RAM for 5000 query/second at a Solr machine?


On 4/9/2013 4:06 PM, Furkan KAMACI wrote:

Are there anybody who can help me about how to guess the approximately
needed RAM for 5000 query/second at a Solr machine?


You've already gotten some good replies, and I'm aware that they haven't 
really answered your question.  This is the kind of question that cannot 
be answered.


The amount of RAM that you'll need for extreme performance actually 
isn't hard to figure out - you need enough free RAM for the OS to cache 
the maximum amount of disk space all your indexes will ever use. 
Normally this will be twice the size of all the indexes on the machine, 
because that's how much disk space will likely be used in a worst-case 
merge scenario (optimize).  That's very expensive, so it is cheaper to 
budget for only the size of the index.


A load of 5000 queries per second is pretty high, and probably something 
you will not achieve with a single-server (not counting backup) 
approach.  All of the tricks that high-volume website developers use are 
also applicable to Solr.


Once you have enough RAM, you need to worry more about the number of 
servers, the number of CPU cores in each server, and the speed of those 
CPU cores.  Testing with actual production queries is the only way to 
find out what you really need.


Beyond hardware design, making the requests as simple as possible and 
taking advantage of caches is important.  Solr has caches for queries, 
filters, and documents.  You can also put a caching proxy (something 
like Varnish) in front of Solr, but that would make NRT updates pretty 
much impossible, and that kind of caching can be difficult to get 
working right.


Thanks,
Shawn

Re: Approximately needed RAM for 5000 query/second at a Solr machine?

These are really good metrics for me:

You say that RAM size should be at least index size, and it is better to
have a RAM size twice the index size (because of worst case scenario).

On the other hand let's assume that I have a RAM size that is bigger than
twice of indexes at machine. Can Solr use that extra RAM or is it a
approximately maximum limit (to have twice size of indexes at machine)?


2013/4/10 Shawn Heisey s...@elyograg.org

 On 4/9/2013 4:06 PM, Furkan KAMACI wrote:

 Are there anybody who can help me about how to guess the approximately
 needed RAM for 5000 query/second at a Solr machine?


 You've already gotten some good replies, and I'm aware that they haven't
 really answered your question.  This is the kind of question that cannot be
 answered.

 The amount of RAM that you'll need for extreme performance actually isn't
 hard to figure out - you need enough free RAM for the OS to cache the
 maximum amount of disk space all your indexes will ever use. Normally this
 will be twice the size of all the indexes on the machine, because that's
 how much disk space will likely be used in a worst-case merge scenario
 (optimize).  That's very expensive, so it is cheaper to budget for only the
 size of the index.

 A load of 5000 queries per second is pretty high, and probably something
 you will not achieve with a single-server (not counting backup) approach.
  All of the tricks that high-volume website developers use are also
 applicable to Solr.

 Once you have enough RAM, you need to worry more about the number of
 servers, the number of CPU cores in each server, and the speed of those CPU
 cores.  Testing with actual production queries is the only way to find out
 what you really need.

 Beyond hardware design, making the requests as simple as possible and
 taking advantage of caches is important.  Solr has caches for queries,
 filters, and documents.  You can also put a caching proxy (something like
 Varnish) in front of Solr, but that would make NRT updates pretty much
 impossible, and that kind of caching can be difficult to get working right.

 Thanks,
 Shawn

Re: Results Order When Performing Wildcard Query


: My gut says the difference in assignment of docids has to do with how the
: 
FileListEntityProcessorhttp://wiki.apache.org/solr/DataImportHandler#FileListEntityProcessor

docids just represent the order documents are added to the index.  if you 
use DIH with FileListEntityProcessor to create one doc per file then the 
order of the documents will (if i remember correctly) corrispond tothe 
order of the files returned by the OS, which may vary.

even if the files are ordered consitently by modification date: 1) the 
modification date of these files on your machines  might be different; the 
graunlarity of file modification dates supported by the filesystem or file 
io layer in the JVM on each machine might be different -- causing two 
files to appera to have identical mod times on one machine, but different 
mod times on the other machine.


-Hoss

Re: Pushing a whole set of pdf-files to solr

Thanks for those replies. I will look into them. But if anyone knows of a
site that describes step by step how a windows user who has already
installed solr (and tomcat) can easily feed a folder (and subfolders) with
100s of pdfs into solr, or would be willing to write down down those steps,
I would really appreciate the reference. And I bet you there are lots of
people like me... 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Pushing-a-whole-set-of-pdf-files-to-solr-tp4025256p4054915.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Pushing a whole set of pdf-files to solr

I am able to run the java -jar post.jar -help command which I found here:
http://docs.lucidworks.com/display/solr/Running+Solr. But now how can I tell
post to post all pdf files in a certain folder (preferably recursively) to a
collection? Could anybody please post the exact command for that? 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Pushing-a-whole-set-of-pdf-files-to-solr-tp4025256p4054916.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr 4.2.1 SSLInitializationException

2013-04-09 Thread Sarita Nair

Hi Chris,

Thanks for your response.

My understanding is that GlassFish specifies the keystore as a system property, 
but does not specify the password  in order to protect it from 
snooping. There's
a keychain that requires a password to be passed from the DAS in order to 
unlock the key for the keystore. 

Is there some way to specify a 
different HttpClient implementation (e.g. DefaultHttpClient rather than 
SystemDefaultHttpClient), as we don't want the application to have 
access to the keystore?


I have also pasted the entire stack trace below:

2013-04-09 10:45:06,144 [main] ERROR org.apache.solr.servlet.SolrDispatchFilter 
- Could not start Solr. Check solr/home property and the logs
    2013-04-09 10:45:06,224 [main] ERROR org.apache.solr.core.SolrCore - 
null:org.apache.http.conn.ssl.SSLInitializationException: Failure initializing 
default system SSL context
    at 
org.apache.http.conn.ssl.SSLSocketFactory.createSystemSSLContext(SSLSocketFactory.java:368)
    at 
org.apache.http.conn.ssl.SSLSocketFactory.getSystemSocketFactory(SSLSocketFactory.java:204)
    at 
org.apache.http.impl.conn.SchemeRegistryFactory.createSystemDefault(SchemeRegistryFactory.java:82)
    at 
org.apache.http.impl.client.SystemDefaultHttpClient.createClientConnectionManager(SystemDefaultHttpClient.java:118)
    at 
org.apache.http.impl.client.AbstractHttpClient.getConnectionManager(AbstractHttpClient.java:466)
    at 
org.apache.solr.client.solrj.impl.HttpClientUtil.setMaxConnections(HttpClientUtil.java:179)
    at 
org.apache.solr.client.solrj.impl.HttpClientConfigurer.configure(HttpClientConfigurer.java:33)
    at 
org.apache.solr.client.solrj.impl.HttpClientUtil.configureClient(HttpClientUtil.java:115)
    at 
org.apache.solr.client.solrj.impl.HttpClientUtil.createClient(HttpClientUtil.java:105)
    at 
org.apache.solr.handler.component.HttpShardHandlerFactory.init(HttpShardHandlerFactory.java:134)
    at 
com.sun.enterprise.glassfish.bootstrap.GlassFishImpl.start(GlassFishImpl.java:79)
    at 
com.sun.enterprise.glassfish.bootstrap.GlassFishDecorator.start(GlassFishDecorator.java:63)
    at 
com.sun.enterprise.glassfish.bootstrap.osgi.OSGiGlassFishImpl.start(OSGiGlassFishImpl.java:69)
    at 
com.sun.enterprise.glassfish.bootstrap.GlassFishMain$Launcher.launch(GlassFishMain.java:117)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:601)
    at 
com.sun.enterprise.glassfish.bootstrap.GlassFishMain.main(GlassFishMain.java:97)
    at com.sun.enterprise.glassfish.bootstrap.ASMain.main(ASMain.java:55)
Caused by: java.io.IOException: Keystore was tampered with, or password was 
incorrect
  at sun.security.provider.JavaKeyStore.engineLoad(JavaKeyStore.java:772)
    at sun.security.provider.JavaKeyStore$JKS.engineLoad(JavaKeyStore.java:55)
    at java.security.KeyStore.load(KeyStore.java:1214)
    at 
org.apache.http.conn.ssl.SSLSocketFactory.createSystemSSLContext(SSLSocketFactory.java:281)
    at 
org.apache.http.conn.ssl.SSLSocketFactory.createSystemSSLContext(SSLSocketFactory.java:366)
 
    ... 50 more
Caused by: java.security.UnrecoverableKeyException: Password verification failed
    at sun.security.provider.JavaKeyStore.engineLoad(JavaKeyStore.java:770)
    ... 54 more






 







 From: Chris Hostetter hossman_luc...@fucit.org
To: solr-user@lucene.apache.org solr-user@lucene.apache.org; Sarita Nair 
sarita...@yahoo.com 
Sent: Tuesday, April 9, 2013 1:31 PM
Subject: Re: Solr 4.2.1 SSLInitializationException
 

: Deploying Solr 4.2.1 to GlassFish 3.1.1 results in the error below.  I 
: have seen similar problems being reported with Solr 4.2

Are you trying to use server SSL with glassfish?

can you please post the full stack trace so we can see where this error is 
coming from.

My best guess is that this is coming from the changes made in 
SOLR-4451 to use system defaults correctly when initializing HttpClient, 
which suggets that your problem is exactly what the error message says...

  Keystore was tampered with, or password was incorrect

Is it possible that the default keystore password for your JVM (or as 
overridden by glassfish defaults - possibly using the 
javax.net.ssl.keyStore sysprop) has a password set on it?  If so you 
need to confiure your JVM with the standard java system properties to 
specify what that password is.

http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201303.mbox/%3c1364232676233-4051159.p...@n3.nabble.com%3E

:     2013-04-09 10:45:06,144 [main] ERROR 
: org.apache.solr.servlet.SolrDispatchFilter - Could not start Solr. Check 
solr/home property and the logs
:     2013-04-09 10:45:06,224 [main] ERROR 
: org.apache.solr.core.SolrCore - 
:

Re: Pushing a whole set of pdf-files to solr

2013-04-09 Thread Gora Mohanty

On 10 April 2013 07:28, sdspieg sdsp...@mail.ru wrote:
 I am able to run the java -jar post.jar -help command which I found here:
 http://docs.lucidworks.com/display/solr/Running+Solr. But now how can I tell
 post to post all pdf files in a certain folder (preferably recursively) to a
 collection? Could anybody please post the exact command for that?
[...]

There are two options:
* I am not familiar with Microsoft Windows, but writing some kind of a batch
  script that recurses down a directory, and posts files to Solr should be easy.
* One could use the Solr DataImportHandler with FileDataSource to handle
   the filesystem traversal, and TikaEntityProcessor to handle the indexing of
   rich content. Please see:
   http://wiki.apache.org/solr/DataImportHandler
   http://wiki.apache.org/solr/TikaEntityProcessor

Regards,
Gora

Re: Pushing a whole set of pdf-files to solr

Another progress report. I 'flattened' all the folders which contained the
pdf files with Fileboss and then moved the pdf files to the directory where
I found the post.jar file (in solr-4.2.1\solr-4.2.1\example\exampledocs). I
then ran java -Ddata=files -jar post.jar *.pdf and in the command window
it seemed to be working fine (these are just academic articles in pdf-format
that I downloaded with ZOtyero from EBSCO):
04/10/2013  12:20 AM   159,224 Vorontsov - 2012 - The Korea- Russia
Gas
Pipeline Project Past, Pres.pdf
04/10/2013  12:12 AM 3,885,056 Walker - 2012 - Asia competes for
energy
security.pdf
04/10/2013  12:45 AM66,195 Whitmill - 2012 - Is UK Energy Policy
Dri
ving Energy Innovation - or.pdf
04/10/2013  12:29 AM 2,208,367 Wietfeld - 2011 - Understanding
Middle Ea
st Gas Exporting Behavior.pdf
04/10/2013  12:59 AM 3,011,185 Wiseman - 2011 - Expanding Regional
Renew
able Governance.pdf
04/10/2013  12:38 AM   180,692 Woudhuysen - 2012 - Innovation in
Energy
Expressions of a Crisis, and.pdf
04/10/2013  12:49 AM   229,991 Yergin - 2012 - How Is Energy
Remaking th
e World.pdf
04/10/2013  12:40 AM 3,397,328 Young - 2012 - Industrial Gases.
(cover s
tory).pdf
04/10/2013  01:36 AM73,125 Zimmerer - 2011 - New Geographies of
Ener
gy Introduction to the Spe.pdf
... and so on, all together some 300 articles.

But then when I looked in solr, I saw the following:
04:34:41
SEVERE
SolrCore
org.apache.solr.common.SolrException: Invalid UTF-8 middle byte 0xe3 (at
char #10, byte #-1)
04:34:41
SEVERE
SolrCore
org.apache.solr.common.SolrException: Invalid UTF-8 middle byte 0xe3 (at
char #10, byte #-1)

... and a lot more of those.

I'd like to think I made SOME progress, but it also seems like I'm still not
close to being there. Any suggestions from the experts here on what I am
doing wrong? 

Thanks!

-Stephan



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Pushing-a-whole-set-of-pdf-files-to-solr-tp4025256p4054920.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Approximately needed RAM for 5000 query/second at a Solr machine?

On 4/9/2013 7:03 PM, Furkan KAMACI wrote:
 These are really good metrics for me:
 
 You say that RAM size should be at least index size, and it is better to
 have a RAM size twice the index size (because of worst case scenario).
 
 On the other hand let's assume that I have a RAM size that is bigger than
 twice of indexes at machine. Can Solr use that extra RAM or is it a
 approximately maximum limit (to have twice size of indexes at machine)?

What we have been discussing is the OS cache, which is memory that is
not used by programs.  The OS uses that memory to make everything run
faster.  The OS will instantly give that memory up if a program requests it.

Solr is a java program, and java uses memory a little differently, so
Solr most likely will NOT use more memory when it is available.

In a normal directly executable program, memory can be allocated at
any time, and given back to the system at any time.

With Java, you tell it the maximum amount of memory the program is ever
allowed to use.  Because of how memory is used inside Java, most
long-running Java programs (like Solr) will allocate up to the
configured maximum even if they don't really need that much memory.
Most Java virtual machines will never give the memory back to the system
even if it is not required.

Thanks,
Shawn

Re: Pushing a whole set of pdf-files to solr

2013-04-09 Thread Gora Mohanty

On 10 April 2013 08:11, sdspieg sdsp...@mail.ru wrote:
 Another progress report. I 'flattened' all the folders which contained the
 pdf files with Fileboss and then moved the pdf files to the directory where
 I found the post.jar file (in solr-4.2.1\solr-4.2.1\example\exampledocs). I
 then ran java -Ddata=files -jar post.jar *.pdf and in the command window
 it seemed to be working fine (these are just academic articles in pdf-format
 that I downloaded with ZOtyero from EBSCO):
[...]

If it works, great, but it is not generally advisable to have a large number
of files under one directory. However, that is not the source of your error
here.
 But then when I looked in solr, I saw the following:
 04:34:41
 SEVERE
 SolrCore
 org.apache.solr.common.SolrException: Invalid UTF-8 middle byte 0xe3 (at
 char #10, byte #-1)
[...]

Your files seem to have some encoding other than UTF-8: My random
guess would be Windows-1252. You need to convert the files to UTF-8.

Regards,
Gora

Re: Pushing a whole set of pdf-files to solr

2013-04-09 Thread Jack Krupansky

The newer SimplePostTool can in fact recurse a directory of PDFs. Just get 
the usage for the tool. I'm sure it lists the command options.


-- Jack Krupansky

-Original Message- 
From: sdspieg

Sent: Tuesday, April 09, 2013 9:48 PM
To: solr-user@lucene.apache.org
Subject: Re: Pushing a whole set of pdf-files to solr

Thanks for those replies. I will look into them. But if anyone knows of a
site that describes step by step how a windows user who has already
installed solr (and tomcat) can easily feed a folder (and subfolders) with
100s of pdfs into solr, or would be willing to write down down those steps,
I would really appreciate the reference. And I bet you there are lots of
people like me...



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Pushing-a-whole-set-of-pdf-files-to-solr-tp4025256p4054915.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: edismax returns very less matches than regular

2013-04-09 Thread Erick Erickson

Adding debugQuery=true is your friend. I suspect that you'll find
your first query is actually searching
name:coldfusion OR defaultsearchfield:cache and you _think_ it's
searching for both coldfusion and cache in the name field

Best
Erick

On Mon, Apr 8, 2013 at 2:50 AM, amit amit.mal...@gmail.com wrote:
 I have a simple system. I put the title of webpages into the name field and
 content of the web pages into the Description field.
 I want to search both fields and give the name a little more boost.
 A search on name field or description field returns records cloase to
 hundreds.

 http://localhost:8983/solr/select/?q=name:%28coldfusion^2%20cache^1%29fq=author:[*%20TO%20*]%20AND%20-author:chinmoypstart=0rows=10fl=author,score,%20id

 But search on both fields using boost just gives 5 matches.

 http://localhost:8983/solr/mindfire/?q=%28%20coldfusion^2%20cache^1%29*defType=edismaxqf=name^1.5%20description^1.0*fq=author:[*%20TO%20*]%20AND%20-author:chinmoypstart=0rows=10fl=author,score,%20id

 I am wondering what is wrong, because there are valid results returned in
 1st query which is ignored by edismax. I am on solr3.6



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/edismax-returns-very-less-matches-than-regular-tp4054442.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Approximately needed RAM for 5000 query/second at a Solr machine?

I am sorry but you said:

*you need enough free RAM for the OS to cache the maximum amount of disk
space all your indexes will ever use*

I have made an assumption my indexes at my machine. Let's assume that it is
5 GB. So it is better to have at least 5 GB RAM? OK, Solr will use RAM up
to how much I define it as a Java processes. When we think about the
indexes at storage and caching them at RAM by OS, is that what you talk
about: having more than 5 GB - or - 10 GB RAM for my machine?

2013/4/10 Shawn Heisey s...@elyograg.org

 On 4/9/2013 7:03 PM, Furkan KAMACI wrote:
  These are really good metrics for me:
 
  You say that RAM size should be at least index size, and it is better to
  have a RAM size twice the index size (because of worst case scenario).
 
  On the other hand let's assume that I have a RAM size that is bigger than
  twice of indexes at machine. Can Solr use that extra RAM or is it a
  approximately maximum limit (to have twice size of indexes at machine)?

 What we have been discussing is the OS cache, which is memory that is
 not used by programs.  The OS uses that memory to make everything run
 faster.  The OS will instantly give that memory up if a program requests
 it.

 Solr is a java program, and java uses memory a little differently, so
 Solr most likely will NOT use more memory when it is available.

 In a normal directly executable program, memory can be allocated at
 any time, and given back to the system at any time.

 With Java, you tell it the maximum amount of memory the program is ever
 allowed to use.  Because of how memory is used inside Java, most
 long-running Java programs (like Solr) will allocate up to the
 configured maximum even if they don't really need that much memory.
 Most Java virtual machines will never give the memory back to the system
 even if it is not required.

 Thanks,
 Shawn

Re: Approximately needed RAM for 5000 query/second at a Solr machine?