Exception when using File based and Index based SpellChecker

2013-07-18 Thread smanad
I am trying to use Filebased and index based spell checker and getting this
exception All checkers need to use the same StringDistance.

They work fine as expected individually but not together. 
Any pointers?

-Manasi



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Exception-when-using-File-based-and-Index-based-SpellChecker-tp4078773.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Doc's FunctionQuery result field in my custom SearchComponent class ?

2013-07-18 Thread Tony Mullins
Eric ,
In freq:termfreq(product,'spider') , freq is alias for 'termfreq' function
query so I could have that field with name 'freq' in document response.
this is my code which I am using to get document object and there is no
termfreq field in its fields collection.

DocList docs = rb.getResults().docList;
DocIterator iterator = docs.iterator();
int sumFreq = 0;
String id = null;

for (int i = 0; i  docs.size(); i++) {
try {
int docId = iterator.nextDoc();

   // Document doc = searcher.doc(docId, fieldSet);
Document doc = searcher.doc(docId);

Thanks,
Tony


On Wed, Jul 17, 2013 at 5:30 PM, Erick Erickson erickerick...@gmail.comwrote:

 Where are you getting the syntax
 freq:termfreq(product,'spider')
 ? Try just

 termfreq(product,'spider')
 you'll get an element in the doc labeled 'termfreq', at least
 I do.

 Best
 Erick

 On Tue, Jul 16, 2013 at 1:03 PM, Tony Mullins tonymullins...@gmail.com
 wrote:
  OK, So thats why I cannot see the FunctionQuery fields in my
  SearchComponent class.
  So then question would be how can I apply my custom processing/logic to
  these FunctionQuery ? Whats the ExtensionPoint in Solr for such
 scenarios ?
 
  Basically I want to call termfreq() for each document and then apply the
  sum to all doc's termfreq() results and show in one aggregated TermFreq
  field in my query response.
 
  Thanks.
  Tony
 
 
 
  On Tue, Jul 16, 2013 at 6:01 PM, Jack Krupansky j...@basetechnology.com
 wrote:
 
  Basically, the evaluation of function queries in the fl parameter
 occurs
  when the response writer is composing the document results. That's AFTER
  all of the search components are done.
 
  SolrReturnFields.**getTransformer() gets the DocTransformer, which is
  really a DocTransformers, and then a call to
 DocTransformers.transform() in
  each response writer will evaluate the embedded function queries and
 insert
  their values in the results as they are being written.
 
  -- Jack Krupansky
 
  -Original Message- From: Tony Mullins
  Sent: Tuesday, July 16, 2013 1:37 AM
  To: solr-user@lucene.apache.org
  Subject: Re: Doc's FunctionQuery result field in my custom
 SearchComponent
  class ?
 
 
  No sorry, I am still not getting the termfreq() field in my 'doc'
 object.
  I do get the _version_ field in my 'doc' object which I think is
  realValue=StoredField.
 
  At which point termfreq() or any other FunctionQuery field becomes the
 part
  of doc object in Solr ? And at that point can I perform some custom
 logic
  and append the response ?
 
  Thanks.
  Tony
 
 
 
 
 
  On Tue, Jul 16, 2013 at 1:34 AM, Patanachai Tangchaisin 
  patanachai.tangchaisin@**wizecommerce.com
 patanachai.tangchai...@wizecommerce.com
  wrote:
 
   Hi,
 
  I think the process of retrieving a stored field (through fl) is
 happens
  after SearchComponent.
 
  One solution: If you wrap a q params with function your score will be a
  result of the function.
  For example,
 
  http://localhost:8080/solr/collection2/demoendpoint?q=**
 http://localhost:8080/solr/**collection2/demoendpoint?q=**
 
 termfreq%28product,%27spider%27%29wt=xmlindent=truefl=***,**score
  http://localhost:**8080/solr/collection2/**demoendpoint?q=termfreq%**
  28product,%27spider%27%29wt=**xmlindent=truefl=*,score
 http://localhost:8080/solr/collection2/demoendpoint?q=termfreq%28product,%27spider%27%29wt=xmlindent=truefl=*,score
 
  
 
 
 
  Now your score is going to be a result of termfreq(product,'spider')
 
 
  --
  Patanachai Tangchaisin
 
 
 
  On 07/15/2013 12:01 PM, Tony Mullins wrote:
 
   any help plz !!!
 
 
  On Mon, Jul 15, 2013 at 4:13 PM, Tony Mullins 
 tonymullins...@gmail.com
  *
  *wrote:
 
 
   Please any help on how to get the value of 'freq' field in my custom
 
  SearchComponent ?
 
 
  http://localhost:8080/solr/collection2/demoendpoint?q=**
 http://localhost:8080/solr/**collection2/demoendpoint?q=**
  spiderwt=xmlindent=truefl=*,freq:termfreq%28product,%**
  27spider%27%29http://**localhost:8080/solr/**
  collection2/demoendpoint?q=**spiderwt=xmlindent=truefl=***
  ,freq:termfreq%28product,%**27spider%27%29
 http://localhost:8080/solr/collection2/demoendpoint?q=spiderwt=xmlindent=truefl=*,freq:termfreq%28product,%27spider%27%29
 
  
 
 
  docstr name=id11/strstr name=typeVideo Games/strstr
  name=formatxbox 360/strstr name=productThe Amazing
  Spider-Man/strint name=popularity11/intlong
  name=_version_1439994081345273856/longint
 
  name=freq1/int/doc
 
 
 
  Here is my code
 
  DocList docs = rb.getResults().docList;
   DocIterator iterator = docs.iterator();
   int sumFreq = 0;
   String id = null;
 
   for (int i = 0; i  docs.size(); i++) {
   try {
   int docId = iterator.nextDoc();
 
  // Document doc = searcher.doc(docId, fieldSet);
   Document doc = searcher.doc(docId);
 
  In doc object I can 

Configuring Tomcat 6 with Solr431 with multiple cores

2013-07-18 Thread PeterKerk

Thanks to Sandeep in this post:
http://lucene.472066.n3.nabble.com/HTTP-Status-503-Server-is-shutting-down-td4065958.html#a4078567
I was able to setup Tomcat 6 with Solr 431.

However, I need a multicore implementation and am now stuck on how to do so.
Here is what I did based on Sandeeps recommended steps so far and what I
need:

1. Extract solr431 package. In my case I did in E:\solr-4.3.1\example\solr 
Peter's path: C:\Dropbox\Databases\solr-4.3.1\example\solr
2. Now copied solr dir from extracted package (E:\solr-4.3.1\example\solr)
into TOMCAT_HOME dir. In my case TOMCAT_HOME dir is pointed to
E:\Apache\Tomcat 6.0. 
3. I can refer now SOLR_HOME as  E:\Apache\Tomcat 6.0\solr  (please
remember this) 
Peter's path: C:\Program Files\Apache Software Foundation\Tomcat 
6.0\solr
4. Copy the solr.war file from extracted package to SOLR HOME dir i.e
E:\Apache\Tomcat 6.0\solr. This is required to create the context. As I
donot want to pass this as JAVA OPTS 
5. Create solr1.xml file into TOMCAT_HOME\conf\Catalina\localhost (I gave
file name as solr1.xml ) 

?xml version=1.0 encoding=utf-8?
Context docBase=C:\Program Files\Apache Software Foundation\Tomcat
6.0\solr\solr-4.3.1.war debug=0 crossContext=true  
Environment name=solr/home type=java.lang.String value=C:\Program
Files\Apache Software Foundation\Tomcat 6.0\solr override=true/
/Context 

6.  Also copy solr.war file into TOMCAT_HOME\webapps for deployment purpose 
7.  If you start tomcat you will get errors as mentioned by Shawn.  S0 you
need to copy all the 5 jar files from solr extracted package (
E:\solr-4.3.1\example\lib\ext ) to TOMCAT_HOME\lib dir.(jul-to-slf4j-1.6.6,
jcl-over-slf4j-1.6.6, slf4j-log4j12-1.6.6, slf4j-api-1.6.6,log4j-1.2.16) 
8. Also copy the log4js.properties file  from
E:\solr-4.3.1\example\resources dir to TOMCAT_HOME\lib dir. 
9. Now if you start the tomcat you wont having any problem. 

So far Sandeeps steps.

I can now approach http://localhost:8080/solr-4.3.1/#/ 

Now, what I will be requiring after I've completed the basic setup of
Tomcat6 and Solr431 I want to migrate my Solr350 (now running on Cygwin)
cores to that environment. 

C:\Dropbox\Databases\apache-solr-3.5.0\example\example-DIH\solr\tt 
C:\Dropbox\Databases\apache-solr-3.5.0\example\example-DIH\solr\shop 
C:\Dropbox\Databases\apache-solr-3.5.0\example\example-DIH\solr\homes 

Where do I need to copy the above cores for this all to work? To C:\Program
Files\Apache Software Foundation\Tomcat 6.0\solr? 
And how can I then approach the data-import handler? I now do this like so: 
http://localhost:8983/solr/tt/dataimport?command=full-import


Thanks!




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Configuring-Tomcat-6-with-Solr431-with-multiple-cores-tp4078778.html
Sent from the Solr - User mailing list archive at Nabble.com.


Inconsistent solrcloud search

2013-07-18 Thread Vladimir Poroshin

Hi,

I have a strange behavior while searching my solrcloud cluster:
for a query like this
http://localhost/solr/my_collection/select?q=my+query; 
http://10.1.1.193:7006/solr-madaptive/collection_mapi/select?q=%22Sairauden+sanoma%22

solr responses sometimes with one document and sometimes with no documents.
This found document is located at shard8, so if I query with 
shards=shard8 then I always get this document,
but if I query with shards=shard8,shard1 then about 50% of my requests 
return no documents at all.

I tried it with solr 4.3.0 and also with 4.3.1.
My cluster has 8 shards with 8 replicas with about 100M docs and default 
(compositeId) document routing.




boost docs if token matches happen in the first 5 words

2013-07-18 Thread Anatoli Matuskova
I've a set of documents with a WhiteSpaceTokenize field. I want to give more
boost when the match of the query happens in the first 3 token positions of
the field. Is there any way to do that (don't want to use payloads as they
mean on more seek to disk so lower performance)
 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/boost-docs-if-token-matches-happen-in-the-first-5-words-tp4078786.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: boost docs if token matches happen in the first 5 words

2013-07-18 Thread Markus Jelsma
You must implement a SpanFirst query yourself. These are not implemented in any 
Solr query parser. You can easily expand the (e)dismax parsers and add support 
for it.
 
-Original message-
 From:Anatoli Matuskova anatoli.matusk...@gmail.com
 Sent: Thursday 18th July 2013 11:54
 To: solr-user@lucene.apache.org
 Subject: boost docs if token matches happen in the first 5 words
 
 I've a set of documents with a WhiteSpaceTokenize field. I want to give more
 boost when the match of the query happens in the first 3 token positions of
 the field. Is there any way to do that (don't want to use payloads as they
 mean on more seek to disk so lower performance)
  
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/boost-docs-if-token-matches-happen-in-the-first-5-words-tp4078786.html
 Sent from the Solr - User mailing list archive at Nabble.com.
 


Custom RequestHandlerBase XML Response Issue

2013-07-18 Thread Vineet Mishra
Hi all

I am using a Custom RequestHandlerBase where I am querying from multiple
different Solr instance and aggregating their output as a XML Document
using DOM,
now in the RequestHandler's function handleRequestBody(SolrQueryRequest
req, SolrQueryResponse resp) I want to output this XML Document to the user
as a response, but if I write it as a Document or Node by

For Document
response.add(grouped, domResult);
or

response.add(grouped, domNode);

its writing to the user

For Document
com.sun.org.apache.xerces.internal.dom.DocumentImpl:[#document: null]
or
For Node
com.sun.org.apache.xerces.internal.dom.ElementImpl:[arr: null]


Even when the Document is present, because when I convert the Document to
String its coming perfectly, but I don't want it as a String rather I want
it in a XML format.

Please this is very urgent, has anybody worked on this!

Regards
Vineet


RE: boost docs if token matches happen in the first 5 words

2013-07-18 Thread Anatoli Matuskova
Thanks for the quick answer Markus.
Could you give me a a guideline or point me where to check in the solr
source code to see how to get it done?




--
View this message in context: 
http://lucene.472066.n3.nabble.com/boost-docs-if-token-matches-happen-in-the-first-5-words-tp4078786p4078792.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Custom RequestHandlerBase XML Response Issue

2013-07-18 Thread Shalin Shekhar Mangar
This isn't a Solr issue. Maybe ask on the xerces list?


On Thu, Jul 18, 2013 at 3:31 PM, Vineet Mishra clearmido...@gmail.comwrote:

 Hi all

 I am using a Custom RequestHandlerBase where I am querying from multiple
 different Solr instance and aggregating their output as a XML Document
 using DOM,
 now in the RequestHandler's function handleRequestBody(SolrQueryRequest
 req, SolrQueryResponse resp) I want to output this XML Document to the user
 as a response, but if I write it as a Document or Node by

 For Document
 response.add(grouped, domResult);
 or

 response.add(grouped, domNode);

 its writing to the user

 For Document
 com.sun.org.apache.xerces.internal.dom.DocumentImpl:[#document: null]
 or
 For Node
 com.sun.org.apache.xerces.internal.dom.ElementImpl:[arr: null]


 Even when the Document is present, because when I convert the Document to
 String its coming perfectly, but I don't want it as a String rather I want
 it in a XML format.

 Please this is very urgent, has anybody worked on this!

 Regards
 Vineet




-- 
Regards,
Shalin Shekhar Mangar.


Re: Custom RequestHandlerBase XML Response Issue

2013-07-18 Thread Vineet Mishra
Thanks for your response Shalin,
so does that mean that we can't return a XML object in SolrQueryResponse
through Custom RequestHandler?


On Thu, Jul 18, 2013 at 4:04 PM, Shalin Shekhar Mangar 
shalinman...@gmail.com wrote:

 This isn't a Solr issue. Maybe ask on the xerces list?


 On Thu, Jul 18, 2013 at 3:31 PM, Vineet Mishra clearmido...@gmail.com
 wrote:

  Hi all
 
  I am using a Custom RequestHandlerBase where I am querying from multiple
  different Solr instance and aggregating their output as a XML Document
  using DOM,
  now in the RequestHandler's function handleRequestBody(SolrQueryRequest
  req, SolrQueryResponse resp) I want to output this XML Document to the
 user
  as a response, but if I write it as a Document or Node by
 
  For Document
  response.add(grouped, domResult);
  or
 
  response.add(grouped, domNode);
 
  its writing to the user
 
  For Document
  com.sun.org.apache.xerces.internal.dom.DocumentImpl:[#document: null]
  or
  For Node
  com.sun.org.apache.xerces.internal.dom.ElementImpl:[arr: null]
 
 
  Even when the Document is present, because when I convert the Document to
  String its coming perfectly, but I don't want it as a String rather I
 want
  it in a XML format.
 
  Please this is very urgent, has anybody worked on this!
 
  Regards
  Vineet
 



 --
 Regards,
 Shalin Shekhar Mangar.



RE: boost docs if token matches happen in the first 5 words

2013-07-18 Thread Markus Jelsma

You'll need the import org.apache.lucene.search.spans package in Solr's 
ExtendedDismaxQParserPlugin and add SpanFirstQuery's to the main query. 
Something like:
query.add(new SpanFirstQuery(new SpanTermQuery(field, clause), distance), 
BooleanClause.Occur.SHOULD);

 
-Original message-
 From:Anatoli Matuskova anatoli.matusk...@gmail.com
 Sent: Thursday 18th July 2013 12:33
 To: solr-user@lucene.apache.org
 Subject: RE: boost docs if token matches happen in the first 5 words
 
 Thanks for the quick answer Markus.
 Could you give me a a guideline or point me where to check in the solr
 source code to see how to get it done?
 
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/boost-docs-if-token-matches-happen-in-the-first-5-words-tp4078786p4078792.html
 Sent from the Solr - User mailing list archive at Nabble.com.
 


Re: Custom RequestHandlerBase XML Response Issue

2013-07-18 Thread Shalin Shekhar Mangar
Solr's response writers support only a few known types. Look at the
writeVal method in TextResponseWriter:

https://svn.apache.org/repos/asf/lucene/dev/trunk/solr/core/src/java/org/apache/solr/response/TextResponseWriter.java


On Thu, Jul 18, 2013 at 4:08 PM, Vineet Mishra clearmido...@gmail.comwrote:

 Thanks for your response Shalin,
 so does that mean that we can't return a XML object in SolrQueryResponse
 through Custom RequestHandler?


 On Thu, Jul 18, 2013 at 4:04 PM, Shalin Shekhar Mangar 
 shalinman...@gmail.com wrote:

  This isn't a Solr issue. Maybe ask on the xerces list?
 
 
  On Thu, Jul 18, 2013 at 3:31 PM, Vineet Mishra clearmido...@gmail.com
  wrote:
 
   Hi all
  
   I am using a Custom RequestHandlerBase where I am querying from
 multiple
   different Solr instance and aggregating their output as a XML Document
   using DOM,
   now in the RequestHandler's function handleRequestBody(SolrQueryRequest
   req, SolrQueryResponse resp) I want to output this XML Document to the
  user
   as a response, but if I write it as a Document or Node by
  
   For Document
   response.add(grouped, domResult);
   or
  
   response.add(grouped, domNode);
  
   its writing to the user
  
   For Document
   com.sun.org.apache.xerces.internal.dom.DocumentImpl:[#document: null]
   or
   For Node
   com.sun.org.apache.xerces.internal.dom.ElementImpl:[arr: null]
  
  
   Even when the Document is present, because when I convert the Document
 to
   String its coming perfectly, but I don't want it as a String rather I
  want
   it in a XML format.
  
   Please this is very urgent, has anybody worked on this!
  
   Regards
   Vineet
  
 
 
 
  --
  Regards,
  Shalin Shekhar Mangar.
 




-- 
Regards,
Shalin Shekhar Mangar.


Re: autoCommit and performance

2013-07-18 Thread Aditya
Hi

It totally depends upon your affordability. If you could afford go for
bigger RAM, SSD drive and 64 Bit OS.

Benchmark your application, with certain set of docs, how much RAM it
takes, Indexing time, Search time etc. Increase the document count and
perform benchmarking tasks again. This will provide more information.
Everything is directly proportional to number of docs.

In my case, I have basic hosting plan and i am happy with the performance.
My point is you don't always need fancy hardware. Start with basic and
based on the need you could change the plan.

Regards
Aditya
www.findbestopensource.com





On Wed, Jul 17, 2013 at 4:55 PM, Ayman Plaha aymanpl...@gmail.com wrote:

 Thanks Aditya, can I also please get some advice on hosting.

- What *hosting specs* should I get ? How much RAM ? Considering my
- client application is very simple that just register users to database
and queries SOLR and displays SOLR results.
- simple batch program adds the 1000 OR 2000 documents to SOLR every
second.

 I'm hoping to deploy the code next week, if you guys can give me any other
 advice I'd really appreciate that.


 On Wed, Jul 17, 2013 at 7:07 PM, Aditya findbestopensou...@gmail.com
 wrote:

  Hi
 
  It will not affect the performance. We are doing this  regularly. If you
 do
  optimize and search then there may be some impact.
 
  Regards
  Aditya
  www.findbestopensource.com
 
 
 
  On Wed, Jul 17, 2013 at 12:52 PM, Ayman Plaha aymanpl...@gmail.com
  wrote:
 
   Hey Guys,
  
   I've finally finished my Spring Java application that uses SOLR for
   searches and just had performance related question about SOLR. I'm
  indexing
   exactly 1000 *OR* 2000 records every second. Every record having 13
  fields
   including 'id'. Majority of the fields are solr.StrField (no filters)
  with
   characters ranging from 5 - 50 in length and one field which is text_t
   (solr.TextField) which can be of length 100 characters to 2000
 characters
   and has the following tokenizer and filters
  
  - PatternTokenizerFactory
  - LowerCaseFilterFactory
  - SynonymFilterFactory
  - SnowballPorterFilterFactory.
  
  
   I'm not using shards. I was hoping when searches get slow I will
 consider
   this or should I consider this now ?
  
   *Questions:*
  
  - I'm using SOLR autoCommit (every 15 minutes) with openSearcher set
  as
  true. I'm not using autoSoftCommit because instant availability of
 the
  documents for search is not necessary and I don't want to chew up
 too
   much
  memory because I'm consider Cloud hosting.
  *autoCommit
  **maxTime90/maxTime
  **openSearchertrue/openSearcher
  **/autoCommit
  *will this effect the query performance of the client website if the
  index grew to 10 million records ? I mean while the commit is
  happening
  does that *effect the performance of queries* and how will this
 effect
  the queries if the index grew to 10 million records ?
  - What *hosting specs* should I get ? How much RAM ? Considering my
  - client application is very simple that just register users to
  database
  and queries SOLR and displays SOLR results.
  - simple batch program adds the 1000 OR 2000 documents to SOLR every
  second.
  
  
   I'm hoping to deploy the code next week, if you guys can give me any
  other
   advice I'd really appreciate that.
  
   Thanks
   Ayman
  
 



Re: autoCommit and performance

2013-07-18 Thread Ayman Plaha
Thanks Shawn and Aditya. Really appreciate your help. Based on your advice
and reading the SolrPerformance article Shawn linked me to, I ended up
getting Intel Dual Core (2 Core) i3 3220 3.3Ghz with 36GB RAM with 2 x
125GB SSD drives for 227$ per month. It's still expensive for me but I got
it anyway because a very basic dedicated host in Australia is for 150$ per
month. VPS in Australia don't offer more then 2GB. I hope I made the right
decision. What do you guys think ?

Thanks
Ayman



On Thu, Jul 18, 2013 at 9:07 PM, Aditya findbestopensou...@gmail.comwrote:

 Hi

 It totally depends upon your affordability. If you could afford go for
 bigger RAM, SSD drive and 64 Bit OS.

 Benchmark your application, with certain set of docs, how much RAM it
 takes, Indexing time, Search time etc. Increase the document count and
 perform benchmarking tasks again. This will provide more information.
 Everything is directly proportional to number of docs.

 In my case, I have basic hosting plan and i am happy with the performance.
 My point is you don't always need fancy hardware. Start with basic and
 based on the need you could change the plan.

 Regards
 Aditya
 www.findbestopensource.com





 On Wed, Jul 17, 2013 at 4:55 PM, Ayman Plaha aymanpl...@gmail.com wrote:

  Thanks Aditya, can I also please get some advice on hosting.
 
 - What *hosting specs* should I get ? How much RAM ? Considering my
 - client application is very simple that just register users to
 database
 and queries SOLR and displays SOLR results.
 - simple batch program adds the 1000 OR 2000 documents to SOLR every
 second.
 
  I'm hoping to deploy the code next week, if you guys can give me any
 other
  advice I'd really appreciate that.
 
 
  On Wed, Jul 17, 2013 at 7:07 PM, Aditya findbestopensou...@gmail.com
  wrote:
 
   Hi
  
   It will not affect the performance. We are doing this  regularly. If
 you
  do
   optimize and search then there may be some impact.
  
   Regards
   Aditya
   www.findbestopensource.com
  
  
  
   On Wed, Jul 17, 2013 at 12:52 PM, Ayman Plaha aymanpl...@gmail.com
   wrote:
  
Hey Guys,
   
I've finally finished my Spring Java application that uses SOLR for
searches and just had performance related question about SOLR. I'm
   indexing
exactly 1000 *OR* 2000 records every second. Every record having 13
   fields
including 'id'. Majority of the fields are solr.StrField (no filters)
   with
characters ranging from 5 - 50 in length and one field which is
 text_t
(solr.TextField) which can be of length 100 characters to 2000
  characters
and has the following tokenizer and filters
   
   - PatternTokenizerFactory
   - LowerCaseFilterFactory
   - SynonymFilterFactory
   - SnowballPorterFilterFactory.
   
   
I'm not using shards. I was hoping when searches get slow I will
  consider
this or should I consider this now ?
   
*Questions:*
   
   - I'm using SOLR autoCommit (every 15 minutes) with openSearcher
 set
   as
   true. I'm not using autoSoftCommit because instant availability of
  the
   documents for search is not necessary and I don't want to chew up
  too
much
   memory because I'm consider Cloud hosting.
   *autoCommit
   **maxTime90/maxTime
   **openSearchertrue/openSearcher
   **/autoCommit
   *will this effect the query performance of the client website if
 the
   index grew to 10 million records ? I mean while the commit is
   happening
   does that *effect the performance of queries* and how will this
  effect
   the queries if the index grew to 10 million records ?
   - What *hosting specs* should I get ? How much RAM ? Considering
 my
   - client application is very simple that just register users to
   database
   and queries SOLR and displays SOLR results.
   - simple batch program adds the 1000 OR 2000 documents to SOLR
 every
   second.
   
   
I'm hoping to deploy the code next week, if you guys can give me any
   other
advice I'd really appreciate that.
   
Thanks
Ayman
   
  
 



Re: Doc's FunctionQuery result field in my custom SearchComponent class ?

2013-07-18 Thread Jack Krupansky
As detailed in previous email, termfreq is not a field - it is a 
transformer or function. Technically, it is actually a ValueSource.


If you look at the TextResponseWriter.writeVal method you can see you it 
kicks off the execution of transformers for writing documents.


-- Jack Krupansky

-Original Message- 
From: Tony Mullins

Sent: Thursday, July 18, 2013 2:49 AM
To: solr-user@lucene.apache.org
Subject: Re: Doc's FunctionQuery result field in my custom SearchComponent 
class ?


Eric ,
In freq:termfreq(product,'spider') , freq is alias for 'termfreq' function
query so I could have that field with name 'freq' in document response.
this is my code which I am using to get document object and there is no
termfreq field in its fields collection.

DocList docs = rb.getResults().docList;
   DocIterator iterator = docs.iterator();
   int sumFreq = 0;
   String id = null;

   for (int i = 0; i  docs.size(); i++) {
   try {
   int docId = iterator.nextDoc();

  // Document doc = searcher.doc(docId, fieldSet);
   Document doc = searcher.doc(docId);

Thanks,
Tony


On Wed, Jul 17, 2013 at 5:30 PM, Erick Erickson 
erickerick...@gmail.comwrote:



Where are you getting the syntax
freq:termfreq(product,'spider')
? Try just

termfreq(product,'spider')
you'll get an element in the doc labeled 'termfreq', at least
I do.

Best
Erick

On Tue, Jul 16, 2013 at 1:03 PM, Tony Mullins tonymullins...@gmail.com
wrote:
 OK, So thats why I cannot see the FunctionQuery fields in my
 SearchComponent class.
 So then question would be how can I apply my custom processing/logic to
 these FunctionQuery ? Whats the ExtensionPoint in Solr for such
scenarios ?

 Basically I want to call termfreq() for each document and then apply the
 sum to all doc's termfreq() results and show in one aggregated TermFreq
 field in my query response.

 Thanks.
 Tony



 On Tue, Jul 16, 2013 at 6:01 PM, Jack Krupansky j...@basetechnology.com
wrote:

 Basically, the evaluation of function queries in the fl parameter
occurs
 when the response writer is composing the document results. That's 
 AFTER

 all of the search components are done.

 SolrReturnFields.**getTransformer() gets the DocTransformer, which is
 really a DocTransformers, and then a call to
DocTransformers.transform() in
 each response writer will evaluate the embedded function queries and
insert
 their values in the results as they are being written.

 -- Jack Krupansky

 -Original Message- From: Tony Mullins
 Sent: Tuesday, July 16, 2013 1:37 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Doc's FunctionQuery result field in my custom
SearchComponent
 class ?


 No sorry, I am still not getting the termfreq() field in my 'doc'
object.
 I do get the _version_ field in my 'doc' object which I think is
 realValue=StoredField.

 At which point termfreq() or any other FunctionQuery field becomes the
part
 of doc object in Solr ? And at that point can I perform some custom
logic
 and append the response ?

 Thanks.
 Tony





 On Tue, Jul 16, 2013 at 1:34 AM, Patanachai Tangchaisin 
 patanachai.tangchaisin@**wizecommerce.com
patanachai.tangchai...@wizecommerce.com
 wrote:

  Hi,

 I think the process of retrieving a stored field (through fl) is
happens
 after SearchComponent.

 One solution: If you wrap a q params with function your score will be 
 a

 result of the function.
 For example,

 http://localhost:8080/solr/collection2/demoendpoint?q=**
http://localhost:8080/solr/**collection2/demoendpoint?q=**

termfreq%28product,%27spider%27%29wt=xmlindent=truefl=***,**score
 http://localhost:**8080/solr/collection2/**demoendpoint?q=termfreq%**
 28product,%27spider%27%29wt=**xmlindent=truefl=*,score
http://localhost:8080/solr/collection2/demoendpoint?q=termfreq%28product,%27spider%27%29wt=xmlindent=truefl=*,score

 



 Now your score is going to be a result of termfreq(product,'spider')


 --
 Patanachai Tangchaisin



 On 07/15/2013 12:01 PM, Tony Mullins wrote:

  any help plz !!!


 On Mon, Jul 15, 2013 at 4:13 PM, Tony Mullins 
tonymullins...@gmail.com
 *
 *wrote:


  Please any help on how to get the value of 'freq' field in my custom

 SearchComponent ?


 http://localhost:8080/solr/collection2/demoendpoint?q=**
http://localhost:8080/solr/**collection2/demoendpoint?q=**
 spiderwt=xmlindent=truefl=*,freq:termfreq%28product,%**
 27spider%27%29http://**localhost:8080/solr/**
 collection2/demoendpoint?q=**spiderwt=xmlindent=truefl=***
 ,freq:termfreq%28product,%**27spider%27%29
http://localhost:8080/solr/collection2/demoendpoint?q=spiderwt=xmlindent=truefl=*,freq:termfreq%28product,%27spider%27%29

 


 docstr name=id11/strstr name=typeVideo Games/strstr
 name=formatxbox 360/strstr name=productThe Amazing
 Spider-Man/strint name=popularity11/intlong
 name=_version_1439994081345273856/longint

 name=freq1/int/doc



 Here is my code

 DocList docs = rb.getResults().docList;
  DocIterator 

Re: Custom RequestHandlerBase XML Response Issue

2013-07-18 Thread Vineet Mishra
But it seems it even have something called  XML ResponseWriter

https://svn.apache.org/repos/asf/lucene/dev/trunk/solr/core/src/java/org/apache/solr/response/XMLResponseWriter.java

Wont it be appropriate in my case?
Although I have not implemented it yet but how come there couldn't be any
way to make a SolrQueryResponse in XML format!


On Thu, Jul 18, 2013 at 4:36 PM, Shalin Shekhar Mangar 
shalinman...@gmail.com wrote:

 Solr's response writers support only a few known types. Look at the
 writeVal method in TextResponseWriter:


 https://svn.apache.org/repos/asf/lucene/dev/trunk/solr/core/src/java/org/apache/solr/response/TextResponseWriter.java


 On Thu, Jul 18, 2013 at 4:08 PM, Vineet Mishra clearmido...@gmail.com
 wrote:

  Thanks for your response Shalin,
  so does that mean that we can't return a XML object in SolrQueryResponse
  through Custom RequestHandler?
 
 
  On Thu, Jul 18, 2013 at 4:04 PM, Shalin Shekhar Mangar 
  shalinman...@gmail.com wrote:
 
   This isn't a Solr issue. Maybe ask on the xerces list?
  
  
   On Thu, Jul 18, 2013 at 3:31 PM, Vineet Mishra clearmido...@gmail.com
   wrote:
  
Hi all
   
I am using a Custom RequestHandlerBase where I am querying from
  multiple
different Solr instance and aggregating their output as a XML
 Document
using DOM,
now in the RequestHandler's function
 handleRequestBody(SolrQueryRequest
req, SolrQueryResponse resp) I want to output this XML Document to
 the
   user
as a response, but if I write it as a Document or Node by
   
For Document
response.add(grouped, domResult);
or
   
response.add(grouped, domNode);
   
its writing to the user
   
For Document
com.sun.org.apache.xerces.internal.dom.DocumentImpl:[#document: null]
or
For Node
com.sun.org.apache.xerces.internal.dom.ElementImpl:[arr: null]
   
   
Even when the Document is present, because when I convert the
 Document
  to
String its coming perfectly, but I don't want it as a String rather I
   want
it in a XML format.
   
Please this is very urgent, has anybody worked on this!
   
Regards
Vineet
   
  
  
  
   --
   Regards,
   Shalin Shekhar Mangar.
  
 



 --
 Regards,
 Shalin Shekhar Mangar.



Re: Custom RequestHandlerBase XML Response Issue

2013-07-18 Thread Jack Krupansky

It would probably be better to integrate the responses (document lists.)

Solr response writers do a lot of special processing of the response data, 
so you can't just throw random objects into the response.


You may need to explain your use case a little more clearly.

-- Jack Krupansky

-Original Message- 
From: Vineet Mishra

Sent: Thursday, July 18, 2013 8:41 AM
To: solr-user@lucene.apache.org
Subject: Re: Custom RequestHandlerBase XML Response Issue

But it seems it even have something called  XML ResponseWriter

https://svn.apache.org/repos/asf/lucene/dev/trunk/solr/core/src/java/org/apache/solr/response/XMLResponseWriter.java

Wont it be appropriate in my case?
Although I have not implemented it yet but how come there couldn't be any
way to make a SolrQueryResponse in XML format!


On Thu, Jul 18, 2013 at 4:36 PM, Shalin Shekhar Mangar 
shalinman...@gmail.com wrote:


Solr's response writers support only a few known types. Look at the
writeVal method in TextResponseWriter:


https://svn.apache.org/repos/asf/lucene/dev/trunk/solr/core/src/java/org/apache/solr/response/TextResponseWriter.java


On Thu, Jul 18, 2013 at 4:08 PM, Vineet Mishra clearmido...@gmail.com
wrote:

 Thanks for your response Shalin,
 so does that mean that we can't return a XML object in SolrQueryResponse
 through Custom RequestHandler?


 On Thu, Jul 18, 2013 at 4:04 PM, Shalin Shekhar Mangar 
 shalinman...@gmail.com wrote:

  This isn't a Solr issue. Maybe ask on the xerces list?
 
 
  On Thu, Jul 18, 2013 at 3:31 PM, Vineet Mishra clearmido...@gmail.com
  wrote:
 
   Hi all
  
   I am using a Custom RequestHandlerBase where I am querying from
 multiple
   different Solr instance and aggregating their output as a XML
Document
   using DOM,
   now in the RequestHandler's function
handleRequestBody(SolrQueryRequest
   req, SolrQueryResponse resp) I want to output this XML Document to
the
  user
   as a response, but if I write it as a Document or Node by
  
   For Document
   response.add(grouped, domResult);
   or
  
   response.add(grouped, domNode);
  
   its writing to the user
  
   For Document
   com.sun.org.apache.xerces.internal.dom.DocumentImpl:[#document: 
   null]

   or
   For Node
   com.sun.org.apache.xerces.internal.dom.ElementImpl:[arr: null]
  
  
   Even when the Document is present, because when I convert the
Document
 to
   String its coming perfectly, but I don't want it as a String rather 
   I

  want
   it in a XML format.
  
   Please this is very urgent, has anybody worked on this!
  
   Regards
   Vineet
  
 
 
 
  --
  Regards,
  Shalin Shekhar Mangar.
 




--
Regards,
Shalin Shekhar Mangar.





Re: Custom RequestHandlerBase XML Response Issue

2013-07-18 Thread Vineet Mishra
So does that mean there is no way that we can write a XML or JSON object to
the SolrQueryResponse and expect it to be formatted?


Re: Custom RequestHandlerBase XML Response Issue

2013-07-18 Thread Shalin Shekhar Mangar
Okay, let me explain. If you construct your combined response (why are you
doing that again?) in the form a Solr NamedList or SolrDocumentList then
the XMLResponseWriter (which btw uses TextResponseWriter) has no problem
writing it down as XML. The problem here is that you are giving it an
object (a DOM Document?) which it doesn't know how to serialize so it just
calls .toString on it and writes it out.

As long as you stick a known type into the SolrQueryResponse, you should be
fine.


On Thu, Jul 18, 2013 at 6:24 PM, Vineet Mishra clearmido...@gmail.comwrote:

 So does that mean there is no way that we can write a XML or JSON object to
 the SolrQueryResponse and expect it to be formatted?




-- 
Regards,
Shalin Shekhar Mangar.


Sort by document similarity counts

2013-07-18 Thread zygis
Hi,

Is it possible to sort search results based on the count of similar documents a 
document has? Say we have a document A which has 4 other similar documents in 
the index and document B which has 10. Then the order solr returns them should 
be B, A. Sorting on moreLikeThis counts for each document would be an example 
of this (in my case I use ngram similarity detection from Tika).

I have tried doing this via custom SearchComponent, where I can find all 
similar documents for each document in current search result, then add a new 
field into document hoping to use sort parameter (q=*sort=similarityCount). 
But this will not work because sort is done before handling my custom search 
component, if added via last-components. Can't add it via first-components, 
because then I will have no access to query results. And I do not want to 
override QueryComponent because I need to have all the functionality it covers: 
grouping, facets, etc.

Thanks


Re: Custom RequestHandlerBase XML Response Issue

2013-07-18 Thread Vineet Mishra
My case is like, I have got a few Solr Instances and querying them and
getting their xml response, out of that xml I have to extract a group of
specific xml nodes, later I am combining other solr's response into a
single xml and making a DOM document out of it.

So as you mentioned in your last mail, how can I prepare a combined
response for this xml doc and even if I do I don't think it would work
because the same I am doing in the RequstHandler.





On Thu, Jul 18, 2013 at 6:30 PM, Shalin Shekhar Mangar 
shalinman...@gmail.com wrote:

 Okay, let me explain. If you construct your combined response (why are you
 doing that again?) in the form a Solr NamedList or SolrDocumentList then
 the XMLResponseWriter (which btw uses TextResponseWriter) has no problem
 writing it down as XML. The problem here is that you are giving it an
 object (a DOM Document?) which it doesn't know how to serialize so it just
 calls .toString on it and writes it out.

 As long as you stick a known type into the SolrQueryResponse, you should be
 fine.


 On Thu, Jul 18, 2013 at 6:24 PM, Vineet Mishra clearmido...@gmail.com
 wrote:

  So does that mean there is no way that we can write a XML or JSON object
 to
  the SolrQueryResponse and expect it to be formatted?
 



 --
 Regards,
 Shalin Shekhar Mangar.



Re: Sort by document similarity counts

2013-07-18 Thread Koji Sekiguchi

I have tried doing this via custom SearchComponent, where I can find all similar 
documents for each document in current search result, then add a new field into 
document hoping to use sort parameter (q=*sort=similarityCount).


I don't understand this part very well, but:


But this will not work because sort is done before handling my custom search 
component, if added via last-components. Can't add it via first-components, 
because then I will have no access to query results. And I do not want to 
override QueryComponent because I need to have all the functionality it covers: 
grouping, facets, etc.


You may want to put your custom SearchComponent to last-component and inject 
SortSpec
in your prepare() so that QueryComponent can sort the result complying with 
your SortSpec?

koji
--
http://soleami.com/blog/automatically-acquiring-synonym-knowledge-from-wikipedia.html


Re: How can I learn the total count of how many documents indexed and how many documents updated?

2013-07-18 Thread Furkan KAMACI
Hi Shawn;

This is what I see when I look at mbeans:
lst name=UPDATEHANDLERlst name=updateHandlerstr
name=classorg.apache.solr.update.DirectUpdateHandler2/strstr
name=version1.0/strstr name=descriptionUpdate handler that
efficiently directly updates the on-disk main lucene index/strstr
name=src$URL$/str
lst name=stats
long name=commits41/long
str name=autocommit maxTime15000ms/str
int name=autocommits37/int
int name=soft autocommits0/int
long name=optimizes2/long
long name=rollbacks0/long
long name=expungeDeletes0/long
long name=docsPending0/long
long name=adds0/long
long name=deletesById0/long
long name=deletesByQuery0/long
long name=errors0/long
long name=cumulative_adds211453/long
long name=cumulative_deletesById0/long
long name=cumulative_deletesByQuery0/long
long name=cumulative_errors0/long
/lst/lst/lst

I think that there is no information about what I look for?

2013/7/18 Shawn Heisey s...@elyograg.org

 On 7/17/2013 8:06 AM, Furkan KAMACI wrote:
  I have crawled some web pages and indexed them at my SolrCloud(Solr
 4.2.1).
  However before I index them there was already some indexes. I can
 calculate
  the difference between current and previous document count. However it
  doesn't mean that I have indexed that count of documents. Because urls of
  websites are unique ids at my system. So it means that some of documents
  updated and they did not increased document count.
 
  My question is that: How can I learn the total count of how many
 documents
  indexed and how many documents updated?

 Look at the update handler statistics.  Your application should record
 the numbers there, then you can check the handler statistics again and
 note the differences.  Here's a URL that can give you those statistics.

 http://server:port/solr/mycollectionname/admin/mbeans?stats=true

 They are also available in the UI on the UPDATEHANDLER section of
 Plugins / Stats, but you can't really use that in a program.

 By setting the request handler path on a query object to /admin/mbeans
 and setting the stats parameter, you can get this information with SolrJ.

 Thanks,
 Shawn




RE: How can I learn the total count of how many documents indexed and how many documents updated?

2013-07-18 Thread Markus Jelsma
Not your updateHandler, that only shows number about what it's doing and it can 
be restarted. Check your cores:
host:port/solr/admin/cores
 
 
-Original message-
 From:Furkan KAMACI furkankam...@gmail.com
 Sent: Thursday 18th July 2013 15:46
 To: solr-user@lucene.apache.org
 Subject: Re: How can I learn the total count of how many documents indexed 
 and how many documents updated?
 
 Hi Shawn;
 
 This is what I see when I look at mbeans:
 lst name=UPDATEHANDLERlst name=updateHandlerstr
 name=classorg.apache.solr.update.DirectUpdateHandler2/strstr
 name=version1.0/strstr name=descriptionUpdate handler that
 efficiently directly updates the on-disk main lucene index/strstr
 name=src$URL$/str
 lst name=stats
 long name=commits41/long
 str name=autocommit maxTime15000ms/str
 int name=autocommits37/int
 int name=soft autocommits0/int
 long name=optimizes2/long
 long name=rollbacks0/long
 long name=expungeDeletes0/long
 long name=docsPending0/long
 long name=adds0/long
 long name=deletesById0/long
 long name=deletesByQuery0/long
 long name=errors0/long
 long name=cumulative_adds211453/long
 long name=cumulative_deletesById0/long
 long name=cumulative_deletesByQuery0/long
 long name=cumulative_errors0/long
 /lst/lst/lst
 
 I think that there is no information about what I look for?
 
 2013/7/18 Shawn Heisey s...@elyograg.org
 
  On 7/17/2013 8:06 AM, Furkan KAMACI wrote:
   I have crawled some web pages and indexed them at my SolrCloud(Solr
  4.2.1).
   However before I index them there was already some indexes. I can
  calculate
   the difference between current and previous document count. However it
   doesn't mean that I have indexed that count of documents. Because urls of
   websites are unique ids at my system. So it means that some of documents
   updated and they did not increased document count.
  
   My question is that: How can I learn the total count of how many
  documents
   indexed and how many documents updated?
 
  Look at the update handler statistics.  Your application should record
  the numbers there, then you can check the handler statistics again and
  note the differences.  Here's a URL that can give you those statistics.
 
  http://server:port/solr/mycollectionname/admin/mbeans?stats=true
 
  They are also available in the UI on the UPDATEHANDLER section of
  Plugins / Stats, but you can't really use that in a program.
 
  By setting the request handler path on a query object to /admin/mbeans
  and setting the stats parameter, you can get this information with SolrJ.
 
  Thanks,
  Shawn
 
 
 


Re: Custom RequestHandlerBase XML Response Issue

2013-07-18 Thread Shalin Shekhar Mangar
This sounds like a bad idea. You could have done this much simply inside
your own application using libraries that you know well.

That being said, instead of creating a DOM document, create a solr
NamedList object which can be serialized by XMLResponseWriter.


On Thu, Jul 18, 2013 at 6:48 PM, Vineet Mishra clearmido...@gmail.comwrote:

 My case is like, I have got a few Solr Instances and querying them and
 getting their xml response, out of that xml I have to extract a group of
 specific xml nodes, later I am combining other solr's response into a
 single xml and making a DOM document out of it.

 So as you mentioned in your last mail, how can I prepare a combined
 response for this xml doc and even if I do I don't think it would work
 because the same I am doing in the RequstHandler.





 On Thu, Jul 18, 2013 at 6:30 PM, Shalin Shekhar Mangar 
 shalinman...@gmail.com wrote:

  Okay, let me explain. If you construct your combined response (why are
 you
  doing that again?) in the form a Solr NamedList or SolrDocumentList then
  the XMLResponseWriter (which btw uses TextResponseWriter) has no problem
  writing it down as XML. The problem here is that you are giving it an
  object (a DOM Document?) which it doesn't know how to serialize so it
 just
  calls .toString on it and writes it out.
 
  As long as you stick a known type into the SolrQueryResponse, you should
 be
  fine.
 
 
  On Thu, Jul 18, 2013 at 6:24 PM, Vineet Mishra clearmido...@gmail.com
  wrote:
 
   So does that mean there is no way that we can write a XML or JSON
 object
  to
   the SolrQueryResponse and expect it to be formatted?
  
 
 
 
  --
  Regards,
  Shalin Shekhar Mangar.
 




-- 
Regards,
Shalin Shekhar Mangar.


Getting a large number of documents by id

2013-07-18 Thread Brian Hurt
I have a situation which is common in our current use case, where I need to
get a large number (many hundreds) of documents by id.  What I'm doing
currently is creating a large query of the form id:12345 OR id:23456 OR
... and sending it off.  Unfortunately, this query is taking a long time,
especially the first time it's executed.  I'm seeing times of like 4+
seconds for this query to return, to get 847 documents.

So, my question is: what should I be looking at to improve the performance
here?

Brian


Re: Clearing old nodes from zookeper without restarting solrcloud cluster

2013-07-18 Thread Luis Carlos Guerrero Covo
Hey andre, that isn't a possibility for us right now since we are
terminating nodes using aws autoscaling policies. We'll have to either
change our policies so that we can have some kind of graceful shutdown
where we get the possibility to unload cores or update zookeeper's cluster
state every once in a while to clear old offline nodes. Thanks for the help!


On Wed, Jul 17, 2013 at 2:23 AM, Andre Bois-Crettez
andre.b...@kelkoo.comwrote:

 Indeed we are using UNLOAD of cores before shutting down extra replica
 nodes, works well but already said, it needs such nodes to be up.
 Once UNLOADed it is possible to stop them, works well for our use case.

 But if nodes are already down, maybe it is possible to manually create
 and upload a cleaned /clusterstate.json to Zookeeper ?


 André


 On 07/16/2013 11:18 PM, Marcin Rzewucki wrote:

 Unloading a core is the known way to unregister a solr node in zookeeper
 (and not use for further querying). It works for me. If you didn't do that
 like this, unused nodes may remain in the cluster state and Solr may try
 to
 use them without a success. I'd suggest to start some machine with the old
 name, run solr, join the cluster for a while, unload a core to unregister
 it from the cluster and shutdown host at the end. This way you could have
 clear cluster state.



 On 16 July 2013 14:41, Luis Carlos Guerrero Covo
 lcguerreroc...@gmail.com**wrote:

  Thanks, I was actually asking about deleting nodes from the cluster state
 not cores, unless you can unload cores specific to an already offline
 node
 from zookeeper.

 --
 André Bois-Crettez

 Search technology, Kelkoo
 http://www.kelkoo.com/


 Kelkoo SAS
 Société par Actions Simplifiée
 Au capital de € 4.168.964,30
 Siège social : 8, rue du Sentier 75002 Paris
 425 093 069 RCS Paris

 Ce message et les pièces jointes sont confidentiels et établis à
 l'attention exclusive de leurs destinataires. Si vous n'êtes pas le
 destinataire de ce message, merci de le détruire et d'en avertir
 l'expéditeur.




-- 
Luis Carlos Guerrero Covo
M.S. Computer Engineering
(57) 3183542047


Two-steps queries with different sorting criteria

2013-07-18 Thread Fabio Amato
Hi all,
I need to execute a Solr query in two steps, executing in the first step a
generic limited-results query ordered by relevance, and in the second step
the ordering of the results of the first step according to a given sorting
criterion (different from relevance).

This two-steps query is meaningful when the query terms are so generic in
such a way that the number of matched results exceeds the wanted number of
results.

In such circumstance, using single-step queries with different sorting
criteria has a very confusing effect on the user experience, because at
each change of sorting criterion the user gets different results even if
the search query and the filtering conditions have not changed.

On the contrary, using a two-steps query where the sorting order of the
first step is always the relevance is more acceptable in case of large
number of matched results because the result set would not change with the
sorting criterion of the second step.

I am wondering if such a two-steps query is achievable with a single Solr
query, or if I am obliged to execute the sorting step of my two-steps query
out of Solr (i.e.:in my application). Another possibility could be the
development of a Solr plugin, but I am afraid of the possible effects on
the performances.

I am using Solr 3.4.0

Thanks in advance for your kind help.
Fabio


Re: How can I learn the total count of how many documents indexed and how many documents updated?

2013-07-18 Thread Furkan KAMACI
Hi Markus;

It doesn't give me how many documents updated from last commit.

2013/7/18 Markus Jelsma markus.jel...@openindex.io

 Not your updateHandler, that only shows number about what it's doing and
 it can be restarted. Check your cores:
 host:port/solr/admin/cores


 -Original message-
  From:Furkan KAMACI furkankam...@gmail.com
  Sent: Thursday 18th July 2013 15:46
  To: solr-user@lucene.apache.org
  Subject: Re: How can I learn the total count of how many documents
 indexed and how many documents updated?
 
  Hi Shawn;
 
  This is what I see when I look at mbeans:
  lst name=UPDATEHANDLERlst name=updateHandlerstr
  name=classorg.apache.solr.update.DirectUpdateHandler2/strstr
  name=version1.0/strstr name=descriptionUpdate handler that
  efficiently directly updates the on-disk main lucene index/strstr
  name=src$URL$/str
  lst name=stats
  long name=commits41/long
  str name=autocommit maxTime15000ms/str
  int name=autocommits37/int
  int name=soft autocommits0/int
  long name=optimizes2/long
  long name=rollbacks0/long
  long name=expungeDeletes0/long
  long name=docsPending0/long
  long name=adds0/long
  long name=deletesById0/long
  long name=deletesByQuery0/long
  long name=errors0/long
  long name=cumulative_adds211453/long
  long name=cumulative_deletesById0/long
  long name=cumulative_deletesByQuery0/long
  long name=cumulative_errors0/long
  /lst/lst/lst
 
  I think that there is no information about what I look for?
 
  2013/7/18 Shawn Heisey s...@elyograg.org
 
   On 7/17/2013 8:06 AM, Furkan KAMACI wrote:
I have crawled some web pages and indexed them at my SolrCloud(Solr
   4.2.1).
However before I index them there was already some indexes. I can
   calculate
the difference between current and previous document count. However
 it
doesn't mean that I have indexed that count of documents. Because
 urls of
websites are unique ids at my system. So it means that some of
 documents
updated and they did not increased document count.
   
My question is that: How can I learn the total count of how many
   documents
indexed and how many documents updated?
  
   Look at the update handler statistics.  Your application should record
   the numbers there, then you can check the handler statistics again and
   note the differences.  Here's a URL that can give you those statistics.
  
   http://server:port/solr/mycollectionname/admin/mbeans?stats=true
  
   They are also available in the UI on the UPDATEHANDLER section of
   Plugins / Stats, but you can't really use that in a program.
  
   By setting the request handler path on a query object to /admin/mbeans
   and setting the stats parameter, you can get this information with
 SolrJ.
  
   Thanks,
   Shawn
  
  
 



RE: How can I learn the total count of how many documents indexed and how many documents updated?

2013-07-18 Thread Markus Jelsma
No nothing will. If you must know, you'll have to do it on the client side and 
make sure autocommit is disabled. 
 
-Original message-
 From:Furkan KAMACI furkankam...@gmail.com
 Sent: Thursday 18th July 2013 17:01
 To: solr-user@lucene.apache.org
 Subject: Re: How can I learn the total count of how many documents indexed 
 and how many documents updated?
 
 Hi Markus;
 
 It doesn't give me how many documents updated from last commit.
 
 2013/7/18 Markus Jelsma markus.jel...@openindex.io
 
  Not your updateHandler, that only shows number about what it's doing and
  it can be restarted. Check your cores:
  host:port/solr/admin/cores
 
 
  -Original message-
   From:Furkan KAMACI furkankam...@gmail.com
   Sent: Thursday 18th July 2013 15:46
   To: solr-user@lucene.apache.org
   Subject: Re: How can I learn the total count of how many documents
  indexed and how many documents updated?
  
   Hi Shawn;
  
   This is what I see when I look at mbeans:
   lst name=UPDATEHANDLERlst name=updateHandlerstr
   name=classorg.apache.solr.update.DirectUpdateHandler2/strstr
   name=version1.0/strstr name=descriptionUpdate handler that
   efficiently directly updates the on-disk main lucene index/strstr
   name=src$URL$/str
   lst name=stats
   long name=commits41/long
   str name=autocommit maxTime15000ms/str
   int name=autocommits37/int
   int name=soft autocommits0/int
   long name=optimizes2/long
   long name=rollbacks0/long
   long name=expungeDeletes0/long
   long name=docsPending0/long
   long name=adds0/long
   long name=deletesById0/long
   long name=deletesByQuery0/long
   long name=errors0/long
   long name=cumulative_adds211453/long
   long name=cumulative_deletesById0/long
   long name=cumulative_deletesByQuery0/long
   long name=cumulative_errors0/long
   /lst/lst/lst
  
   I think that there is no information about what I look for?
  
   2013/7/18 Shawn Heisey s...@elyograg.org
  
On 7/17/2013 8:06 AM, Furkan KAMACI wrote:
 I have crawled some web pages and indexed them at my SolrCloud(Solr
4.2.1).
 However before I index them there was already some indexes. I can
calculate
 the difference between current and previous document count. However
  it
 doesn't mean that I have indexed that count of documents. Because
  urls of
 websites are unique ids at my system. So it means that some of
  documents
 updated and they did not increased document count.

 My question is that: How can I learn the total count of how many
documents
 indexed and how many documents updated?
   
Look at the update handler statistics.  Your application should record
the numbers there, then you can check the handler statistics again and
note the differences.  Here's a URL that can give you those statistics.
   
http://server:port/solr/mycollectionname/admin/mbeans?stats=true
   
They are also available in the UI on the UPDATEHANDLER section of
Plugins / Stats, but you can't really use that in a program.
   
By setting the request handler path on a query object to /admin/mbeans
and setting the stats parameter, you can get this information with
  SolrJ.
   
Thanks,
Shawn
   
   
  
 
 


Re: Getting a large number of documents by id

2013-07-18 Thread Alexandre Rafalovitch
You could start from doing id:(12345 23456) to reduce the query length and
possibly speed up parsing.
You could also move the query from 'q' parameter to 'fq' parameter, since
you probably don't care about ranking ('fq' does not rank).
If these are unique every time, you could probably look at not caching
(can't remember exact syntax).

That's all I can think of at the moment without digging deep into why you
need to do this at all.

Regards,
   Alex.

Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


On Thu, Jul 18, 2013 at 10:46 AM, Brian Hurt bhur...@gmail.com wrote:

 I have a situation which is common in our current use case, where I need to
 get a large number (many hundreds) of documents by id.  What I'm doing
 currently is creating a large query of the form id:12345 OR id:23456 OR
 ... and sending it off.  Unfortunately, this query is taking a long time,
 especially the first time it's executed.  I'm seeing times of like 4+
 seconds for this query to return, to get 847 documents.

 So, my question is: what should I be looking at to improve the performance
 here?

 Brian



Re: Getting a large number of documents by id

2013-07-18 Thread Jack Krupansky
Solr really isn't designed for that kind of use case. If it happens to work 
well for your particular situation, great, but don't complain when you are 
well outside the normal usage for a search engine (10, 20, 50, 100 results 
paged at a time, with modest sized query strings.)


If you must get these 837 documents, do them in reasonable size batches, 
like 20, 50, or 100 at a time.


That said, there may be something else going on here, since a query for 837 
results should not take 4 seconds anyway.


Check QTime - is it 4 seconds?

Add debugQuery=true to your query and check the individual module times - 
which ones are the biggest hogs? Or, maybe it is none of them and the 
problem is elsewhere, like formatting the response, network problems, etc.


Hmmm... I wonder if the new real-time Get API would be better for your 
case. It takes a comma-separated list of document IDs (keys). Check it out:


http://wiki.apache.org/solr/RealTimeGet

-- Jack Krupansky

-Original Message- 
From: Brian Hurt

Sent: Thursday, July 18, 2013 10:46 AM
To: solr-user@lucene.apache.org
Subject: Getting a large number of documents by id

I have a situation which is common in our current use case, where I need to
get a large number (many hundreds) of documents by id.  What I'm doing
currently is creating a large query of the form id:12345 OR id:23456 OR
... and sending it off.  Unfortunately, this query is taking a long time,
especially the first time it's executed.  I'm seeing times of like 4+
seconds for this query to return, to get 847 documents.

So, my question is: what should I be looking at to improve the performance
here?

Brian 



Re: Getting a large number of documents by id

2013-07-18 Thread Michael Della Bitta
Brian,

Have you tried the realtime get handler? It supports multiple documents.

http://wiki.apache.org/solr/RealTimeGet

Michael Della Bitta

Applications Developer

o: +1 646 532 3062  | c: +1 917 477 7906

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions https://twitter.com/Appinions | g+:
plus.google.com/appinions
w: appinions.com http://www.appinions.com/


On Thu, Jul 18, 2013 at 10:46 AM, Brian Hurt bhur...@gmail.com wrote:

 I have a situation which is common in our current use case, where I need to
 get a large number (many hundreds) of documents by id.  What I'm doing
 currently is creating a large query of the form id:12345 OR id:23456 OR
 ... and sending it off.  Unfortunately, this query is taking a long time,
 especially the first time it's executed.  I'm seeing times of like 4+
 seconds for this query to return, to get 847 documents.

 So, my question is: what should I be looking at to improve the performance
 here?

 Brian



Re: Solr with Hadoop

2013-07-18 Thread Matt Lieber
Rajesh,

If you require to have an integration between Solr and Hadoop or NoSQL, I
would recommend using a commercial distribution. I think most are free to
use as long as you don't require support.
I inquired about the Cloudera Search capability, but it seems like that
far it is just preliminary: there is no tight integration yet between
Hbase and Solr, for example, other than full text search on the HDFS data
(I believe enabled in Hue). I am not too familiar with what MapR's M7 has
to offer.
However Datastax does a good job of tightly integrating Solr with
Cassandra, and lets you query over the data ingested from Solr in Hive for
example, which is pretty nice. Solr would not trigger Hadoop jobs, though.

Cheers,
Matt


On 7/17/13 7:37 PM, Rajesh Jain rjai...@gmail.com wrote:

I
 have a newbie question on integrating Solr with Hadoop.

There are some vendors like Cloudera/MapR who have announced Solr Search
for Hadoop.

If I use the Apache distro, how can I use Solr Search on docs in
HDFS/Hadoop

Is there a tutorial on how to use it or getting started.

I am using Flume to sink CSV docs into Hadoop/HDFS and I would like to use
Solr to provide Search.

Does Solr Search trigger MapReduce Jobs (like Splunk-Hunk) does?

Thanks,
Rajesh










NOTE: This message may contain information that is confidential, proprietary, 
privileged or otherwise protected by law. The message is intended solely for 
the named addressee. If received in error, please destroy and notify the 
sender. Any use of this email is prohibited when received in error. Impetus 
does not represent, warrant and/or guarantee, that the integrity of this 
communication has been maintained nor that the communication is free of errors, 
virus, interception or interference.


Re: Getting a large number of documents by id

2013-07-18 Thread Roman Chyla
Look at speed of reading the data - likely, it takes long time to assemble
a big response, especially if there are many long fields - you may want to
try SSD disks, if you have that option.

Also, to gain better understanding: Start your solr, start jvisualvm and
attach to your running solr. Start sending queries and observe where the
most time is spent - it is very easy, you don't have to be a programmer to
do it.

The crucial parts are (but they will show up under different names) are:

1. query parsing
2. search execution
3. response assembly

quite likely, your query is a huge boolean OR clause, that may not be as
efficient as some filter query.

Your use case is actually not at all exotic. There will soon be a JIRA
ticket that makes the scenario of sending/querying with large number of IDs
less painful.

http://lucene.472066.n3.nabble.com/Solr-large-boolean-filter-td4070747.html#a4070964
http://lucene.472066.n3.nabble.com/ACL-implementation-Pseudo-join-performance-amp-Atomic-Updates-td4077894.html

But I would really recommend you to do the jvisualvm measurement - that's
like bringing the light into darkness.

roman


On Thu, Jul 18, 2013 at 10:46 AM, Brian Hurt bhur...@gmail.com wrote:

 I have a situation which is common in our current use case, where I need to
 get a large number (many hundreds) of documents by id.  What I'm doing
 currently is creating a large query of the form id:12345 OR id:23456 OR
 ... and sending it off.  Unfortunately, this query is taking a long time,
 especially the first time it's executed.  I'm seeing times of like 4+
 seconds for this query to return, to get 847 documents.

 So, my question is: what should I be looking at to improve the performance
 here?

 Brian



RE: Solr with Hadoop

2013-07-18 Thread Saikat Kanjilal
I'm familiar with and have used both the DSE cluster as well as am in the 
process of evaluating cloudera search, in general cloudera search has tight 
integration with hdfs and takes care of replication and sharding transparently 
by using the pre-existing hdfs replication and sharding, however cloudera 
search actually uses solrcloud underneath and you would need to install 
zookeeper to enable coordination between each of the solr nodes.   DataStax 
allows you to talk to Solr, however their model scales around the data model 
and architecture of cassandra, release 3.1 allows for some additional solr 
admin functionality and removes the need to write cassandra specific code.

If you go the open source route you have a few options:

1) You can build a custom plugin inside solr that would internally query hdfs 
and return data, you would need to figure out how to scale this potentially 
using a solution very similar to cloudera search (i.e. leverage solrcloud), and 
if using solrcloud you would need ot install zookeeper for node coordination

2) You could write create a flume channel that accumulates specific events from 
hdfs and create a sink to write data directly to solr

3) I would look at cloudera search if you need tight integration into hadoop, 
it might save you some time and efforts

I dont think you want to have solr trigger map-reduce jobs if you're looking at 
having very fast throughput through your search service.


Hope this helps, ping me offline if you have more questions.
Regards

 From: mlie...@impetus.com
 To: solr-user@lucene.apache.org
 Subject: Re: Solr with Hadoop
 Date: Thu, 18 Jul 2013 15:41:36 +
 
 Rajesh,
 
 If you require to have an integration between Solr and Hadoop or NoSQL, I
 would recommend using a commercial distribution. I think most are free to
 use as long as you don't require support.
 I inquired about the Cloudera Search capability, but it seems like that
 far it is just preliminary: there is no tight integration yet between
 Hbase and Solr, for example, other than full text search on the HDFS data
 (I believe enabled in Hue). I am not too familiar with what MapR's M7 has
 to offer.
 However Datastax does a good job of tightly integrating Solr with
 Cassandra, and lets you query over the data ingested from Solr in Hive for
 example, which is pretty nice. Solr would not trigger Hadoop jobs, though.
 
 Cheers,
 Matt
 
 
 On 7/17/13 7:37 PM, Rajesh Jain rjai...@gmail.com wrote:
 
 I
  have a newbie question on integrating Solr with Hadoop.
 
 There are some vendors like Cloudera/MapR who have announced Solr Search
 for Hadoop.
 
 If I use the Apache distro, how can I use Solr Search on docs in
 HDFS/Hadoop
 
 Is there a tutorial on how to use it or getting started.
 
 I am using Flume to sink CSV docs into Hadoop/HDFS and I would like to use
 Solr to provide Search.
 
 Does Solr Search trigger MapReduce Jobs (like Splunk-Hunk) does?
 
 Thanks,
 Rajesh
 
 
 
 
 
 
 
 
 
 
 NOTE: This message may contain information that is confidential, proprietary, 
 privileged or otherwise protected by law. The message is intended solely for 
 the named addressee. If received in error, please destroy and notify the 
 sender. Any use of this email is prohibited when received in error. Impetus 
 does not represent, warrant and/or guarantee, that the integrity of this 
 communication has been maintained nor that the communication is free of 
 errors, virus, interception or interference.
  

Re: Getting a large number of documents by id

2013-07-18 Thread Alexandre Rafalovitch
And I guess, if only a subset of fields is being requested but there are
other large fields present, there could be the cost of loading those extra
fields into memory before discarding them. In which case,
using enableLazyFieldLoading may help.

Regards,
   Alex.

Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


On Thu, Jul 18, 2013 at 11:47 AM, Roman Chyla roman.ch...@gmail.com wrote:

 Look at speed of reading the data - likely, it takes long time to assemble
 a big response, especially if there are many long fields - you may want to
 try SSD disks, if you have that option.

 Also, to gain better understanding: Start your solr, start jvisualvm and
 attach to your running solr. Start sending queries and observe where the
 most time is spent - it is very easy, you don't have to be a programmer to
 do it.

 The crucial parts are (but they will show up under different names) are:

 1. query parsing
 2. search execution
 3. response assembly

 quite likely, your query is a huge boolean OR clause, that may not be as
 efficient as some filter query.

 Your use case is actually not at all exotic. There will soon be a JIRA
 ticket that makes the scenario of sending/querying with large number of IDs
 less painful.


 http://lucene.472066.n3.nabble.com/Solr-large-boolean-filter-td4070747.html#a4070964

 http://lucene.472066.n3.nabble.com/ACL-implementation-Pseudo-join-performance-amp-Atomic-Updates-td4077894.html

 But I would really recommend you to do the jvisualvm measurement - that's
 like bringing the light into darkness.

 roman


 On Thu, Jul 18, 2013 at 10:46 AM, Brian Hurt bhur...@gmail.com wrote:

  I have a situation which is common in our current use case, where I need
 to
  get a large number (many hundreds) of documents by id.  What I'm doing
  currently is creating a large query of the form id:12345 OR id:23456 OR
  ... and sending it off.  Unfortunately, this query is taking a long
 time,
  especially the first time it's executed.  I'm seeing times of like 4+
  seconds for this query to return, to get 847 documents.
 
  So, my question is: what should I be looking at to improve the
 performance
  here?
 
  Brian
 



XInclude and Document Entity not working on schema.xml

2013-07-18 Thread Elodie Sannier

Hello,

I am using the solr nightly version 4.5-2013-07-18_06-04-44 and I want
to use Document Entity in schema.xml, I get this exception :
java.lang.RuntimeException: schema fieldtype
string(org.apache.solr.schema.StrField) invalid
arguments:{xml:base=solrres:/commonschema_types.xml}
at org.apache.solr.schema.FieldType.setArgs(FieldType.java:187)
at
org.apache.solr.schema.FieldTypePluginLoader.init(FieldTypePluginLoader.java:141)
at
org.apache.solr.schema.FieldTypePluginLoader.init(FieldTypePluginLoader.java:43)
at
org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:190)
... 16 more

schema.xml:
?xml version=1.0 encoding=UTF-8 ?
!DOCTYPE schema [
!ENTITY commonschema_types SYSTEM commonschema_types.xml
]
schema name=searchSolrSchema version=1.5
  types
!-- Stuff --
commonschema_types;
  /types
   !-- Stuff --
/schema

commonschema_types.xml:
?xml version=1.0 encoding=UTF-8 ?
fieldType name=string   class=solr.StrField
sortMissingLast=true omitNorms=true/
fieldType name=long class=solr.TrieLongField precisionStep=0
positionIncrementGap=0/
!-- Stuff --

The same error appears in this bug (fixed ?):
https://issues.apache.org/jira/browse/SOLR-3087

It works with solr-4.2.1.

//-

I also try to use use XML XInclude mechanism
(http://en.wikipedia.org/wiki/XInclude) to include parts of schema.xml.

When I try to include a fieldType, I get this exception :
org.apache.solr.common.SolrException: Unknown fieldType 'long' specified
on field _version_
at org.apache.solr.schema.IndexSchema.loadFields(IndexSchema.java:644)
at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:470)
at org.apache.solr.schema.IndexSchema.init(IndexSchema.java:164)
at
org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:55)
at
org.apache.solr.schema.IndexSchemaFactory.buildIndexSchema(IndexSchemaFactory.java:69)
at org.apache.solr.core.ZkContainer.createFromZk(ZkContainer.java:267)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:622)
... 10 more

The type is not found.

I include 'schema_integration.xml' like this in 'schema.xml' :
?xml version=1.0 encoding=UTF-8 ?
schema name=default version=1.5
types
!-- Stuff --
xi:include href=commonschema_types.xml
xmlns:xi=http://www.w3.org/2001/XInclude/
/types
!-- Stuff --
fields
field name=_version_ type=long indexed=true stored=true
multiValued=false/
!-- Stuff --
/fields
/schema

Is it a bug of the nightly version ?

Elodie Sannier

Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention 
exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
message, merci de le détruire et d'en avertir l'expéditeur.


Re: solr autodetectparser tikaconfig dataimporter error

2013-07-18 Thread Andreas Owen
i have now changed some things and the import runs without error. in schema.xml 
i haven't got the field text but contentsExact. unfortunatly the text (from 
file) isn't indexed even though i mapped it to the proper field. what am i 
doing wrong?

data-config.xml:

dataConfig
dataSource type=BinFileDataSource name=data/
dataSource type=BinURLDataSource name=dataUrl/
dataSource type=URLDataSource 
baseUrl=http://127.0.0.1/tkb/internet/; name=main/
document
entity name=rec processor=XPathEntityProcessor url=docImport.xml 
forEach=/albums/album dataSource=main 
!--transformer=script:GenerateId--
field column=title xpath=//title /
field column=id xpath=//file /
field column=path xpath=//path /
field column=Author xpath=//author /

!-- field column=tstamp2013-07-05T14:59:46.889Z/field --

entity name=f processor=FileListEntityProcessor 
baseDir=C:\web\development\tkb\internet\public fileName=${rec.id} 
dataSource=data onError=skip
entity name=tika processor=TikaEntityProcessor 
url=${f.fileAbsolutePath}
field column=text name=contentsExact /
/entity
/entity
/entity
/document
/dataConfig

i noticed, that when I move the field author into the tika-entity it isn't 
indexed. can this have something to do why the text from the file isn't 
indexed? Do I have to do something special about the entity-levels in 
document

ps: how do i import tsstamp, it's a static value?




On 14. Jul 2013, at 10:30 PM, Jack Krupansky wrote:

 Caused by: java.lang.NoSuchMethodError:
 
 That means you have some out of date jars or some newer jars mixed in with 
 the old ones.
 
 -- Jack Krupansky
 
 -Original Message- From: Andreas Owen
 Sent: Sunday, July 14, 2013 3:07 PM
 To: solr-user@lucene.apache.org
 Subject: Re: solr autodetectparser tikaconfig dataimporter error
 
 hi
 
 is there nowone with a idea what this error is or even give me a pointer 
 where to look? If not is there a alternitave way to import documents from a 
 xml-file with meta-data and the filename to parse?
 
 thanks for any help.
 
 
 On 12. Jul 2013, at 10:38 PM, Andreas Owen wrote:
 
 i am using solr 3.5, tika-app-1.4 and tagcloud 1.2.1. when i try to =
 import a
 file via xml i get this error, it doesn't matter what file format i try =
 to index txt, cfm, pdf all the same error:
 
 SEVERE: Exception while processing: rec document :
 SolrInputDocument[{id=3Did(1.0)=3D{myTest.txt},
 title=3Dtitle(1.0)=3D{Beratungsseminar kundenbrief}, =
 contents=3Dcontents(1.0)=3D{wie
 kommuniziert man}, author=3Dauthor(1.0)=3D{Peter Z.},
 =
 path=3Dpath(1.0)=3D{download/online}}]:org.apache.solr.handler.dataimport.=
 DataImportHandlerException:
 java.lang.NoSuchMethodError:
 =
 org.apache.tika.parser.AutoDetectParser.setConfig(Lorg/apache/tika/config/=
 TikaConfig;)V
 at
 =
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav=
 a:669)
 at
 =
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav=
 a:622)
 at
 =
 org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:2=
 68)
 at
 =
 org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:187)=
 
 at
 =
 org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.=
 java:359)
 at
 =
 org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:4=
 27)
 at
 =
 org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:40=
 8)
 Caused by: java.lang.NoSuchMethodError:
 =
 org.apache.tika.parser.AutoDetectParser.setConfig(Lorg/apache/tika/config/=
 TikaConfig;)V
 at
 =
 org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityP=
 rocessor.java:122)
 at
 =
 org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityPr=
 ocessorWrapper.java:238)
 at
 =
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav=
 a:596)
 ... 6 more
 
 Jul 11, 2013 5:23:36 PM org.apache.solr.common.SolrException log
 SEVERE: Full Import
 failed:org.apache.solr.handler.dataimport.DataImportHandlerException:
 java.lang.NoSuchMethodError:
 =
 org.apache.tika.parser.AutoDetectParser.setConfig(Lorg/apache/tika/config/=
 TikaConfig;)V
 at
 =
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav=
 a:669)
 at
 =
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav=
 a:622)
 at
 =
 org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:2=
 68)
 at
 =
 org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:187)=
 
 at
 =
 org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.=
 java:359)
 at
 =
 org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:4=
 27)
 at
 =
 

Luke's analysis of Trie Dates

2013-07-18 Thread JohnRodey
I have a TrieDateField dynamic field setup in my schema, pretty standard...

  dynamicField name=*_tdt type=tdate  indexed=true  stored=false/

  fieldType name=tdate class=solr.TrieDateField omitNorms=true
precisionStep=6 positionIncrementGap=0/

In my code I only set one field, creation_tdt and I round it to the
nearest second before storing it.  However when I analyze it with Luke I
get:

lst name=fields
lst name=creation_tdt
str name=typetdate/str
str name=schemaIT--OF--/str
str name=dynamicBase*_tdt/str
str name=index(unstored field)/str
int name=docs22404/int
int name=distinct-1/int
lst name=topTerms
  int name=2013-07-18T13:37:33.696Z22404/int
  int name=1970-01-01T00:00:00Z22404/int
  int name=1970-01-01T00:00:00Z22404/int
  int name=2013-07-08T20:36:32.896Z22404/int
  int name=1970-01-01T00:00:00Z22404/int
  int name=2011-05-17T22:07:37.984Z22404/int
  int name=1970-01-01T00:00:00Z22404/int
  int name=2013-07-18T15:09:18.72Z16014/int
  int name=2013-07-18T15:04:56.576Z6390/int
  int name=2013-07-18T15:09:10.528Z1535/int
  int name=2013-07-18T15:09:55.584Z1459/int
  int name=2013-07-18T15:09:14.624Z1268/int
  int name=2013-07-18T15:09:06.432Z1193/int
  int name=2013-07-18T15:09:18.72Z1187/int
  int name=2013-07-18T15:09:51.488Z1152/int
  int name=2013-07-18T15:09:59.68Z1129/int
  int name=2013-07-18T15:09:02.336Z1089/int
  ...


So my questions is, where are all these entries coming from?  They are not
the dates I specified because they have millis, and my field isn't
multivalued, so the term counts dont add up (how could I have more than
22404 terms if I only have 22404 documents).  Why multiple
1970-01-01T00:00:00Z entries?

Is this somehow related to Trie fields and how they are indexed?

Thanks!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Luke-s-analysis-of-Trie-Dates-tp4078885.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Luke's analysis of Trie Dates

2013-07-18 Thread Yonik Seeley
On Thu, Jul 18, 2013 at 12:53 PM, JohnRodey timothydd...@yahoo.com wrote:
 I have a TrieDateField dynamic field setup in my schema, pretty standard...

   dynamicField name=*_tdt type=tdate  indexed=true  stored=false/

   fieldType name=tdate class=solr.TrieDateField omitNorms=true
 precisionStep=6 positionIncrementGap=0/

 In my code I only set one field, creation_tdt and I round it to the
 nearest second before storing it.  However when I analyze it with Luke I
 get:

 lst name=fields
 lst name=creation_tdt
 str name=typetdate/str
 str name=schemaIT--OF--/str
 str name=dynamicBase*_tdt/str
 str name=index(unstored field)/str
 int name=docs22404/int
 int name=distinct-1/int
 lst name=topTerms
   int name=2013-07-18T13:37:33.696Z22404/int
   int name=1970-01-01T00:00:00Z22404/int
   int name=1970-01-01T00:00:00Z22404/int
   int name=2013-07-08T20:36:32.896Z22404/int
   int name=1970-01-01T00:00:00Z22404/int
   int name=2011-05-17T22:07:37.984Z22404/int
   int name=1970-01-01T00:00:00Z22404/int
   int name=2013-07-18T15:09:18.72Z16014/int
   int name=2013-07-18T15:04:56.576Z6390/int
   int name=2013-07-18T15:09:10.528Z1535/int
   int name=2013-07-18T15:09:55.584Z1459/int
   int name=2013-07-18T15:09:14.624Z1268/int
   int name=2013-07-18T15:09:06.432Z1193/int
   int name=2013-07-18T15:09:18.72Z1187/int
   int name=2013-07-18T15:09:51.488Z1152/int
   int name=2013-07-18T15:09:59.68Z1129/int
   int name=2013-07-18T15:09:02.336Z1089/int
   ...


 So my questions is, where are all these entries coming from?  They are not
 the dates I specified because they have millis, and my field isn't
 multivalued, so the term counts dont add up (how could I have more than
 22404 terms if I only have 22404 documents).  Why multiple
 1970-01-01T00:00:00Z entries?

 Is this somehow related to Trie fields and how they are indexed?

Yes, it's due to how trie fields are indexed (can have multiple
indexed tokens per logical value to speed up range queries).
If you want counts of values (as opposed to tokens), use faceting.

-Yonik
http://lucidworks.com


Re: JVM Crashed - SOLR deployed in Tomcat

2013-07-18 Thread neoman
Thanks for your reply. Yes, it worked. No more crashes after switching to
1.6.0_30



--
View this message in context: 
http://lucene.472066.n3.nabble.com/JVM-Crashed-SOLR-deployed-in-Tomcat-tp4078439p4078906.html
Sent from the Solr - User mailing list archive at Nabble.com.


Indexing into SolrCloud

2013-07-18 Thread Beale, Jim (US-KOP)
Hey folks,

I've been migrating an application which indexes about 15M documents from 
straight-up Lucene into SolrCloud.  We've set up 5 Solr instances with a 3 
zookeeper ensemble using HAProxy for load balancing. The documents are 
processed on a quad core machine with 6 threads and indexed into SolrCloud 
through HAProxy using ConcurrentUpdateSolrServer in order to batch the updates. 
 The indexing box is heavily-loaded during indexing but I don't think it is so 
bad that it would cause issues.

I'm using Solr 4.3.1 on client and server side, zookeeper 3.4.5 and HAProxy 
1.4.22.

I've been accepting the default HttpClient with 50K buffered docs and 2 
threads, i.e.,

int solrMaxBufferedDocs = 5;
int solrThreadCount = 2;
solrServer = new ConcurrentUpdateSolrServer(solrHttpIPAddress, 
solrMaxBufferedDocs, solrThreadCount);

autoCommit is configured in the solrconfig as follows:

 autoCommit
   maxTime60/maxTime
   maxDocs50/maxDocs
   openSearcherfalse/openSearcher
 /autoCommit

I'm getting the following errors on the client and server sides respectively:

Client side:

2013-07-16 19:02:47,002 [concurrentUpdateScheduler-1-thread-4] INFO  
SystemDefaultHttpClient - I/O exception (java.net.SocketException) caught when 
processing request: Software caused connection abort: socket write error
2013-07-16 19:02:47,002 [concurrentUpdateScheduler-1-thread-4] INFO  
SystemDefaultHttpClient - Retrying request
2013-07-16 19:02:47,002 [concurrentUpdateScheduler-1-thread-5] INFO  
SystemDefaultHttpClient - I/O exception (java.net.SocketException) caught when 
processing request: Software caused connection abort: socket write error
2013-07-16 19:02:47,002 [concurrentUpdateScheduler-1-thread-5] INFO  
SystemDefaultHttpClient - Retrying request

Server side:

7988753 [qtp1956653918-23] ERROR org.apache.solr.core.SolrCore  â 
java.lang.RuntimeException: [was class org.eclipse.jetty.io.EofException] early 
EOF
at 
com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
at 
com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657)
at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
at org.apache.solr.handler.loader.XMLLoader.readDoc(XMLLoader.java:393)

When I disabled autoCommit on the server side, I didn't see any errors there 
but I still get the issue client-side after about 2 million documents - which 
is about 45 minutes.

Has anyone seen this issue before?  I couldn't find anything useful on the 
usual places.

I suppose I could setup wireshark to see what is happening but I'm hoping that 
someone has a better suggestion.

Thanks in advance for any help!


Best regards,
Jim Beale

hibu.com
2201 Renaissance Boulevard, King of Prussia, PA, 19406
Office: 610-879-3864
Mobile: 610-220-3067

The information contained in this email message, including any attachments, is 
intended solely for use by the individual or entity named above and may be 
confidential. If the reader of this message is not the intended recipient, you 
are hereby notified that you must not read, use, disclose, distribute or copy 
any part of this communication. If you have received this communication in 
error, please immediately notify me by email and destroy the original message, 
including any attachments. Thank you.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

The information contained in this email message, including any attachments, is 
intended solely for use by the individual or entity named above and may be 
confidential. If the reader of this message is not the intended recipient, you 
are hereby notified that you must not read, use, disclose, distribute or copy 
any part of this communication. If you have received this communication in 
error, please immediately notify me by email and destroy the original message, 
including any attachments. Thank you.


Re: Getting a large number of documents by id

2013-07-18 Thread Brian Hurt
Thanks everyone for the response.

On Thu, Jul 18, 2013 at 11:22 AM, Alexandre Rafalovitch
arafa...@gmail.comwrote:

 You could start from doing id:(12345 23456) to reduce the query length and
 possibly speed up parsing.


I didn't know about this syntax- it looks useful.


 You could also move the query from 'q' parameter to 'fq' parameter, since
 you probably don't care about ranking ('fq' does not rank).


Yes, I don't care about rank, so this helps.


 If these are unique every time, you could probably look at not caching
 (can't remember exact syntax).


That's all I can think of at the moment without digging deep into why you
 need to do this at all.


Short version of a long story: I'm implementing a graph database on top of
solr.  Which is not what solr is designed for, I know.  This is a case
where I'm following a set of edges from a given node to it's 847 children,
and I need to get the children.  And yes, I've looked at neo4j- it doesn't
help.



 Regards,
Alex.

 Personal website: http://www.outerthoughts.com/
 LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
 - Time is the quality of nature that keeps events from happening all at
 once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


 On Thu, Jul 18, 2013 at 10:46 AM, Brian Hurt bhur...@gmail.com wrote:

  I have a situation which is common in our current use case, where I need
 to
  get a large number (many hundreds) of documents by id.  What I'm doing
  currently is creating a large query of the form id:12345 OR id:23456 OR
  ... and sending it off.  Unfortunately, this query is taking a long
 time,
  especially the first time it's executed.  I'm seeing times of like 4+
  seconds for this query to return, to get 847 documents.
 
  So, my question is: what should I be looking at to improve the
 performance
  here?
 
  Brian
 



Auto-sharding and numShard parameter

2013-07-18 Thread Flavio Pompermaier
Hi to all,
Probably this question has a simple answer but I just want to be sure of
the potential drawbacks..when I run SolrCloud I run the main solr instance
with the -numShard option (e.g. 2).
Then as data grows, shards could potentially become a huge number. If I
hadstio to restart all nodes and I re-run the master with the numShard=2,
what will happen? It will be just ignored or Solr will try to reduce
shards...?

Another question...in SolrCloud, how do I restart all the cloud at once? Is
it possible?

Best,
Flavio


Need ideas to perform historical search

2013-07-18 Thread SolrLover

I am trying to implement Historical search using SOLR.

Ex:

If I search on address 800 5th Ave and provide a time range, it should list
the name of the person who was living at the address during the time period.
I am trying to figure out a way to store the data without redundancy.

I can do a join in the database to return all the names who were living in a
particular address during a particular time but I know it's difficult to do
that in SOLR and SOLR is not a database (it works best when the data is
denormalized).,..

Is there any other way / idea by which I can reduce the redundancy of
creating multiple records for a particular person again and again?







--
View this message in context: 
http://lucene.472066.n3.nabble.com/Need-ideas-to-perform-historical-search-tp4078980.html
Sent from the Solr - User mailing list archive at Nabble.com.


Spellcheck questions

2013-07-18 Thread smanad
Exploring various SpellCheckers in solr and have a few questions, 
1. Which algorithm is used for generating suggestions when using
IndexBasedSpellChecker. I know its Levenshtein (with edit distance=2 -
default) in DirectSolrSpellChecker.
2. If i have 2 indices, can I setup multiple IndexBasedSpellCheckers to
point to different spellcheck dictionaries to generate suggestions from
both.
3. Can I use IndexBasedSpellChecker and FileBasedSpellChecker together? I
tried doing it and ran into an exception All checkers need to use the same
StringDistance.

Any help will be much apprecited.
Thanks, 
-Manasi



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Spellcheck-questions-tp4078985.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Spellcheck questions

2013-07-18 Thread SolrLover
check the below link to get more info on IndexBasedSpellCheckers

http://searchhub.org/2010/08/31/getting-started-spell-checking-with-apache-lucene-and-solr/



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Spellcheck-questions-tp4078985p4079000.html
Sent from the Solr - User mailing list archive at Nabble.com.


additional requests sent to solr

2013-07-18 Thread alxsss
Hello,

I send to solr( to server1 in the cluster of two servers) the folowing request

http://server1:8983/solr/mycollection/select?q=alexwt=xmldefType=edismaxfacet.field=schoolfacet.field=companyfacet=truefacet.limit=10facet.mincount=1qf=school_txt+company_txt+nameshards=server1:8983/solr/mycollection,server2.com:8983/solr/mycollection

I see in the logs 2 additional requests

INFO: [mycollection] webapp=/solr path=/select 
params={facet=truef.company.facet.limit=25qf=school_txt+company_txt+namedistrib=falsewt=javabinversion=2rows=10defType=edismaxf.school_facet.facet.limit=25NOW=1374191542130shard.url=server1:8983/solr/mycollectionfl=id,scorestart=0q=alexfacet.field=schoolfacet.field=companyisShard=truefsv=true}
 hits=9118 status=0 QTime=72

Jul 18, 2013 4:52:22 PM org.apache.solr.core.SolrCore execute
INFO: [mycollection] webapp=/solr path=/select 
params={facet=truefacet.mincount=1company__terms=Googleids=957642543183429632,957841245982425088,67612781366,56659036467,50875569066,957707339232706560,465078975511facet.limit=10qf=school_txt+company_txt+namedistrib=falsewt=javabinversion=2rows=10defType=edismaxNOW=1374191542130shard.url=server1:8983/solr/mycollectionschool__terms=Michigan+State+University,Brigham+Young+University,Northeastern+Universityq=alexfacet.field={!terms%3D$school__terms}schoolfacet.field={!terms%3D$company__terms}companyisShard=true}
 status=0 QTime=6

Jul 18, 2013 4:52:22 PM org.apache.solr.core.SolrCore execute
INFO: [mycollection] webapp=/solr path=/select 
params={facet=trueshards=server1.prod.mylife.com:8983/solr/mycollection,server2:8983/solr/mycollectionfacet.mincount=1q=alexfacet.limit=10qf=school_txt+company_txt+namefacet.field=schoolfacet.field=companywt=xmldefType=edismax}
 hits=97262 status=0 QTime=168


I can understand that the first and the third log records are related to the 
above request, but cannot inderstand where the second log comes from. 
I see in it, company__terms and 
{!terms%3D$school__terms}schoolfacet.field={!terms%3D$company__terms}, whish 
seems does not have anything to do with the initial request. This is solr-4.2.0


Any ideas about it are welcome.

Thanks in advance.
Alex.


Solr 4.3 open a lot more files than solr 3.6

2013-07-18 Thread Zhang, Lisheng
Hi,
 
After upgrading solr from 3.6 to 4.3, we found that solr opened a lot more 
files compared to
solr 3.6 (when core is open). Since we have many cores (more than 2K and still 
grow), we 
would like to reduce the number of open files.
 
We already used shareSchema and sharedLib, we also shared SolrConfig across all 
cores,
we also commented out autoSoftCommit in solrconfig.xml.
 
In solr 3.6, it seems that indexWriter is opened only if indexing request comes 
and immediately 
closed after request is done, but in solr 4.3, IndexWriter kept open, is there 
an easy way to
go back to 3.6 behavior (we donot need to use Near RealTime Search), can we 
change code
to disable keeping IndexWriter open (if no better way)?
 
Any guidance to reduce open files would be very helpful?
 
Thanks very much for helps, Lisheng


Re: add to ContributorsGroup - Instructions for setting up SolrCloud on jboss

2013-07-18 Thread Erick Erickson
Thank you for adding to the wiki! It's always appreciated...

On Wed, Jul 17, 2013 at 5:18 PM, Ali, Saqib docbook@gmail.com wrote:
 Thanks Erick!

 I have added the instructions for running SolrCloud on Jboss:
 http://wiki.apache.org/solr/SolrCloud%20using%20Jboss

 I will refine the instructions further, and also post some screenshots.

 Thanks.


 On Sun, Jul 14, 2013 at 5:05 AM, Erick Erickson 
 erickerick...@gmail.comwrote:

 Done, sorry it took so long, hadn't looked at the list in a couple of days.


 Erick

 On Fri, Jul 12, 2013 at 5:46 PM, Ali, Saqib docbook@gmail.com wrote:
  username: saqib
 
 
  On Fri, Jul 12, 2013 at 2:35 PM, Ali, Saqib docbook@gmail.com
 wrote:
 
  Hello,
 
  Can you please add me to the ContributorsGroup? I would like to add
  instructions for setting up SolrCloud using Jboss.
 
  thanks.
 
 



Re: Need ideas to perform historical search

2013-07-18 Thread Alexandre Rafalovitch
Why do you care about redundancy? That's the search engine's architectural
tradeoff (as far as I understand). And, the tokens are all normalized under
the covers, so it does not take as much space as you expect.

Specifically regarding your issue, maybe you should store 'occupancy' as
the record. That's similar to what they do at Gilt:
http://www.slideshare.net/trenaman/personalized-search-on-the-largest-flash-sale-site-in-america(slide
36+)

The other option is to use location as spans with some clever queries:
http://wiki.apache.org/solr/SpatialForTimeDurations (follow the links).

Regards,
   Alex.

Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


On Thu, Jul 18, 2013 at 5:58 PM, SolrLover bbar...@gmail.com wrote:


 I am trying to implement Historical search using SOLR.

 Ex:

 If I search on address 800 5th Ave and provide a time range, it should list
 the name of the person who was living at the address during the time
 period.
 I am trying to figure out a way to store the data without redundancy.

 I can do a join in the database to return all the names who were living in
 a
 particular address during a particular time but I know it's difficult to do
 that in SOLR and SOLR is not a database (it works best when the data is
 denormalized).,..

 Is there any other way / idea by which I can reduce the redundancy of
 creating multiple records for a particular person again and again?







 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Need-ideas-to-perform-historical-search-tp4078980.html
 Sent from the Solr - User mailing list archive at Nabble.com.