Re: Updating a single field in a Solr document

2010-01-19 Thread Raghuveer Kancherla
Is this feature planned in any of the future releases. I ask because it will
help me plan my system architecture accordingly.

Thanks,
Raghu



On Tue, Jan 19, 2010 at 7:28 PM, Shalin Shekhar Mangar 
shalinman...@gmail.com wrote:

 On Mon, Jan 18, 2010 at 5:11 PM, Raghuveer Kancherla 
 raghuveer.kanche...@aplopio.com wrote:

  Hi,
  I have 2 fields one with captures the category of the documents and an
  other
  which is a pre processed text of the document. Text of the document is
  fairly large.
  The category of the document changes often while the text remains the
 same.
  Search happens on both fields.
 
  The problem is, I have to index both the text and the category each time
  the
  category changes. The text being large obviously makes this suboptimal.
 Is
  there a patch or a tricky way to avoid indexing the text field every
 time.
 
 
 Sure, make the text field as stored, read the old document and create the
 new one. Sorry, there is no way to update an indexed document in Solr
 (yet).

 --
 Regards,
 Shalin Shekhar Mangar.



Updating a single field in a Solr document

2010-01-18 Thread Raghuveer Kancherla
Hi,
I have 2 fields one with captures the category of the documents and an other
which is a pre processed text of the document. Text of the document is
fairly large.
The category of the document changes often while the text remains the same.
Search happens on both fields.

The problem is, I have to index both the text and the category each time the
category changes. The text being large obviously makes this suboptimal. Is
there a patch or a tricky way to avoid indexing the text field every time.

Thanks,
Raghu


Re: Configuring Solr to use RAMDirectory

2010-01-02 Thread Raghuveer Kancherla
Hi Dipti,
Just out of curiosity, are you trying to use RAMDirectory for improvement in
speed? I tried doing that and did not see any significant improvement. Would
be nice to know what your experiment shows.

- Raghu


On Thu, Dec 31, 2009 at 4:17 PM, Erik Hatcher erik.hatc...@gmail.comwrote:

 It's possible, but requires a custom DirectoryFactory implementation.
  There isn't a built in factory to construct a RAMDirectory.  You wire it
 into solrconfig.xml this way:

  directoryFactory name=DirectoryFactory
 class=[fully.qualified.classname]
!-- Parameters as required by the implementation --
  /directoryFactory



 On Dec 31, 2009, at 5:06 AM, dipti khullar wrote:

  Hi

 Can somebody let me know if its possible to configure RAMDirectory from
 solrconfig.xml. Although its clearly mentioned in
 https://issues.apache.org/jira/browse/SOLR-465 by Mark that he has worked
 upon it, but still I couldn't find any such property in config file in
 Solr
 1.4 latest download.
 May be I am overlooking some simple property. Any help would be
 appreciated.


 Thanks
 Dipti

 On Fri, Nov 20, 2009 at 2:27 PM, Andrey Klochkov 
 akloch...@griddynamics.com

 wrote:


  I thought that SOLR-465 just does what is asked, i.e. one can use any
 Directory implementation including RAMDirectory. Thomas, take a look at
 it.

 On Thu, Nov 12, 2009 at 7:55 AM, Otis Gospodnetic 
 otis_gospodne...@yahoo.com wrote:

  I think not out of the box, but look at SOLR-243 issue in JIRA.

 You could also put your index on ram disk (tmpfs), but it would be

 useless

 for writing to it.

 Note that when people ask about loading the whole index in memory
 explicitly, it's often a premature optimization attempt.

 Otis
 --
 Sematext is hiring -- http://sematext.com/about/jobs.html?mls
 Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



 - Original Message 

 From: Thomas Nguyen thngu...@ign.com
 To: solr-user@lucene.apache.org
 Sent: Wed, November 11, 2009 8:46:11 PM
 Subject: Configuring Solr to use RAMDirectory

 Is it possible to configure Solr to fully load indexes in memory?  I
 wasn't able to find any documentation about this on either their site

 or

 in the Solr 1.4 Enterprise Search Server book.





 --
 Andrew Klochkov
 Senior Software Engineer,
 Grid Dynamics





Re: Multi Solr

2009-12-21 Thread Raghuveer Kancherla
Based on your need you can choose one of the options listed at

http://wiki.apache.org/solr/MultipleIndexes


- Raghu


On Tue, Dec 22, 2009 at 10:46 AM, Olala hthie...@gmail.com wrote:


 Hi all!

 I have developed Solr on Tomcat, but now I want to building many Solr on
 only one Tomcat server.Is that can be done or not???
 --
 View this message in context:
 http://old.nabble.com/Multi-Solr-tp26884086p26884086.html
 Sent from the Solr - User mailing list archive at Nabble.com.




Re: payload queries running slow

2009-12-20 Thread Raghuveer Kancherla
Hi Grant,
My queries are about 5 times slower when using payloads as compared to
queries that dont use payloads on the same index. I have not done any
profiling yet, I am trying out lucid gaze now.
I do all the load testing after warming up.
Since my index is small ~1 GB, was wondering if a ramDirectory will help
instead of the default Directory implementation for the indexReader?

Thanks,
Raghu



On Thu, Dec 17, 2009 at 6:58 PM, Grant Ingersoll gsing...@apache.orgwrote:


 On Dec 17, 2009, at 4:52 AM, Raghuveer Kancherla wrote:

  Hi,
  With help from the group here, I have been able to set up a search
  application with payloads enabled. However, there is a noticeable
 increase
  in query response times with payloads as compared to the same queries
  without payloads. I am also seeing a lot more disk IO (I have a 7200 rpm
  disk) and comparatively lesser cpu usage.
 
  I am guessing this is because of the use of payloadTermQuery and
  payloadNearQuery  both of which extend SpanQuery formats. SpanQueries
 read
  the positions index which will be much larger than the index accessed by
 a
  simple TermQuery.
 
  Is there any way of making this system faster without having to
 distribute
  the index. My index size is hardly 1GB (~200k documents and only one
 field
  to search in). I am experiencing query times as high as 2 seconds
 (average).
 
  Any indications on the direction in which I can experiment will also be
 very
  helpful.
 

 Yeah, payloads are going to be slower, but how much slower are they for
 you? Are you warming up those queries?

 Also, have you done any profiling?


  I looked at HathiTrust digital library articles. The methods indicated
 there
  talk about avoiding reading the positions index (converting PhraseQueries
 to
  TermQueries). That will not work in my case because, I still have to read
  the positions index to get the payload information during scoring. Let me
  know if my understanding is incorrect.
 
 
  Thanks,
  -Raghu

 --
 Grant Ingersoll
 http://www.lucidimagination.com/

 Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
 Solr/Lucene:
 http://www.lucidimagination.com/search




payload queries running slow

2009-12-17 Thread Raghuveer Kancherla
Hi,
With help from the group here, I have been able to set up a search
application with payloads enabled. However, there is a noticeable increase
in query response times with payloads as compared to the same queries
without payloads. I am also seeing a lot more disk IO (I have a 7200 rpm
disk) and comparatively lesser cpu usage.

I am guessing this is because of the use of payloadTermQuery and
payloadNearQuery  both of which extend SpanQuery formats. SpanQueries read
the positions index which will be much larger than the index accessed by a
simple TermQuery.

Is there any way of making this system faster without having to distribute
the index. My index size is hardly 1GB (~200k documents and only one field
to search in). I am experiencing query times as high as 2 seconds (average).

Any indications on the direction in which I can experiment will also be very
helpful.

I looked at HathiTrust digital library articles. The methods indicated there
talk about avoiding reading the positions index (converting PhraseQueries to
TermQueries). That will not work in my case because, I still have to read
the positions index to get the payload information during scoring. Let me
know if my understanding is incorrect.


Thanks,
-Raghu


Re: parsedquery becomes PhraseQuery

2009-12-16 Thread Raghuveer Kancherla
Its likely that your analyzer has WordDelimiterFilterFactory (look at your
schema for the field in question).
If a single token is split into more tokens during the analysis phase, solr
will do a phrase query instead of a term query.

In your case disk/1.0 is being analyzed into disk 1 0 (three tokens). Hence
the phrase query.

-Raghu

On Thu, Dec 17, 2009 at 3:40 AM, Jibo John jiboj...@mac.com wrote:

 Hello,

 I have a question on how solr determines whether the q value needs to be
 analyzed as a regular query or as a phrase query.

 Let's say, I have a text'jibojohn info disk/1.0'

 If I query for 'jibojohn info', I get the results. The query is parsed as:

  str name=rawquerystringjibojohn info/str
  str name=querystringjibojohn info/str
  str name=parsedquery+data:jibojohn +data:info/str
  str name=parsedquery_toString+data:jibojohn +data:info/str

 However, if I query for 'disk/1.0', I get nothing. The query is parsed as:

 str name=rawquerystringdisk/1.0/str
  str name=querystringdisk/1.0/str
  str name=parsedqueryPhraseQuery(data:disk 1 0)/str
  str name=parsedquery_toStringdata:disk 1 0/str

 I was expecting this to be treated as a regular query, instead of a phrase
 query.  I was wondering why.

 Appreciate your input.

 -Jibo







Re: Payloads with Phrase queries

2009-12-15 Thread Raghuveer Kancherla
The interesting thing I am noticing is that the scoring works fine for a
phrase query like solr rocks.
This lead me to look at what query I am using in case of a single term.
Turns out that I am using PayloadTermQuery taking a cue from solr-1485
patch.

I changed this to BoostingTermQuery (i read somewhere that this is
deprecated .. but i was just experimenting) and the scoring seems to work as
expected now for a single term.

Now, the important question is what is the Payload version of a TermQuery?

Regards
Raghu


On Tue, Dec 15, 2009 at 12:45 PM, Raghuveer Kancherla 
raghuveer.kanche...@aplopio.com wrote:

 Hi,
 Thanks everyone for the responses, I am now able to get both phrase queries
 and term queries to use payloads.

 However the the score value for each document (and consequently, the
 ordering of documents) are coming out wrong.

 In the solr output appended below, document 4 has a score higher than the
 document 2 (look at the debug part). The results section shows a wrong score
 (which is the payload value I am returning from my custom similarity class)
 and the ordering is also wrong because of this. Can someone explain this ?

 My custom query parser is pasted here http://pastebin.com/m9f21565

 In the similarity class, I return 10.0 if payload is 1 and 20.0 if payload
 is 2. For everything else I return 1.0.

 {
  'responseHeader':{
   'status':0,
   'QTime':2,
   'params':{
   'fl':'*,score',
   'debugQuery':'on',
   'indent':'on',


   'start':'0',
   'q':'solr',
   'qt':'aplopio',
   'wt':'python',
   'fq':'',
   'rows':'10'}},
  'response':{'numFound':5,'start':0,'maxScore':20.0,'docs':[


   {
'payloadTest':'solr|2 rocks|1',
'id':'2',
'score':20.0},
   {
'payloadTest':'solr|2',
'id':'4',
'score':20.0},


   {
'payloadTest':'solr|1 rocks|2',
'id':'1',
'score':10.0},
   {
'payloadTest':'solr|1 rocks|1',
'id':'3',
'score':10.0},


   {
'payloadTest':'solr',
'id':'5',
'score':1.0}]
  },
  'debug':{
   'rawquerystring':'solr',
   'querystring':'solr',


   'parsedquery':'PayloadTermQuery(payloadTest:solr)',
   'parsedquery_toString':'payloadTest:solr',
   'explain':{
   '2':'\n7.227325 = (MATCH) fieldWeight(payloadTest:solr in 1), product 
 of:\n  14.142136 = (MATCH) btq, product of:\n0.70710677 = 
 tf(phraseFreq=0.5)\n20.0 = scorePayload(...)\n  0.81767845 = 
 idf(payloadTest:  solr=5)\n  0.625 = fieldNorm(field=payloadTest, doc=1)\n',


   '4':'\n11.56372 = (MATCH) fieldWeight(payloadTest:solr in 3), product 
 of:\n  14.142136 = (MATCH) btq, product of:\n0.70710677 = 
 tf(phraseFreq=0.5)\n20.0 = scorePayload(...)\n  0.81767845 = 
 idf(payloadTest:  solr=5)\n  1.0 = fieldNorm(field=payloadTest, doc=3)\n',


   '1':'\n3.6136625 = (MATCH) fieldWeight(payloadTest:solr in 0), product 
 of:\n  7.071068 = (MATCH) btq, product of:\n0.70710677 = 
 tf(phraseFreq=0.5)\n10.0 = scorePayload(...)\n  0.81767845 = 
 idf(payloadTest:  solr=5)\n  0.625 = fieldNorm(field=payloadTest, doc=0)\n',


   '3':'\n3.6136625 = (MATCH) fieldWeight(payloadTest:solr in 2), product 
 of:\n  7.071068 = (MATCH) btq, product of:\n0.70710677 = 
 tf(phraseFreq=0.5)\n10.0 = scorePayload(...)\n  0.81767845 = 
 idf(payloadTest:  solr=5)\n  0.625 = fieldNorm(field=payloadTest, doc=2)\n',


   '5':'\n0.578186 = (MATCH) fieldWeight(payloadTest:solr in 4), product 
 of:\n  0.70710677 = (MATCH) btq, product of:\n0.70710677 = 
 tf(phraseFreq=0.5)\n1.0 = scorePayload(...)\n  0.81767845 = 
 idf(payloadTest:  solr=5)\n  1.0 = fieldNorm(field=payloadTest, doc=4)\n'},


   'QParser':'BoostingTermQParser',
   'filter_queries':[''],
   'parsed_filter_queries':[],
   'timing':{
   'time':2.0,
   'prepare':{
'time':1.0,


'org.apache.solr.handler.component.QueryComponent':{
 'time':1.0},
'org.apache.solr.handler.component.FacetComponent':{
 'time':0.0},
'org.apache.solr.handler.component.MoreLikeThisComponent':{


 'time':0.0},
'org.apache.solr.handler.component.HighlightComponent':{
 'time':0.0},
'org.apache.solr.handler.component.StatsComponent':{
 'time':0.0},
'org.apache.solr.handler.component.DebugComponent':{


 'time':0.0}},
   'process':{
'time':1.0,
'org.apache.solr.handler.component.QueryComponent':{
 'time':0.0},
'org.apache.solr.handler.component.FacetComponent':{


 'time':0.0},
'org.apache.solr.handler.component.MoreLikeThisComponent':{
 'time':0.0},
'org.apache.solr.handler.component.HighlightComponent':{
 'time':0.0},


'org.apache.solr.handler.component.StatsComponent':{
 'time':0.0},
'org.apache.solr.handler.component.DebugComponent':{
 'time':1.0}












 On Thu, Dec

Re: Payloads with Phrase queries

2009-12-14 Thread Raghuveer Kancherla
Hi,
Thanks everyone for the responses, I am now able to get both phrase queries
and term queries to use payloads.

However the the score value for each document (and consequently, the
ordering of documents) are coming out wrong.

In the solr output appended below, document 4 has a score higher than the
document 2 (look at the debug part). The results section shows a wrong score
(which is the payload value I am returning from my custom similarity class)
and the ordering is also wrong because of this. Can someone explain this ?

My custom query parser is pasted here http://pastebin.com/m9f21565

In the similarity class, I return 10.0 if payload is 1 and 20.0 if payload
is 2. For everything else I return 1.0.

{
 'responseHeader':{
  'status':0,
  'QTime':2,
  'params':{
'fl':'*,score',
'debugQuery':'on',
'indent':'on',

'start':'0',
'q':'solr',
'qt':'aplopio',
'wt':'python',
'fq':'',
'rows':'10'}},
 'response':{'numFound':5,'start':0,'maxScore':20.0,'docs':[

{
 'payloadTest':'solr|2 rocks|1',
 'id':'2',
 'score':20.0},
{
 'payloadTest':'solr|2',
 'id':'4',
 'score':20.0},

{
 'payloadTest':'solr|1 rocks|2',
 'id':'1',
 'score':10.0},
{
 'payloadTest':'solr|1 rocks|1',
 'id':'3',
 'score':10.0},

{
 'payloadTest':'solr',
 'id':'5',
 'score':1.0}]
 },
 'debug':{
  'rawquerystring':'solr',
  'querystring':'solr',

  'parsedquery':'PayloadTermQuery(payloadTest:solr)',
  'parsedquery_toString':'payloadTest:solr',
  'explain':{
'2':'\n7.227325 = (MATCH) fieldWeight(payloadTest:solr in 1), product
of:\n  14.142136 = (MATCH) btq, product of:\n0.70710677 =
tf(phraseFreq=0.5)\n20.0 = scorePayload(...)\n  0.81767845 =
idf(payloadTest:  solr=5)\n  0.625 = fieldNorm(field=payloadTest,
doc=1)\n',

'4':'\n11.56372 = (MATCH) fieldWeight(payloadTest:solr in 3), product
of:\n  14.142136 = (MATCH) btq, product of:\n0.70710677 =
tf(phraseFreq=0.5)\n20.0 = scorePayload(...)\n  0.81767845 =
idf(payloadTest:  solr=5)\n  1.0 = fieldNorm(field=payloadTest,
doc=3)\n',

'1':'\n3.6136625 = (MATCH) fieldWeight(payloadTest:solr in 0),
product of:\n  7.071068 = (MATCH) btq, product of:\n0.70710677 =
tf(phraseFreq=0.5)\n10.0 = scorePayload(...)\n  0.81767845 =
idf(payloadTest:  solr=5)\n  0.625 = fieldNorm(field=payloadTest,
doc=0)\n',

'3':'\n3.6136625 = (MATCH) fieldWeight(payloadTest:solr in 2),
product of:\n  7.071068 = (MATCH) btq, product of:\n0.70710677 =
tf(phraseFreq=0.5)\n10.0 = scorePayload(...)\n  0.81767845 =
idf(payloadTest:  solr=5)\n  0.625 = fieldNorm(field=payloadTest,
doc=2)\n',

'5':'\n0.578186 = (MATCH) fieldWeight(payloadTest:solr in 4), product
of:\n  0.70710677 = (MATCH) btq, product of:\n0.70710677 =
tf(phraseFreq=0.5)\n1.0 = scorePayload(...)\n  0.81767845 =
idf(payloadTest:  solr=5)\n  1.0 = fieldNorm(field=payloadTest,
doc=4)\n'},

  'QParser':'BoostingTermQParser',
  'filter_queries':[''],
  'parsed_filter_queries':[],
  'timing':{
'time':2.0,
'prepare':{
 'time':1.0,

 'org.apache.solr.handler.component.QueryComponent':{
  'time':1.0},
 'org.apache.solr.handler.component.FacetComponent':{
  'time':0.0},
 'org.apache.solr.handler.component.MoreLikeThisComponent':{

  'time':0.0},
 'org.apache.solr.handler.component.HighlightComponent':{
  'time':0.0},
 'org.apache.solr.handler.component.StatsComponent':{
  'time':0.0},
 'org.apache.solr.handler.component.DebugComponent':{

  'time':0.0}},
'process':{
 'time':1.0,
 'org.apache.solr.handler.component.QueryComponent':{
  'time':0.0},
 'org.apache.solr.handler.component.FacetComponent':{

  'time':0.0},
 'org.apache.solr.handler.component.MoreLikeThisComponent':{
  'time':0.0},
 'org.apache.solr.handler.component.HighlightComponent':{
  'time':0.0},

 'org.apache.solr.handler.component.StatsComponent':{
  'time':0.0},
 'org.apache.solr.handler.component.DebugComponent':{
  'time':1.0}












On Thu, Dec 10, 2009 at 5:48 PM, AHMET ARSLAN iori...@yahoo.com wrote:


  I was looking through some lucene
  source codes and found the following class
  org.apache.lucene.search.payloads.PayloadSpanUtil
 
  There is a function named queryToSpanQuery in this class.
  Is this the
  preferred way to convert a PhraseQuery to
  PayloadNearQuery?

 queryToSpanQuery method does not return PayloadNearQuery type.

 You need to override getFieldQuery(String field, String queryText, int
 slop) of SolrQueryParser or QueryParser.

 This code is modified from Lucene In Action Book (2nd edition) Chapter
 6.3.4 Allowing ordered phrase queries

 

Payloads with Phrase queries

2009-12-10 Thread Raghuveer Kancherla
Hi,
I am looking for a way to use payloads in my search application. Indexing
data with payloads into Solr is pretty straightforward. However using the
payloads during search time is a bit confusing. Can anyone point me in the
right direction to enable payloads on a *PhraseQuery*. I looked at the
following resources and got payload on a TermQuery working.

   1.
   
http://www.lucidimagination.com/blog/2009/08/05/getting-started-with-payloads/
   2. http://www.mail-archive.com/solr-user@lucene.apache.org/msg24863.html
   3. There is also a jira issue (SOLR-1485) that gives a patch for using
   Payload.
   4. Lucene-In-Action

I am guessing that I should return a payload version of PhraseQuery in
QueryParser's (package org.apache.lucene.queryParser.queryParser.java)
newPhraseQuery function. If yes, what type should this query be?

Thanks,
Raghu


Re: Payloads with Phrase queries

2009-12-10 Thread Raghuveer Kancherla
I was looking through some lucene source codes and found the following class
org.apache.lucene.search.payloads.PayloadSpanUtil

There is a function named queryToSpanQuery in this class. Is this the
preferred way to convert a PhraseQuery to PayloadNearQuery?

Also, are there any performance considerations while using a
PayloadNearQuery instead of a PhraseQuery?

Thanks,
Raghu



On Thu, Dec 10, 2009 at 4:40 PM, AHMET ARSLAN iori...@yahoo.com wrote:

  Hi,
  I am looking for a way to use payloads in my search
  application. Indexing
  data with payloads into Solr is pretty straightforward.
  However using the
  payloads during search time is a bit confusing. Can anyone
  point me in the
  right direction to enable payloads on a *PhraseQuery*. I
  looked at the
  following resources and got payload on a TermQuery
  working.
 
 1.
 
 http://www.lucidimagination.com/blog/2009/08/05/getting-started-with-payloads/
 2.
 http://www.mail-archive.com/solr-user@lucene.apache.org/msg24863.html
 3. There is also a jira issue (SOLR-1485)
  that gives a patch for using
 Payload.
 4. Lucene-In-Action
 
  I am guessing that I should return a payload version of
  PhraseQuery in
  QueryParser's (package
  org.apache.lucene.queryParser.queryParser.java)
  newPhraseQuery function. If yes, what type should this
  query be?

 Yes. PayloadNearQuery [1]

 [1]
 http://lucene.apache.org/java/3_0_0/api/core/org/apache/lucene/search/payloads/PayloadNearQuery.html






Re: Retrieving large num of docs

2009-12-05 Thread Raghuveer Kancherla
Hi Otis,
I think my experiments are not conclusive about reduction in search time. I
was playing around with various configurations to reduce the time to
retrieve documents from Solr. I am sure that making the two multi valued
text fields from stored to un-stored, retrieval time (query time + time to
load the stored fields) became very fast. I was expecting the
lazyfieldloading setting in solrconfig to take care of this but apparently
it is not working as expected.

Out of curiosity, I removed these 2 fields from the index (this time I am
not even indexing them) and my search time got better (10 times better).
However, I am still trying to isolate the reason for the search time
reduction. It may be either because of 2 less fields to search in or because
of the reduction in size of the index or may be something else. I am not
sure if lazyfieldloading has any part in explaining this.

- Raghu



On Fri, Dec 4, 2009 at 3:07 AM, Otis Gospodnetic otis_gospodne...@yahoo.com
 wrote:

 Hm, hm, interesting.  I was looking into something like this the other day
 (BIG indexed+stored text fields).  After seeing enableLazyFieldLoading=true
 in solrconfig and after seeing fl didn't include those big fields, I
 though hm, so Lucene/Solr will not be pulling those large fields from disk,
 OK.

 You are saying that this may not be true based on your experiment?
 And what I'm calling your experiment means that you reindexed the same
 data, but without the 2 multi-valued text fields... .and that was the only
 change you made and got cca x10 search performance improvement?

 Sorry for repeating your words, just trying to confirm and understand.

 Thanks,
 Otis
 --
 Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch



 - Original Message 
  From: Raghuveer Kancherla raghuveer.kanche...@aplopio.com
  To: solr-user@lucene.apache.org
  Sent: Thu, December 3, 2009 8:43:16 AM
  Subject: Re: Retrieving large num of docs
 
  Hi Hoss,
 
  I was experimenting with various queries to solve this problem and in one
  such test I remember that requesting only the ID did not change the
  retrieval time. To be sure, I tested it again using the curl command
 today
  and it confirms my previous observation.
 
  Also, enableLazyFieldLoading setting is set to true in my solrconfig.
 
  Another general observation (off topic) is that having a moderately large
  multi valued text field (~200 entries) in the index seems to slow down
 the
  search significantly. I removed the 2 multi valued text fields from my
 index
  and my search got ~10 time faster. :)
 
  - Raghu
 
 
  On Thu, Dec 3, 2009 at 2:14 AM, Chris Hostetter wrote:
 
  
   : I think I solved the problem of retrieving 300 docs per request for
 now.
   The
   : problem was that I was storing 2 moderately large multivalued text
 fields
   : though I was not retrieving them during search time.  I reindexed all
 my
   : data without storing these fields. Now the response time (time for
 Solr
   to
   : return the http response) is very close to the QTime Solr is showing
 in
   the
  
   Hmmm
  
   two comments:
  
   1) the example URL from your previous mail...
  
   : 
  
 
 http://localhost:1212/solr/select/?rows=300q=%28ResumeAllText%3A%28%28%28%22java+j2ee%22+%28java+j2ee%29%29%29%5E4%29%5E1.0%29start=0wt=python
  
   ...doesn't match your earlier statemnet that you are only returning hte
 id
   field (there is no fl param in that URL) ... are you certain you
 werent'
   returning those large stored fields in teh response?
  
   2) assuming you were actually using an fl param to limit the fields,
 make
   sure you have this setting in your solrconfig.xml...
  
  true
  
   ..that should make it pretty fast to return only a few fields of each
   document, even if you do have some jumpto stored fields that aren't
 being
   returned.
  
  
  
   -Hoss
  
  




Re: WELCOME to solr-user@lucene.apache.org

2009-12-05 Thread Raghuveer Kancherla
2 ways I can think of ...

   - ExtractingRequestHandler (this is what I am guessing you are using now)

Set extractOnly=true while making a request to the extractingRequestHandler
and get the parsed content back. Now make a post request on update request
handler with what ever fields and field values you want.


   - Use HTMLStripWhiteSpaceTokenizer factory. This article may be helpful
   to explain what I mean.
   
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripWhitespaceTokenizerFactory.



- Raghu



On Sat, Dec 5, 2009 at 3:44 AM, khalid y kern...@gmail.com wrote:

 Hi,

 I have a problem with solr. I'm indexing some html content and solr crash
 because my id field is multivalued.
 I found that Tika read the html and extract metadata like meta name=id
 content=12 from my htmls but my documents has an already an id setted by
 literal.id=10.

 I tried to map the id from Tika by fmap.id=ignored_ but it ignore also my
 literal.id

 I'm using solr 1.4 and tika 0.5

 Someone can explain to me how I can ignore this the Tika id metadata ??

 Thanks



Re: Retrieving large num of docs

2009-12-03 Thread Raghuveer Kancherla
Hi Hoss,

I was experimenting with various queries to solve this problem and in one
such test I remember that requesting only the ID did not change the
retrieval time. To be sure, I tested it again using the curl command today
and it confirms my previous observation.

Also, enableLazyFieldLoading setting is set to true in my solrconfig.

Another general observation (off topic) is that having a moderately large
multi valued text field (~200 entries) in the index seems to slow down the
search significantly. I removed the 2 multi valued text fields from my index
and my search got ~10 time faster. :)

- Raghu


On Thu, Dec 3, 2009 at 2:14 AM, Chris Hostetter hossman_luc...@fucit.orgwrote:


 : I think I solved the problem of retrieving 300 docs per request for now.
 The
 : problem was that I was storing 2 moderately large multivalued text fields
 : though I was not retrieving them during search time.  I reindexed all my
 : data without storing these fields. Now the response time (time for Solr
 to
 : return the http response) is very close to the QTime Solr is showing in
 the

 Hmmm

 two comments:

 1) the example URL from your previous mail...

 : 
 http://localhost:1212/solr/select/?rows=300q=%28ResumeAllText%3A%28%28%28%22java+j2ee%22+%28java+j2ee%29%29%29%5E4%29%5E1.0%29start=0wt=python

 ...doesn't match your earlier statemnet that you are only returning hte id
 field (there is no fl param in that URL) ... are you certain you werent'
 returning those large stored fields in teh response?

 2) assuming you were actually using an fl param to limit the fields, make
 sure you have this setting in your solrconfig.xml...

enableLazyFieldLoadingtrue/enableLazyFieldLoading

 ..that should make it pretty fast to return only a few fields of each
 document, even if you do have some jumpto stored fields that aren't being
 returned.



 -Hoss




Re: Retrieving large num of docs

2009-12-01 Thread Raghuveer Kancherla
Hi Hoss/Andrew,
I think I solved the problem of retrieving 300 docs per request for now. The
problem was that I was storing 2 moderately large multivalued text fields
though I was not retrieving them during search time.  I reindexed all my
data without storing these fields. Now the response time (time for Solr to
return the http response) is very close to the QTime Solr is showing in the
logs.

Thanks for all the help,
Raghu


On Mon, Nov 30, 2009 at 11:37 AM, Raghuveer Kancherla 
raghuveer.kanche...@aplopio.com wrote:

 Thanks Hoss,
 In my previous mail, I was measuring the system time difference between
 sending a (http) request and receiving a response. This was being run on a
 (different) client machine

 Like you suggested, I tried to time the response on the server itself as
 follows:

 $ /usr/bin/time -p curl -sS -o solr.out 
 http://localhost:1212/solr/select/?rows=300q=%28ResumeAllText%3A%28%28%28%22java+j2ee%22+%28java+j2ee%29%29%29%5E4%29%5E1.0%29start=0wt=python
 
 real 3.49

 user 0.00
 sys 0.00

 The query time in solr log shows me Qtime=600
 size of solr.out is 843 kB.

 As you've mentioned, Solr shouldn't give these kind of numbers for 300
 docs, and we're quite perplexed as to whats going on.

 Thanks,
 Raghu




 On Mon, Nov 30, 2009 at 6:00 AM, Chris Hostetter hossman_luc...@fucit.org
  wrote:


 : I am using Solr1.4 for searching through half a million documents. The
 : problem is, I want to retrieve nearly 200 documents for each search
 query.
 : The query time in Solr logs is showing 0.02 seconds and I am fairly
 happy
 : with that. However Solr is taking a long time (4 to 5 secs) to return
 the
 : results (I think it is because of the number of docs I am requesting). I
 : tried returning only the id's (unique key) without any other stored
 fields,
 : but it is not helping me improve the response times (time to return the
 id's
 : of matching documents).

 What exactly does your request URL look like, and how exactly are you
 timing the total response time?

 200 isn't a very big number for the rows param -- people who want to get
 100K documents back in their response at a time may have problems, but 200
 is not that big.

 so like i said: how exactly are you timing things?

 My guess: it's more likely that network overhead or the performance of
 your client code (reading the data off the wire) is causing your timing
 code to seem slow, then it is that Solr is taking 5 seconds to write out
 those document IDs.

 I suspect if you try hitting the same exact URL using curl via localhost,
 you'll see the total response time be a lot less then 5 seconds.

 Here's an example of a query that asks solr to return *every* field from
 500 documents, in the XML format.  And these are not small documents...

 $ /usr/bin/time -p curl -sS -o /tmp/solr.out 
 http://localhost:5051/solr/select/?q=doctype:productversion=2.2start=0rows=500indent=on
 
 real 0.07
 user 0.00
 sys 0.00
 [chr...@c18-ssa-so-dfll-qry1 ~]$ du -sh /tmp/solr.out
 1.6M/tmp/solr.out

 ...that's 1.6 MB of 500 Solr documents with all of their fields in
 verbose XML format (including indenting) fetched in 70ms.

 If it's taking 5 seconds for you to get just the ids of 200 docs, you've
 got a problem somewhere and i'm 99% certain it's not in Solr.

 what does a similar time curl command for your URL look like when you
 run it on your solr server?


 -Hoss





Re: Retrieving large num of docs

2009-11-29 Thread Raghuveer Kancherla
Thanks Hoss,
In my previous mail, I was measuring the system time difference between
sending a (http) request and receiving a response. This was being run on a
(different) client machine

Like you suggested, I tried to time the response on the server itself as
follows:

$ /usr/bin/time -p curl -sS -o solr.out 
http://localhost:1212/solr/select/?rows=300q=%28ResumeAllText%3A%28%28%28%22java+j2ee%22+%28java+j2ee%29%29%29%5E4%29%5E1.0%29start=0wt=python

real 3.49
user 0.00
sys 0.00

The query time in solr log shows me Qtime=600
size of solr.out is 843 kB.

As you've mentioned, Solr shouldn't give these kind of numbers for 300 docs,
and we're quite perplexed as to whats going on.

Thanks,
Raghu



On Mon, Nov 30, 2009 at 6:00 AM, Chris Hostetter
hossman_luc...@fucit.orgwrote:


 : I am using Solr1.4 for searching through half a million documents. The
 : problem is, I want to retrieve nearly 200 documents for each search
 query.
 : The query time in Solr logs is showing 0.02 seconds and I am fairly happy
 : with that. However Solr is taking a long time (4 to 5 secs) to return the
 : results (I think it is because of the number of docs I am requesting). I
 : tried returning only the id's (unique key) without any other stored
 fields,
 : but it is not helping me improve the response times (time to return the
 id's
 : of matching documents).

 What exactly does your request URL look like, and how exactly are you
 timing the total response time?

 200 isn't a very big number for the rows param -- people who want to get
 100K documents back in their response at a time may have problems, but 200
 is not that big.

 so like i said: how exactly are you timing things?

 My guess: it's more likely that network overhead or the performance of
 your client code (reading the data off the wire) is causing your timing
 code to seem slow, then it is that Solr is taking 5 seconds to write out
 those document IDs.

 I suspect if you try hitting the same exact URL using curl via localhost,
 you'll see the total response time be a lot less then 5 seconds.

 Here's an example of a query that asks solr to return *every* field from
 500 documents, in the XML format.  And these are not small documents...

 $ /usr/bin/time -p curl -sS -o /tmp/solr.out 
 http://localhost:5051/solr/select/?q=doctype:productversion=2.2start=0rows=500indent=on
 
 real 0.07
 user 0.00
 sys 0.00
 [chr...@c18-ssa-so-dfll-qry1 ~]$ du -sh /tmp/solr.out
 1.6M/tmp/solr.out

 ...that's 1.6 MB of 500 Solr documents with all of their fields in
 verbose XML format (including indenting) fetched in 70ms.

 If it's taking 5 seconds for you to get just the ids of 200 docs, you've
 got a problem somewhere and i'm 99% certain it's not in Solr.

 what does a similar time curl command for your URL look like when you
 run it on your solr server?


 -Hoss




Re: Retrieving large num of docs

2009-11-28 Thread Raghuveer Kancherla
Hi Andrew,
I applied the patch you suggested. I am not finding any significant changes
in the response times.
I am wondering if I forgot some important configuration setting etc.
Here is what I did:

   1. Wrote a small program using solrj to use EmbeddedSolrServer (most of
   the code is from the solr wiki) and run the server on an index of ~700k docs
   and note down the avg response time
   2. Applied the SOLR-797.patch to the source code of Solr1.4
   3. complied the source code and rebuilt the jar files.
   4. Rerun step 1 using the new jar files.

Am I supposed to do any other config changes in order to see the performance
jump that you are able to achieve.

Thanks a lot,
Raghu


On Fri, Nov 27, 2009 at 3:16 PM, AHMET ARSLAN iori...@yahoo.com wrote:

  Hi Andrew,
  We are running solr using its http interface from python.
  From the resources
  I could find, EmbeddedSolrServer is possible only if I am
  using solr from a
  java program.  It will be useful to understand if a
  significant part of the
  performance increase is due to bypassing HTTP before going
  down this path.
 
  In the mean time I am trying my luck with the other
  suggestions. Can you
  share the patch that helps cache solr documents instead of
  lucene documents?

 May be these links can help
 http://wiki.apache.org/lucene-java/ImproveSearchingSpeed
 http://wiki.apache.org/lucene-java/ImproveIndexingSpeed
 http://www.lucidimagination.com/Downloads/LucidGaze-for-Solr

 how often do you update your index?
 is your index optimized?
 configuring caching can also help:

 http://wiki.apache.org/solr/SolrCaching
 http://wiki.apache.org/solr/SolrPerformanceFactors








Re: Retrieving large num of docs

2009-11-27 Thread Raghuveer Kancherla
Hi Andrew,
We are running solr using its http interface from python. From the resources
I could find, EmbeddedSolrServer is possible only if I am using solr from a
java program.  It will be useful to understand if a significant part of the
performance increase is due to bypassing HTTP before going down this path.

In the mean time I am trying my luck with the other suggestions. Can you
share the patch that helps cache solr documents instead of lucene documents?


On a different note, I am wondering why does it take 4 - 5 seconds for Solr
to return the ID's of ranked documents when it can rank the results in about
20 milli seconds? Am I missing something here?

Thanks,
Raghu



On Fri, Nov 27, 2009 at 2:15 AM, Andrey Klochkov akloch...@griddynamics.com
 wrote:

 Hi

 We obtain ALL documents for every query, the index size is about 50k. We
 use
 number of stored fields. Often the result set size is several thousands of
 docs.

 We performed the following things to make it faster:

 1. Use EmbeddedSolrServer
 2. Patch Solr to avoid unnecessary marshalling while using
 EmbeddedSolrServer (there's an issue  in Solr JIRA)
 3. Patch Solr to cache SolrDocument instances instead of Lucene's Document
 instances. I was going to share this patch, but then decided that our usage
 of Solr is not common and this functionality is useless in most cases
 4. We have all documents in cache
 5. In fact our index is stored in a data grid, not a file system. But as
 tests showed this is not important because standard FSDirectory is faster
 if
 you have enough of RAM free for OS caches.

 These changes improved the performance very much, so in the end we have
 performance comparable (about 3-5 times slower) to the proper Solr usage
 (obtaining first 20 documents).

 To get more details on how different Solr components perform we injected
 perf4j statements into key points in the code. And a profiler was helpful
 too.

 Hope it helps somehow.

 On Thu, Nov 26, 2009 at 8:48 PM, Raghuveer Kancherla 
 raghuveer.kanche...@aplopio.com wrote:

  Hi,
  I am using Solr1.4 for searching through half a million documents. The
  problem is, I want to retrieve nearly 200 documents for each search
 query.
  The query time in Solr logs is showing 0.02 seconds and I am fairly happy
  with that. However Solr is taking a long time (4 to 5 secs) to return the
  results (I think it is because of the number of docs I am requesting). I
  tried returning only the id's (unique key) without any other stored
 fields,
  but it is not helping me improve the response times (time to return the
  id's
  of matching documents).
  I understand that retrieving 200 documents for each search term is
  impractical in most scenarios but I dont have any other option. Any
  pointers
  on how to improve the response times will be a great help.
 
  Thanks,
   Raghu
 



 --
 Andrew Klochkov
 Senior Software Engineer,
 Grid Dynamics



Retrieving large num of docs

2009-11-26 Thread Raghuveer Kancherla
Hi,
I am using Solr1.4 for searching through half a million documents. The
problem is, I want to retrieve nearly 200 documents for each search query.
The query time in Solr logs is showing 0.02 seconds and I am fairly happy
with that. However Solr is taking a long time (4 to 5 secs) to return the
results (I think it is because of the number of docs I am requesting). I
tried returning only the id's (unique key) without any other stored fields,
but it is not helping me improve the response times (time to return the id's
of matching documents).
I understand that retrieving 200 documents for each search term is
impractical in most scenarios but I dont have any other option. Any pointers
on how to improve the response times will be a great help.

Thanks,
 Raghu