Re: Fast Vector Highlighter Working for some records only

2012-02-23 Thread dhaivat
Hi Koji,

Thanks for your guidance. i have looked into anlysis page of solr and it's
working fine.but still it's not working fine for few documents. 

here is configuration for highlighter i am using,i have specefied this in
solrconfig.xml, please can you tell me what should i change to highlighter
to work for all documents. for your information i am not using any kind of
filter for custom field, i am just using my custom tokeniser.. 


  searchComponent class=solr.HighlightComponent name=highlight
highlighting
  
  
  fragmenter name=gap 
  default=true
  class=solr.highlight.GapFragmenter
lst name=defaults
  
int name=hl.snippets1000/int
int name=hl.fragsize7/int
  
int name=hl.maxAnalyzedChars7/int

/lst
  /fragmenter

  

fragmenter name=regex 
class=org.apache.solr.highlight.RegexFragmenter
lst name=defaults
  
  int name=hl.fragsize70/int
  
  float name=hl.regex.slop0.5/float
  
  str name=hl.regex.pattern[-\w ,/\n\']{20,200}/str
/lst
   /fragmenter

 

  
  formatter name=html 
 default=true
 class=solr.highlight.HtmlFormatter
lst name=defaults
  str name=hl.simple.pre/str
  str name=hl.simple.post/str
/lst
  /formatter

  
  encoder name=html 
   class=solr.highlight.HtmlEncoder /

  
  fragListBuilder name=simple 
   default=true
   class=solr.highlight.SimpleFragListBuilder/

  
  fragListBuilder name=single 
   class=solr.highlight.SingleFragListBuilder/

  
  fragmentsBuilder name=default 
default=true
class=solr.highlight.ScoreOrderFragmentsBuilder

  /fragmentsBuilder

  
  fragmentsBuilder name=colored 
class=solr.highlight.ScoreOrderFragmentsBuilder
lst name=defaults
  str name=hl.tag.pre/str
  str name=hl.tag.post/str
/lst
  /fragmentsBuilder
  
  boundaryScanner name=default 
   default=true
   class=solr.highlight.SimpleBoundaryScanner
lst name=defaults
  str name=hl.bs.maxScan10/str
  str name=hl.bs.chars.,!? #9;#10;#13;/str
/lst
  /boundaryScanner
  
  boundaryScanner name=breakIterator 
   class=solr.highlight.BreakIteratorBoundaryScanner
lst name=defaults
  
  str name=hl.bs.typeWORD/str
  
  
  str name=hl.bs.languageen/str
  str name=hl.bs.countryUS/str
/lst
  /boundaryScanner
/highlighting
  /searchComponent

 



Koji Sekiguchi wrote
 
 Hi dhaivat,
 
 I think you may want to use analysis.jsp:
 
 http://localhost:8983/solr/admin/analysis.jsp
 
 Go to the URL and look into how your custom tokenizer produces tokens,
 and compare with the output of Solr's inbuilt tokenizer.
 
 koji
 -- 
 Query Log Visualizer for Apache Solr
 http://soleami.com/
 
 
 (12/02/22 21:35), dhaivat wrote:

 Koji Sekiguchi wrote

 (12/02/22 11:58), dhaivat wrote:
 Thanks for reply,

 But can you please tell me why it's working for some documents and not
 for
 other.

 As Solr 1.4.1 cannot recognize hl.useFastVectorHighlighter flag, Solr
 just
 ignore it, but due to hl=true is there, Solr tries to create highlight
 snippets
 by using (existing; traditional; I mean not FVH) Highlighter.
 Highlighter (including FVH) cannot produce snippets sometime for some
 reasons,
 you can use hl.alternateField parameter.

 http://wiki.apache.org/solr/HighlightingParameters#hl.alternateField

 koji
 --
 Query Log Visualizer for Apache Solr
 http://soleami.com/


 Thank you so much explanation,

 I have updated my solr version and using 3.5, Could you please tell me
 when
 i am using custom Tokenizer on the field,so do i need to make any changes
 related Solr highlighter.

 here is my custom analyser

   fieldType name=custom_text class=solr.TextField
 positionIncrementGap=100
analyzer type=index
  tokenizer
 class=ns.solr.analyser.CustomIndexTokeniserFactory/
/analyzer
  analyzer type=query
  tokenizer class=ns.solr.analyser.CustomSearcherTokeniserFactory/
  
  /analyzer
  /fieldType

 here is the field info:

 field name=contents type=custom_text indexed=true stored=true
 multiValued=true termPositions=true  termVectors=true
 termOffsets=true/

 i am creating tokens using my custom analyser and when i am trying to use
 highlighter it's not working properly for contents field.. but when i
 tried
 to use Solr inbuilt tokeniser i am finding the word highlighted for
 particular query.. Please can you help me out with this ?


 Thanks in 

Re: 'location' fieldType indexation impossible

2012-02-23 Thread Xavier
You totally get it :)

I'v deleted thoses dynamicField (though it was just an exemple), why didn't
i read the comment above the line  !

Thanks alot ;)

Best regards,
Xavier.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/location-fieldType-indexation-impossible-tp3766136p3769065.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How to merge an autofacet with a predefined facet

2012-02-23 Thread Xavier
Thank you for theses informations, I'll keep that in mind.

But i'm sorry, i don't get it about the process to do it ???


Em wrote
 
 Well, you could create a keyword-file out of your database and join it
 with your self-maintained keywordslist. 
 


By that you mean : 
- 'self-maintained keywordslist' is my 'predefined_facet' already filled in
database that i'll still import with DIH ?
- The keyword-file isnt the same thing that i've created with
synonyms/keepsword combination ?

And still don't get how to 'merge' those both way of getting facets values
in an only one facet !

Thanks for advance,
Xavier


--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-merge-an-autofacet-with-a-predefined-facet-tp3763988p3769121.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: solr 3.5 and indexing performance

2012-02-23 Thread mizayah
Ok i found it.

Its becouse of Hunspell which now is in solr. Somehow when im using it by
myself in 3.4 it is a lot of faster then one from 3.5.

Dont know about differences, but is there any way i use my old Google
Hunspell jar?

--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-3-5-and-indexing-performance-tp3766653p3769139.html
Sent from the Solr - User mailing list archive at Nabble.com.


Can this type of sorting/boosting be done by solr

2012-02-23 Thread rks_lucene
Hi,

I have a journal article citation schema like this:
{  AT - article_title
   AID - article_id (Unique id)
   AREFS - article_references_list (List of article id's referred/cited in
this article. Multi-valued)
   AA - Article Abstract
   ---
   other_article_stuff
   ...
}

So for example, in order to search for all those articles that refer(cite)
article id 51643, I simply need to search for AREFS:51643 and it will give
me the list of articles that have 51643 listed in AREFS.

Now, I want to be able to search in the text of articles and sort the
results by most referred articles. How can I do this ?

Say if my search query is q=AT:metal and it gives me 1700 results. How can I
sort 1700 results by those that have received maximum number of citations by
others.

I have been researching function queries to solve this but have been unable
to do so.

Thanks in advance.
Ritesh


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Can-this-type-of-sorting-boosting-be-done-by-solr-tp3769315p3769315.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Can this type of sorting/boosting be done by solr

2012-02-23 Thread Chantal Ackermann
Hi Ritesh,

you could add another field that contains the size of the list in the
AREFS field. This way you'd simply sort by that field in descending
order.

Should you update AREFS dynamically, you'd have to update the field with
the size, as well, of course.

Chantal

On Thu, 2012-02-23 at 11:27 +0100, rks_lucene wrote:
 Hi,
 
 I have a journal article citation schema like this:
 {  AT - article_title
AID - article_id (Unique id)
AREFS - article_references_list (List of article id's referred/cited in
 this article. Multi-valued)
AA - Article Abstract
---
other_article_stuff
...
 }
 
 So for example, in order to search for all those articles that refer(cite)
 article id 51643, I simply need to search for AREFS:51643 and it will give
 me the list of articles that have 51643 listed in AREFS.
 
 Now, I want to be able to search in the text of articles and sort the
 results by most referred articles. How can I do this ?
 
 Say if my search query is q=AT:metal and it gives me 1700 results. How can I
 sort 1700 results by those that have received maximum number of citations by
 others.
 
 I have been researching function queries to solve this but have been unable
 to do so.
 
 Thanks in advance.
 Ritesh
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Can-this-type-of-sorting-boosting-be-done-by-solr-tp3769315p3769315.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: String search in Dismax handler

2012-02-23 Thread mechravi25
HI Erick,

Thanks for the response.

I am currently using solr 1.5 version.

We are getting the following query when we give the search query as Pass By
Value without quotes and by using qt=dismax in the request query.
 
 webapp=/solr path=/select/
params={facet=truef.typeFacet.facet.mincount=1qf=name^2.3+text+x_name^0.3+id^0.3+uxid^0.3hl.fl=*hl=truef.rFacet.facet.mincount=1rows=10debugQuery=truefl=*start=0q=pass+by+valuefacet.field=typeFacetfacet.field=rFacetqt=dismax}
hits=0 status=0 QTime=63
 
and the response for it in the UI is as follows
 
result name=response numFound=0 start=0 / 
- lst name=facet_counts
  lst name=facet_queries / 
- lst name=facet_fields
  lst name=typeFacet / 
  lst name=rFacet / 
  /lst
  lst name=facet_dates / 
  /lst
  lst name=highlighting / 
- lst name=debug
  str name=rawquerystringpass by value/str 
  str name=querystringpass by value/str 
  str name=parsedquery+((DisjunctionMaxQuery((uxid:pass^0.3 |
id:pass^0.3 | x_name:pass^0.3 | text:loan | name:pass^2.3))
DisjunctionMaxQuery((uxid:by^0.3 | id:by^0.3))
DisjunctionMaxQuery((uxid:value^0.3 | id:value^0.3 | x_name:value^0.3 |
text:value | name:value^2.3)))~3) ()/str 
  str name=parsedquery_toString+(((uxid:pass^0.3 | id:loan^0.3 |
x_name:pass^0.3 | text:loan | name:pass^2.3) (uxid:by^0.3 | id:by^0.3)
(uxid:value^0.3 | id:value^0.3 | x_name:value^0.3 | text:value |
name:value^2.3))~3) ()/str 
  lst name=explain / 
  str name=QParserDisMaxQParser/str 
  null name=altquerystring / 
  null name=boostfuncs / 
- lst name=timing
  double name=time3.0/double 
- lst name=prepare
  double name=time1.0/double 
- lst name=org.apache.solr.handler.component.QueryComponent
  double name=time1.0/double 
  /lst
- lst name=org.apache.solr.handler.component.FacetComponent
  double name=time0.0/double 
  /lst
- lst name=org.apache.solr.handler.component.MoreLikeThisComponent
  double name=time0.0/double 
  /lst
- lst name=org.apache.solr.handler.component.HighlightComponent
  double name=time0.0/double 
  /lst
- lst name=org.apache.solr.handler.component.StatsComponent
  double name=time0.0/double 
  /lst
- lst name=org.apache.solr.handler.component.DebugComponent
  double name=time0.0/double 
  /lst
  /lst
- lst name=process
  double name=time2.0/double 
- lst name=org.apache.solr.handler.component.QueryComponent
  double name=time1.0/double 
  /lst
- lst name=org.apache.solr.handler.component.FacetComponent
  double name=time0.0/double 
  /lst
- lst name=org.apache.solr.handler.component.MoreLikeThisComponent
  double name=time0.0/double 
  /lst
- lst name=org.apache.solr.handler.component.HighlightComponent
  double name=time1.0/double 
  /lst
- lst name=org.apache.solr.handler.component.StatsComponent
  double name=time0.0/double 
  /lst
- lst name=org.apache.solr.handler.component.DebugComponent
  double name=time0.0/double 
  /lst
  /lst
  /lst
  /lst
  /response
 
whereas we get the following query when we remove the parameter qt=dismax
from the request query and this is fetching the required results.
 
webapp=/solr path=/select/
params={facet=trueqf=name^2.3+text+x_name^0.3+id^0.3+uxid^0.3f.typeFacet.facet.mincount=1hl.fl=*f.rFacet.facet.mincount=1hl=truerows=10fl=*debugQuery=truestart=0q=pass+by+valuefacet.field=typeFacetfacet.field=rFacet}
hits=9203 status=0 QTime=1158

In another case where we use Pass by Value with quotes and also with
qt=dismax in the request handler, the search query is fetching the right
values. The following is the concerned query.
 
 webapp=/solr path=/select/
params={facet=trueqf=name^2.3+text+x_name^0.3+id^0.3+uxid^0.3f.typeFacet.facet.mincount=1hl.fl=*f.rFacet.facet.mincount=1hl=truerows=10fl=*debugQuery=truestart=0q=pass+by+valuefacet.field=typeFacetfacet.field=rFacet}
hits=18 status=0 QTime=213 
 
 
 
 and the response for it from UI is
 
 ?xml version=1.0 encoding=UTF-8 ? 
 - response
 - lst name=responseHeader
   int name=status0/int 
   int name=QTime578/int 
 - lst name=params
   str name=facettrue/str 
   str name=f.typeFacet.facet.mincount1/str 
   str name=qfname^2.3 text x_name^0.3 id^0.3 xid^0.3/str 
   str name=hl.fl*/str 
   str name=hltrue/str 
   str name=f.rFacet.facet.mincount1/str 
   str name=rows10/str 
   str name=debugQuerytrue/str 
   str name=fl*/str 
   str name=start0/str 
   str name=qpass by value/str 
 - arr name=facet.field
   strtypeFacet/str 
   strrFacet/str 
   /arr
   str name=qtdismax/str 
   /lst
   /lst
 + result name=response numFound=18 start=0
 + lst name=facet_counts
 + lst name=highlighting
 - lst name=debug
   str name=rawquerystringpass by value/str 
   str name=querystringpass by value/str 
   str name=parsedquery+DisjunctionMaxQuery((xid:pass by value^0.3 |
id:pass by value^0.3 | x_name:pass ? value^0.3 | text:pass ? value |
name:pass ? value^2.3)) ()/str 
   str name=parsedquery_toString+(xid:pass by value^0.3 | id:pass by
value^0.3 | x_name:pass ? value^0.3 | text:pass ? value | name:pass ?
value^2.3) ()/str 
 + lst name=explain
   str 

Re: Can this type of sorting/boosting be done by solr

2012-02-23 Thread rks_lucene
Dear Chantal,

Thanks for your reply, but thats not what I was asking.

Let me explain. The size of the list in AREFS would give me how many records
are *referred by* an article and NOT how many records *refer to* an article.

Say if an article id - 51463 has been published in 2002 and refers to 10
articles dating from 1990-2002. Then the count of AREFS would be 10 which is
static once the journal has been published.

However if the same article is being *referred to* by 20 articles published
from 2003-2012 then I am talking about this 20 count. This count is dynamic
and as we keep adding records to the index, there are more articles that
will refer to article 51463 it in their AREFS field in the future.
/(Obviously when we are adding article 51463 to the index we have no clue
who will be referring to it in the future, so we can have another field in
it for this, nor can be update 51463 everytime someone refers to it)/

So today, if I want to know who all are referring to 51463, by actually
searching for this id in the AREFS field. The query is as simple as
q=AREFS:51463 and it will given the list of articles from 2003 to 2012 and
the result count would be 20.

So back to the question, say if my search query is q=AT:metal and it gives
me 1700 results. How can I 
sort 1700 results by those that have received maximum number of citations
(till date) by others. (i.e., that have maximum number of results if I
individually search their ids in the AREFS field).

Hope this makes it clear. I feel this is a sort/boost by function query
candidate. But I am not able to figure it out.

Thanks
Ritesh  

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Can-this-type-of-sorting-boosting-be-done-by-solr-tp3769315p3769475.html
Sent from the Solr - User mailing list archive at Nabble.com.


Range Query with sensitive Scoring

2012-02-23 Thread Hannes Carl Meyer
Hello,

I have an Integer field which carries a value between 0 to 18.

Ist there a way to query this field fuzzy? For example search for field:5
and also match documents near it (like documents containing field:4 oder
field:6)?

And if this is possible, is it also possible to boost exact matches and
lower the boost for the fuzzy matches?

Thanks in advance and kind regards

Hannes


Re: SnapPull failed :org.apache.solr.common.SolrException: Error opening new searcher

2012-02-23 Thread eks dev
it loos like it works, with patch, after a couple of hours of testing
under same conditions didn't see it happen (without it, approx. every
15 minutes).

I do not think it will happen again with this patch.

Thanks again and my respect to your debugging capacity, my bug report
was really thin.


On Thu, Feb 23, 2012 at 8:47 AM, eks dev eks...@yahoo.co.uk wrote:
 thanks Mark, I will give it a go and report back...

 On Thu, Feb 23, 2012 at 1:31 AM, Mark Miller markrmil...@gmail.com wrote:
 Looks like an issue around replication IndexWriter reboot, soft commits and 
 hard commits.

 I think I've got a workaround for it:

 Index: solr/core/src/java/org/apache/solr/handler/SnapPuller.java
 ===
 --- solr/core/src/java/org/apache/solr/handler/SnapPuller.java  (revision 
 1292344)
 +++ solr/core/src/java/org/apache/solr/handler/SnapPuller.java  (working 
 copy)
 @@ -499,6 +499,17 @@

       // reboot the writer on the new index and get a new searcher
       solrCore.getUpdateHandler().newIndexWriter();
 +      Future[] waitSearcher = new Future[1];
 +      solrCore.getSearcher(true, false, waitSearcher, true);
 +      if (waitSearcher[0] != null) {
 +        try {
 +         waitSearcher[0].get();
 +       } catch (InterruptedException e) {
 +         SolrException.log(LOG,e);
 +       } catch (ExecutionException e) {
 +         SolrException.log(LOG,e);
 +       }
 +     }
       // update our commit point to the right dir
       solrCore.getUpdateHandler().commit(new CommitUpdateCommand(req, 
 false));

 That should allow the searcher that the following commit command prompts to 
 see the *new* IndexWriter.

 On Feb 22, 2012, at 10:56 AM, eks dev wrote:

 We started observing strange failures from ReplicationHandler when we
 commit on master trunk version 4-5 days old.
 It works sometimes, and sometimes not didn't dig deeper yet.

 Looks like the real culprit hides behind:
 org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed

 Looks familiar to somebody?


 120222 154959 SEVERE SnapPull failed
 :org.apache.solr.common.SolrException: Error opening new searcher
    at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1138)
    at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1251)
    at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1043)
    at org.apache.solr.update.DirectUpdateHandler2.commit(Unknown Source)
    at org.apache.solr.handler.SnapPuller.doCommit(SnapPuller.java:503)
    at 
 org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:348)
    at org.apache.solr.handler.ReplicationHandler.doFetch(Unknown Source)
    at org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:163)
    at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
    at 
 java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351)
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178)
    at 
 java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
    at 
 java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
    at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
    at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
    at java.lang.Thread.run(Thread.java:722)
 Caused by: org.apache.lucene.store.AlreadyClosedException: this
 IndexWriter is closed
    at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:810)
    at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:815)
    at 
 org.apache.lucene.index.IndexWriter.nrtIsCurrent(IndexWriter.java:3984)
    at 
 org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:254)
    at 
 org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:233)
    at 
 org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:223)
    at 
 org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:170)
    at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1095)
    ... 15 more

 - Mark Miller
 lucidimagination.com













Re: Range Query with sensitive Scoring

2012-02-23 Thread Ahmet Arslan
 I have an Integer field which carries a value between 0 to
 18.
 
 Ist there a way to query this field fuzzy? For example
 search for field:5
 and also match documents near it (like documents containing
 field:4 oder
 field:6)?
 
 And if this is possible, is it also possible to boost exact
 matches and
 lower the boost for the fuzzy matches?
 

Yes it is possible with query manipulation. 

If you are using lucene query parser,  q=+field:{4 TO 6} field:5^10 
if you are using edismax query parser q=field:{4 TO 6}bq=field:5^10 


Re: Can this type of sorting/boosting be done by solr

2012-02-23 Thread Lee Carroll
Have you looked at external fields?

 
http://lucidworks.lucidimagination.com/display/solr/Solr+Field+Types#SolrFieldTypes-WorkingwithExternalFiles

you will need a process to do the counts and note the limitation of
updates only after a commit, but i think it would fit your usecase.



On 23 February 2012 12:04, rks_lucene ppro.i...@gmail.com wrote:
 Dear Chantal,

 Thanks for your reply, but thats not what I was asking.

 Let me explain. The size of the list in AREFS would give me how many records
 are *referred by* an article and NOT how many records *refer to* an article.

 Say if an article id - 51463 has been published in 2002 and refers to 10
 articles dating from 1990-2002. Then the count of AREFS would be 10 which is
 static once the journal has been published.

 However if the same article is being *referred to* by 20 articles published
 from 2003-2012 then I am talking about this 20 count. This count is dynamic
 and as we keep adding records to the index, there are more articles that
 will refer to article 51463 it in their AREFS field in the future.
 /(Obviously when we are adding article 51463 to the index we have no clue
 who will be referring to it in the future, so we can have another field in
 it for this, nor can be update 51463 everytime someone refers to it)/

 So today, if I want to know who all are referring to 51463, by actually
 searching for this id in the AREFS field. The query is as simple as
 q=AREFS:51463 and it will given the list of articles from 2003 to 2012 and
 the result count would be 20.

 So back to the question, say if my search query is q=AT:metal and it gives
 me 1700 results. How can I
 sort 1700 results by those that have received maximum number of citations
 (till date) by others. (i.e., that have maximum number of results if I
 individually search their ids in the AREFS field).

 Hope this makes it clear. I feel this is a sort/boost by function query
 candidate. But I am not able to figure it out.

 Thanks
 Ritesh

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Can-this-type-of-sorting-boosting-be-done-by-solr-tp3769315p3769475.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: Can this type of sorting/boosting be done by solr

2012-02-23 Thread Chantal Ackermann
Sorry to have misunderstood.
It seems the new Relevance Functions in Solr 4.0 might help - unless you
need to use an official release.

http://wiki.apache.org/solr/FunctionQuery#Relevance_Functions



On Thu, 2012-02-23 at 13:04 +0100, rks_lucene wrote:
 Dear Chantal,
 
 Thanks for your reply, but thats not what I was asking.
 
 Let me explain. The size of the list in AREFS would give me how many records
 are *referred by* an article and NOT how many records *refer to* an article.
 
 Say if an article id - 51463 has been published in 2002 and refers to 10
 articles dating from 1990-2002. Then the count of AREFS would be 10 which is
 static once the journal has been published.
 
 However if the same article is being *referred to* by 20 articles published
 from 2003-2012 then I am talking about this 20 count. This count is dynamic
 and as we keep adding records to the index, there are more articles that
 will refer to article 51463 it in their AREFS field in the future.
 /(Obviously when we are adding article 51463 to the index we have no clue
 who will be referring to it in the future, so we can have another field in
 it for this, nor can be update 51463 everytime someone refers to it)/
 
 So today, if I want to know who all are referring to 51463, by actually
 searching for this id in the AREFS field. The query is as simple as
 q=AREFS:51463 and it will given the list of articles from 2003 to 2012 and
 the result count would be 20.
 
 So back to the question, say if my search query is q=AT:metal and it gives
 me 1700 results. How can I 
 sort 1700 results by those that have received maximum number of citations
 (till date) by others. (i.e., that have maximum number of results if I
 individually search their ids in the AREFS field).
 
 Hope this makes it clear. I feel this is a sort/boost by function query
 candidate. But I am not able to figure it out.
 
 Thanks
 Ritesh  
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Can-this-type-of-sorting-boosting-be-done-by-solr-tp3769315p3769475.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Solr Performance Improvement and degradation Help

2012-02-23 Thread Erick Erickson
It's pretty hard to say, even with the data you've provided. But,
try adding debugQuery=on and look particularly down near the
bottom there'll be a lst name=timing section. That
section lists the time taken by all the components of a search,
not just the QTime. Things like highlighting etc. that can
often give a clue where the time's spent.

What sort of wildcards are you using? Did you have to bump the
maxBooleanClauses?

This is a bit puzzling though

Best
Erick

On Wed, Feb 22, 2012 at 3:16 PM, naptowndev naptowndev...@gmail.com wrote:
 As an update to this... I tried running a query again the
 4.0.0.2010.12.10.08.54.56 version and the newer 4.0.0.2012.02.16 (both on
 the same box).  So the query params were the same, returned results were the
 same, but the 4.0.0.2010.12.10.08.54.56 returned the results in about 1.6
 seconds and the newer (4.0.0.2012.02.16) version returned the results in
 about 4 seconds.

 If I add the wildcard field list to the newer version, the time increases
 anywhere from .5-1 second.

 These are all averages after running the queries several times over a 30
 minute period. (allowing for warming and cache).

 Anybody have any insight into why the newer versions are performing a bit
 slower?

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Solr-Performance-Improvement-and-degradation-Help-tp3767015p3767725.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: Same id on two shards

2012-02-23 Thread Erick Erickson
I really think you'll be in a world of hurt if you have the same
ID on different shards. I just wouldn't go there. The statement
may be non-deterministic should be taken to mean that this
is just unsupported.

Why is this the case? What is the use-case for putting the
same ID on different shard? Because this seems like
an  XY problem...

Best
Erick

On Wed, Feb 22, 2012 at 4:43 PM, jerry.min...@gmail.com
jerry.min...@gmail.com wrote:
 Hi,

 I stumbled across this thread after running into the same question. The
 answers presented here seem a little vague and I was hoping to renew the
 discussion.

 I am using using a branch of Solr 4, distributed searching over 12 shards.
 I want the documents in the first shard to always be selected over
 documents that appear in the other 11 shards.

 The queries to these shards looks something like this: 
 http://solrserver/shard_1_app/select?shards=solr_server:/shard_1_app/,solr_server:/shard_2_app,
 ... ,solr_server:/shard_12_appq=id:

 When I execute a query for an ID that I know exists in shard_1 and another
 shard, I do always get the result from shard 1.

 Here's some questions that I have:
 1. Has anyone rigorously tested the comment in the wiki If docs with
 duplicate unique keys are encountered, Solr will make an attempt to return
 valid results, but the behavior may be non-deterministic.

 2. Who is relying on this behavior (the document of the first shard is
 returned) today? When do you notice the wrong document is selected? Do you
 have a feeling for how frequently your distributed search returns the
 document from a shard other than the first?

 3. Is there a good web source other than the Solr wiki for information
 about Solr distributed queries?


 Thanks,
 Jerry M.


 On Mon, Aug 8, 2011 at 7:41 PM, simon mtnes...@gmail.com wrote:

 I think the first one to respond is indeed the way it works, but
 that's only deterministic up to a point (if your small index is in the
 throes of a commit and everything required for a response happens to
 be  cached on the larger shard ... who knows ?)

 On Mon, Aug 8, 2011 at 7:10 PM, Shawn Heisey s...@elyograg.org wrote:
  On 8/8/2011 4:07 PM, simon wrote:
 
  Only one should be returned, but it's non-deterministic. See
 
 
 http://wiki.apache.org/solr/DistributedSearch#Distributed_Searching_Limitations
 
  I had heard it was based on which one responded first.  This is part of
 why
  we have a small index that contains the newest content and only
 distribute
  content to the other shards once a day.  The hope is that the small index
  (less than 1GB, fits into RAM on that virtual machine) will always
 respond
  faster than the other larger shards (over 18GB each).  Is this an
 incorrect
  assumption on our part?
 
  The build system does do everything it can to ensure that periods of
 overlap
  are limited to the time it takes to commit a change across all of the
  shards, which should amount to just a few seconds once a day.  There
 might
  be situations when the index gets out of whack and we have duplicate id
  values for a longer time period, but in practice it hasn't happened yet.
 
  Thanks,
  Shawn
 
 



Re: Trunk build errors

2012-02-23 Thread Erick Erickson
There was recently some work done to get better about checking
on licenses, when did you last get trunk? About 9 days ago was
the last go-round.

And did you do an 'ant clean'?

It works on my machine with a fresh pull this morning.

Best
Erick

On Wed, Feb 22, 2012 at 5:27 PM, Darren Govoni dar...@ontrenet.com wrote:
 Hi,
  I am getting numerous errors preventing a build of solrcloud trunk.

  [licenses] MISSING LICENSE for the following file:
 

 Any tips to get a clean build working?

 thanks




Re: Can this type of sorting/boosting be done by solr

2012-02-23 Thread rks_lucene
Hi Chantal,

Yes, I have thought about the docfreq(field_name,'search_text') function,
but somehow I will have dereference the article id's (AID) from the result
of the query to the sort. The below query does not work:

q=AT:metalsort=docfreq(AREFS,$q.AID) 

Is there a mistake in the query that am missing out or is dereferencing not
supported in Relevence functions ?

Thanks,
Ritesh




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Can-this-type-of-sorting-boosting-be-done-by-solr-tp3769315p3769779.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Trunk build errors

2012-02-23 Thread darren
I updated yesterday and did an ant clean, ant test.

I will try a clean pull next.

I'm on linux. Perhaps an ant version issue?

 There was recently some work done to get better about checking
 on licenses, when did you last get trunk? About 9 days ago was
 the last go-round.

 And did you do an 'ant clean'?

 It works on my machine with a fresh pull this morning.

 Best
 Erick

 On Wed, Feb 22, 2012 at 5:27 PM, Darren Govoni dar...@ontrenet.com
 wrote:
 Hi,
  I am getting numerous errors preventing a build of solrcloud trunk.

  [licenses] MISSING LICENSE for the following file:
 

 Any tips to get a clean build working?

 thanks






Re: Unique key constraint and optimistic locking (versioning)

2012-02-23 Thread Per Steffensen

Em skrev:

Hi Per,

Solr provides the so called UniqueKey-field.
Refer to the Wiki to learn more:
http://wiki.apache.org/solr/UniqueKey
  
Belive the uniqueKey does not enforce a unique key constraint, so that 
you are not allowed to create a document with an id's when an document 
with the same id already exists. So it is not the whole solution.
  

Optimistic locking (versioning)


... is not provided by Solr out of the box. If you add a new document
with the same UniqueKey it replaces the old one.
You have to do the versioning on your own (and keep in mind concurrent
updates).

Kind regards,
Em

Am 21.02.2012 13:50, schrieb Per Steffensen:
  

Hi

Does solr/lucene provide any mechanism for unique key constraint and
optimistic locking (versioning)?
Unique key constraint: That a client will not succeed creating a new
document in solr/lucene if a document already exists having the same
value in some field (e.g. an id field). Of course implemented right, so
that even though two or more threads are concurrently trying to create a
new document with the same value in this field, only one of them will
succeed.
Optimistic locking (versioning): That a client will only succeed
updating a document if this updated document is based on the version of
the document currently stored in solr/lucene. Implemented in the
optimistic way that clients during an update have to tell which version
of the document they fetched from Solr and that they therefore have used
as a starting-point for their updated document. So basically having a
version field on the document that clients increase by one before
sending to solr for update, and some code in Solr that only makes the
update succeed if the version number of the updated document is exactly
one higher than the version number of the document already stored. Of
course again implemented right, so that even though two or more thrads
are concurrently trying to update a document, and they all have their
updated document based on the current version in solr/lucene, only one
of them will succeed.

Or do I have to do stuff like this myself outside solr/lucene - e.g. in
the client using solr.

Regards, Per Steffensen




  




Re: Unique key constraint and optimistic locking (versioning)

2012-02-23 Thread Em
Hi Per,

well, Solr has no Update-Method like a RDBMS. It is a re-insert of the
whole document. Therefore a document with an existing UniqueKey marks
the old document as deleted and inserts the new one.
However this is not the whole story, since this constraint only works
per index/SolrCore/Shard (depending on your use-case).

Does this help you?

Kind regards,
Em

Am 23.02.2012 15:34, schrieb Per Steffensen:
 Em skrev:
 Hi Per,

 Solr provides the so called UniqueKey-field.
 Refer to the Wiki to learn more:
 http://wiki.apache.org/solr/UniqueKey
   
 Belive the uniqueKey does not enforce a unique key constraint, so that
 you are not allowed to create a document with an id's when an document
 with the same id already exists. So it is not the whole solution.
  
 Optimistic locking (versioning)
 
 ... is not provided by Solr out of the box. If you add a new document
 with the same UniqueKey it replaces the old one.
 You have to do the versioning on your own (and keep in mind concurrent
 updates).

 Kind regards,
 Em

 Am 21.02.2012 13:50, schrieb Per Steffensen:
  
 Hi

 Does solr/lucene provide any mechanism for unique key constraint and
 optimistic locking (versioning)?
 Unique key constraint: That a client will not succeed creating a new
 document in solr/lucene if a document already exists having the same
 value in some field (e.g. an id field). Of course implemented right, so
 that even though two or more threads are concurrently trying to create a
 new document with the same value in this field, only one of them will
 succeed.
 Optimistic locking (versioning): That a client will only succeed
 updating a document if this updated document is based on the version of
 the document currently stored in solr/lucene. Implemented in the
 optimistic way that clients during an update have to tell which version
 of the document they fetched from Solr and that they therefore have used
 as a starting-point for their updated document. So basically having a
 version field on the document that clients increase by one before
 sending to solr for update, and some code in Solr that only makes the
 update succeed if the version number of the updated document is exactly
 one higher than the version number of the document already stored. Of
 course again implemented right, so that even though two or more thrads
 are concurrently trying to update a document, and they all have their
 updated document based on the current version in solr/lucene, only one
 of them will succeed.

 Or do I have to do stuff like this myself outside solr/lucene - e.g. in
 the client using solr.

 Regards, Per Steffensen

 

   
 
 


Re: Solr Performance Improvement and degradation Help

2012-02-23 Thread naptowndev
Erick -

Agreed, it is puzzling.

What I've found is that it doesn't matter if I pass in wildcards for the
field list or not...but that the overall response time from the newer builds
of Solr that we've tested (e.g. 4.0.0.2012.02.16) is slower than the older
(4.0.0.2010.12.10.08.54.56) build.  

If I run the exact same query against those two cores, bringing back a
payload of just over 13MB (xml), the older build brings it back in about 1.6
seconds and the newer build brings it back in about 8.4 seconds.

Implementing the field list wildcard allows us to reduce the payload in the
newer build (not an option in the older build).  They payload is reduced to
1.8MB but takes over 3.5 seconds to come back as compared to the full
payload (13MB) in the older build at about 1.6 seconds.  

With everything else remaining the same (machine/processors/memory/network
and the code base calling Solr) it seems to point to something in the newer
builds that's causing the slowdown, but I'm not intimate enough with Solr to
be able to figure that out.

We are using the debugQuery=on in our test to see timings and they aren't
showing any anomalies, so that makes it even more confusing.

From a wildcard perspective, it's on the fl parameter... here's a 'snippet'
of part of our fl parameter for the query

fl=id, CategoryGroupTypeID, MedicalSpecialtyDescription, TermsMisspelled,
DictionarySource, timestamp, Category_*_MemberReports,
Category_*_MemberReportRange, Category_*_NonMemberReports, Category_*_Grade,
Category_*_GradeDisplay, Category_*_GradeTier, Category_*_ReportLocations,
Category_*_ReportLocationCoordinates, Category_*_coordinate, score

Please note that that fl param is greatly reduced from our full query, we
have over 100 static files and a slew of dynamic fields - but that should
give you an idea of how we are using wildcards.

I'm not sure about the maxBooleanClauses...not being all that familiar with
Solr, does that apply to wildcards used in the fl list?

Thanks!

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Performance-Improvement-and-degradation-Help-tp3767015p3769995.html
Sent from the Solr - User mailing list archive at Nabble.com.


How to retrieve tokens?

2012-02-23 Thread Thiago
Hi to everybody,

My name is Thiago and I'm new with Apache Solr and NoSQL databases. At the
moment, I'm working and using Solr for document indexing. My Question is: Is
there any way to retrieve the tokens in place of the original data?

For example:
I have a field using the fieldtype text_general from the original
schema.xml. If I insert a document with the following string in this field:
All you need is love, the tokens that I get are: all, you, need, love.
When I search in this base, I want to get the tokens(all, you, need, love)
in place of the indexed string.

I searched for this in the web and in this forum too, but I saw some people
saying to use TermVectorsComponent. Is there any way more easy to do it? As
I saw, TermVectorsComponent is more difficult and use more memory.

Thanks to everybody.

Thiago


--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-retrieve-tokens-tp3770007p3770007.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Unique key constraint and optimistic locking (versioning)

2012-02-23 Thread Per Steffensen

Em skrev:

Hi Per,

well, Solr has no Update-Method like a RDBMS. It is a re-insert of the
whole document. Therefore a document with an existing UniqueKey marks
the old document as deleted and inserts the new one.
  
Yes I understand. But it is not always what I want to acheive. I want an 
error to occur if a document with the same id already exists, when my 
intent is to INSERT a new document. When my intent is to UPDATE a 
document in solr/lucene I want the old document already in solr/lucene 
deleted and the new version of this document added (exactly as you 
explain). It will not be possible for solr/lucene to decide what to do 
unless I give it some information about my intent - whether it is INSERT 
or UPDATE semantics I want. I guess solr/lucene always give me INSERT 
sematics when a document with the same id does not already exist, and 
that it always give me UPDATE semantics when a document with the same id 
does exist? I cannot decide?

However this is not the whole story, since this constraint only works
per index/SolrCore/Shard (depending on your use-case).
  
Yes I know. But with the right routing strategy based on id's I will be 
able to acheive what I want if the feature was just there per 
index/core/shard.

Does this help you?
  
Yes it helps me getting sure, that what I am looking for is not there. 
There is not built-in way to make solr/lucene give me an error if I try 
to insert a new document with an id equal to a document already in the 
index/core/shard. The existing document will always be updated 
(implemented as old deleted and new added). Correct?

Kind regards,
Em
  

Regards, Per Steffensen



RE: Trunk build errors

2012-02-23 Thread Steven A Rowe
Hi Darren,

I use Ant 1.7.1.  There have been some efforts to make the build work with Ant 
1.8.X, but it is not (yet) the required version.  So if you're not using Ant 
1.7.1, I suggest you try it.

Steve

 -Original Message-
 From: dar...@ontrenet.com [mailto:dar...@ontrenet.com]
 Sent: Thursday, February 23, 2012 8:59 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Trunk build errors
 
 I updated yesterday and did an ant clean, ant test.
 
 I will try a clean pull next.
 
 I'm on linux. Perhaps an ant version issue?
 
  There was recently some work done to get better about checking
  on licenses, when did you last get trunk? About 9 days ago was
  the last go-round.
 
  And did you do an 'ant clean'?
 
  It works on my machine with a fresh pull this morning.
 
  Best
  Erick
 
  On Wed, Feb 22, 2012 at 5:27 PM, Darren Govoni dar...@ontrenet.com
  wrote:
  Hi,
   I am getting numerous errors preventing a build of solrcloud trunk.
 
   [licenses] MISSING LICENSE for the following file:
  
 
  Any tips to get a clean build working?
 
  thanks
 
 
 



Re: String search in Dismax handler

2012-02-23 Thread Erick Erickson
OK, I really don't get this. The quoted bit gives:
+DisjunctionMaxQuery((xid:pass by value^0.3 | id:pass by value^0.3 |
x_name:pass ? value^0.3 | text:pass ? value | name:pass ?
value^2.3))

The bare bit gives:
+((DisjunctionMaxQuery((uxid:pass^0.3 | id:pass^0.3 | x_name:pass^0.3
| text:loan | name:pass^2.3))
DisjunctionMaxQuery((uxid:by^0.3 | id:by^0.3))
DisjunctionMaxQuery((uxid:value^0.3 | id:value^0.3 | x_name:value^0.3 |
text:value | name:value^2.3)))~3

In the one case you're searching on xid, in the other uxid. The
unquoted case also has text:loan and id:by and id:value. Is that
that's where you are getting your hits?

Erick

On Thu, Feb 23, 2012 at 6:52 AM, mechravi25 mechrav...@yahoo.co.in wrote:
 HI Erick,

 Thanks for the response.

 I am currently using solr 1.5 version.

 We are getting the following query when we give the search query as Pass By
 Value without quotes and by using qt=dismax in the request query.

  webapp=/solr path=/select/
 params={facet=truef.typeFacet.facet.mincount=1qf=name^2.3+text+x_name^0.3+id^0.3+uxid^0.3hl.fl=*hl=truef.rFacet.facet.mincount=1rows=10debugQuery=truefl=*start=0q=pass+by+valuefacet.field=typeFacetfacet.field=rFacetqt=dismax}
 hits=0 status=0 QTime=63

 and the response for it in the UI is as follows

 result name=response numFound=0 start=0 /
 - lst name=facet_counts
  lst name=facet_queries /
 - lst name=facet_fields
  lst name=typeFacet /
  lst name=rFacet /
  /lst
  lst name=facet_dates /
  /lst
  lst name=highlighting /
 - lst name=debug
  str name=rawquerystringpass by value/str
  str name=querystringpass by value/str
  str name=parsedquery+((DisjunctionMaxQuery((uxid:pass^0.3 |
 id:pass^0.3 | x_name:pass^0.3 | text:loan | name:pass^2.3))
 DisjunctionMaxQuery((uxid:by^0.3 | id:by^0.3))
 DisjunctionMaxQuery((uxid:value^0.3 | id:value^0.3 | x_name:value^0.3 |
 text:value | name:value^2.3)))~3) ()/str
  str name=parsedquery_toString+(((uxid:pass^0.3 | id:loan^0.3 |
 x_name:pass^0.3 | text:loan | name:pass^2.3) (uxid:by^0.3 | id:by^0.3)
 (uxid:value^0.3 | id:value^0.3 | x_name:value^0.3 | text:value |
 name:value^2.3))~3) ()/str
  lst name=explain /
  str name=QParserDisMaxQParser/str
  null name=altquerystring /
  null name=boostfuncs /
 - lst name=timing
  double name=time3.0/double
 - lst name=prepare
  double name=time1.0/double
 - lst name=org.apache.solr.handler.component.QueryComponent
  double name=time1.0/double
  /lst
 - lst name=org.apache.solr.handler.component.FacetComponent
  double name=time0.0/double
  /lst
 - lst name=org.apache.solr.handler.component.MoreLikeThisComponent
  double name=time0.0/double
  /lst
 - lst name=org.apache.solr.handler.component.HighlightComponent
  double name=time0.0/double
  /lst
 - lst name=org.apache.solr.handler.component.StatsComponent
  double name=time0.0/double
  /lst
 - lst name=org.apache.solr.handler.component.DebugComponent
  double name=time0.0/double
  /lst
  /lst
 - lst name=process
  double name=time2.0/double
 - lst name=org.apache.solr.handler.component.QueryComponent
  double name=time1.0/double
  /lst
 - lst name=org.apache.solr.handler.component.FacetComponent
  double name=time0.0/double
  /lst
 - lst name=org.apache.solr.handler.component.MoreLikeThisComponent
  double name=time0.0/double
  /lst
 - lst name=org.apache.solr.handler.component.HighlightComponent
  double name=time1.0/double
  /lst
 - lst name=org.apache.solr.handler.component.StatsComponent
  double name=time0.0/double
  /lst
 - lst name=org.apache.solr.handler.component.DebugComponent
  double name=time0.0/double
  /lst
  /lst
  /lst
  /lst
  /response

 whereas we get the following query when we remove the parameter qt=dismax
 from the request query and this is fetching the required results.

 webapp=/solr path=/select/
 params={facet=trueqf=name^2.3+text+x_name^0.3+id^0.3+uxid^0.3f.typeFacet.facet.mincount=1hl.fl=*f.rFacet.facet.mincount=1hl=truerows=10fl=*debugQuery=truestart=0q=pass+by+valuefacet.field=typeFacetfacet.field=rFacet}
 hits=9203 status=0 QTime=1158

 In another case where we use Pass by Value with quotes and also with
 qt=dismax in the request handler, the search query is fetching the right
 values. The following is the concerned query.

  webapp=/solr path=/select/
 params={facet=trueqf=name^2.3+text+x_name^0.3+id^0.3+uxid^0.3f.typeFacet.facet.mincount=1hl.fl=*f.rFacet.facet.mincount=1hl=truerows=10fl=*debugQuery=truestart=0q=pass+by+valuefacet.field=typeFacetfacet.field=rFacet}
 hits=18 status=0 QTime=213



  and the response for it from UI is

  ?xml version=1.0 encoding=UTF-8 ?
  - response
  - lst name=responseHeader
   int name=status0/int
   int name=QTime578/int
  - lst name=params
   str name=facettrue/str
   str name=f.typeFacet.facet.mincount1/str
   str name=qfname^2.3 text x_name^0.3 id^0.3 xid^0.3/str
   str name=hl.fl*/str
   str name=hltrue/str
   str name=f.rFacet.facet.mincount1/str
   str name=rows10/str
   str name=debugQuerytrue/str
   str name=fl*/str
   str name=start0/str
   str 

Re: Solr Performance Improvement and degradation Help

2012-02-23 Thread Erick Erickson
Ah, no, my mistake. The wildcards for the fl list won't matter re:
maxBooleanClauses,
I didn't read carefully enough.

I assume that just returning a field or two doesn't slow down

But one possible culprit, especially since you say this kicks in after
a while, is garbage collection. Here's an excellent intro:

http://www.lucidimagination.com/blog/2011/03/27/garbage-collection-bootcamp-1-0/

Especially look at the getting a view into garbage collection
section and try specifying
those options. The result should be that your solr log gets stats
dumped every time
GC kicks in. If this is a problem, look at the times in the logfile
after your system slows
down. You'll see a bunch of GC dumps that collect very little unused
memory. You can
also connect to the process using jConsole (should be in the Java
distro) and watch
the memory tab, especially after your server has slowed down. You can also
connect jConsole remotely...

This is just an experiment, but any time I see and it slows down
after ### minutes,
GC is the first thing I think of.


Best
Erick


On Thu, Feb 23, 2012 at 10:16 AM, naptowndev naptowndev...@gmail.com wrote:
 Erick -

 Agreed, it is puzzling.

 What I've found is that it doesn't matter if I pass in wildcards for the
 field list or not...but that the overall response time from the newer builds
 of Solr that we've tested (e.g. 4.0.0.2012.02.16) is slower than the older
 (4.0.0.2010.12.10.08.54.56) build.

 If I run the exact same query against those two cores, bringing back a
 payload of just over 13MB (xml), the older build brings it back in about 1.6
 seconds and the newer build brings it back in about 8.4 seconds.

 Implementing the field list wildcard allows us to reduce the payload in the
 newer build (not an option in the older build).  They payload is reduced to
 1.8MB but takes over 3.5 seconds to come back as compared to the full
 payload (13MB) in the older build at about 1.6 seconds.

 With everything else remaining the same (machine/processors/memory/network
 and the code base calling Solr) it seems to point to something in the newer
 builds that's causing the slowdown, but I'm not intimate enough with Solr to
 be able to figure that out.

 We are using the debugQuery=on in our test to see timings and they aren't
 showing any anomalies, so that makes it even more confusing.

 From a wildcard perspective, it's on the fl parameter... here's a 'snippet'
 of part of our fl parameter for the query

 fl=id, CategoryGroupTypeID, MedicalSpecialtyDescription, TermsMisspelled,
 DictionarySource, timestamp, Category_*_MemberReports,
 Category_*_MemberReportRange, Category_*_NonMemberReports, Category_*_Grade,
 Category_*_GradeDisplay, Category_*_GradeTier, Category_*_ReportLocations,
 Category_*_ReportLocationCoordinates, Category_*_coordinate, score

 Please note that that fl param is greatly reduced from our full query, we
 have over 100 static files and a slew of dynamic fields - but that should
 give you an idea of how we are using wildcards.

 I'm not sure about the maxBooleanClauses...not being all that familiar with
 Solr, does that apply to wildcards used in the fl list?

 Thanks!

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Solr-Performance-Improvement-and-degradation-Help-tp3767015p3769995.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: How to retrieve tokens?

2012-02-23 Thread Erick Erickson
Essentially, you're talking about reconstructing the field from the
tokens, and that's pretty difficult in general and lossy. For instance,
if you use stemming and running gets stemmed to run, you
get back just run from the index. Is that acceptable?

But otherwise, you've got to go into the low levels of Lucene to
get this info, and reassembling it is lengthy, I suspect you'd find
that performance was unacceptable.

Why do you want to do this? This may be an XY problem.
http://people.apache.org/~hossman/#xyproblem

Best
Erick

On Thu, Feb 23, 2012 at 10:22 AM, Thiago thiagosousasilve...@gmail.com wrote:
 Hi to everybody,

 My name is Thiago and I'm new with Apache Solr and NoSQL databases. At the
 moment, I'm working and using Solr for document indexing. My Question is: Is
 there any way to retrieve the tokens in place of the original data?

 For example:
 I have a field using the fieldtype text_general from the original
 schema.xml. If I insert a document with the following string in this field:
 All you need is love, the tokens that I get are: all, you, need, love.
 When I search in this base, I want to get the tokens(all, you, need, love)
 in place of the indexed string.

 I searched for this in the web and in this forum too, but I saw some people
 saying to use TermVectorsComponent. Is there any way more easy to do it? As
 I saw, TermVectorsComponent is more difficult and use more memory.

 Thanks to everybody.

 Thiago


 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/How-to-retrieve-tokens-tp3770007p3770007.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: Unique key constraint and optimistic locking (versioning)

2012-02-23 Thread Erick Erickson
Per:

Yep, you've got it. You could write a custom update handler that queried
(via TermDocs or something) for the ID when your intent was to
INSERT, but it'll have to be custom work. I suppose you could query
with a divide-and-conquer approach, that is query for
id:(1 2 58 90... all your insert IDs) and go/no-go based on whether
your return had any hits, but that supposed you have some idea
whether pre-existing documents are likely.

But Solr doesn't have anything like you're looking for.

Best
Erick

On Thu, Feb 23, 2012 at 10:32 AM, Per Steffensen st...@designware.dk wrote:
 Em skrev:

 Hi Per,

 well, Solr has no Update-Method like a RDBMS. It is a re-insert of the
 whole document. Therefore a document with an existing UniqueKey marks
 the old document as deleted and inserts the new one.


 Yes I understand. But it is not always what I want to acheive. I want an
 error to occur if a document with the same id already exists, when my intent
 is to INSERT a new document. When my intent is to UPDATE a document in
 solr/lucene I want the old document already in solr/lucene deleted and the
 new version of this document added (exactly as you explain). It will not be
 possible for solr/lucene to decide what to do unless I give it some
 information about my intent - whether it is INSERT or UPDATE semantics I
 want. I guess solr/lucene always give me INSERT sematics when a document
 with the same id does not already exist, and that it always give me UPDATE
 semantics when a document with the same id does exist? I cannot decide?

 However this is not the whole story, since this constraint only works
 per index/SolrCore/Shard (depending on your use-case).


 Yes I know. But with the right routing strategy based on id's I will be able
 to acheive what I want if the feature was just there per index/core/shard.

 Does this help you?


 Yes it helps me getting sure, that what I am looking for is not there. There
 is not built-in way to make solr/lucene give me an error if I try to insert
 a new document with an id equal to a document already in the
 index/core/shard. The existing document will always be updated (implemented
 as old deleted and new added). Correct?

 Kind regards,
 Em


 Regards, Per Steffensen



Probleme with unicode query

2012-02-23 Thread Frederic Bouchery
hello,

I'm using Solr 3.5 over Tomcat 6 and I've some problemes with unicode quey.

Here is my text field configuration
analyzer type=index
charFilter class=solr.HTMLStripCharFilterFactory/
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.StandardFilterFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.ElisionFilterFactory articles=elisions.txt/
filter class=solr.StopFilterFactory words=stopwords.txt
ignoreCase=true/
filter class=solr.ASCIIFoldingFilterFactory/
filter class=solr.SnowballPorterFilterFactory language=French /
/analyzer
analyzer type=query
charFilter class=solr.HTMLStripCharFilterFactory/
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.StandardFilterFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.ElisionFilterFactory articles=elisions.txt/
filter class=solr.StopFilterFactory words=stopwords.txt
ignoreCase=true/
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
ignoreCase=true/
filter class=solr.ASCIIFoldingFilterFactory/
filter class=solr.SnowballPorterFilterFactory language=French /
/analyzer

When I performe this request : select/?q=hygiene sécuritédebugQuery=true
Here is debug infos :
str name=rawquerystringhygiene sécurité/str
str name=querystringhygiene sécurité/str
str name=parsedquerysearchText:hygien (searchText:sa
searchText:curit)/str
str name=parsedquery_toStringsearchText:hygien (searchText:sa
searchText:curit)/str

Has you can see, unicode request failed : searchText:sa searchText:curit
instead of searchText:securite
I've tried with ISOLatin1AccentFilterFactory, I've changed the order, but
no difference :(

Any ideas ?

Thanks

Frederic


probleme with unicode query

2012-02-23 Thread Frederic Bouchery
hello,

I'm using Solr 3.5 over Tomcat 6 and I've some problemes with unicode quey.

Here is my text field configuration
analyzer type=index
charFilter class=solr.HTMLStripCharFilterFactory/
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.StandardFilterFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.ElisionFilterFactory articles=elisions.txt/
filter class=solr.StopFilterFactory words=stopwords.txt
ignoreCase=true/
filter class=solr.ASCIIFoldingFilterFactory/
filter class=solr.SnowballPorterFilterFactory language=French /
/analyzer
analyzer type=query
charFilter class=solr.HTMLStripCharFilterFactory/
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.StandardFilterFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.ElisionFilterFactory articles=elisions.txt/
filter class=solr.StopFilterFactory words=stopwords.txt
ignoreCase=true/
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
ignoreCase=true/
filter class=solr.ASCIIFoldingFilterFactory/
filter class=solr.SnowballPorterFilterFactory language=French /
/analyzer

When I performe this request : select/?q=hygiene sécuritédebugQuery=true
Here is debug infos :
str name=rawquerystringhygiene sécurité/str
str name=querystringhygiene sécurité/str
str name=parsedquerysearchText:hygien (searchText:sa
searchText:curit)/str
str name=parsedquery_toStringsearchText:hygien (searchText:sa
searchText:curit)/str

Has you can see, unicode request failed : searchText:sa searchText:curit
instead of searchText:securite
I've tried with ISOLatin1AccentFilterFactory, I've changed the order, but
no difference :(

Any ideas ?

Thanks

Frederic


Re: Unique key constraint and optimistic locking (versioning)

2012-02-23 Thread Em
Hi Per,

 I want an error to occur if a document with the same id already
 exists, when my intent is to INSERT a new document. When my intent is
 to UPDATE a document in solr/lucene I want the old document already
 in solr/lucene deleted and the new version of this document added
 (exactly as you explain). It will not be possible for solr/lucene to
 decide what to do unless I give it some information about my intent -
 whether it is INSERT or UPDATE semantics I want. I guess solr/lucene
 always give me INSERT sematics when a document with the same id does
 not already exist, and that it always give me UPDATE semantics when a
 document with the same id does exist? I cannot decide?

Given that you've set a uniqueKey-field and there already exists a
document with that uniqueKey, it will delete the old one and insert the
new one. There is really no difference between the semantics - updates
do not exist.
To create a UNIQUE-constraint as you know it from a database you have to
check whether a document is already in the index *or* whether it is
already pending (waiting for getting flushed to the index).
Fortunately Solr manages a so called pending-set with all those
documents waiting for beeing flushed to disk (Solr 3.5).
I think you have to write your own DirectUpdateHandler to achieve what
you want on the Solr-level or to extend Lucenes IndexWriter to do it on
the Lucene-Level.

While doing so, keep track of what is going on in the trunk and how
Near-Real-Time-Search will change the current way of handling updates.

 There is not built-in way to make solr/lucene give me an error if I
 try to insert a new document with an id equal to a document already
 in the index/core/shard. The existing document will always be updated
 (implemented as old deleted and new added). Correct?
Exactly.

If you really want to get your hands on that topic I suggest you to
learn more about Lucene's IndexWriter:

http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/api/all/index.html?org/apache/lucene/index/IndexWriter.html

Kind Regards,
Em


Re: Probleme with unicode query

2012-02-23 Thread Em
Hi Frederic,

I saw similar issues when sending such a request without proper
URL-encoding. It is important to note that the URL-encoded string
already has to be an UTF-8-string.
What happens if you send that query via Solr's admin-panel?

Have a look at this page for troubleshooting:
http://wiki.apache.org/solr/SolrTomcat

Kind regards,
Em

Am 23.02.2012 18:15, schrieb Frederic Bouchery:
 hello,
 
 I'm using Solr 3.5 over Tomcat 6 and I've some problemes with unicode quey.
 
 Here is my text field configuration
 analyzer type=index
 charFilter class=solr.HTMLStripCharFilterFactory/
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.StandardFilterFactory/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.ElisionFilterFactory articles=elisions.txt/
 filter class=solr.StopFilterFactory words=stopwords.txt
 ignoreCase=true/
 filter class=solr.ASCIIFoldingFilterFactory/
 filter class=solr.SnowballPorterFilterFactory language=French /
 /analyzer
 analyzer type=query
 charFilter class=solr.HTMLStripCharFilterFactory/
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.StandardFilterFactory/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.ElisionFilterFactory articles=elisions.txt/
 filter class=solr.StopFilterFactory words=stopwords.txt
 ignoreCase=true/
 filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
 ignoreCase=true/
 filter class=solr.ASCIIFoldingFilterFactory/
 filter class=solr.SnowballPorterFilterFactory language=French /
 /analyzer
 
 When I performe this request : select/?q=hygiene sécuritédebugQuery=true
 Here is debug infos :
 str name=rawquerystringhygiene sécurité/str
 str name=querystringhygiene sécurité/str
 str name=parsedquerysearchText:hygien (searchText:sa
 searchText:curit)/str
 str name=parsedquery_toStringsearchText:hygien (searchText:sa
 searchText:curit)/str
 
 Has you can see, unicode request failed : searchText:sa searchText:curit
 instead of searchText:securite
 I've tried with ISOLatin1AccentFilterFactory, I've changed the order, but
 no difference :(
 
 Any ideas ?
 
 Thanks
 
 Frederic
 


undefined field on CSV db import

2012-02-23 Thread pmcgovern
I am trying to import a csv file of values via curl (PHP) and am receiving an
'undefined field' error, but I am not sure why, as I am defining the field.
Can someone lend some insight as to what I am missing / doing wrong? Thank
you in advance.

Sample of CSV File:
---
Product_ID  Product_Name  Product_ManufacturerPart  Product_Img 
ImageURL  Manufacturer_Name  lowestPrice  vendorCount
-2121813476  Over-the-Sink Dish Rack  123478   
http://image10.bizrate-images.com/resize?sq=60uid=2511766107mid=18900; 
WALTERDRAKE  24.99  1  
-2121813460  Oregon Nike NCAA Twill Shorts - Mens - Green 
00025305XODR   
http://image10.bizrate-images.com/resize?sq=60uid=2564249353mid=23598; 
Nike  44.99  3  
-2121813456  Sudden Change Under Eye Firming Serum  091777   
http://image10.bizrate-images.com/resize?sq=60uid=2564994087mid=18900; 
WALTERDRAKE  19.99  1  
-2121813445  Global Keratin Leave-In Conditioner Cream  005248   
http://image10.bizrate-images.com/resize?sq=60uid=2101271875mid=21473; 
Global Keratin  24  1  
-2121813443  Oregon Nike NCAA Twill Shorts - Mens - White 
00025305XODH   
http://image10.bizrate-images.com/resize?sq=60uid=2564226023mid=17345; 
Nike  59.99  3  
-2121813441  Paul Brown Hawaii Shine Amplifier 4 oz.  000684   
http://image10.bizrate-images.com/resize?sq=60uid=1171412855mid=21473; 
Paul Brown  20.1  1  
-2121813437  Dish Drying Mat Large  077608   
http://image10.bizrate-images.com/resize?sq=60uid=1371997268mid=18900; 
WALTERDRAKE  14.99  1  


Solr Update URL:

http://localhost:8983/solr/db/update/csv?commit=trueheader=trueseparator=%09escape=\\fieldNames=Product_ID,Product_Name,Product_ManufacturerPart,Product_Img,ImageURL,Manufacturer_Name,lowestPrice,vendorCount


Error Output:
-
html
head
meta http-equiv=Content-Type content=text/html; charset=ISO-8859-1/
titleError 400 undefined field Product_ID/title
/head
body
HTTP ERROR 400

pProblem accessing /solr/db/update/csv. Reason:
preundefined field Product_ID/pre/phr //smallPowered by
Jetty:///small/br/

--
View this message in context: 
http://lucene.472066.n3.nabble.com/undefined-field-on-CSV-db-import-tp3770552p3770552.html
Sent from the Solr - User mailing list archive at Nabble.com.


autoGeneratePhraseQueries sort of silently set to false

2012-02-23 Thread Naomi Dushay
Another thing I noticed when upgrading from Solr 1.4 to Solr 3.5 had to do with 
results when there were hyphenated words:   aaa-bbb.   Erik Hatcher pointed me 
to the autoGeneratePhraseQueries attribute now available on fieldtype 
definitions in schema.xml.  This is a great feature, and everything is peachy 
if you start with Solr 3.4.   But many of us started earlier and are upgrading, 
and that's a different story.

It was surprising to me that

a.  the default for this new feature caused different search results than Solr 
1.4 

b.  it wasn't documented clearly, IMO

http://wiki.apache.org/solr/SchemaXml   makes no mention of it


In the schema.xml example, there is this at the top:

!-- attribute name is the name of this schema and is only used for display 
purposes.
   Applications should change this to reflect the nature of the search 
collection.
   version=1.4 is Solr's version number for the schema syntax and 
semantics.  It should
   not normally be changed by applications.
   1.0: multiValued attribute did not exist, all fields are multiValued by 
nature
   1.1: multiValued attribute introduced, false by default 
   1.2: omitTermFreqAndPositions attribute introduced, true by default 
except for text fields.
   1.3: removed optional field compress feature
   1.4: default auto-phrase (QueryParser feature) to off
 --

And there was this in a couple of field definitions:

fieldType name=text_en_splitting class=solr.TextField 
positionIncrementGap=100 autoGeneratePhraseQueries=true
fieldType name=text_ja class=solr.TextField positionIncrementGap=100 
autoGeneratePhraseQueries=false

But that was it.



Re: Multiple Property Substitution

2012-02-23 Thread entdeveloper
*bump*

I'm also curious is something like this is possible. Being able to nest
property substitution variables, especially when using multiple cores, would
be a really slick feature.


Zach Friedland wrote
 
 Has anyone found a way to have multiple properties (override  default)? 
 What 
 I'd like to create is a default property with an override property that
 usually 
 wouldn't be set, but would be set as a JVM parameter if I want to turn off 
 replication on a particular index on a particular server.  I tried this
 syntax 
 but it didn't work...
 
 requestHandler name=/replication class=solr.ReplicationHandler 
   lst name=slave
 str 
 name=enable${Solr.enable.slave.core.override:${Solr.enable.slave.default:false}}/str
 
   /lst
 /requestHandler
 
 Thanks
 


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Multiple-Property-Substitution-tp2223781p3770649.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: result present in Solr 1.4, but missing in Solr 3.5, dismax only

2012-02-23 Thread Naomi Dushay
Robert,

You found it!   it is the phrase slop.  What do I do now?   I am using Solr 
from trunk from December, and all those JIRA tixes are marked fixed …

- Naomi


Solr 1.4:

luceneQueryParser:

URL: q=all_search:The Beatles as musicians : Revolver through the Anthology~3
final query:  all_search:the beatl as musician revolv through the antholog~3

got result


Solr 3.5

luceneQueryParser:

URL: q=all_search:The Beatles as musicians : Revolver through the Anthology~3
final query:  all_search:the beatl as musician revolv through the antholog~3

NO result



 lucene QueryParser:
 
 URL:  q=all_search:The Beatles as musicians : Revolver through the Anthology
 final query:  all_search:the beatl as musician revolv through the antholog




On Feb 22, 2012, at 7:34 PM, Robert Muir [via Lucene] wrote:

 On Wed, Feb 22, 2012 at 7:35 PM, Naomi Dushay [hidden email] wrote: 
  Jonathan has brought it to my attention that BOTH of my failing searches 
  happen to have 8 terms, and one of the terms is repeated: 
  
   The Beatles as musicians : Revolver through the Anthology 
   Color-blindness [print/digital]; its dangers and its detection 
  
  but this is a PHRASE search. 
  
 
 Can you take your same phrase queries, and simply add some slop to 
 them (e.g. ~3) and ensure they still match with the lucene 
 queryparser? SloppyPhraseQuery has a bit of a history with repeats 
 since Lucene 2.9 that you were using. 
 
 https://issues.apache.org/jira/browse/LUCENE-3068
 https://issues.apache.org/jira/browse/LUCENE-3215
 https://issues.apache.org/jira/browse/LUCENE-3412
 
 -- 
 lucidimagination.com 
 
 
 If you reply to this email, your message will be added to the discussion 
 below:
 http://lucene.472066.n3.nabble.com/result-present-in-Solr-1-4-but-missing-in-Solr-3-5-dismax-only-tp3767851p3768619.html
 To unsubscribe from result present in Solr 1.4, but missing in Solr 3.5, 
 dismax only, click here.
 NAML



--
View this message in context: 
http://lucene.472066.n3.nabble.com/result-present-in-Solr-1-4-but-missing-in-Solr-3-5-dismax-only-tp3767851p3770665.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: result present in Solr 1.4, but missing in Solr 3.5, dismax only

2012-02-23 Thread Robert Muir
Is it possible to also provide your document?
If you could attach the document and the analysis config and queries
to a JIRA issue, that would be most ideal.

On Thu, Feb 23, 2012 at 2:05 PM, Naomi Dushay ndus...@stanford.edu wrote:
 Robert,

 You found it!   it is the phrase slop.  What do I do now?   I am using Solr 
 from trunk from December, and all those JIRA tixes are marked fixed …

 - Naomi


 Solr 1.4:

 luceneQueryParser:

 URL: q=all_search:The Beatles as musicians : Revolver through the 
 Anthology~3
 final query:  all_search:the beatl as musician revolv through the antholog~3

 got result


 Solr 3.5

 luceneQueryParser:

 URL: q=all_search:The Beatles as musicians : Revolver through the 
 Anthology~3
 final query:  all_search:the beatl as musician revolv through the antholog~3

 NO result



 lucene QueryParser:

 URL:  q=all_search:The Beatles as musicians : Revolver through the 
 Anthology
 final query:  all_search:the beatl as musician revolv through the antholog




 On Feb 22, 2012, at 7:34 PM, Robert Muir [via Lucene] wrote:

 On Wed, Feb 22, 2012 at 7:35 PM, Naomi Dushay [hidden email] wrote:
  Jonathan has brought it to my attention that BOTH of my failing searches 
  happen to have 8 terms, and one of the terms is repeated:
 
   The Beatles as musicians : Revolver through the Anthology
   Color-blindness [print/digital]; its dangers and its detection
 
  but this is a PHRASE search.
 

 Can you take your same phrase queries, and simply add some slop to
 them (e.g. ~3) and ensure they still match with the lucene
 queryparser? SloppyPhraseQuery has a bit of a history with repeats
 since Lucene 2.9 that you were using.

 https://issues.apache.org/jira/browse/LUCENE-3068
 https://issues.apache.org/jira/browse/LUCENE-3215
 https://issues.apache.org/jira/browse/LUCENE-3412

 --
 lucidimagination.com


 If you reply to this email, your message will be added to the discussion 
 below:
 http://lucene.472066.n3.nabble.com/result-present-in-Solr-1-4-but-missing-in-Solr-3-5-dismax-only-tp3767851p3768619.html
 To unsubscribe from result present in Solr 1.4, but missing in Solr 3.5, 
 dismax only, click here.
 NAML



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/result-present-in-Solr-1-4-but-missing-in-Solr-3-5-dismax-only-tp3767851p3770665.html
 Sent from the Solr - User mailing list archive at Nabble.com.



-- 
lucidimagination.com


Re: result present in Solr 1.4, but missing in Solr 3.5, dismax only

2012-02-23 Thread Naomi Dushay
Robert,

I will create a jira issue with the documentation.  FYI, I tried ps values of 
3, 2, 1 and 0 and none of them worked with dismax;   For lucene QueryParser, 
only the value of 0 got results.

- Naomi


On Feb 23, 2012, at 11:12 AM, Robert Muir [via Lucene] wrote:

 Is it possible to also provide your document? 
 If you could attach the document and the analysis config and queries 
 to a JIRA issue, that would be most ideal. 
 
 On Thu, Feb 23, 2012 at 2:05 PM, Naomi Dushay [hidden email] wrote:
 
  Robert, 
  
  You found it!   it is the phrase slop.  What do I do now?   I am using Solr 
  from trunk from December, and all those JIRA tixes are marked fixed … 
  
  - Naomi 
  
  
  Solr 1.4: 
  
  luceneQueryParser: 
  
  URL: q=all_search:The Beatles as musicians : Revolver through the 
  Anthology~3 
  final query:  all_search:the beatl as musician revolv through the 
  antholog~3 
  
  got result 
  
  
  Solr 3.5 
  
  luceneQueryParser: 
  
  URL: q=all_search:The Beatles as musicians : Revolver through the 
  Anthology~3 
  final query:  all_search:the beatl as musician revolv through the 
  antholog~3 
  
  NO result 
  
  
  
  lucene QueryParser: 
  
  URL:  q=all_search:The Beatles as musicians : Revolver through the 
  Anthology 
  final query:  all_search:the beatl as musician revolv through the 
  antholog 
  
  
  
  
  On Feb 22, 2012, at 7:34 PM, Robert Muir [via Lucene] wrote: 
  
  On Wed, Feb 22, 2012 at 7:35 PM, Naomi Dushay [hidden email] wrote: 
   Jonathan has brought it to my attention that BOTH of my failing searches 
   happen to have 8 terms, and one of the terms is repeated: 
   
The Beatles as musicians : Revolver through the Anthology 
Color-blindness [print/digital]; its dangers and its detection 
   
   but this is a PHRASE search. 
   
  
  Can you take your same phrase queries, and simply add some slop to 
  them (e.g. ~3) and ensure they still match with the lucene 
  queryparser? SloppyPhraseQuery has a bit of a history with repeats 
  since Lucene 2.9 that you were using. 
  
  https://issues.apache.org/jira/browse/LUCENE-3068
  https://issues.apache.org/jira/browse/LUCENE-3215
  https://issues.apache.org/jira/browse/LUCENE-3412
  
  -- 
  lucidimagination.com 
  
  
  If you reply to this email, your message will be added to the discussion 
  below: 
  http://lucene.472066.n3.nabble.com/result-present-in-Solr-1-4-but-missing-in-Solr-3-5-dismax-only-tp3767851p3768619.html
  To unsubscribe from result present in Solr 1.4, but missing in Solr 3.5, 
  dismax only, click here. 
  NAML 
  
  
  
  -- 
  View this message in context: 
  http://lucene.472066.n3.nabble.com/result-present-in-Solr-1-4-but-missing-in-Solr-3-5-dismax-only-tp3767851p3770665.html
  Sent from the Solr - User mailing list archive at Nabble.com.
 
 
 
 -- 
 lucidimagination.com 
 
 
 If you reply to this email, your message will be added to the discussion 
 below:
 http://lucene.472066.n3.nabble.com/result-present-in-Solr-1-4-but-missing-in-Solr-3-5-dismax-only-tp3767851p3770681.html
 To unsubscribe from result present in Solr 1.4, but missing in Solr 3.5, 
 dismax only, click here.
 NAML



Re: Solr HBase - Re: How is Data Indexed in HBase?

2012-02-23 Thread T Vinod Gupta
regarding your question on hbase support for high performance and
consistency - i would say hbase is highly scalable and performant. how it
does what it does can be understood by reading relevant chapters around
architecture and design in the hbase book.

with regards to ranking, i see your problem. but if you split the problem
into hbase specific solution and solr based solution, you can achieve the
results probably. may be you do the ranking and store the rank in hbase and
then use solr to get the results and then use hbase as a lookup to get the
rank. or you can put the rank as part of the document schema and index the
rank too for range queries and such. is my understanding of your scenario
wrong?

thanks

On Wed, Feb 22, 2012 at 9:51 AM, Bing Li lbl...@gmail.com wrote:

 Mr Gupta,

 Thanks so much for your reply!

 In my use cases, retrieving data by keyword is one of them. I think Solr
 is a proper choice.

 However, Solr does not provide a complex enough support to rank. And,
 frequent updating is also not suitable in Solr. So it is difficult to
 retrieve data randomly based on the values other than keyword frequency in
 text. In this case, I attempt to use HBase.

 But I don't know how HBase support high performance when it needs to keep
 consistency in a large scale distributed system.

 Now both of them are used in my system.

 I will check out ElasticSearch.

 Best regards,
 Bing


 On Thu, Feb 23, 2012 at 1:35 AM, T Vinod Gupta tvi...@readypulse.comwrote:

 Bing,
 Its a classic battle on whether to use solr or hbase or a combination of
 both. both systems are very different but there is some overlap in the
 utility. they also differ vastly when it compares to computation power,
 storage needs, etc. so in the end, it all boils down to your use case. you
 need to pick the technology that it best suited to your needs.
 im still not clear on your use case though.

 btw, if you haven't started using solr yet - then you might want to
 checkout ElasticSearch. I spent over a week researching between solr and ES
 and eventually chose ES due to its cool merits.

 thanks


 On Wed, Feb 22, 2012 at 9:31 AM, Ted Yu yuzhih...@gmail.com wrote:

 There is no secondary index support in HBase at the moment.

 It's on our road map.

 FYI

 On Wed, Feb 22, 2012 at 9:28 AM, Bing Li lbl...@gmail.com wrote:

  Jacques,
 
  Yes. But I still have questions about that.
 
  In my system, when users search with a keyword arbitrarily, the query
 is
  forwarded to Solr. No any updating operations but appending new indexes
  exist in Solr managed data.
 
  When I need to retrieve data based on ranking values, HBase is used.
 And,
  the ranking values need to be updated all the time.
 
  Is that correct?
 
  My question is that the performance must be low if keeping consistency
 in a
  large scale distributed environment. How does HBase handle this issue?
 
  Thanks so much!
 
  Bing
 
 
  On Thu, Feb 23, 2012 at 1:17 AM, Jacques whs...@gmail.com wrote:
 
   It is highly unlikely that you could replace Solr with HBase.
  They're
   really apples and oranges.
  
  
   On Wed, Feb 22, 2012 at 1:09 AM, Bing Li lbl...@gmail.com wrote:
  
   Dear all,
  
   I wonder how data in HBase is indexed? Now Solr is used in my system
   because data is managed in inverted index. Such an index is
 suitable to
   retrieve unstructured and huge amount of data. How does HBase deal
 with
   the
   issue? May I replaced Solr with HBase?
  
   Thanks so much!
  
   Best regards,
   Bing
  
  
  
 






Re: result present in Solr 1.4, but missing in Solr 3.5, dismax only

2012-02-23 Thread Robert Muir
Please attach your docs if you dont mind.

I worked up tests for this (in general for ANY phrase query,
increasing the slop should never remove results, only potentially
enlarge them).

It fails already... but its good to also have your test case too...

On Thu, Feb 23, 2012 at 2:20 PM, Naomi Dushay ndus...@stanford.edu wrote:
 Robert,

 I will create a jira issue with the documentation.  FYI, I tried ps values of 
 3, 2, 1 and 0 and none of them worked with dismax;   For lucene QueryParser, 
 only the value of 0 got results.

 - Naomi


 On Feb 23, 2012, at 11:12 AM, Robert Muir [via Lucene] wrote:

 Is it possible to also provide your document?
 If you could attach the document and the analysis config and queries
 to a JIRA issue, that would be most ideal.

 On Thu, Feb 23, 2012 at 2:05 PM, Naomi Dushay [hidden email] wrote:

  Robert,
 
  You found it!   it is the phrase slop.  What do I do now?   I am using 
  Solr from trunk from December, and all those JIRA tixes are marked fixed …
 
  - Naomi
 
 
  Solr 1.4:
 
  luceneQueryParser:
 
  URL: q=all_search:The Beatles as musicians : Revolver through the 
  Anthology~3
  final query:  all_search:the beatl as musician revolv through the 
  antholog~3
 
  got result
 
 
  Solr 3.5
 
  luceneQueryParser:
 
  URL: q=all_search:The Beatles as musicians : Revolver through the 
  Anthology~3
  final query:  all_search:the beatl as musician revolv through the 
  antholog~3
 
  NO result
 
 
 
  lucene QueryParser:
 
  URL:  q=all_search:The Beatles as musicians : Revolver through the 
  Anthology
  final query:  all_search:the beatl as musician revolv through the 
  antholog
 
 
 
 
  On Feb 22, 2012, at 7:34 PM, Robert Muir [via Lucene] wrote:
 
  On Wed, Feb 22, 2012 at 7:35 PM, Naomi Dushay [hidden email] wrote:
   Jonathan has brought it to my attention that BOTH of my failing 
   searches happen to have 8 terms, and one of the terms is repeated:
  
    The Beatles as musicians : Revolver through the Anthology
    Color-blindness [print/digital]; its dangers and its detection
  
   but this is a PHRASE search.
  
 
  Can you take your same phrase queries, and simply add some slop to
  them (e.g. ~3) and ensure they still match with the lucene
  queryparser? SloppyPhraseQuery has a bit of a history with repeats
  since Lucene 2.9 that you were using.
 
  https://issues.apache.org/jira/browse/LUCENE-3068
  https://issues.apache.org/jira/browse/LUCENE-3215
  https://issues.apache.org/jira/browse/LUCENE-3412
 
  --
  lucidimagination.com
 
 
  If you reply to this email, your message will be added to the discussion 
  below:
  http://lucene.472066.n3.nabble.com/result-present-in-Solr-1-4-but-missing-in-Solr-3-5-dismax-only-tp3767851p3768619.html
  To unsubscribe from result present in Solr 1.4, but missing in Solr 3.5, 
  dismax only, click here.
  NAML
 
 
 
  --
  View this message in context: 
  http://lucene.472066.n3.nabble.com/result-present-in-Solr-1-4-but-missing-in-Solr-3-5-dismax-only-tp3767851p3770665.html
  Sent from the Solr - User mailing list archive at Nabble.com.



 --
 lucidimagination.com


 If you reply to this email, your message will be added to the discussion 
 below:
 http://lucene.472066.n3.nabble.com/result-present-in-Solr-1-4-but-missing-in-Solr-3-5-dismax-only-tp3767851p3770681.html
 To unsubscribe from result present in Solr 1.4, but missing in Solr 3.5, 
 dismax only, click here.
 NAML




-- 
lucidimagination.com


Re: Solr HBase - Re: How is Data Indexed in HBase?

2012-02-23 Thread Bing Li
Dear Mr Gupta,

Your understanding about my solution is correct. Now both HBase and Solr
are used in my system. I hope it could work.

Thanks so much for your reply!

Best regards,
Bing

On Fri, Feb 24, 2012 at 3:30 AM, T Vinod Gupta tvi...@readypulse.comwrote:

 regarding your question on hbase support for high performance and
 consistency - i would say hbase is highly scalable and performant. how it
 does what it does can be understood by reading relevant chapters around
 architecture and design in the hbase book.

 with regards to ranking, i see your problem. but if you split the problem
 into hbase specific solution and solr based solution, you can achieve the
 results probably. may be you do the ranking and store the rank in hbase and
 then use solr to get the results and then use hbase as a lookup to get the
 rank. or you can put the rank as part of the document schema and index the
 rank too for range queries and such. is my understanding of your scenario
 wrong?

 thanks


 On Wed, Feb 22, 2012 at 9:51 AM, Bing Li lbl...@gmail.com wrote:

 Mr Gupta,

 Thanks so much for your reply!

 In my use cases, retrieving data by keyword is one of them. I think Solr
 is a proper choice.

 However, Solr does not provide a complex enough support to rank. And,
 frequent updating is also not suitable in Solr. So it is difficult to
 retrieve data randomly based on the values other than keyword frequency in
 text. In this case, I attempt to use HBase.

 But I don't know how HBase support high performance when it needs to keep
 consistency in a large scale distributed system.

 Now both of them are used in my system.

 I will check out ElasticSearch.

 Best regards,
 Bing


 On Thu, Feb 23, 2012 at 1:35 AM, T Vinod Gupta tvi...@readypulse.comwrote:

 Bing,
 Its a classic battle on whether to use solr or hbase or a combination of
 both. both systems are very different but there is some overlap in the
 utility. they also differ vastly when it compares to computation power,
 storage needs, etc. so in the end, it all boils down to your use case. you
 need to pick the technology that it best suited to your needs.
 im still not clear on your use case though.

 btw, if you haven't started using solr yet - then you might want to
 checkout ElasticSearch. I spent over a week researching between solr and ES
 and eventually chose ES due to its cool merits.

 thanks


 On Wed, Feb 22, 2012 at 9:31 AM, Ted Yu yuzhih...@gmail.com wrote:

 There is no secondary index support in HBase at the moment.

 It's on our road map.

 FYI

 On Wed, Feb 22, 2012 at 9:28 AM, Bing Li lbl...@gmail.com wrote:

  Jacques,
 
  Yes. But I still have questions about that.
 
  In my system, when users search with a keyword arbitrarily, the query
 is
  forwarded to Solr. No any updating operations but appending new
 indexes
  exist in Solr managed data.
 
  When I need to retrieve data based on ranking values, HBase is used.
 And,
  the ranking values need to be updated all the time.
 
  Is that correct?
 
  My question is that the performance must be low if keeping
 consistency in a
  large scale distributed environment. How does HBase handle this issue?
 
  Thanks so much!
 
  Bing
 
 
  On Thu, Feb 23, 2012 at 1:17 AM, Jacques whs...@gmail.com wrote:
 
   It is highly unlikely that you could replace Solr with HBase.
  They're
   really apples and oranges.
  
  
   On Wed, Feb 22, 2012 at 1:09 AM, Bing Li lbl...@gmail.com wrote:
  
   Dear all,
  
   I wonder how data in HBase is indexed? Now Solr is used in my
 system
   because data is managed in inverted index. Such an index is
 suitable to
   retrieve unstructured and huge amount of data. How does HBase deal
 with
   the
   issue? May I replaced Solr with HBase?
  
   Thanks so much!
  
   Best regards,
   Bing
  
  
  
 







Re: result present in Solr 1.4, but missing in Solr 3.5, dismax only

2012-02-23 Thread Naomi Dushay
Robert -

Did you mean for me to attach my docs to an existing ticket (which one?) or 
just want to make sure I attach the docs to the new issue?

- Naomi

On Feb 23, 2012, at 11:39 AM, Robert Muir [via Lucene] wrote:

 Please attach your docs if you dont mind. 
 
 I worked up tests for this (in general for ANY phrase query, 
 increasing the slop should never remove results, only potentially 
 enlarge them). 
 
 It fails already... but its good to also have your test case too... 
 
 On Thu, Feb 23, 2012 at 2:20 PM, Naomi Dushay [hidden email] wrote:
 
  Robert, 
  
  I will create a jira issue with the documentation.  FYI, I tried ps values 
  of 3, 2, 1 and 0 and none of them worked with dismax;   For lucene 
  QueryParser, only the value of 0 got results. 
  
  - Naomi 
  
  
  On Feb 23, 2012, at 11:12 AM, Robert Muir [via Lucene] wrote: 
  
  Is it possible to also provide your document? 
  If you could attach the document and the analysis config and queries 
  to a JIRA issue, that would be most ideal. 
  
  On Thu, Feb 23, 2012 at 2:05 PM, Naomi Dushay [hidden email] wrote: 
  
   Robert, 
   
   You found it!   it is the phrase slop.  What do I do now?   I am using 
   Solr from trunk from December, and all those JIRA tixes are marked fixed 
   … 
   
   - Naomi 
   
   
   Solr 1.4: 
   
   luceneQueryParser: 
   
   URL: q=all_search:The Beatles as musicians : Revolver through the 
   Anthology~3 
   final query:  all_search:the beatl as musician revolv through the 
   antholog~3 
   
   got result 
   
   
   Solr 3.5 
   
   luceneQueryParser: 
   
   URL: q=all_search:The Beatles as musicians : Revolver through the 
   Anthology~3 
   final query:  all_search:the beatl as musician revolv through the 
   antholog~3 
   
   NO result 
   
   
   
   lucene QueryParser: 
   
   URL:  q=all_search:The Beatles as musicians : Revolver through the 
   Anthology 
   final query:  all_search:the beatl as musician revolv through the 
   antholog 
   
   
   
   
   On Feb 22, 2012, at 7:34 PM, Robert Muir [via Lucene] wrote: 
   
   On Wed, Feb 22, 2012 at 7:35 PM, Naomi Dushay [hidden email] wrote: 
Jonathan has brought it to my attention that BOTH of my failing 
searches happen to have 8 terms, and one of the terms is repeated: 

 The Beatles as musicians : Revolver through the Anthology 
 Color-blindness [print/digital]; its dangers and its detection 

but this is a PHRASE search. 

   
   Can you take your same phrase queries, and simply add some slop to 
   them (e.g. ~3) and ensure they still match with the lucene 
   queryparser? SloppyPhraseQuery has a bit of a history with repeats 
   since Lucene 2.9 that you were using. 
   
   https://issues.apache.org/jira/browse/LUCENE-3068
   https://issues.apache.org/jira/browse/LUCENE-3215
   https://issues.apache.org/jira/browse/LUCENE-3412
   
   -- 
   lucidimagination.com 
   
   
   If you reply to this email, your message will be added to the 
   discussion below: 
   http://lucene.472066.n3.nabble.com/result-present-in-Solr-1-4-but-missing-in-Solr-3-5-dismax-only-tp3767851p3768619.html
   To unsubscribe from result present in Solr 1.4, but missing in Solr 
   3.5, dismax only, click here. 
   NAML 
   
   
   
   -- 
   View this message in context: 
   http://lucene.472066.n3.nabble.com/result-present-in-Solr-1-4-but-missing-in-Solr-3-5-dismax-only-tp3767851p3770665.html
   Sent from the Solr - User mailing list archive at Nabble.com. 
  
  
  
  -- 
  lucidimagination.com 
  
  
  If you reply to this email, your message will be added to the discussion 
  below: 
  http://lucene.472066.n3.nabble.com/result-present-in-Solr-1-4-but-missing-in-Solr-3-5-dismax-only-tp3767851p3770681.html
  To unsubscribe from result present in Solr 1.4, but missing in Solr 3.5, 
  dismax only, click here. 
  NAML 
 
 
 
 
 -- 
 lucidimagination.com 
 
 
 If you reply to this email, your message will be added to the discussion 
 below:
 http://lucene.472066.n3.nabble.com/result-present-in-Solr-1-4-but-missing-in-Solr-3-5-dismax-only-tp3767851p3770746.html
 To unsubscribe from result present in Solr 1.4, but missing in Solr 3.5, 
 dismax only, click here.
 NAML



RE: autoGeneratePhraseQueries sort of silently set to false

2012-02-23 Thread Burton-West, Tom
Seems like a change in default behavior like this should be included in the 
changes.txt for Solr 3.5.
Not sure how to do that.

Tom

-Original Message-
From: Naomi Dushay [mailto:ndus...@stanford.edu] 
Sent: Thursday, February 23, 2012 1:57 PM
To: solr-user@lucene.apache.org
Subject: autoGeneratePhraseQueries sort of silently set to false 

Another thing I noticed when upgrading from Solr 1.4 to Solr 3.5 had to do with 
results when there were hyphenated words:   aaa-bbb.   Erik Hatcher pointed me 
to the autoGeneratePhraseQueries attribute now available on fieldtype 
definitions in schema.xml.  This is a great feature, and everything is peachy 
if you start with Solr 3.4.   But many of us started earlier and are upgrading, 
and that's a different story.

It was surprising to me that

a.  the default for this new feature caused different search results than Solr 
1.4 

b.  it wasn't documented clearly, IMO

http://wiki.apache.org/solr/SchemaXml   makes no mention of it


In the schema.xml example, there is this at the top:

!-- attribute name is the name of this schema and is only used for display 
purposes.
   Applications should change this to reflect the nature of the search 
collection.
   version=1.4 is Solr's version number for the schema syntax and 
semantics.  It should
   not normally be changed by applications.
   1.0: multiValued attribute did not exist, all fields are multiValued by 
nature
   1.1: multiValued attribute introduced, false by default 
   1.2: omitTermFreqAndPositions attribute introduced, true by default 
except for text fields.
   1.3: removed optional field compress feature
   1.4: default auto-phrase (QueryParser feature) to off
 --

And there was this in a couple of field definitions:

fieldType name=text_en_splitting class=solr.TextField 
positionIncrementGap=100 autoGeneratePhraseQueries=true
fieldType name=text_ja class=solr.TextField positionIncrementGap=100 
autoGeneratePhraseQueries=false

But that was it.



Re: autoGeneratePhraseQueries sort of silently set to false

2012-02-23 Thread Erik Hatcher
there's this (for 3.1, but in the 3.x CHANGES.txt):

* SOLR-2015: Add a boolean attribute autoGeneratePhraseQueries to TextField.
  autoGeneratePhraseQueries=true (the default) causes the query parser to
  generate phrase queries if multiple tokens are generated from a single
  non-quoted analysis string.  For example WordDelimiterFilter splitting 
text:pdp-11
  will cause the parser to generate text:pdp 11 rather than (text:PDP OR 
text:11).
  Note that autoGeneratePhraseQueries=true tends to not work well for non 
whitespace
  delimited languages. (yonik)

with a ton of useful, though back and forth, commentary here: 
https://issues.apache.org/jira/browse/SOLR-2015

Note that the behavior, as Naomi pointed out so succinctly, is adjustable based 
off the *schema* version setting.  (look at your schema line in schema.xml).  
The code is simply this:

if (schema.getVersion()  1.3f) {
  autoGeneratePhraseQueries = false;
} else {
  autoGeneratePhraseQueries = true;
}

on TextField.  Specifying autoGeneratePhraseQueries explicitly on a field type 
overrides whatever the default may be.

Erik



On Feb 23, 2012, at 14:45 , Burton-West, Tom wrote:

 Seems like a change in default behavior like this should be included in the 
 changes.txt for Solr 3.5.
 Not sure how to do that.
 
 Tom
 
 -Original Message-
 From: Naomi Dushay [mailto:ndus...@stanford.edu] 
 Sent: Thursday, February 23, 2012 1:57 PM
 To: solr-user@lucene.apache.org
 Subject: autoGeneratePhraseQueries sort of silently set to false 
 
 Another thing I noticed when upgrading from Solr 1.4 to Solr 3.5 had to do 
 with results when there were hyphenated words:   aaa-bbb.   Erik Hatcher 
 pointed me to the autoGeneratePhraseQueries attribute now available on 
 fieldtype definitions in schema.xml.  This is a great feature, and everything 
 is peachy if you start with Solr 3.4.   But many of us started earlier and 
 are upgrading, and that's a different story.
 
 It was surprising to me that
 
 a.  the default for this new feature caused different search results than 
 Solr 1.4 
 
 b.  it wasn't documented clearly, IMO
 
 http://wiki.apache.org/solr/SchemaXml   makes no mention of it
 
 
 In the schema.xml example, there is this at the top:
 
 !-- attribute name is the name of this schema and is only used for display 
 purposes.
   Applications should change this to reflect the nature of the search 
 collection.
   version=1.4 is Solr's version number for the schema syntax and 
 semantics.  It should
   not normally be changed by applications.
   1.0: multiValued attribute did not exist, all fields are multiValued by 
 nature
   1.1: multiValued attribute introduced, false by default 
   1.2: omitTermFreqAndPositions attribute introduced, true by default 
 except for text fields.
   1.3: removed optional field compress feature
   1.4: default auto-phrase (QueryParser feature) to off
 --
 
 And there was this in a couple of field definitions:
 
 fieldType name=text_en_splitting class=solr.TextField 
 positionIncrementGap=100 autoGeneratePhraseQueries=true
 fieldType name=text_ja class=solr.TextField positionIncrementGap=100 
 autoGeneratePhraseQueries=false
 
 But that was it.
 



Re: result present in Solr 1.4, but missing in Solr 3.5, dismax only

2012-02-23 Thread Robert Muir
Please make a new one if you dont mind!

On Thu, Feb 23, 2012 at 2:45 PM, Naomi Dushay ndus...@stanford.edu wrote:
 Robert -

 Did you mean for me to attach my docs to an existing ticket (which one?) or 
 just want to make sure I attach the docs to the new issue?

 - Naomi

 On Feb 23, 2012, at 11:39 AM, Robert Muir [via Lucene] wrote:

 Please attach your docs if you dont mind.

 I worked up tests for this (in general for ANY phrase query,
 increasing the slop should never remove results, only potentially
 enlarge them).

 It fails already... but its good to also have your test case too...

 On Thu, Feb 23, 2012 at 2:20 PM, Naomi Dushay [hidden email] wrote:

  Robert,
 
  I will create a jira issue with the documentation.  FYI, I tried ps values 
  of 3, 2, 1 and 0 and none of them worked with dismax;   For lucene 
  QueryParser, only the value of 0 got results.
 
  - Naomi
 
 
  On Feb 23, 2012, at 11:12 AM, Robert Muir [via Lucene] wrote:
 
  Is it possible to also provide your document?
  If you could attach the document and the analysis config and queries
  to a JIRA issue, that would be most ideal.
 
  On Thu, Feb 23, 2012 at 2:05 PM, Naomi Dushay [hidden email] wrote:
 
   Robert,
  
   You found it!   it is the phrase slop.  What do I do now?   I am using 
   Solr from trunk from December, and all those JIRA tixes are marked 
   fixed …
  
   - Naomi
  
  
   Solr 1.4:
  
   luceneQueryParser:
  
   URL: q=all_search:The Beatles as musicians : Revolver through the 
   Anthology~3
   final query:  all_search:the beatl as musician revolv through the 
   antholog~3
  
   got result
  
  
   Solr 3.5
  
   luceneQueryParser:
  
   URL: q=all_search:The Beatles as musicians : Revolver through the 
   Anthology~3
   final query:  all_search:the beatl as musician revolv through the 
   antholog~3
  
   NO result
  
  
  
   lucene QueryParser:
  
   URL:  q=all_search:The Beatles as musicians : Revolver through the 
   Anthology
   final query:  all_search:the beatl as musician revolv through the 
   antholog
  
  
  
  
   On Feb 22, 2012, at 7:34 PM, Robert Muir [via Lucene] wrote:
  
   On Wed, Feb 22, 2012 at 7:35 PM, Naomi Dushay [hidden email] wrote:
Jonathan has brought it to my attention that BOTH of my failing 
searches happen to have 8 terms, and one of the terms is repeated:
   
 The Beatles as musicians : Revolver through the Anthology
 Color-blindness [print/digital]; its dangers and its detection
   
but this is a PHRASE search.
   
  
   Can you take your same phrase queries, and simply add some slop to
   them (e.g. ~3) and ensure they still match with the lucene
   queryparser? SloppyPhraseQuery has a bit of a history with repeats
   since Lucene 2.9 that you were using.
  
   https://issues.apache.org/jira/browse/LUCENE-3068
   https://issues.apache.org/jira/browse/LUCENE-3215
   https://issues.apache.org/jira/browse/LUCENE-3412
  
   --
   lucidimagination.com
  
  
   If you reply to this email, your message will be added to the 
   discussion below:
   http://lucene.472066.n3.nabble.com/result-present-in-Solr-1-4-but-missing-in-Solr-3-5-dismax-only-tp3767851p3768619.html
   To unsubscribe from result present in Solr 1.4, but missing in Solr 
   3.5, dismax only, click here.
   NAML
  
  
  
   --
   View this message in context: 
   http://lucene.472066.n3.nabble.com/result-present-in-Solr-1-4-but-missing-in-Solr-3-5-dismax-only-tp3767851p3770665.html
   Sent from the Solr - User mailing list archive at Nabble.com.
 
 
 
  --
  lucidimagination.com
 
 
  If you reply to this email, your message will be added to the discussion 
  below:
  http://lucene.472066.n3.nabble.com/result-present-in-Solr-1-4-but-missing-in-Solr-3-5-dismax-only-tp3767851p3770681.html
  To unsubscribe from result present in Solr 1.4, but missing in Solr 3.5, 
  dismax only, click here.
  NAML
 



 --
 lucidimagination.com


 If you reply to this email, your message will be added to the discussion 
 below:
 http://lucene.472066.n3.nabble.com/result-present-in-Solr-1-4-but-missing-in-Solr-3-5-dismax-only-tp3767851p3770746.html
 To unsubscribe from result present in Solr 1.4, but missing in Solr 3.5, 
 dismax only, click here.
 NAML




-- 
lucidimagination.com


Re: DataImportHandler running out of memory

2012-02-23 Thread Shawn Heisey

On 2/20/2012 6:49 AM, v_shan wrote:

DIH still running out of memory for me, with Full Import on a database of
size 1.5 GB.

Solr version: 3_5_0

Note that I have already added batchSize=-1 but getting same error.


A few questions:

- How much memory have you given to the JVM running this Solr instance?
- How much memory does your server have?
- What is the size of all your index cores, and how many documents are 
in them?
- How large are your Solr caches (filterCache, documentCache, 
queryResultCache)?

- What is your ramBufferSize set to in the indexDefaults section?

Thanks,
Shawn



RE: autoGeneratePhraseQueries sort of silently set to false

2012-02-23 Thread Burton-West, Tom
Thanks Erik,

The 3.1 changes document the ability to set this and the default being set to 
true
However apparently the change between 3.4 and 3.5 the default was set to 
false  
Since this will change the behavior of any field where 
autoGeneratePhraseQueries is not explicitly set, it could easily surprise users 
who update to 3.5. 

 That's why I think the changing of the default behavior (i.e. when not 
explicitly set) should be called out explicitly in the changes.txt for 3.5.   

True, everyone should read the notes in the example schema.xml, but I think it 
would help if the change was also noted in changes.txt.  

Is it possible to revise the changes.txt for 3.5?

Do you by any chance know where the change in the default behavior was 
discussed?  I know it has been a contentious issue.

Tom

-Original Message-
From: Erik Hatcher [mailto:erik.hatc...@gmail.com] 
Sent: Thursday, February 23, 2012 2:53 PM
To: solr-user@lucene.apache.org
Subject: Re: autoGeneratePhraseQueries sort of silently set to false

there's this (for 3.1, but in the 3.x CHANGES.txt):

* SOLR-2015: Add a boolean attribute autoGeneratePhraseQueries to TextField.
  autoGeneratePhraseQueries=true (the default) causes the query parser to
  generate phrase queries if multiple tokens are generated from a single
  non-quoted analysis string.  For example WordDelimiterFilter splitting 
text:pdp-11
  will cause the parser to generate text:pdp 11 rather than (text:PDP OR 
text:11).
  Note that autoGeneratePhraseQueries=true tends to not work well for non 
whitespace
  delimited languages. (yonik)

with a ton of useful, though back and forth, commentary here: 
https://issues.apache.org/jira/browse/SOLR-2015

Note that the behavior, as Naomi pointed out so succinctly, is adjustable based 
off the *schema* version setting.  (look at your schema line in schema.xml).  
The code is simply this:

if (schema.getVersion()  1.3f) {
  autoGeneratePhraseQueries = false;
} else {
  autoGeneratePhraseQueries = true;
}

on TextField.  Specifying autoGeneratePhraseQueries explicitly on a field type 
overrides whatever the default may be.

Erik



On Feb 23, 2012, at 14:45 , Burton-West, Tom wrote:

 Seems like a change in default behavior like this should be included in the 
 changes.txt for Solr 3.5.
 Not sure how to do that.
 
 Tom
 
 -Original Message-
 From: Naomi Dushay [mailto:ndus...@stanford.edu] 
 Sent: Thursday, February 23, 2012 1:57 PM
 To: solr-user@lucene.apache.org
 Subject: autoGeneratePhraseQueries sort of silently set to false 
 
 Another thing I noticed when upgrading from Solr 1.4 to Solr 3.5 had to do 
 with results when there were hyphenated words:   aaa-bbb.   Erik Hatcher 
 pointed me to the autoGeneratePhraseQueries attribute now available on 
 fieldtype definitions in schema.xml.  This is a great feature, and everything 
 is peachy if you start with Solr 3.4.   But many of us started earlier and 
 are upgrading, and that's a different story.
 
 It was surprising to me that
 
 a.  the default for this new feature caused different search results than 
 Solr 1.4 
 
 b.  it wasn't documented clearly, IMO
 
 http://wiki.apache.org/solr/SchemaXml   makes no mention of it
 
 
 In the schema.xml example, there is this at the top:
 
 !-- attribute name is the name of this schema and is only used for display 
 purposes.
   Applications should change this to reflect the nature of the search 
 collection.
   version=1.4 is Solr's version number for the schema syntax and 
 semantics.  It should
   not normally be changed by applications.
   1.0: multiValued attribute did not exist, all fields are multiValued by 
 nature
   1.1: multiValued attribute introduced, false by default 
   1.2: omitTermFreqAndPositions attribute introduced, true by default 
 except for text fields.
   1.3: removed optional field compress feature
   1.4: default auto-phrase (QueryParser feature) to off
 --
 
 And there was this in a couple of field definitions:
 
 fieldType name=text_en_splitting class=solr.TextField 
 positionIncrementGap=100 autoGeneratePhraseQueries=true
 fieldType name=text_ja class=solr.TextField positionIncrementGap=100 
 autoGeneratePhraseQueries=false
 
 But that was it.
 



Re: Solr Performance Improvement and degradation Help

2012-02-23 Thread naptowndev
Erick -

Thanks.  We've actually worked with Sematext to optimize the GC settings
and saw initial (and continued) performance boosts as a result...

The situation we're seeing now, has both versions of Solr running on the
same box under the same JVM, but we are undeploying an instance at a time
so as to prevent any outlying performance hits in the tests...

So, that being said, both instances of solr, on the same box are running
under the optimized settings.  I'd assume if GC was impacting the results
of the newer version of Solr, we'd see similar decrease in performance on
the older version.

Aside from the QTime and other timings (highlight, etc) - which are all
faster in the new version, the overall response time/delivery of the
results are significantly slower under the new version.

I've unfortunately exhausted my knowledge of Solr and what may or may not
have changed between the nightly builds.

I do appreciate your insight and hope you'll continue to throw out some
ideas...and maybe someone else out there has seen these inconsistencies as
well.

The last set of test I ran consistently showed the the older build of Solr
bringing back a result set of 13.1MB with 1200 records in 2.3 seconds
wheres the newer build was bringing back the same result set in about 17.4
seconds.  The catch is that the qtime and highlighting component time in
the newer version are faster than the older version.

Again, if you have any more ideas, let me know.

Thanks!
Brian

On Thu, Feb 23, 2012 at 11:51 AM, Erick Erickson [via Lucene] 
ml-node+s472066n377030...@n3.nabble.com wrote:

 Ah, no, my mistake. The wildcards for the fl list won't matter re:
 maxBooleanClauses,
 I didn't read carefully enough.

 I assume that just returning a field or two doesn't slow down

 But one possible culprit, especially since you say this kicks in after
 a while, is garbage collection. Here's an excellent intro:


 http://www.lucidimagination.com/blog/2011/03/27/garbage-collection-bootcamp-1-0/

 Especially look at the getting a view into garbage collection
 section and try specifying
 those options. The result should be that your solr log gets stats
 dumped every time
 GC kicks in. If this is a problem, look at the times in the logfile
 after your system slows
 down. You'll see a bunch of GC dumps that collect very little unused
 memory. You can
 also connect to the process using jConsole (should be in the Java
 distro) and watch
 the memory tab, especially after your server has slowed down. You can
 also
 connect jConsole remotely...

 This is just an experiment, but any time I see and it slows down
 after ### minutes,
 GC is the first thing I think of.


 Best
 Erick


 On Thu, Feb 23, 2012 at 10:16 AM, naptowndev [hidden 
 email]http://user/SendEmail.jtp?type=nodenode=3770307i=0
 wrote:

  Erick -
 
  Agreed, it is puzzling.
 
  What I've found is that it doesn't matter if I pass in wildcards for the
  field list or not...but that the overall response time from the newer
 builds
  of Solr that we've tested (e.g. 4.0.0.2012.02.16) is slower than the
 older
  (4.0.0.2010.12.10.08.54.56) build.
 
  If I run the exact same query against those two cores, bringing back a
  payload of just over 13MB (xml), the older build brings it back in about
 1.6
  seconds and the newer build brings it back in about 8.4 seconds.
 
  Implementing the field list wildcard allows us to reduce the payload in
 the
  newer build (not an option in the older build).  They payload is reduced
 to
  1.8MB but takes over 3.5 seconds to come back as compared to the full
  payload (13MB) in the older build at about 1.6 seconds.
 
  With everything else remaining the same
 (machine/processors/memory/network
  and the code base calling Solr) it seems to point to something in the
 newer
  builds that's causing the slowdown, but I'm not intimate enough with
 Solr to
  be able to figure that out.
 
  We are using the debugQuery=on in our test to see timings and they
 aren't
  showing any anomalies, so that makes it even more confusing.
 
  From a wildcard perspective, it's on the fl parameter... here's a
 'snippet'
  of part of our fl parameter for the query
 
  fl=id, CategoryGroupTypeID, MedicalSpecialtyDescription,
 TermsMisspelled,
  DictionarySource, timestamp, Category_*_MemberReports,
  Category_*_MemberReportRange, Category_*_NonMemberReports,
 Category_*_Grade,
  Category_*_GradeDisplay, Category_*_GradeTier,
 Category_*_ReportLocations,
  Category_*_ReportLocationCoordinates, Category_*_coordinate, score
 
  Please note that that fl param is greatly reduced from our full query,
 we
  have over 100 static files and a slew of dynamic fields - but that
 should
  give you an idea of how we are using wildcards.
 
  I'm not sure about the maxBooleanClauses...not being all that familiar
 with
  Solr, does that apply to wildcards used in the fl list?
 
  Thanks!
 
  --
  View this message in context:
 

Backporting Wildcard fieldlist Features to 3.x versions

2012-02-23 Thread naptowndev
We are currently running tests against some of the more recent nightly builds
of Solr 4, but have noticed some significant performance decreases recently. 
Some of the reasons we are using Solr 4 is because we needed geofiltering
and highlighting which were not originally available in 3 from my
understanding

It appears however, that those features have been backported to 3.x.

One other feature that we are very interested in because we have very large
payloads returning in our search is the wildcard field list for return
fields.  We've seen it work in the later builds of 4.x, but again, the gain
we are getting from the smaller payload by leaving out some fields (out of
hundreds), is negated by some poor performance on the response times.

Are there any plans to backport the wildcard fieldlist feature to 3.x?

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Backporting-Wildcard-fieldlist-Features-to-3-x-versions-tp3770953p3770953.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: need to support bi-directional synonyms

2012-02-23 Thread Jonathan Rochkind

Honestly, I'd just map em both the same thing in the index.

sprayer, washer = sprayer

or

sprayer, washer = sprayer_washer

At both index and query time. Now if the source document includes either 
'sprayer' or 'washer', it'll get indexed as 'sprayer_washer'.  And if 
the user enters either 'sprayer' or 'washer', it'll search the index for 
'sprayer_washer', and find source documents that included either 
'sprayer' or 'washer'.


Of course, if you really use sprayer_washer, then if the user actually 
enters sprayer_washer they'll also find sprayer, washer, and 
sprayer_washer.


So it's probably best to actually use either 'sprayer' or 'washer' as 
the destination, even though it seems odd:


sprayer, washer = washer

Will do what you want, pretty sure.

On 2/23/2012 1:03 AM, remi tassing wrote:

Same question here...

On Wednesday, February 22, 2012, geeky2gee...@hotmail.com  wrote:

hello all,

i need to support the following:

if the user enters sprayer in the desc field - then they get results for
BOTH sprayer and washer.

and in the other direction

if the user enters washer in the desc field - then they get results for
BOTH washer and sprayer.

would i set up my synonym file like this?

assuming expand = true..

sprayer =  washer
washer =  sprayer

thank you,
mark

--
View this message in context:

http://lucene.472066.n3.nabble.com/need-to-support-bi-directional-synonyms-tp3767990p3767990.html

Sent from the Solr - User mailing list archive at Nabble.com.



Date search by specific month and day

2012-02-23 Thread Kurt Nordstrom

Hello all!

We have a situation involving date searching that I could use some 
seasoned opinions on. What we have is a collection of records, each 
containing a Solr date field by which we want search on.


The catch is that we want to be able to search for items that match a 
specific day/month. Essentially, we're trying to implement a this day 
in history feature for our dataset, so that users would be able to put 
in a date and we'd return all matching records from the past 100 years 
or so.


Is there a way to perform this kind of search with only the basic Solr 
date field? Or would I have parse out the month and day and store them 
in separate fields at indexing time?


Thanks for the help!

-Kurt


Re: result present in Solr 1.4, but missing in Solr 3.5, dismax only

2012-02-23 Thread Naomi Dushay
Ticket created:

https://issues.apache.org/jira/browse/SOLR-3158

(perhaps it's a lucene problem, not a Solr one -- feel free to move it or 
whatever.)

- Naomi


On Feb 23, 2012, at 11:55 AM, Robert Muir [via Lucene] wrote:

 Please make a new one if you dont mind! 
 
 On Thu, Feb 23, 2012 at 2:45 PM, Naomi Dushay [hidden email] wrote:
 
  Robert - 
  
  Did you mean for me to attach my docs to an existing ticket (which one?) or 
  just want to make sure I attach the docs to the new issue? 
  
  - Naomi 
  
  On Feb 23, 2012, at 11:39 AM, Robert Muir [via Lucene] wrote: 
  
  Please attach your docs if you dont mind. 
  
  I worked up tests for this (in general for ANY phrase query, 
  increasing the slop should never remove results, only potentially 
  enlarge them). 
  
  It fails already... but its good to also have your test case too... 
  
  On Thu, Feb 23, 2012 at 2:20 PM, Naomi Dushay [hidden email] wrote: 
  
   Robert, 
   
   I will create a jira issue with the documentation.  FYI, I tried ps 
   values of 3, 2, 1 and 0 and none of them worked with dismax;   For 
   lucene QueryParser, only the value of 0 got results. 
   
   - Naomi 
   
   
   On Feb 23, 2012, at 11:12 AM, Robert Muir [via Lucene] wrote: 
   
   Is it possible to also provide your document? 
   If you could attach the document and the analysis config and queries 
   to a JIRA issue, that would be most ideal. 
   
   On Thu, Feb 23, 2012 at 2:05 PM, Naomi Dushay [hidden email] wrote: 
   
Robert, 

You found it!   it is the phrase slop.  What do I do now?   I am 
using Solr from trunk from December, and all those JIRA tixes are 
marked fixed … 

- Naomi 


Solr 1.4: 

luceneQueryParser: 

URL: q=all_search:The Beatles as musicians : Revolver through the 
Anthology~3 
final query:  all_search:the beatl as musician revolv through the 
antholog~3 

got result 


Solr 3.5 

luceneQueryParser: 

URL: q=all_search:The Beatles as musicians : Revolver through the 
Anthology~3 
final query:  all_search:the beatl as musician revolv through the 
antholog~3 

NO result 



lucene QueryParser: 

URL:  q=all_search:The Beatles as musicians : Revolver through the 
Anthology 
final query:  all_search:the beatl as musician revolv through the 
antholog 




On Feb 22, 2012, at 7:34 PM, Robert Muir [via Lucene] wrote: 

On Wed, Feb 22, 2012 at 7:35 PM, Naomi Dushay [hidden email] 
wrote: 
 Jonathan has brought it to my attention that BOTH of my failing 
 searches happen to have 8 terms, and one of the terms is repeated: 
 
  The Beatles as musicians : Revolver through the Anthology 
  Color-blindness [print/digital]; its dangers and its detection 
 
 but this is a PHRASE search. 
 

Can you take your same phrase queries, and simply add some slop to 
them (e.g. ~3) and ensure they still match with the lucene 
queryparser? SloppyPhraseQuery has a bit of a history with repeats 
since Lucene 2.9 that you were using. 

https://issues.apache.org/jira/browse/LUCENE-3068
https://issues.apache.org/jira/browse/LUCENE-3215
https://issues.apache.org/jira/browse/LUCENE-3412

-- 
lucidimagination.com 


If you reply to this email, your message will be added to the 
discussion below: 
http://lucene.472066.n3.nabble.com/result-present-in-Solr-1-4-but-missing-in-Solr-3-5-dismax-only-tp3767851p3768619.html
To unsubscribe from result present in Solr 1.4, but missing in Solr 
3.5, dismax only, click here. 
NAML 



-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/result-present-in-Solr-1-4-but-missing-in-Solr-3-5-dismax-only-tp3767851p3770665.html
Sent from the Solr - User mailing list archive at Nabble.com. 
   
   
   
   -- 
   lucidimagination.com 
   
   
   If you reply to this email, your message will be added to the 
   discussion below: 
   http://lucene.472066.n3.nabble.com/result-present-in-Solr-1-4-but-missing-in-Solr-3-5-dismax-only-tp3767851p3770681.html
   To unsubscribe from result present in Solr 1.4, but missing in Solr 
   3.5, dismax only, click here. 
   NAML 
   
  
  
  
  -- 
  lucidimagination.com 
  
  
  If you reply to this email, your message will be added to the discussion 
  below: 
  http://lucene.472066.n3.nabble.com/result-present-in-Solr-1-4-but-missing-in-Solr-3-5-dismax-only-tp3767851p3770746.html
  To unsubscribe from result present in Solr 1.4, but missing in Solr 3.5, 
  dismax only, click here. 
  NAML 
 
 
 
 
 -- 
 lucidimagination.com 
 
 
 If you reply to this email, your message will be added to the discussion 
 below:
 http://lucene.472066.n3.nabble.com/result-present-in-Solr-1-4-but-missing-in-Solr-3-5-dismax-only-tp3767851p3770786.html
 To unsubscribe from result present 

Preferred file system for Solr

2012-02-23 Thread Mou
We are using a VeloDrive (SSD) to store and search our solr index.
The system is running on SLES 11.

Right now we are using ext3 but wondering if anyone has any experience using
XFS/ext3 on SSD or FusionIO for Solr .

Does solr have any preference for the underlined file system ?

Our index will be big ( around 250 M ) docs to start with , adding 5 M docs
every week , 50 to 60 % of that will be updates. 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Preferred-file-system-for-Solr-tp3771250p3771250.html
Sent from the Solr - User mailing list archive at Nabble.com.


how to ignore cases while querying with a field with type=string?

2012-02-23 Thread Yuhan Zhang
hi all,

I am storing a list of tags in a field using type=string with multiValued
setting:

field name=pageKeywords type=string indexed=true stored=true
multiValued=true/

It works ok, when I query with  pageKeyword:The ones. and when I search
for ones no record will come up as desired.

However, it appears that the query is case sensitive. so the query
pageKeyword:The ones and pageKeyword:The Ones
give different results, which is not desirable in my case.

Is there some setting in the query to let it ignore the cases? or I have
to correct the data by keeping everything lower case.


Thank you.

Yuhan Zhang


Re: undefined field on CSV db import

2012-02-23 Thread Erick Erickson
What does your schema.xml file look like? Is Product_ID defined
as a field?

Best
Erick

On Thu, Feb 23, 2012 at 1:24 PM, pmcgovern pmcgov...@portal63.com wrote:
 I am trying to import a csv file of values via curl (PHP) and am receiving an
 'undefined field' error, but I am not sure why, as I am defining the field.
 Can someone lend some insight as to what I am missing / doing wrong? Thank
 you in advance.

 Sample of CSV File:
 ---
 Product_ID  Product_Name  Product_ManufacturerPart  Product_Img
 ImageURL  Manufacturer_Name  lowestPrice  vendorCount
 -2121813476  Over-the-Sink Dish Rack  123478  
 http://image10.bizrate-images.com/resize?sq=60uid=2511766107mid=18900;
 WALTERDRAKE  24.99  1
 -2121813460  Oregon Nike NCAA Twill Shorts - Mens - Green
 00025305XODR  
 http://image10.bizrate-images.com/resize?sq=60uid=2564249353mid=23598;
 Nike  44.99  3
 -2121813456  Sudden Change Under Eye Firming Serum  091777  
 http://image10.bizrate-images.com/resize?sq=60uid=2564994087mid=18900;
 WALTERDRAKE  19.99  1
 -2121813445  Global Keratin Leave-In Conditioner Cream  005248  
 http://image10.bizrate-images.com/resize?sq=60uid=2101271875mid=21473;
 Global Keratin  24  1
 -2121813443  Oregon Nike NCAA Twill Shorts - Mens - White
 00025305XODH  
 http://image10.bizrate-images.com/resize?sq=60uid=2564226023mid=17345;
 Nike  59.99  3
 -2121813441  Paul Brown Hawaii Shine Amplifier 4 oz.  000684  
 http://image10.bizrate-images.com/resize?sq=60uid=1171412855mid=21473;
 Paul Brown  20.1  1
 -2121813437  Dish Drying Mat Large  077608  
 http://image10.bizrate-images.com/resize?sq=60uid=1371997268mid=18900;
 WALTERDRAKE  14.99  1


 Solr Update URL:
 
 http://localhost:8983/solr/db/update/csv?commit=trueheader=trueseparator=%09escape=\\fieldNames=Product_ID,Product_Name,Product_ManufacturerPart,Product_Img,ImageURL,Manufacturer_Name,lowestPrice,vendorCount


 Error Output:
 -
 html
 head
 meta http-equiv=Content-Type content=text/html; charset=ISO-8859-1/
 titleError 400 undefined field Product_ID/title
 /head
 body
 HTTP ERROR 400

 pProblem accessing /solr/db/update/csv. Reason:
 pre    undefined field Product_ID/pre/phr //smallPowered by
 Jetty:///small/br/

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/undefined-field-on-CSV-db-import-tp3770552p3770552.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr Performance Improvement and degradation Help

2012-02-23 Thread Erick Erickson
It's still worth looking at the GC characteristics, there's a possibility
that the newer build uses memory such that you're tripping over some
threshold, but that's grasping at straws. I'd at least hook up jConsole
for a sanity check...

But if your QTimes are fast, the next thing that comes to mind is that
you're spending (for some reason I can't name) more time gathering
your fields off disk. Which, with 1,200 records is a possibility. Again,
the why is a mystery. But you can do some triage by returning
just a few fields to see if that's the issue.

Wild stab: Did you re-index the data for your new version of Solr?
The index format changed not too long ago, so it's at least possible.
But why that would slow things down so much is another mystery
but it's worth testing.

Another wild bit would be your documentCache. Is it sized large enough?
As I remember, the figure is (max docs returned) * (possible number of
simultaneous requests), see:
http://wiki.apache.org/solr/SolrCaching#documentCache

Is there any chance that enableLazyFieldLoading is false
in solrconfig.xml? That could account for it.

But I'm afraid it's a matter of trying to remove stuff from your
process until something changes because this is pretty
surprising...

Best
Erick

On Thu, Feb 23, 2012 at 4:44 PM, naptowndev naptowndev...@gmail.com wrote:
 Erick -

 Thanks.  We've actually worked with Sematext to optimize the GC settings
 and saw initial (and continued) performance boosts as a result...

 The situation we're seeing now, has both versions of Solr running on the
 same box under the same JVM, but we are undeploying an instance at a time
 so as to prevent any outlying performance hits in the tests...

 So, that being said, both instances of solr, on the same box are running
 under the optimized settings.  I'd assume if GC was impacting the results
 of the newer version of Solr, we'd see similar decrease in performance on
 the older version.

 Aside from the QTime and other timings (highlight, etc) - which are all
 faster in the new version, the overall response time/delivery of the
 results are significantly slower under the new version.

 I've unfortunately exhausted my knowledge of Solr and what may or may not
 have changed between the nightly builds.

 I do appreciate your insight and hope you'll continue to throw out some
 ideas...and maybe someone else out there has seen these inconsistencies as
 well.

 The last set of test I ran consistently showed the the older build of Solr
 bringing back a result set of 13.1MB with 1200 records in 2.3 seconds
 wheres the newer build was bringing back the same result set in about 17.4
 seconds.  The catch is that the qtime and highlighting component time in
 the newer version are faster than the older version.

 Again, if you have any more ideas, let me know.

 Thanks!
 Brian

 On Thu, Feb 23, 2012 at 11:51 AM, Erick Erickson [via Lucene] 
 ml-node+s472066n377030...@n3.nabble.com wrote:

 Ah, no, my mistake. The wildcards for the fl list won't matter re:
 maxBooleanClauses,
 I didn't read carefully enough.

 I assume that just returning a field or two doesn't slow down

 But one possible culprit, especially since you say this kicks in after
 a while, is garbage collection. Here's an excellent intro:


 http://www.lucidimagination.com/blog/2011/03/27/garbage-collection-bootcamp-1-0/

 Especially look at the getting a view into garbage collection
 section and try specifying
 those options. The result should be that your solr log gets stats
 dumped every time
 GC kicks in. If this is a problem, look at the times in the logfile
 after your system slows
 down. You'll see a bunch of GC dumps that collect very little unused
 memory. You can
 also connect to the process using jConsole (should be in the Java
 distro) and watch
 the memory tab, especially after your server has slowed down. You can
 also
 connect jConsole remotely...

 This is just an experiment, but any time I see and it slows down
 after ### minutes,
 GC is the first thing I think of.


 Best
 Erick


 On Thu, Feb 23, 2012 at 10:16 AM, naptowndev [hidden 
 email]http://user/SendEmail.jtp?type=nodenode=3770307i=0
 wrote:

  Erick -
 
  Agreed, it is puzzling.
 
  What I've found is that it doesn't matter if I pass in wildcards for the
  field list or not...but that the overall response time from the newer
 builds
  of Solr that we've tested (e.g. 4.0.0.2012.02.16) is slower than the
 older
  (4.0.0.2010.12.10.08.54.56) build.
 
  If I run the exact same query against those two cores, bringing back a
  payload of just over 13MB (xml), the older build brings it back in about
 1.6
  seconds and the newer build brings it back in about 8.4 seconds.
 
  Implementing the field list wildcard allows us to reduce the payload in
 the
  newer build (not an option in the older build).  They payload is reduced
 to
  1.8MB but takes over 3.5 seconds to come back as compared to the full
  payload (13MB) in the older build at about 1.6 seconds.
 
 

Re: Date search by specific month and day

2012-02-23 Thread Erick Erickson
I think your best bet is to parse out the relevant units and index
them independently. But this is probably only a few ints
per record, so it shouldn't be much of a resource hog

Best
Erick

On Thu, Feb 23, 2012 at 5:24 PM, Kurt Nordstrom kurt.nordst...@unt.edu wrote:
 Hello all!

 We have a situation involving date searching that I could use some seasoned
 opinions on. What we have is a collection of records, each containing a Solr
 date field by which we want search on.

 The catch is that we want to be able to search for items that match a
 specific day/month. Essentially, we're trying to implement a this day in
 history feature for our dataset, so that users would be able to put in a
 date and we'd return all matching records from the past 100 years or so.

 Is there a way to perform this kind of search with only the basic Solr date
 field? Or would I have parse out the month and day and store them in
 separate fields at indexing time?

 Thanks for the help!

 -Kurt


Re: how to ignore cases while querying with a field with type=string?

2012-02-23 Thread Erick Erickson
I think your best bet is to NOT use string, use
something like:

  fieldType name=lowercase class=solr.TextField
sortMissingLast=true omitNorms=true
  analyzer
!-- KeywordTokenizer does no actual tokenizing, so the entire
 input string is preserved as a single token
  --
tokenizer class=solr.KeywordTokenizerFactory/
!-- The LowerCase TokenFilter does what you expect, which can be
 when you want your sorting to be case insensitive
  --
filter class=solr.LowerCaseFilterFactory /
!-- The TrimFilter removes any leading or trailing whitespace --
filter class=solr.TrimFilterFactory /
  /fieldType

The TrimFilterFactory is optional here. this will do what you need.
Of course you'll have to re-index.

Best
Erick

On Thu, Feb 23, 2012 at 6:29 PM, Yuhan Zhang yzh...@onescreen.com wrote:
 hi all,

 I am storing a list of tags in a field using type=string with multiValued
 setting:

 field name=pageKeywords type=string indexed=true stored=true
 multiValued=true/

 It works ok, when I query with  pageKeyword:The ones. and when I search
 for ones no record will come up as desired.

 However, it appears that the query is case sensitive. so the query
 pageKeyword:The ones and pageKeyword:The Ones
 give different results, which is not desirable in my case.

 Is there some setting in the query to let it ignore the cases? or I have
 to correct the data by keeping everything lower case.


 Thank you.

 Yuhan Zhang


TikaLanguageIdentifierUpdateProcessorFactory(since Solr3.5.0) to be used in Solr3.3.0?

2012-02-23 Thread bing
Hi, all, 

I am using
org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessorFactory
(since Solr3.5.0) to do language detection, and it's cool.
 
An issue: if I deploy Solr3.3.0, is it possible to import that factory in
Solr3.5.0 to be used in Solr3.3.0? 

Why I stick on Solr3.3.0 is because I am working on Dspace (discovery) to
call solr, and for now the highest version that Solr can be upgraded to is
3.3.0.

I would hope to do this while keep Dspace + Solr at the most. Say, import
that factory into Solr3.3.0, is it possible? Does any one happen to know
certain way to solve this?

Best Regards, 
Bing

--
View this message in context: 
http://lucene.472066.n3.nabble.com/TikaLanguageIdentifierUpdateProcessorFactory-since-Solr3-5-0-to-be-used-in-Solr3-3-0-tp3771620p3771620.html
Sent from the Solr - User mailing list archive at Nabble.com.


How to increase Size of Document in solr

2012-02-23 Thread Suneel
Hello friends,

I am facing a problem during indexing of solr. Indexing successfully working
when data size 300 mb 
but now my data size have increased  its around 50 GB when i caching data
its taking 8 hours and after that I found that data have not committed i
have tried 2 time but same issue occurred. 

Is this any setting need to be done in solrconfig.xml file to increase
capacity of data or its is any other problem.

Please suggest me this will be very helpful to me.


Thanks  Regards



-
Suneel Pandey
Sr. Software Developer
--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-increase-Size-of-Document-in-solr-tp3771813p3771813.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How to increase Size of Document in solr

2012-02-23 Thread bing
Hi, Suneel, 

There is a configuration in solrconfig.xml that you might need to look at.
Following I set the limit as 2GB. 
 
requestParsers enableRemoteStreaming=true
multipartUploadLimitInKB=2048000 /

Best Regards, 
Bing 


--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-increase-Size-of-Document-in-solr-tp3771813p3771931.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Fast Vector Highlighter Working for some records only

2012-02-23 Thread dhaivat
Hi Koji

 i am using solr 3.5 and i want to highlight the multivalued field, when i
supply single value for the multi field value at that highlighter is working
fine. but when i am indexing multiple values for field and try to highlight
that field at that time i am getting following error with Fast Vector
Highlighter 

java.lang.StringIndexOutOfBoundsException: String index out of range: -1099

i have set following parameter using solrj


query.add(hl.q,term);
query.add(hl.fl,contents);
query.add(hl,true);


query.add(hl.useFastVectorHighlighter,true);
query.add(hl.snippets,100);
query.add(hl.fragsize,7);
query.add(hl.maxAnalyzedChars,7);

can you please tell me the cause of this error ?

Thanks in advance
Dhaivat





Koji Sekiguchi wrote
 
 Hi dhaivat,
 
 I think you may want to use analysis.jsp:
 
 http://localhost:8983/solr/admin/analysis.jsp
 
 Go to the URL and look into how your custom tokenizer produces tokens,
 and compare with the output of Solr's inbuilt tokenizer.
 
 koji
 -- 
 Query Log Visualizer for Apache Solr
 http://soleami.com/
 
 
 (12/02/22 21:35), dhaivat wrote:

 Koji Sekiguchi wrote

 (12/02/22 11:58), dhaivat wrote:
 Thanks for reply,

 But can you please tell me why it's working for some documents and not
 for
 other.

 As Solr 1.4.1 cannot recognize hl.useFastVectorHighlighter flag, Solr
 just
 ignore it, but due to hl=true is there, Solr tries to create highlight
 snippets
 by using (existing; traditional; I mean not FVH) Highlighter.
 Highlighter (including FVH) cannot produce snippets sometime for some
 reasons,
 you can use hl.alternateField parameter.

 http://wiki.apache.org/solr/HighlightingParameters#hl.alternateField

 koji
 --
 Query Log Visualizer for Apache Solr
 http://soleami.com/


 Thank you so much explanation,

 I have updated my solr version and using 3.5, Could you please tell me
 when
 i am using custom Tokenizer on the field,so do i need to make any changes
 related Solr highlighter.

 here is my custom analyser

   fieldType name=custom_text class=solr.TextField
 positionIncrementGap=100
analyzer type=index
  tokenizer
 class=ns.solr.analyser.CustomIndexTokeniserFactory/
/analyzer
  analyzer type=query
  tokenizer class=ns.solr.analyser.CustomSearcherTokeniserFactory/
  
  /analyzer
  /fieldType

 here is the field info:

 field name=contents type=custom_text indexed=true stored=true
 multiValued=true termPositions=true  termVectors=true
 termOffsets=true/

 i am creating tokens using my custom analyser and when i am trying to use
 highlighter it's not working properly for contents field.. but when i
 tried
 to use Solr inbuilt tokeniser i am finding the word highlighted for
 particular query.. Please can you help me out with this ?


 Thanks in advance
 Dhaivat





 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Fast-Vector-Highlighter-Working-for-some-records-only-tp3763286p3766335.html
 Sent from the Solr - User mailing list archive at Nabble.com.

 


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Fast-Vector-Highlighter-Working-for-some-records-only-tp3763286p3771933.html
Sent from the Solr - User mailing list archive at Nabble.com.