Re: dismax request handler without q

2010-07-20 Thread Joe Calderon
try something like this:
q.alt=*:*fq=keyphrase:hotel

though if you dont need to query across multiple fields, dismax is
probably not the best choice

On Tue, Jul 20, 2010 at 4:57 AM, olivier sallou
olivier.sal...@gmail.com wrote:
 q will search in defaultSearchField if no field name is set, but you can
 specify in your q param the fields you want to search into.

 Dismax is a handler where you can specify to look in a number of fields for
 the input query. In this case, you do not specify the fields and dismax will
 look in the fields specified in its configuration.
 However, by default, dismax is not used, it needs to be called help with the
 query type parameter (qt=dismax).

 In default solr config, you can call ...solr/select?q=keyphrase:hotel if
 keyphrzase is a declared field in your schema

 2010/7/20 Chamnap Chhorn chamnapchh...@gmail.com

 I can't put q=keyphrase:hotel in my request using dismax handler. It
 returns
 no result.

 On Tue, Jul 20, 2010 at 1:19 PM, Chamnap Chhorn chamnapchh...@gmail.com
 wrote:

  There are some default configuration on my solrconfig.xml that I didn't
  show you. I'm a little confused when reading
  http://wiki.apache.org/solr/DisMaxRequestHandler#q. I think q is for
 plain
  user input query.
 
 
  On Tue, Jul 20, 2010 at 12:08 PM, olivier sallou 
 olivier.sal...@gmail.com
   wrote:
 
  Hi,
  this is not very clear, if you need to query only keyphrase, why don't
 you
  query directly it? e.g. q=keyphrase:hotel ?
  Furthermore, why dismax if only keyphrase field is of interest? dismax
 is
  used to query multiple fields automatically.
 
  At least dismax do not appear in your query (using query type). It is
 set
  in
  your config for your default request handler?
 
  2010/7/20 Chamnap Chhorn chamnapchh...@gmail.com
 
   I wonder how could i make a query to return only *all books* that has
   keyphrase web development using dismax handler? A book has multiple
   keyphrases (keyphrase is multivalued column). Do I have to pass q
   parameter?
  
  
   Is it the correct one?
   http://locahost:8081/solr/select?q=hotelfq=keyphrase:%20hotel
  
   --
   Chhorn Chamnap
   http://chamnapchhorn.blogspot.com/
  
 
 
 
 
  --
  Chhorn Chamnap
  http://chamnapchhorn.blogspot.com/
 



 --
 Chhorn Chamnap
 http://chamnapchhorn.blogspot.com/




Re: preside != president

2010-06-28 Thread Joe Calderon
the general consensus among people who run into the problem you have
is to use a plurals only stemmer, a synonyms file or a combination of
both (for irregular nouns etc)

if you search the archives you can find info on a plurals stemmer

On Mon, Jun 28, 2010 at 6:49 AM,  dar...@ontrenet.com wrote:
 Thanks for the tip. Yeah, I think the stemming confounds search results as
 it stands (porter stemmer).

 I was also thinking of using my dictionary of 500,000 words with their
 complete morphologies and conjugations and create a synonyms.txt to
 provide english accurate morphology.

 Is this a good idea?

 Darren

 Hi Darren,

 You might want to look at the KStemmer
 (http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters/Kstem)
 instead of the standard PorterStemmer. It essentially has a 'dictionary'
 of exception words where stemming stops if found, so in your case
 president won't be stemmed any further than president (but presidents will
 be stemmed to president). You will have to integrate it into solr
 yourself, but that's straightforward.

 HTH
 Brendan


 On Jun 28, 2010, at 8:04 AM, Darren Govoni wrote:

 Hi,
  It seems to me that because the stemming does not produce
 grammatically correct stems in many of the cases,
 search anomalies can occur like the one I am seeing where I have a
 document with president in it and it is returned
 when I search for preside, a different word entirely.

 Is this correct or acceptable behavior? Previous discussions here on
 stemming, I was told its ok as long as all the words reduce
 to the same stem, but when different words reduce to the same stem it
 seems to affect search results in a bad way.

 Darren






Re: Strange query behavior

2010-06-28 Thread Joe Calderon
splitOnCaseChange is creating multiple tokens from 3dsMax disable it
or enable catenateAll, use the analysys page in the admin tool to see
exactly how your text will be indexed by analyzers without having to
reindex your documents, once you have it right you can do a full
reindex.

On Mon, Jun 28, 2010 at 5:48 AM, Marc Ghorayeb dekay...@hotmail.com wrote:

 Hello,
 I have a title that says 3DVIA Studio  Virtools Maya and 3dsMax Exporters. 
 The analysis tool for this field gives me these 
 tokens:3dviadviastudio;virtoolmaya3dsmaxdssystèmmaxexport


 However, when i search for 3dsmax, i get no results :( Furthermore, if i 
 search for dsmax i get the spellchecker that suggests me 3dsmax even 
 though it doesn't find any results. If i search for any other token (3dvia, 
 or max for example), the document is found. 3dsmax is the only token that 
 doesn't seem to work!! :(
 Here is my schema for this field:fieldType name=text 
 class=solr.TextField positionIncrementGap=100
        analyzer type=index
                tokenizer class=solr.WhitespaceTokenizerFactory/

                filter class=solr.WordDelimiterFilterFactory
                        generateWordParts=1
                        generateNumberParts=1
                        catenateWords=0
                        catenateNumbers=0
                        catenateAll=0
                        splitOnCaseChange=1
                        preserveOriginal=1
                /

                filter class=solr.TrimFilterFactory updateOffsets=true/
                filter class=solr.LengthFilterFactory min=2 max=15/    
          filter class=solr.StopFilterFactory ignoreCase=true 
 words=stopwords.txt enablePositionIncrements=true /               
 filter class=solr.SynonymFilterFactory synonyms=synonyms.txt 
 ignoreCase=true expand=true/

                filter class=solr.LowerCaseFilterFactory/
                filter class=solr.RemoveDuplicatesTokenFilterFactory/
                filter class=solr.SnowballPorterFilterFactory 
 language=${Language} protected=protwords.txt/
        /analyzer

        analyzer type=query
                tokenizer class=solr.WhitespaceTokenizerFactory /

                filter class=solr.WordDelimiterFilterFactory
                        generateWordParts=1
                        generateNumberParts=1
                        catenateWords=1
                        catenateNumbers=1
                        catenateAll=0
                        splitOnCaseChange=1
                        preserveOriginal=1
                /

                filter class=solr.TrimFilterFactory updateOffsets=true/
                filter class=solr.LengthFilterFactory min=2 max=15/
                filter class=solr.StopFilterFactory ignoreCase=true 
 words=stopwords.txt enablePositionIncrements=true /
                filter class=solr.LowerCaseFilterFactory /
                filter class=solr.RemoveDuplicatesTokenFilterFactory /
                filter class=solr.SnowballPorterFilterFactory 
 language=${Language} protected=protwords.txt /
        /analyzer
 /fieldType
 Can anyone help me out please? :(
 PS: the ${Language} is set to en (for english) in this case...

 _
 La boîte mail NOW Génération vous permet de réunir toutes vos boîtes mail 
 dans Hotmail !
 http://www.windowslive.fr/hotmail/nowgeneration/


Re: questions about Solr shards

2010-06-28 Thread Joe Calderon
there is a first pass query to retrieve all matching document ids from
every shard along with relevant sorting information, the document ids
are then sorted and limited to the amount needed, then a second query
is sent for the rest of the documents metadata.

On Sun, Jun 27, 2010 at 7:32 PM, Babak Farhang farh...@gmail.com wrote:
 Otis,

 Belated thanks for your reply.

 2. The index could change between stages, e.g. a
 document that matched a
 query and was subsequently changed may no
 longer match but will still be
 retrieved.

 2. This describes the situation where, for instance, a
 document with ID=10 is updated between the 2 calls
 to the Solr instance/shard where that doc ID=10 lives.

 Can you explain why this happens? (I.e. does each query to the sharded
 index somehow involve 2 calls to each shard instance from the base
 instance?)

 -Babak

 On Thu, Jun 24, 2010 at 10:14 PM, Otis Gospodnetic
 otis_gospodne...@yahoo.com wrote:
 Hi Babak,

 1. Yes, you are reading that correctly.

 2. This describes the situation where, for instance, a document with ID=10 
 is updated between the 2 calls to the Solr instance/shard where that doc 
 ID=10 lives.

 3. Yup, orthogonal.  You can have a master with multiple cores for sharded 
 and non-sharded indices and you can have a slave with cores that hold 
 complete indices or just their shards.
  Otis
 
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
 Lucene ecosystem search :: http://search-lucene.com/



 - Original Message 
 From: Babak Farhang farh...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Thu, June 24, 2010 6:32:54 PM
 Subject: questions about Solr shards

 Hi everyone,

 There are a couple of notes on the limitations of this
 approach at

 target=_blank http://wiki.apache.org/solr/DistributedSearch which I'm
 having trouble
 understanding.

 1. When duplicate doc IDs are received,
 Solr chooses the first doc
   and discards subsequent
 ones

 Received here is from the perspective of the base Solr instance
 at
 query time, right?  I.e. if you inadvertently indexed 2 versions
 of
 the document with the same unique ID but different contents to
 2
 shards, then at query time, the first document (putting aside for
 the
 moment what exactly first means) would win.  Am I reading
 this
 right?


 2. The index could change between stages, e.g. a
 document that matched a
   query and was subsequently changed may no
 longer match but will still be
   retrieved.

 I have no idea what
 this second statement means.


 And one other question about
 shards:

 3. The examples I've seen documented do not illustrate
 sharded,
 multicore setups; only sharded monolithic cores.  I assume
 sharding
 works with multicore as well (i.e. the two issues are
 orthogonal).  Is
 this right?


 Any help on interpreting the
 above would be much appreciated.

 Thank you,
 -Babak




Re: SOLR partial string matching question

2010-06-22 Thread Joe Calderon
you want a combination of WhitespaceTokenizer and EdgeNGramFilter
http://lucene.apache.org/solr/api/org/apache/solr/analysis/WhitespaceTokenizerFactory.html
http://lucene.apache.org/solr/api/org/apache/solr/analysis/EdgeNGramFilterFactory.html

the first will create tokens for each word the second will create
multiple tokens from each word prefix

use the analysis link from the admin page to test your filter chain
and make sure its doing what you want.


On Tue, Jun 22, 2010 at 4:06 PM, Vladimir Sutskever
vladimir.sutske...@jpmorgan.com wrote:
 Hi,

 Can you guys make a recommendation for which types/filters to use accomplish 
 the following partial keyword match:


 A. Actual Indexed Term:  bank of america

 B. User Enters Search Term:  of ameri


 I would like SOLR to match document bank of america with the partial string 
 of ameri

 Any suggestions?



 Kind regards,

 Vladimir Sutskever
 Investment Bank - Technology
 JPMorgan Chase, Inc.



 This email is confidential and subject to important disclaimers and
 conditions including on offers for the purchase or sale of
 securities, accuracy and completeness of information, viruses,
 confidentiality, legal privilege, and legal entity disclaimers,
 available at http://www.jpmorgan.com/pages/disclosures/email.


Re: DismaxRequestHandler

2010-06-17 Thread Joe Calderon
the qs parameter affects matching , but you have to wrap your query in
double quotes,ex

q=oil spillqf=title descriptionqs=4defType=dismax

im not sure how to formulate such a query to apply that rule just to
description, maybe with nested queries ...

On Thu, Jun 17, 2010 at 12:01 PM, Blargy zman...@hotmail.com wrote:

 I have a title field and a description filed. I am searching across both
 fields but I don't want description matches unless they are within some slop
 of each other. How can I query for this? It seems that im getting back crazy
 results when there are matches that are nowhere each other

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/DismaxRequestHandler-tp903641p903641.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Exact match on a filter

2010-06-17 Thread Joe Calderon
use a copyField and index the copy as type string, exact matches on
that field should then work as the text wont be tokenized

On Thu, Jun 17, 2010 at 3:13 PM, Pete Chudykowski
pchudykow...@shopzilla.com wrote:
 Hi,

 I'm trying with no luck to filter on the exact-match value of a field.
 Speciffically:
  fq=brand:apple
 returns document's whose 'brand' field contains values like apple bottoms.

 Is there a way to formulate the fq expression to match precisely and only 
 apple ?

 Thanks in advance for your help.
 Pete.



Re: DismaxRequestHandler

2010-06-17 Thread Joe Calderon
see yonik's post on nested queries
http://www.lucidimagination.com/blog/2009/03/31/nested-queries-in-solr/

so for example i thought you could possibly do a dismax query across
the main fields (in this case just title) and OR that with
_query_:{!description:'oil spill'~4}

On Thu, Jun 17, 2010 at 3:01 PM, MitchK mitc...@web.de wrote:

 Joe,

 please, can you provide an example of what you are thinking of?

 Subqueries with Solr... I've never seen something like that before.

 Thank you!

 Kind regards
 - Mitch
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/DismaxRequestHandler-tp903641p904142.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: federated / meta search

2010-06-17 Thread Joe Calderon
yes, you can use distributed search across shards with different
schemas as long as the query only references overlapping fields, i
usually test adding new fields or tokenizers on one shard and deploy
only after i verified its working properly

On Thu, Jun 17, 2010 at 1:10 PM, Markus Jelsma markus.jel...@buyways.nl wrote:
 Hi,



 Check out Solr sharding [1] capabilities. I never tested it with different 
 schema's but if each node is queried with fields that it supports, it should 
 return useful results.



 [1]: http://wiki.apache.org/solr/DistributedSearch



 Cheers.

 -Original message-
 From: Sascha Szott sz...@zib.de
 Sent: Thu 17-06-2010 19:44
 To: solr-user@lucene.apache.org;
 Subject: federated / meta search

 Hi folks,

 if I'm seeing it right Solr currently does not provide any support for
 federated / meta searching. Therefore, I'd like to know if anyone has
 already put efforts into this direction? Moreover, is federated / meta
 search considered a scenario Solr should be able to deal with at all or
 is it (far) beyond the scope of Solr?

 To be more precise, I'll give you a short explanation of my
 requirements. Assume, there are a couple of Solr instances running at
 different places. The documents stored within those instances are all
 from the same domain (bibliographic records), but it can not be ensured
 that the schema definitions conform to 100%. But lets say, there are at
 least some index fields that are present in all instances (fields with
 the same name and type definition). Now, I'd like to perform a search on
 all instances at the same time (with the restriction that the query
 contains only those fields that overlap among the different schemas) and
 combine the results in a reasonable way by utilizing the score
 information associated with each hit. Please note, that due to legal
 issues it is not feasible to build a single index that integrates the
 documents of all Solr instances under consideration.

 Thanks in advance,
 Sascha




Re: how to have shards parameter by default

2010-06-10 Thread Joe Calderon
youve created an infinite loop, the shard you query calls all other
shards and itself and so on, create a separate requestHandler and
query that, ex

requestHandler name=/distributed_select class=solr.SearchHandler
 lst name=defaults
   str 
name=shardslocalhost:7500/solr,localhost:7501/solr,localhost:7502/solr,localhost:7503/solr,localhost:7504/solr,localhost:7505/solr,localhost:7506/solr/str
/lst
arr name=components
  strfacet/str
  strdebug/str
/arr
  /requestHandler


On Wed, Jun 9, 2010 at 9:10 PM, Scott Zhang macromars...@gmail.com wrote:
 I tried put shards into default request handler.
 But now each time if search, solr hangs forever.
 So what's the correct solution?

 Thanks.

  requestHandler name=standard class=solr.SearchHandler
 default=true
    !-- default values for query parameters --
     lst name=defaults
       str name=echoParamsexplicit/str

       int name=rows10/int
       str name=fl*/str
       str name=version2.1/str
       str
 name=shardslocalhost:7500/solr,localhost:7501/solr,localhost:7502/solr,localhost:7503/solr,localhost:7504/solr,localhost:7505/solr,localhost:7506/solr/str
       !--  --
     /lst
  /requestHandler

 On Thu, Jun 10, 2010 at 11:48 AM, Scott Zhang macromars...@gmail.comwrote:

 Hi. I am running distributed search on solr.
 I have 70 solr instances. So each time I want to search I need to use
 ?shards=localhost:7500/solr,localhost..7620/solr

 It is very long url.

 so how can I encode shards into config file then i don't need to type each
 time.


 thanks.
 Scott




Re: Field Collapsing: How to estimate total number of hits

2010-05-12 Thread Joe Calderon
dont know if its the best solution but i have a field i facet on
called type its either 0,1, combined with collapse.facet=before i just
sum all the values of the facet field to get the total number found

if you dont have such a field u can always add a field with a single value

--joe

On Wed, May 12, 2010 at 10:41 AM, Sergey Shinderuk sshinde...@gmail.com wrote:
 Hi, fellows!

 I use field collapsing to collapse near-duplicate documents based on
 document fuzzy signature calculated at index time.
 The problem is that, when field collapsing is enabled, in query
 response numFound is equal to the number of rows requested.

 For instance, with solr example schema i can issue the following query

 http://localhost:8983/solr/select?q=*:*rows=3collapse.field=manu_exact

 In response i get collapse_counts together with ordinary result list,
 but numFound equals 3.
 As far as I understand, this is due to the way field collapsing works.

 I want to show the total number of hits to the user and provide a
 pagination through the results.

 Any ideas?

 Regards,
 Sergey Shinderuk



synonym filter and offsets

2010-04-19 Thread Joe Calderon
hello *, im having issues with the synonym filter altering token offsets,

my input text is
saturday night live
its is tokenized by the whitespace tokenizer yielding 3 tokens
[saturday, 0,8], [night, 9, 14], [live, 15,19]

on indexing these are passed through a synonym filter that has this line
saturday night live = snl, saturday night live


i now end up with four tokens
[saturday, 0, 19], [snl, 0, 19], [night, 0, 19], [live, 0,19]

what i want is
[saturday, 0,8], [snl, 0,19], [night, 9, 14], [live, 15,19]


when using the highlighter i want to make it so only the relevant part
of the text is highlighted, how can i fix my filter chain?


thx much
--joe


highlighter issue

2010-04-02 Thread Joe Calderon
hello *, i have a field that is indexing the string the
ex-girlfriend as these tokens: [the, exgirlfriend, ex, girlfriend]
then they are passed to the edgengram filter, this allows me to match
different user spellings and allows for partial highlighting, however
a token like 'ex' would get generated twice which should be fine
except the highlighter seems to highlight that token twice even though
it has the same offsets (4,6)

is there away to make the highlighter not highlight the same token
twice, or do i have to create a token filter that would dump tokens
with equal text and offsets ?


basically whats happening now is if i search

'the e', i get:
'emSeinfeld/emThe emE/ememE/emx-Girlfriend'

for 'the ex', i get:
'emSeinfeld/emThe emEx/ememEx/em-Girlfriend'

and so on


thx much

--joe


Re: highlighter issue

2010-04-02 Thread Joe Calderon
i had tried it earlier with no effect, when i looked at the source, it
doesnt look at offsets at all, just position increments, so short of
somebody finding a better way i going to create a similar filter that
compared offsets...

On Fri, Apr 2, 2010 at 2:07 PM, Erik Hatcher erik.hatc...@gmail.com wrote:
 Will adding the RemoveDuplicatesTokenFilter(Factory) do the trick here?

        Erik

 On Apr 2, 2010, at 4:13 PM, Joe Calderon wrote:

 hello *, i have a field that is indexing the string the
 ex-girlfriend as these tokens: [the, exgirlfriend, ex, girlfriend]
 then they are passed to the edgengram filter, this allows me to match
 different user spellings and allows for partial highlighting, however
 a token like 'ex' would get generated twice which should be fine
 except the highlighter seems to highlight that token twice even though
 it has the same offsets (4,6)

 is there away to make the highlighter not highlight the same token
 twice, or do i have to create a token filter that would dump tokens
 with equal text and offsets ?


 basically whats happening now is if i search

 'the e', i get:
 'emSeinfeld/em The emE/ememE/emx-Girlfriend'

 for 'the ex', i get:
 'emSeinfeld/em The emEx/ememEx/em-Girlfriend'

 and so on


 thx much

 --joe




how to create this highlighter behaviour

2010-03-29 Thread Joe Calderon
hello *,  ive been using the highlighter and been pretty happy with
its results, however theres an edge case im not sure how to fix

for query: amazing grace

the record matched and highlighted is
emamazing/em rendition of emamazing grace/em

is there any way to only highlight amazing grace without using phrase
queries, can i modify the highlighter components to only use terms
once and to favor contiguous sections?

i dont want to enforce phrase queries as sometimes i do want terms out
of order highlighter but i only want each term matched highlighted
once


does this make sense?


Re: Need help in deploying the modified SOLR source code

2010-03-12 Thread Joe Calderon
do `ant clean dist` within the solr source and use the resulting war 
file, though in the future you might think about extending the built in 
parser and creating a parser plugin rather than modifying the actual sources


see http://wiki.apache.org/solr/SolrPlugins#QParserPlugin for more info

--joe
On 03/12/2010 07:34 PM, JavaGuy84 wrote:

Hi,

I had made some changes to solrqueryparser.java using Eclipse and I am able
to do a leading wildcard search using Jetty plugin (downloaded this plugin
for eclipse).. Now I am not sure how I can package this code and redploy it.
Can someone help me out please?

Thanks,
B
   




Re: Highlighting

2010-03-10 Thread Joe Calderon
just to make sure were on the same page, youre saying that the
highlight section of the response is empty right? the results section
is never highlighted but a separate section contains the highlighted
fields specified in hl.fl=

On Wed, Mar 10, 2010 at 5:23 AM, Ahmet Arslan iori...@yahoo.com wrote:


 Yes Content is stored and I get same
 results adding that parameter.

 Still not highlighting the content :-(

 Any other ideas

 What is the field type of attr_content? And what is your query?

 Are you running your query on another field and then requesting snippets from
 attr_content?

 q:attr_content:somequeryhl=truehl.fl=attr_contenthl.maxAnalyzedChars=-1 
 should return highlighting.






Re: Highlighting

2010-03-10 Thread Joe Calderon
no, thats not the case, see this example response in json format:
{
 responseHeader:{
  status:0,
  QTime:0,
  params:{
indent:on,
q:title_edge:fami,
hl.fl:title_edge,
wt:json,
hl:on,
rows:1}},
 response:{numFound:18,start:0,docs:[
{
 title_id:1581,
 title_edge:Family,
 num:4}]
 },
 highlighting:{
  1581:{
title_edge:[emFami/emly]}}



see how the highlight info is separate from the results?

On Wed, Mar 10, 2010 at 7:44 AM, Lee Smith l...@weblee.co.uk wrote:
 Im am getting results no problem with the query.

 But from what I believe it should wrap em/ around the text in the result.

 So if I search ie Andrew  within the return content Ie would have the 
 contents with the word emAndrew/em

 and hl.fl=attr_content

 Thank you for you help

 Begin forwarded message:

 From: Joe Calderon calderon@gmail.com
 Date: 10 March 2010 15:37:35 GMT
 To: solr-user@lucene.apache.org
 Subject: Re: Highlighting
 Reply-To: solr-user@lucene.apache.org

 just to make sure were on the same page, youre saying that the
 highlight section of the response is empty right? the results section
 is never highlighted but a separate section contains the highlighted
 fields specified in hl.fl=

 On Wed, Mar 10, 2010 at 5:23 AM, Ahmet Arslan iori...@yahoo.com wrote:


 Yes Content is stored and I get same
 results adding that parameter.

 Still not highlighting the content :-(

 Any other ideas

 What is the field type of attr_content? And what is your query?

 Are you running your query on another field and then requesting snippets 
 from
 attr_content?

 q:attr_content:somequeryhl=truehl.fl=attr_contenthl.maxAnalyzedChars=-1 
 should return highlighting.








Re: Highlighting

2010-03-09 Thread Joe Calderon
did u enable the highlighting component in solrconfig.xml? try setting
debugQuery=true to see if the highlighting component is even being
called...

On Tue, Mar 9, 2010 at 12:23 PM, Lee Smith l...@weblee.co.uk wrote:
 Hey All

 I have indexed a whole bunch of documents and now I want to search against 
 them.

 My search is going great all but highlighting.

 I have these items set

 hl=true
 hl.snippets=2
 hl.fl = attr_content
 hl.fragsize=100

 Everything works apart from the highlighted text found not being surrounded 
 with a em

 Am I missing a setting ?

 Lee


Re: indexing a huge data

2010-03-05 Thread Joe Calderon
ive found the csv update to be exceptionally fast, though others enjoy
the flexibility of the data import handler

On Fri, Mar 5, 2010 at 10:21 AM, Mark N nipen.m...@gmail.com wrote:
 what should be the fastest way to index a documents , I am indexing huge
 collection of data after extracting certain meta - data information
 for example author and filename of each files

 i am extracting these information and storing in XML format

 for example :   fileid 1fileidauthorabc /author
 filenameabc.doc/filename
                     fileid 2fileidauthorabc /author
 filenameabc1.doc/filename

 I can not index these documents directly to solr as it is not in the format
 required by solr ( i can not change the format as its used in other modules)

 should converting these file to CSV will be better and faster approach
 compared to XML?



 please  suggest




 --
 Nipen Mark



Re: Issue on stopword list

2010-03-02 Thread Joe Calderon
or you can try the commongrams filter that combines tokens next to a stopword

On Tue, Mar 2, 2010 at 6:56 AM, Walter Underwood wun...@wunderwood.org wrote:
 Don't remove stopwords if you want to search on them. --wunder

 On Mar 2, 2010, at 5:43 AM, Erick Erickson wrote:

 This is a classic problem with Stopword removal. Have you tried
 just removing stopwords from the indexing definition and the
 query definition and reindexing?

 You can't search on them no matter what you do if they've
 been removed, they just aren't there

 HTH
 Erick

 On Tue, Mar 2, 2010 at 5:47 AM, Suram reactive...@yahoo.com wrote:


 Hi,

 How can i search using stopword my query like this

 This             - 0 results becuase it is a stopword
 is                 - 0 results becuase it is a stopword
 that             - 0 results becuase it is a stopword

 if i search like  This is that - it must give the result

 for that i need to change anything in my schema file to get result This is
 that
 --
 View this message in context:
 http://old.nabble.com/Issue-on-stopword-list-tp27754434p27754434.html
 Sent from the Solr - User mailing list archive at Nabble.com.






Re: Search Result differences Standard vs DisMax

2010-03-01 Thread Joe Calderon
what are you using for the mm parameter? if you set it to 1 only one 
word has to match,

On 03/01/2010 05:07 PM, Steve Reichgut wrote:
***Sorry if this was sent twice. I had connection problems here and it 
didn't look like the first time it went out


I have been testing out results for some basic queries using both the 
Standard and DisMax query parsers. The results though aren't what I 
expected and am wondering if I am misundertanding how the DisMax query 
parser works.


For example, let's say I am doing a basic search for Apache Solr 
across a single field = Field 1 using the Standard parser. My results 
are exactly what I expected. Any document that includes either 
Apache or Solr or Apache Solr in Field 1 is listed with priority 
given to those that include both words.


Now, if I do the same search for Apache Solr across multiple fields 
- Field 1, Field 2 - using DisMax, I would expect basically the same 
results. The results should include any document that has one or both 
words in Field 1 or Field 2.


When I run that query in DisMax though, it only returns the documents 
that have BOTH words included which in my sample set only includes 1 
or 2 documents. I thought that, by default, DisMax should make both 
words optional so I am confused as to why I am only getting such a 
small subset.


Can anyone shed some light on what I am doing wrong or if I am 
misunderstanding how DisMax works.


Thanks,
Steve




Re: Changing term frequency according to value of one of the fields

2010-02-26 Thread Joe Calderon
extend the similarity class, compile it against the jars in lib, put in 
a path solr can find and set your schema to use it

http://wiki.apache.org/solr/SolrPlugins#Similarity
On 02/25/2010 10:09 PM, Pooja Verlani wrote:

Hi,
I want to modify Similarity class for my app like the following-
Right now tf is Math.sqrt(termFrequency)
I would like to modify it to
Math.sqrt(termFrequncy/solrDoc.getFieldValue(count))
where count is one of the fields in the particular solr document.
Is it possible to do so? Can I import solrDocument class and take the
particular solrDoc for calculating tf in the similarity class?

Please suggest.

regards,
Pooja

   




Re: Solr 1.4 distributed search configuration

2010-02-26 Thread Joe Calderon
you can set a default shard parameter on the request handler doing
distributed search, you can set up two different request handlers one
with shards default and one without

On Thu, Feb 25, 2010 at 1:35 PM, Jeffrey Zhao
jeffrey.z...@metalogic-inc.com wrote:
 Now I got it, just forgot put qt=search in query.

 By the way, in solr 1.3, I used shards.txt under conf directory and
 distributed=true in query for distributed search.  In that way,in my
 java application, I can hard code solr query with distributed=true and
 control the using of distributed search by  define shards.txt or not.

 In solr 1.4, it is more difficult to use distributed search dynamically.Is
 there a way I just change configuration  without changing query to make DS
 work?

 Thanks,



 From:   Mark Miller markrmil...@gmail.com
 To:     solr-user@lucene.apache.org
 Date:   25/02/2010 04:13 PM
 Subject:        Re: Solr 1.4 distributed search configuration



 Can you elaborate on doesn't work when you put it in the /search
 handler?

 You get an error in the logs? Nothing happens?

 On 02/25/2010 03:47 PM, Jeffrey Zhao wrote:
 Hi Mark,

 Thanks for your reply. I did make a new handler as following, but it
 does
 not work, anything wrong with my configuration?

 Thanks,

   requestHandler name=search class=solr.SearchHandler
        !-- default values for query parameters --
         lst name=defaults
           str
 name=shards202.161.196.189:8080/solr,localhost:8080/solr/str
         /lst
       arr name=components
         strquery/str
         strfacet/str
         strspellcheck/str
         strdebug/str
       /arr
 /requestHandler



 From:   Mark Millermarkrmil...@gmail.com
 To:     solr-user@lucene.apache.org
 Date:   25/02/2010 03:41 PM
 Subject:        Re: Solr 1.4 distributed search configuration



 On 02/25/2010 03:32 PM, Jeffrey Zhao wrote:

 How do define a new search handler with a shards parameter?  I defined

 as

 following way but it doesn't work. If I put the shards parameter in
 default handler, it seems I got an infinite loop.


 requestHandler name=standard class=solr.SearchHandler

 default=true

       !-- default values for query parameters --
        lst name=defaults
          str name=echoParamsexplicit/str
        /lst
     /requestHandler

 requestHandler name=search class=solr.SearchHandler
       !-- default values for query parameters --
        lst name=defaults
          str
 name=shards202.161.196.189:8080/solr,localhost:8080/solr/str
        /lst
      arr name=components
        strquery/str
        strfacet/str
        strspellcheck/str
        strdebug/str
      /arr
     /requestHandler


 Thanks,


 Not seeing this on the wiki (it should be there), but you can't put the
 shards param on the default search handler without causing an infinite
 loop - you have to make a new request handler and put it on that.




 --
 - Mark

 http://www.lucidimagination.com








Re: Autosuggest/Autocomplete with solr 1.4 and EdgeNGrams

2010-02-24 Thread Joe Calderon
i had to create a autosuggest implementation not too long ago,
originally i was using faceting, where i would match wildcards on a
tokenized field and facet on an unaltered field, this had the
advantage that i could do everything from one index, though it was
also limited by the fact suggestions came though facets and scoring
and highlighting went out the window


what i settled on was to create a separate core for suggest to use, i
analyze the fields i want to match against with whitespace tokenizer
and edgengram filter, this has multiple advantages:
query is ran through text analysis where as with wildcarded terms they are not
highlighter will highlight only the text matched not the expanded word
scoring and boosts can be used to rank suggest results

i tokenize on whitespace so i can match out of order tokens , ex
q=family guy stewie  and q=stewie family guy, etc, this is something
that prefix based solutions wont be able to do,

one small gotcha is that i recently submitted a patch to edgengram
filter to fix highlighting behaviour, it has been comitted to lucenes
trunk but its only available in versions 2.9.2 and up unless you patch
it yourself

On Wed, Feb 24, 2010 at 7:35 AM, Grant Ingersoll gsing...@apache.org wrote:
 You might also look at http://issues.apache.org/jira/browse/SOLR-1316

 On Feb 24, 2010, at 1:17 AM, Sachin wrote:



 Hi All,

 I am trying to setup autosuggest using solr 1.4 for my site and needed some 
 pointers on that. Basically, we provide autosuggest for user typed in 
 characters in the searchbox. The autosuggest index is created with older 
 user typed in search queries which returned  0 results. We do some lazy 
 writing to store this information into the db and then export it to solr on 
 a nightly basis. As far as I know, there are 3 ways (apart from wild card 
 search) of achieving autosuggest using solr 1.4:

 1. Use EdgeNGrams
 2. Use shingles and prefix query.
 3. Use the new Terms component.

 I am for now more inclinded towards using the EdgeNGrams (no method to 
 madness) and just wanted to know is there any recommended approach out of 
 the 3 in terms of performance, since the user excepts the suggestions to be 
 almost instantaneous? We do some heavy caching at our end to avoid hitting 
 solr everytime but is any of these 3 approaches faster than the other?

 Also, I would also like to return the suggestion even if the user typed in 
 query matches in between: for instance if I have the query chicken pasta 
 in my index and the user types in pasta, I would also like this query to 
 be returned as part of the suggestion (ala Yahoo!). Below is my field 
 definition:

        fieldType name=suggest class=solr.TextField 
 positionIncrementGap=100
            analyzer type=index
                tokenizer class=solr.KeywordTokenizerFactory/
                filter class=solr.LowerCaseFilterFactory/
                filter class=solr.EdgeNGramFilterFactory minGramSize=2 
 maxGramSize=50 /
            /analyzer
            analyzer type=query
                tokenizer class=solr.KeywordTokenizerFactory/
                filter class=solr.LowerCaseFilterFactory/
            /analyzer
        /fieldType


 I tried changing the KeywordTokenizerFactory with LetterTokenizerFactory, 
 and though it works great for the above scenario (does a in-between match), 
 it has the side-effect of removing everything which are not letters so if 
 the user types in 123 he gets absolutely no suggestions. Is there anything 
 that I'm missing in my configuration, is this even achievable by using 
 EdgeNGrams or shall I look at using perhaps the TermsComponent after 
 applying the regex patch from 1.5 and maybe do something like 
 .*user-typed-in-chars.*?

 Thanks!








Re: including 'the' dismax query kills results

2010-02-18 Thread Joe Calderon
use the common grams filter, itll create tokens for stop words and
their adjacent terms

On Thu, Feb 18, 2010 at 7:16 AM, Nagelberg, Kallin
knagelb...@globeandmail.com wrote:
 I've noticed some peculiar behavior with the dismax searchhandler.

 In my case I'm making the search The British Open, and am getting 0 
 results. When I change it to British Open I get many hits. I looked at the 
 query analyzer and it should be broken down to british and open tokens 
 ('the' is a stopword). I imagine it is doing an 'and' type search, and by 
 setting the 'mm' parameter to 1 I once again get results for 'the british 
 open'. I would like mm to be 100% however, but just not care about stopwords. 
 Is there a way to do this?

 Thanks,
 -Kal



Re: Reindex after changing defaultSearchField?

2010-02-17 Thread Joe Calderon
no, youre just changing how your querying the index, not the actual 
index, you will need to restart the servlet container or reload the core 
for the config changes to take effect tho

On 02/17/2010 10:04 AM, Frederico Azeiteiro wrote:

Hi,



If i change the defaultSearchField in the core schema, do I need to
recreate the index?



Thanks,

Frederico




   




Re: defaultSearchField and DisMaxRequestHandler

2010-02-15 Thread Joe Calderon

no but you can set a default for the qf parameter with the same value
On 02/15/2010 01:50 AM, Steve Radhouani wrote:

Hi there,
Can thedefaultSearchField  option be used by the DisMaxRequestHandler?
  Thanks,
-Steve

   




Re: problem with edgengramtokenfilter and highlighter

2010-02-14 Thread Joe Calderon

lucene-2266 filed and patch posted.
On 02/13/2010 09:14 PM, Robert Muir wrote:

Joe, can you open a Lucene JIRA issue for this?

I just glanced at the code and it looks like a bug to me.

On Sun, Feb 14, 2010 at 12:07 AM, Joe Calderoncalderon@gmail.comwrote:

   

i ran into a problem while using the edgengramtokenfilter, it seems to
report incorrect offsets when generating tokens, more specifically all
the tokens have offset 0 and term length as start and end, this leads
to goofy highlighting behavior when creating edge grams for tokens
beyond the first one, i created a small patch that takes into account
the start of the original token and adds that to the reported
start/end offsets.

 



   




problem with edgengramtokenfilter and highlighter

2010-02-13 Thread Joe Calderon
i ran into a problem while using the edgengramtokenfilter, it seems to
report incorrect offsets when generating tokens, more specifically all
the tokens have offset 0 and term length as start and end, this leads
to goofy highlighting behavior when creating edge grams for tokens
beyond the first one, i created a small patch that takes into account
the start of the original token and adds that to the reported
start/end offsets.


reloading sharedlib folder

2010-02-12 Thread Joe Calderon
when using solr.xml, you can specify a sharedlib directory to share
among cores, is it possible to reload the classes in this dir without
having to restart the servlet container? it would be useful to be able
to make changes to those classes on the fly or be able to drop in new
plugins


Re: How to reindex data without restarting server

2010-02-11 Thread Joe Calderon
if you use the core model via solr.xml you can reload a core without 
having to to restart the servlet container,

http://wiki.apache.org/solr/CoreAdmin
On 02/11/2010 02:40 PM, Emad Mushtaq wrote:

Hi,

I would like to know if there is a way of reindexing data without restarting
the server. Lets say I make a change in the schema file. That would require
me to reindex data. Is there a solution to this ?

   




Re: analysing wild carded terms

2010-02-10 Thread Joe Calderon
sorry, what i meant to say is apply text analysis to the part of the
query that is wildcarded, for example if a term with latin1 diacritics
is wildcarded ide still like to run it through ISOLatin1Filter

On Wed, Feb 10, 2010 at 4:59 AM, Fuad Efendi f...@efendi.ca wrote:
 hello *, quick question, what would i have to change in the query
 parser to allow wildcarded terms to go through text analysis?

 I believe it is illogical. wildcarded terms will go through terms
 enumerator.





analysing wild carded terms

2010-02-09 Thread Joe Calderon
hello *, quick question, what would i have to change in the query
parser to allow wildcarded terms to go through text analysis?


Re: old wildcard highlighting behaviour

2010-02-06 Thread Joe Calderon
when i set hl.highlightMultiTerm=false the term that matches the wild
card is not highlighted at all, ideally ide like a partial highlight
(the characters before the wildcard), but if not i can live without it

thx much for the help

--joe

On Fri, Feb 5, 2010 at 10:44 PM, Mark Miller markrmil...@gmail.com wrote:
 On iPhone so don't remember exact param I named it, but check wiki -
 something like hl.highlightMultiTerm - set it to false.

 - Mark

 http://www.lucidimagination.com (mobile)

 On Feb 6, 2010, at 12:00 AM, Joe Calderon calderon@gmail.com wrote:

 hello *, currently with hl.usePhraseHighlighter=true, a query for (joe
 jack*) will highlight emjoe jackson/em, however after reading the
 archives, what im looking for is the old 1.1 behaviour so that only
 emjoe jack/em is highlighted, is this possible in solr 1.5 ?


 thx  much
 --joe



old wildcard highlighting behaviour

2010-02-05 Thread Joe Calderon
hello *, currently with hl.usePhraseHighlighter=true, a query for (joe
jack*) will highlight emjoe jackson/em, however after reading the
archives, what im looking for is the old 1.1 behaviour so that only
emjoe jack/em is highlighted, is this possible in solr 1.5 ?


thx  much
--joe


fuzzy matching / configurable distance function?

2010-02-04 Thread Joe Calderon
is it possible to configure the distance formula used by fuzzy
matching? i see there are other under the function query page under
strdist but im wondering if they are applicable to fuzzy matching

thx much


--joe


source tree for lucene

2010-02-04 Thread Joe Calderon
i want to recompile lucene with
http://issues.apache.org/jira/browse/LUCENE-2230, but im not sure
which source tree to use, i tried using the implied trunk revision
from the admin/system page but solr fails to build with the generated
jars, even if i exclude the patches from 2230...

im wondering if there is another lucene tree i should grab to use to build solr?


--joe


Re: distributed search and failed core

2010-02-03 Thread Joe Calderon
thx guys, i ended up using a mix of code from the solr-1143 and
solr-1537 patches, now whenever there is an exception theres is a
section in the results indicating the result is partial and also lists
the failed core(s), weve added some monitoring to check for that
output as well to alert us when a shard has failed

On Wed, Feb 3, 2010 at 10:55 AM, Yonik Seeley
yo...@lucidimagination.com wrote:
 On Fri, Jan 29, 2010 at 3:31 PM, Joe Calderon calderon@gmail.com wrote:
 hello *, in distributed search when a shard goes down, an error is
 returned and the search fails, is there a way to avoid the error and
 return the results from the shards that are still up?

 The SolrCloud branch has load-balancing capabilities for distributed
 search amongst shard replicas.
 http://wiki.apache.org/solr/SolrCloud

 -Yonik
 http://www.lucidimagination.com



Re: Basic indexing question

2010-02-02 Thread Joe Calderon
by default solr will only search the default fields, you have to
either query all fields field1:(ore) or field2:(ore) or field3:(ore)
or use a different query parser like dismax

On Tue, Feb 2, 2010 at 3:31 PM, Stefan Maric sma...@ntlworld.com wrote:
 I have got a basic configuration of Solr up and running and have loaded some 
 data to experiment with
  When I run a query for 'ore' I get 3 results when I'm expecting 4
 Dataimport is pulling the expected number of rows in from my DB view

  In my schema.xml I have
  field name=id type=string indexed=true stored=true required=true 
 /
  field name=atomId type=string indexed=true stored=true 
 required=true /
  field name=name type=text indexed=true stored=true/
  field name=description type=text indexed=true stored=true /

  and  the defaults
 field name=text type=text indexed=true stored=false 
 multiValued=true/
 copyField source=name dest=text/

  From an SQL point of view - I am expecting a search for 'ore' to retrieve 4 
 results (which the following does)
 select * from v_sm_search_sectors where description like '% ore%' or name 
 like '% ore%';
 121 B0.010.010  Mining and quarrying  
 Mining of metal ore, stone, sand, clay, coal and other solid minerals
 1000144 E0.030  Metal and metal ores wholesale   
 (null)
 1000145 E0.030.010  Metal and metal ores wholesale   (null)
 1000146 E0.030.020  Metal and metal ores wholesale agents   (null)

 From a Solr query for 'ore' - I get the following
 response
 -
  lst name=responseHeader
  int name=status0/int
  int name=QTime0/int
  -
  lst name=params
  str name=rows10/str
  str name=start0/str
  str name=indenton/str
  str name=qore/str
  str name=version2.2/str
  /lst
  /lst
  -
  result name=response numFound=3 start=0
  -
  doc
  str name=atomIdE0.030/str
  str name=id1000144/str
  str name=nameMetal and metal ores wholesale/str
  /doc
  -
  doc
  str name=atomIdE0.030.010/str
  str name=id1000145/str
  str name=nameMetal and metal ores wholesale/str
  /doc
  -
  doc
  str name=atomIdE0.030.020/str
  str name=id1000146/str
  str name=nameMetal and metal ores wholesale agents/str
  /doc
  /result
  /response


  So I don't retrieve the document where 'ore' is in the descritpion field 
 (and NOT the name field)

  It would seem that Solr is ONLY returning me results based on what has 
 been put into the field name=text by the copyField source=name 
 dest=text/

  Any hints as to what I've missed ??

  Regards
  Stefan Maric



Re: Basic indexing question

2010-02-02 Thread Joe Calderon
see http://wiki.apache.org/solr/SchemaXml#The_Default_Search_Field for
details on default field, most people use the dismax handler when
handling queries from user
see http://wiki.apache.org/solr/DisMaxRequestHandler for more details,
if you dont have many fields you can write your own query using the
lucene query parser as i mentioned before, the syntax cen be found at
http://lucene.apache.org/java/2_9_1/queryparsersyntax.html

hope this helps


--joe
On Tue, Feb 2, 2010 at 3:59 PM, Stefan Maric sma...@ntlworld.com wrote:
 Thanks for the quick reply
 I will have to see if the default query mechanism will suffice for most of
 my needs

 I have skimmed through most of the Solr documentation and didn't see
 anything describing

 I can easily change my DB View so that I only source Solr with a single
 string plus my id field
 (as my application makng the search will have to collate associated
 information into a presentable screen anyhow - so I'm not too worried about
 info being returned by Solr as such)

 Would that be a reasonable way of using Solr




 -Original Message-
 From: Joe Calderon [mailto:calderon@gmail.com]
 Sent: 02 February 2010 23:42
 To: solr-user@lucene.apache.org
 Subject: Re: Basic indexing question


 by default solr will only search the default fields, you have to
 either query all fields field1:(ore) or field2:(ore) or field3:(ore)
 or use a different query parser like dismax

 On Tue, Feb 2, 2010 at 3:31 PM, Stefan Maric sma...@ntlworld.com wrote:
 I have got a basic configuration of Solr up and running and have loaded
 some data to experiment with
  When I run a query for 'ore' I get 3 results when I'm expecting 4
 Dataimport is pulling the expected number of rows in from my DB view

  In my schema.xml I have
  field name=id type=string indexed=true stored=true
 required=true /
  field name=atomId type=string indexed=true stored=true
 required=true /
  field name=name type=text indexed=true stored=true/
  field name=description type=text indexed=true stored=true /

  and  the defaults
 field name=text type=text indexed=true stored=false
 multiValued=true/
 copyField source=name dest=text/

  From an SQL point of view - I am expecting a search for 'ore' to retrieve
 4 results (which the following does)
 select * from v_sm_search_sectors where description like '% ore%' or name
 like '% ore%';
 121 B0.010.010      Mining and quarrying
 Mining of metal ore, stone, sand, clay, coal and other solid minerals
 1000144 E0.030              Metal and metal ores wholesale
 (null)
 1000145 E0.030.010      Metal and metal ores wholesale
 (null)
 1000146 E0.030.020      Metal and metal ores wholesale agents   (null)

 From a Solr query for 'ore' - I get the following
 response
 -
      lst name=responseHeader
      int name=status0/int
      int name=QTime0/int
      -
      lst name=params
      str name=rows10/str
      str name=start0/str
      str name=indenton/str
      str name=qore/str
      str name=version2.2/str
      /lst
      /lst
      -
      result name=response numFound=3 start=0
      -
      doc
      str name=atomIdE0.030/str
      str name=id1000144/str
      str name=nameMetal and metal ores wholesale/str
      /doc
      -
      doc
      str name=atomIdE0.030.010/str
      str name=id1000145/str
      str name=nameMetal and metal ores wholesale/str
      /doc
      -
      doc
      str name=atomIdE0.030.020/str
      str name=id1000146/str
      str name=nameMetal and metal ores wholesale agents/str
      /doc
      /result
      /response


      So I don't retrieve the document where 'ore' is in the descritpion
 field (and NOT the name field)

      It would seem that Solr is ONLY returning me results based on what
 has been put into the field name=text by the copyField source=name
 dest=text/

      Any hints as to what I've missed ??

      Regards
      Stefan Maric

 No virus found in this incoming message.
 Checked by AVG - www.avg.com
 Version: 8.5.435 / Virus Database: 271.1.1/2663 - Release Date: 02/02/10
 07:35:00




distributed search and failed core

2010-01-29 Thread Joe Calderon
hello *, in distributed search when a shard goes down, an error is
returned and the search fails, is there a way to avoid the error and
return the results from the shards that are still up?

thx much

--joe


Re: index of facet fields are not same as original string

2010-01-28 Thread Joe Calderon
facets are based off the indexed version of your string nor the stored
version, you probably have an analyzer thats removing punctuation,
most people index the same field multiple ways for different purposes,
matching. storting, faceting etc...

index a copy of your field as string type and facet on that

On Thu, Jan 28, 2010 at 3:12 AM, Sergey Pavlikovskiy
pavlikovs...@gmail.com wrote:
 Hi,

 probably, it's because of stemming
 if you need unstemmed text you can use 'textgen' data type for the field

 Sergey

 On Thu, Jan 28, 2010 at 12:25 PM, Solr user uma.ravind...@yahoo.co.inwrote:


 Hi,

  I am new to Solr. I found facets fields does not reflect the original
 string in the record. For example,

 the returned xml is,

 - doc
  str name=g_numberG-EUPE/str
 /doc
 - lst name=facet_counts
  lst name=facet_queries /
 - lst name=facet_fields
 -       lst name=g_number
  int name=gupe1/int
 /lst
  /lst
 -  lst name=facet_dates /
  /lst

 Here, G-EUPE is displayed under facet field as 'gupe' where it is not
 capital and missing '-' from the original string. Is there any way we could
 fix this to match the original text in record? Thanks in advance.

 Regards,
 uma
 --
 View this message in context:
 http://old.nabble.com/index-of-facet-fields-are-not-same-as-original-string-tp27353838p27353838.html
 Sent from the Solr - User mailing list archive at Nabble.com.





create requesthandler with default shard parameter for different query parser

2010-01-21 Thread Joe Calderon
hello *, what is the best way to create a requesthandler for
distributed search with a default shards parameter but that can use
different query parsers

thus far i have

  requestHandler name=/ds class=solr.SearchHandler
!-- default values for query parameters --
 lst name=defaults
   str name=fl*,score/str
   str name=wtjson/str
   str 
name=shardshost0:8080/solr/core0,host1:8080/solr/core1,host2:8080/solr/core2,localhost:8080/solr/core3/str
/lst
arr name=components
  strquery/str
  strfacet/str
  strspellcheck/str
  strdebug/str
/arr
  /requestHandler


which works as long as qt=standard, if i change it to dismax it doenst
use the shards parameter anymore...


thx much

--joe


Re: create requesthandler with default shard parameter for different query parser

2010-01-21 Thread Joe Calderon
thx much, i see now, having request handlers with the same name as the
query parsers was confusing me, i do however have an additional
problem, if i use defType it does indeed use the right query parser
but is there a way to not send all the query parameters in the url
(qf, pf, bf etc), its the main reason im creating the new request
handler, or do i put them all as defaults under my new request handler
and let the query parser use whichever ones it supports?

On Thu, Jan 21, 2010 at 11:45 AM, Yonik Seeley
yo...@lucidimagination.com wrote:
 On Thu, Jan 21, 2010 at 2:39 PM, Joe Calderon calderon@gmail.com wrote:
 hello *, what is the best way to create a requesthandler for
 distributed search with a default shards parameter but that can use
 different query parsers

 thus far i have

  requestHandler name=/ds class=solr.SearchHandler
    !-- default values for query parameters --
     lst name=defaults
       str name=fl*,score/str
       str name=wtjson/str
       str 
 name=shardshost0:8080/solr/core0,host1:8080/solr/core1,host2:8080/solr/core2,localhost:8080/solr/core3/str
    /lst
    arr name=components
      strquery/str
      strfacet/str
      strspellcheck/str
      strdebug/str
    /arr
  /requestHandler


 which works as long as qt=standard, if i change it to dismax it doenst
 use the shards parameter anymore...

 Legacy terminology causing some confusion I think... qt does stand for
 query type, but it actually picks the request handler.
 defType defines the default query parser to use, so you probably
 don't want to be using qt at all.

 So try something like:
 http://localhost:8983/solr/ds?defType=dismaxqf=textq=foo

 -Yonik
 http://www.lucidimagination.com



Re: Field collapsing patch error

2010-01-19 Thread Joe Calderon
this has come up before, my suggestions would be to use the 12/24
patch with trunk revision 892336

http://www.lucidimagination.com/search/document/797549d29e1810d9/solr_1_4_field_collapsing_what_are_the_steps_for_applying_the_solr_236_patch

2010/1/19 Licinio Fernández Maurelo licinio.fernan...@gmail.com:
 Hi folks,

 i've downloaded solr release 1.4 and tried to apply  latest field collapsing
 patchhttps://issues.apache.org/jira/secure/attachment/12428902/SOLR-236.patchi've
 found . Found errors :

 d...@backend05:~/workspace/solr-release-1.4.0$ patch -p0 -i SOLR-236.patch

 patching file src/test/test-files/solr/conf/solrconfig-fieldcollapse.xml
 patching file src/test/test-files/solr/conf/schema-fieldcollapse.xml
 patching file src/test/test-files/solr/conf/solrconfig.xml
 patching file src/test/test-files/fieldcollapse/testResponse.xml
 patching file
 src/test/org/apache/solr/search/fieldcollapse/FieldCollapsingIntegrationTest.java
 patching file
 src/test/org/apache/solr/search/fieldcollapse/DistributedFieldCollapsingIntegrationTest.java
 patching file
 src/test/org/apache/solr/search/fieldcollapse/NonAdjacentDocumentCollapserTest.java

 patching file
 src/test/org/apache/solr/search/fieldcollapse/AdjacentCollapserTest.java

 patching file
 src/test/org/apache/solr/handler/component/CollapseComponentTest.java

 patching file
 src/test/org/apache/solr/client/solrj/response/FieldCollapseResponseTest.java

 patching file
 src/java/org/apache/solr/search/DocSetAwareCollector.java

 patching file
 src/java/org/apache/solr/search/fieldcollapse/CollapseGroup.java

 patching file
 src/java/org/apache/solr/search/fieldcollapse/DocumentCollapseResult.java

 patching file
 src/java/org/apache/solr/search/fieldcollapse/DocumentCollapser.java

 patching file
 src/java/org/apache/solr/search/fieldcollapse/collector/CollapseCollectorFactory.java

 patching file
 src/java/org/apache/solr/search/fieldcollapse/collector/DocumentGroupCountCollapseCollectorFactory.java

 patching file
 src/java/org/apache/solr/search/fieldcollapse/collector/aggregate/AverageFunction.java

 patching file
 src/java/org/apache/solr/search/fieldcollapse/collector/aggregate/MinFunction.java

 patching file
 src/java/org/apache/solr/search/fieldcollapse/collector/aggregate/SumFunction.java

 patching file
 src/java/org/apache/solr/search/fieldcollapse/collector/aggregate/MaxFunction.java

 patching file
 src/java/org/apache/solr/search/fieldcollapse/collector/aggregate/AggregateFunction.java

 patching file
 src/java/org/apache/solr/search/fieldcollapse/collector/CollapseContext.java

 patching file
 src/java/org/apache/solr/search/fieldcollapse/collector/DocumentFieldsCollapseCollectorFactory.java
 patching file
 src/java/org/apache/solr/search/fieldcollapse/collector/AggregateCollapseCollectorFactory.java

 patching file
 src/java/org/apache/solr/search/fieldcollapse/collector/CollapseCollector.java

 patching file
 src/java/org/apache/solr/search/fieldcollapse/collector/FieldValueCountCollapseCollectorFactory.java

 patching file
 src/java/org/apache/solr/search/fieldcollapse/collector/AbstractCollapseCollector.java

 patching file
 src/java/org/apache/solr/search/fieldcollapse/AbstractDocumentCollapser.java

 patching file
 src/java/org/apache/solr/search/fieldcollapse/NonAdjacentDocumentCollapser.java

 patching file
 src/java/org/apache/solr/search/fieldcollapse/AdjacentDocumentCollapser.java

 patching file
 src/java/org/apache/solr/search/fieldcollapse/util/Counter.java

 patching file
 src/java/org/apache/solr/search/SolrIndexSearcher.java

 patching file
 src/java/org/apache/solr/search/DocSetHitCollector.java

 patching file
 src/java/org/apache/solr/handler/component/CollapseComponent.java

 patching file
 src/java/org/apache/solr/handler/component/QueryComponent.java

 Hunk #1 FAILED at
 522.

 1 out of 1 hunk FAILED -- saving rejects to file
 src/java/org/apache/solr/handler/component/QueryComponent.java.rej

 patching file
 src/java/org/apache/solr/util/DocSetScoreCollector.java

 patching file
 src/common/org/apache/solr/common/params/CollapseParams.java

 patching file src/solrj/org/apache/solr/client/solrj/SolrQuery.java
 Hunk #1 FAILED at 17.
 Hunk #2 FAILED at 50.
 Hunk #3 FAILED at 76.
 Hunk #4 FAILED at 148.
 Hunk #5 FAILED at 197.
 Hunk #6 succeeded at 510 (offset -155 lines).
 Hunk #7 succeeded at 566 (offset -155 lines).
 5 out of 7 hunks FAILED -- saving rejects to file
 src/solrj/org/apache/solr/client/solrj/SolrQuery.java.rej
 patching file
 src/solrj/org/apache/solr/client/solrj/response/QueryResponse.java
 Hunk #1 succeeded at 17 with fuzz 1.
 Hunk #2 FAILED at 42.
 Hunk #3 FAILED at 58.
 Hunk #4 succeeded at 117 with fuzz 2 (offset -8 lines).
 Hunk #5 succeeded at 315 with fuzz 2 (offset 17 lines).
 2 out of 5 hunks FAILED -- saving rejects to file
 src/solrj/org/apache/solr/client/solrj/response/QueryResponse.java.rej
 patching file
 src/solrj/org/apache/solr/client/solrj/response/FieldCollapseResponse.java

 

Re: question about date boosting

2010-01-12 Thread Joe Calderon

I think you need to use the new trieDateField
On 01/12/2010 07:06 PM, Daniel Higginbotham wrote:

Hello,

I'm trying to boost results based on date using the first example 
here:http://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_boost_the_score_of_newer_documents 



However, I'm getting an error that reads, Can't use ms() function on 
non-numeric legacy date field


The date field uses solr.DateField . What am I doing wrong?

Thank you!
Daniel Higginbotham




help implementing a couple of business rules

2010-01-11 Thread Joe Calderon
hello *, im looking for help on writing queries to implement a few
business rules.


1. given a set of fields how to return matches that match across them
but not just one specific one, ex im using a dismax parser currently
but i want to exclude any results that only match against a field
called 'description2'


2. given a set of fields how to return matches that match across them
but on one specific field match as a phrase only, ex im using a dismax
parser currently but i want matches against a field called 'people' to
only match as a phrase


thx much,

--joe


Re: help implementing a couple of business rules

2010-01-11 Thread Joe Calderon
thx, but im not sure that covers all edge cases, to clarify
1. matching description2 is okay if other fields are matched too, but
results matching only to description2 should be omitted

2. its okay to not match against the people field, but matches against
the people field should only be phrase matches

sorry if  i was unclear

--joe
On Mon, Jan 11, 2010 at 10:13 AM, Erik Hatcher erik.hatc...@gmail.com wrote:

 On Jan 11, 2010, at 12:56 PM, Joe Calderon wrote:

 1. given a set of fields how to return matches that match across them
 but not just one specific one, ex im using a dismax parser currently
 but i want to exclude any results that only match against a field
 called 'description2'

 One way could be to add an fq parameter to the request:

   fq=-description2:(query)

 2. given a set of fields how to return matches that match across them
 but on one specific field match as a phrase only, ex im using a dismax
 parser currently but i want matches against a field called 'people' to
 only match as a phrase

 Doesn't setting pf=people accomplish this?

        Erik




Re: Solr 1.4 Field collapsing - What are the steps for applying the SOLR-236 patch?

2010-01-11 Thread Joe Calderon
it seems to be in flux right now as the solr developers slowly make 
improvements and ingest the various pieces into the solr trunk, i think 
your best bet might be to use the 12/24 patch and fix any errors where 
it doesnt apply cleanly


im using solr trunk r892336 with the 12/24 patch


--joe
On 01/11/2010 08:48 PM, Kelly Taylor wrote:

Hi,

Is there a step-by-step for applying the patch for SOLR-236 to enable field
collapsing in Solr 1.4?

Thanks,
Kelly
   




custom wildcarding in qparser

2010-01-08 Thread Joe Calderon
hello *, what do i need to do to make a query parser that works just
like the standard query parser but also runs analyzers/tokenizers on a
wildcarded term, specifically im looking to only wildcarding the last
token

ive tried the edismax qparser and the prefix qparser and neither is
exactly what im looking for, the problem im trying to solve is
matching wildcards on terms that can be entered multiple ways, i have
a set of analyzers that generate the various terms, ex wildcarding on
stemmed fields etc


thx much

--joe


analyzer type=query with NGramTokenFilterFactory forces phrase query

2009-12-31 Thread Joe Calderon
Hello *, im trying to make an index to support spelling errors/fuzzy
matching, ive indexed my document titles with NGramFilterFactory
minGramSize=2 maxGramSize=3, using the analysis page i can see the
common grams match between the indexed value and the query value,
however when i try to do a query for it ex. title_ngram:(family)  the
debug output says the query is converted to a phrase query f a m i l
y fa am mi il ly fam ami mil ily, if this is the expected behavior is
there a way to override it?

or should i scrap this approach and use title:(family) and boost on
strdist(family, title, ngram, 3) ?


Re: analyzer type=query with NGramTokenFilterFactory forces phrase query

2009-12-31 Thread Joe Calderon
if this is the expected behaviour is there a way to override it?[1]

[1] me

On Thu, Dec 31, 2009 at 10:13 AM, AHMET ARSLAN iori...@yahoo.com wrote:
 Hello *, im trying to make an index
 to support spelling errors/fuzzy
 matching, ive indexed my document titles with
 NGramFilterFactory
 minGramSize=2 maxGramSize=3, using the analysis page i can
 see the
 common grams match between the indexed value and the query
 value,
 however when i try to do a query for it ex.
 title_ngram:(family)  the
 debug output says the query is converted to a phrase query
 f a m i l
 y fa am mi il ly fam ami mil ily, if this is the expected
 behavior is
 there a way to override it?

 If a single token is split into more tokens during the analysis phase, solr 
 will do a phrase query instead of a term query. [1]

 [1]http://www.mail-archive.com/solr-user@lucene.apache.org/msg30055.html







score = result of function query

2009-12-30 Thread Joe Calderon
how can i make the score be solely the output of a function query?

the function query wiki page details something like
 q=boxname:findbox+_val_:product(product(x,y),z)fl=*,score


but that doesnt seems to work


--joe


boosting on string distance

2009-12-29 Thread Joe Calderon
hello *, i want to boost documents that match the query better,
currently i also index my field as a string an boost if i match the
string field

but im wondering if its possible to boost with bf parameter with a
formula using the function strdist(), i know one of the columns would
be the field name, but how do i specify the use query as the other
parameter?

http://wiki.apache.org/solr/FunctionQuery#strdist


best,

--joe


Re: SOLR Performance Tuning: Pagination

2009-12-24 Thread Joe Calderon
fwiw, when implementing distributed search i ran into a similar
problem, but then i noticed even google doesnt let you go past page
1000,  easier to just set a limit on start

On Thu, Dec 24, 2009 at 8:36 AM, Walter Underwood wun...@wunderwood.org wrote:
 When do users do a query like that? --wunder

 On Dec 24, 2009, at 8:09 AM, Fuad Efendi wrote:

 I used pagination for a while till found this...


 I have filtered query ID:[* TO *] returning 20 millions results (no
 faceting), and pagination always seemed to be fast. However, fast only with
 low values for start=12345. Queries like start=28838540 take 40-60 seconds,
 and even cause OutOfMemoryException.

 I use highlight, faceting on nontokenized Country field, standard handler.


 It even seems to be a bug...


 Fuad Efendi
 +1 416-993-2060
 http://www.linkedin.com/in/liferay

 Tokenizer Inc.
 http://www.tokenizer.ca/
 Data Mining, Vertical Search





wildcard oddity

2009-12-15 Thread Joe Calderon
im trying to do a wild card search

q:item_title:(gets*)returns no results
q:item_title:(gets)returns results
q:item_title:(get*)returns results


seems like * at the end of a token is requiring a character, instead
of being 0 or more its acting like1 or more

the text im trying to match is The Gang Gets Extreme: Home Makeover Edition

the field uses the following analyzers

fieldType name=text_token class=solr.TextField
positionIncrementGap=100 omitNorms=false
  analyzer
charFilter class=solr.HTMLStripCharFilterFactory /
tokenizer class=solr.WhiteSpaceTokenizerFactory /
filter class=solr.LowerCaseFilterFactory /
filter class=solr.ISOLatin1AccentFilterFactory /
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=0 catenateAll=1
splitOnNumerics=0 splitOnCaseChange=0 stemEnglishPossessive=0 /
  /analyzer
/fieldType


is anybody else having similar problems?


best,
--joe


Re: apply a patch on solr

2009-11-03 Thread Joe Calderon
patch -p0  /path/to/field-collapse-5.patch

On Tue, Nov 3, 2009 at 7:48 PM, michael8 mich...@saracatech.com wrote:

 Hmmm, perhaps I jumped the gun.  I just looked over the field collapse patch
 for SOLR-236 and each file listed in the patch has its own revision #.

 E.g. from field-collapse-5.patch:
 --- src/java/org/apache/solr/core/SolrConfig.java       (revision 824364)
 --- src/solrj/org/apache/solr/client/solrj/response/QueryResponse.java
 (revision 816372)
 --- src/solrj/org/apache/solr/client/solrj/SolrQuery.java       (revision 
 823653)
 --- src/java/org/apache/solr/search/SolrIndexSearcher.java      (revision 
 794328)
 --- src/java/org/apache/solr/search/DocSetHitCollector.java     (revision
 794328)

 Unless there is a better way, it seems like I would need to do svn up
 --revision ... for each of the files to be patched and then apply the
 patch?  This seems error prone and tedious.  Am I missing something simpler
 here?

 Michael


 michael8 wrote:

 Perfect.  This is what I need to know instead of patching 'in the dark'.
 Good thing SVN revision cuts across all files like a tag.

 Thanks Mike!

 Michael


 cambridgemike wrote:

 You can see what revision the patch was written for at the top of the
 patch,
 it will look like this:

 Index: org/apache/solr/handler/MoreLikeThisHandler.java
 ===
 --- org/apache/solr/handler/MoreLikeThisHandler.java (revision 772437)
 +++ org/apache/solr/handler/MoreLikeThisHandler.java (working copy)

 now check out revision 772437 using the --revision switch in svn, patch
 away, and then svn up to make sure everything merges cleanly.  This is a
 good guide to follow as well:
 http://www.mail-archive.com/solr-user@lucene.apache.org/msg10189.html

 cheers,
 -mike

 On Mon, Nov 2, 2009 at 3:55 PM, michael8 mich...@saracatech.com wrote:


 Hi,

 First I like to pardon my novice question on patching solr (1.4).  What
 I
 like to know is, given a patch, like the one for collapse field, how
 would
 one go about knowing what solr source that patch is meant for since this
 is
 a source level patch?  Wouldn't the exact versions of a set of java
 files
 to
 be patched critical for the patch to work properly?

 So far what I have done is to pull the latest collapse field patch down
 from
 http://issues.apache.org/jira/browse/SOLR-236 (field-collapse-5.patch),
 and
 then svn up the latest trunk from
 http://svn.apache.org/repos/asf/lucene/solr/trunk/, then patch and
 build.
 Intuitively I was thinking I should be doing svn up to a specific
 revision/tag instead of just latest.  So far everything seems fine, but
 I
 just want to make sure I'm doing the right thing and not just being
 lucky.

 Thanks,
 Michael
 --
 View this message in context:
 http://old.nabble.com/apply-a-patch-on-solr-tp26157827p26157827.html
 Sent from the Solr - User mailing list archive at Nabble.com.







 --
 View this message in context: 
 http://old.nabble.com/apply-a-patch-on-solr-tp26157827p26190563.html
 Sent from the Solr - User mailing list archive at Nabble.com.




tokenize after filters

2009-11-02 Thread Joe Calderon
 is it possible to tokenize a field on whitespace after some filters
have been applied:

ex: A + W Root Beer
the field uses a keyword tokenizer to keep the string together, then
it will get converted to aw root beer by a custom filter ive made, i
now want to split that up into 3 tokens (aw, root, beer), but seems
like you cant use a tokenizer after a filter ... so whats the best way
of accomplishing this?

thx much

--joe


profiling solr

2009-10-26 Thread Joe Calderon
as a curiosity ide like to use a profiler to see where within solr
queries spend most of their time, im curious what tools if any others
use for this type of task..

im using jetty as my servlet container so ideally ide like a profiler
thats compatible with it

--joe


field collapsing exception

2009-10-26 Thread Joe Calderon
found another exception, i cant find specific steps to reproduce
besides starting with an unfiltered result and then given an int field
with values (1,2,3) filtering by 3 triggers it sometimes, this is in
an index with very frequent updates and deletes


--joe


java.lang.NullPointerException
at 
org.apache.solr.search.fieldcollapse.collector.FieldValueCountCollapseCollectorFactory$FieldValueCountCollapseCollector.getResult(FieldValueCountCollapseCollectorFactory.java:84)
at 
org.apache.solr.search.fieldcollapse.AbstractDocumentCollapser.getCollapseInfo(AbstractDocumentCollapser.java:191)
at 
org.apache.solr.handler.component.CollapseComponent.doProcess(CollapseComponent.java:179)
at 
org.apache.solr.handler.component.CollapseComponent.process(CollapseComponent.java:121)
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at 
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1148)
at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:387)
at 
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at 
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
at 
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417)
at 
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
at 
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at 
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:326)
at 
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534)
at 
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:539)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
at 
org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409)
at 
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:520)


field collapsing bug (java.lang.ArrayIndexOutOfBoundsException)

2009-10-23 Thread Joe Calderon
seems to happen when sort on anything besides strictly score, even
score desc, num desc triggers it, using latest nightly and 10/14 patch

Problem accessing /solr/core1/select. Reason:

4731592

java.lang.ArrayIndexOutOfBoundsException: 4731592
at 
org.apache.lucene.search.FieldComparator$StringOrdValComparator.copy(FieldComparator.java:660)
at 
org.apache.solr.search.NonAdjacentDocumentCollapser$DocumentComparator.compare(NonAdjacentDocumentCollapser.java:235)
at 
org.apache.solr.search.NonAdjacentDocumentCollapser$DocumentPriorityQueue.lessThan(NonAdjacentDocumentCollapser.java:173)
at 
org.apache.lucene.util.PriorityQueue.insertWithOverflow(PriorityQueue.java:158)
at 
org.apache.solr.search.NonAdjacentDocumentCollapser.doCollapsing(NonAdjacentDocumentCollapser.java:95)
at 
org.apache.solr.search.AbstractDocumentCollapser.collapse(AbstractDocumentCollapser.java:208)
at 
org.apache.solr.handler.component.CollapseComponent.doProcess(CollapseComponent.java:98)
at 
org.apache.solr.handler.component.CollapseComponent.process(CollapseComponent.java:66)
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at 
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1148)
at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:387)
at 
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at 
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
at 
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417)
at 
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
at 
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at 
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:326)
at 
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534)
at 
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:539)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
at 
org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409)
at 
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:520)


boostQParser and dismax

2009-10-22 Thread Joe Calderon
hello *, i was just reading over the wiki function query page and
found this little gem for boosting recent docs thats much better than
what i was doing before

recip(ms(NOW,mydatefield),3.16e-11,1,1)


my question is, at the bottom it says
The most effective way to use such a boost is to multiply it with the
relevancy score, rather than add it in. One way to do this is with the
boost query parser.


how exactly do i use the boost query parser along with the dismax
parser? can someone post an example solrconfig snippet?


thx much

--joe


max words/tokens

2009-10-20 Thread Joe Calderon
i have a pretty basic question, is there an existing analyzer that
limits the number of words/tokens indexed from a field? let say i only
wanted to index the top 25 words...

thx much

--joe


Re: max words/tokens

2009-10-20 Thread Joe Calderon
cool np, i just didnt want to duplicate code if that already existed.

On Tue, Oct 20, 2009 at 12:49 PM, Yonik Seeley
yo...@lucidimagination.com wrote:
 On Tue, Oct 20, 2009 at 1:53 PM, Joe Calderon calderon@gmail.com wrote:
 i have a pretty basic question, is there an existing analyzer that
 limits the number of words/tokens indexed from a field? let say i only
 wanted to index the top 25 words...

 It would be really easy to write one, but no there is not currently.

 -Yonik
 http://www.lucidimagination.com



lucene 2.9 bug

2009-10-16 Thread Joe Calderon
hello * , ive read in other threads that lucene 2.9 had a serious bug
in it, hence trunk moved to 2.9.1 dev, im wondering what the bug is as
ive been using the 2.9.0 version for the past weeks with no problems,
is it critical to upgrade?

--joe


Re: Solr 1.4 release candidate

2009-10-14 Thread Joe Calderon
maybe im just not familiar with the way the version numbers works in
trunk but when i build the latest nightly the jars have names like
*-1.5-dev.jar,  is that normal?

On Wed, Oct 14, 2009 at 7:01 AM, Yonik Seeley
yo...@lucidimagination.com wrote:
 Folks, we've been in code freeze since Monday and a test release
 candidate was created yesterday, however it already had to be updated
 last night due to a serious bug found in Lucene.

 For now you can use the latest nightly build to get any recent changes
 like this:
 http://people.apache.org/builds/lucene/solr/nightly/

 We'll probably release the final bits next week, so in the meantime,
 download the latest nightly build and give it a spin!

 -Yonik
 http://www.lucidimagination.com



how to get field contents out of Document object

2009-10-14 Thread Joe Calderon
hello *, sorry if this seems like a dumb question, im still fairly new
to working with lucene/solr internals.

given a Document object, what is the proper way to fetch an integer
value for a field called num_in_stock, it is both indexed and stored

thx much

--joe


concatenating tokens

2009-10-08 Thread Joe Calderon
hello *, im using a combination of tokenizers and filters that give me
the desired tokens, however for a particular field i want to
concatenate these tokens back to a single string, is there a filter to
do that, if not what are the steps needed to make my own filter to
concatenate tokens?

for example, i start with Sprocket (widget) - Blue the analyzers
churn out the tokens [sprocket,widget,blue] i want to end up with the
string sprocket widget blue, this is a simple example and in the
general case lowercasing and punctuation removal does not work, hence
why im looking to concatenate tokens

--joe


Re: stats page slow in latest nightly

2009-10-06 Thread Joe Calderon
thx much guys, no biggie for me, i just wanted to get to the bottom of
it in case i had screwed something else up..

--joe

On Tue, Oct 6, 2009 at 1:19 PM, Mark Miller markrmil...@gmail.com wrote:
 I was worried about that actually. I havn't tested how fast the RAM
 estimator is on huge String FieldCaches - it will be fast on everything
 else, but it checks the size of each String in the array.

 When I was working on it, I was actually going to default to not show
 the size, and make you click a link that added a param to get the sizes
 in the display too. But I foolishly didn't bring it up when Hoss made my
 life easier with his simpler patch.

 Yonik Seeley wrote:
 Might be the new Lucene fieldCache stats stuff that was recently added?

 -Yonik
 http://www.lucidimagination.com


 On Tue, Oct 6, 2009 at 3:56 PM, Joe Calderon calderon@gmail.com wrote:

 hello *, ive been noticing that /admin/stats.jsp is really slow in the
 recent builds, has anyone else encountered this?


 --joe



 --
 - Mark

 http://www.lucidimagination.com






Re: JVM OOM when using field collapse component

2009-10-02 Thread Joe Calderon
heap space is 4gb set to grow up to 8gb, usage is normally ~1-2gb,
seems to happen within a few searches.

if its just me ill try to isolate it, it could be some other part of
my implementation

thx much

On Fri, Oct 2, 2009 at 1:18 AM, Martijn v Groningen
martijn.is.h...@gmail.com wrote:
 No I have not encountered OOM exception yet with current field collapse patch.
 How large is your configured JVM heap space (-Xmx)? Field collapsing
 requires more memory then regular searches so. Does Solr run out of
 memory during the first search(es) or does it run out of memory after
 a while when it performed quite a few field collapse searches?

 I see that you are also using the collapse.includeCollapsedDocs.fl
 parameter for your search. This feature will require more memory then
 a normal field collapse search.

 I normally give the Solr instance a heap space of 1024M when having an
 index of a few million.

 Martijn

 2009/10/2 Joe Calderon calderon@gmail.com:
 i gotten two different out of memory errors while using the field
 collapsing component, using the latest patch (2009-09-26) and the
 latest nightly,

 has anyone else encountered similar problems? my collection is 5
 million results but ive gotten the error collapsing as little as a few
 thousand

 SEVERE: java.lang.OutOfMemoryError: Java heap space
        at org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:173)
        at 
 org.apache.lucene.util.OpenBitSet.ensureCapacityWords(OpenBitSet.java:749)
        at 
 org.apache.lucene.util.OpenBitSet.ensureCapacity(OpenBitSet.java:757)
        at 
 org.apache.lucene.util.OpenBitSet.expandingWordNum(OpenBitSet.java:292)
        at org.apache.lucene.util.OpenBitSet.set(OpenBitSet.java:233)
        at 
 org.apache.solr.search.AbstractDocumentCollapser.addCollapsedDoc(AbstractDocumentCollapser.java:402)
        at 
 org.apache.solr.search.NonAdjacentDocumentCollapser.doCollapsing(NonAdjacentDocumentCollapser.java:115)
        at 
 org.apache.solr.search.AbstractDocumentCollapser.collapse(AbstractDocumentCollapser.java:208)
        at 
 org.apache.solr.handler.component.CollapseComponent.doProcess(CollapseComponent.java:98)
        at 
 org.apache.solr.handler.component.CollapseComponent.process(CollapseComponent.java:66)
        at 
 org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
        at 
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
        at 
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
        at 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
        at 
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1148)
        at 
 org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:387)
        at 
 org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
        at 
 org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
        at 
 org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765)
        at 
 org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417)
        at 
 org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
        at 
 org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
        at 
 org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
        at org.mortbay.jetty.Server.handle(Server.java:326)
        at 
 org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534)
        at 
 org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864)
        at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:539)
        at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
        at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
        at 
 org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409)
        at 
 org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:520)

 SEVERE: java.lang.OutOfMemoryError: Java heap space
        at 
 org.apache.solr.util.DocSetScoreCollector.init(DocSetScoreCollector.java:44)
        at 
 org.apache.solr.search.NonAdjacentDocumentCollapser.doQuery(NonAdjacentDocumentCollapser.java:68)
        at 
 org.apache.solr.search.AbstractDocumentCollapser.collapse(AbstractDocumentCollapser.java:205)
        at 
 org.apache.solr.handler.component.CollapseComponent.doProcess(CollapseComponent.java:98)
        at 
 org.apache.solr.handler.component.CollapseComponent.process(CollapseComponent.java:66)
        at 
 org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
        at 
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131

Re: field collapsing sums

2009-10-01 Thread Joe Calderon
hello martijn, thx for the tip, i tried that approach but ran into two
snags, 1. returning the fields makes collapsing a lot slower for
results, but that might just be the nature of iterating large results.
2. it seems like only dupes of records on the first page are returned

or is tehre a a setting im missing? currently im only sending,
collapse.field=brand and collapse.includeCollapseDocs.fl=num_in_stock

--joe

On Thu, Oct 1, 2009 at 1:14 AM, Martijn v Groningen
martijn.is.h...@gmail.com wrote:
 Hi Joe,

 Currently the patch does not do that, but you can do something else
 that might help you in getting your summed stock.

 In the latest patch you can include fields of collapsed documents in
 the result per distinct field value.
 If your specify collapse.includeCollapseDocs.fl=num_in_stock in the
 request nd lets say you collapse on brand then in the response you
 will receive the following xml:
 lst name=collapsedDocs
   result name=brand1 numFound=48 start=0
        doc
          str name=num_in_stock2/str
        /doc
         doc
          str name=num_in_stock3/str
        /doc
      ...
   /result
   result name=”brand2” numFound=”9” start=”0”
      ...
   /result
 /lst

 On the client side you can do whatever you want with this data and for
 example sum it together. Although the patch does not sum for you, I
 think it will allow to implement your requirement without to much
 hassle.

 Cheers,

 Martijn

 2009/10/1 Matt Weber m...@mattweber.org:
 You might want to see how the stats component works with field collapsing.

 Thanks,

 Matt Weber

 On Sep 30, 2009, at 5:16 PM, Uri Boness wrote:

 Hi,

 At the moment I think the most appropriate place to put it is in the
 AbstractDocumentCollapser (in the getCollapseInfo method). Though, it might
 not be the most efficient.

 Cheers,
 Uri

 Joe Calderon wrote:

 hello all, i have a question on the field collapsing patch, say i have
 an integer field called num_in_stock and i collapse by some other
 column, is it possible to sum up that integer field and return the
 total in the output, if not how would i go about extending the
 collapsing component to support that?


 thx much

 --joe







Re: field collapsing sums

2009-10-01 Thread Joe Calderon
thx for the reply, i just want the number of dupes in the query
result, but it seems i dont get the correct totals,

for example a non collapsed dismax query for belgian beer returns X
number results
but when i collapse and sum the number of docs under collapse_counts,
its much less than X

it does seem to work when the collapsed results fit on one page (10
rows in my case)


--joe

 2) It seems that you are using the parameters as was intended. The
 collapsed documents will contain all documents (from whole query
 result) that have been collapsed on a certain field value that occurs
 in the result set that is being displayed. That is how it should work.
 But if I'm understanding you correctly you want to display all dupes
 from the whole query result set (also those which collapse field value
 does not occur in the in the displayed result set)?


JVM OOM when using field collapse component

2009-10-01 Thread Joe Calderon
i gotten two different out of memory errors while using the field
collapsing component, using the latest patch (2009-09-26) and the
latest nightly,

has anyone else encountered similar problems? my collection is 5
million results but ive gotten the error collapsing as little as a few
thousand

SEVERE: java.lang.OutOfMemoryError: Java heap space
at org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:173)
at 
org.apache.lucene.util.OpenBitSet.ensureCapacityWords(OpenBitSet.java:749)
at org.apache.lucene.util.OpenBitSet.ensureCapacity(OpenBitSet.java:757)
at 
org.apache.lucene.util.OpenBitSet.expandingWordNum(OpenBitSet.java:292)
at org.apache.lucene.util.OpenBitSet.set(OpenBitSet.java:233)
at 
org.apache.solr.search.AbstractDocumentCollapser.addCollapsedDoc(AbstractDocumentCollapser.java:402)
at 
org.apache.solr.search.NonAdjacentDocumentCollapser.doCollapsing(NonAdjacentDocumentCollapser.java:115)
at 
org.apache.solr.search.AbstractDocumentCollapser.collapse(AbstractDocumentCollapser.java:208)
at 
org.apache.solr.handler.component.CollapseComponent.doProcess(CollapseComponent.java:98)
at 
org.apache.solr.handler.component.CollapseComponent.process(CollapseComponent.java:66)
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1148)
at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:387)
at 
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at 
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
at 
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417)
at 
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
at 
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at 
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:326)
at 
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534)
at 
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:539)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
at 
org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409)
at 
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:520)

SEVERE: java.lang.OutOfMemoryError: Java heap space
at 
org.apache.solr.util.DocSetScoreCollector.init(DocSetScoreCollector.java:44)
at 
org.apache.solr.search.NonAdjacentDocumentCollapser.doQuery(NonAdjacentDocumentCollapser.java:68)
at 
org.apache.solr.search.AbstractDocumentCollapser.collapse(AbstractDocumentCollapser.java:205)
at 
org.apache.solr.handler.component.CollapseComponent.doProcess(CollapseComponent.java:98)
at 
org.apache.solr.handler.component.CollapseComponent.process(CollapseComponent.java:66)
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1148)
at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:387)
at 
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at 
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
at 
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417)
at 
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
at 
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at 
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)

field collapsing sums

2009-09-30 Thread Joe Calderon
hello all, i have a question on the field collapsing patch, say i have
an integer field called num_in_stock and i collapse by some other
column, is it possible to sum up that integer field and return the
total in the output, if not how would i go about extending the
collapsing component to support that?


thx much

--joe


changing dismax parser to not treat symbols differently

2009-09-30 Thread Joe Calderon
how would i go about modifying the dismax parser to treat +/- as regular text?


Re: KStem download

2009-09-14 Thread Joe Calderon
is the source for the lucid kstemmer available ? from the lucid solr
package i only found the compiled jars

On Mon, Sep 14, 2009 at 11:04 AM, Yonik Seeley
yo...@lucidimagination.com wrote:
 On Mon, Sep 14, 2009 at 1:56 PM, darniz rnizamud...@edmunds.com wrote:
 Pascal Dimassimo wrote:

 Hi,

 I want to try KStem. I'm following the instructions on this page:
 http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters/Kstem

 ... but the download link doesn't work.

 Is anyone know the new location to download KStem?

 I am stuck with the same issue
 its link is not working for a long time


 is there any alternate link
 Please let us know

 *shrug* - looks like they changed their download structure (or just
 took it down).  I searched around their site a bit but couldn't find
 another one (and google wasn't able to find it either).

 The one from Lucid is functionally identical, free, and much, much
 faster though - I'd just use that.

 -Yonik
 http://www.lucidimagination.com



query parser question

2009-09-10 Thread Joe Calderon
i have field called text_stem that has a kstemmer on it, im having
trouble matching wildcard searches on a word that got stemmed

for example i index the word america's, which according to
analysis.jsp after stemming gets indexed as america

when matching i do a query like myfield:(ame*) which matches the
indexed term, this all works fine until the query becomes
myfield:(america's*) at which point it doesnt match, however if i
remove the wildcard like myfield:(america's) the it works again

its almost like the term doesnt get stemmed when using a wildcard

im using 1.4 nightly, is this the correct behaviour, is there
something i should do differently?

in the mean time ive added americas as protected word in the
kstemmer but im afraid of more edge cases that will come up

--joe


help with solr.PatternTokenizerFactory

2009-09-09 Thread Joe Calderon
hello *, im not sure what im doing wrong i have this field defined in
schema.xml, using admin/analysis.jsp its working as expected,

fieldType name=text_spell class=solr.TextField
  analyzer
charFilter class=solr.HTMLStripCharFilterFactory /
tokenizer class=solr.PatternTokenizerFactory pattern=; /
filter class=solr.LowerCaseFilterFactory /
filter class=solr.ISOLatin1AccentFilterFactory /
filter class=solr.PatternReplaceFilterFactory
pattern=(\p{Punct}+) replacement= replace=all/
  /analyzer
/fieldType


but when i try to update via csvhandler i get

Error 500 org.apache.solr.analysis.PatternTokenizerFactory$1 cannot be
cast to org.apache.lucene.analysis.Tokenizer

java.lang.ClassCastException:
org.apache.solr.analysis.PatternTokenizerFactory$1 cannot be cast to
org.apache.lucene.analysis.Tokenizer
at 
org.apache.solr.analysis.TokenizerChain.getStream(TokenizerChain.java:69)
at 
org.apache.solr.analysis.SolrAnalyzer.reusableTokenStream(SolrAnalyzer.java:74)
...



im using nightly of solr 1.4

thx much,
--joe


Re: Geographic clustering

2009-09-08 Thread Joe Calderon
there are clustering libraries like
http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/, that have
bindings to perl/python, you can preprocess your results and create
clusters for each zoom level

On Tue, Sep 8, 2009 at 8:08 AM, gwkg...@eyefi.nl wrote:
 Hi,

 I just completed a simple proof-of-concept clusterer component which
 naively clusters with a specified bounding box around each position,
 similar to what the javascript MarkerClusterer does. It's currently very
 slow as I loop over the entire docset and request the longitude and
 latitude of each document (Not to mention that my unfamiliarity with
 Lucene/Solr isn't helping the implementations performance any, most code
 is copied from grep-ing the solr source). Clustering a set of about
 80.000 documents takes about 5-6 seconds. I'm currently looking into
 storing the hilber curve mapping in Solr and clustering using facet
 counts on numerical ranges of that mapping but I'm not sure it will pan out.

 Regards,

 gwk

 Grant Ingersoll wrote:

 Not directly related to geo clustering, but
 http://issues.apache.org/jira/browse/SOLR-769 is all about a pluggable
 interface to clustering implementations.  It currently has Carrot2
 implemented, but the APIs are marked as experimental.  I would definitely be
 interested in hearing your experience with implementing your clustering
 algorithm in it.

 -Grant

 On Sep 8, 2009, at 4:00 AM, gwk wrote:

 Hi,

 I'm working on a search-on-map interface for our website. I've created a
 little proof of concept which uses the MarkerClusterer
 (http://code.google.com/p/gmaps-utility-library-dev/) which clusters the
 markers nicely. But because sending tens of thousands of markers over Ajax
 is not quite as fast as I would like it to be, I'd prefer to do the
 clustering on the server side. I've considered a few options like storing
 the morton-order and throwing away precision to cluster, assigning all
 locations to a grid position. Or simply cluster based on country/region/city
 depending on zoom level by adding latitude on longitude fields for each zoom
 level (so that for smaller countries you have to be zoomed in further to get
 the next level of clustering).

 I was wondering if anybody else has worked on something similar and if so
 what their solutions are.

 Regards,

 gwk

 --
 Grant Ingersoll
 http://www.lucidimagination.com/

 Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
 Solr/Lucene:
 http://www.lucidimagination.com/search






stemming plurals

2009-09-04 Thread Joe Calderon
i saw some post regarding stemming plurals in the archives from 2008,
i was wondering if this was ever integrated or if custom hackery is
still needed, is there something like a stemplurals analyzer is the
kstemmer the closest thing?


thx much
--joe


score = sum of boosts

2009-09-02 Thread Joe Calderon
hello *, what would be the best approach to return the sum of boosts
as the score?

ex:
a dismax handler boosts matches to field1^100 and field2^50, a query
matches both fields hence the score for that row would be 150



is this something i could do with a function query or do i need to
hack up DisjunctionMaxScorer ?

--joe


Re: Responses getting truncated

2009-08-28 Thread Joe Calderon
i had a similar issue with text from past requests showing up, this was 
on 1.3 nightly, i switched to using the lucid build of 1.3 and the 
problem went away, im using a nightly of 1.4 right now also without 
probs, then again your mileage may vary as i also made a bunch of schema 
changes that might have had some effect, it wouldnt hurt to try though



On 08/28/2009 02:04 PM, Rupert Fiasco wrote:

Firstly, to everyone who has been helping me, thank you very much. All
this feedback is helping me narrow down these issues.

I deleted the index and re-indexed all the data from scratch and for a
couple of days we were OK, but now it seems to be erring again.

It happens on different input documents so what was broken before now
works (documents that were having issues before are OK now, after a
fresh re-index).

An issue we are seeing now is that an XML response from Solr will
contain the tail of an earlier response, for an example:

http://brockwine.com/solr2.txt

That is a response we are getting from Solr - using the web interface
for Solr in Firefox, Firefox freaks out because it tries to parse
that, and of course, its invalid XML, but I can retrieve that via
curl.

Anyone seeing this before?

In regards to earlier questions:

   

i assume you are correct, but you listed several steps of transformation
above, are you certian they all work correctly and produce valid UTF-8?
 

Yes, I have looked at the source and contacted the author of the
conversion library we are using and have verified that if UTF8 goes in
then UTF8 will come out and UTF8 is definitely going in.

I dont think sending over an actual input document would help because
it seems to change. Plus, this latest issue appears to be more an
issue of the last response buffer not clearing or something.

Whats strange is that if I wait a few minutes and reload, then the
buffer is cleared and I get back a valid response, its intermittent,
but appears to be happening frequently.

If it matters, we started using LucidGaze for Solr about 10 days ago,
approximately when these issues started happening (but its hard to say
if thats an issue because at this same time we switched from a PHP to
Java indexing client).

Thanks for your patience

-Rupert

On Tue, Aug 25, 2009 at 8:33 PM, Chris
Hostetterhossman_luc...@fucit.org  wrote:
   

: We are running an instance of MediaWiki so the text goes through a
: couple of transformations: wiki markup -  html -  plain text.
: Its at this last step that I take a snippet and insert that into Solr.
...
: doc.addField(text_snippet_t, article.getSnippet(1000));

ok, well first off: that's the not the field we're you are having problems
is it?  if i remember correctly from your previous posts, wasn't the
response getting aborted in the middle of the Contents field?

: and a maximum of 1K chars if its bigger. I initialized this String
: from the DB by using the String constructor where I pass in the
: charset/collation
:
: text = new String(textFromDB, UTF-8);
:
: So to the best of my knowledge, accessing a substring of a UTF-8
: encoded string should not break up the UTF-8 code point. Is that an

i assume you are correct, but you listed several steps of transformation
above, are you certian they all work correctly and produce valid UTF-8?

this leads back to my suggestion before

:  Can you put the orriginal (pre solr, pre solrj, raw untouched, etc...)
:  file that this solr doc came from online somewhere?
:
:  What does your *indexing* code look like? ... Can you add some debuging to
:  the SolrJ client when you *add* this doc to print out exactly what those
:  1000 characters are?


-Hoss

 




Re: Responses getting truncated

2009-08-28 Thread Joe Calderon
yonik has a point, when i ran into this i also upgraded to the latest 
stable jetty, im using jetty 6.1.18


On 08/28/2009 04:07 PM, Rupert Fiasco wrote:

I deployed LucidWorks with my existing solrconfig / schema and
re-indexed my data into it and pushed it out to production, we'll see
how it stacks up over the weekend. Already queries that were breaking
on the prior Jetty/stock Solr setup are now working - but I have seen
it before where upon an initial re-index things work OK then a couple
of days later they break.

Keep y'all posted.

Thanks
-Rupert

On Fri, Aug 28, 2009 at 3:12 PM, Rupert Fiascorufia...@gmail.com  wrote:
   

Yes, I am hitting the Solr server directly (medsolr1.colo:9007)

Versions / architectures:

Jetty(6.1.3)

o...@medsolr1 ~ $ uname -a
Linux medsolr1 2.6.18-xen-r12 #9 SMP Tue Mar 3 15:34:08 PST 2009
x86_64 Intel(R) Xeon(R) CPU L5420 @ 2.50GHz GenuineIntel GNU/Linux

o...@medsolr1 ~ $ java -version
java version 1.6.0_11
Java(TM) SE Runtime Environment (build 1.6.0_11-b03)
Java HotSpot(TM) 64-Bit Server VM (build 11.0-b16, mixed mode)


I was thinking of trying LucidWorks for Solr (1.3.02) x64 - worth a try.

-Rupert

On Fri, Aug 28, 2009 at 3:08 PM, Yonik Seeleyysee...@gmail.com  wrote:
 

On Mon, Aug 24, 2009 at 6:30 PM, Rupert Fiascorufia...@gmail.com  wrote:
   

If I run these through curl on the command its
truncated and if I run the search through the web-based admin panel
then I get an XML parse error.
 

Are you running curl directly against the solr server, or going
through a load balancer?  Cutting out the middle-men using curl was a
great idea - just make sure to go all the way.

At first I thought it could possibly be a FastWriter bug (internal
Solr class), but that's only used on the TextWriter (JSON, Python,
Ruby) based formats, not on the original XML format.

It really looks like you're hitting a lower-level IO buffering bug
(esp when you see a response starting off with the tail of another
response).  That doesn't look like it could be a Solr bug... but
rather smells like a thread safety bug in the servlet container.

What type of machine are you running on?  What JVM?
You could try upgrading your version of Jetty, the JVM, or try
switching to Tomcat.

-Yonik
http://www.lucidimagination.com


   

This appears to have just started recently and the only thing we have
done is change our indexer from a PHP one to a Java one, but
functionally they are identical.

Any thoughts? Thanks in advance.

- Rupert

 
   
 




extended documentation on analyzers

2009-08-27 Thread Joe Calderon
is there an online resource or a book that contains a thorough list of
tokenizers and filters available and their functionality?

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

is very helpful but i would like to go through additional filters to
make sure im not reinventing the wheel adding my own

--joe


shingle filter

2009-08-24 Thread Joe Calderon
hello *, im currently faceting on a shingled field to obtain popular
phrases and its working well, however ide like to limit the number of
shingles that get created, the solr.ShingleFilterFactory supports
maxShingleSize, can it be made to support a minimum as well? can
someone point me in the right direction?

thx much
--joe


where to get solr 1.4 nightly

2009-08-20 Thread Joe Calderon
i want to try out the improvements in 1.4 but the nightly site is down

http://people.apache.org/builds/lucene/solr/nightly/


is there a mirror for nightlies?


--joe


Re: dealing with duplicates

2009-08-10 Thread Joe Calderon
so in the case someone can help me with the query syntax, the
relational query i would use for this would be something like:

SELECT * FROM videos
WHERE
title LIKE 'family guy'
AND desc LIKE 'stewie%'
AND (
  ( is_dup = 0 )
  OR
  ( is_dup = 1 AND id NOT IN
(
SELECT id FROM videos
WHERE
title LIKE 'family guy'
AND desc LIKE 'stewie%'
AND is_dup = 0
)
  )
)
ORDER BY views
LIMIT 10

can a similar query be written in lucene or do i need to structure my
index differently to be able to do such a query?

thx much

--joe


On Sat, Aug 1, 2009 at 9:15 AM, Joe Calderoncalderon@gmail.com wrote:
 hello, thanks for the response, i did take a look at that document but
 in my application i actually want the duplicates, as i mentioned, the
 matching text could be very different among cluster members, what
 joins them together is a similar set of numeric features.

 currently i do a query with fq=duplicate:0 and show a link to
 optionally show the dupes via by querying for all dupes of the
 master id, however im currently missing any documents that matched the
 query but are duplicates of other masters not included in that result
 set.

 in a relational database (fulltext indexing aside) i would use a
 subquery, i imagine a similar approach could be used with lucene, i
 just dont know the syntax

 best,

 --joe

 On Fri, Jul 31, 2009 at 11:32 PM, Otis
 Gospodneticotis_gospodne...@yahoo.com wrote:
 Joe,

 Maybe we can take a step back first.  Would it be better if your index was 
 cleaner and didn't have flagged duplicates in the first place?  If so, have 
 you tried using http://wiki.apache.org/solr/Deduplication ?

  Otis
 --
 Sematext is hiring -- http://sematext.com/about/jobs.html?mls
 Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



 - Original Message 
 From: Joe Calderon calderon@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Friday, July 31, 2009 5:06:48 PM
 Subject: dealing with duplicates

 hello all, i have a collection of a few million documents; i have many
 duplicates in this collection. they have been clustered with a simple
 algorithm, i have a field called 'duplicate' which is 0 or 1 and a
 fields called 'description, tags, meta', documents are clustered on
 different criteria and the text i search against could be very
 different among members of a cluster.

 im currently using a dismax handler to search across the text fields
 with different boosts, and a filter query to restrict to masters
 (duplicate: 0)

 my question is then, how do i best query for documents which are
 masters OR match text but are not included in the matched set of
 masters?

 does this make sense?





concurrent csv loading

2009-08-06 Thread Joe Calderon
for first time loads i currently post to
/update/csv?commit=falseseparator=%09escape=\stream.file=workfile.txtmap=NULL:keepEmpty=false,
this works well and finishes in about 20 minutes for my work load.

this is mostly cpu bound, i have an 8 core box and it seems one takes
the brunt of the work.

 if i wanted to optimize, would i see any benefit to splitting
workfile.txt in two and doing two posts ?

im running lucid's build of solr 1.3.0 on jetty 6, io is not a
bottleneck as the data folder is on tmpfs

thx much
--joe


Re: dealing with duplicates

2009-08-01 Thread Joe Calderon
hello, thanks for the response, i did take a look at that document but
in my application i actually want the duplicates, as i mentioned, the
matching text could be very different among cluster members, what
joins them together is a similar set of numeric features.

currently i do a query with fq=duplicate:0 and show a link to
optionally show the dupes via by querying for all dupes of the
master id, however im currently missing any documents that matched the
query but are duplicates of other masters not included in that result
set.

in a relational database (fulltext indexing aside) i would use a
subquery, i imagine a similar approach could be used with lucene, i
just dont know the syntax

best,

--joe

On Fri, Jul 31, 2009 at 11:32 PM, Otis
Gospodneticotis_gospodne...@yahoo.com wrote:
 Joe,

 Maybe we can take a step back first.  Would it be better if your index was 
 cleaner and didn't have flagged duplicates in the first place?  If so, have 
 you tried using http://wiki.apache.org/solr/Deduplication ?

  Otis
 --
 Sematext is hiring -- http://sematext.com/about/jobs.html?mls
 Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



 - Original Message 
 From: Joe Calderon calderon@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Friday, July 31, 2009 5:06:48 PM
 Subject: dealing with duplicates

 hello all, i have a collection of a few million documents; i have many
 duplicates in this collection. they have been clustered with a simple
 algorithm, i have a field called 'duplicate' which is 0 or 1 and a
 fields called 'description, tags, meta', documents are clustered on
 different criteria and the text i search against could be very
 different among members of a cluster.

 im currently using a dismax handler to search across the text fields
 with different boosts, and a filter query to restrict to masters
 (duplicate: 0)

 my question is then, how do i best query for documents which are
 masters OR match text but are not included in the matched set of
 masters?

 does this make sense?