RE: highlighting performance poor with *.tar, *.gz files

2011-11-25 Thread Shyam Bhaskaran
Hi Eric, Thanks for the response. I am already using termVectors with offsets & positions enabled as shown below. I am indexing FAQ content and some these FAQ has attachments linked to them and these attachments have files like PDF, DOC *.TAR , *.GZIP files that contains additional informa

Re: inconsistent JVM crash with version 4.0-SNAPSHOT

2011-11-25 Thread Erick Erickson
Don't know if its this particular issue, but have you seen: https://issues.apache.org/jira/browse/LUCENE-3588 Best Erick On Fri, Nov 25, 2011 at 4:59 PM, Justin Caratzas wrote: > Lasse Aagren writes: > >> Hi, >> >> We are running Solr-Lucene 4.0-SNAPSHOT (1199777M - hudson - 2011-11-09 >> 14:5

Re: remove answers with identical scores

2011-11-25 Thread Erick Erickson
Have you considered removing them at index time? See: http://wiki.apache.org/solr/Deduplication Best Erick On Fri, Nov 25, 2011 at 3:13 PM, Ted Dunning wrote: > See http://en.wikipedia.org/wiki/Locality-sensitive_hashing > > The obvious thought that I had just after hitting send was that you cou

Re: Index a null text field

2011-11-25 Thread Erick Erickson
Are you committing after the run? Best Erick On Fri, Nov 25, 2011 at 1:32 PM, Young, Cody wrote: > I don't see anything wrong so far other than a typo here (missing a p in > the second price): > > >  Can you see if there are any warnings in the log about documents not > being able to be created

Re: solrQueryParser defaultOperator

2011-11-25 Thread Erick Erickson
Please review: http://wiki.apache.org/solr/UsingMailingLists you're asking us to figure out what you've done. IN particular, are you using either dismax or edismax? They don't respect the defaultOperator. Use the mm param to get this kind of behavior. Best Erick On Thu, Nov 24, 2011 at 6:33 PM,

Re: WordDelimiterFilter MultiPhraseQuery case insesitive Issue

2011-11-25 Thread Erick Erickson
Have you looked at the admin/analysis page? That's invaluable for answering this kind of question. Best Erick On Thu, Nov 24, 2011 at 2:30 PM, Uomesh wrote: > Hi, > > I tried with preserveOriginal="1" and reindex too but still no result. > > Thanks, > Umesh > > On Wed, Nov 23, 2011 at 5:33 PM, S

Re: Incorrect Search results

2011-11-25 Thread Erick Erickson
Please review: http://wiki.apache.org/solr/UsingMailingLists You have given us virtually no information that would allow us to help... Best Erick On Thu, Nov 24, 2011 at 1:57 PM, GAURAV PAREEK wrote: > I am serching some of the key word but I am not getting the correct result. > > According my

Re: highlighting performance poor with *.tar, *.gz files

2011-11-25 Thread Erick Erickson
Highlighting is dependent on the size of the data being fed through the highlighter. Unless you have termVectors & offsets & positions enabled, the text must be re-analyzed, see: http://wiki.apache.org/solr/FieldOptionsByUseCase?highlight=%28termvector%29%7C%28retrieve%29%7C%28contents%29 But high

Re: Query a field with no value or a particular value.

2011-11-25 Thread Erick Erickson
You just need two clauses, something like q=field:yes (field:* -field:[* TO *]) fq could work here too. Best Erick On Fri, Nov 25, 2011 at 10:06 AM, Phil Hoy wrote: > Hi, > > Thanks for getting back to me, and sorry the default q value was *:* so I > omitted it from the example. > > I do not

Re: trouble with CollationKeyFilter

2011-11-25 Thread Erick Erickson
It's checked in, SOLR-2438. Although it's getting some surgery so you can expect it to morph a bit. Erick On Wed, Nov 23, 2011 at 11:22 PM, Michael Sokolov wrote: > Thanks for confirming that, and laying out the options, Robert. > > -Mike > > On 11/23/2011 9:03 PM, Robert Muir wrote: >> >> hi, >

Re: inconsistent JVM crash with version 4.0-SNAPSHOT

2011-11-25 Thread Justin Caratzas
Lasse Aagren writes: > Hi, > > We are running Solr-Lucene 4.0-SNAPSHOT (1199777M - hudson - 2011-11-09 > 14:58:50) on severel servers running: > > 64bit Debian Squeeze (6.0.3) > OpenJDK6 (b18-1.8.9-0.1~squeeze1) > Tomcat 6.028 (6.0.28-9+squeeze1) > > Some of the servers have 48G RAM and in that

Re: remove answers with identical scores

2011-11-25 Thread Ted Dunning
See http://en.wikipedia.org/wiki/Locality-sensitive_hashing The obvious thought that I had just after hitting send was that you could put the LSH signatures on the documents. That would let you do the scan at low volume and using LSH would make the duplicate scan almost as fast as your score scan

Re: How many defaultsearchfields we can have in one schema.xml file?

2011-11-25 Thread Lee Carroll
only one field can be a default. use copy field and copy the fields you need to search into a single field and set the copy field to be the default. That might be ok depending upon your circumstances On 25 November 2011 12:46, kiran.bodigam wrote: > In my schema i have defined below tag for index

Re: Huge Performance: Solr distributed search

2011-11-25 Thread Mikhail Garber
in general terms, when your Java heap is so large, it is beneficial to set mx and ms to the same size. On Wed, Nov 23, 2011 at 5:12 AM, Artem Lokotosh wrote: > Hi! > > * Data: > - Solr 3.4; > - 30 shards ~ 13GB, 27-29M docs each shard. > > * Machine parameters (Ubuntu 10.04 LTS): > user@Solr:~$ u

RE: Index a null text field

2011-11-25 Thread Young, Cody
I don't see anything wrong so far other than a typo here (missing a p in the second price): Can you see if there are any warnings in the log about documents not being able to be created? Also, you should have a field type definition for text in your schema. It will look something like

Re: trouble with CollationKeyFilter

2011-11-25 Thread Robert Muir
On Wed, Nov 23, 2011 at 11:22 PM, Michael Sokolov wrote: > Thanks for confirming that, and laying out the options, Robert. > FYI: Erick committed the multiterm stuff, so I opened an issue for this: https://issues.apache.org/jira/browse/SOLR-2919 -- lucidimagination.com

Re: Boosted documents not appearing higher than less-boosted ones for equal relevancy.

2011-11-25 Thread Tomás Fernández Löbbe
I don't think there is a way of seeing the "boosts" from the index, as those are encoded as "norms" (together with length normalization). You can see the norms with Luke if you want to and in the debugQuery output the index-time boost should be represented in the "fieldNorm" section. (if you click

Re: Unable to index documents using DataImportHandler with MSSQL

2011-11-25 Thread Ian Grainger
Update on this: I've established: * It's not a problem in the DB (I can index from this DB into a Solr instance on another server) * It's not Tomcat (I get the same problem in Jetty) * It's not the schema (I have simplified it to one field) That leaves SolrConfig.xml and data-config. Only thing c

RE: XML Manager for Solr

2011-11-25 Thread Steven A Rowe
Hi Stephane, Do you know about Solr's DataImportHandler, aka DIH?: http://wiki.apache.org/solr/DataImportHandler Steve > -Original Message- > From: KabooHahahein [mailto:stele...@hotmail.com] > Sent: Friday, November 25, 2011 10:33 AM > To: solr-user@lucene.apache.org > Subject: XML Man

XML Manager for Solr

2011-11-25 Thread KabooHahahein
Hi, I am new to Solr, and from what I understand, Solr indexes an XML database into its own format in order to enter the data into the search engine. I am currently trying to find an XML solution for management of these XML files. My database will include multiple XML files, and I'd like to be ab

Re: Huge Performance: Solr distributed search

2011-11-25 Thread Artem Lokotosh
On 11/25/2011 3:13 AM, Mark Miller wrote: When you search each shard, are you positive that you are using all of the same parameters? You are sure you are hitting request handlers that are configured exactly the same and sending exactly the same queries? I'm my experience, the overhead for dist

Boosted documents not appearing higher than less-boosted ones for equal relevancy.

2011-11-25 Thread Andrew Ingram
Hi all, I have 4 products, let's call them p1,p2, p3 and p4, at the point of indexing I'm boosting each document as follows (using ): p1 = 2.3434156476491901 p2 = 2.1894875146124502 p3 = 2.51677824126855 p4 = 2.2773491010634999 (Note: scores may not be identical to what it currently indexed, be

RE: Sort question

2011-11-25 Thread Phil Hoy
You might be able to sort by the map function q=*:*&sort=map(price,0,100, 10) asc, price asc. Phil -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: 25 November 2011 13:49 To: solr-user@lucene.apache.org Subject: Re: Sort question Not that I know of. Yo

RE: Query a field with no value or a particular value.

2011-11-25 Thread Phil Hoy
Hi, Thanks for getting back to me, and sorry the default q value was *:* so I omitted it from the example. I do not have a problem getting the null values so q=*:*&fq=-field:[* TO *] indeed works but I also need the docs with a specific value e.g. fq=field:yes. Is this possible? Phil -Or

Re: Separate ACL and document index

2011-11-25 Thread Erick Erickson
There's another approach that *may* help, see: https://issues.apache.org/jira/browse/SOLR-2429 This is probably suitable if you don't have a zillion results to sort through. The idea here is that you can specify a filter query that only executes after all the other parts of a query are done, i.e.

Re: Synonyms 1 fetching 2001, how to avoid

2011-11-25 Thread Erick Erickson
Please review: http://wiki.apache.org/solr/UsingMailingLists You haven't shown the relevant parts of your configs. You haven't shown the queries you're using, with &debugQuery=on You haven't shown the input You haven't explained why you think synonyms have anything to do with the problem. So it'

Re: strange behavior of scores and term proximity use

2011-11-25 Thread Erick Erickson
You might try with a less "fraught" search phrase, "to be or not to be" is a classic query that may be all stop words. Otherwise, I'm clueless. On Wed, Nov 23, 2011 at 3:15 PM, Ariel Zerbib wrote: > I tested with the version 4.0-2011-11-04_09-29-42. > > Ariel > > > 2011/11/17 Erick Erickson >

Re: Solr dismax scoring and weight

2011-11-25 Thread Erick Erickson
No, I mean the number that's used to hold the length of the field is a byte, but that it's not just a simple byte. It's encoded to handle very long fields in that byte, but there's some loss of precision. For instance, and I'm pulling numbers out of thin air here, fields of 1-25 terms may collapse

Re: Solr Search for misspelled search term

2011-11-25 Thread Erick Erickson
Did you turn it on? In the defaults section, something like: on BTW, I would NOT do the spellcheck.build=true on every request, this will rebuild your dictionary every time which is a definite performance problem! Best Erick On Wed, Nov 23, 2011 at 7:32 AM, meghana wrote: > > I have configured

Re: date range in solr 3.1

2011-11-25 Thread Erick Erickson
I think you're asking for something like: fq=date:[NOW/DAY-5DAYS TO NOW/DAY+1DAY]? Best Erick On Wed, Nov 23, 2011 at 6:29 AM, do3do3 wrote: > what i got is the number of this period but i want to get this result only, > what is the query to can get that like > fq=source:"news" > > > -- > View t

Re: Query a field with no value or a particular value.

2011-11-25 Thread Erick Erickson
You haven't specified any "q" clause, just an "fq" clause. Try q=*:* -field:[* TO *] or q=*:*&fq=-field:[* TO *] BTW, the logic of field:yes -field:[* TO *] makes no sense You're saying "find me all the fields containing the value "yes" and remove from that set all the fields containing any value

Re: Sort question

2011-11-25 Thread Erick Erickson
Not that I know of. You could conceivably do some work at index time to create a field that would sort in that order by doing some sort of mapping from these values into a field that sorts the way you want, or you might be able to do a plugin Best Erick On Wed, Nov 23, 2011 at 3:29 AM, vraa wrot

Re: Can files be faceted based on their size ?

2011-11-25 Thread Erick Erickson
Well, you can try adding a directive to put it into a numeric field But you need to provide significantly more details. From what you've said there's not enough information to say much besides "it should work". Perhaps you should review: http://wiki.apache.org/solr/UsingMailingLists Best E

Re: Faceting is not Using Field Value Cache . . ?

2011-11-25 Thread Erick Erickson
In addition to Samuel's comment, the filterCache is also used under certain circumstances Best Erick 2011/11/22 Samuel García Martínez : > AFAIK, FieldValueCache is only used for faceting on tokenized fields. > Maybe, are you getting confused with FieldCache ( > http://lucene.apache.org/java/

Re: remove answers with identical scores

2011-11-25 Thread Fred Zimmerman
thanks. i did consider postprocessing and may wind up doing that, i was hoping there was a way to have Solr do it for me! that I have to as this question is probably not a good sign, but what is LSH clustering? On Fri, Nov 25, 2011 at 4:34 AM, Ted Dunning wrote: > You can do that pretty easily

How many defaultsearchfields we can have in one schema.xml file?

2011-11-25 Thread kiran.bodigam
In my schema i have defined below tag for indexing the fields because in my use case except the uniquekey remaining fields needs to be indexed as it is (with same datatype) Here i would like to search all of them with out field name unfortunately i can't put all of them using option coz its dyna

Unable to index documents using DataImportHandler with MSSQL

2011-11-25 Thread Ian Grainger
Hi I have copied my Solr config from a working Windows server to a new one, and it can't seem to run an import. They're both using win server 2008 and SQL 2008R2. This is the data importer config I can use MS SQL Prof

Query a field with no value or a particular value.

2011-11-25 Thread Phil Hoy
Hi, Is it possible to constrain the results of a query to return docs were a field contains no value or a particular value? I tried ?fq=(field:yes OR -field:[* TO *]) but I get no results even though queries with either ?fq=field:yes or ?fq=-field:[* TO *]) do return results. Phil

Re: Efficient title sorting on large result sets.

2011-11-25 Thread Andrew Ingram
On 21 Nov 2011, at 23:17, Chris Hostetter wrote: > > : The way that I've solved this in the past is to make a field > : specifically for sorting and then truncate the string to a small number > : of characters and sort on that. You have to accept that in some cases > > Something to consider is

Re: Clustering and FieldType

2011-11-25 Thread Stanislaw Osinski
Hi, You're right -- currently Carrot2 clustering ignores the Solr analysis chain and uses its own pipeline. It is possible to integrate with Solr's analysis components to some extent, see the discussion here: https://issues.apache.org/jira/browse/SOLR-2917. Staszek > > Hi > > Trying to use carr

Re: remove answers with identical scores

2011-11-25 Thread Ted Dunning
You can do that pretty easily by just retrieving extra documents and post processing the results list. You are likely to have a significant number of apparent duplicates this way. To really get rid of duplicates in results, it might be better to remove them from the corpus by deploying something

Re: Huge Performance: Solr distributed search

2011-11-25 Thread Dmitry Kan
45 000 000 per shard approx, Tomcat, caching was tweaked in solrconfig and shard given 12GB of RAM max. filterCache class="solr.FastLRUCache" size="1200" initialSize="1200" autowarmCount="128"/> true 50 200 In you case I would first check if the network throu