Issue in indexing Zip file content with apache-solr-3.3.0

2011-08-23 Thread Jagdish Kumar

Hi All
 
I am using apache-solr-3.3.0 with apache-solr-cell-3.3.0.jar, though I am able 
to index the zip files, but I get no results if I search for content present in 
zip file. Please suggest possible solution.
 
Thanks and regards
Jagdish   

Re: Sorting results by Range

2011-08-23 Thread Sowmya V.B.
Hi Chris

Thanks a lot for the mail.

I did not quite understand how that function was made. But, it does work
like you said - there is a sorted list of documents now, where documents
around value 20 are ranked first and documents around 10 are ranked below.
(I chose a field with 0 and 100 as limits and tried with that. So, replaced
infinities with 0 and 100 respectively)

sort=map(map(myNumField,-Infinity,10,0),20,Infinity,0) desc, score desc

If I needed Sorted results in ascending order, Results around the value 10
ranked above those of 20, what should I do in this case?

I tried giving,
sort=map(map(myNumField,-Infinity,10,0),20,Infinity,0) *asc*, score desc
But, that does not seem to work quite as I expected.

S.


On Mon, Aug 22, 2011 at 9:48 PM, Chris Hostetter
hossman_luc...@fucit.orgwrote:


 : 1) The user gives a query, and also has an option to choose the from
 and
 : to values for a specific field.
 : (For Eg: Give me all documents that match the query Solr Users, but with
 : those that were last updated between 10th and 20th of August ranked on
 top)
 :
 : -Over here, I am currently using a BoostQuery functionality, to do this.
 : However, along with this, I want to provide an additional option of
 Sorting
 : these boosted results based on that range chosen above.

 This should be doable using sort by function, but obviously you'd have to
 decide which end of the range should score higher.

 the key would be to:
  * use a primary and a secondary sort
  * secondary sort is simple score desc
  * primary sort is on a function over the field whose range you care about
  * primary sort function needs to map all values out of the range to a
 constant value so secondary sort applies.

 I haven't tested this out, but i think the map function should make this
 relatively straight forward...

 sort=map(map(myNumField,-Infinity,10,0),20,Infinity,0) desc, score desc

 Assuming -Infinity and Infinity are actaully legal values in functions
 (if they aren't you'd need to pick some upper/lower limits) that should
 sort any doc where myNumField is between 10 and 20 first, with docs
 matching 20 sorting at the top above docs matching 19, 18, ... 10 and
 then after those docs all remaining matching docs will sort by score.


 -Hoss




-- 
Sowmya V.B.

Losing optimism is blasphemy!
http://vbsowmya.wordpress.com



what's the status of droids project(http://incubator.apache.org/droids/)?

2011-08-23 Thread Li Li
hi all
I am interested in vertical crawler. But it seems this project is not
very active. It's last update time is  11/16/2009


Re: Boost or BQ?

2011-08-23 Thread Markus Jelsma
iirc boost gets multiplied into the equation whereas bq is added. Check your 
debug output.

 What is the different between boost= and bq= ?
 
 I cannot find any documentationŠ


Re: can i create filters of score range

2011-08-23 Thread jame vaalet
okey, so this is something i was looking for .. the default order of result
docs in lucene\solr ..
and you are right, since i don care about the order in which i get the docs
ideally i shouldn't ask solr to do any sorting on its raw result list ...
though i understand your point, how do i do it as solr client ? by default
if am not mentioning the sort parameter in query URL to solr, solr will try
to sort it with respect to the score it calculated .. how do i prevent even
this sorting ..do we have any setting as such in solr for this ?


On 23 August 2011 03:29, Chris Hostetter hossman_luc...@fucit.org wrote:


 : before going into lucene doc id , i have got creationDate datetime field
 in
 : my index which i can use as page definition using filter query..
 : i have learned exposing lucene docid wont be a clever idea, as its again
 : relative to index instance.. where as my index date field will be unique
 : ..and i can definitely create ranges with that..

 i think you missunderstood me: i'm *not* suggesting you do any filtering
 on the internal lucene doc id.  I am suggesting that you forget all about
 trying to filter to work arround the issues with deep paging, and simply
 *sort* on _docid_ asc, which should make all inherient issues with deep
 paging go away (as far as i know).  At no point with the internal lucene
 doc ids be exposed to your client code, it's just a instruction to
 Solr/Lucene that it doesn't really need to do any sorting, it can just
 return the Nth-Mth docs as collected.

 : i ahve got on more doubt .. if i use filter query each time will it
 result
 : in memory problem like that we see in deep paging issues..

 it could, i'm not sure. that's why i said...

 :  I'm not sure if this would really gain you much though -- yes this
 would
 :  work arround some of the memory issues inherient in deep paging but
 it
 :  would still require a lot or rescoring of documents again and again.


 -Hoss




-- 

-JAME


Re: Issue in indexing Zip file content with apache-solr-3.3.0

2011-08-23 Thread Jayendra Patil
Solr doesn't index the content of the files, but just the file names.

you can apply patch -
https://issues.apache.org/jira/browse/SOLR-2416
https://issues.apache.org/jira/browse/SOLR-2332

Regards,
Jayendra

On Tue, Aug 23, 2011 at 2:26 AM, Jagdish Kumar
jagdish.thapar...@hotmail.com wrote:

 Hi All

 I am using apache-solr-3.3.0 with apache-solr-cell-3.3.0.jar, though I am 
 able to index the zip files, but I get no results if I search for content 
 present in zip file. Please suggest possible solution.

 Thanks and regards
 Jagdish


Re: what's the status of droids project(http://incubator.apache.org/droids/)?

2011-08-23 Thread Markus Jelsma
You should ask on the Droids list but there's some activity in Jira. And did 
you consider Apache Nutch?

On Tuesday 23 August 2011 10:17:50 Li Li wrote:
 hi all
 I am interested in vertical crawler. But it seems this project is not
 very active. It's last update time is  11/16/2009


How to copy and extract information from a multi-line text before the tokenizer

2011-08-23 Thread Michael Kliewe
Hello all,

I have a custom schema which has a few fields, and I would like to create a new 
field in the schema that only has one special line of another field indexed. 
Lets use this example:

field AllData (TextField) has for example this data:
Title: exampleTitle of the book
Author: Example Author
Date: 01.01.1980

Each line is separated by a line break.
I now need a new field named OnlyAuthor which only has the Author information 
in it, so I can search and facet for specific Author information. I added this 
to my schema:

fieldType name=authorField class=solr.TextField
  analyzer type=index
charFilter class=solr.PatternReplaceCharFilterFactory 
pattern=^.*\nAuthor: (.*?)\n.*$ replacement=$1 replace=all /
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.TrimFilterFactory/
  /analyzer
  analyzer type=query
charFilter class=solr.PatternReplaceCharFilterFactory 
pattern=^.*\nAuthor: (.*?)\n.*$ replacement=$1 replace=all /
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.TrimFilterFactory/
  /analyzer
/fieldType

field name=OnlyAuthor type=authorField indexed=true stored=true /

copyField source=AllData dest=OnlyAuthor/


But this is not working, the new AuthorOnly field contains all data, because 
the regex didn't match. But I need Example Author in that field (I think) to 
be able to search and facet only author information.

I don't know where the problem is, perhaps someone of you can give me a hint, 
or a totally different method to achieve my goal to extract a single line from 
this multi-line-text.

Kind regards and thanks for any help
Michael




RE: what's the status of droids project(http://incubator.apache.org/droids/)?

2011-08-23 Thread karl.wright
It's also worth looking at ManifoldCF.

Karl

-Original Message-
From: ext Markus Jelsma
Sent:  23/08/2011, 6:24  AM
To: solr-user@lucene.apache.org
Cc: java-u...@lucene.apache.org
Subject: Re: what's the status of droids 
project(http://incubator.apache.org/droids/)?

You should ask on the Droids list but there's some activity in Jira. And did
you consider Apache Nutch?

On Tuesday 23 August 2011 10:17:50 Li Li wrote:
 hi all
 I am interested in vertical crawler. But it seems this project is not
 very active. It's last update time is  11/16/2009

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to copy and extract information from a multi-line text before the tokenizer

2011-08-23 Thread Chantal Ackermann

Hi Michael,

have you considered the DataImportHandler?
You could use the the LineEntityProcessor to create fields per line and
then copyField to collect everything for the AllData field.

http://wiki.apache.org/solr/DataImportHandler#LineEntityProcessor

Chantal



On Tue, 2011-08-23 at 12:28 +0200, Michael Kliewe wrote:
 Hello all,
 
 I have a custom schema which has a few fields, and I would like to create a 
 new field in the schema that only has one special line of another field 
 indexed. Lets use this example:
 
 field AllData (TextField) has for example this data:
 Title: exampleTitle of the book
 Author: Example Author
 Date: 01.01.1980
 
 Each line is separated by a line break.
 I now need a new field named OnlyAuthor which only has the Author information 
 in it, so I can search and facet for specific Author information. I added 
 this to my schema:
 
 fieldType name=authorField class=solr.TextField
   analyzer type=index
 charFilter class=solr.PatternReplaceCharFilterFactory 
 pattern=^.*\nAuthor: (.*?)\n.*$ replacement=$1 replace=all /
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.TrimFilterFactory/
   /analyzer
   analyzer type=query
 charFilter class=solr.PatternReplaceCharFilterFactory 
 pattern=^.*\nAuthor: (.*?)\n.*$ replacement=$1 replace=all /
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.TrimFilterFactory/
   /analyzer
 /fieldType
 
 field name=OnlyAuthor type=authorField indexed=true stored=true /
 
 copyField source=AllData dest=OnlyAuthor/
 
 
 But this is not working, the new AuthorOnly field contains all data, because 
 the regex didn't match. But I need Example Author in that field (I think) 
 to be able to search and facet only author information.
 
 I don't know where the problem is, perhaps someone of you can give me a hint, 
 or a totally different method to achieve my goal to extract a single line 
 from this multi-line-text.
 
 Kind regards and thanks for any help
 Michael
 
 



Re: SSD experience

2011-08-23 Thread Peter Sturge
Just to add a few cents worth regarding SSD...

We use Vertex SSD drives for storing indexes, and wow, they really
scream compared to SATA/SAS/SAN. As we do some heavy commits, it's the
commit times where we see the biggest performance boost.
In tests, we found that locally attached 15k SAS drives are the next
best for performance. SANs can work well, but should be FibreChannel.
IP-based SANs are ok, as long they're not heavily taxed by other,
non-Solr disk I/O.
NAS is far and away the poorest performing - not recommended for real indexes.

HTH,
Peter



On Mon, Aug 22, 2011 at 3:54 PM, Rich Cariens richcari...@gmail.com wrote:
 Ahoy ahoy!

 Does anyone have any experiences or stories they can share with the list
 about how SSDs impacted search performance for better or worse?

 I found a Lucene SSD performance benchmark
 dochttp://wiki.apache.org/lucene-java/SSD_performance?action=AttachFiledo=viewtarget=combined-disk-ssd.pdfbut
 the wiki engine is refusing to let me view the attachment (I get You
 are not allowed to do AttachFile on this page.).

 Thanks in advance!



Query parameter changes from solr 1.4 to 3.3

2011-08-23 Thread Samarendra Pratap
Hi,
 We are upgrading solr 1.4 (with collapsing patch solr-236) to solr 3.3. I
was looking for the required changes in query parameters (or parameter
names) if any.
 One thing I know for sure is that collapse and its sub-options are now
known by group, but didn't find anything else.

 Can someone point me to some document or webpage for this?
 Or if there aren't any other changes can someone confirm that?

-- 
Regards,
Samar


RE: what's the status of droids project(http://incubator.apache.org/droids/)?

2011-08-23 Thread O. Klein
Or check http://www.crawl-anywhere.com/

Very customizable crawler.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/what-s-the-status-of-droids-project-http-incubator-apache-org-droids-tp3277367p3277698.html
Sent from the Solr - User mailing list archive at Nabble.com.


Funky date string accepted

2011-08-23 Thread Markus Jelsma
Hi,

The following field value for a date field type is accepted:
field name=somedate-0001-11-30T00:00:00Z/field

and ends up in the index and as stored value as:
date name=somedate2-11-30T00:00:00Z/date

I'd prefer to be punished with an exception. File a bug?

Thanks


Re: Full sentence spellcheck

2011-08-23 Thread Valentin
I tried your solution, it works. But it modify all the spellcheckers that I
made, so that's not a good solution for me (I have an autocomplete and a
regular spellcheck with separated words that I want to keep). 

I tried to move the line queryConverter name=queryConverter
class=com.myPackage.SpellingQueryConverter/ *into* the requestHandler,
but of course it does not work.

Why I can't just use this evil spellcheck.q ? -_-

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Full-sentence-spellcheck-tp3265257p3277847.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: SSD experience

2011-08-23 Thread Gerard Roos
Interesting. Do you make a symlink to the indexes or is the whole Solr 
directory on SSD?

thanks,
Gerard

Op 23 aug. 2011, om 12:53 heeft Peter Sturge het volgende geschreven:

 Just to add a few cents worth regarding SSD...
 
 We use Vertex SSD drives for storing indexes, and wow, they really
 scream compared to SATA/SAS/SAN. As we do some heavy commits, it's the
 commit times where we see the biggest performance boost.
 In tests, we found that locally attached 15k SAS drives are the next
 best for performance. SANs can work well, but should be FibreChannel.
 IP-based SANs are ok, as long they're not heavily taxed by other,
 non-Solr disk I/O.
 NAS is far and away the poorest performing - not recommended for real indexes.
 
 HTH,
 Peter
 
 
 
 On Mon, Aug 22, 2011 at 3:54 PM, Rich Cariens richcari...@gmail.com wrote:
 Ahoy ahoy!
 
 Does anyone have any experiences or stories they can share with the list
 about how SSDs impacted search performance for better or worse?
 
 I found a Lucene SSD performance benchmark
 dochttp://wiki.apache.org/lucene-java/SSD_performance?action=AttachFiledo=viewtarget=combined-disk-ssd.pdfbut
 the wiki engine is refusing to let me view the attachment (I get You
 are not allowed to do AttachFile on this page.).
 
 Thanks in advance!
 
 
 



Spatial Search problems

2011-08-23 Thread Javier Heras
Hi all,

I'm new at solr. I've downloaded solr 3.3, and having tested solr querys for
spatial search with examples that come in the tutorial. Everything ok. But
when I substitute the tutorial index with my index, spatial search doesn't
work until parameter d is greater than 4510 (km?)

Any idea what's going on?

Thanks

Javier

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Spatial-Search-problems-tp3277945p3277945.html
Sent from the Solr - User mailing list archive at Nabble.com.


Spellcheck index replication

2011-08-23 Thread Herman Kiefus
We employ one 'indexing' master that replicates to many 'query' slaves.  We 
have also recently introduced spellchecking/DYM.  It appears that replication 
does not 'cover' the spellchecker index.  Do I understand this correctly?

Further, we have seen where 'buildOnCommit' will cause the spellcheck index to 
be [re]built on each slave; however, during the time that the spellcheck index 
is being rebuilt, spellcheck queries do not produce suggestions, which makes 
sense.

What suggestions do the community have regarding this issue and/or what is 
working well for you?


Re: SSD experience

2011-08-23 Thread Peter Sturge
The Solr index directory lives directly on the SSD (running on Windows
- where the word symlink does not appear in any dictionary within a
100 mile radius of Redmond :-)

Currently, the main limiting factors of SSD are cost and size. SSDs
will get larger over time. Splitting indexes across multiple shards on
multiple SSDs is a wonderfully fast, if not slightly extravagant
method of getting excellent IO performance.
Regarding cost, I've seen many organizations where the use of fast
SANs costs at least the same if not more per GB of storage than SSD.
Hybrid drives can be a good cost-effective alternative as well.

Peter



On Tue, Aug 23, 2011 at 3:29 PM, Gerard Roos l...@gerardroos.nl wrote:
 Interesting. Do you make a symlink to the indexes or is the whole Solr 
 directory on SSD?

 thanks,
 Gerard

 Op 23 aug. 2011, om 12:53 heeft Peter Sturge het volgende geschreven:

 Just to add a few cents worth regarding SSD...

 We use Vertex SSD drives for storing indexes, and wow, they really
 scream compared to SATA/SAS/SAN. As we do some heavy commits, it's the
 commit times where we see the biggest performance boost.
 In tests, we found that locally attached 15k SAS drives are the next
 best for performance. SANs can work well, but should be FibreChannel.
 IP-based SANs are ok, as long they're not heavily taxed by other,
 non-Solr disk I/O.
 NAS is far and away the poorest performing - not recommended for real 
 indexes.

 HTH,
 Peter



 On Mon, Aug 22, 2011 at 3:54 PM, Rich Cariens richcari...@gmail.com wrote:
 Ahoy ahoy!

 Does anyone have any experiences or stories they can share with the list
 about how SSDs impacted search performance for better or worse?

 I found a Lucene SSD performance benchmark
 dochttp://wiki.apache.org/lucene-java/SSD_performance?action=AttachFiledo=viewtarget=combined-disk-ssd.pdfbut
 the wiki engine is refusing to let me view the attachment (I get You
 are not allowed to do AttachFile on this page.).

 Thanks in advance!







RE: Spellcheck Phrases

2011-08-23 Thread Herman Kiefus
The angle that I am trying here is to create a dictionary from indexed terms 
that contain only correctly spelled words.  We are doing this by having the 
field from which the dictionary is created utilize a type that employs 
solr.KeepWordFilterFactory, which in turn utilizes a text file of known 
correctly spelled words (including their respective derivations example: lead, 
leads, leading, etc.).

This is working great for us with the exception being those fields in our 
schema that contain proper names.  I can't seem to get (unfiltered) terms from 
those fields along with (correctly spelled) terms from other fields into the 
single field upon which the dictionary is built.

-Original Message-
From: Dyer, James [mailto:james.d...@ingrambook.com] 
Sent: Thursday, June 02, 2011 11:40 AM
To: solr-user@lucene.apache.org
Subject: RE: Spellcheck Phrases

Actually, someone just pointed out to me that a patch like this is unnecessary. 
 The code works as-is if configured like this:

float name=thresholdTokenFrequency.01/float  (correct)

instead of this:

str name=thresholdTokenFrequency.01/str (incorrect)

I tested this and it seems to work.  I'm still am trying to figure out if using 
this parameter actually improves the quality of our spell suggestions, now that 
I know how to use it properly.

Sorry about the mis-information earlier.

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311


-Original Message-
From: Dyer, James
Sent: Wednesday, June 01, 2011 3:02 PM
To: solr-user@lucene.apache.org
Subject: RE: Spellcheck Phrases

Tanner,

I just entered SOLR-2571 to fix the float-parsing-bug that breaks 
thresholdTokenFrequency.  Its just a 1-line code fix so I also included a 
patch that should cleanly apply to solr 3.1.  See 
https://issues.apache.org/jira/browse/SOLR-2571 for info and patches.

This parameter appears absent from the wiki.  And as it has always been broken 
for me, I haven't tested it.  However, my understanding it should be set as the 
minimum percentage of documents in which a term has to occur in order for it to 
appear in the spelling dictionary.  For instance in the config below, a term 
would have to occur in at least 1% of the documents for it to be part of the 
spelling dictionary.  This might be a good setting for long fields but for the 
short fields in my application, I was thinking of setting this to something 
like 1/1000 of 1% ...

searchComponent name=spellcheck class=solr.SpellCheckComponent  str 
name=queryAnalyzerFieldTypetext/str
 lst name=spellchecker
  str name=namespellchecker/str
  str name=fieldSpelling_Dictionary/str
  str name=fieldTypetext/str
  str name=spellcheckIndexDir./spellchecker/str
  str name=thresholdTokenFrequency.01/str
 /lst
/searchComponent

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311


-Original Message-
From: Tanner Postert [mailto:tanner.post...@gmail.com]
Sent: Friday, May 27, 2011 6:04 PM
To: solr-user@lucene.apache.org
Subject: Re: Spellcheck Phrases

are there any updates on this? any third party apps that can make this work as 
expected?

On Wed, Feb 23, 2011 at 12:38 PM, Dyer, James james.d...@ingrambook.comwrote:

 Tanner,

 Currently Solr will only make suggestions for words that are not in 
 the dictionary, unless you specifiy spellcheck.onlyMorePopular=true.  
 However, if you do that, then it will try to improve every word in 
 your query, even the ones that are spelled correctly (so while it 
 might change brake to break it might also change leg to log.)

 You might be able to alleviate some of the pain by setting the 
 thresholdTokenFrequency so as to remove misspelled and rarely-used 
 words from your dictionary, although I personally haven't been able to 
 get this parameter to work.  It also doesn't seem to be documented on 
 the wiki but it is in the 1.4.1. source code, in class 
 IndexBasedSpellChecker.  Its also mentioned in SmileyPugh's book.  I 
 tried setting it like this, but got a ClassCastException on the float value:

 searchComponent name=spellcheck class=solr.SpellCheckComponent  
 str name=queryAnalyzerFieldTypetext_spelling/str
  lst name=spellchecker
  str name=namespellchecker/str
  str name=fieldSpelling_Dictionary/str
  str name=fieldTypetext_spelling/str
  str name=buildOnOptimizetrue/str  str 
 name=thresholdTokenFrequency.001/str
  /lst
 /searchComponent

 I have it on my to-do list to look into this further but haven't yet.  
 If you decide to try it and can get it to work, please let me know how 
 you do it.

 James Dyer
 E-Commerce Systems
 Ingram Content Group
 (615) 213-4311

 -Original Message-
 From: Tanner Postert [mailto:tanner.post...@gmail.com]
 Sent: Wednesday, February 23, 2011 12:53 PM
 To: solr-user@lucene.apache.org
 Subject: Spellcheck Phrases

 right now when I search for 'brake a leg', solr returns valid results 
 with no indication of misspelling, which is understandable since all 
 of those terms are valid words and are 

RE: HTTP 400 Error

2011-08-23 Thread Lawson, Chris
I am trying to submit a search (Cntrct:1310015) on both Prod and Model
system and after submitting with Search button, the result is a page
displaying HTTP 400.
 
Thanks,
 
Chris Lawson
chris.law...@lfg.com
(336) 691-3733
 




Notice of Confidentiality: **This E-mail and any of its attachments may contain 
Lincoln National Corporation proprietary information, which is privileged, 
confidential,
or subject to copyright belonging to the Lincoln National Corporation family of 
companies. This E-mail is intended solely for the use of the individual or 
entity to 
which it is addressed. If you are not the intended recipient of this E-mail, 
you are 
hereby notified that any dissemination, distribution, copying, or action taken 
in 
relation to the contents of and attachments to this E-mail is strictly 
prohibited 
and may be unlawful. If you have received this E-mail in error, please notify 
the 
sender immediately and permanently delete the original and any copy of this 
E-mail 
and any printout. Thank You.**


Re: HTTP 400 Error

2011-08-23 Thread Gora Mohanty
On Tue, Aug 23, 2011 at 6:30 PM, Lawson, Chris chris.law...@lfg.com wrote:
 I am trying to submit a search (Cntrct:1310015) on both Prod and Model
 system and after submitting with Search button, the result is a page
 displaying HTTP 400.
[...]

Please show us the actual URL used to query Solr: At first
guess, you are not properly escaping the HTML. Have you
tried the same search from the Solr admin. panel?

Regards,
Gora


Re: Funky date string accepted

2011-08-23 Thread Chris Hostetter

: The following field value for a date field type is accepted:
: field name=somedate-0001-11-30T00:00:00Z/field
: 
: and ends up in the index and as stored value as:
: date name=somedate2-11-30T00:00:00Z/date
: 
: I'd prefer to be punished with an exception. File a bug?

That is actualy a legal date according to the format spec (although there 
is seems to be some conflicting guidelines about wether the format allows 
a year 0 which makes the interpretation of negative years ambiguious (at 
least to me)

There is however already a known bug in SOlr with parsing/formatting dates 
prior to year 1000...

https://issues.apache.org/jira/browse/SOLR-1899

...patches most certainly welcome.


-Hoss


Re: Funky date string accepted

2011-08-23 Thread Markus Jelsma
I see, is the leading - char just ignored then?

 : The following field value for a date field type is accepted:
 : field name=somedate-0001-11-30T00:00:00Z/field
 : 
 : and ends up in the index and as stored value as:
 : date name=somedate2-11-30T00:00:00Z/date
 : 
 : I'd prefer to be punished with an exception. File a bug?
 
 That is actualy a legal date according to the format spec (although there
 is seems to be some conflicting guidelines about wether the format allows
 a year 0 which makes the interpretation of negative years ambiguious (at
 least to me)
 
 There is however already a known bug in SOlr with parsing/formatting dates
 prior to year 1000...
 
 https://issues.apache.org/jira/browse/SOLR-1899
 
 ...patches most certainly welcome.
 
 
 -Hoss


Re: SSD experience

2011-08-23 Thread Sanne Grinovero
Indeed I would never actually use it, but symlinks do exist on Windows.

http://en.wikipedia.org/wiki/NTFS_symbolic_link

Sanne

2011/8/23 Peter Sturge peter.stu...@gmail.com:
 The Solr index directory lives directly on the SSD (running on Windows
 - where the word symlink does not appear in any dictionary within a
 100 mile radius of Redmond :-)

 Currently, the main limiting factors of SSD are cost and size. SSDs
 will get larger over time. Splitting indexes across multiple shards on
 multiple SSDs is a wonderfully fast, if not slightly extravagant
 method of getting excellent IO performance.
 Regarding cost, I've seen many organizations where the use of fast
 SANs costs at least the same if not more per GB of storage than SSD.
 Hybrid drives can be a good cost-effective alternative as well.

 Peter



 On Tue, Aug 23, 2011 at 3:29 PM, Gerard Roos l...@gerardroos.nl wrote:
 Interesting. Do you make a symlink to the indexes or is the whole Solr 
 directory on SSD?

 thanks,
 Gerard

 Op 23 aug. 2011, om 12:53 heeft Peter Sturge het volgende geschreven:

 Just to add a few cents worth regarding SSD...

 We use Vertex SSD drives for storing indexes, and wow, they really
 scream compared to SATA/SAS/SAN. As we do some heavy commits, it's the
 commit times where we see the biggest performance boost.
 In tests, we found that locally attached 15k SAS drives are the next
 best for performance. SANs can work well, but should be FibreChannel.
 IP-based SANs are ok, as long they're not heavily taxed by other,
 non-Solr disk I/O.
 NAS is far and away the poorest performing - not recommended for real 
 indexes.

 HTH,
 Peter



 On Mon, Aug 22, 2011 at 3:54 PM, Rich Cariens richcari...@gmail.com wrote:
 Ahoy ahoy!

 Does anyone have any experiences or stories they can share with the list
 about how SSDs impacted search performance for better or worse?

 I found a Lucene SSD performance benchmark
 dochttp://wiki.apache.org/lucene-java/SSD_performance?action=AttachFiledo=viewtarget=combined-disk-ssd.pdfbut
 the wiki engine is refusing to let me view the attachment (I get You
 are not allowed to do AttachFile on this page.).

 Thanks in advance!








Solr indexing process: keep a persistent Mysql connection throu all the indexing process

2011-08-23 Thread samuele.mattiuzzo
I wrote my custom update handler for my solr installation, using jdbc to
query a mysql database. Everything works fine: the updater queries the db,
gets the data i need and update it in my documents! Fantastic!

Only issue is i have to open and close a mysql connection for every document
i read. Since we have something like 10kk indexed document, i was thinking
about opening a mysql connection at the very beginning of the indexing
process, keeping it stored somewhere and use it inside my custom update
handler. When the whole indexing process is complete, the connection should
be closed.

So far, is it possible?

Thanks all in advance!

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-indexing-process-keep-a-persistent-Mysql-connection-throu-all-the-indexing-process-tp3278608p3278608.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Spatial Search problems

2011-08-23 Thread Smiley, David W.
Could you reproduce a very simple example of this? For example if there is a 
particular indexed point in your data that should be returned from your query 
(a query smaller than d=4k10), then reproduce that bug in the Solr example app 
by supplying a dummy document with this point and running your query.  Also, be 
sure you are using the correct field type (LatLonType).

~ David Smiley

On Aug 23, 2011, at 9:12 AM, Javier Heras wrote:

 Hi all,
 
 I'm new at solr. I've downloaded solr 3.3, and having tested solr querys for
 spatial search with examples that come in the tutorial. Everything ok. But
 when I substitute the tutorial index with my index, spatial search doesn't
 work until parameter d is greater than 4510 (km?)
 
 Any idea what's going on?
 
 Thanks
 
 Javier
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Spatial-Search-problems-tp3277945p3277945.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: SSD experience

2011-08-23 Thread Peter Sturge
Ah yes, the beautiful new links in Windows 6. These are 'symlinks' in
name only - they operate *very* differently from LUNIX symlinks, and
sadly, not quite so well. NTFS is one of the best things about
Windows, but it's architecture is not well suited to 'on-the-fly'
redirection, as there are many items 'in the chain' to cater for at
various points - e.g. driver stack, sid context, SACL/DACLs, DFS,
auditing etc.This makes links on NTFS much more difficult to manage
and it is common to encounter all manner of strange behaviour when
using them.


On Tue, Aug 23, 2011 at 5:34 PM, Sanne Grinovero
sanne.grinov...@gmail.com wrote:
 Indeed I would never actually use it, but symlinks do exist on Windows.

 http://en.wikipedia.org/wiki/NTFS_symbolic_link

 Sanne

 2011/8/23 Peter Sturge peter.stu...@gmail.com:
 The Solr index directory lives directly on the SSD (running on Windows
 - where the word symlink does not appear in any dictionary within a
 100 mile radius of Redmond :-)

 Currently, the main limiting factors of SSD are cost and size. SSDs
 will get larger over time. Splitting indexes across multiple shards on
 multiple SSDs is a wonderfully fast, if not slightly extravagant
 method of getting excellent IO performance.
 Regarding cost, I've seen many organizations where the use of fast
 SANs costs at least the same if not more per GB of storage than SSD.
 Hybrid drives can be a good cost-effective alternative as well.

 Peter



 On Tue, Aug 23, 2011 at 3:29 PM, Gerard Roos l...@gerardroos.nl wrote:
 Interesting. Do you make a symlink to the indexes or is the whole Solr 
 directory on SSD?

 thanks,
 Gerard

 Op 23 aug. 2011, om 12:53 heeft Peter Sturge het volgende geschreven:

 Just to add a few cents worth regarding SSD...

 We use Vertex SSD drives for storing indexes, and wow, they really
 scream compared to SATA/SAS/SAN. As we do some heavy commits, it's the
 commit times where we see the biggest performance boost.
 In tests, we found that locally attached 15k SAS drives are the next
 best for performance. SANs can work well, but should be FibreChannel.
 IP-based SANs are ok, as long they're not heavily taxed by other,
 non-Solr disk I/O.
 NAS is far and away the poorest performing - not recommended for real 
 indexes.

 HTH,
 Peter



 On Mon, Aug 22, 2011 at 3:54 PM, Rich Cariens richcari...@gmail.com 
 wrote:
 Ahoy ahoy!

 Does anyone have any experiences or stories they can share with the list
 about how SSDs impacted search performance for better or worse?

 I found a Lucene SSD performance benchmark
 dochttp://wiki.apache.org/lucene-java/SSD_performance?action=AttachFiledo=viewtarget=combined-disk-ssd.pdfbut
 the wiki engine is refusing to let me view the attachment (I get You
 are not allowed to do AttachFile on this page.).

 Thanks in advance!









Re: can i create filters of score range

2011-08-23 Thread Erick Erickson
Did you try exactly what Chris suggested? Appending
sort=_docid_ asc to the query? When you say
client I assume you're talking SolrJ, and I'm pretty
sure that SolrQuery.setSortField is what you want.

I suppose you could also set this as the default in your
query handler.

Best
Erick

On Tue, Aug 23, 2011 at 4:43 AM, jame vaalet jamevaa...@gmail.com wrote:
 okey, so this is something i was looking for .. the default order of result
 docs in lucene\solr ..
 and you are right, since i don care about the order in which i get the docs
 ideally i shouldn't ask solr to do any sorting on its raw result list ...
 though i understand your point, how do i do it as solr client ? by default
 if am not mentioning the sort parameter in query URL to solr, solr will try
 to sort it with respect to the score it calculated .. how do i prevent even
 this sorting ..do we have any setting as such in solr for this ?


 On 23 August 2011 03:29, Chris Hostetter hossman_luc...@fucit.org wrote:


 : before going into lucene doc id , i have got creationDate datetime field
 in
 : my index which i can use as page definition using filter query..
 : i have learned exposing lucene docid wont be a clever idea, as its again
 : relative to index instance.. where as my index date field will be unique
 : ..and i can definitely create ranges with that..

 i think you missunderstood me: i'm *not* suggesting you do any filtering
 on the internal lucene doc id.  I am suggesting that you forget all about
 trying to filter to work arround the issues with deep paging, and simply
 *sort* on _docid_ asc, which should make all inherient issues with deep
 paging go away (as far as i know).  At no point with the internal lucene
 doc ids be exposed to your client code, it's just a instruction to
 Solr/Lucene that it doesn't really need to do any sorting, it can just
 return the Nth-Mth docs as collected.

 : i ahve got on more doubt .. if i use filter query each time will it
 result
 : in memory problem like that we see in deep paging issues..

 it could, i'm not sure. that's why i said...

 :  I'm not sure if this would really gain you much though -- yes this
 would
 :  work arround some of the memory issues inherient in deep paging but
 it
 :  would still require a lot or rescoring of documents again and again.


 -Hoss




 --

 -JAME



RE: Text Analysis and copyField

2011-08-23 Thread Herman Kiefus
To close, I found this article from Hoss: 
http://lucene.472066.n3.nabble.com/CopyField-into-another-CopyField-td3122408.html

Since I cannot use one copyField directive to copy from another copyField's 
dest[ination], I cannot achieve what I desire: some terms that are subject to 
KeepWordFilterFactory and some that are not.

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Monday, August 22, 2011 1:16 PM
To: solr-user@lucene.apache.org
Subject: Re: Text Analysis and copyField

I suspect that the things going into TermsDictionary are from fields other than 
CorrectlySpelledTerms.

In other words I don't think that anything is getting into TermsDictionary from 
CorrectlySpelledTerms...

Be careful to remove the index between schema changes, just to be sure that 
you're not seeing old data.

Best
Erick

On Mon, Aug 22, 2011 at 11:41 AM, Herman Kiefus herm...@angieslist.com wrote:
 That's what I thought, but my experiments show differently.  In actuality:

 I have a number of fields that are of type text (the default as it is 
 packaged).

 I have a type 'textCorrectlySpelled' that utilizes KeepWordFilterFactory in 
 index-time analysis, using a file of terms which are known to be correctly 
 spelled.

 I have a type 'textDictionary' that has no index-time analysis.

 I have the fields:
 field name=CorrectlySpelledTerms type=textCorrectlySpelled 
 indexed=false stored=false multiValued=true/ field 
 name=TermsDictionary type=textDictionary indexed=true 
 stored=false multiValued=true/

 I want 'TermsDictionary' to contain only those terms from some fields that 
 are correctly spelled plus those terms from a couple other fields 
 (CompanyName and ContactName) as is.  I use several copyField directives as 
 follows:

 copyField source=Field1 dest=CorrectlySpelledTerms/ copyField 
 source=Field2 dest=CorrectlySpelledTerms/ copyField 
 source=Field3 dest=CorrectlySpelledTerms/

 copyField source=Name dest=TermsDictionary/ copyField 
 source=Contact dest=TermsDictionary/ copyField source 
 =CorrectlySpelledTerms dest=TermsDictionary/

 If I query 'Field1' for a term that I know is misspelled (electical) it 
 yields results.
 If I query 'TermsDictionary' for the same term it yields no results.

 It would seem by these results that 'TermsDictionary' only contains those 
 terms with misspellings stripped as a results of the text analysis on the 
 field 'CorrectlySpelledTerms'.

 Asked another way, I think you can see what I'm getting at: a source for the 
 spellchecker that only contains correct spelled terms plus proper names; 
 should I have gone about this in a different way?

 -Original Message-
 From: Stephen Duncan Jr [mailto:stephen.dun...@gmail.com]
 Sent: Monday, August 22, 2011 9:30 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Text Analysis and copyField

 On Mon, Aug 22, 2011 at 9:25 AM, Herman Kiefus herm...@angieslist.com wrote:
 Is my thinking correct?

 I have a field 'F1' of type 'T1' whose index time analysis employs the 
 StopFilterFactory.

 I also have a field 'F2' of type 'T2' whose index time analysis does NOT 
 employ the StopFilterFactory.

 There is a copyField directive source=F1 dest=F2

 F2 will not contain any stop words because they were filtered out as F1 was 
 populated.


 No, F2 will contain stop words.  Copy fields does not process input through a 
 chain, it sends the original content to each field and therefore analysis is 
 totally independent.

 --
 Stephen Duncan Jr
 www.stephenduncanjr.com



Re: Sorting results by Range

2011-08-23 Thread Chris Hostetter

: I did not quite understand how that function was made. But, it does work

basically the map function just translates values in a ranage to some 
fixed vald value.  so if you nest two map functions (that use 
different ranges) inside of eachother you get a resulting curve that is 
flat in those two ranges (below 10 and above 20) and returns the actual 
field value in the middle.

: (I chose a field with 0 and 100 as limits and tried with that. So, replaced
: infinities with 0 and 100 respectively)
: 
: sort=map(map(myNumField,-Infinity,10,0),20,Infinity,0) desc, score desc
: 
: If I needed Sorted results in ascending order, Results around the value 10
: ranked above those of 20, what should I do in this case?
: 
: I tried giving,
: sort=map(map(myNumField,-Infinity,10,0),20,Infinity,0) *asc*, score desc
: But, that does not seem to work quite as I expected.

Hmmm... ok.  FWIW: anytime you say things like does not seem to work 
quite as I expected ... you really need to explain: a) what you expected. 
b) what you got.

But i think i see the problem...

if you change to asc, then it's going to sort docs by the result of that 
function asc, and because of the map a *lot* of docs are going to have a 
value of 0 for that function -- so in addition to changing to asc 
you'll want to change the target value of that function to something above 
the upper endpoint of the range you care about (20 in this example)

so if the range of legal values is 0-100, and you care about 10-20

sort=map(map(myNumField,0,10,0),20,100,0) desc, score desc
sort=map(map(myNumField,0,10,100),20,100,100) asc, score desc



-Hoss


Re: Funky date string accepted

2011-08-23 Thread Chris Hostetter

: I see, is the leading - char just ignored then?

i'd have to re-look at the tests/docs (i don't really want to repeat 
that agonizing headache right now), but i believe what you are seeing is a 
compound problem... 

* parsing sees the -0001 and recognizes that as a negative year.
* somewhere the negative year is dealt with in a way that assumes there is 
  (isn't?) a year 0, making -1 = Year 2 BC
* formatting code doesn't include the era in the output and 
  doesn't zero pad propertly so you just get 2 in the response.


-Hoss


Re: Solr indexing process: keep a persistent Mysql connection throu all the indexing process

2011-08-23 Thread Gora Mohanty
On Tue, Aug 23, 2011 at 10:25 PM, samuele.mattiuzzo samum...@gmail.com wrote:
 I wrote my custom update handler for my solr installation, using jdbc to
 query a mysql database. Everything works fine: the updater queries the db,
 gets the data i need and update it in my documents! Fantastic!

 Only issue is i have to open and close a mysql connection for every document
 i read. Since we have something like 10kk indexed document, i was thinking
 about opening a mysql connection at the very beginning of the indexing
 process, keeping it stored somewhere and use it inside my custom update
 handler. When the whole indexing process is complete, the connection should
 be closed.
[...]

If you are using a custom update handler, then I imagine that
it is up to you to keep a persistent connection open.

You could also consider using the Solr DataImportHandler,
http://wiki.apache.org/solr/DataImportHandler . This can
interface with mysql, and does keep a persistent connection
open.

Regards,
Gora


Batch updates order guaranteed?

2011-08-23 Thread Glenn
Hello,

Question about batch updates (performing a delete and add in same
request, as described at bottom
of http://wiki.apache.org/solr/UpdateXmlMessages):
http://wiki.apache.org/solr/UpdateXmlMessages%29:  is the order
guaranteed?  If a delete is followed by an add, will the delete
always be performed first?  I would assume so but would like to get
confirmation.

(I realize that it is not normally necessary to explicitly delete a
document before updating with an add, but we have a need to do some
clean up of certain related documents.  The initial delete-by-query will
ensure that the subsequent add will cleanly update some possible old,
improper documents, but if the delete might ever be performed after
the add, it would end up removing the new document as well.)

Thanks!

Glenn


Re: Batch updates order guaranteed?

2011-08-23 Thread Yonik Seeley
On Tue, Aug 23, 2011 at 2:17 PM, Glenn s...@t2.zazu.com wrote:
 Question about batch updates (performing a delete and add in same
 request, as described at bottom
 of http://wiki.apache.org/solr/UpdateXmlMessages):
 http://wiki.apache.org/solr/UpdateXmlMessages%29:  is the order
 guaranteed?  If a delete is followed by an add, will the delete
 always be performed first?  I would assume so but would like to get
 confirmation.

Yes, if you're crafting the update message yourself in XML or JSON.
SolrJ is a different matter I think.

-Yonik
http://www.lucidimagination.com


Re: Batch updates order guaranteed?

2011-08-23 Thread Glenn
Yes, I'm crafting the XML update message myself.

Thanks for the confirmation.

Glenn

--

On 8/23/11 1:38 PM, Yonik Seeley wrote:
 On Tue, Aug 23, 2011 at 2:17 PM, Glenn s...@t2.zazu.com wrote:
 Question about batch updates (performing a delete and add in same
 request, as described at bottom
 of http://wiki.apache.org/solr/UpdateXmlMessages):
 http://wiki.apache.org/solr/UpdateXmlMessages%29:  is the order
 guaranteed?  If a delete is followed by an add, will the delete
 always be performed first?  I would assume so but would like to get
 confirmation.
 Yes, if you're crafting the update message yourself in XML or JSON.
 SolrJ is a different matter I think.

 -Yonik
 http://www.lucidimagination.com


Re: Batch updates order guaranteed?

2011-08-23 Thread Yonik Seeley
On Tue, Aug 23, 2011 at 3:38 PM, Yonik Seeley
yo...@lucidimagination.com wrote:
 On Tue, Aug 23, 2011 at 2:17 PM, Glenn s...@t2.zazu.com wrote:
 Question about batch updates (performing a delete and add in same
 request, as described at bottom
 of http://wiki.apache.org/solr/UpdateXmlMessages):
 http://wiki.apache.org/solr/UpdateXmlMessages%29:  is the order
 guaranteed?  If a delete is followed by an add, will the delete
 always be performed first?  I would assume so but would like to get
 confirmation.

 Yes, if you're crafting the update message yourself in XML or JSON.
 SolrJ is a different matter I think.

Found the SolrJ issue:
https://issues.apache.org/jira/browse/SOLR-1162

Looks like it sort of got dropped, but I think this is worth fixing.

-Yonik
http://www.lucidimagination.com


Re: Funky date string accepted

2011-08-23 Thread Markus Jelsma
That makes sense indeed. Wouldn't it be an idea to test for the single allowed 
format before parsing it?

 : I see, is the leading - char just ignored then?
 
 i'd have to re-look at the tests/docs (i don't really want to repeat
 that agonizing headache right now), but i believe what you are seeing is a
 compound problem...
 
 * parsing sees the -0001 and recognizes that as a negative year.
 * somewhere the negative year is dealt with in a way that assumes there is
   (isn't?) a year 0, making -1 = Year 2 BC
 * formatting code doesn't include the era in the output and
   doesn't zero pad propertly so you just get 2 in the response.
 
 
 -Hoss


Re: hierarchical faceting in Solr?

2011-08-23 Thread Naomi Dushay

Chris Beer just did a revamp of the wiki page at:

  http://wiki.apache.org/solr/HierarchicalFaceting

Yay Chris!

- Naomi
( ... and I helped!)


On Aug 22, 2011, at 10:49 AM, Naomi Dushay wrote:


Chris,

Is there a document somewhere on how to do this?  If not, might you  
create one?   I could even imagine such a document living on the  
Solr wiki ...  this one has mostly ancient content:


http://wiki.apache.org/solr/HierarchicalFaceting

- Naomi




Re: Solr indexing process: keep a persistent Mysql connection throu all the indexing process

2011-08-23 Thread Tom
10K documents.  Why not just batch them?  

You could read in 10K from your database, load em into an array of
SolrDocuments. and them post them all at once to the Solr server?  Or do em
in 1K increments if they are really big.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-indexing-process-keep-a-persistent-Mysql-connection-throu-all-the-indexing-process-tp3278608p3279708.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr indexing process: keep a persistent Mysql connection throu all the indexing process

2011-08-23 Thread samuele.mattiuzzo
those documents are unrelated to the database. the db i have is just storing
countries - region - cities, and it's used to do a refinement on a specific
solr field

example:

solrField thetext with content Mary comes from London

updateHandler polls the database for europe - great britain - london and
updates those values to the correct fields

isnt an update handler relative to a single document? at least, that's what
i understood...

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-indexing-process-keep-a-persistent-Mysql-connection-throu-all-the-indexing-process-tp3278608p3279765.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr indexing process: keep a persistent Mysql connection throu all the indexing process

2011-08-23 Thread samuele.mattiuzzo
those documents are unrelated to the database. the db i have is just storing
countries - region - cities, and it's used to do a refinement on a specific
solr field

example:

solrField thetext with content Mary comes from London

updateHandler polls the database for europe - great britain - london and
updates those values to the correct fields

isnt an update handler relative to a single document? at least, that's what
i understood...

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-indexing-process-keep-a-persistent-Mysql-connection-throu-all-the-indexing-process-tp3278608p3279764.html
Sent from the Solr - User mailing list archive at Nabble.com.