Re: Trouble handling Unit symbol

2012-04-13 Thread Rajani Maski
Hi All,

   I tried to index with UTF-8  encode but the issue is still not fixed.
Please see my inputs below.

*Indexed XML:*
?xml version=1.0 encoding=UTF-8 ?
add
  doc
field name=ID0.100/field
field name=BODYµ/field
  /doc
/add

*Search Query - * BODY:µ

numfound : 0 results obtained.

*What can be the reason for this? How do i need to make search query so
that the above document is found.*


Thanks  Regards

Regards
Rajani



2012/4/2 Rajani Maski rajinima...@gmail.com

 Thank you for the reply.



 On Sat, Mar 31, 2012 at 3:38 AM, Chris Hostetter hossman_luc...@fucit.org
  wrote:


 : We have data having such symbols like :  ต
 : Indexed data has  -Dose:0 ตL
 : Now , when  it is searched as  - Dose:0 ตL
...
 : Query Q value observed  : str name=qS257:0 ยตL/injection/str

 First off: your when searched as example does not match up to your
 Query Q observed value (ie: field queries, extra /injection text at
 the end) suggesting that you maybe cut/paste something you didn't mean to
 -- so take the rest of this advice with a grain of salt.

 If i ignore your when it is searched as exampleand focus entirely on
 what you say you've indexed the data as, and the Q value you are sing (in
 what looks like the echoParams output) then the first thing that jumps out
 at me is that it looks like your servlet container (or perhaps your web
 browser if that's where you tested this) is not dealing with the unicode
 correctly -- because allthough i see a ต in the first three lines i
 quoted above (UTF8: 0xC2 0xB5) in your value observed i'm seeing it
 preceeded by a ย (UTF8: 0xC3 0x82) ... suggesting that perhaps the ต
 did not get URL encoded properly when the request was made to your servlet
 container?

 In particular, you might want to take a look at...


 https://wiki.apache.org/solr/FAQ#Why_don.27t_International_Characters_Work.3F
 http://wiki.apache.org/solr/SolrTomcat#URI_Charset_Config
 The example/exampledocs/test_utf8.sh script included with solr




 -Hoss





How to read SOLR cache statistics?

2012-04-13 Thread Kashif Khan
Does anyone explain what does the following parameters mean in SOLR cache
statistics?

*name*:  queryResultCache   
*class*:  org.apache.solr.search.LRUCache   
*version*:  1.0   
*description*:  LRU Cache(maxSize=512, initialSize=512)   
*stats*:  lookups : 98 
*hits *: 59 
*hitratio *: 0.60 
*inserts *: 41 
*evictions *: 0 
*size *: 41 
*warmupTime *: 0 
*cumulative_lookups *: 98 
*cumulative_hits *: 59 
*cumulative_hitratio *: 0.60 
*cumulative_inserts *: 39 
*cumulative_evictions *: 0 

AND also this


*name*:  fieldValueCache   
*class*:  org.apache.solr.search.FastLRUCache   
*version*:  1.0   
*description*:  Concurrent LRU Cache(maxSize=1, initialSize=10,
minSize=9000, acceptableSize=9500, cleanupThread=false)   
*stats*:  *lookups *: 8 
*hits *: 4 
*hitratio *: 0.50 
*inserts *: 2 
*evictions *: 0 
*size *: 2 
*warmupTime *: 0 
*cumulative_lookups *: 8 
*cumulative_hits *: 4 
*cumulative_hitratio *: 0.50 
*cumulative_inserts *: 2 
*cumulative_evictions *: 0 
*item_ABC *:
{field=ABC,memSize=340592,tindexSize=1192,time=1360,phase1=1344,nTerms=7373,bigTerms=1,termInstances=11513,uses=4}
 
*item_BCD *:
{field=BCD,memSize=341248,tindexSize=1952,time=1688,phase1=1688,nTerms=8075,bigTerms=0,termInstances=13510,uses=2}
 
 
Without understanding these terms i cannot configure server for better cache
usage. The point is searches are very slow. These stats were taken when
server was down and restarted. I just want to understand what these terms
mean actually


--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-read-SOLR-cache-statistics-tp3907294p3907294.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Lexical analysis tools for German language data

2012-04-13 Thread Tomas Zerolo
On Thu, Apr 12, 2012 at 03:46:56PM +, Michael Ludwig wrote:
  Von: Walter Underwood
 
  German noun decompounding is a little more complicated than it might
  seem.
  
  There can be transformations or inflections, like the s in
  Weinachtsbaum (Weinachten/Baum).
 
 I remember from my linguistics studies that the terminus technicus for
 these is Fugenmorphem (interstitial or joint morpheme) [...]

IANAL (I am not a linguist -- pun intended ;) but I've always read that
as a genitive. Any pointers?

Regards
-- 
Tomás Zerolo
Axel Springer AG
Axel Springer media Systems
BILD Produktionssysteme
Axel-Springer-Straße 65
10888 Berlin
Tel.: +49 (30) 2591-72875
tomas.zer...@axelspringer.de
www.axelspringer.de

Axel Springer AG, Sitz Berlin, Amtsgericht Charlottenburg, HRB 4998
Vorsitzender des Aufsichtsrats: Dr. Giuseppe Vita
Vorstand: Dr. Mathias Döpfner (Vorsitzender)
Jan Bayer, Ralph Büchi, Lothar Lanz, Dr. Andreas Wiele


Re: Solr Scoring

2012-04-13 Thread Li Li
another way is to use payload http://wiki.apache.org/solr/Payloads
the advantage of payload is that you only need one field and can make frq
file smaller than use two fields. but the disadvantage is payload is stored
in prx file, so I am not sure which one is fast. maybe you can try them
both.

On Fri, Apr 13, 2012 at 8:04 AM, Erick Erickson erickerick...@gmail.comwrote:

 GAH! I had my head in make this happen in one field when I wrote my
 response, without being explicit. Of course Walter's solution is pretty
 much the standard way to deal with this.

 Best
 Erick

 On Thu, Apr 12, 2012 at 5:38 PM, Walter Underwood wun...@wunderwood.org
 wrote:
  It is easy. Create two fields, text_exact and text_stem. Don't use the
 stemmer in the first chain, do use the stemmer in the second. Give the
 text_exact a bigger weight than text_stem.
 
  wunder
 
  On Apr 12, 2012, at 4:34 PM, Erick Erickson wrote:
 
  No, I don't think there's an OOB way to make this happen. It's
  a recurring theme, make exact matches score higher than
  stemmed matches.
 
  Best
  Erick
 
  On Thu, Apr 12, 2012 at 5:18 AM, Kissue Kissue kissue...@gmail.com
 wrote:
  Hi,
 
  I have a field in my index called itemDesc which i am applying
  EnglishMinimalStemFilterFactory to. So if i index a value to this field
  containing Edges, the EnglishMinimalStemFilterFactory applies
 stemming
  and Edges becomes Edge. Now when i search for Edges, documents
 with
  Edge score better than documents with the actual search word -
 Edges.
  Is there a way i can make documents with the actual search word in this
  case Edges score better than document with Edge?
 
  I am using Solr 3.5. My field definition is shown below:
 
  fieldType name=text_en class=solr.TextField
 positionIncrementGap=100
   analyzer type=index
 tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.SynonymFilterFactory
  synonyms=index_synonyms.txt ignoreCase=true expand=false/
  filter class=solr.StopFilterFactory
 ignoreCase=true
 words=stopwords_en.txt
 enablePositionIncrements=true
  filter class=solr.LowerCaseFilterFactory/
 filter class=solr.EnglishPossessiveFilterFactory/
 filter class=solr.EnglishMinimalStemFilterFactory/
   /analyzer
   analyzer type=query
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.SynonymFilterFactory
 synonyms=synonyms.txt
  ignoreCase=true expand=true/
 filter class=solr.StopFilterFactory
 ignoreCase=true
 words=stopwords_en.txt
 enablePositionIncrements=true
 /
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.EnglishPossessiveFilterFactory/
 filter class=solr.KeywordMarkerFilterFactory
  protected=protwords.txt/
 filter class=solr.EnglishMinimalStemFilterFactory/
   /analyzer
 /fieldType
 
  Thanks.
 
 
 
 
 



Re: EmbeddedSolrServer and StreamingUpdateSolrServer

2012-04-13 Thread Mikhail Khludnev
Did I get right that you have two separate processes (different app) access
the same LuceneDIrectory simultaneously? In this case I suggest to read
about Locking mechanism. I'm not really experienced in it.
You showed logs from StrUpdHandler failure, it's clear. Can you show logs
from Embeded server commit, which is supposed to be successful?

On Fri, Apr 13, 2012 at 9:34 AM, pcrao purn...@gmail.com wrote:

 Hi Shawn,

 Thanks for sharing your opinion.

 Mikhail Khludnev, what do you think of Shawn's opinion?

 Thanks,
 PC Rao.

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/EmbeddedSolrServer-and-StreamingUpdateSolrServer-tp3889073p3907223.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Sincerely yours
Mikhail Khludnev
ge...@yandex.ru

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: How to read SOLR cache statistics?

2012-04-13 Thread Li Li
http://wiki.apache.org/solr/SolrCaching

On Fri, Apr 13, 2012 at 2:30 PM, Kashif Khan uplink2...@gmail.com wrote:

 Does anyone explain what does the following parameters mean in SOLR cache
 statistics?

 *name*:  queryResultCache
 *class*:  org.apache.solr.search.LRUCache
 *version*:  1.0
 *description*:  LRU Cache(maxSize=512, initialSize=512)
 *stats*:  lookups : 98
 *hits *: 59
 *hitratio *: 0.60
 *inserts *: 41
 *evictions *: 0
 *size *: 41
 *warmupTime *: 0
 *cumulative_lookups *: 98
 *cumulative_hits *: 59
 *cumulative_hitratio *: 0.60
 *cumulative_inserts *: 39
 *cumulative_evictions *: 0

 AND also this


 *name*:  fieldValueCache
 *class*:  org.apache.solr.search.FastLRUCache
 *version*:  1.0
 *description*:  Concurrent LRU Cache(maxSize=1, initialSize=10,
 minSize=9000, acceptableSize=9500, cleanupThread=false)
 *stats*:  *lookups *: 8
 *hits *: 4
 *hitratio *: 0.50
 *inserts *: 2
 *evictions *: 0
 *size *: 2
 *warmupTime *: 0
 *cumulative_lookups *: 8
 *cumulative_hits *: 4
 *cumulative_hitratio *: 0.50
 *cumulative_inserts *: 2
 *cumulative_evictions *: 0
 *item_ABC *:

 {field=ABC,memSize=340592,tindexSize=1192,time=1360,phase1=1344,nTerms=7373,bigTerms=1,termInstances=11513,uses=4}
 *item_BCD *:

 {field=BCD,memSize=341248,tindexSize=1952,time=1688,phase1=1688,nTerms=8075,bigTerms=0,termInstances=13510,uses=2}

 Without understanding these terms i cannot configure server for better
 cache
 usage. The point is searches are very slow. These stats were taken when
 server was down and restarted. I just want to understand what these terms
 mean actually


 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/How-to-read-SOLR-cache-statistics-tp3907294p3907294.html
 Sent from the Solr - User mailing list archive at Nabble.com.



AW: Lexical analysis tools for German language data

2012-04-13 Thread Michael Ludwig
 Von: Tomas Zerolo

   There can be transformations or inflections, like the s in
   Weinachtsbaum (Weinachten/Baum).
 
  I remember from my linguistics studies that the terminus technicus
  for these is Fugenmorphem (interstitial or joint morpheme) [...]
 
 IANAL (I am not a linguist -- pun intended ;) but I've always read
 that as a genitive. Any pointers?

Admittedly, that's what you'd think, and despite linguistics telling me
otherwise I'd maintain there's some truth in it. For this case, however,
consider: die Weihnacht declines like die Nacht, so:

nom. die Weihnacht
gen. der Weihnacht
dat. der Weihnacht
akk. die Weihnacht

As you can see, there's no s to be found anywhere, not even in the
genitive. But my gut feeling, like yours, is that this should indicate
genitive, and I would make a point of well-argued gut feeling being at
least as relevant as formalist analysis.

Michael


Re: two structures in solr

2012-04-13 Thread tkoomzaaskz
Thank you very much Erick for your reply!

So should it go something like the following:

http://lucene.472066.n3.nabble.com/file/n3907393/solr_index.png 
sorry for an ugly drawing ;)

In this example, the index will have 13 columns: 6 for project, 6 for
contractor and one to define the type. Is that right?

--
View this message in context: 
http://lucene.472066.n3.nabble.com/two-structures-in-solr-tp3905143p3907393.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Boost differences in two environments for same query and config

2012-04-13 Thread Kerwin
Hi Erick,

Thanks for your suggestions.
I did an optimize on the remote installation and this time with the
same number of documents but still face the same issue as seen from
the debug output below:

9.950362E-4 = (MATCH) sum of:
9.950362E-4 = (MATCH) weight(RECORD_TYPE:info in 35916), product of:
9.950362E-4 = queryWeight(RECORD_TYPE:info), product of:
1.0 = idf(docFreq=58891, maxDocs=8181811)
9.950362E-4 = queryNorm
1.0 = (MATCH) fieldWeight(RECORD_TYPE:info in 35916), product 
of:
1.0 = tf(termFreq(RECORD_TYPE:info)=1)
1.0 = idf(docFreq=58891, maxDocs=8181811)
1.0 = fieldNorm(field=RECORD_TYPE, doc=35916)
0.0 = (MATCH) product of:
1.0945399 = (MATCH) sum of:
0.99503624 = (MATCH) weight(CD:ee123^1000.0 in 35916), 
product of:
0.99503624 = queryWeight(CD:ee123^1000.0), 
product of:
1000.0 = boost
1.0 = idf(docFreq=1, maxDocs=8181811)
9.950362E-4 = queryNorm
1.0 = (MATCH) fieldWeight(CD:ee123 in 35916), 
product of:
1.0 = tf(termFreq(CD:ee123)=1)
1.0 = idf(docFreq=1, maxDocs=8181811)
1.0 = fieldNorm(field=CD, doc=35916)
0.09950362 = (MATCH)
ConstantScoreQuery(QueryWrapperFilter(CD:ee123 CD:ee123c CD:ee123c.
CD:ee123dc CD:ee123e CD:ee123e. CD:ee123en CD:ee123fx CD:ee123g
CD:ee123g.1 CD:ee123g1 CD:ee123ee123 CD:ee123l.1 CD:ee123l1 CD:ee123ll
CD:ee123lr CD:ee123m.z CD:ee123mg CD:ee123mz CD:ee123na CD:ee123nx
CD:ee123ol CD:ee123op CD:ee123p CD:ee123p.1 CD:ee123p1 CD:ee123pn
CD:ee123r.1 CD:ee123r1 CD:ee123s CD:ee123s.z CD:ee123sm CD:ee123sn
CD:ee123sp CD:ee123ss CD:ee123sz)), product of:
100.0 = boost
9.950362E-4 = queryNorm
0.0 = coord(2/3)


So I got the conf folder from the remote server location and replaced
my local conf folder with this one to see if the indexes were formed
differently but my local installation continues to work.I would expect
to see the same behaviour as on the remote installation but it did not
happen. (The only difference on the remote installation is that there
are cores while my local installation has no cores).
Anything else I could try?
Thanks for your help.

On 4/11/12, Erick Erickson erickerick...@gmail.com wrote:
 Well, you're matching a different number of records, so I have to assume
 your indexes are different on the two machines.

 Here is one case where doing an optimize might make sense, that'll purge
 the data associated with any deleted records from the index which should
 make comparisons better

 Additionally, you have to insure that your request handler is identical
 on both, have you made any changes to solrconfig.xml?

 About the coord (2/3), I'm pretty clueless. But also insure that your
 parsed query is identical on both, which is an additional check on
 whether you've changed something on one server and not the
 other.

 Best
 Erick

 On Wed, Apr 11, 2012 at 8:19 AM, Kerwin kerwin...@gmail.com wrote:
 Hi All,

 I am firing the following Solr query against installations on two
 environments one on my local Windows machine and the other on Unix
 (Remote).

 RECORD_TYPE:info AND (NAME:ee123* OR CD:ee123^1000 OR CD:ee123*^100)

 There are no differences in the DataImportHandler configuration ,
 Schema and Solrconfig for both these installations.
 The correct expected result is given by the local installation of Solr
 which also gives scores as expected for the boosts.

 CORRECT/Expected:
 Debug query output for local installation:

 10.822258 = (MATCH) sum of:
0.002170282 = (MATCH) weight(RECORD_TYPE:info in 35916), product
 of:
3.65739E-4 = queryWeight(RECORD_TYPE:info), product of:
5.933964 = idf(docFreq=58891, maxDocs=8181811)
6.1634855E-5 = queryNorm
5.933964 = (MATCH) fieldWeight(RECORD_TYPE:info in 35916),
 product of:
1.0 = tf(termFreq(RECORD_TYPE:info)=1)
5.933964 = idf(docFreq=58891, maxDocs=8181811)
1.0 = fieldNorm(field=RECORD_TYPE, doc=35916)
10.820087 = (MATCH) product of:
16.230131 = (MATCH) sum of:
16.223969 = (MATCH) weight(CD:ee123^1000.0 in
 35916), product of:
0.81 = queryWeight(CD:ee123^1000.0),
 product of:
1000.0 = boost
16.224277 = idf(docFreq=1,
 maxDocs=8181811)
  

Re: Solr Scoring

2012-04-13 Thread Kissue Kissue
Thanks a lot. I had already implemented Walter's solution and was wondering
if this was the right way to deal with it. This has now given me the
confidence to go with the solution.

Many thanks.

On Fri, Apr 13, 2012 at 1:04 AM, Erick Erickson erickerick...@gmail.comwrote:

 GAH! I had my head in make this happen in one field when I wrote my
 response, without being explicit. Of course Walter's solution is pretty
 much the standard way to deal with this.

 Best
 Erick

 On Thu, Apr 12, 2012 at 5:38 PM, Walter Underwood wun...@wunderwood.org
 wrote:
  It is easy. Create two fields, text_exact and text_stem. Don't use the
 stemmer in the first chain, do use the stemmer in the second. Give the
 text_exact a bigger weight than text_stem.
 
  wunder
 
  On Apr 12, 2012, at 4:34 PM, Erick Erickson wrote:
 
  No, I don't think there's an OOB way to make this happen. It's
  a recurring theme, make exact matches score higher than
  stemmed matches.
 
  Best
  Erick
 
  On Thu, Apr 12, 2012 at 5:18 AM, Kissue Kissue kissue...@gmail.com
 wrote:
  Hi,
 
  I have a field in my index called itemDesc which i am applying
  EnglishMinimalStemFilterFactory to. So if i index a value to this field
  containing Edges, the EnglishMinimalStemFilterFactory applies
 stemming
  and Edges becomes Edge. Now when i search for Edges, documents
 with
  Edge score better than documents with the actual search word -
 Edges.
  Is there a way i can make documents with the actual search word in this
  case Edges score better than document with Edge?
 
  I am using Solr 3.5. My field definition is shown below:
 
  fieldType name=text_en class=solr.TextField
 positionIncrementGap=100
   analyzer type=index
 tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.SynonymFilterFactory
  synonyms=index_synonyms.txt ignoreCase=true expand=false/
  filter class=solr.StopFilterFactory
 ignoreCase=true
 words=stopwords_en.txt
 enablePositionIncrements=true
  filter class=solr.LowerCaseFilterFactory/
 filter class=solr.EnglishPossessiveFilterFactory/
 filter class=solr.EnglishMinimalStemFilterFactory/
   /analyzer
   analyzer type=query
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.SynonymFilterFactory
 synonyms=synonyms.txt
  ignoreCase=true expand=true/
 filter class=solr.StopFilterFactory
 ignoreCase=true
 words=stopwords_en.txt
 enablePositionIncrements=true
 /
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.EnglishPossessiveFilterFactory/
 filter class=solr.KeywordMarkerFilterFactory
  protected=protwords.txt/
 filter class=solr.EnglishMinimalStemFilterFactory/
   /analyzer
 /fieldType
 
  Thanks.
 
 
 
 
 



Re: Facets involving multiple fields

2012-04-13 Thread Marc SCHNEIDER
Hi,

Thanks for your answer.
Yes it works in this case when I know the facet name (Computer). What
if I want to automatically compute all facets?
facet.query=keyword:* short_title:* doesn't work, right?

Marc.

On Thu, Apr 12, 2012 at 2:08 PM, Erick Erickson erickerick...@gmail.com wrote:
 facet.query=keywords:computer short_title:computer
 seems like what you're asking for.

 On Thu, Apr 12, 2012 at 3:19 AM, Marc SCHNEIDER
 marc.schneide...@gmail.com wrote:
 Hi,

 Thanks for your answer.
 Let's say I have to fields : 'keywords' and 'short_title'.
 For these fields I'd like to make a faceted search : if 'Computer' is
 stored in at least one of these fields for a document I'd like to get
 it added in my results.
 doc1 = keywords : 'Computer' / short_title : 'Computer'
 doc2 = keywords : 'Computer'
 doc3 = short_title : 'Computer'

 In this case I'd like to have : Computer (3)

 I don't see how to solve this with facet.query.

 Thanks,
 Marc.

 On Wed, Apr 11, 2012 at 5:13 PM, Erick Erickson erickerick...@gmail.com 
 wrote:
 Have you considered facet.query? You can specify an arbitrary query
 to facet on which might do what you want. Otherwise, I'm not sure what
 you mean by faceted search using two fields. How should these fields
 be combined into a single facet? What that means practically is not at
 all obvious from your problem statement.

 Best
 Erick

 On Tue, Apr 10, 2012 at 8:55 AM, Marc SCHNEIDER
 marc.schneide...@gmail.com wrote:
 Hi,

 I'd like to make a faceted search using two fields. I want to have a
 single result and not a result by field (like when using
 facet.field=f1,facet.field=f2).
 I don't want to use a copy field either because I want it to be
 dynamic at search time.
 As far as I know this is not possible for Solr 3.x...
 But I saw a new parameter named group.facet for Solr4. Could that
 solve my problem? If yes could somebody give me an example?

 Thanks,
 Marc.


Re: How to read SOLR cache statistics?

2012-04-13 Thread Kashif Khan
Hi Li Li,

I have been through that WIKI before but that does not explain what is
*evictions*, *inserts*, *cumulative_inserts*, *cumulative_evictions*,
*hitratio *and all. These terms are foreign to me. What does the following
line mean? 

*item_ABC :
{field=ABC,memSize=340592,tindexSize=1192,time=1360,phase1=1344,nTerms=7373,bigTerms=1,termInstances=11513,uses=4}
*

I want that kind of explanation. I have read the wiki and the comments in
the solrconfig.xml file about all these things but does say how to read the
stats which is very *important!!!*.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-read-SOLR-cache-statistics-tp3907294p3907633.html
Sent from the Solr - User mailing list archive at Nabble.com.

Issues with language based indexing

2012-04-13 Thread JGar
Hello,

I am new to Solr. it is resulting some docs in my search for Acciones y
Valores string. When i go and search for the same word in the given doc
manually, i could not find those word. Pls help on what basis the doc is
found in the search .

Thanks

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Issues-with-language-based-indexing-tp3907601p3907601.html
Sent from the Solr - User mailing list archive at Nabble.com.

Realtime /get versus SearchHandler

2012-04-13 Thread Benson Margulies
A discussion over on the dev list led me to expect that the by-if
field retrievals in a SolrCloud query would come through the get
handler. In fact, I've seen them turn up in my search component in the
search handler that is configured with my custom QT. (I have a
'prepare' method that sets ShardParams.QT to my QT to get my
processing involved in the first of the two queries.) Did I overthink
this?


Re: Trouble handling Unit symbol

2012-04-13 Thread Erick Erickson
Please review:
http://wiki.apache.org/solr/UsingMailingLists

Especially the bit about adding debugQuery=on
and showing the results. You're asking people
to guess at solutions without providing much
in the way of context.

You might try looking at your index with Luke to
see what's actually in your index, or perhaps
TermsComponent


Best
Erick

On Fri, Apr 13, 2012 at 2:29 AM, Rajani Maski rajinima...@gmail.com wrote:
 Hi All,

   I tried to index with UTF-8  encode but the issue is still not fixed.
 Please see my inputs below.

 *Indexed XML:*
 ?xml version=1.0 encoding=UTF-8 ?
 add
  doc
    field name=ID0.100/field
    field name=BODYµ/field
  /doc
 /add

 *Search Query - * BODY:µ

 numfound : 0 results obtained.

 *What can be the reason for this? How do i need to make search query so
 that the above document is found.*


 Thanks  Regards

 Regards
 Rajani



 2012/4/2 Rajani Maski rajinima...@gmail.com

 Thank you for the reply.



 On Sat, Mar 31, 2012 at 3:38 AM, Chris Hostetter hossman_luc...@fucit.org
  wrote:


 : We have data having such symbols like :  ต
 : Indexed data has  -    Dose:0 ตL
 : Now , when  it is searched as  - Dose:0 ตL
        ...
 : Query Q value observed  : str name=qS257:0 ยตL/injection/str

 First off: your when searched as example does not match up to your
 Query Q observed value (ie: field queries, extra /injection text at
 the end) suggesting that you maybe cut/paste something you didn't mean to
 -- so take the rest of this advice with a grain of salt.

 If i ignore your when it is searched as exampleand focus entirely on
 what you say you've indexed the data as, and the Q value you are sing (in
 what looks like the echoParams output) then the first thing that jumps out
 at me is that it looks like your servlet container (or perhaps your web
 browser if that's where you tested this) is not dealing with the unicode
 correctly -- because allthough i see a ต in the first three lines i
 quoted above (UTF8: 0xC2 0xB5) in your value observed i'm seeing it
 preceeded by a ย (UTF8: 0xC3 0x82) ... suggesting that perhaps the ต
 did not get URL encoded properly when the request was made to your servlet
 container?

 In particular, you might want to take a look at...


 https://wiki.apache.org/solr/FAQ#Why_don.27t_International_Characters_Work.3F
 http://wiki.apache.org/solr/SolrTomcat#URI_Charset_Config
 The example/exampledocs/test_utf8.sh script included with solr




 -Hoss





Re: two structures in solr

2012-04-13 Thread Erick Erickson
bq: Is that right?

I don't know, does it work G? You'll probably want an
additional field for unique id (just named id in the example)
that should be disjoint between your types.

Best
Erick

On Fri, Apr 13, 2012 at 3:41 AM, tkoomzaaskz tomasz.du...@gmail.com wrote:
 Thank you very much Erick for your reply!

 So should it go something like the following:

 http://lucene.472066.n3.nabble.com/file/n3907393/solr_index.png
 sorry for an ugly drawing ;)

 In this example, the index will have 13 columns: 6 for project, 6 for
 contractor and one to define the type. Is that right?

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/two-structures-in-solr-tp3905143p3907393.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: Boost differences in two environments for same query and config

2012-04-13 Thread Erick Erickson
Well, next thing I'd do is just copy your entire solr home
directory to the remote machine and try that. If that gives
identical results on both, then try moving just your
solr home/data directory to the remote machine.

I suspect that you've done something different between the two
machines that's leading to this, but haven't a clue what.

If you copy your entire Solr installation over and _still_ get
this kind of thing, we're into whether the JVM or op system
are somehow changing things, which would surprise me a lot.

Best
Erick

On Fri, Apr 13, 2012 at 4:24 AM, Kerwin kerwin...@gmail.com wrote:
 Hi Erick,

 Thanks for your suggestions.
 I did an optimize on the remote installation and this time with the
 same number of documents but still face the same issue as seen from
 the debug output below:

 9.950362E-4 = (MATCH) sum of:
        9.950362E-4 = (MATCH) weight(RECORD_TYPE:info in 35916), product of:
                9.950362E-4 = queryWeight(RECORD_TYPE:info), product of:
                        1.0 = idf(docFreq=58891, maxDocs=8181811)
                        9.950362E-4 = queryNorm
                1.0 = (MATCH) fieldWeight(RECORD_TYPE:info in 35916), product 
 of:
                        1.0 = tf(termFreq(RECORD_TYPE:info)=1)
                        1.0 = idf(docFreq=58891, maxDocs=8181811)
                        1.0 = fieldNorm(field=RECORD_TYPE, doc=35916)
        0.0 = (MATCH) product of:
                1.0945399 = (MATCH) sum of:
                        0.99503624 = (MATCH) weight(CD:ee123^1000.0 in 35916), 
 product of:
                                0.99503624 = queryWeight(CD:ee123^1000.0), 
 product of:
                                        1000.0 = boost
                                        1.0 = idf(docFreq=1, maxDocs=8181811)
                                        9.950362E-4 = queryNorm
                                1.0 = (MATCH) fieldWeight(CD:ee123 in 35916), 
 product of:
                                        1.0 = tf(termFreq(CD:ee123)=1)
                                        1.0 = idf(docFreq=1, maxDocs=8181811)
                                        1.0 = fieldNorm(field=CD, doc=35916)
                                0.09950362 = (MATCH)
 ConstantScoreQuery(QueryWrapperFilter(CD:ee123 CD:ee123c CD:ee123c.
 CD:ee123dc CD:ee123e CD:ee123e. CD:ee123en CD:ee123fx CD:ee123g
 CD:ee123g.1 CD:ee123g1 CD:ee123ee123 CD:ee123l.1 CD:ee123l1 CD:ee123ll
 CD:ee123lr CD:ee123m.z CD:ee123mg CD:ee123mz CD:ee123na CD:ee123nx
 CD:ee123ol CD:ee123op CD:ee123p CD:ee123p.1 CD:ee123p1 CD:ee123pn
 CD:ee123r.1 CD:ee123r1 CD:ee123s CD:ee123s.z CD:ee123sm CD:ee123sn
 CD:ee123sp CD:ee123ss CD:ee123sz)), product of:
                                        100.0 = boost
                                        9.950362E-4 = queryNorm
                0.0 = coord(2/3)


 So I got the conf folder from the remote server location and replaced
 my local conf folder with this one to see if the indexes were formed
 differently but my local installation continues to work.I would expect
 to see the same behaviour as on the remote installation but it did not
 happen. (The only difference on the remote installation is that there
 are cores while my local installation has no cores).
 Anything else I could try?
 Thanks for your help.

 On 4/11/12, Erick Erickson erickerick...@gmail.com wrote:
 Well, you're matching a different number of records, so I have to assume
 your indexes are different on the two machines.

 Here is one case where doing an optimize might make sense, that'll purge
 the data associated with any deleted records from the index which should
 make comparisons better

 Additionally, you have to insure that your request handler is identical
 on both, have you made any changes to solrconfig.xml?

 About the coord (2/3), I'm pretty clueless. But also insure that your
 parsed query is identical on both, which is an additional check on
 whether you've changed something on one server and not the
 other.

 Best
 Erick

 On Wed, Apr 11, 2012 at 8:19 AM, Kerwin kerwin...@gmail.com wrote:
 Hi All,

 I am firing the following Solr query against installations on two
 environments one on my local Windows machine and the other on Unix
 (Remote).

 RECORD_TYPE:info AND (NAME:ee123* OR CD:ee123^1000 OR CD:ee123*^100)

 There are no differences in the DataImportHandler configuration ,
 Schema and Solrconfig for both these installations.
 The correct expected result is given by the local installation of Solr
 which also gives scores as expected for the boosts.

 CORRECT/Expected:
 Debug query output for local installation:

 10.822258 = (MATCH) sum of:
        0.002170282 = (MATCH) weight(RECORD_TYPE:info in 35916), product
 of:
                3.65739E-4 = queryWeight(RECORD_TYPE:info), product of:
                        5.933964 = idf(docFreq=58891, maxDocs=8181811)
                        6.1634855E-5 = queryNorm
                5.933964 = (MATCH) fieldWeight(RECORD_TYPE:info in 35916),
 product of:
   

Re: Trouble handling Unit symbol

2012-04-13 Thread Rajani Maski
Fine. Thank you. I will look at it.


On Fri, Apr 13, 2012 at 5:21 PM, Erick Erickson erickerick...@gmail.comwrote:

 Please review:
 http://wiki.apache.org/solr/UsingMailingLists

 Especially the bit about adding debugQuery=on
 and showing the results. You're asking people
 to guess at solutions without providing much
 in the way of context.

 You might try looking at your index with Luke to
 see what's actually in your index, or perhaps
 TermsComponent


 Best
 Erick

 On Fri, Apr 13, 2012 at 2:29 AM, Rajani Maski rajinima...@gmail.com
 wrote:
  Hi All,
 
I tried to index with UTF-8  encode but the issue is still not fixed.
  Please see my inputs below.
 
  *Indexed XML:*
  ?xml version=1.0 encoding=UTF-8 ?
  add
   doc
 field name=ID0.100/field
 field name=BODYµ/field
   /doc
  /add
 
  *Search Query - * BODY:µ
 
  numfound : 0 results obtained.
 
  *What can be the reason for this? How do i need to make search query so
  that the above document is found.*
 
 
  Thanks  Regards
 
  Regards
  Rajani
 
 
 
  2012/4/2 Rajani Maski rajinima...@gmail.com
 
  Thank you for the reply.
 
 
 
  On Sat, Mar 31, 2012 at 3:38 AM, Chris Hostetter 
 hossman_luc...@fucit.org
   wrote:
 
 
  : We have data having such symbols like :  ต
  : Indexed data has  -Dose:0 ตL
  : Now , when  it is searched as  - Dose:0 ตL
 ...
  : Query Q value observed  : str name=qS257:0 ยตL/injection/str
 
  First off: your when searched as example does not match up to your
  Query Q observed value (ie: field queries, extra /injection text at
  the end) suggesting that you maybe cut/paste something you didn't mean
 to
  -- so take the rest of this advice with a grain of salt.
 
  If i ignore your when it is searched as exampleand focus entirely on
  what you say you've indexed the data as, and the Q value you are sing
 (in
  what looks like the echoParams output) then the first thing that jumps
 out
  at me is that it looks like your servlet container (or perhaps your web
  browser if that's where you tested this) is not dealing with the
 unicode
  correctly -- because allthough i see a ต in the first three lines i
  quoted above (UTF8: 0xC2 0xB5) in your value observed i'm seeing it
  preceeded by a ย (UTF8: 0xC3 0x82) ... suggesting that perhaps the
 ต
  did not get URL encoded properly when the request was made to your
 servlet
  container?
 
  In particular, you might want to take a look at...
 
 
 
 https://wiki.apache.org/solr/FAQ#Why_don.27t_International_Characters_Work.3F
  http://wiki.apache.org/solr/SolrTomcat#URI_Charset_Config
  The example/exampledocs/test_utf8.sh script included with solr
 
 
 
 
  -Hoss
 
 
 



Re: Facets involving multiple fields

2012-04-13 Thread Erick Erickson
Nope. Information about your higher level use-case
would probably be a good thing, this is starting to
smell like an XY problem.

Best
Erick

On Fri, Apr 13, 2012 at 5:48 AM, Marc SCHNEIDER
marc.schneide...@gmail.com wrote:
 Hi,

 Thanks for your answer.
 Yes it works in this case when I know the facet name (Computer). What
 if I want to automatically compute all facets?
 facet.query=keyword:* short_title:* doesn't work, right?

 Marc.

 On Thu, Apr 12, 2012 at 2:08 PM, Erick Erickson erickerick...@gmail.com 
 wrote:
 facet.query=keywords:computer short_title:computer
 seems like what you're asking for.

 On Thu, Apr 12, 2012 at 3:19 AM, Marc SCHNEIDER
 marc.schneide...@gmail.com wrote:
 Hi,

 Thanks for your answer.
 Let's say I have to fields : 'keywords' and 'short_title'.
 For these fields I'd like to make a faceted search : if 'Computer' is
 stored in at least one of these fields for a document I'd like to get
 it added in my results.
 doc1 = keywords : 'Computer' / short_title : 'Computer'
 doc2 = keywords : 'Computer'
 doc3 = short_title : 'Computer'

 In this case I'd like to have : Computer (3)

 I don't see how to solve this with facet.query.

 Thanks,
 Marc.

 On Wed, Apr 11, 2012 at 5:13 PM, Erick Erickson erickerick...@gmail.com 
 wrote:
 Have you considered facet.query? You can specify an arbitrary query
 to facet on which might do what you want. Otherwise, I'm not sure what
 you mean by faceted search using two fields. How should these fields
 be combined into a single facet? What that means practically is not at
 all obvious from your problem statement.

 Best
 Erick

 On Tue, Apr 10, 2012 at 8:55 AM, Marc SCHNEIDER
 marc.schneide...@gmail.com wrote:
 Hi,

 I'd like to make a faceted search using two fields. I want to have a
 single result and not a result by field (like when using
 facet.field=f1,facet.field=f2).
 I don't want to use a copy field either because I want it to be
 dynamic at search time.
 As far as I know this is not possible for Solr 3.x...
 But I saw a new parameter named group.facet for Solr4. Could that
 solve my problem? If yes could somebody give me an example?

 Thanks,
 Marc.


Solr data export to CSV File

2012-04-13 Thread Pavnesh
Hi Team,

 

A very-very thanks to you guy who had developed such a nice product. 

I have one query regarding solr that I have app 36 Million data in my solr
and I wants to export all the data to a csv file but I have found nothing on
the same  so please help me on this topic .

 

 

Regards

Pavnesh

 



Re: How to read SOLR cache statistics?

2012-04-13 Thread Erick Erickson
Well, the place to start is here:
*stats*:  lookups : 98
*hits *: 59
*hitratio *: 0.60
*inserts *: 41
*evictions *: 0
*size *: 41

the important bits are hitratio and evictions.
Caches only really start to show their stuff
when the hit ratio is quite high. That's
the percentage of requests that are satisfied
by entries already in the cache. You want
this number to be as high as possible, +0.90.

evictions are the number of entries that have been
removed from the cache. The pre-configured
number is usually 512, so when the 513th entry
is inserted in the cache, some are removed
to make room and tallied in the evictions
section.

Do note that some of the caches (documentCache
in particular) will rarely have a huge hit ratio due
to its nature, ditto with queryResultCache so you
can temporarily ignore those.

Best
Erick

On Fri, Apr 13, 2012 at 6:28 AM, Kashif Khan uplink2...@gmail.com wrote:
 Hi Li Li,

 I have been through that WIKI before but that does not explain what is
 *evictions*, *inserts*, *cumulative_inserts*, *cumulative_evictions*,
 *hitratio *and all. These terms are foreign to me. What does the following
 line mean?

 *item_ABC :
 {field=ABC,memSize=340592,tindexSize=1192,time=1360,phase1=1344,nTerms=7373,bigTerms=1,termInstances=11513,uses=4}
 *

 I want that kind of explanation. I have read the wiki and the comments in
 the solrconfig.xml file about all these things but does say how to read the
 stats which is very *important!!!*.

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/How-to-read-SOLR-cache-statistics-tp3907294p3907633.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: performance impact using string or float when querying ranges

2012-04-13 Thread Erick Erickson
Well, I guess my first question is whether using stirngs
is fast enough, in which case there's little reason to
make your life more complex.

But yes, range queries will be significantly faster with
any of the Trie types than with strings. Trie types are
all numeric types.


Best
Erick

On Fri, Apr 13, 2012 at 3:49 AM, crive marco.cr...@gmail.com wrote:
 Hi All,
 is there a big difference in terms of performances when querying a range
 like [50.0 TO *] on a string field compared to a float field?

 At the moment I am using a dynamic field of type string to map some values
 coming from our database and their type can vary depending on the context
 (float/integer/string); it easier to use a dynamic field other than having
 to create a bespoke field for each type of value.

 Marco


Re: Issues with language based indexing

2012-04-13 Thread Erick Erickson
Please review:
http://wiki.apache.org/solr/UsingMailingLists

there's so little information to go on here that I
really can't say anything that isn't a guess.

At a minimum we need the raw input, the
fieldType definitions from your schema,
the results of adding debugQuery=on
to your URL

Best
Erick

On Fri, Apr 13, 2012 at 6:04 AM, JGar jyothi.garladi...@citi.com wrote:
 Hello,

 I am new to Solr. it is resulting some docs in my search for Acciones y
 Valores string. When i go and search for the same word in the given doc
 manually, i could not find those word. Pls help on what basis the doc is
 found in the search .

 Thanks

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Issues-with-language-based-indexing-tp3907601p3907601.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr data export to CSV File

2012-04-13 Thread Erick Erickson
Does this help?

http://wiki.apache.org/solr/CSVResponseWriter

Best
Erick

On Fri, Apr 13, 2012 at 7:59 AM, Pavnesh
pavnesh.ku...@altruistindia.com wrote:
 Hi Team,



 A very-very thanks to you guy who had developed such a nice product.

 I have one query regarding solr that I have app 36 Million data in my solr
 and I wants to export all the data to a csv file but I have found nothing on
 the same  so please help me on this topic .





 Regards

 Pavnesh





RE: Realtime /get versus SearchHandler

2012-04-13 Thread Darren Govoni

Yes

brbrbr--- Original Message ---
On 4/13/2012  06:25 AM Benson Margulies wrote:brA discussion over on the dev 
list led me to expect that the by-if
brfield retrievals in a SolrCloud query would come through the get
brhandler. In fact, I've seen them turn up in my search component in the
brsearch handler that is configured with my custom QT. (I have a
br'prepare' method that sets ShardParams.QT to my QT to get my
brprocessing involved in the first of the two queries.) Did I overthink
brthis?
br
br


RE: Solr data export to CSV File

2012-04-13 Thread Ben McCarthy
A combination of the CSV response writer and SOLRJ to page through all of the 
results sending it to something like apache commons fileutils:

  FileUtils.writeStringToFile(new File(output.csv), outputLine 
(line.separator), true);

Would be quiet quick to knock up in Java.

Thanks
Ben

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com]
Sent: 13 April 2012 13:28
To: solr-user@lucene.apache.org
Subject: Re: Solr data export to CSV File

Does this help?

http://wiki.apache.org/solr/CSVResponseWriter

Best
Erick

On Fri, Apr 13, 2012 at 7:59 AM, Pavnesh pavnesh.ku...@altruistindia.com 
wrote:
 Hi Team,



 A very-very thanks to you guy who had developed such a nice product.

 I have one query regarding solr that I have app 36 Million data in my
 solr and I wants to export all the data to a csv file but I have found
 nothing on the same  so please help me on this topic .





 Regards

 Pavnesh







This e-mail is sent on behalf of Trader Media Group Limited, Registered Office: 
Auto Trader House, Cutbush Park Industrial Estate, Danehill, Lower Earley, 
Reading, Berkshire, RG6 4UT(Registered in England No. 4768833). This email and 
any files transmitted with it are confidential and may be legally privileged, 
and intended solely for the use of the individual or entity to whom they are 
addressed. If you have received this email in error please notify the sender. 
This email message has been swept for the presence of computer viruses. 



Re: searching across multiple fields using edismax - am i setting this up right?

2012-04-13 Thread geeky2
thank you for the response.

it seems to be working well ;)

1) i tried your suggestion about removing the qt parameter - 

*somecore/partItemNoSearch*q=dishwasherdebugQuery=onrows=10

but this results in a 404 error message - is there some configuration i am
missing to support this short-hand syntax for specifying the requestHandler
in the url ?



2) ok - good suggestion.



3) yes it looks like it IS searching across all three (3) fields.

i noticed that for the itemNo field, it reduced the search string from
dishwasher to dishwash - it this because of stemming on the field type, used
for the itemNo field?

lst name=debugstr name=rawquerystringdishwasher/strstr
name=querystringdishwasher/strstr
name=parsedquery+DisjunctionMaxQuery((brand:dishwasher^0.5 |
*itemNo:dishwash* | productType:dishwasher^0.8))/strstr
name=parsedquery_toString+(brand:dishwasher^0.5 | itemNo:dishwash |
productType:dishwasher^0.8)/str





--
View this message in context: 
http://lucene.472066.n3.nabble.com/searching-across-multiple-fields-using-edismax-am-i-setting-this-up-right-tp3906334p3907875.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: searching across multiple fields using edismax - am i setting this up right?

2012-04-13 Thread Erick Erickson
as to 1) you have to define your request handler with
a leading /, as in name= /partItemNoSearch. Don't
forget to restart your server.

3) Of course. The input terms MUST be run through
the associated analysis chain to have any hope of
matching correctly.

Best
Erick

On Fri, Apr 13, 2012 at 8:36 AM, geeky2 gee...@hotmail.com wrote:
 thank you for the response.

 it seems to be working well ;)

 1) i tried your suggestion about removing the qt parameter -

 *somecore/partItemNoSearch*q=dishwasherdebugQuery=onrows=10

 but this results in a 404 error message - is there some configuration i am
 missing to support this short-hand syntax for specifying the requestHandler
 in the url ?



 2) ok - good suggestion.



 3) yes it looks like it IS searching across all three (3) fields.

 i noticed that for the itemNo field, it reduced the search string from
 dishwasher to dishwash - it this because of stemming on the field type, used
 for the itemNo field?

 lst name=debugstr name=rawquerystringdishwasher/strstr
 name=querystringdishwasher/strstr
 name=parsedquery+DisjunctionMaxQuery((brand:dishwasher^0.5 |
 *itemNo:dishwash* | productType:dishwasher^0.8))/strstr
 name=parsedquery_toString+(brand:dishwasher^0.5 | itemNo:dishwash |
 productType:dishwasher^0.8)/str





 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/searching-across-multiple-fields-using-edismax-am-i-setting-this-up-right-tp3906334p3907875.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Errors during indexing

2012-04-13 Thread Ben McCarthy
Hello

We have just switched to Solr4 as we needed the ability to return geodist() 
along with our results.

I use a simple multithreaded java app and solr to ingest the data.  We keep 
seeing the following:

13-Apr-2012 15:50:10 org.apache.solr.common.SolrException log
SEVERE: null:org.apache.solr.common.SolrException: Error handling 'status' 
action
at 
org.apache.solr.handler.admin.CoreAdminHandler.handleStatusAction(CoreAdminHandler.java:546)
at 
org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:156)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
at 
org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:359)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:175)
at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at 
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at 
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
at 
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at 
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at 
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at 
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)
at 
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:859)
at 
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:602)
at 
org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.FileNotFoundException: /usr/solr4/data/index/_2jb.fnm (No 
such file or directory)
at java.io.RandomAccessFile.open(Native Method)
at java.io.RandomAccessFile.init(RandomAccaessFile.java:216)
at 
org.apache.lucene.store.MMapDirectory.openInput(MMapDirectory.java:219)
at 
org.apache.lucene.codecs.lucene40.Lucene40FieldInfosReader.read(Lucene40FieldInfosReader.java:47)
at 
org.apache.lucene.index.SegmentInfo.loadFieldInfos(SegmentInfo.java:201)
at 
org.apache.lucene.index.SegmentInfo.getFieldInfos(SegmentInfo.java:227)
at org.apache.lucene.index.SegmentInfo.files(SegmentInfo.java:415)
at org.apache.lucene.index.SegmentInfos.files(SegmentInfos.java:756)
at 
org.apache.lucene.index.StandardDirectoryReader$ReaderCommit.init(StandardDirectoryReader.java:369)
at 
org.apache.lucene.index.StandardDirectoryReader.getIndexCommit(StandardDirectoryReader.java:354)
at 
org.apache.solr.handler.admin.LukeRequestHandler.getIndexInfo(LukeRequestHandler.java:558)
at 
org.apache.solr.handler.admin.CoreAdminHandler.getCoreStatus(CoreAdminHandler.java:816)
at 
org.apache.solr.handler.admin.CoreAdminHandler.handleStatusAction(CoreAdminHandler.java:537)
... 16 more


This seems to happen when were using the new admin tool.  Im checking on the 
autocommit handler.

Has anyone seen anything similar?

Thanks
Ben




This e-mail is sent on behalf of Trader Media Group Limited, Registered Office: 
Auto Trader House, Cutbush Park Industrial Estate, Danehill, Lower Earley, 
Reading, Berkshire, RG6 4UT(Registered in England No. 4768833). This email and 
any files transmitted with it are confidential and may be legally privileged, 
and intended solely for the use of the individual or entity to whom they are 
addressed. If you have received this email in error please notify the sender. 
This email message has been swept for the presence of computer viruses. 



RE: solr 3.5 taking long to index

2012-04-13 Thread Rohit
Hi Shawn,

Thanks for the information, let me give this a try, since this is a live box I 
will try it during the weekend and update you.

Regards,
Rohit
Mobile: +91-9901768202
About Me: http://about.me/rohitg


-Original Message-
From: Shawn Heisey [mailto:s...@elyograg.org] 
Sent: 13 April 2012 11:01
To: solr-user@lucene.apache.org
Subject: Re: solr 3.5 taking long to index

On 4/12/2012 8:42 PM, Rohit wrote:
 The machine has a total ram of around 46GB. My Biggest concern is Solr index 
 time gradually increasing and then the commit stops because of timeouts, out 
 commit rate is very high, but I am not able to find the root cause of the 
 issue.

For good performance, Solr relies on the OS having enough free RAM to keep 
critical portions of the index in the disk cache.  Some numbers that I have 
collected from your information so far are listed below.  
Please let me know if I've got any of this wrong:

46GB total RAM
36GB RAM allocated to Solr
300GB total index size

This leaves only 10GB of RAM free to cache 300GB of index, assuming that this 
server is dedicated to Solr.  The critical portions of your index are very 
likely considerably larger than 10GB, which causes constant reading from the 
disk for queries and updates.  With a high commit rate and a relatively low 
mergeFactor of 10, your index will be doing a lot of merging during updates, 
and some of those merges are likely to be quite large, further complicating the 
I/O situation.

Another thing that can lead to increasing index update times is cache warming, 
also greatly affected by high I/O levels.  If you visit the 
/solr/corename/admin/stats.jsp#cache URL, you can see the warmupTime for each 
cache in milliseconds.

Adding more memory to the server would probably help things.  You'll want to 
carefully check all the server and Solr statistics you can to make sure that 
memory is the root of problem, before you actually spend the money.  At the 
server level, look for things like a high iowait CPU percentage.  For Solr, you 
can turn the logging level up to INFO in the admin interface as well as turn on 
the infostream in solrconfig.xml for extensive debugging.

I hope this is helpful.  If not, I can try to come up with more specific things 
you can look at.

Thanks,
Shawn




Solr is not extracting the CDATA part of xml

2012-04-13 Thread srini
I am trying to use method that is suggested in solr forum to remove CDATA
part of xml. but it is not working. result show whole xml content instead of
CDATA part.

schema.xml
fieldType name=text_ws2 class=solr.TextField
positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
charFilter class=solr.HTMLStripCharFilterFactory/
 charFilter class=solr.MappingCharFilterFactory
mapping=mappings.txt/ 
  /analyzer
/fieldType

mappings.txt
 = 

my xml content
body

/body 





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-is-not-extracting-the-CDATA-part-of-xml-tp3908317p3908317.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr is not extracting the CDATA part of xml

2012-04-13 Thread srini
not sure why CDATA part did not get interpreted. this is how xml content
looks like. I added quotes just to present the exact content xml content.

body/body

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-is-not-extracting-the-CDATA-part-of-xml-tp3908317p3908341.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: performance impact using string or float when querying ranges

2012-04-13 Thread Yonik Seeley
On Fri, Apr 13, 2012 at 8:11 AM, Erick Erickson erickerick...@gmail.com wrote:
 Well, I guess my first question is whether using stirngs
 is fast enough, in which case there's little reason to
 make your life more complex.

 But yes, range queries will be significantly faster with
 any of the Trie types than with strings.

To elaborate on this point a bit... range queries on strings will be
the same speed as a numeric field with precisionStep=0.
You need a precisionStep  0 (so the number will be indexed in
multiple parts) to speed up range queries on numeric fields.  (See
int vs tint in the solr schema).

-Yonik
lucenerevolution.com - Lucene/Solr Open Source Search Conference.
Boston May 7-10




 Trie types are
 all numeric types.


 Best
 Erick

 On Fri, Apr 13, 2012 at 3:49 AM, crive marco.cr...@gmail.com wrote:
 Hi All,
 is there a big difference in terms of performances when querying a range
 like [50.0 TO *] on a string field compared to a float field?

 At the moment I am using a dynamic field of type string to map some values
 coming from our database and their type can vary depending on the context
 (float/integer/string); it easier to use a dynamic field other than having
 to create a bespoke field for each type of value.

 Marco


mergePolicy element format change in 3.6 vs 3.5?

2012-04-13 Thread Peter Wolanin
Trying to maintain the Drupal integration module across multiple versions
of 3.x, we've gotten a bug report suggesting that Solr 3.6 needs this
change to solrconfig:

-
 mergePolicyorg.apache.lucene.index.LogByteSizeMergePolicy/mergePolicy
+mergePolicy class=org.apache.lucene.index.LogByteSizeMergePolicy /


I don't see this mentioned in the release notes - is the second format
useable with 3.5, 3.4, etc?

-- 
Peter M. Wolanin, Ph.D.  : Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com : 781-313-8322

Get a free, hosted Drupal 7 site: http://www.drupalgardens.com;


RE: mergePolicy element format change in 3.6 vs 3.5?

2012-04-13 Thread Michael Ryan
It looks like the first format was removed in 3.6 as part of 
https://issues.apache.org/jira/browse/SOLR-1052. The second format works in all 
3.x versions.

-Michael

-Original Message-
From: Peter Wolanin [mailto:peter.wola...@acquia.com] 
Sent: Friday, April 13, 2012 12:32 PM
To: solr-user@lucene.apache.org
Subject: mergePolicy element format change in 3.6 vs 3.5?

Trying to maintain the Drupal integration module across multiple versions
of 3.x, we've gotten a bug report suggesting that Solr 3.6 needs this
change to solrconfig:

-
 mergePolicyorg.apache.lucene.index.LogByteSizeMergePolicy/mergePolicy
+mergePolicy class=org.apache.lucene.index.LogByteSizeMergePolicy /


I don't see this mentioned in the release notes - is the second format
useable with 3.5, 3.4, etc?


Re: mergePolicy element format change in 3.6 vs 3.5?

2012-04-13 Thread Peter Wolanin
Ok, thanks for the info.  As long as the second one works, we can just use
that.

I just verified that it works for 3.5 at least.

-Peter

On Fri, Apr 13, 2012 at 1:12 PM, Michael Ryan mr...@moreover.com wrote:

 It looks like the first format was removed in 3.6 as part of
 https://issues.apache.org/jira/browse/SOLR-1052. The second format works
 in all 3.x versions.

 -Michael

 -Original Message-
 From: Peter Wolanin [mailto:peter.wola...@acquia.com]
 Sent: Friday, April 13, 2012 12:32 PM
 To: solr-user@lucene.apache.org
 Subject: mergePolicy element format change in 3.6 vs 3.5?

 Trying to maintain the Drupal integration module across multiple versions
 of 3.x, we've gotten a bug report suggesting that Solr 3.6 needs this
 change to solrconfig:

 -
  mergePolicyorg.apache.lucene.index.LogByteSizeMergePolicy/mergePolicy
 +mergePolicy class=org.apache.lucene.index.LogByteSizeMergePolicy /


 I don't see this mentioned in the release notes - is the second format
 useable with 3.5, 3.4, etc?




-- 
Peter M. Wolanin, Ph.D.  : Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com : 781-313-8322

Get a free, hosted Drupal 7 site: http://www.drupalgardens.com;


Re: Solr is not extracting the CDATA part of xml

2012-04-13 Thread Erick Erickson
Solr does not index arbitrary XML content. There is and XML
form of a solr document that can be sent to Solr, but it is
a specific form of XML.

An example of the XML you're trying to index and what you mean
by not working would be helpful.

Best
Erick

On Fri, Apr 13, 2012 at 11:50 AM, srini softtec...@gmail.com wrote:
 not sure why CDATA part did not get interpreted. this is how xml content
 looks like. I added quotes just to present the exact content xml content.

 body/body

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Solr-is-not-extracting-the-CDATA-part-of-xml-tp3908317p3908341.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr is not extracting the CDATA part of xml

2012-04-13 Thread srini
Erick,

Thanks for your reply. when you say Solr does not index arbitery xml
document, then below is the way my xml document looks like which is sitting
in oracle. Could you suggest the best of indexing it ? which method should I
follow? Should I use XPathEntityProcessor?

?xml version=1.0 encoding=UTF-8 ? 
message xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance;
xmlns=someurl xmlns:csp=someurl.xsd xsi:schemaLocation=somelocation
jar: id=002 message-type=create
content
 dsp:row
  dsp:channel100/dsp:channel  
  dsp:role115/dsp:role
  /dsp:row
 
 /body/content/message

Thanks in Advance
Erick Erickson wrote
 
 Solr does not index arbitrary XML content. There is and XML
 form of a solr document that can be sent to Solr, but it is
 a specific form of XML.
 
 An example of the XML you're trying to index and what you mean
 by not working would be helpful.
 
 Best
 Erick
 
 On Fri, Apr 13, 2012 at 11:50 AM, srini lt;softtech88@gt; wrote:
 not sure why CDATA part did not get interpreted. this is how xml content
 looks like. I added quotes just to present the exact content xml content.

 body/body

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Solr-is-not-extracting-the-CDATA-part-of-xml-tp3908317p3908341.html
 Sent from the Solr - User mailing list archive at Nabble.com.
 

Erick Erickson wrote
 
 Solr does not index arbitrary XML content. There is and XML
 form of a solr document that can be sent to Solr, but it is
 a specific form of XML.
 
 An example of the XML you're trying to index and what you mean
 by not working would be helpful.
 
 Best
 Erick
 
 On Fri, Apr 13, 2012 at 11:50 AM, srini lt;softtech88@gt; wrote:
 not sure why CDATA part did not get interpreted. this is how xml content
 looks like. I added quotes just to present the exact content xml content.

 body/body

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Solr-is-not-extracting-the-CDATA-part-of-xml-tp3908317p3908341.html
 Sent from the Solr - User mailing list archive at Nabble.com.
 

Erick Erickson wrote
 
 Solr does not index arbitrary XML content. There is and XML
 form of a solr document that can be sent to Solr, but it is
 a specific form of XML.
 
 An example of the XML you're trying to index and what you mean
 by not working would be helpful.
 
 Best
 Erick
 
 On Fri, Apr 13, 2012 at 11:50 AM, srini lt;softtech88@gt; wrote:
 not sure why CDATA part did not get interpreted. this is how xml content
 looks like. I added quotes just to present the exact content xml content.

 body/body

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Solr-is-not-extracting-the-CDATA-part-of-xml-tp3908317p3908341.html
 Sent from the Solr - User mailing list archive at Nabble.com.
 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-is-not-extracting-the-CDATA-part-of-xml-tp3908317p3908791.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr is not extracting the CDATA part of xml

2012-04-13 Thread Erick Erickson
Right, that will not work at all for direct transmission to
Solr.

You could write a Java program that parses this and sends
it to Solr via SolrJ.

Personally I haven't connected a database to Solr with
XPathEntityProcessor in the mix, but I believe I've seen
messages go by with this configuration. You might want
to search the mail archive...

Best
Erick

On Fri, Apr 13, 2012 at 3:13 PM, srini softtec...@gmail.com wrote:
 Erick,

 Thanks for your reply. when you say Solr does not index arbitery xml
 document, then below is the way my xml document looks like which is sitting
 in oracle. Could you suggest the best of indexing it ? which method should I
 follow? Should I use XPathEntityProcessor?

 ?xml version=1.0 encoding=UTF-8 ?
 message xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance;
 xmlns=someurl xmlns:csp=someurl.xsd xsi:schemaLocation=somelocation
 jar: id=002 message-type=create
 content
     dsp:row
      dsp:channel100/dsp:channel
      dsp:role115/dsp:role
      /dsp:row

  /body/content/message

 Thanks in Advance
 Erick Erickson wrote

 Solr does not index arbitrary XML content. There is and XML
 form of a solr document that can be sent to Solr, but it is
 a specific form of XML.

 An example of the XML you're trying to index and what you mean
 by not working would be helpful.

 Best
 Erick

 On Fri, Apr 13, 2012 at 11:50 AM, srini lt;softtech88@gt; wrote:
 not sure why CDATA part did not get interpreted. this is how xml content
 looks like. I added quotes just to present the exact content xml content.

 body/body

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Solr-is-not-extracting-the-CDATA-part-of-xml-tp3908317p3908341.html
 Sent from the Solr - User mailing list archive at Nabble.com.


 Erick Erickson wrote

 Solr does not index arbitrary XML content. There is and XML
 form of a solr document that can be sent to Solr, but it is
 a specific form of XML.

 An example of the XML you're trying to index and what you mean
 by not working would be helpful.

 Best
 Erick

 On Fri, Apr 13, 2012 at 11:50 AM, srini lt;softtech88@gt; wrote:
 not sure why CDATA part did not get interpreted. this is how xml content
 looks like. I added quotes just to present the exact content xml content.

 body/body

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Solr-is-not-extracting-the-CDATA-part-of-xml-tp3908317p3908341.html
 Sent from the Solr - User mailing list archive at Nabble.com.


 Erick Erickson wrote

 Solr does not index arbitrary XML content. There is and XML
 form of a solr document that can be sent to Solr, but it is
 a specific form of XML.

 An example of the XML you're trying to index and what you mean
 by not working would be helpful.

 Best
 Erick

 On Fri, Apr 13, 2012 at 11:50 AM, srini lt;softtech88@gt; wrote:
 not sure why CDATA part did not get interpreted. this is how xml content
 looks like. I added quotes just to present the exact content xml content.

 body/body

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Solr-is-not-extracting-the-CDATA-part-of-xml-tp3908317p3908341.html
 Sent from the Solr - User mailing list archive at Nabble.com.


 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Solr-is-not-extracting-the-CDATA-part-of-xml-tp3908317p3908791.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr is not extracting the CDATA part of xml

2012-04-13 Thread Alexander Aristov
Hi

This is not solr format. You must re-format your XML into solr XML. you may
find examples on solr wiki or in solr examples dir.

Best Regards
Alexander Aristov


On 13 April 2012 23:13, srini softtec...@gmail.com wrote:

 Erick,

 Thanks for your reply. when you say Solr does not index arbitery xml
 document, then below is the way my xml document looks like which is sitting
 in oracle. Could you suggest the best of indexing it ? which method should
 I
 follow? Should I use XPathEntityProcessor?

 ?xml version=1.0 encoding=UTF-8 ?
 message xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance;
 xmlns=someurl xmlns:csp=someurl.xsd xsi:schemaLocation=somelocation
 jar: id=002 message-type=create
 content
 dsp:row
  dsp:channel100/dsp:channel
  dsp:role115/dsp:role
  /dsp:row

  /body/content/message

 Thanks in Advance
 Erick Erickson wrote
 
  Solr does not index arbitrary XML content. There is and XML
  form of a solr document that can be sent to Solr, but it is
  a specific form of XML.
 
  An example of the XML you're trying to index and what you mean
  by not working would be helpful.
 
  Best
  Erick
 
  On Fri, Apr 13, 2012 at 11:50 AM, srini lt;softtech88@gt; wrote:
  not sure why CDATA part did not get interpreted. this is how xml content
  looks like. I added quotes just to present the exact content xml
 content.
 
  body/body
 
  --
  View this message in context:
 
 http://lucene.472066.n3.nabble.com/Solr-is-not-extracting-the-CDATA-part-of-xml-tp3908317p3908341.html
  Sent from the Solr - User mailing list archive at Nabble.com.
 

 Erick Erickson wrote
 
  Solr does not index arbitrary XML content. There is and XML
  form of a solr document that can be sent to Solr, but it is
  a specific form of XML.
 
  An example of the XML you're trying to index and what you mean
  by not working would be helpful.
 
  Best
  Erick
 
  On Fri, Apr 13, 2012 at 11:50 AM, srini lt;softtech88@gt; wrote:
  not sure why CDATA part did not get interpreted. this is how xml content
  looks like. I added quotes just to present the exact content xml
 content.
 
  body/body
 
  --
  View this message in context:
 
 http://lucene.472066.n3.nabble.com/Solr-is-not-extracting-the-CDATA-part-of-xml-tp3908317p3908341.html
  Sent from the Solr - User mailing list archive at Nabble.com.
 

 Erick Erickson wrote
 
  Solr does not index arbitrary XML content. There is and XML
  form of a solr document that can be sent to Solr, but it is
  a specific form of XML.
 
  An example of the XML you're trying to index and what you mean
  by not working would be helpful.
 
  Best
  Erick
 
  On Fri, Apr 13, 2012 at 11:50 AM, srini lt;softtech88@gt; wrote:
  not sure why CDATA part did not get interpreted. this is how xml content
  looks like. I added quotes just to present the exact content xml
 content.
 
  body/body
 
  --
  View this message in context:
 
 http://lucene.472066.n3.nabble.com/Solr-is-not-extracting-the-CDATA-part-of-xml-tp3908317p3908341.html
  Sent from the Solr - User mailing list archive at Nabble.com.
 

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Solr-is-not-extracting-the-CDATA-part-of-xml-tp3908317p3908791.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Solr is not extracting the CDATA part of xml

2012-04-13 Thread srini
Thanks Again for quick reply. Little curious about the procedure you
suggested. I thought of using same procedure as you suggested. Like writing
a java program to fetch xml record from db and parse the content hand it to
Solr for indexing.

but what if my database content get changed? should I re run my java program
to fetch xml and add to solr for re indexing?

the content of xml format does not match to solr example xml formats. Any
suggestions here?

when I import xml records from oracle and add it to solr and search for a
word, solr is displaying whole xml doc which has that word. what is wrong
with this procedure( I do see my search word in the content of xml, only bad
part is it is displaying whole doc instead CDATA part of it). Please suggest
if there is better of doing this task other than SolrJ

Thanks in Advance
Srini





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-is-not-extracting-the-CDATA-part-of-xml-tp3908317p3908825.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Boosting StandardQuery scores with a subquery?

2012-04-13 Thread Chris Hostetter

: I'm having some trouble wrapping my head around boosting StandardQueries.
: It looks like the function: query(subquery, default)
: http://wiki.apache.org/solr/FunctionQuery#query is what I want, but the
: examples seem to focus on just returning a score (e.g. product of popularity
: and the score of the subquery). I assume my difficulty stems from the fact
: that I'd like to retrieve highlighting from one query, but impact score and
: 'relevance' by a different (sub)query.

if your primary concern is just having highlighting on some words, while 
lots of otherwords contribute to the score, then you should take a look at 
the hl.q param introduced in Solr 3.5...

http://wiki.apache.org/solr/HighlightingParameters#hl.q

That lets you completley seperate the two if you'd like.

you cna even use local param syntax to reduce duplication...

  q={!v=$qq}
  qq=content:(roi return on investment return investment~5)
  hl.q={!v=$qq}
  fq=extension:(pdf doc)
  boost=keywords:(financial investment profit loss) 
title:(financial investment profit loss) 
url:(investment investor relations phoenix)

...should work i think.

-Hoss


Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment

2012-04-13 Thread Jan Høydahl
Hi,

For a web crawl+search like this you will probably need a lot of additional Big 
Data crunching, so a Hadoop based solution is wise.

In addition to those products mentioned we also now have Amazon's own 
CloudSearch http://aws.amazon.com/cloudsearch/ It's new, is not as cool as Solr 
(not even Lucene based), but gives you the elasticity you request I guess. If 
you run your Hadoop cluster in EC2 already it would be quite efficient to 
batch-load the crawled and processed data into a SearchDomain in the same 
availability zone. However, both cost and features may prohibit this as a 
realistic choice for you.

It would be cool to explore a Hadoop/HDFS + SolrCloud integration. SolrCloud 
would not build the indexes, but be pulling pre-built indexes from HDFS down to 
local disk every time it's told to. Or perhaps the SolrCloud nodes could be 
part of the hadoop cluster, being responsible for the Reduce part building the 
indexes?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 13. apr. 2012, at 04:23, Otis Gospodnetic wrote:

 Hello Ali,
 
 I'm trying to setup a large scale *Crawl + Index + Search *infrastructure
 
 using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*,
 crawled + indexed every *4 weeks, *with a search latency of less than 0.5
 seconds.
 
 
 That's fine.  Whether it's doable with any tech will depend on how much 
 hardware you give it, among other things.
 
 Needless to mention, the search index needs to scale to 5Billion pages. It
 is also possible that I might need to store multiple indexes -- one for
 crawled content, and one for ancillary data that is also very large. Each
 of these indices would likely require a logically distributed and
 replicated index.
 
 
 Yup, OK.
 
 However, I would like for such a system to be homogenous with the Hadoop
 infrastructure that is already installed on the cluster (for the crawl). In
 other words, I would much prefer if the replication and distribution of the
 Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of
 using another scalability framework (such as SolrCloud). In addition, it
 would be ideal if this environment was flexible enough to be dynamically
 scaled based on the size requirements of the index and the search traffic
 at the time (i.e. if it is deployed on an Amazon cluster, it should be easy
 enough to automatically provision additional processing power into the
 cluster without requiring server re-starts).
 
 
 There is no such thing just yet.
 There is no Search+Hadoop/HDFS in a box just yet.  There was an attempt to 
 automatically index HBase content, but that was either not completed or not 
 committed into HBase.
 
 However, I'm not sure which Solr-based tool in the Hadoop ecosystem would
 be ideal for this scenario. I've heard mention of Solr-on-HBase, Solandra,
 Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these is
 mature enough and would be the right architectural choice to go along with
 a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling aspects
 above.
 
 
 Here is a summary on all of them:
 * Search on HBase - I assume you are referring to the same thing I mentioned 
 above.  Not ready.
 * Solandra - uses Cassandra+Solr, plus DataStax now has a different 
 (commercial) offering that combines search and Cassandra.  Looks good.
 * Lily - data stored in HBase cluster gets indexed to a separate Solr 
 instance(s)  on the side.  Not really integrated the way you want it to be.
 * ElasticSearch - solid at this point, the most dynamic solution today, can 
 scale well (we are working on a mny-B documents index and hundreds of 
 nodes with ElasticSearch right now), etc.  But again, not integrated with 
 Hadoop the way you want it.
 * IndexTank - has some technical weaknesses, not integrated with Hadoop, not 
 sure about its future considering LinkedIn uses Zoie and Sensei already.
 * And there is SolrCloud, which is coming soon and will be solid, but is 
 again not integrated.
 
 If I were you and I had to pick today - I'd pick ElasticSearch if I were 
 completely open.  If I had Solr bias I'd give SolrCloud a try first.
 
 Lastly, how much hardware (assuming a medium sized EC2 instance) would you
 estimate my needing with this setup, for regular web-data (HTML text) at
 this scale?
 
 I don't know off the topic of my head, but I'm guessing several hundred for 
 serving search requests.
 
 HTH,
 
 Otis
 --
 Search Analytics - http://sematext.com/search-analytics/index.html
 
 Scalable Performance Monitoring - http://sematext.com/spm/index.html
 
 
 Any architectural guidance would be greatly appreciated. The more details
 provided, the wider my grin :).
 
 Many many thanks in advance.
 
 Thanks,
 Safdar
 



Re: Post Sorting hook before the doc slicing.

2012-04-13 Thread Chris Hostetter

: Basically, I need to find item X in the result set and return say N items
: before and N items after.
: 
:  - N items -- Item X --- N items 
...
: So I might be wrong, but it looks like the only way would be to create a
: custom SolrIndexSearcher which will find the offset and create the related
: docslice. That slicing part doesn't seem to be well factored that I can
: see, so it seems to imply copy/pasting a significant chunk off the code. Am
: I looking at the wrong place ?

trying to do this as a hook into the SolrIndexSearcher would definitley be 
complicated ... laregley because of how matches are collected.

the most straight forward way i can think of to get the data you want is 
to consider what you are sorting on, and use that as a range filter, ie...

1) do your search, and filter on id:X
2) look at the values X has in the fields you are sorting on
3) search again, this time filter on those fields, asking for the first N 
docs with values greater then whatever id:X has
4) search again, this time reverse your sort, and reverse your filters 
(docs with values less hten whatever id:X has) and get the first N docs.


...even if your sort is score you can use the frange parser to filter 
(not usually recommended for score, but possible)



-Hoss


Re: Can I discover what part of a score is attributable to a subquery?

2012-04-13 Thread John Chee
On Fri, Apr 13, 2012 at 2:40 PM, Benson Margulies bimargul...@gmail.com wrote:
 Given a query including a subquery, is there any way for me to learn
 that subquery's contribution to the overall document score?

 I can provide 'why on earth would anyone ...' if someone wants to know.

Have you tried debugQuery=true?
http://wiki.apache.org/solr/CommonQueryParameters#debugQuery The
'explain' field of the result explains the scoring of each document.


Re: two structures in solr

2012-04-13 Thread Chris Hostetter

: I need to store *two big structures* in SOLR: projects and contractors.
: Contractors will search for available projects and project owners will
: search for contractors who would do it for them.

http://wiki.apache.org/solr/MultipleIndexes

: that *I want to have two structures*. I guess running two parallel solr
: instances is not the idea. I took a look at

there's nothing wrong with it, the real question is wether you ever need 
to do things with both sets of documents at once.

if contractors only ever search for projects, and project owners only ever 
serach for contractors, and no one ever searches for a mix of projects and 
contractors at the same time, then i would just suggest using multiple 
SolrCores...

http://wiki.apache.org/solr/MultipleIndexes#MultiCore
http://wiki.apache.org/solr/CoreAdmin


-Hoss


Re: term frequency outweighs exact phrase match

2012-04-13 Thread alxsss
Hello Hoss,

Here are the explain tags for two doc

str name=a0127d8e70a6d523
0.021646015 = (MATCH) sum of:
  0.021646015 = (MATCH) sum of:
0.02141003 = (MATCH) max plus 0.01 times others of:
  2.84194E-4 = (MATCH) weight(content:apache^0.5 in 3578), product of:
0.0029881175 = queryWeight(content:apache^0.5), product of:
  0.5 = boost
  4.3554416 = idf(docFreq=126092, maxDocs=3613605)
  0.0013721307 = queryNorm
0.09510804 = (MATCH) fieldWeight(content:apache in 3578), product of:
  2.236068 = tf(termFreq(content:apache)=5)
  4.3554416 = idf(docFreq=126092, maxDocs=3613605)
  0.009765625 = fieldNorm(field=content, doc=3578)
  0.021407187 = (MATCH) weight(title:apache^1.2 in 3578), product of:
0.01371095 = queryWeight(title:apache^1.2), product of:
  1.2 = boost
  8.327043 = idf(docFreq=2375, maxDocs=3613605)
  0.0013721307 = queryNorm
1.5613205 = (MATCH) fieldWeight(title:apache in 3578), product of:
  1.0 = tf(termFreq(title:apache)=1)
  8.327043 = idf(docFreq=2375, maxDocs=3613605)
  0.1875 = fieldNorm(field=title, doc=3578)
2.359865E-4 = (MATCH) max plus 0.01 times others of:
  2.359865E-4 = (MATCH) weight(content:solr^0.5 in 3578), product of:
0.004071705 = queryWeight(content:solr^0.5), product of:
  0.5 = boost
  5.9348645 = idf(docFreq=25986, maxDocs=3613605)
  0.0013721307 = queryNorm
0.05795766 = (MATCH) fieldWeight(content:solr in 3578), product of:
  1.0 = tf(termFreq(content:solr)=1)
  5.9348645 = idf(docFreq=25986, maxDocs=3613605)
  0.009765625 = fieldNorm(field=content, doc=3578)
/strstr name=d89380e313c64aa5
0.021465056 = (MATCH) sum of:
  1.8154096E-4 = (MATCH) sum of:
6.354771E-5 = (MATCH) max plus 0.01 times others of:
  6.354771E-5 = (MATCH) weight(content:apache^0.5 in 638040), product of:
0.0029881175 = queryWeight(content:apache^0.5), product of:
  0.5 = boost
  4.3554416 = idf(docFreq=126092, maxDocs=3613605)
  0.0013721307 = queryNorm
0.021266805 = (MATCH) fieldWeight(content:apache in 638040), product of:
  1.0 = tf(termFreq(content:apache)=1)
  4.3554416 = idf(docFreq=126092, maxDocs=3613605)
  0.0048828125 = fieldNorm(field=content, doc=638040)
1.1799325E-4 = (MATCH) max plus 0.01 times others of:
  1.1799325E-4 = (MATCH) weight(content:solr^0.5 in 638040), product of:
0.004071705 = queryWeight(content:solr^0.5), product of:
  0.5 = boost
  5.9348645 = idf(docFreq=25986, maxDocs=3613605)
  0.0013721307 = queryNorm
0.02897883 = (MATCH) fieldWeight(content:solr in 638040), product of:
  1.0 = tf(termFreq(content:solr)=1)
  5.9348645 = idf(docFreq=25986, maxDocs=3613605)
  0.0048828125 = fieldNorm(field=content, doc=638040)
  0.021283515 = (MATCH) weight(content:apache solr~1^30.0 in 638040), product 
of:
0.42358932 = queryWeight(content:apache solr~1^30.0), product of:
  30.0 = boost
  10.290306 = idf(content: apache=126092 solr=25986)
  0.0013721307 = queryNorm
0.050245635 = fieldWeight(content:apache solr in 638040), product of:
  1.0 = tf(phraseFreq=1.0)
  10.290306 = idf(content: apache=126092 solr=25986)
  0.0048828125 = fieldNorm(field=content, doc=638040)
/str

 

 

 Although the second doc has exact match it is placed after the first one which 
does not have exact match.

I use the following request handler

requestHandler name=search class=solr.SearchHandler 
lst name=defaults
str name=defTypeedismax/str
str name=echoParamsexplicit/str
float name=tie0.01/float
str name=qfhost^30  content^0.5 title^1.2 anchor^1.2/str
str name=pfcontent^30/str
str name=flurl,id, site ,title/str
str name=mm2lt;-1 5lt;-2 6lt;90%/str
int name=ps1/int
bool name=hltrue/bool
str name=q.alt*:*/str
str name=hl.flcontent/str
str name=f.title.hl.fragsize0/str
str name=hl.fragsize165/str
str name=f.title.hl.alternateFieldtitle/str
str name=f.url.hl.fragsize0/str
str name=f.url.hl.alternateFieldurl/str
str name=f.content.hl.fragmenterregex/str
str name=spellchecktrue/str
str name=spellcheck.collatetrue/str
str name=spellcheck.count5/str
str name=grouptrue/str
str name=group.fieldsite/str
str name=group.ngroupstrue/str
/lst
arr name=last-components
 strspellcheck/str
/arr
/requestHandler


and the query is as follows 

http://localhost:8983/solr/select/?q=apache 
solrversion=2.2start=0rows=10indent=onqt=searchdebugQuery=true

Thanks.
Alex.


-Original Message-
From: Chris Hostetter hossman_luc...@fucit.org
To: solr-user solr-user@lucene.apache.org
Sent: Thu, Apr 12, 2012 7:43 pm
Subject: Re: term frequency outweighs exact phrase match



: I use solr 3.5 with edismax. I have the following issue with phrase 
: search. For example if I have three documents with content like
: 
: 1.apache apache
: 2. solr solr
: 

Re: Can I discover what part of a score is attributable to a subquery?

2012-04-13 Thread Benson Margulies
On Fri, Apr 13, 2012 at 6:43 PM, John Chee johnc...@mylife.com wrote:
 On Fri, Apr 13, 2012 at 2:40 PM, Benson Margulies bimargul...@gmail.com 
 wrote:
 Given a query including a subquery, is there any way for me to learn
 that subquery's contribution to the overall document score?

I need this number to be available in a SearchComponent that runs
after QueryComponent.



 I can provide 'why on earth would anyone ...' if someone wants to know.

 Have you tried debugQuery=true?
 http://wiki.apache.org/solr/CommonQueryParameters#debugQuery The
 'explain' field of the result explains the scoring of each document.


Re: Can I discover what part of a score is attributable to a subquery?

2012-04-13 Thread Chris Hostetter

: Given a query including a subquery, is there any way for me to learn
: that subquery's contribution to the overall document score?

You have to just execute the subquery itself ... doc collection 
and score calculation doesn't keep track the subscores.

you could do this using functions in the fl but since you mentioned 
wanting this in SearchCOmponent just pass the subquery to 
SolrIndexSeracher using a DocSet filter of the current page (ie: make your 
own DocSet based on the current DocList)


-Hoss


Re: Solr is not extracting the CDATA part of xml

2012-04-13 Thread Lance Norskog
This all comes from a database? Here is what you want.

The DataImportHandler includes a toolkit for doing full and
incremental loading from databases.

Read this first:
http://www.lucidimagination.com/search/link?url=http://wiki.apache.org/solr/DIHQuickStart

Then these:
http://www.lucidimagination.com/search/link?url=http://wiki.apache.org/solr/DataImportHandlerFaq
http://lucidworks.lucidimagination.com/display/solr/Uploading+Structured+Data+Store+Data+with+the+Data+Import+Handler

After you try the procedure in QuickStart and read the other two, if
you still have questions please ask.

Cheers!

On Fri, Apr 13, 2012 at 12:34 PM, srini softtec...@gmail.com wrote:
 Thanks Again for quick reply. Little curious about the procedure you
 suggested. I thought of using same procedure as you suggested. Like writing
 a java program to fetch xml record from db and parse the content hand it to
 Solr for indexing.

 but what if my database content get changed? should I re run my java program
 to fetch xml and add to solr for re indexing?

 the content of xml format does not match to solr example xml formats. Any
 suggestions here?

 when I import xml records from oracle and add it to solr and search for a
 word, solr is displaying whole xml doc which has that word. what is wrong
 with this procedure( I do see my search word in the content of xml, only bad
 part is it is displaying whole doc instead CDATA part of it). Please suggest
 if there is better of doing this task other than SolrJ

 Thanks in Advance
 Srini





 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Solr-is-not-extracting-the-CDATA-part-of-xml-tp3908317p3908825.html
 Sent from the Solr - User mailing list archive at Nabble.com.



-- 
Lance Norskog
goks...@gmail.com


Re: Can I discover what part of a score is attributable to a subquery?

2012-04-13 Thread Benson Margulies
On Fri, Apr 13, 2012 at 7:07 PM, Chris Hostetter
hossman_luc...@fucit.org wrote:

 : Given a query including a subquery, is there any way for me to learn
 : that subquery's contribution to the overall document score?

 You have to just execute the subquery itself ... doc collection
 and score calculation doesn't keep track the subscores.

 you could do this using functions in the fl but since you mentioned
 wanting this in SearchCOmponent just pass the subquery to
 SolrIndexSeracher using a DocSet filter of the current page (ie: make your
 own DocSet based on the current DocList)

I get it. Some fairly intricate dancing then can ensue with SolrCloud. Thanks.



 -Hoss


Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment

2012-04-13 Thread Ali S Kureishy
Thanks Otis.

I really appreciate the details offered here. This was very helpful
information.

I'm going to go through Solandra and Elastic Search and see if those make
sense. I was also given a suggestion to use SolrCloud on FuseDFS (that's
two recommendations for SolrCloud so far), so I will give that a shot when
it is available. However, do you know when SolrCloud IS expected to be
available?

Thanks again!

Warm regards,
Safdar



On Fri, Apr 13, 2012 at 5:23 AM, Otis Gospodnetic 
otis_gospodne...@yahoo.com wrote:

 Hello Ali,

  I'm trying to setup a large scale *Crawl + Index + Search *infrastructure

  using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*,
  crawled + indexed every *4 weeks, *with a search latency of less than 0.5
  seconds.


 That's fine.  Whether it's doable with any tech will depend on how much
 hardware you give it, among other things.

  Needless to mention, the search index needs to scale to 5Billion pages.
 It
  is also possible that I might need to store multiple indexes -- one for
  crawled content, and one for ancillary data that is also very large. Each
  of these indices would likely require a logically distributed and
  replicated index.


 Yup, OK.

  However, I would like for such a system to be homogenous with the Hadoop
  infrastructure that is already installed on the cluster (for the crawl).
 In
  other words, I would much prefer if the replication and distribution of
 the
  Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of
  using another scalability framework (such as SolrCloud). In addition, it
  would be ideal if this environment was flexible enough to be dynamically
  scaled based on the size requirements of the index and the search traffic
  at the time (i.e. if it is deployed on an Amazon cluster, it should be
 easy
  enough to automatically provision additional processing power into the
  cluster without requiring server re-starts).


 There is no such thing just yet.
 There is no Search+Hadoop/HDFS in a box just yet.  There was an attempt to
 automatically index HBase content, but that was either not completed or not
 committed into HBase.

  However, I'm not sure which Solr-based tool in the Hadoop ecosystem would
  be ideal for this scenario. I've heard mention of Solr-on-HBase,
 Solandra,
  Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these
 is
  mature enough and would be the right architectural choice to go along
 with
  a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling
 aspects
  above.


 Here is a summary on all of them:
 * Search on HBase - I assume you are referring to the same thing I
 mentioned above.  Not ready.
 * Solandra - uses Cassandra+Solr, plus DataStax now has a different
 (commercial) offering that combines search and Cassandra.  Looks good.
 * Lily - data stored in HBase cluster gets indexed to a separate Solr
 instance(s)  on the side.  Not really integrated the way you want it to be.
 * ElasticSearch - solid at this point, the most dynamic solution today,
 can scale well (we are working on a mny-B documents index and hundreds
 of nodes with ElasticSearch right now), etc.  But again, not integrated
 with Hadoop the way you want it.
 * IndexTank - has some technical weaknesses, not integrated with Hadoop,
 not sure about its future considering LinkedIn uses Zoie and Sensei already.
 * And there is SolrCloud, which is coming soon and will be solid, but is
 again not integrated.

 If I were you and I had to pick today - I'd pick ElasticSearch if I were
 completely open.  If I had Solr bias I'd give SolrCloud a try first.

  Lastly, how much hardware (assuming a medium sized EC2 instance) would
 you
  estimate my needing with this setup, for regular web-data (HTML text) at
  this scale?

 I don't know off the topic of my head, but I'm guessing several hundred
 for serving search requests.

 HTH,

 Otis
 --
 Search Analytics - http://sematext.com/search-analytics/index.html

 Scalable Performance Monitoring - http://sematext.com/spm/index.html


  Any architectural guidance would be greatly appreciated. The more details
  provided, the wider my grin :).
 
  Many many thanks in advance.
 
  Thanks,
  Safdar
 



dynamic analyzer based on condition

2012-04-13 Thread srinir
Hi,

I want to pick different analyzers for the same field for different
languages. I can determine the language from a different field. I would have
different fieldTypes defined in my schema.xml such as text_en, text_de,
text_fr, etc where i specify which analyzer and filter to use during
indexing and query time. 

fieldType name=text_en class=solr.TextField
positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.EnglishPossessiveFilterFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.EnglishPossessiveFilterFactory/
  /analyzer
/fieldType

but i would like to define the field dynamically. for e.g

if lang==en
field name=description type=text_en indexed=true stored=true  /
else if lang==de
field name=description type=text_de indexed=true stored=true /
...


Can I achieve this somehow ? If this approach cannot be done then i can just
create one field for every language. 

Thanks
Srini

--
View this message in context: 
http://lucene.472066.n3.nabble.com/dynamic-analyzer-based-on-condition-tp3909345p3909345.html
Sent from the Solr - User mailing list archive at Nabble.com.


remoteLink that change it's text

2012-04-13 Thread Marcelo Carvalho Fernandes
Hi!

I have the following gsp code...

g:each in=${productInstanceList} status=i var=productInstance
   !-- display product properties ommited --
   g:remoteLink action=addaction
 id=${i}
 update=[success:'what-to-put-here',failure:'error']
 on404=alert('not found');
   Select this product
   /g:remoteLink
/g:each

How to have each remoteLink to change it's Select this product text to
what addaction renders?
The problem I'm facing is that I don't know what to put in 'what-to-put-here
' in order to achieve that.

Of course, I'm new to gsp tags. Any idea?

Thanks in advance,


Marcelo Carvalho Fernandes
+55 21 8272-7970
+55 21 2205-2786


Re: remoteLink that change it's text

2012-04-13 Thread Marcelo Carvalho Fernandes
Sorry! Wrong list!


Marcelo Carvalho Fernandes
+55 21 8272-7970
+55 21 2205-2786


On Fri, Apr 13, 2012 at 10:54 PM, Marcelo Carvalho Fernandes 
mcf2...@gmail.com wrote:

 Hi!

 I have the following gsp code...

 g:each in=${productInstanceList} status=i var=productInstance
!-- display product properties ommited --
g:remoteLink action=addaction
  id=${i}
  update=[success:'what-to-put-here',failure:'error']
  on404=alert('not found');
Select this product
/g:remoteLink
 /g:each

 How to have each remoteLink to change it's Select this product text to
 what addaction renders?
 The problem I'm facing is that I don't know what to put in '
 what-to-put-here' in order to achieve that.

 Of course, I'm new to gsp tags. Any idea?

 Thanks in advance,

 
 Marcelo Carvalho Fernandes
 +55 21 8272-7970
 +55 21 2205-2786