Re: SpellCheck Help

2012-01-24 Thread vishal_asc
I have installed the same solr 3.5 with jetty and integrating it magento 1.11
but it seems to be not working. 
As my search result is not showing "Did you mean string ?" when I misspelled
any word.

I followed all steps necessary for magento solr integration.

Please help ASAP.

Thanks
Vishal

--
View this message in context: 
http://lucene.472066.n3.nabble.com/SpellCheck-Help-tp3648589p3686756.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: solr not working with magento enterprise 1.11

2012-01-24 Thread David Radunz

Hey,

I am using Magento Community Edition, I wrote my own Magento 
extension to integrate Solr and it works fine. So I really don't know 
what the Enterprise edition does. On a personal and unrelated note, I 
would never use Windows for a server; Unreliable and most of the system 
resources go towards the OS.


Cheers,

David

On 25/01/2012 3:30 PM, vishal_asc wrote:

Thanks David. As of now we are configuring it on local WAMP server and we have 
only development version provided by sales team.

Do you when where solr saves information or push the xml docs when we run index 
management in magento ?

I followed this site: 
http://www.summasolutions.net/blogposts/magento-apache-solr-set

Please let me know if you have some other info also.

Best Regards,
Vishal Porwal

From: David Radunz [via Lucene] 
[mailto:ml-node+s472066n3686805...@n3.nabble.com]
Sent: Wednesday, January 25, 2012 9:47 AM
To: Vishal Porwal
Subject: Re: solr not working with magento enterprise 1.11

Hey,

  Shouldn't you be asking this question to the Magento people? You
have an Enterprise edition, so you have paid for their support.

Cheers,

David

On 25/01/2012 2:57 PM, vishal_asc wrote:


I am integrating solr 3.5 with jetty in magento EE 1.11.

I have followed all the necessary steps, configure and tested solr
connection in magento catalog system config.

I have copied magento/lib/Solr/conf/ content to solr installation. I have
run the index management, restarted jetty but when I search any word or
misspell its not showing me "Did you mean ?" string means not correcting
misspelling. seems solr is not throwing results.

please let me know how can i know solr is working with magento and where
solr save XML documents when magento pushes attributes and product
information in solr ? which directory it stores them.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-not-working-with-magento-enterprise-1-11-tp3686773p3686773.html
Sent from the Solr - User mailing list archive at Nabble.com.



If you reply to this email, your message will be added to the discussion below:
http://lucene.472066.n3.nabble.com/solr-not-working-with-magento-enterprise-1-11-tp3686773p3686805.html
To unsubscribe from solr not working with magento enterprise 1.11, click 
here.
NAML


--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-not-working-with-magento-enterprise-1-11-tp3686773p3686818.html
Sent from the Solr - User mailing list archive at Nabble.com.




RE: solr not working with magento enterprise 1.11

2012-01-24 Thread vishal_asc
Thanks David. As of now we are configuring it on local WAMP server and we have 
only development version provided by sales team.

Do you when where solr saves information or push the xml docs when we run index 
management in magento ?

I followed this site: 
http://www.summasolutions.net/blogposts/magento-apache-solr-set

Please let me know if you have some other info also.

Best Regards,
Vishal Porwal

From: David Radunz [via Lucene] 
[mailto:ml-node+s472066n3686805...@n3.nabble.com]
Sent: Wednesday, January 25, 2012 9:47 AM
To: Vishal Porwal
Subject: Re: solr not working with magento enterprise 1.11

Hey,

 Shouldn't you be asking this question to the Magento people? You
have an Enterprise edition, so you have paid for their support.

Cheers,

David

On 25/01/2012 2:57 PM, vishal_asc wrote:

> I am integrating solr 3.5 with jetty in magento EE 1.11.
>
> I have followed all the necessary steps, configure and tested solr
> connection in magento catalog system config.
>
> I have copied magento/lib/Solr/conf/ content to solr installation. I have
> run the index management, restarted jetty but when I search any word or
> misspell its not showing me "Did you mean ?" string means not correcting
> misspelling. seems solr is not throwing results.
>
> please let me know how can i know solr is working with magento and where
> solr save XML documents when magento pushes attributes and product
> information in solr ? which directory it stores them.
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/solr-not-working-with-magento-enterprise-1-11-tp3686773p3686773.html
> Sent from the Solr - User mailing list archive at Nabble.com.



If you reply to this email, your message will be added to the discussion below:
http://lucene.472066.n3.nabble.com/solr-not-working-with-magento-enterprise-1-11-tp3686773p3686805.html
To unsubscribe from solr not working with magento enterprise 1.11, click 
here.
NAML


--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-not-working-with-magento-enterprise-1-11-tp3686773p3686818.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: solr not working with magento enterprise 1.11

2012-01-24 Thread vishal_asc
Thanks David. As of now we are configuring it on local WAMP server and we
have only development version provided by sales team.

Do you when where solr saves information or push the xml docs when we run
index management in magento ?

I followed this site:
http://www.summasolutions.net/blogposts/magento-apache-solr-set 

Please let me know if you have some other info also.


--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-not-working-with-magento-enterprise-1-11-tp3686773p3686816.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr Cores

2012-01-24 Thread Sujatha Arun
Thanks Erick.

Regards
Sujatha

On Mon, Jan 23, 2012 at 11:16 PM, Erick Erickson wrote:

> You can have a large number of cores, some people have multiple
> hundreds. Having multiple cores is preferred over having
> multiple JVMs since it's more efficient at sharing system
> resources. If you're running a 32 bit JVM, you are limited in
> the amount of memory you can let the JVM use, so that's a
> consideration, but otherwise use multiple cores in one JVM
> and give that JVM say, half of the physical memory on the
> machine and tune from there.
>
> Best
> Erick
>
> On Sun, Jan 22, 2012 at 8:16 PM, Sujatha Arun  wrote:
> > Hello,
> >
> > We have in production a number of individual solr Instnaces on a single
> > JVM.As a result ,we see that the permgenSpace keeps increasing with each
> > additional instance added.
> >
> > I would Like to know ,if we can have solr cores , instead of individual
> > instances.
> >
> >
> >   - Is there any limit to the number of cores ,for a single instance?
> >   - Will this decrease the permgen space as the LIB is shared.?
> >   - Would there be any decrease in performance with number of cores
> added?
> >   - Any thing else that I should know before moving into cores?
> >
> >
> > Any help would be appreciated?
> >
> > Regards
> > Sujatha
>


Re: solr not working with magento enterprise 1.11

2012-01-24 Thread David Radunz

Hey,

Shouldn't you be asking this question to the Magento people? You 
have an Enterprise edition, so you have paid for their support.


Cheers,

David

On 25/01/2012 2:57 PM, vishal_asc wrote:

I am integrating solr 3.5 with jetty in magento EE 1.11.

I have followed all the necessary steps, configure and tested solr
connection in magento catalog system config.

I have copied magento/lib/Solr/conf/ content to solr installation. I have
run the index management, restarted jetty but when I search any word or
misspell its not showing me "Did you mean ?" string means not correcting
misspelling. seems solr is not throwing results.

please let me know how can i know solr is working with magento and where
solr save XML documents when magento pushes attributes and product
information in solr ? which directory it stores them.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-not-working-with-magento-enterprise-1-11-tp3686773p3686773.html
Sent from the Solr - User mailing list archive at Nabble.com.




solr not working with magento enterprise 1.11

2012-01-24 Thread vishal_asc
I am integrating solr 3.5 with jetty in magento EE 1.11.

I have followed all the necessary steps, configure and tested solr
connection in magento catalog system config.

I have copied magento/lib/Solr/conf/ content to solr installation. I have
run the index management, restarted jetty but when I search any word or
misspell its not showing me "Did you mean ?" string means not correcting
misspelling. seems solr is not throwing results.

please let me know how can i know solr is working with magento and where
solr save XML documents when magento pushes attributes and product
information in solr ? which directory it stores them.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-not-working-with-magento-enterprise-1-11-tp3686773p3686773.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Do Hignlighting + proximity using surround query parser

2012-01-24 Thread Ahmet Arslan
> I got this working the way you
> describe it (in the getHighlightQuery()
> method). The span queries were tripping it up, so I
> extracted the query
> terms and created a DisMax query from them. There'll be a
> loss of accuracy
> in the highlighting, but in my case that's better than no
> highlighting.
> 
> Should I just go ahead and submit a patch to SOLR-2703?

I think a separate jira ticket would be more appropriate. 

By the way, o.a.l.search.Query#rewrite(IndexReader reader) should do the trick. 

/**
   * Highlighter does not recognize SurroundQuery.
   * It must be rewritten in its most primitive form to enable highlighting.
   */
  @Override
  public Query getHighlightQuery() throws ParseException {

Query rewritedQuery;

try {
  rewritedQuery = 
getQuery().rewrite(getReq().getSearcher().getIndexReader());
} catch (IOException ioe) {
  rewritedQuery = null;
  LOG.error("query.rewrite() failed", ioe);
}

if (rewritedQuery == null)
  return getQuery();
else
  return rewritedQuery;
  }


Re: Solr 3.5.0 can't find Carrot classes

2012-01-24 Thread Christopher J. Bottaro
On Tuesday, January 24, 2012 at 3:07 PM, Christopher J. Bottaro wrote:
> SEVERE: java.lang.NoClassDefFoundError: org/carrot2/core/ControllerFactory
> at 
> org.apache.solr.handler.clustering.carrot2.CarrotClusteringEngine.(CarrotClusteringEngine.java:102)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown 
> Source)
> at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown 
> Source)
> at java.lang.reflect.Constructor.newInstance(Unknown Source)
> at java.lang.Class.newInstance0(Unknown Source)
> at java.lang.Class.newInstance(Unknown Source)
>  
> …
>  
> I'm starting Solr with -Dsolr.clustering.enabled=true and I can see that the 
> Carrot jars in contrib are getting loaded.
>  
> Full log file is here:  http://onespot-development.s3.amazonaws.com/solr.log  
>  
> Any ideas?  Thanks for the help.
>  
Ok, got a little further.  Seems that Solr doesn't like it if you include jars 
more than once (I had a lib dir and also  directives in the solrconfig 
which ended up loading the same jars twice).

But now I'm getting these errors:  java.lang.NoClassDefFoundError: 
org/apache/solr/handler/clustering/SearchClusteringEngine

Any help?  Thanks. 

Re: Do Hignlighting + proximity using surround query parser

2012-01-24 Thread Scott Stults
I got this working the way you describe it (in the getHighlightQuery()
method). The span queries were tripping it up, so I extracted the query
terms and created a DisMax query from them. There'll be a loss of accuracy
in the highlighting, but in my case that's better than no highlighting.

Should I just go ahead and submit a patch to SOLR-2703?


On Tue, Jan 10, 2012 at 9:35 AM, Ahmet Arslan  wrote:

> > I am not able to do highlighting with surround query parser
> > on the returned
> > results.
> > I have tried the highlighting component but it does not
> > return highlighted
> > results.
>
> Highlighter does not recognize Surround Query. It must be re-written to
> enable highlighting in o.a.s.search.QParser#getHighlightQuery() method.
>
> Not sure this functionality should be added in SOLR-2703 or a separate
> jira issue.
>
>


-- 
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com


Fw: Problem with SpliBy in Solr 3.4

2012-01-24 Thread Sumit Sen



- Forwarded Message -
From: Sumit Sen 
To: Solr List  
Sent: Tuesday, January 24, 2012 3:53 PM
Subject: Problem with SpliBy in Solr 3.4


Hi All:

I have a very silly problem. I am using Solr 3.4. I have a data import handle 
for indexing which is not Spliting a field data by '|' inspite of following 
setup.

    
  
   
   
   
   
   
   
   
...
    ...
   
   
  
...
...
   
   
    


I schema.xm  I have 

   
   
   
   
   
   
   
   
   
   
   

   
   
   
   
   
   
   
   

Results:

 G3FRBV113TQ4WV4Y
 US
 Salaried
 White
 last
 first
 Full-Time
 2012-02-11T05:00:00Z
 A
 US
 30004
 052641
 Single
 AX9
 0
 E
 Bennett,Chad D

 Reporting Manager|||Employee|
 
 Male
 

 G3FRBV113TQ4955G
 US
 124214
 Hourly
 White
 last
 first
 Full-Time
 2012-02-11T05:00:00Z
 Atlanta
 US
 30004
 052643
 Single
 22D
 0
 E
 Bell,Derrick

 Reporting Manager|||Employee|
 
 Male

Why the RegexTransformer is not spliting the roleCode field.
 
Thanks
Sumit Sen

Re: HTMLStripCharFilterFactory not working in Solr4?

2012-01-24 Thread Mike Hugo
Thanks for the responses everyone.

Steve, the test method you provided also works for me.  However, when I try
a more end to end test with the HTMLStripCharFilterFactory configured for a
field I am still having the same problem.  I attached a failing unit test
and configuration to the following issue in JIRA:

https://issues.apache.org/jira/browse/LUCENE-3721

I appreciate all the prompt responses!  Looking forward to finding the root
cause of this guy :)  If there's something I'm doing incorrectly in the
configuration, please let me know!

Mike

On Tue, Jan 24, 2012 at 1:57 PM, Steven A Rowe  wrote:

> Hi Mike,
>
> When I add the following test to TestHTMLStripCharFilterFactory.java on
> Solr trunk, it passes:
>
> public void testNumericCharacterEntities() throws Exception {
>  final String text = "Bose® ™";  // |Bose® ™|
>  HTMLStripCharFilterFactory htmlStripFactory = new
> HTMLStripCharFilterFactory();
>  htmlStripFactory.init(Collections.emptyMap());
>  CharStream charStream = htmlStripFactory.create(CharReader.get(new
> StringReader(text)));
>  StandardTokenizerFactory stdTokFactory = new StandardTokenizerFactory();
>  stdTokFactory.init(DEFAULT_VERSION_PARAM);
>  Tokenizer stream = stdTokFactory.create(charStream);
>  assertTokenStreamContents(stream, new String[] { "Bose" });
> }
>
> What's happening:
>
> First, htmlStripFactory converts "®" to "®" and "™" to "™".
>  Then stdTokFactory declines to tokenize "®" and "™", because they are
> belong to the Unicode general category "Symbol, Other", and so are not
> included in any of the output tokens.
>
> StandardTokenizer uses the Word Break rules find UAX#29 <
> http://unicode.org/reports/tr29/> to find token boundaries, and then
> outputs only alphanumeric tokens.  See the JFlex grammar for details: <
> http://svn.apache.org/viewvc/lucene/dev/trunk/modules/analysis/common/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.jflex?view=markup
> >.
>
> The behavior you're seeing is not consistent with the above test.
>
> Steve
>
> > -Original Message-
> > From: Mike Hugo [mailto:m...@piragua.com]
> > Sent: Tuesday, January 24, 2012 1:34 PM
> > To: solr-user@lucene.apache.org
> > Subject: HTMLStripCharFilterFactory not working in Solr4?
> >
> > We recently updated to the latest build of Solr4 and everything is
> working
> > really well so far!  There is one case that is not working the same way
> it
> > was in Solr 3.4 - we strip out certain HTML constructs (like trademark
> and
> > registered, for example) in a field as defined below - it was working in
> > Solr3.4 with the configuration shown here, but is not working the same
> way
> > in Solr4.
> >
> > The label field is defined as type="text_general"
> >  > required="false" multiValued="true"/>
> >
> > Here's the type definition for text_general field:
> >  > positionIncrementGap="100">
> > 
> > 
> > 
> >  > words="stopwords.txt"
> > enablePositionIncrements="true"/>
> > 
> > 
> > 
> > 
> > 
> >  > words="stopwords.txt"
> > enablePositionIncrements="true"/>
> > 
> > 
> > 
> >
> >
> > In Solr 3.4, that configuration was completely stripping html constructs
> > out of the indexed field which is exactly what we wanted.  If for
> example,
> > we then do a facet on the label field, like in the test below, we're
> > getting some terms in the response that we would not like to be there.
> >
> >
> > // test case (groovy)
> > void specialHtmlConstructsGetStripped() {
> > SolrInputDocument inputDocument = new SolrInputDocument()
> > inputDocument.addField('label', 'Bose® ™')
> >
> > solrServer.add(inputDocument)
> > solrServer.commit()
> >
> > QueryResponse response = solrServer.query(new SolrQuery('bose'))
> > assert 1 == response.results.numFound
> >
> > SolrQuery facetQuery = new SolrQuery('bose')
> > facetQuery.facet = true
> > facetQuery.set(FacetParams.FACET_FIELD, 'label')
> > facetQuery.set(FacetParams.FACET_MINCOUNT, '1')
> >
> > response = solrServer.query(facetQuery)
> > FacetField ff = response.facetFields.find {it.name == 'label'}
> >
> > List suggestResponse = []
> >
> > for (FacetField.Count facetField in ff?.values) {
> > suggestResponse << facetField.name
> > }
> >
> > assert suggestResponse == ['bose']
> > }
> >
> > With the upgrade to Solr4, the assertion fails, the suggested response
> > contains 174 and 8482 as terms.  Test output is:
> >
> > Assertion failed:
> >
> > assert suggestResponse == ['bose']
> >|   |
> >|   false
> >[174, 8482, bose]
> >
> >
> > I just tried again using the latest build from today, namely:
> > https://builds.apache.org/job/Lucene-Solr-Maven-trunk/369/ and we're
> still
> > getting the failing 

Re: Size of index to use shard

2012-01-24 Thread Vadim Kisselmann
@Erick
thanks:)
i´m with you with your opinion.
my load tests show the same.

@Dmitry
my docs are small too, i think about 3-15KB per doc.
i update my index all the time and i have an average of 20-50 requests
per minute (20% facet queries, 80% large boolean queries with
wildcard/fuzzy) . How much docs at a time=> depends from choosed
filters, from 10 to all 100Mio.
I work with very small caches (strangely, but if my index is under
100GB i need larger caches, over 100GB smaller caches..)
My JVM has 6GB, 18GB for I/O.
With few updates a day i would configure very big caches, like Tim
Burton (see HathiTrust´s Blog)

Regards Vadim



2012/1/24 Anderson vasconcelos :
> Thanks for the explanation Erick :)
>
> 2012/1/24, Erick Erickson :
>> Talking about "index size" can be very misleading. Take
>> a look at http://lucene.apache.org/java/3_5_0/fileformats.html#file-names.
>> Note that the *.fdt and *.fdx files are used to for stored fields, i.e.
>> the verbatim copy of data put in the index when you specify
>> stored="true". These files have virtually no impact on search
>> speed.
>>
>> So, if your *.fdx and *.fdt files are 90G out of a 100G index
>> it is a much different thing than if these files are 10G out of
>> a 100G index.
>>
>> And this doesn't even mention the peculiarities of your query mix.
>> Nor does it say a thing about whether your cheapest alternative
>> is to add more memory.
>>
>> Anderson's method is about the only reliable one, you just have
>> to test with your index and real queries. At some point, you'll
>> find your tipping point, typically when you come under memory
>> pressure. And it's a balancing act between how much memory
>> you allocate to the JVM and how much you leave for the op
>> system.
>>
>> Bottom line: No hard and fast numbers. And you should periodically
>> re-test the empirical numbers you *do* arrive at...
>>
>> Best
>> Erick
>>
>> On Tue, Jan 24, 2012 at 5:31 AM, Anderson vasconcelos
>>  wrote:
>>> Apparently, not so easy to determine when to break the content into
>>> pieces. I'll investigate further about the amount of documents, the
>>> size of each document and what kind of search is being used. It seems,
>>> I will have to do a load test to identify the cutoff point to begin
>>> using the strategy of shards.
>>>
>>> Thanks
>>>
>>> 2012/1/24, Dmitry Kan :
 Hi,

 The article you gave mentions 13GB of index size. It is quite small index
 from our perspective. We have noticed, that at least solr 3.4 has some
 sort
 of "choking" point with respect to growing index size. It just becomes
 substantially slower than what we need (a query on avg taking more than
 3-4
 seconds) once index size crosses a magic level (about 80GB following our
 practical observations). We try to keep our indices at around 60-70GB for
 fast searches and above 100GB for slow ones. We also route majority of
 user
 queries to fast indices. Yes, caching may help, but not necessarily we
 can
 afford adding more RAM for bigger indices. BTW, our documents are very
 small, thus in 100GB index we can have around 200 mil. documents. It
 would
 be interesting to see, how you manage to ensure q-times under 1 sec with
 an
 index of 250GB? How many documents / facets do you ask max. at a time?
 FYI,
 we ask for a thousand of facets in one go.

 Regards,
 Dmitry

 On Tue, Jan 24, 2012 at 10:30 AM, Vadim Kisselmann <
 v.kisselm...@googlemail.com> wrote:

> Hi,
> it depends from your hardware.
> Read this:
>
> http://www.derivante.com/2009/05/05/solr-performance-benchmarks-single-vs-multi-core-index-shards/
> Think about your cache-config (few updates, big caches) and a good
> HW-infrastructure.
> In my case i can handle a 250GB index with 100mil. docs on a I7
> machine with RAID10 and 24GB RAM => q-times under 1 sec.
> Regards
> Vadim
>
>
>
> 2012/1/24 Anderson vasconcelos :
> > Hi
> > Has some size of index (or number of docs) that is necessary to break
> > the index in shards?
> > I have a index with 100GB of size. This index increase 10GB per year.
> > (I don't have information how many docs they have) and the docs never
> > will be deleted.  Thinking in 30 years, the index will be with 400GB
> > of size.
> >
> > I think  is not required to break in shard, because i not consider
> > this like a "large index". Am I correct? What's is a real "large
> > index"
> >
> >
> > Thanks
>

>>


Re: dismax: limiting term match to one field

2012-01-24 Thread astubbs
This seems like a real shame. As soon as you search across more than one
field, the mm setting becomes nearly useless.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/dismax-limiting-term-match-to-one-field-tp2056498p3685850.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: HTMLStripCharFilterFactory not working in Solr4?

2012-01-24 Thread Yonik Seeley
Oops, I didn't read carefully enough to see that you wanted those constructs
entirely stripped out.

Given that you're seeing numbers indexed, this strongly indicates an
escaping bug in the SolrJ client that must have been introduced at
some point.
I'll see if I can reproduce it in a unit test.


-Yonik
http://www.lucidimagination.com


RE: HTMLStripCharFilterFactory not working in Solr4?

2012-01-24 Thread Michael Ryan
Try putting the HTMLStripCharFilterFactory before the StandardTokenizerFactory 
instead of after it. I vaguely recall being burned by something like this 
before.

-Michael


RE: HTMLStripCharFilterFactory not working in Solr4?

2012-01-24 Thread Steven A Rowe
Hi Mike,

When I add the following test to TestHTMLStripCharFilterFactory.java on Solr 
trunk, it passes:
  
public void testNumericCharacterEntities() throws Exception {
  final String text = "Bose® ™";  // |Bose® ™|
  HTMLStripCharFilterFactory htmlStripFactory = new 
HTMLStripCharFilterFactory();
  htmlStripFactory.init(Collections.emptyMap());
  CharStream charStream = htmlStripFactory.create(CharReader.get(new 
StringReader(text)));
  StandardTokenizerFactory stdTokFactory = new StandardTokenizerFactory();
  stdTokFactory.init(DEFAULT_VERSION_PARAM);
  Tokenizer stream = stdTokFactory.create(charStream);
  assertTokenStreamContents(stream, new String[] { "Bose" });
}

What's happening: 

First, htmlStripFactory converts "®" to "®" and "™" to "™".  Then 
stdTokFactory declines to tokenize "®" and "™", because they are belong to the 
Unicode general category "Symbol, Other", and so are not included in any of the 
output tokens.

StandardTokenizer uses the Word Break rules find UAX#29 
 to find token boundaries, and then outputs 
only alphanumeric tokens.  See the JFlex grammar for details: 
.

The behavior you're seeing is not consistent with the above test.

Steve

> -Original Message-
> From: Mike Hugo [mailto:m...@piragua.com]
> Sent: Tuesday, January 24, 2012 1:34 PM
> To: solr-user@lucene.apache.org
> Subject: HTMLStripCharFilterFactory not working in Solr4?
> 
> We recently updated to the latest build of Solr4 and everything is working
> really well so far!  There is one case that is not working the same way it
> was in Solr 3.4 - we strip out certain HTML constructs (like trademark and
> registered, for example) in a field as defined below - it was working in
> Solr3.4 with the configuration shown here, but is not working the same way
> in Solr4.
> 
> The label field is defined as type="text_general"
>  required="false" multiValued="true"/>
> 
> Here's the type definition for text_general field:
>  positionIncrementGap="100">
> 
> 
> 
>  words="stopwords.txt"
> enablePositionIncrements="true"/>
> 
> 
> 
> 
> 
>  words="stopwords.txt"
> enablePositionIncrements="true"/>
> 
> 
> 
> 
> 
> In Solr 3.4, that configuration was completely stripping html constructs
> out of the indexed field which is exactly what we wanted.  If for example,
> we then do a facet on the label field, like in the test below, we're
> getting some terms in the response that we would not like to be there.
> 
> 
> // test case (groovy)
> void specialHtmlConstructsGetStripped() {
> SolrInputDocument inputDocument = new SolrInputDocument()
> inputDocument.addField('label', 'Bose® ™')
> 
> solrServer.add(inputDocument)
> solrServer.commit()
> 
> QueryResponse response = solrServer.query(new SolrQuery('bose'))
> assert 1 == response.results.numFound
> 
> SolrQuery facetQuery = new SolrQuery('bose')
> facetQuery.facet = true
> facetQuery.set(FacetParams.FACET_FIELD, 'label')
> facetQuery.set(FacetParams.FACET_MINCOUNT, '1')
> 
> response = solrServer.query(facetQuery)
> FacetField ff = response.facetFields.find {it.name == 'label'}
> 
> List suggestResponse = []
> 
> for (FacetField.Count facetField in ff?.values) {
> suggestResponse << facetField.name
> }
> 
> assert suggestResponse == ['bose']
> }
> 
> With the upgrade to Solr4, the assertion fails, the suggested response
> contains 174 and 8482 as terms.  Test output is:
> 
> Assertion failed:
> 
> assert suggestResponse == ['bose']
>|   |
>|   false
>[174, 8482, bose]
> 
> 
> I just tried again using the latest build from today, namely:
> https://builds.apache.org/job/Lucene-Solr-Maven-trunk/369/ and we're still
> getting the failing assertion. Is there a different way to configure the
> HTMLStripCharFilterFactory in Solr4?
> 
> Thanks in advance for any tips!
> 
> Mike


Re: phrase auto-complete with suggester component

2012-01-24 Thread Tommy Chheng
Thanks, I'll try out the custom class file. Any possibilities this
class can be merged into solr? It seems like an expected behavior.


On Tue, Jan 24, 2012 at 11:29 AM, O. Klein  wrote:
> You might wanna read
> http://lucene.472066.n3.nabble.com/suggester-issues-td3262718.html#a3264740
> which contains the solution to your problem.
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/phrase-auto-complete-with-suggester-component-tp3685572p3685730.html
> Sent from the Solr - User mailing list archive at Nabble.com.



-- 
Tommy Chheng


Indexing failover and replication

2012-01-24 Thread Anderson vasconcelos
Hi
I'm doing now a test with replication using solr 1.4.1. I configured
two servers (server1 and server 2) as master/slave to sincronized
both. I put apache on the front side, and we index sometime in server1
and sometime  in server2.

I realized that the both index servers are now confused. In solr data
folder, was created many index folders with the timestamp of
syncronization (Exemple: index.20120124041340) with some segments
inside.

I thought that was possible to index in two master server and than
synchronized both using replication. It's really possible do this with
replication mechanism? If is possible, what I have done wrong?

I need to have more than one node for indexing to guarantee failover
feature for indexing. MultiMaster is the best way to guarantee
failover feature for indexing?

Thanks


Re: phrase auto-complete with suggester component

2012-01-24 Thread O. Klein
You might wanna read
http://lucene.472066.n3.nabble.com/suggester-issues-td3262718.html#a3264740
which contains the solution to your problem.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/phrase-auto-complete-with-suggester-component-tp3685572p3685730.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: HTMLStripCharFilterFactory not working in Solr4?

2012-01-24 Thread Mike Hugo
Thanks for the response Yonik,
Interestingly enough, changing to to the LegacyHTMLStripCharFilterFactory
does NOT solve the problem - in fact I get the same result

I can see that the LegacyHTMLStripCharFilterFactory is being applied at
startup:

Jan 24, 2012 1:25:29 PM org.apache.solr.util.plugin.AbstractPluginLoader
load
INFO: created : org.apache.solr.analysis.LegacyHTMLStripCharFilterFactory

however, I'm still getting the same assertion error.  Any thoughts?

Mike


On Tue, Jan 24, 2012 at 12:40 PM, Yonik Seeley
wrote:

> You can use LegacyHTMLStripCharFilterFactory to get the previous behavior.
> See https://issues.apache.org/jira/browse/LUCENE-3690 for more details.
>
> -Yonik
> http://www.lucidimagination.com
>
>
>
> On Tue, Jan 24, 2012 at 1:34 PM, Mike Hugo  wrote:
> > We recently updated to the latest build of Solr4 and everything is
> working
> > really well so far!  There is one case that is not working the same way
> it
> > was in Solr 3.4 - we strip out certain HTML constructs (like trademark
> and
> > registered, for example) in a field as defined below - it was working in
> > Solr3.4 with the configuration shown here, but is not working the same
> way
> > in Solr4.
> >
> > The label field is defined as type="text_general"
> >  > required="false" multiValued="true"/>
> >
> > Here's the type definition for text_general field:
> >  > positionIncrementGap="100">
> >
> >
> >
> > > words="stopwords.txt"
> >enablePositionIncrements="true"/>
> >
> >
> >
> >
> >
> > > words="stopwords.txt"
> >enablePositionIncrements="true"/>
> >
> >
> >
> >
> >
> > In Solr 3.4, that configuration was completely stripping html constructs
> > out of the indexed field which is exactly what we wanted.  If for
> example,
> > we then do a facet on the label field, like in the test below, we're
> > getting some terms in the response that we would not like to be there.
> >
> >
> > // test case (groovy)
> > void specialHtmlConstructsGetStripped() {
> >SolrInputDocument inputDocument = new SolrInputDocument()
> >inputDocument.addField('label', 'Bose® ™')
> >
> >solrServer.add(inputDocument)
> >solrServer.commit()
> >
> >QueryResponse response = solrServer.query(new SolrQuery('bose'))
> >assert 1 == response.results.numFound
> >
> >SolrQuery facetQuery = new SolrQuery('bose')
> >facetQuery.facet = true
> >facetQuery.set(FacetParams.FACET_FIELD, 'label')
> >facetQuery.set(FacetParams.FACET_MINCOUNT, '1')
> >
> >response = solrServer.query(facetQuery)
> >FacetField ff = response.facetFields.find {it.name == 'label'}
> >
> >List suggestResponse = []
> >
> >for (FacetField.Count facetField in ff?.values) {
> >suggestResponse << facetField.name
> >}
> >
> >assert suggestResponse == ['bose']
> > }
> >
> > With the upgrade to Solr4, the assertion fails, the suggested response
> > contains 174 and 8482 as terms.  Test output is:
> >
> > Assertion failed:
> >
> > assert suggestResponse == ['bose']
> >   |   |
> >   |   false
> >   [174, 8482, bose]
> >
> >
> > I just tried again using the latest build from today, namely:
> > https://builds.apache.org/job/Lucene-Solr-Maven-trunk/369/ and we're
> still
> > getting the failing assertion. Is there a different way to configure the
> > HTMLStripCharFilterFactory in Solr4?
> >
> > Thanks in advance for any tips!
> >
> > Mike
>


Re: Size of index to use shard

2012-01-24 Thread Anderson vasconcelos
Thanks for the explanation Erick :)

2012/1/24, Erick Erickson :
> Talking about "index size" can be very misleading. Take
> a look at http://lucene.apache.org/java/3_5_0/fileformats.html#file-names.
> Note that the *.fdt and *.fdx files are used to for stored fields, i.e.
> the verbatim copy of data put in the index when you specify
> stored="true". These files have virtually no impact on search
> speed.
>
> So, if your *.fdx and *.fdt files are 90G out of a 100G index
> it is a much different thing than if these files are 10G out of
> a 100G index.
>
> And this doesn't even mention the peculiarities of your query mix.
> Nor does it say a thing about whether your cheapest alternative
> is to add more memory.
>
> Anderson's method is about the only reliable one, you just have
> to test with your index and real queries. At some point, you'll
> find your tipping point, typically when you come under memory
> pressure. And it's a balancing act between how much memory
> you allocate to the JVM and how much you leave for the op
> system.
>
> Bottom line: No hard and fast numbers. And you should periodically
> re-test the empirical numbers you *do* arrive at...
>
> Best
> Erick
>
> On Tue, Jan 24, 2012 at 5:31 AM, Anderson vasconcelos
>  wrote:
>> Apparently, not so easy to determine when to break the content into
>> pieces. I'll investigate further about the amount of documents, the
>> size of each document and what kind of search is being used. It seems,
>> I will have to do a load test to identify the cutoff point to begin
>> using the strategy of shards.
>>
>> Thanks
>>
>> 2012/1/24, Dmitry Kan :
>>> Hi,
>>>
>>> The article you gave mentions 13GB of index size. It is quite small index
>>> from our perspective. We have noticed, that at least solr 3.4 has some
>>> sort
>>> of "choking" point with respect to growing index size. It just becomes
>>> substantially slower than what we need (a query on avg taking more than
>>> 3-4
>>> seconds) once index size crosses a magic level (about 80GB following our
>>> practical observations). We try to keep our indices at around 60-70GB for
>>> fast searches and above 100GB for slow ones. We also route majority of
>>> user
>>> queries to fast indices. Yes, caching may help, but not necessarily we
>>> can
>>> afford adding more RAM for bigger indices. BTW, our documents are very
>>> small, thus in 100GB index we can have around 200 mil. documents. It
>>> would
>>> be interesting to see, how you manage to ensure q-times under 1 sec with
>>> an
>>> index of 250GB? How many documents / facets do you ask max. at a time?
>>> FYI,
>>> we ask for a thousand of facets in one go.
>>>
>>> Regards,
>>> Dmitry
>>>
>>> On Tue, Jan 24, 2012 at 10:30 AM, Vadim Kisselmann <
>>> v.kisselm...@googlemail.com> wrote:
>>>
 Hi,
 it depends from your hardware.
 Read this:

 http://www.derivante.com/2009/05/05/solr-performance-benchmarks-single-vs-multi-core-index-shards/
 Think about your cache-config (few updates, big caches) and a good
 HW-infrastructure.
 In my case i can handle a 250GB index with 100mil. docs on a I7
 machine with RAID10 and 24GB RAM => q-times under 1 sec.
 Regards
 Vadim



 2012/1/24 Anderson vasconcelos :
 > Hi
 > Has some size of index (or number of docs) that is necessary to break
 > the index in shards?
 > I have a index with 100GB of size. This index increase 10GB per year.
 > (I don't have information how many docs they have) and the docs never
 > will be deleted.  Thinking in 30 years, the index will be with 400GB
 > of size.
 >
 > I think  is not required to break in shard, because i not consider
 > this like a "large index". Am I correct? What's is a real "large
 > index"
 >
 >
 > Thanks

>>>
>


Re: Hierarchical faceting in UI

2012-01-24 Thread Yuhao
Hi Darren.  You said: 


"Your UI will associate the correct parent id to build the facet query"

This is the part I'm having trouble figuring out how to accomplish and some 
guidance would help. How would I get the value of the parent to build the facet 
query in the UI, if the value is in another document field?  I was imagining 
that I would add the additional filter of "parent:" to the "fq" 
URL parameter.  But I don't have a way to do it yet.

Perhaps seeing some data would help.  Here is a record in old (flattened) and 
new (parent-enabled) versions, both in JSON format:

OLD:
    {
        "ID" : "3816",
        "Gene Symbol" : "KLK1",
        "Alternate Names" : "hCG_22931;Klk6;hK1;KLKR",
        "Description" : "Kallikrein 1, a peptidase that cleaves kininogen, 
functions in glucose homeostasis, heart contraction, semen liquefaction, and 
vasoconstriction, aberrantly expressed in pancreatitis and endometrial cancer; 
gene polymorphism correlates with kidney failure (BKL)",
        "GAD_Positive_Disease_Associations" : ["Mental Disorders(MESH:D001523) 
>> Dementia, Vascular(MESH:D015140)", "Cardiovascular Diseases(MESH:D002318) >> 
Coronary Artery Disease(MESH:D003324)"],
        "HuGENet_GeneProspector_Associations" : ["atherosclerosis", "HDL"],
    }



NEW:
    {
        "ID" : "3816",
        "Gene Symbol" : "KLK1",
        "Alternate Names" : "hCG_22931;Klk6;hK1;KLKR",
        "Description" : "Kallikrein 1, a peptidase that cleaves kininogen, 
functions in glucose homeostasis, heart contraction, semen liquefaction, and 
vasoconstriction, aberrantly expressed in pancreatitis and endometrial cancer; 
gene polymorphism correlates with kidney failure (BKL)",
        "GAD_Positive_Disease_Associations" : ["Dementia, 
Vascular(MESH:D015140)", "Coronary Artery Disease(MESH:D003324)"],
        "GAD_Positive_Disease_Associations_parent" : ["Mental 
Disorders(MESH:D001523)", "Cardiovascular Diseases(MESH:D002318)"],
        "HuGENet_GeneProspector_Associations" : ["atherosclerosis", "HDL"],
    }

In the old version, the field "GAD_Positive_Disease_Associations" had 2 levels 
of hierarchy that were flattened.  It had the full path of the hierarchy 
leading to the current term.  In the new version, the field only has the 
current term.  A separate field called 
"GAD_Positive_Disease_Associations_parent" has the full path preceding the 
current term.

So, let's say in the UI, I click on the term "Dementia, Vascular(MESH:D015140)" 
to get its child terms and data.  My filters in the URL querystring would be 
exactly: 

fq=GAD_Positive_Disease_Associations:"Dementia, 
Vascular(MESH:D015140)"&fq=GAD_Positive_Disease_Associations_parent:"Mental 
Disorders(MESH:D001523)"

My question is, how to get the parent value of "Mental Disorders(MESH:D001523)" 
to build that querystring?

Thanks!

Yuhao





 From: Darren Govoni 
To: solr-user@lucene.apache.org 
Sent: Tuesday, January 24, 2012 1:23 PM
Subject: Re: Hierarchical faceting in UI
 
Yuhao,
     Ok, let me think about this. A term can have multiple parents. Each of 
those parents would be 'different', yes?
In this case, use a multivalued field for the parent and add all the parent 
names or id's to it. The relations should be unique.

Your UI will associate the correct parent id to build the facet query from and 
return the correct children because the user
is descending down a specific path in the UI and the parent node unique id's 
are returned along the way.

Now, if you are having parent names/id's that themselves can appear in multiple 
locations (vs. just terms 'the leafs'),
then perhaps your hierarchy needs refactoring for redundancy?

Happy to help with more details.

Darren


On 01/24/2012 11:22 AM, Yuhao wrote:
> Darren,
>
> One challenge for me is that a term can appear in multiple places of the 
> hierarchy.  So it's not safe to simply use the term as it appears to get its 
> children; I probably need to include the entire tree path up to this term.  
> For example, if the hierarchy is "Cardiovascular Diseases>  
> Arteriosclerosis>  Coronary Artery Disease", and I'm getting the children of 
> the middle term Arteriosclerosi, I need to filter on something like 
> "parent:Cardiovascular Diseases/Arteriosclerosis".
>
> I'm having trouble figuring out how I can get the complete path per above to 
> add to the URL of each facet term.  I know "velocity/facet_field.vm" is where 
> I build the URL.  I know how to simply add a "parent:" filter to the 
> URL.  But I don't know how to access a document field, like the complete 
> parent path, in "facet_field.vm".  Any help would be great.
>
> Yuhao
>
>
>
>
> 
>   From: "dar...@ontrenet.com"
> To: Yuhao
> Cc: solr-user@lucene.apache.org
> Sent: Monday, January 23, 2012 7:16 PM
> Subject: Re: Hierarchical faceting in UI
>
>
> On Mon, 23 Jan 2012 14:33:00 -0800 (PST), Yuhao
> wrote:
>> Programmatically, something like this might work: for each face

Re: HTMLStripCharFilterFactory not working in Solr4?

2012-01-24 Thread Yonik Seeley
You can use LegacyHTMLStripCharFilterFactory to get the previous behavior.
See https://issues.apache.org/jira/browse/LUCENE-3690 for more details.

-Yonik
http://www.lucidimagination.com



On Tue, Jan 24, 2012 at 1:34 PM, Mike Hugo  wrote:
> We recently updated to the latest build of Solr4 and everything is working
> really well so far!  There is one case that is not working the same way it
> was in Solr 3.4 - we strip out certain HTML constructs (like trademark and
> registered, for example) in a field as defined below - it was working in
> Solr3.4 with the configuration shown here, but is not working the same way
> in Solr4.
>
> The label field is defined as type="text_general"
>  required="false" multiValued="true"/>
>
> Here's the type definition for text_general field:
>  positionIncrementGap="100">
>            
>                
>                
>                 words="stopwords.txt"
>                        enablePositionIncrements="true"/>
>                
>            
>            
>                
>                
>                 words="stopwords.txt"
>                        enablePositionIncrements="true"/>
>                
>            
>        
>
>
> In Solr 3.4, that configuration was completely stripping html constructs
> out of the indexed field which is exactly what we wanted.  If for example,
> we then do a facet on the label field, like in the test below, we're
> getting some terms in the response that we would not like to be there.
>
>
> // test case (groovy)
> void specialHtmlConstructsGetStripped() {
>    SolrInputDocument inputDocument = new SolrInputDocument()
>    inputDocument.addField('label', 'Bose® ™')
>
>    solrServer.add(inputDocument)
>    solrServer.commit()
>
>    QueryResponse response = solrServer.query(new SolrQuery('bose'))
>    assert 1 == response.results.numFound
>
>    SolrQuery facetQuery = new SolrQuery('bose')
>    facetQuery.facet = true
>    facetQuery.set(FacetParams.FACET_FIELD, 'label')
>    facetQuery.set(FacetParams.FACET_MINCOUNT, '1')
>
>    response = solrServer.query(facetQuery)
>    FacetField ff = response.facetFields.find {it.name == 'label'}
>
>    List suggestResponse = []
>
>    for (FacetField.Count facetField in ff?.values) {
>        suggestResponse << facetField.name
>    }
>
>    assert suggestResponse == ['bose']
> }
>
> With the upgrade to Solr4, the assertion fails, the suggested response
> contains 174 and 8482 as terms.  Test output is:
>
> Assertion failed:
>
> assert suggestResponse == ['bose']
>       |               |
>       |               false
>       [174, 8482, bose]
>
>
> I just tried again using the latest build from today, namely:
> https://builds.apache.org/job/Lucene-Solr-Maven-trunk/369/ and we're still
> getting the failing assertion. Is there a different way to configure the
> HTMLStripCharFilterFactory in Solr4?
>
> Thanks in advance for any tips!
>
> Mike


HTMLStripCharFilterFactory not working in Solr4?

2012-01-24 Thread Mike Hugo
We recently updated to the latest build of Solr4 and everything is working
really well so far!  There is one case that is not working the same way it
was in Solr 3.4 - we strip out certain HTML constructs (like trademark and
registered, for example) in a field as defined below - it was working in
Solr3.4 with the configuration shown here, but is not working the same way
in Solr4.

The label field is defined as type="text_general"


Here's the type definition for text_general field:
















In Solr 3.4, that configuration was completely stripping html constructs
out of the indexed field which is exactly what we wanted.  If for example,
we then do a facet on the label field, like in the test below, we're
getting some terms in the response that we would not like to be there.


// test case (groovy)
void specialHtmlConstructsGetStripped() {
SolrInputDocument inputDocument = new SolrInputDocument()
inputDocument.addField('label', 'Bose® ™')

solrServer.add(inputDocument)
solrServer.commit()

QueryResponse response = solrServer.query(new SolrQuery('bose'))
assert 1 == response.results.numFound

SolrQuery facetQuery = new SolrQuery('bose')
facetQuery.facet = true
facetQuery.set(FacetParams.FACET_FIELD, 'label')
facetQuery.set(FacetParams.FACET_MINCOUNT, '1')

response = solrServer.query(facetQuery)
FacetField ff = response.facetFields.find {it.name == 'label'}

List suggestResponse = []

for (FacetField.Count facetField in ff?.values) {
suggestResponse << facetField.name
}

assert suggestResponse == ['bose']
}

With the upgrade to Solr4, the assertion fails, the suggested response
contains 174 and 8482 as terms.  Test output is:

Assertion failed:

assert suggestResponse == ['bose']
   |   |
   |   false
   [174, 8482, bose]


I just tried again using the latest build from today, namely:
https://builds.apache.org/job/Lucene-Solr-Maven-trunk/369/ and we're still
getting the failing assertion. Is there a different way to configure the
HTMLStripCharFilterFactory in Solr4?

Thanks in advance for any tips!

Mike


SolrCell maximum file size

2012-01-24 Thread Augusto Camarotti
Hi everybody
 
Does anyone knows if there is a maximum file size that can be uploaded to the 
extractingrequesthandler via http request?
 
Thanks in advance,
 
Augusto Camarotti


Re: Hierarchical faceting in UI

2012-01-24 Thread Darren Govoni

Yuhao,
Ok, let me think about this. A term can have multiple parents. Each of 
those parents would be 'different', yes?
In this case, use a multivalued field for the parent and add all the parent 
names or id's to it. The relations should be unique.

Your UI will associate the correct parent id to build the facet query from and 
return the correct children because the user
is descending down a specific path in the UI and the parent node unique id's 
are returned along the way.

Now, if you are having parent names/id's that themselves can appear in multiple 
locations (vs. just terms 'the leafs'),
then perhaps your hierarchy needs refactoring for redundancy?

Happy to help with more details.

Darren


On 01/24/2012 11:22 AM, Yuhao wrote:

Darren,

One challenge for me is that a term can appear in multiple places of the hierarchy.  So it's not safe to 
simply use the term as it appears to get its children; I probably need to include the entire tree path up 
to this term.  For example, if the hierarchy is "Cardiovascular Diseases>  Arteriosclerosis>  
Coronary Artery Disease", and I'm getting the children of the middle term Arteriosclerosi, I need to 
filter on something like "parent:Cardiovascular Diseases/Arteriosclerosis".

I'm having trouble figuring out how I can get the complete path per above to add to the URL of each facet term.  I 
know "velocity/facet_field.vm" is where I build the URL.  I know how to simply add a 
"parent:" filter to the URL.  But I don't know how to access a document field, like the 
complete parent path, in "facet_field.vm".  Any help would be great.

Yuhao





  From: "dar...@ontrenet.com"
To: Yuhao
Cc: solr-user@lucene.apache.org
Sent: Monday, January 23, 2012 7:16 PM
Subject: Re: Hierarchical faceting in UI


On Mon, 23 Jan 2012 14:33:00 -0800 (PST), Yuhao
wrote:

Programmatically, something like this might work: for each facet field,
add another hidden field that identifies its parent.  Then, program
additional logic in the UI to show only the facet terms at the currently
selected level.  For example, if one filters on "cat:electronics", the

new

UI logic would apply the additional filter "cat_parent:electronics".

Can

this be done?

Yes. This is how I do it.


Would it be a lot of work?

No. Its not a lot of work, simply represent your hierarchy as parent/child
relations in the document fields and in your UI drill down by issuing new
faceted searches. Use the current facet (tree level) as the parent:
in the next query. Its much easier than other suggestions for this.


Is there a better way?

Not in my opinion, there isn't. This is the simplest to implement and
understand.


By the way, Flamenco (another faceted browser) has built-in support for
hierarchies, and it has worked well for my data in this aspect (but less
well than Solr in others).  I'm looking for the same kind of

hierarchical

UI feature in Solr.




phrase auto-complete with suggester component

2012-01-24 Thread Tommy Chheng
I'm testing out the various auto-complete functionalities on the
wikipedia dataset.

I first tried the facet.prefix and found it slow at times. I'm now
looking at the Suggester component. Given a query like "new york", I
would like to get results like "New York" or "New York City".

When I tried using the suggest component, it suggest entries for each
word rather then phrase(even if i add quotes). How can I change my
config to get title matches and not have the query broken into each
word?




5
0
3

newt
newwy patitta
newyddion
newyorker
newyork–presbyterian hospital



5
4
8

york
york–dauphin (septa station)
york—humber
york—scarborough
york—simcoe


newt york



/solr/suggest?q=new%20york&omitHeader=true&spellcheck.count=5&spellcheck.collate=true

solrconfig.xml:
  
   
    suggest
    org.apache.solr.spelling.suggest.Suggester
    org.apache.solr.spelling.suggest.tst.TSTLookup
    title_autocomplete
    true
   
  

  
   
    true
    suggest
    10
   
   
    suggest
   
  

schema.xml:
    
     
      
      
     
    

   


-- 
Tommy Chheng


Re: Limiting term frequency in a document to a specific term

2012-01-24 Thread Erick Erickson
At a guess, you're using 3.x and the relevance functions are only
on trunk (4.0).

Best
Erick

On Tue, Jan 24, 2012 at 7:49 AM, solr user  wrote:
> With the Solr search relevancy functions, a ParseException, unknown
> function ttf in FunctionQuery.
>
> http://localhost:8983/solr/select/?fl=score,documentPageId&defType=func&q=ttf(contents,amplifiers)
>
> where contents is a field name, and amplifiers is text in the field name.
>
> Just curious why I get a parse exception for the above syntax.
>
>
>
>
> On Monday, January 23, 2012, Ahmet Arslan  wrote:
>>> Below is an example query to search for the term frequency
>>> in a document,
>>> but it is returning the frequency for all the terms.
>>>
>>> [
>>>
> http://localhost:8983/solr/select/?fl=documentPageId&q=documentPageId:49667.3&qt=tvrh&tv.tf=true&tv.fl=contents][1
>>> ]
>>>
>>> I would like to be able to limit the query to just one term
>>> that I know
>>> occurs in the document.
>>
>> I don't fully follow but http://wiki.apache.org/solr/FunctionQuery#tf may
> be what you want?
>>


Re: Size of index to use shard

2012-01-24 Thread Erick Erickson
Talking about "index size" can be very misleading. Take
a look at http://lucene.apache.org/java/3_5_0/fileformats.html#file-names.
Note that the *.fdt and *.fdx files are used to for stored fields, i.e.
the verbatim copy of data put in the index when you specify
stored="true". These files have virtually no impact on search
speed.

So, if your *.fdx and *.fdt files are 90G out of a 100G index
it is a much different thing than if these files are 10G out of
a 100G index.

And this doesn't even mention the peculiarities of your query mix.
Nor does it say a thing about whether your cheapest alternative
is to add more memory.

Anderson's method is about the only reliable one, you just have
to test with your index and real queries. At some point, you'll
find your tipping point, typically when you come under memory
pressure. And it's a balancing act between how much memory
you allocate to the JVM and how much you leave for the op
system.

Bottom line: No hard and fast numbers. And you should periodically
re-test the empirical numbers you *do* arrive at...

Best
Erick

On Tue, Jan 24, 2012 at 5:31 AM, Anderson vasconcelos
 wrote:
> Apparently, not so easy to determine when to break the content into
> pieces. I'll investigate further about the amount of documents, the
> size of each document and what kind of search is being used. It seems,
> I will have to do a load test to identify the cutoff point to begin
> using the strategy of shards.
>
> Thanks
>
> 2012/1/24, Dmitry Kan :
>> Hi,
>>
>> The article you gave mentions 13GB of index size. It is quite small index
>> from our perspective. We have noticed, that at least solr 3.4 has some sort
>> of "choking" point with respect to growing index size. It just becomes
>> substantially slower than what we need (a query on avg taking more than 3-4
>> seconds) once index size crosses a magic level (about 80GB following our
>> practical observations). We try to keep our indices at around 60-70GB for
>> fast searches and above 100GB for slow ones. We also route majority of user
>> queries to fast indices. Yes, caching may help, but not necessarily we can
>> afford adding more RAM for bigger indices. BTW, our documents are very
>> small, thus in 100GB index we can have around 200 mil. documents. It would
>> be interesting to see, how you manage to ensure q-times under 1 sec with an
>> index of 250GB? How many documents / facets do you ask max. at a time? FYI,
>> we ask for a thousand of facets in one go.
>>
>> Regards,
>> Dmitry
>>
>> On Tue, Jan 24, 2012 at 10:30 AM, Vadim Kisselmann <
>> v.kisselm...@googlemail.com> wrote:
>>
>>> Hi,
>>> it depends from your hardware.
>>> Read this:
>>>
>>> http://www.derivante.com/2009/05/05/solr-performance-benchmarks-single-vs-multi-core-index-shards/
>>> Think about your cache-config (few updates, big caches) and a good
>>> HW-infrastructure.
>>> In my case i can handle a 250GB index with 100mil. docs on a I7
>>> machine with RAID10 and 24GB RAM => q-times under 1 sec.
>>> Regards
>>> Vadim
>>>
>>>
>>>
>>> 2012/1/24 Anderson vasconcelos :
>>> > Hi
>>> > Has some size of index (or number of docs) that is necessary to break
>>> > the index in shards?
>>> > I have a index with 100GB of size. This index increase 10GB per year.
>>> > (I don't have information how many docs they have) and the docs never
>>> > will be deleted.  Thinking in 30 years, the index will be with 400GB
>>> > of size.
>>> >
>>> > I think  is not required to break in shard, because i not consider
>>> > this like a "large index". Am I correct? What's is a real "large
>>> > index"
>>> >
>>> >
>>> > Thanks
>>>
>>


using pre-core properties in dih config

2012-01-24 Thread Robert Stewart
I have a multi-core setup, and for each core I have a shared
data-config.xml which specifies a SQL query for data import.  What I
want to do is have the same data-config.xml file shared between my
cores (linked to same physical file). I'd like to specify core
properties in solr.xml such that each core loads a different set of
data from SQL.  So my query might look like this:

query="select * from index_values where mod(index_id,${NUM_CORES})=${CORE_ID}"

So I want to have NUM_CORES and CORE_ID specified as properties in
solr.xml, something like:


  
 
 

 
 

 
 

 
  



So my question is, is this possible, and if so what is exact syntax to
make it work?

Thanks,
Bob


Re: java.net.SocketException: Too many open files

2012-01-24 Thread Sethi, Parampreet
Hi Jonty,

You can try changing the maximum number of files opened by a process using
command:

ulimit -n XXX

In case, the number of opened files is not increasing with time and just a
constant number which is larger than system default limit, this should fix
it.

-param

On 1/24/12 11:40 AM, "Michael Kuhlmann"  wrote:

>Hi Jonty,
>
>no, not really. When we first had such problems, we really thought that
>the number of open files is the problem, so we implemented an algorithm
>that performed an optimize from time to time to force a segment merge.
>Due to some misconfiguration, this ran too often. With the result that
>an optimize was issued before thje previous optimization was finished,
>which is a really bad idea.
>
>We removed the optimization calls, and since then we didn't have any
>more problems.
>
>However, I never found out the initial reason for the exception. Maybe
>there was some bug in Solr's 3.1 version - we're using 3.5 right now -,
>but I couldn't find a hint in the changelog.
>
>At least we didn't have this exception for more than two months now, so
>I'm hoping that the cause for this has disappeared somehow.
>
>Sorry that I can't help you more.
>
>Greetings,
>Kuli
>
>On 24.01.2012 07:48, Jonty Rhods wrote:
>> Hi Kuli,
>>
>> Did you get the solution of this problem? I am still facing this
>>problem.
>> Please help me to overcome this problem.
>>
>> regards
>>
>>
>> On Wed, Oct 26, 2011 at 1:16 PM, Michael Kuhlmann
>>wrote:
>>
>>> Hi;
>>>
>>> we have a similar problem here. We already raised the file ulimit on
>>>the
>>> server to 4096, but this only defered the problem. We get a
>>> TooManyOpenFilesException every few months.
>>>
>>> The problem has nothing to do with real files. When we had the last
>>> TooManyOpenFilesException, we investigated with netstat -a and saw that
>>> there were about 3900 open sockets in Jetty.
>>>
>>> Curiously, we only have one SolrServer instance per Solr client, and we
>>> only have three clients (our running web servers).
>>>
>>> We have set defaultMaxConnectionsPerHost to 20 and maxTotalConnections
>>> to 100. There should be room enough.
>>>
>>> Sorry that I can't help you, we still have not solved tghe problem on
>>> our own.
>>>
>>> Greetings,
>>> Kuli
>>>
>>> Am 25.10.2011 22:03, schrieb Jonty Rhods:
 Hi,

 I am using solrj and for connection to server I am using instance of
the
 solr server:

 SolrServer server =  new CommonsHttpSolrServer("
 http://localhost:8080/solr/core0";);

 I noticed that after few minutes it start throwing exception
 java.net.SocketException: Too many open files.
 It seems that it related to instance of the HttpClient. How to
resolved
>>> the
 instances to a certain no. Like connection pool in dbcp etc..

 I am not experienced on java so please help to resolved this problem.

   solr version: 3.4

 regards
 Jonty

>>>
>>>
>>
>



Re: java.net.SocketException: Too many open files

2012-01-24 Thread Michael Kuhlmann

Hi Jonty,

no, not really. When we first had such problems, we really thought that 
the number of open files is the problem, so we implemented an algorithm 
that performed an optimize from time to time to force a segment merge. 
Due to some misconfiguration, this ran too often. With the result that 
an optimize was issued before thje previous optimization was finished, 
which is a really bad idea.


We removed the optimization calls, and since then we didn't have any 
more problems.


However, I never found out the initial reason for the exception. Maybe 
there was some bug in Solr's 3.1 version - we're using 3.5 right now -, 
but I couldn't find a hint in the changelog.


At least we didn't have this exception for more than two months now, so 
I'm hoping that the cause for this has disappeared somehow.


Sorry that I can't help you more.

Greetings,
Kuli

On 24.01.2012 07:48, Jonty Rhods wrote:

Hi Kuli,

Did you get the solution of this problem? I am still facing this problem.
Please help me to overcome this problem.

regards


On Wed, Oct 26, 2011 at 1:16 PM, Michael Kuhlmann  wrote:


Hi;

we have a similar problem here. We already raised the file ulimit on the
server to 4096, but this only defered the problem. We get a
TooManyOpenFilesException every few months.

The problem has nothing to do with real files. When we had the last
TooManyOpenFilesException, we investigated with netstat -a and saw that
there were about 3900 open sockets in Jetty.

Curiously, we only have one SolrServer instance per Solr client, and we
only have three clients (our running web servers).

We have set defaultMaxConnectionsPerHost to 20 and maxTotalConnections
to 100. There should be room enough.

Sorry that I can't help you, we still have not solved tghe problem on
our own.

Greetings,
Kuli

Am 25.10.2011 22:03, schrieb Jonty Rhods:

Hi,

I am using solrj and for connection to server I am using instance of the
solr server:

SolrServer server =  new CommonsHttpSolrServer("
http://localhost:8080/solr/core0";);

I noticed that after few minutes it start throwing exception
java.net.SocketException: Too many open files.
It seems that it related to instance of the HttpClient. How to resolved

the

instances to a certain no. Like connection pool in dbcp etc..

I am not experienced on java so please help to resolved this problem.

  solr version: 3.4

regards
Jonty










Re: Hierarchical faceting in UI

2012-01-24 Thread Yuhao
Darren,

One challenge for me is that a term can appear in multiple places of the 
hierarchy.  So it's not safe to simply use the term as it appears to get its 
children; I probably need to include the entire tree path up to this term.  For 
example, if the hierarchy is "Cardiovascular Diseases > Arteriosclerosis > 
Coronary Artery Disease", and I'm getting the children of the middle term 
Arteriosclerosi, I need to filter on something like "parent:Cardiovascular 
Diseases/Arteriosclerosis".

I'm having trouble figuring out how I can get the complete path per above to 
add to the URL of each facet term.  I know "velocity/facet_field.vm" is where I 
build the URL.  I know how to simply add a "parent:" filter to the URL.  
But I don't know how to access a document field, like the complete parent path, 
in "facet_field.vm".  Any help would be great.

Yuhao





 From: "dar...@ontrenet.com" 
To: Yuhao  
Cc: solr-user@lucene.apache.org 
Sent: Monday, January 23, 2012 7:16 PM
Subject: Re: Hierarchical faceting in UI
 

On Mon, 23 Jan 2012 14:33:00 -0800 (PST), Yuhao 
wrote:
> Programmatically, something like this might work: for each facet field,
> add another hidden field that identifies its parent.  Then, program
> additional logic in the UI to show only the facet terms at the currently
> selected level.  For example, if one filters on "cat:electronics", the
new
> UI logic would apply the additional filter "cat_parent:electronics". 
Can
> this be done?  

Yes. This is how I do it.

> Would it be a lot of work?  
No. Its not a lot of work, simply represent your hierarchy as parent/child
relations in the document fields and in your UI drill down by issuing new
faceted searches. Use the current facet (tree level) as the parent:
in the next query. Its much easier than other suggestions for this.

> Is there a better way?
Not in my opinion, there isn't. This is the simplest to implement and
understand.

> 
> By the way, Flamenco (another faceted browser) has built-in support for
> hierarchies, and it has worked well for my data in this aspect (but less
> well than Solr in others).  I'm looking for the same kind of
hierarchical
> UI feature in Solr.

Re: analyzing stored fields (removing HTML tags)

2012-01-24 Thread darul
You probably may use a Sanitizer as we do here.

http://stackoverflow.com/questions/1947021/libs-for-html-sanitizing



--
View this message in context: 
http://lucene.472066.n3.nabble.com/analyzing-stored-fields-removing-HTML-tags-tp3685144p3685182.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr Java client API

2012-01-24 Thread Erick Erickson
It would really help to see the relevant parts of the code
you're using to see what you've tried. You might want to
review:
http://wiki.apache.org/solr/UsingMailingLists

Best
Erick

On Mon, Jan 23, 2012 at 2:45 PM, jingjung Ng  wrote:
> Hi,
>
> I implemented the facet using
>
> query.addFacetQuery
> query.addFilterQuery
>
> to facet on:
>
> gender:male
> state:DC
>
> This works fine. How can I facet on multi-values using Solrj API, like
> following:
>
> gender:male
> gender:female
> state:DC
>
>
> I've tried, but return 0. Can anyone help ?
>
> Thanks,
>
> -jingjung ng


Re: "index-time" over boosted

2012-01-24 Thread Jan Høydahl
Hi,

Well, I think you do it right, but get tricked by either editing the wrong 
file, a typo or browser caching.
Why not try to start with a fresh Solr3.5.0, start the example app, index all 
exampledocs, search for "Podcasts", you get one hit, in fields "text" and 
"features".
Then change solr/example/solr/conf/schema.xml and add omitNorms="true" to these 
two fields. Then stop Solr, delete your index, start Solr, re-index the docs 
and try again. fieldNorm is now 1.0. Once you get that working you can start 
debugging where you got it wrong in your own setup.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 24. jan. 2012, at 14:55, remi tassing wrote:

> Hello,
> 
> thanks for helping out Jan, I really appreciate that!
> 
> These are full explains of two results:
> 
> Result#1.--
> 
> 3.0412199E-5 = (MATCH) max of:
>  3.0412199E-5 = (MATCH) weight(content:"mobil broadband"^0.5 in
> 19081), product of:
>0.13921623 = queryWeight(content:"mobil broadband"^0.5), product of:
>  0.5 = boost
>  6.3531075 = idf(content: mobil=5270 broadband=2392)
>  0.043826185 = queryNorm
>2.1845297E-4 = fieldWeight(content:"mobil broadband" in 19081), product of:
>  3.6055512 = tf(phraseFreq=13.0)
>  6.3531075 = idf(content: mobil=5270 broadband=2392)
>  9.536743E-6 = fieldNorm(field=content, doc=19081)
> 
> Result#2.-
> 
> 2.6991445E-5 = (MATCH) max of:
>  2.6991445E-5 = (MATCH) weight(content:"mobil broadband"^0.5 in
> 15306), product of:
>0.13921623 = queryWeight(content:"mobil broadband"^0.5), product of:
>  0.5 = boost
>  6.3531075 = idf(content: mobil=5270 broadband=2392)
>  0.043826185 = queryNorm
>1.9388145E-4 = fieldWeight(content:"mobil broadband" in 15306), product of:
>  1.0 = tf(phraseFreq=1.0)
>  6.3531075 = idf(content: mobil=5270 broadband=2392)
>  3.0517578E-5 = fieldNorm(field=content, doc=15306)
> 
> Remi
> 
> 
> On Tue, Jan 24, 2012 at 3:38 PM, Jan Høydahl  wrote:
> 
>> That looks right. Can you restart your Solr, do a new search with
>> &debugQuery=true and copy/paste the full EXPLAIN output for your query?
>> 
>> --
>> Jan Høydahl, search solution architect
>> Cominvent AS - www.cominvent.com
>> Solr Training - www.solrtraining.com
>> 
>> On 24. jan. 2012, at 13:22, remi tassing wrote:
>> 
>>> Any idea?
>>> 
>>> This is a snippet of my schema.xml now:
>>> 
>>> 
>>> 
>>>   
>>>   
>>>   >>   required="true"/>
>>>   >> omitNorms="true"/>
>>>   
>>>   
>>>  
>>>   >>   multiValued="true"/>
>>> 
>>> ...
>>>  
>>>  
>>> 
>>> 
>>> 
>>> 
>>> id
>>> 
>>>  Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Solr Training - www.solrtraining.com
> 
> On 19. jan. 2012, at 13:01, remi tassing wrote:
> 
>> Hello Jan,
>> 
>> My schema wasn't changed from the release 3.5.0. The content can be
>> seen
>> below:
>> 
>> 
>>  
>>  >  sortMissingLast="true" omitNorms="true"/>
>>  >  omitNorms="true"/>
>>  >  omitNorms="true"/>
>>  >  positionIncrementGap="100">
>>  
>>  
>>  >  ignoreCase="true" words="stopwords.txt"/>
>>  >  generateWordParts="1" generateNumberParts="1"
>>  catenateWords="1" catenateNumbers="1"
>> catenateAll="0"
>>  splitOnCaseChange="1"/>
>>  
>>  >  protected="protwords.txt"/>
>>  > class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>  
>>  
>>  >  positionIncrementGap="100">
>>  
>>  
>>  
>>  >  generateWordParts="1" generateNumberParts="1"/>
>>  > class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>  
>>  
>>  
>>  
>>  
>> 
>>  
>>  >>> indexed="false"/>
>>  >>> indexed="false"/>
>>  
>> 
>>  
>>  
>>  
>>  >>> 
>> 
>> 



analyzing stored fields (removing HTML tags)

2012-01-24 Thread Robert Stewart
Is it possible to configure schema to remove HTML tags from stored
field content?  As far as I can tell analyzers can only be applied to
indexed content, but they don't affect stored content.  I want to
remove HTML tags from text fields so that returned text content from
stored field has no HTML tags.

Thanks
Bob


Re: Limiting term frequency in a document to a specific term

2012-01-24 Thread solr user
With the Solr search relevancy functions, a ParseException, unknown
function ttf in FunctionQuery.

http://localhost:8983/solr/select/?fl=score,documentPageId&defType=func&q=ttf(contents,amplifiers)

where contents is a field name, and amplifiers is text in the field name.

Just curious why I get a parse exception for the above syntax.




On Monday, January 23, 2012, Ahmet Arslan  wrote:
>> Below is an example query to search for the term frequency
>> in a document,
>> but it is returning the frequency for all the terms.
>>
>> [
>>
http://localhost:8983/solr/select/?fl=documentPageId&q=documentPageId:49667.3&qt=tvrh&tv.tf=true&tv.fl=contents][1
>> ]
>>
>> I would like to be able to limit the query to just one term
>> that I know
>> occurs in the document.
>
> I don't fully follow but http://wiki.apache.org/solr/FunctionQuery#tf may
be what you want?
>


Not getting the expected search results

2012-01-24 Thread m0rt0n
Hello,

I am a newbie in this Solr world and I am getting surprised because I try to
do searches, both with the  browser interface and by using a Java client and
the expected results do not appear.

The issue is:

1) I have set up an entity called "via" in my data-config.xml with 5 fields.
I do the full-import and it indexes 

1.5M records:


 
 
 
 
 


2) These 5 fields are mapped in the schema.xml, this way:
   
   
   
   
   

3) I try to do a search for "Alcala street in Madrid":
NVIAC:ALCALA AND CPRO:28 AND CMUM:079

But it does just get two results (none of them, the desired one):
0792845363ALCALA 

GAZULESCALLE
0792808116ALCALA 

GUADAIRACALLE

4) When I do the indexing by delimiting the entity search:



The full import does 913 documents and I do the same search, but this time I
get the desired result:

0792800132ALCALACALLE

Anyone can help me with that? I don't know why it does not work as expected
when I do the full-import of the whole lot of streets.

Thanks a lot in advance.


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Not-getting-the-expected-search-results-tp3684974p3684974.html
Sent from the Solr - User mailing list archive at Nabble.com.


full import is not working and still not showing any errors

2012-01-24 Thread scabra4
hi all, anyone can help me with this please.
i am trying to do a full import, i've done everything correctly, now when i
try the full import an xml page displays showing the following and i stays
like this now matter how i refresh the page:
This XML file does not appear to have any style information associated with
it. The document tree is shown below.
  

0
0



C:\solr\conf\data-config.xml


full-import
busy
A command is still running...

0:5:8.925
1
0
0
0
2012-01-24 16:29:31

This response format is experimental.  It is likely to
change in the future.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/full-import-is-not-working-and-still-not-showing-any-errors-tp3684751p3684751.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: "index-time" over boosted

2012-01-24 Thread remi tassing
Hello,

thanks for helping out Jan, I really appreciate that!

These are full explains of two results:

Result#1.--

3.0412199E-5 = (MATCH) max of:
  3.0412199E-5 = (MATCH) weight(content:"mobil broadband"^0.5 in
19081), product of:
0.13921623 = queryWeight(content:"mobil broadband"^0.5), product of:
  0.5 = boost
  6.3531075 = idf(content: mobil=5270 broadband=2392)
  0.043826185 = queryNorm
2.1845297E-4 = fieldWeight(content:"mobil broadband" in 19081), product of:
  3.6055512 = tf(phraseFreq=13.0)
  6.3531075 = idf(content: mobil=5270 broadband=2392)
  9.536743E-6 = fieldNorm(field=content, doc=19081)

Result#2.-

2.6991445E-5 = (MATCH) max of:
  2.6991445E-5 = (MATCH) weight(content:"mobil broadband"^0.5 in
15306), product of:
0.13921623 = queryWeight(content:"mobil broadband"^0.5), product of:
  0.5 = boost
  6.3531075 = idf(content: mobil=5270 broadband=2392)
  0.043826185 = queryNorm
1.9388145E-4 = fieldWeight(content:"mobil broadband" in 15306), product of:
  1.0 = tf(phraseFreq=1.0)
  6.3531075 = idf(content: mobil=5270 broadband=2392)
  3.0517578E-5 = fieldNorm(field=content, doc=15306)

Remi


On Tue, Jan 24, 2012 at 3:38 PM, Jan Høydahl  wrote:

> That looks right. Can you restart your Solr, do a new search with
> &debugQuery=true and copy/paste the full EXPLAIN output for your query?
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Solr Training - www.solrtraining.com
>
> On 24. jan. 2012, at 13:22, remi tassing wrote:
>
> > Any idea?
> >
> > This is a snippet of my schema.xml now:
> >
> > 
> > 
> >
> >
> > >required="true"/>
> > > omitNorms="true"/>
> >
> >
> >   
> > >multiValued="true"/>
> >
> > ...
> >   
> >   
> >
> > 
> >
> > 
> > id
> >
> >  >>> Jan Høydahl, search solution architect
> >>> Cominvent AS - www.cominvent.com
> >>> Solr Training - www.solrtraining.com
> >>>
> >>> On 19. jan. 2012, at 13:01, remi tassing wrote:
> >>>
>  Hello Jan,
> 
>  My schema wasn't changed from the release 3.5.0. The content can be
> seen
>  below:
> 
>  
>    
>    sortMissingLast="true" omitNorms="true"/>
>    omitNorms="true"/>
>    omitNorms="true"/>
>    positionIncrementGap="100">
>    
>    
>    ignoreCase="true" words="stopwords.txt"/>
>    generateWordParts="1" generateNumberParts="1"
>    catenateWords="1" catenateNumbers="1"
> catenateAll="0"
>    splitOnCaseChange="1"/>
>    
>    protected="protwords.txt"/>
> class="solr.RemoveDuplicatesTokenFilterFactory"/>
>    
>    
>    positionIncrementGap="100">
>    
>    
>    
>    generateWordParts="1" generateNumberParts="1"/>
> class="solr.RemoveDuplicatesTokenFilterFactory"/>
>    
>    
>    
>    
>    
> 
>    
> >> indexed="false"/>
> >> indexed="false"/>
>    
> 
>    
>    
>    
> >>
>
>


RE: Highlighting more than 1 term

2012-01-24 Thread Tim Hibbs
Nitin and any others who may have followed this item,

I resolved the issue, but I'm not exactly sure of the originating cause.
I had change the field types of my "text" fields to "text_en" and then
re-indexed. Changing to "text_en" kept highlighting from happening to
more than one term in the fields for which I desired highlighting. Note
that I used the "stock" fieldtype definitions supplied with solr.

Once I changed the field type back to "text" and re-indexed again,
highlighting multiple terms in the same field was re-enabled.

Thanks,
Tim Hibbs

-Original Message-
From: csscouter [mailto:tim.hi...@verizon.net] 
Sent: Thursday, January 19, 2012 9:54 AM
To: solr-user@lucene.apache.org
Subject: RE: Highlighting more than 1 term

Nitin (and any other interested parties here):

Unfortunately, re-indexing the content did not resolve the problem and
the symptom remains the same. Any additional advice is appreciated.

Tim

--
View this message in context:
http://lucene.472066.n3.nabble.com/Highlighting-more-than-1-term-tp36708
62p3672538.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: "index-time" over boosted

2012-01-24 Thread Jan Høydahl
That looks right. Can you restart your Solr, do a new search with 
&debugQuery=true and copy/paste the full EXPLAIN output for your query?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 24. jan. 2012, at 13:22, remi tassing wrote:

> Any idea?
> 
> This is a snippet of my schema.xml now:
> 
> 
> 
>
>
>required="true"/>
> omitNorms="true"/>
>
>
>   
>multiValued="true"/>
> 
> ...
>   
>   
> 
> 
> 
> 
> id
> 
> >> Jan Høydahl, search solution architect
>>> Cominvent AS - www.cominvent.com
>>> Solr Training - www.solrtraining.com
>>> 
>>> On 19. jan. 2012, at 13:01, remi tassing wrote:
>>> 
 Hello Jan,
 
 My schema wasn't changed from the release 3.5.0. The content can be seen
 below:
 
 
   
   >>>   sortMissingLast="true" omitNorms="true"/>
   >>>   omitNorms="true"/>
   >>>   omitNorms="true"/>
   >>>   positionIncrementGap="100">
   
   
   >>>   ignoreCase="true" words="stopwords.txt"/>
   >>>   generateWordParts="1" generateNumberParts="1"
   catenateWords="1" catenateNumbers="1" catenateAll="0"
   splitOnCaseChange="1"/>
   
   >>>   protected="protwords.txt"/>
   
   
   
   >>>   positionIncrementGap="100">
   
   
   
   >>>   generateWordParts="1" generateNumberParts="1"/>
   
   
   
   
   
   
 
   
   > indexed="false"/>
   > indexed="false"/>
   
 
   
   
   
   > 



Re: Size of index to use shard

2012-01-24 Thread Anderson vasconcelos
Apparently, not so easy to determine when to break the content into
pieces. I'll investigate further about the amount of documents, the
size of each document and what kind of search is being used. It seems,
I will have to do a load test to identify the cutoff point to begin
using the strategy of shards.

Thanks

2012/1/24, Dmitry Kan :
> Hi,
>
> The article you gave mentions 13GB of index size. It is quite small index
> from our perspective. We have noticed, that at least solr 3.4 has some sort
> of "choking" point with respect to growing index size. It just becomes
> substantially slower than what we need (a query on avg taking more than 3-4
> seconds) once index size crosses a magic level (about 80GB following our
> practical observations). We try to keep our indices at around 60-70GB for
> fast searches and above 100GB for slow ones. We also route majority of user
> queries to fast indices. Yes, caching may help, but not necessarily we can
> afford adding more RAM for bigger indices. BTW, our documents are very
> small, thus in 100GB index we can have around 200 mil. documents. It would
> be interesting to see, how you manage to ensure q-times under 1 sec with an
> index of 250GB? How many documents / facets do you ask max. at a time? FYI,
> we ask for a thousand of facets in one go.
>
> Regards,
> Dmitry
>
> On Tue, Jan 24, 2012 at 10:30 AM, Vadim Kisselmann <
> v.kisselm...@googlemail.com> wrote:
>
>> Hi,
>> it depends from your hardware.
>> Read this:
>>
>> http://www.derivante.com/2009/05/05/solr-performance-benchmarks-single-vs-multi-core-index-shards/
>> Think about your cache-config (few updates, big caches) and a good
>> HW-infrastructure.
>> In my case i can handle a 250GB index with 100mil. docs on a I7
>> machine with RAID10 and 24GB RAM => q-times under 1 sec.
>> Regards
>> Vadim
>>
>>
>>
>> 2012/1/24 Anderson vasconcelos :
>> > Hi
>> > Has some size of index (or number of docs) that is necessary to break
>> > the index in shards?
>> > I have a index with 100GB of size. This index increase 10GB per year.
>> > (I don't have information how many docs they have) and the docs never
>> > will be deleted.  Thinking in 30 years, the index will be with 400GB
>> > of size.
>> >
>> > I think  is not required to break in shard, because i not consider
>> > this like a "large index". Am I correct? What's is a real "large
>> > index"
>> >
>> >
>> > Thanks
>>
>


Re: Advanced stopword handling edismax

2012-01-24 Thread O. Klein

O. Klein wrote
> 
> As I understand it with edismax in trunk, whenever you have a query that
> only contains stopwords then all the terms are required.
> 
> But when I try this I only get an empty parsedQuery like: (+() () () () ()
> () () () () () ()
> FunctionQuery((1.0/(3.16E-11*float(ms(const(132710400),date(date_dt)))+1.0))^50.0))/no_coord
> 
> Am I misunderstanding this feature? Or is something going wrong?
> 

Can someone at least confirm that when using edismax and a query like "to be
or not to be" (for English stopword list) the parsed query is empty?

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Advanced-stopword-handling-edismax-tp3677878p3684599.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Size of index to use shard

2012-01-24 Thread Dmitry Kan
Hi,

The article you gave mentions 13GB of index size. It is quite small index
from our perspective. We have noticed, that at least solr 3.4 has some sort
of "choking" point with respect to growing index size. It just becomes
substantially slower than what we need (a query on avg taking more than 3-4
seconds) once index size crosses a magic level (about 80GB following our
practical observations). We try to keep our indices at around 60-70GB for
fast searches and above 100GB for slow ones. We also route majority of user
queries to fast indices. Yes, caching may help, but not necessarily we can
afford adding more RAM for bigger indices. BTW, our documents are very
small, thus in 100GB index we can have around 200 mil. documents. It would
be interesting to see, how you manage to ensure q-times under 1 sec with an
index of 250GB? How many documents / facets do you ask max. at a time? FYI,
we ask for a thousand of facets in one go.

Regards,
Dmitry

On Tue, Jan 24, 2012 at 10:30 AM, Vadim Kisselmann <
v.kisselm...@googlemail.com> wrote:

> Hi,
> it depends from your hardware.
> Read this:
>
> http://www.derivante.com/2009/05/05/solr-performance-benchmarks-single-vs-multi-core-index-shards/
> Think about your cache-config (few updates, big caches) and a good
> HW-infrastructure.
> In my case i can handle a 250GB index with 100mil. docs on a I7
> machine with RAID10 and 24GB RAM => q-times under 1 sec.
> Regards
> Vadim
>
>
>
> 2012/1/24 Anderson vasconcelos :
> > Hi
> > Has some size of index (or number of docs) that is necessary to break
> > the index in shards?
> > I have a index with 100GB of size. This index increase 10GB per year.
> > (I don't have information how many docs they have) and the docs never
> > will be deleted.  Thinking in 30 years, the index will be with 400GB
> > of size.
> >
> > I think  is not required to break in shard, because i not consider
> > this like a "large index". Am I correct? What's is a real "large
> > index"
> >
> >
> > Thanks
>


Re: "index-time" over boosted

2012-01-24 Thread remi tassing
Any idea?

This is a snippet of my schema.xml now:









   


...
   
   

 

 
 id

  > Jan Høydahl, search solution architect
> > Cominvent AS - www.cominvent.com
> > Solr Training - www.solrtraining.com
> >
> > On 19. jan. 2012, at 13:01, remi tassing wrote:
> >
> >> Hello Jan,
> >>
> >> My schema wasn't changed from the release 3.5.0. The content can be seen
> >> below:
> >>
> >> 
> >>
> >> >>sortMissingLast="true" omitNorms="true"/>
> >> >>omitNorms="true"/>
> >> >>omitNorms="true"/>
> >> >>positionIncrementGap="100">
> >>
> >>
> >> >>ignoreCase="true" words="stopwords.txt"/>
> >> >>generateWordParts="1" generateNumberParts="1"
> >>catenateWords="1" catenateNumbers="1" catenateAll="0"
> >>splitOnCaseChange="1"/>
> >>
> >> >>protected="protwords.txt"/>
> >>
> >>
> >>
> >> >>positionIncrementGap="100">
> >>
> >>
> >>
> >> >>generateWordParts="1" generateNumberParts="1"/>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> indexed="false"/>
> >> indexed="false"/>
> >>
> >>
> >>
> >>
> >>
> >>


highlighter not supporting surround parser

2012-01-24 Thread manyutomar
i want performing span queries using surround parser and i want tos how the
results with highlighter, but the problem is highlighter is not working
properly with surround query parser.Are their any plugins or updates
available to do it.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/highlighter-not-supporting-surround-parser-tp3684474p3684474.html
Sent from the Solr - User mailing list archive at Nabble.com.


solr stopwords issue - documents are not matching

2012-01-24 Thread Ankita Patil
Hi,

I am using solr-3.4. My part of the schema looks like :




























stopwords_en.txt contains :
a
an
and
are
as

etc..

Now when I search for "*buy house*" Solr does not return me the documents
with text "*buy a house*".
Also when I search for "*buy a house*" Solr does not return me the
documents with text "*buy house*".

A part of debugQuery is
cContent:"buy a house"
cContent:"buy a house"
PhraseQuery(cContent:"bui ? hous")
cContent:"bui ? hous"

Any idea how can I solve this problem? or what is wrong?

Thanks
Ankita


Re: Highlighting stopwords

2012-01-24 Thread O. Klein
Ah, I never used the hl.q

That did the trick. Thanx!

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Highlighting-stopwords-tp3681901p3684245.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Filtering search results by an external set of values

2012-01-24 Thread Mikhail Khludnev
Phil,

Some time ago I posted my thoughts about the similar problem. Scroll to
part II.

http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201201.mbox/%3CCANGii8egoB1_rXFfwJMheyxx72v48B_DA-6KteKOymiBrR=m...@mail.gmail.com%3E

Regards

On Tue, Jan 24, 2012 at 1:36 PM, John, Phil (CSS) wrote:

> Thanks for the responses.
>
> Groups probably wouldn't work as while there will be some overlap between
> customers, each will have a very different overall set of accessible
> resources.
>
> I'll try the suggestion about simply reindexing, or using the no-cache
> option and see how I get on.
>
> Failing that, are there hooks to write custom filter modules that used
> other parts of the records to decide on whether to include them in a result
> set or not? In our use case, the documents represent articles, which have
> an "issue" field. Each customer has defined issues (or ranges of issues)
> that they have subscriptions to, so the upper bounds for "what to filter"
> would probably be fairly small (10k - 20k issues/ranges). This could
> probably be used with the no-cache option you've pointed me to.
>
> Best wishes,
>
> Phil.
>
> -Original Message-
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: 23 January 2012 17:34
> To: solr-user@lucene.apache.org
> Subject: Re: Filtering search results by an external set of values
>
> A second, but arguably quite expert option, is to use the no-cache option.
> See: https://issues.apache.org/jira/browse/SOLR-2429
>
> The idea here is that you can specify that a filter is "expensive" and it
> will only be run after all the other filters & etc have been applied.
> Furthermore,
> it will not be cached and only documents that pass through all the other
> filters will be matched against this filter. It has been specifically used
> for ACL calculations...
>
> That said, see exactly how painful storing auth tokens is. I can index, on
> a relatively underpowered laptop, 11M Wiki documents in 5 minutes or so. If
> your worst-case rights update take 1/2 hour to re-index and it only happens
> once a month, why be complex?
>
> And groups, as Jan says, often make even this unnecessary.
>
> Best
> Erick
>
> On Mon, Jan 23, 2012 at 5:16 AM, Jan Høydahl 
> wrote:
> > Hi,
> >
> > Do you have any kind of "group" membership for you users? If you have,
> > a resource's list of security access tokens could be smaller and avoid
> > re-indexing most resources when adding "normal" users which mostly
> > belong to groups. The common way is to add filters on the query. You
> > may do it yourself or have some framework/plugin to it for you, see
> > http://wiki.apache.org/solr/SolrSecurity#Document_Level_Security
> >
> > --
> > Jan Høydahl, search solution architect Cominvent AS -
> > www.cominvent.com Solr Training - www.solrtraining.com
> >
> > On 23. jan. 2012, at 11:49, John, Phil (CSS) wrote:
> >
> >> Hi,
> >>
> >>
> >>
> >> We're building quite a large shared index of resources, using Solr.
> >> The application that makes use of these resources is a multitenant
> >> one (i.e., many customers using the same index). For resources that
> >> are "private" to a customer, it's fairly easy to tag a document with
> >> their customer ID and using a FilterQuery to limit results to just
> >> their "stuff".
> >>
> >>
> >>
> >> We are soon going to be adding a large number (many tens of millions)
> >> of records that will be shared amongst customers. Not all customers
> >> will have access to the same shared resources, e.g.:
> >>
> >>
> >>
> >> * Shared resource 1:
> >>
> >> o   Customer 1
> >>
> >> o   Customer 3
> >>
> >>
> >>
> >> * Shared resource 2:
> >>
> >> o   Customer 2
> >>
> >> o   Customer 1
> >>
> >>
> >>
> >> The issue is, what is the best way to model this in Solr? Should we
> >> have multiple customer_id fields on each record, and then use the
> >> filter query as with "private" resources, or is there a better way of
> doing it?
> >> What happens if we need to do a bulk change - i.e. adding new
> >> customer, or a previous customer has a large change in what shared
> >> resources they have access to? Am I right in thinking that we'd need
> >> to go through every shared resource, read it, make the required
> >> change, and reindex it?
> >>
> >>
> >>
> >> I'm wondering if there's a way, instead of updating these resources
> >> directly, I could construct a set of documents that would act as a
> >> filter at query time of which shared resources to return?
> >>
> >>
> >>
> >> Kind regards,
> >>
> >>
> >>
> >> Phil John
> >>
> >> Technical Lead, Capita Software Services
> >>
> >> Knights Court, Solihull Parkway
> >>
> >> Birmingham Business Park B37 7YB
> >>
> >> Office: 0870 400 5000
> >>
> >> Fax: 0870 400 5001
> >> email: philj...@capita.co.uk 
> >>
> >>
> >>
> >> Part of Capita plc www.capita.co.uk 
> >>
> >>
> >>
> >>
> >>
> >> This email and any attachment to it are confidential.  Unless you are
>

Re: ExractionHandler/Cell ignore just 2 fields defined in schema 3.5.0

2012-01-24 Thread Wayne W
Ah perfect - thank you Jan so much. :-)


On Tue, Jan 24, 2012 at 11:14 AM, Jan Høydahl  wrote:
> Hi,
>
> It's because lowernames=true by default in solrconfig.xml, and it will 
> convert any "-" into "_" in field names. So try adding a request parameter 
> &lowernames=false or change the default in solrconfig.xml. Alternatively, 
> leave as is but name your fields project_id and company_id :)
>
> http://wiki.apache.org/solr/ExtractingRequestHandler#Input_Parameters
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Solr Training - www.solrtraining.com
>
> On 23. jan. 2012, at 22:26, Wayne W wrote:
>
>> Hi,
>>
>> Im been trying to figure this out now for a few days and I'm just not
>> getting anywhere, so any pointers would be MOST welcome. I'm in the
>> process of upgrading from 1.3 to the latest and greatest version of
>> Solr and I'm getting there slowly. However I have this (final) problem
>> that when sending a document for extraction, 2 of my fields defined in
>> my schema are ignored. When I don't using the extraction the fields
>> are used fine (I can see them via Luke).
>>
>> My schema has:
>> 
>>        
>>        
>>        
>>        
>>        
>>        > stored="true" multiValued="true" omitNorms="true"/>
>>        > multiValued="true" />
>>        > multiValued="true"/>
>>        > multiValued="true"/>
>>
>>
>> My request:
>> INFO: [] webapp=/solr path=/update/extract
>> params={literal.company-id=8&literal.uid=hub.app.model.Document#203657&literal.date=2012-01-23T21:10:42Z&literal.id=203657&literal.type=hub.app.model.Document&idx.attr=true&literal.label=&literal.title=hotel+surfers.pdf&def.fl=text&literal.project-id=36}
>> status=0 QTime=3579
>> Jan 24, 2012 8:10:58 AM org.apache.solr.update.DirectUpdateHandler2 commit
>>
>>
>> For unknown reasons the fields 'company-id', and 'project-id' are ignored.
>>
>> any ideas?
>> many thanks
>> Wayne
>


RE: Filtering search results by an external set of values

2012-01-24 Thread John, Phil (CSS)
Thanks for the responses.

Groups probably wouldn't work as while there will be some overlap between 
customers, each will have a very different overall set of accessible resources.

I'll try the suggestion about simply reindexing, or using the no-cache option 
and see how I get on.

Failing that, are there hooks to write custom filter modules that used other 
parts of the records to decide on whether to include them in a result set or 
not? In our use case, the documents represent articles, which have an "issue" 
field. Each customer has defined issues (or ranges of issues) that they have 
subscriptions to, so the upper bounds for "what to filter" would probably be 
fairly small (10k - 20k issues/ranges). This could probably be used with the 
no-cache option you've pointed me to.

Best wishes,

Phil.

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: 23 January 2012 17:34
To: solr-user@lucene.apache.org
Subject: Re: Filtering search results by an external set of values

A second, but arguably quite expert option, is to use the no-cache option.
See: https://issues.apache.org/jira/browse/SOLR-2429

The idea here is that you can specify that a filter is "expensive" and it will 
only be run after all the other filters & etc have been applied.
Furthermore,
it will not be cached and only documents that pass through all the other 
filters will be matched against this filter. It has been specifically used for 
ACL calculations...

That said, see exactly how painful storing auth tokens is. I can index, on a 
relatively underpowered laptop, 11M Wiki documents in 5 minutes or so. If your 
worst-case rights update take 1/2 hour to re-index and it only happens once a 
month, why be complex?

And groups, as Jan says, often make even this unnecessary.

Best
Erick

On Mon, Jan 23, 2012 at 5:16 AM, Jan Høydahl  wrote:
> Hi,
>
> Do you have any kind of "group" membership for you users? If you have, 
> a resource's list of security access tokens could be smaller and avoid 
> re-indexing most resources when adding "normal" users which mostly 
> belong to groups. The common way is to add filters on the query. You 
> may do it yourself or have some framework/plugin to it for you, see 
> http://wiki.apache.org/solr/SolrSecurity#Document_Level_Security
>
> --
> Jan Høydahl, search solution architect Cominvent AS - 
> www.cominvent.com Solr Training - www.solrtraining.com
>
> On 23. jan. 2012, at 11:49, John, Phil (CSS) wrote:
>
>> Hi,
>>
>>
>>
>> We're building quite a large shared index of resources, using Solr. 
>> The application that makes use of these resources is a multitenant 
>> one (i.e., many customers using the same index). For resources that 
>> are "private" to a customer, it's fairly easy to tag a document with 
>> their customer ID and using a FilterQuery to limit results to just 
>> their "stuff".
>>
>>
>>
>> We are soon going to be adding a large number (many tens of millions) 
>> of records that will be shared amongst customers. Not all customers 
>> will have access to the same shared resources, e.g.:
>>
>>
>>
>> *         Shared resource 1:
>>
>> o   Customer 1
>>
>> o   Customer 3
>>
>>
>>
>> *         Shared resource 2:
>>
>> o   Customer 2
>>
>> o   Customer 1
>>
>>
>>
>> The issue is, what is the best way to model this in Solr? Should we 
>> have multiple customer_id fields on each record, and then use the 
>> filter query as with "private" resources, or is there a better way of doing 
>> it?
>> What happens if we need to do a bulk change - i.e. adding new 
>> customer, or a previous customer has a large change in what shared 
>> resources they have access to? Am I right in thinking that we'd need 
>> to go through every shared resource, read it, make the required 
>> change, and reindex it?
>>
>>
>>
>> I'm wondering if there's a way, instead of updating these resources 
>> directly, I could construct a set of documents that would act as a 
>> filter at query time of which shared resources to return?
>>
>>
>>
>> Kind regards,
>>
>>
>>
>> Phil John
>>
>> Technical Lead, Capita Software Services
>>
>> Knights Court, Solihull Parkway
>>
>> Birmingham Business Park B37 7YB
>>
>> Office: 0870 400 5000
>>
>> Fax: 0870 400 5001
>> email: philj...@capita.co.uk 
>>
>>
>>
>> Part of Capita plc www.capita.co.uk 
>>
>>
>>
>>
>>
>> This email and any attachment to it are confidential.  Unless you are the 
>> intended recipient, you may not use, copy or disclose either the message or 
>> any information contained in the message. If you are not the intended 
>> recipient, you should delete this email and notify the sender immediately.
>>
>> Any views or opinions expressed in this email are those of the sender only, 
>> unless otherwise stated.  All copyright in any Capita material in this email 
>> is reserved.
>>
>> All emails, incoming and outgoing, may be recorded by Capita and monitored 
>> for legitimate business purposes.

Re: Size of index to use shard

2012-01-24 Thread Vadim Kisselmann
Hi,
it depends from your hardware.
Read this:
http://www.derivante.com/2009/05/05/solr-performance-benchmarks-single-vs-multi-core-index-shards/
Think about your cache-config (few updates, big caches) and a good
HW-infrastructure.
In my case i can handle a 250GB index with 100mil. docs on a I7
machine with RAID10 and 24GB RAM => q-times under 1 sec.
Regards
Vadim



2012/1/24 Anderson vasconcelos :
> Hi
> Has some size of index (or number of docs) that is necessary to break
> the index in shards?
> I have a index with 100GB of size. This index increase 10GB per year.
> (I don't have information how many docs they have) and the docs never
> will be deleted.  Thinking in 30 years, the index will be with 400GB
> of size.
>
> I think  is not required to break in shard, because i not consider
> this like a "large index". Am I correct? What's is a real "large
> index"
>
>
> Thanks


Re: Highlighting stopwords

2012-01-24 Thread Koji Sekiguchi

(12/01/24 9:31), O. Klein wrote:

Let's say I search for "spellcheck solr" on a website that only contains
info about Solr, so "solr" was added to the stopwords.txt. The query that
will be parsed then (dismax) will not contain the term "solr".

So fragments won't contain highlights of the term "solr". So when a fragment
with the highlighted term "spellcheck" is generated, it would be less
confusing for people who don't know how search engines work to also
highlight the term "solr".

So my first test was to have a field with StopFilterFactory and search on
that field, while using another field without StopFilterFactory to highlight
on. This didn't do the trick.


Are you saying that using hl.q parameter on highlight field while using q on
the search field that has StopFilter and hl.q doesn't work for you?

koji
--
http://www.rondhuit.com/en/


Re: hot deploy of newer version of solr schema in production

2012-01-24 Thread Jan Høydahl
Hi,

To be able to do a true hot deploy of newer schema without reindexing, you must 
carefully see to that none of your changes are breaking changes. So you should 
test the process on your development machine and make sure it works. Adding and 
deleting fields would work, but not changing the field-type or analysis of an 
existing field. Depending on from/to version, you may want to keep the old 
schema-version number.

The process is:
1. Deploy the new schema, including all dependencies such as dictionaries
2. Do a RELOAD CORE http://wiki.apache.org/solr/CoreAdmin#RELOAD

My preference is to do a more thorough upgrade of schema including new 
functionality and breaking changes, and then do a full reindex. The exception 
is if my index is huge and the reason for Solr upgrade or schema change is to 
fix a bug, not to use new functionality.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 24. jan. 2012, at 01:51, roz dev wrote:

> Hi All,
> 
> I need community's feedback about deploying newer versions of solr schema
> into production while existing (older) schema is in use by applications.
> 
> How do people perform these things? What has been the learning of people
> about this.
> 
> Any thoughts are welcome.
> 
> Thanks
> Saroj