Checkout SearchWorkings.org - it just went live!

2011-09-09 Thread Simon Willnauer
Hey folks,

Some of you might have heard, myself and a small group of other
passionate search technology professionals have been working hard in
the last few months to launch a community site known as
SearchWorkings.org [1]. This initiative has been set up for other
search professionals to have a single point of contact or
comprehensive resource where one can learn and talk about all the
exciting new developments in the world of open source search.

Anyone like yourselves familiar with open source search knows that
technologies like Lucene and Solr have grown tremendously in
popularity over the years, but with this growth there have also come a
number of challenges, such as limited support and education. With the
launch of SearchWorkings.org we are convinced we will overcome and
resolve some of these challenges.

Covering open source search technologies from Apache Lucene and Apache
Solr to Apache Mahout, one of the key objectives for the community is
to create a place where search specialists can engage with one another
and enjoy a single point of contact for various resources, downloads
and documentation.

Like any other community website, content will be added on a regular
basis and community members can also make their own contributions and
stay on top of everything search related too. For now, there is access
to a extensive resource centre offering online tutorials, downloads,
white papers and access to a host of search specialists in the forum.
With the ability to post blog items and keep up to date with relevant
news, the site is a search specialists dream come true and addresses
what we felt was a clear need in the market.

Searchworkings.org starts off with an initial focus on Lucene, Solr &
Friends but aims to be much broader. Each of you can & should
contribute, tell us their search, data-processing, setup or
optimization story. I am looking forward to more and more blogs,
articles and tutorials about smaller projects like Apache Lucy, real
world case-studies or 3rd party extensions for OSS Search components.

have fun,

Simon

[1] http://www.searchworkings.org
[2] Trademark Acknowledgement: Apache Lucene, Apache Solr, Apache
Mahout and Apache Lucy respective logos are trademarks of The Apache
Software Foundation. All other marks mentioned may be trademarks or
registered trademarks of their respective owners.


question about StandardAnalyzer, differences between solr 1.4 and solr 3.3

2011-09-09 Thread Marc Des Garets
Hi,

I have a simple field defined like this:

  


Which I use here:
   

In solr 1.4, I could do:
?q=(middlename:a*)

And I was getting all documents where middlename = A or where middlename starts 
by the letter A.

In solr 3.3, I get only results where middlename starts by the letter A but not 
where middlename is equal to A.

The thing is this happens only with the letter A, with other letters, it is 
fine, I get the ones starting by the letter and the ones equal to the letter. 
My guess is that it considers A as the English article but I do not specify any 
filter with stopwords so how come the behaviour with the letter A is different 
from the other letters? Is there a bug? How can I change my field to work with 
the letter A, the same way it does with other letters.


Thanks,
Marc
--
This transmission is strictly confidential, possibly legally privileged, and 
intended solely for the 
addressee.  Any views or opinions expressed within it are those of the author 
and do not necessarily 
represent those of 192.com, i-CD Publishing (UK) Ltd or any of it's subsidiary 
companies.  If you 
are not the intended recipient then you must not disclose, copy or take any 
action in reliance of this 
transmission. If you have received this transmission in error, please notify 
the sender as soon as 
possible.  No employee or agent is authorised to conclude any binding agreement 
on behalf of 
i-CD Publishing (UK) Ltd with another party by email without express written 
confirmation by an 
authorised employee of the Company. http://www.192.com (Tel: 08000 192 192).  
i-CD Publishing (UK) Ltd 
is incorporated in England and Wales, company number 3148549, VAT No. GB 
673128728.


indexing data from rich documents - Tika with solr3.1

2011-09-09 Thread scorpking
Hi everyone, 
Now i have had a problem with tika and solr. I successed in index data from
various file formats (pdf, doc...) with a file absolute path. but now I have
a link from internet (ex: http://myweb/filename.pdf). I want to index from
this link, But it's not ok. I don't why? This is my file dataconfig.xml:

*



http://myweb/filename.pdf"; format="text" dataSource="bin" >







*

when i change url=" http://myweb/filename.pdf"; by a file absolute path, it
work very good. 
Any one know this? 
Thanks for your help.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/indexing-data-from-rich-documents-Tika-with-solr3-1-tp3322555p3322555.html
Sent from the Solr - User mailing list archive at Nabble.com.


scoring only by higher boost

2011-09-09 Thread crisfromnova
Hi,

For my search I want to calculate the score only by higher boost. For
example:
doc1

  Charlie
  Jonhson

doc2

  Charlie
  Charlie


So when I use the query : "q=name:Charlie^5 surname:Charlie^2", I want that
both documents to have the same score, based on the boost value of the first
field matched.

I use a custom Similarity class and I overwrite all methods wich can
influence the score(computenorm(...), tf(...), idf(...), queryNorm(...)),
but I don't know how to change the score to take in consideration only the
higher boost value and not the sum of bossts from all matched fields.

Any idea, please...

 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/scoring-only-by-higher-boost-tp3322666p3322666.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: scoring only by higher boost

2011-09-09 Thread Jamie Johnson
I could be wrong, but isn't that what edismax does?

http://lucene.apache.org/java/3_0_0/api/core/org/apache/lucene/search/DisjunctionMaxQuery.html

On Fri, Sep 9, 2011 at 7:49 AM, crisfromnova  wrote:
> Hi,
>
> For my search I want to calculate the score only by higher boost. For
> example:
> doc1
> 
>  Charlie
>  Jonhson
> 
> doc2
> 
>  Charlie
>  Charlie
> 
>
> So when I use the query : "q=name:Charlie^5 surname:Charlie^2", I want that
> both documents to have the same score, based on the boost value of the first
> field matched.
>
> I use a custom Similarity class and I overwrite all methods wich can
> influence the score(computenorm(...), tf(...), idf(...), queryNorm(...)),
> but I don't know how to change the score to take in consideration only the
> higher boost value and not the sum of bossts from all matched fields.
>
> Any idea, please...
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/scoring-only-by-higher-boost-tp3322666p3322666.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: indexing data from rich documents - Tika with solr3.1

2011-09-09 Thread Erik Hatcher
If the only thing you're doing is indexing file content, then you can bypass 
using the Data Import Handler altogether and use the ExtractingRequestHandler 
(aka Solr Cell).  And you can feed in a file from a URL using the stream.url 
capability, like the stream.file example here: 


Something like -  
http://localhost:8983/solr/update/extract?stream.url=http://myweb/filename.pdf&literal.id=filename.pdf

But to fix what you're doing below, looks like you should be using 
BinURLDataSource rather than BinFileDataSource - other than that, it looks fine.

Erik

On Sep 9, 2011, at 06:58 , scorpking wrote:

> Hi everyone, 
> Now i have had a problem with tika and solr. I successed in index data from
> various file formats (pdf, doc...) with a file absolute path. but now I have
> a link from internet (ex: http://myweb/filename.pdf). I want to index from
> this link, But it's not ok. I don't why? This is my file dataconfig.xml:
> 
> *
>
>
>   
>http://myweb/filename.pdf"; format="text" dataSource="bin" >
>   
>
>
>
> 
>   
>
> *
> 
> when i change url=" http://myweb/filename.pdf"; by a file absolute path, it
> work very good. 
> Any one know this? 
> Thanks for your help.
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/indexing-data-from-rich-documents-Tika-with-solr3-1-tp3322555p3322555.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: Indexing Lotus Notes database using API

2011-09-09 Thread Alexandre Rafalovitch
I was looking at doing something similar a little while ago and I
would not actually go with entry-by-entry extraction code.

There is a semi-secret way to export the whole Lotus Notes database
into an XML format. It can then be processed to extract and import
whatever information you want, much more than you can usefully get
through API.

We use it for a different purpose and it takes about 20 minutes to
dump an 8 documents Lotus Notes database and a simple
post-processing then is another 10 minutes or so.

Please contact me directly if you are interested and I will share the
code to get this started.

Regards,
   Alex.
P.s. On a related note, I had once an idea of a project where this XML
dump could be used to - pretty much - automatically give any Lotus
Notes database a real SOLR-based search. The built-in Lotus Notes
search is just such a disaster whether version 6 or version 8.5
variation.

Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- I think age is a very high price to pay for maturity (Tom Stoppard)




On Fri, Sep 9, 2011 at 1:48 AM, Tobias Berg  wrote:
> Hi again,
>
> After reading a bit more, IBM no longer supports the JDBC driver for Lotus
> Notes. Instead the Notes.jar API is recommended. So I'll go with that, as
> Oleg suggested.
>
> 2011/9/6 Tobias Berg 
>
>> Thanks Jan,
>>
>> I will look into using the JDBC driver.
>>
>> /Tobias
>>
>>
>> 2011/9/5 Jan Høydahl 
>>
>>> Hi,
>>>
>>> You should be able to index Notes databases through JDBC, either with DIH
>>> or ManifoldCF. Have not tried myself though.
>>>
>>> --
>>> Jan Høydahl, search solution architect
>>> Cominvent AS - www.cominvent.com
>>> Solr Training - www.solrtraining.com
>>>
>>> On 5. sep. 2011, at 12:28, Tobias Berg wrote:
>>>
>>> > Hi,
>>> >
>>> > I'm in the need to index some databases in Lotus Notes format.
>>> > Unfortunatley, they cannot be web-enabled so I can't use a crawler such
>>> as
>>> > Nutch.
>>> >
>>> > Before starting to write my own code, have anyone indexed Lotus Notes
>>> > databases using the Notes API before? Maybe using Apache ManifoldCF
>>> > framework or a custom DataImportHandler?
>>> >
>>> > I've tried both Google and searched the mailing list but haven't found
>>> any
>>> > information.
>>> >
>>> > Best regards,
>>> > Tobias Berg
>>>
>>>
>>
>


RE: question about StandardAnalyzer, differences between solr 1.4 and solr 3.3

2011-09-09 Thread Steven A Rowe
Hi Marc,

StandardAnalyzer includes StopFilter.  See the Javadocs for Lucene 3.3 here: 


This is not new behavior - StandardAnalyzer in Lucene 2.9.1 (the version of 
Lucene bundled with Solr 1.4) also includes a StopFilter: 


If you don't want a StopFilter configured, you can specify the individual 
components directly, e.g. to get the equivalent of StandardAnalyzer, but 
without the StopFilter:


  



  


Steve

> -Original Message-
> From: Marc Des Garets [mailto:marc.desgar...@192.com]
> Sent: Friday, September 09, 2011 6:21 AM
> To: solr-user@lucene.apache.org
> Subject: question about StandardAnalyzer, differences between solr 1.4
> and solr 3.3
> 
> Hi,
> 
> I have a simple field defined like this:
> 
>class="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
> 
> 
> Which I use here:
> required="false" />
> 
> In solr 1.4, I could do:
> ?q=(middlename:a*)
> 
> And I was getting all documents where middlename = A or where middlename
> starts by the letter A.
> 
> In solr 3.3, I get only results where middlename starts by the letter A
> but not where middlename is equal to A.
> 
> The thing is this happens only with the letter A, with other letters, it
> is fine, I get the ones starting by the letter and the ones equal to the
> letter. My guess is that it considers A as the English article but I do
> not specify any filter with stopwords so how come the behaviour with the
> letter A is different from the other letters? Is there a bug? How can I
> change my field to work with the letter A, the same way it does with
> other letters.
> 
> 
> Thanks,
> Marc
> --
> This transmission is strictly confidential, possibly legally privileged,
> and intended solely for the
> addressee.  Any views or opinions expressed within it are those of the
> author and do not necessarily
> represent those of 192.com, i-CD Publishing (UK) Ltd or any of it's
> subsidiary companies.  If you
> are not the intended recipient then you must not disclose, copy or take
> any action in reliance of this
> transmission. If you have received this transmission in error, please
> notify the sender as soon as
> possible.  No employee or agent is authorised to conclude any binding
> agreement on behalf of
> i-CD Publishing (UK) Ltd with another party by email without express
> written confirmation by an
> authorised employee of the Company. http://www.192.com (Tel: 08000 192
> 192).  i-CD Publishing (UK) Ltd
> is incorporated in England and Wales, company number 3148549, VAT No. GB
> 673128728.


RE: NRT and commit behavior

2011-09-09 Thread Tirthankar Chatterjee
Erick,
What you said is correct for us the searches are based on some Active Directory 
permissions which are populated in Filter query parameter. So we don't have any 
warming query concept as we cannot fire for every user ahead of time. 

What we do here is that when user logs in we do an invalid query(which return 
no results instead of '*') with the correct filter query (which is his 
permissions based on the login). This way the cache gets warmed up with valid 
docs. 

It works then. 


Also, can you please let me know why commit is taking 45 mins to 1 hours on a 
good resourced hardware with multiple processors and 16gb RAM 64 bit VM, etc. 
We tried passing waitSearcher as false and found that inside the code it hard 
coded to be true. Is there any specific reason. Can we change that value to 
honor what is being passed.

Thanks,
Tirthankar

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Thursday, September 01, 2011 8:38 AM
To: solr-user@lucene.apache.org
Subject: Re: NRT and commit behavior

Hmm, I'm guessing a bit here, but using an invalid query doesn't sound very 
safe, but I suppose it *might* be OK.

What does "invalid" mean? Syntax error? not safe.

search that returns 0 results? I don't know, but I'd guess that filling your 
caches, which is the point of warming queries, might be short circuited if the 
query returns
0 results but I don't know for sure.

But the fact that "invalid queries return quicker" does not inspire confidence 
since the *point* of warming queries is to spend the time up front so your 
users don't have to wait.

So here's a test. Comment out your warming queries.
Restart your server and fire the warming query from the browser 
with&debugQuery=on and look at the QTime parameter.

Now fire the same form of the query (as in the same sort, facet, grouping, etc, 
but presumably a valid term). See the QTime.

Now fire the same form of the query with a *different* value in the query. That 
is, it should search on different terms but with the same sort, facet, etc. to 
avoid getting your data straight from the queryResultCache.

My guess is that the last query will return much more quickly than the second 
query. Which would indicate that the first form isn't doing you any good.

But a test is worth a thousand opinions.

Best
Erick

On Wed, Aug 31, 2011 at 11:04 AM, Tirthankar Chatterjee 
 wrote:
> Also noticed that "waitSearcher" parameter value is not  honored inside 
> commit. It is always defaulted to true which makes it slow during indexing.
>
> What we are trying to do is use an invalid query (which wont return any 
> results) as a warming query. This way the commit returns faster. Are we doing 
> something wrong here?
>
> Thanks,
> Tirthankar
>
> -Original Message-
> From: Jonathan Rochkind [mailto:rochk...@jhu.edu]
> Sent: Monday, July 18, 2011 11:38 AM
> To: solr-user@lucene.apache.org; yo...@lucidimagination.com
> Subject: Re: NRT and commit behavior
>
> In practice, in my experience at least, a very 'expensive' commit can 
> still slow down searches significantly, I think just due to CPU (or
> i/o?) starvation. Not sure anything can be done about that.  That's my 
> experience in Solr 1.4.1, but since searches have always been async with 
> commits, it probably is the same situation even in more recent versions, I'd 
> guess.
>
> On 7/18/2011 11:07 AM, Yonik Seeley wrote:
>> On Mon, Jul 18, 2011 at 10:53 AM, Nicholas Chase  
>> wrote:
>>> Very glad to hear that NRT is finally here!  But my question is this:
>>> will things still come to a standstill during a commit?
>> New updates can now proceed in parallel with a commit, and searches 
>> have always been completely asynchronous w.r.t. commits.
>>
>> -Yonik
>> http://www.lucidimagination.com
>>
> **Legal Disclaimer***
> "This communication may contain confidential and privileged material 
> for the sole use of the intended recipient. Any unauthorized review, 
> use or distribution by others is strictly prohibited. If you have 
> received the message in error, please advise the sender by reply email 
> and delete the message. Thank you."
> *
>


TermsComponent from deleted document

2011-09-09 Thread Manish Bafna
Hi,
http://wiki.apache.org/solr/TermsComponent states that TermsComponent will
return frequencies from deleted documents too.

Is there anyway to omit the deleted documents to get the frequencies.

I know there is a facets which can be used. Is it recommended to use facets
for autosuggest feature?

Thanks,
Manish.


Adding Query Filter custom implementation to Solr's pipeline

2011-09-09 Thread Eugene Prystupa
Hi,

When I was using Lucene directly I used a custom implementation of query filter 
to enforce entitlements of search results. Now, that I'm switching my 
infrastructure from custom host to Solr, what is the best way to configure Solr 
to use my custom query filter for every request?

Thanks!
-Eugene


This email and any attachments may contain information which is confidential 
and/or privileged. The information is intended exclusively for the addressee 
and the views expressed may not be official policy, but the personal views of 
the originator. If you are not the intended recipient, be aware that any 
disclosure, copying, distribution or use of the contents is prohibited. If you 
have received this email and any file transmitted with it in error, please 
notify the sender by telephone or return email immediately and delete the 
material from your computer. Internet communications are not secure and Lab49 
is not responsible for their abuse by third parties, nor for any alteration or 
corruption in transmission, nor for any damage or loss caused by any virus or 
other defect. Lab49 accepts no liability or responsibility arising out of or in 
any way connected to this email.


Alias name for a index field

2011-09-09 Thread Tirthankar Chatterjee
Hi,
Is there a way that we can give an alias name for a field so that the schema is 
not required to change.

Use Case: We defined the schema with a field called "conv" (basically to store 
conversation of an email)
There are users who wants this to be used as "subject"

One Solution: Use copy field but that definitely takes some resource, instead 
can we have something like an alias name, so a field can have multiple alias 
names which different users from geographical regions can use for doing fielded 
search.

Let us know what you thinkor if a JIRA already exists.

Thanks,
Tirthankar


**Legal Disclaimer***
"This communication may contain confidential and privileged
material for the sole use of the intended recipient. Any
unauthorized review, use or distribution by others is strictly
prohibited. If you have received the message in error, please
advise the sender by reply email and delete the message. Thank
you."
*

Re: Alias name for a index field

2011-09-09 Thread darren

See http://wiki.apache.org/solr/FieldAliasesAndGlobsInParams

On Fri, 9 Sep 2011 09:59:57 -0400, Tirthankar Chatterjee
 wrote:
> Hi,
> Is there a way that we can give an alias name for a field so that the
> schema is not required to change.
> 
> Use Case: We defined the schema with a field called "conv" (basically to
> store conversation of an email)
> There are users who wants this to be used as "subject"
> 
> One Solution: Use copy field but that definitely takes some resource,
> instead can we have something like an alias name, so a field can have
> multiple alias names which different users from geographical regions can
> use for doing fielded search.
> 
> Let us know what you thinkor if a JIRA already exists.
> 
> Thanks,
> Tirthankar
> 
> 
> **Legal Disclaimer***
> "This communication may contain confidential and privileged
> material for the sole use of the intended recipient. Any
> unauthorized review, use or distribution by others is strictly
> prohibited. If you have received the message in error, please
> advise the sender by reply email and delete the message. Thank
> you."
> *


Re: Alias name for a index field

2011-09-09 Thread Erik Hatcher
How are you doing fielded search currently?  End users using the "lucene" query 
parser?  Or using dismax/qf?  

I'm just curious to drill into your needs here exactly in terms of 
request/response and whether a simple application layer handling of the alias 
need would suffice or this is something best handled in Solr.

Erik

On Sep 9, 2011, at 09:59 , Tirthankar Chatterjee wrote:

> Hi,
> Is there a way that we can give an alias name for a field so that the schema 
> is not required to change.
> 
> Use Case: We defined the schema with a field called "conv" (basically to 
> store conversation of an email)
> There are users who wants this to be used as "subject"
> 
> One Solution: Use copy field but that definitely takes some resource, instead 
> can we have something like an alias name, so a field can have multiple alias 
> names which different users from geographical regions can use for doing 
> fielded search.
> 
> Let us know what you thinkor if a JIRA already exists.
> 
> Thanks,
> Tirthankar
> 
> 
> **Legal Disclaimer***
> "This communication may contain confidential and privileged
> material for the sole use of the intended recipient. Any
> unauthorized review, use or distribution by others is strictly
> prohibited. If you have received the message in error, please
> advise the sender by reply email and delete the message. Thank
> you."
> *



Re: scoring only by higher boost

2011-09-09 Thread Jamie Johnson
No problem, occasionally a blind squirrel finds a nut :)

On Fri, Sep 9, 2011 at 8:46 AM, crisfromnova  wrote:
> It works with dismax.
> Thank you very much!!
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/scoring-only-by-higher-boost-tp3322666p3322800.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Solr Cloud - is replication really a feature on the trunk?

2011-09-09 Thread Pulkit Singhal
Thank You Yury. After looking at your thread, there's something I must
clarify: Is solr.xml not uploaded and held in ZooKeeper? I ask this
because you have a slightly different config between Node 1 & 2:
http://lucene.472066.n3.nabble.com/Replication-setup-with-SolrCloud-Zk-td2952602.html

On Wed, Sep 7, 2011 at 8:34 PM, Yury Kats  wrote:
> On 9/7/2011 3:18 PM, Pulkit Singhal wrote:
>> Hello,
>>
>> I'm working off the trunk and the following wiki link:
>> http://wiki.apache.org/solr/SolrCloud
>>
>> The wiki link has a section that seeks to quickly familiarize a user
>> with replication in SolrCloud - "Example B: Simple two shard cluster
>> with shard replicas"
>>
>> But after going through it, I have to wonder if this is truly
>> replication?
>
> Not really. Replication is not set up in the example.
> The example use "replicas" as "copies", to demonstrate high search
> availability.
>
>> Because if it is truly replication then somewhere along
>> the line, the following properties must have been set
>> programmatically:
>> replicateAfter, confFiles, masterUrl, pollInterval
>> Can someone tell me: Where exactly in the code is this happening?
>
> Nowhere.
>
> If you want replication, you need to set all the properties you listed
> in solrconfig.xml.
>
> I've done it recently, see 
> http://lucene.472066.n3.nabble.com/Replication-setup-with-SolrCloud-Zk-td2952602.html
>
>


Re: Solr Cloud - is replication really a feature on the trunk?

2011-09-09 Thread Yury Kats
On 9/9/2011 10:52 AM, Pulkit Singhal wrote:
> Thank You Yury. After looking at your thread, there's something I must
> clarify: Is solr.xml not uploaded and held in ZooKeeper? 

Not as far as I understand. Cores are loaded/created by the local
Solr server based on solr.xml and then registered with ZK, so that
ZK know what cores are out there and how they are organized in shards.


> because you have a slightly different config between Node 1 & 2:
> http://lucene.472066.n3.nabble.com/Replication-setup-with-SolrCloud-Zk-td2952602.html


I have two shards, each shard having a master and a slave core.
Cores are located so that master and slave are on different nodes.
This protects search (but not indexing) from node failure.


RE: question about StandardAnalyzer, differences between solr 1.4 and solr 3.3

2011-09-09 Thread Marc Des Garets
Ok thanks, I don't know why the behaviour is different from my 1.4 index then 
but hopefully it will be the same by doing what you tell me.

Thanks again,

Marc

-Original Message-
From: Steven A Rowe [mailto:sar...@syr.edu] 
Sent: 09 September 2011 14:40
To: solr-user@lucene.apache.org
Subject: RE: question about StandardAnalyzer, differences between solr 1.4 and 
solr 3.3

Hi Marc,

StandardAnalyzer includes StopFilter.  See the Javadocs for Lucene 3.3 here: 


This is not new behavior - StandardAnalyzer in Lucene 2.9.1 (the version of 
Lucene bundled with Solr 1.4) also includes a StopFilter: 


If you don't want a StopFilter configured, you can specify the individual 
components directly, e.g. to get the equivalent of StandardAnalyzer, but 
without the StopFilter:


  



  


Steve

> -Original Message-
> From: Marc Des Garets [mailto:marc.desgar...@192.com]
> Sent: Friday, September 09, 2011 6:21 AM
> To: solr-user@lucene.apache.org
> Subject: question about StandardAnalyzer, differences between solr 1.4
> and solr 3.3
> 
> Hi,
> 
> I have a simple field defined like this:
> 
>class="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
> 
> 
> Which I use here:
> required="false" />
> 
> In solr 1.4, I could do:
> ?q=(middlename:a*)
> 
> And I was getting all documents where middlename = A or where middlename
> starts by the letter A.
> 
> In solr 3.3, I get only results where middlename starts by the letter A
> but not where middlename is equal to A.
> 
> The thing is this happens only with the letter A, with other letters, it
> is fine, I get the ones starting by the letter and the ones equal to the
> letter. My guess is that it considers A as the English article but I do
> not specify any filter with stopwords so how come the behaviour with the
> letter A is different from the other letters? Is there a bug? How can I
> change my field to work with the letter A, the same way it does with
> other letters.
> 
> 
> Thanks,
> Marc
> --
> This transmission is strictly confidential, possibly legally privileged,
> and intended solely for the
> addressee.  Any views or opinions expressed within it are those of the
> author and do not necessarily
> represent those of 192.com, i-CD Publishing (UK) Ltd or any of it's
> subsidiary companies.  If you
> are not the intended recipient then you must not disclose, copy or take
> any action in reliance of this
> transmission. If you have received this transmission in error, please
> notify the sender as soon as
> possible.  No employee or agent is authorised to conclude any binding
> agreement on behalf of
> i-CD Publishing (UK) Ltd with another party by email without express
> written confirmation by an
> authorised employee of the Company. http://www.192.com (Tel: 08000 192
> 192).  i-CD Publishing (UK) Ltd
> is incorporated in England and Wales, company number 3148549, VAT No. GB
> 673128728.
--
This transmission is strictly confidential, possibly legally privileged, and 
intended solely for the 
addressee.  Any views or opinions expressed within it are those of the author 
and do not necessarily 
represent those of 192.com, i-CD Publishing (UK) Ltd or any of it's subsidiary 
companies.  If you 
are not the intended recipient then you must not disclose, copy or take any 
action in reliance of this 
transmission. If you have received this transmission in error, please notify 
the sender as soon as 
possible.  No employee or agent is authorised to conclude any binding agreement 
on behalf of 
i-CD Publishing (UK) Ltd with another party by email without express written 
confirmation by an 
authorised employee of the Company. http://www.192.com (Tel: 08000 192 192).  
i-CD Publishing (UK) Ltd 
is incorporated in England and Wales, company number 3148549, VAT No. GB 
673128728.


Re: How to write this query?

2011-09-09 Thread Erick Erickson
That's not valid query syntax at all.  what are you trying to do with key=?

You probably want something like
key:(value1^8 value2^4 value3^2)

or

key:value1^8 key:value2^4 key:value3^2

Best
Erick

On Thu, Sep 8, 2011 at 8:29 AM, crisfromnova  wrote:
> You can try this: q=key:value1^8 key=value2^4 key=value3^2.
>
> It should be working.
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/How-to-write-this-query-tp3318577p3319491.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Weird behaviors with not operators.

2011-09-09 Thread electroyou
Hi all.
I'm crashing into a weird behavior with - operators.
If I execute the query
-text AND -text
I get all expected results (lot), but if I put some parenthesis like
-text AND (-text)
or
(-text) AND (-text)
then I get no results at all. I can't understand why.
Do you have an explanation for this behavior?

Thank you in advance.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Weird-behaviors-with-not-operators-tp3323065p3323065.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How to order results by word position???

2011-09-09 Thread Chris Hostetter

: I have a problem with solr search. If I search after "vitamin" I receive : 
: 1 - arrca MULTIVITAMIN FRUCHTSAFTBÄRCHEN
: 2 - VITAMIN E-KAPSELN NAT. 400

1) That first example will not match a query for "vitamin" using the 
analyzers you specified -- so if those are the results you are getting, 
you are not matching on the field you think you are.

2) As for your specific question about scoring based on the position of 
the word: the default scoring model doesn't do this, but there are things 
you can do to change that:

a) index a positional marker at the start of the values (ie: 
"__START_OF_STRING__" or something like that) and change your queries for 
things like "vitamin" to be a sloppy phrase query including your 
positional marker (ie: "__START_OF_STRING__ vitamin"~100).  phrase queries 
score shorter phrases higher then longer phrases.

b) use SpanQueries ... Solr 3.3 doesn't support these using an query 
syntax, but you could write a custom QParser that uses a SpanFirstQuery, 
or look into the SurroundQParser which has been commited to the 4x branch 
(SOLR-2703)


-Hoss

Re: SolrCloud Feedback

2011-09-09 Thread Pulkit Singhal
Hello Jan,

You've made a very good point in (b). I would be happy to make the
edit to the wiki if I understood your explanation completely.

When you say that it is "looking up what collection that core is part
of" ... I'm curious how a core is being put under a particular
collection in the first place? And what that collection is named?
Obviously you've made it clear that colelction1 is really the name of
the core itself. And where this association is being stored for the
code to look it up?

If not Jan, then perhaps the gurus who wrote Solr Cloud could answer :)

Thanks!
- Pulkit

On Thu, Feb 10, 2011 at 9:10 AM, Jan Høydahl  wrote:
> Hi,
>
> I have so far just tested the examples and got a N by M cluster running. My 
> feedback:
>
> a) First of all, a major update of the SolrCloud Wiki is needed, to clearly 
> state what is in which version, what are current improvement plans and get 
> rid of outdated stuff. That said I think there are many good ideas there.
>
> b) The "collection" terminology is too much confused with "core", and should 
> probably be made more distinct. I just tried to configure two cores on the 
> same Solr instance into the same collection, and that worked fine, both as 
> distinct shards and as same shard (replica). The wiki examples give the 
> impression that "collection1" in 
> localhost:8983/solr/collection1/select?distrib=true is some magic collection 
> identifier, but what it really does is doing the query on the *core* named 
> "collection1", looking up what collection that core is part of and 
> distributing the query to all shards in that collection.
>
> c) ZK is not designed to store large files. While the files in conf are 
> normally well below the 1M limit ZK imposes, we should perhaps consider using 
> a lightweight distributed object or k/v store for holding the /CONFIGS and 
> let ZK store a reference only
>
> d) How are admins supposed to update configs in ZK? Install their favourite 
> ZK editor?
>
> e) We should perhaps not be so afraid to make ZK a requirement for Solr in 
> v4. Ideally you should interact with a 1-node Solr in the same manner as you 
> do with a 100-node Solr. An example is the Admin GUI where the "schema" and 
> "solrconfig" links assume local file. This requires decent tool support to 
> make ZK interaction intuitive, such as "import" and "export" commands.
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
>
> On 19. jan. 2011, at 21.07, Mark Miller wrote:
>
>> Hello Users,
>>
>> About a little over a year ago, a few of us started working on what we 
>> called SolrCloud.
>>
>> This initial bit of work was really a combination of laying some base work - 
>> figuring out how to integrate ZooKeeper with Solr in a limited way, dealing 
>> with some infrastructure - and picking off some low hanging search side 
>> fruit.
>>
>> The next step is the indexing side. And we plan on starting to tackle that 
>> sometime soon.
>>
>> But first - could you help with some feedback?ISome people are using our 
>> SolrCloud start - I have seen evidence of it ;) Some, even in production.
>>
>> I would love to have your help in targeting what we now try and improve. Any 
>> suggestions or feedback? If you have sent this before, I/others likely 
>> missed it - send it again!
>>
>> I know anyone that has used SolrCloud has some feedback. I know it because 
>> I've used it too ;) It's too complicated to setup still. There are still 
>> plenty of pain points. We accepted some compromise trying to fit into what 
>> Solr was, and not wanting to dig in too far before feeling things out and 
>> letting users try things out a bit. Thinking that we might be able to adjust 
>> Solr to be more in favor of SolrCloud as we go, what is the ideal state of 
>> the work we have currently done?
>>
>> If anyone using SolrCloud helps with the feedback, I'll help with the coding 
>> effort.
>>
>> - Mark Miller
>> -- lucidimagination.com
>
>


Running solr on small amounts of RAM

2011-09-09 Thread Mike Austin
I'm trying to push to get solr used in our environment. I know I could have
responses saying WHY can't you get more RAM etc.., but lets just skip those
and work with this situation.

Our index is very small with 100k documents and a light load at the moment.
If I wanted to use the smallest possible RAM on the server, how would I do
this and what are the issues?

I know that caching would be the biggest lose but if solr ran with no to
little caching, the performance would still be ok? I know this is a relative
question..
This is the only application using java on this machine, would tuning java
to use less cache help anything?
I should set the cache settings low in the config?
Basically, what will having a very low cache hit rate do to search speed and
server performance?  I know more is better and it depends on what I'm
comparing it to but if you could just answer in some way saying that it's
not going to cripple the machine or cause 5 second searches?

It's on a windows server.


Thanks,
Mike


Re: Sorting groups by numFound group size

2011-09-09 Thread O. Klein
I am also looking for way to sort on numFound.

Has an issue been created?

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Sorting-groups-by-numFound-group-size-tp3315740p3323420.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: problems of getting frequency and position for a paticular word

2011-09-09 Thread Chris Hostetter

: Is there a way for solr to return only the frequency and position of a
: paticular word back to client?

I don't think so.

It would probably be relatively straight forward to add to 
TermVectorComponent -- i don't know that it would save any *time* (i tihnk 
it would still have to process all the terms to get all the positions for 
the one term you care about) but it would certainly reduce the data sent 
over the wire.

The other option you may wnat to consider is a custom Highligher 
components (i think you would need a fragmenter and a formatter) that just 
paid attention to the positions and returned them instead of marking up 
the text.

-Hoss


Re: TermsComponent from deleted document

2011-09-09 Thread Chris Hostetter

: http://wiki.apache.org/solr/TermsComponent states that TermsComponent will
: return frequencies from deleted documents too.
: 
: Is there anyway to omit the deleted documents to get the frequencies.

not really -- until a deleted document is expunged from segment merging, 
they are still included in the term stats which is what the TermsComponent 
looks at.

If having 100% accurate term counts is really important to you, then you 
can optimize after doing any updates on your index - but there is 
obviously a performance tradeoff there.



-Hoss


Re: SolrCloud Feedback

2011-09-09 Thread Pulkit Singhal
I think I understand it a bit better now but wouldn't mind some validation.

1) solr.xml does not become part of ZooKeeper
2) The default looks like this out-of-box:
  

  
so that may leave one wondering where the core's association to a
collection name is made?

It can be made like so:
a) statically in a file:

b) at start time via java:
java ... -Dcollection.configName=myconf ... -jar start.jar

And I'm guessing that since the core's name ("collection1") for shard1
has already been associated with -Dcollection.configname=myconf in
http://wiki.apache.org/solr/SolrCloud#Example_A:_Simple_two_shard_cluster
once already, adding an additional shard2 with the same core name
("collection1"), automatically throws it in with the collection name
("myconf") without any need to specify anything at startup via -D or
statically in solr.xml file.

Validate away otherwise I'll just accept any hate mail after making
edits to the Solr wiki directly.

- Pulkit

On Fri, Sep 9, 2011 at 11:38 AM, Pulkit Singhal  wrote:
> Hello Jan,
>
> You've made a very good point in (b). I would be happy to make the
> edit to the wiki if I understood your explanation completely.
>
> When you say that it is "looking up what collection that core is part
> of" ... I'm curious how a core is being put under a particular
> collection in the first place? And what that collection is named?
> Obviously you've made it clear that colelction1 is really the name of
> the core itself. And where this association is being stored for the
> code to look it up?
>
> If not Jan, then perhaps the gurus who wrote Solr Cloud could answer :)
>
> Thanks!
> - Pulkit
>
> On Thu, Feb 10, 2011 at 9:10 AM, Jan Høydahl  wrote:
>> Hi,
>>
>> I have so far just tested the examples and got a N by M cluster running. My 
>> feedback:
>>
>> a) First of all, a major update of the SolrCloud Wiki is needed, to clearly 
>> state what is in which version, what are current improvement plans and get 
>> rid of outdated stuff. That said I think there are many good ideas there.
>>
>> b) The "collection" terminology is too much confused with "core", and should 
>> probably be made more distinct. I just tried to configure two cores on the 
>> same Solr instance into the same collection, and that worked fine, both as 
>> distinct shards and as same shard (replica). The wiki examples give the 
>> impression that "collection1" in 
>> localhost:8983/solr/collection1/select?distrib=true is some magic collection 
>> identifier, but what it really does is doing the query on the *core* named 
>> "collection1", looking up what collection that core is part of and 
>> distributing the query to all shards in that collection.
>>
>> c) ZK is not designed to store large files. While the files in conf are 
>> normally well below the 1M limit ZK imposes, we should perhaps consider 
>> using a lightweight distributed object or k/v store for holding the /CONFIGS 
>> and let ZK store a reference only
>>
>> d) How are admins supposed to update configs in ZK? Install their favourite 
>> ZK editor?
>>
>> e) We should perhaps not be so afraid to make ZK a requirement for Solr in 
>> v4. Ideally you should interact with a 1-node Solr in the same manner as you 
>> do with a 100-node Solr. An example is the Admin GUI where the "schema" and 
>> "solrconfig" links assume local file. This requires decent tool support to 
>> make ZK interaction intuitive, such as "import" and "export" commands.
>>
>> --
>> Jan Høydahl, search solution architect
>> Cominvent AS - www.cominvent.com
>>
>> On 19. jan. 2011, at 21.07, Mark Miller wrote:
>>
>>> Hello Users,
>>>
>>> About a little over a year ago, a few of us started working on what we 
>>> called SolrCloud.
>>>
>>> This initial bit of work was really a combination of laying some base work 
>>> - figuring out how to integrate ZooKeeper with Solr in a limited way, 
>>> dealing with some infrastructure - and picking off some low hanging search 
>>> side fruit.
>>>
>>> The next step is the indexing side. And we plan on starting to tackle that 
>>> sometime soon.
>>>
>>> But first - could you help with some feedback?ISome people are using our 
>>> SolrCloud start - I have seen evidence of it ;) Some, even in production.
>>>
>>> I would love to have your help in targeting what we now try and improve. 
>>> Any suggestions or feedback? If you have sent this before, I/others likely 
>>> missed it - send it again!
>>>
>>> I know anyone that has used SolrCloud has some feedback. I know it because 
>>> I've used it too ;) It's too complicated to setup still. There are still 
>>> plenty of pain points. We accepted some compromise trying to fit into what 
>>> Solr was, and not wanting to dig in too far before feeling things out and 
>>> letting users try things out a bit. Thinking that we might be able to 
>>> adjust Solr to be more in favor of SolrCloud as we go, what is the ideal 
>>> state of the work we have currently done?
>>>
>>> If anyone using SolrCloud helps with the feedback

Re: FunctionQueryNode pipeline?

2011-09-09 Thread Chris Hostetter

: space, so identifying a function vs. a group clause hinders any progress. 
: Is that why they separated the functionality of queries using the
: defType=func?

The function syntax in solr predates the new QueryNode based QueryParser 
in lucene.  

The main motivation behind "defType" was to refactor out out query parsing 
so you could have multiple different Query Parser (lucene, dismax, func, 
etc...) impls (with arbitrary query syntaxes) that could be mixed and 
matched in differetn situations (q, fq, etc...).

having one universal syntax that melds the historic lucene syntax with 
arbitrary functions is something probably too hairy to fathom.  I'm also 
not sure that it would really provide much value add: different syntaxes 
for different audiences seems like a saner idea (let end-users enter 
dismax queries, let advanced users and biz managers specify things in 
lucene queries, let solr admins configure boost queries using function 
syntax that refers to variable params specified by biz uers, etc...)


-Hoss


"String index out of range: -1" for hl.fl=* in Solr 1.4.1?

2011-09-09 Thread Demian Katz
I'm running into a strange problem with Solr 1.4.1 - this request:

http://localhost:8080/solr/website/select/?q=*%3A*&rows=20&start=0&indent=yes&fl=score&facet=true&facet.mincount=1&facet.limit=30&facet.field=category&facet.field=linktype&facet.field=subject&facet.prefix=&facet.sort=&fq=category%3A%22Exhibits%22&spellcheck=true&spellcheck.q=*%3A*&spellcheck.dictionary=default&hl=true&hl.fl=*&hl.simple.pre=START_HILITE&hl.simple.post=END_HILITE&wt=json&json.nl=arrarr

leads to this error dump:

String index out of range: -1

java.lang.StringIndexOutOfBoundsException: String index out of range: -1
at java.lang.String.substring(String.java:1949)
at 
org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:263)
at 
org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting(DefaultSolrHighlighter.java:335)
at 
org.apache.solr.handler.component.HighlightComponent.process(HighlightComponent.java:89)
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1088)
at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:360)
at 
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at 
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
at 
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:729)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
at 
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:206)
at 
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at 
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:324)
at 
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:505)
at 
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:829)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:513)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:211)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:380)
at 
org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:395)
at 
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:488)

I've managed to work around the problem by replacing the hl.fl=* parameter with 
a comma-delimited list of the fields I actually need highlighted...  but I 
don't understand why I'm encountering this error, and for peace of mind I would 
like to understand the problem in case there's a deeper problem at work here.  
I'll be happy to share schema or other details if they would help narrow down a 
potential cause!

thanks,
Demian


Re: Running solr on small amounts of RAM

2011-09-09 Thread Mike Austin
or actually disabling caching as mentioned here:
http://wiki.apache.org/solr/SolrCaching#Cache_Sizing

On Fri, Sep 9, 2011 at 11:48 AM, Mike Austin  wrote:

> I'm trying to push to get solr used in our environment. I know I could have
> responses saying WHY can't you get more RAM etc.., but lets just skip those
> and work with this situation.
>
> Our index is very small with 100k documents and a light load at the
> moment.  If I wanted to use the smallest possible RAM on the server, how
> would I do this and what are the issues?
>
> I know that caching would be the biggest lose but if solr ran with no to
> little caching, the performance would still be ok? I know this is a relative
> question..
> This is the only application using java on this machine, would tuning java
> to use less cache help anything?
> I should set the cache settings low in the config?
> Basically, what will having a very low cache hit rate do to search speed
> and server performance?  I know more is better and it depends on what I'm
> comparing it to but if you could just answer in some way saying that it's
> not going to cripple the machine or cause 5 second searches?
>
> It's on a windows server.
>
>
> Thanks,
> Mike
>
>
>
>


RE: Alias name for a index field

2011-09-09 Thread Tirthankar Chatterjee
We are using fielded query using EDISMAX which is passed into the q parameter.

-Original Message-
From: Erik Hatcher [mailto:erik.hatc...@gmail.com] 
Sent: Friday, September 09, 2011 10:21 AM
To: solr-user@lucene.apache.org
Subject: Re: Alias name for a index field

How are you doing fielded search currently?  End users using the "lucene" query 
parser?  Or using dismax/qf?  

I'm just curious to drill into your needs here exactly in terms of 
request/response and whether a simple application layer handling of the alias 
need would suffice or this is something best handled in Solr.

Erik

On Sep 9, 2011, at 09:59 , Tirthankar Chatterjee wrote:

> Hi,
> Is there a way that we can give an alias name for a field so that the schema 
> is not required to change.
> 
> Use Case: We defined the schema with a field called "conv" (basically 
> to store conversation of an email) There are users who wants this to be used 
> as "subject"
> 
> One Solution: Use copy field but that definitely takes some resource, instead 
> can we have something like an alias name, so a field can have multiple alias 
> names which different users from geographical regions can use for doing 
> fielded search.
> 
> Let us know what you thinkor if a JIRA already exists.
> 
> Thanks,
> Tirthankar
> 
> 
> **Legal Disclaimer***
> "This communication may contain confidential and privileged material 
> for the sole use of the intended recipient. Any unauthorized review, 
> use or distribution by others is strictly prohibited. If you have 
> received the message in error, please advise the sender by reply email 
> and delete the message. Thank you."
> *



SolrCloud and replica question

2011-09-09 Thread Jamie Johnson
When doing writes do all writes need to be done to the primary shard
or are writes that are done to the replica also pushed to all replicas
of that shard?


Re: Solr Cloud - is replication really a feature on the trunk?

2011-09-09 Thread Pulkit Singhal
Thanks Again.

Another question:

My solr.xml has:
  

  

And I omitted -Dcollection.configName=myconf from the startup command
because I felt that specifying collection="myconf" should take care of
that:
cd /trunk/solr/example
java -Dbootstrap_confdir=./solr/conf -Dslave=disabled -DzkRun -jar start.jar

But the zookeeper.jsp page doesn't seem to take any of that into
effect and shows:
 /collections (v=6 children=1)
  collection1 (v=0 children=1) "configName=configuration1"
   shards (v=0 children=1)
shard1 (v=0 children=1)
 tiklup-mac.local:8983_solr_ (v=0)
"node_name=tiklup-mac.local:8983_solr
url=http://tiklup-mac.local:8983/solr/";

Then what is the point of naming the core and the collection?

- Pulkit

2011/9/9 Yury Kats :
> On 9/9/2011 10:52 AM, Pulkit Singhal wrote:
>> Thank You Yury. After looking at your thread, there's something I must
>> clarify: Is solr.xml not uploaded and held in ZooKeeper?
>
> Not as far as I understand. Cores are loaded/created by the local
> Solr server based on solr.xml and then registered with ZK, so that
> ZK know what cores are out there and how they are organized in shards.
>
>
>> because you have a slightly different config between Node 1 & 2:
>> http://lucene.472066.n3.nabble.com/Replication-setup-with-SolrCloud-Zk-td2952602.html
>
>
> I have two shards, each shard having a master and a slave core.
> Cores are located so that master and slave are on different nodes.
> This protects search (but not indexing) from node failure.
>


Re: Solr Cloud - is replication really a feature on the trunk?

2011-09-09 Thread Pulkit Singhal
I had forgotten to save the file, the collection name at least shows
up but the core name is still not used, is it simply decorative?

/collections (v=6 children=1)
  myconf (v=0 children=1) "configName=configuration1"
shards (v=0 children=1)
  shard1 (v=0 children=1)
tiklup-mac.local:8983_solr_ (v=0)
"node_name=tiklup-mac.local:8983_solr
 url=http://tiklup-mac.local:8983/solr/";

Thanks!
- Pulkit

On Fri, Sep 9, 2011 at 5:54 PM, Pulkit Singhal  wrote:
> Thanks Again.
>
> Another question:
>
> My solr.xml has:
>  
>    
>  
>
> And I omitted -Dcollection.configName=myconf from the startup command
> because I felt that specifying collection="myconf" should take care of
> that:
> cd /trunk/solr/example
> java -Dbootstrap_confdir=./solr/conf -Dslave=disabled -DzkRun -jar start.jar
>
> But the zookeeper.jsp page doesn't seem to take any of that into
> effect and shows:
>     /collections (v=6 children=1)
>          collection1 (v=0 children=1) "configName=configuration1"
>               shards (v=0 children=1)
>                    shard1 (v=0 children=1)
>                         tiklup-mac.local:8983_solr_ (v=0)
> "node_name=tiklup-mac.local:8983_solr
> url=http://tiklup-mac.local:8983/solr/";
>
> Then what is the point of naming the core and the collection?
>
> - Pulkit
>
> 2011/9/9 Yury Kats :
>> On 9/9/2011 10:52 AM, Pulkit Singhal wrote:
>>> Thank You Yury. After looking at your thread, there's something I must
>>> clarify: Is solr.xml not uploaded and held in ZooKeeper?
>>
>> Not as far as I understand. Cores are loaded/created by the local
>> Solr server based on solr.xml and then registered with ZK, so that
>> ZK know what cores are out there and how they are organized in shards.
>>
>>
>>> because you have a slightly different config between Node 1 & 2:
>>> http://lucene.472066.n3.nabble.com/Replication-setup-with-SolrCloud-Zk-td2952602.html
>>
>>
>> I have two shards, each shard having a master and a slave core.
>> Cores are located so that master and slave are on different nodes.
>> This protects search (but not indexing) from node failure.
>>
>


solr equivalent of "select distinct"

2011-09-09 Thread Mark juszczec
Hello everyone

Let's say each record in my index contains fields named PK, FLD1, FLD2, FLD3
 FLD100

PK is my solr primary key and I'm creating it by concatenating
FLD1+FLD2+FLD3 and I'm guaranteed that combination will be unique

Let's say 2 of these records have FLD1 = A and FLD2 = B.  I am unsure about
the remaining fields

Right now, if I do a query specifying FLD1 = A and FLD2 = B then I get both
records.  I only want 1.

Research says I should use faceting.  But this:

q=FLD1:A and FLD2:B & rows=500 & defType=edismax & fl=FLD1, FLD2 &
facet=true & facet_field=FLD1 & facet_field=FLD2

gives me 2 records.

In fact, it gives me the same results as:

q=FLD1:A and FLD2:B & rows=500 & defType=edismax & fl=FLD1, FLD2

I'm wrong somewhere, but I'm unsure where.

Is faceting the right way to go or should I be using grouping?

Curiously, when I use grouping like this:

q=FLD1:A and FLD2:B &rows=500 &defType=edismax &indent=true &fl=FLD1, FLD2
&group=true &group.field=FLD1 &group.field=FLD2

I get 2 records as well.

Has anyone dealt with mimicing "select distinct" in Solr?

Any advice would be very appreciated.

Mark


searching for terms containing embedded spaces

2011-09-09 Thread Mark juszczec
Hi folks

I've got a field that contains 2 words separated by a single blank.

What's the trick to creating a search string that contains the single blank?

Mark


Re: SolrCloud and replica question

2011-09-09 Thread Yury Kats
On 9/9/2011 4:48 PM, Jamie Johnson wrote:
> When doing writes do all writes need to be done to the primary shard
> or are writes that are done to the replica also pushed to all replicas
> of that shard?
> 

If you have replication setup between cores, all changes to the
slave will be overwritten by replication. Therefore it makes sense
to submit docs for indexing only to the master cores


Re: Solr Cloud - is replication really a feature on the trunk?

2011-09-09 Thread Yury Kats
On 9/9/2011 6:54 PM, Pulkit Singhal wrote:
> Thanks Again.
> 
> Another question:
> 
> My solr.xml has:
>   
> 
>   
> 
> And I omitted -Dcollection.configName=myconf from the startup command
> because I felt that specifying collection="myconf" should take care of
> that:
> cd /trunk/solr/example
> java -Dbootstrap_confdir=./solr/conf -Dslave=disabled -DzkRun -jar start.jar

With this you are telling ZK to bootstrap a collection with content of specific
files, but you don't tell what collection that should be.

Hence you want collection.configName parameter, and you want
solr.xml to reference the same name in 'collection' attribute for the cores,
so that SolrCloud knows where to pull configuration for that core from.




Stemming and other tokenizers

2011-09-09 Thread Patrick Sauts
Hello,

 

I want to implement some king of AutoStemming that will detect the language
of a field based on a tag at the start of this field like #en# my field is
stored on disc but I don't want this tag to be stored. Is there a way to
avoid this field to be stored ?

To me all the filters and the tokenizers interact only with the indexed
field and not the stored one.

Am I wrong ?

Is it possible to you to do such a filter.

 

Patrick.



Re: MMapDirectory failed to map a 23G compound index segment

2011-09-09 Thread Lance Norskog
I remember now: by memory-mapping one block of address space that big, the
garbage collector has problems working around it. If the OOM is repeatable,
you could try watching the app with jconsole and watch the memory spaces.

Lance

On Thu, Sep 8, 2011 at 8:58 PM, Lance Norskog  wrote:

> Do you need to use the compound format?
>
> On Thu, Sep 8, 2011 at 3:57 PM, Rich Cariens wrote:
>
>> I should add some more context:
>>
>>   1. the problem index included several cfs segment files that were around
>>   4.7G, and
>>   2. I'm running four SOLR instances on the same box, all of which have
>>   similiar problem indeces.
>>
>> A colleague thought perhaps I was bumping up against my 256,000 open files
>> ulimit. Do the MultiMMapIndexInput ByteBuffer arrays each consume a file
>> handle/descriptor?
>>
>> On Thu, Sep 8, 2011 at 5:19 PM, Rich Cariens 
>> wrote:
>>
>> > FWiW I optimized the index down to a single segment and now I have no
>> > trouble opening an MMapDirectory on that index, even though the 23G cfx
>> > segment file remains.
>> >
>> >
>> > On Thu, Sep 8, 2011 at 4:27 PM, Rich Cariens > >wrote:
>> >
>> >> Thanks for the response. "free -g" reports:
>> >>
>> >> totalusedfreesharedbuffers
>> >> cached
>> >> Mem:  141  95  46 0
>> >> 093
>> >> -/+ buffers/cache:  2 139
>> >> Swap:   3   0   3
>> >>
>> >> 2011/9/7 François Schiettecatte 
>> >>
>> >>> My memory of this is a little rusty but isn't mmap also limited by mem
>> +
>> >>> swap on the box? What does 'free -g' report?
>> >>>
>> >>> François
>> >>>
>> >>> On Sep 7, 2011, at 12:25 PM, Rich Cariens wrote:
>> >>>
>> >>> > Ahoy ahoy!
>> >>> >
>> >>> > I've run into the dreaded OOM error with MMapDirectory on a 23G cfs
>> >>> compound
>> >>> > index segment file. The stack trace looks pretty much like every
>> other
>> >>> trace
>> >>> > I've found when searching for OOM & "map failed"[1]. My
>> configuration
>> >>> > follows:
>> >>> >
>> >>> > Solr 1.4.1/Lucene 2.9.3 (plus
>> >>> > SOLR-1969
>> >>> > )
>> >>> > CentOS 4.9 (Final)
>> >>> > Linux 2.6.9-100.ELsmp x86_64 yada yada yada
>> >>> > Java SE (build 1.6.0_21-b06)
>> >>> > Hotspot 64-bit Server VM (build 17.0-b16, mixed mode)
>> >>> > ulimits:
>> >>> >core file size (blocks, -c) 0
>> >>> >data seg size(kbytes, -d) unlimited
>> >>> >file size (blocks, -f) unlimited
>> >>> >pending signals(-i) 1024
>> >>> >max locked memory (kbytes, -l) 32
>> >>> >max memory size (kbytes, -m) unlimited
>> >>> >open files(-n) 256000
>> >>> >pipe size (512 bytes, -p) 8
>> >>> >POSIX message queues (bytes, -q) 819200
>> >>> >stack size(kbytes, -s) 10240
>> >>> >cpu time(seconds, -t) unlimited
>> >>> >max user processes (-u) 1064959
>> >>> >virtual memory(kbytes, -v) unlimited
>> >>> >file locks(-x) unlimited
>> >>> >
>> >>> > Any suggestions?
>> >>> >
>> >>> > Thanks in advance,
>> >>> > Rich
>> >>> >
>> >>> > [1]
>> >>> > ...
>> >>> > java.io.IOException: Map failed
>> >>> > at sun.nio.ch.FileChannelImpl.map(Unknown Source)
>> >>> > at
>> org.apache.lucene.store.MMapDirectory$MMapIndexInput.(Unknown
>> >>> > Source)
>> >>> > at
>> org.apache.lucene.store.MMapDirectory$MMapIndexInput.(Unknown
>> >>> > Source)
>> >>> > at org.apache.lucene.store.MMapDirectory.openInput(Unknown Source)
>> >>> > at org.apache.lucene.index.SegmentReader$CoreReaders.(Unknown
>> >>> Source)
>> >>> >
>> >>> > at org.apache.lucene.index.SegmentReader.get(Unknown Source)
>> >>> > at org.apache.lucene.index.SegmentReader.get(Unknown Source)
>> >>> > at org.apache.lucene.index.DirectoryReader.(Unknown Source)
>> >>> > at org.apache.lucene.index.ReadOnlyDirectoryReader.(Unknown
>> >>> Source)
>> >>> > at org.apache.lucene.index.DirectoryReader$1.doBody(Unknown Source)
>> >>> > at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(Unknown
>> >>> > Source)
>> >>> > at org.apache.lucene.index.DirectoryReader.open(Unknown Source)
>> >>> > at org.apache.lucene.index.IndexReader.open(Unknown Source)
>> >>> > ...
>> >>> > Caused by: java.lang.OutOfMemoryError: Map failed
>> >>> > at sun.nio.ch.FileChannelImpl.map0(Native Method)
>> >>> > ...
>> >>>
>> >>>
>> >>
>> >
>>
>
>
>
> --
> Lance Norskog
> goks...@gmail.com
>
>


-- 
Lance Norskog
goks...@gmail.com


Re: Solr Cloud - is replication really a feature on the trunk?

2011-09-09 Thread Jamie Johnson
as a note you could change out the values in solr.xml to be as follows
and pull these values from System Properties.

  

  

unless someone says otherwise, but the quick tests I've run seem to
work perfectly well with this setup.

2011/9/9 Yury Kats :
> On 9/9/2011 6:54 PM, Pulkit Singhal wrote:
>> Thanks Again.
>>
>> Another question:
>>
>> My solr.xml has:
>>   
>>     
>>   
>>
>> And I omitted -Dcollection.configName=myconf from the startup command
>> because I felt that specifying collection="myconf" should take care of
>> that:
>> cd /trunk/solr/example
>> java -Dbootstrap_confdir=./solr/conf -Dslave=disabled -DzkRun -jar start.jar
>
> With this you are telling ZK to bootstrap a collection with content of 
> specific
> files, but you don't tell what collection that should be.
>
> Hence you want collection.configName parameter, and you want
> solr.xml to reference the same name in 'collection' attribute for the cores,
> so that SolrCloud knows where to pull configuration for that core from.
>
>
>


Re: SolrCloud and replica question

2011-09-09 Thread Jamie Johnson
great, thanks Yury, that's what I thought but just wanted to verify.

2011/9/9 Yury Kats :
> On 9/9/2011 4:48 PM, Jamie Johnson wrote:
>> When doing writes do all writes need to be done to the primary shard
>> or are writes that are done to the replica also pushed to all replicas
>> of that shard?
>>
>
> If you have replication setup between cores, all changes to the
> slave will be overwritten by replication. Therefore it makes sense
> to submit docs for indexing only to the master cores
>


Re: TermsComponent from deleted document

2011-09-09 Thread Manish Bafna
Which is preferable? using TermsComponent or Facets for autosuggest?

On Fri, Sep 9, 2011 at 10:33 PM, Chris Hostetter
wrote:

>
> : http://wiki.apache.org/solr/TermsComponent states that TermsComponent
> will
> : return frequencies from deleted documents too.
> :
> : Is there anyway to omit the deleted documents to get the frequencies.
>
> not really -- until a deleted document is expunged from segment merging,
> they are still included in the term stats which is what the TermsComponent
> looks at.
>
> If having 100% accurate term counts is really important to you, then you
> can optimize after doing any updates on your index - but there is
> obviously a performance tradeoff there.
>
>
>
> -Hoss
>