Re: HTTP ERROR: 500 - java.lang.ArrayIndexOutOfBoundsException

2010-07-16 Thread Lance Norskog
This can happen when there are multiple values in a field. Is 'first'
a multi-valued field?

Sorting only works on single-valued fields. After all, if there are
multiple values, it can only sort on one field and there is no way to
decide which one. So, make sure that 'field' has multiValued='false'
in the field declaration. If this is the problem, you will have to fix
your data and re-index.

Is 'field' an analyzed text field? Then sorting definitely will not work.

On Fri, Jul 16, 2010 at 6:54 PM, Girish Pandit  wrote:
> Hi,
>
> As soon as I add "sort=first+desc" parameter to the select clause, it throws
> ArrayIndexOutOfBound exception. Please suggest if I am missing anything.
>
> http://localhost:8983/solr/select?q=girish&start=0&indent=on&wt=json&sort=first+desc
>
> I have close to 1 million records indexed.
>
> Thanks
> Girish
>
>
>



-- 
Lance Norskog
goks...@gmail.com


HTTP ERROR: 500 - java.lang.ArrayIndexOutOfBoundsException

2010-07-16 Thread Girish Pandit

Hi,

As soon as I add "sort=first+desc" parameter to the select clause, it 
throws ArrayIndexOutOfBound exception. Please suggest if I am missing 
anything.


http://localhost:8983/solr/select?q=girish&start=0&indent=on&wt=json&sort=first+desc

I have close to 1 million records indexed.

Thanks
Girish




Re: limiting the total number of documents matched

2010-07-16 Thread Yonik Seeley
On Wed, Jul 14, 2010 at 5:46 PM, Paul  wrote:
> I thought of another way to do it, but I still have one thing I don't
> know how to do. I could do the search without sorting for the 50th
> page, then look at the relevancy score on the first item on that page,
> then repeat the search, but add score > that relevancy as a parameter.
> Is it possible to do a search with "score:[5 to *]"? It didn't work in
> my first attempt.

frange could possible help (range query on an arbitrary function).
http://www.lucidimagination.com/blog/tag/frange/

So perhaps something like
q={!frange l=0.85}query($qq)
qq=

where 0.85 is the lower bound you want for scores and qq is the normal
relevancy query

-Yonik
http://www.lucidimagination.com


>
> On Wed, Jul 14, 2010 at 5:34 PM, Paul  wrote:
>> I was hoping for a way to do this purely by configuration and making
>> the correct GET requests, but if there is a way to do it by creating a
>> custom Request Handler, I suppose I could plunge into that. Would that
>> yield the best results, and would that be particularly difficult?
>>
>> On Wed, Jul 14, 2010 at 4:37 PM, Nagelberg, Kallin
>>  wrote:
>>> So you want to take the top 1000 sorted by score, then sort those by 
>>> another field. It's a strange case, and I can't think of a clean way to 
>>> accomplish it. You could do it in two queries, where the first is by score 
>>> and you only request your IDs to keep it snappy, then do a second query 
>>> against the IDs and sort by your other field. 1000 seems like a lot for 
>>> that approach, but who knows until you try it on your data.
>>>
>>> -Kallin Nagelberg
>>>
>>>
>>> -Original Message-
>>> From: Paul [mailto:p...@nines.org]
>>> Sent: Wednesday, July 14, 2010 4:16 PM
>>> To: solr-user
>>> Subject: limiting the total number of documents matched
>>>
>>> I'd like to limit the total number of documents that are returned for
>>> a search, particularly when the sort order is not based on relevancy.
>>>
>>> In other words, if the user searches for a very common term, they
>>> might get tens of thousands of hits, and if they sort by "title", then
>>> very high relevancy documents will be interspersed with very low
>>> relevancy documents. I'd like to set a limit to the 1000 most relevant
>>> documents, then sort those by title.
>>>
>>> Is there a way to do this?
>>>
>>> I guess I could always retrieve the top 1000 documents and sort them
>>> in the client, but that seems particularly inefficient. I can't find
>>> any other way to do this, though.
>>>
>>> Thanks,
>>> Paul
>>>
>>
>


Re: SOLR Search Query : Exception : Software caused connection abort

2010-07-16 Thread Lance Norskog
How big is "very big"?

Tomcat has to be configured for the maximum length of the parameter
field in a POST. Is your query string longer than that?

If much of the query string is repeated across queries, you can make a
 in solrconfig.xml that adds extra parameters in the
file with a  clause.

It is also possible that the parsed query was too large for Lucene to handle.

On Thu, Jul 15, 2010 at 5:44 AM, sandeep kumar
 wrote:
>
> Hi,
> I am trying to test the SOLR search with very big query , but when i try its
> throwing exception: "Exception : Software caused connection abort".
> I'm using  HTTP POST and server I'm using is Tomcat.
> Is SOLR query has any limitations with size or length..etc??
> P ls help me and let me know solution to this problem ASAP.
>
> Regards
> Sandeep
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/SOLR-Search-Query-Exception-Software-caused-connection-abort-tp969444p969444.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Lance Norskog
goks...@gmail.com


Re: limiting the total number of documents matched

2010-07-16 Thread Lance Norskog
Yes, multiple (radix) sorts work and you can use the score value. The
sort parameters come in order, most important to least important.

This sorts first by score, and then documents with the same score are
sorted by field f:

sort=score+desc,f+asc



On Wed, Jul 14, 2010 at 2:46 PM, Paul  wrote:
> I thought of another way to do it, but I still have one thing I don't
> know how to do. I could do the search without sorting for the 50th
> page, then look at the relevancy score on the first item on that page,
> then repeat the search, but add score > that relevancy as a parameter.
> Is it possible to do a search with "score:[5 to *]"? It didn't work in
> my first attempt.
>
> On Wed, Jul 14, 2010 at 5:34 PM, Paul  wrote:
>> I was hoping for a way to do this purely by configuration and making
>> the correct GET requests, but if there is a way to do it by creating a
>> custom Request Handler, I suppose I could plunge into that. Would that
>> yield the best results, and would that be particularly difficult?
>>
>> On Wed, Jul 14, 2010 at 4:37 PM, Nagelberg, Kallin
>>  wrote:
>>> So you want to take the top 1000 sorted by score, then sort those by 
>>> another field. It's a strange case, and I can't think of a clean way to 
>>> accomplish it. You could do it in two queries, where the first is by score 
>>> and you only request your IDs to keep it snappy, then do a second query 
>>> against the IDs and sort by your other field. 1000 seems like a lot for 
>>> that approach, but who knows until you try it on your data.
>>>
>>> -Kallin Nagelberg
>>>
>>>
>>> -Original Message-
>>> From: Paul [mailto:p...@nines.org]
>>> Sent: Wednesday, July 14, 2010 4:16 PM
>>> To: solr-user
>>> Subject: limiting the total number of documents matched
>>>
>>> I'd like to limit the total number of documents that are returned for
>>> a search, particularly when the sort order is not based on relevancy.
>>>
>>> In other words, if the user searches for a very common term, they
>>> might get tens of thousands of hits, and if they sort by "title", then
>>> very high relevancy documents will be interspersed with very low
>>> relevancy documents. I'd like to set a limit to the 1000 most relevant
>>> documents, then sort those by title.
>>>
>>> Is there a way to do this?
>>>
>>> I guess I could always retrieve the top 1000 documents and sort them
>>> in the client, but that seems particularly inefficient. I can't find
>>> any other way to do this, though.
>>>
>>> Thanks,
>>> Paul
>>>
>>
>



-- 
Lance Norskog
goks...@gmail.com


Re: indexing rich documents

2010-07-16 Thread Lance Norskog
The libraries are searched in the solr/llib directory, not solr home.
If using multicore, solr/core/lib.

These are searched automatically. You can also tell Solr to search in
other directories with the  directive in solrconfig.xml.

On Tue, Jul 13, 2010 at 11:48 PM, satya swaroop  wrote:
>
> here i attach u my solrconfig , tika config, schema files... if der r any
> wrong tell me
>



-- 
Lance Norskog
goks...@gmail.com


JSON and DataImportHandler

2010-07-16 Thread P Williams

Hi All,

Has anyone gotten the DataImportHandler to work with json as 
input?  Is there an even easier alternative to DIH?  Could you show me 
an example?


Many thanks,
Tricia


RE: documents with known relevancy

2010-07-16 Thread fiedzia


Jonathan Rochkind wrote:
> 
> I've never used it, but I think this is the use case that the Solr feature
> to use Lucene 'payloads' is meant for?  
> http://www.lucidimagination.com/blog/2009/08/05/getting-started-with-payloads/
> 
This is it, thanks for this link.

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/documents-with-known-relevancy-tp972462p973444.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How to speed up solr search speed

2010-07-16 Thread Peter Karich
> > Each solr(jetty) instance on consume 40M-60M memory.

> java -Xmx1024M -jar start.jar

That's a good suggestion!
Please, double check that you are using the -server version of the jvm
and the latest 1.6.0_20 or so.

Additionally you can start jvisualvm (shipped with the jdk) and hook
into jetty/tomcat easily to see the current CPU and memory load.

> But I have 70 solr cores

if you ask me: I would reduce them to 10-15 or even less and increase
the RAM.
try out tomcat too

> solr distriubted search's speed is decided by the slowest one. 

so, try to reduce the cores

Regards,
Peter.

> you mentioned that you have a lot of mem free, but your yetty containers
> only using between 40-60 mem.
>
> probably stating the obvious, but have you increased the -Xmx param like for
> instance:
> java -Xmx1024M -jar start.jar
>
> that way you're configuring the container to use a maximum of 1024 MB ram
> instead of the standard which is much lower (I'm not sure what exactly but
> it could well be 64MB for non -server, aligning with what you're seeing)
>
> Geert-Jan
>
> 2010/7/16 marship 
>
>   
>> Hi Tom Burton-West.
>>
>>  Sorry looks my email ISP filtered out your replies. I checked web version
>> of mailing list and saw your reply.
>>
>>  My query string is always simple like "design", "principle of design",
>> "tom"
>>
>>
>>
>> EG:
>>
>> URL:
>> http://localhost:7550/solr/select/?q=design&version=2.2&start=0&rows=10&indent=on
>>
>> Response:
>>
>> 
>> -
>> 
>> 0
>> 16
>> -
>> 
>> on
>> 0
>> design
>> 2.2
>> 10
>> 
>> 
>> -
>> 
>> -
>> 
>> product_208619
>> 
>>
>>
>>
>>
>>
>> EG:
>> http://localhost:7550/solr/select/?q=Principle&version=2.2&start=0&rows=10&indent=on
>>
>> 
>> -
>> 
>> 0
>> 94
>> -
>> 
>> on
>> 0
>> Principle
>> 2.2
>> 10
>> 
>> 
>> -
>> 
>> -
>> 
>> product_56926
>> 
>>
>>
>>
>> As I am querying over single core and other cores are not querying at same
>> time. The QTime looks good.
>>
>> But when I query the distributed node: (For this case, 6422ms is still a
>> not bad one. Many cost ~20s)
>>
>> URL:
>> http://localhost:7499/solr/select/?q=the+first+world+war&version=2.2&start=0&rows=10&indent=on&debugQuery=true
>>
>> Response:
>>
>> 
>> -
>> 
>> 0
>> 6422
>> -
>> 
>> true
>> on
>> 0
>> the first world war
>> 2.2
>> 10
>> 
>> 
>> -
>> 
>>
>>
>>
>> Actually I am thinking and testing a solution: As I believe the bottleneck
>> is in harddisk and all our indexes add up is about 10-15G. What about I just
>> add another 16G memory to my server then use "MemDisk" to map a memory disk
>> and put all my indexes into it. Then each time, solr/jetty need to load
>> index from harddisk, it is loading from memory. This should give solr the
>> most throughout and avoid the harddisk access delay. I am testing 
>>
>> But if there are way to make solr use better use our limited resource to
>> avoid adding new ones. that would be great.
>>
>>
>>
>>
>>
>>
>> 
>   


-- 
http://karussell.wordpress.com/



Re: documents with known relevancy

2010-07-16 Thread Dennis Gearon
Looks to me like a sort of way to get to 'categories', if one were interested 
in doing that, shudder.


Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Fri, 7/16/10, Peter Karich  wrote:

> From: Peter Karich 
> Subject: Re: documents with known relevancy
> To: solr-user@lucene.apache.org
> Date: Friday, July 16, 2010, 12:25 PM
> I didn't looked at payloads as
> mentioned by Jonathan, but another
> solution could be (similar to Dennis'):
> 
> create a field 'tags' and then add the tag1 several times
> to it -
> depending on the weight.
> E.g. add it 10 times if the weight is 1.0
> But add it only 2 times if the weight is 0.2 etc.
> 
> Of course this limits the weight to 11 weights (0, 0.1,
> 0.2, ... and 1)
> but should work :-)
> 
> Regards,
> Peter.
> 
> > I came up with another idea, which seem to do what i
> want. Any comments about
> > better solutions
> > or improving efficiency are welcome:
> >
> > for each document create multivalue text field "tags"
> with all tags,
> > and multiple dynamic fields for each tag containging
> value, so we have:
> > {
> >   id: 123
> >   tags: tag1, tag2, ..., tagN
> >   tag1_float: 0.1,
> >   tag2_float: 0.2,
> >   ...
> >   tagN_float: 0.3,
> > }
> >
> > then query for tag1 and tag2 could like that:
> > tags:tag1 AND tags: tag2
> > and sort results by sum of tag1_float and tag2_float.
> >
> >   
> 
> 
> -- 
> http://karussell.wordpress.com/
> 
>


Re: documents with known relevancy

2010-07-16 Thread Peter Karich
I didn't looked at payloads as mentioned by Jonathan, but another
solution could be (similar to Dennis'):

create a field 'tags' and then add the tag1 several times to it -
depending on the weight.
E.g. add it 10 times if the weight is 1.0
But add it only 2 times if the weight is 0.2 etc.

Of course this limits the weight to 11 weights (0, 0.1, 0.2, ... and 1)
but should work :-)

Regards,
Peter.

> I came up with another idea, which seem to do what i want. Any comments about
> better solutions
> or improving efficiency are welcome:
>
> for each document create multivalue text field "tags" with all tags,
> and multiple dynamic fields for each tag containging value, so we have:
> {
>   id: 123
>   tags: tag1, tag2, ..., tagN
>   tag1_float: 0.1,
>   tag2_float: 0.2,
>   ...
>   tagN_float: 0.3,
> }
>
> then query for tag1 and tag2 could like that:
> tags:tag1 AND tags: tag2
> and sort results by sum of tag1_float and tag2_float.
>
>   


-- 
http://karussell.wordpress.com/



Re:Re: How to speed up solr search speed

2010-07-16 Thread Dennis Gearon
Isn't it always one of these four? (from most likely to least likely, generally)

Memory (as a ceiling limit)
Disk Speed
WebServer and it's code
CPU.

Memory and Disk are related, as swapping occurs between them. As long as memory 
is high enough, it becomes:

Disk Speed
WebServer and it's code
CPU

If the WebServer is configured to be as fast as is possible,only THEN the CPU 
comes into play.

So normally:

1/ Put enough memory in so it doesn't swap
2/ Buy the fastest damn disk/diskArrays/SolidState/HyperDrive RamDisk/RAIDed 
HyperDrive RamDisk that you can afford.
3/ Tune your webserver code.

1 moderate *LAPTOP* with 8-16 gig of ram(with a 64bit OS :-), and an single, 
external SATA HyperDrive 64Gig RamDrive is SCREAMING, way beyond most single 
server boxes you'll pay to get hosting on. The processor almost doesn't matter. 
Imagine what it's like with an array of those things.

Shows how much Ram and Disk slow things down.

Get going that fast and it's the Ethernet connection that slows things down. 
Good gaming boards are now coming with dual ethernet IO stock with software 
preconfigured to issue requests via both and get delivered to both. One dies 
and it falls back to the other.

Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Fri, 7/16/10, marship  wrote:

> From: marship 
> Subject: Re:Re: How to speed up solr search speed
> To: solr-user@lucene.apache.org
> Date: Friday, July 16, 2010, 11:26 AM
> Hi. Peter. 
> 
>  Thanks for replying.
> 
> 
> >Hi Scott!
> >
> >> I am aware these cores on same server are
> interfering with each other.
> >
> >Thats not good. Try to use only one core per CPU. With
> more per CPU you
> >won't have any benefits over the single-core version, I
> think.
> 
>  I only have 2 servers, each CPU with 8 cores. Each server
> has 6G memory. So I have 16 CPU core in total. But I have 70
> solr cores so I have to use them on my 2 servers. Based on
> my observation, even when the search is processing, the CPU
> usage is not high. The memory usage is not high too. Each
> solr(jetty) instance on consume 40M-60M memory. My server
> always have 2-3G memory availabe.
> >
> >> can solr use more memory to avoid disk operation
> conflicts?
> >
> >Yes, only the memory you have on the machine of course.
> Are you using
> >tomcat or jetty?
> >
> 
> I am using jetty.
> >> For my case, I don't think solr can work as fast
> as 100-200ms on average.
> >
> >We have indices with a lot entries not as large as
> yours, but in the
> >range of X Million. and have response times under
> 100ms.
> >What about testing only one core with 5-10 Mio docs? If
> the response
> >time isn't any better maybe you need a different field
> config or sth.
> >different is wrong?
> 
> For the moment, I really don't know. I tried to use java
> -sever -jar start.jar to start jetty/solr. I saw when solr
> start, sometimes some core search for simple keyword like
> "design" will take 70s, of course some only take 0-15ms.
> From my aspect, I do believe it is the harddisk accessed by
> these cores deplays each other. So finally some cores fall
> behind. But the bad news for me is the solr distriubted
> search's speed is decided by the slowest one. 
> 
> 
> >
> >> So should I add it or the default(without it ) is
> ok?
> >
> >Without is also okay -> solr uses default.
> >With 75 Mio docs it should around 20 000 but I guess
> there is sth.
> >different wrong: maybe caching or field definition.
> Could you post the
> >latter one?
> >
> 
> Sorry. What are you asking me to post?
> 
>  
> 
> 
> >Regards,
> >Peter.
> >
> >> Hi. Peter.
> >> I think I am not using faceting, highlighting ...
> I read about them
> >> but don't know how to work with them. I am using
> the default "example"
> >> just change the indexed fields.
> >> For my case, I don't think solr can work as fast
> as 100-200ms on
> >> average. I tried some keywords on only single solr
> instance. It
> >> sometimes takes more than 20s. I just input 4
> keywords. I agree it is
> >> keyword concerns. But the issue is it doesn't work
> consistently.
> >>
> >> When 37 instances on same server works at same
> time (when a
> >> distributed search start), it goes worse, I saw
> some solr cores
> >> execute very fast, 0ms, ~40ms, ~200ms. But more
> solr cores executed as
> >> ~2500ms, ~3500ms, ~6700ms. and about 5-10 solr
> cores need more than
> >> 17s. I have 70 cores running. And the search speed
> depends on the
> >> SLOWEST one. Even 69 cores can run at 1ms. but
> last one need 50s. then
> >> the distributed search speed is 50s.
> >> I am aware these cores on same server are
> interfering with each other.
> >> As I have lots of free memory. I want to know,
> with the prerequisite,
> >> can solr use more memory to avoid disk operation
> conflicts?
> >>
> >> Thanks.
> >> Regards.
> >> Scott
> >>
> >> 在2010-07-15 17:19:57,"Peter Karich" 
> 写道:
> >

Get both regular query and function query scores

2010-07-16 Thread Martynas Miliauskas
Hi,

I am using a function query to tweak my regular query search score, so
search query outputs regular query score modified by some function query. Is
there a way to also obtain a score from regular query?

Thanks!


Re: Re:Re: How to speed up solr search speed

2010-07-16 Thread Geert-Jan Brits
you mentioned that you have a lot of mem free, but your yetty containers
only using between 40-60 mem.

probably stating the obvious, but have you increased the -Xmx param like for
instance:
java -Xmx1024M -jar start.jar

that way you're configuring the container to use a maximum of 1024 MB ram
instead of the standard which is much lower (I'm not sure what exactly but
it could well be 64MB for non -server, aligning with what you're seeing)

Geert-Jan

2010/7/16 marship 

> Hi Tom Burton-West.
>
>  Sorry looks my email ISP filtered out your replies. I checked web version
> of mailing list and saw your reply.
>
>  My query string is always simple like "design", "principle of design",
> "tom"
>
>
>
> EG:
>
> URL:
> http://localhost:7550/solr/select/?q=design&version=2.2&start=0&rows=10&indent=on
>
> Response:
>
> 
> -
> 
> 0
> 16
> -
> 
> on
> 0
> design
> 2.2
> 10
> 
> 
> -
> 
> -
> 
> product_208619
> 
>
>
>
>
>
> EG:
> http://localhost:7550/solr/select/?q=Principle&version=2.2&start=0&rows=10&indent=on
>
> 
> -
> 
> 0
> 94
> -
> 
> on
> 0
> Principle
> 2.2
> 10
> 
> 
> -
> 
> -
> 
> product_56926
> 
>
>
>
> As I am querying over single core and other cores are not querying at same
> time. The QTime looks good.
>
> But when I query the distributed node: (For this case, 6422ms is still a
> not bad one. Many cost ~20s)
>
> URL:
> http://localhost:7499/solr/select/?q=the+first+world+war&version=2.2&start=0&rows=10&indent=on&debugQuery=true
>
> Response:
>
> 
> -
> 
> 0
> 6422
> -
> 
> true
> on
> 0
> the first world war
> 2.2
> 10
> 
> 
> -
> 
>
>
>
> Actually I am thinking and testing a solution: As I believe the bottleneck
> is in harddisk and all our indexes add up is about 10-15G. What about I just
> add another 16G memory to my server then use "MemDisk" to map a memory disk
> and put all my indexes into it. Then each time, solr/jetty need to load
> index from harddisk, it is loading from memory. This should give solr the
> most throughout and avoid the harddisk access delay. I am testing 
>
> But if there are way to make solr use better use our limited resource to
> avoid adding new ones. that would be great.
>
>
>
>
>
>


Re:indexing best practices

2010-07-16 Thread marship
Hi. I justed noticed when you add document to solr, turn the auto-commit flag 
off, after posting done, commit and optimize. The the speed is super fast. 

I was using 31 clients to post 31 solr cores at the same time. I think if you 
use 2 clients to post to same core, the question will be "how fast can your 
client generate the xml?". In my case, solr is faster than the speed I create 
the xml.


 

在2010-07-17 02:39:58,kenf_nc  写道:
>
>I was curious if anyone has done work on finding what an optimal (or max)
>number of client processes are for indexing. That is, if I have the ability
>to spin up N number of processes that construct a POST to add/update a Solr
>document, is there a point at which the number of clients posting
>simultaneously overloads Solr's ability to keep up with the Add's? I know
>this is very hardware dependent, but am looking for ballpark guidelines.
>This will be in a Tomcat process running on Windows Server 2008, 2 Solr
>instances, one master, one slave standard replication.
>
>Related to this, is there a best practice number of documents to send in a
>single POST. (again I know it depends on the complexity of the document,
>field types, analyzers/tokenizers etc).
>
>And finally, what do you find to be the best approach to getting data into
>Solr. If the technology aspect isn't an issue (except I don't want to use
>EmbeddedSolr), you just want to get documents added/updated as quickly as
>possible.  POST, xml or csv document upload, DataImportHandler, other?  I'm
>just looking for raw speed, not architectural factors.
>
>So, nutshell, all other factors put aside, I'm looking for best approach to
>indexing with pure raw speed the only criteria. 
>
>Thanks,
>Ken
>-- 
>View this message in context: 
>http://lucene.472066.n3.nabble.com/indexing-best-practices-tp973274p973274.html
>Sent from the Solr - User mailing list archive at Nabble.com.


Re:Re:Re: How to speed up solr search speed

2010-07-16 Thread marship
Hi Tom Burton-West.

  Sorry looks my email ISP filtered out your replies. I checked web version of 
mailing list and saw your reply.

  My query string is always simple like "design", "principle of design", "tom"

 

EG:

URL: 
http://localhost:7550/solr/select/?q=design&version=2.2&start=0&rows=10&indent=on

Response:


−

0
16
−

on
0
design
2.2
10


−

−

product_208619


 

 

EG: 
http://localhost:7550/solr/select/?q=Principle&version=2.2&start=0&rows=10&indent=on


−

0
94
−

on
0
Principle
2.2
10


−

−

product_56926


 

As I am querying over single core and other cores are not querying at same 
time. The QTime looks good.

But when I query the distributed node: (For this case, 6422ms is still a not 
bad one. Many cost ~20s)

URL: 
http://localhost:7499/solr/select/?q=the+first+world+war&version=2.2&start=0&rows=10&indent=on&debugQuery=true

Response: 


−

0
6422
−

true
on
0
the first world war
2.2
10


−


 

Actually I am thinking and testing a solution: As I believe the bottleneck is 
in harddisk and all our indexes add up is about 10-15G. What about I just add 
another 16G memory to my server then use "MemDisk" to map a memory disk and put 
all my indexes into it. Then each time, solr/jetty need to load index from 
harddisk, it is loading from memory. This should give solr the most throughout 
and avoid the harddisk access delay. I am testing 

But if there are way to make solr use better use our limited resource to avoid 
adding new ones. that would be great.

 

 

 

Re: Tag generation

2010-07-16 Thread kenf_nc

Thanks for all the suggestions! I'm absorbing them as quickly as I can. 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Tag-generation-tp969888p973277.html
Sent from the Solr - User mailing list archive at Nabble.com.


indexing best practices

2010-07-16 Thread kenf_nc

I was curious if anyone has done work on finding what an optimal (or max)
number of client processes are for indexing. That is, if I have the ability
to spin up N number of processes that construct a POST to add/update a Solr
document, is there a point at which the number of clients posting
simultaneously overloads Solr's ability to keep up with the Add's? I know
this is very hardware dependent, but am looking for ballpark guidelines.
This will be in a Tomcat process running on Windows Server 2008, 2 Solr
instances, one master, one slave standard replication.

Related to this, is there a best practice number of documents to send in a
single POST. (again I know it depends on the complexity of the document,
field types, analyzers/tokenizers etc).

And finally, what do you find to be the best approach to getting data into
Solr. If the technology aspect isn't an issue (except I don't want to use
EmbeddedSolr), you just want to get documents added/updated as quickly as
possible.  POST, xml or csv document upload, DataImportHandler, other?  I'm
just looking for raw speed, not architectural factors.

So, nutshell, all other factors put aside, I'm looking for best approach to
indexing with pure raw speed the only criteria. 

Thanks,
Ken
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/indexing-best-practices-tp973274p973274.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re:Re: How to speed up solr search speed

2010-07-16 Thread marship
Hi. Peter. 

 Thanks for replying.


>Hi Scott!
>
>> I am aware these cores on same server are interfering with each other.
>
>Thats not good. Try to use only one core per CPU. With more per CPU you
>won't have any benefits over the single-core version, I think.

 I only have 2 servers, each CPU with 8 cores. Each server has 6G memory. So I 
have 16 CPU core in total. But I have 70 solr cores so I have to use them on my 
2 servers. Based on my observation, even when the search is processing, the CPU 
usage is not high. The memory usage is not high too. Each solr(jetty) instance 
on consume 40M-60M memory. My server always have 2-3G memory availabe.
>
>> can solr use more memory to avoid disk operation conflicts?
>
>Yes, only the memory you have on the machine of course. Are you using
>tomcat or jetty?
>

I am using jetty.
>> For my case, I don't think solr can work as fast as 100-200ms on average.
>
>We have indices with a lot entries not as large as yours, but in the
>range of X Million. and have response times under 100ms.
>What about testing only one core with 5-10 Mio docs? If the response
>time isn't any better maybe you need a different field config or sth.
>different is wrong?

For the moment, I really don't know. I tried to use java -sever -jar start.jar 
to start jetty/solr. I saw when solr start, sometimes some core search for 
simple keyword like "design" will take 70s, of course some only take 0-15ms. 
From my aspect, I do believe it is the harddisk accessed by these cores deplays 
each other. So finally some cores fall behind. But the bad news for me is the 
solr distriubted search's speed is decided by the slowest one. 


>
>> So should I add it or the default(without it ) is ok?
>
>Without is also okay -> solr uses default.
>With 75 Mio docs it should around 20 000 but I guess there is sth.
>different wrong: maybe caching or field definition. Could you post the
>latter one?
>

Sorry. What are you asking me to post?

 


>Regards,
>Peter.
>
>> Hi. Peter.
>> I think I am not using faceting, highlighting ... I read about them
>> but don't know how to work with them. I am using the default "example"
>> just change the indexed fields.
>> For my case, I don't think solr can work as fast as 100-200ms on
>> average. I tried some keywords on only single solr instance. It
>> sometimes takes more than 20s. I just input 4 keywords. I agree it is
>> keyword concerns. But the issue is it doesn't work consistently.
>>
>> When 37 instances on same server works at same time (when a
>> distributed search start), it goes worse, I saw some solr cores
>> execute very fast, 0ms, ~40ms, ~200ms. But more solr cores executed as
>> ~2500ms, ~3500ms, ~6700ms. and about 5-10 solr cores need more than
>> 17s. I have 70 cores running. And the search speed depends on the
>> SLOWEST one. Even 69 cores can run at 1ms. but last one need 50s. then
>> the distributed search speed is 50s.
>> I am aware these cores on same server are interfering with each other.
>> As I have lots of free memory. I want to know, with the prerequisite,
>> can solr use more memory to avoid disk operation conflicts?
>>
>> Thanks.
>> Regards.
>> Scott
>>
>> 在2010-07-15 17:19:57,"Peter Karich"  写道:
>>> How does your queries look like? Do you use faceting, highlighting, ... ?
>>> Did you try to customize the cache?
>>> Setting the HashDocSet to "0.005 of all documents" improves our
>>> search speed a lot.
>>> Did you optimize the index?
>>>
>>> 500ms seems to be slow for an 'average' search. I am not an expert
>>> but without highlighting it should be faster as 100ms or at least 200ms
>>>
>>> Regards,
>>> Peter.
>>>
>>>
 Hi.
 Thanks for replying.
 My document has many different fields(about 30 fields, 10 different
 type of documents but these are not the point ) and I have to search
 over several fields.
 I was putting all 76M documents into several lucene indexes and use
 the default lucene.net ParaSearch to search over these indexes. That
 was slow, more than 20s.
 Then someone suggested I need to merge all our indexes into a huge
 one, he thought lucene can handle 76M documents in one index easily.
 Then I merged all the documents into a single huge one(which took me
 3 days) . That time, the index folder is about 15G(I don't store
 info into index, just index them). Actually the search is still very
 slow, more than 20s too, and looks slower than use several indexes.
 Then I come to solr. Why I put 1M into each core is I found when a
 core has 1M document, the search speed is fast, range from 0-500ms,
 which is acceptable. I don't know how many documents to saved in one
 core is proper.
 The problem is even if I put 2M documents into each core. Then I
 have only 36 cores at the moment. But when our documents doubles in
 the future, same issue will rise again. So I don't think save 1M in
 each core is the issue.
 The issue is I put too many cores into one ser

Re: Fwd: send to list

2010-07-16 Thread kenf_nc

If at all possible I like to do any processing work up front and not deal
with extravagant queries. If your grid definitions don't change, or don't
change often, just assign a cell number to each 100 square grid. Then in a
pre-processing step assign the appropriate cell number to your document
along with the specific lat and lon. Then your facet query gets much
simpler.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Fwd-send-to-list-tp973191p973233.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: send to list

2010-07-16 Thread Mattmann, Chris A (388J)
Hi Joe,

Take a look at the Cartesian Grid work from Patrick O'Leary here [1]. It's not 
fully integrated with Solr and they are moving away from it, but it'll give you 
a good idea of how to get started and to go about doing this...

HTH,
Chris

[1] http://www.nsshutdown.com/projects/lucene/whitepaper/locallucene_v2.html


On 7/16/10 10:58 AM, "Joe Chesak"  wrote:

I wish to display search results on a google map.  I would like to group the 
results such that if more than one hit is in one location, the sum of all hits 
at that location will show up on an icon.

[cid:3362123455_49287900]

By one location, I mean one square on a grid of 10 x 10 squares = 100 squares.  
I am trying out facets for this, but the request string is enormous, and when I 
send a request for 30 facets the query time increases from 5ms to 355ms.

Here is what such a query looks like...

http://theguide.srv.easyconnect.no:8080/solr/no-gul-biz-web/select?facet=true
&facet.query=lat_trie:[57.9 +TO+59.56]+AND+lon_trie:[4+TO+9.4]
&facet.query=lat_trie:[59.56 +TO+61.22]+AND+lon_trie:[4+TO+9.4]
&facet.query=lat_trie:[61.22 +TO+62.89]+AND+lon_trie:[4+TO+9.4]
&facet.query=lat_trie:[62.89 +TO+64.55]+AND+lon_trie:[4+TO+9.4]
&facet.query=lat_trie:[64.55 +TO+66.21]+AND+lon_trie:[4+TO+9.4]
&facet.query=lat_trie:[66.21 +TO+67.88]+AND+lon_trie:[4+TO+9.4]
&facet.query=lat_trie:[67.88 +TO+69.54]+AND+lon_trie:[4+TO+9.4]
&facet.query=lat_trie:[69.54 +TO+71.2]+AND+lon_trie:[4+TO+9.4]
&facet.query=lat_trie:[57.9 +TO+59.56]+AND+lon_trie:[9.4+TO+14.8]
&facet.query=lat_trie:[59.56 +TO+61.22]+AND+lon_trie:[9.4+TO+14.8]
&facet.query=lat_trie:[61.22 +TO+62.89]+AND+lon_trie:[9.4+TO+14.8]
&facet.query=lat_trie:[62.89 +TO+64.55]+AND+lon_trie:[9.4+TO+14.8]
&facet.query=lat_trie:[64.55 +TO+66.21]+AND+lon_trie:[9.4+TO+14.8]
&facet.query=lat_trie:[66.21 +TO+67.88]+AND+lon_trie:[9.4+TO+14.8]
&facet.query=lat_trie:[67.88 +TO+69.54]+AND+lon_trie:[9.4+TO+14.8]
&facet.query=lat_trie:[69.54 +TO+71.2]+AND+lon_trie:[9.4+TO+14.8]
&facet.query=lat_trie:[57.9 +TO+59.56]+AND+lon_trie:[14.8+TO+20.2]
&facet.query=lat_trie:[59.56 +TO+61.22]+AND+lon_trie:[14.8+TO+20.2]
&facet.query=lat_trie:[61.22 +TO+62.89]+AND+lon_trie:[14.8+TO+20.2]
&facet.query=lat_trie:[62.89 +TO+64.55]+AND+lon_trie:[14.8+TO+20.2]
&facet.query=lat_trie:[64.55 +TO+66.21]+AND+lon_trie:[14.8+TO+20.2]
&facet.query=lat_trie:[66.21 +TO+67.88]+AND+lon_trie:[14.8+TO+20.2]
&facet.query=lat_trie:[67.88 +TO+69.54]+AND+lon_trie:[14.8+TO+20.2]
&facet.query=lat_trie:[69.54 +TO+71.2]+AND+lon_trie:[14.8+TO+20.2]
&facet.query=lat_trie:[57.9 +TO+59.56]+AND+lon_trie:[20.2+TO+25.6]
&facet.query=lat_trie:[59.56 +TO+61.22]+AND+lon_trie:[20.2+TO+25.6]
&facet.query=lat_trie:[61.22 +TO+62.89]+AND+lon_trie:[20.2+TO+25.6]
&facet.query=lat_trie:[62.89 +TO+64.55]+AND+lon_trie:[20.2+TO+25.6]
&facet.query=lat_trie:[64.55 +TO+66.21]+AND+lon_trie:[20.2+TO+25.6]
&facet.query=lat_trie:[66.21 +TO+67.88]+AND+lon_trie:[20.2+TO+25.6]
&facet.query=lat_trie:[67.88 +TO+69.54]+AND+lon_trie:[20.2+TO+25.6]
&facet.query=lat_trie:[69.54 +TO+71.2]+AND+lon_trie:[20.2+TO+25.6]
&facet.query=lat_trie:[57.9 +TO+59.56]+AND+lon_trie:[25.6+TO+31]
&facet.query=lat_trie:[59.56 +TO+61.22]+AND+lon_trie:[25.6+TO+31]
&facet.query=lat_trie:[61.22 +TO+62.89]+AND+lon_trie:[25.6+TO+31]
&facet.query=lat_trie:[62.89 +TO+64.55]+AND+lon_trie:[25.6+TO+31]
&facet.query=lat_trie:[64.55 +TO+66.21]+AND+lon_trie:[25.6+TO+31]
&facet.query=lat_trie:[66.21 +TO+67.88]+AND+lon_trie:[25.6+TO+31]
&facet.query=lat_trie:[67.88 +TO+69.54]+AND+lon_trie:[25.6+TO+31]
&facet.query=lat_trie:[69.54 +TO+71.2]+AND+lon_trie:[25.6+TO+31]


Maybe using a grid is the wrong approach, maybe I should be thinking clustering 
instead of grouping.  Is there a best practice for doing this?


Joe





++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



RE: documents with known relevancy

2010-07-16 Thread Jonathan Rochkind
> Exactly. The weight is a weight of a given tag for specific document, not
> weight of the field as in weighted search. So one document may have tag1
> with weight of 0.1, and another may have the same tag1 with weight=0.8.

I've never used it, but I think this is the use case that the Solr feature to 
use Lucene 'payloads' is meant for?  
http://www.lucidimagination.com/blog/2009/08/05/getting-started-with-payloads/

Re: documents with known relevancy

2010-07-16 Thread fiedzia


Dennis Gearon wrote:
> 
> Seems to me that you are doing externally to Solr what you could be doing
> internally. If you had ONE field as  and weighted those in your SOLR
> query, that is how I am guessing it is usually done.
> 

I guess i used confusing term for weight. The weight (value assigned for
given tag) is document specific and may be different for each document, it
is not weight of a field as in weighted search.

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/documents-with-known-relevancy-tp972462p973045.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: documents with known relevancy

2010-07-16 Thread fiedzia


Dennis Gearon wrote:
> 
> So does this mean that each document has a different weight for the same
> tag?
> 

Exactly. The weight is a weight of a given tag for specific document, not
weight of the field as in weighted search. So one document may have tag1
with weight of 0.1, and another may have the same tag1 with weight=0.8.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/documents-with-known-relevancy-tp972462p973036.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Spatial Search - Best choice (if any)?

2010-07-16 Thread Dave Searle
I'm also just starting a project requiring spatial indexing so any info would 
be greatly appreciated. I had a quick look at the wiki last night and it 
appears solr has it built in in the latest version? Not sure if the patches 
need applying directly though 

My requirements are quite simple, I just need an ability to return results 
based on radial distance from one point

http://wiki.apache.org/solr/SpatialSearch 

Thanks
Dave

Sent from my iPhone

On 16 Jul 2010, at 17:50, "Dennis Gearon"  wrote:

> I hope that those who know will answer this. I am really interested in it 
> also. TIA.
> 
> Dennis Gearon
> 
> Signature Warning
> 
> EARTH has a Right To Life,
>  otherwise we all die.
> 
> Read 'Hot, Flat, and Crowded'
> Laugh at http://www.yert.com/film.php
> 
> 
> --- On Fri, 7/16/10, Saïd Radhouani  wrote:
> 
>> From: Saïd Radhouani 
>> Subject: Spatial Search - Best choice (if any)?
>> To: solr-user@lucene.apache.org
>> Date: Friday, July 16, 2010, 1:21 AM
>> Hi,
>> 
>> Using Solr 1.4, I'm now working on adding spatial search
>> options, such as distance-based sorting, Bounding-box
>> filter, etc.
>> 
>> To the best of my knowledge, there are three possible
>> points we can start from: 
>> 
>> 1. The 
>> http://blog.jteam.nl/2009/08/03/geo-location-search-with-solr-and-lucene/
>> 2. The gissearch.com
>> 3. The 
>> http://www.ibm.com/developerworks/opensource/library/j-spatial/index.html#resources
>> 
>> 
>> I saw that these three options have been used but didn't
>> see any comparison between them. Is there any one out there
>> who can recommend one option over another? 
>> 
>> Thanks,
>> -S


Re: Spatial Search - Best choice (if any)?

2010-07-16 Thread Dennis Gearon
I hope that those who know will answer this. I am really interested in it also. 
TIA.

Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Fri, 7/16/10, Saïd Radhouani  wrote:

> From: Saïd Radhouani 
> Subject: Spatial Search - Best choice (if any)?
> To: solr-user@lucene.apache.org
> Date: Friday, July 16, 2010, 1:21 AM
> Hi,
> 
> Using Solr 1.4, I'm now working on adding spatial search
> options, such as distance-based sorting, Bounding-box
> filter, etc.
> 
> To the best of my knowledge, there are three possible
> points we can start from: 
> 
> 1. The 
> http://blog.jteam.nl/2009/08/03/geo-location-search-with-solr-and-lucene/
> 2. The gissearch.com
> 3. The 
> http://www.ibm.com/developerworks/opensource/library/j-spatial/index.html#resources
> 
> 
> I saw that these three options have been used but didn't
> see any comparison between them. Is there any one out there
> who can recommend one option over another? 
> 
> Thanks,
> -S


Re: documents with known relevancy

2010-07-16 Thread Dennis Gearon
So does this mean that each document has a different weight for the same tag?
Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Fri, 7/16/10, fiedzia  wrote:

> From: fiedzia 
> Subject: Re: documents with known relevancy
> To: solr-user@lucene.apache.org
> Date: Friday, July 16, 2010, 8:06 AM
> 
> 
> Peter Karich wrote:
> > 
> > Hi,
> > 
> > Why do you need the weight for the tags?
> > 
> 
> The only reason to include weights is to sort results by
> weights.
> So if there are multiple documents containing given tag,
> i want them to be sorted by weight. Also i would like to be
> able 
> to seach by multiple tags at once (so if there would be
> field "tags" with
> all tags,
> then documents with highest sum of their weights shoud be
> first. Sum is just
> example here,
> if solr can offer something similar or more advanced, its
> fine).
> 
> 
> 
> Peter Karich wrote:
> > 
> > you could index it this way:
> > 
> > {
> >  id:     123
> >  tag:    'tag1'
> >  weight:  0.01
> >  uniqueKey: combine(id, tag)
> > }
> > 
> > {
> >  id:     123
> >  tag:    'tag2'
> >  weight:  0.3
> >  uniqueKey: combine(id, tag)
> > }
> > 
> > and specify the query-time boost with the help of the
> weight.
> > Retrieving the document content in a second request to
> another solrindex
> > or using a db.
> > 
> 
> Well, that would work for querying  for single tag. Do
> you know solution
> solving problem of querying for multiple tags?
> 
> Perhaps i can explain the problem better by presenting
> obvious solution:
> create multivalue field "tags" with all tags. Ths will
> allow to easily ask
> solr for documents matching query
> (which may look like that:  tags:tag1 AND tags:tag2).
> Then get list of all
> results, retrieve tag weights from database and sort them
> by weight. This is
> obviously inneficient, as it requires getting all documents
> from solr
> (possibly large list), then again get them from db, then
> calculate weights
> then sort them. So i am trying to involve solr in this
> processing.
> 
> Other solution i can think could work (though haven't
> examined it fully yet)
> woud be to create single text field for tags with tags
> occurences matching
> tag weight (so if tag2 weigtht is twice as big as tag1,
> then the text contains tag1 once and tag2 twice ("tag1 tag2
> tag2"), then
> calculate document score
> basing on amount of occurences of given tag in text). From
> what i know about
> solr this could be done,
> but maybe there is a better solution.
> 
> 
> Peter Karich wrote:
> > 
> > there could be a different solution using dynamic
> fields and index-time
> > boosts but I am not sure at the
> moment.    
> > 
> 
> Can write more about it? Any idea is welcome.
> 
> Thanks for your help anyway.
> -- 
> View this message in context: 
> http://lucene.472066.n3.nabble.com/documents-with-known-relevancy-tp972462p972748.html
> Sent from the Solr - User mailing list archive at
> Nabble.com.
>


Re: documents with known relevancy

2010-07-16 Thread Dennis Gearon
Seems to me that you are doing externally to Solr what you could be doing 
internally. If you had ONE field as  and weighted those in your SOLR 
query, that is how I am guessing it is usually done.
Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Fri, 7/16/10, fiedzia  wrote:

> From: fiedzia 
> Subject: documents with known relevancy
> To: solr-user@lucene.apache.org
> Date: Friday, July 16, 2010, 5:59 AM
> 
> I want to  know if what i am trying to achieve is
> doable using solr.
> 
> I have some objects that have tags assigned. Tag is as
> string with weight
> attached,
> so whole document that i want to index can look like that:
> {
>   id: 123,
>   tags: {
>           tag1: 0.01,
>           tag2: 0.3,
>           ...
>           tagN: some_weight
>           }
> }
> Now i want to store list of tags and sort returned results
> by tag weight.
> The list of tags can be large (up to thousands per
> document, though mostly
> much less).
> So when i am querying solr for documents containing tag1,
> it should return
> all documents containing it,
> sorted by weight of this tag. Is there any way to do that?
> -- 
> View this message in context: 
> http://lucene.472066.n3.nabble.com/documents-with-known-relevancy-tp972462p972462.html
> Sent from the Solr - User mailing list archive at
> Nabble.com.
>


RE: Securing Solr 1.4 in a glassfish container AS NEW THREAD

2010-07-16 Thread Sharp, Jonathan
Hi Bilgin,

Thanks for the snippet -- that helps a lot.

-Jon

-Original Message-
From: Bilgin Ibryam [mailto:bibr...@gmail.com] 
Sent: Friday, July 16, 2010 1:31 AM
To: solr-user@lucene.apache.org
Subject: Re: Securing Solr 1.4 in a glassfish container AS NEW THREAD

Hi Jon,

SolrJ (CommonsHttpSolrServer) internally uses apache http client to
connect
to solr. You can check there for some documentation.
I secured solr also with BASIC auth-method and use the following snippet
to
access it from solrJ:

  //set username and password
  ((CommonsHttpSolrServer)
server).getHttpClient().getParams().setAuthenticationPreemptive(true);
  Credentials defaultcreds = new
UsernamePasswordCredentials("username",
"secret");
  ((CommonsHttpSolrServer)
server).getHttpClient().getState().setCredentials(new
AuthScope("localhost",
80, AuthScope.ANY_REALM), defaultcreds);

HTH
Bilgin Ibryam



On Fri, Jul 16, 2010 at 2:35 AM, Sharp, Jonathan  wrote:

> Hi All,
>
> I am considering securing Solr with basic auth in glassfish using the
> container, by adding to web.xml and adding sun-web.xml file to the
> distributed WAR as below.
>
> If using SolrJ to index files, how can I provide the credentials for
> authentication to the http-client (or can someone point me in the
direction
> of the right documentation to do that or that will help me make the
> appropriate modifications) ?
>
> Also any comment on the below is appreciated.
>
> Add this to web.xml
> ---
>   
>   BASIC
>   SomeRealm
>   
>   
>   
>   Admin Pages
>   /admin
>   /admin/*
>
>
GETPOSTPUTTRACEHEADOPTIONSDELETE
>   
>   
>   SomeAdminRole
>   
>   
>   
>   
>   Update Servlet
>   /update/*
>
>
GETPOSTPUTTRACEHEADOPTIONSDELETE
>   
>   
>   SomeUpdateRole
>   
>   
>   
>   
>   Select Servlet
>   /select/*
>
>
GETPOSTPUTTRACEHEADOPTIONSDELETE
>   
>   
>   SomeSearchRole
>   
>   
> ---
>
> Also add this as sun-web.xml
>
> 
> 
>  Server 9.0 Servlet 2.5//EN" "
> http://www.sun.com/software/appserver/dtds/sun-web-app_2_5-0.dtd";>
> 
>  /Solr
>  
>   
> Keep a copy of the generated servlet class' java
> code.
>   
>  
>  
> SomeAdminRole
> SomeAdminGroup
>  
>  
> SomeUpdateRole
> SomeUpdateGroup
>  
>  
> SomeSearchRole
> SomeSearchGroup
>  
> 
> --
>
> -Jon
>
>
> -
> SECURITY/CONFIDENTIALITY WARNING: This message and any attachments are
> intended solely for the individual or entity to which they are
addressed.
> This communication may contain information that is privileged,
confidential,
> or exempt from disclosure under applicable law (e.g., personal health
> information, research data, financial information). Because this
e-mail has
> been sent without encryption, individuals other than the intended
recipient
> may be able to view the information, forward it to others or tamper
with the
> information without the knowledge or consent of the sender. If you are
not
> the intended recipient, or the employee or person responsible for
delivering
> the message to the intended recipient, any dissemination, distribution
or
> copying of the communication is strictly prohibited. If you received
the
> communication in error, please notify the sender immediately by
replying to
> this message and deleting the message and any accompanying files from
your
> system. If, due to the security risks, you do not wis
> h to
> receive further communications via e-mail, please reply to this
message and
> inform the sender that you do not wish to receive further e-mail from
the
> sender.
> -
>
>


Re: documents with known relevancy

2010-07-16 Thread fiedzia

I came up with another idea, which seem to do what i want. Any comments about
better solutions
or improving efficiency are welcome:

for each document create multivalue text field "tags" with all tags,
and multiple dynamic fields for each tag containging value, so we have:
{
  id: 123
  tags: tag1, tag2, ..., tagN
  tag1_float: 0.1,
  tag2_float: 0.2,
  ...
  tagN_float: 0.3,
}

then query for tag1 and tag2 could like that:
tags:tag1 AND tags: tag2
and sort results by sum of tag1_float and tag2_float.

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/documents-with-known-relevancy-tp972462p972873.html
Sent from the Solr - User mailing list archive at Nabble.com.


Building maven artifacts

2010-07-16 Thread Pavel Minchenkov
Hi,
I'm trying to run ant task "generate-maven-artifacts" in lucene-solr
build.xml file.
But getting this error:
/home/chardex/lucene/dev/lucene/common-build.xml:312: Error deploying
artifact 'org.apache.lucene:lucene-core:jar': Error deploying artifact: File
/home/chardex/lucene/dev/lucene/build/${project.artifactId}-${project.version}.jar
does not exist

Source code is latest.
What I'm doing wrong?

"generate-maven-artifacts" works fine for solr build.xml file.

Thanks.

-- 
Pavel Minchenkov


Re: documents with known relevancy

2010-07-16 Thread fiedzia


Peter Karich wrote:
> 
> Hi,
> 
> Why do you need the weight for the tags?
> 

The only reason to include weights is to sort results by weights.
So if there are multiple documents containing given tag,
i want them to be sorted by weight. Also i would like to be able 
to seach by multiple tags at once (so if there would be field "tags" with
all tags,
then documents with highest sum of their weights shoud be first. Sum is just
example here,
if solr can offer something similar or more advanced, its fine).



Peter Karich wrote:
> 
> you could index it this way:
> 
> {
>  id: 123
>  tag:'tag1'
>  weight:  0.01
>  uniqueKey: combine(id, tag)
> }
> 
> {
>  id: 123
>  tag:'tag2'
>  weight:  0.3
>  uniqueKey: combine(id, tag)
> }
> 
> and specify the query-time boost with the help of the weight.
> Retrieving the document content in a second request to another solrindex
> or using a db.
> 

Well, that would work for querying  for single tag. Do you know solution
solving problem of querying for multiple tags?

Perhaps i can explain the problem better by presenting obvious solution:
create multivalue field "tags" with all tags. Ths will allow to easily ask
solr for documents matching query
(which may look like that:  tags:tag1 AND tags:tag2). Then get list of all
results, retrieve tag weights from database and sort them by weight. This is
obviously inneficient, as it requires getting all documents from solr
(possibly large list), then again get them from db, then calculate weights
then sort them. So i am trying to involve solr in this processing.

Other solution i can think could work (though haven't examined it fully yet)
woud be to create single text field for tags with tags occurences matching
tag weight (so if tag2 weigtht is twice as big as tag1,
then the text contains tag1 once and tag2 twice ("tag1 tag2 tag2"), then
calculate document score
basing on amount of occurences of given tag in text). From what i know about
solr this could be done,
but maybe there is a better solution.


Peter Karich wrote:
> 
> there could be a different solution using dynamic fields and index-time
> boosts but I am not sure at the moment.   
> 

Can write more about it? Any idea is welcome.

Thanks for your help anyway.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/documents-with-known-relevancy-tp972462p972748.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: documents with known relevancy

2010-07-16 Thread Peter Karich
Hi,

Why do you need the weight for the tags?

you could index it this way:

{
 id: 123
 tag:'tag1'
 weight:  0.01
 uniqueKey: combine(id, tag)
}

{
 id: 123
 tag:'tag2'
 weight:  0.3
 uniqueKey: combine(id, tag)
}

and specify the query-time boost with the help of the weight.
Retrieving the document content in a second request to another solrindex or 
using a db.

there could be a different solution using dynamic fields and index-time boosts 
but I am not sure at the moment. 

Regards,
Peter.

> I want to  know if what i am trying to achieve is doable using solr.
>
> I have some objects that have tags assigned. Tag is as string with weight
> attached,
> so whole document that i want to index can look like that:
> {
>   id: 123,
>   tags: {
>   tag1: 0.01,
>   tag2: 0.3,
>   ...
>   tagN: some_weight
>   }
> }
> Now i want to store list of tags and sort returned results by tag weight.
> The list of tags can be large (up to thousands per document, though mostly
> much less).
> So when i am querying solr for documents containing tag1, it should return
> all documents containing it,
> sorted by weight of this tag. Is there any way to do that?
>   



Re: Finding distinct unique IDs in documents returned by fq -- Urgent Help Req

2010-07-16 Thread kenf_nc

It may just be a mis-wording, but if you do distinct on 'unique' IDs, the
count should be the same as response.numFound. But if you didn't mean
'unique', just count of some field in the results, Rebecca is correct,
facets should do the job. Something like:

?q=content:query+text&facet=on&facet.field=rootId
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Finding-distinct-unique-IDs-in-documents-returned-by-fq-Urgent-Help-Req-tp971883p972601.html
Sent from the Solr - User mailing list archive at Nabble.com.


documents with known relevancy

2010-07-16 Thread fiedzia

I want to  know if what i am trying to achieve is doable using solr.

I have some objects that have tags assigned. Tag is as string with weight
attached,
so whole document that i want to index can look like that:
{
  id: 123,
  tags: {
  tag1: 0.01,
  tag2: 0.3,
  ...
  tagN: some_weight
  }
}
Now i want to store list of tags and sort returned results by tag weight.
The list of tags can be large (up to thousands per document, though mostly
much less).
So when i am querying solr for documents containing tag1, it should return
all documents containing it,
sorted by weight of this tag. Is there any way to do that?
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/documents-with-known-relevancy-tp972462p972462.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: how to eliminating scoring from a query?

2010-07-16 Thread oferiko

that's actually what i already had in mind, just wasn't sure that specifying
the sort order of indextime eliminates the work of scoring.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/how-to-eliminating-scoring-from-a-query-tp968581p972325.html
Sent from the Solr - User mailing list archive at Nabble.com.


DIH context fails to store in global scope

2010-07-16 Thread Marc Emery
Hi,

I am writing a EventListener that put some data in the content on import
start:

ctx.setSessionAttribute( DOCTYPE_MAPPING, docTypeMap, Context.SCOPE_GLOBAL
);

but it doesn't seem to work.

looking at the trunk code of ContextImpl.java the globalSession is not
called:

  private void putVal(String name, Object val, Map map) {
if(val == null) map.remove(name);
else entitySession.put(name, val);
  }

Thanks
marc


Getting facets count on multiple fields by doing a "Group By"

2010-07-16 Thread Rajinimaski


This is Condition example:

I have employee with name Rajani and her ID 2
And another employee name also Rajani ID 3 and another also Rajani with id 4
When i make a facet on name:rajani  and facets on ID ,Results will be like

Name = rajani 

ID=1  
ID= 2  
ID =3  


What I needed is Like 

(Name=Rajani , ID=1) <1>
  (Name=Rajani , ID=2) <1>
  (Name=Rajani , ID=3) <1>

Both facets count in same line so that based on ID i can again easily the
results set

Please let me know faceting like this 
 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Getting-facets-count-on-multiple-fields-by-doing-a-Group-By-tp972105p972105.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Using stored terms for faceting

2010-07-16 Thread Erik Hatcher
This is simple faceting, doesn't even have to be a multi-valued  
field.  Just index your description field with the desired stop word  
removal and other analysis that you want done, and  
&facet.field=description


Erik


On Jul 15, 2010, at 3:26 AM, Peter Karich wrote:


Dear Hoss,

I will try to clarify what I want to achieve :-)

Assume I have the following three docs:

id:1
description: bmx bike 123

id:2
description: bmx bike 321

id:3
description: a mountain bike

If I query against *:* I want to get the facets and its document  
count ala:

bike: 3
bmx: 2

I reached this with the following approach:
I skip the noise words like 'a'. E.g. for doc 3 I will get the terms
'mountain' and 'bike'.
Those two terms will then additionally indexed to a multivalued e.g.
myfacet so that I can do faceting on that field.

Is there a simpler approach?

Regards,
Peter.

: is it possible to use the stored terms of a field for a faceted  
search?


No, the only thing stored fields can be used for is document centric
opterations (ie: once you have a small set of individual docIds,  
you can

access the stored fields to return to the user, or highlight, etc...)

: I mean, I don't want to get the term frequency per document as it  
is

: shown here:
: http://wiki.apache.org/solr/TermVectorComponentExampleOptions
:
: I want to get the frequency of the term of my special search and  
show
: only the 10 most frequent terms and all the nice things that I  
can do

: for faceting.

i honestly have no idea what you are saying you want -- can you  
provide
a concrete use case explaining what you mean?  describe some  
example data
and then explain what type of logic owuld happen and what type of  
result

you'd get back?



-Hoss






--
http://karussell.wordpress.com/





Re: problem with storing??

2010-07-16 Thread satya swaroop
hi,
I checked out the admin page and it is indexing for others.In the log
files i dont get anything when i send the documents. I checked out the log
in catalina(tomcat). I changed the dismax handler from q=*:* to q=   . I
atleast get the response when i send pdf/html files but dont even get for
the doc files


regards,
  swaroop


Re: Custom comparator

2010-07-16 Thread dan sutton
Apologies I didn't make the requirement clear.

I need to keep the best N documents  - set A (chosen by some criteria - call
them sponsored docs) in front of the natural scoring docs  - set B so that I
return (A,B). The set A docs need to all score above 1% of maxScore in B
else they join the B set, though I don't really know maxScore until I've
looked at all the docs.

I am looking at the QueryElevationComponent for some hints, but any other
suggestions are appreciated.

Many thanks,
Dan

On Fri, Jul 16, 2010 at 12:03 AM, Erick Erickson wrote:

> Hmmm, why do you need a custom collector? You can use
> the form of the search that returns a TopDocs, from which you
> can get the max score and the array of ScoreDoc each of which
> has its score. So you can just let the underlying code get the
> top N documents, and throw out any that don't score above
> 1%.
>
> HTH
> Erick
>
> On Thu, Jul 15, 2010 at 10:02 AM, dan sutton  wrote:
>
> > Hi,
> >
> > I have a requirement to have a custom comparator that keep the top N
> > documents (chosen by some criteria) but only if their score is more then
> > e.g. 1% of the maxScore.
> >
> > Looking at SolrIndexSearcher.java, I was hoping to have a custom
> > TopFieldCollector.java to return these via TopFieldCollector..topDocs,
> but
> > I
> > can't see how to override that class to provide my own, I think I need to
> > do
> > this here (TopFieldCollector..topDocs) as I won't know what the maxScore
> is
> > until all the docs have been collected and compared?
> >
> > Does anyone have any suggestions? I'd like to avoid having to do two
> > searches.
> >
> > Many Thanks,
> > Dan
> >
>


Re: Securing Solr 1.4 in a glassfish container AS NEW THREAD

2010-07-16 Thread Bilgin Ibryam
Hi Jon,

SolrJ (CommonsHttpSolrServer) internally uses apache http client to connect
to solr. You can check there for some documentation.
I secured solr also with BASIC auth-method and use the following snippet to
access it from solrJ:

  //set username and password
  ((CommonsHttpSolrServer)
server).getHttpClient().getParams().setAuthenticationPreemptive(true);
  Credentials defaultcreds = new UsernamePasswordCredentials("username",
"secret");
  ((CommonsHttpSolrServer)
server).getHttpClient().getState().setCredentials(new AuthScope("localhost",
80, AuthScope.ANY_REALM), defaultcreds);

HTH
Bilgin Ibryam



On Fri, Jul 16, 2010 at 2:35 AM, Sharp, Jonathan  wrote:

> Hi All,
>
> I am considering securing Solr with basic auth in glassfish using the
> container, by adding to web.xml and adding sun-web.xml file to the
> distributed WAR as below.
>
> If using SolrJ to index files, how can I provide the credentials for
> authentication to the http-client (or can someone point me in the direction
> of the right documentation to do that or that will help me make the
> appropriate modifications) ?
>
> Also any comment on the below is appreciated.
>
> Add this to web.xml
> ---
>   
>   BASIC
>   SomeRealm
>   
>   
>   
>   Admin Pages
>   /admin
>   /admin/*
>
> GETPOSTPUTTRACEHEADOPTIONSDELETE
>   
>   
>   SomeAdminRole
>   
>   
>   
>   
>   Update Servlet
>   /update/*
>
> GETPOSTPUTTRACEHEADOPTIONSDELETE
>   
>   
>   SomeUpdateRole
>   
>   
>   
>   
>   Select Servlet
>   /select/*
>
> GETPOSTPUTTRACEHEADOPTIONSDELETE
>   
>   
>   SomeSearchRole
>   
>   
> ---
>
> Also add this as sun-web.xml
>
> 
> 
>  Server 9.0 Servlet 2.5//EN" "
> http://www.sun.com/software/appserver/dtds/sun-web-app_2_5-0.dtd";>
> 
>  /Solr
>  
>   
> Keep a copy of the generated servlet class' java
> code.
>   
>  
>  
> SomeAdminRole
> SomeAdminGroup
>  
>  
> SomeUpdateRole
> SomeUpdateGroup
>  
>  
> SomeSearchRole
> SomeSearchGroup
>  
> 
> --
>
> -Jon
>
>
> -
> SECURITY/CONFIDENTIALITY WARNING: This message and any attachments are
> intended solely for the individual or entity to which they are addressed.
> This communication may contain information that is privileged, confidential,
> or exempt from disclosure under applicable law (e.g., personal health
> information, research data, financial information). Because this e-mail has
> been sent without encryption, individuals other than the intended recipient
> may be able to view the information, forward it to others or tamper with the
> information without the knowledge or consent of the sender. If you are not
> the intended recipient, or the employee or person responsible for delivering
> the message to the intended recipient, any dissemination, distribution or
> copying of the communication is strictly prohibited. If you received the
> communication in error, please notify the sender immediately by replying to
> this message and deleting the message and any accompanying files from your
> system. If, due to the security risks, you do not wis
> h to
> receive further communications via e-mail, please reply to this message and
> inform the sender that you do not wish to receive further e-mail from the
> sender.
> -
>
>


Spatial Search - Best choice (if any)?

2010-07-16 Thread Saïd Radhouani
Hi,

Using Solr 1.4, I'm now working on adding spatial search options, such as 
distance-based sorting, Bounding-box filter, etc.

To the best of my knowledge, there are three possible points we can start from: 

1. The http://blog.jteam.nl/2009/08/03/geo-location-search-with-solr-and-lucene/
2. The gissearch.com
3. The 
http://www.ibm.com/developerworks/opensource/library/j-spatial/index.html#resources
 

I saw that these three options have been used but didn't see any comparison 
between them. Is there any one out there who can recommend one option over 
another? 

Thanks,
-S

Re: Finding distinct unique IDs in documents returned by fq -- Urgent Help Req

2010-07-16 Thread Rebecca Watson
hi,

would faceting work?
http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Faceted-Search-Solr

if you have a field for rootId that is multivalued + facet on it -- you'll get
value+count pairs back (top 100 i think by default)

bec :)

On 16 July 2010 16:07, Ninad Raut  wrote:
> Hi,
>
> I have a scenario in which I have to find count of distinct unique IDs
> present in a field (rootId field in my case) for a particular query.
>
> I require this for pagination purpose.
>
> Is there a way in Solr to do something like this we do in SQL:
>
> select count(distinct(rootId))
> from table
> where (the query part).
>
>
> Regards,
> Ninad R
>


Finding distinct unique IDs in documents returned by fq -- Urgent Help Req

2010-07-16 Thread Ninad Raut
Hi,

I have a scenario in which I have to find count of distinct unique IDs
present in a field (rootId field in my case) for a particular query.

I require this for pagination purpose.

Is there a way in Solr to do something like this we do in SQL:

select count(distinct(rootId))
from table
where (the query part).


Regards,
Ninad R


Re: Solr Best Version

2010-07-16 Thread Tommaso Teofili
Hi all,
I read in a previous thread [1] that also the branch3.x version could be a
good choice, but I don't know what differences exist at the moment between
the two versions and how stable branch3.x is. Maybe someone else could point
these things out.
My 0.0002 cents.
Tommaso

[1] : http://markmail.org/thread/y46ljhfhfgk3geir


2010/7/15 Peter Karich 

> we are using 1.4.0 without any major problems so far. (So, I would use
> 1.4.1 for the next app, just to have the latest version.)
> the trunk is also nice to use fuzzy search performance boosts.
>
> Peter.
>
> > Hi all,
> > I'm going to develop a search architecture solr based and i wonder if you
> > could suggest me which Solr version will suite best my needs.
> > I have 10 Solr machines which use replication, sharding and multi-core ;
> 1
> > Solr server would index Documents (Xml, *Pdf*,Text ... ) on a *NFS*
> > *v3*Filesystem (i know it's a bad practise but it has been required by
> > the
> > customer) while the others will search over the Index.
> >
> > My first idea is to use Solr 1.4.1 but i would like to know which version
> > (1.4.1,branch 3.x, trunk ) you suggest (I need a stable version ).
> >
> >
> > Thanks in advance for your help
> >
> > Best Regards
> >
>
>
> --
> http://karussell.wordpress.com/
>
>


Re: Query help

2010-07-16 Thread Alejandro Marqués Rodríguez
I can't see a way of retrieving five results from one type and five from
another in a single query. The only way I can think about that would have a
similar behaviour would be:

?q=ContentType:(News+OR+Analysis)&sort=DatePublished+desc&start=0&rows=10

This way you'll have the first 10 results being News or Analysis, though it
could be 7 News and 3 Analysis or even 10 and 0...

If you need Solr to return 5 results from each type, I think the only way to
improve the search speed would be, instead of using just one query, making
two parallel queries.

Regards


2010/7/15 Rupert Bates 

> Sorry, my mistake, the example should have been as follows:
>
> ?q=ContentType:News&sort=DatePublished+desc&start=0&rows=5
> ?q=ContentType:Analysis&sort=DatePublished+desc&start=0&rows=5
>
> Rupert
>
> On 15 July 2010 13:02, kenf_nc  wrote:
> >
> > Your example though doesn't show different ContentType, it shows a
> different
> > sort order. That would be difficult to achieve in one call. Sounds like
> your
> > best bet is asynchronous (multi-threaded) calls if your architecture will
> > allow for it.
> > --
> > View this message in context:
> http://lucene.472066.n3.nabble.com/Query-help-tp969075p969334.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
> >
>
>
>
> --
> Rupert Bates
>
> Software Development Manager
> Guardian News and Media
>
> Tel: 020 3353 3315
> rupert.ba...@guardian.co.uk
> Please consider the environment before printing this email.
> --
> Visit guardian.co.uk - newspaper website of the year
> www.guardian.co.uk  www.observer.co.uk
>
> To save up to 33% when you subscribe to the Guardian and the Observer
> visit http://www.guardian.co.uk/subscriber
>
> The Guardian Public Services Awards 2010, in partnership with
> Hays Specialist Recruitment, recognise and reward outstanding
> performance from public, private and voluntary sector teams.
>
> To find out more and to nominate a deserving team or individual,
> visit http://guardian.co.uk/publicservicesawards
>
> Entries close 16 July.
>
> -
>
> This e-mail and all attachments are confidential and may also
> be privileged. If you are not the named recipient, please notify
> the sender and delete the e-mail and all attachments immediately.
> Do not disclose the contents to another person. You may not use
> the information for any purpose, or store, or copy, it in any way.
>
> Guardian News & Media Limited is not liable for any computer
> viruses or other material transmitted with or as part of this
> e-mail. You should employ virus checking software.
>
> Guardian News & Media Limited
> A member of Guardian Media Group PLC
> Registered Office
> Number 1 Scott Place, Manchester M3 3GG
> Registered in England Number 908396
>
>


-- 
Alejandro Marqués Rodríguez

Paradigma Tecnológico
http://www.paradigmatecnologico.com
Avenida de Europa, 26. Ática 5. 3ª Planta
28224 Pozuelo de Alarcón
Tel.: 91 352 59 42


Re: no response

2010-07-16 Thread satya swaroop
hi,
   i am sorry the mail u sent was in sent mail... I didnt look it I am
going to check now.. I will definetely tell u the entire thing

regards,
  satya


Re: no response

2010-07-16 Thread Peter Karich
satya,

sorry for being a bit harsh, but did you read the answer of Erick in the
'problem with storing??'-thread at all?
just ask the same question again (and not answering old questions) might
be a bit disappointing for people who want to help you.

just my side-note ...

Regards,
Peter.

> Hi all,
> i Have a problem with the solr. when i send the documents(.doc) i am
> not getting the response.
>   example:
>  sa...@geodesic-desktop:~/Desktop$  curl "
> http://localhost:8080/solr/update/extract?stream.file=/home/satya/Desktop/InvestmentDecleration.doc&stream.contentType=application/msword&;
> literal.id=Invest.doc"
> sa...@geodesic-desktop:~/Desktop$
>
>
> could any body tell me what to do??
>