Help:Solr can't put all pdf files into index

2012-02-08 Thread 荣康
Hey ,
I am using solr as my search engine to search my pdf files. I have 18219 
files(different file names) and all the files are in one same directory。But 
when I use solr to import the files into index using Dataimport method, solr 
report only import 17233 files. It's very strange. This problem has stoped out 
project for a few days. I can't handle it.


 please help me!


Schema.xml



   
   

 
 id 
 


and 
 
 
  
 



  
 
 
  
 
 
 




sincerecly
Rong Kang





Re: How to identify the field with highest score in dismax

2012-02-08 Thread Mikhail Khludnev
Hello,

Have you tried to specify debugQuery=on and look into explain section?
Though it's not really performant, but anyway I propose to start from it.

Regards

On Wed, Feb 8, 2012 at 7:32 PM, crisfromnova  wrote:

> Hi,
>
> According solr documentation the dismax score is calculating after the
> formula :
> (score of matching clause with the highest score) + ( (tie paramenter) *
> (scores of any other matching clauses) ).
>
> Is there a way to identify the field on which the matching clause score is
> the highest?
>
> For example I suppose that I have the following document :
>
> 
>  Ford Mustang Coupe Cabrio
>  Ford Mustang is a great car
> 
>
> and the following dismax query :
>
> defType=dismax&qf=Name^10+Details^1&q="Ford+Mustang"+Ford+Mustang
>
> and receive the document with the score : 5.6.
> Is there a way to find out if the score is for the matching on Name field
> or
> for the matching on Details field?
>
> Thanks in advance!
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/How-to-identify-the-field-with-highest-score-in-dismax-tp3726297p3726297.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Sincerely yours
Mikhail Khludnev
Lucid Certified
Apache Lucene/Solr Developer
Grid Dynamics


 


How do i do group by in solr with multiple shards?

2012-02-08 Thread Kashif Khan
Hi all,

I have tried group by in solr with multiple shards but it does not work.
Basically i want to simply do GROUP BY statement like in SQL in solr with
multiple shards. Please suggest me how can i do this as it is not supported
currently OOB by solr.

Thanks & regards,
Kashif Khan

--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-do-i-do-group-by-in-solr-with-multiple-shards-tp3728555p3728555.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Sorting solrdocumentlist object after querying

2012-02-08 Thread Kashif Khan
No that sorting is based on multiple fields. Basically i want to sort them
as the group by statement like in the SQL based on few fields and many
loops to go through. The problem is that i have say 1,000,000 solr docs
after injecting my few solr docs and then i want to do group by these solr
docs by some fields and then take 20 records for paging. So i need some
shortcut for that.
--
Kashif Khan.
http://www.kashifkhan.in



On Wed, Feb 8, 2012 at 11:07 PM, iorixxx [via Lucene] <
ml-node+s472066n3726788...@n3.nabble.com> wrote:

> > I want to sort a SolrDocumentList after it has been queried
> > and obtained
> > from the QueryResponse.getResults(). The reason is i have a
> > SolrDocumentList
> > obtained after querying using QueryResponse.getResults() and
> > i have added
> > few docs to it. Now i want to sort this SolrDocumentList
> > based on the same
> > fields i did the querying before i modified this
> > SolrDocumentList.
>
> QueryResponse.getResults()  will return rows many documents. Cant you sort
> them (plus your injected documents) with your own?
>
>
> --
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://lucene.472066.n3.nabble.com/Sorting-solrdocumentlist-object-after-querying-tp3726303p3726788.html
>  To unsubscribe from Sorting solrdocumentlist object after querying, click
> here
> .
> NAML
>


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Sorting-solrdocumentlist-object-after-querying-tp3726303p3728549.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: solr cloud concepts

2012-02-08 Thread Adeel Qureshi
Thanks for the explanation. It makes sense but I am hoping that you can
clarify things a bit more ..

so now it sounds like in solrcloud the concept of cores have changed a bit
.. as you explained that for me to have 2 cores with different schemas I
will need 2 different collections .. and one good thing about solrcores was
that you could create new ones with coreadmin api or the http calls .. to
create new collections its not that automated right ..

secondly if collections represent what kind of used to be solrcore then
once i have a collection .. why would i ever want to add multiple cores to
it .. i mean i am trying to think of a reason why it would make sense to do
that.

Thanks


On Wed, Feb 8, 2012 at 4:41 PM, Mark Miller  wrote:

>
> On Feb 8, 2012, at 5:26 PM, Adeel Qureshi wrote:
>
> > okay so after reading Bruno's blog post .. lets add slice to the mix as
> > well .. so we have got collections, cores, shards, partitions and slices
> :)
> > ..
>
> Yeah - heh - this has bugged me, but we have not really all come down on
> agreement of terminology here. I was a fan of using shard for each node and
> slice for partition. Another couple of committers wanted partitions rather
> than slice. Another says slice in code, shard for both in terminology and
> use context...
>
> I'd even go for shards as partitions and replicas for every node in a
> shard. But those fine points are still settling ;)
>
> >
> > The whole point with cores is to be able to have different schemas on the
> > same solr server instance. So how does that changes with collections ..
> may
> > be an example might help .. if I want to setup a solrcloud cluster with 2
> > cores (different schema) .. with each core having 2 shards (i m assuming
> > shards are really partitions here, across multiple nodes in the cluster)
> ..
> > with one shard being the replica..
>
> So this would mean you want to create 2 collections. Think of a collection
> as a bunch of SolrCores that all share the same schema and config.
>
> So you would start up 2 nodes set to one collection and with numShards=1
> that will give you one shard hosted by two identical SolrCores, giving you
> a replication factor. The full index will be in each of the two SolrCores.
>
> Then if you start another two nodes and specify a different collection
> name, you will get the same thing, but distinct from your first collection
> (although, if both collections have compatible shema/config you can still
> search across them).
>
> >
> >
> > On Wed, Feb 8, 2012 at 11:35 AM, Mark Miller 
> wrote:
> >
> >>
> >> On Feb 8, 2012, at 10:31 AM, Adeel Qureshi wrote:
> >>
> >>> I have been using solr for a while and have recently started getting
> into
> >>> solrcloud .. i am a bit confused with some of the concepts ..
> >>>
> >>> 1. what exactly is the relationship between a collection and the core
> ..
> >>> can a core has multiple collections in it .. in this case all
> collections
> >>> within this core will have the same schema .. and i am assuming all
> >>> instances of collections within the core can be deployed on different
> >> solr
> >>> nodes to achieve distributed search ..
> >>> or is it the other way around where a collection can have multiple
> cores
> >>
> >> Currently, a core basically equals a replica of the index.
> >>
> >> So you might have a collection called collection1 - lets say it's 2
> shards
> >> and each shard has a single replica:
> >>
> >> Collection1
> >> shard1 replica1
> >> shard1 replica2
> >> shard2 replica1
> >> shard2 replica2
> >>
> >> Each of those replicas is a core. So a collection has multiple cores
> >> basically. Also, each of those cores can be on a different machine. So
> yes,
> >> you have distributed indexing and distributed search.
> >>
> >>>
> >>> 2. at some places it has been pointed out that solrcloud doesnt
> actually
> >>> supports replication .. but in the solrcloud wiki the second example is
> >>> supposed to be for replication .. so does solrcloud at this point
> >> supports
> >>> automatic replication where as you add more servers it automatically
> uses
> >>> the additional servers as replicas
> >>
> >> SolrCloud doesn't support the old style Solr replication concept. It
> does
> >> however, handle replication - it's just all pretty much automatic and
> >> behind the scenes - eg all the information about Solr replication in the
> >> wiki documentation for previous versions of Solr is really not
> applicable.
> >> We now achieve replica copies by sending documents to each shard one
> >> document at a time so that we can support near realtime search. The old
> >> style replication is only used in recovery, or when you start a new
> replica
> >> machine and it has to 'catchup' to the other replicas.
> >>
> >>>
> >>> I have a few more questions but I wanted to get these basic ones out of
> >> the
> >>> way first .. I would appreciate any response.
> >>
> >> Fire away.
> >>
> >>>
> >>> Thanks
> >>> Adeel
> >>
> >> - Mark Miller
> >> lucidimag

Re: multiple cores in a single instance vs multiple instances with single core

2012-02-08 Thread Jamie Johnson
Thanks Mark, in regards to failover I completely agree, I am wondering more
about performance and memory usage if the indexes are large and wondering
if the separate Java instances under heavy load would more or less
performant.  Currently we deploy a single core per instance but deploy
multiple instances per machine
On Wednesday, February 8, 2012, Mark Miller  wrote:
>
> On Feb 8, 2012, at 9:52 PM, Jamie Johnson wrote:
>
>> In solr cloud what is a better approach / use of resources having
multiple
>> cores on a single instance or multiple instances with a single core? What
>> are the benefits and drawbacks of each?
>
>
> It depends I suppose. If you are talking about on a single machine, I'd
prefer using multiple cores over multiple Solr instances. I think it's just
easier to manage. You have to be sensible about that though - if all the
replicas for a shard are on the same machine, in the same instance, as
different cores, you don't have a lot of room for error - if that box goes
down, goodbye. But you can certainly mix and match instances and cores.
>
> One interesting thing you can do is a poor mans micro sharding - put a
few shards per machine - then later when you add more nodes to your
cluster, you can bring up a core on one of the new machines, it will catch
up, then you could unload that core on the original machine and replicas.
Then start up any other new nodes to add replicas for the moved shard.
Roughly and/or something like that anyway - I haven't thought it through
thoroughly, but Yonik has brought it up before, and it seems pretty easily
doable.
>
> - Mark Miller
> lucidimagination.com
>
>
>
>
>
>
>
>
>
>
>
>


Re: multiple cores in a single instance vs multiple instances with single core

2012-02-08 Thread Mark Miller

On Feb 8, 2012, at 9:52 PM, Jamie Johnson wrote:

> In solr cloud what is a better approach / use of resources having multiple
> cores on a single instance or multiple instances with a single core? What
> are the benefits and drawbacks of each?


It depends I suppose. If you are talking about on a single machine, I'd prefer 
using multiple cores over multiple Solr instances. I think it's just easier to 
manage. You have to be sensible about that though - if all the replicas for a 
shard are on the same machine, in the same instance, as different cores, you 
don't have a lot of room for error - if that box goes down, goodbye. But you 
can certainly mix and match instances and cores.

One interesting thing you can do is a poor mans micro sharding - put a few 
shards per machine - then later when you add more nodes to your cluster, you 
can bring up a core on one of the new machines, it will catch up, then you 
could unload that core on the original machine and replicas. Then start up any 
other new nodes to add replicas for the moved shard. Roughly and/or something 
like that anyway - I haven't thought it through thoroughly, but Yonik has 
brought it up before, and it seems pretty easily doable.

- Mark Miller
lucidimagination.com













Re: solr cloud concepts

2012-02-08 Thread Mark Miller

On Feb 8, 2012, at 9:36 PM, Jamie Johnson wrote:

> Mark,
> is the recommendation now to have each solr instance be a separate core in
> solr cloud? I had thought that the core name was by default the collection
> name? Or are you saying that although they have the same name they are
> separate because they are in different JVMs?

By default, the collection name is set to the core name. This is really just 
for convenience when you are getting started. If gives you a default collection 
name of collection1 because the default SolrCore name is collection1, and each 
SolrCore on each instance is addressable as /solr/collection1.

You can certainly have core names be whatever you want and explicitly pass it's 
collection. In the case, the url for each would be different - though I think 
there is an open JIRA issue about making that nicer - so that you can look up 
the right core even if you pass the collection name or something.

- Mark Miller
lucidimagination.com













Re: solr cloud concepts

2012-02-08 Thread Jamie Johnson
Mark,
is the recommendation now to have each solr instance be a separate core in
solr cloud? I had thought that the core name was by default the collection
name? Or are you saying that although they have the same name they are
separate because they are in different JVMs?

On Wednesday, February 8, 2012, Mark Miller  wrote:
>
> On Feb 8, 2012, at 10:31 AM, Adeel Qureshi wrote:
>
>> I have been using solr for a while and have recently started getting into
>> solrcloud .. i am a bit confused with some of the concepts ..
>>
>> 1. what exactly is the relationship between a collection and the core ..
>> can a core has multiple collections in it .. in this case all collections
>> within this core will have the same schema .. and i am assuming all
>> instances of collections within the core can be deployed on different
solr
>> nodes to achieve distributed search ..
>> or is it the other way around where a collection can have multiple cores
>
> Currently, a core basically equals a replica of the index.
>
> So you might have a collection called collection1 - lets say it's 2
shards and each shard has a single replica:
>
> Collection1
> shard1 replica1
> shard1 replica2
> shard2 replica1
> shard2 replica2
>
> Each of those replicas is a core. So a collection has multiple cores
basically. Also, each of those cores can be on a different machine. So yes,
you have distributed indexing and distributed search.
>
>>
>> 2. at some places it has been pointed out that solrcloud doesnt actually
>> supports replication .. but in the solrcloud wiki the second example is
>> supposed to be for replication .. so does solrcloud at this point
supports
>> automatic replication where as you add more servers it automatically uses
>> the additional servers as replicas
>
> SolrCloud doesn't support the old style Solr replication concept. It does
however, handle replication - it's just all pretty much automatic and
behind the scenes - eg all the information about Solr replication in the
wiki documentation for previous versions of Solr is really not applicable.
We now achieve replica copies by sending documents to each shard one
document at a time so that we can support near realtime search. The old
style replication is only used in recovery, or when you start a new replica
machine and it has to 'catchup' to the other replicas.
>
>>
>> I have a few more questions but I wanted to get these basic ones out of
the
>> way first .. I would appreciate any response.
>
> Fire away.
>
>>
>> Thanks
>> Adeel
>
> - Mark Miller
> lucidimagination.com
>
>
>
>
>
>
>
>
>
>
>
>


Re: usage of /etc/jetty.xml when debugging Solr in Eclipse

2012-02-08 Thread jmlucjav
yes, I am using https://github.com/alexwinston/RunJettyRun that apparently is
a fork of the original project that originated in the need to use an
jetty.xml.

So I am already setting an additional jetty.xml, this can be done in the Run
configuration, no need to use -D param. But as I mentioned solr does not
start cleanly if I do that.

So I wanted to understand what role plays /etc/jetty.xml 
- when solr is started via 'java -jar start.jar'
- when started with RunJettyRun in eclipse.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/usage-of-etc-jetty-xml-when-debugging-Solr-in-Eclipse-tp3725588p3728008.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Improving performance for SOLR geo queries?

2012-02-08 Thread Nicolas Flacco
I compared locallucene to spatial search and saw a performance
degradation, even using geohash queries, though perhaps I indexed things
wrong? Locallucene across 6 machines handles 150 queries per second fine,
but using geofilt and geohash I got lots of timeouts even when I was doing
only 50 queries per second. Has anybody done a formal comparison of
locallucene with spatial search and latlontype, pointtype and geohash?

On 2/8/12 2:20 PM, "Ryan McKinley"  wrote:

>Hi Matthias-
>
>I'm trying to understand how you have your data indexed so we can give
>reasonable direction.
>
>What field type are you using for your locations?  Is it using the
>solr spatial field types?  What do you see when you look at the debug
>information from &debugQuery=true?
>
>From my experience, there is no single best practice for spatial
>queries -- it will depend on your data density and distribution if.
>
>You may also want to look at:
>http://code.google.com/p/lucene-spatial-playground/
>but note this is off lucene trunk -- the geohash queries are super fast
>though
>
>ryan
>
>
>
>
>2012/2/8 Matthias Käppler :
>> Hi Erick,
>>
>> if we're not doing geo searches, we filter by "location tags" that we
>> attach to places. This is simply a hierachical regional id, which is
>> simple to filter for, but much less flexible. We use that on Web a
>> lot, but not on mobile, where we want to performance searches in
>> arbitrary radii around arbitrary positions. For those location tag
>> kind of queries, the average time spent in SOLR is 43msec (I'm looking
>> at the New Relic snapshot of the last 12 hours). I have disabled our
>> "optimization" again just yesterday, so for the bbox queries we're now
>> at an avg of 220ms (same time window). That's a 5 fold increase in
>> response time, and in peak hours it's worse than that.
>>
>> I've also found a blog post from 3 years ago which outlines the inner
>> workings of the SOLR spatial indexing and searching:
>> http://www.searchworkings.org/blog/-/blogs/23842
>> From that it seems as if SOLR already performs a similar optimization
>> we had in mind during the index step, so if I understand correctly, it
>> doesn't even search over all records, only those that were mapped to
>> the grid box identified during indexing.
>>
>> What I would love to see is what the suggested way is to perform a geo
>> query on SOLR, considering that they're so difficult to cache and
>> expensive to run. Is the best approach to restrict the candidate set
>> as much as possible using cheap filter queries, so that SOLR merely
>> has to do the geo search against these subsets? How does the query
>> planner work here? I see there's a cost attached to a filter query,
>> but one can only set it when cache is set to false? Are cached geo
>> queries executed last when there are cheaper filter queries to cut
>> down on documents? If you have a real world practical setup to share,
>> one that performs well in a production environment that serves
>> requests in the Millions per day, that would be great.
>>
>> I'd love to contribute documentation by the way, if you knew me you'd
>> know I'm an avid open source contributor and actually run several open
>> source projects myself. But tell me, how can I possibly contribute
>> answer to questions I don't have an answer to? That's why I'm here,
>> remember :) So please, these kinds of snippy replies are not helping
>> anyone.
>>
>> Thanks
>> -Matthias
>>
>> On Tue, Feb 7, 2012 at 3:06 PM, Erick Erickson
>> wrote:
>>> So the obvious question is "what is your
>>> performance like without the distance filters?"
>>>
>>> Without that knowledge, we have no clue whether
>>> the modifications you've made had any hope of
>>> speeding up your response times
>>>
>>> As for the docs, any improvements you'd like to
>>> contribute would be happily received
>>>
>>> Best
>>> Erick
>>>
>>> 2012/2/6 Matthias Käppler :
 Hi,

 we need to perform fast geo lookups on an index of ~13M places, and
 were running into performance problems here with SOLR. We haven't done
 a lot of query optimization / SOLR tuning up until now so there's
 probably a lot of things we're missing. I was wondering if you could
 give me some feedback on the way we do things, whether they make
 sense, and especially why a supposed optimization we implemented
 recently seems to have no effect, when we actually thought it would
 help a lot.

 What we do is this: our API is built on a Rails stack and talks to
 SOLR via a Ruby wrapper. We have a few filters that almost always
 apply, which we put in filter queries. Filter cache hit rate is
 excellent, about 97%, and cache size caps at 10k filters (max size is
 32k, but it never seems to reach that many, probably because we
 replicate / delta update every few minutes). Still, geo queries are
 slow, about 250-500msec on average. We send them with cache=false, so
 as to not flood the fq cache and cause undesirab

Re: solr cloud concepts

2012-02-08 Thread Mark Miller

On Feb 8, 2012, at 5:26 PM, Adeel Qureshi wrote:

> okay so after reading Bruno's blog post .. lets add slice to the mix as
> well .. so we have got collections, cores, shards, partitions and slices :)
> ..

Yeah - heh - this has bugged me, but we have not really all come down on 
agreement of terminology here. I was a fan of using shard for each node and 
slice for partition. Another couple of committers wanted partitions rather than 
slice. Another says slice in code, shard for both in terminology and use 
context...

I'd even go for shards as partitions and replicas for every node in a shard. 
But those fine points are still settling ;)

> 
> The whole point with cores is to be able to have different schemas on the
> same solr server instance. So how does that changes with collections .. may
> be an example might help .. if I want to setup a solrcloud cluster with 2
> cores (different schema) .. with each core having 2 shards (i m assuming
> shards are really partitions here, across multiple nodes in the cluster) ..
> with one shard being the replica..

So this would mean you want to create 2 collections. Think of a collection as a 
bunch of SolrCores that all share the same schema and config. 

So you would start up 2 nodes set to one collection and with numShards=1 that 
will give you one shard hosted by two identical SolrCores, giving you a 
replication factor. The full index will be in each of the two SolrCores.

Then if you start another two nodes and specify a different collection name, 
you will get the same thing, but distinct from your first collection (although, 
if both collections have compatible shema/config you can still search across 
them).

> 
> 
> On Wed, Feb 8, 2012 at 11:35 AM, Mark Miller  wrote:
> 
>> 
>> On Feb 8, 2012, at 10:31 AM, Adeel Qureshi wrote:
>> 
>>> I have been using solr for a while and have recently started getting into
>>> solrcloud .. i am a bit confused with some of the concepts ..
>>> 
>>> 1. what exactly is the relationship between a collection and the core ..
>>> can a core has multiple collections in it .. in this case all collections
>>> within this core will have the same schema .. and i am assuming all
>>> instances of collections within the core can be deployed on different
>> solr
>>> nodes to achieve distributed search ..
>>> or is it the other way around where a collection can have multiple cores
>> 
>> Currently, a core basically equals a replica of the index.
>> 
>> So you might have a collection called collection1 - lets say it's 2 shards
>> and each shard has a single replica:
>> 
>> Collection1
>> shard1 replica1
>> shard1 replica2
>> shard2 replica1
>> shard2 replica2
>> 
>> Each of those replicas is a core. So a collection has multiple cores
>> basically. Also, each of those cores can be on a different machine. So yes,
>> you have distributed indexing and distributed search.
>> 
>>> 
>>> 2. at some places it has been pointed out that solrcloud doesnt actually
>>> supports replication .. but in the solrcloud wiki the second example is
>>> supposed to be for replication .. so does solrcloud at this point
>> supports
>>> automatic replication where as you add more servers it automatically uses
>>> the additional servers as replicas
>> 
>> SolrCloud doesn't support the old style Solr replication concept. It does
>> however, handle replication - it's just all pretty much automatic and
>> behind the scenes - eg all the information about Solr replication in the
>> wiki documentation for previous versions of Solr is really not applicable.
>> We now achieve replica copies by sending documents to each shard one
>> document at a time so that we can support near realtime search. The old
>> style replication is only used in recovery, or when you start a new replica
>> machine and it has to 'catchup' to the other replicas.
>> 
>>> 
>>> I have a few more questions but I wanted to get these basic ones out of
>> the
>>> way first .. I would appreciate any response.
>> 
>> Fire away.
>> 
>>> 
>>> Thanks
>>> Adeel
>> 
>> - Mark Miller
>> lucidimagination.com
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 

- Mark Miller
lucidimagination.com













linking documents in solr

2012-02-08 Thread T Vinod Gupta
hi,
I have a question around documents linking in solr and want to know if its
possible. lets say i have a set of blogs and their authors that i want to
index seperately. is it possible to link a document describing a blog to
another document describing an author? if yes, can i search for blogs with
filters on attributes of the author? if yes, if i update an attribute of an
author (by its id), then will the search results reflect the updated
attribute(s)?

thanks


Re: solr cloud concepts

2012-02-08 Thread Adeel Qureshi
okay so after reading Bruno's blog post .. lets add slice to the mix as
well .. so we have got collections, cores, shards, partitions and slices :)
..

The whole point with cores is to be able to have different schemas on the
same solr server instance. So how does that changes with collections .. may
be an example might help .. if I want to setup a solrcloud cluster with 2
cores (different schema) .. with each core having 2 shards (i m assuming
shards are really partitions here, across multiple nodes in the cluster) ..
with one shard being the replica..


On Wed, Feb 8, 2012 at 11:35 AM, Mark Miller  wrote:

>
> On Feb 8, 2012, at 10:31 AM, Adeel Qureshi wrote:
>
> > I have been using solr for a while and have recently started getting into
> > solrcloud .. i am a bit confused with some of the concepts ..
> >
> > 1. what exactly is the relationship between a collection and the core ..
> > can a core has multiple collections in it .. in this case all collections
> > within this core will have the same schema .. and i am assuming all
> > instances of collections within the core can be deployed on different
> solr
> > nodes to achieve distributed search ..
> > or is it the other way around where a collection can have multiple cores
>
> Currently, a core basically equals a replica of the index.
>
> So you might have a collection called collection1 - lets say it's 2 shards
> and each shard has a single replica:
>
> Collection1
> shard1 replica1
> shard1 replica2
> shard2 replica1
> shard2 replica2
>
> Each of those replicas is a core. So a collection has multiple cores
> basically. Also, each of those cores can be on a different machine. So yes,
> you have distributed indexing and distributed search.
>
> >
> > 2. at some places it has been pointed out that solrcloud doesnt actually
> > supports replication .. but in the solrcloud wiki the second example is
> > supposed to be for replication .. so does solrcloud at this point
> supports
> > automatic replication where as you add more servers it automatically uses
> > the additional servers as replicas
>
> SolrCloud doesn't support the old style Solr replication concept. It does
> however, handle replication - it's just all pretty much automatic and
> behind the scenes - eg all the information about Solr replication in the
> wiki documentation for previous versions of Solr is really not applicable.
> We now achieve replica copies by sending documents to each shard one
> document at a time so that we can support near realtime search. The old
> style replication is only used in recovery, or when you start a new replica
> machine and it has to 'catchup' to the other replicas.
>
> >
> > I have a few more questions but I wanted to get these basic ones out of
> the
> > way first .. I would appreciate any response.
>
> Fire away.
>
> >
> > Thanks
> > Adeel
>
> - Mark Miller
> lucidimagination.com
>
>
>
>
>
>
>
>
>
>
>
>


Re: Improving performance for SOLR geo queries?

2012-02-08 Thread Ryan McKinley
Hi Matthias-

I'm trying to understand how you have your data indexed so we can give
reasonable direction.

What field type are you using for your locations?  Is it using the
solr spatial field types?  What do you see when you look at the debug
information from &debugQuery=true?

>From my experience, there is no single best practice for spatial
queries -- it will depend on your data density and distribution if.

You may also want to look at:
http://code.google.com/p/lucene-spatial-playground/
but note this is off lucene trunk -- the geohash queries are super fast though

ryan




2012/2/8 Matthias Käppler :
> Hi Erick,
>
> if we're not doing geo searches, we filter by "location tags" that we
> attach to places. This is simply a hierachical regional id, which is
> simple to filter for, but much less flexible. We use that on Web a
> lot, but not on mobile, where we want to performance searches in
> arbitrary radii around arbitrary positions. For those location tag
> kind of queries, the average time spent in SOLR is 43msec (I'm looking
> at the New Relic snapshot of the last 12 hours). I have disabled our
> "optimization" again just yesterday, so for the bbox queries we're now
> at an avg of 220ms (same time window). That's a 5 fold increase in
> response time, and in peak hours it's worse than that.
>
> I've also found a blog post from 3 years ago which outlines the inner
> workings of the SOLR spatial indexing and searching:
> http://www.searchworkings.org/blog/-/blogs/23842
> From that it seems as if SOLR already performs a similar optimization
> we had in mind during the index step, so if I understand correctly, it
> doesn't even search over all records, only those that were mapped to
> the grid box identified during indexing.
>
> What I would love to see is what the suggested way is to perform a geo
> query on SOLR, considering that they're so difficult to cache and
> expensive to run. Is the best approach to restrict the candidate set
> as much as possible using cheap filter queries, so that SOLR merely
> has to do the geo search against these subsets? How does the query
> planner work here? I see there's a cost attached to a filter query,
> but one can only set it when cache is set to false? Are cached geo
> queries executed last when there are cheaper filter queries to cut
> down on documents? If you have a real world practical setup to share,
> one that performs well in a production environment that serves
> requests in the Millions per day, that would be great.
>
> I'd love to contribute documentation by the way, if you knew me you'd
> know I'm an avid open source contributor and actually run several open
> source projects myself. But tell me, how can I possibly contribute
> answer to questions I don't have an answer to? That's why I'm here,
> remember :) So please, these kinds of snippy replies are not helping
> anyone.
>
> Thanks
> -Matthias
>
> On Tue, Feb 7, 2012 at 3:06 PM, Erick Erickson  
> wrote:
>> So the obvious question is "what is your
>> performance like without the distance filters?"
>>
>> Without that knowledge, we have no clue whether
>> the modifications you've made had any hope of
>> speeding up your response times
>>
>> As for the docs, any improvements you'd like to
>> contribute would be happily received
>>
>> Best
>> Erick
>>
>> 2012/2/6 Matthias Käppler :
>>> Hi,
>>>
>>> we need to perform fast geo lookups on an index of ~13M places, and
>>> were running into performance problems here with SOLR. We haven't done
>>> a lot of query optimization / SOLR tuning up until now so there's
>>> probably a lot of things we're missing. I was wondering if you could
>>> give me some feedback on the way we do things, whether they make
>>> sense, and especially why a supposed optimization we implemented
>>> recently seems to have no effect, when we actually thought it would
>>> help a lot.
>>>
>>> What we do is this: our API is built on a Rails stack and talks to
>>> SOLR via a Ruby wrapper. We have a few filters that almost always
>>> apply, which we put in filter queries. Filter cache hit rate is
>>> excellent, about 97%, and cache size caps at 10k filters (max size is
>>> 32k, but it never seems to reach that many, probably because we
>>> replicate / delta update every few minutes). Still, geo queries are
>>> slow, about 250-500msec on average. We send them with cache=false, so
>>> as to not flood the fq cache and cause undesirable evictions.
>>>
>>> Now our idea was this: while the actual geo queries are poorly
>>> cacheable, we could clearly identify geographical regions which are
>>> more often queried than others (naturally, since we're a user driven
>>> service). Therefore, we dynamically partition Earth into a static grid
>>> of overlapping boxes, where the grid size (the distance of the nodes)
>>> depends on the maximum allowed search radius. That way, for every user
>>> query, we would always be able to identify a single bounding box that
>>> covers it. This larger bounding box

Re: SolrCloud is in trunk.

2012-02-08 Thread darren

Good job on this work. A monumental effort.

On Wed, 8 Feb 2012 16:41:13 -0500, Mark Miller 
wrote:
> For those that are interested and have not noticed, the latest work on
> SolrCloud and distributed indexing is now in trunk.
> 
> SolrCloud is our name for a new set of distributed capabilities that
> improve upon the old style distributed search and index based
replication.
> 
> It provides for high availability and fault tolerance while allowing for
> near realtime search and an interface that matches what you are used to
> with previous versions of Solr.
> 
> We are looking to release this in the next 4.0 release, and any feedback
> early users can provide will be very useful. So if you have an interest
in
> these types of features, please take the latest trunk build for a spin
and
> provide some feedback. 
> 
> There is still a lot more planned, so feel free to chime in on what you
> would like to see - this is essentially the end of stage one. 
> 
> You can read more about what we have done on the wiki:
> http://wiki.apache.org/solr/SolrCloud
> 
> Also, a couple blog posts I recently saw pop up:
> 
>
http://blog.sematext.com/2012/02/01/solrcloud-distributed-realtime-search
> http://outerthought.org/blog/491-ot.html
> 
> I'll contribute my own blog post as well when I get a chance, but there
> should be a fair amount of info there to get you started if you are
> interested. 
> 
> Thanks,
> 
> - Mark Miller
> lucidimagination.com


Re: Using UUID for uniqueId

2012-02-08 Thread Anderson vasconcelos
Thanks
2012/2/8 François Schiettecatte 

> Anderson
>
> I would say that this is highly unlikely, but you would need to pay
> attention to how they are generated, this would be a good place to start:
>
>http://en.wikipedia.org/wiki/Universally_unique_identifier
>
> Cheers
>
> François
>
> On Feb 8, 2012, at 1:31 PM, Anderson vasconcelos wrote:
>
> > HI all
> >
> > If i use the UUID like a uniqueId in the future if i break my index in
> > shards, i will have problems? The UUID generation could generate the same
> > UUID in differents machines?
> >
> > Thanks
>
>


SolrCloud is in trunk.

2012-02-08 Thread Mark Miller
For those that are interested and have not noticed, the latest work on 
SolrCloud and distributed indexing is now in trunk.

SolrCloud is our name for a new set of distributed capabilities that improve 
upon the old style distributed search and index based replication.

It provides for high availability and fault tolerance while allowing for near 
realtime search and an interface that matches what you are used to with 
previous versions of Solr.

We are looking to release this in the next 4.0 release, and any feedback early 
users can provide will be very useful. So if you have an interest in these 
types of features, please take the latest trunk build for a spin and provide 
some feedback. 

There is still a lot more planned, so feel free to chime in on what you would 
like to see - this is essentially the end of stage one. 

You can read more about what we have done on the wiki: 
http://wiki.apache.org/solr/SolrCloud

Also, a couple blog posts I recently saw pop up:

http://blog.sematext.com/2012/02/01/solrcloud-distributed-realtime-search
http://outerthought.org/blog/491-ot.html

I'll contribute my own blog post as well when I get a chance, but there should 
be a fair amount of info there to get you started if you are interested. 

Thanks,

- Mark Miller
lucidimagination.com













Index Start Question

2012-02-08 Thread Hoffman, Chase
Please forgive me if this is a dumb question.  I've never dealt with SOLR 
before, and I'm being asked to determine from the logs when a SOLR index is 
kicked off (it is a Windows server).  The TOMCAT service runs continually, so 
no love there.  In parsing the logs, I think 
"org.apache.solr.core.SolrResourceLoader " is the indicator, since 
"org.apache.solr.core.SolrCore execute" seems to occur even when I know an 
index has not been started.

Any advice you could give me would be wonderful.

Best,

--Chase

Chase Hoffman
Infrastructure Systems Administrator, Performance Technologies
The Advisory Board Company
512-681-2190 direct | 512-609-1150 fax
hoffm...@advisory.com | 
www.advisory.com

Don't miss out-log in now
Unlock thousands of members-only tools, events, best practices, and more at 
www.advisory.com.
Get 
started


solr/tomcat performance.

2012-02-08 Thread adm1n
Hi,

I'm running solr+tomcat with the following configuration:
I have 16 slaves, which are being queried by aggregator, while aggregator
being queried by the users.
My slaveUrls variable in solr.xml (on aggregator) looks like - ''
I'm running it on linux machine (not dedicated, there are some other 'heavy'
processes) with 16 quads CPUs and 66GB Ram.

I ran some tests and I saw, that when I did 400 concurrent requests to
aggregator the host stopped to respond until I restart the tomcat. I tried
to 'play' with tomcat's/java configuration a little, but it didn't help me
much and the main issue was memory usage and timeouts. Currently I'm using
the following settings:
Java:
-Xms256m -Xmx8192m
I tried to tweak -XX:MinHeapFreeRatio setting, but from what I could see no
memory was returned to OS.
Tomcat:




Assuming I'll have ~1000 requests/second done to aggregator, on how many
aggregators should I balance the loading? Or may be I can achieve better
performance only by tweaking the current system?


Any help/advise will be appreciated,
Thanks!

--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-tomcat-performance-tp3727199p3727199.html
Sent from the Solr - User mailing list archive at Nabble.com.


Thank you all

2012-02-08 Thread Tim Hibbs
All,

It appears my attempt at using solr for the application I support is
about to fail. I'm personally and professionally disappointed, but I
wanted to say "Many Thanks" to those of you who have provided so much
help to so many on this list. In the right hands and in the right
environments, it has so much potential. You all have shown the
collective knowledge and cooperation it takes to bring that potential to
fruition.

I wish I'd been able to pick up on the right details of the toolset to
be able to make this work.

Best of luck to you all!

Tim Hibbs

On 2/7/2012 2:53 PM, Tim Hibbs wrote:
> Hi, all...
> 
> I have a small problem retrieving the full set of query responses I need
> and would appreciate any help.
> 
> I have a query string as follows:
> 
> +((Title:"sales") (+Title:sales) (TOC:"sales") (+TOC:sales)
> (Keywords:"sales") (+Keywords:sales) (text:"sales") (+text:sales)
> (sales)) +(RepType:"WRO Revenue Services") +(ContentType:SOP
> ContentType:"Key Concept") -(Topics:Backup)
> 
> The query is intended to be:
> 
> "MUST" have at least one of:
> - exact phrase in field Title
> - all of the phrase words in field Title
> - exact phrase in field TOC
> - all of the phrase words in field TOC
> - exact phrase in field Keywords
> - all of the phrase words in field Keywords
> - exact phrase in field text
> - all of the phrase words in field text
> - any of the phrase words in field text
> 
> "MUST" have "WRO Revenue Services" in field RepType
> "MUST" have at least one of:
> - "SOP" in field ContentType
> - "Key Concept" in field ContentType
> "MUST NOT" have "Backup" in field Topics
> 
> It's almost working, but it misses a couple of items that contain a
> single occurrence of the word "sale" in a indexed field. The indexed
> field containing that single occurrence is named "UrlContent".
> 
> schema.xml
> 
> UrlContent is defined as:
>  required="false" omitNorms="false"/>
> 
> Copyfields are as follows:
>   
>   
>   
>   
>   
> 
> Thanks,
> Tim Hibbs


Re: Using UUID for uniqueId

2012-02-08 Thread François Schiettecatte
Anderson

I would say that this is highly unlikely, but you would need to pay attention 
to how they are generated, this would be a good place to start:

http://en.wikipedia.org/wiki/Universally_unique_identifier

Cheers

François

On Feb 8, 2012, at 1:31 PM, Anderson vasconcelos wrote:

> HI all
> 
> If i use the UUID like a uniqueId in the future if i break my index in
> shards, i will have problems? The UUID generation could generate the same
> UUID in differents machines?
> 
> Thanks



Re: How to reindex about 10Mio. docs

2012-02-08 Thread Otis Gospodnetic
Vadim,

Would using xslt output help?

Otis 

Performance Monitoring SaaS for Solr - 
http://sematext.com/spm/solr-performance-monitoring/index.html



>
> From: Vadim Kisselmann 
>To: solr-user@lucene.apache.org 
>Sent: Wednesday, February 8, 2012 7:09 AM
>Subject: Re: How to reindex about 10Mio. docs
> 
>Another problem appeared ;)
>how can i export my docs in csv-format?
>In Solr 3.1+ i can use the query-param &wt=csv, but in Solr 1.4.1?
>Best Regards
>Vadim
>
>
>2012/2/8 Vadim Kisselmann :
>> Hi Ahmet,
>> thanks for quick response:)
>> I've already thought the same...
>> And it will be a pain to export and import this huge doc-set as CSV.
>> Do i have an another solution?
>> Regards
>> Vadim
>>
>>
>> 2012/2/8 Ahmet Arslan :
 i want to reindex about 10Mio. Docs. from one Solr(1.4.1) to
 another
 Solr(1.4.1).
 I changed my schema.xml (field types sing to slong),
 standard
 replication would fail.
 what is the fastest and smartest way to manage this?
 this here sound great (EntityProcessor):
 http://www.searchworkings.org/blog/-/blogs/importing-data-from-another-solr
 But would it work with Solr 1.4.1?
>>>
>>> SolrEntityProcessor is not available in 1.4.1. I would dump stored fields 
>>> into comma separated file, and use http://wiki.apache.org/solr/UpdateCSV to 
>>> feed into new solr instance.
>
>
>

Re: solr cloud concepts

2012-02-08 Thread Bruno Dumon
Hi Adeel,

I just started looking into SolrCloud and had some of the same questions.

I wrote a blog with the understanding I gained so far, maybe it will help
you:

http://outerthought.org/blog/491-ot.html

Regards,

Bruno.

On Wed, Feb 8, 2012 at 4:31 PM, Adeel Qureshi wrote:

> I have been using solr for a while and have recently started getting into
> solrcloud .. i am a bit confused with some of the concepts ..
>
> 1. what exactly is the relationship between a collection and the core ..
> can a core has multiple collections in it .. in this case all collections
> within this core will have the same schema .. and i am assuming all
> instances of collections within the core can be deployed on different solr
> nodes to achieve distributed search ..
> or is it the other way around where a collection can have multiple cores
>
> 2. at some places it has been pointed out that solrcloud doesnt actually
> supports replication .. but in the solrcloud wiki the second example is
> supposed to be for replication .. so does solrcloud at this point supports
> automatic replication where as you add more servers it automatically uses
> the additional servers as replicas
>
> I have a few more questions but I wanted to get these basic ones out of the
> way first .. I would appreciate any response.
>
> Thanks
> Adeel
>



-- 
Bruno Dumon
Outerthought
http://outerthought.org/


Re: Sorting solrdocumentlist object after querying

2012-02-08 Thread Ahmet Arslan
> I want to sort a SolrDocumentList after it has been queried
> and obtained
> from the QueryResponse.getResults(). The reason is i have a
> SolrDocumentList
> obtained after querying using QueryResponse.getResults() and
> i have added
> few docs to it. Now i want to sort this SolrDocumentList
> based on the same
> fields i did the querying before i modified this
> SolrDocumentList.

QueryResponse.getResults()  will return rows many documents. Cant you sort them 
(plus your injected documents) with your own?


Re: solr cloud concepts

2012-02-08 Thread Mark Miller

On Feb 8, 2012, at 10:31 AM, Adeel Qureshi wrote:

> I have been using solr for a while and have recently started getting into
> solrcloud .. i am a bit confused with some of the concepts ..
> 
> 1. what exactly is the relationship between a collection and the core ..
> can a core has multiple collections in it .. in this case all collections
> within this core will have the same schema .. and i am assuming all
> instances of collections within the core can be deployed on different solr
> nodes to achieve distributed search ..
> or is it the other way around where a collection can have multiple cores

Currently, a core basically equals a replica of the index.

So you might have a collection called collection1 - lets say it's 2 shards and 
each shard has a single replica:

Collection1
shard1 replica1
shard1 replica2
shard2 replica1
shard2 replica2

Each of those replicas is a core. So a collection has multiple cores basically. 
Also, each of those cores can be on a different machine. So yes, you have 
distributed indexing and distributed search.

> 
> 2. at some places it has been pointed out that solrcloud doesnt actually
> supports replication .. but in the solrcloud wiki the second example is
> supposed to be for replication .. so does solrcloud at this point supports
> automatic replication where as you add more servers it automatically uses
> the additional servers as replicas

SolrCloud doesn't support the old style Solr replication concept. It does 
however, handle replication - it's just all pretty much automatic and behind 
the scenes - eg all the information about Solr replication in the wiki 
documentation for previous versions of Solr is really not applicable. We now 
achieve replica copies by sending documents to each shard one document at a 
time so that we can support near realtime search. The old style replication is 
only used in recovery, or when you start a new replica machine and it has to 
'catchup' to the other replicas.

> 
> I have a few more questions but I wanted to get these basic ones out of the
> way first .. I would appreciate any response.

Fire away.

> 
> Thanks
> Adeel

- Mark Miller
lucidimagination.com













Re: Wildcard ? issue?

2012-02-08 Thread Ahmet Arslan
> I have already tried this and it did
> not helped because it does not 
> highlight matches if wild-card is used. The field
> configuration turns 
> data to:

This writeup should explain your scenario :
http://wiki.apache.org/solr/MultitermQueryAnalysis


Re: struggling with solr.WordDelimiterFilterFactory and periods "." or dots

2012-02-08 Thread geeky2
hello,

thanks for sticking with me on this ...very frustrating 

ok - i did perform the query with the debug parms using two scenarios:

1) a successful search (where i insert the period / dot) in to the itemNo
field and the search returns a document.

itemNo:BP2.1UAA

http://hfsthssolr1.intra.searshc.com:8180/solrpartscat/core1/select/?q=itemNo%3ABP2.1UAA&version=2.2&start=0&rows=10&indent=on&debugQuery=on

results from debug





  0
  1
  
on
10

2.2
on
0
itemNo:BP2.1UAA
  


  

PHILIPS
0333500
0333500,1549  ,BP2.1UAA   
PLASMA TELEVISION
BP2.1UAA   
2

BP2.1UAA   
Plasma Television^
0
1549  
  


  itemNo:BP2.1UAA

  itemNo:BP2.1UAA
  MultiPhraseQuery(itemNo:"bp 2 (1 21) (uaa
bp21uaa)")
  itemNo:"bp 2 (1 21) (uaa bp21uaa)"
  

22.539911 = (MATCH) weight(itemNo:"bp 2 (1 21) (uaa bp21uaa)" in 134993),
product of:
  0.9994 = queryWeight(itemNo:"bp 2 (1 21) (uaa bp21uaa)"), product of:
45.079826 = idf(itemNo: bp=829 2=29303 1=43943 21=6716 uaa=32 bp21uaa=1)
0.02218287 = queryNorm
  22.539913 = (MATCH) fieldWeight(itemNo:"bp 2 (1 21) (uaa bp21uaa)" in
134993), product of:
1.0 = tf(phraseFreq=1.0)
45.079826 = idf(itemNo: bp=829 2=29303 1=43943 21=6716 uaa=32 bp21uaa=1)
0.5 = fieldNorm(field=itemNo, doc=134993)

  

  LuceneQParser
  
1.0

  0.0
  
0.0

  
  
0.0
  
  
0.0
  
  

0.0
  
  
0.0
  
  
0.0

  


  1.0
  
1.0
  
  

0.0
  
  
0.0
  
  
0.0

  
  
0.0
  
  
0.0
  


  









2) a NON-successful search (where i do NOT insert a period / dot) in to the
itemNo field and the search does NOT return a document

 itemNo:BP21UAA

http://hfsthssolr1.intra.searshc.com:8180/solrpartscat/core1/select/?q=itemNo%3ABP21UAA&version=2.2&start=0&rows=10&indent=on&debugQuery=on





  0
  1
  
on
10

2.2
on
0
itemNo:BP21UAA
  




  itemNo:BP21UAA
  itemNo:BP21UAA
  MultiPhraseQuery(itemNo:"bp 21 (uaa
bp21uaa)")
  itemNo:"bp 21 (uaa bp21uaa)"
  
  LuceneQParser

  
1.0

  1.0
  
1.0
  

  
0.0
  
  
0.0
  
  
0.0

  
  
0.0
  
  
0.0
  



  0.0
  
0.0
  
  
0.0

  
  
0.0
  
  
0.0
  
  

0.0
  
  
0.0
  

  




the parsedquery part of the debug ouput looks like it DOES contain the term
that i am entering for my search criteria on the itemNo field ??

does this make sense?

thank you,
mark



--
View this message in context: 
http://lucene.472066.n3.nabble.com/struggling-with-solr-WordDelimiterFilterFactory-and-periods-or-dots-tp3724822p3726614.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Which Tokeniser (and/or filter)

2012-02-08 Thread Robert Brown
Attempting to re-produce legacy behaviour (i know!) of simple SQL
substring searching, with and without phrases.

I feel simply NGram'ing 4m CV's may be pushing it?


---

IntelCompute
Web Design & Local Online Marketing

http://www.intelcompute.com


On Wed, 8 Feb 2012 11:27:24 -0500, Erick Erickson
 wrote:
> You'll probably have to index them in separate fields to
> get what you want. The question is always whether it's
> worth it, is the use-case really well served by having a
> variant that keeps dots and things? But that's always more
> a question for your product manager
> 
> Best
> Erick
> 
> On Wed, Feb 8, 2012 at 9:23 AM, Robert Brown  wrote:
>> Thanks Erick,
>>
>> I didn't get confused with multiple tokens vs multiValued  :)
>>
>> Before I go ahead and re-index 4m docs, and believe me I'm using the
>> analysis page like a mad-man!
>>
>> What do I need to configure to have the following both indexed with and
>> without the dots...
>>
>> .net
>> sales manager.
>> £12.50
>>
>> Currently...
>>
>> 
>> >        generateWordParts="1"
>>        generateNumberParts="1"
>>        catenateWords="1"
>>        catenateNumbers="1"
>>        catenateAll="1"
>>        splitOnCaseChange="1"
>>        splitOnNumerics="1"
>>        types="wdftypes.txt"
>> />
>>
>> with nothing specific in wdftypes.txt for full-stops.
>>
>> Should there also be any difference when quoting my searches?
>>
>> The analysis page seems to just drop the quotes, but surely actual
>> calls don't do this?
>>
>>
>>
>> ---
>>
>> IntelCompute
>> Web Design & Local Online Marketing
>>
>> http://www.intelcompute.com
>>
>>
>> On Wed, 8 Feb 2012 07:38:42 -0500, Erick Erickson
>>  wrote:
>>> Yes, WDDF creates multiple tokens. But that has
>>> nothing to do with the multiValued suggestion.
>>>
>>> You can get exactly what you want by
>>> 1> setting multiValued="true" in your schema file and re-indexing. Say
>>> positionIncrementGap is set to 100
>>> 2> When you index, add the field for each sentence, so your doc
>>>       looks something like:
>>>      
>>>         i am a sales-manager in here
>>>        using asp.net and .net daily
>>>          .
>>>       
>>> 3> search like "sales manager"~100
>>>
>>> Best
>>> Erick
>>>
>>> On Wed, Feb 8, 2012 at 3:05 AM, Rob Brown  wrote:
 Apologies if things were a little vague.

 Given the example snippet to index (numbered to show searches needed to
 match)...

 1: i am a sales-manager in here
 2: using asp.net and .net daily
 3: working in design.
 4: using something called sage 200. and i'm fluent
 5: german sausages.
 6: busy A&E dept earning £10,000 annually


 ... all with newlines in place.

 able to match...

 1. sales
 1. "sales manager"
 1. sales-manager
 1. "sales-manager"
 2. .net
 2. asp.net
 3. design
 4. sage 200
 6. A&E
 6. £10,000

 But do NOT match "fluent german" from 4 + 5 since there's a newline
 between them when indexed, but not when searched.


 Do the filters (wdf in this case) not create multiple tokens, so if
 splitting on period in "asp.net" would create tokens for all of "asp",
 "asp.", "asp.net", ".net", "net".


 Cheers,
 Rob

 --

 IntelCompute
 Web Design and Online Marketing

 http://www.intelcompute.com


 -Original Message-
 From: Chris Hostetter 
 Reply-to: solr-user@lucene.apache.org
 To: solr-user@lucene.apache.org
 Subject: Re: Which Tokeniser (and/or filter)
 Date: Tue, 7 Feb 2012 15:02:36 -0800 (PST)

 : This all seems a bit too much work for such a real-world scenario?

 You haven't really told us what your scenerio is.

 You said you want to split tokens on whitespace, full-stop (aka:
 period) and comma only, but then in response to some suggestions you added
 comments other things that you never mentioned previously...

 1) evidently you don't want the "." in foo.net to cause a split in tokens?
 2) evidently you not only want token splits on newlines, but also
 positition gaps to prevent phrases matching across newlines.

 ...these are kind of important details that affect suggestions people
 might give you.

 can you please provide some concrete examples of hte types of data you
 have, the types of queries you want them to match, and the types of
 queries you *don't* want to match?


 -Hoss

>>



Re: Which Tokeniser (and/or filter)

2012-02-08 Thread Erick Erickson
You'll probably have to index them in separate fields to
get what you want. The question is always whether it's
worth it, is the use-case really well served by having a
variant that keeps dots and things? But that's always more
a question for your product manager

Best
Erick

On Wed, Feb 8, 2012 at 9:23 AM, Robert Brown  wrote:
> Thanks Erick,
>
> I didn't get confused with multiple tokens vs multiValued  :)
>
> Before I go ahead and re-index 4m docs, and believe me I'm using the
> analysis page like a mad-man!
>
> What do I need to configure to have the following both indexed with and
> without the dots...
>
> .net
> sales manager.
> £12.50
>
> Currently...
>
> 
>         generateWordParts="1"
>        generateNumberParts="1"
>        catenateWords="1"
>        catenateNumbers="1"
>        catenateAll="1"
>        splitOnCaseChange="1"
>        splitOnNumerics="1"
>        types="wdftypes.txt"
> />
>
> with nothing specific in wdftypes.txt for full-stops.
>
> Should there also be any difference when quoting my searches?
>
> The analysis page seems to just drop the quotes, but surely actual
> calls don't do this?
>
>
>
> ---
>
> IntelCompute
> Web Design & Local Online Marketing
>
> http://www.intelcompute.com
>
>
> On Wed, 8 Feb 2012 07:38:42 -0500, Erick Erickson
>  wrote:
>> Yes, WDDF creates multiple tokens. But that has
>> nothing to do with the multiValued suggestion.
>>
>> You can get exactly what you want by
>> 1> setting multiValued="true" in your schema file and re-indexing. Say
>> positionIncrementGap is set to 100
>> 2> When you index, add the field for each sentence, so your doc
>>       looks something like:
>>      
>>         i am a sales-manager in here
>>        using asp.net and .net daily
>>          .
>>       
>> 3> search like "sales manager"~100
>>
>> Best
>> Erick
>>
>> On Wed, Feb 8, 2012 at 3:05 AM, Rob Brown  wrote:
>>> Apologies if things were a little vague.
>>>
>>> Given the example snippet to index (numbered to show searches needed to
>>> match)...
>>>
>>> 1: i am a sales-manager in here
>>> 2: using asp.net and .net daily
>>> 3: working in design.
>>> 4: using something called sage 200. and i'm fluent
>>> 5: german sausages.
>>> 6: busy A&E dept earning £10,000 annually
>>>
>>>
>>> ... all with newlines in place.
>>>
>>> able to match...
>>>
>>> 1. sales
>>> 1. "sales manager"
>>> 1. sales-manager
>>> 1. "sales-manager"
>>> 2. .net
>>> 2. asp.net
>>> 3. design
>>> 4. sage 200
>>> 6. A&E
>>> 6. £10,000
>>>
>>> But do NOT match "fluent german" from 4 + 5 since there's a newline
>>> between them when indexed, but not when searched.
>>>
>>>
>>> Do the filters (wdf in this case) not create multiple tokens, so if
>>> splitting on period in "asp.net" would create tokens for all of "asp",
>>> "asp.", "asp.net", ".net", "net".
>>>
>>>
>>> Cheers,
>>> Rob
>>>
>>> --
>>>
>>> IntelCompute
>>> Web Design and Online Marketing
>>>
>>> http://www.intelcompute.com
>>>
>>>
>>> -Original Message-
>>> From: Chris Hostetter 
>>> Reply-to: solr-user@lucene.apache.org
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: Which Tokeniser (and/or filter)
>>> Date: Tue, 7 Feb 2012 15:02:36 -0800 (PST)
>>>
>>> : This all seems a bit too much work for such a real-world scenario?
>>>
>>> You haven't really told us what your scenerio is.
>>>
>>> You said you want to split tokens on whitespace, full-stop (aka:
>>> period) and comma only, but then in response to some suggestions you added
>>> comments other things that you never mentioned previously...
>>>
>>> 1) evidently you don't want the "." in foo.net to cause a split in tokens?
>>> 2) evidently you not only want token splits on newlines, but also
>>> positition gaps to prevent phrases matching across newlines.
>>>
>>> ...these are kind of important details that affect suggestions people
>>> might give you.
>>>
>>> can you please provide some concrete examples of hte types of data you
>>> have, the types of queries you want them to match, and the types of
>>> queries you *don't* want to match?
>>>
>>>
>>> -Hoss
>>>
>


Re: struggling with solr.WordDelimiterFilterFactory and periods "." or dots

2012-02-08 Thread Erick Erickson
Hmmm, that all looks correct, from the output you pasted I'd expect
you to be finding the doc.

So next thing: add &debugQuery=on to your query and look at
the debug information after the list of documents, particularly
the "parsedQuery" bit. Are you searching against the fields you
think you are? If you don't specify a field, Solr uses the default
defined in schema.xml.

Next, look at your actual index using either Luke or the TemsComponent
to see what's actually *in* your index rather than what you *think* is. I
can't tell you how many times I've made the wrong assumptions.

My guess would be that you aren't searching the fields you think you are...

Best
Erick

On Wed, Feb 8, 2012 at 9:06 AM, geeky2  wrote:
> hello,
>
> thank you for the reply.
>
> yes - i did re-index after the changes to the schema.
>
> also - thank you for the direction on using the analyzer - but i am not sure
> if i am interpreting the feedback from the analyzer correctly.
>
> here is what i did:
>
> in the Field value (Index) box - i placed this: BP2.1UAA
>
> in the Field value (Query) box - i placed this: BP21UAA
>
> then after hitting the Analyze button - i see the following:
>
> Under Index Analyzer for:
>
> org.apache.solr.analysis.WordDelimiterFilterFactory {splitOnCaseChange=1,
> generateNumberParts=1, catenateWords=1, luceneMatchVersion=LUCENE_33,
> generateWordParts=1, catenateAll=1, catenateNumbers=1}
>
> i see
>
> position        1       2       3       4
> term text       BP      2       1       UAA
> 21      BP21UAA
>
> Under Query Analyzer for:
>
> org.apache.solr.analysis.WordDelimiterFilterFactory {splitOnCaseChange=1,
> generateNumberParts=1, catenateWords=1, luceneMatchVersion=LUCENE_33,
> generateWordParts=1, catenateAll=1, catenateNumbers=1}
>
> i see
>
> position        1       2       3
> term text       BP      21      UAA
> BP21UAA
>
> the above information leads me to believe that i "should" have BP21UAA as an
> indexed term generated from the BP2.1UAA value coming from the database.
>
> also - the query analysis lead me to believe that i "should" find a document
> when i search on BP21UAA in the itemNo field
>
> do i have this correct
>
> am i missing something here?
>
> i am still unable to get a hit when i search on BP21UAA in the itemNo field.
>
> thank you,
> mark
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/struggling-with-solr-WordDelimiterFilterFactory-and-periods-or-dots-tp3724822p3726021.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Wildcard ? issue?

2012-02-08 Thread Dalius Sidlauskas
I have already tried this and it did not helped because it does not 
highlight matches if wild-card is used. The field configuration turns 
data to:


dc_title: calligraf
dc_title_unicode: cal·lígraf
dc_title_unicode_full: cal·lígraf

Debug parsedquery says:

[Search for *cal·ligraf*]

+DisjunctionMaxQuery((dc_title:*calligraf* |  
dc_title_unicode:cal·ligraf^2.0 | dc_title_unicode_full:cal·ligraf^2.0))


[Search for *cal·ligra?*]

+DisjunctionMaxQuery((dc_title:*cal·ligra?* | 
dc_title_unicode:cal·ligra?^2.0 | dc_title_unicode_full:cal·ligra?^2.0))


Why the *dc_title* field is handled differently? The analysis looks fine:


 Index Analyzer


   org.apache.solr.analysis.HTMLStripCharFilterFactory
   {luceneMatchVersion=LUCENE_34}

textcal·lígraf


   org.apache.solr.analysis.PatternReplaceCharFilterFactory
   {replacement=, pattern=-, maxBlockChars=1,
   luceneMatchVersion=LUCENE_34, blockDelimiters=}

textcal·lígraf


   org.apache.solr.analysis.WhitespaceTokenizerFactory
   {luceneMatchVersion=LUCENE_34}

position1
term text   cal·lígraf
startOffset 43
endOffset   53


   org.apache.solr.analysis.ICUFoldingFilterFactory
   {luceneMatchVersion=LUCENE_34}

position1
term text   calligraf
startOffset 43
endOffset   53


 Query Analyzer


   org.apache.solr.analysis.WhitespaceTokenizerFactory
   {luceneMatchVersion=LUCENE_34}

position1
term text   cal·ligra?
startOffset 0
endOffset   10


   org.apache.solr.analysis.ICUFoldingFilterFactory
   {luceneMatchVersion=LUCENE_34}

position1
term text   calligra?
startOffset 0
endOffset   10


Is this a Solr or Lucene bug?

Regards!
Dalius Sidlauskas


On 08/02/12 16:03, Sethi, Parampreet wrote:

Hi Dalius,

If not already tried, Check http://localhost:8983/solr/admin/analysis.jsp
(enable verbose output for both Field Value index and query for details)
for your queries and see what all filters/tokenizers are being applied.

Hope it helps!

-param

On 2/8/12 10:48 AM, "Dalius Sidlauskas"
wrote:


If you can not read this mail easily check this ticket:
https://issues.apache.org/jira/browse/SOLR-3106 This is a copy.

Regards!
Dalius Sidlauskas


On 08/02/12 15:44, Dalius Sidlauskas wrote:

Sorry for inaccurate title.

I have a 3 fields (dc_title, dc_title_unicode, dc_unicode_full)
containing same value:

http://www.tei-c.org/ns/1.0";>cal.lígraf

and these fields are configured accordingly:

































And finally my search configuration:



all
edismax
2<-25%
dc_title_unicode_full^2 dc_title_unicode^2 dc_title
10
true
false
1


spellcheck



I am trying to match the field with various search phrases (that are
valid). There are results:


# search phrase match? Comment
1 cal.lígra? yes
2 cal.ligra? no Changed í to i
3 cal.ligraf yes
4 calligra? no


The problem is the #2 attempt to match a data. The #3 works replacing
? with f.

One more thing. If * is used insted of ? other data is matched as
cal.lígrafia but not cal.lígraf...

Also I have spotted some logic missmatch in debug parsedQuery field:
*
cal·lígraf:* +DisjunctionMaxQuery((dc_title:*calligraf*^2.0 |
dc_title_unicode:cal·lígraf^3.0 | dc_title_unicode_full:cal·lígraf^3.0))
*cal·lígra?:*+DisjunctionMaxQuery((dc_title:*cal·lígra?*^2.0 |
dc_title_unicode:cal·lígra?^3.0 | dc_title_unicode_full:cal·lígra?^3.0))

Should the second be "*calligra?*" insted?*

*Environment:
Tomcat 7.0.25 (request encoding UTF-8)
Solr 3.5.0
Java 7 Oracle
Ubuntu 11.10



Re: Wildcard ? issue?

2012-02-08 Thread Sethi, Parampreet
Hi Dalius,

If not already tried, Check http://localhost:8983/solr/admin/analysis.jsp
(enable verbose output for both Field Value index and query for details)
for your queries and see what all filters/tokenizers are being applied.

Hope it helps!

-param

On 2/8/12 10:48 AM, "Dalius Sidlauskas" 
wrote:

>If you can not read this mail easily check this ticket:
>https://issues.apache.org/jira/browse/SOLR-3106 This is a copy.
>
>Regards!
>Dalius Sidlauskas
>
>
>On 08/02/12 15:44, Dalius Sidlauskas wrote:
>> Sorry for inaccurate title.
>>
>> I have a 3 fields (dc_title, dc_title_unicode, dc_unicode_full)
>> containing same value:
>>
>> http://www.tei-c.org/ns/1.0";>cal.lígraf
>>
>> and these fields are configured accordingly:
>>
>> > positionIncrementGap="100">
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>>
>> > positionIncrementGap="100">
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>>
>> > positionIncrementGap="100">
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>>
>> And finally my search configuration:
>>
>> 
>> 
>> all
>> edismax
>> 2<-25%
>> dc_title_unicode_full^2 dc_title_unicode^2 dc_title
>> 10
>> true
>> false
>> 1
>> 
>> 
>> spellcheck
>> 
>> 
>>
>> I am trying to match the field with various search phrases (that are
>> valid). There are results:
>>
>>
>> # search phrase match? Comment
>> 1 cal.lígra? yes
>> 2 cal.ligra? no Changed í to i
>> 3 cal.ligraf yes
>> 4 calligra? no
>>
>>
>> The problem is the #2 attempt to match a data. The #3 works replacing
>> ? with f.
>>
>> One more thing. If * is used insted of ? other data is matched as
>> cal.lígrafia but not cal.lígraf...
>>
>> Also I have spotted some logic missmatch in debug parsedQuery field:
>> *
>> cal·lígraf:* +DisjunctionMaxQuery((dc_title:*calligraf*^2.0 |
>> dc_title_unicode:cal·lígraf^3.0 | dc_title_unicode_full:cal·lígraf^3.0))
>> *cal·lígra?:*+DisjunctionMaxQuery((dc_title:*cal·lígra?*^2.0 |
>> dc_title_unicode:cal·lígra?^3.0 | dc_title_unicode_full:cal·lígra?^3.0))
>>
>> Should the second be "*calligra?*" insted?*
>>
>> *Environment:
>> Tomcat 7.0.25 (request encoding UTF-8)
>> Solr 3.5.0
>> Java 7 Oracle
>> Ubuntu 11.10
>>



Re: Wildcard ? issue?

2012-02-08 Thread Dalius Sidlauskas
If you can not read this mail easily check this ticket: 
https://issues.apache.org/jira/browse/SOLR-3106 This is a copy.


Regards!
Dalius Sidlauskas


On 08/02/12 15:44, Dalius Sidlauskas wrote:

Sorry for inaccurate title.

I have a 3 fields (dc_title, dc_title_unicode, dc_unicode_full) 
containing same value:


http://www.tei-c.org/ns/1.0";>cal.lígraf

and these fields are configured accordingly:

positionIncrementGap="100">












positionIncrementGap="100">










positionIncrementGap="100">










And finally my search configuration:



all
edismax
2<-25%
dc_title_unicode_full^2 dc_title_unicode^2 dc_title
10
true
false
1


spellcheck



I am trying to match the field with various search phrases (that are 
valid). There are results:



# search phrase match? Comment
1 cal.lígra? yes
2 cal.ligra? no Changed í to i
3 cal.ligraf yes
4 calligra? no


The problem is the #2 attempt to match a data. The #3 works replacing 
? with f.


One more thing. If * is used insted of ? other data is matched as 
cal.lígrafia but not cal.lígraf...


Also I have spotted some logic missmatch in debug parsedQuery field:
*
cal·lígraf:* +DisjunctionMaxQuery((dc_title:*calligraf*^2.0 | 
dc_title_unicode:cal·lígraf^3.0 | dc_title_unicode_full:cal·lígraf^3.0))
*cal·lígra?:*+DisjunctionMaxQuery((dc_title:*cal·lígra?*^2.0 | 
dc_title_unicode:cal·lígra?^3.0 | dc_title_unicode_full:cal·lígra?^3.0))


Should the second be "*calligra?*" insted?*

*Environment:
Tomcat 7.0.25 (request encoding UTF-8)
Solr 3.5.0
Java 7 Oracle
Ubuntu 11.10



Wildcard ? issue?

2012-02-08 Thread Dalius Sidlauskas

Sorry for inaccurate title.

I have a 3 fields (dc_title, dc_title_unicode, dc_unicode_full) 
containing same value:


http://www.tei-c.org/ns/1.0";>cal.lígraf

and these fields are configured accordingly:


  



  
  


  



  


  
  

  



  


  
  

  


And finally my search configuration:


 
   all
   edismax
   2<-25%
   dc_title_unicode_full^2 dc_title_unicode^2 
dc_title
   10
   true
   false
   1
 

  spellcheck



I am trying to match the field with various search phrases (that are 
valid). There are results:



#   search phrase   match?  Comment
1   cal.lígra?  yes 
2   cal.ligra?  no  Changed í to i
3   cal.ligraf  yes 
4   calligra?   no  


The problem is the #2 attempt to match a data. The #3 works replacing ? 
with f.


One more thing. If * is used insted of ? other data is matched as 
cal.lígrafia but not cal.lígraf...


Also I have spotted some logic missmatch in debug parsedQuery field:
*
cal·lígraf:* +DisjunctionMaxQuery((dc_title:*calligraf*^2.0 | 
dc_title_unicode:cal·lígraf^3.0 | dc_title_unicode_full:cal·lígraf^3.0))
*cal·lígra?:*+DisjunctionMaxQuery((dc_title:*cal·lígra?*^2.0 | 
dc_title_unicode:cal·lígra?^3.0 | dc_title_unicode_full:cal·lígra?^3.0))


Should the second be "*calligra?*" insted?*

*Environment:
Tomcat 7.0.25 (request encoding UTF-8)
Solr 3.5.0
Java 7 Oracle
Ubuntu 11.10

--
Regards!
Dalius Sidlauskas



Sorting solrdocumentlist object after querying

2012-02-08 Thread Kashif Khan
Hi all,

I want to sort a SolrDocumentList after it has been queried and obtained
from the QueryResponse.getResults(). The reason is i have a SolrDocumentList
obtained after querying using QueryResponse.getResults() and i have added
few docs to it. Now i want to sort this SolrDocumentList based on the same
fields i did the querying before i modified this SolrDocumentList.

Please advice any alternatives with sample code will be appreciated a lot if
this is not possible. It is an emergency



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Sorting-solrdocumentlist-object-after-querying-tp3726303p3726303.html
Sent from the Solr - User mailing list archive at Nabble.com.


How to identify the field with highest score in dismax

2012-02-08 Thread crisfromnova
Hi,

According solr documentation the dismax score is calculating after the
formula : 
(score of matching clause with the highest score) + ( (tie paramenter) *
(scores of any other matching clauses) ).

Is there a way to identify the field on which the matching clause score is
the highest?

For example I suppose that I have the following document : 


  Ford Mustang Coupe Cabrio
  Ford Mustang is a great car


and the following dismax query :

defType=dismax&qf=Name^10+Details^1&q="Ford+Mustang"+Ford+Mustang

and receive the document with the score : 5.6.
Is there a way to find out if the score is for the matching on Name field or
for the matching on Details field?

Thanks in advance!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-identify-the-field-with-highest-score-in-dismax-tp3726297p3726297.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: is there any practice to load index into RAM to accelerate solr performance?

2012-02-08 Thread Ted Dunning
Add this as well:

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.155.5030

On Wed, Feb 8, 2012 at 1:56 AM, Andrzej Bialecki  wrote:

> On 08/02/2012 09:17, Ted Dunning wrote:
>
>> This is true with Lucene as it stands.  It would be much faster if there
>> were a specialized in-memory index such as is typically used with high
>> performance search engines.
>>
>
> This could be implemented in Lucene trunk as a Codec. The challenge though
> is to come up with the right data structures.
>
> There has been some interesting research on optimizations for in-memory
> inverted indexes, but it usually involves changing the query evaluation
> algos as well - for reference:
>
> http://digbib.ubka.uni-**karlsruhe.de/volltexte/**documents/1202502
> http://www.siam.org/**proceedings/alenex/2008/alx08_**01transierf.pdf
> http://research.google.com/**pubs/archive/37365.pdf
>
> --
> Best regards,
> Andrzej Bialecki <><
>  ___. ___ ___ ___ _ _   __**
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>


Re: Which Tokeniser (and/or filter)

2012-02-08 Thread Robert Brown
Thanks Erick,

I didn't get confused with multiple tokens vs multiValued  :)

Before I go ahead and re-index 4m docs, and believe me I'm using the
analysis page like a mad-man!

What do I need to configure to have the following both indexed with and
without the dots...

.net
sales manager.
£12.50

Currently...




with nothing specific in wdftypes.txt for full-stops.

Should there also be any difference when quoting my searches?

The analysis page seems to just drop the quotes, but surely actual
calls don't do this?



---

IntelCompute
Web Design & Local Online Marketing

http://www.intelcompute.com


On Wed, 8 Feb 2012 07:38:42 -0500, Erick Erickson
 wrote:
> Yes, WDDF creates multiple tokens. But that has
> nothing to do with the multiValued suggestion.
> 
> You can get exactly what you want by
> 1> setting multiValued="true" in your schema file and re-indexing. Say
> positionIncrementGap is set to 100
> 2> When you index, add the field for each sentence, so your doc
>   looks something like:
>  
> i am a sales-manager in here
>using asp.net and .net daily
>  .
>   
> 3> search like "sales manager"~100
> 
> Best
> Erick
> 
> On Wed, Feb 8, 2012 at 3:05 AM, Rob Brown  wrote:
>> Apologies if things were a little vague.
>>
>> Given the example snippet to index (numbered to show searches needed to
>> match)...
>>
>> 1: i am a sales-manager in here
>> 2: using asp.net and .net daily
>> 3: working in design.
>> 4: using something called sage 200. and i'm fluent
>> 5: german sausages.
>> 6: busy A&E dept earning £10,000 annually
>>
>>
>> ... all with newlines in place.
>>
>> able to match...
>>
>> 1. sales
>> 1. "sales manager"
>> 1. sales-manager
>> 1. "sales-manager"
>> 2. .net
>> 2. asp.net
>> 3. design
>> 4. sage 200
>> 6. A&E
>> 6. £10,000
>>
>> But do NOT match "fluent german" from 4 + 5 since there's a newline
>> between them when indexed, but not when searched.
>>
>>
>> Do the filters (wdf in this case) not create multiple tokens, so if
>> splitting on period in "asp.net" would create tokens for all of "asp",
>> "asp.", "asp.net", ".net", "net".
>>
>>
>> Cheers,
>> Rob
>>
>> --
>>
>> IntelCompute
>> Web Design and Online Marketing
>>
>> http://www.intelcompute.com
>>
>>
>> -Original Message-
>> From: Chris Hostetter 
>> Reply-to: solr-user@lucene.apache.org
>> To: solr-user@lucene.apache.org
>> Subject: Re: Which Tokeniser (and/or filter)
>> Date: Tue, 7 Feb 2012 15:02:36 -0800 (PST)
>>
>> : This all seems a bit too much work for such a real-world scenario?
>>
>> You haven't really told us what your scenerio is.
>>
>> You said you want to split tokens on whitespace, full-stop (aka:
>> period) and comma only, but then in response to some suggestions you added
>> comments other things that you never mentioned previously...
>>
>> 1) evidently you don't want the "." in foo.net to cause a split in tokens?
>> 2) evidently you not only want token splits on newlines, but also
>> positition gaps to prevent phrases matching across newlines.
>>
>> ...these are kind of important details that affect suggestions people
>> might give you.
>>
>> can you please provide some concrete examples of hte types of data you
>> have, the types of queries you want them to match, and the types of
>> queries you *don't* want to match?
>>
>>
>> -Hoss
>>



Re: struggling with solr.WordDelimiterFilterFactory and periods "." or dots

2012-02-08 Thread geeky2
hello,

thank you for the reply.

yes - i did re-index after the changes to the schema.

also - thank you for the direction on using the analyzer - but i am not sure
if i am interpreting the feedback from the analyzer correctly.

here is what i did:

in the Field value (Index) box - i placed this: BP2.1UAA

in the Field value (Query) box - i placed this: BP21UAA

then after hitting the Analyze button - i see the following:

Under Index Analyzer for: 

org.apache.solr.analysis.WordDelimiterFilterFactory {splitOnCaseChange=1,
generateNumberParts=1, catenateWords=1, luceneMatchVersion=LUCENE_33,
generateWordParts=1, catenateAll=1, catenateNumbers=1}

i see 

position1   2   3   4
term text   BP  2   1   UAA
21  BP21UAA

Under Query Analyzer for:

org.apache.solr.analysis.WordDelimiterFilterFactory {splitOnCaseChange=1,
generateNumberParts=1, catenateWords=1, luceneMatchVersion=LUCENE_33,
generateWordParts=1, catenateAll=1, catenateNumbers=1}

i see 

position1   2   3
term text   BP  21  UAA
BP21UAA

the above information leads me to believe that i "should" have BP21UAA as an
indexed term generated from the BP2.1UAA value coming from the database.

also - the query analysis lead me to believe that i "should" find a document
when i search on BP21UAA in the itemNo field

do i have this correct

am i missing something here?

i am still unable to get a hit when i search on BP21UAA in the itemNo field.

thank you,
mark

--
View this message in context: 
http://lucene.472066.n3.nabble.com/struggling-with-solr-WordDelimiterFilterFactory-and-periods-or-dots-tp3724822p3726021.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: usage of /etc/jetty.xml when debugging Solr in Eclipse

2012-02-08 Thread Bernd Fehling

Hi,

run-jetty-run issue #9:
...
In the VM Arguments of your launch configuration set
-Drjrxml=./jetty.xml

If jetty.xml is in the root of your project it will be used (you can also use a 
fully
qualified path name).

The UI port, context and WebApp dir are ignored, since you can define them in 
jetty.xml

Note: You still have to specify a valid "WebApp dir" because there are other 
checks
that the plugin performs.
...


Or you can start solr with jetty as usual and then connect eclipse
to the running process.


Regards


Am 08.02.2012 12:24, schrieb jmlucjav:

Hi,

I am following
http://www.lucidimagination.com/devzone/technical-articles/setting-apache-solr-eclipse
in order to be able to debug Solr in eclipse. I got it working fine.

Now, I usually use ./etc/jetty.xml to set logging configuration. When
starting jetty in eclipse I dont see any log files created, so I guessed
jetty.xml is not being used. So I added it to RunJetty Advanced
configuration (Additional jetty.xml), but in that case something goes wrong,
as I get a 'java.net.BindException: Address already in use: JVM_Bind' error,
like if something is started twice.

So my question is: can jetty.xml be used while debugging in eclipse? If so,
how? I would like to use the same configuration I use when I am just
changing xml stuff in Solr and starting with 'java -jar start.jar'.

thank in advance



Re: Fields not indexed?

2012-02-08 Thread Radu Toev
I just realized that as I pushed the send button :P
Thanks, I'll have a look.

On Wed, Feb 8, 2012 at 2:58 PM, Dmitry Kan  wrote:

> well, you should add these fields in schema.xml, otherwise solr won't know
> them.
>
> On Wed, Feb 8, 2012 at 2:48 PM, Radu Toev  wrote:
>
> > The schema.xml is the default file that comes with Solr 3.5, didn't
> change
> > anything there.
> >
> > On Wed, Feb 8, 2012 at 2:45 PM, Dmitry Kan  wrote:
> >
> > > How does your schema for the fields look like?
> > >
> > > On Wed, Feb 8, 2012 at 2:41 PM, Radu Toev  wrote:
> > >
> > > > Hi,
> > > >
> > > > I am really new to Solr so I apologize if the question is a little
> off.
> > > > I was playing with DataImportHandler and tried to index a table in a
> MS
> > > SQL
> > > > database.
> > > > I configured my datasource with the necessary parameters and added
> > three
> > > > fields with column(uppercase) and name:
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > The full-import command seems to have completed successfully and I
> see
> > > that
> > > > the number of documents processed is the same as the number of
> entries
> > in
> > > > my table.
> > > > However when I try to run a *:* query from the admin console I only
> get
> > > > responses in the form:
> > > >
> > > >  < doc>
> > > >  1.0
> > > >  1
> > > >   
> > > >
> > > > I'm not sure how to get to the bottom of this.
> > > > Thanks.
> > > >
> > >
> > >
> > >
> > > --
> > > Regards,
> > >
> > > Dmitry Kan
> > >
> >
>
>
>
> --
> Regards,
>
> Dmitry Kan
>


Re: Fields not indexed?

2012-02-08 Thread Dmitry Kan
well, you should add these fields in schema.xml, otherwise solr won't know
them.

On Wed, Feb 8, 2012 at 2:48 PM, Radu Toev  wrote:

> The schema.xml is the default file that comes with Solr 3.5, didn't change
> anything there.
>
> On Wed, Feb 8, 2012 at 2:45 PM, Dmitry Kan  wrote:
>
> > How does your schema for the fields look like?
> >
> > On Wed, Feb 8, 2012 at 2:41 PM, Radu Toev  wrote:
> >
> > > Hi,
> > >
> > > I am really new to Solr so I apologize if the question is a little off.
> > > I was playing with DataImportHandler and tried to index a table in a MS
> > SQL
> > > database.
> > > I configured my datasource with the necessary parameters and added
> three
> > > fields with column(uppercase) and name:
> > >
> > >
> > >
> > >
> > >
> > > The full-import command seems to have completed successfully and I see
> > that
> > > the number of documents processed is the same as the number of entries
> in
> > > my table.
> > > However when I try to run a *:* query from the admin console I only get
> > > responses in the form:
> > >
> > >  < doc>
> > >  1.0
> > >  1
> > >   
> > >
> > > I'm not sure how to get to the bottom of this.
> > > Thanks.
> > >
> >
> >
> >
> > --
> > Regards,
> >
> > Dmitry Kan
> >
>



-- 
Regards,

Dmitry Kan


Re: Fields not indexed?

2012-02-08 Thread Radu Toev
The schema.xml is the default file that comes with Solr 3.5, didn't change
anything there.

On Wed, Feb 8, 2012 at 2:45 PM, Dmitry Kan  wrote:

> How does your schema for the fields look like?
>
> On Wed, Feb 8, 2012 at 2:41 PM, Radu Toev  wrote:
>
> > Hi,
> >
> > I am really new to Solr so I apologize if the question is a little off.
> > I was playing with DataImportHandler and tried to index a table in a MS
> SQL
> > database.
> > I configured my datasource with the necessary parameters and added three
> > fields with column(uppercase) and name:
> >
> >
> >
> >
> >
> > The full-import command seems to have completed successfully and I see
> that
> > the number of documents processed is the same as the number of entries in
> > my table.
> > However when I try to run a *:* query from the admin console I only get
> > responses in the form:
> >
> >  < doc>
> >  1.0
> >  1
> >   
> >
> > I'm not sure how to get to the bottom of this.
> > Thanks.
> >
>
>
>
> --
> Regards,
>
> Dmitry Kan
>


Re: Fields not indexed?

2012-02-08 Thread Dmitry Kan
How does your schema for the fields look like?

On Wed, Feb 8, 2012 at 2:41 PM, Radu Toev  wrote:

> Hi,
>
> I am really new to Solr so I apologize if the question is a little off.
> I was playing with DataImportHandler and tried to index a table in a MS SQL
> database.
> I configured my datasource with the necessary parameters and added three
> fields with column(uppercase) and name:
>
>
>
>
>
> The full-import command seems to have completed successfully and I see that
> the number of documents processed is the same as the number of entries in
> my table.
> However when I try to run a *:* query from the admin console I only get
> responses in the form:
>
>  < doc>
>  1.0
>  1
>   
>
> I'm not sure how to get to the bottom of this.
> Thanks.
>



-- 
Regards,

Dmitry Kan


Re: Which Tokeniser (and/or filter)

2012-02-08 Thread Erick Erickson
Yes, WDDF creates multiple tokens. But that has
nothing to do with the multiValued suggestion.

You can get exactly what you want by
1> setting multiValued="true" in your schema file and re-indexing. Say
positionIncrementGap is set to 100
2> When you index, add the field for each sentence, so your doc
  looks something like:
 
i am a sales-manager in here
   using asp.net and .net daily
 .
  
3> search like "sales manager"~100

Best
Erick

On Wed, Feb 8, 2012 at 3:05 AM, Rob Brown  wrote:
> Apologies if things were a little vague.
>
> Given the example snippet to index (numbered to show searches needed to
> match)...
>
> 1: i am a sales-manager in here
> 2: using asp.net and .net daily
> 3: working in design.
> 4: using something called sage 200. and i'm fluent
> 5: german sausages.
> 6: busy A&E dept earning £10,000 annually
>
>
> ... all with newlines in place.
>
> able to match...
>
> 1. sales
> 1. "sales manager"
> 1. sales-manager
> 1. "sales-manager"
> 2. .net
> 2. asp.net
> 3. design
> 4. sage 200
> 6. A&E
> 6. £10,000
>
> But do NOT match "fluent german" from 4 + 5 since there's a newline
> between them when indexed, but not when searched.
>
>
> Do the filters (wdf in this case) not create multiple tokens, so if
> splitting on period in "asp.net" would create tokens for all of "asp",
> "asp.", "asp.net", ".net", "net".
>
>
> Cheers,
> Rob
>
> --
>
> IntelCompute
> Web Design and Online Marketing
>
> http://www.intelcompute.com
>
>
> -Original Message-
> From: Chris Hostetter 
> Reply-to: solr-user@lucene.apache.org
> To: solr-user@lucene.apache.org
> Subject: Re: Which Tokeniser (and/or filter)
> Date: Tue, 7 Feb 2012 15:02:36 -0800 (PST)
>
> : This all seems a bit too much work for such a real-world scenario?
>
> You haven't really told us what your scenerio is.
>
> You said you want to split tokens on whitespace, full-stop (aka:
> period) and comma only, but then in response to some suggestions you added
> comments other things that you never mentioned previously...
>
> 1) evidently you don't want the "." in foo.net to cause a split in tokens?
> 2) evidently you not only want token splits on newlines, but also
> positition gaps to prevent phrases matching across newlines.
>
> ...these are kind of important details that affect suggestions people
> might give you.
>
> can you please provide some concrete examples of hte types of data you
> have, the types of queries you want them to match, and the types of
> queries you *don't* want to match?
>
>
> -Hoss
>


Re: struggling with solr.WordDelimiterFilterFactory and periods "." or dots

2012-02-08 Thread Erick Erickson
Hmmm, seems OK. Did you re-index after any
schema changes?

You'll learn to love admin/analysis for questions like this,
that page should show you what the actual tokenization
results are, make sure to click the "verbose" check boxes.

Best
Erick

On Tue, Feb 7, 2012 at 10:52 PM, geeky2  wrote:
> hello all,
>
> i am struggling with getting solr.WordDelimiterFilterFactory to behave as is
> indicated in the solr book (Smiley) on page 54.
>
> the example in the books reads like this:
>
>>>
> Here is an example exercising all options:
> WiFi-802.11b to Wi, Fi, WiFi, 802, 11, 80211, b, WiFi80211b
> <<
>
> essentially - i have the same requirement with embedded periods and need to
> return a successful search on a field, even if the user does NOT enter the
> period.
>
> i have a field, itemNo that can contain periods ".".
>
> example content in the itemNo field:
>
> B12.0123
>
> when the user searches on this field, they need to be able to enter an
> itemNo without the period, and still find the item.
>
> example:
>
> user enters: B120123 and a document is returned with B12.0123.
>
>
> unfortunately, the search will NOT return the appropriate document, if the
> user enters B120123.
>
> however - the search does work if the user enters B12 0123 (a space in place
> of the period).
>
> can someone help me understand what is missing from my configuration?
>
>
> this is snipped from my schema.xml file
>
>
>  
>     ...
>    
>     ...
>  
>
>
>
>
>     positionIncrementGap="100">
>      
>        
>         ignoreCase="true" expand="true"/>
>         words="stopwords.txt"/>
>        * generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="1" splitOnCaseChange="1"/>*
>        
>         protected="protwords.txt"/>
>        
>        
>      
>      
>        
>         words="stopwords.txt"/>
>         generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="1" splitOnCaseChange="1"/>
>        
>         protected="protwords.txt"/>
>        
>        
>      
>    
>
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/struggling-with-solr-WordDelimiterFilterFactory-and-periods-or-dots-tp3724822p3724822.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Custom Document Clustering and Mahout Integration

2012-02-08 Thread Selvam
Hi all,

I am trying to write a custom document clustering component that should
take all the docs in commit and cluster them; Solr Version:3.5.0

Main Class:
public class KMeansClusteringEngine extends DocumentClusteringEngine
implements SolrEventListener

I added newSearcher event listener, that works as expected. But, when is
the document clustering called ?, I have two functions of
DocumentClusteringEngine in my custom code, but when do they get called ?,
wiki page says to add clustering.collection=true, but I am not sure as my
guess is document clustering noway related to search.

  public NamedList cluster(SolrParams params)
  public NamedList cluster(DocSet docSet, SolrParams solrParams)


Note:
Actually I am trying to integrate Solr 3.5 with Mahout 0.5 for incremental
clustering (i.e mapping new docs to existing cluster to avoid complete
re-clustering ) basing my work from this github code,
https://github.com/gsingers/ApacheCon2010/blob/master/src/main/java/com/grantingersoll/intell/clustering/KMeansClusteringEngine.java
.

I would love to get some support from you.

-- 
Regards,
S.Selvam
http://knackforge.com


Re: How to reindex about 10Mio. docs

2012-02-08 Thread Vadim Kisselmann
Another problem appeared ;)
how can i export my docs in csv-format?
In Solr 3.1+ i can use the query-param &wt=csv, but in Solr 1.4.1?
Best Regards
Vadim


2012/2/8 Vadim Kisselmann :
> Hi Ahmet,
> thanks for quick response:)
> I've already thought the same...
> And it will be a pain to export and import this huge doc-set as CSV.
> Do i have an another solution?
> Regards
> Vadim
>
>
> 2012/2/8 Ahmet Arslan :
>>> i want to reindex about 10Mio. Docs. from one Solr(1.4.1) to
>>> another
>>> Solr(1.4.1).
>>> I changed my schema.xml (field types sing to slong),
>>> standard
>>> replication would fail.
>>> what is the fastest and smartest way to manage this?
>>> this here sound great (EntityProcessor):
>>> http://www.searchworkings.org/blog/-/blogs/importing-data-from-another-solr
>>> But would it work with Solr 1.4.1?
>>
>> SolrEntityProcessor is not available in 1.4.1. I would dump stored fields 
>> into comma separated file, and use http://wiki.apache.org/solr/UpdateCSV to 
>> feed into new solr instance.


usage of /etc/jetty.xml when debugging Solr in Eclipse

2012-02-08 Thread jmlucjav
Hi,

I am following
http://www.lucidimagination.com/devzone/technical-articles/setting-apache-solr-eclipse
in order to be able to debug Solr in eclipse. I got it working fine.

Now, I usually use ./etc/jetty.xml to set logging configuration. When
starting jetty in eclipse I dont see any log files created, so I guessed
jetty.xml is not being used. So I added it to RunJetty Advanced
configuration (Additional jetty.xml), but in that case something goes wrong,
as I get a 'java.net.BindException: Address already in use: JVM_Bind' error,
like if something is started twice.

So my question is: can jetty.xml be used while debugging in eclipse? If so,
how? I would like to use the same configuration I use when I am just
changing xml stuff in Solr and starting with 'java -jar start.jar'.

thank in advance



--
View this message in context: 
http://lucene.472066.n3.nabble.com/usage-of-etc-jetty-xml-when-debugging-Solr-in-Eclipse-tp3725588p3725588.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How to reindex about 10Mio. docs

2012-02-08 Thread Vadim Kisselmann
Hi Ahmet,
thanks for quick response:)
I've already thought the same...
And it will be a pain to export and import this huge doc-set as CSV.
Do i have an another solution?
Regards
Vadim


2012/2/8 Ahmet Arslan :
>> i want to reindex about 10Mio. Docs. from one Solr(1.4.1) to
>> another
>> Solr(1.4.1).
>> I changed my schema.xml (field types sing to slong),
>> standard
>> replication would fail.
>> what is the fastest and smartest way to manage this?
>> this here sound great (EntityProcessor):
>> http://www.searchworkings.org/blog/-/blogs/importing-data-from-another-solr
>> But would it work with Solr 1.4.1?
>
> SolrEntityProcessor is not available in 1.4.1. I would dump stored fields 
> into comma separated file, and use http://wiki.apache.org/solr/UpdateCSV to 
> feed into new solr instance.


Re: is there any practice to load index into RAM to accelerate solr performance?

2012-02-08 Thread Robert Stewart
I concur with this.  As long as index segment files are cached in OS file cache 
performance is as about good as it gets.  Pulling segment files into RAM inside 
JVM process may actually be slower, given Lucene's existing data structures and 
algorithms for reading segment file data.   If you have very large index (much 
bigger than available RAM) then it will only be slow when accessing disk for 
uncached segment files.  In that case you might consider sharding index across 
more than one server and using distributed searching (possibly SOLR cloud, 
etc.).

How large is your index in GB?  You can also try making index files smaller by 
removing indexed/stored fields you dont need, compressing large stored fields, 
etc.  Also maybe turn off storing norms, term frequencies, positions, vectors 
and stuff if you dont need them.

On Feb 8, 2012, at 3:17 AM, Ted Dunning wrote:

> This is true with Lucene as it stands.  It would be much faster if there
> were a specialized in-memory index such as is typically used with high
> performance search engines.
> 
> On Tue, Feb 7, 2012 at 9:50 PM, Lance Norskog  wrote:
> 
>> Experience has shown that it is much faster to run Solr with a small
>> amount of memory and let the rest of the ram be used by the operating
>> system "disk cache". That is, the OS is very good at keeping the right
>> disk blocks in memory, much better than Solr.
>> 
>> How much RAM is in the server and how much RAM does the JVM get? How
>> big are the documents, and how large is the term index for your
>> searches? How many documents do you get with each search? And, do you
>> use filter queries- these are very powerful at limiting searches.
>> 
>> 2012/2/7 James :
>>> Is there any practice to load index into RAM to accelerate solr
>> performance?
>>> The over all documents is about 100 million. The search time around
>> 100ms. I am seeking some method to accelerate the respond time for solr.
>>> Just check that there is some practice use SSD disk. And SSD is also
>> cost much, just want to know is there some method like to load the index
>> file in RAM and keep the RAM index and disk index synchronized. Then I can
>> search on the RAM index.
>> 
>> 
>> 
>> --
>> Lance Norskog
>> goks...@gmail.com
>> 



Re: How to reindex about 10Mio. docs

2012-02-08 Thread Ahmet Arslan
> i want to reindex about 10Mio. Docs. from one Solr(1.4.1) to
> another
> Solr(1.4.1).
> I changed my schema.xml (field types sing to slong),
> standard
> replication would fail.
> what is the fastest and smartest way to manage this?
> this here sound great (EntityProcessor):
> http://www.searchworkings.org/blog/-/blogs/importing-data-from-another-solr
> But would it work with Solr 1.4.1?

SolrEntityProcessor is not available in 1.4.1. I would dump stored fields into 
comma separated file, and use http://wiki.apache.org/solr/UpdateCSV to feed 
into new solr instance.


How to reindex about 10Mio. docs

2012-02-08 Thread Vadim Kisselmann
Hello folks,

i want to reindex about 10Mio. Docs. from one Solr(1.4.1) to another
Solr(1.4.1).
I changed my schema.xml (field types sing to slong), standard
replication would fail.
what is the fastest and smartest way to manage this?
this here sound great (EntityProcessor):
http://www.searchworkings.org/blog/-/blogs/importing-data-from-another-solr
But would it work with Solr 1.4.1?

Best Regards
Vadim


Re: URI Encoding with Solr and Weblogic

2012-02-08 Thread Elisabeth Adler

Hi,

I found a solution to it.
Adding the Weblogic Server Argument -Dfile.encoding=UTF-8 did not affect 
the encoding.


Only a change to the .war file's weblogic.xml and redeployment of the 
modified .war solved it.

I added the following to the weblogic.xml:


  
*
UTF-8
  


Would it make sense to include this in the shipped weblogic.xml file?

Best,
Elisabeth

On 07.02.2012 23:12, Elisabeth Adler wrote:

Hi,

I try to get Solr 3.3.0 to process Arabic search requests using its
admin interface. I have successfully managed to set it up on Tomcat
using the URIEncoding attribute but fail miserably on WebLogic 10.

Invoking the URL http://localhost:7012/solr/select/?q=تهنئة returns the
XML below:


0
0

تهنئة





The search term is just gibberish. Running the query through Luke or
Tomcat returns the expected result and renders the search term correctly.

I have tried to change the URI encoding and JVM default encoding by
setting the following start up arguments in WebLogic:
-Dfile.encoding=UTF-8 -Dweblogic.http.URIDecodeEncoding=UTF-8. I can see
them being set through Solr's admin interface. They don't have any
impact though.

I am running out of ideas on how to get this working. Any thoughts and
pointers are much appreciated.

Thanks,
Elisabeth



Re: Improving performance for SOLR geo queries?

2012-02-08 Thread Matthias Käppler
Hi Erick,

if we're not doing geo searches, we filter by "location tags" that we
attach to places. This is simply a hierachical regional id, which is
simple to filter for, but much less flexible. We use that on Web a
lot, but not on mobile, where we want to performance searches in
arbitrary radii around arbitrary positions. For those location tag
kind of queries, the average time spent in SOLR is 43msec (I'm looking
at the New Relic snapshot of the last 12 hours). I have disabled our
"optimization" again just yesterday, so for the bbox queries we're now
at an avg of 220ms (same time window). That's a 5 fold increase in
response time, and in peak hours it's worse than that.

I've also found a blog post from 3 years ago which outlines the inner
workings of the SOLR spatial indexing and searching:
http://www.searchworkings.org/blog/-/blogs/23842
>From that it seems as if SOLR already performs a similar optimization
we had in mind during the index step, so if I understand correctly, it
doesn't even search over all records, only those that were mapped to
the grid box identified during indexing.

What I would love to see is what the suggested way is to perform a geo
query on SOLR, considering that they're so difficult to cache and
expensive to run. Is the best approach to restrict the candidate set
as much as possible using cheap filter queries, so that SOLR merely
has to do the geo search against these subsets? How does the query
planner work here? I see there's a cost attached to a filter query,
but one can only set it when cache is set to false? Are cached geo
queries executed last when there are cheaper filter queries to cut
down on documents? If you have a real world practical setup to share,
one that performs well in a production environment that serves
requests in the Millions per day, that would be great.

I'd love to contribute documentation by the way, if you knew me you'd
know I'm an avid open source contributor and actually run several open
source projects myself. But tell me, how can I possibly contribute
answer to questions I don't have an answer to? That's why I'm here,
remember :) So please, these kinds of snippy replies are not helping
anyone.

Thanks
-Matthias

On Tue, Feb 7, 2012 at 3:06 PM, Erick Erickson  wrote:
> So the obvious question is "what is your
> performance like without the distance filters?"
>
> Without that knowledge, we have no clue whether
> the modifications you've made had any hope of
> speeding up your response times
>
> As for the docs, any improvements you'd like to
> contribute would be happily received
>
> Best
> Erick
>
> 2012/2/6 Matthias Käppler :
>> Hi,
>>
>> we need to perform fast geo lookups on an index of ~13M places, and
>> were running into performance problems here with SOLR. We haven't done
>> a lot of query optimization / SOLR tuning up until now so there's
>> probably a lot of things we're missing. I was wondering if you could
>> give me some feedback on the way we do things, whether they make
>> sense, and especially why a supposed optimization we implemented
>> recently seems to have no effect, when we actually thought it would
>> help a lot.
>>
>> What we do is this: our API is built on a Rails stack and talks to
>> SOLR via a Ruby wrapper. We have a few filters that almost always
>> apply, which we put in filter queries. Filter cache hit rate is
>> excellent, about 97%, and cache size caps at 10k filters (max size is
>> 32k, but it never seems to reach that many, probably because we
>> replicate / delta update every few minutes). Still, geo queries are
>> slow, about 250-500msec on average. We send them with cache=false, so
>> as to not flood the fq cache and cause undesirable evictions.
>>
>> Now our idea was this: while the actual geo queries are poorly
>> cacheable, we could clearly identify geographical regions which are
>> more often queried than others (naturally, since we're a user driven
>> service). Therefore, we dynamically partition Earth into a static grid
>> of overlapping boxes, where the grid size (the distance of the nodes)
>> depends on the maximum allowed search radius. That way, for every user
>> query, we would always be able to identify a single bounding box that
>> covers it. This larger bounding box (200km edge length) we would send
>> to SOLR as a cached filter query, along with the actual user query
>> which would still be sent uncached. Ex:
>>
>> User asks for places in 10km around 49.14839,8.5691, then what we will
>> send to SOLR is something like this:
>>
>> fq={!bbox cache=false d=10 sfield=location_ll pt=49.14839,8.5691}
>> fq={!bbox cache=true d=100.0 sfield=location_ll
>> pt=49.4684836290799,8.31165802979391} <-- this one we derive
>> automatically
>>
>> That way SOLR would intersect the two filters and return the same
>> results as when only looking at the smaller bounding box, but keep the
>> larger box in cache and speed up subsequent geo queries in the same
>> regions. Or so we thought; unfortunately this approa

Re: is there any practice to load index into RAM to accelerate solr performance?

2012-02-08 Thread Andrzej Bialecki

On 08/02/2012 09:17, Ted Dunning wrote:

This is true with Lucene as it stands.  It would be much faster if there
were a specialized in-memory index such as is typically used with high
performance search engines.


This could be implemented in Lucene trunk as a Codec. The challenge 
though is to come up with the right data structures.


There has been some interesting research on optimizations for in-memory 
inverted indexes, but it usually involves changing the query evaluation 
algos as well - for reference:


http://digbib.ubka.uni-karlsruhe.de/volltexte/documents/1202502
http://www.siam.org/proceedings/alenex/2008/alx08_01transierf.pdf
http://research.google.com/pubs/archive/37365.pdf

--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Query in starting solr 3.5

2012-02-08 Thread mechravi25
Hi,

I am using solr 3.5 version. I moved the data import handler files from solr
1.4(which I used previously) to the new solr. When I tried to start the solr
3.5, I got the following message in my log

WARNING: XML parse warning in "solrres:/dataimport.xml", line 2, column 95:
Include operation failed, reverting to fallback. Resource error reading file
as XML (href='solr/conf/solrconfig_master.xml'). Reason: Can't find resource
'solr/conf/solrconfig_master.xml' in classpath or
'/solr/apache-solr-3.5.0/example/multicore/core1/conf/',
cwd=/solr/apache-solr-3.5.0/example

The partial content of dataimport file that I used in solr1.4 is as follows

http://www.w3.org/2001/XInclude";>









The 3 files given in Fallback tag are present in the location. Does solr 3.5
support fallback? Can someone please suggest a solution?



Also, I got the following warnings in my log while starting solr 3.5

WARNING: the luceneMatchVersion is not specified, defaulting to LUCENE_24
emulation. You should at some point declare and reindex to at least 3.0,
because 2.4 emulation is deprecated and will be removed in 4.0. This
parameter will be mandatory in 4.0.

The solution i got after googling is to apply a patch. Is there any other
option other than applying this patch to overcome the warnings? Which is the
best option.


Kindly help me out.

Thanks in advance.








--
View this message in context: 
http://lucene.472066.n3.nabble.com/Query-in-starting-solr-3-5-tp3725372p3725372.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: is there any practice to load index into RAM to accelerate solr performance?

2012-02-08 Thread Dmitry Kan
Hi,

This talk has some interesting details on setting up an Lucene index in RAM:

http://www.lucidimagination.com/devzone/events/conferences/revolution/2011/lucene-yelp


Would be great to hear your findings!

Dmitry

2012/2/8 James 

> Is there any practice to load index into RAM to accelerate solr
> performance?
> The over all documents is about 100 million. The search time around 100ms.
> I am seeking some method to accelerate the respond time for solr.
> Just check that there is some practice use SSD disk. And SSD is also cost
> much, just want to know is there some method like to load the index file in
> RAM and keep the RAM index and disk index synchronized. Then I can search
> on the RAM index.
>


Re: is there any practice to load index into RAM to accelerate solr performance?

2012-02-08 Thread Patrick Plaatje
A start maybe to use a RAM disk for that. Mount is as a normal disk and
have the index files stored there. Have a read here:

http://en.wikipedia.org/wiki/RAM_disk

Cheers,

Patrick


2012/2/8 Ted Dunning 

> This is true with Lucene as it stands.  It would be much faster if there
> were a specialized in-memory index such as is typically used with high
> performance search engines.
>
> On Tue, Feb 7, 2012 at 9:50 PM, Lance Norskog  wrote:
>
> > Experience has shown that it is much faster to run Solr with a small
> > amount of memory and let the rest of the ram be used by the operating
> > system "disk cache". That is, the OS is very good at keeping the right
> > disk blocks in memory, much better than Solr.
> >
> > How much RAM is in the server and how much RAM does the JVM get? How
> > big are the documents, and how large is the term index for your
> > searches? How many documents do you get with each search? And, do you
> > use filter queries- these are very powerful at limiting searches.
> >
> > 2012/2/7 James :
> > > Is there any practice to load index into RAM to accelerate solr
> > performance?
> > > The over all documents is about 100 million. The search time around
> > 100ms. I am seeking some method to accelerate the respond time for solr.
> > > Just check that there is some practice use SSD disk. And SSD is also
> > cost much, just want to know is there some method like to load the index
> > file in RAM and keep the RAM index and disk index synchronized. Then I
> can
> > search on the RAM index.
> >
> >
> >
> > --
> > Lance Norskog
> > goks...@gmail.com
> >
>



-- 
Patrick Plaatje
Senior Consultant



Re:Re: is there any practice to load index into RAM to accelerate solr performance?

2012-02-08 Thread James
But the solr did not have the im-memory index, I am right?





At 2012-02-08 16:17:49,"Ted Dunning"  wrote:
>This is true with Lucene as it stands.  It would be much faster if there
>were a specialized in-memory index such as is typically used with high
>performance search engines.
>
>On Tue, Feb 7, 2012 at 9:50 PM, Lance Norskog  wrote:
>
>> Experience has shown that it is much faster to run Solr with a small
>> amount of memory and let the rest of the ram be used by the operating
>> system "disk cache". That is, the OS is very good at keeping the right
>> disk blocks in memory, much better than Solr.
>>
>> How much RAM is in the server and how much RAM does the JVM get? How
>> big are the documents, and how large is the term index for your
>> searches? How many documents do you get with each search? And, do you
>> use filter queries- these are very powerful at limiting searches.
>>
>> 2012/2/7 James :
>> > Is there any practice to load index into RAM to accelerate solr
>> performance?
>> > The over all documents is about 100 million. The search time around
>> 100ms. I am seeking some method to accelerate the respond time for solr.
>> > Just check that there is some practice use SSD disk. And SSD is also
>> cost much, just want to know is there some method like to load the index
>> file in RAM and keep the RAM index and disk index synchronized. Then I can
>> search on the RAM index.
>>
>>
>>
>> --
>> Lance Norskog
>> goks...@gmail.com
>>


Re: is there any practice to load index into RAM to accelerate solr performance?

2012-02-08 Thread Ted Dunning
This is true with Lucene as it stands.  It would be much faster if there
were a specialized in-memory index such as is typically used with high
performance search engines.

On Tue, Feb 7, 2012 at 9:50 PM, Lance Norskog  wrote:

> Experience has shown that it is much faster to run Solr with a small
> amount of memory and let the rest of the ram be used by the operating
> system "disk cache". That is, the OS is very good at keeping the right
> disk blocks in memory, much better than Solr.
>
> How much RAM is in the server and how much RAM does the JVM get? How
> big are the documents, and how large is the term index for your
> searches? How many documents do you get with each search? And, do you
> use filter queries- these are very powerful at limiting searches.
>
> 2012/2/7 James :
> > Is there any practice to load index into RAM to accelerate solr
> performance?
> > The over all documents is about 100 million. The search time around
> 100ms. I am seeking some method to accelerate the respond time for solr.
> > Just check that there is some practice use SSD disk. And SSD is also
> cost much, just want to know is there some method like to load the index
> file in RAM and keep the RAM index and disk index synchronized. Then I can
> search on the RAM index.
>
>
>
> --
> Lance Norskog
> goks...@gmail.com
>


Re: Which Tokeniser (and/or filter)

2012-02-08 Thread Rob Brown
Apologies if things were a little vague.

Given the example snippet to index (numbered to show searches needed to
match)...

1: i am a sales-manager in here
2: using asp.net and .net daily
3: working in design.
4: using something called sage 200. and i'm fluent
5: german sausages.
6: busy A&E dept earning £10,000 annually


... all with newlines in place.

able to match...

1. sales
1. "sales manager"
1. sales-manager
1. "sales-manager"
2. .net
2. asp.net
3. design
4. sage 200
6. A&E
6. £10,000

But do NOT match "fluent german" from 4 + 5 since there's a newline
between them when indexed, but not when searched.


Do the filters (wdf in this case) not create multiple tokens, so if
splitting on period in "asp.net" would create tokens for all of "asp",
"asp.", "asp.net", ".net", "net".


Cheers,
Rob

-- 

IntelCompute
Web Design and Online Marketing

http://www.intelcompute.com


-Original Message-
From: Chris Hostetter 
Reply-to: solr-user@lucene.apache.org
To: solr-user@lucene.apache.org
Subject: Re: Which Tokeniser (and/or filter)
Date: Tue, 7 Feb 2012 15:02:36 -0800 (PST)

: This all seems a bit too much work for such a real-world scenario?

You haven't really told us what your scenerio is.

You said you want to split tokens on whitespace, full-stop (aka: 
period) and comma only, but then in response to some suggestions you added 
comments other things that you never mentioned previously...

1) evidently you don't want the "." in foo.net to cause a split in tokens?
2) evidently you not only want token splits on newlines, but also 
positition gaps to prevent phrases matching across newlines.

...these are kind of important details that affect suggestions people 
might give you.

can you please provide some concrete examples of hte types of data you 
have, the types of queries you want them to match, and the types of 
queries you *don't* want to match?


-Hoss