date:20111207

Otis, Tomás: thanks for the great links!

2011/12/7 Tomás Fernández Löbbe 

> Hi Dimitry, I pointed to the wiki page to enable JMX, then you can use any
> tool that visualizes JMX stuff like Zabbix. See
>
> http://www.lucidimagination.com/blog/2011/10/02/monitoring-apache-solr-and-lucidworks-with-zabbix/
>
> On Wed, Dec 7, 2011 at 11:49 AM, Dmitry Kan  wrote:
>
> > The culprit seems to be the merger (frontend) SOLR. Talking to one shard
> > directly takes substantially less time (1-2 sec).
> >
> > On Wed, Dec 7, 2011 at 4:10 PM, Dmitry Kan  wrote:
> >
> > > Tomás: thanks. The page you gave didn't mention cache specifically, is
> > > there more documentation on this specifically? I have used solrmeter
> > tool,
> > > it draws the cache diagrams, is there a similar tool, but which would
> use
> > > jmx directly and present the cache usage in runtime?
> > >
> > > pravesh:
> > > I have increased the size of filterCache, but the search hasn't become
> > any
> > > faster, taking almost 9 sec on avg :(
> > >
> > > name: search
> > > class: org.apache.solr.handler.component.SearchHandler
> > > version: $Revision: 1052938 $
> > > description: Search using components:
> > >
> >
> org.apache.solr.handler.component.QueryComponent,org.apache.solr.handler.component.FacetComponent,org.apache.solr.handler.component.MoreLikeThisComponent,org.apache.solr.handler.component.HighlightComponent,org.apache.solr.handler.component.StatsComponent,org.apache.solr.handler.component.DebugComponent,
> > >
> > > stats: handlerStart : 1323255147351
> > > requests : 100
> > > errors : 3
> > > timeouts : 0
> > > totalTime : 885438
> > > avgTimePerRequest : 8854.38
> > > avgRequestsPerSecond : 0.008789442
> > >
> > > the stats (copying fieldValueCache as well here, to show term
> > statistics):
> > >
> > > name: fieldValueCache
> > > class: org.apache.solr.search.FastLRUCache
> > > version: 1.0
> > > description: Concurrent LRU Cache(maxSize=1, initialSize=10,
> > > minSize=9000, acceptableSize=9500, cleanupThread=false)
> > > stats: lookups : 79
> > > hits : 77
> > > hitratio : 0.97
> > > inserts : 1
> > > evictions : 0
> > > size : 1
> > > warmupTime : 0
> > > cumulative_lookups : 79
> > > cumulative_hits : 77
> > > cumulative_hitratio : 0.97
> > > cumulative_inserts : 1
> > > cumulative_evictions : 0
> > > item_shingleContent_trigram :
> > >
> >
> {field=shingleContent_trigram,memSize=326924381,tindexSize=4765394,time=215426,phase1=213868,nTerms=14827061,bigTerms=35,termInstances=114359167,uses=78}
> > >  name: filterCache
> > > class: org.apache.solr.search.FastLRUCache
> > > version: 1.0
> > > description: Concurrent LRU Cache(maxSize=153600, initialSize=4096,
> > > minSize=138240, acceptableSize=145920, cleanupThread=false)
> > > stats: lookups : 1082854
> > > hits : 940370
> > > hitratio : 0.86
> > > inserts : 142486
> > > evictions : 0
> > > size : 142486
> > > warmupTime : 0
> > > cumulative_lookups : 1082854
> > > cumulative_hits : 940370
> > > cumulative_hitratio : 0.86
> > > cumulative_inserts : 142486
> > > cumulative_evictions : 0
> > >
> > >
> > > index size: 3,25 GB
> > >
> > > Does anyone have some pointers to where to look at and optimize for
> query
> > > time?
> > >
> > >
> > > 2011/12/7 Tomás Fernández Löbbe 
> > >
> > >> Hi Dimitry, cache information is exposed via JMX, so you should be
> able
> > to
> > >> monitor that information with any JMX tool. See
> > >> http://wiki.apache.org/solr/SolrJmx
> > >>
> > >> On Wed, Dec 7, 2011 at 6:19 AM, Dmitry Kan 
> > wrote:
> > >>
> > >> > Yes, we do require that much.
> > >> > Ok, thanks, I will try increasing the maxsize.
> > >> >
> > >> > On Wed, Dec 7, 2011 at 10:56 AM, pravesh 
> > >> wrote:
> > >> >
> > >> > > >>facet.limit=50
> > >> > > your facet.limit seems too high. Do you actually require this
> much?
> > >> > >
> > >> > > Since there a lot of evictions from filtercache, so, increase the
> > >> maxsize
> > >> > > value to your acceptable limit.
> > >> > >
> > >> > > Regards
> > >> > > Pravesh
> > >> > >
> > >> > > --
> > >> > > View this message in context:
> > >> > >
> > >> >
> > >>
> >
> http://lucene.472066.n3.nabble.com/cache-monitoring-tools-tp3566645p3566811.html
> > >> > > Sent from the Solr - User mailing list archive at Nabble.com.
> > >> > >
> > >> >
> > >> >
> > >> >
> > >> > --
> > >> > Regards,
> > >> >
> > >> > Dmitry Kan
> > >> >
> > >>
> > >
> > >
> > >
> > > --
> > > Regards,
> > >
> > > Dmitry Kan
> > >
> >
> >
> >
> > --
> > Regards,
> >
> > Dmitry Kan
> >
>



-- 
Regards,

Dmitry Kan

Re: Solr 4 near real time commit

Hmmm...that sounds pretty odd...

How are you measuring the commit time?

You likely want to turn off any caches, as they will be expired every second, 
but that should not cause this...

I can try and duplicate your setup tomorrow and see what i can spot.

- Mark

On Dec 7, 2011, at 8:13 PM, yu shen wrote:

> Hi Mark, and all
> 
> I now use commit configuration exactly as below:
> 
>  
>   10 
> 
> 
>   1000 
> 
> 
> But the commit time takes about 60 seconds.
> 
> I have around 120 - 130 documents in my server. And each day, the 
> number will increase about 6000. My symptom is if solr server is just 
> started, the commit time is about 3-5 seconds. But after one day time, the 
> commit time increase substantially to about 1 min.
> 
> Do I missed anything or had any mis-configuration?
> Thanks very much for any help in advance.
> 
> Spark
> 
> 2011/12/7 yu shen 
> Thanks for the correction, I did not notice that <344.gif>
> 
> Spark
> 
> 
> 2011/12/7 Mark Miller 
> Well, if that is exactly what you put, it's wrong.  That second one should
> be softAutoCommit.
> 
> On Wednesday, December 7, 2011, yu shen  wrote:
> > Hi All,
> >
> > I tried using solr 4 nightly build: apache-solr-4.0-2011-12-06_08-52-46.
> > And try to enable autoSoftCommit like below in solrconfig.xml
> > 
> >10
> > 
> > 
> >1000
> > 
> >
> > I try to add a document to this solr instance using solrj client in the
> > nightly build. I do saw a commit time boost. Single document commit will
> > basically take around 10 - 15 seconds.
> >
> > My question is, my configuration mean to do the commitment within 1
> second,
> > why solr still takes 10 seconds.
> >
> > Spark
> >
> 
> --
> - Mark
> 
> http://www.lucidimagination.com
> 
> 

- Mark Miller
lucidimagination.com

how to implement per doc weighting

2011-12-07 Thread Jason Toy

I've been reading the solr source code and made modifications by
implementing a custom Similarity class.
I want to implement a weight to the score by multiplying a number
based on if the current doc has certain term in it.

So if the query was q=data_text:foo

then the Similiarity class would apply a weight of :

if this.doc.containsTerm("bar")
  weight = 5;
else
  weight = 1;
end


I don't see how the Similarity class can get access to the doc and its
terms.  Is there a way to do this?

Alternatively I could preprocess the weights and I was looking into
ExternalFileField,  but I can't tell if that is the right solution.
If you use ExternalFileField it sets the rank of the results and not a
weight?  It seems like ExternalFileField is only to be used for
function queries? Is it possible to sort by
product(score,weight_from_external_file_field)?  I also have 10's of
millions of docs so I  am not sure if this is this will work.

Re: Solr Lucene Index Version

Replication just copies the index, so I'm not sure how this would help offhand?

With SolrCloud this is a breeze - just fire up another replica for a shard and 
the current index will replicate to it.

If you where willing to export the data to some portable format and then pull 
it back in, why not just store the original data and reindex?

On Dec 7, 2011, at 8:39 PM, Jamie Johnson wrote:

> Yeah I was actually hoping that some how I could use the replication
> handler to do this, fire up 1 shard, set another as a slave and see if
> it would replicate the index to it but obviously I'm not sure that
> would work either.
> 
> Something like this would be great too
> https://issues.apache.org/jira/browse/LUCENE-3491
> 
> On Wed, Dec 7, 2011 at 7:48 PM, Mark Miller  wrote:
>> Unfortunately, I think the the only silver bullet here, for pure Solr, is to 
>> build a system that makes it possible to reindex somehow.
>> 
>> On Dec 7, 2011, at 1:38 PM, Erik Hatcher wrote:
>> 
>>> 
>>> On Dec 7, 2011, at 13:20 , Shawn Heisey wrote:
>>> 
 On 12/6/2011 2:06 PM, Erik Hatcher wrote:
> I think the best thing that you could do here would be to lock in a 
> version of Lucene (all the Lucene libraries) that you use with SolrCloud. 
>  Certainly not out of the realm of possibilities of some upcoming 
> SolrCloud capability that requires some upgrading of Lucene though, but 
> you may be set for a little while at least.
 
 I have no weight with the Lucene project, especially because I know very 
 little of its internals.
 
 If the code that handles each new index format were also able to read the 
 index format that preceded it, one could incrementally step forward from 
 revision to revision within trunk, running an optimize (forcedMerge?) at 
 each version to upgrade the index format.
>>> 
>>> Shawn - that is the case with Lucene.  The issue Jamie is bringing up is 
>>> going from an *unreleased* snapshot of Lucene to a later *unreleased* 
>>> snapshot of Lucene - and those types of guarantees aren't made across 
>>> snapshots like this.
>>> 
>>> 
>> 
>> - Mark Miller
>> lucidimagination.com
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 

- Mark Miller
lucidimagination.com

Re: Too long to index PDF - SOLR 3.5, SOLRNet 0.4.0.2001, Tom Cat 7.0

2011-12-07 Thread Soumitra Banerjee

Thanks for the response. I will set the stream accrodingly. As for
extraction of the text from pdf, I want the entire content of the pdf. This
content will be part of a SOLR document, which has an uniqueid.

The unique is for what? Here's my schema:

  







  
  InternalCheckID
  Content

Thanks for your help as always.

Regards, Soumitra


On Wed, Dec 7, 2011 at 3:06 PM, Mauricio Scheffer <
mauricioschef...@gmail.com> wrote:

> Try setting the StreamType to application/pdf, that way Tika doesn't have
> to infer it.
> BTW the second argument to ExtractParameters is the unique key... a value
> of "*" probably doesn't make sense.
>
> --
> Mauricio
>
>
> On Wed, Dec 7, 2011 at 5:50 PM, Soumitra Banerjee <
> soumitrabaner...@gmail.com> wrote:
>
> > All -
> >
> > I am using SOLR 3.5, SOLRNet 0.4.0.2001, Tom Cat 7.0 and am running a job
> > to extract the text from pds, stored on my local hard disk.
> >
> > *Tomcat StdErr log Shows:*
> >
> > INFO: [core1] webapp=/Solr path=/update/extract params={extractOnly=true&
> > literal.id=*&resource.name
> > =C:\XXX\10310.pdf&extractFormat=text&version=2.2}
> > status=0 QTime=125
> > Dec 7, 2011 12:29:36 PM
> org.apache.solr.update.processor.LogUpdateProcessor
> > finish
> > INFO: {} 0 141
> > Dec 7, 2011 12:29:36 PM org.apache.solr.core.SolrCore execute
> > INFO: [core1] webapp=/Solr path=/update/extract params={extractOnly=true&
> > literal.id=*&resource.name
> =C:XXX\10311.pdf&extractFormat=text&version=2.2}
> > status=0 QTime=141
> > Dec 7, 2011 12:29:36 PM
> org.apache.solr.update.processor.LogUpdateProcessor
> > finish
> > INFO: {} 0 125
> > Dec 7, 2011 12:29:36 PM org.apache.solr.core.SolrCore execute
> > INFO: [core1] webapp=/Solr path=/update/extract params={extractOnly=true&
> > literal.id=*&resource.name
> > =C:\XXX\3M_US_EN_10313.pdf&extractFormat=text&version=2.2}
> > status=0 QTime=125
> >
> > *Catalina Log Shows:*
> > **
> > INFO: {} 0 281
> > Dec 7, 2011 12:29:04 PM org.apache.solr.core.SolrCore execute
> > INFO: [core1] webapp=/Solr path=/update/extract params={extractOnly=true&
> > literal.id=*&resource.name
> > =C:\XXX\11511.pdf&extractFormat=text&version=2.2}
> > status=0 QTime=281
> > Dec 7, 2011 12:29:05 PM
> org.apache.solr.update.processor.LogUpdateProcessor
> > finish
> > INFO: {} 0 391
> > Dec 7, 2011 12:29:05 PM org.apache.solr.core.SolrCore execute
> > INFO: [core1] webapp=/Solr path=/update/extract params={extractOnly=true&
> > literal.id=*&resource.name
> > =C:XXX\_11513.pdf&extractFormat=text&version=2.2}
> > status=0 QTime=391
> > Dec 7, 2011 12:29:05 PM
> org.apache.solr.update.processor.LogUpdateProcessor
> > finish
> > INFO: {} 0 328
> > Dec 7, 2011 12:29:05 PM org.apache.solr.core.SolrCore execute
> > INFO: [core1] webapp=/Solr path=/update/extract params={extractOnly=true&
> > literal.id=*&resource.name
> > =C:\XXX\11514.pdf&extractFormat=text&version=2.2}
> > status=0 QTime=328
> >
> > The average pdf file size is around 50 KB. My questions are as follows:
> >
> > 1. Can I improve performance by updating any configutaion file for -
> > SolrConfig, Tomcat, others?
> > 2. Since I am using :
> >
> > var response = solr.Extract(new ExtractParameters(pdffile, "*")
> >
> >
> > from SOLRNet 0.4.0.2001, which just came out (Beta), is this a known
> issue
> > to be fixed in upcomming versions?
> >
> >
> > Any help/pointers from the experts will be highly appreciated. Also let
> me
> > know if you would need additional information and  will be more than
> happy
> > to provide that.
> >
> > Regards, Soumitra
> >
>

Re: Solr Lucene Index Version

2011-12-07 Thread Jamie Johnson

Yeah I was actually hoping that some how I could use the replication
handler to do this, fire up 1 shard, set another as a slave and see if
it would replicate the index to it but obviously I'm not sure that
would work either.

Something like this would be great too
https://issues.apache.org/jira/browse/LUCENE-3491

On Wed, Dec 7, 2011 at 7:48 PM, Mark Miller  wrote:
> Unfortunately, I think the the only silver bullet here, for pure Solr, is to 
> build a system that makes it possible to reindex somehow.
>
> On Dec 7, 2011, at 1:38 PM, Erik Hatcher wrote:
>
>>
>> On Dec 7, 2011, at 13:20 , Shawn Heisey wrote:
>>
>>> On 12/6/2011 2:06 PM, Erik Hatcher wrote:
 I think the best thing that you could do here would be to lock in a 
 version of Lucene (all the Lucene libraries) that you use with SolrCloud.  
 Certainly not out of the realm of possibilities of some upcoming SolrCloud 
 capability that requires some upgrading of Lucene though, but you may be 
 set for a little while at least.
>>>
>>> I have no weight with the Lucene project, especially because I know very 
>>> little of its internals.
>>>
>>> If the code that handles each new index format were also able to read the 
>>> index format that preceded it, one could incrementally step forward from 
>>> revision to revision within trunk, running an optimize (forcedMerge?) at 
>>> each version to upgrade the index format.
>>
>> Shawn - that is the case with Lucene.  The issue Jamie is bringing up is 
>> going from an *unreleased* snapshot of Lucene to a later *unreleased* 
>> snapshot of Lucene - and those types of guarantees aren't made across 
>> snapshots like this.
>>
>>
>
> - Mark Miller
> lucidimagination.com
>
>
>
>
>
>
>
>
>
>
>

Re: Grouping or Facet ?

2011-12-07 Thread Darren Govoni

Yes. That's what I would expect. I guess I didn't understand when you said

"The facet counts are the counts of the *values* in that field"

Because it seems its the count of the number of matching documents
irrespective
if one document has 20 values for that field and another 10, the facet
count will be 2,

one for each document in the results.

On 12/07/2011 09:04 AM, Erick Erickson wrote:

In your example you'll have 10 facets returned each with a value of 1.

Best
Erick

On Tue, Dec 6, 2011 at 9:54 AM, wrote:

Sorry to jump into this thread, but are you saying that the facet count is
not # of result hits?

So if I have 1 document with field CAT that has 10 values and I do a query
that returns this 1 document with faceting, that the CAT facet count will
be 10 not 1? I don't seem to be seeing that behavior in my app (Solr 3.5).

Thanks.

OK, I'm not understanding here. You get the counts and the results if you
facet
on a single category field. The facet counts are the counts of the
*values* in that
field. So it would help me if you showed the output of faceting on a
single
category field and why that didn't work for you

But either way, faceting will probably outperform grouping.

Best
Erick

On Mon, Dec 5, 2011 at 9:05 AM, Juan Pablo Mora wrote:

Because I need the count and the result to return back to the client
side. Both the grouping and the facet offers me a solution to do that,
but my doubt is about performance ...

With Grouping my results are:

"grouped":{
"category":{
"matches": ...,
"groups":[{
"groupValue":"categoryXX",
"doclist":{"numFound":Important_number,"start":0,"docs":[
{
doc:id
category:XX
}
"groupValue":"categoryYY",
"doclist":{"numFound":Important_number,"start":0,"docs":[
{
doc: id
category:YY
}

And with faceting my results are :
"facet.prefix=whatever"
"facet_counts":{
"facet_queries":{},
"facet_fields":{
"namesXX":[
"whatever_name_in_category",76,
...
"namesYY":[
"whatever_name_in_category",76,
...

Both results are OK to me.

De: Erick Erickson [erickerick...@gmail.com]
Enviado el: lunes, 05 de diciembre de 2011 14:48
Para: solr-user@lucene.apache.org
Asunto: Re: Grouping or Facet ?

Why not just use the first form of the document
and just facet.field=category? You'll get
two different facet counts for XX and YY
that way.

I don't think grouping is the way to go here.

Best
Erick

On Sat, Dec 3, 2011 at 6:43 AM, Juan Pablo Mora
wrote:

I need to do some counts on a StrField field to suggest options from
two different categories, and I don´t know what option is the best:

My schema looks:

- id
- name
- category: XX or YY

with Grouping I do:

http://localhost:8983/?q=name:prefix*&group=true&group.field=category

But I can change my schema to to:

- id
- nameXX
- nameYY
- category: XX or YY (only 1 value in nameXX or nameYY)

With facet:
http://localhost:8983/?q=*:*&facet=true&facet.field=nameXX&facet.field=nameYY&facet.prefix=prefix

What option have the best performance ?

Best,
Juampa.

Re: Solr 4 near real time commit

2011-12-07 Thread yu shen

Hi Mark, and all

I now use commit configuration exactly as below:

  10

  1000

But the commit time takes about 60 seconds.

I have around 120 - 130 documents in my server. And each day, the
number will increase about 6000. My symptom is if solr server is just
started, the commit time is about 3-5 seconds. But after one day time, the
commit time increase substantially to about 1 min.

Do I missed anything or had any mis-configuration?
Thanks very much for any help in advance.

Spark

2011/12/7 yu shen 

> Thanks for the correction, I did not notice that [?]
>
> Spark
>
>
> 2011/12/7 Mark Miller 
>
>> Well, if that is exactly what you put, it's wrong.  That second one should
>> be softAutoCommit.
>>
>> On Wednesday, December 7, 2011, yu shen  wrote:
>> > Hi All,
>> >
>> > I tried using solr 4 nightly build: apache-solr-4.0-2011-12-06_08-52-46.
>> > And try to enable autoSoftCommit like below in solrconfig.xml
>> > 
>> >10
>> > 
>> > 
>> >1000
>> > 
>> >
>> > I try to add a document to this solr instance using solrj client in the
>> > nightly build. I do saw a commit time boost. Single document commit will
>> > basically take around 10 - 15 seconds.
>> >
>> > My question is, my configuration mean to do the commitment within 1
>> second,
>> > why solr still takes 10 seconds.
>> >
>> > Spark
>> >
>>
>> --
>> - Mark
>>
>> http://www.lucidimagination.com
>>
>
>

Re: UUID field changed when document is updated

2011-12-07 Thread Lance Norskog

http://www.lucidimagination.com/search/link?url=http://wiki.apache.org/solr/UniqueKey

On Wed, Dec 7, 2011 at 5:04 PM, Lance Norskog  wrote:

> Yes, the SignatureUpdateProcessor is what you want. The 128-bit hash is
> exactly what you want to use in this situation.  You will never get the
> same ID for two urls- collisions have never been observed "in the wild" for
> this hash algorithm.
>
> Another cool thing about using hash-codes as fields is this: you can give
> the first few letters of the code and a wildcard to get a random subset of
> the index with a given size. For example, 0a0* gives 1/(16^3) of the index.
>
> On Wed, Dec 7, 2011 at 2:48 AM, blaise thomson wrote:
>
>> Hi Hoss,
>>
>> Thanks for getting back to me on this.
>>
>> : I've been trying to use the UUIDField in solr to maintain ids of the
>> >: pages I've crawled with nutch (as per
>> >: http://wiki.apache.org/solr/UniqueKey). The use case is that I want to
>> >: have the server able to use these ids in another database for various
>> >: statistics gathering. So I want the link url to act like a primary key
>> >: for determining if a page exists, and if it doesn't exist to generate a
>> >: new uuid.
>> >
>> >
>> >i'm confused ... if you want the URL to be the primary key, then use the
>> >URL as the primary key, why use the UUID Field at all?
>>
>> I do use the URL as the primary key. The thing is that I want to have a
>> fixed length id for the document so that I can reference it in another
>> database. For example, if I want to count clicks of the url, then I was
>> thinking of using a mysql database along with solr, where each document id
>> has a count of the clicks. I didn't want to use the url itself in that db
>> because of its arbitrary length.
>>
>>
>> : 2. Looking at the code for UUIDField (relevant bit pasted below), it
>> >: seems that the UUID is just generated randomly. There is no check if
>> the
>> >: generated UUID has already been used.
>> >
>> >
>> >correct ... if you specify "NEW" then it generates a new UUID for you --
>> >if you wnat to update an existing doc with an existing UUID then you need
>> >to send the real, existing, value of the UUID for the doc you are
>> >updating.
>> >
>> >
>> >: I can sort of solve this problem by generating the UUID myself, as a
>> >: hash of the link url, but that doesn't help me for those random cases
>> >: when the hash might happen to generate the same UUID.
>> >:
>> >: Does anyone know if there is a way for solr to only add a uuid if the
>> >: document doesn't already exist?
>> >
>> >
>> >I don't really understand your second sentence, but based on that first
>> >sentence it sounds like what you want may be to use something like the
>> >SignatureUpdateProcessor to generate a hash based on the URL...
>> >
>> >
>> >https://wiki.apache.org/solr/Deduplication
>>
>>
>> I didn't know actually about this, so thanks for sharing. I'm not sure it
>> does exactly what I want though. I think it is more for checking if the two
>> docs are the same, which for my purposes, the url works fine for.
>>
>> I think I've sort of come to realise that generating a uuid from the url
>> might be the way to go. There is a chance of getting the same uuid from
>> different urls, but it's only 1 in 2^128, so it's basically non-existant.
>>
>> Thanks again,
>> Blaise
>
>
>
>
> --
> Lance Norskog
> goks...@gmail.com
>
>


-- 
Lance Norskog
goks...@gmail.com

Re: UUID field changed when document is updated

2011-12-07 Thread Lance Norskog

Yes, the SignatureUpdateProcessor is what you want. The 128-bit hash is
exactly what you want to use in this situation.  You will never get the
same ID for two urls- collisions have never been observed "in the wild" for
this hash algorithm.

Another cool thing about using hash-codes as fields is this: you can give
the first few letters of the code and a wildcard to get a random subset of
the index with a given size. For example, 0a0* gives 1/(16^3) of the index.

On Wed, Dec 7, 2011 at 2:48 AM, blaise thomson  wrote:

> Hi Hoss,
>
> Thanks for getting back to me on this.
>
> : I've been trying to use the UUIDField in solr to maintain ids of the
> >: pages I've crawled with nutch (as per
> >: http://wiki.apache.org/solr/UniqueKey). The use case is that I want to
> >: have the server able to use these ids in another database for various
> >: statistics gathering. So I want the link url to act like a primary key
> >: for determining if a page exists, and if it doesn't exist to generate a
> >: new uuid.
> >
> >
> >i'm confused ... if you want the URL to be the primary key, then use the
> >URL as the primary key, why use the UUID Field at all?
>
> I do use the URL as the primary key. The thing is that I want to have a
> fixed length id for the document so that I can reference it in another
> database. For example, if I want to count clicks of the url, then I was
> thinking of using a mysql database along with solr, where each document id
> has a count of the clicks. I didn't want to use the url itself in that db
> because of its arbitrary length.
>
>
> : 2. Looking at the code for UUIDField (relevant bit pasted below), it
> >: seems that the UUID is just generated randomly. There is no check if the
> >: generated UUID has already been used.
> >
> >
> >correct ... if you specify "NEW" then it generates a new UUID for you --
> >if you wnat to update an existing doc with an existing UUID then you need
> >to send the real, existing, value of the UUID for the doc you are
> >updating.
> >
> >
> >: I can sort of solve this problem by generating the UUID myself, as a
> >: hash of the link url, but that doesn't help me for those random cases
> >: when the hash might happen to generate the same UUID.
> >:
> >: Does anyone know if there is a way for solr to only add a uuid if the
> >: document doesn't already exist?
> >
> >
> >I don't really understand your second sentence, but based on that first
> >sentence it sounds like what you want may be to use something like the
> >SignatureUpdateProcessor to generate a hash based on the URL...
> >
> >
> >https://wiki.apache.org/solr/Deduplication
>
>
> I didn't know actually about this, so thanks for sharing. I'm not sure it
> does exactly what I want though. I think it is more for checking if the two
> docs are the same, which for my purposes, the url works fine for.
>
> I think I've sort of come to realise that generating a uuid from the url
> might be the way to go. There is a chance of getting the same uuid from
> different urls, but it's only 1 in 2^128, so it's basically non-existant.
>
> Thanks again,
> Blaise




-- 
Lance Norskog
goks...@gmail.com

Re: Solr Lucene Index Version

Unfortunately, I think the the only silver bullet here, for pure Solr, is to 
build a system that makes it possible to reindex somehow.

On Dec 7, 2011, at 1:38 PM, Erik Hatcher wrote:

> 
> On Dec 7, 2011, at 13:20 , Shawn Heisey wrote:
> 
>> On 12/6/2011 2:06 PM, Erik Hatcher wrote:
>>> I think the best thing that you could do here would be to lock in a version 
>>> of Lucene (all the Lucene libraries) that you use with SolrCloud.  
>>> Certainly not out of the realm of possibilities of some upcoming SolrCloud 
>>> capability that requires some upgrading of Lucene though, but you may be 
>>> set for a little while at least.
>> 
>> I have no weight with the Lucene project, especially because I know very 
>> little of its internals.
>> 
>> If the code that handles each new index format were also able to read the 
>> index format that preceded it, one could incrementally step forward from 
>> revision to revision within trunk, running an optimize (forcedMerge?) at 
>> each version to upgrade the index format.
> 
> Shawn - that is the case with Lucene.  The issue Jamie is bringing up is 
> going from an *unreleased* snapshot of Lucene to a later *unreleased* 
> snapshot of Lucene - and those types of guarantees aren't made across 
> snapshots like this.
> 
> 

- Mark Miller
lucidimagination.com

Re: Too long to index PDF - SOLR 3.5, SOLRNet 0.4.0.2001, Tom Cat 7.0

2011-12-07 Thread Mauricio Scheffer

Try setting the StreamType to application/pdf, that way Tika doesn't have
to infer it.
BTW the second argument to ExtractParameters is the unique key... a value
of "*" probably doesn't make sense.

--
Mauricio


On Wed, Dec 7, 2011 at 5:50 PM, Soumitra Banerjee <
soumitrabaner...@gmail.com> wrote:

> All -
>
> I am using SOLR 3.5, SOLRNet 0.4.0.2001, Tom Cat 7.0 and am running a job
> to extract the text from pds, stored on my local hard disk.
>
> *Tomcat StdErr log Shows:*
>
> INFO: [core1] webapp=/Solr path=/update/extract params={extractOnly=true&
> literal.id=*&resource.name
> =C:\XXX\10310.pdf&extractFormat=text&version=2.2}
> status=0 QTime=125
> Dec 7, 2011 12:29:36 PM org.apache.solr.update.processor.LogUpdateProcessor
> finish
> INFO: {} 0 141
> Dec 7, 2011 12:29:36 PM org.apache.solr.core.SolrCore execute
> INFO: [core1] webapp=/Solr path=/update/extract params={extractOnly=true&
> literal.id=*&resource.name=C:XXX\10311.pdf&extractFormat=text&version=2.2}
> status=0 QTime=141
> Dec 7, 2011 12:29:36 PM org.apache.solr.update.processor.LogUpdateProcessor
> finish
> INFO: {} 0 125
> Dec 7, 2011 12:29:36 PM org.apache.solr.core.SolrCore execute
> INFO: [core1] webapp=/Solr path=/update/extract params={extractOnly=true&
> literal.id=*&resource.name
> =C:\XXX\3M_US_EN_10313.pdf&extractFormat=text&version=2.2}
> status=0 QTime=125
>
> *Catalina Log Shows:*
> **
> INFO: {} 0 281
> Dec 7, 2011 12:29:04 PM org.apache.solr.core.SolrCore execute
> INFO: [core1] webapp=/Solr path=/update/extract params={extractOnly=true&
> literal.id=*&resource.name
> =C:\XXX\11511.pdf&extractFormat=text&version=2.2}
> status=0 QTime=281
> Dec 7, 2011 12:29:05 PM org.apache.solr.update.processor.LogUpdateProcessor
> finish
> INFO: {} 0 391
> Dec 7, 2011 12:29:05 PM org.apache.solr.core.SolrCore execute
> INFO: [core1] webapp=/Solr path=/update/extract params={extractOnly=true&
> literal.id=*&resource.name
> =C:XXX\_11513.pdf&extractFormat=text&version=2.2}
> status=0 QTime=391
> Dec 7, 2011 12:29:05 PM org.apache.solr.update.processor.LogUpdateProcessor
> finish
> INFO: {} 0 328
> Dec 7, 2011 12:29:05 PM org.apache.solr.core.SolrCore execute
> INFO: [core1] webapp=/Solr path=/update/extract params={extractOnly=true&
> literal.id=*&resource.name
> =C:\XXX\11514.pdf&extractFormat=text&version=2.2}
> status=0 QTime=328
>
> The average pdf file size is around 50 KB. My questions are as follows:
>
> 1. Can I improve performance by updating any configutaion file for -
> SolrConfig, Tomcat, others?
> 2. Since I am using :
>
> var response = solr.Extract(new ExtractParameters(pdffile, "*")
>
>
> from SOLRNet 0.4.0.2001, which just came out (Beta), is this a known issue
> to be fixed in upcomming versions?
>
>
> Any help/pointers from the experts will be highly appreciated. Also let me
> know if you would need additional information and  will be more than happy
> to provide that.
>
> Regards, Soumitra
>

Too long to index PDF - SOLR 3.5, SOLRNet 0.4.0.2001, Tom Cat 7.0

2011-12-07 Thread Soumitra Banerjee

All -

I am using SOLR 3.5, SOLRNet 0.4.0.2001, Tom Cat 7.0 and am running a job
to extract the text from pds, stored on my local hard disk.

*Tomcat StdErr log Shows:*

INFO: [core1] webapp=/Solr path=/update/extract params={extractOnly=true&
literal.id=*&resource.name=C:\XXX\10310.pdf&extractFormat=text&version=2.2}
status=0 QTime=125
Dec 7, 2011 12:29:36 PM org.apache.solr.update.processor.LogUpdateProcessor
finish
INFO: {} 0 141
Dec 7, 2011 12:29:36 PM org.apache.solr.core.SolrCore execute
INFO: [core1] webapp=/Solr path=/update/extract params={extractOnly=true&
literal.id=*&resource.name=C:XXX\10311.pdf&extractFormat=text&version=2.2}
status=0 QTime=141
Dec 7, 2011 12:29:36 PM org.apache.solr.update.processor.LogUpdateProcessor
finish
INFO: {} 0 125
Dec 7, 2011 12:29:36 PM org.apache.solr.core.SolrCore execute
INFO: [core1] webapp=/Solr path=/update/extract params={extractOnly=true&
literal.id=*&resource.name=C:\XXX\3M_US_EN_10313.pdf&extractFormat=text&version=2.2}
status=0 QTime=125

*Catalina Log Shows:*
**
INFO: {} 0 281
Dec 7, 2011 12:29:04 PM org.apache.solr.core.SolrCore execute
INFO: [core1] webapp=/Solr path=/update/extract params={extractOnly=true&
literal.id=*&resource.name=C:\XXX\11511.pdf&extractFormat=text&version=2.2}
status=0 QTime=281
Dec 7, 2011 12:29:05 PM org.apache.solr.update.processor.LogUpdateProcessor
finish
INFO: {} 0 391
Dec 7, 2011 12:29:05 PM org.apache.solr.core.SolrCore execute
INFO: [core1] webapp=/Solr path=/update/extract params={extractOnly=true&
literal.id=*&resource.name=C:XXX\_11513.pdf&extractFormat=text&version=2.2}
status=0 QTime=391
Dec 7, 2011 12:29:05 PM org.apache.solr.update.processor.LogUpdateProcessor
finish
INFO: {} 0 328
Dec 7, 2011 12:29:05 PM org.apache.solr.core.SolrCore execute
INFO: [core1] webapp=/Solr path=/update/extract params={extractOnly=true&
literal.id=*&resource.name=C:\XXX\11514.pdf&extractFormat=text&version=2.2}
status=0 QTime=328

The average pdf file size is around 50 KB. My questions are as follows:

1. Can I improve performance by updating any configutaion file for -
SolrConfig, Tomcat, others?
2. Since I am using :

var response = solr.Extract(new ExtractParameters(pdffile, "*")


from SOLRNet 0.4.0.2001, which just came out (Beta), is this a known issue
to be fixed in upcomming versions?


Any help/pointers from the experts will be highly appreciated. Also let me
know if you would need additional information and  will be more than happy
to provide that.

Regards, Soumitra

Re: Using result grouping with SolrJ

2011-12-07 Thread Kissue Kissue

Thanks Juan. I guess i have found my reason to migrate to 3.4.

Many thanks.

On Wed, Dec 7, 2011 at 7:43 PM, Juan Grande  wrote:

> Hi Kissue,
>
> Support for grouping on SolrJ was added in Solr 3.4, see
> https://issues.apache.org/jira/browse/SOLR-2637
>
> In previous versions you can access the grouping results by simply
> traversing the various named lists.
>
> *Juan*
>
>
>
> On Wed, Dec 7, 2011 at 1:22 PM, Kissue Kissue  wrote:
>
> > Hi,
> >
> > I am using Solr 3.3 with SolrJ. Does anybody know how i can use result
> > grouping with SolrJ? Particularly how i can retrieve the result grouping
> > results with SolrJ?
> >
> > Any help will be much appreciated.
> >
> > Thanks.
> >
>

Re: avoid overwrite in DataImportHandler

2011-12-07 Thread P Williams

Hi,

I've wondered the same thing myself.  I feel like the "clean" parameter has
something to do with it but it doesn't work as I'd expect either.  Thanks
in advance to anyone who can answer this question.

*clean* : (default 'true'). Tells whether to clean up the index before the
indexing is started.

Tricia

On Wed, Dec 7, 2011 at 12:49 PM, sabman  wrote:

> I have a unique ID defined for the documents I am indexing. I want to avoid
> overwriting the documents that have already been indexed. I am using
> XPathEntityProcessor and TikaEntityProcessor to process the documents.
>
> The DataImportHandler does not seem to have the option to set
> overwrite=false. I have read some other forums to use deduplication instead
> but I don't see how it is related to my problem.
>
> Any help on this (or explanation on how deduplication would apply to my
> probelm ) would be great. Thanks!
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/avoid-overwrite-in-DataImportHandler-tp3568435p3568435.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: cache monitoring tools?

2011-12-07 Thread Tomás Fernández Löbbe

Hi Dimitry, I pointed to the wiki page to enable JMX, then you can use any
tool that visualizes JMX stuff like Zabbix. See
http://www.lucidimagination.com/blog/2011/10/02/monitoring-apache-solr-and-lucidworks-with-zabbix/

On Wed, Dec 7, 2011 at 11:49 AM, Dmitry Kan  wrote:

> The culprit seems to be the merger (frontend) SOLR. Talking to one shard
> directly takes substantially less time (1-2 sec).
>
> On Wed, Dec 7, 2011 at 4:10 PM, Dmitry Kan  wrote:
>
> > Tomás: thanks. The page you gave didn't mention cache specifically, is
> > there more documentation on this specifically? I have used solrmeter
> tool,
> > it draws the cache diagrams, is there a similar tool, but which would use
> > jmx directly and present the cache usage in runtime?
> >
> > pravesh:
> > I have increased the size of filterCache, but the search hasn't become
> any
> > faster, taking almost 9 sec on avg :(
> >
> > name: search
> > class: org.apache.solr.handler.component.SearchHandler
> > version: $Revision: 1052938 $
> > description: Search using components:
> >
> org.apache.solr.handler.component.QueryComponent,org.apache.solr.handler.component.FacetComponent,org.apache.solr.handler.component.MoreLikeThisComponent,org.apache.solr.handler.component.HighlightComponent,org.apache.solr.handler.component.StatsComponent,org.apache.solr.handler.component.DebugComponent,
> >
> > stats: handlerStart : 1323255147351
> > requests : 100
> > errors : 3
> > timeouts : 0
> > totalTime : 885438
> > avgTimePerRequest : 8854.38
> > avgRequestsPerSecond : 0.008789442
> >
> > the stats (copying fieldValueCache as well here, to show term
> statistics):
> >
> > name: fieldValueCache
> > class: org.apache.solr.search.FastLRUCache
> > version: 1.0
> > description: Concurrent LRU Cache(maxSize=1, initialSize=10,
> > minSize=9000, acceptableSize=9500, cleanupThread=false)
> > stats: lookups : 79
> > hits : 77
> > hitratio : 0.97
> > inserts : 1
> > evictions : 0
> > size : 1
> > warmupTime : 0
> > cumulative_lookups : 79
> > cumulative_hits : 77
> > cumulative_hitratio : 0.97
> > cumulative_inserts : 1
> > cumulative_evictions : 0
> > item_shingleContent_trigram :
> >
> {field=shingleContent_trigram,memSize=326924381,tindexSize=4765394,time=215426,phase1=213868,nTerms=14827061,bigTerms=35,termInstances=114359167,uses=78}
> >  name: filterCache
> > class: org.apache.solr.search.FastLRUCache
> > version: 1.0
> > description: Concurrent LRU Cache(maxSize=153600, initialSize=4096,
> > minSize=138240, acceptableSize=145920, cleanupThread=false)
> > stats: lookups : 1082854
> > hits : 940370
> > hitratio : 0.86
> > inserts : 142486
> > evictions : 0
> > size : 142486
> > warmupTime : 0
> > cumulative_lookups : 1082854
> > cumulative_hits : 940370
> > cumulative_hitratio : 0.86
> > cumulative_inserts : 142486
> > cumulative_evictions : 0
> >
> >
> > index size: 3,25 GB
> >
> > Does anyone have some pointers to where to look at and optimize for query
> > time?
> >
> >
> > 2011/12/7 Tomás Fernández Löbbe 
> >
> >> Hi Dimitry, cache information is exposed via JMX, so you should be able
> to
> >> monitor that information with any JMX tool. See
> >> http://wiki.apache.org/solr/SolrJmx
> >>
> >> On Wed, Dec 7, 2011 at 6:19 AM, Dmitry Kan 
> wrote:
> >>
> >> > Yes, we do require that much.
> >> > Ok, thanks, I will try increasing the maxsize.
> >> >
> >> > On Wed, Dec 7, 2011 at 10:56 AM, pravesh 
> >> wrote:
> >> >
> >> > > >>facet.limit=50
> >> > > your facet.limit seems too high. Do you actually require this much?
> >> > >
> >> > > Since there a lot of evictions from filtercache, so, increase the
> >> maxsize
> >> > > value to your acceptable limit.
> >> > >
> >> > > Regards
> >> > > Pravesh
> >> > >
> >> > > --
> >> > > View this message in context:
> >> > >
> >> >
> >>
> http://lucene.472066.n3.nabble.com/cache-monitoring-tools-tp3566645p3566811.html
> >> > > Sent from the Solr - User mailing list archive at Nabble.com.
> >> > >
> >> >
> >> >
> >> >
> >> > --
> >> > Regards,
> >> >
> >> > Dmitry Kan
> >> >
> >>
> >
> >
> >
> > --
> > Regards,
> >
> > Dmitry Kan
> >
>
>
>
> --
> Regards,
>
> Dmitry Kan
>

avoid overwrite in DataImportHandler

2011-12-07 Thread sabman

I have a unique ID defined for the documents I am indexing. I want to avoid
overwriting the documents that have already been indexed. I am using
XPathEntityProcessor and TikaEntityProcessor to process the documents.

The DataImportHandler does not seem to have the option to set
overwrite=false. I have read some other forums to use deduplication instead
but I don't see how it is related to my problem. 

Any help on this (or explanation on how deduplication would apply to my
probelm ) would be great. Thanks!

--
View this message in context: 
http://lucene.472066.n3.nabble.com/avoid-overwrite-in-DataImportHandler-tp3568435p3568435.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Using result grouping with SolrJ

2011-12-07 Thread Juan Grande

Hi Kissue,

Support for grouping on SolrJ was added in Solr 3.4, see
https://issues.apache.org/jira/browse/SOLR-2637

In previous versions you can access the grouping results by simply
traversing the various named lists.

*Juan*

On Wed, Dec 7, 2011 at 1:22 PM, Kissue Kissue  wrote:

> Hi,
>
> I am using Solr 3.3 with SolrJ. Does anybody know how i can use result
> grouping with SolrJ? Particularly how i can retrieve the result grouping
> results with SolrJ?
>
> Any help will be much appreciated.
>
> Thanks.
>

Re: Solr Lucene Index Version


On Dec 7, 2011, at 13:20 , Shawn Heisey wrote:

> On 12/6/2011 2:06 PM, Erik Hatcher wrote:
>> I think the best thing that you could do here would be to lock in a version 
>> of Lucene (all the Lucene libraries) that you use with SolrCloud.  Certainly 
>> not out of the realm of possibilities of some upcoming SolrCloud capability 
>> that requires some upgrading of Lucene though, but you may be set for a 
>> little while at least.
> 
> I have no weight with the Lucene project, especially because I know very 
> little of its internals.
> 
> If the code that handles each new index format were also able to read the 
> index format that preceded it, one could incrementally step forward from 
> revision to revision within trunk, running an optimize (forcedMerge?) at each 
> version to upgrade the index format.

Shawn - that is the case with Lucene.  The issue Jamie is bringing up is going 
from an *unreleased* snapshot of Lucene to a later *unreleased* snapshot of 
Lucene - and those types of guarantees aren't made across snapshots like this.

Re: Solr Lucene Index Version

2011-12-07 Thread Shawn Heisey


On 12/6/2011 2:06 PM, Erik Hatcher wrote:

I think the best thing that you could do here would be to lock in a version of 
Lucene (all the Lucene libraries) that you use with SolrCloud.  Certainly not 
out of the realm of possibilities of some upcoming SolrCloud capability that 
requires some upgrading of Lucene though, but you may be set for a little while 
at least.


I have no weight with the Lucene project, especially because I know very 
little of its internals.


If the code that handles each new index format were also able to read 
the index format that preceded it, one could incrementally step forward 
from revision to revision within trunk, running an optimize 
(forcedMerge?) at each version to upgrade the index format.


On the other hand, any reasonable production installation will consist 
of redundant hardware, and if someone is already willing to take the 
risk of running trunk, it can be argued that they should also be 
prepared to take down one of their redundant systems and use it to 
reindex on an upgraded version, then run parallel update programs on 
both versions.  This is what I did when upgrading from 1.4.1 to 3.2, 
because of the javabin difficulties.  Once I had that system in place, I 
also used it for the minor steps to 3.4 and 3.5.


I'd hope for the former, but if that's not going to happen, there is the 
latter.


Thanks,
Shawn

Re: cache monitoring tools?

2011-12-07 Thread Otis Gospodnetic

Hi Dmitry,

You should use SPM for Solr - it exposes all Solr metrics and more (JVM, system 
info, etc.)
PLUS it's currently 100% free.

http://sematext.com/spm/solr-performance-monitoring/index.html


We use it with our clients on a regular basis and it helps us a TON - we just 
helped a very popular mobile app company improve Solr performance by a few 
orders of magnitude (including filter tuning) with the help of SPM.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/


>
> From: Dmitry Kan 
>To: solr-user@lucene.apache.org 
>Sent: Wednesday, December 7, 2011 2:13 AM
>Subject: cache monitoring tools?
> 
>Hello list,
>
>We've noticed quite huge strain on the filterCache in facet queries against
>trigram fields (see schema in the end of this e-mail). The typical query
>contains some keywords in the q parameter and boolean filter query on other
>solr fields. It is also facet query, the facet field is of
>type shingle_text_trigram (see schema) and facet.limit=50.
>
>
>Questions: are there some tools (except for solrmeter) and/or approaches to
>monitor / profile the load on caches, which would help to derive better
>tuning parameters?
>
>Can you recommend checking config parameters of other components but caches?
>
>BTW, this has become much faster compared to solr 1.4 where we had to a lot
>of optimizations on schema level (e.g. by making a number of stored fields
>non-stored)
>
>Here are the relevant stats from admin (SOLR 3.4):
>
>description: Concurrent LRU Cache(maxSize=1, initialSize=10,
>minSize=9000, acceptableSize=9500, cleanupThread=false)
>stats: lookups : 93
>hits : 90
>hitratio : 0.96
>inserts : 1
>evictions : 0
>size : 1
>warmupTime : 0
>cumulative_lookups : 93
>cumulative_hits : 90
>cumulative_hitratio : 0.96
>cumulative_inserts : 1
>cumulative_evictions : 0
>item_shingleContent_trigram :
>{field=shingleContent_trigram,memSize=326924381,tindexSize=4765394,time=222924,phase1=221106,nTerms=14827061,bigTerms=35,termInstances=114359167,uses=91}
>name: filterCache
>class: org.apache.solr.search.FastLRUCache
>version: 1.0
>description: Concurrent LRU Cache(maxSize=512, initialSize=512,
>minSize=460, acceptableSize=486, cleanupThread=false)
>stats: lookups : 1003486
>hits : 2809
>hitratio : 0.00
>inserts : 1000694
>evictions : 1000221
>size : 473
>warmupTime : 0
>cumulative_lookups : 1003486
>cumulative_hits : 2809
>cumulative_hitratio : 0.00
>cumulative_inserts : 1000694
>cumulative_evictions : 1000221
>
>
>schema excerpt:
>
>positionIncrementGap="100">
>   
>      
>      
>      outputUnigrams="true"/>
>   
>
>
>-- 
>Regards,
>
>Dmitry Kan
>
>
>

RE: SolR - Index problems

2011-12-07 Thread Husain, Yavar


Hi Jiggy

When you query the index, what do you get in the tomcat logs? (Check that out 
in tomcat/logs directory)

How much of Heap memory have you allocated to Tomcat?

- Yavar


From: jiggy [new...@trash-mail.com]
Sent: Wednesday, December 07, 2011 9:53 PM
To: solr-user@lucene.apache.org
Subject: SolR - Index problems

Hello Guys,

i have a big problem. I have integrated solr to Magento EE. I have two solr
folder, one is in c:/tomcat 7.0/
and the other one is in my web-folder(c:/www/).

In the tomcat-folder is the data folder of solr, their are about 200 MB
index file(I think here are my datas from magento).
In the www-folder are the bin and conf folder of solr.

My problem is now, when i try a query in the solr admin page, i don't get
any result.

My questions are:
Is that right with the two folders?
And Why i dont get any results ?

I use windows server 2008 r2 and solr 1.4.1.

Can anybody help, i read the reference guide and some contributions in this
forum, but i dont got any result.

Sorry for my bad english.

Thanks in advance.

Best regards,

Jiggy

--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolR-Index-problems-tp3567883p3567883.html
Sent from the Solr - User mailing list archive at Nabble.com.
**
 
This message may contain confidential or proprietary information intended only 
for the use of the 
addressee(s) named above or may contain information that is legally privileged. 
If you are 
not the intended addressee, or the person responsible for delivering it to the 
intended addressee, 
you are hereby notified that reading, disseminating, distributing or copying 
this message is strictly 
prohibited. If you have received this message by mistake, please immediately 
notify us by 
replying to the message and delete the original message and any copies 
immediately thereafter. 

Thank you.- 
**
FAFLD

SolR - Index problems

2011-12-07 Thread jiggy

Hello Guys,

i have a big problem. I have integrated solr to Magento EE. I have two solr
folder, one is in c:/tomcat 7.0/
and the other one is in my web-folder(c:/www/).

In the tomcat-folder is the data folder of solr, their are about 200 MB
index file(I think here are my datas from magento).
In the www-folder are the bin and conf folder of solr.

My problem is now, when i try a query in the solr admin page, i don't get
any result.

My questions are:
Is that right with the two folders?
And Why i dont get any results ?

I use windows server 2008 r2 and solr 1.4.1.

Can anybody help, i read the reference guide and some contributions in this
forum, but i dont got any result.

Sorry for my bad english.

Thanks in advance.

Best regards,

Jiggy

--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolR-Index-problems-tp3567883p3567883.html
Sent from the Solr - User mailing list archive at Nabble.com.

Boost Query in Edismax

2011-12-07 Thread John

I have a complex edismax query:

facet=true&facet.mincount=0&qf=title^0.08+categorysearch^0.05+abstract^0.03+body^0.1&wt=javabin&rows=25&defType=edismax&version=2&omitHeader=true&fl=*,score&bq=eqid:(3yp^1.57+OR+5fi^1.55+OR+c1s^1.55+OR+3ym^1.55+OR+gjz^1.55...)&start=0&q=*:*&facet.field=category&facet.field=equation&facet.field=source&fq=eqid:(3yp+OR+5fi+OR+c1s+OR+3ym+OR+gjz...)

The list inside the parentheses is quite long.

What I am interested in achieving is to get the final score of a document
to be affected only by the highest boost for that document - rather by all
possible boosts.

Please let me know how can I achieve that.

Thanks in advance.

Using result grouping with SolrJ

2011-12-07 Thread Kissue Kissue

Hi,

I am using Solr 3.3 with SolrJ. Does anybody know how i can use result
grouping with SolrJ? Particularly how i can retrieve the result grouping
results with SolrJ?

Any help will be much appreciated.

Thanks.

Re: Solr response writer


Thank you Erik, I will work on your suggestion! It seems it could work, 
provided I can boost matches on "redirect" document type

S

Inizio: Erik Hatcher [erik.hatc...@gmail.com]
Inviato: mercoledì 7 dicembre 2011 16.56
Fine: solr-user@lucene.apache.org
Oggetto: Re: Solr response writer

What you can do is index the "redirect" documents along with the associated 
words, and let Solr do the stemming.   Maybe add a "document type" field and if 
you get a match on a redirect document type, your web service can do what it 
needs to do from there.

Erik



On Dec 7, 2011, at 10:43 , Finotti Simone wrote:

> No, actually it's a .NET web service that queries Endeca (call it Wrapper). 
> It returns to its clients a collection of unique product IDs, then the client 
> will ask other web services for more detailed informations about the given 
> products. As long as no URL redirection is involved, I think that solrnet ( 
> http://code.google.com/p/solrnet/ ) is good enough to make our Wrapper 
> connect to Solr, thus shielding the client from changes in the underlying 
> search engine.
>
> Endeca C# API also returns a 'RedirectionUrl' property in one of its object, 
> which is set to an URL if the text search matches a redirection rule, in this 
> case the Wrapper passes it down to its client (my fault here, I thought there 
> was some sort of redirection through HTTP result code, but that's not the 
> case).
>
> The point is: since Solr doesn't have this feature, my only chance is to 
> implement it into the "wrapping" web service itself, but I need to "access" 
> how the words are analyzed by the search engine to make it work correctly. 
> AFAICS, Solr only returns documents matching the request, so I'm missing 
> something :-(
>
> S
> 
> Inizio: Michael Kuhlmann [k...@solarier.de]
> Inviato: mercoledì 7 dicembre 2011 15.29
> Fine: solr-user@lucene.apache.org
> Oggetto: Re: R: Solr response writer
>
> Am 07.12.2011 15:09, schrieb Finotti Simone:
>> I got your and Michael's point. Indeed, I'm not very skilled in web 
>> devolpment so there may be something that I'm missing. Anyway, Endeca does 
>> something like this:
>>
>> 1. accept a query
>> 2. does the stemming;
>> 3. check if the result of the step 2. matches one of the redirectable words. 
>> If so, returns an URL, otherwise returns the regular matching documents (our 
>> products' description).
>>
>> Do you think that in Solr I will be able to replicate this behaviour without 
>> writing a custom plugin (request handler, response writer, etc)? Maybe I'm a 
>> little dense, but I fail to see how it would be possible...
>
> Endeca not only is a search engine, it's part of a web application. You
> can send a query to the Endeca engine and send the response directly to
> the user; it's already fully rendered. (At least when you configured it
> this way.)
>
> Solr can't do this in any way. Solr responses are always pure technical
> data, not meant to be delivered to an end user. An exception to this is
> the VelocityResponseWriter which can fill a web template.
>
> Anything beyond the possibilities of the VelocityReponseWriter must be
> handled by some web application that anaylzes Solr's reponses.
>
> How do you want ot display your product descriptions, the default case?
> I don't think you want to show some XML data.
>
> Solr is a great search engine, but not more. It's just a small subset of
> commercial search frameworks like Endeca. Therefore, you can't simply
> replace it, you'll need some web application.
>
> However, you don't need a custom response writer in this case, nor do
> you have to Solr extend in any way. At least not for this requrement.
>
> -Kuli
>
>
>
>

XPathEntityProcessor and ExtractingRequestHandler

2011-12-07 Thread Michael Kelleher

Can I use a XPathEntityProcessor in conjunction with an 
ExtractingRequestHandler?  Also, the scripting language that 
XPathEntityProcessor uses/supports, is that just ECMA/JavaScript?


Or is XPathEntityProcessor only supported for use in conjuntion with the 
DataImportHandler?


Thanks.

Re: Solr response writer

What you can do is index the "redirect" documents along with the associated 
words, and let Solr do the stemming.   Maybe add a "document type" field and if 
you get a match on a redirect document type, your web service can do what it 
needs to do from there.

Erik



On Dec 7, 2011, at 10:43 , Finotti Simone wrote:

> No, actually it's a .NET web service that queries Endeca (call it Wrapper). 
> It returns to its clients a collection of unique product IDs, then the client 
> will ask other web services for more detailed informations about the given 
> products. As long as no URL redirection is involved, I think that solrnet ( 
> http://code.google.com/p/solrnet/ ) is good enough to make our Wrapper 
> connect to Solr, thus shielding the client from changes in the underlying 
> search engine.
> 
> Endeca C# API also returns a 'RedirectionUrl' property in one of its object, 
> which is set to an URL if the text search matches a redirection rule, in this 
> case the Wrapper passes it down to its client (my fault here, I thought there 
> was some sort of redirection through HTTP result code, but that's not the 
> case).
> 
> The point is: since Solr doesn't have this feature, my only chance is to 
> implement it into the "wrapping" web service itself, but I need to "access" 
> how the words are analyzed by the search engine to make it work correctly. 
> AFAICS, Solr only returns documents matching the request, so I'm missing 
> something :-(
> 
> S
> 
> Inizio: Michael Kuhlmann [k...@solarier.de]
> Inviato: mercoledì 7 dicembre 2011 15.29
> Fine: solr-user@lucene.apache.org
> Oggetto: Re: R: Solr response writer
> 
> Am 07.12.2011 15:09, schrieb Finotti Simone:
>> I got your and Michael's point. Indeed, I'm not very skilled in web 
>> devolpment so there may be something that I'm missing. Anyway, Endeca does 
>> something like this:
>> 
>> 1. accept a query
>> 2. does the stemming;
>> 3. check if the result of the step 2. matches one of the redirectable words. 
>> If so, returns an URL, otherwise returns the regular matching documents (our 
>> products' description).
>> 
>> Do you think that in Solr I will be able to replicate this behaviour without 
>> writing a custom plugin (request handler, response writer, etc)? Maybe I'm a 
>> little dense, but I fail to see how it would be possible...
> 
> Endeca not only is a search engine, it's part of a web application. You
> can send a query to the Endeca engine and send the response directly to
> the user; it's already fully rendered. (At least when you configured it
> this way.)
> 
> Solr can't do this in any way. Solr responses are always pure technical
> data, not meant to be delivered to an end user. An exception to this is
> the VelocityResponseWriter which can fill a web template.
> 
> Anything beyond the possibilities of the VelocityReponseWriter must be
> handled by some web application that anaylzes Solr's reponses.
> 
> How do you want ot display your product descriptions, the default case?
> I don't think you want to show some XML data.
> 
> Solr is a great search engine, but not more. It's just a small subset of
> commercial search frameworks like Endeca. Therefore, you can't simply
> replace it, you'll need some web application.
> 
> However, you don't need a custom response writer in this case, nor do
> you have to Solr extend in any way. At least not for this requrement.
> 
> -Kuli
> 
> 
> 
>

Re: Solr or SQL fultext search

2011-12-07 Thread Hector Castro

This article shouldn't flat out make the decision for you, but these concerns 
raised by the guys at StackOverflow (over SQL Server 2008) helped guide us 
toward Solr:

http://www.infoq.com/news/2008/11/SQL-Server-Text

--
Hector

On Dec 7, 2011, at 2:17 AM, Mersad wrote:

> hi Everyone,
> 
> I am wondering how much benefit I get if I move from SQL server to Solr in my 
> text -baed search project.
> Any help is apprechiated !
> 
> 
> best
> Mersad

Re: Solr response writer

No, actually it's a .NET web service that queries Endeca (call it Wrapper). It 
returns to its clients a collection of unique product IDs, then the client will 
ask other web services for more detailed informations about the given products. 
As long as no URL redirection is involved, I think that solrnet ( 
http://code.google.com/p/solrnet/ ) is good enough to make our Wrapper connect 
to Solr, thus shielding the client from changes in the underlying search engine.

Endeca C# API also returns a 'RedirectionUrl' property in one of its object, 
which is set to an URL if the text search matches a redirection rule, in this 
case the Wrapper passes it down to its client (my fault here, I thought there 
was some sort of redirection through HTTP result code, but that's not the case).

The point is: since Solr doesn't have this feature, my only chance is to 
implement it into the "wrapping" web service itself, but I need to "access" how 
the words are analyzed by the search engine to make it work correctly. AFAICS, 
Solr only returns documents matching the request, so I'm missing something :-(

S

Inizio: Michael Kuhlmann [k...@solarier.de]
Inviato: mercoledì 7 dicembre 2011 15.29
Fine: solr-user@lucene.apache.org
Oggetto: Re: R: Solr response writer

Am 07.12.2011 15:09, schrieb Finotti Simone:
> I got your and Michael's point. Indeed, I'm not very skilled in web 
> devolpment so there may be something that I'm missing. Anyway, Endeca does 
> something like this:
>
> 1. accept a query
> 2. does the stemming;
> 3. check if the result of the step 2. matches one of the redirectable words. 
> If so, returns an URL, otherwise returns the regular matching documents (our 
> products' description).
>
> Do you think that in Solr I will be able to replicate this behaviour without 
> writing a custom plugin (request handler, response writer, etc)? Maybe I'm a 
> little dense, but I fail to see how it would be possible...

Endeca not only is a search engine, it's part of a web application. You
can send a query to the Endeca engine and send the response directly to
the user; it's already fully rendered. (At least when you configured it
this way.)

Solr can't do this in any way. Solr responses are always pure technical
data, not meant to be delivered to an end user. An exception to this is
the VelocityResponseWriter which can fill a web template.

Anything beyond the possibilities of the VelocityReponseWriter must be
handled by some web application that anaylzes Solr's reponses.

How do you want ot display your product descriptions, the default case?
I don't think you want to show some XML data.

Solr is a great search engine, but not more. It's just a small subset of
commercial search frameworks like Endeca. Therefore, you can't simply
replace it, you'll need some web application.

However, you don't need a custom response writer in this case, nor do
you have to Solr extend in any way. At least not for this requrement.

-Kuli

Difference between field collapsing and result grouping

2011-12-07 Thread Kissue Kissue

Sorry if this question sounds stupid but i am really really confused about
this. Is there actually a difference between field collapsing and result
grouping in SOLR?

I have come across articles that have talked about setting up field
collapsing with commands that look different from the grouping ones while
in some cases i read that they are the same thing.

Any clarifications would be much appreciated.

Thanks.

Re: Solr 4 near real time commit

2011-12-07 Thread yu shen

Thanks for the correction, I did not notice that [?]

Spark

2011/12/7 Mark Miller 

> Well, if that is exactly what you put, it's wrong.  That second one should
> be softAutoCommit.
>
> On Wednesday, December 7, 2011, yu shen  wrote:
> > Hi All,
> >
> > I tried using solr 4 nightly build: apache-solr-4.0-2011-12-06_08-52-46.
> > And try to enable autoSoftCommit like below in solrconfig.xml
> > 
> >10
> > 
> > 
> >1000
> > 
> >
> > I try to add a document to this solr instance using solrj client in the
> > nightly build. I do saw a commit time boost. Single document commit will
> > basically take around 10 - 15 seconds.
> >
> > My question is, my configuration mean to do the commitment within 1
> second,
> > why solr still takes 10 seconds.
> >
> > Spark
> >
>
> --
> - Mark
>
> http://www.lucidimagination.com
>

Re: Solr Lucene Index Version

Jamie -

The details would of course be entirely dependent on what changed, but with 
Lucene trunk/4.0 there is the flexible indexing API with codecs.  I imagine 
with a compatibility codec layer one could provide some insulation to changes.

You're at big scale, so the "just reindex everything" answer isn't really 
satisfactory I understand.  But locking in to a version of Lucene may be a 
decent stop-gap solution, and if/when the format changes you can upgrade one 
node at a time (the Solr request/response won't change!) and reindex in a 
rolling manner probably.  Again, it's still risky as there may be changes to 
the index format needed for enhancements to SolrCloud that you want so you'd be 
stuck at a fixed place with SolrCloud until you could do some reindexing.

Erik


On Dec 7, 2011, at 08:50 , Jamie Johnson wrote:

> Erik,
> 
> Do you have any details behind what would be required to write a tool
> to move from one index format to another?  Any examples/suggestions
> would be appreciated.
> 
> On Tue, Dec 6, 2011 at 5:19 PM, Jamie Johnson  wrote:
>> What about modifying something like SolrIndexConfig.java to change the
>> lucene version that is used when creating the index?  (may not be the
>> right place, but is something like this possible?)
>> 
>> On Tue, Dec 6, 2011 at 5:13 PM, Erik Hatcher  wrote:
>>> Right.  Not sure what to advise you.  We have worked on this problem with 
>>> our LucidWorks platform and have some tools available to do this sort of 
>>> thing, I think, but it's not generally something that you can do with 
>>> Lucene going from a snapshot to a released version.  Perhaps others with 
>>> deeper insight will chime in.
>>> 
>>>Erik
>>> 
>>> 
>>> 
>>> On Dec 6, 2011, at 16:54 , Jamie Johnson wrote:
>>> 
 Problem is that really doesn't help me.  We still have the same issue
 that when the 4.0 becomes final there is no migration utility from
 this pre 4.0 version to 4.0, right?
 
 
 On Tue, Dec 6, 2011 at 4:36 PM, Erik Hatcher  
 wrote:
> Oh geez... no... I didn't mean 3.x JARs... I meant the trunk/4.0 ones 
> that are there now.
> 
>Erik
> 
> On Dec 6, 2011, at 16:22 , Jamie Johnson wrote:
> 
>> So if I wanted to used lucene index 3.5 with SolrCloud I "should" be
>> able to just move the 3.5 jars in and remove any of the snapshot jars
>> that are present when I build locally?
>> 
>> On Tue, Dec 6, 2011 at 4:06 PM, Erik Hatcher  
>> wrote:
>>> Jamie -
>>> 
>>> I think the best thing that you could do here would be to lock in a 
>>> version of Lucene (all the Lucene libraries) that you use with 
>>> SolrCloud.  Certainly not out of the realm of possibilities of some 
>>> upcoming SolrCloud capability that requires some upgrading of Lucene 
>>> though, but you may be set for a little while at least.
>>> 
>>>Erik
>>> 
>>> On Dec 6, 2011, at 15:57 , Jamie Johnson wrote:
>>> 
 Thanks, but I don't believe that will do it.  From my understanding
 that does not control the index version written, it's used to control
 the behavior of some analyzers (taken from some googling).  I'd love
 if someone told me otherwise though.
 
 On Tue, Dec 6, 2011 at 3:48 PM, Alireza Salimi 
  wrote:
> Hi, I'm not sure if it would help.
> 
> in solrconfig.xml:
> 
>  
>  LUCENE_34
> 
> 
> On Tue, Dec 6, 2011 at 3:14 PM, Jamie Johnson  
> wrote:
> 
>> Is there a way to specify the index version solr uses?  We're
>> currently using SolrCloud but with the index format changing I'd be
>> preferable to be able to specify a particular index format to avoid
>> having to do a complete reindex.  Is this possible?
>> 
> 
> 
> 
> --
> Alireza Salimi
> Java EE Developer
>>> 
> 
>>>

Re: Solr Version Upgrade issue

2011-12-07 Thread Erick Erickson

How did you upgrade? What steps did you follow? Do you have
any custom code? Any additional  entries in your
solrconfig.xml?

These details help us diagnose your problem, but it's almost certainly
that you have a mixture of jar files lying around your machine in
a place you don't expect.

Best
Erick

On Wed, Dec 7, 2011 at 1:28 AM, Pawan Darira  wrote:
> I checked that. there are only latest jars. I am not able to figure out the
> issue.
>
> On Tue, Dec 6, 2011 at 6:57 PM, Mark Miller  wrote:
>
>> Looks like you must have a mix of old and new jars.
>>
>> On Tuesday, December 6, 2011, Pawan Darira  wrote:
>> > Hi
>> >
>> > I am trying to upgrade my SOLR version from 1.4 to 3.2. but it's giving
>> me
>> > below exception. I have checked solr home path & it is correct.. Please
>> help
>> >
>> > SEVERE: Could not start Solr. Check solr/home property
>> > java.lang.NoSuchMethodError:
>> >
>>
>> org.apache.solr.common.SolrException.logOnce(Lorg/slf4j/Logger;Ljava/lang/String;Ljava/lang/Throwable;)V
>> >        at org.apache.solr.core.CoreContainer.load(CoreContainer.java:321)
>> >        at org.apache.solr.core.CoreContainer.load(CoreContainer.java:207)
>> >        at
>> >
>>
>> org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:130)
>> >        at
>> >
>> org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:94)
>> >        at
>> >
>>
>> org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:275)
>> >        at
>> >
>>
>> org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java:397)
>> >        at
>> >
>>
>> org.apache.catalina.core.ApplicationFilterConfig.(ApplicationFilterConfig.java:108)
>> >        at
>> >
>>
>> org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:3720)
>> >        at
>> > org.apache.catalina.core.StandardContext.start(StandardContext.java:4358)
>> >        at
>> >
>>
>> org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:752)
>> >        at
>> > org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:732)
>> >        at
>> > org.apache.catalina.core.StandardHost.addChild(StandardHost.java:553)
>> >        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> >        at
>> >
>>
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>> >        at
>> >
>>
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>> >        at java.lang.reflect.Method.invoke(Method.java:585)
>> >        at
>> >
>>
>> org.apache.tomcat.util.modeler.BaseModelMBean.invoke(BaseModelMBean.java:297)
>> >        at
>> > org.jboss.mx.server.RawDynamicInvoker.invoke(RawDynamicInvoker.java:164)
>> >        at
>> > org.jboss.mx.server.MBeanServerImpl.invoke(MBeanServerImpl.java:659)
>> >        at
>> > org.apache.catalina.core.StandardContext.init(StandardContext.java:5300)
>> >        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> >        at
>> >
>>
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>> >        at
>> >
>>
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>> >        at java.lang.reflect.Method.invoke(Method.java:585)
>> >        at
>> >
>>
>> org.apache.tomcat.util.modeler.BaseModelMBean.invoke(BaseModelMBean.java:297)
>> >        at
>> > org.jboss.mx.server.RawDynamicInvoker.invoke(RawDynamicInvoker.java:164)
>> >        at
>> > org.jboss.mx.server.MBeanServerImpl.invoke(MBeanServerImpl.java:659)
>> >        at
>> >
>>
>> org.jboss.web.tomcat.service.TomcatDeployer.performDeployInternal(TomcatDeployer.java:301)
>> >        at
>> >
>>
>> org.jboss.web.tomcat.service.TomcatDeployer.performDeploy(TomcatDeployer.java:104)
>> >        at
>> > org.jboss.web.AbstractWebDeployer.start(AbstractWebDeployer.java:375)
>> >        at org.jboss.web.WebModule.startModule(WebModule.java:83)
>> >
>>
>> --
>> - Mark
>>
>> http://www.lucidimagination.com
>>

Re: cache monitoring tools?

The culprit seems to be the merger (frontend) SOLR. Talking to one shard
directly takes substantially less time (1-2 sec).

On Wed, Dec 7, 2011 at 4:10 PM, Dmitry Kan  wrote:

> Tomás: thanks. The page you gave didn't mention cache specifically, is
> there more documentation on this specifically? I have used solrmeter tool,
> it draws the cache diagrams, is there a similar tool, but which would use
> jmx directly and present the cache usage in runtime?
>
> pravesh:
> I have increased the size of filterCache, but the search hasn't become any
> faster, taking almost 9 sec on avg :(
>
> name: search
> class: org.apache.solr.handler.component.SearchHandler
> version: $Revision: 1052938 $
> description: Search using components:
> org.apache.solr.handler.component.QueryComponent,org.apache.solr.handler.component.FacetComponent,org.apache.solr.handler.component.MoreLikeThisComponent,org.apache.solr.handler.component.HighlightComponent,org.apache.solr.handler.component.StatsComponent,org.apache.solr.handler.component.DebugComponent,
>
> stats: handlerStart : 1323255147351
> requests : 100
> errors : 3
> timeouts : 0
> totalTime : 885438
> avgTimePerRequest : 8854.38
> avgRequestsPerSecond : 0.008789442
>
> the stats (copying fieldValueCache as well here, to show term statistics):
>
> name: fieldValueCache
> class: org.apache.solr.search.FastLRUCache
> version: 1.0
> description: Concurrent LRU Cache(maxSize=1, initialSize=10,
> minSize=9000, acceptableSize=9500, cleanupThread=false)
> stats: lookups : 79
> hits : 77
> hitratio : 0.97
> inserts : 1
> evictions : 0
> size : 1
> warmupTime : 0
> cumulative_lookups : 79
> cumulative_hits : 77
> cumulative_hitratio : 0.97
> cumulative_inserts : 1
> cumulative_evictions : 0
> item_shingleContent_trigram :
> {field=shingleContent_trigram,memSize=326924381,tindexSize=4765394,time=215426,phase1=213868,nTerms=14827061,bigTerms=35,termInstances=114359167,uses=78}
>  name: filterCache
> class: org.apache.solr.search.FastLRUCache
> version: 1.0
> description: Concurrent LRU Cache(maxSize=153600, initialSize=4096,
> minSize=138240, acceptableSize=145920, cleanupThread=false)
> stats: lookups : 1082854
> hits : 940370
> hitratio : 0.86
> inserts : 142486
> evictions : 0
> size : 142486
> warmupTime : 0
> cumulative_lookups : 1082854
> cumulative_hits : 940370
> cumulative_hitratio : 0.86
> cumulative_inserts : 142486
> cumulative_evictions : 0
>
>
> index size: 3,25 GB
>
> Does anyone have some pointers to where to look at and optimize for query
> time?
>
>
> 2011/12/7 Tomás Fernández Löbbe 
>
>> Hi Dimitry, cache information is exposed via JMX, so you should be able to
>> monitor that information with any JMX tool. See
>> http://wiki.apache.org/solr/SolrJmx
>>
>> On Wed, Dec 7, 2011 at 6:19 AM, Dmitry Kan  wrote:
>>
>> > Yes, we do require that much.
>> > Ok, thanks, I will try increasing the maxsize.
>> >
>> > On Wed, Dec 7, 2011 at 10:56 AM, pravesh 
>> wrote:
>> >
>> > > >>facet.limit=50
>> > > your facet.limit seems too high. Do you actually require this much?
>> > >
>> > > Since there a lot of evictions from filtercache, so, increase the
>> maxsize
>> > > value to your acceptable limit.
>> > >
>> > > Regards
>> > > Pravesh
>> > >
>> > > --
>> > > View this message in context:
>> > >
>> >
>> http://lucene.472066.n3.nabble.com/cache-monitoring-tools-tp3566645p3566811.html
>> > > Sent from the Solr - User mailing list archive at Nabble.com.
>> > >
>> >
>> >
>> >
>> > --
>> > Regards,
>> >
>> > Dmitry Kan
>> >
>>
>
>
>
> --
> Regards,
>
> Dmitry Kan
>



-- 
Regards,

Dmitry Kan

Problem in searching the Indexed PDf and Word documents in apache tika+solr envieonment using solrj ?

2011-12-07 Thread kiran.bodigam

I am trying to index the pdf and word documents in  solr 3.3.0 version+apache
tika uisng SOLRJ when i am able to search the documents with the file name
where as when i am trying to search the any text data in the content(text
data in the file) its not showing any document in response ? Do i need to
follow any steps while integrating tika ?

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Problem-in-searching-the-Indexed-PDf-and-Word-documents-in-apache-tika-solr-envieonment-using-solrj-tp3567588p3567588.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: R: Solr response writer

2011-12-07 Thread Michael Kuhlmann


Am 07.12.2011 15:09, schrieb Finotti Simone:

I got your and Michael's point. Indeed, I'm not very skilled in web devolpment 
so there may be something that I'm missing. Anyway, Endeca does something like 
this:

1. accept a query
2. does the stemming;
3. check if the result of the step 2. matches one of the redirectable words. If 
so, returns an URL, otherwise returns the regular matching documents (our 
products' description).

Do you think that in Solr I will be able to replicate this behaviour without 
writing a custom plugin (request handler, response writer, etc)? Maybe I'm a 
little dense, but I fail to see how it would be possible...


Endeca not only is a search engine, it's part of a web application. You 
can send a query to the Endeca engine and send the response directly to 
the user; it's already fully rendered. (At least when you configured it 
this way.)


Solr can't do this in any way. Solr responses are always pure technical 
data, not meant to be delivered to an end user. An exception to this is 
the VelocityResponseWriter which can fill a web template.


Anything beyond the possibilities of the VelocityReponseWriter must be 
handled by some web application that anaylzes Solr's reponses.


How do you want ot display your product descriptions, the default case? 
I don't think you want to show some XML data.


Solr is a great search engine, but not more. It's just a small subset of 
commercial search frameworks like Endeca. Therefore, you can't simply 
replace it, you'll need some web application.


However, you don't need a custom response writer in this case, nor do 
you have to Solr extend in any way. At least not for this requrement.


-Kuli

Re: Solr Trunk Changes requires a reindex

2011-12-07 Thread Jamie Johnson

Thanks for the response Erick.

On Wed, Dec 7, 2011 at 9:08 AM, Erick Erickson  wrote:
> Not that I now of. That's one drawback to being on the bleeding edge, when
> the index format changes you have to re-index...
>
> Best
> Erick
>
> On Tue, Dec 6, 2011 at 10:09 AM, Jamie Johnson  wrote:
>> Are there any migration utilities to move from an index built by a
>> Solr 4.0 snapshot to Solr Trunk?  The issue is referenced here
>>
>> http://markmail.org/thread/4ruznwzofyrh776j
>> https://issues.apache.org/jira/browse/LUCENE-3490

Problems with SolrUIMA

2011-12-07 Thread Adriana Farina

Hello, 


I'm trying to use the SolrUIMA component of solr 3.4.0. I modified 
solrconfig.xml file in the following way:


   

     
       C:\Users\Stefano\workspace2\UimaComplete\descriptors\analysis_engine\AggregateAE.xml
       
       true
       
       
         false
         
           text
         
       
       
         
           org.apache.uima.RegEx
           
             expressionFound
             campo1
           
         
         
           org.apache.uima.LingAnnotator
           
             category
             campo2
           
           
             precision
             campo3
           
         
         
           org.apache.uima.DictionaryEntry
           
             coveredText
             campo4
           
         
       
     
   
   
   
 


I followed the tutorial I found on the solr wiki 
(http://wiki.apache.org/solr/SolrUIMA) and customized it. However when I start 
the solr webapp (java -jar start.jar) I get the following exception:

org.apache.solr.common.SolrException: Error Instantiating

UpdateRequestProcessorFactory, 
org.apache.solr.uima.processor.UIMAUpdateRequestProcessorFactory is not a 
org.apache.solr.update.processor.UpdateRequestProcessorFactory
       at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:425)
       at org.apache.solr.core.SolrCore.createInitInstance(SolrCore.java:445)
       at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:1569)
       at org.apache.solr.update.processor.UpdateRequestProcessorChain.init
(UpdateRequestProcessorChain.java:57)
       at org.apache.solr.core.SolrCore.createInitInstance(SolrCore.java:447)
       at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:1553)
       at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:1547)
       at 
org.apache.solr.core.SolrCore.loadUpdateProcessorChains(SolrCore.java:620)
       at org.apache.solr.core.SolrCore.(SolrCore.java:561)
       at org.apache.solr.core.CoreContainer.create(CoreContainer.java:463)
       at org.apache.solr.core.CoreContainer.load(CoreContainer.java:316)
       at org.apache.solr.core.CoreContainer.load(CoreContainer.java:207)
       at 
org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.
java:130)
       at 
org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:
94)
       at org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:97)
       at 
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
       at 
org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:
713)
       at org.mortbay.jetty.servlet.Context.startContext(Context.java:140)
       at 
org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:
1282)
       at 
org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:518)
       at org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:499)
       at 
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
       at 
org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:
152)
       at org.mortbay.jetty.handler.ContextHandlerCollection.doStart
(ContextHandlerCollection.java:156)
       at 
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
       at 
org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:
152)
       at 
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
       at 
org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:130)
       at org.mortbay.jetty.Server.doStart(Server.java:224)
       at 
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
       at org.mortbay.xml.XmlConfiguration.main(XmlConfiguration.java:985)
       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
       at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
       at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
       at java.lang.reflect.Method.invoke(Unknown Source)
       at org.mortbay.start.Main.invokeMain(Main.java:194)
       at org.mortbay.start.Main.start(Main.java:534)
       at org.mortbay.start.Main.start(Main.java:441)
       at org.mortbay.start.Main.main(Main.java:119)

I can't figure out why I'm getting this exception, since I'm following the 
tutorial step by step and the only thing I modified are the definitions of the 
fieldMapping. Can you help me please?

Thank you.

Re: cache monitoring tools?

Tomás: thanks. The page you gave didn't mention cache specifically, is
there more documentation on this specifically? I have used solrmeter tool,
it draws the cache diagrams, is there a similar tool, but which would use
jmx directly and present the cache usage in runtime?

pravesh:
I have increased the size of filterCache, but the search hasn't become any
faster, taking almost 9 sec on avg :(

name: search
class: org.apache.solr.handler.component.SearchHandler
version: $Revision: 1052938 $
description: Search using components:
org.apache.solr.handler.component.QueryComponent,org.apache.solr.handler.component.FacetComponent,org.apache.solr.handler.component.MoreLikeThisComponent,org.apache.solr.handler.component.HighlightComponent,org.apache.solr.handler.component.StatsComponent,org.apache.solr.handler.component.DebugComponent,

stats: handlerStart : 1323255147351
requests : 100
errors : 3
timeouts : 0
totalTime : 885438
avgTimePerRequest : 8854.38
avgRequestsPerSecond : 0.008789442

the stats (copying fieldValueCache as well here, to show term statistics):

name: fieldValueCache
class: org.apache.solr.search.FastLRUCache
version: 1.0
description: Concurrent LRU Cache(maxSize=1, initialSize=10,
minSize=9000, acceptableSize=9500, cleanupThread=false)
stats: lookups : 79
hits : 77
hitratio : 0.97
inserts : 1
evictions : 0
size : 1
warmupTime : 0
cumulative_lookups : 79
cumulative_hits : 77
cumulative_hitratio : 0.97
cumulative_inserts : 1
cumulative_evictions : 0
item_shingleContent_trigram :
{field=shingleContent_trigram,memSize=326924381,tindexSize=4765394,time=215426,phase1=213868,nTerms=14827061,bigTerms=35,termInstances=114359167,uses=78}
name: filterCache
class: org.apache.solr.search.FastLRUCache
version: 1.0
description: Concurrent LRU Cache(maxSize=153600, initialSize=4096,
minSize=138240, acceptableSize=145920, cleanupThread=false)
stats: lookups : 1082854
hits : 940370
hitratio : 0.86
inserts : 142486
evictions : 0
size : 142486
warmupTime : 0
cumulative_lookups : 1082854
cumulative_hits : 940370
cumulative_hitratio : 0.86
cumulative_inserts : 142486
cumulative_evictions : 0


index size: 3,25 GB

Does anyone have some pointers to where to look at and optimize for query
time?


2011/12/7 Tomás Fernández Löbbe 

> Hi Dimitry, cache information is exposed via JMX, so you should be able to
> monitor that information with any JMX tool. See
> http://wiki.apache.org/solr/SolrJmx
>
> On Wed, Dec 7, 2011 at 6:19 AM, Dmitry Kan  wrote:
>
> > Yes, we do require that much.
> > Ok, thanks, I will try increasing the maxsize.
> >
> > On Wed, Dec 7, 2011 at 10:56 AM, pravesh  wrote:
> >
> > > >>facet.limit=50
> > > your facet.limit seems too high. Do you actually require this much?
> > >
> > > Since there a lot of evictions from filtercache, so, increase the
> maxsize
> > > value to your acceptable limit.
> > >
> > > Regards
> > > Pravesh
> > >
> > > --
> > > View this message in context:
> > >
> >
> http://lucene.472066.n3.nabble.com/cache-monitoring-tools-tp3566645p3566811.html
> > > Sent from the Solr - User mailing list archive at Nabble.com.
> > >
> >
> >
> >
> > --
> > Regards,
> >
> > Dmitry Kan
> >
>



-- 
Regards,

Dmitry Kan

R: Solr response writer

I got your and Michael's point. Indeed, I'm not very skilled in web devolpment 
so there may be something that I'm missing. Anyway, Endeca does something like 
this:

1. accept a query
2. does the stemming;
3. check if the result of the step 2. matches one of the redirectable words. If 
so, returns an URL, otherwise returns the regular matching documents (our 
products' description).

Do you think that in Solr I will be able to replicate this behaviour without 
writing a custom plugin (request handler, response writer, etc)? Maybe I'm a 
little dense, but I fail to see how it would be possible... 

S

Inizio: Erik Hatcher [erik.hatc...@gmail.com]
Inviato: mercoledì 7 dicembre 2011 14.40
Fine: solr-user@lucene.apache.org
Oggetto: Re: Solr response writer

Either way (Endeca's 307, which seems crazy to me) or simply plucking off a 
"url" field from the first document returned in a search request... you're 
getting a URL back to your client and then using that URL to further send back 
to a users browser, I presume.  I personally wouldn't implement it with a 
custom response writer, just get the URL from the standard Solr response.

Erik

On Dec 7, 2011, at 08:26 , Finotti Simone wrote:

> That's the scenario:
> I have an XML that maps words W to URLs; when a search request is issued by 
> my web client, a query will be issued to my Solr application. If, after 
> stemming, the query matches any in W, the client must be redirected to the 
> associated URL.
>
> I agree that it should be handled outside, but we are currently on progress 
> of migrating from Endeca, and it has a feature that allow this scenario. For 
> this reason, my boss asked if it was somehow possible to leave that 
> functionality in the search engine.
>
> thanks again
>
> 
> Inizio: Erik Hatcher [erik.hatc...@gmail.com]
> Inviato: mercoledì 7 dicembre 2011 14.12
> Fine: solr-user@lucene.apache.org
> Oggetto: Re: Solr response writer
>
> First, could you tell us more about your use case?   Why do you want to 
> change the response code?   HTTP 307 = Temporary redirect - where are you 
> going to redirect?  Sounds like something best handled outside of Solr.
>
> If you went down the route of creating your own custom response writer, then 
> you'd be locked into a single format (XML, or JSON, or which ever that you 
> subclassed)
>
>
> On Dec 7, 2011, at 06:48 , Finotti Simone wrote:
>
>> Hello,
>> I need to change the HTTP result code of the query result if some conditions 
>> are met.
>>
>> Analyzing the flow of execution of Solr query process, it seems to me that 
>> the "place" that fits better is the QueryResponseWriter. Anyway I didn't 
>> found a way to change the HTTP request layout (I need to set 307 instead of 
>> 200), so I wonder if it's possible at all with the Solr (v 3.4) plugin 
>> mechanism actually provided.
>>
>> Any insight would be greatly appreciated J
>>
>> Thanks
>> S
>
>
>
>
>

Re: Solr 4 near real time commit

Well, if that is exactly what you put, it's wrong.  That second one should
be softAutoCommit.

On Wednesday, December 7, 2011, yu shen  wrote:
> Hi All,
>
> I tried using solr 4 nightly build: apache-solr-4.0-2011-12-06_08-52-46.
> And try to enable autoSoftCommit like below in solrconfig.xml
> 
>10
> 
> 
>1000
> 
>
> I try to add a document to this solr instance using solrj client in the
> nightly build. I do saw a commit time boost. Single document commit will
> basically take around 10 - 15 seconds.
>
> My question is, my configuration mean to do the commitment within 1
second,
> why solr still takes 10 seconds.
>
> Spark
>

-- 
- Mark

http://www.lucidimagination.com

Re: Solr Trunk Changes requires a reindex

2011-12-07 Thread Erick Erickson

Not that I now of. That's one drawback to being on the bleeding edge, when
the index format changes you have to re-index...

Best
Erick

On Tue, Dec 6, 2011 at 10:09 AM, Jamie Johnson  wrote:
> Are there any migration utilities to move from an index built by a
> Solr 4.0 snapshot to Solr Trunk?  The issue is referenced here
>
> http://markmail.org/thread/4ruznwzofyrh776j
> https://issues.apache.org/jira/browse/LUCENE-3490

Re: Grouping or Facet ?

2011-12-07 Thread Erick Erickson

In your example you'll have 10 facets returned each with a value of 1.

Best
Erick

On Tue, Dec 6, 2011 at 9:54 AM,   wrote:
> Sorry to jump into this thread, but are you saying that the facet count is
> not # of result hits?
>
> So if I have 1 document with field CAT that has 10 values and I do a query
> that returns this 1 document with faceting, that the CAT facet count will
> be 10 not 1? I don't seem to be seeing that behavior in my app (Solr 3.5).
>
> Thanks.
>
>> OK, I'm not understanding here. You get the counts and the results if you
>> facet
>> on a single category field. The facet counts are the counts of the
>> *values* in that
>> field. So it would help me if you showed the output of faceting on a
>> single
>> category field and why that didn't work for you
>>
>> But either way, faceting will probably outperform grouping.
>>
>> Best
>> Erick
>>
>> On Mon, Dec 5, 2011 at 9:05 AM, Juan Pablo Mora  wrote:
>>> Because I need the count and the result to return back to the client
>>> side. Both the grouping and the facet offers me a solution to do that,
>>> but my doubt is about performance ...
>>>
>>> With Grouping my results are:
>>>
>>> "grouped":{
>>>    "category":{
>>>      "matches": ...,
>>>      "groups":[{
>>>          "groupValue":"categoryXX",
>>>          "doclist":{"numFound":Important_number,"start":0,"docs":[
>>>              {
>>>               doc:id
>>>               category:XX
>>>              }
>>>           "groupValue":"categoryYY",
>>>          "doclist":{"numFound":Important_number,"start":0,"docs":[
>>>              {
>>>               doc: id
>>>               category:YY
>>>              }
>>>
>>> And with faceting my results are :
>>> "facet.prefix=whatever"
>>> "facet_counts":{
>>>    "facet_queries":{},
>>>    "facet_fields":{
>>>      "namesXX":[
>>>        "whatever_name_in_category",76,
>>>        ...
>>>      "namesYY":[
>>>        "whatever_name_in_category",76,
>>>        ...
>>>
>>> Both results are OK to me.
>>>
>>>
>>> 
>>> De: Erick Erickson [erickerick...@gmail.com]
>>> Enviado el: lunes, 05 de diciembre de 2011 14:48
>>> Para: solr-user@lucene.apache.org
>>> Asunto: Re: Grouping or Facet ?
>>>
>>> Why not just use the first form of the document
>>> and just facet.field=category? You'll get
>>> two different facet counts for XX and YY
>>> that way.
>>>
>>> I don't think grouping is the way to go here.
>>>
>>> Best
>>> Erick
>>>
>>> On Sat, Dec 3, 2011 at 6:43 AM, Juan Pablo Mora 
>>> wrote:
 I need to do some counts on a StrField field to suggest options from
 two different categories, and I don´t know what option is the best:

 My schema looks:

 - id
 - name
 - category: XX or YY

 with Grouping I do:

 http://localhost:8983/?q=name:prefix*&group=true&group.field=category

 But I can change my schema to to:

 - id
 - nameXX
 - nameYY
 - category: XX or YY (only 1 value in nameXX or nameYY)

 With facet:
 http://localhost:8983/?q=*:*&facet=true&facet.field=nameXX&facet.field=nameYY&facet.prefix=prefix


 What option have the best performance ?

 Best,
 Juampa.
>>
>

RE: Delays when deleting by query

2011-12-07 Thread Mike Gallan


I ran some more tests.  I added an explicit commit after each deleteByQuery() 
call and removed the add/reindex step.  This hung up immediately and completed 
(or timed out?) after 20 minutes.  The hangs occur almost exactly 20 minutes 
apart.  Could this be a Tomcat issue?

I ran jconsole but didn't see any extraordinary memory or CPU usage.  The 
delays appear on the first delete attempt immediately after start up so I 
suspect it's not GC related.

I also tried adding documents without deleting.  This worked with no 
significant delays on the commit.  The delete/commit combo appears to be the 
source of the problem.

Any tips on how to debug this are appreciated!
Thanks,Mike

> From: mgal...@hotmail.com
> To: solr-user@lucene.apache.org
> Subject: Delays when deleting by query
> Date: Tue, 6 Dec 2011 08:25:28 -0500
> 
> Hello,
> 
> We're encountering delays of 10+ minutes when trying to delete from our Solr 
> 3.4 instance.  We have 335k documents indexed and interface using SolrJ.  Our 
> schema basically consists of a parent object with multiple child objects.  
> Every object is indexed as a separate document with the child documents 
> referencing parents via a 'parentId' field.  When any part of a parent object 
> is updated solrServer.deleteByQuery() is called to delete the parent and all 
> the child documents, then solrServer.add() is called to reindex them.  We 
> currently rely on autocommit, with maxDocs set to 100 and maxTime set to 30s. 
>  Deletes work fine on another Solr test instance with 22k documents.
> 
> Any thoughts?  Is this sort of delay common when deleting against this many 
> documents?
> 
> Thanks,
> Mike
>

Re: Solr Lucene Index Version

2011-12-07 Thread Jamie Johnson

Erik,

Do you have any details behind what would be required to write a tool
to move from one index format to another?  Any examples/suggestions
would be appreciated.

On Tue, Dec 6, 2011 at 5:19 PM, Jamie Johnson  wrote:
> What about modifying something like SolrIndexConfig.java to change the
> lucene version that is used when creating the index?  (may not be the
> right place, but is something like this possible?)
>
> On Tue, Dec 6, 2011 at 5:13 PM, Erik Hatcher  wrote:
>> Right.  Not sure what to advise you.  We have worked on this problem with 
>> our LucidWorks platform and have some tools available to do this sort of 
>> thing, I think, but it's not generally something that you can do with Lucene 
>> going from a snapshot to a released version.  Perhaps others with deeper 
>> insight will chime in.
>>
>>        Erik
>>
>>
>>
>> On Dec 6, 2011, at 16:54 , Jamie Johnson wrote:
>>
>>> Problem is that really doesn't help me.  We still have the same issue
>>> that when the 4.0 becomes final there is no migration utility from
>>> this pre 4.0 version to 4.0, right?
>>>
>>>
>>> On Tue, Dec 6, 2011 at 4:36 PM, Erik Hatcher  wrote:
 Oh geez... no... I didn't mean 3.x JARs... I meant the trunk/4.0 ones that 
 are there now.

        Erik

 On Dec 6, 2011, at 16:22 , Jamie Johnson wrote:

> So if I wanted to used lucene index 3.5 with SolrCloud I "should" be
> able to just move the 3.5 jars in and remove any of the snapshot jars
> that are present when I build locally?
>
> On Tue, Dec 6, 2011 at 4:06 PM, Erik Hatcher  
> wrote:
>> Jamie -
>>
>> I think the best thing that you could do here would be to lock in a 
>> version of Lucene (all the Lucene libraries) that you use with 
>> SolrCloud.  Certainly not out of the realm of possibilities of some 
>> upcoming SolrCloud capability that requires some upgrading of Lucene 
>> though, but you may be set for a little while at least.
>>
>>        Erik
>>
>> On Dec 6, 2011, at 15:57 , Jamie Johnson wrote:
>>
>>> Thanks, but I don't believe that will do it.  From my understanding
>>> that does not control the index version written, it's used to control
>>> the behavior of some analyzers (taken from some googling).  I'd love
>>> if someone told me otherwise though.
>>>
>>> On Tue, Dec 6, 2011 at 3:48 PM, Alireza Salimi 
>>>  wrote:
 Hi, I'm not sure if it would help.

 in solrconfig.xml:

  
  LUCENE_34


 On Tue, Dec 6, 2011 at 3:14 PM, Jamie Johnson  
 wrote:

> Is there a way to specify the index version solr uses?  We're
> currently using SolrCloud but with the index format changing I'd be
> preferable to be able to specify a particular index format to avoid
> having to do a complete reindex.  Is this possible?
>



 --
 Alireza Salimi
 Java EE Developer
>>

>>

Re: Solr response writer

Either way (Endeca's 307, which seems crazy to me) or simply plucking off a 
"url" field from the first document returned in a search request... you're 
getting a URL back to your client and then using that URL to further send back 
to a users browser, I presume.  I personally wouldn't implement it with a 
custom response writer, just get the URL from the standard Solr response.

Erik

On Dec 7, 2011, at 08:26 , Finotti Simone wrote:

> That's the scenario:
> I have an XML that maps words W to URLs; when a search request is issued by 
> my web client, a query will be issued to my Solr application. If, after 
> stemming, the query matches any in W, the client must be redirected to the 
> associated URL.
> 
> I agree that it should be handled outside, but we are currently on progress 
> of migrating from Endeca, and it has a feature that allow this scenario. For 
> this reason, my boss asked if it was somehow possible to leave that 
> functionality in the search engine.
> 
> thanks again
> 
> 
> Inizio: Erik Hatcher [erik.hatc...@gmail.com]
> Inviato: mercoledì 7 dicembre 2011 14.12
> Fine: solr-user@lucene.apache.org
> Oggetto: Re: Solr response writer
> 
> First, could you tell us more about your use case?   Why do you want to 
> change the response code?   HTTP 307 = Temporary redirect - where are you 
> going to redirect?  Sounds like something best handled outside of Solr.
> 
> If you went down the route of creating your own custom response writer, then 
> you'd be locked into a single format (XML, or JSON, or which ever that you 
> subclassed)
> 
> 
> On Dec 7, 2011, at 06:48 , Finotti Simone wrote:
> 
>> Hello,
>> I need to change the HTTP result code of the query result if some conditions 
>> are met.
>> 
>> Analyzing the flow of execution of Solr query process, it seems to me that 
>> the "place" that fits better is the QueryResponseWriter. Anyway I didn't 
>> found a way to change the HTTP request layout (I need to set 307 instead of 
>> 200), so I wonder if it's possible at all with the Solr (v 3.4) plugin 
>> mechanism actually provided.
>> 
>> Any insight would be greatly appreciated J
>> 
>> Thanks
>> S
> 
> 
> 
> 
>

Re: Solr response writer

2011-12-07 Thread Michael Kuhlmann


Am 07.12.2011 14:26, schrieb Finotti Simone:

That's the scenario:
I have an XML that maps words W to URLs; when a search request is issued by my 
web client, a query will be issued to my Solr application. If, after stemming, 
the query matches any in W, the client must be redirected to the associated URL.

I agree that it should be handled outside, but we are currently on progress of 
migrating from Endeca, and it has a feature that allow this scenario. For this 
reason, my boss asked if it was somehow possible to leave that functionality in 
the search engine.


Of course, your customers will never directly connect to your Solr 
server. They instead connect to your web application, which is itself a 
client to Solr.


Therefore, it's useless to return redirect response codes directly from 
Solr, since you customer>'s browsers will never get them.


Instead, you should handle Solr responses in your web application 
individually, and redirect your customers then.


-Kuli

Re: Solr response writer

That's the scenario:
I have an XML that maps words W to URLs; when a search request is issued by my 
web client, a query will be issued to my Solr application. If, after stemming, 
the query matches any in W, the client must be redirected to the associated URL.

I agree that it should be handled outside, but we are currently on progress of 
migrating from Endeca, and it has a feature that allow this scenario. For this 
reason, my boss asked if it was somehow possible to leave that functionality in 
the search engine.

thanks again

Inizio: Erik Hatcher [erik.hatc...@gmail.com]
Inviato: mercoledì 7 dicembre 2011 14.12
Fine: solr-user@lucene.apache.org
Oggetto: Re: Solr response writer

First, could you tell us more about your use case?   Why do you want to change 
the response code?   HTTP 307 = Temporary redirect - where are you going to 
redirect?  Sounds like something best handled outside of Solr.

If you went down the route of creating your own custom response writer, then 
you'd be locked into a single format (XML, or JSON, or which ever that you 
subclassed)

On Dec 7, 2011, at 06:48 , Finotti Simone wrote:

> Hello,
> I need to change the HTTP result code of the query result if some conditions 
> are met.
>
> Analyzing the flow of execution of Solr query process, it seems to me that 
> the "place" that fits better is the QueryResponseWriter. Anyway I didn't 
> found a way to change the HTTP request layout (I need to set 307 instead of 
> 200), so I wonder if it's possible at all with the Solr (v 3.4) plugin 
> mechanism actually provided.
>
> Any insight would be greatly appreciated J
>
> Thanks
> S

Re: Solr response writer

First, could you tell us more about your use case?   Why do you want to change 
the response code?   HTTP 307 = Temporary redirect - where are you going to 
redirect?  Sounds like something best handled outside of Solr.  

If you went down the route of creating your own custom response writer, then 
you'd be locked into a single format (XML, or JSON, or which ever that you 
subclassed)


On Dec 7, 2011, at 06:48 , Finotti Simone wrote:

> Hello,
> I need to change the HTTP result code of the query result if some conditions 
> are met.
> 
> Analyzing the flow of execution of Solr query process, it seems to me that 
> the "place" that fits better is the QueryResponseWriter. Anyway I didn't 
> found a way to change the HTTP request layout (I need to set 307 instead of 
> 200), so I wonder if it's possible at all with the Solr (v 3.4) plugin 
> mechanism actually provided.
> 
> Any insight would be greatly appreciated J
> 
> Thanks
> S

custom types file for WordDelimeterFilterFactory

2011-12-07 Thread Maurizio Piccini

Hi,
I'm actually having the exact same problem. Did you anyhow find a solution
for this?
cheers
Maurizio

Re: cache monitoring tools?

2011-12-07 Thread Tomás Fernández Löbbe

Hi Dimitry, cache information is exposed via JMX, so you should be able to
monitor that information with any JMX tool. See
http://wiki.apache.org/solr/SolrJmx

On Wed, Dec 7, 2011 at 6:19 AM, Dmitry Kan  wrote:

> Yes, we do require that much.
> Ok, thanks, I will try increasing the maxsize.
>
> On Wed, Dec 7, 2011 at 10:56 AM, pravesh  wrote:
>
> > >>facet.limit=50
> > your facet.limit seems too high. Do you actually require this much?
> >
> > Since there a lot of evictions from filtercache, so, increase the maxsize
> > value to your acceptable limit.
> >
> > Regards
> > Pravesh
> >
> > --
> > View this message in context:
> >
> http://lucene.472066.n3.nabble.com/cache-monitoring-tools-tp3566645p3566811.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
> >
>
>
>
> --
> Regards,
>
> Dmitry Kan
>

Solr response writer

Hello,
I need to change the HTTP result code of the query result if some conditions 
are met.

Analyzing the flow of execution of Solr query process, it seems to me that the 
"place" that fits better is the QueryResponseWriter. Anyway I didn't found a 
way to change the HTTP request layout (I need to set 307 instead of 200), so I 
wonder if it's possible at all with the Solr (v 3.4) plugin mechanism actually 
provided.

Any insight would be greatly appreciated J

Thanks
S

Solr using very high I/O

2011-12-07 Thread Adrian Fita

Hi. I experience an issue where Solr is using huge ammounts of I/O.
Basically it uses the whole HDD continously, leaving nothing to the
other processes. Solr is called by a script which continously indexes
some files.

The index has around 800MB and I can't understand why it could trash
the HDD so much.

I could use some help on how to optimize Solr so it doesn't use so much I/O.

Thank you.
--
Fita Adrian

Re: Solr request handler queries in fiddler

Is it not possible to expose the shards to your IP and eclipse-debug the
queries via the solr frontend? If you need to intercept the queries between
frontend and shards in a non-windows environment, you could try wireshark
or tcpmon (http://ws.apache.org/commons/tcpmon/)

On Wed, Dec 7, 2011 at 10:40 AM, Kashif Khan  wrote:

> i am already using eclipse jetty for debugging but it is really hectic when
> we have shards and queries going to each shard i want to skip it and see in
> the fiddler rather.
> --
> Kashif Khan. B.E.,
> +91 99805 57379
> http://www.kashifkhan.in
>
>
>
> On Wed, Dec 7, 2011 at 12:54 PM, Dmitry Kan [via Lucene] <
> ml-node+s472066n351...@n3.nabble.com> wrote:
>
> > If you mean debugging the queries, you can use eclipse+jetty plugin setup
> > (
> > http://code.google.com/p/run-jetty-run/) with solr web app (
> >
> >
> http://hokiesuns.blogspot.com/2010/01/setting-up-apache-solr-in-eclipse.html
> > )
> >
> > On Tue, Dec 6, 2011 at 2:57 PM, Kashif Khan <[hidden email]<
> http://user/SendEmail.jtp?type=node&node=351&i=0>>
> > wrote:
> >
> > > Hi all,
> > >
> > > I have developed a solr request handler in which i am querying for
> > shards
> > > and mergin the results but i do not see any queries in the fiddler. How
> > can
> > > i track or capture the queries from the request handler in the fiddler
> > to
> > > see the queries and what setting i have to do for that. Please help me
> > out
> > > with this.
> > >
> > > Thank u in advance
> > >
> > > --
> > > View this message in context:
> > >
> >
> http://lucene.472066.n3.nabble.com/Solr-request-handler-queries-in-fiddler-tp3564260p3564260.html
> > > Sent from the Solr - User mailing list archive at Nabble.com.
> > >
> >
> >
> >
> > --
> > Regards,
> >
> > Dmitry Kan
> >
> >
> > --
> >  If you reply to this email, your message will be added to the discussion
> > below:
> >
> >
> http://lucene.472066.n3.nabble.com/Solr-request-handler-queries-in-fiddler-tp3564260p351.html
> >  To unsubscribe from Solr request handler queries in fiddler, click here<
> http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=3564260&code=dXBsaW5rMjAxMEBnbWFpbC5jb218MzU2NDI2MHwtMTgzODU3NDI3OQ==
> >
> > .
> > NAML<
> http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.InstantMailNamespace&breadcrumbs=instant+emails%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
> >
> >
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Solr-request-handler-queries-in-fiddler-tp3564260p3566778.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Regards,

Dmitry Kan

Re: UUID field changed when document is updated

2011-12-07 Thread blaise thomson

Hi Hoss,

Thanks for getting back to me on this.

: I've been trying to use the UUIDField in solr to maintain ids of the 
>: pages I've crawled with nutch (as per 
>: http://wiki.apache.org/solr/UniqueKey). The use case is that I want to 
>: have the server able to use these ids in another database for various 
>: statistics gathering. So I want the link url to act like a primary key 
>: for determining if a page exists, and if it doesn't exist to generate a 
>: new uuid.
>
>
>i'm confused ... if you want the URL to be the primary key, then use the 
>URL as the primary key, why use the UUID Field at all?

I do use the URL as the primary key. The thing is that I want to have a fixed 
length id for the document so that I can reference it in another database. For 
example, if I want to count clicks of the url, then I was thinking of using a 
mysql database along with solr, where each document id has a count of the 
clicks. I didn't want to use the url itself in that db because of its arbitrary 
length. 


:     2. Looking at the code for UUIDField (relevant bit pasted below), it 
>: seems that the UUID is just generated randomly. There is no check if the 
>: generated UUID has already been used.
>
>
>correct ... if you specify "NEW" then it generates a new UUID for you -- 
>if you wnat to update an existing doc with an existing UUID then you need 
>to send the real, existing, value of the UUID for the doc you are 
>updating.
>
>
>: I can sort of solve this problem by generating the UUID myself, as a 
>: hash of the link url, but that doesn't help me for those random cases 
>: when the hash might happen to generate the same UUID.
>: 
>: Does anyone know if there is a way for solr to only add a uuid if the 
>: document doesn't already exist?
>
>
>I don't really understand your second sentence, but based on that first 
>sentence it sounds like what you want may be to use something like the 
>SignatureUpdateProcessor to generate a hash based on the URL...
>
>
>https://wiki.apache.org/solr/Deduplication


I didn't know actually about this, so thanks for sharing. I'm not sure it does 
exactly what I want though. I think it is more for checking if the two docs are 
the same, which for my purposes, the url works fine for. 

I think I've sort of come to realise that generating a uuid from the url might 
be the way to go. There is a chance of getting the same uuid from different 
urls, but it's only 1 in 2^128, so it's basically non-existant.

Thanks again,
Blaise

Re: cache monitoring tools?