Re: Good example of multiple tokenizers for a single field

2010-11-29 Thread Shawn Heisey

On 11/29/2010 3:15 PM, Jacob Elder wrote:

I am looking for a clear example of using more than one tokenizer for a
source single field. My application has a single "body" field which until
recently was all latin characters, but we're now encountering both English
and Japanese words in a single message. Obviously, we need to be using CJK
in addition to WhitespaceTokenizerFactory.


What I'd like to see is a CJK filter that runs after tokenization 
(whitespace in my case) and doesn't do anything but handle the CJK 
characters.  If there are no CJK characters in the token, it should do 
nothing at all.  The CJK tokenizer does a whole host of other things 
that I want to handle myself.


Shawn



Re: search strangeness

2010-11-29 Thread ramzesua

Hi, Erick. There is defaultSearchField in my schema.xml. Can you give me your
example of configure for text field ?(What filters do you use for index and
for query)
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/search-strangeness-tp1986895p1989466.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: question about Solr SignatureUpdateProcessorFactory

2010-11-29 Thread Chris Hostetter

: Why is also the field name (* above) added to the signature
: and not only the content of the field?
: 
: By purpose or by accident?

It was definitely deliberate.  This way if your signature fields are 
"fieldA,fieldB,fieldC" then these two documents...

Doc1:fielda:XXX
Doc1:fieldB:YYY

Doc2:fieldB:XXX
Doc2:fieldC:YYY

...don't wind up with identical signature alues

: I would like to suggest removing the field name from the signature and
: not mixing it up.

As mentioned, in the typical case it's important that the field names be 
included in the signature, but i imagine there would be cases where you 
wouldn't want them included (like a simple concat Signature for building 
basic composite keys)

I think the Signature API could definitely be enhanced to have additional 
methods for adding field names vs adding field values.

wanna open an issue in Jira sith some suggestions and use cases?


-Hoss


Re: Spell checking question from a Solr novice

2010-11-29 Thread Bill Dueber
On Mon, Oct 18, 2010 at 5:24 PM, Jason Blackerby wrote:

> If you know the misspellings you could prevent them from being added to the
> dictionary with a StopFilterFactory like so:
>


Or, you know, correct the data :-)

-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: solr admin

2010-11-29 Thread Yonik Seeley
On Mon, Nov 29, 2010 at 8:02 PM, Ahmet Arslan  wrote:
>> in Solr admin (http://localhost:8180/services/admin/)
>> I can specify something like:
>>
>> +category_id:200 +xxx:300
>>
>> but how can I specify a sort option?
>>
>> sort:category_id+asc
>
> There is an [FULL INTERFACE] /admin/form.jsp link but it does not have sort 
> option. It seems that you need to append it to your search url.

Heh - yeah... that's an old interface, from the times when sort was
specified along with the query.
Can someone provide a patch to add a way to specify the sort?

-Yonik
http://www.lucidimagination.com


RE: solr admin

2010-11-29 Thread Ahmet Arslan
> in Solr admin (http://localhost:8180/services/admin/)
> I can specify something like:
> 
> +category_id:200 +xxx:300
> 
> but how can I specify a sort option?
> 
> sort:category_id+asc

There is an [FULL INTERFACE] /admin/form.jsp link but it does not have sort 
option. It seems that you need to append it to your search url.


  


Re: Solr DataImportHandler (DIH) and Cassandra

2010-11-29 Thread Mark
The DataSource subclass route is what I will probably be interested in. 
Are there are working examples of this already out there?


On 11/29/10 12:32 PM, Aaron Morton wrote:

AFAIK there is nothing pre-written to pull the data out for you.

You should be able to create your DataSource sub class 
http://lucene.apache.org/solr/api/org/apache/solr/handler/dataimport/DataSource.html Using 
the Hector java library to pull data from Cassandra.


I'm guessing you will need to consider how to perform delta imports. 
Perhaps using the secondary indexes in 0.7* , or maintaining your own 
queues or indexes to know what has changed.


There is also the Lucandra project, not exactly what your after but 
may be of interest anyway https://github.com/tjake/Lucandra


Hope that helps.
Aaron


On 30 Nov, 2010,at 05:04 AM, Mark  wrote:


Is there anyway to use DIH to import from Cassandra? Thanks


RE: special sorting

2010-11-29 Thread Papp Richard
Hmm, any clue how to use it? use the location_id somehow?

thanks,
  Rich

-Original Message-
From: Tommaso Teofili [mailto:tommaso.teof...@gmail.com] 
Sent: Monday, November 29, 2010 22:08
To: solr-user@lucene.apache.org
Subject: Re: special sorting

Perhaps, depending on your domain logic you could use function queries to
achieve that.
http://wiki.apache.org/solr/FunctionQuery#Sort_By_Function
Regards,
Tommaso

2010/11/29 Papp Richard 

> Hello,
>
>  I have many pages with the same content in the search result (the result
> is the same for some of the cities from the same county)... which means
> that
> I have duplicate content.
>
>  the filter query is something like: +locationId:(60 26a 39a) - for city
> with ID 60
>  and I get the same result for city with ID 62: +locationId:(62 26a 39a)
> (cityID, countyID, countryID)
>
>  how could I use a sorting to have different docs order in results for
> different cities?
>  (for the same city I need to have the same sort order always - it cannot
> be a simple random...)
>
>  could I use somehow the cityID parameter as boost or score ? I tried but
> could't realise too much.
>
> thanks,
>  Rich
>
>
> __ Information from ESET NOD32 Antivirus, version of virus
> signature
> database 5659 (20101129) __
>
> The message was checked by ESET NOD32 Antivirus.
>
> http://www.eset.com
>
>
>
 

__ Information from ESET NOD32 Antivirus, version of virus signature
database 5659 (20101129) __

The message was checked by ESET NOD32 Antivirus.

http://www.eset.com
 
 

__ Information from ESET NOD32 Antivirus, version of virus signature
database 5659 (20101129) __

The message was checked by ESET NOD32 Antivirus.

http://www.eset.com
 



RE: solr admin

2010-11-29 Thread Papp Richard
in Solr admin (http://localhost:8180/services/admin/)
I can specify something like:

+category_id:200 +xxx:300

but how can I specify a sort option?

sort:category_id+asc

regards,
  Rich

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Monday, November 29, 2010 22:00
To: solr-user@lucene.apache.org
Subject: Re: solr admin

I honestly don't understand what you're asking here. Specify what
in solr admin other than fields? what is it you're trying to accomplish?

Best
Erick

On Mon, Nov 29, 2010 at 2:56 PM, Papp Richard  wrote:

> Hello,
>
>  is there any way to specify in the solr admin other than fields? and I'm
> nt talking about the full interface which is also very limited.
>
>  like: score, fl, fq, ...
>
>  and yes, I know that I can use the url... which indeed is not too handy.
>
> thanks,
>  Rich
>
>
> __ Information from ESET NOD32 Antivirus, version of virus
> signature
> database 5659 (20101129) __
>
> The message was checked by ESET NOD32 Antivirus.
>
> http://www.eset.com
>
>
>
 

______ Information from ESET NOD32 Antivirus, version of virus signature
database 5659 (20101129) __

The message was checked by ESET NOD32 Antivirus.

http://www.eset.com
 
 

______ Information from ESET NOD32 Antivirus, version of virus signature
database 5659 (20101129) __

The message was checked by ESET NOD32 Antivirus.

http://www.eset.com
 



Re: Good example of multiple tokenizers for a single field

2010-11-29 Thread Robert Muir
On Mon, Nov 29, 2010 at 5:41 PM, Jonathan Rochkind  wrote:
>
> * As a tokenizer, I use the WhitespaceTokenizer.
>
> * Then I apply a custom filter that looks for CJK chars, and re-tokenizes
> any CJK chars into one-token-per-char. This custom filter was written by
> someone other than me; it is open source; but I'm not sure if it's actually
> in a public repo, or how well documented it is.  I can put you in touch with
> the author to try and ask. There may also be a more standard filter other
> than the custom one I'm using that does the same thing?
>

You are describing what standardtokenizer does.


Re: Good example of multiple tokenizers for a single field

2010-11-29 Thread Jonathan Rochkind
You can only use one tokenizer on given field, I think. But a tokenizer 
isn't in fact the only thing that can tokenize, an ordinary filter can 
change tokenization too, so you could use two filters in a row.


You could also write your own custom tokenizer that does what you want, 
although I'm not entirely sure if you turn exactly what you say into 
code it will actually do what you want, I think it's more complicated, I 
think you'll need a tokenizer that looks for contiguous blocks of bytes 
that are UTF-8 CJK and does one thing to them, and contiguous blocks of 
bytes that are not UTF8 CJK and does another thing to them; rather than 
just "first do one to the whole string and then do another."


Dealing with mixed language fields is tricky, I know of no general 
purpose good solutions, in part just because of the semantics involved.


If you have some strings for the field you know are CJK, adn others you 
know are English, the easiest thing to do is NOT put them in the same 
field, but put them in different fields, and use dismax (for example) to 
search both fields on query.  But if you can't even tell at index time 
which is which, or if you have strings that themselves include both CJK 
and English interspersed with each other, that might not work.


For my own case, where everything is just interspersed in the fields and 
I don't really know what language it is, here's what I do, which is 
definitely not great for CJK, but is better than nothing:


* As a tokenizer, I use the WhitespaceTokenizer.

* Then I apply a custom filter that looks for CJK chars, and 
re-tokenizes any CJK chars into one-token-per-char. This custom filter 
was written by someone other than me; it is open source; but I'm not 
sure if it's actually in a public repo, or how well documented it is.  I 
can put you in touch with the author to try and ask. There may also be a 
more standard filter other than the custom one I'm using that does the 
same thing?


Jonathan

Jonathan

On 11/29/2010 5:30 PM, Jacob Elder wrote:

The problem is that the field is not guaranteed to contain just a single
language. I'm looking for some way to pass it first through CJK, then
Whitespace.

If I'm totally off-target here, is there a recommended way of dealing with
mixed-language fields?

On Mon, Nov 29, 2010 at 5:22 PM, Markus Jelsma
wrote:


You can use only one tokenizer per analyzer. You'd better use separate
fields +
fieldTypes for different languages.


I am looking for a clear example of using more than one tokenizer for a
source single field. My application has a single "body" field which until
recently was all latin characters, but we're now encountering both

English

and Japanese words in a single message. Obviously, we need to be using

CJK

in addition to WhitespaceTokenizerFactory.

I've found some references to using copyFields or NGrams but I can't

quite

grasp what the whole solution would look like.





Re: Good example of multiple tokenizers for a single field

2010-11-29 Thread Robert Muir
On Mon, Nov 29, 2010 at 5:35 PM, Jacob Elder  wrote:
> StandardTokenizer doesn't handle some of the tokens we need, like
> @twitteruser, and as far as I can tell, doesn't handle Chinese, Japanese or
> Korean. Am I wrong about that?

it uses the unigram method for CJK ideographs... the CJKtokenizer just
uses the bigram method, its just an alternative method.

the whitespace doesnt work at all though, so give up on that!


Re: Good example of multiple tokenizers for a single field

2010-11-29 Thread Jacob Elder
StandardTokenizer doesn't handle some of the tokens we need, like
@twitteruser, and as far as I can tell, doesn't handle Chinese, Japanese or
Korean. Am I wrong about that?

On Mon, Nov 29, 2010 at 5:31 PM, Robert Muir  wrote:

> On Mon, Nov 29, 2010 at 5:30 PM, Jacob Elder  wrote:
> > The problem is that the field is not guaranteed to contain just a single
> > language. I'm looking for some way to pass it first through CJK, then
> > Whitespace.
> >
> > If I'm totally off-target here, is there a recommended way of dealing
> with
> > mixed-language fields?
> >
>
> maybe you should consider a tokenizer like StandardTokenizer, that
> works reasonably well for most languages.
>



-- 
Jacob Elder
@jelder
(646) 535-3379


Re: Good example of multiple tokenizers for a single field

2010-11-29 Thread Robert Muir
On Mon, Nov 29, 2010 at 5:30 PM, Jacob Elder  wrote:
> The problem is that the field is not guaranteed to contain just a single
> language. I'm looking for some way to pass it first through CJK, then
> Whitespace.
>
> If I'm totally off-target here, is there a recommended way of dealing with
> mixed-language fields?
>

maybe you should consider a tokenizer like StandardTokenizer, that
works reasonably well for most languages.


Re: Good example of multiple tokenizers for a single field

2010-11-29 Thread Jacob Elder
The problem is that the field is not guaranteed to contain just a single
language. I'm looking for some way to pass it first through CJK, then
Whitespace.

If I'm totally off-target here, is there a recommended way of dealing with
mixed-language fields?

On Mon, Nov 29, 2010 at 5:22 PM, Markus Jelsma
wrote:

> You can use only one tokenizer per analyzer. You'd better use separate
> fields +
> fieldTypes for different languages.
>
> > I am looking for a clear example of using more than one tokenizer for a
> > source single field. My application has a single "body" field which until
> > recently was all latin characters, but we're now encountering both
> English
> > and Japanese words in a single message. Obviously, we need to be using
> CJK
> > in addition to WhitespaceTokenizerFactory.
> >
> > I've found some references to using copyFields or NGrams but I can't
> quite
> > grasp what the whole solution would look like.
>



-- 
Jacob Elder
@jelder
(646) 535-3379


Re: Good example of multiple tokenizers for a single field

2010-11-29 Thread Markus Jelsma
You can use only one tokenizer per analyzer. You'd better use separate fields + 
fieldTypes for different languages.

> I am looking for a clear example of using more than one tokenizer for a
> source single field. My application has a single "body" field which until
> recently was all latin characters, but we're now encountering both English
> and Japanese words in a single message. Obviously, we need to be using CJK
> in addition to WhitespaceTokenizerFactory.
> 
> I've found some references to using copyFields or NGrams but I can't quite
> grasp what the whole solution would look like.


Termvector based result grouping / field collapsing?

2010-11-29 Thread Shawn Heisey
I was just in a meeting where we discussed customer feedback on our 
website.  One thing that the users would like to see is "galleries" 
where photos that are part of a set are grouped together under a single 
result.  This is basically field collapsing.


The problem I've got is that for most of our content, there's nothing to 
tie different photos together in a coherent way other than similar 
language in fields like the caption.  Is it feasible to use termvector 
information to automatically group documents with similar (but not 
identical) data in one or more fields?


Thanks,
Shawn



Good example of multiple tokenizers for a single field

2010-11-29 Thread Jacob Elder
I am looking for a clear example of using more than one tokenizer for a
source single field. My application has a single "body" field which until
recently was all latin characters, but we're now encountering both English
and Japanese words in a single message. Obviously, we need to be using CJK
in addition to WhitespaceTokenizerFactory.

I've found some references to using copyFields or NGrams but I can't quite
grasp what the whole solution would look like.

-- 
Jacob Elder
@jelder
(646) 535-3379


Re: DIH causing "shutdown hook executing"?

2010-11-29 Thread Erick Erickson
Try without autocommit or bump the limit up considerably to see
if it changes the behavior. You should not be getting
this kind of performance hit after the first million  docs, so, it's
probably worth exploring.

See if you can find anything in your logs that indicates what's
hogging the critical resource maybe?

Best
Erick

On Mon, Nov 29, 2010 at 3:08 PM, Phong Dais  wrote:

> It is entirely possible that the server is asking solr to shutdown.  I'll
> have to ask the admin.
> I'm running Solr-1.4 inside of Jetty.  I definitely have enough disk space.
> I think I did notice solr shutting down while it was idle.  I just
> disregarded it as a fluke...  Perhaps there's something going on.
> I will try to run this inside of tomcat and see what happens.
>
> Not sure if this is related but I had to change the  to single
> instead of the default "native".
> With native, I get a lock time out when starting up solr.  I also have
>  set to 1.  I did not want to have millions of uncommitted
> docs.
> I'm running under Linux RedHat.
>
> Regarding speed, the first million or so documents is done very quickly
> (maybe 3 hrs) but after that, things slows down tremendously.
>
> Thanks for the advice regarding solrj.  I'll definitely look into that.
>
> P.
>
>
> On Mon, Nov 29, 2010 at 2:39 PM, Erick Erickson  >wrote:
>
> > You're right, the OS is asking the server to shut down.  In the default
> > example under Jetty, this is a result of issuing a crtl-c. Is it possible
> > that something is asking your server to quit? What servlet container
> > are you running under? Does the Solr server run for more than this
> > period if you're NOT indexing? And are you sure you have enough
> > resources, especially disk space?
> >
> > On another note, I'm surprised that it's taking 2 days to index 5m
> > documents.
> > That's less than 30 docs/second and Solr should handle a considerably
> > greater load than that. For whatever that's worth...
> >
> > And what version of Solr are you using? You may want to consider
> > writing something in SolrJ to do your indexing, it'll provide you more
> > flexible control over indexing than DIH..
> >
> > Best
> > Erick
> >
> > On Mon, Nov 29, 2010 at 1:20 PM, Phong Dais 
> wrote:
> >
> > > Hi,
> > >
> > > I am in the process of trying to index about 50 mil documents using the
> > > data
> > > import handler.
> > > For some reason, about 2 days into the import, I see this message
> > "shutdown
> > > hook executing" in the log and the solr web server instance exits
> > > "gracefully".
> > > I do not see any errors in the entire log.  This has happened twice
> now,
> > > usually 5 mil or so documents into the import process.
> > >
> > > Does anyone out there knows what this message mean?  It's an INFO log
> > > message so I don't think it is caused by any error.
> > > Does this problem occur because the os is asking the server to shut
> down
> > > (for whatever reason) or is there something wrong with the server
> causing
> > > it
> > > to shutdown?
> > >
> > > Thanks for any help,
> > > Phong
> > >
> >
>


"Bad file descriptor" Errors

2010-11-29 Thread John Williams
Recently, we have started to get "Bad file descriptor" errors in one of our 
Solr instances. This instance is a searcher and its index is stored on a local 
SSD. The master however has it's index stored on NFS, which seems to be working 
fine, currently. 

I have tried restarting tomcat and bringing over the index fresh from the 
master (via snappull/snapinstall). 

Any help would be greatly appreciated.

Thanks,
John


SEVERE: Exception during commit/optimize:java.lang.RuntimeException: 
java.io.FileNotFoundException: /u/solr/data/index/_w3vs.fnm (Bad file 
descriptor)
at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:371)
at 
org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:512)
at org.apache.solr.core.SolrCore.update(SolrCore.java:771)
at 
org.apache.solr.servlet.SolrUpdateServlet.doPost(SolrUpdateServlet.java:53)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:637)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:717)
at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290)
at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at 
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at 
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
at 
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at 
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at 
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at 
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
at 
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:852)
at 
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588)
at 
org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
at java.lang.Thread.run(Thread.java:662)

smime.p7s
Description: S/MIME cryptographic signature


Re: special sorting

2010-11-29 Thread Tommaso Teofili
Perhaps, depending on your domain logic you could use function queries to
achieve that.
http://wiki.apache.org/solr/FunctionQuery#Sort_By_Function
Regards,
Tommaso

2010/11/29 Papp Richard 

> Hello,
>
>  I have many pages with the same content in the search result (the result
> is the same for some of the cities from the same county)... which means
> that
> I have duplicate content.
>
>  the filter query is something like: +locationId:(60 26a 39a) - for city
> with ID 60
>  and I get the same result for city with ID 62: +locationId:(62 26a 39a)
> (cityID, countyID, countryID)
>
>  how could I use a sorting to have different docs order in results for
> different cities?
>  (for the same city I need to have the same sort order always - it cannot
> be a simple random...)
>
>  could I use somehow the cityID parameter as boost or score ? I tried but
> could't realise too much.
>
> thanks,
>  Rich
>
>
> __ Information from ESET NOD32 Antivirus, version of virus
> signature
> database 5659 (20101129) __
>
> The message was checked by ESET NOD32 Antivirus.
>
> http://www.eset.com
>
>
>


Re: DIH causing "shutdown hook executing"?

2010-11-29 Thread Phong Dais
It is entirely possible that the server is asking solr to shutdown.  I'll
have to ask the admin.
I'm running Solr-1.4 inside of Jetty.  I definitely have enough disk space.
I think I did notice solr shutting down while it was idle.  I just
disregarded it as a fluke...  Perhaps there's something going on.
I will try to run this inside of tomcat and see what happens.

Not sure if this is related but I had to change the  to single
instead of the default "native".
With native, I get a lock time out when starting up solr.  I also have
 set to 1.  I did not want to have millions of uncommitted
docs.
I'm running under Linux RedHat.

Regarding speed, the first million or so documents is done very quickly
(maybe 3 hrs) but after that, things slows down tremendously.

Thanks for the advice regarding solrj.  I'll definitely look into that.

P.


On Mon, Nov 29, 2010 at 2:39 PM, Erick Erickson wrote:

> You're right, the OS is asking the server to shut down.  In the default
> example under Jetty, this is a result of issuing a crtl-c. Is it possible
> that something is asking your server to quit? What servlet container
> are you running under? Does the Solr server run for more than this
> period if you're NOT indexing? And are you sure you have enough
> resources, especially disk space?
>
> On another note, I'm surprised that it's taking 2 days to index 5m
> documents.
> That's less than 30 docs/second and Solr should handle a considerably
> greater load than that. For whatever that's worth...
>
> And what version of Solr are you using? You may want to consider
> writing something in SolrJ to do your indexing, it'll provide you more
> flexible control over indexing than DIH..
>
> Best
> Erick
>
> On Mon, Nov 29, 2010 at 1:20 PM, Phong Dais  wrote:
>
> > Hi,
> >
> > I am in the process of trying to index about 50 mil documents using the
> > data
> > import handler.
> > For some reason, about 2 days into the import, I see this message
> "shutdown
> > hook executing" in the log and the solr web server instance exits
> > "gracefully".
> > I do not see any errors in the entire log.  This has happened twice now,
> > usually 5 mil or so documents into the import process.
> >
> > Does anyone out there knows what this message mean?  It's an INFO log
> > message so I don't think it is caused by any error.
> > Does this problem occur because the os is asking the server to shut down
> > (for whatever reason) or is there something wrong with the server causing
> > it
> > to shutdown?
> >
> > Thanks for any help,
> > Phong
> >
>


special sorting

2010-11-29 Thread Papp Richard
Hello,

  I have many pages with the same content in the search result (the result
is the same for some of the cities from the same county)... which means that
I have duplicate content.

  the filter query is something like: +locationId:(60 26a 39a) - for city
with ID 60
  and I get the same result for city with ID 62: +locationId:(62 26a 39a)
(cityID, countyID, countryID)

  how could I use a sorting to have different docs order in results for
different cities?
  (for the same city I need to have the same sort order always - it cannot
be a simple random...)

  could I use somehow the cityID parameter as boost or score ? I tried but
could't realise too much.

thanks,
  Rich
 

__ Information from ESET NOD32 Antivirus, version of virus signature
database 5659 (20101129) __

The message was checked by ESET NOD32 Antivirus.

http://www.eset.com
 



Re: solr admin

2010-11-29 Thread Erick Erickson
I honestly don't understand what you're asking here. Specify what
in solr admin other than fields? what is it you're trying to accomplish?

Best
Erick

On Mon, Nov 29, 2010 at 2:56 PM, Papp Richard  wrote:

> Hello,
>
>  is there any way to specify in the solr admin other than fields? and I'm
> nt talking about the full interface which is also very limited.
>
>  like: score, fl, fq, ...
>
>  and yes, I know that I can use the url... which indeed is not too handy.
>
> thanks,
>  Rich
>
>
> __ Information from ESET NOD32 Antivirus, version of virus
> signature
> database 5659 (20101129) __
>
> The message was checked by ESET NOD32 Antivirus.
>
> http://www.eset.com
>
>
>


solr admin

2010-11-29 Thread Papp Richard
Hello,

  is there any way to specify in the solr admin other than fields? and I'm
nt talking about the full interface which is also very limited.

  like: score, fl, fq, ...

  and yes, I know that I can use the url... which indeed is not too handy.

thanks,
  Rich
 

__ Information from ESET NOD32 Antivirus, version of virus signature
database 5659 (20101129) __

The message was checked by ESET NOD32 Antivirus.

http://www.eset.com
 



Re: DIH causing "shutdown hook executing"?

2010-11-29 Thread Erick Erickson
You're right, the OS is asking the server to shut down.  In the default
example under Jetty, this is a result of issuing a crtl-c. Is it possible
that something is asking your server to quit? What servlet container
are you running under? Does the Solr server run for more than this
period if you're NOT indexing? And are you sure you have enough
resources, especially disk space?

On another note, I'm surprised that it's taking 2 days to index 5m
documents.
That's less than 30 docs/second and Solr should handle a considerably
greater load than that. For whatever that's worth...

And what version of Solr are you using? You may want to consider
writing something in SolrJ to do your indexing, it'll provide you more
flexible control over indexing than DIH..

Best
Erick

On Mon, Nov 29, 2010 at 1:20 PM, Phong Dais  wrote:

> Hi,
>
> I am in the process of trying to index about 50 mil documents using the
> data
> import handler.
> For some reason, about 2 days into the import, I see this message "shutdown
> hook executing" in the log and the solr web server instance exits
> "gracefully".
> I do not see any errors in the entire log.  This has happened twice now,
> usually 5 mil or so documents into the import process.
>
> Does anyone out there knows what this message mean?  It's an INFO log
> message so I don't think it is caused by any error.
> Does this problem occur because the os is asking the server to shut down
> (for whatever reason) or is there something wrong with the server causing
> it
> to shutdown?
>
> Thanks for any help,
> Phong
>


Re: Spellcheck in solr-nutch integration

2010-11-29 Thread Anurag

i solved the problemAll we need to modify schema file.

Also the spellcheck index is created first when spellcheck.build=true 

-
Kumar Anurag

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Spellcheck-in-solr-nutch-integration-tp1953232p1988252.html
Sent from the Solr - User mailing list archive at Nabble.com.


R: Solr Hot Backup

2010-11-29 Thread Rodolico Piero
Yes, I use the replication only for backup with this call:

http://host:8080/solr/replication?command=backup&location=/home/jboss/backup 

It's work fine but the server must be always up... it's an http call...
I tried also the script 'backup' but it creates hard links and are not 
recommended!


-Messaggio originale-
Da: Jonathan Rochkind [mailto:rochk...@jhu.edu] 
Inviato: lunedì 29 novembre 2010 19.22
A: solr-user@lucene.apache.org
Oggetto: Re: Solr Hot Backup

In Solr 1.4, I think the replication features should be able to 
accomplish your goal, and will be easier to use and more robust.

On 11/29/2010 10:22 AM, Upayavira wrote:
> As I understand it, those tools are more Solr 1.3 related, but I don't
> see why they shouldn't work on 1.4.
>
> I would say it is very unlikely that you will corrupt an index with
> them.
>
> Lucene indexes are "write once", that is, any one index file will never
> be updated, only replaced. This means that taking a backup is actually
> exceptionally easy, as (on Unix at least) you can create a copy of the
> index directory with hard links, which takes milliseconds, even for
> multi-gigabyte indexes. You just need to make sure you are not
> committing while you take your backup, and it looks like those tools
> will take care of that for you.
>
> Another perk is that your backups won't take any additional disk space
> (just the space for the directory data, not the files themselves). As
> your index changes, disk usage will gradually increase though.
>
> Upayavira
>
> On Mon, 29 Nov 2010 16:13 +0100, "Rodolico Piero"
>   wrote:
>> Hi all,
>>
>> How can I backup indexes Solr without stopping the server?
>>
>> I saw the following link:
>>
>>
>>
>> http://wiki.apache.org/solr/SolrOperationsTools
>> 
>>
>> http://wiki.apache.org/solr/CollectionDistribution
>>
>>
>>
>> but I'm afraid that running these scripts 'on the fly' indexes could be
>> corrupted.
>>
>> Thanks,
>>
>> Piero.
>>
>>
>>
>>
>>


Re: Solr Hot Backup

2010-11-29 Thread Jonathan Rochkind
In Solr 1.4, I think the replication features should be able to 
accomplish your goal, and will be easier to use and more robust.


On 11/29/2010 10:22 AM, Upayavira wrote:

As I understand it, those tools are more Solr 1.3 related, but I don't
see why they shouldn't work on 1.4.

I would say it is very unlikely that you will corrupt an index with
them.

Lucene indexes are "write once", that is, any one index file will never
be updated, only replaced. This means that taking a backup is actually
exceptionally easy, as (on Unix at least) you can create a copy of the
index directory with hard links, which takes milliseconds, even for
multi-gigabyte indexes. You just need to make sure you are not
committing while you take your backup, and it looks like those tools
will take care of that for you.

Another perk is that your backups won't take any additional disk space
(just the space for the directory data, not the files themselves). As
your index changes, disk usage will gradually increase though.

Upayavira

On Mon, 29 Nov 2010 16:13 +0100, "Rodolico Piero"
  wrote:

Hi all,

How can I backup indexes Solr without stopping the server?

I saw the following link:



http://wiki.apache.org/solr/SolrOperationsTools


http://wiki.apache.org/solr/CollectionDistribution



but I'm afraid that running these scripts 'on the fly' indexes could be
corrupted.

Thanks,

Piero.







DIH causing "shutdown hook executing"?

2010-11-29 Thread Phong Dais
Hi,

I am in the process of trying to index about 50 mil documents using the data
import handler.
For some reason, about 2 days into the import, I see this message "shutdown
hook executing" in the log and the solr web server instance exits
"gracefully".
I do not see any errors in the entire log.  This has happened twice now,
usually 5 mil or so documents into the import process.

Does anyone out there knows what this message mean?  It's an INFO log
message so I don't think it is caused by any error.
Does this problem occur because the os is asking the server to shut down
(for whatever reason) or is there something wrong with the server causing it
to shutdown?

Thanks for any help,
Phong


Re: Preventing index segment corruption when windows crashes

2010-11-29 Thread Yonik Seeley
On Mon, Nov 29, 2010 at 10:46 AM, Peter Sturge  wrote:
> If a Solr index is running at the time of a system halt, this can
> often corrupt a segments file, requiring the index to be -fix'ed by
> rewriting the offending file.

Really?  That shouldn't be possible (if you mean the index is truly
corrupt - i.e. you can't open it).

-Yonik
http://www.lucidimagination.com


Re: Large Hdd-Space using during commit/optimize

2010-11-29 Thread Upayavira
On Mon, 29 Nov 2010 08:43 -0800, "stockii"  wrote:
> 
> aha okay. thx
> 
> i dont know that solr copys the complete index for optimize. can i solr
> say,
> that he start an optimize, but wihtout copy ? 

No.

The copy is to keep an index available for searches while the optimise
is happening.

Also, to allow for rollback should something go wrong with the optimise.

The simplest thing is to keep your commits low (I suspect you could
ingest 35m documents with just one commit at the end).

In that case, optimisation is not required (optimisation is to reduce
the number of segments in your index, and segments are created by
commits. If you don't do many commits, you won't need to optimise - at
least you won't at the point of initial ingestion.

Upayavira


Re: Large Hdd-Space using during commit/optimize

2010-11-29 Thread stockii

aha okay. thx

i dont know that solr copys the complete index for optimize. can i solr say,
that he start an optimize, but wihtout copy ? 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Large-Hdd-Space-using-during-commit-optimize-tp1985807p1987477.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Boost on newer documents

2010-11-29 Thread Jason Brown
Great - Thank You.


-Original Message-
From: Mat Brown [mailto:m...@patch.com]
Sent: Mon 29/11/2010 16:33
To: solr-user@lucene.apache.org
Subject: Re: Boost on newer documents
 
Hi Jason,

You can use boost functions in the dismax handler to do this:

http://wiki.apache.org/solr/DisMaxQParserPlugin#bf_.28Boost_Functions.29

Mat

On Mon, Nov 29, 2010 at 11:28, Jason Brown  wrote:
>
> Hi,
>
> I use the dismax query to search across several fields.
>
> I find I have a lot of documents with the same document name (one of the 
> fields that the dismax queries) so I wanted to adjust the relevance so that 
> titles with a newer published date have a higher relevance than documents 
> with the same title but are older. Does anyone know how I can achieve this?
>
> Thank You
>
> Jason.
>
> If you wish to view the St. James's Place email disclaimer, please use the 
> link below
>
> http://www.sjp.co.uk/portal/internet/SJPemaildisclaimer
>


If you wish to view the St. James's Place email disclaimer, please use the link 
below

http://www.sjp.co.uk/portal/internet/SJPemaildisclaimer


Re: Boost on newer documents

2010-11-29 Thread Mat Brown
Hi Jason,

You can use boost functions in the dismax handler to do this:

http://wiki.apache.org/solr/DisMaxQParserPlugin#bf_.28Boost_Functions.29

Mat

On Mon, Nov 29, 2010 at 11:28, Jason Brown  wrote:
>
> Hi,
>
> I use the dismax query to search across several fields.
>
> I find I have a lot of documents with the same document name (one of the 
> fields that the dismax queries) so I wanted to adjust the relevance so that 
> titles with a newer published date have a higher relevance than documents 
> with the same title but are older. Does anyone know how I can achieve this?
>
> Thank You
>
> Jason.
>
> If you wish to view the St. James's Place email disclaimer, please use the 
> link below
>
> http://www.sjp.co.uk/portal/internet/SJPemaildisclaimer
>


Re: Boost on newer documents

2010-11-29 Thread Stefan Matheis
Hi Jason,

maybe, just use another field w/ creation-/modification-date and boost on
this field?

Regards
Stefan

On Mon, Nov 29, 2010 at 5:28 PM, Jason Brown  wrote:

>
> Hi,
>
> I use the dismax query to search across several fields.
>
> I find I have a lot of documents with the same document name (one of the
> fields that the dismax queries) so I wanted to adjust the relevance so that
> titles with a newer published date have a higher relevance than documents
> with the same title but are older. Does anyone know how I can achieve this?
>
> Thank You
>
> Jason.
>
> If you wish to view the St. James's Place email disclaimer, please use the
> link below
>
> http://www.sjp.co.uk/portal/internet/SJPemaildisclaimer
>


Boost on newer documents

2010-11-29 Thread Jason Brown

Hi,

I use the dismax query to search across several fields.

I find I have a lot of documents with the same document name (one of the fields 
that the dismax queries) so I wanted to adjust the relevance so that titles 
with a newer published date have a higher relevance than documents with the 
same title but are older. Does anyone know how I can achieve this?

Thank You

Jason.

If you wish to view the St. James's Place email disclaimer, please use the link 
below

http://www.sjp.co.uk/portal/internet/SJPemaildisclaimer


bf for Dismax completly ignored by 'recip(ms(NOW,INDAT),3.16e-11,1,1)'

2010-11-29 Thread rall0r

Hello,
I got a problem that I'm unable to solve: As mentioned in the docs, I put in
a "recip(ms(NOW,INDAT),3.16e-11,1,1)" at the boost-Function fielf "bf".
That is completly ignored by the dismax Search Handler.

The dismax SearchHandler is set to be the default SearchHandler.
If I post a "solr/select?q={!boost
b=recip(ms(NOW,INDAT),3.16e-11,1,1)}SearchTerm" to the solr-Server, the
request is answered as expected, while doing it with the php-Client
completly fails.

The solrconfig looks like:

recip(ms(NOW,INDAT),3.16e-11,1,1)

My someone has an idea?
Thanks a lot!
Ralf
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/bf-for-Dismax-completly-ignored-by-recip-ms-NOW-INDAT-3-16e-11-1-1-tp1987228p1987228.html
Sent from the Solr - User mailing list archive at Nabble.com.


Solr DataImportHandler (DIH) and Cassandra

2010-11-29 Thread Mark

Is there anyway to use DIH to import from Cassandra? Thanks


Re: search strangeness

2010-11-29 Thread Erick Erickson
On a quick look with Solr 3.1, these results are puzzling. Are you
sure that you are searching the field you think you are? I take it you're
searching the "text" field, but that's controlled by your

entry in schema.xml.

Try using the admin page, particularly the "full interface" link and
turn debugging on, that should give you a better idea of what
is actually being searched. Another admin page that's very useful
is the analysis page, that'll show you exactly what transformations
are made to your terms at index and query time and why.

I'm a little suspicious that you've put the stopword filter in a different
place in the index and query process, but I doubt that
is a problem. The analysis page will help with that too.

But nothing really jumps out at me, if you don't get anywhere with the
admin page, perhaps you can show us the field definitions for the
name, caption and text fields (not the type, the actual 
part of the schema).

Also, please post the results of appending &debugQuery=on to the request.

Best
Erick

On Mon, Nov 29, 2010 at 10:06 AM, ramzesua  wrote:

>
> Hi all. I have a little question. Can anyone explain, why this solr search
> work so strange? :)
> For example, I make schema.xml:
> I add some fields with fieldType = text. Here 'text' properties
> 
>  
>
> words="stopwords.txt"/>
> generateWordParts="1" generateNumberParts="0" catenateWords="0"
> catenateNumbers="0" catenateAll="0"/>
>
> protected="protwords.txt"/>
>
>  
>
>  
>
> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0"/>
> ignoreCase="true" expand="true"/>
> words="stopwords.txt"/>
>
> protected="protwords.txt"/>
>
>  
>
> I copied to text field all my fields:
> 
> 
>
>
> Then I add one document to my index. Here schema browser for field
> 'caption':
>
> _term___frequency_
> |annual |1 |
> |golfer |1 |
> |tournament |1 |
> |welcom |1 |
> |3rd|1 |
>
> After that I tried to find this document by terms:
> annual - no results
> golfer  - found document
> tournament - no results
> welcom - found document
> 3rd - no results
>
> I read a lot of forums, some books and http://wiki.apache.org/solr/
> but
> it don't help me.
> Can anyone explain me, why solr search so strange? Or where is my problem?
> Thank you ...
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/search-strangeness-tp1986895p1986895.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Preventing index segment corruption when windows crashes

2010-11-29 Thread Peter Sturge
Hi,

With the advent of new windows versions, there are increasing
instances of system blue-screens, crashes, freezes and ad-hoc
failures.
If a Solr index is running at the time of a system halt, this can
often corrupt a segments file, requiring the index to be -fix'ed by
rewriting the offending file.
Aside from the vagaries of automating such fixes, depending on the
mergeFactor, this can be quite a few documents permanently lost.

Would anyone have any experience/wisdom/insight on ways to mitigate
such corruption in Lucene/Solr - e.g. applying a temp file technique
etc.; though perhaps not 'just use Linux'.. :-)
There are of course, client-side measures that can hold some number of
pending documents until they are truly committed, but a
server-side/Lucene method would be perferable, if possible.

Thanks,
Peter


BasicHelloRequestHandler plugin

2010-11-29 Thread Hong-Thai Nguyen
Hi,

Thank for helping us.

I’m creating a ‘helloword’ plugin in Solr 1.4 in BasicHelloRequestHandler.java

In solrconfig.xml, I added:

 





 

   Default message 

   -10 

 

  

 

I verified ‘hello’ plugin is figured well at: 
http://localhost:8983/solr/admin/plugins

 

When I executed: http://localhost:8983/solr/select?qt=hello, the 
java.lang.AbstractMethodError raised:

type Rapport d'état

message null java.lang.AbstractMethodError at 
org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) 
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
 at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
 at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
 at 
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
 at 
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
 at 
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128) 
at 
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) 
at 
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
 at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293) 
at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:849) 
at 
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583)
 at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:454) at 
java.lang.Thread.run(Thread.java:595) 

I supposed that handleRequest in the BasicHelloRequestHandler isn’t called.

Here’s BasicHelloRequestHandler .java code:

import com.polyspot.mercury.common.params.HelloParams;

import org.apache.solr.common.SolrException;

import org.apache.solr.common.params.SolrParams;

import org.apache.solr.common.util.NamedList;

import org.apache.solr.common.util.SimpleOrderedMap;

import org.apache.solr.request.SolrQueryRequest;

import org.apache.solr.request.SolrRequestHandler;

import org.apache.solr.response.SolrQueryResponse;

import org.slf4j.Logger;

import org.slf4j.LoggerFactory;

 

import java.net.URL;

 

 

/**

 * User: nguyenht

 * Date: 26 nov. 2010

 */

public class BasicHelloRequestHandler implements SolrRequestHandler {

 

  protected static Logger log = 
LoggerFactory.getLogger(BasicHelloRequestHandler.class);

 

  protected NamedList initArgs = null;

  protected SolrParams defaults;

 

 

  /**

   * init will be called just once, immediately after creation.

   * The args are user-level initialization parameters that

   * may be specified when declaring a request handler in

   * solrconfig.xml

   */

  public void init(NamedList args) {

log.info("initializing BasicHelloRequestHandler: " + args);

 

initArgs = args;

 

if (args != null) {

  Object o = args.get("defaults");

  if (o != null && o instanceof NamedList) {

defaults = SolrParams.toSolrParams((NamedList) o);

  }

}

 

  }

 

  /**

   * Handles a query request, this method must be thread safe.

   * 

   * Information about the request may be obtained from req and

   * response information may be set using rsp.

   * 

   * There are no mandatory actions that handleRequest must perform.

   * An empty handleRequest implementation would fulfill

   * all interface obligations.

   */

  public void handleRequest(SolrQueryRequest solrQueryRequest, 
SolrQueryResponse solrQueryResponse) {

 

log.info("handling request for BasicHelloRequestHandler: ");

 

//get request params

SolrParams params = solrQueryRequest.getParams();

String message = params.get(HelloParams.MESSAGE);

 

if (message == null)

{ 

  throw new SolrException(SolrException.ErrorCode.BAD_REQUEST, "message is 
mandatory");

}

 

 

log.info("get anumber ");

 

Integer anumber = params.getInt(HelloParams.ANUMBER);

if (anumber == null)

{

  anumber = defaults.getInt(HelloParams.ANUMBER);

}



 

 

int messageLength = message.length();

 

 

//write response

solrQueryResponse.add("yousaid", message);

solrQueryResponse.add("message length", messageLength);

solrQueryResponse.add("optionalNumber", anumber);

 

 

  }

 

  /*

  methods below are for JMX info

   */

 

 public String getName() {

return this.getClass().getName();

  }

 

  public String getVersion() {

return "1";  //TODO implement this

  }

 

  public String getDescription() {

return "hello";  //TODO implement this

  }

 

  public Category getCategory() {

return Category.OTHER;  //TODO implement this

  }

 

  public String getSourceId() {

return "some hello source id from " + 
BasicHelloRequestHandler.class.getCanonicalNam

Re: Solr Hot Backup

2010-11-29 Thread Upayavira
As I understand it, those tools are more Solr 1.3 related, but I don't
see why they shouldn't work on 1.4.

I would say it is very unlikely that you will corrupt an index with
them.

Lucene indexes are "write once", that is, any one index file will never
be updated, only replaced. This means that taking a backup is actually
exceptionally easy, as (on Unix at least) you can create a copy of the
index directory with hard links, which takes milliseconds, even for
multi-gigabyte indexes. You just need to make sure you are not
committing while you take your backup, and it looks like those tools
will take care of that for you.

Another perk is that your backups won't take any additional disk space
(just the space for the directory data, not the files themselves). As
your index changes, disk usage will gradually increase though.

Upayavira

On Mon, 29 Nov 2010 16:13 +0100, "Rodolico Piero"
 wrote:
> Hi all,
> 
> How can I backup indexes Solr without stopping the server?
> 
> I saw the following link:
> 
>  
> 
> http://wiki.apache.org/solr/SolrOperationsTools
>  
> 
> http://wiki.apache.org/solr/CollectionDistribution
> 
>  
> 
> but I'm afraid that running these scripts 'on the fly' indexes could be
> corrupted.
> 
> Thanks,
> 
> Piero.
> 
>  
> 
>  
> 


search strangeness

2010-11-29 Thread ramzesua

Hi all. I have a little question. Can anyone explain, why this solr search
work so strange? :)
For example, I make schema.xml:
I add some fields with fieldType = text. Here 'text' properties

  






  
  
  







  

I copied to text field all my fields:




Then I add one document to my index. Here schema browser for field
'caption':

_term___frequency_
|annual |1 |
|golfer |1 |
|tournament |1 |
|welcom |1 |
|3rd|1 |

After that I tried to find this document by terms:
annual - no results
golfer  - found document
tournament - no results
welcom - found document
3rd - no results

I read a lot of forums, some books and http://wiki.apache.org/solr/ but
it don't help me.
Can anyone explain me, why solr search so strange? Or where is my problem?
Thank you ...

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/search-strangeness-tp1986895p1986895.html
Sent from the Solr - User mailing list archive at Nabble.com.


Solr Hot Backup

2010-11-29 Thread Rodolico Piero
Hi all,

How can I backup indexes Solr without stopping the server?

I saw the following link:

 

http://wiki.apache.org/solr/SolrOperationsTools
 

http://wiki.apache.org/solr/CollectionDistribution

 

but I'm afraid that running these scripts 'on the fly' indexes could be
corrupted.

Thanks,

Piero.

 

 



Using Ngram and Phrase search

2010-11-29 Thread Jason, Kim

Hi, all
I want to use both EdegeNGram analysis and phrase search.
But there is some problem.

On Field which is not using EdgeNGram analysis, phrase search.is good work.
But if using EdgeNGram then phrase search is incorrect.

Now I'm using Solr1.4.0.
Result of EdgeNGram analysis for "pci express" is below.
http://lucene.472066.n3.nabble.com/file/n1986848/before.jpg 

I thought cause is term position.
So I modified EdgeNGramTokenFilter of lucene-analyzers-2.9.1.
After modified, result is below.
http://lucene.472066.n3.nabble.com/file/n1986848/after.jpg 

So phrase search fot "pci express" from ngram index is good work.
But another problem is happend.

For example, when I searh phrase query "pc express", docs included 'pci
express' are searched too.
In this case I don't want to search for 'pci express'.
I just want exact match "pc express".

Please give your ideas.
Thanks,
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Using-Ngram-and-Phrase-search-tp1986848p1986848.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: question about Solr SignatureUpdateProcessorFactory

2010-11-29 Thread Bernd Fehling

Am 29.11.2010 14:55, schrieb Markus Jelsma:
> 
> 
> On Monday 29 November 2010 14:51:33 Bernd Fehling wrote:
>> Dear list,
>> another suggestion about SignatureUpdateProcessorFactory.
>>
>> Why can I make signatures of several fields and place the
>> result in one field but _not_ make a signature of one field
>> and place the result in several fields.
> 
> Use copyField


Ooooh yes, you are right.


> 
>>
>> Could be realized without huge programming?
>>
>> Best regards,
>> Bernd
>>
>> Am 29.11.2010 14:30, schrieb Bernd Fehling:
>>> Dear list,
>>>
>>> a question about Solr SignatureUpdateProcessorFactory:
>>>
>>> for (String field : sigFields) {
>>>
>>>   SolrInputField f = doc.getField(field);
>>>   if (f != null) {
>>>
>>> *sig.add(field);
>>>
>>> Object o = f.getValue();
>>> if (o instanceof String) {
>>> 
>>>   sig.add((String)o);
>>> 
>>> } else if (o instanceof Collection) {
>>> 
>>>   for (Object oo : (Collection)o) {
>>>   
>>> if (oo instanceof String) {
>>> 
>>>   sig.add((String)oo);
>>> 
>>> }
>>>   
>>>   }
>>> 
>>> }
>>>   
>>>   }
>>>
>>> }
>>>
>>> Why is also the field name (* above) added to the signature
>>> and not only the content of the field?
>>>
>>> By purpose or by accident?
>>>
>>> I would like to suggest removing the field name from the signature and
>>> not mixing it up.
>>>
>>> Best regards,
>>> Bernd
> 


Re: question about Solr SignatureUpdateProcessorFactory

2010-11-29 Thread Markus Jelsma


On Monday 29 November 2010 14:51:33 Bernd Fehling wrote:
> Dear list,
> another suggestion about SignatureUpdateProcessorFactory.
> 
> Why can I make signatures of several fields and place the
> result in one field but _not_ make a signature of one field
> and place the result in several fields.

Use copyField

> 
> Could be realized without huge programming?
> 
> Best regards,
> Bernd
> 
> Am 29.11.2010 14:30, schrieb Bernd Fehling:
> > Dear list,
> > 
> > a question about Solr SignatureUpdateProcessorFactory:
> > 
> > for (String field : sigFields) {
> > 
> >   SolrInputField f = doc.getField(field);
> >   if (f != null) {
> > 
> > *sig.add(field);
> > 
> > Object o = f.getValue();
> > if (o instanceof String) {
> > 
> >   sig.add((String)o);
> > 
> > } else if (o instanceof Collection) {
> > 
> >   for (Object oo : (Collection)o) {
> >   
> > if (oo instanceof String) {
> > 
> >   sig.add((String)oo);
> > 
> > }
> >   
> >   }
> > 
> > }
> >   
> >   }
> > 
> > }
> > 
> > Why is also the field name (* above) added to the signature
> > and not only the content of the field?
> > 
> > By purpose or by accident?
> > 
> > I would like to suggest removing the field name from the signature and
> > not mixing it up.
> > 
> > Best regards,
> > Bernd

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: question about Solr SignatureUpdateProcessorFactory

2010-11-29 Thread Erick Erickson
Why do you want to do this? It'd be the same value, just stored in
multiple fields in the document, which seems a waste. What's
the use-case you're addressing?

Best
Erick

On Mon, Nov 29, 2010 at 8:51 AM, Bernd Fehling <
bernd.fehl...@uni-bielefeld.de> wrote:

> Dear list,
> another suggestion about SignatureUpdateProcessorFactory.
>
> Why can I make signatures of several fields and place the
> result in one field but _not_ make a signature of one field
> and place the result in several fields.
>
> Could be realized without huge programming?
>
> Best regards,
> Bernd
>
>
> Am 29.11.2010 14:30, schrieb Bernd Fehling:
> > Dear list,
> >
> > a question about Solr SignatureUpdateProcessorFactory:
> >
> > for (String field : sigFields) {
> >   SolrInputField f = doc.getField(field);
> >   if (f != null) {
> > *sig.add(field);
> > Object o = f.getValue();
> > if (o instanceof String) {
> >   sig.add((String)o);
> > } else if (o instanceof Collection) {
> >   for (Object oo : (Collection)o) {
> > if (oo instanceof String) {
> >   sig.add((String)oo);
> > }
> >   }
> > }
> >   }
> > }
> >
> > Why is also the field name (* above) added to the signature
> > and not only the content of the field?
> >
> > By purpose or by accident?
> >
> > I would like to suggest removing the field name from the signature and
> > not mixing it up.
> >
> > Best regards,
> > Bernd
>


Re: question about Solr SignatureUpdateProcessorFactory

2010-11-29 Thread Bernd Fehling
Dear list,
another suggestion about SignatureUpdateProcessorFactory.

Why can I make signatures of several fields and place the
result in one field but _not_ make a signature of one field
and place the result in several fields.

Could be realized without huge programming?

Best regards,
Bernd


Am 29.11.2010 14:30, schrieb Bernd Fehling:
> Dear list,
> 
> a question about Solr SignatureUpdateProcessorFactory:
> 
> for (String field : sigFields) {
>   SolrInputField f = doc.getField(field);
>   if (f != null) {
> *sig.add(field);
> Object o = f.getValue();
> if (o instanceof String) {
>   sig.add((String)o);
> } else if (o instanceof Collection) {
>   for (Object oo : (Collection)o) {
> if (oo instanceof String) {
>   sig.add((String)oo);
> }
>   }
> }
>   }
> }
> 
> Why is also the field name (* above) added to the signature
> and not only the content of the field?
> 
> By purpose or by accident?
> 
> I would like to suggest removing the field name from the signature and
> not mixing it up.
> 
> Best regards,
> Bernd


Re: Large Hdd-Space using during commit/optimize

2010-11-29 Thread Erick Erickson
First, don't optimize after every chunk, it's just making extra work for
your system.
If you're using a 3.x or trunk build, optimizing doesn't do much for you
anyway, but
if you must, just optimize after your entire import is done.

Optimizing will pretty much copy the old index into a new set of files, so
you can expect your disk space to at least double because Solr/Lucene
doesn't
delete anything until it's sure that the optimize finished successfully.
Imagine
the consequence of deleting files as they were copied to save disk space.
Now
hit a program error, power glitch or ctrl-c. Your indexes would be
corrupted.

Best
Erick

On Mon, Nov 29, 2010 at 6:07 AM, stockii  wrote:

>
> Hello.
>
> i have ~37 Million Docs that i want to index.
>
> when i starte a full-import i importing only every 2 Million Docs, because
> of better controll over solr and space/heap 
>
> so when i import 2 million docs and solr start the commit and the optimize
> my used disc-space jumps into the sky. reacten: solr restart and space the
> used space goes down.
>
> why is using solr so many space ?
>
> can i optimize that  ?
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Large-Hdd-Space-using-during-commit-optimize-tp1985807p1985807.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


question about Solr SignatureUpdateProcessorFactory

2010-11-29 Thread Bernd Fehling
Dear list,

a question about Solr SignatureUpdateProcessorFactory:

for (String field : sigFields) {
  SolrInputField f = doc.getField(field);
  if (f != null) {
*sig.add(field);
Object o = f.getValue();
if (o instanceof String) {
  sig.add((String)o);
} else if (o instanceof Collection) {
  for (Object oo : (Collection)o) {
if (oo instanceof String) {
  sig.add((String)oo);
}
  }
}
  }
}

Why is also the field name (* above) added to the signature
and not only the content of the field?

By purpose or by accident?

I would like to suggest removing the field name from the signature and
not mixing it up.

Best regards,
Bernd


Re: Large Hdd-Space using during commit/optimize

2010-11-29 Thread Upayavira


On Mon, 29 Nov 2010 03:07 -0800, "stockii"  wrote:
> 
> Hello.
> 
> i have ~37 Million Docs that i want to index. 
> 
> when i starte a full-import i importing only every 2 Million Docs,
> because
> of better controll over solr and space/heap 
> 
> so when i import 2 million docs and solr start the commit and the
> optimize
> my used disc-space jumps into the sky. reacten: solr restart and space
> the
> used space goes down.
> 
> why is using solr so many space ?  
> 
> can i optimize that  ? 

What do you mean "into the sky"? What percentage increase are you
seeing?

I'd expect it to double at least. I've heard it suggested that you
should have three times the usual space available for an optimise.

Remember, when your index is optimising, you'll want to keep the
original index online and available for searches, so you'll have at
least two copies of your index on disk during an optimise.

Also, it is my understanding that if you commit infrequently, you won't
need to optimise immediately. There's nothing to stop you importing your
entire corpus, then doing a single commit. That will leave you with only
one segment (or at most two - one that existed before and was empty, and
one containing all of your documents). The net result being you don't
need to optimise at that point.

Note - I'm no solr guru, so I could be wrong with some of the above -
I'm happy to be corrected.

Upayavira


ArrayIndexOutOfBoundsException for query with rows=0 and sort param

2010-11-29 Thread Martin Grotzke
Hi,

after an upgrade from solr-1.3 to 1.4.1 we're getting an
ArrayIndexOutOfBoundsException for a query with rows=0 and a sort
param specified:

java.lang.ArrayIndexOutOfBoundsException: 0
at 
org.apache.lucene.search.FieldComparator$StringOrdValComparator.copy(FieldComparator.java:660)
at 
org.apache.lucene.search.TopFieldCollector$OneComparatorNonScoringCollector.collect(TopFieldCollector.java:84)
at 
org.apache.solr.search.SolrIndexSearcher.sortDocSet(SolrIndexSearcher.java:1391)
at 
org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:872)
at 
org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:341)
at 
org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:182)
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)

The query is e.g.:
/select/?sort=popularity+desc&rows=0&start=0&q=foo

When this is changed to rows=1 or when the sort param is removed the
exception is gone and everything's fine.

With a clean 1.4.1 installation (unzipped, started example and posted
two documents as described in the tutorial) this issue is not
reproducable.

Does anyone have a clue what might be the reason for this and how we
could fix this on the solr side?
Of course - for a quick fix - I'll change our app so that there's no
sort param specified when rows=0.

Thanx && cheers,
Martin

-- 
Martin Grotzke
http://twitter.com/martin_grotzke


Large Hdd-Space using during commit/optimize

2010-11-29 Thread stockii

Hello.

i have ~37 Million Docs that i want to index. 

when i starte a full-import i importing only every 2 Million Docs, because
of better controll over solr and space/heap 

so when i import 2 million docs and solr start the commit and the optimize
my used disc-space jumps into the sky. reacten: solr restart and space the
used space goes down.

why is using solr so many space ?  

can i optimize that  ? 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Large-Hdd-Space-using-during-commit-optimize-tp1985807p1985807.html
Sent from the Solr - User mailing list archive at Nabble.com.