Re:Re: Re:Re: problem of solr replcation's speed

2010-11-04 Thread kafka0102
sometorment later
I found the reason ofsolr replication'slow speed. It's not solr's problem.It's 
jetty's. I used to embed jetty7 in my app. But when I found solr's demo use 
jetty6 , I tried to use jetty6 in my app and I was so happy to get the fast 
speed.
actually, I tried to change solr's demo in jetty7 by default's conf, the 
replication's speed was slow too.

I don't know why the default jetty7 server is so slow. I wanna to find the 
reason.Maybe I can ask the jetty maillist or continue to read the codes.




At 2010-11-02 07:28:54,"Lance Norskog"  wrote:

>This is the time to replicate and open the new index, right? Opening a
>new index can take a lot of time. How many autowarmers and queries are
>there in the caches? Opening a new index re-runs all of the queries in
>all of the caches.
>
>2010/11/1 kafka0102 :
>> I suspected my app has some sleeping op every 1s, so
>> I changed ReplicationHandler.PACKET_SZ to 1024 * 1024*10; // 10MB
>>
>> and log result is like thus :
>> [2010-11-01 
>> 17:49:29][INFO][pool-6-thread-1][SnapPuller.java(1038)]readFully10485760 
>> cost 3184
>> [2010-11-01 
>> 17:49:32][INFO][pool-6-thread-1][SnapPuller.java(1038)]readFully10485760 
>> cost 3426
>> [2010-11-01 
>> 17:49:36][INFO][pool-6-thread-1][SnapPuller.java(1038)]readFully10485760 
>> cost 3359
>> [2010-11-01 
>> 17:49:39][INFO][pool-6-thread-1][SnapPuller.java(1038)]readFully10485760 
>> cost 3166
>> [2010-11-01 
>> 17:49:42][INFO][pool-6-thread-1][SnapPuller.java(1038)]readFully10485760 
>> cost 3513
>> [2010-11-01 
>> 17:49:46][INFO][pool-6-thread-1][SnapPuller.java(1038)]readFully10485760 
>> cost 3140
>> [2010-11-01 
>> 17:49:50][INFO][pool-6-thread-1][SnapPuller.java(1038)]readFully10485760 
>> cost 3471
>>
>> That means It's still slow like before. what's wrong with my env
>>
>> At 2010-11-01 17:30:32,kafka0102  wrote:
>> I hacked SnapPuller to log the cost, and the log is like thus:
>> [2010-11-01 
>> 17:21:19][INFO][pool-6-thread-1][SnapPuller.java(1037)]readFully1048576 cost 
>> 979
>> [2010-11-01 
>> 17:21:19][INFO][pool-6-thread-1][SnapPuller.java(1037)]readFully1048576 cost 
>> 4
>> [2010-11-01 
>> 17:21:19][INFO][pool-6-thread-1][SnapPuller.java(1037)]readFully1048576 cost 
>> 4
>> [2010-11-01 
>> 17:21:20][INFO][pool-6-thread-1][SnapPuller.java(1037)]readFully1048576 cost 
>> 980
>> [2010-11-01 
>> 17:21:20][INFO][pool-6-thread-1][SnapPuller.java(1037)]readFully1048576 cost 
>> 4
>> [2010-11-01 
>> 17:21:20][INFO][pool-6-thread-1][SnapPuller.java(1037)]readFully1048576 cost 
>> 5
>> [2010-11-01 
>> 17:21:21][INFO][pool-6-thread-1][SnapPuller.java(1037)]readFully1048576 cost 
>> 979
>>
>>
>> It's saying it cost about 1000ms for transfering 1M data every 2 times. I 
>> used jetty as server and embeded solr in my app.I'm so confused.What I have 
>> done wrong?
>>
>>
>> At 2010-11-01 10:12:38,"Lance Norskog"  wrote:
>>
>>>If you are copying from an indexer while you are indexing new content,
>>>this would cause contention for the disk head. Does indexing slow down
>>>during this period?
>>>
>>>Lance
>>>
>>>2010/10/31 Peter Karich :
  we have an identical-sized index and it takes ~5minutes


> It takes about one hour to replacate 6G index for solr in my env. But my
> network can transfer file about 10-20M/s using scp. So solr's http
> replcation is too slow, it's normal or I do something wrong?
>


>>>
>>>
>>>
>>>--
>>>Lance Norskog
>>>goks...@gmail.com
>>
>>
>>
>
>
>
>-- 
>Lance Norskog
>goks...@gmail.com


How to Facet on a price range

2010-11-04 Thread jayant

I am able to facet on a particular field because I have index on that field.
But I am not sure how to facet on a price range when I have the exact price
in the 'price' field. Can anyone help here.
Thanks
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-Facet-on-a-price-range-tp1846392p1846392.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Does Solr support Natural Language Search

2010-11-04 Thread Li Li
I don't think current lucene will offer what you want now.
There are 2 main tasks in a search process.
One is "understanding" users' intension. Because natural language
understanding is difficult, Current Information Retrival systems
"force" users input some terms to express their needs. But terms have
ambiguations. e.g. apple may means a fruit or electronics. so users
are asked to inpinput more terms to disambiguate . e.g. apple fruit
may suggest user want fruit apple. There are many things to help
detect user's demand -- query expansion(Searches related to in google)
suggests when user type .. The ultimate goal is understanding
intension by analyzing users' natural language.

Another is "understanding" documents. Current models such as VSM
don't understanding document. it just regards documents as words'
collections. when users input a word, it returns documents contains
this word(tf). of course idf is also taken into consideration.
But it's far from understanding. That's why Keyword stuffing comes
out. Because machine don't really understanding the document and can't
judge whether the document is good or bad or whether it matchs query
good or bad
So PageRank and some other external informations are used to
relieve this problem. But can't fully solve it.
To fully understand documents need more advaned NLP techs. But I
don't think it will achieve human's intelligence in near future
although I am a NLPer
Another road is human help machine "understanding", That's which
called web 2.0 social networks, semantic web ... But also not an easy
task.


2010/11/4 jayant :
>
> Does Solr support Natural Language Search? I did not find any thing about
> this in the reference manual. Please let me know.
> Thanks.
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Does-Solr-support-Natural-Language-Search-tp1839262p1839262.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Dataimporthandler crashed raidcontroller

2010-11-04 Thread Fuad Efendi
I experienced similar problems. It was because we didn't perform load stress 
tests properly, before going to production. Nothing is forever, replace 
controller, change hardware vendor, maintain low temperature inside a rack. 
Thanks
--Original Message--
From: Robert Gründler
To: solr-user@lucene.apache.org
ReplyTo: solr-user@lucene.apache.org
Subject: Dataimporthandler crashed raidcontroller
Sent: Nov 4, 2010 7:21 PM

Hi all,

we had a severe problem with our raidcontroller on one of our servers today 
during importing a table with ~8 million rows into a solr index. After 
importing about 4 million
documents, our server shutdown, and failed to restart due to a corrupt raid
disk. 

The Solr data import was the only heavy process running on that machine during
the crash.

Has anyone experienced hdd/raid-related problems during indexing large sql 
databases into solr?


thanks!


-robert

 




Sent on the TELUS Mobility network with BlackBerry

Re: Optimize Index

2010-11-04 Thread Erick Erickson
no, you didn't miss anything. The comment at Lucen Revolution was more
along the lines that optimize didn't actually improve much #absent# deletes.

Plus, on a significant size corpus, the doc frequencies won't changed that
much by deleting documents, but that's a case-by-case thing

Best
Erick

On Thu, Nov 4, 2010 at 4:31 PM, Markus Jelsma wrote:

> Huh? That's something new for me. Optmize removed documents that have been
> flagged for deletion. For relevancy it's important those are removed
> because
> document frequencies are not updated for deletes.
>
> Did i miss something?
>
> > For what it's worth, the Solr class instructor at the Lucene Revolution
> > conference recommended *against* optimizing, and instead suggested to
> just
> > let the merge factor do it's job.
> >
> > On Thu, Nov 4, 2010 at 2:55 PM, Shawn Heisey  wrote:
> > > On 11/4/2010 7:22 AM, stockiii wrote:
> > >> how can i start an optimize by using DIH, but NOT after an delta- or
> > >> full-import ?
> > >
> > > I'm not aware of a way to do this with DIH, though there might be
> > > something I'm not aware of.  You can do it with an HTTP POST.  Here's
> > > how to do it with curl:
> > >
> > > /usr/bin/curl "http://HOST:PORT/solr/CORE/update"; \
> > > -H "Content-Type: text/xml" \
> > > --data-binary ''
> > >
> > > Shawn
>


Dataimporthandler crashed raidcontroller

2010-11-04 Thread Robert Gründler
Hi all,

we had a severe problem with our raidcontroller on one of our servers today 
during importing a table with ~8 million rows into a solr index. After 
importing about 4 million
documents, our server shutdown, and failed to restart due to a corrupt raid
disk. 

The Solr data import was the only heavy process running on that machine during
the crash.

Has anyone experienced hdd/raid-related problems during indexing large sql 
databases into solr?


thanks!


-robert

 




Re: Testing/packaging question

2010-11-04 Thread Peter Karich

 Hi,

don't know if the python package provides one but solrj offers to start 
solr embedded (|EmbeddedSolrServer|) and

setting up different schema + config is possible. for this see:
https://karussell.wordpress.com/2010/06/10/how-to-test-apache-solrj/

if you need an 'external solr' (via jetty and java -jar start.jar) while 
tests running see this:

http://java.dzone.com/articles/getting-know-solr

Regards,
Peter.



Hi,

I'm pretty much of a Solr newbie currently packaging solrpy for Debian;
see
http://svn.debian.org/viewsvn/python-modules/packages/python-solrpy/trunk/

In order to run solrpy's supplied tests at build time, I'd need Solr to
know about the schema.xml that comes with the tests.
Can anyone tell me how do that properly? I'd basically need Solr to
temporarily recognize that schema.xml without permanently installing it
-- is there any way to do this, eg via environment variables?

TIA
Bernhard Reiter





RE: Does Solr support Natural Language Search

2010-11-04 Thread Steven A Rowe
Hi Jayant,

I think you mean NL search as opposed to Boolean search: the ability to return 
ranked results from queries based on non-required term matches.  Right?

If that is what you meant, then the answer is: "Yes!".  If not, then you should 
rephrase your question.  

Otherwise, the answer could eventually be: "Maybe!!!".  YMMV, TMR.

Steve

> -Original Message-
> From: jayant [mailto:jayan...@hotmail.com]
> Sent: Wednesday, November 03, 2010 11:49 PM
> To: solr-user@lucene.apache.org
> Subject: Does Solr support Natural Language Search
> 
> 
> Does Solr support Natural Language Search? I did not find any thing about
> this in the reference manual. Please let me know.
> Thanks.
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Does-
> Solr-support-Natural-Language-Search-tp1839262p1839262.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: mergeFactor questions

2010-11-04 Thread Tommaso Teofili
Thanks so much Shawn, I am in a scenario with many inserts while searching,
each  consisting of ~ 500documents, I will monitor the number of
segments taking your considerations in mind :-)
Regards,
Tommaso

2010/11/4 Shawn Heisey 

> On 11/4/2010 3:27 AM, Tommaso Teofili wrote:
>
>>- Is mergeFactor a one time configuration setting that is considered
>> only
>>
>>when creating the index for the first time or can it be adjusted later
>> even
>>with some docs inside the index? e.g. I have mF to 10 then I realize I
>> want
>>quicker searches and I set it to 2 so that at the next optimize/commit
>> I
>>will have no more than 2 segments. My understanding is that one can
>> adjust
>>mF over time, is it right?
>>
>
> The mergeFactor is applied anytime documents are added to the index, not
> just when it is built for the first time.  You can adjust it later, and
> reload the core or restart Solr.  It will apply to any additional indexing
> from that point forward.
>
> With a mergeFactor of 10, having 21 segments (and more) temporarily on the
> disk at the same time is reasonably possible.  I know this applies if you
> are doing a continuous large insert, not sure if you are doing several small
> inserts separately. These segments are:
>
> * The small segment that is being built right now.
> * The previous 10 small segments.
> * The merged segment being created from those above.
> * The previous 9 merged segments.
>
> If it takes a really long time to merge the last 10 small segments and then
> merge the 10 large segments into an even larger segment, you can end up with
> even more small segments from your continuous insert.  If it should take
> long enough that you actually get 10 more new small segments, the large
> merge will pause while it completes the small merge.  I saw this happen
> recently when I decided to see what happens if I built a single shard from
> our entire database.  It took a really long time, partly from that
> super-merge and the optimize that happened later, and took up 85GB of disk
> space.
>
> I'm not really sure what happens if you have this continue beyond a single
> super-merge like I have mentioned.
>
> - In a replicated environment does it make sense to define different
>>
>>mergeFactors on master and slave? I'd say no since it influences the
>> number
>>of segments created, that being a concern of who actually index
>> documents
>>(the master) not of who receives (segments of) index, but please
>> correct me
>>if I am wrong.
>>
>
> Because it only applies when indexes are being built, it has no meaning on
> a slave, which as you said, just copies the data from the master.
>
> Shawn
>
>


Re: querying multiple fields as one

2010-11-04 Thread Jonathan Rochkind

Tommaso Teofili wrote:



No failing, just looking for how to do such "expansion" of fields
automatically (with fields in OR but that's not an issue I think)
  

the dismax query parser is that way.



RE: Testing/packaging question

2010-11-04 Thread Turner, Robbin J
You can setup your own tomcat instance which would contain just configurations 
you need.  You won't even have to recreate all the tomcat configuration and 
binaries, just the ones that were not defaults.  So, if you lookup multiple 
tomcat configuration instance (google it), and then you'll have a set of 
directories.   You'll need to have your own startup script that points to your 
configurations.  You can use the current startup script as a model, then in 
your build procedures (I've done all this with a script) have this added to the 
system so you can preform restart.  You'd have to have a couple of other 
environment variables set:

export CATALINA_BASE=/path/to/your/tomcat/instance/conf/files
export CATALINA_HOME=/path/to/default/installation/bin/files
export SOLR_HOME=/path/to/solr/dataNconf

Good luck


From: Bernhard Reiter [ock...@raz.or.at]
Sent: Thursday, November 04, 2010 5:49 PM
To: solr-user@lucene.apache.org
Subject: RE: Testing/packaging question

Thanks for your instructions. Unfortunately, I need to do all that as
part of my package's (python-solrpy) build procedure, so I can't change
any global configuration, such as in the catalina subdirectories.

I've already sensed that restarting tomcat is also just too
system-invasive and would include changing its (system-wide)
configuration.

Are there any other ways to use solr for running the tests from
http://pypi.python.org/packages/source/s/solrpy/solrpy-0.9.3.tar.gz
without having to change any system configuration? Maybe via a user
Tomcat instance such as provided by the tomcat6-user debian package?

Thanks for your help!
Bernhard

Am Donnerstag, den 04.11.2010, 16:15 -0500 schrieb Turner, Robbin J:
> You need to either add that to catalina.sh or create a setenv.sh in the 
> CATALINA_HOME/bin directory.  Then you can restart tomcat.
>
> So, setenv.sh would contain the following:
>
>export JAVA_HOME="/path/to/jre"
>export JAVA_OPTS="="$JAVA_OPTS -Dsolr.solr.home=/path/to/my/schema.xml"
>
> If you were setting the export in your own environment and then issuing the 
> restart, tomcat was not picking up your local environment because it's 
> running as root.  You don't want to change root's environment.
>
> You could also, create a context.xml in you 
> CATALINA_HOME/conf/CATALINA/localhost.  You should be able to find those 
> instruction on/through the Solr FAQ.
>
> Hope this helps.
> 
> From: Bernhard Reiter [ock...@raz.or.at]
> Sent: Thursday, November 04, 2010 4:49 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Testing/packaging question
>
> Hi,
>
> I'm now trying to
>
> export JAVA_OPTS="$JAVA_OPTS -Dsolr.solr.home=/path/to/my/schema.xml"
>
> and restarting tomcat (v6 package from ubuntu maverick) via
>
> sudo /etc/init.d/tomcat6 restart
>
> but solr still doesn't seem to find that schema.xml, as it complains
> about unknown fields when running the tests that require that schema.xml
>
> Can someone please tell me what I'm doing wrong -- and what I should be
> doing?
>
> TIA again,
> Bernhard
>
> Am Montag, den 01.11.2010, 19:01 +0100 schrieb Bernhard Reiter:
> > Hi,
> >
> > I'm pretty much of a Solr newbie currently packaging solrpy for Debian;
> > see
> > http://svn.debian.org/viewsvn/python-modules/packages/python-solrpy/trunk/
> >
> > In order to run solrpy's supplied tests at build time, I'd need Solr to
> > know about the schema.xml that comes with the tests.
> > Can anyone tell me how do that properly? I'd basically need Solr to
> > temporarily recognize that schema.xml without permanently installing it
> > -- is there any way to do this, eg via environment variables?
> >
> > TIA
> > Bernhard Reiter

Re: querying multiple fields as one

2010-11-04 Thread Tommaso Teofili
Hi Erick

2010/11/4 Erick Erickson 

> Ken's suggestion to look at dismax is a good one, but I have
> a question
> q=type:electronics cat:electronics
>
> should do what you want assuming your default operator
> is OR.


correct


>  Is it failing? Or is the real question how you can
> do this automatically?
>

No failing, just looking for how to do such "expansion" of fields
automatically (with fields in OR but that's not an issue I think)


>
> I'd expect the ranking to be a bit different, but I'm guessing
> that's not a big issue
>

right, no problem if the scoring isn't exactly the same.
Thanks,
Tommaso



>
> Best
> Erick
>
> On Thu, Nov 4, 2010 at 8:21 AM, Tommaso Teofili
> wrote:
>
> > Hi all,
> > having two fields named 'type' and 'cat' with identical type and options,
> > but different values recorded, would it be possible to query them as they
> > were one field?
> > For instance
> >  q=type:electronics cat:electronics
> > should return same results as
> >  q=common:electronics
> > I know I could make it defining a third field 'common' with copyFields
> from
> > 'type' and 'cat' to 'common' but this wouldn't be feasible if you've
> > already
> > lots of documents in your index and don't want to reindex everything,
> isn't
> > it?
> > Any suggestions?
> > Thanks in advance,
> > Tommaso
> >
>


RE: Testing/packaging question

2010-11-04 Thread Bernhard Reiter
Thanks for your instructions. Unfortunately, I need to do all that as
part of my package's (python-solrpy) build procedure, so I can't change
any global configuration, such as in the catalina subdirectories.

I've already sensed that restarting tomcat is also just too
system-invasive and would include changing its (system-wide)
configuration. 

Are there any other ways to use solr for running the tests from
http://pypi.python.org/packages/source/s/solrpy/solrpy-0.9.3.tar.gz
without having to change any system configuration? Maybe via a user
Tomcat instance such as provided by the tomcat6-user debian package?

Thanks for your help!
Bernhard

Am Donnerstag, den 04.11.2010, 16:15 -0500 schrieb Turner, Robbin J:
> You need to either add that to catalina.sh or create a setenv.sh in the 
> CATALINA_HOME/bin directory.  Then you can restart tomcat.  
> 
> So, setenv.sh would contain the following:
> 
>export JAVA_HOME="/path/to/jre"
>export JAVA_OPTS="="$JAVA_OPTS -Dsolr.solr.home=/path/to/my/schema.xml"
> 
> If you were setting the export in your own environment and then issuing the 
> restart, tomcat was not picking up your local environment because it's 
> running as root.  You don't want to change root's environment.
> 
> You could also, create a context.xml in you 
> CATALINA_HOME/conf/CATALINA/localhost.  You should be able to find those 
> instruction on/through the Solr FAQ.
> 
> Hope this helps. 
> 
> From: Bernhard Reiter [ock...@raz.or.at]
> Sent: Thursday, November 04, 2010 4:49 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Testing/packaging question
> 
> Hi,
> 
> I'm now trying to
> 
> export JAVA_OPTS="$JAVA_OPTS -Dsolr.solr.home=/path/to/my/schema.xml"
> 
> and restarting tomcat (v6 package from ubuntu maverick) via
> 
> sudo /etc/init.d/tomcat6 restart
> 
> but solr still doesn't seem to find that schema.xml, as it complains
> about unknown fields when running the tests that require that schema.xml
> 
> Can someone please tell me what I'm doing wrong -- and what I should be
> doing?
> 
> TIA again,
> Bernhard
> 
> Am Montag, den 01.11.2010, 19:01 +0100 schrieb Bernhard Reiter:
> > Hi,
> >
> > I'm pretty much of a Solr newbie currently packaging solrpy for Debian;
> > see
> > http://svn.debian.org/viewsvn/python-modules/packages/python-solrpy/trunk/
> >
> > In order to run solrpy's supplied tests at build time, I'd need Solr to
> > know about the schema.xml that comes with the tests.
> > Can anyone tell me how do that properly? I'd basically need Solr to
> > temporarily recognize that schema.xml without permanently installing it
> > -- is there any way to do this, eg via environment variables?
> >
> > TIA
> > Bernhard Reiter




Re: Problem escaping question marks

2010-11-04 Thread Robert Muir
On Thu, Nov 4, 2010 at 4:58 PM, Stephen Powis  wrote:
> What is the likelihood of this being included in the next release/bug fix
> version of Solr?

In this case, not likely. It will have to wait for Solr 4.0

> Are there docs available online with basic information
> about rolling our own build of Solr that includes this patch?

you can checkout trunk with 'svn checkout
http://svn.apache.org/repos/asf/lucene/dev/trunk' and apply the patch
with 'patch -p0 < foo.patch'


RE: Testing/packaging question

2010-11-04 Thread Bernhard Reiter
The thing is, I only have a schema.xml -- no data, no lib directories.

See the tests subdirectory in the solrpy package:
http://pypi.python.org/packages/source/s/solrpy/solrpy-0.9.3.tar.gz

Bernhard

Am Donnerstag, den 04.11.2010, 15:59 -0500 schrieb Olson, Ron:
> I believe it should point to the directory above, where conf and lib are 
> located (though I have a multi-core setup).
> 
> Mine is set to:
> 
> /usr/local/jboss-5.1.0.GA/server/solr/solr_data/
> 
> And in solr_data the solr.xml defines the two cores, but in each core 
> directory, is a conf, data, and lib directory, which contains the schema.xml.
> 
> 
> 
> -Original Message-
> From: Bernhard Reiter [mailto:ock...@raz.or.at]
> Sent: Thursday, November 04, 2010 3:49 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Testing/packaging question
> 
> Hi,
> 
> I'm now trying to
> 
> export JAVA_OPTS="$JAVA_OPTS -Dsolr.solr.home=/path/to/my/schema.xml"
> 
> and restarting tomcat (v6 package from ubuntu maverick) via
> 
> sudo /etc/init.d/tomcat6 restart
> 
> but solr still doesn't seem to find that schema.xml, as it complains
> about unknown fields when running the tests that require that schema.xml
> 
> Can someone please tell me what I'm doing wrong -- and what I should be
> doing?
> 
> TIA again,
> Bernhard
> 
> Am Montag, den 01.11.2010, 19:01 +0100 schrieb Bernhard Reiter:
> > Hi,
> >
> > I'm pretty much of a Solr newbie currently packaging solrpy for Debian;
> > see
> > http://svn.debian.org/viewsvn/python-modules/packages/python-solrpy/trunk/
> >
> > In order to run solrpy's supplied tests at build time, I'd need Solr to
> > know about the schema.xml that comes with the tests.
> > Can anyone tell me how do that properly? I'd basically need Solr to
> > temporarily recognize that schema.xml without permanently installing it
> > -- is there any way to do this, eg via environment variables?
> >
> > TIA
> > Bernhard Reiter
> 
> 
> 
> 
> DISCLAIMER: This electronic message, including any attachments, files or 
> documents, is intended only for the addressee and may contain CONFIDENTIAL, 
> PROPRIETARY or LEGALLY PRIVILEGED information.  If you are not the intended 
> recipient, you are hereby notified that any use, disclosure, copying or 
> distribution of this message or any of the information included in or with it 
> is  unauthorized and strictly prohibited.  If you have received this message 
> in error, please notify the sender immediately by reply e-mail and 
> permanently delete and destroy this message and its attachments, along with 
> any copies thereof. This message does not create any contractual obligation 
> on behalf of the sender or Law Bulletin Publishing Company.
> Thank you.




Re: Optimize Index

2010-11-04 Thread Peter Karich

 what you can try maxSegments=2 or more as a 'partial' optimize:

"If the index is so large that optimizes are taking longer than desired 
or using more disk space during optimization than you can spare, 
consider adding the maxSegments parameter to the optimize command. In 
the XML message, this would be an attribute; the URL form and SolrJ have 
the corresponding option too. By default this parameter is 1 since an 
optimize results in a single Lucene "segment". By setting it larger than 
1 but less than the mergeFactor, you permit partial optimization to no 
more than this many segments. Of course the index won't be fully 
optimized and therefore searches will be slower. "


from http://wiki.apache.org/solr/PacktBook2009 (I only found that link 
there must be sth. on the real wiki for the maxSegments parameter ...)



Hello.

My Index have ~30 Million documents and a optimize=true is very heavy. it
takes long time ...

how can i start an optimize by using DIH, but NOT after an delta- or
full-import ?

i set my index to compound-index.

thx



--
http://jetwick.com twitter search prototype



Re: Using setStart in solrj

2010-11-04 Thread Peter Karich

 Hi Ron,


 how do I know what the starting row


Always 0.


 especially if the original SolrQuery object has them all


thats the point. solr will normally cache it for you. This is your friend:
40


just try it first with http to get an impression what start is good for:
it just sets the starting doc for the current query.
E.g. you have a very complicated query ala
select?q=xy¶m1=...¶m2=...¶mN=...&rows=20&start=0

the next *page* would be
select?q=xy¶m1=...¶m2=...¶mN=...&rows=20&start=20

(newStart=oldStart+rows)

(To get the next page you'll need to keep the params either in the 
session or 'encoded' within the url.)


Just try and ask if you need more info :-)

Regards,
Peter.


Hi all-

First, thanks to all the folks to have helped me so far getting the hang of 
Solr; I promise to give back when I think my contributions will be useful :)

I am at the point where I'm trying to return results back from a search in a war file, using Java 
with solrj. On the result page of the website I'd want to limit the actual results to probably 
around 20 or so, with the usual "next/prev page" paradigm. The issue I've been wrestling 
with is keeping the SolrQuery object around so that I don't need to transmit the entire thing back 
to the client, especially if they search for something like "truck", which could return a 
lot of results.

I was thinking that one solution would be to do a "query.setRows(20);" for the query, 
then return the results back with some sort of an identifier so that on subsequent queries, I could 
also include "query.setStart(someCounter + 1);" to get the next set of 20. In theory, 
that would work at the cost of having to re-execute the query.

I've been looking for information about setStart() and haven't found much more than 
Javadoc that says "sets the starting row for the result set". My question is, 
how do I know what the starting row is? Maybe, based on the search parameters, it will 
always return the results in an implicit order in which case is it just like executing a 
fixed query in a database and then grabbing the next 20 rows from the result set? Because 
the user would be pressing the prev/next buttons, even though the query is being 
re-executed, the parameters would not be changing.

That's the theory, anyway. It seems excessive to keep executing the same query 
over and over again just because the user wants to see the next set of results, 
especially if the original SolrQuery object has them all, but maybe that's just 
what needs to be done, given the stateless nature of the web.

Any info on this method/strategy would be most appreciated.

Thanks,

Ron

DISCLAIMER: This electronic message, including any attachments, files or 
documents, is intended only for the addressee and may contain CONFIDENTIAL, 
PROPRIETARY or LEGALLY PRIVILEGED information.  If you are not the intended 
recipient, you are hereby notified that any use, disclosure, copying or 
distribution of this message or any of the information included in or with it 
is  unauthorized and strictly prohibited.  If you have received this message in 
error, please notify the sender immediately by reply e-mail and permanently 
delete and destroy this message and its attachments, along with any copies 
thereof. This message does not create any contractual obligation on behalf of 
the sender or Law Bulletin Publishing Company.
Thank you.




--
http://jetwick.com twitter search prototype



RE: Testing/packaging question

2010-11-04 Thread Turner, Robbin J
You need to either add that to catalina.sh or create a setenv.sh in the 
CATALINA_HOME/bin directory.  Then you can restart tomcat.  

So, setenv.sh would contain the following:

   export JAVA_HOME="/path/to/jre"
   export JAVA_OPTS="="$JAVA_OPTS -Dsolr.solr.home=/path/to/my/schema.xml"

If you were setting the export in your own environment and then issuing the 
restart, tomcat was not picking up your local environment because it's running 
as root.  You don't want to change root's environment.

You could also, create a context.xml in you 
CATALINA_HOME/conf/CATALINA/localhost.  You should be able to find those 
instruction on/through the Solr FAQ.

Hope this helps. 

From: Bernhard Reiter [ock...@raz.or.at]
Sent: Thursday, November 04, 2010 4:49 PM
To: solr-user@lucene.apache.org
Subject: Re: Testing/packaging question

Hi,

I'm now trying to

export JAVA_OPTS="$JAVA_OPTS -Dsolr.solr.home=/path/to/my/schema.xml"

and restarting tomcat (v6 package from ubuntu maverick) via

sudo /etc/init.d/tomcat6 restart

but solr still doesn't seem to find that schema.xml, as it complains
about unknown fields when running the tests that require that schema.xml

Can someone please tell me what I'm doing wrong -- and what I should be
doing?

TIA again,
Bernhard

Am Montag, den 01.11.2010, 19:01 +0100 schrieb Bernhard Reiter:
> Hi,
>
> I'm pretty much of a Solr newbie currently packaging solrpy for Debian;
> see
> http://svn.debian.org/viewsvn/python-modules/packages/python-solrpy/trunk/
>
> In order to run solrpy's supplied tests at build time, I'd need Solr to
> know about the schema.xml that comes with the tests.
> Can anyone tell me how do that properly? I'd basically need Solr to
> temporarily recognize that schema.xml without permanently installing it
> -- is there any way to do this, eg via environment variables?
>
> TIA
> Bernhard Reiter

Re: Problem escaping question marks

2010-11-04 Thread Jonathan Rochkind
Wildcard queries, especially a wildcard query with a wildcard both 
_before_ and _after_, are going to be fairly slow for Solr to process, 
anyhow. (In fact, for some reason I thought wildcards weren't even 
supported both before and after, just one or the other).


Still, it's a bug in lucene, it ought not to do that, true.

But there may be a better design to handle your actual use case with 
much better performance anyhow. Based around doing something at indexing 
time to tokenize in a different field on individual letters (if perhaps 
you frequently want to search on arbitrary individual characters), or to 
simply index a "1" or "0" in a field depending on whether it includes a 
question mark if you specifically want to search all the time on 
question marks and don't care about other letters. Or some kind of more 
complex ngram'ing, if you want to be able to search on all sorts of 
sub-strings, efficiently. The trade-off will be disk space for 
performance... but if you start to have a lot of records, that 
wildcard-on-both-sides thing will have unacceptable performance, I predict.


Jonathan

Stephen Powis wrote:

Looking at the JIRA issue, looks like there's been a new patch related to
this.  This is good news!  We've re-written a portion of our web app to use
Solr instead of mysql.  This part of our app allows clients to construct
rules to match data within their account, and automatically apply actions to
those matched data points.  So far our testing and then rollout has been
smooth, until we encountered the above rule/query.  I guess I assumed since
these metacharacters were escaped that they would be parsed correctly under
any type of query.

What is the likelihood of this being included in the next release/bug fix
version of Solr?  Are there docs available online with basic information
about rolling our own build of Solr that includes this patch?

I appreciate your help!
Thanks!
Stephen


On Thu, Nov 4, 2010 at 9:26 AM, Robert Muir  wrote:

  

On Thu, Nov 4, 2010 at 1:44 AM, Stephen Powis 
wrote:


I want to return any first name with a Question Mark in it
Query: first_name: *\?*

  

There is no way to escape the metacharacters * or ? for a wildcard
query (regardless of queryparser, even if you write your own).
See https://issues.apache.org/jira/browse/LUCENE-588

Its something we could fix, but in all honesty it seems one reason it
isn't fixed is because the bug is so old, yet there hasn't really been
any indication of demand for such a thing...




  


Re: Deletes writing bytes len 0, corrupting the index

2010-11-04 Thread Jason Rutherglen
I'm still seeing this error after downloading the latest 2.9 branch
version, compiling, copying to Solr 1.4 and deploying.  Basically as
mentioned, the .del files are of zero length... Hmm...

On Wed, Oct 13, 2010 at 1:33 PM, Jason Rutherglen
 wrote:
> Thanks Robert, that Jira issue aptly describes what I'm seeing, I think.
>
> On Wed, Oct 13, 2010 at 10:22 AM, Robert Muir  wrote:
>> if you are going to fill up your disk space all the time with solr
>> 1.4.1, I suggest replacing the lucene jars with lucene jars from
>> 2.9-branch 
>> (http://svn.apache.org/repos/asf/lucene/java/branches/lucene_2_9/).
>>
>> then you get the fix for https://issues.apache.org/jira/browse/LUCENE-2593 
>> too.
>>
>> On Wed, Oct 13, 2010 at 11:37 AM, Jason Rutherglen
>>  wrote:
>>> We have unit tests for running out of disk space?  However we have
>>> Tomcat logs that fill up quickly and starve Solr 1.4.1 of space.  The
>>> main segments are probably not corrupted, however routinely now, there
>>> are deletes files of length 0.
>>>
>>> 0 2010-10-12 18:35 _cc_8.del
>>>
>>> Which is fundamental index corruption, though less extreme.  Are we
>>> testing for this?
>>>
>>
>


RE: Testing/packaging question

2010-11-04 Thread Olson, Ron
I believe it should point to the directory above, where conf and lib are 
located (though I have a multi-core setup).

Mine is set to:

/usr/local/jboss-5.1.0.GA/server/solr/solr_data/

And in solr_data the solr.xml defines the two cores, but in each core 
directory, is a conf, data, and lib directory, which contains the schema.xml.



-Original Message-
From: Bernhard Reiter [mailto:ock...@raz.or.at]
Sent: Thursday, November 04, 2010 3:49 PM
To: solr-user@lucene.apache.org
Subject: Re: Testing/packaging question

Hi,

I'm now trying to

export JAVA_OPTS="$JAVA_OPTS -Dsolr.solr.home=/path/to/my/schema.xml"

and restarting tomcat (v6 package from ubuntu maverick) via

sudo /etc/init.d/tomcat6 restart

but solr still doesn't seem to find that schema.xml, as it complains
about unknown fields when running the tests that require that schema.xml

Can someone please tell me what I'm doing wrong -- and what I should be
doing?

TIA again,
Bernhard

Am Montag, den 01.11.2010, 19:01 +0100 schrieb Bernhard Reiter:
> Hi,
>
> I'm pretty much of a Solr newbie currently packaging solrpy for Debian;
> see
> http://svn.debian.org/viewsvn/python-modules/packages/python-solrpy/trunk/
>
> In order to run solrpy's supplied tests at build time, I'd need Solr to
> know about the schema.xml that comes with the tests.
> Can anyone tell me how do that properly? I'd basically need Solr to
> temporarily recognize that schema.xml without permanently installing it
> -- is there any way to do this, eg via environment variables?
>
> TIA
> Bernhard Reiter




DISCLAIMER: This electronic message, including any attachments, files or 
documents, is intended only for the addressee and may contain CONFIDENTIAL, 
PROPRIETARY or LEGALLY PRIVILEGED information.  If you are not the intended 
recipient, you are hereby notified that any use, disclosure, copying or 
distribution of this message or any of the information included in or with it 
is  unauthorized and strictly prohibited.  If you have received this message in 
error, please notify the sender immediately by reply e-mail and permanently 
delete and destroy this message and its attachments, along with any copies 
thereof. This message does not create any contractual obligation on behalf of 
the sender or Law Bulletin Publishing Company.
Thank you.


Re: Problem escaping question marks

2010-11-04 Thread Stephen Powis
Looking at the JIRA issue, looks like there's been a new patch related to
this.  This is good news!  We've re-written a portion of our web app to use
Solr instead of mysql.  This part of our app allows clients to construct
rules to match data within their account, and automatically apply actions to
those matched data points.  So far our testing and then rollout has been
smooth, until we encountered the above rule/query.  I guess I assumed since
these metacharacters were escaped that they would be parsed correctly under
any type of query.

What is the likelihood of this being included in the next release/bug fix
version of Solr?  Are there docs available online with basic information
about rolling our own build of Solr that includes this patch?

I appreciate your help!
Thanks!
Stephen


On Thu, Nov 4, 2010 at 9:26 AM, Robert Muir  wrote:

> On Thu, Nov 4, 2010 at 1:44 AM, Stephen Powis 
> wrote:
> > I want to return any first name with a Question Mark in it
> > Query: first_name: *\?*
> >
>
> There is no way to escape the metacharacters * or ? for a wildcard
> query (regardless of queryparser, even if you write your own).
> See https://issues.apache.org/jira/browse/LUCENE-588
>
> Its something we could fix, but in all honesty it seems one reason it
> isn't fixed is because the bug is so old, yet there hasn't really been
> any indication of demand for such a thing...
>


Re: Testing/packaging question

2010-11-04 Thread Bernhard Reiter
Hi, 

I'm now trying to 

export JAVA_OPTS="$JAVA_OPTS -Dsolr.solr.home=/path/to/my/schema.xml"

and restarting tomcat (v6 package from ubuntu maverick) via

sudo /etc/init.d/tomcat6 restart

but solr still doesn't seem to find that schema.xml, as it complains
about unknown fields when running the tests that require that schema.xml

Can someone please tell me what I'm doing wrong -- and what I should be
doing?

TIA again,
Bernhard

Am Montag, den 01.11.2010, 19:01 +0100 schrieb Bernhard Reiter:
> Hi, 
> 
> I'm pretty much of a Solr newbie currently packaging solrpy for Debian;
> see
> http://svn.debian.org/viewsvn/python-modules/packages/python-solrpy/trunk/
> 
> In order to run solrpy's supplied tests at build time, I'd need Solr to
> know about the schema.xml that comes with the tests.
> Can anyone tell me how do that properly? I'd basically need Solr to
> temporarily recognize that schema.xml without permanently installing it
> -- is there any way to do this, eg via environment variables?
> 
> TIA
> Bernhard Reiter




Re: Does DataImportHandler support Digest authentication

2010-11-04 Thread jayant

I mean to say RESTful Apis.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Does-DataImportHandler-support-Digest-authentication-tp1844497p1844501.html
Sent from the Solr - User mailing list archive at Nabble.com.


Does DataImportHandler support Digest authentication

2010-11-04 Thread jayant

I need to connect to a RETS api through a http url. But the REST service uses
digest authentication. Can I use DataImportHandler to pass the credentials
for digest authentication?
Thanks
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Does-DataImportHandler-support-Digest-authentication-tp1844497p1844497.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Optimize Index

2010-11-04 Thread Markus Jelsma
Huh? That's something new for me. Optmize removed documents that have been 
flagged for deletion. For relevancy it's important those are removed because 
document frequencies are not updated for deletes.

Did i miss something?

> For what it's worth, the Solr class instructor at the Lucene Revolution
> conference recommended *against* optimizing, and instead suggested to just
> let the merge factor do it's job.
> 
> On Thu, Nov 4, 2010 at 2:55 PM, Shawn Heisey  wrote:
> > On 11/4/2010 7:22 AM, stockiii wrote:
> >> how can i start an optimize by using DIH, but NOT after an delta- or
> >> full-import ?
> > 
> > I'm not aware of a way to do this with DIH, though there might be
> > something I'm not aware of.  You can do it with an HTTP POST.  Here's
> > how to do it with curl:
> > 
> > /usr/bin/curl "http://HOST:PORT/solr/CORE/update"; \
> > -H "Content-Type: text/xml" \
> > --data-binary ''
> > 
> > Shawn


Re: Optimize Index

2010-11-04 Thread Rich Cariens
For what it's worth, the Solr class instructor at the Lucene Revolution
conference recommended *against* optimizing, and instead suggested to just
let the merge factor do it's job.

On Thu, Nov 4, 2010 at 2:55 PM, Shawn Heisey  wrote:

> On 11/4/2010 7:22 AM, stockiii wrote:
>
>> how can i start an optimize by using DIH, but NOT after an delta- or
>> full-import ?
>>
>
> I'm not aware of a way to do this with DIH, though there might be something
> I'm not aware of.  You can do it with an HTTP POST.  Here's how to do it
> with curl:
>
> /usr/bin/curl "http://HOST:PORT/solr/CORE/update"; \
> -H "Content-Type: text/xml" \
> --data-binary ''
>
> Shawn
>
>


Using setStart in solrj

2010-11-04 Thread Olson, Ron
Hi all-

First, thanks to all the folks to have helped me so far getting the hang of 
Solr; I promise to give back when I think my contributions will be useful :)

I am at the point where I'm trying to return results back from a search in a 
war file, using Java with solrj. On the result page of the website I'd want to 
limit the actual results to probably around 20 or so, with the usual "next/prev 
page" paradigm. The issue I've been wrestling with is keeping the SolrQuery 
object around so that I don't need to transmit the entire thing back to the 
client, especially if they search for something like "truck", which could 
return a lot of results.

I was thinking that one solution would be to do a "query.setRows(20);" for the 
query, then return the results back with some sort of an identifier so that on 
subsequent queries, I could also include "query.setStart(someCounter + 1);" to 
get the next set of 20. In theory, that would work at the cost of having to 
re-execute the query.

I've been looking for information about setStart() and haven't found much more 
than Javadoc that says "sets the starting row for the result set". My question 
is, how do I know what the starting row is? Maybe, based on the search 
parameters, it will always return the results in an implicit order in which 
case is it just like executing a fixed query in a database and then grabbing 
the next 20 rows from the result set? Because the user would be pressing the 
prev/next buttons, even though the query is being re-executed, the parameters 
would not be changing.

That's the theory, anyway. It seems excessive to keep executing the same query 
over and over again just because the user wants to see the next set of results, 
especially if the original SolrQuery object has them all, but maybe that's just 
what needs to be done, given the stateless nature of the web.

Any info on this method/strategy would be most appreciated.

Thanks,

Ron

DISCLAIMER: This electronic message, including any attachments, files or 
documents, is intended only for the addressee and may contain CONFIDENTIAL, 
PROPRIETARY or LEGALLY PRIVILEGED information.  If you are not the intended 
recipient, you are hereby notified that any use, disclosure, copying or 
distribution of this message or any of the information included in or with it 
is  unauthorized and strictly prohibited.  If you have received this message in 
error, please notify the sender immediately by reply e-mail and permanently 
delete and destroy this message and its attachments, along with any copies 
thereof. This message does not create any contractual obligation on behalf of 
the sender or Law Bulletin Publishing Company.
Thank you.


Re: Optimize Index

2010-11-04 Thread Shawn Heisey

On 11/4/2010 7:22 AM, stockiii wrote:

how can i start an optimize by using DIH, but NOT after an delta- or
full-import ?


I'm not aware of a way to do this with DIH, though there might be 
something I'm not aware of.  You can do it with an HTTP POST.  Here's 
how to do it with curl:


/usr/bin/curl "http://HOST:PORT/solr/CORE/update"; \
-H "Content-Type: text/xml" \
--data-binary ''

Shawn



Re: Updating Solr index - DIH delta vs. task queues

2010-11-04 Thread Ezequiel Calderara
I'm in the same scenario, so this answer would be helpful too..
I'm adding...

3) Web Service - Request a webservice for all the new data that has been
updated (can this be done?
On Thu, Nov 4, 2010 at 2:38 PM, Andy  wrote:

> Hi,
> I have data stored in a database that is being updated constantly. I need
> to find a way to update Solr index as data in the database is being updated.
> There seems to be 2 main schools of thoughts on this:
> 1) DIH delta - query the database for all records that have a timestamp
> later than the last_index_time. Import those records for indexing to Solr
> 2) Task queue - every time a record is updated in the database, throw a
> task to a queue to index that record to Solr
> Just want to know what are the pros and cons of each approach and what is
> your experience. For someone starting new, what'd be your recommendation?
> ThanksAndy
>
>
>




-- 
__
Ezequiel.

Http://www.ironicnet.com


Updating Solr index - DIH delta vs. task queues

2010-11-04 Thread Andy
Hi,
I have data stored in a database that is being updated constantly. I need to 
find a way to update Solr index as data in the database is being updated.
There seems to be 2 main schools of thoughts on this:
1) DIH delta - query the database for all records that have a timestamp later 
than the last_index_time. Import those records for indexing to Solr
2) Task queue - every time a record is updated in the database, throw a task to 
a queue to index that record to Solr
Just want to know what are the pros and cons of each approach and what is your 
experience. For someone starting new, what'd be your recommendation?
ThanksAndy


  

Re: Negative or zero value for fieldNorm

2010-11-04 Thread Markus Jelsma
On Thursday 04 November 2010 15:12:23 Yonik Seeley wrote:
> On Thu, Nov 4, 2010 at 9:51 AM, Markus Jelsma
> 
>  wrote:
> > I've done some testing with the example docs and it behaves similar when
> > there is a zero doc boost. Luke, however, does not show me the
> > index-time boosts.
> 
> Remember that the norm is a product of the length norm and the index
> time boost... it's recorded as a single number in the index.

Yes.

> > Bost document and field boosts are not visible in Luke's output. I've
> > changed doc boost and field boosts for the mp500.xml document but all i
> > ever see returned is boost=1.0. Is this correct?
> 
> Perhaps you still have omitNorms=true for the field you are querying?

The example schema does not have omitNorms=true on the name, cat or features 
field.

> 
> -Yonik
> http://www.lucidimagination.com

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536600 / 06-50258350


Re: Negative or zero value for fieldNorm

2010-11-04 Thread Yonik Seeley
On Thu, Nov 4, 2010 at 9:51 AM, Markus Jelsma
 wrote:
> I've done some testing with the example docs and it behaves similar when there
> is a zero doc boost. Luke, however, does not show me the index-time boosts.

Remember that the norm is a product of the length norm and the index
time boost... it's recorded as a single number in the index.

> Bost document and field boosts are not visible in Luke's output. I've changed
> doc boost and field boosts for the mp500.xml document but all i ever see
> returned is boost=1.0. Is this correct?

Perhaps you still have omitNorms=true for the field you are querying?

-Yonik
http://www.lucidimagination.com


Re: Negative or zero value for fieldNorm

2010-11-04 Thread Markus Jelsma
I've done some testing with the example docs and it behaves similar when there 
is a zero doc boost. Luke, however, does not show me the index-time boosts. 
Bost document and field boosts are not visible in Luke's output. I've changed 
doc boost and field boosts for the mp500.xml document but all i ever see 
returned is boost=1.0. Is this correct?

Anyway, i'm looking at Nutch now for reasons why i sends a zero boost on a 
docuement.

On Thursday 04 November 2010 14:16:22 Yonik Seeley wrote:
> On Thu, Nov 4, 2010 at 8:04 AM, Markus Jelsma
> 
>  wrote:
> > The question remains, why does the title field return a fieldNorm=0 for
> > many queries?
> 
> Because the index-time boost was set to 0 when the doc was indexed.  I
> can't say how that happened... look to your indexing code.
> 
> > And a subquestion, does the luke request handler return boost values
> > for documents? I know i get boost values for fields but i haven't seen
> > boost values for documents.
> 
> The doc boost is just multiplied into each field boost and doesn't
> have a separate representation in the index.
> 
> -Yonik
> http://www.lucidimagination.com

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536600 / 06-50258350


Re: Problem escaping question marks

2010-11-04 Thread Robert Muir
On Thu, Nov 4, 2010 at 1:44 AM, Stephen Powis  wrote:
> I want to return any first name with a Question Mark in it
> Query: first_name: *\?*
>

There is no way to escape the metacharacters * or ? for a wildcard
query (regardless of queryparser, even if you write your own).
See https://issues.apache.org/jira/browse/LUCENE-588

Its something we could fix, but in all honesty it seems one reason it
isn't fixed is because the bug is so old, yet there hasn't really been
any indication of demand for such a thing...


Re: Problem escaping question marks

2010-11-04 Thread Jean-Sebastien Vachon
Have you tried encoding it with %3F?

firstname:*%3F*

On 2010-11-04, at 1:44 AM, Stephen Powis wrote:

> I'm having difficulty properly escaping ? in my search queries.  It seems as
> tho it matches any character.
> 
> Some info, a simplified schema and query to explain the issue I'm having.
> I'm currently running solr1.4.1
> 
> Schema:
> 
> 
>  required="false" />
> 
> I want to return any first name with a Question Mark in it
> Query: first_name: *\?*
> 
> Returns all documents with any character in it.
> 
> Can anyone lend a hand?
> Thanks!
> Stephen



Re: Negative or zero value for fieldNorm

2010-11-04 Thread Yonik Seeley
On Thu, Nov 4, 2010 at 8:04 AM, Markus Jelsma
 wrote:
> The question remains, why does the title field return a fieldNorm=0 for many
> queries?

Because the index-time boost was set to 0 when the doc was indexed.  I
can't say how that happened... look to your indexing code.

> And a subquestion, does the luke request handler return boost values
> for documents? I know i get boost values for fields but i haven't seen boost
> values for documents.

The doc boost is just multiplied into each field boost and doesn't
have a separate representation in the index.

-Yonik
http://www.lucidimagination.com


Re: mergeFactor questions

2010-11-04 Thread Shawn Heisey

On 11/4/2010 3:27 AM, Tommaso Teofili wrote:

- Is mergeFactor a one time configuration setting that is considered only
when creating the index for the first time or can it be adjusted later even
with some docs inside the index? e.g. I have mF to 10 then I realize I want
quicker searches and I set it to 2 so that at the next optimize/commit I
will have no more than 2 segments. My understanding is that one can adjust
mF over time, is it right?


The mergeFactor is applied anytime documents are added to the index, not 
just when it is built for the first time.  You can adjust it later, and 
reload the core or restart Solr.  It will apply to any additional 
indexing from that point forward.


With a mergeFactor of 10, having 21 segments (and more) temporarily on 
the disk at the same time is reasonably possible.  I know this applies 
if you are doing a continuous large insert, not sure if you are doing 
several small inserts separately. These segments are:


* The small segment that is being built right now.
* The previous 10 small segments.
* The merged segment being created from those above.
* The previous 9 merged segments.

If it takes a really long time to merge the last 10 small segments and 
then merge the 10 large segments into an even larger segment, you can 
end up with even more small segments from your continuous insert.  If it 
should take long enough that you actually get 10 more new small 
segments, the large merge will pause while it completes the small 
merge.  I saw this happen recently when I decided to see what happens if 
I built a single shard from our entire database.  It took a really long 
time, partly from that super-merge and the optimize that happened later, 
and took up 85GB of disk space.


I'm not really sure what happens if you have this continue beyond a 
single super-merge like I have mentioned.



- In a replicated environment does it make sense to define different
mergeFactors on master and slave? I'd say no since it influences the number
of segments created, that being a concern of who actually index documents
(the master) not of who receives (segments of) index, but please correct me
if I am wrong.


Because it only applies when indexes are being built, it has no meaning 
on a slave, which as you said, just copies the data from the master.


Shawn



Re: ContentStreamDataSource

2010-11-04 Thread Noble Paul നോബിള്‍ नोब्ळ्
for contentstreamdatasource to work you must post the stream in the request

On Thu, Nov 4, 2010 at 8:13 AM, Theodor Tolstoy
wrote:

> Hi!
> I am trying to get the ContentStreamDataSource to work properly , but there
> are not many examples out there.
>
> What I have done is that  I have made a copy of my HttpDataSource config
> and replaced the 
> If understand everything correctly I should be able to use the same URL
> syntax as with HttpDataSource and supply the XML file as  post data.
>
> I have tried to post data - both as binary, file and string to the URL, but
> nothing happens.
>
>
> This is the log file:
> 2010-nov-04 12:32:17 org.apache.solr.handler.dataimport.DataImporter
> doFullImport
> INFO: Starting Full Import
> 2010-nov-04 12:32:17 org.apache.solr.handler.dataimport.SolrWriter
> readIndexerProperties
> VARNING: Unable to read: datapush.properties
> 2010-nov-04 12:32:17 org.apache.solr.handler.dataimport.DocBuilder execute
> INFO: Time taken = 0:0:0.0
> 2010-nov-04 12:32:17 org.apache.solr.core.SolrCore execute
> INFO: [] webapp=/solr path=/datapush
> params={clean=false&entity=suLIBRIS&command=full-import} status=0 QTime=0
>
>
> What am I doing wrong?
>
> Regards
> Theodor Tolstoy
> Developer Stockholm university library
>
>


-- 
-
Noble Paul | Systems Architect| AOL | http://aol.com


Re: querying multiple fields as one

2010-11-04 Thread Erick Erickson
Ken's suggestion to look at dismax is a good one, but I have
a question
q=type:electronics cat:electronics

should do what you want assuming your default operator
is OR.  Is it failing? Or is the real question how you can
do this automatically?

I'd expect the ranking to be a bit different, but I'm guessing
that's not a big issue

Best
Erick

On Thu, Nov 4, 2010 at 8:21 AM, Tommaso Teofili
wrote:

> Hi all,
> having two fields named 'type' and 'cat' with identical type and options,
> but different values recorded, would it be possible to query them as they
> were one field?
> For instance
>  q=type:electronics cat:electronics
> should return same results as
>  q=common:electronics
> I know I could make it defining a third field 'common' with copyFields from
> 'type' and 'cat' to 'common' but this wouldn't be feasible if you've
> already
> lots of documents in your index and don't want to reindex everything, isn't
> it?
> Any suggestions?
> Thanks in advance,
> Tommaso
>


Re: querying multiple fields as one

2010-11-04 Thread Ken Stanley
On Thu, Nov 4, 2010 at 8:21 AM, Tommaso Teofili
wrote:

> Hi all,
> having two fields named 'type' and 'cat' with identical type and options,
> but different values recorded, would it be possible to query them as they
> were one field?
> For instance
>  q=type:electronics cat:electronics
> should return same results as
>  q=common:electronics
> I know I could make it defining a third field 'common' with copyFields from
> 'type' and 'cat' to 'common' but this wouldn't be feasible if you've
> already
> lots of documents in your index and don't want to reindex everything, isn't
> it?
> Any suggestions?
> Thanks in advance,
> Tommaso
>

Tommaso,

If re-indexing is not feasible/preferred, you might try looking into
creating a dismax handler that should give you what you're looking for in
your query: http://wiki.apache.org/solr/DisMaxQParserPlugin. The same
solrconfig.xml that comes with SOLR has a dismax parser that you can modify
to your needs.

- Ken Stanley


querying multiple fields as one

2010-11-04 Thread Tommaso Teofili
Hi all,
having two fields named 'type' and 'cat' with identical type and options,
but different values recorded, would it be possible to query them as they
were one field?
For instance
 q=type:electronics cat:electronics
should return same results as
 q=common:electronics
I know I could make it defining a third field 'common' with copyFields from
'type' and 'cat' to 'common' but this wouldn't be feasible if you've already
lots of documents in your index and don't want to reindex everything, isn't
it?
Any suggestions?
Thanks in advance,
Tommaso


ContentStreamDataSource

2010-11-04 Thread Theodor Tolstoy
Hi!
I am trying to get the ContentStreamDataSource to work properly , but there are 
not many examples out there.

What I have done is that  I have made a copy of my HttpDataSource config and 
replaced the 

Re: Negative or zero value for fieldNorm

2010-11-04 Thread Markus Jelsma
Hi,

I've worked around the issue by setting omitNorms=true on the title field. Now 
all fieldNorm values are 1.0f and therefore do not mess up my scores anymore. 
This, of course, is hardly a solution even though i currently do not use 
index-time boosts on any field.

The question remains, why does the title field return a fieldNorm=0 for many 
queries? And a subquestion, does the luke request handler return boost values 
for documents? I know i get boost values for fields but i haven't seen boost 
values for documents. 

Cheers,

On Wednesday 03 November 2010 20:44:48 Markus Jelsma wrote:
> > Regarding "Negative or zero value for fieldNorm", I don't see any
> > negative fieldNorms here... just very small positive ones?
> 
> Of course, you're right. The E-# got twisted in my mind and became
> negative. Silly me.
> 
> > Anyway the fieldNorm is the product of the lengthNorm and the
> > index-time boost of the field (which is itself the product of the
> > index time boost on the document and the index time boost of all
> > instances of that field).  Index time boosts default to "1" though, so
> > they have no effect unless something has explicitly set a boost.
> 
> I've just checked docs 7 and 1462 (resp. first and second in debug output
> below) with Luke. The title and content fields have no index time boosts,
> thus defaulting to 1.0f which is fine.
> 
> Then, why does doc 7 have a fieldNorm of 0.0 on title (and so setting a 0.0
> score on the doc in the result set) and does doc 1462 have a very very
> small fieldNorm?
> 
> debugOutput for doc 7:
> 0.0 = fieldNorm(field=title, doc=7)
> 
> Luke on the title field of doc 7.
> 1.0
> 
> Thanks for your reply!
> 
> > -Yonik
> > http://www.lucidimagination.com
> > 
> > 
> > 
> > On Wed, Nov 3, 2010 at 2:30 PM, Markus Jelsma
> > 
> >  wrote:
> > > Hi all,
> > > 
> > > I've got some puzzling issue here. During tests i noticed a document at
> > > the bottom of the results where it should not be. I query using DisMax
> > > on title and content field and have a boost on title using qf. Out of
> > > 30 results, only two documents also have the term in the title.
> > > 
> > > Using debugQuery and fl=*,score i quickly noticed large negative
> > > maxScore of the complete resultset and a portion of the resultset
> > > where scores sum up to zero because of a product with 0 (fieldNorm).
> > > 
> > > See below for debug output for a result with score = 0:
> > > 
> > > 0.0 = (MATCH) sum of:
> > >  0.0 = (MATCH) max of:
> > >0.0 = (MATCH) weight(content:kunstgrasveld in 7), product of:
> > >  0.75658196 = queryWeight(content:kunstgrasveld), product of:
> > >6.6516633 = idf(docFreq=33, maxDocs=9682)
> > >0.113743275 = queryNorm
> > >  
> > >  0.0 = (MATCH) fieldWeight(content:kunstgrasveld in 7), product of:
> > >2.236068 = tf(termFreq(content:kunstgrasveld)=5)
> > >6.6516633 = idf(docFreq=33, maxDocs=9682)
> > >0.0 = fieldNorm(field=content, doc=7)
> > >
> > >0.0 = (MATCH) fieldWeight(title:kunstgrasveld in 7), product of:
> > >  1.0 = tf(termFreq(title:kunstgrasveld)=1)
> > >  8.791729 = idf(docFreq=3, maxDocs=9682)
> > >  0.0 = fieldNorm(field=title, doc=7)
> > > 
> > > And one with a negative score:
> > > 
> > > 3.0716116E-4 = (MATCH) sum of:
> > >  3.0716116E-4 = (MATCH) max of:
> > >3.0716116E-4 = (MATCH) weight(content:kunstgrasveld in 1462),
> > >product
> > > 
> > > of: 0.75658196 = queryWeight(content:kunstgrasveld), product of:
> > > 6.6516633 = idf(docFreq=33, maxDocs=9682)
> > > 
> > >0.113743275 = queryNorm
> > >  
> > >  4.059853E-4 = (MATCH) fieldWeight(content:kunstgrasveld in 1462),
> > > 
> > > product of:
> > >1.0 = tf(termFreq(content:kunstgrasveld)=1)
> > >6.6516633 = idf(docFreq=33, maxDocs=9682)
> > >6.1035156E-5 = fieldNorm(field=content, doc=1462)
> > > 
> > > There are no funky issues with term analysis for the text fieldType, in
> > > fact, the term passes through unchanged. I don't do omitNorms, i store
> > > termVectors etc.
> > > 
> > > Because fieldNorm = fieldBoost / sqrt(numTermsForField) i suspect my
> > > input from Nutch is messed up. A fieldNorm can never be =< 0 for a
> > > normal positive boost and field boosts should not be zero or negative
> > > (correct me if i'm wrong). But, since i can't yet figure out what field
> > > boosts Nutch sends to me i thought i'd drop by on this mailing list
> > > first.
> > > 
> > > There are quite a few query terms that return with zero or negative
> > > scores and many that behave as i expect. I find it also a bit hard to
> > > comprehend why the docs with negative score rank higher in the result
> > > set than documents with zero score. Sorting defaults to score DESC, 
> > > but this is perhaps another issue.
> > > 
> > > Anyway, the test runs on a Solr 1.4.1 instance with Java 6 under the
> > > hood. Help or directions are appreciated =)
> > > 
> > > Cheers,
> > > 
> > > --
>

mergeFactor questions

2010-11-04 Thread Tommaso Teofili
Hi all,
Having read the SolrPerformanceFactors wiki page [1], I'd still need a
couple of clarifications about mergeFactor (I am using version 1.4.1) so if
anyone can help it would be nice.

   - Is mergeFactor a one time configuration setting that is considered only
   when creating the index for the first time or can it be adjusted later even
   with some docs inside the index? e.g. I have mF to 10 then I realize I want
   quicker searches and I set it to 2 so that at the next optimize/commit I
   will have no more than 2 segments. My understanding is that one can adjust
   mF over time, is it right?
   - In a replicated environment does it make sense to define different
   mergeFactors on master and slave? I'd say no since it influences the number
   of segments created, that being a concern of who actually index documents
   (the master) not of who receives (segments of) index, but please correct me
   if I am wrong.

Thanks for your help,
Regards,
Tommaso

[1] : http://wiki.apache.org/solr/SolrPerformanceFactors


RE: Filter by relevance

2010-11-04 Thread Jason Brown
I have a dismax query where I check for values in 3 fields against documents in 
the index - a title, a list of keyword tags and then full-text of the document.

I usually get lots of results and I can see that the first results are OK - 
it's giving precedence to titles and tag matches, as my dismax boosts on title 
and keywords (normal boost and phrase boost).

After say 20/30 good results I start to get matches based upon just the 
full-text, so these are less relevant. 

I am also facet.couting on my keyword tags (and presenting in the results as a 
way of filtering) and as you can imagine the counts are high because of the 
number of overall results. I want to somehow make the facet counts more 
associated with the higher relevancy results.

My options as I see it are - 

1) exclude full-text from the dismax altogether
2) configure the dismax normal boost on full-text to zero, but phrase boost to 
something higher (the aim here is to only really get a hit on the full-text if 
my search term is foound as a phrase in the full-text)
3) limit my results by relevancy or number of results

If I do (3) above will the facet.counts respect the lower number of results - 
this is the overall aim really.

Thank You

Jason.


-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com]
Sent: Wed 03/11/2010 23:15
To: solr-user@lucene.apache.org
Subject: Re: Filter by relevance
 
Be aware, though, that relevance isn't absolute, it's only interesting
#within# a query. And it's
then normed between 0 and 1. So picking "a certain value" is rarely doing
what you think it will.
Limiting to the top N docs is usually more reasonable

But this may be an XY problem. What is it you're trying to accomplish?
Perhaps if you
state the problem, some other suggestions may be in the offing

Best
Erick

On Wed, Nov 3, 2010 at 4:48 PM, Jason Brown  wrote:

> Is it possible to filter my search results by relevance? For example,
> anything below a certain value shouldn't be returned?
>
> I also retrieve facet counts in my search queries, so it would be useful if
> the facet counts also respected the filter on the relevance.
>
> Thank You.
>
> Jason.
>
> If you wish to view the St. James's Place email disclaimer, please use the
> link below
>
> http://www.sjp.co.uk/portal/internet/SJPemaildisclaimer
>


If you wish to view the St. James's Place email disclaimer, please use the link 
below

http://www.sjp.co.uk/portal/internet/SJPemaildisclaimer