date:20110531

Hi all,

I'm using DIH and getting the following error.
My Solr version is Solr3.1.

=
...
Caused by:
com.mysql.jdbc.exceptions.jdbc4.MySQLNonTransientConnectionException: Could
not create connection to database server. Attempted reconnect 3 times.
Giving up.
at sun.reflect.GeneratedConstructorAccessor98.newInstance(Unknown
Source)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at com.mysql.jdbc.Util.handleNewInstance(Util.java:406)
at com.mysql.jdbc.Util.getInstance(Util.java:381)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:985)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:956)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:926)
at
com.mysql.jdbc.ConnectionImpl.createNewIO(ConnectionImpl.java:2364)
at com.mysql.jdbc.ConnectionImpl.init(ConnectionImpl.java:781)
at com.mysql.jdbc.JDBC4Connection.init(JDBC4Connection.java:46)
at sun.reflect.GeneratedConstructorAccessor94.newInstance(Unknown
Source)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at com.mysql.jdbc.Util.handleNewInstance(Util.java:406)
at
com.mysql.jdbc.ConnectionImpl.getInstance(ConnectionImpl.java:352)
at
com.mysql.jdbc.NonRegisteringDriver.connect(NonRegisteringDriver.java:284)
at
org.apache.solr.handler.dataimport.JdbcDataSource$1.call(JdbcDataSource.java:161)
at
org.apache.solr.handler.dataimport.JdbcDataSource$1.call(JdbcDataSource.java:128)
at
org.apache.solr.handler.dataimport.JdbcDataSource.getConnection(JdbcDataSource.java:363)
at
org.apache.solr.handler.dataimport.JdbcDataSource.access$200(JdbcDataSource.java:39)
at
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.init(JdbcDataSource.java:240)
... 11 more
Caused by:
com.mysql.jdbc.exceptions.jdbc4.MySQLNonTransientConnectionException: Data
source rejected establishment of connection,  message from server: Too many
connections
at sun.reflect.GeneratedConstructorAccessor98.newInstance(Unknown
Source)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at com.mysql.jdbc.Util.handleNewInstance(Util.java:406)
at com.mysql.jdbc.Util.getInstance(Util.java:381)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:985)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:956)
at com.mysql.jdbc.MysqlIO.doHandshake(MysqlIO.java:1104)
at
com.mysql.jdbc.ConnectionImpl.createNewIO(ConnectionImpl.java:2292)
... 24 more

=

My dataSource setting is something like this:

dataSource type=JdbcDataSource driver=com.mysql.jdbc.Driver
url=jdbc:mysql://database01/test?autoReconnect=true user=xxx
password=xxx batchSize=-1 /

Any idea to solve this problem?

Thank you!


--
View this message in context: 
http://lucene.472066.n3.nabble.com/DIH-Exception-with-Too-many-connections-tp3005213p3005213.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: DIH: Exception with Too many connections

2011-05-31 Thread Chandan Tamrakar

looks like you are not being able to connect to database , pls see if you
get similar exception when you try to connect from other clients



On Tue, May 31, 2011 at 3:01 PM, tiffany tiffany.c...@future.co.jp wrote:

 Hi all,

 I'm using DIH and getting the following error.
 My Solr version is Solr3.1.

 =
 ...
 Caused by:
 com.mysql.jdbc.exceptions.jdbc4.MySQLNonTransientConnectionException: Could
 not create connection to database server. Attempted reconnect 3 times.
 Giving up.
at sun.reflect.GeneratedConstructorAccessor98.newInstance(Unknown
 Source)
at

 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at com.mysql.jdbc.Util.handleNewInstance(Util.java:406)
at com.mysql.jdbc.Util.getInstance(Util.java:381)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:985)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:956)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:926)
at
 com.mysql.jdbc.ConnectionImpl.createNewIO(ConnectionImpl.java:2364)
at com.mysql.jdbc.ConnectionImpl.init(ConnectionImpl.java:781)
at com.mysql.jdbc.JDBC4Connection.init(JDBC4Connection.java:46)
at sun.reflect.GeneratedConstructorAccessor94.newInstance(Unknown
 Source)
at

 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at com.mysql.jdbc.Util.handleNewInstance(Util.java:406)
at
 com.mysql.jdbc.ConnectionImpl.getInstance(ConnectionImpl.java:352)
at
 com.mysql.jdbc.NonRegisteringDriver.connect(NonRegisteringDriver.java:284)
at

 org.apache.solr.handler.dataimport.JdbcDataSource$1.call(JdbcDataSource.java:161)
at

 org.apache.solr.handler.dataimport.JdbcDataSource$1.call(JdbcDataSource.java:128)
at

 org.apache.solr.handler.dataimport.JdbcDataSource.getConnection(JdbcDataSource.java:363)
at

 org.apache.solr.handler.dataimport.JdbcDataSource.access$200(JdbcDataSource.java:39)
at

 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.init(JdbcDataSource.java:240)
... 11 more
 Caused by:
 com.mysql.jdbc.exceptions.jdbc4.MySQLNonTransientConnectionException: Data
 source rejected establishment of connection,  message from server: Too
 many
 connections
at sun.reflect.GeneratedConstructorAccessor98.newInstance(Unknown
 Source)
at

 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at com.mysql.jdbc.Util.handleNewInstance(Util.java:406)
at com.mysql.jdbc.Util.getInstance(Util.java:381)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:985)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:956)
at com.mysql.jdbc.MysqlIO.doHandshake(MysqlIO.java:1104)
at
 com.mysql.jdbc.ConnectionImpl.createNewIO(ConnectionImpl.java:2292)
... 24 more

 =

 My dataSource setting is something like this:

dataSource type=JdbcDataSource driver=com.mysql.jdbc.Driver
 url=jdbc:mysql://database01/test?autoReconnect=true user=xxx
 password=xxx batchSize=-1 /

 Any idea to solve this problem?

 Thank you!


 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/DIH-Exception-with-Too-many-connections-tp3005213p3005213.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Chandan Tamrakar
*
*

Re: DIH: Exception with Too many connections

2011-05-31 Thread Stefan Matheis

Tiffany,

On Tue, May 31, 2011 at 11:16 AM, tiffany tiffany.c...@future.co.jp wrote:
 Any idea to solve this problem?

in Addition to Chandan: Check your mysql process list and have a look
what is displayed there

Regards
Stefan

Re: DIH: Exception with Too many connections

Thanks for your reply, Chandan.

Here is the additional information.  
I'm also using the multi-core function, and I run the delta-import command
in parallel due to saving the running time.  If I don't run in parallel, it
works fine.  Each core accesses to the same database server but different
schema.

So, I don't know if I should change something in my database server side or
I can adjust something at the Solr side by adding some kind of property.

Tiffany

--
View this message in context: 
http://lucene.472066.n3.nabble.com/DIH-Exception-with-Too-many-connections-tp3005213p3005313.html
Sent from the Solr - User mailing list archive at Nabble.com.

Getting and viewing a heap dump

2011-05-31 Thread Constantijn Visinescu

Hi Bernd,

I'm assuming Linux here, if you're running something else these
instructions might differ slightly.

First get a heap dump with:
jmap -heap:format=b,file=/path/to/generate/heapdumpfile.hprof 1234

with 1234 being the PID (process id) of the JVM

After you get a Heap dump you can analyze it with Eclipse MAT (Memory
Analyzer Tool).

Just a heads up if you're doing this in production: the JVM will
freeze completely while generating the heap dump, which will seem like
a giant stop the world GC with a 10GB heap.

Good luck with finding out what's eating your memory!

Constantijn

P.S.
Sorry about  altering the subject line, but the spam assassin used by
the mailing list was rejecting my post because it had replication in
the subject line. hope it doesn't mess up the thread.

On Tue, May 31, 2011 at 8:43 AM, Bernd Fehling
bernd.fehl...@uni-bielefeld.de wrote:
 Some more info,
 after one week the servers have the following status:

 Master (indexing only)
 + looks good and has heap size of about 6g from 10g OldGen
 + has loaded meanwhile 2 times the index from scratch via DIH
 + has added new documents into existing index via DIH
 + has optimized and replicated
 + no full GC within one week

 Slave A (search only) Online
 - looks bad and has heap size of 9.5g from 10g OldGen
 + was replicated
 - several full GC

 Slave B (search only) Backup
 + looks good has heap size of 4 g from 10g OldGen
 + was replicated
 + no full GC within one week

 Conclusion:
 + DIH, processing, indexing, replication are fine
 - the search is crap and eats up OldGen heap which can't be
  cleaned up by full GC. May be memory leaks or what ever...

 Due to this Solr 3.1 can _NOT_ be recommended as high-availability,
 high-search-load search engine because of unclear heap problems
 caused by the search. The search is out of the box, so no
 self produced programming errors.

 Any tools available for JAVA to analyze this?
 (like valgrind or electric fence for C++)

 Is it possible to analyze a heap dump produced with jvisualvm?
 Which tools?


 Bernd


 Am 30.05.2011 15:51, schrieb Bernd Fehling:

 Dear list,
 after switching from FAST to Solr I get the first _real_ data.
 This includes search times, memory consumption, perfomance of solr,...

 What I recognized so far is that something eats up my OldGen and
 I assume it might be replication.

 Current Data:
 one master - indexing only
 two slaves - search only
 over 28 million docs
 single instance
 single core
 index size 140g
 current heap size 16g

 After startup I have about 4g heap in use and about 3.5g of OldGen.
 After one week and some replications OldGen is filled close to 100
 percent.
 If I start an optimize under this condition I get OOM of heap.
 So my assumption is that something is eating up my heap.

 Any idea how to trace this down?

 May be a memory leak somewhere?

 Best regards
 Bernd

Re: DIH: Exception with Too many connections

Thanks Stefan!

I executed the  SHOW PROCESSLIST; command. (Is it what you mean? I've never
tried it before...)
It seems that when I executed one delta-import command, several threads were
inserted into the table and removed after commit.  Also it looks like the
number of threads are pretty much equal to the number of entity in my
db-data-config.xml.

So, if the number of threads in the process list is larger than
max_connections, I would get the too many connections error.  Am I
thinking the right way?
If it is right, maybe I should think of the commit timing, changing the
number of max_connections, and/or some other ways...
If there are any other idea, please let me know =)

Thanks a lot!


--
View this message in context: 
http://lucene.472066.n3.nabble.com/DIH-Exception-with-Too-many-connections-tp3005213p3005401.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Getting and viewing a heap dump

2011-05-31 Thread Bernd Fehling


Hi Constantijn,

yes I use Linux 64bit and thanks for the help.

Bernd

Am 31.05.2011 12:22, schrieb Constantijn Visinescu:

Hi Bernd,

I'm assuming Linux here, if you're running something else these
instructions might differ slightly.

First get a heap dump with:
jmap -heap:format=b,file=/path/to/generate/heapdumpfile.hprof 1234

with 1234 being the PID (process id) of the JVM

After you get a Heap dump you can analyze it with Eclipse MAT (Memory
Analyzer Tool).

Just a heads up if you're doing this in production: the JVM will
freeze completely while generating the heap dump, which will seem like
a giant stop the world GC with a 10GB heap.

Good luck with finding out what's eating your memory!

Constantijn

P.S.
Sorry about  altering the subject line, but the spam assassin used by
the mailing list was rejecting my post because it had replication in
the subject line. hope it doesn't mess up the thread.

On Tue, May 31, 2011 at 8:43 AM, Bernd Fehling
bernd.fehl...@uni-bielefeld.de  wrote:

Some more info,
after one week the servers have the following status:

Master (indexing only)
+ looks good and has heap size of about 6g from 10g OldGen
+ has loaded meanwhile 2 times the index from scratch via DIH
+ has added new documents into existing index via DIH
+ has optimized and replicated
+ no full GC within one week

Slave A (search only) Online
- looks bad and has heap size of 9.5g from 10g OldGen
+ was replicated
- several full GC

Slave B (search only) Backup
+ looks good has heap size of 4 g from 10g OldGen
+ was replicated
+ no full GC within one week

Conclusion:
+ DIH, processing, indexing, replication are fine
- the search is crap and eats up OldGen heap which can't be
  cleaned up by full GC. May be memory leaks or what ever...

Due to this Solr 3.1 can _NOT_ be recommended as high-availability,
high-search-load search engine because of unclear heap problems
caused by the search. The search is out of the box, so no
self produced programming errors.

Any tools available for JAVA to analyze this?
(like valgrind or electric fence for C++)

Is it possible to analyze a heap dump produced with jvisualvm?
Which tools?


Bernd


Am 30.05.2011 15:51, schrieb Bernd Fehling:


Dear list,
after switching from FAST to Solr I get the first _real_ data.
This includes search times, memory consumption, perfomance of solr,...

What I recognized so far is that something eats up my OldGen and
I assume it might be replication.

Current Data:
one master - indexing only
two slaves - search only
over 28 million docs
single instance
single core
index size 140g
current heap size 16g

After startup I have about 4g heap in use and about 3.5g of OldGen.
After one week and some replications OldGen is filled close to 100
percent.
If I start an optimize under this condition I get OOM of heap.
So my assumption is that something is eating up my heap.

Any idea how to trace this down?

May be a memory leak somewhere?

Best regards
Bernd





--
*
Bernd FehlingUniversitätsbibliothek Bielefeld
Dipl.-Inform. (FH)Universitätsstr. 25
Tel. +49 521 106-4060   Fax. +49 521 106-4052
bernd.fehl...@uni-bielefeld.de33615 Bielefeld

BASE - Bielefeld Academic Search Engine - www.base-search.net
*

RE: newbie question for DataImportHandler

2011-05-31 Thread Kevin Bootz

In the op it's stated that the index was deleted. I'm guessing that means the 
physical files, /data/  
quote
populate the table 
 with another million rows of data.
 I remove the index that solr previously create. I restart solr and go 
 to
the
 data import handler development console and do the full import again.
endquote

Is there a separate cache that could be causing the issue? I'm a newbie as well 
and it seems that if I delete the index there shouldn't be any vestige info 
left anywhere

Thanks

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Sunday, May 29, 2011 9:00 PM
To: solr-user@lucene.apache.org
Subject: Re: newbie question for DataImportHandler

This trips up a lot of folks. Sold just marks docs as deleted, the terms etc 
are left in the index until an optimize is performed, or the segments are 
merged. This latter isn't very predictable, so just do an optimize.

The docs aren't returned as results though.

Best
Erick
On May 24, 2011 10:22 PM, antoniosi antonio...@gmail.com wrote:
 Hi,

 I am new to Solr; apologize in advance if this is a stupid question.

 I have created a simple database, with only 1 table with 3 columns, 
 id, name, and last_update fields.

 I populate the database with 1 million test rows.
 I run solr, go to the data import handler development console and do a
full
 import. I use the Luke tool to look at the content of the lucene index.

 This all works fine so far.

 I remove all the 1 million rows from my table and populate the table 
 with another million rows of data.
 I remove the index that solr previously create. I restart solr and go 
 to
the
 data import handler development console and do the full import again.

 I use the Luke tool to look at the content of the lucene index. 
 However,
I
 am seeing the old data in my new index.

 Doe Solr keeps a cached copy of the index somewhere?

 I hope I have described my problem clearly.

 Thanks in advance.

 --
 View this message in context:
http://lucene.472066.n3.nabble.com/newbie-question-for-DataImportHandler-tp2982277p2982277.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: DIH: Exception with Too many connections

2011-05-31 Thread Stefan Matheis

Tiffany,

On Tue, May 31, 2011 at 12:45 PM, tiffany tiffany.c...@future.co.jp wrote:
 I executed the  SHOW PROCESSLIST; command. (Is it what you mean? I've never
 tried it before...)

Exactly this, yes :)

On Tue, May 31, 2011 at 12:45 PM, tiffany tiffany.c...@future.co.jp wrote:
 So, if the number of threads in the process list is larger than
 max_connections, I would get the too many connections error.  Am I
 thinking the right way?

Yepp, right

On Tue, May 31, 2011 at 12:45 PM, tiffany tiffany.c...@future.co.jp wrote:
 If it is right, maybe I should think of the commit timing, changing the
 number of max_connections, and/or some other ways...

You may lift the allowed Number of Connections for the MySQL-Server?
Or, of course - if possible - tweak your SOLR-Settings, correct

Regards
Stefan

Re: DIH: Exception with Too many connections

2011-05-31 Thread François Schiettecatte

Hi

You might also check the 'max_user_connections' settings too if you have that 
set:

# Maximum number of connections, and per user
max_connections   = 2048
max_user_connections  = 2048

http://dev.mysql.com/doc/refman/5.5/en/server-system-variables.html

Cheers

François


On May 31, 2011, at 7:39 AM, Stefan Matheis wrote:

 Tiffany,
 
 On Tue, May 31, 2011 at 12:45 PM, tiffany tiffany.c...@future.co.jp wrote:
 I executed the  SHOW PROCESSLIST; command. (Is it what you mean? I've never
 tried it before...)
 
 Exactly this, yes :)
 
 On Tue, May 31, 2011 at 12:45 PM, tiffany tiffany.c...@future.co.jp wrote:
 So, if the number of threads in the process list is larger than
 max_connections, I would get the too many connections error.  Am I
 thinking the right way?
 
 Yepp, right
 
 On Tue, May 31, 2011 at 12:45 PM, tiffany tiffany.c...@future.co.jp wrote:
 If it is right, maybe I should think of the commit timing, changing the
 number of max_connections, and/or some other ways...
 
 You may lift the allowed Number of Connections for the MySQL-Server?
 Or, of course - if possible - tweak your SOLR-Settings, correct
 
 Regards
 Stefan

Solr NRT

2011-05-31 Thread Ionut Manta

Hi,

I have the following strange use case:

Index 100 documents and make them immediately available for search. I call
this on the fly indexing. Then the index can be removed. So the size of
the index is not an issue here.
Is this possible with Solr? Anyone tried something similar?

Thank you,
Ionut

RE: Solr NRT

2011-05-31 Thread David Hill


Unless you cross a Solr server commit threshold your client has to post a 
commit/ message for the server content to be available for searching. 
Unfortunatly the Solr tool that is supposed to do this apparently doesn't. I 
asked for community help last week and was surprised to receive no response, I 
thought having to leave a Solr import process in an incomplete state would be 
more of a concern. In any case, our (hopefully temporary) solution was to hack 
the source code for the SimplePostTool demo code to turn it into a CommitTool.  
Once Solr receives the commit/ post you will be able to search for your 
recently added documents.

-Original Message-
From: Ionut Manta [mailto:ionut.ma...@gmail.com]
Sent: Tuesday, May 31, 2011 7:41 AM
To: solr-user@lucene.apache.org
Subject: Solr NRT

Hi,

I have the following strange use case:

Index 100 documents and make them immediately available for search. I call this 
on the fly indexing. Then the index can be removed. So the size of the index 
is not an issue here.
Is this possible with Solr? Anyone tried something similar?

Thank you,
Ionut

This e-mail and any files transmitted with it are confidential and intended 
solely for the use of the individual or entity to whom they are addressed. If 
you have received this e-mail in error please notify the originator of the 
message. This footer also confirms that this e-mail message has been scanned 
for the presence of computer viruses. Any views expressed in this message are 
those of the individual sender, except where the sender specifies and with 
authority, states them to be the views of Iowa Student Loan.

RE: DIH: Exception with Too many connections

Hi,


There is existing bug in DataImportHandler described (and patched) at
https://issues.apache.org/jira/browse/SOLR-2233
It is not used in a thread safe manner, and it is not appropriately closed 
reopened (why?); and new connection is opened unpredictably. It may cause
Too many connections even for huge SQL-side max_connections.

If you are interested, I can continue work on SOLR-2233. CC: dev@lucene (is
anyone working on DIH improvements?)

Thanks,
Fuad Efendi
http://www.tokenizer.ca/


-Original Message-
From: François Schiettecatte [mailto:fschietteca...@gmail.com] 
Sent: May-31-11 7:44 AM
To: solr-user@lucene.apache.org
Subject: Re: DIH: Exception with Too many connections

Hi

You might also check the 'max_user_connections' settings too if you have
that set:

# Maximum number of connections, and per user
max_connections   = 2048
max_user_connections  = 2048

http://dev.mysql.com/doc/refman/5.5/en/server-system-variables.html

Cheers

François



 So, if the number of threads in the process list is larger than 
 max_connections, I would get the too many connections error.  Am I 
 thinking the right way?

Re: Solr NRT

2011-05-31 Thread Ionut Manta

What results did you got with this hack?
How long it takes since you start indexing some documents until you get a
search result?
Did you try NRT?

On Tue, May 31, 2011 at 3:47 PM, David Hill dh...@studentloan.org wrote:


 Unless you cross a Solr server commit threshold your client has to post a
 commit/ message for the server content to be available for searching.
 Unfortunatly the Solr tool that is supposed to do this apparently doesn't. I
 asked for community help last week and was surprised to receive no response,
 I thought having to leave a Solr import process in an incomplete state would
 be more of a concern. In any case, our (hopefully temporary) solution was to
 hack the source code for the SimplePostTool demo code to turn it into a
 CommitTool.  Once Solr receives the commit/ post you will be able to
 search for your recently added documents.

 -Original Message-
 From: Ionut Manta [mailto:ionut.ma...@gmail.com]
 Sent: Tuesday, May 31, 2011 7:41 AM
 To: solr-user@lucene.apache.org
 Subject: Solr NRT

 Hi,

 I have the following strange use case:

 Index 100 documents and make them immediately available for search. I call
 this on the fly indexing. Then the index can be removed. So the size of
 the index is not an issue here.
 Is this possible with Solr? Anyone tried something similar?

 Thank you,
 Ionut

 This e-mail and any files transmitted with it are confidential and intended
 solely for the use of the individual or entity to whom they are addressed.
 If you have received this e-mail in error please notify the originator of
 the message. This footer also confirms that this e-mail message has been
 scanned for the presence of computer viruses. Any views expressed in this
 message are those of the individual sender, except where the sender
 specifies and with authority, states them to be the views of Iowa Student
 Loan.

Re: Bulk indexing, UpdateProcessor overwriteDupes and poor IO performances

2011-05-31 Thread lee carroll

Tanguy

You might have tried this already but can you set overwritedupes to
false and set the signiture key to be the id. That way solr
will manage updates?

from the wiki

http://wiki.apache.org/solr/Deduplication

!-- An example dedup update processor that creates the id field on the fly
   based on the hash code of some other fields.  This example has
overwriteDupes
   set to false since we are using the id field as the
signatureField and Solr
   will maintain uniqueness based on that anyway. --

HTH

Lee


On 30 May 2011 08:32, Tanguy Moal tanguy.m...@gmail.com wrote:

 Hello,

 Sorry for re-posting this but it seems my message got lost in the mailing 
 list's messages stream without hitting anyone's attention... =D

 Shortly, has anyone already experienced dramatic indexing slowdowns during 
 large bulk imports with overwriteDupes turned on and a fairly high duplicates 
 rate (around 4-8x) ?

 It seems to produce a lot of deletions, which in turn appear to make the 
 merging of segments pretty slow, by fairly increasing the number of little 
 reads operations occuring simultaneously with the regular large write 
 operations of the merge. Added to the poor IO performances of a commodity 
 SATA drive, indexing takes ages.

 I temporarily bypassed that limitation by disabling the overwriting of 
 duplicates, but that changes the way I request the index, requiring me to 
 turn on field collapsing at search time.

 Is this a known limitation ?

 Has anyone a few hints on how to optimize the handling of index time 
 deduplication ?

 More details on my setup and the state of my understanding are in my previous 
 message here-after.

 Thank you very much in advance.

 Regards,

 Tanguy

 On 05/25/11 15:35, Tanguy Moal wrote:

 Dear list,

 I'm posting here after some unsuccessful investigations.
 In my setup I push documents to Solr using the StreamingUpdateSolrServer.

 I'm sending a comfortable initial amount of documents (~250M) and wished to 
 perform overwriting of duplicated documents at index time, during the 
 update, taking advantage of the UpdateProcessorChain.

 At the beginning of the indexing stage, everything is quite fast; documents 
 arrive at a rate of about 1000 doc/s.
 The only extra processing during the import is computation of a couple of 
 hashes that are used to identify uniquely documents given their content, 
 using both stock (MD5Signature) and custom (derived from Lookup3Signature) 
 update processors.
 I send a commit command to the server every 500k documents sent.

 During a first period, the server is CPU bound. After a short while (~10 
 minutes), the rate at which documents are received starts to fall 
 dramatically, the server being IO bound.
 I've been firstly thinking of a normal speed decrease during the commit, 
 while my push client is waiting for the flush to occur. That would have been 
 a normal slowdown.

 The thing that retained my attention was the fact that unexpectedly, the 
 server was performing a lot of small reads, way more the number writes, 
 which seem to be larger.
 The combination of the many small reads with the constant amount of bigger 
 writes seem to be creating a lot of IO contention on my commodity SATA 
 drive, and the ETA of my built index started to increase scarily =D

 I then restarted the JVM with JMX enabled so I could start investigating a 
 little bit more. I've the realized that the UpdateHandler was performing 
 many reads while processing the update request.

 Are there any known limitations around the UpdateProcessorChain, when 
 overwriteDupes is set to true ?
 I turned that off, which of course breaks the intent of my built index, but 
 for comparison purposes it's good.

 That did the trick, indexing is fast again, even with the periodic commits.

 I therefor have two questions, an interesting first  one and a boring second 
 one :

 1 / What's the workflow of the UpdateProcessorChain when one or more 
 processors have overwriting of duplicates turned on ? What happens under the 
 hood ?

 I tried to answer that myself looking at DirectUpdateHandler2 and my 
 understanding stopped at the following :
 - The document is added to the lucene IW
 - The duplicates are deleted from the lucene IW
 The dark magic I couldn't understand seems to occur around the idTerm and 
 updateTerm things, in the addDoc method. The deletions seem to be buffered 
 somewhere, I just didn't get it :-)

 I might be wrong since I didn't read the code more than that, but the point 
 might be at how does solr handles deletions, which is something still 
 unclear to me. In anyways, a lot of reads seem to occur for that precise 
 task and it tends to produce a lot of IO, killing indexing performances when 
 overwriteDupes is on. I don't even understand why so many read operations 
 occur at this stage since my process had a comfortable amount of RAM (with 
 Xms=Xmx=8GB), with only 4.5GB are used so far.

 Any help, recommandation or idea is welcome

Re: Spellcheck component not returned with numeric queries

File an issue:
https://issues.apache.org/jira/browse/SOLR-2556

On Monday 30 May 2011 16:07:41 Markus Jelsma wrote:
 Hi,
 
 The spell check component's output is not written when sending queries that
 consist of numbers only. Clients depending on the availability of the
 spellcheck output need to check if the output is actually there.
 
 This is with a very recent Solr 3.x check out. Is this a feature or a bug?
 File an issue?
 
 Cheers,

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Nutch Crawl error

This question would be better asked on the Nutch forum rather than the
Solr forum.

Best
Erick

On Thu, May 26, 2011 at 12:06 PM, Roger Shah rs...@caci.com wrote:
 I ran the command bin/nutch crawl urls -dir crawl -depth 3  crawl.log

 When I viewed crawl.log I found some errors such as:

 Can't retrieve Tika parser for mime-typeapplication/x-shockwave-flash, and 
 some other similar messages for other types such as application/xml, etc.

 Do I need to download Tika for these errors to go away?  Where can I download 
 Tika so that it can work with Nutch?  If there are instructions to install 
 Tika to work with Nutch please send them to me.

 Thanks,
 Roger

Re: Facet Query

I'm guessing you're faceting on an analyzed field. This is usually a bad idea.
What is the use-case you're trying to solve?

Best
Erick

On Fri, May 27, 2011 at 12:51 AM, Jasneet Sabharwal
jasneet.sabhar...@ngicorporation.com wrote:
 Hi

 When I do a facet query on my data, it shows me a list of all the words
 present in my database with their count. Is it possible to not get the
 results of common words like a, an, the, http and so one but only get the
 count of stuff we need like microsoft, ipad, solr, etc.

 --
 Thanx  Regards

 Jasneet Sabharwal

Re: How to disable QueryElevationComponent

Let's back up a bit. Why don't you want a uniqueKey? It's usually a good
idea to have one, especially if you're using DIH.

Best
Erick

On Fri, May 27, 2011 at 2:53 AM, Romi romijain3...@gmail.com wrote:
 i removed
 searchComponent name=elevator
 class=org.apache.solr.handler.component.QueryElevationComponent 

     str name=queryFieldTypestring/str
    str name=config-fileelevate.xml/str
  /searchComponent

 from solrconfig.xml but it is showing the following exception:

 java.lang.NullPointerException
        at
 org.apache.solr.handler.dataimport.DataImporter.identifyPk(DataImporter.java:152)
        at
 org.apache.solr.handler.dataimport.DataImporter.init(DataImporter.java:111)
        at
 org.apache.solr.handler.dataimport.DataImportHandler.inform(DataImportHandler.java:113)
        at
 org.apache.solr.core.SolrResourceLoader.inform(SolrResourceLoader.java:486)
        at org.apache.solr.core.SolrCore.init(SolrCore.java:588)
        at org.apache.solr.core.CoreContainer.create(CoreContainer.java:428)
        at org.apache.solr.core.CoreContainer.load(CoreContainer.java:278)
        at
 org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:117)
        at
 org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:83)
        at org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:99)
        at 
 org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
        at
 org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:594)
        at org.mortbay.jetty.servlet.Context.startContext(Context.java:139)
        at
 org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1218)
        at
 org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:500)
        at 
 org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:448)
        at 
 org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
        at
 org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:147)
        at
 org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextHandlerCollection.java:161)
        at 
 org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
        at
 org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:147)
        at 
 org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
        at
 org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:117)
        at org.mortbay.jetty.Server.doStart(Server.java:210)
        at 
 org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
        at org.mortbay.xml.XmlConfiguration.main(XmlConfiguration.java:929)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
        at java.lang.reflect.Method.invoke(Unknown Source)
        at org.mortbay.start.Main.invokeMain(Main.java:183)
        at org.mortbay.start.Main.start(Main.java:497)
        at org.mortbay.start.Main.main(Main.java:115)




 -
 Thanks  Regards
 Romi
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/How-to-disable-QueryElevationComponent-tp2992195p2992320.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Documents update

And it wouldn't work unless all the data is stored anyway. Currently there's
no way to update a single field in a document, although there's work being
done in that direction (see the column stride JIRA).

What do you want to do with these fields? If it's to influence scoring, you
could look at external fields.

If the flags are a selection criteria, it's...harder. What are the flags
used for? Could you consider essentially storing a map of the
uniqueKey's and flags in a special document and having your app
read that document and merge the results with the output? If this seems
irrelevant, a more complete statement of the use-case would be helpful.

Best
Erick

On Fri, May 27, 2011 at 4:33 AM, Denis Kuzmenok forward...@ukr.net wrote:
 I'm  using  3.1  now.  Indexing  lasts for a few hours, and have big
 plain size. Getting all documents would be rather slow :(


 Not with 1.4, but apparently there is a patch for trunk. Not
 sure if it is in 3.1.

 If you are on 1.4, you could first query Solr to get the data
 for the document to be changed, change the modified values,
 and make a complete XML, including all fields, for post.jar.

 Regards,
 Gora

Re: DIH render html entities

Convert them to what? Individual fields in your docs? Text?

If the former, you might get some joy from the XpathEntityProcessor.
If you want to just strip the markup and index all the content you
might get some joy from the various *html* analyzers listed here:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

Best
Erick

On Fri, May 27, 2011 at 5:19 AM, anass talby anass.ta...@gmail.com wrote:
 Sorry my question was not clear.
 when I get data from database, some field contains some html special chars,
 and what i want to do is just convert them automatically.

 On Fri, May 27, 2011 at 1:00 PM, Gora Mohanty g...@mimirtech.com wrote:

 On Fri, May 27, 2011 at 3:50 PM, anass talby anass.ta...@gmail.com
 wrote:
  Is there any way to render html entities in DIH for a specific field?
 [...]

 This does not make too much sense: What do you mean by
 rendering HTML entities. DIH just indexes, so where would
 it render HTML to, even if it could?

 Please take a look at http://wiki.apache.org/solr/UsingMailingLists

 Regards,
 Gora




 --
       Anass

Re: Documents update

2011-05-31 Thread Denis Kuzmenok

Flags   are   stored  to filter results and it's pretty highloaded, it's
working  fine,  but i can't update index very often just to make flags
up to time =\
Where can i read about using external fields / files?


 And it wouldn't work unless all the data is stored anyway. Currently there's
 no way to update a single field in a document, although there's work being
 done in that direction (see the column stride JIRA).

 What do you want to do with these fields? If it's to influence scoring, you
 could look at external fields.

 If the flags are a selection criteria, it's...harder. What are the flags
 used for? Could you consider essentially storing a map of the
 uniqueKey's and flags in a special document and having your app
 read that document and merge the results with the output? If this seems
 irrelevant, a more complete statement of the use-case would be helpful.

 Best
 Erick

Re: Splitting fields

Hmmm, I wonder if a custom Transformer would help here? It can be inserted into
a chain of transformers in DIH.

Essentially, you subclass Transformer and implement one method (transformRow)
and do anything you want. The input is a map of String, Object that
is a simple
representation of the Solr document. You can add/subtract/whatever you
want to that
map and then just return it.

The map in transformRow has all the changes by any other entries in
the transform
chain at this point, and your changes are passed on to the next
transformer in the chain.

The only restriction I know of is that the document has to conform to
the schema when
all is said and done.

Best
Erick

On Fri, May 27, 2011 at 6:47 AM, Joe Fitzgerald
joe_fitzger...@oxfordcorp.com wrote:
 Hello,



 I am in an odd position.  The application server I use has built-in
 integration with SOLR.  Unfortunately, its native capabilities are
 fairly limited, specifically, it only supports a standard/pre-defined
 set of fields which can be indexed.  As a result, it has left me
 kludging how I work with Solr and doing things like putting what I'd
 like to be multiple, separate fields into a single Solr field.



 As an example, I may put a customer id and name into a single field
 called 'custom1'.  Ideally, I'd like this information to be returned in
 separate fields...and even better would be for them to be indexed as
 separate fields but I can live without the latter.  Currently, I'm
 building out a json representation of this information which makes it
 easy for me to deal with when I extract the results...but it all feels
 wrong.



 I do have complete control over the actual Solr installation (just not
 the indexing call to Solr), so I was hoping there may be a way to
 configure Solr to take my single field and split it up into a different
 field for each key in my json representation.



 I don't see anything native to Solr that would do this for me but there
 are a few features that I thought sounded similar and was hoping to get
 some opinions on how I may be able to move forward with this...



 Poly fields, such as the spatial location, might help?  Can I build my
 own poly-field that would split up the main field into subfields?  Do
 poly-fields let me return the subfields?  I don't quite have my head
 around polyfields yet.



 Another option although I suspect this won't be considered a good
 approach, but what about extending the copyField functionality of
 schema.xml to support my needs?  It would seem not entirely unreasonable
 that copyField would provide a means to extract only a portion of the
 contents of the source field to place in the destination field, no?  I'm
 sure people more familiar with Solr's architecture could explain why
 this isn't really an appropriate thing for Solr to handle (just because
 it could doesn't mean it should)...

 The other - and probably best -- option would be to leverage Solr
 directly, bypassing the native integration of my application server,
 which we've already done for most cases.  I'd love to go this route but
 I'm having a hard time figuring out how to easily accomplish the same
 functionality provided by my app server integration...perhaps someone on
 the list could help me with this path forward?  Here is what I'm trying
 to accomplish:



 I'm indexing documents (text, pdf, html...) but I need to include fields
 in the results of my searches which are only available from a db query.
 I know how to have Solr index results from a db query, but I'm having
 trouble getting it to index the documents that are associated to each
 record of that query (full path/filename is one of the fields of that
 query).



 I started to try to use the dataImport handler to do this, by setting up
 a FileDataSource in addition to my jdbc data source.  I tried to
 leverage the filedatasource to populate a sub-entity based on the db
 field that contains the full path/filename, but I wasn't sure how to
 specify the db field from the root query/entity.  Before I spent too
 much time, I also realized I wasn't sure how to get Solr to deal with
 binary file types this way either which upon further reading seemed like
 I would need to leverage Tika - can that be done within the confines of
 dataimporthandler?



 Any advice is greatly appreciated.  Thanks in advance,



 Joe

Re: Edgengram

That'll work for your case, although be aware that string types aren't
analyzed at all,
so case matters, as do spaces etc.

What is the use-case here? If you explain it a bit there might be
better answers

Best
Erick

On Fri, May 27, 2011 at 9:17 AM, Brian Lamb
brian.l...@journalexperts.com wrote:
 For this, I ended up just changing it to string and using abcdefg* to
 match. That seems to work so far.

 Thanks,

 Brian Lamb

 On Wed, May 25, 2011 at 4:53 PM, Brian Lamb
 brian.l...@journalexperts.comwrote:

 Hi all,

 I'm running into some confusion with the way edgengram works. I have the
 field set up as:

 fieldType name=edgengram class=solr.TextField
 positionIncrementGap=1000
    analyzer
      tokenizer class=solr.LowerCaseTokenizerFactory /
        filter class=solr.EdgeNGramFilterFactory minGramSize=1
 maxGramSize=100 side=front /
    /analyzer
 /fieldType

 I've also set up my own similarity class that returns 1 as the idf score.
 What I've found this does is if I match a string abcdefg against a field
 containing abcdefghijklmnop, then the idf will score that as a 7:

 7.0 = idf(myfield: a=51 ab=23 abc=2 abcd=2 abcde=2 abcdef=2 abcdefg=2)

 I get why that's happening, but is there a way to avoid that? Do I need to
 do a new field type to achieve the desired affect?

 Thanks,

 Brian Lamb

Re: applying FastVectorHighlighter truncation patch to solr 3.1

Did you try to apply the patch in Lucene's contrib?

On Tuesday 17 May 2011 18:55:49 Paul wrote:
 I'm having this issue with solr 3.1:
 
 https://issues.apache.org/jira/browse/LUCENE-1824
 
 It looks like there is a patch offered, but I can't figure out how to apply
 it.
 
 What is the easiest way for me to get this fix? I'm just using the
 example solr with changed conf xml files. Is there a file somewhere I
 can just drop in?

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Pivot with Stats (or Stats with Pivot)

Well, you can discuss it on the dev list, but given that Solr is open
source, you'll
have to either create your own patch or engage the community to create one.

You haven't really stated why this is a good thing to have, just that
you want it.
So the use-case would be a big help.

Please don't raise a JIRA until you've discussed it though, it may be
that there's
something in the works that one of the devs already knows about...

Best
Erick

On Fri, May 27, 2011 at 10:34 AM, edua...@calandra.com.br wrote:
Nobody?

Please, help

edua...@calandra.com.br
17/05/2011 16:13
Please respond to
solr-user@lucene.apache.org

To
solr-user@lucene.apache.org
cc

Subject
Pivot with Stats (or Stats with Pivot)

Hi All,

Is it possible to get stats (like Stats Component: min ,max, sum, count,

missing, sumOfSquares, mean and stddev) from numeric fields inside
hierarchical facets (with more than one level, like Pivot)?

I would like to query:
...?q=*:*version=2.2start=0rows=0stats=truestats.field=numeric_field1stats.field=numeric_field2stats.pivot=field_x,field_y,field_z
and get min, max, sum, count, etc. from numeric_field1 and
numeric_field2 from all combinations of field_x, field_y and field_z
(hierarchical values).

Using stats.facet I get just one field at one level and using
facet.pivot I get just counts, but no stats.

Looping in client application to do all combinations of facets values
will be to slow because there is a lot of combinations.

Thanks a lot!

Re: Documents update

http://lucene.apache.org/solr/api/org/apache/solr/schema/ExternalFileField.html

On Tuesday 31 May 2011 15:41:32 Denis Kuzmenok wrote:
 Flags   are   stored  to filter results and it's pretty highloaded, it's
 working  fine,  but i can't update index very often just to make flags
 up to time =\
 Where can i read about using external fields / files?
 
  And it wouldn't work unless all the data is stored anyway. Currently
  there's no way to update a single field in a document, although there's
  work being done in that direction (see the column stride JIRA).
  
  What do you want to do with these fields? If it's to influence scoring,
  you could look at external fields.
  
  If the flags are a selection criteria, it's...harder. What are the flags
  used for? Could you consider essentially storing a map of the
  uniqueKey's and flags in a special document and having your app
  read that document and merge the results with the output? If this seems
  irrelevant, a more complete statement of the use-case would be helpful.
  
  Best
  Erick

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: solr Invalid Date in Date Math String/Invalid Date String

Can we see the results of attaching debugQuery=on to the query? That
often points out the issue.

I'd expect this form to work:
[2006-12-22T00:00:00Z TO 2006-12-22T23:59:59Z]

Best
Erick

2011/5/27 Ellery Leung elleryle...@be-o.com:
 Thank you Mike.

 So I understand that now.  But what about the other items that have values
 on both size?  They don't work at all.


 -Original Message-
 From: Mike Sokolov [mailto:soko...@ifactory.com]
 Sent: 2011年5月27日 10:23 下午
 To: solr-user@lucene.apache.org
 Cc: alucard001
 Subject: Re: solr Invalid Date in Date Math String/Invalid Date String

 The * endpoint for range terms wasn't implemented yet in 1.4.1  As a
 workaround, we use very large and very small values.

 -Mike

 On 05/27/2011 12:55 AM, alucard001 wrote:
 Hi all

 I am using SOLR 1.4.1 (according to solr info), but no matter what date
 field I use (date or tdate) defined in default schema.xml, I cannot do a
 search in solr-admin analysis.jsp:

 fieldtype: date(or tdate)
 fieldvalue(index): 2006-12-22T13:52:13Z (I type it in manually, no
 trailing
 space)
 fieldvalue(query):

 The only success case:
 2006-12-22T13:52:13Z

 All search below are failed:
 * TO NOW
 [* TO NOW]

 2006-12-22T00:00:00Z TO 2006-12-22T23:59:59Z
 2006\-12\-22T00\:00\:00Z TO 2006\-12\-22T23\:59\:59Z
 [2006-12-22T00:00:00Z TO 2006-12-22T23:59:59Z]
 [2006\-12\-22T00\:00\:00Z TO 2006\-12\-22T23\:59\:59Z]

 2006-12-22T00:00:00.000Z TO 2006-12-22T23:59:59.999Z
 2006\-12\-22T00\:00\:00\.000Z TO 2006\-12\-22T23\:59\:59\.999Z
 [2006-12-22T00:00:00.000Z TO 2006-12-22T23:59:59.999Z]
 [2006\-12\-22T00\:00\:00\.000Z TO 2006\-12\-22T23\:59\:59\.999Z]

 2006-12-22T00:00:00Z TO *
 2006\-12\-22T00\:00\:00Z TO *
 [2006-12-22T00:00:00Z TO *]
 [2006\-12\-22T00\:00\:00Z TO *]

 2006-12-22T00:00:00.000Z TO *
 2006\-12\-22T00\:00\:00\.000Z TO *
 [2006-12-22T00:00:00.000Z TO *]
 [2006\-12\-22T00\:00\:00\.000Z TO *]
 (vice versa)

 I get either:
 Invalid Date in Date Math String or
 Invalid Date String
 error

 What's wrong with it?  Can anyone please help me on that?

 Thank you.

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/solr-Invalid-Date-in-Date-Math-String-Inv
 alid-Date-String-tp2991763p2991763.html
 Sent from the Solr - User mailing list archive at Nabble.com.

WIKI alerts

Anyone noticed that it doesn't work? Already 2 weeks

https://issues.apache.org/jira/browse/INFRA-3667

 

I don't receive WIKI change notifications. I CC to 'Apache Wiki'
wikidi...@apache.org

 

Something is bad.

 

 

-Fuad

Re: Documents update

2011-05-31 Thread Denis Kuzmenok

Will it be slow if there are 3-5 million key/value rows?

 http://lucene.apache.org/solr/api/org/apache/solr/schema/ExternalFileField.html

 On Tuesday 31 May 2011 15:41:32 Denis Kuzmenok wrote:
 Flags   are   stored  to filter results and it's pretty highloaded, it's
 working  fine,  but i can't update index very often just to make flags
 up to time =\
 Where can i read about using external fields / files?

Re: Match in the process of filter, not end, does it mean not matching?

Take a closer look at the results of KeywordTokenizerFactory. It won't break
up the text into any tokens, the entire input is considered a single string. Are
you sure this is what you intend?

I'd start by removing most of your filters, understanding what's happening
at each step then adding them back in again. For instance, it's
unusual (but possibly correct) to use both the MappingCharFilterFactory and
ISOLatin... factory. And I'm not even sure what all the *gram* filters are
doing in a KeywordTokenized field..

Best
Erick

On Sun, May 29, 2011 at 8:39 PM, Ellery Leung elleryle...@be-o.com wrote:
 This is the schema:



                fieldType name=textContains class=solr.TextField
 positionIncrementGap=100

                        analyzer type=index

                                charFilter
 class=solr.MappingCharFilterFactory
 mapping=../../filters/filter-mappings.txt/

                                charFilter
 class=solr.HTMLStripCharFilterFactory /

                                tokenizer
 class=solr.KeywordTokenizerFactory/

                                filter
 class=solr.ISOLatin1AccentFilterFactory/

                                filter class=solr.TrimFilterFactory /

                                filter class=solr.LowerCaseFilterFactory
 /

                                filter
 class=solr.CommonGramsFilterFactory words=../../filters/stopwords.txt
 ignoreCase=true/

                                filter class=solr.ShingleFilterFactory
 minShingleSize=2 maxShingleSize=30/

                                filter class=solr.NGramFilterFactory
 minGramSize=2 maxGramSize=30/

                                filter
 class=solr.RemoveDuplicatesTokenFilterFactory /

                        /analyzer

                        analyzer type=query

                                charFilter
 class=solr.MappingCharFilterFactory
 mapping=../../filters/filter-mappings.txt/

                                charFilter
 class=solr.HTMLStripCharFilterFactory /

                                tokenizer
 class=solr.KeywordTokenizerFactory/

                                filter
 class=solr.ISOLatin1AccentFilterFactory/

                                filter class=solr.TrimFilterFactory /

                                filter class=solr.LowerCaseFilterFactory
 /

                                filter
 class=solr.RemoveDuplicatesTokenFilterFactory /

                        /analyzer

                /fieldType



 And there is a multiValued field:



 field name=textContains_Something type=textContains multiValued=true
 indexed=true stored=true /



 Now I want to search this string: Merry Christmas and Happy New Year



 In Admin Analysis in solr admin, it highlight (in light blue) the matching
 word in LowerCaseFilterFactory, CommonGramsFilterFactory and
 ShingleFilterFactory.  However, it does not have any highlight in
 NGramFilterFactory.



 Now, I did a search in full-interface mode in solr admin:



 textContains_Something:Merry Christmas and Happy New Year



 It contains NO RESULT.



 Does it mean that matching only counts after all tokenizer and filters?



 Thank you in advance for any help.

Re: collapse component with pivot faceting

Please provide a more detailed request. This is so general that it's hard to
respond. What is the use-case you're trying to understand/implement?

You might review:
http://wiki.apache.org/solr/UsingMailingLists

Best
Erick

On Mon, May 30, 2011 at 4:31 AM, Isha Garg isha.g...@orkash.com wrote:
 Hi All!

         Can anyone tell me how pivot faceting works in combination with
 field collapsing.?
 Please guide me in this respect.


 Thanks!
 Isha Garg

Re: Solr Dismax bf bq vs. q:{boost ...}

First, please define what wrong results means, what are you expecting
and what are you seeing?

Second, please post the results of debugQuery=on where we can all
see it, perhaps something will pop out...

Best
Erick

On Mon, May 30, 2011 at 12:27 PM, chazzuka chazz...@gmail.com wrote:
 I tried to do this:

 #1. search phrases in title^3  text^1
 #2. based on result #1 add boost for field closed:0^2
 #3. based on result in #2 boost based on last_modified

 and i tried like these:

 /solr/select
 ?q={!boost b=$dateboost v=$qq defType=dismax}
 dateboost=recip(ms(NOW/HOUR,modified),8640,2,1)
 qq=video
 qf=title^3+text
 pf=title^3+text
 bq=closed:0^2
 debugQuery=true

 then i tried differently by changing solrconfig like these:

 str name=qftitle^3 text/str
 str name=pftitle^3 text/str
 str name=bfrecip(ms(NOW/HOUR,modified),8640,2,1)/str
 str name=bqclosed:0^2/str

 with query:
 /solr/select
 ?q=video
 debugQuery=true

 both seems give wrong results, anyone have an idea about doing those tasks?

 thanks in advanced



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Solr-Dismax-bf-bq-vs-q-boost-tp3003028p3003028.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Getting and viewing a heap dump

Constantjin:

I've had better luck by sending messages as plain text. The Spam
filter on the user
list sometimes acts up if you send mail in richtext or similar
formats. Gmail has a link
to change this, what client are you using?

And thanks for participating!

Best
Erick

On Tue, May 31, 2011 at 3:22 AM, Constantijn Visinescu
baeli...@gmail.com wrote:
 Hi Bernd,

 I'm assuming Linux here, if you're running something else these
 instructions might differ slightly.

 First get a heap dump with:
 jmap -heap:format=b,file=/path/to/generate/heapdumpfile.hprof 1234

 with 1234 being the PID (process id) of the JVM

 After you get a Heap dump you can analyze it with Eclipse MAT (Memory
 Analyzer Tool).

 Just a heads up if you're doing this in production: the JVM will
 freeze completely while generating the heap dump, which will seem like
 a giant stop the world GC with a 10GB heap.

 Good luck with finding out what's eating your memory!

 Constantijn

 P.S.
 Sorry about  altering the subject line, but the spam assassin used by
 the mailing list was rejecting my post because it had replication in
 the subject line. hope it doesn't mess up the thread.

 On Tue, May 31, 2011 at 8:43 AM, Bernd Fehling
 bernd.fehl...@uni-bielefeld.de wrote:
 Some more info,
 after one week the servers have the following status:

 Master (indexing only)
 + looks good and has heap size of about 6g from 10g OldGen
 + has loaded meanwhile 2 times the index from scratch via DIH
 + has added new documents into existing index via DIH
 + has optimized and replicated
 + no full GC within one week

 Slave A (search only) Online
 - looks bad and has heap size of 9.5g from 10g OldGen
 + was replicated
 - several full GC

 Slave B (search only) Backup
 + looks good has heap size of 4 g from 10g OldGen
 + was replicated
 + no full GC within one week

 Conclusion:
 + DIH, processing, indexing, replication are fine
 - the search is crap and eats up OldGen heap which can't be
  cleaned up by full GC. May be memory leaks or what ever...

 Due to this Solr 3.1 can _NOT_ be recommended as high-availability,
 high-search-load search engine because of unclear heap problems
 caused by the search. The search is out of the box, so no
 self produced programming errors.

 Any tools available for JAVA to analyze this?
 (like valgrind or electric fence for C++)

 Is it possible to analyze a heap dump produced with jvisualvm?
 Which tools?


 Bernd


 Am 30.05.2011 15:51, schrieb Bernd Fehling:

 Dear list,
 after switching from FAST to Solr I get the first _real_ data.
 This includes search times, memory consumption, perfomance of solr,...

 What I recognized so far is that something eats up my OldGen and
 I assume it might be replication.

 Current Data:
 one master - indexing only
 two slaves - search only
 over 28 million docs
 single instance
 single core
 index size 140g
 current heap size 16g

 After startup I have about 4g heap in use and about 3.5g of OldGen.
 After one week and some replications OldGen is filled close to 100
 percent.
 If I start an optimize under this condition I get OOM of heap.
 So my assumption is that something is eating up my heap.

 Any idea how to trace this down?

 May be a memory leak somewhere?

 Best regards
 Bernd

Re: how can i index data in different documents

document isn't a tag recognized in schema.xml. Please review:
http://wiki.apache.org/solr/UsingMailingLists

Best
Erick

On Thu, May 26, 2011 at 6:46 AM, Romi romijain3...@gmail.com wrote:

 Ensure that when you add your documents, their type value is
 effectively set to either table1 or table2.

 did you mean i set document name=d1 type=table1 in schema.xml???

 but as far as i concern there can only be one document tag then what about
 the table2??

 -
 Romi
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/how-can-i-index-data-in-different-documents-tp2988621p2988789.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Getting and viewing a heap dump

2011-05-31 Thread Constantijn Visinescu

I was sending using the gmail webbrowser client in plaintext.

Spamassasin didn't seem to like 3 things (according to the error
message i got back):
- I use a free email adress
- My email address (before the @) ends with a number
- The email had the word replication in the subject line.

No idea where rule #3 came from but it was the easiest to fix so
that's what I changed ;)


On Tue, May 31, 2011 at 4:21 PM, Erick Erickson erickerick...@gmail.com wrote:
 Constantjin:

 I've had better luck by sending messages as plain text. The Spam
 filter on the user
 list sometimes acts up if you send mail in richtext or similar
 formats. Gmail has a link
 to change this, what client are you using?

 And thanks for participating!

 Best
 Erick

 On Tue, May 31, 2011 at 3:22 AM, Constantijn Visinescu
 baeli...@gmail.com wrote:
 Hi Bernd,

 I'm assuming Linux here, if you're running something else these
 instructions might differ slightly.

 First get a heap dump with:
 jmap -heap:format=b,file=/path/to/generate/heapdumpfile.hprof 1234

 with 1234 being the PID (process id) of the JVM

 After you get a Heap dump you can analyze it with Eclipse MAT (Memory
 Analyzer Tool).

 Just a heads up if you're doing this in production: the JVM will
 freeze completely while generating the heap dump, which will seem like
 a giant stop the world GC with a 10GB heap.

 Good luck with finding out what's eating your memory!

 Constantijn

 P.S.
 Sorry about  altering the subject line, but the spam assassin used by
 the mailing list was rejecting my post because it had replication in
 the subject line. hope it doesn't mess up the thread.

 On Tue, May 31, 2011 at 8:43 AM, Bernd Fehling
 bernd.fehl...@uni-bielefeld.de wrote:
 Some more info,
 after one week the servers have the following status:

 Master (indexing only)
 + looks good and has heap size of about 6g from 10g OldGen
 + has loaded meanwhile 2 times the index from scratch via DIH
 + has added new documents into existing index via DIH
 + has optimized and replicated
 + no full GC within one week

 Slave A (search only) Online
 - looks bad and has heap size of 9.5g from 10g OldGen
 + was replicated
 - several full GC

 Slave B (search only) Backup
 + looks good has heap size of 4 g from 10g OldGen
 + was replicated
 + no full GC within one week

 Conclusion:
 + DIH, processing, indexing, replication are fine
 - the search is crap and eats up OldGen heap which can't be
  cleaned up by full GC. May be memory leaks or what ever...

 Due to this Solr 3.1 can _NOT_ be recommended as high-availability,
 high-search-load search engine because of unclear heap problems
 caused by the search. The search is out of the box, so no
 self produced programming errors.

 Any tools available for JAVA to analyze this?
 (like valgrind or electric fence for C++)

 Is it possible to analyze a heap dump produced with jvisualvm?
 Which tools?


 Bernd


 Am 30.05.2011 15:51, schrieb Bernd Fehling:

 Dear list,
 after switching from FAST to Solr I get the first _real_ data.
 This includes search times, memory consumption, perfomance of solr,...

 What I recognized so far is that something eats up my OldGen and
 I assume it might be replication.

 Current Data:
 one master - indexing only
 two slaves - search only
 over 28 million docs
 single instance
 single core
 index size 140g
 current heap size 16g

 After startup I have about 4g heap in use and about 3.5g of OldGen.
 After one week and some replications OldGen is filled close to 100
 percent.
 If I start an optimize under this condition I get OOM of heap.
 So my assumption is that something is eating up my heap.

 Any idea how to trace this down?

 May be a memory leak somewhere?

 Best regards
 Bernd

Re: Solr NRT

2011-05-31 Thread Nagendra Nagarajayya

Did you trying using Solr with RankingAlgorithm ? It supports NRT. You 
can index documents without a commit while searching  concurrently. No 
changes are needed except for enabling NRT through solrconfig.xml. You 
can get information about the implementation from here:


http://solr-ra.tgels.com/wiki/en/Near_Real_Time_Search
http://solr-ra.tgels.com/papers/NRT_Solr_RankingAlgorithm.pdf

You can download Solr with RankingAlgorithm from here:
http://solr-ra.tgels.com

Regards,

- Nagendra Nagarajayya
http://solr-ra.tgels.com
http://rankingalgorithm.tgels.com

On 5/31/2011 5:57 AM, Ionut Manta wrote:

What results did you got with this hack?
How long it takes since you start indexing some documents until you get a
search result?
Did you try NRT?

On Tue, May 31, 2011 at 3:47 PM, David Hilldh...@studentloan.org  wrote:


Unless you cross a Solr server commit threshold your client has to post a
commit/  message for the server content to be available for searching.
Unfortunatly the Solr tool that is supposed to do this apparently doesn't. I
asked for community help last week and was surprised to receive no response,
I thought having to leave a Solr import process in an incomplete state would
be more of a concern. In any case, our (hopefully temporary) solution was to
hack the source code for the SimplePostTool demo code to turn it into a
CommitTool.  Once Solr receives thecommit/  post you will be able to
search for your recently added documents.

-Original Message-
From: Ionut Manta [mailto:ionut.ma...@gmail.com]
Sent: Tuesday, May 31, 2011 7:41 AM
To: solr-user@lucene.apache.org
Subject: Solr NRT

Hi,

I have the following strange use case:

Index 100 documents and make them immediately available for search. I call
this on the fly indexing. Then the index can be removed. So the size of
the index is not an issue here.
Is this possible with Solr? Anyone tried something similar?

Thank you,
Ionut

This e-mail and any files transmitted with it are confidential and intended
solely for the use of the individual or entity to whom they are addressed.
If you have received this e-mail in error please notify the originator of
the message. This footer also confirms that this e-mail message has been
scanned for the presence of computer viruses. Any views expressed in this
message are those of the individual sender, except where the sender
specifies and with authority, states them to be the views of Iowa Student
Loan.

Re: Edgengram

2011-05-31 Thread Brian Lamb

In this particular case, I will be doing a solr search based on user
preferences. So I will not be depending on the user to type abcdefg. That
will be automatically generated based on user selections.

The contents of the field do not contain spaces and since I am created the
search parameters, case isn't important either.

Thanks,

Brian Lamb

On Tue, May 31, 2011 at 9:44 AM, Erick Erickson erickerick...@gmail.comwrote:

 That'll work for your case, although be aware that string types aren't
 analyzed at all,
 so case matters, as do spaces etc.

 What is the use-case here? If you explain it a bit there might be
 better answers

 Best
 Erick

 On Fri, May 27, 2011 at 9:17 AM, Brian Lamb
 brian.l...@journalexperts.com wrote:
  For this, I ended up just changing it to string and using abcdefg* to
  match. That seems to work so far.
 
  Thanks,
 
  Brian Lamb
 
  On Wed, May 25, 2011 at 4:53 PM, Brian Lamb
  brian.l...@journalexperts.comwrote:
 
  Hi all,
 
  I'm running into some confusion with the way edgengram works. I have the
  field set up as:
 
  fieldType name=edgengram class=solr.TextField
  positionIncrementGap=1000
 analyzer
   tokenizer class=solr.LowerCaseTokenizerFactory /
 filter class=solr.EdgeNGramFilterFactory minGramSize=1
  maxGramSize=100 side=front /
 /analyzer
  /fieldType
 
  I've also set up my own similarity class that returns 1 as the idf
 score.
  What I've found this does is if I match a string abcdefg against a
 field
  containing abcdefghijklmnop, then the idf will score that as a 7:
 
  7.0 = idf(myfield: a=51 ab=23 abc=2 abcd=2 abcde=2 abcdef=2 abcdefg=2)
 
  I get why that's happening, but is there a way to avoid that? Do I need
 to
  do a new field type to achieve the desired affect?
 
  Thanks,
 
  Brian Lamb

how does Solr/Lucene index multi-value fields

2011-05-31 Thread Ian Holsman

Hi.

I want to store a list of documents (say each being 30-60k of text) into a 
single SolrDocument. (to speed up post-retrieval querying)

In order to do this, I need to know if lucene calculates the TF/IDF score over 
the entire field or does it treat each value in the list as a unique field? 

If I can't store it as a multi-value, I could create a schema where I put each 
document into a unique field, but I'm not sure how to create the query to 
search all the fields.


Regards
Ian

Better Spellcheck

2011-05-31 Thread Tanner Postert

I've tried to use a spellcheck dictionary built from my own content, but my
content ends up having a lot of misspelled words so the spellcheck ends up
being less than effective. I could use a standard dictionary, but it may
have problems with proper nouns. It also misses phrases. When someone
searches for Untied States I would hope the spellcheck would suggest
United States but it just recognizes that untied is a valid word and
doesn't suggest any thing.

Is there any way around this? Are there any third party modules or
spellcheck systems that I could implement to get these type of features?

Re: Edgengram

2011-05-31 Thread bmdakshinamur...@gmail.com

Can you specify the analyzer you are using for your queries?

May be you could use a KeywordAnalyzer for your queries so you don't end up
matching parts of your query.
http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/
This should help you.

On Tue, May 31, 2011 at 8:24 PM, Brian Lamb
brian.l...@journalexperts.comwrote:

In this particular case, I will be doing a solr search based on user
preferences. So I will not be depending on the user to type abcdefg. That
will be automatically generated based on user selections.

The contents of the field do not contain spaces and since I am created the
search parameters, case isn't important either.

Thanks,

Brian Lamb

On Tue, May 31, 2011 at 9:44 AM, Erick Erickson erickerick...@gmail.com
wrote:

That'll work for your case, although be aware that string types aren't
analyzed at all,
so case matters, as do spaces etc.

What is the use-case here? If you explain it a bit there might be
better answers

Best
Erick

On Fri, May 27, 2011 at 9:17 AM, Brian Lamb
brian.l...@journalexperts.com wrote:
For this, I ended up just changing it to string and using abcdefg* to
match. That seems to work so far.

Thanks,

Brian Lamb

On Wed, May 25, 2011 at 4:53 PM, Brian Lamb
brian.l...@journalexperts.comwrote:

Hi all,

I'm running into some confusion with the way edgengram works. I have
the
field set up as:

fieldType name=edgengram class=solr.TextField
positionIncrementGap=1000
analyzer
tokenizer class=solr.LowerCaseTokenizerFactory /
filter class=solr.EdgeNGramFilterFactory minGramSize=1
maxGramSize=100 side=front /
/analyzer
/fieldType

I've also set up my own similarity class that returns 1 as the idf
score.
What I've found this does is if I match a string abcdefg against a
field
containing abcdefghijklmnop, then the idf will score that as a 7:

7.0 = idf(myfield: a=51 ab=23 abc=2 abcd=2 abcde=2 abcdef=2 abcdefg=2)

I get why that's happening, but is there a way to avoid that? Do I
need
to
do a new field type to achieve the desired affect?

Thanks,

Brian Lamb

--
Thanks and Regards,
DakshinaMurthy BM

Re: Edgengram

2011-05-31 Thread Brian Lamb

fieldType name=edgengram class=solr.TextField
positionIncrementGap=1000
   analyzer
 tokenizer class=solr.LowerCaseTokenizerFactory /
 filter class=solr.EdgeNGramFilterFactory minGramSize=1
maxGramSize=25 side=front /
   /analyzer
/fieldType

I believe I used that link when I initially set up the field and it worked
great (and I'm still using it in other places). In this particular example
however it does not appear to be practical for me. I mentioned that I have a
similarity class that returns 1 for the idf and in the case of an edgengram,
it returns 1 * length of the search string.

Thanks,

Brian Lamb

On Tue, May 31, 2011 at 11:34 AM, bmdakshinamur...@gmail.com 
bmdakshinamur...@gmail.com wrote:

 Can you specify the analyzer you are using for your queries?

 May be you could use a KeywordAnalyzer for your queries so you don't end up
 matching parts of your query.

 http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/
 This should help you.

 On Tue, May 31, 2011 at 8:24 PM, Brian Lamb
 brian.l...@journalexperts.comwrote:

  In this particular case, I will be doing a solr search based on user
  preferences. So I will not be depending on the user to type abcdefg.
 That
  will be automatically generated based on user selections.
 
  The contents of the field do not contain spaces and since I am created
 the
  search parameters, case isn't important either.
 
  Thanks,
 
  Brian Lamb
 
  On Tue, May 31, 2011 at 9:44 AM, Erick Erickson erickerick...@gmail.com
  wrote:
 
   That'll work for your case, although be aware that string types aren't
   analyzed at all,
   so case matters, as do spaces etc.
  
   What is the use-case here? If you explain it a bit there might be
   better answers
  
   Best
   Erick
  
   On Fri, May 27, 2011 at 9:17 AM, Brian Lamb
   brian.l...@journalexperts.com wrote:
For this, I ended up just changing it to string and using abcdefg*
 to
match. That seems to work so far.
   
Thanks,
   
Brian Lamb
   
On Wed, May 25, 2011 at 4:53 PM, Brian Lamb
brian.l...@journalexperts.comwrote:
   
Hi all,
   
I'm running into some confusion with the way edgengram works. I have
  the
field set up as:
   
fieldType name=edgengram class=solr.TextField
positionIncrementGap=1000
   analyzer
 tokenizer class=solr.LowerCaseTokenizerFactory /
   filter class=solr.EdgeNGramFilterFactory minGramSize=1
maxGramSize=100 side=front /
   /analyzer
/fieldType
   
I've also set up my own similarity class that returns 1 as the idf
   score.
What I've found this does is if I match a string abcdefg against a
   field
containing abcdefghijklmnop, then the idf will score that as a 7:
   
7.0 = idf(myfield: a=51 ab=23 abc=2 abcd=2 abcde=2 abcdef=2
 abcdefg=2)
   
I get why that's happening, but is there a way to avoid that? Do I
  need
   to
do a new field type to achieve the desired affect?
   
Thanks,
   
Brian Lamb
   
   
  
 



 --
 Thanks and Regards,
 DakshinaMurthy BM

Boosting fields at query time in Standard Request Handler from Solrconfig.xml

2011-05-31 Thread Vignesh Raj

Hi,

I am developing a search engine app using Asp.Net, C# and Solrnet. I use the
standard request handler. Is there a way I can boost the fields at query
time from inside the solrconfig.xml file itself. Just like the qf field
for Dismax handler.
Right now am searching like field1:value^1.5 field2:value^1.2
field3:value^0.8 and this is done in the middle tier. I want Solr itself to
do this using standard request handler. Can I write a similar kind of thing
inside standard req handler?

Here is my solrconfig file.

requestHandler name=standard class=solr.SearchHandler default=true

   lst name=defaults

 str name=echoParamsexplicit/str

 str name=hltrue/str

 str name=hl.snippets3/str

 str name=hl.fragsize25/str

 str name=qffile_description^100.0 file_content^6.0 file_name^10.0
file_comments^4.0

 /str

   /lst

   arr name=last-components

  strspellcheck/str 

   /arr

/requestHandler

 

But am not able to see the results if I add this in my solrconfig.xml file.
I have edited the post to add my req handler code in solrconfig. But, if
have my query string as file_description:result^1.0 file_content:result^0.6
file_name:result^0.5 file_comments:result^0.8, am able to see the required
result.

 

Regards

Vignesh

Re: how does Solr/Lucene index multi-value fields

Can you explain the use-case a bit more here? Especially the post-query
processing and how you expect the multiple documents to help here.

But TF/IDF is calculated over all the values in the field. There's really no
difference between a multi-valued field and storing all the data in a
single field
as far as relevance calculations are concerned.

Best
Erick

On Tue, May 31, 2011 at 11:02 AM, Ian Holsman had...@holsman.net wrote:
 Hi.

 I want to store a list of documents (say each being 30-60k of text) into a 
 single SolrDocument. (to speed up post-retrieval querying)

 In order to do this, I need to know if lucene calculates the TF/IDF score 
 over the entire field or does it treat each value in the list as a unique 
 field?

 If I can't store it as a multi-value, I could create a schema where I put 
 each document into a unique field, but I'm not sure how to create the query 
 to search all the fields.


 Regards
 Ian

Re: how does Solr/Lucene index multi-value fields

2011-05-31 Thread Ian Holsman


On May 31, 2011, at 12:11 PM, Erick Erickson wrote:

 Can you explain the use-case a bit more here? Especially the post-query
 processing and how you expect the multiple documents to help here.
 

we have a collection of related stories. when a user searches for something, we 
might not want to display the story that is most-relevant (according to SOLR), 
but according to other home-grown rules.  by combing all the possibilities in 
one SolrDocument, we can avoid a DB-hit to get related stories.


 But TF/IDF is calculated over all the values in the field. There's really no
 difference between a multi-valued field and storing all the data in a
 single field
 as far as relevance calculations are concerned.
 

so.. it will suck regardless.. I thought we had per-field relevance in the 
current trunk. :-(


 Best
 Erick
 
 On Tue, May 31, 2011 at 11:02 AM, Ian Holsman had...@holsman.net wrote:
 Hi.
 
 I want to store a list of documents (say each being 30-60k of text) into a 
 single SolrDocument. (to speed up post-retrieval querying)
 
 In order to do this, I need to know if lucene calculates the TF/IDF score 
 over the entire field or does it treat each value in the list as a unique 
 field?
 
 If I can't store it as a multi-value, I could create a schema where I put 
 each document into a unique field, but I'm not sure how to create the query 
 to search all the fields.
 
 
 Regards
 Ian

Re: Edgengram

2011-05-31 Thread Tomás Fernández Löbbe

Hi Brian, I don't know if I understand what you are trying to achieve. You
want the term query abcdefg to have an idf of 1 insead of 7? I think using
the KeywordTokenizerFilterFactory at query time should work. I would be
something like:

fieldType name=edgengram class=solr.TextField
positionIncrementGap=1000
  analyzer type=index
tokenizer class=solr.LowerCaseTokenizerFactory /
filter class=solr.EdgeNGramFilterFactory minGramSize=1
maxGramSize=25 side=front /
  /analyzer
  analyzer type=query
  tokenizer class=solr.KeywordTokenizerFactory /
  /analyzer
/fieldType

this way, at query time abcdefg won't be turned to a ab abc abcd abcde
abcdef abcdefg. At index time it will.

Regards,
Tomás


On Tue, May 31, 2011 at 1:07 PM, Brian Lamb
brian.l...@journalexperts.comwrote:

 fieldType name=edgengram class=solr.TextField
 positionIncrementGap=1000
   analyzer
 tokenizer class=solr.LowerCaseTokenizerFactory /
 filter class=solr.EdgeNGramFilterFactory minGramSize=1
 maxGramSize=25 side=front /
   /analyzer
 /fieldType

 I believe I used that link when I initially set up the field and it worked
 great (and I'm still using it in other places). In this particular example
 however it does not appear to be practical for me. I mentioned that I have
 a
 similarity class that returns 1 for the idf and in the case of an
 edgengram,
 it returns 1 * length of the search string.

 Thanks,

 Brian Lamb

 On Tue, May 31, 2011 at 11:34 AM, bmdakshinamur...@gmail.com 
 bmdakshinamur...@gmail.com wrote:

  Can you specify the analyzer you are using for your queries?
 
  May be you could use a KeywordAnalyzer for your queries so you don't end
 up
  matching parts of your query.
 
 
 http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/
  This should help you.
 
  On Tue, May 31, 2011 at 8:24 PM, Brian Lamb
  brian.l...@journalexperts.comwrote:
 
   In this particular case, I will be doing a solr search based on user
   preferences. So I will not be depending on the user to type abcdefg.
  That
   will be automatically generated based on user selections.
  
   The contents of the field do not contain spaces and since I am created
  the
   search parameters, case isn't important either.
  
   Thanks,
  
   Brian Lamb
  
   On Tue, May 31, 2011 at 9:44 AM, Erick Erickson 
 erickerick...@gmail.com
   wrote:
  
That'll work for your case, although be aware that string types
 aren't
analyzed at all,
so case matters, as do spaces etc.
   
What is the use-case here? If you explain it a bit there might be
better answers
   
Best
Erick
   
On Fri, May 27, 2011 at 9:17 AM, Brian Lamb
brian.l...@journalexperts.com wrote:
 For this, I ended up just changing it to string and using
 abcdefg*
  to
 match. That seems to work so far.

 Thanks,

 Brian Lamb

 On Wed, May 25, 2011 at 4:53 PM, Brian Lamb
 brian.l...@journalexperts.comwrote:

 Hi all,

 I'm running into some confusion with the way edgengram works. I
 have
   the
 field set up as:

 fieldType name=edgengram class=solr.TextField
 positionIncrementGap=1000
analyzer
  tokenizer class=solr.LowerCaseTokenizerFactory /
filter class=solr.EdgeNGramFilterFactory minGramSize=1
 maxGramSize=100 side=front /
/analyzer
 /fieldType

 I've also set up my own similarity class that returns 1 as the idf
score.
 What I've found this does is if I match a string abcdefg against
 a
field
 containing abcdefghijklmnop, then the idf will score that as a
 7:

 7.0 = idf(myfield: a=51 ab=23 abc=2 abcd=2 abcde=2 abcdef=2
  abcdefg=2)

 I get why that's happening, but is there a way to avoid that? Do I
   need
to
 do a new field type to achieve the desired affect?

 Thanks,

 Brian Lamb


   
  
 
 
 
  --
  Thanks and Regards,
  DakshinaMurthy BM

Re: Edgengram

2011-05-31 Thread Tomás Fernández Löbbe

...or also use the LowerCaseTokenizerFactory at query time for consistency,
but not the edge ngram filter.

2011/5/31 Tomás Fernández Löbbe tomasflo...@gmail.com

 Hi Brian, I don't know if I understand what you are trying to achieve. You
 want the term query abcdefg to have an idf of 1 insead of 7? I think using
 the KeywordTokenizerFilterFactory at query time should work. I would be
 something like:

 fieldType name=edgengram class=solr.TextField
 positionIncrementGap=1000
   analyzer type=index

 tokenizer class=solr.LowerCaseTokenizerFactory /
 filter class=solr.EdgeNGramFilterFactory minGramSize=1
 maxGramSize=25 side=front /
   /analyzer
   analyzer type=query
   tokenizer class=solr.KeywordTokenizerFactory /
   /analyzer
 /fieldType

 this way, at query time abcdefg won't be turned to a ab abc abcd abcde
 abcdef abcdefg. At index time it will.

 Regards,
 Tomás


 On Tue, May 31, 2011 at 1:07 PM, Brian Lamb brian.l...@journalexperts.com
  wrote:

 fieldType name=edgengram class=solr.TextField
 positionIncrementGap=1000
   analyzer
 tokenizer class=solr.LowerCaseTokenizerFactory /
 filter class=solr.EdgeNGramFilterFactory minGramSize=1
 maxGramSize=25 side=front /
   /analyzer
 /fieldType

 I believe I used that link when I initially set up the field and it worked
 great (and I'm still using it in other places). In this particular example
 however it does not appear to be practical for me. I mentioned that I have
 a
 similarity class that returns 1 for the idf and in the case of an
 edgengram,
 it returns 1 * length of the search string.

 Thanks,

 Brian Lamb

 On Tue, May 31, 2011 at 11:34 AM, bmdakshinamur...@gmail.com 
 bmdakshinamur...@gmail.com wrote:

  Can you specify the analyzer you are using for your queries?
 
  May be you could use a KeywordAnalyzer for your queries so you don't end
 up
  matching parts of your query.
 
 
 http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/
  This should help you.
 
  On Tue, May 31, 2011 at 8:24 PM, Brian Lamb
  brian.l...@journalexperts.comwrote:
 
   In this particular case, I will be doing a solr search based on user
   preferences. So I will not be depending on the user to type abcdefg.
  That
   will be automatically generated based on user selections.
  
   The contents of the field do not contain spaces and since I am created
  the
   search parameters, case isn't important either.
  
   Thanks,
  
   Brian Lamb
  
   On Tue, May 31, 2011 at 9:44 AM, Erick Erickson 
 erickerick...@gmail.com
   wrote:
  
That'll work for your case, although be aware that string types
 aren't
analyzed at all,
so case matters, as do spaces etc.
   
What is the use-case here? If you explain it a bit there might be
better answers
   
Best
Erick
   
On Fri, May 27, 2011 at 9:17 AM, Brian Lamb
brian.l...@journalexperts.com wrote:
 For this, I ended up just changing it to string and using
 abcdefg*
  to
 match. That seems to work so far.

 Thanks,

 Brian Lamb

 On Wed, May 25, 2011 at 4:53 PM, Brian Lamb
 brian.l...@journalexperts.comwrote:

 Hi all,

 I'm running into some confusion with the way edgengram works. I
 have
   the
 field set up as:

 fieldType name=edgengram class=solr.TextField
 positionIncrementGap=1000
analyzer
  tokenizer class=solr.LowerCaseTokenizerFactory /
filter class=solr.EdgeNGramFilterFactory
 minGramSize=1
 maxGramSize=100 side=front /
/analyzer
 /fieldType

 I've also set up my own similarity class that returns 1 as the
 idf
score.
 What I've found this does is if I match a string abcdefg
 against a
field
 containing abcdefghijklmnop, then the idf will score that as a
 7:

 7.0 = idf(myfield: a=51 ab=23 abc=2 abcd=2 abcde=2 abcdef=2
  abcdefg=2)

 I get why that's happening, but is there a way to avoid that? Do
 I
   need
to
 do a new field type to achieve the desired affect?

 Thanks,

 Brian Lamb


   
  
 
 
 
  --
  Thanks and Regards,
  DakshinaMurthy BM

Re: how does Solr/Lucene index multi-value fields


On 5/31/2011 12:16 PM, Ian Holsman wrote:
we have a collection of related stories. when a user searches for 
something, we might not want to display the story that is 
most-relevant (according to SOLR), but according to other home-grown 
rules. by combing all the possibilities in one SolrDocument, we can 
avoid a DB-hit to get related stories.


Avoiding a DB hit may or may not actually be a good goal here. You may 
find that hitting the DB to get related stories is _more performant_ 
than retrieving a very large stored field from Solr. (My sense is this 
can be especially a problem on a Solr index that has not been optimized, 
but I'm not sure).


Sorry, don't have an answer to your actual question, but if an attempted 
performance improvement is making other things harder... might want to 
be sure your presumed performance improvement really is a performance 
improvement.

Re: how does Solr/Lucene index multi-value fields

Hmmm, I may have mis-lead you. Re-reading my text it
wasn't very well written

TF/IDF calculations are, indeed, per-field. I was trying
to say that there was no difference between storing all
the data for an individual field as a single long string of text
in a single-valued field or as several shorter strings in
a multi-valued field.

Best
Erick

On Tue, May 31, 2011 at 12:16 PM, Ian Holsman had...@holsman.net wrote:

 On May 31, 2011, at 12:11 PM, Erick Erickson wrote:

 Can you explain the use-case a bit more here? Especially the post-query
 processing and how you expect the multiple documents to help here.


 we have a collection of related stories. when a user searches for something, 
 we might not want to display the story that is most-relevant (according to 
 SOLR), but according to other home-grown rules.  by combing all the 
 possibilities in one SolrDocument, we can avoid a DB-hit to get related 
 stories.


 But TF/IDF is calculated over all the values in the field. There's really no
 difference between a multi-valued field and storing all the data in a
 single field
 as far as relevance calculations are concerned.


 so.. it will suck regardless.. I thought we had per-field relevance in the 
 current trunk. :-(


 Best
 Erick

 On Tue, May 31, 2011 at 11:02 AM, Ian Holsman had...@holsman.net wrote:
 Hi.

 I want to store a list of documents (say each being 30-60k of text) into a 
 single SolrDocument. (to speed up post-retrieval querying)

 In order to do this, I need to know if lucene calculates the TF/IDF score 
 over the entire field or does it treat each value in the list as a unique 
 field?

 If I can't store it as a multi-value, I could create a schema where I put 
 each document into a unique field, but I'm not sure how to create the query 
 to search all the fields.


 Regards
 Ian

Re: Custom Scoring relying on another server.

2011-05-31 Thread arian487

bump

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Custom-Scoring-relying-on-another-server-tp2994546p3006873.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: how does Solr/Lucene index multi-value fields

2011-05-31 Thread Ian Holsman

Thanks Erick.

sadly in my use-case I don't that wouldn't work. I'll go back to storing them 
at the story level, and hitting a DB to get related stories I think.

--I
On May 31, 2011, at 12:27 PM, Erick Erickson wrote:

 Hmmm, I may have mis-lead you. Re-reading my text it
 wasn't very well written
 
 TF/IDF calculations are, indeed, per-field. I was trying
 to say that there was no difference between storing all
 the data for an individual field as a single long string of text
 in a single-valued field or as several shorter strings in
 a multi-valued field.
 
 Best
 Erick
 
 On Tue, May 31, 2011 at 12:16 PM, Ian Holsman had...@holsman.net wrote:
 
 On May 31, 2011, at 12:11 PM, Erick Erickson wrote:
 
 Can you explain the use-case a bit more here? Especially the post-query
 processing and how you expect the multiple documents to help here.
 
 
 we have a collection of related stories. when a user searches for something, 
 we might not want to display the story that is most-relevant (according to 
 SOLR), but according to other home-grown rules.  by combing all the 
 possibilities in one SolrDocument, we can avoid a DB-hit to get related 
 stories.
 
 
 But TF/IDF is calculated over all the values in the field. There's really no
 difference between a multi-valued field and storing all the data in a
 single field
 as far as relevance calculations are concerned.
 
 
 so.. it will suck regardless.. I thought we had per-field relevance in the 
 current trunk. :-(
 
 
 Best
 Erick
 
 On Tue, May 31, 2011 at 11:02 AM, Ian Holsman had...@holsman.net wrote:
 Hi.
 
 I want to store a list of documents (say each being 30-60k of text) into a 
 single SolrDocument. (to speed up post-retrieval querying)
 
 In order to do this, I need to know if lucene calculates the TF/IDF score 
 over the entire field or does it treat each value in the list as a unique 
 field?
 
 If I can't store it as a multi-value, I could create a schema where I put 
 each document into a unique field, but I'm not sure how to create the 
 query to search all the fields.
 
 
 Regards
 Ian

Using multiple CPUs for a single document base?

Is there a way to allow Solr to use multiple CPUs of a single, multi-core box, 
to increase scale (number of documents, number of searches) of the searchbase?

The CoreAdmin wiki page talks about Multiple Cores as essentially independent 
document bases with independent indexes, but with some unification of 
administration at the grosser levels. That's not quite what I'm looking for, 
though. I want a single URL for add and search access, and a single logical 
searchbase, but I want to be able to use more of the resources of the physical 
box where the searchbase runs.

I guess I thought I would get this for free, it being Java and all, but I don't 
seem to: even with hundreds of clients adding and searching, I only seem to use 
one hardware core, and a bit of a second (which I interpret to mean one Java 
thread for Solr, one Java thread for Java I/O).

-==-
Jack Repenning
Technologist
Codesion Business Unit
CollabNet, Inc.
8000 Marina Boulevard, Suite 600
Brisbane, California 94005
office: +1 650.228.2562
twitter: http://twitter.com/jrep











PGP.sig
Description: This is a digitally signed message part

Re: Using multiple CPUs for a single document base?

Are you using a  1.4 version of Solr? It has since been improved for multi-
threaded scalability such as a lot of neat non-blocking components. It's been 
a long time ago since i saw a Solr server not taking advantage of multiple 
cores. Today, they take 200% up to even 1200% CPU time when viewing with top.

Here's one running at almost 700% processing 450+ queries/second.
15522 markus20   0 2432m 859m  10m S  698 10.8   2:05.41 java   
 

 Is there a way to allow Solr to use multiple CPUs of a single, multi-core
 box, to increase scale (number of documents, number of searches) of the
 searchbase?
 
 The CoreAdmin wiki page talks about Multiple Cores as essentially
 independent document bases with independent indexes, but with some
 unification of administration at the grosser levels. That's not quite what
 I'm looking for, though. I want a single URL for add and search access,
 and a single logical searchbase, but I want to be able to use more of the
 resources of the physical box where the searchbase runs.
 
 I guess I thought I would get this for free, it being Java and all, but I
 don't seem to: even with hundreds of clients adding and searching, I only
 seem to use one hardware core, and a bit of a second (which I interpret to
 mean one Java thread for Solr, one Java thread for Java I/O).
 
 -==-
 Jack Repenning
 Technologist
 Codesion Business Unit
 CollabNet, Inc.
 8000 Marina Boulevard, Suite 600
 Brisbane, California 94005
 office: +1 650.228.2562
 twitter: http://twitter.com/jrep

Obtaining query AST?

2011-05-31 Thread darren

Hi,
 I want to write my own query expander. It needs to obtain the AST
(abstract syntax tree) of an already parsed query string, navigate to
certain parts of it (words) and make logical phrases of those words by
adding to the AST - where necessary.

This cannot be done to the string because the query logic cannot be
semantically altered. (e.g. AND, OR, paren's etc) so it must be parsed
first.

How can this be done with SolrJ?

thanks for any tips.
Darren

Re: Using multiple CPUs for a single document base?

On May 31, 2011, at 11:16 AM, Markus Jelsma wrote:

 Are you using a  1.4 version of Solr?

Yeah, about those version numbers ... The tarball I installed claimed its 
version was

  apache-solr-3.1.0

Which sounds comfortably later than 1.4.

But the examples/solr/schema.xml that comes with it claims version 1.3.

I'm confused.

-==-
Jack Repenning
Technologist
Codesion Business Unit
CollabNet, Inc.
8000 Marina Boulevard, Suite 600
Brisbane, California 94005
office: +1 650.228.2562
twitter: http://twitter.com/jrep











PGP.sig
Description: This is a digitally signed message part

Re: Using multiple CPUs for a single document base?

Yeah, ignore the 'multiple cores' you are seeing in the docs there, 
that's about something else unrelated to CPU's, as you discovered, has 
nothing to do with what you're asking about, put it out of your mind.


I kind of think you should get multi-CPU use 'for free' as a Java app 
too. It does for me, heavy Solr usage, I look at my stats, multiple CPU 
cores are being excersized. (Of course, this is never going to be 
perfectly efficient, you aren't going to get double performance  by 
doubling the CPUs in one box, there are various bottlenecks, as we all 
know).


There are also some Java GC tuning you want to do with multiple cores, 
the default JVM settings aren't usually appropriate. (You want to 
background thread your GC, I forget the magic JVM invocations). But 
that's probably not related to your issue if you don't even see more 
than one CPU being exersized at all, weird.


On 5/31/2011 1:44 PM, Jack Repenning wrote:

Is there a way to allow Solr to use multiple CPUs of a single, multi-core box, 
to increase scale (number of documents, number of searches) of the searchbase?

The CoreAdmin wiki page talks about Multiple Cores as essentially independent 
document bases with independent indexes, but with some unification of administration at 
the grosser levels. That's not quite what I'm looking for, though. I want a single URL 
for add and search access, and a single logical searchbase, but I want to be able to use 
more of the resources of the physical box where the searchbase runs.

I guess I thought I would get this for free, it being Java and all, but I don't seem to: 
even with hundreds of clients adding and searching, I only seem to use one hardware core, 
and a bit of a second (which I interpret to mean one Java thread for Solr, one Java 
thread for Java I/O).

-==-
Jack Repenning
Technologist
Codesion Business Unit
CollabNet, Inc.
8000 Marina Boulevard, Suite 600
Brisbane, California 94005
office: +1 650.228.2562
twitter: http://twitter.com/jrep

Re: Using multiple CPUs for a single document base?

1.3 is the schema version. It hasn't had as many upgrades. Solr 3.1 uses the 
1.3 schema version, Solr 1.4.x uses the 1.2 schema version.

 On May 31, 2011, at 11:16 AM, Markus Jelsma wrote:
  Are you using a  1.4 version of Solr?
 
 Yeah, about those version numbers ... The tarball I installed claimed its
 version was
 
   apache-solr-3.1.0
 
 Which sounds comfortably later than 1.4.
 
 But the examples/solr/schema.xml that comes with it claims version 1.3.
 
 I'm confused.
 
 -==-
 Jack Repenning
 Technologist
 Codesion Business Unit
 CollabNet, Inc.
 8000 Marina Boulevard, Suite 600
 Brisbane, California 94005
 office: +1 650.228.2562
 twitter: http://twitter.com/jrep

Re: Using multiple CPUs for a single document base?


On May 31, 2011, at 11:29 AM, Jonathan Rochkind wrote:

 I kind of think you should get multi-CPU use 'for free' as a Java app too.

Ah, probably experimental error? If I apply a stress load consisting only of 
queries, I get automatic multi-core use as expected. I could see where indexing 
new dox could tend toward synchronization and uniprocessing. Perhaps my 
original test load was too add-centric, does that make sense?

-==-
Jack Repenning
Technologist
Codesion Business Unit
CollabNet, Inc.
8000 Marina Boulevard, Suite 600
Brisbane, California 94005
office: +1 650.228.2562
twitter: http://twitter.com/jrep











PGP.sig
Description: This is a digitally signed message part

Re: Obtaining query AST?

2011-05-31 Thread Mike Sokolov

I believe there is a query parser that accepts queries formatted in XML, 
allowing you to provide a parse tree to Solr; perhaps that would get you 
the control you're after.


-Mike

On 05/31/2011 02:24 PM, dar...@ontrenet.com wrote:

Hi,
  I want to write my own query expander. It needs to obtain the AST
(abstract syntax tree) of an already parsed query string, navigate to
certain parts of it (words) and make logical phrases of those words by
adding to the AST - where necessary.

This cannot be done to the string because the query logic cannot be
semantically altered. (e.g. AND, OR, paren's etc) so it must be parsed
first.

How can this be done with SolrJ?

thanks for any tips.
Darren

Re: Obtaining query AST?

2011-05-31 Thread darren

Hi, thanks for the tip. I noticed the XML stuff, but the trouble is I am
taking a query string entered by a user such as this OR that AND (this
AND that) so I'm not sure how to go from that to a representational AST
parse tree...

 I believe there is a query parser that accepts queries formatted in XML,
 allowing you to provide a parse tree to Solr; perhaps that would get you
 the control you're after.

 -Mike

 On 05/31/2011 02:24 PM, dar...@ontrenet.com wrote:
 Hi,
   I want to write my own query expander. It needs to obtain the AST
 (abstract syntax tree) of an already parsed query string, navigate to
 certain parts of it (words) and make logical phrases of those words by
 adding to the AST - where necessary.

 This cannot be done to the string because the query logic cannot be
 semantically altered. (e.g. AND, OR, paren's etc) so it must be parsed
 first.

 How can this be done with SolrJ?

 thanks for any tips.
 Darren

Re: Using multiple CPUs for a single document base?


Yep, that could be it.

You certainly don't get _great_ concurrency support 'for free' in Java, 
concurrent programming is still tricky. Parts of Solr are surely better 
at it than others.


The one place I'd be shocked was just with multiple concurrent queries, 
if those weren't helped by multi-CPUs.   Multi CPUs won't neccesarily 
speed up any single query, they should just speed up the overall 
situation under heavy load.  Which has been my observation. And it may 
be that multi-CPU's don't speed up add/commit much, as you possibly have 
observed.


I do all my 'adds' to a seperate Solr index, and then replicate to a 
slave that actually serves queries. My 'master' that I do my adds to is 
actually on the very same server -- but I run it in an entirely 
different java container, in part to minimize any chance that it will 
end up competing for threads/CPUs with the slave serving queries, the OS 
level alone should ('should', famous last word)  balance it to a 
different cpu core. (Of course, there's still only so much total CPU 
avail on the machine).


On 5/31/2011 2:53 PM, Jack Repenning wrote:

On May 31, 2011, at 11:29 AM, Jonathan Rochkind wrote:


I kind of think you should get multi-CPU use 'for free' as a Java app too.

Ah, probably experimental error? If I apply a stress load consisting only of 
queries, I get automatic multi-core use as expected. I could see where indexing 
new dox could tend toward synchronization and uniprocessing. Perhaps my 
original test load was too add-centric, does that make sense?

-==-
Jack Repenning
Technologist
Codesion Business Unit
CollabNet, Inc.
8000 Marina Boulevard, Suite 600
Brisbane, California 94005
office: +1 650.228.2562
twitter: http://twitter.com/jrep

Re: Obtaining query AST?

You're going to have to parse it yourself.  Or, since Solr is open 
source, you can take pieces of the existing query parsers (dismax or 
lucene), and repurpose them. But I don't _think_ (I could be wrong) 
there is any public API in Solr/SolrJ that will give you an AST.


On 5/31/2011 3:18 PM, dar...@ontrenet.com wrote:

Hi, thanks for the tip. I noticed the XML stuff, but the trouble is I am
taking a query string entered by a user such as this OR that AND (this
AND that) so I'm not sure how to go from that to a representational AST
parse tree...


I believe there is a query parser that accepts queries formatted in XML,
allowing you to provide a parse tree to Solr; perhaps that would get you
the control you're after.

-Mike

On 05/31/2011 02:24 PM, dar...@ontrenet.com wrote:

Hi,
   I want to write my own query expander. It needs to obtain the AST
(abstract syntax tree) of an already parsed query string, navigate to
certain parts of it (words) and make logical phrases of those words by
adding to the AST - where necessary.

This cannot be done to the string because the query logic cannot be
semantically altered. (e.g. AND, OR, paren's etc) so it must be parsed
first.

How can this be done with SolrJ?

thanks for any tips.
Darren

Re: Splitting fields

2011-05-31 Thread Jan Høydahl

Hi,

Write a custom UpdateProcessor, which gives you full control of the 
SolrDocument prior to indexing. The best would be if you write a generic 
FieldSplitterProcessor which is configurable on what field to take as input, 
what delimiter or regex to split on and finally what fields to write the result 
to. This way other may re-use your code for their splitting needs.

See http://wiki.apache.org/solr/UpdateRequestProcessor and 
http://wiki.apache.org/solr/SolrConfigXml#UpdateRequestProcessorChain_section

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 27. mai 2011, at 15.47, Joe Fitzgerald wrote:

 Hello,
 
 
 
 I am in an odd position.  The application server I use has built-in
 integration with SOLR.  Unfortunately, its native capabilities are
 fairly limited, specifically, it only supports a standard/pre-defined
 set of fields which can be indexed.  As a result, it has left me
 kludging how I work with Solr and doing things like putting what I'd
 like to be multiple, separate fields into a single Solr field.
 
 
 
 As an example, I may put a customer id and name into a single field
 called 'custom1'.  Ideally, I'd like this information to be returned in
 separate fields...and even better would be for them to be indexed as
 separate fields but I can live without the latter.  Currently, I'm
 building out a json representation of this information which makes it
 easy for me to deal with when I extract the results...but it all feels
 wrong.
 
 
 
 I do have complete control over the actual Solr installation (just not
 the indexing call to Solr), so I was hoping there may be a way to
 configure Solr to take my single field and split it up into a different
 field for each key in my json representation.
 
 
 
 I don't see anything native to Solr that would do this for me but there
 are a few features that I thought sounded similar and was hoping to get
 some opinions on how I may be able to move forward with this...
 
 
 
 Poly fields, such as the spatial location, might help?  Can I build my
 own poly-field that would split up the main field into subfields?  Do
 poly-fields let me return the subfields?  I don't quite have my head
 around polyfields yet.
 
 
 
 Another option although I suspect this won't be considered a good
 approach, but what about extending the copyField functionality of
 schema.xml to support my needs?  It would seem not entirely unreasonable
 that copyField would provide a means to extract only a portion of the
 contents of the source field to place in the destination field, no?  I'm
 sure people more familiar with Solr's architecture could explain why
 this isn't really an appropriate thing for Solr to handle (just because
 it could doesn't mean it should)...
 
 The other - and probably best -- option would be to leverage Solr
 directly, bypassing the native integration of my application server,
 which we've already done for most cases.  I'd love to go this route but
 I'm having a hard time figuring out how to easily accomplish the same
 functionality provided by my app server integration...perhaps someone on
 the list could help me with this path forward?  Here is what I'm trying
 to accomplish:
 
 
 
 I'm indexing documents (text, pdf, html...) but I need to include fields
 in the results of my searches which are only available from a db query.
 I know how to have Solr index results from a db query, but I'm having
 trouble getting it to index the documents that are associated to each
 record of that query (full path/filename is one of the fields of that
 query).
 
 
 
 I started to try to use the dataImport handler to do this, by setting up
 a FileDataSource in addition to my jdbc data source.  I tried to
 leverage the filedatasource to populate a sub-entity based on the db
 field that contains the full path/filename, but I wasn't sure how to
 specify the db field from the root query/entity.  Before I spent too
 much time, I also realized I wasn't sure how to get Solr to deal with
 binary file types this way either which upon further reading seemed like
 I would need to leverage Tika - can that be done within the confines of
 dataimporthandler?
 
 
 
 Any advice is greatly appreciated.  Thanks in advance,
 
 
 
 Joe

Re: Using multiple CPUs for a single document base?

If you use only one thread when indexing then one one core is going to be 
used.

 On May 31, 2011, at 11:29 AM, Jonathan Rochkind wrote:
  I kind of think you should get multi-CPU use 'for free' as a Java app
  too.
 
 Ah, probably experimental error? If I apply a stress load consisting only
 of queries, I get automatic multi-core use as expected. I could see where
 indexing new dox could tend toward synchronization and uniprocessing.
 Perhaps my original test load was too add-centric, does that make sense?
 
 -==-
 Jack Repenning
 Technologist
 Codesion Business Unit
 CollabNet, Inc.
 8000 Marina Boulevard, Suite 600
 Brisbane, California 94005
 office: +1 650.228.2562
 twitter: http://twitter.com/jrep

Re: Boosting fields at query time in Standard Request Handler from Solrconfig.xml

2011-05-31 Thread Jan Høydahl

Hi,

You need to add
str name=defTypeedismax/str

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 31. mai 2011, at 18.08, Vignesh Raj wrote:

 Hi,
 
 I am developing a search engine app using Asp.Net, C# and Solrnet. I use the
 standard request handler. Is there a way I can boost the fields at query
 time from inside the solrconfig.xml file itself. Just like the qf field
 for Dismax handler.
 Right now am searching like field1:value^1.5 field2:value^1.2
 field3:value^0.8 and this is done in the middle tier. I want Solr itself to
 do this using standard request handler. Can I write a similar kind of thing
 inside standard req handler?
 
 Here is my solrconfig file.
 
 requestHandler name=standard class=solr.SearchHandler default=true
 
   lst name=defaults
 
 str name=echoParamsexplicit/str
 
 str name=hltrue/str
 
 str name=hl.snippets3/str
 
 str name=hl.fragsize25/str
 
 str name=qffile_description^100.0 file_content^6.0 file_name^10.0
 file_comments^4.0
 
 /str
 
   /lst
 
   arr name=last-components
 
  strspellcheck/str 
 
   /arr
 
 /requestHandler
 
 
 
 But am not able to see the results if I add this in my solrconfig.xml file.
 I have edited the post to add my req handler code in solrconfig. But, if
 have my query string as file_description:result^1.0 file_content:result^0.6
 file_name:result^0.5 file_comments:result^0.8, am able to see the required
 result.
 
 
 
 Regards
 
 Vignesh

Re: Using multiple CPUs for a single document base?

You say it like it's something you have control over; how would one 
choose to use more than one thread when indexing? I guess maybe it 
depends on how you're indexing of course; I guess if you're using SolrJ 
it's straightforward. What if you're using the ordinary HTTP Post 
interface, or DIH?


On 5/31/2011 3:35 PM, Markus Jelsma wrote:

If you use only one thread when indexing then one one core is going to be
used.


On May 31, 2011, at 11:29 AM, Jonathan Rochkind wrote:

I kind of think you should get multi-CPU use 'for free' as a Java app
too.

Ah, probably experimental error? If I apply a stress load consisting only
of queries, I get automatic multi-core use as expected. I could see where
indexing new dox could tend toward synchronization and uniprocessing.
Perhaps my original test load was too add-centric, does that make sense?

-==-
Jack Repenning
Technologist
Codesion Business Unit
CollabNet, Inc.
8000 Marina Boulevard, Suite 600
Brisbane, California 94005
office: +1 650.228.2562
twitter: http://twitter.com/jrep

Re: Using multiple CPUs for a single document base?

On May 31, 2011, at 12:24 PM, Jonathan Rochkind wrote:

 I do all my 'adds' to a seperate Solr index, and then replicate to a slave 
 that actually serves queries.

Yes, that's a step I'm holding in reserve. Probably get there some day, as I 
expect always to have a very high add-to-query ratio. But for the moment, I 
don't think I need it.

 My 'master' that I do my adds to is actually on the very same server -- but I 
 run it in an entirely different java container,

Now THAT was an interesting data point, thanks very much! I hadn't thought of 
running the master on the same box!

-==-
Jack Repenning
Technologist
Codesion Business Unit
CollabNet, Inc.
8000 Marina Boulevard, Suite 600
Brisbane, California 94005
office: +1 650.228.2562
twitter: http://twitter.com/jrep











PGP.sig
Description: This is a digitally signed message part

Re: Splitting fields

I'd go for this option as well. The example update processor can't make it 
more easier and it's a very flexible approach. Judging from the patch in 
SOLR-2105 it should still work with the current 3.2 branch.

https://issues.apache.org/jira/browse/SOLR-2105


 Hi,
 
 Write a custom UpdateProcessor, which gives you full control of the
 SolrDocument prior to indexing. The best would be if you write a generic
 FieldSplitterProcessor which is configurable on what field to take as
 input, what delimiter or regex to split on and finally what fields to
 write the result to. This way other may re-use your code for their
 splitting needs.
 
 See http://wiki.apache.org/solr/UpdateRequestProcessor and
 http://wiki.apache.org/solr/SolrConfigXml#UpdateRequestProcessorChain_sect
 ion
 
 --
 Jan Høydahl, search solution architect
 Cominvent AS - www.cominvent.com
 Solr Training - www.solrtraining.com
 
 On 27. mai 2011, at 15.47, Joe Fitzgerald wrote:
  Hello,
  
  
  
  I am in an odd position.  The application server I use has built-in
  integration with SOLR.  Unfortunately, its native capabilities are
  fairly limited, specifically, it only supports a standard/pre-defined
  set of fields which can be indexed.  As a result, it has left me
  kludging how I work with Solr and doing things like putting what I'd
  like to be multiple, separate fields into a single Solr field.
  
  
  
  As an example, I may put a customer id and name into a single field
  called 'custom1'.  Ideally, I'd like this information to be returned in
  separate fields...and even better would be for them to be indexed as
  separate fields but I can live without the latter.  Currently, I'm
  building out a json representation of this information which makes it
  easy for me to deal with when I extract the results...but it all feels
  wrong.
  
  
  
  I do have complete control over the actual Solr installation (just not
  the indexing call to Solr), so I was hoping there may be a way to
  configure Solr to take my single field and split it up into a different
  field for each key in my json representation.
  
  
  
  I don't see anything native to Solr that would do this for me but there
  are a few features that I thought sounded similar and was hoping to get
  some opinions on how I may be able to move forward with this...
  
  
  
  Poly fields, such as the spatial location, might help?  Can I build my
  own poly-field that would split up the main field into subfields?  Do
  poly-fields let me return the subfields?  I don't quite have my head
  around polyfields yet.
  
  
  
  Another option although I suspect this won't be considered a good
  approach, but what about extending the copyField functionality of
  schema.xml to support my needs?  It would seem not entirely unreasonable
  that copyField would provide a means to extract only a portion of the
  contents of the source field to place in the destination field, no?  I'm
  sure people more familiar with Solr's architecture could explain why
  this isn't really an appropriate thing for Solr to handle (just because
  it could doesn't mean it should)...
  
  The other - and probably best -- option would be to leverage Solr
  directly, bypassing the native integration of my application server,
  which we've already done for most cases.  I'd love to go this route but
  I'm having a hard time figuring out how to easily accomplish the same
  functionality provided by my app server integration...perhaps someone on
  the list could help me with this path forward?  Here is what I'm trying
  to accomplish:
  
  
  
  I'm indexing documents (text, pdf, html...) but I need to include fields
  in the results of my searches which are only available from a db query.
  I know how to have Solr index results from a db query, but I'm having
  trouble getting it to index the documents that are associated to each
  record of that query (full path/filename is one of the fields of that
  query).
  
  
  
  I started to try to use the dataImport handler to do this, by setting up
  a FileDataSource in addition to my jdbc data source.  I tried to
  leverage the filedatasource to populate a sub-entity based on the db
  field that contains the full path/filename, but I wasn't sure how to
  specify the db field from the root query/entity.  Before I spent too
  much time, I also realized I wasn't sure how to get Solr to deal with
  binary file types this way either which upon further reading seemed like
  I would need to leverage Tika - can that be done within the confines of
  dataimporthandler?
  
  
  
  Any advice is greatly appreciated.  Thanks in advance,
  
  
  
  Joe

Searching Database

2011-05-31 Thread Roger Shah

How can I use SOLR (version 3.1) to search in our Microsoft SQL Server 
database?  
I looked at the DIH example but that looks like it is for importing.  I also 
looked at the following link:  http://wiki.apache.org/solr/DataImportHandler

Please send me a link to any instructions to set up SOLR so that I can search 
the database.

Thank You,
Roger

Re: Using multiple CPUs for a single document base?

I haven't given it a try but perhaps opening multiple HTTP connections to the 
update handler will end up in multiple threads thus better CPU utilization. 

Would be nice if someone can prove it here, i'm not in `the lab` right now ;)

 You say it like it's something you have control over; how would one
 choose to use more than one thread when indexing? I guess maybe it
 depends on how you're indexing of course; I guess if you're using SolrJ
 it's straightforward. What if you're using the ordinary HTTP Post
 interface, or DIH?
 
 On 5/31/2011 3:35 PM, Markus Jelsma wrote:
  If you use only one thread when indexing then one one core is going to be
  used.
  
  On May 31, 2011, at 11:29 AM, Jonathan Rochkind wrote:
  I kind of think you should get multi-CPU use 'for free' as a Java app
  too.
  
  Ah, probably experimental error? If I apply a stress load consisting
  only of queries, I get automatic multi-core use as expected. I could
  see where indexing new dox could tend toward synchronization and
  uniprocessing. Perhaps my original test load was too add-centric, does
  that make sense?
  
  -==-
  Jack Repenning
  Technologist
  Codesion Business Unit
  CollabNet, Inc.
  8000 Marina Boulevard, Suite 600
  Brisbane, California 94005
  office: +1 650.228.2562
  twitter: http://twitter.com/jrep

Re: Searching Database

2011-05-31 Thread Stefan Matheis


Roger,

 .. but that looks like it is for importing .. it will remain the same 
- no matter how often you'll search for it :) because that is .. what 
solr is for. you have to import (and this means indexing, analyzing ..) 
the whole content that you want to search.


You could either use DIH to import that content directly or use the 
UpdateXML- / UpdateJSON-Handler to push the Content to SOLR.


Regards
Stefan

Am 31.05.2011 21:44, schrieb Roger Shah:

How can I use SOLR (version 3.1) to search in our Microsoft SQL Server database?
I looked at the DIH example but that looks like it is for importing.  I also 
looked at the following link:  http://wiki.apache.org/solr/DataImportHandler

Please send me a link to any instructions to set up SOLR so that I can search 
the database.

Thank You,
Roger

Re: Searching Database

2011-05-31 Thread Otis Gospodnetic

Roger,

You have to import/index into Solr before you can search it.  Solr can't go 
into 
your MS SQL server and search data in there.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Roger Shah rs...@caci.com
 To: solr-user@lucene.apache.org solr-user@lucene.apache.org
 Sent: Tue, May 31, 2011 3:44:42 PM
 Subject: Searching Database
 
 How can I use SOLR (version 3.1) to search in our Microsoft SQL Server  
database?  

 I looked at the DIH example but that looks like it is for  importing.  I also 
looked at the following link:  http://wiki.apache.org/solr/DataImportHandler
 
 Please send me a link  to any instructions to set up SOLR so that I can 
 search 
the  database.
 
 Thank You,
 Roger

Re: Searching Database

2011-05-31 Thread Gora Mohanty

On Wed, Jun 1, 2011 at 1:14 AM, Roger Shah rs...@caci.com wrote:
 How can I use SOLR (version 3.1) to search in our Microsoft SQL Server 
 database?
 I looked at the DIH example but that looks like it is for importing.  I also 
 looked at the following link:  http://wiki.apache.org/solr/DataImportHandler
[...]

That is exactly what you need. You first have to import the
data into Solr/Lucene before you can search it. I think that
you might be mistaken in your view of Solr: It is *not* an
add-on to a database that allows search.

Regards,
Gora

Re: Searching Database

Roger, how about not hijacking another user's thread and not hijacking your 
already hijacked thread twice more? Chaning the e-mail subject won't change 
the header's contents.

 How can I use SOLR (version 3.1) to search in our Microsoft SQL Server
 database? I looked at the DIH example but that looks like it is for
 importing.  I also looked at the following link: 
 http://wiki.apache.org/solr/DataImportHandler
 
 Please send me a link to any instructions to set up SOLR so that I can
 search the database.
 
 Thank You,
 Roger

RE: Searching Database

2011-05-31 Thread Roger Shah

Sorry, Markus.  I was not aware I need to create a new email.

-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: Tuesday, May 31, 2011 3:48 PM
To: solr-user@lucene.apache.org
Subject: Re: Searching Database

Roger, how about not hijacking another user's thread and not hijacking your 
already hijacked thread twice more? Chaning the e-mail subject won't change 
the header's contents.

 How can I use SOLR (version 3.1) to search in our Microsoft SQL Server
 database? I looked at the DIH example but that looks like it is for
 importing.  I also looked at the following link: 
 http://wiki.apache.org/solr/DataImportHandler

 Please send me a link to any instructions to set up SOLR so that I can
 search the database.

 Thank You,
 Roger

Re: Better Spellcheck

2011-05-31 Thread Otis Gospodnetic

Hi Tanner,

We have something we call DYM ReSearcher that helps in situations like these, 
esp. with multi-word queries that Lucene/Solr spellcheckers have trouble with.

See http://sematext.com/products/dym-researcher/index.html

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Tanner Postert tanner.post...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Tue, May 31, 2011 11:31:07 AM
 Subject: Better Spellcheck
 
 I've tried to use a spellcheck dictionary built from my own content, but  my
 content ends up having a lot of misspelled words so the spellcheck ends  up
 being less than effective. I could use a standard dictionary, but it  may
 have problems with proper nouns. It also misses phrases. When  someone
 searches for Untied States I would hope the spellcheck would  suggest
 United States but it just recognizes that untied is a valid word  and
 doesn't suggest any thing.
 
 Is there any way around this? Are there  any third party modules or
 spellcheck systems that I could implement to get  these type of features?

Re: Using multiple CPUs for a single document base?

On May 31, 2011, at 12:44 PM, Markus Jelsma wrote:

 I haven't given it a try but perhaps opening multiple HTTP connections to the 
 update handler will end up in multiple threads thus better CPU utilization. 

My original test case had hundreds of HTTP connections (all to the same URL) 
doing adds, but seemed to use only one CPU core for adding, or to serialize the 
adds somehow, something like that ... at any rate, I couldn't drive CPU use 
above ~120% with that configuration.

This is quite different from queries. For queries (or a rich query-to-add mix), 
I can easily drive CPU use into multiple-hundreds of % CPU, with just a few 
dozen concurrent query connections (running flat out). But adds resist that 
trick. I don't know whether this means that adds really are using a single 
thread, or if they're using multiple threads but synchronizing on some monitor. 
Actually, I can't say I care much: bottom line seems to be I only use one CPU 
core (plus a negligible marginal bit) for adds.

Since I've confirmed that queries spread neatly, I can live with the 
single-thready adds. In production, it seems likely that I'll be more or less 
continuously spending one CPU core on adds, and the rest on queries.

-==-
Jack Repenning
Technologist
Codesion Business Unit
CollabNet, Inc.
8000 Marina Boulevard, Suite 600
Brisbane, California 94005
office: +1 650.228.2562
twitter: http://twitter.com/jrep











PGP.sig
Description: This is a digitally signed message part

Solr memory consumption

2011-05-31 Thread Denis Kuzmenok

I  run  multiple-core  solr with flags: -Xms3g -Xmx6g -D64, but i see
this in top after 6-8 hours and still raising:

17485  test214 10.0g 7.4g 9760 S 308.2 31.3 448:00.75 java
-Xms3g -Xmx6g -D64 -Dsolr.solr.home=/home/test/solr/example/multicore/ -jar 
start.jar
  
Are there any ways to limit memory for sure?

Thanks

Re: Using multiple CPUs for a single document base?

As far as I know you're on the right track, adds are single threaded.
You can have multiple threads making indexing requests from your
client, but that's primarily aimed at making the I/O not be the bottleneck,
at some point the actual indexing of the documents is single-threaded.

It'd be tricky, very tricky to have multiple threads writing to an index at
the same time, much less multiple CPUs.

If you're desperate to index quickly, you can index into several cores, even
on separate machines and merge the results.

Best
Erick

On Tue, May 31, 2011 at 4:13 PM, Jack Repenning jrepenn...@collab.net wrote:
 On May 31, 2011, at 12:44 PM, Markus Jelsma wrote:

 I haven't given it a try but perhaps opening multiple HTTP connections to the
 update handler will end up in multiple threads thus better CPU utilization.

 My original test case had hundreds of HTTP connections (all to the same URL) 
 doing adds, but seemed to use only one CPU core for adding, or to serialize 
 the adds somehow, something like that ... at any rate, I couldn't drive CPU 
 use above ~120% with that configuration.

 This is quite different from queries. For queries (or a rich query-to-add 
 mix), I can easily drive CPU use into multiple-hundreds of % CPU, with just a 
 few dozen concurrent query connections (running flat out). But adds resist 
 that trick. I don't know whether this means that adds really are using a 
 single thread, or if they're using multiple threads but synchronizing on some 
 monitor. Actually, I can't say I care much: bottom line seems to be I only 
 use one CPU core (plus a negligible marginal bit) for adds.

 Since I've confirmed that queries spread neatly, I can live with the 
 single-thready adds. In production, it seems likely that I'll be more or less 
 continuously spending one CPU core on adds, and the rest on queries.

 -==-
 Jack Repenning
 Technologist
 Codesion Business Unit
 CollabNet, Inc.
 8000 Marina Boulevard, Suite 600
 Brisbane, California 94005
 office: +1 650.228.2562
 twitter: http://twitter.com/jrep

What's your query result cache's stats?

Hi,

I've seen the stats page many times, of quite a few installations and even 
more servers. There's one issue that keeps bothering me: the cumulative hit 
ratio of the query result cache, it's almost never higher than 50%.

What are your stats? How do you deal with it?

In some cases i have to disable it because of the high warming penalty i get 
in a frequently changing index. This penalty is worse than the very little 
performance gain i get. Different users accidentally using the same query or a 
single user that's actually browsing the result set only happens very 
occasionally. And if i wanted the hit ratio to climb i'd have to increase the 
cache size and warming size to absurd values, only then i might just reach 
about 60% hit ratio.

Cheers,

RE: Solr memory consumption

It could be environment specific (specific of your top command
implementation, OS, etc)

I have on CentOS 2986m virtual memory showing although -Xmx2g

You have 10g virtual although -Xmx6g 

Don't trust it too much... top command may count OS buffers for opened
files, network sockets, JVM DLLs itself, etc (which is outside Java GC
responsibility); additionally to JVM memory... it counts all memory, not
sure... if you don't have big values for 99.9%wa (which means WAIT I/O -
disk swap usage) everyhing is fine...



-Original Message-
From: Denis Kuzmenok 
Sent: May-31-11 4:18 PM
To: solr-user@lucene.apache.org
Subject: Solr memory consumption

I  run  multiple-core  solr with flags: -Xms3g -Xmx6g -D64, but i see this
in top after 6-8 hours and still raising:

17485  test214 10.0g 7.4g 9760 S 308.2 31.3 448:00.75 java
-Xms3g -Xmx6g -D64 -Dsolr.solr.home=/home/test/solr/example/multicore/ -jar
start.jar
  
Are there any ways to limit memory for sure?

Thanks

Re: What's your query result cache's stats?


On May 31, 2011, at 2:02 PM, Markus Jelsma wrote:

 the cumulative hit 
 ratio of the query result cache, it's almost never higher than 50%.
 
 What are your stats? How do you deal with it?

warmupTime : 0 
cumulative_lookups : 394867 
cumulative_hits : 394780 
cumulative_hitratio : 0.99 
cumulative_inserts : 87 
cumulative_evictions : 0 

Of course, that's shortly after I ran a query-intensive, not very creative load 
test (thousands of identical queries of a not very changeable data set). As a 
matter of fact, the numbers say I had exactly one miss after each insert, and 
everything else was a cache hit. Which makes perfect sense, for my (really 
dumb) test case.

 In some cases i have to disable it because of the high warming penalty i get 
 in a frequently changing index. This penalty is worse than the very little 
 performance gain i get. Different users accidentally using the same query or 
 a 
 single user that's actually browsing the result set only happens very 
 occasionally. And if i wanted the hit ratio to climb i'd have to increase the 
 cache size and warming size to absurd values, only then i might just reach 
 about 60% hit ratio.

If you have humans randomizing the query stream, I'm sure you're right. If 
you're convinced your queries are unrelated and variable, why would you expect 
a query cache to help at all?

On the other hand, I actually plan to use my Solr base to drive a UI, where the 
query parameters never change, and the data underneath changes mostly in bursts 
(generally near the end of the work day), so I suspect I'll only see misses 
after a document add, while lookups ten to cluster early in the day. So I 
actually am hoping for a high hit ratio.

-==-
Jack Repenning
Technologist
Codesion Business Unit
CollabNet, Inc.
8000 Marina Boulevard, Suite 600
Brisbane, California 94005
office: +1 650.228.2562
twitter: http://twitter.com/jrep











PGP.sig
Description: This is a digitally signed message part

Re: Obtaining query AST?

2011-05-31 Thread lboutros

Hi Darren,

I think that if I had to get the parsing result, I would create my own
QueryComponent which would create the parser in the 'prepare' function (you
can take a look to the actual QueryComponent class) and instead of resolving
the query in the 'process' function, I would just parse the query and then
it should be possible to serialize the returned Query object to the
response.

Then you could declare this new query component in the solr config file.
And finally, with solrj, you should be able to get the parsed query in the
response, unserialize it and do your stuff ;)

The Query object could be considered as an AST, I think :).

This is how I would start, if I had to do that.

Ludovic. 

 

-
Jouve
France.
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Obtaining-query-AST-tp3007289p3008330.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Obtaining query AST?

2011-05-31 Thread lboutros

Darren,

you can even take a look to the DebugComponent which returns the parsed
query in a string form.
It uses the QueryParsing class to parse the query, you could perhaps do the
same.

Ludovic.

-
Jouve
France.
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Obtaining-query-AST-tp3007289p3008349.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Obtaining query AST?

2011-05-31 Thread Darren Govoni

Ludovic,
   Thank you for this tip, it sounds useful.

Darren

On Tue, 2011-05-31 at 14:38 -0700, lboutros wrote:

 Darren,
 
 you can even take a look to the DebugComponent which returns the parsed
 query in a string form.
 It uses the QueryParsing class to parse the query, you could perhaps do the
 same.
 
 Ludovic.
 
 -
 Jouve
 France.
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Obtaining-query-AST-tp3007289p3008349.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Obtaining query AST?

2011-05-31 Thread Renaud Delbru


Hi,

have a look at the flexible query parser of lucene (contrib package) 
[1]. It provides a framework to easily create different parsing logic. 
You should be able to access the AST and to modify as you want how it 
can be translated into a Lucene query (look at processors and pipeline 
processors).
One time you have your own query parser, then it is straightforward to 
plug it into Solr.


[1] http://lucene.apache.org/java/3_1_0/api/contrib-queryparser/index.html
--
Renaud Delbru

On 31/05/11 19:24, dar...@ontrenet.com wrote:

Hi,
  I want to write my own query expander. It needs to obtain the AST
(abstract syntax tree) of an already parsed query string, navigate to
certain parts of it (words) and make logical phrases of those words by
adding to the AST - where necessary.

This cannot be done to the string because the query logic cannot be
semantically altered. (e.g. AND, OR, paren's etc) so it must be parsed
first.

How can this be done with SolrJ?

thanks for any tips.
Darren

Re: copyField generates multiple values encountered for non multiValued field

2011-05-31 Thread Alexander Kanarsky

Alexander,

I saw the same behavior in 1.4.x with non-multivalued fields when
updating the document in the index (i.e obtaining the doc from the
index, modifying some fields and then adding the document with the same
id back). I do not know what causes this, but it looks like the
copyField logic completely bypasses the multivalueness check and just
adds the value in addition to whatever already there (instead of
replacing the value). So yes, Solr renders itself into incorrect state
then (note that the index is still correct from the Lucene's
standpoint). 

-Alexander

 


On Wed, 2011-05-25 at 16:50 +0200, Alexander Golubowitsch wrote:
 Dear list,
  
 hope somebody can help me understand/avoid this.
  
 I am sending an add request with allowDuplicates=false to a Solr 1.4.1
 instance.
 This is for debugging purposes, so I am sending the exact same data that are
 already stored in Solr's index.
 I am using the PHP PECL libraries, which fail completely in giving me any
 hint on what goes wrong.
 
 Only sending the same add request again gives me a proper
 SolrClientException that hints:
  
 ERROR: [288400] multiple values encountered for non multiValued field
 field2 [fieldvalue, fieldvalue]
 
 The scenario:
 - field1 is implicitly single value, type text, indexed and stored
 - field2 is generated via a copyField directive in schema.xml, implicitly
 single value, type string, indexed and stored
 
 What appears to happen:
 - On the first add (SolrClient::addDocuments(array(SolrInputDocument
 theDocument))), regular fields like field1 get overwritten as intended
 - field2, defined with a copyField, but still single value, gets
 _appended_ instead
 - When I retrieve the updated document in a query and try to add it again,
 it won't let me because of the inconsistent multi-value state
 - The PECL library, in addition, appears to hit some internal exception
 (that it doesn't handle properly) when encountering multiple values for a
 single value field. That gives me zero results querying a set that includes
 the document via PHP, while the document can be retrieved properly, though
 in inconsistent state, any other way.
 
 But: Solr appears to be generating the corrupted state itsself via
 copyField?
 What's going wrong? I'm pretty confused...
 
 Thank you,
  Alex

Odd (i.e. wrong) File Names in 3.1 distro source zip

2011-05-31 Thread Bob Sandiford

Hi, all.

I just downloaded the apache-solr-3.1.0-src.gz file, and unzipped that.  I see 
inside there, a apache-solr-3.1.0-src file, and tried unzipping that.  There 
weren't any errors, but as I look inside the apache-solr-3.0.1-src file, I see 
that not all the java code (for example) ended up being unzipped with a .java 
extension.

For example,  in the path 
apache-solr-3.1.0\lucene\backwards\src\test\org\apache\lucene\analysis\tokenattributes
 I see two files:
TestSimpleAtt100644
TestTermAttri100644

Any ideas?  Is there some specific tool I should be using to expand these?

I'm doing this in Windows XP.

Thanks!

Bob Sandiford | Lead Software Engineer | SirsiDynix
P: 800.288.8020 X6943 | bob.sandif...@sirsidynix.com
www.sirsidynix.comhttp://www.sirsidynix.com/
Join the conversation - you may even get an iPad or Nook out of it!

[cid:image002.jpg@01CC1FC6.2324C620]http://www.facebook.com/SirsiDynixLike us 
on Facebook!

[cid:image004.jpg@01CC1FC6.2324C620]http://twitter.com/#!/SirsiDynixFollow us 
on Twitter!

Solr vs ElasticSearch

2011-05-31 Thread Mark

I've been hearing more and more about ElasticSearch. Can anyone give me 
a rough overview on how these two technologies differ. What are the 
strengths/weaknesses of each. Why would one choose one of the other?


Thanks

Re: Solr vs ElasticSearch

2011-05-31 Thread Jason Rutherglen

Mark,

Nice email address.  I personally have no idea, maybe ask Shay Banon
to post an answer?  I think it's possible to make Solr more elastic,
eg, it's currently difficult to make it move cores between servers
without a lot of manual labor.

Jason

On Tue, May 31, 2011 at 7:33 PM, Mark static.void@gmail.com wrote:
 I've been hearing more and more about ElasticSearch. Can anyone give me a
 rough overview on how these two technologies differ. What are the
 strengths/weaknesses of each. Why would one choose one of the other?

 Thanks

RE: DIH: Exception with Too many connections

Stephan,

Your advice (check the process list) gave me an important clue for my
solution.
I changed my database connection to the slave instead of master, so that I
can use more threads.
Thank you very much!

*** 

François,

My setting is the default value:
max_connections = 151  
max_user_connections = 0 

I will think of changing the max_connections when I increase the number of
cores.
Thanks!

***

Fuad,

So far, I can handle DIH with my setting.
Thanks for letting me know!


Regards,
Tiffany


--
View this message in context: 
http://lucene.472066.n3.nabble.com/DIH-Exception-with-Too-many-connections-tp3005213p3009206.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Solr vs ElasticSearch