Re: highlighter not respecting sentence boundry

2012-06-05 Thread abhayd
Any help on this one?

Seems like highlighting component does not always start the snippet from
starting of snippet. I tried several options.

Has anyone successfully got this working?

 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/highlighter-not-respecting-sentence-boundry-tp3984327p3987718.html
Sent from the Solr - User mailing list archive at Nabble.com.


random results at specific slots

2012-06-05 Thread srinir
Hi,

I would like to return results sorted by score (desc), but i would like to
insert random results into some predefined slots (lets say 10, 14 and 18).
The reason I want to do that is I boost click-through rate based features
significantly and i want to give a chance to documents which doesnt have
enough click through rate data. This would help the results stay fresh.  

I looked into solr code and it looks like i need a custom QueryComponent
where once the top results are ordered, i can insert some random results at
my predefined slots and then return. I am wondering whether there is any
other way I can achieve the same?

Thanks
Srini

--
View this message in context: 
http://lucene.472066.n3.nabble.com/random-results-at-specific-slots-tp3987719.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Add HTTP-header from ResponseWriter

2012-06-05 Thread Markus Jelsma
Thanks, i'll check the issues. 
 
-Original message-
 From:Jack Krupansky j...@basetechnology.com
 Sent: Mon 04-Jun-2012 17:19
 To: solr-user@lucene.apache.org
 Subject: Re: Add HTTP-header from ResponseWriter
 
 There is some commented-out code in SolrDispatchFilter.doFilter:
 
 // add info to http headers
 //TODO: See SOLR-232 and SOLR-267.
   /*try {
 NamedList solrRspHeader = solrRsp.getResponseHeader();
for (int i=0; isolrRspHeader.size(); i++) {
  ((javax.servlet.http.HttpServletResponse) response).addHeader((Solr- 
 + solrRspHeader.getName(i)), String.valueOf(solrRspHeader.getVal(i)));
}
   } catch (ClassCastException cce) {
 log.log(Level.WARNING, exception adding response header log 
 information, cce);
   }*/
 
 And there is a comment from Grant on SOLR-267 that The changes to 
 SolrDispatchFilter can screw up SolrJ when you have explicit=all ... so I'm 
 going to ... comment out #2 and put a TODO: there and someone can address it 
 on SOLR-232.
 
 I did not see a separate Jira issue for arbitrarily setting HTTP headers 
 from response writers.
 
 -- Jack Krupansky
 
 -Original Message- 
 From: Markus Jelsma
 Sent: Monday, June 04, 2012 7:10 AM
 To: solr-user@lucene.apache.org
 Subject: Add HTTP-header from ResponseWriter
 
 Hi,
 
 There has been discussion before on how to add/set a HTTP-header from a 
 ResponseWriter. That was about adding the number of found documents for a 
 CSVResponseWriter. We also need to set the number of found documents, in 
 this case for the JSONResponseWriter. or any ResponseWriter. Is there any 
 progress or open issue i am not aware of? Can the current (trunk) response 
 framework already set or add an HTTP-header?
 
 Thanks,
 Markus 
 
 


Re: random results at specific slots

2012-06-05 Thread srinir
Other option I could think of is to write a custom component which implements
handleResponses, where i can pick random documents from across shards and
insert it into the ResponseBuilder's resultIds ? I would place this
component at the end (or after QueryCOmponent). will that work ? is there a
better solution ?

--
View this message in context: 
http://lucene.472066.n3.nabble.com/random-results-at-specific-slots-tp3987719p3987725.html
Sent from the Solr - User mailing list archive at Nabble.com.


maxScore always returned

2012-06-05 Thread Markus Jelsma
Hi,

On trunk the maxScore response attribute is always returned even if score is 
not part of fl. Is this intentional?

Thanks,


Re: Multi-words synonyms matching

2012-06-05 Thread O. Klein
The reason multi word synonyms work better if you use LUCENE_33 is because
then Solr uses the SlowSynonymFilter instead of SynonymFilterFactory
(FSTSynonymFilterFactory).

But I don't know if the difference between them is a bug or not. Maybe
someone has more insight?




Bernd Fehling-2 wrote
 
 Are you sure with LUCENE_33 (Use of BitVector)?
 
 
 Am 31.05.2012 17:20, schrieb O. Klein:
 I have been struggling with this as well and found that using LUCENE_33
 gives
 the best results.
 
 But as it will be deprecated this is no everlasting solution. May
 somebody
 knows one?

 


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Multi-words-synonyms-matching-tp3898950p3987728.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Strip html

2012-06-05 Thread Tigunn

Hello,

I advanced on my problem.
The index and fieldtype are good :

I forgot copyfield body_strip_html on text, the defaultSearchField.
Newbie's mistake.

Now, solr returns all xml files i want. 
But, in php, the text isn't displayed for 2 xml files (with term castor
snipped by html or xml tags like exemple). Look:
http://lucene.472066.n3.nabble.com/file/n3987731/recherche_solr_tei.jpg 

The php file:

Thanks you for your help.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Strip-html-tp3987051p3987731.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Multi-words synonyms matching

2012-06-05 Thread Bernd Fehling
Do you have test cases?

What are you sending to your SynonymFilterFactory?

What are you expecting it should return?

What is it returning when setting to Version.LUCENE_33?

What is it returning when setting to Version.LUCENE_36?



Am 05.06.2012 10:56, schrieb O. Klein:
 The reason multi word synonyms work better if you use LUCENE_33 is because
 then Solr uses the SlowSynonymFilter instead of SynonymFilterFactory
 (FSTSynonymFilterFactory).
 
 But I don't know if the difference between them is a bug or not. Maybe
 someone has more insight?
 
 
 
 
 Bernd Fehling-2 wrote

 Are you sure with LUCENE_33 (Use of BitVector)?


 Am 31.05.2012 17:20, schrieb O. Klein:
 I have been struggling with this as well and found that using LUCENE_33
 gives
 the best results.

 But as it will be deprecated this is no everlasting solution. May
 somebody
 knows one?


 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Multi-words-synonyms-matching-tp3898950p3987728.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Search timeout for Solrcloud

2012-06-05 Thread arin_g
Hi, 
We use solrcloud in production, and we are facing some issues with queries
that take very long specially deep paging queries, these queries keep our
servers very busy. i am looking for a way to stop (kill) queries taking
longer than a specific amount of time (say 5 seconds), i checked timeAllowed
but it doesn't work (again query  runs completely). Also i noticed that
there are connTimeout and socketTimeout for distributed searches, but i am
not sure if they kill the thread (i want to save resources by killing the
query, not just returning a timeout). Also, if i could get partial results
that would be ideal. Any suggestions?

Thanks,
 arin

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Search-timeout-for-Solrcloud-tp3987716.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: maxScore always returned

2012-06-05 Thread darul
maybe look into your solrconfig.xml file whether fl not set by default on
your request handler requestHandler/requestHandler

--
View this message in context: 
http://lucene.472066.n3.nabble.com/maxScore-always-returned-tp3987727p3987733.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Correct way to deal with source data that may include a multivalued field that needs to be used for sorting?

2012-06-05 Thread Erick Erickson
Older versions of Solr didn't really sort correctly on multivalued fields, they
just didn't complain G.

Hmmm. Off the top of my head, you can:
1 You don't say what the documents to be indexed are. Are they Solr-style
 documents on disk or do you process them with, say, a SolrJ program?
 If the latter, you can simply inspect them as you construct them and decide
 which of the multi-valued field values you want to use to sort
and copy that
 single value into a new field and sort on that.
2 You could write a custom UpdateRequestProcessorFactory/UpdateRequestProcessor
 pair and do the same thing in the processAdd method.

Best
Erick

On Mon, Jun 4, 2012 at 10:17 PM, Aaron Daubman daub...@gmail.com wrote:
 Greetings,

 I have dirty source data where some documents being indexed, although
 unlikely, may contain multivalued fields that are also required for
 sorting. In previous versions of Solr, sorting on this field worked fine
 (possibly because few or no multivalued fields were ever encountered?),
 however, as of 3.6.0, thanks to
 https://issues.apache.org/jira/browse/SOLR-2339 attempting to sort on this
 field now throws an error:

 [2012-06-04 17:20:01,691] ERROR org.apache.solr.common.SolrException
 org.apache.solr.common.SolrException: can not sort on multivalued field:
 f_normalizedValue

 The relevant bits of the schema.xml are:
 fieldType name=sfloat class=solr.TrieFloatField precisionStep=0
 positionIncrementGap=0 sortMissingLast=true/
 dynamicField name=f_* type=sfloat indexed=true stored=true
 required=false multiValued=true/

 Assuming that the source documents being indexed cannot be changed (which,
 at least for now, they cannot), what would be the next best way to allow
 for both the possibility of multiple f_normalizedValue fields appearing in
 indexed documents, as wel as being able to sort by f_normalizedValue?

 Thank you,
     Aaron


ReadTimeout on commit

2012-06-05 Thread spring
Hi,

I'm indexing documents in batches of 100 docs. Then commit.

Sometimes I get this exception:

org.apache.solr.client.solrj.SolrServerException:
java.net.SocketTimeoutException: Read timed out
at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpS
olrServer.java:475)
at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpS
olrServer.java:249)
at
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractU
pdateRequest.java:105)
at
org.apache.solr.client.solrj.SolrServer.commit(SolrServer.java:178)


I found some similar postings in the web, all recommending autocommit. This
is unfortunately not an option for me, because I have to know whether solr
committed or not.

What is causing this timeout?

I'm using these settings in solrj:

server.setSoTimeout(1000);
  server.setConnectionTimeout(100);
  server.setDefaultMaxConnectionsPerHost(100);
  server.setMaxTotalConnections(100);
  server.setFollowRedirects(false);
  server.setAllowCompression(true);
  server.setMaxRetries(1);

Thank you



Solr instances: many singles vs multi-core

2012-06-05 Thread Christian von Wendt-Jensen
Hi,

I'm runing a cluster of Solr serveres for an index split up in a lot of shards. 
Each shard is replicated. Current setup is one Tomcat instance per shard, even 
if the Tomcats are running on the same machine.

My question is this:

Would it be more advisable to run one Tomcat per machine with all the shards as 
cores, or is the current setup the best, where each shard is running in its own 
Tomcat.

As I see it, i would think that One Tomcat running multiple cores is better as 
it reduces the overhead of having many Tomcat instances, and it there is the 
possibility to let the cores share all available memory after how much they 
actually need. In the One Shard/One Tomcat scenario, each instance must have it 
predefined memory settings wether or not it needs more or less.

Any opinions on the matter?



Med venlig hilsen / Best Regards

Christian von Wendt-Jensen




RE: maxScore always returned

2012-06-05 Thread Markus Jelsma
Hi.

We set fl in the request handler's default without score.

thanks

 
-Original message-
 From:darul daru...@gmail.com
 Sent: Tue 05-Jun-2012 12:05
 To: solr-user@lucene.apache.org
 Subject: Re: maxScore always returned
 
 maybe look into your solrconfig.xml file whether fl not set by default on
 your request handler requestHandler/requestHandler
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/maxScore-always-returned-tp3987727p3987733.html
 Sent from the Solr - User mailing list archive at Nabble.com.
 


SolrDispatchFilter, no hits in response NamedList if distrib=true

2012-06-05 Thread Markus Jelsma
Hi,

I'm adding the numFound to the HTTP response header in a custom 
SolrDispatchFilter in the writeResponse() method, similar to the commented code 
in doFilter(). This works just fine but not for distributed requests. I'm 
trying to read hits from the SolrQueryResponse but it is not there for 
distrib=true requests. Any idea what i'm doing wrong? 

Thanks,
Markus


Re: Search timeout for Solrcloud

2012-06-05 Thread Jason Rutherglen
There isn't a solution for killing long running queries that works.

On Tue, Jun 5, 2012 at 1:34 AM, arin_g arin...@gmail.com wrote:
 Hi,
 We use solrcloud in production, and we are facing some issues with queries
 that take very long specially deep paging queries, these queries keep our
 servers very busy. i am looking for a way to stop (kill) queries taking
 longer than a specific amount of time (say 5 seconds), i checked timeAllowed
 but it doesn't work (again query  runs completely). Also i noticed that
 there are connTimeout and socketTimeout for distributed searches, but i am
 not sure if they kill the thread (i want to save resources by killing the
 query, not just returning a timeout). Also, if i could get partial results
 that would be ideal. Any suggestions?

 Thanks,
  arin

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Search-timeout-for-Solrcloud-tp3987716.html
 Sent from the Solr - User mailing list archive at Nabble.com.


RE: Search timeout for Solrcloud

2012-06-05 Thread Markus Jelsma
There's an open issue for improving deep paging performance:
https://issues.apache.org/jira/browse/SOLR-1726
 
 
-Original message-
 From:arin_g arin...@gmail.com
 Sent: Tue 05-Jun-2012 12:03
 To: solr-user@lucene.apache.org
 Subject: Search timeout for Solrcloud
 
 Hi, 
 We use solrcloud in production, and we are facing some issues with queries
 that take very long specially deep paging queries, these queries keep our
 servers very busy. i am looking for a way to stop (kill) queries taking
 longer than a specific amount of time (say 5 seconds), i checked timeAllowed
 but it doesn't work (again query  runs completely). Also i noticed that
 there are connTimeout and socketTimeout for distributed searches, but i am
 not sure if they kill the thread (i want to save resources by killing the
 query, not just returning a timeout). Also, if i could get partial results
 that would be ideal. Any suggestions?
 
 Thanks,
  arin
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Search-timeout-for-Solrcloud-tp3987716.html
 Sent from the Solr - User mailing list archive at Nabble.com.
 


filtering number and repeated contents

2012-06-05 Thread Mark , N
Is it possible to filter out numbers and disclaimer ( repeated contents)
while indexing to SOLR?
These are all surplus information and do not want to index it

I have tried using boilerpipe algorithm as well to remove surplus
infromation from web pages such as navigational elements, templates, and
advertisements , I think it works well but looking forward to see If I
could filter out  disclaimer information too mainly in email texts.
-- 
Thanks,

*Nipen Mark *


Is it faster to search over many different fields or one field that combines the values of all those other fields?

2012-06-05 Thread santamaria2
Say I have various categories of 'tags'. I want a keyword search to search
through my index of articles. So I search over:
1) the title.
2) the body
3) about 10 of these tag-categories. Each tag category is multivalued with a
few words per value.

Without considering the affect on 'relevance', and using the standard lucene
query parser, would it be faster to specify each of these 10 fields in q (q
= cat1:keyword OR cat2:keyword OR ... ), or to copyfield the stuff in those
10 fields into one combined field?

Or is it such that I should be slapped in the face for even thinking about
performance in this scenario?

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Is-it-faster-to-search-over-many-different-fields-or-one-field-that-combines-the-values-of-all-those-tp3987766.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Strip html

2012-06-05 Thread Tigunn
I resolve my problem:

I had to specify the field to return with my query.


Thanks A LOT for your help !

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Strip-html-tp3987051p398.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Correct way to deal with source data that may include a multivalued field that needs to be used for sorting?

2012-06-05 Thread Jack Krupansky
By saying dirty data you imply that only one of the values is good or 
clean and that the others can be safely discarded/ignored, as opposed to 
true multi-valued data where each value is there for good reason and needs 
to be preserved. In any case, how do you know/decide which value should be 
used for sorting - and did you just get lucky that Solr happened to use the 
right one?


The preferred technique would be the preprocess and clean the data before 
it is handed to Solr or SolrJ, even if the source must remain dirty. 
Baring that a preprocessor or a custom update processor certainly.


Please clarify exactly how the data is being fed into Solr.

And if you really do need to preserve the multiple values, simply store them 
in a separate field that is not sorted. An update processor can do this as 
well.


-- Jack Krupansky

-Original Message- 
From: Erick Erickson

Sent: Tuesday, June 05, 2012 6:34 AM
To: solr-user@lucene.apache.org
Subject: Re: Correct way to deal with source data that may include a 
multivalued field that needs to be used for sorting?


Older versions of Solr didn't really sort correctly on multivalued fields, 
they

just didn't complain G.

Hmmm. Off the top of my head, you can:
1 You don't say what the documents to be indexed are. Are they Solr-style
documents on disk or do you process them with, say, a SolrJ program?
If the latter, you can simply inspect them as you construct them and 
decide

which of the multi-valued field values you want to use to sort
and copy that
single value into a new field and sort on that.
2 You could write a custom 
UpdateRequestProcessorFactory/UpdateRequestProcessor

pair and do the same thing in the processAdd method.

Best
Erick

On Mon, Jun 4, 2012 at 10:17 PM, Aaron Daubman daub...@gmail.com wrote:

Greetings,

I have dirty source data where some documents being indexed, although
unlikely, may contain multivalued fields that are also required for
sorting. In previous versions of Solr, sorting on this field worked fine
(possibly because few or no multivalued fields were ever encountered?),
however, as of 3.6.0, thanks to
https://issues.apache.org/jira/browse/SOLR-2339 attempting to sort on this
field now throws an error:

[2012-06-04 17:20:01,691] ERROR org.apache.solr.common.SolrException
org.apache.solr.common.SolrException: can not sort on multivalued field:
f_normalizedValue

The relevant bits of the schema.xml are:
fieldType name=sfloat class=solr.TrieFloatField precisionStep=0
positionIncrementGap=0 sortMissingLast=true/
dynamicField name=f_* type=sfloat indexed=true stored=true
required=false multiValued=true/

Assuming that the source documents being indexed cannot be changed (which,
at least for now, they cannot), what would be the next best way to allow
for both the possibility of multiple f_normalizedValue fields appearing in
indexed documents, as wel as being able to sort by f_normalizedValue?

Thank you,
Aaron 




Re: Can't index sub-entitties in DIH

2012-06-05 Thread Rafael Taboada
Hi Gora,


 Your configuration files look fine. It would seem that something
 is going wrong with the SELECT in Oracle, or with the JDBC
 driver used to access Oracle. Could you try:

* Manually doing the SELECT for the entity, and sub-entity
  to ensure that things are working.


The SELECTs are working OK.



 * Check the JDBC settings.


I'm using tha last version of jdbc6.jar for Oracle 11g. It seems JDBC
setting is OK because solr brings data.



 Sorry, I do not have access to Oracle so that I cannot try this
 out myself.

 Also, have you checked the Solr logs for any error messages?
 Finally, I just noticed that you have extra quotes in:
 ...where usuario_idusuario = '${usuario.idusuario}'
 I doubt that is the cause of your problem, but you could try
 removing them.


If I remove quotes, there is an error about this:

SEVERE: Full Import failed:java.lang.RuntimeException:
java.lang.RuntimeException:
org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to
execute query: SELECT nombre FROM tipodocumento WHERE idtipodocumento =
 Processing Document # 1
at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:264)
at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:375)
at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:445)
at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:426)
Caused by: java.lang.RuntimeException:
org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to
execute query: SELECT nombre FROM tipodocumento WHERE idtipodocumento =
 Processing Document # 1
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:621)
at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:327)
at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:225)
... 3 more
Caused by: org.apache.solr.handler.dataimport.DataImportHandlerException:
Unable to execute query: SELECT nombre FROM tipodocumento WHERE
idtipodocumento =  Processing Document # 1
at
org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72)
at
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.init(JdbcDataSource.java:253)
at
org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:210)
at
org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:39)
at
org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:59)
at
org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:73)
at
org.apache.solr.handler.dataimport.EntityProcessorWrapper.pullRow(EntityProcessorWrapper.java:330)
at
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:296)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:683)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:709)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:619)
... 5 more
Caused by: java.sql.SQLSyntaxErrorException: ORA-00936: missing expression

at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:445)
at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:396)
at oracle.jdbc.driver.T4C8Oall.processError(T4C8Oall.java:879)
at oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:450)
at oracle.jdbc.driver.T4CTTIfun.doRPC(T4CTTIfun.java:192)
at oracle.jdbc.driver.T4C8Oall.doOALL(T4C8Oall.java:531)
at oracle.jdbc.driver.T4CStatement.doOall8(T4CStatement.java:193)
at oracle.jdbc.driver.T4CStatement.executeForDescribe(T4CStatement.java:873)
at
oracle.jdbc.driver.OracleStatement.executeMaybeDescribe(OracleStatement.java:1167)
at
oracle.jdbc.driver.OracleStatement.doExecuteWithTimeout(OracleStatement.java:1289)
at
oracle.jdbc.driver.OracleStatement.executeInternal(OracleStatement.java:1909)
at oracle.jdbc.driver.OracleStatement.execute(OracleStatement.java:1871)
at
oracle.jdbc.driver.OracleStatementWrapper.execute(OracleStatementWrapper.java:318)
at
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.init(JdbcDataSource.java:246
My config files using Oracle are:


db-data-config.xml
dataConfig
dataSource driver=oracle.jdbc.OracleDriver
url=jdbc:oracle:thin:@localhost:1521:solr user=solr password=solr /
document
entity name=documento query=SELECT
iddocumento,nrodocumento,asunto,autor,tipodocumento FROM documento
field column=iddocumento name=iddocumento /
field column=nrodocumento name=nrodocumento /
field column=asunto name=asuntodocumento /
field column=autor name=autor /
field column=tipodocumento name=tipodocumento /
entity name=tipodocumento1 query=SELECT nombre FROM
tipodocumento WHERE idtipodocumento = '${documento.tipodocumento}'
   field column=nombre name=nombre /
/entity
/entity

Re: random results at specific slots

2012-06-05 Thread Jack Krupansky
Take a look at query elevation. It may do exactly want you want, but at a 
minimum, it would show you how this kind of thing can be done.


See:
http://wiki.apache.org/solr/QueryElevationComponent

-- Jack Krupansky

-Original Message- 
From: srinir

Sent: Tuesday, June 05, 2012 3:08 AM
To: solr-user@lucene.apache.org
Subject: random results at specific slots

Hi,

I would like to return results sorted by score (desc), but i would like to
insert random results into some predefined slots (lets say 10, 14 and 18).
The reason I want to do that is I boost click-through rate based features
significantly and i want to give a chance to documents which doesnt have
enough click through rate data. This would help the results stay fresh.

I looked into solr code and it looks like i need a custom QueryComponent
where once the top results are ordered, i can insert some random results at
my predefined slots and then return. I am wondering whether there is any
other way I can achieve the same?

Thanks
Srini

--
View this message in context: 
http://lucene.472066.n3.nabble.com/random-results-at-specific-slots-tp3987719.html
Sent from the Solr - User mailing list archive at Nabble.com. 



HypericHQ plugins?

2012-06-05 Thread Paul Libbrecht
Hello SOLR users,

is there someone who wrote plugins for HypericHQ to monitor the very many 
metrics SOLR exposes through JMX?
I am a kind of newbie to JMX and the tutorials of Hyperic aren't simple enough 
to my taste... so I'd be helped if someone did it already.

thanks in advance

Paul

RE: Can't index sub-entitties in DIH

2012-06-05 Thread Dyer, James
I sucessfully use Oracle with DIH although none of my imports have 
sub-entities.  (slight difference, I'm on ojdbc5.jar w/10g...).  It may be you 
have a driver that doesn't play well with DIH in some cases.  You might want to 
try these possible workarounds:

- rename the columns in SELECT with AS clauses.
- in cases the columns are the same in SELECT as what you have in schema.xml, 
omit the field / tags  (see 
http://wiki.apache.org/solr/DataImportHandler#A_shorter_data-config)

These are shot-in-the-dark guesses.  I wouldn't expect this to matter but you 
might as well try it.

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311


-Original Message-
From: Rafael Taboada [mailto:kaliman.fore...@gmail.com] 
Sent: Tuesday, June 05, 2012 8:58 AM
To: solr-user@lucene.apache.org
Subject: Re: Can't index sub-entitties in DIH

Hi Gora,


 Your configuration files look fine. It would seem that something
 is going wrong with the SELECT in Oracle, or with the JDBC
 driver used to access Oracle. Could you try:

* Manually doing the SELECT for the entity, and sub-entity
  to ensure that things are working.


The SELECTs are working OK.



 * Check the JDBC settings.


I'm using tha last version of jdbc6.jar for Oracle 11g. It seems JDBC
setting is OK because solr brings data.



 Sorry, I do not have access to Oracle so that I cannot try this
 out myself.

 Also, have you checked the Solr logs for any error messages?
 Finally, I just noticed that you have extra quotes in:
 ...where usuario_idusuario = '${usuario.idusuario}'
 I doubt that is the cause of your problem, but you could try
 removing them.


If I remove quotes, there is an error about this:

SEVERE: Full Import failed:java.lang.RuntimeException:
java.lang.RuntimeException:
org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to
execute query: SELECT nombre FROM tipodocumento WHERE idtipodocumento =
 Processing Document # 1
at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:264)
at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:375)
at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:445)
at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:426)
Caused by: java.lang.RuntimeException:
org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to
execute query: SELECT nombre FROM tipodocumento WHERE idtipodocumento =
 Processing Document # 1
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:621)
at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:327)
at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:225)
... 3 more
Caused by: org.apache.solr.handler.dataimport.DataImportHandlerException:
Unable to execute query: SELECT nombre FROM tipodocumento WHERE
idtipodocumento =  Processing Document # 1
at
org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72)
at
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.init(JdbcDataSource.java:253)
at
org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:210)
at
org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:39)
at
org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:59)
at
org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:73)
at
org.apache.solr.handler.dataimport.EntityProcessorWrapper.pullRow(EntityProcessorWrapper.java:330)
at
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:296)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:683)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:709)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:619)
... 5 more
Caused by: java.sql.SQLSyntaxErrorException: ORA-00936: missing expression

at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:445)
at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:396)
at oracle.jdbc.driver.T4C8Oall.processError(T4C8Oall.java:879)
at oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:450)
at oracle.jdbc.driver.T4CTTIfun.doRPC(T4CTTIfun.java:192)
at oracle.jdbc.driver.T4C8Oall.doOALL(T4C8Oall.java:531)
at oracle.jdbc.driver.T4CStatement.doOall8(T4CStatement.java:193)
at oracle.jdbc.driver.T4CStatement.executeForDescribe(T4CStatement.java:873)
at
oracle.jdbc.driver.OracleStatement.executeMaybeDescribe(OracleStatement.java:1167)
at
oracle.jdbc.driver.OracleStatement.doExecuteWithTimeout(OracleStatement.java:1289)
at
oracle.jdbc.driver.OracleStatement.executeInternal(OracleStatement.java:1909)
at oracle.jdbc.driver.OracleStatement.execute(OracleStatement.java:1871)
at
oracle.jdbc.driver.OracleStatementWrapper.execute(OracleStatementWrapper.java:318)
at

Re: score filter

2012-06-05 Thread debdoot
Hello Grant,

I need to frame a query that is a combination of two query parts and I use a
'function' query to prepare the same. Something like:
q={!type=func q.op=AND df=text}product(query($uq,0.0),query($cq,0.1))

where $uq and $cq are two queries.

Now, I want a search result returned only if I get a hit on $uq. So, I
specify default value of $uq query as 0.0 in order for the final score to be
zero in cases where $uq doesn't record a hit. Even though, the scoring works
as expected (i.e, document that don't match $uq have a score of zero), all
the documents are returned as search results. Is there a way to filter
search results that have a score of zero?

Thanks for your help,

Debdoot

--
View this message in context: 
http://lucene.472066.n3.nabble.com/score-filter-tp493438p3987791.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Can't index sub-entitties in DIH

2012-06-05 Thread Rahul Warawdekar
Hi,

One of the possibilities for this kind of issue to occur may be the case
sensitivity of column names in Oracle.
Can you apply a transformer and check the entity map which actually
contains the keys and their values ?
Also, please try specifying upper case field names for Oracle and try if
that works.
something like

entity name=tipodocumento query=SELECT *NOMBRE* FROM
 tipodocumento where *IDTIPODOCUMENTO* = '${documento.*TIPODOCUMENTO*}'
field column=*NOMBRE* name=nombre /
 /entity

On Tue, Jun 5, 2012 at 9:57 AM, Rafael Taboada kaliman.fore...@gmail.comwrote:

 Hi Gora,


  Your configuration files look fine. It would seem that something
  is going wrong with the SELECT in Oracle, or with the JDBC
  driver used to access Oracle. Could you try:

 * Manually doing the SELECT for the entity, and sub-entity
   to ensure that things are working.
 

 The SELECTs are working OK.



  * Check the JDBC settings.
 

 I'm using tha last version of jdbc6.jar for Oracle 11g. It seems JDBC
 setting is OK because solr brings data.



  Sorry, I do not have access to Oracle so that I cannot try this
  out myself.
 
  Also, have you checked the Solr logs for any error messages?
  Finally, I just noticed that you have extra quotes in:
  ...where usuario_idusuario = '${usuario.idusuario}'
  I doubt that is the cause of your problem, but you could try
  removing them.
 

 If I remove quotes, there is an error about this:

 SEVERE: Full Import failed:java.lang.RuntimeException:
 java.lang.RuntimeException:
 org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to
 execute query: SELECT nombre FROM tipodocumento WHERE idtipodocumento =
  Processing Document # 1
 at
 org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:264)
 at

 org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:375)
 at

 org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:445)
 at

 org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:426)
 Caused by: java.lang.RuntimeException:
 org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to
 execute query: SELECT nombre FROM tipodocumento WHERE idtipodocumento =
  Processing Document # 1
 at

 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:621)
 at

 org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:327)
 at
 org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:225)
 ... 3 more
 Caused by: org.apache.solr.handler.dataimport.DataImportHandlerException:
 Unable to execute query: SELECT nombre FROM tipodocumento WHERE
 idtipodocumento =  Processing Document # 1
 at

 org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72)
 at

 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.init(JdbcDataSource.java:253)
 at

 org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:210)
 at

 org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:39)
 at

 org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:59)
 at

 org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:73)
 at

 org.apache.solr.handler.dataimport.EntityProcessorWrapper.pullRow(EntityProcessorWrapper.java:330)
 at

 org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:296)
 at

 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:683)
 at

 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:709)
 at

 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:619)
 ... 5 more
 Caused by: java.sql.SQLSyntaxErrorException: ORA-00936: missing expression

 at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:445)
 at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:396)
 at oracle.jdbc.driver.T4C8Oall.processError(T4C8Oall.java:879)
 at oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:450)
 at oracle.jdbc.driver.T4CTTIfun.doRPC(T4CTTIfun.java:192)
 at oracle.jdbc.driver.T4C8Oall.doOALL(T4C8Oall.java:531)
 at oracle.jdbc.driver.T4CStatement.doOall8(T4CStatement.java:193)
 at
 oracle.jdbc.driver.T4CStatement.executeForDescribe(T4CStatement.java:873)
 at

 oracle.jdbc.driver.OracleStatement.executeMaybeDescribe(OracleStatement.java:1167)
 at

 oracle.jdbc.driver.OracleStatement.doExecuteWithTimeout(OracleStatement.java:1289)
 at

 oracle.jdbc.driver.OracleStatement.executeInternal(OracleStatement.java:1909)
 at oracle.jdbc.driver.OracleStatement.execute(OracleStatement.java:1871)
 at

 oracle.jdbc.driver.OracleStatementWrapper.execute(OracleStatementWrapper.java:318)
 at

 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.init(JdbcDataSource.java:246
 My config files using Oracle are:


 db-data-config.xml
 dataConfig

Re: Is it faster to search over many different fields or one field that combines the values of all those other fields?

2012-06-05 Thread Michael Della Bitta
I don't have the answer to your question, but I certainly don't think
anybody should be slapped in the face for asking a question!

Michael Della Bitta


Appinions, Inc. -- Where Influence Isn’t a Game.
http://www.appinions.com


On Tue, Jun 5, 2012 at 8:50 AM, santamaria2 aravinda@contify.com wrote:
 Say I have various categories of 'tags'. I want a keyword search to search
 through my index of articles. So I search over:
 1) the title.
 2) the body
 3) about 10 of these tag-categories. Each tag category is multivalued with a
 few words per value.

 Without considering the affect on 'relevance', and using the standard lucene
 query parser, would it be faster to specify each of these 10 fields in q (q
 = cat1:keyword OR cat2:keyword OR ... ), or to copyfield the stuff in those
 10 fields into one combined field?

 Or is it such that I should be slapped in the face for even thinking about
 performance in this scenario?

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Is-it-faster-to-search-over-many-different-fields-or-one-field-that-combines-the-values-of-all-those-tp3987766.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: Can't index sub-entitties in DIH

2012-06-05 Thread Gora Mohanty
Hi,

Sorry, I am stumped, and cannot help further without
access to Oracle. Please disregard the bit about the
quotes: I was reading a single quote followed by a
double quote as three single quotes. There was no
issue there.

Since your configurations for Oracle, and mysql are
different, are you using different Solr cores/instances,
or making sure to restart Solr in between configuration
changes?

Regards,
Gora


Re: Can't index sub-entitties in DIH

2012-06-05 Thread Rafael Taboada
Hi James.

Thanks for your advice.

As I said, alias works for me. I use joins instead of sub-entities...
Heavily...
These config files work for me...

db-data-config.xml
dataConfig
   dataSource type=JdbcDataSource name=jdbc
driver=oracle.jdbc.OracleDriver url=jdbc:oracle:thin:@localhost:1521:solr
user=solr password=solr /
   document
  entity name=documento query=SELECT
d.iddocumento,d.nrodocumento,d.asunto AS asuntodocumento,d.autor,d.estado
AS estadodocumento,d.fechacreacion AS
fechacreaciondocumento,td.idtipodocumento,td.nombre AS
nombretipodocumento,e.idexpediente,e.nroexpediente,e.nrointerno,e.asunto AS
asuntoexpediente,e.clienterazonsocial,e.clienteapellidomaterno,e.clienteapellidopaterno,e.clientenombres,e.clientedireccionprincipal,e.estado
AS estadoexpediente,e.fechacreacion AS
fechacreacionexpediente,p.idproceso,p.nombre AS nombreproceso,o.idusuario
AS idpropietario,o.nombres AS nombrespropietario,o.apellidos AS
apellidospropietario,u.idunidad,u.nombre AS nombreunidad FROM documento d
LEFT OUTER JOIN usuario o ON (d.propietario = o.idusuario) LEFT OUTER JOIN
unidad u ON (o.idunidad = u.idunidad) LEFT OUTER JOIN tipodocumento td ON
(d.tipodocumento = td.idtipodocumento) LEFT OUTER JOIN expediente e ON
(d.expediente = e.idexpediente) LEFT OUTER JOIN proceso p ON (e.proceso =
p.idproceso)
 field column=iddocumento name=iddocumento /
 field column=nrodocumento name=nrodocumento /
 field column=asuntodocumento name=asuntodocumento /
 field column=autor name=autor /
 field column=estadodocumento name=estadodocumento /
 field column=fechacreaciondocumento
name=fechacreaciondocumento /
 field column=idtipodocumento name=idtipodocumento /
 field column=nombretipodocumento name=nombretipodocumento /
 field column=idexpediente name=idexpediente /
 field column=nroexpediente name=nroexpediente /
 field column=nrointerno name=nrointerno /
 field column=asuntoexpediente name=asuntoexpediente /
 field column=clienterazonsocial name=clienterazonsocial /
 field column=clienteapellidomaterno
name=clienteapellidomaterno /
 field column=clienteapellidopaterno
name=clienteapellidopaterno /
 field column=clientenombres name=clientenombres /
 field column=clientedireccionprincipal
name=clientedireccionprincipal /
 field column=estadoexpediente name=estadoexpediente /
 field column=fechacreacionexpediente
name=fechacreacionexpediente /
 field column=idproceso name=idproceso /
 field column=nombreproceso name=nombreproceso /
 field column=idpropietario name=idpropietario /
 field column=nombrespropietario name=nombrespropietario /
 field column=apellidospropietario name=apellidospropietario /
 field column=idunidad name=idunidad /
 field column=nombreunidad name=nombreunidad /
  /entity
   /document
/dataConfig

schema.xml
?xml version=1.0 ?
schema name=siged version=1.1
   types
  fieldtype name=string class=solr.StrField sortMissingLast=true
omitNorms=true /
  fieldType name=tint class=solr.TrieIntField precisionStep=8
positionIncrementGap=0 /
  fieldType name=tdate class=solr.TrieDateField precisionStep=6
positionIncrementGap=0 /
  fieldType name=text_general class=solr.TextField
positionIncrementGap=100
 analyzer type=index
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt enablePositionIncrements=true /
filter class=solr.LowerCaseFilterFactory/
 /analyzer
 analyzer type=query
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt enablePositionIncrements=true /
filter class=solr.SynonymFilterFactory
synonyms=synonyms.txt ignoreCase=true expand=true/
filter class=solr.LowerCaseFilterFactory/
 /analyzer
  /fieldType
   /types

   fields
  !-- SIGED Documentos --
  field name=iddocumento type=tint indexed=true stored=true
required=true /
  field name=nrodocumento type=text_general indexed=true
stored=true /
  field name=asuntodocumento type=text_general indexed=true
stored=true /
  field name=autor type=text_general indexed=true stored=true
/
  field name=estadodocumento type=string indexed=true
stored=true /
  field name=fechacreaciondocumento type=string indexed=true
stored=true /
  field name=idtipodocumento type=tint indexed=true
stored=true /
  field name=nombretipodocumento type=text_general indexed=true
stored=true /
  field name=idexpediente type=tint indexed=true stored=true /
  field name=nroexpediente type=text_general indexed=true
stored=true /
  field name=nrointerno type=text_general indexed=true
stored=true /
  field name=asuntoexpediente type=text_general indexed=true

Re: filtering number and repeated contents

2012-06-05 Thread Jack Krupansky
My (very limited) understanding of boilerpipe in Tika is that it strips 
out short text, which is great for all the menu and navigation text, but 
the typical disclaimer at the bottom of an email is not very short and 
frequently can be longer than the email message body itself. You may have to 
resort to a custom update processor that is programmed with some disclaimer 
signature text strings to be removed from field values.


-- Jack Krupansky

-Original Message- 
From: Mark , N

Sent: Tuesday, June 05, 2012 8:28 AM
To: solr-user@lucene.apache.org
Subject: filtering number and repeated contents

Is it possible to filter out numbers and disclaimer ( repeated contents)
while indexing to SOLR?
These are all surplus information and do not want to index it

I have tried using boilerpipe algorithm as well to remove surplus
infromation from web pages such as navigational elements, templates, and
advertisements , I think it works well but looking forward to see If I
could filter out  disclaimer information too mainly in email texts.
--
Thanks,

*Nipen Mark * 



Re: Can't index sub-entitties in DIH

2012-06-05 Thread Rafael Taboada
Hi Gora,

Yes, I restart Solr for each change I do.

Thanks for your help...

An small question Is DIH work well with Oracle database? Using all the
features It can do?


On Tue, Jun 5, 2012 at 9:32 AM, Gora Mohanty g...@mimirtech.com wrote:

 Hi,

 Sorry, I am stumped, and cannot help further without
 access to Oracle. Please disregard the bit about the
 quotes: I was reading a single quote followed by a
 double quote as three single quotes. There was no
 issue there.

 Since your configurations for Oracle, and mysql are
 different, are you using different Solr cores/instances,
 or making sure to restart Solr in between configuration
 changes?

 Regards,
 Gora




-- 
Rafael Taboada

/*
 * Phone  992 741 026
 */


Re: Can't index sub-entitties in DIH

2012-06-05 Thread Gora Mohanty
On 5 June 2012 20:05, Rafael Taboada kaliman.fore...@gmail.com wrote:
 Hi James.

 Thanks for your advice.

 As I said, alias works for me. I use joins instead of sub-entities...
 Heavily...
 These config files work for me...
[...]

How about NULL values in the column that
you are doing a left outer join on? Cannot
test this right now, but I believe that a left
outer join behaves differently from a DIH
entity/sub-entity when it comes to NULLs.

Regards,
Gora


Re: Can't index sub-entitties in DIH

2012-06-05 Thread Gora Mohanty
On 5 June 2012 20:08, Rafael Taboada kaliman.fore...@gmail.com wrote:
 Hi Gora,

 Yes, I restart Solr for each change I do.

 Thanks for your help...

 An small question Is DIH work well with Oracle database? Using all the
 features It can do?

Unfortunately, I have never used DIH with Oracle. However,
this should be a simple enough use case that it should just
work. I think that we must be missing something obvious.

For the sub-entity with Oracle case, what message do you
get when the data-import concludes? Is the number of
indexed documents correct? Are there any relevant
messages in the Solr log files?

Regards,
Gora


Re: Is it faster to search over many different fields or one field that combines the values of all those other fields?

2012-06-05 Thread Jack Krupansky
There may be a raw performance advantage to having all values in a single 
combined field, but then you loose the opportunity to boost title and tag 
field hits.


With the extended dismax query parser you have the ability to specify the 
field list in the qf request parameter so that the query can simply be the 
keywords and operators without all of the extra OR operators. qf also lets 
you specify the boost for each field.


-- Jack Krupansky

-Original Message- 
From: santamaria2

Sent: Tuesday, June 05, 2012 8:50 AM
To: solr-user@lucene.apache.org
Subject: Is it faster to search over many different fields or one field that 
combines the values of all those other fields?


Say I have various categories of 'tags'. I want a keyword search to search
through my index of articles. So I search over:
1) the title.
2) the body
3) about 10 of these tag-categories. Each tag category is multivalued with a
few words per value.

Without considering the affect on 'relevance', and using the standard lucene
query parser, would it be faster to specify each of these 10 fields in q (q
= cat1:keyword OR cat2:keyword OR ... ), or to copyfield the stuff in those
10 fields into one combined field?

Or is it such that I should be slapped in the face for even thinking about
performance in this scenario?

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Is-it-faster-to-search-over-many-different-fields-or-one-field-that-combines-the-values-of-all-those-tp3987766.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: London OSS search social - meetup 6th June

2012-06-05 Thread Richard Marr
Quick reminder, we're meeting at The Plough in Bloomsbury tomorrow night. 
Details and RSVP on the meetup page:

http://www.meetup.com/london-search-social/events/65873032/

--
Richard Marr

On 3 Jun 2012, at 00:29, Richard Marr richard.m...@gmail.com wrote:

 
 Apologies for the short notice guys, we're meeting up at The Plough in 
 Bloomsbury on Wednesday 6th June.
 
 As usual the format is open and there's a healthy mix of experience and 
 backgrounds. Please come and share wisdom, ask questions, geek out, etc. in 
 the presence of beverages.
 
 -- 
 Richard Marr


Re: Correct way to deal with source data that may include a multivalued field that needs to be used for sorting?

2012-06-05 Thread Aaron Daubman
Thanks for the responses,

By saying dirty data you imply that only one of the values is good or
 clean and that the others can be safely discarded/ignored, as opposed to
 true multi-valued data where each value is there for good reason and needs
 to be preserved. In any case, how do you know/decide which value should be
 used for sorting - and did you just get lucky that Solr happened to use the
 right one?


I haven't gone back and checked the old version's docs where this was
working, however, I suspect that either the field never ended up
appearing in docs more than once, or if it did, it had the same value
repeated...

The real issue here is that the docs are created externally, and the
producer won't (yet) guarantee that fields that should appear once will
actually appear once. Because of this, I don't want to declare the field as
multiValued=false as I don't want to cause indexing errors. It would be
great for me (and apparently many others after searching) if there were an
option as simple as forceSingleValued=true - where some deterministic
behavior such as use first field encountered, ignore all others, would
occur.


The preferred technique would be the preprocess and clean the data before
 it is handed to Solr or SolrJ, even if the source must remain dirty.
 Baring that a preprocessor or a custom update processor certainly.


I could write preprocessors (this is really what will eventually happen
when the producer cleans their data),  custom processors, etc... however,
for something this simple it would be great not to be producing more code
that would have to be maintained.



 Please clarify exactly how the data is being fed into Solr.


 I am using generic code to read from a key/value store and compose
documents. This is another reason fixing the data at this point would not
be desirable, the currently generic code would need to be made specific to
look for these particular fields and then coerce them to single values...

Thanks again,
  Aaron


Re: Is it faster to search over many different fields or one field that combines the values of all those other fields?

2012-06-05 Thread Mikhail Khludnev
IRC, Lucene in Action book loops around this point almost every chapter:
multifield query is faster.

On Tue, Jun 5, 2012 at 7:04 PM, Jack Krupansky j...@basetechnology.comwrote:

 There may be a raw performance advantage to having all values in a single
 combined field, but then you loose the opportunity to boost title and tag
 field hits.

 With the extended dismax query parser you have the ability to specify the
 field list in the qf request parameter so that the query can simply be
 the keywords and operators without all of the extra OR operators. qf also
 lets you specify the boost for each field.

 -- Jack Krupansky

 -Original Message- From: santamaria2
 Sent: Tuesday, June 05, 2012 8:50 AM
 To: solr-user@lucene.apache.org
 Subject: Is it faster to search over many different fields or one field
 that combines the values of all those other fields?


 Say I have various categories of 'tags'. I want a keyword search to search
 through my index of articles. So I search over:
 1) the title.
 2) the body
 3) about 10 of these tag-categories. Each tag category is multivalued with
 a
 few words per value.

 Without considering the affect on 'relevance', and using the standard
 lucene
 query parser, would it be faster to specify each of these 10 fields in q (q
 = cat1:keyword OR cat2:keyword OR ... ), or to copyfield the stuff in those
 10 fields into one combined field?

 Or is it such that I should be slapped in the face for even thinking about
 performance in this scenario?

 --
 View this message in context: http://lucene.472066.n3.**
 nabble.com/Is-it-faster-to-**search-over-many-different-**
 fields-or-one-field-that-**combines-the-values-of-all-**
 those-tp3987766.htmlhttp://lucene.472066.n3.nabble.com/Is-it-faster-to-search-over-many-different-fields-or-one-field-that-combines-the-values-of-all-those-tp3987766.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: Is it faster to search over many different fields or one field that combines the values of all those other fields?

2012-06-05 Thread Gora Mohanty
On 5 June 2012 22:05, Mikhail Khludnev mkhlud...@griddynamics.com wrote:
 IRC, Lucene in Action book loops around this point almost every chapter:
 multifield query is faster.
[...]

Surely this is dependent on the type, and volume of one's
data? As with many issues, isn't the answer that it depends,
i.e., one should prototype, and have objective measures on
one's own data-sets.

Would love to be educated otherwise.

Regards,
Gora

P.S. Have to get that book.


Re: Search timeout for Solrcloud

2012-06-05 Thread Jack Krupansky
I'm curious... how deep is it that is becoming problematic? Tens of pages, 
hundreds, thousands, millions?


And when you say deep paging, are you incrementing through all pages down to 
the depth or gapping to some very large depth outright? If the former, I 
am wondering if the Solr cache is building up with all those previous 
results.


And is it that the time is simply moderately beyond expectations (e.g. 10 or 
30 seconds or a minute compared to 1 second), or... are we talking about a 
situation where a core is terminally thrashing with garbage collection/OOM 
issues?


-- Jack Krupansky

-Original Message- 
From: arin_g

Sent: Tuesday, June 05, 2012 1:34 AM
To: solr-user@lucene.apache.org
Subject: Search timeout for Solrcloud

Hi,
We use solrcloud in production, and we are facing some issues with queries
that take very long specially deep paging queries, these queries keep our
servers very busy. i am looking for a way to stop (kill) queries taking
longer than a specific amount of time (say 5 seconds), i checked timeAllowed
but it doesn't work (again query  runs completely). Also i noticed that
there are connTimeout and socketTimeout for distributed searches, but i am
not sure if they kill the thread (i want to save resources by killing the
query, not just returning a timeout). Also, if i could get partial results
that would be ideal. Any suggestions?

Thanks,
arin

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Search-timeout-for-Solrcloud-tp3987716.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Boost by Nested Query / Join Needed?

2012-06-05 Thread naleiden
Hi,

First off, I'm about a week into all things Solr, and still trying to figure
out how to fit my relational-shaped peg through a denormalized hole. Please
forgive my ignorance below :-D

I have the need store a One-to-N type relationship, and perform a boost a
related field.

Let's say I want to index a number of different types of candy, and also a
customer's preference for each type of candy (which I index/update when a
customer makes a purchase), and then boost by that preference on search.

Here is my paired-down attempt at a denormalized schema:

! -- Common Fields -- 
field name=id type=string indexed=true stored=true required=true
/
field name=type type=string indexed=true stored=true
required=true /

! -- Fields for 'candy' --  
field name=name type=text_general indexed=true stored=true/
field name=description type=text_general indexed=true stored=true/

! -- Fields for Customer-Candy Preference ('preference') -- 
field name=user type=integer indexed=true stored=true
field name=candy type=integer indexed=true stored=true
field name=weight type=integer indexed=true stored=true
default=0

I am indexing 'candy' and 'preferences' separately, and when indexing one, I
leave the fields of the other empty (with the exception of the required 'id'
and 'type').

Ignoring the query score, this is effectively what I'm looking to do in SQL:

SELECT candy.id, candy.name, candy.description FROM candy
LEFT JOIN preference ON (preference.candy = candy.id AND preference.customer
= 'someCustomerID')
// Where some match is made on query against candy.name or candy.description
ORDER BY preference.weight DESC

My questions are:

1.) Am I making any assumptions with respect to what are effectively
different document types in the schema that will not scale well? I don't
think I want to be duplicating each 'candy' entry for every customer, or
maybe that wouldn't be such a big deal in Solr.

2.) Can someone point me in the right direction on how to perform this type
of boost in a Solr query?

Thanks in advance,
Nick


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Boost-by-Nested-Query-Join-Needed-tp3987818.html
Sent from the Solr - User mailing list archive at Nabble.com.


Is FileFloatSource's WeakHashMap cache only cleaned by GC?

2012-06-05 Thread Gregg Donovan
We've encountered GC spikes at Etsy after adding new
ExternalFileFields a decent number of times. I was always a little
confused by this behavior -- isn't it just one big float[]? why does
that cause problems for the GC? -- but looking at the FileFloatSource
code a little more carefully, I wonder if this is due to using a
WeakHashMap that is only cleaned by GC or manual invocation of a
request handler.

FileFloatSource stores a WeakHashMap containing IndexReader,float[]
or CreationPlaceholder. In the code[1], it mentions that the
implementation is modeled after the FieldCache implementation.
However, the FieldCacheImpl adds listeners for IndexReader close
events and uses those to purge its caches. [2] Should we be doing the
same in FileFloatSource?

Here's a mostly untested patch[3] with a possible implementation.
There are probably better ways to do it (e.g. I don't love using
another WeakHashMap), but I found it tough to hook into the
IndexReader lifecycle without a) relying on classes other than
FileFloatSource b) changing the public API of FIleFloatSource or c)
changing the implementation too much.

There is a RequestHandler inside of FileFloatSource
(ReloadCacheRequestHandler) that can be used to clear the cache
entirely[4], but this is sub-optimal for us for a few reasons:

--It clears the entire cache. ExternalFileFields often take some
non-trivial time to load and we prefer to do so during SolrCore
warmups. Clearing the entire cache while serving traffic would likely
cause user-facing requests to timeout.
--It forces an extra commit with its consequent cache cycling, etc..

I'm thinking of ways to monitor the size of FileFloatSource's cache to
track its size against GC pause times, but it seems tricky because
even calling WeakHashMap#size() has side-effects. Any ideas?

Overall, what do you think? Does relying on GC to clean this cache
make sense as a possible cause of GC spikiness? If so, does the patch
[3] look like a decent approach?

Thanks!

--Gregg

[1] https://github.com/apache/lucene-solr/blob/a3914cb5c0243913b827762db2d616ad7cc6801d/solr/core/src/java/org/apache/solr/search/function/FileFloatSource.java#L135
[2] https://github.com/apache/lucene-solr/blob/1c0eee5c5cdfddcc715369dad9d35c81027bddca/lucene/core/src/java/org/apache/lucene/search/FieldCacheImpl.java#L166
[3] https://gist.github.com/2876371
[4] https://github.com/apache/lucene-solr/blob/a3914cb5c0243913b827762db2d616ad7cc6801d/solr/core/src/java/org/apache/solr/search/function/FileFloatSource.java#L310


Re: Search timeout for Solrcloud

2012-06-05 Thread arin_g
for example when we set the start parameter to 1000, 2000 or higher (page
100, 200 ...), it takes very long (20, 30 seconds, sometimes even 100
seconds).
this usually happens when there is a big gap between pages, mostly hit by
web crawlers (when they crawl the last page link on our website).


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Search-timeout-for-Solrcloud-tp3987716p3987834.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr 4.0 Clean Commit for production use

2012-06-05 Thread Chris Hostetter

: Hey guys, I am trying to upgrade to Solr 4.0. Do you know where I get a clean

Clarification: 4.0 does not exist yet.  What does exist is the 4x branch, 
from which you can build snapshots that should be very similar to what 
will eventually be released as 4.0.

: http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/ 
: and it looks like they have migrated to 5.0. From the link below it looks

Correct, a 4x branch has been created off of trunk in anticipation of the 
4.0 release process, so that more agressive experimental work beyond hte 
scope of 4.0 can continue on trunk.

I've update the wiki to try and outline this based on the discussion from 
previous dev@lucene threads...

https://wiki.apache.org/solr/Solr4.0

: My second question would be: Are there any known compatibility
: issues/restrictions with previous versions of Lucene? (I just want to make
: sure I can still use my data indexed with previous Solr/Lucene versions). 

The best thing to do is review the Upgrade instructions in CHANGES.txt, 
however those instructions hsould not be consdiered Final untill the 
final release is voted on -- there may be mistakes/ommissions, but the 
best way to help find those mistakes/ommisions is for users to help try 
out nightly builds and point them out when you notice them.



-Hoss


Re: Solr 4.0 Clean Commit for production use

2012-06-05 Thread Jack Krupansky

The Nightly Build wiki still says it is 4.x even though it is now 5.x.
See:
https://wiki.apache.org/solr/NightlyBuilds

AFAIK, there isn't a 4.x nightly build running. (Is that going to happen 
soon??)


You can checkout the repo for the 4x branch:
http://svn.apache.org/repos/asf/lucene/dev/branches/branch_4x

My (limited) understanding is that 4.x can read and write 3.x indexes, but 
any new/modified indexes will be incompatable with 3.x. And you have to be 
careful upgrading master/slave configurations, as noted in CHANGES.txt.


-- Jack Krupansky

-Original Message- 
From: Chris Hostetter

Sent: Tuesday, June 05, 2012 5:37 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr 4.0 Clean Commit for production use


: Hey guys, I am trying to upgrade to Solr 4.0. Do you know where I get a 
clean


Clarification: 4.0 does not exist yet.  What does exist is the 4x branch,
from which you can build snapshots that should be very similar to what
will eventually be released as 4.0.

: http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/
: and it looks like they have migrated to 5.0. From the link below it looks

Correct, a 4x branch has been created off of trunk in anticipation of the
4.0 release process, so that more agressive experimental work beyond hte
scope of 4.0 can continue on trunk.

I've update the wiki to try and outline this based on the discussion from
previous dev@lucene threads...

https://wiki.apache.org/solr/Solr4.0

: My second question would be: Are there any known compatibility
: issues/restrictions with previous versions of Lucene? (I just want to make
: sure I can still use my data indexed with previous Solr/Lucene versions).

The best thing to do is review the Upgrade instructions in CHANGES.txt,
however those instructions hsould not be consdiered Final untill the
final release is voted on -- there may be mistakes/ommissions, but the
best way to help find those mistakes/ommisions is for users to help try
out nightly builds and point them out when you notice them.



-Hoss 



Re: Solr 4.0 Clean Commit for production use

2012-06-05 Thread Chris Hostetter

: The Nightly Build wiki still says it is 4.x even though it is now 5.x.
: See:
: https://wiki.apache.org/solr/NightlyBuilds
: 
: AFAIK, there isn't a 4.x nightly build running. (Is that going to happen
: soon??)

Yes...

http://mail-archives.apache.org/mod_mbox/lucene-dev/201205.mbox/%3c3fd307e7-7cd2-4042-8ba7-8a4561dbf...@email.android.com%3E


-Hoss


Re: using Tika (ExtractingRequestHandler)

2012-06-05 Thread Chris Hostetter

I've updated the wiki to try and fill in some of these holes...

http://wiki.apache.org/solr/ExtractingRequestHandler

: i'm looking at using Tika to index a bunch of documents. the wiki page seems 
to be a little bit out of date (// TODO: this is out of date as of Solr 1.4 - 
dist/apache-solr-cell-1.4.jar and all of contrib/extraction/lib are needed) 
and it also looks a little incomplete.
: 
: is there an actual list of all the required jar files? i'm not sure they are 
in the same place in the 3.6.0 distribution as they were in 1.4, and having an 
actual list would be very helpful in figuring out where they are.
: 
: as for Sending Documents to Solr, is there any plan to address this todo: 
// TODO: describe the different ways to send the documents to solr (POST body, 
form encoded, remoteStreaming). this is really just a nice to have, i can see 
how to accomplish my goals using a method that is currently documented.
: 
: thanks,
:richard
: 

-Hoss


Re: Correct way to deal with source data that may include a multivalued field that needs to be used for sorting?

2012-06-05 Thread Chris Hostetter

: The real issue here is that the docs are created externally, and the
: producer won't (yet) guarantee that fields that should appear once will
: actually appear once. Because of this, I don't want to declare the field as
: multiValued=false as I don't want to cause indexing errors. It would be
: great for me (and apparently many others after searching) if there were an
: option as simple as forceSingleValued=true - where some deterministic
: behavior such as use first field encountered, ignore all others, would
: occur.

This will be trivial in Solr 4.0, using one of the new 
FieldValueSubsetUpdateProcessorFactory classes that are now available -- 
just pick your rule... 

https://builds.apache.org/view/G-L/view/Lucene/job/Solr-trunk/javadoc/org/apache/solr/update/processor/FieldValueSubsetUpdateProcessorFactory.html
Direct Known Subclasses:
FirstFieldValueUpdateProcessorFactory, 
LastFieldValueUpdateProcessorFactory, 
MaxFieldValueUpdateProcessorFactory, 
MinFieldValueUpdateProcessorFactory 

-Hoss


Re: TermComponent and Optimize

2012-06-05 Thread Chris Hostetter

: It seems that TermComponent is looking at all versions of documents in the 
index.
: 
: Does this is the expected behavior for TermComponent? Any suggestion about 
how to solve this?

Yes...

http://wiki.apache.org/solr/TermsComponent
The doc frequencies returned are the number of documents that match the 
term, including any documents that have been marked for deletion but not 
yet removed from the index.

If you delete/replace a document in the index, it still contributes to 
the doc freq for that term until the deletion is expunged (either 
because of a natural segment merge, or forced merging due to optimize)

The reason TermsComponent is so fast, is because it only looks at the raw 
terms, if you want to fix the counts to represent visible documents, you 
have to use something like faceting, which will be slower becuase it 
checks the actual (live) document counts.


-Hoss


Re: using Tika (ExtractingRequestHandler)

2012-06-05 Thread Jack Krupansky

Hoss,

In your edit, I noticed that the wiki makes SolrPlugin a link, but to a 
nonexistent page, although the page SolrPlugins does exist.


See: it is provided as a SolrPlugin,
http://wiki.apache.org/solr/ExtractingRequestHandler

I also noticed a few other things:

1. Reference to the /site directory that does not exist. So, the statement 
Note, the /site directory in the solr download contains some nice example 
docs to try is not terribly useful.

2. The path to tutorial.html should be ../../docs/api/doc-files
3. There is no tutorial.pdf file as referenced in the curl examples.

-- Jack Krupansky

-Original Message- 
From: Chris Hostetter

Sent: Tuesday, June 05, 2012 6:47 PM
To: solr-user@lucene.apache.org
Subject: Re: using Tika (ExtractingRequestHandler)


I've updated the wiki to try and fill in some of these holes...

http://wiki.apache.org/solr/ExtractingRequestHandler

: i'm looking at using Tika to index a bunch of documents. the wiki page 
seems to be a little bit out of date (// TODO: this is out of date as of 
Solr 1.4 - dist/apache-solr-cell-1.4.jar and all of contrib/extraction/lib 
are needed) and it also looks a little incomplete.

:
: is there an actual list of all the required jar files? i'm not sure they 
are in the same place in the 3.6.0 distribution as they were in 1.4, and 
having an actual list would be very helpful in figuring out where they are.

:
: as for Sending Documents to Solr, is there any plan to address this 
todo: // TODO: describe the different ways to send the documents to solr 
(POST body, form encoded, remoteStreaming). this is really just a nice to 
have, i can see how to accomplish my goals using a method that is currently 
documented.

:
: thanks,
:richard
:

-Hoss 



Re: Solr instances: many singles vs multi-core

2012-06-05 Thread Jack Krupansky
It probably can work out reasonably well in both scenarios, but you do get 
some additional flexibility with multiple Tomcat instances:


1. Any per-instance Tomcat limits become per-core rather than for all 
cores on that machine.

2. If you have to restart Tomcat, only a single shard is impacted.
3. There are probably a fair number of little details that work better and 
with more parallelism if each Solr core is a separate JVM. E.g. 
BooleanQuery.maxTerms is across the whole JVM; PDFBox for Tika in SolrCell 
can have threads blocked due to a shared resource that is shared across 
cores in the JVM (was an issue - not sure if still an issue). But of course 
your usage may not run into any of them. It will depend a lot as well on how 
many CPU cores you have.


-- Jack Krupansky

-Original Message- 
From: Christian von Wendt-Jensen

Sent: Tuesday, June 05, 2012 7:22 AM
To: solr-user@lucene.apache.org
Subject: Solr instances: many singles vs multi-core

Hi,

I'm runing a cluster of Solr serveres for an index split up in a lot of 
shards. Each shard is replicated. Current setup is one Tomcat instance per 
shard, even if the Tomcats are running on the same machine.


My question is this:

Would it be more advisable to run one Tomcat per machine with all the shards 
as cores, or is the current setup the best, where each shard is running in 
its own Tomcat.


As I see it, i would think that One Tomcat running multiple cores is better 
as it reduces the overhead of having many Tomcat instances, and it there is 
the possibility to let the cores share all available memory after how much 
they actually need. In the One Shard/One Tomcat scenario, each instance must 
have it predefined memory settings wether or not it needs more or less.


Any opinions on the matter?



Med venlig hilsen / Best Regards

Christian von Wendt-Jensen



Re: index special characters solr

2012-06-05 Thread KPK
Thanks for your reply!

I tried using the types field in WordDelimiterFilterFactory wherein I was
passing a text file which contained % $ as alphabets. But even then it didnt
get indexed and neither did it show up in search results.
Am I missing something?

Thanks,
Kushal

--
View this message in context: 
http://lucene.472066.n3.nabble.com/index-special-characters-solr-tp3987157p3987888.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: I got ERROR, Unable to execute query

2012-06-05 Thread Jihyun Suh
I used 3.x mysql.
After I migrate to 5.x mysql, I don't get same error just like ' Unable to
execute query'.
Maybe low version of mysql and Solr have some problems, I don't know
exactly.


2012/6/5 Jihyun Suh jhsuh.ourli...@gmail.com

 That's why I made a new DB for dataimport test. So my tables have no
 access or activity.
 Those are just dormant ones.


 --

 My current suspicion is that there is activity in that table that is
 preventing DIH access. I mean, like maybe the table is being updated when
 DIH is failing. Maybe somebody is emptying the table and then regenerating
 it and your DIH run is catching the table when it is being emptied. Or
 something like that.

 -- Jack Krupansky


 2012/6/4 Jihyun Suh jhsuh.ourli...@gmail.com

 I read your answer. Thank you.

 But I don't get that error from same table. This time I get error from
 test_5. but when I try to dataimport again, I can index test_5, but from
 test_7 I get that error.

 I don't know the reason. Could you help me?


 --

 Is test_5 created by a stored procedure? If so, is there a possibility
 that
 the stored procedure may have done an update and not returned data - but
 just sometimes?

 -- Jack Krupansky


 2012/6/2 Jihyun Suh jhsuh.ourli...@gmail.com

 I use many tables for indexing.

 During dataimport, I get errors for some tables like Unable to execute
 query. But next time, when I try to dataimport for that table, I can do
 successfully without any error.

 [Thread-17] ERROR o.a.s.h.d.EntityProcessorWrapper - Exception in
 entity :
 test_5:org.apache.solr.handler.dataimport.DataImportHandlerException:
 Unable to execute query:
 SELECT Title, url, synonym, description FROM test_5 WHERE status in
 ('1','s') Processing Document # 11046

 at
 org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72)
 at
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.init(JdbcDataSource.java:253)
 at
 org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:210)
 at
 org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:39)
 at
 org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:59)
 at
 org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:73)
 at
 org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:238)
 at
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:596)
 at
 org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:268)
 at
 org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:187)
 at
 org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:359)
 at
 org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:427)
 at
 org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:408)

 I use many tables for indexing.

 During dataimport, I get errors for some tables like Unable to execute
 query. But next time, when I try to dataimport for that table, I can do
 successfully without any error.

 [Thread-17] ERROR o.a.s.h.d.EntityProcessorWrapper - Exception in
 entity :
 test_5:org.apache.solr.handler.dataimport.DataImportHandlerException:
 Unable to execute query:
 SELECT Title, url, synonym, description FROM test_5 WHERE status in
 ('1','s') Processing Document # 11046

 at
 org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72)
 at
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.init(JdbcDataSource.java:253)
 at
 org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:210)
 at
 org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:39)
 at
 org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:59)
 at
 org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:73)
 at
 org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:238)
 at
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:596)
 at
 org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:268)
 at
 org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:187)
 at
 org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:359)
 at
 org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:427)
 at
 org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:408)






Solr, I have perfomance problem for indexing.

2012-06-05 Thread Jihyun Suh
I have 128 tables of mysql 5.x and each table have 3,5000 rows.
When I start dataimport(indexing) in Solr, it takes 5 minutes for one
table.
But When Solr indexs 20th table, it takes around 10 minutes for one table.
And then When it indexs 40th table, it takes around 20 minutes for one
table.

Solr has some performance problem for too many documents?
Should I set some configuration?


Re: index special characters solr

2012-06-05 Thread KPK
Thanks Jack for your help!
I found my mistake, rather than classifying those special characters as
ALPHA , I classified it as a DIGIT. Also I missed the same entry for search
analyzer. So probably that was the reason for not getting relevant results.

I spent a lot of time figuring this out. So I'll paste my code snippet of
schema.xml which was changed for newbies so that they dont waste so much
time in this.
I classified my field as text in which I wanted to search for keywords
including special characters. In fieldType definition modify the filter
class=solr.WordDelimiterFilterFactory

 filter class=solr.WordDelimiterFilterFactory generateWordParts=1
generateNumberParts=1 catenateWords=1 catenateNumbers=1
catenateAll=0 splitOnCaseChange=1 preserveOriginal=1
types=characters.txt /
 
in BOTH  analyzer type=index and  analyzer type=query

And make a new characters.txt in the same folder as schema.xml and add the
content :

 $ = ALPHA
 % = ALPHA


(i wanted $ and % to behave as alphabets so that they could be searched)


Then restart jetty/tomcat

This is how i solved this problem.
Hope this would help someone :)


--
View this message in context: 
http://lucene.472066.n3.nabble.com/index-special-characters-solr-tp3987157p3987891.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: index special characters solr

2012-06-05 Thread Jack Krupansky

Thanks. I'm sure someone else will have the same issue at some point.

-- Jack Krupansky

-Original Message- 
From: KPK

Sent: Tuesday, June 05, 2012 9:51 PM
To: solr-user@lucene.apache.org
Subject: Re: index special characters solr

Thanks Jack for your help!
I found my mistake, rather than classifying those special characters as
ALPHA , I classified it as a DIGIT. Also I missed the same entry for search
analyzer. So probably that was the reason for not getting relevant results.

I spent a lot of time figuring this out. So I'll paste my code snippet of
schema.xml which was changed for newbies so that they dont waste so much
time in this.
I classified my field as text in which I wanted to search for keywords
including special characters. In fieldType definition modify the filter
class=solr.WordDelimiterFilterFactory

filter class=solr.WordDelimiterFilterFactory generateWordParts=1
generateNumberParts=1 catenateWords=1 catenateNumbers=1
catenateAll=0 splitOnCaseChange=1 preserveOriginal=1
types=characters.txt /

in BOTH  analyzer type=index and  analyzer type=query

And make a new characters.txt in the same folder as schema.xml and add the
content :

$ = ALPHA
% = ALPHA


(i wanted $ and % to behave as alphabets so that they could be searched)


Then restart jetty/tomcat

This is how i solved this problem.
Hope this would help someone :)


--
View this message in context: 
http://lucene.472066.n3.nabble.com/index-special-characters-solr-tp3987157p3987891.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: Solr, I have perfomance problem for indexing.

2012-06-05 Thread Jack Krupansky

You wrote 3,5000, but is that 35 hundred (3,500) or 35 thousand (35,000)??

Your numbers seem far worse than what many people typically see with Solr 
and DIH.


Is the database running on the same machine?

Check the Solr log file to see if some errors (or warnings) might be 
occurring frequently.


Check the log for the first table from when it starts to when it ends. How 
often is it committing (according to the log)? Does there seem to be any odd 
activity during that period?


-- Jack Krupansky

-Original Message- 
From: Jihyun Suh

Sent: Tuesday, June 05, 2012 9:25 PM
To: solr-user-h...@lucene.apache.org ; solr-user@lucene.apache.org
Subject: Solr, I have perfomance problem for indexing.

I have 128 tables of mysql 5.x and each table have 3,5000 rows.
When I start dataimport(indexing) in Solr, it takes 5 minutes for one
table.
But When Solr indexs 20th table, it takes around 10 minutes for one table.
And then When it indexs 40th table, it takes around 20 minutes for one
table.

Solr has some performance problem for too many documents?
Should I set some configuration? 



Re: Solr, I have perfomance problem for indexing.

2012-06-05 Thread Lance Norskog
Which Solr do you run?

On Tue, Jun 5, 2012 at 8:02 PM, Jack Krupansky j...@basetechnology.com wrote:
 You wrote 3,5000, but is that 35 hundred (3,500) or 35 thousand (35,000)??

 Your numbers seem far worse than what many people typically see with Solr
 and DIH.

 Is the database running on the same machine?

 Check the Solr log file to see if some errors (or warnings) might be
 occurring frequently.

 Check the log for the first table from when it starts to when it ends. How
 often is it committing (according to the log)? Does there seem to be any odd
 activity during that period?

 -- Jack Krupansky

 -Original Message- From: Jihyun Suh
 Sent: Tuesday, June 05, 2012 9:25 PM
 To: solr-user-h...@lucene.apache.org ; solr-user@lucene.apache.org
 Subject: Solr, I have perfomance problem for indexing.


 I have 128 tables of mysql 5.x and each table have 3,5000 rows.
 When I start dataimport(indexing) in Solr, it takes 5 minutes for one
 table.
 But When Solr indexs 20th table, it takes around 10 minutes for one table.
 And then When it indexs 40th table, it takes around 20 minutes for one
 table.

 Solr has some performance problem for too many documents?
 Should I set some configuration?



-- 
Lance Norskog
goks...@gmail.com


Hiring multiple Lucene/Solr Search Engineers

2012-06-05 Thread SV
Hi,

We are hiring multiple Lucene/Solr engineers, tech leads, architects based
in Minneapolis - both full time and consulting for developing new search
platform.

Please reach out to me - svamb...@gmail.com

Thanks,
Venkat Ambati
Sr. Manager, Best Buy


Replication

2012-06-05 Thread William Bell
We are using SOLR 1.4, and we are experiencing full index replication
every 15 minutes.

I have checked the solrconfig and it has maxsegments set to 20. It
appears like it is indexing a segment, but replicating the whole
index.

How can I verify it and possibly fix the issue?

-- 
Bill Bell
billnb...@gmail.com
cell 720-256-8076