date:20120220

Hi,
I want to set my Solr to use log4j and to write log messages into separate
file instead of writing all on standard output. How can I do it? Which jars
should I add? Where should I put log4j.xml file?
Regards,
Alex

Re: Solr logging

I get similar questions in the past :)

http://lucene.472066.n3.nabble.com/Jetty-logging-td3476715.html#a3483146

wish it will help you.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-logging-tp3760171p3760173.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr logging

Thanks a lot.
I've added (and deleted) those libraries and now I don't get this messages
to stdout :) I see that log4j is running and it can't find its config
file.  I wish I could add this to the solr.war. Is this possible?  I want
to avoid setting paramemeters in glassfish.
Regards,
Alex

On Mon, Feb 20, 2012 at 9:58 AM, darul daru...@gmail.com wrote:

 I get similar questions in the past :)

 http://lucene.472066.n3.nabble.com/Jetty-logging-td3476715.html#a3483146

 wish it will help you.

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Solr-logging-tp3760171p3760173.html
 Sent from the Solr - User mailing list archive at Nabble.com.

processing of merged tokens

Hello all,

For our search system we'd like to be able to process merged tokens, i.e.
when a user enters a query like hotelsin barcelona, we'd like to know
that the user means hotels in barcelona.

At some point in the past we implemented this kind of functionality with
shingles (using ShingleFilter), that is, if we were indexing the sentence
hotels in barcelona as a document, we'd be able to match at query time
merged tokens like hotelsin and inbarcelona.

This solution has two problems:
1) The index size increases a lot.
2) We only catch a small % of the possibilities. Merged tokens like
hotelsbarcelona or barcelonahotels cannot be processed.

Our intuition is that there should be a better solution. Maybe it's solved
in SOLR or Lucene and we haven't found it yet. If it's not solved, I can
imagine a naive solution that would use TermsEnum to identify whether a
token exists in the index or not, and then if it doesn't exist, use the
TermsEnum again to check whether it's a composition of two known tokens.

It's highly likely that there are much better solutions and algorithms for
this. It would be great if you can help us identify the best way to solve
this problem.

Thanks a lot for your help.

Carlos

Carlos Gonzalez-Cadenas
CEO, ExperienceOn - New generation search
http://www.experienceon.com

Mobile: +34 652 911 201
Skype: carlosgonzalezcadenas
LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas

Re: Solr logging

Yes, you can update your .war archive by adding/removing expected jars.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-logging-tp3760171p3760285.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr logging

I've already done that. What I'm more interested is if I can add log4j.xml
to war and where to put to make it works

On Mon, Feb 20, 2012 at 10:49 AM, darul daru...@gmail.com wrote:

 Yes, you can update your .war archive by adding/removing expected jars.

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Solr-logging-tp3760171p3760285.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr logging

Hmm, I did not try to achieve this but interested if you find a way...

After I believe than having log4j config file outside war archive is a
better solution, if you may need to update its content for example.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-logging-tp3760171p3760322.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr logging

Yep. I suppose it is. But I have several applications installed on
glassfish and I want each one of them to write into separate file. And Your
solution with this jvm option was redirecting all messages from all apps to
one file. Does anyone knows how to accomplish that?


On Mon, Feb 20, 2012 at 11:09 AM, darul daru...@gmail.com wrote:

 Hmm, I did not try to achieve this but interested if you find a way...

 After I believe than having log4j config file outside war archive is a
 better solution, if you may need to update its content for example.

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Solr-logging-tp3760171p3760322.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr logging

This case explained here:

http://stackoverflow.com/questions/762918/how-to-configure-multiple-log4j-for-different-wars-in-a-single-ear

http://techcrawler.wordpress.com/



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-logging-tp3760171p3760352.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Payload and exact search - 2

2012-02-20 Thread leonardo2

Ok, it works!!
Thanks you very much.

Leonardo


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Payload-and-exact-search-2-tp3750355p3760477.html
Sent from the Solr - User mailing list archive at Nabble.com.

solr and tika

2012-02-20 Thread alessio crisantemi

Hi all,
In a new installation of sOlr (1.4) I configured Tika  for indexing rich
documents.
So, I commit my files and I can find it after indexing with an http query *
http://localhost:8983/solr/select?q=attr_content:parola*; (for search the
word 'parola') and I find the committed text.
but if I search with Solr front panel, the results is '0 documents'.

suggestions?
thanks
alessio

Re: Development inside or outside of Solr?

2012-02-20 Thread François Schiettecatte

You could take a look at this:

http://www.let.rug.nl/vannoord/TextCat/

Will probably require some work to integrate/implement through

François

On Feb 20, 2012, at 3:37 AM, bing wrote:

 I have looked into the TikaCLI with -language option, and learned that Tika
 can output only the language metadata. It cannot help me to solve my problem
 though, as my main concern is whether to change Solr or not.  Thank you all
 the same. 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Development-inside-or-outside-of-Solr-tp3759680p3760131.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr logging

2012-02-20 Thread François Schiettecatte

Ola

Here is what I have for this:


##
#
# Log4J configuration for SOLR
#
#   http://wiki.apache.org/solr/SolrLogging
#
#
# 1) Download LOG4J:
#   http://logging.apache.org/log4j/1.2/
#   http://logging.apache.org/log4j/1.2/download.html
#   
http://www.apache.org/dyn/closer.cgi/logging/log4j/1.2.16/apache-log4j-1.2.16.tar.gz
#   
http://newverhost.com/pub//logging/log4j/1.2.16/apache-log4j-1.2.16.tar.gz
#
# 2) Download SLF4J:
#   http://www.slf4j.org/
#   http://www.slf4j.org/download.html
#   http://www.slf4j.org/dist/slf4j-1.6.4.tar.gz
#
# 3) Unpack Solr:
#   jar xvf apache-solr-3.5.0.war
#
# 4) Delete:
#   WEB-INF/lib/log4j-over-slf4j-1.6.4.jar
#   WEB-INF/lib/slf4j-jdk14-1.6.4.jar
#
# 5) Copy:
#   apache-log4j-1.2.16/log4j-1.2.16.jar-  WEB-INF/lib
#   slf4j-1.6.4/slf4j-log4j12-1.6.4.jar -  WEB-INF/lib
#   log4j.properties (this file)-  WEB-INF/classes/ (needs 
to be created)
#
# 6) Pack Solr:
#   jar cvf apache-solr-3.4.0-omim.war admin favicon.ico index.jsp META-INF 
WEB-INF
#
#
#   Author: Francois Schiettecatte
#   Version:1.0
#
##



##
#
# Logging levels (helpful reminder)
#
# DEBUG  INFO  WARN  ERROR  FATAL
#



##
#
# Logging setup
#

log4j.rootLogger=WARN, SOLR


# Daily Rolling File Appender (SOLR)
log4j.appender.SOLR=org.apache.log4j.DailyRollingFileAppender
log4j.appender.SOLR.File=${catalina.base}/logs/solr.log
log4j.appender.SOLR.Append=true
log4j.appender.SOLR.Encoding=UTF-8
log4j.appender.SOLR.DatePattern='-'-MM-dd
log4j.appender.SOLR.layout=org.apache.log4j.PatternLayout
log4j.appender.SOLR.layout.ConversionPattern=%d [%t] %-5p %c - %m%n



##
#
# Logging levels for SOLR
#

# Default logging level
log4j.logger.org.apache.solr=WARN



##



On Feb 20, 2012, at 5:15 AM, ola nowak wrote:

 Yep. I suppose it is. But I have several applications installed on
 glassfish and I want each one of them to write into separate file. And Your
 solution with this jvm option was redirecting all messages from all apps to
 one file. Does anyone knows how to accomplish that?
 
 
 On Mon, Feb 20, 2012 at 11:09 AM, darul daru...@gmail.com wrote:
 
 Hmm, I did not try to achieve this but interested if you find a way...
 
 After I believe than having log4j config file outside war archive is a
 better solution, if you may need to update its content for example.
 
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Solr-logging-tp3760171p3760322.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Problem with SolrCloud + Zookeeper + DataImportHandler

2012-02-20 Thread Agnieszka Kukałowicz

Hi All,

I've recently downloaded latest solr trunk to configure solrcloud with
zookeeper
using standard configuration from wiki:
http://wiki.apache.org/solr/SolrCloud.

The problem occurred when I tried to configure DataImportHandler in
solrconfig.xml:

  requestHandler name=/dataimport
class=org.apache.solr.handler.dataimport.DataImportHandler
lst name=defaults
   str name=configdb-data-config.xml/str
/lst
  /requestHandler


After starting solr with zookeeper I've got errors:

Feb 20, 2012 11:30:12 AM org.apache.solr.common.SolrException log
SEVERE: null:org.apache.solr.common.SolrException
at org.apache.solr.core.SolrCore.init(SolrCore.java:606)
at org.apache.solr.core.SolrCore.init(SolrCore.java:490)
at
org.apache.solr.core.CoreContainer.create(CoreContainer.java:705)
at org.apache.solr.core.CoreContainer.load(CoreContainer.java:442)
at org.apache.solr.core.CoreContainer.load(CoreContainer.java:313)
at
org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.ja
va:262)
at
org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:98
)
at
org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:97)
at
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
at
org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:71
3)
at
org.mortbay.jetty.servlet.Context.startContext(Context.java:140)
at
org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:128
2)
at
org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:518)
at
org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:499)
at
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
at
org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java
:152)
at
org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextHandlerC
ollection.java:156)
at
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
at
org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java
:152)
at
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
at
org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:130)
at org.mortbay.jetty.Server.doStart(Server.java:224)
at
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
at
org.mortbay.xml.XmlConfiguration.main(XmlConfiguration.java:985)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:
39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorIm
pl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.mortbay.start.Main.invokeMain(Main.java:194)
at org.mortbay.start.Main.start(Main.java:534)
at org.mortbay.start.Main.start(Main.java:441)
at org.mortbay.start.Main.main(Main.java:119)
Caused by: org.apache.solr.common.SolrException: FATAL: Could not create
importer. DataImporter config invalid
at
org.apache.solr.handler.dataimport.DataImportHandler.inform(DataImportHand
ler.java:120)
at
org.apache.solr.core.SolrResourceLoader.inform(SolrResourceLoader.java:542
)
at org.apache.solr.core.SolrCore.init(SolrCore.java:601)
... 31 more
Caused by: org.apache.solr.common.cloud.ZooKeeperException:
ZkSolrResourceLoader does not support getConfigDir() - likely, w
at
org.apache.solr.cloud.ZkSolrResourceLoader.getConfigDir(ZkSolrResourceLoad
er.java:99)
at
org.apache.solr.handler.dataimport.SimplePropertiesWriter.init(SimplePrope
rtiesWriter.java:47)
at
org.apache.solr.handler.dataimport.DataImporter.init(DataImporter.java:1
12)
at
org.apache.solr.handler.dataimport.DataImportHandler.inform(DataImportHand
ler.java:114)
... 33 more

I've checked if file db-data-config.xml is available in Zookeeper:

[zk: localhost:2181(CONNECTED) 0] ls /configs/conf1
[admin-extra.menu-top.html, dict, solrconfig.xml, dataimport.properties,
admin-extra.html, solrconfig.xml.old, solrconfig.xml.new, solrconfig.xml~,
xslt, db-data-config.xml, velocity, elevate.xml,
admin-extra.menu-bottom.html, solrconfig.xml.dataimport, schema.xml]
[zk: localhost:2181(CONNECTED) 1]

Is it possible to configure DIH with Zookeper? And how to do it?
I'm little confused with that.

Regards
Agnieszka Kukalowicz

Re: custom scoring

Hello all:

We've done some tests with Em's approach of putting a BooleanQuery in front
of our user query, that means:

BooleanQuery
must (DismaxQuery)
should (FunctionQuery)

The FunctionQuery obtains the SOLR IR score by means of a QueryValueSource,
then does the SQRT of this value, and then multiplies it by our custom
query_score float, pulling it by means of a FieldCacheSource.

In particular, we've proceeded in the following way:

   - we've loaded the whole index in the page cache of the OS to make sure
   we don't have disk IO problems that might affect the benchmarks (our
   machine has enough memory to load all the index in RAM)
   - we've executed an out-of-benchmark query 10-20 times to make sure that
   everything is jitted and that Lucene's FieldCache is properly populated.
   - we've disabled all the caches (filter query cache, document cache,
   query cache)
   - we've executed 8 different user queries with and without
   FunctionQueries, with early termination in both cases (our collector stops
   after collecting 50 documents per shard)

Em was correct, the query is much faster with the BooleanQuery in front,
but it's still 30-40% slower than the query without FunctionQueries.

Although one may think that it's reasonable that the query response time
increases because of the extra computations, we believe that the increase
is too big, given that we're collecting just 500-600 documents due to the
early query termination techniques we currently use.

Any ideas on how to make it faster?.

Thanks a lot,
Carlos

Carlos Gonzalez-Cadenas
CEO, ExperienceOn - New generation search
http://www.experienceon.com

Mobile: +34 652 911 201
Skype: carlosgonzalezcadenas
LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas


On Fri, Feb 17, 2012 at 11:07 AM, Carlos Gonzalez-Cadenas 
c...@experienceon.com wrote:

 Thanks Em, Robert, Chris for your time and valuable advice. We'll make
 some tests and will let you know soon.



 On Thu, Feb 16, 2012 at 11:43 PM, Em mailformailingli...@yahoo.de wrote:

 Hello Carlos,

 I think we missunderstood eachother.

 As an example:
 BooleanQuery (
  clauses: (
 MustMatch(
   DisjunctionMaxQuery(
   TermQuery(stopword_field, barcelona),
   TermQuery(stopword_field, hoteles)
   )
 ),
 ShouldMatch(
  FunctionQuery(
*please insert your function here*
 )
 )
  )
 )

 Explanation:
 You construct an artificial BooleanQuery which wraps your user's query
 as well as your function query.
 Your user's query - in that case - is just a DisjunctionMaxQuery
 consisting of two TermQueries.
 In the real world you might construct another BooleanQuery around your
 DisjunctionMaxQuery in order to have more flexibility.
 However the interesting part of the given example is, that we specify
 the user's query as a MustMatch-condition of the BooleanQuery and the
 FunctionQuery just as a ShouldMatch.
 Constructed that way, I am expecting the FunctionQuery only scores those
 documents which fit the MustMatch-Condition.

 I conclude that from the fact that the FunctionQuery-class also has a
 skipTo-method and I would expect that the scorer will use it to score
 only matching documents (however I did not search where and how it might
 get called).

 If my conclusion is wrong than hopefully Robert Muir (as far as I can
 see the author of that class) can tell us what was the intention by
 constructing an every-time-match-all-function-query.

 Can you validate whether your QueryParser constructs a query in the form
 I drew above?

 Regards,
 Em

 Am 16.02.2012 20:29, schrieb Carlos Gonzalez-Cadenas:
  Hello Em:
 
  1) Here's a printout of an example DisMax query (as you can see mostly
 MUST
  terms except for some SHOULD terms used for boosting scores for
 stopwords)
  *
  *
  *((+stopword_shortened_phrase:hoteles
 +stopword_shortened_phrase:barcelona
  stopword_shortened_phrase:en) | (+stopword_phrase:hoteles
  +stopword_phrase:barcelona
  stopword_phrase:en) | (+stopword_shortened_phrase:hoteles
 +stopword_short
  ened_phrase:barcelona stopword_shortened_phrase:en) |
 (+stopword_phrase:hoteles
  +stopword_phrase:barcelona stopword_phrase:en) | (+stopword_shor
  tened_phrase:hoteles +wildcard_stopword_shortened_phrase:barcelona
  stopword_shortened_phrase:en) | (+stopword_phrase:hoteles
 +wildcard_stopw
  ord_phrase:barcelona stopword_phrase:en) |
 (+stopword_shortened_phrase:hoteles
  +wildcard_stopword_shortened_phrase:barcelona
 stopword_shortened_phrase:en)
  | (+stopword_phrase:hoteles +wildcard_stopword_phrase:barcelona
  stopword_phrase:en))*
  *
  *
  2)* *The collector is inserted in the SolrIndexSearcher (replacing the
  TimeLimitingCollector). We trigger it through the SOLR interface by
 passing
  the timeAllowed parameter. We know this is a hack but AFAIK there's no
  out-of-the-box way to specify custom collectors by now (

Re: custom scoring

Carlos,

nice to hear that the approach helped you!

Could you show us how your query-request looks like after reworking?

Regards,
Em

Am 20.02.2012 13:30, schrieb Carlos Gonzalez-Cadenas:
 Hello all:
 
 We've done some tests with Em's approach of putting a BooleanQuery in front
 of our user query, that means:
 
 BooleanQuery
 must (DismaxQuery)
 should (FunctionQuery)
 
 The FunctionQuery obtains the SOLR IR score by means of a QueryValueSource,
 then does the SQRT of this value, and then multiplies it by our custom
 query_score float, pulling it by means of a FieldCacheSource.
 
 In particular, we've proceeded in the following way:
 
- we've loaded the whole index in the page cache of the OS to make sure
we don't have disk IO problems that might affect the benchmarks (our
machine has enough memory to load all the index in RAM)
- we've executed an out-of-benchmark query 10-20 times to make sure that
everything is jitted and that Lucene's FieldCache is properly populated.
- we've disabled all the caches (filter query cache, document cache,
query cache)
- we've executed 8 different user queries with and without
FunctionQueries, with early termination in both cases (our collector stops
after collecting 50 documents per shard)
 
 Em was correct, the query is much faster with the BooleanQuery in front,
 but it's still 30-40% slower than the query without FunctionQueries.
 
 Although one may think that it's reasonable that the query response time
 increases because of the extra computations, we believe that the increase
 is too big, given that we're collecting just 500-600 documents due to the
 early query termination techniques we currently use.
 
 Any ideas on how to make it faster?.
 
 Thanks a lot,
 Carlos
 
 Carlos Gonzalez-Cadenas
 CEO, ExperienceOn - New generation search
 http://www.experienceon.com
 
 Mobile: +34 652 911 201
 Skype: carlosgonzalezcadenas
 LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas
 
 
 On Fri, Feb 17, 2012 at 11:07 AM, Carlos Gonzalez-Cadenas 
 c...@experienceon.com wrote:
 
 Thanks Em, Robert, Chris for your time and valuable advice. We'll make
 some tests and will let you know soon.



 On Thu, Feb 16, 2012 at 11:43 PM, Em mailformailingli...@yahoo.de wrote:

 Hello Carlos,

 I think we missunderstood eachother.

 As an example:
 BooleanQuery (
  clauses: (
 MustMatch(
   DisjunctionMaxQuery(
   TermQuery(stopword_field, barcelona),
   TermQuery(stopword_field, hoteles)
   )
 ),
 ShouldMatch(
  FunctionQuery(
*please insert your function here*
 )
 )
  )
 )

 Explanation:
 You construct an artificial BooleanQuery which wraps your user's query
 as well as your function query.
 Your user's query - in that case - is just a DisjunctionMaxQuery
 consisting of two TermQueries.
 In the real world you might construct another BooleanQuery around your
 DisjunctionMaxQuery in order to have more flexibility.
 However the interesting part of the given example is, that we specify
 the user's query as a MustMatch-condition of the BooleanQuery and the
 FunctionQuery just as a ShouldMatch.
 Constructed that way, I am expecting the FunctionQuery only scores those
 documents which fit the MustMatch-Condition.

 I conclude that from the fact that the FunctionQuery-class also has a
 skipTo-method and I would expect that the scorer will use it to score
 only matching documents (however I did not search where and how it might
 get called).

 If my conclusion is wrong than hopefully Robert Muir (as far as I can
 see the author of that class) can tell us what was the intention by
 constructing an every-time-match-all-function-query.

 Can you validate whether your QueryParser constructs a query in the form
 I drew above?

 Regards,
 Em

 Am 16.02.2012 20:29, schrieb Carlos Gonzalez-Cadenas:
 Hello Em:

 1) Here's a printout of an example DisMax query (as you can see mostly
 MUST
 terms except for some SHOULD terms used for boosting scores for
 stopwords)
 *
 *
 *((+stopword_shortened_phrase:hoteles
 +stopword_shortened_phrase:barcelona
 stopword_shortened_phrase:en) | (+stopword_phrase:hoteles
 +stopword_phrase:barcelona
 stopword_phrase:en) | (+stopword_shortened_phrase:hoteles
 +stopword_short
 ened_phrase:barcelona stopword_shortened_phrase:en) |
 (+stopword_phrase:hoteles
 +stopword_phrase:barcelona stopword_phrase:en) | (+stopword_shor
 tened_phrase:hoteles +wildcard_stopword_shortened_phrase:barcelona
 stopword_shortened_phrase:en) | (+stopword_phrase:hoteles
 +wildcard_stopw
 ord_phrase:barcelona stopword_phrase:en) |
 (+stopword_shortened_phrase:hoteles
 +wildcard_stopword_shortened_phrase:barcelona
 stopword_shortened_phrase:en)
 | (+stopword_phrase:hoteles +wildcard_stopword_phrase:barcelona
 stopword_phrase:en))*
 *
 *
 2)* *The collector is inserted in the SolrIndexSearcher (replacing the

Re: Development inside or outside of Solr?

2012-02-20 Thread Erick Erickson

Either is possible. For the first, you would write a custom update processor
that handled the dual Tika call...

For the second, consider writing a SolrJ program that just does it all on
the client. Just download Tika from the apache project (or tease out all
the jars from the Solr distro) and then make it all work on the client.

Here's a sample app:
http://www.lucidimagination.com/blog/2012/02/14/indexing-with-solrj/

Best
Erick

On Sun, Feb 19, 2012 at 9:44 PM, bing jsuser1...@hotmail.com wrote:
Hi, all,

I am deploying a multicore solr server runing on Tomcat, where I want to
achieve language detection during index/query.

Solr3.5.0 has a wrapped Tika API that can do language detection. Currently,
the default behavior of Solr3.5.0 is, every time I index a document, and at
mean time Solr call Tika API to give the result of language detection, i.e.
index and detection happens at the same time. However, I hope I can have the
language detection result first, and then I decide which core to put the
document, i.e. detection happens before index.

There seems that I need to do development in either of the following ways:

1. I might need to do revision of Solr itself, change the default behavior
of Solr;
2. Or I might write a Java client outside Solr, call the client through
server (JSP maybe) in index/query.

Can anyone meeting with similar conditions give some suggestions about the
advantages and disad of the two approaches? Any other alternatives? Thank
you.

Best
Bing

--
View this message in context:
http://lucene.472066.n3.nabble.com/Development-inside-or-outside-of-Solr-tp3759680p3759680.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: custom scoring

Yeah Em, it helped a lot :)

Here it is (for the user query hoteles):

*+(stopword_shortened_phrase:hoteles | stopword_phrase:hoteles |
wildcard_stopword_shortened_phrase:hoteles |
wildcard_stopword_phrase:hoteles) *

*product(pow(query((stopword_shortened_phrase:hoteles |
stopword_phrase:hoteles | wildcard_stopword_shortened_phrase:hoteles |
wildcard_stopword_phrase:hoteles),def=0.0),const(0.5)),float(query_score))*

Thanks a lot for your help.

Carlos
Carlos Gonzalez-Cadenas
CEO, ExperienceOn - New generation search
http://www.experienceon.com

Mobile: +34 652 911 201
Skype: carlosgonzalezcadenas
LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas


On Mon, Feb 20, 2012 at 1:50 PM, Em mailformailingli...@yahoo.de wrote:

 Carlos,

 nice to hear that the approach helped you!

 Could you show us how your query-request looks like after reworking?

 Regards,
 Em

 Am 20.02.2012 13:30, schrieb Carlos Gonzalez-Cadenas:
  Hello all:
 
  We've done some tests with Em's approach of putting a BooleanQuery in
 front
  of our user query, that means:
 
  BooleanQuery
  must (DismaxQuery)
  should (FunctionQuery)
 
  The FunctionQuery obtains the SOLR IR score by means of a
 QueryValueSource,
  then does the SQRT of this value, and then multiplies it by our custom
  query_score float, pulling it by means of a FieldCacheSource.
 
  In particular, we've proceeded in the following way:
 
 - we've loaded the whole index in the page cache of the OS to make
 sure
 we don't have disk IO problems that might affect the benchmarks (our
 machine has enough memory to load all the index in RAM)
 - we've executed an out-of-benchmark query 10-20 times to make sure
 that
 everything is jitted and that Lucene's FieldCache is properly
 populated.
 - we've disabled all the caches (filter query cache, document cache,
 query cache)
 - we've executed 8 different user queries with and without
 FunctionQueries, with early termination in both cases (our collector
 stops
 after collecting 50 documents per shard)
 
  Em was correct, the query is much faster with the BooleanQuery in front,
  but it's still 30-40% slower than the query without FunctionQueries.
 
  Although one may think that it's reasonable that the query response time
  increases because of the extra computations, we believe that the increase
  is too big, given that we're collecting just 500-600 documents due to the
  early query termination techniques we currently use.
 
  Any ideas on how to make it faster?.
 
  Thanks a lot,
  Carlos
 
  Carlos Gonzalez-Cadenas
  CEO, ExperienceOn - New generation search
  http://www.experienceon.com
 
  Mobile: +34 652 911 201
  Skype: carlosgonzalezcadenas
  LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas
 
 
  On Fri, Feb 17, 2012 at 11:07 AM, Carlos Gonzalez-Cadenas 
  c...@experienceon.com wrote:
 
  Thanks Em, Robert, Chris for your time and valuable advice. We'll make
  some tests and will let you know soon.
 
 
 
  On Thu, Feb 16, 2012 at 11:43 PM, Em mailformailingli...@yahoo.de
 wrote:
 
  Hello Carlos,
 
  I think we missunderstood eachother.
 
  As an example:
  BooleanQuery (
   clauses: (
  MustMatch(
DisjunctionMaxQuery(
TermQuery(stopword_field, barcelona),
TermQuery(stopword_field, hoteles)
)
  ),
  ShouldMatch(
   FunctionQuery(
 *please insert your function here*
  )
  )
   )
  )
 
  Explanation:
  You construct an artificial BooleanQuery which wraps your user's query
  as well as your function query.
  Your user's query - in that case - is just a DisjunctionMaxQuery
  consisting of two TermQueries.
  In the real world you might construct another BooleanQuery around your
  DisjunctionMaxQuery in order to have more flexibility.
  However the interesting part of the given example is, that we specify
  the user's query as a MustMatch-condition of the BooleanQuery and the
  FunctionQuery just as a ShouldMatch.
  Constructed that way, I am expecting the FunctionQuery only scores
 those
  documents which fit the MustMatch-Condition.
 
  I conclude that from the fact that the FunctionQuery-class also has a
  skipTo-method and I would expect that the scorer will use it to score
  only matching documents (however I did not search where and how it
 might
  get called).
 
  If my conclusion is wrong than hopefully Robert Muir (as far as I can
  see the author of that class) can tell us what was the intention by
  constructing an every-time-match-all-function-query.
 
  Can you validate whether your QueryParser constructs a query in the
 form
  I drew above?
 
  Regards,
  Em
 
  Am 16.02.2012 20:29, schrieb Carlos Gonzalez-Cadenas:
  Hello Em:
 
  1) Here's a printout of an example DisMax query (as you can see mostly
  MUST
  terms except for some SHOULD terms used for boosting scores for
  stopwords)
  *
  *

How to check for inactive cores in a solr multicore setup?

2012-02-20 Thread Nasima Banu

Hello,

I am trying to figure out a way to detect inactive cores in a multicore setup.
How is that possible?
I queried the STATUS of a core through the CoreAdminHandler. Could anyone 
please tell me what the 'current' field means??

Eg : http://localhost:8080/solr/admin/cores?action=STATUScore=2

Response ::

lst name=2str name=name2/strstr 
name=instanceDirmulticore/solr/2//strstr 
name=dataDirmulticore/solr/2/data//strdate 
name=startTime2012-02-17T06:19:20.805Z/datelong 
name=uptime279811925/longlst name=indexint 
name=numDocs72373/intint name=maxDoc81487/intlong 
name=version1328696930153/longint name=segmentCount12/intbool 
name=currenttrue/boolbool name=hasDeletionstrue/boolstr 
name=directoryorg.apache.lucene.store.MMapDirectory:org.apache.lucene.store.MMapDirectory@multicore/solr/2/data/index
 lockFactory=org.apache.lucene.store.NativeFSLockFactory@4cd0b9d7/strdate 
name=lastModified2012-02-20T12:02:12Z/date/lst/lst



Please help.

Thanks,
Nasima

RE: customizing standard tokenizer

2012-02-20 Thread Torsten Krah

Thx, will use the custom tokenizer. Its less error prone than the
workarounds mentioned.


smime.p7s
Description: S/MIME cryptographic signature

Re: DataImportHandler running out of memory

2012-02-20 Thread v_shan

DIH still running out of memory for me, with Full Import on a database of
size 1.5 GB.

Solr version: 3_5_0

Note that I have already added batchSize=-1 but getting same error. 
Sharing my DIH config below.


dataConfig
dataSource type=JdbcDataSource
name=jdbc
   driver=com.mysql.jdbc.Driver
   url=jdbc:mysql://localhost:3306/ib
   user=root 
   password=root
   batchSize=-1
   /
document name=content
entity 
name=issue 
dataSource=jdbc
transformer=RegexTransformer,DateFormatTransformer,
TemplateTransformer
pk=id
query=
select
ib_issue.`_id` as id,
ib_issue.`_issue_title` as issueTitle,
ib_issue.`_issue_descr` as issueDescr,
createdBy.`_name` as issueCreatedByName,
createdBy.`_email` as issueCreatedByEmail
from
`ib_issue` 
inner join `ib_user` as createdBy 
on createdBy.`_id` = ib_issue.`_created_by_user_id` 

group by ib_issue.`_id` 

/entity
/document
/dataConfig


Please find the error trace below
===
2012-02-20 19:04:40.531:INFO::Started SocketConnector@0.0.0.0:8983
Feb 20, 2012 7:04:57 PM org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/select params={command=statusqt=/dih_ib_jdbc}
status=0 QTime=0
Feb 20, 2012 7:04:58 PM org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/select
params={command=show-configqt=/dih_ib_jdbc} status=0 QTime=0
Feb 20, 2012 7:05:30 PM org.apache.solr.handler.dataimport.DataImporter
doFullImport
INFO: Starting Full Import
Feb 20, 2012 7:05:30 PM org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/dih_ib_jdbc params={command=full-import}
status=0 QTime=0
Feb 20, 2012 7:05:30 PM org.apache.solr.handler.dataimport.SolrWriter
readIndexerProperties
INFO: Read dih_ib_jdbc.properties
Feb 20, 2012 7:05:30 PM org.apache.solr.update.DirectUpdateHandler2
deleteAll
INFO: [] REMOVING ALL DOCUMENTS FROM INDEX
Feb 20, 2012 7:05:30 PM org.apache.solr.core.SolrDeletionPolicy onInit
INFO: SolrDeletionPolicy.onInit: commits:num=1
   
commit{dir=E:\workspace\solr_3_5_0\example\solr\data\index,segFN=segments_1,version=1329744880204,generation=1,filenames=[segments_1]
Feb 20, 2012 7:05:30 PM org.apache.solr.core.SolrDeletionPolicy
updateCommits
INFO: newest commit = 1329744880204
Feb 20, 2012 7:05:30 PM org.apache.solr.handler.dataimport.JdbcDataSource$1
call
INFO: Creating a connection for entity issue with URL:
jdbc:mysql://localhost:3306/issueburner
Feb 20, 2012 7:05:30 PM org.apache.solr.handler.dataimport.JdbcDataSource$1
call
INFO: Time taken for getConnection(): 172
Feb 20, 2012 7:07:45 PM org.apache.solr.common.SolrException log
SEVERE: Full Import
failed:org.apache.solr.handler.dataimport.DataImportHandlerException:
java.lang.OutOfMemoryError: Java heap space
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:669)
at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:268)
at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:187)
at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:359)
at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:427)
at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:408)
Caused by: java.lang.OutOfMemoryError: Java heap space
at
org.apache.lucene.util.UnicodeUtil.UTF16toUTF8(UnicodeUtil.java:377)
at
org.apache.lucene.store.DataOutput.writeString(DataOutput.java:103)
at
org.apache.lucene.index.FieldsWriter.writeField(FieldsWriter.java:200)
at
org.apache.lucene.index.StoredFieldsWriterPerThread.addField(StoredFieldsWriterPerThread.java:58)
at
org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.java:265)
at
org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:766)
at
org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:2327)
at
org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:2299)
at
org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:240)
at
org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:61)
at
org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:115)
at
org.apache.solr.handler.dataimport.SolrWriter.upload(SolrWriter.java:73)
at
org.apache.solr.handler.dataimport.DataImportHandler$1.upload(DataImportHandler.java:293)
at

Re: custom scoring

Could you please provide me the original request (the HTTP-request)?
I am a little bit confused to what query_score refers.
As far as I can see it isn't a magic-value.

Kind regards,
Em

Am 20.02.2012 14:05, schrieb Carlos Gonzalez-Cadenas:
 Yeah Em, it helped a lot :)
 
 Here it is (for the user query hoteles):
 
 *+(stopword_shortened_phrase:hoteles | stopword_phrase:hoteles |
 wildcard_stopword_shortened_phrase:hoteles |
 wildcard_stopword_phrase:hoteles) *
 
 *product(pow(query((stopword_shortened_phrase:hoteles |
 stopword_phrase:hoteles | wildcard_stopword_shortened_phrase:hoteles |
 wildcard_stopword_phrase:hoteles),def=0.0),const(0.5)),float(query_score))*
 
 Thanks a lot for your help.
 
 Carlos
 Carlos Gonzalez-Cadenas
 CEO, ExperienceOn - New generation search
 http://www.experienceon.com
 
 Mobile: +34 652 911 201
 Skype: carlosgonzalezcadenas
 LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas
 
 
 On Mon, Feb 20, 2012 at 1:50 PM, Em mailformailingli...@yahoo.de wrote:
 
 Carlos,

 nice to hear that the approach helped you!

 Could you show us how your query-request looks like after reworking?

 Regards,
 Em

 Am 20.02.2012 13:30, schrieb Carlos Gonzalez-Cadenas:
 Hello all:

 We've done some tests with Em's approach of putting a BooleanQuery in
 front
 of our user query, that means:

 BooleanQuery
 must (DismaxQuery)
 should (FunctionQuery)

 The FunctionQuery obtains the SOLR IR score by means of a
 QueryValueSource,
 then does the SQRT of this value, and then multiplies it by our custom
 query_score float, pulling it by means of a FieldCacheSource.

 In particular, we've proceeded in the following way:

- we've loaded the whole index in the page cache of the OS to make
 sure
we don't have disk IO problems that might affect the benchmarks (our
machine has enough memory to load all the index in RAM)
- we've executed an out-of-benchmark query 10-20 times to make sure
 that
everything is jitted and that Lucene's FieldCache is properly
 populated.
- we've disabled all the caches (filter query cache, document cache,
query cache)
- we've executed 8 different user queries with and without
FunctionQueries, with early termination in both cases (our collector
 stops
after collecting 50 documents per shard)

 Em was correct, the query is much faster with the BooleanQuery in front,
 but it's still 30-40% slower than the query without FunctionQueries.

 Although one may think that it's reasonable that the query response time
 increases because of the extra computations, we believe that the increase
 is too big, given that we're collecting just 500-600 documents due to the
 early query termination techniques we currently use.

 Any ideas on how to make it faster?.

 Thanks a lot,
 Carlos

 Carlos Gonzalez-Cadenas
 CEO, ExperienceOn - New generation search
 http://www.experienceon.com

 Mobile: +34 652 911 201
 Skype: carlosgonzalezcadenas
 LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas


 On Fri, Feb 17, 2012 at 11:07 AM, Carlos Gonzalez-Cadenas 
 c...@experienceon.com wrote:

 Thanks Em, Robert, Chris for your time and valuable advice. We'll make
 some tests and will let you know soon.



 On Thu, Feb 16, 2012 at 11:43 PM, Em mailformailingli...@yahoo.de
 wrote:

 Hello Carlos,

 I think we missunderstood eachother.

 As an example:
 BooleanQuery (
  clauses: (
 MustMatch(
   DisjunctionMaxQuery(
   TermQuery(stopword_field, barcelona),
   TermQuery(stopword_field, hoteles)
   )
 ),
 ShouldMatch(
  FunctionQuery(
*please insert your function here*
 )
 )
  )
 )

 Explanation:
 You construct an artificial BooleanQuery which wraps your user's query
 as well as your function query.
 Your user's query - in that case - is just a DisjunctionMaxQuery
 consisting of two TermQueries.
 In the real world you might construct another BooleanQuery around your
 DisjunctionMaxQuery in order to have more flexibility.
 However the interesting part of the given example is, that we specify
 the user's query as a MustMatch-condition of the BooleanQuery and the
 FunctionQuery just as a ShouldMatch.
 Constructed that way, I am expecting the FunctionQuery only scores
 those
 documents which fit the MustMatch-Condition.

 I conclude that from the fact that the FunctionQuery-class also has a
 skipTo-method and I would expect that the scorer will use it to score
 only matching documents (however I did not search where and how it
 might
 get called).

 If my conclusion is wrong than hopefully Robert Muir (as far as I can
 see the author of that class) can tell us what was the intention by
 constructing an every-time-match-all-function-query.

 Can you validate whether your QueryParser constructs a query in the
 form
 I drew above?

 Regards,
 Em

 Am 16.02.2012 20:29, schrieb Carlos Gonzalez-Cadenas:
 Hello Em:

 1) Here's a printout of an

Re: custom scoring

Hi Em:

The HTTP request is not gonna help you a lot because we use a custom
QParser (that builds the query that I've pasted before). In any case, here
it is:

http://localhost:8080/solr/core0/select?shards=…(shards
here)…indent=onwt=exontimeAllowed=50fl=resulting_phrase%2Cquery_id%2Ctype%2Chighlightingstart=0rows=16limit=20q=%7B!exonautocomplete%7Dhoteleshttp://localhost:8080/solr/core0/select?shards=exp302%3A8983%2Fsolr%2Fcore0%2Cexp302%3A8983%2Fsolr%2Fcore1%2Cexp302%3A8983%2Fsolr%2Fcore2%2Cexp302%3A8983%2Fsolr%2Fcore3%2Cexp302%3A8983%2Fsolr%2Fcore4%2Cexp302%3A8983%2Fsolr%2Fcore5%2Cexp302%3A8983%2Fsolr%2Fcore6%2Cexp302%3A8983%2Fsolr%2Fcore7%2Cexp302%3A8983%2Fsolr%2Fcore8%2Cexp302%3A8983%2Fsolr%2Fcore9%2Cexp302%3A8983%2Fsolr%2Fcore10%2Cexp302%3A8983%2Fsolr%2Fcore11sort=score%20desc%2C%20query_score%20descindent=onwt=exontimeAllowed=50fl=resulting_phrase%2Cquery_id%2Ctype%2Chighlightingstart=0vrows=4rows=16limit=20q=%7B!exonautocomplete%7DBARCELONAgyvl7cn3

We're implementing a query autocomplete system, therefore our Lucene
documents are queries. query_score is a field that is indexed and stored
with every document. It expresses how popular a given query is (i.e. common
queries like hotels in barcelona have a bigger query_score than less
common queries like hotels in barcelona near the beach).

Let me know if you need something else.

Thanks,
Carlos





Carlos Gonzalez-Cadenas
CEO, ExperienceOn - New generation search
http://www.experienceon.com

Mobile: +34 652 911 201
Skype: carlosgonzalezcadenas
LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas


On Mon, Feb 20, 2012 at 3:12 PM, Em mailformailingli...@yahoo.de wrote:

 Could you please provide me the original request (the HTTP-request)?
 I am a little bit confused to what query_score refers.
 As far as I can see it isn't a magic-value.

 Kind regards,
 Em

 Am 20.02.2012 14:05, schrieb Carlos Gonzalez-Cadenas:
  Yeah Em, it helped a lot :)
 
  Here it is (for the user query hoteles):
 
  *+(stopword_shortened_phrase:hoteles | stopword_phrase:hoteles |
  wildcard_stopword_shortened_phrase:hoteles |
  wildcard_stopword_phrase:hoteles) *
 
  *product(pow(query((stopword_shortened_phrase:hoteles |
  stopword_phrase:hoteles | wildcard_stopword_shortened_phrase:hoteles |
 
 wildcard_stopword_phrase:hoteles),def=0.0),const(0.5)),float(query_score))*
 
  Thanks a lot for your help.
 
  Carlos
  Carlos Gonzalez-Cadenas
  CEO, ExperienceOn - New generation search
  http://www.experienceon.com
 
  Mobile: +34 652 911 201
  Skype: carlosgonzalezcadenas
  LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas
 
 
  On Mon, Feb 20, 2012 at 1:50 PM, Em mailformailingli...@yahoo.de
 wrote:
 
  Carlos,
 
  nice to hear that the approach helped you!
 
  Could you show us how your query-request looks like after reworking?
 
  Regards,
  Em
 
  Am 20.02.2012 13:30, schrieb Carlos Gonzalez-Cadenas:
  Hello all:
 
  We've done some tests with Em's approach of putting a BooleanQuery in
  front
  of our user query, that means:
 
  BooleanQuery
  must (DismaxQuery)
  should (FunctionQuery)
 
  The FunctionQuery obtains the SOLR IR score by means of a
  QueryValueSource,
  then does the SQRT of this value, and then multiplies it by our custom
  query_score float, pulling it by means of a FieldCacheSource.
 
  In particular, we've proceeded in the following way:
 
 - we've loaded the whole index in the page cache of the OS to make
  sure
 we don't have disk IO problems that might affect the benchmarks (our
 machine has enough memory to load all the index in RAM)
 - we've executed an out-of-benchmark query 10-20 times to make sure
  that
 everything is jitted and that Lucene's FieldCache is properly
  populated.
 - we've disabled all the caches (filter query cache, document cache,
 query cache)
 - we've executed 8 different user queries with and without
 FunctionQueries, with early termination in both cases (our collector
  stops
 after collecting 50 documents per shard)
 
  Em was correct, the query is much faster with the BooleanQuery in
 front,
  but it's still 30-40% slower than the query without FunctionQueries.
 
  Although one may think that it's reasonable that the query response
 time
  increases because of the extra computations, we believe that the
 increase
  is too big, given that we're collecting just 500-600 documents due to
 the
  early query termination techniques we currently use.
 
  Any ideas on how to make it faster?.
 
  Thanks a lot,
  Carlos
 
  Carlos Gonzalez-Cadenas
  CEO, ExperienceOn - New generation search
  http://www.experienceon.com
 
  Mobile: +34 652 911 201
  Skype: carlosgonzalezcadenas
  LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas
 
 
  On Fri, Feb 17, 2012 at 11:07 AM, Carlos Gonzalez-Cadenas 
  c...@experienceon.com wrote:
 
  Thanks Em, Robert, Chris for your time and valuable advice. We'll make
  some tests and will let you know soon.
 
 
 
  On Thu, Feb 16,

postCommit confusion?

2012-02-20 Thread Esad Mumdzic

in a solr master slave replication, if I register postCommit listener on a 
slave, which index reader should I get if I do:

@Override
public final void postCommit() {
final RefCountedSolrIndexSearcher refC = core
.getNewestSearcher(true);
try{
final MapString, String userData = 
refC.get().getIndexReader().getIndexCommit().getUserData();
// do something with userData
} catch (IOException e) {
log.error(PostCommit: , e);
} finally {
  refC.decref();
}

}


What I observed is that I get stale userData, is this correct? Didn't 
commit replace IndexReader to the actual commit point? (I observe userData 
that were there before replication finished, but I expected to see userData  
version from  the master at this stage)

If I force core.openNewSearcher(false, false);  I get correct, replicated 
userData I just received from master…

What I am doing wrong? Contract of core.getNewestSearcher(true) return in 
postCommit(), or better when solr updates commit point? 

Not so import an for the particular problem, but interesting to know these life 
cycles.


Thanks, eks

Is Sphinx better suited to me, or should I look at Solr?

2012-02-20 Thread Spadez

I am creating what is effectively a search engine. Content is collected via
spiders at
then is inserted into my database and becomes searchable and filterable.

I invision there being around 90K records to be searched at any one time.
The content is
blog posts and forum posts so we are basically looking at full text with
some additional
filters based on location, category and date posted.

What is really important to me is speed and relevancy. The index size or
index time
really isn’t too big of an issue. From the benchmarks I have seen it looks
like Sphinx
is much faster at querying data and showing results, but that Solr has
improved relevancy.

My website is coded entirely in PHP and I am planning on using a MYSQL
database. Can
anyone please give me a bit of input and help me decide which product might
be better
suited to me.

Regards,

James

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Is-Sphinx-better-suited-to-me-or-should-I-look-at-Solr-tp3760988p3760988.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: custom scoring

Hi Carlos,

 query_score is a field that is indexed and stored
 with every document.
Thanks for clarifying that, now the whole query-string makes more sense
to me.

Did you check whether query() - without product() and pow() - is also
much slower than a normal query?

I guess, if the performance-decrease without product() and pow() is not
that large, you are hitting the small overhead that comes with every
function query.
It would be nice, if you could check that.

However, let's take a step back and look what you really want to achieve
instead of how you are trying to achieve it right now.

You want to influence the score of your actual query by a value that
represents a combination of some static values and the likelyness of how
good a query matches a document.

From your query, I can see that you are using the same fields in your
FunctionQuery and within your MainQuery (let's call the q-param
MainQuery).
This means that the scores of your query()-method and your MainQuery
should be identical.
Let's call this value just score and rename your field query_score
popularity.

I don't know how you are implementing the FunctionQuery (boost by
multiplication, boost by addition), but it seems clear to me that your
formula looks this way:

score x (score^0.5*popularity) where x is kind of an operator (+,*,...)

Why don't you reduce it to

score * boost(log(popularity)).

This is a trade-off between precision and performance.

You could even improve the above by setting the doc's boost equal to
log(populary) at indexing time.

What do you think about that?

Regards,
Em



Am 20.02.2012 15:37, schrieb Carlos Gonzalez-Cadenas:
 Hi Em:
 
 The HTTP request is not gonna help you a lot because we use a custom
 QParser (that builds the query that I've pasted before). In any case, here
 it is:
 
 http://localhost:8080/solr/core0/select?shards=…(shards
 here)…indent=onwt=exontimeAllowed=50fl=resulting_phrase%2Cquery_id%2Ctype%2Chighlightingstart=0rows=16limit=20q=%7B!exonautocomplete%7Dhoteleshttp://localhost:8080/solr/core0/select?shards=exp302%3A8983%2Fsolr%2Fcore0%2Cexp302%3A8983%2Fsolr%2Fcore1%2Cexp302%3A8983%2Fsolr%2Fcore2%2Cexp302%3A8983%2Fsolr%2Fcore3%2Cexp302%3A8983%2Fsolr%2Fcore4%2Cexp302%3A8983%2Fsolr%2Fcore5%2Cexp302%3A8983%2Fsolr%2Fcore6%2Cexp302%3A8983%2Fsolr%2Fcore7%2Cexp302%3A8983%2Fsolr%2Fcore8%2Cexp302%3A8983%2Fsolr%2Fcore9%2Cexp302%3A8983%2Fsolr%2Fcore10%2Cexp302%3A8983%2Fsolr%2Fcore11sort=score%20desc%2C%20query_score%20descindent=onwt=exontimeAllowed=50fl=resulting_phrase%2Cquery_id%2Ctype%2Chighlightingstart=0vrows=4rows=16limit=20q=%7B!exonautocomplete%7DBARCELONAgyvl7cn3
 
 We're implementing a query autocomplete system, therefore our Lucene
 documents are queries. query_score is a field that is indexed and stored
 with every document. It expresses how popular a given query is (i.e. common
 queries like hotels in barcelona have a bigger query_score than less
 common queries like hotels in barcelona near the beach).
 
 Let me know if you need something else.
 
 Thanks,
 Carlos
 
 
 
 
 
 Carlos Gonzalez-Cadenas
 CEO, ExperienceOn - New generation search
 http://www.experienceon.com
 
 Mobile: +34 652 911 201
 Skype: carlosgonzalezcadenas
 LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas
 
 
 On Mon, Feb 20, 2012 at 3:12 PM, Em mailformailingli...@yahoo.de wrote:
 
 Could you please provide me the original request (the HTTP-request)?
 I am a little bit confused to what query_score refers.
 As far as I can see it isn't a magic-value.

 Kind regards,
 Em

 Am 20.02.2012 14:05, schrieb Carlos Gonzalez-Cadenas:
 Yeah Em, it helped a lot :)

 Here it is (for the user query hoteles):

 *+(stopword_shortened_phrase:hoteles | stopword_phrase:hoteles |
 wildcard_stopword_shortened_phrase:hoteles |
 wildcard_stopword_phrase:hoteles) *

 *product(pow(query((stopword_shortened_phrase:hoteles |
 stopword_phrase:hoteles | wildcard_stopword_shortened_phrase:hoteles |

 wildcard_stopword_phrase:hoteles),def=0.0),const(0.5)),float(query_score))*

 Thanks a lot for your help.

 Carlos
 Carlos Gonzalez-Cadenas
 CEO, ExperienceOn - New generation search
 http://www.experienceon.com

 Mobile: +34 652 911 201
 Skype: carlosgonzalezcadenas
 LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas


 On Mon, Feb 20, 2012 at 1:50 PM, Em mailformailingli...@yahoo.de
 wrote:

 Carlos,

 nice to hear that the approach helped you!

 Could you show us how your query-request looks like after reworking?

 Regards,
 Em

 Am 20.02.2012 13:30, schrieb Carlos Gonzalez-Cadenas:
 Hello all:

 We've done some tests with Em's approach of putting a BooleanQuery in
 front
 of our user query, that means:

 BooleanQuery
 must (DismaxQuery)
 should (FunctionQuery)

 The FunctionQuery obtains the SOLR IR score by means of a
 QueryValueSource,
 then does the SQRT of this value, and then multiplies it by our custom
 query_score float, pulling it by means of a FieldCacheSource.

 In particular, we've proceeded in the following

How to index a facetfield by searching words matching from another Textfield

2012-02-20 Thread Xavier

Hi everyone,

I'm a new Solr User but i used to work on Endeca.

There is a modul called TextTagger with Endeca that is auto indexing
values in a facetfield (multivalued) when he find words (from a given
wordslist) into an other TextField from that document.

I didn't see any subjects or any ways to do it with Solr ???

Thanks for advance ;)


--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-index-a-facetfield-by-searching-words-matching-from-another-Textfield-tp3761201p3761201.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Is Sphinx better suited to me, or should I look at Solr?

Hi James,

I can not speak for Sphinx, since I never used it.
However, from reading your requirements there is nothing that fears Solr.

Although Sphinx is written in C++, running Solr on top of a HotSpot JVM
gives you high performance. Furthermore the HotSpot JVM is optimizing
your code at runtime which sometimes allows long-running applications to
run as fast as software written in C++ (and sometimes even faster).

Given that Solr is pretty fast and scalable (90k docs are a really small
index), you should have a closer look at the features each search-server
provides to you and how they suit your needs.

You should always keep in mind that users will gladly wait a few
milliseconds longer for their highly-relevant search-results, but do not
care about a blazing fast 5ms response-time for a collection of
trash-results.
So try to find out what your concrete needs in terms of relevancy are
and which search-server provides you the tools to go.
I am pretty sure that both projects provide you php-client-libraries
etc. for indexing and searching (Solr does).

Kind regards,
Em

Am 20.02.2012 16:20, schrieb Spadez:
I am creating what is effectively a search engine. Content is collected via
spiders at
then is inserted into my database and becomes searchable and filterable.

I invision there being around 90K records to be searched at any one time.
The content is
blog posts and forum posts so we are basically looking at full text with
some additional
filters based on location, category and date posted.

What is really important to me is speed and relevancy. The index size or
index time
really isn’t too big of an issue. From the benchmarks I have seen it looks
like Sphinx
is much faster at querying data and showing results, but that Solr has
improved relevancy.

My website is coded entirely in PHP and I am planning on using a MYSQL
database. Can
anyone please give me a bit of input and help me decide which product might
be better
suited to me.

Regards,

James

--
View this message in context:
http://lucene.472066.n3.nabble.com/Is-Sphinx-better-suited-to-me-or-should-I-look-at-Solr-tp3760988p3760988.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: How to index a facetfield by searching words matching from another Textfield