Re: Solr Benchmarks

2006-11-09 Thread Joachim Martin

Hi Walter,

Thunderbird shows that there is an attachment to this message in the 
message list, but when I view
the message, no attachment is available.  Could you try sending this 
attachment again?


Thanks --Joachim

Walter Underwood wrote:


I've done some testing using JMeter. I followed the instructions
in the JMeter FAQ for "How do I use external data files in my
test scripts?"

  http://wiki.apache.org/jakarta-jmeter/JMeterFAQ

I'm attaching the script I built with this. A few notes:

 



Re: [Newbie] Solr Setup

2006-10-03 Thread Joachim Martin
If you have deployed solr as a root application, tomcat may be getting 
confused with the /admin/ url, thinking that it is the tomcat 
administration app.


If you have it deployed, I would rename the /admin/ app to be /tadmin/ 
or something to distinguish from the solr /admin/ directory.


--Joachim

Panayiotis Papadopoulos wrote:


It prompts for HTTP authorization asking for password for Admin Realm


http://www.freemail.gr - äùñåÜí õðçñåóßá çëåêôñïíéêïý ôá÷õäñïìåßïõ.
http://www.freemail.gr - free email service for the Greek-speaking.





Re: Fixed first hits -> custom RequestHandler?

2006-09-28 Thread Joachim Martin

How about a sortOrder field?  Then you can sort by "sortOrder, score".

If you want to promote a book that might not be in the result set, you'd 
OR the featured books in with the query.


--Joachim

Otis Gospodnetic wrote:


Hello,

I have a situation where I want certain documents to appear at the top of the 
hit list for certain searches, regardless of their score.  One can think of it 
as the ads right on top of Google's search results (but I'm not dealing with 
ads).

Example:
If I'm searching books in a bookstore, and a person is searching for "lucene", the owner of the bookstore may want to 
promote the recently published "Lucene in Action" instead of some other book about Lucene, so he wants any search for 
"lucene" or "java search" to put the link to "Lucene in Action" on top.

Is there a good way to accomplish this in Solr?
My initial thoughts are that it would be best to have an external store, maybe 
even a Lucene index.  This store would host the data to display on top of hits, 
as well as keywords/phrases that would have to match user's search terms.  A 
custom RequestHandler would then perform a regular search (a la any of the 
existing RequestHandlers), plus pull the data from this side store, and stick 
those in the response.

Is this a good candidate for a custom RequestHandler?

Thanks,
Otis


 





Re: Simple Faceted Searching out of the box

2006-09-22 Thread Joachim Martin
I think you will find that this architecture is quite common.  What 
commercial packages
provide (remember you are getting this for free!) are the tools for 
managing the dynamic

export of data out of your database into the full-text search engine.

Solr provides a very easy way to do this, but yes, you have to do some 
programming

to automate it.

Two common ways of doing this.  1) write a component that periodically 
checks for
new/updated database content and submits it to solr.  2) write a trigger 
in the database
that immediately posts to solr (I would use JMS or some other 
asynchronous messaging

system for this).  I'm sure there are other solutions.

When/if MYSQL full text search is as good as solr/lucene, you can cut 
out one of the steps.


I could see a component added to solr that did #1 above for you.  MG4j 
has a simple
loader that takes a SQL query and indexes the result 
(JdbcDocumentCollection). For
Solr, you'd want to be able to handle muti-valued fields, which 
complicates things.


If this architecture bothers technical folks, they either are accustomed 
to using very

expensive software, or haven't been doing this very long.

Of course, I am trying to figure out a way to make Solr more like a 
database, so there

you go...

--Joachim

Tim Archambault wrote:


Okay, I'll use an example.

A recruitment (jobs) customer goes onto our website and posts an 
online job
posting to our newspaper website. Upon insert into the database, I 
need to

generate an xml file to be sent to SOLR to ADD as  a record to the search
engine. Same  goes for an edit, my database updates the record and then I
have to send an ADD statement to Solr again to commit my change. 2x the
work.

I've been talking with other papers about Solr and I think what 
bothers many

is that there a is a deposit of information in a structured database here
[named A], then we have another set of basically the same data over here
[named B] and they don't understand why they have to manage to different
sets of data [A & B] that are virtually the same thing.  Many foresee a
maintenance nightmare. I've come to the conclusion that there's 
somewhat of
a disconnect between what a database does and what a search engine 
does. I
accept that the redundancy is necessary given the very different tasks 
that
each performs [keep in mind I'm still naive to the programming details 
here,

I understand conceptually].

In writing this to you another thought came to mind. Maybe there are
alternative ways to inject records into Solr outside the bounds of the
cygwin and CURL examples I've been using. Maybe that is the question 
we need

to be asking. What are some alternative ways to populate Solr?

Enough said, it's Friday afternoon.

Have a great weekend.

Tim

On 9/22/06, Erik Hatcher <[EMAIL PROTECTED]> wrote:




On Sep 22, 2006, at 2:45 PM, Tim Archambault wrote:
> I believe there's a way to access MSSQL, MySQL etc. directly with
> Lucene,
> but not sure how to do this with SOLR.

Nope.  Lucene is a pure search engine, with no hooks to databases, or
document parsers, etc.  Lots of folks have built these kinds of
things on top of Lucene, but the Lucene core is purely the text engine.

How would you envision communicating with Solr with a database in the
picture?   How would the entire database be initially indexed?  How
would changes to the database trigger Solr updates?   I'm not quite
clear on what it would mean for Solr to work with a database directly
so I'm curious.

Erik








Re: relational design in solr?

2006-09-22 Thread Joachim Martin

Chris,

I think what I am trying to do is actually much simpler than what you 
are talking about here.
I do plan on returning document ids and retrieving full entity data from 
the database- solr would

just be used for the search, not for results display.

The problem is that some data cannot be "flattened", for example when a 
document has repeating

fields that are complex types, such as address.

The best example I can think of is a resume database.  You could 
certainly just put the whole resume
document into the text index and do full text searches.  But to answer 
the question of what people
received a Harvard MBA in the last 10 years and have worked at Intel in 
the last 5 years, you have
to correlate the years of attendance with the schoolName entry.  
Otherwise you might be getting years

for some other education/work history entry.

By adding an objType field and combining search results, you can be sure 
that the year/schoolName
query matched a unique education record.  The tricky bit is in getting a 
list of field values (e.g. foreign

keys, which are essentially facets) for a result set very quickly.

If this can be done, figuring out a generic way of specifying multiple 
searches and relationships between

result sets (without reinventing SQL) becomes the challenge.

We'll see.  I have my doubts that it will work for any but the smallest 
of collections, which ours certainly

isn't.

Thanks --Joachim

Chris Hostetter wrote:


While it's certianly possible to "join" the results of multiple indexes, i
would do so only when absolutely neccessary -- in my experience the only
time i've found that it makes sense, is when one aspect of the data
changes extremely rapidly compared to everything else, making complex
reindexing a pain, but reindexing just the changed data in it's own index
is a lot more feasible.

As a rule of thumb, when building "paginated" style search applications, I
would advise people to try and flatten their index as much as possible, so
that the application can do one "user query" (based on the users input)
to get a single page of results, and then use the uniqueKeys from that
page of results to lookup ancillary data from any other indexes (or
databases that you need) -- the key being that all the data you want to
search on, and all hte data you need to sort are in the index, but other
data you needto return to the user can come from other sources.

If you find yourself wanting to "join" to indexes for hte purposes of
matching or sorting, the amount of work you wind up doing tends to be
prohibitive on really large indexes -- and if your indxes aren't that
large, it would probably just be easier to puteverything in one index and
rebuild it frequently.

: I am trying to integrate solr search results with results from a rdbms
: query.  It's working ok, but fairly complicated  due to large size of
: the results from the database, and many different sort requirements.
:
: I know that solr/lucene was not designed to intelligently handle
: multiple document types in the same collection, i.e. provide join
: features, but I'm wondering if anyone on this list has any thoughts on
: how to do it in lucene, and how it might be integrated into a custom
: solr deployment.  I can't see going back to vanilla lucene after solr!
:
: My basic idea is to add an objType field that would be used to define a
: "table".  There would be one main objType, any related objTypes would
: have a field pointing back to the main objs via id, like a foreign key.
:
: I'd run multiple parallel searches and merge the results based on
: foreign keys, either using a Filter or just using custom code.  I'm
: anticipating that iterating through the results to retrieve the foreign
: key values will be too slow.
:
: Our data is highly textual, temporal and spatial, which pretty much
: correspond to the 3 tables I would have.  I can de-normalize a lot of
: the data, but the combination of times, locations and textual
: representations would be way too large to fully flatten.
:
: I'm about to start experimenting with different strategies, and I would
: appreciate any insight anyone can provide.  Would the faceting code help
: here somehow?



-Hoss
 





relational design in solr?

2006-09-19 Thread Joachim Martin
I am trying to integrate solr search results with results from a rdbms 
query.  It's working ok, but fairly complicated  due to large size of 
the results from the database, and many different sort requirements.


I know that solr/lucene was not designed to intelligently handle 
multiple document types in the same collection, i.e. provide join 
features, but I'm wondering if anyone on this list has any thoughts on 
how to do it in lucene, and how it might be integrated into a custom 
solr deployment.  I can't see going back to vanilla lucene after solr!


My basic idea is to add an objType field that would be used to define a 
"table".  There would be one main objType, any related objTypes would 
have a field pointing back to the main objs via id, like a foreign key.


I'd run multiple parallel searches and merge the results based on 
foreign keys, either using a Filter or just using custom code.  I'm 
anticipating that iterating through the results to retrieve the foreign 
key values will be too slow.


Our data is highly textual, temporal and spatial, which pretty much 
correspond to the 3 tables I would have.  I can de-normalize a lot of 
the data, but the combination of times, locations and textual 
representations would be way too large to fully flatten.


I'm about to start experimenting with different strategies, and I would 
appreciate any insight anyone can provide.  Would the faceting code help 
here somehow?


Thanks --Joachim







Re: Facet performance with heterogeneous 'facets'?

2006-09-19 Thread Joachim Martin

Michael Imbeault wrote:

Also, is there any plans to add an option not to run a facet search if 
the result set is too big? To avoid 40 seconds queries if the docset 
is too large...



You could run one query with facet=false, check the result size and then 
run it again (should be fast because it is cached) with 
facet=true&rows=0 to get facet results only.


I would think that the decision to run/not run facets would be highly 
custom to your collection and not easily developed as a configurable 
feature.


--Joachim


SolrCore as Singleton?

2006-09-07 Thread Joachim Martin

Is there a good reason for implementing SolrCore as a Singleton?

We are experimenting with running Solr as a Spring service embedded in 
our app.  Since it is a Singleton

we cannot have more than one index (not currently  a problem, but could be).

I note the comment:

 // Singleton for now...

If there is no specific reason for making it a Singleton, I'd vote for 
removing this so that the
SolrCore(dataDir, schema) constructor could be used to instantiate 
multiple cores.


Seems to me that since the primary usage scenario of solr is access via 
REST (i.e. no Solr jar/API),

the Singleton pattern is not necessary here.

--Joachim


Re: Solr now used on Discogs.com

2006-09-06 Thread Joachim Martin
Can you expand on this a bit? 

"Main search engine" would be the search feature, but not 
browsing/category listing?


Are you using Solr for all data storage and search?  Or a RDBMS?  If so, 
what is the split?


Cool site!

--Joachim

Kevin Lewandowski wrote:


I just wanted to say thanks to the Solr developers.

I'm now using Solr for the main search engine on Discogs.com. I've
been through five revisions of the search engine and this was
definitely the least painful. Solr gives me the power of Lucene
without having to deal with the guts. It made for a much faster
implementation than all other search packages I've worked with.

Some stats: there are now 1.1 million documents in the index and it
handles 200,000 searches per day (on a single-cpu P4 server with 1 gig
ram).

Kevin





Re: Embarrasing compilation errors with solr-nightly/example

2006-06-28 Thread Joachim Martin

Sounds to me like you are using the JRE and not a JDK.

Make sure $JAVA_HOME/lib/tools.jar is in your classpath.

--Joachim

James Pine wrote:


I am trying to walk through the Solr tutorial at:
http://incubator.apache.org/solr/tutorial.html and
can't seem to get:
http://localhost:8983/solr/admin/index.jsp

to compile. Here's the top of the error, I've included
the rest at the end of the message:

HTTP ERROR: 500
Unable to compile class for JSP
Generated servlet error:
Jun 28, 2006 9:52:42 AM
org.apache.jasper.compiler.Compiler generateClass
SEVERE: Javac exception 
Unable to find a javac compiler;

com.sun.tools.javac.Main is not on the classpath.
Perhaps JAVA_HOME does not point to the JDK

I believe my workstation is setup run/compile java
applications because it's part of my job ;o) but
apparently something is amiss. I'm running on a
windows box, using cygwin. My JAVA_HOME and CLASSPATH
environment variables are setup properly AFAIK and I'm
running the 1.5.0_04 JDK. The rest of the stacktrace
appears below. Thanx for your help.

JAMES


at
org.apache.tools.ant.taskdefs.compilers.CompilerAdapterFactory.getCompiler(CompilerAdapterFactory.java:105)
at
org.apache.tools.ant.taskdefs.Javac.compile(Javac.java:929)
at
org.apache.tools.ant.taskdefs.Javac.execute(Javac.java:758)
at
org.apache.jasper.compiler.Compiler.generateClass(Compiler.java:382)
at
org.apache.jasper.compiler.Compiler.compile(Compiler.java:472)
at
org.apache.jasper.compiler.Compiler.compile(Compiler.java:451)
at
org.apache.jasper.compiler.Compiler.compile(Compiler.java:439)
at
org.apache.jasper.JspCompilationContext.compile(JspCompilationContext.java:511)
at
org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:295)
at
org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:292)
at
org.apache.jasper.servlet.JspServlet.service(JspServlet.java:236)
at
javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
at
org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:428)
at
org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationHandler.java:473)
at
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:568)
at
org.mortbay.http.HttpContext.handle(HttpContext.java:1530)
at
org.mortbay.jetty.servlet.WebApplicationContext.handle(WebApplicationContext.java:633)
at
org.mortbay.http.HttpContext.handle(HttpContext.java:1482)
at
org.mortbay.http.HttpServer.service(HttpServer.java:909)
at
org.mortbay.http.HttpConnection.service(HttpConnection.java:820)
at
org.mortbay.http.HttpConnection.handleNext(HttpConnection.java:986)
at
org.mortbay.http.HttpConnection.handle(HttpConnection.java:837)
at
org.mortbay.http.SocketListener.handleConnection(SocketListener.java:245)
at
org.mortbay.util.ThreadedServer.handle(ThreadedServer.java:357)
at
org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:534)


Generated servlet error:
Jun 28, 2006 9:52:42 AM
org.apache.jasper.compiler.Compiler generateClass


Generated servlet error:
SEVERE: Env: Compile:
javaFileName=/C:/DOCUME~1/user/LOCALS~1/Temp/Jetty__8983__solr//org/apache/jsp/admin\index_jsp.java


Generated servlet error:
  
classpath=/C:/Documents%20and%20Settings/user/Local%20Settings/Temp/Jetty__8983__solr/webapp/WEB-INF/classes/;/C:/Documents%20and%20Settings/user/Local%20Settings/Temp/Jetty__8983__solr/webapp/WEB-INF/lib/lucene-core-nightly.jar;/C:/Documents%20and%20Settings/user/Local%20Settings/Temp/Jetty__8983__solr/webapp/WEB-INF/lib/lucene-highlighter-nightly.jar;/C:/Documents%20and%20Settings/user/Local%20Settings/Temp/Jetty__8983__solr/webapp/WEB-INF/lib/lucene-snowball-nightly.jar;/C:/Documents%20and%20Settings/user/Local%20Settings/Temp/Jetty__8983__solr/webapp/WEB-INF/lib/xpp3-1.1.3.4.O.jar;C:\DOCUME~1\user\LOCALS~1\Temp\Jetty__8983__solr;C:\Documents

and Settings\user\Local
Settings\Temp\Jetty__8983__solr\webapp\WEB-INF\classes;C:\Documents
and Settings\user\Local
Settings\Temp\Jetty__8983__solr\webapp\WEB-INF\lib\lucene-core-nightly.jar;C:\Documents
and Settings\user\Local
Settings\Temp\Jetty__8983__solr\webapp\WEB-INF\lib\lucene-highlighter-nightly.jar;C:\Documents
and Settings\user\Local
Settings\Temp\Jetty__8983__solr\webapp\WEB-INF\lib\lucene-snowball-nightly.jar;C:\Documents
and Settings\user\Local
Settings\Temp\Jetty__8983__solr\webapp\WEB-INF\lib\xpp3-1.1.3.4.O.jar;C:\Program
Files\Java\jre1.5.0_06\lib\ext\dnsns.jar;C:\Program
Files\Java\jre1.5.0_06\lib\ext\jai_codec.jar;C:\Program
Files\Java\jre1.5.0_06\lib\ext\jai_core.jar;C:\Program
Files\Java\jre1.5.0_06\lib\ext\mlibwrapper_jai.jar;C:\Program
Files\Java\jre1.5.0_06\lib\ext\sunjce_provider.jar;C:\Program
Files\Java\jre1.5.0_06\lib\ext\sunpkcs11.jar;C:\solr-nightly\example\start.jar;C:\solr-nightly\example\lib\org.mortbay.jetty.jar;C:\solr-nightly\example\lib\jav

Re: embedding solr in a webapp?

2006-06-07 Thread Joachim Martin
Certainly running a load balanced solr cluster will be our first 
approach, I was just wondering if there were
any glaring problems with running solr embedded in each webapp node.  
Sounds like there are not.


As for the secondary db lookup, those will be cached, and are necessary 
to filter results further based on

time (schedule) restrictions.

We will probably also implement a custom ResponseWriter that just 
returns a comma separated list of ids-

the IPC time is just one component of the overhead, xml parsing is another.

Thanks  --Joachim

Yonik Seeley wrote:

On 6/7/06, Joachim Martin <[EMAIL PROTECTED]> wrote:

We are looking at running read-only solr nodes embedded in our webapp
nodes.  This would give us the
additional features of solr over lucene, but would keep it in memory and
reduce the overhead of http/xml
transport of results.

Looks like we would just create a request handler and call
handleRequest(req,rsp), and deal with the
search results DocList ourselves.


Yes, that should work fine.


Would there be any reason why this sort of setup would prohibit the use
of index replication in a master/slave
setup?


No, that should still work fine.


Does this make sense?  As you might guess, speed is more important that
flexibility.


It can make sense in certain cases... but it does cut down on your
flexibility to size the search tier independently of the appserver
tier.

Eliminating the IPC might get you 5% more performance, but at what
development & flexibility cost?  It's easier to buy a slightly faster
box, or simply add another server if you are running behind a
load-balancer.  You know your situation best of course :-)


 We are using solr for
a content search, returning ids, and doing a secondary db lookup for
extended entity information.


You go through the trouble of avoiding one IPC call, but you add it
back in with the DB lookup... are the fields too large to store in
Lucene?

-Yonik




embedding solr in a webapp?

2006-06-07 Thread Joachim Martin

Hi,

We are looking at running read-only solr nodes embedded in our webapp 
nodes.  This would give us the
additional features of solr over lucene, but would keep it in memory and 
reduce the overhead of http/xml

transport of results.

Looks like we would just create a request handler and call 
handleRequest(req,rsp), and deal with the

search results DocList ourselves.

Would there be any reason why this sort of setup would prohibit the use 
of index replication in a master/slave

setup?

Does this make sense?  As you might guess, speed is more important that 
flexibility.  We are using solr for
a content search, returning ids, and doing a secondary db lookup for 
extended entity information.


Thanks --Joachim