Re: Near Duplicate Documents

2007-11-20 Thread Rishabh Joshi
Otis,

Thanks for your response.

I just gave a quick look to the Nutch Forum and find that there is an
implementation to obtain de-duplicate documents/pages but none for Near
Duplicates documents. Can you guide me a little further as to where exactly
under Nutch I should be concentrating, regarding near duplicate documents?

Regards,
Rishabh

On Nov 21, 2007 12:41 PM, Otis Gospodnetic <[EMAIL PROTECTED]>
wrote:

> To whomever started this thread: look at Nutch.  I believe something
> related to this already exists in Nutch for near-duplicate detection.
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
> - Original Message 
> From: Mike Klaas <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Sunday, November 18, 2007 11:08:38 PM
> Subject: Re: Near Duplicate Documents
>
> On 18-Nov-07, at 8:17 AM, Eswar K wrote:
>
> > Is there any idea implementing that feature in the up coming
>  releases?
>
> Not currently.  Feel free to contribute something if you find a good
> solution .
>
> -Mike
>
>
> > On Nov 18, 2007 9:35 PM, Stuart Sierra <[EMAIL PROTECTED]> wrote:
> >
> >> On Nov 18, 2007 10:50 AM, Eswar K <[EMAIL PROTECTED]> wrote:
> >>> We have a scenario, where we want to find out documents which are
> >> similar in
> >>> content. To elaborate a little more on what we mean here, lets
> >>> take an
> >>> example.
> >>>
> >>> The example of this email chain in which we are interacting on,
> >>> can be
> >> best
> >>> used for illustrating the concept of near dupes (We are not getting
> >> confused
> >>> with threads, they are two different things.). Each email in this
> >>> thread
> >> is
> >>> treated as a document by the system. A reply to the original mail
> >>> also
> >>> includes the original mail in which case it becomes a near
> >>> duplicate of
> >> the
> >>> orginal mail (depending on the percentage of similarity).
> >>> Similarly it
> >> goes
> >>> on. The near dupes need not be limited to emails.
> >>
> >> I think this is what's known as "shingling."  See
> >> http://en.wikipedia.org/wiki/W-shingling
> >> Lucene (and therefore Solr) does not implement shingling.  The
> >> "MoreLikeThis" query might be close enough, however.
> >>
> >> -Stuart
> >>
>
>
>
>
>


Re: two solr instances - index and commit

2007-11-20 Thread Otis Gospodnetic
Uh, avoid NFS and Lucene/Solr, unless you really really don't care about 
performance.  We recently benchmarked Lucene indexing+searching+... on 1) local 
disk, 2) SAN, and 3) NFS.

You have the right to a single guess - which of the three was the 
slweet?

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Kasi Sankaralingam <[EMAIL PROTECTED]>
To: "solr-user@lucene.apache.org" 
Sent: Tuesday, November 13, 2007 6:48:03 PM
Subject: RE: two solr instances - index and commit

This works, the only thing you need to be aware of is the NFS problem
 if you are
running in a distributed environment sharing a NFS partition.

a) Index and commit on instance (Typically partitioned as an index
 server)
b) Issue a commit on the search server (like a read only mode)

Things to watch out for, you will get stale NFS problem, I replaced
 lucene core
that is shipped with solr to the latest one and it works.

-Original Message-
From: Jae Joo [mailto:[EMAIL PROTECTED]
Sent: Tuesday, November 13, 2007 9:06 AM
To: solr-user
Subject: two solr instances - index and commit

Hi,

I have two solr instance running under different tomcat environment.
One solr instance is for indexing and would like to commit to the other
 solr
instance.

This is what I tried, but failed.
using post.sh (without commit), the docs are  indexed in solr-1
 instance.
After indexed,
call commit command with the attribute of solr-2.

Can any help me?

Jae





Re: Any tips for indexing large amounts of data?

2007-11-20 Thread Eswar K
Thats great.

At what size of the index do you think we should look at partitioning the
index file?

Eswar

On Nov 21, 2007 12:57 PM, Otis Gospodnetic <[EMAIL PROTECTED]>
wrote:

> Just tried a search for "web" on this index - 1.1 seconds.  This matches
> about 1MM of about 20MM docs.  Redo the search, and it's 1 ms (cached).
>  This is without any load nor serious benchmarking, clearly.
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
> - Original Message 
> From: Eswar K <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Wednesday, November 21, 2007 2:11:07 AM
> Subject: Re: Any tips for indexing large amounts of data?
>
> Hi otis,
>
> I understand that is slightly off track question, but I am just curious
>  to
> know the performance of Search on a 20 GB index file. What has been
>  your
> observation?
>
> Regards,
> Eswar
>
> On Nov 21, 2007 12:33 PM, Otis Gospodnetic <[EMAIL PROTECTED]>
> wrote:
>
> > Mike is right about the occasional slow-down, which appears as a
>  pause and
> > is due to large Lucene index segment merging.  This should go away
>  with
> > newer versions of Lucene where this is happening in the background.
> >
> > That said, we just indexed about 20MM documents on a single 8-core
>  machine
> > with 8 GB of RAM, resulting in nearly 20 GB index.  The whole process
>  took a
> > little less than 10 hours - that's over 550 docs/second.  The vanilla
> > approach before some of our changes apparently required several days
>  to
> > index the same amount of data.
> >
> > Otis
> > --
> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> >
> > - Original Message 
> > From: Mike Klaas <[EMAIL PROTECTED]>
> > To: solr-user@lucene.apache.org
> > Sent: Monday, November 19, 2007 5:50:19 PM
> > Subject: Re: Any tips for indexing large amounts of data?
> >
> > There should be some slowdown in larger indices as occasionally large
> > segment merge operations must occur.  However, this shouldn't really
> > affect overall speed too much.
> >
> > You haven't really given us enough data to tell you anything useful.
> > I would recommend trying to do the indexing via a webapp to eliminate
> > all your code as a possible factor.  Then, look for signs to what is
> > happening when indexing slows.  For instance, is Solr high in cpu, is
> > the computer thrashing, etc?
> >
> > -Mike
> >
> > On 19-Nov-07, at 2:44 PM, Brendan Grainger wrote:
> >
> > > Hi,
> > >
> > > Thanks for answering this question a while back. I have made some
> > > of the suggestions you mentioned. ie not committing until I've
> > > finished indexing. What I am seeing though, is as the index get
> > > larger (around 1Gb), indexing is taking a lot longer. In fact it
> > > slows down to a crawl. Have you got any pointers as to what I might
> > > be doing wrong?
> > >
> > > Also, I was looking at using MultiCore solr. Could this help in
> > > some way?
> > >
> > > Thank you
> > > Brendan
> > >
> > > On Oct 31, 2007, at 10:09 PM, Chris Hostetter wrote:
> > >
> > >>
> > >> : I would think you would see better performance by allowing auto
> > >> commit
> > >> : to handle the commit size instead of reopening the connection
> > >> all the
> > >> : time.
> > >>
> > >> if your goal is "fast" indexing, don't use autoCommit at all ...
> >  just
> > >> index everything, and don't commit until you are completely done.
> > >>
> > >> autoCommitting will slow your indexing down (the benefit being
> > >> that more
> > >> results will be visible to searchers as you proceed)
> > >>
> > >>
> > >>
> > >>
> > >> -Hoss
> > >>
> > >
> >
> >
> >
> >
> >
>
>
>
>


Re: Performance question: Solr 64 bit java vs 32 bit mode.

2007-11-20 Thread Otis Gospodnetic
Solr runs equally well on both 64-bit and 32-bit systems.

Your 15 second problem could be caused by IO bottleneck (not likely if your 
index is small and fits in RAM), could be concurrency (esp. if you are using 
compound index format), could be something else on production killing your CPU, 
could be the JVM being busy sweeping the garbage out, etc.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Robert Purdy <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Thursday, November 15, 2007 4:05:00 PM
Subject: Performance question: Solr 64 bit java vs 32 bit mode.


Would anyone know if solr runs better in 64bit java vs 32 bit and could
answer another possible related question.

I currently have two servers running solr under identical tomcat
installations. One is the production server and is under heavy user
 load and
the other is under no load at all because it is a test box.

I was looking in the logs on the production server and noticed some
 queries
were taking about 15 seconds, and this is after auto-warming. So I
 decided
to execute that same query on the other server with nothing in the
 caches
and found that it only took 2 seconds to complete. 

My question is why an Dual Intel Core Duo  Xserve server in 64 bit java
 mode
with 8GB of ram allocated to the tomcat server be slower than a Dual
 Power
PC G5 server running in 32 bit mode with only 2GB of ram allocated? Is
 it
because of the load/concurrrency issues on the production sever that
 made
the time next to the query in the log greater on the production server?
 If
so what is the best way to configure tomcat to deal with that issue? 

Thanks Robert.
-- 
View this message in context:
 
http://www.nabble.com/Performance-question%3A-Solr-64-bit-java-vs-32-bit-mode.-tf4817186.html#a13781791
Sent from the Solr - User mailing list archive at Nabble.com.






Re: Any tips for indexing large amounts of data?

2007-11-20 Thread Otis Gospodnetic
Just tried a search for "web" on this index - 1.1 seconds.  This matches about 
1MM of about 20MM docs.  Redo the search, and it's 1 ms (cached).  This is 
without any load nor serious benchmarking, clearly.  

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Eswar K <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Wednesday, November 21, 2007 2:11:07 AM
Subject: Re: Any tips for indexing large amounts of data?

Hi otis,

I understand that is slightly off track question, but I am just curious
 to
know the performance of Search on a 20 GB index file. What has been
 your
observation?

Regards,
Eswar

On Nov 21, 2007 12:33 PM, Otis Gospodnetic <[EMAIL PROTECTED]>
wrote:

> Mike is right about the occasional slow-down, which appears as a
 pause and
> is due to large Lucene index segment merging.  This should go away
 with
> newer versions of Lucene where this is happening in the background.
>
> That said, we just indexed about 20MM documents on a single 8-core
 machine
> with 8 GB of RAM, resulting in nearly 20 GB index.  The whole process
 took a
> little less than 10 hours - that's over 550 docs/second.  The vanilla
> approach before some of our changes apparently required several days
 to
> index the same amount of data.
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
> - Original Message 
> From: Mike Klaas <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Monday, November 19, 2007 5:50:19 PM
> Subject: Re: Any tips for indexing large amounts of data?
>
> There should be some slowdown in larger indices as occasionally large
> segment merge operations must occur.  However, this shouldn't really
> affect overall speed too much.
>
> You haven't really given us enough data to tell you anything useful.
> I would recommend trying to do the indexing via a webapp to eliminate
> all your code as a possible factor.  Then, look for signs to what is
> happening when indexing slows.  For instance, is Solr high in cpu, is
> the computer thrashing, etc?
>
> -Mike
>
> On 19-Nov-07, at 2:44 PM, Brendan Grainger wrote:
>
> > Hi,
> >
> > Thanks for answering this question a while back. I have made some
> > of the suggestions you mentioned. ie not committing until I've
> > finished indexing. What I am seeing though, is as the index get
> > larger (around 1Gb), indexing is taking a lot longer. In fact it
> > slows down to a crawl. Have you got any pointers as to what I might
> > be doing wrong?
> >
> > Also, I was looking at using MultiCore solr. Could this help in
> > some way?
> >
> > Thank you
> > Brendan
> >
> > On Oct 31, 2007, at 10:09 PM, Chris Hostetter wrote:
> >
> >>
> >> : I would think you would see better performance by allowing auto
> >> commit
> >> : to handle the commit size instead of reopening the connection
> >> all the
> >> : time.
> >>
> >> if your goal is "fast" indexing, don't use autoCommit at all ...
>  just
> >> index everything, and don't commit until you are completely done.
> >>
> >> autoCommitting will slow your indexing down (the benefit being
> >> that more
> >> results will be visible to searchers as you proceed)
> >>
> >>
> >>
> >>
> >> -Hoss
> >>
> >
>
>
>
>
>





Re: Performance of Solr on different Platforms

2007-11-20 Thread Otis Gospodnetic
Most of Sematext's customers seem to be RH fans.  I've seen some Ubuntu, some 
Debian, and some SuSe users.  RH feels "safe". :)  Some use Solaris.  Some are 
going crazy with Xen, putting everything in VMs.

RAM - as much as you can afford, as usual.
CPU - AMD Opterons performed the best last time I benchmarked a bunch of 
different type of hardware - stay away from Sun Niagara servers for Lucene/Solr.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Eswar K <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Sunday, November 18, 2007 11:15:48 AM
Subject: Performance of Solr on different Platforms

Hi,

I understand that Solr can be used on different Linux flavors. Is there
 any
preferred flavor (Like Red Hat, Ubuntu, etc)?
Also what is the kind of configuration of hardware (Processors, RAM,
 etc) be
best suited for the install?
We expect to load it with millions of documents (varying from 2 - 20
million). There might be around 1000 concurrent users.

Your help in this regard will be appreciated.

Regards,
Eswar





Re: Near Duplicate Documents

2007-11-20 Thread Otis Gospodnetic
To whomever started this thread: look at Nutch.  I believe something related to 
this already exists in Nutch for near-duplicate detection.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Mike Klaas <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Sunday, November 18, 2007 11:08:38 PM
Subject: Re: Near Duplicate Documents

On 18-Nov-07, at 8:17 AM, Eswar K wrote:

> Is there any idea implementing that feature in the up coming
 releases?

Not currently.  Feel free to contribute something if you find a good  
solution .

-Mike


> On Nov 18, 2007 9:35 PM, Stuart Sierra <[EMAIL PROTECTED]> wrote:
>
>> On Nov 18, 2007 10:50 AM, Eswar K <[EMAIL PROTECTED]> wrote:
>>> We have a scenario, where we want to find out documents which are
>> similar in
>>> content. To elaborate a little more on what we mean here, lets  
>>> take an
>>> example.
>>>
>>> The example of this email chain in which we are interacting on,  
>>> can be
>> best
>>> used for illustrating the concept of near dupes (We are not getting
>> confused
>>> with threads, they are two different things.). Each email in this  
>>> thread
>> is
>>> treated as a document by the system. A reply to the original mail  
>>> also
>>> includes the original mail in which case it becomes a near  
>>> duplicate of
>> the
>>> orginal mail (depending on the percentage of similarity).   
>>> Similarly it
>> goes
>>> on. The near dupes need not be limited to emails.
>>
>> I think this is what's known as "shingling."  See
>> http://en.wikipedia.org/wiki/W-shingling
>> Lucene (and therefore Solr) does not implement shingling.  The
>> "MoreLikeThis" query might be close enough, however.
>>
>> -Stuart
>>






Re: Any tips for indexing large amounts of data?

2007-11-20 Thread Eswar K
Hi otis,

I understand that is slightly off track question, but I am just curious to
know the performance of Search on a 20 GB index file. What has been your
observation?

Regards,
Eswar

On Nov 21, 2007 12:33 PM, Otis Gospodnetic <[EMAIL PROTECTED]>
wrote:

> Mike is right about the occasional slow-down, which appears as a pause and
> is due to large Lucene index segment merging.  This should go away with
> newer versions of Lucene where this is happening in the background.
>
> That said, we just indexed about 20MM documents on a single 8-core machine
> with 8 GB of RAM, resulting in nearly 20 GB index.  The whole process took a
> little less than 10 hours - that's over 550 docs/second.  The vanilla
> approach before some of our changes apparently required several days to
> index the same amount of data.
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
> - Original Message 
> From: Mike Klaas <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Monday, November 19, 2007 5:50:19 PM
> Subject: Re: Any tips for indexing large amounts of data?
>
> There should be some slowdown in larger indices as occasionally large
> segment merge operations must occur.  However, this shouldn't really
> affect overall speed too much.
>
> You haven't really given us enough data to tell you anything useful.
> I would recommend trying to do the indexing via a webapp to eliminate
> all your code as a possible factor.  Then, look for signs to what is
> happening when indexing slows.  For instance, is Solr high in cpu, is
> the computer thrashing, etc?
>
> -Mike
>
> On 19-Nov-07, at 2:44 PM, Brendan Grainger wrote:
>
> > Hi,
> >
> > Thanks for answering this question a while back. I have made some
> > of the suggestions you mentioned. ie not committing until I've
> > finished indexing. What I am seeing though, is as the index get
> > larger (around 1Gb), indexing is taking a lot longer. In fact it
> > slows down to a crawl. Have you got any pointers as to what I might
> > be doing wrong?
> >
> > Also, I was looking at using MultiCore solr. Could this help in
> > some way?
> >
> > Thank you
> > Brendan
> >
> > On Oct 31, 2007, at 10:09 PM, Chris Hostetter wrote:
> >
> >>
> >> : I would think you would see better performance by allowing auto
> >> commit
> >> : to handle the commit size instead of reopening the connection
> >> all the
> >> : time.
> >>
> >> if your goal is "fast" indexing, don't use autoCommit at all ...
>  just
> >> index everything, and don't commit until you are completely done.
> >>
> >> autoCommitting will slow your indexing down (the benefit being
> >> that more
> >> results will be visible to searchers as you proceed)
> >>
> >>
> >>
> >>
> >> -Hoss
> >>
> >
>
>
>
>
>


Re: Any tips for indexing large amounts of data?

2007-11-20 Thread Otis Gospodnetic
Mike is right about the occasional slow-down, which appears as a pause and is 
due to large Lucene index segment merging.  This should go away with newer 
versions of Lucene where this is happening in the background.

That said, we just indexed about 20MM documents on a single 8-core machine with 
8 GB of RAM, resulting in nearly 20 GB index.  The whole process took a little 
less than 10 hours - that's over 550 docs/second.  The vanilla approach before 
some of our changes apparently required several days to index the same amount 
of data.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Mike Klaas <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Monday, November 19, 2007 5:50:19 PM
Subject: Re: Any tips for indexing large amounts of data?

There should be some slowdown in larger indices as occasionally large  
segment merge operations must occur.  However, this shouldn't really  
affect overall speed too much.

You haven't really given us enough data to tell you anything useful.   
I would recommend trying to do the indexing via a webapp to eliminate  
all your code as a possible factor.  Then, look for signs to what is  
happening when indexing slows.  For instance, is Solr high in cpu, is  
the computer thrashing, etc?

-Mike

On 19-Nov-07, at 2:44 PM, Brendan Grainger wrote:

> Hi,
>
> Thanks for answering this question a while back. I have made some  
> of the suggestions you mentioned. ie not committing until I've  
> finished indexing. What I am seeing though, is as the index get  
> larger (around 1Gb), indexing is taking a lot longer. In fact it  
> slows down to a crawl. Have you got any pointers as to what I might  
> be doing wrong?
>
> Also, I was looking at using MultiCore solr. Could this help in  
> some way?
>
> Thank you
> Brendan
>
> On Oct 31, 2007, at 10:09 PM, Chris Hostetter wrote:
>
>>
>> : I would think you would see better performance by allowing auto  
>> commit
>> : to handle the commit size instead of reopening the connection  
>> all the
>> : time.
>>
>> if your goal is "fast" indexing, don't use autoCommit at all ...
 just
>> index everything, and don't commit until you are completely done.
>>
>> autoCommitting will slow your indexing down (the benefit being  
>> that more
>> results will be visible to searchers as you proceed)
>>
>>
>>
>>
>> -Hoss
>>
>






Re: Help with Debian solr/jetty install?

2007-11-20 Thread Otis Gospodnetic
Phillip,

I won't go into details, but I'll point out that the Java compiler is called 
javac and if memory serves me well, it is defined in one of Jetty's XML config 
files in its etc/ dir.  The java compiler is used to compile JSPs that Solr 
uses for the admin UI.  So, make sure you have javac and make sure Jetty can 
find it.
 
Otis

--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Phillip Farber <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Tuesday, November 20, 2007 5:55:27 PM
Subject: Help with Debian solr/jetty install?


Hi,

I've successfully run as far as the example admin page on Debian linux
 2.6.

So I installed the solr-jetty packaged for Debian testing which gives
 me 
Jetty 5.1.14-1 and Solr 1.2.0+ds1-1.  Jetty starts fine and so does the
 
Solr home page at http://localhost:8280/solr

But I get an error when I try to run http://localhost:8280/solr/admin

HTTP ERROR: 500
No Java compiler available

I have sun-java6-jre and sun-java6-jdk packages installed.  I'm new to 
servlet containers and java webapps.  What should I be looking for to 
fix this or what information could I provide the list to get me moving 
forward from here?

I've included the trace from the Jetty log, and the java properties
 dump 
from the example below.

Thanks,
Phil

---

Java properties (from the example):
--

sun.boot.library.path = /usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/i386
java.vm.version = 1.6.0-b105
java.vm.name = Java HotSpot(TM) Client VM
user.dir = /tmp/apache-solr-1.2.0/example
java.runtime.version = 1.6.0-b105
os.arch = i386
java.io.tmpdir = /tmp

java.library.path = 
/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/i386/client:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/i386:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/../lib/i386:/usr/java/packages/lib/i386:/lib:/usr/lib
java.class.version = 50.0
jetty.home = /tmp/apache-solr-1.2.0/example
sun.management.compiler = HotSpot Client Compiler
os.version = 2.6.22-2-686
java.class.path = 
/tmp/apache-solr-1.2.0/example:/tmp/apache-solr-1.2.0/example/lib/jetty-6.1.3.jar:/tmp/apache-solr-1.2.0/example/lib/jetty-util-6.1.3.jar:/tmp/apache-solr-1.2.0/example/lib/servlet-api-2.5-6.1.3.jar:/tmp/apache-solr-1.2.0/example/lib/jsp-2.1/ant-1.6.5.jar:/tmp/apache-solr-1.2.0/example/lib/jsp-2.1/core-3.1.1.jar:/tmp/apache-solr-1.2.0/example/lib/jsp-2.1/jsp-2.1.jar:/tmp/apache-solr-1.2.0/example/lib/jsp-2.1/jsp-api-2.1.jar:/usr/share/ant/lib/ant.jar
java.home = /usr/lib/jvm/java-6-sun-1.6.0.00/jre
java.version = 1.6.0
java.ext.dirs = 
/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/ext:/usr/java/packages/lib/ext
sun.boot.class.path = 
/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/resources.jar:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/rt.jar:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/sunrsasign.jar:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/jsse.jar:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/jce.jar:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/charsets.jar:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/classes




Jetty log (from the error under Debian Solr/Jetty):


org.apache.jasper.JasperException: No Java compiler available
at 
org.apache.jasper.servlet.JspServletWrapper.handleJspException(JspServletWrapper.java:460)
at 
org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:367)
at
 org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:329)
at
 org.apache.jasper.servlet.JspServlet.service(JspServlet.java:265)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
at
 org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:428)
at 
org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationHandler.java:473)
at
 org.mortbay.jetty.servlet.Dispatcher.dispatch(Dispatcher.java:286)
at
 org.mortbay.jetty.servlet.Dispatcher.forward(Dispatcher.java:171)
at org.mortbay.jetty.servlet.Default.handleGet(Default.java:302)
at org.mortbay.jetty.servlet.Default.service(Default.java:223)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
at
 org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:428)
at 
org.mortbay.jetty.servlet.WebApplicationHandler$CachedChain.doFilter(WebApplicationHandler.java:830)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:185)
at 
org.mortbay.jetty.servlet.WebApplicationHandler$CachedChain.doFilter(WebApplicationHandler.java:821)
at 
org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationHandler.java:471)
at
 org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:568)
at org.mortbay.http.HttpContext.handle(HttpContext.java:1530)
at 
org.mortbay.jetty.servlet.WebApplicationContext.handle(WebApplicationContext.java:633)
at org.mortbay.http.HttpContext.handle(HttpContext.java:1482)
at org.mortbay.http.HttpServer.service(HttpServer.java

Finding the right place to start ...

2007-11-20 Thread Tracy Flynn

I'm trying to find the right place to start in this community.


I recently posted a question in the thread on SOLR-236.  In that  
posting I mentioned that I was hoping to persuade my management to  
move from a FAST installation to a SOLR-based one.  The changeover was  
approved in principle today.


Our application is a large Rails application. I integrated Solr and  
created a proof-of-concept that covered almost all existing  
functionality and projected new functionality for 2008.


So, I have a few requests for information and possibly help.

I will need the result collapsing described in SOLR 236 to deploy  
Solr. It's an absolute requirement. I understand that it's to be  
available in Solr 1.3. Is there updated information for the timetable  
for Solr 1.3, and what's to be included?


I would very much also like to have SOLR 103 - SQL Upload plugin  
available, though I think I have a work around if it isn't in Solr 1.3.


I would be happy to offer help in any way I can - e.g. with testing.

If someone can point me to the places I need to look to find  
information that bears on these questions, I'm happy to go and dig.


Thanks for any help.

Tracy Flynn



Re: Problems with Basic Install (newbie question)

2007-11-20 Thread Chris Hostetter

: As far as I know, I do have a full JDK. I'm on OS X and it should come with
: a full JDK:
: http://developer.apple.com/java/

well, 1) it depends on which version of "OS X" you are running (10.1, 
10.2?, 10.3?, 10.4?, 10.5?)  but i don't think that's your problem ... 


you said you could see the admin screen before (so JSPs were working) then 
you installed Solr-Drupal, and then you couldn't get the admin screens to 
work. and you were getting this exception.

did you manually start jetty when you saw this error? or did drupal? ... 
either way might be the cause of hte problem ... if drupal started jetty, 
it may be trying to use a "nobody" user, and can't write to jetyt's tmp 
dir for JSPs since you already created it as yourself. 

you might also be seeing a varient of this issue...

https://issues.apache.org/jira/browse/SOLR-118



-Hoss



Re: facet - associated fields

2007-11-20 Thread Norberto Meijome
On Tue, 20 Nov 2007 17:39:58 -0500
"Jae Joo" <[EMAIL PROTECTED]> wrote:

> Hi,
> Can anyone help me how to facet and/or search for associated fields? -

 http://wiki.apache.org/solr/SimpleFacetParameters  

_
{Beto|Norberto|Numard} Meijome

Fear not the path of truth for the lack of people walking on it.

I speak for myself, not my employer. Contents may be hot. Slippery when wet. 
Reading disclaimers makes you go blind. Writing them is worse. You have been 
Warned.


Re: Solr cluster topology.

2007-11-20 Thread Norberto Meijome
On Tue, 20 Nov 2007 16:26:27 -0600
Alexander Wallace <[EMAIL PROTECTED]> wrote:

> Interesting, this ALL MASTERS mode... I guess you don't do any  
> replication then...

correct

> In the single master, several slaves mode, I'm assuming the client  
> still writes to one and reads from the others... right?

Correct again.

There is also another approach which I think in SOLR is called FederatedSearch 
, where a front end queries a number of index servers (each with overlapping or 
non-overlapping data sets) and puts together 1 result stream for the answer. 
There was some discussion on the list,  
http://www.mail-archive.com/solr-user@lucene.apache.org/msg06081.html is the 
earliest link in the archive i can find .

B
_
{Beto|Norberto|Numard} Meijome

"People demand freedom of speech to make up for the freedom of thought which 
they avoid. " 
  Soren Aabye Kierkegaard

I speak for myself, not my employer. Contents may be hot. Slippery when wet. 
Reading disclaimers makes you go blind. Writing them is worse. You have been 
Warned.


Help with Debian solr/jetty install?

2007-11-20 Thread Phillip Farber


Hi,

I've successfully run as far as the example admin page on Debian linux 2.6.

So I installed the solr-jetty packaged for Debian testing which gives me 
Jetty 5.1.14-1 and Solr 1.2.0+ds1-1.  Jetty starts fine and so does the 
Solr home page at http://localhost:8280/solr


But I get an error when I try to run http://localhost:8280/solr/admin

HTTP ERROR: 500
No Java compiler available

I have sun-java6-jre and sun-java6-jdk packages installed.  I'm new to 
servlet containers and java webapps.  What should I be looking for to 
fix this or what information could I provide the list to get me moving 
forward from here?


I've included the trace from the Jetty log, and the java properties dump 
from the example below.


Thanks,
Phil

---

Java properties (from the example):
--

sun.boot.library.path = /usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/i386
java.vm.version = 1.6.0-b105
java.vm.name = Java HotSpot(TM) Client VM
user.dir = /tmp/apache-solr-1.2.0/example
java.runtime.version = 1.6.0-b105
os.arch = i386
java.io.tmpdir = /tmp

java.library.path = 
/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/i386/client:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/i386:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/../lib/i386:/usr/java/packages/lib/i386:/lib:/usr/lib

java.class.version = 50.0
jetty.home = /tmp/apache-solr-1.2.0/example
sun.management.compiler = HotSpot Client Compiler
os.version = 2.6.22-2-686
java.class.path = 
/tmp/apache-solr-1.2.0/example:/tmp/apache-solr-1.2.0/example/lib/jetty-6.1.3.jar:/tmp/apache-solr-1.2.0/example/lib/jetty-util-6.1.3.jar:/tmp/apache-solr-1.2.0/example/lib/servlet-api-2.5-6.1.3.jar:/tmp/apache-solr-1.2.0/example/lib/jsp-2.1/ant-1.6.5.jar:/tmp/apache-solr-1.2.0/example/lib/jsp-2.1/core-3.1.1.jar:/tmp/apache-solr-1.2.0/example/lib/jsp-2.1/jsp-2.1.jar:/tmp/apache-solr-1.2.0/example/lib/jsp-2.1/jsp-api-2.1.jar:/usr/share/ant/lib/ant.jar

java.home = /usr/lib/jvm/java-6-sun-1.6.0.00/jre
java.version = 1.6.0
java.ext.dirs = 
/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/ext:/usr/java/packages/lib/ext
sun.boot.class.path = 
/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/resources.jar:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/rt.jar:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/sunrsasign.jar:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/jsse.jar:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/jce.jar:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/charsets.jar:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/classes





Jetty log (from the error under Debian Solr/Jetty):


org.apache.jasper.JasperException: No Java compiler available
	at 
org.apache.jasper.servlet.JspServletWrapper.handleJspException(JspServletWrapper.java:460)
	at 
org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:367)

at 
org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:329)
at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:265)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
at 
org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:428)
	at 
org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationHandler.java:473)

at org.mortbay.jetty.servlet.Dispatcher.dispatch(Dispatcher.java:286)
at org.mortbay.jetty.servlet.Dispatcher.forward(Dispatcher.java:171)
at org.mortbay.jetty.servlet.Default.handleGet(Default.java:302)
at org.mortbay.jetty.servlet.Default.service(Default.java:223)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
at 
org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:428)
	at 
org.mortbay.jetty.servlet.WebApplicationHandler$CachedChain.doFilter(WebApplicationHandler.java:830)
	at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:185)
	at 
org.mortbay.jetty.servlet.WebApplicationHandler$CachedChain.doFilter(WebApplicationHandler.java:821)
	at 
org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationHandler.java:471)

at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:568)
at org.mortbay.http.HttpContext.handle(HttpContext.java:1530)
	at 
org.mortbay.jetty.servlet.WebApplicationContext.handle(WebApplicationContext.java:633)

at org.mortbay.http.HttpContext.handle(HttpContext.java:1482)
at org.mortbay.http.HttpServer.service(HttpServer.java:909)
at org.mortbay.http.HttpConnection.service(HttpConnection.java:820)
at org.mortbay.http.HttpConnection.handleNext(HttpConnection.java:986)
at org.mortbay.http.HttpConnection.handle(HttpConnection.java:837)
	at 
org.mortbay.http.SocketListener.handleConnection(SocketListener.java:245)

at org.mortbay.util.ThreadedServer.handle(ThreadedServer.java:357)
at org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:534)
getRootCause():
java.lang.IllegalStateException: No Java compiler available
	at 
org.apache.

facet - associated fields

2007-11-20 Thread Jae Joo
Hi,
Can anyone help me how to facet and/or search for associated fields? -


 
  1234
  Baseball hall of Fame opens Jackie Robinson
exhibit
  Description about the new JR hall of fame
exhibit.
  20071114
  200711
  0
  press

  
  Sports
  Baseball
  Major League Baseball

  
  Arts and Culture
  Culture
  Heritage Sites
 


Thanks,

Jae


Re: Solr cluster topology.

2007-11-20 Thread Alexander Wallace

Thanks for the response!

Interesting, this ALL MASTERS mode... I guess you don't do any  
replication then...


In the single master, several slaves mode, I'm assuming the client  
still writes to one and reads from the others... right?


On Nov 20, 2007, at 12:54 PM, Matthew Runo wrote:


Yes. The clients will always be a minute or two behind the master.

I like the way some people are doing it - make them all masters!  
Just post your updates to each of them - you loose a bit of  
performance perhaps, but it doesn't matter if a server bombs out or  
you have to upgrade them, since they're all exactly the same.


--Matthew

On Nov 20, 2007, at 7:43 AM, Alexander Wallace wrote:


Hi All!

I just started reading about Solr a couple of days ago (not full  
time of course) and it looks like a pretty impressive set of  
technologies... I have still a few questions I have not clearly  
found:


Q: On a cluster, as I understand it, one and only one machine is a  
master, and N servers could be slaves...The clients, do they  
all talk to the master for indexing and to a load balancer for  
searching?   Is one particular machine configured to know it is  
the master? Or is it only the settings for replicating the index  
that matter?   Or does one post reindex petitions to any of the  
slaves and they will forward it to the master?


How can we have failover in the master?

It is a correct assumption that slaves could always be a bit out  
of sync with the master, correct? A matter of minutes perhaps...


Thanks in advance for your responses!









RE: Weird memory error.

2007-11-20 Thread Norskog, Lance
AppPerfect has a free-for-noncommercial-use version of their tools. I've
used them before and was very impressed.

http://www.appperfect.com/products/devtest.html#versions

 

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik
Seeley
Sent: Tuesday, November 20, 2007 9:12 AM
To: solr-user@lucene.apache.org
Subject: Re: Weird memory error.

On Nov 20, 2007 11:29 AM, Brian Carmalt <[EMAIL PROTECTED]> wrote:
> Can you recommend one? I am not familar with how to profile under
Java.

Netbeans has one for free:
http://www.netbeans.org/products/profiler/

-Yonik


RE: Solr cluster topology.

2007-11-20 Thread Norskog, Lance
http://wiki.apache.org/solr/CollectionDistribution

http://wiki.apache.org/solr/SolrCollectionDistributionScripts

http://wiki.apache.org/solr/SolrCollectionDistributionStatusStats

http://wiki.apache.org/solr/SolrOperationsTools

http://wiki.apache.org/solr/SolrCollectionDistributionOperationsOutline

http://wiki.apache.org/solr/CollectionRebuilding
 
http://wiki.apache.org/solr/SolrAdminGUI




-Original Message-
From: Matthew Runo [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, November 20, 2007 10:54 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr cluster topology.

Yes. The clients will always be a minute or two behind the master.

I like the way some people are doing it - make them all masters! Just
post your updates to each of them - you loose a bit of performance
perhaps, but it doesn't matter if a server bombs out or you have to
upgrade them, since they're all exactly the same.

--Matthew

On Nov 20, 2007, at 7:43 AM, Alexander Wallace wrote:

> Hi All!
>
> I just started reading about Solr a couple of days ago (not full time 
> of course) and it looks like a pretty impressive set of 
> technologies... I have still a few questions I have not clearly found:
>
> Q: On a cluster, as I understand it, one and only one machine is a  
> master, and N servers could be slaves...The clients, do they all  
> talk to the master for indexing and to a load balancer for  
> searching?   Is one particular machine configured to know it is the  
> master? Or is it only the settings for replicating the index that  
> matter?   Or does one post reindex petitions to any of the slaves  
> and they will forward it to the master?
>
> How can we have failover in the master?
>
> It is a correct assumption that slaves could always be a bit out of 
> sync with the master, correct? A matter of minutes perhaps...
>
> Thanks in advance for your responses!
>
>



Re: Weird memory error.

2007-11-20 Thread Mike Klaas

On 20-Nov-07, at 8:16 AM, Brian Carmalt wrote:


Hello all,

I started looking into the scalability of solr, and have started  
getting weird  results.

I am getting the following error:

Exception in thread "btpool0-3" java.lang.OutOfMemoryError: unable  
to create new native thread

   at java.lang.Thread.start0(Native Method)
   at java.lang.Thread.start(Thread.java:574)
   at org.mortbay.thread.BoundedThreadPool.newThread 
(BoundedThreadPool.java:377)
   at org.mortbay.thread.BoundedThreadPool.dispatch 
(BoundedThreadPool.java:94)
   at org.mortbay.jetty.bio.SocketConnector$Connection.dispatch 
(SocketConnector.java:187)
   at org.mortbay.jetty.bio.SocketConnector.accept 
(SocketConnector.java:101)
   at org.mortbay.jetty.AbstractConnector$Acceptor.run 
(AbstractConnector.java:516)
   at org.mortbay.thread.BoundedThreadPool$PoolThread.run 
(BoundedThreadPool.java:442)


This only occurs when I send docs to the server in batches of  
around 10 as separate processes.

If I send the serially, the heap grows up to 1200M and with no errors.


Could be running out of stack space (which is used by other things as  
well as threads).  But its hard to imagine that happening at 30 threads.


-Mike




Re: Solr cluster topology.

2007-11-20 Thread Matthew Runo

Yes. The clients will always be a minute or two behind the master.

I like the way some people are doing it - make them all masters! Just  
post your updates to each of them - you loose a bit of performance  
perhaps, but it doesn't matter if a server bombs out or you have to  
upgrade them, since they're all exactly the same.


--Matthew

On Nov 20, 2007, at 7:43 AM, Alexander Wallace wrote:


Hi All!

I just started reading about Solr a couple of days ago (not full  
time of course) and it looks like a pretty impressive set of  
technologies... I have still a few questions I have not clearly found:


Q: On a cluster, as I understand it, one and only one machine is a  
master, and N servers could be slaves...The clients, do they all  
talk to the master for indexing and to a load balancer for  
searching?   Is one particular machine configured to know it is the  
master? Or is it only the settings for replicating the index that  
matter?   Or does one post reindex petitions to any of the slaves  
and they will forward it to the master?


How can we have failover in the master?

It is a correct assumption that slaves could always be a bit out of  
sync with the master, correct? A matter of minutes perhaps...


Thanks in advance for your responses!






Re: Payloads, Tokenizers, and Filters. Oh My!

2007-11-20 Thread Chris Hostetter

: I apologize for cross-posting but  I believe both Solr and Lucene users and
: developers should be concerned with this.  I am not aware of a better way to
: reach both communities.

some of these questions strike me as being largely unrelated.  if 
anyone wishes to followup on them further, let's do it in (new) seperate 
threads for each topic, on the specific list appropriate to the topic...

:* Do TokenFilters belong in the Solr code base at all?

Yes, in so much as any java code belongs in the Solr code base (or the 
nutch code base for that matter).  They are seperate projects with 
seperate communities and seperate needs -- that doesn't mean that there 
isn't code in Solr which could be useful to the broader community of 
lucene-java; in that case the appropriate course of action is to open a 
LUCENE issue to "promote" the code up into lucene-java, and a dependent 
issue in SOLR to deprecate the current code and use the newer code 
instead.

as some people may be aware, there was a discussion aboutthis sort of 
thing at ApacheCon during the Lucene BOF -- some reasons this doesn't 
happen as often as it seems like it should are:
  * the code may have subtle dependency tendrals that make it hard to 
refactor from one code base to the other.
  * the tests are frequently harder to "promote" then the code (in the 
case of most Solr tests that use the TestHarness, it's probably easier 
to write new tests from scratch)
  * when promoting the code, it's the best time to consider wether the 
existing API is really the "best" API before a lot of new people start 
using it (compare Solr's FunctionQuery and Lucenes CustomScoreQuery 
for example)
  * someone needs to care enough to follow through on the promotion.

...further discussion is best suited for java-dev since the topic is not 
Solr specific (there's a lot of Nutch code out there that people have sked 
about promoting as well)

:* How to deal with TokenFilters that add new Tokens to the stream?

This is specificly regarding Payloads yes?  also a pretty clear cut 
java-dev discussion (and one possibly already being discussed in the 
monolithic Payload API thread i haven't started reading yet).  
lucene-java sets the API and the semantics ... Solr code will follow them.

:* How to patch TokenFilters and Tokenizers using the model of
:  LUCENE-969 in the Solr code base and in Lucene contrib?

open SOLR issues containing a patchs for any Solr code that needs 
changed, and LUCENE issues containing patches for contrib code that needs 
changed.

: I thought it might be useful to figure out which existing TokenFilters need to
: know about Payloads.  To this end I have taken an inventory of the
: TokenFilters out there.  I think it is fair to categorize them by Add (A),
: Delete (D), Modify (M), Observe (O):

again: this is a straight forward luence-java question ... once the 
semantics have been worked out, then there can be a Solr specific 
discussion about following them.

(which is not to say that the Solr classes/use-cases shouldn't be 
considered in the discussion, just that java-dev is the right place to 
have the conversation)




-Hoss



BooleanQuery exception

2007-11-20 Thread Cody Caughlan
I am trying to run a very simple query via the Admin interface and  
receive the exception below.


The query is:

description_t:guard AND title_t:help

I am using dynamic fields (hence the underscored suffix).

Any ideas?

Thanks in advance
/cody

Nov 19, 2007 3:01:31 PM org.apache.solr.core.SolrException log
SEVERE: java.lang.NoSuchMethodError:  
org.apache.lucene.search.BooleanQuery.clauses()Ljava/util/List;
at  
org.apache.solr.search.QueryUtils.isNegative(QueryUtils.java:38)
at  
org.apache.solr.search.QueryUtils.makeQueryable(QueryUtils.java:92)
at  
org 
.apache 
.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:827)
at  
org 
.apache 
.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:805)
at  
org 
.apache 
.solr.search.SolrIndexSearcher.getDocList(SolrIndexSearcher.java:698)
at  
org 
.apache 
.solr 
.request 
.StandardRequestHandler.handleRequestBody(StandardRequestHandler.java: 
122)
at  
org 
.apache 
.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java: 
77)

at org.apache.solr.core.SolrCore.execute(SolrCore.java:658)
at  
org 
.apache 
.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:191)
at  
org 
.apache 
.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:159)
at  
org 
.apache 
.catalina 
.core 
.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java: 
215)
at  
org 
.apache 
.catalina 
.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:188)
at  
org 
.apache 
.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java: 
213)
at  
org 
.apache 
.catalina.core.StandardContextValve.invoke(StandardContextValve.java: 
174)
at  
org 
.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java: 
127)
at  
org 
.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java: 
117)
at  
org 
.apache 
.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:108)
at  
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java: 
151)
at  
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java: 
874)
at org.apache.coyote.http11.Http11BaseProtocol 
$Http11ConnectionHandler.processConnection(Http11BaseProtocol.java:665)
at  
org 
.apache 
.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:528)
at  
org 
.apache 
.tomcat 
.util 
.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java: 
81)
at org.apache.tomcat.util.threads.ThreadPool 
$ControlRunnable.run(ThreadPool.java:689)

at java.lang.Thread.run(Thread.java:619)




OR-ing together filter queries

2007-11-20 Thread Arnone, Anthony
Hello all,

I am writing my own handler, and I would like to pre-filter the results based 
on a field. I’m calling searcher.getDocList() with a custom constructed query 
and filters list, but the filters always seem to AND together. My question is 
this: how can I construct the List of filters to make them OR together 
(documents are included in results if they match *any* of my filters)?

For reference, here’s how I’m constructing my filters:

  List filters = new LinkedList();

  . . .

  while (fieldIter.hasNext()) {
  String filterStr = fieldIter.next();
  filters.add(new TermQuery(new Term(accessField, filterStr)));  // 
accessField is known ahead of time
  }

  . . . 

  results.docList = s.getDocList(finalQuery, filters.size() != 0 ? filters : 
null,
  Sort.RELEVANCE, start, indexSize, SolrIndexSearcher.GET_SCORES);


Thanks for any help
Anthony

PS: you might notice that I'm asking for ALL of the results in that search. 
Never fear - I do a lot of post processing myself, and return a sane (~1000) 
amount of results in JSON.


Re: Weird memory error.

2007-11-20 Thread Yonik Seeley
On Nov 20, 2007 11:29 AM, Brian Carmalt <[EMAIL PROTECTED]> wrote:
> Can you recommend one? I am not familar with how to profile under Java.

Netbeans has one for free:
http://www.netbeans.org/products/profiler/

-Yonik


Re: Weird memory error.

2007-11-20 Thread Simon Willnauer
I'm using the Eclipse TPTP platfrom and I'm very happy with it. You will
also find good howto or tutorial pages on the web.

- simon

On Nov 20, 2007 5:29 PM, Brian Carmalt <[EMAIL PROTECTED]> wrote:

> Can you recommend one? I am not familar with how to profile under Java.
>
> Yonik Seeley schrieb:
> > Can you try a profiler to see where the memory is being used?
> > -Yonik
> >
> > On Nov 20, 2007 11:16 AM, Brian Carmalt <[EMAIL PROTECTED]> wrote:
> >
> >> Hello all,
> >>
> >> I started looking into the scalability of solr, and have started
> getting
> >> weird  results.
> >> I am getting the following error:
> >>
> >> Exception in thread "btpool0-3" java.lang.OutOfMemoryError: unable to
> >> create new native thread
> >> at java.lang.Thread.start0(Native Method)
> >> at java.lang.Thread.start(Thread.java:574)
> >> at
> >> org.mortbay.thread.BoundedThreadPool.newThread(BoundedThreadPool.java
> :377)
> >> at
> >> org.mortbay.thread.BoundedThreadPool.dispatch(BoundedThreadPool.java
> :94)
> >> at
> >> org.mortbay.jetty.bio.SocketConnector$Connection.dispatch(
> SocketConnector.java:187)
> >> at
> >> org.mortbay.jetty.bio.SocketConnector.accept(SocketConnector.java:101)
> >> at
> >> org.mortbay.jetty.AbstractConnector$Acceptor.run(AbstractConnector.java
> :516)
> >> at
> >> org.mortbay.thread.BoundedThreadPool$PoolThread.run(
> BoundedThreadPool.java:442)
> >>
> >> This only occurs when I send docs to the server in batches of around 10
> >> as separate processes.
> >> If I send the serially, the heap grows up to 1200M and with no errors.
> >>
> >> When I observe the VM during it's operation, It doesn't seem to run out
> >> of memory.  The VM starts
> >> with 1024M and can allocate up to 1800M. I start getting the error
> >> listed above when the memory
> >> usage is right around 1 G. I have been using the Jconsole program on
> >> windows to observe the
> >> jetty server by using the com.sun.management.jmxremote* functions on
> the
> >> server side. The number of threads
> >> is always around 30, and jetty can create up 250, so I don't think
> >> that's the problem. I can't really image that
> >> the monitoring process is using the other 800M of the allowable heap
> >> memory, but it could be.
> >> But the problem occurs without monitoring, even when the VM heap is set
> >> to 1500M.
> >>
> >> Does anyone have an idea as to why this error is occurring?
> >>
> >> Thanks,
> >> Brian
> >>
> >>
> >
> >
>
>


Re: Weird memory error.

2007-11-20 Thread Brian Carmalt

Can you recommend one? I am not familar with how to profile under Java.

Yonik Seeley schrieb:

Can you try a profiler to see where the memory is being used?
-Yonik

On Nov 20, 2007 11:16 AM, Brian Carmalt <[EMAIL PROTECTED]> wrote:
  

Hello all,

I started looking into the scalability of solr, and have started getting
weird  results.
I am getting the following error:

Exception in thread "btpool0-3" java.lang.OutOfMemoryError: unable to
create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:574)
at
org.mortbay.thread.BoundedThreadPool.newThread(BoundedThreadPool.java:377)
at
org.mortbay.thread.BoundedThreadPool.dispatch(BoundedThreadPool.java:94)
at
org.mortbay.jetty.bio.SocketConnector$Connection.dispatch(SocketConnector.java:187)
at
org.mortbay.jetty.bio.SocketConnector.accept(SocketConnector.java:101)
at
org.mortbay.jetty.AbstractConnector$Acceptor.run(AbstractConnector.java:516)
at
org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)

This only occurs when I send docs to the server in batches of around 10
as separate processes.
If I send the serially, the heap grows up to 1200M and with no errors.

When I observe the VM during it's operation, It doesn't seem to run out
of memory.  The VM starts
with 1024M and can allocate up to 1800M. I start getting the error
listed above when the memory
usage is right around 1 G. I have been using the Jconsole program on
windows to observe the
jetty server by using the com.sun.management.jmxremote* functions on the
server side. The number of threads
is always around 30, and jetty can create up 250, so I don't think
that's the problem. I can't really image that
the monitoring process is using the other 800M of the allowable heap
memory, but it could be.
But the problem occurs without monitoring, even when the VM heap is set
to 1500M.

Does anyone have an idea as to why this error is occurring?

Thanks,
Brian




  




Re: Weird memory error.

2007-11-20 Thread Yonik Seeley
Can you try a profiler to see where the memory is being used?
-Yonik

On Nov 20, 2007 11:16 AM, Brian Carmalt <[EMAIL PROTECTED]> wrote:
> Hello all,
>
> I started looking into the scalability of solr, and have started getting
> weird  results.
> I am getting the following error:
>
> Exception in thread "btpool0-3" java.lang.OutOfMemoryError: unable to
> create new native thread
> at java.lang.Thread.start0(Native Method)
> at java.lang.Thread.start(Thread.java:574)
> at
> org.mortbay.thread.BoundedThreadPool.newThread(BoundedThreadPool.java:377)
> at
> org.mortbay.thread.BoundedThreadPool.dispatch(BoundedThreadPool.java:94)
> at
> org.mortbay.jetty.bio.SocketConnector$Connection.dispatch(SocketConnector.java:187)
> at
> org.mortbay.jetty.bio.SocketConnector.accept(SocketConnector.java:101)
> at
> org.mortbay.jetty.AbstractConnector$Acceptor.run(AbstractConnector.java:516)
> at
> org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
>
> This only occurs when I send docs to the server in batches of around 10
> as separate processes.
> If I send the serially, the heap grows up to 1200M and with no errors.
>
> When I observe the VM during it's operation, It doesn't seem to run out
> of memory.  The VM starts
> with 1024M and can allocate up to 1800M. I start getting the error
> listed above when the memory
> usage is right around 1 G. I have been using the Jconsole program on
> windows to observe the
> jetty server by using the com.sun.management.jmxremote* functions on the
> server side. The number of threads
> is always around 30, and jetty can create up 250, so I don't think
> that's the problem. I can't really image that
> the monitoring process is using the other 800M of the allowable heap
> memory, but it could be.
> But the problem occurs without monitoring, even when the VM heap is set
> to 1500M.
>
> Does anyone have an idea as to why this error is occurring?
>
> Thanks,
> Brian
>


Weird memory error.

2007-11-20 Thread Brian Carmalt

Hello all,

I started looking into the scalability of solr, and have started getting 
weird  results.

I am getting the following error:

Exception in thread "btpool0-3" java.lang.OutOfMemoryError: unable to 
create new native thread

   at java.lang.Thread.start0(Native Method)
   at java.lang.Thread.start(Thread.java:574)
   at 
org.mortbay.thread.BoundedThreadPool.newThread(BoundedThreadPool.java:377)
   at 
org.mortbay.thread.BoundedThreadPool.dispatch(BoundedThreadPool.java:94)
   at 
org.mortbay.jetty.bio.SocketConnector$Connection.dispatch(SocketConnector.java:187)
   at 
org.mortbay.jetty.bio.SocketConnector.accept(SocketConnector.java:101)
   at 
org.mortbay.jetty.AbstractConnector$Acceptor.run(AbstractConnector.java:516)
   at 
org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)


This only occurs when I send docs to the server in batches of around 10 
as separate processes.

If I send the serially, the heap grows up to 1200M and with no errors.

When I observe the VM during it's operation, It doesn't seem to run out 
of memory.  The VM starts
with 1024M and can allocate up to 1800M. I start getting the error 
listed above when the memory
usage is right around 1 G. I have been using the Jconsole program on 
windows to observe the
jetty server by using the com.sun.management.jmxremote* functions on the 
server side. The number of threads
is always around 30, and jetty can create up 250, so I don't think 
that's the problem. I can't really image that
the monitoring process is using the other 800M of the allowable heap 
memory, but it could be.
But the problem occurs without monitoring, even when the VM heap is set 
to 1500M.


Does anyone have an idea as to why this error is occurring?

Thanks,
Brian


Re: rows=VERY_LARGE_VALUE throws exception, and error in some cases

2007-11-20 Thread Yonik Seeley
I recently fixed this in the trunk.
-Yonik

On Nov 20, 2007 10:31 AM, Rishabh Joshi <[EMAIL PROTECTED]> wrote:
> Hi,
>
> We are using Solr 1.2 for our project and have come across the following
> exception and error:
>
> Exception:
> SEVERE: java.lang.OutOfMemoryError: Java heap space
> at org.apache.lucene.util.PriorityQueue.initialize (PriorityQueue.java
> :36)
>
> Steps to reproduce:
> 1. Restart your Web Server.
> 2. Enter a query with VERY_LARGE_VALUE for "rows" field. For example:
> http://xx.xx.xx.xx:8080/solr/select?q=unix&%20start=0&fl=id&indent=off&rows=9
> 3. Press enter or click on the 'Go' button on the browser.
>
> NOTE:
> 1. This exception is thrown if'999' (seven digits) <
> VERY_LARGE_VALUE < '9' (nine digits).
> 2. The exception DOES NOT APPEAR AGAIN if we change the VERY_LARGE_VALUE to
> <= '999', execute the query and then change the VERY_LARGE_VALUE  back
> to it's original value and execute the query again.
> 3. If the VERY_LARGE_VALUE >= '99' (ten digits) we get the following
> error:
>
> Error:
> HTTP Status 400 - For input string: "99"
>
> Has anyone come across this scenario before?
>
> Regards,
> Rishabh
>


Re: Invalid value 'explicit' for echoParams parameter

2007-11-20 Thread Chris Hostetter

: I'm confident that /trunk accepts any case:
: 
:   v = v.toUpperCase();

thats in Solr 1.2 as well hmmm 

Ahmet: what is the default Locale of your JVM?  

String.toUpper() does use the default Locale ... i guess maybe we should 
start being more strict about using "compareToIgnoreCase" (or use 
toUpperCase(Locale.ENGLISH)) in cases like this where we want to test 
input strings against expected constants.



-Hoss



Solr cluster topology.

2007-11-20 Thread Alexander Wallace

Hi All!

I just started reading about Solr a couple of days ago (not full time  
of course) and it looks like a pretty impressive set of  
technologies... I have still a few questions I have not clearly found:


Q: On a cluster, as I understand it, one and only one machine is a  
master, and N servers could be slaves...The clients, do they all  
talk to the master for indexing and to a load balancer for  
searching?   Is one particular machine configured to know it is the  
master? Or is it only the settings for replicating the index that  
matter?   Or does one post reindex petitions to any of the slaves and  
they will forward it to the master?


How can we have failover in the master?

It is a correct assumption that slaves could always be a bit out of  
sync with the master, correct? A matter of minutes perhaps...


Thanks in advance for your responses!




Re: Pagination with Solr

2007-11-20 Thread Chris Hostetter

: What I'm trying is to parse the response for "numFound:" 
: and if this number is greater than the "rows" parameter, I send another 
: search request to Solr with a new "start" parameter. Is there a better 
: way to do this?  Specifically, is there another way to obtain the 
: "numFound" rather than parsing the response stream/string?

i really don't understand your question ... how do you get any useful 
information from Solr unless you parse the responses to your requests?



-Hoss



rows=VERY_LARGE_VALUE throws exception, and error in some cases

2007-11-20 Thread Rishabh Joshi
Hi,

We are using Solr 1.2 for our project and have come across the following
exception and error:

Exception:
SEVERE: java.lang.OutOfMemoryError: Java heap space
at org.apache.lucene.util.PriorityQueue.initialize (PriorityQueue.java
:36)

Steps to reproduce:
1. Restart your Web Server.
2. Enter a query with VERY_LARGE_VALUE for "rows" field. For example:
http://xx.xx.xx.xx:8080/solr/select?q=unix&%20start=0&fl=id&indent=off&rows=9
3. Press enter or click on the 'Go' button on the browser.

NOTE:
1. This exception is thrown if'999' (seven digits) <
VERY_LARGE_VALUE < '9' (nine digits).
2. The exception DOES NOT APPEAR AGAIN if we change the VERY_LARGE_VALUE to
<= '999', execute the query and then change the VERY_LARGE_VALUE  back
to it's original value and execute the query again.
3. If the VERY_LARGE_VALUE >= '99' (ten digits) we get the following
error:

Error:
HTTP Status 400 - For input string: "99"

Has anyone come across this scenario before?

Regards,
Rishabh


SolrJ "commit" problem

2007-11-20 Thread Traut
Hi

I've got a problem with solrj from nightly build (from 2007-11-12).
I have this code:
solrClient = new CommonsHttpSolrServer(new URL(indexServerUrl));
and after "add" operation firing solrClient.commit(true, true); But commit
operation is not processing in Solr as I can see in log files
 (but I can see in debug mode that status 200 is returning after executing
getHttpConnection().executeMethod(method); in SolrJ client class file)

Command from console actually do the trick
[EMAIL PROTECTED] ~]$ curl http://traut-base:/-solr-network/update -H
"Content-Type: text/xml" --data-binary ''

I must say that I'm trying to use SolrJ client from nightly build with Solr
server release 1.2. Most likely it is actually the root of the problem

so, can I use Solr release 1.2 with nightly-build SolrJ client? Are there
any problems? What can you cay about my "commit" problem?

Thank you in advance

-- 
Best regards,
Traut


Re: Invalid value 'explicit' for echoParams parameter

2007-11-20 Thread Ryan McKinley


The URL is 
http://localhost:8983/solr/select/?q=solr&version=2.2&start=0&rows=10&indent=on
When i added &echoParams=explicit to the query nothing has changed. But when I 
find and replaced the word 'explicit' to uppercase 'EXPLICIT' in the solrconfig.xml 
it worked. The problem has solved. Thanks for your help.



hymmm.  what version are you using?

I'm confident that /trunk accepts any case:

  v = v.toUpperCase();
  if( v.equals( "EXPLICIT" ) ) {
return EXPLICIT;
  }

ryan


Re: Invalid value 'explicit' for echoParams parameter

2007-11-20 Thread AHMET ARSLAN

-Orijinal e-posta iletisi-
From: Ryan McKinley [EMAIL PROTECTED]
Date: Tue, 20 Nov 2007 07:16:53 +0200
To: solr-user@lucene.apache.org
Subject: Re: Invalid value 'explicit' for echoParams parameter

> AHMET ARSLAN wrote:
> > I am a newbie at solr. I have done everything in the solr tutorial section. 
> > I am using the latest versions of both JDK(1.6.03) and Solr(2.2). I can see 
> > the solr admin page http://localhost:8983/solr/admin/ But when I hit the 
> > search button I receive an http error:
> > 
> > HTTP ERROR: 400
> > 
> > Invalid value 'explicit' for echoParams parameter, use 'EXPLICIT' or 'ALL'
> > RequestURI=/solr/select/
> > 
> > I also tried to run solr under Tomcat but again I was unsuccessful.
> > 
> > Any solutions or document links will be appreciated.
> > 
> > Thanks for your help... 
> > 
> 
> what is the URL when you get this error?
> 
> Have you edited the solrconfig.xml?  What happens if you put: 
> &echoParams=explicit in the query?
> 
> 

The URL is 
http://localhost:8983/solr/select/?q=solr&version=2.2&start=0&rows=10&indent=on
When i added &echoParams=explicit to the query nothing has changed. But when I 
find and replaced the word 'explicit' to uppercase 'EXPLICIT' in the 
solrconfig.xml it worked. The problem has solved. Thanks for your help.


Re: Performance of Solr on different Platforms

2007-11-20 Thread Rishabh Joshi
Eswar,

This link would give you a fair idea of how Solr is used by some of the
sites/companies -
http://wiki.apache.org/solr/SolrPerformanceData

Rishabh

On Nov 20, 2007 10:49 AM, Eswar K <[EMAIL PROTECTED]> wrote:

> In our case, the load is kind of distributed. On an average, the QPS could
> be much less than that. 1000 qps could be the peak load ever expected
> could
> ever reach. However the number of documents going to be in the range of 2
> -
> 20 million documents.
>
> We would possibly distribute the indexes to different solr instances and
> possibly direct it accordingly to reduce the QPS.
>
> - Eswar
>
> On Nov 20, 2007 10:42 AM, Walter Underwood <[EMAIL PROTECTED]> wrote:
>
> > 1000 qps is a lot of load, at least 30M queries/day.
> >
> > We are running dual CPU Power P5 machines and getting about 80 qps
> > with worst case response times of 5 seconds. 90% of responses are
> > under 70 msec.
> >
> > Our expected peak load is 300 qps on our back-end Solr farm.
> > We execute multiple back-end queries for each query page.
> >
> > With N+1 sizing (full throughput with one server down), we
> > have five servers to do that. We have a separate server
> > for indexing and use the Solr distribution scripts.
> >
> > We have a relatively small index, about 250K docs.
> >
> > wunder
> >
> >
> > On 11/19/07 8:48 PM, "Eswar K" <[EMAIL PROTECTED]> wrote:
> >
> > > Its not going to hit 1000 all the time, its the expected peak value.
> > >
> > > I guess for distributing the load we should be using collections and I
> > was
> > > looking at the collections documentation (
> > > http://wiki.apache.org/solr/CollectionDistribution) .
> > >
> > > - Eswar
> > > On Nov 20, 2007 12:07 AM, Matthew Runo <[EMAIL PROTECTED]> wrote:
> > >
> > >> I'd think that any platform that can run Java would be fine to run
> > >> SOLR on. Maybe this is more a question of preferred platforms for
> Java
> > >> deployments? That is quite the load for SOLR though, you may find
> that
> > >> you want more than one server.
> > >>
> > >> Do you mean that you're expecting about 1000 QPS over an index with
> up
> > >> to 20 million documents?
> > >>
> > >> --Matthew
> > >>
> > >> On Nov 19, 2007, at 6:00 AM, Eswar K wrote:
> > >>
> > >>> All,
> > >>>
> > >>> Can you give some information on this or atleast let me know where I
> > >>> can
> > >>> find this information if its already listed out anywhere.
> > >>>
> > >>> Regards,
> > >>> Eswar
> > >>>
> > >>> On Nov 18, 2007 9:45 PM, Eswar K <[EMAIL PROTECTED]> wrote:
> > >>>
> >  Hi,
> > 
> >  I understand that Solr can be used on different Linux flavors. Is
> >  there
> >  any preferred flavor (Like Red Hat, Ubuntu, etc)?
> >  Also what is the kind of configuration of hardware (Processors,
> >  RAM, etc)
> >  be best suited for the install?
> >  We expect to load it with millions of documents (varying from 2 -
> 20
> >  million). There might be around 1000 concurrent users.
> > 
> >  Your help in this regard will be appreciated.
> > 
> >  Regards,
> >  Eswar
> > 
> > 
> > >>
> > >>
> >
> >
>


Re: Solr PHP client

2007-11-20 Thread Nick Jenkin
You can use curl (www.php.net/curl) to interface with solr, its a piece of cake!
-Nick


On 11/20/07, SDIS M. Beauchamp <[EMAIL PROTECTED]> wrote:
> I use the php and php serialized writer to query Solr from php
>
> It's very easy to use
>
> But it's not so easy to update solr from php ( that's why my crawlers are not 
> written in php )
>
> Florent BEAUCHAMP
>
> -Message d'origine-
> De : Jonathan Ariel [mailto:[EMAIL PROTECTED]
> Envoyé : mardi 20 novembre 2007 02:49
> À : solr-user@lucene.apache.org
> Objet : Solr PHP client
>
> Hi!
> I'm wondering if someone is using a PHP client for solr. Actually I'm not 
> sure if there is one out there.
> Would you be interested in having a SolrJ port for PHP?
>
> Thanks,
>
> Jonathan Leibiusky
>
>


Re: Solr on Windows / Linux

2007-11-20 Thread Norberto Meijome
On Tue, 20 Nov 2007 10:55:04 +0530
"Eswar K" <[EMAIL PROTECTED]> wrote:

> Is there any difference in the way any of the Solr's features work on
> Windows/Linux. 


Hi Eswar,
I am developing on FreeBSD 6.2 and 7, testing on a VM with Windows 2003 Server, 
and deploying for now, on Win32 too. We will very possibly deploy to *nix 
servers at a later stage (when we start to have dedicated servers for the SOLR 
component).

I haven't found any issues across platforms, other than the new line - for some 
reason, SOLR didnt seem to like the new line as read from the system 
environment under Win32 - defaulting to unix's \n was enough to fix it.

> Ideally it should not as its a java implementation. I was
> looking at CollectionsDistribution and its documentation (
> http://wiki.apache.org/solr/CollectionDistribution). It appeared that it
> uses rsync which is specific to Linux systems.

use cygwin for this. I haven't used it for solr, but i've used it extensively 
elsewhere when I have to endure win32.

cheers,
B

_
{Beto|Norberto|Numard} Meijome

Exhilaration is that feeling you get just after a great idea hits you,
and just before you realize what is wrong with it.

I speak for myself, not my employer. Contents may be hot. Slippery when wet. 
Reading disclaimers makes you go blind. Writing them is worse. You have been 
Warned.


indexing excel file

2007-11-20 Thread crazy

Hi, i want to index an excel file and i have the following error:

http://dev.torrez.us/public/2006/pundit/java/src/plugin/parse-msexcel/sample/test.xls:
failed(2,0): Can't be handled as Microsoft document.
java.lang.ArrayIndexOutOfBoundsException: No cell at position col1, row 0.

I already add msexcel in the plugin.includes:

plugin.includes
  protocol-http|urlfilter-regex|parse-(text|html|htm|js|pdf|
msword|mspowerpoint|msexcel)|index-basic|query-(basic|site|url)|summary-
msword|mspowerpoint|basic|
scoring-opic|urlnormalizer-(pass|regex|basic)

 i don't now where is the probleme
help plz 
-- 
View this message in context: 
http://www.nabble.com/indexing-excel-file-tf4841896.html#a13852743
Sent from the Solr - User mailing list archive at Nabble.com.