Re: lucene-java version mismatches

2009-03-25 Thread Paul Libbrecht

could I suggest that the maven repositories are populated next-time a
release of solr-specific-lucenes are made?
But they are? It is inside the org.apache.solr group since those  
lucene jars

are released by Solr -- http://repo2.maven.org/maven2/org/apache/solr/


Nope,

http://repo1.maven.org/maven2/org/apache/solr/solr-lucene-core/1.3.0/

has no sources.
Only the solr-specific ones have.

paul

smime.p7s
Description: S/MIME cryptographic signature


Status of an update request

2009-03-25 Thread Pierre-Yves LANDRON

Hello,

When I send an update or a commit to solr via curl, the response I get is 
formated in HTML ; I can't find a way to have a machine readable response file.
Here what is said on the subject in the solr config file :
The response format differs from solr1.1 formatting and returns a standard 
error code.
 To enable solr1.1 behavior, remove the /update handler or change its path
What I want, however, is an accurate description of the error and not just a 
standard Apache error code.
Is there a way to obtain an XML response file from solr ?

Thanks,
Kind regards,
P-YL

_
Drag n’ drop—Get easy photo sharing with Windows Live™ Photos.

http://www.microsoft.com/windows/windowslive/products/photos.aspx

Re: lucene-java version mismatches

2009-03-25 Thread Shalin Shekhar Mangar
On Wed, Mar 25, 2009 at 12:30 PM, Paul Libbrecht p...@activemath.orgwrote:

 could I suggest that the maven repositories are populated next-time a
 release of solr-specific-lucenes are made?

 But they are? It is inside the org.apache.solr group since those lucene
 jars
 are released by Solr -- http://repo2.maven.org/maven2/org/apache/solr/


 Nope,

 http://repo1.maven.org/maven2/org/apache/solr/solr-lucene-core/1.3.0/

 has no sources.
 Only the solr-specific ones have.


Ah, I see. Solr's build uses the lucene binaries which are checked into the
SVN. So sources are a little more difficult to bundle. Either we'd need to
check in the lucene source jars as well or the ant build would need to check
out the lucene code with the same revision number and make a source jar.

Please open an issue in the jira. It might be difficult for me to find time
for this right now but we can decide on an acceptable approach. Also note
that lucene's revision number is mentioned in the CHANGES.txt

-- 
Regards,
Shalin Shekhar Mangar.


Anyone use solr admin and Opera?

2009-03-25 Thread ristretto.rb
Hello,  I'm a happy Solr user.  Thanks for the excellent software!!
Hopefully this is a good question, I have indeed looked around the FAQ
and google and such first.
I have just switched from Firefox to Opera for web browsing.  (Another story)
When I use the solr/admin the home page and stats works fine, but
searches return unformatted results
all run together.  If I get source, I see it is XML, and in fact, the
source is more readable then page
itself.  Perhaps I need a stylesheet, or something.  Are there there
any other Opera users that have gotten
past this problem.

Thanks
gene


numeric range facets

2009-03-25 Thread Ashish P

Similar to getting range facets for date where we specify start, end and gap.
Can we do the same thing for numeric facets where we specify start, end and
gap.
-- 
View this message in context: 
http://www.nabble.com/numeric-range-facets-tp22698330p22698330.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: get all facets

2009-03-25 Thread Shalin Shekhar Mangar
On Wed, Mar 25, 2009 at 7:30 AM, Ashish P ashish.ping...@gmail.com wrote:


 Can I get all the facets in QueryResponse??


You can get all the facets that are returned by the server. Set facet.limit
to the number of facets you want to retrieve.

See
http://lucene.apache.org/solr/api/solrj/org/apache/solr/client/solrj/SolrQuery.html#setFacetLimit(int)
-- 
Regards,
Shalin Shekhar Mangar.


Re: numeric range facets

2009-03-25 Thread Shalin Shekhar Mangar
On Wed, Mar 25, 2009 at 3:26 PM, Ashish P ashish.ping...@gmail.com wrote:


 Similar to getting range facets for date where we specify start, end and
 gap.
 Can we do the same thing for numeric facets where we specify start, end and
 gap.


No. But you can do this with multiple queries by using facet.field with fq
parameters. If you are using the trunk then it should be possible to do this
with one query using the new multi-select facet feature.

See
http://wiki.apache.org/solr/SimpleFacetParameters#head-f277d409b221b407d9c5430f552bf40ee6185c4c

-- 
Regards,
Shalin Shekhar Mangar.


Re: Anyone use solr admin and Opera?

2009-03-25 Thread Shalin Shekhar Mangar
On Wed, Mar 25, 2009 at 1:33 PM, ristretto.rb ristretto...@gmail.comwrote:

 Hello,  I'm a happy Solr user.  Thanks for the excellent software!!
 Hopefully this is a good question, I have indeed looked around the FAQ
 and google and such first.
 I have just switched from Firefox to Opera for web browsing.  (Another
 story)
 When I use the solr/admin the home page and stats works fine, but
 searches return unformatted results
 all run together.  If I get source, I see it is XML, and in fact, the
 source is more readable then page
 itself.  Perhaps I need a stylesheet, or something.  Are there there
 any other Opera users that have gotten
 past this problem.


I'd be interested in this too. Safari/Chrome also have the same problem,
they don't render raw xml nicely.

-- 
Regards,
Shalin Shekhar Mangar.


Re: Status of an update request

2009-03-25 Thread Shalin Shekhar Mangar
On Wed, Mar 25, 2009 at 12:42 PM, Pierre-Yves LANDRON
pland...@hotmail.comwrote:


 Hello,

 When I send an update or a commit to solr via curl, the response I get is
 formated in HTML ; I can't find a way to have a machine readable response
 file.
 Here what is said on the subject in the solr config file :
 The response format differs from solr1.1 formatting and returns a standard
 error code.
  To enable solr1.1 behavior, remove the /update handler or change its path
 What I want, however, is an accurate description of the error and not just
 a standard Apache error code.
 Is there a way to obtain an XML response file from solr ?


If the update command executes successfully, then the response is XML. In
case of error, the error page is generated by the servlet container which is
HTML I guess. Not sure what can be done about that. Perhaps Solr can have
its own error pages which output XML with the stack trace information and
the correct HTTP return codes?

-- 
Regards,
Shalin Shekhar Mangar.


Deleting documents

2009-03-25 Thread Rui Pereira
I'm trying to delete documents based on the following type of update
requests:
deletequerytopologyid:3140/queryquerytopologyid:3142/query/delete

This doesn't cause any changes on index and if I try to read the response,
the following error ocurs:

13:32:35,196 ERROR [STDERR] 25/Mar/2009 13:32:35
org.apache.solr.update.processor.LogUpdateProcessor finish
INFO: {} 0 16
13:32:35,196 ERROR [STDERR] 25/Mar/2009 13:32:35
org.apache.solr.common.SolrException log
SEVERE: org.apache.solr.common.SolrException: missing content stream
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:49)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1333)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:303)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:232)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at
org.jboss.web.tomcat.filters.ReplyHeaderFilter.doFilter(ReplyHeaderFilter.java:96)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:230)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175)
at
org.jboss.web.tomcat.security.SecurityAssociationValve.invoke(SecurityAssociationValve.java:182)
at
org.jboss.web.tomcat.security.JaccContextValve.invoke(JaccContextValve.java:84)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at
org.jboss.web.tomcat.service.jca.CachedConnectionValve.invoke(CachedConnectionValve.java:157)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:262)
at
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844)
at
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583)
at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:446)
at java.lang.Thread.run(Unknown Source)
13:32:35,196 ERROR [STDERR] 25/Mar/2009 13:32:35
org.apache.solr.core.SolrCore execute
INFO: [] webapp=/apache-solr-nightly path=/update
params={deletequerytopologyid:3142/query/delete=} status=400
QTime=16

Thanks in advance,
   Rui Pereira


Copy solr indexes from 2 solr instance

2009-03-25 Thread prerna07

Hi,

Issue 1:
I have 2 solr instances, i need to copy indexes from solr1 instance to solr2
without restarting the solr. 
Please suggest how will this work. Both solr are on multicore setup.

Issue2:
I deleted all indexes from solr and reloaded my core, solr admin return 0
results. 
The size of index folder under data directory of core has still number of
files?

Issue3:
Can I copy/paste data folder in running core of solr.

Thanks,
Prerna
-- 
View this message in context: 
http://www.nabble.com/Copy-solr-indexes-from-2-solr-instance-tp22702100p22702100.html
Sent from the Solr - User mailing list archive at Nabble.com.



speeding up indexing with a LOT of indexed fields

2009-03-25 Thread Britske

hi, 

I'm having difficulty indexing a collection of documents in a reasonable
time. 
it's now going at 20 docs / sec on a c1.xlarge instance of amazon ec2 which
just isnt enough. 
This box has 8GB ram and the equivalent of 20 xeon processors.   

these document have a couple of stored, indexed, multi and single-valued
fields, but the main problem lies in it having about 1500 indexed fields of
type sint.  Range [0,1] (Yes, I know this is a lot) 

I'm looking for some guidance as what strategies to try out to improve
throughput in indexing. I could slam in some more servers (I will) but my
feeling tells me I can get more out of this.

some additional info: 
 - I'm indexing to 10 cores in parallel.  This is done because :
  - at query time, 1 particular index will always fullfill all requests
so we can prune the search space to 1/10th of its original size. 
  - each document as represented in a core is actually 1/10th of a
'conceptual' document (which would contain up to 15000 indexed fields) if I
indexed to 1 core. Indexing as 1 doc containing 15.000 indexed fields proved
to give far worse results in searching and indexing than the solution i'm
going with now. 
 - the alternative of simply putting all docs with 1500 indexed field
each in the same core isn't really possible either, because this quickly
results in OOM-errors when sorting on a couple of fields. (even though 9/10
th of all docs in this case would not have the field sorted on, they would
still end up in a lucene fieldCache for this field) 

- to be clear: the 20 docs / second means 2 docs / second / core. Or 2
'conceptual' docs / second overall. 

- each core has maxBufferedDocs ~20 and mergeFactor~10 .  (I actually set
them differently for each partition so that merges of different partitions
don't happen altogether. This seemed to help a bit) 

- running jvm with -server -Xmx6000M -Xms6000M -XX:+UseParallelGC
-XX:+CMSPermGenSweepingEnabled -XX:MaxPermSize=128M to leave room for
diskcaching. 

- I'm spreading the 10 indices over 2 physical disks. 5 to /dev/sda1 5 to
/dev/sdb 


observations: 
- within minutes after feeding the server reaches it's max ram. 
- until then the processors are running on ~70%
- although I throw in a commit at random intervals (between 600 to 800 secs,
again so not to commit al partitions at the same time) the jvm just stays
eating all the ram. 
- not a lot seems to be happening on disk (using dstat) when the ram hasn't
maxed out. Obviously, aftwerwards the disk is flooded with swapping. 

questions: 
- is there a good reason why all ram keeps occupied even though I commit
regularly? Perhaps fieldcaches get populated when indexing? I guess not, but
I'm not sure what else could explain this

- would splitting the 'conceptual docs' in even more partitions help at
indexing time? from an application standpoint it's possible, it just
requires some work and it's hard to compare figures so I'd like to know if
it's worth it .

- how is a flush different from a commit and would it help in getting the
ram-usage down?

- because all 15.000 indexed fields look very similar in structure (they are
all sints [0,1] to start with, I was looking for more efficient ways to
get them in an index using some low-level indexing operations. For example:
for a given document X and Y, and indexed fields 1,2.., i,...,N if X.a  Y.a
than this ordening in a lot of cases holds for fields 2,...,N. Because of
these special properties I could possibly create a sorting algorithm that
takes advantage of this and thus would make indexing faster. 
Would even considering this path be something that may be useful, because
obviously it would envolve some work to make it work, and presumably a lot
more work to get it to go faster than out of the box ?

 - lastly: should I be able to get more out of this box or am I just
complaining ;-) 

Thanks for making it to here, 
and hoping to receive some valuable info, 

Cheers, 
Britske
-- 
View this message in context: 
http://www.nabble.com/speeding-up-indexing-with-a-LOT-of-indexed-fields-tp22702364p22702364.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: speeding up indexing with a LOT of indexed fields

2009-03-25 Thread Otis Gospodnetic

Britske,

Here are a few quick ones:

- Does that machine really have 10 CPU cores?  If it has significantly less, 
you may be beyond the indexing sweet spot in terms of indexer threads vs. CPU 
cores

- Your maxBufferedDocs is super small.  Comment that out anyway.  use 
ramBufferedSizeMB and set it as high as you can afford.  No need to commit very 
often, certainly no need to flush or optimize until the end.

There is a page about indexing performance on either Solr or Lucene Wiki that 
will help.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
 From: Britske gbr...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Wednesday, March 25, 2009 10:05:17 AM
 Subject: speeding up indexing with a LOT of indexed fields
 
 
 hi, 
 
 I'm having difficulty indexing a collection of documents in a reasonable
 time. 
 it's now going at 20 docs / sec on a c1.xlarge instance of amazon ec2 which
 just isnt enough. 
 This box has 8GB ram and the equivalent of 20 xeon processors.  
 
 these document have a couple of stored, indexed, multi and single-valued
 fields, but the main problem lies in it having about 1500 indexed fields of
 type sint.  Range [0,1] (Yes, I know this is a lot) 
 
 I'm looking for some guidance as what strategies to try out to improve
 throughput in indexing. I could slam in some more servers (I will) but my
 feeling tells me I can get more out of this.
 
 some additional info: 
 - I'm indexing to 10 cores in parallel.  This is done because :
   - at query time, 1 particular index will always fullfill all requests
 so we can prune the search space to 1/10th of its original size. 
   - each document as represented in a core is actually 1/10th of a
 'conceptual' document (which would contain up to 15000 indexed fields) if I
 indexed to 1 core. Indexing as 1 doc containing 15.000 indexed fields proved
 to give far worse results in searching and indexing than the solution i'm
 going with now. 
  - the alternative of simply putting all docs with 1500 indexed field
 each in the same core isn't really possible either, because this quickly
 results in OOM-errors when sorting on a couple of fields. (even though 9/10
 th of all docs in this case would not have the field sorted on, they would
 still end up in a lucene fieldCache for this field) 
 
 - to be clear: the 20 docs / second means 2 docs / second / core. Or 2
 'conceptual' docs / second overall. 
 
 - each core has maxBufferedDocs ~20 and mergeFactor~10 .  (I actually set
 them differently for each partition so that merges of different partitions
 don't happen altogether. This seemed to help a bit) 
 
 - running jvm with -server -Xmx6000M -Xms6000M -XX:+UseParallelGC
 -XX:+CMSPermGenSweepingEnabled -XX:MaxPermSize=128M to leave room for
 diskcaching. 
 
 - I'm spreading the 10 indices over 2 physical disks. 5 to /dev/sda1 5 to
 /dev/sdb 
 
 
 observations: 
 - within minutes after feeding the server reaches it's max ram. 
 - until then the processors are running on ~70%
 - although I throw in a commit at random intervals (between 600 to 800 secs,
 again so not to commit al partitions at the same time) the jvm just stays
 eating all the ram. 
 - not a lot seems to be happening on disk (using dstat) when the ram hasn't
 maxed out. Obviously, aftwerwards the disk is flooded with swapping. 
 
 questions: 
 - is there a good reason why all ram keeps occupied even though I commit
 regularly? Perhaps fieldcaches get populated when indexing? I guess not, but
 I'm not sure what else could explain this
 
 - would splitting the 'conceptual docs' in even more partitions help at
 indexing time? from an application standpoint it's possible, it just
 requires some work and it's hard to compare figures so I'd like to know if
 it's worth it .
 
 - how is a flush different from a commit and would it help in getting the
 ram-usage down?
 
 - because all 15.000 indexed fields look very similar in structure (they are
 all sints [0,1] to start with, I was looking for more efficient ways to
 get them in an index using some low-level indexing operations. For example:
 for a given document X and Y, and indexed fields 1,2.., i,...,N if X.a  Y.a
 than this ordening in a lot of cases holds for fields 2,...,N. Because of
 these special properties I could possibly create a sorting algorithm that
 takes advantage of this and thus would make indexing faster. 
 Would even considering this path be something that may be useful, because
 obviously it would envolve some work to make it work, and presumably a lot
 more work to get it to go faster than out of the box ?
 
 - lastly: should I be able to get more out of this box or am I just
 complaining ;-) 
 
 Thanks for making it to here, 
 and hoping to receive some valuable info, 
 
 Cheers, 
 Britske
 -- 
 View this message in context: 
 http://www.nabble.com/speeding-up-indexing-with-a-LOT-of-indexed-fields-tp22702364p22702364.html
 Sent from the Solr - User 

Re: Copy solr indexes from 2 solr instance

2009-03-25 Thread Otis Gospodnetic

Prerna,

You could create an index snapshot with snapshooter script and then copy the 
index.  You should do that while the source index is not getting modified.

Re issue #2: run optimize

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
 From: prerna07 pkhandelw...@sapient.com
 To: solr-user@lucene.apache.org
 Sent: Wednesday, March 25, 2009 9:52:34 AM
 Subject: Copy solr indexes from 2 solr instance
 
 
 Hi,
 
 Issue 1:
 I have 2 solr instances, i need to copy indexes from solr1 instance to solr2
 without restarting the solr. 
 Please suggest how will this work. Both solr are on multicore setup.
 
 Issue2:
 I deleted all indexes from solr and reloaded my core, solr admin return 0
 results. 
 The size of index folder under data directory of core has still number of
 files?
 
 Issue3:
 Can I copy/paste data folder in running core of solr.
 
 Thanks,
 Prerna
 -- 
 View this message in context: 
 http://www.nabble.com/Copy-solr-indexes-from-2-solr-instance-tp22702100p22702100.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Snapinstaller + Overlapping onDeckSearchers Problems

2009-03-25 Thread Otis Gospodnetic

Hm, I can't quite tell from here, but that is just a warning, so it's not super 
problematic at this point.
Could it be that one of your other caches (query cache) is large and lots of 
items are copied on searcher flip?

Could it be that your JVM doesn't have large or free enough enough heap?  Can 
you tell if lots of GCing happens during the searcher flip?

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
 From: Cloude Porteus clo...@instructables.com
 To: solr-user@lucene.apache.org
 Sent: Wednesday, March 25, 2009 1:06:51 AM
 Subject: Snapinstaller + Overlapping onDeckSearchers Problems
 
 We have been running our solr slaves without autowarming our new searchers
 for a long time, but that was causing us 50-75 requests in 20+ seconds
 timeframe after every update on the slaves. I have turned on autowarming and
 that has fixed our slow response times, but I'm running into occasional
 Overlapping onDeckSearchers.
 
 We have replication setup and are using the snapinstaller script every 10
 minutes:
 
 /home/solr/bin/snappuller -M util01 -P 18984 -D /home/solr/write/data -S
 /home/solr/logs -d /home/solr/read/data -u instruct;
 /home/solr/bin/snapinstaller -M util01 -S /home/solr/write/logs -d
 /home/solr/read/data -u instruct
 
 Here's what a successful update/commit log looks like:
 
 [14:13:02.510] start
 commit(optimize=false,waitFlush=false,waitSearcher=true)
 [14:13:02.522] Opening searc...@e9b4bb main
 [14:13:02.524] end_commit_flush
 [14:13:02.525] autowarming searc...@e9b4bb main from searc...@159e6e8 main
 [14:13:02.525]
 filterCache{lookups=1809739,hits=1766607,hitratio=0.97,inserts=43211,evictions=0,
 size=43154,cumulative_lookups=1809739,cumulative_hits=1766607,cumulative_hitratio=0.97,cumulative_inserts=43211,cumulative_evictions=0}
 --
 [14:15:42.372] {commit=} 0 159964
 [14:15:42.373] /update  0 159964
 
 Here's what a unsuccessful update/commit log looks like, where the /update
 took too long and we started another commit:
 
 [21:03:03.829] start
 commit(optimize=false,waitFlush=false,waitSearcher=true)
 [21:03:03.836] Opening searc...@b2f2d6 main
 [21:03:03.836] end_commit_flush
 [21:03:03.836] autowarming searc...@b2f2d6 main from searc...@103c520 main
 [21:03:03.836]
 filterCache{lookups=1062196,hits=1062160,hitratio=0.99,inserts=49144,evictions=0,size=48353,cumulative_lookups=259485564,cumulative_hits=259426904,cumulative_hitratio=0.99,cumulative_inserts=68467,cumulative_evictions=0}
 --
 [21:23:04.794] start
 commit(optimize=false,waitFlush=false,waitSearcher=true)
 [21:23:04.794] PERFORMANCE WARNING: Overlapping onDeckSearchers=2
 [21:23:04.802] Opening searc...@f11bc main
 [21:23:04.802] end_commit_flush
 --
 [21:24:55.987] {commit=} 0 1312158
 [21:24:55.987] /update  0 1312158
 
 
 I don't understand why this sometimes takes two minutes between the start
 commit  /update and sometimes takes 20 minutes? One of our caches has about
 ~40,000 items, but I can't imagine it taking 20 minutes to autowarm a
 searcher.
 
 It would be super handy if the Snapinstaller script would wait until the
 previous one was done before starting a new one, but I'm not sure how to
 make that happen.
 
 Thanks for any help with this.
 
 best,
 cloude
 
 -- 
 VP of Product Development
 Instructables.com
 
 http://www.instructables.com/member/lebowski



Re: Not able to configure multicore

2009-03-25 Thread Otis Gospodnetic

Hm, where does that /solr2 come from?

 Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
 From: mitulpatel mitulpa...@greymatterindia.com
 To: solr-user@lucene.apache.org
 Sent: Wednesday, March 25, 2009 12:30:11 AM
 Subject: Re: Not able to configure multicore
 
 
 
 
 hossman wrote:
  
  
  : I am facing a problem related to multiple cores configuration. I have
  placed
  : a solr.xml file in solr.home directory. eventhough when I am trying to
  : access http://localhost:8983/solr/admin/cores it gives me tomcat error. 
  : 
  : Can anyone tell me what can be possible issue with this??
  
  not without knowing exactly what the tomcat error message is, what your 
  solr.xml file looks like, what log messages you see on startup, etc...
  
  -Hoss
  
  
 Hello Hoss,
 
 Thanks for reply.
 
 Here is the error message shown on browser:
 HTTP Status 404 - /solr2/admin/cores
 type Status report
 message /solr2/admin/cores
 description The requested resource (/solr2/admin/cores) is not available.
 
 and here is the solr.xml file.
 
 
   
   
 
 
 
 
 -- 
 View this message in context: 
 http://www.nabble.com/Not-able-to-configure-multicore-tp22682691p22695098.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Hardware Questions...

2009-03-25 Thread Otis Gospodnetic

Ah, it's hard to tell.  I look at index size on disk, number of docs, query 
rate, types of queries, etc.


Are you actually seeing problems with your existing servers?  Or see specific 
performance movement in one of the aspects? (e.g. increasing latency, increased 
GC or memory usage, increased disk IO)
 
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
 From: solr s...@highbeam.com
 To: solr-user@lucene.apache.org
 Sent: Tuesday, March 24, 2009 4:51:50 PM
 Subject: Hardware Questions...
 
 We have three Solr servers (several two processor Dell PowerEdge
 servers). I'd like to get three newer servers and I wanted to see what
 we should be getting. I'm thinking the following...
 
 
 
 Dell PowerEdge 2950 III 
 
 2x2.33GHz/12M 1333MHz Quad Core 
 
 16GB RAM 
 6 x 146GB 15K RPM RAID-5 drives
 
 
 
 How do people spec out servers, especially CPU, memory and disk? Is this
 all based on the number of doc's, indexes, etc...
 
 
 
 Also, what are people using for benchmarking and monitoring Solr? Thanks
 - Mike



Re: Snapinstaller + Overlapping onDeckSearchers Problems

2009-03-25 Thread Ryan McKinley


I don't understand why this sometimes takes two minutes between the  
start
commit  /update and sometimes takes 20 minutes? One of our caches  
has about

~40,000 items, but I can't imagine it taking 20 minutes to autowarm a
searcher.



What do your cache configs look like?

How big is the autowarm count?

If you have:
queryResultCache
  class=solr.LRUCache
  size=512
  initialSize=512
  autowarmCount=32/

that will run 32 queries when solr starts.  Are you running 40K  
queries when it starts?



ryan




Re: Snapinstaller + Overlapping onDeckSearchers Problems

2009-03-25 Thread Cloude Porteus
Yes, I guess I'm running 40k queries when it starts :) I didn't know that
each count was equal to a query. I thought it was just copying the cache
entries from the previous searcher, but I guess that wouldn't include new
entries. I set it to the size of our filterCache. What should I set the the
autowarmCount to if I want to try and fill up the caches?

lookups : 8720372
hits : 8676170
hitratio : 0.99
inserts : 44551
evictions : 0
size : 44417
cumulative_lookups : 8720372
cumulative_hits : 8676170
cumulative_hitratio : 0.99
cumulative_inserts : 44551
cumulative_evictions : 0

best,
cloude

On Wed, Mar 25, 2009 at 8:38 AM, Ryan McKinley ryan...@gmail.com wrote:

 I don't understand why this sometimes takes two minutes between the start
 commit  /update and sometimes takes 20 minutes? One of our caches has
about
 ~40,000 items, but I can't imagine it taking 20 minutes to autowarm a
 searcher.


 What do your cache configs look like?

 How big is the autowarm count?

 If you have:
queryResultCache
  class=solr.LRUCache
  size=512
  initialSize=512
  autowarmCount=32/

 that will run 32 queries when solr starts.  Are you running 40K queries
when it starts?


 ryan





--
VP of Product Development
Instructables.com

http://www.instructables.com/member/lebowski


Strange anomaly(?) with string matching in query

2009-03-25 Thread Kurt Nordstrom

Hello,

We've encountered a strange issue in our Solr install regarding a particular
string that just doesn't seem to want to return results, despite the exact
same string being in the index.

What makes it even stranger is that we had the same data in a previous
install of Solr, and it worked there, but doesn't here.

The string that's been showing the trouble is Abilene Christian College --
Students -- Yearbooks.  The field, in this case, is of type text. 
Strangely enough, when we search for Abilene Christian College -- Students
--, the relevant documents are returned.  It just fails when the full
string is specified.

At this point, I'm a little bit stymied.  Any suggestions or ideas would be
highly appreciated.  In order to possibly help with diagnosis, I'm including
links to, hopefully, relevant outputs and configurations.

We're using Solr version 1.3.

This is the output of a search for the string, with debugQuery turned on.
http://pastebin.com/f72c017c1

This is the output of a document containing the string in question.  The
field is dc_subject. http://pastebin.com/f17a2e722

Here is our current schema. http://pastebin.com/f2768bece

If there's any more information or diagnostics that I can post or run,
please let me know.  Thanks for your help and suggestions.

-Kurt
-- 
View this message in context: 
http://www.nabble.com/Strange-anomaly%28-%29-with-string-matching-in-query-tp22704639p22704639.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: speeding up indexing with a LOT of indexed fields

2009-03-25 Thread Britske

Thanks for the quick reply.

the box has 8 real cpu's. Perhaps a good idea then to reduce the nr of cores
to 8 as well. I'm testing out a different scenario with multiple boxes as
well, where clients persist docs to multiple cores on multiple boxes. (which
is what multicore was invented for after all) 

I set maxBufferedDocs  this low (and instead of ramBufferedSizeMB) because I
was worried for the impact on ram and to get a grip on when docs where
persisted to disk . I'm still not sure if it matters much on the big amounts
of ram consumed. This can't be all coming from buffering docs can it? On the
other hand, maxBufferedDocs (20 ) is set for each core so in total the
nrOfBufferedDocs is at max 200. Of course still at the low side, but I got
some draconian docs here.. ;-) 

I will try to use ramBufferedSizeMB and set it higher, but I first have to
get a grip why ram usage is maxed all the time, before this will make any
difference I guess. 

Thanks and please let the suggestions coming. 

Britske.


Otis Gospodnetic wrote:
 
 
 Britske,
 
 Here are a few quick ones:
 
 - Does that machine really have 10 CPU cores?  If it has significantly
 less, you may be beyond the indexing sweet spot in terms of indexer
 threads vs. CPU cores
 
 - Your maxBufferedDocs is super small.  Comment that out anyway.  use
 ramBufferedSizeMB and set it as high as you can afford.  No need to commit
 very often, certainly no need to flush or optimize until the end.
 
 There is a page about indexing performance on either Solr or Lucene Wiki
 that will help.
 
 Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
 
 
 
 - Original Message 
 From: Britske gbr...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Wednesday, March 25, 2009 10:05:17 AM
 Subject: speeding up indexing with a LOT of indexed fields
 
 
 hi, 
 
 I'm having difficulty indexing a collection of documents in a reasonable
 time. 
 it's now going at 20 docs / sec on a c1.xlarge instance of amazon ec2
 which
 just isnt enough. 
 This box has 8GB ram and the equivalent of 20 xeon processors.  
 
 these document have a couple of stored, indexed, multi and single-valued
 fields, but the main problem lies in it having about 1500 indexed fields
 of
 type sint.  Range [0,1] (Yes, I know this is a lot) 
 
 I'm looking for some guidance as what strategies to try out to improve
 throughput in indexing. I could slam in some more servers (I will) but my
 feeling tells me I can get more out of this.
 
 some additional info: 
 - I'm indexing to 10 cores in parallel.  This is done because :
   - at query time, 1 particular index will always fullfill all
 requests
 so we can prune the search space to 1/10th of its original size. 
   - each document as represented in a core is actually 1/10th of a
 'conceptual' document (which would contain up to 15000 indexed fields) if
 I
 indexed to 1 core. Indexing as 1 doc containing 15.000 indexed fields
 proved
 to give far worse results in searching and indexing than the solution i'm
 going with now. 
  - the alternative of simply putting all docs with 1500 indexed field
 each in the same core isn't really possible either, because this quickly
 results in OOM-errors when sorting on a couple of fields. (even though
 9/10
 th of all docs in this case would not have the field sorted on, they
 would
 still end up in a lucene fieldCache for this field) 
 
 - to be clear: the 20 docs / second means 2 docs / second / core. Or 2
 'conceptual' docs / second overall. 
 
 - each core has maxBufferedDocs ~20 and mergeFactor~10 .  (I actually set
 them differently for each partition so that merges of different
 partitions
 don't happen altogether. This seemed to help a bit) 
 
 - running jvm with -server -Xmx6000M -Xms6000M -XX:+UseParallelGC
 -XX:+CMSPermGenSweepingEnabled -XX:MaxPermSize=128M to leave room for
 diskcaching. 
 
 - I'm spreading the 10 indices over 2 physical disks. 5 to /dev/sda1 5 to
 /dev/sdb 
 
 
 observations: 
 - within minutes after feeding the server reaches it's max ram. 
 - until then the processors are running on ~70%
 - although I throw in a commit at random intervals (between 600 to 800
 secs,
 again so not to commit al partitions at the same time) the jvm just stays
 eating all the ram. 
 - not a lot seems to be happening on disk (using dstat) when the ram
 hasn't
 maxed out. Obviously, aftwerwards the disk is flooded with swapping. 
 
 questions: 
 - is there a good reason why all ram keeps occupied even though I commit
 regularly? Perhaps fieldcaches get populated when indexing? I guess not,
 but
 I'm not sure what else could explain this
 
 - would splitting the 'conceptual docs' in even more partitions help at
 indexing time? from an application standpoint it's possible, it just
 requires some work and it's hard to compare figures so I'd like to know
 if
 it's worth it .
 
 - how is a flush different from a commit and would it help in getting the
 ram-usage down?
 
 - because 

Re: Strange anomaly(?) with string matching in query

2009-03-25 Thread Otis Gospodnetic

Hi,

Take the whole string to your Solr Admin - Analysis page and analyze it.  Does 
it get analyzed the way you'd expect it to be analyzed?

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
 From: Kurt Nordstrom knordst...@library.unt.edu
 To: solr-user@lucene.apache.org
 Sent: Wednesday, March 25, 2009 11:52:07 AM
 Subject: Strange anomaly(?) with string matching in query
 
 
 Hello,
 
 We've encountered a strange issue in our Solr install regarding a particular
 string that just doesn't seem to want to return results, despite the exact
 same string being in the index.
 
 What makes it even stranger is that we had the same data in a previous
 install of Solr, and it worked there, but doesn't here.
 
 The string that's been showing the trouble is Abilene Christian College --
 Students -- Yearbooks.  The field, in this case, is of type text. 
 Strangely enough, when we search for Abilene Christian College -- Students
 --, the relevant documents are returned.  It just fails when the full
 string is specified.
 
 At this point, I'm a little bit stymied.  Any suggestions or ideas would be
 highly appreciated.  In order to possibly help with diagnosis, I'm including
 links to, hopefully, relevant outputs and configurations.
 
 We're using Solr version 1.3.
 
 This is the output of a search for the string, with debugQuery turned on.
 http://pastebin.com/f72c017c1
 
 This is the output of a document containing the string in question.  The
 field is dc_subject. http://pastebin.com/f17a2e722
 
 Here is our current schema. http://pastebin.com/f2768bece
 
 If there's any more information or diagnostics that I can post or run,
 please let me know.  Thanks for your help and suggestions.
 
 -Kurt
 -- 
 View this message in context: 
 http://www.nabble.com/Strange-anomaly%28-%29-with-string-matching-in-query-tp22704639p22704639.html
 Sent from the Solr - User mailing list archive at Nabble.com.



REST interface for Query

2009-03-25 Thread Olson, Curtis B
Greetings, I am a new subscriber.   I'm Curtis Olson and I work for CACI
under contract at the U.S. Department of State, where we deal with
massive quantities of documents, so Solr is ideal for us.

 

We have a good sized index that we are starting to build up in
development.   Some of the filter constraints can get reasonable complex
(based upon individual user's access), and I find myself creating long
query strings for selection.  I like the REST interfaces for adding to
the index, and wish I could create an XML document for querying.  I
haven't found a request handler that can do this, does one exist?

 

Cheers,

Curtis Olson, S/ES-IRM, CACI Contractor

 



Re: Snapinstaller + Overlapping onDeckSearchers Problems

2009-03-25 Thread Ryan McKinley
It looks like the cache is configured big enough, but the autowarm  
count is too big to have good performance.


Try something smaller and see if that fixes both problems.  I imagine  
even just warming the most recent 100 queries would precache the most  
important ones, but try some higher numbers and see if the performance  
is acceptable.


for the filterCache and queryCache, autowarm queries the new index and  
caches the results.




On Mar 25, 2009, at 11:48 AM, Cloude Porteus wrote:

Yes, I guess I'm running 40k queries when it starts :) I didn't know  
that
each count was equal to a query. I thought it was just copying the  
cache
entries from the previous searcher, but I guess that wouldn't  
include new
entries. I set it to the size of our filterCache. What should I set  
the the

autowarmCount to if I want to try and fill up the caches?

lookups : 8720372
hits : 8676170
hitratio : 0.99
inserts : 44551
evictions : 0
size : 44417
cumulative_lookups : 8720372
cumulative_hits : 8676170
cumulative_hitratio : 0.99
cumulative_inserts : 44551
cumulative_evictions : 0

best,
cloude

On Wed, Mar 25, 2009 at 8:38 AM, Ryan McKinley ryan...@gmail.com  
wrote:


I don't understand why this sometimes takes two minutes between  
the start
commit  /update and sometimes takes 20 minutes? One of our caches  
has

about
~40,000 items, but I can't imagine it taking 20 minutes to  
autowarm a

searcher.



What do your cache configs look like?

How big is the autowarm count?

If you have:
  queryResultCache
class=solr.LRUCache
size=512
initialSize=512
autowarmCount=32/

that will run 32 queries when solr starts.  Are you running 40K  
queries

when it starts?



ryan






--
VP of Product Development
Instructables.com

http://www.instructables.com/member/lebowski




How do I accomplish this (semi-)complicated setup?

2009-03-25 Thread Jesper Nøhr
Hi list,

I've finally settled on Solr, seeing as it has almost everything I
could want out of the box.

My setup is a complicated one. It will serve as the search backend on
Bitbucket.org, a mercurial hosting site. We have literally thousands
of code repositories, as well as users and other data. All this needs
to be indexed.

The complication comes in when we have private repositories. Only
select users have access to these, but we still need to index them.

How would I go about accomplishing this? I can't think of a clean way to do it.

Any pointers much appreciated.


Jesper


Re: REST interface for Query

2009-03-25 Thread Otis Gospodnetic

Curtis,

Like this?
https://issues.apache.org/jira/browse/SOLR-839

 
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
 From: Olson, Curtis B olso...@state.gov
 To: solr-user@lucene.apache.org
 Sent: Wednesday, March 25, 2009 12:28:35 PM
 Subject: REST interface for Query
 
 Greetings, I am a new subscriber.   I'm Curtis Olson and I work for CACI
 under contract at the U.S. Department of State, where we deal with
 massive quantities of documents, so Solr is ideal for us.
 
 
 
 We have a good sized index that we are starting to build up in
 development.   Some of the filter constraints can get reasonable complex
 (based upon individual user's access), and I find myself creating long
 query strings for selection.  I like the REST interfaces for adding to
 the index, and wish I could create an XML document for querying.  I
 haven't found a request handler that can do this, does one exist?
 
 
 
 Cheers,
 
 Curtis Olson, S/ES-IRM, CACI Contractor



Re: How do I accomplish this (semi-)complicated setup?

2009-03-25 Thread Eric Pugh
You could index the user name or ID, and then in your application add  
as filter the username as you pass the query back to Solr.  Maybe have  
a access_type that is Public or Private, and then for public searches  
only include the ones that meet the access_type of Public.


Eric


On Mar 25, 2009, at 12:52 PM, Jesper Nøhr wrote:


Hi list,

I've finally settled on Solr, seeing as it has almost everything I
could want out of the box.

My setup is a complicated one. It will serve as the search backend on
Bitbucket.org, a mercurial hosting site. We have literally thousands
of code repositories, as well as users and other data. All this needs
to be indexed.

The complication comes in when we have private repositories. Only
select users have access to these, but we still need to index them.

How would I go about accomplishing this? I can't think of a clean  
way to do it.


Any pointers much appreciated.


Jesper


-
Eric Pugh | Principal | OpenSource Connections, LLC | 434.466.1467 | 
http://www.opensourceconnections.com
Free/Busy: http://tinyurl.com/eric-cal






Re: How do I accomplish this (semi-)complicated setup?

2009-03-25 Thread Jesper Nøhr
On Wed, Mar 25, 2009 at 5:57 PM, Eric Pugh
ep...@opensourceconnections.com wrote:
 You could index the user name or ID, and then in your application add as
 filter the username as you pass the query back to Solr.  Maybe have a
 access_type that is Public or Private, and then for public searches only
 include the ones that meet the access_type of Public.

That makes sense. Two questions on that:

1. More than one user can have access to a repository, so how would
that work? Also, if a user is added/removed, what's the best way to
keep that in sync?

2. In the event that a repository that is private, is made public, how
easy would it be to run an UPDATE so to speak?


Jesper

 On Mar 25, 2009, at 12:52 PM, Jesper Nøhr wrote:

 Hi list,

 I've finally settled on Solr, seeing as it has almost everything I
 could want out of the box.

 My setup is a complicated one. It will serve as the search backend on
 Bitbucket.org, a mercurial hosting site. We have literally thousands
 of code repositories, as well as users and other data. All this needs
 to be indexed.

 The complication comes in when we have private repositories. Only
 select users have access to these, but we still need to index them.

 How would I go about accomplishing this? I can't think of a clean way to
 do it.

 Any pointers much appreciated.


 Jesper

 -
 Eric Pugh | Principal | OpenSource Connections, LLC | 434.466.1467 |
 http://www.opensourceconnections.com
 Free/Busy: http://tinyurl.com/eric-cal







Re: How do I accomplish this (semi-)complicated setup?

2009-03-25 Thread Alejandro Gonzalez
you can even create separated indexes for private or public access if u need
(and place them in separated machines), but i think Eric's suggestion is the
best and easier

On Wed, Mar 25, 2009 at 5:52 PM, Jesper Nøhr jno...@gmail.com wrote:

 Hi list,

 I've finally settled on Solr, seeing as it has almost everything I
 could want out of the box.

 My setup is a complicated one. It will serve as the search backend on
 Bitbucket.org, a mercurial hosting site. We have literally thousands
 of code repositories, as well as users and other data. All this needs
 to be indexed.

 The complication comes in when we have private repositories. Only
 select users have access to these, but we still need to index them.

 How would I go about accomplishing this? I can't think of a clean way to do
 it.

 Any pointers much appreciated.


 Jesper



Re: Strange anomaly(?) with string matching in query

2009-03-25 Thread Kurt Nordstrom

Otis:

Okay, I'm not sure whether I should be including the quotes in the query
when using the analyzer, so I've run it both ways (no quotes on the index
value).  I'll try to approximate the final tables returned for each term:

The field is dc_subject in both cases, being of type text

***

Version 1 (With Quotes)
Index Value: Abilene Christian College -- Students -- Yearbooks
Query Value: Abilene Christian College -- Students -- Yearbooks

Index final table:

1 2   3  4   5
abilene  christian college   students yearbooks

Query final table:

1  2   3  4  6
abilene  christian college   studentsyearbooks


Version 2 (Without Quotes)
Index Value: Abilene Christian College -- Students -- Yearbooks
Query Value: Abilene Christian College -- Students -- Yearbooks


Index final table:

1 2   3  4   5
abilene  christian college   students yearbooks

Query final table:

1 2   3  4   5
abilene  christian college   students yearbooks


***

The main difference seems to be that there is no 5 index when I surround
the string with quotes, and instead it skips to 6.  This happens at the
WordDelimiterFilterFactory step. It seems to me like those tokens should be
returning a match, but either way, apparently they're not?  Any suggestions
at this point?


Otis Gospodnetic wrote:
 
 
 Hi,
 
 Take the whole string to your Solr Admin - Analysis page and analyze it. 
 Does it get analyzed the way you'd expect it to be analyzed?
 
 Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
 
 
 
-- 
View this message in context: 
http://www.nabble.com/Strange-anomaly%28-%29-with-string-matching-in-query-tp22704639p22706495.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: How do I accomplish this (semi-)complicated setup?

2009-03-25 Thread Alejandro Gonzalez
i can't see the problem about that. you can manage your users using a DB and
keep there the permissions they could have, and create or erase users
without problems. you just have to manage a working index field for each
user with repositories' ids he can access. or u can create several indexes
and a users solr index with a multi-valued field with the indexes the user
can access.

if then u want to turn a private repository into public u just have to
change the permissions field in your DB or users' index.

On Wed, Mar 25, 2009 at 6:02 PM, Jesper Nøhr jes...@noehr.org wrote:

 On Wed, Mar 25, 2009 at 5:57 PM, Eric Pugh
 ep...@opensourceconnections.com wrote:
  You could index the user name or ID, and then in your application add as
  filter the username as you pass the query back to Solr.  Maybe have a
  access_type that is Public or Private, and then for public searches only
  include the ones that meet the access_type of Public.

 That makes sense. Two questions on that:

 1. More than one user can have access to a repository, so how would
 that work? Also, if a user is added/removed, what's the best way to
 keep that in sync?

 2. In the event that a repository that is private, is made public, how
 easy would it be to run an UPDATE so to speak?


 Jesper

  On Mar 25, 2009, at 12:52 PM, Jesper Nøhr wrote:
 
  Hi list,
 
  I've finally settled on Solr, seeing as it has almost everything I
  could want out of the box.
 
  My setup is a complicated one. It will serve as the search backend on
  Bitbucket.org, a mercurial hosting site. We have literally thousands
  of code repositories, as well as users and other data. All this needs
  to be indexed.
 
  The complication comes in when we have private repositories. Only
  select users have access to these, but we still need to index them.
 
  How would I go about accomplishing this? I can't think of a clean way to
  do it.
 
  Any pointers much appreciated.
 
 
  Jesper
 
  -
  Eric Pugh | Principal | OpenSource Connections, LLC | 434.466.1467 |
  http://www.opensourceconnections.com
  Free/Busy: http://tinyurl.com/eric-cal
 
 
 
 
 



RE: REST interface for Query

2009-03-25 Thread Olson, Curtis B
Otis, that very much looks like what I'm after. 

Curtis

 -Original Message-
 From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com]
 Sent: Wednesday, March 25, 2009 12:53 PM
 To: solr-user@lucene.apache.org
 Subject: Re: REST interface for Query
 
 
 Curtis,
 
 Like this?
 https://issues.apache.org/jira/browse/SOLR-839
 
 
 Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
 
 
 
 - Original Message 
  From: Olson, Curtis B olso...@state.gov
  To: solr-user@lucene.apache.org
  Sent: Wednesday, March 25, 2009 12:28:35 PM
  Subject: REST interface for Query
 
  Greetings, I am a new subscriber.   I'm Curtis Olson and I work for
 CACI
  under contract at the U.S. Department of State, where we deal with
  massive quantities of documents, so Solr is ideal for us.
 
 
 
  We have a good sized index that we are starting to build up in
  development.   Some of the filter constraints can get reasonable
 complex
  (based upon individual user's access), and I find myself creating
 long
  query strings for selection.  I like the REST interfaces for adding
 to
  the index, and wish I could create an XML document for querying.  I
  haven't found a request handler that can do this, does one exist?
 
 
 
  Cheers,
 
  Curtis Olson, S/ES-IRM, CACI Contractor



getting started

2009-03-25 Thread nga pham
Hi

Some of the getting started link dont work.  Can you please enable it?


Re: How do I accomplish this (semi-)complicated setup?

2009-03-25 Thread Jesper Nøhr
Hm, I must be missing something, then.

Consider this.

There are three repositories, A and B, C. There are two users, U1 and U2.

Repository A is public, while B and C are private. Only U1 can access
B. No one can access C.

I index this data, such that Is_Private is true for B.

Now, when U2 searches, he will only see data for repo A. This is correct.

When U1 searches, what happens? AFAIK, he will also only see data for
A, unless we specify Is_Private:True, but then he will only see data
for B (and C, which he doesn't have access to.)

Secondly, say we grant U2 access to B. How do we tell Solr that he can
see it, then?

Sorry if I'm not making much sense here, but I'm quite confused.


Jesper



On Wed, Mar 25, 2009 at 6:13 PM, Alejandro Gonzalez
alejandrogonzalezd...@gmail.com wrote:
 i can't see the problem about that. you can manage your users using a DB and
 keep there the permissions they could have, and create or erase users
 without problems. you just have to manage a working index field for each
 user with repositories' ids he can access. or u can create several indexes
 and a users solr index with a multi-valued field with the indexes the user
 can access.

 if then u want to turn a private repository into public u just have to
 change the permissions field in your DB or users' index.

 On Wed, Mar 25, 2009 at 6:02 PM, Jesper Nøhr jes...@noehr.org wrote:

 On Wed, Mar 25, 2009 at 5:57 PM, Eric Pugh
 ep...@opensourceconnections.com wrote:
  You could index the user name or ID, and then in your application add as
  filter the username as you pass the query back to Solr.  Maybe have a
  access_type that is Public or Private, and then for public searches only
  include the ones that meet the access_type of Public.

 That makes sense. Two questions on that:

 1. More than one user can have access to a repository, so how would
 that work? Also, if a user is added/removed, what's the best way to
 keep that in sync?

 2. In the event that a repository that is private, is made public, how
 easy would it be to run an UPDATE so to speak?


 Jesper

  On Mar 25, 2009, at 12:52 PM, Jesper Nøhr wrote:
 
  Hi list,
 
  I've finally settled on Solr, seeing as it has almost everything I
  could want out of the box.
 
  My setup is a complicated one. It will serve as the search backend on
  Bitbucket.org, a mercurial hosting site. We have literally thousands
  of code repositories, as well as users and other data. All this needs
  to be indexed.
 
  The complication comes in when we have private repositories. Only
  select users have access to these, but we still need to index them.
 
  How would I go about accomplishing this? I can't think of a clean way to
  do it.
 
  Any pointers much appreciated.
 
 
  Jesper
 
  -
  Eric Pugh | Principal | OpenSource Connections, LLC | 434.466.1467 |
  http://www.opensourceconnections.com
  Free/Busy: http://tinyurl.com/eric-cal
 
 
 
 
 




Re: How do I accomplish this (semi-)complicated setup?

2009-03-25 Thread Alejandro Gonzalez
ok so u can create a table in a DB where you have a row foreach user and a
field with the reps he/she can access. Then you just have to take a look on
the db and include the repository name in the index. so you just have to
control (using query parameters) if the query is done for the right reps for
that user.

is it good for u?



On Wed, Mar 25, 2009 at 6:20 PM, Jesper Nøhr jes...@noehr.org wrote:

 Hm, I must be missing something, then.

 Consider this.

 There are three repositories, A and B, C. There are two users, U1 and U2.

 Repository A is public, while B and C are private. Only U1 can access
 B. No one can access C.

 I index this data, such that Is_Private is true for B.

 Now, when U2 searches, he will only see data for repo A. This is correct.

 When U1 searches, what happens? AFAIK, he will also only see data for
 A, unless we specify Is_Private:True, but then he will only see data
 for B (and C, which he doesn't have access to.)

 Secondly, say we grant U2 access to B. How do we tell Solr that he can
 see it, then?

 Sorry if I'm not making much sense here, but I'm quite confused.


 Jesper



 On Wed, Mar 25, 2009 at 6:13 PM, Alejandro Gonzalez
 alejandrogonzalezd...@gmail.com wrote:
  i can't see the problem about that. you can manage your users using a DB
 and
  keep there the permissions they could have, and create or erase users
  without problems. you just have to manage a working index field for
 each
  user with repositories' ids he can access. or u can create several
 indexes
  and a users solr index with a multi-valued field with the indexes the
 user
  can access.
 
  if then u want to turn a private repository into public u just have to
  change the permissions field in your DB or users' index.
 
  On Wed, Mar 25, 2009 at 6:02 PM, Jesper Nøhr jes...@noehr.org wrote:
 
  On Wed, Mar 25, 2009 at 5:57 PM, Eric Pugh
  ep...@opensourceconnections.com wrote:
   You could index the user name or ID, and then in your application add
 as
   filter the username as you pass the query back to Solr.  Maybe have a
   access_type that is Public or Private, and then for public searches
 only
   include the ones that meet the access_type of Public.
 
  That makes sense. Two questions on that:
 
  1. More than one user can have access to a repository, so how would
  that work? Also, if a user is added/removed, what's the best way to
  keep that in sync?
 
  2. In the event that a repository that is private, is made public, how
  easy would it be to run an UPDATE so to speak?
 
 
  Jesper
 
   On Mar 25, 2009, at 12:52 PM, Jesper Nøhr wrote:
  
   Hi list,
  
   I've finally settled on Solr, seeing as it has almost everything I
   could want out of the box.
  
   My setup is a complicated one. It will serve as the search backend on
   Bitbucket.org, a mercurial hosting site. We have literally thousands
   of code repositories, as well as users and other data. All this needs
   to be indexed.
  
   The complication comes in when we have private repositories. Only
   select users have access to these, but we still need to index them.
  
   How would I go about accomplishing this? I can't think of a clean way
 to
   do it.
  
   Any pointers much appreciated.
  
  
   Jesper
  
   -
   Eric Pugh | Principal | OpenSource Connections, LLC | 434.466.1467 |
   http://www.opensourceconnections.com
   Free/Busy: http://tinyurl.com/eric-cal
  
  
  
  
  
 
 



Re: Strange anomaly(?) with string matching in query

2009-03-25 Thread Kurt Nordstrom

Otis,

Absolutely.  Here are the tokenizers and filters for the text fieldtype in
the schema.  http://pastebin.com/f2bb249f3

Thanks!



That's what I suspected.  Want to paste the relevant tokenizer+filters
sections of your schema?  The index-time and query-time analysis has to be
the same or compatible enough, and that's not the case here.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

-- 
View this message in context: 
http://www.nabble.com/Strange-anomaly%28-%29-with-string-matching-in-query-tp22704639p22707191.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: getting started

2009-03-25 Thread Erick Erickson
Which links? Please be as specific as possible.

Erick

On Wed, Mar 25, 2009 at 1:20 PM, nga pham nga.p...@gmail.com wrote:

 Hi

 Some of the getting started link dont work.  Can you please enable it?



Re: getting started

2009-03-25 Thread nga pham
Oops my mistake. Sorry for the trouble

On Wed, Mar 25, 2009 at 10:42 AM, Erick Erickson erickerick...@gmail.comwrote:

 Which links? Please be as specific as possible.

 Erick

 On Wed, Mar 25, 2009 at 1:20 PM, nga pham nga.p...@gmail.com wrote:

  Hi
 
  Some of the getting started link dont work.  Can you please enable it?
 



Can TermIndexInterval be set in Solr?

2009-03-25 Thread Burton-West, Tom
Hello all,

We are experimenting with the ShingleFilter with a very large document set (1 
million full-text books). Because the ShingleFilter indexes every word pair as 
a token, the number of unique terms increases tremendously.  In our experiments 
so far the tii and tis files are getting very large and the tii file will 
eventually be too large to fit into memory.  If we set the TermIndexInterval to 
a larger number than the default 128, the tii file size should go down.  Is it 
possible to set this somehow through Solr configuration or do we need to modify 
the code somewhere and call IndexWriter.setTermIndexInterval?


Tom

Tom Burton-West
Digital Library Production Services
University of Michigan Library

 

Re: getting started

2009-03-25 Thread nga pham
http://lucene.apache.org/solr/tutorial.html#Getting+Started

link - lucene QueryParser syntax

is not working

On Wed, Mar 25, 2009 at 10:48 AM, nga pham nga.p...@gmail.com wrote:

 Oops my mistake. Sorry for the trouble

   On Wed, Mar 25, 2009 at 10:42 AM, Erick Erickson 
 erickerick...@gmail.com wrote:

 Which links? Please be as specific as possible.

 Erick

 On Wed, Mar 25, 2009 at 1:20 PM, nga pham nga.p...@gmail.com wrote:

  Hi
 
  Some of the getting started link dont work.  Can you please enable it?
 





Re: Realtime Searching..

2009-03-25 Thread John Wang
Hi Jon:
We are running various LinkedIn search systems on Zoie in production.

-John

On Thu, Feb 19, 2009 at 9:11 AM, Jon Baer jonb...@gmail.com wrote:

 This part:

 The part of Zoie that enables real-time searchability is the fact that
 ZoieSystem contains three IndexDataLoader objects:

* a RAMLuceneIndexDataLoader, which is a simple wrapper around a
 RAMDirectory,
* a DiskLuceneIndexDataLoader, which can index directly to the
 FSDirectory (followed by an optimize() call if a specified optimizeDuration
 has been exceeded) in batches via an intermediary
* BatchedIndexDataLoader, whose primary job is to queue up and batch
 DataEvents that need to be flushed to disk

 Sounds like it (might) be / (can) be layered into Solr somehow, has anyone
 been using this project or testing it?

 - Jon


 On Feb 19, 2009, at 9:44 AM, Genta Kaneyama wrote:

  Michael,

 I think you might be get interested in zoie.

 zoie: real-time search and indexing system built on Apache Lucene
 http://code.google.com/p/zoie/

 Zoie is realtime search project for lucene by Linkedin.
 Basically, I think it is similar technique to a Otis's trick.

  In the mean time you can use the trick of one large and less frequently
 updated core and one small and more frequently updated core + distributed
 search across them.

 Otis


 Genta


 On Sat, Feb 7, 2009 at 3:02 AM, Michael Austin mausti...@gmail.com
 wrote:

 I need to find a solution for our current social application. It's low
 traffic now because we are early on.. However I'm expecting and want to
 be
 prepaired to grow.  We have messages of different types that are
 aggregated into one stream. Each of these message types have much
 different
 data so that our main queries have a few unions and many joins.  I know
 that
 Solr would work great for searching but we need a realtime system
 (twitter-like) to view user updates.  I'm not interested in a few minutes
 delay; I need something that will be fast updating and searchable and
 have n
 columns per record/document. Can solor do this? what is Ocean?

 Thanks





Re: getting started

2009-03-25 Thread Erick Erickson
OK, now I'll turn it over to the folks who actually maintain that site G.

Meanwhile, here's the link to the 2.4.1 query syntax.

http://lucene.apache.org/java/2_4_1/queryparsersyntax.html

Best
Erick

On Wed, Mar 25, 2009 at 2:00 PM, nga pham nga.p...@gmail.com wrote:

 http://lucene.apache.org/solr/tutorial.html#Getting+Started

 link - lucene QueryParser syntax

 is not working

 On Wed, Mar 25, 2009 at 10:48 AM, nga pham nga.p...@gmail.com wrote:

  Oops my mistake. Sorry for the trouble
 
On Wed, Mar 25, 2009 at 10:42 AM, Erick Erickson 
  erickerick...@gmail.com wrote:
 
  Which links? Please be as specific as possible.
 
  Erick
 
  On Wed, Mar 25, 2009 at 1:20 PM, nga pham nga.p...@gmail.com wrote:
 
   Hi
  
   Some of the getting started link dont work.  Can you please enable it?
  
 
 
 



Solr OpenBitSet OutofMemory Error

2009-03-25 Thread smock

Hello,
After running a nightly release from around January of Solr for about 4
weeks without any problems, I'm starting to see OutofMemory errors:

Mar 24, 2009 1:35:36 AM org.apache.solr.common.SolrException log
SEVERE: java.lang.OutOfMemoryError: Java heap space
at org.apache.lucene.util.OpenBitSet.clone(OpenBitSet.java:640)


Is this a common error to see?  I'm running a lot of faceted queries on an
index with about 7.5 million documents.  I'm giving about 8GBs of memory to
Solr.  While I do update the index frequently, I also optimize frequently -
its a little strange to me that this problem is showing up now after four
weeks of zero problems.  Any suggestions/ideas would be very much
appreciated!

Thanks,
Harish
-- 
View this message in context: 
http://www.nabble.com/Solr-OpenBitSet-OutofMemory-Error-tp22707576p22707576.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: How do I accomplish this (semi-)complicated setup?

2009-03-25 Thread Jesper Nøhr
OK, we're getting closer. I just have two final questions regarding this then:

1. This would also include all the public repositories, right? If so,
how would such a query look? Some kind of is_public:true AND ...?

2. When a repository is made public, the is_public property in the
Solr index needs to reflect this. How can such an update be made
without having to purge and re-index?


Jesper


On Wed, Mar 25, 2009 at 6:29 PM, Alejandro Gonzalez
alejandrogonzalezd...@gmail.com wrote:
 ok so u can create a table in a DB where you have a row foreach user and a
 field with the reps he/she can access. Then you just have to take a look on
 the db and include the repository name in the index. so you just have to
 control (using query parameters) if the query is done for the right reps for
 that user.

 is it good for u?



 On Wed, Mar 25, 2009 at 6:20 PM, Jesper Nøhr jes...@noehr.org wrote:

 Hm, I must be missing something, then.

 Consider this.

 There are three repositories, A and B, C. There are two users, U1 and U2.

 Repository A is public, while B and C are private. Only U1 can access
 B. No one can access C.

 I index this data, such that Is_Private is true for B.

 Now, when U2 searches, he will only see data for repo A. This is correct.

 When U1 searches, what happens? AFAIK, he will also only see data for
 A, unless we specify Is_Private:True, but then he will only see data
 for B (and C, which he doesn't have access to.)

 Secondly, say we grant U2 access to B. How do we tell Solr that he can
 see it, then?

 Sorry if I'm not making much sense here, but I'm quite confused.


 Jesper



 On Wed, Mar 25, 2009 at 6:13 PM, Alejandro Gonzalez
 alejandrogonzalezd...@gmail.com wrote:
  i can't see the problem about that. you can manage your users using a DB
 and
  keep there the permissions they could have, and create or erase users
  without problems. you just have to manage a working index field for
 each
  user with repositories' ids he can access. or u can create several
 indexes
  and a users solr index with a multi-valued field with the indexes the
 user
  can access.
 
  if then u want to turn a private repository into public u just have to
  change the permissions field in your DB or users' index.
 
  On Wed, Mar 25, 2009 at 6:02 PM, Jesper Nøhr jes...@noehr.org wrote:
 
  On Wed, Mar 25, 2009 at 5:57 PM, Eric Pugh
  ep...@opensourceconnections.com wrote:
   You could index the user name or ID, and then in your application add
 as
   filter the username as you pass the query back to Solr.  Maybe have a
   access_type that is Public or Private, and then for public searches
 only
   include the ones that meet the access_type of Public.
 
  That makes sense. Two questions on that:
 
  1. More than one user can have access to a repository, so how would
  that work? Also, if a user is added/removed, what's the best way to
  keep that in sync?
 
  2. In the event that a repository that is private, is made public, how
  easy would it be to run an UPDATE so to speak?
 
 
  Jesper
 
   On Mar 25, 2009, at 12:52 PM, Jesper Nøhr wrote:
  
   Hi list,
  
   I've finally settled on Solr, seeing as it has almost everything I
   could want out of the box.
  
   My setup is a complicated one. It will serve as the search backend on
   Bitbucket.org, a mercurial hosting site. We have literally thousands
   of code repositories, as well as users and other data. All this needs
   to be indexed.
  
   The complication comes in when we have private repositories. Only
   select users have access to these, but we still need to index them.
  
   How would I go about accomplishing this? I can't think of a clean way
 to
   do it.
  
   Any pointers much appreciated.
  
  
   Jesper
  
   -
   Eric Pugh | Principal | OpenSource Connections, LLC | 434.466.1467 |
   http://www.opensourceconnections.com
   Free/Busy: http://tinyurl.com/eric-cal
  
  
  
  
  
 
 




Re: Can TermIndexInterval be set in Solr?

2009-03-25 Thread Otis Gospodnetic

I think it's the later.  I don't think the term interval is exposed anywhere.  
If you expose it through the config and provide a patch, I think we can add 
this to the core quickly.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
 From: Burton-West, Tom tburt...@umich.edu
 To: solr-user@lucene.apache.org solr-user@lucene.apache.org
 Cc: Farber, Phillip pfar...@umich.edu; Dueber, William 
 dueb...@umich.edu
 Sent: Wednesday, March 25, 2009 1:50:17 PM
 Subject: Can TermIndexInterval be set in Solr?
 
 Hello all,
 
 We are experimenting with the ShingleFilter with a very large document set (1 
 million full-text books). Because the ShingleFilter indexes every word pair 
 as a 
 token, the number of unique terms increases tremendously.  In our experiments 
 so 
 far the tii and tis files are getting very large and the tii file will 
 eventually be too large to fit into memory.  If we set the TermIndexInterval 
 to 
 a larger number than the default 128, the tii file size should go down.  Is 
 it 
 possible to set this somehow through Solr configuration or do we need to 
 modify 
 the code somewhere and call IndexWriter.setTermIndexInterval?
 
 
 Tom
 
 Tom Burton-West
 Digital Library Production Services
 University of Michigan Library



Re: Realtime Searching..

2009-03-25 Thread Otis Gospodnetic

Would it not make more sense to wait for the Lucene's IW+IR marriage and other 
things happening in core Lucene that will make near-real-time search possible?

 Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
 From: John Wang john.w...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Wednesday, March 25, 2009 2:34:04 PM
 Subject: Re: Realtime Searching..
 
 Hi Jon:
 We are running various LinkedIn search systems on Zoie in production.
 
 -John
 
 On Thu, Feb 19, 2009 at 9:11 AM, Jon Baer wrote:
 
  This part:
 
  The part of Zoie that enables real-time searchability is the fact that
  ZoieSystem contains three IndexDataLoader objects:
 
 * a RAMLuceneIndexDataLoader, which is a simple wrapper around a
  RAMDirectory,
 * a DiskLuceneIndexDataLoader, which can index directly to the
  FSDirectory (followed by an optimize() call if a specified optimizeDuration
  has been exceeded) in batches via an intermediary
 * BatchedIndexDataLoader, whose primary job is to queue up and batch
  DataEvents that need to be flushed to disk
 
  Sounds like it (might) be / (can) be layered into Solr somehow, has anyone
  been using this project or testing it?
 
  - Jon
 
 
  On Feb 19, 2009, at 9:44 AM, Genta Kaneyama wrote:
 
   Michael,
 
  I think you might be get interested in zoie.
 
  zoie: real-time search and indexing system built on Apache Lucene
  http://code.google.com/p/zoie/
 
  Zoie is realtime search project for lucene by Linkedin.
  Basically, I think it is similar technique to a Otis's trick.
 
   In the mean time you can use the trick of one large and less frequently
  updated core and one small and more frequently updated core + 
  distributed
  search across them.
 
  Otis
 
 
  Genta
 
 
  On Sat, Feb 7, 2009 at 3:02 AM, Michael Austin 
  wrote:
 
  I need to find a solution for our current social application. It's low
  traffic now because we are early on.. However I'm expecting and want to
  be
  prepaired to grow.  We have messages of different types that are
  aggregated into one stream. Each of these message types have much
  different
  data so that our main queries have a few unions and many joins.  I know
  that
  Solr would work great for searching but we need a realtime system
  (twitter-like) to view user updates.  I'm not interested in a few minutes
  delay; I need something that will be fast updating and searchable and
  have n
  columns per record/document. Can solor do this? what is Ocean?
 
  Thanks
 
 
 



SRW/U and OAI-PMH servers over solr

2009-03-25 Thread Miguel Coxo
Hello there,

I'm looking for a way to implement SRW/U and a OAI-PMH servers over solr,
similar to what i have found here:
http://marc.info/?l=solr-devm=116405019011211w=2 . Well actually if it is
decoupled (not a plugin) would be ok, if not better =).

I wanted to know if anyone knows if there is something available out there
that accomplishes this.

For what i have found so far, OCLC has both server implementations
available. I haven't looked too deep into the SRW/U one, but the OAI-PMH can
be configured to work with solr (by implementing a class that does the
actual calls to the data provider).

Any information that you guys can provide is welcome =).

-- 
All the best, Miguel Coxo.


Partition index by time using Solr

2009-03-25 Thread vivek sar
Hi,

  I've used Lucene before, but new to Solr. I've gone through the
mailing list, but unable to find any clear idea on how to partition
Solr indexes. Here is what we want,

  1) Be able to partition indexes by timestamp - basically partition
per day (create a new index directory every day)

  2) Be  able to search partitions based on timestamp. All our queries
are time based, so instead of looking into all the partitions I want
to go directly to the partitions where the data might be.

  3) Be able to purge any data older than 6 months without bringing
down the application. Since, partitions would be marked by timestamp
we would just have to delete the old partitions.


  This is going to be a distributed system with 2 boxes each running
an instance of Solr. I don't  want to replicate data, but each box may
have same timestamp partition with different data. We would be
indexing on avg of  20 million documents (each document = 500 bytes)
with estimate of 10g in index size - evenly distributed across
machines
  (each machine would get roughly 5g of index everyday).

  My questions,

  1) Is this all possible using Solr? If not, should I just do this
using Lucene or is there any other out-of-box alternative?
  2) If it's possible in Solr how do we do this - configuration, setup etc.
  3) How would I optimize the partitions - would it be required when using Solr?

  Thanks,
  -vivek


Re: How do I accomplish this (semi-)complicated setup?

2009-03-25 Thread Alejandro Gonzalez
try using db for permission management and when u want to make a rep public
u just have to add it's id or name to everyuser permissions field. i think
you don't need to add any is_public field to index, just an id or name
field in wich the indexed doc is.So you can pre-filter the reps quering the
db obtaining the reps for wich user has permissions and adding this
restrictions to the solr query. this way you can't change reps'permissions
without re-indexing. so the query for solr if the current user is allowed
for search in the 1 and 2 reps should be something like ...rep_id:1OR2...


Alex


On Wed, Mar 25, 2009 at 8:06 PM, Jesper Nøhr jes...@noehr.org wrote:

 OK, we're getting closer. I just have two final questions regarding this
 then:

 1. This would also include all the public repositories, right? If so,
 how would such a query look? Some kind of is_public:true AND ...?

 2. When a repository is made public, the is_public property in the
 Solr index needs to reflect this. How can such an update be made
 without having to purge and re-index?


 Jesper


 On Wed, Mar 25, 2009 at 6:29 PM, Alejandro Gonzalez
 alejandrogonzalezd...@gmail.com wrote:
  ok so u can create a table in a DB where you have a row foreach user and
 a
  field with the reps he/she can access. Then you just have to take a look
 on
  the db and include the repository name in the index. so you just have to
  control (using query parameters) if the query is done for the right reps
 for
  that user.
 
  is it good for u?
 
 
 
  On Wed, Mar 25, 2009 at 6:20 PM, Jesper Nøhr jes...@noehr.org wrote:
 
  Hm, I must be missing something, then.
 
  Consider this.
 
  There are three repositories, A and B, C. There are two users, U1 and
 U2.
 
  Repository A is public, while B and C are private. Only U1 can access
  B. No one can access C.
 
  I index this data, such that Is_Private is true for B.
 
  Now, when U2 searches, he will only see data for repo A. This is
 correct.
 
  When U1 searches, what happens? AFAIK, he will also only see data for
  A, unless we specify Is_Private:True, but then he will only see data
  for B (and C, which he doesn't have access to.)
 
  Secondly, say we grant U2 access to B. How do we tell Solr that he can
  see it, then?
 
  Sorry if I'm not making much sense here, but I'm quite confused.
 
 
  Jesper
 
 
 
  On Wed, Mar 25, 2009 at 6:13 PM, Alejandro Gonzalez
  alejandrogonzalezd...@gmail.com wrote:
   i can't see the problem about that. you can manage your users using a
 DB
  and
   keep there the permissions they could have, and create or erase users
   without problems. you just have to manage a working index field for
  each
   user with repositories' ids he can access. or u can create several
  indexes
   and a users solr index with a multi-valued field with the indexes the
  user
   can access.
  
   if then u want to turn a private repository into public u just have to
   change the permissions field in your DB or users' index.
  
   On Wed, Mar 25, 2009 at 6:02 PM, Jesper Nøhr jes...@noehr.org
 wrote:
  
   On Wed, Mar 25, 2009 at 5:57 PM, Eric Pugh
   ep...@opensourceconnections.com wrote:
You could index the user name or ID, and then in your application
 add
  as
filter the username as you pass the query back to Solr.  Maybe have
 a
access_type that is Public or Private, and then for public searches
  only
include the ones that meet the access_type of Public.
  
   That makes sense. Two questions on that:
  
   1. More than one user can have access to a repository, so how would
   that work? Also, if a user is added/removed, what's the best way to
   keep that in sync?
  
   2. In the event that a repository that is private, is made public,
 how
   easy would it be to run an UPDATE so to speak?
  
  
   Jesper
  
On Mar 25, 2009, at 12:52 PM, Jesper Nøhr wrote:
   
Hi list,
   
I've finally settled on Solr, seeing as it has almost everything I
could want out of the box.
   
My setup is a complicated one. It will serve as the search backend
 on
Bitbucket.org, a mercurial hosting site. We have literally
 thousands
of code repositories, as well as users and other data. All this
 needs
to be indexed.
   
The complication comes in when we have private repositories. Only
select users have access to these, but we still need to index
 them.
   
How would I go about accomplishing this? I can't think of a clean
 way
  to
do it.
   
Any pointers much appreciated.
   
   
Jesper
   
-
Eric Pugh | Principal | OpenSource Connections, LLC | 434.466.1467
 |
http://www.opensourceconnections.com
Free/Busy: http://tinyurl.com/eric-cal
   
   
   
   
   
  
  
 
 



Re: Delta import

2009-03-25 Thread AlexxelA

Yes my database is remote, mysql 5 and i'm using connector/J 5.1.7.  My index
has 2 documents.  When i try to do lets say 14 updates it takes about 18
sec total.  Here's the resulting log of the operation : 

2009-03-25 15:53:57 org.apache.solr.handler.dataimport.JdbcDataSource$1 call
INFO: Time taken for getConnection(): 411
2009-03-25 15:53:59 org.apache.solr.handler.dataimport.DocBuilder
collectDelta
INFO: Completed ModifiedRowKey for Entity: profil rows obtained : 14
2009-03-25 15:53:59 org.apache.solr.handler.dataimport.DocBuilder
collectDelta
INFO: Completed DeletedRowKey for Entity: profil rows obtained : 0
2009-03-25 15:53:59 org.apache.solr.handler.dataimport.DocBuilder
collectDelta
INFO: Completed parentDeltaQuery for Entity: profil
2009-03-25 15:54:00 org.apache.solr.core.SolrDeletionPolicy onInit
INFO: SolrDeletionPolicy.onInit: commits:num=1
   
commit{dir=/home/solr-tomcat/solr/data/index,segFN=segments_sb,version=1237322897338,generation=1019,filenames=[_uj.frq,
_uj.fdx, _uj.tii, _uj.nrm, _uj.tis, _uj.fnm, _uj.prx, segments_sb, _uj.fdt]
2009-03-25 15:54:00 org.apache.solr.core.SolrDeletionPolicy updateCommits
INFO: last commit = 1237322897338
2009-03-25 15:54:13 org.apache.solr.handler.dataimport.DocBuilder doDelta
INFO: Delta Import completed successfully BOTTLE NECK
2009-03-25 15:54:13 org.apache.solr.handler.dataimport.DocBuilder commit
INFO: Full Import completed successfully
2009-03-25 15:54:13 org.apache.solr.update.DirectUpdateHandler2 commit
INFO: start commit(optimize=true,waitFlush=false,waitSearcher=true)
2009-03-25 15:54:15 org.apache.solr.core.SolrDeletionPolicy onCommit
INFO: SolrDeletionPolicy.onCommit: commits:num=2
   
commit{dir=/home/solr-tomcat/solr/data/index,segFN=segments_sb,version=1237322897338,generation=1019,filenames=[_uj.frq,
_uj.fdx, _uj.tii, _uj.nrm, _uj.tis, _uj.fnm, _uj.prx, segments_sb, _uj.fdt]
   
commit{dir=/home/solr-tomcat/solr/data/index,segFN=segments_sc,version=1237322897339,generation=1020,filenames=[_ul.prx,
_ul.fnm, _ul.tii, _ul.fdt, _ul.nrm, _ul.fdx, _ul.tis, _ul.frq, segments_sc]
2009-03-25 15:54:15 org.apache.solr.core.SolrDeletionPolicy updateCommits
INFO: last commit = 1237322897339
2009-03-25 15:54:15 org.apache.solr.search.SolrIndexSearcher init
INFO: Opening searc...@3da850 main

When i do a full-import it is much faster. Take about 1 min to index 2
documents.  I tried to play a bit with the config but nothing seems to work
for the moment.

What i want to do is pretty interactive, my production db has 1,2M documents
and must be able to delta-import around 2k update every 5min.  Is it
possible with the dataimporthandle to reach those kinda of number ?



Shalin Shekhar Mangar wrote:
 
 On Wed, Mar 25, 2009 at 2:25 AM, AlexxelA
 alexandre.boudrea...@canoe.cawrote:
 

 Ok i'm ok with the fact the solr gonna do X request to database for X
 update.. but when i try to run the delta-import command with 2 row to
 update is it normal that its kinda really slow ~ 1 document fetched / sec
 ?


 Not really, I've seen 1000x faster. Try firing a few of those queries on
 the
 database directly. Are they slow? Is the database remote?
 
 -- 
 Regards,
 Shalin Shekhar Mangar.
 
 

-- 
View this message in context: 
http://www.nabble.com/Delta-import-tp22663196p22710222.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Snapinstaller + Overlapping onDeckSearchers Problems

2009-03-25 Thread Cloude Porteus
I set the autowarm to 2000, which only takes about two minutes and resolves
my issues.

Thanks for your help!

best,
cloude

On Wed, Mar 25, 2009 at 9:34 AM, Ryan McKinley ryan...@gmail.com wrote:

 It looks like the cache is configured big enough, but the autowarm count is
 too big to have good performance.

 Try something smaller and see if that fixes both problems.  I imagine even
 just warming the most recent 100 queries would precache the most important
 ones, but try some higher numbers and see if the performance is acceptable.

 for the filterCache and queryCache, autowarm queries the new index and
 caches the results.




 On Mar 25, 2009, at 11:48 AM, Cloude Porteus wrote:

  Yes, I guess I'm running 40k queries when it starts :) I didn't know that
 each count was equal to a query. I thought it was just copying the cache
 entries from the previous searcher, but I guess that wouldn't include new
 entries. I set it to the size of our filterCache. What should I set the
 the
 autowarmCount to if I want to try and fill up the caches?

 lookups : 8720372
 hits : 8676170
 hitratio : 0.99
 inserts : 44551
 evictions : 0
 size : 44417
 cumulative_lookups : 8720372
 cumulative_hits : 8676170
 cumulative_hitratio : 0.99
 cumulative_inserts : 44551
 cumulative_evictions : 0

 best,
 cloude

 On Wed, Mar 25, 2009 at 8:38 AM, Ryan McKinley ryan...@gmail.com wrote:


 I don't understand why this sometimes takes two minutes between the
 start
 commit  /update and sometimes takes 20 minutes? One of our caches has

 about

 ~40,000 items, but I can't imagine it taking 20 minutes to autowarm a
 searcher.



 What do your cache configs look like?

 How big is the autowarm count?

 If you have:
  queryResultCache
class=solr.LRUCache
size=512
initialSize=512
autowarmCount=32/

 that will run 32 queries when solr starts.  Are you running 40K queries

 when it starts?



 ryan





 --
 VP of Product Development
 Instructables.com

 http://www.instructables.com/member/lebowski





-- 
VP of Product Development
Instructables.com

http://www.instructables.com/member/lebowski


Re: SRW/U and OAI-PMH servers over solr

2009-03-25 Thread Ryan McKinley
I implemented OAI-PMH for solr a few years back for the Massachusetts  
library system...  it appears not to be running right now, but  
check...  http://www.digitalcommonwealth.org/


It would be great to get that code revived and live open source  
somewhere.  As is, it uses a pre 1.3 release that was patched to do  
support modifiable fields.  (If I did it again, I would suggest  
keeping a parallel SQL database for some of this stuff)


ryan


On Mar 25, 2009, at 3:30 PM, Miguel Coxo wrote:


Hello there,

I'm looking for a way to implement SRW/U and a OAI-PMH servers over  
solr,

similar to what i have found here:
http://marc.info/?l=solr-devm=116405019011211w=2 . Well actually  
if it is

decoupled (not a plugin) would be ok, if not better =).

I wanted to know if anyone knows if there is something available out  
there

that accomplishes this.

For what i have found so far, OCLC has both server implementations
available. I haven't looked too deep into the SRW/U one, but the OAI- 
PMH can

be configured to work with solr (by implementing a class that does the
actual calls to the data provider).

Any information that you guys can provide is welcome =).

--
All the best, Miguel Coxo.




large index vs multicore

2009-03-25 Thread Manepalli, Kalyan
Hi All,
In my project, I have one primary core containing all the basic 
information for a product.
Now I need to add additional information which will be searched and displayed 
in conjunction with the product results.
My question is - From design and query speed point of - should I add new core 
to handle the additional data or should I add the data to the existing core.

The data size is not very large around 150,000 - 200,000 documents.

Any insights into this will be helpful

Thanks,
Kalyan Manepalli



solr_hostname in scripts.conf

2009-03-25 Thread Garafola Timothy
I've a question.  Is it safe to use 'localhost' as solr_hostname in
scripts.conf?

-- 
-Tim


Re: get all facets

2009-03-25 Thread Ashish P

Actually what I meant was if there are 100 indexed fields. So there are 100
facet fields right..
So whenever I create solrQuery, I have to do addFacetField(fieldName)
can I avoid this and just get all facet fields.

Sorry for the confusion.

Thanks again,
Ashish 


Shalin Shekhar Mangar wrote:
 
 On Wed, Mar 25, 2009 at 7:30 AM, Ashish P ashish.ping...@gmail.com
 wrote:
 

 Can I get all the facets in QueryResponse??
 
 
 You can get all the facets that are returned by the server. Set
 facet.limit
 to the number of facets you want to retrieve.
 
 See
 http://lucene.apache.org/solr/api/solrj/org/apache/solr/client/solrj/SolrQuery.html#setFacetLimit(int)
 -- 
 Regards,
 Shalin Shekhar Mangar.
 
 

-- 
View this message in context: 
http://www.nabble.com/get-all-facets-tp22693809p22714256.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: large index vs multicore

2009-03-25 Thread Ryan McKinley




My question is - From design and query speed point of - should I add  
new core to handle the additional data or should I add the data to  
the existing core.


Do you ever need to get results from both sets of data in the same  
query?  If so, putting them in the same index will be faster.  If  
every query is always limited to results within on set or the other --  
and the doc count is not huge, then the choice of single core vs multi  
core is more about what you are more comfortable managing then it is  
about query speeds.


Advantages of multicore-
 - the distinct data is in different indexes, you can maintain them  
independently

   (perhaps one data set never changes and the other changes often)

Advantages of single core (with multiple data sets)
 - everything is in one place
 - replicate / load balance a single index rather then multiple.


ryan


Re: large index vs multicore

2009-03-25 Thread Otis Gospodnetic

Hi,

Without knowing the details, I'd say keep it in the same index if the 
additional information shares some/enough fields with the main product data and 
separately if it's sufficiently distinct (this also means 2 queries and manual 
merging/joining).


Otis --
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
 From: Manepalli, Kalyan kalyan.manepa...@orbitz.com
 To: solr-user@lucene.apache.org solr-user@lucene.apache.org
 Sent: Wednesday, March 25, 2009 5:46:40 PM
 Subject: large index vs multicore
 
 Hi All,
 In my project, I have one primary core containing all the basic 
 information for a product.
 Now I need to add additional information which will be searched and displayed 
 in 
 conjunction with the product results.
 My question is - From design and query speed point of - should I add new core 
 to 
 handle the additional data or should I add the data to the existing core.
 
 The data size is not very large around 150,000 - 200,000 documents.
 
 Any insights into this will be helpful
 
 Thanks,
 Kalyan Manepalli



Re: Solr OpenBitSet OutofMemory Error

2009-03-25 Thread Otis Gospodnetic

Hi,

I'm not sure if anyone will be able to help without more detail.  First 
suggestion would be to look at Solr with a debugger/profiler to see where 
memory is used up.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
 From: smock harish.agar...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Wednesday, March 25, 2009 2:37:26 PM
 Subject: Solr OpenBitSet OutofMemory Error
 
 
 Hello,
 After running a nightly release from around January of Solr for about 4
 weeks without any problems, I'm starting to see OutofMemory errors:
 
 Mar 24, 2009 1:35:36 AM org.apache.solr.common.SolrException log
 SEVERE: java.lang.OutOfMemoryError: Java heap space
 at org.apache.lucene.util.OpenBitSet.clone(OpenBitSet.java:640)
 
 
 Is this a common error to see?  I'm running a lot of faceted queries on an
 index with about 7.5 million documents.  I'm giving about 8GBs of memory to
 Solr.  While I do update the index frequently, I also optimize frequently -
 its a little strange to me that this problem is showing up now after four
 weeks of zero problems.  Any suggestions/ideas would be very much
 appreciated!
 
 Thanks,
 Harish
 -- 
 View this message in context: 
 http://www.nabble.com/Solr-OpenBitSet-OutofMemory-Error-tp22707576p22707576.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Delta import

2009-03-25 Thread Noble Paul നോബിള്‍ नोब्ळ्
Hi Alex , you may be able to use CachedSqlEntityprocessor. you can do
delta-import using  full-import
http://wiki.apache.org/solr/DataImportHandlerFaq#fullimportdelta

the inner entity can use a CachedSqlEntityProcessor

On Thu, Mar 26, 2009 at 1:45 AM, AlexxelA alexandre.boudrea...@canoe.ca wrote:

 Yes my database is remote, mysql 5 and i'm using connector/J 5.1.7.  My index
 has 2 documents.  When i try to do lets say 14 updates it takes about 18
 sec total.  Here's the resulting log of the operation :

 2009-03-25 15:53:57 org.apache.solr.handler.dataimport.JdbcDataSource$1 call
 INFO: Time taken for getConnection(): 411
 2009-03-25 15:53:59 org.apache.solr.handler.dataimport.DocBuilder
 collectDelta
 INFO: Completed ModifiedRowKey for Entity: profil rows obtained : 14
 2009-03-25 15:53:59 org.apache.solr.handler.dataimport.DocBuilder
 collectDelta
 INFO: Completed DeletedRowKey for Entity: profil rows obtained : 0
 2009-03-25 15:53:59 org.apache.solr.handler.dataimport.DocBuilder
 collectDelta
 INFO: Completed parentDeltaQuery for Entity: profil
 2009-03-25 15:54:00 org.apache.solr.core.SolrDeletionPolicy onInit
 INFO: SolrDeletionPolicy.onInit: commits:num=1

 commit{dir=/home/solr-tomcat/solr/data/index,segFN=segments_sb,version=1237322897338,generation=1019,filenames=[_uj.frq,
 _uj.fdx, _uj.tii, _uj.nrm, _uj.tis, _uj.fnm, _uj.prx, segments_sb, _uj.fdt]
 2009-03-25 15:54:00 org.apache.solr.core.SolrDeletionPolicy updateCommits
 INFO: last commit = 1237322897338
 2009-03-25 15:54:13 org.apache.solr.handler.dataimport.DocBuilder doDelta
 INFO: Delta Import completed successfully BOTTLE NECK
 2009-03-25 15:54:13 org.apache.solr.handler.dataimport.DocBuilder commit
 INFO: Full Import completed successfully
 2009-03-25 15:54:13 org.apache.solr.update.DirectUpdateHandler2 commit
 INFO: start commit(optimize=true,waitFlush=false,waitSearcher=true)
 2009-03-25 15:54:15 org.apache.solr.core.SolrDeletionPolicy onCommit
 INFO: SolrDeletionPolicy.onCommit: commits:num=2

 commit{dir=/home/solr-tomcat/solr/data/index,segFN=segments_sb,version=1237322897338,generation=1019,filenames=[_uj.frq,
 _uj.fdx, _uj.tii, _uj.nrm, _uj.tis, _uj.fnm, _uj.prx, segments_sb, _uj.fdt]

 commit{dir=/home/solr-tomcat/solr/data/index,segFN=segments_sc,version=1237322897339,generation=1020,filenames=[_ul.prx,
 _ul.fnm, _ul.tii, _ul.fdt, _ul.nrm, _ul.fdx, _ul.tis, _ul.frq, segments_sc]
 2009-03-25 15:54:15 org.apache.solr.core.SolrDeletionPolicy updateCommits
 INFO: last commit = 1237322897339
 2009-03-25 15:54:15 org.apache.solr.search.SolrIndexSearcher init
 INFO: Opening searc...@3da850 main

 When i do a full-import it is much faster. Take about 1 min to index 2
 documents.  I tried to play a bit with the config but nothing seems to work
 for the moment.

 What i want to do is pretty interactive, my production db has 1,2M documents
 and must be able to delta-import around 2k update every 5min.  Is it
 possible with the dataimporthandle to reach those kinda of number ?



 Shalin Shekhar Mangar wrote:

 On Wed, Mar 25, 2009 at 2:25 AM, AlexxelA
 alexandre.boudrea...@canoe.cawrote:


 Ok i'm ok with the fact the solr gonna do X request to database for X
 update.. but when i try to run the delta-import command with 2 row to
 update is it normal that its kinda really slow ~ 1 document fetched / sec
 ?


 Not really, I've seen 1000x faster. Try firing a few of those queries on
 the
 database directly. Are they slow? Is the database remote?

 --
 Regards,
 Shalin Shekhar Mangar.



 --
 View this message in context: 
 http://www.nabble.com/Delta-import-tp22663196p22710222.html
 Sent from the Solr - User mailing list archive at Nabble.com.





-- 
--Noble Paul


Re: Not able to configure multicore

2009-03-25 Thread mitulpatel

Actually solr2 is an application other then default one(example) on which I
have configured my application. 

let me explain things more in details:

so my application path is http://localhost:8983/solr2/admin and I would like
to configure it for multi-cores so I have placed solr.xml in config
directory which contains following:
solr persistent=true sharedLib=lib
 cores adminPath=/admin/cores
  core name=core0 instanceDir=core0 /
  core name=core1 instanceDir=core1 /
 /cores
/solr

but when I am trying to access following:
http://localhost:8983/solr2/admin/cores it gives me tomcat 404 error. 

Thanks,
Mitul Patel.


Otis Gospodnetic wrote:
 
 
 Hm, where does that /solr2 come from?
 
  Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
 
 
 
 - Original Message 
 From: mitulpatel mitulpa...@greymatterindia.com
 To: solr-user@lucene.apache.org
 Sent: Wednesday, March 25, 2009 12:30:11 AM
 Subject: Re: Not able to configure multicore
 
 
 
 
 hossman wrote:
  
  
  : I am facing a problem related to multiple cores configuration. I have
  placed
  : a solr.xml file in solr.home directory. eventhough when I am trying
 to
  : access http://localhost:8983/solr/admin/cores it gives me tomcat
 error. 
  : 
  : Can anyone tell me what can be possible issue with this??
  
  not without knowing exactly what the tomcat error message is, what your 
  solr.xml file looks like, what log messages you see on startup, etc...
  
  -Hoss
  
  
 Hello Hoss,
 
 Thanks for reply.
 
 Here is the error message shown on browser:
 HTTP Status 404 - /solr2/admin/cores
 type Status report
 message /solr2/admin/cores
 description The requested resource (/solr2/admin/cores) is not available.
 
 and here is the solr.xml file.
 
 
   
   
 
 
 
 
 -- 
 View this message in context: 
 http://www.nabble.com/Not-able-to-configure-multicore-tp22682691p22695098.html
 Sent from the Solr - User mailing list archive at Nabble.com.
 
 
 

-- 
View this message in context: 
http://www.nabble.com/Not-able-to-configure-multicore-tp22682691p22715876.html
Sent from the Solr - User mailing list archive at Nabble.com.



Scheduling DIH

2009-03-25 Thread Tricia Williams

Hello,

   Is there a best way to schedule the DataImportHandler?  The idea 
being to schedule a delta-import every Sunday morning at 7am or perhaps 
every hour without human intervention.  Writing a cron job to do this 
wouldn't be difficult.  I'm just wondering is this a built in feature?


Tricia


Re: Scheduling DIH

2009-03-25 Thread Noble Paul നോബിള്‍ नोब्ळ्
right now a cron job is the only option.

building this into DIH has been a common request?

What do others think about this?

On Thu, Mar 26, 2009 at 10:11 AM, Tricia Williams
williams.tri...@gmail.com wrote:
 Hello,

   Is there a best way to schedule the DataImportHandler?  The idea being to
 schedule a delta-import every Sunday morning at 7am or perhaps every hour
 without human intervention.  Writing a cron job to do this wouldn't be
 difficult.  I'm just wondering is this a built in feature?

 Tricia




-- 
--Noble Paul