[ANNOUNCE] Apache Solr 4.7.0 released.

2014-02-26 Thread Simon Willnauer
February 2014, Apache Solr™ 4.7 available

The Lucene PMC is pleased to announce the release of Apache Solr 4.7

Solr is the popular, blazing fast, open source NoSQL search platform
from the Apache Lucene project. Its major features include powerful
full-text search, hit highlighting, faceted search, dynamic
clustering, database integration, rich document (e.g., Word, PDF)
handling, and geospatial search.  Solr is highly scalable, providing
fault tolerant distributed search and indexing, and powers the search
and navigation features of many of the world's largest internet sites.

Solr 4.7 is available for immediate download at:
  http://lucene.apache.org/solr/mirrors-solr-latest-redir.html

See the CHANGES.txt file included with the release for a full list of
details.

Solr 4.7 Release Highlights:

* A new 'migrate' collection API to split all documents with a route key
  into another collection.

* Added support for tri-level compositeId routing.

* Admin UI - Added a new "Files" conf directory browser/file viewer.

* Add a QParserPlugin for Lucene's SimpleQueryParser.

* Suggest improvements: a new SuggestComponent that fully utilizes the
  Lucene suggester module; queries can now use multiple suggesters;
  Lucene's FreeTextSuggester and BlendedInfixSuggester are now supported.

* New 'cursorMark' request param for efficient deep paging of sorted
  result sets. See http://s.apache.org/cursorpagination

* Add a Solr contrib that allows for building Solr indexes via Hadoop's
  MapReduce.

* Upgrade to Spatial4j 0.4. Various new options are now exposed
  automatically for an RPT field type.  See Spatial4j CHANGES & javadocs.
  https://github.com/spatial4j/spatial4j/blob/master/CHANGES.md

* SSL support for SolrCloud.

Solr 4.7 also includes many other new features as well as numerous
optimizations and bugfixes.

Please report any feedback to the mailing lists
(http://lucene.apache.org/solr/discussion.html)

Note: The Apache Software Foundation uses an extensive mirroring network
for distributing releases.  It is possible that the mirror you are using
may not have replicated the release yet.  If that is the case, please
try another mirror.  This also goes for Maven access.


[ANNOUNCE] Apache Solr 4.6 released.

2013-11-24 Thread Simon Willnauer
24 November 2013, Apache Solr™ 4.6 available

The Lucene PMC is pleased to announce the release of Apache Solr 4.6

Solr is the popular, blazing fast, open source NoSQL search platform
from the Apache Lucene project. Its major features include powerful
full-text search, hit highlighting, faceted search, dynamic
clustering, database integration, rich document (e.g., Word, PDF)
handling, and geospatial search.  Solr is highly scalable, providing
fault tolerant distributed search and indexing, and powers the search
and navigation features of many of the world's largest internet sites.

Solr 4.6 is available for immediate download at:
  http://lucene.apache.org/solr/mirrors-solr-latest-redir.html

See the CHANGES.txt file included with the release for a full list of
details.

Solr 4.6 Release Highlights:

* Many improvements and enhancements for shard splitting options

* New AnalyzingInfixLookupFactory to leverage the AnalyzingInfixSuggester

* New CollapsingQParserPlugin for high performance field collapsing on high
  cardinality fields

* New SolrJ APIs for collection management

* New DocBasedVersionConstraintsProcessorFactory providing support for user
  configured doc-centric versioning rules

* New default index format: Lucene46Codec

* New EnumField type

Solr 4.6 also includes many other new features as well as numerous
optimizations and bugfixes.

Please report any feedback to the mailing lists
(http://lucene.apache.org/solr/discussion.html)

Note: The Apache Software Foundation uses an extensive mirroring network
for distributing releases.  It is possible that the mirror you are using
may not have replicated the release yet.  If that is the case, please
try another mirror.  This also goes for Maven access.

Happy Searching

Simon


[ANNOUNCE] Apache Solr 4.3 released

2013-05-06 Thread Simon Willnauer
May 2013, Apache Solr™ 4.3 available

The Lucene PMC is pleased to announce the release of Apache Solr 4.3.

Solr is the popular, blazing fast, open source NoSQL search platform
from the Apache Lucene project. Its major features include powerful
full-text search, hit highlighting, faceted search, dynamic
clustering, database integration, rich document (e.g., Word, PDF)
handling, and geospatial search.  Solr is highly scalable, providing
fault tolerant distributed search and indexing, and powers the search
and navigation features of many of the world's largest internet sites.

Solr 4.3 is available for immediate download at:
   http://lucene.apache.org/solr/mirrors-solr-latest-redir.html

See the CHANGES.txt file included with the release for a full list of
details.

Solr 4.3.0 Release Highlights:

* Tired of maintaining core information in solr.xml? Now you can configure
  Solr to automatically find cores by walking an arbitrary directory.

* Shard Splitting: You can now split SolrCloud shards to expand your cluster as
  you grow.

* The read side schema REST API has been improved and expanded upon: all schema
  information is now available and the full live schema can now be returned in
  json or xml.  Ground work is included for the upcoming write side of the
  schema REST API.

* Spatial queries can now search for indexed shapes by "IsWithin",
"Contains" and
  "IsDisjointTo" relationships, in addition to typical "Intersects".

* Faceting now supports local parameters for faceting on the same field with
  different options.

* Significant performance improvements for minShouldMatch (mm) queries due to
  skipping resulting in up to 4000% faster queries.

* Various new highlighting configuration parameters.

* A new solr.xml format that is closer to that of solrconfig.xml. The example
  still uses the old format, but 4.4 will ship with the new format.

* Lucene 4.3.0 bug fixes and optimizations.

Solr 4.3.0 also includes many other new features as well as numerous
optimizations and bugfixes.

Please report any feedback to the mailing lists
(http://lucene.apache.org/solr/discussion.html)

Note: The Apache Software Foundation uses an extensive mirroring network
for distributing releases.  It is possible that the mirror you are using
may not have replicated the release yet.  If that is the case, please
try another mirror.  This also goes for Maven access.

Happy searching,
Lucene/Solr developers


Re: Memory leak?? with CloseableThreadLocal with use of Snowball Filter

2012-08-01 Thread Simon Willnauer
On Thu, Aug 2, 2012 at 7:53 AM, roz dev  wrote:
> Thanks Robert for these inputs.
>
> Since we do not really Snowball analyzer for this field, we would not use
> it for now. If this still does not address our issue, we would tweak thread
> pool as per eks dev suggestion - I am bit hesitant to do this change yet as
> we would be reducing thread pool which can adversely impact our throughput
>
> If Snowball Filter is being optimized for Solr 4 beta then it would be
> great for us. If you have already filed a JIRA for this then please let me
> know and I would like to follow it

AFAIK Robert already created and issue here:
https://issues.apache.org/jira/browse/LUCENE-4279
and it seems fixed. Given the massive commit last night its already
committed and backported so it will be in 4.0-BETA.

simon
>
> Thanks again
> Saroj
>
>
>
>
>
> On Wed, Aug 1, 2012 at 8:37 AM, Robert Muir  wrote:
>
>> On Tue, Jul 31, 2012 at 2:34 PM, roz dev  wrote:
>> > Hi All
>> >
>> > I am using Solr 4 from trunk and using it with Tomcat 6. I am noticing
>> that
>> > when we are indexing lots of data with 16 concurrent threads, Heap grows
>> > continuously. It remains high and ultimately most of the stuff ends up
>> > being moved to Old Gen. Eventually, Old Gen also fills up and we start
>> > getting into excessive GC problem.
>>
>> Hi: I don't claim to know anything about how tomcat manages threads,
>> but really you shouldnt have all these objects.
>>
>> In general snowball stemmers should be reused per-thread-per-field.
>> But if you have a lot of fields*threads, especially if there really is
>> high thread churn on tomcat, then this could be bad with snowball:
>> see eks dev's comment on https://issues.apache.org/jira/browse/LUCENE-3841
>>
>> I think it would be useful to see if you can tune tomcat's threadpool
>> as he describes.
>>
>> separately: Snowball stemmers are currently really ram-expensive for
>> stupid reasons.
>> each one creates a ton of Among objects, e.g. an EnglishStemmer today
>> is about 8KB.
>>
>> I'll regenerate these and open a JIRA issue: as the snowball code
>> generator in their svn was improved
>> recently and each one now takes about 64 bytes instead (the Among's
>> are static and reused).
>>
>> Still this wont really "solve your problem", because the analysis
>> chain could have other heavy parts
>> in initialization, but it seems good to fix.
>>
>> As a workaround until then you can also just use the "good old
>> PorterStemmer" (PorterStemFilterFactory in solr).
>> Its not exactly the same as using Snowball(English) but its pretty
>> close and also much faster.
>>
>> --
>> lucidimagination.com
>>


Re: Solr 4.0 IllegalStateException: this writer hit an OutOfMemoryError; cannot commit

2012-07-10 Thread Simon Willnauer
it really seems that you are hitting an OOM during auto warming. can
this be the case for your failure.
Can you raise the JVM memory and see if you still hit the spike and go
OOM? this is very unlikely a IndexWriter problem. I'd rather look at
your warmup queries ie. fieldcache, FieldValueCache usage. Are you
sorting / facet on anything?

simon

On Tue, Jul 10, 2012 at 4:49 PM, Vadim Kisselmann
 wrote:
> Hi Robert,
>
>> Can you run Lucene's checkIndex tool on your index?
>
> No, unfortunately not. This Solr should run without stoppage, an
> tomcat-restart is ok, but not more:)
> I tested newer trunk-versions a couple of months ago, but they fail
> all with tomcat.
> i would test 4.0-alpha in next days with tomcat and open an jira-issue
> if it doesn't work with it.
>
>> do you have another exception in your logs? To my knowledge, in all
>> cases that IndexWriter throws an OutOfMemoryError, the original
>> OutOfMemoryError is also rethrown (not just this IllegalStateException
>> noting that at some point, it hit OOM.
>
> Hmm, i checked older logs and found something new, what i have not
> seen in VisualVM. "Java heap space"-Problems, just before OOM.
> My JVM has 8GB -Xmx/-Xms, 16GB for OS, nothing else on this machine.
> This Errors pop up's during normal run according logs, no optimizes,
> high loads(max. 30 queries per minute) or something special at this time.
>
> SCHWERWIEGEND: null:ClientAbortException:  java.net.SocketException: Broken 
> pipe
> SCHWERWIEGEND: null:java.lang.OutOfMemoryError: Java heap space
> SCHWERWIEGEND: auto commit error...:java.lang.IllegalStateException:
> this writer hit an OutOfMemoryError; cannot commit
> SCHWERWIEGEND: Error during auto-warming of
> key:org.apache.solr.search.QueryResultKey@7cba935e:java.lang.OutOfMemoryError:
> Java heap space
> SCHWERWIEGEND: org.apache.solr.common.SolrException: Internal Server Error
> SCHWERWIEGEND: null:org.apache.solr.common.SolrException: Internal Server 
> Error
>
> I knew this failures when i work on virtual machines with solr 1.4,
> big indexes and ridiculous small -Xmx sizes.
> But on real hardware, with enough RAM, fast disks/cpu's it's new for me:)
>
> Best regards
> Vadim


Re: Multiple document types

2012-01-25 Thread Simon Willnauer
On Thu, Jan 26, 2012 at 12:05 AM, Frank DeRose  wrote:
> Hi Simon,
>
> No, not different entity types, but actually different document types (I 
> think). What would be ideal is if we could have multiple  elements 
> in the data-config.xml file and some way of mapping each different  
> element to a different sets of field in the schema.xml file, and to a 
> different index. Then, when Solr got a search request on one url (say, for 
> example, "http://172.24.1.16:8080/gwsolr/cc/doctype1/select/?q=...";), it 
> would search for a document in the first index and when it got a search 
> request on a different url (say, for example, 
> "http://172.24.1.16:8080/gwsolr/pc/doctype1/select/?q=...";), it would search 
> for the document in the second index. In like manner, administrative tasks 
> (like dataimport) would also switch off of the url, so that the url would 
> determine which index was to be loaded by the dataimport command.

seems like you should look at solr's multicore feature:
http://wiki.apache.org/solr/CoreAdmin

simon
>
> F
>
> -Original Message-
> From: Simon Willnauer [mailto:simon.willna...@googlemail.com]
> Sent: Wednesday, January 25, 2012 2:08 PM
> To: java-user
> Subject: Re: Multiple document types
>
> hey Frank,
>
> can you elaborate what you mean by different doc types? Are you
> referring to an entity ie. a table per entity to speak in SQL terms?
> in general you should get better responses for solr related questions
> on solr-user@lucene.apache.org
>
> simon
>
> On Wed, Jan 25, 2012 at 10:49 PM, Frank DeRose  wrote:
>> It seems that it is not possible to have multiple document types defined in 
>> a single solr schema.xml file. If, in fact, this is not possible, then, what 
>> is the recommended app server deployment strategy for supporting multiple 
>> documents on solr? Do I need to have one webapp instance per document type? 
>> For example, if I am deploying under tomcat, do I need to have a separate 
>> webapps each with its own context-path and set of config files 
>> (data-config.xml and schema.xml, in particular)?
>>
>> _
>> Frank DeRose
>> Guidewire Software | Senior Software Engineer
>> Cell: 510 -589-0752
>> fder...@guidewire.com<mailto:fder...@guidewire.com> | 
>> www.guidewire.com<http://www.guidewire.com/>
>> Deliver insurance your way with flexible core systems from Guidewire.
>>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>


Call for Submission Berlin Buzzwords 2012all for Submission Berlin Buzzwords - http://berlinbuzzwords.de

2012-01-11 Thread Simon Willnauer
Call for Submission Berlin Buzzwords 2012 - Search, Store, Scale  --
June 4 / 5. 2012

The event will comprise presentations on scalable data processing. We
invite you to submit talks on the topics:
 * IR / Search - Lucene, Solr, katta, ElasticSearch or comparable solutions
 * NoSQL - like CouchDB, MongoDB, Jackrabbit, HBase and others
 * Hadoop - Hadoop itself, MapReduce, Cascading or Pig and relatives

Related topics not explicitly listed above are more than welcome. We are
looking for presentations on the implementation of the systems
themselves, technical talks,
real world applications and case studies.

Important Dates (all dates in GMT +2)
 * Submission deadline: March 11th 2012, 23:59 MEZ
 * Notification of accepted speakers: April 6st, 2012, MEZ
 * Publication of final schedule: April 13th, 2012
 * Conference: June 4/5. 2012

High quality, technical submissions are called for, ranging from
principles to practice. We are looking for real world use cases,
background on the architecture of specific projects and a deep dive
into architectures built on top of e.g. Hadoop clusters.

To submit your proposal please register to our website [1] and log in
[2] once you received the confirmation email. Once this is done you
can submit your proposal here [3]; please do so no later than March
11th, 2012. Acceptance notifications will be sent out soon after the
submission deadline. Please include your name, bio and email, the
title of the talk, a brief abstract in English language. Please
indicate whether you want to give a lightning (10min), short (20min)
or long (40min) presentation and indicate the level of experience with
the topic your audience should have (e.g. whether your talk will be
suitable for newbies or is targeted for experienced users.) If you'd
like to pitch your brand new product in your talk, please let us know
as well -
there will be extra space for presenting new ideas, awesome products
and great new projects.

The presentation format is short. We will be enforcing the schedule rigorously.

If you are interested in sponsoring the event (e.g. we would be happy
to provide videos after the event, free drinks for attendees as well
as an after-show party), please contact us.

Follow @berlinbuzzwords on Twitter for updates. Tickets, news on the
conference, and the final schedule are be published at
http://berlinbuzzwords.de.

Program Committee Chairs:

 *  Isabel Drost (Nokia & Apache Mahout)
 *  Jan Lehnardt (CouchBase & Apache CouchDB)
 *  Simon Willnauer (SearchWorkings & Apache Lucene)
 *  Grant Ingersoll (Lucid Imagination & Apache Lucene)
 *  Owen O’Malley (Yahoo Inc. & Apache Hadoop)
 *  Jim Webber (Neo Technology & Neo4j)
 *  Sean Treadway (Soundcloud)


Please re-distribute this CfP to people who might be interested.

Contact us at:

newthinking communications
GmbH Schönhauser Allee 6/7
10119 Berlin,
Germany
Julia Gemählich 
Isabel Drost 
Simon Willnauer 
 +49(0)30-9210 596

[1] http://berlinbuzzwords.de/user/register
[2] http://berlinbuzzwords.de/user
[3] http://berlinbuzzwords.de/node/add/session


Re: Solr Scoring question

2012-01-05 Thread Simon Willnauer
hey,

On Thu, Jan 5, 2012 at 9:31 PM, Christopher Gross  wrote:
> I'm getting different results running these queries:
>
> http://localhost:8080/solr/select?&q=*:*&fq=source:wiki&fq=tag:car&sort=score+desc,dateSubmitted+asc&fl=title,score,dateSubmitted&rows=100
>
> http://localhost:8080/solr/select?fq=source:wiki&q=tag:car&sort=score+desc,dateSubmitted+desc&fl=title,score,dateSubmitted&rows=100
>
> They return the same amount of results (and I'm assuming the same
> ones) -- but the first one (with q=*:*) has a score of 1 for all
> results, making it only sort by dateSubmitted.  The second one has
> scores, and it properly sorts them.
>
> I was thinking that the two would be equivalent and give the same
> results in the same order, but I'm guessing that there is something
> happening behind the scenes in Solr (Lucene?) that makes the *:* give
> me a score of 1.0 for everything.  I tried to find some documentation
> to figure out if this is the case, but I'm not having much luck for
> that.

q=*:* is a constant score query that retireves all documents in your
index. The issue here is that with *:* you don't have anything to
score while with q=tag:car you can score the term car with tf idf etc.

does that make sense?

simon
>
> I have a JSP file that will take in parameters, do some work on them
> to make them appropriate for Solr, then pass the query it builds to
> Solr.  Should I just put more brains in that to avoid using a *:*
> (we're trying to verify results and we ran into this oddity).
>
> This is for Solr 3.4, running Tomcat 5.5.25 on Java 1.5.
>
> Thanks!  Let me know if Ineed to clarify anything...
>
> -- Chris


Heads Up - Index File Format Change on Trunk

2012-01-05 Thread Simon Willnauer
Folks,

I just committed LUCENE-3628 [1] which cuts over Norms to DocVaues.
This is an index file format change and if you are using trunk you
need to reindex before updating.

happy indexing :)

simon

[1] https://issues.apache.org/jira/browse/LUCENE-3628


Re: spellcheck-index is rebuilt on commit

2012-01-03 Thread Simon Willnauer
On Tue, Jan 3, 2012 at 9:12 AM, OliverS  wrote:
> Hi all
>
> Thanks a lot, and it seems to be a bug, but not of 4.0 only. You are right,
> I was doing a commit on an optimized index without adding any new docs (in
> fact, I did this for replication on the master). I will open a ticket as
> soon as I fully understand what's going on. I have difficulties
> understanding Simons answer:
> * building the spellcheck-index is triggered by a new searcher?
> * why would this not happen after post/commit?

a commit in solr forces a new searcher to be opened. this new searcher
is passed to the spellcheckers listener which reopens / rebuilds the
spellcheck index. Yet, if you way rebuildOnOptimize=true it only
checks if the index has a single segment. since you didn't change
anything since this was last checked it still has one segment. The
problem is that the listener doesn't safe any state or the version of
the index since it was last called and assumes the index was just
optimized.

simon
>
> Thanks
> Oliver
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/spellcheck-index-is-rebuilt-on-commit-tp3626492p3628423.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: spellcheck-index is rebuilt on commit

2012-01-02 Thread Simon Willnauer
hey, is it possible that during those commits nothing has changed in
the index? I mean are you committing nevertheless there are changes?
if so this could happen since the spellchecker gets a new even that
you did a commit but the index didn't really change. The spellchecker
really only checks if there is a single segment in the index and
rebuilds the index.

if this is the case, I think this is a bug... can you open a jira ticket?

simon

On Mon, Jan 2, 2012 at 8:36 PM, OliverS  wrote:
> Hi
>
> Looks like they strip the -Text for the list. Whole message here:
> http://lucene.472066.n3.nabble.com/spellcheck-index-is-rebuilt-on-commit-td3626492.html
>
> Yes, I did restart tomcat.
>
> Thanks
> Oliver
>
> Zitat von "Jan Høydahl / Cominvent [via Lucene]"
> :
>
>>
>>
>> Olivier, your log snippets did not make it into the mail. I think
>> the mailing list strips attachments.
>>
>> Did you reload core or restart Jetty/Tomcat after your changes?
>>
>> --
>> Jan Høydahl, search solution architect
>> Cominvent AS - www.cominvent.com
>> Solr Training - www.solrtraining.com
>>
>> On 2. jan. 2012, at 13:48, Oliver Schihin wrote:
>>
>>> Hello
>>>
>>> We are working with solr 4.0, the spellchecker used is still the classic
>>> IndexBasedSpellChecker. Now every time I do a commit, it rebuilds the
>>> spellchecker index, even though I clearly state a build on optimize. The
>>> configuration in solrconfig looks like this:
>>>
>>>
>>> I call commits testwise through curl
>>>
>>>
>>> This is from the log:
>>>
>>>
>>> Where am I wrong, any suggestions? Thanks for help
>>> Oliver
>>>
>>> --
>>> View this message in context:
>>> http://lucene.472066.n3.nabble.com/spellcheck-index-is-rebuilt-on-commit-tp3626492p3626492.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
>>
>> ___
>> If you reply to this email, your message will be added to the
>> discussion below:
>> http://lucene.472066.n3.nabble.com/spellcheck-index-is-rebuilt-on-commit-tp3626492p3627105.html
>>
>> To unsubscribe from spellcheck-index is rebuilt on commit, visit
>> http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=3626492&code=b2xpdmVyLnNjaGloaW5AdW5pYmFzLmNofDM2MjY0OTJ8LTE5ODUwMDUwMTY=
>
>
>
> 
> This message was sent using IMP, the Internet Messaging Program.
>
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/spellcheck-index-is-rebuilt-on-commit-tp3626492p3627383.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Matching all documents in the index

2011-12-13 Thread Simon Willnauer
try *:* instead of *.*

simon

On Tue, Dec 13, 2011 at 5:03 PM, Kissue Kissue  wrote:
> Hi,
>
> I have come across this query in the admin interface: *.*
> Is this meant to match all documents in my index?
>
> Currently when i run query with q= *.*, numFound is 130310 but the actuall
> number of documents in my index is 603308.
> Shen i then run the query with q = *  then numFound is 603308 which is the
> total number of documents in my index.
>
> So what is the difference between query with q = *.*  and q = * ?
>
> I ran into this problem because i have a particular scenario where in my
> index where i have a field called categoryId which i am grouping on and
> another field called orgId which i then filter on. So i do grouping on
> categoryId but on all documents in the index matching the filter query
> field. I use q = *.* but this dosen't give me the true picture as
> highlighted above. So i use q = * and this works fine but takes about
> 2900ms to execute. Is this efficient? Is there a better way to do something
> like this?
>
> Solr version = 3.5
>
> Thanks.


Re: Seek past EOF

2011-11-30 Thread Simon Willnauer
can you give us some details about what filesystem you are using?

simon

On Wed, Nov 30, 2011 at 3:07 PM, Ruben Chadien  wrote:
> Happened again….
>
> I got 3 directories in my index dir
>
> 4096 Nov  4 09:31 index.2004083156
> 4096 Nov 21 10:04 index.2021090440
> 4096 Nov 30 14:55 index.2029024919
>
> as you can se the first two are old and also empty , the last one from
> today is and containing 9 files none of the are 0 size
> and total size 7 GB. The size of the index on the master is 14GB.
>
> Any ideas on what to look for ?
>
> Thanks
> Ruben Chadien
>
>
>
>
> On 29 November 2011 15:58, Mark Miller  wrote:
>
>> Hmm...I've seen a bug like this, but I don't think it would be tickled if
>> you are replicating config files...
>>
>> It def looks related though ... I'll try to dig around.
>>
>> Next time it happens, take a look on the slave for 0 size files - also if
>> the index dir on the slave is plain 'index' or has a timestamp as part of
>> the name (eg timestamp.index).
>>
>> On Tue, Nov 29, 2011 at 9:53 AM, Ruben Chadien > >wrote:
>>
>> > Hi, for the moment there are no 0 sized files, but all indexes are
>> working
>> > now. I will have to look next time it breaks.
>> > Yes, the directory name is "index" and it replicates the schema and a
>> > synonyms file.
>> >
>> > /Ruben Chadien
>> >
>> > On 29 November 2011 15:29, Mark Miller  wrote:
>> >
>> > > Also, on your master, what is the name of the index directory? Just
>> > > 'index'?
>> > >
>> > > And are you replicating config files as well or no?
>> > >
>> > >
>> > > On Nov 29, 2011, at 9:23 AM, Mark Miller wrote:
>> > >
>> > > > Does the problem index have any 0 size files in it?
>> > > >
>> > > > On Nov 29, 2011, at 2:54 AM, Ruben Chadien wrote:
>> > > >
>> > > >> HI all
>> > > >>
>> > > >> After upgrading tol Solr 3.4 we are having trouble with the
>> > replication.
>> > > >> The setup is one indexing master with a few slaves that replicate
>> the
>> > > >> indexes once every night.
>> > > >> The largest index is 20 GB and the master and slaves are on the same
>> > > DMZ.
>> > > >>
>> > > >> Almost every night one of the indexes (17 in total) fail after the
>> > > >> replication with an EOF file.
>> > > >>
>> > > >> SEVERE: Error during auto-warming of
>> > > >> key:org.apache.solr.search.QueryResultKey@bda006e3
>> > :java.io.IOException:
>> > > >> seek past EOF
>> > > >> at
>> > > >>
>> > >
>> >
>> org.apache.lucene.store.MMapDirectory$MMapIndexInput.seek(MMapDirectory.java:347)
>> > > >> at
>> > > org.apache.lucene.index.SegmentTermEnum.seek(SegmentTermEnum.java:114)
>> > > >> at
>> > > >>
>> > >
>> >
>> org.apache.lucene.index.TermInfosReader.seekEnum(TermInfosReader.java:203)
>> > > >> at
>> > org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:273)
>> > > >> at
>> > org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:210)
>> > > >> at
>> > org.apache.lucene.index.SegmentReader.docFreq(SegmentReader.java:507)
>> > > >> at
>> > >
>> org.apache.solr.search.SolrIndexReader.docFreq(SolrIndexReader.java:309)
>> > > >> at
>> > > org.apache.lucene.search.TermQuery$TermWeight$1.add(TermQuery.java:56)
>> > > >> at org.apache.lucene.util.ReaderUtil$Gather.run(ReaderUtil.java:77)
>> > > >> at org.apache.lucene.util.ReaderUtil$Gather.run(ReaderUtil.java:82)
>> > > >>
>> > > >>
>> > > >> After a restart the errors are gone, anyone else seen this ?
>> > > >>
>> > > >> Thanks
>> > > >> Ruben Chadien
>> > > >
>> > > > - Mark Miller
>> > > > lucidimagination.com
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > >
>> > > - Mark Miller
>> > > lucidimagination.com
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> >
>> >
>> > --
>> > *Ruben Chadien
>> > *Senior Developer
>> > Mobile +47 900 35 371
>> > ruben.chad...@aspiro.com
>> > *
>> >
>> > Aspiro Music AS*
>> > Øvre Slottsgate 25, P.O. Box 8710 Youngstorget, N-0028 Oslo
>> > Tel +47 452 86 900, fax +47 22 37 36 59
>> > www.aspiro.com/music
>> >
>>
>>
>>
>> --
>> - Mark
>>
>> http://www.lucidimagination.com
>>
>
>
>
> --
> *Ruben Chadien
> *Senior Developer
> Mobile +47 900 35 371
> ruben.chad...@aspiro.com
> *
>
> Aspiro Music AS*
> Øvre Slottsgate 25, P.O. Box 8710 Youngstorget, N-0028 Oslo
> Tel +47 452 86 900, fax +47 22 37 36 59
> www.aspiro.com/music


Re: Solr 3.5 very slow (performance)

2011-11-30 Thread Simon Willnauer
I wonder if you have a explicitly configured merge policy? In Solr 1.4
ie. Lucene 2.9 LogMergePolicy was the default but in 3.5
TieredMergePolicy is used by default. This could explain the
differences segment wise since from what I understand you are indexing
the same data on 1.4 and 3.5?

simon

On Wed, Nov 30, 2011 at 7:42 PM, Mikhail Khludnev
 wrote:
> Hello,
>
> I spot the difference in the number of segments (4 vs 14). For me it
> explains the increased query time, and cpu load, especially because you
> don't use utilize filters via fq=, only q= in your queries.
>
> The first thing you need is make the length of segment chains the same. The
> first clue is try to optimize (I'm sorry. forceMerge() of course) your
> index. If it helps, you'll need to get why your index was optimised but
> isn't optimized now.
>
> And, more statistics from both of your indexes will be really useful:
> num/maxDocs;numTerms;optimized;hasDeletions.
>
> Also I spot that you have deep paging use-case, so you should get some
> benefits from recent 3.5 improvements. Please let me know how it is.
>
>
> On Wed, Nov 30, 2011 at 12:07 PM, Pawel Rog  wrote:
>
>> reader
>> solr 1.4
>> reader : SolrIndexReader{this=8cca36c,r=ReadOnlyDirectoryReader@8cca36c
>> ,refCnt=1,*segments=4*}
>> readerDir : org.apache.lucene.store.NIOFSDirectory@
>> /data/solr_data/itemsfull/index
>>
>> solr 3.5
>> reader : SolrIndexReader{this=3d01e178,r=ReadOnlyDirectoryReader@3d01e178
>> ,refCnt=1,*segments=14*}
>> readerDir : org.apache.lucene.store.MMapDirectory@
>> /data/solr_data_350/itemsfull/index
>>
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Developer
> Grid Dynamics
> tel. 1-415-738-8644
> Skype: mkhludnev
> 
> 


[ANNOUNCE] Apache Solr 3.5 released

2011-11-26 Thread Simon Willnauer
27 November 2011, Apache Solr™ 3.5.0 available
The Lucene PMC is pleased to announce the release of Apache Solr 3.5.0.

Solr is the popular, blazing fast open source enterprise search platform from
the Apache Lucene project. Its major features include powerful full-text
search, hit highlighting, faceted search, dynamic clustering, database
integration, rich document (e.g., Word, PDF) handling, and geospatial search.
Solr is highly scalable, providing distributed search and index replication,
and it powers the search and navigation features of many of the world's
largest internet sites.

This release contains numerous bug fixes, optimizations, and
improvements, some of which are highlighted below.  The release
is available for immediate download at:
   http://www.apache.org/dyn/closer.cgi/lucene/solr (see note below).

See the CHANGES.txt file included with the release for a full list of
details.

Solr 3.5.0 Release Highlights:

  * Bug fixes and improvements from Apache Lucene 3.5.0, including a
very substantial (3-5X) RAM reduction required to hold the terms
index on opening an IndexReader. (LUCENE-2205)

  * Added support for distributed result grouping. (SOLR-2066,
SOLR-2776)

  * Added support for Hunspell stemmer TokenFilter supporting stemming
for 99 languages. (SOLR-2769)

  * A new contrib module "langid" adds language identification
capabilities as an Update Processor, using Tika's
LanguageIdentifier or Cybozu language-detection library (SOLR-1979)

  * Numeric types including Trie and date types now support
sortMissingFirst/Last. (SOLR-2881)

  * Added hl.q parameter. It is optional and if it is specified, it overrides
q parameter in Highlighter. (SOLR-1926)

  * Several minor bugfixes like date parsing for years from 0001-1000, ignored
configurations when using QueryAnalyzer with SpellCheckComponent
and many more.
See CHANGES.txt entries for full details.


Note: The Apache Software Foundation uses an extensive mirroring network for
distributing releases.  It is possible that the mirror you are using may not
have replicated the release yet.  If that is the case, please try another
mirror.  This also goes for Maven access.

Happy searching,

Apache Lucene/Solr Developers


JVM Bugs affecting Lucene & Solr

2011-11-15 Thread Simon Willnauer
hey folks,

we lately looked into
https://issues.apache.org/jira/browse/LUCENE-3235 again, an issue
where a class using ConcurrentHashMap hangs / deadlocks on specific
JVMs in combination with specific CPUs. It turns out its a JVM bug in
Sun / Oracle Java 1.5 as well as Java 1.6. Its apparently fixed in
1.6.u18 so if you are running on a JVM >= 1.6.u18 you should be safe.
Yet, in older JVMs all classes using
java.util.concurrent.locks.LockSupport are vulnerable which includes
ConcurrentHashMap, ReentrantLock, CountDownLatch etc. Lucene and Solr
make use of those classes too so if you running on an older JVM you
could be affected by this bug and should either upgrade to a new JVM
or use -XX:+UseMembar to start you JVM.

In general its a good idea to keep an eye on
http://wiki.apache.org/lucene-java/SunJavaBugs we try to keep this
up-to-date

thanks,

Simon


Re: large scale indexing issues / single threaded bottleneck

2011-10-28 Thread Simon Willnauer
On Fri, Oct 28, 2011 at 9:17 PM, Simon Willnauer
 wrote:
> Hey Roman,
>
> On Fri, Oct 28, 2011 at 8:38 PM, Roman Alekseenkov
>  wrote:
>> Hi everyone,
>>
>> I'm looking for some help with Solr indexing issues on a large scale.
>>
>> We are indexing few terabytes/month on a sizeable Solr cluster (8
>> masters / serving writes, 16 slaves / serving reads). After certain
>> amount of tuning we got to the point where a single Solr instance can
>> handle index size of 100GB without much issues, but after that we are
>> starting to observe noticeable delays on index flush and they are
>> getting larger. See the attached picture for details, it's done for a
>> single JVM on a single machine.
>>
>> We are posting data in 8 threads using javabin format and doing commit
>> every 5K documents, merge factor 20, and ram buffer size about 384MB.
>> From the picture it can be seen that a single-threaded index flushing
>> code kicks in on every commit and blocks all other indexing threads.
>> The hardware is decent (12 physical / 24 virtual cores per machine)
>> and it is mostly idle when the index is flushing. Very little CPU
>> utilization and disk I/O (<5%), with the exception of a single CPU
>> core which actually does index flush (95% CPU, 5% I/O wait).
>>
>> My questions are:
>>
>> 1) will Solr changes from real-time branch help to resolve these
>> issues? I was reading
>> http://blog.mikemccandless.com/2011/05/265-indexing-speedup-with-lucenes.html
>> and it looks like we have exactly the same problem
>
> did you also read http://bit.ly/ujLw6v - here I try to explain the
> major difference between Lucene 3.x and 4.0 and why 3.x has these long
> idle times. In Lucene 3.x a full flush / commit is a single threaded
> process, as you observed there is only one thread making progress. In
> Lucene 4 there is still a single thread executing the commit but other
> threads are not blocked anymore. Depending on how fast the thread can
> flush other threads might help flushing segments for that commit
> concurrently or simply index into new documents writers. So basically
> 4.0 won't have this problem anymore. The realtime branch you talk
> about is already merged into 4.0 trunk.
>
>>
>> 2) what would be the best way to port these (and only these) changes
>> to 3.4.0? I tried to dig into the branching and revisions, but got
>> lost quickly. Tried something like "svn diff
>> […]realtime_search@r953476 […]realtime_search@r1097767", but I'm not
>> sure if it's even possible to merge these into 3.4.0
>
> Possible yes! Worth the trouble, I would say no!
> DocumentsWriterPerThread (DWPT) is a very big change and I don't think
> we should backport this into our stable branch. However, this feature
> is very stable in 4.0 though.
>>
>> 3) what would you recommend for production 24/7 use? 3.4.0?
>
> I think 3.4 is a safe bet! I personally tend to use trunk in
> production too the only problem is that this is basically a moving
> target and introduces extra overhead on your side to watch changes and
> index format modification which could basically prevent you from
> simple upgrades
>
>>
>> 4) is there a workaround that can be used? also, I listed the stack trace 
>> below
>>
>> Thank you!
>> Roman
>>
>> P.S. This single "index flushing" thread spends 99% of all the time in
>> "org.apache.lucene.index.BufferedDeletesStream.applyDeletes", and then
>> the merge seems to go quickly. I looked it up and it looks like the
>> intent here is deleting old commit points (we are keeping only 1
>> non-optimized commit point per config). Not sure why is it taking that
>> long.
>
> in 3.x there is no way to apply deletes without doing a flush (afaik).
> In 3.x a flush means single threaded again - similar to commit just
> without syncing files to disk and writing a new segments file. In 4.0
> you have way more control over this via
> IndexWriterConfig#setMaxBufferedDeleteTerms which are also applied
> without blocking other threads. In trunk we hijack indexing threads to
> do all that work concurrently so you get better cpu utilization and
> due to concurrent flushing better and usually continuous IO
> utilization.
>
> hope that helps.
>
> simon
>>
>> pool-2-thread-1 [RUNNABLE] CPU time: 3:31
>> java.nio.Bits.copyToByteArray(long, Object, long, long)
>> java.nio.DirectByteBuffer.get(byte[], int, int)
>> org.apache.lucene.store.MMapDirectory$MMapIndexInput.readBytes(byte[], int, 
>> int)
>> org.apache.lucene.index.TermBuffer.read(Index

Re: large scale indexing issues / single threaded bottleneck

2011-10-28 Thread Simon Willnauer
Hey Roman,

On Fri, Oct 28, 2011 at 8:38 PM, Roman Alekseenkov
 wrote:
> Hi everyone,
>
> I'm looking for some help with Solr indexing issues on a large scale.
>
> We are indexing few terabytes/month on a sizeable Solr cluster (8
> masters / serving writes, 16 slaves / serving reads). After certain
> amount of tuning we got to the point where a single Solr instance can
> handle index size of 100GB without much issues, but after that we are
> starting to observe noticeable delays on index flush and they are
> getting larger. See the attached picture for details, it's done for a
> single JVM on a single machine.
>
> We are posting data in 8 threads using javabin format and doing commit
> every 5K documents, merge factor 20, and ram buffer size about 384MB.
> From the picture it can be seen that a single-threaded index flushing
> code kicks in on every commit and blocks all other indexing threads.
> The hardware is decent (12 physical / 24 virtual cores per machine)
> and it is mostly idle when the index is flushing. Very little CPU
> utilization and disk I/O (<5%), with the exception of a single CPU
> core which actually does index flush (95% CPU, 5% I/O wait).
>
> My questions are:
>
> 1) will Solr changes from real-time branch help to resolve these
> issues? I was reading
> http://blog.mikemccandless.com/2011/05/265-indexing-speedup-with-lucenes.html
> and it looks like we have exactly the same problem

did you also read http://bit.ly/ujLw6v - here I try to explain the
major difference between Lucene 3.x and 4.0 and why 3.x has these long
idle times. In Lucene 3.x a full flush / commit is a single threaded
process, as you observed there is only one thread making progress. In
Lucene 4 there is still a single thread executing the commit but other
threads are not blocked anymore. Depending on how fast the thread can
flush other threads might help flushing segments for that commit
concurrently or simply index into new documents writers. So basically
4.0 won't have this problem anymore. The realtime branch you talk
about is already merged into 4.0 trunk.

>
> 2) what would be the best way to port these (and only these) changes
> to 3.4.0? I tried to dig into the branching and revisions, but got
> lost quickly. Tried something like "svn diff
> […]realtime_search@r953476 […]realtime_search@r1097767", but I'm not
> sure if it's even possible to merge these into 3.4.0

Possible yes! Worth the trouble, I would say no!
DocumentsWriterPerThread (DWPT) is a very big change and I don't think
we should backport this into our stable branch. However, this feature
is very stable in 4.0 though.
>
> 3) what would you recommend for production 24/7 use? 3.4.0?

I think 3.4 is a safe bet! I personally tend to use trunk in
production too the only problem is that this is basically a moving
target and introduces extra overhead on your side to watch changes and
index format modification which could basically prevent you from
simple upgrades

>
> 4) is there a workaround that can be used? also, I listed the stack trace 
> below
>
> Thank you!
> Roman
>
> P.S. This single "index flushing" thread spends 99% of all the time in
> "org.apache.lucene.index.BufferedDeletesStream.applyDeletes", and then
> the merge seems to go quickly. I looked it up and it looks like the
> intent here is deleting old commit points (we are keeping only 1
> non-optimized commit point per config). Not sure why is it taking that
> long.

in 3.x there is no way to apply deletes without doing a flush (afaik).
In 3.x a flush means single threaded again - similar to commit just
without syncing files to disk and writing a new segments file. In 4.0
you have way more control over this via
IndexWriterConfig#setMaxBufferedDeleteTerms which are also applied
without blocking other threads. In trunk we hijack indexing threads to
do all that work concurrently so you get better cpu utilization and
due to concurrent flushing better and usually continuous IO
utilization.

hope that helps.

simon
>
> pool-2-thread-1 [RUNNABLE] CPU time: 3:31
> java.nio.Bits.copyToByteArray(long, Object, long, long)
> java.nio.DirectByteBuffer.get(byte[], int, int)
> org.apache.lucene.store.MMapDirectory$MMapIndexInput.readBytes(byte[], int, 
> int)
> org.apache.lucene.index.TermBuffer.read(IndexInput, FieldInfos)
> org.apache.lucene.index.SegmentTermEnum.next()
> org.apache.lucene.index.TermInfosReader.(Directory, String,
> FieldInfos, int, int)
> org.apache.lucene.index.SegmentCoreReaders.(SegmentReader,
> Directory, SegmentInfo, int, int)
> org.apache.lucene.index.SegmentReader.get(boolean, Directory,
> SegmentInfo, int, boolean, int)
> org.apache.lucene.index.IndexWriter$ReaderPool.get(SegmentInfo,
> boolean, int, int)
> org.apache.lucene.index.IndexWriter$ReaderPool.get(SegmentInfo, boolean)
> org.apache.lucene.index.BufferedDeletesStream.applyDeletes(IndexWriter$ReaderPool,
> List)
> org.apache.lucene.index.IndexWriter.doFlush(boolean)
> org.apache.lucene.index.IndexWriter.flush(boolean, b

Re: changing omitNorms on an already built index

2011-10-28 Thread Simon Willnauer
On Fri, Oct 28, 2011 at 12:20 AM, Robert Muir  wrote:
> On Thu, Oct 27, 2011 at 6:00 PM, Simon Willnauer
>  wrote:
>> we are not actively removing norms. if you set omitNorms=true and
>> index documents they won't have norms for this field. Yet, other
>> segment still have norms until they get merged with a segment that has
>> no norms for that field ie. omits norms. omitNorms is anti-viral so
>> once you set it to true it will be true for other segment eventually.
>> If you optimize you index you should see that norms go away.
>>
>
> This is only true in trunk (4.x!)
> https://issues.apache.org/jira/browse/LUCENE-2846

ah right, I thought this was ported - nevermind! thanks robert

simon
>
> --
> lucidimagination.com
>


Re: changing omitNorms on an already built index

2011-10-27 Thread Simon Willnauer
we are not actively removing norms. if you set omitNorms=true and
index documents they won't have norms for this field. Yet, other
segment still have norms until they get merged with a segment that has
no norms for that field ie. omits norms. omitNorms is anti-viral so
once you set it to true it will be true for other segment eventually.
If you optimize you index you should see that norms go away.

simon

On Thu, Oct 27, 2011 at 11:17 PM, Marc Sturlese  wrote:
> As far as I know there's no issue about this. You have to reindex and that's
> it.
> In which kind of field are you changing the norms? (You just will see
> changes in text fields)
> Using debugQuery=true you can see how norms affect the score (in case you
> have them not omited)
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/changing-omitNorms-on-an-already-built-index-tp3459132p3459169.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: How can I force the threshold for a fuzzy query?

2011-10-27 Thread Simon Willnauer
I am not sure if there is such an option but you might be able to
override your query parser and reset that value if it is too fuzzy.
look for   protected Query newFuzzyQuery(Term term, float
minimumSimilarity, int prefixLength)  there you can change the actual
value used for minimumSimilarity

simon


On Thu, Oct 27, 2011 at 4:54 PM, Gustavo Falco
 wrote:
> Hi guys,
>
> I'm new to Solr (as you may guess for the subject). I'd like to force the
> threshold for fuzzy queries to, say, 0.7. I've read that fuzzy queries are
> expensive, but limiting it's threshold to a number near 1 would help.
>
> So my question is: Is this possible to configure in some of the xml
> configuration files? and if that's so, if I use this query:
>
> myField:myQuery~0.2
>
> Would Solr use the configured threshold instead, preventing indeed that
> anyone force a minor value than what I've set in the xml file? Would it help
> for what I want to do?
>
>
>
> Thanks in advance!
>


Re: Optimization /Commit memory

2011-10-25 Thread Simon Willnauer
RAM costs during optimize / merge is generally low. Optimize is
basically a merge of all segments into one, however there are
exceptions. Lucene streams existing segments from disk and serializes
the new segment on the fly. When you optimize or in general when you
merge segments you need disk space for the "source" segments and the
"targed" (merged) segment.

If you use CompoundFileSystem (CFS) you need to additional space once
the merge is done and your files are packed into the CFS which is
basically the size of the "target" (merged) segment. Once the merge is
done lucene can free the diskspace unless you have an IndexReader open
that references those segments (lucene keeps track of these files and
frees diskspace once possible).

That said, I think you should use optimize very very rarely. Usually
if you document collection is rarely changing optimize is useful and
reasonable once in a while. if you collection is constantly changing
you should rely on the merge policy to balance the number of segments
for you in the background. Lucene 3.4 has a nice improved
TieredMergePolicy that does a great job. (previous version are also
good - just saying)

A commit is basically flushing the segment you have in memory
(IndexWriter memory) to disk. compression ratio can be up to 30% of
the ram cost or even more depending on your data. The actual commit
doesn't need a notable amount of memory.

hope this helps

simon

On Mon, Oct 24, 2011 at 7:38 PM, Jaeger, Jay - DOT
 wrote:
> I have not spent a lot of time researching it, but one would expect that the 
> OS RAM requirement for optimization of an index to be minimal.
>
> My understanding is that during optimization an essentially new index is 
> built.  Once complete it switches out the indexes and will throw away the old 
> one.  (In Windows it may not throw away the old one until the next Commit).
>
> JRJ
>
> -Original Message-
> From: Sujatha Arun [mailto:suja.a...@gmail.com]
> Sent: Friday, October 21, 2011 12:10 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Optimization /Commit memory
>
> Just one more thing ,when we are talking about Optimization , we
> are referring to  HD  free space for  replicating the index  (2 or 3 times
> the index size  ) .what is role of  RAM (OS) here?
>
> Regards
> Suajtha
>
> On Fri, Oct 21, 2011 at 10:12 AM, Sujatha Arun  wrote:
>
>> Thanks that helps.
>>
>> Regards
>> Sujatha
>>
>>
>> On Thu, Oct 20, 2011 at 6:23 PM, Jaeger, Jay - DOT 
>> wrote:
>>
>>> Well, since the OS RAM includes the JVM RAM, that is part of your
>>> requirement, yes?  Aside from the JVM and normal OS requirements, all you
>>> need OS RAM for is file caching.  Thus, for updates, the OS RAM is not a
>>> major factor.  For searches, you want sufficient OS RAM to cache enough of
>>> the index to get the query performance you need, and to cache queries inside
>>> the JVM if you get a lot of repeat queries (see solrconfig.xml for the
>>> various caches: we have not played with them much).  So, the amount of RAM
>>> necessary for that is very much dependent upon the size of your index, so I
>>> cannot give you a simple number.
>>>
>>> You seem to believe that you have to have sufficient memory to have the
>>> entire index in memory.  Except where extremely high performance is
>>> required, I have not found that to be the case.
>>>
>>> This is just one of those "your mileage may vary" things.  There is not a
>>> single answer or formula that fits every situation.
>>>
>>> JRJ
>>>
>>> -Original Message-
>>> From: Sujatha Arun [mailto:suja.a...@gmail.com]
>>> Sent: Wednesday, October 19, 2011 11:58 PM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: Optimization /Commit memory
>>>
>>> Thanks  Jay ,
>>>
>>> I was trying to compute the *OS RAM requirement*  *not JVM RAM* for a 14
>>> GB
>>> Index [cumulative Index size of all Instances].And I put it thus -
>>>
>>> Requirement of Operating System RAM for an Index of  14GB is   - Index
>>> Size
>>> + 3 Times the  maximum Index Size of Individual Instance for Optimize .
>>>
>>> That is to say ,I have several Instances ,combined Index Size is 14GB
>>> .Maximum Individual Index Size is 2.5GB .so My requirement for OS RAM is
>>>  14GB +3 * 2.5 GB  ~ = 22GB.
>>>
>>> Correct?
>>>
>>> Regards
>>> Sujatha
>>>
>>>
>>>
>>> On Thu, Oct 20, 2011 at 3:45 AM, Jaeger, Jay - DOT >> >wrote:
>>>
>>> > Commit does not particularly spike disk or memory usage, unless you are
>>> > adding a very large number of documents between commits.  A commit can
>>> cause
>>> > a need to merge indexes, which can increase disk space temporarily.  An
>>> > optimize is *likely* to merge indexes, which will usually increase disk
>>> > space temporarily.
>>> >
>>> > How much disk space depends very much upon how big your index is in the
>>> > first place.  A 2 to 3 times factor of the sum of your peak index file
>>> size
>>> > seems safe, to me.
>>> >
>>> > Solr uses only modest amounts of memory for the JVM for this stuff.
>>> >
>>> 

Re: some basic information on Solr

2011-10-25 Thread Simon Willnauer
hey,

2011/10/24 Dan Wu :
>  Hi all,
>
> I am doing a student project on search engine research. Right now I have
> some basic questions about Slor.
>
> 1. How many types of data file Solr can support (estimate)? i.e. No. of
> file types solr can look at for indexing and searching.
basically you can use solr to index all kinds of documents as long as
you can extract the text from the document. However, Solr ships with
content extraction support that handles a large set of different
files. AFAIK it leverages apache tika (http://tika.apache.org) which
supports a very large set of document formats
(http://tika.apache.org/0.10/formats.html). Hope this helps here?!
>
> 2. How much is estimated cost of incidents per year for Solr ?

I have to admit I don't know what you are asking for. can you
elaborate on this a bit? What is an incident in this context?

simon
>
> Since the numbers could vary from different platforms, however we would like
> to know the estimate answers regarding the general cases.
>
> Thanks
>
>
>
> --
> Dan Wu (Fiona Wu)  武丹
> Master of Engineering Management Program Degree Candidate
> Duke University, North Carolina, USA
> Email: dan...@duke.edu
> Tel: 919-599-2730
>


Re: accessing the query string from inside TokenFilter

2011-10-25 Thread Simon Willnauer
On Tue, Oct 25, 2011 at 3:51 PM, Bernd Fehling
 wrote:
> Dear list,
> while writing some TokenFilter for my analyzer chain I need access to
> the query string from inside of my TokenFilter for some comparison, but the
> Filters are working with a TokenStream and get seperate Tokens.
> Currently I couldn't get any access to the query string.
>
> Any idea how to get this done?
>
> Is there an Attribute for "query" or "qstr"?

I don't think there is anything like that but this could be useful. We
could add this and make it optional on the query parser? Maybe even in
lucene. can you bring this to the dev list?

simon
>
> Regards Bernd
>


Re: How to make UnInvertedField faster?

2011-10-22 Thread Simon Willnauer
On Fri, Oct 21, 2011 at 4:37 PM, Michael McCandless
 wrote:
> Well... the limitation of DocValues is that it cannot handle more than
> one value per document (which UnInvertedField can).

you can pack this into one byte[] or use more than one field? I don't
see a real limitation here.

simon
>
> Hopefully we can fix that at some point :)
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Fri, Oct 21, 2011 at 7:50 AM, Simon Willnauer
>  wrote:
>> In trunk we have a feature called IndexDocValues which basically
>> creates the uninverted structure at index time. You can then simply
>> suck that into memory or even access it on disk directly
>> (RandomAccess). Even if I can't help you right now this is certainly
>> going to help you here. There is no need to uninvert at all anymore in
>> lucene 4.0
>>
>> simon
>>
>> On Wed, Oct 19, 2011 at 8:05 PM, Michael Ryan  wrote:
>>> I was wondering if anyone has any ideas for making 
>>> UnInvertedField.uninvert()
>>> faster, or other alternatives for generating facets quickly.
>>>
>>> The vast majority of the CPU time for our Solr instances is spent generating
>>> UnInvertedFields after each commit. Here's an example of one of our slower 
>>> fields:
>>>
>>> [2011-10-19 17:46:01,055] INFO125974[pool-1-thread-1] - (SolrCore:440) -
>>> UnInverted multi-valued field 
>>> {field=authorCS,memSize=38063628,tindexSize=422652,
>>> time=15610,phase1=15584,nTerms=1558514,bigTerms=0,termInstances=4510674,uses=0}
>>>
>>> That is from an index with approximately 8 million documents. After each 
>>> commit,
>>> it takes on average about 90 seconds to uninvert all the fields that we 
>>> facet on.
>>>
>>> Any ideas at all would be greatly appreciated.
>>>
>>> -Michael
>>>
>>
>


Re: How to make UnInvertedField faster?

2011-10-21 Thread Simon Willnauer
In trunk we have a feature called IndexDocValues which basically
creates the uninverted structure at index time. You can then simply
suck that into memory or even access it on disk directly
(RandomAccess). Even if I can't help you right now this is certainly
going to help you here. There is no need to uninvert at all anymore in
lucene 4.0

simon

On Wed, Oct 19, 2011 at 8:05 PM, Michael Ryan  wrote:
> I was wondering if anyone has any ideas for making UnInvertedField.uninvert()
> faster, or other alternatives for generating facets quickly.
>
> The vast majority of the CPU time for our Solr instances is spent generating
> UnInvertedFields after each commit. Here's an example of one of our slower 
> fields:
>
> [2011-10-19 17:46:01,055] INFO125974[pool-1-thread-1] - (SolrCore:440) -
> UnInverted multi-valued field 
> {field=authorCS,memSize=38063628,tindexSize=422652,
> time=15610,phase1=15584,nTerms=1558514,bigTerms=0,termInstances=4510674,uses=0}
>
> That is from an index with approximately 8 million documents. After each 
> commit,
> it takes on average about 90 seconds to uninvert all the fields that we facet 
> on.
>
> Any ideas at all would be greatly appreciated.
>
> -Michael
>


Re: Painfully slow indexing

2011-10-21 Thread Simon Willnauer
On Wed, Oct 19, 2011 at 3:58 PM, Pranav Prakash  wrote:
> Hi guys,
>
> I have set up a Solr instance and upon attempting to index document, the
> whole process is painfully slow. I will try to put as much info as I can in
> this mail. Pl. feel free to ask me anything else that might be required.
>
> I am sending documents in batches not exceeding 2,000. The size of each of
> them depends but usually is around 10-15MiB. My indexing script tells me
> that Solr took T seconds to add N documents of size S. For the same data,
> the Solr Log add QTime is QT. Some of the sample data are:
>
>   N                     S                T               QT
> -
>  390 docs  |   3,478,804 Bytes   | 14.5s    |  2297
>  852 docs  |   6,039,535 Bytes   | 25.3s    |  4237
> 1345 docs | 11,147,512 Bytes   |  47s      |  8543
> 1147 docs |   9,457,717 Bytes   |  44s      |  2297
> 1096 docs | 13,058,204 Bytes   |  54.3s   |   8782
>
> The time T includes the time of converting an array of Hash objects into
> XML, POSTing it to Solr and response acknowledged from Solr. Clearly, there
> is a huge difference between both the time T and QT. After a lot of efforts,
> I have no clue why these times do not match.
>
> The Server has 16 cores, 48GiB RAM. JVM options are -Xms5000M -Xmx5000M
> -XX:+UseParNewGC
>
> I believe my Indexing is getting slow. Relevant portion from my schema file
> are as follows. On a related note, every document has one dynamic field.
> Based on this rate, it takes me ~30hrs to do a full index of my database.
> I would really appreciate kindness of community in order to get this
> indexing faster.
>
> 
>
> false
>
> 
>
> 10
>
> 10
>
>  
>
> 2048
>
> 2147483647
>
> 300
>
> 1000
>
> 5
>
> 256
>
> 10
>
> false
>
> 
>
> 
>
> 
>
> true
>
> true
>
> 
>
>  1
>
> 0
>
> 
>
> false
>
> 
>
> 
>
> 
>
>  10
>
> 
>
> 
>
>
> *Pranav Prakash*
>
> "temet nosce"
>
> Twitter  | Blog  |
> Google 
>

hey,

are you calling commit after your batches or do an optimize by any chance?

I would suggest you to stream your documents to solr and try to commit
only if you really need to. Set your RAM Buffer to something between
256 and 320 MB and remove the maxBufferedDocs setting completely. You
can also experiment with your merge settings a little and 10 merging
threads seem to be a lot. I know you have lots of CPU but IO will be
the bottleneck here.

simon


Checkout SearchWorkings.org - it just went live!

2011-09-09 Thread Simon Willnauer
Hey folks,

Some of you might have heard, myself and a small group of other
passionate search technology professionals have been working hard in
the last few months to launch a community site known as
SearchWorkings.org [1]. This initiative has been set up for other
search professionals to have a single point of contact or
comprehensive resource where one can learn and talk about all the
exciting new developments in the world of open source search.

Anyone like yourselves familiar with open source search knows that
technologies like Lucene and Solr have grown tremendously in
popularity over the years, but with this growth there have also come a
number of challenges, such as limited support and education. With the
launch of SearchWorkings.org we are convinced we will overcome and
resolve some of these challenges.

Covering open source search technologies from Apache Lucene and Apache
Solr to Apache Mahout, one of the key objectives for the community is
to create a place where search specialists can engage with one another
and enjoy a single point of contact for various resources, downloads
and documentation.

Like any other community website, content will be added on a regular
basis and community members can also make their own contributions and
stay on top of everything search related too. For now, there is access
to a extensive resource centre offering online tutorials, downloads,
white papers and access to a host of search specialists in the forum.
With the ability to post blog items and keep up to date with relevant
news, the site is a search specialists dream come true and addresses
what we felt was a clear need in the market.

Searchworkings.org starts off with an initial focus on Lucene, Solr &
Friends but aims to be much broader. Each of you can & should
contribute, tell us their search, data-processing, setup or
optimization story. I am looking forward to more and more blogs,
articles and tutorials about smaller projects like Apache Lucy, real
world case-studies or 3rd party extensions for OSS Search components.

have fun,

Simon

[1] http://www.searchworkings.org
[2] Trademark Acknowledgement: Apache Lucene, Apache Solr, Apache
Mahout and Apache Lucy respective logos are trademarks of The Apache
Software Foundation. All other marks mentioned may be trademarks or
registered trademarks of their respective owners.


Re: Requiring multiple matches of a term

2011-08-22 Thread Simon Willnauer
On Mon, Aug 22, 2011 at 8:10 PM, Chris Hostetter
 wrote:
>
> : One simple way of doing this is maybe to write a wrapper for TermQuery
> : that only returns docs with a Term Frequency  > X as far as I
> : understand the question those terms don't have to be within a certain
> : window right?
>
> I don't think you could do it as a Query Wrapper -- it would have to be a
> Scorer wrapper, correct?

A query wrapper boils down to a scorer. if you don't want to change
lucene source you should simply write your own query wrapper.

simon
>
> That's the approach rmuir and i were discussing on friday, and i just
> posted a patch of the "guts" that could use some review...
>
>        https://issues.apache.org/jira/browse/LUCENE-3395
>
> ..the end goal would be options in TermQuery that would cause it to
> automaticly wrap it's Scorer in one of these, ala..
>
>        TermQuery q = new TermQuery(new Term("foo","bar"));
>        q.setMinFreq(4.0f);
>        q.setMaxFreq(1000.0f);
>
> ...and in solr, options for this could be added to the {!term} parser...
>
>        q={!term f=foo minTf=4.0 maxTf=1000.0}bar
>
> (could maybe add syntax to the regular query parser, but i think our
> strategic meta-character reserves are dangerously low)
>
>
> -Hoss
>


Re: heads up: re-index 3.x branch Lucene/Solr indices

2011-08-22 Thread Simon Willnauer
Shawn, as long as you are only using a release version of lucene /solr
you don't need to be worried at all. This is a index format change
that has never been released. only if you use a svn checkout you
should reindex.

simon

On Mon, Aug 22, 2011 at 8:56 PM, Shawn Heisey  wrote:
> On 8/22/2011 12:38 PM, Shawn Heisey wrote:
>>
>> Just to be clear, if you are not using a compound file, do you need to
>> worry about this?  I am using 3.2, but I've got the compound file turned off
>> and have 11 files per segment.  Upgrading is in my near future, but I think
>> 3.4 will be out by the time I get there.
>
> From what I've just been reading, Solr and Lucene default to using the
> compound file format.  I started with 1.4.0, and now I'm not sure whether I
> turned it off or whether the example solrconfig.xml already had it turned
> off, but I know that it continues to be disabled because of performance
> worries.  Do things still run faster if compound file format is off in the
> newest stable versions?
>
> Thanks,
> Shawn
>
>


heads up: re-index 3.x branch Lucene/Solr indices

2011-08-22 Thread Simon Willnauer
I just reverted a previous commit related to CompoundFile in the 3.x
stable branch.
If you are using unreleased 3.x branch you need to reindex.

See here for details:

   https://issues.apache.org/jira/browse/LUCENE-3218

If you are using a released version of Lucene/Solr then you can ignore
this message.

Thanks,

Simon


Re: Requiring multiple matches of a term

2011-08-21 Thread Simon Willnauer
On Fri, Aug 19, 2011 at 6:26 PM, Michael Ryan  wrote:
> Is there a way to specify in a query that a term must match at least X times 
> in a document, where X is some value greater than 1?
>

One simple way of doing this is maybe to write a wrapper for TermQuery
that only returns docs with a Term Frequency  > X as far as I
understand the question those terms don't have to be within a certain
window right?

simon
> For example, I want to only get documents that contain the word "dog" three 
> times.  I've thought that using a proximity query with an arbitrary large 
> distance value might do it:
> "dog dog dog"~10
> And that does seem to return the results I expect.
>
> But when I try for more than three, I start getting unexpected result counts 
> as I change the proximity value:
> "dog dog dog dog"~10 returns 6403 results
> "dog dog dog dog"~20 returns 9291 results
> "dog dog dog dog"~30 returns 6395 results
>
> Anyone ever do something like this and know how I can accomplish this?
>
> -Michael
>


Re: OOM due to JRE Issue (LUCENE-1566)

2011-08-16 Thread Simon Willnauer
hey,

On Tue, Aug 16, 2011 at 9:34 AM, Pranav Prakash  wrote:
> Hi,
>
> This might probably have been discussed long time back, but I got this error
> recently in one of my production slaves.
>
> SEVERE: java.lang.OutOfMemoryError: OutOfMemoryError likely caused by the
> Sun VM Bug described in https://issues.apache.org/jira/browse/LUCENE-1566;
> try calling FSDirectory.setReadChunkSize with a a value smaller than the
> current chunk size (2147483647)
>
> I am currently using Solr1.4. Going through JIRA Issue comments, I found
> that this patch applies to 2.9 or above. We are also planning an upgrade to
> Solr 3.3. Is this patch included in 3.3 so as to I don't have to manually
> apply the patch?
AFAIK, solr 1.4 is on Lucene 2.9.1 so this patch is already applied to
the version you are using.
maybe you can provide the stacktrace and more deatails about your
problem and report back?

simon

>
> What are the other workarounds of the problem?
>
> Thanks in adv.
>
> *Pranav Prakash*
>
> "temet nosce"
>
> Twitter  | Blog  |
> Google 
>


Re: Can I delete the stored value?

2011-07-11 Thread Simon Willnauer
On Mon, Jul 11, 2011 at 8:28 AM, Andrzej Bialecki  wrote:
> On 7/10/11 2:33 PM, Simon Willnauer wrote:
>>
>> Currently there is no easy way to do this. I would need to think how
>> you can force the index to drop those so the answer here is no you
>> can't!
>>
>> simon
>>
>> On Sat, Jul 9, 2011 at 11:11 AM, Gabriele Kahlout
>>   wrote:
>>>
>>> I've stored the contents of some pages I no longer need. How can I now
>>> delete the stored content without re-crawling the pages (i.e. using
>>> updateDocument ). I cannot just remove the field, since I still want the
>>> field to be indexed, I just don't want to store something with it.
>>> My understanding is that field.setValue("") won't do since that should
>>> affect the indexed value as well.
>
> You could pump the content of your index through a FilterIndexReader - i.e.
> implement a subclass of FilterIndexReader that removes stored fields under
> some conditions, and then use IndexWriter.addIndexes with this reader.
>
> See LUCENE-1812 for another practical application of this concept.

good call andrzej, to make this work I think you need to use lucene
directly so make sure you are on the right version.
simon
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>


Re: Can I delete the stored value?

2011-07-10 Thread Simon Willnauer
Currently there is no easy way to do this. I would need to think how
you can force the index to drop those so the answer here is no you
can't!

simon

On Sat, Jul 9, 2011 at 11:11 AM, Gabriele Kahlout
 wrote:
> I've stored the contents of some pages I no longer need. How can I now
> delete the stored content without re-crawling the pages (i.e. using
> updateDocument ). I cannot just remove the field, since I still want the
> field to be indexed, I just don't want to store something with it.
> My understanding is that field.setValue("") won't do since that should
> affect the indexed value as well.
>
> --
> Regards,
> K. Gabriele
>
> --- unchanged since 20/9/10 ---
> P.S. If the subject contains "[LON]" or the addressee acknowledges the
> receipt within 48 hours then I don't resend the email.
> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
> < Now + 48h) ⇒ ¬resend(I, this).
>
> If an email is sent by a sender that is not a trusted contact or the email
> does not contain a valid code then the email is not received. A valid code
> starts with a hyphen and ends with "X".
> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
> L(-[a-z]+[0-9]X)).
>


Re: DelimitedPayloadTokenFilter and Highlighter

2011-07-10 Thread Simon Willnauer
Hey hannes,

the simplest solution here is maybe using a second field that is for
highlighting only. This field would then store your content without
the payloads. The other way would be stripping off the payloads during
rendering which is not a nice option I guess. Since I am not a
highlighter expert there might be better options though maybe you can
write a custom fragmenter or something like that.

simon

On Sat, Jul 9, 2011 at 6:33 PM, Hannes Korte
 wrote:
> Hi,
>
> I'm trying to use the DelimitedPayloadTokenFilter for a field, which I want
> to be highlighted. Unfortunately, the resulting snippets contain the
> original payload strings, e.g. "token|0.5". Is there a way to clean the
> stored string, which is used by the highlighter?
>
> Thanks in advance!
> Hannes
>


Heads Up - Index File Format Change on Trunk

2011-06-10 Thread Simon Willnauer
Hey folks,

I just committed LUCENE-3108 (Landing DocValues on Trunk) which adds a
byte to FieldInfo.
If you are running on trunk you must / should re-index any trunk
indexes once you update to the latest trunk.

its likely if you open up old trunk (4.0) indexes, you will get an
exception related to Read Past EOF.

Simon


Travel Assistance applications now open for ApacheCon NA 2011

2011-06-06 Thread Simon Willnauer
The Apache Software Foundation (ASF)'s Travel Assistance Committee (TAC) is
now accepting applications for ApacheCon North America 2011, 7-11 November
in Vancouver BC, Canada.

The TAC is seeking individuals from the Apache community at-large --users,
developers, educators, students, Committers, and Members-- who would like to
attend ApacheCon, but need some financial support in order to be able to get
there. There are limited places available, and all applicants will be scored
on their individual merit.

Financial assistance is available to cover flights/trains, accommodation and
entrance fees either in part or in full, depending on circumstances.
However, the support available for those attending only the BarCamp (7-8
November) is less than that for those attending the entire event (Conference
+ BarCamp 7-11 November). The Travel Assistance Committee aims to support
all official ASF events, including cross-project activities; as such, it may
be prudent for those in Asia and Europe to wait for an event geographically
closer to them.

More information can be found at http://www.apache.org/travel/index.html
including a link to the online application and detailed instructions for
submitting.

Applications will close on 8 July 2011 at 22:00 BST (UTC/GMT +1).

We wish good luck to all those who will apply, and thank you in advance for
tweeting, blogging, and otherwise spreading the word.

Regards,
The Travel Assistance Committee


Re: why query chinese character with bracket become phrase query by default?

2011-05-16 Thread Simon Willnauer
On Mon, May 16, 2011 at 3:51 PM, Yonik Seeley
 wrote:
> On Mon, May 16, 2011 at 5:30 AM, Michael McCandless
>  wrote:
>> To be clear, I'm asking that Yonik revert his commit from yesterday
>> (rev 1103444), where he added "text_nwd" fieldType and dynamic fields
>> *_nwd to the example schema.xml.
>
> So... your position is that until the "text" fieldType is changed to
> support non-whitespace-delimited languages better, that
> no other fieldType should be changed/added to better support
> non-whitespace-delimited languages?
> Man, that seems political, not technical.

To me it seems neither nor. Its rather the process of improving
aligned with outstanding issues.
It shouldn't feel wrong.

Simon
>
> Whatever... I'll "revert".
>
> -Yonik
>


Berlin Buzzwords - conference schedule released

2011-04-12 Thread Simon Willnauer
Hey folks,

The Berlin Buzzwords team recently released the schedule for
the conference on high scalability. The conference focuses on the
topics search,
data analysis and NoSQL. It is to take place on June 6/7th 2011 in Berlin.

We are looking forward to two awesome keynote speakers who shaped the world of
open source data analysis: Doug Cutting, founder of Apache Lucene and
Hadoop) as
well as Ted Dunning (Chief Application Architect at MapR Technologies
and active
developer at Apache Hadoop and Mahout).

This year the program has been extended by one additional track. The first
conference day focuses on the topics Apache Lucene, NoSQL, messaging and data
mining. Speakers include Jakob Homan from Yahoo! who will give in introduction
to the new Hadoop security features, Daniel Einspanjer is going to show how
NoSQL and Hadoop are being used at Mozilla Socorro. In addition Chris
Male gives
a presentation on how to integrate Solr with J2EE applications.

The second day features presentations by Jonathan Gray on Facebook's use of
HBase in their Messaging architecture, Dawid Weiss, Simon Willnauer
and Uwe Schindler are
showing the latest Apache Lucene developments, Mark Miller provides
insights into Solr Performance
and Mathias Stearn is discussing MongoDB scalability questions.

"For our developers Berlin Buzzwords is a great chance to introduce our open
source project Couchbase (based on Apache CouchDB and Memcached), get in touch
with interested users and discuss their technical questions on site." says Jan
Lehnardt, Co-Founder of Couchbase (merged CouchOne and Membase (formerly
Northscale) [1].

Registration is open, regular tickets are available for 440,- Euro. There is a
group discount. Prizes include coffee break and lunch catering.

After the conference there will be trainings on topics related to Berlin
Buzzwords such as Enterprise Search with Apache Lucene and Solr [2]. For the
very first time we will also have community organised hackathons, that give
Berlin Buzzwords visitors the opportunity to work together with the projects'
developers on interesting tasks.

Berlin Buzzwords is produced by newthinking communications in
collaboration with Isabel Drost (Member of the Apache Software
Foundation, PMC member Apache community development and co-founder of
Apache Mahout), Jan Lehnardt (PMC Chair Apache CouchDB) and Simon
Willnauer (PMC member Apache Lucene).

[1] http://www.heise.de/open/meldung/NoSQL-CouchOne-und-Membase-fusionieren-zu -
Couchbase-1185227.html
[2] http://www.jteam.nl/training/2-day-training-Lucene-Solr.html


[GSoC] Apache Lucene @ Google Summer of Code 2011 [STUDENTS READ THIS]

2011-03-11 Thread Simon Willnauer
Hey folks,

Google Summer of Code 2011 is very close and the Project Applications
Period has started recently. Now it's time to get some excited students
on board for this year's GSoC.

I encourage students to submit an application to the Google Summer of Code
web-application. Lucene & Solr are amazing projects and GSoC is an
incredible opportunity to join the community and push the project
forward.

If you are a student and you are interested spending some time on a
great open source project while getting paid for it, you should submit
your application from March 28 - April 8, 2011. There are only 3
weeks until this process starts!

Quote from the GSoC website: "We hear almost universally from our
mentoring organizations that the best applications they receive are
from students who took the time to interact and discuss their ideas
before submitting an application, so make sure to check out each
organization's Ideas list to get to know a particular open source
organization better."

So if you have any ideas what Lucene & Solr should have, or if you
find any of the GSoC pre-selected projects [1] interesting, please
join us on d...@lucene.apache.org [2].  Since you as a student must
apply for a certain project via the GSoC website [3], it's a good idea
to work on it ahead of time and include the community and possible
mentors as soon as possible.

Open source development here at the Apache Software
Foundation happens almost exclusively in the public and I encourage you to
follow this. Don't mail folks privately; please use the mailing list to
get the best possible visibility and attract interested community
members and push your idea forward. As always, it's the idea that
counts not the person!

That said, please do not underestimate the complexity of even small
"GSoC - Projects". Don't try to rewrite Lucene or Solr!  A project
usually gains more from a smaller, well discussed and carefully
crafted & tested feature than from a half baked monster change that's
too large to work with.

Once your proposal has been accepted and you begin work, you should
give the community the opportunity to iterate with you.  We prefer
"progress over perfection" so don't hesitate to describe your overall
vision, but when the rubber meets the road let's take it in small
steps.  A code patch of 20 KB is likely to be reviewed very quickly so
get fast feedback, while a patch even 60kb in size can take very
- Hide quoted text -
long. So try to break up your vision and the community will work with
you to get things done!

On behalf of the Lucene & Solr community,

Go! join the mailing list and apply for GSoC 2011,

Simon

[1] 
https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&jqlQuery=labels+%3D+lucene-gsoc-11
[2] http://lucene.apache.org/java/docs/mailinglists.html
[3] http://www.google-melange.com


Re: Lucene 2.9.x vs 3.x

2011-01-16 Thread Simon Willnauer
On Sat, Jan 15, 2011 at 2:19 PM, Salman Akram
 wrote:
> Hi,
>
> SOLR 1.4.1 uses Lucene 2.9.3 by default (I think so). I have few questions
>
> Are there any major performance (or other) improvements in Lucene
> 3.0.3/Lucene 2.9.4?

you can see all major changes here:
http://lucene.apache.org/java/3_0_3/changes/Changes.html

>
> Does 3.x has major compatibility issues moving from 2.9.x?
I assume you mean 3.0.x instead of 3.x ? The answer is no - nothing
major! Its mainly cut over to java 5 like generics and varags etc.
>
> Will SOLR 1.4.1 build work fine with Lucene 3.0.3?

phew... I am not sure but could be though that should very easy to
try... just get the sources here
http://svn.apache.org/repos/asf/lucene/solr/tags/release-1.4.1/
change the jar and run a test build

simon
>
> Thanks!
>
> --
> Regards,
>
> Salman Akram
> Senior Software Engineer - Tech Lead
> 80-A, Abu Bakar Block, Garden Town, Pakistan
> Cell: +92-321-4391210
>


Re: Lucene Scorer Extension?

2011-01-09 Thread Simon Willnauer
you should look into this http://wiki.apache.org/solr/FunctionQuery

simon

On Fri, Jan 7, 2011 at 3:59 PM, dante stroe  wrote:
> Hello,
>
>     What I am trying to do is build a personalized search engine. The aim
> is to have the resulting documents' scores depend on users' preferences.
> I've already built some Solr plugins (request handlers mainly), however I am
> not sure that what I am trying to do can be achieved by a plugin.
> In short, for each query, for each document, I would like to multiply the
> relevance score of each document(at scoring time of course) by the result of
> a function between some of document's fields values and the user's
> preferences (these users preferences will most likely be loaded in memory
> when the plugin initializes). Of course, I need a new request handler to
> take the userID as a query parameter, but I am not sure on how to access
> each document at scoring time in order to update the score based on
> his preferences. Any ideas? (I have looked over
> this
> and after
> looking at the code as well, it doesn't look so trivial ... has anybody else
> tried something similar?)
>
> Cheers,
> Dante
>


Re: The search response time is too loong

2010-09-27 Thread Simon Willnauer
2010/9/27 newsam :
> I have setup a SOLR searcher instance with Tomcat 5.5.21. However, the 
> response time is too long. Here is my scenario:
> 1. The index file is 8.2G. The doc num is 6110745.
> 2. DELL Server: Intel(R) Xeon(TM) CPU (4 cores) 3.00GHZ, 6G Mem.
>
> I used "Key:*" to query all records by localhost:8080. The response time is 
> 68703 milliseconds. The cpu load is 50% and mem useage is over 400M.

If you wanna get all records use q=*:* instead of Key:*  that should
give you faster results - way faster :)

Why are you actually requesting all results and how many of them are
you fetching? Maybe it would be a good idea to explain your usecase /
problem first.

simon

>
> Any comments are welcomed.
>
>
>


Re: trie

2010-09-21 Thread Simon Willnauer
2010/9/21 Péter Király :
> You can read about it in Lucene in Action second edition.
have a look at 
http://www.lucidimagination.com/developer/whitepaper/Whats-New-in-Apache-Lucene-3-0

page 4 to 8 should give you a good intro to the topic

simon
>
> Péter
>
> 2010/9/21 Papp Richard :
>>  is there any good tutorial how to use and what is trie? what I found on the
>> net is really blurry.
>>
>> rgeards,
>>  Rich
>>
>>
>> __ Information from ESET NOD32 Antivirus, version of virus signature
>> database 5419 (20100902) __
>>
>> The message was checked by ESET NOD32 Antivirus.
>>
>> http://www.eset.com
>>
>>
>>
>


Re: Can I tell Solr to merge segments more slowly on an I/O starved system?

2010-09-18 Thread Simon Willnauer
On Sun, Sep 19, 2010 at 6:04 AM, Ron Mayer  wrote:
> My system which has documents being added pretty much
> continually seems pretty well behaved except, it seems,
> when large segments get merged.     During that time
> the system starts really dragging, and queries that took
> only a couple seconds are taking dozens.
You might wanna look at the MergePolicy since this is the part of Solr
/ Lucene taking care of the segment merges.
There is a recent contribution to the 3.x branch which could be very
interesting to you see:

http://svn.apache.org/repos/asf/lucene/dev/branches/branch_3x/lucene/contrib/misc/src/java/org/apache/lucene/index/BalancedSegmentMergePolicy.java

you should also look at this post for frequent updates (Near Realtime)

http://www.lucidimagination.com/search/document/a17d63d8dcc1cb9d/tuning_solr_caches_with_high_commit_rates_nrt#adf9ee007ce18ba6

Simon

>
> Some other I/O bound servers seem to have features
> that let you throttle how much I/O they take for administrative
> background tasks -- for example PostgreSQL's "vacuum_cost_delay"
> and related parameters[1], which are described as
>
>  "The intent of this feature is to allow administrators to
>   reduce the I/O impact of these commands on concurrent
>   database activity. There are many situations in which it is
>   not very important that maintenance commands like VACUUM
>   and ANALYZE finish quickly; however, it is usually very
>   important that these commands do not significantly
>   interfere with the ability of the system to perform other
>   database operations. Cost-based vacuum delay provides
>   a way for administrators to achieve this."
>
> Are there any similar features for Solr, where it can sacrifice the
> speed of doing a commit in favor of leaving more I/O bandwidth
> for users performing searches?
>
> If not, where in the code might I look to add such a feature?
>
>     Ron
>
> [1] http://www.postgresql.org/docs/8.4/static/runtime-config-resource.html
>
>
>
>


Re: No more trunk support for 2.9 indexes

2010-09-18 Thread Simon Willnauer
On Sat, Sep 18, 2010 at 4:13 AM, Chris Hostetter
 wrote:
>
> : Since Lucene 3.0.2 is 'out there', does this mean the format is nailed down,
> : and some sort of porting is possible?
> : Does anyone know of a tool that can read the entire contents of a Solr index
> : and (re)write it another? (as an indexing operation - eg 2.9 -> 3.0.x, so 
> not
> : repl)
>
> 3.0.2 should be able to read 2.9 indexes, so you can open a 2.9 index in
> 3.0.2, optimize, and magicly have a 3.x index.

There will also be tool / mechanism to do the same on 4.0 - it is not
100% nailed down how this might work eventually but it seems to be
very likely that there will be a read only support for old indexes - a
merge will then build new-style segments out of old-style segments.

simon
>
> -Hoss
>
> --
> http://lucenerevolution.org/  ...  October 7-8, Boston
> http://bit.ly/stump-hoss      ...  Stump The Chump!
>
>


Re: Field names

2010-09-13 Thread Simon Willnauer
On Tue, Sep 14, 2010 at 1:39 AM, Peter A. Kirk  wrote:
> Fantastic - that is exactly what I was looking for!
>
> But here is one thing I don't undertstand:
>
> If I call the url:
> http://localhost:8983/solr/admin/luke?numTerms=10&fl=name
>
> Some of the result looks like:
>
> 
>  
>    
>      18
>
> Does this mean that the term "gb" occurs 18 times in the name field?
Yes that is the Doc Frequency of the term "gb". Remember that deleted
/ updated documents and their terms contribute to the doc frequency
until they are expunged from the index. That either happens through a
segment merge in the background or due to an explicit call to
optimize.
>
> Because if I issue this search:
> http://localhost:8983/solr/select/?q=name:gb
>
> I get results like:
> 
>  
>
> So it only finds 9?
Since the "gb" term says 18 occurrences throughout the index I suspect
you updated you docs once without optimizing or indexing a lot of docs
so that segments are merged. Try to call optimize if you can afford it
and see if the doc-freq count goes back to 9

simon
>
> What do the above results actually tell me?
>
> Thanks,
> Peter
>
> 
> From: Ryan McKinley [ryan...@gmail.com]
> Sent: Tuesday, 14 September 2010 11:30
> To: solr-user@lucene.apache.org
> Subject: Re: Field names
>
> check:
> http://wiki.apache.org/solr/LukeRequestHandler
>
>
>
> On Mon, Sep 13, 2010 at 7:00 PM, Peter A. Kirk  
> wrote:
>> Hi
>>
>> is it possible to issue a query to solr, to get a list which contains all 
>> the field names in the index?
>>
>> What about to get a list of the freqency of individual words in each field?
>>
>> thanks,
>> Peter
>>


Re: mm=0?

2010-09-13 Thread Simon Willnauer
On Mon, Sep 13, 2010 at 8:07 PM, Lance Norskog  wrote:
> "Java Swing" no longer gives ads for "swinger's clubs".
damned no i have to explicitly enter it?! - argh!

:)

simon
>
> On Mon, Sep 13, 2010 at 9:37 AM, Dennis Gearon  wrote:
>> I just tried several searches again on google.
>>
>> I think they've refined the ads placements so that certain kind of searches 
>> return no ads, the kinds that I've been doing relative to programming being 
>> one of them.
>>
>> If OTOH I do some product related search, THEN lots of ads show up, but 
>> fairly accurate ones.
>>
>> They've immproved the ads placement a LOT!
>>
>> Dennis Gearon
>>
>> Signature Warning
>> 
>> EARTH has a Right To Life,
>>  otherwise we all die.
>>
>> Read 'Hot, Flat, and Crowded'
>> Laugh at http://www.yert.com/film.php
>>
>>
>> --- On Mon, 9/13/10, Satish Kumar  wrote:
>>
>>> From: Satish Kumar 
>>> Subject: Re: mm=0?
>>> To: solr-user@lucene.apache.org
>>> Date: Monday, September 13, 2010, 7:41 AM
>>> Hi Erik,
>>>
>>> I completely agree with you that showing a random document
>>> for user's query
>>> would be very poor experience. I have raised this in our
>>> product review
>>> meetings before. I was told that because of contractual
>>> agreement some
>>> sponsored content needs to be returned even if it meant no
>>> match. And the
>>> sponsored content drives the ads displayed on the page-- so
>>> it is more for
>>> showing some ad on the page when there is no matching
>>> result from sponsored
>>> content for user's query.
>>>
>>> Note that some other content in addition to sponsored
>>> content is displayed
>>> on the page, so user is not seeing just one random result
>>> when there is not
>>> a good match.
>>>
>>> It looks like I have to do another search to get a random
>>> result when there
>>> are no results. In this case I will use RandomSortField to
>>> generate random
>>> result (so that a different ad is displayed from set of
>>> sponsored ads) for
>>> each no result case.
>>>
>>> Thanks for the comments!
>>>
>>>
>>> Satish
>>>
>>>
>>>
>>> On Sun, Sep 12, 2010 at 10:25 AM, Erick Erickson 
>>> wrote:
>>>
>>> > Could you explain the use-case a bit? Because the
>>> very
>>> > first response I would have is "why in the world did
>>> > product management make this a requirement" and try
>>> > to get the requirement changed
>>> >
>>> > As a user, I'm having a hard time imagining being
>>> well
>>> > served by getting a document in response to a search
>>> that
>>> > had no relation to my search, it was just a random
>>> doc
>>> > selected from the corpus.
>>> >
>>> > All that said, I don't think a single query would do
>>> the trick.
>>> > You could include a "very special" document with a
>>> field
>>> > that no other document had with very special text in
>>> it. Say
>>> > field name "bogusmatch", filled with the text
>>> "bogustext"
>>> > then, at least the second query would match one and
>>> only
>>> > one document and would take minimal time. Or you
>>> could
>>> > tack on to each and every query "OR
>>> bogusmatch:bogustext^0.001"
>>> > (which would really be inexpensive) and filter it out
>>> if there
>>> > was more than one response. By boosting it really low,
>>> it should
>>> > always appear at the end of the list which wouldn't be
>>> a bad thing.
>>> >
>>> > DisMax might help you here...
>>> >
>>> > But do ask if it is really a requirement or just
>>> something nobody's
>>> > objected to before bothering IMO...
>>> >
>>> > Best
>>> > Erick
>>> >
>>> > On Sat, Sep 11, 2010 at 1:10 PM, Satish Kumar <
>>> > satish.kumar.just.d...@gmail.com>
>>> wrote:
>>> >
>>> > > Hi,
>>> > >
>>> > > We have a requirement to show at least one result
>>> every time -- i.e.,
>>> > even
>>> > > if user entered term is not found in any of the
>>> documents. I was hoping
>>> > > setting mm to 0 will return results in all cases,
>>> but it is not.
>>> > >
>>> > > For example, if user entered term "alpha" and it
>>> is *not* in any of the
>>> > > documents in the index, any document in the index
>>> can be returned. If
>>> > term
>>> > > "alpha" is in the document set, documents having
>>> the term "alpha" only
>>> > must
>>> > > be returned.
>>> > >
>>> > > My idea so far is to perform a search using user
>>> entered term. If there
>>> > are
>>> > > any results, return them. If there are no
>>> results, perform another search
>>> > > without the query term-- this means doing two
>>> searches. Any suggestions
>>> > on
>>> > > implementing this requirement using only one
>>> search?
>>> > >
>>> > >
>>> > > Thanks,
>>> > > Satish
>>> > >
>>> >
>>>
>>
>
>
>
> --
> Lance Norskog
> goks...@gmail.com
>


Re: stopwords in AND clauses

2010-09-13 Thread Simon Willnauer
On Mon, Sep 13, 2010 at 3:27 PM, Xavier Noria  wrote:
> Let's suppose we have a regular search field body_t, and an internal
> boolean flag flag_t not exposed to the user.
>
> I'd like
>
>    body_t:foo AND flag_t:true

this is solr right? why don't you use filterquery for you unexposed
flat_t field q=boty_t:foo&fq=flag_t:true
this might help too: http://wiki.apache.org/solr/CommonQueryParameters#fq

simon
>
> to be an intersection, but if "foo" is a stopword I get all documents
> for which flag_t is true, as if the first class was dropped, or if
> technically all documents match an empty string.
>
> Is there a way to get 0 results instead?
>


Re: Tuning Solr caches with high commit rates (NRT)

2010-09-13 Thread Simon Willnauer
On Mon, Sep 13, 2010 at 8:02 AM, Dennis Gearon  wrote:
> BTW, what is a segment?

On the Lucene level an index is composed of one or more index
segments. Each segment is an index by itself and consists of several
files like doc stores, proximity data, term dictionaries etc. During
indexing Lucene / Solr creates those segments depending on ram buffer
/ document buffer settings and flushes them to disk (if you index to
disk). Once a segment has been flushed Lucene will never change the
segments (well up to a certain level - lets keep this simple) but
write new ones for new added documents. Since segments have a
write-once policy Lucene merges multiple segments into a new segment
(how and when this happens is different story) from time to time to
get rid of deleted documents and to reduce the number of overall
segments in the index.
Generally a higher number of segments will also influence you search
performance since Lucene performs almost all operations on a
per-segment level. If you want to reduce the number of segment to one
you need to call optimize and lucene will merge all existing ones into
one single segment.

hope that answers your question

simon
>
> I've only heard about them in the last 2 weeks here on the list.
> Dennis Gearon
>
> Signature Warning
> 
> EARTH has a Right To Life,
>  otherwise we all die.
>
> Read 'Hot, Flat, and Crowded'
> Laugh at http://www.yert.com/film.php
>
>
> --- On Sun, 9/12/10, Jason Rutherglen  wrote:
>
>> From: Jason Rutherglen 
>> Subject: Re: Tuning Solr caches with high commit rates (NRT)
>> To: solr-user@lucene.apache.org
>> Date: Sunday, September 12, 2010, 7:52 PM
>> Yeah there's no patch... I think
>> Yonik can write it. :-)  Yah... The
>> Lucene version shouldn't matter.  The distributed
>> faceting
>> theoretically can easily be applied to multiple segments,
>> however the
>> way it's written for me is a challenge to untangle and
>> apply
>> successfully to a working patch.  Also I don't have
>> this as an itch to
>> scratch at the moment.
>>
>> On Sun, Sep 12, 2010 at 7:18 PM, Peter Sturge 
>> wrote:
>> > Hi Jason,
>> >
>> > I've tried some limited testing with the 4.x trunk
>> using fcs, and I
>> > must say, I really like the idea of per-segment
>> faceting.
>> > I was hoping to see it in 3.x, but I don't see this
>> option in the
>> > branch_3x trunk. Is your SOLR-1606 patch referred to
>> in SOLR-1617 the
>> > one to use with 3.1?
>> > There seems to be a number of Solr issues tied to this
>> - one of them
>> > being Lucene-1785. Can the per-segment faceting patch
>> work with Lucene
>> > 2.9/branch_3x?
>> >
>> > Thanks,
>> > Peter
>> >
>> >
>> >
>> > On Mon, Sep 13, 2010 at 12:05 AM, Jason Rutherglen
>> > 
>> wrote:
>> >> Peter,
>> >>
>> >> Are you using per-segment faceting, eg, SOLR-1617?
>>  That could help
>> >> your situation.
>> >>
>> >> On Sun, Sep 12, 2010 at 12:26 PM, Peter Sturge
>> 
>> wrote:
>> >>> Hi,
>> >>>
>> >>> Below are some notes regarding Solr cache
>> tuning that should prove
>> >>> useful for anyone who uses Solr with frequent
>> commits (e.g. <5min).
>> >>>
>> >>> Environment:
>> >>> Solr 1.4.1 or branch_3x trunk.
>> >>> Note the 4.x trunk has lots of neat new
>> features, so the notes here
>> >>> are likely less relevant to the 4.x
>> environment.
>> >>>
>> >>> Overview:
>> >>> Our Solr environment makes extensive use of
>> faceting, we perform
>> >>> commits every 30secs, and the indexes tend be
>> on the large-ish side
>> >>> (>20million docs).
>> >>> Note: For our data, when we commit, we are
>> always adding new data,
>> >>> never changing existing data.
>> >>> This type of environment can be tricky to
>> tune, as Solr is more geared
>> >>> toward fast reads than frequent writes.
>> >>>
>> >>> Symptoms:
>> >>> If anyone has used faceting in searches where
>> you are also performing
>> >>> frequent commits, you've likely encountered
>> the dreaded OutOfMemory or
>> >>> GC Overhead Exeeded errors.
>> >>> In high commit rate environments, this is
>> almost always due to
>> >>> multiple 'onDeck' searchers and autowarming -
>> i.e. new searchers don't
>> >>> finish autowarming their caches before the
>> next commit()
>> >>> comes along and invalidates them.
>> >>> Once this starts happening on a regular basis,
>> it is likely your
>> >>> Solr's JVM will run out of memory eventually,
>> as the number of
>> >>> searchers (and their cache arrays) will keep
>> growing until the JVM
>> >>> dies of thirst.
>> >>> To check if your Solr environment is suffering
>> from this, turn on INFO
>> >>> level logging, and look for: 'PERFORMANCE
>> WARNING: Overlapping
>> >>> onDeckSearchers=x'.
>> >>>
>> >>> In tests, we've only ever seen this problem
>> when using faceting, and
>> >>> facet.method=fc.
>> >>>
>> >>> Some solutions to this are:
>> >>>    Reduce the commit rate to allow searchers
>> to fully warm before the
>> >>> next commit
>> >>>    Reduce or eliminate the autowarming in
>> caches
>> >>>    Both of the above
>> >>>

Re: Solr memory use, jmap and TermInfos/tii

2010-09-12 Thread Simon Willnauer
On Sun, Sep 12, 2010 at 12:42 PM, Robert Muir  wrote:
> On Sat, Sep 11, 2010 at 7:51 PM, Michael McCandless <
> luc...@mikemccandless.com> wrote:
>
>> On Sat, Sep 11, 2010 at 11:07 AM, Burton-West, Tom 
>> wrote:
>> >  Is there an example of how to set up the divisor parameter in
>> solrconfig.xml somewhere?
>>
>> Alas I don't know how to configure terms index divisor from Solr...
>>
>>
> To change the divisor in your solrconfig, for example to 4, it looks like
> you need to do this.
>
>   class="org.apache.solr.core.StandardIndexReaderFactory">
>    4
>  

Ah, thanks robert! I didn't know about that one either!

simon
>
> This parameter was added in SOLR-1296 so its in Solr 1.4
>
> Tom, i would recommend altering this parameter, instead of the default
> (1)... especially since you don't have to reindex to take advantage of it.
>
> --
> Robert Muir
> rcm...@gmail.com
>


Re: Solr memory use, jmap and TermInfos/tii

2010-09-11 Thread Simon Willnauer
On Sun, Sep 12, 2010 at 1:51 AM, Michael McCandless
 wrote:
> On Sat, Sep 11, 2010 at 11:07 AM, Burton-West, Tom  wrote:
>>  Is there an example of how to set up the divisor parameter in 
>> solrconfig.xml somewhere?
>
> Alas I don't know how to configure terms index divisor from Solr...

You can set the termIndexInterval via


...
128
...


which has the same effect but requires reindexing. I don't see that
the index divisor is exposed but maybe we should do so!

simon
In 4.0, w/ flex indexing, the RAM efficiency is much better -- we use large 
parallel arrays instead of separate objects, and,
we hold much less in RAM.  Simply upgrading to 4.0 and re-indexing will 
show this gain...;
>>
>> I'm looking forward to a number of the developments in 4.0, but am a bit 
>> wary of using it in production.   I've wanted to work in some tests with 
>> 4.0, but other more pressing issues have so far prevented this.
>
> Understood.
>
>> What about Lucene 2205?  Would that be a way to get some of the benefit 
>> similar to the changes in flex without the rest of the changes in flex and 
>> 4.0?
>
> 2205 was a similar idea (don't create tons of small objects), but it
> was never committed...
>
I'd be really curious to test the RAM reduction in 4.0 on your terms  
dict/index --
is there any way I could get a copy of just the tii/tis  files in your 
index?  Your index is a great test for Lucene!
>>
>> We haven't been able to make much data available due to copyright and other 
>> legal issues.  However, since there is absolutely no way anyone could 
>> reconstruct copyrighted works from the tii/tis index alone, that should be 
>> ok on that front.  On Monday I'll try to get legal/administrative clearance 
>> to provide the data and also ask around and see if I can get the ok to 
>> either find a spare hard drive to ship, or make some kind of sftp 
>> arrangement.  Hopefully we will find a way to be able to do this.
>
> That would be awesome, thanks!
>
>> BTW  Most of the terms are probably the result of  dirty OCR and the impact 
>> is probably increased by our present "punctuation filter".  When we re-index 
>> we plan to use a more intelligent filter that will truncate extremely long 
>> tokens on punctuation and we also plan to do some minimal prefiltering prior 
>> to sending documents to Solr for indexing.  However, since with now have 
>> over 400 languages , we will have to be conservative in our filtering since 
>> we would rather  index dirty OCR than risk not indexing legitimate content.
>
> Got it... it's a great test case for Lucene :)
>
> Mike
>


Re: How to give path in SCRIPT tag?

2010-09-07 Thread Simon Willnauer
ankita,

your questions seems to be somewhat unrelated to solr / lucene and
should be asked somewhere else but not on this list. Please try to
keep the focus of your questions to Solr related topics or use
java-user@ for lucene related topics.

Thanks,

Simon

On Tue, Sep 7, 2010 at 3:46 PM, ankita shinde  wrote:
> How to give path of folder stored on our local machine in Script tag  'src'
> attribute in html file,head tag.
>
> Is this is correct ?
>
>   src="C:/evol/core/AbstractManager.js">
>


Re: minMergeDocs supported ?

2010-08-24 Thread Simon Willnauer
Hey, I guess this option has been removed in Lucene 2.0 - you could
look as maxBufferedDocs and ramBufferSizeMB to control how many
documents / heap space is used to buffer documents before they are
flushed and merged into a new segment. Don't know what you are trying
to do but those are the factors you might wanna look at.

simon

On Tue, Aug 24, 2010 at 11:35 AM, stockii  wrote:
>
> in lucene is this option for the index configuration available. In Solr too ?
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/minMergeDocs-supported-tp1302856p1307821.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: search multiple default fields

2010-07-05 Thread Simon Willnauer
Have a look at http://wiki.apache.org/solr/DisMaxRequestHandler and
http://wiki.apache.org/solr/DisMaxRequestHandler#qf_.28Query_Fields.29

that might help with what you are looking for...

simon

On Tue, Jul 6, 2010 at 3:48 AM, bluestar  wrote:
> hi there,
>
> is it possible to define multiple default search fields in the
> solrconfig.xml?
>
> at the moment i am using a queryfilter programatically but i want to be
> able to configure things such that my query will be processed as:
>
> defaultfield:myquery OR field2:myquery OR field3:myquery ... ..
>
> basically i want my query to match any of my named fields, but not always
> matching the defaultfield...
>
> at the moment i have one default field + a queryfilter which is not
> returning the desired results.
>
> thanks
>
>


Re: Not split a field on whitespaces?

2010-07-05 Thread Simon Willnauer
Use solr.StrField or solr.KeywordTokenizerFactory instead.

simon

On Mon, Jul 5, 2010 at 2:47 PM, Sebastian Funk
 wrote:
> Hey there,
>
> I might be just to blind to see this, but isn't it possible to have a
> solr.TextField not getting filtered in any way. That means the input
> "Michael Jackson" should just stay that way and not get split on
> whitespaces? How do I implement that?
>
> Thanks for any help,
> Sebastian
>


Re: Weird memory error.

2007-11-21 Thread Simon Willnauer
Actually when I look at the errormessage, this has nothing to do with
memory.
The error message:
java.lang.OutOfMemoryError: unable to create new native thread

means that the OS can not create any new native threads for this JVM. So the
limit you are running into is not the JVM Memory.
I guess you should rather look at a bottleneck inside your application that
prevents your serverthreads from being reused when you fire concurrent
batches to your sever. Do you do all that in paralell?

In the stacktrace below your connector can not get any new threads from the
pool which has nothing to do with memory.

Try to figure out what is taking so much time during the batch process on
the server.

simon

On Nov 20, 2007 5:16 PM, Brian Carmalt <[EMAIL PROTECTED]> wrote:

> Hello all,
>
> I started looking into the scalability of solr, and have started getting
> weird  results.
> I am getting the following error:
>
> Exception in thread "btpool0-3" java.lang.OutOfMemoryError: unable to
> create new native thread
>at java.lang.Thread.start0(Native Method)
>at java.lang.Thread.start(Thread.java:574)
>at
> org.mortbay.thread.BoundedThreadPool.newThread(BoundedThreadPool.java:377)
>at
> org.mortbay.thread.BoundedThreadPool.dispatch(BoundedThreadPool.java:94)
>at
> org.mortbay.jetty.bio.SocketConnector$Connection.dispatch(
> SocketConnector.java:187)
>at
> org.mortbay.jetty.bio.SocketConnector.accept(SocketConnector.java:101)
>at
> org.mortbay.jetty.AbstractConnector$Acceptor.run(AbstractConnector.java
> :516)
>at
> org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java
> :442)
>
> This only occurs when I send docs to the server in batches of around 10
> as separate processes.
> If I send the serially, the heap grows up to 1200M and with no errors.
>
> When I observe the VM during it's operation, It doesn't seem to run out
> of memory.  The VM starts
> with 1024M and can allocate up to 1800M. I start getting the error
> listed above when the memory
> usage is right around 1 G. I have been using the Jconsole program on
> windows to observe the
> jetty server by using the com.sun.management.jmxremote* functions on the
> server side. The number of threads
> is always around 30, and jetty can create up 250, so I don't think
> that's the problem. I can't really image that
> the monitoring process is using the other 800M of the allowable heap
> memory, but it could be.
> But the problem occurs without monitoring, even when the VM heap is set
> to 1500M.
>
> Does anyone have an idea as to why this error is occurring?
>
> Thanks,
> Brian
>


Re: Weird memory error.

2007-11-20 Thread Simon Willnauer
I'm using the Eclipse TPTP platfrom and I'm very happy with it. You will
also find good howto or tutorial pages on the web.

- simon

On Nov 20, 2007 5:29 PM, Brian Carmalt <[EMAIL PROTECTED]> wrote:

> Can you recommend one? I am not familar with how to profile under Java.
>
> Yonik Seeley schrieb:
> > Can you try a profiler to see where the memory is being used?
> > -Yonik
> >
> > On Nov 20, 2007 11:16 AM, Brian Carmalt <[EMAIL PROTECTED]> wrote:
> >
> >> Hello all,
> >>
> >> I started looking into the scalability of solr, and have started
> getting
> >> weird  results.
> >> I am getting the following error:
> >>
> >> Exception in thread "btpool0-3" java.lang.OutOfMemoryError: unable to
> >> create new native thread
> >> at java.lang.Thread.start0(Native Method)
> >> at java.lang.Thread.start(Thread.java:574)
> >> at
> >> org.mortbay.thread.BoundedThreadPool.newThread(BoundedThreadPool.java
> :377)
> >> at
> >> org.mortbay.thread.BoundedThreadPool.dispatch(BoundedThreadPool.java
> :94)
> >> at
> >> org.mortbay.jetty.bio.SocketConnector$Connection.dispatch(
> SocketConnector.java:187)
> >> at
> >> org.mortbay.jetty.bio.SocketConnector.accept(SocketConnector.java:101)
> >> at
> >> org.mortbay.jetty.AbstractConnector$Acceptor.run(AbstractConnector.java
> :516)
> >> at
> >> org.mortbay.thread.BoundedThreadPool$PoolThread.run(
> BoundedThreadPool.java:442)
> >>
> >> This only occurs when I send docs to the server in batches of around 10
> >> as separate processes.
> >> If I send the serially, the heap grows up to 1200M and with no errors.
> >>
> >> When I observe the VM during it's operation, It doesn't seem to run out
> >> of memory.  The VM starts
> >> with 1024M and can allocate up to 1800M. I start getting the error
> >> listed above when the memory
> >> usage is right around 1 G. I have been using the Jconsole program on
> >> windows to observe the
> >> jetty server by using the com.sun.management.jmxremote* functions on
> the
> >> server side. The number of threads
> >> is always around 30, and jetty can create up 250, so I don't think
> >> that's the problem. I can't really image that
> >> the monitoring process is using the other 800M of the allowable heap
> >> memory, but it could be.
> >> But the problem occurs without monitoring, even when the VM heap is set
> >> to 1500M.
> >>
> >> Does anyone have an idea as to why this error is occurring?
> >>
> >> Thanks,
> >> Brian
> >>
> >>
> >
> >
>
>


Re: Extending Solr's Admin functionality

2006-09-27 Thread Simon Willnauer

First I agree with yonik, the main point is to define which classes /
parts / mbeans should be exposed to JMX is the hard part and should be
planned carefully. I could imagine a very flexible layer between jmx
and solr using 1.5 annotations and an integration of commons-modeler.

Erik, am I get you right that you want to connect to jmx via an extern
webapplication which acts as an admin interface (written in ruby or
whatever) or are you pointing to a http/XML connector for jmx?
JSR-160 permits extensions to the way in which communication is done
between the client and the server. Basic implementations are using the
mandatory RMI-based implementation required by the JSR-160
specification (IIOP and JRMP) and the (optional) JMXMP. By using other
providers or JMX implementations (such as MX4J) you can take advantage
of protocols like SOAP, Hessian, Burlap over simple HTTP or SSL and
others. (http://mx4j.sourceforge.net)

best regards simon



On 9/27/06, Erik Hatcher <[EMAIL PROTECTED]> wrote:

Ah, so I'm beginning to get it.  If we build Solr with JMX support,
the admin HTTP/XML(err, Ruby) interface could be written into the JMX
HTTP adapter as a separate web application, and allowing users to
plug it in or not.  If I'm understanding that correctly then I'm
quite +1 on JMX!  And I suppose some of these adapters already have
built in web service interfaces.

Erik


On Sep 27, 2006, at 6:20 AM, Simon Willnauer wrote:

> @Otis: I suggest we go a bit more in detail about the features solr
> should expose via JMX and talk about the contribution. I'd love to
> extend solr with more JMX support.
>
>
>
> On 9/27/06, Yonik Seeley <[EMAIL PROTECTED]> wrote:
>> On 9/26/06, Otis Gospodnetic <[EMAIL PROTECTED]> wrote:
>> > On the other hand, some people I talked to also expressed
>> interest in JMX, so I'd encourage Simon to make that contribution.
>>
>> I'm also interested in JMX.
>> It has different adapters, including an HTTP one AFAIK, but I don't
>> know how easy it is to use.
>
> The application should only provide mbeans as an interface for the JMX
> kernel to expose these interfaces to the adapter. Which adapter you
> use depends on you personal preferences. There are lots of JMX Monitor
> apps around with http adaptors like mx4j (http://mx4j.sourceforge.net)
> if deployed in the same container all mbeans are exposed to the
> connector via the mbean registry / server.
>
> @Yonik: What are you interests in JMX?
>
> best regards Simon
>>
>> -Yonik
>>




Re: Extending Solr's Admin functionality

2006-09-27 Thread Simon Willnauer

@Otis: I suggest we go a bit more in detail about the features solr
should expose via JMX and talk about the contribution. I'd love to
extend solr with more JMX support.



On 9/27/06, Yonik Seeley <[EMAIL PROTECTED]> wrote:

On 9/26/06, Otis Gospodnetic <[EMAIL PROTECTED]> wrote:
> On the other hand, some people I talked to also expressed interest in JMX, so 
I'd encourage Simon to make that contribution.

I'm also interested in JMX.
It has different adapters, including an HTTP one AFAIK, but I don't
know how easy it is to use.


The application should only provide mbeans as an interface for the JMX
kernel to expose these interfaces to the adapter. Which adapter you
use depends on you personal preferences. There are lots of JMX Monitor
apps around with http adaptors like mx4j (http://mx4j.sourceforge.net)
if deployed in the same container all mbeans are exposed to the
connector via the mbean registry / server.

@Yonik: What are you interests in JMX?

best regards Simon


-Yonik



Re: Extending Solr's Admin functionality

2006-09-24 Thread Simon Willnauer

I followed the discussion the last 3 day and I still wondering why
nobody turned up with an integration of solr monitoring and
administration functionality using javas fantastic management
extension JMX. I joined a team 2 years ago building a distributed
webspider / searcher (similar to nutch). In the middle of the
development process someone came up with monitoring the system via the
admin frontend which communicates via http with the indexer, spider
and searcher part of the system. At that time I was playing arround
with jmx building a generic server side monitoring app. and suggested
to use JMX for all the monitoring. It turned out in a very handy and
easy to use solution.
This would also solve a couple of problems about blowing up the core
with extra features if you create the jmx monitoring in a extra jar as
a contrib feature. You wouldn't need to write a frontend neither and
everybody who is used to his JMX Monitoring frontend could use it to
monitor solr as well. Loading custom classes and monitor the desired
behaviour might be much easier to implement and to analyze.
Additionally jmx supports memory behaviour monitoring with java 1.5.
Another feature is triggering some messages to global monitoring
systems if errors or undesired behaviour of managed components occur
which could also be exposed via JMX.

Also security / firewall doubts would not concern the core and its
security as jmx connectors can be blocked by firewalls not using the
same port as solr does (no matter which protocol is used to connect to
the server).

As I'm not very familiar with the solr core I spotted some Info MBeans
in solr which would be accessible without writing any code as well.
After all I do have to admin that accessing all these features via JMX
would not be as user friendly as a hand made http frontend would be
but I a search server admin / manage frontend exposed to people
without any background?!

I would also be happy to contribute my experience with JMX to the project.

best regards Simon


Re: update partial document

2006-09-18 Thread Simon Willnauer

I'm not into the code of Solr at all but I know that Solr is based on
the lucene core which has no kind of update mechanism. To update a
document using lucene you have to delete and reinsert the document.
That might be the reason for the solr behaviour as well.

You should consider that lucene is not a database!

best regards simon

On 9/18/06, Brian Lucas <[EMAIL PROTECTED]> wrote:

Hi, I wanted to inquire if anybody would find an update flag useful that
only replaced the subset of data (ie a certain field) getting passed in,
instead of the whole record.



Pseudo-code for what I'm describing:



125125

true



+ RU



- EN



Instead of deleting and reinserting an entire document, which is ostensibly
what SOLR does each time an update is performed, it's sometimes preferable
to simply replace a single field's value like one does in a database.



Any thoughts on the feasibility or limitations of this?



Brian





Re: does solr know classpath

2006-09-16 Thread Simon Willnauer

/solrwebapp/WEB-INF/lib

to point out one solution

best regards simon

On 9/16/06, James liu <[EMAIL PROTECTED]> wrote:

i set classpath where i put lucene-analyzers-2.0.0.jar...i can use it.

but solr not find it..

where i should put it in?




Solr in production env.

2006-09-11 Thread Simon Willnauer

Hello,

I almost convinced my boss to use Solr in production for a new project
and hopefully for lots of following projects but I'm a bit confused
that there is no release available for download. Is Solr still in a
beta state, are there solr servers in production. Is it recommendable
to use it in production? I would be glad about some experience and
recommendations about this topic.


best regards Simon